high_scalability high_scalability-2013 high_scalability-2013-1382 knowledge-graph by maker-knowledge-mining

1382 high scalability-2013-01-07-Analyzing billions of credit card transactions and serving low-latency insights in the cloud


meta infos for this blog

Source: html

Introduction: This is a guest post by Ivan de Prado and Pere Ferrera , founders of Datasalt , the company behind Pangool and Splout SQL Big Data open-source projects. The amount of payments performed using credit cards is huge. It is clear that there is inherent value in the data that can be derived from analyzing all the transactions. Client fidelity, demographics, heat maps of activity, shop recommendations, and many other statistics are useful to both clients and shops for improving their relationship with the market. At Datasalt we have developed a system in collaboration with the BBVA bank that is able to analyze years of data and serve insights and statistics to different low-latency web and mobile applications. The main challenge we faced besides processing Big Data input is that the output was also Big Data, and even bigger than the input . And this output needed to be served quickly, under high load. The solution we developed has an infrastructure cost of just a few tho


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Client fidelity, demographics, heat maps of activity, shop recommendations, and many other statistics are useful to both clients and shops for improving their relationship with the market. [sent-4, score-0.653]

2 At Datasalt we have developed a system in collaboration with the BBVA bank that is able to analyze years of data and serve insights and statistics to different low-latency web and mobile applications. [sent-5, score-0.476]

3 Data, goals and first decisions The system uses BBVA's credit card transactions performed in shops all around the world as input source for the analysis. [sent-10, score-0.89]

4 We calculate many statistics and data per each shop and per different periods of time. [sent-14, score-0.541]

5 These are some of them: Histogram of payment amounts for each shop Client fidelity Client demographics Shop recommendations (clients buying here also buy at . [sent-15, score-0.354]

6 The architecture The architecture has three main parts: Data storage : Used to maintain raw data (credit card transactions) and the resulting Voldemort stores. [sent-28, score-0.434]

7 Data serving : A Voldemort cluster that serves the precomputed data from the data processing layer. [sent-30, score-0.43]

8 This allows us to keep all historical data - all credit card transactions performed every single day. [sent-32, score-0.683]

9 The computed data Statistics Part of the analysis consists in calculating simple statistics: averages, max, min, stdev, unique counts, etc. [sent-60, score-0.402]

10 Moreover, we can compute all simple statistics for each commerce together with the associated histograms in just one MapReduce step, and for an arbitrary number of time periods. [sent-64, score-0.367]

11 In order to reduce the amount of storage used by histograms and to improve its visualization, the original computed histograms formed by many bins are transformed into variable-width bins histograms. [sent-65, score-0.776]

12 The following diagram shows the 3-bins optimal histogram for a particular histogram: The optimal histogram is computed using a random-restart hill climbing approximated algorithm. [sent-66, score-0.668]

13 The top co-ocurring shops for a given shop are recommendations for that shop. [sent-71, score-0.634]

14 First, the most popular shops are filtered out using a simple frequency cut because almost everybody buys in them. [sent-73, score-0.419]

15 Filtering recommendations by location (shops close to each other), by shop category or by both also improves the recommendations. [sent-75, score-0.414]

16 Limiting the time where a co-ocurrence can happen results in recommendations of shops where people bought right after buying in the first one. [sent-77, score-0.552]

17 Particularly, if one buyer is paying in many shops, the number of co-ocurrences for this credit call will show a quadratic growth, making the analysis not scale linearly. [sent-79, score-0.39]

18 The cost & some numbers The amount of information to serve in Voldemort for one year of BBVA's credit card transactions on Spain is 270 GB. [sent-81, score-0.602]

19 The whole infrastructure, including the EC2 instances needed to serve the resulting data would cost approximately $3500/month. [sent-84, score-0.413]

20 It would reduce the analysis time and the amount of data to serve as many aggregations could be performed "on the fly". [sent-90, score-0.429]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('voldemort', 0.405), ('shops', 0.28), ('credit', 0.197), ('shop', 0.187), ('histogram', 0.186), ('histograms', 0.168), ('recommendations', 0.167), ('card', 0.166), ('computed', 0.166), ('hadoop', 0.141), ('statistics', 0.133), ('buyer', 0.132), ('data', 0.109), ('flow', 0.109), ('bought', 0.105), ('serve', 0.104), ('bbva', 0.103), ('bins', 0.103), ('flexbility', 0.103), ('pangool', 0.103), ('input', 0.093), ('performed', 0.087), ('processing', 0.084), ('main', 0.08), ('resulting', 0.079), ('insights', 0.077), ('hill', 0.074), ('filtered', 0.073), ('emr', 0.073), ('serving', 0.072), ('stores', 0.069), ('mapreduce', 0.069), ('amount', 0.068), ('transactions', 0.067), ('simple', 0.066), ('needed', 0.065), ('analysis', 0.061), ('category', 0.06), ('derived', 0.06), ('project', 0.059), ('chunk', 0.058), ('us', 0.057), ('periods', 0.057), ('cluster', 0.056), ('diagram', 0.056), ('approximately', 0.056), ('calculate', 0.055), ('implemented', 0.055), ('developed', 0.053), ('clients', 0.053)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000005 1382 high scalability-2013-01-07-Analyzing billions of credit card transactions and serving low-latency insights in the cloud

Introduction: This is a guest post by Ivan de Prado and Pere Ferrera , founders of Datasalt , the company behind Pangool and Splout SQL Big Data open-source projects. The amount of payments performed using credit cards is huge. It is clear that there is inherent value in the data that can be derived from analyzing all the transactions. Client fidelity, demographics, heat maps of activity, shop recommendations, and many other statistics are useful to both clients and shops for improving their relationship with the market. At Datasalt we have developed a system in collaboration with the BBVA bank that is able to analyze years of data and serve insights and statistics to different low-latency web and mobile applications. The main challenge we faced besides processing Big Data input is that the output was also Big Data, and even bigger than the input . And this output needed to be served quickly, under high load. The solution we developed has an infrastructure cost of just a few tho

2 0.35192382 634 high scalability-2009-06-20-Building a data cycle at LinkedIn with Hadoop and Project Voldemort

Introduction: Update : Building Voldemort read-only stores with Hadoop . A write up on what LinkedIn is doing to integrate large offline Hadoop data processing jobs with a fast, distributed online key-value storage system, Project Voldemort .

3 0.17095543 663 high scalability-2009-07-28-37signals Architecture

Introduction: Update 7: Basecamp, now with more vroom . Basecamp application servers running Ruby code were upgraded and virtualization was removed. The result: A 66 % reduction in the response time while handling multiples of the traffic is beyond what I expected . They still use virtualization (Linux KVM), just less of it now. Update 6: Things We’ve Learned at 37Signals . Themes: less is more; don't worry be happy. Update 5: Nuts & Bolts: HAproxy . Nice explanation (post, screencast) by Mark Imbriaco of why HAProxy (load balancing proxy server) is their favorite (fast, efficient, graceful configuration, queues requests when Mongrels are busy) for spreading dynamic content between Apache web servers and Mongrel application servers. Update 4: O'Rielly's Tim O'Brien interviews David Hansson , Rails creator and 37signals partner. Says BaseCamp scales horizontally on the application and web tier. Scales up for the database, using one "big ass" 128GB machine. Says: As technology moves on,

4 0.1574185 1313 high scalability-2012-08-28-Making Hadoop Run Faster

Introduction: Making Hadoop Run Faster One of the challenges in processing data is that the speed at which we can input data is quite often much faster than the speed at which we can process it. This problem becomes even more pronounced in the context of Big Data, where the volume of data keeps on growing, along with a corresponding need for more insights, and thus the need for more complex processing also increases. Batch Processing to the Rescue Hadoop was designed to deal with this challenge in the following ways: 1. Use a distributed file system: This enables us to spread the load and grow our system as needed. 2. Optimize for write speed: To enable fast writes the Hadoop architecture was designed so that writes are first logged, and then processed. This enables fairly fast write speeds. 3. Use batch processing (Map/Reduce) to balance the speed for the data feeds with the processing speed. Batch Processing Challenges The challenge with batch-processing is that it assumes

5 0.15214375 651 high scalability-2009-07-02-Product: Project Voldemort - A Distributed Database

Introduction: Update: Presentation from the NoSQL conference : slides , video 1 , video 2 . Project Voldemort is an open source implementation of the basic parts of Dynamo (Amazon’s Highly Available Key-value Store) distributed key-value storage system. LinkedIn is using it in their production environment for "certain high-scalability storage problems where simple functional partitioning is not sufficient." From their website: Data is automatically replicated over multiple servers. Data is automatically partitioned so each server contains only a subset of the total data Server failure is handled transparently Pluggable serialization is supported to allow rich keys and values including lists and tuples with named fields, as well as to integrate with common serialization frameworks like Protocol Buffers, Thrift, and Java Serialization Data items are versioned to maximize data integrity in failure scenarios without compromising availability of the system Each node is independent o

6 0.13768111 675 high scalability-2009-08-08-1dbase vs. many and cloud hosting vs. dedicated server(s)?

7 0.1363001 601 high scalability-2009-05-17-Product: Hadoop

8 0.12919679 448 high scalability-2008-11-22-Google Architecture

9 0.12664099 1000 high scalability-2011-03-08-Medialets Architecture - Defeating the Daunting Mobile Device Data Deluge

10 0.12462206 1180 high scalability-2012-01-24-The State of NoSQL in 2012

11 0.11976704 954 high scalability-2010-12-06-What the heck are you actually using NoSQL for?

12 0.11846381 862 high scalability-2010-07-20-Strategy: Consider When a Service Starts Billing in Your Algorithm Cost

13 0.11548519 882 high scalability-2010-08-18-Misco: A MapReduce Framework for Mobile Systems - Start of the Ambient Cloud?

14 0.11440556 96 high scalability-2007-09-18-Amazon Architecture

15 0.110011 1240 high scalability-2012-05-07-Startups are Creating a New System of the World for IT

16 0.10663871 1501 high scalability-2013-08-13-In Memoriam: Lavabit Architecture - Creating a Scalable Email Service

17 0.10469739 589 high scalability-2009-05-05-Drop ACID and Think About Data

18 0.10333045 1618 high scalability-2014-03-24-Big, Small, Hot or Cold - Examples of Robust Data Pipelines from Stripe, Tapad, Etsy and Square

19 0.10295285 1355 high scalability-2012-11-05-Gone Fishin': Building Super Scalable Systems: Blade Runner Meets Autonomic Computing In The Ambient Cloud

20 0.10281229 750 high scalability-2009-12-16-Building Super Scalable Systems: Blade Runner Meets Autonomic Computing in the Ambient Cloud


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.202), (1, 0.097), (2, 0.006), (3, 0.022), (4, 0.001), (5, 0.053), (6, 0.063), (7, -0.015), (8, 0.047), (9, 0.062), (10, 0.048), (11, -0.012), (12, 0.062), (13, -0.081), (14, 0.074), (15, 0.003), (16, -0.015), (17, -0.013), (18, -0.019), (19, 0.001), (20, 0.001), (21, 0.016), (22, 0.066), (23, 0.05), (24, 0.04), (25, -0.033), (26, -0.018), (27, -0.006), (28, -0.028), (29, 0.044), (30, 0.032), (31, 0.08), (32, -0.028), (33, -0.005), (34, -0.025), (35, 0.015), (36, -0.019), (37, -0.023), (38, -0.012), (39, -0.026), (40, 0.034), (41, 0.05), (42, -0.029), (43, -0.008), (44, 0.05), (45, -0.0), (46, -0.06), (47, -0.048), (48, -0.016), (49, -0.047)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95710677 1382 high scalability-2013-01-07-Analyzing billions of credit card transactions and serving low-latency insights in the cloud

Introduction: This is a guest post by Ivan de Prado and Pere Ferrera , founders of Datasalt , the company behind Pangool and Splout SQL Big Data open-source projects. The amount of payments performed using credit cards is huge. It is clear that there is inherent value in the data that can be derived from analyzing all the transactions. Client fidelity, demographics, heat maps of activity, shop recommendations, and many other statistics are useful to both clients and shops for improving their relationship with the market. At Datasalt we have developed a system in collaboration with the BBVA bank that is able to analyze years of data and serve insights and statistics to different low-latency web and mobile applications. The main challenge we faced besides processing Big Data input is that the output was also Big Data, and even bigger than the input . And this output needed to be served quickly, under high load. The solution we developed has an infrastructure cost of just a few tho

2 0.88698256 1313 high scalability-2012-08-28-Making Hadoop Run Faster

Introduction: Making Hadoop Run Faster One of the challenges in processing data is that the speed at which we can input data is quite often much faster than the speed at which we can process it. This problem becomes even more pronounced in the context of Big Data, where the volume of data keeps on growing, along with a corresponding need for more insights, and thus the need for more complex processing also increases. Batch Processing to the Rescue Hadoop was designed to deal with this challenge in the following ways: 1. Use a distributed file system: This enables us to spread the load and grow our system as needed. 2. Optimize for write speed: To enable fast writes the Hadoop architecture was designed so that writes are first logged, and then processed. This enables fairly fast write speeds. 3. Use batch processing (Map/Reduce) to balance the speed for the data feeds with the processing speed. Batch Processing Challenges The challenge with batch-processing is that it assumes

3 0.85847402 1173 high scalability-2012-01-12-Peregrine - A Map Reduce Framework for Iterative and Pipelined Jobs

Introduction: The Peregrine falcon is a bird of prey, famous for its high speed diving attacks , feeding primarily on much slower Hadoops. Wait, sorry, it is Kevin Burton of Spinn3r's new Peregrine project -- a new FAST modern map reduce framework optimized for iterative and pipelined map reduce jobs -- that feeds on Hadoops. If you don't know Kevin, he does a lot of excellent technical work that he's kind enough to share it on his blog . Only he hasn't been blogging much lately, he's been heads down working on Peregrine. Now that Peregrine has been released, here's a short email interview with Kevin on why you might want to take up falconry , the ancient sport of MapReduce. What does Spinn3r do that Peregrine is important to you? Ideally it was designed to execute pagerank but many iterative applications that we deploy and WANT to deploy (k-means) would be horribly inefficient under Hadoop as it doesn't have any support for merging and joining IO between tasks.  It also doesn't support

4 0.83615851 601 high scalability-2009-05-17-Product: Hadoop

Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity

5 0.82381231 669 high scalability-2009-08-03-Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2

Introduction: This tutorial will show you how to use Amazon EC2 and Cloudera's Distribution for Hadoop to run batch jobs for a data intensive web application. During the tutorial, we will perform the following data processing steps.... read more on Cloudera website

6 0.82341123 1618 high scalability-2014-03-24-Big, Small, Hot or Cold - Examples of Robust Data Pipelines from Stripe, Tapad, Etsy and Square

7 0.81938881 1586 high scalability-2014-01-28-How Next Big Sound Tracks Over a Trillion Song Plays, Likes, and More Using a Version Control System for Hadoop Data

8 0.80853212 1000 high scalability-2011-03-08-Medialets Architecture - Defeating the Daunting Mobile Device Data Deluge

9 0.79288161 666 high scalability-2009-07-30-Learn How to Think at Scale

10 0.78047836 956 high scalability-2010-12-08-How To Get Experience Working With Large Datasets

11 0.77112561 1445 high scalability-2013-04-24-Strategy: Using Lots of RAM Often Cheaper than Using a Hadoop Cluster

12 0.76786029 1265 high scalability-2012-06-15-Stuff The Internet Says On Scalability For June 15, 2012

13 0.76628798 448 high scalability-2008-11-22-Google Architecture

14 0.74199665 1578 high scalability-2014-01-14-Ask HS: Design and Implementation of scalable services?

15 0.7385878 1161 high scalability-2011-12-22-Architecting Massively-Scalable Near-Real-Time Risk Analysis Solutions

16 0.73725533 1362 high scalability-2012-11-26-BigData using Erlang, C and Lisp to Fight the Tsunami of Mobile Data

17 0.73608613 1020 high scalability-2011-04-12-Caching and Processing 2TB Mozilla Crash Reports in memory with Hazelcast

18 0.73583454 882 high scalability-2010-08-18-Misco: A MapReduce Framework for Mobile Systems - Start of the Ambient Cloud?

19 0.73244369 1287 high scalability-2012-07-20-Stuff The Internet Says On Scalability For July 20, 2012

20 0.72788155 1292 high scalability-2012-07-27-Stuff The Internet Says On Scalability For July 27, 2012


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(1, 0.097), (2, 0.146), (10, 0.061), (30, 0.025), (40, 0.021), (43, 0.013), (49, 0.015), (52, 0.144), (61, 0.107), (77, 0.021), (79, 0.118), (85, 0.069), (94, 0.071)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9100346 1382 high scalability-2013-01-07-Analyzing billions of credit card transactions and serving low-latency insights in the cloud

Introduction: This is a guest post by Ivan de Prado and Pere Ferrera , founders of Datasalt , the company behind Pangool and Splout SQL Big Data open-source projects. The amount of payments performed using credit cards is huge. It is clear that there is inherent value in the data that can be derived from analyzing all the transactions. Client fidelity, demographics, heat maps of activity, shop recommendations, and many other statistics are useful to both clients and shops for improving their relationship with the market. At Datasalt we have developed a system in collaboration with the BBVA bank that is able to analyze years of data and serve insights and statistics to different low-latency web and mobile applications. The main challenge we faced besides processing Big Data input is that the output was also Big Data, and even bigger than the input . And this output needed to be served quickly, under high load. The solution we developed has an infrastructure cost of just a few tho

2 0.90218008 872 high scalability-2010-08-05-Pairing NoSQL and Relational Data Storage: MySQL with MongoDB

Introduction: I’ve largely steered clear of publicly commenting on the “NoSQL vs. Relational” conflict. Keeping in mind that this argument is more about currently available solutions and the features their developers have chosen to build in, I’d like to dig into this and provide a decidedly neutral viewpoint. In fact, by erring on the side of caution, I’ve inadvertently given myself plenty of time to consider the pros and cons of both data storage approaches, and although my mind was initially swaying toward the NoSQL camp, I can say with a fair amount of certainty, that I’ve found a good compromise.  You can read the full store here .

3 0.8876363 47 high scalability-2007-07-30-Product: Yslow to speed up your web pages

Introduction: Update : Speed up Apache - how I went from F to A in YSlow . Good example of using YSlow to speed up a website with solid code examples. Every layer in the multi-layer cake that is your website contributes to how long a page takes to display. YSlow , from Yahoo, is a cool tool for discovering how the ingredients of your site's top layer contribute to performance. YSlow analyzes web pages and tells you why they're slow based on the rules for high performance web sites. YSlow is a Firefox add-on integrated with the popular Firebug web development tool. YSlow gives you: Performance report card HTTP/HTML summary List of components in the page Tools including JSLint

4 0.88691151 244 high scalability-2008-02-11-Yahoo Live's Scaling Problems Prove: Release Early and Often - Just Don't Screw Up

Introduction: Tech Crunch chomped down on some initial scaling problems with Yahoo's new live video streaming service Yahoo Live . After a bit of chewing on Yahoo's old bones, TC spat out: If Yahoo cant scale something like this (no matter how much they claim it’s an experiment, it’s still a live service), it shows how far the once brightest star of the online world has fallen . This kind of thinking kills innovation. When there's no room for a few hiccups or a little failure you have to cover your ass so completely nothing new will ever see the light of day. I thought we were supposed to be agile . We are supposed to release early and often. Not every 'i' has to be dotted and not every last router has to be installed before we take the first step of a grand new journey. Get it out there. Let users help you make it better. Listen to customers, make changes, push the new code out, listen some more, and fix problems as they come up. Following this process we'll make somethi

5 0.86575574 1117 high scalability-2011-09-16-Stuff The Internet Says On Scalability For September 16, 2011

Introduction: Between love and madness lies  HighScalability : Google now 10x better : MapReduce sorts 1 petabyte of data using 8000 computers in 33 minutes;  1 Billion on Social Networks ; Tumblr at 10 Billion Posts ; Twitter at 100 Million Users ; Testing at Google Scale : 1800 builds, 120 million test suites, 60 million tests run daily. From the Dash Memo on Google's Plan: Go is a very promising systems-programming language in the vein of C++. We fully hope and expect that Go becomes the standard back-end language at Google over the next few years. On GAE  Go can load from  a cold start in 100ms and the typical instance size is 4MB. Is it any wonder Go is a go? Should we expect to see Java and Python deprecated because Go is so much cheaper to run at scale? Potent Quotables: @caciufo : 30x more scalability w/ many-core. So perf doesn't have to level out or vex programmers. #IDF2011 @joerglew : Evaluating divide&conquer; vs. master-slave architecture for wor

6 0.86305517 156 high scalability-2007-11-16-Mogulus Doesn't Own a Single Server and has $1.2 million in funding, 15,000 People Creating Channels

7 0.8611334 1451 high scalability-2013-05-03-Stuff The Internet Says On Scalability For May 3, 2013

8 0.85142893 882 high scalability-2010-08-18-Misco: A MapReduce Framework for Mobile Systems - Start of the Ambient Cloud?

9 0.84403723 1432 high scalability-2013-04-01-Khan Academy Checkbook Scaling to 6 Million Users a Month on GAE

10 0.84343976 1573 high scalability-2014-01-06-How HipChat Stores and Indexes Billions of Messages Using ElasticSearch and Redis

11 0.84227681 1389 high scalability-2013-01-18-Stuff The Internet Says On Scalability For January 18, 2013

12 0.83870506 851 high scalability-2010-07-02-Hot Scalability Links for July 2, 2010

13 0.83704919 1085 high scalability-2011-07-25-Is NoSQL a Premature Optimization that's Worse than Death? Or the Lady Gaga of the Database World?

14 0.83402634 1559 high scalability-2013-12-06-Stuff The Internet Says On Scalability For December 6th, 2013

15 0.8322528 716 high scalability-2009-10-06-Building a Unique Data Warehouse

16 0.83202761 984 high scalability-2011-02-04-Stuff The Internet Says On Scalability For February 4, 2011

17 0.83146524 1649 high scalability-2014-05-16-Stuff The Internet Says On Scalability For May 16th, 2014

18 0.83043623 1109 high scalability-2011-09-02-Stuff The Internet Says On Scalability For September 2, 2011

19 0.83021325 1180 high scalability-2012-01-24-The State of NoSQL in 2012

20 0.82970917 1604 high scalability-2014-03-03-The “Four Hamiltons” Framework for Mitigating Faults in the Cloud: Avoid it, Mask it, Bound it, Fix it Fast