high_scalability high_scalability-2012 high_scalability-2012-1313 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Making Hadoop Run Faster One of the challenges in processing data is that the speed at which we can input data is quite often much faster than the speed at which we can process it. This problem becomes even more pronounced in the context of Big Data, where the volume of data keeps on growing, along with a corresponding need for more insights, and thus the need for more complex processing also increases. Batch Processing to the Rescue Hadoop was designed to deal with this challenge in the following ways: 1. Use a distributed file system: This enables us to spread the load and grow our system as needed. 2. Optimize for write speed: To enable fast writes the Hadoop architecture was designed so that writes are first logged, and then processed. This enables fairly fast write speeds. 3. Use batch processing (Map/Reduce) to balance the speed for the data feeds with the processing speed. Batch Processing Challenges The challenge with batch-processing is that it assumes
sentIndex sentText sentNum sentScore
1 Making Hadoop Run Faster One of the challenges in processing data is that the speed at which we can input data is quite often much faster than the speed at which we can process it. [sent-1, score-0.508]
2 This problem becomes even more pronounced in the context of Big Data, where the volume of data keeps on growing, along with a corresponding need for more insights, and thus the need for more complex processing also increases. [sent-2, score-0.988]
3 Use batch processing (Map/Reduce) to balance the speed for the data feeds with the processing speed. [sent-9, score-1.696]
4 Batch Processing Challenges The challenge with batch-processing is that it assumes that the feeds come in bursts. [sent-10, score-0.37]
5 If our data feeds come in on a continuous basis, the entire assumption and architecture behind batch processing starts to break down. [sent-11, score-1.189]
6 If we increase the batch window, the result is higher latency between the time the data comes in until the time we actually get it into our reports and insights. [sent-12, score-0.38]
7 Moreover, the number of hours is finite -- in many systems the batch window is done on a daily basis. [sent-13, score-0.485]
8 Often, the assumption is that most of the processing can be done during off-peak hours. [sent-14, score-0.638]
9 But as the volume gets bigger, the time it takes to process the data gets longer, until it reaches the limit of the hours in a day and then we face dealing with a continuously growing backlog. [sent-15, score-0.664]
10 In addition, if we experience a failure during the processing we might not have enough time to re-process. [sent-16, score-0.52]
11 Making Hadoop Run Faster We can make our Hadoop system run faster by pre-processing some of the work before it gets into our Hadoop system. [sent-17, score-0.308]
12 We can also move the types of workload for which batch processing isn't a good fit out of the Hadoop Map/Reduce system and use Stream Processing, as Google did. [sent-18, score-0.824]
13 Speed Things Up Through Stream-Based Processing The concept of stream-based processing is fairly simple. [sent-19, score-0.603]
14 Instead of logging the data first and then processing it, we can process it as it comes in. [sent-20, score-0.783]
15 A good analogy to explain the difference is a manufacturing pipeline. [sent-21, score-0.266]
16 Think about a car manufacturing pipeline: Compare the process of first putting all the parts together and then assembling them piece by piece, versus a process in which you package each unit at the manufacturer and only send the pre-packaged parts to the assembly line. [sent-22, score-1.463]
17 Putting stream-based processing at the front is analogous to pre-packaging our parts before they get to the assembly line, which is in our case is the Hadoop batch processing system. [sent-25, score-1.729]
18 As in manufacturing, even if we pre-package the parts at the manufacturer we still need an assembly line to put all the parts together. [sent-26, score-0.78]
19 In the same way, stream-based processing is not meant to replace our Hadoop system, but rather to reduce the amount of work that the system needs to deal with, and to make the work that does go into the Hadoop process easier, and thus faster, to process. [sent-27, score-0.957]
20 In-memory stream processing can make a good stream processing system, as Curt Monash’s points out on his research traditional databases will eventually end up in RAM . [sent-28, score-1.274]
wordName wordTfidf (topN-words)
[('processing', 0.52), ('hadoop', 0.292), ('batch', 0.235), ('feeds', 0.229), ('assembly', 0.207), ('manufacturing', 0.198), ('manufacturer', 0.174), ('parts', 0.168), ('assumption', 0.118), ('process', 0.118), ('stream', 0.117), ('window', 0.109), ('speed', 0.105), ('faster', 0.093), ('gets', 0.093), ('piece', 0.09), ('data', 0.087), ('volume', 0.085), ('enables', 0.084), ('assembling', 0.084), ('fairly', 0.083), ('pronounced', 0.081), ('putting', 0.081), ('thus', 0.079), ('analogous', 0.079), ('finite', 0.075), ('challenge', 0.073), ('context', 0.072), ('packaging', 0.07), ('system', 0.069), ('assumes', 0.068), ('analogy', 0.068), ('curt', 0.067), ('hours', 0.066), ('deal', 0.065), ('corresponding', 0.064), ('moreover', 0.064), ('line', 0.063), ('monash', 0.062), ('reaches', 0.061), ('growing', 0.061), ('logged', 0.059), ('writes', 0.059), ('diagram', 0.058), ('comes', 0.058), ('car', 0.057), ('demonstrate', 0.055), ('pipeline', 0.055), ('work', 0.053), ('insights', 0.053)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000002 1313 high scalability-2012-08-28-Making Hadoop Run Faster
Introduction: Making Hadoop Run Faster One of the challenges in processing data is that the speed at which we can input data is quite often much faster than the speed at which we can process it. This problem becomes even more pronounced in the context of Big Data, where the volume of data keeps on growing, along with a corresponding need for more insights, and thus the need for more complex processing also increases. Batch Processing to the Rescue Hadoop was designed to deal with this challenge in the following ways: 1. Use a distributed file system: This enables us to spread the load and grow our system as needed. 2. Optimize for write speed: To enable fast writes the Hadoop architecture was designed so that writes are first logged, and then processed. This enables fairly fast write speeds. 3. Use batch processing (Map/Reduce) to balance the speed for the data feeds with the processing speed. Batch Processing Challenges The challenge with batch-processing is that it assumes
2 0.16719547 897 high scalability-2010-09-08-4 General Core Scalability Patterns
Introduction: Jesper Söderlund put together an excellent list of four general scalability patterns and four subpatterns in his post Scalability patterns and an interesting story : Load distribution - Spread the system load across multiple processing units Load balancing / load sharing - Spreading the load across many components with equal properties for handling the request Partitioning - Spreading the load across many components by routing an individual request to a component that owns that data specific Vertical partitioning - Spreading the load across the functional boundaries of a problem space, separate functions being handled by different processing units Horizontal partitioning - Spreading a single type of data element across many instances, according to some partitioning key, e.g. hashing the player id and doing a modulus operation, etc. Quite often referred to as sharding. Queuing and batch - Achieve efficiencies of scale by
3 0.16641219 669 high scalability-2009-08-03-Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2
Introduction: This tutorial will show you how to use Amazon EC2 and Cloudera's Distribution for Hadoop to run batch jobs for a data intensive web application. During the tutorial, we will perform the following data processing steps.... read more on Cloudera website
4 0.16167958 1088 high scalability-2011-07-27-Making Hadoop 1000x Faster for Graph Problems
Introduction: Dr. Daniel Abadi, author of the DBMS Musings blog and Cofounder of Hadapt , which offers a product improving Hadoop performance by 50x on relational data, is now taking his talents to graph data in Hadoop's tremendous inefficiency on graph data management (and how to avoid it) , which shares the secrets of getting Hadoop to perform 1000x better on graph data. TL;DR: Analysing graph data is at the heart of important data mining problems . Hadoop is the tool of choice for many of these problems. Hadoop style MapReduce works best on KeyValue processing, not graph processing, and can be well over a factor of 1000 less efficient than it needs to be. Hadoop inefficiency has consequences in real world. Inefficiencies on graph data problems like improving power utilization, minimizing carbon emissions, and improving product designs, leads to a lot value being left on the table in the form of negative environmental consequences, increased server costs, increased data center spa
5 0.16114375 601 high scalability-2009-05-17-Product: Hadoop
Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity
7 0.15340465 627 high scalability-2009-06-11-Yahoo! Distribution of Hadoop
9 0.14853221 666 high scalability-2009-07-30-Learn How to Think at Scale
10 0.14415415 414 high scalability-2008-10-15-Hadoop - A Primer
11 0.14026915 1000 high scalability-2011-03-08-Medialets Architecture - Defeating the Daunting Mobile Device Data Deluge
12 0.13916145 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python
13 0.13555382 1161 high scalability-2011-12-22-Architecting Massively-Scalable Near-Real-Time Risk Analysis Solutions
14 0.13460037 1110 high scalability-2011-09-06-Big Data Application Platform
15 0.13439861 968 high scalability-2011-01-04-Map-Reduce With Ruby Using Hadoop
16 0.13108623 650 high scalability-2009-07-02-Product: Hbase
17 0.1290963 233 high scalability-2008-01-30-How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data
18 0.12832369 7 high scalability-2007-07-12-FeedBurner Architecture
20 0.12334289 1216 high scalability-2012-03-27-Big Data In the Cloud Using Cloudify
topicId topicWeight
[(0, 0.185), (1, 0.09), (2, -0.007), (3, 0.053), (4, 0.015), (5, 0.07), (6, 0.082), (7, 0.055), (8, 0.056), (9, 0.063), (10, 0.078), (11, 0.019), (12, 0.085), (13, -0.11), (14, 0.092), (15, -0.028), (16, -0.015), (17, -0.025), (18, -0.023), (19, 0.065), (20, -0.016), (21, 0.034), (22, 0.132), (23, 0.038), (24, 0.052), (25, 0.019), (26, 0.016), (27, 0.04), (28, -0.016), (29, 0.079), (30, 0.096), (31, 0.12), (32, -0.008), (33, 0.023), (34, -0.035), (35, 0.022), (36, -0.045), (37, 0.015), (38, 0.003), (39, -0.049), (40, 0.033), (41, 0.025), (42, -0.087), (43, -0.0), (44, 0.091), (45, 0.031), (46, -0.034), (47, -0.076), (48, -0.046), (49, -0.011)]
simIndex simValue blogId blogTitle
same-blog 1 0.97426522 1313 high scalability-2012-08-28-Making Hadoop Run Faster
Introduction: Making Hadoop Run Faster One of the challenges in processing data is that the speed at which we can input data is quite often much faster than the speed at which we can process it. This problem becomes even more pronounced in the context of Big Data, where the volume of data keeps on growing, along with a corresponding need for more insights, and thus the need for more complex processing also increases. Batch Processing to the Rescue Hadoop was designed to deal with this challenge in the following ways: 1. Use a distributed file system: This enables us to spread the load and grow our system as needed. 2. Optimize for write speed: To enable fast writes the Hadoop architecture was designed so that writes are first logged, and then processed. This enables fairly fast write speeds. 3. Use batch processing (Map/Reduce) to balance the speed for the data feeds with the processing speed. Batch Processing Challenges The challenge with batch-processing is that it assumes
Introduction: This tutorial will show you how to use Amazon EC2 and Cloudera's Distribution for Hadoop to run batch jobs for a data intensive web application. During the tutorial, we will perform the following data processing steps.... read more on Cloudera website
Introduction: This is a guest post by Ivan de Prado and Pere Ferrera , founders of Datasalt , the company behind Pangool and Splout SQL Big Data open-source projects. The amount of payments performed using credit cards is huge. It is clear that there is inherent value in the data that can be derived from analyzing all the transactions. Client fidelity, demographics, heat maps of activity, shop recommendations, and many other statistics are useful to both clients and shops for improving their relationship with the market. At Datasalt we have developed a system in collaboration with the BBVA bank that is able to analyze years of data and serve insights and statistics to different low-latency web and mobile applications. The main challenge we faced besides processing Big Data input is that the output was also Big Data, and even bigger than the input . And this output needed to be served quickly, under high load. The solution we developed has an infrastructure cost of just a few tho
4 0.82961828 601 high scalability-2009-05-17-Product: Hadoop
Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity
5 0.80945331 1173 high scalability-2012-01-12-Peregrine - A Map Reduce Framework for Iterative and Pipelined Jobs
Introduction: The Peregrine falcon is a bird of prey, famous for its high speed diving attacks , feeding primarily on much slower Hadoops. Wait, sorry, it is Kevin Burton of Spinn3r's new Peregrine project -- a new FAST modern map reduce framework optimized for iterative and pipelined map reduce jobs -- that feeds on Hadoops. If you don't know Kevin, he does a lot of excellent technical work that he's kind enough to share it on his blog . Only he hasn't been blogging much lately, he's been heads down working on Peregrine. Now that Peregrine has been released, here's a short email interview with Kevin on why you might want to take up falconry , the ancient sport of MapReduce. What does Spinn3r do that Peregrine is important to you? Ideally it was designed to execute pagerank but many iterative applications that we deploy and WANT to deploy (k-means) would be horribly inefficient under Hadoop as it doesn't have any support for merging and joining IO between tasks. It also doesn't support
6 0.78539872 1445 high scalability-2013-04-24-Strategy: Using Lots of RAM Often Cheaper than Using a Hadoop Cluster
8 0.76952493 968 high scalability-2011-01-04-Map-Reduce With Ruby Using Hadoop
9 0.76686174 1265 high scalability-2012-06-15-Stuff The Internet Says On Scalability For June 15, 2012
10 0.76619077 666 high scalability-2009-07-30-Learn How to Think at Scale
11 0.76049602 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python
12 0.74849951 443 high scalability-2008-11-14-Paper: Pig Latin: A Not-So-Foreign Language for Data Processing
13 0.74539918 862 high scalability-2010-07-20-Strategy: Consider When a Service Starts Billing in Your Algorithm Cost
15 0.7210688 376 high scalability-2008-09-03-MapReduce framework Disco
16 0.71258348 1000 high scalability-2011-03-08-Medialets Architecture - Defeating the Daunting Mobile Device Data Deluge
17 0.7061584 414 high scalability-2008-10-15-Hadoop - A Primer
18 0.70117247 956 high scalability-2010-12-08-How To Get Experience Working With Large Datasets
19 0.68698907 650 high scalability-2009-07-02-Product: Hbase
20 0.67954749 780 high scalability-2010-02-19-Twitter’s Plan to Analyze 100 Billion Tweets
topicId topicWeight
[(1, 0.166), (2, 0.23), (30, 0.018), (40, 0.026), (56, 0.022), (61, 0.099), (73, 0.104), (79, 0.153), (85, 0.015), (94, 0.078)]
simIndex simValue blogId blogTitle
1 0.97015798 1175 high scalability-2012-01-17-Paper: Feeding Frenzy: Selectively Materializing Users’ Event Feeds
Introduction: How do you scale an inbox that has multiple highly volatile feeds? That's a problem faced by social networks like Tumblr, Facebook, and Twitter. Follow a few hundred event sources and it's hard to scalably order an inbox so that you see a correct view as event sources continually publish new events. This can be considered like a view materialization problem in a database. In a database a view is a virtual table defined by a query that can be accessed like a table. Materialization refers to when the data behind the view is created. If a view is a join on several tables and that join is performed when the view is accessed, then performance will be slow. If the view is precomputed access to the view will be fast, but more resources are used, especially considering that the view may never be accessed. Your wall/inbox/stream is a view on all the people/things you follow. If you never look at your inbox then materializing the view in your inbox is a waste of resources, yet you'll be ma
2 0.96469986 980 high scalability-2011-01-28-Stuff The Internet Says On Scalability For January 28, 2011
Introduction: Submitted for your reading pleasure... Something we get to say more often than you might expect - funny NoSQL comic: How to Write a CV (SFW) Playtomic shows hows how to handle over 300 million events per day, in real time, on a budget . More Speed, at $80,000 a Millisecond . Does latency matter ? Oh yes... “On the Chicago to New York route in the US, three milliseconds can mean the difference between US$2,000 a month and US$250,000 a month.” Quotable Quotes @jkalucki : Throwing 1,920 CPUs and 4TB of RAM at an annoyance, as you do. @jointheflock @hkanji : Scale can come quick and come hard. Be prepared. @elenacarstoiu : When you say #Cloud, everybody's thinking lower cost. Agility, scalability and fast access are advantages far more important. @BillGates : From Melinda - Research proves we can save newborn lives at scale Kosmix with a fascinating look at Cassandra on SSD , summarizing some of what they've learned over the past year runni
3 0.9642449 33 high scalability-2007-07-26-ThemBid Architecture
Introduction: ThemBid provides a market where people needing work done broadcast their request and accept bids from people competing for the job. Unlike many of the sites profiled at HighScalability, ThemBid is not in the popular press as often as Paris Hilton. It's not a media darling or a giant of the industry. But what I like is they have a strategy, a point-of-view for building websites and were gracious enough to share very detailed instructions on how to go about building a website. They even delve into actual installation details of the various software packages they use. Anyone can benefit by taking a look at their work. Site: http://www.thembid.com/ Information Sources Build Scalable Web 2.0 Sites with Ubuntu, Symfony, and Lighttpd Platform Linux (Ubuntu) Symfony Lighttpd PHP eAccelerator Eclipse Munin AWStats What's Inside? The Stats Started work in December of 2006 and had a full demo by March 2007. One developer/sys admin worked with a pa
4 0.96117216 1587 high scalability-2014-01-29-10 Things Bitly Should Have Monitored
Introduction: Monitor, monitor, monitor. That's the advice every startup gives once they reach a certain size. But can you ever monitor enough? If you are Bitly and everyone will complain when you are down, probably not. Here are 10 Things We Forgot to Monitor from Bitly, along with good stories and copious amounts of code snippets. Well worth reading, especially after you've already started monitoring the lower hanging fruit. An interesting revelation from the article is that: We run bitly split across two data centers, one is a managed environment with DELL hardware, and the second is Amazon EC2. Fork Rate . A strange configuration issue caused processes to be created at a rate of several hundred a second rather than the expected 1-10/second. Flow control packets . A network configuration that honors flow control packets and isn’t configured to disable them, can temporarily cause dropped traffic. Swap In/Out Rate . Measure the right thing. It's the rate memory is swapped
Introduction: It's time to do something a little different and for me that doesn't mean cutting off my hair and joining a monastery, nor does it mean buying a cherry red convertible (yet), it means doing a webinar! On December 14th, 2:00 PM - 3:00 PM EST, I'll be hosting What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications . The webinar is sponsored by VoltDB, but it will be completely vendor independent, as that's the only honor preserving and technically accurate way of doing these things. The webinar will run about 60 minutes, with 40 minutes of speechifying and 20 minutes for questions. The hashtag for the event on Twitter will be SQLNoSQL . I'll be monitoring that hashtag if you have any suggestions for the webinar or if you would like to ask questions during the webinar. The motivation for me to do the webinar was a talk I had with another audience member at the NoSQL Evening in Palo Alto . He said he came from a Java background and was confused ab
same-blog 7 0.95370531 1313 high scalability-2012-08-28-Making Hadoop Run Faster
8 0.9508127 1642 high scalability-2014-05-02-Stuff The Internet Says On Scalability For May 2nd, 2014
9 0.94830668 517 high scalability-2009-02-21-Google AppEngine - A Second Look
10 0.94331849 1037 high scalability-2011-05-10-Viddler Architecture - 7 Million Embeds a Day and 1500 Req-Sec Peak
11 0.9430849 709 high scalability-2009-09-19-Space Based Programming in .NET
12 0.94244879 301 high scalability-2008-04-08-Google AppEngine - A First Look
13 0.94186461 1102 high scalability-2011-08-22-Strategy: Run a Scalable, Available, and Cheap Static Site on S3 or GitHub
14 0.94154412 229 high scalability-2008-01-29-Building scalable storage into application - Instead of MogileFS OpenAFS etc.
15 0.93986285 264 high scalability-2008-03-03-Read This Site and Ace Your Next Interview!
16 0.93985736 837 high scalability-2010-06-07-Six Ways Twitter May Reach its Big Hairy Audacious Goal of One Billion Users
17 0.93849951 949 high scalability-2010-11-29-Stuff the Internet Says on Scalability For November 29th, 2010
18 0.93749726 776 high scalability-2010-02-12-Hot Scalability Links for February 12, 2010
19 0.93718874 1275 high scalability-2012-07-02-C is for Compute - Google Compute Engine (GCE)
20 0.93682593 1649 high scalability-2014-05-16-Stuff The Internet Says On Scalability For May 16th, 2014