high_scalability high_scalability-2012 high_scalability-2012-1313 knowledge-graph by maker-knowledge-mining

1313 high scalability-2012-08-28-Making Hadoop Run Faster


meta infos for this blog

Source: html

Introduction: Making Hadoop Run Faster One of the challenges in processing data is that the speed at which we can input data is quite often much faster than the speed at which we can process it. This problem becomes even more pronounced in the context of Big Data, where the volume of data keeps on growing, along with a corresponding need for more insights, and thus the need for more complex processing also increases. Batch Processing to the Rescue Hadoop was designed to deal with this challenge in the following ways: 1. Use a distributed file system: This enables us to spread the load and grow our system as needed. 2. Optimize for write speed: To enable fast writes the Hadoop architecture was designed so that writes are first logged, and then processed. This enables fairly fast write speeds. 3. Use batch processing (Map/Reduce) to balance the speed for the data feeds with the processing speed. Batch Processing Challenges The challenge with batch-processing is that it assumes


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Making Hadoop Run Faster One of the challenges in processing data is that the speed at which we can input data is quite often much faster than the speed at which we can process it. [sent-1, score-0.508]

2 This problem becomes even more pronounced in the context of Big Data, where the volume of data keeps on growing, along with a corresponding need for more insights, and thus the need for more complex processing also increases. [sent-2, score-0.988]

3 Use batch processing (Map/Reduce) to balance the speed for the data feeds with the processing speed. [sent-9, score-1.696]

4 Batch Processing Challenges The challenge with batch-processing is that it assumes that the feeds come in bursts. [sent-10, score-0.37]

5 If our data feeds come in on a continuous basis, the entire assumption and architecture behind batch processing starts to break down. [sent-11, score-1.189]

6 If we increase the batch window, the result is higher latency between the time the data comes in until the time we actually get it into our reports and insights. [sent-12, score-0.38]

7 Moreover, the number of hours is finite -- in many systems the batch window is done on a daily basis. [sent-13, score-0.485]

8 Often, the assumption is that most of the processing can be done during off-peak hours. [sent-14, score-0.638]

9 But as the volume gets bigger, the time it takes to process the data gets longer, until it reaches the limit of the hours in a day and then we face dealing with a continuously growing backlog. [sent-15, score-0.664]

10 In addition, if we experience a failure during the processing we might not have enough time to re-process. [sent-16, score-0.52]

11 Making Hadoop Run Faster We can make our Hadoop system run faster by pre-processing some of the work before it gets into our Hadoop system. [sent-17, score-0.308]

12 We can also move the types of workload for which batch processing isn't a good fit out of the Hadoop Map/Reduce system and use Stream Processing, as Google did. [sent-18, score-0.824]

13 Speed Things Up Through Stream-Based Processing The concept of stream-based processing is fairly simple. [sent-19, score-0.603]

14 Instead of logging the data first and then processing it, we can process it as it comes in. [sent-20, score-0.783]

15 A good analogy to explain the difference is a manufacturing pipeline. [sent-21, score-0.266]

16 Think about a car manufacturing pipeline: Compare the process of first putting all the parts together and then assembling them piece by piece, versus a process in which you package each unit at the manufacturer and only send the pre-packaged parts to the assembly line. [sent-22, score-1.463]

17 Putting stream-based processing at the front is analogous to pre-packaging our parts before  they get to the assembly line, which is in our case is the Hadoop batch processing system. [sent-25, score-1.729]

18 As in manufacturing, even if we pre-package the parts at the manufacturer we still need an assembly line to put all the parts together. [sent-26, score-0.78]

19 In the same way, stream-based processing is not meant to replace our Hadoop system, but rather to reduce the amount of work that the system needs to deal with, and to make the work that does go into the Hadoop process easier, and thus faster, to process. [sent-27, score-0.957]

20 In-memory stream processing can make a good stream processing system, as  Curt Monash’s  points out on his research  traditional databases will eventually end up in RAM . [sent-28, score-1.274]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('processing', 0.52), ('hadoop', 0.292), ('batch', 0.235), ('feeds', 0.229), ('assembly', 0.207), ('manufacturing', 0.198), ('manufacturer', 0.174), ('parts', 0.168), ('assumption', 0.118), ('process', 0.118), ('stream', 0.117), ('window', 0.109), ('speed', 0.105), ('faster', 0.093), ('gets', 0.093), ('piece', 0.09), ('data', 0.087), ('volume', 0.085), ('enables', 0.084), ('assembling', 0.084), ('fairly', 0.083), ('pronounced', 0.081), ('putting', 0.081), ('thus', 0.079), ('analogous', 0.079), ('finite', 0.075), ('challenge', 0.073), ('context', 0.072), ('packaging', 0.07), ('system', 0.069), ('assumes', 0.068), ('analogy', 0.068), ('curt', 0.067), ('hours', 0.066), ('deal', 0.065), ('corresponding', 0.064), ('moreover', 0.064), ('line', 0.063), ('monash', 0.062), ('reaches', 0.061), ('growing', 0.061), ('logged', 0.059), ('writes', 0.059), ('diagram', 0.058), ('comes', 0.058), ('car', 0.057), ('demonstrate', 0.055), ('pipeline', 0.055), ('work', 0.053), ('insights', 0.053)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000002 1313 high scalability-2012-08-28-Making Hadoop Run Faster

Introduction: Making Hadoop Run Faster One of the challenges in processing data is that the speed at which we can input data is quite often much faster than the speed at which we can process it. This problem becomes even more pronounced in the context of Big Data, where the volume of data keeps on growing, along with a corresponding need for more insights, and thus the need for more complex processing also increases. Batch Processing to the Rescue Hadoop was designed to deal with this challenge in the following ways: 1. Use a distributed file system: This enables us to spread the load and grow our system as needed. 2. Optimize for write speed: To enable fast writes the Hadoop architecture was designed so that writes are first logged, and then processed. This enables fairly fast write speeds. 3. Use batch processing (Map/Reduce) to balance the speed for the data feeds with the processing speed. Batch Processing Challenges The challenge with batch-processing is that it assumes

2 0.16719547 897 high scalability-2010-09-08-4 General Core Scalability Patterns

Introduction: Jesper Söderlund put together an excellent list of four general scalability patterns and four subpatterns in his post Scalability patterns and an interesting story : Load distribution - Spread the system load across multiple processing units Load balancing / load sharing - Spreading the load across many components with equal properties for handling the request Partitioning - Spreading the load across many components by routing an individual request to a component that owns that data specific Vertical partitioning - Spreading the load across the functional boundaries of a problem space, separate functions being handled by different processing units Horizontal partitioning - Spreading a single type of data element across many instances, according to some partitioning key, e.g. hashing the player id and doing a modulus operation, etc. Quite often referred to as sharding. Queuing and batch  - Achieve efficiencies of scale by

3 0.16641219 669 high scalability-2009-08-03-Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2

Introduction: This tutorial will show you how to use Amazon EC2 and Cloudera's Distribution for Hadoop to run batch jobs for a data intensive web application. During the tutorial, we will perform the following data processing steps.... read more on Cloudera website

4 0.16167958 1088 high scalability-2011-07-27-Making Hadoop 1000x Faster for Graph Problems

Introduction: Dr. Daniel Abadi, author of the DBMS Musings blog  and Cofounder of Hadapt , which offers a product improving Hadoop performance by 50x on relational data, is now taking his talents to graph data in  Hadoop's tremendous inefficiency on graph data management (and how to avoid it) , which shares the secrets of getting Hadoop to perform 1000x better on graph data. TL;DR: Analysing graph data is at the heart of important data mining problems . Hadoop is the tool of choice for many of these problems. Hadoop style MapReduce works best on KeyValue processing, not graph processing, and can be well over a factor of 1000 less efficient than it needs to be. Hadoop inefficiency has consequences in real world. Inefficiencies on graph data problems like improving power utilization, minimizing carbon emissions, and improving product designs, leads to a lot value being left on the table in the form of negative environmental consequences, increased server costs, increased data center spa

5 0.16114375 601 high scalability-2009-05-17-Product: Hadoop

Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity

6 0.1574185 1382 high scalability-2013-01-07-Analyzing billions of credit card transactions and serving low-latency insights in the cloud

7 0.15340465 627 high scalability-2009-06-11-Yahoo! Distribution of Hadoop

8 0.14927897 1618 high scalability-2014-03-24-Big, Small, Hot or Cold - Examples of Robust Data Pipelines from Stripe, Tapad, Etsy and Square

9 0.14853221 666 high scalability-2009-07-30-Learn How to Think at Scale

10 0.14415415 414 high scalability-2008-10-15-Hadoop - A Primer

11 0.14026915 1000 high scalability-2011-03-08-Medialets Architecture - Defeating the Daunting Mobile Device Data Deluge

12 0.13916145 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python

13 0.13555382 1161 high scalability-2011-12-22-Architecting Massively-Scalable Near-Real-Time Risk Analysis Solutions

14 0.13460037 1110 high scalability-2011-09-06-Big Data Application Platform

15 0.13439861 968 high scalability-2011-01-04-Map-Reduce With Ruby Using Hadoop

16 0.13108623 650 high scalability-2009-07-02-Product: Hbase

17 0.1290963 233 high scalability-2008-01-30-How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data

18 0.12832369 7 high scalability-2007-07-12-FeedBurner Architecture

19 0.12437218 1293 high scalability-2012-07-30-Prismatic Architecture - Using Machine Learning on Social Networks to Figure Out What You Should Read on the Web

20 0.12334289 1216 high scalability-2012-03-27-Big Data In the Cloud Using Cloudify


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.185), (1, 0.09), (2, -0.007), (3, 0.053), (4, 0.015), (5, 0.07), (6, 0.082), (7, 0.055), (8, 0.056), (9, 0.063), (10, 0.078), (11, 0.019), (12, 0.085), (13, -0.11), (14, 0.092), (15, -0.028), (16, -0.015), (17, -0.025), (18, -0.023), (19, 0.065), (20, -0.016), (21, 0.034), (22, 0.132), (23, 0.038), (24, 0.052), (25, 0.019), (26, 0.016), (27, 0.04), (28, -0.016), (29, 0.079), (30, 0.096), (31, 0.12), (32, -0.008), (33, 0.023), (34, -0.035), (35, 0.022), (36, -0.045), (37, 0.015), (38, 0.003), (39, -0.049), (40, 0.033), (41, 0.025), (42, -0.087), (43, -0.0), (44, 0.091), (45, 0.031), (46, -0.034), (47, -0.076), (48, -0.046), (49, -0.011)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97426522 1313 high scalability-2012-08-28-Making Hadoop Run Faster

Introduction: Making Hadoop Run Faster One of the challenges in processing data is that the speed at which we can input data is quite often much faster than the speed at which we can process it. This problem becomes even more pronounced in the context of Big Data, where the volume of data keeps on growing, along with a corresponding need for more insights, and thus the need for more complex processing also increases. Batch Processing to the Rescue Hadoop was designed to deal with this challenge in the following ways: 1. Use a distributed file system: This enables us to spread the load and grow our system as needed. 2. Optimize for write speed: To enable fast writes the Hadoop architecture was designed so that writes are first logged, and then processed. This enables fairly fast write speeds. 3. Use batch processing (Map/Reduce) to balance the speed for the data feeds with the processing speed. Batch Processing Challenges The challenge with batch-processing is that it assumes

2 0.8631683 669 high scalability-2009-08-03-Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2

Introduction: This tutorial will show you how to use Amazon EC2 and Cloudera's Distribution for Hadoop to run batch jobs for a data intensive web application. During the tutorial, we will perform the following data processing steps.... read more on Cloudera website

3 0.84247363 1382 high scalability-2013-01-07-Analyzing billions of credit card transactions and serving low-latency insights in the cloud

Introduction: This is a guest post by Ivan de Prado and Pere Ferrera , founders of Datasalt , the company behind Pangool and Splout SQL Big Data open-source projects. The amount of payments performed using credit cards is huge. It is clear that there is inherent value in the data that can be derived from analyzing all the transactions. Client fidelity, demographics, heat maps of activity, shop recommendations, and many other statistics are useful to both clients and shops for improving their relationship with the market. At Datasalt we have developed a system in collaboration with the BBVA bank that is able to analyze years of data and serve insights and statistics to different low-latency web and mobile applications. The main challenge we faced besides processing Big Data input is that the output was also Big Data, and even bigger than the input . And this output needed to be served quickly, under high load. The solution we developed has an infrastructure cost of just a few tho

4 0.82961828 601 high scalability-2009-05-17-Product: Hadoop

Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity

5 0.80945331 1173 high scalability-2012-01-12-Peregrine - A Map Reduce Framework for Iterative and Pipelined Jobs

Introduction: The Peregrine falcon is a bird of prey, famous for its high speed diving attacks , feeding primarily on much slower Hadoops. Wait, sorry, it is Kevin Burton of Spinn3r's new Peregrine project -- a new FAST modern map reduce framework optimized for iterative and pipelined map reduce jobs -- that feeds on Hadoops. If you don't know Kevin, he does a lot of excellent technical work that he's kind enough to share it on his blog . Only he hasn't been blogging much lately, he's been heads down working on Peregrine. Now that Peregrine has been released, here's a short email interview with Kevin on why you might want to take up falconry , the ancient sport of MapReduce. What does Spinn3r do that Peregrine is important to you? Ideally it was designed to execute pagerank but many iterative applications that we deploy and WANT to deploy (k-means) would be horribly inefficient under Hadoop as it doesn't have any support for merging and joining IO between tasks.  It also doesn't support

6 0.78539872 1445 high scalability-2013-04-24-Strategy: Using Lots of RAM Often Cheaper than Using a Hadoop Cluster

7 0.76961851 1618 high scalability-2014-03-24-Big, Small, Hot or Cold - Examples of Robust Data Pipelines from Stripe, Tapad, Etsy and Square

8 0.76952493 968 high scalability-2011-01-04-Map-Reduce With Ruby Using Hadoop

9 0.76686174 1265 high scalability-2012-06-15-Stuff The Internet Says On Scalability For June 15, 2012

10 0.76619077 666 high scalability-2009-07-30-Learn How to Think at Scale

11 0.76049602 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python

12 0.74849951 443 high scalability-2008-11-14-Paper: Pig Latin: A Not-So-Foreign Language for Data Processing

13 0.74539918 862 high scalability-2010-07-20-Strategy: Consider When a Service Starts Billing in Your Algorithm Cost

14 0.72164446 1586 high scalability-2014-01-28-How Next Big Sound Tracks Over a Trillion Song Plays, Likes, and More Using a Version Control System for Hadoop Data

15 0.7210688 376 high scalability-2008-09-03-MapReduce framework Disco

16 0.71258348 1000 high scalability-2011-03-08-Medialets Architecture - Defeating the Daunting Mobile Device Data Deluge

17 0.7061584 414 high scalability-2008-10-15-Hadoop - A Primer

18 0.70117247 956 high scalability-2010-12-08-How To Get Experience Working With Large Datasets

19 0.68698907 650 high scalability-2009-07-02-Product: Hbase

20 0.67954749 780 high scalability-2010-02-19-Twitter’s Plan to Analyze 100 Billion Tweets


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(1, 0.166), (2, 0.23), (30, 0.018), (40, 0.026), (56, 0.022), (61, 0.099), (73, 0.104), (79, 0.153), (85, 0.015), (94, 0.078)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.97015798 1175 high scalability-2012-01-17-Paper: Feeding Frenzy: Selectively Materializing Users’ Event Feeds

Introduction: How do you scale an inbox that has multiple highly volatile feeds? That's a problem faced by social networks like Tumblr, Facebook, and Twitter. Follow a few hundred event sources and it's hard to scalably order an inbox so that you see a correct view as event sources continually publish new events. This can be considered like a view materialization problem in a database. In a database a view is a virtual table defined by a query that can be accessed like a table. Materialization refers to when the data behind the view is created. If a view is a join on several tables and that join is performed when the view is accessed, then performance will be slow. If the view is precomputed access to the view will be fast, but more resources are used, especially considering that the view may never be accessed. Your wall/inbox/stream is a view on all the people/things you follow. If you never look at your inbox then materializing the view in your inbox is a waste of resources, yet you'll be ma

2 0.96469986 980 high scalability-2011-01-28-Stuff The Internet Says On Scalability For January 28, 2011

Introduction: Submitted for your reading pleasure... Something we get to say more often than you might expect - funny NoSQL comic:  How to Write a CV  (SFW) Playtomic shows hows how to handle over 300 million events per day, in real time, on a budget .  More Speed, at $80,000 a Millisecond . Does latency matter ? Oh yes... “On the Chicago to New York route in the US, three milliseconds can mean the difference between US$2,000 a month and US$250,000 a month.” Quotable Quotes @jkalucki : Throwing 1,920 CPUs and 4TB of RAM at an annoyance, as you do. @jointheflock @hkanji : Scale can come quick and come hard. Be prepared. @elenacarstoiu : When you say #Cloud, everybody's thinking lower cost. Agility, scalability and fast access are advantages far more important. @BillGates : From Melinda - Research proves we can save newborn lives at scale  Kosmix with a fascinating look at Cassandra on SSD ,  summarizing some of what they've learned over the past year runni

3 0.9642449 33 high scalability-2007-07-26-ThemBid Architecture

Introduction: ThemBid provides a market where people needing work done broadcast their request and accept bids from people competing for the job. Unlike many of the sites profiled at HighScalability, ThemBid is not in the popular press as often as Paris Hilton. It's not a media darling or a giant of the industry. But what I like is they have a strategy, a point-of-view for building websites and were gracious enough to share very detailed instructions on how to go about building a website. They even delve into actual installation details of the various software packages they use. Anyone can benefit by taking a look at their work. Site: http://www.thembid.com/ Information Sources Build Scalable Web 2.0 Sites with Ubuntu, Symfony, and Lighttpd Platform Linux (Ubuntu) Symfony Lighttpd PHP eAccelerator Eclipse Munin AWStats What's Inside? The Stats Started work in December of 2006 and had a full demo by March 2007. One developer/sys admin worked with a pa

4 0.96117216 1587 high scalability-2014-01-29-10 Things Bitly Should Have Monitored

Introduction: Monitor, monitor, monitor. That's the advice every startup gives once they reach a certain size. But can you ever monitor enough? If you are Bitly and everyone will complain when you are down, probably not. Here are  10 Things We Forgot to Monitor  from Bitly, along with good stories and copious amounts of code snippets. Well worth reading, especially after you've already started monitoring the lower hanging fruit. An interesting revelation from the article is that: We run bitly split across two data centers, one is a managed environment with DELL hardware, and the second is Amazon EC2.   Fork Rate . A strange configuration issue caused processes to be created at a rate of several hundred a second rather than the expected 1-10/second.  Flow control packets .  A network configuration that honors flow control packets and isn’t configured to disable them, can temporarily cause dropped traffic. Swap In/Out Rate . Measure the right thing. It's the rate memory is swapped

5 0.95653749 957 high scalability-2010-12-13-Still Time to Attend My Webinar Tomorrow: What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications

Introduction: It's time to do something a little different and for me that doesn't mean cutting off my hair and joining a monastery, nor does it mean buying a cherry red convertible (yet), it means doing a webinar! On December 14th, 2:00 PM - 3:00 PM EST, I'll be hosting  What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications . The webinar is sponsored by VoltDB, but it will be completely vendor independent, as that's the only honor preserving and technically accurate way of doing these things. The webinar will run about 60 minutes, with 40 minutes of speechifying and 20 minutes for questions. The hashtag for the event on Twitter will be SQLNoSQL . I'll be monitoring that hashtag if you have any suggestions for the webinar or if you would like to ask questions during the webinar.  The motivation for me to do the webinar was a talk I had with another audience member at the NoSQL Evening in Palo Alto . He said he came from a Java background and was confused ab

6 0.95653677 945 high scalability-2010-11-18-Announcing My Webinar on December 14th: What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications

same-blog 7 0.95370531 1313 high scalability-2012-08-28-Making Hadoop Run Faster

8 0.9508127 1642 high scalability-2014-05-02-Stuff The Internet Says On Scalability For May 2nd, 2014

9 0.94830668 517 high scalability-2009-02-21-Google AppEngine - A Second Look

10 0.94331849 1037 high scalability-2011-05-10-Viddler Architecture - 7 Million Embeds a Day and 1500 Req-Sec Peak

11 0.9430849 709 high scalability-2009-09-19-Space Based Programming in .NET

12 0.94244879 301 high scalability-2008-04-08-Google AppEngine - A First Look

13 0.94186461 1102 high scalability-2011-08-22-Strategy: Run a Scalable, Available, and Cheap Static Site on S3 or GitHub

14 0.94154412 229 high scalability-2008-01-29-Building scalable storage into application - Instead of MogileFS OpenAFS etc.

15 0.93986285 264 high scalability-2008-03-03-Read This Site and Ace Your Next Interview!

16 0.93985736 837 high scalability-2010-06-07-Six Ways Twitter May Reach its Big Hairy Audacious Goal of One Billion Users

17 0.93849951 949 high scalability-2010-11-29-Stuff the Internet Says on Scalability For November 29th, 2010

18 0.93749726 776 high scalability-2010-02-12-Hot Scalability Links for February 12, 2010

19 0.93718874 1275 high scalability-2012-07-02-C is for Compute - Google Compute Engine (GCE)

20 0.93682593 1649 high scalability-2014-05-16-Stuff The Internet Says On Scalability For May 16th, 2014