high_scalability high_scalability-2009 high_scalability-2009-666 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Aaron Kimball of Cloudera gives a wonderful 23 minute presentation titled Cloudera Hadoop Training: Thinking at Scale Cloudera which talks about "common challenges and general best practices for scaling with your data." As a company Cloudera offers "enterprise-level support to users of Apache Hadoop." Part of that offering is a really useful series of tutorial videos on the Hadoop ecosystem . Like TV lawyer Perry Mason (or is it Harmon Rabb?), Aaron gradually builds his case. He opens with the problem of storing lots of data. Then a blistering cross examination of the problem of building distributed systems to analyze that data sets up a powerful closing argument. With so much testimony behind him, on closing Aaron really brings it home with why shared nothing systems like map-reduce are the right solution on how to query lots of data. They jury loved it. Here's the video Thinking at Scale . And here's a summary of some of the lessons learned from the talk: Lessons Learned
sentIndex sentText sentNum sentScore
1 Then a blistering cross examination of the problem of building distributed systems to analyze that data sets up a powerful closing argument. [sent-7, score-0.53]
2 With so much testimony behind him, on closing Aaron really brings it home with why shared nothing systems like map-reduce are the right solution on how to query lots of data. [sent-8, score-0.319]
3 And here's a summary of some of the lessons learned from the talk: Lessons Learned We can process data much faster than we can read it and much faster than we can write results back to disk. [sent-11, score-0.235]
4 The amount of data a machine can store is greater than the amount it can manipulate in memory so you have to swap out RAM. [sent-14, score-0.365]
5 * With an average job size of 180 GB it would take 45 minutes to read that data off of disk sequentially. [sent-15, score-0.316]
6 With a parallel system in place the next step is to move computation to where the data is already stored. [sent-21, score-0.357]
7 * The new large scale computing approach is to move computation to where the data is already stored. [sent-24, score-0.443]
8 * So move processing to individual nodes that store only a small amount of the data at a time. [sent-26, score-0.636]
9 Large distributed systems must be able to support partial failure and adapt to new additional capacity. [sent-29, score-0.412]
10 * Failure with large systems is inevitable so partial progress must be kept for long jobs and jobs must be restarted when a failure is detected. [sent-30, score-0.932]
11 Complex distributed systems make job restarting difficult because of the state that must be maintained. [sent-31, score-0.265]
12 * Workload should be transferred as new nodes are added and failures occur. [sent-36, score-0.278]
13 Solution to large scale data processing problems is to build a shared nothing architecture. [sent-39, score-0.451]
14 * In map-reduce (MR) systems data is read locally and processed locally. [sent-41, score-0.471]
15 * Data is paritioned onto machines in advance and computations happen where data is stored. [sent-44, score-0.269]
16 Tasks are processed on the same node as where the data is stored or at least on the same rack. [sent-54, score-0.347]
17 * In standard MR it's processing large files of data, typically 1 GB or more. [sent-59, score-0.308]
18 Typical file system block sizes are 4K, for MR they are 64MB to 256MB, which allows writing large linear chunks which reduces seeks on reading. [sent-61, score-0.295]
19 * Tasks are restarted transparently on failure because tasks are independent of each other. [sent-62, score-0.359]
20 The same task can be started on difference nodes and the fastest result can be used. [sent-65, score-0.32]
wordName wordTfidf (topN-words)
[('mr', 0.311), ('aaron', 0.216), ('nodes', 0.206), ('gb', 0.162), ('processing', 0.142), ('restarted', 0.139), ('onto', 0.137), ('data', 0.132), ('processed', 0.13), ('cloudera', 0.128), ('closing', 0.122), ('allows', 0.118), ('failure', 0.115), ('task', 0.114), ('systems', 0.106), ('tasks', 0.105), ('jobs', 0.104), ('partial', 0.104), ('read', 0.103), ('loaded', 0.094), ('rdbmsby', 0.093), ('drown', 0.093), ('nothing', 0.091), ('linear', 0.091), ('must', 0.087), ('blistering', 0.087), ('large', 0.086), ('node', 0.085), ('mason', 0.083), ('examination', 0.083), ('drives', 0.083), ('amount', 0.081), ('size', 0.081), ('jury', 0.08), ('perry', 0.08), ('typically', 0.08), ('computation', 0.079), ('programs', 0.079), ('colin', 0.078), ('talking', 0.076), ('speculative', 0.076), ('move', 0.075), ('additions', 0.074), ('failures', 0.072), ('restarting', 0.072), ('mpi', 0.072), ('communication', 0.072), ('manipulate', 0.071), ('already', 0.071), ('mapreduce', 0.069)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000005 666 high scalability-2009-07-30-Learn How to Think at Scale
Introduction: Aaron Kimball of Cloudera gives a wonderful 23 minute presentation titled Cloudera Hadoop Training: Thinking at Scale Cloudera which talks about "common challenges and general best practices for scaling with your data." As a company Cloudera offers "enterprise-level support to users of Apache Hadoop." Part of that offering is a really useful series of tutorial videos on the Hadoop ecosystem . Like TV lawyer Perry Mason (or is it Harmon Rabb?), Aaron gradually builds his case. He opens with the problem of storing lots of data. Then a blistering cross examination of the problem of building distributed systems to analyze that data sets up a powerful closing argument. With so much testimony behind him, on closing Aaron really brings it home with why shared nothing systems like map-reduce are the right solution on how to query lots of data. They jury loved it. Here's the video Thinking at Scale . And here's a summary of some of the lessons learned from the talk: Lessons Learned
2 0.17222546 538 high scalability-2009-03-16-Are Cloud Based Memory Architectures the Next Big Thing?
Introduction: We are on the edge of two potent technological changes: Clouds and Memory Based Architectures. This evolution will rip open a chasm where new players can enter and prosper. Google is the master of disk. You can't beat them at a game they perfected. Disk based databases like SimpleDB and BigTable are complicated beasts, typical last gasp products of any aging technology before a change. The next era is the age of Memory and Cloud which will allow for new players to succeed. The tipping point will be soon. Let's take a short trip down web architecture lane: It's 1993: Yahoo runs on FreeBSD, Apache, Perl scripts and a SQL database It's 1995: Scale-up the database. It's 1998: LAMP It's 1999: Stateless + Load Balanced + Database + SAN It's 2001: In-memory data-grid. It's 2003: Add a caching layer. It's 2004: Add scale-out and partitioning. It's 2005: Add asynchronous job scheduling and maybe a distributed file system. It's 2007: Move it all into the cloud. It's 2008: C
3 0.16835769 448 high scalability-2008-11-22-Google Architecture
Introduction: Update 2: Sorting 1 PB with MapReduce . PB is not peanut-butter-and-jelly misspelled. It's 1 petabyte or 1000 terabytes or 1,000,000 gigabytes. It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers and the results were replicated thrice on 48,000 disks. Update: Greg Linden points to a new Google article MapReduce: simplified data processing on large clusters . Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build
4 0.16691211 195 high scalability-2007-12-28-Amazon's EC2: Pay as You Grow Could Cut Your Costs in Half
Introduction: Update 2: Summize Computes Computing Resources for a Startup . Lots of nice graphs showing Amazon is hard to beat for small machines and become less cost efficient for well used larger machines. Long term storage costs may eat your saving away. And out of cloud bandwidth costs are high. Update: via ProductionScale , a nice Digital Web article on how to setup S3 to store media files and how Blue Origin was able to handle 3.5 million requests and 758 GBs in bandwidth in a single day for very little $$$. Also a Right Scale article on Network performance within Amazon EC2 and to Amazon S3 . 75MB/s between EC2 instances, 10.2MB/s between EC2 and S3 for download, 6.9MB/s upload. Now that Amazon's S3 (storage service) is out of beta and EC2 (elastic compute cloud) has added new instance types (the class of machine you can rent) with more CPU and more RAM, I thought it would be interesting to take a look out how their pricing stacks up. The quick conclusion: the m
5 0.16052102 920 high scalability-2010-10-15-Troubles with Sharding - What can we learn from the Foursquare Incident?
Introduction: For everything given something seems to be taken. Caching is a great scalability solution, but caching also comes with problems . Sharding is a great scalability solution, but as Foursquare recently revealed in a post-mortem about their 17 hours of downtime, sharding also has problems. MongoDB, the database Foursquare uses, also contributed their post-mortem of what went wrong too. Now that everyone has shared and resharded, what can we learn to help us skip these mistakes and quickly move on to a different set of mistakes? First, like for Facebook , huge props to Foursquare and MongoDB for being upfront and honest about their problems. This helps everyone get better and is a sign we work in a pretty cool industry. Second, overall, the fault didn't flow from evil hearts or gross negligence. As usual the cause was more mundane: a key system, that could be a little more robust, combined with a very popular application built by a small group of people, under immense pressure
6 0.15976807 589 high scalability-2009-05-05-Drop ACID and Think About Data
8 0.15605269 1000 high scalability-2011-03-08-Medialets Architecture - Defeating the Daunting Mobile Device Data Deluge
9 0.15366337 601 high scalability-2009-05-17-Product: Hadoop
10 0.14853221 1313 high scalability-2012-08-28-Making Hadoop Run Faster
11 0.14815325 1173 high scalability-2012-01-12-Peregrine - A Map Reduce Framework for Iterative and Pipelined Jobs
12 0.14080705 954 high scalability-2010-12-06-What the heck are you actually using NoSQL for?
14 0.14034626 750 high scalability-2009-12-16-Building Super Scalable Systems: Blade Runner Meets Autonomic Computing in the Ambient Cloud
16 0.13950638 1186 high scalability-2012-02-02-The Data-Scope Project - 6PB storage, 500GBytes-sec sequential IO, 20M IOPS, 130TFlops
17 0.13813612 1440 high scalability-2013-04-15-Scaling Pinterest - From 0 to 10s of Billions of Page Views a Month in Two Years
18 0.13729708 211 high scalability-2008-01-13-Google Reveals New MapReduce Stats
19 0.13712347 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python
20 0.13691166 669 high scalability-2009-08-03-Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2
topicId topicWeight
[(0, 0.252), (1, 0.151), (2, 0.003), (3, 0.057), (4, -0.009), (5, 0.109), (6, 0.104), (7, 0.017), (8, 0.026), (9, 0.045), (10, 0.081), (11, -0.033), (12, 0.039), (13, -0.065), (14, 0.103), (15, 0.019), (16, -0.062), (17, -0.051), (18, -0.024), (19, 0.092), (20, -0.009), (21, 0.045), (22, 0.03), (23, 0.041), (24, -0.015), (25, -0.026), (26, 0.081), (27, 0.023), (28, -0.022), (29, 0.057), (30, 0.085), (31, 0.029), (32, 0.011), (33, 0.015), (34, -0.002), (35, 0.003), (36, -0.014), (37, 0.018), (38, 0.021), (39, 0.027), (40, 0.014), (41, -0.067), (42, -0.028), (43, 0.019), (44, 0.017), (45, 0.029), (46, -0.008), (47, -0.056), (48, 0.001), (49, -0.022)]
simIndex simValue blogId blogTitle
same-blog 1 0.96791255 666 high scalability-2009-07-30-Learn How to Think at Scale
Introduction: Aaron Kimball of Cloudera gives a wonderful 23 minute presentation titled Cloudera Hadoop Training: Thinking at Scale Cloudera which talks about "common challenges and general best practices for scaling with your data." As a company Cloudera offers "enterprise-level support to users of Apache Hadoop." Part of that offering is a really useful series of tutorial videos on the Hadoop ecosystem . Like TV lawyer Perry Mason (or is it Harmon Rabb?), Aaron gradually builds his case. He opens with the problem of storing lots of data. Then a blistering cross examination of the problem of building distributed systems to analyze that data sets up a powerful closing argument. With so much testimony behind him, on closing Aaron really brings it home with why shared nothing systems like map-reduce are the right solution on how to query lots of data. They jury loved it. Here's the video Thinking at Scale . And here's a summary of some of the lessons learned from the talk: Lessons Learned
2 0.85962111 601 high scalability-2009-05-17-Product: Hadoop
Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity
3 0.85381198 448 high scalability-2008-11-22-Google Architecture
Introduction: Update 2: Sorting 1 PB with MapReduce . PB is not peanut-butter-and-jelly misspelled. It's 1 petabyte or 1000 terabytes or 1,000,000 gigabytes. It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers and the results were replicated thrice on 48,000 disks. Update: Greg Linden points to a new Google article MapReduce: simplified data processing on large clusters . Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build
4 0.83832234 1173 high scalability-2012-01-12-Peregrine - A Map Reduce Framework for Iterative and Pipelined Jobs
Introduction: The Peregrine falcon is a bird of prey, famous for its high speed diving attacks , feeding primarily on much slower Hadoops. Wait, sorry, it is Kevin Burton of Spinn3r's new Peregrine project -- a new FAST modern map reduce framework optimized for iterative and pipelined map reduce jobs -- that feeds on Hadoops. If you don't know Kevin, he does a lot of excellent technical work that he's kind enough to share it on his blog . Only he hasn't been blogging much lately, he's been heads down working on Peregrine. Now that Peregrine has been released, here's a short email interview with Kevin on why you might want to take up falconry , the ancient sport of MapReduce. What does Spinn3r do that Peregrine is important to you? Ideally it was designed to execute pagerank but many iterative applications that we deploy and WANT to deploy (k-means) would be horribly inefficient under Hadoop as it doesn't have any support for merging and joining IO between tasks. It also doesn't support
5 0.83741516 1313 high scalability-2012-08-28-Making Hadoop Run Faster
Introduction: Making Hadoop Run Faster One of the challenges in processing data is that the speed at which we can input data is quite often much faster than the speed at which we can process it. This problem becomes even more pronounced in the context of Big Data, where the volume of data keeps on growing, along with a corresponding need for more insights, and thus the need for more complex processing also increases. Batch Processing to the Rescue Hadoop was designed to deal with this challenge in the following ways: 1. Use a distributed file system: This enables us to spread the load and grow our system as needed. 2. Optimize for write speed: To enable fast writes the Hadoop architecture was designed so that writes are first logged, and then processed. This enables fairly fast write speeds. 3. Use batch processing (Map/Reduce) to balance the speed for the data feeds with the processing speed. Batch Processing Challenges The challenge with batch-processing is that it assumes
8 0.81327987 1020 high scalability-2011-04-12-Caching and Processing 2TB Mozilla Crash Reports in memory with Hazelcast
9 0.80262613 326 high scalability-2008-05-25-Product: Condor - Compute Intensive Workload Management
10 0.78959268 901 high scalability-2010-09-16-How Can the Large Hadron Collider Withstand One Petabyte of Data a Second?
11 0.78735608 1445 high scalability-2013-04-24-Strategy: Using Lots of RAM Often Cheaper than Using a Hadoop Cluster
13 0.77763081 1270 high scalability-2012-06-22-Stuff The Internet Says On Scalability For June 22, 2012
14 0.77432674 1035 high scalability-2011-05-05-Paper: A Study of Practical Deduplication
15 0.77189577 1000 high scalability-2011-03-08-Medialets Architecture - Defeating the Daunting Mobile Device Data Deluge
18 0.76454604 254 high scalability-2008-02-19-Hadoop Getting Closer to 1.0 Release
19 0.75778615 819 high scalability-2010-04-30-Hot Scalability Links for April 30, 2010
20 0.75408727 1265 high scalability-2012-06-15-Stuff The Internet Says On Scalability For June 15, 2012
topicId topicWeight
[(1, 0.123), (2, 0.234), (10, 0.052), (27, 0.121), (30, 0.03), (61, 0.07), (77, 0.05), (79, 0.166), (85, 0.054), (94, 0.029)]
simIndex simValue blogId blogTitle
same-blog 1 0.96742225 666 high scalability-2009-07-30-Learn How to Think at Scale
Introduction: Aaron Kimball of Cloudera gives a wonderful 23 minute presentation titled Cloudera Hadoop Training: Thinking at Scale Cloudera which talks about "common challenges and general best practices for scaling with your data." As a company Cloudera offers "enterprise-level support to users of Apache Hadoop." Part of that offering is a really useful series of tutorial videos on the Hadoop ecosystem . Like TV lawyer Perry Mason (or is it Harmon Rabb?), Aaron gradually builds his case. He opens with the problem of storing lots of data. Then a blistering cross examination of the problem of building distributed systems to analyze that data sets up a powerful closing argument. With so much testimony behind him, on closing Aaron really brings it home with why shared nothing systems like map-reduce are the right solution on how to query lots of data. They jury loved it. Here's the video Thinking at Scale . And here's a summary of some of the lessons learned from the talk: Lessons Learned
2 0.96385682 1141 high scalability-2011-11-11-Stuff The Internet Says On Scalability For November 11, 2011
Introduction: You got performance in my scalability! You got scalability in my performance! Two great tastes that taste great together: Quotable quotes: @jasoncbooth : Tired of the term #nosql. I would like to coin NRDS (pronounced "nerds"), standing for Non Relational Data Store. @zenfeed : One lesson I learn about scalability, is that it has a LOT to do with simplicity and consistency. Ray Walters : Quad-core chips in mobile phones is nothing but a marketing snow job Flickr: Real-time Updates on the Cheap for Fun and Profit . How Flickr added real-time push feed on the cheap. Events happen all over Flickr, uploads and updates (around 100/s depending on the time of day), all of them inserting tasks. Implemented with Cache, Tasks, & Queues: PubSubHubbub; Async task system Gearman; use async EVERYWHERE; use Redis Lists for queues; cron to consume events off the queue; Cloud Event Processing - Big Data, Low Latency Use Cases at LinkedIn by Colin Clark. It
3 0.95815217 705 high scalability-2009-09-16-Paper: A practical scalable distributed B-tree
Introduction: We've seen a lot of NoSQL action lately built around distributed hash tables. Btrees are getting jealous. Btrees, once the king of the database world, want their throne back. Paul Buchheit surfaced a paper: A practical scalable distributed B-tree by Marcos K. Aguilera and Wojciech Golab, that might help spark a revolution. From the Abstract: We propose a new algorithm for a practical, fault tolerant, and scalable B-tree distributed over a set of servers. Our algorithm supports practical features not present in prior work: transactions that allow atomic execution of multiple operations over multiple B-trees, online migration of B-tree nodes between servers, and dynamic addition and removal of servers. Moreover, our algorithm is conceptually simple: we use transactions to manipulate B-tree nodes so that clients need not use complicated concurrency and locking protocols used in prior work. To execute these transactions quickly, we rely on three techniques: (1) We use optimistic
4 0.94881058 900 high scalability-2010-09-11-Google's Colossus Makes Search Real-time by Dumping MapReduce
Introduction: As the Kings of scaling, when Google changes its search infrastructure over to do something completely different, it's news. In Google search index splits with MapReduce , an exclusive interview by Cade Metz with Eisar Lipkovitz, a senior director of engineering at Google, we learn a bit more of the secret scaling sauce behind Google Instant , Google's new faster, real-time search system. The challenge for Google has been how to support a real-time world when the core of their search technology, the famous MapReduce, is batch oriented. Simple, they got rid of MapReduce. At least they got rid of MapReduce as the backbone for calculating search indexes. MapReduce still excels as a general query mechanism against masses of data, but real-time search requires a very specialized tool, and Google built one. Internally the successor to Google's famed Google File System, was code named Colossus. Details are slowly coming out about their new goals and approach: Goal is to update the
5 0.9478687 1097 high scalability-2011-08-12-Stuff The Internet Says On Scalability For August 12, 2011
Introduction: Submitted for your scaling pleasure, you may not scale often, but when you scale, please drink us: Quotably quotable quotes: @mardix : There is no single point of truth in #NoSQL . #Consistency is no longer global, it's relative to the one accessing it. #Scalability @kekline : RT @CurtMonash: "...from industry figures, Basho/Riak is our third-biggest competitor." How often do you encounter them? "Never have" #nosql @dave_jacobs : Love being in a city where I can overhear a convo about Heroku scalability while doing deadlifts. #ahsanfrancisco @satheeshilu : Doctor at #hospital in india says #ge #healthcare software is slow to handle 100K X-rays an year.Scalability is critical 4 Indian #software @sufw : How can it be possible that Tagged has 80m users and I have *never* heard of it!?! @EventCloudPro : One of my vacation realizations? Whole #bigdata thing has turned into a lotta #bighype - many distinct issues & nothing to do w/ #bigdata No
6 0.94753641 835 high scalability-2010-06-03-Hot Scalability Links for June 3, 2010
8 0.93922472 1483 high scalability-2013-06-27-Paper: XORing Elephants: Novel Erasure Codes for Big Data
9 0.93716115 883 high scalability-2010-08-20-Hot Scalability Links For Aug 20, 2010
10 0.93478841 544 high scalability-2009-03-18-QCon London 2009: Upgrading Twitter without service disruptions
11 0.92543149 28 high scalability-2007-07-25-Product: NetApp MetroCluster Software
12 0.92476296 1460 high scalability-2013-05-17-Stuff The Internet Says On Scalability For May 17, 2013
13 0.92287439 1612 high scalability-2014-03-14-Stuff The Internet Says On Scalability For March 14th, 2014
14 0.92189586 1545 high scalability-2013-11-08-Stuff The Internet Says On Scalability For November 8th, 2013
15 0.92158264 1316 high scalability-2012-09-04-Changing Architectures: New Datacenter Networks Will Set Your Code and Data Free
16 0.92126113 1439 high scalability-2013-04-12-Stuff The Internet Says On Scalability For April 12, 2013
17 0.92108017 687 high scalability-2009-08-24-How Google Serves Data from Multiple Datacenters
18 0.92054701 1186 high scalability-2012-02-02-The Data-Scope Project - 6PB storage, 500GBytes-sec sequential IO, 20M IOPS, 130TFlops
19 0.92025191 1112 high scalability-2011-09-07-What Google App Engine Price Changes Say About the Future of Web Architecture
20 0.92013192 1122 high scalability-2011-09-23-Stuff The Internet Says On Scalability For September 23, 2011