high_scalability high_scalability-2009 high_scalability-2009-601 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity
sentIndex sentText sentNum sentScore
1 Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. [sent-6, score-0.354]
2 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5. [sent-10, score-0.374]
3 Hadoop is a framework for running applications on large clusters of commodity hardware using a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed on any node in the cluster. [sent-17, score-0.733]
4 Jeremy Zawodny has a wonderful overview of why Hadoop is important for large website builders: For the last several years, every company involved in building large web-scale systems has faced some of the same fundamental challenges. [sent-19, score-0.256]
5 While nearly everyone agrees that the "divide-and-conquer using lots of cheap hardware" approach to breaking down large problems is the only way to scale, doing so is not easy. [sent-20, score-0.425]
6 Even if you use somebody else's commodity hardware, you still have to develop the software that'll do the divide-and-conquer work to keep them all busy It's hard work. [sent-23, score-0.163]
7 And it needs to be commoditized, just like the hardware has been. [sent-24, score-0.096]
8 Hadoop also provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. [sent-25, score-0.2]
9 Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework. [sent-26, score-0.21]
10 Hadoop has been demonstrated on clusters with 2000 nodes. [sent-27, score-0.156]
11 The obvious question of the day is: should you build your website around Hadoop? [sent-29, score-0.086]
12 There seems to be a few types of things you do with lots of data: process, transform, and serve. [sent-31, score-0.094]
13 Yahoo literally has petabytes of log files, web pages, and other data they process. [sent-32, score-0.273]
14 If you are YouTube and you have petabytes of media to serve, do you really need map/reduce? [sent-37, score-0.2]
15 Maybe not, but the clustered file system is great. [sent-38, score-0.109]
16 Perfect for when you have lots of stuff to store. [sent-40, score-0.094]
17 With that you could create thumbnails, previews, transcode media files, and so on. [sent-42, score-0.182]
18 Everyone needs to store structured data in a scalable, reliable, highly performing data store. [sent-44, score-0.182]
19 I can't wait for experience reports about "normal" people, familiar with a completely different paradigm, adopting this infrastructure. [sent-46, score-0.076]
20 I wonder what animal O'Reilly will use on their Hadoop cover? [sent-47, score-0.104]
wordName wordTfidf (topN-words)
[('hadoop', 0.408), ('pig', 0.314), ('latin', 0.191), ('paradigm', 0.119), ('petabytes', 0.113), ('sherpa', 0.111), ('file', 0.109), ('animal', 0.104), ('greenplum', 0.104), ('hadoopsummit', 0.104), ('previews', 0.104), ('node', 0.101), ('hadoopby', 0.099), ('manycore', 0.099), ('yahoo', 0.099), ('hardware', 0.096), ('transcode', 0.095), ('lots', 0.094), ('commodity', 0.093), ('symposium', 0.092), ('agrees', 0.092), ('lines', 0.092), ('data', 0.091), ('conquer', 0.088), ('media', 0.087), ('thumbnails', 0.086), ('website', 0.086), ('flow', 0.086), ('large', 0.085), ('commoditized', 0.084), ('fragments', 0.084), ('clusters', 0.083), ('nearly', 0.083), ('builders', 0.08), ('computing', 0.078), ('directions', 0.077), ('adopting', 0.076), ('affinity', 0.075), ('terabyte', 0.074), ('prospect', 0.073), ('sense', 0.073), ('demonstrated', 0.073), ('replicates', 0.073), ('divided', 0.072), ('bet', 0.072), ('files', 0.072), ('everyone', 0.071), ('language', 0.07), ('somebody', 0.07), ('literally', 0.069)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000004 601 high scalability-2009-05-17-Product: Hadoop
Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity
2 0.38837776 443 high scalability-2008-11-14-Paper: Pig Latin: A Not-So-Foreign Language for Data Processing
Introduction: Yahoo has developed a new language called Pig Latin that fit in a sweet spot between high-level declarative querying in the spirit of SQL, and low-level, procedural programming `a la map-reduce and combines best of both worlds. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. Pig has just graduated from the Apache Incubator and joined Hadoop as a subproject. The paper has a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. References: Apache Pig Wiki
3 0.25971073 627 high scalability-2009-06-11-Yahoo! Distribution of Hadoop
Introduction: Many people in the Apache Hadoop community have asked Yahoo! to publish the version of Apache Hadoop they test and deploy across their large Hadoop clusters. As a service to the Hadoop community, Yahoo is releasing the Yahoo! Distribution of Hadoop -- a source code distribution that is based entirely on code found in the Apache Hadoop project. This source distribution includes code patches that they have added to improve the stability and performance of their clusters. In all cases, these patches have already been contributed back to Apache, but they may not yet be available in an Apache release of Hadoop. Read more and get the Hadoop distribution from Yahoo
4 0.25217238 414 high scalability-2008-10-15-Hadoop - A Primer
Introduction: Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of the Google File System and of MapReduce to process vast amounts of data "Hadoop is a Free Java software framework that supports data intensive distributed applications running on large clusters of commodity computers. It enables applications to easily scale out to thousands of nodes and petabytes of data" (Wikipedia) * What platform does Hadoop run on? * Java 1.5.x or higher, preferably from Sun * Linux * Windows for development * Solaris
5 0.21179111 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python
Introduction: Has a Java only Hadoop been getting you down? Now you can be Happy . Happy is a framework for writing map-reduce programs for Hadoop using Jython. It files off the sharp edges on Hadoop and makes writing map-reduce programs a breeze . There's really no history yet on Happy, but I'm delighted at the idea of being able to map-reduce in other languages . The more ways the better. From the website: Happy is a framework that allows Hadoop jobs to be written and run in Python 2.2 using Jython. It is an easy way to write map-reduce programs for Hadoop, and includes some new useful features as well. The current release supports Hadoop 0.17.2. Map-reduce jobs in Happy are defined by sub-classing happy.HappyJob and implementing a map(records, task) and reduce(key, values, task) function. Then you create an instance of the class, set the job parameters (such as inputs and outputs) and call run(). When you call run(), Happy serializes your job instance and copies it and all accompanyi
6 0.20001175 1088 high scalability-2011-07-27-Making Hadoop 1000x Faster for Graph Problems
7 0.19821298 968 high scalability-2011-01-04-Map-Reduce With Ruby Using Hadoop
8 0.19379076 1265 high scalability-2012-06-15-Stuff The Internet Says On Scalability For June 15, 2012
9 0.18616119 448 high scalability-2008-11-22-Google Architecture
10 0.18331662 780 high scalability-2010-02-19-Twitter’s Plan to Analyze 100 Billion Tweets
11 0.18286459 851 high scalability-2010-07-02-Hot Scalability Links for July 2, 2010
12 0.1817843 862 high scalability-2010-07-20-Strategy: Consider When a Service Starts Billing in Your Algorithm Cost
13 0.17657292 650 high scalability-2009-07-02-Product: Hbase
15 0.16114375 1313 high scalability-2012-08-28-Making Hadoop Run Faster
16 0.15613618 233 high scalability-2008-01-30-How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data
17 0.15454005 1000 high scalability-2011-03-08-Medialets Architecture - Defeating the Daunting Mobile Device Data Deluge
18 0.15366337 666 high scalability-2009-07-30-Learn How to Think at Scale
19 0.14971599 1076 high scalability-2011-07-08-Stuff The Internet Says On Scalability For July 8, 2011
topicId topicWeight
[(0, 0.236), (1, 0.095), (2, 0.039), (3, 0.073), (4, -0.008), (5, 0.071), (6, 0.093), (7, 0.009), (8, 0.141), (9, 0.187), (10, 0.068), (11, -0.056), (12, 0.121), (13, -0.137), (14, 0.148), (15, -0.053), (16, -0.063), (17, -0.01), (18, -0.065), (19, 0.064), (20, -0.006), (21, 0.108), (22, 0.053), (23, 0.068), (24, 0.014), (25, 0.01), (26, 0.151), (27, 0.015), (28, -0.011), (29, 0.083), (30, 0.089), (31, 0.126), (32, -0.011), (33, -0.009), (34, -0.012), (35, 0.029), (36, -0.089), (37, 0.06), (38, -0.019), (39, -0.073), (40, -0.005), (41, 0.027), (42, -0.066), (43, -0.051), (44, 0.023), (45, 0.016), (46, -0.001), (47, 0.037), (48, -0.031), (49, 0.023)]
simIndex simValue blogId blogTitle
same-blog 1 0.95929748 601 high scalability-2009-05-17-Product: Hadoop
Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity
2 0.91862696 968 high scalability-2011-01-04-Map-Reduce With Ruby Using Hadoop
Introduction: A demonstration, with repeatable steps, of how to quickly fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will not need to ssh into the cluster, as all tasks are run from your local machine. Below I am using my MacBook Pro as my local machine, but the steps I have provided should be reproducible on other platforms running bash and Java. Fire-Up Your Hadoop Cluster I choose the Cloudera distribution of Hadoop which is still 100% Apache licensed, but has some additional benefits. One of these benefits is that it is released by Doug Cutting , who started Hadoop and drove it’s development at Yahoo! He also started Lucene , which is another of my favourite Apache Projects, so I have good faith that he knows what he is doing. Another benefit, as you will see, is that it is simple to fire-up a Hadoop cluster. I am going to use C
3 0.86821401 443 high scalability-2008-11-14-Paper: Pig Latin: A Not-So-Foreign Language for Data Processing
Introduction: Yahoo has developed a new language called Pig Latin that fit in a sweet spot between high-level declarative querying in the spirit of SQL, and low-level, procedural programming `a la map-reduce and combines best of both worlds. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. Pig has just graduated from the Apache Incubator and joined Hadoop as a subproject. The paper has a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. References: Apache Pig Wiki
4 0.83739948 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python
Introduction: Has a Java only Hadoop been getting you down? Now you can be Happy . Happy is a framework for writing map-reduce programs for Hadoop using Jython. It files off the sharp edges on Hadoop and makes writing map-reduce programs a breeze . There's really no history yet on Happy, but I'm delighted at the idea of being able to map-reduce in other languages . The more ways the better. From the website: Happy is a framework that allows Hadoop jobs to be written and run in Python 2.2 using Jython. It is an easy way to write map-reduce programs for Hadoop, and includes some new useful features as well. The current release supports Hadoop 0.17.2. Map-reduce jobs in Happy are defined by sub-classing happy.HappyJob and implementing a map(records, task) and reduce(key, values, task) function. Then you create an instance of the class, set the job parameters (such as inputs and outputs) and call run(). When you call run(), Happy serializes your job instance and copies it and all accompanyi
5 0.83311063 627 high scalability-2009-06-11-Yahoo! Distribution of Hadoop
Introduction: Many people in the Apache Hadoop community have asked Yahoo! to publish the version of Apache Hadoop they test and deploy across their large Hadoop clusters. As a service to the Hadoop community, Yahoo is releasing the Yahoo! Distribution of Hadoop -- a source code distribution that is based entirely on code found in the Apache Hadoop project. This source distribution includes code patches that they have added to improve the stability and performance of their clusters. In all cases, these patches have already been contributed back to Apache, but they may not yet be available in an Apache release of Hadoop. Read more and get the Hadoop distribution from Yahoo
6 0.82844621 414 high scalability-2008-10-15-Hadoop - A Primer
8 0.79654783 862 high scalability-2010-07-20-Strategy: Consider When a Service Starts Billing in Your Algorithm Cost
9 0.78662473 1445 high scalability-2013-04-24-Strategy: Using Lots of RAM Often Cheaper than Using a Hadoop Cluster
10 0.78105348 1265 high scalability-2012-06-15-Stuff The Internet Says On Scalability For June 15, 2012
11 0.77692831 1313 high scalability-2012-08-28-Making Hadoop Run Faster
12 0.77340639 254 high scalability-2008-02-19-Hadoop Getting Closer to 1.0 Release
13 0.7704463 1173 high scalability-2012-01-12-Peregrine - A Map Reduce Framework for Iterative and Pipelined Jobs
14 0.7691738 650 high scalability-2009-07-02-Product: Hbase
15 0.73188061 851 high scalability-2010-07-02-Hot Scalability Links for July 2, 2010
16 0.72283089 1382 high scalability-2013-01-07-Analyzing billions of credit card transactions and serving low-latency insights in the cloud
17 0.70037162 666 high scalability-2009-07-30-Learn How to Think at Scale
18 0.6973666 707 high scalability-2009-09-17-Hot Links for 2009-9-17
19 0.68399012 1076 high scalability-2011-07-08-Stuff The Internet Says On Scalability For July 8, 2011
20 0.67059356 1088 high scalability-2011-07-27-Making Hadoop 1000x Faster for Graph Problems
topicId topicWeight
[(1, 0.103), (2, 0.225), (4, 0.013), (10, 0.05), (30, 0.023), (52, 0.091), (56, 0.018), (61, 0.071), (79, 0.269), (85, 0.027), (94, 0.033)]
simIndex simValue blogId blogTitle
1 0.96448809 1048 high scalability-2011-05-27-Stuff The Internet Says On Scalability For May 27, 2011
Introduction: Submitted for your scaling pleasure: Good idea: Open The Index And Speed Up The Internet . SmugMug estimates 50% of their CPU is spent serving crawler robots. Having a common meta-data repository wouldn't prevent search engines from having their own special sauce. Then the problem becomes one of syncing data between repositories and processing change events. A generous soul could even offer a shared MapReduce service over the data. Now that would speed up the internet . Scaling Achievements: YouTube Sees 3 Billion Views per Day ; Twitter produces a sustained feed of 35 Mb per second ; companies processing billions of APIs calls (Twitter, Netflix, Amazon, NPR, Google, Facebook, eBay, Bing); Astronomers Identify the Farthest Object Ever Observed, 13.14 Billion Light Years Away Quotes that are Quotably Quotable: eekygeeky : When cloud computing news is slow? Switch to "big data"-100% of the vaguery, none of the used-up, mushy marketing feel!
2 0.96365923 1494 high scalability-2013-07-19-Stuff The Internet Says On Scalability For July 19, 2013
Introduction: Hey, it's HighScalability time: (Still not a transporter: Looping at 685 mph ) 898 exabytes : US storage, 1/3 global total; 1 Kb/s : data transmit rate from harvestable energy from human motion Create your own trust nobody point-to-point private cloud. Dan Brown shows how step-by-step in How I Created My Own Personal Cloud Using BitTorrent Sync, Owncloud, and Raspberry Pi . BitTorrent Sync is used to copy large files around. Raspberry Pi is a cheap low power always on device with BitTorrent Sync installed. Owncloud is an open source cloud that provides a web interface for file access files from anywhere. This is different. Funding a startup using Airbnb as a source of start-up capital. It beats getting a part-time job and one of your guests might even be a VC. This is not different. Old industries clawing and digging in, using the tools of power to beat back competition. Steve Blank details a familiar story in Strangling Innovation: Tesla versus “Rent Seeker
3 0.96317995 1420 high scalability-2013-03-08-Stuff The Internet Says On Scalability For March 8, 2013
Introduction: Hey, it's HighScalability time: Quotable Quotes: @ibogost : Disabling features of SimCity due to ineffective central infrastructure is probably the most realistic simulation of the modern city. antirez : The point is simply to show how SSDs can't be considered, currently, as a bit slower version of memory. Their performance characteristics are a lot more about, simply, "faster disks". @jessenoller : I only use JavaScript so I can gain maximum scalability across multiple cores. Also unicorns. Paint thinner gingerbread @liammclennan : high-scalability ruby. Why bother? @scomma : Problem with BitCoin is not scalability, not even usability. It's whether someone will crack the algorithm and render BTC entirely useless. @webclimber : Amazing how often I find myself explaining that scalability is not magical @mvmsan : Flash as Primary Storage - Highest Cost, Lack of HA, scalability and management features #flas
4 0.96090311 786 high scalability-2010-03-02-Using the Ambient Cloud as an Application Runtime
Introduction: This is an excerpt from my article Building Super Scalable Systems: Blade Runner Meets Autonomic Computing in the Ambient Cloud. The future looks many, big, complex, and adaptive: Many clouds. Many servers. Many operating systems. Many languages. Many storage services. Many database services. Many software services. Many adjunct human networks (like Mechanical Turk). Many fast interconnects. Many CDNs. Many cache memory pools. Many application profiles (simple request-response, live streaming, computationally complex, sensor driven, memory intensive, storage intensive, monolithic, decomposable, etc). Many legal jurisdictions. Don't want to perform a function on Patriot Act "protected" systems then move the function elsewhere. Many SLAs. Many data driven pricing policies that like airplane pricing algorithms will price "seats" to maximize profit using multi-variate time sensitive pricing models. Many competitive products. The need t
5 0.95974785 244 high scalability-2008-02-11-Yahoo Live's Scaling Problems Prove: Release Early and Often - Just Don't Screw Up
Introduction: Tech Crunch chomped down on some initial scaling problems with Yahoo's new live video streaming service Yahoo Live . After a bit of chewing on Yahoo's old bones, TC spat out: If Yahoo cant scale something like this (no matter how much they claim it’s an experiment, it’s still a live service), it shows how far the once brightest star of the online world has fallen . This kind of thinking kills innovation. When there's no room for a few hiccups or a little failure you have to cover your ass so completely nothing new will ever see the light of day. I thought we were supposed to be agile . We are supposed to release early and often. Not every 'i' has to be dotted and not every last router has to be installed before we take the first step of a grand new journey. Get it out there. Let users help you make it better. Listen to customers, make changes, push the new code out, listen some more, and fix problems as they come up. Following this process we'll make somethi
7 0.95744413 1403 high scalability-2013-02-08-Stuff The Internet Says On Scalability For February 8, 2013
same-blog 8 0.95422256 601 high scalability-2009-05-17-Product: Hadoop
9 0.95407987 448 high scalability-2008-11-22-Google Architecture
10 0.9527486 867 high scalability-2010-07-27-YeSQL: An Overview of the Various Query Semantics in the Post Only-SQL World
11 0.95141846 882 high scalability-2010-08-18-Misco: A MapReduce Framework for Mobile Systems - Start of the Ambient Cloud?
12 0.94881541 871 high scalability-2010-08-04-Dremel: Interactive Analysis of Web-Scale Datasets - Data as a Programming Paradigm
13 0.94543636 1328 high scalability-2012-09-24-Google Spanner's Most Surprising Revelation: NoSQL is Out and NewSQL is In
14 0.94541508 1392 high scalability-2013-01-23-Building Redundant Datacenter Networks is Not For Sissies - Use an Outside WAN Backbone
15 0.94498569 680 high scalability-2009-08-13-Reconnoiter - Large-Scale Trending and Fault-Detection
16 0.94099176 627 high scalability-2009-06-11-Yahoo! Distribution of Hadoop
17 0.93935794 89 high scalability-2007-09-10-Is there a difference between partitioning and federation and sharding?
18 0.9392857 38 high scalability-2007-07-30-Build an Infinitely Scalable Infrastructure for $100 Using Amazon Services
19 0.93766218 1098 high scalability-2011-08-15-Should any cloud be considered one availability zone? The Amazon experience says yes.
20 0.93679565 289 high scalability-2008-03-27-Amazon Announces Static IP Addresses and Multiple Datacenter Operation