high_scalability high_scalability-2009 high_scalability-2009-627 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Many people in the Apache Hadoop community have asked Yahoo! to publish the version of Apache Hadoop they test and deploy across their large Hadoop clusters. As a service to the Hadoop community, Yahoo is releasing the Yahoo! Distribution of Hadoop -- a source code distribution that is based entirely on code found in the Apache Hadoop project. This source distribution includes code patches that they have added to improve the stability and performance of their clusters. In all cases, these patches have already been contributed back to Apache, but they may not yet be available in an Apache release of Hadoop. Read more and get the Hadoop distribution from Yahoo
sentIndex sentText sentNum sentScore
1 Many people in the Apache Hadoop community have asked Yahoo! [sent-1, score-0.295]
2 to publish the version of Apache Hadoop they test and deploy across their large Hadoop clusters. [sent-2, score-0.413]
3 As a service to the Hadoop community, Yahoo is releasing the Yahoo! [sent-3, score-0.196]
4 Distribution of Hadoop -- a source code distribution that is based entirely on code found in the Apache Hadoop project. [sent-4, score-0.855]
5 This source distribution includes code patches that they have added to improve the stability and performance of their clusters. [sent-5, score-1.206]
6 In all cases, these patches have already been contributed back to Apache, but they may not yet be available in an Apache release of Hadoop. [sent-6, score-0.818]
wordName wordTfidf (topN-words)
[('hadoop', 0.521), ('yahoo', 0.397), ('apache', 0.379), ('distribution', 0.325), ('patches', 0.321), ('releasing', 0.158), ('contributed', 0.158), ('community', 0.154), ('publish', 0.139), ('entirely', 0.114), ('code', 0.114), ('asked', 0.104), ('stability', 0.1), ('source', 0.092), ('includes', 0.09), ('release', 0.085), ('cases', 0.078), ('added', 0.071), ('version', 0.07), ('deploy', 0.069), ('improve', 0.067), ('yet', 0.064), ('test', 0.061), ('found', 0.059), ('already', 0.056), ('back', 0.05), ('available', 0.047), ('across', 0.04), ('service', 0.038), ('people', 0.037), ('based', 0.037), ('may', 0.037), ('large', 0.034), ('many', 0.028), ('get', 0.027), ('performance', 0.026)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999988 627 high scalability-2009-06-11-Yahoo! Distribution of Hadoop
Introduction: Many people in the Apache Hadoop community have asked Yahoo! to publish the version of Apache Hadoop they test and deploy across their large Hadoop clusters. As a service to the Hadoop community, Yahoo is releasing the Yahoo! Distribution of Hadoop -- a source code distribution that is based entirely on code found in the Apache Hadoop project. This source distribution includes code patches that they have added to improve the stability and performance of their clusters. In all cases, these patches have already been contributed back to Apache, but they may not yet be available in an Apache release of Hadoop. Read more and get the Hadoop distribution from Yahoo
2 0.25971073 601 high scalability-2009-05-17-Product: Hadoop
Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity
3 0.24616954 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python
Introduction: Has a Java only Hadoop been getting you down? Now you can be Happy . Happy is a framework for writing map-reduce programs for Hadoop using Jython. It files off the sharp edges on Hadoop and makes writing map-reduce programs a breeze . There's really no history yet on Happy, but I'm delighted at the idea of being able to map-reduce in other languages . The more ways the better. From the website: Happy is a framework that allows Hadoop jobs to be written and run in Python 2.2 using Jython. It is an easy way to write map-reduce programs for Hadoop, and includes some new useful features as well. The current release supports Hadoop 0.17.2. Map-reduce jobs in Happy are defined by sub-classing happy.HappyJob and implementing a map(records, task) and reduce(key, values, task) function. Then you create an instance of the class, set the job parameters (such as inputs and outputs) and call run(). When you call run(), Happy serializes your job instance and copies it and all accompanyi
4 0.24283022 968 high scalability-2011-01-04-Map-Reduce With Ruby Using Hadoop
Introduction: A demonstration, with repeatable steps, of how to quickly fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will not need to ssh into the cluster, as all tasks are run from your local machine. Below I am using my MacBook Pro as my local machine, but the steps I have provided should be reproducible on other platforms running bash and Java. Fire-Up Your Hadoop Cluster I choose the Cloudera distribution of Hadoop which is still 100% Apache licensed, but has some additional benefits. One of these benefits is that it is released by Doug Cutting , who started Hadoop and drove it’s development at Yahoo! He also started Lucene , which is another of my favourite Apache Projects, so I have good faith that he knows what he is doing. Another benefit, as you will see, is that it is simple to fire-up a Hadoop cluster. I am going to use C
5 0.20452347 596 high scalability-2009-05-11-Facebook, Hadoop, and Hive
Introduction: Facebook has the second largest installation of Hadoop (a software platform that lets one easily write and run applications that process vast amounts of data), Yahoo being the first. Learn how they do it and what are the challenges on DBMS2 blog, which is a blog for people who care about database and analytic technologies.
6 0.19599724 443 high scalability-2008-11-14-Paper: Pig Latin: A Not-So-Foreign Language for Data Processing
7 0.18907821 1088 high scalability-2011-07-27-Making Hadoop 1000x Faster for Graph Problems
8 0.18673001 414 high scalability-2008-10-15-Hadoop - A Primer
9 0.16812748 862 high scalability-2010-07-20-Strategy: Consider When a Service Starts Billing in Your Algorithm Cost
10 0.15340465 1313 high scalability-2012-08-28-Making Hadoop Run Faster
11 0.15019688 254 high scalability-2008-02-19-Hadoop Getting Closer to 1.0 Release
12 0.14697352 617 high scalability-2009-06-04-New Book: Even Faster Web Sites: Performance Best Practices for Web Developers
13 0.14659065 650 high scalability-2009-07-02-Product: Hbase
14 0.13520555 669 high scalability-2009-08-03-Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2
15 0.13115066 56 high scalability-2007-08-03-Running Hadoop MapReduce on Amazon EC2 and Amazon S3
16 0.13061926 665 high scalability-2009-07-29-Strategy: Let Google and Yahoo Host Your Ajax Library - For Free
17 0.1298088 707 high scalability-2009-09-17-Hot Links for 2009-9-17
18 0.12692389 1076 high scalability-2011-07-08-Stuff The Internet Says On Scalability For July 8, 2011
20 0.11665212 244 high scalability-2008-02-11-Yahoo Live's Scaling Problems Prove: Release Early and Often - Just Don't Screw Up
topicId topicWeight
[(0, 0.098), (1, 0.01), (2, 0.038), (3, -0.004), (4, 0.065), (5, 0.051), (6, 0.079), (7, 0.002), (8, 0.133), (9, 0.148), (10, 0.064), (11, -0.028), (12, 0.145), (13, -0.161), (14, 0.081), (15, -0.114), (16, 0.02), (17, -0.013), (18, -0.055), (19, 0.003), (20, -0.028), (21, 0.153), (22, 0.104), (23, 0.045), (24, -0.015), (25, 0.091), (26, 0.084), (27, -0.007), (28, 0.028), (29, 0.007), (30, 0.103), (31, 0.146), (32, -0.011), (33, 0.037), (34, 0.055), (35, 0.038), (36, -0.133), (37, 0.057), (38, -0.024), (39, -0.069), (40, -0.008), (41, 0.075), (42, -0.021), (43, -0.084), (44, 0.02), (45, 0.006), (46, 0.037), (47, 0.055), (48, -0.001), (49, 0.096)]
simIndex simValue blogId blogTitle
same-blog 1 0.99635118 627 high scalability-2009-06-11-Yahoo! Distribution of Hadoop
Introduction: Many people in the Apache Hadoop community have asked Yahoo! to publish the version of Apache Hadoop they test and deploy across their large Hadoop clusters. As a service to the Hadoop community, Yahoo is releasing the Yahoo! Distribution of Hadoop -- a source code distribution that is based entirely on code found in the Apache Hadoop project. This source distribution includes code patches that they have added to improve the stability and performance of their clusters. In all cases, these patches have already been contributed back to Apache, but they may not yet be available in an Apache release of Hadoop. Read more and get the Hadoop distribution from Yahoo
2 0.91331804 443 high scalability-2008-11-14-Paper: Pig Latin: A Not-So-Foreign Language for Data Processing
Introduction: Yahoo has developed a new language called Pig Latin that fit in a sweet spot between high-level declarative querying in the spirit of SQL, and low-level, procedural programming `a la map-reduce and combines best of both worlds. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. Pig has just graduated from the Apache Incubator and joined Hadoop as a subproject. The paper has a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. References: Apache Pig Wiki
3 0.87816232 968 high scalability-2011-01-04-Map-Reduce With Ruby Using Hadoop
Introduction: A demonstration, with repeatable steps, of how to quickly fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will not need to ssh into the cluster, as all tasks are run from your local machine. Below I am using my MacBook Pro as my local machine, but the steps I have provided should be reproducible on other platforms running bash and Java. Fire-Up Your Hadoop Cluster I choose the Cloudera distribution of Hadoop which is still 100% Apache licensed, but has some additional benefits. One of these benefits is that it is released by Doug Cutting , who started Hadoop and drove it’s development at Yahoo! He also started Lucene , which is another of my favourite Apache Projects, so I have good faith that he knows what he is doing. Another benefit, as you will see, is that it is simple to fire-up a Hadoop cluster. I am going to use C
4 0.84058762 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python
Introduction: Has a Java only Hadoop been getting you down? Now you can be Happy . Happy is a framework for writing map-reduce programs for Hadoop using Jython. It files off the sharp edges on Hadoop and makes writing map-reduce programs a breeze . There's really no history yet on Happy, but I'm delighted at the idea of being able to map-reduce in other languages . The more ways the better. From the website: Happy is a framework that allows Hadoop jobs to be written and run in Python 2.2 using Jython. It is an easy way to write map-reduce programs for Hadoop, and includes some new useful features as well. The current release supports Hadoop 0.17.2. Map-reduce jobs in Happy are defined by sub-classing happy.HappyJob and implementing a map(records, task) and reduce(key, values, task) function. Then you create an instance of the class, set the job parameters (such as inputs and outputs) and call run(). When you call run(), Happy serializes your job instance and copies it and all accompanyi
5 0.6969375 601 high scalability-2009-05-17-Product: Hadoop
Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity
6 0.68210506 414 high scalability-2008-10-15-Hadoop - A Primer
7 0.68017405 669 high scalability-2009-08-03-Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2
8 0.66783786 862 high scalability-2010-07-20-Strategy: Consider When a Service Starts Billing in Your Algorithm Cost
9 0.617024 1265 high scalability-2012-06-15-Stuff The Internet Says On Scalability For June 15, 2012
10 0.60882854 254 high scalability-2008-02-19-Hadoop Getting Closer to 1.0 Release
11 0.60319239 1445 high scalability-2013-04-24-Strategy: Using Lots of RAM Often Cheaper than Using a Hadoop Cluster
12 0.59667331 707 high scalability-2009-09-17-Hot Links for 2009-9-17
13 0.59271318 647 high scalability-2009-07-02-Hypertable is a New BigTable Clone that Runs on HDFS or KFS
14 0.58675957 650 high scalability-2009-07-02-Product: Hbase
15 0.56186843 1173 high scalability-2012-01-12-Peregrine - A Map Reduce Framework for Iterative and Pipelined Jobs
16 0.56079149 596 high scalability-2009-05-11-Facebook, Hadoop, and Hive
17 0.54935932 1313 high scalability-2012-08-28-Making Hadoop Run Faster
18 0.52967191 1088 high scalability-2011-07-27-Making Hadoop 1000x Faster for Graph Problems
19 0.52112049 851 high scalability-2010-07-02-Hot Scalability Links for July 2, 2010
20 0.50206351 56 high scalability-2007-08-03-Running Hadoop MapReduce on Amazon EC2 and Amazon S3
topicId topicWeight
[(1, 0.048), (2, 0.342), (10, 0.035), (61, 0.021), (79, 0.373)]
simIndex simValue blogId blogTitle
1 0.99156249 202 high scalability-2008-01-06-Email Architecture
Introduction: I would like to know email architecture used by large ISPs.. or even used by google. Can someone point me to some sites?? Thanks..
same-blog 2 0.98012418 627 high scalability-2009-06-11-Yahoo! Distribution of Hadoop
Introduction: Many people in the Apache Hadoop community have asked Yahoo! to publish the version of Apache Hadoop they test and deploy across their large Hadoop clusters. As a service to the Hadoop community, Yahoo is releasing the Yahoo! Distribution of Hadoop -- a source code distribution that is based entirely on code found in the Apache Hadoop project. This source distribution includes code patches that they have added to improve the stability and performance of their clusters. In all cases, these patches have already been contributed back to Apache, but they may not yet be available in an Apache release of Hadoop. Read more and get the Hadoop distribution from Yahoo
3 0.9658379 1048 high scalability-2011-05-27-Stuff The Internet Says On Scalability For May 27, 2011
Introduction: Submitted for your scaling pleasure: Good idea: Open The Index And Speed Up The Internet . SmugMug estimates 50% of their CPU is spent serving crawler robots. Having a common meta-data repository wouldn't prevent search engines from having their own special sauce. Then the problem becomes one of syncing data between repositories and processing change events. A generous soul could even offer a shared MapReduce service over the data. Now that would speed up the internet . Scaling Achievements: YouTube Sees 3 Billion Views per Day ; Twitter produces a sustained feed of 35 Mb per second ; companies processing billions of APIs calls (Twitter, Netflix, Amazon, NPR, Google, Facebook, eBay, Bing); Astronomers Identify the Farthest Object Ever Observed, 13.14 Billion Light Years Away Quotes that are Quotably Quotable: eekygeeky : When cloud computing news is slow? Switch to "big data"-100% of the vaguery, none of the used-up, mushy marketing feel!
4 0.95310819 867 high scalability-2010-07-27-YeSQL: An Overview of the Various Query Semantics in the Post Only-SQL World
Introduction: The NoSQL movement faults the SQL query language as the source of many of the scalability issues that we face today with traditional database approach. I think that the main reason so many people have come to see SQL as the source of all evil is the fact that, traditionally, the query language was burned into the database implementation. So by saying NoSQL you basically say "No" to the traditional non-scalable RDBMS implementations. This view has brought on a flood of alternative query languages, each aiming to solve a different aspect that is missing in the traditional SQL query approach, such as a document model, or that provides a simpler approach, such as Key/Value query. Most of the people I speak with seem fairly confused on this subject, and tend to use query semantics and architecture interchangeably. In Part I of this post i tried to provide quick overview of what each query term stands for in the context of the NoSQL world . Part II illustrates those ide
5 0.95157516 601 high scalability-2009-05-17-Product: Hadoop
Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity
6 0.95027447 1286 high scalability-2012-07-18-Strategy: Kill Off Multi-tenant Instances with High CPU Stolen Time
7 0.94873822 323 high scalability-2008-05-19-Twitter as a scalability case study
8 0.94693202 1494 high scalability-2013-07-19-Stuff The Internet Says On Scalability For July 19, 2013
9 0.94685483 1420 high scalability-2013-03-08-Stuff The Internet Says On Scalability For March 8, 2013
10 0.94418752 1403 high scalability-2013-02-08-Stuff The Internet Says On Scalability For February 8, 2013
12 0.94278741 871 high scalability-2010-08-04-Dremel: Interactive Analysis of Web-Scale Datasets - Data as a Programming Paradigm
13 0.93721664 946 high scalability-2010-11-22-Strategy: Google Sends Canary Requests into the Data Mine
14 0.93599433 448 high scalability-2008-11-22-Google Architecture
15 0.93586695 786 high scalability-2010-03-02-Using the Ambient Cloud as an Application Runtime
16 0.93501222 680 high scalability-2009-08-13-Reconnoiter - Large-Scale Trending and Fault-Detection
17 0.93320107 483 high scalability-2009-01-04-Paper: MapReduce: Simplified Data Processing on Large Clusters
18 0.93160576 897 high scalability-2010-09-08-4 General Core Scalability Patterns
19 0.93149543 1485 high scalability-2013-07-01-PRISM: The Amazingly Low Cost of Using BigData to Know More About You in Under a Minute
20 0.93027145 526 high scalability-2009-03-05-Strategy: In Cloud Computing Systematically Drive Load to the CPU