high_scalability high_scalability-2008 high_scalability-2008-254 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Update: Yahoo! Launches World's Largest Hadoop Production Application . A 10,000 core Hadoop cluster produces data used in every Yahoo! Web search query. Raw disk is at 5 Petabytes. Their previous 1 petabyte database couldn't handle the load and couldn't grow larger. Greg Linden thinks the Google cluster has way over 133,000 machines. From an InfoQ interview with project lead Doug Cutting, it appears Hadoop , an open source distributed computing platform, is making good progress towards their 1.0 release. They've successfully reached a 1000 node cluster size, improved file system integrity, and jacked performance by 20x in the last year. How they are making progress could be a good model for anyone: The speedup has been an aggregation of our work in the past few years, and has been accomplished mostly by trial-and-error. We get things running smoothly on a cluster of a given size, then double the size of the cluster and see what breaks. We aim for performan
sentIndex sentText sentNum sentScore
1 A 10,000 core Hadoop cluster produces data used in every Yahoo! [sent-3, score-0.515]
2 Their previous 1 petabyte database couldn't handle the load and couldn't grow larger. [sent-6, score-0.27]
3 Greg Linden thinks the Google cluster has way over 133,000 machines. [sent-7, score-0.395]
4 From an InfoQ interview with project lead Doug Cutting, it appears Hadoop , an open source distributed computing platform, is making good progress towards their 1. [sent-8, score-0.542]
5 They've successfully reached a 1000 node cluster size, improved file system integrity, and jacked performance by 20x in the last year. [sent-10, score-0.773]
6 How they are making progress could be a good model for anyone: The speedup has been an aggregation of our work in the past few years, and has been accomplished mostly by trial-and-error. [sent-11, score-0.943]
7 We get things running smoothly on a cluster of a given size, then double the size of the cluster and see what breaks. [sent-12, score-1.34]
8 We aim for performance to scale linearly as you increase the cluster size. [sent-13, score-0.803]
9 We learn from this process and then increase the cluster size again. [sent-14, score-0.864]
10 Each time you increase the cluster size reliability becomes a bigger challenge since the number and kind of failures increase. [sent-15, score-1.248]
11 It 's tempting to say just jump to the end game, don't bother with all those errors and trials, but there's a lot of learning and experience that must be earned on the way to scaling anything. [sent-16, score-0.735]
wordName wordTfidf (topN-words)
[('cluster', 0.395), ('size', 0.298), ('progress', 0.227), ('breakthe', 0.193), ('trials', 0.184), ('yahoo', 0.183), ('doug', 0.171), ('increase', 0.171), ('tempting', 0.167), ('earned', 0.156), ('smoothly', 0.15), ('hadoop', 0.137), ('speedup', 0.135), ('accomplished', 0.132), ('bother', 0.13), ('aim', 0.122), ('produces', 0.12), ('cutting', 0.12), ('integrity', 0.119), ('launches', 0.119), ('linearly', 0.115), ('petabyte', 0.114), ('jump', 0.112), ('reached', 0.112), ('successfully', 0.107), ('aggregation', 0.106), ('double', 0.102), ('raw', 0.101), ('errors', 0.099), ('could', 0.098), ('improved', 0.094), ('towards', 0.092), ('previous', 0.088), ('making', 0.086), ('mostly', 0.083), ('bigger', 0.082), ('failures', 0.08), ('challenge', 0.077), ('past', 0.076), ('reliability', 0.075), ('largest', 0.074), ('game', 0.074), ('lead', 0.073), ('anyone', 0.073), ('learning', 0.071), ('becomes', 0.07), ('grow', 0.068), ('anything', 0.066), ('last', 0.065), ('project', 0.064)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 254 high scalability-2008-02-19-Hadoop Getting Closer to 1.0 Release
Introduction: Update: Yahoo! Launches World's Largest Hadoop Production Application . A 10,000 core Hadoop cluster produces data used in every Yahoo! Web search query. Raw disk is at 5 Petabytes. Their previous 1 petabyte database couldn't handle the load and couldn't grow larger. Greg Linden thinks the Google cluster has way over 133,000 machines. From an InfoQ interview with project lead Doug Cutting, it appears Hadoop , an open source distributed computing platform, is making good progress towards their 1.0 release. They've successfully reached a 1000 node cluster size, improved file system integrity, and jacked performance by 20x in the last year. How they are making progress could be a good model for anyone: The speedup has been an aggregation of our work in the past few years, and has been accomplished mostly by trial-and-error. We get things running smoothly on a cluster of a given size, then double the size of the cluster and see what breaks. We aim for performan
2 0.18043467 1445 high scalability-2013-04-24-Strategy: Using Lots of RAM Often Cheaper than Using a Hadoop Cluster
Introduction: Solving problems while saving money is always a problem. In Nobody ever got fired for using Hadoop on a cluster they give some counter-intuitive advice by showing a big-memory server may provide better performance per dollar than a cluster: For jobs where the input data is multi-terabyte or larger a Hadoop cluster is the right solution. For smaller problems memory has reached a GB/$ ratio where it is technically and financially feasible to use a single server with 100s of GB of DRAM rather than a cluster. Given the majority of analytics jobs do not process huge data sets, a cluster doesn't need to be your first option. Scaling up RAM saves on programmer time, reduces programmer effort, improved accuracy, and reduces hardware costs.
3 0.15019688 627 high scalability-2009-06-11-Yahoo! Distribution of Hadoop
Introduction: Many people in the Apache Hadoop community have asked Yahoo! to publish the version of Apache Hadoop they test and deploy across their large Hadoop clusters. As a service to the Hadoop community, Yahoo is releasing the Yahoo! Distribution of Hadoop -- a source code distribution that is based entirely on code found in the Apache Hadoop project. This source distribution includes code patches that they have added to improve the stability and performance of their clusters. In all cases, these patches have already been contributed back to Apache, but they may not yet be available in an Apache release of Hadoop. Read more and get the Hadoop distribution from Yahoo
4 0.13836516 1279 high scalability-2012-07-09-Data Replication in NoSQL Databases
Introduction: This is the third guest post ( part 1 , part 2 ) of a series by Greg Lindahl, CTO of blekko, the spam free search engine. Previously, Greg was Founder and Distinguished Engineer at PathScale, at which he was the architect of the InfiniPath low-latency InfiniBand HCA, used to build tightly-coupled supercomputing clusters. blekko's home-grown NoSQL database was designed from the start to support a web-scale search engine, with 1,000s of servers and petabytes of disk. Data replication is a very important part of keeping the database up and serving queries. Like many NoSQL database authors, we decided to keep R=3 copies of each piece of data in the database, and not use RAID to improve reliability. The key goal we were shooting for was a database which degrades gracefully when there are many small failures over time, without needing human intervention. Why don't we like RAID for big NoSQL databases? Most big storage systems use RAID levels like 3, 4, 5, or 10 to improve relia
5 0.13835147 601 high scalability-2009-05-17-Product: Hadoop
Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity
6 0.12871674 968 high scalability-2011-01-04-Map-Reduce With Ruby Using Hadoop
7 0.12025353 13 high scalability-2007-07-15-Lustre cluster file system
8 0.11382492 820 high scalability-2010-05-03-100 Node Hazelcast cluster on Amazon EC2
9 0.11298979 862 high scalability-2010-07-20-Strategy: Consider When a Service Starts Billing in Your Algorithm Cost
10 0.10995792 666 high scalability-2009-07-30-Learn How to Think at Scale
11 0.10698353 617 high scalability-2009-06-04-New Book: Even Faster Web Sites: Performance Best Practices for Web Developers
12 0.10382088 596 high scalability-2009-05-11-Facebook, Hadoop, and Hive
13 0.10155185 106 high scalability-2007-10-02-Secrets to Fotolog's Scaling Success
14 0.10041873 1155 high scalability-2011-12-12-Netflix: Developing, Deploying, and Supporting Software According to the Way of the Cloud
15 0.099794455 1020 high scalability-2011-04-12-Caching and Processing 2TB Mozilla Crash Reports in memory with Hazelcast
16 0.096177191 707 high scalability-2009-09-17-Hot Links for 2009-9-17
17 0.095856197 1564 high scalability-2013-12-13-Stuff The Internet Says On Scalability For December 13th, 2013
18 0.094893709 750 high scalability-2009-12-16-Building Super Scalable Systems: Blade Runner Meets Autonomic Computing in the Ambient Cloud
20 0.091876402 454 high scalability-2008-12-01-Deploying MySQL Database in Solaris Cluster Environments
topicId topicWeight
[(0, 0.16), (1, 0.077), (2, 0.03), (3, 0.018), (4, 0.013), (5, 0.04), (6, 0.053), (7, 0.008), (8, 0.052), (9, 0.079), (10, 0.02), (11, -0.023), (12, 0.057), (13, -0.044), (14, 0.069), (15, -0.005), (16, -0.043), (17, -0.011), (18, -0.013), (19, 0.05), (20, 0.014), (21, 0.063), (22, 0.008), (23, -0.007), (24, -0.06), (25, 0.035), (26, 0.038), (27, 0.017), (28, -0.029), (29, 0.036), (30, 0.065), (31, 0.086), (32, -0.011), (33, -0.004), (34, 0.042), (35, 0.047), (36, -0.021), (37, -0.028), (38, -0.01), (39, -0.083), (40, 0.057), (41, 0.0), (42, -0.005), (43, -0.023), (44, -0.031), (45, 0.062), (46, 0.048), (47, 0.0), (48, 0.036), (49, 0.054)]
simIndex simValue blogId blogTitle
same-blog 1 0.98151314 254 high scalability-2008-02-19-Hadoop Getting Closer to 1.0 Release
Introduction: Update: Yahoo! Launches World's Largest Hadoop Production Application . A 10,000 core Hadoop cluster produces data used in every Yahoo! Web search query. Raw disk is at 5 Petabytes. Their previous 1 petabyte database couldn't handle the load and couldn't grow larger. Greg Linden thinks the Google cluster has way over 133,000 machines. From an InfoQ interview with project lead Doug Cutting, it appears Hadoop , an open source distributed computing platform, is making good progress towards their 1.0 release. They've successfully reached a 1000 node cluster size, improved file system integrity, and jacked performance by 20x in the last year. How they are making progress could be a good model for anyone: The speedup has been an aggregation of our work in the past few years, and has been accomplished mostly by trial-and-error. We get things running smoothly on a cluster of a given size, then double the size of the cluster and see what breaks. We aim for performan
2 0.7709989 601 high scalability-2009-05-17-Product: Hadoop
Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity
3 0.76905543 968 high scalability-2011-01-04-Map-Reduce With Ruby Using Hadoop
Introduction: A demonstration, with repeatable steps, of how to quickly fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will not need to ssh into the cluster, as all tasks are run from your local machine. Below I am using my MacBook Pro as my local machine, but the steps I have provided should be reproducible on other platforms running bash and Java. Fire-Up Your Hadoop Cluster I choose the Cloudera distribution of Hadoop which is still 100% Apache licensed, but has some additional benefits. One of these benefits is that it is released by Doug Cutting , who started Hadoop and drove it’s development at Yahoo! He also started Lucene , which is another of my favourite Apache Projects, so I have good faith that he knows what he is doing. Another benefit, as you will see, is that it is simple to fire-up a Hadoop cluster. I am going to use C
4 0.75868446 1445 high scalability-2013-04-24-Strategy: Using Lots of RAM Often Cheaper than Using a Hadoop Cluster
Introduction: Solving problems while saving money is always a problem. In Nobody ever got fired for using Hadoop on a cluster they give some counter-intuitive advice by showing a big-memory server may provide better performance per dollar than a cluster: For jobs where the input data is multi-terabyte or larger a Hadoop cluster is the right solution. For smaller problems memory has reached a GB/$ ratio where it is technically and financially feasible to use a single server with 100s of GB of DRAM rather than a cluster. Given the majority of analytics jobs do not process huge data sets, a cluster doesn't need to be your first option. Scaling up RAM saves on programmer time, reduces programmer effort, improved accuracy, and reduces hardware costs.
5 0.70838469 862 high scalability-2010-07-20-Strategy: Consider When a Service Starts Billing in Your Algorithm Cost
Introduction: At Monday's Cloud Computing Meetup , Paco Nathan gave an excellent Getting Started on Hadoop talk ( slides ). I found one of Paco's strategies particularly interesting: consider when a service starts charging in cost calculations. Depending on your use case it may be cheaper to go with a more expensive service that charges only for work accomplished rather than charging for both work + startup time. The example is comparing the cost of running Hadoop on AWS yourself versus using Amazon's prepackaged Hadoop service, Elastic MapReduce (EMR). The thought may have gone through your mind as it did mine that it doesn't necessarily make sense to use Amazon's Hadoop service. Why pay a premium for EMR when Hadoop will run directly on AWS? One reason is that Amazon has made significant changes to Hadoop to make it run more efficiently and easily on AWS. The other more surprising reason is cost. When starting a 500 node Hadoop cluster, for example, you have to wait for all the node
6 0.70565897 443 high scalability-2008-11-14-Paper: Pig Latin: A Not-So-Foreign Language for Data Processing
7 0.69570971 647 high scalability-2009-07-02-Hypertable is a New BigTable Clone that Runs on HDFS or KFS
8 0.69245982 627 high scalability-2009-06-11-Yahoo! Distribution of Hadoop
9 0.68075615 1265 high scalability-2012-06-15-Stuff The Internet Says On Scalability For June 15, 2012
10 0.67874724 650 high scalability-2009-07-02-Product: Hbase
11 0.67403758 1075 high scalability-2011-07-07-Myth: Google Uses Server Farms So You Should Too - Resurrection of the Big-Ass Machines
12 0.67388153 1173 high scalability-2012-01-12-Peregrine - A Map Reduce Framework for Iterative and Pipelined Jobs
13 0.66570687 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python
14 0.66351163 851 high scalability-2010-07-02-Hot Scalability Links for July 2, 2010
15 0.6535762 414 high scalability-2008-10-15-Hadoop - A Primer
16 0.6499697 666 high scalability-2009-07-30-Learn How to Think at Scale
17 0.6439597 1020 high scalability-2011-04-12-Caching and Processing 2TB Mozilla Crash Reports in memory with Hazelcast
18 0.63699406 1313 high scalability-2012-08-28-Making Hadoop Run Faster
19 0.62528092 795 high scalability-2010-03-16-1 Billion Reasons Why Adobe Chose HBase
topicId topicWeight
[(1, 0.2), (2, 0.18), (5, 0.018), (10, 0.046), (27, 0.015), (30, 0.058), (52, 0.04), (61, 0.091), (77, 0.037), (79, 0.167), (94, 0.043)]
simIndex simValue blogId blogTitle
1 0.98158908 576 high scalability-2009-04-21-What CDN would you recommend?
Introduction: Update 10: The Value of CDNs by Mike Axelrod of Google. Google implements a distributed content cache from within large ISPs . This allows them to serve content from the edge of the network and save bandwidth on the ISPs backbone. Update 9: Just Jump: Start using Clouds and CDNs . Bob Buffone gives a really nice and practical tutorial of how to use CloudFront as your CDN. Update 8: Akamai’s Services Become Affordable for Anyone! Blazing Web Site Performance by Distribution Cloud . Distribution Cloud starts at $150 per month for access to the best content distribution network in the world and the leader of Content Distribution Networks. Update 7: Where Amazon’s Data Centers Are Located , Expanding the Cloud: Amazon CloudFront . Why Amazon's CDN Offering Is No Threat To Akamai, Limelight or CDN Pricing . Amazon has launched their CDN with "“low latency, high data transfer speeds, and no commitments.” The perfect relationship for many. The m
2 0.97286439 853 high scalability-2010-07-08-Cloud AWS Infrastructure vs. Physical Infrastructure
Introduction: This is a guest post by Frédéric Faure (architect at Ysance ) on the differences between using a cloud infrastructure and building your own. Frédéric was kind enough to translate the original French version of this article into English. I’ve been noticing many questions about the differences inherent in choosing between a Cloud infrastructure such as AWS (Amazon Web Services) and a traditional physical infrastructure. Firstly, there are a certain number of preconceived notions on this subject that I will attempt to decode for you. Then, it must be understood that each infrastructure has its advantages and disadvantages: a Cloud-type infrastructure does not necessarily fulfill your requirements in every case, however, it can satisfy some of them by optimizing or facilitating the features offered by a traditional physical infrastructure. I will therefore demonstrate the differences between the two that I have noticed, in order to help you make up your own mind. The Fram
3 0.97146839 1014 high scalability-2011-03-31-8 Lessons We Can Learn from the MySpace Incident - Balance, Vision, Fearlessness
Introduction: A surprising amount of heat and light was generated by the whole Micrsoft vs MySpace discussion. Why people feel so passionate about this I'm not quite sure, but fortunately for us, in the best sense of the web, it generated an amazing number of insightful comments and observations. If we stand back and take a look at the whole incident, what can we take a way that might help us in the future? All computer companies are technology companies first. A repeated theme was that you can't be an entertainment company first. You are a technology company providing entertainment using technology. The tech can inform the entertainment side, the entertainment side drives features, but they really can't be separated. An awesome stack that does nothing is useless. A great idea on a poor stack is just as useless. There's a difficult balance that must be achieved and both management and developers must be aware that there's something to balance. All pigs are equal . All business f
4 0.97060144 1448 high scalability-2013-04-29-AWS v GCE Face-off and Why Innovation Needs Lower Cost Infrastructures
Introduction: This is a repost of part 2 ( part 1 ) of an interview I did for the Boundary blog . Boundary: There’s another battle coming down the pike between Amazon (AWS) and Google (GCE). How should the CTO decide which one’s best? Hoff: Given that GCE is still closed to public access we have very little common experience on which to judge. The best way to decide is as always, by running a few experiments. Pick a few representative projects, a representative team, implement the projects on both infrastructures, crunch some numbers, figure out the bigger picture and then select the one you wanted in the first place . Sebastian Stadil, founder of Scalr, recently wrote about his experiences on both platforms and found some interesting differences: AWS has a much richer set of services; GCE is on-demand only, so AWS can be cheaper; GCE has faster disk and faster network IO, especially between datacenters; GCE has faster boot times and can mount read-only partitions across multiple
5 0.96919245 450 high scalability-2008-11-24-Scalability Perspectives #3: Marc Andreessen – Internet Platforms
Introduction: Scalability Perspectives is a series of posts that highlights the ideas that will shape the next decade of IT architecture. Each post is dedicated to a thought leader of the information age and his vision of the future. Be warned though – the journey into the minds and perspectives of these people requires an open mind. Marc Andreessen Marc Andreessen is known as an internet pioneer, entrepreneur, investor, startup coach, blogger, and a multi-millionaire software engineer best known as co-author of Mosaic, the first widely-used web browser, and founder of Netscape Communications Corporation. He was the chair of Opsware, a software company he founded originally as Loudcloud, when it was acquired by Hewlett-Packard. He is also a co-founder of Ning, a company which provides a platform for social-networking websites. He has recently joined the Board of Directors of Facebook and eBay. Marc is an investor in several startups including Digg, Metaplace, Plazes, Qik, and Twitter. His pas
6 0.96910602 1575 high scalability-2014-01-08-Under Snowden's Light Software Architecture Choices Become Murky
7 0.96895462 1036 high scalability-2011-05-06-Stuff The Internet Says On Scalability For May 6th, 2011
8 0.96894437 888 high scalability-2010-08-27-OpenStack - The Answer to: How do We Compete with Amazon?
9 0.96642715 841 high scalability-2010-06-14-How scalable could be a cPanel Hosting service?
10 0.96488374 195 high scalability-2007-12-28-Amazon's EC2: Pay as You Grow Could Cut Your Costs in Half
11 0.96476853 129 high scalability-2007-10-23-Hire Facebook, Ning, and Salesforce to Scale for You
12 0.96453893 1275 high scalability-2012-07-02-C is for Compute - Google Compute Engine (GCE)
13 0.96396387 1654 high scalability-2014-06-05-Cloud Architecture Revolution
14 0.96392328 851 high scalability-2010-07-02-Hot Scalability Links for July 2, 2010
15 0.96376061 1354 high scalability-2012-11-05-Are we seeing the renaissance of enterprises in the cloud?
16 0.96300596 509 high scalability-2009-02-05-Product: HAProxy - The Reliable, High Performance TCP-HTTP Load Balancer
17 0.96178627 1216 high scalability-2012-03-27-Big Data In the Cloud Using Cloudify
18 0.96172881 803 high scalability-2010-04-05-Intercloud: How Will We Scale Across Multiple Clouds?
19 0.96127892 126 high scalability-2007-10-20-Should you build your next website using 3tera's grid OS?
20 0.9611187 1011 high scalability-2011-03-25-Did the Microsoft Stack Kill MySpace?