high_scalability high_scalability-2008 high_scalability-2008-211 knowledge-graph by maker-knowledge-mining

211 high scalability-2008-01-13-Google Reveals New MapReduce Stats


meta infos for this blog

Source: html

Introduction: The Google Operating System blog has an interesting post on Google's scale based on an updated version of Google's paper about MapReduce. The input data for some of the MapReduce jobs run in September 2007 was 403,152 TB (terabytes), the average number of machines allocated for a MapReduce job was 394, while the average completion time was 6 minutes and a half. The paper mentions that Google's indexing system processes more than 20 TB of raw data. Niall Kennedy calculates that the average MapReduce job runs across a $1 million hardware infrastructure, assuming that Google still uses the same cluster configurations from 2004: two 2 GHz Intel Xeon processors with Hyper-Threading enabled, 4 GB of memory, two 160 GB IDE hard drives and a gigabit Ethernet link. Greg Linden notices that Google's infrastructure is an important competitive advantage. "Anyone at Google can process terabytes of data. And they can get their results back in about 10 minutes, so they ca


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The Google Operating System blog has an interesting post on Google's scale based on an updated version of Google's paper about MapReduce. [sent-1, score-0.202]

2 The input data for some of the MapReduce jobs run in September 2007 was 403,152 TB (terabytes), the average number of machines allocated for a MapReduce job was 394, while the average completion time was 6 minutes and a half. [sent-2, score-1.048]

3 The paper mentions that Google's indexing system processes more than 20 TB of raw data. [sent-3, score-0.418]

4 Greg Linden notices that Google's infrastructure is an important competitive advantage. [sent-5, score-0.309]

5 And they can get their results back in about 10 minutes, so they can iterate on it and try something else if they didn't get what they wanted the first time. [sent-7, score-0.253]

6 " It is interesting to compare this to Amazon EC2: $0. [sent-8, score-0.09]

7 40 Large Instance price per hour x 400 instances x 10 minutes = $26. [sent-9, score-0.401]

8 10 per GB = $100 For a hundred bucks you could also process a TB of data! [sent-11, score-0.393]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('tb', 0.386), ('google', 0.276), ('gb', 0.243), ('mapreduce', 0.207), ('average', 0.194), ('minutes', 0.178), ('terabytes', 0.176), ('bucks', 0.167), ('calculates', 0.161), ('kennedy', 0.161), ('ide', 0.156), ('notices', 0.156), ('poston', 0.152), ('aninteresting', 0.148), ('ghz', 0.129), ('xeon', 0.129), ('mentions', 0.127), ('completion', 0.124), ('paper', 0.116), ('september', 0.115), ('gigabit', 0.115), ('configurations', 0.113), ('linden', 0.112), ('ethernet', 0.111), ('assuming', 0.109), ('iterate', 0.109), ('enabled', 0.107), ('allocated', 0.106), ('transfer', 0.098), ('input', 0.093), ('intel', 0.092), ('raw', 0.092), ('compare', 0.09), ('job', 0.089), ('hundred', 0.087), ('updated', 0.086), ('hour', 0.086), ('indexing', 0.083), ('drives', 0.083), ('processors', 0.082), ('wanted', 0.078), ('competitive', 0.077), ('infrastructure', 0.076), ('process', 0.075), ('price', 0.073), ('jobs', 0.07), ('anyone', 0.067), ('else', 0.066), ('per', 0.064), ('two', 0.062)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 211 high scalability-2008-01-13-Google Reveals New MapReduce Stats

Introduction: The Google Operating System blog has an interesting post on Google's scale based on an updated version of Google's paper about MapReduce. The input data for some of the MapReduce jobs run in September 2007 was 403,152 TB (terabytes), the average number of machines allocated for a MapReduce job was 394, while the average completion time was 6 minutes and a half. The paper mentions that Google's indexing system processes more than 20 TB of raw data. Niall Kennedy calculates that the average MapReduce job runs across a $1 million hardware infrastructure, assuming that Google still uses the same cluster configurations from 2004: two 2 GHz Intel Xeon processors with Hyper-Threading enabled, 4 GB of memory, two 160 GB IDE hard drives and a gigabit Ethernet link. Greg Linden notices that Google's infrastructure is an important competitive advantage. "Anyone at Google can process terabytes of data. And they can get their results back in about 10 minutes, so they ca

2 0.24760088 195 high scalability-2007-12-28-Amazon's EC2: Pay as You Grow Could Cut Your Costs in Half

Introduction: Update 2: Summize Computes Computing Resources for a Startup . Lots of nice graphs showing Amazon is hard to beat for small machines and become less cost efficient for well used larger machines. Long term storage costs may eat your saving away. And out of cloud bandwidth costs are high. Update: via ProductionScale , a nice Digital Web article on how to setup S3 to store media files and how Blue Origin was able to handle 3.5 million requests and 758 GBs in bandwidth in a single day for very little $$$. Also a Right Scale article on Network performance within Amazon EC2 and to Amazon S3 . 75MB/s between EC2 instances, 10.2MB/s between EC2 and S3 for download, 6.9MB/s upload. Now that Amazon's S3 (storage service) is out of beta and EC2 (elastic compute cloud) has added new instance types (the class of machine you can rent) with more CPU and more RAM, I thought it would be interesting to take a look out how their pricing stacks up. The quick conclusion: the m

3 0.21031846 483 high scalability-2009-01-04-Paper: MapReduce: Simplified Data Processing on Large Clusters

Introduction: Update: MapReduce and PageRank Notes from Remzi Arpaci-Dusseau's Fall 2008 class . Collects interesting facts about MapReduce and PageRank. For example, the history of the solution to searching for the term "flu" is traced through multiple generations of technology. With Google entering the cloud space with Google AppEngine and a maturing Hadoop product, the MapReduce scaling approach might finally become a standard programmer practice. This is the best paper on the subject and is an excellent primer on a content-addressable memory future. Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. One common criticism ex-Googlers have is that it takes months to get up and be productive in the Google environment. Hopefully a way will be found to lower the learning curve a

4 0.20284784 336 high scalability-2008-05-31-Biggest Under Reported Story: Google's BigTable Costs 10 Times Less than Amazon's SimpleDB

Introduction: Why isn't Google's aggressive new database pricing strategy getting more pub? That's what Bill Katz , instigator of the GAE Meetup and prize winning science fiction author is wondering: It's surprising that the blogosphere hasn't picked up the biggest difference in pricing: Google's datastore is less than a tenth of the price of Amazon's SimpleDB while offering a better API. If money matters to you then the burn rate under GAE could be convincingly lower. Let's compare the numbers: GAE pricing : * $0.10 - $0.12 per CPU core-hour * $0.15 - $0.18 per GB-month of storage * $0.11 - $0.13 per GB outgoing bandwidth * $0.09 - $0.11 per GB incoming bandwidth SimpleDB Pricing : * $0.14 per Amazon SimpleDB Machine Hour consumed * Structured Data Storage - $1.50 per GB-month * $0.100 per GB - all data transfer in * $0.170 per GB - first 10 TB / month data transfer out (more on the site) Clearly Google priced their services to be competitive

5 0.15053301 448 high scalability-2008-11-22-Google Architecture

Introduction: Update 2: Sorting 1 PB with MapReduce . PB is not peanut-butter-and-jelly misspelled. It's 1 petabyte or 1000 terabytes or 1,000,000 gigabytes. It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers and the results were replicated thrice on 48,000 disks. Update: Greg Linden points to a new Google article MapReduce: simplified data processing on large clusters . Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build

6 0.14299379 242 high scalability-2008-02-07-Looking for good business examples of compaines using Hadoop

7 0.13729708 666 high scalability-2009-07-30-Learn How to Think at Scale

8 0.12928313 871 high scalability-2010-08-04-Dremel: Interactive Analysis of Web-Scale Datasets - Data as a Programming Paradigm

9 0.12330562 334 high scalability-2008-05-29-Amazon Improves Diagonal Scaling Support with High-CPU Instances

10 0.11778303 1117 high scalability-2011-09-16-Stuff The Internet Says On Scalability For September 16, 2011

11 0.1166568 912 high scalability-2010-10-01-Google Paper: Large-scale Incremental Processing Using Distributed Transactions and Notifications

12 0.11663112 1275 high scalability-2012-07-02-C is for Compute - Google Compute Engine (GCE)

13 0.10766728 1485 high scalability-2013-07-01-PRISM: The Amazingly Low Cost of ­Using BigData to Know More About You in Under a Minute

14 0.10593488 75 high scalability-2007-08-28-Google Utilities : An online google guide,tools and Utilities.

15 0.10590978 851 high scalability-2010-07-02-Hot Scalability Links for July 2, 2010

16 0.10515037 164 high scalability-2007-11-22-Why not Cache from Intersystems?

17 0.10337473 1535 high scalability-2013-10-21-Google's Sanjay Ghemawat on What Made Google Google and Great Big Data Career Advice

18 0.10208017 570 high scalability-2009-04-15-Implementing large scale web analytics

19 0.10157243 998 high scalability-2011-03-03-Stack Overflow Architecture Update - Now at 95 Million Page Views a Month

20 0.10020263 525 high scalability-2009-03-05-Product: Amazon Simple Storage Service


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.137), (1, 0.088), (2, 0.008), (3, 0.048), (4, -0.065), (5, -0.006), (6, 0.064), (7, 0.038), (8, 0.111), (9, 0.053), (10, 0.026), (11, -0.164), (12, 0.083), (13, -0.03), (14, 0.032), (15, 0.011), (16, -0.136), (17, -0.063), (18, 0.067), (19, 0.049), (20, 0.088), (21, 0.015), (22, 0.018), (23, -0.127), (24, 0.049), (25, -0.005), (26, -0.011), (27, -0.001), (28, -0.026), (29, 0.027), (30, 0.114), (31, -0.01), (32, -0.011), (33, 0.006), (34, -0.052), (35, 0.011), (36, 0.031), (37, -0.036), (38, 0.043), (39, 0.119), (40, 0.029), (41, 0.008), (42, 0.071), (43, -0.043), (44, -0.02), (45, -0.006), (46, -0.024), (47, -0.059), (48, -0.009), (49, -0.066)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98000491 211 high scalability-2008-01-13-Google Reveals New MapReduce Stats

Introduction: The Google Operating System blog has an interesting post on Google's scale based on an updated version of Google's paper about MapReduce. The input data for some of the MapReduce jobs run in September 2007 was 403,152 TB (terabytes), the average number of machines allocated for a MapReduce job was 394, while the average completion time was 6 minutes and a half. The paper mentions that Google's indexing system processes more than 20 TB of raw data. Niall Kennedy calculates that the average MapReduce job runs across a $1 million hardware infrastructure, assuming that Google still uses the same cluster configurations from 2004: two 2 GHz Intel Xeon processors with Hyper-Threading enabled, 4 GB of memory, two 160 GB IDE hard drives and a gigabit Ethernet link. Greg Linden notices that Google's infrastructure is an important competitive advantage. "Anyone at Google can process terabytes of data. And they can get their results back in about 10 minutes, so they ca

2 0.85548139 483 high scalability-2009-01-04-Paper: MapReduce: Simplified Data Processing on Large Clusters

Introduction: Update: MapReduce and PageRank Notes from Remzi Arpaci-Dusseau's Fall 2008 class . Collects interesting facts about MapReduce and PageRank. For example, the history of the solution to searching for the term "flu" is traced through multiple generations of technology. With Google entering the cloud space with Google AppEngine and a maturing Hadoop product, the MapReduce scaling approach might finally become a standard programmer practice. This is the best paper on the subject and is an excellent primer on a content-addressable memory future. Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. One common criticism ex-Googlers have is that it takes months to get up and be productive in the Google environment. Hopefully a way will be found to lower the learning curve a

3 0.73054415 336 high scalability-2008-05-31-Biggest Under Reported Story: Google's BigTable Costs 10 Times Less than Amazon's SimpleDB

Introduction: Why isn't Google's aggressive new database pricing strategy getting more pub? That's what Bill Katz , instigator of the GAE Meetup and prize winning science fiction author is wondering: It's surprising that the blogosphere hasn't picked up the biggest difference in pricing: Google's datastore is less than a tenth of the price of Amazon's SimpleDB while offering a better API. If money matters to you then the burn rate under GAE could be convincingly lower. Let's compare the numbers: GAE pricing : * $0.10 - $0.12 per CPU core-hour * $0.15 - $0.18 per GB-month of storage * $0.11 - $0.13 per GB outgoing bandwidth * $0.09 - $0.11 per GB incoming bandwidth SimpleDB Pricing : * $0.14 per Amazon SimpleDB Machine Hour consumed * Structured Data Storage - $1.50 per GB-month * $0.100 per GB - all data transfer in * $0.170 per GB - first 10 TB / month data transfer out (more on the site) Clearly Google priced their services to be competitive

4 0.73041111 409 high scalability-2008-10-13-Challenges from large scale computing at Google

Introduction: From Greg Linden on a talk Google Fellow Jeff Dean gave last week at University of Washington Computer Science titled "Research Challenges Inspired by Large-Scale Computing at Google" : Coming away from the talk, the biggest points for me were the considerable interest in reducing costs (especially reducing power costs), the suggestion that the Google cluster may eventually contain 10M machines at 1k locations, and the call to action for researchers on distributed systems and databases to think orders of magnitude bigger than they often are, not about running on hundreds of machines in one location, but hundreds of thousands of machines across many locations.

5 0.70530534 871 high scalability-2010-08-04-Dremel: Interactive Analysis of Web-Scale Datasets - Data as a Programming Paradigm

Introduction: If Google was a boxer then MapReduce would be a probing right hand that sets up the massive left hook that is  Dremel , Google's—scalable (thousands of CPUs, petabytes of data, trillions of rows), SQL based, columnar, interactive (results returned in seconds), ad-hoc—analytics system. If Google was a magician then MapReduce would be the shiny thing that distracts the mind while the trick goes unnoticed. I say that because even though Dremel has been around internally at Google since 2006, we have not heard a whisper about it. All we've heard about is MapReduce, clones of which have inspired entire new industries. Tricky . Dremel, according to Brian Bershad, Director of Engineering at Google, is targeted at solving BigData class problems : While we all know that systems are huge and will get even huger, the implications of this size on programmability, manageability, power, etc. is hard to comprehend. Alfred noted that the Internet is predicted to be carrying a zetta-byte (10 21

6 0.70238024 912 high scalability-2010-10-01-Google Paper: Large-scale Incremental Processing Using Distributed Transactions and Notifications

7 0.69649047 1535 high scalability-2013-10-21-Google's Sanjay Ghemawat on What Made Google Google and Great Big Data Career Advice

8 0.67700595 1275 high scalability-2012-07-02-C is for Compute - Google Compute Engine (GCE)

9 0.66879636 640 high scalability-2009-06-28-Google Voice Architecture

10 0.6489864 75 high scalability-2007-08-28-Google Utilities : An online google guide,tools and Utilities.

11 0.62235004 1078 high scalability-2011-07-12-Google+ is Built Using Tools You Can Use Too: Closure, Java Servlets, JavaScript, BigTable, Colossus, Quick Turnaround

12 0.62091023 734 high scalability-2009-10-30-Hot Scalabilty Links for October 30 2009

13 0.62079179 1117 high scalability-2011-09-16-Stuff The Internet Says On Scalability For September 16, 2011

14 0.61873442 1107 high scalability-2011-08-29-The Three Ages of Google - Batch, Warehouse, Instant

15 0.60806966 650 high scalability-2009-07-02-Product: Hbase

16 0.6064719 362 high scalability-2008-08-11-Distributed Computing & Google Infrastructure

17 0.60004956 1328 high scalability-2012-09-24-Google Spanner's Most Surprising Revelation: NoSQL is Out and NewSQL is In

18 0.59241575 1485 high scalability-2013-07-01-PRISM: The Amazingly Low Cost of ­Using BigData to Know More About You in Under a Minute

19 0.58658916 1310 high scalability-2012-08-23-Economies of Scale in the Datacenter: Gmail is 100x Cheaper to Run than Your Own Server

20 0.58127648 1548 high scalability-2013-11-13-Google: Multiplex Multiple Works Loads on Computers to Increase Machine Utilization and Save Money


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(1, 0.057), (2, 0.153), (10, 0.041), (61, 0.063), (77, 0.384), (79, 0.127), (94, 0.066)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.87904793 474 high scalability-2008-12-21-The I.H.S.D.F. Theorem: A Proposed Theorem for the Trade-offs in Horizontally Scalable Systems

Introduction: Successful software design is all about trade-offs. In the typical (if there is such a thing) distributed system, recognizing the importance of trade-offs within the design of your architecture is integral to the success of your system. Despite this reality, I see time and time again, developers choosing a particular solution based on an ill-placed belief in their solution as a “silver bullet”, or a solution that conquers all, despite the inevitable occurrence of changing requirements. Regardless of the reasons behind this phenomenon, I’d like to outline a few of the methods I use to ensure that I’m making good scalable decisions without losing sight of the trade-offs that accompany them. I’d also like to compile (pun intended) the issues at hand, by formulating a simple theorem that we can use to describe this oft occurring situation.

2 0.86604917 1116 high scalability-2011-09-15-Paper: It's Time for Low Latency - Inventing the 1 Microsecond Datacenter

Introduction: In  It's Time for Low Latency   Stephen Rumble et al. explore the idea that it's time to rearchitect our stack to live in the modern era of low-latency datacenter instead of high-latency WANs. The implications for program architectures will be revolutionary .   Luiz André Barroso , Distinguished Engineer at Google, sees ultra low latency as a way to make computer resources, to be as much as possible, fungible, that is they are interchangeable and location independent, effectively turning a datacenter into single computer.  Abstract from the paper: The operating systems community has ignored network latency for too long. In the past, speed-of-light delays in wide area networks and unoptimized network hardware have made sub-100µs round-trip times impossible. However, in the next few years datacenters will be deployed with low-latency Ethernet. Without the burden of propagation delays in the datacenter campus and network delays in the Ethernet devices, it will be up to us to finish

3 0.86051404 1493 high scalability-2013-07-17-Steve Ballmer Says Microsoft has Over 1 Million Servers - What Does that Really Mean?

Introduction: James Hamilton in  Counting Servers is Hard  has an awesome breakdown of what one million plus servers  really means in terms of resource usage. The summary from his calculations are eye popping: Facilities: 15 to 30 large datacenters Capital expense: $4.25 Billion Total power: 300MW Power Consumption: 2.6TWh annually The power consumption is about the same as used by Nicaragua and the capital cost is about a third of what Americans spent on video games in 2012. Now that's web scale.

4 0.84549677 258 high scalability-2008-02-24-Yandex Architecture

Introduction: Update: Anatomy of a crash in a new part of Yandex written in Django . Writing to a magic session variable caused an unexpected write into an InnoDB database on every request. Writes took 6-7 seconds because of index rebuilding. Lots of useful details on the sizing of their system, what went wrong, and how they fixed it. Yandex is a Russian search engine with 3.5 billion pages in their search index. We only know a few fun facts about how they do things, nothing at a detailed architecture level. Hopefully we'll learn more later, but I thought it would still be interesting. From Allen Stern's interview with Yandex's CTO Ilya Segalovich, we learn: 3.5 billion pages in the search index. Over several thousand servers. 35 million searches a day. Several data centers around Russia. Two-layer architecture. The database is split in pieces and when a search is requested, it pulls the bits from the different database servers and brings it together for the user. Languages

same-blog 5 0.83634269 211 high scalability-2008-01-13-Google Reveals New MapReduce Stats

Introduction: The Google Operating System blog has an interesting post on Google's scale based on an updated version of Google's paper about MapReduce. The input data for some of the MapReduce jobs run in September 2007 was 403,152 TB (terabytes), the average number of machines allocated for a MapReduce job was 394, while the average completion time was 6 minutes and a half. The paper mentions that Google's indexing system processes more than 20 TB of raw data. Niall Kennedy calculates that the average MapReduce job runs across a $1 million hardware infrastructure, assuming that Google still uses the same cluster configurations from 2004: two 2 GHz Intel Xeon processors with Hyper-Threading enabled, 4 GB of memory, two 160 GB IDE hard drives and a gigabit Ethernet link. Greg Linden notices that Google's infrastructure is an important competitive advantage. "Anyone at Google can process terabytes of data. And they can get their results back in about 10 minutes, so they ca

6 0.83116436 1195 high scalability-2012-02-17-Stuff The Internet Says On Scalability For February 17, 2012

7 0.82188624 753 high scalability-2009-12-21-Hot Holiday Scalability Links for 2009

8 0.81707662 766 high scalability-2010-01-26-Product: HyperGraphDB - A Graph Database

9 0.79944795 525 high scalability-2009-03-05-Product: Amazon Simple Storage Service

10 0.79764986 959 high scalability-2010-12-17-Stuff the Internet Says on Scalability For December 17th, 2010

11 0.77369767 1531 high scalability-2013-10-13-AIDA: Badoo’s journey into Continuous Integration

12 0.77121073 1059 high scalability-2011-06-14-A TripAdvisor Short

13 0.76106089 439 high scalability-2008-11-10-Scalability Perspectives #1: Nicholas Carr – The Big Switch

14 0.74403071 1377 high scalability-2012-12-26-Ask HS: What will programming and architecture look like in 2020?

15 0.72922677 1158 high scalability-2011-12-16-Stuff The Internet Says On Scalability For December 16, 2011

16 0.67955488 1188 high scalability-2012-02-06-The Design of 99designs - A Clean Tens of Millions Pageviews Architecture

17 0.67404121 1567 high scalability-2013-12-20-Stuff The Internet Says On Scalability For December 20th, 2013

18 0.67310143 1571 high scalability-2014-01-02-xkcd: How Standards Proliferate:

19 0.66427422 222 high scalability-2008-01-25-Application Database and DAL Architecture

20 0.66279978 212 high scalability-2008-01-14-OpenSpaces.org community site launched - framework for building scale-out applications