high_scalability high_scalability-2009 high_scalability-2009-483 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Update: MapReduce and PageRank Notes from Remzi Arpaci-Dusseau's Fall 2008 class . Collects interesting facts about MapReduce and PageRank. For example, the history of the solution to searching for the term "flu" is traced through multiple generations of technology. With Google entering the cloud space with Google AppEngine and a maturing Hadoop product, the MapReduce scaling approach might finally become a standard programmer practice. This is the best paper on the subject and is an excellent primer on a content-addressable memory future. Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. One common criticism ex-Googlers have is that it takes months to get up and be productive in the Google environment. Hopefully a way will be found to lower the learning curve a
sentIndex sentText sentNum sentScore
1 For example, the history of the solution to searching for the term "flu" is traced through multiple generations of technology. [sent-3, score-0.217]
2 With Google entering the cloud space with Google AppEngine and a maturing Hadoop product, the MapReduce scaling approach might finally become a standard programmer practice. [sent-4, score-0.092]
3 This is the best paper on the subject and is an excellent primer on a content-addressable memory future. [sent-5, score-0.218]
4 Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. [sent-6, score-0.878]
5 One common criticism ex-Googlers have is that it takes months to get up and be productive in the Google environment. [sent-7, score-0.314]
6 Hopefully a way will be found to lower the learning curve and make programmers more productive faster. [sent-8, score-0.386]
7 From the abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. [sent-9, score-0.3]
8 Thanks to Kevin Burton for linking to the complete article. [sent-12, score-0.1]
wordName wordTfidf (topN-words)
[('mapreduce', 0.434), ('petabytes', 0.225), ('productive', 0.191), ('articleby', 0.147), ('flu', 0.147), ('dayby', 0.138), ('ghemawat', 0.138), ('googlers', 0.138), ('jeffrey', 0.138), ('parallelizes', 0.138), ('sanjay', 0.138), ('amenable', 0.131), ('ongoogle', 0.131), ('withgoogle', 0.127), ('primer', 0.127), ('computation', 0.125), ('programs', 0.125), ('google', 0.124), ('criticism', 0.123), ('traced', 0.116), ('implemented', 0.115), ('programmers', 0.114), ('twenty', 0.114), ('pagerank', 0.111), ('clusters', 0.11), ('jobs', 0.109), ('generations', 0.101), ('collects', 0.101), ('linking', 0.1), ('day', 0.099), ('schedules', 0.094), ('facts', 0.093), ('dean', 0.093), ('entering', 0.092), ('paper', 0.091), ('gigabit', 0.091), ('executes', 0.091), ('broad', 0.087), ('ethernet', 0.087), ('distinct', 0.087), ('generating', 0.082), ('notes', 0.082), ('curve', 0.081), ('fall', 0.08), ('specify', 0.079), ('executed', 0.077), ('per', 0.075), ('dual', 0.075), ('hopefully', 0.073), ('internally', 0.073)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000001 483 high scalability-2009-01-04-Paper: MapReduce: Simplified Data Processing on Large Clusters
Introduction: Update: MapReduce and PageRank Notes from Remzi Arpaci-Dusseau's Fall 2008 class . Collects interesting facts about MapReduce and PageRank. For example, the history of the solution to searching for the term "flu" is traced through multiple generations of technology. With Google entering the cloud space with Google AppEngine and a maturing Hadoop product, the MapReduce scaling approach might finally become a standard programmer practice. This is the best paper on the subject and is an excellent primer on a content-addressable memory future. Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. One common criticism ex-Googlers have is that it takes months to get up and be productive in the Google environment. Hopefully a way will be found to lower the learning curve a
2 0.21387266 448 high scalability-2008-11-22-Google Architecture
Introduction: Update 2: Sorting 1 PB with MapReduce . PB is not peanut-butter-and-jelly misspelled. It's 1 petabyte or 1000 terabytes or 1,000,000 gigabytes. It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers and the results were replicated thrice on 48,000 disks. Update: Greg Linden points to a new Google article MapReduce: simplified data processing on large clusters . Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build
3 0.21031846 211 high scalability-2008-01-13-Google Reveals New MapReduce Stats
Introduction: The Google Operating System blog has an interesting post on Google's scale based on an updated version of Google's paper about MapReduce. The input data for some of the MapReduce jobs run in September 2007 was 403,152 TB (terabytes), the average number of machines allocated for a MapReduce job was 394, while the average completion time was 6 minutes and a half. The paper mentions that Google's indexing system processes more than 20 TB of raw data. Niall Kennedy calculates that the average MapReduce job runs across a $1 million hardware infrastructure, assuming that Google still uses the same cluster configurations from 2004: two 2 GHz Intel Xeon processors with Hyper-Threading enabled, 4 GB of memory, two 160 GB IDE hard drives and a gigabit Ethernet link. Greg Linden notices that Google's infrastructure is an important competitive advantage. "Anyone at Google can process terabytes of data. And they can get their results back in about 10 minutes, so they ca
4 0.16627589 871 high scalability-2010-08-04-Dremel: Interactive Analysis of Web-Scale Datasets - Data as a Programming Paradigm
Introduction: If Google was a boxer then MapReduce would be a probing right hand that sets up the massive left hook that is Dremel , Google's—scalable (thousands of CPUs, petabytes of data, trillions of rows), SQL based, columnar, interactive (results returned in seconds), ad-hoc—analytics system. If Google was a magician then MapReduce would be the shiny thing that distracts the mind while the trick goes unnoticed. I say that because even though Dremel has been around internally at Google since 2006, we have not heard a whisper about it. All we've heard about is MapReduce, clones of which have inspired entire new industries. Tricky . Dremel, according to Brian Bershad, Director of Engineering at Google, is targeted at solving BigData class problems : While we all know that systems are huge and will get even huger, the implications of this size on programmability, manageability, power, etc. is hard to comprehend. Alfred noted that the Internet is predicted to be carrying a zetta-byte (10 21
5 0.15843295 882 high scalability-2010-08-18-Misco: A MapReduce Framework for Mobile Systems - Start of the Ambient Cloud?
Introduction: Misco: A MapReduce Framework for Mobile Systems is a very exciting paper to me because it's really one of the first explorations of some of the ideas in Building Super Scalable Systems: Blade Runner Meets Autonomic Computing in the Ambient Cloud . What they are trying to do is efficiently distribute work across a set cellphones using a now familiar MapReduce interface. Usually we think of MapReduce as working across large data center hosted clusters. Here, the cluster nodes are cellphones not contained in any data center, but compute nodes potentially distributed everywhere. I talked with Adam Dou , one of the paper's authors, and he said they don't see cellphone clusters replacing dedicated computer clusters, primarily because of the power required for both network communication and the map-reduce computations. Large multi-terabyte jobs aren't in the cards...yet. Adam estimates computationally that cellphones are performing similarly to desktops of ten years ago. Instead, they
6 0.15266576 590 high scalability-2009-05-06-Art of Distributed
7 0.13895014 1535 high scalability-2013-10-21-Google's Sanjay Ghemawat on What Made Google Google and Great Big Data Career Advice
8 0.13452943 376 high scalability-2008-09-03-MapReduce framework Disco
9 0.12445892 362 high scalability-2008-08-11-Distributed Computing & Google Infrastructure
10 0.1207296 401 high scalability-2008-10-04-Is MapReduce going mainstream?
11 0.11958767 666 high scalability-2009-07-30-Learn How to Think at Scale
12 0.11587952 414 high scalability-2008-10-15-Hadoop - A Primer
13 0.11117489 644 high scalability-2009-06-29-eHarmony.com describes how they use Amazon EC2 and MapReduce
14 0.10862469 1233 high scalability-2012-04-25-The Anatomy of Search Technology: blekko’s NoSQL database
15 0.10360302 1607 high scalability-2014-03-07-Stuff The Internet Says On Scalability For March 7th, 2014
16 0.10172834 309 high scalability-2008-04-23-Behind The Scenes of Google Scalability
18 0.09721338 1117 high scalability-2011-09-16-Stuff The Internet Says On Scalability For September 16, 2011
19 0.095875986 470 high scalability-2008-12-18-Risk Analysis on the Cloud (Using Excel and GigaSpaces)
20 0.095329173 1181 high scalability-2012-01-25-Google Goes MoreSQL with Tenzing - SQL Over MapReduce
topicId topicWeight
[(0, 0.123), (1, 0.084), (2, 0.018), (3, 0.09), (4, -0.011), (5, 0.039), (6, 0.06), (7, 0.065), (8, 0.06), (9, 0.113), (10, 0.049), (11, -0.092), (12, 0.036), (13, -0.049), (14, 0.062), (15, -0.014), (16, -0.095), (17, -0.07), (18, 0.066), (19, 0.03), (20, 0.069), (21, 0.013), (22, -0.009), (23, -0.054), (24, 0.041), (25, 0.03), (26, 0.001), (27, 0.069), (28, -0.029), (29, 0.052), (30, 0.078), (31, 0.054), (32, -0.04), (33, 0.021), (34, -0.019), (35, -0.044), (36, 0.038), (37, -0.029), (38, 0.014), (39, 0.128), (40, 0.015), (41, -0.008), (42, -0.031), (43, -0.007), (44, -0.001), (45, -0.029), (46, -0.039), (47, -0.033), (48, -0.039), (49, -0.092)]
simIndex simValue blogId blogTitle
same-blog 1 0.98291308 483 high scalability-2009-01-04-Paper: MapReduce: Simplified Data Processing on Large Clusters
Introduction: Update: MapReduce and PageRank Notes from Remzi Arpaci-Dusseau's Fall 2008 class . Collects interesting facts about MapReduce and PageRank. For example, the history of the solution to searching for the term "flu" is traced through multiple generations of technology. With Google entering the cloud space with Google AppEngine and a maturing Hadoop product, the MapReduce scaling approach might finally become a standard programmer practice. This is the best paper on the subject and is an excellent primer on a content-addressable memory future. Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. One common criticism ex-Googlers have is that it takes months to get up and be productive in the Google environment. Hopefully a way will be found to lower the learning curve a
2 0.83072478 211 high scalability-2008-01-13-Google Reveals New MapReduce Stats
Introduction: The Google Operating System blog has an interesting post on Google's scale based on an updated version of Google's paper about MapReduce. The input data for some of the MapReduce jobs run in September 2007 was 403,152 TB (terabytes), the average number of machines allocated for a MapReduce job was 394, while the average completion time was 6 minutes and a half. The paper mentions that Google's indexing system processes more than 20 TB of raw data. Niall Kennedy calculates that the average MapReduce job runs across a $1 million hardware infrastructure, assuming that Google still uses the same cluster configurations from 2004: two 2 GHz Intel Xeon processors with Hyper-Threading enabled, 4 GB of memory, two 160 GB IDE hard drives and a gigabit Ethernet link. Greg Linden notices that Google's infrastructure is an important competitive advantage. "Anyone at Google can process terabytes of data. And they can get their results back in about 10 minutes, so they ca
3 0.79233354 871 high scalability-2010-08-04-Dremel: Interactive Analysis of Web-Scale Datasets - Data as a Programming Paradigm
Introduction: If Google was a boxer then MapReduce would be a probing right hand that sets up the massive left hook that is Dremel , Google's—scalable (thousands of CPUs, petabytes of data, trillions of rows), SQL based, columnar, interactive (results returned in seconds), ad-hoc—analytics system. If Google was a magician then MapReduce would be the shiny thing that distracts the mind while the trick goes unnoticed. I say that because even though Dremel has been around internally at Google since 2006, we have not heard a whisper about it. All we've heard about is MapReduce, clones of which have inspired entire new industries. Tricky . Dremel, according to Brian Bershad, Director of Engineering at Google, is targeted at solving BigData class problems : While we all know that systems are huge and will get even huger, the implications of this size on programmability, manageability, power, etc. is hard to comprehend. Alfred noted that the Internet is predicted to be carrying a zetta-byte (10 21
4 0.78763485 376 high scalability-2008-09-03-MapReduce framework Disco
Introduction: Disco is an open-source implementation of the MapReduce framework for distributed computing. It was started at Nokia Research Center as a lightweight framework for rapid scripting of distributed data processing tasks. The Disco core is written in Erlang. The MapReduce jobs in Disco are natively described as Python programs, which makes it possible to express complex algorithmic and data processing tasks often only in tens of lines of code.
Introduction: This paper, Large-scale Incremental Processing Using Distributed Transactions and Notifications by Daniel Peng and Frank Dabek, is Google's much anticipated description of Percolator, their new real-time indexing system. The abstract: Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput requirements of these tasks: Google’s indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines. MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency. We have built Percolator, a system f
6 0.78018761 362 high scalability-2008-08-11-Distributed Computing & Google Infrastructure
7 0.76262617 401 high scalability-2008-10-04-Is MapReduce going mainstream?
8 0.71961546 1512 high scalability-2013-09-05-Paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale
9 0.71520019 409 high scalability-2008-10-13-Challenges from large scale computing at Google
10 0.68594623 1535 high scalability-2013-10-21-Google's Sanjay Ghemawat on What Made Google Google and Great Big Data Career Advice
11 0.68428719 850 high scalability-2010-06-30-Paper: GraphLab: A New Framework For Parallel Machine Learning
12 0.68368381 650 high scalability-2009-07-02-Product: Hbase
13 0.65384674 309 high scalability-2008-04-23-Behind The Scenes of Google Scalability
14 0.64200389 734 high scalability-2009-10-30-Hot Scalabilty Links for October 30 2009
15 0.63298535 448 high scalability-2008-11-22-Google Architecture
16 0.618698 216 high scalability-2008-01-17-Database People Hating on MapReduce
17 0.61629647 1107 high scalability-2011-08-29-The Three Ages of Google - Batch, Warehouse, Instant
18 0.61120701 634 high scalability-2009-06-20-Building a data cycle at LinkedIn with Hadoop and Project Voldemort
19 0.59564269 1181 high scalability-2012-01-25-Google Goes MoreSQL with Tenzing - SQL Over MapReduce
20 0.58789885 666 high scalability-2009-07-30-Learn How to Think at Scale
topicId topicWeight
[(1, 0.055), (2, 0.196), (10, 0.06), (40, 0.014), (61, 0.07), (76, 0.249), (77, 0.041), (79, 0.168), (94, 0.059)]
simIndex simValue blogId blogTitle
1 0.90297335 172 high scalability-2007-12-02-nginx: high performance smpt-pop-imap proxy
Introduction: nginx is a high performance smtp/pop/imap proxy that lets you do custom authorization and lookups and is very scalable. (just add nodes) Nginx by default is a reverse proxy and this is what it is doing here for pop/imap connections. It is also an excellelent reverse proxy for web servers. Advantage: You dont have to have a speacial database or ldap schema. Just an url to do auth and lookup with. A url that may be accessed by a unix or a tcp socket. Write your own auth handler - according to your own policy. For example: A user called atif tries to login with the pass testxyz. You pass this infomation to a URL such as socket:/var/tmp/xyz.sock or http://auth.corp.mailserver.net:someport/someurl The auth server replies with either a FAILURE such as Auth-Status: Invalid Login or password or with a success such as Auth-Status: OK Auth-Server: OneOfThe100Servers Auth-Port: optionalyAPort We have implemented it at our ISP and it has saves us a
same-blog 2 0.88865197 483 high scalability-2009-01-04-Paper: MapReduce: Simplified Data Processing on Large Clusters
Introduction: Update: MapReduce and PageRank Notes from Remzi Arpaci-Dusseau's Fall 2008 class . Collects interesting facts about MapReduce and PageRank. For example, the history of the solution to searching for the term "flu" is traced through multiple generations of technology. With Google entering the cloud space with Google AppEngine and a maturing Hadoop product, the MapReduce scaling approach might finally become a standard programmer practice. This is the best paper on the subject and is an excellent primer on a content-addressable memory future. Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. One common criticism ex-Googlers have is that it takes months to get up and be productive in the Google environment. Hopefully a way will be found to lower the learning curve a
3 0.87848145 966 high scalability-2010-12-31-Facebook in 20 Minutes: 2.7M Photos, 10.2M Comments, 4.6M Messages
Introduction: To celebrate the new year Facebook has shared the results of a little end of the year introspection. It has been a fecund year for Facebook: 43,869,800 changed their status to single 3,025,791 changed their status to "it's complicated" 28,460,516 changed their status to in a relationship 5,974,574 changed their status to engaged 36,774,801 changes their status to married If these numbers are simply to large to grasp, it doesn't get any better when you look at happens in a mere 20 minutes: Shared links: 1,000,000 Tagged photos: 1,323,000 Event invites sent out: 1,484,000 Wall Posts: 1,587,000 Status updates: 1,851,000 Friend requests accepted: 1,972,000 Photos uploaded: 2,716,000 Comments: 10,208,000 Message: 4,632,000 If you want to see how Facebook supports these huge numbers take a look at a few posts . One wonders what the new year will bring? Related Articles What the World Eats from Time Magazine A Day in the Life of an An
4 0.79438114 665 high scalability-2009-07-29-Strategy: Let Google and Yahoo Host Your Ajax Library - For Free
Introduction: Update: Offloading ALL JS Files To Google . Now you can let Google serve all your javascript files. This article tells you how to do it (upload to Google Code Project) and why it's a big win (cheap, fast, caching, parallel downloads, save bandwidth). Don't have a CDN? Why not let Google and Yahoo be your CDN? At least for Ajax libraries. No charge. Google runs a content distribution network and loading architecture for the most popular open source JavaScript libraries , which include: jQuery, prototype, script.aculo.us, MooTools, and dojo. The idea is web pages directly include your library of choice from Google's global, fast, and highly available network. Some have found much better performance and others experienced slower performance. My guess is the performance may be slower if your data center is close to you, but far away users will be much happier. Some negatives: not all libraries are included, you'll load more than you need because all functionality is included. Yahoo
5 0.78522766 1161 high scalability-2011-12-22-Architecting Massively-Scalable Near-Real-Time Risk Analysis Solutions
Introduction: Constructing a scalable risk analysis solution is a fascinating architectural challenge. If you come from Financial Services you are sure to appreciate that. But even architects from other domains are bound to find the challenges fascinating, and the architectural patterns of my suggested solution highly useful in other domains. Recently I held an interesting webinar around architecting solutions for scalable and near-real-time risk analysis solutions based on experience gathered with Financial Services customers. Seeing the vast interest in the webinar, I would like to share the highlights with you here. From an architectural point of view, risk analysis is a data-intensive and a compute-intensive process, which also has an elaborate orchestration logic. volumes in this domain are massive and ever-increasing, together with an ever-increasing demand to reduce response time. These trends are aggravated by global financial regulatory reforms set following the late-2000s
6 0.78090447 1122 high scalability-2011-09-23-Stuff The Internet Says On Scalability For September 23, 2011
7 0.77395147 889 high scalability-2010-08-30-Pomegranate - Storing Billions and Billions of Tiny Little Files
8 0.75479233 65 high scalability-2007-08-16-Scaling Secret #2: Denormalizing Your Way to Speed and Profit
9 0.75137991 1564 high scalability-2013-12-13-Stuff The Internet Says On Scalability For December 13th, 2013
10 0.73676223 526 high scalability-2009-03-05-Strategy: In Cloud Computing Systematically Drive Load to the CPU
11 0.73466611 1098 high scalability-2011-08-15-Should any cloud be considered one availability zone? The Amazon experience says yes.
12 0.73172659 289 high scalability-2008-03-27-Amazon Announces Static IP Addresses and Multiple Datacenter Operation
13 0.72979981 1316 high scalability-2012-09-04-Changing Architectures: New Datacenter Networks Will Set Your Code and Data Free
14 0.72831744 1179 high scalability-2012-01-23-Facebook Timeline: Brought to You by the Power of Denormalization
15 0.72787613 1048 high scalability-2011-05-27-Stuff The Internet Says On Scalability For May 27, 2011
16 0.72759449 1545 high scalability-2013-11-08-Stuff The Internet Says On Scalability For November 8th, 2013
17 0.72754532 780 high scalability-2010-02-19-Twitter’s Plan to Analyze 100 Billion Tweets
19 0.72739643 601 high scalability-2009-05-17-Product: Hadoop
20 0.7259919 1112 high scalability-2011-09-07-What Google App Engine Price Changes Say About the Future of Web Architecture