high_scalability high_scalability-2008 high_scalability-2008-448 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Update 2: Sorting 1 PB with MapReduce . PB is not peanut-butter-and-jelly misspelled. It's 1 petabyte or 1000 terabytes or 1,000,000 gigabytes. It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers and the results were replicated thrice on 48,000 disks. Update: Greg Linden points to a new Google article MapReduce: simplified data processing on large clusters . Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build
sentIndex sentText sentNum sentScore
1 Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. [sent-6, score-0.366]
2 Pools of tens of thousands of machines retrieve data from GFS clusters that run as large as 5 petabytes of storage. [sent-23, score-0.414]
3 Computing Platforms: a bunch of machines in a bunch of different data centers Make sure easy for folks in the company to deploy at a low cost. [sent-28, score-0.445]
4 Google File System - large distributed log structured file system in which they throw in a lot of data. [sent-34, score-0.415]
5 They required: - high reliability across data centers - scalability to thousands of network nodes - huge read/write bandwidth requirements - support for large blocks of data which are gigabytes in size. [sent-37, score-0.506]
6 - efficient distribution of operations across nodes to reduce bottlenecks System has master and chunk servers. [sent-38, score-0.587]
7 Clients talk to the master servers to perform metadata operations on files and to locate the chunk server that contains the needed they need on disk. [sent-41, score-0.478]
8 Each chunk is replicated across three different chunk servers to create redundancy in case of server crashes. [sent-43, score-0.718]
9 Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. [sent-54, score-0.834]
10 - The Master server assigns user tasks to map and reduce servers. [sent-67, score-0.395]
11 - The Map servers accept user input and performs map operations on them. [sent-69, score-0.383]
12 The results are written to intermediate files - The Reduce servers accepts intermediate files produced by map servers and performs reduce operation on them. [sent-70, score-0.849]
13 - In MapReduce a map maps one view of data to another, producing a key value pair, which in our example is word and count. [sent-75, score-0.465]
14 The Google indexing pipeline has about 20 different map reductions. [sent-78, score-0.337]
15 Data transferred between map and reduce servers is compressed. [sent-88, score-0.434]
16 GFS stores opaque data and many applications needs has data with structure. [sent-96, score-0.356]
17 A tablet is a sequence of 64KB blocks in a data format called SSTable. [sent-103, score-0.409]
18 BigTable has three different types of servers: - The Master servers assign tablets to tablet servers. [sent-104, score-0.675]
19 When a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers. [sent-108, score-0.986]
20 If every project uses a different file system then there's no continual incremental improvement across the entire stack. [sent-149, score-0.347]
wordName wordTfidf (topN-words)
[('gfs', 0.347), ('tablet', 0.264), ('map', 0.215), ('chunk', 0.184), ('lab', 0.171), ('mapreduce', 0.165), ('intermediate', 0.151), ('data', 0.145), ('master', 0.126), ('across', 0.116), ('google', 0.115), ('servers', 0.113), ('tablets', 0.109), ('reduce', 0.106), ('infrastructure', 0.102), ('large', 0.1), ('machines', 0.093), ('structured', 0.089), ('file', 0.085), ('system', 0.081), ('tbs', 0.081), ('petabytes', 0.076), ('tasks', 0.074), ('stored', 0.073), ('bunch', 0.071), ('terabytes', 0.07), ('types', 0.068), ('pb', 0.068), ('mechanism', 0.067), ('applications', 0.066), ('different', 0.065), ('lock', 0.065), ('commodity', 0.063), ('simplified', 0.063), ('distributed', 0.06), ('locality', 0.06), ('pair', 0.06), ('pairs', 0.06), ('io', 0.057), ('pipeline', 0.057), ('prior', 0.056), ('three', 0.056), ('operations', 0.055), ('maps', 0.054), ('roll', 0.053), ('spend', 0.053), ('executed', 0.052), ('build', 0.052), ('sort', 0.051), ('key', 0.051)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999964 448 high scalability-2008-11-22-Google Architecture
Introduction: Update 2: Sorting 1 PB with MapReduce . PB is not peanut-butter-and-jelly misspelled. It's 1 petabyte or 1000 terabytes or 1,000,000 gigabytes. It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers and the results were replicated thrice on 48,000 disks. Update: Greg Linden points to a new Google article MapReduce: simplified data processing on large clusters . Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build
2 0.22103155 589 high scalability-2009-05-05-Drop ACID and Think About Data
Introduction: The abstract for the talk given by Bob Ippolito, co-founder and CTO of Mochi Media, Inc: Building large systems on top of a traditional single-master RDBMS data storage layer is no longer good enough. This talk explores the landscape of new technologies available today to augment your data layer to improve performance and reliability. Is your application a good fit for caches, bloom filters, bitmap indexes, column stores, distributed key/value stores, or document databases? Learn how they work (in theory and practice) and decide for yourself. Bob does an excellent job highlighting different products and the key concepts to understand when pondering the wide variety of new database offerings. It's unlikely you'll be able to say oh, this is the database for me after watching the presentation, but you will be much better informed on your options. And I imagine slightly confused as to what to do :-) An interesting observation in the talk is that the more robust products are internal
3 0.21387266 483 high scalability-2009-01-04-Paper: MapReduce: Simplified Data Processing on Large Clusters
Introduction: Update: MapReduce and PageRank Notes from Remzi Arpaci-Dusseau's Fall 2008 class . Collects interesting facts about MapReduce and PageRank. For example, the history of the solution to searching for the term "flu" is traced through multiple generations of technology. With Google entering the cloud space with Google AppEngine and a maturing Hadoop product, the MapReduce scaling approach might finally become a standard programmer practice. This is the best paper on the subject and is an excellent primer on a content-addressable memory future. Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. One common criticism ex-Googlers have is that it takes months to get up and be productive in the Google environment. Hopefully a way will be found to lower the learning curve a
4 0.20411843 650 high scalability-2009-07-02-Product: Hbase
Introduction: Update 3: Presentation from the NoSQL Conference : slides , video . Update 2: Jim Wilson helps with the Understanding HBase and BigTable by explaining them from a "conceptual standpoint." Update: InfoQ interview: HBase Leads Discuss Hadoop, BigTable and Distributed Databases . "MapReduce (both Google's and Hadoop's) is ideal for processing huge amounts of data with sizes that would not fit in a traditional database. Neither is appropriate for transaction/single request processing." Hbase is the open source answer to BigTable, Google's highly scalable distributed database. It is built on top of Hadoop ( product ), which implements functionality similar to Google's GFS and Map/Reduce systems. Both Google's GFS and Hadoop's HDFS provide a mechanism to reliably store large amounts of data. However, there is not really a mechanism for organizing the data and accessing only the parts that are of interest to a particular application. Bigtable (and Hbase) provide a means for
5 0.20318747 538 high scalability-2009-03-16-Are Cloud Based Memory Architectures the Next Big Thing?
Introduction: We are on the edge of two potent technological changes: Clouds and Memory Based Architectures. This evolution will rip open a chasm where new players can enter and prosper. Google is the master of disk. You can't beat them at a game they perfected. Disk based databases like SimpleDB and BigTable are complicated beasts, typical last gasp products of any aging technology before a change. The next era is the age of Memory and Cloud which will allow for new players to succeed. The tipping point will be soon. Let's take a short trip down web architecture lane: It's 1993: Yahoo runs on FreeBSD, Apache, Perl scripts and a SQL database It's 1995: Scale-up the database. It's 1998: LAMP It's 1999: Stateless + Load Balanced + Database + SAN It's 2001: In-memory data-grid. It's 2003: Add a caching layer. It's 2004: Add scale-out and partitioning. It's 2005: Add asynchronous job scheduling and maybe a distributed file system. It's 2007: Move it all into the cloud. It's 2008: C
8 0.19355612 1000 high scalability-2011-03-08-Medialets Architecture - Defeating the Daunting Mobile Device Data Deluge
9 0.19006076 309 high scalability-2008-04-23-Behind The Scenes of Google Scalability
10 0.18616119 601 high scalability-2009-05-17-Product: Hadoop
11 0.18203363 882 high scalability-2010-08-18-Misco: A MapReduce Framework for Mobile Systems - Start of the Ambient Cloud?
12 0.17894703 1008 high scalability-2011-03-22-Facebook's New Realtime Analytics System: HBase to Process 20 Billion Events Per Day
13 0.17614868 954 high scalability-2010-12-06-What the heck are you actually using NoSQL for?
14 0.17605348 920 high scalability-2010-10-15-Troubles with Sharding - What can we learn from the Foursquare Incident?
15 0.16835769 666 high scalability-2009-07-30-Learn How to Think at Scale
16 0.1683002 227 high scalability-2008-01-28-Howto setup GFS-GNBD
17 0.1673878 1529 high scalability-2013-10-08-F1 and Spanner Holistically Compared
18 0.16634963 112 high scalability-2007-10-04-You Can Now Store All Your Stuff on Your Own Google Like File System
19 0.16413689 517 high scalability-2009-02-21-Google AppEngine - A Second Look
20 0.16351523 1240 high scalability-2012-05-07-Startups are Creating a New System of the World for IT
topicId topicWeight
[(0, 0.338), (1, 0.152), (2, 0.01), (3, 0.015), (4, -0.03), (5, 0.093), (6, 0.11), (7, 0.004), (8, 0.028), (9, 0.092), (10, 0.068), (11, -0.024), (12, 0.011), (13, -0.084), (14, 0.133), (15, 0.059), (16, -0.115), (17, -0.039), (18, 0.019), (19, 0.058), (20, 0.058), (21, 0.049), (22, 0.024), (23, -0.029), (24, 0.029), (25, 0.002), (26, 0.058), (27, 0.044), (28, -0.088), (29, 0.057), (30, 0.002), (31, 0.029), (32, -0.005), (33, -0.024), (34, -0.055), (35, -0.026), (36, 0.005), (37, -0.029), (38, 0.006), (39, 0.083), (40, 0.024), (41, -0.043), (42, -0.042), (43, -0.048), (44, -0.011), (45, -0.026), (46, -0.013), (47, -0.047), (48, 0.001), (49, -0.008)]
simIndex simValue blogId blogTitle
same-blog 1 0.97295421 448 high scalability-2008-11-22-Google Architecture
Introduction: Update 2: Sorting 1 PB with MapReduce . PB is not peanut-butter-and-jelly misspelled. It's 1 petabyte or 1000 terabytes or 1,000,000 gigabytes. It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers and the results were replicated thrice on 48,000 disks. Update: Greg Linden points to a new Google article MapReduce: simplified data processing on large clusters . Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build
2 0.87764931 666 high scalability-2009-07-30-Learn How to Think at Scale
Introduction: Aaron Kimball of Cloudera gives a wonderful 23 minute presentation titled Cloudera Hadoop Training: Thinking at Scale Cloudera which talks about "common challenges and general best practices for scaling with your data." As a company Cloudera offers "enterprise-level support to users of Apache Hadoop." Part of that offering is a really useful series of tutorial videos on the Hadoop ecosystem . Like TV lawyer Perry Mason (or is it Harmon Rabb?), Aaron gradually builds his case. He opens with the problem of storing lots of data. Then a blistering cross examination of the problem of building distributed systems to analyze that data sets up a powerful closing argument. With so much testimony behind him, on closing Aaron really brings it home with why shared nothing systems like map-reduce are the right solution on how to query lots of data. They jury loved it. Here's the video Thinking at Scale . And here's a summary of some of the lessons learned from the talk: Lessons Learned
3 0.83338672 871 high scalability-2010-08-04-Dremel: Interactive Analysis of Web-Scale Datasets - Data as a Programming Paradigm
Introduction: If Google was a boxer then MapReduce would be a probing right hand that sets up the massive left hook that is Dremel , Google's—scalable (thousands of CPUs, petabytes of data, trillions of rows), SQL based, columnar, interactive (results returned in seconds), ad-hoc—analytics system. If Google was a magician then MapReduce would be the shiny thing that distracts the mind while the trick goes unnoticed. I say that because even though Dremel has been around internally at Google since 2006, we have not heard a whisper about it. All we've heard about is MapReduce, clones of which have inspired entire new industries. Tricky . Dremel, according to Brian Bershad, Director of Engineering at Google, is targeted at solving BigData class problems : While we all know that systems are huge and will get even huger, the implications of this size on programmability, manageability, power, etc. is hard to comprehend. Alfred noted that the Internet is predicted to be carrying a zetta-byte (10 21
4 0.82561219 815 high scalability-2010-04-27-Paper: Dapper, Google's Large-Scale Distributed Systems Tracing Infrastructure
Introduction: Imagine a single search request coursing through Google's massive infrastructure. A single request can run across thousands of machines and involve hundreds of different subsystems. And oh by the way, you are processing more requests per second than any other system in the world. How do you debug such a system? How do you figure out where the problems are? How do you determine if programmers are coding correctly? How do you keep sensitive data secret and safe? How do ensure products don't use more resources than they are assigned? How do you store all the data? How do you make use of it? That's where Dapper comes in. Dapper is Google's tracing system and it was originally created to understand the system behaviour from a search request. Now Google's production clusters generate more than 1 terabyte of sampled trace data per day . So how does Dapper do what Dapper does? Dapper is described in an very well written and intricately detailed paper: Dapper, a Large-Scale Distributed Sy
5 0.81514382 326 high scalability-2008-05-25-Product: Condor - Compute Intensive Workload Management
Introduction: From their website: Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion. While providing functionality similar to that of a more traditional batch queueing system, Condor's novel architecture allows it to succeed in areas where traditional scheduling systems fail. Condor can be used to manage a cluster of dedicated compute nodes (such as a "Beowulf" cluster). In addition, unique mechanisms enable Condor to effectively harness wasted CPU power from otherwise idle desktop workstations. For instance, Condor can be configured to only use desktop machines where the keyboard
7 0.80417949 1000 high scalability-2011-03-08-Medialets Architecture - Defeating the Daunting Mobile Device Data Deluge
10 0.79961479 1075 high scalability-2011-07-07-Myth: Google Uses Server Farms So You Should Too - Resurrection of the Big-Ass Machines
11 0.79377234 1270 high scalability-2012-06-22-Stuff The Internet Says On Scalability For June 22, 2012
12 0.79304641 601 high scalability-2009-05-17-Product: Hadoop
13 0.7916792 483 high scalability-2009-01-04-Paper: MapReduce: Simplified Data Processing on Large Clusters
14 0.78568798 50 high scalability-2007-07-31-BerkeleyDB & other distributed high performance key-value databases
15 0.78428227 901 high scalability-2010-09-16-How Can the Large Hadron Collider Withstand One Petabyte of Data a Second?
17 0.77998638 216 high scalability-2008-01-17-Database People Hating on MapReduce
18 0.77679813 1589 high scalability-2014-02-03-How Google Backs Up the Internet Along With Exabytes of Other Data
20 0.76682407 350 high scalability-2008-07-15-ZooKeeper - A Reliable, Scalable Distributed Coordination System
topicId topicWeight
[(1, 0.136), (2, 0.186), (10, 0.04), (30, 0.019), (61, 0.047), (77, 0.011), (79, 0.372), (85, 0.04), (94, 0.052)]
simIndex simValue blogId blogTitle
1 0.99219489 1403 high scalability-2013-02-08-Stuff The Internet Says On Scalability For February 8, 2013
Introduction: Hey, it's HighScalability time: 34TB : storage for GitHub search ; 2,880,000,000: log lines per day Quotable Quotes: @peakscale : The " IKEA effec t" << Contributes to NIH and why ppl still like IaaS over PaaS. :-\ @sheeshee : module named kafka.. creates weird & random processes, sends data from here to there & after 3 minutes noone knows what's happening anymore? @sometoomany : Ceased writing a talk about cloud computing infrastructure, and data centre power efficiency. Bored myself to death, but saved others. Larry Kass on aged bourbon : Where it spent those years is as important has how many years it spent. Lots of heat on Is MongoDB's fault tolerance broken? Yes it is . No it's not . YES it is . And the score: MongoDB Is Still Broken by Design 5-0 . Every insurgency must recruit from an existing population which is already affiliated elsewhere. For web properties the easiest group to recru
2 0.99212933 871 high scalability-2010-08-04-Dremel: Interactive Analysis of Web-Scale Datasets - Data as a Programming Paradigm
Introduction: If Google was a boxer then MapReduce would be a probing right hand that sets up the massive left hook that is Dremel , Google's—scalable (thousands of CPUs, petabytes of data, trillions of rows), SQL based, columnar, interactive (results returned in seconds), ad-hoc—analytics system. If Google was a magician then MapReduce would be the shiny thing that distracts the mind while the trick goes unnoticed. I say that because even though Dremel has been around internally at Google since 2006, we have not heard a whisper about it. All we've heard about is MapReduce, clones of which have inspired entire new industries. Tricky . Dremel, according to Brian Bershad, Director of Engineering at Google, is targeted at solving BigData class problems : While we all know that systems are huge and will get even huger, the implications of this size on programmability, manageability, power, etc. is hard to comprehend. Alfred noted that the Internet is predicted to be carrying a zetta-byte (10 21
3 0.98593235 680 high scalability-2009-08-13-Reconnoiter - Large-Scale Trending and Fault-Detection
Introduction: One of the top recommendations from the collective wisdom contained in Real Life Architectures is to add monitoring to your system. Now! Loud is the lament for not adding monitoring early and often. The reason is easy to understand. Without monitoring you don't know what your system is doing which means you can't fix it and you can't improve it. Feedback loops require data. Some popular monitor options are Munin, Nagios, Cacti and Hyperic. A relatively new entrant is a product called Reconnoiter from Theo Schlossnagle, President and CEO of OmniTI, leading consultants on solving problems of scalability, performance, architecture, infrastructure, and data management. Theo's name might sound familiar. He gives lots of talks and is the author of the very influential Scalable Internet Architectures book. So right away you know Reconnoiter has a good pedigree. As Theo says, their products are born of pain, from the fire of solving real-life problems and that's always a harbinger of
4 0.98561466 1420 high scalability-2013-03-08-Stuff The Internet Says On Scalability For March 8, 2013
Introduction: Hey, it's HighScalability time: Quotable Quotes: @ibogost : Disabling features of SimCity due to ineffective central infrastructure is probably the most realistic simulation of the modern city. antirez : The point is simply to show how SSDs can't be considered, currently, as a bit slower version of memory. Their performance characteristics are a lot more about, simply, "faster disks". @jessenoller : I only use JavaScript so I can gain maximum scalability across multiple cores. Also unicorns. Paint thinner gingerbread @liammclennan : high-scalability ruby. Why bother? @scomma : Problem with BitCoin is not scalability, not even usability. It's whether someone will crack the algorithm and render BTC entirely useless. @webclimber : Amazing how often I find myself explaining that scalability is not magical @mvmsan : Flash as Primary Storage - Highest Cost, Lack of HA, scalability and management features #flas
5 0.98485243 1494 high scalability-2013-07-19-Stuff The Internet Says On Scalability For July 19, 2013
Introduction: Hey, it's HighScalability time: (Still not a transporter: Looping at 685 mph ) 898 exabytes : US storage, 1/3 global total; 1 Kb/s : data transmit rate from harvestable energy from human motion Create your own trust nobody point-to-point private cloud. Dan Brown shows how step-by-step in How I Created My Own Personal Cloud Using BitTorrent Sync, Owncloud, and Raspberry Pi . BitTorrent Sync is used to copy large files around. Raspberry Pi is a cheap low power always on device with BitTorrent Sync installed. Owncloud is an open source cloud that provides a web interface for file access files from anywhere. This is different. Funding a startup using Airbnb as a source of start-up capital. It beats getting a part-time job and one of your guests might even be a VC. This is not different. Old industries clawing and digging in, using the tools of power to beat back competition. Steve Blank details a familiar story in Strangling Innovation: Tesla versus “Rent Seeker
same-blog 6 0.98155856 448 high scalability-2008-11-22-Google Architecture
7 0.98053503 784 high scalability-2010-02-25-Paper: High Performance Scalable Data Stores
8 0.97616684 1277 high scalability-2012-07-05-10 Golden Principles For Building Successful Mobile-Web Applications
9 0.9731065 323 high scalability-2008-05-19-Twitter as a scalability case study
10 0.97227192 1169 high scalability-2012-01-05-Shutterfly Saw a Speedup of 500% With Flashcache
11 0.96952349 1100 high scalability-2011-08-18-Paper: The Akamai Network - 61,000 servers, 1,000 networks, 70 countries
12 0.96951205 786 high scalability-2010-03-02-Using the Ambient Cloud as an Application Runtime
13 0.96043575 650 high scalability-2009-07-02-Product: Hbase
14 0.95885104 1048 high scalability-2011-05-27-Stuff The Internet Says On Scalability For May 27, 2011
15 0.95750892 107 high scalability-2007-10-02-Some Real Financial Numbers for Your Startup
16 0.95741248 1485 high scalability-2013-07-01-PRISM: The Amazingly Low Cost of Using BigData to Know More About You in Under a Minute
17 0.95470637 581 high scalability-2009-04-26-Map-Reduce for Machine Learning on Multicore
18 0.94582081 1392 high scalability-2013-01-23-Building Redundant Datacenter Networks is Not For Sissies - Use an Outside WAN Backbone
19 0.94527191 1181 high scalability-2012-01-25-Google Goes MoreSQL with Tenzing - SQL Over MapReduce
20 0.93872786 380 high scalability-2008-09-05-Product: Tungsten Replicator