high_scalability high_scalability-2013 high_scalability-2013-1512 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Ever wonder what powers Google's world spirit sensing Zeitgeist service ? No, it's not a homunculus of Georg Wilhelm Friedrich Hegel sitting in each browser. It's actually a stream processing (think streaming MapReduce on steroids) system called MillWheel, described in this very well written paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale . MillWheel isn't just used for Zeitgeist at Google, it's also used for streaming joins for a variety of Ads customers, generalized anomaly-detection service, and network switch and cluster health monitoring. Abstract: MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework’s fault-tolerance guarantees. This paper describes MillWheel’s programming model as well as it
sentIndex sentText sentNum sentScore
1 Ever wonder what powers Google's world spirit sensing Zeitgeist service ? [sent-1, score-0.157]
2 It's actually a stream processing (think streaming MapReduce on steroids) system called MillWheel, described in this very well written paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale . [sent-3, score-0.306]
3 MillWheel isn't just used for Zeitgeist at Google, it's also used for streaming joins for a variety of Ads customers, generalized anomaly-detection service, and network switch and cluster health monitoring. [sent-4, score-0.172]
4 Abstract: MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. [sent-5, score-0.101]
5 Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework’s fault-tolerance guarantees. [sent-6, score-0.315]
6 This paper describes MillWheel’s programming model as well as its implementation. [sent-7, score-0.171]
7 The case study of a continuous anomaly detector in use at Google serves to motivate how many of MillWheel’s features are used. [sent-8, score-0.277]
8 MillWheel’s programming model provides a notion of logical time, making it simple to write time-based aggregations. [sent-9, score-0.166]
9 MillWheel was designed from the outset with fault tolerance and scalability in mind. [sent-10, score-0.307]
10 In practice, we find that MillWheel’s unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google. [sent-11, score-0.43]
11 It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. [sent-14, score-0.101]
12 Storm : a free and open source distributed realtime computation system. [sent-15, score-0.212]
13 Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. [sent-16, score-0.356]
wordName wordTfidf (topN-words)
[('millwheel', 0.817), ('zeitgeist', 0.165), ('realtime', 0.142), ('tolerance', 0.124), ('processing', 0.105), ('fault', 0.101), ('streaming', 0.101), ('stream', 0.1), ('homunculus', 0.082), ('friedrich', 0.082), ('outset', 0.082), ('samza', 0.082), ('yarn', 0.082), ('continuous', 0.072), ('lends', 0.071), ('motivate', 0.071), ('ow', 0.071), ('datamining', 0.071), ('steroids', 0.071), ('variety', 0.071), ('computation', 0.07), ('anomaly', 0.069), ('versatile', 0.067), ('nd', 0.067), ('prismatic', 0.067), ('detector', 0.065), ('programming', 0.065), ('envelope', 0.064), ('plumbing', 0.063), ('mapreduce', 0.061), ('sensing', 0.06), ('kafka', 0.059), ('graph', 0.058), ('unbounded', 0.058), ('generalized', 0.057), ('apache', 0.056), ('model', 0.055), ('hadoop', 0.055), ('framework', 0.053), ('google', 0.052), ('reliably', 0.051), ('directed', 0.051), ('paper', 0.051), ('spirit', 0.05), ('articleson', 0.05), ('widely', 0.048), ('storm', 0.047), ('powers', 0.047), ('sitting', 0.047), ('logical', 0.046)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 1512 high scalability-2013-09-05-Paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale
Introduction: Ever wonder what powers Google's world spirit sensing Zeitgeist service ? No, it's not a homunculus of Georg Wilhelm Friedrich Hegel sitting in each browser. It's actually a stream processing (think streaming MapReduce on steroids) system called MillWheel, described in this very well written paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale . MillWheel isn't just used for Zeitgeist at Google, it's also used for streaming joins for a variety of Ads customers, generalized anomaly-detection service, and network switch and cluster health monitoring. Abstract: MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework’s fault-tolerance guarantees. This paper describes MillWheel’s programming model as well as it
2 0.10859422 1313 high scalability-2012-08-28-Making Hadoop Run Faster
Introduction: Making Hadoop Run Faster One of the challenges in processing data is that the speed at which we can input data is quite often much faster than the speed at which we can process it. This problem becomes even more pronounced in the context of Big Data, where the volume of data keeps on growing, along with a corresponding need for more insights, and thus the need for more complex processing also increases. Batch Processing to the Rescue Hadoop was designed to deal with this challenge in the following ways: 1. Use a distributed file system: This enables us to spread the load and grow our system as needed. 2. Optimize for write speed: To enable fast writes the Hadoop architecture was designed so that writes are first logged, and then processed. This enables fairly fast write speeds. 3. Use batch processing (Map/Reduce) to balance the speed for the data feeds with the processing speed. Batch Processing Challenges The challenge with batch-processing is that it assumes
3 0.092372544 1406 high scalability-2013-02-14-When all the Program's a Graph - Prismatic's Plumbing Library
Introduction: At some point as a programmer you might have the insight/fear that all programming is just doing stuff to other stuff. Then you may observe after coding the same stuff over again that stuff in a program often takes the form of interacting patterns of flows. Then you may think hey, a program isn't only useful for coding datastructures, but a program is a kind of datastructure and that with a meta level jump you could program a program in terms of flows over data and flow over other flows. That's the kind of stuff Prismatic is making available in the Graph extension to their plumbing package ( code examples ), which is described in an excellent post: Graph: Abstractions for Structured Computation . You may remember Prismatic from previous profile we did on HighScalability: Prismatic Architecture - Using Machine Learning On Social Networks To Figure Out What You Should Read On The Web . We learned how Prismatic, an interest driven content suggestion service, builds programs in
Introduction: This post on Prismatic ’s Architecture is adapted from an email conversation with Prismatic programmer Jason Wolfe . What should you read on the web today? Any thoroughly modern person must solve this dilemma every day, usually using some occult process to divine what’s important in their many feeds: Twitter, RSS, Facebook, Pinterest, G+, email, Techmeme, and an uncountable numbers of other information sources. Jason Wolfe from Prismatic has generously agreed to describe their thoroughly modern solution for answering the “what to read question” using lots of sexy words like Machine Learning, Social Graphs, BigData, functional programming, and in-memory real-time feed processing. The result is possibly even more occult, but this or something very much like it will be how we meet the challenge of finding interesting topics and stories hidden inside infinitely deep pools of information. A couple of things stand out about Prismatic. They want you to know that Prismatic is being built
5 0.072603211 1638 high scalability-2014-04-28-How Disqus Went Realtime with 165K Messages Per Second and Less than .2 Seconds Latency
Introduction: Here's an Update On Disqus: It's Still About Realtime, But Go Demolishes Python . How do you add realtime functionality to a web scale application? That's what Adam Hitchcock , a Software Engineer at Disqus talks about in an excellent talk: Making DISQUS Realtime ( slides ). Disqus had to take their commenting system and add realtime capabilities to it. Not something that's easy to do when at the time of the talk (2013) they had had just hit a billion unique visitors a month. What Disqus developed is a realtime commenting system called “realertime” that was tested to handle 1.5 million concurrently connected users, 45,000 new connections per second, 165,000 messages/second, with less than .2 seconds latency end-to-end. The nature of a commenting system is that it is IO bound and has a high fanout, that is a comment comes in and must be sent out to a lot of readers. It's a problem very similar to what Twitter must solve . Disqus' solution was quite interesting as was th
6 0.069803484 483 high scalability-2009-01-04-Paper: MapReduce: Simplified Data Processing on Large Clusters
7 0.064692616 1644 high scalability-2014-05-07-Update on Disqus: It's Still About Realtime, But Go Demolishes Python
8 0.063897595 1088 high scalability-2011-07-27-Making Hadoop 1000x Faster for Graph Problems
9 0.063158572 801 high scalability-2010-03-30-Running Large Graph Algorithms - Evaluation of Current State-of-the-Art and Lessons Learned
10 0.062320806 666 high scalability-2009-07-30-Learn How to Think at Scale
12 0.059904873 1630 high scalability-2014-04-11-Stuff The Internet Says On Scalability For April 11th, 2014
13 0.059411664 628 high scalability-2009-06-13-Neo4j - a Graph Database that Kicks Buttox
14 0.058562018 1520 high scalability-2013-09-20-Stuff The Internet Says On Scalability For September 20, 2013
15 0.055814289 1240 high scalability-2012-05-07-Startups are Creating a New System of the World for IT
17 0.055663448 1302 high scalability-2012-08-10-Stuff The Internet Says On Scalability For August 10, 2012
18 0.054575503 627 high scalability-2009-06-11-Yahoo! Distribution of Hadoop
19 0.054568782 1076 high scalability-2011-07-08-Stuff The Internet Says On Scalability For July 8, 2011
20 0.053152286 954 high scalability-2010-12-06-What the heck are you actually using NoSQL for?
topicId topicWeight
[(0, 0.088), (1, 0.044), (2, 0.01), (3, 0.049), (4, 0.012), (5, 0.037), (6, 0.012), (7, 0.029), (8, 0.019), (9, 0.081), (10, 0.044), (11, 0.013), (12, 0.012), (13, -0.078), (14, 0.012), (15, -0.012), (16, -0.014), (17, 0.027), (18, 0.021), (19, 0.002), (20, -0.016), (21, -0.005), (22, 0.002), (23, -0.021), (24, 0.003), (25, 0.027), (26, -0.004), (27, 0.022), (28, 0.008), (29, 0.013), (30, 0.034), (31, 0.005), (32, 0.001), (33, 0.003), (34, -0.027), (35, -0.01), (36, -0.021), (37, -0.012), (38, 0.011), (39, 0.06), (40, 0.023), (41, 0.001), (42, -0.011), (43, -0.012), (44, 0.02), (45, 0.011), (46, 0.0), (47, -0.013), (48, -0.008), (49, -0.017)]
simIndex simValue blogId blogTitle
same-blog 1 0.96914035 1512 high scalability-2013-09-05-Paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale
Introduction: Ever wonder what powers Google's world spirit sensing Zeitgeist service ? No, it's not a homunculus of Georg Wilhelm Friedrich Hegel sitting in each browser. It's actually a stream processing (think streaming MapReduce on steroids) system called MillWheel, described in this very well written paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale . MillWheel isn't just used for Zeitgeist at Google, it's also used for streaming joins for a variety of Ads customers, generalized anomaly-detection service, and network switch and cluster health monitoring. Abstract: MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework’s fault-tolerance guarantees. This paper describes MillWheel’s programming model as well as it
2 0.75704587 376 high scalability-2008-09-03-MapReduce framework Disco
Introduction: Disco is an open-source implementation of the MapReduce framework for distributed computing. It was started at Nokia Research Center as a lightweight framework for rapid scripting of distributed data processing tasks. The Disco core is written in Erlang. The MapReduce jobs in Disco are natively described as Python programs, which makes it possible to express complex algorithmic and data processing tasks often only in tens of lines of code.
3 0.71160835 1088 high scalability-2011-07-27-Making Hadoop 1000x Faster for Graph Problems
Introduction: Dr. Daniel Abadi, author of the DBMS Musings blog and Cofounder of Hadapt , which offers a product improving Hadoop performance by 50x on relational data, is now taking his talents to graph data in Hadoop's tremendous inefficiency on graph data management (and how to avoid it) , which shares the secrets of getting Hadoop to perform 1000x better on graph data. TL;DR: Analysing graph data is at the heart of important data mining problems . Hadoop is the tool of choice for many of these problems. Hadoop style MapReduce works best on KeyValue processing, not graph processing, and can be well over a factor of 1000 less efficient than it needs to be. Hadoop inefficiency has consequences in real world. Inefficiencies on graph data problems like improving power utilization, minimizing carbon emissions, and improving product designs, leads to a lot value being left on the table in the form of negative environmental consequences, increased server costs, increased data center spa
4 0.70872885 850 high scalability-2010-06-30-Paper: GraphLab: A New Framework For Parallel Machine Learning
Introduction: In the never ending quest to figure out how to do something useful with never ending streams of data, GraphLab: A New Framework For Parallel Machine Learning wants to go beyond low-level programming, MapReduce, and dataflow languages with a new parallel framework for ML (machine learning) which exploits the sparse structure and common computational patterns of ML algorithms. GraphLab enables ML experts to easily design and implement efficient scalable parallel algorithms by composing problem specific computation, data-dependencies, and scheduling . Our main contributions include: A graph-based data model which simultaneously represents data and computational dependencies. A set of concurrent access models which provide a range of sequential-consistency guarantees. A sophisticated modular scheduling mechanism. An aggregation framework to manage global state. From the abstract: Designing and implementing efficient, provably correct parallel machine lear
Introduction: On the surface nothing appears more different than soft data and hard raw materials like iron. Then isn’t it ironic , in the Alanis Morissette sense, that in this Age of Information, great wealth still lies hidden deep beneath piles of stuff? It's so strange how directly digging for dollars in data parallels the great wealth producing models of the Industrial Revolution. The piles of stuff is the Internet. It takes lots of prospecting to find the right stuff. Mighty web crawling machines tirelessly collect stuff, bringing it into their huge maws, then depositing load after load into rack after rack of distributed file system machines. Then armies of still other machines take this stuff and strip out the valuable raw materials, which in the Information Age, are endless bytes of raw data. Link clicks, likes, page views, content, head lines, searches, inbound links, outbound links, search clicks, hashtags, friends, purchases: anything and everything you do on the Internet is a valu
6 0.70713967 483 high scalability-2009-01-04-Paper: MapReduce: Simplified Data Processing on Large Clusters
7 0.70596272 1136 high scalability-2011-11-03-Paper: G2 : A Graph Processing System for Diagnosing Distributed Systems
8 0.68639624 871 high scalability-2010-08-04-Dremel: Interactive Analysis of Web-Scale Datasets - Data as a Programming Paradigm
9 0.67488354 155 high scalability-2007-11-15-Video: Dryad: A general-purpose distributed execution platform
10 0.66370296 650 high scalability-2009-07-02-Product: Hbase
11 0.64993435 631 high scalability-2009-06-15-Large-scale Graph Computing at Google
12 0.63892686 1313 high scalability-2012-08-28-Making Hadoop Run Faster
13 0.63431579 601 high scalability-2009-05-17-Product: Hadoop
14 0.63123119 448 high scalability-2008-11-22-Google Architecture
15 0.62884361 1315 high scalability-2012-08-30-Stuff The Internet Says On Scalability For August 31, 2012
16 0.62451971 1385 high scalability-2013-01-11-Stuff The Internet Says On Scalability For January 11, 2013
18 0.61666632 1382 high scalability-2013-01-07-Analyzing billions of credit card transactions and serving low-latency insights in the cloud
19 0.61497682 666 high scalability-2009-07-30-Learn How to Think at Scale
20 0.6144141 1406 high scalability-2013-02-14-When all the Program's a Graph - Prismatic's Plumbing Library
topicId topicWeight
[(1, 0.087), (2, 0.186), (10, 0.023), (15, 0.303), (47, 0.015), (52, 0.01), (61, 0.06), (77, 0.027), (79, 0.112), (85, 0.022), (94, 0.027)]
simIndex simValue blogId blogTitle
1 0.81732726 362 high scalability-2008-08-11-Distributed Computing & Google Infrastructure
Introduction: A couple of videos about distributed computing with direct reference on Google infrastructure. You will get acquainted with: --MapReduce the software framework implemented by Google to support parallel computations over large (greater than 100 terabyte) data sets on commodity hardware --GFS and the way it stores it's data into 64mb chunks --Bigtable which is the simple implementation of a non-relational database at Google Cluster Computing and MapReduce Lectures 1-5 .
same-blog 2 0.78729486 1512 high scalability-2013-09-05-Paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale
Introduction: Ever wonder what powers Google's world spirit sensing Zeitgeist service ? No, it's not a homunculus of Georg Wilhelm Friedrich Hegel sitting in each browser. It's actually a stream processing (think streaming MapReduce on steroids) system called MillWheel, described in this very well written paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale . MillWheel isn't just used for Zeitgeist at Google, it's also used for streaming joins for a variety of Ads customers, generalized anomaly-detection service, and network switch and cluster health monitoring. Abstract: MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework’s fault-tolerance guarantees. This paper describes MillWheel’s programming model as well as it
3 0.75802392 85 high scalability-2007-09-08-Making the case for PHP at Yahoo! (Oct 2002)
Introduction: This presentation by Michael Radwin describes why Yahoo! had standardized on PHP going forward. It describes how after reviewing all the web technologies including their own internal ones, PHP was choosen. It shows that not only technical reasons , but also business and development processes were taken into account.
4 0.7370916 1455 high scalability-2013-05-10-Stuff The Internet Says On Scalability For May 10, 2013
Introduction: Hey, it's HighScalability time: ( In Thailand, they figured out how to solve the age-old queuing problem! ) Nanoscale : Plants IM Using Nanoscale Sound Waves; 100 petabytes : CERN data storage Quotable Quotes: Geoff Arnold : Arguably all interesting advances in computer science and software engineering occur when a resource that was previously scarce or expensive becomes cheap and plentiful. @jamesurquhart : "Complexity is a characteristic of the system, not of the parts in it." -Dekker @louisnorthmore : Scaling down - now that's scalability! @peakscale : Where distributed systems people retire to forget the madness: http://en.wikipedia.org/wiki/Antipaxos @dozba : "The Linux Game Database" ... Well, at least they will never have scaling problems. Michael Widenius : There is no reason at all to use MySQL @steveloughran : Whenever someone says "unlimited scalability", ask if that exceeds the ber
5 0.72839582 414 high scalability-2008-10-15-Hadoop - A Primer
Introduction: Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of the Google File System and of MapReduce to process vast amounts of data "Hadoop is a Free Java software framework that supports data intensive distributed applications running on large clusters of commodity computers. It enables applications to easily scale out to thousands of nodes and petabytes of data" (Wikipedia) * What platform does Hadoop run on? * Java 1.5.x or higher, preferably from Sun * Linux * Windows for development * Solaris
6 0.7169379 812 high scalability-2010-04-19-Strategy: Order Two Mediums Instead of Two Smalls and the EC2 Buffet
7 0.70597321 948 high scalability-2010-11-24-Great Introductory Video on Scalability from Harvard Computer Science
8 0.70061213 923 high scalability-2010-10-21-Machine VM + Cloud API - Rewriting the Cloud from Scratch
9 0.69288385 88 high scalability-2007-09-10-Blog: Scalable Web Architectures by Royans Tharakan
10 0.69201112 1297 high scalability-2012-08-03-Stuff The Internet Says On Scalability For August 3, 2012
11 0.68310267 682 high scalability-2009-08-16-ThePort Network Architecture
12 0.67076725 1421 high scalability-2013-03-11-Low Level Scalability Solutions - The Conditioning Collection
13 0.66741467 122 high scalability-2007-10-14-Product: The Spread Toolkit
14 0.64297646 1501 high scalability-2013-08-13-In Memoriam: Lavabit Architecture - Creating a Scalable Email Service
15 0.64026707 1237 high scalability-2012-05-02-12 Ways to Increase Throughput by 32X and Reduce Latency by 20X
16 0.63185728 1231 high scalability-2012-04-20-Stuff The Internet Says On Scalability For April 20, 2012
17 0.62499076 1460 high scalability-2013-05-17-Stuff The Internet Says On Scalability For May 17, 2013
18 0.62496769 1112 high scalability-2011-09-07-What Google App Engine Price Changes Say About the Future of Web Architecture
19 0.62439686 1589 high scalability-2014-02-03-How Google Backs Up the Internet Along With Exabytes of Other Data
20 0.62432927 687 high scalability-2009-08-24-How Google Serves Data from Multiple Datacenters