high_scalability high_scalability-2008 high_scalability-2008-376 knowledge-graph by maker-knowledge-mining

376 high scalability-2008-09-03-MapReduce framework Disco


meta infos for this blog

Source: html

Introduction: Disco is an open-source implementation of the MapReduce framework for distributed computing. It was started at Nokia Research Center as a lightweight framework for rapid scripting of distributed data processing tasks. The Disco core is written in Erlang. The MapReduce jobs in Disco are natively described as Python programs, which makes it possible to express complex algorithmic and data processing tasks often only in tens of lines of code.


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Disco is an open-source implementation of the MapReduce framework for distributed computing. [sent-1, score-0.38]

2 It was started at Nokia Research Center as a lightweight framework for rapid scripting of distributed data processing tasks. [sent-2, score-1.035]

3 The MapReduce jobs in Disco are natively described as Python programs, which makes it possible to express complex algorithmic and data processing tasks often only in tens of lines of code. [sent-4, score-1.692]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('disco', 0.656), ('atnokia', 0.298), ('mapreduce', 0.22), ('algorithmic', 0.203), ('natively', 0.196), ('framework', 0.193), ('express', 0.179), ('scripting', 0.172), ('lightweight', 0.163), ('tens', 0.154), ('processing', 0.152), ('described', 0.138), ('programs', 0.127), ('lines', 0.124), ('rapid', 0.123), ('research', 0.115), ('python', 0.115), ('tasks', 0.112), ('jobs', 0.111), ('started', 0.096), ('implementation', 0.096), ('distributed', 0.091), ('written', 0.08), ('core', 0.08), ('often', 0.074), ('complex', 0.072), ('possible', 0.07), ('makes', 0.062), ('code', 0.051), ('data', 0.045)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 376 high scalability-2008-09-03-MapReduce framework Disco

Introduction: Disco is an open-source implementation of the MapReduce framework for distributed computing. It was started at Nokia Research Center as a lightweight framework for rapid scripting of distributed data processing tasks. The Disco core is written in Erlang. The MapReduce jobs in Disco are natively described as Python programs, which makes it possible to express complex algorithmic and data processing tasks often only in tens of lines of code.

2 0.13452943 483 high scalability-2009-01-04-Paper: MapReduce: Simplified Data Processing on Large Clusters

Introduction: Update: MapReduce and PageRank Notes from Remzi Arpaci-Dusseau's Fall 2008 class . Collects interesting facts about MapReduce and PageRank. For example, the history of the solution to searching for the term "flu" is traced through multiple generations of technology. With Google entering the cloud space with Google AppEngine and a maturing Hadoop product, the MapReduce scaling approach might finally become a standard programmer practice. This is the best paper on the subject and is an excellent primer on a content-addressable memory future. Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. One common criticism ex-Googlers have is that it takes months to get up and be productive in the Google environment. Hopefully a way will be found to lower the learning curve a

3 0.10876185 468 high scalability-2008-12-17-Ringo - Distributed key-value storage for immutable data

Introduction: Ringo is an experimental, distributed, replicating key-value store based on consistent hashing and immutable data. Unlike many general-purpose databases, Ringo is designed for a specific use case: For archiving small (less than 4KB) or medium-size data items (<100MB) in real-time so that the data can survive K - 1 disk breaks, where K is the desired number of replicas, without any downtime, in a manner that scales to terabytes of data. In addition to storing, Ringo should be able to retrieve individual or small sets of data items with low latencies (<10ms) and provide a convenient on-disk format for bulk data access. Ringo is compatible with the map-reduce framework Disco and it was started at Nokia Research Center Palo Alto.

4 0.10015146 1313 high scalability-2012-08-28-Making Hadoop Run Faster

Introduction: Making Hadoop Run Faster One of the challenges in processing data is that the speed at which we can input data is quite often much faster than the speed at which we can process it. This problem becomes even more pronounced in the context of Big Data, where the volume of data keeps on growing, along with a corresponding need for more insights, and thus the need for more complex processing also increases. Batch Processing to the Rescue Hadoop was designed to deal with this challenge in the following ways: 1. Use a distributed file system: This enables us to spread the load and grow our system as needed. 2. Optimize for write speed: To enable fast writes the Hadoop architecture was designed so that writes are first logged, and then processed. This enables fairly fast write speeds. 3. Use batch processing (Map/Reduce) to balance the speed for the data feeds with the processing speed. Batch Processing Challenges The challenge with batch-processing is that it assumes

5 0.097883999 590 high scalability-2009-05-06-Art of Distributed

Introduction: Art of Distributed Part 1: Rethinking about distributed computing models I ‘m getting a lot of questions lately about the distributed computing, especially distributed computing model, and MapReduce, such as: What is MapReduce? Can MapReduce fit in all situations? How we can compares it with other technologies such as Grid Computing? And what is the best solution to our situation? So I decide to write about the distributed computing article in two parts. First one about the distributed computing model and what is the difference between them. In the second part I will discuss the reliability, and distributed storage systems. Download the article in PDF format. Download the article in MS Word format. I wait for your comments, and questions, and I will answer it in part two.

6 0.092391267 666 high scalability-2009-07-30-Learn How to Think at Scale

7 0.089997128 882 high scalability-2010-08-18-Misco: A MapReduce Framework for Mobile Systems - Start of the Ambient Cloud?

8 0.087502807 448 high scalability-2008-11-22-Google Architecture

9 0.08544752 401 high scalability-2008-10-04-Is MapReduce going mainstream?

10 0.085374303 362 high scalability-2008-08-11-Distributed Computing & Google Infrastructure

11 0.080960073 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python

12 0.079100348 685 high scalability-2009-08-20-Dependency Injection and AOP frameworks for .NET

13 0.079041928 871 high scalability-2010-08-04-Dremel: Interactive Analysis of Web-Scale Datasets - Data as a Programming Paradigm

14 0.078634731 161 high scalability-2007-11-20-Product: SmartFrog a Distributed Configuration and Deployment Framework

15 0.077580534 912 high scalability-2010-10-01-Google Paper: Large-scale Incremental Processing Using Distributed Transactions and Notifications

16 0.075175099 357 high scalability-2008-07-26-Google's Paxos Made Live – An Engineering Perspective

17 0.071605332 414 high scalability-2008-10-15-Hadoop - A Primer

18 0.068113744 1215 high scalability-2012-03-26-7 Years of YouTube Scalability Lessons in 30 Minutes

19 0.064155594 555 high scalability-2009-04-04-Performance Anti-Pattern

20 0.064151049 1003 high scalability-2011-03-14-6 Lessons from Dropbox - One Million Files Saved Every 15 minutes


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.063), (1, 0.036), (2, 0.011), (3, 0.05), (4, 0.029), (5, 0.044), (6, 0.052), (7, 0.021), (8, 0.007), (9, 0.086), (10, 0.033), (11, 0.001), (12, 0.038), (13, -0.082), (14, 0.021), (15, -0.035), (16, -0.03), (17, -0.027), (18, 0.024), (19, 0.008), (20, 0.014), (21, -0.006), (22, 0.003), (23, 0.011), (24, 0.018), (25, 0.027), (26, 0.032), (27, 0.041), (28, -0.017), (29, 0.054), (30, 0.017), (31, 0.077), (32, -0.049), (33, 0.013), (34, -0.008), (35, -0.057), (36, -0.01), (37, -0.013), (38, 0.019), (39, 0.092), (40, -0.007), (41, 0.009), (42, -0.051), (43, 0.026), (44, 0.049), (45, -0.024), (46, -0.019), (47, -0.013), (48, -0.024), (49, -0.059)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9651075 376 high scalability-2008-09-03-MapReduce framework Disco

Introduction: Disco is an open-source implementation of the MapReduce framework for distributed computing. It was started at Nokia Research Center as a lightweight framework for rapid scripting of distributed data processing tasks. The Disco core is written in Erlang. The MapReduce jobs in Disco are natively described as Python programs, which makes it possible to express complex algorithmic and data processing tasks often only in tens of lines of code.

2 0.74170697 483 high scalability-2009-01-04-Paper: MapReduce: Simplified Data Processing on Large Clusters

Introduction: Update: MapReduce and PageRank Notes from Remzi Arpaci-Dusseau's Fall 2008 class . Collects interesting facts about MapReduce and PageRank. For example, the history of the solution to searching for the term "flu" is traced through multiple generations of technology. With Google entering the cloud space with Google AppEngine and a maturing Hadoop product, the MapReduce scaling approach might finally become a standard programmer practice. This is the best paper on the subject and is an excellent primer on a content-addressable memory future. Some interesting stats from the paper: Google executes 100k MapReduce jobs each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. One common criticism ex-Googlers have is that it takes months to get up and be productive in the Google environment. Hopefully a way will be found to lower the learning curve a

3 0.73816854 401 high scalability-2008-10-04-Is MapReduce going mainstream?

Introduction: Compares MapReduce to other parallel processing approaches and suggests new paradigm for clouds and grids

4 0.72058094 1512 high scalability-2013-09-05-Paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale

Introduction: Ever wonder what powers Google's world spirit sensing  Zeitgeist service ? No, it's not a homunculus of Georg Wilhelm Friedrich Hegel  sitting in each browser. It's actually a stream processing (think streaming MapReduce on steroids) system called MillWheel, described in this very well written paper:  MillWheel: Fault-Tolerant Stream Processing at Internet Scale . MillWheel isn't just used for Zeitgeist at Google, it's also used for streaming joins for a variety of Ads customers, generalized anomaly-detection service, and network switch and cluster health monitoring. Abstract: MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework’s fault-tolerance guarantees.   This paper describes MillWheel’s programming model as well as it

5 0.69386727 850 high scalability-2010-06-30-Paper: GraphLab: A New Framework For Parallel Machine Learning

Introduction: In the never ending quest to figure out how to do something useful with never ending streams of data,  GraphLab: A New Framework For Parallel Machine Learning  wants to go beyond low-level programming, MapReduce, and dataflow languages with  a new parallel framework for ML (machine learning) which exploits the sparse structure and common computational patterns of ML algorithms. GraphLab enables ML experts to easily design and implement efficient scalable parallel algorithms by composing problem specific computation, data-dependencies, and scheduling .   Our main contributions include:  A graph-based data model which simultaneously represents data and computational dependencies.  A set of concurrent access models which provide a range of sequential-consistency guarantees.  A sophisticated modular scheduling mechanism.  An aggregation framework to manage global state.  From the abstract: Designing and implementing efficient, provably correct parallel machine lear

6 0.63890302 414 high scalability-2008-10-15-Hadoop - A Primer

7 0.63334084 1313 high scalability-2012-08-28-Making Hadoop Run Faster

8 0.62914592 362 high scalability-2008-08-11-Distributed Computing & Google Infrastructure

9 0.61058062 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python

10 0.60943151 634 high scalability-2009-06-20-Building a data cycle at LinkedIn with Hadoop and Project Voldemort

11 0.58949971 433 high scalability-2008-10-29-CTL - Distributed Control Dispatching Framework

12 0.58811915 601 high scalability-2009-05-17-Product: Hadoop

13 0.57914698 415 high scalability-2008-10-15-Need help with your Hadoop deployment? This company may help!

14 0.57771999 912 high scalability-2010-10-01-Google Paper: Large-scale Incremental Processing Using Distributed Transactions and Notifications

15 0.57570672 650 high scalability-2009-07-02-Product: Hbase

16 0.57390058 669 high scalability-2009-08-03-Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2

17 0.5662775 591 high scalability-2009-05-06-Dyrad

18 0.55677134 871 high scalability-2010-08-04-Dremel: Interactive Analysis of Web-Scale Datasets - Data as a Programming Paradigm

19 0.55325836 1382 high scalability-2013-01-07-Analyzing billions of credit card transactions and serving low-latency insights in the cloud

20 0.53881329 443 high scalability-2008-11-14-Paper: Pig Latin: A Not-So-Foreign Language for Data Processing


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(1, 0.072), (2, 0.313), (10, 0.031), (61, 0.056), (67, 0.245), (94, 0.107)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.90895057 1422 high scalability-2013-03-12-If Your System was a Symphony it Might Sound Like This...

Introduction: I am in no way a music expert, but when I listen to Symphony No. 4 by Charles Ives , I imagine it's what a complex software/hardware system might sound like if we could hear its inner workings. Ives uses a lot of riotously competing rhythms in this work. It can sound discordant, yet the effect is deeply layered and eventually harmonious, just like the systems we use, create, and become part of. I was pointed to this piece by someone who said there were two conductors. I'd never heard of such a thing! So I was intrigued. The first version of the performance sounds and looks great, but it unfortunately does not use two conductors. The second version uses two conductors, but is unfortunately just a snippet. It's strikingly odd to see two conductors, but I imagine different parts of our systems using different conductors too, running at different rhythms, sometimes slow, sometimes fast, sometimes there are outbursts, sometimes in vicious conflict. Yet conceptually it all stills seem

2 0.90258348 898 high scalability-2010-09-09-6 Scalability Lessons

Introduction: Jesper Söderlund not only put together a few interesting  scalability patterns , he also came up with a few interesting  scalability lessons : Lesson #1 . Put Smarty compile and template caches on an active-active DRBD cluster with high load and your servers will DIE! Lesson #2 . Don't use out-of-the-box configurations. Lesson #3 . Single points of contention will eventually become a bottleneck. Lesson #4 . Plan in advance.  Lesson #5 . Offload your databases as much as possible. Lesson #6 . File systems matter and can run out of space / inodes. For more details and explanations see the original post.

same-blog 3 0.88532478 376 high scalability-2008-09-03-MapReduce framework Disco

Introduction: Disco is an open-source implementation of the MapReduce framework for distributed computing. It was started at Nokia Research Center as a lightweight framework for rapid scripting of distributed data processing tasks. The Disco core is written in Erlang. The MapReduce jobs in Disco are natively described as Python programs, which makes it possible to express complex algorithmic and data processing tasks often only in tens of lines of code.

4 0.80201817 1172 high scalability-2012-01-10-A Perfect Fifth of Notes on Scalability

Introduction: Jeremiah Peschka with a great a set of Notes on Scalability , just in case you do reach your wildest expectations of success: Build it to Break . Plan for the fact that everything you make is going to break. Design in layers that are independent and redundant. Everything is a Feature . Your application is a set of features created by a series of conscious choices made by considering trade-offs.  Scale Out, Not Up . Purchasing more hardware is easier than coding and managing horizontal resources.  Buy More Storage . Large numbers of smaller, faster drives have more IOPS than fewer, larger drives. You’re Going to Do It Wrong . Be prepared to iterate on your ideas. You will make mistakes.  Be prepared to re-write code and to quickly move on to the next idea. Please read the original article for a much more expansive treatment.

5 0.77775246 723 high scalability-2009-10-16-Paper: Scaling Online Social Networks without Pains

Introduction: We saw in  Why are Facebook, Digg, and Twitter so hard to scale?  scaling social networks is a lot harder than you might think. This paper, Scaling Online Social Networks without Pains , from a team at Telefonica Research in Spain hopes to meet the challenge of status distribution, user generated content distribution, and managing the social graph through a technique they call One-Hop Replication  (OHR). OHR abstracts and delegates the complexity of scaling up from the social network application . The abstract: Online Social Networks (OSN) face serious scalability challenges due to their rapid growth and popularity. To address this issue we present a novel approach to scale up OSN called One Hop Replication (OHR). Our system combines partitioning and replication in a middleware to transparently scale up a centralized OSN design, and therefore, avoid the OSN application to undergo the costly transition to a fully distributed system to meet its scalability needs. OHR exploits some

6 0.77721083 967 high scalability-2011-01-03-Stuff The Internet Says On Scalability For January 3, 2010

7 0.77661508 1591 high scalability-2014-02-05-Little’s Law, Scalability and Fault Tolerance: The OS is your bottleneck. What you can do?

8 0.77575761 1199 high scalability-2012-02-27-Zen and the Art of Scaling - A Koan and Epigram Approach

9 0.77487576 205 high scalability-2008-01-10-Letting Clients Know What's Changed: Push Me or Pull Me?

10 0.77306885 551 high scalability-2009-03-30-Lavabit Architecture - Creating a Scalable Email Service

11 0.77292645 1190 high scalability-2012-02-10-Stuff The Internet Says On Scalability For February 10, 2012

12 0.77137685 752 high scalability-2009-12-17-Oracle and IBM databases: Disk-based vs In-memory databases

13 0.77114433 1373 high scalability-2012-12-17-11 Uses For the Humble Presents Queue, er, Message Queue

14 0.77066427 1126 high scalability-2011-09-27-Use Instance Caches to Save Money: Latency == $$$

15 0.76949805 221 high scalability-2008-01-24-Mailinator Architecture

16 0.76856476 639 high scalability-2009-06-27-Scaling Twitter: Making Twitter 10000 Percent Faster

17 0.76693898 1006 high scalability-2011-03-17-Are long VM instance spin-up times in the cloud costing you money?

18 0.76660275 594 high scalability-2009-05-08-Eight Best Practices for Building Scalable Systems

19 0.76611948 1445 high scalability-2013-04-24-Strategy: Using Lots of RAM Often Cheaper than Using a Hadoop Cluster

20 0.76498216 1325 high scalability-2012-09-19-The 4 Building Blocks of Architecting Systems for Scale