high_scalability high_scalability-2007 high_scalability-2007-155 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Dryad is Microsoft's answer to Google's map-reduce . What's the question: How do you process really large amounts of data? My initial impression of Dryad is it's like a giant Unix command line filter on steroids. There are lots of inputs, outputs, tees, queues, and merge sorts all connected together by a master exec program. What else does Dryad have to offer the scalable infrastructure wars? Dryad models programs as the execution of a directed acyclic graph. Each vertex is a program and edges are typed communication channels (files, TCP pipes, and shared memory channels within a process). Map-reduce uses a different model. It's more like a large distributed sort where the programmer defines functions for mapping, partitioning, and reducing. Each approach seems to borrow from the spirit of its creating organization. The graph approach seems a bit too complicated and map-reduce seems a bit too simple. How ironic, in the Alanis Morissette sense. Dryad is a middleware layer that ex
sentIndex sentText sentNum sentScore
1 My initial impression of Dryad is it's like a giant Unix command line filter on steroids. [sent-3, score-0.14]
2 Dryad models programs as the execution of a directed acyclic graph. [sent-6, score-0.19]
3 Each vertex is a program and edges are typed communication channels (files, TCP pipes, and shared memory channels within a process). [sent-7, score-0.669]
4 It's more like a large distributed sort where the programmer defines functions for mapping, partitioning, and reducing. [sent-9, score-0.164]
5 Each approach seems to borrow from the spirit of its creating organization. [sent-10, score-0.366]
6 The graph approach seems a bit too complicated and map-reduce seems a bit too simple. [sent-11, score-0.618]
7 Dryad is a middleware layer that executes graphs for you, automatically taking care of scheduling, distribution, and fault tolerance. [sent-13, score-0.344]
8 It's written in C++, but apparently few write directly to this layer, most people use higher layer interfaces. [sent-14, score-0.253]
9 It's a library you link in and it loads and executes the graph. [sent-16, score-0.145]
10 The DAG is a multigraph so you can have multiple edges between vertices. [sent-19, score-0.149]
11 A DAG was chosen because it's not too cold, or too hot, the porridge is just right. [sent-20, score-0.11]
12 DAGs support relational algebra and can split multiple inputs and outputs nicely. [sent-23, score-0.417]
13 One interesting aspect is a a channel is a sequence of structure items that are C++ objects. [sent-24, score-0.14]
14 This means pointers can be passed directly so you don't have to worry about serialization overhead. [sent-25, score-0.237]
15 Graphs are dynamically changeable at runtime which allows for a lot of optimizations. [sent-27, score-0.105]
16 Everyone can relate to counting words in a document. [sent-31, score-0.164]
17 My thoughts while watching is that the graph stuff sounds cool and general, but it's hard to map it efficiently to solutions when the problems have large numbers of inputs. [sent-32, score-0.165]
18 The programmer provide the bits of atomic behaviour and the system can then try various optimizations. [sent-36, score-0.167]
19 The code doesn't have to change because the graph can be manipulated abstractly on its own. [sent-37, score-0.38]
20 Then something like a query planner figures out how to execute the query on Dryad. [sent-39, score-0.326]
wordName wordTfidf (topN-words)
[('dryad', 0.505), ('dag', 0.234), ('outputs', 0.182), ('channels', 0.166), ('graph', 0.165), ('edges', 0.149), ('executes', 0.145), ('inputs', 0.142), ('acyclic', 0.117), ('layer', 0.111), ('manipulated', 0.11), ('porridge', 0.11), ('seems', 0.107), ('abstractly', 0.105), ('borrow', 0.105), ('changeable', 0.105), ('planner', 0.101), ('ironic', 0.098), ('vertex', 0.095), ('programmer', 0.093), ('algebra', 0.093), ('typed', 0.093), ('pipes', 0.091), ('exec', 0.091), ('figures', 0.091), ('relate', 0.091), ('pointers', 0.089), ('graphs', 0.088), ('approach', 0.083), ('restrictions', 0.078), ('bit', 0.078), ('serialization', 0.076), ('runs', 0.076), ('wars', 0.075), ('behaviour', 0.074), ('counting', 0.073), ('directed', 0.073), ('manually', 0.073), ('directly', 0.072), ('sequence', 0.072), ('unix', 0.072), ('defines', 0.071), ('daemon', 0.071), ('impression', 0.071), ('spirit', 0.071), ('apparently', 0.07), ('filter', 0.069), ('studies', 0.068), ('channel', 0.068), ('query', 0.067)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000001 155 high scalability-2007-11-15-Video: Dryad: A general-purpose distributed execution platform
Introduction: Dryad is Microsoft's answer to Google's map-reduce . What's the question: How do you process really large amounts of data? My initial impression of Dryad is it's like a giant Unix command line filter on steroids. There are lots of inputs, outputs, tees, queues, and merge sorts all connected together by a master exec program. What else does Dryad have to offer the scalable infrastructure wars? Dryad models programs as the execution of a directed acyclic graph. Each vertex is a program and edges are typed communication channels (files, TCP pipes, and shared memory channels within a process). Map-reduce uses a different model. It's more like a large distributed sort where the programmer defines functions for mapping, partitioning, and reducing. Each approach seems to borrow from the spirit of its creating organization. The graph approach seems a bit too complicated and map-reduce seems a bit too simple. How ironic, in the Alanis Morissette sense. Dryad is a middleware layer that ex
2 0.33258381 591 high scalability-2009-05-06-Dyrad
Introduction: The Dryad Project is investigating programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center.
Introduction: On the surface nothing appears more different than soft data and hard raw materials like iron. Then isn’t it ironic , in the Alanis Morissette sense, that in this Age of Information, great wealth still lies hidden deep beneath piles of stuff? It's so strange how directly digging for dollars in data parallels the great wealth producing models of the Industrial Revolution. The piles of stuff is the Internet. It takes lots of prospecting to find the right stuff. Mighty web crawling machines tirelessly collect stuff, bringing it into their huge maws, then depositing load after load into rack after rack of distributed file system machines. Then armies of still other machines take this stuff and strip out the valuable raw materials, which in the Information Age, are endless bytes of raw data. Link clicks, likes, page views, content, head lines, searches, inbound links, outbound links, search clicks, hashtags, friends, purchases: anything and everything you do on the Internet is a valu
4 0.17032368 592 high scalability-2009-05-06-DyradLINQ
Introduction: The goal of DryadLINQ is to make distributed computing on large compute cluster simple enough for ordinary programmers. DryadLINQ combines two important pieces of Microsoft technology: the Dryad distributed execution engine and the .NET Language Integrated Query (LINQ).
5 0.1546582 1136 high scalability-2011-11-03-Paper: G2 : A Graph Processing System for Diagnosing Distributed Systems
Introduction: One of the problems in building distributed systems is figuring out what the heck is going on. Usually endless streams of log files are consulted like ancients using entrails to divine the will of the Gods. To rise above these ancient practices we must rise another level of abstraction and that's the approach described in a Microsoft research paper: G2: A Graph Processing System for Diagnosing Distributed Systems , which uses execution graphs that model runtime events and their correlations in distributed systems . The problem with these schemes is viewing applications, written by programmers in low level code, as execution graphs. But we're heading in this direction in any case. To program a warehouse or an internet sized computer we'll have to write at higher levels of abstraction so code can be executed transparently at runtime on these giant distributed computers. There are many advantages to this approach, fault diagnosis and performance monitoring are just one of the wins
6 0.14111575 628 high scalability-2009-06-13-Neo4j - a Graph Database that Kicks Buttox
7 0.12733805 1406 high scalability-2013-02-14-When all the Program's a Graph - Prismatic's Plumbing Library
8 0.11599361 683 high scalability-2009-08-18-Hardware Architecture Example (geographical level mapping of servers)
9 0.11551724 805 high scalability-2010-04-06-Strategy: Make it Really Fast vs Do the Work Up Front
10 0.11489086 936 high scalability-2010-11-09-Facebook Uses Non-Stored Procedures to Update Social Graphs
11 0.10718364 626 high scalability-2009-06-10-Paper: Graph Databases and the Future of Large-Scale Knowledge Management
12 0.10220788 621 high scalability-2009-06-06-Graph server
13 0.10039495 1088 high scalability-2011-07-27-Making Hadoop 1000x Faster for Graph Problems
14 0.095213391 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python
15 0.094639026 1093 high scalability-2011-08-05-Stuff The Internet Says On Scalability For August 5, 2011
16 0.093342878 631 high scalability-2009-06-15-Large-scale Graph Computing at Google
17 0.093134649 538 high scalability-2009-03-16-Are Cloud Based Memory Architectures the Next Big Thing?
18 0.092561506 448 high scalability-2008-11-22-Google Architecture
19 0.09203206 1285 high scalability-2012-07-18-Disks Ain't Dead Yet: GraphChi - a disk-based large-scale graph computation
20 0.090990283 658 high scalability-2009-07-17-Against all the odds
topicId topicWeight
[(0, 0.134), (1, 0.091), (2, 0.005), (3, 0.033), (4, 0.015), (5, 0.076), (6, -0.004), (7, 0.023), (8, -0.003), (9, 0.045), (10, 0.044), (11, 0.009), (12, -0.038), (13, -0.077), (14, 0.034), (15, -0.043), (16, -0.038), (17, 0.102), (18, 0.032), (19, 0.087), (20, -0.083), (21, -0.066), (22, -0.014), (23, 0.0), (24, 0.008), (25, 0.064), (26, 0.047), (27, 0.016), (28, 0.054), (29, -0.018), (30, -0.019), (31, -0.022), (32, -0.019), (33, 0.03), (34, -0.008), (35, 0.041), (36, 0.027), (37, -0.047), (38, -0.009), (39, 0.055), (40, -0.029), (41, 0.036), (42, 0.003), (43, -0.025), (44, 0.011), (45, -0.007), (46, 0.025), (47, -0.009), (48, 0.051), (49, -0.01)]
simIndex simValue blogId blogTitle
same-blog 1 0.96206093 155 high scalability-2007-11-15-Video: Dryad: A general-purpose distributed execution platform
Introduction: Dryad is Microsoft's answer to Google's map-reduce . What's the question: How do you process really large amounts of data? My initial impression of Dryad is it's like a giant Unix command line filter on steroids. There are lots of inputs, outputs, tees, queues, and merge sorts all connected together by a master exec program. What else does Dryad have to offer the scalable infrastructure wars? Dryad models programs as the execution of a directed acyclic graph. Each vertex is a program and edges are typed communication channels (files, TCP pipes, and shared memory channels within a process). Map-reduce uses a different model. It's more like a large distributed sort where the programmer defines functions for mapping, partitioning, and reducing. Each approach seems to borrow from the spirit of its creating organization. The graph approach seems a bit too complicated and map-reduce seems a bit too simple. How ironic, in the Alanis Morissette sense. Dryad is a middleware layer that ex
2 0.84892905 1136 high scalability-2011-11-03-Paper: G2 : A Graph Processing System for Diagnosing Distributed Systems
Introduction: One of the problems in building distributed systems is figuring out what the heck is going on. Usually endless streams of log files are consulted like ancients using entrails to divine the will of the Gods. To rise above these ancient practices we must rise another level of abstraction and that's the approach described in a Microsoft research paper: G2: A Graph Processing System for Diagnosing Distributed Systems , which uses execution graphs that model runtime events and their correlations in distributed systems . The problem with these schemes is viewing applications, written by programmers in low level code, as execution graphs. But we're heading in this direction in any case. To program a warehouse or an internet sized computer we'll have to write at higher levels of abstraction so code can be executed transparently at runtime on these giant distributed computers. There are many advantages to this approach, fault diagnosis and performance monitoring are just one of the wins
3 0.84862131 1406 high scalability-2013-02-14-When all the Program's a Graph - Prismatic's Plumbing Library
Introduction: At some point as a programmer you might have the insight/fear that all programming is just doing stuff to other stuff. Then you may observe after coding the same stuff over again that stuff in a program often takes the form of interacting patterns of flows. Then you may think hey, a program isn't only useful for coding datastructures, but a program is a kind of datastructure and that with a meta level jump you could program a program in terms of flows over data and flow over other flows. That's the kind of stuff Prismatic is making available in the Graph extension to their plumbing package ( code examples ), which is described in an excellent post: Graph: Abstractions for Structured Computation . You may remember Prismatic from previous profile we did on HighScalability: Prismatic Architecture - Using Machine Learning On Social Networks To Figure Out What You Should Read On The Web . We learned how Prismatic, an interest driven content suggestion service, builds programs in
4 0.80326271 805 high scalability-2010-04-06-Strategy: Make it Really Fast vs Do the Work Up Front
Introduction: In Cool spatial algos with Neo4j: Part 1 - Routing with A* in Ruby Peter Neubauer not only does a fantastic job explaining a complicated routing algorithm using the graph database Neo4j , but he surfaces an interesting architectural conundrum: make it really fast so work can be done on the reads or do all the work on the writes so the reads are really fast . The money quote pointing out the competing options is: [Being] able to do these calculations in sub-second speeds on graphs of millions of roads and waypoints makes it possible in many cases to abandon the normal approach of precomputing indexes with K/V stores and be able to put routing into the critical path with the possibility to adapt to the live conditions and build highly personalized and dynamic spatial services. The poster boys for the precompute strategy is SimpleGeo , a startup that is building a "scaling infrastructure for geodata." Their strategy for handling geodata is to use Cassandra and bui
5 0.80298424 631 high scalability-2009-06-15-Large-scale Graph Computing at Google
Introduction: To continue the graph theme Google has got into the act and released information on Pregel . Pregel does not appear to be a new type of potato chip. Pregel is instead a scalable infrastructure... ...to mine a wide range of graphs. In Pregel, programs are expressed as a sequence of iterations. In each iteration, a vertex can, independently of other vertices, receive messages sent to it in the previous iteration, send messages to other vertices, modify its own and its outgoing edges' states, and mutate the graph's topology. Currently, Pregel scales to billions of vertices and edges, but this limit will keep expanding. Pregel's applicability is harder to quantify, but so far we haven't come across a type of graph or a practical graph computing problem which is not solvable with Pregel. It computes over large graphs much faster than alternatives, and the application programming interface is easy to use. Implementing PageRank, for example, takes only about 15 lines of code. Developers
6 0.79565668 1285 high scalability-2012-07-18-Disks Ain't Dead Yet: GraphChi - a disk-based large-scale graph computation
7 0.79031712 628 high scalability-2009-06-13-Neo4j - a Graph Database that Kicks Buttox
8 0.77766395 766 high scalability-2010-01-26-Product: HyperGraphDB - A Graph Database
10 0.74608952 626 high scalability-2009-06-10-Paper: Graph Databases and the Future of Large-Scale Knowledge Management
11 0.72305131 842 high scalability-2010-06-16-Hot Scalability Links for June 16, 2010
12 0.71455282 1088 high scalability-2011-07-27-Making Hadoop 1000x Faster for Graph Problems
13 0.67468297 1263 high scalability-2012-06-13-Why My Soap Film is Better than Your Hadoop Cluster
14 0.66130531 827 high scalability-2010-05-14-Hot Scalability Links for May 14, 2010
15 0.65252268 621 high scalability-2009-06-06-Graph server
16 0.63253403 58 high scalability-2007-08-04-Product: Cacti
17 0.62367642 1512 high scalability-2013-09-05-Paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale
18 0.62364841 722 high scalability-2009-10-15-Hot Scalability Links for Oct 15 2009
19 0.61080229 1365 high scalability-2012-11-30-Stuff The Internet Says On Scalability For November 30, 2012
20 0.60143673 216 high scalability-2008-01-17-Database People Hating on MapReduce
topicId topicWeight
[(1, 0.098), (2, 0.184), (8, 0.254), (10, 0.067), (17, 0.03), (61, 0.061), (77, 0.032), (79, 0.142), (85, 0.038)]
simIndex simValue blogId blogTitle
1 0.89981538 1243 high scalability-2012-05-10-Paper: Paxos Made Moderately Complex
Introduction: If you are a normal human being and find the Paxos protocol confusing, then this paper, Paxos Made Moderately Complex , is a great find. Robbert van Renesse from Cornell University has written a clear and well written paper with excellent explanations. The Abstract: For anybody who has ever tried to implement it, Paxos is by no means a simple protocol, even though it is based on relatively simple invariants. This paper provides imperative pseudo-code for the full Paxos (or Multi-Paxos) protocol without shying away from discussing various implementation details. The initial description avoids optimizations that complicate comprehension. Next we discuss liveness, and list various optimizations that make the protocol practical. Related Articles Paxos on HighScalability.com
2 0.88217551 186 high scalability-2007-12-13-un-article: the setup behind microsoft.com
Introduction: On the blogs.technet.com article on microsoft.com's infrastructure: The article reads like a blatant ad for it's own products, and is light on the technical side. The juicy bits are here, so you know what the fuss is about: Cytrix Netscaler (= loadbalancer with various optimizations) W2K8 + IIS7 and antivirus software on the webservers 650GB/day ISS log files 8-9GBit/s (unknown if CDN's are included) Simple network filtering: stateless access lists blocking unwanted ports on the routers/switches (hence the debated "no firewalls" claim). Note that this information may not reflect present reality very well; the spokesman appears to be reciting others words.
same-blog 3 0.87591648 155 high scalability-2007-11-15-Video: Dryad: A general-purpose distributed execution platform
Introduction: Dryad is Microsoft's answer to Google's map-reduce . What's the question: How do you process really large amounts of data? My initial impression of Dryad is it's like a giant Unix command line filter on steroids. There are lots of inputs, outputs, tees, queues, and merge sorts all connected together by a master exec program. What else does Dryad have to offer the scalable infrastructure wars? Dryad models programs as the execution of a directed acyclic graph. Each vertex is a program and edges are typed communication channels (files, TCP pipes, and shared memory channels within a process). Map-reduce uses a different model. It's more like a large distributed sort where the programmer defines functions for mapping, partitioning, and reducing. Each approach seems to borrow from the spirit of its creating organization. The graph approach seems a bit too complicated and map-reduce seems a bit too simple. How ironic, in the Alanis Morissette sense. Dryad is a middleware layer that ex
4 0.85580677 1108 high scalability-2011-08-31-Pud is the Anti-Stack - Windows, CFML, Dropbox, Xeround, JungleDisk, ELB
Introduction: Pud of f*ckedcomany.com (FC) fame, a favorite site of the dot bomb era, and a site I absolutely loved until my company became featured, has given us a look at his backend: Why Must You Laugh At My Back End . For those whose don't remember FC's history, TechCrunch published a fitting eulogy : [FC] first went live in 2000, chronicling failing and troubled companies in its unique and abrasive style after the dot com bust. Within a year it had a massive audience and was getting serious mainstream press attention. As the startup economy became better in 2004, much of the attention the site received went away. But a large and loyal audience remains at the site, coming back day after day for its unique slant on the news. At its peak, FC had 4 million unique monthly visitors. Delightfully, FC was not a real-names kind of site. Hard witty cynicism ruled and not a single cat picture was in sight. It was a blast of fun when all around was the enclosing dark. So when I saw Pud's post
5 0.85136336 272 high scalability-2008-03-08-Product: FAI - Fully Automatic Installation
Introduction: From their website: FAI is an automated installation tool to install or deploy Debian GNU/Linux and other distributions on a bunch of different hosts or a Cluster. It's more flexible than other tools like kickstart for Red Hat, autoyast and alice for SuSE or Jumpstart for SUN Solaris. FAI can also be used for configuration management of a running system. You can take one or more virgin PCs, turn on the power and after a few minutes Linux is installed, configured and running on all your machines, without any interaction necessary. FAI it's a scalable method for installing and updating all your computers unattended with little effort involved. It's a centralized management system for your Linux deployment. FAI's target group are system administrators who have to install Linux onto one or even hundreds of computers. It's not only a tool for doing a Cluster installation but a general purpose installation tool. It can be used for installing a Beowulf cluster, a rendering farm,
6 0.81308401 1229 high scalability-2012-04-17-YouTube Strategy: Adding Jitter isn't a Bug
7 0.77579653 1183 high scalability-2012-01-30-37signals Still Happily Scaling on Moore RAM and SSDs
8 0.74143058 1434 high scalability-2013-04-03-5 Steps to Benchmarking Managed NoSQL - DynamoDB vs Cassandra
9 0.73097879 1207 high scalability-2012-03-12-Google: Taming the Long Latency Tail - When More Machines Equals Worse Results
10 0.72105044 350 high scalability-2008-07-15-ZooKeeper - A Reliable, Scalable Distributed Coordination System
11 0.7203325 913 high scalability-2010-10-01-Hot Scalability Links For Oct 1, 2010
12 0.7123847 1041 high scalability-2011-05-15-Building a Database remote availability site
13 0.71213311 1159 high scalability-2011-12-19-How Twitter Stores 250 Million Tweets a Day Using MySQL
14 0.71182632 1186 high scalability-2012-02-02-The Data-Scope Project - 6PB storage, 500GBytes-sec sequential IO, 20M IOPS, 130TFlops
15 0.71139884 858 high scalability-2010-07-13-Sponsored Post: VoltDB and Digg are Hiring
16 0.71082729 687 high scalability-2009-08-24-How Google Serves Data from Multiple Datacenters
17 0.71028203 1098 high scalability-2011-08-15-Should any cloud be considered one availability zone? The Amazon experience says yes.
18 0.70935506 666 high scalability-2009-07-30-Learn How to Think at Scale
19 0.70853859 734 high scalability-2009-10-30-Hot Scalabilty Links for October 30 2009
20 0.70792288 1316 high scalability-2012-09-04-Changing Architectures: New Datacenter Networks Will Set Your Code and Data Free