high_scalability high_scalability-2011 high_scalability-2011-1088 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Dr. Daniel Abadi, author of the DBMS Musings blog and Cofounder of Hadapt , which offers a product improving Hadoop performance by 50x on relational data, is now taking his talents to graph data in Hadoop's tremendous inefficiency on graph data management (and how to avoid it) , which shares the secrets of getting Hadoop to perform 1000x better on graph data. TL;DR: Analysing graph data is at the heart of important data mining problems . Hadoop is the tool of choice for many of these problems. Hadoop style MapReduce works best on KeyValue processing, not graph processing, and can be well over a factor of 1000 less efficient than it needs to be. Hadoop inefficiency has consequences in real world. Inefficiencies on graph data problems like improving power utilization, minimizing carbon emissions, and improving product designs, leads to a lot value being left on the table in the form of negative environmental consequences, increased server costs, increased data center spa
sentIndex sentText sentNum sentScore
1 TL;DR: Analysing graph data is at the heart of important data mining problems . [sent-3, score-0.748]
2 Hadoop style MapReduce works best on KeyValue processing, not graph processing, and can be well over a factor of 1000 less efficient than it needs to be. [sent-5, score-0.515]
3 Hadoop inefficiency has consequences in real world. [sent-6, score-0.462]
4 10x improvement by using a clustering algorithm to graph partition data across nodes in the Hadoop cluster. [sent-8, score-0.521]
5 By default in Hadoop data is distributed randomly around a cluster, which means data that's close together in the graph can be very far apart on disk. [sent-9, score-0.782]
6 This is very slow for common operations like sub-graph pattern matching, which prefers neighbors to be stored on the same machine. [sent-10, score-0.133]
7 10x improvement by replicating data on the edges of partitions so that vertexes are stored on the same physical machine as their neighbors. [sent-11, score-0.118]
8 By default Hadoop replicates data 3 times, treating all data equally is inefficient. [sent-12, score-0.443]
9 HDFS, which is a distributed file system, and HBase, which is an unstructured data storage system, are not optimal data stores for graph data. [sent-14, score-0.696]
10 That's a 10x * 10x * 10x = 1000x performance improvement on graph problems using techniques that make a lot of sense. [sent-16, score-0.403]
11 What may be less obvious is the whole idea of keeping the Hadoop shell and making the component parts more efficient for graph problems. [sent-17, score-0.594]
12 Hadoop stays Hadoop externally, but internally has graph super powers. [sent-18, score-0.461]
13 What I found most intriguing is thinking about the larger consequences of Hadoop being inefficient. [sent-20, score-0.318]
14 From the most obvious angle, money, we are used to thinking this way about mass produced items. [sent-22, score-0.139]
15 If a widget can be cost reduced by 10 cents and millions of them are made, we are talking real money. [sent-23, score-0.139]
16 If Hadoop is going to be used for the majority of data mining problem, then making it more efficient adds up to real effects. [sent-24, score-0.41]
17 Going to the next level, the more efficient Hadoop becomes, the quicker important problems facing the world will be solved. [sent-25, score-0.173]
wordName wordTfidf (topN-words)
[('graph', 0.403), ('hadoop', 0.354), ('rdf', 0.33), ('abadi', 0.209), ('consequences', 0.197), ('inefficiency', 0.194), ('sparql', 0.194), ('daniel', 0.17), ('improving', 0.128), ('data', 0.118), ('efficient', 0.112), ('mining', 0.109), ('increased', 0.098), ('querying', 0.091), ('huang', 0.088), ('default', 0.086), ('avi', 0.083), ('oranges', 0.083), ('obvious', 0.079), ('talents', 0.079), ('musings', 0.079), ('cofounder', 0.079), ('scalably', 0.076), ('carbon', 0.076), ('kurt', 0.074), ('externally', 0.071), ('inefficiencies', 0.071), ('real', 0.071), ('prefers', 0.07), ('warehousing', 0.07), ('widget', 0.068), ('processing', 0.067), ('graphs', 0.066), ('erik', 0.066), ('mapreduce', 0.065), ('theoriginal', 0.064), ('angle', 0.064), ('environmental', 0.064), ('treating', 0.063), ('neighbors', 0.063), ('dr', 0.062), ('quicker', 0.061), ('intriguing', 0.061), ('cient', 0.061), ('thinking', 0.06), ('richard', 0.059), ('stays', 0.058), ('replicates', 0.058), ('randomly', 0.057), ('unstructured', 0.057)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000004 1088 high scalability-2011-07-27-Making Hadoop 1000x Faster for Graph Problems
Introduction: Dr. Daniel Abadi, author of the DBMS Musings blog and Cofounder of Hadapt , which offers a product improving Hadoop performance by 50x on relational data, is now taking his talents to graph data in Hadoop's tremendous inefficiency on graph data management (and how to avoid it) , which shares the secrets of getting Hadoop to perform 1000x better on graph data. TL;DR: Analysing graph data is at the heart of important data mining problems . Hadoop is the tool of choice for many of these problems. Hadoop style MapReduce works best on KeyValue processing, not graph processing, and can be well over a factor of 1000 less efficient than it needs to be. Hadoop inefficiency has consequences in real world. Inefficiencies on graph data problems like improving power utilization, minimizing carbon emissions, and improving product designs, leads to a lot value being left on the table in the form of negative environmental consequences, increased server costs, increased data center spa
2 0.30661494 628 high scalability-2009-06-13-Neo4j - a Graph Database that Kicks Buttox
Introduction: Update: Social networks in the database: using a graph database . A nice post on representing, traversing, and performing other common social network operations using a graph database. If you are Digg or LinkedIn you can build your own speedy graph database to represent your complex social network relationships. For those of more modest means Neo4j , a graph database, is a good alternative. A graph is a collection nodes (things) and edges (relationships) that connect pairs of nodes. Slap properties (key-value pairs) on nodes and relationships and you have a surprisingly powerful way to represent most anything you can think of. In a graph database "relationships are first-class citizens. They connect two nodes and both nodes and relationships can hold an arbitrary amount of key-value pairs. So you can look at a graph database as a key-value store, with full support for relationships." A graph looks something like: For more lovely examples take a look at the Graph Image Gal
Introduction: On the surface nothing appears more different than soft data and hard raw materials like iron. Then isn’t it ironic , in the Alanis Morissette sense, that in this Age of Information, great wealth still lies hidden deep beneath piles of stuff? It's so strange how directly digging for dollars in data parallels the great wealth producing models of the Industrial Revolution. The piles of stuff is the Internet. It takes lots of prospecting to find the right stuff. Mighty web crawling machines tirelessly collect stuff, bringing it into their huge maws, then depositing load after load into rack after rack of distributed file system machines. Then armies of still other machines take this stuff and strip out the valuable raw materials, which in the Information Age, are endless bytes of raw data. Link clicks, likes, page views, content, head lines, searches, inbound links, outbound links, search clicks, hashtags, friends, purchases: anything and everything you do on the Internet is a valu
4 0.24590658 626 high scalability-2009-06-10-Paper: Graph Databases and the Future of Large-Scale Knowledge Management
Introduction: Relational databases, document databases, and distributed hash tables get most of the hype these days, but there's another option: graph databases. Back to the future it seems. Here's a really interesting paper by Marko A. Rodriguez introducing the graph model and it's extension to representing the world wide web of data. Modern day open source and commercial graph databases can store on the order of 1 billion relationships with some databases reaching the 10 billion mark. These developments are making the graph database practical for applications that require large-scale knowledge structures. Moreover, with the Web of Data standards set forth by the Linked Data community, it is possible to interlink graph databases across the web into a giant global knowledge structure. This talk will discuss graph databases, their underlying data model, their querying mechanisms, and the benefits of the graph data structure for modeling and analysis.
5 0.23086539 621 high scalability-2009-06-06-Graph server
Introduction: I've seen mentioned in few times sites like Digg or LinkedIn using graph servers to hold their social graphs. But the only sort of open source graph server I've found is http://neo4j.org/ . Can anyone recommend an open source graph server? Thanks Aaron
6 0.20527612 1406 high scalability-2013-02-14-When all the Program's a Graph - Prismatic's Plumbing Library
7 0.20001175 601 high scalability-2009-05-17-Product: Hadoop
8 0.19904131 827 high scalability-2010-05-14-Hot Scalability Links for May 14, 2010
9 0.19296141 805 high scalability-2010-04-06-Strategy: Make it Really Fast vs Do the Work Up Front
10 0.18907821 627 high scalability-2009-06-11-Yahoo! Distribution of Hadoop
11 0.17031954 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python
12 0.16802487 1136 high scalability-2011-11-03-Paper: G2 : A Graph Processing System for Diagnosing Distributed Systems
13 0.16167958 1313 high scalability-2012-08-28-Making Hadoop Run Faster
14 0.16156073 1285 high scalability-2012-07-18-Disks Ain't Dead Yet: GraphChi - a disk-based large-scale graph computation
15 0.16111599 968 high scalability-2011-01-04-Map-Reduce With Ruby Using Hadoop
16 0.15679941 414 high scalability-2008-10-15-Hadoop - A Primer
17 0.14505507 766 high scalability-2010-01-26-Product: HyperGraphDB - A Graph Database
18 0.14023875 650 high scalability-2009-07-02-Product: Hbase
19 0.13433747 56 high scalability-2007-08-03-Running Hadoop MapReduce on Amazon EC2 and Amazon S3
20 0.12740251 954 high scalability-2010-12-06-What the heck are you actually using NoSQL for?
topicId topicWeight
[(0, 0.157), (1, 0.071), (2, 0.032), (3, 0.081), (4, 0.041), (5, 0.166), (6, 0.024), (7, 0.022), (8, 0.12), (9, 0.167), (10, 0.137), (11, -0.012), (12, 0.018), (13, -0.169), (14, 0.021), (15, -0.051), (16, -0.03), (17, 0.166), (18, 0.045), (19, 0.176), (20, -0.167), (21, 0.001), (22, 0.048), (23, -0.034), (24, -0.073), (25, 0.115), (26, 0.03), (27, 0.065), (28, 0.071), (29, 0.006), (30, 0.013), (31, 0.075), (32, -0.002), (33, -0.01), (34, -0.039), (35, 0.137), (36, -0.033), (37, 0.029), (38, -0.025), (39, -0.032), (40, -0.009), (41, 0.053), (42, -0.01), (43, -0.049), (44, 0.044), (45, -0.016), (46, 0.01), (47, 0.027), (48, -0.025), (49, -0.009)]
simIndex simValue blogId blogTitle
same-blog 1 0.95906115 1088 high scalability-2011-07-27-Making Hadoop 1000x Faster for Graph Problems
Introduction: Dr. Daniel Abadi, author of the DBMS Musings blog and Cofounder of Hadapt , which offers a product improving Hadoop performance by 50x on relational data, is now taking his talents to graph data in Hadoop's tremendous inefficiency on graph data management (and how to avoid it) , which shares the secrets of getting Hadoop to perform 1000x better on graph data. TL;DR: Analysing graph data is at the heart of important data mining problems . Hadoop is the tool of choice for many of these problems. Hadoop style MapReduce works best on KeyValue processing, not graph processing, and can be well over a factor of 1000 less efficient than it needs to be. Hadoop inefficiency has consequences in real world. Inefficiencies on graph data problems like improving power utilization, minimizing carbon emissions, and improving product designs, leads to a lot value being left on the table in the form of negative environmental consequences, increased server costs, increased data center spa
2 0.81451011 1285 high scalability-2012-07-18-Disks Ain't Dead Yet: GraphChi - a disk-based large-scale graph computation
Introduction: GraphChi uses a Parallel Sliding Windows method which can: process a graph with mutable edge values efficiently from disk, with only a small number of non-sequential disk accesses, while supporting the asynchronous model of computation. The result is graphs with billions of edges can be processed on just a single machine. It uses a vertex-centric computation model similar to Pregel , which supports iterative algorithims as apposed to the batch style of MapReduce. Streaming graph updates are supported. About GraphChi, Carlos Guestrin, codirector of Carnegie Mellon's Select Lab, says : A Mac Mini running GraphChi can analyze Twitter's social graph from 2010—which contains 40 million users and 1.2 billion connections—in 59 minutes. "The previous published result on this problem took 400 minutes using a cluster of about 1,000 computers Related Articles Aapo Kyrola Home Page Your Laptop Can Now Analyze Big Data by JOHN PAVLUS Example Applications Runn
3 0.80991077 628 high scalability-2009-06-13-Neo4j - a Graph Database that Kicks Buttox
Introduction: Update: Social networks in the database: using a graph database . A nice post on representing, traversing, and performing other common social network operations using a graph database. If you are Digg or LinkedIn you can build your own speedy graph database to represent your complex social network relationships. For those of more modest means Neo4j , a graph database, is a good alternative. A graph is a collection nodes (things) and edges (relationships) that connect pairs of nodes. Slap properties (key-value pairs) on nodes and relationships and you have a surprisingly powerful way to represent most anything you can think of. In a graph database "relationships are first-class citizens. They connect two nodes and both nodes and relationships can hold an arbitrary amount of key-value pairs. So you can look at a graph database as a key-value store, with full support for relationships." A graph looks something like: For more lovely examples take a look at the Graph Image Gal
Introduction: On the surface nothing appears more different than soft data and hard raw materials like iron. Then isn’t it ironic , in the Alanis Morissette sense, that in this Age of Information, great wealth still lies hidden deep beneath piles of stuff? It's so strange how directly digging for dollars in data parallels the great wealth producing models of the Industrial Revolution. The piles of stuff is the Internet. It takes lots of prospecting to find the right stuff. Mighty web crawling machines tirelessly collect stuff, bringing it into their huge maws, then depositing load after load into rack after rack of distributed file system machines. Then armies of still other machines take this stuff and strip out the valuable raw materials, which in the Information Age, are endless bytes of raw data. Link clicks, likes, page views, content, head lines, searches, inbound links, outbound links, search clicks, hashtags, friends, purchases: anything and everything you do on the Internet is a valu
5 0.79350513 1406 high scalability-2013-02-14-When all the Program's a Graph - Prismatic's Plumbing Library
Introduction: At some point as a programmer you might have the insight/fear that all programming is just doing stuff to other stuff. Then you may observe after coding the same stuff over again that stuff in a program often takes the form of interacting patterns of flows. Then you may think hey, a program isn't only useful for coding datastructures, but a program is a kind of datastructure and that with a meta level jump you could program a program in terms of flows over data and flow over other flows. That's the kind of stuff Prismatic is making available in the Graph extension to their plumbing package ( code examples ), which is described in an excellent post: Graph: Abstractions for Structured Computation . You may remember Prismatic from previous profile we did on HighScalability: Prismatic Architecture - Using Machine Learning On Social Networks To Figure Out What You Should Read On The Web . We learned how Prismatic, an interest driven content suggestion service, builds programs in
6 0.77768052 626 high scalability-2009-06-10-Paper: Graph Databases and the Future of Large-Scale Knowledge Management
7 0.74540353 1136 high scalability-2011-11-03-Paper: G2 : A Graph Processing System for Diagnosing Distributed Systems
8 0.739582 827 high scalability-2010-05-14-Hot Scalability Links for May 14, 2010
9 0.7300455 766 high scalability-2010-01-26-Product: HyperGraphDB - A Graph Database
10 0.72986042 805 high scalability-2010-04-06-Strategy: Make it Really Fast vs Do the Work Up Front
11 0.71075243 155 high scalability-2007-11-15-Video: Dryad: A general-purpose distributed execution platform
12 0.70055425 631 high scalability-2009-06-15-Large-scale Graph Computing at Google
13 0.65644073 621 high scalability-2009-06-06-Graph server
14 0.63780624 1512 high scalability-2013-09-05-Paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale
15 0.62239939 601 high scalability-2009-05-17-Product: Hadoop
16 0.61794341 443 high scalability-2008-11-14-Paper: Pig Latin: A Not-So-Foreign Language for Data Processing
17 0.60410905 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python
18 0.5993439 968 high scalability-2011-01-04-Map-Reduce With Ruby Using Hadoop
19 0.59703815 1445 high scalability-2013-04-24-Strategy: Using Lots of RAM Often Cheaper than Using a Hadoop Cluster
20 0.59605753 669 high scalability-2009-08-03-Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2
topicId topicWeight
[(1, 0.101), (2, 0.235), (10, 0.021), (27, 0.011), (30, 0.042), (43, 0.264), (51, 0.014), (61, 0.079), (77, 0.023), (79, 0.094), (85, 0.029)]
simIndex simValue blogId blogTitle
1 0.95816487 505 high scalability-2009-02-01-More Chips Means Less Salsa
Introduction: Yes, I just got through watching the Superbowl so chips and salsa are on my mind and in my stomach. In recreational eating more chips requires downing more salsa. With mulitcore chips it turns out as cores go up salsa goes down, salsa obviously being a metaphor for speed. Sandia National Laboratories found in their simulations: a significant increase in speed going from two to four multicores, but an insignificant increase from four to eight multicores. Exceeding eight multicores causes a decrease in speed. Sixteen multicores perform barely as well as two, and after that, a steep decline is registered as more cores are added. The problem is the lack of memory bandwidth as well as contention between processors over the memory bus available to each processor. The implication for those following a diagonal scaling strategy is to work like heck to make your system fit within eight multicores. After that you'll need to consider some sort of partitioning strategy. What's interesti
2 0.88258702 1182 high scalability-2012-01-27-Stuff The Internet Says On Scalability For January 27, 2012
Introduction: If you’ve got the time, we’ve got the HighScalability: 9nm : IBM's carbon nanotube transistor that outperforms silicon; YouTube : 4 Billion Views/Day; 864GB RAM : 37signals Memcache, $12K Quotable Quotes: Chad Dickerson : You can only get growth by feeding opportunities. @launchany : It amazes me how many NoSQL database vendors spend more time detailing their scalability and no time detailing the data model and design Google : Let's make TCP faster. WhatsApp : we are now able to easily push our systems to over 2 million tcp connections! Sidney Dekker : In a complex system…doing the same thing twice will not predictably or necessarily lead to the same results. @Rasmusfjord : Just heard about an Umbraco site running on Azure that handles 20.000 requests /*second* Herb Sutter with an epic post, Welcome to the Jungle , touching on a lot of themes we've explored on HighScalability, only in a dramatically more competent way. What's after
3 0.87372434 1336 high scalability-2012-10-09-Batoo JPA - The new JPA Implementation that runs over 15 times faster...
Introduction: This post is by Hasan Ceylan , an Open Source software enthusiast from Istanbul. I loved the JPA 1.0 back in early 2000s. I started using it together with EJB 3.0 even before the stable releases. I loved it so much that I contributed bits and parts for JBoss 3.x implementations. Those were the days our company was considerably still small in size. Creating new features and applications were more priority than the performance, because there were a lot of ideas that we have and we needed to develop and market those as fast as we can. Now, we no longer needed to write tedious and error prone xml descriptions for the data model and deployment descriptors. Nor we needed to use the curse called “XDoclet”. On the other side, our company grew steadily, our web site has become the top portal in the country for live events and ticketing. We now had the performance problems! Although the company grew considerably, due to the economics in the industry, we did not make a lot of money. The ch
4 0.8722052 893 high scalability-2010-09-03-Hot Scalability Links For Sep 3, 2010
Introduction: With summer almost gone, it's time to fall into some good links... Hibari - distributed, fault tolerant, highly available key-value store written in Erlang. In this video Scott Lystig Fritchie gives a very good overview of the newest key-value store. Tweets of Gold lenidot : with 12 staff, @ tumblr serves 1.5billion pageviews/month and 25,000 signups/day. Now that's scalability! jmtan24 : Funny that whenever a high scalability article comes out, it always mention the shared nothing approach mfeathers : When life gives you lemons, you can have decades-long conquest to convert lemons to oranges, or you can make lemonade. OyvindIsene : Met an old man with mustache today, he had no opinion on #noSQL . Note to myself: Don't grow a mustache, now or later. vlad003 : Isn't it interesting how P2P distributes data while Cloud Computing centralizes it? And they're both said to be the future. You may be interested in a new DevOps Meetup organized by Dave
same-blog 5 0.86735797 1088 high scalability-2011-07-27-Making Hadoop 1000x Faster for Graph Problems
Introduction: Dr. Daniel Abadi, author of the DBMS Musings blog and Cofounder of Hadapt , which offers a product improving Hadoop performance by 50x on relational data, is now taking his talents to graph data in Hadoop's tremendous inefficiency on graph data management (and how to avoid it) , which shares the secrets of getting Hadoop to perform 1000x better on graph data. TL;DR: Analysing graph data is at the heart of important data mining problems . Hadoop is the tool of choice for many of these problems. Hadoop style MapReduce works best on KeyValue processing, not graph processing, and can be well over a factor of 1000 less efficient than it needs to be. Hadoop inefficiency has consequences in real world. Inefficiencies on graph data problems like improving power utilization, minimizing carbon emissions, and improving product designs, leads to a lot value being left on the table in the form of negative environmental consequences, increased server costs, increased data center spa
6 0.85368508 726 high scalability-2009-10-22-Paper: The Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM
7 0.8498069 1624 high scalability-2014-04-01-The Mullet Cloud Selection Pattern
8 0.82764965 37 high scalability-2007-07-28-Product: Web Log Storming
9 0.81704992 1603 high scalability-2014-02-28-Stuff The Internet Says On Scalability For February 28th, 2014
10 0.80760849 342 high scalability-2008-06-08-Search fast in million rows
11 0.8064065 470 high scalability-2008-12-18-Risk Analysis on the Cloud (Using Excel and GigaSpaces)
12 0.79784268 937 high scalability-2010-11-09-Paper: Hyder - Scaling Out without Partitioning
13 0.78687882 1475 high scalability-2013-06-13-Busting 4 Modern Hardware Myths - Are Memory, HDDs, and SSDs Really Random Access?
14 0.7811448 54 high scalability-2007-08-02-Multilanguage Website
15 0.78047019 1131 high scalability-2011-10-24-StackExchange Architecture Updates - Running Smoothly, Amazon 4x More Expensive
16 0.76925391 798 high scalability-2010-03-22-7 Secrets to Successfully Scaling with Scalr (on Amazon) by Sebastian Stadil
17 0.76413363 1123 high scalability-2011-09-23-The Real News is Not that Facebook Serves Up 1 Trillion Pages a Month…
18 0.7640664 1460 high scalability-2013-05-17-Stuff The Internet Says On Scalability For May 17, 2013
19 0.75274199 837 high scalability-2010-06-07-Six Ways Twitter May Reach its Big Hairy Audacious Goal of One Billion Users
20 0.73926711 1385 high scalability-2013-01-11-Stuff The Internet Says On Scalability For January 11, 2013