high_scalability high_scalability-2009 high_scalability-2009-650 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Update 3: Presentation from the NoSQL Conference : slides , video . Update 2: Jim Wilson helps with the Understanding HBase and BigTable by explaining them from a "conceptual standpoint." Update: InfoQ interview: HBase Leads Discuss Hadoop, BigTable and Distributed Databases . "MapReduce (both Google's and Hadoop's) is ideal for processing huge amounts of data with sizes that would not fit in a traditional database. Neither is appropriate for transaction/single request processing." Hbase is the open source answer to BigTable, Google's highly scalable distributed database. It is built on top of Hadoop ( product ), which implements functionality similar to Google's GFS and Map/Reduce systems. Both Google's GFS and Hadoop's HDFS provide a mechanism to reliably store large amounts of data. However, there is not really a mechanism for organizing the data and accessing only the parts that are of interest to a particular application. Bigtable (and Hbase) provide a means for
sentIndex sentText sentNum sentScore
1 Update 2: Jim Wilson helps with the Understanding HBase and BigTable by explaining them from a "conceptual standpoint. [sent-2, score-0.109]
2 "MapReduce (both Google's and Hadoop's) is ideal for processing huge amounts of data with sizes that would not fit in a traditional database. [sent-4, score-0.611]
3 It is built on top of Hadoop ( product ), which implements functionality similar to Google's GFS and Map/Reduce systems. [sent-7, score-0.177]
4 Â Both Google's GFS and Hadoop's HDFS provide a mechanism to reliably store large amounts of data. [sent-8, score-0.553]
5 However, there is not really a mechanism for organizing the data and accessing only the parts that are of interest to a particular application. [sent-9, score-0.798]
6 Bigtable (and Hbase) provide a means for organizing and efficiently accessing these large data sets. [sent-10, score-0.641]
7 Hbase is still not ready for production, but it's a glimpse into the power that will soon be available to your average website builder. [sent-11, score-0.411]
8 Google is of course still way ahead of the game. [sent-12, score-0.156]
9 They have huge core competencies in data center roll out and they will continually improve their stack. [sent-13, score-0.501]
10 It will be interesting to see how these sorts of tools along with Software as a Service can be leveraged to create the next generation of systems. [sent-14, score-0.394]
wordName wordTfidf (topN-words)
[('hbase', 0.272), ('gfs', 0.27), ('organizing', 0.263), ('hadoop', 0.259), ('accessing', 0.213), ('bigtable', 0.202), ('competencies', 0.193), ('glimpse', 0.193), ('mechanism', 0.173), ('amounts', 0.164), ('google', 0.163), ('conceptual', 0.153), ('infoq', 0.15), ('thenosql', 0.147), ('leveraged', 0.144), ('wilson', 0.142), ('reliably', 0.121), ('jim', 0.119), ('neither', 0.114), ('huge', 0.113), ('hdfs', 0.113), ('sorts', 0.109), ('explaining', 0.109), ('implements', 0.106), ('roll', 0.103), ('update', 0.102), ('ideal', 0.102), ('sizes', 0.095), ('provide', 0.095), ('continually', 0.092), ('appropriate', 0.091), ('leads', 0.091), ('discuss', 0.09), ('ahead', 0.085), ('interest', 0.082), ('interview', 0.082), ('soon', 0.08), ('presentation', 0.079), ('generation', 0.072), ('conference', 0.072), ('mapreduce', 0.071), ('functionality', 0.071), ('still', 0.071), ('fit', 0.071), ('efficiently', 0.07), ('along', 0.069), ('average', 0.067), ('parts', 0.067), ('however', 0.066), ('traditional', 0.066)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999994 650 high scalability-2009-07-02-Product: Hbase
Introduction: Update 3: Presentation from the NoSQL Conference : slides , video . Update 2: Jim Wilson helps with the Understanding HBase and BigTable by explaining them from a "conceptual standpoint." Update: InfoQ interview: HBase Leads Discuss Hadoop, BigTable and Distributed Databases . "MapReduce (both Google's and Hadoop's) is ideal for processing huge amounts of data with sizes that would not fit in a traditional database. Neither is appropriate for transaction/single request processing." Hbase is the open source answer to BigTable, Google's highly scalable distributed database. It is built on top of Hadoop ( product ), which implements functionality similar to Google's GFS and Map/Reduce systems. Both Google's GFS and Hadoop's HDFS provide a mechanism to reliably store large amounts of data. However, there is not really a mechanism for organizing the data and accessing only the parts that are of interest to a particular application. Bigtable (and Hbase) provide a means for
Introduction: You may have read somewhere that Facebook has introduced a new Social Inbox integrating email, IM, SMS, text messages, on-site Facebook messages. All-in-all they need to store over 135 billion messages a month. Where do they store all that stuff? Facebook's Kannan Muthukkaruppan gives the surprise answer in The Underlying Technology of Messages : HBase . HBase beat out MySQL, Cassandra, and a few others. Why a surprise? Facebook created Cassandra and it was purpose built for an inbox type application, but they found Cassandra's eventual consistency model wasn't a good match for their new real-time Messages product. Facebook also has an extensive MySQL infrastructure , but they found performance suffered as data set and indexes grew larger. And they could have built their own, but they chose HBase. HBase is a scaleout table store supporting very high rates of row-level updates over massive amounts of data . Exactly what is needed for a Messaging system. HBase is also a colu
3 0.20411843 448 high scalability-2008-11-22-Google Architecture
Introduction: Update 2: Sorting 1 PB with MapReduce . PB is not peanut-butter-and-jelly misspelled. It's 1 petabyte or 1000 terabytes or 1,000,000 gigabytes. It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers and the results were replicated thrice on 48,000 disks. Update: Greg Linden points to a new Google article MapReduce: simplified data processing on large clusters . Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build
4 0.19230427 795 high scalability-2010-03-16-1 Billion Reasons Why Adobe Chose HBase
Introduction: Cosmin Lehene wrote two excellent articles on Adobe's experiences with HBase: Why we’re using HBase: Part 1 and Why we’re using HBase: Part 2 . Adobe needed a generic, real-time, structured data storage and processing system that could handle any data volume, with access times under 50ms, with no downtime and no data loss . The article goes into great detail about their experiences with HBase and their evaluation process, providing a "well reasoned impartial use case from a commercial user". It talks about failure handling, availability, write performance, read performance, random reads, sequential scans, and consistency. One of the knocks against HBase has been it's complexity, as it has many parts that need installation and configuration. All is not lost according to the Adobe team: HBase is more complex than other systems (you need Hadoop, Zookeeper, cluster machines have multiple roles). We believe that for HBase, this is not accidental complexity and that the argu
5 0.17657292 601 high scalability-2009-05-17-Product: Hadoop
Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity
6 0.17295788 647 high scalability-2009-07-02-Hypertable is a New BigTable Clone that Runs on HDFS or KFS
7 0.15116072 309 high scalability-2008-04-23-Behind The Scenes of Google Scalability
8 0.14659065 627 high scalability-2009-06-11-Yahoo! Distribution of Hadoop
9 0.14640118 414 high scalability-2008-10-15-Hadoop - A Primer
10 0.14199482 1189 high scalability-2012-02-07-Hypertable Routs HBase in Performance Test -- HBase Overwhelmed by Garbage Collection
11 0.14023875 1088 high scalability-2011-07-27-Making Hadoop 1000x Faster for Graph Problems
12 0.1324828 968 high scalability-2011-01-04-Map-Reduce With Ruby Using Hadoop
13 0.13165298 1000 high scalability-2011-03-08-Medialets Architecture - Defeating the Daunting Mobile Device Data Deluge
14 0.13108623 1313 high scalability-2012-08-28-Making Hadoop Run Faster
15 0.12535018 1008 high scalability-2011-03-22-Facebook's New Realtime Analytics System: HBase to Process 20 Billion Events Per Day
16 0.12500702 327 high scalability-2008-05-27-How I Learned to Stop Worrying and Love Using a Lot of Disk Space to Scale
17 0.12439718 227 high scalability-2008-01-28-Howto setup GFS-GNBD
18 0.11830788 517 high scalability-2009-02-21-Google AppEngine - A Second Look
19 0.11685464 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python
topicId topicWeight
[(0, 0.142), (1, 0.041), (2, 0.026), (3, 0.087), (4, 0.038), (5, 0.093), (6, 0.036), (7, 0.025), (8, 0.135), (9, 0.12), (10, 0.068), (11, -0.018), (12, 0.077), (13, -0.106), (14, 0.067), (15, -0.025), (16, -0.095), (17, -0.069), (18, 0.022), (19, -0.038), (20, 0.04), (21, 0.098), (22, 0.064), (23, -0.056), (24, 0.021), (25, 0.008), (26, 0.113), (27, 0.051), (28, -0.066), (29, 0.032), (30, 0.061), (31, 0.103), (32, 0.078), (33, -0.05), (34, -0.015), (35, 0.07), (36, -0.017), (37, 0.022), (38, 0.009), (39, 0.023), (40, 0.04), (41, 0.035), (42, 0.026), (43, -0.011), (44, 0.006), (45, -0.015), (46, 0.001), (47, -0.007), (48, 0.002), (49, 0.037)]
simIndex simValue blogId blogTitle
same-blog 1 0.9770599 650 high scalability-2009-07-02-Product: Hbase
Introduction: Update 3: Presentation from the NoSQL Conference : slides , video . Update 2: Jim Wilson helps with the Understanding HBase and BigTable by explaining them from a "conceptual standpoint." Update: InfoQ interview: HBase Leads Discuss Hadoop, BigTable and Distributed Databases . "MapReduce (both Google's and Hadoop's) is ideal for processing huge amounts of data with sizes that would not fit in a traditional database. Neither is appropriate for transaction/single request processing." Hbase is the open source answer to BigTable, Google's highly scalable distributed database. It is built on top of Hadoop ( product ), which implements functionality similar to Google's GFS and Map/Reduce systems. Both Google's GFS and Hadoop's HDFS provide a mechanism to reliably store large amounts of data. However, there is not really a mechanism for organizing the data and accessing only the parts that are of interest to a particular application. Bigtable (and Hbase) provide a means for
2 0.79205608 647 high scalability-2009-07-02-Hypertable is a New BigTable Clone that Runs on HDFS or KFS
Introduction: Update 3 : Presentation from the NoSQL conference : slides , video 1 , video 2 . Update 2 : The folks at Hypertable would like you to know that Hypertable is now officially sponsored by Baidu , China’s Leading Search Engine. As a sponsor of Hypertable, Baidu has committed an industrious team of engineers, numerous servers, and support resources to improve the quality and development of the open source technology. Update : InfoQ interview on Hypertable Lead Discusses Hadoop and Distributed Databases . Hypertable differs from HBase in that it is a higher performance implementation of Bigtable. Skrentablog gives the heads up on Hypertable , Zvents' open-source BigTable clone. It's written in C++ and can run on top of either HDFS or KFS. Performance looks encouraging at 28M rows of data inserted at a per-node write rate of 7mb/sec .
3 0.77125609 601 high scalability-2009-05-17-Product: Hadoop
Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity
4 0.73278224 414 high scalability-2008-10-15-Hadoop - A Primer
Introduction: Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of the Google File System and of MapReduce to process vast amounts of data "Hadoop is a Free Java software framework that supports data intensive distributed applications running on large clusters of commodity computers. It enables applications to easily scale out to thousands of nodes and petabytes of data" (Wikipedia) * What platform does Hadoop run on? * Java 1.5.x or higher, preferably from Sun * Linux * Windows for development * Solaris
5 0.71126407 795 high scalability-2010-03-16-1 Billion Reasons Why Adobe Chose HBase
Introduction: Cosmin Lehene wrote two excellent articles on Adobe's experiences with HBase: Why we’re using HBase: Part 1 and Why we’re using HBase: Part 2 . Adobe needed a generic, real-time, structured data storage and processing system that could handle any data volume, with access times under 50ms, with no downtime and no data loss . The article goes into great detail about their experiences with HBase and their evaluation process, providing a "well reasoned impartial use case from a commercial user". It talks about failure handling, availability, write performance, read performance, random reads, sequential scans, and consistency. One of the knocks against HBase has been it's complexity, as it has many parts that need installation and configuration. All is not lost according to the Adobe team: HBase is more complex than other systems (you need Hadoop, Zookeeper, cluster machines have multiple roles). We believe that for HBase, this is not accidental complexity and that the argu
6 0.69300318 443 high scalability-2008-11-14-Paper: Pig Latin: A Not-So-Foreign Language for Data Processing
7 0.69098383 669 high scalability-2009-08-03-Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2
8 0.69035017 968 high scalability-2011-01-04-Map-Reduce With Ruby Using Hadoop
9 0.67473835 254 high scalability-2008-02-19-Hadoop Getting Closer to 1.0 Release
10 0.66717213 415 high scalability-2008-10-15-Need help with your Hadoop deployment? This company may help!
11 0.66574693 627 high scalability-2009-06-11-Yahoo! Distribution of Hadoop
12 0.66437805 1512 high scalability-2013-09-05-Paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale
13 0.65567482 1313 high scalability-2012-08-28-Making Hadoop Run Faster
14 0.65421009 871 high scalability-2010-08-04-Dremel: Interactive Analysis of Web-Scale Datasets - Data as a Programming Paradigm
15 0.63730276 483 high scalability-2009-01-04-Paper: MapReduce: Simplified Data Processing on Large Clusters
16 0.62644339 1445 high scalability-2013-04-24-Strategy: Using Lots of RAM Often Cheaper than Using a Hadoop Cluster
17 0.62375689 1265 high scalability-2012-06-15-Stuff The Internet Says On Scalability For June 15, 2012
18 0.61567795 666 high scalability-2009-07-30-Learn How to Think at Scale
19 0.61461675 211 high scalability-2008-01-13-Google Reveals New MapReduce Stats
20 0.61413592 1382 high scalability-2013-01-07-Analyzing billions of credit card transactions and serving low-latency insights in the cloud
topicId topicWeight
[(1, 0.153), (2, 0.107), (25, 0.159), (30, 0.014), (40, 0.024), (61, 0.083), (77, 0.044), (79, 0.304)]
simIndex simValue blogId blogTitle
same-blog 1 0.95172888 650 high scalability-2009-07-02-Product: Hbase
Introduction: Update 3: Presentation from the NoSQL Conference : slides , video . Update 2: Jim Wilson helps with the Understanding HBase and BigTable by explaining them from a "conceptual standpoint." Update: InfoQ interview: HBase Leads Discuss Hadoop, BigTable and Distributed Databases . "MapReduce (both Google's and Hadoop's) is ideal for processing huge amounts of data with sizes that would not fit in a traditional database. Neither is appropriate for transaction/single request processing." Hbase is the open source answer to BigTable, Google's highly scalable distributed database. It is built on top of Hadoop ( product ), which implements functionality similar to Google's GFS and Map/Reduce systems. Both Google's GFS and Hadoop's HDFS provide a mechanism to reliably store large amounts of data. However, there is not really a mechanism for organizing the data and accessing only the parts that are of interest to a particular application. Bigtable (and Hbase) provide a means for
2 0.91782618 408 high scalability-2008-10-10-Useful Corporate Blogs that Talk About Scalability
Introduction: Some intrepid company blogs are posting their technical challenges and how they solve them. I wish more would open up and talk about what they are doing as it helps everyone move forward. Here are a few blogs documenting their encounters with the bleeding edge: Flickr Digg LinkedIn Facebook Amazon Web Services blog Twitter blog Reddit blog Photobucket blog Second Life blog PlentyofFish blog Joyent's Blog Any others that should be added?
3 0.90329522 784 high scalability-2010-02-25-Paper: High Performance Scalable Data Stores
Introduction: The world of scalable databases is not a simple one. They come in every race, creed, and color. Rick Cattell has brought some harmony to that world by publishing High Performance Scalable Data Stores , a nicely detailed one stop shop paper comparing scalable databases soley on the content of their character. Ironically, the first step in that evaluation is dividing the world into four groups: Key-value stores: Redis, Scalaris, Voldmort, and Riak. Document stores: Couch DB, MongoDB, and SimpleDB. Record stores: BigTable, HBase, HyperTable, and Cassandra. Scalable RDBMSs: MySQL Cluster, ScaleDB, Drizzle, and VoltDB. The paper describes each system and then compares them on the dimensions of Concurrency Control, Data Storage Replication, Transaction Model, General Comments, Maturity, K-hits, License Language. And the winner is: there are no winners. Yet. Rick concludes by pointing to a great convergence: I believe that a few of these systems will gain critical mass an
4 0.89509767 1277 high scalability-2012-07-05-10 Golden Principles For Building Successful Mobile-Web Applications
Introduction: Wildly popular VC blogger Fred Wilson defines in an excellent 27 minute video the ten most important criteria he uses when deciding to give the gold, that is, fund a web application. Note, this video is from 2010 , so no doubt the ideas are still valid, but the importance of mobile vs web apps has probably shifted to mobile, as Mr. Wilson says in a recent post: mobile is growing like a weed . Speed - speed is more than a feature, it's a requirement. Mainstream users are unforgiving. If something is slow they won't use it. Pingdom is used to track speed across their portfolio. A trend they've noticed is that as an application slows down they don't grow as quickly. Instant Utility - a service must be instantly useful to users. Lengthy setup and configuration is a killer. Tricks like crawling the web to populate information you expect to get from your users later makes the service initially useful. YouTube won, for example, with instant availability of uploaded video.
5 0.88645059 1169 high scalability-2012-01-05-Shutterfly Saw a Speedup of 500% With Flashcache
Introduction: In the "should I or shouldn't I" debate around deploying SSD, it always helps to have real-world data. Fiesta! with a live-blog summary of a presentation by Kenny Gorman on Shutterfly on MongoDB Performance Tuning . What if you still need more performance after doing all of this tuning? One option is to use SSDs. Shutterfly uses Facebook’s flashcache : kernel module to cache data on SSD. Designed for MySQL/InnoDB. SSD in front of a disk, but exposed as a single mount point. This only makes sense when you have lots of physical I/O. Shutterfly saw a speedup of 500% w/ flashcache. A benefit is that you can delay sharding: less complexity. The whole series of posts has a lot of great information and is worth a longer look, especially if you are considering using MongoDB. Related Articles Slides for MongoSF 2011 slides: MongoDB Performance Tuning SSD+HDD sharding setup for large and permanently growing collections Imlementing MongoDB at Shutterfly by Kenny
6 0.88483834 871 high scalability-2010-08-04-Dremel: Interactive Analysis of Web-Scale Datasets - Data as a Programming Paradigm
7 0.88318086 680 high scalability-2009-08-13-Reconnoiter - Large-Scale Trending and Fault-Detection
8 0.88041627 1403 high scalability-2013-02-08-Stuff The Internet Says On Scalability For February 8, 2013
9 0.88024437 1100 high scalability-2011-08-18-Paper: The Akamai Network - 61,000 servers, 1,000 networks, 70 countries
10 0.86930943 1420 high scalability-2013-03-08-Stuff The Internet Says On Scalability For March 8, 2013
11 0.86861092 448 high scalability-2008-11-22-Google Architecture
12 0.86766112 1494 high scalability-2013-07-19-Stuff The Internet Says On Scalability For July 19, 2013
13 0.86458099 107 high scalability-2007-10-02-Some Real Financial Numbers for Your Startup
14 0.86212331 786 high scalability-2010-03-02-Using the Ambient Cloud as an Application Runtime
15 0.85861421 323 high scalability-2008-05-19-Twitter as a scalability case study
16 0.85830826 372 high scalability-2008-08-27-Updating distributed web applications
17 0.85285598 1485 high scalability-2013-07-01-PRISM: The Amazingly Low Cost of Using BigData to Know More About You in Under a Minute
18 0.8520261 581 high scalability-2009-04-26-Map-Reduce for Machine Learning on Multicore
19 0.85194349 380 high scalability-2008-09-05-Product: Tungsten Replicator
20 0.84597552 1181 high scalability-2012-01-25-Google Goes MoreSQL with Tenzing - SQL Over MapReduce