high_scalability high_scalability-2008 high_scalability-2008-414 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of the Google File System and of MapReduce to process vast amounts of data "Hadoop is a Free Java software framework that supports data intensive distributed applications running on large clusters of commodity computers. It enables applications to easily scale out to thousands of nodes and petabytes of data" (Wikipedia) * What platform does Hadoop run on? * Java 1.5.x or higher, preferably from Sun * Linux * Windows for development * Solaris
sentIndex sentText sentNum sentScore
1 Hadoop is a distributed computing platform written in Java. [sent-1, score-0.476]
2 It incorporates features similar to those of the Google File System and of MapReduce to process vast amounts of data "Hadoop is a Free Java software framework that supports data intensive distributed applications running on large clusters of commodity computers. [sent-2, score-2.122]
3 It enables applications to easily scale out to thousands of nodes and petabytes of data" (Wikipedia) * What platform does Hadoop run on? [sent-3, score-1.061]
wordName wordTfidf (topN-words)
[('hadoop', 0.355), ('incorporates', 0.295), ('thegoogle', 0.287), ('preferably', 0.28), ('solaris', 0.259), ('wikipedia', 0.224), ('platform', 0.19), ('sun', 0.185), ('petabytes', 0.181), ('java', 0.177), ('vast', 0.169), ('intensive', 0.168), ('enables', 0.153), ('amounts', 0.149), ('windows', 0.148), ('commodity', 0.148), ('clusters', 0.133), ('mapreduce', 0.13), ('supports', 0.127), ('applications', 0.118), ('framework', 0.114), ('similar', 0.11), ('thousands', 0.109), ('linux', 0.108), ('distributed', 0.108), ('higher', 0.106), ('easily', 0.101), ('nodes', 0.098), ('written', 0.095), ('file', 0.087), ('free', 0.085), ('computing', 0.083), ('development', 0.082), ('data', 0.079), ('features', 0.077), ('process', 0.071), ('running', 0.066), ('run', 0.061), ('software', 0.057), ('large', 0.054), ('scale', 0.05), ('system', 0.036)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999994 414 high scalability-2008-10-15-Hadoop - A Primer
Introduction: Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of the Google File System and of MapReduce to process vast amounts of data "Hadoop is a Free Java software framework that supports data intensive distributed applications running on large clusters of commodity computers. It enables applications to easily scale out to thousands of nodes and petabytes of data" (Wikipedia) * What platform does Hadoop run on? * Java 1.5.x or higher, preferably from Sun * Linux * Windows for development * Solaris
2 0.25217238 601 high scalability-2009-05-17-Product: Hadoop
Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity
3 0.22988829 596 high scalability-2009-05-11-Facebook, Hadoop, and Hive
Introduction: Facebook has the second largest installation of Hadoop (a software platform that lets one easily write and run applications that process vast amounts of data), Yahoo being the first. Learn how they do it and what are the challenges on DBMS2 blog, which is a blog for people who care about database and analytic technologies.
4 0.21996839 454 high scalability-2008-12-01-Deploying MySQL Database in Solaris Cluster Environments
Introduction: MySQL™ database, an open source database, delivers high performance and reliability while keeping costs low by eliminating licensing fees. The Solaris™ Cluster product is an integrated hardware and software environment that can be used to create highly-available data services. This article explains how to deploy the MySQL database in a Solaris Cluster environment. The article addresses the following topics: * "Advantages of Deploying MySQL Database with Solaris Cluster" on page 1 discusses the benefits provided by a Solaris Cluster deployment of the MySQL database. * "Overview of Solaris Cluster" on page 2 provides a high-level description of the hardware and software components of the Solaris Cluster. * "Installation and Configuration" on page 8 explains the procedure for deploying the MySQL database on a Solaris Cluster. This article assumes that readers have a basic understanding of Solaris Cluster and MySQL database installation and administration.
5 0.19859016 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python
Introduction: Has a Java only Hadoop been getting you down? Now you can be Happy . Happy is a framework for writing map-reduce programs for Hadoop using Jython. It files off the sharp edges on Hadoop and makes writing map-reduce programs a breeze . There's really no history yet on Happy, but I'm delighted at the idea of being able to map-reduce in other languages . The more ways the better. From the website: Happy is a framework that allows Hadoop jobs to be written and run in Python 2.2 using Jython. It is an easy way to write map-reduce programs for Hadoop, and includes some new useful features as well. The current release supports Hadoop 0.17.2. Map-reduce jobs in Happy are defined by sub-classing happy.HappyJob and implementing a map(records, task) and reduce(key, values, task) function. Then you create an instance of the class, set the job parameters (such as inputs and outputs) and call run(). When you call run(), Happy serializes your job instance and copies it and all accompanyi
6 0.18673001 627 high scalability-2009-06-11-Yahoo! Distribution of Hadoop
7 0.16344121 968 high scalability-2011-01-04-Map-Reduce With Ruby Using Hadoop
8 0.16295172 458 high scalability-2008-12-01-Web Consolidation on the Sun Fire T1000 using Solaris Containers
9 0.15679941 1088 high scalability-2011-07-27-Making Hadoop 1000x Faster for Graph Problems
10 0.15031062 415 high scalability-2008-10-15-Need help with your Hadoop deployment? This company may help!
12 0.14640118 650 high scalability-2009-07-02-Product: Hbase
13 0.14415415 1313 high scalability-2012-08-28-Making Hadoop Run Faster
14 0.13722414 862 high scalability-2010-07-20-Strategy: Consider When a Service Starts Billing in Your Algorithm Cost
17 0.13444638 448 high scalability-2008-11-22-Google Architecture
18 0.13336812 459 high scalability-2008-12-03-Java World Interview on Scalability and Other Java Scalability Secrets
19 0.13227758 471 high scalability-2008-12-19-Gigaspaces curbs latency outliers with Java Real Time
topicId topicWeight
[(0, 0.149), (1, -0.014), (2, 0.037), (3, 0.081), (4, -0.009), (5, 0.094), (6, 0.122), (7, -0.062), (8, 0.067), (9, 0.266), (10, 0.039), (11, -0.013), (12, 0.186), (13, -0.046), (14, 0.081), (15, -0.056), (16, -0.06), (17, -0.038), (18, -0.042), (19, 0.049), (20, 0.037), (21, 0.009), (22, 0.126), (23, -0.026), (24, -0.032), (25, 0.047), (26, 0.13), (27, -0.023), (28, 0.033), (29, 0.008), (30, 0.031), (31, 0.123), (32, -0.015), (33, -0.036), (34, 0.006), (35, 0.013), (36, -0.097), (37, 0.02), (38, -0.016), (39, 0.012), (40, 0.028), (41, 0.012), (42, -0.112), (43, 0.017), (44, 0.034), (45, 0.028), (46, 0.063), (47, 0.035), (48, -0.065), (49, 0.03)]
simIndex simValue blogId blogTitle
same-blog 1 0.96134913 414 high scalability-2008-10-15-Hadoop - A Primer
Introduction: Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of the Google File System and of MapReduce to process vast amounts of data "Hadoop is a Free Java software framework that supports data intensive distributed applications running on large clusters of commodity computers. It enables applications to easily scale out to thousands of nodes and petabytes of data" (Wikipedia) * What platform does Hadoop run on? * Java 1.5.x or higher, preferably from Sun * Linux * Windows for development * Solaris
2 0.7907114 968 high scalability-2011-01-04-Map-Reduce With Ruby Using Hadoop
Introduction: A demonstration, with repeatable steps, of how to quickly fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will not need to ssh into the cluster, as all tasks are run from your local machine. Below I am using my MacBook Pro as my local machine, but the steps I have provided should be reproducible on other platforms running bash and Java. Fire-Up Your Hadoop Cluster I choose the Cloudera distribution of Hadoop which is still 100% Apache licensed, but has some additional benefits. One of these benefits is that it is released by Doug Cutting , who started Hadoop and drove it’s development at Yahoo! He also started Lucene , which is another of my favourite Apache Projects, so I have good faith that he knows what he is doing. Another benefit, as you will see, is that it is simple to fire-up a Hadoop cluster. I am going to use C
3 0.76929313 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python
Introduction: Has a Java only Hadoop been getting you down? Now you can be Happy . Happy is a framework for writing map-reduce programs for Hadoop using Jython. It files off the sharp edges on Hadoop and makes writing map-reduce programs a breeze . There's really no history yet on Happy, but I'm delighted at the idea of being able to map-reduce in other languages . The more ways the better. From the website: Happy is a framework that allows Hadoop jobs to be written and run in Python 2.2 using Jython. It is an easy way to write map-reduce programs for Hadoop, and includes some new useful features as well. The current release supports Hadoop 0.17.2. Map-reduce jobs in Happy are defined by sub-classing happy.HappyJob and implementing a map(records, task) and reduce(key, values, task) function. Then you create an instance of the class, set the job parameters (such as inputs and outputs) and call run(). When you call run(), Happy serializes your job instance and copies it and all accompanyi
4 0.74346226 601 high scalability-2009-05-17-Product: Hadoop
Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity
5 0.72525078 443 high scalability-2008-11-14-Paper: Pig Latin: A Not-So-Foreign Language for Data Processing
Introduction: Yahoo has developed a new language called Pig Latin that fit in a sweet spot between high-level declarative querying in the spirit of SQL, and low-level, procedural programming `a la map-reduce and combines best of both worlds. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. Pig has just graduated from the Apache Incubator and joined Hadoop as a subproject. The paper has a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. References: Apache Pig Wiki
6 0.7234717 627 high scalability-2009-06-11-Yahoo! Distribution of Hadoop
7 0.66814387 669 high scalability-2009-08-03-Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2
8 0.65090632 650 high scalability-2009-07-02-Product: Hbase
9 0.63350427 415 high scalability-2008-10-15-Need help with your Hadoop deployment? This company may help!
10 0.63179624 1445 high scalability-2013-04-24-Strategy: Using Lots of RAM Often Cheaper than Using a Hadoop Cluster
11 0.61827755 862 high scalability-2010-07-20-Strategy: Consider When a Service Starts Billing in Your Algorithm Cost
12 0.60658288 254 high scalability-2008-02-19-Hadoop Getting Closer to 1.0 Release
13 0.59758788 376 high scalability-2008-09-03-MapReduce framework Disco
14 0.59249258 596 high scalability-2009-05-11-Facebook, Hadoop, and Hive
15 0.59160233 1313 high scalability-2012-08-28-Making Hadoop Run Faster
16 0.58993447 1076 high scalability-2011-07-08-Stuff The Internet Says On Scalability For July 8, 2011
17 0.57586795 1173 high scalability-2012-01-12-Peregrine - A Map Reduce Framework for Iterative and Pipelined Jobs
18 0.56659853 1265 high scalability-2012-06-15-Stuff The Internet Says On Scalability For June 15, 2012
20 0.55235744 1512 high scalability-2013-09-05-Paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale
topicId topicWeight
[(1, 0.15), (2, 0.137), (15, 0.215), (61, 0.043), (79, 0.243), (94, 0.072)]
simIndex simValue blogId blogTitle
same-blog 1 0.8917678 414 high scalability-2008-10-15-Hadoop - A Primer
Introduction: Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of the Google File System and of MapReduce to process vast amounts of data "Hadoop is a Free Java software framework that supports data intensive distributed applications running on large clusters of commodity computers. It enables applications to easily scale out to thousands of nodes and petabytes of data" (Wikipedia) * What platform does Hadoop run on? * Java 1.5.x or higher, preferably from Sun * Linux * Windows for development * Solaris
2 0.85346931 362 high scalability-2008-08-11-Distributed Computing & Google Infrastructure
Introduction: A couple of videos about distributed computing with direct reference on Google infrastructure. You will get acquainted with: --MapReduce the software framework implemented by Google to support parallel computations over large (greater than 100 terabyte) data sets on commodity hardware --GFS and the way it stores it's data into 64mb chunks --Bigtable which is the simple implementation of a non-relational database at Google Cluster Computing and MapReduce Lectures 1-5 .
3 0.81743217 680 high scalability-2009-08-13-Reconnoiter - Large-Scale Trending and Fault-Detection
Introduction: One of the top recommendations from the collective wisdom contained in Real Life Architectures is to add monitoring to your system. Now! Loud is the lament for not adding monitoring early and often. The reason is easy to understand. Without monitoring you don't know what your system is doing which means you can't fix it and you can't improve it. Feedback loops require data. Some popular monitor options are Munin, Nagios, Cacti and Hyperic. A relatively new entrant is a product called Reconnoiter from Theo Schlossnagle, President and CEO of OmniTI, leading consultants on solving problems of scalability, performance, architecture, infrastructure, and data management. Theo's name might sound familiar. He gives lots of talks and is the author of the very influential Scalable Internet Architectures book. So right away you know Reconnoiter has a good pedigree. As Theo says, their products are born of pain, from the fire of solving real-life problems and that's always a harbinger of
4 0.81596202 448 high scalability-2008-11-22-Google Architecture
Introduction: Update 2: Sorting 1 PB with MapReduce . PB is not peanut-butter-and-jelly misspelled. It's 1 petabyte or 1000 terabytes or 1,000,000 gigabytes. It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers and the results were replicated thrice on 48,000 disks. Update: Greg Linden points to a new Google article MapReduce: simplified data processing on large clusters . Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build
5 0.81482595 1512 high scalability-2013-09-05-Paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale
Introduction: Ever wonder what powers Google's world spirit sensing Zeitgeist service ? No, it's not a homunculus of Georg Wilhelm Friedrich Hegel sitting in each browser. It's actually a stream processing (think streaming MapReduce on steroids) system called MillWheel, described in this very well written paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale . MillWheel isn't just used for Zeitgeist at Google, it's also used for streaming joins for a variety of Ads customers, generalized anomaly-detection service, and network switch and cluster health monitoring. Abstract: MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework’s fault-tolerance guarantees. This paper describes MillWheel’s programming model as well as it
6 0.81249958 85 high scalability-2007-09-08-Making the case for PHP at Yahoo! (Oct 2002)
7 0.81100661 650 high scalability-2009-07-02-Product: Hbase
8 0.80993593 1420 high scalability-2013-03-08-Stuff The Internet Says On Scalability For March 8, 2013
9 0.80863452 1403 high scalability-2013-02-08-Stuff The Internet Says On Scalability For February 8, 2013
10 0.8065691 786 high scalability-2010-03-02-Using the Ambient Cloud as an Application Runtime
11 0.80587775 871 high scalability-2010-08-04-Dremel: Interactive Analysis of Web-Scale Datasets - Data as a Programming Paradigm
12 0.80448365 1494 high scalability-2013-07-19-Stuff The Internet Says On Scalability For July 19, 2013
13 0.80189055 380 high scalability-2008-09-05-Product: Tungsten Replicator
14 0.80049145 784 high scalability-2010-02-25-Paper: High Performance Scalable Data Stores
15 0.79825902 706 high scalability-2009-09-16-The VeriScale Architecture - Elasticity and efficiency for private clouds
16 0.79497045 1485 high scalability-2013-07-01-PRISM: The Amazingly Low Cost of Using BigData to Know More About You in Under a Minute
17 0.79458946 581 high scalability-2009-04-26-Map-Reduce for Machine Learning on Multicore
18 0.78925186 1277 high scalability-2012-07-05-10 Golden Principles For Building Successful Mobile-Web Applications
19 0.78657836 1343 high scalability-2012-10-18-Save up to 30% by Selecting Better Performing Amazon Instances
20 0.78432024 995 high scalability-2011-02-24-Strategy: Eliminate Unnecessary SQL