high_scalability high_scalability-2011 high_scalability-2011-968 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: A demonstration, with repeatable steps, of how to quickly fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will not need to ssh into the cluster, as all tasks are run from your local machine. Below I am using my MacBook Pro as my local machine, but the steps I have provided should be reproducible on other platforms running bash and Java. Fire-Up Your Hadoop Cluster I choose the Cloudera distribution of Hadoop which is still 100% Apache licensed, but has some additional benefits. One of these benefits is that it is released by Doug Cutting , who started Hadoop and drove it’s development at Yahoo! He also started Lucene , which is another of my favourite Apache Projects, so I have good faith that he knows what he is doing. Another benefit, as you will see, is that it is simple to fire-up a Hadoop cluster. I am going to use C
sentIndex sentText sentNum sentScore
1 I am going to use Cloudera’s Whirr script , which will allow me to fire up a production ready Hadoop cluster on Amazon EC2 directly from my laptop. [sent-8, score-0.165]
2 If you are not familiar with Maven, you can install it via Homebrew on Mac OS X using the brew command below. [sent-37, score-0.228]
3 sudo brew update sudo brew install maven Once the dependencies are installed we can build the whirr tool. [sent-39, score-0.92]
4 0+23 mvn clean install mvn package -Ppackage In true Maven style, it will download a long list of dependencies the first time you build this. [sent-42, score-0.216]
5 Let’s sanity check the whirr script… bin/whirr version You should see something like “Apache Whirr 0. [sent-45, score-0.444]
6 0, device=/dev/sda2, durable=false, isBootDevice=false]], supportsImage=Not(is64Bit())]]] Completed launch of myhadoopcluster Web UI available at http://ec2-72-44-45-199. [sent-120, score-0.171]
7 sh Running proxy to Hadoop cluster at ec2-72-44-45-199. [sent-176, score-0.187]
8 The above will output the hostname that you can access the cluster at. [sent-181, score-0.34]
9 Use this hostname to view the cluster in your web browser. [sent-187, score-0.235]
10 On Amazon EC2 this new hostname will be the internal hostname of data-node server, which is visible because you are tunnelling through the SOCKS proxy. [sent-192, score-0.246]
11 You can interact with Hadoop and HDFS with the hadoop command. [sent-198, score-0.394]
12 We do not have Hadoop installed on our local machine. [sent-199, score-0.19]
13 Therefore, we can either log into one of our Hadoop cluster machines and run the hadoop command from there, or install hadoop on our local machine. [sent-200, score-1.178]
14 profile which hadoop # should output "/usr/local/hadoop/bin/hadoop" hadoop version # should output "Hadoop 0. [sent-225, score-0.998]
15 xml /usr/local/hadoop/conf/ Now run your first command from your local machine to interact with HDFS. [sent-232, score-0.186]
16 hadoop fs -ls / You should see the following output which lists the root on the Hadoop filesystem. [sent-234, score-0.61]
17 Uploading Your Data To HDFS (Hadoop Distributed FileSystem) hadoop fs -mkdir input hadoop fs -put /usr/share/dict/words input/ hadoop fs -ls input You should see output similar to the following, which list the words file on the remote HDFS. [sent-253, score-1.673]
18 Since my local user is “phil”, Hadoop has added the file under /user/phil on HDFS. [sent-254, score-0.181]
19 Found 1 items -rw-r--r-- 3 phil supergroup 2486813 2010-12-30 18:43 /user/phil/input/words Congratulations! [sent-255, score-0.171]
20 You have just uploaded your first file to the Hadoop Distributed File-System on your cluster in the cloud. [sent-256, score-0.165]
wordName wordTfidf (topN-words)
[('whirr', 0.444), ('hadoop', 0.394), ('null', 0.267), ('hdfs', 0.18), ('myhadoopcluster', 0.171), ('supergroup', 0.171), ('false', 0.153), ('mac', 0.144), ('drwxrwxrwx', 0.136), ('isbootdevice', 0.136), ('providerid', 0.136), ('local', 0.128), ('id', 0.126), ('hostname', 0.123), ('cluster', 0.112), ('fs', 0.111), ('output', 0.105), ('os', 0.095), ('install', 0.092), ('socks', 0.087), ('durable', 0.085), ('sudo', 0.083), ('name', 0.082), ('brew', 0.078), ('proxy', 0.075), ('dn', 0.068), ('imageid', 0.068), ('jt', 0.068), ('paravirtual', 0.068), ('privateaddress', 0.068), ('privateaddresses', 0.068), ('publicaddress', 0.068), ('publicaddresses', 0.068), ('supportsimage', 0.068), ('usermetadata', 0.068), ('ll', 0.068), ('installed', 0.062), ('cdh', 0.062), ('mvn', 0.062), ('tea', 0.062), ('tt', 0.062), ('re', 0.059), ('command', 0.058), ('echo', 0.058), ('terminal', 0.058), ('arch', 0.058), ('directory', 0.054), ('script', 0.053), ('file', 0.053), ('device', 0.052)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 968 high scalability-2011-01-04-Map-Reduce With Ruby Using Hadoop
Introduction: A demonstration, with repeatable steps, of how to quickly fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will not need to ssh into the cluster, as all tasks are run from your local machine. Below I am using my MacBook Pro as my local machine, but the steps I have provided should be reproducible on other platforms running bash and Java. Fire-Up Your Hadoop Cluster I choose the Cloudera distribution of Hadoop which is still 100% Apache licensed, but has some additional benefits. One of these benefits is that it is released by Doug Cutting , who started Hadoop and drove it’s development at Yahoo! He also started Lucene , which is another of my favourite Apache Projects, so I have good faith that he knows what he is doing. Another benefit, as you will see, is that it is simple to fire-up a Hadoop cluster. I am going to use C
2 0.24283022 627 high scalability-2009-06-11-Yahoo! Distribution of Hadoop
Introduction: Many people in the Apache Hadoop community have asked Yahoo! to publish the version of Apache Hadoop they test and deploy across their large Hadoop clusters. As a service to the Hadoop community, Yahoo is releasing the Yahoo! Distribution of Hadoop -- a source code distribution that is based entirely on code found in the Apache Hadoop project. This source distribution includes code patches that they have added to improve the stability and performance of their clusters. In all cases, these patches have already been contributed back to Apache, but they may not yet be available in an Apache release of Hadoop. Read more and get the Hadoop distribution from Yahoo
3 0.207223 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python
Introduction: Has a Java only Hadoop been getting you down? Now you can be Happy . Happy is a framework for writing map-reduce programs for Hadoop using Jython. It files off the sharp edges on Hadoop and makes writing map-reduce programs a breeze . There's really no history yet on Happy, but I'm delighted at the idea of being able to map-reduce in other languages . The more ways the better. From the website: Happy is a framework that allows Hadoop jobs to be written and run in Python 2.2 using Jython. It is an easy way to write map-reduce programs for Hadoop, and includes some new useful features as well. The current release supports Hadoop 0.17.2. Map-reduce jobs in Happy are defined by sub-classing happy.HappyJob and implementing a map(records, task) and reduce(key, values, task) function. Then you create an instance of the class, set the job parameters (such as inputs and outputs) and call run(). When you call run(), Happy serializes your job instance and copies it and all accompanyi
4 0.19821298 601 high scalability-2009-05-17-Product: Hadoop
Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity
5 0.16344121 414 high scalability-2008-10-15-Hadoop - A Primer
Introduction: Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of the Google File System and of MapReduce to process vast amounts of data "Hadoop is a Free Java software framework that supports data intensive distributed applications running on large clusters of commodity computers. It enables applications to easily scale out to thousands of nodes and petabytes of data" (Wikipedia) * What platform does Hadoop run on? * Java 1.5.x or higher, preferably from Sun * Linux * Windows for development * Solaris
6 0.16111599 1088 high scalability-2011-07-27-Making Hadoop 1000x Faster for Graph Problems
7 0.15698224 862 high scalability-2010-07-20-Strategy: Consider When a Service Starts Billing in Your Algorithm Cost
8 0.13439861 1313 high scalability-2012-08-28-Making Hadoop Run Faster
9 0.1324828 650 high scalability-2009-07-02-Product: Hbase
10 0.1317393 233 high scalability-2008-01-30-How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data
11 0.12871674 254 high scalability-2008-02-19-Hadoop Getting Closer to 1.0 Release
12 0.12129965 1445 high scalability-2013-04-24-Strategy: Using Lots of RAM Often Cheaper than Using a Hadoop Cluster
14 0.11133411 56 high scalability-2007-08-03-Running Hadoop MapReduce on Amazon EC2 and Amazon S3
15 0.11091457 669 high scalability-2009-08-03-Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2
16 0.10561118 1000 high scalability-2011-03-08-Medialets Architecture - Defeating the Daunting Mobile Device Data Deluge
17 0.10540948 1265 high scalability-2012-06-15-Stuff The Internet Says On Scalability For June 15, 2012
18 0.10021712 1173 high scalability-2012-01-12-Peregrine - A Map Reduce Framework for Iterative and Pipelined Jobs
19 0.099967994 1618 high scalability-2014-03-24-Big, Small, Hot or Cold - Examples of Robust Data Pipelines from Stripe, Tapad, Etsy and Square
20 0.096934937 707 high scalability-2009-09-17-Hot Links for 2009-9-17
topicId topicWeight
[(0, 0.127), (1, 0.038), (2, 0.007), (3, 0.002), (4, 0.009), (5, 0.049), (6, 0.122), (7, -0.014), (8, 0.118), (9, 0.091), (10, 0.054), (11, -0.016), (12, 0.126), (13, -0.149), (14, 0.054), (15, -0.063), (16, -0.028), (17, -0.017), (18, -0.083), (19, 0.029), (20, -0.012), (21, 0.054), (22, 0.064), (23, 0.054), (24, -0.019), (25, 0.055), (26, 0.093), (27, -0.023), (28, 0.004), (29, 0.004), (30, 0.07), (31, 0.103), (32, -0.025), (33, -0.0), (34, 0.022), (35, 0.045), (36, -0.081), (37, 0.014), (38, -0.014), (39, -0.06), (40, -0.001), (41, 0.044), (42, -0.069), (43, -0.057), (44, 0.015), (45, 0.032), (46, -0.004), (47, 0.059), (48, -0.053), (49, 0.048)]
simIndex simValue blogId blogTitle
same-blog 1 0.98702008 968 high scalability-2011-01-04-Map-Reduce With Ruby Using Hadoop
Introduction: A demonstration, with repeatable steps, of how to quickly fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will not need to ssh into the cluster, as all tasks are run from your local machine. Below I am using my MacBook Pro as my local machine, but the steps I have provided should be reproducible on other platforms running bash and Java. Fire-Up Your Hadoop Cluster I choose the Cloudera distribution of Hadoop which is still 100% Apache licensed, but has some additional benefits. One of these benefits is that it is released by Doug Cutting , who started Hadoop and drove it’s development at Yahoo! He also started Lucene , which is another of my favourite Apache Projects, so I have good faith that he knows what he is doing. Another benefit, as you will see, is that it is simple to fire-up a Hadoop cluster. I am going to use C
2 0.9157601 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python
Introduction: Has a Java only Hadoop been getting you down? Now you can be Happy . Happy is a framework for writing map-reduce programs for Hadoop using Jython. It files off the sharp edges on Hadoop and makes writing map-reduce programs a breeze . There's really no history yet on Happy, but I'm delighted at the idea of being able to map-reduce in other languages . The more ways the better. From the website: Happy is a framework that allows Hadoop jobs to be written and run in Python 2.2 using Jython. It is an easy way to write map-reduce programs for Hadoop, and includes some new useful features as well. The current release supports Hadoop 0.17.2. Map-reduce jobs in Happy are defined by sub-classing happy.HappyJob and implementing a map(records, task) and reduce(key, values, task) function. Then you create an instance of the class, set the job parameters (such as inputs and outputs) and call run(). When you call run(), Happy serializes your job instance and copies it and all accompanyi
3 0.9101395 627 high scalability-2009-06-11-Yahoo! Distribution of Hadoop
Introduction: Many people in the Apache Hadoop community have asked Yahoo! to publish the version of Apache Hadoop they test and deploy across their large Hadoop clusters. As a service to the Hadoop community, Yahoo is releasing the Yahoo! Distribution of Hadoop -- a source code distribution that is based entirely on code found in the Apache Hadoop project. This source distribution includes code patches that they have added to improve the stability and performance of their clusters. In all cases, these patches have already been contributed back to Apache, but they may not yet be available in an Apache release of Hadoop. Read more and get the Hadoop distribution from Yahoo
4 0.88630307 443 high scalability-2008-11-14-Paper: Pig Latin: A Not-So-Foreign Language for Data Processing
Introduction: Yahoo has developed a new language called Pig Latin that fit in a sweet spot between high-level declarative querying in the spirit of SQL, and low-level, procedural programming `a la map-reduce and combines best of both worlds. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. Pig has just graduated from the Apache Incubator and joined Hadoop as a subproject. The paper has a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. References: Apache Pig Wiki
5 0.7950418 601 high scalability-2009-05-17-Product: Hadoop
Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity
6 0.78724015 862 high scalability-2010-07-20-Strategy: Consider When a Service Starts Billing in Your Algorithm Cost
7 0.76541978 414 high scalability-2008-10-15-Hadoop - A Primer
8 0.74240035 669 high scalability-2009-08-03-Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2
9 0.71604377 1445 high scalability-2013-04-24-Strategy: Using Lots of RAM Often Cheaper than Using a Hadoop Cluster
10 0.6946876 1265 high scalability-2012-06-15-Stuff The Internet Says On Scalability For June 15, 2012
11 0.68508673 1173 high scalability-2012-01-12-Peregrine - A Map Reduce Framework for Iterative and Pipelined Jobs
12 0.67332655 254 high scalability-2008-02-19-Hadoop Getting Closer to 1.0 Release
13 0.65527153 1313 high scalability-2012-08-28-Making Hadoop Run Faster
14 0.60769707 650 high scalability-2009-07-02-Product: Hbase
15 0.60508257 707 high scalability-2009-09-17-Hot Links for 2009-9-17
16 0.58871371 1382 high scalability-2013-01-07-Analyzing billions of credit card transactions and serving low-latency insights in the cloud
17 0.5830515 1088 high scalability-2011-07-27-Making Hadoop 1000x Faster for Graph Problems
18 0.57872587 647 high scalability-2009-07-02-Hypertable is a New BigTable Clone that Runs on HDFS or KFS
19 0.57454181 1442 high scalability-2013-04-17-Tachyon - Fault Tolerant Distributed File System with 300 Times Higher Throughput than HDFS
20 0.5692085 851 high scalability-2010-07-02-Hot Scalability Links for July 2, 2010
topicId topicWeight
[(1, 0.098), (2, 0.17), (10, 0.037), (23, 0.017), (30, 0.032), (47, 0.025), (57, 0.322), (61, 0.053), (79, 0.048), (85, 0.028), (94, 0.071)]
simIndex simValue blogId blogTitle
1 0.88060087 159 high scalability-2007-11-18-Reverse Proxy
Introduction: Hi, I saw an year ago that Netapp sold netcache to blu-coat, my site is a heavy NetCache user and we cached 83% of our site. We tested with Blue-coat and F5 WA and we are not getting same performce as NetCache. Any of you guys have the same issue? or somebody knows another product can handle much traffic? Thanks Rodrigo
same-blog 2 0.81806856 968 high scalability-2011-01-04-Map-Reduce With Ruby Using Hadoop
Introduction: A demonstration, with repeatable steps, of how to quickly fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will not need to ssh into the cluster, as all tasks are run from your local machine. Below I am using my MacBook Pro as my local machine, but the steps I have provided should be reproducible on other platforms running bash and Java. Fire-Up Your Hadoop Cluster I choose the Cloudera distribution of Hadoop which is still 100% Apache licensed, but has some additional benefits. One of these benefits is that it is released by Doug Cutting , who started Hadoop and drove it’s development at Yahoo! He also started Lucene , which is another of my favourite Apache Projects, so I have good faith that he knows what he is doing. Another benefit, as you will see, is that it is simple to fire-up a Hadoop cluster. I am going to use C
3 0.80668461 1144 high scalability-2011-11-17-Five Misconceptions on Cloud Portability
Introduction: The term "cloud portability" is often considered a synonym for "Cloud API portability," which implies a series of misconceptions. If we break away from dogma, we can find that what we really looking for in cloud portability is Application portability between clouds which can be a vastly simpler requirement, as we can achieve application portability without settling on a common Cloud API. In this post i'll be covering five common misconceptions people have WRT to cloud portability. Cloud portability = Cloud API portability . API portability is easy; cloud API portability is not. The main incentive for Cloud Portability is - Avoiding Vendor lock-in .Cloud portability is more about business agility than it is about vendor lock-in. Cloud portability isn’t for startups . Every startup that is expecting rapid growth should re-examine their deployments and plan for cloud portability rather than wait to be forced to make the switch when you are least prepared to do so.
4 0.78530818 433 high scalability-2008-10-29-CTL - Distributed Control Dispatching Framework
Introduction: CTL is a flexible distributed control dispatching framework that enables you to break management processes into reusable control modules and execute them in distributed fashion over the network . From their website: CTL is a flexible distributed control dispatching framework that enables you to break management processes into reusable control modules and execute them in distributed fashion over the network. What does CTL do? CTL helps you leverage your current scripts and tools to easily automate any kind of distributed systems management or application provisioning task. Its good for simplifiying large-scale scripting efforts or as another tool in your toolbox that helps you speed through your daily mix of ad-hoc administration tasks. What are CTL's features? CTL has many features, but the general highlights are: * Execute sophisticated procedures in distributed environments - Aren't you tired of writing and then endlessly modifying scripts that loop over nodes and invoke remot
5 0.75497037 1138 high scalability-2011-11-07-10 Core Architecture Pattern Variations for Achieving Scalability
Introduction: Srinath Perera has put together a strong list of architecture patterns based on three meta patterns: distribution, caching, and asynchronous processing. He contends these three are the primal patterns and the following patterns are but different combinations: LB (Load Balancers) + Shared nothing Units . Units that do not share anything with each other fronted with a load balancer that routes incoming messages to a unit based on some criteria. LB + Stateless Nodes + Scalable Storage . Several stateless nodes talking to a scalable storage, and a load balancer distributes load among the nodes. Peer to Peer Architectures (Distributed Hash Table (DHT) and Content Addressable Networks (CAN)) . Algorithm for scaling up logarithmically. Distributed Queues . Queue implementation (FIFO delivery) implemented as a network service. Publish/Subscribe Paradigm . Network publish subscribe brokers that route messages to each other. Gossip and Nature-inspired Architectures . Each
6 0.75181746 731 high scalability-2009-10-28-Need for change in your IT infrastructure
7 0.73355138 1211 high scalability-2012-03-19-LinkedIn: Creating a Low Latency Change Data Capture System with Databus
8 0.73019725 807 high scalability-2010-04-09-Vagrant - Build and Deploy Virtualized Development Environments Using Ruby
9 0.67899674 553 high scalability-2009-04-03-Collectl interface to Ganglia - any interest?
10 0.67138821 6 high scalability-2007-07-11-Friendster Architecture
11 0.66329348 218 high scalability-2008-01-17-Moving old to new. Do not be afraid of the re-write -- but take some help
12 0.64383531 232 high scalability-2008-01-29-When things aren't scalable
13 0.6432277 855 high scalability-2010-07-11-So, Why is Twitter Really Not Using Cassandra to Store Tweets?
14 0.63930148 1003 high scalability-2011-03-14-6 Lessons from Dropbox - One Million Files Saved Every 15 minutes
15 0.63335991 351 high scalability-2008-07-16-The Mother of All Database Normalization Debates on Coding Horror
16 0.61080265 1385 high scalability-2013-01-11-Stuff The Internet Says On Scalability For January 11, 2013
17 0.60967368 972 high scalability-2011-01-11-Google Megastore - 3 Billion Writes and 20 Billion Read Transactions Daily
18 0.59607035 857 high scalability-2010-07-13-DbShards Part Deux - The Internals
19 0.59087986 1507 high scalability-2013-08-26-Reddit: Lessons Learned from Mistakes Made Scaling to 1 Billion Pageviews a Month
20 0.58342451 1087 high scalability-2011-07-26-Web 2.0 Killed the Middleware Star