high_scalability high_scalability-2010 high_scalability-2010-780 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: If Twitter is the “nervous system of the web” as some people think, then what is the brain that makes sense of all those signals (tweets) from the nervous system? That brain is the Twitter Analytics System and Kevin Weil, as Analytics Lead at Twitter, is the homunculus within in charge of figuring out what those over 100 billion tweets (approximately the number of neurons in the human brain) mean. Twitter has only 10% of the expected 100 billion tweets now, but a good brain always plans ahead. Kevin gave a talk, Hadoop and Protocol Buffers at Twitter , at the Hadoop Meetup , explaining how Twitter plans to use all that data to an answer key business questions. What type of questions is Twitter interested in answering? Questions that help them better understand Twitter. Questions like: How many requests do we serve in a day? What is the average latency? How many searches happen in day? How many unique queries, how many unique users, what is their geographic dist
sentIndex sentText sentNum sentScore
1 If Twitter is the “nervous system of the web” as some people think, then what is the brain that makes sense of all those signals (tweets) from the nervous system? [sent-1, score-0.239]
2 That brain is the Twitter Analytics System and Kevin Weil, as Analytics Lead at Twitter, is the homunculus within in charge of figuring out what those over 100 billion tweets (approximately the number of neurons in the human brain) mean. [sent-2, score-0.52]
3 Twitter has only 10% of the expected 100 billion tweets now, but a good brain always plans ahead. [sent-3, score-0.52]
4 Kevin gave a talk, Hadoop and Protocol Buffers at Twitter , at the Hadoop Meetup , explaining how Twitter plans to use all that data to an answer key business questions. [sent-4, score-0.162]
5 What type of questions is Twitter interested in answering? [sent-5, score-0.101]
6 How many unique queries, how many unique users, what is their geographic distribution? [sent-10, score-0.126]
7 The questions help them understand Twitter, their analytics system helps them get the answers faster. [sent-20, score-0.176]
8 Your choice has a lot to do with performance, how much data can be stored, and how agile you can be in reacting to future changes. [sent-34, score-0.223]
9 Each tweet has 12 fields, 3 of which have sub structure, and the fields can and will change over time as new features are added. [sent-35, score-0.118]
10 Protocol Buffer is a way of encoding structured data in an efficient yet extensible format. [sent-38, score-0.247]
11 What is often considered a weakness, Protocol Buffer’s use of an IDL to describe data structures, is actually considered a big win by Twitter. [sent-46, score-0.271]
12 Having to define data structure IDL is often seen as a useless waste of time. [sent-47, score-0.166]
13 All the code that once was written by hand for each new data structure is now simply auto generated from the IDL. [sent-49, score-0.448]
14 This saves ton of effort and the code is much less buggy. [sent-50, score-0.154]
15 At one point model driven auto generation was a common tactic on many projects. [sent-52, score-0.145]
16 Once you hand generate everything you start really worrying about the verbosity of your language, which moved everyone to more dynamic languages, and ironically DSLs were still often listed as an advantage of languages like Ruby. [sent-55, score-0.304]
17 Another consequence of hand coding was the framework of the weekitis. [sent-56, score-0.127]
18 It’s good to see code generation coming into fashion again. [sent-58, score-0.157]
19 I like seeing the careful evaluation of different options based on knowing what you want and why. [sent-62, score-0.128]
20 - RECAP Twitter says "Today, we are seeing 50 million tweets per day—that's an average of 600 tweets per second. [sent-66, score-0.572]
wordName wordTfidf (topN-words)
[('protocol', 0.327), ('buffers', 0.308), ('idl', 0.307), ('tweets', 0.255), ('twitter', 0.242), ('pig', 0.202), ('hadoop', 0.143), ('brain', 0.136), ('retweets', 0.128), ('hand', 0.127), ('fields', 0.118), ('nervous', 0.103), ('questions', 0.101), ('data', 0.091), ('rpc', 0.091), ('considered', 0.09), ('encoding', 0.088), ('fashion', 0.084), ('kevin', 0.084), ('auto', 0.082), ('saves', 0.081), ('structure', 0.075), ('analytics', 0.075), ('buffer', 0.075), ('code', 0.073), ('agile', 0.072), ('json', 0.072), ('homunculus', 0.071), ('plans', 0.071), ('efficient', 0.068), ('blunt', 0.067), ('csv', 0.067), ('sacred', 0.067), ('weil', 0.067), ('evaluation', 0.066), ('dsls', 0.064), ('bson', 0.064), ('many', 0.063), ('seeing', 0.062), ('generate', 0.062), ('parses', 0.061), ('hooked', 0.061), ('feb', 0.061), ('structures', 0.061), ('reacting', 0.06), ('fruit', 0.06), ('refreshing', 0.058), ('billion', 0.058), ('moved', 0.058), ('languages', 0.057)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 780 high scalability-2010-02-19-Twitter’s Plan to Analyze 100 Billion Tweets
Introduction: If Twitter is the “nervous system of the web” as some people think, then what is the brain that makes sense of all those signals (tweets) from the nervous system? That brain is the Twitter Analytics System and Kevin Weil, as Analytics Lead at Twitter, is the homunculus within in charge of figuring out what those over 100 billion tweets (approximately the number of neurons in the human brain) mean. Twitter has only 10% of the expected 100 billion tweets now, but a good brain always plans ahead. Kevin gave a talk, Hadoop and Protocol Buffers at Twitter , at the Hadoop Meetup , explaining how Twitter plans to use all that data to an answer key business questions. What type of questions is Twitter interested in answering? Questions that help them better understand Twitter. Questions like: How many requests do we serve in a day? What is the average latency? How many searches happen in day? How many unique queries, how many unique users, what is their geographic dist
2 0.22159991 837 high scalability-2010-06-07-Six Ways Twitter May Reach its Big Hairy Audacious Goal of One Billion Users
Introduction: Twitter has a big hairy audacious goal of reaching one billion users by 2013. Three forces stand against Twitter. The world will end in 2012 . But let's be optimistic and assume we'll make it. Next is Facebook. Currently Facebook is the user leader with over 400 million users . Will Facebook stumble or will they rocket to one billion users before Twitter? And lastly, there's Twitter's "low" starting point and "slow" growth rate. Twitter currently has 106 million registered users and adds about 300,000 new users a day. That doesn't add up to a billion in three years. Twitter needs to triple the number of registered users they add per day. How will Twitter reach its goal of over one billion users served? From recent infrastructure announcements and information gleaned at Chirp ( videos ) and other talks, it has become a little clearer how they hope to reach their billion user goal: 1) Make a Big Hairy Audacious Goal 2) Hire Lots of Quality People 3) Hug Developers and Users 4) D
3 0.19520463 443 high scalability-2008-11-14-Paper: Pig Latin: A Not-So-Foreign Language for Data Processing
Introduction: Yahoo has developed a new language called Pig Latin that fit in a sweet spot between high-level declarative querying in the spirit of SQL, and low-level, procedural programming `a la map-reduce and combines best of both worlds. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. Pig has just graduated from the Apache Incubator and joined Hadoop as a subproject. The paper has a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. References: Apache Pig Wiki
Introduction: Toy solutions solving Twitter’s “problems” are a favorite scalability trope. Everybody has this idea that Twitter is easy. With a little architectural hand waving we have a scalable Twitter, just that simple. Well, it’s not that simple as Raffi Krikorian , VP of Engineering at Twitter, describes in his superb and very detailed presentation on Timelines at Scale . If you want to know how Twitter works - then start here. It happened gradually so you may have missed it, but Twitter has grown up. It started as a struggling three-tierish Ruby on Rails website to become a beautifully service driven core that we actually go to now to see if other services are down. Quite a change. Twitter now has 150M world wide active users, handles 300K QPS to generate timelines, and a firehose that churns out 22 MB/sec. 400 million tweets a day flow through the system and it can take up to 5 minutes for a tweet to flow from Lady Gaga’s fingers to her 31 million followers. A couple o
5 0.18331662 601 high scalability-2009-05-17-Product: Hadoop
Introduction: Update 5: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds and has its green cred questioned because it took 40 times the number of machines Greenplum used to do the same work. Update 4: Introduction to Pig . Pig allows you to skip programming Hadoop at the low map-reduce level. You don't have to know Java. Using the Pig Latin language, which is a scripting data flow language, you can think about your problem as a data flow program. 10 lines of Pig Latin = 200 lines of Java. Update 3 : Scaling Hadoop to 4000 nodes at Yahoo! . 30,000 cores with nearly 16PB of raw disk; sorted 6TB of data completed in 37 minutes; 14,000 map tasks writes (reads) 360 MB (about 3 blocks) of data into a single file with a total of 5.04 TB for the whole job. Update 2 : Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides . Topics include: Pig, JAQL, Hbase, Hive, Data-Intensive Scalable Computing, Clouds and ManyCore: The Revolution, Simplicity and Complexity
6 0.17204854 954 high scalability-2010-12-06-What the heck are you actually using NoSQL for?
7 0.17192641 1004 high scalability-2011-03-14-Twitter by the Numbers - 460,000 New Accounts and 140 Million Tweets Per Day
8 0.17069232 234 high scalability-2008-01-30-The AOL XMPP scalability challenge
9 0.16744594 1159 high scalability-2011-12-19-How Twitter Stores 250 Million Tweets a Day Using MySQL
10 0.16670495 639 high scalability-2009-06-27-Scaling Twitter: Making Twitter 10000 Percent Faster
11 0.15521285 1491 high scalability-2013-07-15-Ask HS: What's Wrong with Twitter, Why Isn't One Machine Enough?
12 0.15402615 1551 high scalability-2013-11-20-How Twitter Improved JVM Performance by Reducing GC and Faster Memory Allocation
13 0.14334399 1148 high scalability-2011-11-29-DataSift Architecture: Realtime Datamining at 120,000 Tweets Per Second
14 0.14104593 1251 high scalability-2012-05-24-Build your own twitter like real time analytics - a step by step guide
16 0.1314415 568 high scalability-2009-04-14-Designing a Scalable Twitter
17 0.1311318 855 high scalability-2010-07-11-So, Why is Twitter Really Not Using Cassandra to Store Tweets?
18 0.12834889 960 high scalability-2010-12-20-Netflix: Use Less Chatty Protocols in the Cloud - Plus 26 Fixes
19 0.12296713 651 high scalability-2009-07-02-Product: Project Voldemort - A Distributed Database
20 0.12231576 851 high scalability-2010-07-02-Hot Scalability Links for July 2, 2010
topicId topicWeight
[(0, 0.202), (1, 0.113), (2, -0.0), (3, 0.027), (4, 0.057), (5, 0.038), (6, -0.025), (7, 0.079), (8, 0.089), (9, 0.039), (10, 0.062), (11, 0.086), (12, 0.054), (13, -0.035), (14, 0.027), (15, -0.034), (16, 0.016), (17, -0.01), (18, -0.075), (19, 0.001), (20, -0.032), (21, -0.007), (22, 0.068), (23, 0.06), (24, -0.021), (25, -0.008), (26, 0.067), (27, -0.045), (28, -0.003), (29, 0.067), (30, 0.067), (31, -0.007), (32, -0.125), (33, 0.015), (34, -0.139), (35, 0.001), (36, -0.033), (37, 0.132), (38, -0.136), (39, -0.038), (40, 0.068), (41, 0.04), (42, -0.02), (43, -0.001), (44, 0.006), (45, -0.005), (46, -0.024), (47, 0.008), (48, -0.027), (49, 0.058)]
simIndex simValue blogId blogTitle
same-blog 1 0.94366115 780 high scalability-2010-02-19-Twitter’s Plan to Analyze 100 Billion Tweets
Introduction: If Twitter is the “nervous system of the web” as some people think, then what is the brain that makes sense of all those signals (tweets) from the nervous system? That brain is the Twitter Analytics System and Kevin Weil, as Analytics Lead at Twitter, is the homunculus within in charge of figuring out what those over 100 billion tweets (approximately the number of neurons in the human brain) mean. Twitter has only 10% of the expected 100 billion tweets now, but a good brain always plans ahead. Kevin gave a talk, Hadoop and Protocol Buffers at Twitter , at the Hadoop Meetup , explaining how Twitter plans to use all that data to an answer key business questions. What type of questions is Twitter interested in answering? Questions that help them better understand Twitter. Questions like: How many requests do we serve in a day? What is the average latency? How many searches happen in day? How many unique queries, how many unique users, what is their geographic dist
2 0.79514474 1251 high scalability-2012-05-24-Build your own twitter like real time analytics - a step by step guide
Introduction: Major social networking platforms like Facebook and Twitter have developed their own architectures for handling the need for real-time analytics on huge amounts of data. However, not every company has the need or resources to build their own Twitter-like solution. In this example we have taken the same Twitter/Facebook-like blueprint, and made it simple enough for developers to implement. We have taken the following approach in our implementation: Use In Memory Data Grid (XAP) for handling the real time stream data-processing. BigData data-base (Cassandra) for storing the historical data and manage the trend analytics Use Cloudify (cloudifysource.org) for managing and automating the deployment on private or public cloud The example demonstrate a simple case of word count analytics. It uses Spring Social to plug-in to real twitter feeds. The solution is designed to efficiently cope with getting and processing the large volume of tweets. First, we partition the twee
3 0.79342753 1491 high scalability-2013-07-15-Ask HS: What's Wrong with Twitter, Why Isn't One Machine Enough?
Introduction: Can anyone convincingly explain why properties sporting traffic statistics that may seem in-line with with the capabilities of a single big-iron machine need so many machines in their architecture? This is a common reaction to architecture profiles on High Scalability: I could do all that on a few machines so they must be doing something really stupid. Lo and behold this same reaction also occurred to the article The Architecture Twitter Uses to Deal with 150M Active Users . On Hacker News papsosouid voiced what a lot of people may have been thinking: I really question the current trend of creating big, complex, fragile architectures to "be able to scale". These numbers are a great example of why, the entire thing could run on a single server, in a very straight forward setup. When you are creating a cluster for scalability, and it has less CPU, RAM and IO than a single server, what are you gaining? They are only doing 6k writes a second for crying out loud. This is a s
4 0.7912817 1159 high scalability-2011-12-19-How Twitter Stores 250 Million Tweets a Day Using MySQL
Introduction: Jeremy Cole , a DBA Team Lead/Database Architect at Twitter, gave a really good talk at the O'Reilly MySQL conference: Big and Small Data at @Twitter , where the topic was thinking of Twitter from the data perspective. One of the interesting stories he told was of the transition from Twitter's old way of storing tweets using temporal sharding , to a more distributed approach using a new tweet store called T-bird, which is built on top of Gizzard , which is built using MySQL. Twitter's original tweet store: Temporally sharded tweets was a good-idea-at-the-time architecture. Temporal sharding simply means tweets from the same date range are stored together on the same shard. The problem is tweets filled up one machine, then a second, and then a third. You end up filling up one machine after another. This is a pretty common approach and one that has some real flaws: Load balancing . Most of the old machines didn't get any traffic because people
5 0.78715181 837 high scalability-2010-06-07-Six Ways Twitter May Reach its Big Hairy Audacious Goal of One Billion Users
Introduction: Twitter has a big hairy audacious goal of reaching one billion users by 2013. Three forces stand against Twitter. The world will end in 2012 . But let's be optimistic and assume we'll make it. Next is Facebook. Currently Facebook is the user leader with over 400 million users . Will Facebook stumble or will they rocket to one billion users before Twitter? And lastly, there's Twitter's "low" starting point and "slow" growth rate. Twitter currently has 106 million registered users and adds about 300,000 new users a day. That doesn't add up to a billion in three years. Twitter needs to triple the number of registered users they add per day. How will Twitter reach its goal of over one billion users served? From recent infrastructure announcements and information gleaned at Chirp ( videos ) and other talks, it has become a little clearer how they hope to reach their billion user goal: 1) Make a Big Hairy Audacious Goal 2) Hire Lots of Quality People 3) Hug Developers and Users 4) D
6 0.78021109 855 high scalability-2010-07-11-So, Why is Twitter Really Not Using Cassandra to Store Tweets?
8 0.76347029 1148 high scalability-2011-11-29-DataSift Architecture: Realtime Datamining at 120,000 Tweets Per Second
9 0.73307347 1265 high scalability-2012-06-15-Stuff The Internet Says On Scalability For June 15, 2012
10 0.71832138 556 high scalability-2009-04-05-At Some Point the Cost of Servers Outweighs the Cost of Programmers
11 0.70239383 639 high scalability-2009-06-27-Scaling Twitter: Making Twitter 10000 Percent Faster
12 0.68217194 544 high scalability-2009-03-18-QCon London 2009: Upgrading Twitter without service disruptions
13 0.675937 323 high scalability-2008-05-19-Twitter as a scalability case study
14 0.67176938 1375 high scalability-2012-12-21-Stuff The Internet Says On Scalability For December 21, 2012
15 0.66083169 1004 high scalability-2011-03-14-Twitter by the Numbers - 460,000 New Accounts and 140 Million Tweets Per Day
16 0.65050125 166 high scalability-2007-11-27-Solving the Client Side API Scalability Problem with a Little Game Theory
17 0.64624637 116 high scalability-2007-10-08-Lessons from Pownce - The Early Years
18 0.63986009 363 high scalability-2008-08-12-Strategy: Limit The New, Not The Old
20 0.63878387 568 high scalability-2009-04-14-Designing a Scalable Twitter
topicId topicWeight
[(1, 0.075), (2, 0.218), (10, 0.094), (27, 0.011), (30, 0.021), (41, 0.123), (61, 0.135), (79, 0.17), (85, 0.022), (94, 0.05)]
simIndex simValue blogId blogTitle
1 0.95778024 1388 high scalability-2013-01-16-What if Cars Were Rented Like We Hire Programmers?
Introduction: Imagine if you will that car rental agencies rented cars like programmers are hired at many software companies... Agency : So sorry you had to wait in the reception area for an hour. Nobody knew you were coming to today. I finally found 8 people to interview before we can rent you a car. If we like you you may have to come in for another round of interviews tomorrow because our manager isn't in today. I didn't have a chance to read your application, so I'll just start with a question. What car do you drive today? Applicant : I drive a 2008 Subaru. Agency : That's a shame. We don't have a Subaru to rent you. Applicant : That's OK. Any car will do. Agency : No, we can only take on clients who know how to drive the cars we stock. We find it's safer that way. There are so many little differences between cars, we just don't want to take a chance. Applicant : I have a drivers license. I know how to drive. I've been driving all kinds of cars for 15 years, I am sure I can adapt.
2 0.94493097 1007 high scalability-2011-03-18-Stuff The Internet Says On Scalability For March 18, 2011
Introduction: Submitted for your reading pleasure on this day of wind and rain... The cloud is falling. Or at least shared networked storage as Reddit has a couple of long periods of downtime. Good write up at Why reddit was down for 6 of the last 24 hours. Upshot: Reddit is moving off EBS to local disks. Quotable Quotes: @thomleggett : 32 or so joins - the sweet-spot of suck for MySQL - @emileifrem #nosql #neo4j #qconlondon @jkalucki : The [Twitter] Streaming API pushes 100MB in less than a second. @LusciousPear : Finding the true test of a DB is recovery when things go wrong, not "who's most web-scale" on paper #nosql @TimelessP : Functionality, scalability, security... pick two. @beaknit : #ccevent #nosql @adrianco: A year from oracle to simpledb. A week from simpledb to cassandra. Mental shift biggest hurdle. Heather Willems captured this very cool Ogilvy Note , which is a visual representation of a panel talk on Scalability: Covering Your Rear with a Good
3 0.94227803 293 high scalability-2008-03-31-Read HighScalability on Your Mobile Phone Using WidSets Widgets
Introduction: Jean-Paul de Vooght of our Switzerland contingent created a nifty little WidSets widget that lets you better read HighScalability from your mobile phone. I thought untethered readers might like to give it a try. Thanks to Jean-Paul for making it available! WidSets is: a simple service that brings you information normally accessed via the Internet by sending it directly to your mobile phone . Using mini-applications called widgets, it sends you the latest updates to your favorite websites. The system uses RSS feeds to push information from these websites directly to your mobile phone the minute they’re updated .
same-blog 4 0.93598771 780 high scalability-2010-02-19-Twitter’s Plan to Analyze 100 Billion Tweets
Introduction: If Twitter is the “nervous system of the web” as some people think, then what is the brain that makes sense of all those signals (tweets) from the nervous system? That brain is the Twitter Analytics System and Kevin Weil, as Analytics Lead at Twitter, is the homunculus within in charge of figuring out what those over 100 billion tweets (approximately the number of neurons in the human brain) mean. Twitter has only 10% of the expected 100 billion tweets now, but a good brain always plans ahead. Kevin gave a talk, Hadoop and Protocol Buffers at Twitter , at the Hadoop Meetup , explaining how Twitter plans to use all that data to an answer key business questions. What type of questions is Twitter interested in answering? Questions that help them better understand Twitter. Questions like: How many requests do we serve in a day? What is the average latency? How many searches happen in day? How many unique queries, how many unique users, what is their geographic dist
5 0.92514819 1454 high scalability-2013-05-08-Typesafe Interview: Scala + Akka is an IaaS for Your Process Architecture
Introduction: This is an email interview with Viktor Klang , Director of Engineering at Typesafe , on the Scala Futures model & Akka, both topics on which is he is immensely passionate and knowledgeable. How do you structure your application? That’s the question I explored in the article Beyond Threads And Callbacks . An option I did not talk about, mostly because of my own ignorance, is a powerful stack you may not be all that familiar with: Scala and Akka. To remedy my oversight is our acting tour guide, Typesafe’s Viktor Klang, long time Scala hacker and Java enterprise systems architect. Viktor was very patient in answering my questions and was enthusiastic about sharing his knowledge. He’s a guy who definitely knows what he is talking about. I’ve implemented several Actor systems along with the messaging infrastructure, threading, async IO, service orchestration, failover, etc, so I’m innately skeptical about frameworks that remove control from the programmer at
6 0.90721291 1112 high scalability-2011-09-07-What Google App Engine Price Changes Say About the Future of Web Architecture
9 0.90441954 526 high scalability-2009-03-05-Strategy: In Cloud Computing Systematically Drive Load to the CPU
10 0.90406042 289 high scalability-2008-03-27-Amazon Announces Static IP Addresses and Multiple Datacenter Operation
11 0.90286112 1183 high scalability-2012-01-30-37signals Still Happily Scaling on Moore RAM and SSDs
12 0.90151566 687 high scalability-2009-08-24-How Google Serves Data from Multiple Datacenters
14 0.90074193 1395 high scalability-2013-01-28-DuckDuckGo Architecture - 1 Million Deep Searches a Day and Growing
15 0.90011626 1382 high scalability-2013-01-07-Analyzing billions of credit card transactions and serving low-latency insights in the cloud
16 0.89995098 1242 high scalability-2012-05-09-Cell Architectures
17 0.8993631 1491 high scalability-2013-07-15-Ask HS: What's Wrong with Twitter, Why Isn't One Machine Enough?
18 0.89893621 1649 high scalability-2014-05-16-Stuff The Internet Says On Scalability For May 16th, 2014
19 0.89779109 1153 high scalability-2011-12-08-Update on Scalable Causal Consistency For Wide-Area Storage With COPS
20 0.89754099 306 high scalability-2008-04-21-The Search for the Source of Data - How SimpleDB Differs from a RDBMS