high_scalability high_scalability-2010 high_scalability-2010-815 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Imagine a single search request coursing through Google's massive infrastructure. A single request can run across thousands of machines and involve hundreds of different subsystems. And oh by the way, you are processing more requests per second than any other system in the world. How do you debug such a system? How do you figure out where the problems are? How do you determine if programmers are coding correctly? How do you keep sensitive data secret and safe? How do ensure products don't use more resources than they are assigned? How do you store all the data? How do you make use of it? That's where Dapper comes in. Dapper is Google's tracing system and it was originally created to understand the system behaviour from a search request. Now Google's production clusters generate more than 1 terabyte of sampled trace data per day . So how does Dapper do what Dapper does? Dapper is described in an very well written and intricately detailed paper: Dapper, a Large-Scale Distributed Sy
sentIndex sentText sentNum sentScore
1 Now Google's production clusters generate more than 1 terabyte of sampled trace data per day . [sent-13, score-0.753]
2 The full paper is worth a full read and a re-read, but we'll just cover some of the highlights: There are so many operations going on that Google can't trace every request all the time, so in order to reduce overhead they sample one out of thousands of requests. [sent-20, score-0.724]
3 Google found sampling provides sufficient information for many common uses of the tracing data . [sent-21, score-0.706]
4 Tracing is laregly transparent to applications because the trace code in common libraries (threading, control flow, RPC) is sufficient to debug most problems. [sent-23, score-0.858]
5 A trace id is allocated to bind all the spans to a particular trace session. [sent-34, score-1.405]
6 The trace id isn't a globally unique sequence number, it's a probabilistically unique 64-bit integer. [sent-35, score-0.844]
7 Each row is a single trace which each column mapped to a span. [sent-37, score-0.689]
8 The median latency for sending trace data from applications to the central repository is 15 seconds, but often it can take many hours. [sent-38, score-0.732]
9 There's also always the curious question of how do you trace the tracing system? [sent-39, score-0.903]
10 An out-of-bound trace mechanism is in place for that purpose. [sent-41, score-0.645]
11 For example, Google can look at the trace to pinpoint: which applications are not using proper levels of authentication and encryption; and find which applications accessing sensitive data are not logging at an appropriate level so they won't see data they shouldn't see. [sent-45, score-0.929]
12 Originally the sampling rate was uniform, they are now moving to an adaptive sampling rate that species a desired number of traces per unit of time. [sent-51, score-0.815]
13 If a trace is kept all spans for the trace are also kept. [sent-58, score-1.407]
14 DAPI is an API ontop of the trace data which makes it possible to write trace applications and analysis tools. [sent-62, score-1.377]
15 During development the trace information is used to characterize performance, determine correctness, understand how an application is working, and as a verification test to determine if an application is behaving as expected. [sent-66, score-0.776]
16 Developers have parallel debug logs that are outside of the trace system. [sent-67, score-0.746]
17 Using the trace data it's possible to generate charge back data based on actual usage. [sent-73, score-0.747]
18 Using the system view provided by the trace data they were able to identify unintended service interactions and fix them. [sent-76, score-0.802]
19 Dapper has a few downsides too: Work that is batched together for efficiency is not correctly mapped to the trace ids inside the batch. [sent-78, score-0.73]
20 A trace id might get blamed for work it is not doing. [sent-79, score-0.685]
wordName wordTfidf (topN-words)
[('trace', 0.645), ('dapper', 0.37), ('sampling', 0.284), ('tracing', 0.258), ('traces', 0.127), ('debug', 0.101), ('google', 0.098), ('tree', 0.087), ('sample', 0.079), ('spans', 0.075), ('logging', 0.07), ('originating', 0.067), ('payloads', 0.067), ('probabilistically', 0.067), ('system', 0.062), ('binaries', 0.062), ('depths', 0.062), ('rate', 0.06), ('barroso', 0.057), ('sampled', 0.057), ('andre', 0.056), ('workloads', 0.054), ('establish', 0.052), ('data', 0.051), ('indicates', 0.049), ('determine', 0.047), ('unique', 0.046), ('identify', 0.044), ('mapped', 0.044), ('detailed', 0.043), ('threading', 0.042), ('kept', 0.042), ('correctly', 0.041), ('originally', 0.04), ('level', 0.04), ('mapreduce', 0.04), ('id', 0.04), ('sufficient', 0.039), ('bigtable', 0.037), ('common', 0.037), ('information', 0.037), ('applications', 0.036), ('annotation', 0.036), ('patter', 0.036), ('undergo', 0.036), ('coursing', 0.036), ('drilling', 0.036), ('saul', 0.036), ('latencies', 0.035), ('developers', 0.035)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000001 815 high scalability-2010-04-27-Paper: Dapper, Google's Large-Scale Distributed Systems Tracing Infrastructure
Introduction: Imagine a single search request coursing through Google's massive infrastructure. A single request can run across thousands of machines and involve hundreds of different subsystems. And oh by the way, you are processing more requests per second than any other system in the world. How do you debug such a system? How do you figure out where the problems are? How do you determine if programmers are coding correctly? How do you keep sensitive data secret and safe? How do ensure products don't use more resources than they are assigned? How do you store all the data? How do you make use of it? That's where Dapper comes in. Dapper is Google's tracing system and it was originally created to understand the system behaviour from a search request. Now Google's production clusters generate more than 1 terabyte of sampled trace data per day . So how does Dapper do what Dapper does? Dapper is described in an very well written and intricately detailed paper: Dapper, a Large-Scale Distributed Sy
2 0.21996391 77 high scalability-2007-08-30-Log Everything All the Time
Introduction: This JoelOnSoftware thread asks the age old question of what and how to log. The usual trace/error/warning/info advice is totally useless in a large scale distributed system. Instead, you need to log everything all the time so you can solve problems that have already happened across a potentially huge range of servers. Yes, it can be done. To see why the typical logging approach is broken, imagine this scenario: Your site has been up and running great for weeks. No problems. A foreshadowing beeper goes off at 2AM. It seems some users can no longer add comments to threads. Then you hear the debugging deathknell: it's an intermittent problem and customers are pissed. Fix it. Now. So how are you going to debug this? The monitoring system doesn't show any obvious problems or errors. You quickly post a comment and it works fine. This won't be easy. So you think. Commenting involves a bunch of servers and networks. There's the load balancer, spam filter, web server, database server,
3 0.16531959 484 high scalability-2009-01-05-Lessons Learned at 208K: Towards Debugging Millions of Cores
Introduction: How do we debug and profile a cloud full of processors and threads? It's a problem more will be seeing as we code big scary programs that run on even bigger scarier clouds. Logging gets you far, but sometimes finding the root cause of problem requires delving deep into a program's execution. I don't know about you, but setting up 200,000+ gdb instances doesn't sound all that appealing. Tools like STAT (Stack Trace Analysis Tool) are being developed to help with this huge task. STAT "gathers and merges stack traces from a parallel application’s processes." So STAT isn't a low level debugger, but it will help you find the needle in a million haystacks. Abstract: Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures and analysis algorithms to collect and to process application data. In addition, at such scales, each tool itself will become a large paralle
4 0.16345103 311 high scalability-2008-04-29-Strategy: Sample to Reduce Data Set
Introduction: Update: Arjen links to video Supporting Scalable Online Statistical Processing which shows "rather than doing complete aggregates, use statistical sampling to provide a reasonable estimate (unbiased guess) of the result." When you have a lot of data, sampling allows you to draw conclusions from a much smaller amount of data. That's why sampling is a scalability solution. If you don't have to process all your data to get the information you need then you've made the problem smaller and you'll need fewer resources and you'll get more timely results. Sampling is not useful when you need a complete list that matches a specific criteria. If you need to know the exact set of people who bought a car in the last week then sampling won't help. But, if you want to know many people bought a car then you could take a sample and then create estimate of the full data-set. The difference is you won't really know the exact car count. You'll have a confidence interval saying how confident
5 0.12232403 237 high scalability-2008-02-03-Product: Collectl - Performance Data Collector
Introduction: From their website : There are a number of times in which you find yourself needing performance data. These can include benchmarking, monitoring a system's general heath or trying to determine what your system was doing at some time in the past. Sometimes you just want to know what the system is doing right now. Depending on what you're doing, you often end up using different tools, each designed to for that specific situation. Features include: You are be able to run with non-integral sampling intervals. Collectl uses very little CPU. In fact it has been measured to use <0.1% when run as a daemon using the default sampling interval of 60 seconds for process and slab data and 10 seconds for everything else. Brief, verbose, and plot formats are supported. You can report aggregated performance numbers on many devices such as CPUs, Disks, interconnects such as Infiniband or Quadrics, Networks or even Lustre file systems. Collectl will align its sampling on integral sec
6 0.12042525 934 high scalability-2010-11-04-Facebook at 13 Million Queries Per Second Recommends: Minimize Request Variance
7 0.11763748 1207 high scalability-2012-03-12-Google: Taming the Long Latency Tail - When More Machines Equals Worse Results
8 0.098600782 1107 high scalability-2011-08-29-The Three Ages of Google - Batch, Warehouse, Instant
9 0.088412441 448 high scalability-2008-11-22-Google Architecture
10 0.08805187 750 high scalability-2009-12-16-Building Super Scalable Systems: Blade Runner Meets Autonomic Computing in the Ambient Cloud
12 0.086625718 1032 high scalability-2011-05-02-Stack Overflow Makes Slow Pages 100x Faster by Simple SQL Tuning
13 0.08617413 920 high scalability-2010-10-15-Troubles with Sharding - What can we learn from the Foursquare Incident?
14 0.084804066 517 high scalability-2009-02-21-Google AppEngine - A Second Look
15 0.081978351 1468 high scalability-2013-05-31-Stuff The Internet Says On Scalability For May 31, 2013
16 0.08088956 589 high scalability-2009-05-05-Drop ACID and Think About Data
17 0.079446018 233 high scalability-2008-01-30-How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data
18 0.07878238 881 high scalability-2010-08-16-Scaling an AWS infrastructure - Tools and Patterns
19 0.078723654 538 high scalability-2009-03-16-Are Cloud Based Memory Architectures the Next Big Thing?
20 0.078098767 954 high scalability-2010-12-06-What the heck are you actually using NoSQL for?
topicId topicWeight
[(0, 0.154), (1, 0.073), (2, -0.003), (3, 0.019), (4, 0.005), (5, 0.03), (6, 0.058), (7, 0.065), (8, -0.02), (9, 0.005), (10, 0.015), (11, 0.01), (12, 0.007), (13, -0.031), (14, 0.039), (15, -0.011), (16, -0.036), (17, -0.037), (18, 0.042), (19, -0.007), (20, 0.067), (21, -0.016), (22, -0.001), (23, 0.006), (24, 0.03), (25, 0.015), (26, -0.04), (27, 0.033), (28, -0.02), (29, -0.004), (30, -0.002), (31, -0.044), (32, 0.03), (33, 0.019), (34, -0.013), (35, 0.027), (36, 0.015), (37, -0.013), (38, 0.001), (39, 0.006), (40, -0.007), (41, 0.02), (42, 0.028), (43, -0.021), (44, -0.013), (45, -0.032), (46, -0.018), (47, -0.04), (48, 0.022), (49, 0.017)]
simIndex simValue blogId blogTitle
same-blog 1 0.94815183 815 high scalability-2010-04-27-Paper: Dapper, Google's Large-Scale Distributed Systems Tracing Infrastructure
Introduction: Imagine a single search request coursing through Google's massive infrastructure. A single request can run across thousands of machines and involve hundreds of different subsystems. And oh by the way, you are processing more requests per second than any other system in the world. How do you debug such a system? How do you figure out where the problems are? How do you determine if programmers are coding correctly? How do you keep sensitive data secret and safe? How do ensure products don't use more resources than they are assigned? How do you store all the data? How do you make use of it? That's where Dapper comes in. Dapper is Google's tracing system and it was originally created to understand the system behaviour from a search request. Now Google's production clusters generate more than 1 terabyte of sampled trace data per day . So how does Dapper do what Dapper does? Dapper is described in an very well written and intricately detailed paper: Dapper, a Large-Scale Distributed Sy
2 0.78415877 946 high scalability-2010-11-22-Strategy: Google Sends Canary Requests into the Data Mine
Introduction: Google runs queries against thousands of in-memory index nodes in parallel and then merges the results. One of the interesting problems with this approach, explains Google's Jeff Dean in this lecture at Stanford , is the Query of Death . A query can cause a program to fail because of bugs or various other issues. This means that a single query can take down an entire cluster of machines, which is not good for availability and response times, as it takes quite a while for thousands of machines to recover. Thus the Query of Death. New queries are always coming into the system and when you are always rolling out new software, it's impossible to completely get rid of the problem. Two solutions: Test against logs . Google replays a month's worth of logs to see if any of those queries kill anything. That helps, but Queries of Death may still happen. Send a canary request . A request is sent to one machine. If the request succeeds then it will probably succeed on all machines, s
Introduction: In Taming The Long Latency Tail we covered Luiz Barroso ’s exploration of the long tail latency (some operations are really slow) problems generated by large fanout architectures (a request is composed of potentially thousands of other requests). You may have noticed there weren’t a lot of solutions. That’s where a talk I attended, Achieving Rapid Response Times in Large Online Services ( slide deck ), by Jeff Dean , also of Google, comes in: In this talk, I’ll describe a collection of techniques and practices lowering response times in large distributed systems whose components run on shared clusters of machines, where pieces of these systems are subject to interference by other tasks, and where unpredictable latency hiccups are the norm, not the exception. The goal is to use software techniques to reduce variability given the increasing variability in underlying hardware, the need to handle dynamic workloads on a shared infrastructure, and the need to use lar
4 0.75508738 871 high scalability-2010-08-04-Dremel: Interactive Analysis of Web-Scale Datasets - Data as a Programming Paradigm
Introduction: If Google was a boxer then MapReduce would be a probing right hand that sets up the massive left hook that is Dremel , Google's—scalable (thousands of CPUs, petabytes of data, trillions of rows), SQL based, columnar, interactive (results returned in seconds), ad-hoc—analytics system. If Google was a magician then MapReduce would be the shiny thing that distracts the mind while the trick goes unnoticed. I say that because even though Dremel has been around internally at Google since 2006, we have not heard a whisper about it. All we've heard about is MapReduce, clones of which have inspired entire new industries. Tricky . Dremel, according to Brian Bershad, Director of Engineering at Google, is targeted at solving BigData class problems : While we all know that systems are huge and will get even huger, the implications of this size on programmability, manageability, power, etc. is hard to comprehend. Alfred noted that the Internet is predicted to be carrying a zetta-byte (10 21
5 0.75415343 1207 high scalability-2012-03-12-Google: Taming the Long Latency Tail - When More Machines Equals Worse Results
Introduction: Likewise the current belief that, in the case of artificial machines the very large and the very small are equally feasible and lasting is a manifest error. Thus, for example, a small obelisk or column or other solid figure can certainly be laid down or set up without danger of breaking, while the large ones will go to pieces under the slightest provocation, and that purely on account of their own weight. -- Galileo Galileo observed how things broke if they were naively scaled up. Interestingly, Google noticed a similar pattern when building larger software systems using the same techniques used to build smaller systems. Luiz André Barroso , Distinguished Engineer at Google, talks about this fundamental property of scaling systems in his fascinating talk, Warehouse-Scale Computing: Entering the Teenage Decade . Google found the larger the scale the greater the impact of latency variability. When a request is implemented by work done in parallel, as is common with today's service
7 0.73149514 1535 high scalability-2013-10-21-Google's Sanjay Ghemawat on What Made Google Google and Great Big Data Career Advice
8 0.73061621 1010 high scalability-2011-03-24-Strategy: Disk Backup for Speed, Tape Backup to Save Your Bacon, Just Ask Google
9 0.7224074 1404 high scalability-2013-02-11-At Scale Even Little Wins Pay Off Big - Google and Facebook Examples
11 0.71734977 661 high scalability-2009-07-25-Latency is Everywhere and it Costs You Sales - How to Crush it
12 0.71118206 1107 high scalability-2011-08-29-The Three Ages of Google - Batch, Warehouse, Instant
13 0.70638442 1328 high scalability-2012-09-24-Google Spanner's Most Surprising Revelation: NoSQL is Out and NewSQL is In
14 0.70501131 1345 high scalability-2012-10-22-Spanner - It's About Programmers Building Apps Using SQL Semantics at NoSQL Scale
15 0.70231301 1559 high scalability-2013-12-06-Stuff The Internet Says On Scalability For December 6th, 2013
16 0.70099431 77 high scalability-2007-08-30-Log Everything All the Time
17 0.70066094 1222 high scalability-2012-04-05-Big Data Counting: How to count a billion distinct objects using only 1.5KB of Memory
18 0.698753 981 high scalability-2011-02-01-Google Strategy: Tree Distribution of Requests and Responses
19 0.6975596 1464 high scalability-2013-05-24-Stuff The Internet Says On Scalability For May 24, 2013
20 0.69734663 1275 high scalability-2012-07-02-C is for Compute - Google Compute Engine (GCE)
topicId topicWeight
[(1, 0.124), (2, 0.216), (10, 0.039), (27, 0.013), (30, 0.021), (37, 0.034), (40, 0.026), (47, 0.011), (51, 0.012), (56, 0.183), (61, 0.063), (77, 0.027), (79, 0.076), (85, 0.016), (94, 0.037)]
simIndex simValue blogId blogTitle
1 0.96909434 779 high scalability-2010-02-16-Seven Signs You May Need a NoSQL Database
Introduction: While exploring deep into some dusty old library stacks, I dug up Nostradamus' long lost NoSQL codex. What are the chances? Strangely, it also gave the plot to the next Dan Brown novel, but I left that out for reasons of sanity. About NoSQL, here is what Nosty (his friends call him Nosty) predicted are the signs you may need a NoSQL database... You noticed a lot of your database fields are really serialized complex objects in disguise . Why bother with a RDBMS at all then? Storing serialized objects in a relational database is like being on the pill while trying to get pregnant, a bit counter productive. Just use a schemaless database from the start. Using a standard query language has become too confining . You just want to be free. SQL is so easy, so convenient, and so standard, it's really not a challenge anymore. You need to be different. Then NoSQL is for you. Each has their own completely different query mechanism . Your toolbox only contains a hammer . Hammers wh
2 0.95818001 941 high scalability-2010-11-15-How Google's Instant Previews Reduces HTTP Requests
Introduction: In a strange case of synchronicity, Google just published Instant Previews: Under the hood , a very well written blog post by Matías Pelenur of the Instant Previews team, giving some fascinating inside details on how Google implemented Instant Previews . It's syncronicty because I had just posted Strategy: Biggest Performance Impact Is To Reduce The Number Of HTTP Requests and one of the major ideas behind the design Instant Previews is to reduce the number of HTTP requests through a few well chosen tricks. Cosmic! Some of what Google does to reduce HTTP requests: Data URIs , which are are base64 encodings of image data, are used instead of static images that are served from the server. This means the whole preview can be pieced together from image slices in one request as both the data and the image are returned in the same request. Google found that even though base64 encoding adds about 33% to the size of the image, tests showed that gzip-compressed data URIs are compara
3 0.94402051 732 high scalability-2009-10-29-Digg - Looking to the Future with Cassandra
Introduction: Digg has been researching ways to scale our database infrastructure for some time now. We’ve adopted a traditional vertically partitioned master-slave configuration with MySQL, and also investigated sharding MySQL with IDDB . Ultimately, these solutions left us wanting. In the case of the traditional architecture, the lack of redundancy on the write masters is painful, and both approaches have significant management overhead to keep running. Since it was already necessary to abandon data normalization and consistency to make these approaches work, we felt comfortable looking at more exotic, non-relational data stores. After considering HBase, Hypertable, Cassandra, Tokyo Cabinet/Tyrant, Voldemort, and Dynomite, we settled on Cassandra . Each system has its own strengths and weaknesses, but Cassandra has a good blend of everything. It offers column-oriented data storage, so you have a bit more structure than plain key/value stores. It operates in a distributed, highly available,
4 0.93423343 446 high scalability-2008-11-18-Scalability Perspectives #2: Van Jacobson – Content-Centric Networking
Introduction: Scalability Perspectives is a series of posts that highlights the ideas that will shape the next decade of IT architecture. Each post is dedicated to a thought leader of the information age and his vision of the future. Be warned though – the journey into the minds and perspectives of these people requires an open mind. Van Jacobson Van Jacobson is a Research Fellow at PARC . Prior to that he was Chief Scientist and co-founder of Packet Design. Prior to that he was Chief Scientist at Cisco. Prior to that he was head of the Network Research group at Lawrence Berkeley National Laboratory. He's been studying networking since 1969. He still hopes that someday something will start to make sense. Scaling the Internet – Does the Net needs an upgrade? As the Internet is being overrun with video traffic, many wonder if it can survive. With challenges being thrown down over the imbalances that have been created and their impact on the viability of monopolistic business models, t
same-blog 5 0.92680651 815 high scalability-2010-04-27-Paper: Dapper, Google's Large-Scale Distributed Systems Tracing Infrastructure
Introduction: Imagine a single search request coursing through Google's massive infrastructure. A single request can run across thousands of machines and involve hundreds of different subsystems. And oh by the way, you are processing more requests per second than any other system in the world. How do you debug such a system? How do you figure out where the problems are? How do you determine if programmers are coding correctly? How do you keep sensitive data secret and safe? How do ensure products don't use more resources than they are assigned? How do you store all the data? How do you make use of it? That's where Dapper comes in. Dapper is Google's tracing system and it was originally created to understand the system behaviour from a search request. Now Google's production clusters generate more than 1 terabyte of sampled trace data per day . So how does Dapper do what Dapper does? Dapper is described in an very well written and intricately detailed paper: Dapper, a Large-Scale Distributed Sy
6 0.92547697 1322 high scalability-2012-09-14-Stuff The Internet Says On Scalability For September 14, 2012
7 0.92361295 854 high scalability-2010-07-09-Hot Scalability Links for July 9, 2010
8 0.92311341 659 high scalability-2009-07-20-A Scalability Lament
9 0.91918021 759 high scalability-2010-01-11-Strategy: Don't Use Polling for Real-time Feeds
10 0.91843253 67 high scalability-2007-08-17-What is the best hosting option?
11 0.88351071 1565 high scalability-2013-12-16-22 Recommendations for Building Effective High Traffic Web Software
12 0.88275623 1022 high scalability-2011-04-13-Paper: NoSQL Databases - NoSQL Introduction and Overview
13 0.85559285 45 high scalability-2007-07-30-Product: SmarterStats
14 0.84557807 1334 high scalability-2012-10-04-Stuff The Internet Says On Scalability For October 5, 2012
15 0.84520817 1394 high scalability-2013-01-25-Stuff The Internet Says On Scalability For January 25, 2013
16 0.8423897 1345 high scalability-2012-10-22-Spanner - It's About Programmers Building Apps Using SQL Semantics at NoSQL Scale
17 0.84105539 1339 high scalability-2012-10-12-Stuff The Internet Says On Scalability For October 12, 2012
18 0.83907449 1408 high scalability-2013-02-19-Puppet monitoring: how to monitor the success or failure of Puppet runs
19 0.83855563 1237 high scalability-2012-05-02-12 Ways to Increase Throughput by 32X and Reduce Latency by 20X
20 0.83748615 1455 high scalability-2013-05-10-Stuff The Internet Says On Scalability For May 10, 2013