high_scalability high_scalability-2008 high_scalability-2008-249 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Being an authentic human being is difficult and apparently authenticating all those S3 requests can be a bit overwhelming as well. Amazon fingered a lot of processor heavy authentication requests as the reason for their downtime: Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations. While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. Importantly, these cryptographic requests consume more resources per call than other request types. Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls. The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. In addition to processing authenticated requests, the authentication service also performs accou
sentIndex sentText sentNum sentScore
1 Being an authentic human being is difficult and apparently authenticating all those S3 requests can be a bit overwhelming as well. [sent-1, score-0.461]
2 Amazon fingered a lot of processor heavy authentication requests as the reason for their downtime: Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations. [sent-2, score-1.177]
3 While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. [sent-3, score-0.668]
4 Importantly, these cryptographic requests consume more resources per call than other request types. [sent-4, score-0.437]
5 Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls. [sent-5, score-0.43]
6 The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. [sent-6, score-0.357]
7 In addition to processing authenticated requests, the authentication service also performs account validation on every request Amazon S3 handles. [sent-7, score-0.744]
8 This caused Amazon S3 to be unable to process any requests in that location, beginning at 4:31am PST. [sent-8, score-0.439]
9 They need to purchase specialized SSL concentrators to handle the load which makes capacity planning a lot trickier and more expensive. [sent-12, score-0.192]
10 In the comments Allen conjectured What caused the problem however was a sudden unexpected surge in a particular type of usage (PUT's and GET's of private files which require cryptographic credentials, rather than GET's of public files that require no credentials). [sent-13, score-0.512]
11 As I understand what Kathrin said, the surge was caused by several large customers suddenly and unexpectedly increasing their usage. [sent-14, score-0.372]
12 The Skype failure was blamed on software updates which caused all nodes to relogin at the same time. [sent-17, score-0.268]
13 Bring up a new disk storage filer and if you aren't load balancing requests all new storage requests will go to that new filer and you'll be down lickity split. [sent-18, score-0.746]
14 Bandwidth and CPU all become restricted which causes a cascade of failures. [sent-20, score-0.326]
15 Packets drop which causes retransmissions which chews up bandwidth which uses CPU and causes more drops. [sent-22, score-0.639]
16 CPUs spike which causes timeouts and reconnects which again spirals everything out of control. [sent-23, score-0.369]
17 When I worked at a settop company we had the scenario of a neighborhood rebooting after a power outage. [sent-24, score-0.237]
18 Lots of houses needing to boot large boot images over asymmetric low bandwidth cable connections. [sent-25, score-0.729]
19 As a fix we broadcasted boot image blocks to all settops. [sent-26, score-0.423]
20 Amazon's problem was a subtle one in a very obscure corner of their system. [sent-29, score-0.152]
wordName wordTfidf (topN-words)
[('authenticated', 0.43), ('boot', 0.248), ('pst', 0.238), ('requests', 0.198), ('causes', 0.185), ('caused', 0.176), ('filer', 0.175), ('credentials', 0.158), ('authentication', 0.157), ('cryptographic', 0.151), ('surge', 0.119), ('packets', 0.1), ('capacity', 0.1), ('authenticating', 0.097), ('authentic', 0.097), ('broadcasted', 0.097), ('chews', 0.097), ('elevated', 0.097), ('fingered', 0.097), ('spirals', 0.097), ('blamed', 0.092), ('retransmissions', 0.092), ('trickier', 0.092), ('request', 0.088), ('stressful', 0.087), ('arp', 0.087), ('rebooting', 0.087), ('reconnects', 0.087), ('amazon', 0.084), ('bandwidth', 0.08), ('asymmetric', 0.079), ('neighborhood', 0.079), ('obscure', 0.079), ('image', 0.078), ('unexpectedly', 0.077), ('allen', 0.077), ('remained', 0.077), ('cascade', 0.076), ('houses', 0.074), ('corner', 0.073), ('proportion', 0.073), ('worked', 0.071), ('bring', 0.069), ('overwhelming', 0.069), ('validation', 0.069), ('morning', 0.068), ('ranges', 0.067), ('sudden', 0.066), ('unable', 0.065), ('restricted', 0.065)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999976 249 high scalability-2008-02-16-S3 Failed Because of Authentication Overload
Introduction: Being an authentic human being is difficult and apparently authenticating all those S3 requests can be a bit overwhelming as well. Amazon fingered a lot of processor heavy authentication requests as the reason for their downtime: Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations. While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. Importantly, these cryptographic requests consume more resources per call than other request types. Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls. The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. In addition to processing authenticated requests, the authentication service also performs accou
2 0.18897197 76 high scalability-2007-08-29-Skype Failed the Boot Scalability Test: Is P2P fundamentally flawed?
Introduction: Skype's 220 millions users lost service for a stunning two days. The primary cause for Skype's nightmare (can you imagine the beeper storm that went off?) was a massive global roll-out of a Window's patch triggering the simultaneous reboot of millions of machines across the globe. The secondary cause was a bug in Skype's software that prevented "self-healing" in the face of such attacks. The flood of log-in requests and a lack of "peer-to-peer resources" melted their system. Who's fault is it? Is Skype to blame? Is Microsoft to blame? Or is the peer-to-peer model itself fundamentally flawed in some way? Let's be real, how could Skype possibly test booting 220 million servers over a random configuration of resources? Answer: they can't. Yes, it's Skype's responsibility, but they are in a bit of a pickle on this one. The boot scenario is one of the most basic and one of the most difficult scalability scenarios to plan for and test. You can't simulate the viciousness of real-life
Introduction: In Taming The Long Latency Tail we covered Luiz Barroso ’s exploration of the long tail latency (some operations are really slow) problems generated by large fanout architectures (a request is composed of potentially thousands of other requests). You may have noticed there weren’t a lot of solutions. That’s where a talk I attended, Achieving Rapid Response Times in Large Online Services ( slide deck ), by Jeff Dean , also of Google, comes in: In this talk, I’ll describe a collection of techniques and practices lowering response times in large distributed systems whose components run on shared clusters of machines, where pieces of these systems are subject to interference by other tasks, and where unpredictable latency hiccups are the norm, not the exception. The goal is to use software techniques to reduce variability given the increasing variability in underlying hardware, the need to handle dynamic workloads on a shared infrastructure, and the need to use lar
4 0.11189649 1413 high scalability-2013-02-27-42 Monster Problems that Attack as Loads Increase
Introduction: For solutions take a look at: 7 Life Saving Scalability Defenses Against Load Monster Attacks . This is a look at all the bad things that can happen to your carefully crafted program as loads increase: all hell breaks lose. Sure, you can scale out or scale up, but you can also choose to program better. Make your system handle larger loads. This saves money because fewer boxes are needed and it will make the entire application more reliable and have better response times. And it can be quite satisfying as a programmer. Large Number Of Objects We usually get into scaling problems when the number of objects gets larger. Clearly resource usage of all types is stressed as the number of objects grow. Continuous Failures Makes An Infinite Event Stream During large network failure scenarios there is never time for the system recover. We are in a continual state of stress. Lots of High Priority Work For example, rerouting is a high priority activity. If there is a large amount
Introduction: This is a guestrepostby Ron Pressler, the founder and CEO ofParallel Universe, a Y Combinator company building advanced middleware for real-time applications. Little's Law helps us determine the maximum request rate a server can handle. When we apply it, we find that the dominating factor limiting a server's capacity is not the hardware but theOS.Should we buy more hardware if software is the problem? If not, how can we remove that software limitation in a way that does not make the code much harder to write and understand?Many modern web applications are composed of multiple (often many)HTTPservices (this is often called a micro-service architecture). This architecture has many advantages in terms of code reuse and maintainability, scalability and fault tolerance. In this post I'd like to examine one particular bottleneck in the approach, which hinders scalability as well as fault tolerance, and various ways to deal with it (I am using the term "scalability" very loosely in this post
6 0.09957476 728 high scalability-2009-10-26-Facebook's Memcached Multiget Hole: More machines != More Capacity
7 0.097913668 1331 high scalability-2012-10-02-An Epic TripAdvisor Update: Why Not Run on the Cloud? The Grand Experiment.
8 0.09607967 38 high scalability-2007-07-30-Build an Infinitely Scalable Infrastructure for $100 Using Amazon Services
9 0.094337851 1421 high scalability-2013-03-11-Low Level Scalability Solutions - The Conditioning Collection
10 0.094012231 981 high scalability-2011-02-01-Google Strategy: Tree Distribution of Requests and Responses
12 0.081831291 1314 high scalability-2012-08-30-Dramatically Improving Performance by Debugging Brutally Complex Prolems
13 0.0806555 920 high scalability-2010-10-15-Troubles with Sharding - What can we learn from the Foursquare Incident?
14 0.079877324 761 high scalability-2010-01-17-Applications Become Black Boxes Using Markets to Scale and Control Costs
15 0.079664297 1275 high scalability-2012-07-02-C is for Compute - Google Compute Engine (GCE)
16 0.078866437 1051 high scalability-2011-06-01-Why is your network so slow? Your switch should tell you.
17 0.078181498 750 high scalability-2009-12-16-Building Super Scalable Systems: Blade Runner Meets Autonomic Computing in the Ambient Cloud
19 0.077754073 1131 high scalability-2011-10-24-StackExchange Architecture Updates - Running Smoothly, Amazon 4x More Expensive
20 0.077722706 910 high scalability-2010-09-30-Facebook and Site Failures Caused by Complex, Weakly Interacting, Layered Systems
topicId topicWeight
[(0, 0.138), (1, 0.076), (2, -0.011), (3, -0.04), (4, -0.044), (5, -0.07), (6, 0.073), (7, 0.02), (8, -0.018), (9, -0.045), (10, -0.006), (11, -0.011), (12, 0.024), (13, -0.008), (14, 0.017), (15, 0.016), (16, 0.019), (17, 0.001), (18, -0.017), (19, 0.009), (20, 0.003), (21, 0.002), (22, 0.024), (23, -0.017), (24, 0.013), (25, 0.033), (26, -0.011), (27, 0.027), (28, 0.01), (29, -0.004), (30, 0.033), (31, -0.063), (32, 0.031), (33, -0.0), (34, 0.014), (35, 0.022), (36, 0.018), (37, 0.009), (38, -0.048), (39, -0.039), (40, -0.037), (41, -0.006), (42, 0.026), (43, -0.014), (44, 0.002), (45, 0.056), (46, 0.019), (47, -0.015), (48, -0.002), (49, -0.018)]
simIndex simValue blogId blogTitle
same-blog 1 0.97161758 249 high scalability-2008-02-16-S3 Failed Because of Authentication Overload
Introduction: Being an authentic human being is difficult and apparently authenticating all those S3 requests can be a bit overwhelming as well. Amazon fingered a lot of processor heavy authentication requests as the reason for their downtime: Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations. While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. Importantly, these cryptographic requests consume more resources per call than other request types. Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls. The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. In addition to processing authenticated requests, the authentication service also performs accou
2 0.73318213 76 high scalability-2007-08-29-Skype Failed the Boot Scalability Test: Is P2P fundamentally flawed?
Introduction: Skype's 220 millions users lost service for a stunning two days. The primary cause for Skype's nightmare (can you imagine the beeper storm that went off?) was a massive global roll-out of a Window's patch triggering the simultaneous reboot of millions of machines across the globe. The secondary cause was a bug in Skype's software that prevented "self-healing" in the face of such attacks. The flood of log-in requests and a lack of "peer-to-peer resources" melted their system. Who's fault is it? Is Skype to blame? Is Microsoft to blame? Or is the peer-to-peer model itself fundamentally flawed in some way? Let's be real, how could Skype possibly test booting 220 million servers over a random configuration of resources? Answer: they can't. Yes, it's Skype's responsibility, but they are in a bit of a pickle on this one. The boot scenario is one of the most basic and one of the most difficult scalability scenarios to plan for and test. You can't simulate the viciousness of real-life
Introduction: In Taming The Long Latency Tail we covered Luiz Barroso ’s exploration of the long tail latency (some operations are really slow) problems generated by large fanout architectures (a request is composed of potentially thousands of other requests). You may have noticed there weren’t a lot of solutions. That’s where a talk I attended, Achieving Rapid Response Times in Large Online Services ( slide deck ), by Jeff Dean , also of Google, comes in: In this talk, I’ll describe a collection of techniques and practices lowering response times in large distributed systems whose components run on shared clusters of machines, where pieces of these systems are subject to interference by other tasks, and where unpredictable latency hiccups are the norm, not the exception. The goal is to use software techniques to reduce variability given the increasing variability in underlying hardware, the need to handle dynamic workloads on a shared infrastructure, and the need to use lar
4 0.71936065 645 high scalability-2009-06-30-Hot New Trend: Linking Clouds Through Cheap IP VPNs Instead of Private Lines
Introduction: You might think major Internet companies have a latency, availability, and bandwidth advantage because they can afford expensive dedicated point-to-point private line networks between their data centers. And you would be right. It's a great advantage. Or it at least it was a great advantage. Cost is the great equalizer and companies are now scrambling for ways to cut costs. Many of the most recognizable Internet companies are moving to IP VPNs (Virtual Private Networks) as a much cheaper alternative to private lines. This is a strategy you can effectively use too. This trend has historical precedent in the data center. In the same way leading edge companies moved early to virtualize their data centers, leading edge companies are now virtualizing their networks using IP VPNs to build inexpensive private networks over a shared public network. In kindergarten we learned sharing was polite, it turns out sharing can also save a lot of money in both the data center and on the network. The
5 0.71632743 1587 high scalability-2014-01-29-10 Things Bitly Should Have Monitored
Introduction: Monitor, monitor, monitor. That's the advice every startup gives once they reach a certain size. But can you ever monitor enough? If you are Bitly and everyone will complain when you are down, probably not. Here are 10 Things We Forgot to Monitor from Bitly, along with good stories and copious amounts of code snippets. Well worth reading, especially after you've already started monitoring the lower hanging fruit. An interesting revelation from the article is that: We run bitly split across two data centers, one is a managed environment with DELL hardware, and the second is Amazon EC2. Fork Rate . A strange configuration issue caused processes to be created at a rate of several hundred a second rather than the expected 1-10/second. Flow control packets . A network configuration that honors flow control packets and isn’t configured to disable them, can temporarily cause dropped traffic. Swap In/Out Rate . Measure the right thing. It's the rate memory is swapped
6 0.69753861 1413 high scalability-2013-02-27-42 Monster Problems that Attack as Loads Increase
7 0.69654548 960 high scalability-2010-12-20-Netflix: Use Less Chatty Protocols in the Cloud - Plus 26 Fixes
8 0.69244361 533 high scalability-2009-03-11-The Implications of Punctuated Scalabilium for Website Architecture
9 0.68491817 1421 high scalability-2013-03-11-Low Level Scalability Solutions - The Conditioning Collection
10 0.68105513 1622 high scalability-2014-03-31-How WhatsApp Grew to Nearly 500 Million Users, 11,000 cores, and 70 Million Messages a Second
11 0.67934901 1591 high scalability-2014-02-05-Little’s Law, Scalability and Fault Tolerance: The OS is your bottleneck. What you can do?
12 0.67069489 788 high scalability-2010-03-04-How MySpace Tested Their Live Site with 1 Million Concurrent Users
13 0.66745394 1207 high scalability-2012-03-12-Google: Taming the Long Latency Tail - When More Machines Equals Worse Results
14 0.66727424 878 high scalability-2010-08-12-Strategy: Terminate SSL Connections in Hardware and Reduce Server Count by 40%
15 0.66712964 663 high scalability-2009-07-28-37signals Architecture
16 0.66439044 1438 high scalability-2013-04-10-Check Yourself Before You Wreck Yourself - Avocado's 5 Early Stages of Architecture Evolution
17 0.66410196 275 high scalability-2008-03-14-Problem: Mobbing the Least Used Resource Error
18 0.66408867 661 high scalability-2009-07-25-Latency is Everywhere and it Costs You Sales - How to Crush it
19 0.66353536 1051 high scalability-2011-06-01-Why is your network so slow? Your switch should tell you.
20 0.66114444 270 high scalability-2008-03-08-DNS-Record TTL on worst case scenarios
topicId topicWeight
[(1, 0.105), (2, 0.197), (10, 0.017), (30, 0.038), (47, 0.024), (60, 0.28), (61, 0.073), (77, 0.021), (79, 0.051), (85, 0.026), (94, 0.076)]
simIndex simValue blogId blogTitle
1 0.9064247 698 high scalability-2009-09-10-Building Scalable Databases: Denormalization, the NoSQL Movement and Digg
Introduction: Database normalization is a technique for designing relational database schemas that ensures that the data is optimal for ad-hoc querying and that modifications such as deletion or insertion of data does not lead to data inconsistency. Database denormalization is the process of optimizing your database for reads by creating redundant data. A consequence of denormalization is that insertions or deletions could cause data inconsistency if not uniformly applied to all redundant copies of the data within the database. Read more on Carnage4life blog...
2 0.89753932 1260 high scalability-2012-06-07-Case Study on Scaling PaaS infrastructure
Introduction: In his blog post, Scaling WSO2 Stratos , Srinath Perera explains the scaling architecture of the WSO2 Stratos Platform as a Service (PaaS) infrastructure. It is explained as a series of solutions where every solution adds a new concept to solve a specific problem found in the earlier solution. Overall, WSO2 Stratos uses a combination of intelligent Load balancing and lazy loading to scale up the architecture. More details about Stratos can be found from the paper WSO2 Stratos: An Industrial Stack to Support Cloud Computing . Problem Stratos is multi-tenanted . In other words, there are many tenants. Each tenant generally represents an organization and isolated from other tenants, where each tenant has his own users, resources, and permissions. Stratos supports multiple PaaS services. Each PaaS service is actually a WSO2 Products (e.g. AS, BPS, ESB etc.) offered as a service. Using those services, tenants may deploy their own Web Services, Mediation logic, Workflows, a
3 0.89365703 618 high scalability-2009-06-05-Google Wave Architecture
Introduction: Update: Good Vibrations by Radovan Semančík. Lot's of interesting questions about how Wave works, scalability, security, RESTyness, and so on. Google Wave is a new communication and collaboration platform based on hosted XML documents (called waves) supporting concurrent modifications and low-latency updates. This platform enables people to communicate and work together in new, convenient and effective ways. We will offer these benefits to users of Google Wave and we also want to share them with everyone else by making waves an open platform that everybody can share. We welcome others to run wave servers and become wave providers, for themselves or as services for their users, and to "federate" waves, that is, to share waves with each other and with Google Wave. In this way users from different wave providers can communicate and collaborate using shared waves. We are introducing the Google Wave Federation Protocol for federating waves between wave providers on the Internet. H
same-blog 4 0.85906947 249 high scalability-2008-02-16-S3 Failed Because of Authentication Overload
Introduction: Being an authentic human being is difficult and apparently authenticating all those S3 requests can be a bit overwhelming as well. Amazon fingered a lot of processor heavy authentication requests as the reason for their downtime: Early this morning, at 3:30am PST, we started seeing elevated levels of authenticated requests from multiple users in one of our locations. While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. Importantly, these cryptographic requests consume more resources per call than other request types. Shortly before 4:00am PST, we began to see several other users significantly increase their volume of authenticated calls. The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. In addition to processing authenticated requests, the authentication service also performs accou
5 0.76625764 1617 high scalability-2014-03-21-Stuff The Internet Says On Scalability For March 21st, 2014
Introduction: Hey, it's HighScalability time: Isaac Newton's College Notebook , such a slacker. Quotable Quotes: Chris Anderson : Petabytes allow us to say: ‘Correlation is enough.’ @adron : Back to writing the micro-services for my composite app with regionally distributed, highly available, key value crypto stores of unicorns! DevOps Cafe : when the canary dies you don't buy a stronger canary. Mark Page l: Creativity, like evolution, is merely a series of thefts. GitHub : Whoever has more capacity wins. The Master : We balance probabilities and choose the most likely. It is the scientific use of the imagination. Jonathan Ive : Steve, and I don’t recognize my friend in much of it. Yes, he had a surgically precise opinion. Yes, it could sting. Yes, he constantly questioned. ‘Is this good enough? Is this right?’ but he was so clever. His ideas were bold and magnificent. They could suck the air from the room. And when the ideas didn’t come, he d
6 0.75721204 655 high scalability-2009-07-12-SPHiveDB: A mixture of the Key-Value Store and the Relational Database.
7 0.75062001 1572 high scalability-2014-01-03-Stuff The Internet Says On Scalability For January 3rd, 2014
8 0.73492718 1292 high scalability-2012-07-27-Stuff The Internet Says On Scalability For July 27, 2012
9 0.73487496 49 high scalability-2007-07-30-allowed contributed
10 0.73352283 299 high scalability-2008-04-07-Rumors of Signs and Portents Concerning Freeish Google Cloud
11 0.73322368 1129 high scalability-2011-09-30-Stuff The Internet Says On Scalability For September 30, 2011
12 0.71397078 1408 high scalability-2013-02-19-Puppet monitoring: how to monitor the success or failure of Puppet runs
13 0.71384305 1630 high scalability-2014-04-11-Stuff The Internet Says On Scalability For April 11th, 2014
14 0.70153832 789 high scalability-2010-03-05-Strategy: Planning for a Power Outage Google Style
15 0.6611824 545 high scalability-2009-03-19-Product: Redis - Not Just Another Key-Value Store
16 0.65922368 942 high scalability-2010-11-15-Strategy: Biggest Performance Impact is to Reduce the Number of HTTP Requests
17 0.65914482 1151 high scalability-2011-12-05-Stuff The Internet Says On Scalability For December 5, 2011
18 0.65900099 256 high scalability-2008-02-21-Tracking usage of public resources - throttling accesses per hour
19 0.6588735 1407 high scalability-2013-02-15-Stuff The Internet Says On Scalability For February 15, 2013
20 0.65826768 307 high scalability-2008-04-21-Using Google AppEngine for a Little Micro-Scalability