high_scalability high_scalability-2011 high_scalability-2011-1098 knowledge-graph by maker-knowledge-mining

1098 high scalability-2011-08-15-Should any cloud be considered one availability zone? The Amazon experience says yes.


meta infos for this blog

Source: html

Introduction: Amazon has a very will written account of their 8/8/2011 downtime: Summary of the Amazon EC2, Amazon EBS, and Amazon RDS Service Event in the EU West Region . Power failed, backup generators failed to kick in, there weren't enough resources for EBS volumes to recover, API servers where overwhelmed, a DNS failure caused failovers to alternate availability zones to fail, a double fault occurred as the power event interrupted the repair of a different bug. All kind of typical stuff that just seems to happen. Considering the  previous outage , the big question for programmers is: what does this mean? What does it mean for how systems should be structured? Have we learned something that can't be unlearned? The Amazon post has lots of good insights into how EBS and RDS work, plus lessons learned. The short of the problem is large + complex = high probability of failure. The immediate fixes are adding more resources, more redundancy, more isolation between components, more automation, re


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Power failed, backup generators failed to kick in, there weren't enough resources for EBS volumes to recover, API servers where overwhelmed, a DNS failure caused failovers to alternate availability zones to fail, a double fault occurred as the power event interrupted the repair of a different bug. [sent-2, score-1.456]

2 What does it mean for how systems should be structured? [sent-5, score-0.09]

3 The short of the problem is large + complex = high probability of failure. [sent-8, score-0.183]

4 The immediate fixes are adding more resources, more redundancy, more isolation between components, more automation, reduce recovery times, and build software that is more aware of large scale failure modes. [sent-9, score-0.247]

5 We can predict, however, problems like this will continue to happen, not because of any incompetence by Amazon, but because: large + complex make  cascading failure an inherent characteristic of the system. [sent-12, score-0.484]

6 At some level of complexity any cloud/region/datacenter could be reasonably considered a single failure domain and should be treated accordingly, regardless of the heroic software infrastructure created to carve out availability zones. [sent-13, score-0.76]

7 Viewing a region as a single point of failure implies to be really safe you would need to be in multiple regions, which is to say multiple locations. [sent-14, score-0.369]

8 Diversity as mother nature's means of robustness would indicate using different providers as a good strategy. [sent-15, score-0.273]

9 Something a lot of people have been saying for a while, but with more evidence coming in, that conclusion is even stronger now. [sent-16, score-0.326]

10 For most projects this conclusion doesn't really matter all that much. [sent-18, score-0.145]

11 100% uptime is extremely expensive and Amazon will usually keep your infrastructure up and working. [sent-19, score-0.087]

12 Most of the time multiple Availability Zones are all you need. [sent-20, score-0.101]

13 All this diversity of course is very expensive and and very complicated. [sent-23, score-0.243]

14 Another option is a retreat into radical simplicity. [sent-30, score-0.135]

15 Related Articles Amazon Discussion Forums Apache Libcloud  - a unified interface to the cloud. [sent-33, score-0.083]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('amazon', 0.261), ('double', 0.202), ('ebs', 0.198), ('failure', 0.167), ('diversity', 0.156), ('zones', 0.146), ('rds', 0.145), ('conclusion', 0.145), ('failed', 0.135), ('failovers', 0.135), ('incompetence', 0.135), ('retreat', 0.135), ('accordingly', 0.127), ('eu', 0.121), ('articlesamazon', 0.121), ('interrupted', 0.121), ('cake', 0.11), ('synchronizing', 0.11), ('carve', 0.11), ('heroic', 0.107), ('recovering', 0.107), ('earned', 0.102), ('multiple', 0.101), ('availability', 0.099), ('alternate', 0.099), ('overwhelmed', 0.097), ('robustness', 0.096), ('reasonably', 0.096), ('characteristic', 0.096), ('problem', 0.095), ('complexity', 0.095), ('mother', 0.094), ('generators', 0.093), ('occurred', 0.092), ('rightscale', 0.091), ('stronger', 0.091), ('mean', 0.09), ('evidence', 0.09), ('radically', 0.089), ('probability', 0.088), ('expensive', 0.087), ('treated', 0.086), ('cascading', 0.086), ('repair', 0.084), ('kick', 0.083), ('indicate', 0.083), ('eat', 0.083), ('unified', 0.083), ('west', 0.082), ('fixes', 0.08)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 1098 high scalability-2011-08-15-Should any cloud be considered one availability zone? The Amazon experience says yes.

Introduction: Amazon has a very will written account of their 8/8/2011 downtime: Summary of the Amazon EC2, Amazon EBS, and Amazon RDS Service Event in the EU West Region . Power failed, backup generators failed to kick in, there weren't enough resources for EBS volumes to recover, API servers where overwhelmed, a DNS failure caused failovers to alternate availability zones to fail, a double fault occurred as the power event interrupted the repair of a different bug. All kind of typical stuff that just seems to happen. Considering the  previous outage , the big question for programmers is: what does this mean? What does it mean for how systems should be structured? Have we learned something that can't be unlearned? The Amazon post has lots of good insights into how EBS and RDS work, plus lessons learned. The short of the problem is large + complex = high probability of failure. The immediate fixes are adding more resources, more redundancy, more isolation between components, more automation, re

2 0.23431148 1029 high scalability-2011-04-25-The Big List of Articles on the Amazon Outage

Introduction: Please see The Updated Big List Of Articles On The Amazon Outage  for a new improved list. So many great articles have been written on the Amazon Outage. Some aim at being helpful, some chastise developers for being so stupid, some chastise Amazon for being so incompetent, some talk about the pain they and their companies have experienced, and some even predict the downfall of the cloud. Still others say we have seen a sea change in future of the cloud, a prediction that's hard to disagree with, though the shape of the change remains...cloudy. I'll try to keep this list update as more information comes out. There will be a lot for developers to consider going forward. If there's a resource you think should be added, just let me know. Amazon's Explanation of What Happened Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region Hackers News thread on AWS Service Disruption Post Mortem   Quite Funny Commentary on the Summary Experiences f

3 0.21025582 1033 high scalability-2011-05-02-The Updated Big List of Articles on the Amazon Outage

Introduction: Since The Big List Of Articles On The Amazon Outage  was published we've a had few updates that people might not have seen. Amazon of course released their  Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region . Netlix shared their Lessons Learned from the AWS Outage  as did Heroku ( How Heroku Survived the Amazon Outage ), Smug Mug ( How SmugMug survived the Amazonpocalypse ), and SimpleGeo ( How SimpleGeo Stayed Up During the AWS Downtime ).  The curious thing from my perspective is the general lack of response to Amazon's explanation. I expected more discussion. There's been almost none that I've seen. My guess is very few people understand what Amazon was talking about enough to comment whereas almost everyone feels qualified to talk about the event itself. Lesson for crisis handlers : deep dive post-mortems that are timely, long, honestish, and highly technical are the most effective means of staunching the downward spiral of media attention.

4 0.20709045 1631 high scalability-2014-04-14-How do you even do anything without using EBS?

Introduction: In a recent thread on Hacker News discussing  recent AWS price changes , seldo  mentioned they use AWS for business, they just never use EBS on AWS. A good question was asked: How do you even do anything without using EBS? Amazon certainly makes using EBS the easiest path. And EBS has a better reliability record as of late, but it's still often recommended to not use EBS. This avoids a single point of failure at the cost of a lot of complexity, though as AWS uses EBS internally, not using EBS may not save you if you use other AWS services like RDS or ELB. If you don't want to use EBS, it's hard to know where to even start. A dilemma to which Kevin Nuckolls  gives a great answer : Well, you break your services out onto stateless and stateful machines. After that, you make sure that each of your stateful services is resilient to individual node failure. I prefer to believe that if you can't roll your entire infrastructure over to new nodes monthly then you're unprepared fo

5 0.16586497 816 high scalability-2010-04-28-Elasticity for the Enterprise -- Ensuring Continuous High Availability in a Disaster Failure Scenario

Introduction: Many enterprises' high-availability architecture is based on the assumption that you can prevent failure from happening by putting all your critical data in a centralized database, back it up with expensive storage, and replicate it somehow between the sites. As I argued in one of my previous posts ( Why Existing Databases (RAC) are So Breakable! ) many of those assumptions are broken at their core, as storage is doomed to failure just like any other device, expensive hardware doesn’t make things any better and database replication is often not enough. One of the main lessons that we can take from the likes of Amazon and Google is that the right way to ensure continuous high availability is by designing our system to cope with failure. We need to assume that what we tend to think of as unthinkable will probably happen, as that’s the nature of failure. So rather than trying to prevent failures, we need to build a system that will tolerate them. As we can learn from a  recent outage

6 0.16428469 1604 high scalability-2014-03-03-The “Four Hamiltons” Framework for Mitigating Faults in the Cloud: Avoid it, Mask it, Bound it, Fix it Fast

7 0.16334963 38 high scalability-2007-07-30-Build an Infinitely Scalable Infrastructure for $100 Using Amazon Services

8 0.15376636 289 high scalability-2008-03-27-Amazon Announces Static IP Addresses and Multiple Datacenter Operation

9 0.14475614 853 high scalability-2010-07-08-Cloud AWS Infrastructure vs. Physical Infrastructure

10 0.14168009 1348 high scalability-2012-10-26-Stuff The Internet Says On Scalability For October 26, 2012

11 0.14041287 881 high scalability-2010-08-16-Scaling an AWS infrastructure - Tools and Patterns

12 0.13988639 1398 high scalability-2013-02-04-Is Provisioned IOPS Better? Yes, it Delivers More Consistent and Higher Performance IO

13 0.13791652 96 high scalability-2007-09-18-Amazon Architecture

14 0.12979527 1133 high scalability-2011-10-27-Strategy: Survive a Comet Strike in the East With Reserved Instances in the West

15 0.12467467 789 high scalability-2010-03-05-Strategy: Planning for a Power Outage Google Style

16 0.1187285 1121 high scalability-2011-09-21-5 Scalability Poisons and 3 Cloud Scalability Antidotes

17 0.11252781 480 high scalability-2008-12-30-Scalability Perspectives #5: Werner Vogels – The Amazon Technology Platform

18 0.11127427 831 high scalability-2010-05-26-End-To-End Performance Study of Cloud Services

19 0.11035844 925 high scalability-2010-10-22-Paper: Netflix’s Transition to High-Availability Storage Systems

20 0.10753338 1331 high scalability-2012-10-02-An Epic TripAdvisor Update: Why Not Run on the Cloud? The Grand Experiment.


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.17), (1, 0.101), (2, -0.011), (3, 0.09), (4, -0.053), (5, -0.078), (6, 0.02), (7, -0.123), (8, 0.066), (9, -0.155), (10, -0.035), (11, -0.022), (12, 0.016), (13, -0.078), (14, -0.009), (15, -0.003), (16, 0.062), (17, 0.006), (18, -0.024), (19, 0.055), (20, 0.038), (21, 0.015), (22, 0.009), (23, 0.057), (24, -0.091), (25, -0.027), (26, 0.023), (27, 0.048), (28, 0.042), (29, 0.021), (30, -0.067), (31, -0.027), (32, 0.091), (33, -0.096), (34, 0.042), (35, -0.029), (36, 0.046), (37, 0.082), (38, -0.015), (39, 0.011), (40, 0.061), (41, -0.077), (42, -0.065), (43, -0.057), (44, 0.061), (45, -0.037), (46, -0.006), (47, 0.013), (48, 0.045), (49, -0.012)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.97647583 1098 high scalability-2011-08-15-Should any cloud be considered one availability zone? The Amazon experience says yes.

Introduction: Amazon has a very will written account of their 8/8/2011 downtime: Summary of the Amazon EC2, Amazon EBS, and Amazon RDS Service Event in the EU West Region . Power failed, backup generators failed to kick in, there weren't enough resources for EBS volumes to recover, API servers where overwhelmed, a DNS failure caused failovers to alternate availability zones to fail, a double fault occurred as the power event interrupted the repair of a different bug. All kind of typical stuff that just seems to happen. Considering the  previous outage , the big question for programmers is: what does this mean? What does it mean for how systems should be structured? Have we learned something that can't be unlearned? The Amazon post has lots of good insights into how EBS and RDS work, plus lessons learned. The short of the problem is large + complex = high probability of failure. The immediate fixes are adding more resources, more redundancy, more isolation between components, more automation, re

2 0.90491211 1029 high scalability-2011-04-25-The Big List of Articles on the Amazon Outage

Introduction: Please see The Updated Big List Of Articles On The Amazon Outage  for a new improved list. So many great articles have been written on the Amazon Outage. Some aim at being helpful, some chastise developers for being so stupid, some chastise Amazon for being so incompetent, some talk about the pain they and their companies have experienced, and some even predict the downfall of the cloud. Still others say we have seen a sea change in future of the cloud, a prediction that's hard to disagree with, though the shape of the change remains...cloudy. I'll try to keep this list update as more information comes out. There will be a lot for developers to consider going forward. If there's a resource you think should be added, just let me know. Amazon's Explanation of What Happened Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region Hackers News thread on AWS Service Disruption Post Mortem   Quite Funny Commentary on the Summary Experiences f

3 0.88369668 1033 high scalability-2011-05-02-The Updated Big List of Articles on the Amazon Outage

Introduction: Since The Big List Of Articles On The Amazon Outage  was published we've a had few updates that people might not have seen. Amazon of course released their  Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region . Netlix shared their Lessons Learned from the AWS Outage  as did Heroku ( How Heroku Survived the Amazon Outage ), Smug Mug ( How SmugMug survived the Amazonpocalypse ), and SimpleGeo ( How SimpleGeo Stayed Up During the AWS Downtime ).  The curious thing from my perspective is the general lack of response to Amazon's explanation. I expected more discussion. There's been almost none that I've seen. My guess is very few people understand what Amazon was talking about enough to comment whereas almost everyone feels qualified to talk about the event itself. Lesson for crisis handlers : deep dive post-mortems that are timely, long, honestish, and highly technical are the most effective means of staunching the downward spiral of media attention.

4 0.83176845 1631 high scalability-2014-04-14-How do you even do anything without using EBS?

Introduction: In a recent thread on Hacker News discussing  recent AWS price changes , seldo  mentioned they use AWS for business, they just never use EBS on AWS. A good question was asked: How do you even do anything without using EBS? Amazon certainly makes using EBS the easiest path. And EBS has a better reliability record as of late, but it's still often recommended to not use EBS. This avoids a single point of failure at the cost of a lot of complexity, though as AWS uses EBS internally, not using EBS may not save you if you use other AWS services like RDS or ELB. If you don't want to use EBS, it's hard to know where to even start. A dilemma to which Kevin Nuckolls  gives a great answer : Well, you break your services out onto stateless and stateful machines. After that, you make sure that each of your stateful services is resilient to individual node failure. I prefer to believe that if you can't roll your entire infrastructure over to new nodes monthly then you're unprepared fo

5 0.77842212 559 high scalability-2009-04-07-Six Lessons Learned Deploying a Large-scale Infrastructure in Amazon EC2

Introduction: Lessons learned from OpenX's large-scale deployment to Amazon EC2: Expect failures; what's more, embrace them Fully automate your infrastructure deployments Design your infrastructure so that it scales horizontally Establish clear measurable goals Be prepared to quickly identify and eliminate bottlenecks Play wack-a-mole for a while, until things get stable

6 0.71885335 1398 high scalability-2013-02-04-Is Provisioned IOPS Better? Yes, it Delivers More Consistent and Higher Performance IO

7 0.70374578 816 high scalability-2010-04-28-Elasticity for the Enterprise -- Ensuring Continuous High Availability in a Disaster Failure Scenario

8 0.70016158 1133 high scalability-2011-10-27-Strategy: Survive a Comet Strike in the East With Reserved Instances in the West

9 0.69487482 1348 high scalability-2012-10-26-Stuff The Internet Says On Scalability For October 26, 2012

10 0.68917137 1083 high scalability-2011-07-20-Netflix: Harden Systems Using a Barrel of Problem Causing Monkeys - Latency, Conformity, Doctor, Janitor, Security, Internationalization, Chaos

11 0.68821174 139 high scalability-2007-10-30-Paper: Dynamo: Amazon’s Highly Available Key-value Store

12 0.66543752 964 high scalability-2010-12-28-Netflix: Continually Test by Failing Servers with Chaos Monkey

13 0.64612263 1604 high scalability-2014-03-03-The “Four Hamiltons” Framework for Mitigating Faults in the Cloud: Avoid it, Mask it, Bound it, Fix it Fast

14 0.6437093 289 high scalability-2008-03-27-Amazon Announces Static IP Addresses and Multiple Datacenter Operation

15 0.64322186 1466 high scalability-2013-05-29-Amazon: Creating a Customer Utopia One Culture Hack at a Time

16 0.64095074 23 high scalability-2007-07-24-Major Websites Down: Or Why You Want to Run in Two or More Data Centers.

17 0.63887924 38 high scalability-2007-07-30-Build an Infinitely Scalable Infrastructure for $100 Using Amazon Services

18 0.63788247 1470 high scalability-2013-06-05-A Simple 6 Step Transition Guide for Moving Away from X to AWS

19 0.63355035 477 high scalability-2008-12-29-100% on Amazon Web Services: Soocial.com - a lesson of porting your service to Amazon

20 0.63182062 853 high scalability-2010-07-08-Cloud AWS Infrastructure vs. Physical Infrastructure


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(1, 0.097), (2, 0.184), (10, 0.103), (20, 0.185), (40, 0.02), (56, 0.034), (61, 0.078), (77, 0.03), (79, 0.14), (94, 0.041)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.93561471 370 high scalability-2008-08-18-Forum sort order

Introduction: G'day, I noticed the default sort order for the forum is to show the posts with the most replies first. That seems a bit odd for a forum. Would it not make sense to show the posts with the most recently replies first? It is possible to re-sort the forum threads that way by clicking on the "Last post" header (twice). It would seem like a more sensible default. I've checked and I see the same behaviour as both a registered (logged in) and anonymous user. Cheers - Callum .

2 0.92472285 23 high scalability-2007-07-24-Major Websites Down: Or Why You Want to Run in Two or More Data Centers.

Introduction: A lot of sites hosted in San Francisco are down because of at least 6 back-to-back power outages power outages. More details at laughingsquid . Sites like SecondLife, Craigstlist, Technorati, Yelp and all Six Apart properties, TypePad, LiveJournal and Vox are all down. The cause was an underground explosion in a transformer vault under a manhole at 560 Mission Street. Flames shot 6 feet out from the manhole cover. Over PG&E; 30,000 customers are without power. What's perplexing is the UPS backup and diesel generators didn't kick in to bring the datacenter back on line. I've never toured that datacenter, but they usually have massive backup systems. It's probably one of those multiple simultaneous failure situations that you hope never happen in real life, but too often do. Or maybe the infrastructure wasn't rolled out completely. Update: the cause was a cascade of failures in a tightly couples system that could never happen :-) Details at Failure Happens: A summary of the power

same-blog 3 0.89229518 1098 high scalability-2011-08-15-Should any cloud be considered one availability zone? The Amazon experience says yes.

Introduction: Amazon has a very will written account of their 8/8/2011 downtime: Summary of the Amazon EC2, Amazon EBS, and Amazon RDS Service Event in the EU West Region . Power failed, backup generators failed to kick in, there weren't enough resources for EBS volumes to recover, API servers where overwhelmed, a DNS failure caused failovers to alternate availability zones to fail, a double fault occurred as the power event interrupted the repair of a different bug. All kind of typical stuff that just seems to happen. Considering the  previous outage , the big question for programmers is: what does this mean? What does it mean for how systems should be structured? Have we learned something that can't be unlearned? The Amazon post has lots of good insights into how EBS and RDS work, plus lessons learned. The short of the problem is large + complex = high probability of failure. The immediate fixes are adding more resources, more redundancy, more isolation between components, more automation, re

4 0.86028314 995 high scalability-2011-02-24-Strategy: Eliminate Unnecessary SQL

Introduction: MySQL Expert Ronald Bradford explains how one key way to improve the scalability of a MySQL server, and undoubtedly nearly every other server, is to eliminate unnecessary SQL , saying  the most efficient way to improve an SQL statement is to eliminate it : The MySQL kernel can only physically process a certain number of SQL statements for a given time period (e.g. per second). Regardless of the type of machine you have, there is a physical limit. If you eliminate SQL statements that are unwarranted and unnecessary, you automatically enable more important SQL statements to run. There are numerous other downstream affects, however this is the simple math. To run more SQL, reduce the number of SQL you need to run. Ronald shows how to use  mk-query-digest  to look at query execution times and determine which ones can be profitably whacked.  Related Articles Quora: What are the best methods for optimizing PHP/MySQL code for speed without caching?

5 0.83916342 1566 high scalability-2013-12-18-How to get started with sizing and capacity planning, assuming you don't know the software behavior?

Introduction: Here's a common situation and question from the mechanical-sympathy Google group by Avinash Agrawal on the black art of capacity planning: How to get started with sizing and capacity planning, assuming we don't know the software behavior and its completely new product to deal with? Gil Tene , Vice President of Technology and CTO & Co-Founder, wrote a very  understandable and useful answer  that is worth highlighting: Start with requirements. I see way too many "capacity planning" exercises that go off spending weeks measuring some irrelevant metrics about a system (like how many widgets per hour can this thing do) without knowing what they actually need it to do. There are two key sets of metrics to state here: the "how much" set and the "how bad" set: In the "How Much" part, you need to establish, based on expected business needs, Numbers for things (like connections, users, streams, transactions or messages per second) that you expect to interact with at the peak t

6 0.83472526 1615 high scalability-2014-03-19-Strategy: Three Techniques to Survive Traffic Surges by Quickly Scaling Your Site

7 0.82287896 825 high scalability-2010-05-10-Sify.com Architecture - A Portal at 3900 Requests Per Second

8 0.81991839 1596 high scalability-2014-02-14-Stuff The Internet Says On Scalability For February 14th, 2014

9 0.81187373 142 high scalability-2007-11-05-Strategy: Diagonal Scaling - Don't Forget to Scale Out AND Up

10 0.80914569 1248 high scalability-2012-05-21-Pinterest Architecture Update - 18 Million Visitors, 10x Growth,12 Employees, 410 TB of Data

11 0.80868798 1112 high scalability-2011-09-07-What Google App Engine Price Changes Say About the Future of Web Architecture

12 0.80868292 1186 high scalability-2012-02-02-The Data-Scope Project - 6PB storage, 500GBytes-sec sequential IO, 20M IOPS, 130TFlops

13 0.80812877 498 high scalability-2009-01-20-Product: Amazon's SimpleDB

14 0.80742908 289 high scalability-2008-03-27-Amazon Announces Static IP Addresses and Multiple Datacenter Operation

15 0.80731159 714 high scalability-2009-10-02-HighScalability has Moved to Squarespace.com!

16 0.80680358 1353 high scalability-2012-11-01-Cost Analysis: TripAdvisor and Pinterest costs on the AWS cloud

17 0.80621296 1041 high scalability-2011-05-15-Building a Database remote availability site

18 0.80598611 1371 high scalability-2012-12-12-Pinterest Cut Costs from $54 to $20 Per Hour by Automatically Shutting Down Systems

19 0.80492878 1626 high scalability-2014-04-04-Stuff The Internet Says On Scalability For April 4th, 2014

20 0.80479962 1159 high scalability-2011-12-19-How Twitter Stores 250 Million Tweets a Day Using MySQL