high_scalability high_scalability-2009 high_scalability-2009-703 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: A user named Apathy on how Reddit scales some of their features, shares some advice he learned while working at Google and other major companies. To be fair, I [Apathy] was working at Google at the time, and every job I held between 1995 and 2005 involved at least one of the largest websites on the planet. I didn't come up with any of these ideas, just watched other smart people I worked with who knew what they were doing and found (or wrote) tools that did the same things. But the theme is always the same: Cache everything you can and store the rest in some sort of database (not necessarily relational and not necessarily centralized). Cache everything that doesn't change rapidly. Most of the time you don't have to hit the database for anything other than checking whether the users' new message count has transitioned from 0 to (1 or more). Cache everything--templates, user message status, the front page components--and hit the database once a minute or so to update the fr
sentIndex sentText sentNum sentScore
1 I didn't come up with any of these ideas, just watched other smart people I worked with who knew what they were doing and found (or wrote) tools that did the same things. [sent-3, score-0.196]
2 But the theme is always the same: Cache everything you can and store the rest in some sort of database (not necessarily relational and not necessarily centralized). [sent-4, score-0.32]
3 Most of the time you don't have to hit the database for anything other than checking whether the users' new message count has transitioned from 0 to (1 or more). [sent-6, score-0.435]
4 Cache everything--templates, user message status, the front page components--and hit the database once a minute or so to update the front page, forums, etc. [sent-7, score-1.015]
5 Combine the previous two steps to generate a menu from cached blocks. [sent-14, score-0.363]
6 The golden rule of website engineering is that you don't try to enforce partial ordering simultaneously with your updates. [sent-26, score-0.172]
7 When running a search engine operate the crawler separately from the indexer. [sent-27, score-0.176]
8 Ranking scores are used as necessary from the index, usually cached for popular queries. [sent-28, score-0.43]
9 Re-rank popular subreddits or the front page once a minute. [sent-29, score-0.63]
10 Then cache numbers 100-200 when someone bothers to visit the 5th page of a subreddit, etc. [sent-32, score-0.463]
11 For less-popular subreddits, you cache the results until an update comes in. [sent-33, score-0.267]
12 With enough horsepower and common sense, almost any volume of data can be managed, just not in realtime. [sent-34, score-0.221]
13 Merge all the normalized rankings and cache the output every minute or so. [sent-36, score-0.502]
14 It's a lot cheaper to merge cached lists than build them from scratch. [sent-38, score-0.264]
15 This delays the crushing read/write bottleneck at the database. [sent-39, score-0.167]
16 If that's not found, look for the components and build an exact match. [sent-43, score-0.204]
17 The majority of traffic on almost all websites comes from the default, un-logged-in front page or from random forum/comment/result pages. [sent-44, score-0.592]
18 If one or more of the components aren't found, regenerate those from the DB (now it's cached! [sent-47, score-0.196]
19 You (almost) always have to hit the database on writes. [sent-50, score-0.241]
20 The key is to avoid hitting it for reads until you're forced to do so. [sent-51, score-0.152]
wordName wordTfidf (topN-words)
[('blow', 0.327), ('cached', 0.264), ('subreddits', 0.228), ('cache', 0.194), ('hit', 0.167), ('front', 0.164), ('page', 0.16), ('exact', 0.129), ('necessarily', 0.123), ('regenerate', 0.121), ('rankings', 0.121), ('almost', 0.119), ('bothers', 0.109), ('minute', 0.105), ('votes', 0.105), ('transitioned', 0.102), ('formatting', 0.102), ('conditional', 0.102), ('horsepower', 0.102), ('watched', 0.099), ('menu', 0.099), ('found', 0.097), ('pump', 0.096), ('crawler', 0.094), ('hooks', 0.092), ('characters', 0.092), ('message', 0.092), ('crushing', 0.091), ('user', 0.089), ('enforce', 0.088), ('scores', 0.088), ('append', 0.086), ('rank', 0.084), ('golden', 0.084), ('forums', 0.082), ('separately', 0.082), ('normalized', 0.082), ('reads', 0.08), ('popular', 0.078), ('google', 0.077), ('lucene', 0.076), ('websites', 0.076), ('delays', 0.076), ('components', 0.075), ('database', 0.074), ('avoids', 0.073), ('comes', 0.073), ('sold', 0.072), ('forced', 0.072), ('fair', 0.072)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000001 703 high scalability-2009-09-12-How Google Taught Me to Cache and Cash-In
Introduction: A user named Apathy on how Reddit scales some of their features, shares some advice he learned while working at Google and other major companies. To be fair, I [Apathy] was working at Google at the time, and every job I held between 1995 and 2005 involved at least one of the largest websites on the planet. I didn't come up with any of these ideas, just watched other smart people I worked with who knew what they were doing and found (or wrote) tools that did the same things. But the theme is always the same: Cache everything you can and store the rest in some sort of database (not necessarily relational and not necessarily centralized). Cache everything that doesn't change rapidly. Most of the time you don't have to hit the database for anything other than checking whether the users' new message count has transitioned from 0 to (1 or more). Cache everything--templates, user message status, the front page components--and hit the database once a minute or so to update the fr
2 0.19418178 360 high scalability-2008-08-04-A Bunch of Great Strategies for Using Memcached and MySQL Better Together
Introduction: The primero recommendation for speeding up a website is almost always to add cache and more cache. And after that add a little more cache just in case. Memcached is almost always given as the recommended cache to use. What we don't often hear is how to effectively use a cache in our own products. MySQL hosted two excellent webinars (referenced below) on the subject of how to deploy and use memcached. The star of the show, other than MySQL of course, is Farhan Mashraqi of Fotolog. You may recall we did an earlier article on Fotolog in Secrets to Fotolog's Scaling Success , which was one of my personal favorites. Fotolog, as they themselves point out, is probably the largest site nobody has ever heard of, pulling in more page views than even Flickr. Fotolog has 51 instances of memcached on 21 servers with 175G in use and 254G available. As a large successful photo-blogging site they have very demanding performance and scaling requirements. To meet those requirements they've developed a
3 0.18657048 828 high scalability-2010-05-17-7 Lessons Learned While Building Reddit to 270 Million Page Views a Month
Introduction: Steve Huffman , co-founder of social news site Reddit , gave an excellent presentation ( slides , transcript ) on the lessons he learned while building and growing Reddit to 7.5 million users per month, 270 million page views per month, and 20+ database servers. Steve says a lot of the lessons were really obvious, so you may not find a lot of completely new ideas in the presentation. But Steve has an earnestness and genuineness about him that is so obviously grounded in experience that you can't help but think deeply about what you could be doing different. And if Steve didn't know about these lessons, I'm betting others don't either. There are seven lessons, each has their own summary section: Lesson one: Crash Often; Lesson 2: Separation of Services; Lesson 3: Open Schema; Lesson 4: Keep it Stateless; Lesson 5: Memcache; Lesson 6: Store Redundant Data; Lesson 7: Work Offline. By far the most surprising feature of their architecture is in Lesson Six, whose essential idea is:
4 0.15241177 908 high scalability-2010-09-28-6 Strategies for Scaling BBC iPlayer
Introduction: The BBC's iPlayer site averages 8 million page views a day for 1.3 million users. Technical Architect Simon Frost describes how they scaled their site in Scaling the BBC iPlayer to handle demand : Use frameworks . Frameworks support component based development which makes it convenient for team development, but can introduce delays that have to be minimized. Zend/PHP is used because it supports components and is easy to recruit for. MySQL is used for program metadata. CouchDB is used for key-value access for fast read/write of user-focused data. Prove architecture before building it . Eliminate guesswork by coming up with alternate architectures and create prototypes to determine which option works best. Balance performance with factors like ease of development. Cache a lot . Data is cached in memcached for a few seconds to minutes. Short cache invalidation periods keep the data up to date for the users, but even these short periods make a huge difference in performance.
Introduction: Jeremy Edberg , the first paid employee at reddit, teaches us a lot about how to create a successful social site in a really good talk he gave at the RAMP conference. Watch it here at Scaling Reddit from 1 Million to 1 Billion–Pitfalls and Lessons . Jeremy uses a virtue and sin approach. Examples of the mistakes made in scaling reddit are shared and it turns out they did a lot of good stuff too. Somewhat of a shocker is that Jeremy is now a Reliability Architect at Netflix, so we get a little Netflix perspective thrown in for free. Some of the lessons that stood out most for me: Think of SSDs as cheap RAM, not expensive disk . When reddit moved from spinning disks to SSDs for the database the number of servers was reduced from 12 to 1 with a ton of headroom. SSDs are 4x more expensive but you get 16x the performance. Worth the cost. Give users a little bit of power, see what they do with it, and turn the good stuff into features . One of the biggest revelations
7 0.13435002 673 high scalability-2009-08-07-Strategy: Break Up the Memcache Dog Pile
9 0.13091213 467 high scalability-2008-12-16-[ANN] New Open Source Cache System
10 0.12201595 1331 high scalability-2012-10-02-An Epic TripAdvisor Update: Why Not Run on the Cloud? The Grand Experiment.
11 0.12164319 313 high scalability-2008-05-02-Friends for Sale Architecture - A 300 Million Page View-Month Facebook RoR App
12 0.12107079 274 high scalability-2008-03-12-YouTube Architecture
13 0.1203457 808 high scalability-2010-04-12-Poppen.de Architecture
14 0.12033863 511 high scalability-2009-02-12-MySpace Architecture
15 0.1186786 721 high scalability-2009-10-13-Why are Facebook, Digg, and Twitter so hard to scale?
16 0.11857113 638 high scalability-2009-06-26-PlentyOfFish Architecture
17 0.11781562 436 high scalability-2008-11-02-Strategy: How to Manage Sessions Using Memcached
18 0.11720395 1361 high scalability-2012-11-22-Gone Fishin': PlentyOfFish Architecture
19 0.11548965 639 high scalability-2009-06-27-Scaling Twitter: Making Twitter 10000 Percent Faster
20 0.11547408 836 high scalability-2010-06-04-Strategy: Cache Larger Chunks - Cache Hit Rate is a Bad Indicator
topicId topicWeight
[(0, 0.183), (1, 0.114), (2, -0.062), (3, -0.143), (4, 0.04), (5, 0.005), (6, -0.04), (7, 0.023), (8, 0.014), (9, -0.017), (10, -0.019), (11, -0.051), (12, -0.01), (13, 0.088), (14, -0.036), (15, -0.043), (16, -0.089), (17, -0.035), (18, 0.061), (19, -0.043), (20, -0.035), (21, 0.018), (22, 0.071), (23, 0.015), (24, -0.059), (25, -0.023), (26, -0.012), (27, 0.091), (28, -0.057), (29, 0.008), (30, -0.042), (31, 0.01), (32, -0.061), (33, 0.009), (34, -0.005), (35, 0.059), (36, -0.0), (37, -0.007), (38, 0.057), (39, 0.022), (40, 0.027), (41, -0.002), (42, -0.091), (43, 0.001), (44, -0.031), (45, -0.009), (46, -0.021), (47, -0.02), (48, -0.051), (49, 0.039)]
simIndex simValue blogId blogTitle
same-blog 1 0.98763609 703 high scalability-2009-09-12-How Google Taught Me to Cache and Cash-In
Introduction: A user named Apathy on how Reddit scales some of their features, shares some advice he learned while working at Google and other major companies. To be fair, I [Apathy] was working at Google at the time, and every job I held between 1995 and 2005 involved at least one of the largest websites on the planet. I didn't come up with any of these ideas, just watched other smart people I worked with who knew what they were doing and found (or wrote) tools that did the same things. But the theme is always the same: Cache everything you can and store the rest in some sort of database (not necessarily relational and not necessarily centralized). Cache everything that doesn't change rapidly. Most of the time you don't have to hit the database for anything other than checking whether the users' new message count has transitioned from 0 to (1 or more). Cache everything--templates, user message status, the front page components--and hit the database once a minute or so to update the fr
2 0.86063743 908 high scalability-2010-09-28-6 Strategies for Scaling BBC iPlayer
Introduction: The BBC's iPlayer site averages 8 million page views a day for 1.3 million users. Technical Architect Simon Frost describes how they scaled their site in Scaling the BBC iPlayer to handle demand : Use frameworks . Frameworks support component based development which makes it convenient for team development, but can introduce delays that have to be minimized. Zend/PHP is used because it supports components and is easy to recruit for. MySQL is used for program metadata. CouchDB is used for key-value access for fast read/write of user-focused data. Prove architecture before building it . Eliminate guesswork by coming up with alternate architectures and create prototypes to determine which option works best. Balance performance with factors like ease of development. Cache a lot . Data is cached in memcached for a few seconds to minutes. Short cache invalidation periods keep the data up to date for the users, but even these short periods make a huge difference in performance.
3 0.80851501 673 high scalability-2009-08-07-Strategy: Break Up the Memcache Dog Pile
Introduction: Update: Asynchronous HTTP cache validations . A proposed HTTP caching extension: if your application can afford to show slightly out of date content, then stale-while-revalidate can guarantee that the user will always be served directly from the cache, hence guaranteeing a consistent response-time user-experience. Caching is like aspirin for headaches. Head hurts: pop a 'sprin. Slow site: add caching. Facebook must have a lot of headaches because they popped 805 memcached servers between 10,000 web servers and 1,800 MySQL servers and they reportedly have a 99% cache hit rate. But what's the best way for you to cache for your application? It's a remarkably complex and rich topic. Alexey Kovyrin talks about one common caching problem called the Dog Pile Effect in Dog-pile Effect and How to Avoid it with Ruby on Rails . Glenn Franxman also has a Django solution in MintCache . Data is usually cached because it's too expensive to calculate for every hit. Maybe it's a gnarly S
4 0.80851471 836 high scalability-2010-06-04-Strategy: Cache Larger Chunks - Cache Hit Rate is a Bad Indicator
Introduction: Isn't the secret to fast, scalable websites to cache everything ? Caching, if not the secret sauce of many a website, is it at least a popular condiment. But not so fast says Peter Zaitsev in Beyond great cache hit ratio . The point Peter makes is that we read about websites like Amazon and Facebook that can literally make hundreds of calls to satisfy a user request. Even if you have an awesome cache hit ratio, pages can still be slow because making and processing all those requests takes time. The solution is to remove requests all together . You do this by caching larger blocks so you have to make fewer requests. The post has a lot of good advice worth reading: 1) Make non cacheable blocks as small as possible, 2) Maximize amount of uses of the cache item, 3) Control invalidation, 4) Multi-Get.
5 0.79969954 828 high scalability-2010-05-17-7 Lessons Learned While Building Reddit to 270 Million Page Views a Month
Introduction: Steve Huffman , co-founder of social news site Reddit , gave an excellent presentation ( slides , transcript ) on the lessons he learned while building and growing Reddit to 7.5 million users per month, 270 million page views per month, and 20+ database servers. Steve says a lot of the lessons were really obvious, so you may not find a lot of completely new ideas in the presentation. But Steve has an earnestness and genuineness about him that is so obviously grounded in experience that you can't help but think deeply about what you could be doing different. And if Steve didn't know about these lessons, I'm betting others don't either. There are seven lessons, each has their own summary section: Lesson one: Crash Often; Lesson 2: Separation of Services; Lesson 3: Open Schema; Lesson 4: Keep it Stateless; Lesson 5: Memcache; Lesson 6: Store Redundant Data; Lesson 7: Work Offline. By far the most surprising feature of their architecture is in Lesson Six, whose essential idea is:
6 0.7947697 436 high scalability-2008-11-02-Strategy: How to Manage Sessions Using Memcached
7 0.79077858 360 high scalability-2008-08-04-A Bunch of Great Strategies for Using Memcached and MySQL Better Together
9 0.75965506 247 high scalability-2008-02-12-We want to cache a lot :) How do we go about it ?
10 0.75878119 1620 high scalability-2014-03-27-Strategy: Cache Stored Procedure Results
11 0.75275093 1633 high scalability-2014-04-16-Six Lessons Learned the Hard Way About Scaling a Million User System
12 0.73600632 1346 high scalability-2012-10-24-Saving Cash Using Less Cache - 90% Savings in the Caching Tier
13 0.72097379 800 high scalability-2010-03-26-Strategy: Caching 404s Saved the Onion 66% on Server Time
14 0.72089428 359 high scalability-2008-07-29-Ehcache - A Java Distributed Cache
15 0.71818054 174 high scalability-2007-12-05-Product: Tugela Cache
16 0.71722388 911 high scalability-2010-09-30-More Troubles with Caching
17 0.71625978 467 high scalability-2008-12-16-[ANN] New Open Source Cache System
18 0.71533263 495 high scalability-2009-01-17-Intro to Caching,Caching algorithms and caching frameworks part 1
19 0.70570695 481 high scalability-2009-01-02-Strategy: Understanding Your Data Leads to the Best Scalability Solutions
20 0.70409817 996 high scalability-2011-02-28-A Practical Guide to Varnish - Why Varnish Matters
topicId topicWeight
[(1, 0.07), (2, 0.23), (3, 0.011), (10, 0.078), (30, 0.063), (40, 0.029), (46, 0.122), (61, 0.143), (79, 0.026), (85, 0.051), (94, 0.082), (96, 0.022)]
simIndex simValue blogId blogTitle
same-blog 1 0.93422389 703 high scalability-2009-09-12-How Google Taught Me to Cache and Cash-In
Introduction: A user named Apathy on how Reddit scales some of their features, shares some advice he learned while working at Google and other major companies. To be fair, I [Apathy] was working at Google at the time, and every job I held between 1995 and 2005 involved at least one of the largest websites on the planet. I didn't come up with any of these ideas, just watched other smart people I worked with who knew what they were doing and found (or wrote) tools that did the same things. But the theme is always the same: Cache everything you can and store the rest in some sort of database (not necessarily relational and not necessarily centralized). Cache everything that doesn't change rapidly. Most of the time you don't have to hit the database for anything other than checking whether the users' new message count has transitioned from 0 to (1 or more). Cache everything--templates, user message status, the front page components--and hit the database once a minute or so to update the fr
2 0.8920368 145 high scalability-2007-11-08-ID generator
Introduction: Hi, I would like feed back on a ID generator I just made. What positive and negative effects do you see with this. It's programmed in Java, but could just as easily be programmed in any other typical language. It's thread safe and does not use any synchronization. When testing it on my laptop, I was able to generate 10 million IDs within about 15 seconds, so it should be more than fast enough. Take a look at the attachment.. (had to rename it from IdGen.java to IdGen.txt to attach it) IdGen.java
Introduction: You may have read somewhere that Facebook has introduced a new Social Inbox integrating email, IM, SMS, text messages, on-site Facebook messages. All-in-all they need to store over 135 billion messages a month. Where do they store all that stuff? Facebook's Kannan Muthukkaruppan gives the surprise answer in The Underlying Technology of Messages : HBase . HBase beat out MySQL, Cassandra, and a few others. Why a surprise? Facebook created Cassandra and it was purpose built for an inbox type application, but they found Cassandra's eventual consistency model wasn't a good match for their new real-time Messages product. Facebook also has an extensive MySQL infrastructure , but they found performance suffered as data set and indexes grew larger. And they could have built their own, but they chose HBase. HBase is a scaleout table store supporting very high rates of row-level updates over massive amounts of data . Exactly what is needed for a Messaging system. HBase is also a colu
4 0.87982243 269 high scalability-2008-03-08-Audiogalaxy.com Architecture
Introduction: Update 3: Always Refer to Your V1 As a Prototype . You really do have to plan to throw one away. Update 2: Lessons Learned Scaling the Audiogalaxy Search Engine . Things he should have done and fun things he couldn’t justify doing. Update: Design details of Audiogalaxy.com’s high performance MySQL search engine . At peak times, the search engine needed to handle 1500-2000 searches every second against a MySQL database with about 200 million rows. Search was one of most interesting problems at Audiogalaxy. It was one of the core functions of the site, and somewhere between 50 to 70 million searches were performed every day. At peak times, the search engine needed to handle 1500-2000 searches every second against a MySQL database with about 200 million rows.
5 0.87731653 361 high scalability-2008-08-08-Separation into read-write only databases
Introduction: At least in the articles on Plenty of Fish and Slashdot it was mentioned that one can achieve higher performance by creating read-only and write-only databases where possible. I have read the comments and tried unsuccessfully to find more information on the net about this. I still do not understand the concept. Can someone explain it in more detail, as well as recommend resources for further investigation? (Are there books written specifically about this technique?) I think it is a very important issue, because databases are oftentimes the bottleneck.
6 0.87318295 6 high scalability-2007-07-11-Friendster Architecture
7 0.8712312 1154 high scalability-2011-12-09-Stuff The Internet Says On Scalability For December 9, 2011
8 0.87072569 1244 high scalability-2012-05-11-Stuff The Internet Says On Scalability For May 11, 2012
9 0.86956745 1135 high scalability-2011-10-31-15 Ways to Make Your Application Feel More Responsive under Google App Engine
10 0.86941725 1538 high scalability-2013-10-28-Design Decisions for Scaling Your High Traffic Feeds
11 0.86901075 1464 high scalability-2013-05-24-Stuff The Internet Says On Scalability For May 24, 2013
12 0.8687526 1074 high scalability-2011-07-06-11 Common Web Use Cases Solved in Redis
13 0.8665055 545 high scalability-2009-03-19-Product: Redis - Not Just Another Key-Value Store
14 0.8631891 1296 high scalability-2012-08-02-Strategy: Use Spare Region Capacity to Survive Availability Zone Failures
16 0.86027384 800 high scalability-2010-03-26-Strategy: Caching 404s Saved the Onion 66% on Server Time
17 0.85853344 1267 high scalability-2012-06-18-The Clever Ways Chrome Hides Latency by Anticipating Your Every Need
18 0.85828507 1269 high scalability-2012-06-20-iDoneThis - Scaling an Email-based App from Scratch
19 0.85795194 1255 high scalability-2012-06-01-Stuff The Internet Says On Scalability For June 1, 2012
20 0.85791397 1407 high scalability-2013-02-15-Stuff The Internet Says On Scalability For February 15, 2013