high_scalability high_scalability-2010 high_scalability-2010-900 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: As the Kings of scaling, when Google changes its search infrastructure over to do something completely different, it's news. In Google search index splits with MapReduce , an exclusive interview by Cade Metz with Eisar Lipkovitz, a senior director of engineering at Google, we learn a bit more of the secret scaling sauce behind Google Instant , Google's new faster, real-time search system. The challenge for Google has been how to support a real-time world when the core of their search technology, the famous MapReduce, is batch oriented. Simple, they got rid of MapReduce. At least they got rid of MapReduce as the backbone for calculating search indexes. MapReduce still excels as a general query mechanism against masses of data, but real-time search requires a very specialized tool, and Google built one. Internally the successor to Google's famed Google File System, was code named Colossus. Details are slowly coming out about their new goals and approach: Goal is to update the
sentIndex sentText sentNum sentScore
1 As the Kings of scaling, when Google changes its search infrastructure over to do something completely different, it's news. [sent-1, score-0.325]
2 In Google search index splits with MapReduce , an exclusive interview by Cade Metz with Eisar Lipkovitz, a senior director of engineering at Google, we learn a bit more of the secret scaling sauce behind Google Instant , Google's new faster, real-time search system. [sent-2, score-0.308]
3 " When a page is crawled, changes are made incrementally to the web map stored in BigTable, using something like database triggers. [sent-10, score-0.33]
4 The use of triggers is interesting because triggers are largely ignored in production systems. [sent-12, score-1.352]
5 In a relational database triggers are integrity checks that are executed on a record every time a record is written. [sent-13, score-1.287]
6 The idea is that if the data is checked for problems before it is written into the database, then your data will always be correct and the database can be a pure source of validated facts. [sent-14, score-0.343]
7 For example, an account balance could be checked for a negative number. [sent-15, score-0.301]
8 If the balance was negative then the write would fail, the transaction aborted, and the database would maintain a correct state. [sent-16, score-0.375]
9 Since triggers happen on every write they slow down the write path to the extent that database performance is often killed. [sent-18, score-0.912]
10 Not all integrity checks can be made from data inside the database so checks that are database oriented are put in the triggers and checks, that say access a 3rd party system, are kept in application code. [sent-21, score-1.485]
11 So now we have checks in multiple places, which leads to the update anomalies in code that ironically, relational databases are meant to prevent in data. [sent-22, score-0.32]
12 The database CPU becomes dedicated to running triggers instead of "real" database operations, which slows down the entire database. [sent-28, score-0.924]
13 Application code tends to move into triggers over time since triggers happen on writes (changes), in a single common place, they are a very natural place to do all the things that need to be done to data. [sent-30, score-1.488]
14 It's easy to imagine building secondary indexes from triggers and synchronizing to backup systems and other 3rd party systems. [sent-32, score-0.772]
15 Instead of application code being executed in parallel on horizontally scalable nodes, it's all centralized on the database server and the database becomes the bottleneck. [sent-34, score-0.452]
16 So, triggers tend not to be used in OLTP scenarios. [sent-35, score-0.676]
17 Triggers are ideal in an event oriented world because the database is the one central place where changes are recorded. [sent-38, score-0.327]
18 It sounds like Google may have made a specialized database where triggers are efficient by design. [sent-40, score-0.918]
19 What we have is a sort of Internet DOM, analogous to the browser DOMs that have made it possible to have such incredibly powerful browser UIs, but for the web. [sent-45, score-0.268]
20 Imagine if the Web was represented by an Internet DOM that you could program, that you could write to, and that you could add code to just like Javascript is added to the browser DOM. [sent-51, score-0.449]
wordName wordTfidf (topN-words)
[('triggers', 0.676), ('dom', 0.207), ('checks', 0.184), ('google', 0.134), ('database', 0.124), ('search', 0.118), ('mapreduce', 0.118), ('browser', 0.104), ('crawled', 0.101), ('checked', 0.097), ('imagine', 0.096), ('logic', 0.088), ('changes', 0.084), ('record', 0.081), ('backbone', 0.08), ('negative', 0.079), ('code', 0.076), ('checking', 0.075), ('integrity', 0.074), ('index', 0.072), ('could', 0.071), ('internet', 0.07), ('rid', 0.068), ('pages', 0.068), ('executed', 0.067), ('behindgoogle', 0.064), ('aborted', 0.064), ('div', 0.064), ('inall', 0.064), ('something', 0.062), ('correct', 0.062), ('perform', 0.061), ('completely', 0.061), ('centralized', 0.061), ('made', 0.06), ('leads', 0.06), ('place', 0.06), ('validated', 0.06), ('oriented', 0.059), ('specialized', 0.058), ('destroys', 0.057), ('successor', 0.057), ('kings', 0.057), ('write', 0.056), ('masses', 0.055), ('emit', 0.055), ('metz', 0.055), ('balance', 0.054), ('excels', 0.053), ('ingoogle', 0.053)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000001 900 high scalability-2010-09-11-Google's Colossus Makes Search Real-time by Dumping MapReduce
Introduction: As the Kings of scaling, when Google changes its search infrastructure over to do something completely different, it's news. In Google search index splits with MapReduce , an exclusive interview by Cade Metz with Eisar Lipkovitz, a senior director of engineering at Google, we learn a bit more of the secret scaling sauce behind Google Instant , Google's new faster, real-time search system. The challenge for Google has been how to support a real-time world when the core of their search technology, the famous MapReduce, is batch oriented. Simple, they got rid of MapReduce. At least they got rid of MapReduce as the backbone for calculating search indexes. MapReduce still excels as a general query mechanism against masses of data, but real-time search requires a very specialized tool, and Google built one. Internally the successor to Google's famed Google File System, was code named Colossus. Details are slowly coming out about their new goals and approach: Goal is to update the
2 0.14932698 910 high scalability-2010-09-30-Facebook and Site Failures Caused by Complex, Weakly Interacting, Layered Systems
Introduction: Facebook has been so reliable that when a site outage does occur it's a definite learning opportunity. Fortunately for us we can learn something because in More Details on Today's Outage , Facebook's Robert Johnson gave a pretty candid explanation of what caused a rare 2.5 hour period of down time for Facebook. It wasn't a simple problem. The root causes were feedback loops and transient spikes caused ultimately by the complexity of weakly interacting layers in modern systems. You know, the kind everyone is building these days. Problems like this are notoriously hard to fix and finding a real solution may send Facebook back to the whiteboard. There's a technical debt that must be paid. The outline and my interpretation (reading between the lines) of what happened is: Remember that Facebook caches everything . They have 28 terabytes of memcached data on 800 servers. The database is the system of record, but memory is where the action is. So when a problem happens that i
3 0.14072196 151 high scalability-2007-11-12-a8cjdbc - Database Clustering via JDBC
Introduction: Practically any software project nowadays could not survive without a database (DBMS) backend storing all the business data that is vital to you and/or your customers. When projects grow larger, the amount of data usually grows larger exponentially. So you start moving the DBMS to a separate server to gain more speed and capacity. Which is all good and healthy but you do not gain any extra safety for this business data. You might be backing up your database once a day so in case the database server crashes you don't lose EVERYTHING, but how much can you really afford to lose? Well clearly this depends on what kind of data you are storing. In our case the users of our solutions use our software products to do their everyday (all day) work. They have "everything" they need for their business stored in the database we are providing. So is 24 hours of data loss acceptable? No, not really. One hour? Maybe. But what we really want is a second database running with the EXACT same data. We
4 0.13000134 899 high scalability-2010-09-09-How did Google Instant become Faster with 5-7X More Results Pages?
Introduction: We don't have a lot of details on how Google pulled off their technically very impressive Google Instant release, but in Google Instant behind the scenes , they did share some interesting facts: Google was serving more than a billion searches per day. With Google Instant they served 5-7X more results pages than previously. Typical search results were returned in less than a quarter of second. A team of 50+ worked on the project for an extended period of time. Although Google is associated with muscular data centers, they just didn't throw more server capacity at the problem, they worked smarter too. What were their general strategies? Increase backend server capacity. Add new caches to handle high request rates while keeping results fresh while the web is continuously crawled and re-indexed. Add User-state data to the back-ends to keep track of the results pages already shown to a given user, preventing the same results from being re-fetched repeatedly. Optim
5 0.12869038 65 high scalability-2007-08-16-Scaling Secret #2: Denormalizing Your Way to Speed and Profit
Introduction: Alan Watts once observed how after we accepted Descartes' separation of the mind and body we've been trying to smash them back together again ever since when really they were never separate to begin with. The database normalization-denormalization dualism has the same mobius shaped reverberations as Descartes' error. We separate data into a million jagged little pieces and then spend all our time stooping over, picking them and up, and joining them back together again. Normalization has been standard practice now for decades. But times are changing. Many mega-website architects are concluding Watts was right: the data was never separate to begin with. And even more radical, we may even need to store multiple copies of data. Information Sources Normalization Is for Sissies by Pat Helland Data normalization, is it really that good? by Arnon Rotem-Gal-Oz When Not to Normalize your SQL Database by Dare Obasanjo MegaData by Joe Gregorio Audio
6 0.1162307 448 high scalability-2008-11-22-Google Architecture
8 0.11086855 538 high scalability-2009-03-16-Are Cloud Based Memory Architectures the Next Big Thing?
11 0.10954894 936 high scalability-2010-11-09-Facebook Uses Non-Stored Procedures to Update Social Graphs
12 0.10900562 517 high scalability-2009-02-21-Google AppEngine - A Second Look
13 0.10702323 1094 high scalability-2011-08-08-Tagged Architecture - Scaling to 100 Million Users, 1000 Servers, and 5 Billion Page Views
14 0.10409804 954 high scalability-2010-12-06-What the heck are you actually using NoSQL for?
15 0.10350174 1240 high scalability-2012-05-07-Startups are Creating a New System of the World for IT
16 0.10290854 1565 high scalability-2013-12-16-22 Recommendations for Building Effective High Traffic Web Software
17 0.10255167 1508 high scalability-2013-08-28-Sean Hull's 20 Biggest Bottlenecks that Reduce and Slow Down Scalability
18 0.10131013 1501 high scalability-2013-08-13-In Memoriam: Lavabit Architecture - Creating a Scalable Email Service
19 0.100178 1107 high scalability-2011-08-29-The Three Ages of Google - Batch, Warehouse, Instant
20 0.099278964 933 high scalability-2010-11-01-Hot Trend: Move Behavior to Data for a New Interactive Application Architecture
topicId topicWeight
[(0, 0.187), (1, 0.102), (2, -0.016), (3, 0.007), (4, 0.028), (5, 0.043), (6, 0.006), (7, 0.032), (8, 0.024), (9, 0.025), (10, -0.027), (11, -0.004), (12, -0.043), (13, -0.021), (14, 0.087), (15, -0.045), (16, -0.106), (17, -0.011), (18, 0.099), (19, -0.019), (20, 0.055), (21, -0.027), (22, 0.05), (23, -0.025), (24, 0.018), (25, 0.044), (26, -0.058), (27, 0.029), (28, -0.006), (29, 0.086), (30, -0.061), (31, 0.039), (32, -0.029), (33, -0.008), (34, 0.046), (35, -0.014), (36, -0.005), (37, 0.001), (38, 0.005), (39, 0.044), (40, 0.031), (41, -0.059), (42, -0.022), (43, 0.004), (44, -0.045), (45, 0.022), (46, 0.047), (47, -0.036), (48, -0.007), (49, -0.011)]
simIndex simValue blogId blogTitle
same-blog 1 0.97510809 900 high scalability-2010-09-11-Google's Colossus Makes Search Real-time by Dumping MapReduce
Introduction: As the Kings of scaling, when Google changes its search infrastructure over to do something completely different, it's news. In Google search index splits with MapReduce , an exclusive interview by Cade Metz with Eisar Lipkovitz, a senior director of engineering at Google, we learn a bit more of the secret scaling sauce behind Google Instant , Google's new faster, real-time search system. The challenge for Google has been how to support a real-time world when the core of their search technology, the famous MapReduce, is batch oriented. Simple, they got rid of MapReduce. At least they got rid of MapReduce as the backbone for calculating search indexes. MapReduce still excels as a general query mechanism against masses of data, but real-time search requires a very specialized tool, and Google built one. Internally the successor to Google's famed Google File System, was code named Colossus. Details are slowly coming out about their new goals and approach: Goal is to update the
2 0.7998662 899 high scalability-2010-09-09-How did Google Instant become Faster with 5-7X More Results Pages?
Introduction: We don't have a lot of details on how Google pulled off their technically very impressive Google Instant release, but in Google Instant behind the scenes , they did share some interesting facts: Google was serving more than a billion searches per day. With Google Instant they served 5-7X more results pages than previously. Typical search results were returned in less than a quarter of second. A team of 50+ worked on the project for an extended period of time. Although Google is associated with muscular data centers, they just didn't throw more server capacity at the problem, they worked smarter too. What were their general strategies? Increase backend server capacity. Add new caches to handle high request rates while keeping results fresh while the web is continuously crawled and re-indexed. Add User-state data to the back-ends to keep track of the results pages already shown to a given user, preventing the same results from being re-fetched repeatedly. Optim
3 0.78756797 1395 high scalability-2013-01-28-DuckDuckGo Architecture - 1 Million Deep Searches a Day and Growing
Introduction: This is an interview with Gabriel Weinberg , founder of Duck Duck Go and general all around startup guru , on what DDG’s architecture looks like in 2012. Innovative search engine upstart DuckDuckGo had 30 million searches in February 2012 and averages over 1 million searches a day. It’s being positioned by super investor Fred Wilson as a clean, private, impartial and fast search engine. After talking with Gabriel I like what Fred Wilson said earlier, it seems closer to the heart of the matter: We invested in DuckDuckGo for the Reddit, Hacker News anarchists . Choosing DuckDuckGo can be thought of as not just a technical choice, but a vote for revolution. In an age when knowing your essence is not about about love or friendship, but about more effectively selling you to advertisers, DDG is positioning themselves as the do not track alternative , keepers of the privacy flame . You will still be monetized of course, but in a more civilized and an
4 0.76150995 216 high scalability-2008-01-17-Database People Hating on MapReduce
Introduction: Update: Typical Programmer tackles the technical issues in Relational Database Experts Jump The MapReduce Shark . The culture clash is still what fascinates me. David DeWitt writes in the Database Column that MapReduce is a major step backwards: A giant step backward in the programming paradigm for large-scale data intensive applications A sub-optimal implementation, in that it uses brute force instead of indexing Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago Missing most of the features that are routinely included in current DBMS Incompatible with all of the tools DBMS users have come to depend on Listening to databasers and map reducers talk is like eavesdropping on your average family holiday mashup. Every holiday people who have virtually nothing in common are thrown together because they incidentally share a little DNA or are married to the shared DNA. In desperation everyone gravitates to som
Introduction: This paper, Large-scale Incremental Processing Using Distributed Transactions and Notifications by Daniel Peng and Frank Dabek, is Google's much anticipated description of Percolator, their new real-time indexing system. The abstract: Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput requirements of these tasks: Google’s indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines. MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency. We have built Percolator, a system f
7 0.73323947 1601 high scalability-2014-02-25-Peter Norvig's 9 Master Steps to Improving a Program
8 0.73060161 871 high scalability-2010-08-04-Dremel: Interactive Analysis of Web-Scale Datasets - Data as a Programming Paradigm
9 0.72543091 1253 high scalability-2012-05-28-The Anatomy of Search Technology: Crawling using Combinators
10 0.72422761 1233 high scalability-2012-04-25-The Anatomy of Search Technology: blekko’s NoSQL database
11 0.71701241 1124 high scalability-2011-09-26-17 Techniques Used to Scale Turntable.fm and Labmeeting to Millions of Users
12 0.70607787 258 high scalability-2008-02-24-Yandex Architecture
13 0.7030673 946 high scalability-2010-11-22-Strategy: Google Sends Canary Requests into the Data Mine
14 0.68933409 1107 high scalability-2011-08-29-The Three Ages of Google - Batch, Warehouse, Instant
15 0.68504506 815 high scalability-2010-04-27-Paper: Dapper, Google's Large-Scale Distributed Systems Tracing Infrastructure
16 0.68441719 201 high scalability-2008-01-04-For $5 Million You Can Buy Enough Storage to Compete with Google
18 0.68185824 810 high scalability-2010-04-14-Parallel Information Retrieval and Other Search Engine Goodness
19 0.68027592 1032 high scalability-2011-05-02-Stack Overflow Makes Slow Pages 100x Faster by Simple SQL Tuning
20 0.67660344 1143 high scalability-2011-11-16-Google+ Infrastructure Update - the JavaScript Story
topicId topicWeight
[(1, 0.156), (2, 0.224), (5, 0.026), (10, 0.019), (27, 0.162), (30, 0.023), (61, 0.062), (76, 0.013), (79, 0.163), (85, 0.017), (94, 0.024)]
simIndex simValue blogId blogTitle
1 0.95963681 544 high scalability-2009-03-18-QCon London 2009: Upgrading Twitter without service disruptions
Introduction: Evan Weaver from Twitter presented a talk on Twitter software upgrades, titled Improving running components as part of the Systems that never stop track at QCon London 2009 conference last Friday. The talk focused on several upgrades performed since last May, while Twitter was experiencing serious performance problems.
2 0.9534291 705 high scalability-2009-09-16-Paper: A practical scalable distributed B-tree
Introduction: We've seen a lot of NoSQL action lately built around distributed hash tables. Btrees are getting jealous. Btrees, once the king of the database world, want their throne back. Paul Buchheit surfaced a paper: A practical scalable distributed B-tree by Marcos K. Aguilera and Wojciech Golab, that might help spark a revolution. From the Abstract: We propose a new algorithm for a practical, fault tolerant, and scalable B-tree distributed over a set of servers. Our algorithm supports practical features not present in prior work: transactions that allow atomic execution of multiple operations over multiple B-trees, online migration of B-tree nodes between servers, and dynamic addition and removal of servers. Moreover, our algorithm is conceptually simple: we use transactions to manipulate B-tree nodes so that clients need not use complicated concurrency and locking protocols used in prior work. To execute these transactions quickly, we rely on three techniques: (1) We use optimistic
same-blog 3 0.94854343 900 high scalability-2010-09-11-Google's Colossus Makes Search Real-time by Dumping MapReduce
Introduction: As the Kings of scaling, when Google changes its search infrastructure over to do something completely different, it's news. In Google search index splits with MapReduce , an exclusive interview by Cade Metz with Eisar Lipkovitz, a senior director of engineering at Google, we learn a bit more of the secret scaling sauce behind Google Instant , Google's new faster, real-time search system. The challenge for Google has been how to support a real-time world when the core of their search technology, the famous MapReduce, is batch oriented. Simple, they got rid of MapReduce. At least they got rid of MapReduce as the backbone for calculating search indexes. MapReduce still excels as a general query mechanism against masses of data, but real-time search requires a very specialized tool, and Google built one. Internally the successor to Google's famed Google File System, was code named Colossus. Details are slowly coming out about their new goals and approach: Goal is to update the
Introduction: This is guest post by Michael DeHaan (@laserllama), a software developer and architect, on Ansible , a simple deployment, model-driven configuration management, and command execution framework. I owe High Scalability a great deal of credit for the idea behind my latest software project. I was reading about how an older tool I helped create, Func, was used at Tumblr , and it kicked some ideas into gear. This article is about what happened from that idea. My observation, which the article reinforced, was that many shops end up using a configuration management tool (Puppet, Chef, cfengine), a separate deployment tool (Capistrano, Fabric) and yet another separate ad-hoc task execution tool (Func, pssh, etc) because one class of tool historically hasn't been good at all three jobs. My other observation (not from the article) was that the whole "infrastructure as code" movement, while revolutionary, and definitely great for many, was probably secretly grating on a good number of
5 0.94457382 1141 high scalability-2011-11-11-Stuff The Internet Says On Scalability For November 11, 2011
Introduction: You got performance in my scalability! You got scalability in my performance! Two great tastes that taste great together: Quotable quotes: @jasoncbooth : Tired of the term #nosql. I would like to coin NRDS (pronounced "nerds"), standing for Non Relational Data Store. @zenfeed : One lesson I learn about scalability, is that it has a LOT to do with simplicity and consistency. Ray Walters : Quad-core chips in mobile phones is nothing but a marketing snow job Flickr: Real-time Updates on the Cheap for Fun and Profit . How Flickr added real-time push feed on the cheap. Events happen all over Flickr, uploads and updates (around 100/s depending on the time of day), all of them inserting tasks. Implemented with Cache, Tasks, & Queues: PubSubHubbub; Async task system Gearman; use async EVERYWHERE; use Redis Lists for queues; cron to consume events off the queue; Cloud Event Processing - Big Data, Low Latency Use Cases at LinkedIn by Colin Clark. It
6 0.94312656 835 high scalability-2010-06-03-Hot Scalability Links for June 3, 2010
7 0.94004983 883 high scalability-2010-08-20-Hot Scalability Links For Aug 20, 2010
8 0.93059862 1483 high scalability-2013-06-27-Paper: XORing Elephants: Novel Erasure Codes for Big Data
9 0.93028837 1097 high scalability-2011-08-12-Stuff The Internet Says On Scalability For August 12, 2011
10 0.92934394 666 high scalability-2009-07-30-Learn How to Think at Scale
11 0.91501772 28 high scalability-2007-07-25-Product: NetApp MetroCluster Software
12 0.89101887 1594 high scalability-2014-02-12-Paper: Network Stack Specialization for Performance
13 0.89024931 1240 high scalability-2012-05-07-Startups are Creating a New System of the World for IT
14 0.8892585 38 high scalability-2007-07-30-Build an Infinitely Scalable Infrastructure for $100 Using Amazon Services
15 0.88703603 865 high scalability-2010-07-27-A Metric A$$-Ton of Joe Stump: The Cloud is Cheaper than Bare Metal
16 0.8854475 327 high scalability-2008-05-27-How I Learned to Stop Worrying and Love Using a Lot of Disk Space to Scale
17 0.88489407 750 high scalability-2009-12-16-Building Super Scalable Systems: Blade Runner Meets Autonomic Computing in the Ambient Cloud
19 0.88442057 717 high scalability-2009-10-07-How to Avoid the Top 5 Scale-Out Pitfalls
20 0.88430989 1275 high scalability-2012-07-02-C is for Compute - Google Compute Engine (GCE)