high_scalability high_scalability-2014 high_scalability-2014-1650 knowledge-graph by maker-knowledge-mining

1650 high scalability-2014-05-19-A Short On How the Wayback Machine Stores More Pages than Stars in the Milky Way


meta infos for this blog

Source: html

Introduction: How does the  Wayback Machine work? Now with over  400 billion webpages indexed , allowing the Internet to be browsed all the way back to 1996, it's an even more compelling question. I've looked several times but I've never found a really good answer. Here's some information from a thread on Hacker News. It starts with  mmagin , a former Archive employee: I can't speak to their current infrastructure (though more of it is open source now - http://archive-access.sourceforge.net/projects/wayback/ ), but as far as the wayback machine, there was no SQL database anywhere in it. For the purposes of making the wayback machine go: Archived data was in ARC file format (predecessor to http://en.wikipedia.org/wiki/Web_ARChive) which is essentially a concatenation of separately gzipped records. That is, you can seek to a particular offset and start decompressing a record. Thus you could get at any archived web page with a triple (server, filename, file-offset) Thus it was spread


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 net/projects/wayback/ ), but as far as the wayback machine, there was no SQL database anywhere in it. [sent-7, score-0.471]

2 For the purposes of making the wayback machine go: Archived data was in ARC file format (predecessor to http://en. [sent-8, score-0.632]

3 org/wiki/Web_ARChive) which is essentially a concatenation of separately gzipped records. [sent-10, score-0.158]

4 An sorted index of all the content was built that would let you lookup (url) and give you a list of times or (url, time) to (filename, file-offset). [sent-13, score-0.544]

5 It was implemented by building a sorted text file (first sorted on the url, second on the time) and sharded across many machines by simply splitting it into N roughly equal sizes. [sent-14, score-0.544]

6 Binary search across a sorted text file is surprisingly fast -- in part because the first few points you look at in the file remain cached in RAM, since you hit them frequently. [sent-15, score-0.395]

7 (Here's where I'm a little rusty) The web frontend would get a request, query the appropriate index machine. [sent-16, score-0.308]

8 ) to find out what server that (unique) filename was on, then it would request the particular record from that server. [sent-18, score-0.223]

9 I know they've done some things to keep the index more current than they did back then. [sent-20, score-0.19]

10 If you want to change the data for that row in the table, you have to write a new file in the filesystem and update the 'pointer'. [sent-28, score-0.146]

11 Playback is accomplished by binary searching a 2-level index of pointers into the WARC data. [sent-32, score-0.367]

12 The second level of this index is a 20TB compressed sorted list of (url, date, pointer) tuples called CDX records[2]. [sent-33, score-0.486]

13 The first level fits in core, and is a 13GB sorted list of every 3000th entry in the CDX index, with a pointer to larger CDX block. [sent-34, score-0.411]

14 Index lookup works by binary searching the first level list stored in core, then HTTP range-request loading the appropriate second-level blocks from the CDX index. [sent-35, score-0.413]

15 org/2013/07/04/metadata-api/ sytelus asked : Why not use a hashtable instead of binary search? [sent-46, score-0.121]

16 Some Wayback Machine queries require sorted key traversal: listing all dates for which captures of an URL are available, the discovery of the nearest-date for an URL, and listing all available URLs beginning with a certain URL-prefix. [sent-49, score-0.583]

17 Maintaining the canonically-ordered master index of (URL, date, pointer) – that 20TB second-level index rajbot mentions – allows both kinds of queries to be satisfied. [sent-50, score-0.38]

18 Then loading nearest-date captures for the page's inline resources starts hitting similar ranges, as do followup clicks on outlinks or nearby dates. [sent-54, score-0.205]

19 So even though the master index is still on spinning disk (unless there was a recent big SSD upgrade that escaped my notice), the ranges-being-browsed wind up in main-memory caches quite often. [sent-55, score-0.242]

20 However, the new TV news and search capability requires substantially more space than even the archive IIRC, or certainly is heading that way. [sent-62, score-0.258]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('wayback', 0.471), ('cdx', 0.307), ('sorted', 0.231), ('url', 0.23), ('filename', 0.223), ('index', 0.19), ('warc', 0.184), ('archive', 0.156), ('arc', 0.123), ('binary', 0.121), ('https', 0.117), ('pointer', 0.115), ('captures', 0.093), ('archived', 0.088), ('listing', 0.087), ('dates', 0.085), ('playback', 0.085), ('file', 0.082), ('employee', 0.082), ('machine', 0.079), ('ranges', 0.077), ('former', 0.068), ('frontend', 0.065), ('list', 0.065), ('filesystem', 0.064), ('http', 0.063), ('loading', 0.06), ('date', 0.06), ('lookup', 0.058), ('essentially', 0.056), ('fyi', 0.056), ('browsed', 0.056), ('iirc', 0.056), ('brewster', 0.056), ('fathom', 0.056), ('raj', 0.056), ('searching', 0.056), ('appropriate', 0.053), ('followup', 0.052), ('gzipped', 0.052), ('escaped', 0.052), ('predecessor', 0.052), ('milky', 0.052), ('certainly', 0.052), ('thanks', 0.052), ('space', 0.05), ('crawlers', 0.05), ('amenable', 0.05), ('concatenation', 0.05), ('artifact', 0.05)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999976 1650 high scalability-2014-05-19-A Short On How the Wayback Machine Stores More Pages than Stars in the Milky Way

Introduction: How does the  Wayback Machine work? Now with over  400 billion webpages indexed , allowing the Internet to be browsed all the way back to 1996, it's an even more compelling question. I've looked several times but I've never found a really good answer. Here's some information from a thread on Hacker News. It starts with  mmagin , a former Archive employee: I can't speak to their current infrastructure (though more of it is open source now - http://archive-access.sourceforge.net/projects/wayback/ ), but as far as the wayback machine, there was no SQL database anywhere in it. For the purposes of making the wayback machine go: Archived data was in ARC file format (predecessor to http://en.wikipedia.org/wiki/Web_ARChive) which is essentially a concatenation of separately gzipped records. That is, you can seek to a particular offset and start decompressing a record. Thus you could get at any archived web page with a triple (server, filename, file-offset) Thus it was spread

2 0.10487325 64 high scalability-2007-08-10-How do we make a large real-time search engine?

Introduction: We're implementing a website which should be oriented to content and with massive access by public and we would need a search engine to index and execute queries on the indexes of contents (stored in a database, most likely MySQL InnoDB or Oracle). The solution we found is to implement a separate service to make index constantly the contents of the database at regular intervals. Anyway, this is a complex and not optimal solution, since we would like it to index in real time and make it searchable. Could you point me to some examples or articles I could review to design a solution for such this context?

3 0.10124835 1008 high scalability-2011-03-22-Facebook's New Realtime Analytics System: HBase to Process 20 Billion Events Per Day

Introduction: Facebook did it again. They've built another system capable of doing something useful with ginormous streams of realtime data. Last time we saw Facebook release their  New Real-Time Messaging System: HBase To Store 135+ Billion Messages A Month . This time it's a realtime analytics system handling over 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds .  Alex Himel, Engineering Manager at Facebook,  explains what they've built  ( video ) and the scale required: Social plugins have become an important and growing source of traffic for millions of websites over the past year. We released a new version of Insights for Websites last week to give site owners better analytics on how people interact with their content and to help them optimize their websites in real time. To accomplish this, we had to engineer a system that could process over 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds.  Alex does a

4 0.096812405 1597 high scalability-2014-02-17-How the AOL.com Architecture Evolved to 99.999% Availability, 8 Million Visitors Per Day, and 200,000 Requests Per Second

Introduction: This is a guest post by Dave Hagler Systems Architect at AOL. The AOL homepages receive more than 8 million visitors per day .  That’s more daily viewers than Good Morning America or the Today Show on television.  Over a billion page views are served each month.  AOL.com has been a major internet destination since 1996, and still has a strong following of loyal users. The architecture for AOL.com is in it’s 5th generation .  It has essentially been rebuilt from scratch 5 times over two decades.  The current architecture was designed 6 years ago.  Pieces have been upgraded and new components have been added along the way, but the overall design remains largely intact.  The code, tools, development and deployment processes are highly tuned over 6 years of continual improvement, making the AOL.com architecture battle tested and very stable. The engineering team is made up of developers, testers, and operations and totals around 25 people .  The majority are in Dulles, Virginia

5 0.095810488 1253 high scalability-2012-05-28-The Anatomy of Search Technology: Crawling using Combinators

Introduction: This is the second guest post ( part 1 , part 3 ) of a series by Greg Lindahl, CTO of blekko, the spam free search engine. Previously, Greg was Founder and Distinguished Engineer at PathScale, at which he was the architect of the InfiniPath low-latency InfiniBand HCA, used to build tightly-coupled supercomputing clusters. What's so hard about crawling the web? Web crawlers have been around as long as the Web has -- and before the web, there were crawlers for gopher and ftp. You would think that 25 years of experience would render crawling a solved problem, but the vast growth of the web and new inventions in the technology of webspam and other unsavory content results in a constant supply of new challenges. The general difficulty of tightly-coupled parallel programming also rears its head, as the web has scaled from millions to 100s of billions of pages. Existing Open-Source Crawlers and Crawls This article is mainly going to discuss blekko's crawler and its use of combinat

6 0.095504574 889 high scalability-2010-08-30-Pomegranate - Storing Billions and Billions of Tiny Little Files

7 0.094955072 971 high scalability-2011-01-10-Riak's Bitcask - A Log-Structured Hash Table for Fast Key-Value Data

8 0.09418904 954 high scalability-2010-12-06-What the heck are you actually using NoSQL for?

9 0.094026379 1649 high scalability-2014-05-16-Stuff The Internet Says On Scalability For May 16th, 2014

10 0.090922594 307 high scalability-2008-04-21-Using Google AppEngine for a Little Micro-Scalability

11 0.088254511 1233 high scalability-2012-04-25-The Anatomy of Search Technology: blekko’s NoSQL database

12 0.087688386 658 high scalability-2009-07-17-Against all the odds

13 0.086750343 1385 high scalability-2013-01-11-Stuff The Internet Says On Scalability For January 11, 2013

14 0.086625889 705 high scalability-2009-09-16-Paper: A practical scalable distributed B-tree

15 0.085848838 1174 high scalability-2012-01-13-Stuff The Internet Says On Scalability For January 13, 2012

16 0.085011542 172 high scalability-2007-12-02-nginx: high performance smpt-pop-imap proxy

17 0.084840328 1440 high scalability-2013-04-15-Scaling Pinterest - From 0 to 10s of Billions of Page Views a Month in Two Years

18 0.08328861 1508 high scalability-2013-08-28-Sean Hull's 20 Biggest Bottlenecks that Reduce and Slow Down Scalability

19 0.082646802 233 high scalability-2008-01-30-How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data

20 0.082140259 1521 high scalability-2013-09-23-Salesforce Architecture - How they Handle 1.3 Billion Transactions a Day


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.152), (1, 0.084), (2, -0.03), (3, -0.047), (4, 0.026), (5, 0.027), (6, 0.022), (7, 0.005), (8, 0.013), (9, 0.022), (10, 0.005), (11, -0.028), (12, -0.031), (13, -0.019), (14, 0.033), (15, 0.024), (16, -0.071), (17, 0.011), (18, 0.006), (19, -0.047), (20, -0.005), (21, -0.05), (22, -0.022), (23, 0.064), (24, -0.009), (25, 0.005), (26, -0.008), (27, -0.017), (28, -0.029), (29, 0.028), (30, -0.037), (31, -0.008), (32, -0.052), (33, 0.03), (34, 0.013), (35, 0.027), (36, 0.032), (37, 0.019), (38, -0.013), (39, -0.04), (40, -0.007), (41, -0.03), (42, -0.011), (43, -0.008), (44, 0.0), (45, 0.04), (46, -0.01), (47, 0.001), (48, -0.008), (49, 0.054)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95927566 1650 high scalability-2014-05-19-A Short On How the Wayback Machine Stores More Pages than Stars in the Milky Way

Introduction: How does the  Wayback Machine work? Now with over  400 billion webpages indexed , allowing the Internet to be browsed all the way back to 1996, it's an even more compelling question. I've looked several times but I've never found a really good answer. Here's some information from a thread on Hacker News. It starts with  mmagin , a former Archive employee: I can't speak to their current infrastructure (though more of it is open source now - http://archive-access.sourceforge.net/projects/wayback/ ), but as far as the wayback machine, there was no SQL database anywhere in it. For the purposes of making the wayback machine go: Archived data was in ARC file format (predecessor to http://en.wikipedia.org/wiki/Web_ARChive) which is essentially a concatenation of separately gzipped records. That is, you can seek to a particular offset and start decompressing a record. Thus you could get at any archived web page with a triple (server, filename, file-offset) Thus it was spread

2 0.78332084 825 high scalability-2010-05-10-Sify.com Architecture - A Portal at 3900 Requests Per Second

Introduction: Sify.com is one of the leading portals in India. Samachar.com is owned by the same company and is one of the top content aggregation sites in India, primarily targeting Non-resident Indians from around the world. Ramki Subramanian, an Architect at Sify, has been generous enough to describe the common back-end for both these sites. One of the most notable aspects of their architecture is that Sify does not use a traditional database. They query Solr and then retrieve records from a distributed file system. Over the years many people have argued for file systems over databases. Filesystems can work for key-value lookups, but they don't work for queries, using Solr is a good way around that problem. Another interesting aspect of their system is the use of Drools for intelligent cache invalidation. As we have more and more data duplicated in multiple specialized services, the problem of how to keep them synchronized  is a difficult one. A rules engine is a clever approach. Platfo

3 0.76584363 1268 high scalability-2012-06-20-Ask HighScalability: How do I organize millions of images?

Introduction: Does anyone have any advice or suggestions on how to store millions of images? Currently images are stored in a MS SQL database which performance wise isn't ideal. We'd like to migrate the images over to a file system structure but I'd assume we don't just want to dump millions of images into a single directory. Besides having to contend with naming collisions, the windows filesystem might not perform optimally with that many files. I'm assuming one approach may be to assign each user a unique CSLID, create a folder based on the CSLID and then place one users files in that particular folder. Even so, this could result in hundreds of thousands of folders. Whats the best organizational scheme/heirachy for doing this?

4 0.75694084 50 high scalability-2007-07-31-BerkeleyDB & other distributed high performance key-value databases

Introduction: I currently use BerkeleyDB as an embedded database http://www.oracle.com/database/berkeley-db/ a decision which was initially brought on by learning that Google used BerkeleyDB for their universal sign-on feature. Lustre looks impressive, but their white paper shows speeds of 800 files created per second, as a good number. However, BerkeleyDB on my mac mini does 200,000 row creations per second, and can be used as a distributed file system. I'm having I/O scalability issues with BerkeleyDB on one machine, and about to implement their distributed replication feature (and go multi-machine), which in effect makes it work like a distributed file system, but with local access speeds. That's why I was looking at Lustre. The key feature difference between BerkeleyDB and Lustre is that BerkeleyDB has a complete copy of all the data on each computer, making it not a viable solution for massive sized database applications. However, if you have < 1TB (ie, one disk) of total pos

5 0.74341065 1233 high scalability-2012-04-25-The Anatomy of Search Technology: blekko’s NoSQL database

Introduction: This is a guest post ( part 2 , part 3 ) by Greg Lindahl, CTO of blekko, the spam free search engine that had over 3.5 million unique visitors in March. Greg Lindahl was Founder and Distinguished Engineer at PathScale, at which he was the architect of the InfiniPath low-latency InfiniBand HCA, used to build tightly-coupled supercomputing clusters. Imagine that you're crazy enough to think about building a search engine.  It's a huge task: the minimum index size needed to answer most queries is a few billion webpages. Crawling and indexing a few billion webpages requires a cluster with several petabytes of usable disk -- that's several thousand 1 terabyte disks -- and produces an index that's about 100 terabytes in size. Serving query results quickly involves having most of the index in RAM or on solid state (flash) disk. If you can buy a server with 100 gigabytes of RAM for about $3,000, that's 1,000 servers at a capital cost of $3 million, plus about $1 million per year of serve

6 0.73316079 1395 high scalability-2013-01-28-DuckDuckGo Architecture - 1 Million Deep Searches a Day and Growing

7 0.73210049 1304 high scalability-2012-08-14-MemSQL Architecture - The Fast (MVCC, InMem, LockFree, CodeGen) and Familiar (SQL)

8 0.72598869 587 high scalability-2009-05-01-FastBit: An Efficient Compressed Bitmap Index Technology

9 0.71943158 1253 high scalability-2012-05-28-The Anatomy of Search Technology: Crawling using Combinators

10 0.71639341 828 high scalability-2010-05-17-7 Lessons Learned While Building Reddit to 270 Million Page Views a Month

11 0.70684642 281 high scalability-2008-03-18-Database Design 101

12 0.70108294 800 high scalability-2010-03-26-Strategy: Caching 404s Saved the Onion 66% on Server Time

13 0.69509166 609 high scalability-2009-05-28-Scaling PostgreSQL using CUDA

14 0.69210118 1096 high scalability-2011-08-10-LevelDB - Fast and Lightweight Key-Value Database From the Authors of MapReduce and BigTable

15 0.6884132 1509 high scalability-2013-08-30-Stuff The Internet Says On Scalability For August 30, 2013

16 0.68751061 435 high scalability-2008-10-30-The case for functional decomposition

17 0.68347037 986 high scalability-2011-02-10-Database Isolation Levels And Their Effects on Performance and Scalability

18 0.67845392 817 high scalability-2010-04-29-Product: SciDB - A Science-Oriented DBMS at 100 Petabytes

19 0.67220485 1333 high scalability-2012-10-04-LinkedIn Moved from Rails to Node: 27 Servers Cut and Up to 20x Faster

20 0.67138851 1508 high scalability-2013-08-28-Sean Hull's 20 Biggest Bottlenecks that Reduce and Slow Down Scalability


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(1, 0.146), (2, 0.149), (10, 0.061), (26, 0.229), (40, 0.029), (61, 0.084), (76, 0.015), (77, 0.01), (79, 0.099), (85, 0.043), (91, 0.011), (94, 0.036)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.96792752 521 high scalability-2009-02-25-Enterprise Architecture Conference by - John Zachman. Johannesburg (25th March) , Cape Town (27Th March) Dubai (23rd March)

Introduction: Why You Need To Attend THIS CONFERENCE • Understand the multi-dimensional view of business-technology alignment • A sense of urgency for aggressively pursuing Enterprise Architecture • A "language" (ie., a Framework) for improving enterprise communications about architecture issues • An understanding of the cultural changes implied by process evolution. How to effectively use the framework to anchor processes and procedures for delivering service and support for applications • An understanding of basic Enterprise physics • Recommendations for the Sr. Managers to understand the political realities and organizational resistance in realizing EA vision and some excellent advices for overcoming these barriers • Number of practical examples of how to work with people who affect decisions on EA implementation • How to create value for your organization by systematically recording assets, processes, connectivity, people, timing and motivation, through a simple framework

2 0.93478578 73 high scalability-2007-08-23-Postgresql on high availability websites?

Introduction: I was looking at the pingdom infrastructure matrix (http://royal.pingdom.com/royalfiles/0702_infrastructure_matrix.pdf) and I saw that no sites are using Postgresql, and then I searched through highscalability.com and saw very few mentions of postgresql. Are there any examples of high-traffic sites that use postgresql? Does anyone have any experience with it? I'm having trouble finding good, recent studies of postgres (and postgres compared w/ mysql) online.

3 0.92005086 697 high scalability-2009-09-09-GridwiseTech revolutionizes data management

Introduction: GridwiseTech has developed AdHoc , an advanced framework for sharing geographically distributed data and compute resources. It simplifies the resource management and makes cooperation secure and effective. The premise of AdHoc is to enable each member of the associated institution to control access to his or her resources without an IT administrator’s help, and with high security level of any exposed data or applications assured. It takes 3 easy steps to establish cooperation within AdHoc: create a virtual organization, add resources and share them. The application can be implemented within any organization to exchange data and resources or between institutions to join forces for more efficient results. AdHoc was initially created for a consortium of hospitals and institutions to share medical data sets. As a technical partner in that project, GridwiseTech implemented the Security Framework to provide access to that data and designed a graphical tool to facilitate the administration

4 0.91764849 751 high scalability-2009-12-16-The most common flaw in software performance testing

Introduction: How many times have we all run across a situation where the performance tests on a piece of software pass with flying colors on the test systems only to see the software exhibit poor performance characteristics when the software is deployed in production? Read More Here...

5 0.91621679 1410 high scalability-2013-02-20-Smart Companies Fail Because they Do Everything Right - Staying Alive to Scale

Introduction: Wired has a wonderful interview  with  Clayton Christensen , author of the tech ninja's bible,  Innovator's Dilemma . Innovation is the name of the game in Silicon Valley and if you want to understand the rules of the game this article is a quick and clear way of learning. Everything is simply explained with compelling examples by the man himself. Just as every empire has fallen, every organization is open to disruption. It's the human condition to become comfortable and discount potential dangers. It takes a great deal of mindfulness to outwit and outlast the human condition. If you want to be the disruptor and avoid being the disruptee, this is good stuff. He also talks about his new book, The Capitalist's Dilemma , which addresses this puzzle: if corporations are doing so well why are individuals doing so bad? If someone can help you see a deep meaningful pattern in life then they haven't brought you a fish, they've taught you how to fish. That's what Christensen does. Here'

6 0.90966958 715 high scalability-2009-10-06-10 Ways to Take your Site from One to One Million Users by Kevin Rose

7 0.9000681 1356 high scalability-2012-11-07-Gone Fishin': 10 Ways to Take your Site from One to One Million Users by Kevin Rose

same-blog 8 0.86516398 1650 high scalability-2014-05-19-A Short On How the Wayback Machine Stores More Pages than Stars in the Milky Way

9 0.86352837 1570 high scalability-2014-01-01-Paper: Nanocubes: Nanocubes for Real-Time Exploration of Spatiotemporal Datasets

10 0.82639927 381 high scalability-2008-09-08-Guerrilla Capacity Planning and the Law of Universal Scalability

11 0.81017542 635 high scalability-2009-06-22-Improving performance and scalability with DDD

12 0.80728531 148 high scalability-2007-11-11-Linkedin architecture

13 0.80470616 425 high scalability-2008-10-22-Scalability Best Practices: Lessons from eBay

14 0.79581004 339 high scalability-2008-06-04-LinkedIn Architecture

15 0.78690261 92 high scalability-2007-09-15-The Role of Memory within Web 2.0 Architectures and Deployments

16 0.75859773 340 high scalability-2008-06-06-Economies of Non-Scale

17 0.75262111 790 high scalability-2010-03-09-Applications as Virtual States

18 0.74779469 1235 high scalability-2012-04-27-Stuff The Internet Says On Scalability For April 27, 2012

19 0.74717849 1645 high scalability-2014-05-09-Stuff The Internet Says On Scalability For May 9th, 2014

20 0.73944199 423 high scalability-2008-10-19-Alternatives to Google App Engine