high_scalability high_scalability-2012 high_scalability-2012-1233 knowledge-graph by maker-knowledge-mining

1233 high scalability-2012-04-25-The Anatomy of Search Technology: blekko’s NoSQL database

meta infos for this blog

Source: html

Introduction: This is a guest post ( part 2 , part 3 ) by Greg Lindahl, CTO of blekko, the spam free search engine that had over 3.5 million unique visitors in March. Greg Lindahl was Founder and Distinguished Engineer at PathScale, at which he was the architect of the InfiniPath low-latency InfiniBand HCA, used to build tightly-coupled supercomputing clusters. Imagine that you're crazy enough to think about building a search engine. It's a huge task: the minimum index size needed to answer most queries is a few billion webpages. Crawling and indexing a few billion webpages requires a cluster with several petabytes of usable disk -- that's several thousand 1 terabyte disks -- and produces an index that's about 100 terabytes in size. Serving query results quickly involves having most of the index in RAM or on solid state (flash) disk. If you can buy a server with 100 gigabytes of RAM for about $3,000, that's 1,000 servers at a capital cost of $3 million, plus about $1 million per year of serve

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Crawling and indexing a few billion webpages requires a cluster with several petabytes of usable disk -- that's several thousand 1 terabyte disks -- and produces an index that's about 100 terabytes in size. [sent-6, score-0.52]

2 Putting the index into RAM on Amazon is very expensive, and only makes sense for a search engine with several % market share. [sent-13, score-0.247]

3 org/wiki/Paxos_algorithm Combinators In a conventional database system, updates to the database are done in transactions, in which the program locks one or more rows of the database, makes some changes, and then commits or aborts the entire transaction. [sent-28, score-0.244]

4 In blekko's datastore, we heavily rely on a construct called combinators to do processing at the database cell level. [sent-30, score-0.535]

5 A combinator is an atomic operation on a cell of a database that is associative and preferably commutative. [sent-31, score-0.869]

6 The fact that addition is associative and commutative means that we will (eventually) get the same answer in all 3 replicas of this cell. [sent-35, score-0.333]

7 The hierarchy of combinations means that the total number of transactions is dramatically reduced compared to a naive implementation, where every process talks directly to the 3 replicas of the cell, and every addition operation results in 3 immediate transactions. [sent-36, score-0.248]

8 The Logcount Combinator Search engines frequently need to count unique items in a set. [sent-37, score-0.28]

9 Examples include counting the number of inlinks to a website, the number of unique geographic areas linking to a website, and the number of unique Class-C IP networks linking to a website, and so on. [sent-38, score-0.477]

10 Keeping a perfect count would require keeping a lot of data, so we invented an approximate method , which can count a up to a billion things with accuracy of +- 50% in only 16 bytes. [sent-40, score-0.256]

11 The TopN Combinator Another common search engine operation is remembering the most important N items in a set. [sent-46, score-0.35]

12 The TopN combinator can represent the top N URLs in a finite-sized array that fits into a single cell of the database: The TopN combinator can be updated incrementally, as we crawl new webpages, and these updates are inexpensive. [sent-49, score-0.933]

13 It can be read back in a single disk operation, without needing indexing or sorting or reading of any data about URLs not in the top N. [sent-50, score-0.2]

14 So far we've used combinators as single cells in our database tables. [sent-55, score-0.348]

15 Meta-Combinators: The Hash Combinator If a cell in the database contains a hash of (key,value) pairs, the hash meta-combinator can be used to atomically update only some of the (key,value) pairs, leaving the rest unchanged. [sent-58, score-0.451]

16 This gives us considerable freedom to make the columns in the database the ones that make sense to the programmer, instead of having to promote extra things to be columns in order to be able to change them atomically. [sent-59, score-0.187]

17 Taking The Reduce Out Of Map/Reduce Since we represent our database tables with combinators, why not use the combinators to shuffle and reduce the output of our MapReduce jobs? [sent-60, score-0.581]

18 Then we can write MapJobs that iterate over a table in the database, and write their output back into the database using combinators. [sent-61, score-0.288]

19 A second feature of this method of wordcount is that the above function can also be used in a streaming context, to add the wordcounts of newly crawled documents to the existing counts in the table "/wordcount". [sent-64, score-0.206]

20 Our datastore: Looks like a tabular database Supports real-time and batch processing in the same cluster Supports high programmer productivity In the next installment of this series, we'll take a more detailed look at web crawling, and using combinators to implement a crawler. [sent-72, score-0.592]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('combinator', 0.307), ('combinators', 0.226), ('cell', 0.187), ('datastore', 0.172), ('logcount', 0.157), ('topn', 0.157), ('crawling', 0.138), ('associative', 0.136), ('blekko', 0.136), ('count', 0.128), ('commutative', 0.125), ('database', 0.122), ('urls', 0.121), ('operation', 0.117), ('mapjob', 0.116), ('mapjobs', 0.116), ('wordcount', 0.116), ('disk', 0.107), ('index', 0.099), ('foreach', 0.098), ('lindahl', 0.098), ('reduce', 0.096), ('indexing', 0.093), ('table', 0.09), ('items', 0.085), ('batch', 0.084), ('programmer', 0.084), ('search', 0.084), ('webpages', 0.083), ('row', 0.078), ('shuffle', 0.078), ('mapreduce', 0.077), ('output', 0.076), ('cluster', 0.076), ('word', 0.074), ('initial', 0.072), ('answer', 0.072), ('hash', 0.071), ('crawl', 0.071), ('linking', 0.071), ('sum', 0.068), ('number', 0.067), ('unique', 0.067), ('columns', 0.065), ('engine', 0.064), ('hierarchy', 0.064), ('pairs', 0.063), ('requires', 0.062), ('iterate', 0.061), ('represent', 0.061)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999994 1233 high scalability-2012-04-25-The Anatomy of Search Technology: blekko’s NoSQL database

2 0.27762261 1253 high scalability-2012-05-28-The Anatomy of Search Technology: Crawling using Combinators

Introduction: This is the second guest post ( part 1 , part 3 ) of a series by Greg Lindahl, CTO of blekko, the spam free search engine. Previously, Greg was Founder and Distinguished Engineer at PathScale, at which he was the architect of the InfiniPath low-latency InfiniBand HCA, used to build tightly-coupled supercomputing clusters. What's so hard about crawling the web? Web crawlers have been around as long as the Web has -- and before the web, there were crawlers for gopher and ftp. You would think that 25 years of experience would render crawling a solved problem, but the vast growth of the web and new inventions in the technology of webspam and other unsavory content results in a constant supply of new challenges. The general difficulty of tightly-coupled parallel programming also rears its head, as the web has scaled from millions to 100s of billions of pages. Existing Open-Source Crawlers and Crawls This article is mainly going to discuss blekko's crawler and its use of combinat

3 0.19691832 1279 high scalability-2012-07-09-Data Replication in NoSQL Databases

Introduction: This is the third guest post ( part 1 , part 2 ) of a series by Greg Lindahl, CTO of blekko, the spam free search engine. Previously, Greg was Founder and Distinguished Engineer at PathScale, at which he was the architect of the InfiniPath low-latency InfiniBand HCA, used to build tightly-coupled supercomputing clusters. blekko's home-grown NoSQL database was designed from the start to support a web-scale search engine, with 1,000s of servers and petabytes of disk. Data replication is a very important part of keeping the database up and serving queries. Like many NoSQL database authors, we decided to keep R=3 copies of each piece of data in the database, and not use RAID to improve reliability. The key goal we were shooting for was a database which degrades gracefully when there are many small failures over time, without needing human intervention. Why don't we like RAID for big NoSQL databases? Most big storage systems use RAID levels like 3, 4, 5, or 10 to improve relia

4 0.16306683 1242 high scalability-2012-05-09-Cell Architectures

Introduction: A consequence of Service Oriented Architectures is the burning need to provide services at scale. The architecture that has evolved to satisfy these requirements is a little known technique called the Cell Architecture. A Cell Architecture is based on the idea that massive scale requires parallelization and parallelization requires components be isolated from each other. These islands of isolation are called cells. A cell is a self-contained installation that can satisfy all the operations for a shard . A shard is a subset of a much larger dataset, typically a range of users, for example. Cell Architectures have several advantages: Cells provide a unit of parallelization that can be adjusted to any size as the user base grows. Cell are added in an incremental fashion as more capacity is required. Cells isolate failures. One cell failure does not impact other cells. Cells provide isolation as the storage and application horsepower to process requests is independent of othe

5 0.15433376 448 high scalability-2008-11-22-Google Architecture

Introduction: Update 2: Sorting 1 PB with MapReduce . PB is not peanut-butter-and-jelly misspelled. It's 1 petabyte or 1000 terabytes or 1,000,000 gigabytes. It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers and the results were replicated thrice on 48,000 disks. Update: Greg Linden points to a new Google article MapReduce: simplified data processing on large clusters . Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory. Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build

6 0.15164496 1191 high scalability-2012-02-13-Tumblr Architecture - 15 Billion Page Views a Month and Harder to Scale than Twitter

7 0.15164496 1360 high scalability-2012-11-19-Gone Fishin': Tumblr Architecture - 15 Billion Page Views A Month And Harder To Scale Than Twitter

8 0.15034851 1461 high scalability-2013-05-20-The Tumblr Architecture Yahoo Bought for a Cool Billion Dollars

9 0.14919333 750 high scalability-2009-12-16-Building Super Scalable Systems: Blade Runner Meets Autonomic Computing in the Ambient Cloud

10 0.14916274 1355 high scalability-2012-11-05-Gone Fishin': Building Super Scalable Systems: Blade Runner Meets Autonomic Computing In The Ambient Cloud

11 0.14904541 920 high scalability-2010-10-15-Troubles with Sharding - What can we learn from the Foursquare Incident?

12 0.14696449 538 high scalability-2009-03-16-Are Cloud Based Memory Architectures the Next Big Thing?

13 0.14223248 1042 high scalability-2011-05-17-Facebook: An Example Canonical Architecture for Scaling Billions of Messages

14 0.14094363 589 high scalability-2009-05-05-Drop ACID and Think About Data

15 0.13974924 1514 high scalability-2013-09-09-Need Help with Database Scalability? Understand I-O

16 0.13855614 1197 high scalability-2012-02-21-Pixable Architecture - Crawling, Analyzing, and Ranking 20 Million Photos a Day

17 0.13371271 1065 high scalability-2011-06-21-Running TPC-C on MySQL-RDS

18 0.12829326 778 high scalability-2010-02-15-The Amazing Collective Compute Power of the Ambient Cloud

19 0.12744749 954 high scalability-2010-12-06-What the heck are you actually using NoSQL for?

20 0.12714249 1240 high scalability-2012-05-07-Startups are Creating a New System of the World for IT

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.243), (1, 0.154), (2, -0.03), (3, -0.003), (4, 0.005), (5, 0.082), (6, 0.036), (7, -0.004), (8, 0.036), (9, -0.013), (10, 0.037), (11, -0.015), (12, -0.022), (13, -0.0), (14, 0.068), (15, 0.054), (16, -0.157), (17, -0.041), (18, 0.03), (19, 0.019), (20, 0.004), (21, -0.047), (22, -0.036), (23, 0.023), (24, -0.04), (25, -0.067), (26, -0.047), (27, 0.017), (28, -0.009), (29, 0.052), (30, -0.047), (31, 0.057), (32, -0.004), (33, 0.019), (34, 0.02), (35, 0.03), (36, -0.004), (37, -0.008), (38, 0.002), (39, -0.032), (40, 0.049), (41, -0.016), (42, -0.049), (43, 0.003), (44, -0.016), (45, 0.06), (46, -0.007), (47, 0.038), (48, 0.064), (49, -0.053)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95526344 1233 high scalability-2012-04-25-The Anatomy of Search Technology: blekko’s NoSQL database

2 0.80709261 1395 high scalability-2013-01-28-DuckDuckGo Architecture - 1 Million Deep Searches a Day and Growing

Introduction: This is an interview with Gabriel Weinberg , founder of Duck Duck Go and general all around startup guru , on what DDG’s architecture looks like in 2012. Innovative search engine upstart DuckDuckGo had 30 million searches in February 2012 and averages over 1 million searches a day. It’s being positioned by super investor Fred Wilson as a clean, private, impartial and fast search engine. After talking with Gabriel I like what Fred Wilson said earlier, it seems closer to the heart of the matter: We invested in DuckDuckGo for the Reddit, Hacker News anarchists . Choosing DuckDuckGo can be thought of as not just a technical choice, but a vote for revolution. In an age when knowing your essence is not about about love or friendship, but about more effectively selling you to advertisers, DDG is positioning themselves as the do not track alternative , keepers of the privacy flame . You will still be monetized of course, but in a more civilized and an

3 0.79531354 342 high scalability-2008-06-08-Search fast in million rows

Introduction: I have a table .This table has many columns but search performed based on 1 columns ,this table can have more than million rows. The data in these columns is something like funny,new york,hollywood User can search with parameters as funny hollywood .I need to take this 2 words and then search on column whether that column contain this words and how many times .It is not possible to index here .If the results return say 1200 results then without comparing each and every column i can't determine no of results.I need to compare for each and every column.This query is very frequent .How can i approach for this problem.What type of architecture,tools is helpful. I just know that this can be accomplished with distributed system but how can i make this system. I also see in this website that LinkedIn uses Lucene for search .Is Lucene is helpful in my case.My table has also lots of insertion ,however updation in not very frequent.

4 0.77168024 1650 high scalability-2014-05-19-A Short On How the Wayback Machine Stores More Pages than Stars in the Milky Way

Introduction: How does the Wayback Machine work? Now with over 400 billion webpages indexed , allowing the Internet to be browsed all the way back to 1996, it's an even more compelling question. I've looked several times but I've never found a really good answer. Here's some information from a thread on Hacker News. It starts with mmagin , a former Archive employee: I can't speak to their current infrastructure (though more of it is open source now - http://archive-access.sourceforge.net/projects/wayback/ ), but as far as the wayback machine, there was no SQL database anywhere in it. For the purposes of making the wayback machine go: Archived data was in ARC file format (predecessor to http://en.wikipedia.org/wiki/Web_ARChive) which is essentially a concatenation of separately gzipped records. That is, you can seek to a particular offset and start decompressing a record. Thus you could get at any archived web page with a triple (server, filename, file-offset) Thus it was spread

5 0.76413429 1472 high scalability-2013-06-07-Stuff The Internet Says On Scalability For June 7, 2013

Introduction: Hey, it's HighScalability time: ( Ever feel like everyone has already climbed your Everest?) Trillion Particles, 120,000 cores, and 350 TBs: Lessons Learned From a Hero I/O Run on Hopper Quotable Quotes: @PenLlawen : @spolsky In my time as a scalability engineer, I’ve seen plenty of cases where optimisation was left too late. Even harder to fix. @davidlubar : Whoever said you can't fold a piece of paper in half more than 7 times probably forget to unfold it each time. I'm up to 6,000. deno : A quick comparison of App Engine vs. Compute Engine prices shows that App Engine is at best 10x more expensive per unit of RAM. Fred Wilson : strategy is figuring out what part of the market the company wants to play in, how it goes to market, and how it differentiates itself in the market it is about what you are going to do and importantly what you are not going to do Elon Musk : SpaceX was able to achieve orders of magnitude saving

6 0.76297283 587 high scalability-2009-05-01-FastBit: An Efficient Compressed Bitmap Index Technology

7 0.76247758 810 high scalability-2010-04-14-Parallel Information Retrieval and Other Search Engine Goodness

8 0.75359964 281 high scalability-2008-03-18-Database Design 101

9 0.7422682 1304 high scalability-2012-08-14-MemSQL Architecture - The Fast (MVCC, InMem, LockFree, CodeGen) and Familiar (SQL)

10 0.73880935 1302 high scalability-2012-08-10-Stuff The Internet Says On Scalability For August 10, 2012

11 0.73489016 1649 high scalability-2014-05-16-Stuff The Internet Says On Scalability For May 16th, 2014

12 0.72979963 1222 high scalability-2012-04-05-Big Data Counting: How to count a billion distinct objects using only 1.5KB of Memory

13 0.72621632 825 high scalability-2010-05-10-Sify.com Architecture - A Portal at 3900 Requests Per Second

14 0.72403765 1253 high scalability-2012-05-28-The Anatomy of Search Technology: Crawling using Combinators

15 0.72259587 1509 high scalability-2013-08-30-Stuff The Internet Says On Scalability For August 30, 2013

16 0.7213825 775 high scalability-2010-02-10-ElasticSearch - Open Source, Distributed, RESTful Search Engine

17 0.71830881 258 high scalability-2008-02-24-Yandex Architecture

18 0.71552759 900 high scalability-2010-09-11-Google's Colossus Makes Search Real-time by Dumping MapReduce

19 0.71507466 1601 high scalability-2014-02-25-Peter Norvig's 9 Master Steps to Improving a Program

20 0.71472645 216 high scalability-2008-01-17-Database People Hating on MapReduce

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(1, 0.115), (2, 0.22), (5, 0.011), (10, 0.045), (14, 0.034), (17, 0.014), (30, 0.013), (40, 0.037), (47, 0.013), (61, 0.095), (77, 0.025), (79, 0.118), (85, 0.018), (93, 0.078), (94, 0.087)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.96722925 1233 high scalability-2012-04-25-The Anatomy of Search Technology: blekko’s NoSQL database

2 0.96249431 1450 high scalability-2013-05-01-Myth: Eric Brewer on Why Banks are BASE Not ACID - Availability Is Revenue

Introduction: In NoSQL: Past, Present, Future Eric Brewer has a particularly fine section on explaining the often hard to understand ideas of BASE (Basically Available, Soft State, Eventually Consistent), ACID (Atomicity, Consistency, Isolation, Durability), CAP (Consistency Availability, Partition Tolerance), in terms of a pernicious long standing myth about the sanctity of consistency in banking. Myth : Money is important, so banks must use transactions to keep money safe and consistent, right? Reality : Banking transactions are inconsistent, particularly for ATMs. ATMs are designed to have a normal case behaviour and a partition mode behaviour. In partition mode Availability is chosen over Consistency. Why? 1) Availability correlates with revenue and consistency generally does not. 2) Historically there was never an idea of perfect communication so everything was partitioned. Your ATM transaction must go through so Availability is more important than

3 0.96182966 1513 high scalability-2013-09-06-Stuff The Internet Says On Scalability For September 6, 2013

Introduction: Hey, it's HighScalability time: ( Unidentified Ivy Bridge processor using 22 nanometer Tri-Gate transistors ) Quotable Quotes: @pbailis : Big ups to AWS folks for following up re: all of my questions on cr1 provisioning. We saw a huge win moving from m1.xl to cr1.8xl @rob_carlson : Packet switching via containers --> almost 8X increase in trade; what will #drones bring? What is optimal mesh size? @mrtazz : “an Open Source, Clojure-based DevOps platform” congratulations, I now have no idea what you’re talking about @KentBeck : If you can't make engineering decisions based on data, then make engineering decisions that result in data. @cassandralondon : Cassandra on AWS SSDs - a perfect fit because you don't get write amplification If you think about it, a cloud as a rule driven, capability rich environment, accessible over a large surfaced API, plays the same role as physics in biology. Software must speci

4 0.95591861 168 high scalability-2007-11-30-Strategy: Efficiently Geo-referencing IPs

Introduction: A lot of apps need to map IP addresses to locations. Jeremy Cole in On efficiently geo-referencing IPs with MaxMind GeoIP and MySQL GIS succinctly explains the many uses for such a feature: Geo-referencing IPs is, in a nutshell, converting an IP address, perhaps from an incoming web visitor, a log file, a data file, or some other place, into the name of some entity owning that IP address. There are a lot of reasons you may want to geo-reference IP addresses to country, city, etc., such as in simple ad targeting systems, geographic load balancing, web analytics, and many more applications. This is difficult to do efficiently, at least it gives me a bit of brain freeze. In the same post Jeremy nicely explains where to get the geo-rereferncing data, how to load data, and the performance of different approaches for IP address searching. It's a great practical introduction to the subject.

5 0.95565057 58 high scalability-2007-08-04-Product: Cacti

Introduction: Cacti is a network statistics graphing tool designed as a frontend to RRDtool's data storage and graphing functionality. It is intended to be intuitive and easy to use, as well as robust and scalable. It is generally used to graph time-series data like CPU load and bandwidth use. The frontend is written in PHP; it can handle multiple users, each with their own graph sets, so it is sometimes used by web hosting providers (especially dedicated server, virtual private server, and colocation providers) to display bandwidth statistics for their customers. It can be used to configure the data collection itself, allowing certain setups to be monitored without any manual configuration of RRDtool.

6 0.95422852 1198 high scalability-2012-02-24-Stuff The Internet Says On Scalability For February 24, 2012

7 0.95364451 1279 high scalability-2012-07-09-Data Replication in NoSQL Databases

8 0.95326459 1637 high scalability-2014-04-25-Stuff The Internet Says On Scalability For April 25th, 2014

9 0.94959849 1112 high scalability-2011-09-07-What Google App Engine Price Changes Say About the Future of Web Architecture

10 0.94881183 517 high scalability-2009-02-21-Google AppEngine - A Second Look

11 0.94874269 301 high scalability-2008-04-08-Google AppEngine - A First Look

12 0.94797635 1649 high scalability-2014-05-16-Stuff The Internet Says On Scalability For May 16th, 2014

13 0.94705129 944 high scalability-2010-11-17-Some Services are More Equal than Others

14 0.94518226 1117 high scalability-2011-09-16-Stuff The Internet Says On Scalability For September 16, 2011

15 0.94451827 1330 high scalability-2012-09-28-Stuff The Internet Says On Scalability For September 28, 2012

16 0.94241959 119 high scalability-2007-10-10-WAN Accelerate Your Way to Lightening Fast Transfers Between Data Centers

17 0.94215167 349 high scalability-2008-07-10-Can cloud computing smite down evil zombie botnet armies?

18 0.94175178 1516 high scalability-2013-09-13-Stuff The Internet Says On Scalability For September 13, 2013

19 0.94149238 1402 high scalability-2013-02-07-Ask HighScalability: Web asset server concept - 3rd party software available?

20 0.94138014 849 high scalability-2010-06-28-VoltDB Decapitates Six SQL Urban Myths and Delivers Internet Scale OLTP in the Process