high_scalability high_scalability-2009 high_scalability-2009-587 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Data mining and fast queries are always in that bin of hard to do things where doing something smarter can yield big results. Bloom Filters are one such do it smarter strategy, compressed bitmap indexes are another. In one application "FastBit outruns other search indexes by a factor of 10 to 100 and doesn’t require much more room than the original data size." The data size is an interesting metric. Our old standard b-trees can be two to four times larger than the original data. In a test searching an Enron email database FastBit outran MySQL by 10 to 1,000 times. FastBit is a software tool for searching large read-only datasets. It organizes user data in a column-oriented structure which is efficient for on-line analytical processing (OLAP), and utilizes compressed bitmap indices to further speed up query processing. Analyses have proven the compressed bitmap index used in FastBit to be theoretically optimal for one-dimensional queries. Compared with other optimal indexing me
sentIndex sentText sentNum sentScore
1 Data mining and fast queries are always in that bin of hard to do things where doing something smarter can yield big results. [sent-1, score-0.486]
2 Bloom Filters are one such do it smarter strategy, compressed bitmap indexes are another. [sent-2, score-0.876]
3 In one application "FastBit outruns other search indexes by a factor of 10 to 100 and doesn’t require much more room than the original data size. [sent-3, score-0.391]
4 Our old standard b-trees can be two to four times larger than the original data. [sent-5, score-0.268]
5 In a test searching an Enron email database FastBit outran MySQL by 10 to 1,000 times. [sent-6, score-0.167]
6 FastBit is a software tool for searching large read-only datasets. [sent-7, score-0.158]
7 It organizes user data in a column-oriented structure which is efficient for on-line analytical processing (OLAP), and utilizes compressed bitmap indices to further speed up query processing. [sent-8, score-1.209]
8 Analyses have proven the compressed bitmap index used in FastBit to be theoretically optimal for one-dimensional queries. [sent-9, score-0.97]
9 Compared with other optimal indexing methods, bitmap indices are superior because they can be efficiently combined to answer multi-dimensional queries whereas other optimal methods can not. [sent-10, score-1.442]
10 An excellent description of how FastBit works, especially compared to b-trees. [sent-13, score-0.174]
wordName wordTfidf (topN-words)
[('fastbit', 0.672), ('bitmap', 0.421), ('compressed', 0.213), ('indices', 0.182), ('optimal', 0.147), ('smarter', 0.146), ('enron', 0.122), ('methods', 0.122), ('searching', 0.122), ('bin', 0.115), ('organizes', 0.105), ('digging', 0.102), ('original', 0.099), ('indexes', 0.096), ('compared', 0.093), ('utilizes', 0.091), ('theoretically', 0.089), ('analyses', 0.088), ('yield', 0.083), ('superior', 0.082), ('bloom', 0.079), ('analytical', 0.078), ('olap', 0.077), ('mining', 0.075), ('filters', 0.074), ('whereas', 0.071), ('queries', 0.067), ('room', 0.063), ('combined', 0.063), ('factor', 0.058), ('indexing', 0.054), ('proven', 0.054), ('four', 0.05), ('index', 0.046), ('email', 0.045), ('efficiently', 0.044), ('articles', 0.044), ('description', 0.043), ('structure', 0.043), ('answer', 0.042), ('standard', 0.042), ('larger', 0.041), ('efficient', 0.039), ('require', 0.038), ('especially', 0.038), ('strategy', 0.037), ('data', 0.037), ('old', 0.036), ('tool', 0.036), ('size', 0.035)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 587 high scalability-2009-05-01-FastBit: An Efficient Compressed Bitmap Index Technology
Introduction: Data mining and fast queries are always in that bin of hard to do things where doing something smarter can yield big results. Bloom Filters are one such do it smarter strategy, compressed bitmap indexes are another. In one application "FastBit outruns other search indexes by a factor of 10 to 100 and doesn’t require much more room than the original data size." The data size is an interesting metric. Our old standard b-trees can be two to four times larger than the original data. In a test searching an Enron email database FastBit outran MySQL by 10 to 1,000 times. FastBit is a software tool for searching large read-only datasets. It organizes user data in a column-oriented structure which is efficient for on-line analytical processing (OLAP), and utilizes compressed bitmap indices to further speed up query processing. Analyses have proven the compressed bitmap index used in FastBit to be theoretically optimal for one-dimensional queries. Compared with other optimal indexing me
Introduction: This is a guest post by Matt Abrams (@abramsm), from Clearspring, discussing how they are able to accurately estimate the cardinality of sets with billions of distinct elements using surprisingly small data structures. Their servers receive well over 100 billion events per month. At Clearspring we like to count things. Counting the number of distinct elements (the cardinality) of a set is challenge when the cardinality of the set is large. To better understand the challenge of determining the cardinality of large sets let's imagine that you have a 16 character ID and you'd like to count the number of distinct IDs that you've seen in your logs. Here is an example: 4f67bfc603106cb2 These 16 characters represent 128 bits. 65K IDs would require 1 megabyte of space. We receive over 3 billion events per day, and each event has an ID. Those IDs require 384,000,000,000 bits or 45 gigabytes of storage. And that is just the space that the ID field requires! To get the
3 0.077465586 589 high scalability-2009-05-05-Drop ACID and Think About Data
Introduction: The abstract for the talk given by Bob Ippolito, co-founder and CTO of Mochi Media, Inc: Building large systems on top of a traditional single-master RDBMS data storage layer is no longer good enough. This talk explores the landscape of new technologies available today to augment your data layer to improve performance and reliability. Is your application a good fit for caches, bloom filters, bitmap indexes, column stores, distributed key/value stores, or document databases? Learn how they work (in theory and practice) and decide for yourself. Bob does an excellent job highlighting different products and the key concepts to understand when pondering the wide variety of new database offerings. It's unlikely you'll be able to say oh, this is the database for me after watching the presentation, but you will be much better informed on your options. And I imagine slightly confused as to what to do :-) An interesting observation in the talk is that the more robust products are internal
4 0.074679092 1514 high scalability-2013-09-09-Need Help with Database Scalability? Understand I-O
Introduction: This is a guest post by Zardosht Kasheff , Software Developer at Tokutek , a storage engine company that delivers 21st-Century capabilities to the leading open source data management platforms. As software developers, we value abstraction. The simpler the API, the more attractive it becomes. Arguably, MongoDB’s greatest strengths are its elegant API and its agility , which let developers simply code. But when MongoDB runs into scalability problems on big data , developers need to peek underneath the covers to understand the underlying issues and how to fix them. Without understanding, one may end up with an inefficient solution that costs time and money. For example, one may shard prematurely, increasing hardware and management costs, when a simpler replication setup would do. Or, one may increase the size of a replica set when upgrading to SSDs would suffice. This article shows how to reason about some big data scalability problems in an effort to find efficient solut
5 0.073235795 572 high scalability-2009-04-16-Paper: The End of an Architectural Era (It’s Time for a Complete Rewrite)
Introduction: Update 3 : A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks . Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. Update 2 : H-Store: A Next Generation OLTP DBMS is the project implementing the ideas in this paper: The goal of the H-Store project is to investigate how these architectural and application shifts affect the performance of OLTP databases, and to study what performance benefits would be possible with a complete redesign of OLTP systems in light of these trends. Our early results show that a simple prototype built from scratch using modern assumptions can outperform current commercial DBMS offerings by around a factor of 80 on OLTP workloads. Update : interesting related thread on Lamda the Ultimate . A really fascinating paper bolstering many of the anti-RDBMS threads the have popped up on the intert
6 0.069109276 64 high scalability-2007-08-10-How do we make a large real-time search engine?
7 0.062738262 1151 high scalability-2011-12-05-Stuff The Internet Says On Scalability For December 5, 2011
8 0.054703284 805 high scalability-2010-04-06-Strategy: Make it Really Fast vs Do the Work Up Front
9 0.053137708 801 high scalability-2010-03-30-Running Large Graph Algorithms - Evaluation of Current State-of-the-Art and Lessons Learned
10 0.049675532 946 high scalability-2010-11-22-Strategy: Google Sends Canary Requests into the Data Mine
11 0.048576009 233 high scalability-2008-01-30-How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data
12 0.047822159 958 high scalability-2010-12-16-7 Design Patterns for Almost-infinite Scalability
13 0.047366306 1320 high scalability-2012-09-11-How big is a Petabyte, Exabyte, Zettabyte, or a Yottabyte?
14 0.044682704 810 high scalability-2010-04-14-Parallel Information Retrieval and Other Search Engine Goodness
15 0.044542491 526 high scalability-2009-03-05-Strategy: In Cloud Computing Systematically Drive Load to the CPU
16 0.043839145 1596 high scalability-2014-02-14-Stuff The Internet Says On Scalability For February 14th, 2014
17 0.043417975 187 high scalability-2007-12-14-The Current Pros and Cons List for SimpleDB
18 0.042362913 351 high scalability-2008-07-16-The Mother of All Database Normalization Debates on Coding Horror
19 0.041597448 1396 high scalability-2013-01-30-Better Browser Caching is More Important than No Javascript or Fast Networks for HTTP Performance
20 0.040711485 1438 high scalability-2013-04-10-Check Yourself Before You Wreck Yourself - Avocado's 5 Early Stages of Architecture Evolution
topicId topicWeight
[(0, 0.066), (1, 0.032), (2, -0.017), (3, -0.006), (4, 0.013), (5, 0.047), (6, -0.001), (7, -0.004), (8, 0.009), (9, 0.003), (10, 0.013), (11, -0.014), (12, -0.013), (13, 0.007), (14, 0.024), (15, 0.005), (16, -0.017), (17, -0.005), (18, 0.013), (19, 0.003), (20, 0.001), (21, -0.023), (22, -0.011), (23, 0.023), (24, -0.001), (25, 0.009), (26, -0.028), (27, 0.0), (28, 0.001), (29, 0.033), (30, -0.028), (31, 0.036), (32, -0.029), (33, 0.014), (34, 0.005), (35, -0.004), (36, 0.021), (37, 0.009), (38, -0.013), (39, -0.016), (40, 0.027), (41, 0.002), (42, -0.002), (43, 0.0), (44, 0.006), (45, 0.008), (46, -0.026), (47, -0.01), (48, 0.011), (49, 0.0)]
simIndex simValue blogId blogTitle
same-blog 1 0.92015612 587 high scalability-2009-05-01-FastBit: An Efficient Compressed Bitmap Index Technology
Introduction: Data mining and fast queries are always in that bin of hard to do things where doing something smarter can yield big results. Bloom Filters are one such do it smarter strategy, compressed bitmap indexes are another. In one application "FastBit outruns other search indexes by a factor of 10 to 100 and doesn’t require much more room than the original data size." The data size is an interesting metric. Our old standard b-trees can be two to four times larger than the original data. In a test searching an Enron email database FastBit outran MySQL by 10 to 1,000 times. FastBit is a software tool for searching large read-only datasets. It organizes user data in a column-oriented structure which is efficient for on-line analytical processing (OLAP), and utilizes compressed bitmap indices to further speed up query processing. Analyses have proven the compressed bitmap index used in FastBit to be theoretically optimal for one-dimensional queries. Compared with other optimal indexing me
2 0.78909111 810 high scalability-2010-04-14-Parallel Information Retrieval and Other Search Engine Goodness
Introduction: Parallel Information Retrieval is a sample chapter in what appears to be a book-in-progress titled Information Retrieval Implementing and Evaluation Search Engines by Stefan B端ttcher , Google Inc and Charles L. A. Clarke, Gordon V. Cormack , both of the University of Waterloo. The full table of contents is on-line and looks to be really interesting: Information retrieval is the foundation for modern search engines. This text offers an introduction to the core topics underlying modern search technologies, including algorithms, data structures, indexing, retrieval, and evaluation. The emphasis is on implementation and experimentation; each chapter includes exercises and suggestions for student projects . Currently available is the full text of chapters: Introduction, Basic Techniques, Static Inverted Indices, Index Compression, and Parallel Information Retrieval. Parallel Information Retrieval is really meaty: Information retrieval systems often have to deal
3 0.7639851 1233 high scalability-2012-04-25-The Anatomy of Search Technology: blekko’s NoSQL database
Introduction: This is a guest post ( part 2 , part 3 ) by Greg Lindahl, CTO of blekko, the spam free search engine that had over 3.5 million unique visitors in March. Greg Lindahl was Founder and Distinguished Engineer at PathScale, at which he was the architect of the InfiniPath low-latency InfiniBand HCA, used to build tightly-coupled supercomputing clusters. Imagine that you're crazy enough to think about building a search engine. It's a huge task: the minimum index size needed to answer most queries is a few billion webpages. Crawling and indexing a few billion webpages requires a cluster with several petabytes of usable disk -- that's several thousand 1 terabyte disks -- and produces an index that's about 100 terabytes in size. Serving query results quickly involves having most of the index in RAM or on solid state (flash) disk. If you can buy a server with 100 gigabytes of RAM for about $3,000, that's 1,000 servers at a capital cost of $3 million, plus about $1 million per year of serve
4 0.76054978 64 high scalability-2007-08-10-How do we make a large real-time search engine?
Introduction: We're implementing a website which should be oriented to content and with massive access by public and we would need a search engine to index and execute queries on the indexes of contents (stored in a database, most likely MySQL InnoDB or Oracle). The solution we found is to implement a separate service to make index constantly the contents of the database at regular intervals. Anyway, this is a complex and not optimal solution, since we would like it to index in real time and make it searchable. Could you point me to some examples or articles I could review to design a solution for such this context?
Introduction: How does the Wayback Machine work? Now with over 400 billion webpages indexed , allowing the Internet to be browsed all the way back to 1996, it's an even more compelling question. I've looked several times but I've never found a really good answer. Here's some information from a thread on Hacker News. It starts with mmagin , a former Archive employee: I can't speak to their current infrastructure (though more of it is open source now - http://archive-access.sourceforge.net/projects/wayback/ ), but as far as the wayback machine, there was no SQL database anywhere in it. For the purposes of making the wayback machine go: Archived data was in ARC file format (predecessor to http://en.wikipedia.org/wiki/Web_ARChive) which is essentially a concatenation of separately gzipped records. That is, you can seek to a particular offset and start decompressing a record. Thus you could get at any archived web page with a triple (server, filename, file-offset) Thus it was spread
6 0.74545062 342 high scalability-2008-06-08-Search fast in million rows
7 0.72574514 281 high scalability-2008-03-18-Database Design 101
8 0.72257197 1395 high scalability-2013-01-28-DuckDuckGo Architecture - 1 Million Deep Searches a Day and Growing
9 0.71837658 817 high scalability-2010-04-29-Product: SciDB - A Science-Oriented DBMS at 100 Petabytes
10 0.71713251 630 high scalability-2009-06-14-kngine 'Knowledge Engine' milestone 2
11 0.70432335 986 high scalability-2011-02-10-Database Isolation Levels And Their Effects on Performance and Scalability
12 0.70132583 1304 high scalability-2012-08-14-MemSQL Architecture - The Fast (MVCC, InMem, LockFree, CodeGen) and Familiar (SQL)
13 0.6961658 1281 high scalability-2012-07-11-FictionPress: Publishing 6 Million Works of Fiction on the Web
14 0.67543441 1253 high scalability-2012-05-28-The Anatomy of Search Technology: Crawling using Combinators
15 0.67055154 1601 high scalability-2014-02-25-Peter Norvig's 9 Master Steps to Improving a Program
16 0.66977853 1509 high scalability-2013-08-30-Stuff The Internet Says On Scalability For August 30, 2013
17 0.66809362 1065 high scalability-2011-06-21-Running TPC-C on MySQL-RDS
18 0.66380072 578 high scalability-2009-04-23-Which Key value pair database to be used
19 0.66040421 351 high scalability-2008-07-16-The Mother of All Database Normalization Debates on Coding Horror
20 0.65669149 990 high scalability-2011-02-15-Wordnik - 10 million API Requests a Day on MongoDB and Scala
topicId topicWeight
[(1, 0.112), (2, 0.163), (10, 0.117), (56, 0.023), (58, 0.344), (61, 0.095)]
simIndex simValue blogId blogTitle
same-blog 1 0.80725509 587 high scalability-2009-05-01-FastBit: An Efficient Compressed Bitmap Index Technology
Introduction: Data mining and fast queries are always in that bin of hard to do things where doing something smarter can yield big results. Bloom Filters are one such do it smarter strategy, compressed bitmap indexes are another. In one application "FastBit outruns other search indexes by a factor of 10 to 100 and doesn’t require much more room than the original data size." The data size is an interesting metric. Our old standard b-trees can be two to four times larger than the original data. In a test searching an Enron email database FastBit outran MySQL by 10 to 1,000 times. FastBit is a software tool for searching large read-only datasets. It organizes user data in a column-oriented structure which is efficient for on-line analytical processing (OLAP), and utilizes compressed bitmap indices to further speed up query processing. Analyses have proven the compressed bitmap index used in FastBit to be theoretically optimal for one-dimensional queries. Compared with other optimal indexing me
2 0.65409601 1551 high scalability-2013-11-20-How Twitter Improved JVM Performance by Reducing GC and Faster Memory Allocation
Introduction: Netty is a high-performance NIO (New IO) client server framework for Java that Twitter uses internally as a protocol agonostic RPC system. Twitter found some problems with Netty 3's memory management for buffer allocations beacause it generated a lot of garbage during operation. When you send as many messages as Twitter it creates a lot of GC pressure and the simple act of zero filling newly allocated buffers consumed 50% of memory bandwidth. Netty 4 fixes this situation with: Short-lived event objects, methods on long-lived channel objects are used to handle I/O events. Secialized buffer allocator that uses pool which implements buddy memory allocation and slab allocation . The result: 5 times less frequent GC pauses: 45.5 vs. 9.2 times/min 5 times less garbage production: 207.11 vs 41.81 MiB/s The buffer pool is much faster than JVM as the size of the buffer increases. Some problems with smaller buffers. Given how many services use the JVM in thei
3 0.57251006 1074 high scalability-2011-07-06-11 Common Web Use Cases Solved in Redis
Introduction: In How to take advantage of Redis just adding it to your stack Salvatore 'antirez' Sanfilippo shows how to solve some common problems in Redis by taking advantage of its unique data structure handling capabilities. Common Redis primitives like LPUSH, and LTRIM, and LREM are used to accomplish tasks programmers need to get done, but that can be hard or slow in more traditional stores. A very useful and practical article. How would you accomplish these tasks in your framework? Show latest items listings in your home page . This is a live in-memory cache and is very fast. LPUSH is used to insert a content ID at the head of the list stored at a key. LTRIM is used to limit the number of items in the list to 5000. If the user needs to page beyond this cache only then are they sent to the database. Deletion and filtering . If a cached article is deleted it can be removed from the cache using LREM . Leaderboards and related problems . A leader board is a set sorted by score.
4 0.56924957 1386 high scalability-2013-01-14-MongoDB and GridFS for Inter and Intra Datacenter Data Replication
Introduction: This is a guest post by Jeff Behl , VP Ops @ LogicMonitor. Jeff has been a bit herder for the last 20 years, architecting and overseeing the infrastructure for a number of SaaS based companies. Data Replication for Disaster Recovery An inevitable part of disaster recovery planning is making sure customer data exists in multiple locations. In the case of LogicMonitor, a SaaS-based monitoring solution for physical, virtual, and cloud environments, we wanted copies of customer data files both within a data center and outside of it. The former was to protect against the loss of individual servers within a facility, and the latter for recovery in the event of the complete loss of a data center. Where we were: Rsync Like most everyone who starts off in a Linux environment, we used our trusty friend rsync to copy data around. Rsync is tried, true and tested, and works well when the number of servers, the amount of data, and the number of files is not horrendous.
Introduction: Toy solutions solving Twitter’s “problems” are a favorite scalability trope. Everybody has this idea that Twitter is easy. With a little architectural hand waving we have a scalable Twitter, just that simple. Well, it’s not that simple as Raffi Krikorian , VP of Engineering at Twitter, describes in his superb and very detailed presentation on Timelines at Scale . If you want to know how Twitter works - then start here. It happened gradually so you may have missed it, but Twitter has grown up. It started as a struggling three-tierish Ruby on Rails website to become a beautifully service driven core that we actually go to now to see if other services are down. Quite a change. Twitter now has 150M world wide active users, handles 300K QPS to generate timelines, and a firehose that churns out 22 MB/sec. 400 million tweets a day flow through the system and it can take up to 5 minutes for a tweet to flow from Lady Gaga’s fingers to her 31 million followers. A couple o
6 0.56707525 584 high scalability-2009-04-27-Some Questions from a newbie
8 0.56057531 792 high scalability-2010-03-10-How FarmVille Scales - The Follow-up
9 0.55848867 689 high scalability-2009-08-28-Strategy: Solve Only 80 Percent of the Problem
10 0.55676037 1331 high scalability-2012-10-02-An Epic TripAdvisor Update: Why Not Run on the Cloud? The Grand Experiment.
11 0.55157912 407 high scalability-2008-10-10-The Art of Capacity Planning: Scaling Web Resources
12 0.54727334 64 high scalability-2007-08-10-How do we make a large real-time search engine?
13 0.54714537 949 high scalability-2010-11-29-Stuff the Internet Says on Scalability For November 29th, 2010
14 0.54690588 1585 high scalability-2014-01-24-Stuff The Internet Says On Scalability For January 24th, 2014
15 0.54638004 269 high scalability-2008-03-08-Audiogalaxy.com Architecture
16 0.54472011 302 high scalability-2008-04-10-Mysql scalability and failover...
17 0.54428875 1521 high scalability-2013-09-23-Salesforce Architecture - How they Handle 1.3 Billion Transactions a Day
18 0.5441733 950 high scalability-2010-11-30-NoCAP – Part III – GigaSpaces clustering explained..
19 0.54256439 928 high scalability-2010-10-26-Scaling DISQUS to 75 Million Comments and 17,000 RPS
20 0.54100436 1046 high scalability-2011-05-23-Evernote Architecture - 9 Million Users and 150 Million Requests a Day