high_scalability high_scalability-2011 high_scalability-2011-1035 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: With BigData comes BigStorage costs. One way to store less is simply not to store the same data twice . That's the radically simple and powerful notion behind data deduplication . If you are one of those who got a good laugh out of the idea of eliminating SQL queries as a rather obvious scalability strategy, you'll love this one, but it is a powerful feature and one I don't hear talked about outside the enterprise. A parallel idea in programming is the once-and-only-once principle of never duplicating code. Using deduplication technology, for some upfront CPU usage, which is a plentiful resource in many systems that are IO bound anyway, it's possible to reduce storage requirements by upto 20:1, depending on your data, which saves both money and disk write overhead. This comes up because of really good article Robin Harris of StorageMojo wrote, All de-dup works , on a paper, A Study of Practical Deduplication by Dutch Meyer and William Bolosky, For a great explanation o
sentIndex sentText sentNum sentScore
1 One way to store less is simply not to store the same data twice . [sent-2, score-0.217]
2 That's the radically simple and powerful notion behind data deduplication . [sent-3, score-0.847]
3 Using deduplication technology, for some upfront CPU usage, which is a plentiful resource in many systems that are IO bound anyway, it's possible to reduce storage requirements by upto 20:1, depending on your data, which saves both money and disk write overhead. [sent-6, score-0.857]
4 Chunks of data -- files, blocks, or byte ranges -- are checksummed using some hash function that uniquely identifies data with very high probability. [sent-9, score-0.354]
5 When using a secure hash like SHA256, the probability of a hash collision is about 2^-256 = 10^-77 or, in more familiar notation, 0. [sent-10, score-0.236]
6 Chunks of data are remembered in a table of some sort that maps the data's checksum to its storage location and reference count. [sent-13, score-0.368]
7 When you store another copy of existing data, instead of allocating new space on disk, the dedup code just increments the reference count on the existing data. [sent-14, score-0.521]
8 When data is highly replicated, which is typical of backup servers, virtual machine images, and source code repositories, deduplication can reduce space consumption not just by percentages, but by multiples. [sent-15, score-1.024]
9 What the paper is saying, for their type of data at least, is that dealing simply with files is nearly as affective as more complex block deduplication schemes. [sent-16, score-1.035]
10 From the paper: We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. [sent-17, score-0.22]
11 We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of redundancy. [sent-18, score-0.231]
12 We found that whole-file deduplication achieves about three quarters of the space savings of the most aggressive block-level deduplication for storage of live file systems, and 87% of the savings for backup images. [sent-19, score-1.988]
13 We also studied file fragmentation finding that it is not prevalent, and updated prior file system metadata studies, finding that the distribution of file sizes continues to skew toward very large unstructured files. [sent-20, score-0.659]
14 Indirection and hashing are still the greatest weapons of computer science, which leads to some telling caveats from Robin: De-duplication trades direct access to save data capacity. [sent-21, score-0.23]
15 If it's at the storage layer, in the cloud, that buys you nothing because then all the cost savings simply go to the cloud provider as you will be billed for the size of the data you want to store, not the data actually stored. [sent-25, score-0.479]
16 And all these calculations must be applied to non-encrypted data so the implication is it should be in user space somewhere. [sent-26, score-0.206]
17 Why shouldn't they just store entire files and let the underlying file system or storage device figure out what's common? [sent-29, score-0.347]
18 With the mass of unstructured data rising it seems like there's an opportunity here. [sent-31, score-0.191]
19 Related Articles ZFS Deduplication by Jeff Bonwick Cloud Based Deduplication Quick Start Guide Do any common OS file systems use hashes to avoid storing the same content data more than once? [sent-32, score-0.283]
20 Multi-level comparison of data deduplication in a backup scenario by Dirk Meister and André Brinkmann. [sent-34, score-0.923]
wordName wordTfidf (topN-words)
[('deduplication', 0.742), ('dedup', 0.147), ('file', 0.115), ('store', 0.112), ('savings', 0.107), ('data', 0.105), ('space', 0.101), ('reference', 0.098), ('hash', 0.091), ('eliminating', 0.09), ('unstructured', 0.086), ('robin', 0.085), ('backup', 0.076), ('jeff', 0.068), ('weapons', 0.067), ('bigstorage', 0.067), ('revisited', 0.067), ('files', 0.065), ('dutch', 0.063), ('meyer', 0.063), ('bonwick', 0.063), ('efficacy', 0.063), ('diffs', 0.063), ('elimination', 0.063), ('percentages', 0.063), ('undetected', 0.063), ('ecc', 0.063), ('hashes', 0.063), ('increments', 0.063), ('paper', 0.062), ('dealing', 0.061), ('prevalent', 0.06), ('studied', 0.06), ('upto', 0.06), ('caveats', 0.058), ('quarters', 0.058), ('finding', 0.057), ('checksum', 0.056), ('william', 0.056), ('rethinking', 0.056), ('laugh', 0.056), ('storage', 0.055), ('billed', 0.054), ('remembered', 0.054), ('collision', 0.054), ('skew', 0.054), ('duplicating', 0.053), ('identifies', 0.053), ('buys', 0.053), ('zfs', 0.052)]
simIndex simValue blogId blogTitle
same-blog 1 1.0000001 1035 high scalability-2011-05-05-Paper: A Study of Practical Deduplication
Introduction: With BigData comes BigStorage costs. One way to store less is simply not to store the same data twice . That's the radically simple and powerful notion behind data deduplication . If you are one of those who got a good laugh out of the idea of eliminating SQL queries as a rather obvious scalability strategy, you'll love this one, but it is a powerful feature and one I don't hear talked about outside the enterprise. A parallel idea in programming is the once-and-only-once principle of never duplicating code. Using deduplication technology, for some upfront CPU usage, which is a plentiful resource in many systems that are IO bound anyway, it's possible to reduce storage requirements by upto 20:1, depending on your data, which saves both money and disk write overhead. This comes up because of really good article Robin Harris of StorageMojo wrote, All de-dup works , on a paper, A Study of Practical Deduplication by Dutch Meyer and William Bolosky, For a great explanation o
2 0.16025738 1596 high scalability-2014-02-14-Stuff The Internet Says On Scalability For February 14th, 2014
Introduction: Hey, it's HighScalability time: Climbing the World's Second Tallest Building 5 billion : Number of phone records NSA collects per day; Facebook : 1.23 billion users, 201.6 billion friend connections, 400 billion shared photos, and 7.8 trillion messages sent since the start of 2012. Quotable Quotes: @ShrikanthSS : people repeatedly underestimate the cost of busy waits @mcclure111 : Learning today java․net․URL․equals is a blocking operation that hits the network shook me badly. I don't know if I can trust the world now. @hui_kenneth : @randybias: “3 ways 2 be market leader - be 1st, be best, or be cheapest. #AWS was all 3. Now #googlecloud may be best & is the cheapest.” @thijs : The nice thing about Paper is that we can point out to clients that it took 18 experienced designers and developers two years to build. @neil_conway : My guess is that the split between Spanner and F1 is a great example of Conway's Law. How F
3 0.15682703 992 high scalability-2011-02-18-Stuff The Internet Says On Scalability For February 18, 2011
Introduction: Submitted for your reading pleasure on this cold and rainy Friday... Quotable Quotes: CarryMillsap : You can't hardware yourself out of a performance problem you softwared yourself into. @juokaz : schema-less databases doesn't mean data should have no structure Scalability Porn: 3 Months To The First Million Users, Just 6 Weeks To The Second Million For Instagram S tudy by the USC Annenberg School for Communication & Journalism estimates: in 2007, humankind was able to store 2.9 × 1020 optimally compressed bytes, communicate almost 2 × 1021 bytes, and carry out 6.4 × 1018 instructions per second on general-purpose computers. Hadoop has hit a scalability limit at a whopping 4,000 machines and are looking to create the next generation architecture . Their target is clusters of 10,000 machines and 200,000 cores. The fundamental idea of the re-architecture is to divide the two major functions of the Job Tracker, resource management and job sc
4 0.11292757 1000 high scalability-2011-03-08-Medialets Architecture - Defeating the Daunting Mobile Device Data Deluge
Introduction: Mobile developers have a huge scaling problem ahead: doing something useful with massive continuous streams of telemetry data from millions and millions of devices. This is a really good problem to have. It means smartphone sales are finally fulfilling their destiny: slaughtering PCs in the sales arena. And it also means mobile devices aren't just containers for simple standalone apps anymore, they are becoming the dominant interface to giant backend systems. While developers are now rocking mobile development on the client side, their next challenge is how to code those tricky backend bits. A company facing those same exact problems right now is Medialets , a mobile rich media ad platform. What they do is help publishers create high quality interactive ads, though for our purposes their ad stuff isn't that interesting. What I did find really interesting about their system is how they are tackling the problem of defeating the mobile device data deluge. Each day Medialets munc
5 0.11049142 589 high scalability-2009-05-05-Drop ACID and Think About Data
Introduction: The abstract for the talk given by Bob Ippolito, co-founder and CTO of Mochi Media, Inc: Building large systems on top of a traditional single-master RDBMS data storage layer is no longer good enough. This talk explores the landscape of new technologies available today to augment your data layer to improve performance and reliability. Is your application a good fit for caches, bloom filters, bitmap indexes, column stores, distributed key/value stores, or document databases? Learn how they work (in theory and practice) and decide for yourself. Bob does an excellent job highlighting different products and the key concepts to understand when pondering the wide variety of new database offerings. It's unlikely you'll be able to say oh, this is the database for me after watching the presentation, but you will be much better informed on your options. And I imagine slightly confused as to what to do :-) An interesting observation in the talk is that the more robust products are internal
6 0.11045098 889 high scalability-2010-08-30-Pomegranate - Storing Billions and Billions of Tiny Little Files
7 0.10841254 1105 high scalability-2011-08-25-The Cloud and The Consumer: The Impact on Bandwidth and Broadband
8 0.097911015 1318 high scalability-2012-09-07-Stuff The Internet Says On Scalability For September 7, 2012
9 0.09545695 971 high scalability-2011-01-10-Riak's Bitcask - A Log-Structured Hash Table for Fast Key-Value Data
10 0.092154726 1117 high scalability-2011-09-16-Stuff The Internet Says On Scalability For September 16, 2011
11 0.08784654 448 high scalability-2008-11-22-Google Architecture
12 0.082277849 920 high scalability-2010-10-15-Troubles with Sharding - What can we learn from the Foursquare Incident?
13 0.079569437 1279 high scalability-2012-07-09-Data Replication in NoSQL Databases
14 0.078099214 750 high scalability-2009-12-16-Building Super Scalable Systems: Blade Runner Meets Autonomic Computing in the Ambient Cloud
16 0.077046387 954 high scalability-2010-12-06-What the heck are you actually using NoSQL for?
17 0.076498725 538 high scalability-2009-03-16-Are Cloud Based Memory Architectures the Next Big Thing?
18 0.07561224 1010 high scalability-2011-03-24-Strategy: Disk Backup for Speed, Tape Backup to Save Your Bacon, Just Ask Google
19 0.074354365 1386 high scalability-2013-01-14-MongoDB and GridFS for Inter and Intra Datacenter Data Replication
topicId topicWeight
[(0, 0.139), (1, 0.079), (2, -0.014), (3, 0.028), (4, -0.026), (5, 0.05), (6, 0.04), (7, -0.005), (8, 0.003), (9, 0.027), (10, 0.012), (11, -0.047), (12, -0.02), (13, -0.007), (14, 0.026), (15, 0.03), (16, -0.007), (17, 0.025), (18, -0.029), (19, -0.011), (20, 0.017), (21, 0.009), (22, 0.002), (23, 0.046), (24, -0.007), (25, -0.033), (26, 0.046), (27, -0.024), (28, -0.041), (29, -0.018), (30, -0.021), (31, -0.002), (32, 0.032), (33, 0.011), (34, -0.059), (35, 0.031), (36, 0.043), (37, 0.029), (38, 0.025), (39, -0.032), (40, -0.041), (41, -0.033), (42, -0.007), (43, 0.032), (44, 0.014), (45, 0.011), (46, -0.016), (47, -0.013), (48, -0.001), (49, 0.012)]
simIndex simValue blogId blogTitle
same-blog 1 0.93475544 1035 high scalability-2011-05-05-Paper: A Study of Practical Deduplication
Introduction: With BigData comes BigStorage costs. One way to store less is simply not to store the same data twice . That's the radically simple and powerful notion behind data deduplication . If you are one of those who got a good laugh out of the idea of eliminating SQL queries as a rather obvious scalability strategy, you'll love this one, but it is a powerful feature and one I don't hear talked about outside the enterprise. A parallel idea in programming is the once-and-only-once principle of never duplicating code. Using deduplication technology, for some upfront CPU usage, which is a plentiful resource in many systems that are IO bound anyway, it's possible to reduce storage requirements by upto 20:1, depending on your data, which saves both money and disk write overhead. This comes up because of really good article Robin Harris of StorageMojo wrote, All de-dup works , on a paper, A Study of Practical Deduplication by Dutch Meyer and William Bolosky, For a great explanation o
2 0.83720487 50 high scalability-2007-07-31-BerkeleyDB & other distributed high performance key-value databases
Introduction: I currently use BerkeleyDB as an embedded database http://www.oracle.com/database/berkeley-db/ a decision which was initially brought on by learning that Google used BerkeleyDB for their universal sign-on feature. Lustre looks impressive, but their white paper shows speeds of 800 files created per second, as a good number. However, BerkeleyDB on my mac mini does 200,000 row creations per second, and can be used as a distributed file system. I'm having I/O scalability issues with BerkeleyDB on one machine, and about to implement their distributed replication feature (and go multi-machine), which in effect makes it work like a distributed file system, but with local access speeds. That's why I was looking at Lustre. The key feature difference between BerkeleyDB and Lustre is that BerkeleyDB has a complete copy of all the data on each computer, making it not a viable solution for massive sized database applications. However, if you have < 1TB (ie, one disk) of total pos
3 0.82778114 971 high scalability-2011-01-10-Riak's Bitcask - A Log-Structured Hash Table for Fast Key-Value Data
Introduction: How would you implement a key-value storage system if you were starting from scratch? The approach Basho settled on with Bitcask , their new backend for Riak, is an interesting combination of using RAM to store a hash map of file pointers to values and a log-structured file system for efficient writes. In this excellent Changelog interview , some folks from Basho describe Bitcask in more detail. The essential Bitcask: Keys are stored in memory for fast lookups. All keys must fit in RAM. Writes are append-only, which means writes are strictly sequential and do not require seeking. Writes are write-through. Every time a value is updated the data file on disk is appended and the in-memory key index is updated with the file pointer. Read queries are satisfied with O(1) random disk seeks. Latency is very predictable if all keys fit in memory because there's no random seeking around through a file. For reads, the file system cache in the kernel is used instead of writing a c
Introduction: “ Data is everywhere, never be at a single location. Not scalable, not maintainable. ” –Alex Szalay While Galileo played life and death doctrinal games over the mysteries revealed by the telescope, another revolution went unnoticed, the microscope gave up mystery after mystery and nobody yet understood how subversive would be what it revealed. For the first time these new tools of perceptual augmentation allowed humans to peek behind the veil of appearance. A new new eye driving human invention and discovery for hundreds of years. Data is another material that hides, revealing itself only when we look at different scales and investigate its underlying patterns. If the universe is truly made of information , then we are looking into truly primal stuff. A new eye is needed for Data and an ambitious project called Data-scope aims to be the lens. A detailed paper on the Data-Scope tells more about what it is: The Data-Scope is a new scientific instrum
5 0.80968428 128 high scalability-2007-10-21-Paper: Standardizing Storage Clusters (with pNFS)
Introduction: pNFS (parallel NFS) is the next generation of NFS and its main claim to fame is that it's clustered, which "enables clients to directly access file data spread over multiple storage servers in parallel. As a result, each client can leverage the full aggregate bandwidth of a clustered storage service at the granularity of an individual file." About pNFS StorageMojo says: pNFS is going to commoditize parallel data access. In 5 years we won’t know how we got along without it . Something to watch.
6 0.79984146 103 high scalability-2007-09-28-Kosmos File System (KFS) is a New High End Google File System Option
7 0.79864997 112 high scalability-2007-10-04-You Can Now Store All Your Stuff on Your Own Google Like File System
9 0.77862817 889 high scalability-2010-08-30-Pomegranate - Storing Billions and Billions of Tiny Little Files
10 0.77403331 1483 high scalability-2013-06-27-Paper: XORing Elephants: Novel Erasure Codes for Big Data
11 0.77110839 1096 high scalability-2011-08-10-LevelDB - Fast and Lightweight Key-Value Database From the Authors of MapReduce and BigTable
12 0.77015042 98 high scalability-2007-09-18-Sync data on all servers
13 0.75428921 1279 high scalability-2012-07-09-Data Replication in NoSQL Databases
14 0.7541014 1162 high scalability-2011-12-23-Funny: A Cautionary Tale About Storage and Backup
15 0.74965763 1163 high scalability-2011-12-23-Stuff The Internet Says On Scalability For December 23, 2011
16 0.74890929 852 high scalability-2010-07-07-Strategy: Recompute Instead of Remember Big Data
17 0.74872124 278 high scalability-2008-03-16-Product: GlusterFS
18 0.74151498 368 high scalability-2008-08-17-Wuala - P2P Online Storage Cloud
19 0.73945236 959 high scalability-2010-12-17-Stuff the Internet Says on Scalability For December 17th, 2010
20 0.73877156 104 high scalability-2007-10-01-SmugMug Found their Perfect Storage Array
topicId topicWeight
[(1, 0.089), (2, 0.196), (10, 0.039), (27, 0.014), (30, 0.054), (40, 0.017), (57, 0.013), (61, 0.074), (63, 0.03), (77, 0.01), (79, 0.098), (85, 0.036), (94, 0.054), (96, 0.188)]
simIndex simValue blogId blogTitle
1 0.93072236 281 high scalability-2008-03-18-Database Design 101
Introduction: I am working on the design for my database and can't seem to come up with a firm schema. I am torn between normalizing the data and dealing with the overhead of joins and denormalizing it for easy sharding. The data is essentially music information per user: UserID, Artist, Album, Song. This lends itself nicely to be normalized and have separate User, Artist, Album and Song databases with a table full of INTs to tie them together. This will be in a mostly read based environment and with about 80% being searches of data by artist album or song. By the time I begin the query for artist, album or song I will already have a list of UserID's to limit the search by. The problem is that the tables can get unmanageably large pretty quickly and my plan was to shard off users once it got too big. Given this simple data relationship what are the pros and cons of normalizing the data vs denormalizing it? Should I go with 4 separate, normalized tables or one 4 column table? Perhaps it might
2 0.90231192 868 high scalability-2010-07-30-Basho Lives up to their Name With Consistent Smashing
Introduction: For some Friday Fun nerd style, I thought this demonstration from Basho on the difference between single master, sharding, and consistent smashing was really clever. I love the use of safety glasses! And it's harder to crash a server with a hammer than you might think... Recommended reading: http://labs.google.com/papers/bigtable.html http://research.yahoo.com/project/212
3 0.88863629 1549 high scalability-2013-11-15-Stuff The Internet Says On Scalability For November 15th, 2013
Introduction: Hey, it's HighScalability time: Test your sense of scale. Is this image of something microscopic or macroscopic? Find out . Quotable Quotes: fidotron : It feels like we've gone in one big circle, where first we move the DB on to a separate machine for performance, yet now more computation will go back to being done nearer the data (like Hadoop) and we'll try to pretend it's all just one giant computer again. @pbailis : Building systems from the ground up with distribution, scale, and availability in mind is much easier than retrofitting single-node systems. @merv : #awsreinvent Jassy: Netflix has 10,000s of EC2 instances. They are the final deployment scenario: All In. And others are coming. Edward Capriolo : YARN... Either it is really complicated or I have brain damage @djspiewak : Eventually, Node.js will reinvent the “IO promise” and realize that flattening your callback effects is actually quite nice. @jimblomo : A Note on Dis
same-blog 4 0.8828612 1035 high scalability-2011-05-05-Paper: A Study of Practical Deduplication
Introduction: With BigData comes BigStorage costs. One way to store less is simply not to store the same data twice . That's the radically simple and powerful notion behind data deduplication . If you are one of those who got a good laugh out of the idea of eliminating SQL queries as a rather obvious scalability strategy, you'll love this one, but it is a powerful feature and one I don't hear talked about outside the enterprise. A parallel idea in programming is the once-and-only-once principle of never duplicating code. Using deduplication technology, for some upfront CPU usage, which is a plentiful resource in many systems that are IO bound anyway, it's possible to reduce storage requirements by upto 20:1, depending on your data, which saves both money and disk write overhead. This comes up because of really good article Robin Harris of StorageMojo wrote, All de-dup works , on a paper, A Study of Practical Deduplication by Dutch Meyer and William Bolosky, For a great explanation o
5 0.88054812 162 high scalability-2007-11-20-what is j2ee stack
Introduction: I see everyone talk about lamp stack is less than j2ee stack .i m newbie can anyone plz explain what is j2ee stack
6 0.86943614 117 high scalability-2007-10-08-Paper: Understanding and Building High Availability-Load Balanced Clusters
7 0.85973144 1052 high scalability-2011-06-03-Stuff The Internet Says On Scalability For June 3, 2011
8 0.84985608 1228 high scalability-2012-04-16-Instagram Architecture Update: What’s new with Instagram?
9 0.8492685 422 high scalability-2008-10-17-Scaling Spam Eradication Using Purposeful Games: Die Spammer Die!
10 0.84745604 348 high scalability-2008-07-09-Federation at Flickr: Doing Billions of Queries Per Day
11 0.84586138 828 high scalability-2010-05-17-7 Lessons Learned While Building Reddit to 270 Million Page Views a Month
12 0.84573746 1528 high scalability-2013-10-07-Ask HS: Is Microsoft the Right Technology for a Scalable Web-based System?
13 0.84301805 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python
14 0.83978796 1212 high scalability-2012-03-21-The Conspecific Hybrid Cloud
15 0.83157456 435 high scalability-2008-10-30-The case for functional decomposition
16 0.82871914 1221 high scalability-2012-04-03-Hazelcast 2.0: Big Data In-Memory
17 0.80367821 1418 high scalability-2013-03-06-Low Level Scalability Solutions - The Aggregation Collection
18 0.80339563 703 high scalability-2009-09-12-How Google Taught Me to Cache and Cash-In
19 0.7997992 351 high scalability-2008-07-16-The Mother of All Database Normalization Debates on Coding Horror
20 0.79869878 1473 high scalability-2013-06-10-The 10 Deadly Sins Against Scalability