high_scalability high_scalability-2009 high_scalability-2009-716 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: There are many reasons to roll your own data storage solution on top of existing technologies. We've seen stories on HighScalability about custom databases for very large sets of individual data (like Twitter) and large amounts of binary data (like Facebook pictures). However, I recently ran into a unique type of problem. I was tasked with recording and storing bandwidth information for more than 20,000 servers and their associated networking equipment. This data needed to be accessed in real-time, with less than a 5 minute delay between the data being recorded and the data showing up on customer bandwidth graphs on our customer portal. After numerous false starts with off the shelf components and existing database clustering technology, we decided we must roll our own system. The real key to our problem (literally) was the ratio of the size of the key to the size of the actual data . Because the tracked metric was so small (a 64-bit counter) compared to the unique ide
sentIndex sentText sentNum sentScore
1 There are many reasons to roll your own data storage solution on top of existing technologies. [sent-1, score-0.652]
2 We've seen stories on HighScalability about custom databases for very large sets of individual data (like Twitter) and large amounts of binary data (like Facebook pictures). [sent-2, score-0.374]
3 However, I recently ran into a unique type of problem. [sent-3, score-0.302]
4 I was tasked with recording and storing bandwidth information for more than 20,000 servers and their associated networking equipment. [sent-4, score-0.485]
5 This data needed to be accessed in real-time, with less than a 5 minute delay between the data being recorded and the data showing up on customer bandwidth graphs on our customer portal. [sent-5, score-1.063]
6 After numerous false starts with off the shelf components and existing database clustering technology, we decided we must roll our own system. [sent-6, score-1.169]
7 The real key to our problem (literally) was the ratio of the size of the key to the size of the actual data . [sent-7, score-0.681]
8 Because the tracked metric was so small (a 64-bit counter) compared to the unique identifier (32-bit network component ID, 32-bit timestamp, 16-bit data type identifier) existing database technologies would choke on the key sizes. [sent-8, score-1.099]
9 Eventually it was decided that the best solution was to write our own wrapper for standard MySQL databases. [sent-9, score-0.434]
10 No fancy features, no clustering, no merge tables or partitioning, no extra indexes, just hundreds of thousands of flat tables on as many physical machines as was necessary. [sent-10, score-0.749]
11 I chronicled the whole decision making process in the full article, located here, on our developers' blog . [sent-11, score-0.087]
wordName wordTfidf (topN-words)
[('identifier', 0.343), ('roll', 0.212), ('clustering', 0.199), ('decided', 0.189), ('existing', 0.184), ('wrapper', 0.158), ('tables', 0.152), ('choke', 0.151), ('tasked', 0.148), ('tracked', 0.143), ('timestamp', 0.139), ('fancy', 0.137), ('recorded', 0.135), ('recording', 0.134), ('numerous', 0.134), ('customer', 0.132), ('metric', 0.128), ('shelf', 0.128), ('unique', 0.127), ('literally', 0.124), ('false', 0.123), ('key', 0.122), ('type', 0.119), ('flat', 0.118), ('ratio', 0.116), ('size', 0.116), ('pictures', 0.115), ('delay', 0.113), ('stories', 0.112), ('bandwidth', 0.109), ('binary', 0.108), ('merge', 0.106), ('counter', 0.106), ('ran', 0.1), ('associated', 0.094), ('located', 0.091), ('data', 0.089), ('accessed', 0.089), ('solution', 0.087), ('decision', 0.087), ('actual', 0.086), ('minute', 0.086), ('extra', 0.084), ('amounts', 0.084), ('reasons', 0.08), ('partitioning', 0.079), ('component', 0.079), ('indexes', 0.078), ('compared', 0.076), ('recently', 0.075)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999988 716 high scalability-2009-10-06-Building a Unique Data Warehouse
Introduction: There are many reasons to roll your own data storage solution on top of existing technologies. We've seen stories on HighScalability about custom databases for very large sets of individual data (like Twitter) and large amounts of binary data (like Facebook pictures). However, I recently ran into a unique type of problem. I was tasked with recording and storing bandwidth information for more than 20,000 servers and their associated networking equipment. This data needed to be accessed in real-time, with less than a 5 minute delay between the data being recorded and the data showing up on customer bandwidth graphs on our customer portal. After numerous false starts with off the shelf components and existing database clustering technology, we decided we must roll our own system. The real key to our problem (literally) was the ratio of the size of the key to the size of the actual data . Because the tracked metric was so small (a 64-bit counter) compared to the unique ide
2 0.11116654 15 high scalability-2007-07-16-Blog: MySQL Performance Blog - Everything about MySQL Performance.
Introduction: Follow this blog and you'll learn a lot about MySQL and how to make it sing. A Quick Hit of What's Inside Working with large data sets in MySQL, PHP Large result sets and summary tables, Implementing efficient counters with MySQL. Site: http://www.mysqlperformanceblog.com/
Introduction: Pinterest has been riding an exponential growth curve, doubling every month and half. They've gone from 0 to 10s of billions of page views a month in two years, from 2 founders and one engineer to over 40 engineers, from one little MySQL server to 180 Web Engines, 240 API Engines, 88 MySQL DBs (cc2.8xlarge) + 1 slave each, 110 Redis Instances, and 200 Memcache Instances.Stunning growth. So what's Pinterest's story? To tell their story we have our bards, Pinterest'sYashwanth NelapatiandMarty Weiner, who tell the dramatic story of Pinterest's architecture evolution in a talk titledScaling Pinterest. This is the talk they would have liked to hear a year and half ago when they were scaling fast and there were a lot of options to choose from. And they made a lot of incorrect choices.This is a great talk. It's full of amazing details. It's also very practical, down to earth, and it contains strategies adoptable by nearly anyone. Highly recommended.Two of my favorite lessons from the talk:Arc
4 0.10795932 292 high scalability-2008-03-30-Scaling Out MySQL
Introduction: This post covers two main options for scaling-out MySql and compare between them. The first is based on data-base clustering and the second is based on In Memory clustering a.k.a Data Grid. A special emphasis is given to a pattern which shows how to scale our existing data base without changing it through a combination of Data Grid and data base as a background service. This pattern is referred to as Persistency as a Service (PaaS). It also address many of the fequently asked question related to how performance, reliability and scalability is achieved with this pattern.
5 0.10591158 5 high scalability-2007-07-10-mixi.jp Architecture
Introduction: Mixi is a fast growing social networking site in Japan. They provide services like: diary, community, message, review, and photo album. Having a lot in common with LiveJournal they also developed many of the same approaches. Their write up on how they scaled their system is easily one of the best out there. Site: http://mixi.jp Information Sources mixi.jp - scaling out with open source Platform Linux Apache MySQL Perl Memcached Squid Shard What's Inside? They grew to approximately 4 million users in two years and add over 15,000 new users/day. Ranks 35th on Alexa and 3rd in Japan. More than 100 MySQL servers Add more than 10 servers/month Use non-persistent connections. Diary traffic is 85% read and 15% write. Message traffic is is 75% read and 25% write. Ran into replication performance problems so they had to split the database. Considered splitting vertically by user or splitting horizontally by table type. The ende
6 0.10536753 1082 high scalability-2011-07-18-New Relic Architecture - Collecting 20+ Billion Metrics a Day
7 0.10390683 1508 high scalability-2013-08-28-Sean Hull's 20 Biggest Bottlenecks that Reduce and Slow Down Scalability
8 0.098446161 606 high scalability-2009-05-25-non-sequential, unique identifier, strategy question
9 0.097221434 448 high scalability-2008-11-22-Google Architecture
10 0.096629523 1345 high scalability-2012-10-22-Spanner - It's About Programmers Building Apps Using SQL Semantics at NoSQL Scale
11 0.095572896 954 high scalability-2010-12-06-What the heck are you actually using NoSQL for?
12 0.088663213 1251 high scalability-2012-05-24-Build your own twitter like real time analytics - a step by step guide
13 0.087586179 1316 high scalability-2012-09-04-Changing Architectures: New Datacenter Networks Will Set Your Code and Data Free
14 0.086234532 233 high scalability-2008-01-30-How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data
15 0.085196398 86 high scalability-2007-09-09-Clustering Solution
16 0.085019 1514 high scalability-2013-09-09-Need Help with Database Scalability? Understand I-O
18 0.083122514 855 high scalability-2010-07-11-So, Why is Twitter Really Not Using Cassandra to Store Tweets?
19 0.082950704 721 high scalability-2009-10-13-Why are Facebook, Digg, and Twitter so hard to scale?
20 0.082191356 671 high scalability-2009-08-05-Stack Overflow Architecture
topicId topicWeight
[(0, 0.152), (1, 0.069), (2, -0.001), (3, -0.028), (4, 0.017), (5, 0.07), (6, -0.003), (7, -0.054), (8, 0.035), (9, 0.021), (10, 0.039), (11, 0.032), (12, 0.011), (13, 0.029), (14, 0.034), (15, 0.056), (16, -0.017), (17, -0.002), (18, 0.005), (19, 0.022), (20, 0.008), (21, 0.025), (22, 0.014), (23, 0.052), (24, 0.034), (25, 0.015), (26, 0.007), (27, -0.034), (28, 0.037), (29, 0.03), (30, -0.032), (31, -0.019), (32, 0.01), (33, 0.07), (34, -0.091), (35, 0.025), (36, 0.052), (37, 0.022), (38, -0.019), (39, -0.02), (40, 0.046), (41, -0.011), (42, 0.03), (43, -0.005), (44, 0.028), (45, 0.009), (46, 0.017), (47, -0.004), (48, -0.039), (49, 0.04)]
simIndex simValue blogId blogTitle
same-blog 1 0.9628787 716 high scalability-2009-10-06-Building a Unique Data Warehouse
Introduction: There are many reasons to roll your own data storage solution on top of existing technologies. We've seen stories on HighScalability about custom databases for very large sets of individual data (like Twitter) and large amounts of binary data (like Facebook pictures). However, I recently ran into a unique type of problem. I was tasked with recording and storing bandwidth information for more than 20,000 servers and their associated networking equipment. This data needed to be accessed in real-time, with less than a 5 minute delay between the data being recorded and the data showing up on customer bandwidth graphs on our customer portal. After numerous false starts with off the shelf components and existing database clustering technology, we decided we must roll our own system. The real key to our problem (literally) was the ratio of the size of the key to the size of the actual data . Because the tracked metric was so small (a 64-bit counter) compared to the unique ide
2 0.73776287 292 high scalability-2008-03-30-Scaling Out MySQL
Introduction: This post covers two main options for scaling-out MySql and compare between them. The first is based on data-base clustering and the second is based on In Memory clustering a.k.a Data Grid. A special emphasis is given to a pattern which shows how to scale our existing data base without changing it through a combination of Data Grid and data base as a background service. This pattern is referred to as Persistency as a Service (PaaS). It also address many of the fequently asked question related to how performance, reliability and scalability is achieved with this pattern.
3 0.72871238 1578 high scalability-2014-01-14-Ask HS: Design and Implementation of scalable services?
Introduction: We have written agents deployed/distributed across the network. Agents sends data every 15 Secs may be even 5 secs. Working on a service/system to which all agent can post data/tuples with marginal payload. Upto 5% drop rate is acceptable. Ultimately the data will be segregated and stored into DBMS System (currently we are using MSQL). Question(s) I am looking for answer 1. Client/Server Communication: Agent(s) can post data. Status of sending data is not that important. But there is a remote where Agent(s) to be notified if the server side system generates an event based on the data sent. - Lot of advices from internet suggests using Message Bus (ActiveMQ) for async communication. Multicast and UDP are the alternatives. 2. Persistence: After some evaluation data to be stored in DBMS System. - End of processing data is an aggregated record for which MySql looks scalable. But on the volume of data is exponential. Considering HBase as an option. Looking if there are any alter
4 0.72402352 587 high scalability-2009-05-01-FastBit: An Efficient Compressed Bitmap Index Technology
Introduction: Data mining and fast queries are always in that bin of hard to do things where doing something smarter can yield big results. Bloom Filters are one such do it smarter strategy, compressed bitmap indexes are another. In one application "FastBit outruns other search indexes by a factor of 10 to 100 and doesn’t require much more room than the original data size." The data size is an interesting metric. Our old standard b-trees can be two to four times larger than the original data. In a test searching an Enron email database FastBit outran MySQL by 10 to 1,000 times. FastBit is a software tool for searching large read-only datasets. It organizes user data in a column-oriented structure which is efficient for on-line analytical processing (OLAP), and utilizes compressed bitmap indices to further speed up query processing. Analyses have proven the compressed bitmap index used in FastBit to be theoretically optimal for one-dimensional queries. Compared with other optimal indexing me
5 0.71702874 279 high scalability-2008-03-17-Microsoft's New Database Cloud Ready to Rumble with Amazon
Introduction: Update: Zdnet says Ozzie signals Microsoft’s surrender to the cloud . CD ROMs are to the internet as the internet is to the cloud and Microsoft aims to scratch and claw its way into this paradigm shift as well. The gloves are off. The tag line for Microsoft's new SQL Server Data Service is Your Data, Any Place, Any Time . Thems fighten' words. Microsoft is itch'n for a fight! Who will be Amazon's second? The service description: SQL Server Data Services (SSDS) are highly scalable, on-demand data storage and query processing utility services. Built on robust SQL Server database and Windows Server technologies, these services provide high availability, security and support standards-based web interfaces for easy programming and quick provisioning. Sounds like a fast uppercut aimed squarely at SimpleDB's jaw. As a developer what do you need to know? Highly available and highly scalable. Targeted at applications that can tolerate high internet latencies. Pr
6 0.71426719 907 high scalability-2010-09-23-Working With Large Data Sets
7 0.71358162 281 high scalability-2008-03-18-Database Design 101
8 0.70953417 119 high scalability-2007-10-10-WAN Accelerate Your Way to Lightening Fast Transfers Between Data Centers
9 0.70604831 1570 high scalability-2014-01-01-Paper: Nanocubes: Nanocubes for Real-Time Exploration of Spatiotemporal Datasets
10 0.70389843 310 high scalability-2008-04-29-High performance file server
11 0.70321375 65 high scalability-2007-08-16-Scaling Secret #2: Denormalizing Your Way to Speed and Profit
12 0.70247412 1179 high scalability-2012-01-23-Facebook Timeline: Brought to You by the Power of Denormalization
14 0.70008314 351 high scalability-2008-07-16-The Mother of All Database Normalization Debates on Coding Horror
15 0.69354045 817 high scalability-2010-04-29-Product: SciDB - A Science-Oriented DBMS at 100 Petabytes
16 0.69302595 468 high scalability-2008-12-17-Ringo - Distributed key-value storage for immutable data
17 0.69087625 1008 high scalability-2011-03-22-Facebook's New Realtime Analytics System: HBase to Process 20 Billion Events Per Day
18 0.69031399 151 high scalability-2007-11-12-a8cjdbc - Database Clustering via JDBC
19 0.68953097 829 high scalability-2010-05-20-Strategy: Scale Writes to 734 Million Records Per Day Using Time Partitioning
20 0.68781418 86 high scalability-2007-09-09-Clustering Solution
topicId topicWeight
[(1, 0.12), (2, 0.201), (10, 0.132), (21, 0.098), (56, 0.013), (61, 0.051), (79, 0.124), (85, 0.103), (94, 0.064)]
simIndex simValue blogId blogTitle
same-blog 1 0.93682951 716 high scalability-2009-10-06-Building a Unique Data Warehouse
Introduction: There are many reasons to roll your own data storage solution on top of existing technologies. We've seen stories on HighScalability about custom databases for very large sets of individual data (like Twitter) and large amounts of binary data (like Facebook pictures). However, I recently ran into a unique type of problem. I was tasked with recording and storing bandwidth information for more than 20,000 servers and their associated networking equipment. This data needed to be accessed in real-time, with less than a 5 minute delay between the data being recorded and the data showing up on customer bandwidth graphs on our customer portal. After numerous false starts with off the shelf components and existing database clustering technology, we decided we must roll our own system. The real key to our problem (literally) was the ratio of the size of the key to the size of the actual data . Because the tracked metric was so small (a 64-bit counter) compared to the unique ide
2 0.90494299 257 high scalability-2008-02-22-Kevin's Great Adventures in SSDland
Introduction: Update: Final Thoughts on SSD and MySQL AKA Battleship Spinn3r . Tips on how to make your database 10x faster using solid state drives. Potential exists for 100x speedup. Solid-state drives (SSDs) are the holy grail of storage. The promise of RAM speeds and hard disk like persistence have for years driven us crazy with power user lust, but they've stayed tantalizingly just out of reach. Always too expensive, too small, and oddly too slow. Has that changed? Can you now miraculously have your cake and eat it too? Can you now have it both ways? Is balancing work with family life now as easy as tripping over a terabyte drive? In a pioneering series of blog articles Kevin Burton conducts original research on next generation SSD drives in real world configurations. For an experience report on his great adventure you can turn to: Could SSD Mean a Rise in MyISAM Usage? , Serverbeach, MySQL and Mtron SSDs , Prediction: SSD Blades in 2008 , Zeus IOPS - Another High
3 0.90488851 498 high scalability-2009-01-20-Product: Amazon's SimpleDB
Introduction: Update 35 : How and Why Glue is Using Amazon SimpleDB instead of a Relational Database . Discusses a key design decision that required duplicating data in order to mimic RDBMS joins: Given the trade off between potential inconsistencies and scalability, social services have to choose the latter. Update 34 : Apparently Amazon pulled this article. I'm not sure what that means. Maybe time went backwards or something? Amazon dramatically drops SimpleDB pricing to $0.25 per GB per month from $1.50 per GB . This puts SimpleDB on par with Google App Engine . They also announced a few new features: a SQL-like SELECT API as well as a Batch Put operation to streamline uploading of multiple items or attributes . One of the complaints against SimpleDB is that programmers end up writing too much code to do simple things. These features and a much cheaper price should help considerably. And you can store lots of data now. GAE is still capped. Update 33 : Amazon announces
4 0.90208495 307 high scalability-2008-04-21-Using Google AppEngine for a Little Micro-Scalability
Introduction: Over the years I've accumulated quite a rag tag collection of personal systems scattered wide across a galaxy of different servers. For the past month I've been on a quest to rationalize this conglomeration by moving everything to a managed service of one kind or another. The goal: lift a load of worry from my mind. I like to do my own stuff my self so I learn something and have control. Control always comes with headaches and it was time for a little aspirin. As part of the process GAE came in handy as a host for a few Twitter related scripts I couldn't manage to run anywhere else. I recoded my simple little scripts into Python/GAE and learned a lot in the process. In the move I exported HighScalability from a VPS and imported it into a shared hosting service. I could never quite configure Apache and MySQL well enough that they wouldn't spike memory periodically and crash the VPS. And since a memory crash did not automatically restarted it was unacceptable. I also wrote a scrip
5 0.90169239 1592 high scalability-2014-02-07-Stuff The Internet Says On Scalability For February 7th, 2014
Introduction: Hey, it's HighScalability time: Google "Corkboard" Server, 1999 5 billion requests per day : Heroku serves 60,000 requests per second; 500 Petabytes : Backblaze's New Data Center; 25000 simultaneous connections : on a Percona Server How algorithms help determine that shape of our world. First we encode normative rules of an idealized world in algorithms. Second those algorithms help enforce those expectations by nudging humans in to acting accordingly. A fun example is the story of Ed Bolian's Record-Breaking Drive . Ed raced from New York to L.A. at speeds of up to 158 mph, "breaking countless laws – and the previous record, by more than two hours." His approach is one that any nerd would love. He had three radar detectors, two laser jammers, two nav systems, a CB radio, a scanner, two iPhones and two iPads running applications like Waze, lookouts in the back seat scanning for cops, and someone scouting ahead. Awesome! For the moral of the story, they were going s
6 0.89891303 1585 high scalability-2014-01-24-Stuff The Internet Says On Scalability For January 24th, 2014
7 0.8958872 1041 high scalability-2011-05-15-Building a Database remote availability site
8 0.89579648 1454 high scalability-2013-05-08-Typesafe Interview: Scala + Akka is an IaaS for Your Process Architecture
9 0.89555919 1452 high scalability-2013-05-06-7 Not So Sexy Tips for Saving Money On Amazon
10 0.89448982 1331 high scalability-2012-10-02-An Epic TripAdvisor Update: Why Not Run on the Cloud? The Grand Experiment.
11 0.89206702 1553 high scalability-2013-11-25-How To Make an Infinitely Scalable Relational Database Management System (RDBMS)
12 0.89187086 1577 high scalability-2014-01-13-NYTimes Architecture: No Head, No Master, No Single Point of Failure
13 0.89120483 1371 high scalability-2012-12-12-Pinterest Cut Costs from $54 to $20 Per Hour by Automatically Shutting Down Systems
14 0.89107567 1353 high scalability-2012-11-01-Cost Analysis: TripAdvisor and Pinterest costs on the AWS cloud
15 0.88948727 1186 high scalability-2012-02-02-The Data-Scope Project - 6PB storage, 500GBytes-sec sequential IO, 20M IOPS, 130TFlops
16 0.88936347 1382 high scalability-2013-01-07-Analyzing billions of credit card transactions and serving low-latency insights in the cloud
17 0.88735002 1369 high scalability-2012-12-10-Switch your databases to Flash storage. Now. Or you're doing it wrong.
18 0.88702631 1080 high scalability-2011-07-15-Stuff The Internet Says On Scalability For July 15, 2011
19 0.88654143 1380 high scalability-2013-01-02-Why Pinterest Uses the Cloud Instead of Going Solo - To Be Or Not To Be
20 0.88650364 72 high scalability-2007-08-22-Wikimedia architecture