high_scalability high_scalability-2008 high_scalability-2008-281 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: I am working on the design for my database and can't seem to come up with a firm schema. I am torn between normalizing the data and dealing with the overhead of joins and denormalizing it for easy sharding. The data is essentially music information per user: UserID, Artist, Album, Song. This lends itself nicely to be normalized and have separate User, Artist, Album and Song databases with a table full of INTs to tie them together. This will be in a mostly read based environment and with about 80% being searches of data by artist album or song. By the time I begin the query for artist, album or song I will already have a list of UserID's to limit the search by. The problem is that the tables can get unmanageably large pretty quickly and my plan was to shard off users once it got too big. Given this simple data relationship what are the pros and cons of normalizing the data vs denormalizing it? Should I go with 4 separate, normalized tables or one 4 column table? Perhaps it might
sentIndex sentText sentNum sentScore
1 I am working on the design for my database and can't seem to come up with a firm schema. [sent-1, score-0.13]
2 I am torn between normalizing the data and dealing with the overhead of joins and denormalizing it for easy sharding. [sent-2, score-0.757]
3 The data is essentially music information per user: UserID, Artist, Album, Song. [sent-3, score-0.183]
4 This lends itself nicely to be normalized and have separate User, Artist, Album and Song databases with a table full of INTs to tie them together. [sent-4, score-0.692]
5 This will be in a mostly read based environment and with about 80% being searches of data by artist album or song. [sent-5, score-1.123]
6 By the time I begin the query for artist, album or song I will already have a list of UserID's to limit the search by. [sent-6, score-1.02]
7 The problem is that the tables can get unmanageably large pretty quickly and my plan was to shard off users once it got too big. [sent-7, score-0.263]
8 Given this simple data relationship what are the pros and cons of normalizing the data vs denormalizing it? [sent-8, score-0.756]
9 Should I go with 4 separate, normalized tables or one 4 column table? [sent-9, score-0.444]
10 Perhaps it might be best to write the data in both formats at first and see what query speed is like once the tables fill up. [sent-10, score-0.405]
wordName wordTfidf (topN-words)
[('album', 0.516), ('artist', 0.438), ('song', 0.263), ('normalized', 0.241), ('userid', 0.239), ('normalizing', 0.225), ('denormalizing', 0.206), ('tables', 0.137), ('table', 0.124), ('ints', 0.112), ('torn', 0.112), ('lends', 0.103), ('tie', 0.088), ('selects', 0.084), ('firm', 0.084), ('fact', 0.084), ('formats', 0.081), ('batches', 0.079), ('cons', 0.079), ('pros', 0.079), ('pretty', 0.075), ('inserts', 0.075), ('separate', 0.074), ('music', 0.069), ('query', 0.068), ('searches', 0.066), ('user', 0.066), ('column', 0.066), ('fill', 0.065), ('insert', 0.064), ('begin', 0.064), ('nicely', 0.062), ('requiring', 0.061), ('already', 0.061), ('essentially', 0.06), ('relationship', 0.059), ('intensive', 0.057), ('joins', 0.056), ('dealing', 0.055), ('data', 0.054), ('shard', 0.051), ('issue', 0.049), ('overhead', 0.049), ('coming', 0.049), ('mostly', 0.049), ('limit', 0.048), ('pull', 0.048), ('perhaps', 0.048), ('seem', 0.046), ('potential', 0.046)]
simIndex simValue blogId blogTitle
same-blog 1 1.0 281 high scalability-2008-03-18-Database Design 101
Introduction: I am working on the design for my database and can't seem to come up with a firm schema. I am torn between normalizing the data and dealing with the overhead of joins and denormalizing it for easy sharding. The data is essentially music information per user: UserID, Artist, Album, Song. This lends itself nicely to be normalized and have separate User, Artist, Album and Song databases with a table full of INTs to tie them together. This will be in a mostly read based environment and with about 80% being searches of data by artist album or song. By the time I begin the query for artist, album or song I will already have a list of UserID's to limit the search by. The problem is that the tables can get unmanageably large pretty quickly and my plan was to shard off users once it got too big. Given this simple data relationship what are the pros and cons of normalizing the data vs denormalizing it? Should I go with 4 separate, normalized tables or one 4 column table? Perhaps it might
2 0.13101944 351 high scalability-2008-07-16-The Mother of All Database Normalization Debates on Coding Horror
Introduction: Jeff Atwood started a barn burner of a conversation in Maybe Normalizing Isn't Normal on how to create a fast scalable tagging system. Jeff eventually asks that terrible question: which is better -- a normalized database, or a denormalized database? And all hell breaks loose. I know, it's hard to imagine database debates becoming contentious, but it does happen :-) It's lucky developers don't have temporal power or rivers of blood would flow. Here are a few of the pithier points (summarized): Normalization is not magical fairy dust you sprinkle over your database to cure all ills; it often creates as many problems as it solves. (Jeff) Normalize until it hurts, denormalize until it works. (Jeff) Use materialized views which are tables created and maintained by your RDBMS. So a materialized view will act exactly like a de-normalized table would - except you keep you original normalized structure and any change to original data will propagate to the view automat
Introduction: This is a guest post by Eric Czech , Chief Architect at Next Big Sound, talks about some unique approaches taken to solving scalability challenges in music analytics. Tracking online activity is hardly a new idea, but doing it for the entire music industry isn't easy. Half a billion music video streams, track downloads, and artist page likes occur each day and measuring all of this activity across platforms such as Spotify, iTunes, YouTube, Facebook, and more, poses some interesting scalability challenges. Next Big Sound collects this type of data from over a hundred sources, standardizes everything, and offers that information to record labels, band managers, and artists through a web-based analytics platform. While many of our applications use open-source systems like Hadoop, HBase, Cassandra, Mongo, RabbitMQ, and MySQL, our usage is fairly standard, but there is one aspect of what we do that is pretty unique. We collect or receive information from 100+ sources and we s
4 0.11034661 115 high scalability-2007-10-07-Using ThreadLocal to pass context information around in web applications
Introduction: Hi, In java web servers, each http request is handled by a thread in thread pool. So for a Servlet handling the request, a thread is assigned. It is tempting (and very convinient) to keep context information in the threadlocal variable. I recently had a requirement where we need to assign logged in user id and timestamp to request sent to web services. Because we already had the code in place, it was extremely difficult to change the method signatures to pass user id everywhere. The solution I thought is class ReferenceIdGenerator { public static setReferenceId(String login) { threadLocal.set(login + System.currentMillis()); } public static String getReferenceId() { return threadLocal.get(); } private static ThreadLocal threadLocal = new ThreadLocal(); } class MySevlet { void service(.....) { HttpSession session = request.getSession(false); String userId = session.get("userId"); ReferenceIdGenerator.setRefernceId(userId
5 0.10993052 829 high scalability-2010-05-20-Strategy: Scale Writes to 734 Million Records Per Day Using Time Partitioning
Introduction: In Scaling writes in MySQL ( slides ) Philip Tellis, while working for Yahoo, describes how using time based partitions they were able to increase their write capability from 2100 inserts per second (7 million a day) to a sustained 8500 inserts per second (734 million a day). This was capacity enough to handle the load during Michael Jackson's memorial service. In summary, the secrets to scalable writes are: Bulk inserts push up insert rate Partitioning lets you insert more records Partition based on incoming data for fast inserts Partitioning is a standard approach for handling high write loads because it means data can be written to different hard disks in parallel. In this example Phillip created a separate table for each day with each table having it's own database file. Each table is partitioned on time, 12 partitions per day, 2 hours of data per partition. Huge log streams are often handled this way. Other advantages of this approach: 1) fast drop table operations 2
7 0.092233211 222 high scalability-2008-01-25-Application Database and DAL Architecture
8 0.083475828 672 high scalability-2009-08-06-An Unorthodox Approach to Database Design : The Coming of the Shard
9 0.079910807 342 high scalability-2008-06-08-Search fast in million rows
10 0.076104291 578 high scalability-2009-04-23-Which Key value pair database to be used
11 0.074587025 327 high scalability-2008-05-27-How I Learned to Stop Worrying and Love Using a Lot of Disk Space to Scale
12 0.071113221 936 high scalability-2010-11-09-Facebook Uses Non-Stored Procedures to Update Social Graphs
13 0.068943933 721 high scalability-2009-10-13-Why are Facebook, Digg, and Twitter so hard to scale?
14 0.068787038 315 high scalability-2008-05-05-HSCALE - Handling 200 Million Transactions Per Month Using Transparent Partitioning With MySQL Proxy
15 0.067729622 1440 high scalability-2013-04-15-Scaling Pinterest - From 0 to 10s of Billions of Page Views a Month in Two Years
16 0.064630575 1514 high scalability-2013-09-09-Need Help with Database Scalability? Understand I-O
17 0.063892178 788 high scalability-2010-03-04-How MySpace Tested Their Live Site with 1 Million Concurrent Users
18 0.062676556 1163 high scalability-2011-12-23-Stuff The Internet Says On Scalability For December 23, 2011
19 0.06148608 1529 high scalability-2013-10-08-F1 and Spanner Holistically Compared
20 0.060544487 1146 high scalability-2011-11-23-Paper: Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS
topicId topicWeight
[(0, 0.082), (1, 0.053), (2, -0.015), (3, -0.035), (4, 0.023), (5, 0.046), (6, -0.013), (7, -0.004), (8, 0.035), (9, -0.037), (10, 0.006), (11, 0.023), (12, -0.042), (13, 0.025), (14, 0.039), (15, 0.008), (16, -0.064), (17, -0.001), (18, 0.01), (19, 0.018), (20, 0.008), (21, -0.035), (22, -0.0), (23, 0.017), (24, 0.011), (25, -0.019), (26, -0.037), (27, -0.013), (28, 0.018), (29, 0.034), (30, 0.004), (31, -0.031), (32, -0.015), (33, 0.044), (34, 0.004), (35, 0.033), (36, 0.033), (37, 0.004), (38, -0.011), (39, -0.024), (40, -0.013), (41, 0.003), (42, -0.02), (43, -0.0), (44, 0.001), (45, 0.022), (46, -0.019), (47, 0.008), (48, 0.023), (49, -0.006)]
simIndex simValue blogId blogTitle
same-blog 1 0.92671329 281 high scalability-2008-03-18-Database Design 101
Introduction: I am working on the design for my database and can't seem to come up with a firm schema. I am torn between normalizing the data and dealing with the overhead of joins and denormalizing it for easy sharding. The data is essentially music information per user: UserID, Artist, Album, Song. This lends itself nicely to be normalized and have separate User, Artist, Album and Song databases with a table full of INTs to tie them together. This will be in a mostly read based environment and with about 80% being searches of data by artist album or song. By the time I begin the query for artist, album or song I will already have a list of UserID's to limit the search by. The problem is that the tables can get unmanageably large pretty quickly and my plan was to shard off users once it got too big. Given this simple data relationship what are the pros and cons of normalizing the data vs denormalizing it? Should I go with 4 separate, normalized tables or one 4 column table? Perhaps it might
2 0.80056441 986 high scalability-2011-02-10-Database Isolation Levels And Their Effects on Performance and Scalability
Introduction: Some of us are not aware of the tremendous job databases perform, particularly their efforts to maintain the Isolation aspect of ACID. For example, some people believe that transactions are only related to data manipulation and not to queries, which is an incorrect assumption. Transaction Isolation is all about queries, and the consistency and completeness of the data retrieved by queries. This is how it works: Isolation gives the querying user the feeling that he owns the database. It does not matter that hundreds or thousands of concurrent users work with the same database and the same schema (or even the same data). These other uses can generate new data, modify existing data or perform any other action. The querying user must be able to get a complete, consistent picture of the data, unaffected by other users’ actions. Let’s take the following scenario, which is based on an Orders table that has 1,000,000 rows, with a disk size of 20 GB: 8:00: UserA started a query “SELECT
3 0.78272754 578 high scalability-2009-04-23-Which Key value pair database to be used
Introduction: My Table has 2 columsn .Column1 is id,Column2 contains information given by user about item in Column1 .User can give 3 types of information about item.I separate the opinion of single user by comma,and opinion of another user by ;. Example- 23-34,us,56;78,in,78 I need to calculate opinions of all users very fast.My idea is to have index on key so the searching would be very fast.Currently i m using mysql .My problem is that maximum column size is below my requirement .If any overflow occurs i make new row with same id and insert data into new row. Practically I would have around maximum 5-10 for each row. I think if there is any database which removes this application code. I just learn about key value pair database which is exactly i needed . But which doesn't put constraint(i mean much better than RDMS on column size. This application is not in production.
4 0.74113649 1065 high scalability-2011-06-21-Running TPC-C on MySQL-RDS
Introduction: I recently came across a TPC-C benchmark results held on MySQL based RDS databases. You can see it here . I think the results may bring light to many questions concerning MySQL scalability in general and RDS scalability in particular. (For disclosure, I'm working for ScaleBase where we run an internal scale out TPC-C benchmark these days, and will publish results soon). TPC-C TPC-C is a standard database benchmark, used to measure databases. The database vendors invest big bucks in running this test, and showing off which database is faster, and can scale better. It is a write intensive test, so it doesn’t necessarily reflect the behavior of the database in your application. But it does give some very important insights on what you can expect from your database under heavy load. The Benchmark Process First of all, I have some comments for the benchmark method itself. Generally - the benchmarks were held in an orderly fashion and in a rather methodological way – which i
5 0.74089521 1233 high scalability-2012-04-25-The Anatomy of Search Technology: blekko’s NoSQL database
Introduction: This is a guest post ( part 2 , part 3 ) by Greg Lindahl, CTO of blekko, the spam free search engine that had over 3.5 million unique visitors in March. Greg Lindahl was Founder and Distinguished Engineer at PathScale, at which he was the architect of the InfiniPath low-latency InfiniBand HCA, used to build tightly-coupled supercomputing clusters. Imagine that you're crazy enough to think about building a search engine. It's a huge task: the minimum index size needed to answer most queries is a few billion webpages. Crawling and indexing a few billion webpages requires a cluster with several petabytes of usable disk -- that's several thousand 1 terabyte disks -- and produces an index that's about 100 terabytes in size. Serving query results quickly involves having most of the index in RAM or on solid state (flash) disk. If you can buy a server with 100 gigabytes of RAM for about $3,000, that's 1,000 servers at a capital cost of $3 million, plus about $1 million per year of serve
6 0.73419839 561 high scalability-2009-04-08-N+1+caching is ok?
7 0.73403454 1514 high scalability-2013-09-09-Need Help with Database Scalability? Understand I-O
8 0.71877056 351 high scalability-2008-07-16-The Mother of All Database Normalization Debates on Coding Horror
9 0.71063358 514 high scalability-2009-02-18-Numbers Everyone Should Know
10 0.70882434 587 high scalability-2009-05-01-FastBit: An Efficient Compressed Bitmap Index Technology
11 0.69950598 1304 high scalability-2012-08-14-MemSQL Architecture - The Fast (MVCC, InMem, LockFree, CodeGen) and Familiar (SQL)
12 0.69547361 1650 high scalability-2014-05-19-A Short On How the Wayback Machine Stores More Pages than Stars in the Milky Way
13 0.69508904 342 high scalability-2008-06-08-Search fast in million rows
14 0.69402957 476 high scalability-2008-12-28-How to Organize a Database Table’s Keys for Scalability
15 0.69067544 327 high scalability-2008-05-27-How I Learned to Stop Worrying and Love Using a Lot of Disk Space to Scale
16 0.68694293 1440 high scalability-2013-04-15-Scaling Pinterest - From 0 to 10s of Billions of Page Views a Month in Two Years
17 0.68060654 828 high scalability-2010-05-17-7 Lessons Learned While Building Reddit to 270 Million Page Views a Month
18 0.67528933 246 high scalability-2008-02-12-Search the tags across all post
19 0.67224646 1032 high scalability-2011-05-02-Stack Overflow Makes Slow Pages 100x Faster by Simple SQL Tuning
20 0.67046028 152 high scalability-2007-11-13-Flickr Architecture
topicId topicWeight
[(2, 0.191), (30, 0.019), (61, 0.123), (77, 0.011), (79, 0.064), (85, 0.058), (96, 0.407)]
simIndex simValue blogId blogTitle
same-blog 1 0.81898737 281 high scalability-2008-03-18-Database Design 101
Introduction: I am working on the design for my database and can't seem to come up with a firm schema. I am torn between normalizing the data and dealing with the overhead of joins and denormalizing it for easy sharding. The data is essentially music information per user: UserID, Artist, Album, Song. This lends itself nicely to be normalized and have separate User, Artist, Album and Song databases with a table full of INTs to tie them together. This will be in a mostly read based environment and with about 80% being searches of data by artist album or song. By the time I begin the query for artist, album or song I will already have a list of UserID's to limit the search by. The problem is that the tables can get unmanageably large pretty quickly and my plan was to shard off users once it got too big. Given this simple data relationship what are the pros and cons of normalizing the data vs denormalizing it? Should I go with 4 separate, normalized tables or one 4 column table? Perhaps it might
2 0.79705739 868 high scalability-2010-07-30-Basho Lives up to their Name With Consistent Smashing
Introduction: For some Friday Fun nerd style, I thought this demonstration from Basho on the difference between single master, sharding, and consistent smashing was really clever. I love the use of safety glasses! And it's harder to crash a server with a hammer than you might think... Recommended reading: http://labs.google.com/papers/bigtable.html http://research.yahoo.com/project/212
3 0.63987041 162 high scalability-2007-11-20-what is j2ee stack
Introduction: I see everyone talk about lamp stack is less than j2ee stack .i m newbie can anyone plz explain what is j2ee stack
4 0.62270314 1549 high scalability-2013-11-15-Stuff The Internet Says On Scalability For November 15th, 2013
Introduction: Hey, it's HighScalability time: Test your sense of scale. Is this image of something microscopic or macroscopic? Find out . Quotable Quotes: fidotron : It feels like we've gone in one big circle, where first we move the DB on to a separate machine for performance, yet now more computation will go back to being done nearer the data (like Hadoop) and we'll try to pretend it's all just one giant computer again. @pbailis : Building systems from the ground up with distribution, scale, and availability in mind is much easier than retrofitting single-node systems. @merv : #awsreinvent Jassy: Netflix has 10,000s of EC2 instances. They are the final deployment scenario: All In. And others are coming. Edward Capriolo : YARN... Either it is really complicated or I have brain damage @djspiewak : Eventually, Node.js will reinvent the “IO promise” and realize that flattening your callback effects is actually quite nice. @jimblomo : A Note on Dis
5 0.61561143 117 high scalability-2007-10-08-Paper: Understanding and Building High Availability-Load Balanced Clusters
Introduction: A superb explanation by Theo Schlossnagle of how to deploy a high availability load balanced system using mod backhand and Wackamole . The idea is you don't need to buy expensive redundant hardware load balancers, you can make use of the hosts you already have to the same effect. The discussion of using peer-based HA solutions versus a single front-end HA device is well worth the read. Another interesting perspective in the document is to view load balancing as a resource allocation problem. There's also a nice discussion of the negative of effect of keep-alives on performance.
6 0.60191345 348 high scalability-2008-07-09-Federation at Flickr: Doing Billions of Queries Per Day
7 0.57642829 1528 high scalability-2013-10-07-Ask HS: Is Microsoft the Right Technology for a Scalable Web-based System?
8 0.5719763 1035 high scalability-2011-05-05-Paper: A Study of Practical Deduplication
9 0.55809689 435 high scalability-2008-10-30-The case for functional decomposition
10 0.55648404 828 high scalability-2010-05-17-7 Lessons Learned While Building Reddit to 270 Million Page Views a Month
11 0.54780519 1052 high scalability-2011-06-03-Stuff The Internet Says On Scalability For June 3, 2011
12 0.53253698 397 high scalability-2008-09-28-Product: Happy = Hadoop + Python
13 0.52938306 1228 high scalability-2012-04-16-Instagram Architecture Update: What’s new with Instagram?
14 0.52910721 1212 high scalability-2012-03-21-The Conspecific Hybrid Cloud
15 0.51834625 1473 high scalability-2013-06-10-The 10 Deadly Sins Against Scalability
16 0.51316994 422 high scalability-2008-10-17-Scaling Spam Eradication Using Purposeful Games: Die Spammer Die!
17 0.51293391 1221 high scalability-2012-04-03-Hazelcast 2.0: Big Data In-Memory
18 0.49681005 1418 high scalability-2013-03-06-Low Level Scalability Solutions - The Aggregation Collection
19 0.49361736 1074 high scalability-2011-07-06-11 Common Web Use Cases Solved in Redis
20 0.48602447 703 high scalability-2009-09-12-How Google Taught Me to Cache and Cash-In