high_scalability high_scalability-2008 high_scalability-2008-304 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: Hello everybody! I am a developer of a website with a lot of traffic. Right now we are managing the whole website using perl + postgresql + fastcgi + memcached + mogileFS + lighttpd + roundrobin DNS distributed over 5 servers and I must say it works like a charm, load is stable and everything works very fast and we are recording about 8 million pageviews per day. The only problem is with postgres database since we have it installed only on one server and if this server goes down, the whole "cluster" goes down. That's why we have a master2slave replication so we still have a backup database except that when the master goes down, all inserts/updates are disabled so the whole website is just read only. But this is not a problem since this configuration is working for us and we don't have any problems with it. Right now we are planning to build our own analytics service that would be customized for our needs. We tried various different software packages but were not satisfi
sentIndex sentText sentNum sentScore
1 I am a developer of a website with a lot of traffic. [sent-2, score-0.08]
2 Right now we are managing the whole website using perl + postgresql + fastcgi + memcached + mogileFS + lighttpd + roundrobin DNS distributed over 5 servers and I must say it works like a charm, load is stable and everything works very fast and we are recording about 8 million pageviews per day. [sent-3, score-0.576]
3 The only problem is with postgres database since we have it installed only on one server and if this server goes down, the whole "cluster" goes down. [sent-4, score-0.272]
4 That's why we have a master2slave replication so we still have a backup database except that when the master goes down, all inserts/updates are disabled so the whole website is just read only. [sent-5, score-0.348]
5 Right now we are planning to build our own analytics service that would be customized for our needs. [sent-7, score-0.144]
6 We tried various different software packages but were not satisfied with any of them. [sent-8, score-0.146]
7 We want to build something like Google Analytics so it would allow us to create reports in real-time with "drill-down" possibility to make interactive reports. [sent-9, score-0.471]
8 We don't need real-time data to be included in report - we just need a possibility to make different reports very fast. [sent-10, score-0.578]
9 For example right now we are logging requests into plain text log files in the following format: date | hour | user_id | site_id | action_id | some_other_attributes. [sent-12, score-0.769]
10 You can display any type of report by combining different columns as well as counting all or only distinct occurrences of certain attributes. [sent-20, score-0.798]
11 I know how to parse these log files and calculate any type of report I want, but it takes time. [sent-21, score-0.99]
12 There are about 9 million rows in each daily log file and if I want to calculate monthly reports I need to parse all daily log files for one month - meaning I have to parse almost 300 million of lines, count what I want and then display the summary. [sent-22, score-2.408]
13 calculating a number of users that have been on site_id=1 but not on site_id=2 - in this case I have to export users on site 1, export users on site 2 and then compare results and count the differences). [sent-25, score-1.168]
14 If you take a look at Google Analytics it calculates any type of similar report in real-time. [sent-26, score-0.372]
15 If I put 300 million of rows (requests per month) into the Postgres/MySQL table, selects are even slower than parsing plain text log files using Perl. [sent-29, score-1.06]
16 I am aware that they have a huge amount of servers but I am also aware that they have even bigger amount of hits per day. [sent-32, score-0.531]
17 I have a possibility to store and process this kind of analytics on multiple servers at the same time but I don't have enough knowledge how to construct a software and database that would be able to do a job like this. [sent-33, score-0.336]
18 To calculate unique users during certain period you have to count all the distinct user_ids during that time period. [sent-37, score-1.111]
19 where date>='2008-04-10' and date <='2008-04-18' - with a 9million rows per day this statement would take about two minutes to complete and we are not satisfied with it. [sent-43, score-0.664]
wordName wordTfidf (topN-words)
[('distinct', 0.243), ('parse', 0.222), ('count', 0.201), ('reports', 0.195), ('possibility', 0.192), ('report', 0.191), ('hits', 0.183), ('calculate', 0.182), ('log', 0.17), ('date', 0.165), ('satisfied', 0.146), ('rows', 0.144), ('analytics', 0.144), ('export', 0.142), ('per', 0.14), ('period', 0.14), ('users', 0.139), ('files', 0.133), ('unique', 0.132), ('plain', 0.131), ('month', 0.113), ('million', 0.111), ('aware', 0.104), ('display', 0.102), ('occurrences', 0.096), ('number', 0.094), ('type', 0.092), ('charm', 0.092), ('text', 0.092), ('whole', 0.092), ('goes', 0.09), ('calculates', 0.089), ('disabled', 0.086), ('site', 0.086), ('want', 0.084), ('fastcgi', 0.084), ('mogilefs', 0.084), ('daily', 0.082), ('hint', 0.081), ('website', 0.08), ('requests', 0.078), ('certain', 0.074), ('suggestion', 0.074), ('hello', 0.073), ('selects', 0.072), ('simplest', 0.07), ('day', 0.069), ('recording', 0.069), ('specific', 0.068), ('parsing', 0.067)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999982 304 high scalability-2008-04-19-How to build a real-time analytics system?
Introduction: Hello everybody! I am a developer of a website with a lot of traffic. Right now we are managing the whole website using perl + postgresql + fastcgi + memcached + mogileFS + lighttpd + roundrobin DNS distributed over 5 servers and I must say it works like a charm, load is stable and everything works very fast and we are recording about 8 million pageviews per day. The only problem is with postgres database since we have it installed only on one server and if this server goes down, the whole "cluster" goes down. That's why we have a master2slave replication so we still have a backup database except that when the master goes down, all inserts/updates are disabled so the whole website is just read only. But this is not a problem since this configuration is working for us and we don't have any problems with it. Right now we are planning to build our own analytics service that would be customized for our needs. We tried various different software packages but were not satisfi
2 0.15885763 1008 high scalability-2011-03-22-Facebook's New Realtime Analytics System: HBase to Process 20 Billion Events Per Day
Introduction: Facebook did it again. They've built another system capable of doing something useful with ginormous streams of realtime data. Last time we saw Facebook release their New Real-Time Messaging System: HBase To Store 135+ Billion Messages A Month . This time it's a realtime analytics system handling over 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds . Alex Himel, Engineering Manager at Facebook, explains what they've built ( video ) and the scale required: Social plugins have become an important and growing source of traffic for millions of websites over the past year. We released a new version of Insights for Websites last week to give site owners better analytics on how people interact with their content and to help them optimize their websites in real time. To accomplish this, we had to engineer a system that could process over 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds. Alex does a
3 0.15205415 30 high scalability-2007-07-26-Product: AWStats a Log Analyzer
Introduction: AWStats is a free powerful and featureful tool that generates advanced web, streaming, ftp or mail server statistics, graphically. This log analyzer works as a CGI or from command line and shows you all possible information your log contains, in few graphical web pages. It uses a partial information file to be able to process large log files, often and quickly. It can analyze log files from all major server tools like Apache log files (NCSA combined/XLF/ELF log format or common/CLF log format), WebStar, IIS (W3C log format) and a lot of other web, proxy, wap, streaming servers, mail servers and some ftp servers.
4 0.14884874 233 high scalability-2008-01-30-How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data
Introduction: How do you query hundreds of gigabytes of new data each day streaming in from over 600 hyperactive servers? If you think this sounds like the perfect battle ground for a head-to-head skirmish in the great MapReduce Versus Database War , you would be correct. Bill Boebel, CTO of Mailtrust (Rackspace's mail division), has generously provided a fascinating account of how they evolved their log processing system from an early amoeba'ic text file stored on each machine approach, to a Neandertholic relational database solution that just couldn't compete, and finally to a Homo sapien'ic Hadoop based solution that works wisely for them and has virtually unlimited scalability potential. Rackspace faced a now familiar problem. Lots and lots of data streaming in. Where do you store all that data? How do you do anything useful with it? In the first version of their system logs were stored in flat text files and had to be manually searched by engineers logging into each individual machine. T
5 0.1479072 276 high scalability-2008-03-15-New Website Design Considerations
Introduction: I am in the design phase of getting a website up and running that will have scalability as a main concern. I am looking for opinions on architecture and the like for this endeavor. The site has a few unique characteristics that make scalability difficult. Users will all have a pretty large amount of data that other users will be able to search. The site will be entirely based around search. The catch is that other users will be searching always with a stipulation of 'n' miles from me. I imagine that fact will kill the possibility of query caching for most searches. I have extensive experience with PHP and MYSQL, some experience with ASP.NET/C#, some experience with perl but can learn anything fast. The site will start out on a single server but I want to be 100% certain that I architect the code and databases such that scaling will be simple. What language should I code the site in? What DB would you use: Postgres, MYSQL, MSSQL, BerkelyDB? Should we shard the databa
6 0.14053091 77 high scalability-2007-08-30-Log Everything All the Time
7 0.13968515 229 high scalability-2008-01-29-Building scalable storage into application - Instead of MogileFS OpenAFS etc.
8 0.13924237 37 high scalability-2007-07-28-Product: Web Log Storming
10 0.12519681 1065 high scalability-2011-06-21-Running TPC-C on MySQL-RDS
11 0.12399381 1020 high scalability-2011-04-12-Caching and Processing 2TB Mozilla Crash Reports in memory with Hazelcast
12 0.12275488 70 high scalability-2007-08-22-How many machines do you need to run your site?
13 0.12080427 1361 high scalability-2012-11-22-Gone Fishin': PlentyOfFish Architecture
14 0.12060051 638 high scalability-2009-06-26-PlentyOfFish Architecture
15 0.1173685 808 high scalability-2010-04-12-Poppen.de Architecture
16 0.11689101 106 high scalability-2007-10-02-Secrets to Fotolog's Scaling Success
17 0.11326859 313 high scalability-2008-05-02-Friends for Sale Architecture - A 300 Million Page View-Month Facebook RoR App
18 0.11307633 513 high scalability-2009-02-16-Handle 1 Billion Events Per Day Using a Memory Grid
19 0.11210091 1501 high scalability-2013-08-13-In Memoriam: Lavabit Architecture - Creating a Scalable Email Service
20 0.10991116 152 high scalability-2007-11-13-Flickr Architecture
topicId topicWeight
[(0, 0.205), (1, 0.096), (2, -0.059), (3, -0.146), (4, 0.019), (5, -0.036), (6, 0.01), (7, 0.01), (8, 0.1), (9, 0.033), (10, 0.019), (11, -0.011), (12, 0.017), (13, -0.024), (14, 0.1), (15, 0.017), (16, -0.035), (17, -0.035), (18, -0.009), (19, 0.06), (20, 0.068), (21, -0.065), (22, -0.048), (23, 0.081), (24, 0.104), (25, -0.092), (26, -0.083), (27, 0.018), (28, 0.015), (29, -0.031), (30, 0.016), (31, -0.062), (32, -0.024), (33, 0.024), (34, -0.06), (35, 0.057), (36, -0.036), (37, -0.028), (38, 0.039), (39, -0.004), (40, -0.039), (41, 0.037), (42, 0.006), (43, -0.011), (44, -0.005), (45, -0.044), (46, 0.01), (47, 0.016), (48, -0.009), (49, -0.022)]
simIndex simValue blogId blogTitle
same-blog 1 0.96839321 304 high scalability-2008-04-19-How to build a real-time analytics system?
Introduction: Hello everybody! I am a developer of a website with a lot of traffic. Right now we are managing the whole website using perl + postgresql + fastcgi + memcached + mogileFS + lighttpd + roundrobin DNS distributed over 5 servers and I must say it works like a charm, load is stable and everything works very fast and we are recording about 8 million pageviews per day. The only problem is with postgres database since we have it installed only on one server and if this server goes down, the whole "cluster" goes down. That's why we have a master2slave replication so we still have a backup database except that when the master goes down, all inserts/updates are disabled so the whole website is just read only. But this is not a problem since this configuration is working for us and we don't have any problems with it. Right now we are planning to build our own analytics service that would be customized for our needs. We tried various different software packages but were not satisfi
Introduction: This is a guest post by Gordon Worley , a Software Engineer at Korrelate , where they correlate (see what they did there) online purchases to offline purchases. Several weeks ago, we came into the office one morning to find every server alarm going off. Pixel log processing was behind by 8 hours and not making headway. Checking the logs, we discovered that a big client had come online during the night and was giving us 10 times more traffic than we were originally told to expect. I wouldn’t say we panicked, but the office was certainly more jittery than usual. Over the next several hours, though, thanks both to foresight and quick thinking, we were able to scale up to handle the added load and clear the backlog to return log processing to a steady state. At Korrelate, we deploy tracking pixels , also known beacons or web bugs, that our partners use to send us information about their users. These tiny web objects contain no visible content, but may include transparent 1 by 1 gif
3 0.77572286 233 high scalability-2008-01-30-How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data
Introduction: How do you query hundreds of gigabytes of new data each day streaming in from over 600 hyperactive servers? If you think this sounds like the perfect battle ground for a head-to-head skirmish in the great MapReduce Versus Database War , you would be correct. Bill Boebel, CTO of Mailtrust (Rackspace's mail division), has generously provided a fascinating account of how they evolved their log processing system from an early amoeba'ic text file stored on each machine approach, to a Neandertholic relational database solution that just couldn't compete, and finally to a Homo sapien'ic Hadoop based solution that works wisely for them and has virtually unlimited scalability potential. Rackspace faced a now familiar problem. Lots and lots of data streaming in. Where do you store all that data? How do you do anything useful with it? In the first version of their system logs were stored in flat text files and had to be manually searched by engineers logging into each individual machine. T
4 0.7735849 36 high scalability-2007-07-28-Product: Web Log Expert
Introduction: WebLog Expert is a fast and powerful access log analyzer. It will give you information about your site's visitors: activity statistics, accessed files, paths through the site, information about referring pages, search engines, browsers, operating systems, and more. The program produces easy-to-read HTML reports that include both text information (tables) and charts. View the WebLog Expert sample report to get the general idea of the variety of information about your site's usage it can provide. WebLog Expert can analyze logs of Apache and IIS web servers. It can even read GZ and ZIP compressed logs so you won't need to unpack them manually. The log analyzer features intuitive interface. Built-in wizards will help you quickly and easily create a profile for your site and analyze it.
5 0.75798029 37 high scalability-2007-07-28-Product: Web Log Storming
Introduction: Web Log Storming is an interactive, desktop-based Web Log Analyzer for Windows. The whole new concept of log analysis makes it clearly different from any other web log analyzer. Browse through statistics to get into details - down to individual visitor's session. Check individual visitor behavior pattern and how it fits into your desired scenario. Web Log Storming does far more than just generate common reports - it displays detailed web site statistics with interactive graphs and reports. Very complete detailed log analysis of activity from every visitor to your web site is only a mouse-click away. In other words, analyze your web logs like never before! It's easy to track sessions, hits, page views, downloads, or whatever metric is most important to each user. You can look at referring pages and see which search engines and keywords were used to bring visitors to the site. Web site behavior, from the top entry and exit pages, to the paths that users follow, can be analyzed. You
6 0.75411814 937 high scalability-2010-11-09-Paper: Hyder - Scaling Out without Partitioning
7 0.7538169 35 high scalability-2007-07-28-Product: FastStats Log Analyzer
8 0.74553972 77 high scalability-2007-08-30-Log Everything All the Time
9 0.73648316 449 high scalability-2008-11-24-Product: Scribe - Facebook's Scalable Logging System
10 0.72428906 711 high scalability-2009-09-22-How Ravelry Scales to 10 Million Requests Using Rails
11 0.70912945 30 high scalability-2007-07-26-Product: AWStats a Log Analyzer
12 0.70721954 553 high scalability-2009-04-03-Collectl interface to Ganglia - any interest?
13 0.70647937 105 high scalability-2007-10-01-Statistics Logging Scalability
14 0.68700564 808 high scalability-2010-04-12-Poppen.de Architecture
15 0.68144083 541 high scalability-2009-03-16-Product: Smart Inspect
16 0.67909974 570 high scalability-2009-04-15-Implementing large scale web analytics
17 0.67233938 829 high scalability-2010-05-20-Strategy: Scale Writes to 734 Million Records Per Day Using Time Partitioning
18 0.66686338 1008 high scalability-2011-03-22-Facebook's New Realtime Analytics System: HBase to Process 20 Billion Events Per Day
19 0.66080666 965 high scalability-2010-12-29-Pinboard.in Architecture - Pay to Play to Keep a System Small
20 0.65653312 379 high scalability-2008-09-04-Database question for upcoming project
topicId topicWeight
[(1, 0.201), (2, 0.136), (10, 0.028), (28, 0.101), (30, 0.062), (40, 0.011), (47, 0.033), (61, 0.045), (79, 0.135), (85, 0.083), (94, 0.087)]
simIndex simValue blogId blogTitle
same-blog 1 0.93712091 304 high scalability-2008-04-19-How to build a real-time analytics system?
Introduction: Hello everybody! I am a developer of a website with a lot of traffic. Right now we are managing the whole website using perl + postgresql + fastcgi + memcached + mogileFS + lighttpd + roundrobin DNS distributed over 5 servers and I must say it works like a charm, load is stable and everything works very fast and we are recording about 8 million pageviews per day. The only problem is with postgres database since we have it installed only on one server and if this server goes down, the whole "cluster" goes down. That's why we have a master2slave replication so we still have a backup database except that when the master goes down, all inserts/updates are disabled so the whole website is just read only. But this is not a problem since this configuration is working for us and we don't have any problems with it. Right now we are planning to build our own analytics service that would be customized for our needs. We tried various different software packages but were not satisfi
2 0.93704361 806 high scalability-2010-04-08-Hot Scalability Links for April 8, 2010
Introduction: Scalability porn (SFW). Real time meter for the number of ads being served by doubleclick. Amazing. A constant ~390,000 impressions a second are being served and 25 trillion since 1996. Thanks to Mike Rhoads for title idea. Scalability? Don't worry. Application complexity? Worry by Joe McKendrick. The next challenge on enterprise agendas: application complexity. This is something that lots of hardware — whether from the cloud or internal data center — cannot fix Leo Laporte and Steve Gibson talked about how the iPad was a denial of service attack on UPS delivery schedules. UPS trucks were filled with iPads. Cassandra: Fact vs fiction . Jonathan Ellies puts the beatdown on Cassandra misinformation. Don't you dare say Cassandra can't work across datacenters! JIT'd code calling conventions . Cliff Click Jr shows how Java’s calling convention can match compiled C code in speed, but allows for the flexibility of calling (code,slow) non-JIT'd code . Some assembly code re
3 0.90963387 1506 high scalability-2013-08-23-Stuff The Internet Says On Scalability For August 23, 2013
Introduction: Hey, it's HighScalability time: ( Parkour is to terrain as programming is to frameworks ) 5x : AWS vs combined size of other cloud vendors; Every Second on The Internet : Why we need so many servers. Quotable Quotes: @chaliy : Today I learned that I do not understand how #azure scaling works, instance scale does not affect requests/sec I can load. @Lariar : Note how crazy this is. An international launch would have been a huge deal. Now it's just another thing you do. smacktoward : The problem with relying on donations is that people don't make donations. @toddhoffious : Programming is a tool built by logical positivists to solve the problems of idealists and pragmatists. We have a fundamental mismatch here. @etherealmind : Me: "Weird, my phone data isn't working" Them: "They turned the 3G off at the tower because it interferes with the particle accelerator" John Carmack : In computer science, just about t
4 0.90611076 706 high scalability-2009-09-16-The VeriScale Architecture - Elasticity and efficiency for private clouds
Introduction: The modern datacenter is evolving into the network centric datacenter model, which is applied to both public and private cloud computing. In this model, networking, platform, storage, and software infrastructure are provided as services that scale up or down on demand. The network centric model allows the datacenter to be viewed as a collection of automatically deployed and managed application services that utilize underlying virtualized services. Providing sufficient elasticity and scalability for the rapidly growing needs of the datacenter requires these collections of automatically-managed services to scale efficiently and with essentially no limits, letting services adapt easily to changing requirements and workloads. Sun’s VeriScale architecture provides the architectural platform that can deliver these capabilities. Sun Microsystems has been developing open and modular infrastructure architectures for more than a decade. The features of these architectures, such as elasticity, ar
5 0.89924544 720 high scalability-2009-10-12-High Performance at Massive Scale – Lessons learned at Facebook
Introduction: Jeff Rothschild, Vice President of Technology at Facebook gave a great presentation at UC San Diego on our favorite subject: " High Performance at Massive Scale – Lessons learned at Facebook ". The abstract for the talk is: Facebook has grown into one of the largest sites on the Internet today serving over 200 billion pages per month. The nature of social data makes engineering a site for this level of scale a particularly challenging proposition. In this presentation, I will discuss the aspects of social data that present challenges for scalability and will describe the the core architectural components and design principles that Facebook has used to address these challenges. In addition, I will discuss emerging technologies that offer new opportunities for building cost-effective high performance web architectures. There's a lot of interesting about this talk that we'll get into later, but I thought you might want a head start on learning how Facebook handles 30K+ machines,
6 0.8989327 1294 high scalability-2012-08-01-Prismatic Update: Machine Learning on Documents and Users
7 0.89540738 114 high scalability-2007-10-07-Product: Wackamole
8 0.89348614 442 high scalability-2008-11-13-Plenty of Fish Says Scaling for Free Doesn't Pay
9 0.89294493 888 high scalability-2010-08-27-OpenStack - The Answer to: How do We Compete with Amazon?
10 0.89212221 903 high scalability-2010-09-17-Hot Scalability Links For Sep 17, 2010
11 0.89077353 355 high scalability-2008-07-21-Eucalyptus - Build Your Own Private EC2 Cloud
12 0.89019358 1557 high scalability-2013-12-02-Evolution of Bazaarvoice’s Architecture to 500M Unique Users Per Month
13 0.88722372 1435 high scalability-2013-04-04-Paper: A Web of Things Application Architecture - Integrating the Real-World into the Web
14 0.88694918 576 high scalability-2009-04-21-What CDN would you recommend?
15 0.88685364 1014 high scalability-2011-03-31-8 Lessons We Can Learn from the MySpace Incident - Balance, Vision, Fearlessness
16 0.88561016 1575 high scalability-2014-01-08-Under Snowden's Light Software Architecture Choices Become Murky
17 0.88552725 1275 high scalability-2012-07-02-C is for Compute - Google Compute Engine (GCE)
18 0.88362777 458 high scalability-2008-12-01-Web Consolidation on the Sun Fire T1000 using Solaris Containers
20 0.88195133 707 high scalability-2009-09-17-Hot Links for 2009-9-17