high_scalability high_scalability-2007 high_scalability-2007-105 knowledge-graph by maker-knowledge-mining
Source: html
Introduction: My company is developing a centralized web platform to service our clients. We currently use about 3Mb/s on our uplink at our ISP serving web pages for about 100 clients. We'd like to offer them statistics that mean something to their businesses and have been contemplating writing our own statistics code to handle the task. All statistics would be gathered at the page view level and we're implementing a HttpModule in ASP.Net 2.0 to handle the gather of the data. That said, I'm curious to hear comments on writing this data (~500 bytes of log data/page request). We need to write this data somewhere and then build a process to aggregate the data into a warehouse application used in our reporting system. Google Analytics is out of the question because we do not want our hosting infrastructure dependant upon a remote server. Web Trends et al. are too expensive for our clients. I'm thinking of a couple of options. 1) Writing log data directly to a SQL Server 2000 db and havin
sentIndex sentText sentNum sentScore
1 My company is developing a centralized web platform to service our clients. [sent-1, score-0.109]
2 We currently use about 3Mb/s on our uplink at our ISP serving web pages for about 100 clients. [sent-2, score-0.227]
3 We'd like to offer them statistics that mean something to their businesses and have been contemplating writing our own statistics code to handle the task. [sent-3, score-0.922]
4 All statistics would be gathered at the page view level and we're implementing a HttpModule in ASP. [sent-4, score-0.461]
5 That said, I'm curious to hear comments on writing this data (~500 bytes of log data/page request). [sent-7, score-0.638]
6 We need to write this data somewhere and then build a process to aggregate the data into a warehouse application used in our reporting system. [sent-8, score-0.802]
7 1) Writing log data directly to a SQL Server 2000 db and having a Windows Service come in periodically to summarize and aggregate the data to the reporting server. [sent-13, score-1.006]
8 I'm not sure this will scale with higher load and that the aggregation process will timeout because of the number of inserts being sent to the table. [sent-14, score-0.35]
9 2) Write the log data to a structure in memory on the web server and periodically flush the data to the db. [sent-15, score-0.93]
10 The fear here is that the web server goes down and we lose all the data in memory. [sent-16, score-0.323]
11 Other fears are that the IIS processes and worker threads might mangle one another when contending for the memory system resource. [sent-17, score-0.472]
12 3) Don't use memory and write to a file instead. [sent-18, score-0.295]
13 Save the file handler as an application variable and use it for all accesses to the file. [sent-19, score-0.389]
14 Not sure about threading issues here as well and am reluctant to use anything which might corrupt a log file under load. [sent-20, score-0.818]
15 This theoretically should remove the threading issues but leaves me to think that the data would not be terribly useful once its in the IIS logs. [sent-22, score-0.635]
16 The major driver here is that we do not want to use any of the web sites and canned reports built into 90% of all statistics platforms. [sent-23, score-0.471]
17 Our users shouldn't have to "leave" the customer care portal we're creating just to see stats for their sites. [sent-24, score-0.098]
18 I'm looking for a solution that's not entirely complex, nor is it overly expensive and it will give me the access to the data we need to record on page views. [sent-26, score-0.416]
wordName wordTfidf (topN-words)
[('iis', 0.326), ('statistics', 0.281), ('log', 0.196), ('periodically', 0.179), ('threading', 0.174), ('reporting', 0.158), ('writing', 0.156), ('contending', 0.148), ('fears', 0.148), ('aggregate', 0.14), ('terribly', 0.139), ('reluctant', 0.133), ('contemplating', 0.124), ('data', 0.122), ('corrupt', 0.121), ('overly', 0.121), ('uplink', 0.118), ('flush', 0.115), ('handler', 0.111), ('file', 0.109), ('web', 0.109), ('theoretically', 0.109), ('et', 0.104), ('gathered', 0.102), ('iframes', 0.101), ('write', 0.099), ('isp', 0.099), ('portal', 0.098), ('timeout', 0.096), ('expensive', 0.095), ('inserts', 0.093), ('fear', 0.092), ('leaves', 0.091), ('accesses', 0.09), ('worker', 0.089), ('gather', 0.089), ('summarize', 0.089), ('memory', 0.087), ('curious', 0.086), ('sure', 0.085), ('warehouse', 0.083), ('driver', 0.081), ('thoughts', 0.081), ('businesses', 0.08), ('variable', 0.079), ('page', 0.078), ('bytes', 0.078), ('somewhere', 0.078), ('leave', 0.077), ('aggregation', 0.076)]
simIndex simValue blogId blogTitle
same-blog 1 0.99999994 105 high scalability-2007-10-01-Statistics Logging Scalability
Introduction: My company is developing a centralized web platform to service our clients. We currently use about 3Mb/s on our uplink at our ISP serving web pages for about 100 clients. We'd like to offer them statistics that mean something to their businesses and have been contemplating writing our own statistics code to handle the task. All statistics would be gathered at the page view level and we're implementing a HttpModule in ASP.Net 2.0 to handle the gather of the data. That said, I'm curious to hear comments on writing this data (~500 bytes of log data/page request). We need to write this data somewhere and then build a process to aggregate the data into a warehouse application used in our reporting system. Google Analytics is out of the question because we do not want our hosting infrastructure dependant upon a remote server. Web Trends et al. are too expensive for our clients. I'm thinking of a couple of options. 1) Writing log data directly to a SQL Server 2000 db and havin
2 0.21310468 30 high scalability-2007-07-26-Product: AWStats a Log Analyzer
Introduction: AWStats is a free powerful and featureful tool that generates advanced web, streaming, ftp or mail server statistics, graphically. This log analyzer works as a CGI or from command line and shows you all possible information your log contains, in few graphical web pages. It uses a partial information file to be able to process large log files, often and quickly. It can analyze log files from all major server tools like Apache log files (NCSA combined/XLF/ELF log format or common/CLF log format), WebStar, IIS (W3C log format) and a lot of other web, proxy, wap, streaming servers, mail servers and some ftp servers.
3 0.18810976 175 high scalability-2007-12-05-how to: Load Balancing with iis
Introduction: he l l o wor l d, can you te l l me how i can i mp l ement a l oad ba l anc i ng of a web s i te runn i ng under i i s - w i ndows server 2003/08
4 0.16137376 37 high scalability-2007-07-28-Product: Web Log Storming
Introduction: Web Log Storming is an interactive, desktop-based Web Log Analyzer for Windows. The whole new concept of log analysis makes it clearly different from any other web log analyzer. Browse through statistics to get into details - down to individual visitor's session. Check individual visitor behavior pattern and how it fits into your desired scenario. Web Log Storming does far more than just generate common reports - it displays detailed web site statistics with interactive graphs and reports. Very complete detailed log analysis of activity from every visitor to your web site is only a mouse-click away. In other words, analyze your web logs like never before! It's easy to track sessions, hits, page views, downloads, or whatever metric is most important to each user. You can look at referring pages and see which search engines and keywords were used to bring visitors to the site. Web site behavior, from the top entry and exit pages, to the paths that users follow, can be analyzed. You
5 0.15734568 77 high scalability-2007-08-30-Log Everything All the Time
Introduction: This JoelOnSoftware thread asks the age old question of what and how to log. The usual trace/error/warning/info advice is totally useless in a large scale distributed system. Instead, you need to log everything all the time so you can solve problems that have already happened across a potentially huge range of servers. Yes, it can be done. To see why the typical logging approach is broken, imagine this scenario: Your site has been up and running great for weeks. No problems. A foreshadowing beeper goes off at 2AM. It seems some users can no longer add comments to threads. Then you hear the debugging deathknell: it's an intermittent problem and customers are pissed. Fix it. Now. So how are you going to debug this? The monitoring system doesn't show any obvious problems or errors. You quickly post a comment and it works fine. This won't be easy. So you think. Commenting involves a bunch of servers and networks. There's the load balancer, spam filter, web server, database server,
7 0.14071971 233 high scalability-2008-01-30-How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data
8 0.13180101 36 high scalability-2007-07-28-Product: Web Log Expert
9 0.12391867 35 high scalability-2007-07-28-Product: FastStats Log Analyzer
10 0.12095773 1008 high scalability-2011-03-22-Facebook's New Realtime Analytics System: HBase to Process 20 Billion Events Per Day
11 0.11870206 59 high scalability-2007-08-04-Try Squid as a Reverse Proxy
12 0.10720938 96 high scalability-2007-09-18-Amazon Architecture
13 0.10460067 1000 high scalability-2011-03-08-Medialets Architecture - Defeating the Daunting Mobile Device Data Deluge
14 0.10456994 538 high scalability-2009-03-16-Are Cloud Based Memory Architectures the Next Big Thing?
15 0.10101241 1501 high scalability-2013-08-13-In Memoriam: Lavabit Architecture - Creating a Scalable Email Service
16 0.097858004 1386 high scalability-2013-01-14-MongoDB and GridFS for Inter and Intra Datacenter Data Replication
17 0.097607233 448 high scalability-2008-11-22-Google Architecture
18 0.096988499 259 high scalability-2008-02-25-Any Suggestions for the Architecture Template?
19 0.096988499 260 high scalability-2008-02-25-Architecture Template Advice Needed
20 0.096507072 58 high scalability-2007-08-04-Product: Cacti
topicId topicWeight
[(0, 0.177), (1, 0.084), (2, -0.021), (3, -0.094), (4, 0.015), (5, 0.013), (6, 0.045), (7, -0.021), (8, 0.009), (9, 0.064), (10, -0.028), (11, -0.023), (12, 0.006), (13, -0.064), (14, 0.075), (15, -0.006), (16, 0.004), (17, 0.004), (18, 0.02), (19, 0.012), (20, 0.025), (21, -0.063), (22, -0.017), (23, 0.105), (24, 0.136), (25, -0.07), (26, -0.049), (27, -0.009), (28, -0.014), (29, -0.008), (30, -0.014), (31, -0.074), (32, 0.077), (33, -0.062), (34, -0.055), (35, 0.043), (36, -0.005), (37, 0.041), (38, 0.075), (39, -0.007), (40, 0.027), (41, 0.052), (42, -0.003), (43, -0.006), (44, 0.009), (45, -0.067), (46, 0.028), (47, -0.057), (48, 0.012), (49, -0.03)]
simIndex simValue blogId blogTitle
same-blog 1 0.94179064 105 high scalability-2007-10-01-Statistics Logging Scalability
Introduction: My company is developing a centralized web platform to service our clients. We currently use about 3Mb/s on our uplink at our ISP serving web pages for about 100 clients. We'd like to offer them statistics that mean something to their businesses and have been contemplating writing our own statistics code to handle the task. All statistics would be gathered at the page view level and we're implementing a HttpModule in ASP.Net 2.0 to handle the gather of the data. That said, I'm curious to hear comments on writing this data (~500 bytes of log data/page request). We need to write this data somewhere and then build a process to aggregate the data into a warehouse application used in our reporting system. Google Analytics is out of the question because we do not want our hosting infrastructure dependant upon a remote server. Web Trends et al. are too expensive for our clients. I'm thinking of a couple of options. 1) Writing log data directly to a SQL Server 2000 db and havin
Introduction: This is a guest post by Gordon Worley , a Software Engineer at Korrelate , where they correlate (see what they did there) online purchases to offline purchases. Several weeks ago, we came into the office one morning to find every server alarm going off. Pixel log processing was behind by 8 hours and not making headway. Checking the logs, we discovered that a big client had come online during the night and was giving us 10 times more traffic than we were originally told to expect. I wouldn’t say we panicked, but the office was certainly more jittery than usual. Over the next several hours, though, thanks both to foresight and quick thinking, we were able to scale up to handle the added load and clear the backlog to return log processing to a steady state. At Korrelate, we deploy tracking pixels , also known beacons or web bugs, that our partners use to send us information about their users. These tiny web objects contain no visible content, but may include transparent 1 by 1 gif
3 0.84574306 35 high scalability-2007-07-28-Product: FastStats Log Analyzer
Introduction: FastStats Log Analyzer enables you to: Determine whether your CPC advertising is profitable: Are you spending $0.75 per click on Google or Overture, but only receiving $0.56 per click in revenue? Tune site traffic patterns: FastStats's Hyperlink Tree View feature lets you visually see how traffic flows through your web site. High-performance solution for even the busiest web sites: Our software has been clocked at over 1000 MB/min. Other popular log file analysis tools (we won't name names), run at 1/40th the speed. We've been in the business for over 6 years, delivering value, quality, and good customer service to our clients. Our products are used for data mining at some of the world's busiest web sites -- why not give FastStats a try at your web site? FastStats log file analysis supports a wide variety of web server log files, including Apache logs and Microsoft IIS logs.
4 0.83967799 77 high scalability-2007-08-30-Log Everything All the Time
Introduction: This JoelOnSoftware thread asks the age old question of what and how to log. The usual trace/error/warning/info advice is totally useless in a large scale distributed system. Instead, you need to log everything all the time so you can solve problems that have already happened across a potentially huge range of servers. Yes, it can be done. To see why the typical logging approach is broken, imagine this scenario: Your site has been up and running great for weeks. No problems. A foreshadowing beeper goes off at 2AM. It seems some users can no longer add comments to threads. Then you hear the debugging deathknell: it's an intermittent problem and customers are pissed. Fix it. Now. So how are you going to debug this? The monitoring system doesn't show any obvious problems or errors. You quickly post a comment and it works fine. This won't be easy. So you think. Commenting involves a bunch of servers and networks. There's the load balancer, spam filter, web server, database server,
5 0.83717227 570 high scalability-2009-04-15-Implementing large scale web analytics
Introduction: Does anyone know of any articles or papers that discuss the nuts and bolts of how web analytics is implemented at organizations with large volumes of web traffic and a critcal business need to analyze that data - e.g. places like Amazon.com, eBay, and Google? Just as a fun project I'm planning to build my own web log analysis app that can effectively index and query large volumes of web log data (i.e. TB range). But first I'd like to learn more about how it's done in the organizations whose lifeblood depends on this stuff. Even just a high level architectural overview of their approaches would be nice to have.
6 0.82552153 30 high scalability-2007-07-26-Product: AWStats a Log Analyzer
7 0.80936396 541 high scalability-2009-03-16-Product: Smart Inspect
8 0.8070839 449 high scalability-2008-11-24-Product: Scribe - Facebook's Scalable Logging System
9 0.80096805 37 high scalability-2007-07-28-Product: Web Log Storming
10 0.79047489 937 high scalability-2010-11-09-Paper: Hyder - Scaling Out without Partitioning
11 0.7744925 233 high scalability-2008-01-30-How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data
12 0.73828208 304 high scalability-2008-04-19-How to build a real-time analytics system?
13 0.73731858 553 high scalability-2009-04-03-Collectl interface to Ganglia - any interest?
14 0.70123339 36 high scalability-2007-07-28-Product: Web Log Expert
15 0.69490147 1196 high scalability-2012-02-20-Berkeley DB Architecture - NoSQL Before NoSQL was Cool
16 0.69328225 1096 high scalability-2011-08-10-LevelDB - Fast and Lightweight Key-Value Database From the Authors of MapReduce and BigTable
17 0.64607948 45 high scalability-2007-07-30-Product: SmarterStats
18 0.63468355 663 high scalability-2009-07-28-37signals Architecture
19 0.63205361 1008 high scalability-2011-03-22-Facebook's New Realtime Analytics System: HBase to Process 20 Billion Events Per Day
20 0.63068569 168 high scalability-2007-11-30-Strategy: Efficiently Geo-referencing IPs
topicId topicWeight
[(1, 0.151), (2, 0.223), (10, 0.033), (11, 0.208), (17, 0.027), (56, 0.032), (61, 0.083), (79, 0.094), (85, 0.018), (94, 0.046)]
simIndex simValue blogId blogTitle
1 0.94041252 134 high scalability-2007-10-26-Paper: Wikipedia's Site Internals, Configuration, Code Examples and Management Issues
Introduction: Wikipedia and Wikimedia have some of the best, most complete real-world documentation on how to build highly scalable systems. This paper by Domas Mituzas covers a lot of details about how Wikipedia works, including: an overview of the different packages used (Linux, PowerDNS, LVS, Squid, lighttpd, Apache, PHP5, Lucene, Mono, Memcached), how they use their CDN, how caching works, how they profile their code, how they store their media, how they structure their database access, how they handle search, how they handle load balancing and administration. All with real code examples and examples of configuration files. This is a really useful resource. Related Articles Wikimedia Architecture Domas Mituzas' Blog
2 0.92486912 25 high scalability-2007-07-25-Paper: Designing Disaster Tolerant High Availability Clusters
Introduction: A very detailed (339 pages) paper on how to use HP products to create a highly available cluster. It's somewhat dated and obviously concentrates on HP products, but it is still good information. Table of contents: 1. Disaster Tolerance and Recovery in a Serviceguard Cluster 2. Building an Extended Distance Cluster Using ServiceGuard 3. Designing a Metropolitan Cluster 4. Designing a Continental Cluster 5. Building Disaster-Tolerant Serviceguard Solutions Using Metrocluster with Continuous Access XP 6. Building Disaster Tolerant Serviceguard Solutions Using Metrocluster with EMC SRDF 7. Cascading Failover in a Continental Cluster Evaluating the Need for Disaster Tolerance What is a Disaster Tolerant Architecture? Types of Disaster Tolerant Clusters Extended Distance Clusters Metropolitan Cluster Continental Cluster Continental Cluster With Cascading Failover Disaster Tolerant Architecture Guidelines Protecting Nodes through Geographic Dispersion Protecting Data th
3 0.92153889 668 high scalability-2009-08-01-15 Scalability and Performance Best Practices
Introduction: These are from Laura Thomson of OmniTi : Profile early, profile often. Pick a profiling tool and learn it in and out. Dev-ops cooperation is essential. The most critical difference in organizations that handles crises well. Test on production data. Code behavior (especially performance) is often data driven. Track and trend. Understanding your historical performance characteristics is essential for spotting emerging problems. Assumptions will burn you. Systems are complex and often break in unexpected ways. Decouple. Isolate performance failures. Cache. Caching is the core of most optimizations. Federate. Data federation is taking a single data set and spreading it across multiple database/application servers. Replicate. Replication is making synchronized copies of data available in more than one place. Avoid straining hard-to-scale resources. Some resources are inherently hard to scale: Uncacheable’ data, Data with a very high read+write rate
Introduction: Snooze is an open-source, scalable, autonomic, and energy-efficient virtual machine (VM) management framework for private clouds. Similarly to other VM management frameworks such as Nimbus, OpenNebula, Eucalyptus, and OpenStack it allows to build compute infrastructures from virtualized resources. Particularly, once installed and configured users can submit and control the life-cycle of a large number of VMs. However, contrary to existing frameworks for scalability and fault tolerance, Snooze employs a self-organizing and healing (based on Apache ZooKeeper) hierarchical architecture. Moreover, it performs distributed VM management and is designed to be energy efficient. Therefore, it implements features to monitor and estimate VM resource (CPU, memory, network Rx, network Tx) demands, detect and resolve overload/underload situations, perform dynamic VM consolidation through live migration, and finally power management to save energy. Last but not least, it integrates a g
5 0.89812922 771 high scalability-2010-02-04-Hot Scalability Links for February 4, 2010
Introduction: Lots of cool stuff happening this week... Voldemort gets rebalancing. It's one thing to shard data to scale, it's a completely different level of functionality to manage those shards intelligently. Voldemort has stepped up by adding advanced rebalancing functionality: Dynamic addition of new nodes to the cluster; Deletion of nodes from cluster; Load balancing of data inside a cluster. Microsoft Finally Opens Azure for Business. Out of the blue Microsoft opens up their platform as a service service. Good to have more competition and we'll keep an eye out for experience reports. New details on LinkedIn architecture by Greg Linden. LinkedIn appears to only use caching minimally, preferring to spend their efforts and machine resources on making sure they can recompute computations quickly than on hiding poor performance behind caching layers . The end of SQL and relational databases? by David Intersimone . For new projects, I believe, we have genuine non-relational a
6 0.89033574 1055 high scalability-2011-06-08-Stuff to Watch from Google IO 2011
same-blog 7 0.88819963 105 high scalability-2007-10-01-Statistics Logging Scalability
9 0.87082136 908 high scalability-2010-09-28-6 Strategies for Scaling BBC iPlayer
10 0.86398345 136 high scalability-2007-10-28-Scaling Early Stage Startups
11 0.84702992 457 high scalability-2008-12-01-Sun FireTM X4540 Server as Backup Server for Zmanda's Amanda Enterprise 2.6 Software
12 0.83424938 765 high scalability-2010-01-25-Let's Welcome our Neo-Feudal Overlords
13 0.83279192 442 high scalability-2008-11-13-Plenty of Fish Says Scaling for Free Doesn't Pay
14 0.83024555 72 high scalability-2007-08-22-Wikimedia architecture
15 0.82450438 5 high scalability-2007-07-10-mixi.jp Architecture
16 0.82013893 1334 high scalability-2012-10-04-Stuff The Internet Says On Scalability For October 5, 2012
17 0.81956339 303 high scalability-2008-04-18-Scaling Mania at MySQL Conference 2008
19 0.81562269 1565 high scalability-2013-12-16-22 Recommendations for Building Effective High Traffic Web Software
20 0.81442583 1042 high scalability-2011-05-17-Facebook: An Example Canonical Architecture for Scaling Billions of Messages