high_scalability high_scalability-2010 high_scalability-2010-907 knowledge-graph by maker-knowledge-mining

907 high scalability-2010-09-23-Working With Large Data Sets


meta infos for this blog

Source: html

Introduction: This is an excerpt from my blogpost Working With Large Data Sets ... For the past 18 months I’ve moved from working on the SMTP proxy to working on our other systems, all of which make use of the data we collect from each connection. It’s a fair amount of data and it can be up to 2Kb in size for each connection. Our servers receive approximately 1000 of these pieces of data per second, which is fairly sustained due to our global distribution of customers. If you compare that to Twitter’s peak of 3,283 tweets per second  (maximum of 140 characters), you can see it’s not a small amount of data that we are dealing with here. I recently set out to scientifically prove the benefits of throttling, which is our technology for slowing down connections in order to detect spambots, who are kind enough to disconnect quite quickly when they see a slow connection. Due to the nature of the data we had, I needed to work with a long range of data to show evidence that an IP that appeared on Spam


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This is an excerpt from my blogpost Working With Large Data Sets . [sent-1, score-0.144]

2 For the past 18 months I’ve moved from working on the SMTP proxy to working on our other systems, all of which make use of the data we collect from each connection. [sent-4, score-0.86]

3 It’s a fair amount of data and it can be up to 2Kb in size for each connection. [sent-5, score-0.42]

4 Our servers receive approximately 1000 of these pieces of data per second, which is fairly sustained due to our global distribution of customers. [sent-6, score-0.803]

5 If you compare that to Twitter’s peak of 3,283 tweets per second  (maximum of 140 characters), you can see it’s not a small amount of data that we are dealing with here. [sent-7, score-0.766]

6 I recently set out to scientifically prove the benefits of throttling, which is our technology for slowing down connections in order to detect spambots, who are kind enough to disconnect quite quickly when they see a slow connection. [sent-8, score-0.772]

7 Due to the nature of the data we had, I needed to work with a long range of data to show evidence that an IP that appeared on Spamhaus had previously been throttled and disconnected, and then measure the duration until it appeared on Spamhaus. [sent-9, score-1.513]

8 I set a job to pre-process a selected set of customers data and arbitrarily decided 66 days would be a good amount to process, as this was 2 months plus a little breathing room. [sent-10, score-1.381]

9 I knew from my experience it was possible that it might take 2 months for a bad IP to be picked up by Spamhaus. [sent-11, score-0.488]

10 I extracted 28,204,693 distinct IPs, some of which were seen over million times in this data set. [sent-12, score-0.392]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('appeared', 0.312), ('months', 0.245), ('spambots', 0.185), ('throttled', 0.176), ('amount', 0.172), ('breathing', 0.17), ('ip', 0.168), ('disconnect', 0.164), ('throttling', 0.16), ('disconnected', 0.149), ('characters', 0.149), ('due', 0.146), ('duration', 0.144), ('extracted', 0.144), ('excerpt', 0.144), ('arbitrarily', 0.139), ('smtp', 0.137), ('sustained', 0.137), ('picked', 0.136), ('data', 0.132), ('evidence', 0.131), ('ips', 0.125), ('prove', 0.119), ('approximately', 0.117), ('slowing', 0.117), ('distinct', 0.116), ('fair', 0.116), ('set', 0.116), ('selected', 0.113), ('collect', 0.108), ('knew', 0.107), ('detect', 0.102), ('tweets', 0.101), ('working', 0.1), ('previously', 0.096), ('proxy', 0.095), ('compare', 0.095), ('pieces', 0.095), ('decided', 0.093), ('receive', 0.092), ('second', 0.091), ('maximum', 0.09), ('dealing', 0.09), ('plus', 0.085), ('peak', 0.085), ('fairly', 0.084), ('moved', 0.08), ('benefits', 0.079), ('nature', 0.078), ('recently', 0.075)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999994 907 high scalability-2010-09-23-Working With Large Data Sets

Introduction: This is an excerpt from my blogpost Working With Large Data Sets ... For the past 18 months I’ve moved from working on the SMTP proxy to working on our other systems, all of which make use of the data we collect from each connection. It’s a fair amount of data and it can be up to 2Kb in size for each connection. Our servers receive approximately 1000 of these pieces of data per second, which is fairly sustained due to our global distribution of customers. If you compare that to Twitter’s peak of 3,283 tweets per second  (maximum of 140 characters), you can see it’s not a small amount of data that we are dealing with here. I recently set out to scientifically prove the benefits of throttling, which is our technology for slowing down connections in order to detect spambots, who are kind enough to disconnect quite quickly when they see a slow connection. Due to the nature of the data we had, I needed to work with a long range of data to show evidence that an IP that appeared on Spam

2 0.13090299 289 high scalability-2008-03-27-Amazon Announces Static IP Addresses and Multiple Datacenter Operation

Introduction: Amazon is fixing two of their major problems: no static IP addresses and single datacenter operation. By adding these two new features developers can finally build a no apology system on Amazon. Before you always had to throw in an apology or two. No, we don't have low failover times because of the silly DNS games and unexceptionable DNS update and propagation times and no, we don't operate in more than one datacenter. No more. Now Amazon is adding Elastic IP Addresses and Availability Zones . Elastic IP addresses are far better than normal IP addresses because they are both in tight with Jessica Alba and they are: Static IP addresses designed for dynamic cloud computing. An Elastic IP address is associated with your account, not a particular instance, and you control that address until you choose to explicitly release it. Unlike traditional static IP addresses, however, Elastic IP addresses allow you to mask instance or availability zone failures by programmatica

3 0.12078713 168 high scalability-2007-11-30-Strategy: Efficiently Geo-referencing IPs

Introduction: A lot of apps need to map IP addresses to locations. Jeremy Cole in On efficiently geo-referencing IPs with MaxMind GeoIP and MySQL GIS succinctly explains the many uses for such a feature: Geo-referencing IPs is, in a nutshell, converting an IP address, perhaps from an incoming web visitor, a log file, a data file, or some other place, into the name of some entity owning that IP address. There are a lot of reasons you may want to geo-reference IP addresses to country, city, etc., such as in simple ad targeting systems, geographic load balancing, web analytics, and many more applications. This is difficult to do efficiently, at least it gives me a bit of brain freeze. In the same post Jeremy nicely explains where to get the geo-rereferncing data, how to load data, and the performance of different approaches for IP address searching. It's a great practical introduction to the subject.

4 0.082402393 920 high scalability-2010-10-15-Troubles with Sharding - What can we learn from the Foursquare Incident?

Introduction: For everything given something seems to be taken. Caching is a great scalability solution, but caching also comes with problems . Sharding is a great scalability solution, but as Foursquare recently revealed in a post-mortem about their 17 hours of downtime, sharding also has problems. MongoDB, the database Foursquare uses, also contributed their post-mortem of what went wrong too. Now that everyone has shared and resharded, what can we learn to help us skip these mistakes and quickly move on to a different set of mistakes? First, like for Facebook , huge props to Foursquare and MongoDB for being upfront and honest about their problems. This helps everyone get better and is a sign we work in a pretty cool industry. Second, overall, the fault didn't flow from evil hearts or gross negligence. As usual the cause was more mundane: a key system, that could be a little more robust, combined with a very popular application built by a small group of people, under immense pressure

5 0.081279784 1501 high scalability-2013-08-13-In Memoriam: Lavabit Architecture - Creating a Scalable Email Service

Introduction: With Lavabit shutting down  under murky circumstances , it seems fitting to repost an old (2009), yet still very good post by Ladar Levison on Lavabit's architecture. I don't know how much of this information is still current, but it should give you a general idea what Lavabit was all about. Getting to Know You What is the name of your system and where can we find out more about it? Note: these links are no longer valid... Lavabit http://lavabit.com http://lavabit.com/network.html http://lavabit.com/about.html What is your system for? Lavabit is a mid-sized email service provider. We currently have about 140,000 registered users with more than 260,000 email addresses. While most of our accounts belong to individual users, we also provide corporate email services to approximately 70 companies. Why did you decide to build this system? We built the system to compete against the other large free email providers, with an emphasis on serving the privacy c

6 0.07959684 1004 high scalability-2011-03-14-Twitter by the Numbers - 460,000 New Accounts and 140 Million Tweets Per Day

7 0.078330941 233 high scalability-2008-01-30-How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data

8 0.075628072 72 high scalability-2007-08-22-Wikimedia architecture

9 0.073112428 304 high scalability-2008-04-19-How to build a real-time analytics system?

10 0.071840852 691 high scalability-2009-08-31-Squarespace Architecture - A Grid Handles Hundreds of Millions of Requests a Month

11 0.071047477 796 high scalability-2010-03-16-Justin.tv's Live Video Broadcasting Architecture

12 0.070951179 477 high scalability-2008-12-29-100% on Amazon Web Services: Soocial.com - a lesson of porting your service to Amazon

13 0.070839837 1359 high scalability-2012-11-15-Gone Fishin': Justin.Tv's Live Video Broadcasting Architecture

14 0.069315612 934 high scalability-2010-11-04-Facebook at 13 Million Queries Per Second Recommends: Minimize Request Variance

15 0.068485349 870 high scalability-2010-08-02-7 Scaling Strategies Facebook Used to Grow to 500 Million Users

16 0.068033583 297 high scalability-2008-04-05-Skype Plans for PostgreSQL to Scale to 1 Billion Users

17 0.067730322 1586 high scalability-2014-01-28-How Next Big Sound Tracks Over a Trillion Song Plays, Likes, and More Using a Version Control System for Hadoop Data

18 0.067320809 1372 high scalability-2012-12-14-Stuff The Internet Says On Scalability For December 14, 2012

19 0.066861153 1618 high scalability-2014-03-24-Big, Small, Hot or Cold - Examples of Robust Data Pipelines from Stripe, Tapad, Etsy and Square

20 0.066779368 221 high scalability-2008-01-24-Mailinator Architecture


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.124), (1, 0.067), (2, -0.019), (3, -0.026), (4, -0.01), (5, -0.002), (6, 0.019), (7, 0.019), (8, 0.018), (9, -0.009), (10, 0.015), (11, 0.021), (12, 0.029), (13, 0.012), (14, 0.041), (15, 0.051), (16, -0.005), (17, 0.013), (18, -0.031), (19, -0.001), (20, 0.002), (21, 0.017), (22, 0.021), (23, 0.009), (24, 0.004), (25, -0.006), (26, -0.041), (27, 0.022), (28, 0.01), (29, 0.003), (30, 0.016), (31, -0.011), (32, -0.032), (33, 0.048), (34, -0.003), (35, 0.055), (36, 0.018), (37, 0.06), (38, -0.004), (39, 0.014), (40, -0.004), (41, 0.026), (42, 0.067), (43, -0.032), (44, 0.022), (45, 0.052), (46, 0.013), (47, -0.023), (48, 0.006), (49, 0.0)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.86439419 907 high scalability-2010-09-23-Working With Large Data Sets

Introduction: This is an excerpt from my blogpost Working With Large Data Sets ... For the past 18 months I’ve moved from working on the SMTP proxy to working on our other systems, all of which make use of the data we collect from each connection. It’s a fair amount of data and it can be up to 2Kb in size for each connection. Our servers receive approximately 1000 of these pieces of data per second, which is fairly sustained due to our global distribution of customers. If you compare that to Twitter’s peak of 3,283 tweets per second  (maximum of 140 characters), you can see it’s not a small amount of data that we are dealing with here. I recently set out to scientifically prove the benefits of throttling, which is our technology for slowing down connections in order to detect spambots, who are kind enough to disconnect quite quickly when they see a slow connection. Due to the nature of the data we had, I needed to work with a long range of data to show evidence that an IP that appeared on Spam

2 0.72191697 1245 high scalability-2012-05-14-DynamoDB Talk Notes and the SSD Hot S3 Cold Pattern

Introduction: My impression of DynamoDB before attending a Amazon DynamoDB for Developers talk is that it’s the usual quality service produced by Amazon: simple, fast, scalable, geographically redundant, expensive enough to make you think twice about using it, and delightfully NoOp. After the talk my impression has become more nuanced. The quality impression still stands. Look at the forums and you’ll see the typical issues every product has, but no real surprises. And as a SimpleDB++, DynamoDB seems to have avoided second system syndrome and produced a more elegant design. What was surprising is how un-cloudy DynamoDB appears to be. The cloud pillars of pay for what you use and quick elastic response to bursty traffic have been abandoned, for some understandable reasons, but the result is you really have to consider your use cases before making DynamoDB the default choice. Here are some of my impressions from the talk... DynamoDB is a clean well lighted place for key-va

3 0.71560788 119 high scalability-2007-10-10-WAN Accelerate Your Way to Lightening Fast Transfers Between Data Centers

Introduction: How do you keep in sync a crescendo of data between data centers over a slow WAN? That's the question Alberto posted a few weeks ago. Normally I'm not into all boy bands, but I was frustrated there wasn't a really good answer for his problem. It occurred to me later a WAN accelerator might help turn his slow WAN link into more of a LAN, so the overhead of copying files across the WAN wouldn't be so limiting. Many might not consider a WAN accelerator in this situation, but since my friend Damon Ennis works at the WAN accelerator vendor Silver Peak , I thought I would ask him if their product would help. Not surprisingly his answer is yes! Potentially a lot, depending on the nature of your data. Here's a no BS overview of their product: What is it? - Scalable WAN Accelerator from Silver Peak (http://www.silver-peak.com) What does it do? - You can send 5x-100x times more data across your expensive, low-bandwidth WAN link. Why should you care? - Your data centers becom

4 0.70871508 1320 high scalability-2012-09-11-How big is a Petabyte, Exabyte, Zettabyte, or a Yottabyte?

Introduction: This is an intuitive look at large data sizes By Julian Bunn in Globally Interconnected Object Databases . Bytes(8 bits) 0.1 bytes:  A binary decision 1 byte:  A single character 10 bytes:  A single word 100 bytes:  A telegram  OR  A punched card Kilobyte (1000 bytes) 1 Kilobyte:  A very short story 2 Kilobytes: A Typewritten page 10 Kilobytes:  An encyclopaedic page  OR  A deck of punched cards 50 Kilobytes: A compressed document image page 100 Kilobytes:  A low-resolution photograph 200 Kilobytes: A box of punched cards 500 Kilobytes: A very heavy box of punched cards Megabyte (1 000 000 bytes) 1 Megabyte:  A small novel  OR  A 3.5 inch floppy disk 2 Megabytes: A high resolution photograph 5 Megabytes:  The complete works of Shakespeare  OR 30 seconds of TV-quality video 10 Megabytes: A minute of high-fidelity sound OR A digital chest X-ray 20 Megabytes:  A box of floppy disks 50 Megabytes: A digital mammogram 100 Megabyte

5 0.68668765 1335 high scalability-2012-10-08-How UltraDNS Handles Hundreds of Thousands of Zones and Tens of Millions of Records

Introduction: This is a guest post by Jeffrey Damick , Principal Software Engineer for Neustar . Jeffrey has overseen the software architecture for UltraDNS for last two and half years as it went through substantial revitalization. UltraDNS is one the top the DNS providers, serving many top-level domains (TLDs) as well as second-level domains (SLDs) . This requires handling of several hundreds of thousands of zones with many containing millions of records each. Even with all of its success UltraDNS had fallen into a rut several years ago, its release schedule had become haphazard at best and the team was struggling to keep up with feature requests in a waterfall development style. Development Realizing that something had to be done the team came together and identified the most important areas to attack first. We began with the code base, stabilizing our flagship proprietary C++ DNS server, instituting common best practices and automation. Testing was very manual and not easily

6 0.66891664 934 high scalability-2010-11-04-Facebook at 13 Million Queries Per Second Recommends: Minimize Request Variance

7 0.66425169 1213 high scalability-2012-03-22-Paper: Revisiting Network I-O APIs: The netmap Framework

8 0.66323513 645 high scalability-2009-06-30-Hot New Trend: Linking Clouds Through Cheap IP VPNs Instead of Private Lines

9 0.66291374 1362 high scalability-2012-11-26-BigData using Erlang, C and Lisp to Fight the Tsunami of Mobile Data

10 0.65763736 249 high scalability-2008-02-16-S3 Failed Because of Authentication Overload

11 0.65530598 1566 high scalability-2013-12-18-How to get started with sizing and capacity planning, assuming you don't know the software behavior?

12 0.65481907 1222 high scalability-2012-04-05-Big Data Counting: How to count a billion distinct objects using only 1.5KB of Memory

13 0.65233094 1552 high scalability-2013-11-22-Stuff The Internet Says On Scalability For November 22th, 2013

14 0.65070057 558 high scalability-2009-04-06-How do you monitor the performance of your cluster?

15 0.65065801 719 high scalability-2009-10-09-Have you collectl'd yet? If not, maybe collectl-utils will make it easier to do so

16 0.65026301 1205 high scalability-2012-03-07-Scale Indefinitely on S3 With These Secrets of the S3 Masters

17 0.64964557 1585 high scalability-2014-01-24-Stuff The Internet Says On Scalability For January 24th, 2014

18 0.64477259 1148 high scalability-2011-11-29-DataSift Architecture: Realtime Datamining at 120,000 Tweets Per Second

19 0.644135 1564 high scalability-2013-12-13-Stuff The Internet Says On Scalability For December 13th, 2013

20 0.64364344 622 high scalability-2009-06-08-Distribution of queries per second


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(1, 0.089), (2, 0.196), (10, 0.032), (61, 0.144), (77, 0.018), (94, 0.108), (99, 0.313)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.83479607 374 high scalability-2008-08-30-Paper: GargantuanComputing—GRIDs and P2P

Introduction: I found the discussion of the available bandwidth of tree vs higher dimensional virtual networks topologies quite, to quote Spock, fascinating : A mathematical analysis by Ritter (2002) (one of the original developers of Napster) presented a detailed numerical argument demonstrating that the Gnutella network could not scale to the capacity of its competitor, the Napster network. Essentially, that model showed that the Gnutella network is severely bandwidth-limited long before the P2P population reaches a million peers. In each of these previous studies, the conclusions have overlooked the intrinsic bandwidth limits of the underlying topology in the Gnutella network: a Cayley tree (Rains and Sloane 1999) (see Sect. 9.4 for the definition). Trees are known to have lower aggregate bandwidth than higher dimensional topologies, e.g., hypercubes and hypertori. Studies of interconnection topologies in the literature have tended to focus on hardware implementations (see, e.g., Culler et

2 0.82765925 1128 high scalability-2011-09-30-Gone Fishin'

Introduction: Well, not exactly Fishin', I'll be on vacation starting today and I'll be back in mid October. I won't be posting, so we'll all have a break. Disappointing, I know. If you've ever wanted to write an article for HighScalability, this would be a great time :-) I especially need help on writing Stuff the Internet Says on Scalability as I won't even be reading the Interwebs. Shock! Horror!   So if the spirit moves you, please write something. My connectivity in South Africa is unknown, but I will check in and approve articles when I can. See you on down the road...

3 0.80958152 1350 high scalability-2012-10-29-Gone Fishin' Two

Introduction: Well, not exactly Fishin', I'll be on vacation starting today and I'll be back late November. I won't be posting anything new, so we'll all have a break. Disappointing, I know, but fear not, I will be posting some oldies for your re-enjoyment. And If you've ever wanted to write an article for HighScalability, this would be a great time :-) I especially need help on writing Stuff the Internet Says on Scalability as I will be reading the Interwebs on a much reduced schedule. Shock! Horror!   So if the spirit moves you, please write something. My connectivity in Italy will probably be good, so I will check in and approve articles on a regular basis. Ciao...

4 0.80030882 478 high scalability-2008-12-29-Paper: Spamalytics: An Empirical Analysisof Spam Marketing Conversion

Introduction: Under the philosophy that the best method to analyse spam is to become a spammer , this absolutely fascinating paper recounts how a team of UC Berkely researchers went under cover to infiltrate a spam network. Part CSI, part Mission Impossible, and part MacGyver, the team hijacked the botnet so that their code was actually part of the dark network itself. Once inside they figured out the architecture and protocols of the botnet and how many sales they were able to tally. Truly elegant work. Two different spam campaigns were run on a Storm botnet network of 75,800 zombie computers. Storm is a peer-to-peer botnet that uses spam to creep its tentacles through the world wide computer network. One of the campains distributed viruses in order to recruit new bots into the network. This is normally accomplished by enticing people to download email attachments. An astonishing one in ten people downloaded the executable and ran it, which means we won't run out of zombies soon. The downloade

same-blog 5 0.8001191 907 high scalability-2010-09-23-Working With Large Data Sets

Introduction: This is an excerpt from my blogpost Working With Large Data Sets ... For the past 18 months I’ve moved from working on the SMTP proxy to working on our other systems, all of which make use of the data we collect from each connection. It’s a fair amount of data and it can be up to 2Kb in size for each connection. Our servers receive approximately 1000 of these pieces of data per second, which is fairly sustained due to our global distribution of customers. If you compare that to Twitter’s peak of 3,283 tweets per second  (maximum of 140 characters), you can see it’s not a small amount of data that we are dealing with here. I recently set out to scientifically prove the benefits of throttling, which is our technology for slowing down connections in order to detect spambots, who are kind enough to disconnect quite quickly when they see a slow connection. Due to the nature of the data we had, I needed to work with a long range of data to show evidence that an IP that appeared on Spam

6 0.73757982 1163 high scalability-2011-12-23-Stuff The Internet Says On Scalability For December 23, 2011

7 0.70489705 1367 high scalability-2012-12-05-5 Ways to Make Cloud Failure Not an Option

8 0.70261186 1653 high scalability-2014-05-23-Gone Fishin' 2014

9 0.69239712 315 high scalability-2008-05-05-HSCALE - Handling 200 Million Transactions Per Month Using Transparent Partitioning With MySQL Proxy

10 0.68134302 384 high scalability-2008-09-16-EE-Appserver Clustering OR Terracota OR Coherence OR something else?

11 0.67698056 1137 high scalability-2011-11-04-Stuff The Internet Says On Scalability For November 4, 2011

12 0.67461836 810 high scalability-2010-04-14-Parallel Information Retrieval and Other Search Engine Goodness

13 0.67332631 1301 high scalability-2012-08-08-3 Tips and Tools for Creating Reliable Billion Page View Web Services

14 0.65713656 1321 high scalability-2012-09-12-Using Varnish for Paywalls: Moving Logic to the Edge

15 0.63086259 72 high scalability-2007-08-22-Wikimedia architecture

16 0.62573606 120 high scalability-2007-10-11-How Flickr Handles Moving You to Another Shard

17 0.62362349 1172 high scalability-2012-01-10-A Perfect Fifth of Notes on Scalability

18 0.62333107 545 high scalability-2009-03-19-Product: Redis - Not Just Another Key-Value Store

19 0.6218673 437 high scalability-2008-11-03-How Sites are Scaling Up for the Election Night Crush

20 0.62154877 1538 high scalability-2013-10-28-Design Decisions for Scaling Your High Traffic Feeds