high_scalability high_scalability-2012 high_scalability-2012-1294 knowledge-graph by maker-knowledge-mining

1294 high scalability-2012-08-01-Prismatic Update: Machine Learning on Documents and Users


meta infos for this blog

Source: html

Introduction: In update to Prismatic Architecture - Using Machine Learning on Social Networks to Figure Out What You Should Read on the Web , Jason Wolfe, even in the face of deadening fatigue from long nights spent getting their iPhone app out, has gallantly agreed to talk a little more about Primatic's approach to Machine Learning. Documents and users are two areas where Prismatic applies ML (machine learning): ML on Documents Given an HTML document:  learn how to extract the main text of the page (rather than the sidebar, footer, comments, etc), its title, author, best images, etc determine features for relevance (e.g., what the article is about, topics, etc.) The setup for most of these tasks is pretty typical. Models are trained using big batch jobs on other machines that read data from s3, save the learned parameter files to s3, and then read (and periodically refresh) the models from s3 in the ingest pipeline. All of the data that flows out of the system can be


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Models are trained using big batch jobs on other machines that read data from s3, save the learned parameter files to s3, and then read (and periodically refresh) the models from s3 in the ingest pipeline. [sent-6, score-0.691]

2 All of the data that flows out of the system can be fed back into this pipeline, which helps learn more about what's interesting, and learn from mistakes over time. [sent-7, score-0.24]

3 The code can be an order of magnitude more compact and easy to read than the corresponding Java, and execute at basically the same speed. [sent-9, score-0.245]

4 ML on Users Guess what users are interested in from social network data and refine these guesses using explicit signals within the app (+/remove). [sent-11, score-0.352]

5 The problem of using explicit singnals is interesting as user inputs should be reflected in their feeds very quickly. [sent-12, score-0.47]

6 If a user removes 5 articles from a given publisher in a row, then stop showing them articles from that publisher right now, not tomorrow. [sent-13, score-0.667]

7 This means there isn't time to run another batch job over all the users, The solution is online learning: immediately update the model of a user with each observation they provide us. [sent-14, score-0.282]

8 The raw stream of user interaction events is saved. [sent-15, score-0.305]

9 This allows rerruning later the user interest ML over the raw events, in case any data is lost through slightly loose write-back caches on this data when a machine goes down or something like that. [sent-16, score-0.534]

10 Drift in the online learning can be corrected and more accurate models can be computed. [sent-17, score-0.412]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('ml', 0.411), ('publisher', 0.193), ('prismatic', 0.193), ('learning', 0.164), ('explicit', 0.148), ('models', 0.141), ('documentsgiven', 0.119), ('drift', 0.119), ('etcdetermine', 0.119), ('maniuplation', 0.119), ('rerruning', 0.119), ('sidebar', 0.119), ('singnals', 0.119), ('usersguess', 0.119), ('wolfe', 0.119), ('raw', 0.117), ('machine', 0.113), ('footer', 0.112), ('resorting', 0.112), ('user', 0.109), ('refine', 0.107), ('corrected', 0.107), ('trained', 0.103), ('macros', 0.099), ('inference', 0.099), ('guesses', 0.097), ('batch', 0.096), ('ordinary', 0.094), ('reflected', 0.094), ('ingest', 0.094), ('nights', 0.092), ('compiles', 0.092), ('relevance', 0.089), ('read', 0.088), ('agreed', 0.087), ('articles', 0.086), ('computed', 0.084), ('jason', 0.084), ('clojure', 0.084), ('learn', 0.083), ('parameter', 0.081), ('events', 0.079), ('compact', 0.079), ('corresponding', 0.078), ('metal', 0.077), ('observation', 0.077), ('loose', 0.076), ('loops', 0.076), ('refresh', 0.074), ('fed', 0.074)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99999988 1294 high scalability-2012-08-01-Prismatic Update: Machine Learning on Documents and Users

Introduction: In update to Prismatic Architecture - Using Machine Learning on Social Networks to Figure Out What You Should Read on the Web , Jason Wolfe, even in the face of deadening fatigue from long nights spent getting their iPhone app out, has gallantly agreed to talk a little more about Primatic's approach to Machine Learning. Documents and users are two areas where Prismatic applies ML (machine learning): ML on Documents Given an HTML document:  learn how to extract the main text of the page (rather than the sidebar, footer, comments, etc), its title, author, best images, etc determine features for relevance (e.g., what the article is about, topics, etc.) The setup for most of these tasks is pretty typical. Models are trained using big batch jobs on other machines that read data from s3, save the learned parameter files to s3, and then read (and periodically refresh) the models from s3 in the ingest pipeline. All of the data that flows out of the system can be

2 0.54815334 1293 high scalability-2012-07-30-Prismatic Architecture - Using Machine Learning on Social Networks to Figure Out What You Should Read on the Web

Introduction: This post on Prismatic ’s Architecture is adapted from an email conversation with Prismatic programmer Jason Wolfe . What should you read on the web today? Any thoroughly modern person must solve this dilemma every day, usually using some occult process to divine what’s important in their many feeds: Twitter, RSS, Facebook, Pinterest, G+, email, Techmeme, and an uncountable numbers of other information sources. Jason Wolfe from Prismatic has generously agreed to describe their thoroughly modern solution for answering the “what to read question” using lots of sexy words like Machine Learning, Social Graphs, BigData, functional programming, and in-memory real-time feed processing. The result is possibly even more occult, but this or something very much like it will be how we meet the challenge of finding interesting topics and stories hidden inside infinitely deep pools of information. A couple of things stand out about Prismatic. They want you to know that Prismatic is being built

3 0.23413685 850 high scalability-2010-06-30-Paper: GraphLab: A New Framework For Parallel Machine Learning

Introduction: In the never ending quest to figure out how to do something useful with never ending streams of data,  GraphLab: A New Framework For Parallel Machine Learning  wants to go beyond low-level programming, MapReduce, and dataflow languages with  a new parallel framework for ML (machine learning) which exploits the sparse structure and common computational patterns of ML algorithms. GraphLab enables ML experts to easily design and implement efficient scalable parallel algorithms by composing problem specific computation, data-dependencies, and scheduling .   Our main contributions include:  A graph-based data model which simultaneously represents data and computational dependencies.  A set of concurrent access models which provide a range of sequential-consistency guarantees.  A sophisticated modular scheduling mechanism.  An aggregation framework to manage global state.  From the abstract: Designing and implementing efficient, provably correct parallel machine lear

4 0.13981056 1406 high scalability-2013-02-14-When all the Program's a Graph - Prismatic's Plumbing Library

Introduction: At some point as a programmer you might have the insight/fear that all programming is just doing stuff to other stuff. Then you may observe after coding the same stuff over again that stuff in a program often takes the form of interacting patterns of flows. Then you may think hey, a program isn't only useful for coding datastructures, but a program is a kind of datastructure and that with a meta level jump you could program a program in terms of flows over data and flow over other flows. That's the kind of stuff Prismatic is making available in the Graph extension to their  plumbing  package ( code examples ), which is described in an excellent post: Graph: Abstractions for Structured Computation . You may remember Prismatic from previous profile we did on HighScalability:  Prismatic Architecture - Using Machine Learning On Social Networks To Figure Out What You Should Read On The Web . We learned how Prismatic, an interest driven content suggestion service, builds programs in

5 0.096535787 1559 high scalability-2013-12-06-Stuff The Internet Says On Scalability For December 6th, 2013

Introduction: Hey, it's HighScalability time: Test your sense of scale. Is this image of something microscopic or macroscopic? Find out . 72 : Intel's 72 core x86 Processor; One Trillion : number of fonts served by Google. Quotable Quotes: West-Eberhard : The gene does not lead, it follows. @waldojaquith : To an ant, gravity is nothing, but surface tension is a powerful force. When you change scale, you play by different rules. Nicholas Christakis : The spread of germs is the price we pay for the spread of ideas. We assemble ourselves into networks to facilitate the flow information but we pay a price, the spread of disease. James Mickens : When you debug a distributed system or an OS kernel, you do it Texas-style. You gather some mean, stoic people, people who have seen things die, and you get some primitive tools, like a compass and a rucksack and a stick that’s pointed on one end, and you walk into the wilderness and you look for troub

6 0.09278512 459 high scalability-2008-12-03-Java World Interview on Scalability and Other Java Scalability Secrets

7 0.086290307 327 high scalability-2008-05-27-How I Learned to Stop Worrying and Love Using a Lot of Disk Space to Scale

8 0.078239456 1000 high scalability-2011-03-08-Medialets Architecture - Defeating the Daunting Mobile Device Data Deluge

9 0.075793989 1586 high scalability-2014-01-28-How Next Big Sound Tracks Over a Trillion Song Plays, Likes, and More Using a Version Control System for Hadoop Data

10 0.074964605 1592 high scalability-2014-02-07-Stuff The Internet Says On Scalability For February 7th, 2014

11 0.074150786 1618 high scalability-2014-03-24-Big, Small, Hot or Cold - Examples of Robust Data Pipelines from Stripe, Tapad, Etsy and Square

12 0.074029826 1021 high scalability-2011-04-12-Sponsored Post: Gazillion, Edmunds, OPOWER, ClearStone, deviantART, ScaleOut, aiCache, WAPT, Karmasphere, Kabam, Newrelic, Cloudkick, Membase, Joyent, CloudSigma, ManageEngine, Site24x7

13 0.073435597 1068 high scalability-2011-06-27-TripAdvisor Architecture - 40M Visitors, 200M Dynamic Page Views, 30TB Data

14 0.07162492 1008 high scalability-2011-03-22-Facebook's New Realtime Analytics System: HBase to Process 20 Billion Events Per Day

15 0.070969462 1270 high scalability-2012-06-22-Stuff The Internet Says On Scalability For June 22, 2012

16 0.07086806 1440 high scalability-2013-04-15-Scaling Pinterest - From 0 to 10s of Billions of Page Views a Month in Two Years

17 0.070783719 1502 high scalability-2013-08-16-Stuff The Internet Says On Scalability For August 16, 2013

18 0.070659027 666 high scalability-2009-07-30-Learn How to Think at Scale

19 0.070494905 1 high scalability-2007-07-06-Start Here

20 0.070221968 1300 high scalability-2012-08-07-Sponsored Post: Palantir, Percona, ElasticHosts, Atlantic.Net, ScaleOut, ground(ctrl), New Relic, NetDNA, GigaSpaces, AiCache, Logic Monitor, AppDynamics, CloudSigma, ManageEngine, Site24x7


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.146), (1, 0.057), (2, 0.018), (3, -0.023), (4, 0.037), (5, 0.01), (6, -0.02), (7, 0.051), (8, 0.009), (9, 0.026), (10, 0.028), (11, 0.019), (12, 0.013), (13, -0.063), (14, 0.006), (15, -0.014), (16, -0.028), (17, -0.021), (18, 0.043), (19, 0.031), (20, -0.049), (21, -0.059), (22, 0.002), (23, 0.018), (24, -0.041), (25, 0.03), (26, 0.01), (27, -0.026), (28, -0.042), (29, -0.002), (30, 0.049), (31, -0.003), (32, -0.001), (33, 0.036), (34, 0.002), (35, -0.041), (36, 0.06), (37, -0.027), (38, 0.054), (39, 0.005), (40, -0.032), (41, -0.019), (42, -0.046), (43, 0.024), (44, 0.008), (45, 0.015), (46, 0.004), (47, -0.054), (48, -0.067), (49, -0.026)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.9556545 1294 high scalability-2012-08-01-Prismatic Update: Machine Learning on Documents and Users

Introduction: In update to Prismatic Architecture - Using Machine Learning on Social Networks to Figure Out What You Should Read on the Web , Jason Wolfe, even in the face of deadening fatigue from long nights spent getting their iPhone app out, has gallantly agreed to talk a little more about Primatic's approach to Machine Learning. Documents and users are two areas where Prismatic applies ML (machine learning): ML on Documents Given an HTML document:  learn how to extract the main text of the page (rather than the sidebar, footer, comments, etc), its title, author, best images, etc determine features for relevance (e.g., what the article is about, topics, etc.) The setup for most of these tasks is pretty typical. Models are trained using big batch jobs on other machines that read data from s3, save the learned parameter files to s3, and then read (and periodically refresh) the models from s3 in the ingest pipeline. All of the data that flows out of the system can be

2 0.83162302 1293 high scalability-2012-07-30-Prismatic Architecture - Using Machine Learning on Social Networks to Figure Out What You Should Read on the Web

Introduction: This post on Prismatic ’s Architecture is adapted from an email conversation with Prismatic programmer Jason Wolfe . What should you read on the web today? Any thoroughly modern person must solve this dilemma every day, usually using some occult process to divine what’s important in their many feeds: Twitter, RSS, Facebook, Pinterest, G+, email, Techmeme, and an uncountable numbers of other information sources. Jason Wolfe from Prismatic has generously agreed to describe their thoroughly modern solution for answering the “what to read question” using lots of sexy words like Machine Learning, Social Graphs, BigData, functional programming, and in-memory real-time feed processing. The result is possibly even more occult, but this or something very much like it will be how we meet the challenge of finding interesting topics and stories hidden inside infinitely deep pools of information. A couple of things stand out about Prismatic. They want you to know that Prismatic is being built

3 0.66616088 1509 high scalability-2013-08-30-Stuff The Internet Says On Scalability For August 30, 2013

Introduction: Hey, it's HighScalability time: ( Nerd Power: Paul Kasemir software engineer AND American Ninja Warrior ) Two billion documents, 30 terabytes : Github source code indexed Quotable Quotes: David Krakauer : We fail to make intelligent machines because engineering is about putting together stupid components to make smart objects. Evolution is about putting together smart components into intelligent aggregates. Your brain is like an ecosystem of organisms. It's not like a circuit of gates. @spyced : At this point if you depend on EBS for critical services you're living in denial and I can't help you.  @skilpat : TIL Friedrich Engels, not Leslie Lamport, invented logical clocks in a 1844 letter to Karl Marx Dan Geer : Risk is a necessary consequence of dependence. @postwait : OS Rule 1. The version of /usr/bin/X you want today will never be what your OS ships

4 0.63718855 1609 high scalability-2014-03-11-Building a Social Music Service Using AWS, Scala, Akka, Play, MongoDB, and Elasticsearch

Introduction: This is a guest repost by Rotem Hermon , former Chief Architect for serendip.me , on the architecture and scaling considerations behind making a startup music service. serendip.me is a social music service that helps people discover great music shared by their friends, and also introduces them to their “music soulmates” - people outside their immediate social circle that shares a similar taste in music. Serendip is running on AWS and is built on the following stack: scala (and some Java), akka (for handling concurrency), Play framework (for the web and API front-ends), MongoDB and Elasticsearch . Choosing the stack One of the challenges of building serendip was the need to handle a large amount of data from day one, since a main feature of serendip is that it collects every piece of music being shared on Twitter from public music services. So when we approached the question of choosing the language and technologies to use, an important consideration was the ab

5 0.6325779 1228 high scalability-2012-04-16-Instagram Architecture Update: What’s new with Instagram?

Introduction: The fascination over Instagram continues and fortunately we have several new streams of information to feed the insanity. So consider this article an update to The Instagram Architecture Facebook Bought For A Cool Billion Dollars , based primarily on Scaling Instagram , a slide deck for an AirBnB tech talk given by Instagram co-founder, Mike Krieger. Several other information sources, listed at the bottom of the article, were also used. Unfortunately we just have a slide deck, so the connective tissue of the talk is missing, but it’s still very interesting, in the same spirit of wisdom presentations we often see after developers come up for air after spending significant time spent in the trenches. If you expect to dive deep into the technological details and find a billion reasons why Instagram was acquired, you will be disappointed. That magic can be found in the emotional investment in the relationship between all of the users and the product, not in the bits about h

6 0.63024926 1490 high scalability-2013-07-12-Stuff The Internet Says On Scalability For July 12, 2013

7 0.62583911 183 high scalability-2007-12-12-Report from OpenSocial Meetup at Google

8 0.62492204 1538 high scalability-2013-10-28-Design Decisions for Scaling Your High Traffic Feeds

9 0.61913699 850 high scalability-2010-06-30-Paper: GraphLab: A New Framework For Parallel Machine Learning

10 0.61806452 1592 high scalability-2014-02-07-Stuff The Internet Says On Scalability For February 7th, 2014

11 0.61760932 210 high scalability-2008-01-13-A Note on How to Create Teasers When Posting

12 0.61683041 1567 high scalability-2013-12-20-Stuff The Internet Says On Scalability For December 20th, 2013

13 0.61673886 917 high scalability-2010-10-08-4 Scalability Themes from Surgecon

14 0.61598235 216 high scalability-2008-01-17-Database People Hating on MapReduce

15 0.6152159 1430 high scalability-2013-03-27-The Changing Face of Scale - The Downside of Scaling in the Contextual Age

16 0.61479813 1484 high scalability-2013-06-28-Stuff The Internet Says On Scalability For June 28, 2013

17 0.61422259 1074 high scalability-2011-07-06-11 Common Web Use Cases Solved in Redis

18 0.61079139 1476 high scalability-2013-06-14-Stuff The Internet Says On Scalability For June 14, 2013

19 0.60955739 992 high scalability-2011-02-18-Stuff The Internet Says On Scalability For February 18, 2011

20 0.60893089 1158 high scalability-2011-12-16-Stuff The Internet Says On Scalability For December 16, 2011


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(1, 0.118), (2, 0.166), (10, 0.018), (28, 0.285), (30, 0.037), (56, 0.024), (61, 0.075), (79, 0.093), (85, 0.038), (94, 0.057)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.89755815 562 high scalability-2009-04-10-Facebook's Aditya giving presentation on Facebook Architecture

Introduction: Facebook's engg. director aditya talks about facebook architecture. How they use mysql, php and memcache. How they have modified the above to suit their requirements.

same-blog 2 0.87115586 1294 high scalability-2012-08-01-Prismatic Update: Machine Learning on Documents and Users

Introduction: In update to Prismatic Architecture - Using Machine Learning on Social Networks to Figure Out What You Should Read on the Web , Jason Wolfe, even in the face of deadening fatigue from long nights spent getting their iPhone app out, has gallantly agreed to talk a little more about Primatic's approach to Machine Learning. Documents and users are two areas where Prismatic applies ML (machine learning): ML on Documents Given an HTML document:  learn how to extract the main text of the page (rather than the sidebar, footer, comments, etc), its title, author, best images, etc determine features for relevance (e.g., what the article is about, topics, etc.) The setup for most of these tasks is pretty typical. Models are trained using big batch jobs on other machines that read data from s3, save the learned parameter files to s3, and then read (and periodically refresh) the models from s3 in the ingest pipeline. All of the data that flows out of the system can be

3 0.84416699 606 high scalability-2009-05-25-non-sequential, unique identifier, strategy question

Introduction: (Please bare with me, I'm a new, passionate, confident and terrified programmer :D ) Background: I'm pre-launch and 1 year into the development of my application. My target is to be able to eventually handle millions of registered users with 5-10% of them concurrent. Up to this point I've used auto-increment to assign unique identifiers to rows. I am now considering switching to a non-sequential strategy. Oh, I'm using the LAMP configuration. My reasons for avoiding auto-increment: 1. Complicates replication when scaling horizontally. Risk of collision is significant (when running multiple masters). Note: I've read the other entries in this forum that relate to ID generation and there have been some great suggestions -- including a strategy that uses auto-increment in a way that avoids this pitfall... That said, I'm still nervous about it. 2. Potential bottleneck when retrieving/assigning IDs -- IDs assigned at the database. My reasons for being nervous about

4 0.79291999 630 high scalability-2009-06-14-kngine 'Knowledge Engine' milestone 2

Introduction: Kngine is Knowledge Web search engine designed to provide meaningful search results, such as: semantic information about the keywords/concepts, answer the user’s questions, discover the relations between the keywords/concepts, and link the different kind of data together, such as: Movies, Subtitles, Photos, Price at sale store, User reviews, and Influenced story Goals Kngine long-term goal is to make all human beings systematic knowledge and experience accessible to everyone. I aim to collect and organize all objective data, and make it possible and easy to access. Our goal is to build on the advances of Web search engine, semantic web, data representation technologies a new form of Web search engine that will unleash a revolution of new possibilities. Kngine tries to combine the power of Web search engines with the power of Semantic search and the data representation to provide meaningful search results compromising user needs. Status Kngine starts as a research project in O

5 0.77446163 1030 high scalability-2011-04-27-Heroku Emergency Strategy: Incident Command System and 8 Hour Ops Rotations for Fresh Minds

Introduction: In  Resolved: Widespread Application Outage ,  Heroku tells their story of how they dealt with the Amazon outage . While taking 100% responsibility for the downtime, they also shared a number of the strategies they used to bring their service back to full working order. One of Heroku's most interesting strategies wasn't a technical hack at all, but how they consciously went about deploying their Ops personnel in response to the emergency. An outline of their strategy is:   Monitoring systems immediately alerted Ops to the problem.  An on-call engineer applied triage logic to the problem and classified it as serious, which caused the on-call Incident Commander to be woken out of restful slumber.  The IC contacted AWS . They were in constant contact with their AWS representative and worked closely with AWS to solve problems. The IC alerted Heroku engineers. A full crew: support, data, and other engineering teams worked around the clock to bring every

6 0.76777464 1506 high scalability-2013-08-23-Stuff The Internet Says On Scalability For August 23, 2013

7 0.7660042 806 high scalability-2010-04-08-Hot Scalability Links for April 8, 2010

8 0.76592344 903 high scalability-2010-09-17-Hot Scalability Links For Sep 17, 2010

9 0.74676925 752 high scalability-2009-12-17-Oracle and IBM databases: Disk-based vs In-memory databases

10 0.73379648 1395 high scalability-2013-01-28-DuckDuckGo Architecture - 1 Million Deep Searches a Day and Growing

11 0.73021042 1611 high scalability-2014-03-12-Paper: Scalable Eventually Consistent Counters over Unreliable Networks

12 0.72125727 1261 high scalability-2012-06-08-Stuff The Internet Says On Scalability For June 8, 2012

13 0.69298792 1439 high scalability-2013-04-12-Stuff The Internet Says On Scalability For April 12, 2013

14 0.68024296 1293 high scalability-2012-07-30-Prismatic Architecture - Using Machine Learning on Social Networks to Figure Out What You Should Read on the Web

15 0.67501926 840 high scalability-2010-06-10-The Four Meta Secrets of Scaling at Facebook

16 0.67011952 304 high scalability-2008-04-19-How to build a real-time analytics system?

17 0.66991645 1385 high scalability-2013-01-11-Stuff The Internet Says On Scalability For January 11, 2013

18 0.65090024 1444 high scalability-2013-04-23-Facebook Secrets of Web Performance

19 0.6473068 1037 high scalability-2011-05-10-Viddler Architecture - 7 Million Embeds a Day and 1500 Req-Sec Peak

20 0.64601153 1102 high scalability-2011-08-22-Strategy: Run a Scalable, Available, and Cheap Static Site on S3 or GitHub