hilary_mason_data hilary_mason_data-2013 hilary_mason_data-2013-99 knowledge-graph by maker-knowledge-mining

99 hilary mason data-2013-04-01-Data Engineering

meta infos for this blog

Source: html

Introduction: Data Engineering Posted: April 1, 2013 | Author: Hilary Mason | Filed under: blog | Tags: bitly , data , engineering , infrastructure | 5 Comments » Data engineering is when the architecture of your system is dependent on characteristics of the data flowing through that system . It requires a different kind of engineering process than typical systems engineering, because you have to do some work upfront to understand the nature of the data before you can effectively begin to design the infrastructure. Most data engineering systems also transform the data as they process it. Developing these types of systems requires an initial research phase, where you do the necessary work to understand the characteristics of the data, before you design the system (and perhaps even requiring an active experimental process where you try multiple infrastructure options in the wild before making a final decision). I’ve seen numerous people run straight into walls when

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 It requires a different kind of engineering process than typical systems engineering, because you have to do some work upfront to understand the nature of the data before you can effectively begin to design the infrastructure. [sent-2, score-2.119]

2 Most data engineering systems also transform the data as they process it. [sent-3, score-1.426]

3 I’ve seen numerous people run straight into walls when they ignore this research requirement. [sent-5, score-0.414]

4 Forget Table is one example of a data engineering project from our work at bitly. [sent-6, score-0.901]

5 We often see streams of data and want to understand what the distributions in that data look like, knowing that they drift over time. [sent-8, score-0.889]

6 Forget Table is designed precisely for this use, allowing you to configure the rate of change in your particular dataset (check it out on github ). [sent-9, score-0.425]

similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('engineering', 0.59), ('understand', 0.253), ('data', 0.224), ('process', 0.194), ('systems', 0.194), ('characteristics', 0.187), ('table', 0.187), ('forget', 0.169), ('infrastructure', 0.145), ('system', 0.137), ('requires', 0.137), ('research', 0.118), ('design', 0.109), ('typical', 0.094), ('phase', 0.094), ('active', 0.094), ('wild', 0.094), ('requiring', 0.094), ('multiple', 0.094), ('experimental', 0.094), ('types', 0.094), ('perhaps', 0.094), ('effectively', 0.094), ('flowing', 0.094), ('ignore', 0.094), ('knowing', 0.094), ('streams', 0.094), ('work', 0.087), ('begin', 0.084), ('initial', 0.084), ('decision', 0.084), ('designed', 0.084), ('final', 0.084), ('numerous', 0.084), ('architecture', 0.084), ('precisely', 0.084), ('developing', 0.078), ('options', 0.073), ('configure', 0.073), ('dataset', 0.073), ('github', 0.073), ('necessary', 0.068), ('rate', 0.068), ('check', 0.068), ('database', 0.068), ('run', 0.059), ('seen', 0.059), ('kind', 0.059), ('bitly', 0.057), ('change', 0.054)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000001 99 hilary mason data-2013-04-01-Data Engineering

2 0.18579604 8 hilary mason data-2007-08-19-Curriculum Design as Software Engineering

Introduction: Curriculum Design as Software Engineering Posted: August 19, 2007 | Author: hilary | Filed under: blog | Tags: education | 1 Comment » This summer, I’ve been involved in the process of creating a new undergraduate curriculum essentially from scratch. I was reflecting back on this process, and I realized the development of a robust and relevant curriculum shares many attributes with the process of developing robust and functional software. Modern software development is a largely modular process. Each component of a system interacts with every other component through a defined interface. I see this same behavior in a degree program – each course has certain incoming requires and defined outcomes. Students navigate through a narrative of courses that must fit together to equal a bachelor’s degree. Unit testing is the practice of separating out each module in a software system and insuring that it functions correctly. The final system is will contain many

3 0.1398484 85 hilary mason data-2013-01-19-Startups: How to Share Data with Academics

Introduction: Startups: How to Share Data with Academics Posted: January 19, 2013 | Author: Hilary Mason | Filed under: blog | Tags: academics , data , research | 8 Comments » This post assumes that you want to share data. If you’re not convinced, don’t worry — that’s next on my list. You and your academic colleagues will benefit from having at least a quick chat about the research questions they want to address. I’ve read every paper I’ve been able to find that uses bitly data and all of the ones that acquired the data without our assistance had serious flaws, generally based on incorrect assumptions about the data they had acquired (this, unfortunately, makes me question the validity of most research done on commercial social data without cooperation from the subject company). The easiest way to share data is through your own API . Set generous rate limits where possible. Most projects are not realtime and they can gather the data (or, more likely, have a grad

4 0.12373066 42 hilary mason data-2010-04-18-Stop talking, start coding

Introduction: Stop talking, start coding Posted: April 18, 2010 | Author: hilary | Filed under: blog | 65 Comments » I read Out of the Loop in Silicon Valley in the NYTimes today, which explores how and why women are under-repesented in tech startups. From the number of retweets I saw and the clicks through bit.ly links (12,579 at the time of this posting), it’s been getting a lot of attention. There are some very strong, compelling themes in this article. Computer science and engineering to have an “image problem”; the way we teach math to elementary school students is horrible and turns way too many away. I don’t want to nitpick the article, but there are a few statements that reinforce the very damaging stereotypes that the article sets out to dispel. “When women take on the challenges of an engineering or computer science education in college, some studies suggest that they struggle against a distinct set of personal, psycho-social issues… Even women who soldier

5 0.11273181 76 hilary mason data-2012-08-28-How do you prioritize research?

Introduction: How do you prioritize research? Posted: August 28, 2012 | Author: Hilary Mason | Filed under: blog | Tags: datascience , startups | 14 Comments » One of the most fun and challenging parts of my job is setting bitly’s research agenda. We’re a startup, so this means prioritizing the set of questions we look into in the context of what will be most beneficial for the rest of the business, for the short and long-term, by creating opportunity and opening up potential futures. We work on a wide variety of projects, from pure research to press collaborations to infrastructure and experimental products . We always have a list of research questions way longer than we have time and resources to pursue, so we developed a process for evaluating whether a given question is worth pursuing at a particular time. This is the kind of process that I’ve only discussed with several people over whisky (thanks!), but not seen written up. I initially had a much longer list o

6 0.10569495 80 hilary mason data-2012-12-28-Getting Started with Data Science

7 0.10395957 84 hilary mason data-2013-01-17-Need Data? Start Here

8 0.093754321 27 hilary mason data-2009-04-02-From the ACM: Learning More About Active Learning

9 0.086017333 82 hilary mason data-2013-01-08-Bitly Social Data APIs

10 0.08292193 32 hilary mason data-2009-08-29-Do you do human subject research?

11 0.078252539 31 hilary mason data-2009-08-12-My NYC Python Meetup Presentation: Practical Data Analysis in Python

12 0.078037776 87 hilary mason data-2013-01-28-Startups: Why to Share Data with Academics

13 0.069374517 110 hilary mason data-2013-10-06-What Mugshots Mean For Public Data

14 0.068830043 24 hilary mason data-2009-01-31-WordPress tip: Move comments from one post to another post

15 0.063741133 81 hilary mason data-2013-01-03-Interview Questions for Data Scientists

16 0.063221879 33 hilary mason data-2009-10-03-Hadoop World NYC

17 0.063216187 75 hilary mason data-2012-08-22-DataGotham: The Empire State of Data

18 0.058739241 49 hilary mason data-2010-11-10-Machine Learning: A Love Story

19 0.057000551 40 hilary mason data-2010-02-16-Conference: Search and Social Media 2010

20 0.054870948 105 hilary mason data-2013-07-05-Speaking: Spend at least 1-3 of the time practicing the talk

similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, -0.203), (1, -0.089), (2, -0.028), (3, -0.073), (4, -0.064), (5, 0.223), (6, 0.182), (7, -0.142), (8, 0.045), (9, -0.052), (10, -0.02), (11, -0.015), (12, 0.071), (13, -0.224), (14, -0.177), (15, 0.073), (16, 0.08), (17, 0.037), (18, 0.121), (19, -0.125), (20, 0.061), (21, -0.096), (22, 0.032), (23, -0.094), (24, -0.19), (25, 0.011), (26, 0.137), (27, 0.012), (28, 0.116), (29, 0.022), (30, -0.066), (31, 0.001), (32, -0.067), (33, -0.133), (34, 0.131), (35, -0.028), (36, 0.091), (37, 0.166), (38, -0.032), (39, 0.134), (40, -0.004), (41, 0.025), (42, 0.126), (43, 0.065), (44, 0.038), (45, -0.116), (46, 0.101), (47, 0.078), (48, -0.109), (49, 0.083)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.94658542 99 hilary mason data-2013-04-01-Data Engineering

2 0.66525519 8 hilary mason data-2007-08-19-Curriculum Design as Software Engineering

3 0.45550802 85 hilary mason data-2013-01-19-Startups: How to Share Data with Academics

4 0.4117465 84 hilary mason data-2013-01-17-Need Data? Start Here

Introduction: Need Data? Start Here Posted: January 17, 2013 | Author: Hilary Mason | Filed under: projects | Tags: data , dataset | 12 Comments » Data scientists need data, and good data is hard to find. I put together this bitly bundle of research quality data sets to collect as many useful data sets as possible in one place. The list includes such exciting and diverse things as spam, belly buttons, item pricing, social media, and face recognition, so you know there’s something that will intrigue anyone. Have one to add? Let me know! (I’ve shared the bundle before, but this post can act as unofficial homepage for it.)

5 0.37070325 42 hilary mason data-2010-04-18-Stop talking, start coding

6 0.33526003 80 hilary mason data-2012-12-28-Getting Started with Data Science

7 0.32712907 110 hilary mason data-2013-10-06-What Mugshots Mean For Public Data

8 0.32660383 82 hilary mason data-2013-01-08-Bitly Social Data APIs

9 0.31316856 27 hilary mason data-2009-04-02-From the ACM: Learning More About Active Learning

10 0.30935267 31 hilary mason data-2009-08-12-My NYC Python Meetup Presentation: Practical Data Analysis in Python

11 0.30492842 115 hilary mason data-2014-02-14-Play with your food!

12 0.29497054 76 hilary mason data-2012-08-28-How do you prioritize research?

13 0.25795317 21 hilary mason data-2008-09-26-What am I like? How about you?

14 0.25760397 87 hilary mason data-2013-01-28-Startups: Why to Share Data with Academics

15 0.25452712 75 hilary mason data-2012-08-22-DataGotham: The Empire State of Data

16 0.2375838 29 hilary mason data-2009-05-07-I’m on Jon Udell’s Interviews with Innovators!

17 0.22135426 33 hilary mason data-2009-10-03-Hadoop World NYC

18 0.20587324 81 hilary mason data-2013-01-03-Interview Questions for Data Scientists

19 0.20037901 32 hilary mason data-2009-08-29-Do you do human subject research?

20 0.19639295 49 hilary mason data-2010-11-10-Machine Learning: A Love Story

similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(2, 0.049), (56, 0.134), (87, 0.685)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.97295928 38 hilary mason data-2009-12-24-IgniteNYC: The video!

Introduction: IgniteNYC: The video! Posted: December 24, 2009 | Author: hilary | Filed under: academics , blog | Tags: presentation , python | 15 Comments » The video of my IgniteNYC presentation is up, and has gotten a great response! I’m working on removing the me-specific bits from the code and I’ll be posting it as open-source very soon!

same-blog 2 0.96630824 99 hilary mason data-2013-04-01-Data Engineering

3 0.25895056 42 hilary mason data-2010-04-18-Stop talking, start coding

4 0.21475375 8 hilary mason data-2007-08-19-Curriculum Design as Software Engineering

5 0.20942959 105 hilary mason data-2013-07-05-Speaking: Spend at least 1-3 of the time practicing the talk

Introduction: Speaking: Spend at least 1/3 of the time practicing the talk Posted: July 5, 2013 | Author: Hilary Mason | Filed under: speaking | 3 Comments » This week we welcome a guest contribution. Matthew Trentacoste is a recovering academic and a computer scientist at Adobe, where he writes software to make pretty pictures. He’s constantly curious, often about data, and cooks a lot. You can follow his exploits at @mattttrent . In Hilary’s last post, she made the point that your slides != your talk . In a well-crafted talk, your message — in the form of the words you say — needs to dominate while the slides need to play a supporting role. Speak the important parts, and use your slides as a backdrop for what you’re saying. Hilary has provided a valuable strategy in her post, but how should someone approach crafting such a clearly-organized presentation? If you’re just getting started speaking, it can be a real challenge to make a coherent talk and along with slid

6 0.2066033 46 hilary mason data-2010-08-15-Should you attend Hadoop World? Yes.

7 0.20443851 114 hilary mason data-2013-12-18-Using Twitter’s Lead-Gen Card to Recruit Beta Testers

8 0.20282924 87 hilary mason data-2013-01-28-Startups: Why to Share Data with Academics

9 0.2003192 109 hilary mason data-2013-09-30-Need actual random numbers? Meet the NIST randomness beacon.

10 0.20016088 58 hilary mason data-2011-06-22-My Head is Open Source!

11 0.19613238 7 hilary mason data-2007-07-30-Tip: How to Search Google for Ideas

12 0.19164452 76 hilary mason data-2012-08-28-How do you prioritize research?

13 0.18265811 43 hilary mason data-2010-05-27-E-mail automation, questions and answers

14 0.17638916 33 hilary mason data-2009-10-03-Hadoop World NYC

15 0.17366478 80 hilary mason data-2012-12-28-Getting Started with Data Science

16 0.16328406 82 hilary mason data-2013-01-08-Bitly Social Data APIs

17 0.16150127 106 hilary mason data-2013-08-12-DataGotham 2013 is coming!

18 0.15988171 74 hilary mason data-2012-08-19-Why I love New York City

19 0.15744741 81 hilary mason data-2013-01-03-Interview Questions for Data Scientists

20 0.15671365 21 hilary mason data-2008-09-26-What am I like? How about you?