hunch_net hunch_net-2006 hunch_net-2006-217 knowledge-graph by maker-knowledge-mining

217 hunch net-2006-11-06-Data Linkage Problems


meta infos for this blog

Source: html

Introduction: Data linkage is a problem which seems to come up in various applied machine learning problems. I have heard it mentioned in various data mining contexts, but it seems relatively less studied for systemic reasons. A very simple version of the data linkage problem is a cross hospital patient record merge. Suppose a patient (John Doe) is admitted to a hospital (General Health), treated, and released. Later, John Doe is admitted to a second hospital (Health General), treated, and released. Given a large number of records of this sort, it becomes very tempting to try and predict the outcomes of treatments. This is reasonably straightforward as a machine learning problem if there is a shared unique identifier for John Doe used by General Health and Health General along with time stamps. We can merge the records and create examples of the form “Given symptoms and treatment, did the patient come back to a hospital within the next year?” These examples could be fed into a learning algo


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Data linkage is a problem which seems to come up in various applied machine learning problems. [sent-1, score-0.544]

2 A very simple version of the data linkage problem is a cross hospital patient record merge. [sent-3, score-1.24]

3 Suppose a patient (John Doe) is admitted to a hospital (General Health), treated, and released. [sent-4, score-0.574]

4 Later, John Doe is admitted to a second hospital (Health General), treated, and released. [sent-5, score-0.402]

5 Given a large number of records of this sort, it becomes very tempting to try and predict the outcomes of treatments. [sent-6, score-0.509]

6 This is reasonably straightforward as a machine learning problem if there is a shared unique identifier for John Doe used by General Health and Health General along with time stamps. [sent-7, score-0.419]

7 We can merge the records and create examples of the form “Given symptoms and treatment, did the patient come back to a hospital within the next year? [sent-8, score-0.676]

8 ” These examples could be fed into a learning algorithm, and we could attempt to predict whether a return occurs. [sent-9, score-0.238]

9 The problem is that General Health and Health General don’t have any shared unique identifier for John Doe. [sent-10, score-0.419]

10 Although this is just one example, data linkage problems seem to be endemic to learning applications. [sent-13, score-0.567]

11 Sometimes minor changes to what information is recorded can strongly disambiguate. [sent-15, score-0.164]

12 For example, there is a big difference between recording the pages visited at a website versus tracking the sequence of pages visited. [sent-16, score-0.292]

13 The essential thing to think about when designing the information to record is: How will I track the consequences of decisions? [sent-17, score-0.243]

14 First predict which records should be linked, based upon a smaller dataset that is hand checked. [sent-19, score-0.401]

15 A common approach to improving performance is turning a double approximation (given x predict y, given y predict z) into a single approximation (given x predict z). [sent-29, score-0.982]

16 A method for achieving single approximation here is tricky because we have ancillary information about the intermediate prediction. [sent-30, score-0.455]

17 The Bayesian approach of “specify a prior, then use Bayes law to get a posterior, then predict with the posterior” is attractive here because we often have strong prior beliefs about at least the linkage portion of the problem. [sent-32, score-0.785]

18 The data linkage problem also makes very clear the tension between privacy and machine learning. [sent-34, score-0.677]

19 And yet, linking records can result in unexpectedly large pools of information on individuals. [sent-36, score-0.381]

20 Furthermore explicitly sensitive information (like credit card numbers) might easily be the most useful bit of information for linkage. [sent-37, score-0.39]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('linkage', 0.488), ('hospital', 0.287), ('health', 0.263), ('records', 0.217), ('doe', 0.209), ('predict', 0.184), ('patient', 0.172), ('information', 0.164), ('link', 0.144), ('identifier', 0.139), ('john', 0.131), ('unique', 0.125), ('approximation', 0.118), ('admitted', 0.115), ('treated', 0.115), ('outcomes', 0.108), ('general', 0.099), ('shared', 0.099), ('posterior', 0.088), ('pages', 0.088), ('given', 0.086), ('data', 0.079), ('cross', 0.079), ('record', 0.079), ('improved', 0.07), ('intermediate', 0.062), ('card', 0.062), ('recording', 0.062), ('customized', 0.062), ('verified', 0.062), ('predictor', 0.059), ('systemic', 0.057), ('birthday', 0.057), ('representing', 0.057), ('stage', 0.057), ('attractive', 0.057), ('ancillary', 0.057), ('index', 0.057), ('linked', 0.057), ('treatments', 0.057), ('problem', 0.056), ('prior', 0.056), ('single', 0.054), ('suggestion', 0.054), ('tension', 0.054), ('versus', 0.054), ('jump', 0.054), ('fed', 0.054), ('contexts', 0.054), ('turning', 0.054)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0000002 217 hunch net-2006-11-06-Data Linkage Problems

Introduction: Data linkage is a problem which seems to come up in various applied machine learning problems. I have heard it mentioned in various data mining contexts, but it seems relatively less studied for systemic reasons. A very simple version of the data linkage problem is a cross hospital patient record merge. Suppose a patient (John Doe) is admitted to a hospital (General Health), treated, and released. Later, John Doe is admitted to a second hospital (Health General), treated, and released. Given a large number of records of this sort, it becomes very tempting to try and predict the outcomes of treatments. This is reasonably straightforward as a machine learning problem if there is a shared unique identifier for John Doe used by General Health and Health General along with time stamps. We can merge the records and create examples of the form “Given symptoms and treatment, did the patient come back to a hospital within the next year?” These examples could be fed into a learning algo

2 0.18895254 430 hunch net-2011-04-11-The Heritage Health Prize

Introduction: The Heritage Health Prize is potentially the largest prediction prize yet at $3M, which is sure to get many people interested. Several elements of the competition may be worth discussing. The most straightforward way for HPN to deploy this predictor is in determining who to cover with insurance. This might easily cover the costs of running the contest itself, but the value to the health system of a whole is minimal, as people not covered still exist. While HPN itself is a provider network, they have active relationships with a number of insurance companies, and the right to resell any entrant. It’s worth keeping in mind that the research and development may nevertheless end up being useful in the longer term, especially as entrants also keep the right to their code. The judging metric is something I haven’t seen previously. If a patient has probability 0.5 of being in the hospital 0 days and probability 0.5 of being in the hospital ~53.6 days, the optimal prediction in e

3 0.1554908 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

4 0.11637949 237 hunch net-2007-04-02-Contextual Scaling

Introduction: Machine learning has a new kind of “scaling to larger problems” to worry about: scaling with the amount of contextual information. The standard development path for a machine learning application in practice seems to be the following: Marginal . In the beginning, there was “majority vote”. At this stage, it isn’t necessary to understand that you have a prediction problem. People just realize that one answer is right sometimes and another answer other times. In machine learning terms, this corresponds to making a prediction without side information. First context . A clever person realizes that some bit of information x 1 could be helpful. If x 1 is discrete, they condition on it and make a predictor h(x 1 ) , typically by counting. If they are clever, then they also do some smoothing. If x 1 is some real valued parameter, it’s very common to make a threshold cutoff. Often, these tasks are simply done by hand. Second . Another clever person (or perhaps the s

5 0.11438876 212 hunch net-2006-10-04-Health of Conferences Wiki

Introduction: Aaron Hertzmann points out the health of conferences wiki , which has a great deal of information about how many different conferences function.

6 0.11222901 314 hunch net-2008-08-24-Mass Customized Medicine in the Future?

7 0.10172231 143 hunch net-2005-12-27-Automated Labeling

8 0.10124215 12 hunch net-2005-02-03-Learning Theory, by assumption

9 0.10109858 369 hunch net-2009-08-27-New York Area Machine Learning Events

10 0.096520014 235 hunch net-2007-03-03-All Models of Learning have Flaws

11 0.09627381 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

12 0.093401454 218 hunch net-2006-11-20-Context and the calculation misperception

13 0.085647956 34 hunch net-2005-03-02-Prior, “Prior” and Bias

14 0.085383162 406 hunch net-2010-08-22-KDD 2010

15 0.084387735 411 hunch net-2010-09-21-Regretting the dead

16 0.078903452 160 hunch net-2006-03-02-Why do people count for learning?

17 0.07659965 19 hunch net-2005-02-14-Clever Methods of Overfitting

18 0.076444283 260 hunch net-2007-08-25-The Privacy Problem

19 0.075362727 149 hunch net-2006-01-18-Is Multitask Learning Black-Boxable?

20 0.073674269 207 hunch net-2006-09-12-Incentive Compatible Reviewing


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, 0.19), (1, 0.065), (2, -0.026), (3, 0.013), (4, 0.004), (5, -0.021), (6, -0.032), (7, 0.048), (8, 0.081), (9, -0.08), (10, -0.086), (11, 0.055), (12, 0.017), (13, 0.024), (14, -0.008), (15, -0.08), (16, -0.034), (17, -0.028), (18, 0.044), (19, 0.014), (20, -0.027), (21, -0.04), (22, 0.109), (23, -0.009), (24, -0.049), (25, 0.04), (26, 0.056), (27, 0.119), (28, 0.002), (29, 0.007), (30, -0.02), (31, -0.063), (32, -0.005), (33, -0.03), (34, 0.026), (35, -0.004), (36, 0.005), (37, -0.001), (38, -0.105), (39, 0.036), (40, 0.03), (41, 0.05), (42, 0.043), (43, 0.027), (44, -0.074), (45, -0.018), (46, 0.001), (47, -0.077), (48, 0.01), (49, -0.119)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.95010155 217 hunch net-2006-11-06-Data Linkage Problems

Introduction: Data linkage is a problem which seems to come up in various applied machine learning problems. I have heard it mentioned in various data mining contexts, but it seems relatively less studied for systemic reasons. A very simple version of the data linkage problem is a cross hospital patient record merge. Suppose a patient (John Doe) is admitted to a hospital (General Health), treated, and released. Later, John Doe is admitted to a second hospital (Health General), treated, and released. Given a large number of records of this sort, it becomes very tempting to try and predict the outcomes of treatments. This is reasonably straightforward as a machine learning problem if there is a shared unique identifier for John Doe used by General Health and Health General along with time stamps. We can merge the records and create examples of the form “Given symptoms and treatment, did the patient come back to a hospital within the next year?” These examples could be fed into a learning algo

2 0.68475592 165 hunch net-2006-03-23-The Approximation Argument

Introduction: An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply 2 . The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior: P(D|x) = P(x|D)P(D)/P(x) After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss. This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties: There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method.

3 0.68275607 237 hunch net-2007-04-02-Contextual Scaling

Introduction: Machine learning has a new kind of “scaling to larger problems” to worry about: scaling with the amount of contextual information. The standard development path for a machine learning application in practice seems to be the following: Marginal . In the beginning, there was “majority vote”. At this stage, it isn’t necessary to understand that you have a prediction problem. People just realize that one answer is right sometimes and another answer other times. In machine learning terms, this corresponds to making a prediction without side information. First context . A clever person realizes that some bit of information x 1 could be helpful. If x 1 is discrete, they condition on it and make a predictor h(x 1 ) , typically by counting. If they are clever, then they also do some smoothing. If x 1 is some real valued parameter, it’s very common to make a threshold cutoff. Often, these tasks are simply done by hand. Second . Another clever person (or perhaps the s

4 0.65429246 430 hunch net-2011-04-11-The Heritage Health Prize

Introduction: The Heritage Health Prize is potentially the largest prediction prize yet at $3M, which is sure to get many people interested. Several elements of the competition may be worth discussing. The most straightforward way for HPN to deploy this predictor is in determining who to cover with insurance. This might easily cover the costs of running the contest itself, but the value to the health system of a whole is minimal, as people not covered still exist. While HPN itself is a provider network, they have active relationships with a number of insurance companies, and the right to resell any entrant. It’s worth keeping in mind that the research and development may nevertheless end up being useful in the longer term, especially as entrants also keep the right to their code. The judging metric is something I haven’t seen previously. If a patient has probability 0.5 of being in the hospital 0 days and probability 0.5 of being in the hospital ~53.6 days, the optimal prediction in e

5 0.64588386 260 hunch net-2007-08-25-The Privacy Problem

Introduction: Machine Learning is rising in importance because data is being collected for all sorts of tasks where it either wasn’t previously collected, or for tasks that did not previously exist. While this is great for Machine Learning, it has a downside—the massive data collection which is so useful can also lead to substantial privacy problems. It’s important to understand that this is a much harder problem than many people appreciate. The AOL data release is a good example. To those doing machine learning, the following strategies might be obvious: Just delete any names or other obviously personally identifiable information. The logic here seems to be “if I can’t easily find the person then no one can”. That doesn’t work as demonstrated by the people who were found circumstantially from the AOL data. … then just hash all the search terms! The logic here is “if I can’t read it, then no one can”. It’s also trivially broken by a dictionary attack—just hash all the strings

6 0.64186931 314 hunch net-2008-08-24-Mass Customized Medicine in the Future?

7 0.58597577 150 hunch net-2006-01-23-On Coding via Mutual Information & Bayes Nets

8 0.57638747 253 hunch net-2007-07-06-Idempotent-capable Predictors

9 0.57581383 157 hunch net-2006-02-18-Multiplication of Learned Probabilities is Dangerous

10 0.56310284 263 hunch net-2007-09-18-It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

11 0.55586183 312 hunch net-2008-08-04-Electoralmarkets.com

12 0.55337304 160 hunch net-2006-03-02-Why do people count for learning?

13 0.54350775 102 hunch net-2005-08-11-Why Manifold-Based Dimension Reduction Techniques?

14 0.5324325 348 hunch net-2009-04-02-Asymmophobia

15 0.52775538 272 hunch net-2007-11-14-BellKor wins Netflix

16 0.52060568 298 hunch net-2008-04-26-Eliminating the Birthday Paradox for Universal Features

17 0.51470703 235 hunch net-2007-03-03-All Models of Learning have Flaws

18 0.51341087 191 hunch net-2006-07-08-MaxEnt contradicts Bayes Rule?

19 0.51301736 60 hunch net-2005-04-23-Advantages and Disadvantages of Bayesian Learning

20 0.51266605 277 hunch net-2007-12-12-Workshop Summary—Principles of Learning Problem Design


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(3, 0.022), (10, 0.024), (13, 0.024), (27, 0.21), (38, 0.055), (53, 0.021), (55, 0.063), (74, 0.309), (77, 0.034), (84, 0.027), (94, 0.091), (95, 0.039)]

similar blogs list:

simIndex simValue blogId blogTitle

1 0.92490023 278 hunch net-2007-12-17-New Machine Learning mailing list

Introduction: IMLS (which is the nonprofit running ICML) has setup a new mailing list for Machine Learning News . The list address is ML-news@googlegroups.com, and signup requires a google account (which you can create). Only members can send messages.

2 0.89187032 404 hunch net-2010-08-20-The Workshop on Cores, Clusters, and Clouds

Introduction: Alekh , John , Ofer , and I are organizing a workshop at NIPS this year on learning in parallel and distributed environments. The general interest level in parallel learning seems to be growing rapidly, so I expect quite a bit of attendance. Please join us if you are parallel-interested. And, if you are working in the area of parallel learning, please consider submitting an abstract due Oct. 17 for presentation at the workshop.

same-blog 3 0.85976237 217 hunch net-2006-11-06-Data Linkage Problems

Introduction: Data linkage is a problem which seems to come up in various applied machine learning problems. I have heard it mentioned in various data mining contexts, but it seems relatively less studied for systemic reasons. A very simple version of the data linkage problem is a cross hospital patient record merge. Suppose a patient (John Doe) is admitted to a hospital (General Health), treated, and released. Later, John Doe is admitted to a second hospital (Health General), treated, and released. Given a large number of records of this sort, it becomes very tempting to try and predict the outcomes of treatments. This is reasonably straightforward as a machine learning problem if there is a shared unique identifier for John Doe used by General Health and Health General along with time stamps. We can merge the records and create examples of the form “Given symptoms and treatment, did the patient come back to a hospital within the next year?” These examples could be fed into a learning algo

4 0.71359909 67 hunch net-2005-05-06-Don’t mix the solution into the problem

Introduction: A common defect of many pieces of research is defining the problem in terms of the solution. Here are some examples in learning: “The learning problem is finding a good seperating hyperplane.” “The goal of learning is to minimize (y-p) 2 + C w 2 where y = the observation, p = the prediction and w = a parameter vector.” Defining the loss function to be the one that your algorithm optimizes rather than the one imposed by the world. The fundamental reason why this is a defect is that it creates artificial boundaries to problem solution. Artificial boundaries lead to the possibility of being blind-sided. For example, someone committing (1) or (2) above might find themselves themselves surprised to find a decision tree working well on a problem. Example (3) might result in someone else solving a learning problem better for real world purposes, even if it’s worse with respect to the algorithm optimization. This defect should be avoided so as to not artificially l

5 0.70706713 370 hunch net-2009-09-18-Necessary and Sufficient Research

Introduction: Researchers are typically confronted with big problems that they have no idea how to solve. In trying to come up with a solution, a natural approach is to decompose the big problem into a set of subproblems whose solution yields a solution to the larger problem. This approach can go wrong in several ways. Decomposition failure . The solution to the decomposition does not in fact yield a solution to the overall problem. Artificial hardness . The subproblems created are sufficient if solved to solve the overall problem, but they are harder than necessary. As you can see, computational complexity forms a relatively new (in research-history) razor by which to judge an approach sufficient but not necessary. In my experience, the artificial hardness problem is very common. Many researchers abdicate the responsibility of choosing a problem to work on to other people. This process starts very naturally as a graduate student, when an incoming student might have relatively l

6 0.63337767 10 hunch net-2005-02-02-Kolmogorov Complexity and Googling

7 0.62561369 36 hunch net-2005-03-05-Funding Research

8 0.61312926 235 hunch net-2007-03-03-All Models of Learning have Flaws

9 0.61003393 95 hunch net-2005-07-14-What Learning Theory might do

10 0.60912848 259 hunch net-2007-08-19-Choice of Metrics

11 0.60907733 360 hunch net-2009-06-15-In Active Learning, the question changes

12 0.60842615 351 hunch net-2009-05-02-Wielding a New Abstraction

13 0.60730827 337 hunch net-2009-01-21-Nearly all natural problems require nonlinearity

14 0.60681772 422 hunch net-2011-01-16-2011 Summer Conference Deadline Season

15 0.60497177 343 hunch net-2009-02-18-Decision by Vetocracy

16 0.60429758 51 hunch net-2005-04-01-The Producer-Consumer Model of Research

17 0.6042707 406 hunch net-2010-08-22-KDD 2010

18 0.6034134 435 hunch net-2011-05-16-Research Directions for Machine Learning and Algorithms

19 0.60340869 286 hunch net-2008-01-25-Turing’s Club for Machine Learning

20 0.60323977 132 hunch net-2005-11-26-The Design of an Optimal Research Environment