nips nips2009 nips2009-233 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Benjamin V. Durme, Ashwin Lall
Abstract: Recent work has led to the ability to perform space efficient, approximate counting over large vocabularies in a streaming context. Motivated by the existence of data structures of this type, we explore the computation of associativity scores, otherwise known as pointwise mutual information (PMI), in a streaming context. We give theoretical bounds showing the impracticality of perfect online PMI computation, and detail an algorithm with high expected accuracy. Experiments on news articles show our approach gives high accuracy on real world data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Motivated by the existence of data structures of this type, we explore the computation of associativity scores, otherwise known as pointwise mutual information (PMI), in a streaming context. [sent-2, score-0.163]
2 We give theoretical bounds showing the impracticality of perfect online PMI computation, and detail an algorithm with high expected accuracy. [sent-3, score-0.105]
3 1 Introduction Recent work has led to the ability to perform space efficient counting over large vocabularies [Talbot, 2009; Van Durme and Lall, 2009]. [sent-5, score-0.15]
4 We explore what a data structure of this type means for the computation of associativity scores, or pointwise mutual information, in a streaming context. [sent-8, score-0.163]
5 We show that approximate k-best PMI rank lists may be maintained online, with high accuracy, both in theory and in practice. [sent-9, score-0.19]
6 Further, we assume that the sets X and Y are so large that it is infeasible to explicitly maintain precise counts for every such pair on a single machine (e. [sent-12, score-0.058]
7 We define the pointwise mutual information (PMI) of a pair x and y to be PMI(x, y) ≡ lg P (x, y) P (x)P (y) where these (empirical) probabilities are computed over a particular data set of interest. [sent-15, score-0.405]
8 Our goal in this paper is to estimate these top-k sets in a streaming fashion, i. [sent-18, score-0.062]
9 , the data is being accessed by crawling the web and it is infeasible to buffer all the crawled results. [sent-25, score-0.191]
10 As mentioned earlier, there has been considerable work in keeping track of the counts of a large number of items succinctly. [sent-26, score-0.081]
11 Trigger Models Rosenfeld [1994] was interested in collecting trigger pairs, A, B , such that the presence of A in a document is likely to “trigger” an occurrence of B. [sent-36, score-0.092]
12 There the concern was in finding the most useful triggers overall, and thus pairs were favored based on high average mu¯¯ ¯ P (AB) P (AB) P (AB) ¯¯ ¯ tual information; I(A, B) = P (AB) lg P (A)P (B) + P (AB) lg P (A)P (B) + P (AB) lg P (A)P (B) + ¯ ¯ ¯ (¯ ¯ lg P¯AB) . [sent-37, score-1.418]
13 = ¯ S If we take x to range over cues, and y to be a memory structure, then in our work here we are storing the identities of the top-k memory structures for a given cue x, as according to strength of associativity. [sent-44, score-0.094]
14 An obvious first attempt at an algorithm for this problem is to use approximate counters to estimate the PMI for each pair in 3 Rosenfeld: . [sent-46, score-0.09]
15 unlike in a bigram model, where the number of different consecutive word pairs is much less than [the vocabulary] V 2 , the number of word pairs where both words occurred in the same document is a significant fraction of V 2 . [sent-49, score-0.143]
16 2 the stream and maintain the top-k for each x using a priority queue. [sent-52, score-0.238]
17 Example 1 (probability of y changes): Consider the stream xy xy xy xz wz | wy wy wy wy wy which we have divided in half. [sent-54, score-0.631]
18 After the first half, y is best for x since PMI(x, y) = lg lg (5/4) and PMI(x, z) = lg 1/5 (4/5)(2/5) 3/5 (4/5)(3/5) = = lg (5/8). [sent-55, score-1.34]
19 At the end of the second half of the stream, 3/10 1/10 z is best for x since PMI(x, y) = lg (4/10)(8/10) ≈ lg (0. [sent-56, score-0.67]
20 However, during the second half of the stream we never encounter x and hence never update its value. [sent-59, score-0.197]
21 Example 2 (probability of x changes): Consider the stream pd py py xy xd in which we are interested in only the top PMI tuples for x. [sent-63, score-0.265]
22 When we see xy in the stream, 1/4 PMI(x, y) = lg (1/4)(3/4) ≈ lg (1. [sent-64, score-0.738]
23 33), and when we see xd in the stream, PMI(x, d) = lg 1/5 (2/5)(2/5) = lg (1. [sent-65, score-0.67]
24 However, xy’s PMI is now 1/5 PMI(x, y) = lg (2/5)(3/5) = lg (0. [sent-68, score-0.67]
25 833) which means that we should replace xy with xd. [sent-69, score-0.068]
26 Theorem 1: Any algorithm that explicitly maintains the top-k PMIs for all x ∈ X in a stream of length at most n (where |X| ∈ o(n)) in a single pass requires Ω(n|X|) time. [sent-75, score-0.221]
27 We will prove this theorem using the following lemma: Lemma 1: Any algorithm that explicitly maintains the top-k PMIs of |X| = p + 1 items over a stream of length at most n = 2r + 2p + 1 in a single pass requires Ω(pr) time. [sent-76, score-0.244]
28 Proof of Lemma 1: Let us take the length of the stream to be n, where we assume without loss of generality that n is odd. [sent-77, score-0.221]
29 , xp y2 , xp+1 y1 xp+1 y2 , xp+1 y2 , xp+1 y1 , xp+1 y1 , xp+1 y2 , xp+1 y2 , r times . [sent-87, score-0.141]
30 After xp+1 y1 appears in the stream for the first time, it should be evident that all the elements of Xp have a higher PMI with y2 than y1 . [sent-96, score-0.219]
31 Now, the current PMI for each element of Xp must be correct at any point in the stream since the stream may terminate at any time. [sent-99, score-0.417]
32 , xp will change at least r times in the course of this stream, for 3 a total of at least pr operations. [sent-103, score-0.165]
33 5 Algorithm The lower bound from the previous section shows that, when solving the PMI problem, the best one can do is effectively cross-check the PMI for every possible x ∈ X for each item in the stream. [sent-111, score-0.062]
34 Our algorithm uses approximate counting to retain the counts of all pairs of items x, y in a data structure Cxy . [sent-116, score-0.278]
35 We keep exact counts of all x and y since this takes considerably less space. [sent-117, score-0.058]
36 We assume Cxy to be based on recent work in space efficient counting methods for streamed text data [Talbot, 2009; Van Durme and Lall, 2009]. [sent-119, score-0.128]
37 For our implementation we used TOMB counters [Van Durme and Lall, 2009] which approximate counts by storing values in log-scale. [sent-120, score-0.18]
38 These log-scale counts are maintained in unary within layers of Bloom filters [Bloom, 1970] (Figure 1) that can be probabilistically updated using a small base (Figure 2); each occurrence of an item in the stream prompts a probabilistic update to its value, dependent on the base. [sent-121, score-0.31]
39 By tuning this base, one can trade off between the accuracy of the counts and the space savings of approximate counting. [sent-122, score-0.093]
40 , the issue in Example 2 in the previous section), we divide the stream up into fixed-size buffers B and re-compute the PMIs for all pairs seen within each buffer (see Algorithm 1). [sent-130, score-0.422]
41 Updating counts for x, y and x, y is constant time per element in the stream. [sent-131, score-0.081]
42 Insertion into a k-best priority queue requires O(lg k) operations. [sent-132, score-0.063]
43 Per interval, we perform in the worst case one insertion per new element observed, along with one insertion for each element stored in the previous rank lists. [sent-133, score-0.183]
44 As long as |B| ≥ |X|k, updating rank lists costs O(|B|lg k) per interval. [sent-134, score-0.126]
45 5 The algorithm therefore requires O(n + n lg k) = O(n lg k) time, where n is the length of the stream. [sent-135, score-0.694]
46 A benefit of our algorithm is that this can be kept significantly smaller than |X| × |Y |,6 since in practice, |Y | lg k. [sent-139, score-0.335]
47 , the extra cost for reinserting elements from the previous rank lists is amortized over the buffer length. [sent-142, score-0.317]
48 In giving a bound on this error, we will make two assumptions: (i) the PMI for a given x follows a Zipfian distribution (something that we observed in our data), and (ii) the items in the stream are drawn independently from some underlying distribution (i. [sent-152, score-0.256]
49 Both these assumptions together help us to sidestep the lower bound proved earlier and demonstrate that our single-pass algorithm will perform well on real language data sets. [sent-158, score-0.064]
50 We first make the observation that, for any y in the set of top-k PMIs for x, if x, y appears in the final buffer then we are guaranteed that y is correctly placed in the top-k at the end. [sent-159, score-0.213]
51 This is because we recompute PMIs for all the pairs in the last buffer at the end of the algorithm (line 13 of Algorithm 1). [sent-160, score-0.225]
52 The probability that x, y does not appear in the last buffer can be bounded using the i. [sent-161, score-0.191]
53 We do so by studying the last buffer in which x, y appears. [sent-168, score-0.191]
54 The only way that y can displace y in the top-k for x in our algorithm is if at the end of this buffer the following holds true: ct (x, y ) ct (x, y) > , ct (y ) ct (y) where the t subscripts denotes the respective counts at the end of the buffer. [sent-169, score-0.594]
55 , 1/2m ) will the last buffer containing x, y appear before the midpoint of the stream. [sent-174, score-0.213]
56 So, let us assume that the buffer appears after the midpoint of the stream. [sent-175, score-0.235]
57 Then, the probability that x, y appears more than (1 + δ)c(x, y )/2 times by this point can be bounded by the Chernoff bound to be at most 5 exp(−c(x, y )δ 2 /8). [sent-176, score-0.058]
58 Putting all these together, we get that Pr ct (x, y ) (1 + δ)c(x, y ) > ct (y ) (1 − δ)c(y ) < 1/2m + exp(−c(x, y )δ 2 /8) + exp(−c(y )δ 2 /4). [sent-178, score-0.152]
59 Let us take the rank of the PMI of y to be i (and recall that the rank of the PMI of y is at most k). [sent-180, score-0.13]
60 We can now put all these results c(y) c(y together to bound the probability of the event Pr ct (x, y) ct (x, y ) > ct (y ) ct (y) ≤ 1/2m + exp(−c(x, y )δ 2 /8) + exp(−c(y )δ 2 /4), where we take δ = (((i/k)s − 1)2PMI(x,y ) − 1)/(((i/k)s − 1)2PMI(x,y ) + 1). [sent-183, score-0.34]
61 Taking a union bound across all possible y ∈ Y gives a bound of 1/2m + |Y |(exp(−c(x, y )δ 2 /8) + exp(−c(y )δ 2 /4)). [sent-185, score-0.072]
62 7 6 Experiments We evaluated our algorithm for online, k-best PMI with a set of experiments on collecting verbal triggers in a document collection. [sent-186, score-0.105]
63 For each unique verb x observed in the stream, our goal was to recover the top-k verbs y with the highest PMI given x. [sent-190, score-0.107]
64 Tokens were tagged for part of speech (POS) using SVMTool [Gim´ nez and M` rquez, 2004], e a a POS tagger based on SVMlight [Joachims, 1999]. [sent-193, score-0.089]
65 Our stream was constructed by considering all pairwise combinations of the roughly 82 (on average) verb tokens occurring in each document. [sent-194, score-0.288]
66 10 We will refer to such non-approximate counting as perfect counting. [sent-198, score-0.201]
67 We did not heavily tune our counting mechanism to this task, other than to experiment with a few different bases (settling on a base of 1. [sent-200, score-0.128]
68 As such, empirical results for approximate counting 7 For streams composed such as described in our experiments, this bound becomes powerful as m approaches 100 or beyond (recalling that both c(x, y ), c(y ) > m). [sent-202, score-0.199]
69 This can be viewed as a restricted version of the experiments of Chambers and Jurafsky [2008], where we consider all verb pairs, regardless of whether they are assumed to possess a co-referent argument. [sent-207, score-0.068]
70 If fully enumerated as text, this stream would have required 12GB of uncompressed storage. [sent-209, score-0.197]
71 05 q 0 10 20 30 40 standard instrumented 0. [sent-216, score-0.055]
72 3(b) : Accuracy of top-5 ranklist using the standard measurement, and when using an instrumented counter that had oracle access to which x, y were above threshold. [sent-218, score-0.201]
73 Table 1: When using a perfect counter and a buffer of 50, 500 and 5,000 documents, for k = 1, 5, 10: the accuracy of the resultant k-best lists when compared to the first k, k + 1 and k + 2 true values. [sent-219, score-0.5]
74 74 k=1 k=5 k = 10 should be taken as a lower bound, while the perfect counting results are the upper bound on what an approximate counter might achieve. [sent-247, score-0.418]
75 We measured the accuracy of resultant k-best lists by first collecting the true top-50 elements for each x, offline, to be used as a key. [sent-248, score-0.12]
76 , rank 9, should really be of rank 12, rather than rank 50. [sent-253, score-0.195]
77 1 Results In Table 1 we see that when using a perfect counter, our algorithm succeeds in recovering almost all top-k elements. [sent-257, score-0.073]
78 For example, when k = 5, reading 500 documents at a time, our rank lists are 97. [sent-258, score-0.126]
79 As there appears to be minimal impact based on buffer size, we fixed |B| = 500 documents for the remainder of our experiments. [sent-261, score-0.213]
80 11 This result supports the intuition behind our misclassification probability bound: while it is possible for an adversary to construct a stream that would mislead our online algorithm, this seems to rarely occur in practice. [sent-262, score-0.229]
81 Shown in Figure 3(b) are the accuracy results when using an approximate counter and a buffer size of 500 documents, to collect top-5 rank lists. [sent-263, score-0.437]
82 The standard result is based on comparing the rank lists to the key just as with the results when using a perfect counter. [sent-265, score-0.199]
83 A problem with this evaluation is that the hard threshold used for both generating the key, and the results for perfect counting, cannot be guaranteed to hold when using approximate counts. [sent-266, score-0.108]
84 It is possible that 11 Strictly speaking, |B| is no larger than the maximum length interval in the stream resulting from enumerating the contents of, e. [sent-267, score-0.221]
85 Left columns are based on using a perfect counter, while right columns are based on an approximate counter. [sent-271, score-0.108]
86 Numeral prefixes denote rank of element in true top-k lists. [sent-272, score-0.088]
87 All results are with respect to a buffer of 500 documents. [sent-273, score-0.191]
88 For purposes of comparison, we instrumented the approximate solution to use a perfect counter in parallel. [sent-276, score-0.309]
89 All PMI values were computed as before, using approximate counts, but the perfect counter was used just in verifying whether a given pair exceeded the threshold. [sent-277, score-0.254]
90 In this way the approximate counting solution saw just those elements of the stream as observed in the perfect counting case, allowing us to evaluate the ranking error introduced by the counter, irrespective of issues in “dipping below” the threshold. [sent-278, score-0.561]
91 As seen in the instrumented curve, top-5 rank lists generated when using the approximate counter are composed primarily of elements truly ranked 10 or below. [sent-279, score-0.362]
92 2 Examples Figure 2 contains the top-5 most associated verbs as according to our algorithm, both when using a perfect and an approximate counter. [sent-281, score-0.147]
93 As can be seen for the perfect counter, and as suggested by Table 1, in practice it is possible to track PMI scores over buffered intervals with a very high degree of accuracy. [sent-282, score-0.073]
94 For the examples shown (and more generally throughout the results), the resultant k-best lists are near perfect matches to those computed offline. [sent-283, score-0.163]
95 When using an approximate counter we continue to see reasonable results, with some error introduced due to the use of probabilistic counting. [sent-284, score-0.181]
96 The rank 1 entry reported for x = laughed exemplifies the earlier referenced issue of the approximate counter being able to incorrectly dip below the threshold for terms that the gold standard would never see. [sent-285, score-0.317]
97 We showed that while a precise solution comes at a high cost in the streaming model, there exists a simple algorithm that performs well on real data. [sent-287, score-0.062]
98 An avenue of future work is to drop the assumption that each of the top-k PMI values is maintained explicitly and see whether there is an algorithm that is feasible for the streaming version of the problem or if a similar lower bound still applies. [sent-288, score-0.127]
99 An experiment of Schooler and Anderson [1997] assumed words in NYTimes headlines operated as cues for the retrieval of memory structures associated with co-occurring terms. [sent-290, score-0.058]
100 [Gim´ nez and M` rquez, 2004] Jes´ s Gim´ nez and Llu´s M` rquez. [sent-320, score-0.082]
wordName wordTfidf (topN-words)
[('pmi', 0.738), ('lg', 0.335), ('stream', 0.197), ('buffer', 0.191), ('counter', 0.146), ('xp', 0.141), ('counting', 0.128), ('pmis', 0.096), ('church', 0.084), ('durme', 0.082), ('ct', 0.076), ('perfect', 0.073), ('talbot', 0.072), ('lall', 0.068), ('xy', 0.068), ('verb', 0.068), ('rank', 0.065), ('streaming', 0.062), ('lists', 0.061), ('chambers', 0.06), ('counts', 0.058), ('rosenfeld', 0.055), ('zip', 0.055), ('bloom', 0.055), ('counters', 0.055), ('instrumented', 0.055), ('pmik', 0.055), ('ab', 0.051), ('rochester', 0.048), ('schooler', 0.048), ('cxy', 0.048), ('hanks', 0.048), ('wy', 0.046), ('triggers', 0.044), ('bomb', 0.041), ('chklovski', 0.041), ('displace', 0.041), ('gim', 0.041), ('laughed', 0.041), ('nez', 0.041), ('priority', 0.041), ('verbs', 0.039), ('pointwise', 0.038), ('jurafsky', 0.037), ('bound', 0.036), ('insertion', 0.036), ('approximate', 0.035), ('pairs', 0.034), ('mutual', 0.032), ('storing', 0.032), ('online', 0.032), ('document', 0.031), ('trigger', 0.031), ('associativity', 0.031), ('memory', 0.031), ('gold', 0.03), ('collecting', 0.03), ('resultant', 0.029), ('pos', 0.029), ('maintained', 0.029), ('language', 0.028), ('ashwin', 0.027), ('assassinate', 0.027), ('detonate', 0.027), ('latches', 0.027), ('narrative', 0.027), ('nytimes', 0.027), ('overridden', 0.027), ('override', 0.027), ('panang', 0.027), ('pantel', 0.027), ('rquez', 0.027), ('snickered', 0.027), ('svmtool', 0.027), ('tickle', 0.027), ('tickled', 0.027), ('tickling', 0.027), ('vetoed', 0.027), ('vetoing', 0.027), ('cues', 0.027), ('item', 0.026), ('linguistics', 0.024), ('length', 0.024), ('qq', 0.024), ('graff', 0.024), ('tagged', 0.024), ('hy', 0.024), ('tagger', 0.024), ('pr', 0.024), ('items', 0.023), ('element', 0.023), ('tokens', 0.023), ('appears', 0.022), ('list', 0.022), ('word', 0.022), ('queue', 0.022), ('vocabularies', 0.022), ('midpoint', 0.022), ('increment', 0.022), ('osborne', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 233 nips-2009-Streaming Pointwise Mutual Information
Author: Benjamin V. Durme, Ashwin Lall
Abstract: Recent work has led to the ability to perform space efficient, approximate counting over large vocabularies in a streaming context. Motivated by the existence of data structures of this type, we explore the computation of associativity scores, otherwise known as pointwise mutual information (PMI), in a streaming context. We give theoretical bounds showing the impracticality of perfect online PMI computation, and detail an algorithm with high expected accuracy. Experiments on news articles show our approach gives high accuracy on real world data. 1
2 0.065004788 234 nips-2009-Streaming k-means approximation
Author: Nir Ailon, Ragesh Jaiswal, Claire Monteleoni
Abstract: We provide a clustering algorithm that approximately optimizes the k-means objective, in the one-pass streaming setting. We make no assumptions about the data, and our algorithm is very light-weight in terms of memory, and computation. This setting is applicable to unsupervised learning on massive data sets, or resource-constrained devices. The two main ingredients of our theoretical work are: a derivation of an extremely simple pseudo-approximation batch algorithm for k-means (based on the recent k-means++), in which the algorithm is allowed to output more than k centers, and a streaming clustering algorithm in which batch clustering algorithms are performed on small inputs (fitting in memory) and combined in a hierarchical manner. Empirical evaluations on real and simulated data reveal the practical utility of our method. 1
3 0.047790773 259 nips-2009-Who’s Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation
Author: Jie Luo, Barbara Caputo, Vittorio Ferrari
Abstract: Given a corpus of news items consisting of images accompanied by text captions, we want to find out “who’s doing what”, i.e. associate names and action verbs in the captions to the face and body pose of the persons in the images. We present a joint model for simultaneously solving the image-caption correspondences and learning visual appearance models for the face and pose classes occurring in the corpus. These models can then be used to recognize people and actions in novel images without captions. We demonstrate experimentally that our joint ‘face and pose’ model solves the correspondence problem better than earlier models covering only the face, and that it can perform recognition of new uncaptioned images. 1
4 0.047041953 66 nips-2009-Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning
Author: Anne Hsu, Thomas L. Griffiths
Abstract: A classic debate in cognitive science revolves around understanding how children learn complex linguistic rules, such as those governing restrictions on verb alternations, without negative evidence. Traditionally, formal learnability arguments have been used to claim that such learning is impossible without the aid of innate language-specific knowledge. However, recently, researchers have shown that statistical models are capable of learning complex rules from only positive evidence. These two kinds of learnability analyses differ in their assumptions about the distribution from which linguistic input is generated. The former analyses assume that learners seek to identify grammatical sentences in a way that is robust to the distribution from which the sentences are generated, analogous to discriminative approaches in machine learning. The latter assume that learners are trying to estimate a generative model, with sentences being sampled from that model. We show that these two learning approaches differ in their use of implicit negative evidence – the absence of a sentence – when learning verb alternations, and demonstrate that human learners can produce results consistent with the predictions of both approaches, depending on how the learning problem is presented. 1
5 0.044787101 199 nips-2009-Ranking Measures and Loss Functions in Learning to Rank
Author: Wei Chen, Tie-yan Liu, Yanyan Lan, Zhi-ming Ma, Hang Li
Abstract: Learning to rank has become an important research topic in machine learning. While most learning-to-rank methods learn the ranking functions by minimizing loss functions, it is the ranking measures (such as NDCG and MAP) that are used to evaluate the performance of the learned ranking functions. In this work, we reveal the relationship between ranking measures and loss functions in learningto-rank methods, such as Ranking SVM, RankBoost, RankNet, and ListMLE. We show that the loss functions of these methods are upper bounds of the measurebased ranking errors. As a result, the minimization of these loss functions will lead to the maximization of the ranking measures. The key to obtaining this result is to model ranking as a sequence of classification tasks, and define a so-called essential loss for ranking as the weighted sum of the classification errors of individual tasks in the sequence. We have proved that the essential loss is both an upper bound of the measure-based ranking errors, and a lower bound of the loss functions in the aforementioned methods. Our proof technique also suggests a way to modify existing loss functions to make them tighter bounds of the measure-based ranking errors. Experimental results on benchmark datasets show that the modifications can lead to better ranking performances, demonstrating the correctness of our theoretical analysis. 1
6 0.043189321 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry
7 0.040744502 190 nips-2009-Polynomial Semantic Indexing
8 0.040424109 78 nips-2009-Efficient Moments-based Permutation Tests
9 0.035035793 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall
10 0.03440946 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models
11 0.033773214 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness
12 0.031769656 208 nips-2009-Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization
13 0.031163495 87 nips-2009-Exponential Family Graph Matching and Ranking
14 0.030406464 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms
15 0.030214049 186 nips-2009-Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units
16 0.030101653 136 nips-2009-Learning to Rank by Optimizing NDCG Measure
17 0.027749918 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution
18 0.027451647 14 nips-2009-A Parameter-free Hedging Algorithm
19 0.027154742 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs
20 0.026468864 260 nips-2009-Zero-shot Learning with Semantic Output Codes
topicId topicWeight
[(0, -0.088), (1, 0.019), (2, -0.01), (3, -0.038), (4, 0.007), (5, -0.012), (6, -0.057), (7, -0.046), (8, 0.004), (9, 0.005), (10, 0.019), (11, -0.02), (12, 0.011), (13, -0.005), (14, 0.029), (15, 0.021), (16, 0.006), (17, -0.008), (18, -0.032), (19, 0.032), (20, -0.03), (21, 0.008), (22, 0.034), (23, 0.001), (24, -0.034), (25, 0.021), (26, 0.054), (27, -0.015), (28, 0.037), (29, -0.001), (30, -0.082), (31, -0.037), (32, -0.022), (33, -0.012), (34, 0.014), (35, 0.041), (36, 0.029), (37, -0.039), (38, -0.021), (39, 0.144), (40, 0.064), (41, 0.05), (42, -0.017), (43, 0.084), (44, 0.017), (45, -0.111), (46, -0.02), (47, 0.043), (48, 0.086), (49, 0.115)]
simIndex simValue paperId paperTitle
same-paper 1 0.87762737 233 nips-2009-Streaming Pointwise Mutual Information
Author: Benjamin V. Durme, Ashwin Lall
Abstract: Recent work has led to the ability to perform space efficient, approximate counting over large vocabularies in a streaming context. Motivated by the existence of data structures of this type, we explore the computation of associativity scores, otherwise known as pointwise mutual information (PMI), in a streaming context. We give theoretical bounds showing the impracticality of perfect online PMI computation, and detail an algorithm with high expected accuracy. Experiments on news articles show our approach gives high accuracy on real world data. 1
2 0.68402404 66 nips-2009-Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning
Author: Anne Hsu, Thomas L. Griffiths
Abstract: A classic debate in cognitive science revolves around understanding how children learn complex linguistic rules, such as those governing restrictions on verb alternations, without negative evidence. Traditionally, formal learnability arguments have been used to claim that such learning is impossible without the aid of innate language-specific knowledge. However, recently, researchers have shown that statistical models are capable of learning complex rules from only positive evidence. These two kinds of learnability analyses differ in their assumptions about the distribution from which linguistic input is generated. The former analyses assume that learners seek to identify grammatical sentences in a way that is robust to the distribution from which the sentences are generated, analogous to discriminative approaches in machine learning. The latter assume that learners are trying to estimate a generative model, with sentences being sampled from that model. We show that these two learning approaches differ in their use of implicit negative evidence – the absence of a sentence – when learning verb alternations, and demonstrate that human learners can produce results consistent with the predictions of both approaches, depending on how the learning problem is presented. 1
3 0.55948704 259 nips-2009-Who’s Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation
Author: Jie Luo, Barbara Caputo, Vittorio Ferrari
Abstract: Given a corpus of news items consisting of images accompanied by text captions, we want to find out “who’s doing what”, i.e. associate names and action verbs in the captions to the face and body pose of the persons in the images. We present a joint model for simultaneously solving the image-caption correspondences and learning visual appearance models for the face and pose classes occurring in the corpus. These models can then be used to recognize people and actions in novel images without captions. We demonstrate experimentally that our joint ‘face and pose’ model solves the correspondence problem better than earlier models covering only the face, and that it can perform recognition of new uncaptioned images. 1
4 0.45343366 216 nips-2009-Sequential effects reflect parallel learning of multiple environmental regularities
Author: Matthew Wilder, Matt Jones, Michael C. Mozer
Abstract: Across a wide range of cognitive tasks, recent experience influences behavior. For example, when individuals repeatedly perform a simple two-alternative forcedchoice task (2AFC), response latencies vary dramatically based on the immediately preceding trial sequence. These sequential effects have been interpreted as adaptation to the statistical structure of an uncertain, changing environment (e.g., Jones and Sieck, 2003; Mozer, Kinoshita, and Shettel, 2007; Yu and Cohen, 2008). The Dynamic Belief Model (DBM) (Yu and Cohen, 2008) explains sequential effects in 2AFC tasks as a rational consequence of a dynamic internal representation that tracks second-order statistics of the trial sequence (repetition rates) and predicts whether the upcoming trial will be a repetition or an alternation of the previous trial. Experimental results suggest that first-order statistics (base rates) also influence sequential effects. We propose a model that learns both first- and second-order sequence properties, each according to the basic principles of the DBM but under a unified inferential framework. This model, the Dynamic Belief Mixture Model (DBM2), obtains precise, parsimonious fits to data. Furthermore, the model predicts dissociations in behavioral (Maloney, Martello, Sahm, and Spillmann, 2005) and electrophysiological studies (Jentzsch and Sommer, 2002), supporting the psychological and neurobiological reality of its two components. 1
5 0.36811861 234 nips-2009-Streaming k-means approximation
Author: Nir Ailon, Ragesh Jaiswal, Claire Monteleoni
Abstract: We provide a clustering algorithm that approximately optimizes the k-means objective, in the one-pass streaming setting. We make no assumptions about the data, and our algorithm is very light-weight in terms of memory, and computation. This setting is applicable to unsupervised learning on massive data sets, or resource-constrained devices. The two main ingredients of our theoretical work are: a derivation of an extremely simple pseudo-approximation batch algorithm for k-means (based on the recent k-means++), in which the algorithm is allowed to output more than k centers, and a streaming clustering algorithm in which batch clustering algorithms are performed on small inputs (fitting in memory) and combined in a hierarchical manner. Empirical evaluations on real and simulated data reveal the practical utility of our method. 1
6 0.3466565 190 nips-2009-Polynomial Semantic Indexing
7 0.34125727 87 nips-2009-Exponential Family Graph Matching and Ranking
8 0.33515731 196 nips-2009-Quantification and the language of thought
9 0.33352888 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model
10 0.32612318 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs
11 0.32119346 194 nips-2009-Predicting the Optimal Spacing of Study: A Multiscale Context Model of Memory
12 0.30922654 182 nips-2009-Optimal Scoring for Unsupervised Learning
13 0.30613145 244 nips-2009-The Wisdom of Crowds in the Recollection of Order Information
14 0.30355653 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem
15 0.30191982 114 nips-2009-Indian Buffet Processes with Power-law Behavior
16 0.29697573 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models
17 0.28599483 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models
18 0.28080198 24 nips-2009-Adapting to the Shifting Intent of Search Queries
19 0.2775054 112 nips-2009-Human Rademacher Complexity
20 0.27693021 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry
topicId topicWeight
[(1, 0.012), (7, 0.01), (24, 0.06), (25, 0.057), (35, 0.032), (36, 0.083), (37, 0.011), (39, 0.032), (58, 0.045), (61, 0.024), (71, 0.067), (80, 0.377), (81, 0.022), (86, 0.066)]
simIndex simValue paperId paperTitle
same-paper 1 0.75748932 233 nips-2009-Streaming Pointwise Mutual Information
Author: Benjamin V. Durme, Ashwin Lall
Abstract: Recent work has led to the ability to perform space efficient, approximate counting over large vocabularies in a streaming context. Motivated by the existence of data structures of this type, we explore the computation of associativity scores, otherwise known as pointwise mutual information (PMI), in a streaming context. We give theoretical bounds showing the impracticality of perfect online PMI computation, and detail an algorithm with high expected accuracy. Experiments on news articles show our approach gives high accuracy on real world data. 1
2 0.56750941 100 nips-2009-Gaussian process regression with Student-t likelihood
Author: Jarno Vanhatalo, Pasi Jylänki, Aki Vehtari
Abstract: In the Gaussian process regression the observation model is commonly assumed to be Gaussian, which is convenient in computational perspective. However, the drawback is that the predictive accuracy of the model can be significantly compromised if the observations are contaminated by outliers. A robust observation model, such as the Student-t distribution, reduces the influence of outlying observations and improves the predictions. The problem, however, is the analytically intractable inference. In this work, we discuss the properties of a Gaussian process regression model with the Student-t likelihood and utilize the Laplace approximation for approximate inference. We compare our approach to a variational approximation and a Markov chain Monte Carlo scheme, which utilize the commonly used scale mixture representation of the Student-t distribution. 1
3 0.47459656 211 nips-2009-Segmenting Scenes by Matching Image Composites
Author: Bryan Russell, Alyosha Efros, Josef Sivic, Bill Freeman, Andrew Zisserman
Abstract: In this paper, we investigate how, given an image, similar images sharing the same global description can help with unsupervised scene segmentation. In contrast to recent work in semantic alignment of scenes, we allow an input image to be explained by partial matches of similar scenes. This allows for a better explanation of the input scenes. We perform MRF-based segmentation that optimizes over matches, while respecting boundary information. The recovered segments are then used to re-query a large database of images to retrieve better matches for the target regions. We show improved performance in detecting the principal occluding and contact boundaries for the scene over previous methods on data gathered from the LabelMe database.
4 0.39117068 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
Author: Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum, Michael J. Black
Abstract: Large, relational factor graphs with structure defined by first-order logic or other languages give rise to notoriously difficult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC. However, because of limitations in the design and parameterization of the jump function, these samplingbased methods suffer from local minima—the system must transition through lower-scoring configurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL). Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with methods of temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data. Our method allows efficient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed—we bypass the need to compute marginals entirely. Our method yields dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48% reduction in error over state-of-the-art in that domain. 1
5 0.3897 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions
Author: Ruslan Salakhutdinov
Abstract: Markov random fields (MRF’s), or undirected graphical models, provide a powerful framework for modeling complex dependencies among random variables. Maximum likelihood learning in MRF’s is hard due to the presence of the global normalizing constant. In this paper we consider a class of stochastic approximation algorithms of the Robbins-Monro type that use Markov chain Monte Carlo to do approximate maximum likelihood learning. We show that using MCMC operators based on tempered transitions enables the stochastic approximation algorithm to better explore highly multimodal distributions, which considerably improves parameter estimates in large, densely-connected MRF’s. Our results on MNIST and NORB datasets demonstrate that we can successfully learn good generative models of high-dimensional, richly structured data that perform well on digit and object recognition tasks.
6 0.38880119 260 nips-2009-Zero-shot Learning with Semantic Output Codes
7 0.38668132 71 nips-2009-Distribution-Calibrated Hierarchical Classification
8 0.38544208 97 nips-2009-Free energy score space
9 0.38499263 129 nips-2009-Learning a Small Mixture of Trees
10 0.38377553 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations
11 0.38318139 29 nips-2009-An Infinite Factor Model Hierarchy Via a Noisy-Or Mechanism
12 0.38294342 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior
13 0.38206273 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora
14 0.38201627 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models
15 0.38147762 169 nips-2009-Nonlinear Learning using Local Coordinate Coding
16 0.38146287 226 nips-2009-Spatial Normalized Gamma Processes
17 0.38122496 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs
18 0.38080478 204 nips-2009-Replicated Softmax: an Undirected Topic Model
19 0.38037565 72 nips-2009-Distribution Matching for Transduction
20 0.37933418 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction