nips nips2004 nips2004-22 knowledge-graph by maker-knowledge-mining

22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

Source: pdf

Author: Ting Liu, Andrew W. Moore, Ke Yang, Alexander G. Gray

Abstract: This paper concerns approximate nearest neighbor searching algorithms, which have become increasingly important, especially in high dimensional perception areas such as computer vision, with dozens of publications in recent years. Much of this enthusiasm is due to a successful new approximate nearest neighbor approach called Locality Sensitive Hashing (LSH). In this paper we ask the question: can earlier spatial data structure approaches to exact nearest neighbor, such as metric trees, be altered to provide approximate answers to proximity queries and if so, how? We introduce a new kind of metric tree that allows overlap: certain datapoints may appear in both the children of a parent. We also introduce new approximate k-NN search algorithms on this structure. We show why these structures should be able to exploit the same randomprojection-based approximations that LSH enjoys, but with a simpler algorithm and perhaps with greater efﬁciency. We then provide a detailed empirical evaluation on ﬁve large, high dimensional datasets which show up to 31-fold accelerations over LSH. This result holds true throughout the spectrum of approximation levels. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract This paper concerns approximate nearest neighbor searching algorithms, which have become increasingly important, especially in high dimensional perception areas such as computer vision, with dozens of publications in recent years. [sent-4, score-0.512]

2 Much of this enthusiasm is due to a successful new approximate nearest neighbor approach called Locality Sensitive Hashing (LSH). [sent-5, score-0.413]

3 In this paper we ask the question: can earlier spatial data structure approaches to exact nearest neighbor, such as metric trees, be altered to provide approximate answers to proximity queries and if so, how? [sent-6, score-0.467]

4 We introduce a new kind of metric tree that allows overlap: certain datapoints may appear in both the children of a parent. [sent-7, score-0.34]

5 We also introduce new approximate k-NN search algorithms on this structure. [sent-8, score-0.343]

6 1 Introduction The k-nearest-neighbor searching problem is to ﬁnd the k nearest points in a dataset X ⊂ RD containing n points to a query point q ∈ RD , usually under the Euclidean distance. [sent-12, score-0.572]

7 up to the 10’s), such as kd-trees [8] and metric trees. [sent-18, score-0.149]

8 However, many realworld problems are posed with very large dimensionalities which are beyond the capability of such search structures to achieve sub-linear efﬁciency, for example in computer vision, in which each pixel of an image represents a dimension. [sent-20, score-0.339]

9 One approach to dealing with this apparent intractability has been to deﬁne a different problem, the (1 + ε) approximate k-nearest-neighbor searching problem, which returns points whose distance from the query is no more than (1 + ε) times the distance of the true kth nearest-neighbor. [sent-23, score-0.433]

10 Further, the problem is often relaxed to only do this with high probability, and without a certiﬁcate property telling the user when it has failed to do so, nor any guarantee on the actual rank of the distance of the points returned, which may be arbitrarily far from k [4]. [sent-24, score-0.147]

11 Another commonly used modiﬁcation to the problem is to perform the search under the L1 norm rather than L2 . [sent-25, score-0.263]

12 Roughly speaking, a locality sensitive hashing function has the property that if two points are “close,” then they hash to same bucket with “high” probability; if they are “far apart,” then they hash to same bucket with “low” probability. [sent-28, score-0.496]

13 By deﬁning a LSH scheme, namely a (r, r(1 + ε), p1 , p2 )-sensitive hash family, the (1 + ε)-NN problem can be solved by performing a series of hashing and searching within the buckets. [sent-32, score-0.229]

14 [23, 28] have found (1 + ε) approximation to be useful, for example when the k-nearest-neighbor search is just one component in a large system with many parts, each of which can be highly inaccurate. [sent-36, score-0.263]

15 In this paper we explore the extent to which the most successful exact search structures can be adapted to perform (1 + ε) approximate high-dimensional searches. [sent-37, score-0.424]

16 A notable previous approach along this line is a simple modiﬁcation of kd-trees [3] – ours takes the more powerful metric trees as a starting point. [sent-38, score-0.223]

17 We next review metric trees, then introduce a variant, known as spill trees. [sent-39, score-0.24]

18 1 Metric Trees The metric tree [29, 25, 5] is a data structure that supports efﬁcient nearest neighbor search. [sent-41, score-0.538]

19 We brieﬂy A metric tree organizes a set of points in a spatial hierarchical manner. [sent-42, score-0.302]

20 The root node represents all points, and the points represented by an internal node v is partitioned into two subsets, represented by its two children. [sent-44, score-0.379]

21 Formally, if we use N(v) to denote the set of points represented by node v, and use v. [sent-45, score-0.238]

22 rc to denote the left child and the right child of node v, then we have N(v) = N(v. [sent-47, score-0.265]

23 At the lowest level, each leaf node contains very few points. [sent-52, score-0.169]

24 The key to building a metric-tree is how to partition a node v. [sent-54, score-0.177]

25 rpv are found, we can go ahead to partition node v. [sent-71, score-0.177]

26 It is known as the decision boundary since all points to the left of L belong to v. [sent-80, score-0.238]

27 Then we search for the point that is the farthest to p and set it to be v. [sent-82, score-0.302]

28 rpv)) instead, since it is 2 more efﬁcient to compute, and in practice, we can still have a metric tree of depth O(log n). [sent-96, score-0.287]

29 Each node v also has a hypersphere B, such that all points represented by v fall in the ball centered at v. [sent-97, score-0.272]

30 A search on a metric-tree is simply a guided DFS (for simplicity, we assume that k = 1). [sent-106, score-0.263]

31 The decision boundary L is used to decide which child node to search ﬁrst. [sent-107, score-0.562]

32 At all times, the algorithm maintains a “candidate NN”, which is the nearest neighbor it ﬁnds so far while traversing the tree. [sent-111, score-0.333]

33 If DFS is about to exploit a node v, but discovers that no member of v can be within distance r of q, then it prunes this node (i. [sent-113, score-0.332]

34 We have found that in practice, metric tree search typically ﬁnds a very good NN candidate quickly, and then spends up to 95% of the time verifying that it is in fact the true NN. [sent-123, score-0.507]

35 2 Spill-Trees A spill-tree (sp-tree) is a variant of metric-trees in which the children of a node can “spill over” onto each other, and contain shared datapoints. [sent-126, score-0.197]

36 rpv, and ﬁnd the decision boundary L that goes through the mid point A, Next, we deﬁne two new separating planes, LL and LR, both of which are parallel to L and at distance τ from L. [sent-138, score-0.191]

37 Then, all the points to the right of plane LL belong to the child v. [sent-139, score-0.258]

38 rc, and all the points to the left of plane LR belong to the child v. [sent-140, score-0.258]

39 We call this region the overlapping buffer, and we call τ the overlapping size. [sent-147, score-0.292]

40 rc, we can repeat the splitting procedure, until the number of points within a node is less than a speciﬁc threshold, at which point we stop. [sent-150, score-0.238]

41 The overlapping obviously makes both the construction and the MT-DFS less efﬁcient than regular metric-trees, since the points in the overlapping buffer may be searched twice. [sent-152, score-0.469]

42 Nonetheless, the advantage of sp-trees over metric-trees becomes clear when we perform the defeatist search, an (1 + ε)NN search algorithm based on sp-trees. [sent-153, score-0.489]

43 Based on this observation, a quick revision would be to descends the metric tree using the decision boundaries at each level without backtracking, and then output the point x in the ﬁrst leaf node it visits as the NN of query q. [sent-156, score-0.5]

44 We call this the defeatist search on a metric-tree. [sent-157, score-0.518]

45 Since the depth of a metric-tree is O(log n), the complexity of defeatist search is O(log n) per query. [sent-158, score-0.571]

46 Consider the case where q is very close to a decision boundary L, then it is almost equally likely that the NN of q is on the same side of L as on the opposite side of L, and the defeatist search can make a mistake with probability close to 1/2. [sent-160, score-0.585]

47 In practice, we observe that there exists a non-negligible fraction of the query points that are close to one of the decision boundaries. [sent-161, score-0.223]

48 Thus the average accuracy of the defeatist search algorithm is typically unacceptably low, even for approximate NN search. [sent-162, score-0.569]

49 This is precisely the place where sp-trees can help: the defeatist search on sp-trees has much higher accuracy and remains very fast. [sent-163, score-0.489]

50 As before, the decision boundary at node v is plane L. [sent-166, score-0.291]

51 If a query q is to the left of L, we decide that its nearest neighbor is in v. [sent-167, score-0.418]

52 Conversely, if q is to the right of L, we only search node v. [sent-173, score-0.404]

53 Notice that in either case, points in the overlapping buffer are always searched. [sent-177, score-0.305]

54 To see this, suppose that q is to the left of L, then the only points eliminated are the one to the right of plane LR, all of which are at least distance τ away from q. [sent-179, score-0.201]

55 2 Hybrid Sp-Tree Search One problem with spill-trees is that their depth varies considerably depending on the overlapping size τ. [sent-181, score-0.24]

56 If τ = 0, a sp-tree turns back to a metric tree with depth O(log n). [sent-182, score-0.287]

57 In other words, both children of node v contain all points of v. [sent-188, score-0.294]

58 For each node v, we ﬁrst split the points using the overlapping buffer. [sent-193, score-0.383]

59 However, if either of its children contains more than ρ fraction of the total points in v, we undo the overlapping splitting. [sent-194, score-0.27]

60 In this way, we can ensure that each split reduces the number of points of a node by at least a constant factor and thus we can maintain the logarithmic depth of the tree. [sent-197, score-0.348]

61 The NN search on a hybrid sp-tree also becomes a hybrid of the MT-DFS search and the defeatist search. [sent-198, score-1.006]

62 We only do defeatist search on overlapping nodes, for non-overlapping nodes, we still do backtracking as MT-DFS search. [sent-199, score-0.674]

63 If τ = 0, we have a pure sp-tree with defeatist search — very efﬁcient but not accurate enough; if τ ≥ ||v. [sent-201, score-0.489]

64 lpv||/2, then every node is a non-overlapping node (due to the balance threshold mechanism) — in this way we get back to the traditional metric-tree with MT-DFS, which is perfectly accurate but inefﬁcient. [sent-203, score-0.312]

65 As a general rule, the greater τ is, the more accurate and the slower the search algorithm becomes. [sent-205, score-0.263]

66 3 Further Efﬁciency Improvement Using Random Projection The hybrid sp-tree search algorithm is much more efﬁcient than the traditional MT-DFS algorithm. [sent-207, score-0.39]

67 In some sense, the hybrid sp-tree search algorithm also suffer from the curse of dimensionality, only much less severely than MT-DFS. [sent-209, score-0.441]

68 In particular, the Johnson-Lindenstrauss Lemma [15] states that one can embed a dataset of n points in a subspace of dimension O(log n) with little distortion on the pair-wise distances. [sent-211, score-0.219]

69 In our (1 + ε)-NN search algorithm, we use random projection as a pre-processing step: project the datapoints to a subspace of lower dimension, and then do the hybrid sptree search. [sent-213, score-0.54]

70 Both the construction of sp-tree and the search are conducted in the lowdimensional subspace. [sent-214, score-0.263]

71 But we can easily ﬁx this problem by doing multiple rounds of random projections and doing one hybrid sp-tree search for each round. [sent-216, score-0.417]

72 4 Experimental Results We report our experimental results based on hybrid sp-tree search on a variety of real-world datasets, with the number of datapoints ranging from 20,000 to 275,465, and dimensions from 60 to 3,838. [sent-221, score-0.509]

73 The ﬁrst two datasets are same as the ones used in [9], where it is demonstrated that LSH can have a signiﬁcant speedup over SR-trees. [sent-222, score-0.178]

74 Aerial Texture feature data contain 275,465 feature vectors of 60 dimensions representing texture information of large aerial photographs [21, 20]. [sent-223, score-0.279]

75 Corel hist 20,000 histograms (64-dimensional) of color thumbnail-sized images taken from the COREL STOCK PHOTO library. [sent-224, score-0.181]

76 This dataset differs signiﬁcantly from Corel hist and is available from the UCI repository [1]. [sent-230, score-0.229]

77 Besides the sp-tree search algorithm, we also run a number of other algorithms: LSH The original LSH implementation used in [9] is not public and we were unable to obtain it from the authors. [sent-234, score-0.326]

78 Metric-Tree This is highly optimized k-NN search based on metric trees [29, 22], and code is publicly available [2]. [sent-241, score-0.486]

79 To measure accuracy, we use the effective distance error [3, 9], which is deﬁned as dalg 1 E = Q ∑q∈Q d ∗ − 1 , where dalg is the distance from a query q to the NN found by the algorithm, and d ∗ is the distance from q to the true NN. [sent-247, score-0.325]

80 For the k-NN case where (k > 1), we measure separately the distance ratios between the closest points found to the nearest neighbor, the 2nd closest one to the 2nd nearest neighbor and so on, and then take the average. [sent-249, score-0.654]

81 Table 1: the CPU time of exact SR-tree, Metric-tree, and Na¨ve search ı Algorithm (%) Naive SR-tree Metric-tree Aerial 43620 23450 3650 Corel hist (k = 1) (k = 10) 462 465 184 330 58. [sent-255, score-0.481]

82 2 Corel uci Disk trace Galaxy 5460 3230 791 27050 n/a 19860 46760 n/a 6600 All the datasets are rather large, and the metric-tree is consistently the fastest. [sent-257, score-0.177]

83 On the other hand, the SR-tree implement only has limited speedup over the Na¨ve algorithm, and it fails ı to run on Disk trace and Galaxy, both of which have very high dimensions. [sent-258, score-0.187]

84 Since Metric-tree and SRtree are both designed for exact NN search, we also run them on randomly chosen subsets of the whole dataset to produce approximate answers. [sent-261, score-0.193]

85 We show the comparison results of all algorithms for the Aerial and the Corel hist datasets, both for k = 1, in Figure 3. [sent-262, score-0.181]

86 In particular, the CPU time and the speedup of sp-tree search over LSH is summarized in Table 2. [sent-264, score-0.369]

87 We do so by examining the speedup of both implementations over SR-tree on the Aerial and Corel hist datasets, with both k = 1 and Table 2: the CPU time (s) of Sp-tree and its speedup (in parentheses) over LSH Error (%) 20 10 5 2 1 Aerial 33. [sent-268, score-0.393]

88 3 For the Aerial dataset, in the case where E varies from 10% to 20%, the speedup of LSH in [9] over SR-tree varies from 4 to 6, and as for our implementation, the speedup varies from 4. [sent-310, score-0.335]

89 Perhaps a little surprisingly, the Metric-tree search algorithm (MT-DFS) performs very well on Aerial and Corel hist datasets. [sent-316, score-0.444]

90 Furthermore, the approximate MT-DFS algorithm (conventional metric-tree based search using a random subset of the training data) consistently outperforms LSH across the entire error spectrum on Aerial. [sent-319, score-0.343]

91 But in all cases, sp-tree search remains the fastest among all algorithms, frequently achieving 2 or 3 orders of magnitude in speed-up. [sent-322, score-0.263]

92 Space does not permit a lengthy conclusion, but the summary of this paper is that there is empirical evidence that with appropriate redesign of the data structures and search algorithms, spatial data structures remain a useful tool in the realm of approximate k-NN search. [sent-323, score-0.431]

93 The latter also proposed a data structure similar to the spilltree, where the decision boundary needs to be aligned with a coordinate and there is no hybrid version. [sent-328, score-0.223]

94 An optimal algorithm for approximate nearest neighbor searching ﬁxed dimensions. [sent-345, score-0.484]

95 M-tree: An efﬁcient access method for similarity search in metric spaces. [sent-354, score-0.439]

96 Approximate nearest neighbors: towards removing the curse of dimensionality. [sent-400, score-0.225]

97 The SR-tree: an index structure for high-dimensional nearest neighbor queries. [sent-421, score-0.333]

98 An investigation of practical approximate nearest neighbor algorithms (full version). [sent-439, score-0.413]

99 The earth mover’s distance as a metric for image retrieval. [sent-496, score-0.199]

100 Excluded middle vantage point forests for nearest neighbor search. [sent-508, score-0.333]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lsh', 0.453), ('search', 0.263), ('corel', 0.235), ('defeatist', 0.226), ('nn', 0.21), ('aerial', 0.204), ('hist', 0.181), ('nearest', 0.174), ('cpu', 0.162), ('neighbor', 0.159), ('metric', 0.149), ('node', 0.141), ('hybrid', 0.127), ('disk', 0.117), ('overlapping', 0.117), ('lr', 0.108), ('speedup', 0.106), ('points', 0.097), ('buffer', 0.091), ('indyk', 0.091), ('spill', 0.091), ('galaxy', 0.09), ('query', 0.085), ('na', 0.083), ('depth', 0.082), ('approximate', 0.08), ('datapoints', 0.079), ('hash', 0.079), ('hashing', 0.079), ('trees', 0.074), ('datasets', 0.072), ('searching', 0.071), ('backtracking', 0.068), ('child', 0.062), ('dfs', 0.059), ('vldb', 0.059), ('tree', 0.056), ('children', 0.056), ('boundary', 0.055), ('plane', 0.054), ('goldstein', 0.054), ('trace', 0.053), ('uci', 0.052), ('curse', 0.051), ('ll', 0.051), ('locality', 0.05), ('distance', 0.05), ('dataset', 0.048), ('searched', 0.047), ('ve', 0.047), ('dimension', 0.047), ('dalg', 0.045), ('katayama', 0.045), ('metrictree', 0.045), ('mid', 0.045), ('piotr', 0.045), ('prh', 0.045), ('ramakrishnan', 0.045), ('belong', 0.045), ('projection', 0.044), ('structures', 0.044), ('decision', 0.041), ('varies', 0.041), ('dimensions', 0.04), ('bucket', 0.039), ('farthest', 0.039), ('pivot', 0.039), ('spends', 0.039), ('acm', 0.038), ('exact', 0.037), ('partition', 0.036), ('implementation', 0.035), ('texture', 0.035), ('sensitive', 0.034), ('hypersphere', 0.034), ('vision', 0.032), ('ciency', 0.032), ('moore', 0.032), ('dimensionalities', 0.032), ('september', 0.032), ('ef', 0.031), ('balance', 0.03), ('nodes', 0.03), ('symposium', 0.03), ('call', 0.029), ('notice', 0.029), ('split', 0.028), ('dimensional', 0.028), ('run', 0.028), ('conventional', 0.028), ('leaf', 0.028), ('traces', 0.028), ('nds', 0.027), ('subspace', 0.027), ('access', 0.027), ('neighbors', 0.027), ('liu', 0.027), ('queries', 0.027), ('rounds', 0.027), ('comparable', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

Author: Ting Liu, Andrew W. Moore, Ke Yang, Alexander G. Gray

2 0.12619428 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval

Author: Giorgio Gia\-cin\-to, Fabio Roli

Abstract: High retrieval precision in content-based image retrieval can be attained by adopting relevance feedback mechanisms. These mechanisms require that the user judges the quality of the results of the query by marking all the retrieved images as being either relevant or not. Then, the search engine exploits this information to adapt the search to better meet user’s needs. At present, the vast majority of proposed relevance feedback mechanisms are formulated in terms of search model that has to be optimized. Such an optimization involves the modification of some search parameters so that the nearest neighbor of the query vector contains the largest number of relevant images. In this paper, a different approach to relevance feedback is proposed. After the user provides the first feedback, following retrievals are not based on knn search, but on the computation of a relevance score for each image of the database. This score is computed as a function of two distances, namely the distance from the nearest non-relevant image and the distance from the nearest relevant one. Images are then ranked according to this score and the top k images are displayed. Reported results on three image data sets show that the proposed mechanism outperforms other state-of-the-art relevance feedback mechanisms. 1 In t rod u ct i on A large number of content-based image retrieval (CBIR) systems rely on the vector representation of images in a multidimensional feature space representing low-level image characteristics, e.g., color, texture, shape, etc. [1]. Content-based queries are often expressed by visual examples in order to retrieve from the database the images that are “similar” to the examples. This kind of retrieval is often referred to as K nearest-neighbor retrieval. It is easy to see that the effectiveness of content-based image retrieval systems (CBIR) strongly depends on the choice of the set of visual features, on the choice of the “metric” used to model the user’s perception of image similarity, and on the choice of the image used to query the database [1]. Typically, if we allow different users to mark the images retrieved with a given query as relevant or non-relevant, different subsets of images will be marked as relevant. Accordingly, the need for mechanisms to adapt the CBIR system response based on some feedback from the user is widely recognized. It is interesting to note that while relevance feedback mechanisms have been first introduced in the information retrieval field [2], they are receiving more attention in the CBIR field (Huang). The vast majority of relevance feedback techniques proposed in the literature is based on modifying the values of the search parameters as to better represent the concept the user bears in mind. To this end, search parameters are computed as a function of the relevance values assigned by the user to all the images retrieved so far. As an example, relevance feedback is often formulated in terms of the modification of the query vector, and/or in terms of adaptive similarity metrics. [3]-[7]. Recently, pattern classification paradigms such as SVMs have been proposed [8]. Feedback is thus used to model the concept of relevant images and adjust the search consequently. Concept modeling may be difficult on account of the distribution of relevant images in the selected feature space. “Narrow domain” image databases allows extracting good features, so that images bearing similar concepts belong to compact clusters. On the other hand, “broad domain” databases, such as image collection used by graphic professionals, or those made up of images from the Internet, are more difficult to subdivide in cluster because of the high variability of concepts [1]. In these cases, it is worth extracting only low level, non-specialized features, and image retrieval is better formulated in terms of a search problem rather then concept modeling. The present paper aims at offering an original contribution in this direction. Rather then modeling the concept of “relevance” the user bears in mind, feedback is used to assign each image of the database a relevance score. Such a score depends only from two dissimilarities (distances) computed against the images already marked by the user: the dissimilarity from the set of relevant images, and the dissimilarity from the set of non-relevant images. Despite its computational simplicity, this mechanism allows outperforming state-of-the-art relevance feedback mechanisms both on “narrow domain” databases, and on “broad domain” databases. This paper is organized as follows. Section 2 illustrates the idea behind the proposed mechanism and provides the basic assumptions. Section 3 details the proposed relevance feedback mechanism. Results on three image data sets are presented in Section 4, where performances of other relevance feedback mechanisms are compared. Conclusions are drawn in Section 5. 2 In st an ce- b ased rel evan ce est i m at i on The proposed mechanism has been inspired by classification techniques based on the “nearest case” [9]-[10]. Nearest-case theory provided the mechanism to compute the dissimilarity of each image from the sets of relevant and non–relevant images. The ratio between the nearest relevant image and the nearest non-relevant image has been used to compute the degree of relevance of each image of the database [11]. The present section illustrates the rationale behind the use of the nearest-case paradigm. Let us assume that each image of the database has been represented by a number of low-level features, and that a (dis)similarity measure has been defined so that the proximity between pairs of images represents some kind of “conceptual” similarity. In other words, the chosen feature space and similarity metric is meaningful at least for a restricted number of users. A search in image databases is usually performed by retrieving the k most similar images with respect to a given query. The dimension of k is usually small, to avoid displaying a large number of images at a time. Typical values for k are between 10 and 20. However, as the “relevant” images that the user wishes to retrieve may not fit perfectly with the similarity metric designed for the search engine, the user may be interested in exploring other regions of the feature space. To this end, the user marks the subset of “relevant” images out of the k retrieved. Usually, such relevance feedback is used to perform a new k-nn search by modifying some search parameters, i.e., the position of the query point, the similarity metric, and other tuning parameters [1]-[7]. Recent works proposed the use of support vector machine to learn the distribution of relevant images [8]. These techniques require some assumption about the general form of the distribution of relevant images in the feature space. As it is difficult to make any assumption about such a distribution for broad domain databases, we propose to exploit the information about the relevance of the images retrieved so far in a nearest-neighbor fashion. Nearest-neighbor techniques, as used in statistical pattern recognition, case-based reasoning, or instance-based learning, are effective in all applications where it is difficult to produce a high-level generalization of a “class” of objects [9]-[10],[12][13]. Relevance learning in content base image retrieval may well fit into this definition, as it is difficult to provide a general model that can be adapted to represent different concepts of similarity. In addition, the number of available cases may be too small to estimate the optimal set of parameters for such a general model. On the other hand, it can be more effective to use each “relevant” image as well as each “non-relevant” image, as “cases” or “instances” against which the images of the database should be compared. Consequently, we assume that an image is as much as relevant as much as its dissimilarity from the nearest relevant image is small. Analogously, an image is as much as non-relevant as much as its dissimilarity from the nearest non-relevant image is small. 3 Rel evan ce S core Com p u t ati on According to previous section, each image of the database can be thus characterized by a “degree of relevance” and a “degree of non-relevance” according to the dissimilarities from the nearest relevant image, and from the nearest non-relevant image, respectively. However, it should be noted that these degrees should be treated differently because only “relevant” images represent a “concept” in the user’s mind, while “non-relevant” images may represent a number of other concepts different from user’s interest. In other words, while it is meaningful to treat the degree of relevance as a degree of membership to the class of relevant images, the same does not apply to the degree of non-relevance. For this reason, we propose to use the “degree of non-relevance” to weight the “degree of relevance”. Let us denote with R the subset of indexes j ∈ {1,...,k} related to the set of relevant images retrieved so far and the original query (that is relevant by default), and with NR the subset of indexes j ∈ (1,...,k} related to the set of non-relevant images retrieved so far. For each image I of the database, according to the nearest neighbor rule, let us compute the dissimilarity from the nearest image in R and the dissimilarity from the nearest image in NR. Let us denote these dissimilarities as dR(I) and dNR(I), respectively. The value of dR(I) can be clearly used to measure the degree of relevance of image I, assuming that small values of dR(I) are related to very relevant images. On the other hand, the hypothesis that image I is relevant to the user’s query can be supported by a high value of dNR(I). Accordingly, we defined the relevance score ! dR ( I ) $ relevance ( I ) = # 1 + dN ( I ) &

3 0.11841071 23 nips-2004-Analysis of a greedy active learning strategy

Author: Sanjoy Dasgupta

Abstract: We abstract out the core search problem of active learning schemes, to better understand the extent to which adaptive labeling can improve sample complexity. We give various upper and lower bounds on the number of labels which need to be queried, and we prove that a popular greedy active learning rule is approximately as good as any other strategy for minimizing this number of labels. 1

4 0.11081397 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

Author: Michael Fink

Abstract: We describe a framework for learning an object classiﬁer from a single example. This goal is achieved by emphasizing the relevant dimensions for classiﬁcation using available examples of related classes. Learning to accurately classify objects from a single training example is often unfeasible due to overﬁtting effects. However, if the instance representation provides that the distance between each two instances of the same class is smaller than the distance between any two instances from different classes, then a nearest neighbor classiﬁer could achieve perfect performance with a single training example. We therefore suggest a two stage strategy. First, learn a metric over the instances that achieves the distance criterion mentioned above, from available examples of other related classes. Then, using the single examples, deﬁne a nearest neighbor classiﬁer where distance is evaluated by the learned class relevance metric. Finding a metric that emphasizes the relevant dimensions for classiﬁcation might not be possible when restricted to linear projections. We therefore make use of a kernel based metric learning algorithm. Our setting encodes object instances as sets of locality based descriptors and adopts an appropriate image kernel for the class relevance metric learning. The proposed framework for learning from a single example is demonstrated in a synthetic setting and on a character classiﬁcation task. 1

5 0.099320926 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

Author: Elizaveta Levina, Peter J. Bickel

Abstract: We propose a new method for estimating intrinsic dimension of a dataset derived by applying the principle of maximum likelihood to the distances between close neighbors. We derive the estimator by a Poisson process approximation, assess its bias and variance theoretically and by simulations, and apply it to a number of simulated and real datasets. We also show it has the best overall performance compared with two other intrinsic dimension estimators. 1

6 0.087558657 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's

7 0.082057089 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem

8 0.077544957 19 nips-2004-An Application of Boosting to Graph Classification

9 0.076074168 51 nips-2004-Detecting Significant Multidimensional Spatial Clusters

10 0.075253263 127 nips-2004-Neighbourhood Components Analysis

11 0.072383985 137 nips-2004-On the Adaptive Properties of Decision Trees

12 0.064887784 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

13 0.061035402 82 nips-2004-Incremental Algorithms for Hierarchical Classification

14 0.060829572 125 nips-2004-Multiple Relational Embedding

15 0.057887621 131 nips-2004-Non-Local Manifold Tangent Learning

16 0.056868769 40 nips-2004-Common-Frame Model for Object Recognition

17 0.056228999 34 nips-2004-Breaking SVM Complexity with Cross-Training

18 0.056010555 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

19 0.05541344 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

20 0.055185001 54 nips-2004-Distributed Information Regularization on Graphs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.17), (1, 0.054), (2, 0.003), (3, -0.055), (4, 0.06), (5, 0.109), (6, 0.013), (7, 0.001), (8, 0.065), (9, -0.044), (10, -0.029), (11, -0.114), (12, -0.111), (13, -0.06), (14, -0.102), (15, -0.014), (16, 0.086), (17, -0.048), (18, 0.011), (19, 0.056), (20, 0.074), (21, 0.178), (22, 0.018), (23, 0.132), (24, 0.057), (25, 0.01), (26, -0.026), (27, 0.034), (28, 0.237), (29, -0.01), (30, -0.178), (31, -0.053), (32, -0.236), (33, -0.005), (34, -0.011), (35, -0.089), (36, -0.067), (37, 0.012), (38, 0.0), (39, -0.127), (40, -0.116), (41, -0.015), (42, -0.076), (43, -0.064), (44, 0.016), (45, 0.046), (46, 0.039), (47, 0.021), (48, -0.032), (49, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96801174 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

Author: Ting Liu, Andrew W. Moore, Ke Yang, Alexander G. Gray

2 0.65267038 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval

Author: Giorgio Gia\-cin\-to, Fabio Roli

3 0.53113097 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

Author: Michael Fink

4 0.51576883 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's

Author: Jaety Edwards, Yee W. Teh, Roger Bock, Michael Maire, Grace Vesom, David A. Forsyth

Abstract: We describe a method that can make a scanned, handwritten mediaeval latin manuscript accessible to full text search. A generalized HMM is ﬁtted, using transcribed latin to obtain a transition model and one example each of 22 letters to obtain an emission model. We show results for unigram, bigram and trigram models. Our method transcribes 25 pages of a manuscript of Terence with fair accuracy (75% of letters correctly transcribed). Search results are very strong; we use examples of variant spellings to demonstrate that the search respects the ink of the document. Furthermore, our model produces fair searches on a document from which we obtained no training data. 1. Intoduction There are many large corpora of handwritten scanned documents, and their number is growing rapidly. Collections range from the complete works of Mark Twain to thousands of pages of zoological notes spanning two centuries. Large scale analyses of such corpora is currently very difﬁcult, because handwriting recognition works poorly. Recently, Rath and Manmatha have demonstrated that one can use small bodies of aligned material as supervised data to train a word spotting mechanism [7]. The result can make scanned handwritten documents searchable. Current techniques assume a closed vocabulary — one can search only for words in the training set — and search for instances of whole words. This approach is particularly unattractive for an inﬂected language, because individual words can take so many forms that one is unlikely to see all in the training set. Furthermore, one would like the method used to require very little aligned training data, so that it is possible to process documents written by different scribes with little overhead. Mediaeval Latin manuscripts are a natural ﬁrst corpus for studying this problem, because there are many scanned manuscripts and because the handwriting is relatively regular. We expect the primary user need to be search over a large body of documents — to allow comparisons between documents — rather than transcription of a particular document (which is usually relatively easy to do by hand). Desirable features for a system are: First, that it use little or no aligned training data (an ideal, which we believe may be attainable, is an unsupervised learning system). Second, that one can search the document for an arbitrary string (rather than, say, only complete words that appear in the training data). This would allow a user to determine whether a document contains curious or distinctive spellings, for example (ﬁgure 7). We show that, using a statistical model based on a generalized HMM, we can search a medieval manuscript with considerable accuracy, using only one instance each of each letter in the manuscript to train the method (22 instances in total; Latin has no j, k, w, or z). Furthermore, our method allows fairly accurate transcription of the manuscript. We train our system on 22 glyphs taken from a a 12th century latin manuscript of Terence’s Comedies (obtained from a repository of over 80 scanned medieval works maintained by Oxford University [1]). We evaluate searches using a considerable portion of this manuscript aligned by hand; we then show that fair search results are available on a different manuscript (MS. Auct. D. 2. 16, Latin Gospels with beast-headed evangelist portraits made at Landvennec, Brittany, late 9th or early 10th century, from [1]) without change of letter templates. 1.1. Previous Work Handwriting recognition is a traditional problem, too well studied to review in detail here (see [6]). Typically, online handwriting recognition (where strokes can be recorded) works better than ofﬂine handwriting recognition. Handwritten digits can now be recognized with high accuracy [2, 5]. Handwritten amounts can be read with fair accuracy, which is signiﬁcantly improved if one segments the amount into digits at the same time as one recognizes it [4, 5]. Recently several authors have proposed new techniques for search and translation in this unrestricted setting. Manmatha et al [7] introduce the technique of “word spotting,” which segments text into word images, rectiﬁes the word images, and then uses an aligned training set to learn correspondences between rectiﬁed word images and strings. The method is not suitable for a heavily inﬂected language, because words take so many forms. In an inﬂected language, the natural unit to match to is a subset of a word, rather than a whole word, implying that one should segment the text into blocks — which may be smaller than words — while recognizing. Vinciarelli et al [8] introduce a method for line by line recognition based around an HMM and quite similar to techniques used in the speech recognition community. Their method uses a window that slides along the text to obtain features; this has the difﬁculty that the same window is in some places too small (and so uninformative) and in others too big (and so spans more than one letter, and is confusing). Their method requires a substantial body of aligned training data, which makes it impractical for our applications. Close in spirit to our work is the approach to machine translation of Koehn and Knight [3]. They demonstrate that the statistics of unaligned corpora may provide as powerful constraints for training models as aligned bitexts. 2. The Model Our models for both search and transcription are based on the generalized HMM and differ only in their choice of transition model. In an HMM, each hidden node ct emits a single evidence node xt . In a generalized HMM, we allow each ct to emit a series of x’s whose length is itself a random variable. In our model, the hidden nodes correspond to letters and each xt is a single column of pixels. Allowing letters to emit sets of columns lets us accomodate letter templates of variable width. In particular, this means that we can unify segmenting ink into letters and recognizing blocks of ink; ﬁgure 3 shows an example of how useful this is. 2.1. Generating a line of text Our hidden state consists of a character label c, width w and vertical position y. The statespace of c contains the characters ‘a’-‘z’, a space ‘ ’, and a special end state Ω. Let T c be the template associated with character c, Tch , Tcw be respectively the height and width of that template, and m be the height of the image. Figure 1: Left, a full page of our manuscript, a 12’th century manuscript of Terence’s Comedies obtained from [1]. Top right, a set of lines from a page from that document and bottom right, some words in higher resolution. Note: (a) the richness of page layout; (b) the clear spacing of the lines; (c) the relatively regular handwriting. Figure 2: Left, the 22 instances, one per letter, used to train our emission model. These templates are extracted by hand from the Terence document. Right, the ﬁve image channels for a single letter. Beginning at image column 1 (and assuming a dummy space before the ﬁrst character), • • • • choose character c ∼ p(c|c−1...−n ) (an n-gram letter model) choose length w ∼ Uniform(Tcw − k, Tcw + k) (for some small k) choose vertical position y ∼ Uniform(1, m − Tch ) z,y and Tch now deﬁne a bounding box b of pixels. Let i and j be indexed from the top left of that bounding box. – draw pixel (i, j) ∼ N (Tcij , σcij ) for each pixel in b – draw all pixels above and below b from background gaussian N (µ0 , σ0 ) (See 2.2 for greater detail on pixel emission model) • move to column w + 1 and repeat until we enter the end state Ω. Inference on a gHMM is a relatively straighforward business of dynamic programming. We have used unigram, bigram and trigram models, with each transition model ﬁtted using an electronic version of Caesar’s Gallic Wars, obtained from http://www.thelatinlibrary.com. We do not believe that the choice of author should signiﬁcantly affect the ﬁtted transition model — which is at the level of characters — but have not experimented with this point. The important matter is the emission model. 2.2. The Emission Model Our emission model is as follows: Given the character c and width w, we generate a template of the required length. Each pixel in this template becomes the mean of a gaussian which generates the corresponding pixel in the image. This template has a separate mean image for each pixel channel. The channels are assumed independent given the means. We train the model by cutting out by hand a single instance of each letter from our corpus (ﬁgure 2). This forms the central portion of the template. Pixels above and below this Model Perfect transcription unigram bigram trigram matching chars 21019 14603 15572 15788 substitutions 0 5487 4597 4410 insertions 0 534 541 507 deletions 0 773 718 695 Table 1: Edit distance between our transcribed Terence and the editor’s version. Note the trigram model produces signiﬁcantly fewer letter errors than the unigram model, but that the error rate is still a substantial 25%. central box are generated from a single gaussian used to model background pixels (basically white pixels). We add a third variable yt to our hidden state indicating the vertical position of the central box. However, since we are uninterested in actually recovering this variable, during inference we sum it out of the model. The width of a character is constrained to be close to the width (tw ) of our hand cut example by setting p(w|c) = 0 for w < tw − k and w > tw + k. Here k is a small, user deﬁned integer. Within this range, p(w|c) is distributed uniformly, larger templates are created by appending pixels from the background model to the template and smaller ones by simply removing the right k-most columns of the hand cut example. For features, we generate ﬁve image representations, shown in ﬁgure 2. The ﬁrst is a grayscale version of the original color image. The second and third are generated by convolving the grayscale image with a vertical derivative of gaussian ﬁlter, separating the positive and negative components of this response, and smoothing each of these gradient images separately. The fourth and ﬁfth are generated similarly but with a horizontal derivative of gaussian ﬁlter. We have experimented with different weightings of these 5 channels. In practice we use the gray scale channel and the horizontal gradient channels. We emphasize the horizontal pieces since these seem the more discriminative. 2.3. Transcription For transcription, we model letters as coming from an n-gram language model, with no dependencies between words. Thus, the probability of a letter depends on the k letters before it, where k = n unless this would cross a word boundary in which case the history terminates at this boundary. We chose not to model word to word transition probabilities since, unlike in English, word order in Latin is highly arbitrary. This transition model is ﬁt from a corpus of ascii encoded latin. We have experimented with unigram (i.e. uniform transition probabilities), bigram and trigram letter models. We can perform transcription by ﬁtting the maximum likelihood path through any given line. Some results of this technique are shown in ﬁgure 3. 2.4. Search For search, we rank lines by the probability that they contain our search word. We set up a ﬁnite state machine like that in ﬁgure 4. In this ﬁgure, ‘bg’ represents our background model for that portion of the line not generated by our search word. We can use any of the n-gram letter models described for transcription as the transition model for ‘bg’. The probability that the line contains the search word is the probability that this FSM takes path 1. We use this FSM as the transition model for our gHMM, and output the posterior probability of the two arrows leading into the end state. 1 and 2 are user deﬁned weights, but in practice the algorithm does not appear to be particular sensitive to the choice of these parameters. The results presented here use the unigram model. Editorial translation Orator ad vos venio ornatu prologi: unigram b u rt o r a d u o s u em o o r n a t u p r o l o g r b u rt o r a d v o s v em o o r u a t u p r o l o g r fo r a t o r a d v o s v en i o o r n a t u p r o l o g i bigram trigram Figure 3: We transcribe the text by ﬁnding the maximum likelihood path through the gHMM. The top line shows the standard version of the line (obtained by consensus among editors who have consulted various manuscripts; we obtained this information in electronic form from http://www.thelatinlibrary.com). Below, we show the line as segmented and transcribed by unigram, bigram and trigram models; the unigram and bigram models transcribe one word as “vemo”, but the stronger trigram model forces the two letters to be segmented and correctly transcribes the word as “venio”, illustrating the considerable beneﬁt to be obtained by segmenting only at recognition time. 1 − ε1 Path 1 1 − ε2 a b bg ε1 Ω bg Path 2 ε2 Figure 4: The ﬁnite state machine to search for the word ‘ab.’ ‘bg’ is a place holder for the larger ﬁnite state machine deﬁned by our language model’s transition matrix. 3. Results Figure 1 shows a page from our collection. This is a scanned 12th century manuscript of Terence’s Comedies, obtained from the collection at [1]. In preprocessing, we extract individual lines of text by rotating the image to various degrees and projecting the sum of the pixel values onto the y-axis. We choose the orientation whose projection vector has the lowest entropy, and then segment lines by cutting at minima of this projection. Transcription is not our primary task, but methods that produce good transcriptions are going to support good searches. The gHMM can produce a surprisingly good transcription, given how little training data is used to train the emission model. We aligned an editors version of Terence with 25 pages from the manuscript by hand, and computed the edit distance between the transcribed text and the aligned text; as table 1 indicates, approximately 75% of letters are read correctly. Search results are strong. We show results for two documents. The ﬁrst set of results refers to the edition of Terence’s Comedies, from which we took the 22 letter instances. In particular, for any given search term, our process ranks the complete set of lines. We used a hand alignment of the manuscript to determine which lines contained each term; ﬁgure 5 shows an overview of searches performed using every word that appears in the 50 100 150 200 250 300 350 400 450 500 550 Figure 5: Our search ranks 587 manuscript lines, with higher ranking lines more likely to contain the relevant term. This ﬁgure shows complete search results for each term that appears more than three times in the 587 lines. Each row represents the ranked search results for a term, and a black mark appears if the search term is actually in the line; a successful search will therefore appear as a row which is wholly dark to the left, and then wholly light. All 587 lines are represented. More common terms are represented by lower rows. More detailed results appear in ﬁgure 5 and ﬁgure 6; this summary ﬁgure suggests almost all searches are highly successful. document more than three times, in particular, showing which of the ranked set of lines actually contained the search term. For almost every search, the term appears mainly in the lines with higher rank. Figure 6 contains more detailed information for a smaller set of words. We do not score the position of a word in a line (for practical reasons). Figure 7 demonstrates (a) that our search respects the ink of the document and (b) that for the Terence document, word positions are accurately estimated. The spelling of mediaeval documents is typically cleaned up by editors; in our manuscript, the scribe reliably spells “michi” for the standard “mihi”. A search on “michi” produces many instances; a search on “mihi” produces none, because the ink doesn’t have any. Notice this phenomenon also in the bottom right line of ﬁgure 7, the scribe writes “habet, ut consumat nunc cum nichil obsint doli” and the editor gives “habet, ut consumat nunc quom nil obsint doli.” Figure 8 shows that searches on short strings produce many words containing that string as one would wish. 4. Discussion We have shown that it is possible to make at least some handwritten mediaeval manuscripts accessible to full text search, without requiring an aligned text or much supervisory data. Our documents have very regular letters, and letter frequencies — which can be obtained from transcribed Latin — appear to provide so powerful a cue that relatively little detailed information about letter shapes is required. Linking letter segmentation and recognition has thoroughly beneﬁcial effects. This suggests that the pool of manuscripts that can be made accessible in this way is large. In particular, we have used our method, trained on 22 instances of letters from one document, to search another document. Figure 9 shows the results from two searches of our second document (MS. Auct. D. 2. 16, Latin Gospels with beast-headed evangelist portraits made at Landvennec, Brittany, late 9th or early 10th century, from [1]). No information from this document was used in training at all; but letter 1tu arbitror pater etiam nisi factum primum siet vero illi inter hic michi ibi qui tu ibi michi 0.9 0.8 0.7 qui hic 0.6 inter 0.5 illi 0.4 siet 0.3 vero 0.2 nisi 0.1 50 100 150 200 250 300 350 400 450 500 550 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 6: On the left, search results for selected words (indicated on the leftmost column). Each row represents the ranked search results for a term, and a black mark appears if the search term is actually in the line; a successful search will therefore appear as a row which is wholly dark to the left, and then wholly light. Note only the top 300 results are represented, and that lines containing the search term are almost always at or close to the top of the search results (black marks to the left). On the right, we plot precision against recall for a set of different words by taking the top 10, 20, ... lines returned from the search, and checking them against the aligned manuscript. Note that, once all cases have been found, if the size of the pool is increased the precision will fall with 100% recall; many words work well, with most of the ﬁrst 20 or so lines returned containing the search term. shapes are sufﬁciently well shared that the search is still useful. All this suggests that one might be able to use EM to link three processes: one that clusters to determine letter shapes; one that segments letters; and one that imposes a language model. Such a system might be able to make handwritten Latin searchable with no training data. References [1] Early Manuscripts at Oxford University. Bodleian library ms. auct. f. 2.13. http://image.ox.ac.uk/. [2] Serge Belongie, Jitendra Malik, and Jan Puzicha. Shape matching and object recognition using shape contexts. IEEE T. Pattern Analysis and Machine Intelligence, 24(4):509–522, 2002. [3] Philipp Koehn and Kevin Knight. Estimating word translation probabilities from unrelated monolingual corpora. In Proc. of the 17th National Conf. on AI, pages 711–715. AAAI Press / The MIT Press, 2000. [4] Y. LeCun, L. Bottou, and Y. Bengio. Reading checks with graph transformer networks. In International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 151–154, Munich, 1997. IEEE. [5] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [6] R. Plamondon and S.N. Srihari. Online and off-line handwriting recognition: a comprehensive survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):63–84, 2000. [7] T. M. Rath and R. Manmatha. Word image matching using dynamic time warping. In Proc. of the Conf. on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 521–527, 2003. [8] Alessandro Vinciarelli, Samy Bengio, and Horst Bunke. Ofﬂine recognition of unconstrained handwritten texts using hmms and statistical language models. IEEE Trans. Pattern Anal. Mach. Intell., 26(6):709–720, 2004. michi: Spe incerta certum mihi laborem sustuli, mihi: Faciuntne intellegendo ut nil intellegant? michi: Nonnumquam conlacrumabat. placuit tum id mihi. mihi: Placuit: despondi. hic nuptiis dictust dies. michi: Sto exspectans siquid mi imperent. venit una,

5 0.48180988 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem

Author: Suvrit Sra, Joel Tropp, Inderjit S. Dhillon

Abstract: Various problems in machine learning, databases, and statistics involve pairwise distances among a set of objects. It is often desirable for these distances to satisfy the properties of a metric, especially the triangle inequality. Applications where metric data is useful include clustering, classiﬁcation, metric-based indexing, and approximation algorithms for various graph problems. This paper presents the Metric Nearness Problem: Given a dissimilarity matrix, ﬁnd the “nearest” matrix of distances that satisfy the triangle inequalities. For p nearness measures, this paper develops efﬁcient triangle ﬁxing algorithms that compute globally optimal solutions by exploiting the inherent structure of the problem. Empirically, the algorithms have time and storage costs that are linear in the number of triangle constraints. The methods can also be easily parallelized for additional speed. 1

6 0.47792804 23 nips-2004-Analysis of a greedy active learning strategy

7 0.41293308 51 nips-2004-Detecting Significant Multidimensional Spatial Clusters

8 0.41277137 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

9 0.40650576 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

10 0.39261472 127 nips-2004-Neighbourhood Components Analysis

11 0.38852516 130 nips-2004-Newscast EM

12 0.38618824 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

13 0.38229606 75 nips-2004-Heuristics for Ordering Cue Search in Decision Making

14 0.3464953 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

15 0.32012096 125 nips-2004-Multiple Relational Embedding

16 0.3141385 141 nips-2004-Optimal sub-graphical models

17 0.30756396 82 nips-2004-Incremental Algorithms for Hierarchical Classification

18 0.30190209 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)

19 0.29927278 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors

20 0.29788682 19 nips-2004-An Application of Boosting to Graph Classification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.281), (13, 0.149), (15, 0.121), (26, 0.052), (31, 0.037), (32, 0.014), (33, 0.145), (35, 0.037), (39, 0.033), (50, 0.025), (87, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79883772 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

Author: Ting Liu, Andrew W. Moore, Ke Yang, Alexander G. Gray

2 0.65722996 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

Author: Jochen Triesch

Abstract: This paper explores the computational consequences of simultaneous intrinsic and synaptic plasticity in individual model neurons. It proposes a new intrinsic plasticity mechanism for a continuous activation model neuron based on low order moments of the neuron’s ﬁring rate distribution. The goal of the intrinsic plasticity mechanism is to enforce a sparse distribution of the neuron’s activity level. In conjunction with Hebbian learning at the neuron’s synapses, the neuron is shown to discover sparse directions in the input. 1

3 0.65213549 131 nips-2004-Non-Local Manifold Tangent Learning

Author: Yoshua Bengio, Martin Monperrus

Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1

4 0.64615679 28 nips-2004-Bayesian inference in spiking neurons

Author: Sophie Deneve

Abstract: We propose a new interpretation of spiking neurons as Bayesian integrators accumulating evidence over time about events in the external world or the body, and communicating to other neurons their certainties about these events. In this model, spikes signal the occurrence of new information, i.e. what cannot be predicted from the past activity. As a result, ﬁring statistics are close to Poisson, albeit providing a deterministic representation of probabilities. We proceed to develop a theory of Bayesian inference in spiking neural networks, recurrent interactions implementing a variant of belief propagation. Many perceptual and motor tasks performed by the central nervous system are probabilistic, and can be described in a Bayesian framework [4, 3]. A few important but hidden properties, such as direction of motion, or appropriate motor commands, are inferred from many noisy, local and ambiguous sensory cues. These evidences are combined with priors about the sensory world and body. Importantly, because most of these inferences should lead to quick and irreversible decisions in a perpetually changing world, noisy cues have to be integrated on-line, but in a way that takes into account unpredictable events, such as a sudden change in motion direction or the appearance of a new stimulus. This raises the question of how this temporal integration can be performed at the neural level. It has been proposed that single neurons in sensory cortices represent and compute the log probability that a sensory variable takes on a certain value (eg Is visual motion in the neuron’s preferred direction?) [9, 7]. Alternatively, to avoid normalization issues and provide an appropriate signal for decision making, neurons could represent the log probability ratio of a particular hypothesis (eg is motion more likely to be towards the right than towards the left) [7, 6]. Log probabilities are convenient here, since under some assumptions, independent noisy cues simply combine linearly. Moreover, there are physiological evidence for the neural representation of log probabilities and log probability ratios [9, 6, 7]. However, these models assume that neurons represent probabilities in their ﬁring rates. We argue that it is important to study how probabilistic information are encoded in spikes. Indeed, it seems spurious to marry the idea of an exquisite on-line integration of noisy cues with an underlying rate code that requires averaging on large populations of noisy neurons and long periods of time. In particular, most natural tasks require this integration to take place on the time scale of inter-spike intervals. Spikes are more efﬁciently signaling events ∗ Institute of Cognitive Science, 69645 Bron, France than analog quantities. In addition, a neural theory of inference with spikes will bring us closer to the physiological level and generate more easily testable predictions. Thus, we propose a new theory of neural processing in which spike trains provide a deterministic, online representation of a log-probability ratio. Spikes signals events, eg that the log-probability ratio has exceeded what could be predicted from previous spikes. This form of coding was loosely inspired by the idea of ”energy landscape” coding proposed by Hinton and Brown [2]. However, contrary to [2] and other theories using rate-based representation of probabilities, this model is self-consistent and does not require different models for encoding and decoding: As output spikes provide new, unpredictable, temporally independent evidence, they can be used directly as an input to other Bayesian neurons. Finally, we show that these neurons can be used as building blocks in a theory of approximate Bayesian inference in recurrent spiking networks. Connections between neurons implement an underlying Bayesian network, consisting of coupled hidden Markov models. Propagation of spikes is a form of belief propagation in this underlying graphical model. Our theory provides computational explanations of some general physiological properties of cortical neurons, such as spike frequency adaptation, Poisson statistics of spike trains, the existence of strong local inhibition in cortical columns, and the maintenance of a tight balance between excitation and inhibition. Finally, we discuss the implications of this model for the debate about temporal versus rate-based neural coding. 1 Spikes and log posterior odds 1.1 Synaptic integration seen as inference in a hidden Markov chain We propose that each neuron codes for an underlying ”hidden” binary variable, xt , whose state evolves over time. We assume that xt depends only on the state at the previous time step, xt−dt , and is conditionally independent of other past states. The state xt can switch 1 from 0 to 1 with a constant rate ron = dt limdt→0 P (xt = 1|xt−dt = 0), and from 1 to 0 with a constant rate roﬀ . For example, these transition rates could represent how often motion in a preferred direction appears the receptive ﬁeld and how long it is likely to stay there. The neuron infers the state of its hidden variable from N noisy synaptic inputs, considered to be observations of the hidden state. In this initial version of the model, we assume that these inputs are conditionally independent homogeneous Poisson processes, synapse i i emitting a spike between time t and t + dt (si = 1) with constant probability qon dt if t i xt = 1, and another constant probability qoﬀ dt if xt = 0. The synaptic spikes are assumed to be otherwise independent of previous synaptic spikes, previous states and spikes at other synapses. The resulting generative model is a hidden Markov chain (ﬁgure 1-A). However, rather than estimating the state of its hidden variable and communicating this estimate to other neurons (for example by emitting a spike when sensory evidence for xt = 1 goes above a threshold) the neuron reports and communicates its certainty that the current state is 1. This certainty takes the form of the log of the ratio of the probability that the hidden state is 1, and the probability that the state is 0, given all the synaptic inputs P (xt =1|s0→t ) received so far: Lt = log P (xt =0|s0→t ) . We use s0→t as a short hand notation for the N synaptic inputs received at present and in the past. We will refer to it as the log odds ratio. Thanks to the conditional independencies assumed in the generative model, we can compute this Log odds ratio iteratively. Taking the limit as dt goes to zero, we get the following differential equation: ˙ L = ron 1 + e−L − roﬀ 1 + eL + i wi δ(si − 1) − θ t B. A. xt ron .roff dt qon , qoff st xt ron .roff i t st dt s qon , qoff qon , qoff st dt xt j st Ot It Gt Ot Lt t t dt C. E. 2 0 -2 -4 D. 500 1000 1500 2000 2500 2 3000 Count Log odds 4 20 Lt 0 -2 0 500 1000 1500 2000 2500 Time Ot 3000 0 200 400 600 ISI Figure 1: A. Generative model for the synaptic input. B. Schematic representation of log odds ratio encoding and decoding. The dashed circle represents both eventual downstream elements and the self-prediction taking place inside the model neuron. A spike is ﬁred only when Lt exceeds Gt . C. One example trial, where the state switches from 0 to 1 (shaded area) and back to 0. plain: Lt , dotted: Gt . Black stripes at the top: corresponding spikes train. D. Mean Log odds ratio (dark line) and mean output ﬁring rate (clear line). E. Output spike raster plot (1 line per trial) and ISI distribution for the neuron shown is C. and D. Clear line: ISI distribution for a poisson neuron with the same rate. wi , the synaptic weight, describe how informative synapse i is about the state of the hidden i qon variable, e.g. wi = log qi . Each synaptic spike (si = 1) gives an impulse to the log t off odds ratio, which is positive if this synapse is more active when the hidden state if 1 (i.e it increases the neuron’s conﬁdence that the state is 1), and negative if this synapse is more active when xt = 0 (i.e it decreases the neuron’s conﬁdence that the state is 1). The bias, θ, is determined by how informative it is not to receive any spike, e.g. θ = i i i qon − qoﬀ . By convention, we will consider that the ”bias” is positive or zero (if not, we need simply to invert the status of the state x). 1.2 Generation of output spikes The spike train should convey a sparse representation of Lt , so that each spike reports new information about the state xt that is not redundant with that reported by other, preceding, spikes. This proposition is based on three arguments: First, spikes, being metabolically expensive, should be kept to a minimum. Second, spikes conveying redundant information would require a decoding of the entire spike train, whereas independent spike can be taken into account individually. And ﬁnally, we seek a self consistent model, with the spiking output having a similar semantics to its spiking input. To maximize the independence of the spikes (conditioned on xt ), we propose that the neuron ﬁres only when the difference between its log odds ratio Lt and a prediction Gt of this log odds ratio based on the output spikes emitted so far reaches a certain threshold. Indeed, supposing that downstream elements predicts Lt as best as they can, the neuron only needs to ﬁre when it expects that prediction to be too inaccurate (ﬁgure 1-B). In practice, this will happen when the neuron receives new evidence for xt = 1. Gt should thereby follow the same dynamics as Lt when spikes are not received. The equation for Gt and the output Ot (Ot = 1 when an output spike is ﬁred) are given by: ˙ G = Ot = ron 1 + e−L − roﬀ 1 + eL + go δ(Ot − 1) go 1. when Lt > Gt + , 0 otherwise, 2 (1) (2) Here go , a positive constant, is the only free parameter, the other parameters being constrained by the statistics of the synaptic input. 1.3 Results Figure 1-C plots a typical trial, showing the behavior of L, G and O before, during and after presentation of the stimulus. As random synaptic inputs are integrated, L ﬂuctuates and eventually exceeds G + 0.5, leading to an output spike. Immediately after a spike, G jumps to G + go , which prevents (except in very rare cases) a second spike from immediately following the ﬁrst. Thus, this ”jump” implements a relative refractory period. However, ron G decays as it tends to converge back to its stable level gstable = log roff . Thus L eventually exceeds G again, leading to a new spike. This threshold crossing happens more often during stimulation (xt = 1) as the net synaptic input alters to create a higher overall level of certainty, Lt . Mean Log odds ratio and output ﬁring rate ¯ The mean ﬁring rate Ot of the Bayesian neuron during presentation of its preferred stimulus (i.e. when xt switches from 0 to 1 and back to 0) is plotted in ﬁgure 1-D, together with the ¯ mean log posterior ratio Lt , both averaged over trials. Not surprisingly, the log-posterior ratio reﬂects the leaky integration of synaptic evidence, with an effective time constant that depends on the transition probabilities ron , roﬀ . If the state is very stable (ron = roﬀ ∼ 0), synaptic evidence is integrated over almost inﬁnite time periods, the mean log posterior ratio tending to either increase or decrease linearly with time. In the example in ﬁgure 1D, the state is less stable, so ”old” synaptic evidence are discounted and Lt saturates. ¯ In contrast, the mean output ﬁring rate Ot tracks the state of xt almost perfectly. This is because, as a form of predictive coding, the output spikes reﬂect the new synaptic i evidence, It = i δ(st − 1) − θ, rather than the log posterior ratio itself. In particular, the mean output ﬁring rate is a rectiﬁed linear function of the mean input, e. g. + ¯ ¯ wi q i −θ . O= 1I= go i on(oﬀ) Analogy with a leaky integrate and ﬁre neuron We can get an interesting insight into the computation performed by this neuron by linearizing L and G around their mean levels over trials. Here we reduce the analysis to prolonged, statistically stable periods when the state is constant (either ON or OFF). In this case, the ¯ ¯ mean level of certainty L and its output prediction G are also constant over time. We make the rough approximation that the post spike jump, go , and the input ﬂuctuations are small ¯ compared to the mean level of certainty L. Rewriting Vt = Lt − Gt + go 2 as the ”membrane potential” of the Bayesian neuron: ˙ V = −kL V + It − ∆go − go Ot ¯ ¯ ¯ where kL = ron e−L + roﬀ eL , the ”leak” of the membrane potential, depends on the overall ¯ level of certainty. ∆go is positive and a monotonic increasing function of go . A. s t1 dt s t1 s t1 dt B. C. x t1 x t3 dt x t3 x t3 dt x t1 x t1 x t1 x t2 x t3 x t1 … x tn x t3 x t2 … x tn … dt dt Lx2 D. x t2 dt s t2 dt x t2 s t2 x t2 dt s t2 dt Log odds 10 No inh -0.5 -1 -1 -1.5 -2 5 Feedback 500 1000 1500 2000 Tiger Stripes 0 -5 -10 500 1000 1500 2000 2500 Time Figure 2: A. Bayesian causal network for yt (tiger), x1 (stripes) and x2 (paws). B. A nett t work feedforward computing the log posterior for x1 . C. A recurrent network computing t the log posterior odds for all variables. D. Log odds ratio in a simulated trial with the net2 1 1 work in C (see text). Thick line: Lx , thin line: Lx , dash-dotted: Lx without inhibition. t t t 2 Insert: Lx averaged over trials, showing the effect of feedback. t The linearized Bayesian neuron thus acts in its stable regime as a leaky integrate and ﬁre (LIF) neuron. The membrane potential Vt integrates its input, Jt = It − ∆go , with a leak kL . The neuron ﬁres when its membrane potential reaches a constant threshold go . After ¯ each spikes, Vt is reset to 0. Interestingly, for appropriately chosen compression factor go , the mean input to the lin¯ ¯ earized neuron J = I − ∆go ≈ 0 1 . This means that the membrane potential is purely driven to its threshold by input ﬂuctuations, or a random walk in membrane potential. As a consequence, the neuron’s ﬁring will be memoryless, and close to a Poisson process. In particular, we found Fano factor close to 1 and quasi-exponential ISI distribution (ﬁgure 1E) on the entire range of parameters tested. Indeed, LIF neurons with balanced inputs have been proposed as a model to reproduce the statistics of real cortical neurons [8]. This balance is implemented in our model by the neuron’s effective self-inhibition, even when the synaptic input itself is not balanced. Decoding As we previously said, downstream elements could predict the log odds ratio Lt by computing Gt from the output spikes (Eq 1, ﬁg 1-B). Of course, this requires an estimate of the transition probabilities ron , roﬀ , that could be learned from the observed spike trains. However, we show next that explicit decoding is not necessary to perform bayesian inference in spiking networks. Intuitively, this is because the quantity that our model neurons receive and transmit, eg new information, is exactly what probabilistic inference algorithm propagate between connected statistical elements. 1 ¯ Even if go is not chosen optimally, the inﬂuence of the drift J is usually negligible compared to the large ﬂuctuations in membrane potential. 2 Bayesian inference in cortical networks The model neurons, having the same input and output semantics, can be used as building blocks to implement more complex generative models consisting of coupled Markov chains. Consider, for example, the example in ﬁgure 2-A. Here, a ”parent” variable x1 t (the presence of a tiger) can cause the state of n other ”children” variables ([xk ]k=2...n ), t of whom two are represented (the presence of stripes,x2 , and motion, x3 ). The ”chilt t dren” variables are Bayesian neurons identical to those described previously. The resulting bayesian network consist of n + 1 coupled hidden Markov chains. Inference in this architecture corresponds to computing the log posterior odds ratio for the tiger, x1 , and the log t posterior of observing stripes or motion, ([xk ]k=2...n ), given the synaptic inputs received t by the entire network so far, i.e. s2 , . . . , sk . 0→t 0→t Unfortunately, inference and learning in this network (and in general in coupled Markov chains) requires very expensive computations, and cannot be performed by simply propagating messages over time and among the variable nodes. In particular, the state of a child k variable xt depends on xk , sk , x1 and the state of all other children at the previous t t t−dt time step, [xj ]2

5 0.64459813 163 nips-2004-Semi-parametric Exponential Family PCA

Author: Sajama Sajama, Alon Orlitsky

Abstract: We present a semi-parametric latent variable model based technique for density modelling, dimensionality reduction and visualization. Unlike previous methods, we estimate the latent distribution non-parametrically which enables us to model data generated by an underlying low dimensional, multimodal distribution. In addition, we allow the components of latent variable models to be drawn from the exponential family which makes the method suitable for special data types, for example binary or count data. Simulations on real valued, binary and count data show favorable comparison to other related schemes both in terms of separating different populations and generalization to unseen samples. 1

6 0.6421532 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech

7 0.64201051 36 nips-2004-Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification

8 0.64116609 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

9 0.64113963 70 nips-2004-Following Curved Regularized Optimization Solution Paths

10 0.6397137 102 nips-2004-Learning first-order Markov models for control

11 0.63686401 58 nips-2004-Edge of Chaos Computation in Mixed-Mode VLSI - A Hard Liquid

12 0.63506556 116 nips-2004-Message Errors in Belief Propagation

13 0.63460827 178 nips-2004-Support Vector Classification with Input Data Uncertainty

14 0.63323092 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

15 0.63236177 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

16 0.63165486 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

17 0.6313048 151 nips-2004-Rate- and Phase-coded Autoassociative Memory

18 0.62944466 124 nips-2004-Multiple Alignment of Continuous Time Series

19 0.6289838 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution

20 0.62893838 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification