nips nips2012 nips2012-140 knowledge-graph by maker-knowledge-mining

140 nips-2012-Fusion with Diffusion for Robust Visual Tracking

Source: pdf

Author: Yu Zhou, Xiang Bai, Wenyu Liu, Longin J. Latecki

Abstract: A weighted graph is used as an underlying structure of many algorithms like semisupervised learning and spectral clustering. If the edge weights are determined by a single similarity measure, then it hard if not impossible to capture all relevant aspects of similarity when using a single similarity measure. In particular, in the case of visual object matching it is beneﬁcial to integrate different similarity measures that focus on different visual representations. In this paper, a novel approach to integrate multiple similarity measures is proposed. First pairs of similarity measures are combined with a diffusion process on their tensor product graph (TPG). Hence the diffused similarity of each pair of objects becomes a function of joint diffusion of the two original similarities, which in turn depends on the neighborhood structure of the TPG. We call this process Fusion with Diffusion (FD). However, a higher order graph like the TPG usually means signiﬁcant increase in time complexity. This is not the case in the proposed approach. A key feature of our approach is that the time complexity of the diffusion on the TPG is the same as the diffusion process on each of the original graphs. Moreover, it is not necessary to explicitly construct the TPG in our framework. Finally all diffused pairs of similarity measures are combined as a weighted sum. We demonstrate the advantages of the proposed approach on the task of visual tracking, where different aspects of the appearance similarity between the target object in frame t − 1 and target object candidates in frame t are integrated. The obtained method is tested on several challenge video sequences and the experimental results show that it outperforms state-of-the-art tracking methods. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 If the edge weights are determined by a single similarity measure, then it hard if not impossible to capture all relevant aspects of similarity when using a single similarity measure. [sent-14, score-0.399]

2 In particular, in the case of visual object matching it is beneﬁcial to integrate different similarity measures that focus on different visual representations. [sent-15, score-0.291]

3 In this paper, a novel approach to integrate multiple similarity measures is proposed. [sent-16, score-0.174]

4 First pairs of similarity measures are combined with a diffusion process on their tensor product graph (TPG). [sent-17, score-0.453]

5 Hence the diffused similarity of each pair of objects becomes a function of joint diffusion of the two original similarities, which in turn depends on the neighborhood structure of the TPG. [sent-18, score-0.313]

6 A key feature of our approach is that the time complexity of the diffusion on the TPG is the same as the diffusion process on each of the original graphs. [sent-22, score-0.327]

7 Finally all diffused pairs of similarity measures are combined as a weighted sum. [sent-24, score-0.222]

8 We demonstrate the advantages of the proposed approach on the task of visual tracking, where different aspects of the appearance similarity between the target object in frame t − 1 and target object candidates in frame t are integrated. [sent-25, score-0.584]

9 The obtained method is tested on several challenge video sequences and the experimental results show that it outperforms state-of-the-art tracking methods. [sent-26, score-0.15]

10 1 Introduction The considered problem has a simple formulation: Given are multiple similarities between the same set of n data points, each similarity can be represented as a weighted graph. [sent-27, score-0.192]

11 The goal is to combine them to a single similarity measure that best reﬂects the underlying data manifold. [sent-28, score-0.181]

12 Then our task can be stated as ﬁnding a mapping from the multigraph to a weighted simple graph whose edge weights best represent the similarity of the data points. [sent-30, score-0.231]

13 However, this solution does not consider the similarity dependencies of different data points. [sent-37, score-0.133]

14 ∗ Part of this work was done while the author was visiting Temple University 1 Given two different similarity measures, we ﬁrst construct their Tensor Product Graph (TPG). [sent-39, score-0.133]

15 Then we jointly diffuse both similarities with a diffusion process on TPG. [sent-40, score-0.216]

16 However, while the original graphs representing the two measures have n nodes, their TPG has n2 nodes, which signiﬁcantly increases the time complexity of the diffusion on TPG. [sent-41, score-0.212]

17 To address this problem, we introduce an iterative algorithm that operates on the original graphs and prove that it is equivalent to the diffusion on TPG. [sent-42, score-0.194]

18 FD is a generalization of the approached in [26], where only a single similarity measure is considered. [sent-44, score-0.153]

19 While the diffusion process on TPG in [26] is used to enhances a single similarity measure, our approach aims at combining two different similarity measures so that they enhance and constrain each others. [sent-45, score-0.483]

20 Although algorithmically very different, our motivation is similar to co-training style algorithms in [5, 23, 24] where multiple cues are fused in an iterative learning process. [sent-46, score-0.069]

21 For online tracking task, we only have the label information from the current frame, which can be regarded as the labeled data, and the label information in the next frame is unavailable, which can be regarded as unlabeled data. [sent-48, score-0.304]

22 Hence from the point of view of visual tracking, but in the spirit of semi-supervised learning, our approach utilizes the unlabeled data from the next frame for improved visual similarity to the labeled data representing the tracked objets. [sent-51, score-0.379]

23 Visual tracking is an important issue in computer vision and has many practical applications. [sent-52, score-0.197]

24 The challenges in designing a tracking system are often caused by shape deformation, occlusion, viewpoints variances, and background clutter. [sent-53, score-0.167]

25 Different strategies have been proposed to obtain robust tracking systems. [sent-54, score-0.174]

26 Discriminate appearance model of the target is extracted from the current frame, then the optimal target is estimated based on the distance/similatity between the appearance model and the candidate in the hypothesis set. [sent-56, score-0.156]

27 In this paper, we focus on improving the distance/similarity measure to improve the matching based tracking strategy. [sent-60, score-0.192]

28 However, different from [12], multiple cues are fused to improve the similarity in our approach. [sent-62, score-0.179]

29 Moreover, the information from the forthcoming frame is also used to improve the similarity. [sent-63, score-0.156]

30 This leads to more stable tracking performance than in [12]. [sent-64, score-0.15]

31 Multiple cues fusion seem to be an effective way to improve the tracking performance. [sent-65, score-0.263]

32 In [13], multiple feature fusion is implemented based on sampling the state space. [sent-66, score-0.089]

33 In [20], the tracking task is formulated as the combination of different trackers, three different trackers are combined into a cascade. [sent-67, score-0.189]

34 Different from those methods, we combine different similarities into a single similarity measure, which makes our method a more general for integrating various appearance models. [sent-68, score-0.235]

35 In summary, we propose a novel framework for integration of multiple similarity measures into a single consistent similarity measure, where the similarity of each pair of data points depends on their similarity to other data points. [sent-69, score-0.573]

36 We demonstrate its superior performance on a challenging task of tracking by visual matching. [sent-70, score-0.189]

37 2 Problem Formulation The problem of matching based visual tracking boils down to the following simple formulation. [sent-71, score-0.211]

38 Given the target in frame It−1 which can be represented as image patch I1 enclosing the target, and the set of candidate target patches in frame It , C = {In | n = 2, . [sent-72, score-0.487]

39 , N }, the goal is to determine which patch in C corresponds to the target in frame It−1 . [sent-75, score-0.209]

40 The size of all the image patches is ﬁxed and the candidate set is composed of patches in frame It inside a search radius r, i. [sent-80, score-0.273]

41 2 Let S be a similarity measure deﬁned on the set of the image patches V = {I1 } ∪ C, i. [sent-83, score-0.223]

42 Then our tracking goal can be formally stated as ˆ I = arg max S(I1 , X) X∈C (1) meaning that the patch in C with most similar appearance to patch I1 is selected as the target location in frame t. [sent-86, score-0.47]

43 Since the appearance of the target object changes, e. [sent-87, score-0.086]

44 , due to motion and lighting changes, single similarity measure is often not sufﬁcient to identify the target in the next frame. [sent-89, score-0.188]

45 Therefore, we consider a set of similarity measures S = {S1 , . [sent-90, score-0.174]

46 For example, in our experimental results, each image patch is represented with three histograms based on three different features, HOG[9], LBP[18], Haar-like feature[4], which lead to three different similarity measures. [sent-97, score-0.206]

47 In other words, each pair of patches can be compared with respect to three different appearance features. [sent-98, score-0.082]

48 We can interpret each similarity measure Sα as the afﬁnity matrix of a graph Gα whose vertex set is V , i. [sent-99, score-0.193]

49 Then we can combine the graphs Gα into a single multigraph whose edge weights corresponds to different similarity measures Sα . [sent-102, score-0.261]

50 Hence we face a question how to combine the measures in S into a single similarity measure. [sent-105, score-0.202]

51 First, we combine pairs of similarity measures Sα and Sβ into a single measure P∗ , which is a matrix of size N × N . [sent-107, score-0.222]

52 P∗ is deﬁned in Section 3 and it is obtained with α,β α,β the proposed process called fusion with diffusion. [sent-108, score-0.114]

53 Q into a single similarity measure S α,β deﬁned as a weighted matrix sum (2) S= ωα ωβ P∗ α,β α,β where ωα and ωβ are positive weights associated with measures Sα and Sβ deﬁned in Section 5. [sent-112, score-0.213]

54 We also observe that in contrast to many tracking by matching methods, the combined measure S is not only a function of similarities between I1 and the candidate patches in C, but also of similarities of patches in C to each other. [sent-113, score-0.386]

55 The graph Gα is fully connected graph in many applications. [sent-120, score-0.08]

56 This inequality is important in our framework, since it guarantees the converge of the diffusion process on the tensor product graph presented in the next section. [sent-124, score-0.279]

57 3 Diffusion Process on Tensor Product Graph We utilize a diffusion process on TPG to combine the two similarity measures Pk,α and Pk,β . [sent-133, score-0.378]

58 The vec operator creates a column vector from a matrix M by stacking 2 the column vectors of M below one another. [sent-135, score-0.24]

59 More formally vec : RN ×N → RN is deﬁned as vec(M )g = (M )ij , where i = (g − 1)/N + 1 and j = g mod N . [sent-136, score-0.24]

60 We deﬁne a q-th iteration of the diffusion process on this graph as q (P)e vec(∆). [sent-140, score-0.216]

61 (7) e=0 As shown in [26], this iterative process is guaranteed to converge to a nontrivial solution given by q (P)e vec(∆) = (I − P)−1 vec(∆), lim q→∞ (8) e=0 where I is a identity matrix. [sent-141, score-0.086]

62 We call the diffusion process to compute P∗ a Fusion with Diffusion (FD) process, since diffusion on TPG Gα ⊗ Gβ is used to fuse two similarity measures Sα and Sβ . [sent-143, score-0.501]

63 4 that FD process on TPG is equivalent to an iterative process on N × N matrices only. [sent-147, score-0.073]

64 Then in Section 4 we further reduce it to an efﬁcient algorithm with time complexity O(n2 ), which can be used in real time tracking algorithms. [sent-149, score-0.15]

65 (10) until convergence, and as we prove in Proposition 1, we obtain P∗ = lim Pq =vec−1 ((I − P)−1 vec(∆)) q→∞ (11) The iterative process in Eq. [sent-153, score-0.086]

66 In other words, we consider diffusion on TPG of two different graphs, while diffusion on TPG of a single graph with itself is considered in [26]. [sent-157, score-0.342]

67 Proposition 1 q−1 vec Pe vec(∆) = (I − P)−1 vec(∆). [sent-158, score-0.24]

68 lim P(q+1) = lim q→∞ q→∞ e=0 4 (12) Proof: Eq. [sent-159, score-0.076]

69 = lim q→∞ (14) e=0 T Lemma 2 vec (Pk,α )e ∆ (Pk,β )e = (P)e vec(∆) for e = 1, 2, . [sent-168, score-0.278]

70 Suppose (P)l vec(∆)=vec (Pk,α )l ∆ (Pk,β )l is true for e = l, then for e = l + 1 we have T (P)l+1 vec(∆) = P Pl vec(∆) = vec Pk,α vec−1 (Pl vec(∆)) Pk,β T T = vec Pk,α ((Pk,α )l ∆ (Pk,β )l ) Pk,β T = vec (Pk,α )l+1 ∆ (Pk,β )l+1 and the proof of Lemma 2 is complete. [sent-173, score-0.72]

71 By Lemma 1 and Lemma 2, we obtain that q−1 q−1 T (Pk,α )e ∆ (Pk,β )e vec (P)e vec(∆). [sent-174, score-0.24]

72 We now show how FD could improve the original similarity measures. [sent-177, score-0.133]

73 Suppose we have two similarity measures Sα and Sβ . [sent-178, score-0.174]

74 I1 denotes the image patch enclosing the target in frame t−1. [sent-179, score-0.249]

75 According to Sα , there are many patches in frame t that have nearly equal similarity to I1 with patch In being most similar to I1 , while according to Sβ , I1 is clearly more similar to Im in frame t. [sent-180, score-0.492]

76 Then the proposed diffusion will enhance the similarity Sβ (I1 , Im ), since it will propagate faster the Sβ similarity of I1 to Im than to the other patches. [sent-181, score-0.417]

77 Consequently, the ﬁnal joint similarity P∗ will have Im as the most similar to I1 . [sent-183, score-0.133]

78 5 Weight Estimation The weight ωα of measure Sα is proportional to how well Sα is able to distinguish the target I1 in frame It−1 from the background surrounding the target. [sent-198, score-0.222]

79 , H} be a set of background patches surrounding the target I1 in frame It−1 . [sent-202, score-0.25]

80 The weights of all similarity measures Q are normalized so that α=1 ωα = 1. [sent-204, score-0.174]

81 The weights are computed for every frame in order to accommodate appearance changes of the tracked object. [sent-205, score-0.185]

82 6 Experimental Results We validate our tracking algorithm on eight challenging videos from [4] and [17]: Sylvester, Coke Can, Tiger1, Cliff Bar, Coupon Book, Surfer, and Tiger2, PETS01D1. [sent-206, score-0.186]

83 For Linear, we use three kinds of image features to get the afﬁnity and then simply calculate the average afﬁnity and use the diffusion process mentioned in [26]. [sent-210, score-0.211]

84 Red color indicates best performance, Blue color indicates second best, Green color indicates the third best Video Coke Can Cliff Bar Tiger 1 Tiger2 Coup. [sent-223, score-0.069]

85 Red color indicates best performance, Blue color indicates second best, Green color indicates the third best. [sent-305, score-0.069]

86 00 Comparison to matching based methods: MS, IVT, Frag and Linear are all matching based tracking algorithms. [sent-386, score-0.194]

87 For Linear Combination, the average similarity is used and the diffusion process in [26] is used to improve the similarity measure. [sent-462, score-0.442]

88 Our tracking results achieve the best performance in all the testing videos, especially for the Precision Plots shown in Table 2. [sent-464, score-0.15]

89 In some videos like sylvester and PETS01D1, Frag achieves comparable results with our method, but it works badly in other videos which means that speciﬁc distance measure can only work on some special cases but our fusion framework is robust for all the challenges that appear in the videos. [sent-467, score-0.258]

90 Our method is always batter than Linear Combination, which means that the fusion with diffusion can really improve the tracking performance. [sent-468, score-0.412]

91 The stability of our method can be clearly seen in the plots of location error as the function of frame number in Fig. [sent-469, score-0.177]

92 Our tracking results are always stable, which means that we do not lose the target in the whole tracking process. [sent-471, score-0.335]

93 Comparison to classiﬁcation based methods: MIL and OAB are both classiﬁcation based tracking algorithms. [sent-474, score-0.15]

94 7 Conclusions In this paper, a novel Fusion with Diffusion process is proposed for robust visual tracking. [sent-482, score-0.088]

95 Pairs of similarity measures are fused into a single similarity measure with a diffusion process on the tensor product of two graphs determined by the two similarity measures. [sent-483, score-0.741]

96 It is evaluated on several challenging videos, and it signiﬁcantly outperforms a large number of state-of-the-art tracking algorithms. [sent-485, score-0.15]

97 A robust boosting tracker with minimum error bound in a co-training framework. [sent-579, score-0.077]

98 Robust visual tracking and vehicle classiﬁcation via sparse representation. [sent-584, score-0.189]

99 Minimum error bounded efﬁcient l1 tracker with occlusion detection. [sent-592, score-0.068]

100 Afﬁnity learning on a tensor product graph with applications to shape and image retrieval. [sent-653, score-0.125]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('frag', 0.579), ('pk', 0.525), ('vec', 0.24), ('ivt', 0.23), ('tpg', 0.175), ('diffusion', 0.151), ('tracking', 0.15), ('emd', 0.142), ('frame', 0.137), ('similarity', 0.133), ('chi', 0.127), ('ks', 0.11), ('ms', 0.102), ('fusion', 0.089), ('fd', 0.067), ('mil', 0.066), ('oab', 0.066), ('cliff', 0.055), ('coke', 0.055), ('sylvester', 0.053), ('tracker', 0.053), ('patches', 0.048), ('precision', 0.046), ('surfer', 0.044), ('tensor', 0.044), ('coupon', 0.044), ('semiboost', 0.044), ('measures', 0.041), ('graph', 0.04), ('location', 0.04), ('similarities', 0.04), ('visual', 0.039), ('multigraph', 0.039), ('trackers', 0.039), ('lim', 0.038), ('patch', 0.037), ('videos', 0.036), ('target', 0.035), ('appearance', 0.034), ('book', 0.033), ('semi', 0.032), ('pixel', 0.032), ('center', 0.032), ('ut', 0.032), ('vision', 0.031), ('bar', 0.031), ('pattern', 0.03), ('threshold', 0.029), ('tiger', 0.029), ('diffused', 0.029), ('grabner', 0.029), ('cvpr', 0.028), ('combine', 0.028), ('process', 0.025), ('cues', 0.024), ('robust', 0.024), ('im', 0.023), ('iterative', 0.023), ('color', 0.023), ('fused', 0.022), ('acle', 0.022), ('batter', 0.022), ('cle', 0.022), ('matching', 0.022), ('image', 0.022), ('lbp', 0.021), ('nity', 0.021), ('measure', 0.02), ('graphs', 0.02), ('product', 0.019), ('forthcoming', 0.019), ('weighted', 0.019), ('af', 0.019), ('ieee', 0.018), ('enclosing', 0.018), ('temple', 0.018), ('hog', 0.018), ('candidate', 0.018), ('recognition', 0.017), ('society', 0.017), ('background', 0.017), ('unlabeled', 0.017), ('leistner', 0.017), ('object', 0.017), ('computer', 0.016), ('foreground', 0.016), ('mei', 0.016), ('frames', 0.016), ('famous', 0.015), ('bh', 0.015), ('transactions', 0.015), ('occlusion', 0.015), ('adaboost', 0.015), ('pq', 0.015), ('tracked', 0.014), ('bai', 0.014), ('pe', 0.014), ('histograms', 0.014), ('surrounding', 0.013), ('kinds', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 140 nips-2012-Fusion with Diffusion for Robust Visual Tracking

Author: Yu Zhou, Xiang Bai, Wenyu Liu, Longin J. Latecki

2 0.2820864 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation

Author: Figen Oztoprak, Jorge Nocedal, Steven Rennie, Peder A. Olsen

Abstract: We propose two classes of second-order optimization methods for solving the sparse inverse covariance estimation problem. The ﬁrst approach, which we call the Newton-LASSO method, minimizes a piecewise quadratic model of the objective function at every iteration to generate a step. We employ the fast iterative shrinkage thresholding algorithm (FISTA) to solve this subproblem. The second approach, which we call the Orthant-Based Newton method, is a two-phase algorithm that ﬁrst identiﬁes an orthant face and then minimizes a smooth quadratic approximation of the objective function using the conjugate gradient method. These methods exploit the structure of the Hessian to efﬁciently compute the search direction and to avoid explicitly storing the Hessian. We also propose a limited memory BFGS variant of the orthant-based Newton method. Numerical results, including comparisons with the method implemented in the QUIC software [1], suggest that all the techniques described in this paper constitute useful tools for the solution of the sparse inverse covariance estimation problem. 1

3 0.089787215 47 nips-2012-Augment-and-Conquer Negative Binomial Processes

Author: Mingyuan Zhou, Lawrence Carin

Abstract: By developing data augmentation methods unique to the negative binomial (NB) distribution, we unite seemingly disjoint count and mixture models under the NB process framework. We develop fundamental properties of the models and derive efﬁcient Gibbs sampling inference. We show that the gamma-NB process can be reduced to the hierarchical Dirichlet process with normalization, highlighting its unique theoretical, structural and computational advantages. A variety of NB processes with distinct sharing mechanisms are constructed and applied to topic modeling, with connections to existing algorithms, showing the importance of inferring both the NB dispersion and probability parameters. 1

4 0.086981125 256 nips-2012-On the connections between saliency and tracking

Author: Vijay Mahadevan, Nuno Vasconcelos

Abstract: A model connecting visual tracking and saliency has recently been proposed. This model is based on the saliency hypothesis for tracking which postulates that tracking is achieved by the top-down tuning, based on target features, of discriminant center-surround saliency mechanisms over time. In this work, we identify three main predictions that must hold if the hypothesis were true: 1) tracking reliability should be larger for salient than for non-salient targets, 2) tracking reliability should have a dependence on the deﬁning variables of saliency, namely feature contrast and distractor heterogeneity, and must replicate the dependence of saliency on these variables, and 3) saliency and tracking can be implemented with common low level neural mechanisms. We conﬁrm that the ﬁrst two predictions hold by reporting results from a set of human behavior studies on the connection between saliency and tracking. We also show that the third prediction holds by constructing a common neurophysiologically plausible architecture that can computationally solve both saliency and tracking. This architecture is fully compliant with the standard physiological models of V1 and MT, and with what is known about attentional control in area LIP, while explaining the results of the human behavior experiments.

5 0.077640764 336 nips-2012-The Coloured Noise Expansion and Parameter Estimation of Diffusion Processes

Author: Simon Lyons, Amos J. Storkey, Simo Särkkä

Abstract: Stochastic differential equations (SDE) are a natural tool for modelling systems that are inherently noisy or contain uncertainties that can be modelled as stochastic processes. Crucial to the process of using SDE to build mathematical models is the ability to estimate parameters of those models from observed data. Over the past few decades, signiﬁcant progress has been made on this problem, but we are still far from having a deﬁnitive solution. We describe a novel method of approximating a diffusion process that we show to be useful in Markov chain Monte-Carlo (MCMC) inference algorithms. We take the ‘white’ noise that drives a diffusion process and decompose it into two terms. The ﬁrst is a ‘coloured noise’ term that can be deterministically controlled by a set of auxilliary variables. The second term is small and enables us to form a linear Gaussian ‘small noise’ approximation. The decomposition allows us to take a diffusion process of interest and cast it in a form that is amenable to sampling by MCMC methods. We explain why many state-of-the-art inference methods fail on highly nonlinear inference problems, and we demonstrate experimentally that our method performs well in such situations. Our results show that this method is a promising new tool for use in inference and parameter estimation problems. 1

6 0.072148621 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

7 0.071332753 97 nips-2012-Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification

8 0.060441464 94 nips-2012-Delay Compensation with Dynamical Synapses

9 0.050630562 182 nips-2012-Learning Networks of Heterogeneous Influence

10 0.049855359 311 nips-2012-Shifting Weights: Adapting Object Detectors from Image to Video

11 0.047293853 330 nips-2012-Supervised Learning with Similarity Functions

12 0.045864381 338 nips-2012-The Perturbed Variation

13 0.043997977 193 nips-2012-Learning to Align from Scratch

14 0.043367039 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

15 0.042798229 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach

16 0.041758455 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity

17 0.041301228 9 nips-2012-A Geometric take on Metric Learning

18 0.038356703 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection

19 0.03779389 202 nips-2012-Locally Uniform Comparison Image Descriptor

20 0.037031494 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.096), (1, 0.028), (2, -0.045), (3, -0.023), (4, 0.031), (5, -0.005), (6, 0.005), (7, -0.022), (8, 0.037), (9, 0.022), (10, -0.003), (11, -0.057), (12, 0.012), (13, -0.05), (14, -0.031), (15, 0.066), (16, 0.008), (17, 0.031), (18, 0.024), (19, -0.009), (20, 0.078), (21, -0.033), (22, 0.009), (23, 0.07), (24, 0.037), (25, -0.12), (26, 0.064), (27, -0.01), (28, -0.049), (29, -0.105), (30, -0.065), (31, 0.09), (32, 0.069), (33, -0.106), (34, 0.155), (35, -0.065), (36, -0.053), (37, 0.021), (38, -0.035), (39, 0.303), (40, -0.128), (41, 0.023), (42, -0.019), (43, 0.044), (44, -0.025), (45, -0.019), (46, 0.009), (47, 0.052), (48, 0.028), (49, 0.101)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95422089 140 nips-2012-Fusion with Diffusion for Robust Visual Tracking

Author: Yu Zhou, Xiang Bai, Wenyu Liu, Longin J. Latecki

2 0.61727035 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation

Author: Figen Oztoprak, Jorge Nocedal, Steven Rennie, Peder A. Olsen

3 0.52597088 256 nips-2012-On the connections between saliency and tracking

Author: Vijay Mahadevan, Nuno Vasconcelos

4 0.51375443 97 nips-2012-Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification

Author: Yung-kyun Noh, Frank Park, Daniel D. Lee

Abstract: This paper sheds light on some fundamental connections of the diffusion decision making model of neuroscience and cognitive psychology with k-nearest neighbor classiﬁcation. We show that conventional k-nearest neighbor classiﬁcation can be viewed as a special problem of the diffusion decision model in the asymptotic situation. By applying the optimal strategy associated with the diffusion decision model, an adaptive rule is developed for determining appropriate values of k in knearest neighbor classiﬁcation. Making use of the sequential probability ratio test (SPRT) and Bayesian analysis, we propose ﬁve different criteria for adaptively acquiring nearest neighbors. Experiments with both synthetic and real datasets demonstrate the effectiveness of our classiﬁcation criteria. 1

5 0.48840961 47 nips-2012-Augment-and-Conquer Negative Binomial Processes

Author: Mingyuan Zhou, Lawrence Carin

6 0.40051922 336 nips-2012-The Coloured Noise Expansion and Parameter Estimation of Diffusion Processes

7 0.3501347 182 nips-2012-Learning Networks of Heterogeneous Influence

8 0.32398093 94 nips-2012-Delay Compensation with Dynamical Synapses

9 0.31393462 338 nips-2012-The Perturbed Variation

10 0.30666852 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

11 0.28554234 244 nips-2012-Nonconvex Penalization Using Laplace Exponents and Concave Conjugates

12 0.2835159 223 nips-2012-Multi-criteria Anomaly Detection using Pareto Depth Analysis

13 0.27801937 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification

14 0.27474037 128 nips-2012-Fast Resampling Weighted v-Statistics

15 0.27221426 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search

16 0.26604089 242 nips-2012-Non-linear Metric Learning

17 0.26273078 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

18 0.25992516 193 nips-2012-Learning to Align from Scratch

19 0.25946105 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach

20 0.25918975 303 nips-2012-Searching for objects driven by context

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.019), (4, 0.321), (17, 0.014), (21, 0.021), (38, 0.107), (39, 0.023), (42, 0.02), (54, 0.014), (72, 0.01), (74, 0.09), (76, 0.149), (80, 0.058), (92, 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73319334 140 nips-2012-Fusion with Diffusion for Robust Visual Tracking

Author: Yu Zhou, Xiang Bai, Wenyu Liu, Longin J. Latecki

2 0.60967994 279 nips-2012-Projection Retrieval for Classification

Author: Madalina Fiterau, Artur Dubrawski

Abstract: In many applications, classiﬁcation systems often require human intervention in the loop. In such cases the decision process must be transparent and comprehensible, simultaneously requiring minimal assumptions on the underlying data distributions. To tackle this problem, we formulate an axis-aligned subspace-ﬁnding task under the assumption that query speciﬁc information dictates the complementary use of the subspaces. We develop a regression-based approach called RECIP that efﬁciently solves this problem by ﬁnding projections that minimize a nonparametric conditional entropy estimator. Experiments show that the method is accurate in identifying the informative projections of the dataset, picking the correct views to classify query points, and facilitates visual evaluation by users. 1 Introduction and problem statement In the domain of predictive analytics, many applications which keep human users in the loop require the use of simple classiﬁcation models. Often, it is required that a test-point be ‘explained’ (classiﬁed) using a simple low-dimensional projection of the original feature space. This is a Projection Retrieval for Classiﬁcation problem (PRC). The interaction with the user proceeds as follows: the user provides the system a query point; the system searches for a projection in which the point can be accurately classiﬁed; the system displays the classiﬁcation result as well as an illustration of how the classiﬁcation decision was reached in the selected projection. Solving the PRC problem is relevant in many practical applications. For instance, consider a nuclear threat detection system installed at a border check point. Vehicles crossing the border are scanned with sensors so that a large array of measurements of radioactivity and secondary contextual information is being collected. These observations are fed into a classiﬁcation system that determines whether the scanned vehicle may carry a threat. Given the potentially devastating consequences of a false negative, a border control agent is requested to validate the prediction and decide whether to submit the vehicle for a costly further inspection. With the positive classiﬁcation rate of the system under strict bounds because of limitations in the control process, the risk of false negatives is increased. Despite its crucial role, human intervention should only be withheld for cases in which there are reasons to doubt the validity of classiﬁcation. In order for a user to attest the validity of a decision, the user must have a good understanding of the classiﬁcation process, which happens more readily when the classiﬁer only uses the original dataset features rather than combinations of them, and when the discrimination models are low-dimensional. In this context, we aim to learn a set of classiﬁers in low-dimensional subspaces and a decision function which selects the subspace under which a test point is to be classiﬁed. Assume we are given a dataset {(x1 , y1 ) . . . (xn , yn )} ∈ X n × {0, 1}n and a class of discriminators H. The model will contain a set Π of subspaces of X ; Π ⊆ Π, where Π is the set of all axis-aligned subspaces of the original feature space, the power set of the features. To each projection πi ∈ Π corresponds one discriminator from a given hypothesis space hi ∈ H. It will also contain a selection function g : X → Π × H, which yields, for a query point x, the projection/discriminator pair with which this point will be classiﬁed. The notation π(x) refers to the projection of the point x onto the subspace 1 π while h(π(x)) represents the predicted label for x. Formally, we describe the model class as Md = {Π = {π : π ∈ Π, dim(π) ≤ d}, H = {hi : hi ∈ H, h : πi → Y, ∀i = 1 . . . |Π|}, g ∈ {f : X → {1 . . . |Π|}} . where dim(π) presents the dimensionality of the subspace determined by the projection π. Note that only projections up to size d will be considered, where d is a parameter speciﬁc to the application. The set H contains one discriminator from the hypothesis class H for each projection. Intuitively, the aim is to minimize the expected classiﬁcation error over Md , however, a notable modiﬁcation is that the projection and, implicitly, the discriminator, are chosen according to the data point that needs to be classiﬁed. Given a query x in the space X , g(x) will yield the subspace πg(x) onto which the query is projected and the discriminator hg(x) for it. Distinct test points can be handled using different combinations of subspaces and discriminators. We consider models that minimize 0/1 loss. Hence, the PRC problem can be stated as follows: M ∗ = arg min EX ,Y y = hg(x) (πg(x) (x)) M ∈Md There are limitations to the type of selection function g that can be learned. A simple example for which g can be recovered is a set of signal readings x for which, if one of the readings xi exceeds a threshold ti , the label can be predicted just based on xi . A more complex one is a dataset containing regulatory variables, that is, for xi in the interval [ak , bk ] the label only depends on (x1 . . . xnk ) k k datasets that fall into the latter category fulﬁll what we call the Subspace-Separability Assumption. This paper proposes an algorithm called RECIP that solves the PRC problem for a class of nonparametric classiﬁers. We evaluate the method on artiﬁcial data to show that indeed it correctly identiﬁes the underlying structure for data satisfying the Subspace-Separability Assumption. We show some case studies to illustrate how RECIP offers insight into applications requiring human intervention. The use of dimensionality reduction techniques is a common preprocessing step in applications where the use of simpliﬁed classiﬁcation models is preferable. Methods that learn linear combinations of features, such as Linear Discriminant Analysis, are not quite appropriate for the task considered here, since we prefer to natively rely on the dimensions available in the original feature space. Feature selection methods, such as e.g. lasso, are suitable for identifying sets of relevant features, but do not consider interactions between them. Our work better ﬁts the areas of class dependent feature selection and context speciﬁc classiﬁcation, highly connected to the concept of Transductive Learning [6]. Other context-sensitive methods are Lazy and Data-Dependent Decision Trees, [5] and [10] respectively. In Ting et al [14], the Feating submodel selection relies on simple attribute splits followed by ﬁtting local predictors, though the algorithm itself is substantially different. Obozinski et al present a subspace selection method in the context of multitask learning [11]. Go et al propose a joint method for feature selection and subspace learning [7], however, their classiﬁcation model is not particularly query speciﬁc. Alternatively, algorithms that transform complex or unintelligible models with user-friendly equivalents have been proposed [3, 2, 1, 8]. Algorithms speciﬁcally designed to yield understandable models are a precious few. Here we note a rule learning method described in [12], even though the resulting rules can make visualization difﬁcult, while itemset mining [9] is not speciﬁcally designed for classiﬁcation. Unlike those approaches, our method is designed to retrieve subsets of the feature space designed for use in a way that is complementary to the basic task at hand (classiﬁcation) while providing query-speciﬁc information. 2 Recovering informative projections with RECIP To solve PRC, we need means by which to ascertain which projections are useful in terms of discriminating data from the two classes. Since our model allows the use of distinct projections depending on the query point, it is expected that each projection would potentially beneﬁt different areas of the feature space. A(π) refers to the area of the feature space where the projection π is selected. A(π) = {x ∈ X : πg(x) = π} The objective becomes min E(X ×Y) y = hg(x) (πg(x) (x)) M ∈Md = p(A(π))E y = hg(x) (πg(x) (x))|x ∈ A(π) min M ∈Md 2 π∈Π . The expected classiﬁcation error over A(π) is linked to the conditional entropy of Y |X. Fano’s inequality provides a lower bound on the error while Feder and Merhav [4] derive a tight upper bound on the minimal error probability in terms of the entropy. This means that conditional entropy characterizes the potential of a subset of the feature space to separate data, which is more generic than simply quantifying classiﬁcation accuracy for a speciﬁc discriminator. In view of this connection between classiﬁcation accuracy and entropy, we adapt the objective to: p(A(π))H(Y |π(X); X ∈ A(π)) min M ∈Md (1) π∈Π The method we propose optimizes an empirical analog of (1) which we develop below and for which we will need the following result. Proposition 2.1. Given a continuous variable X ∈ X and a binary variable Y , where X is sampled from the mixture model f (x) = p(y = 0)f0 (x) + p(y = 1)f1 (x) = p0 f0 (x) + p1 f1 (x) , then H(Y |X) = −p0 log p0 − p1 log p1 − DKL (f0 ||f ) − DKL (f1 ||f ) . Next, we will use the nonparametric estimator presented in [13] for Tsallis α-divergence. Given samples Ui ∼ U, with i = 1, n and Vj ∼ V with j = 1, m, the divergence is estimated as follows: ˆ Tα (U ||V ) = 1 1 1−α n n (n − 1)νk (Ui , U \ ui )d mνk (Ui , V )d i=1 1−α B(k, α) − 1 , (2) where d is the dimensionality of the variables U and V and νk (z, Z) represents the distance from z ˆ to its k th nearest neighbor of the set of points Z. For α ≈ 1 and n → ∞, Tα (u||v) ≈ DKL (u||v). 2.1 Local estimators of entropy We will now plug (2) in the formula obtained by Proposition 2.1 to estimate the quantity (1). We use the notation X0 to represent the n0 samples from X which have the labels Y equal to 0, and X1 to represent the n1 samples from X which have the labels set to 1. Also, Xy(x) represents the set of samples that have labels equal to the label of x and X¬y(x) the data that have labels opposite to the label of x. ˆ ˆ x ˆ x H(Y |X; X ∈ A) = −H(p0 ) − H(p1 ) − T (f0 ||f x ) − T (f1 ||f x ) + C α≈1 ˆ H(Y |X; X ∈ A) ∝ 1 n0 + 1 n1 ∝ 1 n0 + 1 n1 ∝ 1 n n0 (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X1 \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X0 \ xi )d 1−α I[xi ∈ A] i=1 n1 i=1 n0 i=1 n1 i=1 n I[xi ∈ A] i=1 (n − 1)νk (xi , Xy(xi ) \ xi )d nνk (xi , X¬y(xi ) \ xi )d 1−α The estimator for the entropy of the data that is classiﬁed with projection π is as follows: ˆ H(Y |π(X); X ∈ A(π)) ∝ 1 n n I[xi ∈ A(π)] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 1−α (3) From 3 and using the fact that I[xi ∈ A(π)] = I[πg(xi ) = π] for which we use the notation I[g(xi ) → π], we estimate the objective as min M ∈Md π∈Π 1 n n I[g(xi ) → π] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 3 1−α (4) Therefore, the contribution of each data point to the objective corresponds to a distance ratio on the projection π ∗ where the class of the point is obtained with the highest conﬁdence (data is separable in the neighborhood of the point). We start by computing the distance-based metric of each point on each projection of size up to d - there are d∗ such projections. This procedure yields an extended set of features Z, which we name local entropy estimates: Zij = νk (πj (xi ), πj (Xy(xi ) ) \ πj (xi )) νk (πj (xi ), πj (X¬y(xi ) ) \ πj (xi )) d(1−α) α≈1 j ∈ {1 . . . d∗ } (5) For each training data point, we compute the best distance ratio amid all the projections, which is simply Ti = minj∈[d∗ ] Zij . The objective can be then further rewritten as a function of the entropy estimates: n I[g(xi ) → πj ]Zij min M ∈Md (6) i=1 πj ∈Π From the deﬁnition of T, it is also clear that n n I[g(xi ) → πj ]Zij min M ∈Md 2.2 ≥ i=1 πj ∈Π Ti . (7) i=1 Projection selection as a combinatorial problem Considering form (6) of the objective, and given that the estimates Zij are constants, depending only on the training set, the projection retrieval problem is reduced to ﬁnding g for all training points, which will implicitly select the projection set of the model. Naturally, one might assume the bestperforming classiﬁcation model is the one containing all the axis-aligned subspaces. This model achieves the lower bound (7) for the training set. However, the larger the set of projections, the more values the function g takes, and thus the problem of selecting the correct projection becomes more difﬁcult. It becomes apparent that the number of projections should be somehow restricted to allow intepretability. Assuming a hard threshold of at most t projections, the optimization (6) becomes an entry selection problem over matrix Z where one value must be picked from each row under a limitation on the number of columns that can be used. This problem cannot be solved exactly in polynomial time. Instead, it can be formulated as an optimization problem under 1 constraints. 2.3 Projection retrieval through regularized regression To transform the projection retrieval to a regression problem we consider T, the minimum obtainable value of the entropy estimator for each point, as the output which the method needs to predict. Each row i of the parameter matrix B represents the degrees to which the entropy estimates on each projection contribute to the entropy estimator of point xi . Thus, the sum over each row of B is 1, and the regularization penalty applies to the number of non-zero columns in B. d∗ min ||T − (Z B B)J|Π|,1 ||2 2 +λ [Bi = 0] (8) i=1 subject to |Bk | 1 = 1 k = 1, n where (Z B)ij = Zij + Bij and J is a matrix of ones. The problem with this optimization is that it is not convex. A typical walk-around of this issue is to use the convex relaxation for Bi = 0, that is 1 norm. This would transform the penalized term d∗ d∗ n to i=1 |Bi | 1 . However, i=1 |Bi | 1 = k=1 |Bk | 1 = n , so this penalty really has no effect. An alternative mechanism to encourage the non-zero elements in B to populate a small number of columns is to add a penalty term in the form of Bδ, where δ is a d∗ -size column vector with each element representing the penalty for a column in B. With no prior information about which subspaces are more informative, δ starts as an all-1 vector. An initial value for B is obtained through the optimization (8). Since our goal is to handle data using a small number of projections, δ is then updated such that its value is lower for the denser columns in B. This update resembles the reweighing in adaptive lasso. The matrix B itself is updated, and this 2-step process continues until convergence of δ. Once δ converges, the projections corresponding to the non-zero columns of B are added to the model. The procedure is shown in Algorithm 1. 4 Algorithm 1: RECIP δ = [1 . . . 1] repeat |P I| b = arg minB ||T − i=1 < Z, B > ||2 + λ|Bδ| 1 2 subject to |Bk | 1 = 1 k = 1...n δk = |Bi | 1 i = . . . d∗ (update the differential penalty) δ δ = 1 − |δ| 1 until δ converges return Π = {πi ; |Bi | 1 > 0 ∀i = 1 . . . d∗ } 2.4 Lasso for projection selection We will compare our algorithm to lasso regularization that ranks the projections in terms of their potential for data separability. We write this as an 1 -penalized optimization on the extended feature set Z, with the objective T : minβ |T − Zβ|2 + λ|β| 1 . The lasso penalty to the coefﬁcient vector encourages sparsity. For a high enough λ, the sparsity pattern in β is indicative of the usefulness of the projections. The lasso on entropy contributions was not found to perform well as it is not query speciﬁc and will ﬁnd one projection for all data. We improved it by allowing it to iteratively ﬁnd projections - this robust version offers increased performance by reweighting the data thus focusing on different subsets of it. Although better than running lasso on entropy contributions, the robust lasso does not match RECIP’s performance as the projections are selected gradually rather than jointly. Running the standard lasso on the original design matrix yields a set of relevant variables and it is not immediately clear how the solution would translate to the desired class. 2.5 The selection function Once the projections are selected, the second stage of the algorithm deals with assigning the projection with which to classify a particular query point. An immediate way of selecting the correct projection starts by computing the local entropy estimator for each subspace with each class assignment. Then, we may select the label/subspace combination that minimizes the empirical entropy. (i∗ , θ∗ ) = arg min i,θ 3 νk (πi (x), πi (Xθ )) νk (πi (x), πi (X¬θ )) dim(πi )(1−α) i = 1 . . . d∗ , α≈1 (9) Experimental results In this section we illustrate the capability of RECIP to retrieve informative projections of data and their use in support of interpreting results of classiﬁcation. First, we analyze how well RECIP can identify subspaces in synthetic data whose distribution obeys the subspace separability assumption (3.1). As a point of reference, we also present classiﬁcation accuracy results (3.2) for both the synthetic data and a few real-world sets. This is to quantify the extent of the trade-off between ﬁdelity of attainable classiﬁers and desired informativeness of the projections chosen by RECIP. We expect RECIP’s classiﬁcation performance to be slightly, but not substantially worse when compared to relevant classiﬁcation algorithms trained to maximize classiﬁcation accuracy. Finally, we present a few examples (3.3) of informative projections recovered from real-world data and their utility in explaining to human users the decision processes applied to query points. A set of artiﬁcial data used in our experiments contains q batches of data points, each of them made classiﬁable with high accuracy using one of available 2-dimensional subspaces (x1 , x2 ) with k ∈ k k {1 . . . q}. The data in batch k also have the property that x1 > tk . This is done such that the group a k point belongs to can be detected from x1 , thus x1 is a regulatory variable. We control the amount of k k noise added to thusly created synthetic data by varying the proportion of noisy data points in each batch. The results below are for datasets with 7 features each, with number of batches q ranging between 1 and 7. We kept the number of features speciﬁcally low in order to prevent excessive variation between any two sets generated this way, and to enable computing meaningful estimates of the expectation and variance of performance, while enabling creation of complicated data in which synthetic patterns may substantially overlap (using 7 features and 7 2-dimensional patterns implies that dimensions of at least 4 of the patterns will overlap). We implemented our method 5 to be scalable to the size and dimensionality of data and although for brevity we do not include a discussion of this topic here, we have successfully run RECIP against data with 100 features. The parameter α is a value close to 1, because the Tsallis divergence converges to the KL divergence as alpha approaches 1. For the experiments on real-world data, d was set to n (all projections were considered). For the artiﬁcial data experiments, we reported results for d = 2 as they do not change signiﬁcantly for d >= 2 because this data was synthesized to contain bidimensional informative projections. In general, if d is too low, the correct full set of projections will not be found, but it may be recovered partially. If d is chosen too high, there is a risk that a given selected projection p will contain irrelevant features compared to the true projection p0 . However, this situation only occurs if the noise introduced by these features in the estimators makes the entropy contributions on p and p0 statistically indistinguishable for a large subset of data. The users will choose d according to the desired/acceptable complexity of the resulting model. If the results are to be visually interpreted by a human, values of 2 or 3 are reasonable for d. 3.1 Recovering informative projections Table 1 shows how well RECIP recovers the q subspaces corresponding to the synthesized batches of data. We measure precision (proportion of the recovered projections that are known to be informative), and recall (proportion of known informative projections that are recovered by the algorithm). in Table 1, rows correspond to the number of distinct synthetic batches injected in data, q, and subsequent columns correspond to the increasing amounts of noise in data. We note that the observed precision is nearly perfect: the algorithm makes only 2 mistakes over the entire set of experiments, and those occur for highly noisy setups. The recall is nearly perfect as long as there is little overlap among the dimensions, that is when the injections do not interfere with each other. As the number of projections increases, the chances for overlap among the affected features also increase, which makes the data more confusing resulting on a gradual drop of recall until only about 3 or 4 of the 7 known to be informative subspaces can be recovered. We have also used lasso as described in 2.4 in an attempt to recover projections. This setup only manages to recover one of the informative subspaces, regardless of how the regularization parameter is tuned. 3.2 Classiﬁcation accuracy Table 2 shows the classiﬁcation accuracy of RECIP, obtained using synthetic data. As expected, the observed performance is initially high when there are few known informative projections in data and it decreases as noise and ambiguity of the injected patterns increase. Most types of ensemble learners would use a voting scheme to arrive at the ﬁnal classiﬁcation of a testing sample, rather than use a model selection scheme. For this reason, we have also compared predictive accuracy revealed by RECIP against a method based on majority voting among multiple candidate subspaces. Table 4 shows that the accuracy of this technique is lower than the accuracy of RECIP, regardless of whether the informative projections are recovered by the algorithm or assumed to be known a priori. This conﬁrms the intuition that a selection-based approach can be more effective than voting for data which satisﬁes the subspace separability assumption. For reference, we have also classiﬁed the synthetic data using K-Nearest-Neighbors algorithm using all available features at once. The results of that experiment are shown in Table 5. Since RECIP uses neighbor information, K-NN is conceptually the closest among the popular alternatives. Compared to RECIP, K-NN performs worse when there are fewer synthetic patterns injected in data to form informative projections. It is because some features used then by K-NN are noisy. As more features become informative, the K-NN accuracy improves. This example shows the beneﬁt of a selective approach to feature space and using a subset of the most explanatory projections to support not only explanatory analyses but also classiﬁcation tasks in such circumstances. 3.3 RECIP case studies using real-world data Table 3 summarizes the RECIP and K-NN performance on UCI datasets. We also test the methods using Cell dataset containing a set of measurements such as the area and perimeter biological cells with separate labels marking cells subjected to treatment and control cells. In Vowel data, the nearest-neighbor approach works exceptionally well, even outperforming random forests (0.94 accuracy), which is an indication that all features are jointly relevant. For d lower than the number of features, RECIP picks projections of only one feature, but if there is no such limitation, RECIP picks the space of all the features as informative. 6 Table 1: Projection recovery for artiﬁcial datasets with 1 . . . 7 informative features and noise level 0 . . . 0.2 in terms of mean and variance of Precision and Recall. Mean/var obtained for each setting by repeating the experiment with datasets with different informative projections. PRECISION 1 2 3 4 5 6 7 0 1 1 1 1 1 1 1 0.02 1 1 1 1 1 1 1 Mean 0.05 1 1 1 1 1 1 1 1 2 3 4 5 6 7 0 1 1 1 0.9643 0.7714 0.6429 0.6327 0.02 1 1 1 0.9643 0.7429 0.6905 0.5918 Mean 0.05 1 1 0.9524 0.9643 0.8286 0.6905 0.5918 0 0 0 0 0 0 0 0 0.02 0 0 0 0 0 0 0 Variance 0.05 0 0 0 0 0 0 0 0.1 0.0306 0 0 0 0 0 0 0.2 0.0306 0 0 0 0 0 0 0 0 0 0 0.0077 0.0163 0.0113 0.0225 0.02 0 0 0 0.0077 0.0196 0.0113 0.02 Variance 0.05 0 0 0.0136 0.0077 0.0049 0.0272 0.0258 0.1 0 0 0.0136 0.0077 0.0082 0.0113 0.0233 0.2 0 0 0 0.0128 0.0278 0.0113 0.02 0.1 0.0008 0.0001 0.0028 0.0025 0.0036 0.0025 0.0042 0.2 0.0007 0.0001 0.0007 0.0032 0.0044 0.0027 0.0045 0.1 0.0001 0.0001 0.0007 0.0014 0.0019 0.0023 0.0021 0.2 0.0000 0.0001 0.0007 0.0020 0.0023 0.0021 0.0020 0.1 0.9286 1 1 1 1 1 1 0.2 0.9286 1 1 1 1 1 1 RECALL 0.1 1 1 0.9524 0.9643 0.8571 0.6905 0.5714 0.2 1 1 1 0.9286 0.7714 0.6905 0.551 Table 2: RECIP Classiﬁcation Accuracy on Artiﬁcial Data 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8725 0.8113 0.7655 0.7534 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8820 0.8714 0.8566 0.8429 CLASSIFICATION ACCURACY Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9543 0.9420 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8764 0.8640 0.8618 0.0004 0.0005 0.0016 0.8685 0.8589 0.8454 0.8187 0.0020 0.0020 0.0019 0.8009 0.8105 0.8105 0.7782 0.0042 0.0044 0.0033 0.7739 0.7669 0.7632 0.7511 0.0025 0.0021 0.0026 0.7399 0.7347 0.7278 0.7205 0.0034 0.0040 0.0042 CLASSIFICATION ACCURACY - KNOWN PROJECTIONS Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8914 0.8777 0.8618 0.0004 0.0005 0.0005 0.8781 0.8657 0.8541 0.8331 0.0011 0.0011 0.0014 0.8641 0.8523 0.8429 0.8209 0.0015 0.0015 0.0018 0.8497 0.8377 0.8285 0.8074 0.0014 0.0015 0.0016 0.8371 0.8256 0.8122 0.7988 0.0015 0.0018 0.0018 Table 3: Accuracy of K-NN and RECIP Dataset KNN RECIP In Spam data, the two most informative projections are Breast Cancer Wis 0.8415 0.8275 ’Capital Length Total’ (CLT)/’Capital Length Longest’ Breast Tissue 1.0000 1.0000 (CLL) and CLT/’Frequency of word your’ (FWY). FigCell 0.7072 0.7640 ure 1 shows these two projections, with the dots repreMiniBOONE* 0.7896 0.7396 senting training points. The red dots represent points laSpam 0.7680 0.7680 Vowel 0.9839 0.9839 beled as spam while the blue ones are non-spam. The circles are query points that have been assigned to be classiﬁed with the projection in which they are plotted. The green circles are correctly classiﬁed points, while the magenta circles - far fewer - are the incorrectly classiﬁed ones. Not only does the importance of text in capital letters make sense for a spam ﬁltering dataset, but the points that select those projections are almost ﬂawlessly classiﬁed. Additionally, assuming the user would need to attest the validity of classiﬁcation for the ﬁrst plot, he/she would have no trouble seeing that the circled data points are located in a region predominantly populated with examples of spam, so any non-spam entry appears suspicious. Both of the magenta-colored cases fall into this category, and they can be therefore ﬂagged for further investigation. 7 Informative Projection for the Spam dataset 2000 1500 1000 500 0 Informative Projection for the Spam dataset 12 Frequency of word ‘your‘ Capital Run Length Longest 2500 10 8 6 4 2 0 500 1000 1500 2000 2500 Capital Run Length Total 3000 0 3500 0 2000 4000 6000 8000 10000 12000 14000 Capital Run Length Total 16000 Figure 1: Spam Dataset Selected Subspaces Table 4: Classiﬁcation accuracy using RECIP-learned projections - or known projections, in the lower section - within a voting model instead of a selection model 1 2 3 4 5 6 7 1 2 3 4 5 6 7 CLASSIFICATION ACCURACY - VOTING ENSEMBLE Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9317 0.9226 0.0000 0.0000 0.0000 0.0070 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7290 0.7266 0.7163 0.7166 0.7212 0.0002 0.0002 0.0008 0.0006 0.6934 0.6931 0.6932 0.6904 0.6867 0.0008 0.0008 0.0008 0.0008 0.6715 0.6602 0.6745 0.6688 0.6581 0.0013 0.0014 0.0013 0.0014 0.6410 0.6541 0.6460 0.6529 0.6512 0.0008 0.0007 0.0010 0.0006 0.6392 0.6342 0.6268 0.6251 0.6294 0.0009 0.0011 0.0012 0.0012 CLASSIFICATION ACCURACY - VOTING ENSEMBLE, KNOWN PROJECTIONS Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.0001 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7409 0.7385 0.7390 0.7353 0.7325 0.0010 0.0012 0.0010 0.0011 0.7110 0.7109 0.7083 0.7067 0.7035 0.0041 0.0041 0.0042 0.0042 0.7077 0.7070 0.7050 0.7034 0.7008 0.0015 0.0015 0.0015 0.0016 0.6816 0.6807 0.6801 0.6790 0.6747 0.0008 0.0008 0.0008 0.0008 0.6787 0.6783 0.6772 0.6767 0.6722 0.0008 0.0009 0.0009 0.0008 0.2 0.0053 0.0001 0.0002 0.0009 0.0013 0.0005 0.0012 0.2 0.0000 0.0001 0.0010 0.0043 0.0016 0.0009 0.0008 Table 5: Classiﬁcation accuracy for artiﬁcial data with the K-Nearest Neighbors method CLASSIFICATION ACCURACY - KNN 1 2 3 4 5 6 7 4 0 0.7909 0.7940 0.7964 0.7990 0.8038 0.8043 0.8054 0.02 0.7843 0.7911 0.7939 0.7972 0.8024 0.8032 0.8044 Mean 0.05 0.7747 0.7861 0.7901 0.7942 0.8002 0.8015 0.8028 0.1 0.7652 0.7790 0.7854 0.7904 0.7970 0.7987 0.8004 0.2 0.7412 0.7655 0.7756 0.7828 0.7905 0.7930 0.7955 0 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.02 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 Variance 0.05 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.1 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.2 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 Conclusion This paper considers the problem of Projection Recovery for Classiﬁcation. It is relevant in applications where the decision process must be easy to understand in order to enable human interpretation of the results. We have developed a principled, regression-based algorithm designed to recover small sets of low-dimensional subspaces that support interpretability. It optimizes the selection using individual data-point-speciﬁc entropy estimators. In this context, the proposed algorithm follows the idea of transductive learning, and the role of the resulting projections bears resemblance to high conﬁdence regions known in conformal prediction models. Empirical results obtained using simulated and real-world data show the effectiveness of our method in ﬁnding informative projections that enable accurate classiﬁcation while maintaining transparency of the underlying decision process. Acknowledgments This material is based upon work supported by the NSF, under Grant No. IIS-0911032. 8 References [1] Mark W. Craven and Jude W. Shavlik. Extracting Tree-Structured Representations of Trained Networks. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8, pages 24–30. The MIT Press, 1996. [2] Pedro Domingos. Knowledge discovery via multiple models. Intelligent Data Analysis, 2:187–202, 1998. [3] Eulanda M. Dos Santos, Robert Sabourin, and Patrick Maupin. A dynamic overproduce-and-choose strategy for the selection of classiﬁer ensembles. Pattern Recogn., 41:2993–3009, October 2008. [4] M. Feder and N. Merhav. Relations between entropy and error probability. Information Theory, IEEE Transactions on, 40(1):259–266, January 1994. [5] Jerome H. Friedman, Ron Kohavi, and Yeogirl Yun. Lazy decision trees, 1996. [6] A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In In Uncertainty in Artiﬁcial Intelligence, pages 148–155. Morgan Kaufmann, 1998. [7] Quanquan Gu, Zhenhui Li, and Jiawei Han. Joint feature selection and subspace learning, 2011. [8] Bing Liu, Minqing Hu, and Wynne Hsu. Intuitive representation of decision trees using general rules and exceptions. In Proceedings of Seventeeth National Conference on Artiﬁcial Intellgience (AAAI-2000), July 30 - Aug 3, 2000, pages 615–620, 2000. [9] Michael Mampaey, Nikolaj Tatti, and Jilles Vreeken. Tell me what i need to know: succinctly summarizing data with itemsets. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, pages 573–581, New York, NY, USA, 2011. ACM. [10] Mario Marchand and Marina Sokolova. Learning with decision lists of data-dependent features. JOURNAL OF MACHINE LEARNING REASEARCH, 6, 2005. [11] Guillaume Obozinski, Ben Taskar, and Michael I. Jordan. Joint covariate selection and joint subspace selection for multiple classiﬁcation problems. Statistics and Computing, 20(2):231–252, April 2010. [12] Michael J. Pazzani, Subramani Mani, and W. Rodman Shankle. Beyond concise and colorful: Learning intelligible rules, 1997. [13] B. Poczos and J. Schneider. On the estimation of alpha-divergences. AISTATS, 2011. [14] Kai Ting, Jonathan Wells, Swee Tan, Shyh Teng, and Geoffrey Webb. Feature-subspace aggregating: ensembles for stable andunstable learners. Machine Learning, 82:375–397, 2011. 10.1007/s10994-0105224-5. 9

3 0.58852458 319 nips-2012-Sparse Prediction with the $k$-Support Norm

Author: Andreas Argyriou, Rina Foygel, Nathan Srebro

Abstract: We derive a novel norm that corresponds to the tightest convex relaxation of sparsity combined with an 2 penalty. We show that this new k-support norm provides a tighter relaxation than the elastic net and can thus be advantageous in in sparse prediction problems. We also bound the looseness of the elastic net, thus shedding new light on it and providing justiﬁcation for its use. 1

4 0.55314839 176 nips-2012-Learning Image Descriptors with the Boosting-Trick

Author: Tomasz Trzcinski, Mario Christoudias, Vincent Lepetit, Pascal Fua

Abstract: In this paper we apply boosting to learn complex non-linear local visual feature representations, drawing inspiration from its successful application to visual object detection. The main goal of local feature descriptors is to distinctively represent a salient image region while remaining invariant to viewpoint and illumination changes. This representation can be improved using machine learning, however, past approaches have been mostly limited to learning linear feature mappings in either the original input or a kernelized input feature space. While kernelized methods have proven somewhat effective for learning non-linear local feature descriptors, they rely heavily on the choice of an appropriate kernel function whose selection is often difﬁcult and non-intuitive. We propose to use the boosting-trick to obtain a non-linear mapping of the input to a high-dimensional feature space. The non-linear feature mapping obtained with the boosting-trick is highly intuitive. We employ gradient-based weak learners resulting in a learned descriptor that closely resembles the well-known SIFT. As demonstrated in our experiments, the resulting descriptor can be learned directly from intensity patches achieving state-of-the-art performance. 1

5 0.55245847 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

Author: Du Tran, Junsong Yuan

Abstract: Structured output learning has been successfully applied to object localization, where the mapping between an image and an object bounding box can be well captured. Its extension to action localization in videos, however, is much more challenging, because we need to predict the locations of the action patterns both spatially and temporally, i.e., identifying a sequence of bounding boxes that track the action in video. The problem becomes intractable due to the exponentially large size of the structured video space where actions could occur. We propose a novel structured learning approach for spatio-temporal action localization. The mapping between a video and a spatio-temporal action trajectory is learned. The intractable inference and learning problems are addressed by leveraging an efﬁcient Max-Path search method, thus making it feasible to optimize the model over the whole structured space. Experiments on two challenging benchmark datasets show that our proposed method outperforms the state-of-the-art methods. 1

6 0.55063808 111 nips-2012-Efficient Sampling for Bipartite Matching Problems

7 0.54881537 260 nips-2012-Online Sum-Product Computation Over Trees

8 0.54845119 210 nips-2012-Memorability of Image Regions

9 0.54831028 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation

10 0.54722518 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection

11 0.54707384 201 nips-2012-Localizing 3D cuboids in single-view images

12 0.54705852 303 nips-2012-Searching for objects driven by context

13 0.54667073 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

14 0.54520345 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

15 0.54466122 185 nips-2012-Learning about Canonical Views from Internet Image Collections

16 0.54455692 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model

17 0.54455668 357 nips-2012-Unsupervised Template Learning for Fine-Grained Object Recognition

18 0.54310691 146 nips-2012-Graphical Gaussian Vector for Image Categorization

19 0.54282707 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

20 0.54168415 168 nips-2012-Kernel Latent SVM for Visual Recognition