nips nips2012 nips2012-145 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Samory Kpotufe, Abdeslam Boularias
Abstract: In regression problems over Rd , the unknown function f often varies more in some coordinates than in others. We show that weighting each coordinate i with the estimated norm of the ith derivative of f is an efficient way to significantly improve the performance of distance-based regressors, e.g. kernel and k-NN regressors. We propose a simple estimator of these derivative norms and prove its consistency. Moreover, the proposed estimator is efficiently learned online. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract In regression problems over Rd , the unknown function f often varies more in some coordinates than in others. [sent-5, score-0.21]
2 We show that weighting each coordinate i with the estimated norm of the ith derivative of f is an efficient way to significantly improve the performance of distance-based regressors, e. [sent-6, score-0.126]
3 We propose a simple estimator of these derivative norms and prove its consistency. [sent-9, score-0.086]
4 1 Introduction In regression problems over Rd , the unknown function f might vary more in some coordinates than in others, even though all coordinates might be relevant. [sent-11, score-0.322]
5 How much f varies with coordinate i can be captured by the norm fi 1,µ = EX |fi (X)| of the ith derivative fi = ei f of f . [sent-12, score-1.143]
6 A simple way to take advantage of the information in fi 1,µ is to weight each coordinate proportionally to an estimate of fi 1,µ . [sent-13, score-1.049]
7 The intuition, detailed in Section 2, is that the resulting data space behaves as a low-dimensional projection to coordinates with large norm fi 1,µ , while maintaining information about all coordinates. [sent-14, score-0.618]
8 We show that such weighting can be learned efficiently, both in batch-mode and online, and can significantly improve the performance of distance-based regressors in real-world applications. [sent-15, score-0.132]
9 For distance-based methods, the weights can be incorporated into a distance function of the form ρ(x, x ) = (x − x ) W(x − x ), where each element Wi of the diagonal matrix W is an estimate of fi 1,µ . [sent-17, score-0.545]
10 This is not metric learning [1, 2, 3, 4] where the best ρ is found by optimizing over a sufficiently large space of possible metrics. [sent-18, score-0.1]
11 Clearly metric learning can only yield better performance, but the optimization over a larger space will result in heavier preprocessing time, often O(n2 ) on datasets of size n. [sent-19, score-0.146]
12 Yet, preprocessing time is especially important in many modern applications where both training and prediction are done online (e. [sent-20, score-0.109]
13 Here we do not optimize over a space of metrics, but rather estimate a single metric ρ based on the norms fi 1,µ . [sent-23, score-0.621]
14 Our metric ρ is efficiently obtained, can be estimated online, and still significantly improves the performance of distance-based regressors. [sent-24, score-0.1]
15 To estimate fi 1,µ , one does not need to estimate fi well everywhere, just well on average. [sent-25, score-1.042]
16 [5]), we have to keep in mind our need for fast but consistent estimator of fi 1,µ . [sent-28, score-0.548]
17 We propose a simple estimator Wi which averages the differences along i of an estimator fn,h of f . [sent-29, score-0.118]
18 More precisely (see Section 3) Wi has the form En |fn,h (X + tei ) − fn,h (X − tei )| /2t where En denotes the empirical expectation over a sample n {Xi }1 . [sent-30, score-0.84]
19 In this paper fn,h is a kernel estimator, although any regression method might be used in estimating fi 1,µ . [sent-32, score-0.645]
20 Figure 1: Typical gradient weights Wi ≈ fi 1,µ (c) Telecom. [sent-36, score-0.546]
21 Most related work As we mentioned above, metric learning is closest in spirit to the gradient-weighting approach presented here, but our approach is different from metric learning in that we do not search a space of possible metrics, but rather estimate a single metric based on gradients. [sent-40, score-0.322]
22 There exists many metric learning approaches, mostly for classification and few for regression (e. [sent-42, score-0.181]
23 The approaches of [1, 2] for regression are meant for batch learning. [sent-45, score-0.128]
24 In the case of kernel regression and local polynomial regression, multiple bandwidths can be used, one for each coordinate [6]. [sent-48, score-0.157]
25 However, tuning d bandwidth parameters requires searching a d×d grid, which is impractical even in batch mode. [sent-49, score-0.073]
26 Here we do not need sparsity, instead we only need the target function to vary in some coordinates more than in others. [sent-58, score-0.107]
27 2 Technical motivation In this section, we motivate the approach by considering the ideal situation where Wi = fi 1,µ . [sent-60, score-0.499]
28 Let’s consider regression on (X , ρ), where the input space X ⊂ Rd is connected. [sent-61, score-0.081]
29 kernel or k-NN) is well known to be the sum of its variance and its bias [7]. [sent-64, score-0.114]
30 Regression variance decreases on (X , ρ): The variance of a distance based estimate fn (x) is inversely proportional to the number of samples (and hence the mass) in a neighborhood of x (see e. [sent-66, score-0.117]
31 2 Relative to a Euclidean ball, a ball Bρ of similar1 radius has more mass in the direction e1 in which f varies least. [sent-72, score-0.11]
32 Essentially, let R ⊂ [d] be the set of coordinates with larger weights Wi , then the mass of balls Bρ behaves like the mass of balls in R|R| . [sent-74, score-0.269]
33 Thus, effectively, regression in (X , ρ) has variance nearly as small as that for regression in the lower-dimensional space R|R| . [sent-75, score-0.194]
34 Note that the assumptions on the marginal µ in the lemma statement are verified for instance when µ has a continuous lower-bounded density on X . [sent-76, score-0.094]
35 Sup/ 1 pose X ≡ √d [0, 1]d , and the marginal µ satisfies on (X , · ), for some C1 , C2 : ∀x ∈ X , ∀r > √ 0, C1 rd ≤ µ(B(x, r)) ≤ C2 rd . [sent-80, score-0.079]
36 Regression bias remains bounded on (X , ρ): The bias of distance-based regressors is controlled by the smoothness of the unknown function f on (X , ρ), i. [sent-84, score-0.162]
37 Turning back to our earlier example in R2 , some points x that were originally far from x along e1 might now be included in the estimate fn (x) on (X , ρ). [sent-87, score-0.123]
38 Intuitively, this should not add bias to the estimate since f does not vary much in e1 . [sent-88, score-0.085]
39 Suppose each derivative fi is bounded on X by |fi |sup . [sent-91, score-0.536]
40 Applying the above lemma with Wi = 1, we see that in the original Euclidean space, the variation in f relative to distance between points x, x , is of the order i∈R |fi |sup . [sent-95, score-0.121]
41 This variation in f is now increased in (X , ρ) by a factor of 1/ inf i∈R fi 1,µ in the worst case. [sent-96, score-0.499]
42 In contrast, information is lost under a projection of the data in the likely scenario that all or most coordinates are relevant. [sent-98, score-0.079]
43 However, we observed that in practice, even when all coordinates are relevant, the gradient-weights vary sufficiently (Figure 1) to observe significant performance gains for distance-based regressors. [sent-100, score-0.131]
44 Estimating fi 3 1,µ n In all that follows we are given n i. [sent-101, score-0.499]
45 The kernel estimate at x is defined using any kernel K(u), positive on [0, 1/2], and 0 for u > 1. [sent-105, score-0.116]
46 If B(x, h) ∩ X = ∅, fn,h (x) = En Y , otherwise n fn,ρ,h (x) = ¯ i=1 K(¯(x, Xi )/h) ρ · Yi = n ρ j=1 K(¯(x, Xj )/h) n wi (x)Yi , (1) i=1 for some metric ρ and a bandwidth parameter h. [sent-106, score-0.403]
47 ¯ For the kernel regressor fn,h used to learn the metric ρ below, ρ is the Euclidean metric. [sent-107, score-0.147]
48 The metric is defined as |fn,h (X + tei ) − fn,h (X − tei )| Wi En · 1{An,i (X)} = En ∆t,i fn,h (X) · 1{An,i (X)} , (2) 2t where An,i (X) is the event that enough samples contribute to the estimate ∆t,i fn,h (X). [sent-111, score-0.985]
49 For the consistency result, we assume the following setting: 2d ln 2n + ln(4/δ) An,i (X) ≡ min µn (B(X + sei , h/2)) ≥ αn where αn . [sent-112, score-0.341]
50 n s∈{−t,t} 4 Consistency of the estimator Wi of fi 1,µ 4. [sent-113, score-0.548]
51 The marginal is assumed to have a continuous density on X and has mass everywhere on X : ∀x ∈ X , ∀h > 0, µ(B(x, h)) ≥ Cµ hd . [sent-117, score-0.121]
52 Under this assumption, for samples X in dense regions, X ± tei is also likely to be in a dense region. [sent-119, score-0.42]
53 Furthermore, each derivative fi (x) = ei f (x) is upper bounded on X + B(0, τ ) by |fi |sup and is uniformly continuous on X + B(0, τ ) (this is automatically the case if the support X is compact). [sent-129, score-0.564]
54 For i ∈ [d], define the (t, i)-boundary of X as ∂t,i (X ) {x : {x + tei , x − tei } X }. [sent-133, score-0.84]
55 The smaller the mass µ(∂t,i (X )) at the boundary, the better we approximate fi 1,µ . [sent-134, score-0.546]
56 The second type of quantity is t,i supx∈X , s∈[−t,t] |fi (x) − fi (x + sei )|. [sent-135, score-0.772]
57 2 Main theorem Our main theorem bounds the error in estimating each norm fi 1,µ with Wi . [sent-138, score-0.566]
58 Wi − fi 1,µ ≤ d t nh n i∈[d] 4 The bound suggest to set t in the order of h or larger. [sent-145, score-0.499]
59 We need t to be small in order for µ (∂t,i (X )) and t,i to be small, but t need to be sufficiently large (relative to h) for the estimates fn,h (X + tei ) and fn,h (X − tei ) to differ sufficiently so as to capture the variation in f along ei . [sent-146, score-0.891]
60 3 Proof of Theorem 1 The main difficulty in bounding Wi − fi 1,µ is in circumventing certain depencies: both quantities fn,h (X) and An,i (X) depend not just on X ∈ X, but on other samples in X, and thus introduce inter-dependencies between the estimates ∆t,i fn,h (X) for different X ∈ X. [sent-150, score-0.522]
61 To handle these dependencies, we carefully decompose Wi − fi Wi − fi 1,µ 1,µ , i ∈ [d], starting with: ≤ |Wi − En |fi (X)|| + En |fi (X)| − fi 1,µ . [sent-151, score-1.497]
62 (3) The following simple lemma bounds the second term of (3). [sent-152, score-0.069]
63 With probability at least 1 − δ, we have for all i ∈ [d], En |fi (X)| − fi 1,µ ≤ |fi |sup · ln 2d/δ . [sent-154, score-0.543]
64 The main technicality ¯ in this lemma is that, for any X in the sample X, the event An,i (X) depends on other samples in X. [sent-161, score-0.111]
65 To this end we need to bring in f through the following quantities: Wi En |f (X + tei ) − f (X − tei )| · 1{An,i (X)} = En ∆t,i f (X) · 1{An,i (X)} 2t ˜ and for any x ∈ X , define fn,h (x) EY|X fn,h (x) = i wi (x)f (xi ). [sent-168, score-1.09]
66 We have f (x + tei ) − f (x − tei) = 2t (fi (x) − It follows that 1 2t t,i ) t −t fi (x + sei ) ds and therefore ≤ f (x + tei ) − f (x − tei) ≤ 2t (fi (x) + |f (x + tei ) − f (x − tei)| − |fi (x)| ≤ Wi − En |fi (X)| · 1{An,i (X)} ≤ En t,i , t,i ) . [sent-178, score-2.032]
67 therefore 1 |f (x + tei ) − f (x − tei)| − |fi (x)| ≤ 2t t,i . [sent-179, score-0.42]
68 We have 2t Wi − Wi =2t En (∆t,i fn,h (X) − ∆t,i f (X)) · 1{An,i (X)} ≤2 max En |fn,h (X + sei ) − f (X + sei )| · 1{An,i (X)} s∈{−t,t} ˜ ≤2 max En fn,h (X + sei ) − fn,h (X + sei ) · 1{An,i (X)} s∈{−t,t} ˜ + 2 max En fn,h (X + sei ) − f (X + sei ) · 1{An,i (X)} . [sent-181, score-1.638]
69 s∈{−t,t} (5) (6) We first handle the bias term (6) in the next lemma which is given in the appendix. [sent-182, score-0.104]
70 We have for all i ∈ [d], and all s ∈ {t, −t}: ˜ En fn,h (X + sei ) − f (X + sei ) · 1{An,i (X)} ≤ h · |fi |sup . [sent-185, score-0.546]
71 i∈[d] The variance term in (5) is handled in the lemma below. [sent-186, score-0.101]
72 There exist C = C(µ, K(·)) such that, with probability at least 1 − 2δ, we have for all i ∈ [d], and all s ∈ {−t, t}: ˜ En fn,h (X + sei ) − fn,h (X + sei ) · 1{An,i (X)} ≤ 2 2 Cd · log(n/δ)CY (δ/2n) · σY . [sent-189, score-0.546]
73 1 Experiments Data description We present experiments on several real-world regression datasets. [sent-199, score-0.081]
74 The input points are 21-dimensional and correspond to samples of the positions, velocities, and accelerations of the 7 joints. [sent-201, score-0.071]
75 The output points correspond to the torque of each joint. [sent-202, score-0.071]
76 01 KR time KR-ρ time KR error KR-ρ error KR time KR-ρ time k-NN error k-NN-ρ error k-NN time k-NN-ρ time k-NN error k-NN-ρ error k-NN time k-NN-ρ time Housing 0. [sent-331, score-0.282]
77 32 5000 1000 number of training points 2000 3000 4000 0. [sent-382, score-0.082]
78 1 5000 1000 2000 number of training points (a) SARCOS, joint 7, with KR (b) Ailerons with KR 0. [sent-383, score-0.114]
79 38 k−NN error k−NN−ρ error 3000 number of training points 0. [sent-385, score-0.176]
80 Results are shown for k-NN and kernel regression (KR), with and without the metric ρ. [sent-403, score-0.228]
81 correspond to different regression problems and are the only results reported. [sent-404, score-0.081]
82 The concrete strength dataset (Concrete Strength) contains 8-dimensional input points, describing age and ingredients of concrete, the output points are the compressive strength. [sent-407, score-0.187]
83 The wine quality dataset (Wine Quality) contains 11-dimensional input points corresponding to the physicochemistry of wine samples, the output points are the wine quality. [sent-408, score-0.47]
84 The ailerons dataset (Ailerons) is taken from the problem of flying a F16 aircraft. [sent-409, score-0.167]
85 The 5-dimensional input points describe the status of the aeroplane, while the goal is 7 to predict the control action on the ailerons of the aircraft. [sent-410, score-0.219]
86 The housing dataset (Housing) concerns the task of predicting housing values in areas of Boston, the input points are 13-dimensional. [sent-411, score-0.236]
87 We also consider a telecommunication problem (Telecom), wherein the 47-dimensional input points and the output points describe the bandwidth usage in a network. [sent-413, score-0.157]
88 For all datasets we normalize each coordinate with its standard deviation from the training data. [sent-414, score-0.086]
89 The event An,i (X) is set to reject the gradient estimate ∆n,i fn,h (X) at X if no sample contributed to one the estimates fn,h (X ± tei ). [sent-418, score-0.511]
90 In each experiment, we compare kernel regression in the euclidean metric space (KR) and in the learned metric space (KR-ρ), where we use a box kernel for both. [sent-419, score-0.429]
91 The parameter k in k-NN/k-NN-ρ, and the bandwidth in KR/KR-ρ are learned by cross-validation on half of the training points. [sent-422, score-0.102]
92 We use 1000 training points in the robotic datasets, 2000 training points in the Telecom, Parkinson’s, Wine Quality, and Ailerons datasets, and 730 training points in Concrete Strength, and 300 in Housing. [sent-428, score-0.274]
93 3 Discussion of results From the results in Table 5 we see that virtually on all datasets the metric helps improve the performance of the distance based-regressor even though we did not tune t to the particular problem (remember t = h/2 for all experiments). [sent-433, score-0.158]
94 The rest of the results in Figure 2 where we vary n are self-descriptive: gradient weighting clearly improves the performance of the distance-based regressors. [sent-438, score-0.091]
95 Last, remember that the metric can be learned online at the cost of only 2d times the average kernel estimation time reported. [sent-441, score-0.233]
96 6 Final remarks Gradient weighting is simple to implement, computationally efficient in batch-mode and online, and most importantly improves the performance of distance-based regressors on real-world applications. [sent-442, score-0.113]
97 In our experiments, most or all coordinates of the data are relevant, yet some coordinates are more important than others. [sent-443, score-0.158]
98 This is sufficient for gradient weighting to yield gains in performance. [sent-444, score-0.087]
99 Learning distance metric for regression by semidefinite programming with application to human age estimation. [sent-451, score-0.217]
100 On robust kernel estimation of derivatives of regression functions. [sent-466, score-0.128]
wordName wordTfidf (topN-words)
[('fi', 0.499), ('tei', 0.42), ('en', 0.318), ('sei', 0.273), ('wi', 0.25), ('kr', 0.207), ('telecom', 0.189), ('ailerons', 0.167), ('sarcos', 0.167), ('wine', 0.113), ('parkinson', 0.111), ('metric', 0.1), ('housing', 0.092), ('sup', 0.085), ('regression', 0.081), ('coordinates', 0.079), ('barrett', 0.076), ('regressors', 0.073), ('lemma', 0.069), ('cy', 0.068), ('nn', 0.066), ('concrete', 0.064), ('bandwidth', 0.053), ('points', 0.052), ('estimator', 0.049), ('kernel', 0.047), ('error', 0.047), ('mass', 0.047), ('ln', 0.044), ('duy', 0.042), ('weighting', 0.04), ('online', 0.038), ('derivative', 0.037), ('planck', 0.036), ('age', 0.036), ('strength', 0.035), ('bias', 0.035), ('euclidean', 0.035), ('joints', 0.034), ('samory', 0.034), ('cd', 0.033), ('boularias', 0.032), ('variance', 0.032), ('joint', 0.032), ('ball', 0.032), ('varies', 0.031), ('tune', 0.031), ('fn', 0.031), ('training', 0.03), ('coordinate', 0.029), ('remember', 0.029), ('maxi', 0.029), ('ei', 0.028), ('nonparametric', 0.028), ('robotic', 0.028), ('supx', 0.028), ('vary', 0.028), ('meant', 0.027), ('rd', 0.027), ('datasets', 0.027), ('quality', 0.027), ('hd', 0.026), ('balls', 0.026), ('marginal', 0.025), ('weights', 0.024), ('gains', 0.024), ('consistency', 0.024), ('everywhere', 0.023), ('estimates', 0.023), ('gradient', 0.023), ('event', 0.023), ('diameter', 0.022), ('prediction', 0.022), ('estimate', 0.022), ('jan', 0.021), ('robot', 0.021), ('mini', 0.021), ('averages', 0.02), ('norm', 0.02), ('batch', 0.02), ('robotics', 0.02), ('behaves', 0.02), ('preprocessing', 0.019), ('grid', 0.019), ('learned', 0.019), ('unknown', 0.019), ('lemmas', 0.019), ('accelerations', 0.019), ('torque', 0.019), ('clinician', 0.019), ('expectedly', 0.019), ('liated', 0.019), ('depencies', 0.019), ('hurdles', 0.019), ('mosci', 0.019), ('symptom', 0.019), ('technicality', 0.019), ('villa', 0.019), ('uci', 0.018), ('might', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 145 nips-2012-Gradient Weights help Nonparametric Regressors
Author: Samory Kpotufe, Abdeslam Boularias
Abstract: In regression problems over Rd , the unknown function f often varies more in some coordinates than in others. We show that weighting each coordinate i with the estimated norm of the ith derivative of f is an efficient way to significantly improve the performance of distance-based regressors, e.g. kernel and k-NN regressors. We propose a simple estimator of these derivative norms and prove its consistency. Moreover, the proposed estimator is efficiently learned online. 1
2 0.1158009 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models
Author: Nicholas Ruozzi
Abstract: Sudderth, Wainwright, and Willsky conjectured that the Bethe approximation corresponding to any fixed point of the belief propagation algorithm over an attractive, pairwise binary graphical model provides a lower bound on the true partition function. In this work, we resolve this conjecture in the affirmative by demonstrating that, for any graphical model with binary variables whose potential functions (not necessarily pairwise) are all log-supermodular, the Bethe partition function always lower bounds the true partition function. The proof of this result follows from a new variant of the “four functions” theorem that may be of independent interest. 1
3 0.11284836 248 nips-2012-Nonparanormal Belief Propagation (NPNBP)
Author: Gal Elidan, Cobi Cario
Abstract: The empirical success of the belief propagation approximate inference algorithm has inspired numerous theoretical and algorithmic advances. Yet, for continuous non-Gaussian domains performing belief propagation remains a challenging task: recent innovations such as nonparametric or kernel belief propagation, while useful, come with a substantial computational cost and offer little theoretical guarantees, even for tree structured models. In this work we present Nonparanormal BP for performing efficient inference on distributions parameterized by a Gaussian copulas network and any univariate marginals. For tree structured networks, our approach is guaranteed to be exact for this powerful class of non-Gaussian models. Importantly, the method is as efficient as standard Gaussian BP, and its convergence properties do not depend on the complexity of the univariate marginals, even when a nonparametric representation is used. 1
4 0.09540505 330 nips-2012-Supervised Learning with Similarity Functions
Author: Purushottam Kar, Prateek Jain
Abstract: We address the problem of general supervised learning when data can only be accessed through an (indefinite) similarity function between data points. Existing work on learning with indefinite kernels has concentrated solely on binary/multiclass classification problems. We propose a model that is generic enough to handle any supervised learning task and also subsumes the model previously proposed for classification. We give a “goodness” criterion for similarity functions w.r.t. a given supervised learning task and then adapt a well-known landmarking technique to provide efficient algorithms for supervised learning using “good” similarity functions. We demonstrate the effectiveness of our model on three important supervised learning problems: a) real-valued regression, b) ordinal regression and c) ranking where we show that our method guarantees bounded generalization error. Furthermore, for the case of real-valued regression, we give a natural goodness definition that, when used in conjunction with a recent result in sparse vector recovery, guarantees a sparse predictor with bounded generalization error. Finally, we report results of our learning algorithms on regression and ordinal regression tasks using non-PSD similarity functions and demonstrate the effectiveness of our algorithms, especially that of the sparse landmark selection algorithm that achieves significantly higher accuracies than the baseline methods while offering reduced computational costs. 1
5 0.084514968 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data
Author: Michael C. Hughes, Erik B. Sudderth, Emily B. Fox
Abstract: Applications of Bayesian nonparametric methods require learning and inference algorithms which efficiently explore models of unbounded complexity. We develop new Markov chain Monte Carlo methods for the beta process hidden Markov model (BP-HMM), enabling discovery of shared activity patterns in large video and motion capture databases. By introducing split-merge moves based on sequential allocation, we allow large global changes in the shared feature structure. We also develop data-driven reversible jump moves which more reliably discover rare or unique behaviors. Our proposals apply to any choice of conjugate likelihood for observed data, and we show success with multinomial, Gaussian, and autoregressive emission models. Together, these innovations allow tractable analysis of hundreds of time series, where previous inference required clever initialization and lengthy burn-in periods for just six sequences. 1
6 0.084132344 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking
7 0.080395021 25 nips-2012-A new metric on the manifold of kernel matrices with application to matrix geometric means
8 0.078790158 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification
9 0.078516632 179 nips-2012-Learning Manifolds with K-Means and K-Flats
10 0.078471139 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison
11 0.070992768 271 nips-2012-Pointwise Tracking the Optimal Regression Function
12 0.066654488 20 nips-2012-A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets
13 0.065459393 264 nips-2012-Optimal kernel choice for large-scale two-sample tests
14 0.064785786 221 nips-2012-Multi-Stage Multi-Task Feature Learning
15 0.061612319 9 nips-2012-A Geometric take on Metric Learning
16 0.059321854 205 nips-2012-MCMC for continuous-time discrete-state systems
17 0.058354329 227 nips-2012-Multiclass Learning with Simplex Coding
18 0.057419945 16 nips-2012-A Polynomial-time Form of Robust Regression
19 0.056656897 79 nips-2012-Compressive neural representation of sparse, high-dimensional probabilities
20 0.055461109 200 nips-2012-Local Supervised Learning through Space Partitioning
topicId topicWeight
[(0, 0.149), (1, 0.019), (2, 0.039), (3, -0.043), (4, 0.039), (5, 0.018), (6, 0.003), (7, 0.093), (8, 0.003), (9, -0.036), (10, -0.013), (11, -0.04), (12, 0.027), (13, -0.051), (14, -0.064), (15, -0.088), (16, 0.007), (17, 0.007), (18, 0.053), (19, 0.038), (20, 0.023), (21, -0.073), (22, -0.002), (23, -0.0), (24, -0.023), (25, 0.074), (26, 0.044), (27, 0.096), (28, 0.023), (29, 0.008), (30, -0.006), (31, -0.092), (32, 0.046), (33, 0.025), (34, -0.037), (35, 0.028), (36, -0.053), (37, -0.071), (38, -0.091), (39, 0.004), (40, -0.003), (41, 0.029), (42, -0.067), (43, -0.098), (44, -0.085), (45, 0.192), (46, 0.007), (47, 0.111), (48, 0.039), (49, 0.042)]
simIndex simValue paperId paperTitle
same-paper 1 0.93567324 145 nips-2012-Gradient Weights help Nonparametric Regressors
Author: Samory Kpotufe, Abdeslam Boularias
Abstract: In regression problems over Rd , the unknown function f often varies more in some coordinates than in others. We show that weighting each coordinate i with the estimated norm of the ith derivative of f is an efficient way to significantly improve the performance of distance-based regressors, e.g. kernel and k-NN regressors. We propose a simple estimator of these derivative norms and prove its consistency. Moreover, the proposed estimator is efficiently learned online. 1
2 0.63800263 330 nips-2012-Supervised Learning with Similarity Functions
Author: Purushottam Kar, Prateek Jain
Abstract: We address the problem of general supervised learning when data can only be accessed through an (indefinite) similarity function between data points. Existing work on learning with indefinite kernels has concentrated solely on binary/multiclass classification problems. We propose a model that is generic enough to handle any supervised learning task and also subsumes the model previously proposed for classification. We give a “goodness” criterion for similarity functions w.r.t. a given supervised learning task and then adapt a well-known landmarking technique to provide efficient algorithms for supervised learning using “good” similarity functions. We demonstrate the effectiveness of our model on three important supervised learning problems: a) real-valued regression, b) ordinal regression and c) ranking where we show that our method guarantees bounded generalization error. Furthermore, for the case of real-valued regression, we give a natural goodness definition that, when used in conjunction with a recent result in sparse vector recovery, guarantees a sparse predictor with bounded generalization error. Finally, we report results of our learning algorithms on regression and ordinal regression tasks using non-PSD similarity functions and demonstrate the effectiveness of our algorithms, especially that of the sparse landmark selection algorithm that achieves significantly higher accuracies than the baseline methods while offering reduced computational costs. 1
3 0.55198807 338 nips-2012-The Perturbed Variation
Author: Maayan Harel, Shie Mannor
Abstract: We introduce a new discrepancy score between two distributions that gives an indication on their similarity. While much research has been done to determine if two samples come from exactly the same distribution, much less research considered the problem of determining if two finite samples come from similar distributions. The new score gives an intuitive interpretation of similarity; it optimally perturbs the distributions so that they best fit each other. The score is defined between distributions, and can be efficiently estimated from samples. We provide convergence bounds of the estimated score, and develop hypothesis testing procedures that test if two data sets come from similar distributions. The statistical power of this procedures is presented in simulations. We also compare the score’s capacity to detect similarity with that of other known measures on real data. 1
4 0.54253751 248 nips-2012-Nonparanormal Belief Propagation (NPNBP)
Author: Gal Elidan, Cobi Cario
Abstract: The empirical success of the belief propagation approximate inference algorithm has inspired numerous theoretical and algorithmic advances. Yet, for continuous non-Gaussian domains performing belief propagation remains a challenging task: recent innovations such as nonparametric or kernel belief propagation, while useful, come with a substantial computational cost and offer little theoretical guarantees, even for tree structured models. In this work we present Nonparanormal BP for performing efficient inference on distributions parameterized by a Gaussian copulas network and any univariate marginals. For tree structured networks, our approach is guaranteed to be exact for this powerful class of non-Gaussian models. Importantly, the method is as efficient as standard Gaussian BP, and its convergence properties do not depend on the complexity of the univariate marginals, even when a nonparametric representation is used. 1
5 0.51536173 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support
Author: Florian T. Pokorny, Hedvig Kjellström, Danica Kragic, Carl Ek
Abstract: We present a novel method for learning densities with bounded support which enables us to incorporate ‘hard’ topological constraints. In particular, we show how emerging techniques from computational algebraic topology and the notion of persistent homology can be combined with kernel-based methods from machine learning for the purpose of density estimation. The proposed formalism facilitates learning of models with bounded support in a principled way, and – by incorporating persistent homology techniques in our approach – we are able to encode algebraic-topological constraints which are not addressed in current state of the art probabilistic models. We study the behaviour of our method on two synthetic examples for various sample sizes and exemplify the benefits of the proposed approach on a real-world dataset by learning a motion model for a race car. We show how to learn a model which respects the underlying topological structure of the racetrack, constraining the trajectories of the car. 1
6 0.49952927 221 nips-2012-Multi-Stage Multi-Task Feature Learning
7 0.49824291 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison
8 0.49795818 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics
9 0.49094987 271 nips-2012-Pointwise Tracking the Optimal Regression Function
10 0.46965438 179 nips-2012-Learning Manifolds with K-Means and K-Flats
11 0.45835057 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification
12 0.44926941 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection
13 0.44867411 95 nips-2012-Density-Difference Estimation
14 0.44792399 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models
15 0.43686906 264 nips-2012-Optimal kernel choice for large-scale two-sample tests
16 0.43682078 79 nips-2012-Compressive neural representation of sparse, high-dimensional probabilities
17 0.42665279 25 nips-2012-A new metric on the manifold of kernel matrices with application to matrix geometric means
18 0.42226398 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data
19 0.42050681 123 nips-2012-Exponential Concentration for Mutual Information Estimation with Application to Forests
20 0.41881213 16 nips-2012-A Polynomial-time Form of Robust Regression
topicId topicWeight
[(0, 0.029), (21, 0.034), (38, 0.14), (39, 0.012), (42, 0.03), (49, 0.293), (54, 0.014), (55, 0.019), (60, 0.012), (74, 0.035), (76, 0.157), (80, 0.088), (92, 0.037)]
simIndex simValue paperId paperTitle
1 0.78088444 231 nips-2012-Multiple Operator-valued Kernel Learning
Author: Hachem Kadri, Alain Rakotomamonjy, Philippe Preux, Francis R. Bach
Abstract: Positive definite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. This paper addresses the problem of learning a finite linear combination of infinite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. We study this problem in the case of kernel ridge regression for functional responses with an r -norm constraint on the combination coefficients (r ≥ 1). The resulting optimization problem is more involved than those of multiple scalar-valued kernel learning since operator-valued kernels pose more technical and theoretical issues. We propose a multiple operator-valued kernel learning algorithm based on solving a system of linear operator equations by using a block coordinate-descent procedure. We experimentally validate our approach on a functional regression task in the context of finger movement prediction in brain-computer interfaces. 1
same-paper 2 0.76215756 145 nips-2012-Gradient Weights help Nonparametric Regressors
Author: Samory Kpotufe, Abdeslam Boularias
Abstract: In regression problems over Rd , the unknown function f often varies more in some coordinates than in others. We show that weighting each coordinate i with the estimated norm of the ith derivative of f is an efficient way to significantly improve the performance of distance-based regressors, e.g. kernel and k-NN regressors. We propose a simple estimator of these derivative norms and prove its consistency. Moreover, the proposed estimator is efficiently learned online. 1
3 0.75156683 133 nips-2012-Finding Exemplars from Pairwise Dissimilarities via Simultaneous Sparse Recovery
Author: Ehsan Elhamifar, Guillermo Sapiro, René Vidal
Abstract: Given pairwise dissimilarities between data points, we consider the problem of finding a subset of data points, called representatives or exemplars, that can efficiently describe the data collection. We formulate the problem as a row-sparsity regularized trace minimization problem that can be solved efficiently using convex programming. The solution of the proposed optimization program finds the representatives and the probability that each data point is associated with each one of the representatives. We obtain the range of the regularization parameter for which the solution of the proposed optimization program changes from selecting one representative for all data points to selecting all data points as representatives. When data points are distributed around multiple clusters according to the dissimilarities, we show that the data points in each cluster select representatives only from that cluster. Unlike metric-based methods, our algorithm can be applied to dissimilarities that are asymmetric or violate the triangle inequality, i.e., it does not require that the pairwise dissimilarities come from a metric. We demonstrate the effectiveness of the proposed algorithm on synthetic data as well as real-world image and text data. 1
4 0.68841302 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)
Author: Nicolò Cesa-bianchi, Pierre Gaillard, Gabor Lugosi, Gilles Stoltz
Abstract: Mirror descent with an entropic regularizer is known to achieve shifting regret bounds that are logarithmic in the dimension. This is done using either a carefully designed projection or by a weight sharing technique. Via a novel unified analysis, we show that these two approaches deliver essentially equivalent bounds on a notion of regret generalizing shifting, adaptive, discounted, and other related regrets. Our analysis also captures and extends the generalized weight sharing technique of Bousquet and Warmuth, and can be refined in several ways, including improvements for small losses and adaptive tuning of parameters. 1
5 0.62824988 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
Author: Ke Jiang, Brian Kulis, Michael I. Jordan
Abstract: Sampling and variational inference techniques are two standard methods for inference in probabilistic models, but for many problems, neither approach scales effectively to large-scale data. An alternative is to relax the probabilistic model into a non-probabilistic formulation which has a scalable associated algorithm. This can often be fulfilled by performing small-variance asymptotics, i.e., letting the variance of particular distributions in the model go to zero. For instance, in the context of clustering, such an approach yields connections between the kmeans and EM algorithms. In this paper, we explore small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that features the scalability of existing hard clustering methods as well as the flexibility of Bayesian nonparametric models. We focus on special cases of our analysis for discrete-data problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis. 1
6 0.6260643 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation
7 0.62562144 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
8 0.62533629 179 nips-2012-Learning Manifolds with K-Means and K-Flats
9 0.62517059 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
10 0.62515008 277 nips-2012-Probabilistic Low-Rank Subspace Clustering
11 0.6250217 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
12 0.6248486 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
13 0.62464142 147 nips-2012-Graphical Models via Generalized Linear Models
14 0.62449688 319 nips-2012-Sparse Prediction with the $k$-Support Norm
15 0.62418598 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function
16 0.62414372 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models
17 0.62367165 111 nips-2012-Efficient Sampling for Bipartite Matching Problems
18 0.62363422 232 nips-2012-Multiplicative Forests for Continuous-Time Processes
19 0.62357938 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
20 0.62329525 16 nips-2012-A Polynomial-time Form of Robust Regression