nips nips2006 nips2006-56 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ping Li, Kenneth W. Church, Trevor J. Hastie
Abstract: We1 develop Conditional Random Sampling (CRS), a technique particularly suitable for sparse data. In large-scale applications, the data are often highly sparse. CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stage, with the sample size determined retrospectively. This paper focuses on approximating pairwise l2 and l1 distances and comparing CRS with random projections. For boolean (0/1) data, CRS is provably better than random projections. We show using real-world data that CRS often outperforms random projections. This technique can be applied in learning, data mining, information retrieval, and database query optimizations.
Reference: text
sentIndex sentText sentNum sentScore
1 CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stage, with the sample size determined retrospectively. [sent-10, score-0.754]
2 This paper focuses on approximating pairwise l2 and l1 distances and comparing CRS with random projections. [sent-11, score-0.254]
3 For boolean (0/1) data, CRS is provably better than random projections. [sent-12, score-0.138]
4 We show using real-world data that CRS often outperforms random projections. [sent-13, score-0.109]
5 1 Introduction Conditional Random Sampling (CRS) is a sketch-based sampling technique that effectively exploits data sparsity. [sent-15, score-0.089]
6 Also, in heavy-tailed data, the estimation errors of random sampling could be very large. [sent-22, score-0.164]
7 As alternatives to random sampling, various sketching algorithms have become popular, e. [sent-23, score-0.189]
8 For a specific task, a sketching algorithm often outperforms random sampling. [sent-27, score-0.223]
9 On the other hand, random sampling is much more flexible. [sent-28, score-0.118]
10 For example, we can use the same set of random samples to estimate any lp pairwise distances and multi-way associations. [sent-29, score-0.278]
11 Conditional Random Sampling (CRS) combines the advantages of both sketching and random sampling. [sent-30, score-0.189]
12 , at Web scale), computing pairwise distances exactly is often too time-consuming or even infeasible. [sent-36, score-0.142]
13 Computing all pairwise associations AAT , also called the Gram matrix in machine learning, costs O(n2 D), which could be daunting for large n and D. [sent-41, score-0.108]
14 Various sampling methods have been proposed for approximating Gram matrix and kernels [2, 8]. [sent-42, score-0.144]
15 For example, using T (normal) random projections [17], we approximate AAT by (AR) (AR) , where the entries of D×k 2 R∈R are i. [sent-43, score-0.257]
16 Conditional Random Sampling (CRS) can be applied to estimating pairwise distances (in any norm) as well as multi-way associations. [sent-53, score-0.176]
17 While this paper focuses on estimating pairwise l2 and l1 distances and inner products, we refer readers to the technical report [13] for estimating joint histograms. [sent-55, score-0.299]
18 Our early work, [11, 12] concerned estimating two-way and multi-way associations in boolean (0/1) data. [sent-56, score-0.125]
19 We will compare CRS with normal random projections for approximating l2 distances and inner products, and with Cauchy random projections for approximating l1 distances. [sent-57, score-0.795]
20 In boolean data, CRS bears some similarity to Broder’s sketches [6] with some important distinctions. [sent-58, score-0.326]
21 [12] showed that in boolean data, CRS improves Broder’s sketches by roughly halving the estimation variances. [sent-59, score-0.346]
22 In the sketching stage, we scan the data matrix once and store a fraction of the non-zero elements in each data point, as “sketches. [sent-61, score-0.228]
23 ” In the estimation stage, we generate conditional random samples online pairwise (for two-way) or groupwise (for multi-way); hence we name our algorithm Conditional Random Sampling (CRS). [sent-62, score-0.211]
24 1 The Sampling/Sketching Procedure 1 2 3 4 5 6 7 1 2 3 4 5 n 8 D 1 2 3 4 5 6 7 8 D 1 2 3 4 5 n 1 2 3 4 5 n 1 2 3 4 5 n (a) Original (b) Permuted (c) Postings (d) Sketches Figure 1: A global view of the sketching stage. [sent-64, score-0.137]
25 Figure 1 provides a global view of the sketching stage. [sent-65, score-0.137]
26 The columns of a sparse data matrix (a) are first randomly permuted (b). [sent-66, score-0.139]
27 Then only the non-zero entries are considered, called postings (c). [sent-67, score-0.135]
28 If the column IDs are random, the first Ds = 10 columns constitute a random sample. [sent-71, score-0.076]
29 ” (c): Sketches are the first ki entries of postings sorted ascending by IDs. [sent-74, score-0.184]
30 Excluding 11(3) in K2 , we obtain the same samples as if we directly sampled the first Ds = 10 columns in the data matrix. [sent-76, score-0.092]
31 Apparently sketches are not uniformly random samples, which may make the estimation task difficult. [sent-77, score-0.331]
32 We show, in Figure 2, that sketches are almost random samples pairwise (or group-wise). [sent-78, score-0.404]
33 Figure 2(a) constructs conventional random samples from a data matrix; and we show one can generate (retrospectively) the same random samples from sketches in Figure 2(b)(c). [sent-79, score-0.496]
34 In Figure 2(a), when the column are randomly permuted, we can construct random samples by simply taking the first Ds columns from the data matrix of D columns (Ds ≪ D in real applications). [sent-80, score-0.186]
35 For sparse data, we only store the non-zero elements in the form of tuples “ID (Value),” a structure called postings. [sent-81, score-0.091]
36 Figure 2(b) shows the postings for the same data matrix in Figure 2(a). [sent-83, score-0.15]
37 A sketch, Ki , of postings Pi , is the first ki entries (i. [sent-85, score-0.164]
38 The central observation is that if we exclude all elements of sketches whose IDs are larger than Ds = min (max(ID(K1 )), max(ID(K2 ))) , (1) we obtain exactly the same samples as if we directly sampled the first Ds columns from the data matrix in Figure 2(a). [sent-88, score-0.369]
39 This way, we convert sketches into random samples by conditioning on Ds , which differs pairwise and we do not know beforehand. [sent-89, score-0.404]
40 After we construct the conditional random samples from sketches K1 and K2 with the effective sample size Ds , we can compute any distances D (l2 , l1 , or inner products) from the samples and multiply them by Ds to estimate the original space. [sent-92, score-0.736]
41 ) We use u1,j and u2,j (j = 1 to Ds ) to denote the conditional random samples (of size Ds ) obtained ˜ ˜ by CRS. [sent-94, score-0.185]
42 3 The Computational Cost Th sketching stage requires generating a random permutation mapping of length D, and linear scan n all the non-zeros. [sent-98, score-0.254]
43 Therefore, generating sketches for A ∈ Rn×D costs O( i=1 fi ), where fi is the number of non-zeros in the ith row, i. [sent-99, score-0.331]
44 While the conditional sample size Ds might be large, the cost for estimating the distance between one pair of data points would be only O(k1 + k2 ) instead of O(Ds ). [sent-103, score-0.285]
45 3 The Theoretical Variance Analysis of CRS We give some theoretical analysis on the variances of CRS. [sent-104, score-0.077]
46 ˆ(2) ˆ(1) We similarly derive the variances for dMF and dMF . [sent-119, score-0.077]
47 The sparsity term max(f1 ,f2 ) reduces the variances significantly. [sent-124, score-0.115]
48 01, the variances D D can be reduced by a factor of 100, compared to conventional random coordinate sampling. [sent-126, score-0.149]
49 (Normal) Random projections [17] are widely used in learning and data mining [2–4]. [sent-128, score-0.202]
50 Random projections multiply the data matrix A ∈ Rn×D with a random matrix R ∈ RD×k to generate a compact representation B = AR ∈ Rn×k . [sent-129, score-0.29]
51 entries in N (0, 1); hence we call it normal random projections. [sent-133, score-0.108]
52 However, the recent impossibility result [5] has ruled out estimators that could be metrics for dimension reduction in l1 . [sent-138, score-0.107]
53 It is easy to show that the following linear estimators of the inner product a and the squared l2 distance d(2) are unbiased aN RP,MF = ˆ 1 T v v2 , k 1 1 ˆ(2) dN RP,MF = v1 − v2 2 , k (11) with variances [15, 17] Var (ˆN RP,MF ) = a 1 m 1 m 2 + a2 , k ˆ(2) Var dN RP,MF = 2[d(2) ]2 . [sent-146, score-0.296]
54 k (12) Assuming that the margins m1 = u1 2 and m2 = u2 2 are known, [15] provides a maximum likelihood estimator, denoted by aN RP,MLE , whose (asymptotic) variance is ˆ Var (ˆN RP,MLE ) = a 1 (m1 m2 − a2 )2 + O(k −2 ). [sent-147, score-0.166]
55 [9] proposed an estimator based on the absolute sample median. [sent-153, score-0.12]
56 Recently, [14] proposed a variety of nonlinear estimators, including, a biascorrected sample median estimator, a bias-corrected geometric mean estimator, and a bias-corrected maximum likelihood estimator. [sent-154, score-0.11]
57 3 General Stable Random Projections for Dimension Reduction in lp (0 < p ≤ 2) [10] generalized the bias-corrected geometric mean estimator to general stable random projections for dimension reduction in lp (0 < p ≤ 2), and provided the theoretical variances and exponential tail bounds. [sent-160, score-0.506]
58 Of course, CRS can also be applied to approximating any lp distances. [sent-161, score-0.099]
59 5 Improving CRS Using Marginal Information It is often reasonable to assume that we know the marginal information such as marginal l2 norms, numbers of non-zeros, or even marginal histograms. [sent-162, score-0.138]
60 In the boolean data case, we can express the MLE solution explicitly and derive a closed-form (asymptotic) variance. [sent-164, score-0.09]
61 1 Boolean (0/1) Data In 0/1 data, estimating the inner product becomes estimating a two-way contingency table, which has four cells. [sent-167, score-0.184]
62 6 Theoretical Comparisons of CRS With Random Projections As reflected by their variances, for general data types, whether CRS is better than random projections depends on two competing factors: data sparsity and data heavy-tailedness. [sent-187, score-0.338]
63 However, in the following two important scenarios, CRS outperforms random projections. [sent-188, score-0.086]
64 1 Boolean (0/1) data 2 In this case, the marginal norms are the same as the numbers of non-zeros, i. [sent-190, score-0.099]
65 , a ≈ f2 ≈ f1 ), random projections appear more accurate. [sent-198, score-0.231]
66 However, when this does occur, the absolute variances are so small (even zero) that their ratio does not matter. [sent-199, score-0.146]
67 8 1 2 Var(ˆ Figure 3: The variance ratios, Var(ˆNaM F )F ) , show that CRS has smaller variances than random a RP,M projections, when no marginal information is used. [sent-237, score-0.206]
68 8 2 Var(a0/1,M LE ) ˆ Figure 4: The ratios, Var(ˆN RP,M LE ) , show that CRS usually has smaller variances than random a projections, except when f1 ≈ f2 ≈ a. [sent-283, score-0.129]
69 Therefore, CRS will be much better for estimating inner products in nearly independent data. [sent-286, score-0.18]
70 Once we have obtained the inner products, we can infer the l2 distances easily by d(2) = m1 + m2 − 2a, since the margins, m1 and m2 , are easy to obtain exactly. [sent-287, score-0.183]
71 3 Comparing the Computational Efficiency As previously mentioned, the cost of constructing sketches for A ∈ Rn×D would be O(nD) (or n more precisely, O( i=1 fi )). [sent-290, score-0.305]
72 The cost of (normal) random projections would be O(nDk), which can be reduced to O(nDk/3) using sparse random projections [1]. [sent-291, score-0.526]
73 Therefore, it is possible that CRS is considerably more efficient than random projections in the sampling stage. [sent-292, score-0.32]
74 2 In the estimation stage, CRS costs O(2k) to compute the sample distance for each pair. [sent-293, score-0.159]
75 7 Empirical Evaluations We compare CRS with random projections (RP) using real data, including n = 100 randomly sampled documents from the NSF data [7] (sparsity ≈ 1%), n = 100 documents from the NEWSGROUP data [4] (sparsity ≈ 1%), and one class of the COREL image data (n = 80, sparsity ≈ 5%). [sent-296, score-0.338]
76 We estimate all pairwise inner products, l1 and l2 distances, using both CRS and RP. [sent-297, score-0.137]
77 We compare the median errors and the percentage in which CRS does better than random projections. [sent-299, score-0.155]
78 In each panel, the dashed curve indicates that we sample each data point with equal sample size (k). [sent-301, score-0.193]
79 For CRS, we can adjust the sample size according to the sparsity, reflected by the solid curves. [sent-302, score-0.128]
80 Data in different groups are assigned different sample sizes for CRS. [sent-305, score-0.09]
81 For random projections, we use the average sample size. [sent-306, score-0.116]
82 For both NSF and NEWSGROUP data, CRS overwhelmingly outperforms RP for estimating inner products and l2 distances (both using the marginal information). [sent-307, score-0.385]
83 CRS also outperforms RP for approximating l1 and l2 distances (without using the margins). [sent-308, score-0.188]
84 Ratio of median errors For the COREL data, CRS still outperforms RP for approximating inner products and l2 distances (using the margins). [sent-309, score-0.406]
85 However, RP considerably outperforms CRS for approximating l1 distances and l2 distances (without using the margins). [sent-310, score-0.305]
86 Note that the COREL image data are not too sparse and are considerably more heavy-tailed than the NSF and NEWSGROUP data [13]. [sent-311, score-0.114]
87 2 10 20 30 40 Sample size k 50 10 20 30 40 Sample size k 50 Figure 5: NSF data. [sent-342, score-0.084]
88 Upper four panels: ratios (CRS over RP ( random projections)) of the median absolute errors; values < 1 indicate that CRS does better. [sent-343, score-0.142]
89 Dashed curves correspond to fixed sample sizes while solid curves indicate that we (crudely) adjust sketch sizes in CRS according to data sparsity. [sent-346, score-0.189]
90 In this case, CRS is overwhelmingly better than RP for approximating inner products and l2 distances (both using margins). [sent-347, score-0.331]
91 8 Conclusion There are many applications of l1 and l2 distances on large sparse datasets. [sent-348, score-0.139]
92 We propose a new sketch-based method, Conditional Random Sampling (CRS), which is provably better than random projections, at least for the important special cases of boolean data and nearly independent data. [sent-349, score-0.161]
93 In general non-boolean data, CRS compares favorably, both theoretically and empirically, especially when we take advantage of the margins (which are easier to compute than distances). [sent-350, score-0.135]
94 2 √ [16] proposed very sparse random projections to reduce the cost O(nDk) down to O(n Dk). [sent-351, score-0.295]
95 99 10 20 Sample size k 30 Ratio of median errors Figure 6: NEWSGROUP data. [sent-379, score-0.114]
96 05 10 20 30 40 Sample size k 20 30 40 Sample size k 50 0. [sent-403, score-0.084]
97 On the nystrom method for approximating a gram matrix for improved kernel-based learning. [sent-454, score-0.107]
98 Very sparse stable random projections, estimators and tail bounds for stable random projections. [sent-462, score-0.267]
99 Conditional random sampling: A sketched-based sampling technique for sparse data. [sent-480, score-0.163]
100 Nonlinear estimators and tail bounds for dimensional reduction in l1 using Cauchy random projections. [sent-486, score-0.156]
wordName wordTfidf (topN-words)
[('crs', 0.671), ('ds', 0.407), ('sketches', 0.259), ('var', 0.251), ('projections', 0.179), ('sketching', 0.137), ('margins', 0.135), ('postings', 0.109), ('distances', 0.094), ('inner', 0.089), ('variances', 0.077), ('mle', 0.073), ('rp', 0.072), ('boolean', 0.067), ('sampling', 0.066), ('cauchy', 0.066), ('sample', 0.064), ('church', 0.061), ('dmf', 0.061), ('approximating', 0.06), ('products', 0.057), ('distance', 0.057), ('random', 0.052), ('ratio', 0.051), ('corel', 0.048), ('pairwise', 0.048), ('conditional', 0.046), ('marginal', 0.046), ('median', 0.046), ('estimators', 0.046), ('ndk', 0.046), ('tuples', 0.046), ('newsgroup', 0.045), ('samples', 0.045), ('sparse', 0.045), ('le', 0.044), ('size', 0.042), ('ids', 0.04), ('lp', 0.039), ('sparsity', 0.038), ('estimator', 0.038), ('stage', 0.038), ('id', 0.036), ('li', 0.035), ('ut', 0.035), ('estimating', 0.034), ('stanford', 0.034), ('bivariate', 0.034), ('outperforms', 0.034), ('reduction', 0.034), ('variance', 0.031), ('percentage', 0.031), ('amf', 0.031), ('bingham', 0.031), ('broder', 0.031), ('overwhelmingly', 0.031), ('norms', 0.03), ('normal', 0.03), ('ki', 0.029), ('gram', 0.029), ('permuted', 0.029), ('sketch', 0.028), ('fi', 0.027), ('hastie', 0.027), ('max', 0.027), ('scan', 0.027), ('product', 0.027), ('impossibility', 0.027), ('pi', 0.026), ('entries', 0.026), ('ratios', 0.026), ('errors', 0.026), ('sizes', 0.026), ('nsf', 0.025), ('ar', 0.024), ('aat', 0.024), ('dhillon', 0.024), ('associations', 0.024), ('columns', 0.024), ('stable', 0.024), ('tail', 0.024), ('considerably', 0.023), ('data', 0.023), ('adjust', 0.022), ('chris', 0.021), ('dd', 0.021), ('ui', 0.021), ('estimation', 0.02), ('kdd', 0.02), ('ascending', 0.02), ('dm', 0.02), ('mf', 0.02), ('conventional', 0.02), ('microsoft', 0.019), ('cost', 0.019), ('asymptotic', 0.019), ('provably', 0.019), ('costs', 0.018), ('absolute', 0.018), ('matrix', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data
Author: Ping Li, Kenneth W. Church, Trevor J. Hastie
Abstract: We1 develop Conditional Random Sampling (CRS), a technique particularly suitable for sparse data. In large-scale applications, the data are often highly sparse. CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stage, with the sample size determined retrospectively. This paper focuses on approximating pairwise l2 and l1 distances and comparing CRS with random projections. For boolean (0/1) data, CRS is provably better than random projections. We show using real-world data that CRS often outperforms random projections. This technique can be applied in learning, data mining, information retrieval, and database query optimizations.
2 0.15340489 33 nips-2006-Analysis of Representations for Domain Adaptation
Author: Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira
Abstract: Discriminative learning methods for classification perform well when training and test data are drawn from the same distribution. In many situations, though, we have labeled training data for a source domain, and we wish to learn a classifier which performs well on a target domain with a different distribution. Under what conditions can we adapt a classifier trained on the source domain for use in the target domain? Intuitively, a good feature representation is a crucial factor in the success of domain adaptation. We formalize this intuition theoretically with a generalization bound for domain adaption. Our theory illustrates the tradeoffs inherent in designing a representation for domain adaptation and gives a new justification for a recently proposed model. It also points toward a promising new model for domain adaptation: one which explicitly minimizes the difference between the source and target domains, while at the same time maximizing the margin of the training set. 1
3 0.06258323 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
Author: Huan Xu, Shie Mannor
Abstract: Computation of a satisfactory control policy for a Markov decision process when the parameters of the model are not exactly known is a problem encountered in many practical applications. The traditional robust approach is based on a worstcase analysis and may lead to an overly conservative policy. In this paper we consider the tradeoff between nominal performance and the worst case performance over all possible models. Based on parametric linear programming, we propose a method that computes the whole set of Pareto efficient policies in the performancerobustness plane when only the reward parameters are subject to uncertainty. In the more general case when the transition probabilities are also subject to error, we show that the strategy with the “optimal” tradeoff might be non-Markovian and hence is in general not tractable. 1
4 0.062405616 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts
Author: Hariharan Narayanan, Mikhail Belkin, Partha Niyogi
Abstract: One of the intuitions underlying many graph-based methods for clustering and semi-supervised learning, is that class or cluster boundaries pass through areas of low probability density. In this paper we provide some formal analysis of that notion for a probability distribution. We introduce a notion of weighted boundary volume, which measures the length of the class/cluster boundary weighted by the density of the underlying probability distribution. We show that sizes of the cuts of certain commonly used data adjacency graphs converge to this continuous weighted volume of the boundary. keywords: Clustering, Semi-Supervised Learning 1
5 0.056391202 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
Author: Rey Ramírez, Jason Palmer, Scott Makeig, Bhaskar D. Rao, David P. Wipf
Abstract: The ill-posed nature of the MEG/EEG source localization problem requires the incorporation of prior assumptions when choosing an appropriate solution out of an infinite set of candidates. Bayesian methods are useful in this capacity because they allow these assumptions to be explicitly quantified. Recently, a number of empirical Bayesian approaches have been proposed that attempt a form of model selection by using the data to guide the search for an appropriate prior. While seemingly quite different in many respects, we apply a unifying framework based on automatic relevance determination (ARD) that elucidates various attributes of these methods and suggests directions for improvement. We also derive theoretical properties of this methodology related to convergence, local minima, and localization bias and explore connections with established algorithms. 1
6 0.056028966 157 nips-2006-PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier
7 0.050145134 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
8 0.0479904 158 nips-2006-PG-means: learning the number of clusters in data
9 0.047649544 147 nips-2006-Non-rigid point set registration: Coherent Point Drift
10 0.047577478 65 nips-2006-Denoising and Dimension Reduction in Feature Space
11 0.038617779 20 nips-2006-Active learning for misspecified generalized linear models
12 0.037521709 177 nips-2006-Sparse Kernel Orthonormalized PLS for feature extraction in large data sets
13 0.036798771 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression
14 0.03646617 50 nips-2006-Chained Boosting
15 0.035488192 121 nips-2006-Learning to be Bayesian without Supervision
16 0.034135439 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems
17 0.033122137 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints
18 0.032335363 57 nips-2006-Conditional mean field
19 0.031350203 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors
20 0.030930864 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
topicId topicWeight
[(0, -0.114), (1, 0.036), (2, -0.018), (3, -0.008), (4, -0.011), (5, 0.047), (6, 0.003), (7, 0.001), (8, 0.021), (9, 0.037), (10, 0.047), (11, 0.013), (12, 0.001), (13, -0.015), (14, 0.021), (15, -0.097), (16, -0.007), (17, -0.099), (18, 0.032), (19, -0.043), (20, -0.005), (21, 0.008), (22, 0.048), (23, -0.087), (24, -0.082), (25, 0.083), (26, -0.034), (27, 0.146), (28, 0.038), (29, -0.028), (30, 0.013), (31, -0.143), (32, -0.224), (33, 0.051), (34, -0.183), (35, 0.008), (36, 0.064), (37, 0.014), (38, -0.05), (39, -0.023), (40, 0.012), (41, -0.03), (42, 0.158), (43, -0.029), (44, -0.08), (45, 0.014), (46, 0.015), (47, 0.119), (48, 0.156), (49, 0.1)]
simIndex simValue paperId paperTitle
same-paper 1 0.95012707 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data
Author: Ping Li, Kenneth W. Church, Trevor J. Hastie
Abstract: We1 develop Conditional Random Sampling (CRS), a technique particularly suitable for sparse data. In large-scale applications, the data are often highly sparse. CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stage, with the sample size determined retrospectively. This paper focuses on approximating pairwise l2 and l1 distances and comparing CRS with random projections. For boolean (0/1) data, CRS is provably better than random projections. We show using real-world data that CRS often outperforms random projections. This technique can be applied in learning, data mining, information retrieval, and database query optimizations.
2 0.70769173 33 nips-2006-Analysis of Representations for Domain Adaptation
Author: Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira
Abstract: Discriminative learning methods for classification perform well when training and test data are drawn from the same distribution. In many situations, though, we have labeled training data for a source domain, and we wish to learn a classifier which performs well on a target domain with a different distribution. Under what conditions can we adapt a classifier trained on the source domain for use in the target domain? Intuitively, a good feature representation is a crucial factor in the success of domain adaptation. We formalize this intuition theoretically with a generalization bound for domain adaption. Our theory illustrates the tradeoffs inherent in designing a representation for domain adaptation and gives a new justification for a recently proposed model. It also points toward a promising new model for domain adaptation: one which explicitly minimizes the difference between the source and target domains, while at the same time maximizing the margin of the training set. 1
3 0.44678551 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
Author: Huan Xu, Shie Mannor
Abstract: Computation of a satisfactory control policy for a Markov decision process when the parameters of the model are not exactly known is a problem encountered in many practical applications. The traditional robust approach is based on a worstcase analysis and may lead to an overly conservative policy. In this paper we consider the tradeoff between nominal performance and the worst case performance over all possible models. Based on parametric linear programming, we propose a method that computes the whole set of Pareto efficient policies in the performancerobustness plane when only the reward parameters are subject to uncertainty. In the more general case when the transition probabilities are also subject to error, we show that the strategy with the “optimal” tradeoff might be non-Markovian and hence is in general not tractable. 1
4 0.39321968 158 nips-2006-PG-means: learning the number of clusters in data
Author: Yu Feng, Greg Hamerly
Abstract: We present a novel algorithm called PG-means which is able to learn the number of clusters in a classical Gaussian mixture model. Our method is robust and efficient; it uses statistical hypothesis tests on one-dimensional projections of the data and model to determine if the examples are well represented by the model. In so doing, we are applying a statistical test for the entire model at once, not just on a per-cluster basis. We show that our method works well in difficult cases such as non-Gaussian data, overlapping clusters, eccentric clusters, high dimension, and many true clusters. Further, our new method provides a much more stable estimate of the number of clusters than existing methods. 1
5 0.37278551 177 nips-2006-Sparse Kernel Orthonormalized PLS for feature extraction in large data sets
Author: Jerónimo Arenas-garcía, Kaare B. Petersen, Lars K. Hansen
Abstract: In this paper we are presenting a novel multivariate analysis method. Our scheme is based on a novel kernel orthonormalized partial least squares (PLS) variant for feature extraction, imposing sparsity constrains in the solution to improve scalability. The algorithm is tested on a benchmark of UCI data sets, and on the analysis of integrated short-time music features for genre prediction. The upshot is that the method has strong expressive power even with rather few features, is clearly outperforming the ordinary kernel PLS, and therefore is an appealing method for feature extraction of labelled data. 1
6 0.36864343 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts
7 0.36428764 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis
8 0.35674134 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
9 0.35474819 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems
10 0.33773807 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples
11 0.33448666 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
12 0.32648289 109 nips-2006-Learnability and the doubling dimension
13 0.31150544 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints
14 0.29576102 5 nips-2006-A Kernel Method for the Two-Sample-Problem
15 0.28759977 179 nips-2006-Sparse Representation for Signal Classification
16 0.28393289 121 nips-2006-Learning to be Bayesian without Supervision
17 0.26963362 75 nips-2006-Efficient sparse coding algorithms
18 0.26293239 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors
19 0.25457719 147 nips-2006-Non-rigid point set registration: Coherent Point Drift
20 0.25040334 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
topicId topicWeight
[(1, 0.099), (3, 0.031), (7, 0.059), (9, 0.043), (20, 0.025), (22, 0.068), (44, 0.057), (57, 0.041), (65, 0.055), (69, 0.027), (71, 0.016), (82, 0.359)]
simIndex simValue paperId paperTitle
1 0.85841537 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning
Author: Gediminas Lukšys, Jérémie Knüsel, Denis Sheynikhovich, Carmen Sandi, Wulfram Gerstner
Abstract: Stress and genetic background regulate different aspects of behavioral learning through the action of stress hormones and neuromodulators. In reinforcement learning (RL) models, meta-parameters such as learning rate, future reward discount factor, and exploitation-exploration factor, control learning dynamics and performance. They are hypothesized to be related to neuromodulatory levels in the brain. We found that many aspects of animal learning and performance can be described by simple RL models using dynamic control of the meta-parameters. To study the effects of stress and genotype, we carried out 5-hole-box light conditioning and Morris water maze experiments with C57BL/6 and DBA/2 mouse strains. The animals were exposed to different kinds of stress to evaluate its effects on immediate performance as well as on long-term memory. Then, we used RL models to simulate their behavior. For each experimental session, we estimated a set of model meta-parameters that produced the best fit between the model and the animal performance. The dynamics of several estimated meta-parameters were qualitatively similar for the two simulated experiments, and with statistically significant differences between different genetic strains and stress conditions. 1
same-paper 2 0.7142939 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data
Author: Ping Li, Kenneth W. Church, Trevor J. Hastie
Abstract: We1 develop Conditional Random Sampling (CRS), a technique particularly suitable for sparse data. In large-scale applications, the data are often highly sparse. CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stage, with the sample size determined retrospectively. This paper focuses on approximating pairwise l2 and l1 distances and comparing CRS with random projections. For boolean (0/1) data, CRS is provably better than random projections. We show using real-world data that CRS often outperforms random projections. This technique can be applied in learning, data mining, information retrieval, and database query optimizations.
3 0.60349727 187 nips-2006-Temporal Coding using the Response Properties of Spiking Neurons
Author: Thomas Voegtlin
Abstract: In biological neurons, the timing of a spike depends on the timing of synaptic currents, in a way that is classically described by the Phase Response Curve. This has implications for temporal coding: an action potential that arrives on a synapse has an implicit meaning, that depends on the position of the postsynaptic neuron on the firing cycle. Here we show that this implicit code can be used to perform computations. Using theta neurons, we derive a spike-timing dependent learning rule from an error criterion. We demonstrate how to train an auto-encoder neural network using this rule. 1
4 0.42375153 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
Author: Hamed Valizadegan, Rong Jin
Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1
5 0.42275226 65 nips-2006-Denoising and Dimension Reduction in Feature Space
Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1
6 0.41651398 20 nips-2006-Active learning for misspecified generalized linear models
7 0.41630638 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
8 0.41592863 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
9 0.4149645 61 nips-2006-Convex Repeated Games and Fenchel Duality
10 0.4149172 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
11 0.41414908 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments
12 0.41310301 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
13 0.4123463 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
14 0.41172332 175 nips-2006-Simplifying Mixture Models through Function Approximation
15 0.41125563 127 nips-2006-MLLE: Modified Locally Linear Embedding Using Multiple Weights
16 0.41043526 79 nips-2006-Fast Iterative Kernel PCA
17 0.41028655 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
18 0.41017392 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition
19 0.4099988 117 nips-2006-Learning on Graph with Laplacian Regularization
20 0.40879065 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy