nips nips2011 nips2011-4 knowledge-graph by maker-knowledge-mining

4 nips-2011-A Convergence Analysis of Log-Linear Training

Source: pdf

Author: Simon Wiesler, Hermann Ney

Abstract: Log-linear models are widely used probability models for statistical pattern recognition. Typically, log-linear models are trained according to a convex criterion. In recent years, the interest in log-linear models has greatly increased. The optimization of log-linear model parameters is costly and therefore an important topic, in particular for large-scale applications. Different optimization algorithms have been evaluated empirically in many papers. In this work, we analyze the optimization problem analytically and show that the training of log-linear models can be highly ill-conditioned. We verify our ﬁndings on two handwriting tasks. By making use of our convergence analysis, we obtain good results on a large-scale continuous handwriting recognition task with a simple and generic approach. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this work, we analyze the optimization problem analytically and show that the training of log-linear models can be highly ill-conditioned. [sent-10, score-0.265]

2 By making use of our convergence analysis, we obtain good results on a large-scale continuous handwriting recognition task with a simple and generic approach. [sent-12, score-0.506]

3 Furthermore, the conventional training of log-linear models is a strictly convex optimization problem. [sent-20, score-0.265]

4 Thus, the global optimum of the training criterion is unique and no other local optima exist. [sent-21, score-0.285]

5 Steepest descent and other gradient-based optimization algorithms are guaranteed to converge to the unique global optimum from any initialization. [sent-22, score-0.191]

6 Bound optimization algorithms, as generalized iterative scaling (GIS) [4] and variants of GIS have been used in earlier works. [sent-28, score-0.203]

7 1 So far, a rigorous mathematical analysis of the optimization problem encountered in training of loglinear models has been missing. [sent-35, score-0.322]

8 From optimization theory it is known that the convergence rate of ﬁrst-order optimization algorithms depends on the condition number of the Hessian matrix at the optimum. [sent-36, score-0.425]

9 1 The dependence of the convergence behavior on the condition number is strongest for steepest descent. [sent-37, score-0.373]

10 For high condition numbers, steepest descent is useless in practice [3, Chapter 9. [sent-38, score-0.193]

11 It can be shown that more sophisticated gradient-based optimization algorithms as CG and L-BFGS depend on the condition number as well [18, Chapter 5. [sent-40, score-0.196]

12 Apart from numerical reasons, the convergence behavior of Newton’s method is completely independent of the condition number. [sent-43, score-0.275]

13 In this paper, we derive an estimate for the condition number of the objective function used for training of log-linear models. [sent-46, score-0.288]

14 One is a small digit recognition task, the other a large-scale continuous handwriting recognition task with real-life data. [sent-49, score-0.502]

15 The penalized maximum likelihood criterion is regarded as the natural training criterion for log-linear models. [sent-69, score-0.267]

16 Then the training criterion of log-linear models is an unconstrained optimization problem of the form 1 Λ→− N ˆ Λ = argmin F(Λ), with F : Rd×C → R, Λ∈Rd×C N log pΛ (cn |xn ) + n=1 α Λ 2 2 2 (3) Here, F is the objective function, and α ≥ 0 the regularization constant. [sent-74, score-0.487]

17 ¯ c ¯ ¯  (5) n=1 N n=1 1 Recall that the condition number of a positive deﬁnite matrix A is the ratio of its largest and its smallest eigenvalues: κ(A) = λmax (A)/λmin (A) 2 Here, δ denotes the Kronecker delta. [sent-77, score-0.204]

18 [16, 10], the optimization problem (3) has been solved with generalized iterative scaling (GIS) [4] or improved iterative scaling [10]. [sent-84, score-0.305]

19 Since then, it has been shown in several works that gradient-based optimization algorithms are far superior to iterative scaling methods. [sent-85, score-0.203]

20 Salakhutdinov [20] derived a convergence analysis for bound optimization algorithms as GIS and showed that GIS converges extremely slowly when features are highly correlated and are far from the origin. [sent-95, score-0.36]

21 Interestingly, we come to similar conclusions for the convergence behavior of log-linear training as LeCun et al. [sent-103, score-0.303]

22 We derive an estimate of the eigenvalues of the Hessian of log-linear training, which determine the convergence behavior of gradient-based optimization algorithms. [sent-107, score-0.544]

23 First, we calculate the eigenvalues of the Hessian in terms of the eigenvalues of the uncentered covariance matrix. [sent-108, score-0.662]

24 Our new Theorems 1 and 2 give lower and upper bounds for the condition number of the uncentered covariance matrix. [sent-109, score-0.231]

25 The analysis of the case with regularization is based on the analysis of the unregularized case. [sent-110, score-0.264]

26 1 The Unregularized Case Let Λ∗ be the limit of the optimization algorithm applied to problem (3) without regularization (α = 0). [sent-112, score-0.181]

27 The Hessian matrix of the objective function at the optimum depends on the posterior probabilities pΛ∗ (c|x), which are of course unknown. [sent-113, score-0.2]

28 We derive the eigenvalues of the Hessian at Λ0 = 0. [sent-115, score-0.263]

29 If the quadratic approximation of F at Λ0 is good, the Hessian does not change strongly from Λ0 to Λ∗ , and the eigenvalues of HF (Λ0 ) are close to those of HF (Λ∗ ). [sent-116, score-0.263]

30 This enables us to draw conclusions about the convergence behavior of gradient-based optimization algorithms. [sent-117, score-0.281]

31 The matrix X ∈ Rd×d is the 3 uncentered covariance matrix: X = 1 N N n=1 xn xT . [sent-128, score-0.214]

32 The eigenvalues of S can be computed easily: n µ1 (S) = 0, µ2 (S) = C −1 . [sent-129, score-0.263]

33 The eigenvalues of the Kronecker product S ⊗ X are of the form µi (S)µj (X) (see [8, Theorem 4. [sent-134, score-0.263]

34 Therefore, the spectrum of the Hessian is determined by the eigenvalues of X: σ(HF (Λ0 )) = {0} ∪ {C −1 µ1 (X), . [sent-137, score-0.305]

35 (8) A difﬁculty in the analysis of the unregularized case is that the objective function is only convex, but not strictly convex. [sent-141, score-0.254]

36 Thus, one of the eigenvalues of the Hessian at the optimum is zero and the condition number is not deﬁned. [sent-145, score-0.491]

37 Intuitively, the convergence rate should not depend on the eigenvalue zero, since the objective function is constant in the direction of the corresponding eigenvectors. [sent-146, score-0.294]

38 The classic proof about the convergence rate of steepest descent for quadratic functions with the Kantorovich inequality (see [13, p218]) can directly be generalized to the singular case. [sent-147, score-0.226]

39 The convergence rate depends on the ratio of the largest and the smallest non-zero eigenvalue. [sent-148, score-0.237]

40 All results about the convergence behavior of conjugate gradient extend to the singular case, if instead of the complete spectrum only the non-zero eigenvalues are considered. [sent-151, score-0.529]

41 Therefore, Notay deﬁnes the condition number of a singular matrix as the ratio of its largest eigenvalue and its smallest non-zero eigenvalue. [sent-152, score-0.3]

42 2 The Eigenvalues of X The dependence of the convergence behavior on the properties of X is in accordance to experimental observations. [sent-157, score-0.228]

43 Other researchers have noted before, that the use of correlated features leads to slower convergence [21]. [sent-158, score-0.259]

44 Then the N 1 T condition number of X = N n=1 xn xn is bounded by 2 2 max{σ1 + µ 2 , σd + µ2 } σ2 + µ 2 d ≤ κ(X) ≤ d 2 2 , σ 2 + µ2 } min{σ2 1 σ1 1 2 2 . [sent-173, score-0.251]

45 (11) The eigenvalues of the sum of two Hermitian matrices can be estimated with Weyl’s inequalities. [sent-179, score-0.263]

46 (12) (13) 2 The eigenvalues of A are the diagonal elements λj (A) = σj . [sent-182, score-0.263]

47 B is a rank-one matrix with the 2 eigenvalues λd (B) = µ 2 and λj (B) = 0 for 1 ≤ j ≤ d − 1. [sent-183, score-0.263]

48 For instance, the upper bound on the condition number follows from the application of (12) with j = k = d to the largest eigenvalue and (13) with j = k = 1 to the lowest eigenvalue. [sent-185, score-0.239]

49 The bound is sharpened by using the fact that every diagonal element of X is an upper bound for the smallest eigenvalue and a lower bound for the largest eigenvalue (see [9, p181]). [sent-187, score-0.301]

50 Analyzing the general case of correlated and unnormalized features is more difﬁcult. [sent-188, score-0.182]

51 1], which states that all eigenvalues s lie in circles around the diagonal entries of the matrix. [sent-192, score-0.263]

52 Then, the largest and smallest eigenvalues of X = s N 1 xn xT are bounded by: n n=1 N 2 σ1 − R1 2 max{σd + 2 µ2 , σ1 d − R1 + µ 2 2} ≤ λ1 (X) 2 2 ≤ min{σ1 + µ2 , σd + Rd } , 1 ≤ λd (X) ≤ 2 σd + Rd + µ 2 2 . [sent-199, score-0.45]

53 In contrast to Theorem 1, only the bounds for the eigenvalues of A obtained by Gerˇgorin’s theorem are known instead of the exact s eigenvalues. [sent-201, score-0.308]

54 For strongly correlated features, the eigenvalues can be distributed almost arbitrarily according to the bounds (15) and (16). [sent-202, score-0.305]

55 Conversely, our analysis shows that log-linear training can be accelerated by decorrelating the features and normalizing their means and variances, i. [sent-206, score-0.212]

56 Since the Hessian of the 2 -regularization term is a multiple of the identity, the eigenvalues of the regularization term and the loss-term can be added. [sent-213, score-0.343]

57 In the unregularized case, all non-zero eigenvalues depend on the eigenvalues of X. [sent-215, score-0.71]

58 In the regularized case, the eigenvalue zero changes to α, which is then the smallest non-zero eigenvalue of the Hessian. [sent-216, score-0.296]

59 Therefore, the condition number depends only on the largest eigenvalue of X C −1 µd (X) + α . [sent-217, score-0.239]

60 (18) κ(HF (Λ0 )) = α This shows that for large regularization parameters, the condition number is close to one and convergence is fast. [sent-218, score-0.303]

61 On ﬁrst glance, it seems paradoxical that a small modiﬁcation of the objective function can change the convergence behavior completely. [sent-220, score-0.25]

62 But for a small regularization constant, the objective function has a very ﬂat optimum instead of being constant in these directions. [sent-221, score-0.24]

63 On the other hand, the optimization is dominated by the unregularized part of the objective function. [sent-223, score-0.355]

64 Therefore, the iterates of the optimization algorithm will be close to an optimum of the unregularized objective function. [sent-224, score-0.445]

65 Since the regularization term is only small, the iterates already correspond to good models according to the objective function. [sent-225, score-0.191]

66 The ﬁrst one is the well-known USPS task for handwritten digit recognition. [sent-227, score-0.197]

67 The second task, IAM, is a large-scale continuous handwriting recognition task with real-life data. [sent-228, score-0.378]

68 The columns “separation” and “termination” list the number of passes through the dataset until separation of the training data, respectively the termination of the optimization algorithm. [sent-231, score-0.303]

69 29 21 61 356 - 66 116 513 731 358 174 100 Handwritten Digit Recognition The USPS dataset2 consists of 7291 training and 2007 test images from ten classes of handwritten digits. [sent-249, score-0.233]

70 The optimization takes even longer, when a very small non-zero regularization constant is used. [sent-260, score-0.181]

71 This is what we expected by analyzing the condition number – the objective function has a very ﬂat optimum, which slows down convergence. [sent-261, score-0.215]

72 On the other hand, for higher regularization parameters, the optimization is much faster. [sent-262, score-0.181]

73 We applied the normalizations only to the unregularized models, because the feature transformations affect the regularization term. [sent-263, score-0.316]

74 Often, the classiﬁcation error on the training data reaches its minimum before the optimization algorithm terminates, so one might argue that it is not necessary to run the optimization until the termination criterion is reached. [sent-267, score-0.476]

75 The USPS training data is linearly separable and for all unregularized trainings, a zero classiﬁcation error on the training set is reached. [sent-268, score-0.473]

76 2 Handwritten Text Recognition Our second task is the IAM handwriting database [15]. [sent-271, score-0.354]

77 In contrast to USPS, where single images are classiﬁed into a small number of classes, IAM deﬁnes a continuous handwriting recognition task with unrestricted vocabulary, and is therefore much harder. [sent-272, score-0.378]

78 The training fold contains lines of handwritten text with 53k words in total. [sent-274, score-0.271]

79 The IAM database is a large-scale learning problem in the sense that it is not feasible to run the optimization until convergence [2] and the test error is strongly inﬂuenced by the optimization accuracy. [sent-277, score-0.384]

80 1 Baseline Model For our baseline model, we use the conventional generative approach of a statistical classiﬁer based on hidden Markov models (HMMs) with Gaussian Mixture models (GMMs) as emission probabilities. [sent-280, score-0.202]

81 ˆN 1 1 (20) N w1 ∈W N The prior probability pθ (w1 ) is a smoothed trigram language model trained on the reference of the training data and the three additional text corpora Lancaster-Oslo-Bergen, Brown, and Wellington, as proposed in [1]. [sent-292, score-0.227]

82 We only used basic deslanting and size normalization for feature preprocessing, as it is commonly applied in handwriting recognition. [sent-299, score-0.307]

83 The recognition lexicon consists of the 50k most frequent words in the language model training data. [sent-302, score-0.267]

84 2 Hybrid LL/HMM Recognition System The main component of the visual model of our baseline system is the GMM for the emission probabilities pθ (x|s). [sent-308, score-0.211]

85 Analogous to the use of neural network outputs by [6], we build a hybrid LL/HMM recognition system by deriving the emission probabilities via pΛ (x|s) = pΛ (s|x)p(x)/p(s) . [sent-309, score-0.3]

86 Note that the training of the log-linear model is conceptually exactly the same as for USPS and our convergence analysis applies. [sent-317, score-0.294]

87 On large-scale tasks as IAM, it is not practicable to run the optimization until convergence as on USPS. [sent-318, score-0.229]

88 In contrast to USPS, where the classiﬁcation error on the training data without regularization was zero, on IAM, the state-classiﬁcation error on the training data ranges from forty to sixty percent. [sent-323, score-0.326]

89 The ﬁrst-order features are already decorrelated, but without mean and variance normalization, the convergence is slower, resulting in a worse WER on the development and test set. [sent-330, score-0.344]

90 This results in a huge degradation for the unnormalized features and – with exactly the same random initialization – has only a minor impact when normalized features are used. [sent-333, score-0.277]

91 This can be expected, since mean and variance take on more extreme values when the features are squared, and the features are correlated. [sent-335, score-0.261]

92 For the unnormalized features and random initialization, the optimization did not lead to a usable 7 Table 2: Results on the IAM database for polynomial feature spaces of degree m ∈ {1, 2, 3} with different initializations and preprocessings. [sent-337, score-0.352]

93 Fastest convergence and best results are obtained by the application of the whitening transformation to the features. [sent-372, score-0.29]

94 6 Discussion In this paper, we presented a novel convergence analysis for the optimization of the parameters of log-linear models. [sent-391, score-0.229]

95 Our main results are ﬁrst that the convergence of gradient-based optimization algorithms depends on the eigenvalues of the uncentered empirical covariance matrix. [sent-392, score-0.628]

96 Second, we analyzed the eigenvalues of the covariance matrix. [sent-394, score-0.307]

97 Best convergence behavior can be expected when, in addition, the features are decorrelated. [sent-396, score-0.269]

98 We veriﬁed our ﬁndings on two handwriting recognition tasks and found that the theoretical analysis predicted the observed convergence behavior very well. [sent-406, score-0.517]

99 On IAM, a real-life dataset for continuous handwriting recognition, our log-linear system outperforms other systems with comparable architecture and preprocessing. [sent-407, score-0.31]

100 This will be useful for very high-dimensional features for which the estimation of the whitening transformation is not feasible. [sent-412, score-0.251]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('iam', 0.314), ('wer', 0.285), ('eigenvalues', 0.263), ('handwriting', 0.259), ('unregularized', 0.184), ('usps', 0.164), ('hessian', 0.16), ('gis', 0.15), ('hf', 0.15), ('convergence', 0.128), ('training', 0.123), ('whitening', 0.123), ('aachen', 0.114), ('bunke', 0.114), ('cg', 0.113), ('handwritten', 0.11), ('optimization', 0.101), ('minka', 0.099), ('steepest', 0.098), ('eigenvalue', 0.096), ('condition', 0.095), ('uncentered', 0.092), ('optimum', 0.09), ('features', 0.089), ('bertolami', 0.086), ('gorin', 0.086), ('ney', 0.086), ('notay', 0.086), ('regularization', 0.08), ('termination', 0.079), ('emission', 0.078), ('recognition', 0.078), ('xn', 0.078), ('weyl', 0.075), ('criterion', 0.072), ('lecun', 0.071), ('objective', 0.07), ('ger', 0.069), ('hermitian', 0.069), ('language', 0.066), ('chapter', 0.066), ('pereira', 0.063), ('none', 0.062), ('xt', 0.062), ('gmms', 0.061), ('hmm', 0.061), ('smallest', 0.061), ('uncorrelated', 0.06), ('dreuw', 0.057), ('espa', 0.057), ('loglinear', 0.057), ('malouf', 0.057), ('preconditioned', 0.057), ('rwth', 0.057), ('usable', 0.057), ('wiesler', 0.057), ('rd', 0.056), ('iterative', 0.054), ('hmms', 0.054), ('database', 0.054), ('hybrid', 0.053), ('behavior', 0.052), ('transformations', 0.052), ('rc', 0.051), ('unnormalized', 0.051), ('system', 0.051), ('slows', 0.05), ('accordance', 0.048), ('initialization', 0.048), ('normalization', 0.048), ('scaling', 0.048), ('largest', 0.048), ('digit', 0.046), ('graves', 0.046), ('kronecker', 0.045), ('theorem', 0.045), ('covariance', 0.044), ('variance', 0.044), ('conjugate', 0.044), ('development', 0.044), ('ic', 0.043), ('word', 0.043), ('conceptually', 0.043), ('zero', 0.043), ('classi', 0.042), ('mccallum', 0.042), ('baseline', 0.042), ('correlated', 0.042), ('cn', 0.042), ('spectrum', 0.042), ('task', 0.041), ('models', 0.041), ('ine', 0.041), ('newton', 0.041), ('probabilities', 0.04), ('mean', 0.039), ('salakhutdinov', 0.039), ('transformation', 0.039), ('text', 0.038), ('multiclass', 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 4 nips-2011-A Convergence Analysis of Log-Linear Training

Author: Simon Wiesler, Hermann Ney

2 0.095763393 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation

Author: Cho-jui Hsieh, Inderjit S. Dhillon, Pradeep K. Ravikumar, Mátyás A. Sustik

Abstract: The 1 regularized Gaussian maximum likelihood estimator has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix, or alternatively the underlying graph structure of a Gaussian Markov Random Field, from very limited samples. We propose a novel algorithm for solving the resulting optimization problem which is a regularized log-determinant program. In contrast to other state-of-the-art methods that largely use ﬁrst order gradient information, our algorithm is based on Newton’s method and employs a quadratic approximation, but with some modiﬁcations that leverage the structure of the sparse Gaussian MLE problem. We show that our method is superlinearly convergent, and also present experimental results using synthetic and real application data that demonstrate the considerable improvements in performance of our method when compared to other state-of-the-art methods.

3 0.095641248 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

Author: Armen Allahverdyan, Aram Galstyan

Abstract: We present an asymptotic analysis of Viterbi Training (VT) and contrast it with a more conventional Maximum Likelihood (ML) approach to parameter estimation in Hidden Markov Models. While ML estimator works by (locally) maximizing the likelihood of the observed data, VT seeks to maximize the probability of the most likely hidden state sequence. We develop an analytical framework based on a generating function formalism and illustrate it on an exactly solvable model of HMM with one unambiguous symbol. For this particular model the ML objective function is continuously degenerate. VT objective, in contrast, is shown to have only ﬁnite degeneracy. Furthermore, VT converges faster and results in sparser (simpler) models, thus realizing an automatic Occam’s razor for HMM learning. For more general scenario VT can be worse compared to ML but still capable of correctly recovering most of the parameters. 1

4 0.087885141 217 nips-2011-Practical Variational Inference for Neural Networks

Author: Alex Graves

Abstract: Variational methods have been previously explored as a tractable approximation to Bayesian inference for neural networks. However the approaches proposed so far have only been applicable to a few simple network architectures. This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks. Along the way it revisits several common regularisers from a variational perspective. It also provides a simple pruning heuristic that can both drastically reduce the number of network weights and lead to improved generalisation. Experimental results are provided for a hierarchical multidimensional recurrent neural network applied to the TIMIT speech corpus. 1

5 0.086070865 244 nips-2011-Selecting Receptive Fields in Deep Networks

Author: Adam Coates, Andrew Y. Ng

Abstract: Recent deep learning and unsupervised feature learning systems that learn from unlabeled data have achieved high performance in benchmarks by using extremely large architectures with many features (hidden units) at each layer. Unfortunately, for such large architectures the number of parameters can grow quadratically in the width of the network, thus necessitating hand-coded “local receptive ﬁelds” that limit the number of connections from lower level features to higher ones (e.g., based on spatial locality). In this paper we propose a fast method to choose these connections that may be incorporated into a wide variety of unsupervised training methods. Speciﬁcally, we choose local receptive ﬁelds that group together those low-level features that are most similar to each other according to a pairwise similarity metric. This approach allows us to harness the advantages of local receptive ﬁelds (such as improved scalability, and reduced data requirements) when we do not know how to specify such receptive ﬁelds by hand or where our unsupervised training algorithm has no obvious generalization to a topographic setting. We produce results showing how this method allows us to use even simple unsupervised training algorithms to train successful multi-layered networks that achieve state-of-the-art results on CIFAR and STL datasets: 82.0% and 60.1% accuracy, respectively. 1

6 0.083146445 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

7 0.078875884 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models

8 0.077509455 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time

9 0.076156639 261 nips-2011-Sparse Filtering

10 0.072875038 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

11 0.072149947 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data

12 0.069872528 258 nips-2011-Sparse Bayesian Multi-Task Learning

13 0.067852519 176 nips-2011-Multi-View Learning of Word Embeddings via CCA

14 0.066589713 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

15 0.066318922 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition

16 0.06526047 271 nips-2011-Statistical Tests for Optimization Efficiency

17 0.0652496 239 nips-2011-Robust Lasso with missing and grossly corrupted observations

18 0.0640193 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

19 0.062529095 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning

20 0.062427439 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.229), (1, 0.014), (2, -0.039), (3, -0.048), (4, -0.04), (5, 0.021), (6, 0.053), (7, 0.007), (8, -0.051), (9, -0.035), (10, 0.016), (11, -0.062), (12, 0.071), (13, -0.033), (14, -0.029), (15, 0.009), (16, -0.01), (17, 0.048), (18, 0.049), (19, 0.006), (20, -0.053), (21, 0.064), (22, 0.026), (23, 0.03), (24, -0.004), (25, 0.076), (26, -0.029), (27, -0.05), (28, 0.044), (29, -0.029), (30, -0.01), (31, 0.013), (32, 0.042), (33, 0.017), (34, -0.034), (35, -0.023), (36, -0.056), (37, -0.096), (38, -0.083), (39, 0.004), (40, -0.003), (41, 0.118), (42, -0.028), (43, -0.047), (44, -0.001), (45, -0.011), (46, -0.002), (47, -0.051), (48, 0.025), (49, -0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93265945 4 nips-2011-A Convergence Analysis of Log-Linear Training

Author: Simon Wiesler, Hermann Ney

2 0.65061557 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing

Author: Shai Shalev-shwartz, Yonatan Wexler, Amnon Shashua

Abstract: Multiclass prediction is the problem of classifying an object into a relevant target class. We consider the problem of learning a multiclass predictor that uses only few features, and in particular, the number of used features should increase sublinearly with the number of possible classes. This implies that features should be shared by several classes. We describe and analyze the ShareBoost algorithm for learning a multiclass predictor that uses few shared features. We prove that ShareBoost efﬁciently ﬁnds a predictor that uses few shared features (if such a predictor exists) and that it has a small generalization error. We also describe how to use ShareBoost for learning a non-linear predictor that has a fast evaluation time. In a series of experiments with natural data sets we demonstrate the beneﬁts of ShareBoost and evaluate its success relatively to other state-of-the-art approaches. 1

3 0.62934101 33 nips-2011-An Exact Algorithm for F-Measure Maximization

Author: Krzysztof J. Dembczynski, Willem Waegeman, Weiwei Cheng, Eyke Hüllermeier

Abstract: The F-measure, originally introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classiﬁcation, multi-label classiﬁcation, and structured output prediction. Optimizing this measure remains a statistically and computationally challenging problem, since no closed-form maximizer exists. Current algorithms are approximate and typically rely on additional assumptions regarding the statistical distribution of the binary response variables. In this paper, we present an algorithm which is not only computationally efﬁcient but also exact, regardless of the underlying distribution. The algorithm requires only a quadratic number of parameters of the joint distribution (with respect to the number of binary responses). We illustrate its practical performance by means of experimental results for multi-label classiﬁcation. 1

4 0.62398136 176 nips-2011-Multi-View Learning of Word Embeddings via CCA

Author: Paramveer Dhillon, Dean P. Foster, Lyle H. Ungar

Abstract: Recently, there has been substantial interest in using large amounts of unlabeled data to learn word representations which can then be used as features in supervised classiﬁers for NLP tasks. However, most current approaches are slow to train, do not model the context of the word, and lack theoretical grounding. In this paper, we present a new learning method, Low Rank Multi-View Learning (LR-MVL) which uses a fast spectral method to estimate low dimensional context-speciﬁc word representations from unlabeled data. These representation features can then be used with any supervised learner. LR-MVL is extremely fast, gives guaranteed convergence to a global optimum, is theoretically elegant, and achieves state-ofthe-art performance on named entity recognition (NER) and chunking problems. 1 Introduction and Related Work Over the past decade there has been increased interest in using unlabeled data to supplement the labeled data in semi-supervised learning settings to overcome the inherent data sparsity and get improved generalization accuracies in high dimensional domains like NLP. Approaches like [1, 2] have been empirically very successful and have achieved excellent accuracies on a variety of NLP tasks. However, it is often difﬁcult to adapt these approaches to use in conjunction with an existing supervised NLP system as these approaches enforce a particular choice of model. An increasingly popular alternative is to learn representational embeddings for words from a large collection of unlabeled data (typically using a generative model), and to use these embeddings to augment the feature set of a supervised learner. Embedding methods produce features in low dimensional spaces or over a small vocabulary size, unlike the traditional approach of working in the original high dimensional vocabulary space with only one dimension “on” at a given time. Broadly, these embedding methods fall into two categories: 1. Clustering based word representations: Clustering methods, often hierarchical, are used to group distributionally similar words based on their contexts. The two dominant approaches are Brown Clustering [3] and [4]. As recently shown, HMMs can also be used to induce a multinomial distribution over possible clusters [5]. 2. Dense representations: These representations are dense, low dimensional and real-valued. Each dimension of these representations captures latent information about a combination of syntactic and semantic word properties. They can either be induced using neural networks like C&W; embeddings [6] and Hierarchical log-linear (HLBL) embeddings [7] or by eigen-decomposition of the word co-occurrence matrix, e.g. Latent Semantic Analysis/Latent Semantic Indexing (LSA/LSI) [8]. Unfortunately, most of these representations are 1). slow to train, 2). sensitive to the scaling of the embeddings (especially 2 based approaches like LSA/PCA), 3). can get stuck in local optima (like EM trained HMM) and 4). learn a single embedding for a given word type; i.e. all the occurrences 1 of the word “bank” will have the same embedding, irrespective of whether the context of the word suggests it means “a ﬁnancial institution” or “a river bank”. In this paper, we propose a novel context-speciﬁc word embedding method called Low Rank MultiView Learning, LR-MVL, which is fast to train and is guaranteed to converge to the optimal solution. As presented here, our LR-MVL embeddings are context-speciﬁc, but context oblivious embeddings (like the ones used by [6, 7]) can be trivially gotten from our model. Furthermore, building on recent advances in spectral learning for sequence models like HMMs [9, 10, 11] we show that LR-MVL has strong theoretical grounding. Particularly, we show that LR-MVL estimates low dimensional context-speciﬁc word embeddings which preserve all the information in the data if the data were generated by an HMM. Moreover, LR-MVL being linear does not face the danger of getting stuck in local optima as is the case for an EM trained HMM. LR-MVL falls into category (2) mentioned above; it learns real-valued context-speciﬁc word embeddings by performing Canonical Correlation Analysis (CCA) [12] between the past and future views of low rank approximations of the data. However, LR-MVL is more general than those methods, which work on bigram or trigram co-occurrence matrices, in that it uses longer word sequence information to estimate context-speciﬁc embeddings and also for the reasons mentioned in the last paragraph. The remainder of the paper is organized as follows. In the next section we give a brief overview of CCA, which forms the core of our method. Section 3 describes our proposed LR-MVL algorithm in detail and gives theory supporting its performance. Section 4 demonstrates the effectiveness of LR-MVL on the NLP tasks of Named Entity Recognition and Chunking. We conclude with a brief summary in Section 5. 2 Brief Review: Canonical Correlation Analysis (CCA) CCA [12] is the analog to Principal Component Analysis (PCA) for pairs of matrices. PCA computes the directions of maximum covariance between elements in a single matrix, whereas CCA computes the directions of maximal correlation between a pair of matrices. Unlike PCA, CCA does not depend on how the observations are scaled. This invariance of CCA to linear data transformations allows proofs that keeping the dominant singular vectors (those with largest singular values) will faithfully capture any state information. More speciﬁcally, given a set of n paired observation vectors {(l1 , r1 ), ..., (ln , rn )}–in our case the two matrices are the left (L) and right (R) context matrices of a word–we would like to simultaneously ﬁnd the directions Φl and Φr that maximize the correlation of the projections of L onto Φl with the projections of R onto Φr . This is expressed as max Φl ,Φr E[ L, Φl R, Φr ] E[ L, Φl 2 ]E[ R, Φr 2 ] (1) where E denotes the empirical expectation. We use the notation Clr (Cll ) to denote the cross (auto) covariance matrices between L and R (i.e. L’R and L’L respectively.). The left and right canonical correlates are the solutions Φl , Φr of the following equations: Cll −1 Clr Crr −1 Crl Φl = λΦl Crr −1 Crl Cll −1 Clr Φr = λΦr 3 (2) Low Rank Multi-View Learning (LR-MVL) In LR-MVL, we compute the CCA between the past and future views of the data on a large unlabeled corpus to ﬁnd the common latent structure, i.e., the hidden state associated with each token. These induced representations of the tokens can then be used as features in a supervised classiﬁer (typically discriminative). The context around a word, consisting of the h words to the right and left of it, sits in a high dimensional space, since for a vocabulary of size v, each of the h words in the context requires an indicator function of dimension v. The key move in LR-MVL is to project the v-dimensional word 2 space down to a k dimensional state space. Thus, all eigenvector computations are done in a space that is v/k times smaller than the original space. Since a typical vocabulary contains at least 50, 000 words, and we use state spaces of order k ≈ 50 dimensions, this gives a 1,000-fold reduction in the size of calculations that are needed. The core of our LR-MVL algorithm is a fast spectral method for learning a v × k matrix A which maps each of the v words in the vocabulary to a k-dimensional state vector. We call this matrix the “eigenfeature dictionary”. We now describe the LR-MVL method, give a theorem that provides intuition into how it works, and formally present the LR-MVL algorithm. The Experiments section then shows that this low rank approximation allows us to achieve state-of-the-art performance on NLP tasks. 3.1 The LR-MVL method Given an unlabeled token sequence w={w0 , w1 , . . ., wn } we want to learn a low (k)- dimensional state vector {z0 , z1 , . . . , zn } for each observed token. The key is to ﬁnd a v ×k matrix A (Algorithm 1) that maps each of the v words in the vocabulary to a reduced rank k-dimensional state vector, which is later used to induce context speciﬁc embeddings for the tokens (Algorithm 2). For supervised learning, these context speciﬁc embeddings are supplemented with other information about each token wt , such as its identity, orthographic features such as preﬁxes and sufﬁxes or membership in domain-speciﬁc lexicons, and used as features in a classiﬁer. Section 3.4 gives the algorithm more formally, but the key steps in the algorithm are, in general terms: • Take the h words to the left and to the right of each target word wt (the “Left” and “Right” contexts), and project them each down to k dimensions using A. • Take the CCA between the reduced rank left and right contexts, and use the resulting model to estimate a k dimensional state vector (the “hidden state”) for each token. • Take the CCA between the hidden states and the tokens wt . The singular vectors associated with wt form a new estimate of the eigenfeature dictionary. LR-MVL can be viewed as a type of co-training [13]: The state of each token wt is similar to that of the tokens both before and after it, and it is also similar to the states of the other occurrences of the same word elsewhere in the document (used in the outer iteration). LR-MVL takes advantage of these two different types of similarity by alternately estimating word state using CCA on the smooths of the states of the words before and after each target token and using the average over the states associated with all other occurrences of that word. 3.2 Theoretical Properties of LR-MVL We now present the theory behind the LR-MVL algorithm; particularly we show that the reduced rank matrix A allows a signiﬁcant data reduction while preserving the information in our data and the estimated state does the best possible job of capturing any label information that can be inferred by a linear model. Let L be an n × hv matrix giving the words in the left context of each of the n tokens, where the context is of length h, R be the corresponding n × hv matrix for the right context, and W be an n × v matrix of indicator functions for the words themselves. We will use the following assumptions at various points in our proof: Assumption 1. L, W, and R come from a rank k HMM i.e. it has a rank k observation matrix and rank k transition matrix both of which have the same domain. For example, if the dimension of the hidden state is k and the vocabulary size is v then the observation matrix, which is k × v, has rank k. This rank condition is similar to the one used by [10]. Assumption 1A. For the three views, L, W and R assume that there exists a “hidden state H” of dimension n × k, where each row Hi has the same non-singular variance-covariance matrix and 3 such that E(Li |Hi ) = Hi β T and E(Ri |Hi ) = Hi β T and E(Wi |Hi ) = Hi β T where all β’s are of L R W rank k, where Li , Ri and Wi are the rows of L, R and W respectively. Assumption 1A follows from Assumption 1. Assumption 2. ρ(L, W), ρ(L, R) and ρ(W, R) all have rank k, where ρ(X1 , X2 ) is the expected correlation between X1 and X2 . Assumption 2 is a rank condition similar to that in [9]. Assumption 3. ρ([L, R], W) has k distinct singular values. Assumption 3 just makes the proof a little cleaner, since if there are repeated singular values, then the singular vectors are not unique. Without it, we would have to phrase results in terms of subspaces with identical singular values. We also need to deﬁne the CCA function that computes the left and right singular vectors for a pair of matrices: Deﬁnition 1 (CCA). Compute the CCA between two matrices X1 and X2 . Let ΦX1 be a matrix containing the d largest singular vectors for X1 (sorted from the largest on down). Likewise for ΦX2 . Deﬁne the function CCAd (X1 , X2 ) = [ΦX1 , ΦX2 ]. When we want just one of these Φ’s, we will use CCAd (X1 , X2 )left = ΦX1 for the left singular vectors and CCAd (X1 , X2 )right = ΦX2 for the right singular vectors. Note that the resulting singular vectors, [ΦX1 , ΦX2 ] can be used to give two redundant estimates, X1 ΦX1 and X2 ΦX2 of the “hidden” state relating X1 and X2 , if such a hidden state exists. Deﬁnition 2. Deﬁne the symbol “≈” to mean X1 ≈ X2 ⇐⇒ lim X1 = lim X2 n→∞ n→∞ where n is the sample size. Lemma 1. Deﬁne A by the following limit of the right singular vectors: CCAk ([L, R], W)right ≈ A. Under assumptions 2, 3 and 1A, such that if CCAk (L, R) ≡ [ΦL , ΦR ] then CCAk ([LΦL , RΦR ], W)right ≈ A. Lemma 1 shows that instead of ﬁnding the CCA between the full context and the words, we can take the CCA between the Left and Right contexts, estimate a k dimensional state from them, and take the CCA of that state with the words and get the same result. See the supplementary material for the Proof. ˜ Let Ah denote a matrix formed by stacking h copies of A on top of each other. Right multiplying ˜ L or R by Ah projects each of the words in that context into the k-dimensional reduced rank space. The following theorem addresses the core of the LR-MVL algorithm, showing that there is an A which gives the desired dimensionality reduction. Speciﬁcally, it shows that the previous lemma also holds in the reduced rank space. Theorem 1. Under assumptions 1, 2 and 3 there exists a unique matrix A such that if ˜ ˜ ˜ ˜ CCAk (LAh , RAh ) ≡ [ΦL , ΦR ] then ˜ ˜ ˜ ˜ CCAk ([LAh ΦL , RAh ΦR ], W)right ≈ A ˜ where Ah is the stacked form of A. See the supplementary material for the Proof 1 . ˆ It is worth noting that our matrix A corresponds to the matrix U used by [9, 10]. They showed that U is sufﬁcient to compute the probability of a sequence of words generated by an HMM; although we do not show ˆ it here (due to limited space), our A provides a more statistically efﬁcient estimate of U than their U , and hence can also be used to estimate the sequence probabilities. 1 4 Under the above assumptions, there is asymptotically (in the limit of inﬁnite data) no beneﬁt to ﬁrst estimating state by ﬁnding the CCA between the left and right contexts and then ﬁnding the CCA between the estimated state and the words. One could instead just directly ﬁnd the CCA between the combined left and rights contexts and the words. However, because of the Zipﬁan distribution of words, many words are rare or even unique, and hence one is not in the asymptotic limit. In this case, CCA between the rare words and context will not be informative, whereas ﬁnding the CCA between the left and right contexts gives a good state vector estimate even for unique words. One can then fruitfully ﬁnd the CCA between the contexts and the estimated state vector for their associated words. 3.3 Using Exponential Smooths In practice, we replace the projected left and right contexts with exponential smooths (weighted average of the previous (or next) token’s state i.e. Zt−1 (or Zt+1 ) and previous (or next) token’s smoothed state i.e. St−1 (or St+1 ).), of them at a few different time scales, thus giving a further dimension reduction by a factor of context length h (say 100 words) divided by the number of smooths (often 5-7). We use a mixture of both very short and very long contexts which capture short and long range dependencies as required by NLP problems as NER, Chunking, WSD etc. Since exponential smooths are linear, we preserve the linearity of our method. 3.4 The LR-MVL Algorithm The LR-MVL algorithm (using exponential smooths) is given in Algorithm 1; it computes the pair of CCAs described above in Theorem 1. Algorithm 1 LR-MVL Algorithm - Learning from Large amounts of Unlabeled Data 1: Input: Token sequence Wn×v , state space size k, smoothing rates αj 2: Initialize the eigenfeature dictionary A to random values N (0, 1). 3: repeat 4: Set the state Zt (1 < t ≤ n) of each token wt to the eigenfeature vector of the corresponding word. Zt = (Aw : w = wt ) 5: Smooth the state estimates before and after each token to get a pair of views for each smoothing rate αj . (l,j) (l,j) = (1 − αj )St−1 + αj Zt−1 // left view L St (r,j) (r,j) j St = (1 − α )St+1 + αj Zt+1 // right view R. (l,j) (r,j) th where the t rows of L and R are, respectively, concatenations of the smooths St and St for (j) each of the α s. 6: Find the left and right canonical correlates, which are the eigenvectors Φl and Φr of (L L)−1 L R(R R)−1 R LΦl = λΦl . (R R)−1 R L(L L)−1 L RΦr = λΦr . 7: Project the left and right views on to the space spanned by the top k/2 left and right CCAs respectively (k/2) (k/2) Xl = LΦl and Xr = RΦr (k/2) (k/2) where Φl , Φr are matrices composed of the singular vectors of Φl , Φr with the k/2 largest magnitude singular values. Estimate the state for each word wt as the union of the left and right estimates: Z = [Xl , Xr ] 8: Estimate the eigenfeatures of each word type, w, as the average of the states estimated for that word. Aw = avg(Zt : wt = w) 9: Compute the change in A from the previous iteration 10: until |∆A| < 11: Output: Φk , Φk , A . r l A few iterations (∼ 5) of the above algorithm are sufﬁcient to converge to the solution. (Since the problem is convex, there is a single solution, so there is no issue of local minima.) As [14] show for PCA, one can start with a random matrix that is only slightly larger than the true rank k of the correlation matrix, and with extremely high likelihood converge in a few iterations to within a small distance of the true principal components. In our case, if the assumptions detailed above (1, 1A, 2 and 3) are satisﬁed, our method converges equally rapidly to the true canonical variates. As mentioned earlier, we get further dimensionality reduction in Step 5, by replacing the Left and Right context matrices with a set of exponentially smoothed values of the reduced rank projections of the context words. Step 6 ﬁnds the CCA between the Left and Right contexts. Step 7 estimates 5 the state by combining the estimates from the left and right contexts, since we don’t know which will best estimate the state. Step 8 takes the CCA between the estimated state Z and the matrix of words W. Because W is a vector of indicator functions, this CCA takes the trivial form of a set of averages. Once we have estimated the CCA model, it is used to generate context speciﬁc embeddings for the tokens from training, development and test sets (as described in Algorithm 2). These embeddings are further supplemented with other baseline features and used in a supervised learner to predict the label of the token. Algorithm 2 LR-MVL Algorithm -Inducing Context Speciﬁc Embeddings for Train/Dev/Test Data 1: Input: Model (Φk , Φk , A) output from above algorithm and Token sequences Wtrain , (Wdev , Wtest ) r l 2: Project the left and right views L and R after smoothing onto the space spanned by the top k left and right CCAs respectively Xl = LΦk and Xr = RΦk r l and the words onto the eigenfeature dictionary Xw = W train A 3: Form the ﬁnal embedding matrix Xtrain:embed by concatenating these three estimates of state Xtrain:embed = [Xl , Xw , Xr ] 4: Output: The embedding matrices Xtrain:embed , (Xdev:embed , Xtest:embed ) with context-speciﬁc representations for the tokens. These embeddings are augmented with baseline set of features mentioned in Sections 4.1.1 and 4.1.2 before learning the ﬁnal classiﬁer. Note that we can get context “oblivious” embeddings i.e. one embedding per word type, just by using the eigenfeature dictionary (Av×k ) output by Algorithm 1. 4 Experimental Results In this section we present the experimental results of LR-MVL on Named Entity Recognition (NER) and Syntactic Chunking tasks. We compare LR-MVL to state-of-the-art semi-supervised approaches like [1] (Alternating Structures Optimization (ASO)) and [2] (Semi-supervised extension of CRFs) as well as embeddings like C&W;, HLBL and Brown Clustering. 4.1 Datasets and Experimental Setup For the NER experiments we used the data from CoNLL 2003 shared task and for Chunking experiments we used the CoNLL 2000 shared task data2 with standard training, development and testing set splits. The CoNLL ’03 and the CoNLL ’00 datasets had ∼ 204K/51K/46K and ∼ 212K/ − /47K tokens respectively for Train/Dev./Test sets. 4.1.1 Named Entity Recognition (NER) We use the same set of baseline features as used by [15, 16] in their experiments. The detailed list of features is as below: • Current Word wi ; Its type information: all-capitalized, is-capitalized, all-digits and so on; Preﬁxes and sufﬁxes of wi • Word tokens in window of 2 around the current word i.e. (wi−2 , wi−1 , wi , wi+1 , wi+2 ); and capitalization pattern in the window. d = • Previous two predictions yi−1 and yi−2 and conjunction of d and yi−1 • Embedding features (LR-MVL, C&W;, HLBL, Brown etc.) in a window of 2 around the current word (if applicable). Following [17] we use regularized averaged perceptron model with above set of baseline features for the NER task. We also used their BILOU text chunk representation and fast greedy inference as it was shown to give superior performance. 2 More details about the data and competition are available at http://www.cnts.ua.ac.be/ conll2003/ner/ and http://www.cnts.ua.ac.be/conll2000/chunking/ 6 We also augment the above set of baseline features with gazetteers, as is standard practice in NER experiments. We tuned our free parameter namely the size of LR-MVL embedding on the development and scaled our embedding features to have a 2 norm of 1 for each token and further multiplied them by a normalization constant (also chosen by cross validation), so that when they are used in conjunction with other categorical features in a linear classiﬁer, they do not exert extra inﬂuence. The size of LR-MVL embeddings (state-space) that gave the best performance on the development set was k = 50 (50 each for Xl , Xw , Xr in Algorithm 2) i.e. the total size of embeddings was 50×3, and the best normalization constant was 0.5. We omit validation plots due to paucity of space. 4.1.2 Chunking For our chunking experiments we use a similar base set of features as above: • Current Word wi and word tokens in window of 2 around the current word i.e. d = (wi−2 , wi−1 , wi , wi+1 , wi+2 ); • POS tags ti in a window of 2 around the current word. • Word conjunction features wi ∩ wi+1 , i ∈ {−1, 0} and Tag conjunction features ti ∩ ti+1 , i ∈ {−2, −1, 0, 1} and ti ∩ ti+1 ∩ ti+2 , i ∈ {−2, −1, 0}. • Embedding features in a window of 2 around the current word (when applicable). Since CoNLL 00 chunking data does not have a development set, we randomly sampled 1000 sentences from the training data (8936 sentences) for development. So, we trained our chunking models on 7936 training sentences and evaluated their F1 score on the 1000 development sentences and used a CRF 3 as the supervised classiﬁer. We tuned the size of embedding and the magnitude of 2 regularization penalty in CRF on the development set and took log (or -log of the magnitude) of the value of the features4 . The regularization penalty that gave best performance on development set was 2 and here again the best size of LR-MVL embeddings (state-space) was k = 50. Finally, we trained the CRF on the entire (“original”) training data i.e. 8936 sentences. 4.1.3 Unlabeled Data and Induction of embeddings For inducing the embeddings we used the RCV1 corpus containing Reuters newswire from Aug ’96 to Aug ’97 and containing about 63 million tokens in 3.3 million sentences5 . Case was left intact and we did not do the “cleaning” as done by [18, 16] i.e. remove all sentences which are less than 90% lowercase a-z, as our multi-view learning approach is robust to such noisy data, like news byline text (mostly all caps) which does not correlate strongly with the text of the article. We induced our LR-MVL embeddings over a period of 3 days (70 core hours on 3.0 GHz CPU) on the entire RCV1 data by performing 4 iterations, a vocabulary size of 300k and using a variety of smoothing rates (α in Algorithm 1) to capture correlations between shorter and longer contexts α = [0.005, 0.01, 0.05, 0.1, 0.5, 0.9]; theoretically we could tune the smoothing parameters on the development set but we found this mixture of long and short term dependencies to work well in practice. As far as the other embeddings are concerned i.e. C&W;, HLBL and Brown Clusters, we downloaded them from http://metaoptimize.com/projects/wordreprs. The details about their induction and parameter tuning can be found in [16]; we report their best numbers here. It is also worth noting that the unsupervised training of LR-MVL was (> 1.5 times)6 faster than other embeddings. 4.2 Results The results for NER and Chunking are shown in Tables 1 and 2, respectively, which show that LR-MVL performs signiﬁcantly better than state-of-the-art competing methods on both NER and Chunking tasks. 3 http://www.chokkan.org/software/crfsuite/ Our embeddings are learnt using a linear model whereas CRF is a log-linear model, so to keep things on same scale we did this normalization. 5 We chose this particular dataset to make a fair comparison with [1, 16], who report results using RCV1 as unlabeled data. 6 As some of these embeddings were trained on GPGPU which makes our method even faster comparatively. 4 7 Embedding/Model Baseline C&W;, 200-dim HLBL, 100-dim Brown 1000 clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim HLBL, 100-dim C&W;, 200-dim Brown, 1000 clusters LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim No Gazetteers With Gazetteers F1-Score Dev. Set Test Set 90.03 84.39 92.46 87.46 92.00 88.13 92.32 88.52 93.15 89.31 93.66 89.36 93.11 89.55 93.61 89.91 92.91 89.35 92.98 88.88 93.25 89.41 93.91 89.89 94.41 90.06 Table 1: NER Results. Note: 1). LR-MVL (CO) are Context Oblivious embeddings which are gotten from (A) in Algorithm 1. 2). F1-score= Harmonic Mean of Precision and Recall. 3). The current state-of-the-art for this NER task is 90.90 (Test Set) but using 700 billion tokens of unlabeled data [19]. Embedding/Model Baseline HLBL, 50-dim C&W;, 50-dim Brown 3200 Clusters Ando & Zhang ’05 Suzuki & Isozaki ’08 LR-MVL (CO) 50 × 3-dim LR-MVL 50 × 3-dim Test Set F1-Score 93.79 94.00 94.10 94.11 94.39 94.67 95.02 95.44 Table 2: Chunking Results. It is important to note that in problems like NER, the ﬁnal accuracy depends on performance on rare-words and since LR-MVL is robustly able to correlate past with future views, it is able to learn better representations for rare words resulting in overall better accuracy. On rare-words (occurring < 10 times in corpus), we got 11.7%, 10.7% and 9.6% relative reduction in error over C&W;, HLBL and Brown respectively for NER; on chunking the corresponding numbers were 6.7%, 7.1% and 8.7%. Also, it is worth mentioning that modeling the context in embeddings gives decent improvements in accuracies on both NER and Chunking problems. For the case of NER, the polysemous words were mostly like Chicago, Wales, Oakland etc., which could either be a location or organization (Sports teams, Banks etc.), so when we don’t use the gazetteer features, (which are known lists of cities, persons, organizations etc.) we got higher increase in F-score by modeling context, compared to the case when we already had gazetteer features which captured most of the information about polysemous words for NER dataset and modeling the context didn’t help as much. The polysemous words for Chunking dataset were like spot (VP/NP), never (VP/ADVP), more (NP/VP/ADVP/ADJP) etc. and in this case embeddings with context helped signiﬁcantly, giving 3.1 − 6.5% relative improvement in accuracy over context oblivious embeddings. 5 Summary and Conclusion In this paper, we presented a novel CCA-based multi-view learning method, LR-MVL, for large scale sequence learning problems such as arise in NLP. LR-MVL is a spectral method that works in low dimensional state-space so it is computationally efﬁcient, and can be used to train using large amounts of unlabeled data; moreover it does not get stuck in local optima like an EM trained HMM. The embeddings learnt using LR-MVL can be used as features with any supervised learner. LR-MVL has strong theoretical grounding; is much simpler and faster than competing methods and achieves state-of-the-art accuracies on NER and Chunking problems. Acknowledgements: The authors would like to thank Alexander Yates, Ted Sandler and the three anonymous reviews for providing valuable feedback. We would also like to thank Lev Ratinov and Joseph Turian for answering our questions regarding their paper [16]. 8 References [1] Ando, R., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6 (2005) 1817–1853 [2] Suzuki, J., Isozaki, H.: Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In: In ACL. (2008) [3] Brown, P., deSouza, P., Mercer, R., Pietra, V.D., Lai, J.: Class-based n-gram models of natural language. Comput. Linguist. 18 (December 1992) 467–479 [4] Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 31st Annual Meeting of the ACL. (1993) 183–190 [5] Huang, F., Yates, A.: Distributional representations for handling sparsity in supervised sequence-labeling. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 495–503 [6] Collobert, R., Weston, J.: A uniﬁed architecture for natural language processing: deep neural networks with multitask learning. ICML ’08, New York, NY, USA, ACM (2008) 160–167 [7] Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. ICML ’07, New York, NY, USA, ACM (2007) 641–648 [8] Dumais, S., Furnas, G., Landauer, T., Deerwester, S., Harshman, R.: Using latent semantic analysis to improve access to textual information. In: SIGCHI Conference on human factors in computing systems, ACM (1988) 281–285 [9] Hsu, D., Kakade, S., Zhang, T.: A spectral algorithm for learning hidden markov models. In: COLT. (2009) [10] Siddiqi, S., Boots, B., Gordon, G.J.: Reduced-rank hidden Markov models. In: AISTATS2010. (2010) [11] Song, L., Boots, B., Siddiqi, S.M., Gordon, G.J., Smola, A.J.: Hilbert space embeddings of hidden Markov models. In: ICML. (2010) [12] Hotelling, H.: Canonical correlation analysis (cca). Journal of Educational Psychology (1935) [13] Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT’ 98. (1998) 92–100 [14] Halko, N., Martinsson, P.G., Tropp, J.: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. (Dec 2010) [15] Zhang, T., Johnson, D.: A robust risk minimization based named entity recognition system. CONLL ’03 (2003) 204–207 [16] Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. ACL ’10, Stroudsburg, PA, USA, Association for Computational Linguistics (2010) 384–394 [17] Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: CONLL. (2009) 147–155 [18] Liang, P.: Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology (2005) [19] Lin, D., Wu, X.: Phrase clustering for discriminative learning. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2. ACL ’09, Stroudsburg, PA, USA, Association for Computational Linguistics (2009) 1030–1038 9

5 0.6228596 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data

Author: Tzu-kuo Huang, Jeff G. Schneider

Abstract: Vector Auto-regressive models (VAR) are useful tools for analyzing time series data. In quite a few modern time series modelling tasks, the collection of reliable time series turns out to be a major challenge, either due to the slow progression of the dynamic process of interest, or inaccessibility of repetitive measurements of the same dynamic process over time. In those situations, however, we observe that it is often easier to collect a large amount of non-sequence samples, or snapshots of the dynamic process of interest. In this work, we assume a small amount of time series data are available, and propose methods to incorporate non-sequence data into penalized least-square estimation of VAR models. We consider non-sequence data as samples drawn from the stationary distribution of the underlying VAR model, and devise a novel penalization scheme based on the Lyapunov equation concerning the covariance of the stationary distribution. Experiments on synthetic and video data demonstrate the effectiveness of the proposed methods. 1

6 0.62158841 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation

7 0.61244667 169 nips-2011-Maximum Margin Multi-Label Structured Prediction

8 0.60754347 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models

9 0.58669752 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization

10 0.58326513 277 nips-2011-Submodular Multi-Label Learning

11 0.56792235 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

12 0.56654263 28 nips-2011-Agnostic Selective Classification

13 0.56034303 143 nips-2011-Learning Anchor Planes for Classification

14 0.55774027 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation

15 0.55259824 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

16 0.54735017 165 nips-2011-Matrix Completion for Multi-label Image Classification

17 0.54214501 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

18 0.54076332 260 nips-2011-Sparse Features for PCA-Like Linear Regression

19 0.5344857 254 nips-2011-Similarity-based Learning via Data Driven Embeddings

20 0.53016376 125 nips-2011-Identifying Alzheimer's Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.04), (4, 0.032), (20, 0.028), (26, 0.019), (31, 0.109), (33, 0.032), (41, 0.211), (43, 0.102), (45, 0.144), (57, 0.032), (65, 0.022), (74, 0.06), (83, 0.033), (99, 0.063)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82283044 4 nips-2011-A Convergence Analysis of Log-Linear Training

Author: Simon Wiesler, Hermann Ney

2 0.80612737 251 nips-2011-Shaping Level Sets with Submodular Functions

Author: Francis R. Bach

Abstract: We consider a class of sparsity-inducing regularization terms based on submodular functions. While previous work has focused on non-decreasing functions, we explore symmetric submodular functions and their Lov´ sz extensions. We show that the Lov´ sz a a extension may be seen as the convex envelope of a function that depends on level sets (i.e., the set of indices whose corresponding components of the underlying predictor are greater than a given constant): this leads to a class of convex structured regularization terms that impose prior knowledge on the level sets, and not only on the supports of the underlying predictors. We provide uniﬁed optimization algorithms, such as proximal operators, and theoretical guarantees (allowed level sets and recovery conditions). By selecting speciﬁc submodular functions, we give a new interpretation to known norms, such as the total variation; we also deﬁne new norms, in particular ones that are based on order statistics with application to clustering and outlier detection, and on noisy cuts in graphs with application to change point detection in the presence of outliers.

3 0.79007745 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

Author: George Konidaris, Scott Niekum, Philip S. Thomas

Abstract: We show that the λ-return target used in the TD(λ) family of algorithms is the maximum likelihood estimator for a speciﬁc model of how the variance of an nstep return estimate increases with n. We introduce the γ-return estimator, an alternative target based on a more accurate model of variance, which deﬁnes the TDγ family of complex-backup temporal difference learning algorithms. We derive TDγ , the γ-return equivalent of the original TD(λ) algorithm, which eliminates the λ parameter but can only perform updates at the end of an episode and requires time and space proportional to the episode length. We then derive a second algorithm, TDγ (C), with a capacity parameter C. TDγ (C) requires C times more time and memory than TD(λ) and is incremental and online. We show that TDγ outperforms TD(λ) for any setting of λ on 4 out of 5 benchmark domains, and that TDγ (C) performs as well as or better than TDγ for intermediate settings of C. 1

4 0.7403059 258 nips-2011-Sparse Bayesian Multi-Task Learning

Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau

Abstract: We propose a new sparse Bayesian model for multi-task regression and classiﬁcation. The model is able to capture correlations between tasks, or more speciﬁcally a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classiﬁcation tasks it achieves competitive predictive performance compared to previously proposed methods. 1

5 0.73858982 186 nips-2011-Noise Thresholds for Spectral Clustering

Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh

Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1

6 0.72886264 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

7 0.72866577 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

8 0.72726542 276 nips-2011-Structured sparse coding via lateral inhibition

9 0.72628772 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

10 0.72547024 239 nips-2011-Robust Lasso with missing and grossly corrupted observations

11 0.72510421 180 nips-2011-Multiple Instance Filtering

12 0.72431684 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning

13 0.72293985 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis

14 0.72278559 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data

15 0.72163129 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models

16 0.72073841 269 nips-2011-Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning

17 0.72070736 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

18 0.72059071 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

19 0.72042543 156 nips-2011-Learning to Learn with Compound HD Models

20 0.72030801 271 nips-2011-Statistical Tests for Optimization Efficiency