nips nips2009 nips2009-27 knowledge-graph by maker-knowledge-mining

27 nips-2009-Adaptive Regularization of Weight Vectors


Source: pdf

Author: Koby Crammer, Alex Kulesza, Mark Dredze

Abstract: We present AROW, a new online learning algorithm that combines several useful properties: large margin training, confidence weighting, and the capacity to handle non-separable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive a mistake bound, similar in form to the second order perceptron bound, that does not assume separability. We also relate our algorithm to recent confidence-weighted online learning techniques and show empirically that AROW achieves state-of-the-art performance and notable robustness in the case of non-separable data. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present AROW, a new online learning algorithm that combines several useful properties: large margin training, confidence weighting, and the capacity to handle non-separable data. [sent-9, score-0.112]

2 AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. [sent-10, score-0.131]

3 We derive a mistake bound, similar in form to the second order perceptron bound, that does not assume separability. [sent-11, score-0.129]

4 We also relate our algorithm to recent confidence-weighted online learning techniques and show empirically that AROW achieves state-of-the-art performance and notable robustness in the case of non-separable data. [sent-12, score-0.045]

5 Recent work has shown that parameter confidence information can be effectively used to guide online learning [2]. [sent-14, score-0.045]

6 However, the strict update criterion used by CW learning is very aggressive and can over-fit [5]. [sent-17, score-0.078]

7 Approximate solutions can be used to regularize the update and improve results; however, current analyses of CW learning still assume that the data are separable. [sent-18, score-0.06]

8 In this paper we present a new online learning algorithm for binary classification that combines several attractive properties: large margin training, confidence weighting, and the capacity to handle non-separable data. [sent-20, score-0.136]

9 The key to our approach is the adaptive regularization of the prediction function upon seeing each new instance, so we call this algorithm Adaptive Regularization of Weights (AROW). [sent-21, score-0.091]

10 Because it adjusts its regularization for each example, AROW is robust to sudden changes in the classification function due to label noise. [sent-22, score-0.081]

11 We derive a mistake bound, similar in form to the second order perceptron bound, that does not assume separability. [sent-23, score-0.129]

12 We also provide empirical results demonstrating that AROW is competitive with state-of-the-art methods and improves upon them significantly in the presence of label noise. [sent-24, score-0.056]

13 In round t the algorithm receives an instance xt ∈ Rd and applies its current prediction rule to make a prediction yt ∈ Y. [sent-26, score-0.584]

14 It then receives the true ˆ 1 label yt ∈ Y and suffers a loss (yt , yt ). [sent-27, score-0.499]

15 For binary classification we have Y = {−1, +1} and ˆ use the zero-one loss 01 (yt , yt ) = 0 if yt = yt and 1 otherwise. [sent-28, score-0.698]

16 Finally, the algorithm updates ˆ ˆ its prediction rule using (xt , yt ) and proceeds to the next round. [sent-29, score-0.297]

17 In this work we consider linear prediction rules parameterized by a weight vector w: y = hw (x) = sign(w · x). [sent-30, score-0.053]

18 ˆ Recently Dredze, Crammer and Pereira [6, 5] proposed an algorithmic framework for online learning of binary classification tasks called confidence weighted (CW) learning. [sent-31, score-0.069]

19 The values µp and Σp,p , respectively, encode the learner’s knowledge of and confidence in the weight for feature p: the smaller Σp,p , the more confidence the learner has in the mean weight value µp . [sent-33, score-0.078]

20 Conceptually, to classify an instance x, a CW classifier draws a parameter vector w ∼ N (µ, Σ) and predicts the label according to sign(w · x). [sent-35, score-0.063]

21 In practice, however, it can be easier to simply use the average weight vector E [w] = µ to make predictions. [sent-36, score-0.029]

22 This is similar to the approach taken by Bayes point machines [9], where a single weight vector is used to approximate a distribution. [sent-37, score-0.029]

23 Furthermore, for binary classification, the prediction given by the mean weight vector turns out to be Bayes optimal. [sent-38, score-0.077]

24 CW classifiers are trained according to a passive-aggressive rule [3] that adjusts the distribution at each round to ensure that the probability of a correct prediction is at least η ∈ (0. [sent-39, score-0.092]

25 This yields the update constraint Pr [yt (w · xt ) ≥ 0] ≥ η . [sent-41, score-0.351]

26 Subject to this constraint, the algorithm makes the smallest possible change to the hypothesis weight distribution as measured using the KL divergence. [sent-42, score-0.029]

27 This implies the following optimization problem for each round t: (µt , Σt ) = min DKL N (µ, Σ) µ,Σ N µt−1 , Σt−1 s. [sent-43, score-0.032]

28 Prw∼N (µ,Σ) [yt (w · xt ) ≥ 0] ≥ η Confidence-weighted algorithms have been shown to perform well in practice [5, 6], but they suffer from several problems. [sent-45, score-0.284]

29 First, the update is quite aggressive, forcing the probability of predicting each example correctly to be at least η > 1/2 regardless of the cost to the objective. [sent-46, score-0.06]

30 This is in part because the constraint is written in discrete terms where the prediction is either correct or not. [sent-49, score-0.044]

31 We deal with both of these issues, coping more effectively with label noise and generalizing the advantages of CW learning in an extensible way. [sent-50, score-0.079]

32 3 Adaptive Regularization Of Weights We identify two important properties of the CW update rule that contribute to its good performance but also make it sensitive to label noise. [sent-51, score-0.118]

33 First, the mean parameters µ are guaranteed to correctly classify the current training example with margin following each update. [sent-52, score-0.077]

34 This is because the probability constraint Pr [yt (w · xt ) ≥ 0] ≥ η can be written explicitly as yt (µ · xt ) ≥ φ xt Σxt , where φ > 0 is a positive constant related to η. [sent-53, score-1.048]

35 This aggressiveness yields rapid learning, but given an incorrectly labeled example, it can also force the learner to make a drastic and incorrect change to its parameters. [sent-54, score-0.034]

36 Second, confidence, as measured by the inverse eigenvalues of Σ, increases monotonically with every update. [sent-55, score-0.028]

37 Second, the new mean parameters should predict the current example with low loss (second term). [sent-63, score-0.029]

38 Since the loss term depends on µ only via the inner-product µ · xt , we are able to prove a representer theorem (Sec. [sent-66, score-0.329]

39 While we use the squared-hinge loss for classification, different loss functions, as long as they are convex and differentiable in µ, yield algorithms for different settings. [sent-68, score-0.071]

40 The squared-hinge loss yields a conservative (or passive) update for µ in which the mean parameters change only when the margin is too small, and we follow CW learning by enforcing a correspondingly conservative update for the confidence parameter Σ, updating it only when µ changes. [sent-71, score-0.231]

41 If µt = µt−1 , update the confidence parameters: Σt = arg min C2 (Σ) (4) µ Σ We now develop the update equations for (3) and (4) explicitly, starting with the former. [sent-76, score-0.12]

42 Taking the derivative of C (µ, Σ) with respect to µ and setting it to zero, we get µt = µt−1 − 1 d 2r dz h2 (yt , z) |z=µt ·xt Σt−1 xt , (5) assuming Σt−1 is non-singular. [sent-77, score-0.299]

43 Substituting the derivative of the squared-hinge loss in (5) and assuming 1 − yt (µt · xt ) ≥ 0, we get µt = µt−1 + yt (1 − yt (µt · xt )) Σt−1 xt . [sent-78, score-1.515]

44 r (6) We solve for µt by taking the dot product of each side of the equality with xt and substituting back in (6) to obtain the rule µt = µt−1 + max 0, 1 − yt xt µt−1 Σt−1 yt xt . [sent-79, score-1.29]

45 xt Σt−1 xt + r (7) It can be easily verified that (7) satisfies our assumption that 1 − yt (µt · xt ) ≥ 0. [sent-80, score-1.028]

46 , T • Receive a training example xt ∈ Rd • Compute margin and confidence mt = µt−1 · xt vt = xt Σt−1 xt • Receive true label yt , and suffer loss t = 1 if sign (mt ) = yt • If mt yt < 1, update using eqs. [sent-85, score-1.966]

47 (7) & (9): Σt = Σt−1 − βt Σt−1 xt xt Σt−1 “ ” αt = max 0, 1 − yt xt µt−1 βt µt = µt−1 + αt Σt−1 yt xt 1 βt = xt Σt−1 xt + r Output: Weight vector µT and confidence ΣT . [sent-86, score-2.056]

48 Figure 1: The AROW algorithm for online binary classification. [sent-87, score-0.069]

49 The update for the confidence parameters is made only if µt = µt−1 , that is, if 1 > yt xt µt−1 . [sent-88, score-0.546]

50 4 Analysis We first show that AROW can be kernelized by stating the following representer theorem. [sent-92, score-0.029]

51 The base case follows from the definitions of µ0 and Σ0 , and the induction step follows algebraically from the update rules (7) and (9). [sent-97, score-0.06]

52 Denote by M (M = |M|) the set of example indices for which the algorithm makes a mistake, yt µt−1 · xt ≤ 0, and by U (U = |U|) the set of example indices for which there is an update but not a mistake, 0 < yt (µt · xt ) ≤ 1. [sent-99, score-1.032]

53 Theorem 2 For any reference weight vector u ∈ Rd , the number of mistakes made by AROW (Fig. [sent-102, score-0.067]

54 1) is upper bounded by M≤ r u 2 + u XA u 1 log det I + XA r gt − U , +U + t∈M∪U where gt = max 0, 1 − yt u xt . [sent-103, score-0.739]

55 The proof depends on two lemmas; we omit the proof of the first for lack of space. [sent-104, score-0.028]

56 4 (10) Lemma 3 Let = max 0, 1 − yt µt−1 xt and χt = xt Σt−1 xt . [sent-105, score-1.028]

57 Then, for every t ∈ M∪U, t y t u xt r χt + r − 2 r t µt Σ−1 µt = µt−1 Σ−1 µt−1 + t t−1 r (χt + r) u Σ−1 µt = u Σ−1 µt−1 + t t−1 Lemma 4 Let T be the number of rounds. [sent-106, score-0.271]

58 Proof : We compute the following quantity: xt Σt xt = xt Σt−1 − βt Σt−1 xt xt Σt−1 xt = χt − χt r χ2 t = . [sent-108, score-1.626]

59 x t Σ t xt = 1 − r det Σ−1 t (11) Combining, we get t χt r = r (χt + r) 1− t det Σ−1 t−1 ≤− det Σ−1 t log t det Σ−1 t−1 ≤ log det Σ−1 T +1 det Σ−1 t . [sent-111, score-1.056]

60 Proof : We iterate the first equality of Lemma 3 to get u Σ−1 µT = T t∈M∪U y t u xt ≥ r t∈M∪U 1 − gt M +U 1 = − r r r gt . [sent-113, score-0.431]

61 χt + r (13) t∈M∪U We iterate the second equality to get µT Σ−1 µT = T t∈M∪U χt + r − 2 r t = r (χt + r) t∈M∪U χt + r (χt + r) t∈M∪U Using Lemma 4 we have that the first term of (13) is upper bounded by 1 log det Σ−1 . [sent-115, score-0.181]

62 First, if a mistake occurred on example t, then we have that yt xt · µt−1 ≤ 0 and t ≥ 1, so 1 − 2 ≤ 0. [sent-117, score-0.556]

63 Second, if an the algorithm t made an update (but no mistake) on example t, then 0 < yt xt · µt−1 ≤ 1 and t ≥ 0, thus 1 − 2 ≤ 1. [sent-118, score-0.546]

64 χt + r (14) Combining and plugging into the Cauchy-Schwarz inequality u Σ−1 µT ≤ T u Σ−1 u T µT Σ−1 µT , T we get M +U 1 − r r gt ≤ u Σ−1 u T t∈M∪U 1 log det Σ−1 T r + Rearranging the terms and using the fact that χt ≥ 0 yields √ M ≤ r u Σ−1 u log det Σ−1 + U + T T t∈M∪U 5 t∈U 1 . [sent-120, score-0.347]

65 (15) By definition, Σ−1 = I + T 1 r t∈M∪U 1 xi xi = I + X A , r so substituting and simplifying completes the proof: √ M≤ r 1 I + XA u r u = r u 2 + u XA u 1 log det I + XA r 1 log det I + XA r gt − U +U + t∈M∪U gt − U . [sent-122, score-0.405]

66 First, the two square-root terms of the bound depend on r in opposite ways: the first is monotonically increasing, while the second is monotonically decreasing. [sent-124, score-0.078]

67 Second, if all the updates are associated with errors, that is, U = ∅, then the bound reduces to the bound of the second-order perceptron [2]. [sent-129, score-0.13]

68 5 Empirical Evaluation We evaluate AROW on both synthetic and real data, including several popular datasets for document classification and optical character recognition (OCR). [sent-131, score-0.07]

69 2(a) shows the online learning curves for both full and diagonalized versions of the algorithms on these noisy data. [sent-136, score-0.058]

70 We selected a variety of document classification datasets popular in the NLP community, summarized as follows. [sent-139, score-0.052]

71 We created binary datasets by taking all pairs of the six domains (15 datasets). [sent-143, score-0.089]

72 We binarized the corpus following [6] and used binary bag-of-words features (3 datasets). [sent-146, score-0.045]

73 We created binary classification tasks using pairs of labels following [6] (3 datasets). [sent-149, score-0.051]

74 We used each Amazon product review domain as a sentiment classification task (6 datasets). [sent-152, score-0.063]

75 Spam: We selected three task A users from the ECML/PKDD Challenge5 , using bag-ofwords to classify each email as spam or ham (3 datasets). [sent-153, score-0.069]

76 all datasets from the MNIST data (100 datasets total). [sent-156, score-0.076]

77 Each result for the text datasets was averaged over 10-fold cross-validation. [sent-157, score-0.038]

78 5 binary classification task for different amounts of label noise (left: 0 noise, right: 10%). [sent-173, score-0.103]

79 r for AROW) and the number of online iterations (up to 10) were optimized using a single randomized run. [sent-174, score-0.045]

80 In order to observe each algorithm’s ability to handle non-separable data, we performed each experiment using various levels of artifical label noise, generated by independently flipping each binary label with fixed probability. [sent-176, score-0.104]

81 As label noise increases (moving across the rows in Fig. [sent-215, score-0.079]

82 In almost every high noise evaluation, AROW improves over CW (as well as the other baselines, not shown). [sent-217, score-0.055]

83 Though absolute performance suffers with noise, the gap between AROW and the baselines increases. [sent-223, score-0.03]

84 To help interpret the results, we classify the algorithms evaluated here according to four characteristics: the use of large margin updates, confidence weighting, a design that accomodates non-separable data, and adaptive per-instance margin (Table 2). [sent-224, score-0.2]

85 Based on the results in Table 1, it is clear that the comLarge ConfNonAdaptive bination of confidence informaAlgorithm Margin idence Separable Margin tion and large margin learning PA Yes No Yes No SOP No No Yes Yes is powerful when label noise is CW Yes Yes No Yes low. [sent-226, score-0.145]

86 CW easily outperforms AROW Yes Yes Yes No the other baselines in such situations, as it has been shown to Table 2: Online algorithm properties overview. [sent-227, score-0.03]

87 However, as noise increases, the separability assumption inherent in CW appears to reduce its performance considerably. [sent-229, score-0.039]

88 5 AROW 20news amazon reuters sentiment spam AROW 0. [sent-251, score-0.204]

89 Label noise increases from left to right: 0%, 10% and 30%. [sent-311, score-0.039]

90 AROW, by combining the large margin and confidence weighting of CW with a soft update rule that accomodates non-separable data, matches CW’s performance in general while avoiding degradation under noise. [sent-313, score-0.197]

91 AROW lacks the adaptive margin of CW, suggesting that this characteristic is not crucial to achieving strong performance. [sent-314, score-0.082]

92 6 Related and Future Work AROW is most similar to the second order perceptron [2]. [sent-316, score-0.059]

93 The SOP performs the same type of update as AROW, but only when it makes an error. [sent-317, score-0.06]

94 AROW, on the other hand, updates even when its prediction is correct if there is insufficient margin. [sent-318, score-0.051]

95 Confidence weighted (CW) [6, 5] algorithms, by which AROW was inspired, update the mean and confidence parameters simultaneously, while AROW makes a decoupled update and softens the hard constraint of CW. [sent-319, score-0.14]

96 Second, we bound the loss directly, not the cumulative sum of regularization and loss. [sent-324, score-0.074]

97 Third, the gradient algorithms perform a projection after making an update (not before) since the norm of the weight vector is kept bounded. [sent-325, score-0.102]

98 Finally, while we used the confidence term xt Σxt in (1), we can replace this term with any differentiable, monotonically increasing function f (xt Σxt ). [sent-329, score-0.299]

99 Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. [sent-332, score-0.063]

100 Efficient algorithms for online convex optimization and their applications. [sent-355, score-0.058]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('arow', 0.673), ('cw', 0.503), ('xt', 0.271), ('yt', 0.215), ('det', 0.124), ('dence', 0.121), ('sop', 0.098), ('yes', 0.088), ('koby', 0.084), ('dredze', 0.074), ('mistake', 0.07), ('con', 0.07), ('sentiment', 0.063), ('update', 0.06), ('perceptron', 0.059), ('xa', 0.059), ('gt', 0.058), ('usps', 0.058), ('margin', 0.054), ('crammer', 0.053), ('amazon', 0.053), ('di', 0.05), ('spam', 0.046), ('online', 0.045), ('mnist', 0.044), ('erent', 0.044), ('reuters', 0.042), ('ocr', 0.042), ('label', 0.04), ('su', 0.04), ('pa', 0.039), ('noise', 0.039), ('mistakes', 0.038), ('datasets', 0.038), ('classi', 0.033), ('round', 0.032), ('mark', 0.032), ('baselines', 0.03), ('loss', 0.029), ('weight', 0.029), ('representer', 0.029), ('accomodates', 0.028), ('rls', 0.028), ('adaptive', 0.028), ('monotonically', 0.028), ('updates', 0.027), ('alex', 0.027), ('erentiable', 0.025), ('prediction', 0.024), ('binary', 0.024), ('ers', 0.024), ('fernando', 0.023), ('regularization', 0.023), ('classify', 0.023), ('kulesza', 0.023), ('bound', 0.022), ('ectively', 0.021), ('binarized', 0.021), ('weighting', 0.021), ('mt', 0.02), ('constraint', 0.02), ('learner', 0.02), ('lemma', 0.019), ('rd', 0.019), ('rule', 0.018), ('adjusts', 0.018), ('dkl', 0.018), ('aggressive', 0.018), ('synthetic', 0.018), ('hazan', 0.018), ('seeing', 0.016), ('improves', 0.016), ('avoiding', 0.016), ('reviews', 0.016), ('substituting', 0.015), ('instances', 0.015), ('iterate', 0.015), ('get', 0.015), ('er', 0.015), ('incorrectly', 0.014), ('conservative', 0.014), ('created', 0.014), ('equality', 0.014), ('document', 0.014), ('sign', 0.014), ('proof', 0.014), ('xm', 0.013), ('log', 0.013), ('proceeds', 0.013), ('capacity', 0.013), ('algorithms', 0.013), ('recursive', 0.013), ('grow', 0.013), ('derivative', 0.013), ('pairs', 0.013), ('bination', 0.012), ('biographies', 0.012), ('blenders', 0.012), ('bollywood', 0.012), ('compactness', 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 27 nips-2009-Adaptive Regularization of Weight Vectors

Author: Koby Crammer, Alex Kulesza, Mark Dredze

Abstract: We present AROW, a new online learning algorithm that combines several useful properties: large margin training, confidence weighting, and the capacity to handle non-separable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive a mistake bound, similar in form to the second order perceptron bound, that does not assume separability. We also relate our algorithm to recent confidence-weighted online learning techniques and show empirically that AROW achieves state-of-the-art performance and notable robustness in the case of non-separable data. 1

2 0.17239231 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

Author: Chonghai Hu, Weike Pan, James T. Kwok

Abstract: Regularized risk minimization often involves non-smooth optimization, either because of the loss function (e.g., hinge loss) or the regularizer (e.g., ℓ1 -regularizer). Gradient methods, though highly scalable and easy to implement, are known to converge slowly. In this paper, we develop a novel accelerated gradient method for stochastic optimization while still preserving their computational simplicity and scalability. The proposed algorithm, called SAGE (Stochastic Accelerated GradiEnt), exhibits fast convergence rates on stochastic composite optimization with convex or strongly convex objectives. Experimental results show that SAGE is faster than recent (sub)gradient methods including FOLOS, SMIDAS and SCD. Moreover, SAGE can also be extended for online learning, resulting in a simple algorithm but with the best regret bounds currently known for these problems. 1

3 0.16406868 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos

Abstract: This paper considers a sensitivity analysis in Hidden Markov Models with continuous state and observation spaces. We propose an Infinitesimal Perturbation Analysis (IPA) on the filtering distribution with respect to some parameters of the model. We describe a methodology for using any algorithm that estimates the filtering density, such as Sequential Monte Carlo methods, to design an algorithm that estimates its gradient. The resulting IPA estimator is proven to be asymptotically unbiased, consistent and has computational complexity linear in the number of particles. We consider an application of this analysis to the problem of identifying unknown parameters of the model given a sequence of observations. We derive an IPA estimator for the gradient of the log-likelihood, which may be used in a gradient method for the purpose of likelihood maximization. We illustrate the method with several numerical experiments.

4 0.13792381 178 nips-2009-On Stochastic and Worst-case Models for Investing

Author: Elad Hazan, Satyen Kale

Abstract: In practice, most investing is done assuming a probabilistic model of stock price returns known as the Geometric Brownian Motion (GBM). While often an acceptable approximation, the GBM model is not always valid empirically. This motivates a worst-case approach to investing, called universal portfolio management, where the objective is to maximize wealth relative to the wealth earned by the best fixed portfolio in hindsight. In this paper we tie the two approaches, and design an investment strategy which is universal in the worst-case, and yet capable of exploiting the mostly valid GBM model. Our method is based on new and improved regret bounds for online convex optimization with exp-concave loss functions. 1

5 0.13280611 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

Author: Rong Jin, Shijun Wang, Yang Zhou

Abstract: In this paper, we examine the generalization error of regularized distance metric learning. We show that with appropriate constraints, the generalization error of regularized distance metric learning could be independent from the dimensionality, making it suitable for handling high dimensional data. In addition, we present an efficient online learning algorithm for regularized distance metric learning. Our empirical studies with data classification and face recognition show that the proposed algorithm is (i) effective for distance metric learning when compared to the state-of-the-art methods, and (ii) efficient and robust for high dimensional data.

6 0.12064764 154 nips-2009-Modeling the spacing effect in sequential category learning

7 0.10178815 177 nips-2009-On Learning Rotations

8 0.098143362 63 nips-2009-DUOL: A Double Updating Approach for Online Learning

9 0.09381526 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

10 0.089133523 220 nips-2009-Slow Learners are Fast

11 0.069111452 45 nips-2009-Beyond Convexity: Online Submodular Minimization

12 0.064574331 11 nips-2009-A General Projection Property for Distribution Families

13 0.064103805 101 nips-2009-Generalization Errors and Learning Curves for Regression with Multi-task Gaussian Processes

14 0.053009331 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

15 0.051446643 146 nips-2009-Manifold Regularization for SIR with Rate Root-n Convergence

16 0.049963214 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing

17 0.048289411 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

18 0.047312248 246 nips-2009-Time-Varying Dynamic Bayesian Networks

19 0.046757709 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

20 0.042208735 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.134), (1, 0.139), (2, 0.065), (3, -0.058), (4, 0.229), (5, 0.15), (6, 0.022), (7, 0.078), (8, -0.046), (9, 0.028), (10, 0.076), (11, 0.087), (12, -0.037), (13, -0.039), (14, 0.006), (15, -0.139), (16, -0.061), (17, -0.014), (18, 0.051), (19, -0.013), (20, 0.054), (21, -0.066), (22, -0.049), (23, 0.093), (24, 0.009), (25, 0.083), (26, 0.027), (27, -0.025), (28, -0.003), (29, -0.136), (30, -0.039), (31, -0.041), (32, -0.018), (33, 0.051), (34, 0.04), (35, -0.045), (36, -0.001), (37, 0.045), (38, 0.056), (39, 0.017), (40, 0.056), (41, 0.047), (42, -0.01), (43, 0.05), (44, -0.025), (45, -0.05), (46, 0.07), (47, 0.043), (48, -0.002), (49, -0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95389801 27 nips-2009-Adaptive Regularization of Weight Vectors

Author: Koby Crammer, Alex Kulesza, Mark Dredze

Abstract: We present AROW, a new online learning algorithm that combines several useful properties: large margin training, confidence weighting, and the capacity to handle non-separable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive a mistake bound, similar in form to the second order perceptron bound, that does not assume separability. We also relate our algorithm to recent confidence-weighted online learning techniques and show empirically that AROW achieves state-of-the-art performance and notable robustness in the case of non-separable data. 1

2 0.78413868 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos

Abstract: This paper considers a sensitivity analysis in Hidden Markov Models with continuous state and observation spaces. We propose an Infinitesimal Perturbation Analysis (IPA) on the filtering distribution with respect to some parameters of the model. We describe a methodology for using any algorithm that estimates the filtering density, such as Sequential Monte Carlo methods, to design an algorithm that estimates its gradient. The resulting IPA estimator is proven to be asymptotically unbiased, consistent and has computational complexity linear in the number of particles. We consider an application of this analysis to the problem of identifying unknown parameters of the model given a sequence of observations. We derive an IPA estimator for the gradient of the log-likelihood, which may be used in a gradient method for the purpose of likelihood maximization. We illustrate the method with several numerical experiments.

3 0.77242786 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

Author: Chonghai Hu, Weike Pan, James T. Kwok

Abstract: Regularized risk minimization often involves non-smooth optimization, either because of the loss function (e.g., hinge loss) or the regularizer (e.g., ℓ1 -regularizer). Gradient methods, though highly scalable and easy to implement, are known to converge slowly. In this paper, we develop a novel accelerated gradient method for stochastic optimization while still preserving their computational simplicity and scalability. The proposed algorithm, called SAGE (Stochastic Accelerated GradiEnt), exhibits fast convergence rates on stochastic composite optimization with convex or strongly convex objectives. Experimental results show that SAGE is faster than recent (sub)gradient methods including FOLOS, SMIDAS and SCD. Moreover, SAGE can also be extended for online learning, resulting in a simple algorithm but with the best regret bounds currently known for these problems. 1

4 0.74036831 178 nips-2009-On Stochastic and Worst-case Models for Investing

Author: Elad Hazan, Satyen Kale

Abstract: In practice, most investing is done assuming a probabilistic model of stock price returns known as the Geometric Brownian Motion (GBM). While often an acceptable approximation, the GBM model is not always valid empirically. This motivates a worst-case approach to investing, called universal portfolio management, where the objective is to maximize wealth relative to the wealth earned by the best fixed portfolio in hindsight. In this paper we tie the two approaches, and design an investment strategy which is universal in the worst-case, and yet capable of exploiting the mostly valid GBM model. Our method is based on new and improved regret bounds for online convex optimization with exp-concave loss functions. 1

5 0.65468985 177 nips-2009-On Learning Rotations

Author: Raman Arora

Abstract: An algorithm is presented for online learning of rotations. The proposed algorithm involves matrix exponentiated gradient updates and is motivated by the von Neumann divergence. The multiplicative updates are exponentiated skew-symmetric matrices which comprise the Lie algebra of the rotation group. The orthonormality and unit determinant of the matrix parameter are preserved using matrix logarithms and exponentials and the algorithm lends itself to intuitive interpretation in terms of the differential geometry of the manifold associated with the rotation group. A complexity reduction result is presented that exploits the eigenstructure of the matrix updates to simplify matrix exponentiation to a quadratic form. 1

6 0.61752707 154 nips-2009-Modeling the spacing effect in sequential category learning

7 0.54892749 220 nips-2009-Slow Learners are Fast

8 0.50625527 63 nips-2009-DUOL: A Double Updating Approach for Online Learning

9 0.47769561 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

10 0.4633773 146 nips-2009-Manifold Regularization for SIR with Rate Root-n Convergence

11 0.41182607 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

12 0.39961407 11 nips-2009-A General Projection Property for Distribution Families

13 0.35190603 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing

14 0.3482013 101 nips-2009-Generalization Errors and Learning Curves for Regression with Multi-task Gaussian Processes

15 0.3376067 249 nips-2009-Tracking Dynamic Sources of Malicious Activity at Internet Scale

16 0.31729686 72 nips-2009-Distribution Matching for Transduction

17 0.31647086 45 nips-2009-Beyond Convexity: Online Submodular Minimization

18 0.31008953 71 nips-2009-Distribution-Calibrated Hierarchical Classification

19 0.27991426 56 nips-2009-Conditional Neural Fields

20 0.26272035 126 nips-2009-Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.012), (10, 0.023), (17, 0.195), (18, 0.021), (24, 0.043), (25, 0.058), (27, 0.043), (35, 0.048), (36, 0.13), (39, 0.031), (51, 0.025), (58, 0.065), (61, 0.016), (71, 0.075), (81, 0.012), (86, 0.073), (91, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8423596 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

Author: Pascal Germain, Alexandre Lacasse, Mario Marchand, Sara Shanian, François Laviolette

Abstract: We show that convex KL-regularized objective functions are obtained from a PAC-Bayes risk bound when using convex loss functions for the stochastic Gibbs classifier that upper-bound the standard zero-one loss used for the weighted majority vote. By restricting ourselves to a class of posteriors, that we call quasi uniform, we propose a simple coordinate descent learning algorithm to minimize the proposed KL-regularized cost function. We show that standard p -regularized objective functions currently used, such as ridge regression and p -regularized boosting, are obtained from a relaxation of the KL divergence between the quasi uniform posterior and the uniform prior. We present numerical experiments where the proposed learning algorithm generally outperforms ridge regression and AdaBoost. 1

same-paper 2 0.82837152 27 nips-2009-Adaptive Regularization of Weight Vectors

Author: Koby Crammer, Alex Kulesza, Mark Dredze

Abstract: We present AROW, a new online learning algorithm that combines several useful properties: large margin training, confidence weighting, and the capacity to handle non-separable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive a mistake bound, similar in form to the second order perceptron bound, that does not assume separability. We also relate our algorithm to recent confidence-weighted online learning techniques and show empirically that AROW achieves state-of-the-art performance and notable robustness in the case of non-separable data. 1

3 0.69617695 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution

Author: Cosmin Bejan, Matthew Titsworth, Andrew Hickl, Sanda Harabagiu

Abstract: We present a sequence of unsupervised, nonparametric Bayesian models for clustering complex linguistic objects. In this approach, we consider a potentially infinite number of features and categorical outcomes. We evaluated these models for the task of within- and cross-document event coreference on two corpora. All the models we investigated show significant improvements when compared against an existing baseline for this task.

4 0.69381618 129 nips-2009-Learning a Small Mixture of Trees

Author: M. P. Kumar, Daphne Koller

Abstract: The problem of approximating a given probability distribution using a simpler distribution plays an important role in several areas of machine learning, for example variational inference and classification. Within this context, we consider the task of learning a mixture of tree distributions. Although mixtures of trees can be learned by minimizing the KL-divergence using an EM algorithm, its success depends heavily on the initialization. We propose an efficient strategy for obtaining a good initial set of trees that attempts to cover the entire observed distribution by minimizing the α-divergence with α = ∞. We formulate the problem using the fractional covering framework and present a convergent sequential algorithm that only relies on solving a convex program at each iteration. Compared to previous methods, our approach results in a significantly smaller mixture of trees that provides similar or better accuracies. We demonstrate the usefulness of our approach by learning pictorial structures for face recognition.

5 0.69177902 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

Author: Ruslan Salakhutdinov

Abstract: Markov random fields (MRF’s), or undirected graphical models, provide a powerful framework for modeling complex dependencies among random variables. Maximum likelihood learning in MRF’s is hard due to the presence of the global normalizing constant. In this paper we consider a class of stochastic approximation algorithms of the Robbins-Monro type that use Markov chain Monte Carlo to do approximate maximum likelihood learning. We show that using MCMC operators based on tempered transitions enables the stochastic approximation algorithm to better explore highly multimodal distributions, which considerably improves parameter estimates in large, densely-connected MRF’s. Our results on MNIST and NORB datasets demonstrate that we can successfully learn good generative models of high-dimensional, richly structured data that perform well on digit and object recognition tasks.

6 0.69075006 260 nips-2009-Zero-shot Learning with Semantic Output Codes

7 0.69053066 220 nips-2009-Slow Learners are Fast

8 0.68696445 72 nips-2009-Distribution Matching for Transduction

9 0.68583953 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

10 0.68441224 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

11 0.68316978 97 nips-2009-Free energy score space

12 0.68233836 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

13 0.68049383 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

14 0.67926359 128 nips-2009-Learning Non-Linear Combinations of Kernels

15 0.67910671 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

16 0.67899001 204 nips-2009-Replicated Softmax: an Undirected Topic Model

17 0.67807865 71 nips-2009-Distribution-Calibrated Hierarchical Classification

18 0.67589355 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

19 0.67353368 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

20 0.67349875 112 nips-2009-Human Rademacher Complexity