jmlr jmlr2012 jmlr2012-49 knowledge-graph by maker-knowledge-mining

49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models

Source: pdf

Author: David Chiang

Abstract: In machine translation, discriminative models have almost entirely supplanted the classical noisychannel model, but are standardly trained using a method that is reliable only in low-dimensional spaces. Two strands of research have tried to adapt more scalable discriminative training methods to machine translation: the ﬁrst uses log-linear probability models and either maximum likelihood or minimum risk, and the other uses linear models and large-margin methods. Here, we provide an overview of the latter. We compare several learning algorithms and describe in detail some novel extensions suited to properties of the translation task: no single correct output, a large space of structured outputs, and slow inference. We present experimental results on a large-scale ArabicEnglish translation task, demonstrating large gains in translation accuracy. Keywords: machine translation, structured prediction, large-margin methods, online learning, distributed computing

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We compare several learning algorithms and describe in detail some novel extensions suited to properties of the translation task: no single correct output, a large space of structured outputs, and slow inference. [sent-5, score-0.265]

2 We present experimental results on a large-scale ArabicEnglish translation task, demonstrating large gains in translation accuracy. [sent-6, score-0.45]

3 Introduction Statistical machine translation (MT) aims to learn models that can predict, given some utterance in a source language, the best translation into some target language. [sent-8, score-0.45]

4 Och and Ney (2002) ﬁrst proposed evolving this noisy-channel model into a discriminative log-linear model, which incorporated the language model and translation model as features. [sent-12, score-0.307]

5 This allowed the language model and translation model be to scaled by different factors, and allowed the addition of features beyond these two. [sent-13, score-0.252]

6 The loss function of choice is most often B LEU (rather, 1 − B LEU), which is the standard metric of translation quality used in current MT research (Papineni et al. [sent-15, score-0.253]

7 4 B LEU over MERT in a large-scale Arabic-English translation task. [sent-34, score-0.225]

8 We discuss three novel extensions of these algorithms that adapt them to particular properties of the translation task. [sent-35, score-0.225]

9 We ﬁnd that training the model to generate the reference exactly can be too brittle; instead, we propose to update the model towards hope translations which compromise between the reference translation and translations that are easier for the model to generate (Section 4). [sent-37, score-0.572]

10 Second, translation involves a large space of structured outputs. [sent-38, score-0.265]

11 Third, inference in translation tends to be very slow. [sent-40, score-0.225]

12 1 Setting In this paper, models are deﬁned over derivations d, which are objects that encapsulate an input sentence f (d), an output sentence e(d), and possibly other information. [sent-47, score-0.418]

13 1 For any input sentence f , let D ( f ) be the set of all valid derivations d such that f (d) = f . [sent-48, score-0.305]

14 The 1-best or Viterbi derivation of fi is dˆ = arg maxd∈D ( fi ) w · h(d), and the 1-best or Viterbi translation ˆ is e = e(d). [sent-51, score-0.755]

15 Each ei is not the only correct translation of fi , but only one of many. [sent-59, score-0.512]

16 For this reason, often multiple reference translations are available for each fi , but 1. [sent-60, score-0.31]

17 Note that although the model is deﬁned over derivations, only sentence pairs ( fi , ei ) are observed. [sent-64, score-0.4]

18 Nevertheless, assume for the moment that we can choose a reference derivation di that derives ei ; we discuss various ways of choosing di in Section 4. [sent-66, score-0.684]

19 2 Derivation Forests The methods described in this paper should work with a wide variety of translation models, but, for concreteness, we assume a model deﬁned using a weighted synchronous context-free grammar or related formalism (Chiang, 2007). [sent-68, score-0.225]

20 In models of this type, derivations can be thought of as trees, and the set of derivations D ( f ) is called a forest. [sent-70, score-0.384]

21 Let c be the candidate translation to be evaluated and let r be the reference translation. [sent-83, score-0.282]

22 4 Loss Function Our learning algorithms assume a loss function ℓi (e, ei ) that indicates how bad it is to guess e instead of the reference ei . [sent-101, score-0.271]

23 Even barring such problems, a B LEU score for a single sentence may not accurately reﬂect the impact of that sentence on the whole test set (Chiang et al. [sent-105, score-0.275]

24 That is, after processing each training example ( fi , ei ), we update the oracle document using the 1-best translation e: ˆ b ← 0. [sent-117, score-0.568]

25 Finally, we can deﬁne the loss of a translation e relative to e′ as the difference between their B scores, following Watanabe et al. [sent-121, score-0.253]

26 That is, our learning problem is to minimize: L(w) = 1 Li (w) N∑ i (4) where Li (w) = max vi (w, d, di ), d∈D ( fi ) vi (w, d, di ) = ℓi (d, di ) − w · (h(di ) − h(d)). [sent-125, score-1.044]

27 Note that since di ∈ D ( fi ) and vi (w, di , di ) = 0, Li (w) is always nonnegative. [sent-126, score-0.937]

28 We now review the derivations of several existing algorithms for optimizing (4) for structured models. [sent-127, score-0.232]

29 In SGD, we consider one component Li (w) of the objective function at a time and update w by the subgradient: w ← w − η∇Li (w), (5) ∇Li (w) = −(h(di ) − h(d )) + where d + = arg max vi (w, d, di ). [sent-132, score-0.38]

30 d∈D ( fi ) If, as an approximation, we restrict D ( fi ) to just the 1-best derivation of fi , then we get the structured perceptron algorithm (Rosenblatt, 1958; Freund and Schapire, 1999; Collins, 2002). [sent-133, score-0.767]

31 2 An easy way to approximate the fear derivation would be to generate an n-best list and select the derivation from it that maximizes vi . [sent-140, score-0.523]

32 The terminology of fear derivations and hope derivations to be deﬁned below are due to Kevin Knight. [sent-143, score-0.639]

33 , N} in random order do 5: U PDATE W EIGHTS(w, i) 6: s ← s+w 7: t ← t +1 8: w ← s/t procedure U PDATE W EIGHTS(w, i) d + ← arg maxd∈D ( fi ) vi (w, d, di ) 11: w ← w + η(h(di ) − h(d + )) 9: 10: 3. [sent-150, score-0.545]

34 It is more commonly presented as a quadratic program (QP): minimize subject to 1 w′ − w 2 + ξi 2η vi (w′ , d, di ) − ξi ≤ 0 ∀d ∈ D ( fi ) where ξi is a slack variable. [sent-157, score-0.513]

35 3 (Note that ξi ≥ 0 since di ∈ D ( fi ) and vi (w′ , di , di ) = 0. [sent-158, score-0.937]

36 ) The Lagrangian is: 1 (7) w′ − w 2 + ξi + ∑ αd (vi (w′ , d, di ) − ξi ). [sent-159, score-0.212]

37 d∈D ( fi ) Substituting back into (7), we get the following dual problem: maximize subject to η − 2 ∑ 2 ∑ αd (h(di ) − h(d)) ∑ + αd vi (w, d, di ) d∈D ( fi ) d∈D ( fi ) αd = 1 d∈D ( fi ) αd ≥ 0 ∀d ∈ D ( fi ). [sent-164, score-1.289]

38 In machine translation, and in structured prediction in general, the number of hypotheses in D ( fi ), and therefore the number of constraints in the QP, can be exponential or worse. [sent-165, score-0.234]

39 (2004), which repeatedly recomputes the fear derivation and adds it to a working set Si of derivations on which the QP is optimized (Algorithm 2). [sent-171, score-0.498]

40 A new fear derivation is added to the working set only if it is a worse violator by a certain margin (ε); otherwise, the algorithm terminates. [sent-172, score-0.306]

41 80) to select a pair of constraints: one must violate one of the KKT conditions (αd (vi (w′ , d, di ) − ξi ) = 0), and the other must allow the objective to be improved. [sent-175, score-0.212]

42 More accurately, we took the union of the 10 best derivations, the top 10 fear derivations, and the top 10 hope derivations (to be deﬁned below). [sent-182, score-0.447]

43 For example, even a small change in the language model weights could result in a large change in translation length and ﬂuency, whereas large changes in features like those attached to number-translation rules have a relatively small effect. [sent-191, score-0.252]

44 This change has no effect on the scores assigned to derivations or the translations generated, so intuitively one would hope that it also has no effect on learning. [sent-193, score-0.348]

45 First, line 34 is replaced with: δ← vi (w, d ′ , di ) − vi (w, d ′′ , di ) (h(d ′ ) − h(d ′′ ))Σ(h(d ′ ) − h(d ′′ )) and line 37 is replaced with: w ← w − Σδ(h(d ′ ) − h(d ′′ )). [sent-210, score-0.638]

46 , xn ) 11: w ← s/t 12: procedure O PTIMIZE PAIR(w, i, d ′ , d ′′ ) vi (w, d ′ , di ) − vi (w, d ′′ , di ) δ← (h(d ′ ) − h(d ′′ ))Σ(h(d ′ ) − h(d ′′ )) δ ← max(−αd ′ , min(αd ′′ , δ)) αd ′ ← αd ′ + δ αd ′′ ← αd ′′ − δ w ← w − Σδ(h(d ′ ) − h(d ′′ )) 13: 14: 15: 16: 17: . [sent-226, score-0.638]

47 The Reference Derivation We have been assuming that di is the derivation of the reference translation ei . [sent-228, score-0.697]

48 1 Bold/Max-B LEU Updating It can happen that there does not exist any derivation of ei , for example, if ei contains a word never seen before in training. [sent-232, score-0.296]

49 ﬁnd that even when it is possible to ﬁnd a di that exactly generates ei , it is not necessarily desirable to update the model towards it, because it may be a bad derivation of a good translation. [sent-237, score-0.444]

50 A very literal translation might be, A piece of a salted biscuit, a “pretzel,” blocked his throat. [sent-240, score-0.254]

51 But the reference translation is in fact: A pretzel, a salted biscuit, became lodged in his throat. [sent-241, score-0.311]

52 While accurate, this translation swaps grammatical roles in a way that is still difﬁcult for statistical MT systems to model. [sent-242, score-0.225]

53 If the system happens to have some bad rules that translate sd qTEp mn as a pretzel and “ brytzl ” as became lodged in, then it can use these bad rules to obtain a perfect translation, but using this derivation as the reference derivation would only reinforce the use of these bad rules. [sent-243, score-0.315]

54 A derivation of the more literal translation would probably serve better as the reference translation. [sent-244, score-0.392]

55 2 Local Updating The most common way to do this has been to generate the n-best derivations according to the model and to choose the one with the lowest loss (Och and Ney, 2002). [sent-247, score-0.22]

56 (2007) generate a 1000-best list and select either the derivation with lowest loss or the 10 derivations with lowest loss. [sent-251, score-0.33]

57 The idea is that restricting to derivations with a higher model score will ﬁlter out derivations that use bad, low-probability rules. [sent-252, score-0.433]

58 We suppose that for each fi , the reference derivation di is unknown, and it doesn’t necessarily derive the reference translation ei , but we add a term to the objective function that says that we want di to have low loss relative to ei . [sent-257, score-1.281]

59 w ← arg min min w′ di ∈D ( f i ) 1 w′ − w 2η 2 + max vi (w′ , d, di ) + (1 − µ)ℓi (di , ei ) . [sent-258, score-0.656]

60 d∈D ( fi ) The parameter µ < 0 controls how strongly we want di to have low loss. [sent-259, score-0.406]

61 Then the optimization reduces to di = arg max (µℓi (d, ei ) + w · h(d)) . [sent-261, score-0.337]

62 (9) d∈D ( fi ) Then, we optimize with respect to w′ , holding di constant. [sent-262, score-0.406]

63 We call di chosen according to (9) the hope derivation. [sent-264, score-0.271]

64 If we let µ = −1, the deﬁnition of the hope derivation becomes conveniently symmetric with the fear derivation: di = arg max(−ℓi (d, ei ) + w · h(d)). [sent-266, score-0.702]

65 d∈D ( fi ) Both the hope and fear derivations try to maximize the model score, but the fear derivation maximizes the loss whereas the hope derivation minimizes the loss. [sent-267, score-1.144]

66 Searching for Hope and Fear As mentioned above, one simple way of approximating either the hope or fear derivation is to generate an n-best list and choose from it the derivation that maximizes (9) or vi , respectively. [sent-269, score-0.582]

67 w1 ···wk ∈gk (e(d)) Since hLMk is decomposable onto hyperedges by assumption, it is safe to assume that gk is also decomposable onto hyperedges, and so is nk , which is the cardinality of gk . [sent-280, score-0.311]

68 Suppose our reference sentence is Australia is one of the few countries that have diplomatic relations with North Korea and we have two partial translations the few 1170 D ISCRIMINATIVE T RAINING OF S TATISTICAL T RANSLATION M ODELS 0. [sent-282, score-0.262]

69 55 −45 −44 −43 −42 −41 −40 −39 Model score (w · h) −38 −37 −36 −35 Figure 1: Using loss-augmented inference to search for fear translations in the whole forest is better than searching in the n-best list. [sent-291, score-0.368]

70 The red square in the upper-right is the fear derivation obtained by loss-augmented inference, whereas the red square inside the box labeled “100-best” is the fear derivation selected from the 100-best list. [sent-293, score-0.612]

71 Then we can apportion ρ among hyperedges according to how much of the input sentence they consume: ρ(v → v) = ρ | fi | | f (v)| − ∑ f (v′ ) (10) v′ ∈v where f (v) is the part of the input sentence covered by the subderivation rooted at v. [sent-302, score-0.487]

72 To search for the hope or fear derivation, we use the following dynamic program: vderiv(v) = φ(d) arg max d∈{vderiv(v→v)} vderiv(v′ ) vderiv(v → v) = {v → v} ∪ v′ ∈v where φ is one of the following: φ(d) = w · h(d) + B(b(d, ei )) (hope), φ(d) = w · h(d) − B(b(d, ei )) (fear). [sent-309, score-0.473]

73 Note that maximizing w · h(d) + B(b(d, ei )) is equivalent to maximizing w · h(d) − ℓi (d, ei ), since they differ by only a constant; likewise, maximizing w · h(d) − B(b(d, ei )) is equivalent to maximizing w · h(d) + ℓi (d, ei ). [sent-310, score-0.372]

74 North Korea with relations diplomatic have that countries few the of one is Translation #1 has 4 unigram matches and 3 bigram matches, for a B LEU-2 score of 12/156; translation #2 has 13 unigram matches and 1 bigram match, for a B LEU-2 score of 13/156. [sent-317, score-0.356]

75 If we extend both translations, however, with the word Australia, giving them each an extra unigram match, then translation #1 gets a B LEU-2 score of 15/156, and translation #2, 14/156. [sent-318, score-0.499]

76 After we ﬁnd a hope or fear derivation, we recalculate its exact B LEU score, without any of the approximations described in this section. [sent-320, score-0.255]

77 Parallelization Because inference is so slow for the translation task, and especially for the CKY-based decoder we are using, parallelization is critical. [sent-322, score-0.299]

78 Thus, when each learner works on a training example ( fi , ei ), it optimizes the QP on it along with all of the working sets it received from other nodes. [sent-346, score-0.314]

79 Experiments We experimented with the methods described above on the hierarchical phrase-based translation system Hiero (Chiang, 2005, 2007), using two feature sets. [sent-351, score-0.225]

80 01 • Hope derivations with µ = −1 • Forest reranking for hope/fear derivations • Iterative parameter mixing on 20 processors A few probability features have to be initialized carefully: the two language models and the two phrase translation probability models. [sent-362, score-0.739]

81 (b) More negative values of the loss weight µ for hope derivations lead to higher initial performance, whereas less negative loss weights lead to higher ﬁnal performance. [sent-398, score-0.35]

82 To control for this, we ran SGD (on the hinge loss) using both forest reranking and linear B LEU to search for hope/fear derivations (Figure 3a). [sent-412, score-0.371]

83 A weight of µ = −1 appears to be a good tradeoff, and is symmetrical with the weight of 1 used when computing fear derivations. [sent-417, score-0.282]

84 Development B LEU asynchronous 44 serial async p = 2 async p = 5 async p = 10 async p = 20 async p = 50 43 42 2 4 6 8 10 Epoch Figure 5: Taking a closer look at asynchronous sharing of working sets, we see that, at each epoch, greater parallelization generally gives better performance. [sent-423, score-0.789]

85 1179 C HIANG model small small small small large large large large obj 1 − B LEU hinge risk hinge hinge hinge hinge hinge alg MERT SGD η = 0. [sent-455, score-0.234]

86 01 approx – rerank linear rerank rerank rerank rerank rerank par – IPM IPM IPM IPM IPM async async epoch 6 6 8 4 5 7 9 4 B LEU dev test 42. [sent-462, score-0.677]

87 We extended all of these methods in novel ways to cope with the large structured search space of the translation task, that is, to use as much of the translation forest as possible. [sent-484, score-0.554]

88 Acknowledgments This work evolved over time to support several projects adding new features to the ISI machine translation systems, and would not have been possible without my collaborators on those projects: Steve DeNeefe, Kevin Knight, Yuval Marton, Michael Pust, Philip Resnik, and Wei Wang. [sent-499, score-0.225]

89 1 Objective Function Deﬁne a probabilistic version of the model, 1 w · h(d) T where T is a temperature parameter, and for any random variable X over derivations, deﬁne PT (d | fi ) ∝ exp ET [X | fi ] = ∑ PT (d | fi )X(d). [sent-510, score-0.582]

90 d∈D ( fi ) In minimum-risk training, we want to minimize ∑i ET [ℓi (d, di ) | fi ] for T = 1. [sent-511, score-0.6]

92 (11) T A major advantage that minimum-risk has over the large-margin methods explored in this paper is that it does not require a reference derivation, or a hope derivation as a proxy for the reference derivation. [sent-515, score-0.283]

93 (2010) show that for applications where the input space is continuous (as in speech processing), a perceptron-like update using the hope and 1-best derivations, or the 1-best and fear derivations, approaches the gradient of the loss. [sent-521, score-0.284]

94 Consider a single training example ( fi , ei ), so that we can simply write ℓ for ℓi and ET [X] for ET [X | fi ]. [sent-523, score-0.508]

95 Deﬁne a loss-augmented model: 1 Pµ (d | fi ) ∝ exp (w · h(d) + µℓ(d, di )) µ and deﬁne ∑ Eµ [X] = Pµ (d | fi )X(d). [sent-524, score-0.6]

96 Having made this approximation, there is no harm in letting T = 0, so that the expectations of h become the value of h at the mode of the underlying distribution: w ← w − η (h(d+1 ) − h(d−1 )) , d+1 = arg max (w · h(d) + ℓ(d, di )) , d d−1 = arg max (w · h(d) − ℓ(d, di )) . [sent-528, score-0.488]

97 d But this is exactly the SGD update on the generalized hinge loss (5), with d + = d+1 being the fear derivation and di = d−1 being the hope derivation. [sent-529, score-0.673]

98 Decomposability of translation metrics for improved evaluation and efﬁcient algorithms. [sent-577, score-0.225]

99 Online large-margin training of syntactic and structural translation features. [sent-580, score-0.252]

100 First- and second-order expectation semirings with applications to minimum-risk training on translation forests. [sent-635, score-0.252]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('leu', 0.535), ('translation', 0.225), ('mira', 0.22), ('di', 0.212), ('fear', 0.196), ('fi', 0.194), ('derivations', 0.192), ('mert', 0.153), ('hiang', 0.134), ('ranslation', 0.134), ('arow', 0.131), ('epoch', 0.125), ('iscriminative', 0.114), ('tatistical', 0.114), ('raining', 0.114), ('sgd', 0.113), ('sentence', 0.113), ('derivation', 0.11), ('vi', 0.107), ('async', 0.105), ('och', 0.095), ('ptimize', 0.095), ('chiang', 0.095), ('asynchronous', 0.095), ('ei', 0.093), ('insidew', 0.076), ('reranking', 0.076), ('parallelization', 0.074), ('eisner', 0.067), ('expectb', 0.067), ('hyperedges', 0.067), ('odels', 0.067), ('ipm', 0.066), ('forest', 0.064), ('gk', 0.062), ('emnlp', 0.061), ('acl', 0.06), ('translations', 0.059), ('hope', 0.059), ('reference', 0.057), ('rerank', 0.057), ('viterbi', 0.057), ('watanabe', 0.057), ('eights', 0.057), ('discriminative', 0.055), ('si', 0.054), ('mcdonald', 0.051), ('pdate', 0.051), ('mk', 0.051), ('crammer', 0.05), ('franz', 0.049), ('arun', 0.049), ('score', 0.049), ('koehn', 0.048), ('nk', 0.044), ('liang', 0.044), ('weight', 0.043), ('structured', 0.04), ('hinge', 0.039), ('li', 0.039), ('elect', 0.038), ('expectbh', 0.038), ('hyperedge', 0.038), ('pretzel', 0.038), ('tillmann', 0.038), ('tromble', 0.038), ('vderiv', 0.038), ('decomposable', 0.038), ('scores', 0.038), ('online', 0.035), ('perceptron', 0.035), ('nist', 0.034), ('mt', 0.033), ('countries', 0.033), ('david', 0.033), ('arg', 0.032), ('kevin', 0.032), ('forests', 0.032), ('josef', 0.032), ('development', 0.031), ('qp', 0.03), ('philipp', 0.029), ('arabic', 0.029), ('update', 0.029), ('conservativity', 0.029), ('deneefe', 0.029), ('dreyer', 0.029), ('expecth', 0.029), ('ney', 0.029), ('papineni', 0.029), ('salted', 0.029), ('zens', 0.029), ('converged', 0.029), ('loss', 0.028), ('epochs', 0.028), ('language', 0.027), ('sentences', 0.027), ('training', 0.027), ('mixing', 0.027), ('master', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000012 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models

Author: David Chiang

2 0.10972196 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

Author: Koby Crammer, Mark Dredze, Fernando Pereira

Abstract: Conﬁdence-weighted online learning is a generalization of margin-based learning of linear classiﬁers in which the margin constraint is replaced by a probabilistic constraint based on a distribution over classiﬁer weights that is updated online as examples are observed. The distribution captures a notion of conﬁdence on classiﬁer weights, and in some cases it can also be interpreted as replacing a single learning rate by adaptive per-weight rates. Conﬁdence-weighted learning was motivated by the statistical properties of natural-language classiﬁcation tasks, where most of the informative features are relatively rare. We investigate several versions of conﬁdence-weighted learning that use a Gaussian distribution over weight vectors, updated at each observed example to achieve high probability of correct classiﬁcation for the example. Empirical evaluation on a range of textcategorization tasks show that our algorithms improve over other state-of-the-art online and batch methods, learn faster in the online setting, and lead to better classiﬁer combination for a type of distributed training commonly used in cloud computing. Keywords: online learning, conﬁdence prediction, text categorization

3 0.050492022 45 jmlr-2012-Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs

Author: Sunita Nayak, Kester Duncan, Sudeep Sarkar, Barbara Loeding

Abstract: We present a probabilistic framework to automatically learn models of recurring signs from multiple sign language video sequences containing the vocabulary of interest. We extract the parts of the signs that are present in most occurrences of the sign in context and are robust to the variations produced by adjacent signs. Each sentence video is ﬁrst transformed into a multidimensional time series representation, capturing the motion and shape aspects of the sign. Skin color blobs are extracted from frames of color video sequences, and a probabilistic relational distribution is formed for each frame using the contour and edge pixels from the skin blobs. Each sentence is represented as a trajectory in a low dimensional space called the space of relational distributions. Given these time series trajectories, we extract signemes from multiple sentences concurrently using iterated conditional modes (ICM). We show results by learning single signs from a collection of sentences with one common pervading sign, multiple signs from a collection of sentences with more than one common sign, and single signs from a mixed collection of sentences. The extracted signemes demonstrate that our approach is robust to some extent to the variations produced within a sign due to different contexts. We also show results whereby these learned sign models are used for spotting signs in test sequences. Keywords: pattern extraction, sign language recognition, signeme extraction, sign modeling, iterated conditional modes

4 0.04522679 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions

Author: Franz J. Király, Paul von Bünau, Frank C. Meinecke, Duncan A.J. Blythe, Klaus-Robert Müller

Abstract: We propose a novel algebraic algorithmic framework for dealing with probability distributions represented by their cumulants such as the mean and covariance matrix. As an example, we consider the unsupervised learning problem of ﬁnding the subspace on which several probability distributions agree. Instead of minimizing an objective function involving the estimated cumulants, we show that by treating the cumulants as elements of the polynomial ring we can directly solve the problem, at a lower computational cost and with higher accuracy. Moreover, the algebraic viewpoint on probability distributions allows us to invoke the theory of algebraic geometry, which we demonstrate in a compact proof for an identiﬁability criterion. Keywords: computational algebraic geometry, approximate algebra, unsupervised Learning

5 0.041778177 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity

Author: Nir Ailon

Abstract: Given a set V of n elements we wish to linearly order them given pairwise preference labels which may be non-transitive (due to irrationality or arbitrary noise). The goal is to linearly order the elements while disagreeing with as few pairwise preference labels as possible. Our performance is measured by two parameters: The number of disagreements (loss) and the query complexity (number of pairwise preference labels). Our algorithm adaptively queries at most O(ε−6 n log5 n) preference labels for a regret of ε times the optimal loss. As a function of n, this is asymptotically better than standard (non-adaptive) learning bounds achievable for the same problem. Our main result takes us a step closer toward settling an open problem posed by learning-torank (from pairwise information) theoreticians and practitioners: What is a provably correct way to sample preference labels? To further show the power and practicality of our solution, we analyze a typical test case in which a large margin linear relaxation is used for efﬁciently solving the simpler learning problems in our decomposition. Keywords: statistical learning theory, active learning, ranking, pairwise ranking, preferences

6 0.039368577 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage

7 0.038679823 90 jmlr-2012-Pattern for Python

8 0.037894536 25 jmlr-2012-Characterization and Greedy Learning of Interventional Markov Equivalence Classes of Directed Acyclic Graphs

9 0.036635332 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training

10 0.035820097 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

11 0.032291234 14 jmlr-2012-Activized Learning: Transforming Passive to Active with Improved Label Complexity

12 0.032261986 20 jmlr-2012-Analysis of a Random Forests Model

13 0.031612266 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection

14 0.030106347 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition

15 0.029092902 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

16 0.029010544 6 jmlr-2012-A Model of the Perception of Facial Expressions of Emotion by Humans: Research Overview and Perspectives

17 0.027816949 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning

18 0.027755184 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

19 0.027586179 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices

20 0.027135078 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.137), (1, 0.003), (2, 0.085), (3, -0.028), (4, 0.023), (5, -0.028), (6, 0.102), (7, 0.019), (8, 0.047), (9, -0.026), (10, 0.019), (11, 0.077), (12, -0.056), (13, -0.066), (14, 0.048), (15, 0.064), (16, 0.071), (17, -0.105), (18, -0.135), (19, 0.128), (20, -0.204), (21, 0.142), (22, -0.116), (23, 0.021), (24, 0.31), (25, -0.099), (26, 0.028), (27, 0.014), (28, -0.09), (29, -0.101), (30, -0.055), (31, 0.057), (32, -0.145), (33, 0.002), (34, 0.164), (35, -0.014), (36, -0.083), (37, 0.246), (38, 0.05), (39, 0.154), (40, -0.189), (41, 0.107), (42, -0.029), (43, 0.143), (44, 0.067), (45, 0.095), (46, 0.045), (47, 0.047), (48, -0.002), (49, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96423006 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models

Author: David Chiang

2 0.65641707 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

Author: Koby Crammer, Mark Dredze, Fernando Pereira

3 0.34217355 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions

Author: Franz J. Király, Paul von Bünau, Frank C. Meinecke, Duncan A.J. Blythe, Klaus-Robert Müller

4 0.3003332 90 jmlr-2012-Pattern for Python

Author: Tom De Smedt, Walter Daelemans

Abstract: Pattern is a package for Python 2.4+ with functionality for web mining (Google + Twitter + Wikipedia, web spider, HTML DOM parser), natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classiﬁers) and network analysis (graph centrality and visualization). It is well documented and bundled with 30+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/pattern. Keywords: Python, data mining, natural language processing, machine learning, graph networks

5 0.28554782 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection

Author: Marius Kloft, Pavel Laskov

Abstract: Security issues are crucial in a number of machine learning applications, especially in scenarios dealing with human activity rather than natural phenomena (e.g., information ranking, spam detection, malware detection, etc.). In such cases, learning algorithms may have to cope with manipulated data aimed at hampering decision making. Although some previous work addressed the issue of handling malicious data in the context of supervised learning, very little is known about the behavior of anomaly detection methods in such scenarios. In this contribution,1 we analyze the performance of a particular method—online centroid anomaly detection—in the presence of adversarial noise. Our analysis addresses the following security-related issues: formalization of learning and attack processes, derivation of an optimal attack, and analysis of attack efﬁciency and limitations. We derive bounds on the effectiveness of a poisoning attack against centroid anomaly detection under different conditions: attacker’s full or limited control over the trafﬁc and bounded false positive rate. Our bounds show that whereas a poisoning attack can be effectively staged in the unconstrained case, it can be made arbitrarily difﬁcult (a strict upper bound on the attacker’s gain) if external constraints are properly used. Our experimental evaluation, carried out on real traces of HTTP and exploit trafﬁc, conﬁrms the tightness of our theoretical bounds and the practicality of our protection mechanisms. Keywords: anomaly detection, adversarial, security analysis, support vector data description, computer security, network intrusion detection

6 0.27240989 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

7 0.24879196 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity

8 0.24589169 25 jmlr-2012-Characterization and Greedy Learning of Interventional Markov Equivalence Classes of Directed Acyclic Graphs

9 0.23734573 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training

10 0.20309852 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage

11 0.19330725 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition

12 0.17992032 6 jmlr-2012-A Model of the Perception of Facial Expressions of Emotion by Humans: Research Overview and Perspectives

13 0.17356463 45 jmlr-2012-Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs

14 0.17277539 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data

15 0.1568756 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

16 0.15542285 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss

17 0.15467094 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

18 0.14850897 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation

19 0.14195576 20 jmlr-2012-Analysis of a Random Forests Model

20 0.14068009 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.422), (7, 0.012), (21, 0.027), (26, 0.034), (27, 0.022), (29, 0.031), (35, 0.018), (56, 0.02), (57, 0.012), (75, 0.048), (77, 0.018), (79, 0.012), (81, 0.012), (92, 0.073), (96, 0.15)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7677964 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models

Author: David Chiang

2 0.38838887 91 jmlr-2012-Plug-in Approach to Active Learning

Author: Stanislav Minsker

Abstract: We present a new active learning algorithm based on nonparametric estimators of the regression function. Our investigation provides probabilistic bounds for the rates of convergence of the generalization error achievable by proposed method over a broad class of underlying distributions. We also prove minimax lower bounds which show that the obtained rates are almost tight. Keywords: active learning, selective sampling, model selection, classiﬁcation, conﬁdence bands

3 0.38158482 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization

Author: Karthik Mohan, Maryam Fazel

Abstract: The problem of minimizing the rank of a matrix subject to afﬁne constraints has applications in several areas including machine learning, and is known to be NP-hard. A tractable relaxation for this problem is nuclear norm (or trace norm) minimization, which is guaranteed to ﬁnd the minimum rank matrix under suitable assumptions. In this paper, we propose a family of Iterative Reweighted Least Squares algorithms IRLS-p (with 0 ≤ p ≤ 1), as a computationally efﬁcient way to improve over the performance of nuclear norm minimization. The algorithms can be viewed as (locally) minimizing certain smooth approximations to the rank function. When p = 1, we give theoretical guarantees similar to those for nuclear norm minimization, that is, recovery of low-rank matrices under certain assumptions on the operator deﬁning the constraints. For p < 1, IRLSp shows better empirical performance in terms of recovering low-rank matrices than nuclear norm minimization. We provide an efﬁcient implementation for IRLS-p, and also present a related family of algorithms, sIRLS-p. These algorithms exhibit competitive run times and improved recovery when compared to existing algorithms for random instances of the matrix completion problem, as well as on the MovieLens movie recommendation data set. Keywords: matrix rank minimization, matrix completion, iterative algorithms, null-space property

4 0.38019642 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

Author: Koby Crammer, Mark Dredze, Fernando Pereira

5 0.37584767 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao

Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization

6 0.37458324 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

7 0.37262532 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

8 0.37188411 40 jmlr-2012-Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso

9 0.36962152 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

10 0.36714911 73 jmlr-2012-Multi-task Regression using Minimal Penalties

11 0.36613008 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition

12 0.36524433 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

13 0.36456156 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

14 0.36442864 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

15 0.3643592 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

16 0.36435279 103 jmlr-2012-Sampling Methods for the Nyström Method

17 0.36413211 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

18 0.36393511 106 jmlr-2012-Sign Language Recognition using Sub-Units

19 0.3636103 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

20 0.36360759 80 jmlr-2012-On Ranking and Generalization Bounds