jmlr jmlr2012 jmlr2012-49 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: David Chiang
Abstract: In machine translation, discriminative models have almost entirely supplanted the classical noisychannel model, but are standardly trained using a method that is reliable only in low-dimensional spaces. Two strands of research have tried to adapt more scalable discriminative training methods to machine translation: the first uses log-linear probability models and either maximum likelihood or minimum risk, and the other uses linear models and large-margin methods. Here, we provide an overview of the latter. We compare several learning algorithms and describe in detail some novel extensions suited to properties of the translation task: no single correct output, a large space of structured outputs, and slow inference. We present experimental results on a large-scale ArabicEnglish translation task, demonstrating large gains in translation accuracy. Keywords: machine translation, structured prediction, large-margin methods, online learning, distributed computing
Reference: text
sentIndex sentText sentNum sentScore
1 We compare several learning algorithms and describe in detail some novel extensions suited to properties of the translation task: no single correct output, a large space of structured outputs, and slow inference. [sent-5, score-0.265]
2 We present experimental results on a large-scale ArabicEnglish translation task, demonstrating large gains in translation accuracy. [sent-6, score-0.45]
3 Introduction Statistical machine translation (MT) aims to learn models that can predict, given some utterance in a source language, the best translation into some target language. [sent-8, score-0.45]
4 Och and Ney (2002) first proposed evolving this noisy-channel model into a discriminative log-linear model, which incorporated the language model and translation model as features. [sent-12, score-0.307]
5 This allowed the language model and translation model be to scaled by different factors, and allowed the addition of features beyond these two. [sent-13, score-0.252]
6 The loss function of choice is most often B LEU (rather, 1 − B LEU), which is the standard metric of translation quality used in current MT research (Papineni et al. [sent-15, score-0.253]
7 4 B LEU over MERT in a large-scale Arabic-English translation task. [sent-34, score-0.225]
8 We discuss three novel extensions of these algorithms that adapt them to particular properties of the translation task. [sent-35, score-0.225]
9 We find that training the model to generate the reference exactly can be too brittle; instead, we propose to update the model towards hope translations which compromise between the reference translation and translations that are easier for the model to generate (Section 4). [sent-37, score-0.572]
10 Second, translation involves a large space of structured outputs. [sent-38, score-0.265]
11 Third, inference in translation tends to be very slow. [sent-40, score-0.225]
12 1 Setting In this paper, models are defined over derivations d, which are objects that encapsulate an input sentence f (d), an output sentence e(d), and possibly other information. [sent-47, score-0.418]
13 1 For any input sentence f , let D ( f ) be the set of all valid derivations d such that f (d) = f . [sent-48, score-0.305]
14 The 1-best or Viterbi derivation of fi is dˆ = arg maxd∈D ( fi ) w · h(d), and the 1-best or Viterbi translation ˆ is e = e(d). [sent-51, score-0.755]
15 Each ei is not the only correct translation of fi , but only one of many. [sent-59, score-0.512]
16 For this reason, often multiple reference translations are available for each fi , but 1. [sent-60, score-0.31]
17 Note that although the model is defined over derivations, only sentence pairs ( fi , ei ) are observed. [sent-64, score-0.4]
18 Nevertheless, assume for the moment that we can choose a reference derivation di that derives ei ; we discuss various ways of choosing di in Section 4. [sent-66, score-0.684]
19 2 Derivation Forests The methods described in this paper should work with a wide variety of translation models, but, for concreteness, we assume a model defined using a weighted synchronous context-free grammar or related formalism (Chiang, 2007). [sent-68, score-0.225]
20 In models of this type, derivations can be thought of as trees, and the set of derivations D ( f ) is called a forest. [sent-70, score-0.384]
21 Let c be the candidate translation to be evaluated and let r be the reference translation. [sent-83, score-0.282]
22 4 Loss Function Our learning algorithms assume a loss function ℓi (e, ei ) that indicates how bad it is to guess e instead of the reference ei . [sent-101, score-0.271]
23 Even barring such problems, a B LEU score for a single sentence may not accurately reflect the impact of that sentence on the whole test set (Chiang et al. [sent-105, score-0.275]
24 That is, after processing each training example ( fi , ei ), we update the oracle document using the 1-best translation e: ˆ b ← 0. [sent-117, score-0.568]
25 Finally, we can define the loss of a translation e relative to e′ as the difference between their B scores, following Watanabe et al. [sent-121, score-0.253]
26 That is, our learning problem is to minimize: L(w) = 1 Li (w) N∑ i (4) where Li (w) = max vi (w, d, di ), d∈D ( fi ) vi (w, d, di ) = ℓi (d, di ) − w · (h(di ) − h(d)). [sent-125, score-1.044]
27 Note that since di ∈ D ( fi ) and vi (w, di , di ) = 0, Li (w) is always nonnegative. [sent-126, score-0.937]
28 We now review the derivations of several existing algorithms for optimizing (4) for structured models. [sent-127, score-0.232]
29 In SGD, we consider one component Li (w) of the objective function at a time and update w by the subgradient: w ← w − η∇Li (w), (5) ∇Li (w) = −(h(di ) − h(d )) + where d + = arg max vi (w, d, di ). [sent-132, score-0.38]
30 d∈D ( fi ) If, as an approximation, we restrict D ( fi ) to just the 1-best derivation of fi , then we get the structured perceptron algorithm (Rosenblatt, 1958; Freund and Schapire, 1999; Collins, 2002). [sent-133, score-0.767]
31 2 An easy way to approximate the fear derivation would be to generate an n-best list and select the derivation from it that maximizes vi . [sent-140, score-0.523]
32 The terminology of fear derivations and hope derivations to be defined below are due to Kevin Knight. [sent-143, score-0.639]
33 , N} in random order do 5: U PDATE W EIGHTS(w, i) 6: s ← s+w 7: t ← t +1 8: w ← s/t procedure U PDATE W EIGHTS(w, i) d + ← arg maxd∈D ( fi ) vi (w, d, di ) 11: w ← w + η(h(di ) − h(d + )) 9: 10: 3. [sent-150, score-0.545]
34 It is more commonly presented as a quadratic program (QP): minimize subject to 1 w′ − w 2 + ξi 2η vi (w′ , d, di ) − ξi ≤ 0 ∀d ∈ D ( fi ) where ξi is a slack variable. [sent-157, score-0.513]
35 3 (Note that ξi ≥ 0 since di ∈ D ( fi ) and vi (w′ , di , di ) = 0. [sent-158, score-0.937]
36 ) The Lagrangian is: 1 (7) w′ − w 2 + ξi + ∑ αd (vi (w′ , d, di ) − ξi ). [sent-159, score-0.212]
37 d∈D ( fi ) Substituting back into (7), we get the following dual problem: maximize subject to η − 2 ∑ 2 ∑ αd (h(di ) − h(d)) ∑ + αd vi (w, d, di ) d∈D ( fi ) d∈D ( fi ) αd = 1 d∈D ( fi ) αd ≥ 0 ∀d ∈ D ( fi ). [sent-164, score-1.289]
38 In machine translation, and in structured prediction in general, the number of hypotheses in D ( fi ), and therefore the number of constraints in the QP, can be exponential or worse. [sent-165, score-0.234]
39 (2004), which repeatedly recomputes the fear derivation and adds it to a working set Si of derivations on which the QP is optimized (Algorithm 2). [sent-171, score-0.498]
40 A new fear derivation is added to the working set only if it is a worse violator by a certain margin (ε); otherwise, the algorithm terminates. [sent-172, score-0.306]
41 80) to select a pair of constraints: one must violate one of the KKT conditions (αd (vi (w′ , d, di ) − ξi ) = 0), and the other must allow the objective to be improved. [sent-175, score-0.212]
42 More accurately, we took the union of the 10 best derivations, the top 10 fear derivations, and the top 10 hope derivations (to be defined below). [sent-182, score-0.447]
43 For example, even a small change in the language model weights could result in a large change in translation length and fluency, whereas large changes in features like those attached to number-translation rules have a relatively small effect. [sent-191, score-0.252]
44 This change has no effect on the scores assigned to derivations or the translations generated, so intuitively one would hope that it also has no effect on learning. [sent-193, score-0.348]
45 First, line 34 is replaced with: δ← vi (w, d ′ , di ) − vi (w, d ′′ , di ) (h(d ′ ) − h(d ′′ ))Σ(h(d ′ ) − h(d ′′ )) and line 37 is replaced with: w ← w − Σδ(h(d ′ ) − h(d ′′ )). [sent-210, score-0.638]
46 , xn ) 11: w ← s/t 12: procedure O PTIMIZE PAIR(w, i, d ′ , d ′′ ) vi (w, d ′ , di ) − vi (w, d ′′ , di ) δ← (h(d ′ ) − h(d ′′ ))Σ(h(d ′ ) − h(d ′′ )) δ ← max(−αd ′ , min(αd ′′ , δ)) αd ′ ← αd ′ + δ αd ′′ ← αd ′′ − δ w ← w − Σδ(h(d ′ ) − h(d ′′ )) 13: 14: 15: 16: 17: . [sent-226, score-0.638]
47 The Reference Derivation We have been assuming that di is the derivation of the reference translation ei . [sent-228, score-0.697]
48 1 Bold/Max-B LEU Updating It can happen that there does not exist any derivation of ei , for example, if ei contains a word never seen before in training. [sent-232, score-0.296]
49 find that even when it is possible to find a di that exactly generates ei , it is not necessarily desirable to update the model towards it, because it may be a bad derivation of a good translation. [sent-237, score-0.444]
50 A very literal translation might be, A piece of a salted biscuit, a “pretzel,” blocked his throat. [sent-240, score-0.254]
51 But the reference translation is in fact: A pretzel, a salted biscuit, became lodged in his throat. [sent-241, score-0.311]
52 While accurate, this translation swaps grammatical roles in a way that is still difficult for statistical MT systems to model. [sent-242, score-0.225]
53 If the system happens to have some bad rules that translate sd qTEp mn as a pretzel and “ brytzl ” as became lodged in, then it can use these bad rules to obtain a perfect translation, but using this derivation as the reference derivation would only reinforce the use of these bad rules. [sent-243, score-0.315]
54 A derivation of the more literal translation would probably serve better as the reference translation. [sent-244, score-0.392]
55 2 Local Updating The most common way to do this has been to generate the n-best derivations according to the model and to choose the one with the lowest loss (Och and Ney, 2002). [sent-247, score-0.22]
56 (2007) generate a 1000-best list and select either the derivation with lowest loss or the 10 derivations with lowest loss. [sent-251, score-0.33]
57 The idea is that restricting to derivations with a higher model score will filter out derivations that use bad, low-probability rules. [sent-252, score-0.433]
58 We suppose that for each fi , the reference derivation di is unknown, and it doesn’t necessarily derive the reference translation ei , but we add a term to the objective function that says that we want di to have low loss relative to ei . [sent-257, score-1.281]
59 w ← arg min min w′ di ∈D ( f i ) 1 w′ − w 2η 2 + max vi (w′ , d, di ) + (1 − µ)ℓi (di , ei ) . [sent-258, score-0.656]
60 d∈D ( fi ) The parameter µ < 0 controls how strongly we want di to have low loss. [sent-259, score-0.406]
61 Then the optimization reduces to di = arg max (µℓi (d, ei ) + w · h(d)) . [sent-261, score-0.337]
62 (9) d∈D ( fi ) Then, we optimize with respect to w′ , holding di constant. [sent-262, score-0.406]
63 We call di chosen according to (9) the hope derivation. [sent-264, score-0.271]
64 If we let µ = −1, the definition of the hope derivation becomes conveniently symmetric with the fear derivation: di = arg max(−ℓi (d, ei ) + w · h(d)). [sent-266, score-0.702]
65 d∈D ( fi ) Both the hope and fear derivations try to maximize the model score, but the fear derivation maximizes the loss whereas the hope derivation minimizes the loss. [sent-267, score-1.144]
66 Searching for Hope and Fear As mentioned above, one simple way of approximating either the hope or fear derivation is to generate an n-best list and choose from it the derivation that maximizes (9) or vi , respectively. [sent-269, score-0.582]
67 w1 ···wk ∈gk (e(d)) Since hLMk is decomposable onto hyperedges by assumption, it is safe to assume that gk is also decomposable onto hyperedges, and so is nk , which is the cardinality of gk . [sent-280, score-0.311]
68 Suppose our reference sentence is Australia is one of the few countries that have diplomatic relations with North Korea and we have two partial translations the few 1170 D ISCRIMINATIVE T RAINING OF S TATISTICAL T RANSLATION M ODELS 0. [sent-282, score-0.262]
69 55 −45 −44 −43 −42 −41 −40 −39 Model score (w · h) −38 −37 −36 −35 Figure 1: Using loss-augmented inference to search for fear translations in the whole forest is better than searching in the n-best list. [sent-291, score-0.368]
70 The red square in the upper-right is the fear derivation obtained by loss-augmented inference, whereas the red square inside the box labeled “100-best” is the fear derivation selected from the 100-best list. [sent-293, score-0.612]
71 Then we can apportion ρ among hyperedges according to how much of the input sentence they consume: ρ(v → v) = ρ | fi | | f (v)| − ∑ f (v′ ) (10) v′ ∈v where f (v) is the part of the input sentence covered by the subderivation rooted at v. [sent-302, score-0.487]
72 To search for the hope or fear derivation, we use the following dynamic program: vderiv(v) = φ(d) arg max d∈{vderiv(v→v)} vderiv(v′ ) vderiv(v → v) = {v → v} ∪ v′ ∈v where φ is one of the following: φ(d) = w · h(d) + B(b(d, ei )) (hope), φ(d) = w · h(d) − B(b(d, ei )) (fear). [sent-309, score-0.473]
73 Note that maximizing w · h(d) + B(b(d, ei )) is equivalent to maximizing w · h(d) − ℓi (d, ei ), since they differ by only a constant; likewise, maximizing w · h(d) − B(b(d, ei )) is equivalent to maximizing w · h(d) + ℓi (d, ei ). [sent-310, score-0.372]
74 North Korea with relations diplomatic have that countries few the of one is Translation #1 has 4 unigram matches and 3 bigram matches, for a B LEU-2 score of 12/156; translation #2 has 13 unigram matches and 1 bigram match, for a B LEU-2 score of 13/156. [sent-317, score-0.356]
75 If we extend both translations, however, with the word Australia, giving them each an extra unigram match, then translation #1 gets a B LEU-2 score of 15/156, and translation #2, 14/156. [sent-318, score-0.499]
76 After we find a hope or fear derivation, we recalculate its exact B LEU score, without any of the approximations described in this section. [sent-320, score-0.255]
77 Parallelization Because inference is so slow for the translation task, and especially for the CKY-based decoder we are using, parallelization is critical. [sent-322, score-0.299]
78 Thus, when each learner works on a training example ( fi , ei ), it optimizes the QP on it along with all of the working sets it received from other nodes. [sent-346, score-0.314]
79 Experiments We experimented with the methods described above on the hierarchical phrase-based translation system Hiero (Chiang, 2005, 2007), using two feature sets. [sent-351, score-0.225]
80 01 • Hope derivations with µ = −1 • Forest reranking for hope/fear derivations • Iterative parameter mixing on 20 processors A few probability features have to be initialized carefully: the two language models and the two phrase translation probability models. [sent-362, score-0.739]
81 (b) More negative values of the loss weight µ for hope derivations lead to higher initial performance, whereas less negative loss weights lead to higher final performance. [sent-398, score-0.35]
82 To control for this, we ran SGD (on the hinge loss) using both forest reranking and linear B LEU to search for hope/fear derivations (Figure 3a). [sent-412, score-0.371]
83 A weight of µ = −1 appears to be a good tradeoff, and is symmetrical with the weight of 1 used when computing fear derivations. [sent-417, score-0.282]
84 Development B LEU asynchronous 44 serial async p = 2 async p = 5 async p = 10 async p = 20 async p = 50 43 42 2 4 6 8 10 Epoch Figure 5: Taking a closer look at asynchronous sharing of working sets, we see that, at each epoch, greater parallelization generally gives better performance. [sent-423, score-0.789]
85 1179 C HIANG model small small small small large large large large obj 1 − B LEU hinge risk hinge hinge hinge hinge hinge alg MERT SGD η = 0. [sent-455, score-0.234]
86 01 approx – rerank linear rerank rerank rerank rerank rerank par – IPM IPM IPM IPM IPM async async epoch 6 6 8 4 5 7 9 4 B LEU dev test 42. [sent-462, score-0.677]
87 We extended all of these methods in novel ways to cope with the large structured search space of the translation task, that is, to use as much of the translation forest as possible. [sent-484, score-0.554]
88 Acknowledgments This work evolved over time to support several projects adding new features to the ISI machine translation systems, and would not have been possible without my collaborators on those projects: Steve DeNeefe, Kevin Knight, Yuval Marton, Michael Pust, Philip Resnik, and Wei Wang. [sent-499, score-0.225]
89 1 Objective Function Define a probabilistic version of the model, 1 w · h(d) T where T is a temperature parameter, and for any random variable X over derivations, define PT (d | fi ) ∝ exp ET [X | fi ] = ∑ PT (d | fi )X(d). [sent-510, score-0.582]
90 d∈D ( fi ) In minimum-risk training, we want to minimize ∑i ET [ℓi (d, di ) | fi ] for T = 1. [sent-511, score-0.6]
91 The gradient for a single example is: ∇ET [ℓi (d, di ) | fi ] = 1 (ET [ℓi h | fi ] − ET [ℓi | fi ]ET [h | fi ]) T or, in terms of B: ∇ET [ℓi (d, di ) | fi ] = −∇ET [B(b(d, ei )) | fi ] 1 = − (ET [Bh | fi ] − ET [B | fi ]ET [h | fi ]) . [sent-514, score-2.263]
92 (11) T A major advantage that minimum-risk has over the large-margin methods explored in this paper is that it does not require a reference derivation, or a hope derivation as a proxy for the reference derivation. [sent-515, score-0.283]
93 (2010) show that for applications where the input space is continuous (as in speech processing), a perceptron-like update using the hope and 1-best derivations, or the 1-best and fear derivations, approaches the gradient of the loss. [sent-521, score-0.284]
94 Consider a single training example ( fi , ei ), so that we can simply write ℓ for ℓi and ET [X] for ET [X | fi ]. [sent-523, score-0.508]
95 Define a loss-augmented model: 1 Pµ (d | fi ) ∝ exp (w · h(d) + µℓ(d, di )) µ and define ∑ Eµ [X] = Pµ (d | fi )X(d). [sent-524, score-0.6]
96 Having made this approximation, there is no harm in letting T = 0, so that the expectations of h become the value of h at the mode of the underlying distribution: w ← w − η (h(d+1 ) − h(d−1 )) , d+1 = arg max (w · h(d) + ℓ(d, di )) , d d−1 = arg max (w · h(d) − ℓ(d, di )) . [sent-528, score-0.488]
97 d But this is exactly the SGD update on the generalized hinge loss (5), with d + = d+1 being the fear derivation and di = d−1 being the hope derivation. [sent-529, score-0.673]
98 Decomposability of translation metrics for improved evaluation and efficient algorithms. [sent-577, score-0.225]
99 Online large-margin training of syntactic and structural translation features. [sent-580, score-0.252]
100 First- and second-order expectation semirings with applications to minimum-risk training on translation forests. [sent-635, score-0.252]
wordName wordTfidf (topN-words)
[('leu', 0.535), ('translation', 0.225), ('mira', 0.22), ('di', 0.212), ('fear', 0.196), ('fi', 0.194), ('derivations', 0.192), ('mert', 0.153), ('hiang', 0.134), ('ranslation', 0.134), ('arow', 0.131), ('epoch', 0.125), ('iscriminative', 0.114), ('tatistical', 0.114), ('raining', 0.114), ('sgd', 0.113), ('sentence', 0.113), ('derivation', 0.11), ('vi', 0.107), ('async', 0.105), ('och', 0.095), ('ptimize', 0.095), ('chiang', 0.095), ('asynchronous', 0.095), ('ei', 0.093), ('insidew', 0.076), ('reranking', 0.076), ('parallelization', 0.074), ('eisner', 0.067), ('expectb', 0.067), ('hyperedges', 0.067), ('odels', 0.067), ('ipm', 0.066), ('forest', 0.064), ('gk', 0.062), ('emnlp', 0.061), ('acl', 0.06), ('translations', 0.059), ('hope', 0.059), ('reference', 0.057), ('rerank', 0.057), ('viterbi', 0.057), ('watanabe', 0.057), ('eights', 0.057), ('discriminative', 0.055), ('si', 0.054), ('mcdonald', 0.051), ('pdate', 0.051), ('mk', 0.051), ('crammer', 0.05), ('franz', 0.049), ('arun', 0.049), ('score', 0.049), ('koehn', 0.048), ('nk', 0.044), ('liang', 0.044), ('weight', 0.043), ('structured', 0.04), ('hinge', 0.039), ('li', 0.039), ('elect', 0.038), ('expectbh', 0.038), ('hyperedge', 0.038), ('pretzel', 0.038), ('tillmann', 0.038), ('tromble', 0.038), ('vderiv', 0.038), ('decomposable', 0.038), ('scores', 0.038), ('online', 0.035), ('perceptron', 0.035), ('nist', 0.034), ('mt', 0.033), ('countries', 0.033), ('david', 0.033), ('arg', 0.032), ('kevin', 0.032), ('forests', 0.032), ('josef', 0.032), ('development', 0.031), ('qp', 0.03), ('philipp', 0.029), ('arabic', 0.029), ('update', 0.029), ('conservativity', 0.029), ('deneefe', 0.029), ('dreyer', 0.029), ('expecth', 0.029), ('ney', 0.029), ('papineni', 0.029), ('salted', 0.029), ('zens', 0.029), ('converged', 0.029), ('loss', 0.028), ('epochs', 0.028), ('language', 0.027), ('sentences', 0.027), ('training', 0.027), ('mixing', 0.027), ('master', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000012 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models
Author: David Chiang
Abstract: In machine translation, discriminative models have almost entirely supplanted the classical noisychannel model, but are standardly trained using a method that is reliable only in low-dimensional spaces. Two strands of research have tried to adapt more scalable discriminative training methods to machine translation: the first uses log-linear probability models and either maximum likelihood or minimum risk, and the other uses linear models and large-margin methods. Here, we provide an overview of the latter. We compare several learning algorithms and describe in detail some novel extensions suited to properties of the translation task: no single correct output, a large space of structured outputs, and slow inference. We present experimental results on a large-scale ArabicEnglish translation task, demonstrating large gains in translation accuracy. Keywords: machine translation, structured prediction, large-margin methods, online learning, distributed computing
2 0.10972196 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
Author: Koby Crammer, Mark Dredze, Fernando Pereira
Abstract: Confidence-weighted online learning is a generalization of margin-based learning of linear classifiers in which the margin constraint is replaced by a probabilistic constraint based on a distribution over classifier weights that is updated online as examples are observed. The distribution captures a notion of confidence on classifier weights, and in some cases it can also be interpreted as replacing a single learning rate by adaptive per-weight rates. Confidence-weighted learning was motivated by the statistical properties of natural-language classification tasks, where most of the informative features are relatively rare. We investigate several versions of confidence-weighted learning that use a Gaussian distribution over weight vectors, updated at each observed example to achieve high probability of correct classification for the example. Empirical evaluation on a range of textcategorization tasks show that our algorithms improve over other state-of-the-art online and batch methods, learn faster in the online setting, and lead to better classifier combination for a type of distributed training commonly used in cloud computing. Keywords: online learning, confidence prediction, text categorization
3 0.050492022 45 jmlr-2012-Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs
Author: Sunita Nayak, Kester Duncan, Sudeep Sarkar, Barbara Loeding
Abstract: We present a probabilistic framework to automatically learn models of recurring signs from multiple sign language video sequences containing the vocabulary of interest. We extract the parts of the signs that are present in most occurrences of the sign in context and are robust to the variations produced by adjacent signs. Each sentence video is first transformed into a multidimensional time series representation, capturing the motion and shape aspects of the sign. Skin color blobs are extracted from frames of color video sequences, and a probabilistic relational distribution is formed for each frame using the contour and edge pixels from the skin blobs. Each sentence is represented as a trajectory in a low dimensional space called the space of relational distributions. Given these time series trajectories, we extract signemes from multiple sentences concurrently using iterated conditional modes (ICM). We show results by learning single signs from a collection of sentences with one common pervading sign, multiple signs from a collection of sentences with more than one common sign, and single signs from a mixed collection of sentences. The extracted signemes demonstrate that our approach is robust to some extent to the variations produced within a sign due to different contexts. We also show results whereby these learned sign models are used for spotting signs in test sequences. Keywords: pattern extraction, sign language recognition, signeme extraction, sign modeling, iterated conditional modes
4 0.04522679 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions
Author: Franz J. Király, Paul von Bünau, Frank C. Meinecke, Duncan A.J. Blythe, Klaus-Robert Müller
Abstract: We propose a novel algebraic algorithmic framework for dealing with probability distributions represented by their cumulants such as the mean and covariance matrix. As an example, we consider the unsupervised learning problem of finding the subspace on which several probability distributions agree. Instead of minimizing an objective function involving the estimated cumulants, we show that by treating the cumulants as elements of the polynomial ring we can directly solve the problem, at a lower computational cost and with higher accuracy. Moreover, the algebraic viewpoint on probability distributions allows us to invoke the theory of algebraic geometry, which we demonstrate in a compact proof for an identifiability criterion. Keywords: computational algebraic geometry, approximate algebra, unsupervised Learning
5 0.041778177 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity
Author: Nir Ailon
Abstract: Given a set V of n elements we wish to linearly order them given pairwise preference labels which may be non-transitive (due to irrationality or arbitrary noise). The goal is to linearly order the elements while disagreeing with as few pairwise preference labels as possible. Our performance is measured by two parameters: The number of disagreements (loss) and the query complexity (number of pairwise preference labels). Our algorithm adaptively queries at most O(ε−6 n log5 n) preference labels for a regret of ε times the optimal loss. As a function of n, this is asymptotically better than standard (non-adaptive) learning bounds achievable for the same problem. Our main result takes us a step closer toward settling an open problem posed by learning-torank (from pairwise information) theoreticians and practitioners: What is a provably correct way to sample preference labels? To further show the power and practicality of our solution, we analyze a typical test case in which a large margin linear relaxation is used for efficiently solving the simpler learning problems in our decomposition. Keywords: statistical learning theory, active learning, ranking, pairwise ranking, preferences
6 0.039368577 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage
7 0.038679823 90 jmlr-2012-Pattern for Python
8 0.037894536 25 jmlr-2012-Characterization and Greedy Learning of Interventional Markov Equivalence Classes of Directed Acyclic Graphs
9 0.036635332 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training
10 0.035820097 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
11 0.032291234 14 jmlr-2012-Activized Learning: Transforming Passive to Active with Improved Label Complexity
12 0.032261986 20 jmlr-2012-Analysis of a Random Forests Model
13 0.031612266 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection
14 0.030106347 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition
15 0.029092902 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
16 0.029010544 6 jmlr-2012-A Model of the Perception of Facial Expressions of Emotion by Humans: Research Overview and Perspectives
17 0.027816949 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning
18 0.027755184 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
19 0.027586179 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices
20 0.027135078 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
topicId topicWeight
[(0, -0.137), (1, 0.003), (2, 0.085), (3, -0.028), (4, 0.023), (5, -0.028), (6, 0.102), (7, 0.019), (8, 0.047), (9, -0.026), (10, 0.019), (11, 0.077), (12, -0.056), (13, -0.066), (14, 0.048), (15, 0.064), (16, 0.071), (17, -0.105), (18, -0.135), (19, 0.128), (20, -0.204), (21, 0.142), (22, -0.116), (23, 0.021), (24, 0.31), (25, -0.099), (26, 0.028), (27, 0.014), (28, -0.09), (29, -0.101), (30, -0.055), (31, 0.057), (32, -0.145), (33, 0.002), (34, 0.164), (35, -0.014), (36, -0.083), (37, 0.246), (38, 0.05), (39, 0.154), (40, -0.189), (41, 0.107), (42, -0.029), (43, 0.143), (44, 0.067), (45, 0.095), (46, 0.045), (47, 0.047), (48, -0.002), (49, 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.96423006 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models
Author: David Chiang
Abstract: In machine translation, discriminative models have almost entirely supplanted the classical noisychannel model, but are standardly trained using a method that is reliable only in low-dimensional spaces. Two strands of research have tried to adapt more scalable discriminative training methods to machine translation: the first uses log-linear probability models and either maximum likelihood or minimum risk, and the other uses linear models and large-margin methods. Here, we provide an overview of the latter. We compare several learning algorithms and describe in detail some novel extensions suited to properties of the translation task: no single correct output, a large space of structured outputs, and slow inference. We present experimental results on a large-scale ArabicEnglish translation task, demonstrating large gains in translation accuracy. Keywords: machine translation, structured prediction, large-margin methods, online learning, distributed computing
2 0.65641707 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
Author: Koby Crammer, Mark Dredze, Fernando Pereira
Abstract: Confidence-weighted online learning is a generalization of margin-based learning of linear classifiers in which the margin constraint is replaced by a probabilistic constraint based on a distribution over classifier weights that is updated online as examples are observed. The distribution captures a notion of confidence on classifier weights, and in some cases it can also be interpreted as replacing a single learning rate by adaptive per-weight rates. Confidence-weighted learning was motivated by the statistical properties of natural-language classification tasks, where most of the informative features are relatively rare. We investigate several versions of confidence-weighted learning that use a Gaussian distribution over weight vectors, updated at each observed example to achieve high probability of correct classification for the example. Empirical evaluation on a range of textcategorization tasks show that our algorithms improve over other state-of-the-art online and batch methods, learn faster in the online setting, and lead to better classifier combination for a type of distributed training commonly used in cloud computing. Keywords: online learning, confidence prediction, text categorization
3 0.34217355 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions
Author: Franz J. Király, Paul von Bünau, Frank C. Meinecke, Duncan A.J. Blythe, Klaus-Robert Müller
Abstract: We propose a novel algebraic algorithmic framework for dealing with probability distributions represented by their cumulants such as the mean and covariance matrix. As an example, we consider the unsupervised learning problem of finding the subspace on which several probability distributions agree. Instead of minimizing an objective function involving the estimated cumulants, we show that by treating the cumulants as elements of the polynomial ring we can directly solve the problem, at a lower computational cost and with higher accuracy. Moreover, the algebraic viewpoint on probability distributions allows us to invoke the theory of algebraic geometry, which we demonstrate in a compact proof for an identifiability criterion. Keywords: computational algebraic geometry, approximate algebra, unsupervised Learning
4 0.3003332 90 jmlr-2012-Pattern for Python
Author: Tom De Smedt, Walter Daelemans
Abstract: Pattern is a package for Python 2.4+ with functionality for web mining (Google + Twitter + Wikipedia, web spider, HTML DOM parser), natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classifiers) and network analysis (graph centrality and visualization). It is well documented and bundled with 30+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/pattern. Keywords: Python, data mining, natural language processing, machine learning, graph networks
5 0.28554782 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection
Author: Marius Kloft, Pavel Laskov
Abstract: Security issues are crucial in a number of machine learning applications, especially in scenarios dealing with human activity rather than natural phenomena (e.g., information ranking, spam detection, malware detection, etc.). In such cases, learning algorithms may have to cope with manipulated data aimed at hampering decision making. Although some previous work addressed the issue of handling malicious data in the context of supervised learning, very little is known about the behavior of anomaly detection methods in such scenarios. In this contribution,1 we analyze the performance of a particular method—online centroid anomaly detection—in the presence of adversarial noise. Our analysis addresses the following security-related issues: formalization of learning and attack processes, derivation of an optimal attack, and analysis of attack efficiency and limitations. We derive bounds on the effectiveness of a poisoning attack against centroid anomaly detection under different conditions: attacker’s full or limited control over the traffic and bounded false positive rate. Our bounds show that whereas a poisoning attack can be effectively staged in the unconstrained case, it can be made arbitrarily difficult (a strict upper bound on the attacker’s gain) if external constraints are properly used. Our experimental evaluation, carried out on real traces of HTTP and exploit traffic, confirms the tightness of our theoretical bounds and the practicality of our protection mechanisms. Keywords: anomaly detection, adversarial, security analysis, support vector data description, computer security, network intrusion detection
6 0.27240989 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
9 0.23734573 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training
10 0.20309852 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage
11 0.19330725 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition
12 0.17992032 6 jmlr-2012-A Model of the Perception of Facial Expressions of Emotion by Humans: Research Overview and Perspectives
13 0.17356463 45 jmlr-2012-Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs
14 0.17277539 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data
15 0.1568756 72 jmlr-2012-Multi-Target Regression with Rule Ensembles
16 0.15542285 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss
17 0.15467094 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
18 0.14850897 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
19 0.14195576 20 jmlr-2012-Analysis of a Random Forests Model
20 0.14068009 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications
topicId topicWeight
[(2, 0.422), (7, 0.012), (21, 0.027), (26, 0.034), (27, 0.022), (29, 0.031), (35, 0.018), (56, 0.02), (57, 0.012), (75, 0.048), (77, 0.018), (79, 0.012), (81, 0.012), (92, 0.073), (96, 0.15)]
simIndex simValue paperId paperTitle
same-paper 1 0.7677964 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models
Author: David Chiang
Abstract: In machine translation, discriminative models have almost entirely supplanted the classical noisychannel model, but are standardly trained using a method that is reliable only in low-dimensional spaces. Two strands of research have tried to adapt more scalable discriminative training methods to machine translation: the first uses log-linear probability models and either maximum likelihood or minimum risk, and the other uses linear models and large-margin methods. Here, we provide an overview of the latter. We compare several learning algorithms and describe in detail some novel extensions suited to properties of the translation task: no single correct output, a large space of structured outputs, and slow inference. We present experimental results on a large-scale ArabicEnglish translation task, demonstrating large gains in translation accuracy. Keywords: machine translation, structured prediction, large-margin methods, online learning, distributed computing
2 0.38838887 91 jmlr-2012-Plug-in Approach to Active Learning
Author: Stanislav Minsker
Abstract: We present a new active learning algorithm based on nonparametric estimators of the regression function. Our investigation provides probabilistic bounds for the rates of convergence of the generalization error achievable by proposed method over a broad class of underlying distributions. We also prove minimax lower bounds which show that the obtained rates are almost tight. Keywords: active learning, selective sampling, model selection, classification, confidence bands
3 0.38158482 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization
Author: Karthik Mohan, Maryam Fazel
Abstract: The problem of minimizing the rank of a matrix subject to affine constraints has applications in several areas including machine learning, and is known to be NP-hard. A tractable relaxation for this problem is nuclear norm (or trace norm) minimization, which is guaranteed to find the minimum rank matrix under suitable assumptions. In this paper, we propose a family of Iterative Reweighted Least Squares algorithms IRLS-p (with 0 ≤ p ≤ 1), as a computationally efficient way to improve over the performance of nuclear norm minimization. The algorithms can be viewed as (locally) minimizing certain smooth approximations to the rank function. When p = 1, we give theoretical guarantees similar to those for nuclear norm minimization, that is, recovery of low-rank matrices under certain assumptions on the operator defining the constraints. For p < 1, IRLSp shows better empirical performance in terms of recovering low-rank matrices than nuclear norm minimization. We provide an efficient implementation for IRLS-p, and also present a related family of algorithms, sIRLS-p. These algorithms exhibit competitive run times and improved recovery when compared to existing algorithms for random instances of the matrix completion problem, as well as on the MovieLens movie recommendation data set. Keywords: matrix rank minimization, matrix completion, iterative algorithms, null-space property
4 0.38019642 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
Author: Koby Crammer, Mark Dredze, Fernando Pereira
Abstract: Confidence-weighted online learning is a generalization of margin-based learning of linear classifiers in which the margin constraint is replaced by a probabilistic constraint based on a distribution over classifier weights that is updated online as examples are observed. The distribution captures a notion of confidence on classifier weights, and in some cases it can also be interpreted as replacing a single learning rate by adaptive per-weight rates. Confidence-weighted learning was motivated by the statistical properties of natural-language classification tasks, where most of the informative features are relatively rare. We investigate several versions of confidence-weighted learning that use a Gaussian distribution over weight vectors, updated at each observed example to achieve high probability of correct classification for the example. Empirical evaluation on a range of textcategorization tasks show that our algorithms improve over other state-of-the-art online and batch methods, learn faster in the online setting, and lead to better classifier combination for a type of distributed training commonly used in cloud computing. Keywords: online learning, confidence prediction, text categorization
5 0.37584767 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
6 0.37458324 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
7 0.37262532 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
8 0.37188411 40 jmlr-2012-Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso
9 0.36962152 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
10 0.36714911 73 jmlr-2012-Multi-task Regression using Minimal Penalties
11 0.36613008 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition
12 0.36524433 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
13 0.36456156 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
14 0.36442864 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
15 0.3643592 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices
16 0.36435279 103 jmlr-2012-Sampling Methods for the Nyström Method
17 0.36413211 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
18 0.36393511 106 jmlr-2012-Sign Language Recognition using Sub-Units
19 0.3636103 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
20 0.36360759 80 jmlr-2012-On Ranking and Generalization Bounds