nips nips2010 nips2010-61 knowledge-graph by maker-knowledge-mining

61 nips-2010-Direct Loss Minimization for Structured Prediction


Source: pdf

Author: Tamir Hazan, Joseph Keshet, David A. McAllester

Abstract: In discriminative machine learning one is interested in training a system to optimize a certain desired measure of performance, or loss. In binary classification one typically tries to minimizes the error rate. But in structured prediction each task often has its own measure of performance such as the BLEU score in machine translation or the intersection-over-union score in PASCAL segmentation. The most common approaches to structured prediction, structural SVMs and CRFs, do not minimize the task loss: the former minimizes a surrogate loss with no guarantees for task loss and the latter minimizes log loss independent of task loss. The main contribution of this paper is a theorem stating that a certain perceptronlike learning rule, involving features vectors derived from loss-adjusted inference, directly corresponds to the gradient of task loss. We give empirical results on phonetic alignment of a standard test set from the TIMIT corpus, which surpasses all previously reported results on this problem. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 But in structured prediction each task often has its own measure of performance such as the BLEU score in machine translation or the intersection-over-union score in PASCAL segmentation. [sent-6, score-0.186]

2 The most common approaches to structured prediction, structural SVMs and CRFs, do not minimize the task loss: the former minimizes a surrogate loss with no guarantees for task loss and the latter minimizes log loss independent of task loss. [sent-7, score-0.504]

3 We give empirical results on phonetic alignment of a standard test set from the TIMIT corpus, which surpasses all previously reported results on this problem. [sent-9, score-0.191]

4 For example, modern machine translation systems convert an input word string into an output word string in a different language by approximately optimizing a score defined on the input-output pair. [sent-11, score-0.253]

5 We assume an objective function sw : X × Y → R parameterized by a vector w ∈ Rd such that for x ∈ X and y ∈ Y we have a score sw (x, y). [sent-16, score-0.204]

6 The parameter setting w determines a mapping from input x to output yw (x) is defined as follows: yw (x) = argmax sw (x, y) (1) y∈Y Our goal is to set the parameters w of the scoring function such that the mapping from input to output defined by (1) performs well. [sent-17, score-1.764]

7 We assume a loss function L, such as the BLEU score, which gives a cost L(y, y ) ≥ 0 for ˆ producing output y when the desired output (reference output) is y. [sent-19, score-0.285]

8 w∗ = argmin E [L(y, yw (x))] (2) w In (2) the expectation is taken over a random draw of the pair (x, y) form the source data distribution ρ. [sent-21, score-0.762]

9 1 Unfortunately the training objective function (2) is typically non-convex and we are not aware of any polynomial algorithms (in time and sample complexity) with reasonable approximation guarantees to (2) for typical loss functions, say 0-1 loss, and an arbitrary distribution ρ. [sent-24, score-0.22]

10 In spite of the lack of approximation guarantees, it is common to replace the objective in (2) with a convex relaxation such as structural hinge loss [8, 10]. [sent-25, score-0.324]

11 It should be noted that replacing the objective in (2) with structural hinge loss leads to inconsistency — the optimum of the relaxation is different from the optimum of (2). [sent-26, score-0.361]

12 An alternative to a convex relaxation is to perform gradient descent directly on the objective in (2). [sent-27, score-0.168]

13 Unfortunately, direct gradient descent on (2) is conceptually puzzling in the case where the output space Y is discrete. [sent-29, score-0.145]

14 In this case the output yw (x) is not a differentiable function of w. [sent-30, score-0.794]

15 As one smoothly changes w the output yw (x) jumps discontinuously between discrete output values. [sent-31, score-0.851]

16 So one cannot write w E [L(y, yw (x))] as E [ w L(y, yw (x))]. [sent-32, score-1.474]

17 However, when the input space X is continuous the gradient w E [L(y, yw (x))] can exist even when the output space Y is discrete. [sent-33, score-0.825]

18 The main results of this paper is a perceptron-like method of performing direct gradient descent on (2) in the case where the output space is discrete but the input space is continuous. [sent-34, score-0.145]

19 Although machine translation has discrete inputs as well as discrete outputs, the training method we propose can still be used, although without theoretical guarantees. [sent-36, score-0.154]

20 We also present empirical results on the use of this method in phoneme alignment on the TIMIT corpus, where it achieves the best known results on this problem. [sent-37, score-0.396]

21 sw (x, y) = w φ(x, y) Because the feature map φ can itself be nonlinear, and the feature vector φ(x, y) can be very high dimensional, objective functions of the this form are highly expressive. [sent-40, score-0.138]

22 wt+1 = wt + φ(xt , yt ) − φ(xt , ywt (xt )) (3) Note that if ywt (xt ) = yt then no update is made and we have wt+1 = wt . [sent-49, score-0.84]

23 If ywt (xt ) = yt then the update changes the parameter vector in a way that favors yt over ywt (xt ). [sent-50, score-0.502]

24 , there exists a weight vector w with the property that yw (x) = y with probability 1 and yw (x) is always γ-separated from all distractors, then the perceptron update rule will eventually lead to a parameter setting with zero loss. [sent-53, score-1.607]

25 Note, however, that the basic perceptron update does not involve the loss function L. [sent-54, score-0.248]

26 Hence it cannot be expected to optimize the training objective (2) in cases where zero loss is unachievable. [sent-55, score-0.22]

27 A loss-sensitive perceptron-like algorithm can be derived from the structured hinge loss of a marginscaled structural SVM [10]. [sent-56, score-0.241]

28 The optimization problem for margin-scaled structured hinge loss can be defined as follows. [sent-57, score-0.211]

29 ˜ t yhinge = argmax L(yt , y ) − (wt ) (φ(xt , yt ) − φ(xt , y )) ˜ ˜ y ∈Y ˜ = argmax (wt ) φ(xt , y ) + L(yt , y ) ˜ ˜ y ∈Y ˜ 2 (4) This yields the following perceptron-like update rule where the update direction is the negative of the sub-gradient of the loss and η t is a learning rate. [sent-61, score-0.59]

30 t wt+1 = wt + η t φ(xt , yt ) − φ(xt , yhinge ) (5) Equation (4) is often referred to as loss-adjusted inference. [sent-62, score-0.292]

31 The use of loss-adjusted inference causes the rule update (5) to be at least influenced by the loss function. [sent-63, score-0.263]

32 t wt+1 = wt + η t φ(xt , ywt (xt )) − φ(xt , ydirect ) t ydirect = argmax (wt ) φ(xt , y ) + t L(y, y ) ˜ ˜ (6) (7) y ∈Y ˜ t In the update (6) we view ydirect as being worse than ywt (xt ). [sent-65, score-1.569]

33 Note that the reference label yt in (5) has been replaced by the inferred label ywt (x) in (6). [sent-67, score-0.268]

34 The main result of this paper is that under mild conditions the expected update direction of (6) approaches the negative direction of w E [L(y, yw (x))] in the limit as the update weight t goes to zero. [sent-68, score-0.931]

35 For a finite set Y of possible output values, and for w in general position as defined below, we have the following where ydirect is a function of w, x, y and . [sent-74, score-0.412]

36 w E [L(y, yw (x))] 1 = lim E [φ(x, ydirect ) − φ(x, yw (x)))] →0 where ydirect = argmax w φ(x, y ) + L(y, y ) ˜ ˜ y ∈Y ˜ We prove this theorem in the case of only two labels where we have y ∈ {−1, 1}. [sent-75, score-2.36]

37 We assume an input set X and a probability distribution or a measure ρ on X × {−1, 1} and a loss function L(y, y ) for y, y ∈ {−1, 1}. [sent-77, score-0.141]

38 Typically the loss L(y, y ) is zero if y = y but the loss of a false positive, namely L(−1, 1), may be different from the loss of a false negative, L(1, −1). [sent-78, score-0.423]

39 By definition the gradient of expected loss satisfies the following condition for any vector ∆w ∈ Rd . [sent-79, score-0.172]

40 Under this convention we have yw (x) = sign(w ∆φ(x)). [sent-81, score-0.737]

41 If the two labels yw+ ∆w (x) and yw (x) are the same then the quantity inside the expectation is zero. [sent-83, score-0.755]

42 In (b) we show the integration of ∆L(y)(∆w) ∆φ(x) over the sets U + = {x : wt ∆φ(x) ∈ [0, ]} and U − = {x : wt ∆φ(x) ∈ [− (∆w) ∆φ(x), 0)}. [sent-88, score-0.338]

43 and S− = = = {x : yw (x) = 1, yw+ ∆w (x) = −1} x : w ∆φ(x) ≥ 0, (w + ∆w) ∆φ(x) < 0 x : w ∆φ(x) ∈ [0, − (∆w) ∆φ(x)) We define ∆L(y) = L(y, 1) − L(y, −1) and then write the left hand side of (8) as follows. [sent-90, score-0.737]

44 E [L(y, yw+ ∆w (x)) − L(y, yw (x))] = E ∆L(y)1{x∈S + } − E ∆L(y)1{x∈S − } (9) These expectations are shown as integrals in Figure 1 (a) where the lines in the figure represent the decision boundaries defined by w and w + ∆w. [sent-91, score-0.837]

45 1 lim Eρ U · 1{z∈[0, V ]} − Eρ U · 1{z∈[ V,0]} = Eµ [U V · f (0|u, v)] + → 0 = lim 1 →+ 0 Eρ U V · 1{z∈[0, Proof. [sent-97, score-0.214]

46 lim + 1 → 0 Eρ U · 1{z∈[0, V )} = lim + → 0 1 V Eµ U f (z|u, v)dz 0 = Eµ U V + · f (0|U, V ) Similarly we have the following where V − denotes min(0, V ). [sent-99, score-0.214]

47 lim + → 0 1 Eρ U · 1{z∈( V,0]} = lim + → 0 1 0 Eµ U f (z|u, v)dz V = −Eµ U V − · f (0|U, V ) 4 ]} Subtracting these two expressions gives the following. [sent-100, score-0.214]

48 (∆w) w E [L(y, yw (x))] = = lim →+ 0 lim →+ 0 1 1 E ∆L(y) · 1{x∈S + } − E ∆L(y) · 1{x∈S − } E ∆L(y) · (∆w) ∆φ(x) · 1{w ∆φ∈[0, ]} (10) Of course we need to check that the conditions of Lemma 1 hold. [sent-102, score-0.951]

49 If the two labels ydirect and yw (x) are the same then the quantity inside the expectation is zero. [sent-107, score-1.085]

50 ydirect = sign w ∆φ(x) + ∆L(y) We now define the following two sets which correspond to the set of pairs (x, y) for which yw (x) and ydirect are different. [sent-109, score-1.397]

51 B+ = {(x, y) : yw (x) = −1, ydirect = 1} (x, y) : w ∆φ(x) < 0, w ∆φ(x) + ∆L(y) ≥ 0 = (x, y) : w ∆φ(x) ∈ [− ∆L(y)(x), 0) = B− = {(x, y) : yw (x) = 1, ydirect = −1} (x, y) : w ∆φ(x) ≥ 0, w ∆φ(x) + ∆L(y) < 0 = (x, y) : w ∆φ(x) ∈ [0, − ∆L(y)) = We now have the following. [sent-110, score-2.134]

52 E (∆w) (φ(x, ydirect ) − φ(x, yw (x))) = E (∆w) ∆φ(x) · 1{(x,y)∈B + } − E (∆w) ∆φ(x) · 1{(x,y)∈B − } (11) These expectations are shown as integrals in Figure 1 (b). [sent-111, score-1.124]

53 lim →+ 0 = lim →+ 0 1 1 (∆w) E [φ(x, ydirect ) − φ(x, yw (x))] E (∆w) ∆φ(x) · ∆L(y) · 1{w ∆φ(x)∈[0, ]} (12) Theorem 1 now follows from (10) and (12). [sent-113, score-1.281]

54 rw (x) = argmax w φ(x, r) (13) r∈R We assume that for y ∈ Y and r ∈ R we can assign a loss L(y, r). [sent-127, score-0.28]

55 For many loss functions, such as weighted Hamming loss, one can ˜ ˜ compute L(y, r) efficiently. [sent-129, score-0.141]

56 The parameter setting optimizing approximate inference might be significantly different from the parameter setting optimizing the loss under exact inference. [sent-132, score-0.211]

57 1 w E(x,y)∼ρ [L(y, rw (x))] = lim E [φ(x, rdirect ) − φ(x, rw (x))] →0 where rdirect = argmax w φ(x, r) + L(y, r) ˜ ˜ r ∈R ˜ Another possible extension involves hidden structure. [sent-135, score-0.413]

58 yw (x) = argmax max w φ(x, y, h) y∈Y h∈H (15) In this case we can take the training problem to again be defined by (2) but where yw (x) is defined by (15). [sent-139, score-1.596]

59 Phonemeto-speech alignment is used as a tool in developing speech recognition and text-to-speech systems. [sent-143, score-0.231]

60 In the phoneme alignment problem each input x represents a speech utterance, and consists of a pair (s, p) of a sequence of acoustic feature vectors, s = (s1 , . [sent-144, score-0.608]

61 , pK ), where pk ∈ P, 1 ≤ k ≤ K is a phoneme symbol and P is a finite set of phoneme symbols. [sent-150, score-0.546]

62 Sometimes this task is called forced-alignment because one is forced 6 Table 1: Percentage of correctly positioned phoneme boundaries, given a predefined tolerance on the TIMIT corpus. [sent-153, score-0.301]

63 277 to interpret the given acoustic signal as the given phoneme sequence. [sent-180, score-0.328]

64 , yK ), where 1 ≤ yk ≤ T is an integer giving the start frame in the acoustic sequence of the k-th phoneme in the phoneme sequence. [sent-184, score-0.73]

65 Hence the k-th phoneme starts at frame yk and ends at frame yk+1 −1. [sent-185, score-0.408]

66 Two types of loss functions are used to quantitatively assess alignments. [sent-186, score-0.141]

67 The first loss is called the τ -alignment loss and it is defined as Lτ -alignment (¯, y ) = y ¯ 1 |{k : |yk − yk | > τ }| . [sent-187, score-0.358]

68 |¯| y (16) In words, this loss measures the average number of times the absolute difference between the predicted alignment sequence and the manual alignment sequence is greater than τ . [sent-188, score-0.461]

69 This loss with different values of τ was used to measure the performance of the learned alignment function in [1, 9, 4]. [sent-189, score-0.273]

70 The second loss, called τ -insensitive loss was proposed in [5] as is defined as follows. [sent-190, score-0.141]

71 Lτ -insensitive (¯, y ) = y ¯ 1 max {|yk − yk | − τ, 0} |¯| y (17) This loss measures the average disagreement between all the boundaries of the desired alignment sequence and the boundaries of predicted alignment sequence where a disagreement of less than τ is ignored. [sent-191, score-0.709]

72 Note that τ -insensitive loss is continuous and convex while τ -alignment is discontinuous and non-convex. [sent-192, score-0.16]

73 wt+1 = wt + η t φ(¯t , ydirect ) − φ(¯t , ywt (¯t )) x ¯t x ¯ x ydirect = argmax (wt ) φ(¯t , y ) − t L(¯, y ) ¯t x ˜ y ˜ y ∈Y ˜ Our experiments are on the TIMIT speech corpus for which there are published benchmark results [1, 5, 4]. [sent-195, score-1.185]

74 The corpus contains aligned utterances each of which is a pair (x, y) where x is a pair of a phonetic sequence and an acoustic sequence and y is a desired alignment. [sent-196, score-0.325]

75 The first part of the training set was used to train a phoneme frame-based classifier, which given a speech frame and a phoneme, outputs the confident that the phoneme was uttered in that frame. [sent-198, score-0.738]

76 The phoneme frame-based classifier is then used as part of a seven dimensional feature map φ(x, y) = φ((¯, p), y ) as described in [5]. [sent-199, score-0.331]

77 The feature set used to s ¯ ¯ train the phoneme classifier consisted of the Mel-Frequency Cepstral Coefficient (MFCC) and the log-energy along with their first and second derivatives (∆+∆∆) as described in [5]. [sent-200, score-0.285]

78 The complete set of 61 TIMIT phoneme symbols were mapped into 39 phoneme symbols as proposed by [6], and was used throughout the training process. [sent-203, score-0.612]

79 We trained twice, once for τ -alignment loss and once for τ -insensitive loss, with τ = 10 ms in both cases. [sent-205, score-0.186]

80 It should be noted that if w0 = 0 and t and η t are both held constant at and η respectively, then the 7 direction of wt is independent of the choice of η. [sent-207, score-0.213]

81 These updates are repeated until the performance of wt on the third data set (the hold-out set) begins to degrade. [sent-208, score-0.186]

82 We scored the performance of our system on the whole TIMIT test set of 1344 utterances using τ -alignment accuracy (one minus the loss) with τ set to each of 10, 20, 30 and 40 ms and with τ insensitive loss with τ set to 10 ms. [sent-213, score-0.254]

83 As should be expected, for τ equal to 10 ms the best performance is achieved when the loss used in training matches the loss used in test. [sent-214, score-0.352]

84 Larger values of τ correspond to a loss function that was not used in training. [sent-215, score-0.141]

85 Also, as might be expected, the τ -insensitive loss seems more robust to the use of a τ value at test time that is larger than the τ value used in training. [sent-220, score-0.141]

86 6 Open Problems and Discussion The main result of this paper is the loss gradient theorem of Section 3. [sent-221, score-0.197]

87 This theorem provides a theoretical foundation for perceptron-like training methods with updates computed as a difference between the feature vectors of two different inferred outputs where at least one of those outputs is inferred with loss-adjusted inference. [sent-222, score-0.246]

88 Perceptron-like training methods using feature differences between two inferred outputs have already been shown to be successful for machine translation but theoretical justification has been lacking. [sent-223, score-0.21]

89 We also show the value of these training methods in a phonetic alignment problem. [sent-224, score-0.22]

90 Although we did not give an asymptotic convergence results it should be straightforward to show that under the update given by (6) we have that wt converges to a local optimum of the objective provided that√ both η t and t go to zero while t η t t goes to infinity. [sent-225, score-0.304]

91 In our phoneme alignment experiments we trained only a seven dimensional weight vector and early stopping was used as regularization. [sent-228, score-0.5]

92 It should be noted that naive regularization with a norm of w, such as regularizing with λ||w||2 , is nonsensical as the loss E [L(y, yw (x))] is insensitive to the norm of w. [sent-229, score-0.918]

93 Regularization is typically done with a surrogate loss function such as hinge loss. [sent-230, score-0.198]

94 Regularization remains an open theoretical issue for direct gradient descent on a desired loss function on a finite training sample. [sent-231, score-0.345]

95 Many practical computational problems in areas such as computational linguistics, computer vision, speech recognition, robotics, genomics, and marketing seem best handled by some form of score optimization. [sent-233, score-0.171]

96 We have provided a theoretical foundation for a certain perceptron-like training algorithm by showing that it can be viewed as direct stochastic gradient descent on the loss of the inference system. [sent-237, score-0.33]

97 The main point of this training method is to incorporate domain-specific loss functions, such as the BLEU score in machine translation, directly into the training process with a clear theoretical foundation. [sent-238, score-0.299]

98 Automatic segmentation and labeling of speech based on hidden markov models. [sent-244, score-0.142]

99 Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. [sent-255, score-0.134]

100 A large margin algorithm for speech and audio segmentation. [sent-267, score-0.134]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('yw', 0.737), ('ydirect', 0.33), ('phoneme', 0.264), ('wt', 0.169), ('loss', 0.141), ('ywt', 0.136), ('alignment', 0.132), ('timit', 0.117), ('lim', 0.107), ('speech', 0.099), ('xt', 0.096), ('yt', 0.084), ('yk', 0.076), ('argmax', 0.076), ('utterances', 0.069), ('translation', 0.064), ('acoustic', 0.064), ('relaxation', 0.063), ('rw', 0.063), ('sw', 0.063), ('update', 0.062), ('bleu', 0.058), ('output', 0.057), ('training', 0.046), ('corpus', 0.045), ('perceptron', 0.045), ('score', 0.045), ('boundaries', 0.043), ('phonetic', 0.042), ('keshet', 0.04), ('linguistics', 0.04), ('brugnara', 0.039), ('elded', 0.039), ('rdirect', 0.039), ('yhinge', 0.039), ('hinge', 0.038), ('scoring', 0.037), ('dz', 0.037), ('direct', 0.035), ('audio', 0.035), ('inference', 0.034), ('frame', 0.034), ('objective', 0.033), ('integrals', 0.033), ('structured', 0.032), ('tamir', 0.031), ('submanifold', 0.031), ('gradient', 0.031), ('outputs', 0.031), ('desired', 0.03), ('structural', 0.03), ('seven', 0.029), ('sequence', 0.028), ('rd', 0.028), ('disagreement', 0.028), ('inferred', 0.027), ('handled', 0.027), ('hidden', 0.026), ('rule', 0.026), ('string', 0.025), ('position', 0.025), ('theorem', 0.025), ('software', 0.025), ('argmin', 0.025), ('polytope', 0.025), ('ms', 0.024), ('direction', 0.024), ('expectations', 0.024), ('inputs', 0.023), ('goes', 0.022), ('descent', 0.022), ('feature', 0.021), ('reference', 0.021), ('theoretical', 0.021), ('trained', 0.021), ('noted', 0.02), ('insensitive', 0.02), ('tolerance', 0.02), ('lemma', 0.02), ('aligned', 0.019), ('convex', 0.019), ('language', 0.019), ('mcallester', 0.019), ('surrogate', 0.019), ('open', 0.019), ('stopping', 0.019), ('symbols', 0.019), ('graphical', 0.019), ('labels', 0.018), ('optimum', 0.018), ('optimizing', 0.018), ('pk', 0.018), ('early', 0.018), ('dimensional', 0.017), ('updates', 0.017), ('inconsistencies', 0.017), ('positioned', 0.017), ('reinterpreted', 0.017), ('surpasses', 0.017), ('markov', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 61 nips-2010-Direct Loss Minimization for Structured Prediction

Author: Tamir Hazan, Joseph Keshet, David A. McAllester

Abstract: In discriminative machine learning one is interested in training a system to optimize a certain desired measure of performance, or loss. In binary classification one typically tries to minimizes the error rate. But in structured prediction each task often has its own measure of performance such as the BLEU score in machine translation or the intersection-over-union score in PASCAL segmentation. The most common approaches to structured prediction, structural SVMs and CRFs, do not minimize the task loss: the former minimizes a surrogate loss with no guarantees for task loss and the latter minimizes log loss independent of task loss. The main contribution of this paper is a theorem stating that a certain perceptronlike learning rule, involving features vectors derived from loss-adjusted inference, directly corresponds to the gradient of task loss. We give empirical results on phonetic alignment of a standard test set from the TIMIT corpus, which surpasses all previously reported results on this problem. 1

2 0.16098799 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs

Author: Fabian Triefenbach, Azarakhsh Jalalvand, Benjamin Schrauwen, Jean-pierre Martens

Abstract: Automatic speech recognition has gradually improved over the years, but the reliable recognition of unconstrained speech is still not within reach. In order to achieve a breakthrough, many research groups are now investigating new methodologies that have potential to outperform the Hidden Markov Model technology that is at the core of all present commercial systems. In this paper, it is shown that the recently introduced concept of Reservoir Computing might form the basis of such a methodology. In a limited amount of time, a reservoir system that can recognize the elementary sounds of continuous speech has been built. The system already achieves a state-of-the-art performance, and there is evidence that the margin for further improvements is still significant. 1

3 0.095041499 182 nips-2010-New Adaptive Algorithms for Online Classification

Author: Francesco Orabona, Koby Crammer

Abstract: We propose a general framework to online learning for classification problems with time-varying potential functions in the adversarial setting. This framework allows to design and prove relative mistake bounds for any generic loss function. The mistake bounds can be specialized for the hinge loss, allowing to recover and improve the bounds of known online classification algorithms. By optimizing the general bound we derive a new online classification algorithm, called NAROW, that hybridly uses adaptive- and fixed- second order information. We analyze the properties of the algorithm and illustrate its performance using synthetic dataset. 1

4 0.086979583 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine

Author: George Dahl, Marc'aurelio Ranzato, Abdel-rahman Mohamed, Geoffrey E. Hinton

Abstract: Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonalcovariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5%, which is superior to all published results on speaker-independent TIMIT to date. 1

5 0.077855617 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification

Author: Min Yang, Linli Xu, Martha White, Dale Schuurmans, Yao-liang Yu

Abstract: Robust regression and classification are often thought to require non-convex loss functions that prevent scalable, global training. However, such a view neglects the possibility of reformulated training methods that can yield practically solvable alternatives. A natural way to make a loss function more robust to outliers is to truncate loss values that exceed a maximum threshold. We demonstrate that a relaxation of this form of “loss clipping” can be made globally solvable and applicable to any standard loss while guaranteeing robustness against outliers. We present a generic procedure that can be applied to standard loss functions and demonstrate improved robustness in regression and classification problems. 1

6 0.077465467 228 nips-2010-Reverse Multi-Label Learning

7 0.075496957 283 nips-2010-Variational Inference over Combinatorial Spaces

8 0.064421974 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction

9 0.063961148 235 nips-2010-Self-Paced Learning for Latent Variable Models

10 0.063810289 158 nips-2010-Learning via Gaussian Herding

11 0.063582242 188 nips-2010-On Herding and the Perceptron Cycling Theorem

12 0.059246097 265 nips-2010-The LASSO risk: asymptotic results and real world examples

13 0.057476096 243 nips-2010-Smoothness, Low Noise and Fast Rates

14 0.056870311 219 nips-2010-Random Conic Pursuit for Semidefinite Programming

15 0.056195814 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks

16 0.054735351 28 nips-2010-An Alternative to Low-level-Sychrony-Based Methods for Speech Detection

17 0.054113042 222 nips-2010-Random Walk Approach to Regret Minimization

18 0.051860418 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

19 0.051853728 282 nips-2010-Variable margin losses for classifier design

20 0.051690515 118 nips-2010-Implicit Differentiation by Perturbation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.144), (1, 0.008), (2, 0.033), (3, -0.037), (4, 0.004), (5, 0.03), (6, -0.032), (7, -0.019), (8, -0.073), (9, 0.044), (10, -0.023), (11, -0.03), (12, 0.115), (13, 0.031), (14, -0.102), (15, 0.05), (16, 0.029), (17, -0.042), (18, -0.018), (19, -0.105), (20, -0.029), (21, 0.121), (22, 0.18), (23, 0.051), (24, 0.023), (25, -0.079), (26, 0.14), (27, -0.046), (28, 0.04), (29, -0.011), (30, -0.034), (31, -0.026), (32, 0.012), (33, 0.099), (34, 0.001), (35, -0.004), (36, 0.007), (37, -0.049), (38, 0.004), (39, -0.101), (40, -0.054), (41, 0.01), (42, 0.002), (43, -0.157), (44, 0.026), (45, 0.037), (46, -0.055), (47, 0.011), (48, 0.026), (49, -0.003)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89279044 61 nips-2010-Direct Loss Minimization for Structured Prediction

Author: Tamir Hazan, Joseph Keshet, David A. McAllester

Abstract: In discriminative machine learning one is interested in training a system to optimize a certain desired measure of performance, or loss. In binary classification one typically tries to minimizes the error rate. But in structured prediction each task often has its own measure of performance such as the BLEU score in machine translation or the intersection-over-union score in PASCAL segmentation. The most common approaches to structured prediction, structural SVMs and CRFs, do not minimize the task loss: the former minimizes a surrogate loss with no guarantees for task loss and the latter minimizes log loss independent of task loss. The main contribution of this paper is a theorem stating that a certain perceptronlike learning rule, involving features vectors derived from loss-adjusted inference, directly corresponds to the gradient of task loss. We give empirical results on phonetic alignment of a standard test set from the TIMIT corpus, which surpasses all previously reported results on this problem. 1

2 0.72683311 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs

Author: Fabian Triefenbach, Azarakhsh Jalalvand, Benjamin Schrauwen, Jean-pierre Martens

Abstract: Automatic speech recognition has gradually improved over the years, but the reliable recognition of unconstrained speech is still not within reach. In order to achieve a breakthrough, many research groups are now investigating new methodologies that have potential to outperform the Hidden Markov Model technology that is at the core of all present commercial systems. In this paper, it is shown that the recently introduced concept of Reservoir Computing might form the basis of such a methodology. In a limited amount of time, a reservoir system that can recognize the elementary sounds of continuous speech has been built. The system already achieves a state-of-the-art performance, and there is evidence that the margin for further improvements is still significant. 1

3 0.64762419 188 nips-2010-On Herding and the Perceptron Cycling Theorem

Author: Andrew Gelfand, Yutian Chen, Laurens Maaten, Max Welling

Abstract: The paper develops a connection between traditional perceptron algorithms and recently introduced herding algorithms. It is shown that both algorithms can be viewed as an application of the perceptron cycling theorem. This connection strengthens some herding results and suggests new (supervised) herding algorithms that, like CRFs or discriminative RBMs, make predictions by conditioning on the input attributes. We develop and investigate variants of conditional herding, and show that conditional herding leads to practical algorithms that perform better than or on par with related classifiers such as the voted perceptron and the discriminative RBM. 1

4 0.60398477 158 nips-2010-Learning via Gaussian Herding

Author: Koby Crammer, Daniel D. Lee

Abstract: We introduce a new family of online learning algorithms based upon constraining the velocity flow over a distribution of weight vectors. In particular, we show how to effectively herd a Gaussian weight vector distribution by trading off velocity constraints with a loss function. By uniformly bounding this loss function, we demonstrate how to solve the resulting optimization analytically. We compare the resulting algorithms on a variety of real world datasets, and demonstrate how these algorithms achieve state-of-the-art robust performance, especially with high label noise in the training data. 1

5 0.56493372 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine

Author: George Dahl, Marc'aurelio Ranzato, Abdel-rahman Mohamed, Geoffrey E. Hinton

Abstract: Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonalcovariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5%, which is superior to all published results on speaker-independent TIMIT to date. 1

6 0.5281263 182 nips-2010-New Adaptive Algorithms for Online Classification

7 0.47837391 118 nips-2010-Implicit Differentiation by Perturbation

8 0.459539 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction

9 0.44565615 202 nips-2010-Parallelized Stochastic Gradient Descent

10 0.4303661 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification

11 0.42211962 28 nips-2010-An Alternative to Low-level-Sychrony-Based Methods for Speech Detection

12 0.4206014 264 nips-2010-Synergies in learning words and their referents

13 0.41900209 269 nips-2010-Throttling Poisson Processes

14 0.41861638 125 nips-2010-Inference and communication in the game of Password

15 0.41348284 285 nips-2010-Why are some word orders more common than others? A uniform information density account

16 0.41347525 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction

17 0.41248259 228 nips-2010-Reverse Multi-Label Learning

18 0.41165078 255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs

19 0.41106027 35 nips-2010-Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting

20 0.40685502 235 nips-2010-Self-Paced Learning for Latent Variable Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.042), (17, 0.038), (27, 0.051), (30, 0.095), (35, 0.018), (40, 0.214), (45, 0.217), (50, 0.039), (52, 0.023), (59, 0.027), (60, 0.051), (77, 0.049), (78, 0.015), (90, 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.85928857 56 nips-2010-Deciphering subsampled data: adaptive compressive sampling as a principle of brain communication

Author: Guy Isely, Christopher Hillar, Fritz Sommer

Abstract: A new algorithm is proposed for a) unsupervised learning of sparse representations from subsampled measurements and b) estimating the parameters required for linearly reconstructing signals from the sparse codes. We verify that the new algorithm performs efficient data compression on par with the recent method of compressive sampling. Further, we demonstrate that the algorithm performs robustly when stacked in several stages or when applied in undercomplete or overcomplete situations. The new algorithm can explain how neural populations in the brain that receive subsampled input through fiber bottlenecks are able to form coherent response properties. 1

2 0.8546586 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning

Author: Eiji Mizutani, Stuart Dreyfus

Abstract: In the neural-network parameter space, an attractive field is likely to be induced by singularities. In such a singularity region, first-order gradient learning typically causes a long plateau with very little change in the objective function value E (hence, a flat region). Therefore, it may be confused with “attractive” local minima. Our analysis shows that the Hessian matrix of E tends to be indefinite in the vicinity of (perturbed) singular points, suggesting a promising strategy that exploits negative curvature so as to escape from the singularity plateaus. For numerical evidence, we limit the scope to small examples (some of which are found in journal papers) that allow us to confirm singularities and the eigenvalues of the Hessian matrix, and for which computation using a descent direction of negative curvature encounters no plateau. Even for those small problems, no efficient methods have been previously developed that avoided plateaus. 1

same-paper 3 0.80846113 61 nips-2010-Direct Loss Minimization for Structured Prediction

Author: Tamir Hazan, Joseph Keshet, David A. McAllester

Abstract: In discriminative machine learning one is interested in training a system to optimize a certain desired measure of performance, or loss. In binary classification one typically tries to minimizes the error rate. But in structured prediction each task often has its own measure of performance such as the BLEU score in machine translation or the intersection-over-union score in PASCAL segmentation. The most common approaches to structured prediction, structural SVMs and CRFs, do not minimize the task loss: the former minimizes a surrogate loss with no guarantees for task loss and the latter minimizes log loss independent of task loss. The main contribution of this paper is a theorem stating that a certain perceptronlike learning rule, involving features vectors derived from loss-adjusted inference, directly corresponds to the gradient of task loss. We give empirical results on phonetic alignment of a standard test set from the TIMIT corpus, which surpasses all previously reported results on this problem. 1

4 0.76057631 63 nips-2010-Distributed Dual Averaging In Networks

Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi

Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. 1

5 0.75879574 155 nips-2010-Learning the context of a category

Author: Dan Navarro

Abstract: This paper outlines a hierarchical Bayesian model for human category learning that learns both the organization of objects into categories, and the context in which this knowledge should be applied. The model is fit to multiple data sets, and provides a parsimonious method for describing how humans learn context specific conceptual representations.

6 0.75385594 243 nips-2010-Smoothness, Low Noise and Fast Rates

7 0.75375873 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs

8 0.75354642 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

9 0.75078964 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods

10 0.75029117 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

11 0.7495575 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars

12 0.74889576 148 nips-2010-Learning Networks of Stochastic Differential Equations

13 0.74864775 27 nips-2010-Agnostic Active Learning Without Constraints

14 0.74861431 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

15 0.74810779 290 nips-2010-t-logistic regression

16 0.74781418 280 nips-2010-Unsupervised Kernel Dimension Reduction

17 0.74778068 222 nips-2010-Random Walk Approach to Regret Minimization

18 0.74751318 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework

19 0.7462545 265 nips-2010-The LASSO risk: asymptotic results and real world examples

20 0.74614388 7 nips-2010-A Family of Penalty Functions for Structured Sparsity