jmlr jmlr2008 jmlr2008-59 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jörg Lücke, Maneesh Sahani
Abstract: We study a generative model in which hidden causes combine competitively to produce observations. Multiple active causes combine to determine the value of an observed variable through a max function, in the place where algorithms such as sparse coding, independent component analysis, or non-negative matrix factorization would use a sum. This max rule can represent a more realistic model of non-linear interaction between basic components in many settings, including acoustic and image data. While exact maximum-likelihood learning of the parameters of this model proves to be intractable, we show that efficient approximations to expectation-maximization (EM) can be found in the case of sparsely active hidden causes. One of these approximations can be formulated as a neural network model with a generalized softmax activation function and Hebbian learning. Thus, we show that learning in recent softmax-like neural networks may be interpreted as approximate maximization of a data likelihood. We use the bars benchmark test to numerically verify our analytical results and to demonstrate the competitiveness of the resulting algorithms. Finally, we show results of learning model parameters to fit acoustic and visual data sets in which max-like component combinations arise naturally. Keywords: component extraction, maximum likelihood, approximate EM, competitive learning, neural networks
Reference: text
sentIndex sentText sentNum sentScore
1 UK Gatsby Computational Neuroscience Unit University College London 17 Queen Square London WC1N 3AR, UK Editor: Yoshua Bengio Abstract We study a generative model in which hidden causes combine competitively to produce observations. [sent-7, score-0.474]
2 Alternative, more competitive generative models have also been proposed: for instance, Saund (1995) suggests a model in which hidden causes are combined by a noisy-or rule, while Dayan and Zemel (1995) suggest a yet more competitive scheme. [sent-22, score-0.532]
3 , Spratling and Johnson, 2002; L¨ cke and von der Malsburg, 2004; L¨ cke and u u Bouecke, 2005; Spratling, 2006), although sometimes in natural images (e. [sent-32, score-0.522]
4 In Section 4 we derive computationally efficient approximations to these update rules, in the context of sparsely active hidden causes—that is, when a small number of hidden causes generally suffices to explain the data. [sent-41, score-0.572]
5 , D), in which H hidden binary causes sh , (h = 1, . [sent-49, score-0.395]
6 , those taking the value 1), the distribution of yd is determined by the largest of the weights associated with the active causes and yd . [sent-56, score-0.566]
7 Then the free-energy is defined as: N F (Θ, q) = ∑ ∑ qn (s ) n=1 log p(y (n) | s, Θ) + log p(s | Θ) + H(q) ≤ L (Θ), (4) s where H(q) = ∑n H(qn (s)) = − ∑n ∑s qn (s) log(qn (s)) is the Shannon entropy of q. [sent-114, score-0.46]
8 The iterations of EM alternately increase F with respect to the distributions qn while holding Θ fixed (the E-step), and with respect to Θ while holding the qn fixed (the M-step). [sent-115, score-0.46]
9 Thus, if we consider a pair of steps beginning from parameters Θ′ , the E-step first finds new distributions qn that depend on Θ′ and the observations y (n) , which we write as qn (s ; Θ′ ). [sent-116, score-0.491]
10 Ideally, these distributions maximize F for fixed Θ′ , in which case it can be shown that qn (s ; Θ′ ) = p(s | y (n) , Θ′ ) and F (Θ′ , qn (s ; Θ′ )) = L (Θ′ ) (Neal and Hinton, 1998). [sent-117, score-0.46]
11 After choosing the qn ’s in the E-step, we maximize F with respect to Θ in the M-step while holding the qn distributions fixed. [sent-119, score-0.46]
12 Note that the denominator in (9) vanishes only if qn (s ; Θ′ ) A id (s,W ) = 0 for all s and n (assuming positive weights), in which case (6) is already satisfied, and no update of W is required. [sent-132, score-0.422]
13 For our chosen Bernoulli distribution (1), the M-step is obtained by setting the derivative of F with respect to πi to zero, giving (after rearrangement): πi = 1 si N∑ n qn , with si qn = ∑ qn (s ; Θ′ ) si . [sent-134, score-0.69]
14 (11) s Parameter values that satisfy Equations (9) and (11), maximize the free-energy given the distributions qn = qn (s ; Θ′ ). [sent-135, score-0.46]
15 E-Step Approximations The computational cost of finding the exact sufficient statistics A id (s,W ) qn , with qn equal to the posterior probability (12), is intractable in general. [sent-142, score-0.621]
16 , 1999), would be to optimize the qn within a constrained class of distributions; for example, distributions that factor over the sources sh . [sent-146, score-0.375]
17 Instead, we base our approximations on an assumption of sparsity—that only a small number of active hidden sources is needed to explain any one observed 1232 M AXIMAL C AUSES data vector (note that sparsity here refers to the number of active hidden sources, rather than to their proportion). [sent-148, score-0.468]
18 , H} : πi = π and ∑ Wid = C ; (19) d and (iii) on average, the influence of each hidden source is homogeneously covered by the other gen sources. [sent-171, score-0.357]
19 Figure 3A,B show weight patterns associated with hidden causes for which the condition is fulfilled; for instance in Figure 3B bi = 0 for all causes with horizontal weight patterns, while bi = 1 for the vertically oriented cause. [sent-178, score-0.645]
20 Using both the sum constraint of (19) and the assumption of homogeneous coverage of causes, we obtain the M-step update: ∑ Wid = C A id (s,W ) n ∑∑ (n) qn yd A id ′ (s,W ) d′ n 1235 (n) qn yd ′ . [sent-184, score-1.098]
21 , those (n) with ∑d yd = 0) do not affect the update rule (21) and so can be neglected, yields the expression (see Appendix C): (n) A id (s,W ) qn ≈ ∑ h exp(Ii ) (n) exp(Ih ) + ∑ π 2 (n) exp(Iab ) , π := π , 1−π (22) a, b a=b with abbreviations given in (18). [sent-188, score-0.58]
22 An observation y is represented by the values (or activities) of the input units, which act through connections parameterized by (Wid ) to determine the activities of the hidden units through an activation function gi = gi (y, W ). [sent-200, score-0.405]
23 If, instead, we consider the effect of presenting a group of patterns {y (n) }, the net change is approximately (see Appendix D): (n) new Wid ≈ C ∑n gi (y (n) , W ) yd (n) ∑d ′ ∑n gi (y (n) , W ) yd ′ . [sent-212, score-0.501]
24 Unfortunately, the expectation A id (s,W ) qn depends on d, and thus exact optimization in the general case would require a modified Hebbian rule. [sent-214, score-0.391]
25 Many component-extraction algorithms have been applied to a version of the bars test, including some with probabilistic generative semantics (Saund, 1995; Dayan and Zemel, 1995; Hinton et al. [sent-241, score-0.451]
26 , 2002; Spratling and Johnson, 2002; L¨ cke and von der Malsburg, u 2004; L¨ cke and Bouecke, 2005; Spratling, 2006; Butko and Triesch, 2007). [sent-243, score-0.496]
27 1238 M AXIMAL C AUSES A B Input patterns W1d W2d W3d W4d W5d W6d W7d W8d W9d W10d Iterations 0 1 10 20 30 100 Figure 5: Bars test data with b = 10 bars on D = 5 × 5 pixels and a bar appearance probability 2 of π = 10 . [sent-254, score-0.61]
28 A 24 patterns from the set of N = 500 input patterns that were generated according to the generative model with Poisson noise. [sent-255, score-0.371]
29 The associated matrix of generating weights, W gen , was 10 × 25, with each row representing a horizontal or vertical bar in a 5-by-5 pixel gen grid. [sent-266, score-0.593]
30 The weights Wid that correspond to the pixels of the bar were set to 10, the others to 0, so gen gen 2 that ∑d Wid = 50. [sent-267, score-0.578]
31 Each source was active with probability πi = 10 , leading to an average of two 1239 ¨ L UCKE AND S AHANI bars appearing in each image. [sent-268, score-0.361]
32 W gen , πgen L (Θ) MCA3 MCAex R-MCA2 103 -20 L (Θ) 103 -30 -30 -40 -50 added noise -50 -40 R-MCANN no noise 0 1 40 80 1 120 2 160 patterns 103 iterations Figure 6: Change of the MCA parameter likelihood under different MCA learning algorithms. [sent-270, score-0.365]
33 Because of the finite number of patterns (N = 500) we compared the estimates with the actual frequency of occurrence of each bar i: π′ = (numb of bars i in input)/N. [sent-305, score-0.518]
34 5 Comparison to Other Algorithms—Noiseless Bars To compare the component extraction results of MCA to that of other algorithms reported in the literature, we used a standard version of the bars benchmark test, in which the bars appear with no noise. [sent-335, score-0.62]
35 If the mapping from single-bar images to the most active internal variable is injective—that is, for each single bar a different hidden unit or source is the most active—then this instance of the model is said to have represented all of the bars. [sent-342, score-0.408]
36 Without noise, MCA3 with H = 10 hidden variables found all 10 bars in 81 of 100 experiments. [sent-356, score-0.439]
37 It took fewer than 1000 pattern presentations to find all bars in the majority of 100 experiments,1 although in a few trials learning did take much longer. [sent-361, score-0.356]
38 Many component-extraction algorithms—particularly those based on artificial neural networks— use models with more hidden elements than there are distinct causes in the input data (e. [sent-373, score-0.363]
39 If we use H = 12 hidden variu ables, then all the MCA-algorithms (MCA3 , R-MCA2 , and R-MCANN ) found all of the bars in all of 100 trials. [sent-377, score-0.439]
40 In A bar overlap is increased by increasing the 3 bar appearance probability to πgen = 10 (an average of three bars per pattern). [sent-381, score-0.682]
41 In B bar overlap is varied using different bar widths (two one-pixel-wide bars and one three-pixelwide bar for each orientation). [sent-382, score-0.772]
42 In the bars test in C there are 8 (two-pixel-wide) horizontal bars and 8 (two-pixel-wide) vertical bars on a D = 9 × 9 pixel grid. [sent-383, score-0.922]
43 Each bar appears with 2 probability πgen = 16 (two bars per input pattern on average). [sent-384, score-0.46]
44 This is because it is the non-linear combination u within the regions of overlap that most distinguishes the bars test images from linear superpositions of sources. [sent-399, score-0.354]
45 Following Spratling (2006), in all experiments the MCA model had twice as many possible sources as there were bars in the generative input. [sent-401, score-0.507]
46 The most straightforward way to increase the degree of bar overlap is to use the standard bars 3 test with an average of three instead of two bars per image, that is, take π = 10 for an otherwise unchanged bars test with b = 10 bars on D = 5 × 5 pixels (see Figure 8A for some examples). [sent-405, score-1.359]
47 When using H = 20 hidden variables, MCA3 extracted all bars in 92% of 25 experiments. [sent-406, score-0.467]
48 For example, half the bars appeared with probability πh = (1 + γ) 10 and the other half2 gen 2 appeared with probability πh = (1 − γ) 10 , MCA3 extracted all bars in all of 25 experiments for γ = 0. [sent-418, score-0.767]
49 6 (when half the bars appeared 4 times more often than the other half) all bars were extracted in 88% of 25 experiments. [sent-421, score-0.594]
50 As suggested by L¨ cke and von der Malsburg (2004), we also varied the bar overlap in a second u experiment by choosing bars of different widths. [sent-426, score-0.773]
51 The bar appearance probability was π = 2 , so that an input contained, as usual, two bars on average. [sent-429, score-0.518]
52 The bar heights represent the average numbers of extracted bars in 25 trials. [sent-440, score-0.459]
53 Error bars indicate the largest and the lowest number of bars found in a trial. [sent-441, score-0.566]
54 In the third experiment we changed the degree of bar overlap more substantially, using a bars test that included overlapping parallel bars as introduced by L¨ cke (2004). [sent-444, score-0.958]
55 To allow for a detailed comparison with other systems we adopted the exact settings used by Spratling (2006), that is, we considered 25 runs of a system 2 with H = 32 hidden variables using bars test parameters D = 9 × 9, πgen = 16 , and N = 400. [sent-449, score-0.47]
56 Convergence to a representation that contains just the true hidden causes and leaves super1246 M AXIMAL C AUSES numerary units unspecialized can improve the interpretability of the result. [sent-461, score-0.364]
57 When using a higher fixed temperature for R-MCANN all the hidden units represented bars, with some bars represented by more than one unit. [sent-462, score-0.553]
58 A Generating causes C B W after learning (MCA3 ) Input patterns Figure 10: Experiments with more causes and hidden variables than observed variables. [sent-468, score-0.543]
59 We generated N = 500 patterns on a 3-by-3 grid (D = 9), using sparse combinations of 12 hidden causes corresponding to 6 horizontal and 6 vertical bars, each 1-by-2 pixels in size and thus extending across only a portion of the image (Figure 10A). [sent-480, score-0.495]
60 R-MCANN only extracted all causes when run at fixed temperatures that were lower than those used for the bars tests above (e. [sent-490, score-0.49]
61 To highlight this point we explicitly removed such sparse data vectors from a standard bars test, thereby violating the Bernoulli prior assumption of the generative model. [sent-499, score-0.451]
62 We used bars tests as described above, with b = 10 or b = 16 bars and π = 2 , generating N = 500 (or more) patterns, in each case by first b drawing causes from the Bernoulli distribution (1) and then rejecting patterns in which fewer than m causes were active. [sent-500, score-0.979]
63 However, when only patterns with fewer than 2 bars had been removed, MCA3 was still able to identify all the bars in many of the runs. [sent-502, score-0.653]
64 More precisely, using data generated as above with b = 10, m = 2 and N = 500, MCA3 with H = 10 hidden variables found all causes in 69 of 100 trials with noisy observations and in 37 of 100 trials without noise (the parameters for MCA3 and the associated annealing schedule were unchanged). [sent-503, score-0.503]
65 The relatively low reliability seen for noiseless bars in this experiment may be due to the combined violation of both the assumed prior and noise distributions. [sent-506, score-0.42]
66 The resulting values were thresholded and then rescaled linearly so that power-levels across all phonemes filled the interval [0, 10], as in the standard bars test. [sent-550, score-0.347]
67 The MCA algorithms were used with the same parameter settings as in the bars tests above, except that annealing began at a lower initial temperature (see Appendix E). [sent-552, score-0.382]
68 As in the bars tests with increased overlap, we used twice as many hidden variables (H = 12) as there were causes in the input. [sent-553, score-0.589]
69 For intermediate fixed temperatures, results for R-MCANN were similar to those of the bars test in Figure 8D in that each cause was represented just once, with additional hidden units displaying little structure in their weights. [sent-564, score-0.497]
70 These were measured as described for the bars tests above, by checking whether, after learning, inference based on each individual phoneme log-spectrogram led to a different hidden cause being most probable. [sent-567, score-0.483]
71 MCA3 found all causes in 21 of 25 trials (84% reliability), R-MCA2 found all causes in all of 25 trials; as did R-MCANN (with fixed T = 70). [sent-568, score-0.347]
72 However, each blade of grass might appear at many different positions within the image patches, rather than at a fixed set of possible locations as in the bars test. [sent-594, score-0.355]
73 R-MCA2 was used with the same parameter setting as for the bars tests above, except for lower initial and final temperatures for annealing (see Appendix E). [sent-601, score-0.355]
74 5 (as in the bars test), the resulting weights were generally similar to the ones in Figure 12C, but with a larger proportion of weight vectors showing little structure. [sent-610, score-0.368]
75 In experiments applying the online algorithm R-MCANN to a set of 5000 10-by-10 patches as above, we found that it would converge to ‘grass’-like weight patterns provided the learning rate (ε in Equation 25) was set to a much lower value than had been used in the bars tests. [sent-613, score-0.456]
76 0 as above), and with noise on the weights (σ) scaled down by the same factor, R-MCANN converged to weights similar to those shown in Figure 12C (for R-MCA2 ), although a larger number of hidden units showed relatively uniform weight structure. [sent-617, score-0.411]
77 Numerical experiments on raw image data (Figure 12) demonstrate that plausible generative causes are extracted using the MCA approach. [sent-636, score-0.382]
78 The weight patterns associated with the extracted causes resemble images of the single object parts (blades and stems of grass in our example) that combine non-linearly to generate the image. [sent-637, score-0.362]
79 An alternative approach would be to constrain qn to place mass only on source configurations where a limited number of causes are active. [sent-659, score-0.408]
80 Revisiting the argument leading to equation (15), it is clear that such an approximation would correspond to truncating the sums in both numerator and denominator of (15), as well as the corresponding expression for sh qn , at the same point. [sent-661, score-0.359]
81 In the bars test (b = 10, N = 500) the noisy-or algorithm (Saund, 1995) finds all bars in just 27% of trials. [sent-715, score-0.566]
82 The more competitive scheme (Dayan and Zemel, 1995) only extracts all bars if bar overlap is excluded for training, that is, if training patterns only contain parallel bars. [sent-716, score-0.592]
83 For comparison, the MCA learning algorithms MCA3 , R-MCA2 , and R-MCANN all show significantly higher values of reliability in the same bars test (see Table 1), at least when combined with an annealing procedure. [sent-718, score-0.434]
84 The reliability can be boosted further in two ways—either by adding Poisson noise to the input (Table 1), or by adding more hidden variables or units to the model (in which case all three MCA algorithms find all 10 bars in all of our experiments). [sent-719, score-0.663]
85 3 N EURAL N ETWORK M ODELS High reliability in component extraction in the bars test is not a feature exclusive to the new algorithms presented. [sent-722, score-0.445]
86 , 2002) as well as neural network models (Spratling and Johnson, 2002; L¨ cke and von der Malsburg, u 2004; L¨ cke, 2004; L¨ cke and Bouecke, 2005; Spratling, 2006; Butko and Triesch, 2007). [sent-726, score-0.556]
87 The extraction of input components, for example, in the bars test, fails if a simple max is used for inference instead. [sent-749, score-0.366]
88 F (Θ, Θ′ ) = 0 ∂Wid ⇒ ∑ ∑ qn (s ; Θ′ ) ∑ n ⇒ d′ s ∑ ∑ qn (s ; Θ′ ) n s ∂ (n) log p(yd ′ |W d ′ (s,W )) ∂Wid ! [sent-760, score-0.46]
89 Now, for any well-behaved function g, and large ρ: A ρ (s,W ) g(W d (s,W )) ≈ A ρ (s,W ) g(Wid ) , where A ρ (s,W ) := id id id ∂ ρ W (s,W ). [sent-763, score-0.483]
90 Hence it follows from (26) that: ρ ∑ ∑ qn (s ; Θ′ ) A id (s,W ) f (yd n ⇒ ! [sent-765, score-0.391]
91 ,Wid ) ≈ 0 , (29) s ρ ∑ ∑ qn (s ; Θ′ ) A id (s,W ) n (n) (n) yd − Wid ! [sent-766, score-0.549]
92 To derive the numerator of (16) we have used the property: A id (sh ,W ) = δih , A id (sab ,W ) = δia H (Wid −Wbd ) + δib H (Wid −Wad ) , (32) where H is the Heaviside function. [sent-807, score-0.362]
93 Note that instead of (32) we also could have used A id (sh ,W ) ρ and A id (sab ,W ) directly or the corresponding expressions of A id (s,W ) in (31) with high ρ. [sent-808, score-0.483]
94 2 R-MCA2 If we optimize (5) under the constraint ∑d Wid = C in (19), we obtain: ∑ (n) A id (s,W ) n qn yd − Wid + µi = 0 . [sent-821, score-0.549]
95 Wid The elimination of the Lagrange multipliers µi results in: Wid = ∑n A id (s,W ) 1 C ∑n,d ′ A id ′ (s,W ) (n) qn yd (n) qn yd ′ − ∑n (∑d ′ A id ′ (s,W ) qn Wid ′ C ) . [sent-822, score-1.489]
96 − A id (s,W ) qn If the model parameters W fulfill condition (20), which can be expected at least close to the maximum likelihood solution, we obtain after the rearrangement of terms, the approximate M-step of Equation (21). [sent-823, score-0.469]
97 To approximate A id (s,W ) qn we can therefore assume input statistics that follow from Equations (1) to (3), but in which all inputs exactly equal to zero are omitted. [sent-825, score-0.42]
98 The 1 parameters of the prior distribution were initialized to be πi = H for MCA3 , that is, half the value of gen 2 the generating πi = H in the standard bars test. [sent-845, score-0.513]
99 Experiments on different versions of the bars test showed a roughly linear dependence between the critical temperature Tc and the number of input dimensions D. [sent-880, score-0.368]
100 For R-MCANN we used a fixed temperature of T = 16 if not otherwise stated, and stopped after all single causes were represented by the same hidden variables for 4000 pattern presentations. [sent-888, score-0.362]
wordName wordTfidf (topN-words)
[('wid', 0.443), ('mca', 0.315), ('bars', 0.283), ('qn', 0.23), ('cke', 0.199), ('gen', 0.173), ('spratling', 0.173), ('generative', 0.168), ('id', 0.161), ('yd', 0.158), ('hidden', 0.156), ('causes', 0.15), ('bar', 0.148), ('auses', 0.129), ('aximal', 0.129), ('ahani', 0.122), ('nmf', 0.122), ('ucke', 0.122), ('reliability', 0.108), ('sh', 0.089), ('patterns', 0.087), ('acoustic', 0.084), ('gid', 0.084), ('malsburg', 0.084), ('mcaex', 0.077), ('poisson', 0.071), ('cooling', 0.064), ('phonemes', 0.064), ('appearance', 0.058), ('units', 0.058), ('waveforms', 0.058), ('whd', 0.058), ('temperature', 0.056), ('sources', 0.056), ('extraction', 0.054), ('der', 0.053), ('iab', 0.051), ('patches', 0.051), ('weights', 0.05), ('active', 0.05), ('superposition', 0.049), ('gi', 0.049), ('trials', 0.047), ('likelihood', 0.047), ('sahani', 0.045), ('saund', 0.045), ('dayan', 0.045), ('overlap', 0.045), ('von', 0.045), ('hochreiter', 0.044), ('phoneme', 0.044), ('annealing', 0.043), ('pixel', 0.041), ('numerator', 0.04), ('softmax', 0.039), ('schmidhuber', 0.038), ('hebbian', 0.038), ('activation', 0.037), ('grass', 0.036), ('image', 0.036), ('ica', 0.035), ('intensity', 0.035), ('weight', 0.035), ('pixels', 0.034), ('converged', 0.033), ('likelihoods', 0.033), ('zemel', 0.033), ('intensities', 0.033), ('hateren', 0.032), ('reilly', 0.032), ('sounds', 0.032), ('network', 0.032), ('horizontal', 0.032), ('visual', 0.032), ('parameters', 0.031), ('update', 0.031), ('sparseness', 0.03), ('em', 0.03), ('johnson', 0.029), ('sparsely', 0.029), ('temperatures', 0.029), ('input', 0.029), ('competitive', 0.029), ('noise', 0.029), ('neural', 0.028), ('wi', 0.028), ('source', 0.028), ('extracted', 0.028), ('competition', 0.027), ('seung', 0.027), ('heaviside', 0.027), ('activities', 0.027), ('generating', 0.026), ('images', 0.026), ('gray', 0.026), ('trial', 0.026), ('bouecke', 0.026), ('iabc', 0.026), ('presentations', 0.026), ('recordings', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
Author: Jörg Lücke, Maneesh Sahani
Abstract: We study a generative model in which hidden causes combine competitively to produce observations. Multiple active causes combine to determine the value of an observed variable through a max function, in the place where algorithms such as sparse coding, independent component analysis, or non-negative matrix factorization would use a sum. This max rule can represent a more realistic model of non-linear interaction between basic components in many settings, including acoustic and image data. While exact maximum-likelihood learning of the parameters of this model proves to be intractable, we show that efficient approximations to expectation-maximization (EM) can be found in the case of sparsely active hidden causes. One of these approximations can be formulated as a neural network model with a generalized softmax activation function and Hebbian learning. Thus, we show that learning in recent softmax-like neural networks may be interpreted as approximate maximization of a data likelihood. We use the bars benchmark test to numerically verify our analytical results and to demonstrate the competitiveness of the resulting algorithms. Finally, we show results of learning model parameters to fit acoustic and visual data sets in which max-like component combinations arise naturally. Keywords: component extraction, maximum likelihood, approximate EM, competitive learning, neural networks
2 0.065543093 24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)
Author: Ilya Shpitser, Judea Pearl
Abstract: We consider a hierarchy of queries about causal relationships in graphical models, where each level in the hierarchy requires more detailed information than the one below. The hierarchy consists of three levels: associative relationships, derived from a joint distribution over the observable variables; cause-effect relationships, derived from distributions resulting from external interventions; and counterfactuals, derived from distributions that span multiple “parallel worlds” and resulting from simultaneous, possibly conflicting observations and interventions. We completely characterize cases where a given causal query can be computed from information lower in the hierarchy, and provide algorithms that accomplish this computation. Specifically, we show when effects of interventions can be computed from observational studies, and when probabilities of counterfactuals can be computed from experimental studies. We also provide a graphical characterization of those queries which cannot be computed (by any method) from queries at a lower layer of the hierarchy. Keywords: causality, graphical causal models, identification
3 0.05362748 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
Author: Bo Jiang, Xuegong Zhang, Tianxi Cai
Abstract: Support vector machine (SVM) is one of the most popular and promising classification algorithms. After a classification rule is constructed via the SVM, it is essential to evaluate its prediction accuracy. In this paper, we develop procedures for obtaining both point and interval estimators for the prediction error. Under mild regularity conditions, we derive the consistency and asymptotic normality of the prediction error estimators for SVM with finite-dimensional kernels. A perturbationresampling procedure is proposed to obtain interval estimates for the prediction error in practice. With numerical studies on simulated data and a benchmark repository, we recommend the use of interval estimates centered at the cross-validated point estimates for the prediction error. Further applications of the proposed procedure in model evaluation and feature selection are illustrated with two examples. Keywords: k-fold cross-validation, model evaluation, perturbation-resampling, prediction errors, support vector machine
4 0.050381888 52 jmlr-2008-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. Keywords: error bounds, multi-task learning
5 0.049586419 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
Author: Jun Zhu, Zaiqing Nie, Bo Zhang, Ji-Rong Wen
Abstract: Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies—attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and define a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model’s parameters and to find the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve significant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by fixed-structured hierarchical models. Keywords: conditional random fields, dynamic hierarchical Markov random fields, integrated web data extraction, statistical hierarchical modeling, blocky artifact issue
6 0.048866708 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
7 0.0439304 58 jmlr-2008-Max-margin Classification of Data with Absent Features
8 0.043558214 60 jmlr-2008-Minimal Nonlinear Distortion Principle for Nonlinear Independent Component Analysis
9 0.041502934 54 jmlr-2008-Learning to Select Features using their Properties
10 0.03973911 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
11 0.039182134 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
12 0.038994554 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting
13 0.036515545 61 jmlr-2008-Mixed Membership Stochastic Blockmodels
14 0.035892442 56 jmlr-2008-Magic Moments for Structured Output Prediction
15 0.034468267 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
16 0.033418491 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
17 0.033302028 87 jmlr-2008-Stationary Features and Cat Detection
18 0.033222634 73 jmlr-2008-On the Suitable Domain for SVM Training in Image Coding
19 0.032176662 96 jmlr-2008-Visualizing Data using t-SNE
20 0.03165153 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
topicId topicWeight
[(0, 0.163), (1, -0.019), (2, -0.016), (3, -0.015), (4, -0.081), (5, 0.056), (6, 0.046), (7, -0.015), (8, -0.033), (9, -0.041), (10, -0.084), (11, 0.114), (12, -0.006), (13, -0.141), (14, 0.111), (15, -0.008), (16, 0.021), (17, 0.027), (18, 0.06), (19, 0.087), (20, 0.024), (21, 0.031), (22, 0.03), (23, -0.229), (24, -0.125), (25, 0.106), (26, -0.185), (27, -0.2), (28, 0.104), (29, 0.031), (30, -0.15), (31, 0.228), (32, -0.007), (33, -0.172), (34, -0.043), (35, 0.258), (36, -0.025), (37, -0.23), (38, 0.131), (39, -0.182), (40, 0.012), (41, 0.052), (42, 0.034), (43, 0.008), (44, -0.189), (45, -0.323), (46, 0.115), (47, 0.06), (48, -0.003), (49, -0.106)]
simIndex simValue paperId paperTitle
same-paper 1 0.9566626 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
Author: Jörg Lücke, Maneesh Sahani
Abstract: We study a generative model in which hidden causes combine competitively to produce observations. Multiple active causes combine to determine the value of an observed variable through a max function, in the place where algorithms such as sparse coding, independent component analysis, or non-negative matrix factorization would use a sum. This max rule can represent a more realistic model of non-linear interaction between basic components in many settings, including acoustic and image data. While exact maximum-likelihood learning of the parameters of this model proves to be intractable, we show that efficient approximations to expectation-maximization (EM) can be found in the case of sparsely active hidden causes. One of these approximations can be formulated as a neural network model with a generalized softmax activation function and Hebbian learning. Thus, we show that learning in recent softmax-like neural networks may be interpreted as approximate maximization of a data likelihood. We use the bars benchmark test to numerically verify our analytical results and to demonstrate the competitiveness of the resulting algorithms. Finally, we show results of learning model parameters to fit acoustic and visual data sets in which max-like component combinations arise naturally. Keywords: component extraction, maximum likelihood, approximate EM, competitive learning, neural networks
2 0.32646865 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
Author: Jun Zhu, Zaiqing Nie, Bo Zhang, Ji-Rong Wen
Abstract: Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies—attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and define a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model’s parameters and to find the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve significant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by fixed-structured hierarchical models. Keywords: conditional random fields, dynamic hierarchical Markov random fields, integrated web data extraction, statistical hierarchical modeling, blocky artifact issue
3 0.32434133 24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)
Author: Ilya Shpitser, Judea Pearl
Abstract: We consider a hierarchy of queries about causal relationships in graphical models, where each level in the hierarchy requires more detailed information than the one below. The hierarchy consists of three levels: associative relationships, derived from a joint distribution over the observable variables; cause-effect relationships, derived from distributions resulting from external interventions; and counterfactuals, derived from distributions that span multiple “parallel worlds” and resulting from simultaneous, possibly conflicting observations and interventions. We completely characterize cases where a given causal query can be computed from information lower in the hierarchy, and provide algorithms that accomplish this computation. Specifically, we show when effects of interventions can be computed from observational studies, and when probabilities of counterfactuals can be computed from experimental studies. We also provide a graphical characterization of those queries which cannot be computed (by any method) from queries at a lower layer of the hierarchy. Keywords: causality, graphical causal models, identification
4 0.28458828 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
Author: Bo Jiang, Xuegong Zhang, Tianxi Cai
Abstract: Support vector machine (SVM) is one of the most popular and promising classification algorithms. After a classification rule is constructed via the SVM, it is essential to evaluate its prediction accuracy. In this paper, we develop procedures for obtaining both point and interval estimators for the prediction error. Under mild regularity conditions, we derive the consistency and asymptotic normality of the prediction error estimators for SVM with finite-dimensional kernels. A perturbationresampling procedure is proposed to obtain interval estimates for the prediction error in practice. With numerical studies on simulated data and a benchmark repository, we recommend the use of interval estimates centered at the cross-validated point estimates for the prediction error. Further applications of the proposed procedure in model evaluation and feature selection are illustrated with two examples. Keywords: k-fold cross-validation, model evaluation, perturbation-resampling, prediction errors, support vector machine
5 0.25193852 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses
Author: Faustino Gomez, Jürgen Schmidhuber, Risto Miikkulainen
Abstract: Many complex control problems require sophisticated solutions that are not amenable to traditional controller design. Not only is it difficult to model real world systems, but often it is unclear what kind of behavior is required to solve the task. Reinforcement learning (RL) approaches have made progress by using direct interaction with the task environment, but have so far not scaled well to large state spaces and environments that are not fully observable. In recent years, neuroevolution, the artificial evolution of neural networks, has had remarkable success in tasks that exhibit these two properties. In this paper, we compare a neuroevolution method called Cooperative Synapse Neuroevolution (CoSyNE), that uses cooperative coevolution at the level of individual synaptic weights, to a broad range of reinforcement learning algorithms on very difficult versions of the pole balancing problem that involve large (continuous) state spaces and hidden state. CoSyNE is shown to be significantly more efficient and powerful than the other methods on these tasks. Keywords: coevolution, recurrent neural networks, non-linear control, genetic algorithms, experimental comparison
6 0.2217169 52 jmlr-2008-Learning from Multiple Sources
7 0.22044496 60 jmlr-2008-Minimal Nonlinear Distortion Principle for Nonlinear Independent Component Analysis
8 0.18867415 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
10 0.18352285 61 jmlr-2008-Mixed Membership Stochastic Blockmodels
11 0.18297653 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
12 0.17425299 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
13 0.16876811 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting
14 0.16665734 96 jmlr-2008-Visualizing Data using t-SNE
15 0.15956485 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
16 0.15836039 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management
17 0.14339416 58 jmlr-2008-Max-margin Classification of Data with Absent Features
18 0.14301662 56 jmlr-2008-Magic Moments for Structured Output Prediction
19 0.13607828 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
20 0.13479322 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
topicId topicWeight
[(0, 0.018), (5, 0.033), (9, 0.013), (40, 0.054), (54, 0.03), (58, 0.035), (60, 0.012), (66, 0.047), (72, 0.022), (76, 0.028), (79, 0.432), (88, 0.067), (92, 0.04), (94, 0.062), (99, 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.75503725 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
Author: Jörg Lücke, Maneesh Sahani
Abstract: We study a generative model in which hidden causes combine competitively to produce observations. Multiple active causes combine to determine the value of an observed variable through a max function, in the place where algorithms such as sparse coding, independent component analysis, or non-negative matrix factorization would use a sum. This max rule can represent a more realistic model of non-linear interaction between basic components in many settings, including acoustic and image data. While exact maximum-likelihood learning of the parameters of this model proves to be intractable, we show that efficient approximations to expectation-maximization (EM) can be found in the case of sparsely active hidden causes. One of these approximations can be formulated as a neural network model with a generalized softmax activation function and Hebbian learning. Thus, we show that learning in recent softmax-like neural networks may be interpreted as approximate maximization of a data likelihood. We use the bars benchmark test to numerically verify our analytical results and to demonstrate the competitiveness of the resulting algorithms. Finally, we show results of learning model parameters to fit acoustic and visual data sets in which max-like component combinations arise naturally. Keywords: component extraction, maximum likelihood, approximate EM, competitive learning, neural networks
2 0.286358 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
Author: Matthias W. Seeger
Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing
3 0.28140703 58 jmlr-2008-Max-margin Classification of Data with Absent Features
Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller
Abstract: We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa
4 0.28044873 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies
Author: Andreas Krause, Ajit Singh, Carlos Guestrin
Abstract: When monitoring spatial phenomena, which can often be modeled as Gaussian processes (GPs), choosing sensor locations is a fundamental task. There are several common strategies to address this task, for example, geometry or disk models, placing sensors at the points of highest entropy (variance) in the GP model, and A-, D-, or E-optimal design. In this paper, we tackle the combinatorial optimization problem of maximizing the mutual information between the chosen locations and the locations which are not selected. We prove that the problem of finding the configuration that maximizes mutual information is NP-complete. To address this issue, we describe a polynomial-time approximation that is within (1 − 1/e) of the optimum by exploiting the submodularity of mutual information. We also show how submodularity can be used to obtain online bounds, and design branch and bound search procedures. We then extend our algorithm to exploit lazy evaluations and local structure in the GP, yielding significant speedups. We also extend our approach to find placements which are robust against node failures and uncertainties in the model. These extensions are again associated with rigorous theoretical approximation guarantees, exploiting the submodularity of the objective function. We demonstrate the advantages of our approach towards optimizing mutual information in a very extensive empirical study on two real-world data sets. Keywords: networks Gaussian processes, experimental design, active learning, spatial learning; sensor
5 0.28020039 83 jmlr-2008-Robust Submodular Observation Selection
Author: Andreas Krause, H. Brendan McMahan, Carlos Guestrin, Anupam Gupta
Abstract: In many applications, one has to actively select among a set of expensive observations before making an informed decision. For example, in environmental monitoring, we want to select locations to measure in order to most effectively predict spatial phenomena. Often, we want to select observations which are robust against a number of possible objective functions. Examples include minimizing the maximum posterior variance in Gaussian Process regression, robust experimental design, and sensor placement for outbreak detection. In this paper, we present the Submodular Saturation algorithm, a simple and efficient algorithm with strong theoretical approximation guarantees for cases where the possible objective functions exhibit submodularity, an intuitive diminishing returns property. Moreover, we prove that better approximation algorithms do not exist unless NP-complete problems admit efficient algorithms. We show how our algorithm can be extended to handle complex cost functions (incorporating non-unit observation cost or communication and path costs). We also show how the algorithm can be used to near-optimally trade off expected-case (e.g., the Mean Square Prediction Error in Gaussian Process regression) and worst-case (e.g., maximum predictive variance) performance. We show that many important machine learning problems fit our robust submodular observation selection formalism, and provide extensive empirical evaluation on several real-world problems. For Gaussian Process regression, our algorithm compares favorably with state-of-the-art heuristics described in the geostatistics literature, while being simpler, faster and providing theoretical guarantees. For robust experimental design, our algorithm performs favorably compared to SDP-based algorithms. c 2008 Andreas Krause, H. Brendan McMahan, Carlos Guestrin and Anupam Gupta. K RAUSE , M C M AHAN , G UESTRIN AND G UPTA Keywords: observation selection, experimental design, active learning, submodular functions, Gaussi
6 0.27914399 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
7 0.27860236 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
8 0.2778953 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
9 0.2771886 86 jmlr-2008-SimpleMKL
10 0.27672714 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
11 0.27637503 57 jmlr-2008-Manifold Learning: The Price of Normalization
12 0.27541819 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
13 0.2741006 56 jmlr-2008-Magic Moments for Structured Output Prediction
15 0.27275184 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
16 0.27215973 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
17 0.2697086 9 jmlr-2008-Active Learning by Spherical Subdivision
18 0.26701483 16 jmlr-2008-Approximations for Binary Gaussian Process Classification
19 0.26664174 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
20 0.26447847 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration