nips nips2010 nips2010-214 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Peter Jones, Venkatesh Saligrama, Sanjoy Mitter
Abstract: Experts (human or computer) are often required to assess the probability of uncertain events. When a collection of experts independently assess events that are structurally interrelated, the resulting assessment may violate fundamental laws of probability. Such an assessment is termed incoherent. In this work we investigate how the problem of incoherence may be affected by allowing experts to specify likelihood models and then update their assessments based on the realization of a globally-observable random sequence. Keywords: Bayesian Methods, Information Theory, consistency
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Experts (human or computer) are often required to assess the probability of uncertain events. [sent-9, score-0.05]
2 When a collection of experts independently assess events that are structurally interrelated, the resulting assessment may violate fundamental laws of probability. [sent-10, score-0.313]
3 In this work we investigate how the problem of incoherence may be affected by allowing experts to specify likelihood models and then update their assessments based on the realization of a globally-observable random sequence. [sent-12, score-0.368]
4 Coherence will be formally defined later, but in essence a coherent probability assessment is one that exhibits logical consistency. [sent-14, score-0.484]
5 Incoherent assessments are those that cannot be correct, that are at odds with the underlying structure of the space, and so can’t be extended to a complete probability distribution [1, 2]. [sent-15, score-0.248]
6 From a decision theoretic standpoint, treating assessments as odds, incoherent assessments result in guaranteed losses to assessors. [sent-16, score-0.456]
7 They are dominated strategies, meaning that for every incoherent assessment there is a coherent assessment that uniformly improves the outcome for the assessors. [sent-17, score-0.773]
8 Despite this fact, expert assessments (human and machine) are vulnerable to incoherence [3]. [sent-18, score-0.265]
9 Previous authors have used coherence as a tool for fusing distributed expert assessments [4, 5, 6]. [sent-19, score-0.456]
10 The focus has been on static coherence in which experts are polled once about some set of events and the responses are then fused through a geometric projection. [sent-20, score-0.362]
11 Besides relying on arbitrary scoring functions to define the “right” projection, such analyses don’t address dynamically evolving assessments or forecasts. [sent-21, score-0.181]
12 This paper is, to our knowledge, the first attempt to analyze the problem of coherence under Bayesian belief dynamics. [sent-22, score-0.291]
13 The importance of dynamic coherence is demonstrated in the following example. [sent-23, score-0.261]
14 Consider two uncertain events A1 and A2 where A1 ⊆ A2 (e. [sent-24, score-0.086]
15 A2 = {NASDAQ ↑ tomorrow} and A1 = {NASDAQ ↑ tomorrow ≥ 10 points}). [sent-26, score-0.046]
16 To be coherent, a probability assessment must obey the relation P (A1 ) ≤ P (A2 ). [sent-27, score-0.241]
17 For the purposes of the example, suppose the initial belief is P (A1 ) = P (A2 ) = 0. [sent-28, score-0.056]
18 Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government 1 Z that is believed to correlate with the underlying event (e. [sent-33, score-0.136]
19 The believed dependence between Z and Ai is captured by a likelihood model P (Z|Ai ) that gives the probability of observing Z when event Ai does or doesn’t occur. [sent-36, score-0.27]
20 For the ¯ example, suppose Z = 0 and the believed likelihoods are P (Z = 0|A1 ) = 1 and P (Z = 0|A1 ) and ¯2 ) = 0. [sent-37, score-0.041]
21 There’s nothing inherently ¯ P (Z = 0|A2 ) = P (Z = 0|A irrational in this belief model, but when Bayes’ Rule is applied, it gives P (A1 |Z = 0) = 0. [sent-39, score-0.056]
22 IDPS1 detects both distributed denial of service (DDoS) attacks and port scan (PS) attacks, while IDPS2 detects only DDoS attacks. [sent-46, score-0.142]
23 While studying the NIST guide to IDPSs [7], BigCorps’ CTO notes the recommendation that ”organizations should consider using multiple types of IDPS technologies to achieve more comprehensive and accurate detection and prevention of malicious activity. [sent-47, score-0.067]
24 One morning while reading the output reports of the two detectors, an intrepid security analyst witnesses an interesting behavior. [sent-49, score-0.091]
25 1 while detector IDPS1 is reading an attack probability of 0. [sent-51, score-0.11]
26 Since the threats detected by IDPS1 are a superset of those detected by IDPS2 , the probability assigned by IDPS1 should always be larger than that assigned by IDPS2 . [sent-53, score-0.118]
27 The dilemma faced by our analyst is how to reconcile the logically incoherent outputs of the two detectors. [sent-54, score-0.187]
28 Particularly, how to ascribe probabilities in a way that is logically consistent, but still retains as much as possible the expert assessments of the detectors. [sent-55, score-0.277]
29 We suggest two possible forms of dynamic coherence and analyze the relationship between them. [sent-58, score-0.261]
30 3 Previous Work Previous authors have analyzed coherence with respect to contingent (or conditional) probability assessments [8, 9, 10]. [sent-61, score-0.501]
31 These developments attempt to determine conditions characterizing coherent subjective posteriors. [sent-62, score-0.23]
32 While likelihood models are a form of contingent probability assessment, this paper goes further in analyzing the impact of these assessments on coherent belief dynamics. [sent-63, score-0.607]
33 In [11, 12] a different form of conditional coherence is suggested which derives from coherence of a joint probability distribution over observations and states of nature. [sent-64, score-0.545]
34 It is shown that for this stronger form of conditional coherence, certain specially structured event sets and likelihood functions will produce coherent posterior assessments. [sent-65, score-0.423]
35 Logical consistency under non-Bayesian belief dynamics has been previously analyzed. [sent-66, score-0.056]
36 In [13] conditions for invariance under permutations of the observational sequence under Jeffrey’s rule are developed. [sent-67, score-0.072]
37 A comparison of Jeffrey’s rule and Pearl’s virtual evidence method is made in [14] which shows that the virtual evidence method implicitly assumes the conditions of Jeffrey’s update rule. [sent-68, score-0.132]
38 } be an event space and (Ω, F) a measurable space. [sent-72, score-0.117]
39 ” Also, let Zi : Ω → Z be a sequence of measureable random variables; consider Zi to be the sequence of observations, with Z = {z 1 , z 2 , . [sent-77, score-0.069]
40 Since the random variables are assumed measureable, Ωθ and ΩZi are measureable sets (i. [sent-84, score-0.069]
41 We i i call elements of A events under assessment. [sent-91, score-0.063]
42 The characteristic matrix χ for the events under assessment is defined as 1 θj ∈ Aθ i . [sent-92, score-0.334]
43 An individual probability assessment P : A → [0, 1] maps each event under assessment to the T P (A1 ) P (A2 ) · · · P (AN ) unit interval. [sent-95, score-0.55]
44 one that is logically consistent) can be described geometrically as lying in the convex hull of the columns of χ, meaning ∃λ ∈ [0, 1]J s. [sent-99, score-0.115]
45 We now consider a sequence of probability assessments Pn defined as follows: Pn is the result of a belief revision process based on an initial probability assessment P0 , a likelihood model pn (z|A), and a sequence of observations Z1 , Z2 , . [sent-102, score-1.328]
46 A likelihood model pn (z|A) is a pair of probability ¯ mass functions over the observations: one conditioned on A and the other conditioned on A (where ¯ A denotes the complement of A). [sent-106, score-0.74]
47 We will make the simplifying assumption that the likelihood ¯ ¯ model is static, i. [sent-107, score-0.107]
48 pn (z|A) = p(z|A) and pn (z|A) = p(z|A) for all n. [sent-109, score-1.162]
49 In this paper we assume belief revision dynamics governed by Bayes’ rule, i. [sent-110, score-0.143]
50 each event has at least one informative observation) and αij ∈ (0, 1), βij ∈ (0, 1) for all i, j (i. [sent-116, score-0.095]
51 Then by induction the posterior probability of event A after n observations is: Pn (Aj ) = 1 1+ βij αi K i=1 1−P0 P0 ni (1) when ni is the number of observations z i . [sent-119, score-0.308]
52 3 Probability convergence for single assessors For a single assessor revising his estimate of the likelihood of event A, let the probability model ¯ be given by p(z = z i |A) = αi and p(z = z i |A) = βi . [sent-120, score-0.413]
53 It is convenient to rewrite (1) in terms of ni the ratio ρi = n and for simplicity assuming P0 = 0. [sent-121, score-0.034]
54 ) to the true generating distribution, and 2) the convergence properties of Pn are determined by the quantity between the square brackets in (2). [sent-125, score-0.038]
55 Specifically, let K L∞ = lim n→∞ i=1 βi αi ρi L∞ is commonly referred to as the likelihood ratio, familiar from classical binary hypothesis testing. [sent-126, score-0.137]
56 1 Matched likelihood functions Assume that the likelihood model is both infinitely precise and infinitely accurate, meaning that ¯ when A (resp. [sent-133, score-0.236]
57 Assume that A obtains; then L∞ = K L∞ = log i=1 K i=1 βi αi αi βi αi a. [sent-139, score-0.029]
58 Let L∞ = log L∞ which in this case yields K αi = αi log i=1 βi = −D(α||β) < 0 αi where all relations hold a. [sent-141, score-0.058]
59 Since L∞ < 0 ⇔ L∞ < 1, this implies that when the true generating distribution is α, Pn → 1 a. [sent-144, score-0.038]
60 ¯ Similarly, when A obtains, we have K L∞ = log i=1 βi αi K βi = βi log i=1 βi = D(β||α) > 0 αi and Pn → 0 a. [sent-146, score-0.058]
61 2 Mismatched likelihood functions Now consider the situation when the expert assessed likelihood model is incorrect. [sent-149, score-0.254]
62 Assume the observation generating distribution is γ = P(Zi = z) where γ = α and γ = β. [sent-150, score-0.074]
63 We define i γi log T (γ) = −L∞ = i αi βi (3) Then the probability simplex over the observation space Z can be partitioned into two sets: P0 = {γ|T (γ) < 0} and P1 = {γ|T (γ) > 0}. [sent-152, score-0.151]
64 (The boundary set {γ|T (γ) = 0} represents an unstable equilibrium in which Pn a. [sent-156, score-0.033]
65 2 The problem of mismatched likelihood functions is similar to composite hypothesis testing (c. [sent-159, score-0.226]
66 Composite hypothesis testing attempts to design tests to determine the truth or falsity of a hypothesis with some ambiguity in the underlying parameter space. [sent-162, score-0.085]
67 Because of this ambiguity, each hypothesis Hi corresponds not to a single distribution, but to a set of possible distributions. [sent-163, score-0.03]
68 In the mismatched likelihood function problem, composite spaces are formed due to the properties of Bayes’ rule for a specific likelihood model. [sent-164, score-0.344]
69 A corollary of the above result is that if Hi ⊆ Pi then Bayes’ rule (under the specific likelihood model) is an asymptotically perfect detector. [sent-165, score-0.148]
70 4 Multiple Assessors with Structural Constraints In Section 3 we analyzed convergence properties of a single event under assessment. [sent-166, score-0.116]
71 Considering multiple events introduces the challenge of defining a dynamic concept of coherence for the assessment revision process. [sent-167, score-0.625]
72 In this section we suggest two possible definitions of dynamic coherence and consider some of the implications of these definitions. [sent-168, score-0.261]
73 1 Step-wise Coherence We first introduce a step-wise definition of coherence, and derive equivalency conditions for the special class of 2-expert likelihood models. [sent-170, score-0.138]
74 4 Definition 1 Under the Bayes’ rule revision process, a likelihood model p(z|A) is step-wise coherent (SWC) if Pn ∈ convhull(χ) ⇒ Pn+1 ∈ convhull(χ) for all z ∈ Z. [sent-171, score-0.434]
75 Essentially this definition says that if the posterior assessment process is coherent at any time, it will remain coherent perpetually, independent of observation sequence. [sent-172, score-0.67]
76 We derive necessary and sufficient conditions for SWC for the characteristic matrix given by χ= 1 0 1 0 1 0 (4) Generalizations of this development are possible for any χ ∈ {0, 1}2×|Θ| . [sent-173, score-0.088]
77 Note that under the characteristic matrix given by (4) a model is SWC iff Pn (A1 ) ≥ Pn (A2 ) for all n and all coherent P0 . [sent-174, score-0.337]
78 Due to the continuity of the update rule, a model will be SWC iff it is coherent at the margins. [sent-178, score-0.28]
79 , max ately, for χ given by (4), the model will be SWC iff ∀i, ≥ αi1 αi2 ≥ αi1 βi1 αi2 , βi2 βi1 βi2 . [sent-182, score-0.081]
80 Since αi1 αi2 ≥ αi1 αi2 degener- ∀i, or (rearranging) αi1 αi2 ≥ βi1 βi2 (5) Asymptotic coherence While it is relatively simple to characterize coherent models in the two assessor case, in general SWC is difficult to check. [sent-183, score-0.503]
81 As such, we introduce a simpler condition: Definition 2 A likelihood model p(z|A) is weakly asymptotically coherent (WAC) if for all observation generating distributions γ s. [sent-184, score-0.401]
82 Then, by a separating hyperplane argument, there must exist some n (and therefore some zn ) s. [sent-199, score-0.08]
83 This contradicts the assumption that / the likelihood model is SWC. [sent-202, score-0.107]
84 1 WAC for static models Analogous to (3), we define Tj (γ) = γi log i αij . [sent-209, score-0.057]
85 βij For a given γ, define the logical vector r(γ) as Tj (γ) < 0 0 1 Tj (γ) > 0 rj (γ) = undet Tj (γ) = 0 Lemma 2 A likelihood model is WAC if ∀γ s. [sent-210, score-0.151]
86 Lemma 2 states that for a WAC likelihood model, {Pi } partitions the simplex (excluding unstable edge events) into sets of distributions s. [sent-216, score-0.199]
87 2 Motivating Example Revisited Consider again the motivating example of the two IDPSs from Section 1. [sent-222, score-0.037]
88 Recall that IDPS1 detects a superset of the attacks detected by IDPS2 , and so this scenario conforms to the characteristic matrix analyzed in Section 4. [sent-224, score-0.238]
89 Therefore (5) gives necessary and sufficient conditions for SWC, while (7) gives necessary and sufficient conditions for WAC. [sent-226, score-0.062]
90 Plugging the given likelihood model into (5) implies that the model is SWC iff, for z = 0, 1, 2, . [sent-229, score-0.107]
91 1 − x1 1 − y1 Equation (8) will be satisfied iff sufficient condition for SWC. [sent-232, score-0.081]
92 Forming T as defined in (6), we see that Tj (γ) = γz z log z 1 − xj xj 1 − xj xj + log = µ log + log 1 − yj yj 1 − yj yj (9) where µ = Eγ [z]. [sent-234, score-0.248]
93 By the structure of the characteristic matrix, the model will be WAC iff T2 (γ) > 0 ⇒ T1 (γ) > 0 for all µ ≥ 0. [sent-235, score-0.138]
94 Then {γ|Ti (γ) < 0} = log i /xi {γ|µ < log(1−xy)/(1−yi ) } and therefore the model is WAC iff i x2 y2 1−x2 1−y2 log log ≥ x1 y1 1−x1 1−y1 log log (10) Comparing the conditions for SWC (8) to those for WAC (10), we see that any parameters satisfying (8) also satisfy (10) but not vice versa. [sent-237, score-0.257]
95 5 Coherence with only finitely many observations As shown in Sections 3 and 4, a WAC likelihood model generates a partition {Pi } over the observation probability simplex such that γ ∈ Pi ⇒ Pn → χei . [sent-244, score-0.304]
96 The question we now address is, given a WAC likelihood model and finitely many observations (with empirical distribution γn ), how to ˆ revise an incoherent posterior probability assessment Pn so that it is both coherent and consistent with the observed data. [sent-245, score-0.711]
97 Principle of Conserving Predictive Uncertainty: Given γn , choose λ such that ˆ λi = Pr[limn→∞ γn ∈ Pi ] for each i (where γ ∈ Pi iff Pn → χei ). [sent-246, score-0.081]
98 As will be shown in Section 6, the principle of conserving predictive uncertainty can even be effectively applied to non-WAC models. [sent-257, score-0.09]
99 1 Sparse coherent approximation In general |Θ| (the length of the vector λ) can be of order 2N (where N is the number of assessors), so solving for λ directly using (11) may be computationally infeasible. [sent-259, score-0.199]
100 The neighborhood of Pi is the set of partition elements such that the limit of one (and only one) assessor’s probability assessment has changed. [sent-264, score-0.294]
wordName wordTfidf (topN-words)
[('pn', 0.581), ('swc', 0.299), ('wac', 0.276), ('coherence', 0.235), ('assessment', 0.214), ('pi', 0.201), ('coherent', 0.199), ('assessments', 0.181), ('idpss', 0.115), ('likelihood', 0.107), ('pp', 0.099), ('event', 0.095), ('incoherent', 0.094), ('convhull', 0.092), ('revision', 0.087), ('iff', 0.081), ('zn', 0.08), ('limn', 0.075), ('ei', 0.075), ('assessor', 0.069), ('assessors', 0.069), ('bigcorps', 0.069), ('conserving', 0.069), ('measureable', 0.069), ('ij', 0.067), ('events', 0.063), ('tj', 0.063), ('simplex', 0.059), ('attack', 0.058), ('zi', 0.058), ('characteristic', 0.057), ('belief', 0.056), ('jeffrey', 0.056), ('mismatched', 0.056), ('logically', 0.056), ('attacks', 0.052), ('observations', 0.048), ('ddos', 0.046), ('idps', 0.046), ('nasdaq', 0.046), ('pfpp', 0.046), ('revising', 0.046), ('tomorrow', 0.046), ('detects', 0.045), ('nitely', 0.045), ('incoherence', 0.044), ('logical', 0.044), ('believed', 0.041), ('rule', 0.041), ('nist', 0.04), ('mitter', 0.04), ('odds', 0.04), ('prevention', 0.04), ('expert', 0.04), ('generating', 0.038), ('obtains', 0.038), ('geometrically', 0.037), ('analyst', 0.037), ('contingent', 0.037), ('motivating', 0.037), ('experts', 0.036), ('observation', 0.036), ('aj', 0.036), ('bayes', 0.035), ('superset', 0.035), ('ni', 0.034), ('yj', 0.033), ('ai', 0.033), ('unstable', 0.033), ('composite', 0.033), ('government', 0.031), ('conditions', 0.031), ('outcome', 0.03), ('virtual', 0.03), ('hypothesis', 0.03), ('tp', 0.029), ('security', 0.029), ('log', 0.029), ('static', 0.028), ('detected', 0.028), ('boston', 0.027), ('recommendation', 0.027), ('probability', 0.027), ('partition', 0.027), ('neighborhood', 0.026), ('pj', 0.026), ('dynamic', 0.026), ('complement', 0.025), ('reading', 0.025), ('hi', 0.025), ('ambiguity', 0.025), ('uncertain', 0.023), ('posterior', 0.022), ('meaning', 0.022), ('measurable', 0.022), ('occurring', 0.022), ('weakly', 0.021), ('analyzed', 0.021), ('predictive', 0.021), ('reverse', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 214 nips-2010-Probabilistic Belief Revision with Structural Constraints
Author: Peter Jones, Venkatesh Saligrama, Sanjoy Mitter
Abstract: Experts (human or computer) are often required to assess the probability of uncertain events. When a collection of experts independently assess events that are structurally interrelated, the resulting assessment may violate fundamental laws of probability. Such an assessment is termed incoherent. In this work we investigate how the problem of incoherence may be affected by allowing experts to specify likelihood models and then update their assessments based on the realization of a globally-observable random sequence. Keywords: Bayesian Methods, Information Theory, consistency
2 0.10061869 63 nips-2010-Distributed Dual Averaging In Networks
Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi
Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. 1
3 0.078431748 107 nips-2010-Global seismic monitoring as probabilistic inference
Author: Nimar Arora, Stuart Russell, Paul Kidwell, Erik B. Sudderth
Abstract: The International Monitoring System (IMS) is a global network of sensors whose purpose is to identify potential violations of the Comprehensive Nuclear-Test-Ban Treaty (CTBT), primarily through detection and localization of seismic events. We report on the first stage of a project to improve on the current automated software system with a Bayesian inference system that computes the most likely global event history given the record of local sensor data. The new system, VISA (Vertically Integrated Seismological Analysis), is based on empirically calibrated, generative models of event occurrence, signal propagation, and signal detection. VISA exhibits significantly improved precision and recall compared to the current operational system and is able to detect events that are missed even by the human analysts who post-process the IMS output. 1
4 0.075869903 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
Author: Noah A. Smith, Shay B. Cohen
Abstract: Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of the parameters of a fixed probabilistic grammar using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. 1
5 0.063108467 27 nips-2010-Agnostic Active Learning Without Constraints
Author: Alina Beygelzimer, John Langford, Zhang Tong, Daniel J. Hsu
Abstract: We present and analyze an agnostic active learning algorithm that works without keeping a version space. This is unlike all previous approaches where a restricted set of candidate hypotheses is maintained throughout learning, and only hypotheses from this set are ever returned. By avoiding this version space approach, our algorithm sheds the computational burden and brittleness associated with maintaining version spaces, yet still allows for substantial improvements over supervised learning for classification. 1
6 0.061164383 20 nips-2010-A unified model of short-range and long-range motion perception
7 0.052778237 98 nips-2010-Functional form of motion priors in human motion perception
8 0.050880305 263 nips-2010-Switching state space model for simultaneously estimating state transitions and nonstationary firing rates
9 0.045768406 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization
10 0.042515319 265 nips-2010-The LASSO risk: asymptotic results and real world examples
11 0.041444637 165 nips-2010-MAP estimation in Binary MRFs via Bipartite Multi-cuts
12 0.039489407 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
13 0.039204884 18 nips-2010-A novel family of non-parametric cumulative based divergences for point processes
14 0.038950972 279 nips-2010-Universal Kernels on Non-Standard Input Spaces
16 0.036969338 285 nips-2010-Why are some word orders more common than others? A uniform information density account
17 0.036491178 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations
18 0.036201522 118 nips-2010-Implicit Differentiation by Perturbation
19 0.03584915 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models
20 0.03576915 52 nips-2010-Convex Multiple-Instance Learning by Estimating Likelihood Ratio
topicId topicWeight
[(0, 0.116), (1, 0.009), (2, 0.013), (3, 0.032), (4, -0.03), (5, 0.02), (6, -0.038), (7, 0.0), (8, 0.034), (9, -0.002), (10, -0.027), (11, -0.08), (12, -0.049), (13, 0.044), (14, 0.0), (15, 0.024), (16, -0.002), (17, 0.054), (18, 0.034), (19, -0.039), (20, -0.005), (21, -0.013), (22, 0.015), (23, 0.04), (24, 0.098), (25, 0.006), (26, 0.027), (27, -0.024), (28, -0.015), (29, 0.033), (30, -0.073), (31, 0.04), (32, 0.066), (33, 0.027), (34, 0.043), (35, -0.016), (36, -0.021), (37, 0.04), (38, -0.052), (39, -0.084), (40, -0.043), (41, 0.046), (42, 0.058), (43, 0.113), (44, 0.117), (45, 0.015), (46, -0.005), (47, 0.089), (48, -0.009), (49, -0.094)]
simIndex simValue paperId paperTitle
same-paper 1 0.93153155 214 nips-2010-Probabilistic Belief Revision with Structural Constraints
Author: Peter Jones, Venkatesh Saligrama, Sanjoy Mitter
Abstract: Experts (human or computer) are often required to assess the probability of uncertain events. When a collection of experts independently assess events that are structurally interrelated, the resulting assessment may violate fundamental laws of probability. Such an assessment is termed incoherent. In this work we investigate how the problem of incoherence may be affected by allowing experts to specify likelihood models and then update their assessments based on the realization of a globally-observable random sequence. Keywords: Bayesian Methods, Information Theory, consistency
2 0.60586184 107 nips-2010-Global seismic monitoring as probabilistic inference
Author: Nimar Arora, Stuart Russell, Paul Kidwell, Erik B. Sudderth
Abstract: The International Monitoring System (IMS) is a global network of sensors whose purpose is to identify potential violations of the Comprehensive Nuclear-Test-Ban Treaty (CTBT), primarily through detection and localization of seismic events. We report on the first stage of a project to improve on the current automated software system with a Bayesian inference system that computes the most likely global event history given the record of local sensor data. The new system, VISA (Vertically Integrated Seismological Analysis), is based on empirically calibrated, generative models of event occurrence, signal propagation, and signal detection. VISA exhibits significantly improved precision and recall compared to the current operational system and is able to detect events that are missed even by the human analysts who post-process the IMS output. 1
Author: Ken Takiyama, Masato Okada
Abstract: 019 020 We propose an algorithm for simultaneously estimating state transitions among neural states, the number of neural states, and nonstationary firing rates using a switching state space model (SSSM). This algorithm enables us to detect state transitions on the basis of not only the discontinuous changes of mean firing rates but also discontinuous changes in temporal profiles of firing rates, e.g., temporal correlation. We construct a variational Bayes algorithm for a non-Gaussian SSSM whose non-Gaussian property is caused by binary spike events. Synthetic data analysis reveals that our algorithm has the high performance for estimating state transitions, the number of neural states, and nonstationary firing rates compared to previous methods. We also analyze neural data that were recorded from the medial temporal area. The statistically detected neural states probably coincide with transient and sustained states that have been detected heuristically. Estimated parameters suggest that our algorithm detects the state transition on the basis of discontinuous changes in the temporal correlation of firing rates, which transitions previous methods cannot detect. This result suggests that our algorithm is advantageous in real-data analysis. 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053
4 0.46015885 3 nips-2010-A Bayesian Framework for Figure-Ground Interpretation
Author: Vicky Froyen, Jacob Feldman, Manish Singh
Abstract: Figure/ground assignment, in which the visual image is divided into nearer (figural) and farther (ground) surfaces, is an essential step in visual processing, but its underlying computational mechanisms are poorly understood. Figural assignment (often referred to as border ownership) can vary along a contour, suggesting a spatially distributed process whereby local and global cues are combined to yield local estimates of border ownership. In this paper we model figure/ground estimation in a Bayesian belief network, attempting to capture the propagation of border ownership across the image as local cues (contour curvature and T-junctions) interact with more global cues to yield a figure/ground assignment. Our network includes as a nonlocal factor skeletal (medial axis) structure, under the hypothesis that medial structure “draws” border ownership so that borders are owned by the skeletal hypothesis that best explains them. We also briefly present a psychophysical experiment in which we measured local border ownership along a contour at various distances from an inducing cue (a T-junction). Both the human subjects and the network show similar patterns of performance, converging rapidly to a similar pattern of spatial variation in border ownership along contours. Figure/ground assignment (further referred to as f/g), in which the visual image is divided into nearer (figural) and farther (ground) surfaces, is an essential step in visual processing. A number of factors are known to affect f/g assignment, including region size [9], convexity [7, 16], and symmetry [1, 7, 11]. Figural assignment (often referred to as border ownership, under the assumption that the figural side “owns” the border) is usually studied globally, meaning that entire surfaces and their enclosing boundaries are assumed to receive a globally consistent figural status. But recent psychophysical findings [8] have suggested that border ownership can vary locally along a boundary, even leading to a globally inconsistent figure/ground assignment—broadly consistent with electrophysiological evidence showing local coding for border ownership in area V2 as early as 68 msec after image onset [20]. This suggests a spatially distributed and potentially competitive process of figural assignment [15], in which adjacent surfaces compete to own their common boundary, with figural status propagating across the image as this competition proceeds. But both the principles and computational mechanisms underlying this process are poorly understood. ∗ V.F. was supported by a Fullbright Honorary fellowship and by the Rutgers NSF IGERT program in Perceptual Science, NSF DGE 0549115, J.F. by NIH R01 EY15888, and M.S. by NSF CCF-0541185 1 In this paper we consider how border ownership might propagate over both space and time—that is, across the image as well as over the progression of computation. Following Weiss et al. [18] we adopt a Bayesian belief network architecture, with nodes along boundaries representing estimated border ownership, and connections arranged so that both neighboring nodes and nonlocal integrating nodes combine to influence local estimates of border ownership. Our model is novel in two particular respects: (a) we combine both local and global influences on border ownership in an integrated and principled way; and (b) we include as a nonlocal factor skeletal (medial axis) influences on f/g assignment. Skeletal structure has not been previously considered as a factor on border ownership, but its relevance follows from a model [4] in which shapes are conceived of as generated by or “grown” from an internal skeleton, with the consequence that their boundaries are perceptually “owned” by the skeletal side. We also briey present a psychophysical experiment in which we measured local border ownership along a contour, at several distances from a strong local f/g inducing cue, and at several time delays after the onset of the cue. The results show measurable spatial differences in judged border ownership, with judgments varying with distance from the inducer; but no temporal effect, with essentially asymptotic judgments even after very brief exposures. Both results are consistent with the behavior of the network, which converges quickly to an asymptotic but spatially nonuniform f/g assignment. 1 The Model The Network. For simplicity, we take an edge map as input for the model, assuming that edges and T-junctions have already been detected. From this edge map we then create a Bayesian belief network consisting of four hierarchical levels. At the input level the model receives evidence E from the image, consisting of local contour curvature and T-junctions. The nodes for this level are placed at equidistant locations along the contour. At the first level the model estimates local border ownership. The border ownership, or B-nodes at this level are at the same locations as the E-nodes, but are connected to their nearest neighbors, and are the parent of the E-node at their location. (As a simplifying assumption, such connections are broken at T-junctions in such a way that the occluded contour is disconnected from the occluder.) The highest level has skeletal nodes, S, whose positions are defined by the circumcenters of the Delaunay triangulation on all the E-nodes, creating a coarse medial axis skeleton [13]. Because of the structure of the Delaunay, each S-node is connected to exactly three E-nodes from which they receive information about the position and the local tangent of the contour. In the current state of the model the S-nodes are “passive”, meaning their posteriors are computed before the model is initiated. Between the S nodes and the B nodes are the grouping nodes G. They have the same positions as the S-nodes and the same Delaunay connections, but to B-nodes that have the same image positions as the E-nodes. They will integrate information from distant B-nodes, applying an interiority cue that is influenced by the local strength of skeletal axes as computed by the S-nodes (Fig. 1). Although this is a multiply connected network, we have found that given reasonable parameters the model converges to intuitive posteriors for a variety of shapes (see below). Updating. Our goal is to compute the posterior p(Bi |I), where I is the whole image. Bi is a binary variable coding for the local direction of border ownership, that is, the side that owns the border. In order for border ownership estimates to be influenced by image structure elsewhere in the image, information has to propagate throughout the network. To achieve this propagation, we use standard equations for node updating [14, 12]. However while to all other connections being directed, connections at the B-node level are undirected, causing each node to be child and parent node at the same time. Considering only the B-node level, a node Bi is only separated from the rest of the network by its two neighbors. Hence the Markovian property applies, in that Bi only needs to get iterative information from its neighbors to eventually compute p(Bi |I). So considering the whole network, at each iteration t, Bi receives information from both its child, Ei and from its parents—that is neigbouring nodes (Bi+1 and Bi−1 )—as well as all grouping nodes connected to it (Gj , ..., Gm ). The latter encode for interiority versus exteriority, interiority meaning that the B-node’s estimated gural direction points towards the G-node in question, exteriority meaning that it points away. Integrating all this information creates a multidimensional likelihood function: p(Bi |Bi−1 , Bi+1 , Gj , ..., Gm ). Because of its complexity we choose to approximate it (assuming all nodes are marginally independent of each other when conditioned on Bi ) by 2 Figure 1: Basic network structure of the model. Both skeletal (S-nodes) and border-ownerhsip nodes (B-nodes) get evidence from E-nodes, though different types. S-nodes receive mere positional information, while B-nodes receive information about local curvature and the presence of T-junctions. Because of the structure of the Delaunay triangulation S-nodes and G-nodes (grouping nodes) always get input from exactly three nodes, respectively E and B-nodes. The gray color depicts the fact that this part of the network is computed before the model is initiated and does not thereafter interact with the dynamics of the model. m p(Bi |Pj , ..., Pm ) ∝ p(Bi |Pj ) (1) j where the Pj ’s are the parents of Bi . Given this, at each iteration, each node Bi performs the following computation: Bel(Bi ) ← cλ(Bi )π(Bi )α(Bi )β(Bi ) (2) where conceptually λ stands for bottom-up information, π for top down information and α and β for information received from within the same level. More formally, λ(Bi ) ← p(E|Bi ) (3) m π(Bi ) ← p(Bi |Gj )πGj (Bi ) j (4) Gj and analogously to equation 4 for α(Bi ) and β(Bi ), which compute information coming from Bi−1 and Bi+1 respectively. For these πBi−1 (Bi ), πBi+1 (Bi ), and πGj (Bi ): πGj (Bi ) ← c π(G) λBk (Gj ) (5) k=i πBi−1 (Bi ) ← c β(Bi−1 )λ(Bi−1 )π(Bi−1 ) 3 (6) and πBi+1 (Bi ) is analogous to πBi−1 (Bi ), with c and c being normalization constants. Finally for the G-nodes: Bel(Gi ) ← cλ(Gi )π(Gi ) λ(Gi ) ← (7) λBj (Gi ) (8) j m λBj (Gi ) ← λ(Bj )p(Bi |Gj )[α(Bj )β(Bj ) Bj p(Bi |Gk )πGk (Bi )] (9) k=i Gk The posteriors of the S-nodes are used to compute the π(Gi ). This posterior computes how well the S-node at each position explains the contour—that is, how well it accounts for the cues flowing from the E-nodes it is connected to. Each Delaunay connection between S- and E-nodes can be seen as a rib that sprouts from the skeleton. More specifically each rib sprouts in a direction that is normal (perpendicular) to the tangent of the contour at the E-node plus a random error φi chosen independently for each rib from a von Mises distribution centered on zero, i.e. φi ∼ V (0, κS ) with spread parameter κS [4]. The rib lengths are drawn from an exponential decreasing density function p(ρi ) ∝ e−λS ρi [4]. We can now express how well this node “explains” the three E-nodes it is connected to via the probability that this S-node deserves to be a skeletal node or not, p(S = true|E1 , E2 , E3 ) ∝ p(ρi )p(φi ) (10) i with S = true depicting that this S-node deserves to be a skeletal node. From this we then compute the prior π(Gi ) in such a way that good (high posterior) skeletal nodes induce a high interiority bias, hence a stronger tendency to induce figural status. Conversely, bad (low posterior) skeletal nodes create a prior close to indifferent (uniform) and thus have less (or no) influence on figural status. Likelihood functions Finally we need to express the likelihood function necessary for the updating rules described above. The first two likelihood functions are part of p(Ei |Bi ), one for each of the local cues. The first one, reflecting local curvature, gives the probability of the orientations of the two vectors inherent to Ei (α1 and α2 ) given both direction of figure (θ) encoded in Bi as a von Mises density centered on θ, i.e. αi ∼ V (θ, κEB ). The second likelihood function, reflecting the presence of a T-junction, simply assumes a fixed likelihood when a T-junction is present—that is p(T-junction = true|Bi ) = θT , where Bi places the direction of figure in the direction of the occluder. This likelihood function is only in effect when a T-junction is present, replacing the curvature cue at that node. The third likelihood function serves to keep consistency between nodes of the first level. This function p(Bi |Bi−1 ) or p(Bi |Bi+1 ) is used to compute α(B) and β(B) and is defined 2x2 conditional probability matrix with a single free parameter, θBB (the probability that figural direction at both B-nodes are the same). A fourth and final likelihood function p(Bi |Gj ) serves to propagate information between level one and two. This likelihood function is 2x2 conditional probability matrix matrix with one free parameter, θBG . In this case θBG encodes the probability that the figural direction of the B-node is in the direction of the exterior or interior preference of the G-node. In total this brings us to six free parameters in the model: κS , λS , κEB , θT , θBB , and θBG . 2 Basic Simulations To evaluate the performance of the model, we first tested it on several basic stimulus configurations in which the desired outcome is intuitively clear: a convex shape, a concave shape, a pair of overlapping shapes, and a pair of non-overlapping shapes (Fig. 2,3). The convex shape is the simplest in that curvature never changes sign. The concave shape includes a region with oppositely signed curvature. (The shape is naturally described as predominantly positively curved with a region of negative curvature, i.e. a concavity. But note that it can also be interpreted as predominantly negatively curved “window” with a region of positive curvature, although this is not the intuitive interpretation.) 4 The overlapping pair of shapes consists of two convex shapes with one partly occluding the other, creating a competition between the two shapes for the ownership of the common borderline. Finally the non-overlapping shapes comprise two simple convex shapes that do not touch—again setting up a competition for ownership of the two inner boundaries (i.e. between each shape and the ground space between them). Fig. 2 shows the network structures for each of these four cases. Figure 2: Network structure for the four shape categories (left to right: convex, concave, overlapping, non-overlapping shapes). Blue depict the locations of the B-nodes (and also the E-nodes), the red connections are the connections between B-nodes, the green connections are connections between B-nodes and G-nodes, and the G-nodes (and also the S-nodes) go from orange to dark red. This colour code depicts low (orange) to high (dark red) probability that this is a skeletal node, and hence the strength of the interiority cue. Running our model with hand-estimated parameter values yields highly intuitive posteriors (Fig. 3), an essential “sanity check” to ensure that the network approximates human judgments in simple cases. For the convex shape the model assigns figure to the interior just as one would expect even based solely on local curvature (Fig. 3A). In the concave figure (Fig. 3B), estimated border ownership begins to reverse inside the deep concavity. This may seem surprising, but actually closely matches empirical results obtained when local border ownership is probed psychophysically inside a similarly deep concavity, i.e. a “negative part” in which f/g seems to partly reverse [8]. For the overlapping shapes posteriors were also intuitive, with the occluding shape interpreted as in front and owning the common border (Fig. 3C). Finally, for the two non-overlapping shapes the model computed border-ownership just as one would expect if each shape were run separately, with each shape treated as figural along its entire boundary (Fig. 3D). That is, even though there is skeletal structure in the ground-region between the two shapes (see Fig. 2D), its posterior is weak compared to the skeletal structure inside the shapes, which thus loses the competition to own the boundary between them. For all these configurations, the model not only converged to intuitive estimates but did so rapidly (Fig. 4), always in fewer cycles than would be expected by pure lateral propagation, niterations < Nnodes [18] (with these parameters, typically about five times faster). Figure 3: Posteriors after convergence for the four shape categories (left to right: convex, concave, overlapping, non-overlapping). Arrows indicate estimated border ownership, with direction pointing to the perceived figural side, and length proportional to the magnitude of the posterior. All four simulations used the same parameters. 5 Figure 4: Convergence of the model for the basic shape categories. The vertical lines represent the point of convergence for each of the three shape categories. The posterior change is calculated as |p(Bi = 1|I)t − p(Bi = 1|I)t−1 | at each iteration. 3 Comparison to human data Beyond the simple cases reviewed above, we wished to submit our network to a more fine-grained comparison with human data. To this end we compared its performance to that of human subjects in an experiment we conducted (to be presented in more detail in a future paper). Briefly, our experiment involved finding evidence for propagation of f/g signals across the image. Subjects were first shown a stimulus in which the f/g configuration was globally and locally unambiguous and consistent: a smaller rectangle partly occluding a larger one (Fig. 5A), meaning that the smaller (front) one owns the common border. Then this configuration was perturbed by adding two bars, of which one induced a local f/g reversal—making it now appear locally that the larger rectangle owned the border (Fig. 5B). (The other bar in the display does not alter f/g interpretation, but was included to control for the attentional affects of introducing a bar in the image.) The inducing bar creates T-junctions that serve as strong local f/g cues, in this case tending to reverse the prior global interpretation of the figure. We then measured subjective border ownership along the central contour at various distances from the inducing bar, and at different times after the onset of the bar (25ms, 100ms and 250ms). We measured border ownership locally using a method introduced in [8] in which a local motion probe is introduced at a point on the boundary between two color regions of different colors, and the subject is asked which color appeared to move. Because the figural side “owns” the border, the response reflects perceived figural status. The goal of the experiment was to actually measure the progression of the influence of the inducing T-junction as it (hypothetically) propagated along the boundary. Briefly, we found no evidence of temporal differences, meaning that f/g judgments were essentially constant over time, suggesting rapid convergence of local f/g assignment. (This is consistent with the very rapid convergence of our network, which would suggest a lack of measurable temporal differences except at much shorter time scales than we measured.) But we did find a progressive reduction of f/g reversal with increasing distance from the inducer—that is, the influence of the T-junction decayed with distance. Mean responses aggregated over subjects (shortest delay only) are shown in Fig. 6. In order to run our model on this stimulus (which has a much more complex structure than the simple figures tested above) we had to make some adjustments. We removed the bars from the edge map, leaving only the T-junctions as underlying cues. This was a necessary first step because our model is not yet able to cope with skeletons that are split up by occluders. (The larger rectangle’s skeleton has been split up by the lower bar.) In this way all contours except those created by the bars were used to create the network (Fig. 7). Given this network we ran the model using hand-picked parameters that 6 Figure 5: Stimuli used in the experiment. A. Initial stimulus with locally and globally consistent and unambiguous f/g. B. Subsequently bars were added of which one (the top bar in this case) created a local reversal of f/g. C. Positions at which local f/g judgments of subjects were probed. Figure 6: Results from our experiment aggregated for all 7 subjects (shortest delay only) are shown in red. The x-axis shows distance from the inducing bar at which f/g judgment was probed. The y-axis shows the proportion of trials on which subjects judged the smaller rectangle to own the boundary. As can be seen, the further from the T-junction, the lower the f/g reversal. The fitted model (green curve) shows very similar pattern. Horizontal black line indicates chance performance (ambiguous f/g). gave us the best possible qualitative similarity to the human data. The parameters used never entailed total elimination of the influence of any likelihood function (κS = 16, λS = .025, κEB = .5, θT = .9, θBB = .9, and θBG = .6). As can be seen in Fig. 6 the border-ownership estimates at the locations where we had data show compelling similarities to human judgments. Furthermore along the entire contour the model converged to intuitive border-ownership estimates (Fig. 7) very rapidly (within 36 iterations). The fact that our model yielded intuitive estimates for the current network in which not all contours were completed shows another strength of our model. Because our model included grouping nodes, it did not require contours to be amodally completed [6] in order for information to propagate. 4 Conclusion In this paper we proposed a model rooted in Bayesian belief networks to compute figure/ground. The model uses both local and global cues, combined in a principled way, to achieve a stable and apparently psychologically reasonable estimate of border ownership. Local cues included local curvature and T-junctions, both well-established cues to f/g. Global cues included skeletal structure, 7 Figure 7: (left) Node structure for the experimental stimulus. (right) The model’s local borderownership estimates after convergence. a novel cue motivated by the idea that strongly axial shapes tend to be figural and thus own their boundaries. We successfully tested this model on both simple displays, in which it gave intuitive results, and on a more complex experimental stimulus, in which it gave a close match to the pattern of f/g propagation found in our subjects. Specifically, the model, like the human subjects rapidly converged to a stable local f/g interpretation. Our model’s structure shows several interesting parallels to properties of neural coding of border ownership in visual cortex. Some cortical cells (end-stopped cells) appear to code for local curvature [3] and T-junctions [5]. The B-nodes in our model could be seen as corresponding to cells that code for border ownership [20]. Furthermore, some authors [2] have suggested that recurrent feedback loops between border ownership cells in V2 and cells in V4 (corresponding to G-nodes in our model) play a role in the rapid computation of border ownership. The very rapid convergence we observed in our model likewise appears to be due to the connections between B-nodes and G-nodes. Finally scale-invariant shape representations (such as, speculatively, those based on skeletons) are thought to be present in higher cortical regions such as IT [17], which project down to earlier areas in ways that are not yet understood. A number of parallels to past models of f/g should be mentioned. Weiss [18] pioneered the application of belief networks to the f/g problem, though their network only considered a more restricted set of local cues and no global ones, such that information only propagated along the contour. Furthermore it has not been systematically compared to human judgments. Kogo et al. [10] proposed an exponential decay of f/g signals as they spread throughout the image. Our model has a similar decay for information going through the G-nodes, though it is also influenced by an angular factor defined by the position of the skeletal node. Like the model by Li Zhaoping [19], our model includes horizontal propagation between B-nodes, analogous to border-ownership cells in her model. A neurophysiological model by Craft et al. [2] defines grouping cells coding for an interiority preference that decays with the size of the receptive fields of these grouping cells. Our model takes this a step further by including shape (skeletal) structure as a factor in interiority estimates, rather than simply size of receptive fields (which is similar to the rib lengths in our model). Currently, our use of skeletons as shape representations is still limited to medial axis skeletons and surfaces that are not split up by occluders. Our future goals including integrating skeletons in a more robust way following the probabilistic account suggested by Feldman and Singh [4]. Eventually, we hope to fully integrate skeleton computation with f/g computation so that the more general problem of shape and surface estimation can be approached in a coherent and unified fashion. 8 References [1] P. Bahnsen. Eine untersuchung uber symmetrie und assymmetrie bei visuellen wahrnehmungen. Zeitschrift fur psychology, 108:129–154, 1928. [2] E. Craft, H. Sch¨ tze, E. Niebur, and R. von der Heydt. A neural model of figure-ground u organization. Journal of Neurophysiology, 97:4310–4326, 2007. [3] A. Dobbins, S. W. Zucker, and M. S. Cyander. Endstopping and curvature. Vision Research, 29:1371–1387, 1989. [4] J. Feldman and M. Singh. Bayesian estimation of the shape skeleton. Proceedings of the National Academy of Sciences, 103:18014–18019, 2006. [5] B. Heider, V. Meskenaite, and E. Peterhans. Anatomy and physiology of a neural mechanism defining depth order and contrast polarity at illusory contours. European Journal of Neuroscience, 12:4117–4130, 2000. [6] G. Kanizsa. Organization inVision. New York: Praeger, 1979. [7] G. Kanizsa and W. Gerbino. Vision and Artifact, chapter Convexity and symmetry in figureground organisation, pages 25–32. New York: Springer, 1976. [8] S. Kim and J. Feldman. Globally inconsistent figure/ground relations induced by a negative part. Journal of Vision, 9:1534–7362, 2009. [9] K. Koffka. Principles of Gestalt Psychology. Lund Humphries, London, 1935. [10] N. Kogo, C. Strecha, L. Van Gool, and J. Wagemans. Surface construction by a 2-d differentiation-integration process: a neurocomputational model for perceived border ownership, depth, and lightness in kanizsa figures. Psychological Review, 117:406–439, 2010. [11] B. Machielsen, M. Pauwels, and J. Wagemans. The role of vertical mirror-symmetry in visual shape detection. Journal of Vision, 9:1–11, 2009. [12] K. Murphy, Y. Weiss, and M.I. Jordan. Loopy belief propagation for approximate inference: an empirical study. Proceedings of Uncertainty in AI, pages 467–475, 1999. [13] R. L. Ogniewicz and O. K¨ bler. Hierarchic Voronoi skeletons. Pattern Recognition, 28:343– u 359, 1995. [14] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 1988. [15] M. A. Peterson and E. Skow. Inhibitory competition between shape properties in figureground perception. Journal of Experimental Psychology: Human Perception and Performance, 34:251–267, 2008. [16] K. A. Stevens and A. Brookes. The concave cusp as a determiner of figure-ground. Perception, 17:35–42, 1988. [17] K. Tanaka, H. Saito, Y. Fukada, and M. Moriya. Coding visual images of object in the inferotemporal cortex of the macaque monkey. Journal of Neurophysiology, 66:170–189, 1991. [18] Y. Weiss. Interpreting images by propagating Bayesian beliefs. Adv. in Neural Information Processing Systems, 9:908915, 1997. [19] L. Zhaoping. Border ownership from intracortical interactions in visual area V2. Neuron, 47(1):143–153, Jul 2005. [20] H. Zhou, H. S. Friedman, and R. von der Heydt. Coding of border ownerschip in monkey visual cortex. The Journal of Neuroscience, 20:6594–6611, 2000. 9
5 0.43279278 63 nips-2010-Distributed Dual Averaging In Networks
Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi
Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. 1
6 0.42420322 20 nips-2010-A unified model of short-range and long-range motion perception
7 0.40552261 98 nips-2010-Functional form of motion priors in human motion perception
8 0.39646155 269 nips-2010-Throttling Poisson Processes
9 0.39146543 52 nips-2010-Convex Multiple-Instance Learning by Estimating Likelihood Ratio
10 0.38700214 18 nips-2010-A novel family of non-parametric cumulative based divergences for point processes
11 0.37412077 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
12 0.37397644 155 nips-2010-Learning the context of a category
13 0.37249863 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
14 0.36366186 285 nips-2010-Why are some word orders more common than others? A uniform information density account
15 0.36034015 129 nips-2010-Inter-time segment information sharing for non-homogeneous dynamic Bayesian networks
16 0.35656235 142 nips-2010-Learning Bounds for Importance Weighting
17 0.35101444 289 nips-2010-b-Bit Minwise Hashing for Estimating Three-Way Similarities
18 0.34688723 215 nips-2010-Probabilistic Deterministic Infinite Automata
19 0.34626955 224 nips-2010-Regularized estimation of image statistics by Score Matching
20 0.32154924 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
topicId topicWeight
[(7, 0.346), (13, 0.037), (17, 0.015), (27, 0.049), (30, 0.05), (35, 0.017), (45, 0.151), (50, 0.057), (52, 0.051), (60, 0.031), (77, 0.065), (78, 0.019), (90, 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.75117964 214 nips-2010-Probabilistic Belief Revision with Structural Constraints
Author: Peter Jones, Venkatesh Saligrama, Sanjoy Mitter
Abstract: Experts (human or computer) are often required to assess the probability of uncertain events. When a collection of experts independently assess events that are structurally interrelated, the resulting assessment may violate fundamental laws of probability. Such an assessment is termed incoherent. In this work we investigate how the problem of incoherence may be affected by allowing experts to specify likelihood models and then update their assessments based on the realization of a globally-observable random sequence. Keywords: Bayesian Methods, Information Theory, consistency
2 0.52920079 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
Author: David Sontag, Ofer Meshi, Amir Globerson, Tommi S. Jaakkola
Abstract: The problem of learning to predict structured labels is of key importance in many applications. However, for general graph structure both learning and inference are intractable. Here we show that it is possible to circumvent this difficulty when the distribution of training examples is rich enough, via a method similar in spirit to pseudo-likelihood. We show that our new method achieves consistency, and illustrate empirically that it indeed approaches the performance of exact methods when sufficiently large training sets are used. Many prediction problems in machine learning applications are structured prediction tasks. For example, in protein folding we are given a protein sequence and the goal is to predict the protein’s native structure [14]. In parsing for natural language processing, we are given a sentence and the goal is to predict the most likely parse tree [2]. In these and many other applications, we can formalize the structured prediction problem as taking an input x (e.g., primary sequence, sentence) and predicting ˆ y (e.g., structure, parse) according to y = arg maxy∈Y θ · φ(x, y ), where φ(x, y) is a function that ˆ maps any input and a candidate assignment to a feature vector, Y denotes the space of all possible assignments to the vector y, and θ is a weight vector to be learned. This paper addresses the problem of learning structured prediction models from data. In particular, given a set of labeled examples {(xm , y m )}M , our goal is to find a vector θ such that for each m=1 example m, y m = arg maxy∈Y θ · φ(xm , y), i.e. one which separates the training data. For many structured prediction models, maximization over Y is computationally intractable. This makes it difficult to apply previous algorithms for learning structured prediction models, such as structured perceptron [2], stochastic subgradient [10], and cutting-plane algorithms [5], which require making a prediction at every iteration (equivalent to repeatedly solving an integer linear program). Given training data, we can consider the space of parameters Θ that separate the data. This space can be defined by the intersection of a large number of linear inequalities. A recent approach to getting around the hardness of prediction is to use linear programming (LP) relaxations to approximate the maximization over Y [4, 6, 9]. However, separation with respect to a relaxation places stronger constraints on the parameters. The target solution, an integral vertex in the LP, must now distinguish itself also from possible fractional vertexes that arise due to the relaxation. The relaxations can therefore be understood as optimizing over an inner bound of Θ. This set may be empty even if the training data is separable with exact inference [6]. Another obstacle to using LP relaxations for learning is that solving the LPs can be very slow. In this paper we ask whether it is possible to learn while avoiding inference altogether. We propose a new learning algorithm, inspired by pseudo-likelihood [1], that optimizes over an outer bound of Θ. Learning involves optimizing over only a small number of constraints per data point, and thus can be performed quickly, even for complex structured prediction models. We show that, if the data available for learning is “nice”, this algorithm is consistent, i.e. it will find some θ ∈ Θ. This is an example of how having the right data can circumvent the hardness of learning for structured prediction. 1 We also investigate the limitations of the proposed method. We show that the problem of even deciding whether a given data set is separable is NP-hard, and thus learning in a strict sense is no easier than prediction. Thus, we should not expect for our algorithm, or any other polynomial time algorithm, to always succeed at learning from an arbitrary finite data set. To our knowledge, this is the first result characterizing the hardness of exact learning for structured prediction. Finally, we show empirically that our algorithm allows us to successfully learn the parameters for both multi-label prediction and protein side-chain placement. The performance of the algorithm is improved as more data becomes available, as our theoretical results anticipate. 1 Pseudo-Max method We consider the general structured prediction problem. The input space is denoted by X and the set of all possible assignments by Y. Each y ∈ Y corresponds to n variables y1 , . . . , yn , each with k possible states. The classifier uses a (given) function φ(x, y) : X , Y → Rd and (learned) weights θ ∈ Rd , and is defined as y(x; θ) = arg maxy∈Y f (ˆ ; x, θ) where f is the discriminant function y ˆ f (y; x, θ) = θ · φ(x, y). Our analysis will focus on functions φ whose scope is limited to small sets of the yi variables, but for now we keep the discussion general. Given a set of labeled examples {(xm , y m )}M , the goal of the typical learning problem is to find m=1 weights θ that correctly classify the training examples. Consider first the separable case. Define the set of separating weight vectors, Θ = θ | ∀m, y ∈ Y, f (y m ; xm , θ) ≥ f (y; xm , θ)+e(y, y m ) . e is a loss function (e.g., zero-one or Hamming) such that e(y m , y m ) = 0 and e(y, y m ) > 0 for y = y m , which serves to rule out the trivial solution θ = 0.1 The space Θ is defined by exponentially many constraints per example, one for each competing assignment. In this work we consider a much simpler set of constraints where, for each example, we only consider the competing assignments obtained by modifying a single label yi , while fixing the other labels to their value at y m . The pseudo-max set, which is an outer bound on Θ, is given by Here ym −i m Θps = θ | ∀m, i, yi , f (y m ; xm , θ) ≥ f (y m , yi ; xm , θ) + e(yi , yi ) . −i denotes the label y m (1) without the assignment to yi . When the data is not separable, Θ will be the empty set. Instead, we may choose to minimize the hinge loss, (θ) = m maxy f (y; xm , θ) − f (y m ; xm , θ) + e(y, y m ) , which can be shown to be an upper bound on the training error [13]. When the data is separable, minθ (θ) = 0. Note that regularization may be added to this objective. The corresponding pseudo-max objective replaces the maximization over all of y with maximization over a single variable yi while fixing the other labels to their value at y m :2,3 M ps (θ) n = m=1 i=1 m max f (y m , yi ; xm , θ) − f (y m ; xm , θ) + e(yi , yi ) . −i yi Analogous to before, we have minθ ps (θ) (2) = 0 if and only if θ ∈ Θps . The objective in Eq. 2 is similar in spirit to pseudo-likelihood objectives used for maximum likelihood estimation of parameters of Markov random fields (MRFs) [1]. The pseudo-likelihood estimate is provably consistent when the data generating distribution is a MRF of the same structure as used in the pseudo-likelihood objective. However, our setting is different since we only get to view the maximizing assignment of the MRF rather than samples from it. Thus, a particular x will always be paired with the same y rather than samples y drawn from the conditional distribution p(y|x; θ). The pseudo-max constraints in Eq. 1 are also related to cutting plane approaches to inference [4, 5]. In the latter, the learning problem is solved by repeatedly looking for assignments that violate the separability constraint (or its hinge version). Our constraints can be viewed as using a very small 1 An alternative formulation, which we use in the next section, is to break the symmetry by having part of the input not be multiplied by any weight. This will also rule out the trivial solution θ = 0. P 2 It is possible to use maxi instead of i , and some of our consistency results will still hold. 3 The pseudo-max approach is markedly different from a learning method which predicts each label yi independently, since the objective considers all i simultaneously (both at learning and test time). 2 x2 0.2 J ∗ + x1 = 0 y = (0, 1) y = (1, 1) g(J12) x2 = 0 x1 J ∗ + x1 + x2 = 0 y = (0, 0) c1=0 c1=1 c1= 1 0.15 0.1 J + x2 = 0 ∗ 0.05 y = (1, 0) x1 = 0 0 1 0.5 0 J 0.5 1 Figure 1: Illustrations for a model with two variables. Left: Partitioning of X induced by configurations y(x) for some J ∗ > 0. Blue lines carve out the exact regions. Red lines denote the pseudo-max constraints that hold with equality. Pseudo-max does not obtain the diagonal constraint coming from comparing configurations y = (1, 1) and (0, 0), since these differ by more than one coordinate. Right: One strictly-convex component of the ps (J ) function (see Eq. 9). The function is shown for different values of c1 , the mean of the x1 variable. subset of assignments for the set of candidate constraint violators. We also note that when exact maximization over the discriminant function f (y; x, θ) is hard, the standard cutting plane algorithm cannot be employed since it is infeasible to find a violated constraint. For the pseudo-max objective, finding a constraint violation is simple and linear in the number of variables.4 It is easy to see (as will be elaborated on next) that the pseudo-max method does not in general yield a consistent estimate of θ, even in the separable case. However, as we show, consistency can be shown to be achieved under particular assumptions on the data generating distribution p(x). 2 Consistency of the Pseudo-Max method In this section we show that if the feature generating distribution p(x) satisfies particular assumptions, then the pseudo-max approach yields a consistent estimate. In other words, if the training data is of the form {(xm , y(xm ; θ ∗ ))}M for some true parameter vector θ ∗ , then as M → ∞ the m=1 minimum of the pseudo-max objective will converge to θ ∗ (up to equivalence transformations). The section is organized as follows. First, we provide intuition for the consistency results by considering a model with only two variables. Then, in Sec. 2.1, we show that any parameter θ ∗ can be identified to within arbitrary accuracy by choosing a particular training set (i.e., choice of xm ). This in itself proves consistency, as long as there is a non-zero probability of sampling this set. In Sec. 2.2 we give a more direct proof of consistency by using strict convexity arguments. For ease of presentation, we shall work with a simplified instance of the structured learning setting. We focus on binary variables, yi ∈ {0, 1}, and consider discriminant functions corresponding to Ising models, a special case of pairwise MRFs (J denotes the vector of “interaction” parameters): f (y; x, J ) = ij∈E Jij yi yj + i yi xi (3) The singleton potential for variable yi is yi xi and is not dependent on the model parameters. We could have instead used Ji yi xi , which would be more standard. However, this would make the parameter vector J invariant to scaling, complicating the identifiability analysis. In the consistency analysis we will assume that the data is generated using a true parameter vector J ∗ . We will show that as the data size goes to infinity, minimization of ps (J ) yields J ∗ . We begin with an illustrative analysis of the pseudo-max constraints for a model with only two variables, i.e. f (y; x, J) = Jy1 y2 + y1 x1 + y2 x2 . The purpose of the analysis is to demonstrate general principles for when pseudo-max constraints may succeed or fail. Assume that training samples are generated via y(x) = argmaxy f (y; x, J ∗ ). We can partition the input space X into four regions, ˆ ˆ {x ∈ X : y(x) = y } for each of the four configurations y , shown in Fig. 1 (left). The blue lines outline the exact decision boundaries of f (y; x, J ∗ ), with the lines being given by the constraints 4 The methods differ substantially in the non-separable setting where we minimize ps (θ), using a slack variable for every node and example, rather than just one slack variable per example as in (θ). 3 in Θ that hold with equality. The red lines denote the pseudo-max constraints in Θps that hold with equality. For x such that y(x) = (1, 0) or (0, 1), the pseudo-max and exact constraints are identical. We can identify J ∗ by obtaining samples x = (x1 , x2 ) that explore both sides of one of the decision boundaries that depends on J ∗ . The pseudo-max constraints will fail to identify J ∗ if the samples do not sufficiently explore the transitions between y = (0, 1) and y = (1, 1) or between y = (1, 0) and y = (1, 1). This can happen, for example, when the input samples are dependent, giving only rise to the configurations y = (0, 0) and y = (1, 1). For points labeled (1, 1) around the decision line J ∗ + x1 + x2 = 0, pseudo-max can only tell that they respect J ∗ + x1 ≥ 0 and J ∗ + x2 ≥ 0 (dashed red lines), or x1 ≤ 0 and x2 ≤ 0 for points labeled (0, 0). Only constraints that depend on the parameter are effective for learning. For pseudo-max to be able to identify J ∗ , the input samples must be continuous, densely populating the two parameter dependent decision lines that pseudo-max can use. The two point sets in the figure illustrate good and bad input distributions for pseudo-max. The diagonal set would work well with the exact constraints but badly with pseudo-max, and the difference can be arbitrarily large. However, the input distribution on the right, populating the J ∗ + x2 = 0 decision line, would permit pseudo-max to identify J ∗ . 2.1 Identifiability of True Parameters In this section, we show that it is possible to approximately identify the true model parameters, up to model equivalence, using the pseudo-max constraints and a carefully chosen linear number of data points. Consider the learning problem for structured prediction defined on a fixed graph G = (V, E) where the parameters to be learned are pairwise potential functions θij (yi , yj ) for ij ∈ E and single node fields θi (yi ) for i ∈ V . We consider discriminant functions of the form f (y; x, θ) = ij∈E θij (yi , yj ) + i θi (yi ) + i xi (yi ), (4) where the input space X = R|V |k specifies the single node potentials. Without loss of generality, we remove the additional degrees of freedom in θ by restricting it to be in a canonical form: θ ∈ Θcan if for all edges θij (yi , yj ) = 0 whenever yi = 0 or yj = 0, and if for all nodes, θi (yi ) = 0 when yi = 0. As a result, assuming the training set comes from a model in this class, and the input fields xi (yi ) exercise the discriminant function appropriately, we can hope to identify θ ∗ ∈ Θcan . Indeed, we show that, for some data sets, the pseudo-max constraints are sufficient to identify θ ∗ . Let Θps ({y m , xm }) be the set of parameters that satisfy the pseudo-max classification constraints m Θps ({y m , xm }) = θ | ∀m, i, yi = yi , f (y m ; xm , θ) ≥ f (y m , yi ; xm , θ) . −i (5) m e(yi , yi ), For simplicity we omit the margin losses since the input fields xi (yi ) already suffice to rule out the trivial solution θ = 0. Proposition 2.1. For any θ ∗ ∈ Θcan , there is a set of 2|V |(k − 1) + 2|E|(k − 1)2 examples, {xm , y(xm ; θ ∗ )}, such that any pseudo-max consistent θ ∈ Θps ({y m , xm }) ∩ Θcan is arbitrarily close to θ ∗ . The proof is given in the supplementary material. To illustrate the key ideas, we consider the simpler binary discriminant function discussed in Eq. 3. Note that the binary model is already in the canonical form since Jij yi yj = 0 whenever yi = 0 or yj = 0. For any ij ∈ E, we show how to choose two input examples x1 and x2 such that any J consistent with the pseudo-max constraints for these ∗ ∗ two examples will have Jij ∈ [Jij − , Jij + ]. Repeating this for all of the edge parameters then gives the complete set of examples. The input examples we need for this will depend on J ∗ . For the first example, we set the input fields for all neighbors of i (except j) in such a way that ∗ we force the corresponding labels to be zero. More formally, we set x1 < −|N (k)| maxl |Jkl | for k 1 k ∈ N (i)\j, resulting in yk = 0, where y 1 = y(x1 ). In contrast, we set x1 to a large value, e.g. j ∗ 1 ∗ x1 > |N (j)| maxl |Jjl |, so that yj = 1. Finally, for node i, we set x1 = −Jij + so as to obtain a j i 1 slight preference for yi = 1. All other input fields can be set arbitrarily. As a result, the pseudo-max constraints pertaining to node i are f (y 1 ; x1 , J ) ≥ f (y 1 , yi ; x1 , J ) for yi = 0, 1. By taking into −i 1 account the label assignments for yi and its neighbors, and by removing terms that are the same on both sides of the equation, we get Jij + x1 + x1 ≥ Jij yi + yi x1 + x1 , which, for yi = 0, implies i j i j ∗ that Jij + x1 ≥ 0 or Jij − Jij + ≥ 0. The second example x2 differs only in terms of the input i ∗ 2 ∗ field for i. In particular, we set x2 = −Jij − so that yi = 0. This gives Jij ≤ Jij + , as desired. i 4 2.2 Consistency via Strict Convexity In this section we prove the consistency of the pseudo-max approach by showing that it corresponds to minimizing a strictly convex function. Our proof only requires that p(x) be non-zero for all x ∈ Rn (a simple example being a multi-variate Gaussian) and that J ∗ is finite. We use a discriminant function as in Eq. 3. Now, assume the input points xm are distributed according to p(x) and that y m are obtained via y m = arg maxy f (y; xm , J ∗ ). We can write the ps (J ) objective for finite data, and its limit when M → ∞, compactly as: 1 m m = max (yi − yi ) xm + Jki yk ps (J ) i M m i yi k∈N (i) p(x) max (yi − yi (x)) xi + → yi i Jki yk (x) dx (6) k∈N (i) ∗ where yi (x) is the label of i for input x when using parameters J . Starting from the above, consider the terms separately for each i. We partition the integral over x ∈ Rn into exclusive regions according to the predicted labels of the neighbors of i (given x). Define Sij = {x : yj (x) = 1 and yk (x) = 0 for k ∈ N (i)\j}. Eq. 6 can then be written as ps (J ) = gi ({Jik }k∈N (i) ) + ˆ i gik (Jik ) , (7) k∈N (i) where gik (Jik ) = x∈Sik p(x) maxyi [(yi −yi (x))(xi +Jik )]dx and gi ({Jik }k∈N (i) ) contains all of ˆ the remaining terms, i.e. where either zero or more than one neighbor is set to one. The function gi ˆ is convex in J since it is a sum of integrals over convex functions. We proceed to show that gik (Jik ) is strictly convex for all choices of i and k ∈ N (i). This will show that ps (J ) is strictly convex since it is a sum over functions strictly convex in each one of the variables in J . For all values xi ∈ (−∞, ∞) there is some x in Sij . This is because for any finite xi and finite J ∗ , the other xj ’s can be chosen so as to give the y configuration corresponding to Sij . Now, since p(x) has full support, we have P (Sij ) > 0 and p(x) > 0 for any x in Sij . As a result, this also holds for the marginal pi (xi |Sij ) over xi within Sij . After some algebra, we obtain: gij (Jij ) = P (Sij ) ∞ p(x)yi (x)(xi + Jij )dx pi (xi |Sij ) max [0, xi + Jij ] dxi − −∞ x∈Sij The integral over the yi (x)(xi + Jij ) expression just adds a linear term to gij (Jij ). The relevant remaining term is (for brevity we drop P (Sij ), a strictly positive constant, and the ij index): h(J) = ∞ pi (xi |Sij ) max [0, xi + J] dxi = −∞ ∞ ˆ pi (xi |Sij )h(xi , J)dxi (8) −∞ ˆ ˆ where we define h(xi , J) = max [0, xi + J]. Note that h(J) is convex since h(xi , J) is convex in J for all xi . We want to show that h(J) is strictly convex. Consider J < J and α ∈ (0, 1) and define ˆ ˆ the interval I = [−J, −αJ − (1 − α)J ]. For xi ∈ I it holds that: αh(xi , J) + (1 − α)h(xi , J ) > ˆ i , αJ + (1 − α)J ) (since the first term is strictly positive and the rest are zero). For all other x, h(x ˆ this inequality holds but is not necessarily strict (since h is always convex in J). We thus have after integrating over x that αh(J) + (1 − α)h(J ) > h(αJ + (1 − α)J ), implying h is strictly convex, as required. Note that we used the fact that p(x) has full support when integrating over I. The function ps (J ) is thus a sum of strictly convex functions in all its variables (namely g(Jik )) plus other convex functions of J , hence strictly convex. We can now proceed to show consistency. By strict convexity, the pseudo-max objective is minimized at a unique point J . Since we know that ps (J ∗ ) = 0 and zero is a lower bound on the value of ps (J ), it follows that J ∗ is the unique minimizer. Thus we have that as M → ∞, the minimizer of the pseudo-max objective is the true parameter vector, and thus we have consistency. As an example, consider the case of two variables y1 , y2 , with x1 and x2 distributed according to ∗ N (c1 , 1), N (0, 1) respectively. Furthermore assume J12 = 0. Then simple direct calculation yields: 2 2 2 c1 + J12 −c1 1 1 √ (9) e−x /2 dx − √ e−c1 /2 + √ e−(J12 +c1 ) /2 2π 2π 2π −J12 −c1 which is indeed a strictly convex function that is minimized at J = 0 (see Fig. 1 for an illustration). g(J12 ) = 5 3 Hardness of Structured Learning Most structured prediction learning algorithms use some form of inference as a subroutine. However, the corresponding prediction task is generally NP-hard. For example, maximizing the discriminant function defined in Eq. 3 is equivalent to solving Max-Cut, which is known to be NP-hard. This raises the question of whether it is possible to bypass prediction during learning. Although prediction may be intractable for arbitrary MRFs, what does this say about the difficulty of learning with a polynomial number of data points? In this section, we show that the problem of deciding whether there exists a parameter vector that separates the training data is NP-hard. Put in the context of the positive results in this paper, these hardness results show that, although in some cases the pseudo-max constraints yield a consistent estimate, we cannot hope for a certificate of optimality. Put differently, although the pseudo-max constraints in the separable case always give an outer bound on Θ (and may even be a single point), Θ could be the empty set – and we would never know the difference. Theorem 3.1. Given labeled examples {(xm , y m )}M for a fixed but arbitrary graph G, it is m=1 NP-hard to decide whether there exists parameters θ such that ∀m, y m = arg maxy f (y; xm , θ). Proof. Any parameters θ have an equivalent parameterization in canonical form (see section Sec. 2.1, also supplementary). Thus, the examples will be separable if and only if they are separable by some θ ∈ Θcan . We reduce from unweighted Max-Cut. The Max-Cut problem is to decide, given an undirected graph G, whether there exists a cut of at least K edges. Let G be the same graph as G, with k = 3 states per variable. We construct a small set of examples where a parameter vector will exist that separates the data if and only if there is no cut of K or more edges in G. Let θ be parameters in canonical form equivalent to θij (yi , yj ) = 1 if (yi , yj ) ∈ {(1, 2), (2, 1)}, 0 if yi = yj , and −n2 if (yi , yj ) ∈ {(1, 3), (2, 3), (3, 1), (3, 2)}. We first construct 4n + 8|E| examples, using the technique described in Sec. 2.1 (also supplementary material), which when restricted to the space Θcan , constrain the parameters to equal θ. We then use one more example (xm , y m ) where y m = 3 (every node is in state 3) and, for all i, xm (3) = K−1 and xm (1) = xm (2) = 0. The first i i i n two states encode the original Max-Cut instance, while the third state is used to construct a labeling y m that has value equal to K − 1, and is otherwise not used. Let K ∗ be the value of the maximum cut in G. If in any assignment to the last example there is a variable taking the state 3 and another variable taking the state 1 or 2, then the assignment’s value will be at most K ∗ − n2 , which is less than zero. By construction, the 3 assignment has value K − 1. Thus, the optimal assignment must either be 3 with value K − 1, or some combination of states 1 and 2, which has value at most K ∗ . If K ∗ > K − 1 then 3 is not optimal and the examples are not separable. If K ∗ ≤ K − 1, the examples are separable. This result illustrates the potential difficulty of learning in worst-case graphs. Nonetheless, many problems have a more restricted dependence on the input. For example, in computer vision, edge potentials may depend only on the difference in color between two adjacent pixels. Our results do not preclude positive results of learnability in such restricted settings. By establishing hardness of learning, we also close the open problem of relating hardness of inference and learning in structured prediction. If inference problems can be solved in polynomial time, then so can learning (using, e.g., structured perceptron). Thus, when learning is hard, inference must be hard as well. 4 Experiments To evaluate our learning algorithm, we test its performance on both synthetic and real-world datasets. We show that, as the number of training samples grows, the accuracy of the pseudo-max method improves and its speed-up gain over competing algorithms increases. Our learning algorithm corresponds to solving the following, where we add L2 regularization and use a scaled 0-1 loss, m m e(yi , yi ) = 1{yi = yi }/nm (nm is the number of labels in example m): min θ C m nm M nm m=1 i=1 m max f (y m , yi ; xm , θ) − f (y m ; xm , θ) + e(yi , yi ) + θ −i yi 2 . (10) We will compare the pseudo-max method with learning using structural SVMs, both with exact inference and LP relaxations [see, e.g., 4]. We use exact inference for prediction at test time. 6 (a) Synthetic (b) Reuters 0.4 exact LP−relaxation pseudo−max 0.15 Test error Test error 0.2 0.1 0.05 0 1 10 2 10 0.2 0.1 0 1 10 3 10 Train size exact LP−relaxation pseudo−max 0.3 2 10 3 10 4 10 Train size Figure 2: Test error as a function of train size for various algorithms. Subfigure (a) shows results for a synthetic setting, while (b) shows performance on the Reuters data. In the synthetic setting we use the discriminant function f (y; x, θ) = ij∈E θij (yi , yj ) + xi θi (yi ), which is similar to Eq. 4. We take a fully connected graph over n = 10 binary labels. i For a weight vector θ ∗ (sampled once, uniformly in the range [−1, 1], and used for all train/test sets) we generate train and test instances by sampling xm uniformly in the range [−5, 5] and then computing the optimal labels y m = arg maxy∈Y f (y; xm , θ ∗ ). We generate train sets of increasing size (M = {10, 50, 100, 500, 1000, 5000}), run the learning algorithms, and measure the test error for the learned weights (with 1000 test samples). For each train size we average the test error over 10 repeats of sampling and training. Fig. 2(a) shows a comparison of the test error for the three learning algorithms. For small numbers of training examples, the test error of pseudo-max is larger than that of the other algorithms. However, as the train size grows, the error converges to that of exact learning, as our consistency results predict. We also test the performance of our algorithm on a multi-label document classification task from the Reuters dataset [7]. The data consists of M = 23149 training samples, and we use a reduction of the dataset to the 5 most frequent labels. The 5 label variables form a fully connected pairwise graph structure (see [4] for a similar setting). We use random subsamples of increasing size from the train set to learn the parameters, and then measure the test error using 20000 additional samples. For each sample size and learning algorithm, we optimize the trade-off parameter C using 30% of the training data as a hold-out set. Fig. 2(b) shows that for the large data regime the performance of pseudo-max learning gets close to that of the other methods. However, unlike the synthetic setting there is still a small gap, even after seeing the entire train set. This could be because the full dataset is not yet large enough to be in the consistent regime (note that exact learning has not flattened either), or because the consistency conditions are not fully satisfied: the data might be non-separable or the support of the input distribution p(x) may be partial. We next apply our method to the problem of learning the energy function for protein side-chain placement, mirroring the learning setup of [14], where the authors train a conditional random field (CRF) using tree-reweighted belief propagation to maximize a lower bound on the likelihood.5 The prediction problem for side-chain placement corresponds to finding the most likely assignment in a pairwise MRF, and fits naturally into our learning framework. There are only 8 parameters to be learned, corresponding to a reweighting of known energy terms. The dataset consists of 275 proteins, where each MRF has several hundred variables (one per residue of the protein) and each variable has on average 20 states. For prediction we use CPLEX’s ILP solver. Fig. 3 shows a comparison of the pseudo-max method and a cutting-plane algorithm which uses an LP relaxation, solved with CPLEX, for finding violated constraints.6 We generate training sets of increasing size (M = {10, 50, 100, 274}), and measure the test error for the learned weights on the remaining examples.7 For M = 10, 50, 100 we average the test error over 3 random train/test splits, whereas for M = 274 we do 1-fold cross validation. We use C = 1 for both algorithms. 5 The authors’ data and results are available from: http://cyanover.fhcrc.org/recomb-2007/ We significantly optimized the cutting-plane algorithm, e.g. including a large number of initial cuttingplanes and restricting the weight vector to be positive (which we know to hold at optimality). 7 Specifically, for each protein we compute the fraction of correctly predicted χ1 and χ2 angles for all residues (except when trivial, e.g. just 1 state). Then, we compute the median of this value across all proteins. 6 7 Time to train (minutes) Test error (χ1 and χ2) 0.27 0.265 pseudo−max LP−relaxation Soft rep 0.26 0.255 0.25 0 50 100 150 200 Train size 250 250 200 pseudo−max LP−relaxation 150 100 50 0 0 50 100 150 200 Train size 250 Figure 3: Training time (for one train/test split) and test error as a function of train size for both the pseudomax method and a cutting-plane algorithm which uses a LP relaxation for inference, applied to the problem of learning the energy function for protein side-chain placement. The pseudo-max method obtains better accuracy than both the LP relaxation and HCRF (given roughly five times more data) for a fraction of the training time. The original weights (“Soft rep” [3]) used for this energy function have 26.7% error across all 275 proteins. The best previously reported parameters, learned in [14] using a Hidden CRF, obtain 25.6% error (their training set included 55 of these 275 proteins, so this is an optimistic estimate). To get a sense of the difficulty of this learning task, we also tried a random positive weight vector, uniformly sampled from the range [0, 1], obtaining an error of 34.9% (results would be much worse if we allowed the weights to be negative). Training using pseudo-max with 50 examples, we learn parameters in under a minute that give better accuracy than the HCRF. The speed-up of training with pseudo-max (using CPLEX’s QP solver) versus cutting-plane is striking. For example, for M = 10, pseudo-max takes only 3 seconds, a 1000-fold speedup. Unfortunately the cutting-plane algorithm took a prohibitive amount of time to be able to run on the larger training sets. Since the data used in learning for protein side-chain placement is both highly non-separable and relatively little, these positive results illustrate the potential wide-spread applicability of the pseudo-max method. 5 Discussion The key idea of our method is to find parameters that prefer the true assignment y m over assignments that differ from it in only one variable, in contrast to all other assignments. Perhaps surprisingly, this weak requirement is sufficient to achieve consistency given a rich enough input distribution. One extension of our approach is to add constraints for assignments that differ from y m in more than one variable. This would tighten the outer bound on Θ and possibly result in improved performance, but would also increase computational complexity. We could also add such competing assignments via a cutting-plane scheme so that optimization is performed only over a subset of these constraints. Our work raises a number of important open problems: It would be interesting to derive generalization bounds to understand the convergence rate of our method, as well as understanding the effect of the distribution p(x) on these rates. The distribution p(x) needs to have two key properties. On the one hand, it needs to explore the space Y in the sense that a sufficient number of labels need to be obtained as the correct label for the true parameters (this is indeed used in our consistency proofs). On the other hand, p(x) needs to be sufficiently sensitive close to the decision boundaries so that the true parameters can be inferred. We expect that generalization analysis will depend on these two properties of p(x). Note that [11] studied active learning schemes for structured data and may be relevant in the current context. How should one apply this learning algorithm to non-separable data sets? We suggested one approach, based on using a hinge loss for each of the pseudo constraints. One question in this context is, how resilient is this learning algorithm to label noise? Recent work has analyzed the sensitivity of pseudo-likelihood methods to model mis-specification [8], and it would be interesting to perform a similar analysis here. Also, is it possible to give any guarantees for the empirical and expected risks (with respect to exact inference) obtained by outer bound learning versus exact learning? Finally, our algorithm demonstrates a phenomenon where more data can make computation easier. Such a scenario was recently analyzed in the context of supervised learning [12], and it would be interesting to combine the approaches. Acknowledgments: We thank Chen Yanover for his assistance with the protein data. This work was supported by BSF grant 2008303 and a Google Research Grant. D.S. was supported by a Google PhD Fellowship. 8 References [1] J. Besag. The analysis of non-lattice data. The Statistician, 24:179–195, 1975. [2] M. Collins. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In EMNLP, 2002. [3] G. Dantas, C. Corrent, S. L. Reichow, J. J. Havranek, Z. M. Eletr, N. G. Isern, B. Kuhlman, G. Varani, E. A. Merritt, and D. Baker. High-resolution structural and thermodynamic analysis of extreme stabilization of human procarboxypeptidase by computational protein design. Journal of Molecular Biology, 366(4):1209 – 1221, 2007. [4] T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. In Proceedings of the 25th International Conference on Machine Learning 25, pages 304–311. ACM, 2008. [5] T. Joachims, T. Finley, and C.-N. Yu. Cutting-plane training of structural SVMs. Machine Learning, 77(1):27–59, 2009. [6] A. Kulesza and F. Pereira. Structured learning with approximate inference. In Advances in Neural Information Processing Systems 20, pages 785–792. 2008. [7] D. Lewis, , Y. Yang, T. Rose, and F. Li. RCV1: a new benchmark collection for text categorization research. JMLR, 5:361–397, 2004. [8] P. Liang and M. I. Jordan. An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators. In Proceedings of the 25th international conference on Machine learning, pages 584–591, New York, NY, USA, 2008. ACM Press. [9] A. F. T. Martins, N. A. Smith, and E. P. Xing. Polyhedral outer approximations with application to natural language parsing. In ICML 26, pages 713–720, 2009. [10] N. Ratliff, J. A. D. Bagnell, and M. Zinkevich. (Online) subgradient methods for structured prediction. In AISTATS, 2007. [11] D. Roth and K. Small. Margin-based active learning for structured output spaces. In Proc. of the European Conference on Machine Learning (ECML). Springer, September 2006. [12] S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence on training set size. In Proceedings of the 25th international conference on Machine learning, pages 928–935. ACM, 2008. [13] B. Taskar, C. Guestrin, and D. Koller. Max margin Markov networks. In Advances in Neural Information Processing Systems 16, pages 25–32. 2004. [14] C. Yanover, O. Schueler-Furman, and Y. Weiss. Minimizing and learning energy functions for side-chain prediction. Journal of Computational Biology, 15(7):899–911, 2008. 9
3 0.49391943 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
Author: Surya Ganguli, Haim Sompolinsky
Abstract: Recent proposals suggest that large, generic neuronal networks could store memory traces of past input sequences in their instantaneous state. Such a proposal raises important theoretical questions about the duration of these memory traces and their dependence on network size, connectivity and signal statistics. Prior work, in the case of gaussian input sequences and linear neuronal networks, shows that the duration of memory traces in a network cannot exceed the number of neurons (in units of the neuronal time constant), and that no network can out-perform an equivalent feedforward network. However a more ethologically relevant scenario is that of sparse input sequences. In this scenario, we show how linear neural networks can essentially perform compressed sensing (CS) of past inputs, thereby attaining a memory capacity that exceeds the number of neurons. This enhanced capacity is achieved by a class of “orthogonal” recurrent networks and not by feedforward networks or generic recurrent networks. We exploit techniques from the statistical physics of disordered systems to analytically compute the decay of memory traces in such networks as a function of network size, signal sparsity and integration time. Alternately, viewed purely from the perspective of CS, this work introduces a new ensemble of measurement matrices derived from dynamical systems, and provides a theoretical analysis of their asymptotic performance. 1
4 0.48863918 117 nips-2010-Identifying graph-structured activation patterns in networks
Author: James Sharpnack, Aarti Singh
Abstract: We consider the problem of identifying an activation pattern in a complex, largescale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of highdimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size. 1
5 0.48388952 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
Author: Dahua Lin, Eric Grimson, John W. Fisher
Abstract: We present a novel method for constructing dependent Dirichlet processes. The approach exploits the intrinsic relationship between Dirichlet and Poisson processes in order to create a Markov chain of Dirichlet processes suitable for use as a prior over evolving mixture models. The method allows for the creation, removal, and location variation of component models over time while maintaining the property that the random measures are marginally DP distributed. Additionally, we derive a Gibbs sampling algorithm for model inference and test it on both synthetic and real data. Empirical results demonstrate that the approach is effective in estimating dynamically varying mixture models. 1
6 0.48364589 10 nips-2010-A Novel Kernel for Learning a Neuron Model from Spike Train Data
7 0.48304397 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
8 0.4819997 148 nips-2010-Learning Networks of Stochastic Differential Equations
9 0.48138803 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
10 0.48116025 265 nips-2010-The LASSO risk: asymptotic results and real world examples
11 0.47986573 63 nips-2010-Distributed Dual Averaging In Networks
12 0.47835916 18 nips-2010-A novel family of non-parametric cumulative based divergences for point processes
13 0.47830874 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
14 0.47819853 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
15 0.47767618 158 nips-2010-Learning via Gaussian Herding
16 0.47762465 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning
17 0.47755966 96 nips-2010-Fractionally Predictive Spiking Neurons
18 0.47719023 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
19 0.47653145 155 nips-2010-Learning the context of a category
20 0.47638947 200 nips-2010-Over-complete representations on recurrent neural networks can support persistent percepts