nips nips2003 nips2003-135 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gang Ji, Jeff A. Bilmes
Abstract: In pattern classification tasks, errors are introduced because of differences between the true model and the one obtained via model estimation. Using likelihood-ratio based classification, it is possible to correct for this discrepancy by finding class-pair specific terms to adjust the likelihood ratio directly, and that can make class-pair preference relationships intransitive. In this work, we introduce new methodology that makes necessary corrections to the likelihood ratio, specifically those that are necessary to achieve perfect classification (but not perfect likelihood-ratio correction which can be overkill). The new corrections, while weaker than previously reported such adjustments, are analytically challenging since they involve discontinuous functions, therefore requiring several approximations. We test a number of these new schemes on an isolatedword speech recognition task as well as on the UCI machine learning data sets. Results show that by using the bias terms calculated in this new way, classification accuracy can substantially improve over both the baseline and over our previous results. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Using likelihood-ratio based classification, it is possible to correct for this discrepancy by finding class-pair specific terms to adjust the likelihood ratio directly, and that can make class-pair preference relationships intransitive. [sent-4, score-0.187]
2 In this work, we introduce new methodology that makes necessary corrections to the likelihood ratio, specifically those that are necessary to achieve perfect classification (but not perfect likelihood-ratio correction which can be overkill). [sent-5, score-0.5]
3 The new corrections, while weaker than previously reported such adjustments, are analytically challenging since they involve discontinuous functions, therefore requiring several approximations. [sent-6, score-0.154]
4 We test a number of these new schemes on an isolatedword speech recognition task as well as on the UCI machine learning data sets. [sent-7, score-0.126]
5 Results show that by using the bias terms calculated in this new way, classification accuracy can substantially improve over both the baseline and over our previous results. [sent-8, score-0.217]
6 1 Introduction Statistical pattern recognition is often based on Bayes decision theory [4], which aims to achieve minimum error rate classification. [sent-9, score-0.09]
7 In previous work [2], we observed that multiclass Bayes classification can be viewed as a tournament style game, where the winner between players is decided using log likelihood ratios. [sent-10, score-0.249]
8 ) A practical game strategy can be obtained by fixing a comparison order, {i1 , i2 , · · · , iM }, as a permutation of {1, 2, · · · , M }, where class ci1 plays with class ci2 , the winner plays with class ci3 , and so on until a final winner is ultimately found. [sent-12, score-0.433]
9 This yields a transitive game [8] — assuming no ties, the ultimate winner is identical regardless of the comparison order. [sent-13, score-0.277]
10 To perform these procedures optimally, correct likelihood ratios are needed, which requires correct probabilistic models and sufficient training data. [sent-14, score-0.15]
11 In previous work [2], we introduced a method to correct for the difference between the true and an approximate log likelihood ratio. [sent-16, score-0.109]
12 In this work, we improve upon the correction method by using an expression that can still lead to perfect correction, but is weaker than what we used before. [sent-17, score-0.318]
13 We show that this new condition can achieve a significant improvement over baseline results, both on a medium vocabulary isolated-word automatic speech recognition task and on the UCI machine learning data sets. [sent-18, score-0.379]
14 Section 3 discusses the weaker correction condition, and its approximations. [sent-20, score-0.276]
15 Section 4 provides various experimental results on an isolated-word speech recognition task. [sent-21, score-0.126]
16 In the likelihood ratio decision scheme described above, only ˆ ˆ ˆ an imperfect log likelihood ratio, Lij (x) = ln(P (x|ci )/P (x|cj )), is available for decision making rather than the true log likelihood ratio Lij (x). [sent-26, score-0.495]
17 The key idea is to compensate for the difference between Lij (x) and Lij (x) 2 using a bias term αij (x) computed from test data such that: ˆ Lij (x) − αij (x) = Lij (x). [sent-29, score-0.106]
18 Under the assumption (referred to as assumption A in Section 3. [sent-31, score-0.074]
19 , equal covariance matrices in the Gaussian case), the bias term can be solved explicitly as 1 ˆ ˆ αij = − D(i j) − D(j i) . [sent-34, score-0.106]
20 (3) 2 ˆ We saw how the augmented likelihood ratio Sij (x) = Lij (x) + αij can lead to an intransitive game [8, 13], since Sij (x) can specify intransitive preferences amongst the set {1, 2, · · · , M }. [sent-35, score-0.574]
21 We therefore investigated a number of intransitive game playing strategies. [sent-36, score-0.267]
22 Moreover, we observed that if the correction was optimal, the true likelihood ratios would be obtained which are clearly transitive. [sent-37, score-0.377]
23 We therefore hypothesized and experimentally verified that the existence of intransitivity was a good indicator of the occurrence of a classification error. [sent-38, score-0.102]
24 First, better intransitive strategies can be developed (for detecting, tolerating, and utilizing the intransitivity of a 1 2 In this paper, we use “hatted” letters to describe estimated quantities. [sent-40, score-0.205]
25 Note that by bias, we do not mean standard parameter bias in statistical parameter estimation. [sent-41, score-0.077]
26 classifier); second, the assumption of symmetric KL-divergence could be relaxed; and third, the above criterion is stricter than required to obtain perfect correction. [sent-42, score-0.11]
27 3 Necessary Intransitive Scheme An αij (x) that solves Equation 1 is a sufficient condition for a perfect correction of the estimated likelihood ratio since given such a quantity, the true likelihood ratio would be attainable. [sent-44, score-0.627]
28 This condition, however, is stricter than required because it is only the sign of the likelihood ratio that is needed to decide the winning class. [sent-45, score-0.344]
29 We therefore should ask for a condition that corrects only for the discrepancy in sign between the true and estimated ratio, i. [sent-46, score-0.26]
30 , we want to find a function αij (x) that minimizes ˆ sgn [Lij (x) − αij (x)] − sgnLij (x) J[αij ] = 2 · Pij (x) dx. [sent-48, score-0.173]
31 Rn Clearly the αij (x) that minimizes J[αij ] is the one such that ˆ sgn [Lij (x) − αij (x)] = sgnLij (x), ∀x ∈ suppPij = {x : Pij (x) = 0}. [sent-49, score-0.173]
32 (4) As can be seen, this condition is weaker than Equation 1, weaker in the sense that any solution to Equation 1 solves Equation 4 but not vice versa. [sent-50, score-0.208]
33 Note also that Equation 4 provides necessary conditions for an additive bias term to achieve perfect correction, since any such correction must achieve parity in the sign. [sent-51, score-0.422]
34 Therefore, it might make it simpler to find a better bias term since Equation 4 (and therefore, set of possible α values) is less constrained. [sent-52, score-0.13]
35 As will be seen, however, analysis of this weaker condition is more difficult. [sent-53, score-0.107]
36 In the following sections, therefore we introduce several approximations to this condition. [sent-54, score-0.053]
37 In this case, the equation providing the best αij values is: ˆ EPij {sgn [Lij (x) − αij ]} = EPij sgnLij (x) . [sent-56, score-0.055]
38 1 (5) The difficulty with the sign function The main problem in trying to solve for αij in Equation 5 is the existence of a discontinuous function. [sent-58, score-0.196]
39 The {−1, 0, 1}-valued sign function sgn(z) is defined as 2u(z) − 1, where u(z) is the Heaviside step function. [sent-60, score-0.149]
40 We obtain an approximation via a Taylor expansion as follows: sgn(z + ) = sgn(z) + sgn (z) + o( ) = sgn(z) + 2 δ(z) + o( ), (6) where δ(z) is the Dirac delta function [7]. [sent-61, score-0.217]
41 If, however, we find and use a suitable continuous and differentiable approximation rather than the discrete sign function, the above expansion becomes more appropriate. [sent-65, score-0.221]
42 The shifted sigmoid with free parameter β (defined and used below) allows us to easily explore this trade-off simply by varying β. [sent-68, score-0.216]
43 Retaining the first-order Taylor term, and applying this to the left side of Equation 5, EPij sgn [Lij (x) − αij ] ≈ EPij sgnLij (x) − 2EPij αij δ [Lij (x)] . [sent-69, score-0.173]
44 If it is known that the true class of x is always ci , the ci -conditional distribution (i) should be used, i. [sent-71, score-0.569]
45 , Pij (x) = P (x|ci ), yielding a class-conditional correction term αij , (i) (i) ˆ and a class-conditional likelihood-ratio correction Sij (x) = Lij (x)+αij . [sent-73, score-0.427]
46 The symmetric case arises when x is of class cj . [sent-74, score-0.321]
47 If, on the other hand, neither ci nor cj is the true classes (i. [sent-75, score-0.567]
48 It is therefore valid to consider only the case when either x is of class ci (we denote this event by Ci (x)) or when x is of class cj (event Cj (x)). [sent-78, score-0.639]
49 P (ci ) + P (cj ) ˆ This results in a single likelihood correction Sij (x) = Lij (x) + αij that is obtained simply by integrating in Equation 5 with respect to the average distribution over class ci and cj , i. [sent-83, score-0.854]
50 This occurs when Lij is positive and Lij is negative ˆ ij is positive, a situation improved upon by giving i more often than Lij is negative and L “help. [sent-92, score-0.28]
51 ” Similarly, when αij is negative, the correction biases towards j. [sent-93, score-0.199]
52 The maximum amount of absolute likelihood correction possible is determined by the (always positive) 1/Ψ(Pi , Pj ) factor. [sent-94, score-0.28]
53 This is affected by two quantities, the mass around and the log-likelihood ratio gradient at the decision boundary. [sent-95, score-0.158]
54 Low mass at the decision boundary increases the maximum possible correction because any errors in the integral factor are being de-weighted. [sent-96, score-0.305]
55 High gradient at the decision boundary also increases the maximum possible correction because any decision boundary deviation causes a higher change in likelihood ratio than if the gradient was low. [sent-97, score-0.517]
56 Since we are correcting the likelihood ratio directly, this needs to be reflected in αij . [sent-98, score-0.164]
57 As will be seen, this assumption yields a likelihood-ratio adjustment that is similar in form to our previous KLdivergence based adjustment. [sent-102, score-0.062]
58 More practically, the assumption significantly simplifies the derivation and still yields reasonable empirical results. [sent-103, score-0.062]
59 (8) 2 ij 2 The left term on the right of the equality is quite similar to the left difference on the right of the equality in the KL-divergence case (Equation 2). [sent-105, score-0.309]
60 Again, because we have no information about the true class conditional models, we assume the left term in Equation 8 to be zero (denote this as assumption B). [sent-106, score-0.129]
61 Comparing this with the corresponding assumption for the KL-divergence case (assumption A, Equations 2 and 3), it can be shown that 1) they are not identical in general, and 2) in the Gaussian case, A implies B but not vice versa, meaning B is weaker than A. [sent-107, score-0.138]
62 One can immediately see the similarity between this equation and the one using KLD [2]. [sent-109, score-0.055]
63 , using the original likelihood ratios) to get estimates and use these in Equation 9. [sent-112, score-0.081]
64 The third is the error due to the discontinuity of the sign function. [sent-116, score-0.149]
65 To address the third problem, rather than using the sign function in Equation 9, we can approximate it with a continuous differential function with the goal of balancing the trade-off mentioned above. [sent-117, score-0.149]
66 There are a number of possible sign-function approximations, including hyperbolic and arc tangent, and shifted sigmoid function, the latter of which is the most flexible because of its free parameter β. [sent-118, score-0.269]
67 3 1 Specifically, the sigmoid function has the form f (z) = 1+e−βz , where the free parameter β (an inverse temperature) determines how well the curve will approximate the discontinuous function. [sent-119, score-0.228]
68 Using the sigmoid function, we can approximate the sign function as sgnz ≈ 2 − 1. [sent-120, score-0.303]
69 (10) αij ≈ − 1− 1− ˆ ˆ 2(Ni + Nj ) x∈c 1 + eβ Lji (x) 1 + eβ Lij (x) x∈c j i 4 Speech Recognition Evaluation As in previous work [2], we implemented this technique on NYNEX PHONEBOOK [10, 1], a medium vocabulary isolated-word speech corpus. [sent-123, score-0.148]
70 Gaussian mixture hidden Markov ˆ models (HMMs) produced probability scores P (x|ci ) where here x is a matrix of feature values (one dimension as MFCC features and the other as time frames), and ci is a word identity. [sent-124, score-0.285]
71 Since the procedure is no longer transitive, we run 1000 random tournament-style games (as in [2]) and choose the most frequent winner as the ultimate winner. [sent-128, score-0.129]
72 Table 1: Word error rates % on speech data with various sign approximations. [sent-129, score-0.237]
73 91 The results are shown in Table 1, where the first column gives the test-set vocabulary size (number of different classes). [sent-175, score-0.074]
74 The second column shows the baseline word error rates ˆ (WERs) using only Lij . [sent-176, score-0.208]
75 The remaining columns are the bias-corrected results with various sign approximations, namely sign (Equation 9), hyperbolic and arc tangent, and the shifted sigmoid with various β values (thus allowing us to investigate the trade-off mentioned in Section 3. [sent-177, score-0.54]
76 From the results we can see that larger-β sigmoid is usually better, with overall performance increasing with β. [sent-179, score-0.154]
77 This is because with large β, the shifted sigmoid curve better approximates the sign function. [sent-180, score-0.362]
78 For β = 100, the results are even better than our previous KL-divergence (KLD) results reported in [2] (right-most column in the table). [sent-181, score-0.06]
79 3 Note that the other soft sign functions can also be defined to utilize a β smoothness parameter. [sent-184, score-0.149]
80 data australian breast chess cleve corral crx diabetes flare german glass glass2 heart hepatitis iris letter lymphography mofn-3-7-10 pima satimage segment shuttle-small soybean-large vehicle vote waveform-21 NN baseline 16. [sent-186, score-0.14]
81 40 In order to show that our methodology is general beyond isolated-word speech recognition, we also evaluated this technique on the entire UCI machine learning repository [9]. [sent-526, score-0.088]
82 , NN or NB), we augmented the resulting likelihood ratios with bias correction terms thereby evaluating our technique using quite different forms of baseline classifiers. [sent-530, score-0.594]
83 Unlike the above, with these data sets we have only tried one random tournament game to decide the winner so far. [sent-531, score-0.235]
84 While our bias correction described above is in terms of likelihoods ratios Lij (x), posteriors can be used as well if the posteriors are divided by the priors giving the relation P (c|x)/P (c) = P (x|c)/p(x) (i. [sent-533, score-0.429]
85 , scaled likelihoods) which produces the standard Lij (x) values when used in a likelihood ratio . [sent-535, score-0.164]
86 Again, first-pass error-full test-set hypothesized answers are used to compute the bias corrections. [sent-539, score-0.113]
87 Table 5 shows our results for both the NN (columns 2—5) and NB (columns 6—9) baseline classifiers. [sent-540, score-0.14]
88 Within each baseline group, the first column shows the baseline accuracy (with the 5-fold standard derivations when the data set is small). [sent-541, score-0.316]
89 The second column shows results using KL-divergence based bias corrections — these are the first published KLD results on the UCI data. [sent-542, score-0.168]
90 The third column shows results with sign-based correction (Equation 9), and the forth column shows the sigmoid (β = 10) case (Equation 10). [sent-543, score-0.425]
91 While not the point of this paper, one immediately sees that the NB baseline results are often better than the NN baseline results (15 out of 25 times). [sent-544, score-0.304]
92 Using the NN as a baseline, the table shows that the KLD results are almost always better than the baseline 24 times (out of 25). [sent-545, score-0.164]
93 Also, the sign correction is better than the baseline 23 out of 25 times, and the sigmoid(10) results are better 20 times. [sent-546, score-0.536]
94 These results therefore show that the NN KLD correction typically beats the sign and sigmoid correction, possibly owing to the error in the Taylor approximation. [sent-548, score-0.559]
95 Using the NB classifier as the baseline, however, shows not only improved baseline results in general but also that the sigmoid(10) improves more often. [sent-549, score-0.14]
96 Specifically, the KLD results are better than the baseline 16 times, sign is better than the baseline 18 times, and sigmoid(10) beats the baseline 19 times, suggesting that sigmoid(10) typically wins over the KLD case. [sent-550, score-0.671]
97 6 Discussion We have introduced a new necessary intransitive likelihood ratio classifier. [sent-551, score-0.338]
98 This was done by using sign-based corrections to likelihood ratios and by using continuous differentiable approximations of the sign function in order to be able to vary the inherent trade-off between sign-function approximation accuracy and Taylor error. [sent-552, score-0.427]
99 We have applied these techniques to both a speech recognition corpus and the UCI data sets, as well as applying previous KL-divergence based corrections to the latter data. [sent-553, score-0.181]
100 Results on the UCI data sets confirm that our techniques reasonably generalize to data sets other than speech recognition. [sent-554, score-0.088]
wordName wordTfidf (topN-words)
[('lij', 0.535), ('cj', 0.286), ('ij', 0.28), ('ci', 0.253), ('correction', 0.199), ('kld', 0.181), ('sgnlij', 0.181), ('sgn', 0.173), ('sigmoid', 0.154), ('sign', 0.149), ('intransitive', 0.145), ('sig', 0.145), ('baseline', 0.14), ('nn', 0.108), ('winner', 0.107), ('pij', 0.106), ('uci', 0.096), ('game', 0.092), ('epij', 0.091), ('speech', 0.088), ('ratio', 0.083), ('likelihood', 0.081), ('nb', 0.08), ('bias', 0.077), ('weaker', 0.077), ('pj', 0.075), ('ratios', 0.069), ('taylor', 0.063), ('sij', 0.06), ('rn', 0.057), ('classi', 0.056), ('equation', 0.055), ('corrections', 0.055), ('pi', 0.054), ('discontinuous', 0.047), ('ep', 0.046), ('ln', 0.043), ('jeff', 0.043), ('perfect', 0.042), ('likelihoods', 0.04), ('vocabulary', 0.038), ('recognition', 0.038), ('assumption', 0.037), ('column', 0.036), ('bilmes', 0.036), ('gang', 0.036), ('intransitivity', 0.036), ('phonebook', 0.036), ('supposing', 0.036), ('tournament', 0.036), ('wers', 0.036), ('zg', 0.036), ('zlij', 0.036), ('std', 0.036), ('hypothesized', 0.036), ('class', 0.035), ('shifted', 0.035), ('aij', 0.033), ('word', 0.032), ('transitive', 0.031), ('stricter', 0.031), ('therefore', 0.03), ('condition', 0.03), ('decision', 0.029), ('necessary', 0.029), ('hyperbolic', 0.029), ('term', 0.029), ('differentiable', 0.028), ('augmented', 0.028), ('true', 0.028), ('ni', 0.027), ('free', 0.027), ('integral', 0.027), ('wins', 0.027), ('beats', 0.027), ('boundary', 0.026), ('nj', 0.026), ('players', 0.025), ('toolbox', 0.025), ('lebesgue', 0.025), ('yields', 0.025), ('better', 0.024), ('mass', 0.024), ('arc', 0.024), ('vice', 0.024), ('dx', 0.024), ('sons', 0.024), ('achieve', 0.023), ('approximations', 0.023), ('acoustics', 0.023), ('discrepancy', 0.023), ('approximation', 0.022), ('ji', 0.022), ('ultimate', 0.022), ('posteriors', 0.022), ('medium', 0.022), ('ultimately', 0.022), ('hmms', 0.022), ('gradient', 0.022), ('expansion', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers
Author: Gang Ji, Jeff A. Bilmes
Abstract: In pattern classification tasks, errors are introduced because of differences between the true model and the one obtained via model estimation. Using likelihood-ratio based classification, it is possible to correct for this discrepancy by finding class-pair specific terms to adjust the likelihood ratio directly, and that can make class-pair preference relationships intransitive. In this work, we introduce new methodology that makes necessary corrections to the likelihood ratio, specifically those that are necessary to achieve perfect classification (but not perfect likelihood-ratio correction which can be overkill). The new corrections, while weaker than previously reported such adjustments, are analytically challenging since they involve discontinuous functions, therefore requiring several approximations. We test a number of these new schemes on an isolatedword speech recognition task as well as on the UCI machine learning data sets. Results show that by using the bias terms calculated in this new way, classification accuracy can substantially improve over both the baseline and over our previous results. 1
2 0.14234437 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors
Author: Quaid D. Morris, Brendan J. Frey
Abstract: This paper addresses the problem of untangling hidden graphs from a set of noisy detections of undirected edges. We present a model of the generation of the observed graph that includes degree-based structure priors on the hidden graphs. Exact inference in the model is intractable; we present an efficient approximate inference algorithm to compute edge appearance posteriors. We evaluate our model and algorithm on a biological graph inference problem. 1 Introduction and motivation The inference of hidden graphs from noisy edge appearance data is an important problem with obvious practical application. For example, biologists are currently building networks of all the physical protein-protein interactions (PPI) that occur in particular organisms. The importance of this enterprise is commensurate with its scale: a completed network would be as valuable as a completed genome sequence, and because each organism contains thousands of different types of proteins, there are millions of possible types of interactions. However, scalable experimental methods for detecting interactions are noisy, generating many false detections. Motivated by this application, we formulate the general problem of inferring hidden graphs as probabilistic inference in a graphical model, and we introduce an efficient algorithm that approximates the posterior probability that an edge is present. In our model, a set of hidden, constituent graphs are combined to generate the observed graph. Each hidden graph is independently sampled from a prior on graph structure. The combination mechanism acts independently on each edge but can be either stochastic or deterministic. Figure 1 shows an example of our generative model. Typically one of the hidden graphs represents the graph of interest (the true graph), the others represent different types of observation noise. Independent edge noise may also be added by the combination mechanism. We use probabilistic inference to compute a likely decomposition of the observed graph into its constituent parts. This process is deemed “untangling”. We use the term “denoising” to refer to the special case where the edge noise is independent. In denoising there is a single hidden graph, the true graph, and all edge noise in the observed graph is due E1 1 eij E2 i 2 eij j xij i j i j X Figure 1: Illustrative generative model example. Figure shows an example where an observed graph, X, is a noisy composition of two constituent graphs, E 1 and E 2 . All graphs share the same vertex set, so each can be represented by a symmetric matrix of random binary variables (i.e., an adjacency matrix). This generative model is designed to solve a toy counter-espionage problem. The vertices represent suspects and each edge in X represents an observed call between two suspects. The graph X reflects zero or more spy rings (represented by E 1 ), telemarketing calls (represented by E 2 ), social calls (independent edge noise), and lost call records (more independent edge noise). The task is to locate any spy rings hidden in X. We model the distribution of spy ring graphs using a prior, P (E 1 ), that has support only on graphs where all vertices have degree of either 2 (i.e., are in the ring) or 0 (i.e., are not). Graphs of telemarketing call patterns are represented using a prior, P (E 2 ), under which all nodes have degrees of > 3 (i.e., are telemarketers), 1 (i.e., are telemarketees), or 0 (i.e., are neither). The displayed hidden graphs are one likely untangling of X. to the combination mechanism. Prior distributions over graphs can be specified in various ways, but our choice is motivated by problems we want to solve, and by a view to deriving an efficient inference algorithm. One compact representation of a distribution over graphs consists of specifying a distribution over vertex degrees, and assuming that graphs that have the same vertex degrees are equiprobable. Such a prior can model quite rich distributions over graphs. These degree-based structure priors are natural representions of graph structure; many classes of real-world networks have a characteristic functional form associated with their degree distributions [1], and sometimes this form can be predicted using knowledge about the domain (see, e.g., [2]) or detected empirically (see, e.g., [3, 4]). As such, our model incorporates degree-based structure priors. Though exact inference in our model is intractable in general, we present an efficient algorithm for approximate inference for arbitrary degree distributions. We evaluate our model and algorithm using the real-world example of untangling yeast proteinprotein interaction networks. 2 A model of noisy and tangled graphs For degree-based structure priors, inference consists of searching over vertex degrees and edge instantiations, while comparing each edge with its noisy observation and enforcing the constraint that the number of edges connected to every vertex must equal the degree of the vertex. Our formulation of the problem in this way is inspired by the success of the sum-product algorithm (loopy belief propagation) for solving similar formulations of problems in error-correcting decoding [6, 7], phase unwrapping [8], and random satisfiability [9]. For example, in error-correcting decoding, inference consists of searching over configurations of codeword bits, while comparing each bit with its noisy observation and enforcing parity-check constraints on subsets of bits [10]. For a graph on a set of N vertices, eij is a variable that indicates the presence of an edge connecting vertices i and j: eij = 1 if there is an edge, and eij = 0 otherwise. We assume the vertex set is fixed, so each graph is specified by an adjacency matrix, E = {eij }N . The degree of vertex i is denoted by di and the i,j=1 degree set by D = {di }N . The observations are given by a noisy adjacency matrix, i=1 X = {xij }N . Generally, edges can be directed, but in this paper we focus on i,j=1 undirected graphs, so eij = eji and xij = xji . Assuming the observation noise is independent for different edges, the joint distribution is P (X, E, D) = P (X|E)P (E, D) = P (xij |eij ) P (E, D). j≥i P (xij |eij ) models the edge observation noise. We use an undirected model for the joint distribution over edges and degrees, P (E, D), where the prior distribution over di is determined by a non-negative potential fi (di ). Assuming graphs that have the same vertex degrees are equiprobable, we have N eij ) , fi (di )I(di , P (E, D) ∝ i j=1 where I(a, b) = 1 if a = b, and I(a, b) = 0 if a = b. The term I(di , j eij ) ensures that the number of edges connected to vertex i is equal to di . It is straightforward to show that the marginal distribution over di is P (di ) ∝ fi (di ) D\di nD j=i fj (dj ) , where nD is the number of graphs with degrees D and the sum is over all degree variables except di . The potentials, fi , can be estimated from a given degree prior using Markov chain Monte Carlo; or, as an approximation, they can be set to an empirical degree distribution obtained from noise-free graphs. Fig 2a shows the factor graph [11] for the above model. Each filled square corresponds to a term in the factorization of the joint distribution and the square is connected to all variables on which the term depends. Factor graphs are graphical models that unify the properties of Bayesian networks and Markov random fields [12]. Many inference algorithms, including the sum-product algorithm (a.k.a. loopy belief propagation), are more easily derived using factor graphs than Bayesian networks or Markov random fields. We describe the sum-product algorithm for our model in section 3. (a) I(d ,e + e +e +e 4 14 24 34 44 d1 e11 e12 e14 4 d3 d2 e13 f 4(d ) e22 e23 e24 (b) ) d1 d4 e33 e34 e1 e44 11 x11 x11 x12 x13 x14 x22 x23 x24 x33 d1 1 x34 x44 e2 11 e1 12 x12 e2 12 d1 2 e1 13 e1 e2 13 e1 14 x13 e1 22 x14 e2 14 d1 3 23 x22 e2 22 x23 e2 23 4 e1 e1 24 e2 24 e1 33 x24 34 x33 e2 33 x34 e2 34 e1 44 x44 e2 44 P( x44 | e44 ) (c) d4 s41 e14 e24 d2 1 d4 e34 e44 e14 s42 e24 s43 e34 d2 2 d2 3 d2 4 s44 P( x e44 44 1 ,e 2 ) 44 44 |e Figure 2: (a) A factor graph that describes a distribution over graphs with vertex degrees di , binary edge indicator variables eij , and noisy edge observations xij . The indicator function I(di , j eij ) enforces the constraint that the sum of the binary edge indicator variables for vertex i must equal the degree of vertex i. (b) A factor graph that explains noisy observed edges as a combination of two constituent graphs, with edge indicator variables e 1 and e2 . ij ij (c) The constraint I(di , j eij ) can be implemented using a chain with state variables, which leads to an exponentially faster message-passing algorithm. 2.1 Combining multiple graphs The above model is suitable when we want to infer a graph that matches a degree prior, assuming the edge observation noise is independent. A more challenging goal, with practical application, is to infer multiple hidden graphs that combine to explain the observed edge data. In section 4, we show how priors over multiple hidden graphs can be be used to infer protein-protein interactions. When there are H hidden graphs, each constituent graph is specified by a set of edges on the same set of N common vertices. For the degree variables and edge variables, we use a superscript to indicate which hidden graph the variable is used to describe. Assuming the graphs are independent, the joint distribution over the observed edge data X, and the edge variables and degree variables for the hidden graphs, E 1 , D1 , . . . , E H , DH , is H P (X, E 1 , D1 , . . . , E H , DH ) = P (xij |e1 , . . . , eH ) ij ij j≥i P (E h , Dh ), (1) h=1 where for each hidden graph, P (E h , Dh ) is modeled as described above. Here, the likelihood P (xij |e1 , . . . , eH ) describes how the edges in the hidden graphs combine ij ij to model the observed edge. Figure 2b shows the factor graph for this model. 3 Probabilistic inference of constituent graphs Exact probabilistic inference in the above models is intractable, here we introduce an approximate inference algorithm that consists of applying the sum-product algorithm, while ignoring cycles in the factor graph. Although the sum-product algorithm has been used to obtain excellent results on several problems [6, 7, 13, 14, 8, 9], we have found that the algorithm works best when the model consists of uncertain observations of variables that are subject to a large number of hard constraints. Thus the formulation of the model described above. Conceptually, our inference algorithm is a straight-forward application of the sumproduct algorithm, c.f. [15], where messages are passed along edges in the factor graph iteratively, and then combined at variables to obtain estimates of posterior probabilities. However, direct implementation of the message-passing updates will lead to an intractable algorithm. In particular, direct implementation of the update for the message sent from function I(di , j eij ) to edge variable eik takes a number of scalar operations that is exponential in the number of vertices. Fortunately there exists a more efficient way to compute these messages. 3.1 Efficiently summing over edge configurations The function I(di , j eij ) ensures that the number of edges connected to vertex i is equal to di . Passing messages through this function requires summing over all edge configurations that correspond to each possible degree, di , and summing over di . Specifically, the message, µIi →eik (eik ), sent from function I(di , j eij ) to edge variable eik is given by I(di , di {eij | j=1,...,N, j=k} eij ) j µeij →Ii (eij ) , j=k where µeij →Ii (eij ) is the message sent from eij to function I(di , j eij ). The sum over {eij | j = 1, . . . , N, j = k} contains 2N −1 terms, so direct computation is intractable. However, for a maximum degree of dmax , all messages departing from the function I(di , j eij ) can be computed using order dmax N binary scalar operations, by introducing integer state variables sij . We define sij = n≤j ein and note that, by recursion, sij = sij−1 + eij , where si0 = 0 and 0 ≤ sij ≤ dmax . This recursive expression enables us to write the high-complexity constraint as the sum of a product of low-complexity constraints, N I(di , eij ) = j I(si1 , ei1 ) {sij | j=1,...,N } I(sij , sij−1 + eij ) I(di , siN ). j=2 This summation can be performed using the forward-backward algorithm. In the factor graph, the summation can be implemented by replacing the function I(di , j eij ) with a chain of lower-complexity functions, connected as shown in Fig. 2c. The function vertex (filled square) on the far left corresponds to I(si1 , ei1 ) and the function vertex in the upper right corresponds to I(di , siN ). So, messages can be passed through each constraint function I(di , j eij ) in an efficient manner, by performing a single forward-backward pass in the corresponding chain. 4 Results We evaluate our model using yeast protein-protein interaction (PPI) data compiled by [16]. These data include eight sets of putative, but noisy, interactions derived from various sources, and one gold-standard set of interactions detected by reliable experiments. Using the ∼ 6300 yeast proteins as vertices, we represent the eight sets of putative m interactions using adjacency matrices {Y m }8 m=1 where yij = 1 if and only if putative interaction dataset m contains an interaction between proteins i and j. We similarly use Y gold to represent the gold-standard interactions. m We construct an observed graph, X, by setting xij = maxm yij for all i and j, thus the observed edge set is the union of all the putative edge sets. We test our model (a) (b) 40 0 untangling baseline random empirical potential posterior −2 30 log Pr true positives (%) 50 20 10 −4 −6 −8 0 0 5 10 −10 0 false positives (%) 10 20 30 degree (# of nodes) Figure 3: Protein-protein interaction network untangling results. (a) ROC curves measuring performance of predicting e1 when xij = 1. (b) Degree distributions. Compares the empirical ij degree distribution of the test set subgraph of E 1 to the degree potential f 1 estimated on the ˆ ij training set subgraph of E 1 and to the distribution of di = j pij where pij = P (e1 = 1|X) is estimated by untangling. on the task of discerning which of the edges in X are also in Y gold . We formalize this problem as that of decomposing X into two constituent graphs E 1 and E 2 , the gold true and the noise graphs respectively, such that e1 = xij yij and e2 = xij − e1 . ij ij ij We use a training set to fit our model parameters and then measure task performance on a test set. The training set contains a randomly selected half of the ∼ 6300 yeast proteins, and the subgraphs of E 1 , E 2 , and X restricted to those vertices. The test contains the other half of the proteins and the corresponding subgraphs. Note that interactions connecting test set proteins to training set proteins (and vice versa) are ignored. We fit three sets of parameters: a set of Naive Bayes parameters that define a set of edge-specific likelihood functions, Pij (xij |e1 , e2 ), one degree potential, f 1 , which ij ij is the same for every vertex in E1 and defines the prior P (E 1 ), and a second, f 2 , that similarly defines the prior P (E 2 ). The likelihood functions, Pij , are used to both assign likelihoods and enforce problem constraints. Given our problem definition, if xij = 0 then e1 = e2 = 0, ij ij otherwise xij = 1 and e1 = 1 − e2 . We enforce the former constraint by setij ij ting Pij (xij = 0|e1 , e2 ) = (1 − e1 )(1 − e2 ), and the latter by setting Pij (xij = ij ij ij ij 1|e1 , e2 ) = 0 whenever e1 = e2 . This construction of Pij simplifies the calculation ij ij ij ij of the µPij →eh messages and improves the computational efficiency of inference beij cause when xij = 0, we need never update messages to and from variables e1 and ij e2 . We complete the specification of Pij (xij = 1|e1 , e2 ) as follows: ij ij ij ym Pij (xij = 1|e1 , e2 ) ij ij = m ij θm (1 − θm )1−yij , if e1 = 1 and e2 = 0, ij ij ym m ij ψm (1 − ψm )1−yij , if e1 = 0 and e2 = 1. ij ij where {θm } and {ψm } are naive Bayes parameters, θm = i,j m yij e1 / ij i,j e1 and ij ψm = i,j m yij e2 / ij i,j e2 , respectively. ij The degree potentials f 1 (d) and f 2 (d) are kernel density estimates fit to the degree distribution of the training set subgraphs of E 1 and E 2 , respectively. We use Gaussian kernels and set the width parameter (standard deviation) σ using leaveone-out cross-validation to maximize the total log density of the held-out datapoints. Each datapoint is the degree of a single vertex. Both degree potentials closely followed the training set empirical degree distributions. Untangling was done on the test set subgraph of X. We initially set the µ Pij →e1 ij messages equal to the likelihood function Pij and we randomly initialized the 1 µIj →e1 messages with samples from a normal distribution with mean 0 and variij ance 0.01. We then performed 40 iterations of the following message update order: 2 2 1 1 µe1 →Ij , µIj →e1 , µe1 →Pij , µPij →e2 , µe2 →Ij , µIj →e2 , µe2 →Pij , µPij →e1 . ij ij ij ij ij ij ij ij We evaluated our untangling algorithm using an ROC curve by comparing the actual ˆ test set subgraph of E 1 to posterior marginal probabilities,P (e1 = 1|X), estimated ij by our sum-product algorithm. Note that because the true interaction network is sparse (less than 0.2% of the 1.8 × 107 possible interactions are likely present [16]) and, in this case, true positive predictions are of greater biological interest than true negative predictions, we focus on low false positive rate portions of the ROC curve. Figure 3a compares the performance of a classifier for e1 based on thresholding ij ˆ P (eij = 1|X) to a baseline method based on thresholding the likelihood functions, Pij (xij = 1|e1 = 1, e2 = 0). Note because e1 = 0 whenever xij = 0, we exclude ij ij ij the xij = 0 cases from our performance evaluation. The ROC curve shows that for the same low false positive rate, untangling produces 50% − 100% more true positives than the baseline method. Figure 3b shows that the degree potential, the true degree distribution, and the predicted degree distribution are all comparable. The slight overprediction of the true degree distribution may result because the degree potential f 1 that defines P (E 1 ) is not equal to the expected degree distribution of graphs sampled from the distribution P (E 1 ). 5 Summary and Related Work Related work includes other algorithms for structure-based graph denoising [17, 18]. These algorithms use structural properties of the observed graph to score edges and rely on the true graph having a surprisingly large number of three (or four) edge cycles compared to the noise graph. In contrast, we place graph generation in a probabilistic framework; our algorithm computes structural fit in the hidden graph, where this computation is not affected by the noise graph(s); and we allow for multiple sources of observation noise, each with its own structural properties. After submitting this paper to the NIPS conference, we discovered [19], in which a degree-based graph structure prior is used to denoise (but not untangle) observed graphs. This paper addresses denoising in directed graphs as well as undirected graphs, however, the prior that they use is not amenable to deriving an efficient sumproduct algorithm. Instead, they use Markov Chain Monte Carlo to do approximate inference in a hidden graph containing 40 vertices. It is not clear how well this approach scales to the ∼ 3000 vertex graphs that we are using. In summary, the contributions of the work described in this paper include: a general formulation of the problem of graph untangling as inference in a factor graph; an efficient approximate inference algorithm for a rich class of degree-based structure priors; and a set of reliability scores (i.e., edge posteriors) for interactions from a current version of the yeast protein-protein interaction network. References [1] A L Barabasi and R Albert. Emergence of scaling in random networks. Science, 286(5439), October 1999. [2] A Rzhetsky and S M Gomez. Birth of scale-free molecular networks and the number of distinct dna and protein domains per genome. Bioinformatics, pages 988–96, 2001. [3] M Faloutsos, P Faloutsos, and C Faloutsos. On power-law relationships of the Internet topology. Computer Communications Review, 29, 1999. [4] Hawoong Jeong, B Tombor, R´ka Albert, Z N Oltvai, and Albert-L´szl´ Barab´si. e a o a The large-scale organization of metabolic networks. Nature, 407, October 2000. [5] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo CA., 1988. [6] D. J. C. MacKay and R. M. Neal. Near Shannon limit performance of low density parity check codes. Electronics Letters, 32(18):1645–1646, August 1996. Reprinted in Electronics Letters, vol. 33, March 1997, 457–458. [7] B. J. Frey and F. R. Kschischang. Probability propagation and iterative decoding. In Proceedings of the 1996 Allerton Conference on Communication, Control and Computing, 1996. [8] B. J. Frey, R. Koetter, and N. Petrovic. Very loopy belief propagation for unwrapping phase images. In 2001 Conference on Advances in Neural Information Processing Systems, Volume 14. MIT Press, 2002. [9] M. M´zard, G. Parisi, and R. Zecchina. Analytic and algorithmic solution of random e satisfiability problems. Science, 297:812–815, 2002. [10] B. J. Frey and D. J. C. MacKay. Trellis-constrained codes. In Proceedings of the 35 th Allerton Conference on Communication, Control and Computing 1997, 1998. [11] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, Special Issue on Codes on Graphs and Iterative Algorithms, 47(2):498–519, February 2001. [12] B. J. Frey. Factor graphs: A unification of directed and undirected graphical models. University of Toronto Technical Report PSI-2003-02, 2003. [13] Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy belief propagation for approximate inference: An empirical study. In Uncertainty in Artificial Intelligence 1999. Stockholm, Sweden, 1999. [14] W. Freeman and E. Pasztor. Learning low-level vision. In Proceedings of the International Conference on Computer Vision, pages 1182–1189, 1999. [15] M. I. Jordan. An Inroduction to Learning in Graphical Models. 2004. In preparation. [16] C von Mering et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 2002. [17] R Saito, H Suzuki, and Y Hayashizaki. Construction of reliable protein-protein interaction networks with a new interaction generality measure. Bioinformatics, pages 756–63, 2003. [18] D S Goldberg and F P Roth. Assessing experimentally derived interactions in a small world. Proceedings of the National Academy of Science, 2003. [19] S M Gomez and A Rzhetsky. Towards the prediction of complete protein–protein interaction networks. In Pacific Symposium on Biocomputing, pages 413–24, 2002.
3 0.080613762 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling
Author: Ting-fan Wu, Chih-jen Lin, Ruby C. Weng
Abstract: Pairwise coupling is a popular multi-class classification method that combines together all pairwise comparisons for each pair of classes. This paper presents two approaches for obtaining class probabilities. Both methods can be reduced to linear systems and are easy to implement. We show conceptually and experimentally that the proposed approaches are more stable than two existing popular methods: voting and [3]. 1
4 0.075637862 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
Author: Xuanlong Nguyen, Michael I. Jordan
Abstract: We present an analysis of concentration-of-expectation phenomena in layered Bayesian networks that use generalized linear models as the local conditional probabilities. This framework encompasses a wide variety of probability distributions, including both discrete and continuous random variables. We utilize ideas from large deviation analysis and the delta method to devise and evaluate a class of approximate inference algorithms for layered Bayesian networks that have superior asymptotic error bounds and very fast computation time. 1
5 0.074721664 193 nips-2003-Variational Linear Response
Author: Manfred Opper, Ole Winther
Abstract: A general linear response method for deriving improved estimates of correlations in the variational Bayes framework is presented. Three applications are given and it is discussed how to use linear response as a general principle for improving mean field approximations.
6 0.070477828 46 nips-2003-Clustering with the Connectivity Kernel
7 0.067348436 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
8 0.065428153 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks
9 0.063360199 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation
10 0.058877837 126 nips-2003-Measure Based Regularization
11 0.057318382 151 nips-2003-PAC-Bayesian Generic Chaining
12 0.056576464 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion
13 0.055574086 162 nips-2003-Probabilistic Inference of Speech Signals from Phaseless Spectrograms
14 0.053457826 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices
15 0.052795924 36 nips-2003-Auction Mechanism Design for Multi-Robot Coordination
16 0.052508064 100 nips-2003-Laplace Propagation
17 0.051772475 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis
18 0.051149305 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
19 0.050694462 117 nips-2003-Linear Response for Approximate Inference
20 0.050250087 97 nips-2003-Iterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian
topicId topicWeight
[(0, -0.163), (1, -0.025), (2, -0.02), (3, 0.016), (4, 0.042), (5, -0.022), (6, 0.04), (7, -0.042), (8, -0.117), (9, 0.018), (10, 0.021), (11, -0.067), (12, -0.055), (13, -0.006), (14, 0.049), (15, -0.099), (16, 0.03), (17, -0.032), (18, -0.019), (19, -0.009), (20, 0.086), (21, -0.091), (22, -0.043), (23, 0.005), (24, -0.215), (25, 0.058), (26, -0.036), (27, -0.099), (28, -0.099), (29, 0.009), (30, -0.151), (31, -0.001), (32, -0.043), (33, -0.02), (34, 0.138), (35, -0.129), (36, 0.092), (37, -0.017), (38, 0.047), (39, 0.094), (40, -0.228), (41, 0.039), (42, 0.045), (43, -0.159), (44, 0.124), (45, 0.013), (46, 0.029), (47, 0.063), (48, -0.004), (49, -0.14)]
simIndex simValue paperId paperTitle
same-paper 1 0.96266478 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers
Author: Gang Ji, Jeff A. Bilmes
Abstract: In pattern classification tasks, errors are introduced because of differences between the true model and the one obtained via model estimation. Using likelihood-ratio based classification, it is possible to correct for this discrepancy by finding class-pair specific terms to adjust the likelihood ratio directly, and that can make class-pair preference relationships intransitive. In this work, we introduce new methodology that makes necessary corrections to the likelihood ratio, specifically those that are necessary to achieve perfect classification (but not perfect likelihood-ratio correction which can be overkill). The new corrections, while weaker than previously reported such adjustments, are analytically challenging since they involve discontinuous functions, therefore requiring several approximations. We test a number of these new schemes on an isolatedword speech recognition task as well as on the UCI machine learning data sets. Results show that by using the bias terms calculated in this new way, classification accuracy can substantially improve over both the baseline and over our previous results. 1
2 0.67711323 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors
Author: Quaid D. Morris, Brendan J. Frey
Abstract: This paper addresses the problem of untangling hidden graphs from a set of noisy detections of undirected edges. We present a model of the generation of the observed graph that includes degree-based structure priors on the hidden graphs. Exact inference in the model is intractable; we present an efficient approximate inference algorithm to compute edge appearance posteriors. We evaluate our model and algorithm on a biological graph inference problem. 1 Introduction and motivation The inference of hidden graphs from noisy edge appearance data is an important problem with obvious practical application. For example, biologists are currently building networks of all the physical protein-protein interactions (PPI) that occur in particular organisms. The importance of this enterprise is commensurate with its scale: a completed network would be as valuable as a completed genome sequence, and because each organism contains thousands of different types of proteins, there are millions of possible types of interactions. However, scalable experimental methods for detecting interactions are noisy, generating many false detections. Motivated by this application, we formulate the general problem of inferring hidden graphs as probabilistic inference in a graphical model, and we introduce an efficient algorithm that approximates the posterior probability that an edge is present. In our model, a set of hidden, constituent graphs are combined to generate the observed graph. Each hidden graph is independently sampled from a prior on graph structure. The combination mechanism acts independently on each edge but can be either stochastic or deterministic. Figure 1 shows an example of our generative model. Typically one of the hidden graphs represents the graph of interest (the true graph), the others represent different types of observation noise. Independent edge noise may also be added by the combination mechanism. We use probabilistic inference to compute a likely decomposition of the observed graph into its constituent parts. This process is deemed “untangling”. We use the term “denoising” to refer to the special case where the edge noise is independent. In denoising there is a single hidden graph, the true graph, and all edge noise in the observed graph is due E1 1 eij E2 i 2 eij j xij i j i j X Figure 1: Illustrative generative model example. Figure shows an example where an observed graph, X, is a noisy composition of two constituent graphs, E 1 and E 2 . All graphs share the same vertex set, so each can be represented by a symmetric matrix of random binary variables (i.e., an adjacency matrix). This generative model is designed to solve a toy counter-espionage problem. The vertices represent suspects and each edge in X represents an observed call between two suspects. The graph X reflects zero or more spy rings (represented by E 1 ), telemarketing calls (represented by E 2 ), social calls (independent edge noise), and lost call records (more independent edge noise). The task is to locate any spy rings hidden in X. We model the distribution of spy ring graphs using a prior, P (E 1 ), that has support only on graphs where all vertices have degree of either 2 (i.e., are in the ring) or 0 (i.e., are not). Graphs of telemarketing call patterns are represented using a prior, P (E 2 ), under which all nodes have degrees of > 3 (i.e., are telemarketers), 1 (i.e., are telemarketees), or 0 (i.e., are neither). The displayed hidden graphs are one likely untangling of X. to the combination mechanism. Prior distributions over graphs can be specified in various ways, but our choice is motivated by problems we want to solve, and by a view to deriving an efficient inference algorithm. One compact representation of a distribution over graphs consists of specifying a distribution over vertex degrees, and assuming that graphs that have the same vertex degrees are equiprobable. Such a prior can model quite rich distributions over graphs. These degree-based structure priors are natural representions of graph structure; many classes of real-world networks have a characteristic functional form associated with their degree distributions [1], and sometimes this form can be predicted using knowledge about the domain (see, e.g., [2]) or detected empirically (see, e.g., [3, 4]). As such, our model incorporates degree-based structure priors. Though exact inference in our model is intractable in general, we present an efficient algorithm for approximate inference for arbitrary degree distributions. We evaluate our model and algorithm using the real-world example of untangling yeast proteinprotein interaction networks. 2 A model of noisy and tangled graphs For degree-based structure priors, inference consists of searching over vertex degrees and edge instantiations, while comparing each edge with its noisy observation and enforcing the constraint that the number of edges connected to every vertex must equal the degree of the vertex. Our formulation of the problem in this way is inspired by the success of the sum-product algorithm (loopy belief propagation) for solving similar formulations of problems in error-correcting decoding [6, 7], phase unwrapping [8], and random satisfiability [9]. For example, in error-correcting decoding, inference consists of searching over configurations of codeword bits, while comparing each bit with its noisy observation and enforcing parity-check constraints on subsets of bits [10]. For a graph on a set of N vertices, eij is a variable that indicates the presence of an edge connecting vertices i and j: eij = 1 if there is an edge, and eij = 0 otherwise. We assume the vertex set is fixed, so each graph is specified by an adjacency matrix, E = {eij }N . The degree of vertex i is denoted by di and the i,j=1 degree set by D = {di }N . The observations are given by a noisy adjacency matrix, i=1 X = {xij }N . Generally, edges can be directed, but in this paper we focus on i,j=1 undirected graphs, so eij = eji and xij = xji . Assuming the observation noise is independent for different edges, the joint distribution is P (X, E, D) = P (X|E)P (E, D) = P (xij |eij ) P (E, D). j≥i P (xij |eij ) models the edge observation noise. We use an undirected model for the joint distribution over edges and degrees, P (E, D), where the prior distribution over di is determined by a non-negative potential fi (di ). Assuming graphs that have the same vertex degrees are equiprobable, we have N eij ) , fi (di )I(di , P (E, D) ∝ i j=1 where I(a, b) = 1 if a = b, and I(a, b) = 0 if a = b. The term I(di , j eij ) ensures that the number of edges connected to vertex i is equal to di . It is straightforward to show that the marginal distribution over di is P (di ) ∝ fi (di ) D\di nD j=i fj (dj ) , where nD is the number of graphs with degrees D and the sum is over all degree variables except di . The potentials, fi , can be estimated from a given degree prior using Markov chain Monte Carlo; or, as an approximation, they can be set to an empirical degree distribution obtained from noise-free graphs. Fig 2a shows the factor graph [11] for the above model. Each filled square corresponds to a term in the factorization of the joint distribution and the square is connected to all variables on which the term depends. Factor graphs are graphical models that unify the properties of Bayesian networks and Markov random fields [12]. Many inference algorithms, including the sum-product algorithm (a.k.a. loopy belief propagation), are more easily derived using factor graphs than Bayesian networks or Markov random fields. We describe the sum-product algorithm for our model in section 3. (a) I(d ,e + e +e +e 4 14 24 34 44 d1 e11 e12 e14 4 d3 d2 e13 f 4(d ) e22 e23 e24 (b) ) d1 d4 e33 e34 e1 e44 11 x11 x11 x12 x13 x14 x22 x23 x24 x33 d1 1 x34 x44 e2 11 e1 12 x12 e2 12 d1 2 e1 13 e1 e2 13 e1 14 x13 e1 22 x14 e2 14 d1 3 23 x22 e2 22 x23 e2 23 4 e1 e1 24 e2 24 e1 33 x24 34 x33 e2 33 x34 e2 34 e1 44 x44 e2 44 P( x44 | e44 ) (c) d4 s41 e14 e24 d2 1 d4 e34 e44 e14 s42 e24 s43 e34 d2 2 d2 3 d2 4 s44 P( x e44 44 1 ,e 2 ) 44 44 |e Figure 2: (a) A factor graph that describes a distribution over graphs with vertex degrees di , binary edge indicator variables eij , and noisy edge observations xij . The indicator function I(di , j eij ) enforces the constraint that the sum of the binary edge indicator variables for vertex i must equal the degree of vertex i. (b) A factor graph that explains noisy observed edges as a combination of two constituent graphs, with edge indicator variables e 1 and e2 . ij ij (c) The constraint I(di , j eij ) can be implemented using a chain with state variables, which leads to an exponentially faster message-passing algorithm. 2.1 Combining multiple graphs The above model is suitable when we want to infer a graph that matches a degree prior, assuming the edge observation noise is independent. A more challenging goal, with practical application, is to infer multiple hidden graphs that combine to explain the observed edge data. In section 4, we show how priors over multiple hidden graphs can be be used to infer protein-protein interactions. When there are H hidden graphs, each constituent graph is specified by a set of edges on the same set of N common vertices. For the degree variables and edge variables, we use a superscript to indicate which hidden graph the variable is used to describe. Assuming the graphs are independent, the joint distribution over the observed edge data X, and the edge variables and degree variables for the hidden graphs, E 1 , D1 , . . . , E H , DH , is H P (X, E 1 , D1 , . . . , E H , DH ) = P (xij |e1 , . . . , eH ) ij ij j≥i P (E h , Dh ), (1) h=1 where for each hidden graph, P (E h , Dh ) is modeled as described above. Here, the likelihood P (xij |e1 , . . . , eH ) describes how the edges in the hidden graphs combine ij ij to model the observed edge. Figure 2b shows the factor graph for this model. 3 Probabilistic inference of constituent graphs Exact probabilistic inference in the above models is intractable, here we introduce an approximate inference algorithm that consists of applying the sum-product algorithm, while ignoring cycles in the factor graph. Although the sum-product algorithm has been used to obtain excellent results on several problems [6, 7, 13, 14, 8, 9], we have found that the algorithm works best when the model consists of uncertain observations of variables that are subject to a large number of hard constraints. Thus the formulation of the model described above. Conceptually, our inference algorithm is a straight-forward application of the sumproduct algorithm, c.f. [15], where messages are passed along edges in the factor graph iteratively, and then combined at variables to obtain estimates of posterior probabilities. However, direct implementation of the message-passing updates will lead to an intractable algorithm. In particular, direct implementation of the update for the message sent from function I(di , j eij ) to edge variable eik takes a number of scalar operations that is exponential in the number of vertices. Fortunately there exists a more efficient way to compute these messages. 3.1 Efficiently summing over edge configurations The function I(di , j eij ) ensures that the number of edges connected to vertex i is equal to di . Passing messages through this function requires summing over all edge configurations that correspond to each possible degree, di , and summing over di . Specifically, the message, µIi →eik (eik ), sent from function I(di , j eij ) to edge variable eik is given by I(di , di {eij | j=1,...,N, j=k} eij ) j µeij →Ii (eij ) , j=k where µeij →Ii (eij ) is the message sent from eij to function I(di , j eij ). The sum over {eij | j = 1, . . . , N, j = k} contains 2N −1 terms, so direct computation is intractable. However, for a maximum degree of dmax , all messages departing from the function I(di , j eij ) can be computed using order dmax N binary scalar operations, by introducing integer state variables sij . We define sij = n≤j ein and note that, by recursion, sij = sij−1 + eij , where si0 = 0 and 0 ≤ sij ≤ dmax . This recursive expression enables us to write the high-complexity constraint as the sum of a product of low-complexity constraints, N I(di , eij ) = j I(si1 , ei1 ) {sij | j=1,...,N } I(sij , sij−1 + eij ) I(di , siN ). j=2 This summation can be performed using the forward-backward algorithm. In the factor graph, the summation can be implemented by replacing the function I(di , j eij ) with a chain of lower-complexity functions, connected as shown in Fig. 2c. The function vertex (filled square) on the far left corresponds to I(si1 , ei1 ) and the function vertex in the upper right corresponds to I(di , siN ). So, messages can be passed through each constraint function I(di , j eij ) in an efficient manner, by performing a single forward-backward pass in the corresponding chain. 4 Results We evaluate our model using yeast protein-protein interaction (PPI) data compiled by [16]. These data include eight sets of putative, but noisy, interactions derived from various sources, and one gold-standard set of interactions detected by reliable experiments. Using the ∼ 6300 yeast proteins as vertices, we represent the eight sets of putative m interactions using adjacency matrices {Y m }8 m=1 where yij = 1 if and only if putative interaction dataset m contains an interaction between proteins i and j. We similarly use Y gold to represent the gold-standard interactions. m We construct an observed graph, X, by setting xij = maxm yij for all i and j, thus the observed edge set is the union of all the putative edge sets. We test our model (a) (b) 40 0 untangling baseline random empirical potential posterior −2 30 log Pr true positives (%) 50 20 10 −4 −6 −8 0 0 5 10 −10 0 false positives (%) 10 20 30 degree (# of nodes) Figure 3: Protein-protein interaction network untangling results. (a) ROC curves measuring performance of predicting e1 when xij = 1. (b) Degree distributions. Compares the empirical ij degree distribution of the test set subgraph of E 1 to the degree potential f 1 estimated on the ˆ ij training set subgraph of E 1 and to the distribution of di = j pij where pij = P (e1 = 1|X) is estimated by untangling. on the task of discerning which of the edges in X are also in Y gold . We formalize this problem as that of decomposing X into two constituent graphs E 1 and E 2 , the gold true and the noise graphs respectively, such that e1 = xij yij and e2 = xij − e1 . ij ij ij We use a training set to fit our model parameters and then measure task performance on a test set. The training set contains a randomly selected half of the ∼ 6300 yeast proteins, and the subgraphs of E 1 , E 2 , and X restricted to those vertices. The test contains the other half of the proteins and the corresponding subgraphs. Note that interactions connecting test set proteins to training set proteins (and vice versa) are ignored. We fit three sets of parameters: a set of Naive Bayes parameters that define a set of edge-specific likelihood functions, Pij (xij |e1 , e2 ), one degree potential, f 1 , which ij ij is the same for every vertex in E1 and defines the prior P (E 1 ), and a second, f 2 , that similarly defines the prior P (E 2 ). The likelihood functions, Pij , are used to both assign likelihoods and enforce problem constraints. Given our problem definition, if xij = 0 then e1 = e2 = 0, ij ij otherwise xij = 1 and e1 = 1 − e2 . We enforce the former constraint by setij ij ting Pij (xij = 0|e1 , e2 ) = (1 − e1 )(1 − e2 ), and the latter by setting Pij (xij = ij ij ij ij 1|e1 , e2 ) = 0 whenever e1 = e2 . This construction of Pij simplifies the calculation ij ij ij ij of the µPij →eh messages and improves the computational efficiency of inference beij cause when xij = 0, we need never update messages to and from variables e1 and ij e2 . We complete the specification of Pij (xij = 1|e1 , e2 ) as follows: ij ij ij ym Pij (xij = 1|e1 , e2 ) ij ij = m ij θm (1 − θm )1−yij , if e1 = 1 and e2 = 0, ij ij ym m ij ψm (1 − ψm )1−yij , if e1 = 0 and e2 = 1. ij ij where {θm } and {ψm } are naive Bayes parameters, θm = i,j m yij e1 / ij i,j e1 and ij ψm = i,j m yij e2 / ij i,j e2 , respectively. ij The degree potentials f 1 (d) and f 2 (d) are kernel density estimates fit to the degree distribution of the training set subgraphs of E 1 and E 2 , respectively. We use Gaussian kernels and set the width parameter (standard deviation) σ using leaveone-out cross-validation to maximize the total log density of the held-out datapoints. Each datapoint is the degree of a single vertex. Both degree potentials closely followed the training set empirical degree distributions. Untangling was done on the test set subgraph of X. We initially set the µ Pij →e1 ij messages equal to the likelihood function Pij and we randomly initialized the 1 µIj →e1 messages with samples from a normal distribution with mean 0 and variij ance 0.01. We then performed 40 iterations of the following message update order: 2 2 1 1 µe1 →Ij , µIj →e1 , µe1 →Pij , µPij →e2 , µe2 →Ij , µIj →e2 , µe2 →Pij , µPij →e1 . ij ij ij ij ij ij ij ij We evaluated our untangling algorithm using an ROC curve by comparing the actual ˆ test set subgraph of E 1 to posterior marginal probabilities,P (e1 = 1|X), estimated ij by our sum-product algorithm. Note that because the true interaction network is sparse (less than 0.2% of the 1.8 × 107 possible interactions are likely present [16]) and, in this case, true positive predictions are of greater biological interest than true negative predictions, we focus on low false positive rate portions of the ROC curve. Figure 3a compares the performance of a classifier for e1 based on thresholding ij ˆ P (eij = 1|X) to a baseline method based on thresholding the likelihood functions, Pij (xij = 1|e1 = 1, e2 = 0). Note because e1 = 0 whenever xij = 0, we exclude ij ij ij the xij = 0 cases from our performance evaluation. The ROC curve shows that for the same low false positive rate, untangling produces 50% − 100% more true positives than the baseline method. Figure 3b shows that the degree potential, the true degree distribution, and the predicted degree distribution are all comparable. The slight overprediction of the true degree distribution may result because the degree potential f 1 that defines P (E 1 ) is not equal to the expected degree distribution of graphs sampled from the distribution P (E 1 ). 5 Summary and Related Work Related work includes other algorithms for structure-based graph denoising [17, 18]. These algorithms use structural properties of the observed graph to score edges and rely on the true graph having a surprisingly large number of three (or four) edge cycles compared to the noise graph. In contrast, we place graph generation in a probabilistic framework; our algorithm computes structural fit in the hidden graph, where this computation is not affected by the noise graph(s); and we allow for multiple sources of observation noise, each with its own structural properties. After submitting this paper to the NIPS conference, we discovered [19], in which a degree-based graph structure prior is used to denoise (but not untangle) observed graphs. This paper addresses denoising in directed graphs as well as undirected graphs, however, the prior that they use is not amenable to deriving an efficient sumproduct algorithm. Instead, they use Markov Chain Monte Carlo to do approximate inference in a hidden graph containing 40 vertices. It is not clear how well this approach scales to the ∼ 3000 vertex graphs that we are using. In summary, the contributions of the work described in this paper include: a general formulation of the problem of graph untangling as inference in a factor graph; an efficient approximate inference algorithm for a rich class of degree-based structure priors; and a set of reliability scores (i.e., edge posteriors) for interactions from a current version of the yeast protein-protein interaction network. References [1] A L Barabasi and R Albert. Emergence of scaling in random networks. Science, 286(5439), October 1999. [2] A Rzhetsky and S M Gomez. Birth of scale-free molecular networks and the number of distinct dna and protein domains per genome. Bioinformatics, pages 988–96, 2001. [3] M Faloutsos, P Faloutsos, and C Faloutsos. On power-law relationships of the Internet topology. Computer Communications Review, 29, 1999. [4] Hawoong Jeong, B Tombor, R´ka Albert, Z N Oltvai, and Albert-L´szl´ Barab´si. e a o a The large-scale organization of metabolic networks. Nature, 407, October 2000. [5] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo CA., 1988. [6] D. J. C. MacKay and R. M. Neal. Near Shannon limit performance of low density parity check codes. Electronics Letters, 32(18):1645–1646, August 1996. Reprinted in Electronics Letters, vol. 33, March 1997, 457–458. [7] B. J. Frey and F. R. Kschischang. Probability propagation and iterative decoding. In Proceedings of the 1996 Allerton Conference on Communication, Control and Computing, 1996. [8] B. J. Frey, R. Koetter, and N. Petrovic. Very loopy belief propagation for unwrapping phase images. In 2001 Conference on Advances in Neural Information Processing Systems, Volume 14. MIT Press, 2002. [9] M. M´zard, G. Parisi, and R. Zecchina. Analytic and algorithmic solution of random e satisfiability problems. Science, 297:812–815, 2002. [10] B. J. Frey and D. J. C. MacKay. Trellis-constrained codes. In Proceedings of the 35 th Allerton Conference on Communication, Control and Computing 1997, 1998. [11] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, Special Issue on Codes on Graphs and Iterative Algorithms, 47(2):498–519, February 2001. [12] B. J. Frey. Factor graphs: A unification of directed and undirected graphical models. University of Toronto Technical Report PSI-2003-02, 2003. [13] Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy belief propagation for approximate inference: An empirical study. In Uncertainty in Artificial Intelligence 1999. Stockholm, Sweden, 1999. [14] W. Freeman and E. Pasztor. Learning low-level vision. In Proceedings of the International Conference on Computer Vision, pages 1182–1189, 1999. [15] M. I. Jordan. An Inroduction to Learning in Graphical Models. 2004. In preparation. [16] C von Mering et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 2002. [17] R Saito, H Suzuki, and Y Hayashizaki. Construction of reliable protein-protein interaction networks with a new interaction generality measure. Bioinformatics, pages 756–63, 2003. [18] D S Goldberg and F P Roth. Assessing experimentally derived interactions in a small world. Proceedings of the National Academy of Science, 2003. [19] S M Gomez and A Rzhetsky. Towards the prediction of complete protein–protein interaction networks. In Pacific Symposium on Biocomputing, pages 413–24, 2002.
3 0.50598788 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling
Author: Ting-fan Wu, Chih-jen Lin, Ruby C. Weng
Abstract: Pairwise coupling is a popular multi-class classification method that combines together all pairwise comparisons for each pair of classes. This paper presents two approaches for obtaining class probabilities. Both methods can be reduced to linear systems and are easy to implement. We show conceptually and experimentally that the proposed approaches are more stable than two existing popular methods: voting and [3]. 1
4 0.41302711 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
Author: Sanjiv Kumar, Martial Hebert
Abstract: In this paper we present Discriminative Random Fields (DRF), a discriminative framework for the classification of natural image regions by incorporating neighborhood spatial dependencies in the labels as well as the observed data. The proposed model exploits local discriminative models and allows to relax the assumption of conditional independence of the observed data given the labels, commonly used in the Markov Random Field (MRF) framework. The parameters of the DRF model are learned using penalized maximum pseudo-likelihood method. Furthermore, the form of the DRF model allows the MAP inference for binary classification problems using the graph min-cut algorithms. The performance of the model was verified on the synthetic as well as the real-world images. The DRF model outperforms the MRF model in the experiments. 1
5 0.3852627 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks
Author: Justin Werfel, Xiaohui Xie, H. S. Seung
Abstract: Gradient-following learning methods can encounter problems of implementation in many applications, and stochastic variants are frequently used to overcome these difficulties. We derive quantitative learning curves for three online training methods used with a linear perceptron: direct gradient descent, node perturbation, and weight perturbation. The maximum learning rate for the stochastic methods scales inversely with the first power of the dimensionality of the noise injected into the system; with sufficiently small learning rate, all three methods give identical learning curves. These results suggest guidelines for when these stochastic methods will be limited in their utility, and considerations for architectures in which they will be effective. 1
6 0.36967728 100 nips-2003-Laplace Propagation
7 0.36537582 168 nips-2003-Salient Boundary Detection using Ratio Contour
8 0.3608999 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
9 0.33986825 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation
10 0.33732826 3 nips-2003-AUC Optimization vs. Error Rate Minimization
11 0.32954946 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices
12 0.3238011 30 nips-2003-Approximability of Probability Distributions
13 0.31127387 19 nips-2003-Algorithms for Interdependent Security Games
14 0.29899943 187 nips-2003-Training a Quantum Neural Network
15 0.29728174 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion
16 0.28621358 181 nips-2003-Statistical Debugging of Sampled Programs
17 0.2852484 75 nips-2003-From Algorithmic to Subjective Randomness
18 0.28118587 193 nips-2003-Variational Linear Response
19 0.28019643 90 nips-2003-Increase Information Transfer Rates in BCI by CSP Extension to Multi-class
20 0.27364945 130 nips-2003-Model Uncertainty in Classical Conditioning
topicId topicWeight
[(0, 0.043), (5, 0.01), (11, 0.028), (29, 0.015), (30, 0.021), (35, 0.048), (53, 0.116), (69, 0.015), (71, 0.076), (76, 0.062), (78, 0.275), (85, 0.098), (91, 0.097)]
simIndex simValue paperId paperTitle
1 0.89803904 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering
Author: Yoshua Bengio, Jean-françcois Paiement, Pascal Vincent, Olivier Delalleau, Nicolas L. Roux, Marie Ouimet
Abstract: Several unsupervised learning algorithms based on an eigendecomposition provide either an embedding or a clustering only for given training points, with no straightforward extension for out-of-sample examples short of recomputing eigenvectors. This paper provides a unified framework for extending Local Linear Embedding (LLE), Isomap, Laplacian Eigenmaps, Multi-Dimensional Scaling (for dimensionality reduction) as well as for Spectral Clustering. This framework is based on seeing these algorithms as learning eigenfunctions of a data-dependent kernel. Numerical experiments show that the generalizations performed have a level of error comparable to the variability of the embedding algorithms due to the choice of training data. 1
same-paper 2 0.80181402 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers
Author: Gang Ji, Jeff A. Bilmes
Abstract: In pattern classification tasks, errors are introduced because of differences between the true model and the one obtained via model estimation. Using likelihood-ratio based classification, it is possible to correct for this discrepancy by finding class-pair specific terms to adjust the likelihood ratio directly, and that can make class-pair preference relationships intransitive. In this work, we introduce new methodology that makes necessary corrections to the likelihood ratio, specifically those that are necessary to achieve perfect classification (but not perfect likelihood-ratio correction which can be overkill). The new corrections, while weaker than previously reported such adjustments, are analytically challenging since they involve discontinuous functions, therefore requiring several approximations. We test a number of these new schemes on an isolatedword speech recognition task as well as on the UCI machine learning data sets. Results show that by using the bias terms calculated in this new way, classification accuracy can substantially improve over both the baseline and over our previous results. 1
3 0.79185003 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification
Author: Tong Zhang
Abstract: The purpose of this paper is to investigate infinity-sample properties of risk minimization based multi-category classification methods. These methods can be considered as natural extensions to binary large margin classification. We establish conditions that guarantee the infinity-sample consistency of classifiers obtained in the risk minimization framework. Examples are provided for two specific forms of the general formulation, which extend a number of known methods. Using these examples, we show that some risk minimization formulations can also be used to obtain conditional probability estimates for the underlying problem. Such conditional probability information will be useful for statistical inferencing tasks beyond classification. 1 Motivation Consider a binary classification problem where we want to predict label y ∈ {±1} based on observation x. One of the most significant achievements for binary classification in machine learning is the invention of large margin methods, which include support vector machines and boosting algorithms. Based on a set of observations (X1 , Y1 ), . . . , (Xn , Yn ), ˆ a large margin classification algorithm produces a decision function fn by empirically minimizing a loss function that is often a convex upper bound of the binary classification error ˆ ˆ function. Given fn , the binary decision rule is to predict y = 1 if fn (x) ≥ 0, and to predict ˆ y = −1 otherwise (the decision rule at fn (x) = 0 is not important). In the literature, the following form of large margin binary classification is often encountered: we minimize the empirical risk associated with a convex function φ in a pre-chosen function class Cn : 1 ˆ fn = arg min f ∈Cn n n φ(f (Xi )Yi ). (1) i=1 Originally such a scheme was regarded as a compromise to avoid computational difficulties associated with direct classification error minimization, which often leads to an NP-hard problem. The current view in the statistical literature interprets such methods as algorithms to obtain conditional probability estimates. For example, see [3, 6, 9, 11] for some related studies. This point of view allows people to show the consistency of various large margin methods: that is, in the large sample limit, the obtained classifiers achieve the optimal Bayes error rate. For example, see [1, 4, 7, 8, 10, 11]. The consistency of a learning method is certainly a very desirable property, and one may argue that a good classification method should be consistent in the large sample limit. Although statistical properties of binary classification algorithms based on the risk minimization formulation (1) are quite well-understood due to many recent works such as those mentioned above, there are much fewer studies on risk minimization based multicategory problems which generalizes the binary large margin method (1). The complexity of possible generalizations may be one reason. Another reason may be that one can always estimate the conditional probability for a multi-category problem using the binary classification formulation (1) for each category, and then pick the category with the highest estimated conditional probability (or score).1 However, it is still useful to understand whether there are more natural alternatives, and what kind of risk minimization formulation which generalizes (1) can be used to yield consistent classifiers in the large sample limit. An important step toward this direction has recently been taken in [5], where the authors proposed a multi-category extension of the support vector machine that is Bayes consistent (note that there were a number of earlier proposals that were not consistent). The purpose of this paper is to generalize their investigation so as to include a much wider class of risk minimization formulations that can lead to consistent classifiers in the infinity-sample limit. We shall see that there is a rich structure in risk minimization based multi-category classification formulations. Multi-category large margin methods have started to draw more attention recently. For example, in [2], learning bounds for some multi-category convex risk minimization methods were obtained, although the authors did not study possible choices of Bayes consistent formulations. 2 Multi-category classification We consider the following K-class classification problem: we would like to predict the label y ∈ {1, . . . , K} of an input vector x. In this paper, we only consider the simplest scenario with 0 − 1 classification loss: we have a loss of 0 for correct prediction, and loss of 1 for incorrect prediction. In binary classification, the class label can be determined using the sign of a decision function. This can be generalized to K class classification problem as follows: we consider K decision functions fc (x) where c = 1, . . . , K and we predict the label y of x as: T (f (x)) = arg max c∈{1,...,K} fc (x), (2) where we denote by f (x) the vector function f (x) = [f1 (x), . . . , fK (x)]. Note that if two or more components of f achieve the same maximum value, then we may choose any of them as T (f ). In this framework, fc (x) is often regarded as a scoring function for category c that is correlated with how likely x belongs to category c (compared with the remaining k − 1 categories). The classification error is given by: (f ) = 1 − EX P (Y = T (X)|X). Note that only the relative strength of fc compared with the alternatives is important. In particular, the decision rule given in (2) does not change when we add the same numerical quantity to each component of f (x). This allows us to impose one constraint on the vector f (x) which decreases the degree of freedom K of the K-component vector f (x) to K − 1. 1 This approach is often called one-versus-all or ranking in machine learning. Another main approach is to encode a multi-category classification problem into binary classification sub-problems. The consistency of such encoding schemes can be difficult to analyze, and we shall not discuss them. For example, in the binary classification case, we can enforce f1 (x)+f2 (x) = 0, and hence f (x) can be represented as [f1 (x), −f1 (x)]. The decision rule in (2), which compares f1 (x) ≥ f2 (x), is equivalent to f1 (x) ≥ 0. This leads to the binary classification rule mentioned in the introduction. In the multi-category case, one may also interpret the possible constraint on the vector function f , which reduces its degree of freedom from K to K − 1 based on the following reasoning. In many cases, we seek fc (x) as a function of p(Y = c|x). Since we have a K constraint c=1 p(Y = c|x) = 1 (implying that the degree of freedom for p(Y = c|x) is K − 1), the degree of freedom for f is also K − 1 (instead of K). However, we shall point out that in the algorithms we formulate below, we may either enforce such a constraint that reduces the degree of freedom of f , or we do not impose any constraint, which keeps the degree of freedom of f to be K. The advantage of the latter is that it allows the computation of each fc to be decoupled. It is thus much simpler both conceptually and numerically. Moreover, it directly handles multiple-label problems where we may assign each x to multiple labels of y ∈ {1, . . . , K}. In this scenario, we do not have a constraint. In this paper, we consider an empirical risk minimization method to solve a multi-category problem, which is of the following general form: 1 ˆ fn = arg min f ∈Cn n n ΨYi (f (Xi )). (3) i=1 As we shall see later, this method is a natural generalization of the binary classification method (1). Note that one may consider an even more general form with ΨY (f (X)) replaced by ΨY (f (X), X), which we don’t study in this paper. From the standard learning theory, one can expect that with appropriately chosen Cn , the ˆ ˆ solution fn of (3) approximately minimizes the true risk R(f ) with respect to the unknown underlying distribution within the function class Cn , R(f ) = EX,Y ΨY (f (X)) = EX L(P (·|X), f (X)), (4) where P (·|X) = [P (Y = 1|X), . . . , P (Y = K|X)] is the conditional probability, and K L(q, f ) = qc Ψc (f ). (5) c=1 In order to understand the large sample behavior of the algorithm based on solving (3), we first need to understand the behavior of a function f that approximately minimizes R(f ). We introduce the following definition (also referred to as classification calibrated in [1]): Definition 2.1 Consider Ψc (f ) in (4). We say that the formulation is admissible (classification calibrated) on a closed set Ω ⊆ [−∞, ∞]K if the following conditions hold: ∀c, Ψc (·) : Ω → (−∞, ∞] is bounded below and continuous; ∩c {f : Ψc (f ) < ∞} is ∗ ∗ non-empty and dense in Ω; ∀q, if L(q, f ∗ ) = inf f L(q, f ), then fc = supk fk implies qc = supk qk . Since we allow Ψc (f ) = ∞, we use the convention that qc Ψc (f ) = 0 when qc = 0 and Ψc (f ) = ∞. The following result relates the approximate minimization of the Ψ risk to the approximate minimization of classification error: Theorem 2.1 Let B be the set of all Borel measurable functions. For a closed set Ω ⊂ [−∞, ∞]K , let BΩ = {f ∈ B : ∀x, f (x) ∈ Ω}. If Ψc (·) is admissible on Ω, then for a Borel measurable distribution, R(f ) → inf g∈BΩ R(g) implies (f ) → inf g∈B (g). Proof Sketch. First we show that the admissibility implies that ∀ > 0, ∃δ > 0 such that ∀q and x: inf {L(q, f ) : fc = sup fk } ≥ inf L(q, g) + δ. (6) qc ≤supk qk − g∈Ω k m If (6) does not hold, then ∃ > 0, and a sequence of (c , f m , q m ) with f m ∈ Ω such that m m m m fcm = supk fk , qcm ≤ supk qk − , and L(q m , f m ) − inf g∈Ω L(q m , g) → 0. Taking a limit point of (cm , f m , q m ), and using the continuity of Ψc (·), we obtain a contradiction (technical details handling the infinity case are skipped). Therefore (6) must be valid. Now we consider a vector function f (x) ∈ ΩB . Let q(x) = P (·|x). Given X, if P (Y = T (f (X))|X) ≥ P (Y = T (q(X))|X)+ , then equation (6) implies that L(q(X), f (X)) ≥ inf g∈Ω L(q(X), g) + δ. Therefore (f ) − inf (g) =EX [P (Y = T (q(X))|X) − P (Y = T (f (X))|X)] g∈B ≤ + EX I(P (Y = T (q(X))|X) − P (Y = T (f (X))|X) > ) LX (q(X), f (X)) − inf g∈BΩ LX (q(X), g) ≤ + EX δ R(f ) − inf g∈BΩ R(g) = + . δ In the above derivation we use I to denote the indicator function. Since and δ are arbitrary, we obtain the theorem by letting → 0. 2 Clearly, based on the above theorem, an admissible risk minimization formulation is suitable for multi-category classification problems. The classifier obtained from minimizing (3) can approach the Bayes error rate if we can show that with appropriately chosen function class Cn , approximate minimization of (3) implies approximate minimization of (4). Learning bounds of this forms have been very well-studied in statistics and machine learning. For example, for large margin binary classification, such bounds can be found in [4, 7, 8, 10, 11, 1], where they were used to prove the consistency of various large margin methods. In order to achieve consistency, it is also necessary to take a sequence of function classes Cn (C1 ⊂ C2 ⊂ · · · ) such that ∪n Cn is dense in the set of Borel measurable functions. The set Cn has the effect of regularization, which ensures that ˆ ˆ P R(fn ) ≈ inf f ∈Cn R(f ). It follows that as n → ∞, R(fn ) → inf f ∈B R(f ). Theorem 2.1 ˆ P then implies that (fn ) → inf f ∈B (f ). The purpose of this paper is not to study similar learning bounds that relate approximate minimization of (3) to the approximate minimization of (4). See [2] for a recent investigation. We shall focus on the choices of Ψ that lead to admissible formulations. We pay special attention to the case that each Ψc (f ) is a convex function of f , so that the resulting formulation becomes computational more tractable. Instead of working with the general form of Ψc in (4), we focus on two specific choices listed in the next two sections. 3 Unconstrained formulations We consider unconstrained formulation with the following choice of Ψ: K Ψc (f ) = φ(fc ) + s t(fk ) , (7) k=1 where φ, s and t are appropriately chosen functions that are continuously differentiable. The first term, which has a relatively simple form, depends on the label c. The second term is independent of the label, and can be regarded as a normalization term. Note that this function is symmetric with respect to components of f . This choice treats all potential classes equally. It is also possible to treat different classes differently (e.g. replacing φ(fc ) by φc (fc )), which can be useful if we associate different classification loss to different kinds of errors. 3.1 Optimality equation and probability model Using (7), the conditional true risk (5) can be written as: K L(q, f ) = K qc φ(fc ) + s t(fc ) . c=1 c=1 In the following, we study the property of the optimal vector f ∗ that minimizes L(q, f ) for a fixed q. Given q, the optimal solution f ∗ of L(q, f ) satisfies the following first order condition: ∗ ∗ qc φ (fc ) + µf ∗ t (fc ) = 0 (c = 1, . . . , K). (8) where quantity µf ∗ = s ( K k=1 ∗ t(fk )) is independent of k. ∗ Clearly this equation relates qc to fc for each component c. The relationship of q and f ∗ defined by (8) can be regarded as the (infinite sample-size) probability model associated with the learning method (3) with Ψ given by (7). The following result presents a simple criterion to check admissibility. We skip the proof for simplicity. Most of our examples satisfy the condition. Proposition 3.1 Consider (7). Assume Φc (f ) is continuous on [−∞, ∞]K and bounded below. If s (u) ≥ 0 and ∀p > 0, pφ (f ) + t (f ) = 0 has a unique solution fp that is an increasing function of p, then the formulation is admissible. If s(u) = u, the condition ∀p > 0 in Proposition 3.1 can be replaced by ∀p ∈ (0, 1). 3.2 Decoupled formulations We let s(u) = u in (7). The optimality condition (8) becomes ∗ ∗ qc φ (fc ) + t (fc ) = 0 (c = 1, . . . , K). (9) This means that we have K decoupled equalities, one for each fc . This is the simplest and in the author’s opinion, the most interesting formulation. Since the estimation problem in ˆ (3) is also decoupled into K separate equations, one for each component of fn , this class of methods are computationally relatively simple and easy to parallelize. Although this method seems to be preferable for multi-category problems, it is not the most efficient way for two-class problem (if we want to treat the two classes in a symmetric manner) since we have to solve two separate equations. We only need to deal with one equation in (1) due to the fact that an effective constraint f1 + f2 = 0 can be used to reduce the number of equations. This variable elimination has little impact if there are many categories. In the following, we list some examples of multi-category risk minimization formulations. They all satisfy the admissibility condition in Proposition 3.1. We focus on the relationship of the optimal optimizer function f∗ (q) and the conditional probability q. For simplicity, we focus on the choice φ(u) = −u. 3.2.1 φ(u) = −u and t(u) = eu ∗ We obtain the following probability model: qc = efc . This formulation is closely related K to the maximum-likelihood estimate with conditional model qc = efc / k=1 efk (logistic regression). In particular, if we choose a function class such that the normalization condiK tion k=1 efk = 1 holds, then the two formulations are identical. However, they become different when we do not impose such a normalization condition. Another very important and closely related formulation is the choice of φ(u) = − ln u and t(u) = u. This is an extension of maximum-likelihood estimate with probability model qc = fc . The resulting method is identical to maximum-likelihood if we choose our function class such that k fk = 1. However, the formulation also allows us to use function classes that do not satisfy the normalization constraint k fk = 1. Therefore this method is more flexible. 3.2.2 φ(u) = −u and t(u) = ln(1 + eu ) This version uses binary logistic regression loss, and we have the following probability ∗ model: qc = (1 + e−fc )−1 . Again this is an unnormalized model. 1 3.2.3 φ(u) = −u and t(u) = p |u|p (p > 1) ∗ ∗ We obtain the following probability model: qc = sign(fc )|fc |p−1 . This means that at the ∗ ∗ solution, fc ≥ 0. One may modify it such that we allow fc ≤ 0 to model the condition probability qc = 0. 3.2.4 φ(u) = −u and t(u) = 1 p max(u, 0)p (p > 1) ∗ In this probability model, we have the following relationship: qc = max(fc , 0)p−1 . The ∗ equation implies that we allow fc ≤ 0 to model the conditional probability qc = 0. Therefore, with a fixed function class, this model is more powerful than the previous one. How∗ ever, at the optimal solution, fc ≤ 1. This requirement can be further alleviated with the following modification. 3.2.5 φ(u) = −u and t(u) = 1 p min(max(u, 0)p , p(u − 1) + 1) (p > 1) In this probability model, we have the following relationship at the exact solution: qc = c min(max(f∗ , 0), 1)p−1 . Clearly this model is more powerful than the previous model since ∗ the function value fc ≥ 1 can be used to model qc = 1. 3.3 Coupled formulations In the coupled formulation with s(u) = u, the probability model can be normalized in a certain way. We list a few examples. 3.3.1 φ(u) = −u, and t(u) = eu , and s(u) = ln(u) This is the standard logistic regression model. The probability model is: K ∗ qc (x) = exp(fc (x))( ∗ exp(fc (x)))−1 . c=1 The right hand side is always normalized (sum up to 1). Note that the model is not continuous at infinities, and thus not admissible in our definition. However, we may consider the region Ω = {f : supk fk = 0}, and it is easy to check that this model is admissible in Ω. Ω Let fc = fc − supk fk ∈ Ω, then f Ω has the same decision rule as f and R(f ) = R(f Ω ). Therefore Theorem 2.1 implies that R(f ) → inf g∈B R(g) implies (f ) → inf g∈B (g). 1 3.3.2 φ(u) = −u, and t(u) = |u|p , and s(u) = p |u|p/p (p, p > 1) The probability model is: K ∗ ∗ ∗ |fk (x)|p )(p−p )/p sign(fc (x))|fc (x)|p −1 . qc (x) = ( k=1 We may replace t(u) by t(u) = max(0, u)p , and the probability model becomes: K qc (x) = ( ∗ ∗ max(fk (x), 0)p )(p−p )/p max(fc (x), 0)p −1 . k=1 These formulations do not seem to have advantages over the decoupled counterparts. Note that if we let p → 1, then the sum of the p p -th power of the right hand side → 1. In a −1 way, this means that the model is normalized in the limit of p → 1. 4 Constrained formulations As pointed out, one may impose constraints on possible choices of f . We may impose such a condition when we specify the function class Cn . However, for clarity, we shall directly impose a condition into our formulation. If we impose a constraint into (7), then its effect is rather similar to that of the second term in (7). In this section, we consider a direct extension of binary large-margin method (1) to multi-category case. The choice given below is motivated by [5], where an extension of SVM was proposed. We use a risk formulation that is different from (7), and for simplicity, we will consider linear equality constraint only: K Ψc (f ) = φ(−fk ), s.t. f ∈ Ω, (10) k=1,k=c where we define Ω as: K Ω = {f : fk = 0} ∪ {f : sup fk = ∞}. k k=1 We may interpret the added constraint as a restriction on the function class Cn in (3) such that every f ∈ Cn satisfies the constraint. Note that with K = 2, this leads to the usually binary large margin method. Using (10), the conditional true risk (5) can be written as: K (1 − qc )φ(−fc ), L(q, f ) = s.t. f ∈ Ω. (11) c=1 The following result provides a simple way to check the admissibility of (10). Proposition 4.1 If φ is a convex function which is bounded below and φ (0) < 0, then (10) is admissible on Ω. Proof Sketch. The continuity condition is straight-forward to verify. We may also assume that φ(·) ≥ 0 without loss of generality. Now let f achieves the minimum of L(q, ·). If fc = ∞, then it is clear that qc = 1 and thus qk = 0 for k = c. This implies that for k = c, φ(−fk ) = inf f φ(−f ), and thus fk < 0. If fc = supk fk < ∞, then the constraint implies fc ≥ 0. It is easy to see that ∀k, qc ≥ qk since otherwise, we must have φ(−fk ) > φ(−fc ), and thus φ (−fk ) > 0 and φ (−fc ) < 0, implying that with sufficient small δ > 0, φ(−(fk + δ)) < φ(−fk ) and φ(−(fc − δ)) < φ(−fc ). A contradiction. 2 Using the above criterion, we can convert any admissible convex φ for the binary formulation (1) into an admissible multi-category classification formulation (10). In [5] the special case of SVM (with loss function φ(u) = max(0, 1 − u)) was studied. The authors demonstrated the admissibility by direct calculation, although no results similar to Theorem 2.1 were established. Such a result is needed to prove consistency. The treatment presented here generalizes their study. Note that for the constrained formulation, it is more difficult to relate fc at the optimal solution to a probability model, since such a model will have a much more complicated form compared with the unconstrained counterpart. 5 Conclusion In this paper we proposed a family of risk minimization methods for multi-category classification problems, which are natural extensions of binary large margin classification methods. We established admissibility conditions that ensure the consistency of the obtained classifiers in the large sample limit. Two specific forms of risk minimization were proposed and examples were given to study the induced probability models. As an implication of this work, we see that it is possible to obtain consistent (conditional) density estimation using various non-maximum likelihood estimation methods. One advantage of some of the newly proposed methods is that they allow us to model zero density directly. Note that for the maximum-likelihood method, near zero density may cause serious robustness problems at least in theory. References [1] P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification, and risk bounds. Technical Report 638, Statistics Department, University of California, Berkeley, 2003. [2] Ilya Desyatnikov and Ron Meir. Data-dependent bounds for multi-category classification based on convex losses. In COLT, 2003. [3] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. The Annals of Statistics, 28(2):337–407, 2000. With discussion. [4] W. Jiang. Process consistency for adaboost. The Annals of Statistics, 32, 2004. with discussion. [5] Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of American Statistical Association, 2002. accepted. [6] Yi Lin. Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery, pages 259–275, 2002. [7] G. Lugosi and N. Vayatis. On the Bayes-risk consistency of regularized boosting methods. The Annals of Statistics, 32, 2004. with discussion. [8] Shie Mannor, Ron Meir, and Tong Zhang. Greedy algorithms for classification - consistency, convergence rates, and adaptivity. Journal of Machine Learning Research, 4:713–741, 2003. [9] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37:297–336, 1999. [10] Ingo Steinwart. Support vector machines are universally consistent. J. Complexity, 18:768–791, 2002. [11] Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statitics, 32, 2004. with discussion.
4 0.64315641 120 nips-2003-Locality Preserving Projections
Author: Xiaofei He, Partha Niyogi
Abstract: Many problems in information processing involve some form of dimensionality reduction. In this paper, we introduce Locality Preserving Projections (LPP). These are linear projective maps that arise by solving a variational problem that optimally preserves the neighborhood structure of the data set. LPP should be seen as an alternative to Principal Component Analysis (PCA) – a classical linear technique that projects the data along the directions of maximal variance. When the high dimensional data lies on a low dimensional manifold embedded in the ambient space, the Locality Preserving Projections are obtained by finding the optimal linear approximations to the eigenfunctions of the Laplace Beltrami operator on the manifold. As a result, LPP shares many of the data representation properties of nonlinear techniques such as Laplacian Eigenmaps or Locally Linear Embedding. Yet LPP is linear and more crucially is defined everywhere in ambient space rather than just on the training data points. This is borne out by illustrative examples on some high dimensional data sets.
5 0.63046438 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models
Author: Jakob J. Verbeek, Sam T. Roweis, Nikos A. Vlassis
Abstract: We propose a non-linear Canonical Correlation Analysis (CCA) method which works by coordinating or aligning mixtures of linear models. In the same way that CCA extends the idea of PCA, our work extends recent methods for non-linear dimensionality reduction to the case where multiple embeddings of the same underlying low dimensional coordinates are observed, each lying on a different high dimensional manifold. We also show that a special case of our method, when applied to only a single manifold, reduces to the Laplacian Eigenmaps algorithm. As with previous alignment schemes, once the mixture models have been estimated, all of the parameters of our model can be estimated in closed form without local optima in the learning. Experimental results illustrate the viability of the approach as a non-linear extension of CCA. 1
6 0.62198627 126 nips-2003-Measure Based Regularization
7 0.61969411 128 nips-2003-Minimax Embeddings
8 0.60078901 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
9 0.59719568 113 nips-2003-Learning with Local and Global Consistency
10 0.59698474 3 nips-2003-AUC Optimization vs. Error Rate Minimization
11 0.59320986 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
12 0.5921725 78 nips-2003-Gaussian Processes in Reinforcement Learning
13 0.59203285 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
14 0.59190261 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
15 0.59188753 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
16 0.59145343 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
17 0.59088159 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
18 0.59040415 112 nips-2003-Learning to Find Pre-Images
19 0.59007502 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction
20 0.58925986 107 nips-2003-Learning Spectral Clustering