nips nips2004 nips2004-57 knowledge-graph by maker-knowledge-mining

57 nips-2004-Economic Properties of Social Networks


Source: pdf

Author: Sham M. Kakade, Michael Kearns, Luis E. Ortiz, Robin Pemantle, Siddharth Suri

Abstract: We examine the marriage of recent probabilistic generative models for social networks with classical frameworks from mathematical economics. We are particularly interested in how the statistical structure of such networks influences global economic quantities such as price variation. Our findings are a mixture of formal analysis, simulation, and experiments on an international trade data set from the United Nations. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We are particularly interested in how the statistical structure of such networks influences global economic quantities such as price variation. [sent-4, score-0.638]

2 The work of Arrow and Debreu [1954], who established equilibrium existence in a very general commodities exchange model, was certainly one of the high points of this continuing line of inquiry. [sent-7, score-0.307]

3 While there has been relatively recent interest in network models for interaction in economics (see Jackson [2003] for a good review), it was only quite recently that a network or graph-theoretic model that generalizes the classical Arrow-Debreu and Fisher models was introduced (Kakade et al. [sent-9, score-0.288]

4 In this model, the edges in a network over individual consumers (for example) represent those pairs of consumers that can engage in direct trade. [sent-11, score-0.269]

5 In addition, variations in the price of a good can arise due to the topology of the network: certain individuals may be relatively favored or cursed by their position in the graph. [sent-13, score-0.526]

6 In this paper we examine classical economic exchange models in the modern light of social network theory. [sent-16, score-0.35]

7 We are particularly interested in the interaction between the statistical structure of the underlying network and the variation in prices at equilibrium. [sent-17, score-0.364]

8 Closely related work to ours is that of Kranton and Minehart [2001], which also considers networks of buyers and sellers, though they focus more on the economics of network formation. [sent-19, score-0.426]

9 Many of our results are based on a powerful new local approximation method for global equilibrium prices: we show that in the preferential attachment model, prices computed from only local regions of a network yield strikingly good estimates of the global prices. [sent-20, score-0.912]

10 We assume that there are gj units of good j in the market, and that each good j is be sold at some price pj . [sent-24, score-0.553]

11 Each consumer i has a cash endowment ei , to be used to purchase goods in a manner that maximizes the consumers’ utility. [sent-25, score-0.221]

12 A set of prices {pj } and consumption plans {xij } constitutes an equilibrium if the following two conditions hold: 1. [sent-29, score-0.501]

13 It turns out that such an equilibrium always exists if each good j has a consumer which derives nonzero utility for good j — that is, uij > 0 for some i (see Gale [1960]). [sent-37, score-0.408]

14 In the graphical Fisher model, we desire to capture the fact that each good may have multiple vendors or sellers, and that individual buyers may have access only to some, but not all, of these sellers. [sent-41, score-0.289]

15 Examples include the fact that consumers generally purchase their groceries from local markets, that social connections play a major role in business transactions, and that securities regulations prevent certain pairs of parties from engaging in stock trades. [sent-43, score-0.241]

16 Without loss of generality, we assume that each seller j sells only one of the available goods. [sent-44, score-0.414]

17 ) Let G be a bipartite graph, where buyers and sellers are represented as vertices, and all edges are between a buyerseller pair. [sent-46, score-0.695]

18 The semantics of the graph are as follows: if there is an edge from buyer i to seller j, then buyer i is permitted to purchase from seller j. [sent-47, score-1.491]

19 Note that if buyer i is connected to two sellers of the same good, he will always choose to purchase from the cheaper source, since his utility is identical for both sellers (they sell the same good). [sent-48, score-0.953]

20 One of the most interesting features of this model is the fact that at equilibrium, significant price variations can appear solely due to structural properties of the underlying network. [sent-51, score-0.478]

21 3 Generative Models for Social Networks For simplicity, in the sequel we will consider economies in which the numbers of buyers and sellers are equal. [sent-53, score-0.557]

22 We will also restrict attention to the case in which all sellers sell the same good1 . [sent-54, score-0.286]

23 The simplest generative model for the bipartite graph G might be the random graph, in which each edge between a buyer i and a seller j is included independently with probability p. [sent-55, score-0.919]

24 Many researchers have sought more realistic models of social network formation, in order to explain observed phenomena such as heavy-tailed degree distributions. [sent-57, score-0.264]

25 We now describe a slight variant of the preferential attachment model (see Mitzenmacher [2003]) for the case of a bipartite graph. [sent-58, score-0.388]

26 We start with a graph in which one buyer is connected to one seller. [sent-59, score-0.363]

27 At each time step, we add one buyer and one seller as follows. [sent-60, score-0.699]

28 With probability α, the buyer is connected to a seller in the existing graph uniformly at random; and with probability 1 − α, the buyer is connected to a seller chosen in proportion to the degree of the seller (preferential attachment). [sent-61, score-2.036]

29 Simultaneously, a seller is attached in a symmetric manner: with probability α the seller is connected to a buyer chosen uniformly at random, and with probability 1 − α the seller is connected under preferential attachment. [sent-62, score-1.77]

30 The parameter α in this model thus allows us to move between a pure preferential attachment model (α = 0), and a model closer to classical random graph theory (α = 1), in which new parties are connected to random extant parties2 . [sent-63, score-0.418]

31 We thus will also consider a variant of this model in which at each time step, a new seller is still attached to exactly one extant buyer, while each new buyer is connected to ν > 1 extant sellers. [sent-65, score-0.85]

32 The procedure for edge selection is as outlined above, with the modification that the ν new edges of the buyer are added without replacement — meaning that we resample so that each buyer gets attached to exactly ν distinct sellers. [sent-66, score-0.65]

33 The main purpose of the introduction of ν is to have a model capable of generating highly cyclical (non-tree) networks, while having just a single parameter that can “tune” the asymmetry between the (number of) opportunities for buyers and sellers. [sent-68, score-0.301]

34 There are also economic motivations: it is natural to imagine that new sellers of the good arise only upon obtaining their first customer, but that new buyers arrive already aware of several alternative sellers. [sent-69, score-0.63]

35 We will use n to denote the number of buyers and the number of sellers, so the network has 2n vertices. [sent-71, score-0.326]

36 Figure 1 and its caption provide an example of a network generated by this model, along with a discussion of its equilibrium properties. [sent-72, score-0.327]

37 We first present a rather intuitive “frontier” theorem, which implies a scheme in which we can find upper and lower bounds on the equilibrium prices using only local computations. [sent-75, score-0.544]

38 2 We note that α = 1 still does not exactly produce the Erdos-Renyi model due to the incremental nature of the network generation: early buyers and sellers are still more likely to have higher degree. [sent-78, score-0.583]

39 00 Figure 1: Sample network generated by the bipartite (α = 0, ν = 2)-model. [sent-99, score-0.228]

40 Buyers and sellers are labeled by ‘B’ or ‘S’ respectively, followed by an index indicating the time step at which they were introduced to the network. [sent-100, score-0.257]

41 The solid edges in the figure show the exchange subgraph — those pairs of buyers and sellers who actually exchange currency and goods at equilibrium. [sent-101, score-0.828]

42 Each seller is labeled with the price they charge at equilibrium. [sent-103, score-0.892]

43 Note that while there appears to be a correlation between seller degree and price, it is far from a deterministic relation, a topic we shall examine later. [sent-107, score-0.558]

44 consists of all edges between buyers and sellers in V that are also in G. [sent-108, score-0.55]

45 We say that G has a buyer (respectively, seller) frontier if on every (simple) path in G from a node in V to a node outside of V , the last node in V on this path is a buyer (respectively, seller). [sent-109, score-0.644]

46 Theorem 1 (Frontier Bound) If V has a subgraph G with a seller (respectively, buyer) frontier, then the equilibrium price of any good j in the induced economy on V is a lower bound (respectively, upper bound) on the equilibrium price of j in G. [sent-110, score-2.076]

47 Theorem 1 implies a simple price upper bound: the price commanded by any seller j is bounded by its degree d. [sent-111, score-1.544]

48 Let G be the immediate neighborhood of j (which is j and its d buyers); then the equilibrium price in G is just d, since all d buyers are forced to buy from seller j. [sent-113, score-1.379]

49 This provides an upper bound since G has a buyer frontier. [sent-114, score-0.353]

50 Since it can be shown that the degree distribution obeys a power law in the bipartite (α, ν)-model, we have an upper bound on the cumulative price distribution. [sent-115, score-0.935]

51 Theorem 2 In the bipartite (α, ν)-model, the proportion of sellers with price greater than w is O(w−1/β ). [sent-117, score-0.88]

52 We do not yet have such a closed-form lower bound on the cumulative price distribution. [sent-119, score-0.572]

53 However, as we shall see in Section 5, the price distributions seen in large simulation results do indeed show power-law behavior. [sent-120, score-0.511]

54 Interestingly, this occurs despite the fact that degree is a poor predictor of individual seller price. [sent-121, score-0.525]

55 Another quantity of interest is what we might call price variation — the ratio of the price of the richest seller to the poorest seller. [sent-127, score-1.436]

56 Theorem 3 In the bipartite (α, ν)-model, if α(ν 2 + 1) < 1, then the ratio of the maximum 2−α(ν 2 +1) 1+ν price to the minimum price scales with number of buyers n as Ω(n simplest case in which α = 0 and ν = 1, this lower bound is just Ω(n). [sent-129, score-1.401]

57 For the We conclude our theoretical results with a remark on the price variation in the Erdos-Renyi (random graph) model. [sent-131, score-0.544]

58 First, let us present a condition for there to be no price variation. [sent-132, score-0.478]

59 Theorem 4 A necessary and sufficient condition for there to be no price variation, ie for all prices to be equal to 1, is that for all sets of vertices S, |N (S)| ≥ |S|, where N (S) is the set of vertices connected by an edge to some vertex in S. [sent-133, score-0.728]

60 In other words, in the Erdos-Renyi model, there is no variation in price — a stark contrast to the preferential attachment results. [sent-136, score-0.787]

61 We note that it was only the recent development of this algorithm and related ones that made possible the simulations described here (involving hundreds of buyers and sellers in highly cyclical graphs). [sent-140, score-0.529]

62 Price and Degree Distributions: The first (leftmost) panel of Figure 2 shows empirical cumulative price and degree distributions on a loglog scale, averaged over 25 networks drawn according to the bipartite (α = 0. [sent-143, score-0.898]

63 The cumulative degree distribution is shown as a dotted line, where the y-axis represents the fraction of the sellers with degree greater than or equal to d, and the degree d is plotted on the x-axis. [sent-145, score-0.627]

64 Similarly, the solid curve plots the fraction of sellers with price greater than some value w, where the price w is shown on the x-axis. [sent-146, score-1.233]

65 Though a natural conjecture from the plots is that the price of a seller is essentially determined by its degree, below we will see that the degree is a rather poor predictor of an individual seller price, while more complex (but still local) properties are extremely accurate predictors. [sent-149, score-1.437]

66 Perhaps the most interesting finding is that the tail of the price distribution looks linear, i. [sent-150, score-0.478]

67 As discussed in the Introduction, Pareto’s original observation was that the wealth (which corresponds to seller price in our model) distribution in societies obey a power law, which has been born out in many studies on western economies. [sent-156, score-0.975]

68 Here we have power law wealth distribution arising from the combination of certain natural statistical properties of the network, and classical theories of economic equilibrium. [sent-159, score-0.263]

69 Bounds via Local Computations: Recall that Theorem 1 suggests a scheme by which we can do only local computations to approximate the global equilibrium price for any seller. [sent-160, score-0.8]

70 More precisely, for some seller j, consider the subgraph which contains all nodes that are within distance k of j. [sent-161, score-0.466]

71 In our bipartite setting, for k odd, this subgraph has a buyer frontier, and for k even, this subgraph has a seller frontier, since we start from a seller. [sent-162, score-0.948]

72 Hence, the equilibrium computation on the odd k (respectively, even k) subgraph will provide an upper (respectively, lower) bound. [sent-163, score-0.33]

73 This provides an heuristic in which one can examine the equilibrium properties of small regions of the graph, without having to do expensive global equilibrium computations. [sent-164, score-0.511]

74 The second panel in Figure 2 shows how rapidly the local equilibrium computations converge to the true global equilibrium prices as a function of k, and also how this convergence is influenced by n. [sent-168, score-0.833]

75 The value of n is given on the x-axis; the average errors (over 5 trials for each value of k and n) in the local equilibrium computations are given on the y-axis; and there is a separate plot for each of 4 values for k. [sent-170, score-0.353]

76 However, for the crudest approximation k = 1, which corresponds exactly to using seller degree as a proxy for price, we can see that this performs rather poorly. [sent-174, score-0.525]

77 Parameter Dependencies: We now provide a brief examination of how price variation depends on the parameters of the bipartite (α, ν)-model. [sent-176, score-0.689]

78 The third panel of Figure 2 shows the maximum to minimum price as function of n (averaged over 25 trials) on a loglog scale. [sent-178, score-0.573]

79 The overall message is that for small values of ν, price variation increases rapidly with the economy size n in preferential attachment. [sent-190, score-0.737]

80 the maximum to minimum price in a graph (where n = 250) . [sent-192, score-0.521]

81 Here, each point represents the maximum to minimum price ratio in a specific network generated by our model. [sent-193, score-0.561]

82 Here we see that in general, increasing α dramatically decreases price variation (note that the price ratio is plotted on a log scale). [sent-195, score-1.022]

83 Furthermore, the data for ν = 1 shows much larger variation, suggesting that a larger value of ν also has the effect of equalizing buyer opportunities and therefore prices. [sent-197, score-0.314]

84 The statistics division of the United Nations makes available extensive data sets detailing the amounts of trade between major sovereign nations (see http://unstats. [sent-199, score-0.224]

85 For each of the 70 largest nations (in terms of total trade), we include connections from that nation to each of its top k trading partners, for some integer k > 1. [sent-207, score-0.272]

86 Note that each nation will have degree at least k, but as we shall see, some nations will have much higher degree, since they frequently occur as a top k partner of other nations. [sent-209, score-0.391]

87 To further cast this extracted network into the bipartite setting we have been considering, we ran many trials in which each nation is randomly assigned a role as either a buyer or seller (which are symmetric roles), and then computed the equilibrium prices of the resulting network economy. [sent-210, score-1.58]

88 The upper plot shows the average equilibrium price for each nation, where the nations have been sorted by this average price. [sent-213, score-0.946]

89 We can immediately see that there is dramatic price variation due to the network structure; while many nations suffer equilibrium prices well under $1, the most topologically favored nations command prices of $4. [sent-214, score-1.671]

90 The lower plot of the leftmost panel shows a scatterplot of a nation’s degree (x-axis) and its average equilibrium price (y-axis). [sent-223, score-1.001]

91 We see that while there is generally a monotonic relationship, at smaller degree values there can be significant price variation (on the order of $0. [sent-224, score-0.655]

92 As suggested by the theory and simulations, increasing the overall connectivity of each party radically reduces price variation, with the highest price being just $1. [sent-227, score-0.981]

93 Interestingly, the identities of the nations commanding the highest prices (in order, U. [sent-229, score-0.376]

94 The lower plot shows that the relationship between degree and price divides the population into “have” (degree above 10) and “have not” (degree below 10) components. [sent-232, score-0.641]

95 The preponderance of European nations among the top prices suggests our final experi- UN data network, top 3 links, full set of nations UN data network, top 10 links, full set of nations UN data network, top 3 links, EU collapsed nation set 1. [sent-233, score-0.784]

96 4 1 0 0 10 20 30 40 price rank 50 60 70 0 80 0 10 20 30 40 price rank 50 60 70 0 80 1. [sent-238, score-0.956]

97 9 5 10 15 20 price rank 25 30 35 40 6 1. [sent-240, score-0.478]

98 2 5 0 4 0 5 10 15 20 average degree 25 30 35 0 0 2 4 6 8 average degree 10 12 14 Figure 3: See text for descriptions. [sent-244, score-0.222]

99 This merged vertex has much higher degree than any of its original constituents and can be viewed as an (extremely) idealized experiment in the economic power that might be wielded by a truly unified Europe. [sent-248, score-0.245]

100 The rightmost panel of Figure 3 provides the results, where we show the relative prices and the degree-price scatterplot for the 35 largest nations. [sent-249, score-0.304]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('price', 0.478), ('seller', 0.414), ('buyer', 0.285), ('sellers', 0.257), ('equilibrium', 0.244), ('buyers', 0.243), ('prices', 0.215), ('nations', 0.161), ('bipartite', 0.145), ('preferential', 0.143), ('degree', 0.111), ('economic', 0.105), ('attachment', 0.1), ('goods', 0.1), ('nation', 0.086), ('network', 0.083), ('frontier', 0.074), ('social', 0.07), ('consumers', 0.068), ('economics', 0.068), ('variation', 0.066), ('exchange', 0.063), ('trade', 0.063), ('economies', 0.057), ('wealth', 0.054), ('fisher', 0.054), ('panel', 0.052), ('subgraph', 0.052), ('edges', 0.05), ('economy', 0.05), ('purchase', 0.05), ('market', 0.047), ('law', 0.046), ('graph', 0.043), ('devanur', 0.043), ('extant', 0.043), ('loglog', 0.043), ('consumer', 0.042), ('consumption', 0.042), ('kakade', 0.042), ('xij', 0.041), ('utility', 0.04), ('formation', 0.038), ('scatterplot', 0.037), ('cumulative', 0.037), ('connected', 0.035), ('bound', 0.034), ('upper', 0.034), ('theorem', 0.033), ('shall', 0.033), ('generative', 0.032), ('networks', 0.032), ('uij', 0.032), ('attached', 0.03), ('power', 0.029), ('plot', 0.029), ('asymmetries', 0.029), ('barabasi', 0.029), ('commanded', 0.029), ('cyclical', 0.029), ('endowment', 0.029), ('gale', 0.029), ('kranton', 0.029), ('markets', 0.029), ('opportunities', 0.029), ('pemantle', 0.029), ('sell', 0.029), ('siddharth', 0.029), ('vijay', 0.029), ('wellstudied', 0.029), ('classical', 0.029), ('united', 0.028), ('local', 0.028), ('leftmost', 0.027), ('computations', 0.027), ('trials', 0.025), ('un', 0.025), ('good', 0.025), ('trading', 0.025), ('suri', 0.025), ('sold', 0.025), ('robin', 0.025), ('command', 0.025), ('pareto', 0.025), ('parties', 0.025), ('party', 0.025), ('global', 0.023), ('forthcoming', 0.023), ('favored', 0.023), ('explanations', 0.023), ('acknowledges', 0.023), ('partners', 0.023), ('slopes', 0.023), ('lower', 0.023), ('respectively', 0.022), ('graphical', 0.021), ('unused', 0.021), ('netherlands', 0.021), ('obeys', 0.021), ('plots', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 57 nips-2004-Economic Properties of Social Networks

Author: Sham M. Kakade, Michael Kearns, Luis E. Ortiz, Robin Pemantle, Siddharth Suri

Abstract: We examine the marriage of recent probabilistic generative models for social networks with classical frameworks from mathematical economics. We are particularly interested in how the statistical structure of such networks influences global economic quantities such as price variation. Our findings are a mixture of formal analysis, simulation, and experiments on an international trade data set from the United Nations. 1

2 0.059233021 165 nips-2004-Semi-supervised Learning on Directed Graphs

Author: Dengyong Zhou, Thomas Hofmann, Bernhard Schölkopf

Abstract: Given a directed graph in which some of the nodes are labeled, we investigate the question of how to exploit the link structure of the graph to infer the labels of the remaining unlabeled nodes. To that extent we propose a regularization framework for functions defined over nodes of a directed graph that forces the classification function to change slowly on densely linked subgraphs. A powerful, yet computationally simple classification algorithm is derived within the proposed framework. The experimental evaluation on real-world Web classification problems demonstrates encouraging results that validate our approach. 1

3 0.056987029 203 nips-2004-Validity Estimates for Loopy Belief Propagation on Binary Real-world Networks

Author: Joris M. Mooij, Hilbert J. Kappen

Abstract: We introduce a computationally efficient method to estimate the validity of the BP method as a function of graph topology, the connectivity strength, frustration and network size. We present numerical results that demonstrate the correctness of our estimates for the uniform random model and for a real-world network (“C. Elegans”). Although the method is restricted to pair-wise interactions, no local evidence (zero “biases”) and binary variables, we believe that its predictions correctly capture the limitations of BP for inference and MAP estimation on arbitrary graphical models. Using this approach, we find that BP always performs better than MF. Especially for large networks with broad degree distributions (such as scale-free networks) BP turns out to significantly outperform MF. 1

4 0.054245576 195 nips-2004-Trait Selection for Assessing Beef Meat Quality Using Non-linear SVM

Author: Juan Coz, Gustavo F. Bayón, Jorge Díez, Oscar Luaces, Antonio Bahamonde, Carlos Sañudo

Abstract: In this paper we show that it is possible to model sensory impressions of consumers about beef meat. This is not a straightforward task; the reason is that when we are aiming to induce a function that maps object descriptions into ratings, we must consider that consumers’ ratings are just a way to express their preferences about the products presented in the same testing session. Therefore, we had to use a special purpose SVM polynomial kernel. The training data set used collects the ratings of panels of experts and consumers; the meat was provided by 103 bovines of 7 Spanish breeds with different carcass weights and aging periods. Additionally, to gain insight into consumer preferences, we used feature subset selection tools. The result is that aging is the most important trait for improving consumers’ appreciation of beef meat. 1

5 0.051299755 24 nips-2004-Approximately Efficient Online Mechanism Design

Author: David C. Parkes, Dimah Yanovsky, Satinder P. Singh

Abstract: Online mechanism design (OMD) addresses the problem of sequential decision making in a stochastic environment with multiple self-interested agents. The goal in OMD is to make value-maximizing decisions despite this self-interest. In previous work we presented a Markov decision process (MDP)-based approach to OMD in large-scale problem domains. In practice the underlying MDP needed to solve OMD is too large and hence the mechanism must consider approximations. This raises the possibility that agents may be able to exploit the approximation for selfish gain. We adopt sparse-sampling-based MDP algorithms to implement efficient policies, and retain truth-revelation as an approximate BayesianNash equilibrium. Our approach is empirically illustrated in the context of the dynamic allocation of WiFi connectivity to users in a coffeehouse. 1

6 0.041961815 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

7 0.041746963 26 nips-2004-At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks

8 0.038216736 177 nips-2004-Supervised Graph Inference

9 0.037794176 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

10 0.037440609 48 nips-2004-Convergence and No-Regret in Multiagent Learning

11 0.032980271 141 nips-2004-Optimal sub-graphical models

12 0.031549536 116 nips-2004-Message Errors in Belief Propagation

13 0.02987062 8 nips-2004-A Machine Learning Approach to Conjoint Analysis

14 0.029537447 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

15 0.029129842 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

16 0.028905232 76 nips-2004-Hierarchical Bayesian Inference in Networks of Spiking Neurons

17 0.028707605 19 nips-2004-An Application of Boosting to Graph Classification

18 0.02807709 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning

19 0.027670728 120 nips-2004-Modeling Conversational Dynamics as a Mixed-Memory Markov Process

20 0.027067995 129 nips-2004-New Criteria and a New Algorithm for Learning in Multi-Agent Systems


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.098), (1, -0.005), (2, 0.032), (3, -0.001), (4, -0.012), (5, 0.033), (6, -0.005), (7, 0.055), (8, -0.017), (9, -0.102), (10, -0.032), (11, -0.052), (12, 0.042), (13, -0.024), (14, -0.028), (15, 0.024), (16, -0.002), (17, -0.012), (18, 0.029), (19, -0.018), (20, 0.066), (21, -0.045), (22, 0.035), (23, -0.027), (24, 0.02), (25, -0.034), (26, 0.007), (27, 0.014), (28, -0.012), (29, -0.036), (30, -0.071), (31, 0.065), (32, 0.102), (33, 0.023), (34, -0.046), (35, 0.084), (36, 0.044), (37, -0.034), (38, 0.019), (39, -0.015), (40, 0.031), (41, -0.046), (42, -0.108), (43, 0.086), (44, -0.063), (45, 0.05), (46, -0.023), (47, 0.107), (48, 0.014), (49, -0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92811584 57 nips-2004-Economic Properties of Social Networks

Author: Sham M. Kakade, Michael Kearns, Luis E. Ortiz, Robin Pemantle, Siddharth Suri

Abstract: We examine the marriage of recent probabilistic generative models for social networks with classical frameworks from mathematical economics. We are particularly interested in how the statistical structure of such networks influences global economic quantities such as price variation. Our findings are a mixture of formal analysis, simulation, and experiments on an international trade data set from the United Nations. 1

2 0.5730477 141 nips-2004-Optimal sub-graphical models

Author: Mukund Narasimhan, Jeff A. Bilmes

Abstract: We investigate the problem of reducing the complexity of a graphical model (G, PG ) by finding a subgraph H of G, chosen from a class of subgraphs H, such that H is optimal with respect to KL-divergence. We do this by first defining a decomposition tree representation for G, which is closely related to the junction-tree representation for G. We then give an algorithm which uses this representation to compute the optimal H ∈ H. Gavril [2] and Tarjan [3] have used graph separation properties to solve several combinatorial optimization problems when the size of the minimal separators in the graph is bounded. We present an extension of this technique which applies to some important choices of H even when the size of the minimal separators of G are arbitrarily large. In particular, this applies to problems such as finding an optimal subgraphical model over a (k − 1)-tree of a graphical model over a k-tree (for arbitrary k) and selecting an optimal subgraphical model with (a constant) d fewer edges with respect to KL-divergence can be solved in time polynomial in |V (G)| using this formulation. 1 Introduction and Preliminaries The complexity of inference in graphical models is typically exponential in some parameter of the graph, such as the size of the largest clique. Therefore, it is often required to find a subgraphical model that has lower complexity (smaller clique size) without introducing a large error in inference results. The KL-divergence between the original probability distribution and the probability distribution on the simplified graphical model is often used to measure the impact on inference. Existing techniques for reducing the complexity of graphical models including annihilation and edge-removal [4] are greedy in nature and cannot make any guarantees regarding the optimality of the solution. This problem is NP-complete [9] and so, in general, one cannot expect a polynomial time algorithm to find the optimal solution. However, we show that when we restrict the problem to some sets of subgraphs, the optimal solution can be found quite quickly using a dynamic programming algorithm in time polynomial in the tree-width of the graph. 1.1 Notation and Terminology A graph G = (V, E) is said to be triangulated if every cycle of length greater than 3 has a chord. A clique of G is a non-empty set S ⊆ V such that {a, b} ∈ E for all ∗ This work was supported by NSF grant IIS-0093430 and an Intel Corporation Grant. {b, c, d} d {c, f, g} {b, c} {b, e, c} b c {f, c} {c, e} {e, c, f } g {b, e} a e f {a, b, e} Figure 1: A triangulated graph G and a junction-tree for G a, b ∈ S. A clique S is maximal if S is not properly contained in another clique. If α and β are non-adjacent vertices of G then a set of vertices S ⊆ V \ {α, β} is called an (α, β)-separator if α and β are in distinct components of G[V \ S]. S is a minimal (α, β)-separator if no proper subset of S is an (α, β)-separator. S is said to be a minimal separator if S is a minimal (α, β)-separator for some non adjacent a, b ∈ V . If T = (K, S) is a junction-tree for G (see [7]), then the nodes K of T correspond to the maximalcliques of G, while the links S correspond to minimal separators of G (We reserve the terms vertices/edges for elements of G, and nodes/links for the elements of T ). If G is triangulated, then the number of maximal cliques is at most |V |. For example, in the graph G shown in Figure 1, K = {{b, c, d} , {a, b, e} , {b, e, c} , {e, c, f } , {c, f, g}}. The links S of T correspond to minimal-separators of G in the following way. If Vi Vj ∈ S (where Vi , Vj ∈ K and hence are cliques of G), then Vi ∩ Vj = φ. We label each edge Vi Vj ∈ S with the set Vij = Vi ∩ Vj , which is a non-empty complete separator in G. The removal of any link Vi Vj ∈ S disconnects T into two subtrees which we denote T (i) and T (j) (chosen so that T (i) contains Vi ). We will let K(i) be the nodes of T (i) , and V (i) = ∪V ∈K (i) V be the set of vertices corresponding to the subtree T (i) . The junction tree property ensures that V (i) ∩ V (j) = Vi ∩ Vj = Vij . We will let G(i) be the subgraph induced by V (i) . A graphical model is a pair (G, P ) where P is the joint probability distribution for random variables X1 , X2 , . . . , Xn , and G is a graph with vertex set V (G) = {X1 , X2 , . . . , Xn } such that the separators in G imply conditional independencies in P (so P factors according to G). If G is triangulated, then the junction-tree algorithm can be used for exact inference in the probability distribution P . The complexity of this algorithm grows with the treewidth of G (which is one less than the size of the largest clique in G when G is triangulated). The growth is exponential when P is a discrete probability distribution, thus rendering exact inference for graphs with large treewidth impractical. Therefore, we seek another graphical model (H, PH ) which allows tractable inference (so H should have lower treewidth than G has). The general problem of finding a graphical model of tree-width at most k so as to minimize the KL-divergence from a specified probability distribution is NP complete for general k ([9]) However, it is known that this problem is solvable in polynomial time (in |V (G)|) for some special cases cases (such as when G has bounded treewidth or when k = 1 [1]). If (G, PG ) and (H, PH ) are graphical models, then we say that (H, PH ) is a subgraphical model of (G, PG ) if H is a spanning subgraph of G. Note in particular that separators in G are separators in H, and hence (G, PH ) is also a graphical model. 2 Graph Decompositions and Divide-and-Conquer Algorithms For the remainder of the paper, we will be assuming that G = (V, E) is some triangulated graph, with junction tree T = (K, S). As observed above, if Vi Vj ∈ S, then the removal {b, c, d} d {b, c} {b, e, c} b c c {c, f, g} {f, c} {e, c, f } g {b, e} a e e f {a, b, e} Figure 2: The graphs G(i) , G(j) and junction-trees T (i) and T (j) resulting from the removal of the link Vij = {c, e} of Vij = Vi ∩ Vj disconnects G into two (vertex-induced) subgraphs G(i) and G(j) which are both triangulated, with junction-trees T (i) and T (j) respectively. We can recursively decompose each of G(i) and G(j) into smaller and smaller subgraphs till the resulting subgraphs are cliques. When the size of all the minimal separators are bounded, we may use these decompositions to easily solve problems that are hard in general. For example, in [5] it is shown that NP-complete problems like vertex coloring, and finding maximum independent sets can be solved in polynomial time on graphs with bounded tree-width (which are equivalent to spanning graphs with bounded size separators). We will be interested in finding (triangulated) subgraphs of G that satisfy some conditions, such as a bound on the number of edges, or a bound on the tree-width and which optimize separable objective functions (described in Section 2) One reason why problems such as this can often be solved easily when the tree-width of G is bounded by some constant is this : If Vij is a separator decomposing G into G(i) and G(j) , then a divide-and-conquer approach would suggest that we try and find optimal subgraphs of G(i) and G(j) and then splice the two together to get an optimal subgraph of G. There are two issues with this approach. First, the optimal subgraphs of G (i) and G(j) need not necessarily match up on Vij , the set of common vertices. Second, even if the two subgraphs agree on the set of common vertices, the graph resulting from splicing the two subgraphs together need not be triangulated (which could happen even if the two subgraphs individually are triangulated). To rectify the situation, we can do the following. We partition the set of subgraphs of G(i) and G(j) into classes, so that any subgraph of G(i) and any subgraph G(j) corresponding to the same class are compatible in the sense that they match up on their intersection namely Vij , and so that by splicing the two subgraphs together, we get a subgraph of G which is acceptable (and in particular is triangulated). Then given optimal subgraphs of both G(i) and G(j) corresponding to each class, we can enumerate over all the classes and pick the best one. Of course, to ensure that we do not repeatedly solve the same problem, we need to work bottom-up (a.k.a dynamic programming) or memoize our solutions. This procedure can be carried out in polynomial (in |V |) time as long as we have only a polynomial number of classes. Now, if we have a polynomial number of classes, these classes need not actually be a partition of all the acceptable subgraphs, though the union of the classes must cover all acceptable subgraphs (so the same subgraph can be contained in more than one class). For our application, every class can be thought of to be the set of subgraphs that satisfy some constraint, and we need to pick a polynomial number of constraints that cover all possibilities. The bound on the tree-width helps us here. If k |Vij | = k, then in any subgraph H of G, H[Vij ] must be one of the 2(2) possible subgraphs k of G[Vij ]. So, if k is sufficiently small (so 2(2) is bounded by some polynomial in |V |), then this procedure results in a polynomial time algorithm. In this paper, we show that in some cases we can characterize the space H so that we still have a polynomial number of constraints even when the tree-width of G is not bounded by a small constant. 2.1 Separable objective functions For cases where exact inference in the graphical model (G, PG ) is intractable, it is natural to try to find a subgraphical model (H, PH ) such that D(PG PH ) is minimized, and inference using H is tractable. We will denote by H the set of subgraphs of G that are tractable for inference. For example, this set could be the set of subgraphs of G with treewidth one less than the treewidth of G, or perhaps the set of subgraphs of G with at d fewer edges. For a specified subgraph H of G, there is a unique probability distribution PH factoring over H that minimizes D(PG PH ). Hence, finding a optimal subgraphical model is equivalent to finding a subgraph H for which D(PG PH ) is minimized. If Vij is a separator of G, we will attempt to find optimal subgraphs of G by finding optimal subgraphs of G (i) and G(j) and splicing them together. However, to do this, we need to ensure that the objective criteria also decomposes along the separator Vij . Suppose that H is any triangulated subgraph of G. Let PG(i) and PG(j) be the (marginalized) distributions of PG on V (i) and V (j) respectively, and PH (i) and PH (j) be the (marginalized) distributions of the distribution PH on V (i) and V (j) where H (i) = H[V (i) ] and H (j) = H[V (j) ], The following result assures us that the KL-divergence also factors according to the separator Vij . Lemma 1. Suppose that (G, PG ) is a graphical model, H is a triangulated subgraph of G, and PH factors over H. Then D(PG PH ) = D(PG(i) PH (i) ) + D(PG(j) PH (j) ) − D(PG[Vij ] PH[Vij ] ). Proof. Since H is a subgraph of G, and Vij is a separator of G, Vij must also be a separator of H. Therefore, PH {Xv }v∈V = PH (i) ({Xv }v∈V (i) )·PH (j) ({Xv }v∈V (j) ) . PH[Vij ] ({Xv }v∈V ) The result ij follows immediately. Therefore, there is hope that we can reduce our our original problem of finding an optimal subgraph H ∈ H as one of finding subgraphs of H (i) ⊆ G(i) and H (j) ⊆ G(j) that are compatible, in the sense that they match up on the overlap Vij , and for which D(PG PH ) is minimized. Throughout this paper, for the sake of concreteness, we will assume that the objective criterion is to minimize the KL-divergence. However, all the results can be extended to other objective functions, as long as they “separate” in the sense that for any separator, the objective function is the sum of the objective functions of the two parts, possibly modulo some correction factor which is purely a function of the separator. Another example might be the complexity r(H) of representing the graphical model H. A very natural representation satisfies r(G) = r(G(i) ) + r(G(j) ) if G has a separator G(i) ∩ G(j) . Therefore, the representation cost reduction would satisfy r(G) − r(H) = (r(G (i) ) − r(H (i) )) + (r(G(j) ) − r(H (j) )), and so also factors according to the separators. Finally note that any linear combinations of such separable functions is also separable, and so this technique could also be used to determine tradeoffs (representation cost vs. KL-divergence loss for example). In Section 4 we discuss some issues regarding computing this function. 2.2 Decompositions and decomposition trees For the algorithms considered in this paper, we will be mostly interested in the decompositions that are specified by the junction tree, and we will represent these decompositions by a rooted tree called a decomposition tree. This representation was introduced in [2, 3], and is similar in spirit to Darwiche’s dtrees [6] which specify decompositions of directed acyclic graphs. In this section and the next, we show how a decomposition tree for a graph may be constructed, and show how it is used to solve a number of optimization problems. abd; ce; gf a; be; cd d; bc; e abe dbc ebc e; cf ; g cef cf g Figure 3: The separator tree corresponding to Figure 1 A decomposition tree for G is a rooted tree whose vertices correspond to separators and cliques of G. We describe the construction of the decomposition tree in terms of a junctiontree T = (K, S) for G. The interior nodes of the decomposition tree R(T ) correspond to S (the links of T and hence the minimal separators of G). The leaf or terminal nodes represent the elements of K (the nodes of T and hence the maximal cliques of G). R(T ) can be recursively constructed from T as follows : If T consists of just one node K, (and hence no edges), then R consists of just one node, which is given the label K as well. If however, T has more than one node, then T must contain at least one link. To begin, let Vi Vj ∈ S be any link in T . Then removal of the link Vi Vj results in two disjoint junctiontrees T (i) and T (j) . We label the root of R by the decomposition (V (i) ; Vij ; V (j) ). The rest of R is recursively built by successively picking links of T (i) and T (j) (decompositions of G(i) and G(j) ) to form the interior nodes of R. The effect of this procedure on the junction tree of Figure 1 is shown in Figure 3, where the decomposition associated with the interior nodes is shown inside the nodes. Let M be the set of all nodes of R(T ). For any interior node M induced by the the link Vi Vj ∈ S of T , then we will let M (i) and M (j) represent the left and right children of M , and R(i) and R(j) be the left and right trees below M . 3 3.1 Finding optimal subgraphical models Optimal sub (k − 1)-trees of k-trees Suppose that G is a k-tree. A sub (k − 1)-tree of G is a subgraph H of G that is (k − 1)tree. Now, if Vij is any minimal separator of G, then both G(i) and G(j) are k-trees on vertex sets V (i) and V (j) respectively. It is clear that the induced subgraphs H[V (i) ] and H[V (j) ] are subgraphs of G(i) and G(j) and are partial (k − 1)-trees. We will be interested in finding sub (k − 1)-trees of k trees and this problem is trivial by the result of [1] when k = 2. Therefore, we assume that k ≥ 3. The following result characterizes the various possibilities for H[Vij ] in this case. Lemma 2. Suppose that G is a k-tree, and S = Vij is a minimal separator of G corresponding to the link ij of the junction-tree T . In any (k − 1)-tree H ⊆ G either 1. There is a u ∈ S such that u is not connected to vertices in both V (i) \ S and V (j) \ S. Then S \ {u} is a minimal separator in H and hence is complete. 2. Every vertex in S is connected to vertices in both V (i) \S and V (j) \S. Then there are vertices {x, y} ⊆ S such that the edge H[S] is missing only the edge {x, y}. Further either H[V (i) ] or H[V (j) ] does not contain a unchorded x-y path. Proof. We consider two possibilities. In the first, there is some vertex u ∈ S such that u is not connected to vertices in both V (i) \S and V (j) \. Since the removal of S disconnects G, the removal of S must also disconnect H. Therefore, S must contain a minimal separator of H. Since H is a (k − 1)-tree, all minimal separators of H must contain k − 1 vertices which must therefore be S \{u}. This corresponds to case (1) above. Clearly this possiblity can occur. If there is no such u ∈ S, then every vertex in S is connected to vertices in both V (i) \ S and V (j) \ S. If x ∈ S is connected to some yi ∈ V (i) \ S and yj ∈ V (j) \ S, then x is contained in every minimal yi /yj separator (see [5]). Therefore, every vertex in S is part of a minimal separator. Since each minimal separator contains k − 1 vertices, there must be at least two distinct minimum separators contained in S. Let Sx = S \ {x} and Sy = S \ {y} be two distinct minimal separators. We claim that H[S] contains all edges except the edge {x, y}. To see this, note that if z, w ∈ S, with z = w and {z, w} = {x, y} (as sets), then either {z, w} ⊂ Sy or {z, w} ⊂ Sx . Since both Sx and Sy are complete in H, this edge must be present in H. The edge {x, y} is not present in H[S] because all minimal separators in H must be of size k − 1. Further, if both V (i) and V (j) contain an unchorded path between x and y, then by joining the two paths at x and y, we get a unchorded cycle in H which contradicts the fact that H is triangulated. Therefore, we may associate k · 2 + 2 · k constraints with each separator Vij of G as 2 follows. There are k possible constraints corresponding to case (1) above (one for each choice of x), and k · 2 choices corresponding to case (2) above. This is because for each 2 pair {x, y} corresponding to the missing edge, we have either V (i) contains no unchorded xy paths or V (j) contains no unchorded xy paths. More explicitly, we can encode the set of constraints CM associated with each separator S corresponding to an interior node M of the decomposition tree as follows: CM = { (x, y, s) : x ∈ S, y ∈ S, s ∈ {i, j}}. If y = x, then this corresponds to case (1) of the above lemma. If s = i, then x is connected only to H (i) and if s = j, then x is connected only to H (j) . If y = x, then this corresponds to case (2) in the above lemma. If s = i, then H (i) does not contain any unchorded path between x and y, and there is no constraint on H (j) . Similarly if s = j, then H (j) does not contain any unchorded path between x and y, and there is no constraint on H (i) . Now suppose that H (i) and H (j) are triangulated subgraphs of G(i) and G(j) respectively, then it is clear that if H (i) and H (j) both satisfy the same constraint they must match up on the common vertices Vij . Therefore to splice together two solutions corresponding to the same constraint, we only need to check that the graph obtained by splicing the graphs is triangulated. Lemma 3. Suppose that H (i) and H (j) are triangulated subgraphs of G(i) and G(j) respectively such that both of them satisfy the same constraint as described above. Then the graph H obtained by splicing H (i) and H (j) together is triangulated. Proof. Suppose that both H (i) and H (j) are both triangulated and both satisfy the same constraint. If both H (i) and H (j) satisfy the same constraint corresponding to case (1) in Lemma 2 and H has an unchorded cycle, then this cycle must involve elements of both H (i) and H (j) . Therefore, there must be two vertices of S \{u} on the cycle, and hence this cycle has a chord as S \ {u} is complete. This contradiction shows that H is triangulated. So assume that both of them satisfy the constraint corresponding to case (2) of Lemma 2. Then if H is not triangulated, there must be a t-cycle (for t ≥ 4) with no chord. Now, since {x, y} is the only missing edge of S in H, and because H (i) and H (j) are individually triangulated, the cycle must contain x, y and vertices of both V (i) \ S and V (j) \ S. We may split this unchorded cycle into two unchorded paths, one contained in V (i) and one in V (j) thus violating our assumption that both H (i) and H (j) satisfy the same constraint. If |S| = k, then there are 2k + 2 · k ∈ O(k 2 ) ∈ O(n2 ). We can use a divide and conquer 2 strategy to find the optimal sub (k − 1) tree once we have taken care of the base case, where G is just a single clique (of k + 1) elements. However, for this case, it is easily checked that any subgraph of G obtained by deleting exactly one edge results in a (k − 1) tree, and every sub (k−1)-tree results from this operation. Therefore, the optimal (k−1)-tree can be found using Algorithm 1, and in this case, the complexity of Algorithm 1 is O(n(k + 1) 2 ). This procedure can be generalized to find the optimal sub (k − d)- tree for any fixed d. However, the number of constraints grows exponentially with d (though it is still polynomial in n). Therefore for small, fixed values of d, we can compute the optimal sub (k − d)-tree of G. While we can compute (k − d)-trees of G by first going from a k tree to a (k − 1) tree, then from a (k − 1)-tree to a (k − 2)-tree, and so on in a greedy fashion, this will not be optimal in general. However, this might be a good algorithm to try when d is large. 3.2 Optimal triangulated subgraphs with |E(G)| − d edges Suppose that we are interested in a (triangulated) subgraph of G that contains d fewer edges that G does. That is, we want to find an optimal subgraph H ⊂ G such that |E(H)| = |E(G)| − d. Note that by the result of [4] there is always a triangulated subgraph with d fewer edges (if d < |E(G)|). Two possibilities for finding such an optimal subgraph are 1. Use the procedure described in [4]. This is a greedy procedure which works in d steps by deleting an edge at each step. At each state, the edge is picked from the set of edges whose deletion leaves a triangulated graph. Then the edge which causes the least increase in KL-divergence is picked at each stage. 2. For each possible subset A of E(G) of size d, whose deletion leaves a triangulated graph, compute the KL divergence using the formula above, and then pick the optimal one. Since there are |E(G)| such sets, this can be done in polynomial d time (in |V (G)|) when d is a constant. The first greedy algorithm is not guaranteed to yield the optimal solution. The second takes time that is O(n2d ). Now, let us solve this problem using the framework we’ve described. Let H be the set of subgraphs of G which may be obtained by deletion of d edges. For each M = ij ∈ M corresponding to the separator Vij , let CM = (l, r, c, s, A) : l + r − c = d, s a d bit string, A ∈ E(G[Vij ]) . The constraint reprec sented by (l, r, c, A) is this : A is a set of d edges of G[Vij ] that are missing in H, l edges are missing from the left subgraph, and r edges are missing from the right subgraph. c represents the double count, and so is subtracted from the total. If k is the size of the largest k clique, then the total number of such constraints is bounded by 2d · 2d · (2) ∈ O(k 2d ) d which could be better than O(n2d ) and is polynomial in |V | when d is constant. See [10] for additional details. 4 Conclusions Algorithm 1 will compute the optimal H ∈ H for the two examples discussed above and is polynomial (for fixed constant d) even if k is O(n). In [10] a generalization is presented which will allow finding the optimal solution for other classes of subgraphical models. Now, we assume an oracle model for computing KL-divergences of probability distributions on vertex sets of cliques. It is clear that these KL-divergences can be computed R ← separator-tree for G; for each vertex M of R in order of increasing height (bottom up) do for each constraint cM of M do if M is an interior vertex of R corresponding to edge ij of the junction tree then Let Ml and Mr be the left and right children of M ; Pick constraint cl ∈ CMl compatible with cM to minimize table[Ml , cl ]; Pick constraint cr ∈ CMr compatible with cM to minimize table[Mr , cr ]; loss ← D(PG [M ] PH [M ]); table[M, cM ] ← table[Ml , cl ] + table[Mr , cr ] − loss; else table[M, cM ] ← D(PG [M ] PH [M ]); end end end Algorithm 1: Finding optimal set of constraints efficiently for distributions like Gaussians, but for discrete distributions this may not be possible when k is large. However even in this case this algorithm will result in only polynomial calls to the oracle. The standard algorithm [3] which is exponential in the treewidth will make O(2k ) calls to this oracle. Therefore, when the cost of computing the KL-divergence is large, this algorithm becomes even more attractive as it results in exponential speedup over the standard algorithm. Alternatively, if we can compute approximate KL-divergences, or approximately optimal solutions, then we can compute an approximate solution by using the same algorithm. References [1] C. Chow and C. Liu, “Approximating discrete probability distributions with dependence trees”, IEEE Transactions on Information Theory, v. 14, 1968, Pages 462–467. [2] F. Gavril, “Algorithms on clique separable graphs”, Discrete Mathematics v. 9 (1977), pp. 159–165. [3] R. E. Tarjan. “Decomposition by Clique Separators”, Discrete Mathematics, v. 55 (1985), pp. 221–232. [4] U. Kjaerulff. “Reduction of computational complexity in Bayesian networks through removal of weak dependencies”, Proceedings of the Tenth Annual Conference on Uncertainty in Artificial Intelligence, pp. 374–382, 1994. [5] T. Kloks, “Treewidth: Computations and Approximations”, Springer-Verlag, 1994. [6] A. Darwiche and M. Hopkins. “Using recursive decomposition to construct elimination orders, jointrees and dtrees”, Technical Report D-122, Computer Science Dept., UCLA. [7] S. Lauritzen. “Graphical Models”, Oxford University Press, Oxford, 1996. [8] T. A. McKee and F. R. McMorris. “Topics in Intersection Graph Theory”, SIAM Monographs on Discrete Mathematics and Applications, 1999. [9] D. Karger and N. Srebro. “Learning Markov networks: Maximum bounded tree-width graphs.” In Symposium on Discrete Algorithms, 2001, Pages 391-401. [10] M. Narasimhan and J. Bilmes. “Optimization on separator-clique trees.”, Technical report UWEETR 2004-10, June 2004.

3 0.57137102 165 nips-2004-Semi-supervised Learning on Directed Graphs

Author: Dengyong Zhou, Thomas Hofmann, Bernhard Schölkopf

Abstract: Given a directed graph in which some of the nodes are labeled, we investigate the question of how to exploit the link structure of the graph to infer the labels of the remaining unlabeled nodes. To that extent we propose a regularization framework for functions defined over nodes of a directed graph that forces the classification function to change slowly on densely linked subgraphs. A powerful, yet computationally simple classification algorithm is derived within the proposed framework. The experimental evaluation on real-world Web classification problems demonstrates encouraging results that validate our approach. 1

4 0.45532569 95 nips-2004-Large-Scale Prediction of Disulphide Bond Connectivity

Author: Jianlin Cheng, Alessandro Vullo, Pierre F. Baldi

Abstract: The formation of disulphide bridges among cysteines is an important feature of protein structures. Here we develop new methods for the prediction of disulphide bond connectivity. We first build a large curated data set of proteins containing disulphide bridges and then use 2-Dimensional Recursive Neural Networks to predict bonding probabilities between cysteine pairs. These probabilities in turn lead to a weighted graph matching problem that can be addressed efficiently. We show how the method consistently achieves better results than previous approaches on the same validation data. In addition, the method can easily cope with chains with arbitrary numbers of bonded cysteines. Therefore, it overcomes one of the major limitations of previous approaches restricting predictions to chains containing no more than 10 oxidized cysteines. The method can be applied both to situations where the bonded state of each cysteine is known or unknown, in which case bonded state can be predicted with 85% precision and 90% recall. The method also yields an estimate for the total number of disulphide bridges in each chain. 1

5 0.44611174 130 nips-2004-Newscast EM

Author: Wojtek Kowalczyk, Nikos A. Vlassis

Abstract: We propose a gossip-based distributed algorithm for Gaussian mixture learning, Newscast EM. The algorithm operates on network topologies where each node observes a local quantity and can communicate with other nodes in an arbitrary point-to-point fashion. The main difference between Newscast EM and the standard EM algorithm is that the M-step in our case is implemented in a decentralized manner: (random) pairs of nodes repeatedly exchange their local parameter estimates and combine them by (weighted) averaging. We provide theoretical evidence and demonstrate experimentally that, under this protocol, nodes converge exponentially fast to the correct estimates in each M-step of the EM algorithm. 1

6 0.44261909 177 nips-2004-Supervised Graph Inference

7 0.44133571 129 nips-2004-New Criteria and a New Algorithm for Learning in Multi-Agent Systems

8 0.42620447 171 nips-2004-Solitaire: Man Versus Machine

9 0.40797484 180 nips-2004-Synchronization of neural networks by mutual learning and its application to cryptography

10 0.39357653 122 nips-2004-Modelling Uncertainty in the Game of Go

11 0.39215732 203 nips-2004-Validity Estimates for Loopy Belief Propagation on Binary Real-world Networks

12 0.38457766 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

13 0.37783551 26 nips-2004-At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks

14 0.36685604 146 nips-2004-Pictorial Structures for Molecular Modeling: Interpreting Density Maps

15 0.35900155 123 nips-2004-Multi-agent Cooperation in Diverse Population Games

16 0.35590184 80 nips-2004-Identifying Protein-Protein Interaction Sites on a Genome-Wide Scale

17 0.34814861 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem

18 0.33880246 58 nips-2004-Edge of Chaos Computation in Mixed-Mode VLSI - A Hard Liquid

19 0.32589599 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference

20 0.31929898 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.4), (11, 0.014), (13, 0.056), (15, 0.116), (26, 0.049), (31, 0.024), (33, 0.134), (35, 0.019), (39, 0.014), (50, 0.023), (77, 0.016), (82, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.79544306 135 nips-2004-On-Chip Compensation of Device-Mismatch Effects in Analog VLSI Neural Networks

Author: Miguel Figueroa, Seth Bridges, Chris Diorio

Abstract: Device mismatch in VLSI degrades the accuracy of analog arithmetic circuits and lowers the learning performance of large-scale neural networks implemented in this technology. We show compact, low-power on-chip calibration techniques that compensate for device mismatch. Our techniques enable large-scale analog VLSI neural networks with learning performance on the order of 10 bits. We demonstrate our techniques on a 64-synapse linear perceptron learning with the Least-Mean-Squares (LMS) algorithm, and fabricated in a 0.35µm CMOS process. 1

same-paper 2 0.76174599 57 nips-2004-Economic Properties of Social Networks

Author: Sham M. Kakade, Michael Kearns, Luis E. Ortiz, Robin Pemantle, Siddharth Suri

Abstract: We examine the marriage of recent probabilistic generative models for social networks with classical frameworks from mathematical economics. We are particularly interested in how the statistical structure of such networks influences global economic quantities such as price variation. Our findings are a mixture of formal analysis, simulation, and experiments on an international trade data set from the United Nations. 1

3 0.75660211 112 nips-2004-Maximising Sensitivity in a Spiking Network

Author: Anthony J. Bell, Lucas C. Parra

Abstract: We use unsupervised probabilistic machine learning ideas to try to explain the kinds of learning observed in real neurons, the goal being to connect abstract principles of self-organisation to known biophysical processes. For example, we would like to explain Spike TimingDependent Plasticity (see [5,6] and Figure 3A), in terms of information theory. Starting out, we explore the optimisation of a network sensitivity measure related to maximising the mutual information between input spike timings and output spike timings. Our derivations are analogous to those in ICA, except that the sensitivity of output timings to input timings is maximised, rather than the sensitivity of output ‘firing rates’ to inputs. ICA and related approaches have been successful in explaining the learning of many properties of early visual receptive fields in rate coding models, and we are hoping for similar gains in understanding of spike coding in networks, and how this is supported, in principled probabilistic ways, by cellular biophysical processes. For now, in our initial simulations, we show that our derived rule can learn synaptic weights which can unmix, or demultiplex, mixed spike trains. That is, it can recover independent point processes embedded in distributed correlated input spike trains, using an adaptive single-layer feedforward spiking network. 1 Maximising Sensitivity. In this section, we will follow the structure of the ICA derivation [4] in developing the spiking theory. We cannot claim, as before, that this gives us an information maximisation algorithm, for reasons that we will delay addressing until Section 3. But for now, to first develop our approach, we will explore an interim objective function called sensitivity which we define as the log Jacobian of how input spike timings affect output spike timings. 1.1 How to maximise the effect of one spike timing on another. Consider a spike in neuron j at time tl that has an effect on the timing of another spike in neuron i at time tk . The neurons are connected by a weight wij . We use i and j to index neurons, and k and l to index spikes, but sometimes for convenience we will use spike indices in place of neuron indices. For example, wkl , the weight between an input spike l and an output spike k, is naturally understood to be just the corresponding wij . dtk dtl threshold potential du u(t) R(t) resting potential tk output spikes tl input spikes Figure 1: Firing time tk is determined by the time of threshold crossing. A change of an input spike time dtl affects, via a change of the membrane potential du the time of the output spike by dtk . In the simplest version of the Spike Response Model [7], spike l has an effect on spike k that depends on the time-course of the evoked EPSP or IPSP, which we write as R kl (tk − tl ). In general, this Rkl models both synaptic and dendritic linear responses to an input spike, and thus models synapse type and location. For learning, we need only consider the value of this function when an output spike, k, occurs. In this model, depicted in Figure 1, a neuron adds up its spiking inputs until its membrane potential, ui (t), reaches threshold at time tk . This threshold we will often, again for convenience, write as uk ≡ ui (tk , {tl }), and it is given by a sum over spikes l: uk = wkl Rkl (tk − tl ) . (1) l To maximise timing sensitivity, we need to determine the effect of a small change in the input firing time tl on the output firing time tk . (A related problem is tackled in [2].) When tl is changed by a small amount dtl the membrane potential will change as a result. This change in the membrane potential leads to a change in the time of threshold crossing dt k . The contribution to the membrane potential, du, due to dtl is (∂uk /∂tl )dtl , and the change in du corresponding to a change dtk is (∂uk /∂tk )dtk . We can relate these two effects by noting that the total change of the membrane potential du has to vanish because u k is defined as the potential at threshold. ie: du = ∂uk ∂uk dtk + dtl = 0 . ∂tk ∂tl (2) This is the total differential of the function uk = u(tk , {tl }), and is a special case of the implicit function theorem. Rearranging this: dtk ∂uk =− dtl ∂tl ∂uk ˙ = −wkl Rkl /uk . ˙ ∂tk (3) Now, to connect with the standard ICA derivation [4], recall the ‘rate’ (or sigmoidal) neuron, for which yi = gi (ui ) and ui = j wij xj . For this neuron, the output dependence on input is ∂yi /∂xj = wij gi while the learning gradient is: ∂yi ∂ 1 log − fi (ui )xj = ∂wij ∂xj wij (4) where the ‘score functions’, fi , are defined in terms of a density estimate on the summed ∂ ∂ inputs: fi (ui ) = ∂ui log gi = ∂ui log p(ui ). ˆ The analogous learning gradient for the spiking case, from (3), is: ˙ j(a)Rka ∂ dtk 1 log − a . = ∂wij dtl wij uk ˙ (5) where j(a) = 1 if spike a came from neuron j, and 0 otherwise. Comparing the two cases in (4) and (5), we see that the input variable xj has become the temporal derivative of the sum of the EPSPs coming from synapse j, and the output variable (or score function) fi (ui ) has become u−1 , the inverse of the temporal derivative ˙k of the membrane potential at threshold. It is intriguing (A) to see this quantity appear as analogous to the score function in the ICA likelihood model, and, (B) to speculate that experiments could show that this‘ voltage slope at threshold’ is a hidden factor in STDP data, explaining some of the scatter in Figure 3A. In other words, an STDP datapoint should lie on a 2-surface in a 3D space of {∆w, ∆t, uk }. Incidentally, uk shows up in any ˙ ˙ learning rule optimising an objective function involving output spike timings. 1.2 How to maximise the effect of N spike timings on N other ones. Now we deal with the case of a ‘square’ single-layer feedforward mapping between spike timings. There can be several input and output neurons, but here we ignore which neurons are spiking, and just look at how the input timings affect the output timings. This is captured in a Jacobian matrix of all timing dependencies we call T. The entries of this matrix are Tkl ≡ ∂tk /∂tl . A multivariate version of the sensitivity measure introduced in the previous section is the log of the absolute determinant of the timing matrix, ie: log |T|. The full derivation for the gradient W log |T| is in the Appendix. Here, we again draw out the analogy between Square ICA [4] and this gradient, as follows. Square ICA with a network y = g(Wx) is: ∆W ∝ W log |J| = W−1 − f (u)xT (6) where the Jacobian J has entries ∂yi /∂xj and the score functions are now, fi (u) = ∂ − ∂ui log p(u) for the general likelihood case, with p(u) = i gi being the special case of ˆ ˆ ICA. We will now split the gradient in (6) according to the chain rule: W log |J| = [ J log |J|] ⊗ [ W J] j(l) − fk (u)xj wkl J−T ⊗ Jkl i(k) = (7) . (8) In this equation, i(k) = δik and j(l) = δjl . The righthand term is a 4-tensor with entries ∂Jkl /∂wij , and ⊗ is defined as A ⊗ Bij = kl Akl Bklij . We write the gradient this way to preserve, in the second term, the independent structure of the 1 → 1 gradient term in (4), and to separate a difficult derivation into two easy parts. The structure of (8) holds up when we move to the spiking case, giving: W log |T| = = [ T log |T|] ⊗ [ W T] T−T ⊗ Tkl i(k) j(l) − wkl (9) a ˙ j(a)Rka uk ˙ (10) where i(k) is now defined as being 1 if spike k occured in neuron i, and 0 otherwise. j(l) and j(a) are analogously defined. Because the T matrix is much bigger than the J matrix, and because it’s entries are more complex, here the similarity ends. When (10) is evaluated for a single weight influencing a single spike coupling (see the Appendix for the full derivation), it yields: ∆wkl ∝ ∂ log |T| Tkl = ∂wkl wkl T−1 lk −1 , (11) This is a non-local update involving a matrix inverse at each step. In the ICA case of (6), such an inverse was removed by the Natural Gradient transform (see [1]), but in the spike timing case, this has turned out not to be possible, because of the additional asymmetry ˙ introduced into the T matrix (as opposed to the J matrix) by the Rkl term in (3). 2 Results. Nonetheless, this learning rule can be simulated. It requires running the network for a while to generate spikes (and a corresponding T matrix), and then for each input/output spike coupling, the corresponding synapse is updated according to (11). When this is done, and the weights learn, it is clear that something has been sacrificed by ignoring the issue of which neurons are producing the spikes. Specifically, the network will often put all the output spikes on one output neuron, with the rates of the others falling to zero. It is happy to do this, if a large log |T| can thereby be achieved, because we have not included this ‘which neuron’ information in the objective. We will address these and other problems in Section 3, but now we report on our simulation results on demultiplexing. 2.1 Demultiplexing spike trains. An interesting possibility in the brain is that ‘patterns’ are embedded in spatially distributed spike timings that are input to neurons. Several patterns could be embedded in single input trains. This is called multiplexing. To extract and propagate these patterns, the neurons must demultiplex these inputs using its threshold nonlinearity. Demultiplexing is the ‘point process’ analog of the unmixing of independent inputs in ICA. We have been able to robustly achieve demultiplexing, as we now report. We simulated a feed-forward network with 3 integrate-and-fire neurons and inputs from 3 presynaptic neurons. Learning followed (11) where we replace the inverse by the pseudoinverse computed on the spikes generated during 0.5 s. The pseudo-inverse is necessary because even though on average, the learning matches number of output spikes to number of input spikes, the matrix T is still not usually square and so its actual inverse cannot be taken. In addition, in these simulations, an additional term is introduced in the learning to make sure all the output neurons fire with equal probability. This partially counters the ignoral of the ‘which neuron’ information, which we explained above. Assuming Poisson spike count ni for the ith output neuron with equal firing rate ni it is easy to derive in an approximate ¯ term that will control the spike count, i (¯ i − ni ). The target firing rates ni were set to n ¯ match the “source” spike train in this example. The network learns to demultiplex mixed spike trains, as shown in Figure 2. This demultiplexing is a robust property of learning using (11) with this new spike-controlling term. Finally, what about the spike-timing dependendence of the observed learning? Does it match experimental results? The comparison is made in Figure 3, and the answer is no. There is a timing-dependent transition between depression and potentiation in our result Spike Trains mixing mixed input trains 1 1 0.8 2 0.6 3 0 50 100 150 200 250 300 350 400 450 0.4 500 0.2 output 1 0 2 3 synaptic weights 0 50 100 150 200 250 300 350 400 450 500 original spike train 1 1 0.5 2 0 3 0 50 100 150 200 250 time in ms 300 350 400 450 500 −0.5 Figure 2: Unmixed spike trains. The input (top lef) are 3 spike trains which are a mixture of three independent Poison processes (bottom left). The network unmixes the spike train to approximately recover the original (center left). In this example 19 spikes correspond to the original with 4 deletion and 2 insertions. The two panels at the right show the mixing (top) and synaptic weight matrix after training (bottom). in Figure 3B, but it is not a sharp transition like the experimental result in Figure 3A. In addition, it does not transition at zero (ie: when tk − tl = 0), but at a time offset by the rise time of the EPSPs. In earlier experiments, in which we tranformed the gradient in (11) by an approximate inverse Hessian, to get an approximate Natural Gradient method, a sharp transition did emerge in simulations. However, the approximate inverse Hessian was singular, and we had to de-emphasise this result. It does suggest, however, that if the Natural Gradient transform can be usefully done on some variant of this learning rule, it may well be what accounts for the sharp transition effect of STDP. 3 Discussion Although these derivations started out smoothly, the reader possibly shares the authors’ frustration at the approximations involved here. Why isn’t this simple, like ICA? Why don’t we just have a nice maximum spikelihood model, ie: a density estimation algorithm for multivariate point processes, as ICA was a model in continuous space? We are going to be explicit about the problems now, and will propose a direction where the solution may lie. The over-riding problem is: we are unable to claim that in maximising log |T|, we are maximising the mutual information between inputs and outputs because: 1. The Invertability Problem. Algorithms such as ICA which maximise log Jacobians can only be called Infomax algorithms if the network transformation is both deterministic and invertable. The Spike Response Model is deterministic, but it is not invertable in general. When not invertable, the key formula (considering here vectors of input and output timings, tin and tout )is transformed from simple to complex. ie: p(tout ) = p(tin ) becomes p(tout ) = |T| solns tin p(tin ) d tin |T| (12) Thus when not invertable, we need to know the Jacobians of all the inputs that could have caused an output (called here ‘solns’), something we simply don’t know. 2. The ‘Which Neuron’ Problem. Instead of maximising the mutual information I(tout , tin ), we should be maximising I(tiout , tiin ), where the vector ti is the timing (A) STDP (B) Gradient 100 ∆ w (a.u.) 150 100 ∆ w / w (%) 150 50 0 −50 −100 −100 50 0 −50 −50 0 ∆ t (ms) 50 100 −100 −20 0 20 40 60 ∆ t (ms) 80 100 Figure 3: Dependence of synaptic modification on pre/post inter-spike interval. Left (A): From Froemke & Dan, Nature (2002)]. Dependence of synaptic modification on pre/post inter-spike interval in cat L2/3 visual cortical pyramidal cells in slice. Naturalistic spike trains. Each point represents one experiment. Right (B): According to Equation (11). Each point corresponds to an spike pair between approximately 100 input and 100 output spikes. vector, t, with the vector, i, of corresponding neuron indices, concatenated. Thus, ‘who spiked?’ should be included in the analysis as it is part of the information. 3. The Predictive Information Problem. In ICA, since there was no time involved, we did not have to worry about mutual informations over time between inputs and outputs. But in the spiking model, output spikes may well have (predictive) mutual information with future input spikes, as well as the usual (causal) mutual information with past input spikes. The former has been entirely missing from our analysis so far. These temporal and spatial infomation dependencies missing in our analysis so far, are thrown into a different light by a single empirical observation, which is that Spike TimingDependent Plasticity is not just a feedforward computation like the Spike Response Model. Specifically, there must be at least a statistical, if not a causal, relation between a real synapse’s plasticity and its neuron’s output spike timings, for Figure 3B to look like it does. It seems we have to confront the need for both a ‘memory’ (or reconstruction) model, such as the T we have thus far dealt with, in which output spikes talk about past inputs, and a ‘prediction’ model, in which they talk about future inputs. This is most easily understood from the point of view of Barber & Agakov’s variational Infomax algorithm [3]. They argue for optimising a lower bound on mutual information, which, for our neurons’, would be expressed using an inverse model p, as follows: ˆ I(tiin , tiout ) = H(tiin ) − log p(tiin |tiout ) ˆ p(tiin ,tiout ) ≤ I(tiin , tiout ) (13) In a feedforward model, H(tiin ) may be disregarded in taking gradients, leading us to the optimisation of a ‘memory-prediction’ model p(tiin |tiout ) related to something supposˆ edly happening in dendrites, somas and at synapses. In trying to guess what this might be, it would be nice if the math worked out. We need a square Jacobian matrix, T, so that |T| = p(tiin |tiout ) can be our memory/prediction model. Now let’s rename our feedforˆ → − ward timing Jacobian T (‘up the dendritic trees’), as T, and let’s fantasise that there is ← − some, as yet unspecified, feedback Jacobian T (‘down the dendritic trees’), which covers → − electrotonic influences as they spread from soma to synapse, and which T can be combined with by some operation ‘⊗’ to make things square. Imagine further, that doing this → ← − − yields a memory/prediction model on the inputs. Then the T we are looking for is T ⊗ T, → ← − − and the memory-prediction model is: p(tiin |tiout ) = T ⊗ T ˆ → − → − Ideally, the entries of T should be as before, ie: T kl = ∂tk /∂tl . What should the entries ← − ← − ← − of T be? Becoming just one step more concrete, suppose T had entries T lk = ∂cl /∂tk , where cl is some, as yet unspecified, value, or process, occuring at an input synapse when spike l comes in. What seems clear is that ⊗ should combine the correctly tensorised forms → − ← − → ← − − of T and T (giving them each 4 indices ijkl), so that T = T ⊗ T sums over the spikes k and l to give a I × J matrix, where I is the number of output neurons, and J the number of input neurons. Then our quantity, T, would represent all dependencies of input neuronal activity on output activity, summed over spikes. ← − Further, we imagine that T contains reverse (feedback) electrotonic transforms from soma ← − to synapse R lk that are somehow symmetrically related to the feedforward Spike Re→ − sponses from synapse to soma, which we now rename R kl . Thinking for a moment in terms of somatic k and synaptic l, voltages V , currents I and linear cable theory, the synapse to → − → − soma transform, R kl would be related to an impedance in Vk = Il Z kl , while the soma ← − ← − to synapse transform, R lk would be related to an admittance in Il = Vk Y lk [8]. The → − ← − symmetry in these equations is that Z kl is just the inverse conjugate of Y lk . Finally, then, what is cl ? And what is its relation to the calcium concentration, [Ca2+ ]l , at a synapse, when spike l comes in? These questions naturally follow from considering the experimental data, since it is known that the calcium level at synapses is the critical integrating factor in determining whether potentiation or depression occurs [5]. 4 Appendix: Gradient of log |T| for the full Spike Response Model. Here we give full details of the gradient for Gerstner’s Spike Response Model [7]. This is a general model for which Integrate-and-Fire is a special case. In this model the effect of a presynaptic spike at time tl on the membrane potential at time t is described by a post synaptic potential or spike response, which may also depend on the time that has passed since the last output spike tk−1 , hence the spike response is written as R(t − tk−1 , t − tl ). This response is weighted by the synaptic strength wl . Excitatory or inhibitory synapses are determined by the sign of wl . Refractoriness is incorporated by adding a hyper-polarizing contribution (spike-afterpotential) to the membrane potential in response to the last preceding spike η(t − tk−1 ). The membrane potential as a function of time is therefore given by u(t) = η(t − tk−1 ) + wl R(t − tk−1 , t − tl ) . (14) l We have ignored here potential contributions from external currents which can easily be included without modifying the following derivations. The output firing times t k are defined as the times for which u(t) reaches firing threshold from below. We consider a dynamic threshold, ϑ(t − tk−1 ), which may depend on the time since that last spike tk−1 , together then output spike times are defined implicitly by: t = tk : u(t) = ϑ(t − tk−1 ) and du(t) > 0. dt (15) For this more general model Tkl is given by Tkl = dtk =− dtl ∂u ∂ϑ − ∂tk ∂tk −1 ˙ ∂u wkl R(tk − tk−1 , tk − tl , ) = , ˙ ∂tl u(tk ) − ϑ(tk − tk−1 ) ˙ (16) ˙ ˙ where R(s, t), u(t), and ϑ(t) are derivatives with respect to t. The dependence of Tkl on ˙ tk−1 should be implicitly assumed. It has been omitted to simplify the notation. Now we compute the derivative of log |T| with respect to wkl . For any matrix T we have ∂ log |T|/∂Tab = [T−1 ]ba . Therefore: ∂ log |T| ∂Tab ∂ log |T| ∂Tab = [T−1 ]ba . (17) ∂wkl ∂Tab ∂wkl ∂wkl ab ab Utilising the Kronecker delta δab = (1 if a = b, else 0), the derivative of (16) with respect to wkl gives: ˙ ∂Tab ∂ wab R(ta − ta−1 , ta − tb ) = ˙ ˙ ∂wkl ∂wkl η(ta − ta−1 ) + wac R(ta − ta−1 , ta − tc ) − ϑ(ta − ta−1 ) c ˙ R(ta − ta−1 , ta − tb ) = δak δbl ˙ u(ta ) − ϑ(ta − ta−1 ) ˙ ˙ ˙ wab R(ta − ta−1 , ta − tb )δak R(ta − ta−1 , ta − tl ) − 2 ˙ u(ta ) − ϑ(ta − ta−1 ) ˙ = δak Tab Therefore: ∂ log |T| ∂wkl Tal δbl − wab wal . (18) δbl Tal − wab wal [T−1 ]ba δak Tab = ab = Tkl wkl [T−1 ]lk − [T−1 ]bk Tkl b (19) = Tkl [T−1 ]lk − 1 . wkl (20) Acknowledgments We are grateful for inspirational discussions with Nihat Ay, Michael Eisele, Hong Hui Yu, Jim Crutchfield, Jeff Beck, Surya Ganguli, Sophi` Deneve, David Barber, Fabian Theis, e Tony Zador and Arunava Banerjee. AJB thanks all RNI colleagues for many such discussions. References [1] Amari S-I. 1997. Natural gradient works efficiently in learning, Neural Computation, 10, 251-276 [2] Banerjee A. 2001. On the Phase-Space Dynamics of Systems of Spiking Neurons. Neural Computation, 13, 161-225 [3] Barber D. & Agakov F. 2003. The IM Algorithm: A Variational Approach to Information Maximization. Advances in Neural Information Processing Systems 16, MIT Press. [4] Bell A.J. & Sejnowski T.J. 1995. An information maximization approach to blind separation and blind deconvolution, Neural Computation, 7, 1129-1159 [5] Dan Y. & Poo M-m. 2004. Spike timing-dependent plasticity of neural circuits, Neuron, 44, 23-30 [6] Froemke R.C. & Dan Y. 2002. Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 28, 416: 433-8 [7] Gerstner W. & Kistner W.M. 2002. Spiking neuron models, Camb. Univ. Press [8] Zador A.M., Agmon-Snir H. & Segev I. 1995. The morphoelectrotonic transform: a graphical approach to dendritic function, J. Neurosci., 15(3): 1669-1682

4 0.43374091 178 nips-2004-Support Vector Classification with Input Data Uncertainty

Author: Jinbo Bi, Tong Zhang

Abstract: This paper investigates a new learning model in which the input data is corrupted with noise. We present a general statistical framework to tackle this problem. Based on the statistical reasoning, we propose a novel formulation of support vector classification, which allows uncertainty in input data. We derive an intuitive geometric interpretation of the proposed formulation, and develop algorithms to efficiently solve it. Empirical results are included to show that the newly formed method is superior to the standard SVM for problems with noisy input. 1

5 0.43315095 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1

6 0.43285942 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

7 0.43223193 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

8 0.43140626 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

9 0.4314031 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications

10 0.43113121 131 nips-2004-Non-Local Manifold Tangent Learning

11 0.4310362 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

12 0.430397 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

13 0.43021542 161 nips-2004-Self-Tuning Spectral Clustering

14 0.4301002 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

15 0.42971027 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

16 0.42946279 68 nips-2004-Face Detection --- Efficient and Rank Deficient

17 0.4288955 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming

18 0.42889056 69 nips-2004-Fast Rates to Bayes for Kernel Machines

19 0.4286938 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

20 0.42830548 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods