nips nips2002 nips2002-83 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gal Chechik, Naftali Tishby
Abstract: The problem of extracting the relevant aspects of data, in face of multiple conflicting structures, is inherent to modeling of complex data. Extracting structure in one random variable that is relevant for another variable has been principally addressed recently via the information bottleneck method [15]. However, such auxiliary variables often contain more information than is actually required due to structures that are irrelevant for the task. In many other cases it is in fact easier to specify what is irrelevant than what is, for the task at hand. Identifying the relevant structures, however, can thus be considerably improved by also minimizing the information about another, irrelevant, variable. In this paper we give a general formulation of this problem and derive its formal, as well as algorithmic, solution. Its operation is demonstrated in a synthetic example and in two real world problems in the context of text categorization and face images. While the original information bottleneck problem is related to rate distortion theory, with the distortion measure replaced by the relevant information, extracting relevant features while removing irrelevant ones is related to rate distortion with side information.
Reference: text
sentIndex sentText sentNum sentScore
1 Extracting structure in one random variable that is relevant for another variable has been principally addressed recently via the information bottleneck method [15]. [sent-5, score-0.582]
2 However, such auxiliary variables often contain more information than is actually required due to structures that are irrelevant for the task. [sent-6, score-0.606]
3 In many other cases it is in fact easier to specify what is irrelevant than what is, for the task at hand. [sent-7, score-0.208]
4 Identifying the relevant structures, however, can thus be considerably improved by also minimizing the information about another, irrelevant, variable. [sent-8, score-0.217]
5 Its operation is demonstrated in a synthetic example and in two real world problems in the context of text categorization and face images. [sent-10, score-0.186]
6 While the original information bottleneck problem is related to rate distortion theory, with the distortion measure replaced by the relevant information, extracting relevant features while removing irrelevant ones is related to rate distortion with side information. [sent-11, score-1.843]
7 1 Introduction A fundamental goal of machine learning is to find regular structures in a given empirical data, and use it to construct predictive or comprehensible models. [sent-12, score-0.213]
8 For example, documents may be classified either by subject or by writing style; spoken words can be labeled by their meaning or by the identity of the speaker; proteins can be classified by their structure or function - all are valid alternatives. [sent-14, score-0.221]
9 Which of these alternative structures is “relevant” is often implicit in the problem formulation. [sent-15, score-0.169]
10 The problem of identifying “the” relevant structures is commonly addressed in supervised learning tasks, by providing a “relevant” label to the data, and selecting features that are discriminative with respect to this label. [sent-16, score-0.41]
11 An information theoretic generalization of this supervised approach has been proposed in [9, 15] through the information bottleneck method (IB). [sent-17, score-0.362]
12 In this approach, relevance is introduced through another random variable (as is the label in supervised learning) and the goal is to compress one (the source) variable, while maintaining as much information about the auxiliary (relevance) variable. [sent-18, score-0.341]
13 This framework has proven powerful for numerous applications, such as clustering the objects of sentences with respect to the verbs [9], documents with respect to their terms [1, 6, 14], genes with respect to tissues [8, 11], and stimuli with respect to spike patterns [10]. [sent-19, score-0.563]
14 An important condition for this approach to work is that the auxiliary variable indeed corresponds to the task. [sent-20, score-0.153]
15 In many situations, however, such “pure” variable is not available. [sent-21, score-0.082]
16 The variable may in fact contain alternative and even conflicting structures. [sent-22, score-0.125]
17 information about “unimportant”, or irrelevant, aspects of the data that can interfere with the desired structure during the learning. [sent-25, score-0.145]
18 When given a sample of pairs with the goal of extracting the relevant dependence , the noise which may contain information on and thus interfere with extracting - is an irrelevant variable. [sent-28, score-0.79]
19 Such data, as generated by the DNA-chips technology, can be considered as an empirical joint distribution of gene expression levels and different tissues, where the tissues are taken from different biological conditions and pathologies. [sent-32, score-0.188]
20 The search for expressed genes that testify for the existence of a pathology may be obscured by genetic correlations that exist also in other conditions. [sent-33, score-0.085]
21 Here again a sample of irrelevant expression data, taken for instance from a healthy population, can enable clustering analysis to focus on the pathological features only, and ignore spurious structures. [sent-34, score-0.421]
22 These two examples, and numerous others, are all instantiations of a common problem: in order to better extract the relevant structures information about the irrelevant components of the data should be incorporated. [sent-35, score-0.703]
23 The current paper presents a general unified information theoretic framework for such problems, extending the original information bottleneck variational problem to deal with discriminative tasks of that nature, by observing its analogy with rate distortion theory with side information. [sent-39, score-0.907]
24 2 Information Theoretic Formulation To formalize the problem of extracting relevant structures consider first three categorical variables , and whose co-occurrence distributions are known. [sent-40, score-0.499]
25 Our goal is to uncover structures in , that do not exist in . [sent-41, score-0.288]
26 The distribution may contain several conflicting underlying structures, some of which may also exist in . [sent-42, score-0.075]
27 These variables stand for example for a set of terms , a set of documents whose structure we seek, and an additional set of documents , or a set of genes and two sets of tissues with different biological conditions. [sent-43, score-0.627]
28 50 for discussion of this type of diagrams) and the intersection of two circles corresponds to their mutual information. [sent-50, score-0.141]
29 The mutual information of two random variables is the familiar symmetric functional of their joint distribution, ¢ © &6 ¦ I(#7¨%HG . [sent-51, score-0.264]
30 c c da ga p e eT R e Q c Q a ` Y U © ¡ ¦ R Q qifT $hdb0$XWV"¨ @ T S¥P A. [sent-52, score-0.179]
31 A Venn diagram illustrating the relations between the entropy and mutual information of the variables , , . [sent-55, score-0.224]
32 The area of each circle corresponds to the entropy of a variable, while the intersection of two circles corresponds to their mutual information. [sent-56, score-0.186]
33 As and are independent given , their mutual information vanishes when is known, thus all their overlap is included in the circle of . [sent-57, score-0.213]
34 A graphical model representation of IB with side information. [sent-59, score-0.157]
35 Given the three variables , , , we seek a compact stochastic representation of which preserves information about but removes information about . [sent-60, score-0.236]
36 The goal of information bottleneck with side information , in a way that (IBSI) is therefor to find a stochastic map of to a new variable , maximizes its mutual information with and minimizes the mutual information about . [sent-63, score-0.987]
37 § ' 8& ) & ) & The information bottleneck variational problem, introduced in [15], is a special case of our current variational problem with , namely, no side or irrelevant information is available. [sent-67, score-0.752]
38 © ¦ D @ ¡ ©) ¦ 9¡9@ © ¦ ' © ) ¦ 6© ¦ 6© ¦ 6© ¦ D ¡9¨@q D ' ¡9¨@f @ q D @ © ¦ © ¦ D ' ¨¡@ @ ¡ ¡ ¡ ¡ ¡ ¢ ¡ c © ) ¦ c@ © ¦ da D ¨¡da D ' ¨¡@ e RT RT T Vqp dca ` Y UW Q © 7¨%HG 6 ¦ ee TT c c ga ! [sent-74, score-0.271]
39 0 the discriminability between the distribution of , measuring 0 0 , IBSI thus operates to extract a compact representaFor ¥ 0 tion and a fixed level of . [sent-76, score-0.097]
40 2, 0 0 © ¦ D @ © ' &6 ¦ 3(2I G The formal solutions of the above variational problem have an exponential form which is a natural generalization of the solution of the original IB problem. [sent-78, score-0.12]
41 As in the original IB, when goes to infinity the Lagrangian reduces to , and the exponents become binary cluster membership collapse to a hard clustering solution, where probabilities. [sent-79, score-0.2]
42 ¡ ¥ § % ©§¨¥ ¤) e T gca e Q T dca 0 ¡ ¢ ¡ ¦ %&$"# P © @ ¢ ¢ D ) ¨¡@ © ¦ ¡ ¢ ¡ % ¡ ¢ (3) 0 ¥ © ¦ ©¨ @ D @ D ) ¡@ Q @ ¦ © ¦ © ¦ ! [sent-85, score-0.15]
43 © ¦ © ¦ © ¦ © ¦ @ D 9C@ D ' ¡@ Q @ ¡ ¢ c c da © 2©C da e T e Q T ¢ ¡ ¢ ¡ ¤ C© D ¨¡@ DD D ¡@ © ) ¦ © ) ¦ % ¡ ¢ © ¦ © ¦ 9C@ D ¨¡@ Q P 3 ¨¡@ ¢© ¦ , we write and obtain for the second term of Eq. [sent-86, score-0.278]
44 4 Relation to Rate Distortion Theory with Side Information The problem formulated above is related to the theory of rate distortion with side information ([17],[2] p. [sent-91, score-0.51]
45 In rate distortion theory (RDT) a source variable is stochastically encoded into a variable , which is decoded at the other side of the channel with some distortion. [sent-93, score-0.648]
46 The achievable code rate, at a given distortion level , is bounded by the optimal rate, also known as the rate distortion function, . [sent-94, score-0.532]
47 The optimal encoding is determined by the stochastic map , where the representation quantization is found by minimizing the average distortion. [sent-95, score-0.063]
48 % © '¦ ' ¤ ¢§© 7¨%HG 6 ¦ © '¦ ¤ ¤ © ¦ D @ ¡ This rate can be improved by utilizing side information in the form of another variable, , that is known at both ends of the channel. [sent-97, score-0.309]
49 In this case, an improved rate can be achieved by avoiding sending information about that can be extracted from . [sent-98, score-0.115]
50 Indeed, in this case the rate distortion function with this side information has a lower lower-bound, given by , where is the optimal quantization of in this case, under the distortion constraint (see [17] for details). [sent-99, score-0.78]
51 In the information bottleneck framework the average distortion is replaced by the mutual information about the relevant variable, while the rate-distortion function is turned into a convex curve that characterizes the complexity of the relation between the variables, (see [15, 13]). [sent-100, score-0.784]
52 ¥ % ¥ % © ¦I HG © 7%¥3© ¦'¦ ¥ 6 ¦ 6 ¦G ¢ ¥ ¤ ¤ Similarly, IBSI avoids differentiating instances of that are informative about if they contain information also about . [sent-101, score-0.102]
53 The variable is analogous to the side information variable , while is just the “informative” of the original IB. [sent-102, score-0.38]
54 Whereas RDT with side information is a specific communication problem with some given (often arbitrary) distortion function, our problem is a general statistical non-parametric analysis and . [sent-104, score-0.454]
55 Many differtechnique that depends solely by the choice of the variables , ent pattern recognition and discriminative learning problems can be cast into this general information theoretic framework - far beyond the original setting of RDT with side information. [sent-105, score-0.428]
56 Unlike the original IB equations, convergence of the algorithm is no longer allways guaranteed, simply because the problem is not guaranteed to have feasible solutions for all values. [sent-108, score-0.062]
57 However, there exist a non empty set of values for which this algorithm is guaranteed to converge. [sent-109, score-0.064]
58 As in the case of IB, various heuristics can be applied, such as deterministic annealing in which increasing the parameter is used to obtain finer clusters; greedy agglomerative hard clustering [13]; or a sequential K-means like algorithm [12]. [sent-110, score-0.258]
59 The latter provides a good compromise between top-down annealing and agglomerative greedy approaches and § achieves excellent performance. [sent-111, score-0.088]
60 © ' &6 ¦ "3I2I G © I HG ) %6 ¦ ¢ ¥ " ¢ § © ) #I HG &6 ¦ 6 Applications We describe two applications of our method: a simple synthetic example, and a “real world” problem of hierarchical text categorization. [sent-113, score-0.127]
61 We also used IBSI to extract relevant features in face images, but these results will be published elsewhere due spavce considerations. [sent-114, score-0.258]
62 1 A synthetic example To demonstrate the ability of our approach to uncover weak but interesting hidden structures in data, we designed a co-occurrences matrix contains two competing sub-structures (see figure 2A). [sent-116, score-0.281]
63 For demonstration purposes, the matrix was created such that the stronger structure can be observed on the left and the weaker structure on the right. [sent-117, score-0.172]
64 Compressing into two clusters while preserving information on using IB ( ), yields the clustering of figure 2B, in which the upper half of ’s are all clustered together. [sent-118, score-0.367]
65 This clustering follows from the strong structure on the left of 2A. [sent-119, score-0.21]
66 % ¢ ' & We now created a second co-occurrence matrix, to be used for identifying the relevant structure, in which each half of yield similar distributions . [sent-120, score-0.201]
67 Applying sequentialIBSI now successfully ignores the strong but irrelevant structure in and retrieves the weak structure. [sent-121, score-0.277]
68 Importantly, this is done in an unsupervised manner, without explicitly pointing to the strong but irrelevant structure. [sent-122, score-0.243]
69 © £¨&£1 % ' ¦ © ¦ D ) ¨¡£1 % This example was designed for demonstration purposes, thus the irrelevant structures is . [sent-123, score-0.43]
70 The next example shows that our approach is also useful strongly manifested in for real data, in which structures are much more covert. [sent-124, score-0.169]
71 Clustering into two clusters using the information bottleneck method separates upper and lower values of , according to the stronger structure. [sent-133, score-0.397]
72 A joint distribution that contains a single structure, similar in nature to the stronger structure . [sent-135, score-0.119]
73 Clustering into two clusters using IBSI successfully extract the weaker structure in . [sent-137, score-0.273]
74 5 0 10 20 30 40 n chosen clusters 50 60 Figure 3: A. [sent-142, score-0.138]
75 An illustration of the 20 newsgroups hierarchical data we used. [sent-143, score-0.128]
76 2 Hierarchical text categorization Text categorization is a fundamental task in information retrieval. [sent-152, score-0.257]
77 Typically, one has to group a large set of texts into groups of homogeneous subjects. [sent-153, score-0.086]
78 Recently, Slonim and colleagues showed that the IB method achieves categorization that predicts manually predefined categories with great accuracy, and largely outperforms competing methods [12]. [sent-154, score-0.151]
79 Clearly, this unsupervised task becomes more difficult when the texts have similar subjects, because alternative categories are extracted instead of the “correct” one. [sent-155, score-0.111]
80 This problem can be alleviated by using side information in the form of additional documents from other categories. [sent-156, score-0.455]
81 This is specifically useful in hierarchical document categorization, in which known categories are refined by grouping documents into sub-categories. [sent-157, score-0.318]
82 IBSI can be applied to this problem by operating on the terms-documents cooccurrence matrix while using the other top-level groups for focusing on the relevant structures. [sent-159, score-0.204]
83 This database consists of 20 equal sized groups of documents, hierarchically organized into groups according to their content (figure 3A). [sent-163, score-0.092]
84 We aimed to cluster documents that belong to two newsgroups from the supergroup of computer documents and have very similar subjects comp. [sent-164, score-0.467]
85 As side information we used all documents from the super group of science ( sci. [sent-172, score-0.397]
86 To demonstrate the power of IBSI we used double clustering to separate documents into two groups. [sent-177, score-0.351]
87 The goal of the first clustering phase is to use IBSI to identify clusters of terms that extract the relevant structures of the data. [sent-178, score-0.745]
88 The goal of the second clustering phase is simply to provide a quantitative measure for the quality of the features extracted in the first phase. [sent-179, score-0.214]
89 We therefor performed the following procedure: First, the most frequent 2000 words in these documents were clustered into clusters using IBSI. [sent-180, score-0.377]
90 Then, word clusters were sorted by a single-cluster score , and the clusters with the highest score were chosen. [sent-181, score-0.309]
91 The performance of this process is evaluated by measuring the overlap of the resulting clusters with the manualy classified groups. [sent-183, score-0.138]
92 Using IBSI successfully improves mean clustering accuracy from about 55 percent to about 63 percents. [sent-188, score-0.232]
93 ¢ ¡ ¢ 7 Discussion and Further Research We have presented an information theoretic approach for extracting relevant structures from data, by utilizing additional data known to share irrelevant structures with the relevant data. [sent-191, score-1.157]
94 Naturally, the choice of side data may considerably influence the solutions obtained with IBSI, simply because using different irrelevant variables, is equivalent to asking different questions about the data analysed. [sent-192, score-0.395]
95 In practice, side data can be naturally defined in numerous applications, in particular in exploratory analysis of scientific experiments, e. [sent-193, score-0.234]
96 While the current work is based on clustering to compress the source, the notion of extracting relevance through side information can be extended to other forms of dimentionality reduction, such as non-linear embedding on low dimensional manifolds. [sent-196, score-0.645]
97 In particular side information can be naturally combined with information theoretic modeling approaches such as SDR [5]. [sent-197, score-0.392]
98 Clustering based on conditional distribution in an auxiliary space. [sent-275, score-0.071]
99 Document clustering using word clusters via the information bottleneck method. [sent-292, score-0.561]
100 The rate distortion function for source coding with side information at the decoder. [sent-314, score-0.543]
wordName wordTfidf (topN-words)
[('ibsi', 0.463), ('distortion', 0.238), ('irrelevant', 0.208), ('documents', 0.181), ('xdd', 0.174), ('clustering', 0.17), ('ib', 0.17), ('structures', 0.169), ('bottleneck', 0.161), ('relevant', 0.158), ('side', 0.157), ('da', 0.139), ('clusters', 0.138), ('slonim', 0.134), ('icting', 0.126), ('extracting', 0.116), ('tissues', 0.116), ('xwv', 0.115), ('mutual', 0.109), ('dd', 0.101), ('sigir', 0.096), ('dca', 0.092), ('rdt', 0.087), ('hg', 0.087), ('theoretic', 0.083), ('variable', 0.082), ('categorization', 0.078), ('newsgroups', 0.075), ('rt', 0.073), ('auxiliary', 0.071), ('extract', 0.066), ('tishby', 0.064), ('information', 0.059), ('alleviated', 0.058), ('dimentionality', 0.058), ('gca', 0.058), ('globerson', 0.058), ('therefor', 0.058), ('venn', 0.058), ('uw', 0.058), ('variables', 0.056), ('rate', 0.056), ('agglomerative', 0.055), ('variational', 0.054), ('hierarchical', 0.053), ('genes', 0.053), ('demonstration', 0.053), ('compress', 0.05), ('document', 0.048), ('interfere', 0.046), ('groups', 0.046), ('aa', 0.045), ('circle', 0.045), ('goal', 0.044), ('classi', 0.044), ('contain', 0.043), ('healthy', 0.043), ('uncover', 0.043), ('numerous', 0.043), ('identifying', 0.043), ('friedman', 0.043), ('text', 0.042), ('texts', 0.04), ('ga', 0.04), ('joint', 0.04), ('structure', 0.04), ('discriminative', 0.04), ('stronger', 0.039), ('competing', 0.037), ('utilizing', 0.037), ('distributional', 0.037), ('formal', 0.036), ('categories', 0.036), ('lagrangian', 0.035), ('unsupervised', 0.035), ('relevance', 0.035), ('pereira', 0.034), ('face', 0.034), ('naturally', 0.034), ('word', 0.033), ('source', 0.033), ('cast', 0.033), ('annealing', 0.033), ('tradeoff', 0.033), ('hebrew', 0.033), ('accuracy', 0.033), ('gure', 0.032), ('guaranteed', 0.032), ('synthetic', 0.032), ('conditionally', 0.032), ('gene', 0.032), ('quantization', 0.032), ('intersection', 0.032), ('exist', 0.032), ('compact', 0.031), ('stochastic', 0.031), ('cluster', 0.03), ('solutions', 0.03), ('israel', 0.03), ('successfully', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 83 nips-2002-Extracting Relevant Structures with Side Information
Author: Gal Chechik, Naftali Tishby
Abstract: The problem of extracting the relevant aspects of data, in face of multiple conflicting structures, is inherent to modeling of complex data. Extracting structure in one random variable that is relevant for another variable has been principally addressed recently via the information bottleneck method [15]. However, such auxiliary variables often contain more information than is actually required due to structures that are irrelevant for the task. In many other cases it is in fact easier to specify what is irrelevant than what is, for the task at hand. Identifying the relevant structures, however, can thus be considerably improved by also minimizing the information about another, irrelevant, variable. In this paper we give a general formulation of this problem and derive its formal, as well as algorithmic, solution. Its operation is demonstrated in a synthetic example and in two real world problems in the context of text categorization and face images. While the original information bottleneck problem is related to rate distortion theory, with the distortion measure replaced by the relevant information, extracting relevant features while removing irrelevant ones is related to rate distortion with side information.
2 0.20337978 166 nips-2002-Rate Distortion Function in the Spin Glass State: A Toy Model
Author: Tatsuto Murayama, Masato Okada
Abstract: We applied statistical mechanics to an inverse problem of linear mapping to investigate the physics of optimal lossy compressions. We used the replica symmetry breaking technique with a toy model to demonstrate Shannon’s result. The rate distortion function, which is widely known as the theoretical limit of the compression with a fidelity criterion, is derived. Numerical study shows that sparse constructions of the model provide suboptimal compressions. 1
3 0.18769373 142 nips-2002-Maximum Likelihood and the Information Bottleneck
Author: Noam Slonim, Yair Weiss
Abstract: The information bottleneck (IB) method is an information-theoretic formulation for clustering problems. Given a joint distribution , this method constructs a new variable that defines partitions over the values of that are informative about . Maximum likelihood (ML) of mixture models is a standard statistical approach to clustering problems. In this paper, we ask: how are the two methods related ? We define a simple mapping between the IB problem and the ML problem for the multinomial mixture model. We show that under this mapping the or problems are strongly related. In fact, for uniform input distribution over for large sample size, the problems are mathematically equivalent. Specifically, in these cases, every fixed point of the IB-functional defines a fixed point of the (log) likelihood and vice versa. Moreover, the values of the functionals at the fixed points are equal under simple transformations. As a result, in these cases, every algorithm that solves one of the problems, induces a solution for the other. ©§ ¥£ ¨¦¤¢
4 0.13145724 27 nips-2002-An Impossibility Theorem for Clustering
Author: Jon M. Kleinberg
Abstract: Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-offs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median. 1
5 0.12527156 90 nips-2002-Feature Selection in Mixture-Based Clustering
Author: Martin H. Law, Anil K. Jain, Mário Figueiredo
Abstract: There exist many approaches to clustering, but the important issue of feature selection, i.e., selecting the data attributes that are relevant for clustering, is rarely addressed. Feature selection for clustering is difficult due to the absence of class labels. We propose two approaches to feature selection in the context of Gaussian mixture-based clustering. In the first one, instead of making hard selections, we estimate feature saliencies. An expectation-maximization (EM) algorithm is derived for this task. The second approach extends Koller and Sahami’s mutual-informationbased feature relevance criterion to the unsupervised case. Feature selection is then carried out by a backward search scheme. This scheme can be classified as a “wrapper”, since it wraps mixture estimation in an outer layer that performs feature selection. Experimental results on synthetic and real data show that both methods have promising performance.
6 0.11397744 30 nips-2002-Annealing and the Rate Distortion Problem
7 0.11168272 112 nips-2002-Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis
8 0.10812385 143 nips-2002-Mean Field Approach to a Probabilistic Model in Information Retrieval
10 0.1066402 115 nips-2002-Informed Projections
11 0.10349868 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information
12 0.10020187 188 nips-2002-Stability-Based Model Selection
13 0.096893586 53 nips-2002-Clustering with the Fisher Score
14 0.091251671 125 nips-2002-Learning Semantic Similarity
15 0.091019936 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
16 0.088599212 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA
17 0.086212985 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
18 0.083623424 98 nips-2002-Going Metric: Denoising Pairwise Data
19 0.081117131 28 nips-2002-An Information Theoretic Approach to the Functional Classification of Neurons
20 0.079762302 1 nips-2002-"Name That Song!" A Probabilistic Approach to Querying on Music and Text
topicId topicWeight
[(0, -0.223), (1, -0.073), (2, 0.025), (3, -0.003), (4, -0.139), (5, 0.095), (6, -0.166), (7, -0.245), (8, -0.111), (9, 0.034), (10, -0.047), (11, -0.033), (12, -0.059), (13, 0.019), (14, -0.039), (15, -0.049), (16, -0.007), (17, -0.176), (18, -0.187), (19, 0.031), (20, 0.076), (21, -0.055), (22, -0.044), (23, -0.17), (24, 0.085), (25, 0.101), (26, 0.098), (27, -0.077), (28, 0.003), (29, 0.01), (30, 0.132), (31, -0.116), (32, 0.099), (33, 0.08), (34, -0.036), (35, -0.121), (36, 0.026), (37, 0.067), (38, 0.057), (39, -0.046), (40, 0.035), (41, 0.063), (42, 0.013), (43, -0.01), (44, -0.009), (45, -0.025), (46, -0.026), (47, -0.027), (48, 0.052), (49, 0.049)]
simIndex simValue paperId paperTitle
same-paper 1 0.957151 83 nips-2002-Extracting Relevant Structures with Side Information
Author: Gal Chechik, Naftali Tishby
Abstract: The problem of extracting the relevant aspects of data, in face of multiple conflicting structures, is inherent to modeling of complex data. Extracting structure in one random variable that is relevant for another variable has been principally addressed recently via the information bottleneck method [15]. However, such auxiliary variables often contain more information than is actually required due to structures that are irrelevant for the task. In many other cases it is in fact easier to specify what is irrelevant than what is, for the task at hand. Identifying the relevant structures, however, can thus be considerably improved by also minimizing the information about another, irrelevant, variable. In this paper we give a general formulation of this problem and derive its formal, as well as algorithmic, solution. Its operation is demonstrated in a synthetic example and in two real world problems in the context of text categorization and face images. While the original information bottleneck problem is related to rate distortion theory, with the distortion measure replaced by the relevant information, extracting relevant features while removing irrelevant ones is related to rate distortion with side information.
2 0.72855878 166 nips-2002-Rate Distortion Function in the Spin Glass State: A Toy Model
Author: Tatsuto Murayama, Masato Okada
Abstract: We applied statistical mechanics to an inverse problem of linear mapping to investigate the physics of optimal lossy compressions. We used the replica symmetry breaking technique with a toy model to demonstrate Shannon’s result. The rate distortion function, which is widely known as the theoretical limit of the compression with a fidelity criterion, is derived. Numerical study shows that sparse constructions of the model provide suboptimal compressions. 1
3 0.63611782 30 nips-2002-Annealing and the Rate Distortion Problem
Author: Albert E. Parker, Tomá\v S. Gedeon, Alexander G. Dimitrov
Abstract: In this paper we introduce methodology to determine the bifurcation structure of optima for a class of similar cost functions from Rate Distortion Theory, Deterministic Annealing, Information Distortion and the Information Bottleneck Method. We also introduce a numerical algorithm which uses the explicit form of the bifurcating branches to find optima at a bifurcation point. 1
4 0.62272871 142 nips-2002-Maximum Likelihood and the Information Bottleneck
Author: Noam Slonim, Yair Weiss
Abstract: The information bottleneck (IB) method is an information-theoretic formulation for clustering problems. Given a joint distribution , this method constructs a new variable that defines partitions over the values of that are informative about . Maximum likelihood (ML) of mixture models is a standard statistical approach to clustering problems. In this paper, we ask: how are the two methods related ? We define a simple mapping between the IB problem and the ML problem for the multinomial mixture model. We show that under this mapping the or problems are strongly related. In fact, for uniform input distribution over for large sample size, the problems are mathematically equivalent. Specifically, in these cases, every fixed point of the IB-functional defines a fixed point of the (log) likelihood and vice versa. Moreover, the values of the functionals at the fixed points are equal under simple transformations. As a result, in these cases, every algorithm that solves one of the problems, induces a solution for the other. ©§ ¥£ ¨¦¤¢
Author: Dmitry Y. Pavlov, David M. Pennock
Abstract: We develop a maximum entropy (maxent) approach to generating recommendations in the context of a user’s current navigation stream, suitable for environments where data is sparse, high-dimensional, and dynamic— conditions typical of many recommendation applications. We address sparsity and dimensionality reduction by first clustering items based on user access patterns so as to attempt to minimize the apriori probability that recommendations will cross cluster boundaries and then recommending only within clusters. We address the inherent dynamic nature of the problem by explicitly modeling the data as a time series; we show how this representational expressivity fits naturally into a maxent framework. We conduct experiments on data from ResearchIndex, a popular online repository of over 470,000 computer science documents. We show that our maxent formulation outperforms several competing algorithms in offline tests simulating the recommendation of documents to ResearchIndex users.
6 0.51887411 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information
7 0.50027335 27 nips-2002-An Impossibility Theorem for Clustering
8 0.48848152 188 nips-2002-Stability-Based Model Selection
9 0.43312508 115 nips-2002-Informed Projections
10 0.41353238 90 nips-2002-Feature Selection in Mixture-Based Clustering
11 0.39768511 143 nips-2002-Mean Field Approach to a Probabilistic Model in Information Retrieval
12 0.39576134 1 nips-2002-"Name That Song!" A Probabilistic Approach to Querying on Music and Text
13 0.38012716 112 nips-2002-Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis
14 0.36382094 100 nips-2002-Half-Lives of EigenFlows for Spectral Clustering
15 0.34487876 53 nips-2002-Clustering with the Fisher Score
16 0.34448594 107 nips-2002-Identity Uncertainty and Citation Matching
17 0.33450314 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA
18 0.32519513 98 nips-2002-Going Metric: Denoising Pairwise Data
19 0.31947228 124 nips-2002-Learning Graphical Models with Mercer Kernels
20 0.31052536 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm
topicId topicWeight
[(3, 0.017), (11, 0.015), (14, 0.318), (23, 0.019), (42, 0.077), (54, 0.119), (55, 0.044), (67, 0.035), (68, 0.019), (74, 0.123), (92, 0.035), (98, 0.105)]
simIndex simValue paperId paperTitle
1 0.84677702 150 nips-2002-Multiple Cause Vector Quantization
Author: David A. Ross, Richard S. Zemel
Abstract: We propose a model that can learn parts-based representations of highdimensional data. Our key assumption is that the dimensions of the data can be separated into several disjoint subsets, or factors, which take on values independently of each other. We assume each factor has a small number of discrete states, and model it using a vector quantizer. The selected states of each factor represent the multiple causes of the input. Given a set of training examples, our model learns the association of data dimensions with factors, as well as the states of each VQ. Inference and learning are carried out efficiently via variational algorithms. We present applications of this model to problems in image decomposition, collaborative filtering, and text classification.
same-paper 2 0.82662892 83 nips-2002-Extracting Relevant Structures with Side Information
Author: Gal Chechik, Naftali Tishby
Abstract: The problem of extracting the relevant aspects of data, in face of multiple conflicting structures, is inherent to modeling of complex data. Extracting structure in one random variable that is relevant for another variable has been principally addressed recently via the information bottleneck method [15]. However, such auxiliary variables often contain more information than is actually required due to structures that are irrelevant for the task. In many other cases it is in fact easier to specify what is irrelevant than what is, for the task at hand. Identifying the relevant structures, however, can thus be considerably improved by also minimizing the information about another, irrelevant, variable. In this paper we give a general formulation of this problem and derive its formal, as well as algorithmic, solution. Its operation is demonstrated in a synthetic example and in two real world problems in the context of text categorization and face images. While the original information bottleneck problem is related to rate distortion theory, with the distortion measure replaced by the relevant information, extracting relevant features while removing irrelevant ones is related to rate distortion with side information.
3 0.81886327 96 nips-2002-Generalized² Linear² Models
Author: Geoffrey J. Gordon
Abstract: We introduce the Generalized 2 Linear 2 Model, a statistical estimator which combines features of nonlinear regression and factor analysis. A (GL)2M approximately decomposes a rectangular matrix X into a simpler representation j(g(A)h(B)). Here A and Bare low-rank matrices, while j, g, and h are link functions. (GL)2Ms include many useful models as special cases, including principal components analysis, exponential-family peA, the infomax formulation of independent components analysis, linear regression, and generalized linear models. They also include new and interesting special cases, one of which we describe below. We also present an iterative procedure which optimizes the parameters of a (GL)2M. This procedure reduces to well-known algorithms for some of the special cases listed above; for other special cases, it is new. 1
4 0.62330151 100 nips-2002-Half-Lives of EigenFlows for Spectral Clustering
Author: Chakra Chennubhotla, Allan D. Jepson
Abstract: Using a Markov chain perspective of spectral clustering we present an algorithm to automatically find the number of stable clusters in a dataset. The Markov chain’s behaviour is characterized by the spectral properties of the matrix of transition probabilities, from which we derive eigenflows along with their halflives. An eigenflow describes the flow of probability mass due to the Markov chain, and it is characterized by its eigenvalue, or equivalently, by the halflife of its decay as the Markov chain is iterated. A ideal stable cluster is one with zero eigenflow and infinite half-life. The key insight in this paper is that bottlenecks between weakly coupled clusters can be identified by computing the sensitivity of the eigenflow’s halflife to variations in the edge weights. We propose a novel E IGEN C UTS algorithm to perform clustering that removes these identified bottlenecks in an iterative fashion.
5 0.60046983 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
Author: Nicholas Roy, Geoffrey J. Gordon
Abstract: Standard value function approaches to finding policies for Partially Observable Markov Decision Processes (POMDPs) are intractable for large models. The intractability of these algorithms is due to a great extent to their generating an optimal policy over the entire belief space. However, in real POMDP problems most belief states are unlikely, and there is a structured, low-dimensional manifold of plausible beliefs embedded in the high-dimensional belief space. We introduce a new method for solving large-scale POMDPs by taking advantage of belief space sparsity. We reduce the dimensionality of the belief space by exponential family Principal Components Analysis [1], which allows us to turn the sparse, highdimensional belief space into a compact, low-dimensional representation in terms of learned features of the belief state. We then plan directly on the low-dimensional belief features. By planning in a low-dimensional space, we can find policies for POMDPs that are orders of magnitude larger than can be handled by conventional techniques. We demonstrate the use of this algorithm on a synthetic problem and also on a mobile robot navigation task.
6 0.59880185 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
7 0.59130627 2 nips-2002-A Bilinear Model for Sparse Coding
8 0.58782053 27 nips-2002-An Impossibility Theorem for Clustering
9 0.58781785 28 nips-2002-An Information Theoretic Approach to the Functional Classification of Neurons
10 0.57850277 53 nips-2002-Clustering with the Fisher Score
11 0.57464969 8 nips-2002-A Maximum Entropy Approach to Collaborative Filtering in Dynamic, Sparse, High-Dimensional Domains
12 0.57403833 188 nips-2002-Stability-Based Model Selection
13 0.57237631 40 nips-2002-Bayesian Models of Inductive Generalization
14 0.57184082 169 nips-2002-Real-Time Particle Filters
15 0.56566244 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information
16 0.56363636 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
17 0.56129462 190 nips-2002-Stochastic Neighbor Embedding
18 0.55718732 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm
19 0.55685747 142 nips-2002-Maximum Likelihood and the Information Bottleneck
20 0.55683857 3 nips-2002-A Convergent Form of Approximate Policy Iteration