nips nips2013 nips2013-5 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zhengdong Lu, Hang Li
Abstract: Many machine learning problems can be interpreted as learning for matching two types of objects (e.g., images and captions, users and products, queries and documents, etc.). The matching level of two objects is usually measured as the inner product in a certain feature space, while the modeling effort focuses on mapping of objects from the original space to the feature space. This schema, although proven successful on a range of matching tasks, is insufficient for capturing the rich structure in the matching process of more complicated objects. In this paper, we propose a new deep architecture to more effectively model the complicated matching relations between two objects from heterogeneous domains. More specifically, we apply this model to matching tasks in natural language, e.g., finding sensible responses for a tweet, or relevant answers to a given question. This new architecture naturally combines the localness and hierarchy intrinsic to the natural language problems, and therefore greatly improves upon the state-of-the-art models. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract Many machine learning problems can be interpreted as learning for matching two types of objects (e. [sent-9, score-0.39]
2 The matching level of two objects is usually measured as the inner product in a certain feature space, while the modeling effort focuses on mapping of objects from the original space to the feature space. [sent-13, score-0.643]
3 This schema, although proven successful on a range of matching tasks, is insufficient for capturing the rich structure in the matching process of more complicated objects. [sent-14, score-0.631]
4 In this paper, we propose a new deep architecture to more effectively model the complicated matching relations between two objects from heterogeneous domains. [sent-15, score-0.848]
5 More specifically, we apply this model to matching tasks in natural language, e. [sent-16, score-0.29]
6 This new architecture naturally combines the localness and hierarchy intrinsic to the natural language problems, and therefore greatly improves upon the state-of-the-art models. [sent-19, score-0.439]
7 1 Introduction Many machine learning problems can be interpreted as matching two objects, e. [sent-20, score-0.29]
8 , texts and images), and it is usually associated with a particular purpose. [sent-25, score-0.206]
9 The degree of matching is typically modeled as an inner-product of two representing feature vectors for objects x and y in a Hilbert space H, match(x, y) =< ΦY (x), ΦX (y) >H (1) while the modeling effort boils down to finding the mapping from the original inputs to the feature vectors. [sent-26, score-0.507]
10 In this paper, we focus on a rather difficult task of matching a given short text and candidate responses. [sent-29, score-0.452]
11 This inner-product based schema, although proven effective on tasks like information retrieval, are often incapable for modeling the matching between complicated objects. [sent-31, score-0.374]
12 First, representing structured objects like text as compact and meaningful vectors can be difficult; Second, inner-product cannot sufficiently take into account the complicated interaction between components within the objects, often in a rather nonlinear manner. [sent-32, score-0.365]
13 In this paper, we attack the problem of matching short texts from a brand new angle. [sent-33, score-0.568]
14 Instead of representing the text objects in each domain as semantically meaningful vectors, we directly model object-object interactions with a deep architecture. [sent-34, score-0.428]
15 This new architecture allows us to explicitly capture the natural nonlinearity and the hierarchical structure in matching two structured objects. [sent-35, score-0.603]
16 The bilinear matching model decides the score for any pair (x, y) as Dx Dy Anm xm yn , (2) match(x, y) = x Ay = m=1 n=1 with a pre-determined A. [sent-38, score-0.393]
17 From a different angle, each element product xn ym in the above sum can be viewed as a micro and local decision about the matching level of x and y. [sent-39, score-0.482]
18 The final decision is made considering all the local decisions, while in the bilinear case match(x, y) = nm Anm Mnm , it simply sums all the local decisions with a weight specified by A, as illustrated in Figure 1. [sent-42, score-0.443]
19 1 From Linear to Deep This simple summarization strategy can be extended to a deep architecture to explore the nonlinearity and hierarchy in matching short texts. [sent-44, score-0.875]
20 Unlike tasks like text classification, we need to work on a pair of text objects to be matched, which we refer to as parallel texts, borrowed from machine translation. [sent-45, score-0.333]
21 This new architecture is mainly based on the following two intuitions: Localness: there is a salient local structure in the semantic space of parallel text objects to be matched, which can be roughly captured via the co-occurrence pattern of words across the objects. [sent-46, score-0.663]
22 This localness however should not prevent two “distant” components from correlating with each other on a higher level, hence calls for the hierarchical characteristic of our model; Hierarchy: the decision making for matching has different levels of abstraction. [sent-47, score-0.51]
23 The local decisions, capturing the interaction between semantically close words, will be combined later layer-bylayer to form the final and global decision on matching. [sent-48, score-0.268]
24 2 Localness The localness of the text matching problem can be best described using an analogy with the “image patch” “text patch” patches in images, as illustrated in Figure 2. [sent-50, score-0.746]
25 Loosely speaking, a patch for parallel texts defines the set of interacting pairs of words from the two text objects. [sent-51, score-0.711]
26 Like the patches of images, the patches defined here are meant to capture the segments Figure 2: Image patches vs. [sent-53, score-0.756]
27 But unlike the naturally formed rectangular patches of images, the patches defined here do not come with a pre-given spatial continuity. [sent-56, score-0.504]
28 , fever—antibiotics), they are likely to have strong interaction in determining the matching score, and 2) when the words cooccur frequently in the same domain (e. [sent-61, score-0.473]
29 , {Hawaii,vacation}), they are likely to collaborate in making the matching decision. [sent-63, score-0.29]
30 For example, modeling the matching between the word “Hawaii” in question (likely to be a travel-related question) and the word “RAM” in answer (likely an answer to a computer-related question) is probably useless, judging from their co-occurrence pattern in Question-Answer pairs. [sent-64, score-0.427]
31 In other words, our architecture models only “local” pairwise relations on 2 a low level with patches, while describing the interaction between semantically distant terms on higher levels in the hierarchy. [sent-65, score-0.414]
32 3 Hierarchy Once the local decisions on patches are made (most of them are N ULL for a particular short text pair), they will be sent to the next layer, where the lower-level decisions are further combined to form more composite decisions, which in turn will be sent to still higher levels. [sent-67, score-0.84]
33 For example, the “WEATHER” patch may belong to higher level patches “TRAVEL” and “AGRICULTURE” at the same time. [sent-72, score-0.59]
34 A more complicated strategy is often needed in, for example, a decision on “TRAVELING”, which often takes more than one element, like “SIGHTSEEING”, “HOTEL”, “TRANSPORTATION”, Figure 3: An example of decision hierarchy. [sent-75, score-0.263]
35 The particular strategy taken by a local decision composition unit is fully encoded in the weights of the corresponding neuron through sp (x, y) = f wp Φp (x, y) , (3) where f is the active function. [sent-77, score-0.403]
36 Here we decide the hierarchical architecture of the decision making, but leave the exact mechanism for decision combination (encoded in the weights) to the learning algorithm later. [sent-79, score-0.478]
37 3 The Construction of Deep Architecture The process for constructing the deep architecture for matching consists of two steps. [sent-80, score-0.697]
38 First, we define parallel text patches with different resolutions using bilingual topic models. [sent-81, score-0.541]
39 For example, the word hotel appearing in domain X is treated as a different word as hotel in domain Y. [sent-86, score-0.398]
40 As we can see intuitively, in the same topic, a word in domain X co-occurs frequently not only with words in the same domain, but also with those in domain Y. [sent-92, score-0.22]
41 We fit the same corpus with L topic models with decreasing resolutions1 , with the series of learned topic sets denoted as H = {T1 , · · · , T , · · · , TL }, with indexing the topic resolution. [sent-93, score-0.318]
42 2 Getting Matching Architecture With the set of topics H, the architecture of the deep matching model can then be obtained in the following three steps. [sent-100, score-0.75]
43 First, we trim the words (in both domains X and Y) with the low probability for each topic in T ∈ H, and the remaining words in each topic specify a patch p. [sent-101, score-0.669]
44 With a slight abuse of symbols, we still use H to denote the patch sets with different resolutions. [sent-102, score-0.302]
45 Second, based on the patches specified in H, we construct a layered DAG G by assigning each patch with resolution to a number of patches with resolution − 1 based on the word overlapping between patches, as illustrated in Figure 4 (left panel). [sent-103, score-0.936]
46 If a patch p in layer − 1 is assigned to patch p in layer , we denote this relation as p p 2 . [sent-104, score-0.994]
47 Third, based on G, we can construct the architecture of the patchinduced layers of the neural network. [sent-105, score-0.399]
48 More specifically, each patch p in layer will be transformed into K neurons in the ( −1)th hidden layer in the neural network, and the K neurons are connected to the neurons in the th layer corresponding to patch p iff p p . [sent-106, score-1.67]
49 Using the image analogy, the neurons corresponding to patch p are referred to as filters. [sent-108, score-0.452]
50 Figure 4 illustrates the process of transforming patches in layer − 1 (specific topics) and layer (general topics) into two layers in neural network with K = 2. [sent-109, score-0.765]
51 patches neural network Figure 4: An illustration of constructing the deep architecture from hierarchical patches. [sent-110, score-0.695]
52 The input layer is a two-dimensional interaction space, which connects to the first patch-induced layer p-layerI followed by the second patchinduced layer p-layerII. [sent-112, score-0.7]
53 Following p-layerII is a committee layer (c-layer), with full connections from p-layerII. [sent-114, score-0.282]
54 With an input (x, y), we first get the local matching decisions on p-layerI, associated with patches in the interaction space. [sent-115, score-0.829]
55 Those local decisions will be sent to the corresponding neurons in p-layerII to get the first round of fusion. [sent-116, score-0.422]
56 Finally the logistic regression unit in the output layer summarizes the decisions on c-layer to get the final matching score s(x, y). [sent-118, score-0.653]
57 This architecture is referred to as D EEP M ATCH in the remainder of the paper. [sent-119, score-0.266]
58 Figure 5: An illustration of the deep architecture for matching decisions. [sent-120, score-0.697]
59 2 In the assignment, we make sure each patch in layer is assigned to at least m patches in layer − 1. [sent-121, score-0.944]
60 The first type of sparsity is enforced through architecture, since most of the connections between neurons in adjacent layers are turned off in construction. [sent-124, score-0.277]
61 For most object pairs in our experiment, only a small percentage of neurons in the lower layers are active (see Section 5 for more details). [sent-127, score-0.237]
62 This is mainly due to two factors, 1) the input parallel texts are very short (usually < 100 words), and 2) the patches are well designed to give a compact and sparse representation of each of the texts, as describe in Section 3. [sent-128, score-0.583]
63 An input pair (x, y) overlaps with patch p, iff x ∩ px = ∅ and y ∩ py = ∅, where px and py are respectively the word indices of patch p in domain X and Y. [sent-132, score-1.152]
64 def We also define the following indicator function overlap((x, y), p) = px ∩ x 0 · py ∩ y 0 . [sent-133, score-0.221]
65 The proposed architecture only allows neurons associated with patches overlapped with the input to have nonzero output. [sent-134, score-0.668]
66 More specifically, the output of neurons associated with patch p is sp (x, y) = ap (x, y) · overlap((x, y), p) (4) to ensure that sp (x, y) ≥ 0 only when there is non-empty cross-talking of x and y within patch p, where ap (x, y) is the activation of neuron before this rule is enforced. [sent-135, score-1.152]
67 It is not hard to understand, for any input (x, y), when we track any upwards path of decisions from input to a higher level, there is nonzero matching vote until we reach a patch that contains terms from both x and y. [sent-136, score-0.726]
68 This view is particularly useful in parameter tuning with back-propagation: the supervision signal can only get down to a patch p when it overlaps with input (x, y). [sent-137, score-0.336]
69 It is easy to show from the definition, once the supervision signal stops at one patch p, it will not get pass p and propagate to p’s children, even if those children have other ancestors. [sent-138, score-0.336]
70 4 Local Decision Models In the hidden layers p-layerI, p-layerII, and c-layer, we allow two types of neurons, corresponding to two active functions: 1) linear flin (t) = x, and 2) sigmoid fsig (t) = (1 + e−t )−1 . [sent-141, score-0.256]
71 As indicated in Figure 5, the two-dimensional structure is lost after leaving the input layer, while the local structure is kept in the second patch-induced layer p-layerII. [sent-144, score-0.245]
72 The local decision models in the committee layer c-layer are the same as in p-layerII, except that they are fully connected to neurons in the previous layer. [sent-146, score-0.588]
73 , the weights 5 (k) (k) between p-layerI and p-layerII, denoted (wp , bp ) for associated patch p and filter index 1 ≤ k ≤ K2 , and 3) the weights for committee layer (c-layer) and after, denoted as wc . [sent-149, score-0.712]
74 With this type of optimization, most of the patches in p-layerI and p-layerII get zero inputs, and therefore remain inactive by definition during the prediction as well as updating process. [sent-166, score-0.286]
75 Moreover, during stochastic gradient descent, only about 5% of neurons in p-layerI and p-layerII are active on average for each training instance, indicating that the designed architecture has greatly reduced the essential capacity of the model. [sent-168, score-0.416]
76 5 Experiments We compare our deep matching model to the inner-product based models, ranging from variants of bilinear models to nonlinear mappings for ΦX (·) and ΦY (·). [sent-169, score-0.623]
77 Please note here that we omit the nonlinear model for shared representation [13, 18, 17] since they are essentially also inner product based models (when used for matching) and not designed to deal with short texts with large vocabulary. [sent-175, score-0.333]
78 1 Data Sets We use the learned matching function for retrieving response texts y for a given query text x, which will be ranked purely based on the matching scores. [sent-177, score-0.928]
79 The training data are randomly split into training data and testing data, and the parameters of all models (including the learned patches for D EEP M ATCH) are learned on training data. [sent-192, score-0.252]
80 We use NDCG@1 and NDCG@6 [8] on random pool with size 6 (one positive + five negative) to measure the performance of different matching models. [sent-196, score-0.29]
81 Among the two data sets, the Question-Answer data set is relatively easy, with all four matching models improve upon random guesses. [sent-199, score-0.29]
82 As another observation, we get significant gain of performance by introducing nonlinearity in the mapping function, but all the inner-product based matching models are outperformed by the proposed D EEP M ATCH with large margin on this data set. [sent-200, score-0.459]
83 The story is slightly different on the Weibo-Response data set, which is significantly more challenging than the Q-A task in that it relies more on the content of texts and is harder to be captured by bag-of-words. [sent-201, score-0.243]
84 To further understand the performances of the different matching models, we also compare the generalization ability of two nonlinear models. [sent-204, score-0.345]
85 We argue that our model has better generalization property than the Siamese architecture with similar model complexity. [sent-207, score-0.266]
86 665 Table 2: The retrieval performance of matching models on the Q-A and Weibo data sets. [sent-228, score-0.335]
87 The number of neurons associated with each patch (Figure 6, right panel) follows a similar story: the performance gets flat out after the number of neurons per patch reaches 3, again with training time and memory increased significantly. [sent-233, score-0.904]
88 As another observation about the architecture, D EEP M ATCH with both linear and sigmoid activation functions in hidden layers yields slightly but consistently better performance than that with only sigmoid function. [sent-234, score-0.254]
89 Our conjecture is that linear neurons provide shortcuts for low-level matching decision to high level composition units, and therefore facilitate the informative low-level units in determining the final matching score. [sent-235, score-0.907]
90 size of patch-induced layers size of committee layer(s) number of filters/patch Figure 6: Choices of architecture for D EEP M ATCH. [sent-236, score-0.44]
91 As we discussed earlier, this kind of models cannot sufficiently model the rich and nonlinear structure of matching complicated objects. [sent-239, score-0.396]
92 Those deep learning models however do not give a direct matching function, and cannot handle short texts with a large vocabulary. [sent-245, score-0.709]
93 Our work is in a sense related to the sum-product network (SPN)[4, 5, 15], especially the work in [4] that learns the deep architecture from clustering in the feature space for the image completion task. [sent-246, score-0.443]
94 However, it is difficult to determine a regular architecture like SPN for short texts, since the structure of the matching task for short texts is not as well-defined as that for images. [sent-247, score-0.906]
95 We therefore adopt a more traditional MLP-like architecture in this paper. [sent-248, score-0.266]
96 Our work is conceptually close to the dynamic pooling algorithm recently proposed by Socher et al [16] for paraphrase identification, which is essentially a special case of matching between two homogeneous domains. [sent-249, score-0.365]
97 Similar to our model, their proposed model also constructs a neural network on the interaction space of two objects (sentences in their case), and outputs the measure of semantic similarity between them. [sent-250, score-0.249]
98 7 Conclusion and Future Work We proposed a novel deep architecture for matching problems, inspired partially by the long thread of work on deep learning. [sent-252, score-0.838]
99 The proposed architecture can sufficiently explore the nonlinearity and hierarchy in the matching process, and has been empirically shown to be superior to various innerproduct based matching models on real-world data sets. [sent-253, score-0.952]
100 Learning the architecture of sum-product networks using clustering on variables. [sent-285, score-0.266]
wordName wordTfidf (topN-words)
[('patch', 0.302), ('matching', 0.29), ('architecture', 0.266), ('patches', 0.252), ('eep', 0.228), ('texts', 0.206), ('layer', 0.195), ('atch', 0.182), ('neurons', 0.15), ('sp', 0.141), ('deep', 0.141), ('sightseeing', 0.137), ('decisions', 0.134), ('px', 0.118), ('localness', 0.114), ('weibo', 0.114), ('ndcg', 0.111), ('decision', 0.106), ('topic', 0.106), ('bilinear', 0.103), ('py', 0.103), ('objects', 0.1), ('hotel', 0.093), ('iamese', 0.091), ('text', 0.09), ('committee', 0.087), ('layers', 0.087), ('yi', 0.085), ('short', 0.072), ('interaction', 0.069), ('fp', 0.063), ('words', 0.06), ('pls', 0.06), ('hierarchy', 0.059), ('nonlinear', 0.055), ('domain', 0.054), ('sent', 0.054), ('parallel', 0.053), ('topics', 0.053), ('word', 0.052), ('retrieving', 0.052), ('yp', 0.052), ('complicated', 0.051), ('bp', 0.05), ('local', 0.05), ('ew', 0.05), ('images', 0.049), ('triples', 0.048), ('nonlinearity', 0.047), ('anm', 0.046), ('dtrn', 0.046), ('erlin', 0.046), ('flin', 0.046), ('fsig', 0.046), ('grangier', 0.046), ('orr', 0.046), ('patchinduced', 0.046), ('potp', 0.046), ('rmls', 0.046), ('siamese', 0.046), ('spn', 0.046), ('tin', 0.046), ('sigmoid', 0.046), ('retrieval', 0.045), ('margin', 0.044), ('mapping', 0.044), ('semantic', 0.044), ('activation', 0.044), ('matched', 0.043), ('semantically', 0.043), ('xi', 0.041), ('ark', 0.04), ('bilingual', 0.04), ('captions', 0.04), ('paraphrase', 0.04), ('sparsity', 0.04), ('effort', 0.04), ('resolution', 0.039), ('weights', 0.039), ('huawei', 0.037), ('schema', 0.037), ('story', 0.037), ('network', 0.036), ('ap', 0.036), ('level', 0.036), ('domains', 0.035), ('latent', 0.035), ('sha', 0.035), ('weather', 0.035), ('composition', 0.035), ('xp', 0.035), ('pooling', 0.035), ('overlap', 0.034), ('get', 0.034), ('mappings', 0.034), ('intuitions', 0.033), ('chinese', 0.033), ('modeling', 0.033), ('wp', 0.032), ('hidden', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 5 nips-2013-A Deep Architecture for Matching Short Texts
Author: Zhengdong Lu, Hang Li
Abstract: Many machine learning problems can be interpreted as learning for matching two types of objects (e.g., images and captions, users and products, queries and documents, etc.). The matching level of two objects is usually measured as the inner product in a certain feature space, while the modeling effort focuses on mapping of objects from the original space to the feature space. This schema, although proven successful on a range of matching tasks, is insufficient for capturing the rich structure in the matching process of more complicated objects. In this paper, we propose a new deep architecture to more effectively model the complicated matching relations between two objects from heterogeneous domains. More specifically, we apply this model to matching tasks in natural language, e.g., finding sensible responses for a tweet, or relevant answers to a given question. This new architecture naturally combines the localness and hierarchy intrinsic to the natural language problems, and therefore greatly improves upon the state-of-the-art models. 1
2 0.20471068 251 nips-2013-Predicting Parameters in Deep Learning
Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas
Abstract: We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1
3 0.19447953 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
Author: Michiel Hermans, Benjamin Schrauwen
Abstract: Time series often have a temporal hierarchy, with information that is spread out over multiple time scales. Common recurrent neural networks, however, do not explicitly accommodate such a hierarchy, and most research on them has been focusing on training algorithms rather than on their basic architecture. In this paper we study the effect of a hierarchy of recurrent neural networks on processing time series. Here, each layer is a recurrent network which receives the hidden state of the previous layer as input. This architecture allows us to perform hierarchical processing on difficult temporal tasks, and more naturally capture the structure of time series. We show that they reach state-of-the-art performance for recurrent networks in character-level language modeling when trained with simple stochastic gradient descent. We also offer an analysis of the different emergent time scales. 1
4 0.1803944 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
Author: Carl Doersch, Abhinav Gupta, Alexei A. Efros
Abstract: Recent work on mid-level visual representations aims to capture information at the level of complexity higher than typical “visual words”, but lower than full-blown semantic objects. Several approaches [5, 6, 12, 23] have been proposed to discover mid-level visual elements, that are both 1) representative, i.e., frequently occurring within a visual dataset, and 2) visually discriminative. However, the current approaches are rather ad hoc and difficult to analyze and evaluate. In this work, we pose visual element discovery as discriminative mode seeking, drawing connections to the the well-known and well-studied mean-shift algorithm [2, 1, 4, 8]. Given a weakly-labeled image collection, our method discovers visually-coherent patch clusters that are maximally discriminative with respect to the labels. One advantage of our formulation is that it requires only a single pass through the data. We also propose the Purity-Coverage plot as a principled way of experimentally analyzing and evaluating different visual discovery approaches, and compare our method against prior work on the Paris Street View dataset of [5]. We also evaluate our method on the task of scene classification, demonstrating state-of-the-art performance on the MIT Scene-67 dataset. 1
5 0.15018533 75 nips-2013-Convex Two-Layer Modeling
Author: Özlem Aslan, Hao Cheng, Xinhua Zhang, Dale Schuurmans
Abstract: Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. Unfortunately, such models are difficult to train because inference over latent variables must be performed concurrently with parameter optimization—creating a highly non-convex problem. Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics. 1
6 0.14306514 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
7 0.14239295 331 nips-2013-Top-Down Regularization of Deep Belief Networks
8 0.12246006 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification
9 0.11936332 64 nips-2013-Compete to Compute
11 0.11548235 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach
12 0.1144105 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
13 0.11115441 300 nips-2013-Solving the multi-way matching problem by permutation synchronization
14 0.10400632 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
15 0.10330119 167 nips-2013-Learning the Local Statistics of Optical Flow
16 0.09937606 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs
17 0.092833459 6 nips-2013-A Determinantal Point Process Latent Variable Model for Inhibition in Neural Spiking Data
18 0.090638302 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
19 0.087888815 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion
20 0.087373078 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
topicId topicWeight
[(0, 0.221), (1, 0.125), (2, -0.19), (3, -0.109), (4, 0.065), (5, -0.144), (6, -0.049), (7, 0.014), (8, 0.021), (9, -0.036), (10, 0.085), (11, 0.009), (12, 0.002), (13, 0.005), (14, 0.026), (15, -0.005), (16, 0.109), (17, -0.059), (18, -0.07), (19, -0.024), (20, -0.02), (21, -0.069), (22, 0.069), (23, 0.056), (24, 0.071), (25, -0.128), (26, -0.091), (27, 0.041), (28, -0.067), (29, 0.042), (30, 0.091), (31, 0.053), (32, 0.055), (33, 0.027), (34, -0.03), (35, 0.117), (36, 0.022), (37, -0.014), (38, -0.008), (39, 0.014), (40, -0.054), (41, -0.141), (42, -0.017), (43, 0.019), (44, 0.023), (45, -0.084), (46, -0.125), (47, 0.036), (48, -0.001), (49, 0.061)]
simIndex simValue paperId paperTitle
same-paper 1 0.96788824 5 nips-2013-A Deep Architecture for Matching Short Texts
Author: Zhengdong Lu, Hang Li
Abstract: Many machine learning problems can be interpreted as learning for matching two types of objects (e.g., images and captions, users and products, queries and documents, etc.). The matching level of two objects is usually measured as the inner product in a certain feature space, while the modeling effort focuses on mapping of objects from the original space to the feature space. This schema, although proven successful on a range of matching tasks, is insufficient for capturing the rich structure in the matching process of more complicated objects. In this paper, we propose a new deep architecture to more effectively model the complicated matching relations between two objects from heterogeneous domains. More specifically, we apply this model to matching tasks in natural language, e.g., finding sensible responses for a tweet, or relevant answers to a given question. This new architecture naturally combines the localness and hierarchy intrinsic to the natural language problems, and therefore greatly improves upon the state-of-the-art models. 1
2 0.66974467 251 nips-2013-Predicting Parameters in Deep Learning
Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas
Abstract: We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1
3 0.64591205 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
Author: Michiel Hermans, Benjamin Schrauwen
Abstract: Time series often have a temporal hierarchy, with information that is spread out over multiple time scales. Common recurrent neural networks, however, do not explicitly accommodate such a hierarchy, and most research on them has been focusing on training algorithms rather than on their basic architecture. In this paper we study the effect of a hierarchy of recurrent neural networks on processing time series. Here, each layer is a recurrent network which receives the hidden state of the previous layer as input. This architecture allows us to perform hierarchical processing on difficult temporal tasks, and more naturally capture the structure of time series. We show that they reach state-of-the-art performance for recurrent networks in character-level language modeling when trained with simple stochastic gradient descent. We also offer an analysis of the different emergent time scales. 1
4 0.64324272 64 nips-2013-Compete to Compute
Author: Rupesh K. Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, Jürgen Schmidhuber
Abstract: Local competition among neighboring neurons is common in biological neural networks (NNs). In this paper, we apply the concept to gradient-based, backprop-trained artificial multilayer NNs. NNs with competing linear units tend to outperform those with non-competing nonlinear units, and avoid catastrophic forgetting when training sets change over time. 1
5 0.62922657 331 nips-2013-Top-Down Regularization of Deep Belief Networks
Author: Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim
Abstract: Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap between the two phases, resulting in a three-phase learning procedure. We propose to implement the scheme using a method to regularize deep belief networks with top-down information. The network is constructed from building blocks of restricted Boltzmann machines learned by combining bottom-up and top-down sampled signals. A global optimization procedure that merges samples from a forward bottom-up pass and a top-down pass is used. Experiments on the MNIST dataset show improvements over the existing algorithms for deep belief networks. Object recognition results on the Caltech-101 dataset also yield competitive results. 1
6 0.6090185 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising
7 0.58919919 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification
8 0.5426634 75 nips-2013-Convex Two-Layer Modeling
9 0.53191638 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs
10 0.5281902 85 nips-2013-Deep content-based music recommendation
12 0.52368098 167 nips-2013-Learning the Local Statistics of Optical Flow
13 0.5156517 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach
14 0.5146938 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
15 0.51194417 160 nips-2013-Learning Stochastic Feedforward Neural Networks
16 0.50913918 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator
17 0.50555146 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines
18 0.50184304 84 nips-2013-Deep Neural Networks for Object Detection
19 0.49540874 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
20 0.4933973 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition
topicId topicWeight
[(16, 0.046), (19, 0.021), (33, 0.137), (34, 0.127), (36, 0.012), (41, 0.028), (49, 0.055), (56, 0.089), (70, 0.046), (83, 0.206), (85, 0.042), (89, 0.027), (93, 0.091), (95, 0.014)]
simIndex simValue paperId paperTitle
1 0.85970426 162 nips-2013-Learning Trajectory Preferences for Manipulators via Iterative Improvement
Author: Ashesh Jain, Brian Wojcik, Thorsten Joachims, Ashutosh Saxena
Abstract: We consider the problem of learning good trajectories for manipulation tasks. This is challenging because the criterion defining a good trajectory varies with users, tasks and environments. In this paper, we propose a co-active online learning framework for teaching robots the preferences of its users for object manipulation tasks. The key novelty of our approach lies in the type of feedback expected from the user: the human user does not need to demonstrate optimal trajectories as training data, but merely needs to iteratively provide trajectories that slightly improve over the trajectory currently proposed by the system. We argue that this co-active preference feedback can be more easily elicited from the user than demonstrations of optimal trajectories, which are often challenging and non-intuitive to provide on high degrees of freedom manipulators. Nevertheless, theoretical regret bounds of our algorithm match the asymptotic rates of optimal trajectory algorithms. We demonstrate the generalizability of our algorithm on a variety of grocery checkout tasks, for whom, the preferences were not only influenced by the object being manipulated but also by the surrounding environment.1 1
2 0.85871923 37 nips-2013-Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs
Author: Vikash Mansinghka, Tejas D. Kulkarni, Yura N. Perov, Josh Tenenbaum
Abstract: The idea of computer vision as the Bayesian inverse problem to computer graphics has a long history and an appealing elegance, but it has proved difficult to directly implement. Instead, most vision tasks are approached via complex bottom-up processing pipelines. Here we show that it is possible to write short, simple probabilistic graphics programs that define flexible generative models and to automatically invert them to interpret real-world images. Generative probabilistic graphics programs (GPGP) consist of a stochastic scene generator, a renderer based on graphics software, a stochastic likelihood model linking the renderer’s output and the data, and latent variables that adjust the fidelity of the renderer and the tolerance of the likelihood. Representations and algorithms from computer graphics are used as the deterministic backbone for highly approximate and stochastic generative models. This formulation combines probabilistic programming, computer graphics, and approximate Bayesian computation, and depends only on generalpurpose, automatic inference techniques. We describe two applications: reading sequences of degraded and adversarially obscured characters, and inferring 3D road models from vehicle-mounted camera images. Each of the probabilistic graphics programs we present relies on under 20 lines of probabilistic code, and yields accurate, approximately Bayesian inferences about real-world images. 1
same-paper 3 0.83895326 5 nips-2013-A Deep Architecture for Matching Short Texts
Author: Zhengdong Lu, Hang Li
Abstract: Many machine learning problems can be interpreted as learning for matching two types of objects (e.g., images and captions, users and products, queries and documents, etc.). The matching level of two objects is usually measured as the inner product in a certain feature space, while the modeling effort focuses on mapping of objects from the original space to the feature space. This schema, although proven successful on a range of matching tasks, is insufficient for capturing the rich structure in the matching process of more complicated objects. In this paper, we propose a new deep architecture to more effectively model the complicated matching relations between two objects from heterogeneous domains. More specifically, we apply this model to matching tasks in natural language, e.g., finding sensible responses for a tweet, or relevant answers to a given question. This new architecture naturally combines the localness and hierarchy intrinsic to the natural language problems, and therefore greatly improves upon the state-of-the-art models. 1
4 0.73793977 201 nips-2013-Multi-Task Bayesian Optimization
Author: Kevin Swersky, Jasper Snoek, Ryan P. Adams
Abstract: Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efficiency. In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to find optimal hyperparameter settings more efficiently. Our approach is based on extending multi-task Gaussian processes to the framework of Bayesian optimization. We show that this method significantly speeds up the optimization process when compared to the standard single-task approach. We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up k-fold cross-validation. Lastly, we propose an adaptation of a recently developed acquisition function, entropy search, to the cost-sensitive, multi-task setting. We demonstrate the utility of this new acquisition function by leveraging a small dataset to explore hyperparameter settings for a large dataset. Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost. 1
5 0.73559701 99 nips-2013-Dropout Training as Adaptive Regularization
Author: Stefan Wager, Sida Wang, Percy Liang
Abstract: Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset. 1
6 0.73459977 64 nips-2013-Compete to Compute
7 0.73255563 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
8 0.73143691 251 nips-2013-Predicting Parameters in Deep Learning
9 0.72800648 121 nips-2013-Firing rate predictions in optimal balanced networks
10 0.72774422 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
11 0.72717893 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
12 0.72565377 183 nips-2013-Mapping paradigm ontologies to and from the brain
13 0.72411454 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
14 0.72379524 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
15 0.72376001 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
16 0.72252524 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
17 0.72224206 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
18 0.72127247 69 nips-2013-Context-sensitive active sensing in humans
19 0.72042984 173 nips-2013-Least Informative Dimensions
20 0.71946621 301 nips-2013-Sparse Additive Text Models with Low Rank Background