acl acl2013 acl2013-349 knowledge-graph by maker-knowledge-mining

349 acl-2013-The mathematics of language learning


Source: pdf

Author: Andras Kornai ; Gerald Penn ; James Rogers ; Anssi Yli-Jyra

Abstract: unkown-abstract

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The mathematics of language learning Andr a´s Kornai Gerald Penn Computer and Automation Research Institute Department of Computer Science Hungarian Academy of Sciences andras@kornai . [sent-1, score-0.096]

2 com James Rogers Computer Science Department Earlham College j rogers@cs earlham edu . [sent-2, score-0.178]

3 The Mathematics of Language (MOL) SIG put together this tutorial, composed of three lectures, to highlight some alternative learning paradigms in speech, syntax, and semantics in the hopes of accelerating this trend. [sent-5, score-0.149]

4 Compounding the enormous variety of formal models one may consider is the bewildering range of ML techniques one may bring to bear. [sent-6, score-0.256]

5 edu Anssi Yli-Jyr a¨ Department of Modern Languages University of Helsinki anssi yli-j yra@helsinki fi . [sent-13, score-0.157]

6 The second lecture recasts these and similar problems in terms of learning weighted edges in a sparse graph, and presents learning techniques that seem to have some potential to better find spare finite state and near-FS models than EM. [sent-18, score-0.408]

7 We will provide a mathematical introduction to the Minimum Description Length (MDL) paradigm and 11 ProceedinSgosfi oa,f tB huel 5g1arsita, An Anuugauls Mt 4e-e9ti n2g01 o3f. [sent-19, score-0.064]

8 c e2 A0s1s3oc Aiastsioocnia fotiron C foomrp Cuotmatpiountaatlio Lninaglu Liisntgicusi,s ptaicgses 11–13, spectral learning, and relate these to the betterknown techniques based on (convex) optimization and (data-oriented) memorization. [sent-21, score-0.36]

9 )l • MDL for weighted languages •• AMmDbLig fuoirty w •• DAmiscbairgduiintgy data – yes, you can! [sent-30, score-0.05]

10 A particularly signifi- cant case in point is provided by PCFGs, which have not proved competitive with straight trigram models. [sent-32, score-0.046]

11 A natural response to this outcome is to retrench and use less powerful formal models, and the last lecture will be spent in the subregular space of formal models even less powerful than finite state automata. [sent-34, score-0.823]

12 Lecture 3: Subregular Languages and Their Linguistic Relevance Rogers and Yli-Jyr ¨a The difficulty of learning a regular or context-free language in the limit from positive data gives a motivation for studying non-Chomskyan language classes. [sent-35, score-0.057]

13 The lecture gives an overview of the taxonomy of the most important subregular classes of languages and motivate their linguistic relevance in phonology and syntax. [sent-36, score-0.746]

14 p df 2011) relate language types to the theory of sub- regular language classes. [sent-43, score-0.176]

15 There are finite-state approaches to syntax showing subregular properties. [sent-44, score-0.424]

16 Although structure-assigning syntax differs from phonotactical constraints, the inadequacy of right-linear grammars does not generalize to all finite-state representations of syntax. [sent-45, score-0.12]

17 The linguistic relevance and descriptive adequacy are discussed, in particular, in the context of intersection parsing and conjunctive representations of syntax. [sent-46, score-0.151]

18 His research focuses on finitestate technology in phonology, morphology and syntax. [sent-48, score-0.12]

19 He is interested in weighted logic, dependency complexity and machine learning. [sent-49, score-0.05]

20 His primary research interests are in formal models of language and formal language theory, particularly model-theoretic approaches to these, and in cognitive science. [sent-51, score-0.286]

21 Gerald Penn teaches computer science at the University of Toronto, and is a Senior Member of the IEEE. [sent-52, score-0.105]

22 His research interests are in spoken language processing, human-computer interaction, and mathematical linguistics. [sent-53, score-0.13]

23 Andr ´as Kornai teaches at the Algebra Department of the Budapest Institute of Technology, and leads the HLT group at the Computer and Automation Research Institute of the Hungarian Academy of Sciences. [sent-54, score-0.105]

24 He is interested in everything in the intersection of mathematics and linguistics. [sent-55, score-0.149]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('subregular', 0.356), ('kornai', 0.242), ('pcfgs', 0.2), ('earlham', 0.178), ('rogers', 0.173), ('lecture', 0.172), ('anssi', 0.157), ('mdl', 0.13), ('phonology', 0.12), ('gmms', 0.119), ('imotivation', 0.119), ('spectral', 0.116), ('formal', 0.11), ('piecewise', 0.105), ('teaches', 0.105), ('variational', 0.099), ('relevance', 0.098), ('mathematics', 0.096), ('academy', 0.092), ('helsinki', 0.091), ('bulk', 0.091), ('fellow', 0.091), ('ocr', 0.091), ('neural', 0.088), ('ml', 0.086), ('enormous', 0.083), ('iv', 0.082), ('toronto', 0.08), ('hmms', 0.077), ('gerald', 0.075), ('hungarian', 0.075), ('finite', 0.075), ('morphology', 0.072), ('acoustic', 0.071), ('syntax', 0.068), ('optimization', 0.068), ('approximations', 0.066), ('interests', 0.066), ('automation', 0.065), ('mathematical', 0.064), ('techniques', 0.063), ('department', 0.062), ('relate', 0.061), ('contrastive', 0.059), ('theory', 0.058), ('andr', 0.058), ('regular', 0.057), ('iii', 0.057), ('learners', 0.056), ('vi', 0.056), ('intersection', 0.053), ('penn', 0.053), ('inadequacy', 0.052), ('resilient', 0.052), ('optionality', 0.052), ('fari', 0.052), ('mso', 0.052), ('senior', 0.052), ('aanld', 0.052), ('ism', 0.052), ('intersecting', 0.052), ('ainndg', 0.052), ('balle', 0.052), ('atsn', 0.052), ('accelerating', 0.052), ('betterknown', 0.052), ('creutz', 0.052), ('budapest', 0.052), ('corners', 0.052), ('fmoro', 0.052), ('ructure', 0.052), ('frames', 0.052), ('divergence', 0.052), ('paraphrase', 0.051), ('weighted', 0.05), ('college', 0.049), ('semantics', 0.049), ('finland', 0.048), ('mohri', 0.048), ('mcs', 0.048), ('mof', 0.048), ('instructors', 0.048), ('zeros', 0.048), ('finitestate', 0.048), ('goldsmith', 0.048), ('heinz', 0.048), ('spare', 0.048), ('hopes', 0.048), ('clare', 0.048), ('yra', 0.048), ('gpenn', 0.048), ('cvs', 0.048), ('broad', 0.047), ('pca', 0.046), ('cant', 0.046), ('lasso', 0.046), ('tine', 0.046), ('conveyed', 0.046), ('beneath', 0.046), ('andras', 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 349 acl-2013-The mathematics of language learning

Author: Andras Kornai ; Gerald Penn ; James Rogers ; Anssi Yli-Jyra

Abstract: unkown-abstract

2 0.11720657 348 acl-2013-The effect of non-tightness on Bayesian estimation of PCFGs

Author: Shay B. Cohen ; Mark Johnson

Abstract: Probabilistic context-free grammars have the unusual property of not always defining tight distributions (i.e., the sum of the “probabilities” of the trees the grammar generates can be less than one). This paper reviews how this non-tightness can arise and discusses its impact on Bayesian estimation of PCFGs. We begin by presenting the notion of “almost everywhere tight grammars” and show that linear CFGs follow it. We then propose three different ways of reinterpreting non-tight PCFGs to make them tight, show that the Bayesian estimators in Johnson et al. (2007) are correct under one of them, and provide MCMC samplers for the other two. We conclude with a discussion of the impact of tightness empirically.

3 0.093292363 382 acl-2013-Variational Inference for Structured NLP Models

Author: David Burkett ; Dan Klein

Abstract: unkown-abstract

4 0.066281423 275 acl-2013-Parsing with Compositional Vector Grammars

Author: Richard Socher ; John Bauer ; Christopher D. Manning ; Ng Andrew Y.

Abstract: Natural language parsing has typically been done with small sets of discrete categories such as NP and VP, but this representation does not capture the full syntactic nor semantic richness of linguistic phrases, and attempts to improve on this by lexicalizing phrases or splitting categories only partly address the problem at the cost of huge feature spaces and sparseness. Instead, we introduce a Compositional Vector Grammar (CVG), which combines PCFGs with a syntactically untied recursive neural network that learns syntactico-semantic, compositional vector representations. The CVG improves the PCFG of the Stanford Parser by 3.8% to obtain an F1 score of 90.4%. It is fast to train and implemented approximately as an efficient reranker it is about 20% faster than the current Stanford factored parser. The CVG learns a soft notion of head words and improves performance on the types of ambiguities that require semantic information such as PP attachments.

5 0.064748079 318 acl-2013-Sentiment Relevance

Author: Christian Scheible ; Hinrich Schutze

Abstract: A number of different notions, including subjectivity, have been proposed for distinguishing parts of documents that convey sentiment from those that do not. We propose a new concept, sentiment relevance, to make this distinction and argue that it better reflects the requirements of sentiment analysis systems. We demonstrate experimentally that sentiment relevance and subjectivity are related, but different. Since no large amount of labeled training data for our new notion of sentiment relevance is available, we investigate two semi-supervised methods for creating sentiment relevance classifiers: a distant supervision approach that leverages structured information about the domain of the reviews; and transfer learning on feature representations based on lexical taxonomies that enables knowledge transfer. We show that both methods learn sentiment relevance classifiers that perform well.

6 0.063949846 388 acl-2013-Word Alignment Modeling with Context Dependent Deep Neural Network

7 0.057148613 379 acl-2013-Utterance-Level Multimodal Sentiment Analysis

8 0.05423427 38 acl-2013-Additive Neural Networks for Statistical Machine Translation

9 0.053988189 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models

10 0.053217109 313 acl-2013-Semantic Parsing with Combinatory Categorial Grammars

11 0.052591018 347 acl-2013-The Role of Syntax in Vector Space Models of Compositional Semantics

12 0.051055338 35 acl-2013-Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation

13 0.050920207 238 acl-2013-Measuring semantic content in distributional vectors

14 0.050388586 334 acl-2013-Supervised Model Learning with Feature Grouping based on a Discrete Constraint

15 0.04980148 50 acl-2013-An improved MDL-based compression algorithm for unsupervised word segmentation

16 0.044650361 254 acl-2013-Multimodal DBN for Predicting High-Quality Answers in cQA portals

17 0.043739013 121 acl-2013-Discovering User Interactions in Ideological Discussions

18 0.043364089 149 acl-2013-Exploring Word Order Universals: a Probabilistic Graphical Model Approach

19 0.042802524 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation

20 0.042501125 175 acl-2013-Grounded Language Learning from Video Described with Sentences


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.13), (1, 0.019), (2, -0.018), (3, -0.015), (4, -0.07), (5, -0.048), (6, 0.031), (7, -0.02), (8, -0.033), (9, 0.041), (10, -0.01), (11, -0.033), (12, 0.046), (13, -0.079), (14, -0.055), (15, -0.043), (16, -0.017), (17, 0.043), (18, 0.043), (19, -0.079), (20, 0.028), (21, -0.002), (22, 0.003), (23, 0.004), (24, 0.012), (25, -0.012), (26, -0.017), (27, -0.006), (28, 0.04), (29, 0.039), (30, 0.034), (31, -0.003), (32, 0.003), (33, -0.045), (34, 0.012), (35, -0.033), (36, 0.006), (37, -0.0), (38, 0.035), (39, -0.0), (40, 0.011), (41, -0.004), (42, 0.029), (43, 0.007), (44, -0.061), (45, 0.048), (46, -0.043), (47, 0.001), (48, -0.081), (49, -0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93387747 349 acl-2013-The mathematics of language learning

Author: Andras Kornai ; Gerald Penn ; James Rogers ; Anssi Yli-Jyra

Abstract: unkown-abstract

2 0.64521372 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models

Author: Matthew R. Gormley ; Jason Eisner

Abstract: Many models in NLP involve latent variables, such as unknown parses, tags, or alignments. Finding the optimal model parameters is then usually a difficult nonconvex optimization problem. The usual practice is to settle for local optimization methods such as EM or gradient ascent. We explore how one might instead search for a global optimum in parameter space, using branch-and-bound. Our method would eventually find the global maximum (up to a user-specified ?) if run for long enough, but at any point can return a suboptimal solution together with an upper bound on the global maximum. As an illustrative case, we study a generative model for dependency parsing. We search for the maximum-likelihood model parameters and corpus parse, subject to posterior constraints. We show how to formulate this as a mixed integer quadratic programming problem with nonlinear constraints. We use the Reformulation Linearization Technique to produce convex relaxations during branch-and-bound. Although these techniques do not yet provide a practical solution to our instance of this NP-hard problem, they sometimes find better solutions than Viterbi EM with random restarts, in the same time.

3 0.64500189 348 acl-2013-The effect of non-tightness on Bayesian estimation of PCFGs

Author: Shay B. Cohen ; Mark Johnson

Abstract: Probabilistic context-free grammars have the unusual property of not always defining tight distributions (i.e., the sum of the “probabilities” of the trees the grammar generates can be less than one). This paper reviews how this non-tightness can arise and discusses its impact on Bayesian estimation of PCFGs. We begin by presenting the notion of “almost everywhere tight grammars” and show that linear CFGs follow it. We then propose three different ways of reinterpreting non-tight PCFGs to make them tight, show that the Bayesian estimators in Johnson et al. (2007) are correct under one of them, and provide MCMC samplers for the other two. We conclude with a discussion of the impact of tightness empirically.

4 0.63872713 275 acl-2013-Parsing with Compositional Vector Grammars

Author: Richard Socher ; John Bauer ; Christopher D. Manning ; Ng Andrew Y.

Abstract: Natural language parsing has typically been done with small sets of discrete categories such as NP and VP, but this representation does not capture the full syntactic nor semantic richness of linguistic phrases, and attempts to improve on this by lexicalizing phrases or splitting categories only partly address the problem at the cost of huge feature spaces and sparseness. Instead, we introduce a Compositional Vector Grammar (CVG), which combines PCFGs with a syntactically untied recursive neural network that learns syntactico-semantic, compositional vector representations. The CVG improves the PCFG of the Stanford Parser by 3.8% to obtain an F1 score of 90.4%. It is fast to train and implemented approximately as an efficient reranker it is about 20% faster than the current Stanford factored parser. The CVG learns a soft notion of head words and improves performance on the types of ambiguities that require semantic information such as PP attachments.

5 0.61023307 161 acl-2013-Fluid Construction Grammar for Historical and Evolutionary Linguistics

Author: Pieter Wellens ; Remi van Trijp ; Katrien Beuls ; Luc Steels

Abstract: Fluid Construction Grammar (FCG) is an open-source computational grammar formalism that is becoming increasingly popular for studying the history and evolution of language. This demonstration shows how FCG can be used to operationalise the cultural processes and cognitive mechanisms that underly language evolution and change.

6 0.60938805 149 acl-2013-Exploring Word Order Universals: a Probabilistic Graphical Model Approach

7 0.60250247 175 acl-2013-Grounded Language Learning from Video Described with Sentences

8 0.57633233 213 acl-2013-Language Acquisition and Probabilistic Models: keeping it simple

9 0.57380736 321 acl-2013-Sign Language Lexical Recognition With Propositional Dynamic Logic

10 0.5627023 311 acl-2013-Semantic Neighborhoods as Hypergraphs

11 0.54892653 1 acl-2013-"Let Everything Turn Well in Your Wife": Generation of Adult Humor Using Lexical Constraints

12 0.51907128 347 acl-2013-The Role of Syntax in Vector Space Models of Compositional Semantics

13 0.51904666 36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning

14 0.51437533 224 acl-2013-Learning to Extract International Relations from Political Context

15 0.51139069 334 acl-2013-Supervised Model Learning with Feature Grouping based on a Discrete Constraint

16 0.50224602 310 acl-2013-Semantic Frames to Predict Stock Price Movement

17 0.50076234 364 acl-2013-Typesetting for Improved Readability using Lexical and Syntactic Information

18 0.49931449 390 acl-2013-Word surprisal predicts N400 amplitude during reading

19 0.4991731 382 acl-2013-Variational Inference for Structured NLP Models

20 0.49318352 35 acl-2013-Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.031), (6, 0.031), (11, 0.057), (24, 0.045), (26, 0.032), (28, 0.469), (35, 0.07), (42, 0.04), (48, 0.043), (70, 0.05), (88, 0.012), (90, 0.018), (95, 0.042)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89472699 349 acl-2013-The mathematics of language learning

Author: Andras Kornai ; Gerald Penn ; James Rogers ; Anssi Yli-Jyra

Abstract: unkown-abstract

2 0.64369017 124 acl-2013-Discriminative state tracking for spoken dialog systems

Author: Angeliki Metallinou ; Dan Bohus ; Jason Williams

Abstract: In spoken dialog systems, statistical state tracking aims to improve robustness to speech recognition errors by tracking a posterior distribution over hidden dialog states. Current approaches based on generative or discriminative models have different but important shortcomings that limit their accuracy. In this paper we discuss these limitations and introduce a new approach for discriminative state tracking that overcomes them by leveraging the problem structure. An offline evaluation with dialog data collected from real users shows improvements in both state tracking accuracy and the quality of the posterior probabilities. Features that encode speech recognition error patterns are particularly helpful, and training requires rel- atively few dialogs.

3 0.62564796 363 acl-2013-Two-Neighbor Orientation Model with Cross-Boundary Global Contexts

Author: Hendra Setiawan ; Bowen Zhou ; Bing Xiang ; Libin Shen

Abstract: Long distance reordering remains one of the greatest challenges in statistical machine translation research as the key contextual information may well be beyond the confine of translation units. In this paper, we propose Two-Neighbor Orientation (TNO) model that jointly models the orientation decisions between anchors and two neighboring multi-unit chunks which may cross phrase or rule boundaries. We explicitly model the longest span of such chunks, referred to as Maximal Orientation Span, to serve as a global parameter that constrains underlying local decisions. We integrate our proposed model into a state-of-the-art string-to-dependency translation system and demonstrate the efficacy of our proposal in a large-scale Chinese-to-English translation task. On NIST MT08 set, our most advanced model brings around +2.0 BLEU and -1.0 TER improvement.

4 0.623941 107 acl-2013-Deceptive Answer Prediction with User Preference Graph

Author: Fangtao Li ; Yang Gao ; Shuchang Zhou ; Xiance Si ; Decheng Dai

Abstract: In Community question answering (QA) sites, malicious users may provide deceptive answers to promote their products or services. It is important to identify and filter out these deceptive answers. In this paper, we first solve this problem with the traditional supervised learning methods. Two kinds of features, including textual and contextual features, are investigated for this task. We further propose to exploit the user relationships to identify the deceptive answers, based on the hypothesis that similar users will have similar behaviors to post deceptive or authentic answers. To measure the user similarity, we propose a new user preference graph based on the answer preference expressed by users, such as “helpful” voting and “best answer” selection. The user preference graph is incorporated into traditional supervised learning framework with the graph regularization technique. The experiment results demonstrate that the user preference graph can indeed help improve the performance of deceptive answer prediction.

5 0.60753334 96 acl-2013-Creating Similarity: Lateral Thinking for Vertical Similarity Judgments

Author: Tony Veale ; Guofu Li

Abstract: Just as observing is more than just seeing, comparing is far more than mere matching. It takes understanding, and even inventiveness, to discern a useful basis for judging two ideas as similar in a particular context, especially when our perspective is shaped by an act of linguistic creativity such as metaphor, simile or analogy. Structured resources such as WordNet offer a convenient hierarchical means for converging on a common ground for comparison, but offer little support for the divergent thinking that is needed to creatively view one concept as another. We describe such a means here, by showing how the web can be used to harvest many divergent views for many familiar ideas. These lateral views complement the vertical views of WordNet, and support a system for idea exploration called Thesaurus Rex. We show also how Thesaurus Rex supports a novel, generative similarity measure for WordNet. 1 Seeing is Believing (and Creating) Similarity is a cognitive phenomenon that is both complex and subjective, yet for practical reasons it is often modeled as if it were simple and objective. This makes sense for the many situations where we want to align our similarity judgments with those of others, and thus focus on the same conventional properties that others are also likely to focus upon. This reliance on the consensus viewpoint explains why WordNet (Fellbaum, 1998) has proven so useful as a basis for computational measures of lexico-semantic similarity Guofu Li School of Computer Science and Informatics, University College Dublin, Belfield, Dublin D2, Ireland. l .guo fu . l gmai l i @ .com (e.g. see Pederson et al. 2004, Budanitsky & Hirst, 2006; Seco et al. 2006). These measures reduce the similarity of two lexical concepts to a single number, by viewing similarity as an objective estimate of the overlap in their salient qualities. This convenient perspective is poorly suited to creative or insightful comparisons, but it is sufficient for the many mundane comparisons we often perform in daily life, such as when we organize books or look for items in a supermarket. So if we do not know in which aisle to locate a given item (such as oatmeal), we may tacitly know how to locate a similar product (such as cornflakes) and orient ourselves accordingly. Yet there are occasions when the recognition of similarities spurs the creation of similarities, when the act of comparison spurs us to invent new ways of looking at an idea. By placing pop tarts in the breakfast aisle, food manufacturers encourage us to view them as a breakfast food that is not dissimilar to oatmeal or cornflakes. When ex-PM Tony Blair published his memoirs, a mischievous activist encouraged others to move his book from Biography to Fiction in bookshops, in the hope that buyers would see it in a new light. Whenever we use a novel metaphor to convey a non-obvious viewpoint on a topic, such as “cigarettes are time bombs”, the comparison may spur us to insight, to see aspects of the topic that make it more similar to the vehicle (see Ortony, 1979; Veale & Hao, 2007). In formal terms, assume agent A has an insight about concept X, and uses the metaphor X is a Y to also provoke this insight in agent B. To arrive at this insight for itself, B must intuit what X and Y have in common. But this commonality is surely more than a standard categorization of X, or else it would not count as an insight about X. To understand the metaphor, B must place X 660 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 6 0–670, in a new category, so that X can be seen as more similar to Y. Metaphors shape the way we per- ceive the world by re-shaping the way we make similarity judgments. So if we want to imbue computers with the ability to make and to understand creative metaphors, we must first give them the ability to look beyond the narrow viewpoints of conventional resources. Any measure that models similarity as an objective function of a conventional worldview employs a convergent thought process. Using WordNet, for instance, a similarity measure can vertically converge on a common superordinate category of both inputs, and generate a single numeric result based on their distance to, and the information content of, this common generalization. So to find the most conventional ways of seeing a lexical concept, one simply ascends a narrowing concept hierarchy, using a process de Bono (1970) calls vertical thinking. To find novel, non-obvious and useful ways of looking at a lexical concept, one must use what Guilford (1967) calls divergent thinking and what de Bono calls lateral thinking. These processes cut across familiar category boundaries, to simultaneously place a concept in many different categories so that we can see it in many different ways. de Bono argues that vertical thinking is selective while lateral thinking is generative. Whereas vertical thinking concerns itself with the “right” way or a single “best” way of looking at things, lateral thinking focuses on producing alternatives to the status quo. To be as useful for creative tasks as they are for conventional tasks, we need to re-imagine our computational similarity measures as generative rather than selective, expansive rather than reductive, divergent as well as convergent and lateral as well as vertical. Though WordNet is ideally structured to support vertical, convergent reasoning, its comprehensive nature means it can also be used as a solid foundation for building a more lateral and divergent model of similarity. Here we will use the web as a source of diverse perspectives on familiar ideas, to complement the conventional and often narrow views codified by WordNet. Section 2 provides a brief overview of past work in the area of similarity measurement, before section 3 describes a simple bootstrapping loop for acquiring richly diverse perspectives from the web for a wide variety of familiar ideas. These perspectives are used to enhance a Word- Net-based measure of lexico-semantic similarity in section 4, by broadening the range of informative viewpoints the measure can select from. Similarity is thus modeled as a process that is both generative and selective. This lateral-andvertical approach is evaluated in section 5, on the Miller & Charles (1991) data-set. A web app for the lateral exploration of diverse viewpoints, named Thesaurus Rex, is also presented, before closing remarks are offered in section 6. 2 Related Work and Ideas WordNet’s taxonomic organization of nounsenses and verb-senses – in which very general categories are successively divided into increasingly informative sub-categories or instancelevel ideas – allows us to gauge the overlap in information content, and thus of meaning, of two lexical concepts. We need only identify the deepest point in the taxonomy at which this content starts to diverge. This point of divergence is often called the LCS, or least common subsumer, of two concepts (Pederson et al., 2004). Since sub-categories add new properties to those they inherit from their parents – Aristotle called these properties the differentia that stop a category system from trivially collapsing into itself – the depth of a lexical concept in a taxonomy is an intuitive proxy for its information content. Wu & Palmer (1994) use the depth of a lexical concept in the WordNet hierarchy as such a proxy, and thereby estimate the similarity of two lexical concepts as twice the depth of their LCS divided by the sum of their individual depths. Leacock and Chodorow (1998) instead use the length of the shortest path between two concepts as a proxy for the conceptual distance between them. To connect any two ideas in a hierarchical system, one must vertically ascend the hierarchy from one concept, change direction at a potential LCS, and then descend the hierarchy to reach the second concept. (Aristotle was also first to suggest this approach in his Poetics). Leacock and Chodorow normalize the length of this path by dividing its size (in nodes) by twice the depth of the deepest concept in the hierarchy; the latter is an upper bound on the distance between any two concepts in the hierarchy. Negating the log of this normalized length yields a corresponding similarity score. While the role of an LCS is merely implied in Leacock and Chodorow’s use of a shortest path, the LCS is pivotal nonetheless, and like that of Wu & Palmer, the approach uses an essentially vertical reasoning process to identify a single “best” generalization. Depth is a convenient proxy for information content, but more nuanced proxies can yield 661 more rounded similarity measures. Resnick (1995) draws on information theory to define the information content of a lexical concept as the negative log likelihood of its occurrence in a corpus, either explicitly (via a direct mention) or by presupposition (via a mention of any of its sub-categories or instances). Since the likelihood of a general category occurring in a corpus is higher than that of any of its sub-categories or instances, such categories are more predictable, and less informative, than rarer categories whose occurrences are less predictable and thus more informative. The negative log likelihood of the most informative LCS of two lexical concepts offers a reliable estimate of the amount of infor- mation shared by those concepts, and thus a good estimate of their similarity. Lin (1998) combines the intuitions behind Resnick’s metric and that of Wu and Palmer to estimate the similarity of two lexical concepts as an information ratio: twice the information content of their LCS divided by the sum of their individual information contents. Jiang and Conrath (1997) consider the converse notion of dissimilarity, noting that two lexical concepts are dissimilar to the extent that each contains information that is not shared by the other. So if the information content of their most informative LCS is a good measure of what they do share, then the sum of their individual information contents, minus twice the content of their most informative LCS, is a reliable estimate of their dissimilarity. Seco et al. (2006) presents a minor innovation, showing how Resnick’s notion of information content can be calculated without the use of an external corpus. Rather, when using Resnick’s metric (or that of Lin, or Jiang and Conrath) for measuring the similarity of lexical concepts in WordNet, one can use the category structure of WordNet itself to estimate infor- mation content. Typically, the more general a concept, the more descendants it will possess. Seco et al. thus estimate the information content of a lexical concept as the log of the sum of all its unique descendants (both direct and indirect), divided by the log of the total number of concepts in the entire hierarchy. Not only is this intrinsic view of information content convenient to use, without recourse to an external corpus, Seco et al. show that it offers a better estimate of information content than its extrinsic, corpus-based alternatives, as measured relative to average human similarity ratings for the 30 word-pairs in the Miller & Charles (1991) test set. A similarity measure can draw on other sources of information besides WordNet’s category structures. One might eke out additional information from WordNet’s textual glosses, as in Lesk (1986), or use category structures other than those offered by WordNet. Looking beyond WordNet, entries in the online encyclopedia Wikipedia are not only connected by a dense topology of lateral links, they are also organized by a rich hierarchy of overlapping categories. Strube and Ponzetto (2006) show how Wikipedia can support a measure of similarity (and relatedness) that better approximates human judgments than many WordNet-based measures. Nonetheless, WordNet can be a valuable component of a hybrid measure, and Agirre et al. (2009) use an SVM (support vector machine) to combine information from WordNet with information harvested from the web. Their best similarity measure achieves a remarkable 0.93 correlation with human judgments on the Miller & Charles word-pair set. Similarity is not always applied to pairs of concepts; it is sometimes analogically applied to pairs of pairs of concepts, as in proportional analogies of the form A is to B as C is to D (e.g., hacks are to writers as mercenaries are to soldiers, or chisels are to sculptors as scalpels are to surgeons). In such analogies, one is really assessing the similarity of the unstated relationship between each pair of concepts: thus, mercenaries are soldiers whose allegiance is paid for, much as hacks are writers with income-driven loyalties; sculptors use chisels to carve stone, while surgeons use scalpels to cut or carve flesh. Veale (2004) used WordNet to assess the similarity of A:B to C:D as a function of the combined similarity of A to C and of B to D. In contrast, Turney (2005) used the web to pursue a more divergent course, to represent the tacit relationships of A to B and of C to D as points in a highdimensional space. The dimensions of this space initially correspond to linking phrases on the web, before these dimensions are significantly reduced using singular value decomposition. In the infamous SAT test, an analogy A:B::C:D has four other pairs of concepts that serve as likely distractors (e.g. singer:songwriter for hack:writer) and the goal is to choose the most appropriate C:D pair for a given A:B pairing. Using variants of Wu and Palmer (1994) on the 374 SAT analogies of Turney (2005), Veale (2004) reports a success rate of 38–44% using only WordNet-based similarity. In contrast, Turney (2005) reports up to 55% success on the same analogies, partly because his approach aims 662 to match implicit relations rather than explicit concepts, and in part because it uses a divergent process to gather from the web as rich a perspec- tive as it can on these latent relationships. 2.1 Clever Comparisons Create Similarity Each of these approaches to similarity is a user of information, rather than a creator, and each fails to capture how a creative comparison (such as a metaphor) can spur a listener to view a topic from an atypical perspective. Camac & Glucksberg (1984) provide experimental evidence for the claim that “metaphors do not use preexisting associations to achieve their effects [… ] people use metaphors to create new relations between concepts.” They also offer a salutary reminder of an often overlooked fact: every comparison exploits information, but each is also a source of new information in its own right. Thus, “this cola is acid” reveals a different perspective on cola (e.g. as a corrosive substance or an irritating food) than “this acid is cola” highlights for acid (such as e.g., a familiar substance) Veale & Keane (1994) model the role of similarity in realizing the long-term perlocutionary effect of an informative comparison. For example, to compare surgeons to butchers is to encourage one to see all surgeons as more bloody, … crude or careless. The reverse comparison, of butchers to surgeons, encourages one to see butchers as more skilled and precise. Veale & Keane present a network model of memory, called Sapper, in which activation can spread between related concepts, thus allowing one concept to prime the properties of a neighbor. To interpret an analogy, Sapper lays down new activation-carrying bridges in memory between analogical counterparts, such as between surgeon & butcher, flesh & meat, and scalpel & cleaver. Comparisons can thus have lasting effects on how Sapper sees the world, changing the pattern of activation that arises when it primes a concept. Veale (2003) adopts a similarly dynamic view of similarity in WordNet, showing how an analogical comparison can result in the automatic addition of new categories and relations to WordNet itself. Veale considers the problem of finding an analogical mapping between different parts of WordNet’s noun-sense hierarchy, such as between instances of Greek god and Norse god, or between the letters of different alphabets, such as of Greek and Hebrew. But no structural similarity measure for WordNet exhibits enough discernment to e.g. assign a higher similarity to Zeus & Odin (each is the supreme deity of its pantheon) than to a pairing of Zeus and any other Norse god, just as no structural measure will assign a higher similarity to Alpha & Aleph or to Beta & Beth than to any random letter pairing. A fine-grained category hierarchy permits fine-grained similarity judgments, and though WordNet is useful, its sense hierarchies are not especially fine-grained. However, we can automatically make WordNet subtler and more discerning, by adding new fine-grained categories to unite lexical concepts whose similarity is not reflected by any existing categories. Veale (2003) shows how a property that is found in the glosses of two lexical concepts, of the same depth, can be combined with their LCS to yield a new fine-grained parent category, so e.g. “supreme” + deity = Supreme-deity (for Odin, Zeus, Jupiter, etc.) and “1 st” + letter = 1st-letter (for Alpha, Aleph, etc.) Selected aspects of the textual similarity of two WordNet glosses – the key to similarity in Lesk (1986) – can thus be reified into an explicitly categorical WordNet form. 3 Divergent (Re)Categorization To tap into a richer source of concept properties than WordNet’s glosses, we can use web ngrams. Consider these descriptions of a cowboy from the Google n-grams (Brants & Franz, 2006). The numbers to the right are Google frequency counts. a lonesome cowboy 432 a mounted cowboy 122 a grizzled cowboy 74 a swaggering cowboy 68 To find the stable properties that can underpin a meaningful fine-grained category for cowboy, we must seek out the properties that are so often presupposed to be salient of all cowboys that one can use them to anchor a simile, such as

6 0.57570934 112 acl-2013-Dependency Parser Adaptation with Subtrees from Auto-Parsed Target Domain Data

7 0.47687128 27 acl-2013-A Two Level Model for Context Sensitive Inference Rules

8 0.33659956 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing

9 0.33302006 332 acl-2013-Subtree Extractive Summarization via Submodular Maximization

10 0.3253701 168 acl-2013-Generating Recommendation Dialogs by Extracting Information from User Reviews

11 0.32351214 169 acl-2013-Generating Synthetic Comparable Questions for News Articles

12 0.32206184 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

13 0.3203038 16 acl-2013-A Novel Translation Framework Based on Rhetorical Structure Theory

14 0.31970888 367 acl-2013-Universal Conceptual Cognitive Annotation (UCCA)

15 0.31870088 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri

16 0.31846511 280 acl-2013-Plurality, Negation, and Quantification:Towards Comprehensive Quantifier Scope Disambiguation

17 0.31830266 329 acl-2013-Statistical Machine Translation Improves Question Retrieval in Community Question Answering via Matrix Factorization

18 0.31813174 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing

19 0.3168681 126 acl-2013-Diverse Keyword Extraction from Conversations

20 0.31685758 231 acl-2013-Linggle: a Web-scale Linguistic Search Engine for Words in Context