emnlp emnlp2013 emnlp2013-10 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: John Canny ; David Hall ; Dan Klein
Abstract: Constituency parsing with rich grammars remains a computational challenge. Graphics Processing Units (GPUs) have previously been used to accelerate CKY chart evaluation, but gains over CPU parsers were modest. In this paper, we describe a collection of new techniques that enable chart evaluation at close to the GPU’s practical maximum speed (a Teraflop), or around a half-trillion rule evaluations per second. Net parser performance on a 4-GPU system is over 1 thousand length30 sentences/second (1 trillion rules/sec), and 400 general sentences/second for the Berkeley Parser Grammar. The techniques we introduce include grammar compilation, recursive symbol blocking, and cache-sharing.
Reference: text
sentIndex sentText sentNum sentScore
1 The techniques we introduce include grammar compilation, recursive symbol blocking, and cache-sharing. [sent-6, score-0.395]
2 Symbol/rule blocking of the grammar to respect register, constant and instruction cache limits. [sent-32, score-0.427]
3 Sub-block partitioning to distribute rules across the stream processors of the GPU and allow L2 cache acceleration. [sent-35, score-0.412]
4 R KeplerTM series GPU contains between 2 and 16 “stream processors” or SMX’s which share an L2 cache interfacing to the GPUs main memory Anonymous (2013). [sent-47, score-0.423]
5 Shared memory supports relatively fast communication between threads in an SMX. [sent-49, score-0.362]
6 Register storage is not attached to cores, instead registers are associated in blocks of 63 or 255 (depending on KeplerTM subarchitecture) with running threads. [sent-54, score-0.266]
7 These 1024 threads are grouped into 32 groups of 32 threads called warps. [sent-56, score-0.282]
8 When the warp encounters a branching instruction, all branches that are satisfied by some thread will be executed in sequence. [sent-59, score-0.338]
9 Each thread only executes the instructions for its own branch, and idles for the others. [sent-60, score-0.301]
10 Table 1 shows instruction throughput (number of instructions that are executed per cycle on each SMX). [sent-66, score-0.319]
11 , which are as fast as shared memory accesses or integer multiplication, and which amount to a quartertrillion transcendental evaluations per second on a GTX-680/K10. [sent-79, score-0.376]
12 Ifwe worked through shared memory (first line of the table), we would be limited to about 80 billion evaluations/sec, 20 times slower. [sent-88, score-0.271]
13 Shared memory/L1 cache is shared by all threads in an SMX. [sent-100, score-0.362]
14 64kB per SMX partitioned into shared memory and cache functions. [sent-101, score-0.501]
15 Constant memory Each SMX has a 48kB read/only cache separate from the L1 cache. [sent-103, score-0.392]
16 It can store grammar constants and has much higher bandwidth than shared memory. [sent-104, score-0.293]
17 Global memory is 2-6GB typically, and is shared by all SMXs. [sent-112, score-0.271]
18 GPUs use a particularly fast form of SDRAM (compared to CPUs) but it is still much slower than the other memory types above. [sent-113, score-0.268]
19 From the above it can be seen that main memory access is much slower than other memories and can easily bottleneck the calculations. [sent-116, score-0.354]
20 The figure above (160 GB/s) for main memory access assumes such access is coalesced. [sent-117, score-0.362]
21 Each thread in the GPU has a thread and a block number which determines where it runs on the hardware. [sent-118, score-0.517]
22 Consecutively-numbered threads should access consecutive main memory locations for fast memory access. [sent-119, score-0.669]
23 1900 Maximize use of registers for symbol scores, and minimize use of shared memory (in fact we will not use it at all). [sent-124, score-0.694]
24 Maximize use of constant memory for rule weights, and minimize use of shared memory. [sent-125, score-0.36]
25 Partition the rule set into blocks that respect the limits on number of registers, constant memory (needed for grammar rules probabilities) and instruction cache limits. [sent-126, score-0.831]
26 Minimize main memory access and use L2 cache to speed it up. [sent-127, score-0.513]
27 This suggests that we do not use a lexicalized grammar or a grammar that is sensitive to the position of a span within the sentence. [sent-130, score-0.268]
28 These kinds of grammars—while highly accurate— have irregular memory access patterns that conflict with SIMD execution. [sent-131, score-0.276]
29 1 The Inside Updates The high-level goal of our parser is to use SIMD parallelism to evaluate the same rule across many spans (1024 threads are currently used to process 8192 spans in each kernel). [sent-155, score-0.261]
30 As discussed in section 5 each GPU kernel actually processes a small subset of the symbols and rules for the grammar, and kernels are executed in sequence until the entire grammar has been processed. [sent-157, score-0.499]
31 Each thread iterates over the rules in the same order, reading in symbols from the left child and right child arrays in main memory as necessary. [sent-158, score-1.011]
32 That is, the parent VP for one work item is stored next to the parent VP for the next work item, while the VP symbol for the first work item is stored on the next “row” of the work array. [sent-160, score-0.629]
33 The reason the workspace cells are stored in “symbol-major” order is to maximize coalesced access: each thread in the SMX accesses the same symbol for a different work item in parallel, and those work items are in consecutive memory locations. [sent-161, score-1.1]
34 2 The Copy-transpose Operations Unlike the workspace arrays, the arrays for the parse charts are stored in “span-major” order, transposed from how they are stored in the workspace arrays. [sent-163, score-0.591]
35 That is, for a given span, the NP symbol is next to the same span’s VP symbol (for example). [sent-164, score-0.578]
36 This order accelerates both symbol loading and Viterbi search later on. [sent-165, score-0.289]
37 It requires a transpose-copy instead of “non-transposed” copy to move from chart to workspace arrays and back again, but note that a non-transposed copy (or direct access to the chart by the GPU compute kernel) would probably be slower. [sent-166, score-0.43]
38 The reason is that any linear ordering of cells in the triangle table will produce short segments (less than 32 words and often less than 16) of consecutive memory locations. [sent-167, score-0.286]
39 The copy-transpose operations are quite efficient (the transpose itself is much faster than the I/O), and come close to the 160 GB/s GPU main memory limit. [sent-170, score-0.299]
40 The reverse copy-transpose (from parent workspace cells to chart) is typically many-to-one, since parent scores derive from multiple splits. [sent-171, score-0.359]
41 jX; n,p∈Q where Sij,m is the score for symbol m as a generator of the span of words from position ito j in the input sentence, cmnp is the probability that symbol m generates the binary symbol pair n, p, and Q is the set of symbols. [sent-177, score-0.968]
42 The scores will be stored in a CKY chart indexed by the span ij and the symbol m. [sent-178, score-0.51]
43 We use symbols P, L and R for respectively the score of the parent, left child and right child in the CKY chart. [sent-181, score-0.347]
44 one cannot access register 3 as an array variable R[i] with i=31. [sent-184, score-0.329]
45 Note that we show here the sum-product code for computing inner/outer symbol probabilities. [sent-190, score-0.397]
46 3 3 8 2 0 2 e-0 0 1f ; where tid is the thread ID plus an offset, and stride is the row dimension of the workspace (typically 8192), and right is the main memory array of right symbol scores. [sent-194, score-1.049]
47 2 0 2 16 0 e-0 0 1f ; at omi cAdd ( ∥ [ t id+ 6 * st ride ] G0 2 0 ) ; These load/store strategies minimize the active life of each variable and allow reuse of register variables for symbols whose lifetimes do not overlap. [sent-196, score-0.429]
48 In hindsight, this is obvious because the symbols in this grammar are splits of base symbols, and so splits of the parent symbol will be involved in rules with each pair of L,R splits. [sent-200, score-0.715]
49 2 0 2 16 0 e-0 0 1f ; and inspection of the resulting assembly code shows that each rule is compiled into a single fused multiply-add of LRtmp and a value from constant memory into the P symbol register. [sent-210, score-0.75]
50 2 Exploiting L2 cache Finally, we have to generate code to evaluate distinct minor cube rulesets on each of the 8 SMXes concurrently in order to benefit from the L2 cache, as described in the next section. [sent-215, score-0.491]
51 CUDATM (NVIDIA’s GPU programming Platform) does not allow direct control of SMX target, but we can achieve this by running the kernel as 8 thread blocks and then testing the block ID within the kernel and dispatching to one of 8 blocks of rules. [sent-216, score-0.597]
52 The CUDATM scheduler will execute each thread block on a different SMX which gives the desired distribution of code. [sent-217, score-0.305]
53 The re-use rate is determined by the number of rules in which a particular symbol occurs, which is actually very high (more than 1000 on average). [sent-226, score-0.364]
54 The number of symbols is about 1100 in this grammar, and only a fraction can be stored in a thread’s register set at one time (which is either 63 or 255 registers). [sent-227, score-0.384]
55 The cube will be partitioned into smaller subcubes indexed by subsets of those symbols, and containing all the rules that apply between those symbols. [sent-231, score-0.256]
56 The partitioning is chosen so that the symbols in that subset can fit into available register storage. [sent-232, score-0.43]
57 In addition, the partitioning is chosen to induce the same number of rules in each cube - otherwise different code paths in the kernel will run longer than others, and reduce overall performance. [sent-233, score-0.504]
58 This figure is a simplification - in order to balance the number of rules in each subcube, the partitioning is not uniform in number of symbols as the figure suggests. [sent-234, score-0.364]
59 The original P-L-R cube is first par1903 LPR Figure 2: Partition of the cube of symbol combinations into major subcubes (left) and minor subcubes (right). [sent-236, score-0.703]
60 A major cube holds the symbols and rules that are executed in a single GPU kernel. [sent-238, score-0.404]
61 The minor cubes in a major cube hold the symbols and rules that are executed in a particular SMX. [sent-239, score-0.512]
62 This arrangement substantially reduces main memory bandwidth through the L2 cache (which is shared between SMXes). [sent-241, score-0.61]
63 Each symbol in a major cube will be loaded just once from main memory, but loaded into (up to) 4 different SMXes through the L2 cache. [sent-242, score-0.546]
64 Each symbol would need to be lo≈ade 1d6 05,5020 = c3e0ll2s5. [sent-248, score-0.289]
65 times, sthymereb owlo wuolud bde n aelemdo tsot no symbol re-use. [sent-249, score-0.289]
66 Throughput would be limited by main memory speed to about 100 Gflops, an order of magnitude slower than our target. [sent-250, score-0.371]
67 Instead, we use a rule partitioning scheme that creates as small a symbol footprint as possible in each cube. [sent-251, score-0.475]
68 Before describing the spectral method we mention an optimization that drops the symbol count by 2. [sent-253, score-0.335]
69 L and R symbols may be either terminal or non-terminal, so there are 4 distinct types of rules depending on the L, R, types. [sent-256, score-0.267]
70 1 Spectral Partitioning We explored a number of partitioning schemes for both symbol and rule partitioning. [sent-266, score-0.475]
71 In the end we settled on a spectral symbol partitioning scheme. [sent-267, score-0.465]
72 Each symbol is a node in the graph to be partitioned. [sent-268, score-0.356]
73 In the end the vector for a particular P symbol is a = (a1, 0. [sent-271, score-0.289]
74 1dex ∗e ad by L, R pairs and whose values represent the number of rules involving both those symbols (and the parent symbol P), a2 encodes L symbols and counts the number of rules containing that L symbol and P, and a3 encodes the R symbols and counts rules containing that R symbol and P. [sent-275, score-1.655]
75 Since many symbols are involved, this typically does not happen to an in1904 dividual symbol, but this heuristic is successful at making the individual symbol or LR pair distributions across the cuts as unbalanced as possible. [sent-281, score-0.448]
76 The number of instances of a symbol is an upper bound on the number of subcells in which than symbol occurs, and therefore on the number of times it needs to be loaded from memory. [sent-285, score-0.631]
77 Repeating this operation recursively to produce a 3d cell decomposition also concentrates each symbol in relatively few cells, and so tends to reduce the total register count per cell. [sent-286, score-0.489]
78 The XX ruleset for our Berkeley grammar has about 343,000 rules over a 5033 cube ofnon-terminal symbols. [sent-295, score-0.301]
79 This requires 6x2x2=24 GPU kernels, each of which encodes 2x2x2=8 code blocks (recall that each of the 8 SMXs executes a different code block from the same kernel)2. [sent-297, score-0.432]
80 Most importantly the reload rate (the mean number of major cells containing a given symbol, or the mean number of times a symbol needs to be reloaded from main memory) drops to about 6 (vs. [sent-298, score-0.385]
81 Each symbol is used on average 343, 000/501 ≈ 6000 times overall by the XX ker2This cube decomposition also respects the constant cache and instruction cache limits nel. [sent-301, score-0.878]
82 Dropping the reload factor to 6 means that for every 1 main memory load of a symbol, there are approximately 1000 register or L2 cache reuses. [sent-302, score-0.564]
83 A little further calculation shows that L2 cache items are used a little more than twice, so the register reuse rate within kernel code blocks is close to 500 on average. [sent-303, score-0.6]
84 Note that while the maximum number of registers per thread in the GTX-680 or K10 is 63, the average number of variables per minor cube is over 80 for our best-performing kernel, showing a number of variables have non-overlapping lifetimes. [sent-305, score-0.665]
85 However the CUDATM compiler reorders variables anyway with slightly worse performance on average (there seems to be no way around this, other than generating assembly code directly). [sent-307, score-0.321]
86 A recursive program must create software stack space in either shared memory or main memory which serious performance impact on small function calls. [sent-315, score-0.523]
87 Instead, we use an iterative version of Viterbi parse extraction which uses pre-allocated array storage to store its output, and such that the partially-complete output array encodes all the information the algorithm needs to proceed - i. [sent-316, score-0.334]
88 aTth pe symbol for each node can obviously be stored with the height. [sent-326, score-0.44]
89 Otherwise, if the node is a non-terminal, we then find its left and right children, entering their respective symbols and widths into the array representing the tree. [sent-348, score-0.396]
90 Each parse tree is handled by a separate thread block (thread blocks are groups of threads that can communicate through shared memory, and run on a single SMX). [sent-350, score-0.662]
91 Each thread block includes a number of threads which are used to rapidly (in partly parallel fashion) iterate through rulesets and symbol vectors for the BestBinary and BestUnary operations using coalesced memory accesses. [sent-351, score-1.064]
92 Each thread block first loads the complete set of L and R scores for the current split being explored. [sent-352, score-0.351]
93 Recall that these are in consecutive memory locations using the “spanmajor” ordering, so these loads are coalesced. [sent-353, score-0.267]
94 Then the thread block parallel-iterates through the rules for the current parent symbol, which will be in a contiguous block of memory since the rules are sorted by parent symbol, and again are coalesced. [sent-354, score-0.941]
95 The thread block therefore needs storage for all the L, R symbol scores and in addition working storage proportional to the number of threads (to hold the best child symbol and its score from each thread). [sent-355, score-1.264]
96 The number of threads is chosen to maximize speed: too few will cause each thread to do more work and to run more slowly. [sent-356, score-0.353]
97 Too many will limit the number of thread blocks (since the total threads concurrently running on an SMX is 1024) that can run concurrently. [sent-357, score-0.468]
98 The high-level parser code is written in a 1906 matrix library in the Scala language, which access GPU code through JNI and using the JCUDA wrapper library for CUDATM. [sent-362, score-0.335]
99 However, since symbol variables cor- respond almost one-to-one with registers (modulo lifetime overlap and reuse, which our code generator is slightly better at than the compiler), there is no reason for our code generator not to generate assembly code directly. [sent-381, score-0.911]
100 Presumably assembly code is much faster to translate into kernel modules than C source, and hopefully this will lead to much faster kernel generation. [sent-382, score-0.326]
wordName wordTfidf (topN-words)
[('gpu', 0.405), ('symbol', 0.289), ('memory', 0.221), ('thread', 0.212), ('gpus', 0.198), ('cache', 0.171), ('symbols', 0.159), ('register', 0.141), ('threads', 0.141), ('bandwidth', 0.137), ('smx', 0.137), ('partitioning', 0.13), ('workspace', 0.122), ('cube', 0.12), ('code', 0.108), ('keplertm', 0.106), ('compiler', 0.106), ('grammar', 0.106), ('array', 0.102), ('registers', 0.101), ('block', 0.093), ('arrays', 0.091), ('smxs', 0.091), ('instruction', 0.09), ('storage', 0.09), ('cky', 0.089), ('parent', 0.086), ('stored', 0.084), ('cpu', 0.083), ('chart', 0.081), ('nvidia', 0.079), ('throughput', 0.079), ('assembly', 0.076), ('cudatm', 0.076), ('lrtmp', 0.076), ('warp', 0.076), ('blocks', 0.075), ('rules', 0.075), ('viterbi', 0.073), ('kernel', 0.071), ('node', 0.067), ('berkeley', 0.066), ('cells', 0.065), ('parser', 0.064), ('coalesced', 0.061), ('gflops', 0.061), ('smxes', 0.061), ('subcubes', 0.061), ('blocking', 0.06), ('child', 0.06), ('per', 0.059), ('span', 0.056), ('rule', 0.056), ('cubes', 0.056), ('access', 0.055), ('compilation', 0.053), ('floating', 0.053), ('loaded', 0.053), ('minor', 0.052), ('tree', 0.051), ('executed', 0.05), ('shared', 0.05), ('charts', 0.048), ('executes', 0.048), ('slower', 0.047), ('operations', 0.047), ('spectral', 0.046), ('accesses', 0.046), ('canny', 0.046), ('fma', 0.046), ('loads', 0.046), ('simd', 0.046), ('teraflop', 0.046), ('generator', 0.045), ('queue', 0.045), ('aggregate', 0.043), ('atomic', 0.042), ('constituency', 0.042), ('instructions', 0.041), ('parse', 0.04), ('pruning', 0.04), ('concurrently', 0.04), ('kernels', 0.038), ('magnitude', 0.037), ('limits', 0.037), ('right', 0.036), ('processors', 0.036), ('partition', 0.036), ('speed', 0.035), ('petrov', 0.035), ('updates', 0.035), ('iterates', 0.034), ('reuse', 0.034), ('minimize', 0.033), ('terminal', 0.033), ('left', 0.032), ('variables', 0.031), ('variable', 0.031), ('inside', 0.031), ('main', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999857 10 emnlp-2013-A Multi-Teraflop Constituency Parser using GPUs
Author: John Canny ; David Hall ; Dan Klein
Abstract: Constituency parsing with rich grammars remains a computational challenge. Graphics Processing Units (GPUs) have previously been used to accelerate CKY chart evaluation, but gains over CPU parsers were modest. In this paper, we describe a collection of new techniques that enable chart evaluation at close to the GPU’s practical maximum speed (a Teraflop), or around a half-trillion rule evaluations per second. Net parser performance on a 4-GPU system is over 1 thousand length30 sentences/second (1 trillion rules/sec), and 400 general sentences/second for the Berkeley Parser Grammar. The techniques we introduce include grammar compilation, recursive symbol blocking, and cache-sharing.
2 0.11987767 71 emnlp-2013-Efficient Left-to-Right Hierarchical Phrase-Based Translation with Improved Reordering
Author: Maryam Siahbani ; Baskaran Sankaran ; Anoop Sarkar
Abstract: Left-to-right (LR) decoding (Watanabe et al., 2006b) is a promising decoding algorithm for hierarchical phrase-based translation (Hiero). It generates the target sentence by extending the hypotheses only on the right edge. LR decoding has complexity O(n2b) for input of n words and beam size b, compared to O(n3) for the CKY algorithm. It requires a single language model (LM) history for each target hypothesis rather than two LM histories per hypothesis as in CKY. In this paper we present an augmented LR decoding algorithm that builds on the original algorithm in (Watanabe et al., 2006b). Unlike that algorithm, using experiments over multiple language pairs we show two new results: our LR decoding algorithm provides demonstrably more efficient decoding than CKY Hiero, four times faster; and by introducing new distortion and reordering features for LR decoding, it maintains the same translation quality (as in BLEU scores) ob- tained phrase-based and CKY Hiero with the same translation model.
3 0.091510795 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation
Author: Xinyan Xiao ; Deyi Xiong
Abstract: Traditional synchronous grammar induction estimates parameters by maximizing likelihood, which only has a loose relation to translation quality. Alternatively, we propose a max-margin estimation approach to discriminatively inducing synchronous grammars for machine translation, which directly optimizes translation quality measured by BLEU. In the max-margin estimation of parameters, we only need to calculate Viterbi translations. This further facilitates the incorporation of various non-local features that are defined on the target side. We test the effectiveness of our max-margin estimation framework on a competitive hierarchical phrase-based system. Experiments show that our max-margin method significantly outperforms the traditional twostep pipeline for synchronous rule extraction by 1.3 BLEU points and is also better than previous max-likelihood estimation method.
4 0.084231175 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction
Author: Zhiyuan Chen ; Arjun Mukherjee ; Bing Liu ; Meichun Hsu ; Malu Castellanos ; Riddhiman Ghosh
Abstract: Aspect extraction is one of the key tasks in sentiment analysis. In recent years, statistical models have been used for the task. However, such models without any domain knowledge often produce aspects that are not interpretable in applications. To tackle the issue, some knowledge-based topic models have been proposed, which allow the user to input some prior domain knowledge to generate coherent aspects. However, existing knowledge-based topic models have several major shortcomings, e.g., little work has been done to incorporate the cannot-link type of knowledge or to automatically adjust the number of topics based on domain knowledge. This paper proposes a more advanced topic model, called MC-LDA (LDA with m-set and c-set), to address these problems, which is based on an Extended generalized Pólya urn (E-GPU) model (which is also proposed in this paper). Experiments on real-life product reviews from a variety of domains show that MCLDA outperforms the existing state-of-the-art models markedly.
5 0.082060866 17 emnlp-2013-A Walk-Based Semantically Enriched Tree Kernel Over Distributed Word Representations
Author: Shashank Srivastava ; Dirk Hovy ; Eduard Hovy
Abstract: In this paper, we propose a walk-based graph kernel that generalizes the notion of treekernels to continuous spaces. Our proposed approach subsumes a general framework for word-similarity, and in particular, provides a flexible way to incorporate distributed representations. Using vector representations, such an approach captures both distributional semantic similarities among words as well as the structural relations between them (encoded as the structure of the parse tree). We show an efficient formulation to compute this kernel using simple matrix operations. We present our results on three diverse NLP tasks, showing state-of-the-art results.
6 0.077242143 19 emnlp-2013-Adaptor Grammars for Learning Non-Concatenative Morphology
7 0.073794574 50 emnlp-2013-Combining PCFG-LA Models with Dual Decomposition: A Case Study with Function Labels and Binarization
8 0.073755138 146 emnlp-2013-Optimal Incremental Parsing via Best-First Dynamic Programming
9 0.067032129 187 emnlp-2013-Translation with Source Constituency and Dependency Trees
10 0.066406444 157 emnlp-2013-Recursive Autoencoders for ITG-Based Translation
11 0.065192938 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing
12 0.061274823 106 emnlp-2013-Inducing Document Plans for Concept-to-Text Generation
13 0.058557145 40 emnlp-2013-Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction
14 0.056233969 171 emnlp-2013-Shift-Reduce Word Reordering for Machine Translation
15 0.053913184 176 emnlp-2013-Structured Penalties for Log-Linear Language Models
16 0.052878704 154 emnlp-2013-Prior Disambiguation of Word Tensors for Constructing Sentence Vectors
17 0.050010961 195 emnlp-2013-Unsupervised Spectral Learning of WCFG as Low-rank Matrix Completion
18 0.048836254 41 emnlp-2013-Building Event Threads out of Multiple News Articles
19 0.048460502 22 emnlp-2013-Anchor Graph: Global Reordering Contexts for Statistical Machine Translation
20 0.047637951 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation
topicId topicWeight
[(0, -0.158), (1, -0.058), (2, 0.017), (3, 0.045), (4, -0.045), (5, 0.032), (6, 0.007), (7, -0.065), (8, 0.001), (9, 0.116), (10, -0.037), (11, -0.017), (12, -0.155), (13, 0.101), (14, 0.021), (15, 0.069), (16, -0.053), (17, 0.023), (18, -0.042), (19, -0.044), (20, -0.004), (21, -0.142), (22, 0.126), (23, 0.015), (24, -0.096), (25, -0.034), (26, -0.046), (27, -0.045), (28, 0.089), (29, -0.115), (30, -0.071), (31, -0.039), (32, -0.031), (33, -0.115), (34, -0.003), (35, 0.083), (36, -0.097), (37, -0.014), (38, -0.079), (39, -0.025), (40, -0.074), (41, -0.021), (42, -0.053), (43, -0.056), (44, 0.012), (45, 0.019), (46, 0.113), (47, -0.012), (48, -0.12), (49, -0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.96660334 10 emnlp-2013-A Multi-Teraflop Constituency Parser using GPUs
Author: John Canny ; David Hall ; Dan Klein
Abstract: Constituency parsing with rich grammars remains a computational challenge. Graphics Processing Units (GPUs) have previously been used to accelerate CKY chart evaluation, but gains over CPU parsers were modest. In this paper, we describe a collection of new techniques that enable chart evaluation at close to the GPU’s practical maximum speed (a Teraflop), or around a half-trillion rule evaluations per second. Net parser performance on a 4-GPU system is over 1 thousand length30 sentences/second (1 trillion rules/sec), and 400 general sentences/second for the Berkeley Parser Grammar. The techniques we introduce include grammar compilation, recursive symbol blocking, and cache-sharing.
2 0.63476491 50 emnlp-2013-Combining PCFG-LA Models with Dual Decomposition: A Case Study with Function Labels and Binarization
Author: Joseph Le Roux ; Antoine Rozenknop ; Jennifer Foster
Abstract: It has recently been shown that different NLP models can be effectively combined using dual decomposition. In this paper we demonstrate that PCFG-LA parsing models are suitable for combination in this way. We experiment with the different models which result from alternative methods of extracting a grammar from a treebank (retaining or discarding function labels, left binarization versus right binarization) and achieve a labeled Parseval F-score of 92.4 on Wall Street Journal Section 23 this represents an absolute improvement of 0.7 and an error reduction rate of 7% over a strong PCFG-LA product-model baseline. Although we experiment only with binarization and function labels in this study, there is much scope for applying this approach to – other grammar extraction strategies.
3 0.58099872 14 emnlp-2013-A Synchronous Context Free Grammar for Time Normalization
Author: Steven Bethard
Abstract: We present an approach to time normalization (e.g. the day before yesterday⇒20 13-04- 12) based on a synchronous contex⇒t free grammar. Synchronous rules map the source language to formally defined operators for manipulating times (FINDENCLOSED, STARTATENDOF, etc.). Time expressions are then parsed using an extended CYK+ algorithm, and converted to a normalized form by applying the operators recursively. For evaluation, a small set of synchronous rules for English time expressions were developed. Our model outperforms HeidelTime, the best time normalization system in TempEval 2013, on four different time normalization corpora.
4 0.50648475 106 emnlp-2013-Inducing Document Plans for Concept-to-Text Generation
Author: Ioannis Konstas ; Mirella Lapata
Abstract: In a language generation system, a content planner selects which elements must be included in the output text and the ordering between them. Recent empirical approaches perform content selection without any ordering and have thus no means to ensure that the output is coherent. In this paper we focus on the problem of generating text from a database and present a trainable end-to-end generation system that includes both content selection and ordering. Content plans are represented intuitively by a set of grammar rules that operate on the document level and are acquired automatically from training data. We develop two approaches: the first one is inspired from Rhetorical Structure Theory and represents the document as a tree of discourse relations between database records; the second one requires little linguistic sophistication and uses tree structures to represent global patterns of database record sequences within a document. Experimental evaluation on two domains yields considerable improvements over the state of the art for both approaches.
5 0.50127411 195 emnlp-2013-Unsupervised Spectral Learning of WCFG as Low-rank Matrix Completion
Author: Raphael Bailly ; Xavier Carreras ; Franco M. Luque ; Ariadna Quattoni
Abstract: We derive a spectral method for unsupervised learning of Weighted Context Free Grammars. We frame WCFG induction as finding a Hankel matrix that has low rank and is linearly constrained to represent a function computed by inside-outside recursions. The proposed algorithm picks the grammar that agrees with a sample and is the simplest with respect to the nuclear norm of the Hankel matrix.
7 0.46129164 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing
8 0.45666471 171 emnlp-2013-Shift-Reduce Word Reordering for Machine Translation
9 0.44962338 146 emnlp-2013-Optimal Incremental Parsing via Best-First Dynamic Programming
10 0.44509697 19 emnlp-2013-Adaptor Grammars for Learning Non-Concatenative Morphology
11 0.43308029 71 emnlp-2013-Efficient Left-to-Right Hierarchical Phrase-Based Translation with Improved Reordering
12 0.43262213 176 emnlp-2013-Structured Penalties for Log-Linear Language Models
13 0.40290293 188 emnlp-2013-Tree Kernel-based Negation and Speculation Scope Detection with Structured Syntactic Parse Features
14 0.38926306 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation
15 0.38555902 187 emnlp-2013-Translation with Source Constituency and Dependency Trees
16 0.38477874 58 emnlp-2013-Dependency Language Models for Sentence Completion
17 0.36911109 35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations
18 0.36511654 116 emnlp-2013-Joint Parsing and Disfluency Detection in Linear Time
19 0.35689473 161 emnlp-2013-Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems!
20 0.35410532 17 emnlp-2013-A Walk-Based Semantically Enriched Tree Kernel Over Distributed Word Representations
topicId topicWeight
[(3, 0.02), (18, 0.062), (22, 0.027), (30, 0.122), (50, 0.019), (51, 0.127), (64, 0.012), (66, 0.019), (71, 0.015), (75, 0.023), (90, 0.013), (96, 0.027), (97, 0.411)]
simIndex simValue paperId paperTitle
same-paper 1 0.81272531 10 emnlp-2013-A Multi-Teraflop Constituency Parser using GPUs
Author: John Canny ; David Hall ; Dan Klein
Abstract: Constituency parsing with rich grammars remains a computational challenge. Graphics Processing Units (GPUs) have previously been used to accelerate CKY chart evaluation, but gains over CPU parsers were modest. In this paper, we describe a collection of new techniques that enable chart evaluation at close to the GPU’s practical maximum speed (a Teraflop), or around a half-trillion rule evaluations per second. Net parser performance on a 4-GPU system is over 1 thousand length30 sentences/second (1 trillion rules/sec), and 400 general sentences/second for the Berkeley Parser Grammar. The techniques we introduce include grammar compilation, recursive symbol blocking, and cache-sharing.
2 0.68974996 23 emnlp-2013-Animacy Detection with Voting Models
Author: Joshua Moore ; Christopher J.C. Burges ; Erin Renshaw ; Wen-tau Yih
Abstract: Animacy detection is a problem whose solution has been shown to be beneficial for a number of syntactic and semantic tasks. We present a state-of-the-art system for this task which uses a number of simple classifiers with heterogeneous data sources in a voting scheme. We show how this framework can give us direct insight into the behavior of the system, allowing us to more easily diagnose sources of error.
3 0.67258036 119 emnlp-2013-Learning Distributions over Logical Forms for Referring Expression Generation
Author: Nicholas FitzGerald ; Yoav Artzi ; Luke Zettlemoyer
Abstract: We present a new approach to referring expression generation, casting it as a density estimation problem where the goal is to learn distributions over logical expressions identifying sets of objects in the world. Despite an extremely large space of possible expressions, we demonstrate effective learning of a globally normalized log-linear distribution. This learning is enabled by a new, multi-stage approximate inference technique that uses a pruning model to construct only the most likely logical forms. We train and evaluate the approach on a new corpus of references to sets of visual objects. Experiments show the approach is able to learn accurate models, which generate over 87% of the expressions people used. Additionally, on the previously studied special case of single object reference, we show a 35% relative error reduction over previous state of the art.
4 0.45640862 185 emnlp-2013-Towards Situated Dialogue: Revisiting Referring Expression Generation
Author: Rui Fang ; Changsong Liu ; Lanbo She ; Joyce Y. Chai
Abstract: In situated dialogue, humans and agents have mismatched capabilities of perceiving the shared environment. Their representations of the shared world are misaligned. Thus referring expression generation (REG) will need to take this discrepancy into consideration. To address this issue, we developed a hypergraph-based approach to account for group-based spatial relations and uncertainties in perceiving the environment. Our empirical results have shown that this approach outperforms a previous graph-based approach with an absolute gain of 9%. However, while these graph-based approaches perform effectively when the agent has perfect knowledge or perception of the environment (e.g., 84%), they perform rather poorly when the agent has imperfect perception of the environment (e.g., 45%). This big performance gap calls for new solutions to REG that can mediate a shared perceptual basis in situated dialogue.
5 0.43176132 164 emnlp-2013-Scaling Semantic Parsers with On-the-Fly Ontology Matching
Author: Tom Kwiatkowski ; Eunsol Choi ; Yoav Artzi ; Luke Zettlemoyer
Abstract: We consider the challenge of learning semantic parsers that scale to large, open-domain problems, such as question answering with Freebase. In such settings, the sentences cover a wide variety of topics and include many phrases whose meaning is difficult to represent in a fixed target ontology. For example, even simple phrases such as ‘daughter’ and ‘number of people living in’ cannot be directly represented in Freebase, whose ontology instead encodes facts about gender, parenthood, and population. In this paper, we introduce a new semantic parsing approach that learns to resolve such ontological mismatches. The parser is learned from question-answer pairs, uses a probabilistic CCG to build linguistically motivated logicalform meaning representations, and includes an ontology matching model that adapts the output logical forms for each target ontology. Experiments demonstrate state-of-the-art performance on two benchmark semantic parsing datasets, including a nine point accuracy improvement on a recent Freebase QA corpus.
6 0.42083448 157 emnlp-2013-Recursive Autoencoders for ITG-Based Translation
7 0.41235667 86 emnlp-2013-Feature Noising for Log-Linear Structured Prediction
8 0.40894526 4 emnlp-2013-A Dataset for Research on Short-Text Conversations
9 0.40612558 163 emnlp-2013-Sarcasm as Contrast between a Positive Sentiment and Negative Situation
10 0.40069923 172 emnlp-2013-Simple Customization of Recursive Neural Networks for Semantic Relation Classification
11 0.40060791 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
12 0.39916262 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)
13 0.3965359 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing
14 0.39427111 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation
15 0.39344114 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation
16 0.39280263 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation
17 0.39269942 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
18 0.39245826 70 emnlp-2013-Efficient Higher-Order CRFs for Morphological Tagging
19 0.39208835 143 emnlp-2013-Open Domain Targeted Sentiment
20 0.39204907 139 emnlp-2013-Noise-Aware Character Alignment for Bootstrapping Statistical Machine Transliteration from Bilingual Corpora