nips nips2010 nips2010-255 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nikos Karampatziakis
Abstract: We cast the problem of identifying basic blocks of code in a binary executable as learning a mapping from a byte sequence to a segmentation of the sequence. In general, inference in segmentation models, such as semi-CRFs, can be cubic in the length of the sequence. By taking advantage of the structure of our problem, we derive a linear-time inference algorithm which makes our approach practical, given that even small programs are tens or hundreds of thousands bytes long. Furthermore, we introduce two loss functions which are appropriate for our problem and show how to use structural SVMs to optimize the learned mapping for these losses. Finally, we present experimental results that demonstrate the advantages of our method against a strong baseline. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We cast the problem of identifying basic blocks of code in a binary executable as learning a mapping from a byte sequence to a segmentation of the sequence. [sent-3, score-0.961]
2 By taking advantage of the structure of our problem, we derive a linear-time inference algorithm which makes our approach practical, given that even small programs are tens or hundreds of thousands bytes long. [sent-5, score-0.344]
3 1 Introduction In this work, we are interested in the problem of extracting the CPU instructions that comprise a binary executable file. [sent-8, score-0.676]
4 In particular we are motivated by a computer security application, in which we want to detect whether a previously unseen executable contains malicious code. [sent-10, score-0.355]
5 This is a task that computer security experts have to solve many times every day because in the last few years the volume of malicious software has witnessed an exponential increase (estimated at 50000 new malicious code samples every day). [sent-11, score-0.366]
6 This happens because the tools themselves are based on heuristics and make many assumptions about the way a binary executable is structured. [sent-13, score-0.321]
7 But why is it hard to find the instructions inside a binary executable? [sent-14, score-0.529]
8 After all, when running a program the CPU always knows which instructions it is executing. [sent-15, score-0.474]
9 The caveat here is that we want to extract the instructions from the executable without running it. [sent-16, score-0.672]
10 On one hand, running the executable will in general reveal little information about all possible instructions in the program, and on the other hand it may be dangerous or even misguiding. [sent-17, score-0.647]
11 1 Another issue that makes this task challenging is that binary executables contain many other things except the instructions they will execute. [sent-18, score-0.629]
12 2 Furthermore, the executable does not contain any demarcations about the locations of instructions in the file. [sent-19, score-0.647]
13 3 Nevertheless, an executable file is organized into sections such as a code section, a section with constants, a section containing global variables etc. [sent-20, score-0.358]
14 2 Here, we are focusing on Windows executables for the Intel x86 architecture, though everything carries over to any other modern operating system and any other architecture with a complex instruction set. [sent-26, score-0.644]
15 3 Executables that contain debugging information are an exception, but most software is released without it 1 1 refer to all instructions as code and to everything else as data. [sent-27, score-0.648]
16 For example, the compiler may, for performance reasons, prefer to pad a function with up to 3 data bytes so that the next function starts at an address that is a multiple of 4. [sent-28, score-0.33]
17 This table does not contain any instructions, yet it can be stored together with the instructions that make up the function in which the “switch” statement appears. [sent-31, score-0.443]
18 Apart from the compiler, the author of a malicious program can also insert data bytes in the code section of her program. [sent-32, score-0.492]
19 In this sense, words play the role of instructions and letters play the role of bytes. [sent-44, score-0.443]
20 For complex instruction sets like the Intel x86, instructions are composed of a variable number of bytes, as words are composed of a variable number of letters. [sent-45, score-0.93]
21 This analogy motivates tackling our problem as predicting a segmentation of the input sequence into blocks of code and blocks of data. [sent-51, score-0.671]
22 An obvious first approach for this task would be to treat it as a sequence labeling problem and train, for example, a linear chain conditional random field (CRF) [2] to tag each byte in the sequence as being the beginning, inside, or outside of a data block. [sent-52, score-0.4]
23 To learn the parameters of the model, we will use structural SVMs to optimize performance according to loss functions that are appropriate for our task, such as the sum of incorrect plus missed CPU instructions induced by the segmentation. [sent-57, score-0.628]
24 2 Heuristic tools for analyzing binary executables Tools for statically analyzing binary executables differ in the details of their workings but they all share the same high level logic, which is called recursive disassembly. [sent-61, score-0.512]
25 5 The tool starts by obtaining the address of the first instruction from a specific location inside the executable. [sent-62, score-0.612]
26 All the disassembled instructions would execute one after the other until we reach an instruction that changes the flow of execution. [sent-73, score-0.961]
27 After the execution of an unconditional jump the next instruction to be executed is at the address specified by the jump’s argument. [sent-75, score-0.627]
28 Other control flow instructions are similar to the unconditional jump. [sent-76, score-0.507]
29 A call saves the address of the next instruction and then jumps. [sent-78, score-0.562]
30 The tool places the arguments of control flow instructions it encounters on the stack. [sent-80, score-0.469]
31 If the control flow instruction is a conditional jump or a call, it continues disassembling, otherwise it takes the next address, that has not yet been disassembled, from the stack and repeats. [sent-81, score-0.622]
32 Even though recursive disassembly seems like a robust way of extracting the instructions from a program, there are many reasons that can make it fail [6]. [sent-82, score-0.676]
33 Most importantly, the arguments of the control flow instructions do not have to be constants, they can be registers whose values are generally not available during static analysis. [sent-83, score-0.507]
34 Hence, recursive disassembly can ran out of addresses much before all the instructions have been extracted. [sent-84, score-0.676]
35 This means that if the call instruction was spanning positions a, . [sent-94, score-0.608]
36 , a + − 1 of the sequence, upon the function’s return the next instruction will be at position a + + 3, not at a + . [sent-97, score-0.595]
37 This will give a completely different decoding of the sequence and is called disassembly desynchronization. [sent-98, score-0.322]
38 To return to a text analogy, recursive disassembly parses the sequence “driverballetterrace” as [driver, ballet, terrace] while the actual parsing, obtained by starting three positions down, is [xxx, verbal, letter, race], where x denotes junk data. [sent-99, score-0.395]
39 3 A structured prediction model In this section we will combine ideas from recursive disassembly and structured prediction to derive an expressive and efficient model for predicting the instructions inside a binary executable. [sent-100, score-0.802]
40 As in recursive disassembly, if we are certain that code begins at position i we can unambiguously disassemble the byte sequence starting from position i until we reach a control flow instruction. [sent-101, score-0.697]
41 The trellis graph has vertices bi that denote the possibility that a code block starts at position i. [sent-103, score-0.637]
42 It also has vertices ej and edges (bi , ej ) which denote that disassembling from position i yields a possible code block that spans positions i, . [sent-104, score-0.838]
43 Edges (ej , bj+1 ) and (ej , dj+1 ) encode that the next byte after a code block can either be the beginning of another code block, or a data byte respectively. [sent-109, score-0.977]
44 For data blocks no particular structure is assumed and we just use edges (dj , dj+1 ) and (dj , bj+1 ) to denote that a data byte can be followed either by another data byte or by the beginning of a code block respectively. [sent-110, score-0.996]
45 For example, the sequence in Figure 1 contains three code blocks that span positions 1–7, 8–8, and 10–12. [sent-115, score-0.437]
46 6 Some subsequences will produce errors while decoding to assembly because some bytes may not correspond to any instructions. [sent-117, score-0.351]
47 These could never be valid code blocks because they would crash the program. [sent-118, score-0.332]
48 Below this, we show the actual x86 instructions with position 9 being a data byte. [sent-120, score-0.525]
49 We show both the mnemonic instructions as well as the bytes they are composed of. [sent-121, score-0.664]
50 Disassembling from position i in the sequence yields an interval [i, j] where j is the position where the first encountered control flow instruction ends. [sent-132, score-0.766]
51 The first loss function, which we call block loss, comes from the observation that the beginnings of the code blocks are necessary and sufficient to describe the segmentation. [sent-141, score-0.587]
52 In the case where the inferred y identifies, say, the second instruction in a block as its start, we would like to penalize ˆ this less, since the disassembly is still synchronized, and only missed one instruction. [sent-143, score-0.844]
53 Formally, we let y and y be the sets of positions where the instructions start in the two segmentations and we ˆ define the instruction loss ∆I (y, y ) to be the cardinality of their symmetric difference. [sent-144, score-1.175]
54 For the instruction loss, the positions of the real instructions are y = {1, 4, 6, 8, 10, 11} while the proposed segmentation has y = {2, 9, 10, 11}. [sent-150, score-1.143]
55 Taking the symmetric difference of these ˆ sets, we see that the instruction loss has value 6. [sent-151, score-0.578]
56 If we simply sum the losses for each sequence then the losses in longer executables may overshadow the losses on shorter ones. [sent-153, score-0.511]
57 To represent each executable equally in the final measure we can normalize our loss functions, for example we can define the normalized instruction loss to be ∆N I (y, y ) = ˆ |y| + |ˆ| − 2|y ∩ y | y ˆ |y| and we similarly define a normalized block loss ∆N B . [sent-154, score-1.117]
58 , n of sequences and segmentations we can learn a vector of parameters w, that assigns a high score to segmentation yi and a low score to all other possible segmentations of xi . [sent-160, score-0.334]
59 ∀i ∀¯ ∈ Yi : w Ψ(xi , yi ) − w Ψ(xi , y ) ≥ ∆(yi , y ) − ξi y ¯ ¯ The constraints of this optimization problem enforce that the difference in score between the correct segmentation y and any incorrect segmentation y is at least as large as the loss ∆(yi , y ). [sent-163, score-0.448]
60 If bi ∈ y then bi defines an incorrect code / block which spans bytes i, . [sent-180, score-0.733]
61 , j and c(bi ) = 1 + |{k|bk ∈ y ∧ i < k ≤ j}|, capturing that we will introduce one incorrect block and we will skip all the blocks that begin between positions i and j. [sent-183, score-0.418]
62 For the instruction loss, y is a set of instruction positions. [sent-186, score-0.974]
63 If i ∈ y then bi is the beginning of an incorrect block that spans bytes i, . [sent-188, score-0.56]
64 , j / and produces instructions in a set of positions yi . [sent-191, score-0.572]
65 Let s be the first position in this block that gets ˜ synchronized with the correct decoding i. [sent-192, score-0.327]
66 The first term captures ˜ the number of incorrect instructions produced by treating bi as the start of a code block, while the second term captures the number of missed real instructions. [sent-196, score-0.73]
67 This approach could not discover all code blocks, especially when the compiler was automatically inserting code that did not exist in the source, such as the calls to destructors generated by C++ compilers. [sent-204, score-0.38]
68 Since the executables we used were 200 common programs from a typical installation of Windows XP, we believe that the outputs of heuristic tools should have little noise. [sent-206, score-0.317]
69 For each edge in the graph, the byte-level features are extracted from an 11 byte window around the source of the edge (so if the source vertex is at position i, the window spans positions i − 5, . [sent-216, score-0.469]
70 The features are which bytes and byte pairs appear in which position inside the window. [sent-220, score-0.628]
71 An example feature is “does byte c3 appear in position i − 1? [sent-221, score-0.318]
72 In x86 architectures, when the previous instruction is a return instruction this feature fires. [sent-223, score-1.0]
73 These are obtained from histograms of instructions that occur in candidate code blocks (i. [sent-225, score-0.744]
74 We use two kinds of histograms, one where we abstract the values of the arguments of the instructions but keep their type (register, memory location or constant), and one where we completely discard all information about the arguments. [sent-228, score-0.443]
75 An example of the former type of feature would be “number of times the instruction [add register, register] appears in this block”. [sent-229, score-0.487]
76 An example of the latter type of feature would be “number of times the instruction [mov] appears in this block”. [sent-230, score-0.487]
77 Normalized losses (∆N X ) are multiplied with the ¯ ¯ ¯ average number of bytes (L), instructions (I), or blocks (B) to bring all numbers to a similar scale. [sent-270, score-0.91]
78 It tags each byte as being the beginning, inside or outside of a code block using Viterbi and optimizes Hamming loss. [sent-273, score-0.6]
79 Running a general segmentation model [4] was impractical since inference depends quadratically on the maximum length of the code blocks, which was 2800 in our data. [sent-274, score-0.364]
80 Table 2 shows the results of our comparison for different loss functions (columns): Hamming loss, instruction loss, block loss, and their normalized counterparts. [sent-277, score-0.731]
81 Results for normalized losses have ¯ ¯ ¯ been multiplied with the average number of bytes (L), instructions (I), or blocks (B) to bring all hmm numbers to a similar scale. [sent-278, score-0.938]
82 Greedy starts decoding from the begining of the sequence and after decoding a block (bi , ej ) it repeats at position j + 1. [sent-280, score-0.594]
83 It only marks a byte as data if the decoding fails, in which case it starts decoding from the next byte in the sequence. [sent-281, score-0.691]
84 The results suggest that just treating our task as a simple sequence labeling problem at the level of bytes already goes a long way in terms of Hamming loss and block loss. [sent-282, score-0.572]
85 SVMhmm sometimes predicts as the beginning of a code block a position that leads to a decoding error. [sent-283, score-0.525]
86 Since it is not clear how to compute the instruction loss in this case, we do not report instruction losses for this model. [sent-284, score-1.164]
87 For the Hamming loss, this is expected since the SVMwis models are more expressive and a small block loss or instruction loss implies a small Hamming loss, but not vice versa. [sent-291, score-0.862]
88 For the instruction loss, we believe this occurs because of two reasons. [sent-292, score-0.487]
89 First our data consists of benign programs and for them learning to identify the code blocks may be enough. [sent-293, score-0.375]
90 Second it may be harder to learn with the instruction loss since its value depends on how quickly each decoding synchronizes with another (the correct) decoding of the stream, something that is not modeled in the feature map we are using. [sent-294, score-0.762]
91 The end result is that the models trained for block loss also attain the smallest losses for all other loss functions. [sent-295, score-0.458]
92 5 Related work and other applications There are two lines of research which are relevant to this work: one is structured prediction approaches for segmenting sequences and the other is research on static analysis techniques for finding code and data blocks in executables. [sent-296, score-0.339]
93 Previous techniques for identifying code blocks in executables have used no or very little statistical learning. [sent-305, score-0.458]
94 For example, [10] and [11] use recursive disassembly and pattern heuristics similarly to currently used tools such as OllyDbg and IdaPro. [sent-306, score-0.321]
95 In this work, the authors use simple statistical models based on unigram and bigram instruction models in addition to the pattern heuristics. [sent-308, score-0.487]
96 However, these approaches make independent decisions for every candidate code block and they have a less principled way of dealing with equally plausible but overlapping code blocks. [sent-309, score-0.461]
97 One way to approximate this feature would be to count how many candidate code blocks have instructions that jump to or call the current position in the sequence. [sent-315, score-0.937]
98 6 Conclusions In this work we proposed a code segmentation model SVMwis that can help security experts in the static analysis of binary executables. [sent-327, score-0.395]
99 We also want to look into richer features such as some approximation of call consistency (since the actual constraints give rise to NPhard inference), so that addresses which are targets of call or jump instructions from a code block do not lie inside data blocks. [sent-332, score-1.017]
100 Finally, we plan to extend our model to allow for joint segmentation and classification of the executable as malicious or not. [sent-333, score-0.424]
wordName wordTfidf (topN-words)
[('instruction', 0.487), ('instructions', 0.443), ('byte', 0.236), ('bytes', 0.221), ('executable', 0.204), ('disassembly', 0.173), ('executables', 0.157), ('code', 0.154), ('block', 0.153), ('blocks', 0.147), ('segmentation', 0.134), ('svmwis', 0.11), ('losses', 0.099), ('decoding', 0.092), ('loss', 0.091), ('malicious', 0.086), ('ej', 0.083), ('position', 0.082), ('positions', 0.079), ('disassembling', 0.079), ('svmhmm', 0.079), ('segmentations', 0.075), ('programs', 0.074), ('jump', 0.069), ('hamming', 0.065), ('eax', 0.063), ('ollydbg', 0.063), ('trellis', 0.063), ('bi', 0.063), ('ow', 0.062), ('recursive', 0.06), ('vertices', 0.059), ('sequence', 0.057), ('inside', 0.057), ('tools', 0.052), ('scheduling', 0.052), ('debugging', 0.051), ('labeling', 0.05), ('yi', 0.05), ('inference', 0.049), ('decodings', 0.047), ('edi', 0.047), ('jnz', 0.047), ('ad', 0.044), ('beginning', 0.044), ('dj', 0.043), ('call', 0.042), ('register', 0.041), ('compiler', 0.041), ('sir', 0.041), ('expressive', 0.04), ('stack', 0.04), ('spans', 0.04), ('security', 0.04), ('incorrect', 0.039), ('static', 0.038), ('unconditional', 0.038), ('assembly', 0.038), ('heuristics', 0.036), ('starts', 0.035), ('heuristic', 0.034), ('costs', 0.033), ('address', 0.033), ('interval', 0.032), ('analogy', 0.032), ('features', 0.032), ('crash', 0.031), ('disassembled', 0.031), ('ebx', 0.031), ('idapro', 0.031), ('mov', 0.031), ('nop', 0.031), ('rebels', 0.031), ('usenix', 0.031), ('calls', 0.031), ('missed', 0.031), ('program', 0.031), ('path', 0.031), ('binary', 0.029), ('inc', 0.029), ('graph', 0.028), ('hmm', 0.028), ('svms', 0.028), ('workings', 0.028), ('crf', 0.027), ('fold', 0.027), ('length', 0.027), ('return', 0.026), ('edges', 0.026), ('control', 0.026), ('joachims', 0.026), ('misses', 0.025), ('saved', 0.025), ('confuse', 0.025), ('cpu', 0.025), ('news', 0.025), ('want', 0.025), ('structural', 0.024), ('ads', 0.024), ('trained', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs
Author: Nikos Karampatziakis
Abstract: We cast the problem of identifying basic blocks of code in a binary executable as learning a mapping from a byte sequence to a segmentation of the sequence. In general, inference in segmentation models, such as semi-CRFs, can be cubic in the length of the sequence. By taking advantage of the structure of our problem, we derive a linear-time inference algorithm which makes our approach practical, given that even small programs are tens or hundreds of thousands bytes long. Furthermore, we introduce two loss functions which are appropriate for our problem and show how to use structural SVMs to optimize the learned mapping for these losses. Finally, we present experimental results that demonstrate the advantages of our method against a strong baseline. 1
2 0.067737564 77 nips-2010-Epitome driven 3-D Diffusion Tensor image segmentation: on extracting specific structures
Author: Kamiya Motwani, Nagesh Adluru, Chris Hinrichs, Andrew Alexander, Vikas Singh
Abstract: We study the problem of segmenting specific white matter structures of interest from Diffusion Tensor (DT-MR) images of the human brain. This is an important requirement in many Neuroimaging studies: for instance, to evaluate whether a brain structure exhibits group level differences as a function of disease in a set of images. Typically, interactive expert guided segmentation has been the method of choice for such applications, but this is tedious for large datasets common today. To address this problem, we endow an image segmentation algorithm with “advice” encoding some global characteristics of the region(s) we want to extract. This is accomplished by constructing (using expert-segmented images) an epitome of a specific region – as a histogram over a bag of ‘words’ (e.g., suitable feature descriptors). Now, given such a representation, the problem reduces to segmenting a new brain image with additional constraints that enforce consistency between the segmented foreground and the pre-specified histogram over features. We present combinatorial approximation algorithms to incorporate such domain specific constraints for Markov Random Field (MRF) segmentation. Making use of recent results on image co-segmentation, we derive effective solution strategies for our problem. We provide an analysis of solution quality, and present promising experimental evidence showing that many structures of interest in Neuroscience can be extracted reliably from 3-D brain image volumes using our algorithm. 1
3 0.063088134 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification
Author: Min Yang, Linli Xu, Martha White, Dale Schuurmans, Yao-liang Yu
Abstract: Robust regression and classification are often thought to require non-convex loss functions that prevent scalable, global training. However, such a view neglects the possibility of reformulated training methods that can yield practically solvable alternatives. A natural way to make a loss function more robust to outliers is to truncate loss values that exceed a maximum threshold. We demonstrate that a relaxation of this form of “loss clipping” can be made globally solvable and applicable to any standard loss while guaranteeing robustness against outliers. We present a generic procedure that can be applied to standard loss functions and demonstrate improved robustness in regression and classification problems. 1
4 0.057908297 181 nips-2010-Network Flow Algorithms for Structured Sparsity
Author: Julien Mairal, Rodolphe Jenatton, Francis R. Bach, Guillaume R. Obozinski
Abstract: We consider a class of learning problems that involve a structured sparsityinducing norm defined as the sum of ℓ∞ -norms over groups of variables. Whereas a lot of effort has been put in developing fast optimization methods when the groups are disjoint or embedded in a specific hierarchical structure, we address here the case of general overlapping groups. To this end, we show that the corresponding optimization problem is related to network flow optimization. More precisely, the proximal problem associated with the norm we consider is dual to a quadratic min-cost flow problem. We propose an efficient procedure which computes its solution exactly in polynomial time. Our algorithm scales up to millions of variables, and opens up a whole new range of applications for structured sparse models. We present several experiments on image and video data, demonstrating the applicability and scalability of our approach for various problems. 1
5 0.055867326 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
Author: Tamir Hazan, Raquel Urtasun
Abstract: In this paper we propose an approximated structured prediction framework for large scale graphical models and derive message-passing algorithms for learning their parameters efficiently. We first relate CRFs and structured SVMs and show that in CRFs a variant of the log-partition function, known as the soft-max, smoothly approximates the hinge loss function of structured SVMs. We then propose an intuitive approximation for the structured prediction problem, using duality, based on a local entropy approximation and derive an efficient messagepassing algorithm that is guaranteed to converge. Unlike existing approaches, this allows us to learn efficiently graphical models with cycles and very large number of parameters. 1
6 0.055803291 228 nips-2010-Reverse Multi-Label Learning
7 0.055621333 211 nips-2010-Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression
8 0.053011864 234 nips-2010-Segmentation as Maximum-Weight Independent Set
9 0.052865218 141 nips-2010-Layered image motion with explicit occlusions, temporal consistency, and depth ordering
10 0.052295327 91 nips-2010-Fast detection of multiple change-points shared by many signals using group LARS
11 0.052080721 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
12 0.050334755 282 nips-2010-Variable margin losses for classifier design
13 0.050101008 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
14 0.04894729 257 nips-2010-Structured Determinantal Point Processes
15 0.048237376 118 nips-2010-Implicit Differentiation by Perturbation
16 0.047573622 162 nips-2010-Link Discovery using Graph Feature Tracking
17 0.047532018 167 nips-2010-Mixture of time-warped trajectory models for movement decoding
18 0.046784945 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes
19 0.046106633 151 nips-2010-Learning from Candidate Labeling Sets
20 0.04565284 235 nips-2010-Self-Paced Learning for Latent Variable Models
topicId topicWeight
[(0, 0.135), (1, 0.03), (2, -0.003), (3, -0.012), (4, -0.04), (5, -0.037), (6, -0.036), (7, 0.013), (8, 0.039), (9, -0.005), (10, -0.067), (11, -0.04), (12, 0.038), (13, 0.074), (14, 0.006), (15, 0.032), (16, -0.012), (17, 0.001), (18, -0.009), (19, -0.068), (20, 0.023), (21, 0.031), (22, 0.026), (23, 0.012), (24, 0.06), (25, -0.064), (26, -0.001), (27, 0.038), (28, 0.015), (29, 0.071), (30, 0.049), (31, 0.051), (32, -0.01), (33, -0.097), (34, -0.07), (35, 0.026), (36, -0.04), (37, -0.054), (38, 0.047), (39, 0.046), (40, 0.048), (41, 0.034), (42, -0.044), (43, -0.074), (44, -0.007), (45, 0.06), (46, 0.032), (47, -0.078), (48, -0.04), (49, -0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.9053039 255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs
Author: Nikos Karampatziakis
Abstract: We cast the problem of identifying basic blocks of code in a binary executable as learning a mapping from a byte sequence to a segmentation of the sequence. In general, inference in segmentation models, such as semi-CRFs, can be cubic in the length of the sequence. By taking advantage of the structure of our problem, we derive a linear-time inference algorithm which makes our approach practical, given that even small programs are tens or hundreds of thousands bytes long. Furthermore, we introduce two loss functions which are appropriate for our problem and show how to use structural SVMs to optimize the learned mapping for these losses. Finally, we present experimental results that demonstrate the advantages of our method against a strong baseline. 1
2 0.62242162 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
Author: Tamir Hazan, Raquel Urtasun
Abstract: In this paper we propose an approximated structured prediction framework for large scale graphical models and derive message-passing algorithms for learning their parameters efficiently. We first relate CRFs and structured SVMs and show that in CRFs a variant of the log-partition function, known as the soft-max, smoothly approximates the hinge loss function of structured SVMs. We then propose an intuitive approximation for the structured prediction problem, using duality, based on a local entropy approximation and derive an efficient messagepassing algorithm that is guaranteed to converge. Unlike existing approaches, this allows us to learn efficiently graphical models with cycles and very large number of parameters. 1
3 0.59796387 181 nips-2010-Network Flow Algorithms for Structured Sparsity
Author: Julien Mairal, Rodolphe Jenatton, Francis R. Bach, Guillaume R. Obozinski
Abstract: We consider a class of learning problems that involve a structured sparsityinducing norm defined as the sum of ℓ∞ -norms over groups of variables. Whereas a lot of effort has been put in developing fast optimization methods when the groups are disjoint or embedded in a specific hierarchical structure, we address here the case of general overlapping groups. To this end, we show that the corresponding optimization problem is related to network flow optimization. More precisely, the proximal problem associated with the norm we consider is dual to a quadratic min-cost flow problem. We propose an efficient procedure which computes its solution exactly in polynomial time. Our algorithm scales up to millions of variables, and opens up a whole new range of applications for structured sparse models. We present several experiments on image and video data, demonstrating the applicability and scalability of our approach for various problems. 1
4 0.58551466 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
Author: Ronny Luss, Saharon Rosset, Moni Shahar
Abstract: A new algorithm for isotonic regression is presented based on recursively partitioning the solution space. We develop efficient methods for each partitioning subproblem through an equivalent representation as a network flow problem, and prove that this sequence of partitions converges to the global solution. These network flow problems can further be decomposed in order to solve very large problems. Success of isotonic regression in prediction and our algorithm’s favorable computational properties are demonstrated through simulated examples as large as 2 × 105 variables and 107 constraints.
5 0.57285756 234 nips-2010-Segmentation as Maximum-Weight Independent Set
Author: William Brendel, Sinisa Todorovic
Abstract: Given an ensemble of distinct, low-level segmentations of an image, our goal is to identify visually “meaningful” segments in the ensemble. Knowledge about any specific objects and surfaces present in the image is not available. The selection of image regions occupied by objects is formalized as the maximum-weight independent set (MWIS) problem. MWIS is the heaviest subset of mutually non-adjacent nodes of an attributed graph. We construct such a graph from all segments in the ensemble. Then, MWIS selects maximally distinctive segments that together partition the image. A new MWIS algorithm is presented. The algorithm seeks a solution directly in the discrete domain, instead of relaxing MWIS to a continuous problem, as common in previous work. It iteratively finds a candidate discrete solution of the Taylor series expansion of the original MWIS objective function around the previous solution. The algorithm is shown to converge to an optimum. Our empirical evaluation on the benchmark Berkeley segmentation dataset shows that the new algorithm eliminates the need for hand-picking optimal input parameters of the state-of-the-art segmenters, and outperforms their best, manually optimized results.
6 0.52067792 248 nips-2010-Sparse Inverse Covariance Selection via Alternating Linearization Methods
7 0.4801861 29 nips-2010-An Approximate Inference Approach to Temporal Optimization in Optimal Control
8 0.47491866 77 nips-2010-Epitome driven 3-D Diffusion Tensor image segmentation: on extracting specific structures
9 0.46454114 118 nips-2010-Implicit Differentiation by Perturbation
10 0.46051651 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
11 0.45207757 50 nips-2010-Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories
12 0.45143706 269 nips-2010-Throttling Poisson Processes
13 0.45045918 187 nips-2010-Occlusion Detection and Motion Estimation with Convex Optimization
14 0.44985121 290 nips-2010-t-logistic regression
15 0.44727162 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
16 0.43528566 1 nips-2010-(RF)^2 -- Random Forest Random Field
17 0.43212384 228 nips-2010-Reverse Multi-Label Learning
18 0.42975017 2 nips-2010-A Bayesian Approach to Concept Drift
19 0.42654946 171 nips-2010-Movement extraction by detecting dynamics switches and repetitions
20 0.42623222 259 nips-2010-Subgraph Detection Using Eigenvector L1 Norms
topicId topicWeight
[(13, 0.019), (17, 0.018), (27, 0.036), (30, 0.046), (35, 0.011), (45, 0.671), (50, 0.028), (52, 0.012), (60, 0.021), (77, 0.034), (78, 0.014), (90, 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99917281 255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs
Author: Nikos Karampatziakis
Abstract: We cast the problem of identifying basic blocks of code in a binary executable as learning a mapping from a byte sequence to a segmentation of the sequence. In general, inference in segmentation models, such as semi-CRFs, can be cubic in the length of the sequence. By taking advantage of the structure of our problem, we derive a linear-time inference algorithm which makes our approach practical, given that even small programs are tens or hundreds of thousands bytes long. Furthermore, we introduce two loss functions which are appropriate for our problem and show how to use structural SVMs to optimize the learned mapping for these losses. Finally, we present experimental results that demonstrate the advantages of our method against a strong baseline. 1
2 0.99850756 235 nips-2010-Self-Paced Learning for Latent Variable Models
Author: M. P. Kumar, Benjamin Packer, Daphne Koller
Abstract: Latent variable models are a powerful tool for addressing several tasks in machine learning. However, the algorithms for learning the parameters of latent variable models are prone to getting stuck in a bad local optimum. To alleviate this problem, we build on the intuition that, rather than considering all samples simultaneously, the algorithm should be presented with the training data in a meaningful order that facilitates learning. The order of the samples is determined by how easy they are. The main challenge is that often we are not provided with a readily computable measure of the easiness of samples. We address this issue by proposing a novel, iterative self-paced learning algorithm where each iteration simultaneously selects easy samples and learns a new parameter vector. The number of samples selected is governed by a weight that is annealed until the entire training data has been considered. We empirically demonstrate that the self-paced learning algorithm outperforms the state of the art method for learning a latent structural SVM on four applications: object localization, noun phrase coreference, motif finding and handwritten digit recognition. 1
3 0.99846035 11 nips-2010-A POMDP Extension with Belief-dependent Rewards
Author: Mauricio Araya, Olivier Buffet, Vincent Thomas, Françcois Charpillet
Abstract: Partially Observable Markov Decision Processes (POMDPs) model sequential decision-making problems under uncertainty and partial observability. Unfortunately, some problems cannot be modeled with state-dependent reward functions, e.g., problems whose objective explicitly implies reducing the uncertainty on the state. To that end, we introduce ρPOMDPs, an extension of POMDPs where the reward function ρ depends on the belief state. We show that, under the common assumption that ρ is convex, the value function is also convex, what makes it possible to (1) approximate ρ arbitrarily well with a piecewise linear and convex (PWLC) function, and (2) use state-of-the-art exact or approximate solving algorithms with limited changes. 1
4 0.99831086 187 nips-2010-Occlusion Detection and Motion Estimation with Convex Optimization
Author: Alper Ayvaci, Michalis Raptis, Stefano Soatto
Abstract: We tackle the problem of simultaneously detecting occlusions and estimating optical flow. We show that, under standard assumptions of Lambertian reflection and static illumination, the task can be posed as a convex minimization problem. Therefore, the solution, computed using efficient algorithms, is guaranteed to be globally optimal, for any number of independently moving objects, and any number of occlusion layers. We test the proposed algorithm on benchmark datasets, expanded to enable evaluation of occlusion detection performance. 1
5 0.99791217 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits
Author: Daniel Lowd, Pedro Domingos
Abstract: Arithmetic circuits (ACs) exploit context-specific independence and determinism to allow exact inference even in networks with high treewidth. In this paper, we introduce the first ever approximate inference methods using ACs, for domains where exact inference remains intractable. We propose and evaluate a variety of techniques based on exact compilation, forward sampling, AC structure learning, Markov network parameter learning, variational inference, and Gibbs sampling. In experiments on eight challenging real-world domains, we find that the methods based on sampling and learning work best: one such method (AC2 -F) is faster and usually more accurate than loopy belief propagation, mean field, and Gibbs sampling; another (AC2 -G) has a running time similar to Gibbs sampling but is consistently more accurate than all baselines. 1
6 0.99745899 167 nips-2010-Mixture of time-warped trajectory models for movement decoding
7 0.99597913 100 nips-2010-Gaussian Process Preference Elicitation
8 0.99563926 133 nips-2010-Kernel Descriptors for Visual Recognition
9 0.99548185 108 nips-2010-Graph-Valued Regression
10 0.98184025 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision
11 0.97755772 83 nips-2010-Evidence-Specific Structures for Rich Tractable CRFs
12 0.97623962 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
13 0.97317111 144 nips-2010-Learning Efficient Markov Networks
14 0.97198808 141 nips-2010-Layered image motion with explicit occlusions, temporal consistency, and depth ordering
15 0.97124714 197 nips-2010-Optimal Bayesian Recommendation Sets and Myopically Optimal Choice Query Sets
16 0.96944046 118 nips-2010-Implicit Differentiation by Perturbation
17 0.96884298 61 nips-2010-Direct Loss Minimization for Structured Prediction
18 0.96880662 212 nips-2010-Predictive State Temporal Difference Learning
19 0.96711659 94 nips-2010-Feature Set Embedding for Incomplete Data
20 0.96605021 25 nips-2010-Active Learning by Querying Informative and Representative Examples