jmlr jmlr2012 jmlr2012-63 knowledge-graph by maker-knowledge-mining

63 jmlr-2012-Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features


Source: pdf

Author: Gil Tahan, Lior Rokach, Yuval Shahar

Abstract: This paper proposes several novel methods, based on machine learning, to detect malware in executable files without any need for preprocessing, such as unpacking or disassembling. The basic method (Mal-ID) is a new static (form-based) analysis methodology that uses common segment analysis in order to detect malware files. By using common segment analysis, Mal-ID is able to discard malware parts that originate from benign code. In addition, Mal-ID uses a new kind of feature, termed meta-feature, to better capture the properties of the analyzed segments. Rather than using the entire file, as is usually the case with machine learning based techniques, the new approach detects malware on the segment level. This study also introduces two Mal-ID extensions that improve the Mal-ID basic method in various aspects. We rigorously evaluated Mal-ID and its two extensions with more than ten performance measures, and compared them to the highly rated boosted decision tree method under identical settings. The evaluation demonstrated that Mal-ID and the two Mal-ID extensions outperformed the boosted decision tree method in almost all respects. In addition, the results indicated that by extracting meaningful features, it is sufficient to employ one simple detection rule for classifying executable files. Keywords: computer security, malware detection, common segment analysis, supervised learning

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 653 Beer-Sheva, Israel 84105 Editor: Charles Elkan Abstract This paper proposes several novel methods, based on machine learning, to detect malware in executable files without any need for preprocessing, such as unpacking or disassembling. [sent-10, score-0.829]

2 The basic method (Mal-ID) is a new static (form-based) analysis methodology that uses common segment analysis in order to detect malware files. [sent-11, score-0.998]

3 By using common segment analysis, Mal-ID is able to discard malware parts that originate from benign code. [sent-12, score-1.148]

4 Rather than using the entire file, as is usually the case with machine learning based techniques, the new approach detects malware on the segment level. [sent-14, score-0.869]

5 Keywords: computer security, malware detection, common segment analysis, supervised learning 1. [sent-19, score-0.869]

6 The rate of malware attacks and infections is not yet leveling. [sent-28, score-0.696]

7 There are many ways to mitigate malware infection and spread. [sent-30, score-0.696]

8 Tools such as anti-virus and anti-spyware are able to identify and block or identify malware based on its behavior (Franc and Sonnenburg, 2009) or static features (see Table 1 below). [sent-31, score-0.751]

9 A static feature may be a rule or a signature that uniquely identifies a malware or malware group. [sent-32, score-1.469]

10 While the tools mitigating malware may vary, at their core there must be some classification method to distinguish malware files from benign files. [sent-33, score-1.624]

11 In addition, in recent years many researchers have been using machine learning (ML) techniques to produce a binary classifier that is able to distinguish malware from benign files. [sent-41, score-0.928]

12 To test the effectiveness of ML techniques, in malware detection, the researchers listed in Table 1 conducted experiments combining various feature extraction methods along with several feature selection and classification algorithms. [sent-69, score-0.696]

13 (2010) presented a variation of the method, presented above, that uses Hierarchical Associative Classifier (HAC) to detect malware from a large imbalanced list of applications. [sent-80, score-0.729]

14 The malware in the imbalanced list were the minority class. [sent-81, score-0.696]

15 First, both malware and benign programs are executed inside the virtual machine and the instruction sequences are collected during runtime. [sent-88, score-0.981]

16 (2011) presented a simple method to detect malware variants. [sent-96, score-0.729]

17 (2011) showed that when the similarity is high, there is a high probability that the suspected file is a malware variant. [sent-101, score-0.696]

18 The experiments definitely proved that is possible to use ML techniques for malware detection. [sent-102, score-0.696]

19 Since most malware are also made of the same common building blocks, we believe it would be reasonable to discard the parts of a malware that are common to all kinds of software, leaving only the parts that are unique to the malware. [sent-111, score-1.418]

20 Doing so should increase the difference between malware files and benign files and therefore should result in a lower misclassification rate. [sent-112, score-0.928]

21 As a result, current techniques using short n-gram rely on complex conditions and involve many features for detecting malware files. [sent-121, score-0.767]

22 The goal of this paper is to develop and evaluate a novel methodology and supporting algorithms for detecting malware files by using common segment analysis. [sent-122, score-0.912]

23 In the proposed methodology we initially detect and nullify, by zero patching, benign segments and therefore resolve the deficiency of analyzing files with segments that may not contribute or even hinder classification. [sent-123, score-0.547]

24 As a result, malware that has been developed with these tools generally resembles benign applications. [sent-179, score-0.928]

25 Therefore it may be reasonable to assume that there will be resemblances in various types of malware due to sharing common malware library code or even similar specific method to perform malicious action. [sent-181, score-1.477]

26 Of course such malware commonalities cannot be always guaranteed. [sent-182, score-0.696]

27 Since many modern malware files are in fact much larger than 1 MB, analysis of the newer applications is much more complex than previously when the applications themselves were smaller as well as the malware attacking them. [sent-185, score-1.392]

28 As noted above, many applications and malware are developed using the development platforms that include large program language libraries. [sent-187, score-0.722]

29 For example, a worm malware that distributes itself via email may contain a benign code for sending emails. [sent-189, score-0.977]

30 The TFL contains data structures constructed from malware without segments identified as benign (i. [sent-199, score-1.069]

31 As can be seen in this figure, our Mal-ID methodology uses two distinct stages to accomplish the malware detection task: setup and detection. [sent-203, score-0.8]

32 The detection phase classifies a previously unseen application as either malware or benign. [sent-205, score-0.836]

33 1 The Setup Phase The setup phase involves collecting two kinds of files: benign and malware files. [sent-209, score-0.964]

34 The malware files can, for example, be downloaded from trusted dedicated Internet sites, or by collaborating with an anti-virus company. [sent-211, score-0.696]

35 In this study the malware collection was obtained from trusted sources. [sent-212, score-0.696]

36 In particular, Ben-Gurion University Computational Center provided us malware that were detected by them over time. [sent-213, score-0.696]

37 The CFL repository is constructed from benign files and the TFL repository is constructed from malware files. [sent-215, score-0.928]

38 Note that in the proposed algorithm, we are calculating the distribution of 3-grams within each file and across files, to make sure that a 3-gram belongs to the examined segment and thus associate the segment to either benign (CFL) or malware (TFL). [sent-218, score-1.306]

39 Moreover, 3-grams that seem to appear approximately within the same offset in all malware can be used to characterize the malware. [sent-219, score-0.696]

40 Figure 1: The Mal-ID method for detecting new malware applications. [sent-224, score-0.739]

41 Each file from the malware collection is broken into segments. [sent-236, score-0.696]

42 It is important to note that the end result is the TFL, a repository made of segments found only in malware and not in benign files. [sent-241, score-1.069]

43 Once the setup phase has constructed the CFL and the TFL, it is possible to classify a file F as benign or as malware using the algorithm presented in Figure 2. [sent-268, score-0.964]

44 The algorithm gets the ThreatThreshold parameter which indicates the minimum occurrences a segment should appear in the TFL in order to be qualified as malware indicator. [sent-296, score-0.869]

45 Obviously a segment that does not appear in any malware cannot be used to indicate that the file is a malware. [sent-300, score-0.869]

46 958 AUTOMATIC M ALWARE D ETECTION (j) Lines 21-25 (optional stage, aimed to reduce false malware detection). [sent-308, score-0.696]

47 A segment that meets all of the above conditions is tested against the malware file groups that contain all 3-gram segments. [sent-309, score-0.869]

48 As a result, only segments that actually reside in the malware are left in the segment collection. [sent-310, score-1.01]

49 Second level index aggregation—Count all segments that are found in malware and not in the CFL. [sent-314, score-0.837]

50 Classify—If there are at least X segments found in the malware train set (TFL) and not in the CFL then the file is malware; otherwise consider the file as benign. [sent-317, score-0.837]

51 960 19 AUTOMATIC M ALWARE D ETECTION Finally, using the model is used to detect the malware among the files in the test set. [sent-360, score-0.729]

52 Next, zero patch each malware in the training set as follows: Iterate over all of the file segments and perform common segment analysis to detect the segments that appear in the CFL. [sent-366, score-1.215]

53 The benign segments (the segments that appear in the CFL) are zero patched in an attempt to reduce the number of n-gram that are clearly not relevant for detecting segments that appear only in malware. [sent-367, score-0.732]

54 The patched malware collection and the unchanged benign file collection are used for training. [sent-371, score-0.962]

55 To examine whether the proposed basic methods, could detect malware while keeping the false alarm rate as small as possible. [sent-393, score-0.798]

56 An additional 849 malware files were gathered from the Internet with lengths ranging from 6Kb to 4. [sent-407, score-0.696]

57 The malware and benign file sets were used without any decryption, decompression or any other preprocessing. [sent-415, score-0.928]

58 The malware types and frequencies are presented in Figure 3. [sent-416, score-0.696]

59 Figure 3: Distribution of malware types in data set. [sent-422, score-0.696]

60 Given the low rate of malware versus benign code, accuracy might be a misleading measure. [sent-424, score-0.928]

61 Let p(xi ) represents the posterior probability of the instance xi to be associated with the malware class according to the classifier. [sent-433, score-0.696]

62 1 R ESULTS OF M AL -ID BASIC M ODEL Table 3 presents the detection performance of the proposed method for 70% of the benign files and 90% of the malware files that are used for training. [sent-445, score-1.032]

63 The ratio between malware and benign was kept fixed for all cases. [sent-480, score-0.956]

64 On the other hand because we also increase the imbalance ratio between benign and malware therefore we should have expected to a decrease in the predictive performance. [sent-536, score-0.956]

65 nodes are linear Table 7 reports the mean TPR of Mal-ID basic for small malwares (size<=350K) and large Mal-IDF+RF malware (size>350K) using the largest training set. [sent-597, score-0.851]

66 In order to estimate the effect of obfuscation on detection rate, we have divided the tested malware into two groups—obfuscated and non-obfuscated. [sent-603, score-0.823]

67 951 in data set content, training size, the benign and malware ratio and possibly other 0. [sent-651, score-0.987]

68 893 62% Compression Table 8: A comparison of TPR (True Positive Rate) Mal-ID basic for obfuscated and nonobfuscated malware when using maximum training size. [sent-724, score-0.912]

69 The comparison suggests that Mal-ID meta-features are useful in contributing to malware detection and probably more meaningful than simple n-gram in capturing a file’s essence. [sent-742, score-0.8]

70 Considering detection performance only when choosing a malware detection method may not be enough; it is important to consider other aspects as well. [sent-746, score-0.904]

71 The reason is that each detected segment, that passed the Mal-ID filter stage as explained in Section 2, can be tracked back to a specific malware or malware group. [sent-750, score-1.441]

72 Disassembly or reverse engineering of the whole malware is no longer required. [sent-752, score-0.696]

73 4 Default Signature For Real-time Malware Detection Hardware The end result of applying Mal-ID basic method is a file segment or segments that appear in malware files only and thus may be used as a signature for anti-virus tools. [sent-765, score-1.129]

74 The detected malware segments can be used, as described by Filiol (2006), to generate signatures resistant against black-box analysis. [sent-766, score-0.837]

75 IPSs require the anytime detection trait to act as real-time malware filtering devices and thus promote and provide users with default protection. [sent-768, score-0.823]

76 Having both malware detection and signature generation could help shorten the window of vulnerability. [sent-769, score-0.85]

77 It is estimated1 that the mean malware size has increased from 150K (in 2005) to 350K (in 2010). [sent-777, score-0.696]

78 In this sense, we referred to malware as they are found “in the Wild”. [sent-789, score-0.696]

79 For example, malware developers are sharing tools for facilitating the generation of new malwares. [sent-792, score-0.696]

80 org/, one can find many tools (such as Falckon Encrypter that is used for obfuscation) that can be used by the malware developers but are not used by benign software developers. [sent-795, score-0.928]

81 All malware that use the Falckon Encrypter, share the same decryption segment. [sent-796, score-0.696]

82 The results of Table 8 agree with the previously-made observation that ML techniques can classify malware that are obfuscated (compressed or encrypted or both). [sent-797, score-0.812]

83 006), however it should be noted that this value was obtained when our corpus contained 2,627 benign files and 849 malware files (i. [sent-801, score-0.928]

84 According to Kolter and Maloof (2006), the success in detecting obfuscated malware relies on learning certain forms of obfuscation such as run-time decompression. [sent-806, score-0.878]

85 (2005) noticed that in many cases malware requires fixed sequences to be used in the body of the malware (which must exist before self-decryption or self-decompression) in order to exploit a specific vulnerability and selfpropagate. [sent-813, score-1.392]

86 Theoretically an attacker can specifically design a malware that will make it hard for MAL ID to detect it. [sent-818, score-0.729]

87 In particular, if a malware is designed such that the entropy measure will be high for all segments, it will be undiscovered by the Mal-ID basic method. [sent-819, score-0.804]

88 Summary and Future Work In this paper we have described novel methods based on machine learning to detect malware in executable files without any need for preprocessing the executables. [sent-823, score-0.829]

89 The basic method that we presented works on the segment level for detecting new malware instead of using the entire file as usually done in machine learning based techniques. [sent-824, score-0.981]

90 We believe this study has made several contributions to malware detection research, including the introduction of: 1. [sent-829, score-0.8]

91 a new and effective method for malware detection based on common segment analysis and supporting algorithms. [sent-830, score-0.973]

92 The importance of common segment analysis to the process of malware detection was identified and demonstrated. [sent-831, score-0.973]

93 BCR, BER, PPV, NPV and entropy decrease for measuring the performance of malware detection methods. [sent-838, score-0.839]

94 It is our assumption that systematically collecting and choosing common segments will provide a better representation of benign common segments and a more robust and lower FPR. [sent-844, score-0.514]

95 A robust and low FPR will enable the use of more sensitive malware detection methods (or parameters that affect malware detection) without increasing the FPR too much. [sent-845, score-1.496]

96 In addition, it will be interesting to test the proposed method on live network data and on an institutional network and determine if it detects malware that is not detected by other means. [sent-850, score-0.696]

97 Finally, future work may repeat the evaluation Mal-ID on a larger scale with thousands of malware samples and tens of thousands of non-malware samples. [sent-851, score-0.696]

98 Auto-sign: an automatic signature generator for high-speed malware filtering devices. [sent-1091, score-0.78]

99 Sbmds: an interpretable string based malware detection system using svm ensemble with bagging. [sent-1122, score-0.8]

100 Hierarchical associative classifier (hac) for malware detection from the large and imbalanced gray list. [sent-1130, score-0.8]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('malware', 0.696), ('benign', 0.232), ('fpr', 0.187), ('rf', 0.184), ('segment', 0.173), ('tahan', 0.158), ('rokach', 0.144), ('segments', 0.141), ('les', 0.135), ('cfl', 0.13), ('obfuscated', 0.116), ('tfl', 0.116), ('detection', 0.104), ('alware', 0.103), ('hahar', 0.103), ('maloof', 0.103), ('tpr', 0.1), ('kolter', 0.1), ('executable', 0.1), ('le', 0.099), ('bytes', 0.094), ('etection', 0.079), ('basic', 0.069), ('boosted', 0.063), ('malicious', 0.059), ('nn', 0.057), ('elovici', 0.055), ('executables', 0.055), ('malwares', 0.055), ('npv', 0.055), ('ppv', 0.055), ('auc', 0.052), ('signature', 0.05), ('pe', 0.045), ('ber', 0.044), ('detecting', 0.043), ('ye', 0.043), ('ml', 0.043), ('nave', 0.041), ('tree', 0.039), ('strings', 0.039), ('entropy', 0.039), ('phase', 0.036), ('patched', 0.034), ('spread', 0.034), ('automatic', 0.034), ('detect', 0.033), ('examined', 0.032), ('threat', 0.032), ('training', 0.031), ('tp', 0.03), ('security', 0.03), ('hac', 0.029), ('ratio', 0.028), ('gain', 0.028), ('decision', 0.028), ('features', 0.028), ('entropylow', 0.027), ('flooder', 0.027), ('instruction', 0.027), ('maxmalsize', 0.027), ('mfg', 0.027), ('segmentsinmalwareonly', 0.027), ('static', 0.027), ('classi', 0.027), ('platforms', 0.026), ('file', 0.026), ('forest', 0.026), ('programs', 0.026), ('code', 0.026), ('discard', 0.026), ('explained', 0.025), ('api', 0.024), ('continue', 0.024), ('stage', 0.024), ('bcr', 0.023), ('anytime', 0.023), ('newsome', 0.023), ('obfuscation', 0.023), ('virology', 0.023), ('virus', 0.023), ('email', 0.023), ('acc', 0.023), ('sr', 0.023), ('rotation', 0.021), ('internet', 0.021), ('operating', 0.021), ('trees', 0.021), ('installed', 0.021), ('originate', 0.021), ('bgu', 0.021), ('datashort', 0.021), ('disassembly', 0.021), ('henchiri', 0.021), ('mpc', 0.021), ('opcode', 0.021), ('patching', 0.021), ('repositories', 0.021), ('segmentcheck', 0.021), ('segmentcoll', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 63 jmlr-2012-Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features

Author: Gil Tahan, Lior Rokach, Yuval Shahar

Abstract: This paper proposes several novel methods, based on machine learning, to detect malware in executable files without any need for preprocessing, such as unpacking or disassembling. The basic method (Mal-ID) is a new static (form-based) analysis methodology that uses common segment analysis in order to detect malware files. By using common segment analysis, Mal-ID is able to discard malware parts that originate from benign code. In addition, Mal-ID uses a new kind of feature, termed meta-feature, to better capture the properties of the analyzed segments. Rather than using the entire file, as is usually the case with machine learning based techniques, the new approach detects malware on the segment level. This study also introduces two Mal-ID extensions that improve the Mal-ID basic method in various aspects. We rigorously evaluated Mal-ID and its two extensions with more than ten performance measures, and compared them to the highly rated boosted decision tree method under identical settings. The evaluation demonstrated that Mal-ID and the two Mal-ID extensions outperformed the boosted decision tree method in almost all respects. In addition, the results indicated that by extracting meaningful features, it is sufficient to employ one simple detection rule for classifying executable files. Keywords: computer security, malware detection, common segment analysis, supervised learning

2 0.048640385 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection

Author: Marius Kloft, Pavel Laskov

Abstract: Security issues are crucial in a number of machine learning applications, especially in scenarios dealing with human activity rather than natural phenomena (e.g., information ranking, spam detection, malware detection, etc.). In such cases, learning algorithms may have to cope with manipulated data aimed at hampering decision making. Although some previous work addressed the issue of handling malicious data in the context of supervised learning, very little is known about the behavior of anomaly detection methods in such scenarios. In this contribution,1 we analyze the performance of a particular method—online centroid anomaly detection—in the presence of adversarial noise. Our analysis addresses the following security-related issues: formalization of learning and attack processes, derivation of an optimal attack, and analysis of attack efficiency and limitations. We derive bounds on the effectiveness of a poisoning attack against centroid anomaly detection under different conditions: attacker’s full or limited control over the traffic and bounded false positive rate. Our bounds show that whereas a poisoning attack can be effectively staged in the unconstrained case, it can be made arbitrarily difficult (a strict upper bound on the attacker’s gain) if external constraints are properly used. Our experimental evaluation, carried out on real traces of HTTP and exploit traffic, confirms the tightness of our theoretical bounds and the practicality of our protection mechanisms. Keywords: anomaly detection, adversarial, security analysis, support vector data description, computer security, network intrusion detection

3 0.048461426 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel

Author: Stephen R. Piccolo, Lewis J. Frey

Abstract: Motivated by a need to classify high-dimensional, heterogeneous data from the bioinformatics domain, we developed ML-Flex, a machine-learning toolbox that enables users to perform two-class and multi-class classification analyses in a systematic yet flexible manner. ML-Flex was written in Java but is capable of interfacing with third-party packages written in other programming languages. It can handle multiple input-data formats and supports a variety of customizations. MLFlex provides implementations of various validation strategies, which can be executed in parallel across multiple computing cores, processors, and nodes. Additionally, ML-Flex supports aggregating evidence across multiple algorithms and data sets via ensemble learning. This open-source software package is freely available from http://mlflex.sourceforge.net. Keywords: toolbox, classification, parallel, ensemble, reproducible research

4 0.04703616 102 jmlr-2012-Sally: A Tool for Embedding Strings in Vector Spaces

Author: Konrad Rieck, Christian Wressnegger, Alexander Bikadorov

Abstract: Strings and sequences are ubiquitous in many areas of data analysis. However, only few learning methods can be directly applied to this form of data. We present Sally, a tool for embedding strings in vector spaces that allows for applying a wide range of learning methods to string data. Sally implements a generalized form of the bag-of-words model, where strings are mapped to a vector space that is spanned by a set of string features, such as words or n-grams of words. The implementation of Sally builds on efficient string algorithms and enables processing millions of strings and features. The tool supports several data formats and is capable of interfacing with common learning environments, such as Weka, Shogun, Matlab, or Pylab. Sally has been successfully applied for learning with natural language text, DNA sequences and monitored program behavior. Keywords: string embedding, bag-of-words models, learning with sequential data

5 0.038720466 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss

Author: José Hernández-Orallo, Peter Flach, Cèsar Ferri

Abstract: Many performance metrics have been introduced in the literature for the evaluation of classification performance, each of them with different origins and areas of application. These metrics include accuracy, unweighted accuracy, the area under the ROC curve or the ROC convex hull, the mean absolute error and the Brier score or mean squared error (with its decomposition into refinement and calibration). One way of understanding the relations among these metrics is by means of variable operating conditions (in the form of misclassification costs and/or class distributions). Thus, a metric may correspond to some expected loss over different operating conditions. One dimension for the analysis has been the distribution for this range of operating conditions, leading to some important connections in the area of proper scoring rules. We demonstrate in this paper that there is an equally important dimension which has so far received much less attention in the analysis of performance metrics. This dimension is given by the decision rule, which is typically implemented as a threshold choice method when using scoring models. In this paper, we explore many old and new threshold choice methods: fixed, score-uniform, score-driven, rate-driven and optimal, among others. By calculating the expected loss obtained with these threshold choice methods for a uniform range of operating conditions we give clear interpretations of the 0-1 loss, the absolute error, the Brier score, the AUC and the refinement loss respectively. Our analysis provides a comprehensive view of performance metrics as well as a systematic approach to loss minimisation which can be summarised as follows: given a model, apply the threshold choice methods that correspond with the available information about the operating condition, and compare their expected losses. In order to assist in this procedure we also derive several connections between the aforementioned performance metrics, and we highlight the role of calibra

6 0.031784203 20 jmlr-2012-Analysis of a Random Forests Model

7 0.026400086 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

8 0.026031828 106 jmlr-2012-Sign Language Recognition using Sub-Units

9 0.025914581 19 jmlr-2012-An Introduction to Artificial Prediction Markets for Classification

10 0.023615265 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

11 0.022609839 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition

12 0.022402655 90 jmlr-2012-Pattern for Python

13 0.0220543 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels

14 0.019697582 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies

15 0.01947937 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

16 0.019288141 44 jmlr-2012-Feature Selection via Dependence Maximization

17 0.018920342 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

18 0.018534627 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

19 0.018524971 30 jmlr-2012-DARWIN: A Framework for Machine Learning and Computer Vision Research and Development

20 0.018495163 82 jmlr-2012-On the Necessity of Irrelevant Variables


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.093), (1, 0.04), (2, 0.096), (3, -0.016), (4, 0.018), (5, 0.018), (6, 0.048), (7, 0.006), (8, 0.018), (9, 0.035), (10, 0.021), (11, -0.037), (12, 0.083), (13, -0.054), (14, -0.04), (15, 0.128), (16, 0.081), (17, -0.049), (18, -0.076), (19, -0.093), (20, 0.024), (21, 0.022), (22, -0.056), (23, -0.065), (24, 0.093), (25, -0.06), (26, -0.064), (27, 0.026), (28, 0.114), (29, -0.106), (30, 0.256), (31, -0.353), (32, 0.119), (33, 0.032), (34, 0.077), (35, 0.13), (36, 0.017), (37, -0.158), (38, -0.077), (39, -0.037), (40, -0.086), (41, 0.055), (42, 0.024), (43, -0.107), (44, -0.165), (45, -0.105), (46, -0.017), (47, -0.021), (48, 0.035), (49, -0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95157522 63 jmlr-2012-Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features

Author: Gil Tahan, Lior Rokach, Yuval Shahar

Abstract: This paper proposes several novel methods, based on machine learning, to detect malware in executable files without any need for preprocessing, such as unpacking or disassembling. The basic method (Mal-ID) is a new static (form-based) analysis methodology that uses common segment analysis in order to detect malware files. By using common segment analysis, Mal-ID is able to discard malware parts that originate from benign code. In addition, Mal-ID uses a new kind of feature, termed meta-feature, to better capture the properties of the analyzed segments. Rather than using the entire file, as is usually the case with machine learning based techniques, the new approach detects malware on the segment level. This study also introduces two Mal-ID extensions that improve the Mal-ID basic method in various aspects. We rigorously evaluated Mal-ID and its two extensions with more than ten performance measures, and compared them to the highly rated boosted decision tree method under identical settings. The evaluation demonstrated that Mal-ID and the two Mal-ID extensions outperformed the boosted decision tree method in almost all respects. In addition, the results indicated that by extracting meaningful features, it is sufficient to employ one simple detection rule for classifying executable files. Keywords: computer security, malware detection, common segment analysis, supervised learning

2 0.64348131 102 jmlr-2012-Sally: A Tool for Embedding Strings in Vector Spaces

Author: Konrad Rieck, Christian Wressnegger, Alexander Bikadorov

Abstract: Strings and sequences are ubiquitous in many areas of data analysis. However, only few learning methods can be directly applied to this form of data. We present Sally, a tool for embedding strings in vector spaces that allows for applying a wide range of learning methods to string data. Sally implements a generalized form of the bag-of-words model, where strings are mapped to a vector space that is spanned by a set of string features, such as words or n-grams of words. The implementation of Sally builds on efficient string algorithms and enables processing millions of strings and features. The tool supports several data formats and is capable of interfacing with common learning environments, such as Weka, Shogun, Matlab, or Pylab. Sally has been successfully applied for learning with natural language text, DNA sequences and monitored program behavior. Keywords: string embedding, bag-of-words models, learning with sequential data

3 0.54490149 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection

Author: Marius Kloft, Pavel Laskov

Abstract: Security issues are crucial in a number of machine learning applications, especially in scenarios dealing with human activity rather than natural phenomena (e.g., information ranking, spam detection, malware detection, etc.). In such cases, learning algorithms may have to cope with manipulated data aimed at hampering decision making. Although some previous work addressed the issue of handling malicious data in the context of supervised learning, very little is known about the behavior of anomaly detection methods in such scenarios. In this contribution,1 we analyze the performance of a particular method—online centroid anomaly detection—in the presence of adversarial noise. Our analysis addresses the following security-related issues: formalization of learning and attack processes, derivation of an optimal attack, and analysis of attack efficiency and limitations. We derive bounds on the effectiveness of a poisoning attack against centroid anomaly detection under different conditions: attacker’s full or limited control over the traffic and bounded false positive rate. Our bounds show that whereas a poisoning attack can be effectively staged in the unconstrained case, it can be made arbitrarily difficult (a strict upper bound on the attacker’s gain) if external constraints are properly used. Our experimental evaluation, carried out on real traces of HTTP and exploit traffic, confirms the tightness of our theoretical bounds and the practicality of our protection mechanisms. Keywords: anomaly detection, adversarial, security analysis, support vector data description, computer security, network intrusion detection

4 0.43960705 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel

Author: Stephen R. Piccolo, Lewis J. Frey

Abstract: Motivated by a need to classify high-dimensional, heterogeneous data from the bioinformatics domain, we developed ML-Flex, a machine-learning toolbox that enables users to perform two-class and multi-class classification analyses in a systematic yet flexible manner. ML-Flex was written in Java but is capable of interfacing with third-party packages written in other programming languages. It can handle multiple input-data formats and supports a variety of customizations. MLFlex provides implementations of various validation strategies, which can be executed in parallel across multiple computing cores, processors, and nodes. Additionally, ML-Flex supports aggregating evidence across multiple algorithms and data sets via ensemble learning. This open-source software package is freely available from http://mlflex.sourceforge.net. Keywords: toolbox, classification, parallel, ensemble, reproducible research

5 0.294184 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss

Author: José Hernández-Orallo, Peter Flach, Cèsar Ferri

Abstract: Many performance metrics have been introduced in the literature for the evaluation of classification performance, each of them with different origins and areas of application. These metrics include accuracy, unweighted accuracy, the area under the ROC curve or the ROC convex hull, the mean absolute error and the Brier score or mean squared error (with its decomposition into refinement and calibration). One way of understanding the relations among these metrics is by means of variable operating conditions (in the form of misclassification costs and/or class distributions). Thus, a metric may correspond to some expected loss over different operating conditions. One dimension for the analysis has been the distribution for this range of operating conditions, leading to some important connections in the area of proper scoring rules. We demonstrate in this paper that there is an equally important dimension which has so far received much less attention in the analysis of performance metrics. This dimension is given by the decision rule, which is typically implemented as a threshold choice method when using scoring models. In this paper, we explore many old and new threshold choice methods: fixed, score-uniform, score-driven, rate-driven and optimal, among others. By calculating the expected loss obtained with these threshold choice methods for a uniform range of operating conditions we give clear interpretations of the 0-1 loss, the absolute error, the Brier score, the AUC and the refinement loss respectively. Our analysis provides a comprehensive view of performance metrics as well as a systematic approach to loss minimisation which can be summarised as follows: given a model, apply the threshold choice methods that correspond with the available information about the operating condition, and compare their expected losses. In order to assist in this procedure we also derive several connections between the aforementioned performance metrics, and we highlight the role of calibra

6 0.25008735 19 jmlr-2012-An Introduction to Artificial Prediction Markets for Classification

7 0.21436745 20 jmlr-2012-Analysis of a Random Forests Model

8 0.20105891 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies

9 0.19779587 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection

10 0.19758163 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

11 0.19391365 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

12 0.18390626 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition

13 0.17907982 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

14 0.16633324 6 jmlr-2012-A Model of the Perception of Facial Expressions of Emotion by Humans: Research Overview and Perspectives

15 0.16191781 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition

16 0.16012608 106 jmlr-2012-Sign Language Recognition using Sub-Units

17 0.15562671 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

18 0.15334482 95 jmlr-2012-Random Search for Hyper-Parameter Optimization

19 0.14774328 93 jmlr-2012-Quantum Set Intersection and its Application to Associative Memory

20 0.1399639 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.016), (21, 0.043), (26, 0.026), (27, 0.025), (29, 0.04), (35, 0.018), (49, 0.014), (56, 0.028), (57, 0.013), (69, 0.022), (75, 0.042), (77, 0.016), (79, 0.015), (81, 0.469), (92, 0.038), (96, 0.071)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.766761 63 jmlr-2012-Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features

Author: Gil Tahan, Lior Rokach, Yuval Shahar

Abstract: This paper proposes several novel methods, based on machine learning, to detect malware in executable files without any need for preprocessing, such as unpacking or disassembling. The basic method (Mal-ID) is a new static (form-based) analysis methodology that uses common segment analysis in order to detect malware files. By using common segment analysis, Mal-ID is able to discard malware parts that originate from benign code. In addition, Mal-ID uses a new kind of feature, termed meta-feature, to better capture the properties of the analyzed segments. Rather than using the entire file, as is usually the case with machine learning based techniques, the new approach detects malware on the segment level. This study also introduces two Mal-ID extensions that improve the Mal-ID basic method in various aspects. We rigorously evaluated Mal-ID and its two extensions with more than ten performance measures, and compared them to the highly rated boosted decision tree method under identical settings. The evaluation demonstrated that Mal-ID and the two Mal-ID extensions outperformed the boosted decision tree method in almost all respects. In addition, the results indicated that by extracting meaningful features, it is sufficient to employ one simple detection rule for classifying executable files. Keywords: computer security, malware detection, common segment analysis, supervised learning

2 0.75827628 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

Author: Jasper Snoek, Ryan P. Adams, Hugo Larochelle

Abstract: While unsupervised learning has long been useful for density modeling, exploratory data analysis and visualization, it has become increasingly important for discovering features that will later be used for discriminative tasks. Discriminative algorithms often work best with highly-informative features; remarkably, such features can often be learned without the labels. One particularly effective way to perform such unsupervised learning has been to use autoencoder neural networks, which find latent representations that are constrained but nevertheless informative for reconstruction. However, pure unsupervised learning with autoencoders can find representations that may or may not be useful for the ultimate discriminative task. It is a continuing challenge to guide the training of an autoencoder so that it finds features which will be useful for predicting labels. Similarly, we often have a priori information regarding what statistical variation will be irrelevant to the ultimate discriminative task, and we would like to be able to use this for guidance as well. Although a typical strategy would be to include a parametric discriminative model as part of the autoencoder training, here we propose a nonparametric approach that uses a Gaussian process to guide the representation. By using a nonparametric model, we can ensure that a useful discriminative function exists for a given set of features, without explicitly instantiating it. We demonstrate the superiority of this guidance mechanism on four data sets, including a real-world application to rehabilitation research. We also show how our proposed approach can learn to explicitly ignore statistically significant covariate information that is label-irrelevant, by evaluating on the small NORB image recognition problem in which pose and lighting labels are available. Keywords: autoencoder, gaussian process, gaussian process latent variable model, representation learning, unsupervised learning

3 0.63986254 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices

Author: Uri Shalit, Daphna Weinshall, Gal Chechik

Abstract: When learning models that are represented in matrix forms, enforcing a low-rank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. However, naive approaches to minimizing functions over the set of low-rank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the low-rank matrix). We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. While the ideal retraction is costly to compute, and so is the projection operator that approximates it, we describe another retraction that can be computed efficiently. It has run time and memory complexity of O ((n + m)k) for a rank-k matrix of dimension m × n, when using an online procedure with rank-one gradients. We use this algorithm, L ORETA, to learn a matrix-form similarity measure over pairs of documents represented as high dimensional vectors. L ORETA improves the mean average precision over a passive-aggressive approach in a factorized model, and also improves over a full model trained on pre-selected features using the same memory requirements. We further adapt L ORETA to learn positive semi-definite low-rank matrices, providing an online algorithm for low-rank metric learning. L ORETA also shows consistent improvement over standard weakly supervised methods in a large (1600 classes and 1 million images, using ImageNet) multi-label image classification task. Keywords: low rank, Riemannian manifolds, metric learning, retractions, multitask learning, online learning

4 0.31821424 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

Author: Jun Zhu, Amr Ahmed, Eric P. Xing

Abstract: A supervised topic model can use side information such as ratings or labels associated with documents or images to discover more predictive low dimensional topical representations of the data. However, existing supervised topic models predominantly employ likelihood-driven objective functions for learning and inference, leaving the popular and potentially powerful max-margin principle unexploited for seeking predictive representations of data and more discriminative topic bases for the corpus. In this paper, we propose the maximum entropy discrimination latent Dirichlet allocation (MedLDA) model, which integrates the mechanism behind the max-margin prediction models (e.g., SVMs) with the mechanism behind the hierarchical Bayesian topic models (e.g., LDA) under a unified constrained optimization framework, and yields latent topical representations that are more discriminative and more suitable for prediction tasks such as document classification or regression. The principle underlying the MedLDA formalism is quite general and can be applied for jointly max-margin and maximum likelihood learning of directed or undirected topic models when supervising side information is available. Efficient variational methods for posterior inference and parameter estimation are derived and extensive empirical studies on several real data sets are also provided. Our experimental results demonstrate qualitatively and quantitatively that MedLDA could: 1) discover sparse and highly discriminative topical representations; 2) achieve state of the art prediction performance; and 3) be more efficient than existing supervised topic models, especially for classification. Keywords: supervised topic models, max-margin learning, maximum entropy discrimination, latent Dirichlet allocation, support vector machines

5 0.28379059 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection

Author: Marius Kloft, Pavel Laskov

Abstract: Security issues are crucial in a number of machine learning applications, especially in scenarios dealing with human activity rather than natural phenomena (e.g., information ranking, spam detection, malware detection, etc.). In such cases, learning algorithms may have to cope with manipulated data aimed at hampering decision making. Although some previous work addressed the issue of handling malicious data in the context of supervised learning, very little is known about the behavior of anomaly detection methods in such scenarios. In this contribution,1 we analyze the performance of a particular method—online centroid anomaly detection—in the presence of adversarial noise. Our analysis addresses the following security-related issues: formalization of learning and attack processes, derivation of an optimal attack, and analysis of attack efficiency and limitations. We derive bounds on the effectiveness of a poisoning attack against centroid anomaly detection under different conditions: attacker’s full or limited control over the traffic and bounded false positive rate. Our bounds show that whereas a poisoning attack can be effectively staged in the unconstrained case, it can be made arbitrarily difficult (a strict upper bound on the attacker’s gain) if external constraints are properly used. Our experimental evaluation, carried out on real traces of HTTP and exploit traffic, confirms the tightness of our theoretical bounds and the practicality of our protection mechanisms. Keywords: anomaly detection, adversarial, security analysis, support vector data description, computer security, network intrusion detection

6 0.27680993 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

7 0.25912467 106 jmlr-2012-Sign Language Recognition using Sub-Units

8 0.25613463 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems

9 0.25369599 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks

10 0.25022277 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

11 0.24582741 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

12 0.24554493 45 jmlr-2012-Finding Recurrent Patterns from Continuous Sign Language Sentences for Automated Extraction of Signs

13 0.24251983 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

14 0.2401278 100 jmlr-2012-Robust Kernel Density Estimation

15 0.23967457 19 jmlr-2012-An Introduction to Artificial Prediction Markets for Classification

16 0.23174278 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

17 0.22939353 60 jmlr-2012-Local and Global Scaling Reduce Hubs in Space

18 0.21888706 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

19 0.21759021 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

20 0.21744579 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms