nips nips2004 nips2004-111 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hideto Kazawa, Tomonori Izumitani, Hirotoshi Taira, Eisaku Maeda
Abstract: In this paper, we address the problem of statistical learning for multitopic text categorization (MTC), whose goal is to choose all relevant topics (a label) from a given set of topics. The proposed algorithm, Maximal Margin Labeling (MML), treats all possible labels as independent classes and learns a multi-class classifier on the induced multi-class categorization problem. To cope with the data sparseness caused by the huge number of possible labels, MML combines some prior knowledge about label prototypes and a maximal margin criterion in a novel way. Experiments with multi-topic Web pages show that MML outperforms existing learning algorithms including Support Vector Machines. 1 Multi-topic Text Categorization (MTC) This paper addresses the problem of learning for multi-topic text categorization (MTC), whose goal is to select all topics relevant to a text from a given set of topics. In MTC, multiple topics may be relevant to a single text. We thus call a set of topics label, and say that a text is assigned a label, not a topic. In almost all previous text categorization studies (e.g. [1, 2]), the label is predicted by judging each topic’s relevance to the text. In this decomposition approach, the features specific to a topic, not a label, are regarded as important features. However, the approach may result in inefficient learning as we will explain in the following example. Imagine an MTC problem of scientific papers where quantum computing papers are assigned multi-topic label “quantum physics (QP) & computer science (CS)”. (QP and CS are topics in this example.) Since there are some words specific to quantum computing such as “qbit1 ”, one can say that efficient MTC learners should use such words to assign label QP & CS. However, the decomposition approach is likely to ignore these words since they are only specific to a small portion of the whole QP or CS papers (there are many more QP and CS papers than quantum computing papers), and therefore are not discriminative features for either topic QP or CS. 1 Qbit is a unit of quantum information, and frequently appears in real quantum computing literatures, but rarely seen in other literatures. Symbol x(∈ Rd ) t1 , t2 , . . . , tl T L, λ(⊂ T ) L[j] Λ(= 2T ) {(xi , Li )}m i=1 Meaning A document vector Topics The set of all topics A label The binary representation of L. 1 if tj ∈ L and 0 otherwise. The set of all possible labels Training samples Table 1: Notation Parametric Mixture Model (PMM) [3] adopts another approach to MTC. It is assumed in PMM that multi-topic texts are generated from a mixture of topic-specific word distributions. Its decision on labeling is done at once, not separately for each topic. However, PMM also has a problem with multi-topic specific features such as “qbit” since it is impossible for texts to have such features given PMM’s mixture process. These problems with multi-topic specific features are caused by dependency assumptions between labels, which are explicitly or implicitly made in existing methods. To solve these problems, we propose Maximal Margin Labeling, which treats labels as independent classes and learns a multi-class classifier on the induced multi-class problem. In this paper, we first discuss why multi-class classifiers cannot be directly applied to MTC in Section 2. We then propose MML in Section 3, and address implementation issues in Section 4. In Section 5, MML is experimentally compared with existing methods using a collection of multi-topic Web pages. We summarize this paper in Section 6. 2 Solving MTC as a Multi-Class Categorization To discuss why existing multi-class classifiers do not work in MTC, we start from the multi-class classifier proposed in [4]. Hereafter we use the notation given in Table 1. The multi-class classifier in [4] categorizes an object into the class whose prototype vector is the closest to the object’s feature vector. By substituting label for class, the classifier can be written as follows. f (x) = arg max x, mλ (1) X λ∈Λ where , X is the the inner product of Rd , and mλ ∈ Rd is the prototype vector of label λ. Following the similar argument as in [4], the prototype vectors are learned by solving the following maximal margin problem2 . min M s.t. 1 M 2 2 λ ξi +C xi , mLi 1≤i≤m λ∈Λ,λ=Li X − xi , m λ X λ ≥ 1 − ξi for 1 ≤ i ≤ m, ∀λ = Li , (2) where M is the prototype matrix whose columns are the prototype vectors, and M is the Frobenius matrix norm of M . Note that Eq. (1) and Eq. (2) cover only training samples’ labels, but also all possible labels. This is because the labels unseen in training samples may be relevant to test samples. In 2 In Eq.(2), we penalize all violation of the margin constraints. On the other hand, Crammer and Singer penalize only the largest violation of the margin constraint for each training sample [4]. We chose the “penalize-all” approach since it leads to an optimization problem without equality constraints (see Eq.(7)), which is much easier to solve than the one in [4]. usual multi-class problems, such unseen labels seldom exist. In MTC, however, the number of labels is generally very large (e.g. one of our datasets has 1,054 labels (Table 2)), and unseen labels often exist. Thus it is necessary to consider all possible labels in Eq. (1) and Eq. (2) since it is impossible to know which unseen labels are present in the test samples. There are two problems with Eq. (1) and Eq. (2). The first problem is that they involve the prototype vectors of seldom or never seen labels. Without the help of prior knowledge about where the prototype vectors should be, it is impossible to obtain appropriate prototype vectors for such labels. The second problem is that these equations are computationally too demanding since they involve combinatorial maximization and summation over all possible labels, whose number can be quite large. (For example, the number is around 230 in the datasets used in our experiments.) We will address the first problem in Section 3 and the second problem in Section 4. 3 Maximal Margin Labeling In this section, we incorporate some prior knowledge about the location of prototype vectors into Eq. (1) and Eq. (2), and propose a novel MTC learning algorithm, Maximal Margin Labeling (MML). As prior knowledge, we simply assume that the prototype vectors of similar labels should be placed close to each other. Based on this assumption, we first rewrite Eq. (1) to yield f (x) = arg max M T x, eλ L, (3) λ∈Λ where , L is the inner product of R|Λ| and {eλ }λ∈Λ is the orthonormal basis of R|Λ| . The classifier of Eq. (3) can be interpreted as a two-step process: the first step is to map the vector x into R|Λ| by M T , and the second step is to find the closest eλ to image M T x. Then we replace {eλ }λ∈Λ with (generally) non-orthogonal vectors {φ(λ)}λ∈Λ whose geometrical configuration reflects label similarity. More formally speaking, we use vectors {φ(λ)}λ∈Λ that satisfy the condition φ(λ1 ), φ(λ2 ) S = S(λ1 , λ2 ) for ∀λ1 , λ2 ∈ Λ, (4) where , S is an inner product of the vector space spanned by {φ(λ)}λ∈Λ , and S is a Mercer kernel [5] on Λ × Λ and is a similarity measure between labels. We call the vector space spanned by {φ(λ)} VS . With this replacement, MML’s classifier is written as follows. f (x) = arg max W x, φ(λ) S , (5) λ∈Λ where W is a linear map from Rd to VS . W is the solution of the following problem. min W 1 W 2 m 2 λ ξi +C i=1 λ∈Λ,λ=Li φ(Li )−φ(λ) λ λ s.t. W xi , ≥ 1−ξi , ξi ≥ 0 for 1 ≤ i ≤ m, ∀λ = Li . (6) φ(Li )−φ(λ) Note that if φ(λ) is replaced by eλ , Eq. (6) becomes identical to Eq. (2) except for a scale factor. Thus Eq. (5) and Eq. (6) are natural extensions of the multi-class classifier in [4]. We call the MTC classifier of Eq. (5) and Eq. (6) “Maximal Margin Labeling (MML)”. Figure 1 explains the margin (the inner product in Eq. (6)) in MML. The margin represents the distance from the image of the training sample xi to the boundary between the correct label Li and wrong label λ. MML optimizes the linear map W so that the smallest margin between all training samples and all possible labels becomes maximal, along with a penalty C for the case that samples penetrate into the margin. Figure 1: Maximal Margin Labeling Dual Form For numerical computation, the following Wolfe dual form of Eq. (6) is more convenient. (We omit its derivation due to space limits.) S(Li , Li′ )−S(Li , λ′ )−S(λ, Li′ )+S(λ, λ′ ) 1 λ λ′ λ αi αi′ (xi ·xi′ ) max αi − 2 αλ 2 (1−S(Li , λ))(1−S(Li′ , λ′ )) i ′ ′ i,λ i,λ i ,λ λ s.t. 0 ≤ αi ≤ C for 1 ≤ i ≤ m, ∀λ = Li , (7) m λ where we denote i=1 λ∈Λ,∀λ=Li by i,λ , and αi are the dual variables corresponding to the first inequality constraints in Eq. (6). Note that Eq. (7) does not contain φ(λ): all the computations involving φ can be done through the label similarity S. Additionally xi only appears in the inner products, and therefore can be replaced by any kernel of x. λ Using the solution αi of Eq. (7), the MML’s classifier in Eq. (5) can be written as follows. S(Li , L)−S(λ, L) λ . (8) f (x) = arg max αi (x·xi ) 2(1−S(Li , λ)) L∈Λ i,λ Label Similarity3 As examples of label similarity, we use two similarity measures: Dice measure and cosine measure. Dice measure4 4.1 SC (λ1 , λ2 ) = |λ1 ∩ λ2 | 2 |λ1 | |λ2 | = l j=1 l j=1 2|λ1 ∩λ2 | = |λ1 |+|λ2 | Cosine measure 4 SD (λ1 , λ2 ) = λ1 [j] + λ1 [j]λ2 [j] l j=1 l j=1 l j=1 λ2 [j] . λ1 [j]λ2 [j] λ1 [j] l j=1 (9) . 10) ( λ2 [j] Efficient Implementation Approximation in Learning Eq. (7) contains the sum over all possible labels. As the number of topics (l) increases, this summation rapidly becomes intractable since |Λ| grows exponentially as 2l . To circumvent 3 The following discussion is easily extended to include the case that both λ1 and λ2 are empty although we do not discuss the case due to space limits. this problem, we approximate the sum over all possible labels in Eq. (7) by the partial sum λ λ over αi of |(A ∩ B c ) ∪ (Ac ∩ B)| = 1 and set all the other αi to zero. This approximation reduces the burden of the summation quite a lot: the number of summands is reduced from 2l to l, which is a huge reduction especially when many topics exist. λ To understand the rationale behind the approximation, first note that αi is the dual variable λ corresponding to the first inequality constraint (the margin constraint) in Eq. (7). Thus αi is non-zero if and only if W xi falls in the margin between φ(Li ) and φ(λ). We assume that this margin violation mainly occurs when φ(λ) is “close” to φ(Li ), i.e. |(A ∩ B c ) ∪ (Ac ∩ B)| = 1. If this assumption holds well, the proposed approximation of the sum will lead to a good approximation of the exact solution. 4.2 Polynomial Time Algorithms for Classification The classification of MML (Eq. (8)) involves the combinatorial maximization over all possible labels, so it can be a computationally demanding process. However, efficient classification algorithms are available when either the cosine measure or dice measure is used as label similarity. Eq. (8) can be divided into the subproblems by the number of topics in a label. f (x) = arg max g(x, L), (11) ˆ ˆ ˆ L∈{L1 ,L2 ,...,Ll } ˆ Ln = arg max g(x, L). (12) L∈Λ,|L|=n where g(x) is l g(x, L) cn [j]L[j], = j=1 cn [j] = i,λ √ i,λ √ αλ (x·xi ) i 2(1−SD (Li ,λ)) αλ (x·xi ) i 2(1−SC (Li ,λ)) . 2Li [j] |Li |+n − 2λ[j] |λ|+n √Li [j] − √λ[j] √ √ |Li | n |λ| n if SD is used. (13) if SC is used. Here n = |L|. The computational cost of Eq. (13) for all j is O(nα l) (nα is the number of non-zero α), and that of Eq. (12) is O(l log l). Thus the total cost of the classification by Eq. (11) is O(nα l2 + l2 log l). On the other hand, nα is O(ml) under the approximation described above. Therefore, the classification can be done within O(ml3 ) computational steps, which is a significant reduction from the case that the brute force search is used in Eq. (8). 5 Experiments In this section, we report experiments that compared MML to PMM [3], SVM5 [6], and BoosTexter [2] using a collection of Web pages. We used a normalized linear kernel k(x, x′ ) = x · x′ / x x′ in MML and SVM. As for BoosTexter, “real abstaining AdaBoost.MH” was used as the weak learner. 5.1 Experimental Setup The datasets used in our experiment represent the Web page collection used in [3] (Table 2). The Web pages were collected through the hyperlinks from Yahoo!’s top directory 5 For each topic, an SVM classifier is trained to predict whether the topic is relevant (positive) or irrelevant (negative) to input doucments. Dataset Name (Abbrev.) Arts & Humanities (Ar) Business & Economy (Bu) Computers & Internet (Co) Education (Ed) Entertainment (En) Health (He) Recreation (Rc) Reference (Rf) Science (Si) Social Science (SS) Society & Culture (SC) #Text #Voc 7,484 11,214 12,444 12,030 12,730 9,205 12,828 8,027 6,428 12,111 14,512 #Tpc #Lbl 23,146 21,924 34,096 27,534 32,001 30,605 30,324 39,679 37,187 52,350 31,802 26 30 33 33 21 32 22 33 40 39 27 Label Size Frequency (%) 1 2 3 4 ≥5 599 55.6 30.5 9.7 2.8 1.4 233 57.6 28.8 11.0 1.7 0.8 428 69.8 18.2 7.8 3.0 1.1 511 66.9 23.4 7.3 1.9 0.6 337 72.3 21.1 4.5 1.0 1.1 335 53.2 34.0 9.5 2.4 0.9 530 69.2 23.1 5.6 1.4 0.6 275 85.5 12.6 1.5 0.3 0.1 457 68.0 22.3 7.3 1.9 0.5 361 78.4 17.0 3.7 0.7 0.3 1054 59.6 26.1 9.2 2.9 2.2 Table 2: A summary of the web page datasets. “#Text” is the number of texts in the dataset, “#Voc” the number of vocabularies (i.e. features), “#Tpc” the number of topics, “#Lbl” the number of labels, and “Label Size Frequency” is the relative frequency of each label size. (Label size is the number of topics in a label.) Method MML PMM SVM Boost Feature Type TF, TF×IDF TF TF, TF×IDF Binary Parameter C = 0.1, 1, 10 Model1, Model2 C = 0.1, 1, 10 R = {2, 4, 6, 8, 10}×103 Table 3: Candidate feature types and learning parameters. (R is the number of weak hypotheses.) The underlined fetures and parameters were selected for the evaluation with the test data. (www.yahoo.com), and then divided into 11 datasets by Yahoo’s top category. Each page is labeled with the Yahoo’s second level sub-categories from which the page is hyperlinked. (Thus, the sub-categories are topics in our term.) See [3] for more details about the collection. Then the Web pages were converted into three types of feature vectors: (a) Binary vectors, where each feature indicates the presence (absence) of a term by 1 (0); (b) TF vectors, where each feature is the number of appearances of a term (term frequency); and (c) TF×IDF vectors, where each feature is the product of term frequency and inverse document frequency [7]. To select the best combinations of feature types and learning parameters such as the penalty C for MML, the learners were trained on 2,000 Web pages with all combinations of feature and parameter listed in Table 3, and then were evaluated by labeling F-measure on independently drawn development data. The combinations which achieve the best labeling F-measures (underlined in Table 3) were used in the following experiments. 5.2 Evaluation Measures We used three measures to evaluate labeling performance: labeling F-measure, exact match ratio, and retrieval F-measure. In the following definitions, {Lpred }n and {Ltrue }n i=1 i i=1 i mean the predicted labels and the true labels, respectively. Labeling F-measure Labeling F-measure FL evaluates the average labeling performance while taking partial match into account. FL = 1 n n i=1 2|Lpred ∩ Ltrue | i i |Lpred | + |Ltrue | i i = 1 n n i=1 l pred [j]Ltrue [j] i j=1 Li . l (Lpred [j] + Ltrue [j]) i i j=1 2 (14) Dataset Ar Bu Co Ed En He Rc Rf Si SS SC Avg MD 0.55 0.80 0.62 0.56 0.64 0.74 0.63 0.67 0.61 0.73 0.60 0.65 Labeling F-measure MC PM SV BO 0.44 0.50 0.46 0.38 0.81 0.75 0.76 0.75 0.59 0.61 0.55 0.47 0.43 0.51 0.48 0.37 0.52 0.61 0.54 0.49 0.74 0.66 0.67 0.60 0.46 0.55 0.49 0.44 0.58 0.63 0.56 0.50 0.54 0.52 0.47 0.39 0.71 0.66 0.64 0.59 0.55 0.54 0.49 0.44 0.58 0.59 0.56 0.49 Exact Match Ratio MC PM SV 0.32 0.21 0.29 0.62 0.48 0.57 0.46 0.35 0.41 0.34 0.19 0.30 0.44 0.31 0.42 0.53 0.34 0.47 0.38 0.25 0.37 0.51 0.39 0.49 0.43 0.22 0.36 0.60 0.45 0.55 0.40 0.21 0.32 0.46 0.31 0.41 MD 0.44 0.63 0.51 0.45 0.55 0.58 0.54 0.60 0.52 0.65 0.44 0.54 BO 0.22 0.53 0.34 0.23 0.36 0.39 0.33 0.41 0.28 0.49 0.27 0.35 MD 0.30 0.25 0.27 0.25 0.37 0.35 0.47 0.29 0.37 0.36 0.29 0.32 Retrieval F-measure MC PM SV BO 0.26 0.24 0.29 0.22 0.27 0.20 0.29 0.20 0.25 0.19 0.30 0.17 0.23 0.21 0.25 0.16 0.33 0.30 0.35 0.29 0.35 0.23 0.35 0.26 0.39 0.36 0.40 0.33 0.25 0.24 0.25 0.16 0.35 0.28 0.31 0.19 0.35 0.18 0.31 0.15 0.28 0.25 0.26 0.20 0.30 0.24 0.31 0.21 Table 4: The performance comparison by labeling F-measure (left), exact match ratio (middle) and retrieval F-measure (right). The bold figures are the best ones among the five methods, and the underlined figures the second best ones. MD, MC, PM, SV, and BO represent MML with SD , MML with SC , PMM, SVM and BoosTexter, respectively. Exact Match Ratio Exact match ratio EX counts only exact matches between the predicted label and the true label. EX = 1 n n I[Lpred = Ltrue ], i i (15) i=1 where I[S] is 1 if the statement S is true and 0 otherwise. Retrieval F-measure6 For real tasks, it is also important to evaluate retrieval performance, i.e. how accurately classifiers can find relevant texts for a given topic. Retrieval F-measure FR measures the average retrieval performance over all topics. FR = 5.3 1 l l j=1 n pred [j]Ltrue [j] i i=1 Li . n (Lpred [j] + Ltrue [j]) i i i=1 2 (16) Results First we trained the classifiers with randomly chosen 2,000 samples. We then calculated the three evaluation measures for 3,000 other randomly chosen samples. This process was repeated five times, and the resulting averaged values are shown in Table 4. Table 4 shows that the MMLs with Dice measure outperform other methods in labeling F-measure and exact match ratio. The MMLs also show the best performance with regard to retrieval Fmeasure although the margins to the other methods are not as large as observed in labeling F-measure and exact match ratio. Note that no classifier except MML with Dice measure achieves good labeling on all the three measures. For example, PMM shows high labeling F-measures, but its performance is rather poor when evaluated in retrieval F-measure. As the second experiment, we evaluated the classifiers trained with 250–2000 training samples on the same test samples. Figure 2 shows each measure averaged over all datasets. It is observed that the MMLs show high generalization even when training data is small. An interesting point is that MML with cosine measure achieves rather high labeling F-measures and retrieval F-measure with training data of smaller size. Such high-performace, however, does not continue when trained on larger data. 6 FR is called “the macro average of F-measures” in the text categorization community. Figure 2: The learning curve of labeling F-measure (left), exact match ratio (middle) and retrieval F-measure (right). MD, MC, PM, SV, BO mean the same as in Table 4. 6 Conclusion In this paper, we proposed a novel learning algorithm for multi-topic text categorization. The algorithm, Maximal Margin Labeling, embeds labels (sets of topics) into a similarityinduced vector space, and learns a large margin classifier in the space. To overcome the demanding computational cost of MML, we provide an approximation method in learning and efficient classification algorithms. In experiments on a collection of Web pages, MML outperformed other methods including SVM and showed better generalization. Acknowledgement The authors would like to thank Naonori Ueda, Kazumi Saito and Yuji Kaneda of Nippon Telegraph and Telephone Corporation for providing PMM’s codes and the datasets. References [1] Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Claire N´ dellec and C´ line Rouveirol, editors, Proc. of the e e 10th European Conference on Machine Learning, number 1398, pages 137–142, 1998. [2] Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135–168, 2000. [3] Naonori Ueda and Kazumi Saito. Parametoric mixture models for multi-topic text. In Advances in Neural Information Processing Systems 15, pages 1261–1268, 2003. [4] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265–292, 2001. [5] Klaus-Robert M¨ ller, Sebastian Mika, Gunnar R¨ tsch, Koji Tsuda, and Bernhard u a Sch¨ lkopf. An introduction to kernel-based learning algorithms. IEEE Transactions o on Neural Networks, 12(2):181–201, 2001. [6] Vladimir N. Vapnik. Statistical Learning Theory. John Wiley & Sons, Inc., 1998. [7] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wealy, 1999.
Reference: text
sentIndex sentText sentNum sentScore
1 jp Abstract In this paper, we address the problem of statistical learning for multitopic text categorization (MTC), whose goal is to choose all relevant topics (a label) from a given set of topics. [sent-5, score-0.403]
2 The proposed algorithm, Maximal Margin Labeling (MML), treats all possible labels as independent classes and learns a multi-class classifier on the induced multi-class categorization problem. [sent-6, score-0.309]
3 To cope with the data sparseness caused by the huge number of possible labels, MML combines some prior knowledge about label prototypes and a maximal margin criterion in a novel way. [sent-7, score-0.372]
4 Experiments with multi-topic Web pages show that MML outperforms existing learning algorithms including Support Vector Machines. [sent-8, score-0.031]
5 1 Multi-topic Text Categorization (MTC) This paper addresses the problem of learning for multi-topic text categorization (MTC), whose goal is to select all topics relevant to a text from a given set of topics. [sent-9, score-0.481]
6 In MTC, multiple topics may be relevant to a single text. [sent-10, score-0.203]
7 We thus call a set of topics label, and say that a text is assigned a label, not a topic. [sent-11, score-0.243]
8 [1, 2]), the label is predicted by judging each topic’s relevance to the text. [sent-14, score-0.137]
9 Imagine an MTC problem of scientific papers where quantum computing papers are assigned multi-topic label “quantum physics (QP) & computer science (CS)”. [sent-17, score-0.349]
10 ) Since there are some words specific to quantum computing such as “qbit1 ”, one can say that efficient MTC learners should use such words to assign label QP & CS. [sent-19, score-0.277]
11 However, the decomposition approach is likely to ignore these words since they are only specific to a small portion of the whole QP or CS papers (there are many more QP and CS papers than quantum computing papers), and therefore are not discriminative features for either topic QP or CS. [sent-20, score-0.269]
12 1 Qbit is a unit of quantum information, and frequently appears in real quantum computing literatures, but rarely seen in other literatures. [sent-21, score-0.228]
13 , tl T L, λ(⊂ T ) L[j] Λ(= 2T ) {(xi , Li )}m i=1 Meaning A document vector Topics The set of all topics A label The binary representation of L. [sent-25, score-0.302]
14 The set of all possible labels Training samples Table 1: Notation Parametric Mixture Model (PMM) [3] adopts another approach to MTC. [sent-27, score-0.16]
15 It is assumed in PMM that multi-topic texts are generated from a mixture of topic-specific word distributions. [sent-28, score-0.064]
16 Its decision on labeling is done at once, not separately for each topic. [sent-29, score-0.228]
17 However, PMM also has a problem with multi-topic specific features such as “qbit” since it is impossible for texts to have such features given PMM’s mixture process. [sent-30, score-0.097]
18 These problems with multi-topic specific features are caused by dependency assumptions between labels, which are explicitly or implicitly made in existing methods. [sent-31, score-0.031]
19 To solve these problems, we propose Maximal Margin Labeling, which treats labels as independent classes and learns a multi-class classifier on the induced multi-class problem. [sent-32, score-0.187]
20 In Section 5, MML is experimentally compared with existing methods using a collection of multi-topic Web pages. [sent-35, score-0.058]
21 2 Solving MTC as a Multi-Class Categorization To discuss why existing multi-class classifiers do not work in MTC, we start from the multi-class classifier proposed in [4]. [sent-37, score-0.031]
22 The multi-class classifier in [4] categorizes an object into the class whose prototype vector is the closest to the object’s feature vector. [sent-39, score-0.225]
23 By substituting label for class, the classifier can be written as follows. [sent-40, score-0.137]
24 f (x) = arg max x, mλ (1) X λ∈Λ where , X is the the inner product of Rd , and mλ ∈ Rd is the prototype vector of label λ. [sent-41, score-0.447]
25 Following the similar argument as in [4], the prototype vectors are learned by solving the following maximal margin problem2 . [sent-42, score-0.478]
26 1 M 2 2 λ ξi +C xi , mLi 1≤i≤m λ∈Λ,λ=Li X − xi , m λ X λ ≥ 1 − ξi for 1 ≤ i ≤ m, ∀λ = Li , (2) where M is the prototype matrix whose columns are the prototype vectors, and M is the Frobenius matrix norm of M . [sent-45, score-0.458]
27 This is because the labels unseen in training samples may be relevant to test samples. [sent-49, score-0.242]
28 On the other hand, Crammer and Singer penalize only the largest violation of the margin constraint for each training sample [4]. [sent-52, score-0.219]
29 usual multi-class problems, such unseen labels seldom exist. [sent-55, score-0.221]
30 In MTC, however, the number of labels is generally very large (e. [sent-56, score-0.129]
31 one of our datasets has 1,054 labels (Table 2)), and unseen labels often exist. [sent-58, score-0.332]
32 Thus it is necessary to consider all possible labels in Eq. [sent-59, score-0.129]
33 (2) since it is impossible to know which unseen labels are present in the test samples. [sent-61, score-0.206]
34 The first problem is that they involve the prototype vectors of seldom or never seen labels. [sent-65, score-0.291]
35 Without the help of prior knowledge about where the prototype vectors should be, it is impossible to obtain appropriate prototype vectors for such labels. [sent-66, score-0.519]
36 The second problem is that these equations are computationally too demanding since they involve combinatorial maximization and summation over all possible labels, whose number can be quite large. [sent-67, score-0.082]
37 (For example, the number is around 230 in the datasets used in our experiments. [sent-68, score-0.03]
38 3 Maximal Margin Labeling In this section, we incorporate some prior knowledge about the location of prototype vectors into Eq. [sent-70, score-0.243]
39 As prior knowledge, we simply assume that the prototype vectors of similar labels should be placed close to each other. [sent-73, score-0.372]
40 (1) to yield f (x) = arg max M T x, eλ L, (3) λ∈Λ where , L is the inner product of R|Λ| and {eλ }λ∈Λ is the orthonormal basis of R|Λ| . [sent-75, score-0.12]
41 Then we replace {eλ }λ∈Λ with (generally) non-orthogonal vectors {φ(λ)}λ∈Λ whose geometrical configuration reflects label similarity. [sent-78, score-0.19]
42 More formally speaking, we use vectors {φ(λ)}λ∈Λ that satisfy the condition φ(λ1 ), φ(λ2 ) S = S(λ1 , λ2 ) for ∀λ1 , λ2 ∈ Λ, (4) where , S is an inner product of the vector space spanned by {φ(λ)}λ∈Λ , and S is a Mercer kernel [5] on Λ × Λ and is a similarity measure between labels. [sent-79, score-0.13]
43 f (x) = arg max W x, φ(λ) S , (5) λ∈Λ where W is a linear map from Rd to VS . [sent-82, score-0.071]
44 Figure 1 explains the margin (the inner product in Eq. [sent-96, score-0.187]
45 The margin represents the distance from the image of the training sample xi to the boundary between the correct label Li and wrong label λ. [sent-98, score-0.451]
46 MML optimizes the linear map W so that the smallest margin between all training samples and all possible labels becomes maximal, along with a penalty C for the case that samples penetrate into the margin. [sent-99, score-0.329]
47 Figure 1: Maximal Margin Labeling Dual Form For numerical computation, the following Wolfe dual form of Eq. [sent-100, score-0.037]
48 ) S(Li , Li′ )−S(Li , λ′ )−S(λ, Li′ )+S(λ, λ′ ) 1 λ λ′ λ αi αi′ (xi ·xi′ ) max αi − 2 αλ 2 (1−S(Li , λ))(1−S(Li′ , λ′ )) i ′ ′ i,λ i,λ i ,λ λ s. [sent-103, score-0.028]
49 0 ≤ αi ≤ C for 1 ≤ i ≤ m, ∀λ = Li , (7) m λ where we denote i=1 λ∈Λ,∀λ=Li by i,λ , and αi are the dual variables corresponding to the first inequality constraints in Eq. [sent-105, score-0.037]
50 (7) does not contain φ(λ): all the computations involving φ can be done through the label similarity S. [sent-108, score-0.137]
51 Additionally xi only appears in the inner products, and therefore can be replaced by any kernel of x. [sent-109, score-0.088]
52 (8) f (x) = arg max αi (x·xi ) 2(1−S(Li , λ)) L∈Λ i,λ Label Similarity3 As examples of label similarity, we use two similarity measures: Dice measure and cosine measure. [sent-114, score-0.289]
53 1 SC (λ1 , λ2 ) = |λ1 ∩ λ2 | 2 |λ1 | |λ2 | = l j=1 l j=1 2|λ1 ∩λ2 | = |λ1 |+|λ2 | Cosine measure 4 SD (λ1 , λ2 ) = λ1 [j] + λ1 [j]λ2 [j] l j=1 l j=1 l j=1 λ2 [j] . [sent-116, score-0.028]
54 As the number of topics (l) increases, this summation rapidly becomes intractable since |Λ| grows exponentially as 2l . [sent-120, score-0.199]
55 this problem, we approximate the sum over all possible labels in Eq. [sent-122, score-0.129]
56 This approximation reduces the burden of the summation quite a lot: the number of summands is reduced from 2l to l, which is a huge reduction especially when many topics exist. [sent-124, score-0.199]
57 λ To understand the rationale behind the approximation, first note that αi is the dual variable λ corresponding to the first inequality constraint (the margin constraint) in Eq. [sent-125, score-0.175]
58 Thus αi is non-zero if and only if W xi falls in the margin between φ(Li ) and φ(λ). [sent-127, score-0.177]
59 We assume that this margin violation mainly occurs when φ(λ) is “close” to φ(Li ), i. [sent-128, score-0.191]
60 If this assumption holds well, the proposed approximation of the sum will lead to a good approximation of the exact solution. [sent-131, score-0.049]
61 (8)) involves the combinatorial maximization over all possible labels, so it can be a computationally demanding process. [sent-134, score-0.048]
62 However, efficient classification algorithms are available when either the cosine measure or dice measure is used as label similarity. [sent-135, score-0.366]
63 (8) can be divided into the subproblems by the number of topics in a label. [sent-137, score-0.165]
64 f (x) = arg max g(x, L), (11) ˆ ˆ ˆ L∈{L1 ,L2 ,. [sent-138, score-0.071]
65 5 Experiments In this section, we report experiments that compared MML to PMM [3], SVM5 [6], and BoosTexter [2] using a collection of Web pages. [sent-154, score-0.027]
66 1 Experimental Setup The datasets used in our experiment represent the Web page collection used in [3] (Table 2). [sent-159, score-0.107]
67 ’s top directory 5 For each topic, an SVM classifier is trained to predict whether the topic is relevant (positive) or irrelevant (negative) to input doucments. [sent-161, score-0.095]
68 “#Text” is the number of texts in the dataset, “#Voc” the number of vocabularies (i. [sent-219, score-0.064]
69 features), “#Tpc” the number of topics, “#Lbl” the number of labels, and “Label Size Frequency” is the relative frequency of each label size. [sent-221, score-0.179]
70 1, 1, 10 R = {2, 4, 6, 8, 10}×103 Table 3: Candidate feature types and learning parameters. [sent-225, score-0.035]
71 ) The underlined fetures and parameters were selected for the evaluation with the test data. [sent-227, score-0.072]
72 com), and then divided into 11 datasets by Yahoo’s top category. [sent-230, score-0.03]
73 Each page is labeled with the Yahoo’s second level sub-categories from which the page is hyperlinked. [sent-231, score-0.1]
74 The combinations which achieve the best labeling F-measures (underlined in Table 3) were used in the following experiments. [sent-236, score-0.254]
75 2 Evaluation Measures We used three measures to evaluate labeling performance: labeling F-measure, exact match ratio, and retrieval F-measure. [sent-238, score-0.715]
76 In the following definitions, {Lpred }n and {Ltrue }n i=1 i i=1 i mean the predicted labels and the true labels, respectively. [sent-239, score-0.129]
77 Labeling F-measure Labeling F-measure FL evaluates the average labeling performance while taking partial match into account. [sent-240, score-0.299]
78 FL = 1 n n i=1 2|Lpred ∩ Ltrue | i i |Lpred | + |Ltrue | i i = 1 n n i=1 l pred [j]Ltrue [j] i j=1 Li . [sent-241, score-0.038]
79 21 Table 4: The performance comparison by labeling F-measure (left), exact match ratio (middle) and retrieval F-measure (right). [sent-422, score-0.49]
80 The bold figures are the best ones among the five methods, and the underlined figures the second best ones. [sent-423, score-0.072]
81 Exact Match Ratio Exact match ratio EX counts only exact matches between the predicted label and the true label. [sent-425, score-0.291]
82 Retrieval F-measure6 For real tasks, it is also important to evaluate retrieval performance, i. [sent-427, score-0.108]
83 how accurately classifiers can find relevant texts for a given topic. [sent-429, score-0.102]
84 Retrieval F-measure FR measures the average retrieval performance over all topics. [sent-430, score-0.139]
85 We then calculated the three evaluation measures for 3,000 other randomly chosen samples. [sent-434, score-0.031]
86 Table 4 shows that the MMLs with Dice measure outperform other methods in labeling F-measure and exact match ratio. [sent-436, score-0.376]
87 The MMLs also show the best performance with regard to retrieval Fmeasure although the margins to the other methods are not as large as observed in labeling F-measure and exact match ratio. [sent-437, score-0.456]
88 Note that no classifier except MML with Dice measure achieves good labeling on all the three measures. [sent-438, score-0.256]
89 For example, PMM shows high labeling F-measures, but its performance is rather poor when evaluated in retrieval F-measure. [sent-439, score-0.336]
90 As the second experiment, we evaluated the classifiers trained with 250–2000 training samples on the same test samples. [sent-440, score-0.031]
91 Figure 2 shows each measure averaged over all datasets. [sent-441, score-0.028]
92 An interesting point is that MML with cosine measure achieves rather high labeling F-measures and retrieval F-measure with training data of smaller size. [sent-443, score-0.417]
93 6 FR is called “the macro average of F-measures” in the text categorization community. [sent-445, score-0.2]
94 Figure 2: The learning curve of labeling F-measure (left), exact match ratio (middle) and retrieval F-measure (right). [sent-446, score-0.49]
95 6 Conclusion In this paper, we proposed a novel learning algorithm for multi-topic text categorization. [sent-448, score-0.078]
96 The algorithm, Maximal Margin Labeling, embeds labels (sets of topics) into a similarityinduced vector space, and learns a large margin classifier in the space. [sent-449, score-0.293]
97 To overcome the demanding computational cost of MML, we provide an approximation method in learning and efficient classification algorithms. [sent-450, score-0.048]
98 In experiments on a collection of Web pages, MML outperformed other methods including SVM and showed better generalization. [sent-451, score-0.027]
99 Text categorization with support vector machines: learning with many relevant features. [sent-454, score-0.16]
wordName wordTfidf (topN-words)
[('mml', 0.502), ('mtc', 0.287), ('li', 0.23), ('labeling', 0.228), ('pmm', 0.215), ('ltrue', 0.191), ('prototype', 0.19), ('topics', 0.165), ('lpred', 0.144), ('margin', 0.138), ('label', 0.137), ('labels', 0.129), ('categorization', 0.122), ('bo', 0.12), ('dice', 0.12), ('tf', 0.117), ('quantum', 0.114), ('retrieval', 0.108), ('web', 0.101), ('maximal', 0.097), ('qp', 0.091), ('sc', 0.088), ('boostexter', 0.083), ('pm', 0.079), ('mc', 0.079), ('text', 0.078), ('mmls', 0.072), ('underlined', 0.072), ('match', 0.071), ('sv', 0.068), ('classi', 0.067), ('er', 0.067), ('texts', 0.064), ('sd', 0.061), ('topic', 0.057), ('md', 0.057), ('idf', 0.057), ('violation', 0.053), ('cosine', 0.053), ('vectors', 0.053), ('cs', 0.051), ('page', 0.05), ('papers', 0.049), ('inner', 0.049), ('exact', 0.049), ('table', 0.049), ('lbl', 0.048), ('qbit', 0.048), ('seldom', 0.048), ('tpc', 0.048), ('voc', 0.048), ('demanding', 0.048), ('fr', 0.048), ('yahoo', 0.048), ('unseen', 0.044), ('arg', 0.043), ('frequency', 0.042), ('nippon', 0.042), ('telegraph', 0.042), ('bu', 0.042), ('kazumi', 0.042), ('naonori', 0.042), ('xi', 0.039), ('relevant', 0.038), ('telephone', 0.038), ('pred', 0.038), ('dual', 0.037), ('ueda', 0.035), ('feature', 0.035), ('ratio', 0.034), ('summation', 0.034), ('ac', 0.033), ('fl', 0.033), ('impossible', 0.033), ('rf', 0.032), ('treats', 0.032), ('measures', 0.031), ('existing', 0.031), ('samples', 0.031), ('datasets', 0.03), ('ss', 0.03), ('rd', 0.029), ('yoram', 0.029), ('corporation', 0.029), ('ers', 0.029), ('penalize', 0.028), ('measure', 0.028), ('max', 0.028), ('collection', 0.027), ('co', 0.027), ('combinations', 0.026), ('ex', 0.026), ('en', 0.026), ('ar', 0.026), ('crammer', 0.026), ('learners', 0.026), ('rc', 0.026), ('svm', 0.026), ('learns', 0.026), ('gures', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization
Author: Hideto Kazawa, Tomonori Izumitani, Hirotoshi Taira, Eisaku Maeda
Abstract: In this paper, we address the problem of statistical learning for multitopic text categorization (MTC), whose goal is to choose all relevant topics (a label) from a given set of topics. The proposed algorithm, Maximal Margin Labeling (MML), treats all possible labels as independent classes and learns a multi-class classifier on the induced multi-class categorization problem. To cope with the data sparseness caused by the huge number of possible labels, MML combines some prior knowledge about label prototypes and a maximal margin criterion in a novel way. Experiments with multi-topic Web pages show that MML outperforms existing learning algorithms including Support Vector Machines. 1 Multi-topic Text Categorization (MTC) This paper addresses the problem of learning for multi-topic text categorization (MTC), whose goal is to select all topics relevant to a text from a given set of topics. In MTC, multiple topics may be relevant to a single text. We thus call a set of topics label, and say that a text is assigned a label, not a topic. In almost all previous text categorization studies (e.g. [1, 2]), the label is predicted by judging each topic’s relevance to the text. In this decomposition approach, the features specific to a topic, not a label, are regarded as important features. However, the approach may result in inefficient learning as we will explain in the following example. Imagine an MTC problem of scientific papers where quantum computing papers are assigned multi-topic label “quantum physics (QP) & computer science (CS)”. (QP and CS are topics in this example.) Since there are some words specific to quantum computing such as “qbit1 ”, one can say that efficient MTC learners should use such words to assign label QP & CS. However, the decomposition approach is likely to ignore these words since they are only specific to a small portion of the whole QP or CS papers (there are many more QP and CS papers than quantum computing papers), and therefore are not discriminative features for either topic QP or CS. 1 Qbit is a unit of quantum information, and frequently appears in real quantum computing literatures, but rarely seen in other literatures. Symbol x(∈ Rd ) t1 , t2 , . . . , tl T L, λ(⊂ T ) L[j] Λ(= 2T ) {(xi , Li )}m i=1 Meaning A document vector Topics The set of all topics A label The binary representation of L. 1 if tj ∈ L and 0 otherwise. The set of all possible labels Training samples Table 1: Notation Parametric Mixture Model (PMM) [3] adopts another approach to MTC. It is assumed in PMM that multi-topic texts are generated from a mixture of topic-specific word distributions. Its decision on labeling is done at once, not separately for each topic. However, PMM also has a problem with multi-topic specific features such as “qbit” since it is impossible for texts to have such features given PMM’s mixture process. These problems with multi-topic specific features are caused by dependency assumptions between labels, which are explicitly or implicitly made in existing methods. To solve these problems, we propose Maximal Margin Labeling, which treats labels as independent classes and learns a multi-class classifier on the induced multi-class problem. In this paper, we first discuss why multi-class classifiers cannot be directly applied to MTC in Section 2. We then propose MML in Section 3, and address implementation issues in Section 4. In Section 5, MML is experimentally compared with existing methods using a collection of multi-topic Web pages. We summarize this paper in Section 6. 2 Solving MTC as a Multi-Class Categorization To discuss why existing multi-class classifiers do not work in MTC, we start from the multi-class classifier proposed in [4]. Hereafter we use the notation given in Table 1. The multi-class classifier in [4] categorizes an object into the class whose prototype vector is the closest to the object’s feature vector. By substituting label for class, the classifier can be written as follows. f (x) = arg max x, mλ (1) X λ∈Λ where , X is the the inner product of Rd , and mλ ∈ Rd is the prototype vector of label λ. Following the similar argument as in [4], the prototype vectors are learned by solving the following maximal margin problem2 . min M s.t. 1 M 2 2 λ ξi +C xi , mLi 1≤i≤m λ∈Λ,λ=Li X − xi , m λ X λ ≥ 1 − ξi for 1 ≤ i ≤ m, ∀λ = Li , (2) where M is the prototype matrix whose columns are the prototype vectors, and M is the Frobenius matrix norm of M . Note that Eq. (1) and Eq. (2) cover only training samples’ labels, but also all possible labels. This is because the labels unseen in training samples may be relevant to test samples. In 2 In Eq.(2), we penalize all violation of the margin constraints. On the other hand, Crammer and Singer penalize only the largest violation of the margin constraint for each training sample [4]. We chose the “penalize-all” approach since it leads to an optimization problem without equality constraints (see Eq.(7)), which is much easier to solve than the one in [4]. usual multi-class problems, such unseen labels seldom exist. In MTC, however, the number of labels is generally very large (e.g. one of our datasets has 1,054 labels (Table 2)), and unseen labels often exist. Thus it is necessary to consider all possible labels in Eq. (1) and Eq. (2) since it is impossible to know which unseen labels are present in the test samples. There are two problems with Eq. (1) and Eq. (2). The first problem is that they involve the prototype vectors of seldom or never seen labels. Without the help of prior knowledge about where the prototype vectors should be, it is impossible to obtain appropriate prototype vectors for such labels. The second problem is that these equations are computationally too demanding since they involve combinatorial maximization and summation over all possible labels, whose number can be quite large. (For example, the number is around 230 in the datasets used in our experiments.) We will address the first problem in Section 3 and the second problem in Section 4. 3 Maximal Margin Labeling In this section, we incorporate some prior knowledge about the location of prototype vectors into Eq. (1) and Eq. (2), and propose a novel MTC learning algorithm, Maximal Margin Labeling (MML). As prior knowledge, we simply assume that the prototype vectors of similar labels should be placed close to each other. Based on this assumption, we first rewrite Eq. (1) to yield f (x) = arg max M T x, eλ L, (3) λ∈Λ where , L is the inner product of R|Λ| and {eλ }λ∈Λ is the orthonormal basis of R|Λ| . The classifier of Eq. (3) can be interpreted as a two-step process: the first step is to map the vector x into R|Λ| by M T , and the second step is to find the closest eλ to image M T x. Then we replace {eλ }λ∈Λ with (generally) non-orthogonal vectors {φ(λ)}λ∈Λ whose geometrical configuration reflects label similarity. More formally speaking, we use vectors {φ(λ)}λ∈Λ that satisfy the condition φ(λ1 ), φ(λ2 ) S = S(λ1 , λ2 ) for ∀λ1 , λ2 ∈ Λ, (4) where , S is an inner product of the vector space spanned by {φ(λ)}λ∈Λ , and S is a Mercer kernel [5] on Λ × Λ and is a similarity measure between labels. We call the vector space spanned by {φ(λ)} VS . With this replacement, MML’s classifier is written as follows. f (x) = arg max W x, φ(λ) S , (5) λ∈Λ where W is a linear map from Rd to VS . W is the solution of the following problem. min W 1 W 2 m 2 λ ξi +C i=1 λ∈Λ,λ=Li φ(Li )−φ(λ) λ λ s.t. W xi , ≥ 1−ξi , ξi ≥ 0 for 1 ≤ i ≤ m, ∀λ = Li . (6) φ(Li )−φ(λ) Note that if φ(λ) is replaced by eλ , Eq. (6) becomes identical to Eq. (2) except for a scale factor. Thus Eq. (5) and Eq. (6) are natural extensions of the multi-class classifier in [4]. We call the MTC classifier of Eq. (5) and Eq. (6) “Maximal Margin Labeling (MML)”. Figure 1 explains the margin (the inner product in Eq. (6)) in MML. The margin represents the distance from the image of the training sample xi to the boundary between the correct label Li and wrong label λ. MML optimizes the linear map W so that the smallest margin between all training samples and all possible labels becomes maximal, along with a penalty C for the case that samples penetrate into the margin. Figure 1: Maximal Margin Labeling Dual Form For numerical computation, the following Wolfe dual form of Eq. (6) is more convenient. (We omit its derivation due to space limits.) S(Li , Li′ )−S(Li , λ′ )−S(λ, Li′ )+S(λ, λ′ ) 1 λ λ′ λ αi αi′ (xi ·xi′ ) max αi − 2 αλ 2 (1−S(Li , λ))(1−S(Li′ , λ′ )) i ′ ′ i,λ i,λ i ,λ λ s.t. 0 ≤ αi ≤ C for 1 ≤ i ≤ m, ∀λ = Li , (7) m λ where we denote i=1 λ∈Λ,∀λ=Li by i,λ , and αi are the dual variables corresponding to the first inequality constraints in Eq. (6). Note that Eq. (7) does not contain φ(λ): all the computations involving φ can be done through the label similarity S. Additionally xi only appears in the inner products, and therefore can be replaced by any kernel of x. λ Using the solution αi of Eq. (7), the MML’s classifier in Eq. (5) can be written as follows. S(Li , L)−S(λ, L) λ . (8) f (x) = arg max αi (x·xi ) 2(1−S(Li , λ)) L∈Λ i,λ Label Similarity3 As examples of label similarity, we use two similarity measures: Dice measure and cosine measure. Dice measure4 4.1 SC (λ1 , λ2 ) = |λ1 ∩ λ2 | 2 |λ1 | |λ2 | = l j=1 l j=1 2|λ1 ∩λ2 | = |λ1 |+|λ2 | Cosine measure 4 SD (λ1 , λ2 ) = λ1 [j] + λ1 [j]λ2 [j] l j=1 l j=1 l j=1 λ2 [j] . λ1 [j]λ2 [j] λ1 [j] l j=1 (9) . 10) ( λ2 [j] Efficient Implementation Approximation in Learning Eq. (7) contains the sum over all possible labels. As the number of topics (l) increases, this summation rapidly becomes intractable since |Λ| grows exponentially as 2l . To circumvent 3 The following discussion is easily extended to include the case that both λ1 and λ2 are empty although we do not discuss the case due to space limits. this problem, we approximate the sum over all possible labels in Eq. (7) by the partial sum λ λ over αi of |(A ∩ B c ) ∪ (Ac ∩ B)| = 1 and set all the other αi to zero. This approximation reduces the burden of the summation quite a lot: the number of summands is reduced from 2l to l, which is a huge reduction especially when many topics exist. λ To understand the rationale behind the approximation, first note that αi is the dual variable λ corresponding to the first inequality constraint (the margin constraint) in Eq. (7). Thus αi is non-zero if and only if W xi falls in the margin between φ(Li ) and φ(λ). We assume that this margin violation mainly occurs when φ(λ) is “close” to φ(Li ), i.e. |(A ∩ B c ) ∪ (Ac ∩ B)| = 1. If this assumption holds well, the proposed approximation of the sum will lead to a good approximation of the exact solution. 4.2 Polynomial Time Algorithms for Classification The classification of MML (Eq. (8)) involves the combinatorial maximization over all possible labels, so it can be a computationally demanding process. However, efficient classification algorithms are available when either the cosine measure or dice measure is used as label similarity. Eq. (8) can be divided into the subproblems by the number of topics in a label. f (x) = arg max g(x, L), (11) ˆ ˆ ˆ L∈{L1 ,L2 ,...,Ll } ˆ Ln = arg max g(x, L). (12) L∈Λ,|L|=n where g(x) is l g(x, L) cn [j]L[j], = j=1 cn [j] = i,λ √ i,λ √ αλ (x·xi ) i 2(1−SD (Li ,λ)) αλ (x·xi ) i 2(1−SC (Li ,λ)) . 2Li [j] |Li |+n − 2λ[j] |λ|+n √Li [j] − √λ[j] √ √ |Li | n |λ| n if SD is used. (13) if SC is used. Here n = |L|. The computational cost of Eq. (13) for all j is O(nα l) (nα is the number of non-zero α), and that of Eq. (12) is O(l log l). Thus the total cost of the classification by Eq. (11) is O(nα l2 + l2 log l). On the other hand, nα is O(ml) under the approximation described above. Therefore, the classification can be done within O(ml3 ) computational steps, which is a significant reduction from the case that the brute force search is used in Eq. (8). 5 Experiments In this section, we report experiments that compared MML to PMM [3], SVM5 [6], and BoosTexter [2] using a collection of Web pages. We used a normalized linear kernel k(x, x′ ) = x · x′ / x x′ in MML and SVM. As for BoosTexter, “real abstaining AdaBoost.MH” was used as the weak learner. 5.1 Experimental Setup The datasets used in our experiment represent the Web page collection used in [3] (Table 2). The Web pages were collected through the hyperlinks from Yahoo!’s top directory 5 For each topic, an SVM classifier is trained to predict whether the topic is relevant (positive) or irrelevant (negative) to input doucments. Dataset Name (Abbrev.) Arts & Humanities (Ar) Business & Economy (Bu) Computers & Internet (Co) Education (Ed) Entertainment (En) Health (He) Recreation (Rc) Reference (Rf) Science (Si) Social Science (SS) Society & Culture (SC) #Text #Voc 7,484 11,214 12,444 12,030 12,730 9,205 12,828 8,027 6,428 12,111 14,512 #Tpc #Lbl 23,146 21,924 34,096 27,534 32,001 30,605 30,324 39,679 37,187 52,350 31,802 26 30 33 33 21 32 22 33 40 39 27 Label Size Frequency (%) 1 2 3 4 ≥5 599 55.6 30.5 9.7 2.8 1.4 233 57.6 28.8 11.0 1.7 0.8 428 69.8 18.2 7.8 3.0 1.1 511 66.9 23.4 7.3 1.9 0.6 337 72.3 21.1 4.5 1.0 1.1 335 53.2 34.0 9.5 2.4 0.9 530 69.2 23.1 5.6 1.4 0.6 275 85.5 12.6 1.5 0.3 0.1 457 68.0 22.3 7.3 1.9 0.5 361 78.4 17.0 3.7 0.7 0.3 1054 59.6 26.1 9.2 2.9 2.2 Table 2: A summary of the web page datasets. “#Text” is the number of texts in the dataset, “#Voc” the number of vocabularies (i.e. features), “#Tpc” the number of topics, “#Lbl” the number of labels, and “Label Size Frequency” is the relative frequency of each label size. (Label size is the number of topics in a label.) Method MML PMM SVM Boost Feature Type TF, TF×IDF TF TF, TF×IDF Binary Parameter C = 0.1, 1, 10 Model1, Model2 C = 0.1, 1, 10 R = {2, 4, 6, 8, 10}×103 Table 3: Candidate feature types and learning parameters. (R is the number of weak hypotheses.) The underlined fetures and parameters were selected for the evaluation with the test data. (www.yahoo.com), and then divided into 11 datasets by Yahoo’s top category. Each page is labeled with the Yahoo’s second level sub-categories from which the page is hyperlinked. (Thus, the sub-categories are topics in our term.) See [3] for more details about the collection. Then the Web pages were converted into three types of feature vectors: (a) Binary vectors, where each feature indicates the presence (absence) of a term by 1 (0); (b) TF vectors, where each feature is the number of appearances of a term (term frequency); and (c) TF×IDF vectors, where each feature is the product of term frequency and inverse document frequency [7]. To select the best combinations of feature types and learning parameters such as the penalty C for MML, the learners were trained on 2,000 Web pages with all combinations of feature and parameter listed in Table 3, and then were evaluated by labeling F-measure on independently drawn development data. The combinations which achieve the best labeling F-measures (underlined in Table 3) were used in the following experiments. 5.2 Evaluation Measures We used three measures to evaluate labeling performance: labeling F-measure, exact match ratio, and retrieval F-measure. In the following definitions, {Lpred }n and {Ltrue }n i=1 i i=1 i mean the predicted labels and the true labels, respectively. Labeling F-measure Labeling F-measure FL evaluates the average labeling performance while taking partial match into account. FL = 1 n n i=1 2|Lpred ∩ Ltrue | i i |Lpred | + |Ltrue | i i = 1 n n i=1 l pred [j]Ltrue [j] i j=1 Li . l (Lpred [j] + Ltrue [j]) i i j=1 2 (14) Dataset Ar Bu Co Ed En He Rc Rf Si SS SC Avg MD 0.55 0.80 0.62 0.56 0.64 0.74 0.63 0.67 0.61 0.73 0.60 0.65 Labeling F-measure MC PM SV BO 0.44 0.50 0.46 0.38 0.81 0.75 0.76 0.75 0.59 0.61 0.55 0.47 0.43 0.51 0.48 0.37 0.52 0.61 0.54 0.49 0.74 0.66 0.67 0.60 0.46 0.55 0.49 0.44 0.58 0.63 0.56 0.50 0.54 0.52 0.47 0.39 0.71 0.66 0.64 0.59 0.55 0.54 0.49 0.44 0.58 0.59 0.56 0.49 Exact Match Ratio MC PM SV 0.32 0.21 0.29 0.62 0.48 0.57 0.46 0.35 0.41 0.34 0.19 0.30 0.44 0.31 0.42 0.53 0.34 0.47 0.38 0.25 0.37 0.51 0.39 0.49 0.43 0.22 0.36 0.60 0.45 0.55 0.40 0.21 0.32 0.46 0.31 0.41 MD 0.44 0.63 0.51 0.45 0.55 0.58 0.54 0.60 0.52 0.65 0.44 0.54 BO 0.22 0.53 0.34 0.23 0.36 0.39 0.33 0.41 0.28 0.49 0.27 0.35 MD 0.30 0.25 0.27 0.25 0.37 0.35 0.47 0.29 0.37 0.36 0.29 0.32 Retrieval F-measure MC PM SV BO 0.26 0.24 0.29 0.22 0.27 0.20 0.29 0.20 0.25 0.19 0.30 0.17 0.23 0.21 0.25 0.16 0.33 0.30 0.35 0.29 0.35 0.23 0.35 0.26 0.39 0.36 0.40 0.33 0.25 0.24 0.25 0.16 0.35 0.28 0.31 0.19 0.35 0.18 0.31 0.15 0.28 0.25 0.26 0.20 0.30 0.24 0.31 0.21 Table 4: The performance comparison by labeling F-measure (left), exact match ratio (middle) and retrieval F-measure (right). The bold figures are the best ones among the five methods, and the underlined figures the second best ones. MD, MC, PM, SV, and BO represent MML with SD , MML with SC , PMM, SVM and BoosTexter, respectively. Exact Match Ratio Exact match ratio EX counts only exact matches between the predicted label and the true label. EX = 1 n n I[Lpred = Ltrue ], i i (15) i=1 where I[S] is 1 if the statement S is true and 0 otherwise. Retrieval F-measure6 For real tasks, it is also important to evaluate retrieval performance, i.e. how accurately classifiers can find relevant texts for a given topic. Retrieval F-measure FR measures the average retrieval performance over all topics. FR = 5.3 1 l l j=1 n pred [j]Ltrue [j] i i=1 Li . n (Lpred [j] + Ltrue [j]) i i i=1 2 (16) Results First we trained the classifiers with randomly chosen 2,000 samples. We then calculated the three evaluation measures for 3,000 other randomly chosen samples. This process was repeated five times, and the resulting averaged values are shown in Table 4. Table 4 shows that the MMLs with Dice measure outperform other methods in labeling F-measure and exact match ratio. The MMLs also show the best performance with regard to retrieval Fmeasure although the margins to the other methods are not as large as observed in labeling F-measure and exact match ratio. Note that no classifier except MML with Dice measure achieves good labeling on all the three measures. For example, PMM shows high labeling F-measures, but its performance is rather poor when evaluated in retrieval F-measure. As the second experiment, we evaluated the classifiers trained with 250–2000 training samples on the same test samples. Figure 2 shows each measure averaged over all datasets. It is observed that the MMLs show high generalization even when training data is small. An interesting point is that MML with cosine measure achieves rather high labeling F-measures and retrieval F-measure with training data of smaller size. Such high-performace, however, does not continue when trained on larger data. 6 FR is called “the macro average of F-measures” in the text categorization community. Figure 2: The learning curve of labeling F-measure (left), exact match ratio (middle) and retrieval F-measure (right). MD, MC, PM, SV, BO mean the same as in Table 4. 6 Conclusion In this paper, we proposed a novel learning algorithm for multi-topic text categorization. The algorithm, Maximal Margin Labeling, embeds labels (sets of topics) into a similarityinduced vector space, and learns a large margin classifier in the space. To overcome the demanding computational cost of MML, we provide an approximation method in learning and efficient classification algorithms. In experiments on a collection of Web pages, MML outperformed other methods including SVM and showed better generalization. Acknowledgement The authors would like to thank Naonori Ueda, Kazumi Saito and Yuji Kaneda of Nippon Telegraph and Telephone Corporation for providing PMM’s codes and the datasets. References [1] Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Claire N´ dellec and C´ line Rouveirol, editors, Proc. of the e e 10th European Conference on Machine Learning, number 1398, pages 137–142, 1998. [2] Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135–168, 2000. [3] Naonori Ueda and Kazumi Saito. Parametoric mixture models for multi-topic text. In Advances in Neural Information Processing Systems 15, pages 1261–1268, 2003. [4] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265–292, 2001. [5] Klaus-Robert M¨ ller, Sebastian Mika, Gunnar R¨ tsch, Koji Tsuda, and Bernhard u a Sch¨ lkopf. An introduction to kernel-based learning algorithms. IEEE Transactions o on Neural Networks, 12(2):181–201, 2001. [6] Vladimir N. Vapnik. Statistical Learning Theory. John Wiley & Sons, Inc., 1998. [7] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wealy, 1999.
2 0.096394725 115 nips-2004-Maximum Margin Clustering
Author: Linli Xu, James Neufeld, Bryce Larson, Dale Schuurmans
Abstract: We propose a new method for clustering based on finding maximum margin hyperplanes through data. By reformulating the problem in terms of the implied equivalence relation matrix, we can pose the problem as a convex integer program. Although this still yields a difficult computational problem, the hard-clustering constraints can be relaxed to a soft-clustering formulation which can be feasibly solved with a semidefinite program. Since our clustering technique only depends on the data through the kernel matrix, we can easily achieve nonlinear clusterings in the same manner as spectral clustering. Experimental results show that our maximum margin clustering technique often obtains more accurate results than conventional clustering methods. The real benefit of our approach, however, is that it leads naturally to a semi-supervised training method for support vector machines. By maximizing the margin simultaneously on labeled and unlabeled training data, we achieve state of the art performance by using a single, integrated learning principle. 1
3 0.088233747 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
Author: Peter L. Bartlett, Michael Collins, Ben Taskar, David A. McAllester
Abstract: We consider the problem of structured classification, where the task is to predict a label y from an input x, and y has meaningful internal structure. Our framework includes supervised training of Markov random fields and weighted context-free grammars as special cases. We describe an algorithm that solves the large-margin optimization problem defined in [12], using an exponential-family (Gibbs distribution) representation of structured objects. The algorithm is efficient—even in cases where the number of labels y is exponential in size—provided that certain expectations under Gibbs distributions can be calculated efficiently. The method for structured labels relies on a more general result, specifically the application of exponentiated gradient updates [7, 8] to quadratic programs. 1
4 0.083744831 145 nips-2004-Parametric Embedding for Class Visualization
Author: Tomoharu Iwata, Kazumi Saito, Naonori Ueda, Sean Stromsten, Thomas L. Griffiths, Joshua B. Tenenbaum
Abstract: In this paper, we propose a new method, Parametric Embedding (PE), for visualizing the posteriors estimated over a mixture model. PE simultaneously embeds both objects and their classes in a low-dimensional space. PE takes as input a set of class posterior vectors for given data points, and tries to preserve the posterior structure in an embedding space by minimizing a sum of Kullback-Leibler divergences, under the assumption that samples are generated by a Gaussian mixture with equal covariances in the embedding space. PE has many potential uses depending on the source of the input data, providing insight into the classifier’s behavior in supervised, semi-supervised and unsupervised settings. The PE algorithm has a computational advantage over conventional embedding methods based on pairwise object relations since its complexity scales with the product of the number of objects and the number of classes. We demonstrate PE by visualizing supervised categorization of web pages, semi-supervised categorization of digits, and the relations of words and latent topics found by an unsupervised algorithm, Latent Dirichlet Allocation. 1
5 0.078611553 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
Author: Saharon Rosset, Robert Tibshirani, Ji Zhu, Trevor J. Hastie
Abstract: In this paper we argue that the choice of the SVM cost parameter can be critical. We then derive an algorithm that can fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model. 1
6 0.078055598 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning
7 0.075646907 165 nips-2004-Semi-supervised Learning on Directed Graphs
8 0.073181018 100 nips-2004-Learning Preferences for Multiclass Problems
9 0.072661757 23 nips-2004-Analysis of a greedy active learning strategy
10 0.070653565 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
11 0.06947846 136 nips-2004-On Semi-Supervised Classification
12 0.068099722 44 nips-2004-Conditional Random Fields for Object Recognition
13 0.066838026 54 nips-2004-Distributed Information Regularization on Graphs
14 0.064300314 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval
15 0.063030072 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression
16 0.061990108 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations
17 0.061338287 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound
18 0.061034806 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics
19 0.058485135 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data
20 0.058239616 19 nips-2004-An Application of Boosting to Graph Classification
topicId topicWeight
[(0, -0.173), (1, 0.078), (2, -0.031), (3, 0.059), (4, 0.053), (5, 0.085), (6, 0.037), (7, 0.041), (8, 0.031), (9, 0.023), (10, -0.03), (11, 0.115), (12, -0.127), (13, 0.045), (14, -0.031), (15, 0.034), (16, 0.046), (17, -0.004), (18, 0.025), (19, 0.003), (20, 0.006), (21, 0.017), (22, 0.049), (23, 0.042), (24, -0.085), (25, -0.042), (26, 0.004), (27, 0.016), (28, -0.032), (29, -0.135), (30, 0.043), (31, -0.046), (32, 0.059), (33, -0.066), (34, -0.019), (35, -0.074), (36, 0.035), (37, 0.013), (38, -0.0), (39, 0.022), (40, 0.037), (41, 0.063), (42, 0.039), (43, 0.007), (44, 0.016), (45, -0.023), (46, -0.194), (47, 0.232), (48, -0.046), (49, -0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.92690587 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization
Author: Hideto Kazawa, Tomonori Izumitani, Hirotoshi Taira, Eisaku Maeda
Abstract: In this paper, we address the problem of statistical learning for multitopic text categorization (MTC), whose goal is to choose all relevant topics (a label) from a given set of topics. The proposed algorithm, Maximal Margin Labeling (MML), treats all possible labels as independent classes and learns a multi-class classifier on the induced multi-class categorization problem. To cope with the data sparseness caused by the huge number of possible labels, MML combines some prior knowledge about label prototypes and a maximal margin criterion in a novel way. Experiments with multi-topic Web pages show that MML outperforms existing learning algorithms including Support Vector Machines. 1 Multi-topic Text Categorization (MTC) This paper addresses the problem of learning for multi-topic text categorization (MTC), whose goal is to select all topics relevant to a text from a given set of topics. In MTC, multiple topics may be relevant to a single text. We thus call a set of topics label, and say that a text is assigned a label, not a topic. In almost all previous text categorization studies (e.g. [1, 2]), the label is predicted by judging each topic’s relevance to the text. In this decomposition approach, the features specific to a topic, not a label, are regarded as important features. However, the approach may result in inefficient learning as we will explain in the following example. Imagine an MTC problem of scientific papers where quantum computing papers are assigned multi-topic label “quantum physics (QP) & computer science (CS)”. (QP and CS are topics in this example.) Since there are some words specific to quantum computing such as “qbit1 ”, one can say that efficient MTC learners should use such words to assign label QP & CS. However, the decomposition approach is likely to ignore these words since they are only specific to a small portion of the whole QP or CS papers (there are many more QP and CS papers than quantum computing papers), and therefore are not discriminative features for either topic QP or CS. 1 Qbit is a unit of quantum information, and frequently appears in real quantum computing literatures, but rarely seen in other literatures. Symbol x(∈ Rd ) t1 , t2 , . . . , tl T L, λ(⊂ T ) L[j] Λ(= 2T ) {(xi , Li )}m i=1 Meaning A document vector Topics The set of all topics A label The binary representation of L. 1 if tj ∈ L and 0 otherwise. The set of all possible labels Training samples Table 1: Notation Parametric Mixture Model (PMM) [3] adopts another approach to MTC. It is assumed in PMM that multi-topic texts are generated from a mixture of topic-specific word distributions. Its decision on labeling is done at once, not separately for each topic. However, PMM also has a problem with multi-topic specific features such as “qbit” since it is impossible for texts to have such features given PMM’s mixture process. These problems with multi-topic specific features are caused by dependency assumptions between labels, which are explicitly or implicitly made in existing methods. To solve these problems, we propose Maximal Margin Labeling, which treats labels as independent classes and learns a multi-class classifier on the induced multi-class problem. In this paper, we first discuss why multi-class classifiers cannot be directly applied to MTC in Section 2. We then propose MML in Section 3, and address implementation issues in Section 4. In Section 5, MML is experimentally compared with existing methods using a collection of multi-topic Web pages. We summarize this paper in Section 6. 2 Solving MTC as a Multi-Class Categorization To discuss why existing multi-class classifiers do not work in MTC, we start from the multi-class classifier proposed in [4]. Hereafter we use the notation given in Table 1. The multi-class classifier in [4] categorizes an object into the class whose prototype vector is the closest to the object’s feature vector. By substituting label for class, the classifier can be written as follows. f (x) = arg max x, mλ (1) X λ∈Λ where , X is the the inner product of Rd , and mλ ∈ Rd is the prototype vector of label λ. Following the similar argument as in [4], the prototype vectors are learned by solving the following maximal margin problem2 . min M s.t. 1 M 2 2 λ ξi +C xi , mLi 1≤i≤m λ∈Λ,λ=Li X − xi , m λ X λ ≥ 1 − ξi for 1 ≤ i ≤ m, ∀λ = Li , (2) where M is the prototype matrix whose columns are the prototype vectors, and M is the Frobenius matrix norm of M . Note that Eq. (1) and Eq. (2) cover only training samples’ labels, but also all possible labels. This is because the labels unseen in training samples may be relevant to test samples. In 2 In Eq.(2), we penalize all violation of the margin constraints. On the other hand, Crammer and Singer penalize only the largest violation of the margin constraint for each training sample [4]. We chose the “penalize-all” approach since it leads to an optimization problem without equality constraints (see Eq.(7)), which is much easier to solve than the one in [4]. usual multi-class problems, such unseen labels seldom exist. In MTC, however, the number of labels is generally very large (e.g. one of our datasets has 1,054 labels (Table 2)), and unseen labels often exist. Thus it is necessary to consider all possible labels in Eq. (1) and Eq. (2) since it is impossible to know which unseen labels are present in the test samples. There are two problems with Eq. (1) and Eq. (2). The first problem is that they involve the prototype vectors of seldom or never seen labels. Without the help of prior knowledge about where the prototype vectors should be, it is impossible to obtain appropriate prototype vectors for such labels. The second problem is that these equations are computationally too demanding since they involve combinatorial maximization and summation over all possible labels, whose number can be quite large. (For example, the number is around 230 in the datasets used in our experiments.) We will address the first problem in Section 3 and the second problem in Section 4. 3 Maximal Margin Labeling In this section, we incorporate some prior knowledge about the location of prototype vectors into Eq. (1) and Eq. (2), and propose a novel MTC learning algorithm, Maximal Margin Labeling (MML). As prior knowledge, we simply assume that the prototype vectors of similar labels should be placed close to each other. Based on this assumption, we first rewrite Eq. (1) to yield f (x) = arg max M T x, eλ L, (3) λ∈Λ where , L is the inner product of R|Λ| and {eλ }λ∈Λ is the orthonormal basis of R|Λ| . The classifier of Eq. (3) can be interpreted as a two-step process: the first step is to map the vector x into R|Λ| by M T , and the second step is to find the closest eλ to image M T x. Then we replace {eλ }λ∈Λ with (generally) non-orthogonal vectors {φ(λ)}λ∈Λ whose geometrical configuration reflects label similarity. More formally speaking, we use vectors {φ(λ)}λ∈Λ that satisfy the condition φ(λ1 ), φ(λ2 ) S = S(λ1 , λ2 ) for ∀λ1 , λ2 ∈ Λ, (4) where , S is an inner product of the vector space spanned by {φ(λ)}λ∈Λ , and S is a Mercer kernel [5] on Λ × Λ and is a similarity measure between labels. We call the vector space spanned by {φ(λ)} VS . With this replacement, MML’s classifier is written as follows. f (x) = arg max W x, φ(λ) S , (5) λ∈Λ where W is a linear map from Rd to VS . W is the solution of the following problem. min W 1 W 2 m 2 λ ξi +C i=1 λ∈Λ,λ=Li φ(Li )−φ(λ) λ λ s.t. W xi , ≥ 1−ξi , ξi ≥ 0 for 1 ≤ i ≤ m, ∀λ = Li . (6) φ(Li )−φ(λ) Note that if φ(λ) is replaced by eλ , Eq. (6) becomes identical to Eq. (2) except for a scale factor. Thus Eq. (5) and Eq. (6) are natural extensions of the multi-class classifier in [4]. We call the MTC classifier of Eq. (5) and Eq. (6) “Maximal Margin Labeling (MML)”. Figure 1 explains the margin (the inner product in Eq. (6)) in MML. The margin represents the distance from the image of the training sample xi to the boundary between the correct label Li and wrong label λ. MML optimizes the linear map W so that the smallest margin between all training samples and all possible labels becomes maximal, along with a penalty C for the case that samples penetrate into the margin. Figure 1: Maximal Margin Labeling Dual Form For numerical computation, the following Wolfe dual form of Eq. (6) is more convenient. (We omit its derivation due to space limits.) S(Li , Li′ )−S(Li , λ′ )−S(λ, Li′ )+S(λ, λ′ ) 1 λ λ′ λ αi αi′ (xi ·xi′ ) max αi − 2 αλ 2 (1−S(Li , λ))(1−S(Li′ , λ′ )) i ′ ′ i,λ i,λ i ,λ λ s.t. 0 ≤ αi ≤ C for 1 ≤ i ≤ m, ∀λ = Li , (7) m λ where we denote i=1 λ∈Λ,∀λ=Li by i,λ , and αi are the dual variables corresponding to the first inequality constraints in Eq. (6). Note that Eq. (7) does not contain φ(λ): all the computations involving φ can be done through the label similarity S. Additionally xi only appears in the inner products, and therefore can be replaced by any kernel of x. λ Using the solution αi of Eq. (7), the MML’s classifier in Eq. (5) can be written as follows. S(Li , L)−S(λ, L) λ . (8) f (x) = arg max αi (x·xi ) 2(1−S(Li , λ)) L∈Λ i,λ Label Similarity3 As examples of label similarity, we use two similarity measures: Dice measure and cosine measure. Dice measure4 4.1 SC (λ1 , λ2 ) = |λ1 ∩ λ2 | 2 |λ1 | |λ2 | = l j=1 l j=1 2|λ1 ∩λ2 | = |λ1 |+|λ2 | Cosine measure 4 SD (λ1 , λ2 ) = λ1 [j] + λ1 [j]λ2 [j] l j=1 l j=1 l j=1 λ2 [j] . λ1 [j]λ2 [j] λ1 [j] l j=1 (9) . 10) ( λ2 [j] Efficient Implementation Approximation in Learning Eq. (7) contains the sum over all possible labels. As the number of topics (l) increases, this summation rapidly becomes intractable since |Λ| grows exponentially as 2l . To circumvent 3 The following discussion is easily extended to include the case that both λ1 and λ2 are empty although we do not discuss the case due to space limits. this problem, we approximate the sum over all possible labels in Eq. (7) by the partial sum λ λ over αi of |(A ∩ B c ) ∪ (Ac ∩ B)| = 1 and set all the other αi to zero. This approximation reduces the burden of the summation quite a lot: the number of summands is reduced from 2l to l, which is a huge reduction especially when many topics exist. λ To understand the rationale behind the approximation, first note that αi is the dual variable λ corresponding to the first inequality constraint (the margin constraint) in Eq. (7). Thus αi is non-zero if and only if W xi falls in the margin between φ(Li ) and φ(λ). We assume that this margin violation mainly occurs when φ(λ) is “close” to φ(Li ), i.e. |(A ∩ B c ) ∪ (Ac ∩ B)| = 1. If this assumption holds well, the proposed approximation of the sum will lead to a good approximation of the exact solution. 4.2 Polynomial Time Algorithms for Classification The classification of MML (Eq. (8)) involves the combinatorial maximization over all possible labels, so it can be a computationally demanding process. However, efficient classification algorithms are available when either the cosine measure or dice measure is used as label similarity. Eq. (8) can be divided into the subproblems by the number of topics in a label. f (x) = arg max g(x, L), (11) ˆ ˆ ˆ L∈{L1 ,L2 ,...,Ll } ˆ Ln = arg max g(x, L). (12) L∈Λ,|L|=n where g(x) is l g(x, L) cn [j]L[j], = j=1 cn [j] = i,λ √ i,λ √ αλ (x·xi ) i 2(1−SD (Li ,λ)) αλ (x·xi ) i 2(1−SC (Li ,λ)) . 2Li [j] |Li |+n − 2λ[j] |λ|+n √Li [j] − √λ[j] √ √ |Li | n |λ| n if SD is used. (13) if SC is used. Here n = |L|. The computational cost of Eq. (13) for all j is O(nα l) (nα is the number of non-zero α), and that of Eq. (12) is O(l log l). Thus the total cost of the classification by Eq. (11) is O(nα l2 + l2 log l). On the other hand, nα is O(ml) under the approximation described above. Therefore, the classification can be done within O(ml3 ) computational steps, which is a significant reduction from the case that the brute force search is used in Eq. (8). 5 Experiments In this section, we report experiments that compared MML to PMM [3], SVM5 [6], and BoosTexter [2] using a collection of Web pages. We used a normalized linear kernel k(x, x′ ) = x · x′ / x x′ in MML and SVM. As for BoosTexter, “real abstaining AdaBoost.MH” was used as the weak learner. 5.1 Experimental Setup The datasets used in our experiment represent the Web page collection used in [3] (Table 2). The Web pages were collected through the hyperlinks from Yahoo!’s top directory 5 For each topic, an SVM classifier is trained to predict whether the topic is relevant (positive) or irrelevant (negative) to input doucments. Dataset Name (Abbrev.) Arts & Humanities (Ar) Business & Economy (Bu) Computers & Internet (Co) Education (Ed) Entertainment (En) Health (He) Recreation (Rc) Reference (Rf) Science (Si) Social Science (SS) Society & Culture (SC) #Text #Voc 7,484 11,214 12,444 12,030 12,730 9,205 12,828 8,027 6,428 12,111 14,512 #Tpc #Lbl 23,146 21,924 34,096 27,534 32,001 30,605 30,324 39,679 37,187 52,350 31,802 26 30 33 33 21 32 22 33 40 39 27 Label Size Frequency (%) 1 2 3 4 ≥5 599 55.6 30.5 9.7 2.8 1.4 233 57.6 28.8 11.0 1.7 0.8 428 69.8 18.2 7.8 3.0 1.1 511 66.9 23.4 7.3 1.9 0.6 337 72.3 21.1 4.5 1.0 1.1 335 53.2 34.0 9.5 2.4 0.9 530 69.2 23.1 5.6 1.4 0.6 275 85.5 12.6 1.5 0.3 0.1 457 68.0 22.3 7.3 1.9 0.5 361 78.4 17.0 3.7 0.7 0.3 1054 59.6 26.1 9.2 2.9 2.2 Table 2: A summary of the web page datasets. “#Text” is the number of texts in the dataset, “#Voc” the number of vocabularies (i.e. features), “#Tpc” the number of topics, “#Lbl” the number of labels, and “Label Size Frequency” is the relative frequency of each label size. (Label size is the number of topics in a label.) Method MML PMM SVM Boost Feature Type TF, TF×IDF TF TF, TF×IDF Binary Parameter C = 0.1, 1, 10 Model1, Model2 C = 0.1, 1, 10 R = {2, 4, 6, 8, 10}×103 Table 3: Candidate feature types and learning parameters. (R is the number of weak hypotheses.) The underlined fetures and parameters were selected for the evaluation with the test data. (www.yahoo.com), and then divided into 11 datasets by Yahoo’s top category. Each page is labeled with the Yahoo’s second level sub-categories from which the page is hyperlinked. (Thus, the sub-categories are topics in our term.) See [3] for more details about the collection. Then the Web pages were converted into three types of feature vectors: (a) Binary vectors, where each feature indicates the presence (absence) of a term by 1 (0); (b) TF vectors, where each feature is the number of appearances of a term (term frequency); and (c) TF×IDF vectors, where each feature is the product of term frequency and inverse document frequency [7]. To select the best combinations of feature types and learning parameters such as the penalty C for MML, the learners were trained on 2,000 Web pages with all combinations of feature and parameter listed in Table 3, and then were evaluated by labeling F-measure on independently drawn development data. The combinations which achieve the best labeling F-measures (underlined in Table 3) were used in the following experiments. 5.2 Evaluation Measures We used three measures to evaluate labeling performance: labeling F-measure, exact match ratio, and retrieval F-measure. In the following definitions, {Lpred }n and {Ltrue }n i=1 i i=1 i mean the predicted labels and the true labels, respectively. Labeling F-measure Labeling F-measure FL evaluates the average labeling performance while taking partial match into account. FL = 1 n n i=1 2|Lpred ∩ Ltrue | i i |Lpred | + |Ltrue | i i = 1 n n i=1 l pred [j]Ltrue [j] i j=1 Li . l (Lpred [j] + Ltrue [j]) i i j=1 2 (14) Dataset Ar Bu Co Ed En He Rc Rf Si SS SC Avg MD 0.55 0.80 0.62 0.56 0.64 0.74 0.63 0.67 0.61 0.73 0.60 0.65 Labeling F-measure MC PM SV BO 0.44 0.50 0.46 0.38 0.81 0.75 0.76 0.75 0.59 0.61 0.55 0.47 0.43 0.51 0.48 0.37 0.52 0.61 0.54 0.49 0.74 0.66 0.67 0.60 0.46 0.55 0.49 0.44 0.58 0.63 0.56 0.50 0.54 0.52 0.47 0.39 0.71 0.66 0.64 0.59 0.55 0.54 0.49 0.44 0.58 0.59 0.56 0.49 Exact Match Ratio MC PM SV 0.32 0.21 0.29 0.62 0.48 0.57 0.46 0.35 0.41 0.34 0.19 0.30 0.44 0.31 0.42 0.53 0.34 0.47 0.38 0.25 0.37 0.51 0.39 0.49 0.43 0.22 0.36 0.60 0.45 0.55 0.40 0.21 0.32 0.46 0.31 0.41 MD 0.44 0.63 0.51 0.45 0.55 0.58 0.54 0.60 0.52 0.65 0.44 0.54 BO 0.22 0.53 0.34 0.23 0.36 0.39 0.33 0.41 0.28 0.49 0.27 0.35 MD 0.30 0.25 0.27 0.25 0.37 0.35 0.47 0.29 0.37 0.36 0.29 0.32 Retrieval F-measure MC PM SV BO 0.26 0.24 0.29 0.22 0.27 0.20 0.29 0.20 0.25 0.19 0.30 0.17 0.23 0.21 0.25 0.16 0.33 0.30 0.35 0.29 0.35 0.23 0.35 0.26 0.39 0.36 0.40 0.33 0.25 0.24 0.25 0.16 0.35 0.28 0.31 0.19 0.35 0.18 0.31 0.15 0.28 0.25 0.26 0.20 0.30 0.24 0.31 0.21 Table 4: The performance comparison by labeling F-measure (left), exact match ratio (middle) and retrieval F-measure (right). The bold figures are the best ones among the five methods, and the underlined figures the second best ones. MD, MC, PM, SV, and BO represent MML with SD , MML with SC , PMM, SVM and BoosTexter, respectively. Exact Match Ratio Exact match ratio EX counts only exact matches between the predicted label and the true label. EX = 1 n n I[Lpred = Ltrue ], i i (15) i=1 where I[S] is 1 if the statement S is true and 0 otherwise. Retrieval F-measure6 For real tasks, it is also important to evaluate retrieval performance, i.e. how accurately classifiers can find relevant texts for a given topic. Retrieval F-measure FR measures the average retrieval performance over all topics. FR = 5.3 1 l l j=1 n pred [j]Ltrue [j] i i=1 Li . n (Lpred [j] + Ltrue [j]) i i i=1 2 (16) Results First we trained the classifiers with randomly chosen 2,000 samples. We then calculated the three evaluation measures for 3,000 other randomly chosen samples. This process was repeated five times, and the resulting averaged values are shown in Table 4. Table 4 shows that the MMLs with Dice measure outperform other methods in labeling F-measure and exact match ratio. The MMLs also show the best performance with regard to retrieval Fmeasure although the margins to the other methods are not as large as observed in labeling F-measure and exact match ratio. Note that no classifier except MML with Dice measure achieves good labeling on all the three measures. For example, PMM shows high labeling F-measures, but its performance is rather poor when evaluated in retrieval F-measure. As the second experiment, we evaluated the classifiers trained with 250–2000 training samples on the same test samples. Figure 2 shows each measure averaged over all datasets. It is observed that the MMLs show high generalization even when training data is small. An interesting point is that MML with cosine measure achieves rather high labeling F-measures and retrieval F-measure with training data of smaller size. Such high-performace, however, does not continue when trained on larger data. 6 FR is called “the macro average of F-measures” in the text categorization community. Figure 2: The learning curve of labeling F-measure (left), exact match ratio (middle) and retrieval F-measure (right). MD, MC, PM, SV, BO mean the same as in Table 4. 6 Conclusion In this paper, we proposed a novel learning algorithm for multi-topic text categorization. The algorithm, Maximal Margin Labeling, embeds labels (sets of topics) into a similarityinduced vector space, and learns a large margin classifier in the space. To overcome the demanding computational cost of MML, we provide an approximation method in learning and efficient classification algorithms. In experiments on a collection of Web pages, MML outperformed other methods including SVM and showed better generalization. Acknowledgement The authors would like to thank Naonori Ueda, Kazumi Saito and Yuji Kaneda of Nippon Telegraph and Telephone Corporation for providing PMM’s codes and the datasets. References [1] Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Claire N´ dellec and C´ line Rouveirol, editors, Proc. of the e e 10th European Conference on Machine Learning, number 1398, pages 137–142, 1998. [2] Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135–168, 2000. [3] Naonori Ueda and Kazumi Saito. Parametoric mixture models for multi-topic text. In Advances in Neural Information Processing Systems 15, pages 1261–1268, 2003. [4] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265–292, 2001. [5] Klaus-Robert M¨ ller, Sebastian Mika, Gunnar R¨ tsch, Koji Tsuda, and Bernhard u a Sch¨ lkopf. An introduction to kernel-based learning algorithms. IEEE Transactions o on Neural Networks, 12(2):181–201, 2001. [6] Vladimir N. Vapnik. Statistical Learning Theory. John Wiley & Sons, Inc., 1998. [7] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wealy, 1999.
2 0.64517063 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations
Author: Johannes Mohr, Klaus Obermayer
Abstract: The standard approach to the classification of objects is to consider the examples as independent and identically distributed (iid). In many real world settings, however, this assumption is not valid, because a topographical relationship exists between the objects. In this contribution we consider the special case of image segmentation, where the objects are pixels and where the underlying topography is a 2D regular rectangular grid. We introduce a classification method which not only uses measured vectorial feature information but also the label configuration within a topographic neighborhood. Due to the resulting dependence between the labels of neighboring pixels, a collective classification of a set of pixels becomes necessary. We propose a new method called ’Topographic Support Vector Machine’ (TSVM), which is based on a topographic kernel and a self-consistent solution to the label assignment shown to be equivalent to a recurrent neural network. The performance of the algorithm is compared to a conventional SVM on a cell image segmentation task. 1
3 0.5150159 100 nips-2004-Learning Preferences for Multiclass Problems
Author: Fabio Aiolli, Alessandro Sperduti
Abstract: Many interesting multiclass problems can be cast in the general framework of label ranking defined on a given set of classes. The evaluation for such a ranking is generally given in terms of the number of violated order constraints between classes. In this paper, we propose the Preference Learning Model as a unifying framework to model and solve a large class of multiclass problems in a large margin perspective. In addition, an original kernel-based method is proposed and evaluated on a ranking dataset with state-of-the-art results. 1
4 0.49913171 101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery
Author: Rion Snow, Daniel Jurafsky, Andrew Y. Ng
Abstract: Semantic taxonomies such as WordNet provide a rich source of knowledge for natural language processing applications, but are expensive to build, maintain, and extend. Motivated by the problem of automatically constructing and extending such taxonomies, in this paper we present a new algorithm for automatically learning hypernym (is-a) relations from text. Our method generalizes earlier work that had relied on using small numbers of hand-crafted regular expression patterns to identify hypernym pairs. Using “dependency path” features extracted from parse trees, we introduce a general-purpose formalization and generalization of these patterns. Given a training set of text containing known hypernym pairs, our algorithm automatically extracts useful dependency paths and applies them to new corpora to identify novel pairs. On our evaluation task (determining whether two nouns in a news article participate in a hypernym relationship), our automatically extracted database of hypernyms attains both higher precision and higher recall than WordNet. 1
5 0.45958489 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification
Author: Felix A. Wichmann, Arnulf B. Graf, Heinrich H. Bülthoff, Eero P. Simoncelli, Bernhard Schölkopf
Abstract: We study gender discrimination of human faces using a combination of psychophysical classification and discrimination experiments together with methods from machine learning. We reduce the dimensionality of a set of face images using principal component analysis, and then train a set of linear classifiers on this reduced representation (linear support vector machines (SVMs), relevance vector machines (RVMs), Fisher linear discriminant (FLD), and prototype (prot) classifiers) using human classification data. Because we combine a linear preprocessor with linear classifiers, the entire system acts as a linear classifier, allowing us to visualise the decision-image corresponding to the normal vector of the separating hyperplanes (SH) of each classifier. We predict that the female-tomaleness transition along the normal vector for classifiers closely mimicking human classification (SVM and RVM [1]) should be faster than the transition along any other direction. A psychophysical discrimination experiment using the decision images as stimuli is consistent with this prediction. 1
6 0.45440054 54 nips-2004-Distributed Information Regularization on Graphs
7 0.44950476 109 nips-2004-Mass Meta-analysis in Talairach Space
8 0.43947208 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction
9 0.42762506 115 nips-2004-Maximum Margin Clustering
10 0.40890044 145 nips-2004-Parametric Embedding for Class Visualization
11 0.4066354 165 nips-2004-Semi-supervised Learning on Directed Graphs
12 0.40281835 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference
13 0.39169902 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound
14 0.3831926 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection
15 0.37943539 23 nips-2004-Analysis of a greedy active learning strategy
16 0.37855551 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
17 0.37575373 44 nips-2004-Conditional Random Fields for Object Recognition
18 0.37512147 34 nips-2004-Breaking SVM Complexity with Cross-Training
19 0.36366931 164 nips-2004-Semi-supervised Learning by Entropy Minimization
20 0.35988662 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
topicId topicWeight
[(13, 0.087), (15, 0.128), (21, 0.348), (26, 0.064), (31, 0.038), (33, 0.134), (35, 0.019), (39, 0.015), (50, 0.044), (76, 0.011), (87, 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.7696169 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization
Author: Hideto Kazawa, Tomonori Izumitani, Hirotoshi Taira, Eisaku Maeda
Abstract: In this paper, we address the problem of statistical learning for multitopic text categorization (MTC), whose goal is to choose all relevant topics (a label) from a given set of topics. The proposed algorithm, Maximal Margin Labeling (MML), treats all possible labels as independent classes and learns a multi-class classifier on the induced multi-class categorization problem. To cope with the data sparseness caused by the huge number of possible labels, MML combines some prior knowledge about label prototypes and a maximal margin criterion in a novel way. Experiments with multi-topic Web pages show that MML outperforms existing learning algorithms including Support Vector Machines. 1 Multi-topic Text Categorization (MTC) This paper addresses the problem of learning for multi-topic text categorization (MTC), whose goal is to select all topics relevant to a text from a given set of topics. In MTC, multiple topics may be relevant to a single text. We thus call a set of topics label, and say that a text is assigned a label, not a topic. In almost all previous text categorization studies (e.g. [1, 2]), the label is predicted by judging each topic’s relevance to the text. In this decomposition approach, the features specific to a topic, not a label, are regarded as important features. However, the approach may result in inefficient learning as we will explain in the following example. Imagine an MTC problem of scientific papers where quantum computing papers are assigned multi-topic label “quantum physics (QP) & computer science (CS)”. (QP and CS are topics in this example.) Since there are some words specific to quantum computing such as “qbit1 ”, one can say that efficient MTC learners should use such words to assign label QP & CS. However, the decomposition approach is likely to ignore these words since they are only specific to a small portion of the whole QP or CS papers (there are many more QP and CS papers than quantum computing papers), and therefore are not discriminative features for either topic QP or CS. 1 Qbit is a unit of quantum information, and frequently appears in real quantum computing literatures, but rarely seen in other literatures. Symbol x(∈ Rd ) t1 , t2 , . . . , tl T L, λ(⊂ T ) L[j] Λ(= 2T ) {(xi , Li )}m i=1 Meaning A document vector Topics The set of all topics A label The binary representation of L. 1 if tj ∈ L and 0 otherwise. The set of all possible labels Training samples Table 1: Notation Parametric Mixture Model (PMM) [3] adopts another approach to MTC. It is assumed in PMM that multi-topic texts are generated from a mixture of topic-specific word distributions. Its decision on labeling is done at once, not separately for each topic. However, PMM also has a problem with multi-topic specific features such as “qbit” since it is impossible for texts to have such features given PMM’s mixture process. These problems with multi-topic specific features are caused by dependency assumptions between labels, which are explicitly or implicitly made in existing methods. To solve these problems, we propose Maximal Margin Labeling, which treats labels as independent classes and learns a multi-class classifier on the induced multi-class problem. In this paper, we first discuss why multi-class classifiers cannot be directly applied to MTC in Section 2. We then propose MML in Section 3, and address implementation issues in Section 4. In Section 5, MML is experimentally compared with existing methods using a collection of multi-topic Web pages. We summarize this paper in Section 6. 2 Solving MTC as a Multi-Class Categorization To discuss why existing multi-class classifiers do not work in MTC, we start from the multi-class classifier proposed in [4]. Hereafter we use the notation given in Table 1. The multi-class classifier in [4] categorizes an object into the class whose prototype vector is the closest to the object’s feature vector. By substituting label for class, the classifier can be written as follows. f (x) = arg max x, mλ (1) X λ∈Λ where , X is the the inner product of Rd , and mλ ∈ Rd is the prototype vector of label λ. Following the similar argument as in [4], the prototype vectors are learned by solving the following maximal margin problem2 . min M s.t. 1 M 2 2 λ ξi +C xi , mLi 1≤i≤m λ∈Λ,λ=Li X − xi , m λ X λ ≥ 1 − ξi for 1 ≤ i ≤ m, ∀λ = Li , (2) where M is the prototype matrix whose columns are the prototype vectors, and M is the Frobenius matrix norm of M . Note that Eq. (1) and Eq. (2) cover only training samples’ labels, but also all possible labels. This is because the labels unseen in training samples may be relevant to test samples. In 2 In Eq.(2), we penalize all violation of the margin constraints. On the other hand, Crammer and Singer penalize only the largest violation of the margin constraint for each training sample [4]. We chose the “penalize-all” approach since it leads to an optimization problem without equality constraints (see Eq.(7)), which is much easier to solve than the one in [4]. usual multi-class problems, such unseen labels seldom exist. In MTC, however, the number of labels is generally very large (e.g. one of our datasets has 1,054 labels (Table 2)), and unseen labels often exist. Thus it is necessary to consider all possible labels in Eq. (1) and Eq. (2) since it is impossible to know which unseen labels are present in the test samples. There are two problems with Eq. (1) and Eq. (2). The first problem is that they involve the prototype vectors of seldom or never seen labels. Without the help of prior knowledge about where the prototype vectors should be, it is impossible to obtain appropriate prototype vectors for such labels. The second problem is that these equations are computationally too demanding since they involve combinatorial maximization and summation over all possible labels, whose number can be quite large. (For example, the number is around 230 in the datasets used in our experiments.) We will address the first problem in Section 3 and the second problem in Section 4. 3 Maximal Margin Labeling In this section, we incorporate some prior knowledge about the location of prototype vectors into Eq. (1) and Eq. (2), and propose a novel MTC learning algorithm, Maximal Margin Labeling (MML). As prior knowledge, we simply assume that the prototype vectors of similar labels should be placed close to each other. Based on this assumption, we first rewrite Eq. (1) to yield f (x) = arg max M T x, eλ L, (3) λ∈Λ where , L is the inner product of R|Λ| and {eλ }λ∈Λ is the orthonormal basis of R|Λ| . The classifier of Eq. (3) can be interpreted as a two-step process: the first step is to map the vector x into R|Λ| by M T , and the second step is to find the closest eλ to image M T x. Then we replace {eλ }λ∈Λ with (generally) non-orthogonal vectors {φ(λ)}λ∈Λ whose geometrical configuration reflects label similarity. More formally speaking, we use vectors {φ(λ)}λ∈Λ that satisfy the condition φ(λ1 ), φ(λ2 ) S = S(λ1 , λ2 ) for ∀λ1 , λ2 ∈ Λ, (4) where , S is an inner product of the vector space spanned by {φ(λ)}λ∈Λ , and S is a Mercer kernel [5] on Λ × Λ and is a similarity measure between labels. We call the vector space spanned by {φ(λ)} VS . With this replacement, MML’s classifier is written as follows. f (x) = arg max W x, φ(λ) S , (5) λ∈Λ where W is a linear map from Rd to VS . W is the solution of the following problem. min W 1 W 2 m 2 λ ξi +C i=1 λ∈Λ,λ=Li φ(Li )−φ(λ) λ λ s.t. W xi , ≥ 1−ξi , ξi ≥ 0 for 1 ≤ i ≤ m, ∀λ = Li . (6) φ(Li )−φ(λ) Note that if φ(λ) is replaced by eλ , Eq. (6) becomes identical to Eq. (2) except for a scale factor. Thus Eq. (5) and Eq. (6) are natural extensions of the multi-class classifier in [4]. We call the MTC classifier of Eq. (5) and Eq. (6) “Maximal Margin Labeling (MML)”. Figure 1 explains the margin (the inner product in Eq. (6)) in MML. The margin represents the distance from the image of the training sample xi to the boundary between the correct label Li and wrong label λ. MML optimizes the linear map W so that the smallest margin between all training samples and all possible labels becomes maximal, along with a penalty C for the case that samples penetrate into the margin. Figure 1: Maximal Margin Labeling Dual Form For numerical computation, the following Wolfe dual form of Eq. (6) is more convenient. (We omit its derivation due to space limits.) S(Li , Li′ )−S(Li , λ′ )−S(λ, Li′ )+S(λ, λ′ ) 1 λ λ′ λ αi αi′ (xi ·xi′ ) max αi − 2 αλ 2 (1−S(Li , λ))(1−S(Li′ , λ′ )) i ′ ′ i,λ i,λ i ,λ λ s.t. 0 ≤ αi ≤ C for 1 ≤ i ≤ m, ∀λ = Li , (7) m λ where we denote i=1 λ∈Λ,∀λ=Li by i,λ , and αi are the dual variables corresponding to the first inequality constraints in Eq. (6). Note that Eq. (7) does not contain φ(λ): all the computations involving φ can be done through the label similarity S. Additionally xi only appears in the inner products, and therefore can be replaced by any kernel of x. λ Using the solution αi of Eq. (7), the MML’s classifier in Eq. (5) can be written as follows. S(Li , L)−S(λ, L) λ . (8) f (x) = arg max αi (x·xi ) 2(1−S(Li , λ)) L∈Λ i,λ Label Similarity3 As examples of label similarity, we use two similarity measures: Dice measure and cosine measure. Dice measure4 4.1 SC (λ1 , λ2 ) = |λ1 ∩ λ2 | 2 |λ1 | |λ2 | = l j=1 l j=1 2|λ1 ∩λ2 | = |λ1 |+|λ2 | Cosine measure 4 SD (λ1 , λ2 ) = λ1 [j] + λ1 [j]λ2 [j] l j=1 l j=1 l j=1 λ2 [j] . λ1 [j]λ2 [j] λ1 [j] l j=1 (9) . 10) ( λ2 [j] Efficient Implementation Approximation in Learning Eq. (7) contains the sum over all possible labels. As the number of topics (l) increases, this summation rapidly becomes intractable since |Λ| grows exponentially as 2l . To circumvent 3 The following discussion is easily extended to include the case that both λ1 and λ2 are empty although we do not discuss the case due to space limits. this problem, we approximate the sum over all possible labels in Eq. (7) by the partial sum λ λ over αi of |(A ∩ B c ) ∪ (Ac ∩ B)| = 1 and set all the other αi to zero. This approximation reduces the burden of the summation quite a lot: the number of summands is reduced from 2l to l, which is a huge reduction especially when many topics exist. λ To understand the rationale behind the approximation, first note that αi is the dual variable λ corresponding to the first inequality constraint (the margin constraint) in Eq. (7). Thus αi is non-zero if and only if W xi falls in the margin between φ(Li ) and φ(λ). We assume that this margin violation mainly occurs when φ(λ) is “close” to φ(Li ), i.e. |(A ∩ B c ) ∪ (Ac ∩ B)| = 1. If this assumption holds well, the proposed approximation of the sum will lead to a good approximation of the exact solution. 4.2 Polynomial Time Algorithms for Classification The classification of MML (Eq. (8)) involves the combinatorial maximization over all possible labels, so it can be a computationally demanding process. However, efficient classification algorithms are available when either the cosine measure or dice measure is used as label similarity. Eq. (8) can be divided into the subproblems by the number of topics in a label. f (x) = arg max g(x, L), (11) ˆ ˆ ˆ L∈{L1 ,L2 ,...,Ll } ˆ Ln = arg max g(x, L). (12) L∈Λ,|L|=n where g(x) is l g(x, L) cn [j]L[j], = j=1 cn [j] = i,λ √ i,λ √ αλ (x·xi ) i 2(1−SD (Li ,λ)) αλ (x·xi ) i 2(1−SC (Li ,λ)) . 2Li [j] |Li |+n − 2λ[j] |λ|+n √Li [j] − √λ[j] √ √ |Li | n |λ| n if SD is used. (13) if SC is used. Here n = |L|. The computational cost of Eq. (13) for all j is O(nα l) (nα is the number of non-zero α), and that of Eq. (12) is O(l log l). Thus the total cost of the classification by Eq. (11) is O(nα l2 + l2 log l). On the other hand, nα is O(ml) under the approximation described above. Therefore, the classification can be done within O(ml3 ) computational steps, which is a significant reduction from the case that the brute force search is used in Eq. (8). 5 Experiments In this section, we report experiments that compared MML to PMM [3], SVM5 [6], and BoosTexter [2] using a collection of Web pages. We used a normalized linear kernel k(x, x′ ) = x · x′ / x x′ in MML and SVM. As for BoosTexter, “real abstaining AdaBoost.MH” was used as the weak learner. 5.1 Experimental Setup The datasets used in our experiment represent the Web page collection used in [3] (Table 2). The Web pages were collected through the hyperlinks from Yahoo!’s top directory 5 For each topic, an SVM classifier is trained to predict whether the topic is relevant (positive) or irrelevant (negative) to input doucments. Dataset Name (Abbrev.) Arts & Humanities (Ar) Business & Economy (Bu) Computers & Internet (Co) Education (Ed) Entertainment (En) Health (He) Recreation (Rc) Reference (Rf) Science (Si) Social Science (SS) Society & Culture (SC) #Text #Voc 7,484 11,214 12,444 12,030 12,730 9,205 12,828 8,027 6,428 12,111 14,512 #Tpc #Lbl 23,146 21,924 34,096 27,534 32,001 30,605 30,324 39,679 37,187 52,350 31,802 26 30 33 33 21 32 22 33 40 39 27 Label Size Frequency (%) 1 2 3 4 ≥5 599 55.6 30.5 9.7 2.8 1.4 233 57.6 28.8 11.0 1.7 0.8 428 69.8 18.2 7.8 3.0 1.1 511 66.9 23.4 7.3 1.9 0.6 337 72.3 21.1 4.5 1.0 1.1 335 53.2 34.0 9.5 2.4 0.9 530 69.2 23.1 5.6 1.4 0.6 275 85.5 12.6 1.5 0.3 0.1 457 68.0 22.3 7.3 1.9 0.5 361 78.4 17.0 3.7 0.7 0.3 1054 59.6 26.1 9.2 2.9 2.2 Table 2: A summary of the web page datasets. “#Text” is the number of texts in the dataset, “#Voc” the number of vocabularies (i.e. features), “#Tpc” the number of topics, “#Lbl” the number of labels, and “Label Size Frequency” is the relative frequency of each label size. (Label size is the number of topics in a label.) Method MML PMM SVM Boost Feature Type TF, TF×IDF TF TF, TF×IDF Binary Parameter C = 0.1, 1, 10 Model1, Model2 C = 0.1, 1, 10 R = {2, 4, 6, 8, 10}×103 Table 3: Candidate feature types and learning parameters. (R is the number of weak hypotheses.) The underlined fetures and parameters were selected for the evaluation with the test data. (www.yahoo.com), and then divided into 11 datasets by Yahoo’s top category. Each page is labeled with the Yahoo’s second level sub-categories from which the page is hyperlinked. (Thus, the sub-categories are topics in our term.) See [3] for more details about the collection. Then the Web pages were converted into three types of feature vectors: (a) Binary vectors, where each feature indicates the presence (absence) of a term by 1 (0); (b) TF vectors, where each feature is the number of appearances of a term (term frequency); and (c) TF×IDF vectors, where each feature is the product of term frequency and inverse document frequency [7]. To select the best combinations of feature types and learning parameters such as the penalty C for MML, the learners were trained on 2,000 Web pages with all combinations of feature and parameter listed in Table 3, and then were evaluated by labeling F-measure on independently drawn development data. The combinations which achieve the best labeling F-measures (underlined in Table 3) were used in the following experiments. 5.2 Evaluation Measures We used three measures to evaluate labeling performance: labeling F-measure, exact match ratio, and retrieval F-measure. In the following definitions, {Lpred }n and {Ltrue }n i=1 i i=1 i mean the predicted labels and the true labels, respectively. Labeling F-measure Labeling F-measure FL evaluates the average labeling performance while taking partial match into account. FL = 1 n n i=1 2|Lpred ∩ Ltrue | i i |Lpred | + |Ltrue | i i = 1 n n i=1 l pred [j]Ltrue [j] i j=1 Li . l (Lpred [j] + Ltrue [j]) i i j=1 2 (14) Dataset Ar Bu Co Ed En He Rc Rf Si SS SC Avg MD 0.55 0.80 0.62 0.56 0.64 0.74 0.63 0.67 0.61 0.73 0.60 0.65 Labeling F-measure MC PM SV BO 0.44 0.50 0.46 0.38 0.81 0.75 0.76 0.75 0.59 0.61 0.55 0.47 0.43 0.51 0.48 0.37 0.52 0.61 0.54 0.49 0.74 0.66 0.67 0.60 0.46 0.55 0.49 0.44 0.58 0.63 0.56 0.50 0.54 0.52 0.47 0.39 0.71 0.66 0.64 0.59 0.55 0.54 0.49 0.44 0.58 0.59 0.56 0.49 Exact Match Ratio MC PM SV 0.32 0.21 0.29 0.62 0.48 0.57 0.46 0.35 0.41 0.34 0.19 0.30 0.44 0.31 0.42 0.53 0.34 0.47 0.38 0.25 0.37 0.51 0.39 0.49 0.43 0.22 0.36 0.60 0.45 0.55 0.40 0.21 0.32 0.46 0.31 0.41 MD 0.44 0.63 0.51 0.45 0.55 0.58 0.54 0.60 0.52 0.65 0.44 0.54 BO 0.22 0.53 0.34 0.23 0.36 0.39 0.33 0.41 0.28 0.49 0.27 0.35 MD 0.30 0.25 0.27 0.25 0.37 0.35 0.47 0.29 0.37 0.36 0.29 0.32 Retrieval F-measure MC PM SV BO 0.26 0.24 0.29 0.22 0.27 0.20 0.29 0.20 0.25 0.19 0.30 0.17 0.23 0.21 0.25 0.16 0.33 0.30 0.35 0.29 0.35 0.23 0.35 0.26 0.39 0.36 0.40 0.33 0.25 0.24 0.25 0.16 0.35 0.28 0.31 0.19 0.35 0.18 0.31 0.15 0.28 0.25 0.26 0.20 0.30 0.24 0.31 0.21 Table 4: The performance comparison by labeling F-measure (left), exact match ratio (middle) and retrieval F-measure (right). The bold figures are the best ones among the five methods, and the underlined figures the second best ones. MD, MC, PM, SV, and BO represent MML with SD , MML with SC , PMM, SVM and BoosTexter, respectively. Exact Match Ratio Exact match ratio EX counts only exact matches between the predicted label and the true label. EX = 1 n n I[Lpred = Ltrue ], i i (15) i=1 where I[S] is 1 if the statement S is true and 0 otherwise. Retrieval F-measure6 For real tasks, it is also important to evaluate retrieval performance, i.e. how accurately classifiers can find relevant texts for a given topic. Retrieval F-measure FR measures the average retrieval performance over all topics. FR = 5.3 1 l l j=1 n pred [j]Ltrue [j] i i=1 Li . n (Lpred [j] + Ltrue [j]) i i i=1 2 (16) Results First we trained the classifiers with randomly chosen 2,000 samples. We then calculated the three evaluation measures for 3,000 other randomly chosen samples. This process was repeated five times, and the resulting averaged values are shown in Table 4. Table 4 shows that the MMLs with Dice measure outperform other methods in labeling F-measure and exact match ratio. The MMLs also show the best performance with regard to retrieval Fmeasure although the margins to the other methods are not as large as observed in labeling F-measure and exact match ratio. Note that no classifier except MML with Dice measure achieves good labeling on all the three measures. For example, PMM shows high labeling F-measures, but its performance is rather poor when evaluated in retrieval F-measure. As the second experiment, we evaluated the classifiers trained with 250–2000 training samples on the same test samples. Figure 2 shows each measure averaged over all datasets. It is observed that the MMLs show high generalization even when training data is small. An interesting point is that MML with cosine measure achieves rather high labeling F-measures and retrieval F-measure with training data of smaller size. Such high-performace, however, does not continue when trained on larger data. 6 FR is called “the macro average of F-measures” in the text categorization community. Figure 2: The learning curve of labeling F-measure (left), exact match ratio (middle) and retrieval F-measure (right). MD, MC, PM, SV, BO mean the same as in Table 4. 6 Conclusion In this paper, we proposed a novel learning algorithm for multi-topic text categorization. The algorithm, Maximal Margin Labeling, embeds labels (sets of topics) into a similarityinduced vector space, and learns a large margin classifier in the space. To overcome the demanding computational cost of MML, we provide an approximation method in learning and efficient classification algorithms. In experiments on a collection of Web pages, MML outperformed other methods including SVM and showed better generalization. Acknowledgement The authors would like to thank Naonori Ueda, Kazumi Saito and Yuji Kaneda of Nippon Telegraph and Telephone Corporation for providing PMM’s codes and the datasets. References [1] Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Claire N´ dellec and C´ line Rouveirol, editors, Proc. of the e e 10th European Conference on Machine Learning, number 1398, pages 137–142, 1998. [2] Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135–168, 2000. [3] Naonori Ueda and Kazumi Saito. Parametoric mixture models for multi-topic text. In Advances in Neural Information Processing Systems 15, pages 1261–1268, 2003. [4] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265–292, 2001. [5] Klaus-Robert M¨ ller, Sebastian Mika, Gunnar R¨ tsch, Koji Tsuda, and Bernhard u a Sch¨ lkopf. An introduction to kernel-based learning algorithms. IEEE Transactions o on Neural Networks, 12(2):181–201, 2001. [6] Vladimir N. Vapnik. Statistical Learning Theory. John Wiley & Sons, Inc., 1998. [7] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wealy, 1999.
2 0.66361433 99 nips-2004-Learning Hyper-Features for Visual Identification
Author: Andras D. Ferencz, Erik G. Learned-miller, Jitendra Malik
Abstract: We address the problem of identifying specific instances of a class (cars) from a set of images all belonging to that class. Although we cannot build a model for any particular instance (as we may be provided with only one “training” example of it), we can use information extracted from observing other members of the class. We pose this task as a learning problem, in which the learner is given image pairs, labeled as matching or not, and must discover which image features are most consistent for matching instances and discriminative for mismatches. We explore a patch based representation, where we model the distributions of similarity measurements defined on the patches. Finally, we describe an algorithm that selects the most salient patches based on a mutual information criterion. This algorithm performs identification well for our challenging dataset of car images, after matching only a few, well chosen patches. 1
3 0.53102183 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1
4 0.52702981 131 nips-2004-Non-Local Manifold Tangent Learning
Author: Yoshua Bengio, Martin Monperrus
Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1
5 0.52558768 69 nips-2004-Fast Rates to Bayes for Kernel Machines
Author: Ingo Steinwart, Clint Scovel
Abstract: We establish learning rates to the Bayes risk for support vector machines (SVMs) with hinge loss. In particular, for SVMs with Gaussian RBF kernels we propose a geometric condition for distributions which can be used to determine approximation properties of these kernels. Finally, we compare our methods with a recent paper of G. Blanchard et al.. 1
6 0.52407157 192 nips-2004-The power of feature clustering: An application to object detection
7 0.52302355 178 nips-2004-Support Vector Classification with Input Data Uncertainty
8 0.52276152 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill
9 0.52247381 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
10 0.52076489 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
11 0.52028936 28 nips-2004-Bayesian inference in spiking neurons
12 0.51927269 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
13 0.51920933 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
14 0.51889163 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
15 0.51872885 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
16 0.51866752 160 nips-2004-Seeing through water
17 0.51856649 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images
18 0.51745176 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
19 0.51734954 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling
20 0.51639646 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications