nips nips2004 nips2004-111 nips2004-111-reference knowledge-graph by maker-knowledge-mining

111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization

Source: pdf

Author: Hideto Kazawa, Tomonori Izumitani, Hirotoshi Taira, Eisaku Maeda

Abstract: In this paper, we address the problem of statistical learning for multitopic text categorization (MTC), whose goal is to choose all relevant topics (a label) from a given set of topics. The proposed algorithm, Maximal Margin Labeling (MML), treats all possible labels as independent classes and learns a multi-class classiﬁer on the induced multi-class categorization problem. To cope with the data sparseness caused by the huge number of possible labels, MML combines some prior knowledge about label prototypes and a maximal margin criterion in a novel way. Experiments with multi-topic Web pages show that MML outperforms existing learning algorithms including Support Vector Machines. 1 Multi-topic Text Categorization (MTC) This paper addresses the problem of learning for multi-topic text categorization (MTC), whose goal is to select all topics relevant to a text from a given set of topics. In MTC, multiple topics may be relevant to a single text. We thus call a set of topics label, and say that a text is assigned a label, not a topic. In almost all previous text categorization studies (e.g. [1, 2]), the label is predicted by judging each topic’s relevance to the text. In this decomposition approach, the features speciﬁc to a topic, not a label, are regarded as important features. However, the approach may result in inefﬁcient learning as we will explain in the following example. Imagine an MTC problem of scientiﬁc papers where quantum computing papers are assigned multi-topic label “quantum physics (QP) & computer science (CS)”. (QP and CS are topics in this example.) Since there are some words speciﬁc to quantum computing such as “qbit1 ”, one can say that efﬁcient MTC learners should use such words to assign label QP & CS. However, the decomposition approach is likely to ignore these words since they are only speciﬁc to a small portion of the whole QP or CS papers (there are many more QP and CS papers than quantum computing papers), and therefore are not discriminative features for either topic QP or CS. 1 Qbit is a unit of quantum information, and frequently appears in real quantum computing literatures, but rarely seen in other literatures. Symbol x(∈ Rd ) t1 , t2 , . . . , tl T L, λ(⊂ T ) L[j] Λ(= 2T ) {(xi , Li )}m i=1 Meaning A document vector Topics The set of all topics A label The binary representation of L. 1 if tj ∈ L and 0 otherwise. The set of all possible labels Training samples Table 1: Notation Parametric Mixture Model (PMM) [3] adopts another approach to MTC. It is assumed in PMM that multi-topic texts are generated from a mixture of topic-speciﬁc word distributions. Its decision on labeling is done at once, not separately for each topic. However, PMM also has a problem with multi-topic speciﬁc features such as “qbit” since it is impossible for texts to have such features given PMM’s mixture process. These problems with multi-topic speciﬁc features are caused by dependency assumptions between labels, which are explicitly or implicitly made in existing methods. To solve these problems, we propose Maximal Margin Labeling, which treats labels as independent classes and learns a multi-class classiﬁer on the induced multi-class problem. In this paper, we ﬁrst discuss why multi-class classiﬁers cannot be directly applied to MTC in Section 2. We then propose MML in Section 3, and address implementation issues in Section 4. In Section 5, MML is experimentally compared with existing methods using a collection of multi-topic Web pages. We summarize this paper in Section 6. 2 Solving MTC as a Multi-Class Categorization To discuss why existing multi-class classiﬁers do not work in MTC, we start from the multi-class classiﬁer proposed in [4]. Hereafter we use the notation given in Table 1. The multi-class classiﬁer in [4] categorizes an object into the class whose prototype vector is the closest to the object’s feature vector. By substituting label for class, the classiﬁer can be written as follows. f (x) = arg max x, mλ (1) X λ∈Λ where , X is the the inner product of Rd , and mλ ∈ Rd is the prototype vector of label λ. Following the similar argument as in [4], the prototype vectors are learned by solving the following maximal margin problem2 . min M s.t. 1 M 2 2 λ ξi +C xi , mLi 1≤i≤m λ∈Λ,λ=Li X − xi , m λ X λ ≥ 1 − ξi for 1 ≤ i ≤ m, ∀λ = Li , (2) where M is the prototype matrix whose columns are the prototype vectors, and M is the Frobenius matrix norm of M . Note that Eq. (1) and Eq. (2) cover only training samples’ labels, but also all possible labels. This is because the labels unseen in training samples may be relevant to test samples. In 2 In Eq.(2), we penalize all violation of the margin constraints. On the other hand, Crammer and Singer penalize only the largest violation of the margin constraint for each training sample [4]. We chose the “penalize-all” approach since it leads to an optimization problem without equality constraints (see Eq.(7)), which is much easier to solve than the one in [4]. usual multi-class problems, such unseen labels seldom exist. In MTC, however, the number of labels is generally very large (e.g. one of our datasets has 1,054 labels (Table 2)), and unseen labels often exist. Thus it is necessary to consider all possible labels in Eq. (1) and Eq. (2) since it is impossible to know which unseen labels are present in the test samples. There are two problems with Eq. (1) and Eq. (2). The ﬁrst problem is that they involve the prototype vectors of seldom or never seen labels. Without the help of prior knowledge about where the prototype vectors should be, it is impossible to obtain appropriate prototype vectors for such labels. The second problem is that these equations are computationally too demanding since they involve combinatorial maximization and summation over all possible labels, whose number can be quite large. (For example, the number is around 230 in the datasets used in our experiments.) We will address the ﬁrst problem in Section 3 and the second problem in Section 4. 3 Maximal Margin Labeling In this section, we incorporate some prior knowledge about the location of prototype vectors into Eq. (1) and Eq. (2), and propose a novel MTC learning algorithm, Maximal Margin Labeling (MML). As prior knowledge, we simply assume that the prototype vectors of similar labels should be placed close to each other. Based on this assumption, we ﬁrst rewrite Eq. (1) to yield f (x) = arg max M T x, eλ L, (3) λ∈Λ where , L is the inner product of R|Λ| and {eλ }λ∈Λ is the orthonormal basis of R|Λ| . The classiﬁer of Eq. (3) can be interpreted as a two-step process: the ﬁrst step is to map the vector x into R|Λ| by M T , and the second step is to ﬁnd the closest eλ to image M T x. Then we replace {eλ }λ∈Λ with (generally) non-orthogonal vectors {φ(λ)}λ∈Λ whose geometrical conﬁguration reﬂects label similarity. More formally speaking, we use vectors {φ(λ)}λ∈Λ that satisfy the condition φ(λ1 ), φ(λ2 ) S = S(λ1 , λ2 ) for ∀λ1 , λ2 ∈ Λ, (4) where , S is an inner product of the vector space spanned by {φ(λ)}λ∈Λ , and S is a Mercer kernel [5] on Λ × Λ and is a similarity measure between labels. We call the vector space spanned by {φ(λ)} VS . With this replacement, MML’s classiﬁer is written as follows. f (x) = arg max W x, φ(λ) S , (5) λ∈Λ where W is a linear map from Rd to VS . W is the solution of the following problem. min W 1 W 2 m 2 λ ξi +C i=1 λ∈Λ,λ=Li φ(Li )−φ(λ) λ λ s.t. W xi , ≥ 1−ξi , ξi ≥ 0 for 1 ≤ i ≤ m, ∀λ = Li . (6) φ(Li )−φ(λ) Note that if φ(λ) is replaced by eλ , Eq. (6) becomes identical to Eq. (2) except for a scale factor. Thus Eq. (5) and Eq. (6) are natural extensions of the multi-class classiﬁer in [4]. We call the MTC classiﬁer of Eq. (5) and Eq. (6) “Maximal Margin Labeling (MML)”. Figure 1 explains the margin (the inner product in Eq. (6)) in MML. The margin represents the distance from the image of the training sample xi to the boundary between the correct label Li and wrong label λ. MML optimizes the linear map W so that the smallest margin between all training samples and all possible labels becomes maximal, along with a penalty C for the case that samples penetrate into the margin. Figure 1: Maximal Margin Labeling Dual Form For numerical computation, the following Wolfe dual form of Eq. (6) is more convenient. (We omit its derivation due to space limits.) S(Li , Li′ )−S(Li , λ′ )−S(λ, Li′ )+S(λ, λ′ ) 1 λ λ′ λ αi αi′ (xi ·xi′ ) max αi − 2 αλ 2 (1−S(Li , λ))(1−S(Li′ , λ′ )) i ′ ′ i,λ i,λ i ,λ λ s.t. 0 ≤ αi ≤ C for 1 ≤ i ≤ m, ∀λ = Li , (7) m λ where we denote i=1 λ∈Λ,∀λ=Li by i,λ , and αi are the dual variables corresponding to the ﬁrst inequality constraints in Eq. (6). Note that Eq. (7) does not contain φ(λ): all the computations involving φ can be done through the label similarity S. Additionally xi only appears in the inner products, and therefore can be replaced by any kernel of x. λ Using the solution αi of Eq. (7), the MML’s classiﬁer in Eq. (5) can be written as follows. S(Li , L)−S(λ, L) λ . (8) f (x) = arg max αi (x·xi ) 2(1−S(Li , λ)) L∈Λ i,λ Label Similarity3 As examples of label similarity, we use two similarity measures: Dice measure and cosine measure. Dice measure4 4.1 SC (λ1 , λ2 ) = |λ1 ∩ λ2 | 2 |λ1 | |λ2 | = l j=1 l j=1 2|λ1 ∩λ2 | = |λ1 |+|λ2 | Cosine measure 4 SD (λ1 , λ2 ) = λ1 [j] + λ1 [j]λ2 [j] l j=1 l j=1 l j=1 λ2 [j] . λ1 [j]λ2 [j] λ1 [j] l j=1 (9) . 10) ( λ2 [j] Efﬁcient Implementation Approximation in Learning Eq. (7) contains the sum over all possible labels. As the number of topics (l) increases, this summation rapidly becomes intractable since |Λ| grows exponentially as 2l . To circumvent 3 The following discussion is easily extended to include the case that both λ1 and λ2 are empty although we do not discuss the case due to space limits. this problem, we approximate the sum over all possible labels in Eq. (7) by the partial sum λ λ over αi of |(A ∩ B c ) ∪ (Ac ∩ B)| = 1 and set all the other αi to zero. This approximation reduces the burden of the summation quite a lot: the number of summands is reduced from 2l to l, which is a huge reduction especially when many topics exist. λ To understand the rationale behind the approximation, ﬁrst note that αi is the dual variable λ corresponding to the ﬁrst inequality constraint (the margin constraint) in Eq. (7). Thus αi is non-zero if and only if W xi falls in the margin between φ(Li ) and φ(λ). We assume that this margin violation mainly occurs when φ(λ) is “close” to φ(Li ), i.e. |(A ∩ B c ) ∪ (Ac ∩ B)| = 1. If this assumption holds well, the proposed approximation of the sum will lead to a good approximation of the exact solution. 4.2 Polynomial Time Algorithms for Classiﬁcation The classiﬁcation of MML (Eq. (8)) involves the combinatorial maximization over all possible labels, so it can be a computationally demanding process. However, efﬁcient classiﬁcation algorithms are available when either the cosine measure or dice measure is used as label similarity. Eq. (8) can be divided into the subproblems by the number of topics in a label. f (x) = arg max g(x, L), (11) ˆ ˆ ˆ L∈{L1 ,L2 ,...,Ll } ˆ Ln = arg max g(x, L). (12) L∈Λ,|L|=n where g(x) is l g(x, L) cn [j]L[j], = j=1 cn [j] =      i,λ √ i,λ √ αλ (x·xi ) i 2(1−SD (Li ,λ)) αλ (x·xi ) i 2(1−SC (Li ,λ)) . 2Li [j] |Li |+n − 2λ[j] |λ|+n √Li [j] − √λ[j] √ √ |Li | n |λ| n if SD is used. (13) if SC is used. Here n = |L|. The computational cost of Eq. (13) for all j is O(nα l) (nα is the number of non-zero α), and that of Eq. (12) is O(l log l). Thus the total cost of the classiﬁcation by Eq. (11) is O(nα l2 + l2 log l). On the other hand, nα is O(ml) under the approximation described above. Therefore, the classiﬁcation can be done within O(ml3 ) computational steps, which is a signiﬁcant reduction from the case that the brute force search is used in Eq. (8). 5 Experiments In this section, we report experiments that compared MML to PMM [3], SVM5 [6], and BoosTexter [2] using a collection of Web pages. We used a normalized linear kernel k(x, x′ ) = x · x′ / x x′ in MML and SVM. As for BoosTexter, “real abstaining AdaBoost.MH” was used as the weak learner. 5.1 Experimental Setup The datasets used in our experiment represent the Web page collection used in [3] (Table 2). The Web pages were collected through the hyperlinks from Yahoo!’s top directory 5 For each topic, an SVM classiﬁer is trained to predict whether the topic is relevant (positive) or irrelevant (negative) to input doucments. Dataset Name (Abbrev.) Arts & Humanities (Ar) Business & Economy (Bu) Computers & Internet (Co) Education (Ed) Entertainment (En) Health (He) Recreation (Rc) Reference (Rf) Science (Si) Social Science (SS) Society & Culture (SC) #Text #Voc 7,484 11,214 12,444 12,030 12,730 9,205 12,828 8,027 6,428 12,111 14,512 #Tpc #Lbl 23,146 21,924 34,096 27,534 32,001 30,605 30,324 39,679 37,187 52,350 31,802 26 30 33 33 21 32 22 33 40 39 27 Label Size Frequency (%) 1 2 3 4 ≥5 599 55.6 30.5 9.7 2.8 1.4 233 57.6 28.8 11.0 1.7 0.8 428 69.8 18.2 7.8 3.0 1.1 511 66.9 23.4 7.3 1.9 0.6 337 72.3 21.1 4.5 1.0 1.1 335 53.2 34.0 9.5 2.4 0.9 530 69.2 23.1 5.6 1.4 0.6 275 85.5 12.6 1.5 0.3 0.1 457 68.0 22.3 7.3 1.9 0.5 361 78.4 17.0 3.7 0.7 0.3 1054 59.6 26.1 9.2 2.9 2.2 Table 2: A summary of the web page datasets. “#Text” is the number of texts in the dataset, “#Voc” the number of vocabularies (i.e. features), “#Tpc” the number of topics, “#Lbl” the number of labels, and “Label Size Frequency” is the relative frequency of each label size. (Label size is the number of topics in a label.) Method MML PMM SVM Boost Feature Type TF, TF×IDF TF TF, TF×IDF Binary Parameter C = 0.1, 1, 10 Model1, Model2 C = 0.1, 1, 10 R = {2, 4, 6, 8, 10}×103 Table 3: Candidate feature types and learning parameters. (R is the number of weak hypotheses.) The underlined fetures and parameters were selected for the evaluation with the test data. (www.yahoo.com), and then divided into 11 datasets by Yahoo’s top category. Each page is labeled with the Yahoo’s second level sub-categories from which the page is hyperlinked. (Thus, the sub-categories are topics in our term.) See [3] for more details about the collection. Then the Web pages were converted into three types of feature vectors: (a) Binary vectors, where each feature indicates the presence (absence) of a term by 1 (0); (b) TF vectors, where each feature is the number of appearances of a term (term frequency); and (c) TF×IDF vectors, where each feature is the product of term frequency and inverse document frequency [7]. To select the best combinations of feature types and learning parameters such as the penalty C for MML, the learners were trained on 2,000 Web pages with all combinations of feature and parameter listed in Table 3, and then were evaluated by labeling F-measure on independently drawn development data. The combinations which achieve the best labeling F-measures (underlined in Table 3) were used in the following experiments. 5.2 Evaluation Measures We used three measures to evaluate labeling performance: labeling F-measure, exact match ratio, and retrieval F-measure. In the following deﬁnitions, {Lpred }n and {Ltrue }n i=1 i i=1 i mean the predicted labels and the true labels, respectively. Labeling F-measure Labeling F-measure FL evaluates the average labeling performance while taking partial match into account. FL = 1 n n i=1 2|Lpred ∩ Ltrue | i i |Lpred | + |Ltrue | i i = 1 n n i=1 l pred [j]Ltrue [j] i j=1 Li . l (Lpred [j] + Ltrue [j]) i i j=1 2 (14) Dataset Ar Bu Co Ed En He Rc Rf Si SS SC Avg MD 0.55 0.80 0.62 0.56 0.64 0.74 0.63 0.67 0.61 0.73 0.60 0.65 Labeling F-measure MC PM SV BO 0.44 0.50 0.46 0.38 0.81 0.75 0.76 0.75 0.59 0.61 0.55 0.47 0.43 0.51 0.48 0.37 0.52 0.61 0.54 0.49 0.74 0.66 0.67 0.60 0.46 0.55 0.49 0.44 0.58 0.63 0.56 0.50 0.54 0.52 0.47 0.39 0.71 0.66 0.64 0.59 0.55 0.54 0.49 0.44 0.58 0.59 0.56 0.49 Exact Match Ratio MC PM SV 0.32 0.21 0.29 0.62 0.48 0.57 0.46 0.35 0.41 0.34 0.19 0.30 0.44 0.31 0.42 0.53 0.34 0.47 0.38 0.25 0.37 0.51 0.39 0.49 0.43 0.22 0.36 0.60 0.45 0.55 0.40 0.21 0.32 0.46 0.31 0.41 MD 0.44 0.63 0.51 0.45 0.55 0.58 0.54 0.60 0.52 0.65 0.44 0.54 BO 0.22 0.53 0.34 0.23 0.36 0.39 0.33 0.41 0.28 0.49 0.27 0.35 MD 0.30 0.25 0.27 0.25 0.37 0.35 0.47 0.29 0.37 0.36 0.29 0.32 Retrieval F-measure MC PM SV BO 0.26 0.24 0.29 0.22 0.27 0.20 0.29 0.20 0.25 0.19 0.30 0.17 0.23 0.21 0.25 0.16 0.33 0.30 0.35 0.29 0.35 0.23 0.35 0.26 0.39 0.36 0.40 0.33 0.25 0.24 0.25 0.16 0.35 0.28 0.31 0.19 0.35 0.18 0.31 0.15 0.28 0.25 0.26 0.20 0.30 0.24 0.31 0.21 Table 4: The performance comparison by labeling F-measure (left), exact match ratio (middle) and retrieval F-measure (right). The bold ﬁgures are the best ones among the ﬁve methods, and the underlined ﬁgures the second best ones. MD, MC, PM, SV, and BO represent MML with SD , MML with SC , PMM, SVM and BoosTexter, respectively. Exact Match Ratio Exact match ratio EX counts only exact matches between the predicted label and the true label. EX = 1 n n I[Lpred = Ltrue ], i i (15) i=1 where I[S] is 1 if the statement S is true and 0 otherwise. Retrieval F-measure6 For real tasks, it is also important to evaluate retrieval performance, i.e. how accurately classiﬁers can ﬁnd relevant texts for a given topic. Retrieval F-measure FR measures the average retrieval performance over all topics. FR = 5.3 1 l l j=1 n pred [j]Ltrue [j] i i=1 Li . n (Lpred [j] + Ltrue [j]) i i i=1 2 (16) Results First we trained the classiﬁers with randomly chosen 2,000 samples. We then calculated the three evaluation measures for 3,000 other randomly chosen samples. This process was repeated ﬁve times, and the resulting averaged values are shown in Table 4. Table 4 shows that the MMLs with Dice measure outperform other methods in labeling F-measure and exact match ratio. The MMLs also show the best performance with regard to retrieval Fmeasure although the margins to the other methods are not as large as observed in labeling F-measure and exact match ratio. Note that no classiﬁer except MML with Dice measure achieves good labeling on all the three measures. For example, PMM shows high labeling F-measures, but its performance is rather poor when evaluated in retrieval F-measure. As the second experiment, we evaluated the classiﬁers trained with 250–2000 training samples on the same test samples. Figure 2 shows each measure averaged over all datasets. It is observed that the MMLs show high generalization even when training data is small. An interesting point is that MML with cosine measure achieves rather high labeling F-measures and retrieval F-measure with training data of smaller size. Such high-performace, however, does not continue when trained on larger data. 6 FR is called “the macro average of F-measures” in the text categorization community. Figure 2: The learning curve of labeling F-measure (left), exact match ratio (middle) and retrieval F-measure (right). MD, MC, PM, SV, BO mean the same as in Table 4. 6 Conclusion In this paper, we proposed a novel learning algorithm for multi-topic text categorization. The algorithm, Maximal Margin Labeling, embeds labels (sets of topics) into a similarityinduced vector space, and learns a large margin classiﬁer in the space. To overcome the demanding computational cost of MML, we provide an approximation method in learning and efﬁcient classiﬁcation algorithms. In experiments on a collection of Web pages, MML outperformed other methods including SVM and showed better generalization. Acknowledgement The authors would like to thank Naonori Ueda, Kazumi Saito and Yuji Kaneda of Nippon Telegraph and Telephone Corporation for providing PMM’s codes and the datasets. References [1] Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Claire N´ dellec and C´ line Rouveirol, editors, Proc. of the e e 10th European Conference on Machine Learning, number 1398, pages 137–142, 1998. [2] Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135–168, 2000. [3] Naonori Ueda and Kazumi Saito. Parametoric mixture models for multi-topic text. In Advances in Neural Information Processing Systems 15, pages 1261–1268, 2003. [4] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265–292, 2001. [5] Klaus-Robert M¨ ller, Sebastian Mika, Gunnar R¨ tsch, Koji Tsuda, and Bernhard u a Sch¨ lkopf. An introduction to kernel-based learning algorithms. IEEE Transactions o on Neural Networks, 12(2):181–201, 2001. [6] Vladimir N. Vapnik. Statistical Learning Theory. John Wiley & Sons, Inc., 1998. [7] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wealy, 1999.

reference text

[1] Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Claire N´ dellec and C´ line Rouveirol, editors, Proc. of the e e 10th European Conference on Machine Learning, number 1398, pages 137–142, 1998.

[2] Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135–168, 2000.

[3] Naonori Ueda and Kazumi Saito. Parametoric mixture models for multi-topic text. In Advances in Neural Information Processing Systems 15, pages 1261–1268, 2003.

[4] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265–292, 2001.

[5] Klaus-Robert M¨ ller, Sebastian Mika, Gunnar R¨ tsch, Koji Tsuda, and Bernhard u a Sch¨ lkopf. An introduction to kernel-based learning algorithms. IEEE Transactions o on Neural Networks, 12(2):181–201, 2001.

[6] Vladimir N. Vapnik. Statistical Learning Theory. John Wiley & Sons, Inc., 1998.

[7] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wealy, 1999.