jmlr jmlr2007 jmlr2007-50 knowledge-graph by maker-knowledge-mining

50 jmlr-2007-Local Discriminant Wavelet Packet Coordinates for Face Recognition

Source: pdf

Author: Chao-Chun Liu, Dao-Qing Dai, Hong Yan

Abstract: Face recognition is a challenging problem due to variations in pose, illumination, and expression. Techniques that can provide effective feature representation with enhanced discriminability are crucial. Wavelets have played an important role in image processing for its ability to capture localized spatial-frequency information of images. In this paper, we propose a novel local discriminant coordinates method based on wavelet packet for face recognition to compensate for these variations. Traditional wavelet-based methods for face recognition select or operate on the most discriminant subband, and neglect the scattered characteristic of discriminant features. The proposed method selects the most discriminant coordinates uniformly from all spatial frequency subbands to overcome the deﬁciency of traditional wavelet-based methods. To measure the discriminability of coordinates, a new dilation invariant entropy and a maximum a posterior logistic model are put forward. Moreover, a new triangle square ratio criterion is used to improve classiﬁcation using the Euclidean distance and the cosine criterion. Experimental results show that the proposed method is robust for face recognition under variations in illumination, pose and expression. Keywords: local discriminant coordinates, invariant entropy, logistic model, wavelet packet, face recognition, illumination, pose and expression variations

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper, we propose a novel local discriminant coordinates method based on wavelet packet for face recognition to compensate for these variations. [sent-10, score-0.979]

2 Traditional wavelet-based methods for face recognition select or operate on the most discriminant subband, and neglect the scattered characteristic of discriminant features. [sent-11, score-0.542]

3 The proposed method selects the most discriminant coordinates uniformly from all spatial frequency subbands to overcome the deﬁciency of traditional wavelet-based methods. [sent-12, score-0.59]

4 To measure the discriminability of coordinates, a new dilation invariant entropy and a maximum a posterior logistic model are put forward. [sent-13, score-0.333]

5 Experimental results show that the proposed method is robust for face recognition under variations in illumination, pose and expression. [sent-15, score-0.316]

6 Keywords: local discriminant coordinates, invariant entropy, logistic model, wavelet packet, face recognition, illumination, pose and expression variations 1. [sent-16, score-0.757]

7 In this study, we investigate a new approach by extracting the features not sensitive to environmental changes from a wavelet packet dictionary. [sent-32, score-0.534]

8 Generally, feature extraction, discriminant analysis and classifying criterion are the three basic elements of a face recognition system. [sent-33, score-0.417]

9 A well-known feature extraction method is called FisherFace, based on linear discriminant analysis (LDA), which linearly projects the image space to a low-dimensional subspace so as to discount environmental variations (Belhumeur et al. [sent-42, score-0.335]

10 Our previous study (Dai and Yuen, 2006) uses a wavelet enhanced regularized discriminant analysis after dimensionality reducing with low-pass ﬁlter to solve the small sample size problem, which is also a method based on the low frequency subband. [sent-50, score-0.471]

11 , ICA, PCA, Neural Networks) to enhance the discriminant power in one or several special subbands, the latter always fuse the discriminant power in these different subbands for ﬁnal classiﬁcation (Ekenel and Sanker, 2005; Zhang et al. [sent-55, score-0.562]

12 Moreover, as a generalization of the wavelet transform, the wavelet packet not only offers an attractive tool for reducing the dimensionality by feature extraction, but also allows us to create localized subbands of the data in both space and frequency domains. [sent-57, score-1.017]

13 (1997), the best-basis algorithm of Coifman and Wicherhauser (1992) is used to search for the wavelet packet basis for face representation. [sent-63, score-0.644]

14 In Bhagavatula and Savvides (2005), PCA is performed in wavelet packet subbands and the subbands which generalize better across illumination variations for face recognition are sought. [sent-64, score-1.304]

15 It is known that a good feature extractor for a face recognition system is claimed to select as many the best discriminant features as possible, which are not sensitive to arbitrary environmental variations. [sent-66, score-0.392]

16 Moreover, changes in pose or scale of a face and most illumination variations affect the intensity manifold globally, in which only their low-frequency spectrum is affected. [sent-70, score-0.382]

17 So there are no special subbands whose all coordinates are not sensitive to these variations. [sent-73, score-0.315]

18 In each subband, there may be only segmental coordinates which have enough discriminant power to distinguish different person, the remainder may be sensitive to environmental changes, but the methods based on the whole subband will also extract these sensitive features. [sent-74, score-0.465]

19 Moreover, there may be no special subbands containing all the best discriminant features, because the features not sensitive to environmental variations are always distributed in different coordinates of different subbands locally. [sent-75, score-0.783]

20 The methods based on the segmental subbands will lose some good discriminant features. [sent-76, score-0.387]

21 Many less discriminant coordinates in one subband may add up to a larger discriminability than another subband whose discriminability is added up with few best discriminant coordinates and residual small discriminant coordinates (Saito et al. [sent-78, score-1.338]

22 , 2002), then the few best discriminant coordinates will be discarded by the methods which search for the best discriminate subbands, but only the few best discriminant coordinates are needed. [sent-79, score-0.556]

23 So the best discriminant information selection should be independent of their seated subbands, and only depends on their discriminability for face recognition. [sent-80, score-0.4]

24 However, the methods based on the whole subband neglect the distribution of features, they are deﬁcient to select the best discriminant features sometimes. [sent-81, score-0.337]

25 Based on the new dilation invariant entropy and its derived separability measure function, any two coordinates are comparable for their discriminability, either they locate in the same subband or different subbands. [sent-85, score-0.565]

26 For classifying criterion, the traditional Euclidean distance cannot measure the similarity very well when there exist illumination variations on facial images, and the cosine criterion is unsatisfactory when there exist pose and expression changes. [sent-88, score-0.492]

27 Experimental results show that our LDC feature extraction has almost overcome the shortcomings of the methods based on subband and improves the effect of feature extraction for face recognition under different environmental variations. [sent-92, score-0.485]

28 The contribution of this paper consists of the following: • Further extension of wavelets to face recognition to deal with illumination, pose and expression problems. [sent-93, score-0.323]

29 1167 L IU , DAI AND YAN • Introduction of a dilation invariant entropy and a maximum a posterior logistic model for selection of wavelet packet coordinates. [sent-94, score-0.752]

30 • Design of a face recognition system, which solves the small sample size problem and is robust to variations in illumination, pose and expression. [sent-96, score-0.316]

31 In Section 2, the wavelet packet decomposition and the local discriminant basis algorithm will be introduced. [sent-98, score-0.712]

32 Feature Extraction by Local Discriminant Basis In this section, we ﬁrst make a review on the wavelet packet decomposition, then the local discriminant basis (LDB) algorithm and the modiﬁed LDB algorithm are introduced. [sent-102, score-0.684]

33 For the wavelet transform, only the lowpass ﬁltered subband is further iterated. [sent-108, score-0.411]

34 As a generalization of the wavelet transform, the two-channel ﬁlter banks are iterated over the lowpass and the highpass subbands in the wavelet packet decomposition. [sent-109, score-0.97]

35 This generates a tree structure which provides many possible wavelet packet bases, accordingly, signals are decomposed into a time-frequency dictionary. [sent-110, score-0.509]

36 When dealing with images, the wavelet decomposition or the wavelet packet decomposition is ﬁrst applied along the rows of the images, then their results are further decomposed along the columns. [sent-111, score-0.814]

37 While the process being iterated, only L 1 is further decomposed in the wavelet decomposition, but all L1 , H1 , V1 and D1 are further decomposed in the wavelet packet decomposition. [sent-115, score-0.758]

38 Figure 1 shows a two-dimensional examples of a facial image for the wavelet decomposition and the wavelet packet decomposition with depth 2. [sent-116, score-0.912]

39 y=1 1168 L OCAL D ISCRIMINANT WAVELET PACKET C OORDINATES FOR FACE R ECOGNITION Figure 1: (Top) The two-dimensional wavelet decomposition of facial image with depth 2. [sent-121, score-0.375]

40 (Bottom) The two-dimensional wavelet packet decomposition of facial image with depth 2. [sent-122, score-0.635]

41 Then φ1 (B j ) will be a measure of efﬁcacy of the subband B j for classiﬁcation, and local discriminant basis are selected by the best-basis algorithm (Coifman and Wicherhauser, 1992) using the following criterion: Ψ = arg max φ1 (B j ). [sent-127, score-0.337]

42 (2002), a modiﬁed version of the LDB algorithm is introduced using the empirical probability distributions instead of the space-frequency energy distribution as their selection strategy to eliminate some less discriminant coordinates in each subband locally. [sent-133, score-0.457]

43 Let ∆ (1) (2) K (K) δ jt = φ2 (Γ (b jt ), Γ (b jt ), · · · Γ (b jt )) = ∑ (m) (n) d ∗ (Γ (b jt ), Γ (b jt )) (5) m,n=1 m=n that is, the discriminability of coordinate b jt (t = 1, 2, · · · , n j ). [sent-134, score-0.547]

44 Although the MLDB algorithm may overcome some limitations of LDB, the selection of coordinates is only limited to each subband so that coordinates in different subbands are still incomparable. [sent-137, score-0.58]

45 We use the wavelet packet feature extraction at the ﬁrst step. [sent-140, score-0.562]

46 Thus, our selection can be based on all coordinates of the dictionary, but not the subbands themselves. [sent-144, score-0.315]

47 1 The Wavelet Packet Decomposition in our LDC Algorithm In the LDC algorithm, the wavelet packet technique is used to decompose an image into subbands that are localized in both space and frequency domains, and offers a choice of optimal coordinates for the representation of a human face. [sent-153, score-0.897]

48 Because each child subband is derived from its parent subband at the above level, the coordinates in the two levels are linearly dependent. [sent-155, score-0.427]

49 frequency dictionary D in the LDC algorithm The dashed part is the spatial- Section 4, we will search for the most discriminant level by the best performance of its selected coordinates. [sent-157, score-0.33]

50 In fact, with the further wavelet packet decomposition, more ﬁne scale information which may have good discriminant power is generated, however, the resolution of subband images becomes lower so that less information exists for the purpose of object localization (Grewe and Brooks, 1997). [sent-163, score-0.933]

51 We also use Level 3 in the LDC algorithm, and our spatial-frequency dictionary D consists of 16 subbands in Level 3 (a subset of the dictionary in the LDB algorithm) (see Figure 2). [sent-166, score-0.382]

52 The Daubechies db4 wavelet will be used for image decomposition (Daubechies, 1990), if the sizes of facial images are not the dyadic numbers, we will apply zero-padding extension to create the smallest dyadic images for the wavelet packet decomposition. [sent-167, score-1.012]

53 1 D EFICIENCY OF A BSOLUTE “D ISTANCE ” M EASURES IN WAVELET- BASED M ETHODS To introduce our new dilation invariant entropy, we ﬁrst list several traditional discriminant measures. [sent-172, score-0.4]

54 i=1 From Equations (3) and (10), we know that the discriminant measure of subband B j in the LDB algorithm can be written as K nj m,n=1 m=n φ1 (B j ) = t=1 ∑ ∑ d ∗ (Γ (m) (n) (b jt ), Γ (b jt )). [sent-177, score-0.461]

55 Unfortunately, we ﬁnd that the OM of coordinate loadings make much difference between lower spatial frequency subbands and higher spatial frequency subbands. [sent-181, score-0.453]

56 For example, the coordinate loadings in the ﬁrst spatial frequency subband B6 (=LL2 in Figure 2) of the second level may vary from 0. [sent-182, score-0.345]

57 1 to 10, and the coordinate loadings in the second spatial frequency subband B 7 (=LH2 in 1172 L OCAL D ISCRIMINANT WAVELET PACKET C OORDINATES FOR FACE R ECOGNITION B6 (1) Γ (b6t ) 1. [sent-183, score-0.322]

58 Similarly, we can deﬁne the dilation invariant 2 norm as d DI 2 n (w, z) = wi zi ∑ ( wi + z i − wi + z i )2 i=1 1 2 n = ∑ (wi − zi )2 1 2 2 = d (w , z ). [sent-227, score-0.448]

59 (15) i=1 In Section 4, we will conduct an experiment to test the performance of the LDC algorithm using both dilation invariant entropy and dilation invariant 2 norm. [sent-228, score-0.449]

60 (2005) proposed a complete kernel Fisher discriminant analysis algorithm which makes full use of two kinds of discriminant information, regular and irregular in kernel feature space. [sent-278, score-0.35]

61 (21) 2 CLDA uses the between-class scatter criterion deﬁned in Equation (21) to derive the irregular discriminant vectors from null(Sw ), while using the standard Fisher criterion deﬁned in Equation (20) to derive the regular discriminant vectors from range(Sw ). [sent-286, score-0.469]

62 In our experiments, we capture all the regular discriminant vectors which satisfy J1 (ϕ1 ) > 0 from the range space of Sw , simultaneously, we capture all the irregular discriminant vectors which satisfy J2 (ϕ2 ) > 0 from the null space of Sw . [sent-287, score-0.35]

63 5 A New Criterion for the Nearest Neighbor (NN) Classiﬁer After the discriminant features are extracted, a remaining key element of face recognition is to design a robust classiﬁer. [sent-289, score-0.367]

64 Experimental results in Section 4 show that the triangular square ratio is more robust against illumination variations than the Euclidean distance, whilst retaining the robustness against pose and expression changes as the Euclidean distance. [sent-306, score-0.325]

65 On the other hand, although the triangular square ratio marginally underperforms the cosine criterion when there are variations of illumination, it can obviously outperform the cosine criterion when there are changes of pose and expression. [sent-307, score-0.46]

66 Step 4: Testing a new probe sample For a new probe sample, it will be expanded into the spatial-frequency dictionary D by the wavelet packet decomposition, and be extracted the loadings of the corresponding coordinates selected in Step 2 to form a new feature vector vnew . [sent-317, score-0.787]

67 Simply, the LDC algorithm can be represented as: Out put = T3 · T2 · T1 · Input where T1 is the wavelet packet transform, T2 is the LDC selection transform. [sent-319, score-0.509]

68 We compare their computational complexity step by step: Step 1: The wavelet packet transform The same procedure of the LDC and LDB algorithms cost O(N · n r nc · L ), where nr × nc is the size of facial images, L is the level of the wavelet packet decomposition. [sent-323, score-1.241]

69 a) For each coordinate in DLDC , the LDC algorithm needs to compute the prior distribution (16), the conditional probability (17) and its discriminability (19), so the costs of all the coordinates in DLDC are O(N · nr nc ) + O(K · nr nc ) + O(N · nr nc ). [sent-325, score-0.438]

70 So the costs of all the subbands in DLDB are O(N · nr nc · L ) + O(K 2 · nr nc · L ). [sent-327, score-0.36]

71 So we blend the small FERET data set and the ORL database together, in order to test the performance of the LDC algorithm when facial images have larger illumination variations and pose, expression changes (Loog and Duin, 2004). [sent-356, score-0.416]

72 In this paper, we use a subset that focuses on illumination variations with pose and expression variations in frontal Figure 3: Facial images of the FERET database. [sent-361, score-0.386]

73 Database ORL SmallFERET LargeFERET Hybrid CMU-lights Total number of images 400 432 1020 832 2924 Number of images per person 10 6 4 6 or 10 43 Number of classes 40 72 255 112 68 Table 2: Statistics for face images view. [sent-367, score-0.366]

74 For the MLDB algorithm, we choose N0 coordinates as LDC on the scheme—ﬁve subbands with each 260 coordinates. [sent-383, score-0.315]

75 Table 4 shows that the performance of Scheme 1 is marginally underperform Scheme 2, and Scheme 1 with all the subbands in the wavelet packet tree is more time-consuming than Scheme 2 which only uses the left subtree whose root node is L1 . [sent-402, score-0.745]

76 So Level 3 is used in the LDC algorithm, and our spatial-frequency dictionary D consists of the ﬁrst 16 subbands in the third level (see Figure 2). [sent-503, score-0.32]

77 4 Efﬁcacy of the LDC Based Feature Extraction It is of paramount importance for face recognition to extract most discriminant features that are less sensitive to environmental variations. [sent-557, score-0.392]

78 Generally, distance-based criterion is more robust than correlation-based criterion with respect to pose and expression changes while the contrary result is shown with respect to illumination variations. [sent-759, score-0.31]

79 In fact, the FERET database, the CMU-lights database concern about illumination variations (light intensity and direction respectively), and the ORL database concerns about expression and pose changes. [sent-766, score-0.43]

80 Comparison results show that the triangle square ratio criterion is more robust against illumination variations than the Euclidean distance, whilst retaining the robustness against pose and expression changes as the Euclidean distance. [sent-767, score-0.428]

81 Furthermore, we replace the triangle square ratio criterion with the cosine criterion, whilst keeping the other setting, on the ORL database and the small FERET data set. [sent-768, score-0.311]

82 The comparison between Table 6 and Table 7 shows that the triangle square ratio criterion performs better than the cosine criterion considerably on the ORL database, although it marginally underperforms the cosine criterion on the small FERET data set. [sent-770, score-0.439]

83 1 E FFECTS OF D IFFERENT WAVELET BASIS F UNCTIONS We use different wavelet basis functions for the wavelet packet decomposition on the ORL database and the small FERET data set, including: harr wavelet, Daubechies db6 wavelet, Symlets sym2 wavelet, Coiﬂets coif2 wavelet, Biorthogonal spline bior2. [sent-1153, score-0.89]

84 2 E FFECTS OF D IFFERENT R ELATIVE “D ISTANCE ” M EASURES In order to show the effect of different relative “distance” measures on the performance of classiﬁcation, we compare the dilation invariant entropy with the dilation invariant l 2 norm (15) on the 1189 L IU , DAI AND YAN 0. [sent-1163, score-0.449]

85 92 5 3 4 Training Samples 5 Figure 8: Performances of the LDC algorithm using the dilation invariant entropy and the dilation invariant l 2 norm. [sent-1175, score-0.449]

86 It shows that the dilation invariant l 2 norm marginally underperforms the dilation invariant entropy, which implies the changes between aforementioned different relative “distance” measures have slight effects on the performance of the LDC algorithm. [sent-1180, score-0.455]

87 It should point out that in our experiments, LDB selects the four best discriminant subbands from the local discriminant basis (see Subsection 4. [sent-1189, score-0.562]

88 Discussions and Conclusion In this paper, we have presented a novel local discriminant coordinates (LDC) method based on wavelet packet for face recognition to compensate for illumination, pose and expression variations. [sent-1194, score-1.066]

89 The method searches for the most discriminant coordinates from a wavelet packet dictionary, instead of the most discriminant basis as in the LDB algorithm. [sent-1195, score-0.962]

90 We have used the dilation invariant entropy and a MAP logistic model to measure the separability of coordinate-loading ensemble accurately. [sent-1202, score-0.317]

91 It locates in either low spatial frequency subbands or high spatial frequency subbands. [sent-1203, score-0.374]

92 So any two coordinates in the wavelet packet dictionary are comparable for their discriminability. [sent-1204, score-0.697]

93 We have modiﬁed the Euclidean distance and the cosine criterion in the nearest neighbor classiﬁer, and proposed a new triangle square ratio criterion which takes into account of two similarity measures, distance and correlation. [sent-1208, score-0.315]

94 Experimental results show that the triangle square ratio criterion is more robust against illumination variations than the Euclidean distance, while retaining the robustness against pose and expression changes as the Euclidean distance. [sent-1209, score-0.428]

95 Also, it can obviously outperform the cosine criterion when there are changes of pose and expression, although it marginally underperforms the cosine criterion when there are variations of illumination. [sent-1210, score-0.401]

96 We have used the ORL database to test moderate variations in pose and expression, the CMU database to test illumination variations, the FERET database for more generic situation, and the hybrid to test heteroscedastic class covariance distribution. [sent-1212, score-0.512]

97 In conclusion, experimental results show that our LDC algorithm can well preserve the most discriminant information of facial image and improve the performance for face recognition under different variations. [sent-1213, score-0.465]

98 Solving the small sample size problem in face recognition using generalized discriminant analysis. [sent-1414, score-0.367]

99 Face recognition by applying wavelet subband representation and kernel associative memory. [sent-1531, score-0.468]

100 The application of neural network and wavelet in human face illumination compensation. [sent-1546, score-0.507]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ldc', 0.498), ('ldb', 0.291), ('packet', 0.26), ('wavelet', 0.249), ('subbands', 0.212), ('feret', 0.189), ('orl', 0.179), ('discriminant', 0.175), ('mldb', 0.168), ('subband', 0.162), ('dilation', 0.151), ('clda', 0.151), ('face', 0.135), ('waveletface', 0.129), ('xy', 0.124), ('illumination', 0.123), ('dai', 0.106), ('coordinates', 0.103), ('fisherface', 0.095), ('discriminability', 0.09), ('xiy', 0.089), ('dictionary', 0.085), ('oordinates', 0.084), ('database', 0.082), ('pc', 0.076), ('dlda', 0.073), ('facial', 0.072), ('yan', 0.072), ('ecognition', 0.071), ('pose', 0.068), ('cosine', 0.067), ('images', 0.064), ('iu', 0.063), ('jt', 0.062), ('sw', 0.058), ('iscriminant', 0.058), ('recognition', 0.057), ('separability', 0.057), ('variations', 0.056), ('loadings', 0.056), ('pca', 0.055), ('coifman', 0.055), ('invariant', 0.055), ('extraction', 0.053), ('triangle', 0.053), ('ocal', 0.051), ('criterion', 0.05), ('zi', 0.049), ('wi', 0.048), ('die', 0.047), ('saito', 0.047), ('frequency', 0.047), ('wavelets', 0.044), ('sr', 0.043), ('nr', 0.043), ('person', 0.039), ('cacy', 0.038), ('euclidean', 0.038), ('entropy', 0.037), ('yang', 0.037), ('nn', 0.037), ('lda', 0.035), ('spatial', 0.034), ('om', 0.034), ('performances', 0.032), ('nc', 0.031), ('square', 0.031), ('ratio', 0.028), ('chien', 0.028), ('kld', 0.028), ('decomposition', 0.028), ('image', 0.026), ('subsection', 0.026), ('environmental', 0.025), ('samples', 0.025), ('hong', 0.025), ('marginally', 0.024), ('transform', 0.023), ('localization', 0.023), ('level', 0.023), ('coordinate', 0.023), ('ekenel', 0.022), ('harr', 0.022), ('kong', 0.021), ('underperforms', 0.019), ('expression', 0.019), ('scatter', 0.019), ('sss', 0.019), ('traditional', 0.019), ('hybrid', 0.019), ('distance', 0.018), ('mika', 0.018), ('belhumeur', 0.017), ('probe', 0.017), ('energy', 0.017), ('ensemble', 0.017), ('cmu', 0.017), ('azi', 0.017), ('cmulights', 0.017), ('dldb', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 50 jmlr-2007-Local Discriminant Wavelet Packet Coordinates for Face Recognition

Author: Chao-Chun Liu, Dao-Qing Dai, Hong Yan

2 0.10579327 80 jmlr-2007-Synergistic Face Detection and Pose Estimation with Energy-Based Models

Author: Margarita Osadchy, Yann Le Cun, Matthew L. Miller

Abstract: We describe a novel method for simultaneously detecting faces and estimating their pose in real time. The method employs a convolutional network to map images of faces to points on a lowdimensional manifold parametrized by pose, and images of non-faces to points far away from that manifold. Given an image, detecting a face and estimating its pose is viewed as minimizing an energy function with respect to the face/non-face binary variable and the continuous pose parameters. The system is trained to minimize a loss function that drives correct combinations of labels and pose to be associated with lower energy values than incorrect ones. The system is designed to handle very large range of poses without retraining. The performance of the system was tested on three standard data sets—for frontal views, rotated faces, and proﬁles— is comparable to previous systems that are designed to handle a single one of these data sets. We show that a system trained simuiltaneously for detection and pose estimation is more accurate on both tasks than similar systems trained for each task separately.1 Keywords: face detection, pose estimation, convolutional networks, energy based models, object recognition

3 0.057162978 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

Author: Sébastien Gadat, Laurent Younes

Abstract: We introduce a new model addressing feature selection from a large dictionary of variables that can be computed from a signal or an image. Features are extracted according to an efﬁciency criterion, on the basis of speciﬁed classiﬁcation or recognition tasks. This is done by estimating a probability distribution P on the complete dictionary, which distributes its mass over the more efﬁcient, or informative, components. We implement a stochastic gradient descent algorithm, using the probability as a state variable and optimizing a multi-task goodness of ﬁt criterion for classiﬁers based on variable randomly chosen according to P. We then generate classiﬁers from the optimal distribution of weights learned on the training set. The method is ﬁrst tested on several pattern recognition problems including face detection, handwritten digit recognition, spam classiﬁcation and micro-array analysis. We then compare our approach with other step-wise algorithms like random forests or recursive feature elimination. Keywords: stochastic learning algorithms, Robbins-Monro application, pattern recognition, classiﬁcation algorithm, feature selection

4 0.050010182 15 jmlr-2007-Bilinear Discriminant Component Analysis

Author: Mads Dyrholm, Christoforos Christoforou, Lucas C. Parra

Abstract: Factor analysis and discriminant analysis are often used as complementary approaches to identify linear components in two dimensional data arrays. For three dimensional arrays, which may organize data in dimensions such as space, time, and trials, the opportunity arises to combine these two approaches. A new method, Bilinear Discriminant Component Analysis (BDCA), is derived and demonstrated in the context of functional brain imaging data for which it seems ideally suited. The work suggests to identify a subspace projection which optimally separates classes while ensuring that each dimension in this space captures an independent contribution to the discrimination. Keywords: bilinear, decomposition, component, classiﬁcation, regularization

5 0.049913086 2 jmlr-2007-A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion

Author: Marco Loog

Abstract: Recently, Ye (2005) suggested yet another optimization criterion for discriminant analysis and proposed a characterization of the family of solutions to this objective. The characterization, however, merely describes a part of the full solution set, that is, it is not complete and therefore not at all a characterization. This correspondence ﬁrst gives the correct characterization and afterwards compares it to Ye’s. Keywords: linear discriminant analysis, Fisher criterion, small sample, characterization 1. Classical and New Criteria Given N feature vectors of dimensionality n, a linear reduction of dimensionality, based on classical Fisher LDA, determines an n × d transformation matrix L that, for a given d < K, K the number of classes, maximizes the so-called Fisher criterion: F(A) = tr((A t SW A)−1 (At SB A)) or, equivalently, F0 (A) = tr((At ST A)−1 (At SB A)). Here, SB := ∑K pi (mi − m)(mi − m)t , SW := ∑K pi Si , and i=1 i=1 ST = SB + SW . The matrices SB , SW , and ST are the so-called between-class, pooled within-class, and total covariance matrices. In addition, mi is the mean vector of class i, pi is the prior of class i, and the overall mean m equals ∑k pi mi . Finally, Si is the covariance matrix of class i. i=1 A solution to these optimization problems can be obtained by means of a generalized eigenvalue decomposition, which Fukunaga (1990) relates to a simultaneous diagonalization of the two matrices involved (see also Campbell and Atchley, 1981). More common is it to apply a standard −1 eigenvalue decomposition to S−1 SB (or SW SB ), resulting in an equivalent set of eigenvectors. The d T columns of the optimal solution L are simply taken to equal the d eigenvectors corresponding to the d largest eigenvalues. It is known that this solution is not unique and the full class can be obtained by multiplying L to the right with nonsingular d × d matrices (see Fukunaga, 1990). Clearly, if the total covariance ST is singular, neither the generalized nor the standard eigenvalue decomposition can be readily employed. Directly or indirectly, the problem is that the matrix inverse S−1 does not exist, which is the typical situation when dealing with small samples. In an attempt to T overcome this problem, Ye (2005) introduced a different criterion that is deﬁned as F1 (A) = tr((At ST A)+ (At SB A)) , ∗. Also at Nordic Bioscience Imaging, Hovegade 207, DK-2730 Herlev, Denmark. c 2007 Marco Loog. (1) L OOG where + denotes taking the Moore-Penrose generalized inverse of a matrix. Like for F0 , an optimal transform L is one that maximizes the objective F1 . Again, this solution is not unique. 2. Correct Characterization For the full characterization of the set of solutions to Equation (1), initially the problem is looked at from a geometrical point of view (cf., Campbell and Atchley, 1981). It is assumed that the number of samples N is smaller than or equal to the feature dimensionality n. In the undersampled case, it is clear that all data variation occurs in an N − 1-dimensional subspace of the original space. To start with, a PCA of the data is carried out and the ﬁrst N − 1 principal components are rotated to the ﬁrst N − 1 axes of the n-dimensional space by means of a rotation matrix R. This matrix consists of all (normalized) eigenvectors of ST taken as its columns. After this rotation, new total and between-class covariance matrices, ST = Rt ST R and SB = Rt SB R, are obtained. These 0 0 matrices can be partitioned as follows: ST = Σ0T 0 and SB = ΣB 0 , where ΣT and ΣB are N − 1 × 0 N − 1 covariance matrices and ΣT is nonsingular and diagonal by construction. The between-class variation is obviously restricted to the N − 1-dimensional subspace in which the total data variation takes place, therefore a same partitioning of SB is possible. However, the covariance submatrix ΣB is not necessarily diagonal, neither does it have to be nonsingular. Basically, the PCA-based rotation R converts the initial problem into a more convenient one, splitting up the original space in an N − 1-dimensional one in which “everything interesting” takes place and a remaining n − N + 1dimensional subspace in which “nothing happens at all”. Now consider F1 in this transformed space and take a general n × d transformation matrix A, which is partitioned in a way similar to the covariance matrices, that is, X . Y A= (2) Here, X is an N − 1 × d-matrix and Y is of size n − N + 1 × d. Taking this deﬁnition, the following holds (cf., Ye, 2005): t + t F1 (A) = tr((A ST A) (A SB A)) = tr =tr X t ΣT X 0 0 0 + X Y X t ΣB X 0 0 0 t ΣT 0 = tr 0 0 X Y + (Xt ΣT X)−1 0 0 0 X Y t ΣB 0 0 0 X Y X t ΣB X 0 0 0 = tr((Xt ΣT X)−1 (Xt ΣB X)) = F0 (X) . From this it is immediate that a matrix A maximizes F1 if and only if the submatrix X maximizes the original Fisher criterion in the lower-dimensional subspace. Moreover, if L is such a matrix maximizing F1 in the PCA-transformed space, it is easy to check that R−1 L = Rt L provides a solution to the original, general problem that has not been preprocessed by means of a PCA (see also Fukunaga, 1990). A characterization of the complete family of solutions can now be given. Let Λ ∈ RN−1×d be an optimal solution of F0 (X) = tr((Xt ΣT X)−1 (Xt ΣB X)). As already noted in Section 1, the full set of solutions is given by F = {ΛZ ∈ RN−1×d | Z ∈ GLd (R)}, where GLd (R) denotes the general linear group of d × d invertible matrices. The previous paragraph essentially demonstrates that if X ∈ F , A in Equation (2) maximizes F1 . The matrix Y can be chosen ad 2122 C OMPLETE C HARACTERIZATION OF A FAMILY OF S OLUTIONS libitum. Now, the latter provides the solution in the PCA-transformed space and to solve the general problem we need to take the rotation back to the original space into account. All in all, this leads to the following complete family of solutions L , maximizing F1 in the original space: L = Rt ΛZ ∈ Rn×d Z ∈ GLd (R), Y ∈ Rn−N+1×d Y , (3) where Λ = argmaxX tr((Xt ΣT X)−1 (Xt ΣB X)) and Rt takes care of the rotation back. 3. Original Characterization Though not noted by Ye (2005), his attempt to characterize the full set of solutions of Equation (1) is based on a simultaneous diagonalization of the three covariance matrices S B , SW , and ST that is similar to the ideas described by Campbell and Atchley (1981) and Fukunaga (1990). Moreover, Golub and Van Loan (Theorem 8.7.1. 1996) can be readily applied to demonstrate that such simultaneous diagonalization is possible in the small sample setting. After the diagonalization step, partitioned between-class, pooled within-class, and total covariance matrices are considered. This partitioning is similar to the one employed in the previous section, which does not enforce matrices to be diagonal however. In the subsequent optimization step, the classical Fisher criterion is maximized basically in the appropriate subspace, comparable to the approach described above, but in a mildly more involved and concealed way. For this, matrices of the form Rt X are considered, consider Equations (2) and Y (3). However, Y is simply the null matrix and the family of solutions L provided is limited to L = Rt ΛZ ∈ Rn×d Z ∈ GLd (R) . 0 Obviously, this is far from a complete characterization, especially when N − 1 n which is, for instance, typically the case for the data sets considered by Ye (2005). Generally, the utility of a dimensionality reduction criterion, without additional constrains, depends on the efﬁciency over the full set of solutions. As Ye (2005) only considers two very speciﬁc instances from the large class of possibilities, it is unclear to what extent the new criterion really provides an efﬁcient way of performing a reduction of dimensionality. References N. A. Campbell and W. R. Atchley. The geometry of canonical variate analysis. Systematic Zoology, 30(3):268–280, 1981. K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, New York, 1990. G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, third edition, 1996. J. Ye. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. Journal of Machine Learning Research, 6:483–502, 2005. 2123

6 0.042113937 23 jmlr-2007-Concave Learners for Rankboost

7 0.035186902 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

8 0.033442993 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

9 0.032585699 26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis

10 0.029854739 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features

11 0.025483744 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

12 0.024533145 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

13 0.023231709 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis

14 0.023092749 38 jmlr-2007-Graph Laplacians and their Convergence on Random Neighborhood Graphs (Special Topic on the Conference on Learning Theory 2005)

15 0.02242581 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

16 0.022053754 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters (Special Topic on Model Selection)

17 0.021080129 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

18 0.020826157 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems

19 0.018686241 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

20 0.018353049 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.125), (1, 0.043), (2, 0.023), (3, 0.076), (4, -0.117), (5, -0.03), (6, -0.097), (7, -0.095), (8, 0.051), (9, 0.125), (10, -0.173), (11, 0.042), (12, -0.015), (13, -0.208), (14, -0.151), (15, -0.128), (16, 0.036), (17, -0.258), (18, -0.062), (19, -0.063), (20, -0.1), (21, 0.145), (22, 0.207), (23, -0.057), (24, -0.15), (25, 0.072), (26, 0.047), (27, -0.001), (28, 0.174), (29, 0.169), (30, 0.181), (31, -0.072), (32, -0.078), (33, 0.084), (34, 0.005), (35, 0.092), (36, 0.08), (37, -0.053), (38, -0.116), (39, -0.021), (40, 0.076), (41, -0.013), (42, -0.018), (43, 0.114), (44, 0.095), (45, -0.054), (46, 0.005), (47, -0.132), (48, -0.111), (49, -0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96408284 50 jmlr-2007-Local Discriminant Wavelet Packet Coordinates for Face Recognition

Author: Chao-Chun Liu, Dao-Qing Dai, Hong Yan

2 0.69584298 80 jmlr-2007-Synergistic Face Detection and Pose Estimation with Energy-Based Models

Author: Margarita Osadchy, Yann Le Cun, Matthew L. Miller

3 0.26034552 23 jmlr-2007-Concave Learners for Rankboost

Author: Ofer Melnik, Yehuda Vardi, Cun-Hui Zhang

Abstract: Rankboost has been shown to be an effective algorithm for combining ranks. However, its ability to generalize well and not overﬁt is directly related to the choice of weak learner, in the sense that regularization of the rank function is due to the regularization properties of its weak learners. We present a regularization property called consistency in preference and conﬁdence that mathematically translates into monotonic concavity, and describe a new weak ranking learner (MWGR) that generates ranking functions with this property. In experiments combining ranks from multiple face recognition algorithms and an experiment combining text information retrieval systems, rank functions using MWGR proved superior to binary weak learners. Keywords: rankboost, ranking, convex/concave, regularization 1. Ranking Problems A ranking problem is a learning problem that involves ranks as the inputs, the outputs or both. An example where ranks are used as inputs is a collaborative ﬁltering application where people are asked to rank movies according to their preferences. In such an application the ranks assigned by different people are combined to generate recommendations. Another type of problem in which ranks are used as inputs are meta-search problems, where the ranks of multiple search engines are combined (Dwork et al., 2001). However, the inputs to a ranking problem are not always ranks. An object recognition ranking problem (Kittler and Roli, 2000) may receive as inputs a graphical representation and output a ranking of the possible objects, sorted by likelihood. The outputs of a ranking problem may also be ranks. For example, in combining multiple search engines the output are ranks which are a synthesis of the ranks from the individual search engines. A similar meta-recognition problem is the combination of the outputs of multiple face- recognition systems to improve the accuracy of detecting the correct face. While the inputs and outputs of this problem are ranks, the outputs can be simpliﬁed to only return the most likely candidate. Another example where the outputs do not need to be complete ranks could be an information retrieval combination task. In such a task the inputs might be ranks of sets of documents with respect to c 2007 Ofer Melnik, Yehuda Vardi and Cun-Hui Zhang. M ELNIK , VARDI AND Z HANG a particular query by different experts. Again, the outputs could be the complete ranks, or more simply a designation for each document of whether it is relevant or not to the particular query. 2. Rankboost The rankboost algorithm (Freund et al., 2003) tries to address this variety in ranking problems by maintaining generality in how it regards its inputs and how it applies different loss functions to outputs. The rankboost algorithm works on instances which are the discrete items (e.g., movies or faces) that either are ranked as input or are to be ranked as output. To allow a general input mechanism, the inputs to rankboost are called the ranking features of an instance, speciﬁed as f j (x) which is the j-th ranking feature of instance x. While this generality in how inputs are handled is potentially powerful, in the original rankboost paper as well as in this paper, only the case where the inputs are different rankings of the instances is considered. Thus, in this paper, the inputs to rankboost are constituent ranks, denoted as f 1 . . . fn , where f j (x ) < f j (x ) implies that instance x has a better rank than instance x under ranking f j . For example, in some of the experiments we present, each f j is the ranking result of a particular face recognition algorithm. The output of rankboost is a new ranking function, H(x), which deﬁnes a linear ordering on the instances, that is, H (x ) < H (x ) iff x has a better rank than x . In rankboost T H(x) = ∑ wt ht (x), (1) t=1 a weighted sum of weak ranking learners, where the ht (x)’s are relatively simple learned functions of the constituent rankings. To address the variety in possible loss functions of the outputs, in rankboost the desirable properties for the output loss function are speciﬁed with a favor function, Φ : X × X → R, where X is the space of instances (note that this function has been renamed from “preference function” to avoid confusion with the use of preference in this paper in Section 4). Here Φ (x , x ) > 0 means that x should be better ranked than x for a given query. It is an anti-symmetric function, that is, Φ (x , x ) = −Φ (x , x ) and Φ (x, x) = 0, which avoids loops where two instances should both be ranked better than the other. Also Φ (x , x ) = 0 when there is no favor in how two instances should be relatively ranked. For example, as described in Section 6.1, for the face recognition combination problem described above the favor function can be used to specify that the correct identity should be given a better rank than all other identities, while zeroing all other entries in the favor function, giving no favor in how incorrect identities are ranked between them. In a similar fashion for the information retrieval combination task mentioned above, the favor function can be speciﬁed such that all relevant documents should be better ranked than irrelevant documents, without specifying favor for the ordering between relevant documents and the ordering between irrelevant documents (Section 7.1). Rankboost is shown in Algorithm 1 as described in Freund et al. (2003). It uses the favor function to specify an initial weight function over instance pair orderings: D x ,x = c · max 0, Φ x , x −1 , (2) where c = ∑x ,x max (0, Φ (x , x )) . The algorithm itself is very similar to adaboost (Freund and Schapire, 1996). At each iteration the algorithm selects the weak learner that best maximizes a 792 C ONCAVE L EARNERS FOR R ANKBOOST Algorithm 1 rankboost algorithm for generating a ranking function. Input: constituent ranks, a favor function, and a class of weak learners with outputs between 0 and 1 and an appropriate training algorithm. Output: ranking function H(x) (Eq. 1) Initialize D(1) = D (Eq. 2) for t = 1 . . . T do Find weak learner, ht , that maximizes r(h) = ∑x ,x D(t) (x , x )(h(x ) − h(x )) (Eq. 4). Choose wt = 0.5 ln ((1 + r(ht )) / (1 − r(ht ))) . (Eq. 5) Update: D(t+1) (x , x ) = D(t) (x , x ) exp (wt (ht (x ) − ht (x ))) Zt where Zt is chosen to make ∑x ,x D(t+1) (x , x ) = 1 end for rank function’s utility, r(h), (Eq. 4). It then assigns it a weight in proportion to its performance and adds the weighted weak learner to the ranking function, H(x). After which the weight function D is adjusted to reﬂect the impact of the weak learner. Freund et al. (2003) prove that the rankboost algorithm iteratively minimizes the ranking loss function, a measure of the quantity of misranked pairs: ∑ D x ,x I H x ≤ H x x ,x where I is the indicator function (1 if true, 0 otherwise) and H(x) is a ranking function output by rankboost. The rankboost paper (Freund et al., 2003) uses binary weak learners of the following functional form:  i f f j (x) > θ,  1 0 i f f j (x) ≤ θ, (3) h(x) =  qde f i f f j (x) unde f ined. Each binary weak learner, h, operates on a single f j (ranking feature), giving a binary output of 0 or 1 depending on whether the instance’s rank is smaller than a learned parameter, θ. Note that function h(x) in (3), which is dependent on only one f j (x) with a ﬁxed j, is monotonic increasing but not convex or concave. If there is no rank for the instance then it returns a prespeciﬁed q de f value. 3. Rankboost, the Weak Learner and Regularization While rankboost inherits the accuracy of adaboost and has been shown to be very successful in practice, in two important ways it is very different from adaboost and similar classiﬁer leveraging algorithms (Meir and Ratsch, 2003). The ﬁrst is the choice of the weak learner, h. In adaboost the weak learner is expected to minimize the weighted empirical classiﬁcation error: N ∑ d (t) (i)I [yi = h (zi )] , i=1 where yi is the class label, I is the indicator function and d (t) is a weighting over training samples. This is a standard function to minimize in classiﬁcation with many possible types of algorithms to 793 M ELNIK , VARDI AND Z HANG choose from as possible weak learners. In contrast the weak ranking learner for rankboost (with outputs in [0, 1]) needs to maximize the following rank utility function: r = r(h) = ∑ D(t) (x , x )(h(x ) − h(x )), (4) x ,x where D(t) is the weight function over pairs of instances. As Eq. 4 contains the term h(x ) − h(x ), short of linear learners this equation can not be concave in h or easily approximated by a concave function. Therefore the weak learner needs to be optimized either by heuristics or by brute force, which limits the choice of h. It is not surprising that the original rankboost paper only included one type of weak learner, a binary threshold function (Eq. 3) that was tuned using brute force. The second difference between rankboost and adaboost also concerns the weak ranking learner. One feature of boosting that has sparked a great deal of interesting research is the algorithm’s ability to avoid overﬁtting for low noise classiﬁcation problems (with modiﬁcations to higher noise problems), see Meir and Ratsch (2003) for a survey. In contrast for rankboost it is only by limiting the type of underlying weak learners that overﬁtting is avoided. In their paper, Freund et al. (2003) show that not using weak ranking learners with cumulative positive coefﬁcients leads to overﬁtting and poor performance quite quickly. Therefore, choosing a weak learner that regularizes the ranking function, the output of rankboost, is very important for achieving accuracy and avoiding overﬁtting. It is clear from the above discussion that the choice of a weak ranking learner for rankboost is important and non trivial. First, the learner must be efﬁciently tunable with respect to Eq. 4, typically limiting its complexity. Second, the learner itself must demonstrate regularization properties that are appropriate for ranking functions. In this paper we present a new weak ranking learner that enforces consistency in preference and conﬁdence for the ranking function by being monotonic and concave. We start with a discussion of these regularization properties, theoretically justify them, and show what they mean in terms of the ﬁnal ranking function. Then we present a new learner, Minimum Weighted Group Ranks (MWGR), that satisﬁes these properties and can be readily learned. This learner is tested and compared with the binary learner of rankboost on combining multiple face recognition systems from the FERET study (Phillips et al., 2000) and on an information retrieval combination task from TREC (Voorhees and Harman, 2001). 4. Regularizing Ranking Functions with Consistency in Preference and Conﬁdence In this paper, as Freund et al. (2003), we consider ranking functions H(x) which depend on x only through the values of the ranking features, y j = f j (x), for that instance, so that H(x) = G ( f1 (x), . . . , fn (x)), for certain functions G (y1 , . . . , yn ) = G(y). Here, we assume that the f j (x) is an actual rank assigned to an instance, x, by the j-th ranker. Note that if the original data are numerical scores then they can easily be converted to rankings. Freund et al. (2003) make a strong case for conversion of raw measures to relative orderings (rankings) over combining measures directly, arguing that it is more general and avoids the semantics of particular measures. As the y j ’s are ranks instead of points in a general space, care should be taken as to the functional form of G. A great deal of literature in social choice theory (Arrow, 1951) revolves around the properties of various rank combination strategies that try to achieve fair rankings. In this machine learning case our motivations are different. Fairness is not the goal; the goal is to improve the 794 C ONCAVE L EARNERS FOR R ANKBOOST accuracy or performance of the ranking function. Thus, regularization, by functionally constraining G, is used to confer information on how to interpret ranks in order to ultimately improve accuracy. Freund et al. (2003) imposed the regularization principle of monotonicity on the ranking function. Monotonicity encompasses a belief that for individual features a smaller rank is always considered better than a bigger rank. It means that for every two rank vectors, a and b, if a j ≤ b j , j = 1, . . . , n then G(a) ≤ G(b). Monotonicity was enforced by requiring that the coefﬁcients, wt in Eq. 1, combining the binary weak learners (Eq. 3) were cumulatively positive. As shown by Freund et al. (2003), without enforcing monotonicity the rankboost algorithm quickly overﬁts. In this section we present another regularization principle, consistency in preference and conﬁdence (which includes monotonicity). A ranking function with this regularization property should be consistent in its preference for the individual rankers and also in how it captures their conﬁdence. The following example illustrates these two concepts. 4.1 Grocer Example A grocer needs to make 2 decisions, to decide between stocking oat bran vs. granola and to decide between stocking turnips vs. radishes. The grocer asks her consultants to express their opinion about stocking each item, and based on their responses makes her 2 decisions. First of all, in either problem, the grocer will adopt the opinion of her consultants if they all agree with each other, that is, they all prefer granola over oat bran in the ﬁrst problem. Lets assume the grocer considered the ﬁrst problem and chose granola over oat bran. What this implies is that the grocer adopted the opinions of the consultants that preferred granola over oat bran. Now consider the turnips vs. radishes decision. Lets say that the same consultants that liked granola more also liked radishes more (and the same ones that like oat bran more like turnips more). Also, for this decision the radish lovers actually feel more conﬁdent in their choice than they did for granola, while the turnip lovers are more unsure than they were for oat bran. Then for this second choice, if the grocer is consistent in her method of combining the opinions of her consultants, we would expect her to pick radishes over the turnips. In addition, we would expect her to be more conﬁdent in this second decision as a reﬂection of the consultants relative conﬁdence. The above properties of preference and conﬁdence imply that preference should be applied consistently across different problems and conﬁdence should reﬂect the relative conﬁdences of the individual consultants. 4.2 The General Principle of Consistency in Preference and Conﬁdence To generalize the above grocer’s problem consider the problem of combining the opinions of a panel of consultants on several multiple-choice decision problems. Suppose for each decision problem each consultant provides his preference as one of the possible choices and a qualitative measure of the conﬁdence level for his preference. The consistency principle in preference and conﬁdence holds if the combiner always agrees with at least one of the consultants in each decision problem and that the following condition holds for any pair of decision problems. Deﬁnition 1. Consistency principle for a pair of decision problems: Suppose that for the ﬁrst decision problem, the combiner adopts the preference of a subset A of consultants in the sense that A is the (non empty) set of all consultants whose preference is identical to that of the combiner. 795 M ELNIK , VARDI AND Z HANG Suppose that for the second decision problem, the preference of the consultants in A is again identical. Suppose further that compared with the conﬁdence level for the ﬁrst decision problem, each consultant in A has higher or the same conﬁdence level for his preference in the second problem, while each consultant with a different preference than A for the second problem has lower or the same conﬁdence level. Then, the preference of the combiner for the second problem is identical to that of the consultants in A, and the conﬁdence level of the combiner for the second problem is at least as high as his conﬁdence level for the ﬁrst problem. Let B be the set of consultants whose preferences are different from that of the combiner in the ﬁrst decision problem, that is, B = Ac . If some members of B switch sides or have no preference in the second problem, the combiner would be even more conﬁdent about the adoption of the preference of A in the second problem regardless of the conﬁdence of those who switch sides. Thus, Deﬁnition 1 requires only those against the preference of A in the second problem (necessarily a subset of B since members of A act in unison in both problems) to have lower or equal conﬁdence in the second problem, instead of all members of B. This is taken into consideration in Theorem 1 below. Note that while preference for individual consultants is speciﬁed within each decision problem, two decision problems are needed to compare the qualitative measure of conﬁdence. However, comparison of conﬁdence is not always possible (it is a partial ordering). In particular, the level of conﬁdence between different experts may not be comparable, and the levels of conﬁdence of a given expert (or the combiner) for different decision problems are not always comparable. 4.3 Conﬁdence for Ranks and Ranking Functions In order to apply consistency to rank combination we need to specify what we mean by more or less conﬁdence. For ranks we make the assumption that a constant change of ranks requires more conﬁdence for low ranks than high ranks. For example, we would expect the difference between ranks of 1 and 3 to represent a more signiﬁcant distinction on the part of a ranker than would for example the difference between ranks 34 and 36 which may be almost arbitrarily assigned. Speciﬁcally we make the following deﬁnition: Deﬁnition 2. Preference and conﬁdence for univariate rank values For any pair of rank values {r, r } ⊂ R with r < r , by monotonicity r is preferred. For any two pairs of rank values {r, r } and {r , r } with r < r and r ≤ r , the conﬁdence level for {r, r }is higher than that of {r , r } if r ≤ r , r −r ≤ r −r with at least one inequality. The pair{r, r }provides no preference if r = r . Likewise, if either r − r = r − r = 0 or r − r = r − r = 0, the two pairs {r, r } and {r , r }are deﬁned to have the same level of conﬁdence. In this deﬁnition conﬁdence between pairs of ranks can only be compared if the pair with the lowest rank has a gap at least as large as the other pair. Thus, we are actually comparing two numbers, that is, we are doing a vector comparison. As such, this comparison does not cover all pairs of ranks and applies only a partial ordering on rank pairs. As a regularization principle, a partial ordering is desirable since it does not overly constrain the ranking function and allows for ﬂexibility. 796 C ONCAVE L EARNERS FOR R ANKBOOST As the rank function, G, is a univariate function, we apply the same deﬁnitions of preference and conﬁdence to it. That is, for a ranking function G and for any ﬁxed rank vectors, y and y , the preferred rank vector is the one with smaller G, that is, y is preferred iff G(y) < G(y ). For conﬁdence again we need to consider pairs of rank vectors, {y, y } and {y , y }, with G(y) ≤ G(y ) and G(y ) ≤ G(y ). If G(y) ≤ G(y ) and G(y ) − G(y) ≥ G(y ) − G(y ) then we say that the ﬁrst decision, between yand y , shows equal or more conﬁdence than the second decision between y and y . This numerical method of capturing conﬁdence and preference in ranks, y, and ranking functions, G, allows us to apply Deﬁnition 1. Speciﬁcally, for a pair of binary decisions the conﬁdence of a consistent ranking function increases for the second decision if in the second decision the conﬁdence of the agreeing rank values are comparable and increase and the conﬁdence of the disagreeing rank values are comparable and decrease, with the exception of those that switched sides. 4.4 Three-Point Consistency in Preference and Conﬁdence for Combining Ranks As described, the consistency property is over two binary decision problems. In this section we consider ranking functions, G, that have consistency in preference and conﬁdence for all pairs of binary decision problems involving three rank vectors y, y and y . We show that such functions have speciﬁc mathematical properties under this three-point consistency principle in preference and conﬁdence for ranks.. Theorem 1: For a ranking function G whose domain is convex, three-point consistency in preference and conﬁdence holds for G iff G(y) is nondecreasing in individual components of y and is jointly concave in y. Proof of Necessity: Assume that the ranking function G exhibits three-point consistency in preference and conﬁdence for any three rank vectors. If y ≤ y component wise, then the individual components of y all agree in their preference to y, so that G(y) ≤ G(y ) by the consistency of G in preference. This implies the monotonicity of G in y. We pick three points y, y − a and y + a, such that all points are rank vectors. Consider the pair of binary comparison problems, y vs. y + a and y − a vs. y. Assume that G(y) < G(y + a). These three points are comparable by Deﬁnition 2 (r − r = r − r for all components in the rank vectors). Since the agreeing rankers, A = j y j < y j + a j , in the y vs. y + a comparison are greater than in the y − a vs. y comparison, and the disagreeing ranks, outside A, are smaller, then by the consistency of the combiner in preference and conﬁdence (deﬁnition 1), G(y − a) ≤ G(y) and G(y) − G(y − a) ≥ G(y + a) − G(y). A function G is concave in a convex domain iff 2G(y) ≥ G(y − a) + G(y + a), for every y and a with y ± a in its domain. To verify this inequality for a particular y and a, we must have G(y + a) > G(y), G(y − a) > G(y) or the third case of G(y) ≥ max{G(y + a), G(y − a)}. We have already proven that the consistency properties of preference and conﬁdence imply G(y) − G(y − a) ≥ G(y + a) − G(y) in the ﬁrst case. By symmetry this requirement also must hold for −a, so that G(y) − G(y + a) ≥ G(y − a) − G(y) in the second case. This completes the necessity part of the proof since 2G(y) ≥ G(y − a) + G(y + a) automatically holds in the third case. Proof of Sufﬁciency: Assume the ranking function G is nondecreasing in individual components of y and jointly concave in y. We need to prove consistency in preference and conﬁdence for a pair 797 M ELNIK , VARDI AND Z HANG 1) G(y) < G(y’) y ≤ y’’ A A y’’ A 4 A A 10 y’ 8 y’ 6 4) G(y) > G(y’) y ≥ y’’ A 10 8 A 2) G(y) < G(y’) y ≥ y’’ A 10 8 6 y A y 6 4 4 2 2 2 0 0 y y’ y’’ y’’ 0 2 4 6 8 10 0 2 4 B 6 8 10 0 0 B 2 4 6 8 10 B Figure 1: Consider two decision problems in two dimensions involving three points, y, y and y , where the ﬁrst decision problem is to choose between y and y and the second problem is to choose between y and y . The cases where we expect consistency in preference and conﬁdence (Deﬁnitions 1 and 2) can be enumerated by the result of the ﬁrst decision problem and relative location of y . The darker gray box is the location where B has lesser or equal conﬁdence in the second problem, and the lighter gray box is where B switches sides. of decision problems involving three rank vectors y, y and y . Without loss of generality, suppose that the ﬁrst problem is to choose between y and y and the second problem is to choose between y and y . We break the proof up by the results of the comparison between y vs. y and the relative location of the third ranking vector y that satisﬁes the preference and conﬁdence requirements. If G(y) = G(y ) = G(y ), the combiner has equal conﬁdence in the two decision problems by deﬁnition 2, so that the consistency principle holds automatically. Thus, we only need to consider the cases where G(y) = G(y ). Figure 1 illustrates three of these cases two dimensionally, where there is a single agreeing component A, the y-axis, and a single disagreeing component B, the x-axis. Let A be the set of agreeing indices, A = j : sgn(y j − y j ) = sgn (G(y ) − G(y)) = 0 , and B = c the set of disagreeing indices. We use the notation y = (y , j ∈ A) to describe corresponding A j A subvectors. Also, when used, vector inequalities are component wise. Case 1: G(y) < G(y ) and yA ≤ yA In the ﬁrst decision problem, as G agrees with A and disagrees with B yA < y A , yB ≥ y B . / We also know that A = 0 since otherwise y ≥ y and therefore by monotonicity G(y) ≥ G(y ), which is a contradiction. For the second decision problem we consider the values of y which are consistent with the conﬁdence assumption (Deﬁnitions 1 and 2), that is, agreeing with more conﬁdence along the A indices and disagreeing with less conﬁdence or switching sides along the B indices. As y A ≤ yA , either by the conﬁdence relationship or by switching sides (see Figure 1) these y values satisfy yA − y A ≥ y A − y A , 798 yB − y B ≤ y B − y B C ONCAVE L EARNERS FOR R ANKBOOST which implies y ≤ y , and therefore G(y ) ≤ G(y ) by monotonicity. Thus, G(y) < G(y ) ≤ G(y ), which implies that in the second decision problem of choosing between y and y , the preference of G is the same as that of A (i.e., y since yA ≤ yA ) and the conﬁdence of G is at least as high as the ﬁrst decision problem. Case 2: G(y) < G(y ) and yA ≥ yA Since in this case yA ≥ yA , then either by the conﬁdence relationship or by switching sides y satisﬁes yA − y A ≥ y A − y A , yB − y B ≤ y B − y B . This implies that y + y ≤ 2y, which means that G(y ) + G(y ) ≤ 2G y +y /2 ≤ 2G(y) by the concavity and monotonicity of G. Thus, G(y) − G(y ) ≥ G(y ) − G(y) and since G(y ) > G(y), we have that G(y) − G(y ) > 0 and thus G(y) > G(y ). Therefore we see that the preference in G for the two decision problems is the same as A (with y in the ﬁrst problem and with y in the second problem) and the conﬁdence is no smaller for the second comparison. Case 3: G(y) > G(y ) and yA ≤ yA Since y is preferred in the ﬁrst decision problem we have yA > y A , yB ≤ y B . For the conﬁdence assumption to hold, that is, greater conﬁdence for A in the second problem (Deﬁnition 2), the smaller ranks in the second problem have to be smaller than the smaller ranks in the ﬁrst problem. But with yA ≥ yA and yA ≤ yA that can only be if yA = yA , which leads to y ≤ y , and by monotonicity implies G(y) ≤ G(y ). However that is a contradiction to G(y) > G(y ), and therefore this is not a viable case for comparing conﬁdence. Case 4: G(y) > G(y ) and yA ≥ yA Since in this case yA ≥ yA , then either by the conﬁdence relationship or by switching sides y satisﬁes yA − y A ≥ y A − y A , yB − y B ≤ y B − y B , which means that y ≤ y . Thus, by the monotonicity of G, G(y ) ≤ G(y ). Since G(y ) < G(y) the preference in the second decision problem is also with A (i.e., y ) and the conﬁdence is at least as high as the ﬁrst decision problem. 4.5 Applying Regularization to Rankboost It follows from the above theorem that to have consistency in preference and conﬁdence we desire ranking functions that are monotonic and concave. In rankboost H(x) = ∑ wt ht (x) (Eq. 1). To make H an increasing and concave function of constituent rankings y j = f j (x) we need to constrain the weak ranking learners. If the learners themselves are monotonically increasing and concave functions of y j , then linearly combining them with positive weights will give an H that is also an increasing and concave function of y j . 799 M ELNIK , VARDI AND Z HANG In this paper, we apply the “third method” (Freund et al., 2003) to setting a wt weight value. That is, weak learners are selected on their ability to maximize r from Eq. 4 and then wt = 0.5 ln ((1 + rmax ) / (1 − rmax )) . (5) Therefore, using monotonic and concave weak learners we select only ones that rankboost gives a positive r value to, which renders a positive wt weight. If no r values are positive the rankboost algorithm stops. We mention that a ranking function can be construed as the negative of a score function, that is, G(y) = −S(y). For score functions these regularization properties become monotonically decreasing, and convex (Melnik et al., 2004). 5. Minimum Weighted Group Ranks The functional structure of the new learner we propose is h(y) = min {γ1 y1 , . . . , γn yn , 1} , (6) where y = (y1 , . . . , yn ) = ( f1 (x), . . . , fn (x)) the vector of ranking features, and the γ j are learned positive coefﬁcients. Note that the function’s range is in [0, 1] due to the 1 term in the min function. Using rankings as our features, the learner function (Eq. 6) is monotonically increasing. It is also a concave function in y. Thus if these learners are linearly combined with positive weights, the resulting ranking function, H, will have three-point consistency in conﬁdence and preference. To gain some intuition, the functional form of the learner is related to the Highest Rank combination method (Ho, 1992) that assigns combined ranks as the best rank each class receives from any ranker. This is a conﬁdence based approach, as Highest Rank bets on the classiﬁer that is most conﬁdent, the one giving the best rank. As such, a single learner can potentially be error prone. But as we combine many of these learners during boosting, it becomes more powerful, allowing the conﬁdence of different classiﬁers to be weighed with preference for their potential accuracy. 5.1 Learning At each boosting iteration, rather than selecting from all possible weak learners of form (Eq. 6), we limit our choices by building new weak learners out of the ones that have been previously trained. Let F = { f1 (x), . . . , fn (x)} be the set of ranking features. Recall that in rankboost H(x) = (t) (t) (t) ∑t wt ht (x), where ht (x) = min γ1 f1 (x), . . . , γn fn (x), 1 and γ j are learned coefﬁcients. We set (s) (s) Ht = h(x) h(x) = min γ1 f1 (x), . . . , γn fn (x) , s ≤ t and select ht+1 (x) from weak learners of the form hnew (x) = min αh(x), β f (x), 1 (7) (t+1) with h(x) ∈ Ht and f (x) ∈ F . This learner can be rewritten in the form of Eq. 6 with the γ j derived from the learned α, β, h(x) and f (x). Thus, at each iteration we look at combinations of 800 C ONCAVE L EARNERS FOR R ANKBOOST the features and existing learners. As discussed in the next section, we can either consider all such combinations, or use a heuristic selection strategy. We propose to optimize α and β separately, in stages. That is, given a value for one of the variables we optimize the other. As α and β are symmetric we show how to optimize β given α. We need to ﬁnd a value of β that maximizes r in Eq. 4. Freund et al. (2003) pointed out that this equation can be rewritten as: r = ∑ D(t) (x , x )(h(x ) − h(x )) x ,x = ∑ π(x)h(x) x where π(x) = ∑x D(t) (x , x) − D(t) (x, x ) . Given the form of Eq. 7 we can write r as a function of β, ∑ (β f (x) − 1) π(x) + ∑ r (β) = αh(x) − 1 π(x). β f j (x)≤min(αh(x),1) (8) αh(x) < 1 , then O(βl+1 ) = O(βl ) − π(xl ), P(βl+1 ) = P(βl ) − f (xl )π(xl ), Q(βl+1 ) = Q(βl ) +W π(xl ), R(βl+1 ) = R(βl ) +W αh(xl )π(xl ). Combining these formulas gives algorithm 2 for optimizing β. Algorithm 2 Algorithm for optimizing β Given α, h(x) ∈ Ht , f (x) ∈ F and the training instances. For all x’s generate and sort the set of candidate βs, B, such that β 1 ≤ β2 ≤ · · · ≤ β|B| and βl = min(αh(xl ),1) . f (xl ) O = ∑xl π(xl ) P = ∑xl f (xl )π(xl ) Q=0 R=0 rbest = 0 for j = 1 . . . |B| do r = βl P − O + R − Q if r > rbest then rbest = r end if O = O − π(xl ), P = P − f (xl )π(xl ) if αh(xl ) < 1 then Q = Q + π(xl ), R = R + αh(xl )π(xl ) end if end for 5.2 Heuristics for Learner Selection If at each boosting iteration we select from all combinations of h(x) and f (x) we end up with an O(T 2 ) algorithm, where T is the number of iterations. However, we can sacriﬁce accuracy for speed by only evaluating and selecting from a ﬁxed sized pool of previously trained learners h(x) and features f (x), where at each iteration the pool is randomly chosen from the full Ht and F . To improve performance, instead of using a uniform sampling distribution we can apply a heuristic to focus the distribution on combinations with better potential. As Eq. 8 is composed of two sums, for r to be large the terms f (x)π(x) and h(x)π(x) need to be large. We can consider s f = ∑x f (x)π(x) and sh = ∑x h(x)π(x) as indicators of how well these components work with π(x). Thus, we might expect larger r values to occur when these two score values are larger. Of course, we are discounting interactions, which is the reason for the combination. Using these score values, we can order all h(x) and f (x) separately, and sample such that learners and features with better scores are more likely to be selected. We opted for a polynomial weighting 802 C ONCAVE L EARNERS FOR R ANKBOOST Training error on Dup II 9 P=1 P=1/2 P=1/4 Avg rank of correct class 8.8 8.6 8.4 8.2 8 7.8 7.6 7.4 7.2 1 2 3 4 5 6 7 8 9 Iteration number Figure 2: These plots show convergence of training error for 3 values of the heuristic selection pressure value p. These plots are averages over 10 runs and are typical of the other training data sets as well. As can be seen the heuristic improves the convergence rate of the training error. method. Thus, for example, all f ∈ F are sorted by their score and are assigned a number based on the rank of these scores,(maxrank − rank)/maxrank, that gives each f an equally sized bin in the range 0 and 1. Given a random number ξ ∼ U(0, 1), we calculate ξ p and select the f that corresponds to the bin this number falls into. Here p < 1 can be construed as a selection pressure, where bins corresponding to higher scores are more likely to be selected. Figure 2 demonstrates the effect of different values of the p parameter in one of our experiments. 6. Face Recognition Experiments We present experiments on the combination of face recognizers, comparing the binary learner with the MWGR learner. Given an image of a face, a face recognizer assigns a similarity score to each of the faces it is trained to recognize. These scores give a linear order or ranking to the gallery of faces. Different face recognition algorithms have different performance characteristics. Thus, we explore how combining the outputs of multiple face recognizers can improve recognition accuracy. 6.1 Algorithm Methods We consider a data set I of face images (queries) to train on. For each query image, i in I, we need to rank all u ∈ U faces in the gallery. In rankboost the favor function, Φ, that speciﬁes output loss, is a function of the query and the item to be ranked. Therefore, the notational convention is to combine the query, i, and the item to be ranked, u, as an instance, x ≡ (i, u). As such, f j (x) = f j ((i, u)) is the 803 M ELNIK , VARDI AND Z HANG Error convergence 9 Train error on dup II Test error on dup I Avg rank of correct class 8.5 8 7.5 7 6.5 6 0 10 20 30 40 50 60 70 80 90 100 Iteration number Figure 3: This plot is typical of the convergence behavior of rankboost with MWGR on the FERET data. Both training and test errors tended to converge within 10-30 iterations of boosting with no signiﬁcant post-convergence divergence. rank assigned to identity u for query image i by recognition algorithm j. As there is only one correct identity, we only care about the ranking of the one correct identity for each query. We set the favor function as stated by Freund et al. (2003) for this type of output loss. Let u ∗ be the the correct identity for training image i, then Φ ((i, u) , (i, u∗ )) = +1 and Φ ((i, u∗ ) , (i, u)) = −1 for all u = u∗ , setting all remaining elements of Φ (x , x ) = 0. That is, the correct identity of a query image is given positive favor compared to all other identities for that image, while all other rankings, including interactions between training images, are given zero favor. Note that since there is no favor interaction between queries (different i’s); Φ (x , x ) is effectively a function of 3 variables, (i, u , u ). Both weak learners were trained for 100 iterations, giving them ample time to converge. See Figure 3 for an illustration of convergence times. The binary learner was trained as speciﬁed by Freund et al. (2003). At each iteration the MWGR learner was selected from a pool of candidate combinations of f (x) and h(x), with a selection pressure of p = 0.5. For each candidate, ﬁrst β was optimized with α = 1, then α was optimized using the optimized β. The candidate with the most positive r value was always selected. This is summarized in algorithm 3. 6.2 Experimental Setup FERET (Phillips et al., 2000) was a government sponsored program for the evaluation of face recognition algorithms. In this program commercial and academic algorithms were evaluated on their ability to differentiate between 1,196 individuals. The test consisted of different data sets of varying difﬁculty, for a total of 3,816 different images. The data sets, in order of perceived difﬁculty, are: the fafb data set of 1,195 images which consists of pictures taken the same day with different facial 804 C ONCAVE L EARNERS FOR R ANKBOOST Algorithm 3 Algorithm for selecting learner from pool. Given poolsize and selection pressure. Calculate s f j and sh for all rank features and existing learners. for p = 1 . . . poolsize do Generate d f ∼ U(0, 1) and dh ∼ U(0, 1) Select which h = min αh(x), β f (x), 1 to try, using s f j , sh , u f , uh . Setting α = 1, optimize β (algorithm 2) Keeping the optimized β, optimize α (algorithm 2) Get r of learner with this α, β if r > 0 and r > rbest then rbest = r, hbest = h, hbest = min αh(x), β f (x) end if end for ht = hbest, Ht+1 = Ht ∪ hbest expressions; the fafc data set of 194 images that contains pictures taken with different cameras and lighting conditions; the dup I data set of 488 images that has duplicate pictures taken within a year of the initial photo; and the most difﬁcult, the dup II data set of 234 images which contains duplicate pictures taken more than a year later. Note that in our experiments we separate the images of dup II from the dup I data set, unlike the FERET study where dup II was also a subset of dup I. The FERET study evaluated 10 baseline and proprietary face recognition algorithms. The baseline algorithms consisted of a correlation-based method and a number of eigenfaces (principle components) methods that differ in the internal metric they use. Of the proprietary algorithms, most were from different academic institutions and one was commercial. Of the 10 algorithms we selected three dominant algorithms. From the baseline algorithms we chose to use the ANM algorithm which uses a Mahalanobis distance variation on angular distances for eigenfaces (Moon and Phillips, 2001). While this algorithm’s performance is not distinctive, within the class of baseline algorithms it was strong. Moreover, in accuracy with respect to average rank of the correct class on the dup I data set it demonstrated superior performance to all other algorithms. The other two algorithms we used were the University of Maryland’s 1997 test submission (UMD) and the University of Southern California’s 1997 test submission (USC). These algorithms clearly outperformed the other algorithms. UMD is based on a discriminant analysis of eigenfaces (Zhao et al., 1998), and USC is an elastic bunch graph matching approach (Wiskott et al., 1997). The outputs of the 10 face recognizers on the four FERET data sets, fafb, fafc, dup I and dup II were the data for the experiments. Thus, we never had access to the actual classiﬁers, only to data on how they ranked the different faces in these data sets. We conducted experiments based on homogeneous and heterogeneous data sets, testing the efﬁciency and robustness (adaptivity) of the MWGR procedure. For the homogeneous case we took all 4 FERET data sets and randomly shufﬂed them together. We call this the homogeneous data set as both the training and testing data are selected from the same combined pool. On this combined data set we did 4-fold cross validation. For each fold 75% of the data was used for training and the rest for testing. We combined the results of all four runs together for evaluation purposes. 805 M ELNIK , VARDI AND Z HANG For the heterogeneous case, in each experiment one of the FERET data sets was selected as a training set and another data set was selected for testing. This gave 12 experiments (not including training and testing on the same data set) per group of face recognizers, where we get combinations of training on easy data sets and testing on hard data sets, training on hard and testing on easy data sets, and training and testing on hard data sets. To reduce noise in our experiments the training ranks were truncated at 150, and outliers were removed. In face recognition and other classiﬁcation applications usually only the top ranks are important. Thus, in evaluating the results we focused on the top 30 ranks. All ranks returned by a boosted combiner for the correct class above 30 were truncated to 30. In evaluating the performance of the combiners not all the test data are equally useful. We consider the following two cases as non informative. When the two best face recognizers, UMD and USC both give the correct class a rank of 1 there is very little reason for the combined rank to be different. Also when both the binary-learner-based combiner and the MWGR-based combiner give the correct class a rank greater than the truncation value (30) it makes little sense to compare between the combiners. The testing data was ﬁltered to remove these cases, and the results are presented without them. Before presenting the results, it should be said that rankboost with both learner types gives ranking functions that signiﬁcantly outperform all the individual face recognition algorithms. In addition, in our tests both learners also clearly outperformed other standard rank combination methods, such as the Borda count, highest rank and logistic regression (Ho et al., 1992). We present two sets of experiments—the combination of the 3 selected classiﬁers (ANM, UMD and USC) and the combination of all 10 classiﬁers. These are qualitatively different tasks. In combining 3, we seek to capitalize on the unique strengths of each classiﬁer. In combining 10, we are introducing classiﬁers which may be correlated and classiﬁers which are comparably much noisier. The size of the pool was 6 when we combined 3 classiﬁers and 20 when combining all 10 classiﬁers. For all experiments we measure the average rank of the correct class for both learners: A= 1 N ∑ min {Rank (xi∗ ) , 30} , N i=1 where Rank (xi∗ ) is the rank of the correct class for query i, and the sum is over all useful test queries, as described above. The average rank difference between the learners is calculated to show the improvement in performance. To evaluate the signiﬁcance of the improvement we ran paired one-sided t-tests and evaluated the signiﬁcance of the p-value (a value less than 0.05 is signiﬁcant). In addition we show the standard deviation of the rank difference. 6.3 Results In the experiments with the homogeneous data sets, combining all classiﬁers gives an improvement in the average rank of the correct class of 0.296 for MWGR, with a standard deviation of 3.9, a paired t-test statistic of 1.76 and p-value of 0.039, where combining the ANM, UMD and USC classiﬁers gives an average rank improvement of 0.1 for MWGR, with standard deviation of 3.6, a paired t-test statistic of 0.68 and p-value of 0.24. Table 2 contains the results for combining the 3 classiﬁers in the experiments with heterogeneous data sets (compare the combiner results in columns bin mean and mwgr mean with the aver806 C ONCAVE L EARNERS FOR R ANKBOOST ANM ARL EFAVG EFML1 EFML2 EXCA MSU RUT UMD USC dup i 9.52 16.85 15.02 18.23 15.85 11.9 17.64 17.87 13.21 12.44 dup ii 18.14 16.96 20.61 22.21 20.33 16.28 23.44 16.48 13.74 6.85 fafb 10.88 8.94 14.43 16.23 12.21 11.67 6.81 12.93 4.57 5.84 fafc 19.71 28.02 26.17 17.39 19.78 20.30 14.27 22.79 8.96 5.17 Table 1: The best average rank of the correct class on the different data sets for all constituent face recognition systems. age rank of the correct class for all constituent classiﬁers in Table 1). The diff mean column contains the improvement of MWGR over the binary learner in terms of average rank of the correct class. Of the 12 experiments, we see an improvement in 10 cases. Six of those 10 have signiﬁcant p-values. The two no improvement experiments do not have signiﬁcant p-values. Table 3 contains the results for combining all 10 classiﬁers. Of the 12 experiments, we see an improvement for MWGR in 11 cases. Eight or nine of those 11 have signiﬁcant p-values. The one no improvement experiment does not have a signiﬁcant p-value. It is interesting to note that we do not seem to see overﬁtting when increasing the number of constituents. In some case we see improvement, in others we see slight degradation, but all in all the combiner seems resilient to the noise of adding less informative constituents. All sets of experiments, homogeneous data, heterogeneous data sets, combining 3 select recognizers and combining all 10 recognizers at once yielded signiﬁcant improvements in accuracy, as is visible in the change in the average rank of the correct class and the signiﬁcance of the statistical tests. 7. Information Retrieval Experiments The annual Text REtrieval Conference (TREC) generates high-quality retrieval results of different systems on different retrieval tasks (Voorhees and Harman, 2001). We use the result data sets of the TREC-2001 web ad hoc task that uses actual web queries taken from web logs. This task has been used in other rank fusion experiments (Renda and Straccia, 2003). As did Renda and Straccia (2003) we combine the results of the following top 12 systems: iit01m, ok10wtnd1, csiro0mwal, ﬂabxtd, UniNEn7d, fub01be2, hum01tdlx, JuruFull, kuadhoc, ricMM, jscbtawtl4, apl10wd. Similar to other TREC information retrieval tasks, the TREC-2001 ad hoc task consists of 50 queries. For each query a list of relevant documents is supplied. Each system returns an ordered list of 1,000 documents for each query. The fusion goal is to combine these 12 individual lists into one list of 1,000 documents, which hopefully has greater precision than the individual systems. For the 807 M ELNIK , VARDI AND Z HANG test set dup i dup i dup i dup ii dup ii dup ii fafb fafb fafb fafc fafc fafc train set dup ii fafb fafc dup i fafb fafc dup i dup ii fafc dup i dup ii fafb bin mean 7.19 7.36 8.31 5.25 5.15 5.19 2.62 3.38 2.60 2.85 2.40 2.56 mwgr mean 5.96 6.21 6.27 5.38 4.4 4.95 2.22 2.60 2.28 2.78 2.43 2.03 diff mean 1.22 1.14 2.04 -.12 0.74 0.23 0.39 0.78 0.32 0.06 -.03 0.52 diff std 5.22 5.15 5.51 4.0 4.61 5.1 3.09 3.79 2.81 3.33 2.27 2.57 pval .7e-4 .001 .1e-6 .658 .018 .275 .115 .029 .142 .424 .537 .028 Table 2: Results of combining the ANM, UMD, and USC classiﬁers using individual FERET data sets. test set dup i dup i dup i dup ii dup ii dup ii fafb fafb fafb fafc fafc fafc train set dup ii fafb fafc dup i fafb fafc dup i dup ii fafc dup i dup ii fafb bin mean 6.73 8.06 6.68 5.75 6.24 5.86 2.67 3.56 2.68 3.36 3.17 2.22 mwgr mean 5.89 7.09 4.87 5.67 5.47 5.31 2.07 2.45 2.17 2.51 2.95 2.23 diff mean 0.84 0.96 1.81 0.08 0.76 0.55 0.59 1.11 0.51 0.85 0.22 -0.01 diff std 4.79 6.01 5.24 4.61 4.22 4.87 2.97 4.51 2.03 3.23 3.95 2.08 pval .009 .014 .5e-6 .408 .009 .074 .033 .012 .01 .007 .3 .52 Table 3: Results of combining all 10 classiﬁers using individual FERET data sets. 50 queries of the TREC-2001 ad hoc task, the number of relevant documents that intersect with the union of system results range between 662 and 2664. 7.1 Methods In the information retrieval task the favor function needs to show favor for relevant documents while disregarding other documents. Thus, we can set the favor function similarly to the way it was set in 808 C ONCAVE L EARNERS FOR R ANKBOOST JuruFull 0.759 hum01tdlx 0.760 UniNEn7d 0.763 iit01m 0.762 apl10wd 0.735 jscbtawt14 0.761 csiro0mwa1 0.721 kuadhoc2001 0.727 ﬂabxtd 0.741 ok10wtnd1 0.745 fub01be2 0.734 ricMM 0.765 Table 4: Normalized mean average precision for each constituent. the face recognition classiﬁcation task. Consider a data set with I queries to train on. For each query, i in I, we need to rank all u ∈ U documents. An instance in this case is a pair, x ≡ (i, u). For each u∗ a relevant document and u an irrelevant document for query i, we set Φ ((i, u) , (i, u ∗ )) = +1 and Φ ((i, u∗ ) , (i, u)) = −1, setting all remaining elements of Φ (x0 , x1 ) = 0. Thus, all relevant documents are given a positive favor with respect to irrelevant documents, while all other rankings, including interactions between relevant documents and interactions between irrelevant documents are given zero favor. As the task has only 50 queries, rather than separating the data into train and test, we opted to do a cross validation performance evaluation. We use 5-fold cross validation on the normalized mean average precision measure (Salton and McGill, 1983), a standard IR measure which accounts for the precision (quantity of correct documents retrieved) and their rank position in the list. It is described by the following equation: AveP = ∑N (Prec(r) × rel(r)) r=1 number o f relevant documents where r is the rank, N is the number of documents retrieved, rel() is a binary function of the relevance of the document at rank r,and Prec is the precision ( [number of relevant documents retrieved] / [number of documents retrieved]) at a given cut-off rank. It is clear that higher values of AveP are better. Note that for purposes of normalization we assigned unretrieved relevant documents a constant rank. As the binary and MWGR weak learners had signiﬁcantly different convergence properties on this task (see Figure 4) MWGR was trained for 100 iterations and the binary learner for 300 iterations. As in the FERET experiments, MWGR was optimized using algorithm 3, with a selection pressure of p = 0.5. 7.2 Results As seen in Figure 4, the MWGR learner converges in signiﬁcantly less iterations than the binary learner. This could possibly be attributed to the fact that the MWGR is a more complex function that can incorporate the rankings of multiple classiﬁers in each learner. Also the MWGR function is not tuned to particular rank cutoffs, whereas the binary learner is, so the MWGR can better accommodate the variety in the 1000 ranks being considered. The normalized mean average precision for the MWGR after 100 iterations was 0.8537 and it was 0.8508 for the binary learner after 300 iterations. Compare these results with the precision of the constituents in Table 4. Both weak learners had a performance rate of change of approximately 3 ∗ 10−5 on their ﬁnal iteration (better for MWGR). A paired t-test on the cross validation results of the two learners gives a statistically signiﬁcant p-value of 0.007 in favor of MWGR. 809 M ELNIK , VARDI AND Z HANG Cross Validation Performance 0.88 MWGR Binary 0.87 0.86 0.85 0.84 Mean Avg Precision 0.83 0.82 0.81 0.8 0.79 0.78 0.77 0.76 0.75 0.74 0.73 0.72 0.71 0.7 0 50 100 150 200 Iteration Number 250 300 Figure 4: The cross validation mean average precision score of the two weak learners, MWGR and binary, as a function of boosting iteration number. 8. Discussion The question of how to combine ordinal data has become an active focus of research in machine learning, as applications in pattern recognition, information retrieval and other domains have come to the forefront. A particular question of importance is how can the structure of ranks be correctly exploited to maximize performance. The semi parametric nature of rankboost offers the possibility to generate arbitrarily ﬂexible ranking functions. But as observed this ﬂexibility comes at a cost of signiﬁcant overﬁtting without further regularization. Freund et al. (2003) demonstrate that successful generalization only occurs when the resulting ranking functions are constrained to be monotonic. This constraint can be thought of as a regularization that incorporates prior knowledge on the interpretation of ranks and as such, how they can be combined. We present a regularization framework based on the concept of consistency in conﬁdence and preference. Ranking functions with this property show a consistency in how they treat the preference and relative conﬁdence exhibited by constituents. We prove that under a natural interpretation of preference and conﬁdence for ranks, this consistency property of the combiner is equivalent to monotonicity and concavity of its ranking function. We enhance rankboost by designing a weak ranking learner that exhibits consistency in preference and conﬁdence. A computational advantage of this weak learner, called minimum weighted group ranks (MWGR) is that its parameters can be individually optimized readily with respect to the rankboost criteria, allowing it to be tested on real-world data. In our ﬁrst experiments we compare the original rankboost binary weak learner with MWGR on a task of combining the output of multiple face recognition algorithms from the FERET study. We conducted experiments on homogeneous data, testing the intrinsic efﬁciency of the MWGR proce810 C ONCAVE L EARNERS FOR R ANKBOOST dure. We also conducted experiments on heterogeneous data, testing the robustness or adaptivity of the procedure. In almost all cases we see that MWGR shows improved performance compared with the binary weak learner, whether combining three or all of the face recognizers, conﬁrming the utility of this monotonic and concave learner. Our second experiment was on an Information Retrieval task taken from the TREC conference. In this task we see MWGR converges in signiﬁcantly less iterations and generates statistically signiﬁcant improved performance. Final Words Ofer Melnik and Cun-Hui Zhang are very saddened that our colleague and friend Yehuda Vardi passed away before he could give this paper his ﬁnal stamp of approval. He was very enthusiastic about this research and would have been very pleased to see it come to fruition. Acknowledgments This research is partially supported by NSF Grants DMS 04-05202, DMS 05-04387 and, DMS 0604571, ONR Grant N00014-02-1-056 and NSA Grant H98230-04-1-0041. Ofer Melnik would also like to thank DIMACS for their support in this research. The authors are also grateful to the editor and reviewers for their constructive suggestions which improved the presentation of the results. Appendix A. In this paper we showed how three-point consistency in preference and conﬁdence implies concave and monotonic ranking functions. For two decision problems involving two pairs of rank vectors, the four-point consistency property implies the following constraints for a ranking function   G(y) < G(y )   yB − y B ≤ 0 < y A − y A  zA ≤ yA , yA − yA ≤ zA − zA ⇒ G(z ) − G(z) > G(y ) − G(y)   yB ≤ z B , zB − z B ≤ y B − y B (11) where y and y are the rank vectors from the ﬁrst decision problem and z and z are the rank vectors from the second decision problem. Unlike the three-point case, the four-point consistency property does not imply a clearly recognizable functional form for the ranking function. What we can say about it though is that as the constraints are linear, in the same way that concavity and monotonicity in the weak learner conferred the same properties to the ranking function, a weak learner that satisﬁes Eq. 11 will also confer those properties to the ranking function that uses it with positive weights. References K. Arrow. Social Choice and Individual Values. Wiley, 1951. C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proc. 10th Intl. World Wide Web Conf., pages 613–622, 2001. 811 M ELNIK , VARDI AND Z HANG Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. An efﬁcient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4:933–969, 2003. Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 148–156, 1996. T. K. Ho. A Theory of Multiple Classiﬁer Systems and Its Application to Visual Word Recognition. PhD thesis, State University of New York at Buffalo, May 1992. T. K. Ho, J. J. Hull, and S. N. Srihari. Combination of decisions by multiple classiﬁers. In H. S. Baird, H. Bunke, and K. Yamamoto (Eds.), editors, Structured Document Image Analysis, pages 188–202. Springer-Verlag, Heidelberg, 1992. J. Kittler and F. Roli, editors. Multiple Classiﬁer Systems, Lecture Notes in Computer Science 1857, 2000. Springer. R. Meir and G. Ratsch. Advanced Lectures in Machine Learning, Lecture Notes in Computer Science 2600, chapter An introduction to boosting and leveraging, pages 119–184. Springer, 2003. O. Melnik, Y. Vardi, and C-H. Zhang. Mixed group ranks: Preference and conﬁdence in classiﬁer combination. IEEE Pattern Analysis and Machine Intelligence, 26(8):973–981, 2004. H. Moon and P.J. Phillips. Computational and performance aspects of PCA-based face-recognition algorithms. Perception, 30:303–321, 2001. P.J. Phillips, H. Moon, S.A. Rizvi, and P.J. Rauss. The FERET evaluation methodology for facerecognition algorithms. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22:1090– 1104, 2000. M.E. Renda and U. Straccia. Web metasearch: Rank vs. score based rank aggregation methods. In 18th Annual ACM Symposium on Applied Computing (SAC-03), pages 841–846, Melbourne, Florida, USA, 2003. ACM. G. Salton and J.M. McGill. Introduction to Modern Information Retrieval. Addison Wesley Publ. Co., 1983. E.M. Voorhees and D.K. Harman, editors. NIST Special Publication 500-250: The Tenth Text REtrieval Conference (TREC 2001), number SN003-003-03750-8, 2001. Department of Commerce, National Institute of Standards and Technology, Government Printing Ofﬁce. URL http://trec.nist.gov. L. Wiskott, J.-M. Fellous, N. Kruger, and C. von der Malsburg. Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(7):775– 779, 1997. W. Zhao, A. Krishnaswamy, R. Chellappa, D. Swets, and J. Weng. Face Recognition: From Theory to Applications, chapter Discriminant Analysis of Principal Components, pages 73–86. SpringerVerlag, Berlin, 1998. 812

4 0.2564615 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

Author: Sébastien Gadat, Laurent Younes

5 0.24411744 2 jmlr-2007-A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion

Author: Marco Loog

6 0.23921461 15 jmlr-2007-Bilinear Discriminant Component Analysis

7 0.22168322 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

8 0.21218072 26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis

9 0.15943959 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

10 0.1514429 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis

11 0.12821448 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions

12 0.12796877 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

13 0.12334104 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features

14 0.11576443 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems

15 0.11318444 14 jmlr-2007-Behavioral Shaping for Geometric Concepts

16 0.10868219 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

17 0.10755546 27 jmlr-2007-Distances between Data Sets Based on Summary Statistics

18 0.096178867 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

19 0.093332663 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

20 0.089835986 11 jmlr-2007-Anytime Learning of Decision Trees

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.038), (8, 0.02), (10, 0.02), (12, 0.012), (15, 0.021), (28, 0.041), (33, 0.468), (34, 0.016), (40, 0.027), (45, 0.012), (48, 0.081), (60, 0.016), (85, 0.043), (98, 0.072)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.72230655 50 jmlr-2007-Local Discriminant Wavelet Packet Coordinates for Face Recognition

Author: Chao-Chun Liu, Dao-Qing Dai, Hong Yan

2 0.27644524 26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis

Author: Masashi Sugiyama

Abstract: Reducing the dimensionality of data without losing intrinsic information is an important preprocessing step in high-dimensional data analysis. Fisher discriminant analysis (FDA) is a traditional technique for supervised dimensionality reduction, but it tends to give undesired results if samples in a class are multimodal. An unsupervised dimensionality reduction method called localitypreserving projection (LPP) can work well with multimodal data due to its locality preserving property. However, since LPP does not take the label information into account, it is not necessarily useful in supervised learning scenarios. In this paper, we propose a new linear supervised dimensionality reduction method called local Fisher discriminant analysis (LFDA), which effectively combines the ideas of FDA and LPP. LFDA has an analytic form of the embedding transformation and the solution can be easily computed just by solving a generalized eigenvalue problem. We demonstrate the practical usefulness and high scalability of the LFDA method in data visualization and classiﬁcation tasks through extensive simulation studies. We also show that LFDA can be extended to non-linear dimensionality reduction scenarios by applying the kernel trick. Keywords: dimensionality reduction, supervised learning, Fisher discriminant analysis, locality preserving projection, afﬁnity matrix

3 0.25335425 80 jmlr-2007-Synergistic Face Detection and Pose Estimation with Energy-Based Models

Author: Margarita Osadchy, Yann Le Cun, Matthew L. Miller

4 0.24700612 4 jmlr-2007-A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning (Special Topic on Model Selection)

Author: Carine Hue, Marc Boullé

Abstract: In this paper, we consider the supervised learning task which consists in predicting the normalized rank of a numerical variable. We introduce a novel probabilistic approach to estimate the posterior distribution of the target rank conditionally to the predictors. We turn this learning task into a model selection problem. For that, we deﬁne a 2D partitioning family obtained by discretizing numerical variables and grouping categorical ones and we derive an analytical criterion to select the partition with the highest posterior probability. We show how these partitions can be used to build univariate predictors and multivariate ones under a naive Bayes assumption. We also propose a new evaluation criterion for probabilistic rank estimators. Based on the logarithmic score, we show that such criterion presents the advantage to be minored, which is not the case of the logarithmic score computed for probabilistic value estimator. A ﬁrst set of experimentations on synthetic data shows the good properties of the proposed criterion and of our partitioning approach. A second set of experimentations on real data shows competitive performance of the univariate and selective naive Bayes rank estimators projected on the value range compared to methods submitted to a recent challenge on probabilistic metric regression tasks. Our approach is applicable for all regression problems with categorical or numerical predictors. It is particularly interesting for those with a high number of predictors as it automatically detects the variables which contain predictive information. It builds pertinent predictors of the normalized rank of the numerical target from one or several predictors. As the criteria selection is regularized by the presence of a prior and a posterior term, it does not suffer from overﬁtting. Keywords: rank regression, probabilistic approach, 2D partitioning, non parametric estimation, Bayesian model selection

5 0.23692828 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

Author: Sébastien Gadat, Laurent Younes

6 0.23344411 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

7 0.23338747 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

8 0.23269282 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

9 0.2326577 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

10 0.22992529 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

11 0.22911753 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

12 0.22695085 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

13 0.22305115 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

14 0.22272272 39 jmlr-2007-Handling Missing Values when Applying Classification Models

15 0.22136751 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

16 0.22047025 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

17 0.21997112 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

18 0.21933348 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

19 0.21892852 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

20 0.21862648 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method