jmlr jmlr2007 jmlr2007-14 knowledge-graph by maker-knowledge-mining

14 jmlr-2007-Behavioral Shaping for Geometric Concepts

Source: pdf

Author: Manu Chhabra, Robert A. Jacobs, Daniel Štefankovič

Abstract: In a search problem, an agent uses the membership oracle of a target concept to ﬁnd a positive example of the concept. In a shaped search problem the agent is aided by a sequence of increasingly restrictive concepts leading to the target concept (analogous to behavioral shaping). The concepts are given by membership oracles, and the agent has to ﬁnd a positive example of the target concept while minimizing the total number of oracle queries. We show that for the concept class of intervals on the real line an agent using a bounded number of queries per oracle exists. In contrast, for the concept class of unions of two intervals on the real line no agent with a bounded number of queries per oracle exists. We then investigate the (amortized) number of queries per oracle needed for the shaped search problem over other concept classes. We explore the following methods to obtain efﬁcient agents. For axis-parallel rectangles we use a bootstrapping technique to obtain gradually better approximations of the target concept. We show that given rectangles R ⊆ A ⊆ R d one can obtain a rectangle A ⊇ R with vol(A )/vol(R) ≤ 2, using only O(d · vol(A)/vol(R)) random samples from A. For ellipsoids of bounded eccentricity in Rd we analyze a deterministic ray-shooting process which uses a sequence of rays to get close to the centroid. Finally, we use algorithms for generating random points in convex bodies (Dyer et al., 1991; Kannan et al., 1997) to give a randomized agent for the concept class of convex bodies. In the ﬁnal section, we explore connections between our bootstrapping method and active learning. Speciﬁcally, we use the bootstrapping technique for axis-parallel rectangles to active learn axis-parallel rectangles under the uniform distribution in O(d ln(1/ε)) labeled samples. Keywords: computational learning theory, behavioral shaping, active learning

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Warmuth Abstract In a search problem, an agent uses the membership oracle of a target concept to ﬁnd a positive example of the concept. [sent-9, score-0.72]

2 In a shaped search problem the agent is aided by a sequence of increasingly restrictive concepts leading to the target concept (analogous to behavioral shaping). [sent-10, score-0.705]

3 The concepts are given by membership oracles, and the agent has to ﬁnd a positive example of the target concept while minimizing the total number of oracle queries. [sent-11, score-0.69]

4 We show that for the concept class of intervals on the real line an agent using a bounded number of queries per oracle exists. [sent-12, score-0.897]

5 In contrast, for the concept class of unions of two intervals on the real line no agent with a bounded number of queries per oracle exists. [sent-13, score-0.928]

6 We then investigate the (amortized) number of queries per oracle needed for the shaped search problem over other concept classes. [sent-14, score-0.795]

7 , 1997) to give a randomized agent for the concept class of convex bodies. [sent-21, score-0.425]

8 Speciﬁcally, we use the bootstrapping technique for axis-parallel rectangles to active learn axis-parallel rectangles under the uniform distribution in O(d ln(1/ε)) labeled samples. [sent-23, score-0.431]

9 In a search problem, the task is to ﬁnd one positive example of a concept given a membership oracle of the concept using as few queries to the oracle as possible. [sent-44, score-1.009]

10 If a shaping sequence, which is a sequence of nested concepts, is available, it might be possible to solve the search problem with a smaller number of oracle queries. [sent-45, score-0.47]

11 When concepts are standard geometrical concepts like rectangles, balls, ellipsoids, and convex bodies in high dimensions, we show efﬁcient algorithms to solve the search problem using a shaping sequence. [sent-46, score-0.417]

12 A weaker version of the shaped search problem in which the concepts are convex and all the oracles are available to the agent simultaneously can be viewed as an instance of a quasi-convex optimization problem. [sent-56, score-0.636]

13 Also, we cannot apply the algorithms from this area since they usually rely on an oracle (so called separation oracle) stronger than the membership oracle that we use. [sent-57, score-0.526]

14 In Section 3 we illustrate the shaped search problem with algorithms for the concept classes of intervals and axis-parallel rectangles. [sent-68, score-0.374]

15 In Section 8 we deﬁne the problem of one-sided active learning, and show that the bootstrapping algorithm given in Section 4 can be used to active-learn the concept class of axis-parallel rectangles with O(d ln(1/ε)) labeled samples. [sent-72, score-0.395]

16 The goal of the agent is to minimize the number of queries to the oracle. [sent-78, score-0.458]

17 Let (X, B ) be a measurable space and let vol be a measure on (X, B ). [sent-88, score-0.614]

18 Examples of concept classes that we study include intervals in R, axis-parallel rectangles in Rd , ellipsoids in Rd , and convex bodies in Rd . [sent-92, score-0.501]

19 The agent has a representation of S and has to ﬁnd a point in T using a membership oracle for T . [sent-97, score-0.54]

20 , ﬁnite unions of intervals) the use of randomness might be inevitable—a deterministic algorithm with bounded number of queries need not exist (the question of deterministic search is related to the concept of hitting sets, see, e. [sent-102, score-0.497]

21 The expected (over the choice of i) number of queries made by any (randomized) agent on (S, Ti ) is Ω(n). [sent-114, score-0.458]

22 In a shaped search problem the agent’s search task will be aided by a shaping sequence, which is a sequence of nested sets between the S and T . [sent-115, score-0.466]

23 The sets in the shaping sequence will be gradually 1837 ˇ ˇ C HHABRA , JACOBS , AND S TEFANKOVI C shrinking concepts from the underlying concept class C . [sent-116, score-0.317]

24 A search agent in a shaped search problem only has access to the membership oracles of S1 , . [sent-125, score-0.698]

25 In other words, the oracles St are presented to the agent in k iterations, with the agent making (zero or more) queries to the oracle St at iteration t. [sent-130, score-1.012]

26 The agent successfully solves the shaped search problem if at the end of the process it outputs x ∈ T . [sent-131, score-0.452]

27 We will evaluate the performance of the agent by the amortized number of membership queries per oracle (i. [sent-134, score-0.831]

28 For a randomized agent the performance is measured by the expected number of membership queries per oracle, where the expectation is taken over the coin ﬂips of the agent. [sent-140, score-0.618]

29 We say that the agent A solves a γ-shaped search problem using q queries per oracle, if for every S, T ∈ C , every k, and every γ-shaping sequence S0 , . [sent-145, score-0.578]

30 , Sk ∈ C the total number of queries made by the agent is bounded by kq. [sent-148, score-0.483]

31 If the agent is randomized we require the expected total number of queries to be bounded by kq. [sent-149, score-0.523]

32 1 Our Results In order to introduce the shaped search problem, we start with a positive and a negative result for two simple concept classes (the proofs are in Section 3). [sent-154, score-0.315]

33 First, we show that O(1/γ) queries per oracle sufﬁce to solve the γ-shaped search problem for the concept class of closed intervals in R. [sent-155, score-0.701]

34 There exists a deterministic agent which for any γ uses O(1/γ) queries per oracle to solve γ-shaped search problem. [sent-158, score-0.807]

35 4 by showing that for the class of “unions of two closed intervals in R” there exists no agent that solves the γ-shaped search problem using bounded number of queries per oracle. [sent-160, score-0.642]

36 For every (randomized) agent A and every number q there exists a search problem (S, T ), k, and a γ-shaping sequence S1 , . [sent-164, score-0.319]

37 , Sk such that A makes more than q queries per oracle (in expectation). [sent-167, score-0.48]

38 There exists a deterministic agent which for any γ ∈ (0, 1) uses O( f (γ)) queries per oracle to solve γ-shaped search problem. [sent-173, score-0.807]

39 On the other hand, for any γ ∈ (0, 1), any (randomized) agent makes at least Ω( f (γ)) queries per oracle. [sent-174, score-0.494]

40 The shaped search problem for axis-parallel rectangles in R d turns out to be more complicated. [sent-175, score-0.351]

41 We say that a randomized agent is oblivious if for every oracle St the queries to St which lie in St are independent and uniformly random in St . [sent-178, score-0.719]

42 For any γ there exists a randomized agent using O( 1 + (d + ln 1 ) ln d) queries per oracle. [sent-182, score-0.912]

43 For any γ there exists an (oblivious) randomized agent using O(d/γ) queries per oracle. [sent-184, score-0.534]

44 For any constant γ > 1/2 there exists a deterministic agent using O(ln d) queries per oracle. [sent-186, score-0.522]

45 γ = 3/4 O(d ln d) O(d) O(ln d) γ = 1/4 O(d ln d) O(d) N/A γ = 1/ ln d O(d ln d) O(d ln d) N/A γ = 1/d O(d ln d) O(d 2 ) N/A Note that the deterministic algorithm for part 3. [sent-194, score-1.162]

46 Question 1 Are Ω(d) queries per oracle needed for γ < 1/2? [sent-199, score-0.48]

47 A bootstrap-learning algorithm, given an approximation A1 of an unknown concept C ∈ C and a membership oracle for C, outputs a better approximation A2 of C. [sent-203, score-0.403]

48 We show how an approximate center of an axisparallel rectangle can be maintained using only O(ln d) (amortized) queries per oracle. [sent-208, score-0.345]

49 The γ-shaped search problem can be solved by a deterministic agent using O(L 2 · d 3/2 ln d) queries per ray-shooting oracle (a ray-shooting oracle returns the points of intersection of K with a line through a point x ∈ K) The requirement γ > 1/2 in Theorem 2. [sent-225, score-1.217]

50 For any γ ∈ (0, 1) there exists a randomized agent for the γ-shaped search problem using O(poly(d, 1/γ)) queries per oracle. [sent-245, score-0.598]

51 Basic Results for Intervals and Axis-parallel Rectangles Now we show that for intervals O(1/γ) queries per oracle are sufﬁcient to solve the γ-shaped search problem. [sent-247, score-0.603]

52 By the minimality of j we have 2 j ≤ 12/γ and hence the total number of queries per oracle is O(1/γ). [sent-281, score-0.502]

53 The number of queries per oracle is O(1/ ) = O(ln(1/γ)). [sent-291, score-0.48]

54 1, the agent needs to make Ω( 1/γ ) queries (per oracle). [sent-295, score-0.458]

55 This implies E[Q] = Ω(k ln(1/γ)), and hence the number of queries per oracle is Ω(ln(1/γ)). [sent-303, score-0.502]

56 The lower bound for a randomized agent follows by Yao’s min-max lemma (see, e. [sent-304, score-0.33]

57 For unions of two intervals the agent can be forced to make many queries per oracle. [sent-307, score-0.584]

58 For the sake of the lower bound argument, we will help the agent by telling it the general shape of the shaping sequence, but we will keep the location of T secret. [sent-329, score-0.4]

59 g, we can assume that the agent only queries points in [0, γn ] on the oracle for T (because for all the other queries the agent can ﬁgure the answer himself). [sent-333, score-1.137]

60 1 the agent needs Ω(1/γ ) queries to ﬁnd a point in T . [sent-335, score-0.458]

61 Hence the number of queries per oracle is Ω 1/γ . [sent-336, score-0.48]

62 3n + + 2 Letting → ∞ we obtain that the number of queries per oracle is unbounded. [sent-337, score-0.48]

63 Only O(− ln w1 (St+1 )) queries are needed for this step (where w1 (St+1 ) is the width of St+1 in the 1-st dimension). [sent-364, score-0.433]

64 The total number of queries is d O (d ln d) + O − ∑ ln w1 (St+1 ) i=1 = O d ln d + ln 1 . [sent-373, score-0.979]

65 An (α, β)-bootstrap learning algorithm for C using CA takes as an input a representation of a concept A1 ∈ CA and an oracle for a concept R ∈ C . [sent-383, score-0.417]

66 The efﬁciency of the algorithm is measured by the worst-case (expected) number T of oracle queries to R (i. [sent-386, score-0.444]

67 If an efﬁcient (α, β)-bootstrap learning algorithm exists for a concept class C then it can be used for the shaped search problem as follows. [sent-389, score-0.315]

68 Then there exists an agent which for any γ ≥ β/α solves the γ-shaped search problem (over C ) using T queries per oracle. [sent-395, score-0.558]

69 (The starting concept S in a shaped search problem is known in advance and hence the agent can pre-compute A0 . [sent-401, score-0.572]

70 2 to obtain an agent for the shaped search problem for C , one should choose CA which allows for an efﬁcient (α, β)-bootstrap algorithm. [sent-406, score-0.452]

71 The expected number of oracle calls made by the Inner-Outer algorithm is bounded by 8 + 320dα/ ln β. [sent-445, score-0.435]

72 By Jensen’s inequality 2 wi (A2 ) 2 ≤ ln 1 + E ln ≤ . [sent-498, score-0.378]

73 + n2 n1 For n1 = 64dα/ ln β and n2 = 16dα/ ln β the right hand side is bounded by β and hence the algorithm will terminate. [sent-504, score-0.425]

74 Hence in expectation at most 4(n 1 + n2 ) oracle queries sufﬁce. [sent-507, score-0.444]

75 Center-point Algorithm In this section we show how an approximate center-point of a rectangle can be maintained using only O(ln d) queries per oracle. [sent-509, score-0.345]

76 Thus it is ﬁne if the algorithm makes Θ(d) queries going from St to St+1 , as long as the average number of queries per oracle is O(ln d). [sent-576, score-0.703]

77 Thus the total number of queries is O ln γ 1 ln + ln d k 1 + ε ln(1 − ε/4) 1848 = O(k ln d), B EHAVIORAL S HAPING FOR G EOMETRIC C ONCEPTS since γ < 1/2 is a constant, and we can take ε = (1/2 − γ)/4. [sent-593, score-0.979]

78 Algorithms which deal with convex bodies given by membership oracles often require the body to be sandwiched between balls, in the following precise sense. [sent-653, score-0.351]

79 In this work, the learner does not have access to unlabeled samples, but is allowed to query the membership oracle with any instance of its choice. [sent-756, score-0.401]

80 They showed a negative result stating that there are certain concept classes which are ”dense in themselves”, meaning that a small number of queries (even if chosen by the learner) are not enough to determine the target concept 1852 B EHAVIORAL S HAPING FOR G EOMETRIC C ONCEPTS well. [sent-758, score-0.437]

81 Much modern active learning work uses this framework of having an unlabeled oracle and a membership oracle that can answer queries from examples generated from the unlabeled oracle. [sent-763, score-0.871]

82 The concept class of rectangles has been popular in the machine learning literature as rectangles are geometrically simple objects and also yield excellent results experimentally (Dietterich et al. [sent-773, score-0.366]

83 We adopt the standard active learning framework of having an oracle that generates random samples and another oracle that can label these samples on request. [sent-781, score-0.56]

84 However, it clearly gives a (weak) upper bound to active learning the concept class of rectangles in O(d/ε) labeled samples under the uniform distribution. [sent-783, score-0.352]

85 In this section, we give a variant of the bootstrapping algorithm and show how it can be repeatedly used to active learn rectangles using both labeled and unlabeled oracles with only O(d ln(1/ε)) labeled queries. [sent-784, score-0.439]

86 1 The concept class of axis-parallel rectangles inside the d-dimensional cube [0, 1] d is not one-sided active learnable. [sent-790, score-0.355]

87 Thus we are making exponentially more queries to the membership oracle than desired. [sent-793, score-0.528]

88 In the same spirit, we assume that our concept class has rectangles which are larger than some constant value and show that active learning is possible in this case. [sent-797, score-0.31]

89 2 The concept class C of big rectangles is a set of axis-parallel rectangles R such that R ⊂ [0, 1]d and vol(R) > 1/2. [sent-799, score-0.366]

90 The algorithm samples O(dα/ ln β) points from the membership oracle of R. [sent-803, score-0.514]

91 Hence the expected number of queries to the membership oracle is (1 − 1/α)(8 + 1854 B EHAVIORAL S HAPING FOR G EOMETRIC C ONCEPTS dα/ ln β), which is O(1 + d(α − 1)/ ln β) = O(d(α − 1)/ ln β) (in the last containment we used d(α − 1)/ ln β ≥ d(α − 1)/ ln α ≥ d = Ω(1)). [sent-825, score-1.473]

92 Analogous to behavioral shaping, in a shaped search problem, the learner is given access to a sequence of membership oracles of increasingly restrictive concepts and the learner is required to output one sample from the target concept. [sent-859, score-0.632]

93 1855 ˇ ˇ C HHABRA , JACOBS , AND S TEFANKOVI C We gave efﬁcient algorithms to solve the shaped search problem for the concept class of intervals, axis-parallel rectangles, bounded eccentricity ellipsoids, and general convex bodies. [sent-860, score-0.445]

94 While we have matching lower and upper bounds for the concept class of intervals, for other concept classes we do not understand the complexity of the shaped search problem (i. [sent-861, score-0.413]

95 The bootstrapping technique of Section 4 was useful for both the shaped search problem and active learning for axis-parallel rectangles. [sent-867, score-0.358]

96 By keeping track of the centroid we were able to solve the shaped search problem for bounded eccentricity ellipsoids. [sent-871, score-0.352]

97 Our solution of the shaped search problem for general convex bodies samples random points from convex bodies and hence heavily relies on randomization. [sent-875, score-0.499]

98 Does there exist a deterministic agent for the γ-shaped search problem, using O(poly(d, 1/γ)) queries per oracle? [sent-878, score-0.586]

99 For example, it was observed in Dasgupta (2005) that when the instance space is a unit ball, the concept class of non-homogeneous perceptrons is not active learnable (in the sense that any active learning scheme requires a large number of labeled samples). [sent-881, score-0.34]

100 One could consider a framework in which all the oracles are present simultaneously and the agent can query any oracle at any time. [sent-887, score-0.589]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vol', 0.614), ('st', 0.307), ('agent', 0.235), ('queries', 0.223), ('oracle', 0.221), ('ln', 0.189), ('shaping', 0.165), ('shaped', 0.153), ('rectangles', 0.134), ('ehavioral', 0.113), ('eometric', 0.113), ('haping', 0.113), ('hhabra', 0.113), ('tefankovi', 0.113), ('concept', 0.098), ('oracles', 0.098), ('oncepts', 0.095), ('ellipsoids', 0.09), ('jacobs', 0.088), ('rectangle', 0.086), ('membership', 0.084), ('behavioral', 0.083), ('active', 0.078), ('vt', 0.077), ('kk', 0.068), ('bodies', 0.068), ('search', 0.064), ('goldman', 0.063), ('bootstrapping', 0.063), ('intervals', 0.059), ('rochester', 0.057), ('centroid', 0.057), ('lemma', 0.055), ('bt', 0.053), ('eccentricity', 0.053), ('convex', 0.052), ('body', 0.049), ('reward', 0.048), ('ellipsoid', 0.045), ('centrally', 0.045), ('lov', 0.045), ('possibler', 0.045), ('inside', 0.045), ('dx', 0.042), ('tv', 0.041), ('randomized', 0.04), ('learner', 0.039), ('pj', 0.038), ('sz', 0.038), ('per', 0.036), ('query', 0.035), ('concepts', 0.034), ('learnable', 0.034), ('rd', 0.032), ('amortized', 0.032), ('kannan', 0.032), ('unions', 0.031), ('ci', 0.03), ('vj', 0.03), ('perceptrons', 0.03), ('isotropic', 0.029), ('teacher', 0.028), ('yk', 0.028), ('deterministic', 0.028), ('kearns', 0.027), ('bi', 0.026), ('vd', 0.025), ('bounded', 0.025), ('dasgupta', 0.024), ('bertsimas', 0.023), ('dvk', 0.023), ('af', 0.022), ('hence', 0.022), ('interval', 0.022), ('mk', 0.022), ('vk', 0.022), ('labeled', 0.022), ('unlabeled', 0.022), ('proposition', 0.022), ('sk', 0.021), ('width', 0.021), ('sally', 0.021), ('vi', 0.021), ('samples', 0.02), ('theorem', 0.02), ('sequence', 0.02), ('sampler', 0.019), ('centering', 0.019), ('ik', 0.019), ('szl', 0.019), ('dorigo', 0.019), ('dvi', 0.019), ('fine', 0.019), ('rays', 0.019), ('skinner', 0.019), ('ball', 0.019), ('isbn', 0.018), ('target', 0.018), ('xt', 0.017), ('ng', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 14 jmlr-2007-Behavioral Shaping for Geometric Concepts

Author: Manu Chhabra, Robert A. Jacobs, Daniel Štefankovič

2 0.15716054 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring

Author: András György, Tamás Linder, Gábor Lugosi, György Ottucsák

Abstract: The on-line shortest path problem is considered under various models of partial monitoring. Given a weighted directed acyclic graph whose edge weights can change in an arbitrary (adversarial) way, a decision maker has to choose in each round of a game a path between two distinguished vertices such that the loss of the chosen path (deﬁned as the sum of the weights of its composing edges) be as small as possible. In a setting generalizing the multi-armed bandit problem, after choosing a path, the decision maker learns only the weights of those edges that belong to the chosen path. For this problem, an algorithm is given whose average cumulative loss in n rounds exceeds that of the best path, matched off-line to the entire sequence of the edge weights, by a quantity that is √ proportional to 1/ n and depends only polynomially on the number of edges of the graph. The algorithm can be implemented with complexity that is linear in the number of rounds n (i.e., the average complexity per round is constant) and in the number of edges. An extension to the so-called label efﬁcient setting is also given, in which the decision maker is informed about the weights of the edges corresponding to the chosen path at a total of m n time instances. Another extension is shown where the decision maker competes against a time-varying path, a generalization of the problem of tracking the best expert. A version of the multi-armed bandit setting for shortest path is also discussed where the decision maker learns only the total weight of the chosen path but not the weights of the individual edges on the path. Applications to routing in packet switched networks along with simulation results are also presented. Keywords: on-line learning, shortest path problem, multi-armed bandit problem c 2007 Andr´ s Gy¨ rgy, Tam´ s Linder, G´ bor Lugosi and Gy¨ rgy Ottucs´ k. a o a a o a ¨ ´ G Y ORGY, L INDER , L UGOSI AND OTTUCS AK

3 0.11474141 74 jmlr-2007-Separating Models of Learning from Correlated and Uncorrelated Data (Special Topic on the Conference on Learning Theory 2005)

Author: Ariel Elbaz, Homin K. Lee, Rocco A. Servedio, Andrew Wan

Abstract: We consider a natural framework of learning from correlated data, in which successive examples used for learning are generated according to a random walk over the space of possible examples. A recent paper by Bshouty et al. (2003) shows that the class of polynomial-size DNF formulas is efﬁciently learnable in this random walk model; this result suggests that the Random Walk model is more powerful than comparable standard models of learning from independent examples, in which similarly efﬁcient DNF learning algorithms are not known. We give strong evidence that the Random Walk model is indeed more powerful than the standard model, by showing that if any cryptographic one-way function exists (a universally held belief in cryptography), then there is a class of functions that can be learned efﬁciently in the Random Walk setting but not in the standard setting where all examples are independent. Keywords: random walks, uniform distribution learning, cryptographic hardness, correlated data, PAC learning

4 0.10265323 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions (Special Topic on the Conference on Learning Theory 2005)

Author: Vitaly Feldman

Abstract: We consider the problems of attribute-efﬁcient PAC learning of two well-studied concept classes: parity functions and DNF expressions over {0, 1}n . We show that attribute-efﬁcient learning of parities with respect to the uniform distribution is equivalent to decoding high-rate random linear codes from low number of errors, a long-standing open problem in coding theory. This is the ﬁrst evidence that attribute-efﬁcient learning of a natural PAC learnable concept class can be computationally hard. An algorithm is said to use membership queries (MQs) non-adaptively if the points at which the algorithm asks MQs do not depend on the target concept. Using a simple non-adaptive parity learning algorithm and a modiﬁcation of Levin’s algorithm for locating a weakly-correlated parity due to Bshouty et al. (1999), we give the ﬁrst non-adaptive and attribute-efﬁcient algorithm for ˜ learning DNF with respect to the uniform distribution. Our algorithm runs in time O(ns4 /ε) and ˜ uses O(s4 · log2 n/ε) non-adaptive MQs, where s is the number of terms in the shortest DNF representation of the target concept. The algorithm improves on the best previous algorithm for learning DNF (of Bshouty et al., 1999) and can also be easily modiﬁed to tolerate random persistent classiﬁcation noise in MQs. Keywords: attribute-efﬁcient, non-adaptive, membership query, DNF, parity function, random linear code

5 0.086547874 2 jmlr-2007-A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion

Author: Marco Loog

Abstract: Recently, Ye (2005) suggested yet another optimization criterion for discriminant analysis and proposed a characterization of the family of solutions to this objective. The characterization, however, merely describes a part of the full solution set, that is, it is not complete and therefore not at all a characterization. This correspondence ﬁrst gives the correct characterization and afterwards compares it to Ye’s. Keywords: linear discriminant analysis, Fisher criterion, small sample, characterization 1. Classical and New Criteria Given N feature vectors of dimensionality n, a linear reduction of dimensionality, based on classical Fisher LDA, determines an n × d transformation matrix L that, for a given d < K, K the number of classes, maximizes the so-called Fisher criterion: F(A) = tr((A t SW A)−1 (At SB A)) or, equivalently, F0 (A) = tr((At ST A)−1 (At SB A)). Here, SB := ∑K pi (mi − m)(mi − m)t , SW := ∑K pi Si , and i=1 i=1 ST = SB + SW . The matrices SB , SW , and ST are the so-called between-class, pooled within-class, and total covariance matrices. In addition, mi is the mean vector of class i, pi is the prior of class i, and the overall mean m equals ∑k pi mi . Finally, Si is the covariance matrix of class i. i=1 A solution to these optimization problems can be obtained by means of a generalized eigenvalue decomposition, which Fukunaga (1990) relates to a simultaneous diagonalization of the two matrices involved (see also Campbell and Atchley, 1981). More common is it to apply a standard −1 eigenvalue decomposition to S−1 SB (or SW SB ), resulting in an equivalent set of eigenvectors. The d T columns of the optimal solution L are simply taken to equal the d eigenvectors corresponding to the d largest eigenvalues. It is known that this solution is not unique and the full class can be obtained by multiplying L to the right with nonsingular d × d matrices (see Fukunaga, 1990). Clearly, if the total covariance ST is singular, neither the generalized nor the standard eigenvalue decomposition can be readily employed. Directly or indirectly, the problem is that the matrix inverse S−1 does not exist, which is the typical situation when dealing with small samples. In an attempt to T overcome this problem, Ye (2005) introduced a different criterion that is deﬁned as F1 (A) = tr((At ST A)+ (At SB A)) , ∗. Also at Nordic Bioscience Imaging, Hovegade 207, DK-2730 Herlev, Denmark. c 2007 Marco Loog. (1) L OOG where + denotes taking the Moore-Penrose generalized inverse of a matrix. Like for F0 , an optimal transform L is one that maximizes the objective F1 . Again, this solution is not unique. 2. Correct Characterization For the full characterization of the set of solutions to Equation (1), initially the problem is looked at from a geometrical point of view (cf., Campbell and Atchley, 1981). It is assumed that the number of samples N is smaller than or equal to the feature dimensionality n. In the undersampled case, it is clear that all data variation occurs in an N − 1-dimensional subspace of the original space. To start with, a PCA of the data is carried out and the ﬁrst N − 1 principal components are rotated to the ﬁrst N − 1 axes of the n-dimensional space by means of a rotation matrix R. This matrix consists of all (normalized) eigenvectors of ST taken as its columns. After this rotation, new total and between-class covariance matrices, ST = Rt ST R and SB = Rt SB R, are obtained. These 0 0 matrices can be partitioned as follows: ST = Σ0T 0 and SB = ΣB 0 , where ΣT and ΣB are N − 1 × 0 N − 1 covariance matrices and ΣT is nonsingular and diagonal by construction. The between-class variation is obviously restricted to the N − 1-dimensional subspace in which the total data variation takes place, therefore a same partitioning of SB is possible. However, the covariance submatrix ΣB is not necessarily diagonal, neither does it have to be nonsingular. Basically, the PCA-based rotation R converts the initial problem into a more convenient one, splitting up the original space in an N − 1-dimensional one in which “everything interesting” takes place and a remaining n − N + 1dimensional subspace in which “nothing happens at all”. Now consider F1 in this transformed space and take a general n × d transformation matrix A, which is partitioned in a way similar to the covariance matrices, that is, X . Y A= (2) Here, X is an N − 1 × d-matrix and Y is of size n − N + 1 × d. Taking this deﬁnition, the following holds (cf., Ye, 2005): t + t F1 (A) = tr((A ST A) (A SB A)) = tr =tr X t ΣT X 0 0 0 + X Y X t ΣB X 0 0 0 t ΣT 0 = tr 0 0 X Y + (Xt ΣT X)−1 0 0 0 X Y t ΣB 0 0 0 X Y X t ΣB X 0 0 0 = tr((Xt ΣT X)−1 (Xt ΣB X)) = F0 (X) . From this it is immediate that a matrix A maximizes F1 if and only if the submatrix X maximizes the original Fisher criterion in the lower-dimensional subspace. Moreover, if L is such a matrix maximizing F1 in the PCA-transformed space, it is easy to check that R−1 L = Rt L provides a solution to the original, general problem that has not been preprocessed by means of a PCA (see also Fukunaga, 1990). A characterization of the complete family of solutions can now be given. Let Λ ∈ RN−1×d be an optimal solution of F0 (X) = tr((Xt ΣT X)−1 (Xt ΣB X)). As already noted in Section 1, the full set of solutions is given by F = {ΛZ ∈ RN−1×d | Z ∈ GLd (R)}, where GLd (R) denotes the general linear group of d × d invertible matrices. The previous paragraph essentially demonstrates that if X ∈ F , A in Equation (2) maximizes F1 . The matrix Y can be chosen ad 2122 C OMPLETE C HARACTERIZATION OF A FAMILY OF S OLUTIONS libitum. Now, the latter provides the solution in the PCA-transformed space and to solve the general problem we need to take the rotation back to the original space into account. All in all, this leads to the following complete family of solutions L , maximizing F1 in the original space: L = Rt ΛZ ∈ Rn×d Z ∈ GLd (R), Y ∈ Rn−N+1×d Y , (3) where Λ = argmaxX tr((Xt ΣT X)−1 (Xt ΣB X)) and Rt takes care of the rotation back. 3. Original Characterization Though not noted by Ye (2005), his attempt to characterize the full set of solutions of Equation (1) is based on a simultaneous diagonalization of the three covariance matrices S B , SW , and ST that is similar to the ideas described by Campbell and Atchley (1981) and Fukunaga (1990). Moreover, Golub and Van Loan (Theorem 8.7.1. 1996) can be readily applied to demonstrate that such simultaneous diagonalization is possible in the small sample setting. After the diagonalization step, partitioned between-class, pooled within-class, and total covariance matrices are considered. This partitioning is similar to the one employed in the previous section, which does not enforce matrices to be diagonal however. In the subsequent optimization step, the classical Fisher criterion is maximized basically in the appropriate subspace, comparable to the approach described above, but in a mildly more involved and concealed way. For this, matrices of the form Rt X are considered, consider Equations (2) and Y (3). However, Y is simply the null matrix and the family of solutions L provided is limited to L = Rt ΛZ ∈ Rn×d Z ∈ GLd (R) . 0 Obviously, this is far from a complete characterization, especially when N − 1 n which is, for instance, typically the case for the data sets considered by Ye (2005). Generally, the utility of a dimensionality reduction criterion, without additional constrains, depends on the efﬁciency over the full set of solutions. As Ye (2005) only considers two very speciﬁc instances from the large class of possibilities, it is unclear to what extent the new criterion really provides an efﬁcient way of performing a reduction of dimensionality. References N. A. Campbell and W. R. Atchley. The geometry of canonical variate analysis. Systematic Zoology, 30(3):268–280, 1981. K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, New York, 1990. G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, third edition, 1996. J. Ye. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. Journal of Machine Learning Research, 6:483–502, 2005. 2123

6 0.073179118 9 jmlr-2007-AdaBoost is Consistent

7 0.057647649 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

8 0.053149424 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

9 0.051870819 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

10 0.047896266 85 jmlr-2007-Transfer Learning via Inter-Task Mappings for Temporal Difference Learning

11 0.043185785 41 jmlr-2007-Hierarchical Average Reward Reinforcement Learning

12 0.038114693 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes

13 0.035920423 70 jmlr-2007-Ranking the Best Instances

14 0.034595061 47 jmlr-2007-Learning Horn Expressions with LOGAN-H

15 0.033447195 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

16 0.032584164 65 jmlr-2007-PAC-Bayes Risk Bounds for Stochastic Averages and Majority Votes of Sample-Compressed Classifiers

17 0.031244662 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data

18 0.028505331 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners

19 0.027301777 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

20 0.0267908 42 jmlr-2007-Infinitely Imbalanced Logistic Regression

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.192), (1, -0.077), (2, -0.221), (3, -0.184), (4, -0.194), (5, 0.088), (6, 0.079), (7, 0.297), (8, -0.081), (9, -0.158), (10, -0.117), (11, -0.01), (12, 0.079), (13, -0.015), (14, -0.023), (15, -0.047), (16, 0.051), (17, -0.03), (18, -0.091), (19, -0.127), (20, -0.035), (21, 0.029), (22, 0.058), (23, -0.153), (24, 0.046), (25, -0.016), (26, -0.1), (27, -0.019), (28, 0.031), (29, 0.032), (30, -0.063), (31, -0.086), (32, 0.056), (33, 0.005), (34, -0.003), (35, -0.036), (36, -0.122), (37, -0.029), (38, -0.027), (39, 0.067), (40, 0.031), (41, -0.046), (42, 0.136), (43, -0.027), (44, -0.015), (45, 0.025), (46, 0.139), (47, -0.124), (48, -0.174), (49, 0.088)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97203362 14 jmlr-2007-Behavioral Shaping for Geometric Concepts

Author: Manu Chhabra, Robert A. Jacobs, Daniel Štefankovič

2 0.6200462 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring

Author: András György, Tamás Linder, Gábor Lugosi, György Ottucsák

3 0.52240473 74 jmlr-2007-Separating Models of Learning from Correlated and Uncorrelated Data (Special Topic on the Conference on Learning Theory 2005)

Author: Ariel Elbaz, Homin K. Lee, Rocco A. Servedio, Andrew Wan

4 0.43417904 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions (Special Topic on the Conference on Learning Theory 2005)

Author: Vitaly Feldman

5 0.32761905 2 jmlr-2007-A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion

Author: Marco Loog

6 0.28692836 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

7 0.2710312 9 jmlr-2007-AdaBoost is Consistent

8 0.23105942 85 jmlr-2007-Transfer Learning via Inter-Task Mappings for Temporal Difference Learning

9 0.21431598 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes

10 0.2046756 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

11 0.1875308 47 jmlr-2007-Learning Horn Expressions with LOGAN-H

12 0.18724579 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

13 0.18500178 41 jmlr-2007-Hierarchical Average Reward Reinforcement Learning

14 0.17934772 70 jmlr-2007-Ranking the Best Instances

15 0.16020526 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

16 0.14289142 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems

17 0.13984133 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results

18 0.13769518 21 jmlr-2007-Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets"

19 0.13608694 4 jmlr-2007-A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning (Special Topic on Model Selection)

20 0.13488393 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.014), (8, 0.031), (10, 0.023), (12, 0.025), (15, 0.019), (28, 0.044), (40, 0.048), (45, 0.015), (48, 0.03), (60, 0.05), (61, 0.446), (80, 0.014), (85, 0.068), (98, 0.076)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.71686989 14 jmlr-2007-Behavioral Shaping for Geometric Concepts

Author: Manu Chhabra, Robert A. Jacobs, Daniel Štefankovič

2 0.27731478 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

Author: Miroslav Dudík, Steven J. Phillips, Robert E. Schapire

Abstract: We present a uniﬁed and complete account of maximum entropy density estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special cases, we easily derive performance guarantees for many known regularization types, including 1 , 2 , 2 , and 1 + 2 style regularization. We propose an algorithm solving a large and 2 2 general subclass of generalized maximum entropy problems, including all discussed in the paper, and prove its convergence. Our approach generalizes and uniﬁes techniques based on information geometry and Bregman divergences as well as those based more directly on compactness. Our work is motivated by a novel application of maximum entropy to species distribution modeling, an important problem in conservation biology and ecology. In a set of experiments on real-world data, we demonstrate the utility of maximum entropy in this setting. We explore effects of different feature types, sample sizes, and regularization levels on the performance of maxent, and discuss interpretability of the resulting models. Keywords: maximum entropy, density estimation, regularization, iterative scaling, species distribution modeling

3 0.27581358 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

Abstract: Embedding algorithms search for a low dimensional continuous representation of data, but most algorithms only handle objects of a single type for which pairwise distances are speciﬁed. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semideﬁnite matrices. The local structure of the embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text data sets, and show that it consistently and signiﬁcantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling, IsoMap and correspondence analysis. Keywords: embedding algorithms, manifold learning, exponential families, multidimensional scaling, matrix factorization, semideﬁnite programming

4 0.27510694 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

Author: Jia Li, Surajit Ray, Bruce G. Lindsay

Abstract: A new clustering approach based on mode identiﬁcation is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efﬁciently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model ﬁtting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Speciﬁcally, a pairwise separability measure for clusters is deﬁned using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difﬁculty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density

5 0.27173775 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

Author: Tapani Raiko, Harri Valpola, Markus Harva, Juha Karhunen

Abstract: We introduce standardised building blocks designed to be used with variational Bayesian learning. The blocks include Gaussian variables, summation, multiplication, nonlinearity, and delay. A large variety of latent variable models can be constructed from these blocks, including nonlinear and variance models, which are lacking from most existing variational systems. The introduced blocks are designed to ﬁt together and to yield efﬁcient update rules. Practical implementation of various models is easy thanks to an associated software package which derives the learning formulas automatically once a speciﬁc model structure has been ﬁxed. Variational Bayesian learning provides a cost function which is used both for updating the variables of the model and for optimising the model structure. All the computations can be carried out locally, resulting in linear computational complexity. We present experimental results on several structures, including a new hierarchical nonlinear model for variances and means. The test results demonstrate the good performance and usefulness of the introduced method. Keywords: latent variable models, variational Bayesian learning, graphical models, building blocks, Bayesian modelling, local computation

6 0.27076423 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

7 0.27039555 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

8 0.27033725 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

9 0.267501 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

10 0.26680857 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

11 0.26677662 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

12 0.26655245 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

13 0.26561955 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

14 0.2655881 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

15 0.26529604 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

16 0.26516789 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions (Special Topic on the Conference on Learning Theory 2005)

17 0.26467049 77 jmlr-2007-Stagewise Lasso

18 0.26376998 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters (Special Topic on Model Selection)

19 0.2636835 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

20 0.26230139 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression