nips nips2013 nips2013-135 knowledge-graph by maker-knowledge-mining

135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms


Source: pdf

Author: Yu Zhang

Abstract: All the existing multi-task local learning methods are defined on homogeneous neighborhood which consists of all data points from only one task. In this paper, different from existing methods, we propose local learning methods for multitask classification and regression problems based on heterogeneous neighborhood which is defined on data points from all tasks. Specifically, we extend the knearest-neighbor classifier by formulating the decision function for each data point as a weighted voting among the neighbors from all tasks where the weights are task-specific. By defining a regularizer to enforce the task-specific weight matrix to approach a symmetric one, a regularized objective function is proposed and an efficient coordinate descent method is developed to solve it. For regression problems, we extend the kernel regression to multi-task setting in a similar way to the classification case. Experiments on some toy data and real-world datasets demonstrate the effectiveness of our proposed methods. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 hk Abstract All the existing multi-task local learning methods are defined on homogeneous neighborhood which consists of all data points from only one task. [sent-4, score-0.211]

2 In this paper, different from existing methods, we propose local learning methods for multitask classification and regression problems based on heterogeneous neighborhood which is defined on data points from all tasks. [sent-5, score-0.298]

3 Specifically, we extend the knearest-neighbor classifier by formulating the decision function for each data point as a weighted voting among the neighbors from all tasks where the weights are task-specific. [sent-6, score-0.134]

4 For regression problems, we extend the kernel regression to multi-task setting in a similar way to the classification case. [sent-8, score-0.129]

5 Multi-task learning utilizes supervised information from some related tasks to improve the performance of one task at hand and during the past decades many advanced methods have been proposed for multi-task learning, e. [sent-20, score-0.14]

6 The KNN classifiers use in both two methods are defined on the homogeneous neighborhood which is the set of nearest data points from the same task the query point belongs to. [sent-25, score-0.287]

7 In some situation, it is better to use a heterogeneous neighborhood which is defined as the set of nearest data points from all tasks. [sent-26, score-0.231]

8 However, if we can use the data points from both two tasks to define the neighborhood (i. [sent-30, score-0.186]

9 For multi-task classification problems, we extend the KNN classifier by formulating the decision function on each data point as weighted voting of its neighbors from all tasks where the weights are task-specific. [sent-34, score-0.134]

10 Since multi-task learning usually considers that the contribution of one task to another one equals that in the reverse direc- Figure 1: Data points with one color tion, we define a regularizer to enforce the task-specific (i. [sent-35, score-0.144]

11 The training set consists of n triples (xi , yi , ti ) i=1 with the ith data point as xi ∈ RD , its label yi ∈ {−1, 1} and its task indicator ti ∈ {1, . [sent-49, score-0.307]

12 So each task is a binary classification problem with ni = |{j|tj = i}| data points belonging to the ith task Ti . [sent-53, score-0.297]

13 For the ith data point xi , we use Nk (i) to denote the set of the indices of its k nearest neighbors. [sent-54, score-0.133]

14 If Nk (i) is a homogeneous neighborhood which only contains data points from the task that xi belongs to, we can use d(xi ) = sgn j∈Nk (i) s(i, j)yj to make a decision for xi where sgn(·) denotes the sign function and s(i, j) denotes a similarity function between xi and xj . [sent-55, score-0.398]

15 When wqr (q = r) approaches wqq , it means Tr is very similar to Tq in local regions. [sent-60, score-0.417]

16 At another extreme where wqr (q = r) approaches −wqq , if we flip the labels of data points in Tr , Tr can have a positive contribution −wqr to Tq which indicates that Tr is negatively correlated to Tq . [sent-61, score-0.315]

17 Moreover, when wqr /wqq (q = r) is close to 0 which implies there is no contribution from Tr to Tq , Tr is likely to be unrelated to Tq . [sent-62, score-0.3]

18 So the utilization of {wqr } can model three task relationships: positive task correlation, negative task correlation and task unrelatedness as in [6, 20]. [sent-63, score-0.318]

19 We use f (xi ) to define the estimation function as f (xi ) = j∈Nk (i) wti ,tj s(i, j)yj . [sent-64, score-0.213]

20 Moreover, recall that wqr represents the contribution of Tr to Tq and wrq is the contribution of Tq to Tr . [sent-66, score-0.396]

21 Since multi-task learning usually considers that the contribution of Tr to Tq almost equals that of Tq to Tr , we expect wqr to be close to wrq . [sent-67, score-0.365]

22 To encode this priori information into our model, we can either formulate it as wqr = wrq , a hard constraint, or a soft regularizer, i. [sent-68, score-0.334]

23 , minimizing (wqr − wrq )2 to enforce wqr ≈ wrq , which is more preferred. [sent-70, score-0.425]

24 wqq ≥ 0, wqq ≥ wqr ≥ −wqq (q = r) (1) where W is a m × m matrix with wqr as its (q, r)th element and · F denotes Frobenius norm of a matrix. [sent-73, score-0.756]

25 The first term in the objective function of problem (1) measures the training loss, the second one enforces W to be a symmetric matrix which implies wqr ≈ wrq , and the last one penalizes the complexity of W. [sent-74, score-0.334]

26 m j=1 wti j j l∈Nk (i) s(i, l)yl = wti xi where ˆ j Nk (i) We first rewrite f (xi ) as f (xi ) = denotes the set of the indices of xi ’s nearest neighbors from the jth task in Nk (i), wti = (wti 1 , . [sent-78, score-0.967]

27 , wti m ) is the ti th row of W, and xi is a ˆ m × 1 vector with the jth element as l∈N j (i) s(i, l)yl . [sent-81, score-0.406]

28 Then we can reformulate problem (1) as k m min W l(yj , wi xj ) + ˆ i=1 j∈Ti λ1 W − WT 4 2 F + λ2 W 2 2 F s. [sent-82, score-0.113]

29 By adopting the hinge loss in problem (2), the optimization problem for wik (k = i) is formulated as min wik λ 2 wik − βik wik + max(0, aj wik + bj ) ik ik 2 j∈T s. [sent-86, score-2.011]

30 cik ≤ wik ≤ eik (3) i where λ = λ1 + λ2 , βik = λ1 wki , xjk is the kth element of xj , aj = −yj xjk , bj = 1 − ˆ ˆ ˆ ik ik yj t=k wit xjt , cik = −wii , and eik = wii . [sent-88, score-2.602]

31 We assume all aj are non-zero and otherwise we can discard them without affecting the solution since the corresponding losses are constants. [sent-94, score-0.22]

32 We define six index sets as C1 = {j|aj > 0, − ik bj bj bj ik < cik }, C2 = {j|aj > 0, cik ≤ − ik ≤ eik }, C3 = {j|aj > 0, − ik > eik } ik ik j j aik aik aj ik C4 = {j|aj < 0, − ik bj bj bj ik < cik }, C5 = {j|aj < 0, cik ≤ − ik ≤ eik }, C6 = {j|aj < 0, − ik > eik }. [sent-95, score-5.308]

33 ik ik aj aj aj ik ik ik It is easy to show that when j ∈ C1 ∪C6 where the operator ∪ denotes the union of sets, aj w+bj > ik ik 0 holds for w ∈ [cik , eik ], corresponding to the set of data points with non-zero loss. [sent-96, score-2.671]

34 Oppositely when j ∈ C3 ∪ C4 , the values of the corresponding losses become zero since aj w + bj ≤ 0 holds ik ik for w ∈ [cik , eik ]. [sent-97, score-1.02]

35 Moreover, we also ik ik u u keep a index mapping qu with its rth element qr defined as qr = j if ur = −bj /aj . [sent-103, score-0.543]

36 Similarly, ik ik j j for sequence {−bik /aik |j ∈ C5 }, we define a sorted vector v of length dv and the corresponding index mapping qv . [sent-104, score-0.569]

37 By using the merge-sort algorithm, we merge u and v into a sorted vector s and then we add cik and eik into s as the minimum and maximum elements if they are not contained in s. [sent-105, score-0.364]

38 Obviously, in range [sl , sl+1 ] where sl is the lth element of s and ds is the length of s, problem (3) becomes a univariate QP problem which has an analytical solution. [sent-106, score-0.178]

39 , ds − 1) and get the global minimum over region [cik , eik ] by comparing all local optima. [sent-110, score-0.258]

40 In step 2, we need to sort two sequences to obtain u and v in O(du ln du + dv ln dv ) time and merge two sequences to get s in O(du + dv ). [sent-113, score-0.332]

41 The overall complexity of the algorithm in Table 1 is O(du ln du + dv ln dv + ni ) which is at most O(ni ln ni ) due to du + dv ≤ ni . [sent-116, score-0.675]

42 The main difference between problem (3) and (4) is that there exist a box constraint for wik in problem (3) but in problem (4) wii is only bj lower-bounded. [sent-120, score-0.933]

43 For wii ∈ [ei , +∞), the objective j i i j j j λ2 2 j∈S (ai wii + bi ) where S = {j|ai > 0} 2 wii + aj (1) and the minimum value in [ei , +∞) will take at wii = max{ei , − j∈S i }. [sent-122, score-2.461]

44 Then we can use the λ2 (2) algorithm in Table 1 to find the minimizor wii in the interval [ci , ei ] for problem (4). [sent-123, score-0.589]

45 Finally we (1) (2) can choose the optimal solution to problem (4) from {wii , wii } by comparing the corresponding function of problem (4) can be reformulated as values of the objective function. [sent-124, score-0.552]

46 Since the complexity to solve both problem (3) and (4) is O(ni ln ni ), the complexity of one update m for the whole matrix W is O(m i=1 ni ln ni ). [sent-125, score-0.331]

47 It is easy to show that problem (3) has an analytical solution as wik = min max cik , computed as wii = max ci , λ −2 2 +2 βik −2 j j∈Ti j aik bik , eik j λ+2 j∈T (aik )2 i j j j∈Ti ai bi j 2 j∈Ti (ai ) and the solution to problem (4) can be . [sent-130, score-1.391]

48 3 A Multi-Task Local Regressor based on Heterogeneous Neighborhood In this section, we consider the situation that each task is a regression problem with each label yi ∈ R. [sent-132, score-0.119]

49 We can see that the expression in the brackets on the right-hand j j=1 wti j l∈Nk (i) side can be viewed as a prediction for xi based on its neighbors in the jth task. [sent-137, score-0.367]

50 Inspired by this observation, we can construct a prediction yj for xi based on its neighbors from the jth task by ˆi utilizing any regressor, e. [sent-138, score-0.365]

51 Here due to the local nature of our proposed method, we choose the kernel regression method, which is a local regression method, as a good candidate and hence yj is formulated as yj = ˆi ˆi s(i,l)yl j l∈N (i) k s(i,l) j l∈N (i) k . [sent-141, score-0.55]

52 When j equals ti which means we use neighbored data points from the task that xi belongs to, we can use this prediction in confidence. [sent-142, score-0.238]

53 However, if j = ti , we cannot totally trust the prediction and need to add some weight wti ,j as a confidence. [sent-143, score-0.292]

54 Then by using the square loss, we formulate an optimization problem to get the estimation function f (xi ) based on {ˆj } as yi m wti ,j (y − yj )2 = ˆi f (xi ) = arg min y j=1 m ˆi j=1 wti ,j yj m j=1 wti ,j . [sent-144, score-0.957]

55 (6) Compared with the regression function of the direct extension of kernel regression to multi-task learning in Eq. [sent-145, score-0.129]

56 Since the scale of wij does not matter the value of the estimation function in Eq. [sent-148, score-0.227]

57 Moreover, the estimation yti ˆi by using data from the same task as xi is more trustful than the estimations based on other tasks, which suggests wii should be the largest among elements in the ith row. [sent-155, score-0.701]

58 Then this constraint implies 1 1 that wii ≥ m k wik = m > 0. [sent-156, score-0.763]

59 To capture the negative task correlations, wij (i = j) is only required to be a real scalar and wij ≥ −wii . [sent-157, score-0.526]

60 Combining the above consideration, we formulate an optimization problem as m (wi yj − yj )2 + ˆ min W i=1 j∈Ti λ1 W − WT 4 2 F + λ2 W 2 2 F s. [sent-158, score-0.278]

61 W1 = 1, wii ≥ wij ≥ −wii , (7) where 1 denotes a vector of all ones with appropriate size and yj = (ˆ1 , . [sent-160, score-0.918]

62 In the following ˆ yj ˆj section, we discuss how to optimize problem (7). [sent-164, score-0.139]

63 We update each row one by one and the optimization problem with respect to wi is formulated as min wi 1 T wi Awi + wi bT 2 m wij = 1, −wii ≤ wij ≤ wii ∀j = i, s. [sent-168, score-1.497]

64 (8) j=1 where A = 2 j∈Ti yj yj + λ1 Ii + λ2 Im , Im is an m × m identity matrix, Ii is a copy of Im ˆ ˆT m m by setting the (i, i)th element to be 0, b = −2 j∈Ti yj yj − λ1 cT , and ci is the ith column of W ˆT i by setting its ith element to 0. [sent-170, score-0.699]

65 We define the Lagrangian as m 1 T J = wi Awi + wi bT − α( wij − 1) − 2 j=1 (wii − wij )βj − j=i (wii + wij )γj . [sent-171, score-0.907]

66 (12) we have γj = 0 and further wi aj + bj = α − βj ≤ α according to Eq. [sent-175, score-0.503]

67 (11) we can get βj = 0 and then wi aj + bj = α + γj ≥ α. [sent-178, score-0.503]

68 , −wii < wij < wii ), γj = βj = 0 according to Eqs. [sent-181, score-0.779]

69 (11) and (12), which implies that wi aj + bj = α. [sent-182, score-0.503]

70 (10) implies that wi ai + bi = α + k=i (βk + γk ) ≥ α. [sent-184, score-0.181]

71 We define sets as S1 = {j|wij = wii , j = i}, S2 = {j| − wii < wij < wii }, S3 = {j|wij = −wii }, and S4 = {i}. [sent-185, score-1.883]

72 Then a feasible wi is a stationary point of problem (8) if and only if maxj∈S1 ∪S2 {wi aj + bj } ≤ mink∈S2 ∪S3 ∪S4 {wi ak + bk }. [sent-186, score-0.503]

73 If there exist a pair of indices (j, k), where j ∈ S1 ∪ S2 and k ∈ S2 ∪ S3 ∪ S4 , satisfying wi aj + bj > wi ak + bk , {j, k} is called a violating pair. [sent-187, score-0.658]

74 If the current estimation wi is not an optimal solution, there should exist some violating pairs. [sent-188, score-0.155]

75 We define the update rule for wij and wik as wij = wij + t and wik = wik − t. [sent-190, score-1.314]

76 By noting that j cannot be i, t should ˜ ˜ satisfy the following constraints to make the updated solution feasible: when k = i, t − wik ≤ wij + t ≤ wik − t, t − wik ≤ wil ≤ wik − t ∀l = j&l; = k when k = i, −wii ≤ wij + t ≤ wii , −wii ≤ wik − t ≤ wii . [sent-191, score-2.643]

77 w −w When k = i, there will be a constraint on t as t ≤ e ≡ min ik 2 ij , minl=j&l;=k (wik − |wil |) and otherwise t will satisfy c ≤ t ≤ e where c = max(wik − wii , −wij − wii ) and e = min(wii − wij , wii + wik ). [sent-192, score-2.318]

78 Then the optimization problem for t can be unified as min t ajj + aii − 2aji 2 t + (wi aj + bj − wi ai − bi )t 2 s. [sent-193, score-0.626]

79 This problem has an analytical solution as wi ai +bi −wi aj −bj . [sent-196, score-0.394]

80 We randomly select p percent of data points to form the training set of two learning tasks respectively. [sent-204, score-0.134]

81 The regularization parameters λ1 and λ2 are fixed as 1 and the number of nearest neighbors is set to 5. [sent-205, score-0.122]

82 1010 wij (j = i) is very close to wii for i = 1, 2. [sent-215, score-0.779]

83 This observation implies our method can find that these two tasks are positive correlated which matches our expectation since those two tasks are from the same distribution. [sent-216, score-0.136]

84 For the second experiment, we randomly select p percent of data points to form the training set of two learning tasks respectively but differently we flip the labels of one task so that those two tasks should be negatively correlated. [sent-217, score-0.274]

85 Here those two tasks can be viewed as unrelated tasks since the label assignment is random. [sent-229, score-0.162]

86 1077 , where wij (i = j) is much smaller than wii . [sent-238, score-0.779]

87 In summary, our method can learn the positive correlations, negative correlations and task unrelatedness for those toy problems. [sent-240, score-0.137]

88 The first problem we investigate is Table 2: Comparison of classification errors of different a handwritten letter classification ap- methods on the two classification problems in the form of plication consisting of seven tasks mean±std. [sent-243, score-0.157]

89 Each task contains about 1000 data points with 255 features for each class. [sent-268, score-0.113]

90 From the results, we can see that our method MT-KNN is better than KNN, mtLMNN and MTFL methods, which demonstrates that the introduction of the heterogeneous neighborhood is helpful to improve the performance. [sent-288, score-0.134]

91 For different loss functions utilized by our method, MT-KNN with hinge loss is better than that with square loss due to the robustness of the hinge loss against the square loss. [sent-289, score-0.288]

92 For those two problems, we also compare our proposed coordinate descent method described in Table 1 with some off-the-shelf solvers such as the CVX solver [11] with respect to the running time. [sent-290, score-0.116]

93 We record the average running time over 100 trials in Figure 2 and from the results we can see that on the classification problems above, our proposed coordinate descent method is much faster than the CVX solver which demonstrates the efficiency of our proposed method. [sent-293, score-0.148]

94 3 Experiments on Regression Problems Here we study a multi-task regression problem to learn the inverse dynamics of a seven degree-offreedom SARCOS anthropomorphic robot arm. [sent-295, score-0.125]

95 org/gpml/data/ 7 on 21 input features, corresponding to seven joint positions, seven joint velocities and seven joint accelerations. [sent-302, score-0.144]

96 So each task corresponds to the prediction of one torque and can be formulated as a regression problem. [sent-303, score-0.158]

97 We find the running time of our proposed method is much smaller than that of the CVX solver which demonstrates that the proposed coordinate descent method can speed up the computation of our MT-KR method. [sent-313, score-0.116]

98 By changing the number of nearest neighbors from 5 to 40 at an interval of 5, we record the mean of the performance of our method over 10 trials in Figure 4. [sent-324, score-0.154]

99 01 5 Conclusion Letter USPS Robot 5 10 15 20 25 30 Number of Neighbors 35 40 Figure 4: Sensitivity analysis of the performance of our method with respect to the number of nearest neighbors at different data sets. [sent-332, score-0.122]

100 Based on an assumption that all task pairs contributes to each other almost equally, we propose regularized objective functions and develop efficient coordinate descent methods to solve them. [sent-334, score-0.18]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wii', 0.552), ('wqr', 0.243), ('wij', 0.227), ('ik', 0.224), ('aj', 0.22), ('wti', 0.213), ('wik', 0.211), ('cik', 0.182), ('eik', 0.182), ('bj', 0.17), ('yj', 0.139), ('tq', 0.127), ('wqq', 0.122), ('wi', 0.113), ('sl', 0.102), ('mtfl', 0.091), ('wrq', 0.091), ('cvx', 0.086), ('ni', 0.081), ('ti', 0.079), ('neighborhood', 0.077), ('knn', 0.076), ('aik', 0.074), ('nk', 0.073), ('task', 0.072), ('du', 0.069), ('tasks', 0.068), ('dv', 0.067), ('bik', 0.067), ('neighbors', 0.066), ('tr', 0.06), ('classi', 0.059), ('heterogeneous', 0.057), ('nearest', 0.056), ('hinge', 0.054), ('qv', 0.054), ('local', 0.052), ('seven', 0.048), ('regression', 0.047), ('qu', 0.046), ('xi', 0.046), ('vancouver', 0.044), ('coordinate', 0.043), ('yl', 0.042), ('violating', 0.042), ('jth', 0.042), ('homogeneous', 0.041), ('letter', 0.041), ('points', 0.041), ('regressor', 0.041), ('british', 0.041), ('square', 0.04), ('qp', 0.039), ('formulated', 0.039), ('descent', 0.039), ('kr', 0.038), ('usps', 0.038), ('ei', 0.037), ('editors', 0.036), ('toy', 0.035), ('kernel', 0.035), ('ai', 0.035), ('solver', 0.034), ('bi', 0.033), ('columbia', 0.033), ('record', 0.032), ('contribution', 0.031), ('canada', 0.031), ('ith', 0.031), ('ln', 0.031), ('ajj', 0.03), ('awi', 0.03), ('minl', 0.03), ('mtlmnn', 0.03), ('unrelatedness', 0.03), ('wil', 0.03), ('robot', 0.03), ('ci', 0.029), ('sgn', 0.029), ('er', 0.028), ('sch', 0.027), ('xjk', 0.027), ('wit', 0.027), ('xjt', 0.027), ('parameswaran', 0.027), ('element', 0.026), ('maxj', 0.026), ('unrelated', 0.026), ('analytical', 0.026), ('solve', 0.026), ('percent', 0.025), ('loss', 0.025), ('aii', 0.025), ('smo', 0.025), ('multitask', 0.024), ('ds', 0.024), ('platt', 0.024), ('nmse', 0.023), ('rth', 0.023), ('thrun', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms

Author: Yu Zhang

Abstract: All the existing multi-task local learning methods are defined on homogeneous neighborhood which consists of all data points from only one task. In this paper, different from existing methods, we propose local learning methods for multitask classification and regression problems based on heterogeneous neighborhood which is defined on data points from all tasks. Specifically, we extend the knearest-neighbor classifier by formulating the decision function for each data point as a weighted voting among the neighbors from all tasks where the weights are task-specific. By defining a regularizer to enforce the task-specific weight matrix to approach a symmetric one, a regularized objective function is proposed and an efficient coordinate descent method is developed to solve it. For regression problems, we extend the kernel regression to multi-task setting in a similar way to the classification case. Experiments on some toy data and real-world datasets demonstrate the effectiveness of our proposed methods. 1

2 0.099849164 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression

Author: Haim Avron, Vikas Sindhwani, David Woodruff

Abstract: Motivated by the desire to extend fast randomized techniques to nonlinear lp regression, we consider a class of structured regression problems. These problems involve Vandermonde matrices which arise naturally in various statistical modeling settings, including classical polynomial fitting problems, additive models and approximations to recently developed randomized techniques for scalable kernel methods. We show that this structure can be exploited to further accelerate the solution of the regression problem, achieving running times that are faster than “input sparsity”. We present empirical results confirming both the practical value of our modeling framework, as well as speedup benefits of randomized regression. 1

3 0.070162535 211 nips-2013-Non-Linear Domain Adaptation with Boosting

Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua

Abstract: A common assumption in machine vision is that the training and test samples are drawn from the same distribution. However, there are many problems when this assumption is grossly violated, as in bio-medical applications where different acquisitions can generate drastic variations in the appearance of the data due to changing experimental conditions. This problem is accentuated with 3D data, for which annotation is very time-consuming, limiting the amount of data that can be labeled in new acquisitions for training. In this paper we present a multitask learning algorithm for domain adaptation based on boosting. Unlike previous approaches that learn task-specific decision boundaries, our method learns a single decision boundary in a shared feature space, common to all tasks. We use the boosting-trick to learn a non-linear mapping of the observations in each task, with no need for specific a-priori knowledge of its global analytical form. This yields a more parameter-free domain adaptation approach that successfully leverages learning on new tasks where labeled data is scarce. We evaluate our approach on two challenging bio-medical datasets and achieve a significant improvement over the state of the art. 1

4 0.067630716 75 nips-2013-Convex Two-Layer Modeling

Author: Özlem Aslan, Hao Cheng, Xinhua Zhang, Dale Schuurmans

Abstract: Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. Unfortunately, such models are difficult to train because inference over latent variables must be performed concurrently with parameter optimization—creating a highly non-convex problem. Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics. 1

5 0.064961366 274 nips-2013-Relevance Topic Model for Unstructured Social Group Activity Recognition

Author: Fang Zhao, Yongzhen Huang, Liang Wang, Tieniu Tan

Abstract: Unstructured social group activity recognition in web videos is a challenging task due to 1) the semantic gap between class labels and low-level visual features and 2) the lack of labeled training data. To tackle this problem, we propose a “relevance topic model” for jointly learning meaningful mid-level representations upon bagof-words (BoW) video representations and a classifier with sparse weights. In our approach, sparse Bayesian learning is incorporated into an undirected topic model (i.e., Replicated Softmax) to discover topics which are relevant to video classes and suitable for prediction. Rectified linear units are utilized to increase the expressive power of topics so as to explain better video data containing complex contents and make variational inference tractable for the proposed model. An efficient variational EM algorithm is presented for model parameter estimation and inference. Experimental results on the Unstructured Social Activity Attribute dataset show that our model achieves state of the art performance and outperforms other supervised topic model in terms of classification accuracy, particularly in the case of a very small number of labeled training videos. 1

6 0.064116955 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression

7 0.063781656 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series

8 0.058502283 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions

9 0.057622217 87 nips-2013-Density estimation from unweighted k-nearest neighbor graphs: a roadmap

10 0.057228137 244 nips-2013-Parametric Task Learning

11 0.055513281 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning

12 0.053268231 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

13 0.051757384 158 nips-2013-Learning Multiple Models via Regularized Weighting

14 0.051203523 269 nips-2013-Regression-tree Tuning in a Streaming Setting

15 0.051089246 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

16 0.047539402 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking

17 0.047127549 77 nips-2013-Correlations strike back (again): the case of associative memory retrieval

18 0.046534773 318 nips-2013-Structured Learning via Logistic Regression

19 0.046314381 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models

20 0.045504384 180 nips-2013-Low-rank matrix reconstruction and clustering via approximate message passing


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.138), (1, 0.044), (2, 0.001), (3, -0.02), (4, 0.054), (5, 0.024), (6, -0.002), (7, -0.001), (8, -0.002), (9, 0.013), (10, 0.021), (11, -0.012), (12, -0.002), (13, -0.035), (14, 0.055), (15, 0.008), (16, -0.037), (17, 0.078), (18, 0.014), (19, 0.002), (20, -0.018), (21, 0.026), (22, 0.046), (23, 0.085), (24, 0.001), (25, -0.001), (26, 0.054), (27, 0.006), (28, -0.046), (29, -0.071), (30, -0.098), (31, 0.063), (32, 0.004), (33, -0.034), (34, 0.008), (35, 0.042), (36, -0.091), (37, -0.09), (38, -0.042), (39, 0.056), (40, -0.021), (41, 0.036), (42, -0.004), (43, 0.043), (44, -0.125), (45, -0.139), (46, 0.015), (47, 0.024), (48, 0.003), (49, -0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92576295 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms

Author: Yu Zhang

Abstract: All the existing multi-task local learning methods are defined on homogeneous neighborhood which consists of all data points from only one task. In this paper, different from existing methods, we propose local learning methods for multitask classification and regression problems based on heterogeneous neighborhood which is defined on data points from all tasks. Specifically, we extend the knearest-neighbor classifier by formulating the decision function for each data point as a weighted voting among the neighbors from all tasks where the weights are task-specific. By defining a regularizer to enforce the task-specific weight matrix to approach a symmetric one, a regularized objective function is proposed and an efficient coordinate descent method is developed to solve it. For regression problems, we extend the kernel regression to multi-task setting in a similar way to the classification case. Experiments on some toy data and real-world datasets demonstrate the effectiveness of our proposed methods. 1

2 0.79772645 244 nips-2013-Parametric Task Learning

Author: Ichiro Takeuchi, Tatsuya Hongo, Masashi Sugiyama, Shinichi Nakajima

Abstract: We introduce an extended formulation of multi-task learning (MTL) called parametric task learning (PTL) that can systematically handle infinitely many tasks parameterized by a continuous parameter. Our key finding is that, for a certain class of PTL problems, the path of the optimal task-wise solutions can be represented as piecewise-linear functions of the continuous task parameter. Based on this fact, we employ a parametric programming technique to obtain the common shared representation across all the continuously parameterized tasks. We show that our PTL formulation is useful in various scenarios such as learning under non-stationarity, cost-sensitive learning, and quantile regression. We demonstrate the advantage of our approach in these scenarios.

3 0.62018049 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

Author: Shaodan Zhai, Tian Xia, Ming Tan, Shaojun Wang

Abstract: We propose a boosting method, DirectBoost, a greedy coordinate descent algorithm that builds an ensemble classifier of weak classifiers through directly minimizing empirical classification error over labeled training examples; once the training classification error is reduced to a local coordinatewise minimum, DirectBoost runs a greedy coordinate ascent algorithm that continuously adds weak classifiers to maximize any targeted arbitrarily defined margins until reaching a local coordinatewise maximum of the margins in a certain sense. Experimental results on a collection of machine-learning benchmark datasets show that DirectBoost gives better results than AdaBoost, LogitBoost, LPBoost with column generation and BrownBoost, and is noise tolerant when it maximizes an n′ th order bottom sample margin. 1

4 0.58891928 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression

Author: Samory Kpotufe, Vikas Garg

Abstract: We present the first result for kernel regression where the procedure adapts locally at a point x to both the unknown local dimension of the metric space X and the unknown H¨ lder-continuity of the regression function at x. The result holds with o high probability simultaneously at all points x in a general metric space X of unknown structure. 1

5 0.5827378 76 nips-2013-Correlated random features for fast semi-supervised learning

Author: Brian McWilliams, David Balduzzi, Joachim Buhmann

Abstract: This paper presents Correlated Nystr¨ m Views (XNV), a fast semi-supervised alo gorithm for regression and classification. The algorithm draws on two main ideas. First, it generates two views consisting of computationally inexpensive random features. Second, multiview regression, using Canonical Correlation Analysis (CCA) on unlabeled data, biases the regression towards useful features. It has been shown that CCA regression can substantially reduce variance with a minimal increase in bias if the views contains accurate estimators. Recent theoretical and empirical work shows that regression with random features closely approximates kernel regression, implying that the accuracy requirement holds for random views. We show that XNV consistently outperforms a state-of-the-art algorithm for semi-supervised learning: substantially improving predictive performance and reducing the variability of performance on a wide variety of real-world datasets, whilst also reducing runtime by orders of magnitude. 1

6 0.57880211 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

7 0.57704675 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions

8 0.56069219 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning

9 0.54080343 158 nips-2013-Learning Multiple Models via Regularized Weighting

10 0.53111607 211 nips-2013-Non-Linear Domain Adaptation with Boosting

11 0.52717865 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation

12 0.50758511 10 nips-2013-A Latent Source Model for Nonparametric Time Series Classification

13 0.50254881 357 nips-2013-k-Prototype Learning for 3D Rigid Structures

14 0.49689773 176 nips-2013-Linear decision rule as aspiration for simple decision heuristics

15 0.48901433 202 nips-2013-Multiclass Total Variation Clustering

16 0.48492518 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series

17 0.48348942 75 nips-2013-Convex Two-Layer Modeling

18 0.48181915 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach

19 0.47574818 65 nips-2013-Compressive Feature Learning

20 0.46688861 171 nips-2013-Learning with Noisy Labels


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.011), (16, 0.027), (33, 0.112), (34, 0.114), (41, 0.047), (49, 0.059), (56, 0.099), (70, 0.027), (85, 0.049), (89, 0.036), (93, 0.043), (95, 0.011), (98, 0.286)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73531896 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms

Author: Yu Zhang

Abstract: All the existing multi-task local learning methods are defined on homogeneous neighborhood which consists of all data points from only one task. In this paper, different from existing methods, we propose local learning methods for multitask classification and regression problems based on heterogeneous neighborhood which is defined on data points from all tasks. Specifically, we extend the knearest-neighbor classifier by formulating the decision function for each data point as a weighted voting among the neighbors from all tasks where the weights are task-specific. By defining a regularizer to enforce the task-specific weight matrix to approach a symmetric one, a regularized objective function is proposed and an efficient coordinate descent method is developed to solve it. For regression problems, we extend the kernel regression to multi-task setting in a similar way to the classification case. Experiments on some toy data and real-world datasets demonstrate the effectiveness of our proposed methods. 1

2 0.6835795 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic

Author: James L. Sharpnack, Akshay Krishnamurthy, Aarti Singh

Abstract: The detection of anomalous activity in graphs is a statistical problem that arises in many applications, such as network surveillance, disease outbreak detection, and activity monitoring in social networks. Beyond its wide applicability, graph structured anomaly detection serves as a case study in the difficulty of balancing computational complexity with statistical power. In this work, we develop from first principles the generalized likelihood ratio test for determining if there is a well connected region of activation over the vertices in the graph in Gaussian noise. Because this test is computationally infeasible, we provide a relaxation, called the Lov´ sz extended scan statistic (LESS) that uses submodularity to approximate the a intractable generalized likelihood ratio. We demonstrate a connection between LESS and maximum a-posteriori inference in Markov random fields, which provides us with a poly-time algorithm for LESS. Using electrical network theory, we are able to control type 1 error for LESS and prove conditions under which LESS is risk consistent. Finally, we consider specific graph models, the torus, knearest neighbor graphs, and ǫ-random graphs. We show that on these graphs our results provide near-optimal performance by matching our results to known lower bounds. 1

3 0.65001011 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

Author: Dahua Lin

Abstract: Reliance on computationally expensive algorithms for inference has been limiting the use of Bayesian nonparametric models in large scale applications. To tackle this problem, we propose a Bayesian learning algorithm for DP mixture models. Instead of following the conventional paradigm – random initialization plus iterative update, we take an progressive approach. Starting with a given prior, our method recursively transforms it into an approximate posterior through sequential variational approximation. In this process, new components will be incorporated on the fly when needed. The algorithm can reliably estimate a DP mixture model in one pass, making it particularly suited for applications with massive data. Experiments on both synthetic data and real datasets demonstrate remarkable improvement on efficiency – orders of magnitude speed-up compared to the state-of-the-art. 1

4 0.60282898 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer

Author: Richard Socher, Milind Ganjoo, Christopher D. Manning, Andrew Ng

Abstract: This work introduces a model that can recognize objects in images even if no training data is available for the object class. The only necessary knowledge about unseen visual categories comes from unsupervised text corpora. Unlike previous zero-shot learning models, which can only differentiate between unseen classes, our model can operate on a mixture of seen and unseen classes, simultaneously obtaining state of the art performance on classes with thousands of training images and reasonable performance on unseen classes. This is achieved by seeing the distributions of words in texts as a semantic space for understanding what objects look like. Our deep learning model does not require any manually defined semantic or visual features for either words or images. Images are mapped to be close to semantic word vectors corresponding to their classes, and the resulting image embeddings can be used to distinguish whether an image is of a seen or unseen class. We then use novelty detection methods to differentiate unseen classes from seen classes. We demonstrate two novelty detection strategies; the first gives high accuracy on unseen classes, while the second is conservative in its prediction of novelty and keeps the seen classes’ accuracy high. 1

5 0.58368599 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking

Author: David Carlson, Vinayak Rao, Joshua T. Vogelstein, Lawrence Carin

Abstract: With simultaneous measurements from ever increasing populations of neurons, there is a growing need for sophisticated tools to recover signals from individual neurons. In electrophysiology experiments, this classically proceeds in a two-step process: (i) threshold the waveforms to detect putative spikes and (ii) cluster the waveforms into single units (neurons). We extend previous Bayesian nonparametric models of neural spiking to jointly detect and cluster neurons using a Gamma process model. Importantly, we develop an online approximate inference scheme enabling real-time analysis, with performance exceeding the previous state-of-theart. Via exploratory data analysis—using data with partial ground truth as well as two novel data sets—we find several features of our model collectively contribute to our improved performance including: (i) accounting for colored noise, (ii) detecting overlapping spikes, (iii) tracking waveform dynamics, and (iv) using multiple channels. We hope to enable novel experiments simultaneously measuring many thousands of neurons and possibly adapting stimuli dynamically to probe ever deeper into the mysteries of the brain. 1

6 0.57809359 173 nips-2013-Least Informative Dimensions

7 0.57801145 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

8 0.57760251 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis

9 0.57718509 77 nips-2013-Correlations strike back (again): the case of associative memory retrieval

10 0.57650578 353 nips-2013-When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity

11 0.57625186 5 nips-2013-A Deep Architecture for Matching Short Texts

12 0.57624829 121 nips-2013-Firing rate predictions in optimal balanced networks

13 0.5760963 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables

14 0.57569706 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

15 0.57543015 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

16 0.57503295 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching

17 0.57501644 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits

18 0.57459295 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

19 0.57443398 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

20 0.57389867 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models