nips nips2010 knowledge-graph by maker-knowledge-mining
1 nips-2010-(RF)^2 -- Random Forest Random Field
Author: Nadia Payet, Sinisa Todorovic
Abstract: We combine random forest (RF) and conditional random field (CRF) into a new computational framework, called random forest random field (RF)2 . Inference of (RF)2 uses the Swendsen-Wang cut algorithm, characterized by MetropolisHastings jumps. A jump from one state to another depends on the ratio of the proposal distributions, and on the ratio of the posterior distributions of the two states. Prior work typically resorts to a parametric estimation of these four distributions, and then computes their ratio. Our key idea is to instead directly estimate these ratios using RF. RF collects in leaf nodes of each decision tree the class histograms of training examples. We use these class histograms for a nonparametric estimation of the distribution ratios. We derive the theoretical error bounds of a two-class (RF)2 . (RF)2 is applied to a challenging task of multiclass object recognition and segmentation over a random field of input image regions. In our empirical evaluation, we use only the visual information provided by image regions (e.g., color, texture, spatial layout), whereas the competing methods additionally use higher-level cues about the horizon location and 3D layout of surfaces in the scene. Nevertheless, (RF)2 outperforms the state of the art on benchmark datasets, in terms of accuracy and computation time.
2 nips-2010-A Bayesian Approach to Concept Drift
Author: Stephen Bach, Mark Maloof
Abstract: To cope with concept drift, we placed a probability distribution over the location of the most-recent drift point. We used Bayesian model comparison to update this distribution from the predictions of models trained on blocks of consecutive observations and pruned potential drift points with low probability. We compare our approach to a non-probabilistic method for drift and a probabilistic method for change-point detection. In our experiments, our approach generally yielded improved accuracy and/or speed over these other methods. 1
3 nips-2010-A Bayesian Framework for Figure-Ground Interpretation
Author: Vicky Froyen, Jacob Feldman, Manish Singh
Abstract: Figure/ground assignment, in which the visual image is divided into nearer (figural) and farther (ground) surfaces, is an essential step in visual processing, but its underlying computational mechanisms are poorly understood. Figural assignment (often referred to as border ownership) can vary along a contour, suggesting a spatially distributed process whereby local and global cues are combined to yield local estimates of border ownership. In this paper we model figure/ground estimation in a Bayesian belief network, attempting to capture the propagation of border ownership across the image as local cues (contour curvature and T-junctions) interact with more global cues to yield a figure/ground assignment. Our network includes as a nonlocal factor skeletal (medial axis) structure, under the hypothesis that medial structure “draws” border ownership so that borders are owned by the skeletal hypothesis that best explains them. We also briefly present a psychophysical experiment in which we measured local border ownership along a contour at various distances from an inducing cue (a T-junction). Both the human subjects and the network show similar patterns of performance, converging rapidly to a similar pattern of spatial variation in border ownership along contours. Figure/ground assignment (further referred to as f/g), in which the visual image is divided into nearer (figural) and farther (ground) surfaces, is an essential step in visual processing. A number of factors are known to affect f/g assignment, including region size [9], convexity [7, 16], and symmetry [1, 7, 11]. Figural assignment (often referred to as border ownership, under the assumption that the figural side “owns” the border) is usually studied globally, meaning that entire surfaces and their enclosing boundaries are assumed to receive a globally consistent figural status. But recent psychophysical findings [8] have suggested that border ownership can vary locally along a boundary, even leading to a globally inconsistent figure/ground assignment—broadly consistent with electrophysiological evidence showing local coding for border ownership in area V2 as early as 68 msec after image onset [20]. This suggests a spatially distributed and potentially competitive process of figural assignment [15], in which adjacent surfaces compete to own their common boundary, with figural status propagating across the image as this competition proceeds. But both the principles and computational mechanisms underlying this process are poorly understood. ∗ V.F. was supported by a Fullbright Honorary fellowship and by the Rutgers NSF IGERT program in Perceptual Science, NSF DGE 0549115, J.F. by NIH R01 EY15888, and M.S. by NSF CCF-0541185 1 In this paper we consider how border ownership might propagate over both space and time—that is, across the image as well as over the progression of computation. Following Weiss et al. [18] we adopt a Bayesian belief network architecture, with nodes along boundaries representing estimated border ownership, and connections arranged so that both neighboring nodes and nonlocal integrating nodes combine to influence local estimates of border ownership. Our model is novel in two particular respects: (a) we combine both local and global influences on border ownership in an integrated and principled way; and (b) we include as a nonlocal factor skeletal (medial axis) influences on f/g assignment. Skeletal structure has not been previously considered as a factor on border ownership, but its relevance follows from a model [4] in which shapes are conceived of as generated by or “grown” from an internal skeleton, with the consequence that their boundaries are perceptually “owned” by the skeletal side. We also briey present a psychophysical experiment in which we measured local border ownership along a contour, at several distances from a strong local f/g inducing cue, and at several time delays after the onset of the cue. The results show measurable spatial differences in judged border ownership, with judgments varying with distance from the inducer; but no temporal effect, with essentially asymptotic judgments even after very brief exposures. Both results are consistent with the behavior of the network, which converges quickly to an asymptotic but spatially nonuniform f/g assignment. 1 The Model The Network. For simplicity, we take an edge map as input for the model, assuming that edges and T-junctions have already been detected. From this edge map we then create a Bayesian belief network consisting of four hierarchical levels. At the input level the model receives evidence E from the image, consisting of local contour curvature and T-junctions. The nodes for this level are placed at equidistant locations along the contour. At the first level the model estimates local border ownership. The border ownership, or B-nodes at this level are at the same locations as the E-nodes, but are connected to their nearest neighbors, and are the parent of the E-node at their location. (As a simplifying assumption, such connections are broken at T-junctions in such a way that the occluded contour is disconnected from the occluder.) The highest level has skeletal nodes, S, whose positions are defined by the circumcenters of the Delaunay triangulation on all the E-nodes, creating a coarse medial axis skeleton [13]. Because of the structure of the Delaunay, each S-node is connected to exactly three E-nodes from which they receive information about the position and the local tangent of the contour. In the current state of the model the S-nodes are “passive”, meaning their posteriors are computed before the model is initiated. Between the S nodes and the B nodes are the grouping nodes G. They have the same positions as the S-nodes and the same Delaunay connections, but to B-nodes that have the same image positions as the E-nodes. They will integrate information from distant B-nodes, applying an interiority cue that is influenced by the local strength of skeletal axes as computed by the S-nodes (Fig. 1). Although this is a multiply connected network, we have found that given reasonable parameters the model converges to intuitive posteriors for a variety of shapes (see below). Updating. Our goal is to compute the posterior p(Bi |I), where I is the whole image. Bi is a binary variable coding for the local direction of border ownership, that is, the side that owns the border. In order for border ownership estimates to be influenced by image structure elsewhere in the image, information has to propagate throughout the network. To achieve this propagation, we use standard equations for node updating [14, 12]. However while to all other connections being directed, connections at the B-node level are undirected, causing each node to be child and parent node at the same time. Considering only the B-node level, a node Bi is only separated from the rest of the network by its two neighbors. Hence the Markovian property applies, in that Bi only needs to get iterative information from its neighbors to eventually compute p(Bi |I). So considering the whole network, at each iteration t, Bi receives information from both its child, Ei and from its parents—that is neigbouring nodes (Bi+1 and Bi−1 )—as well as all grouping nodes connected to it (Gj , ..., Gm ). The latter encode for interiority versus exteriority, interiority meaning that the B-node’s estimated gural direction points towards the G-node in question, exteriority meaning that it points away. Integrating all this information creates a multidimensional likelihood function: p(Bi |Bi−1 , Bi+1 , Gj , ..., Gm ). Because of its complexity we choose to approximate it (assuming all nodes are marginally independent of each other when conditioned on Bi ) by 2 Figure 1: Basic network structure of the model. Both skeletal (S-nodes) and border-ownerhsip nodes (B-nodes) get evidence from E-nodes, though different types. S-nodes receive mere positional information, while B-nodes receive information about local curvature and the presence of T-junctions. Because of the structure of the Delaunay triangulation S-nodes and G-nodes (grouping nodes) always get input from exactly three nodes, respectively E and B-nodes. The gray color depicts the fact that this part of the network is computed before the model is initiated and does not thereafter interact with the dynamics of the model. m p(Bi |Pj , ..., Pm ) ∝ p(Bi |Pj ) (1) j where the Pj ’s are the parents of Bi . Given this, at each iteration, each node Bi performs the following computation: Bel(Bi ) ← cλ(Bi )π(Bi )α(Bi )β(Bi ) (2) where conceptually λ stands for bottom-up information, π for top down information and α and β for information received from within the same level. More formally, λ(Bi ) ← p(E|Bi ) (3) m π(Bi ) ← p(Bi |Gj )πGj (Bi ) j (4) Gj and analogously to equation 4 for α(Bi ) and β(Bi ), which compute information coming from Bi−1 and Bi+1 respectively. For these πBi−1 (Bi ), πBi+1 (Bi ), and πGj (Bi ): πGj (Bi ) ← c π(G) λBk (Gj ) (5) k=i πBi−1 (Bi ) ← c β(Bi−1 )λ(Bi−1 )π(Bi−1 ) 3 (6) and πBi+1 (Bi ) is analogous to πBi−1 (Bi ), with c and c being normalization constants. Finally for the G-nodes: Bel(Gi ) ← cλ(Gi )π(Gi ) λ(Gi ) ← (7) λBj (Gi ) (8) j m λBj (Gi ) ← λ(Bj )p(Bi |Gj )[α(Bj )β(Bj ) Bj p(Bi |Gk )πGk (Bi )] (9) k=i Gk The posteriors of the S-nodes are used to compute the π(Gi ). This posterior computes how well the S-node at each position explains the contour—that is, how well it accounts for the cues flowing from the E-nodes it is connected to. Each Delaunay connection between S- and E-nodes can be seen as a rib that sprouts from the skeleton. More specifically each rib sprouts in a direction that is normal (perpendicular) to the tangent of the contour at the E-node plus a random error φi chosen independently for each rib from a von Mises distribution centered on zero, i.e. φi ∼ V (0, κS ) with spread parameter κS [4]. The rib lengths are drawn from an exponential decreasing density function p(ρi ) ∝ e−λS ρi [4]. We can now express how well this node “explains” the three E-nodes it is connected to via the probability that this S-node deserves to be a skeletal node or not, p(S = true|E1 , E2 , E3 ) ∝ p(ρi )p(φi ) (10) i with S = true depicting that this S-node deserves to be a skeletal node. From this we then compute the prior π(Gi ) in such a way that good (high posterior) skeletal nodes induce a high interiority bias, hence a stronger tendency to induce figural status. Conversely, bad (low posterior) skeletal nodes create a prior close to indifferent (uniform) and thus have less (or no) influence on figural status. Likelihood functions Finally we need to express the likelihood function necessary for the updating rules described above. The first two likelihood functions are part of p(Ei |Bi ), one for each of the local cues. The first one, reflecting local curvature, gives the probability of the orientations of the two vectors inherent to Ei (α1 and α2 ) given both direction of figure (θ) encoded in Bi as a von Mises density centered on θ, i.e. αi ∼ V (θ, κEB ). The second likelihood function, reflecting the presence of a T-junction, simply assumes a fixed likelihood when a T-junction is present—that is p(T-junction = true|Bi ) = θT , where Bi places the direction of figure in the direction of the occluder. This likelihood function is only in effect when a T-junction is present, replacing the curvature cue at that node. The third likelihood function serves to keep consistency between nodes of the first level. This function p(Bi |Bi−1 ) or p(Bi |Bi+1 ) is used to compute α(B) and β(B) and is defined 2x2 conditional probability matrix with a single free parameter, θBB (the probability that figural direction at both B-nodes are the same). A fourth and final likelihood function p(Bi |Gj ) serves to propagate information between level one and two. This likelihood function is 2x2 conditional probability matrix matrix with one free parameter, θBG . In this case θBG encodes the probability that the figural direction of the B-node is in the direction of the exterior or interior preference of the G-node. In total this brings us to six free parameters in the model: κS , λS , κEB , θT , θBB , and θBG . 2 Basic Simulations To evaluate the performance of the model, we first tested it on several basic stimulus configurations in which the desired outcome is intuitively clear: a convex shape, a concave shape, a pair of overlapping shapes, and a pair of non-overlapping shapes (Fig. 2,3). The convex shape is the simplest in that curvature never changes sign. The concave shape includes a region with oppositely signed curvature. (The shape is naturally described as predominantly positively curved with a region of negative curvature, i.e. a concavity. But note that it can also be interpreted as predominantly negatively curved “window” with a region of positive curvature, although this is not the intuitive interpretation.) 4 The overlapping pair of shapes consists of two convex shapes with one partly occluding the other, creating a competition between the two shapes for the ownership of the common borderline. Finally the non-overlapping shapes comprise two simple convex shapes that do not touch—again setting up a competition for ownership of the two inner boundaries (i.e. between each shape and the ground space between them). Fig. 2 shows the network structures for each of these four cases. Figure 2: Network structure for the four shape categories (left to right: convex, concave, overlapping, non-overlapping shapes). Blue depict the locations of the B-nodes (and also the E-nodes), the red connections are the connections between B-nodes, the green connections are connections between B-nodes and G-nodes, and the G-nodes (and also the S-nodes) go from orange to dark red. This colour code depicts low (orange) to high (dark red) probability that this is a skeletal node, and hence the strength of the interiority cue. Running our model with hand-estimated parameter values yields highly intuitive posteriors (Fig. 3), an essential “sanity check” to ensure that the network approximates human judgments in simple cases. For the convex shape the model assigns figure to the interior just as one would expect even based solely on local curvature (Fig. 3A). In the concave figure (Fig. 3B), estimated border ownership begins to reverse inside the deep concavity. This may seem surprising, but actually closely matches empirical results obtained when local border ownership is probed psychophysically inside a similarly deep concavity, i.e. a “negative part” in which f/g seems to partly reverse [8]. For the overlapping shapes posteriors were also intuitive, with the occluding shape interpreted as in front and owning the common border (Fig. 3C). Finally, for the two non-overlapping shapes the model computed border-ownership just as one would expect if each shape were run separately, with each shape treated as figural along its entire boundary (Fig. 3D). That is, even though there is skeletal structure in the ground-region between the two shapes (see Fig. 2D), its posterior is weak compared to the skeletal structure inside the shapes, which thus loses the competition to own the boundary between them. For all these configurations, the model not only converged to intuitive estimates but did so rapidly (Fig. 4), always in fewer cycles than would be expected by pure lateral propagation, niterations < Nnodes [18] (with these parameters, typically about five times faster). Figure 3: Posteriors after convergence for the four shape categories (left to right: convex, concave, overlapping, non-overlapping). Arrows indicate estimated border ownership, with direction pointing to the perceived figural side, and length proportional to the magnitude of the posterior. All four simulations used the same parameters. 5 Figure 4: Convergence of the model for the basic shape categories. The vertical lines represent the point of convergence for each of the three shape categories. The posterior change is calculated as |p(Bi = 1|I)t − p(Bi = 1|I)t−1 | at each iteration. 3 Comparison to human data Beyond the simple cases reviewed above, we wished to submit our network to a more fine-grained comparison with human data. To this end we compared its performance to that of human subjects in an experiment we conducted (to be presented in more detail in a future paper). Briefly, our experiment involved finding evidence for propagation of f/g signals across the image. Subjects were first shown a stimulus in which the f/g configuration was globally and locally unambiguous and consistent: a smaller rectangle partly occluding a larger one (Fig. 5A), meaning that the smaller (front) one owns the common border. Then this configuration was perturbed by adding two bars, of which one induced a local f/g reversal—making it now appear locally that the larger rectangle owned the border (Fig. 5B). (The other bar in the display does not alter f/g interpretation, but was included to control for the attentional affects of introducing a bar in the image.) The inducing bar creates T-junctions that serve as strong local f/g cues, in this case tending to reverse the prior global interpretation of the figure. We then measured subjective border ownership along the central contour at various distances from the inducing bar, and at different times after the onset of the bar (25ms, 100ms and 250ms). We measured border ownership locally using a method introduced in [8] in which a local motion probe is introduced at a point on the boundary between two color regions of different colors, and the subject is asked which color appeared to move. Because the figural side “owns” the border, the response reflects perceived figural status. The goal of the experiment was to actually measure the progression of the influence of the inducing T-junction as it (hypothetically) propagated along the boundary. Briefly, we found no evidence of temporal differences, meaning that f/g judgments were essentially constant over time, suggesting rapid convergence of local f/g assignment. (This is consistent with the very rapid convergence of our network, which would suggest a lack of measurable temporal differences except at much shorter time scales than we measured.) But we did find a progressive reduction of f/g reversal with increasing distance from the inducer—that is, the influence of the T-junction decayed with distance. Mean responses aggregated over subjects (shortest delay only) are shown in Fig. 6. In order to run our model on this stimulus (which has a much more complex structure than the simple figures tested above) we had to make some adjustments. We removed the bars from the edge map, leaving only the T-junctions as underlying cues. This was a necessary first step because our model is not yet able to cope with skeletons that are split up by occluders. (The larger rectangle’s skeleton has been split up by the lower bar.) In this way all contours except those created by the bars were used to create the network (Fig. 7). Given this network we ran the model using hand-picked parameters that 6 Figure 5: Stimuli used in the experiment. A. Initial stimulus with locally and globally consistent and unambiguous f/g. B. Subsequently bars were added of which one (the top bar in this case) created a local reversal of f/g. C. Positions at which local f/g judgments of subjects were probed. Figure 6: Results from our experiment aggregated for all 7 subjects (shortest delay only) are shown in red. The x-axis shows distance from the inducing bar at which f/g judgment was probed. The y-axis shows the proportion of trials on which subjects judged the smaller rectangle to own the boundary. As can be seen, the further from the T-junction, the lower the f/g reversal. The fitted model (green curve) shows very similar pattern. Horizontal black line indicates chance performance (ambiguous f/g). gave us the best possible qualitative similarity to the human data. The parameters used never entailed total elimination of the influence of any likelihood function (κS = 16, λS = .025, κEB = .5, θT = .9, θBB = .9, and θBG = .6). As can be seen in Fig. 6 the border-ownership estimates at the locations where we had data show compelling similarities to human judgments. Furthermore along the entire contour the model converged to intuitive border-ownership estimates (Fig. 7) very rapidly (within 36 iterations). The fact that our model yielded intuitive estimates for the current network in which not all contours were completed shows another strength of our model. Because our model included grouping nodes, it did not require contours to be amodally completed [6] in order for information to propagate. 4 Conclusion In this paper we proposed a model rooted in Bayesian belief networks to compute figure/ground. The model uses both local and global cues, combined in a principled way, to achieve a stable and apparently psychologically reasonable estimate of border ownership. Local cues included local curvature and T-junctions, both well-established cues to f/g. Global cues included skeletal structure, 7 Figure 7: (left) Node structure for the experimental stimulus. (right) The model’s local borderownership estimates after convergence. a novel cue motivated by the idea that strongly axial shapes tend to be figural and thus own their boundaries. We successfully tested this model on both simple displays, in which it gave intuitive results, and on a more complex experimental stimulus, in which it gave a close match to the pattern of f/g propagation found in our subjects. Specifically, the model, like the human subjects rapidly converged to a stable local f/g interpretation. Our model’s structure shows several interesting parallels to properties of neural coding of border ownership in visual cortex. Some cortical cells (end-stopped cells) appear to code for local curvature [3] and T-junctions [5]. The B-nodes in our model could be seen as corresponding to cells that code for border ownership [20]. Furthermore, some authors [2] have suggested that recurrent feedback loops between border ownership cells in V2 and cells in V4 (corresponding to G-nodes in our model) play a role in the rapid computation of border ownership. The very rapid convergence we observed in our model likewise appears to be due to the connections between B-nodes and G-nodes. Finally scale-invariant shape representations (such as, speculatively, those based on skeletons) are thought to be present in higher cortical regions such as IT [17], which project down to earlier areas in ways that are not yet understood. A number of parallels to past models of f/g should be mentioned. Weiss [18] pioneered the application of belief networks to the f/g problem, though their network only considered a more restricted set of local cues and no global ones, such that information only propagated along the contour. Furthermore it has not been systematically compared to human judgments. Kogo et al. [10] proposed an exponential decay of f/g signals as they spread throughout the image. Our model has a similar decay for information going through the G-nodes, though it is also influenced by an angular factor defined by the position of the skeletal node. Like the model by Li Zhaoping [19], our model includes horizontal propagation between B-nodes, analogous to border-ownership cells in her model. A neurophysiological model by Craft et al. [2] defines grouping cells coding for an interiority preference that decays with the size of the receptive fields of these grouping cells. Our model takes this a step further by including shape (skeletal) structure as a factor in interiority estimates, rather than simply size of receptive fields (which is similar to the rib lengths in our model). Currently, our use of skeletons as shape representations is still limited to medial axis skeletons and surfaces that are not split up by occluders. Our future goals including integrating skeletons in a more robust way following the probabilistic account suggested by Feldman and Singh [4]. Eventually, we hope to fully integrate skeleton computation with f/g computation so that the more general problem of shape and surface estimation can be approached in a coherent and unified fashion. 8 References [1] P. Bahnsen. Eine untersuchung uber symmetrie und assymmetrie bei visuellen wahrnehmungen. Zeitschrift fur psychology, 108:129–154, 1928. [2] E. Craft, H. Sch¨ tze, E. Niebur, and R. von der Heydt. A neural model of figure-ground u organization. Journal of Neurophysiology, 97:4310–4326, 2007. [3] A. Dobbins, S. W. Zucker, and M. S. Cyander. Endstopping and curvature. Vision Research, 29:1371–1387, 1989. [4] J. Feldman and M. Singh. Bayesian estimation of the shape skeleton. Proceedings of the National Academy of Sciences, 103:18014–18019, 2006. [5] B. Heider, V. Meskenaite, and E. Peterhans. Anatomy and physiology of a neural mechanism defining depth order and contrast polarity at illusory contours. European Journal of Neuroscience, 12:4117–4130, 2000. [6] G. Kanizsa. Organization inVision. New York: Praeger, 1979. [7] G. Kanizsa and W. Gerbino. Vision and Artifact, chapter Convexity and symmetry in figureground organisation, pages 25–32. New York: Springer, 1976. [8] S. Kim and J. Feldman. Globally inconsistent figure/ground relations induced by a negative part. Journal of Vision, 9:1534–7362, 2009. [9] K. Koffka. Principles of Gestalt Psychology. Lund Humphries, London, 1935. [10] N. Kogo, C. Strecha, L. Van Gool, and J. Wagemans. Surface construction by a 2-d differentiation-integration process: a neurocomputational model for perceived border ownership, depth, and lightness in kanizsa figures. Psychological Review, 117:406–439, 2010. [11] B. Machielsen, M. Pauwels, and J. Wagemans. The role of vertical mirror-symmetry in visual shape detection. Journal of Vision, 9:1–11, 2009. [12] K. Murphy, Y. Weiss, and M.I. Jordan. Loopy belief propagation for approximate inference: an empirical study. Proceedings of Uncertainty in AI, pages 467–475, 1999. [13] R. L. Ogniewicz and O. K¨ bler. Hierarchic Voronoi skeletons. Pattern Recognition, 28:343– u 359, 1995. [14] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 1988. [15] M. A. Peterson and E. Skow. Inhibitory competition between shape properties in figureground perception. Journal of Experimental Psychology: Human Perception and Performance, 34:251–267, 2008. [16] K. A. Stevens and A. Brookes. The concave cusp as a determiner of figure-ground. Perception, 17:35–42, 1988. [17] K. Tanaka, H. Saito, Y. Fukada, and M. Moriya. Coding visual images of object in the inferotemporal cortex of the macaque monkey. Journal of Neurophysiology, 66:170–189, 1991. [18] Y. Weiss. Interpreting images by propagating Bayesian beliefs. Adv. in Neural Information Processing Systems, 9:908915, 1997. [19] L. Zhaoping. Border ownership from intracortical interactions in visual area V2. Neuron, 47(1):143–153, Jul 2005. [20] H. Zhou, H. S. Friedman, and R. von der Heydt. Coding of border ownerschip in monkey visual cortex. The Journal of Neuroscience, 20:6594–6611, 2000. 9
4 nips-2010-A Computational Decision Theory for Interactive Assistants
Author: Alan Fern, Prasad Tadepalli
Abstract: We study several classes of interactive assistants from the points of view of decision theory and computational complexity. We first introduce a class of POMDPs called hidden-goal MDPs (HGMDPs), which formalize the problem of interactively assisting an agent whose goal is hidden and whose actions are observable. In spite of its restricted nature, we show that optimal action selection in finite horizon HGMDPs is PSPACE-complete even in domains with deterministic dynamics. We then introduce a more restricted model called helper action MDPs (HAMDPs), where the assistant’s action is accepted by the agent when it is helpful, and can be easily ignored by the agent otherwise. We show classes of HAMDPs that are complete for PSPACE and NP along with a polynomial time class. Furthermore, we show that for general HAMDPs a simple myopic policy achieves a regret, compared to an omniscient assistant, that is bounded by the entropy of the initial goal distribution. A variation of this policy is shown to achieve worst-case regret that is logarithmic in the number of goals for any goal distribution. 1
5 nips-2010-A Dirty Model for Multi-task Learning
Author: Ali Jalali, Sujay Sanghavi, Chao Ruan, Pradeep K. Ravikumar
Abstract: We consider multi-task learning in the setting of multiple linear regression, and where some relevant features could be shared across the tasks. Recent research has studied the use of ℓ1 /ℓq norm block-regularizations with q > 1 for such blocksparse structured problems, establishing strong guarantees on recovery even under high-dimensional scaling where the number of features scale with the number of observations. However, these papers also caution that the performance of such block-regularized methods are very dependent on the extent to which the features are shared across tasks. Indeed they show [8] that if the extent of overlap is less than a threshold, or even if parameter values in the shared features are highly uneven, then block ℓ1 /ℓq regularization could actually perform worse than simple separate elementwise ℓ1 regularization. Since these caveats depend on the unknown true parameters, we might not know when and which method to apply. Even otherwise, we are far away from a realistic multi-task setting: not only do the set of relevant features have to be exactly the same across tasks, but their values have to as well. Here, we ask the question: can we leverage parameter overlap when it exists, but not pay a penalty when it does not ? Indeed, this falls under a more general question of whether we can model such dirty data which may not fall into a single neat structural bracket (all block-sparse, or all low-rank and so on). With the explosion of such dirty high-dimensional data in modern settings, it is vital to develop tools – dirty models – to perform biased statistical estimation tailored to such data. Here, we take a first step, focusing on developing a dirty model for the multiple regression problem. Our method uses a very simple idea: we estimate a superposition of two sets of parameters and regularize them differently. We show both theoretically and empirically, our method strictly and noticeably outperforms both ℓ1 or ℓ1 /ℓq methods, under high-dimensional scaling and over the entire range of possible overlaps (except at boundary cases, where we match the best method). 1 Introduction: Motivation and Setup High-dimensional scaling. In fields across science and engineering, we are increasingly faced with problems where the number of variables or features p is larger than the number of observations n. Under such high-dimensional scaling, for any hope of statistically consistent estimation, it becomes vital to leverage any potential structure in the problem such as sparsity (e.g. in compressed sensing [3] and LASSO [14]), low-rank structure [13, 9], or sparse graphical model structure [12]. It is in such high-dimensional contexts in particular that multi-task learning [4] could be most useful. Here, 1 multiple tasks share some common structure such as sparsity, and estimating these tasks jointly by leveraging this common structure could be more statistically efficient. Block-sparse Multiple Regression. A common multiple task learning setting, and which is the focus of this paper, is that of multiple regression, where we have r > 1 response variables, and a common set of p features or covariates. The r tasks could share certain aspects of their underlying distributions, such as common variance, but the setting we focus on in this paper is where the response variables have simultaneously sparse structure: the index set of relevant features for each task is sparse; and there is a large overlap of these relevant features across the different regression problems. Such “simultaneous sparsity” arises in a variety of contexts [15]; indeed, most applications of sparse signal recovery in contexts ranging from graphical model learning, kernel learning, and function estimation have natural extensions to the simultaneous-sparse setting [12, 2, 11]. It is useful to represent the multiple regression parameters via a matrix, where each column corresponds to a task, and each row to a feature. Having simultaneous sparse structure then corresponds to the matrix being largely “block-sparse” – where each row is either all zero or mostly non-zero, and the number of non-zero rows is small. A lot of recent research in this setting has focused on ℓ1 /ℓq norm regularizations, for q > 1, that encourage the parameter matrix to have such blocksparse structure. Particular examples include results using the ℓ1 /ℓ∞ norm [16, 5, 8], and the ℓ1 /ℓ2 norm [7, 10]. Dirty Models. Block-regularization is “heavy-handed” in two ways. By strictly encouraging sharedsparsity, it assumes that all relevant features are shared, and hence suffers under settings, arguably more realistic, where each task depends on features specific to itself in addition to the ones that are common. The second concern with such block-sparse regularizers is that the ℓ1 /ℓq norms can be shown to encourage the entries in the non-sparse rows taking nearly identical values. Thus we are far away from the original goal of multitask learning: not only do the set of relevant features have to be exactly the same, but their values have to as well. Indeed recent research into such regularized methods [8, 10] caution against the use of block-regularization in regimes where the supports and values of the parameters for each task can vary widely. Since the true parameter values are unknown, that would be a worrisome caveat. We thus ask the question: can we learn multiple regression models by leveraging whatever overlap of features there exist, and without requiring the parameter values to be near identical? Indeed this is an instance of a more general question on whether we can estimate statistical models where the data may not fall cleanly into any one structural bracket (sparse, block-sparse and so on). With the explosion of dirty high-dimensional data in modern settings, it is vital to investigate estimation of corresponding dirty models, which might require new approaches to biased high-dimensional estimation. In this paper we take a first step, focusing on such dirty models for a specific problem: simultaneously sparse multiple regression. Our approach uses a simple idea: while any one structure might not capture the data, a superposition of structural classes might. Our method thus searches for a parameter matrix that can be decomposed into a row-sparse matrix (corresponding to the overlapping or shared features) and an elementwise sparse matrix (corresponding to the non-shared features). As we show both theoretically and empirically, with this simple fix we are able to leverage any extent of shared features, while allowing disparities in support and values of the parameters, so that we are always better than both the Lasso or block-sparse regularizers (at times remarkably so). The rest of the paper is organized as follows: In Sec 2. basic definitions and setup of the problem are presented. Main results of the paper is discussed in sec 3. Experimental results and simulations are demonstrated in Sec 4. Notation: For any matrix M , we denote its j th row as Mj , and its k-th column as M (k) . The set of all non-zero rows (i.e. all rows with at least one non-zero element) is denoted by RowSupp(M ) (k) and its support by Supp(M ). Also, for any matrix M , let M 1,1 := j,k |Mj |, i.e. the sums of absolute values of the elements, and M 1,∞ := j 2 Mj ∞ where, Mj ∞ (k) := maxk |Mj |. 2 Problem Set-up and Our Method Multiple regression. We consider the following standard multiple linear regression model: ¯ y (k) = X (k) θ(k) + w(k) , k = 1, . . . , r, where y (k) ∈ Rn is the response for the k-th task, regressed on the design matrix X (k) ∈ Rn×p (possibly different across tasks), while w(k) ∈ Rn is the noise vector. We assume each w(k) is drawn independently from N (0, σ 2 ). The total number of tasks or target variables is r, the number of features is p, while the number of samples we have for each task is n. For notational convenience, ¯ we collate these quantities into matrices Y ∈ Rn×r for the responses, Θ ∈ Rp×r for the regression n×r parameters and W ∈ R for the noise. ¯ Dirty Model. In this paper we are interested in estimating the true parameter Θ from data by lever¯ aging any (unknown) extent of simultaneous-sparsity. In particular, certain rows of Θ would have many non-zero entries, corresponding to features shared by several tasks (“shared” rows), while certain rows would be elementwise sparse, corresponding to those features which are relevant for some tasks but not all (“non-shared rows”), while certain rows would have all zero entries, corresponding to those features that are not relevant to any task. We are interested in estimators Θ that automatically adapt to different levels of sharedness, and yet enjoy the following guarantees: Support recovery: We say an estimator Θ successfully recovers the true signed support if ¯ sign(Supp(Θ)) = sign(Supp(Θ)). We are interested in deriving sufficient conditions under which ¯ the estimator succeeds. We note that this is stronger than merely recovering the row-support of Θ, which is union of its supports for the different tasks. In particular, denoting Uk for the support of the ¯ k-th column of Θ, and U = k Uk . Error bounds: We are also interested in providing bounds on the elementwise ℓ∞ norm error of the estimator Θ, ¯ Θ−Θ 2.1 ∞ = max max j=1,...,p k=1,...,r (k) Θj (k) ¯ − Θj . Our Method Our method explicitly models the dirty block-sparse structure. We estimate a sum of two parameter matrices B and S with different regularizations for each: encouraging block-structured row-sparsity in B and elementwise sparsity in S. The corresponding “clean” models would either just use blocksparse regularizations [8, 10] or just elementwise sparsity regularizations [14, 18], so that either method would perform better in certain suited regimes. Interestingly, as we will see in the main results, by explicitly allowing to have both block-sparse and elementwise sparse component, we are ¯ able to outperform both classes of these “clean models”, for all regimes Θ. Algorithm 1 Dirty Block Sparse Solve the following convex optimization problem: (S, B) ∈ arg min S,B 1 2n r k=1 y (k) − X (k) S (k) + B (k) 2 2 + λs S 1,1 + λb B 1,∞ . (1) Then output Θ = B + S. 3 Main Results and Their Consequences We now provide precise statements of our main results. A number of recent results have shown that the Lasso [14, 18] and ℓ1 /ℓ∞ block-regularization [8] methods succeed in recovering signed supports with controlled error bounds under high-dimensional scaling regimes. Our first two theorems extend these results to our dirty model setting. In Theorem 1, we consider the case of deterministic design matrices X (k) , and provide sufficient conditions guaranteeing signed support recovery, and elementwise ℓ∞ norm error bounds. In Theorem 2, we specialize this theorem to the case where the 3 rows of the design matrices are random from a general zero mean Gaussian distribution: this allows us to provide scaling on the number of observations required in order to guarantee signed support recovery and bounded elementwise ℓ∞ norm error. Our third result is the most interesting in that it explicitly quantifies the performance gains of our method vis-a-vis Lasso and the ℓ1 /ℓ∞ block-regularization method. Since this entailed finding the precise constants underlying earlier theorems, and a correspondingly more delicate analysis, we follow Negahban and Wainwright [8] and focus on the case where there are two-tasks (i.e. r = 2), and where we have standard Gaussian design matrices as in Theorem 2. Further, while each of two tasks depends on s features, only a fraction α of these are common. It is then interesting to see how the behaviors of the different regularization methods vary with the extent of overlap α. Comparisons. Negahban and Wainwright [8] show that there is actually a “phase transition” in the scaling of the probability of successful signed support-recovery with the number of observations. n Denote a particular rescaling of the sample-size θLasso (n, p, α) = s log(p−s) . Then as Wainwright [18] show, when the rescaled number of samples scales as θLasso > 2 + δ for any δ > 0, Lasso succeeds in recovering the signed support of all columns with probability converging to one. But when the sample size scales as θLasso < 2−δ for any δ > 0, Lasso fails with probability converging to one. For the ℓ1 /ℓ∞ -reguralized multiple linear regression, define a similar rescaled sample size n θ1,∞ (n, p, α) = s log(p−(2−α)s) . Then as Negahban and Wainwright [8] show there is again a transition in probability of success from near zero to near one, at the rescaled sample size of θ1,∞ = (4 − 3α). Thus, for α < 2/3 (“less sharing”) Lasso would perform better since its transition is at a smaller sample size, while for α > 2/3 (“more sharing”) the ℓ1 /ℓ∞ regularized method would perform better. As we show in our third theorem, the phase transition for our method occurs at the rescaled sample size of θ1,∞ = (2 − α), which is strictly before either the Lasso or the ℓ1 /ℓ∞ regularized method except for the boundary cases: α = 0, i.e. the case of no sharing, where we match Lasso, and for α = 1, i.e. full sharing, where we match ℓ1 /ℓ∞ . Everywhere else, we strictly outperform both methods. Figure 3 shows the empirical performance of each of the three methods; as can be seen, they agree very well with the theoretical analysis. (Further details in the experiments Section 4). 3.1 Sufficient Conditions for Deterministic Designs We first consider the case where the design matrices X (k) for k = 1, · · ·, r are deterministic, and start by specifying the assumptions we impose on the model. We note that similar sufficient conditions for the deterministic X (k) ’s case were imposed in papers analyzing Lasso [18] and block-regularization methods [8, 10]. (k) A0 Column Normalization Xj 2 ≤ √ 2n for all j = 1, . . . , p, k = 1, . . . , r. ¯ Let Uk denote the support of the k-th column of Θ, and U = supports for each task. Then we require that k r A1 Incoherence Condition γb := 1 − max c j∈U (k) (k) Xj , XUk (k) (k) XUk , XUk Uk denote the union of −1 c We will also find it useful to define γs := 1−max1≤k≤r maxj∈Uk (k) > 0. 1 k=1 (k) Xj , XUk Note that by the incoherence condition A1, we have γs > 0. A2 Eigenvalue Condition Cmin := min λmin 1≤k≤r A3 Boundedness Condition Dmax := max 1≤k≤r 1 (k) (k) XUk , XUk n 1 (k) (k) XUk , XUk n (k) (k) XUk , XUk −1 . 1 > 0. −1 ∞,1 < ∞. Further, we require the regularization penalties be set as λs > 2(2 − γs )σ log(pr) √ γs n and 4 λb > 2(2 − γb )σ log(pr) √ . γb n (2) 1 0.9 0.8 0.8 Dirty Model L1/Linf Reguralizer Probability of Success Probability of Success 1 0.9 0.7 0.6 0.5 0.4 LASSO 0.3 0.2 0 0.5 1 1.5 1.7 2 2.5 Control Parameter θ 3 3.1 3.5 0.6 0.5 0.4 L1/Linf Reguralizer 0.3 LASSO 0.2 p=128 p=256 p=512 0.1 Dirty Model 0.7 p=128 p=256 p=512 0.1 0 0.5 4 1 1.333 (a) α = 0.3 1.5 2 Control Parameter θ (b) α = 2.5 3 2 3 1 0.9 Dirty Model Probability of Success 0.8 0.7 L1/Linf Reguralizer 0.6 0.5 LASSO 0.4 0.3 0.2 p=128 p=256 p=512 0.1 0 0.5 1 1.2 1.5 1.6 2 Control Parameter θ 2.5 (c) α = 0.8 Figure 1: Probability of success in recovering the true signed support using dirty model, Lasso and ℓ1 /ℓ∞ regularizer. For a 2-task problem, the probability of success for different values of feature-overlap fraction α is plotted. As we can see in the regimes that Lasso is better than, as good as and worse than ℓ1 /ℓ∞ regularizer ((a), (b) and (c) respectively), the dirty model outperforms both of the methods, i.e., it requires less number of observations for successful recovery of the true signed support compared to Lasso and ℓ1 /ℓ∞ regularizer. Here p s = ⌊ 10 ⌋ always. Theorem 1. Suppose A0-A3 hold, and that we obtain estimate Θ from our algorithm with regularization parameters chosen according to (2). Then, with probability at least 1 − c1 exp(−c2 n) → 1, we are guaranteed that the convex program (1) has a unique optimum and (a) The estimate Θ has no false inclusions, and has bounded ℓ∞ norm error so that ¯ Supp(Θ) ⊆ Supp(Θ), and ¯ Θ−Θ ∞,∞ 4σ 2 log (pr) + λs Dmax . n Cmin ≤ bmin ¯ (b) sign(Supp(Θ)) = sign Supp(Θ) provided that min ¯ (j,k)∈Supp(Θ) ¯(k) θj > bmin . Here the positive constants c1 , c2 depend only on γs , γb , λs , λb and σ, but are otherwise independent of n, p, r, the problem dimensions of interest. Remark: Condition (a) guarantees that the estimate will have no false inclusions; i.e. all included features will be relevant. If in addition, we require that it have no false exclusions and that recover the support exactly, we need to impose the assumption in (b) that the non-zero elements are large enough to be detectable above the noise. 3.2 General Gaussian Designs Often the design matrices consist of samples from a Gaussian ensemble. Suppose that for each task (k) k = 1, . . . , r the design matrix X (k) ∈ Rn×p is such that each row Xi ∈ Rp is a zero-mean Gaussian random vector with covariance matrix Σ(k) ∈ Rp×p , and is independent of every other (k) row. Let ΣV,U ∈ R|V|×|U | be the submatrix of Σ(k) with rows corresponding to V and columns to U . We require these covariance matrices to satisfy the following conditions: r C1 Incoherence Condition γb := 1 − max c j∈U (k) (k) Σj,Uk , ΣUk ,Uk k=1 5 −1 >0 1 C2 Eigenvalue Condition Cmin := min λmin Σ(k),Uk Uk > 0 so that the minimum eigenvalue 1≤k≤r is bounded away from zero. C3 Boundedness Condition Dmax := (k) ΣUk ,Uk −1 ∞,1 < ∞. These conditions are analogues of the conditions for deterministic designs; they are now imposed on the covariance matrix of the (randomly generated) rows of the design matrix. Further, defining s := maxk |Uk |, we require the regularization penalties be set as 1/2 λs > 1/2 4σ 2 Cmin log(pr) √ γs nCmin − 2s log(pr) and λb > 4σ 2 Cmin r(r log(2) + log(p)) . √ γb nCmin − 2sr(r log(2) + log(p)) (3) Theorem 2. Suppose assumptions C1-C3 hold, and that the number of samples scale as n > max 2s log(pr) 2sr r log(2)+log(p) 2 2 Cmin γs , Cmin γb . Suppose we obtain estimate Θ from algorithm (3). Then, with probability at least 1 − c1 exp (−c2 (r log(2) + log(p))) − c3 exp(−c4 log(rs)) → 1 for some positive numbers c1 − c4 , we are guaranteed that the algorithm estimate Θ is unique and satisfies the following conditions: (a) the estimate Θ has no false inclusions, and has bounded ℓ∞ norm error so that ¯ Supp(Θ) ⊆ Supp(Θ), and ¯ Θ−Θ ∞,∞ ≤ 50σ 2 log(rs) + λs nCmin 4s √ + Dmax . Cmin n gmin ¯ (b) sign(Supp(Θ)) = sign Supp(Θ) provided that 3.3 min ¯ (j,k)∈Supp(Θ) ¯(k) θj > gmin . Sharp Transition for 2-Task Gaussian Designs This is one of the most important results of this paper. Here, we perform a more delicate and finer analysis to establish precise quantitative gains of our method. We focus on the special case where r = 2 and the design matrix has rows generated from the standard Gaussian distribution N (0, In×n ), so that C1 − C3 hold, with Cmin = Dmax = 1. As we will see both analytically and experimentally, our method strictly outperforms both Lasso and ℓ1 /ℓ∞ -block-regularization over for all cases, except at the extreme endpoints of no support sharing (where it matches that of Lasso) and full support sharing (where it matches that of ℓ1 /ℓ∞ ). We now present our analytical results; the empirical comparisons are presented next in Section 4. The results will be in terms of a particular rescaling of the sample size n as θ(n, p, s, α) := n . (2 − α)s log (p − (2 − α)s) We will also require the assumptions that 4σ 2 (1 − F1 λs > F2 λb > s/n)(log(r) + log(p − (2 − α)s)) 1/2 (n)1/2 − (s)1/2 − ((2 − α) s (log(r) + log(p − (2 − α)s)))1/2 4σ 2 (1 − s/n)r(r log(2) + log(p − (2 − α)s)) , 1/2 (n)1/2 − (s)1/2 − ((1 − α/2) sr (r log(2) + log(p − (2 − α)s)))1/2 . Theorem 3. Consider a 2-task regression problem (n, p, s, α), where the design matrix has rows generated from the standard Gaussian distribution N (0, In×n ). 6 Suppose maxj∈B∗ ∗(1) Θj − ∗(2) Θj = o(λs ), where B ∗ is the submatrix of Θ∗ with rows where both entries are non-zero. Then the estimate Θ of the problem (1) satisfies the following: (Success) Suppose the regularization coefficients satisfy F1 − F2. Further, assume that the number of samples scales as θ(n, p, s, α) > 1. Then, with probability at least 1 − c1 exp(−c2 n) for some positive numbers c1 and c2 , we are guaranteed that Θ satisfies the support-recovery and ℓ∞ error bound conditions (a-b) in Theorem 2. ˆ ˆ (Failure) If θ(n, p, s, α) < 1 there is no solution (B, S) for any choices of λs and λb such that ¯ sign Supp(Θ) = sign Supp(Θ) . We note that we require the gap ∗(1) Θj ∗(2) − Θj to be small only on rows where both entries are non-zero. As we show in a more general theorem in the appendix, even in the case where the gap is large, the dependence of the sample scaling on the gap is quite weak. 4 Empirical Results In this section, we investigate the performance of our dirty block sparse estimator on synthetic and real-world data. The synthetic experiments explore the accuracy of Theorem 3, and compare our estimator with LASSO and the ℓ1 /ℓ∞ regularizer. We see that Theorem 3 is very accurate indeed. Next, we apply our method to a real world datasets containing hand-written digits for classification. Again we compare against LASSO and the ℓ1 /ℓ∞ . (a multi-task regression dataset) with r = 2 tasks. In both of this real world dataset, we show that dirty model outperforms both LASSO and ℓ1 /ℓ∞ practically. For each method, the parameters are chosen via cross-validation; see supplemental material for more details. 4.1 Synthetic Data Simulation We consider a r = 2-task regression problem as discussed in Theorem 3, for a range of parameters (n, p, s, α). The design matrices X have each entry being i.i.d. Gaussian with mean 0 and variance 1. For each fixed set of (n, s, p, α), we generate 100 instances of the problem. In each instance, ¯ given p, s, α, the locations of the non-zero entries of the true Θ are chosen at randomly; each nonzero entry is then chosen to be i.i.d. Gaussian with mean 0 and variance 1. n samples are then generated from this. We then attempt to estimate using three methods: our dirty model, ℓ1 /ℓ∞ regularizer and LASSO. In each case, and for each instance, the penalty regularizer coefficients are found by cross validation. After solving the three problems, we compare the signed support of the solution with the true signed support and decide whether or not the program was successful in signed support recovery. We describe these process in more details in this section. Performance Analysis: We ran the algorithm for five different values of the overlap ratio α ∈ 2 {0.3, 3 , 0.8} with three different number of features p ∈ {128, 256, 512}. For any instance of the ˆ ¯ problem (n, p, s, α), if the recovered matrix Θ has the same sign support as the true Θ, then we count it as success, otherwise failure (even if one element has different sign, we count it as failure). As Theorem 3 predicts and Fig 3 shows, the right scaling for the number of oservations is n s log(p−(2−α)s) , where all curves stack on the top of each other at 2 − α. Also, the number of observations required by dirty model for true signed support recovery is always less than both LASSO and ℓ1 /ℓ∞ regularizer. Fig 1(a) shows the probability of success for the case α = 0.3 (when LASSO is better than ℓ1 /ℓ∞ regularizer) and that dirty model outperforms both methods. When α = 2 3 (see Fig 1(b)), LASSO and ℓ1 /ℓ∞ regularizer performs the same; but dirty model require almost 33% less observations for the same performance. As α grows toward 1, e.g. α = 0.8 as shown in Fig 1(c), ℓ1 /ℓ∞ performs better than LASSO. Still, dirty model performs better than both methods in this case as well. 7 4 p=128 p=256 p=512 Phase Transition Threshold 3.5 L1/Linf Regularizer 3 2.5 LASSO 2 Dirty Model 1.5 1 0 0.1 0.2 0.3 0.4 0.5 0.6 Shared Support Parameter α 0.7 0.8 0.9 1 Figure 2: Verification of the result of the Theorem 3 on the behavior of phase transition threshold by changing the parameter α in a 2-task (n, p, s, α) problem for dirty model, LASSO and ℓ1 /ℓ∞ regularizer. The y-axis p n is s log(p−(2−α)s) , where n is the number of samples at which threshold was observed. Here s = ⌊ 10 ⌋. Our dirty model method shows a gain in sample complexity over the entire range of sharing α. The pre-constant in Theorem 3 is also validated. n 10 20 40 Average Classification Error Variance of Error Average Row Support Size Average Support Size Average Classification Error Variance of Error Average Row Support Size Average Support Size Average Classification Error Variance of Error Average Row Support Size Average Support Size Our Model 8.6% 0.53% B:165 B + S:171 S:18 B + S:1651 3.0% 0.56% B:211 B + S:226 S:34 B + S:2118 2.2% 0.57% B:270 B + S:299 S:67 B + S:2761 ℓ1 /ℓ∞ 9.9% 0.64% 170 1700 3.5% 0.62% 217 2165 3.2% 0.68% 368 3669 LASSO 10.8% 0.51% 123 539 4.1% 0.68% 173 821 2.8% 0.85% 354 2053 Table 1: Handwriting Classification Results for our model, ℓ1 /ℓ∞ and LASSO Scaling Verification: To verify that the phase transition threshold changes linearly with α as predicted by Theorem 3, we plot the phase transition threshold versus α. For five different values of 2 α ∈ {0.05, 0.3, 3 , 0.8, 0.95} and three different values of p ∈ {128, 256, 512}, we find the phase transition threshold for dirty model, LASSO and ℓ1 /ℓ∞ regularizer. We consider the point where the probability of success in recovery of signed support exceeds 50% as the phase transition threshold. We find this point by interpolation on the closest two points. Fig 2 shows that phase transition threshold for dirty model is always lower than the phase transition for LASSO and ℓ1 /ℓ∞ regularizer. 4.2 Handwritten Digits Dataset We use the handwritten digit dataset [1], containing features of handwritten numerals (0-9) extracted from a collection of Dutch utility maps. This dataset has been used by a number of papers [17, 6] as a reliable dataset for handwritten recognition algorithms. There are thus r = 10 tasks, and each handwritten sample consists of p = 649 features. Table 1 shows the results of our analysis for different sizes n of the training set . We measure the classification error for each digit to get the 10-vector of errors. Then, we find the average error and the variance of the error vector to show how the error is distributed over all tasks. We compare our method with ℓ1 /ℓ∞ reguralizer method and LASSO. Again, in all methods, parameters are chosen via cross-validation. For our method we separate out the B and S matrices that our method finds, so as to illustrate how many features it identifies as “shared” and how many as “non-shared”. For the other methods we just report the straight row and support numbers, since they do not make such a separation. Acknowledgements We acknowledge support from NSF grant IIS-101842, and NSF CAREER program, Grant 0954059. 8 References [1] A. Asuncion and D.J. Newman. UCI Machine Learning Repository, http://www.ics.uci.edu/ mlearn/MLRepository.html. University of California, School of Information and Computer Science, Irvine, CA, 2007. [2] F. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9:1179–1225, 2008. [3] R. Baraniuk. Compressive sensing. IEEE Signal Processing Magazine, 24(4):118–121, 2007. [4] R. Caruana. Multitask learning. Machine Learning, 28:41–75, 1997. [5] C.Zhang and J.Huang. Model selection consistency of the lasso selection in high-dimensional linear regression. Annals of Statistics, 36:1567–1594, 2008. [6] X. He and P. Niyogi. Locality preserving projections. In NIPS, 2003. [7] K. Lounici, A. B. Tsybakov, M. Pontil, and S. A. van de Geer. Taking advantage of sparsity in multi-task learning. In 22nd Conference On Learning Theory (COLT), 2009. [8] S. Negahban and M. J. Wainwright. Joint support recovery under high-dimensional scaling: Benefits and perils of ℓ1,∞ -regularization. In Advances in Neural Information Processing Systems (NIPS), 2008. [9] S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. In ICML, 2010. [10] G. Obozinski, M. J. Wainwright, and M. I. Jordan. Support union recovery in high-dimensional multivariate regression. Annals of Statistics, 2010. [11] P. Ravikumar, H. Liu, J. Lafferty, and L. Wasserman. Sparse additive models. Journal of the Royal Statistical Society, Series B. [12] P. Ravikumar, M. J. Wainwright, and J. Lafferty. High-dimensional ising model selection using ℓ1 -regularized logistic regression. Annals of Statistics, 2009. [13] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. In Allerton Conference, Allerton House, Illinois, 2007. [14] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996. [15] J. A. Tropp, A. C. Gilbert, and M. J. Strauss. Algorithms for simultaneous sparse approximation. Signal Processing, Special issue on “Sparse approximations in signal and image processing”, 86:572–602, 2006. [16] B. Turlach, W.N. Venables, and S.J. Wright. Simultaneous variable selection. Techno- metrics, 27:349–363, 2005. [17] M. van Breukelen, R.P.W. Duin, D.M.J. Tax, and J.E. den Hartog. Handwritten digit recognition by combined classifiers. Kybernetika, 34(4):381–386, 1998. [18] M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using ℓ1 -constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55: 2183–2202, 2009. 9
6 nips-2010-A Discriminative Latent Model of Image Region and Object Tag Correspondence
Author: Yang Wang, Greg Mori
Abstract: We propose a discriminative latent model for annotating images with unaligned object-level textual annotations. Instead of using the bag-of-words image representation currently popular in the computer vision community, our model explicitly captures more intricate relationships underlying visual and textual information. In particular, we model the mapping that translates image regions to annotations. This mapping allows us to relate image regions to their corresponding annotation terms. We also model the overall scene label as latent information. This allows us to cluster test images. Our training data consist of images and their associated annotations. But we do not have access to the ground-truth regionto-annotation mapping or the overall scene label. We develop a novel variant of the latent SVM framework to model them as latent variables. Our experimental results demonstrate the effectiveness of the proposed model compared with other baseline methods.
7 nips-2010-A Family of Penalty Functions for Structured Sparsity
Author: Jean Morales, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning a sparse linear regression vector under additional conditions on the structure of its sparsity pattern. We present a family of convex penalty functions, which encode this prior knowledge by means of a set of constraints on the absolute values of the regression coefficients. This family subsumes the ℓ1 norm and is flexible enough to include different models of sparsity patterns, which are of practical and theoretical importance. We establish some important properties of these functions and discuss some examples where they can be computed explicitly. Moreover, we present a convergent optimization algorithm for solving regularized least squares with these penalty functions. Numerical simulations highlight the benefit of structured sparsity and the advantage offered by our approach over the Lasso and other related methods.
8 nips-2010-A Log-Domain Implementation of the Diffusion Network in Very Large Scale Integration
Author: Yi-da Wu, Shi-jie Lin, Hsin Chen
Abstract: The Diffusion Network(DN) is a stochastic recurrent network which has been shown capable of modeling the distributions of continuous-valued, continuoustime paths. However, the dynamics of the DN are governed by stochastic differential equations, making the DN unfavourable for simulation in a digital computer. This paper presents the implementation of the DN in analogue Very Large Scale Integration, enabling the DN to be simulated in real time. Moreover, the logdomain representation is applied to the DN, allowing the supply voltage and thus the power consumption to be reduced without limiting the dynamic ranges for diffusion processes. A VLSI chip containing a DN with two stochastic units has been designed and fabricated. The design of component circuits will be described, so will the simulation of the full system be presented. The simulation results demonstrate that the DN in VLSI is able to regenerate various types of continuous paths in real-time. 1
9 nips-2010-A New Probabilistic Model for Rank Aggregation
Author: Tao Qin, Xiubo Geng, Tie-yan Liu
Abstract: This paper is concerned with rank aggregation, which aims to combine multiple input rankings to get a better ranking. A popular approach to rank aggregation is based on probabilistic models on permutations, e.g., the Luce model and the Mallows model. However, these models have their limitations in either poor expressiveness or high computational complexity. To avoid these limitations, in this paper, we propose a new model, which is defined with a coset-permutation distance, and models the generation of a permutation as a stagewise process. We refer to the new model as coset-permutation distance based stagewise (CPS) model. The CPS model has rich expressiveness and can therefore be used in versatile applications, because many different permutation distances can be used to induce the coset-permutation distance. The complexity of the CPS model is low because of the stagewise decomposition of the permutation probability and the efficient computation of most coset-permutation distances. We apply the CPS model to supervised rank aggregation, derive the learning and inference algorithms, and empirically study their effectiveness and efficiency. Experiments on public datasets show that the derived algorithms based on the CPS model can achieve state-ofthe-art ranking accuracy, and are much more efficient than previous algorithms.
10 nips-2010-A Novel Kernel for Learning a Neuron Model from Spike Train Data
Author: Nicholas Fisher, Arunava Banerjee
Abstract: From a functional viewpoint, a spiking neuron is a device that transforms input spike trains on its various synapses into an output spike train on its axon. We demonstrate in this paper that the function mapping underlying the device can be tractably learned based on input and output spike train data alone. We begin by posing the problem in a classification based framework. We then derive a novel kernel for an SRM0 model that is based on PSP and AHP like functions. With the kernel we demonstrate how the learning problem can be posed as a Quadratic Program. Experimental results demonstrate the strength of our approach. 1
11 nips-2010-A POMDP Extension with Belief-dependent Rewards
Author: Mauricio Araya, Olivier Buffet, Vincent Thomas, Françcois Charpillet
Abstract: Partially Observable Markov Decision Processes (POMDPs) model sequential decision-making problems under uncertainty and partial observability. Unfortunately, some problems cannot be modeled with state-dependent reward functions, e.g., problems whose objective explicitly implies reducing the uncertainty on the state. To that end, we introduce ρPOMDPs, an extension of POMDPs where the reward function ρ depends on the belief state. We show that, under the common assumption that ρ is convex, the value function is also convex, what makes it possible to (1) approximate ρ arbitrarily well with a piecewise linear and convex (PWLC) function, and (2) use state-of-the-art exact or approximate solving algorithms with limited changes. 1
12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
Author: Sofia Mosci, Silvia Villa, Alessandro Verri, Lorenzo Rosasco
Abstract: We deal with the problem of variable selection when variables must be selected group-wise, with possibly overlapping groups defined a priori. In particular we propose a new optimization procedure for solving the regularized algorithm presented in [12], where the group lasso penalty is generalized to overlapping groups of variables. While in [12] the proposed implementation requires explicit replication of the variables belonging to more than one group, our iterative procedure is based on a combination of proximal methods in the primal space and projected Newton method in a reduced dual space, corresponding to the active groups. This procedure provides a scalable alternative with no need for data duplication, and allows to deal with high dimensional problems without pre-processing for dimensionality reduction. The computational advantages of our scheme with respect to state-of-the-art algorithms using data duplication are shown empirically with numerical simulations. 1
13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
Author: Tamir Hazan, Raquel Urtasun
Abstract: In this paper we propose an approximated structured prediction framework for large scale graphical models and derive message-passing algorithms for learning their parameters efficiently. We first relate CRFs and structured SVMs and show that in CRFs a variant of the log-partition function, known as the soft-max, smoothly approximates the hinge loss function of structured SVMs. We then propose an intuitive approximation for the structured prediction problem, using duality, based on a local entropy approximation and derive an efficient messagepassing algorithm that is guaranteed to converge. Unlike existing approaches, this allows us to learn efficiently graphical models with cycles and very large number of parameters. 1
14 nips-2010-A Reduction from Apprenticeship Learning to Classification
Author: Umar Syed, Robert E. Schapire
Abstract: We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert’s behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate ǫ, the difference between the √ value of the apprentice’s policy and the expert’s policy is O( ǫ). Further, we prove that this difference is only O(ǫ) when the expert’s policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain. 1
15 nips-2010-A Theory of Multiclass Boosting
Author: Indraneel Mukherjee, Robert E. Schapire
Abstract: Boosting combines weak classifiers to form highly accurate predictors. Although the case of binary classification is well understood, in the multiclass setting, the “correct” requirements on the weak classifier, or the notion of the most efficient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classifier, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. 1
16 nips-2010-A VLSI Implementation of the Adaptive Exponential Integrate-and-Fire Neuron Model
Author: Sebastian Millner, Andreas Grübl, Karlheinz Meier, Johannes Schemmel, Marc-olivier Schwartz
Abstract: We describe an accelerated hardware neuron being capable of emulating the adaptive exponential integrate-and-fire neuron model. Firing patterns of the membrane stimulated by a step current are analyzed in transistor level simulations and in silicon on a prototype chip. The neuron is destined to be the hardware neuron of a highly integrated wafer-scale system reaching out for new computational paradigms and opening new experimentation possibilities. As the neuron is dedicated as a universal device for neuroscientific experiments, the focus lays on parameterizability and reproduction of the analytical model. 1
17 nips-2010-A biologically plausible network for the computation of orientation dominance
Author: Kritika Muralidharan, Nuno Vasconcelos
Abstract: The determination of dominant orientation at a given image location is formulated as a decision-theoretic question. This leads to a novel measure for the dominance of a given orientation θ, which is similar to that used by SIFT. It is then shown that the new measure can be computed with a network that implements the sequence of operations of the standard neurophysiological model of V1. The measure can thus be seen as a biologically plausible version of SIFT, and is denoted as bioSIFT. The network units are shown to exhibit trademark properties of V1 neurons, such as cross-orientation suppression, sparseness and independence. The connection between SIFT and biological vision provides a justification for the success of SIFT-like features and reinforces the importance of contrast normalization in computer vision. We illustrate this by replacing the Gabor units of an HMAX network with the new bioSIFT units. This is shown to lead to significant gains for classification tasks, leading to state-of-the-art performance among biologically inspired network models and performance competitive with the best non-biological object recognition systems. 1
18 nips-2010-A novel family of non-parametric cumulative based divergences for point processes
Author: Sohan Seth, Park Il, Austin Brockmeier, Mulugeta Semework, John Choi, Joseph Francis, Jose Principe
Abstract: Hypothesis testing on point processes has several applications such as model fitting, plasticity detection, and non-stationarity detection. Standard tools for hypothesis testing include tests on mean firing rate and time varying rate function. However, these statistics do not fully describe a point process, and therefore, the conclusions drawn by these tests can be misleading. In this paper, we introduce a family of non-parametric divergence measures for hypothesis testing. A divergence measure compares the full probability structure and, therefore, leads to a more robust test of hypothesis. We extend the traditional Kolmogorov–Smirnov and Cram´ r–von-Mises tests to the space of spike trains via stratification, and e show that these statistics can be consistently estimated from data without any free parameter. We demonstrate an application of the proposed divergences as a cost function to find optimally matched point processes. 1
19 nips-2010-A rational decision making framework for inhibitory control
Author: Pradeep Shenoy, Angela J. Yu, Rajesh P. Rao
Abstract: Intelligent agents are often faced with the need to choose actions with uncertain consequences, and to modify those actions according to ongoing sensory processing and changing task demands. The requisite ability to dynamically modify or cancel planned actions is known as inhibitory control in psychology. We formalize inhibitory control as a rational decision-making problem, and apply to it to the classical stop-signal task. Using Bayesian inference and stochastic control tools, we show that the optimal policy systematically depends on various parameters of the problem, such as the relative costs of different action choices, the noise level of sensory inputs, and the dynamics of changing environmental demands. Our normative model accounts for a range of behavioral data in humans and animals in the stop-signal task, suggesting that the brain implements statistically optimal, dynamically adaptive, and reward-sensitive decision-making in the context of inhibitory control problems. 1
20 nips-2010-A unified model of short-range and long-range motion perception
Author: Shuang Wu, Xuming He, Hongjing Lu, Alan L. Yuille
Abstract: The human vision system is able to effortlessly perceive both short-range and long-range motion patterns in complex dynamic scenes. Previous work has assumed that two different mechanisms are involved in processing these two types of motion. In this paper, we propose a hierarchical model as a unified framework for modeling both short-range and long-range motion perception. Our model consists of two key components: a data likelihood that proposes multiple motion hypotheses using nonlinear matching, and a hierarchical prior that imposes slowness and spatial smoothness constraints on the motion field at multiple scales. We tested our model on two types of stimuli, random dot kinematograms and multiple-aperture stimuli, both commonly used in human vision research. We demonstrate that the hierarchical model adequately accounts for human performance in psychophysical experiments.
22 nips-2010-Active Estimation of F-Measures
23 nips-2010-Active Instance Sampling via Matrix Partition
24 nips-2010-Active Learning Applied to Patient-Adaptive Heartbeat Classification
25 nips-2010-Active Learning by Querying Informative and Representative Examples
26 nips-2010-Adaptive Multi-Task Lasso: with Application to eQTL Detection
27 nips-2010-Agnostic Active Learning Without Constraints
28 nips-2010-An Alternative to Low-level-Sychrony-Based Methods for Speech Detection
29 nips-2010-An Approximate Inference Approach to Temporal Optimization in Optimal Control
32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits
33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes
34 nips-2010-Attractor Dynamics with Synaptic Depression
35 nips-2010-Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting
36 nips-2010-Avoiding False Positive in Multi-Instance Learning
37 nips-2010-Basis Construction from Power Series Expansions of Value Functions
38 nips-2010-Batch Bayesian Optimization via Simulation Matching
39 nips-2010-Bayesian Action-Graph Games
40 nips-2010-Beyond Actions: Discriminative Models for Contextual Group Activities
41 nips-2010-Block Variable Selection in Multivariate Regression and High-dimensional Causal Inference
42 nips-2010-Boosting Classifier Cascades
43 nips-2010-Bootstrapping Apprenticeship Learning
45 nips-2010-CUR from a Sparse Optimization Viewpoint
46 nips-2010-Causal discovery in multiple models from different experiments
47 nips-2010-Co-regularization Based Semi-supervised Domain Adaptation
48 nips-2010-Collaborative Filtering in a Non-Uniform World: Learning with the Weighted Trace Norm
50 nips-2010-Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories
51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
52 nips-2010-Convex Multiple-Instance Learning by Estimating Likelihood Ratio
53 nips-2010-Copula Bayesian Networks
57 nips-2010-Decoding Ipsilateral Finger Movements from ECoG Signals in Humans
58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
59 nips-2010-Deep Coding Network
60 nips-2010-Deterministic Single-Pass Algorithm for LDA
61 nips-2010-Direct Loss Minimization for Structured Prediction
62 nips-2010-Discriminative Clustering by Regularized Information Maximization
63 nips-2010-Distributed Dual Averaging In Networks
64 nips-2010-Distributionally Robust Markov Decision Processes
65 nips-2010-Divisive Normalization: Justification and Effectiveness as Efficient Coding Transform
66 nips-2010-Double Q-learning
67 nips-2010-Dynamic Infinite Relational Model for Time-varying Relational Data Analysis
68 nips-2010-Effects of Synaptic Weight Diffusion on Learning in Decision Making Networks
69 nips-2010-Efficient Minimization of Decomposable Submodular Functions
70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
71 nips-2010-Efficient Relational Learning with Hidden Variable Detection
73 nips-2010-Efficient and Robust Feature Selection via Joint ℓ2,1-Norms Minimization
74 nips-2010-Empirical Bernstein Inequalities for U-Statistics
75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
76 nips-2010-Energy Disaggregation via Discriminative Sparse Coding
77 nips-2010-Epitome driven 3-D Diffusion Tensor image segmentation: on extracting specific structures
78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
79 nips-2010-Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces
81 nips-2010-Evaluating neuronal codes for inference using Fisher information
82 nips-2010-Evaluation of Rarity of Fingerprints in Forensics
83 nips-2010-Evidence-Specific Structures for Rich Tractable CRFs
84 nips-2010-Exact inference and learning for cumulative distribution functions on loopy graphs
85 nips-2010-Exact learning curves for Gaussian process regression on large random graphs
87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
89 nips-2010-Factorized Latent Spaces with Structured Sparsity
90 nips-2010-Fast Large-scale Mixture Modeling with Component-specific Data Partitions
91 nips-2010-Fast detection of multiple change-points shared by many signals using group LARS
93 nips-2010-Feature Construction for Inverse Reinforcement Learning
94 nips-2010-Feature Set Embedding for Incomplete Data
95 nips-2010-Feature Transitions with Saccadic Search: Size, Color, and Orientation Are Not Alike
96 nips-2010-Fractionally Predictive Spiking Neurons
97 nips-2010-Functional Geometry Alignment and Localization of Brain Areas
98 nips-2010-Functional form of motion priors in human motion perception
99 nips-2010-Gated Softmax Classification
100 nips-2010-Gaussian Process Preference Elicitation
101 nips-2010-Gaussian sampling by local perturbations
102 nips-2010-Generalized roof duality and bisubmodular functions
103 nips-2010-Generating more realistic images using gated MRF's
104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification
105 nips-2010-Getting lost in space: Large sample analysis of the resistance distance
106 nips-2010-Global Analytic Solution for Variational Bayesian Matrix Factorization
107 nips-2010-Global seismic monitoring as probabilistic inference
108 nips-2010-Graph-Valued Regression
109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
110 nips-2010-Guaranteed Rank Minimization via Singular Value Projection
112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning
113 nips-2010-Heavy-Tailed Process Priors for Selective Shrinkage
114 nips-2010-Humans Learn Using Manifolds, Reluctantly
115 nips-2010-Identifying Dendritic Processing
117 nips-2010-Identifying graph-structured activation patterns in networks
118 nips-2010-Implicit Differentiation by Perturbation
119 nips-2010-Implicit encoding of prior probabilities in optimal neural populations
120 nips-2010-Improvements to the Sequence Memoizer
121 nips-2010-Improving Human Judgments by Decontaminating Sequential Dependencies
122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
124 nips-2010-Inductive Regularized Learning of Kernel Functions
125 nips-2010-Inference and communication in the game of Password
126 nips-2010-Inference with Multivariate Heavy-Tails in Linear Models
127 nips-2010-Inferring Stimulus Selectivity from the Spatial Structure of Neural Network Dynamics
128 nips-2010-Infinite Relational Modeling of Functional Connectivity in Resting State fMRI
129 nips-2010-Inter-time segment information sharing for non-homogeneous dynamic Bayesian networks
130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
131 nips-2010-Joint Analysis of Time-Evolving Binary Matrices and Associated Documents
132 nips-2010-Joint Cascade Optimization Using A Product Of Boosted Classifiers
133 nips-2010-Kernel Descriptors for Visual Recognition
134 nips-2010-LSTD with Random Projections
135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
136 nips-2010-Large-Scale Matrix Factorization with Missing Data under Additional Constraints
137 nips-2010-Large Margin Learning of Upstream Scene Understanding Models
138 nips-2010-Large Margin Multi-Task Metric Learning
140 nips-2010-Layer-wise analysis of deep networks with Gaussian kernels
141 nips-2010-Layered image motion with explicit occlusions, temporal consistency, and depth ordering
142 nips-2010-Learning Bounds for Importance Weighting
143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition
144 nips-2010-Learning Efficient Markov Networks
145 nips-2010-Learning Kernels with Radiuses of Minimum Enclosing Balls
146 nips-2010-Learning Multiple Tasks using Manifold Regularization
147 nips-2010-Learning Multiple Tasks with a Sparse Matrix-Normal Penalty
148 nips-2010-Learning Networks of Stochastic Differential Equations
149 nips-2010-Learning To Count Objects in Images
150 nips-2010-Learning concept graphs from text with stick-breaking priors
151 nips-2010-Learning from Candidate Labeling Sets
152 nips-2010-Learning from Logged Implicit Exploration Data
153 nips-2010-Learning invariant features using the Transformed Indian Buffet Process
155 nips-2010-Learning the context of a category
156 nips-2010-Learning to combine foveal glimpses with a third-order Boltzmann machine
157 nips-2010-Learning to localise sounds with spiking neural networks
158 nips-2010-Learning via Gaussian Herding
159 nips-2010-Lifted Inference Seen from the Other Side : The Tractable Features
160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
161 nips-2010-Linear readout from a neural population with partial correlation data
162 nips-2010-Link Discovery using Graph Feature Tracking
163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization
165 nips-2010-MAP estimation in Binary MRFs via Bipartite Multi-cuts
166 nips-2010-Minimum Average Cost Clustering
167 nips-2010-Mixture of time-warped trajectory models for movement decoding
168 nips-2010-Monte-Carlo Planning in Large POMDPs
169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning
171 nips-2010-Movement extraction by detecting dynamics switches and repetitions
172 nips-2010-Multi-Stage Dantzig Selector
173 nips-2010-Multi-View Active Learning in the Non-Realizable Case
175 nips-2010-Multiparty Differential Privacy via Aggregation of Locally Trained Classifiers
176 nips-2010-Multiple Kernel Learning and the SMO Algorithm
177 nips-2010-Multitask Learning without Label Correspondences
178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems
179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations
181 nips-2010-Network Flow Algorithms for Structured Sparsity
182 nips-2010-New Adaptive Algorithms for Online Classification
183 nips-2010-Non-Stochastic Bandit Slate Problems
184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
187 nips-2010-Occlusion Detection and Motion Estimation with Convex Optimization
188 nips-2010-On Herding and the Perceptron Cycling Theorem
189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
190 nips-2010-On the Convexity of Latent Social Network Inference
191 nips-2010-On the Theory of Learnining with Privileged Information
192 nips-2010-Online Classification with Specificity Constraints
193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
194 nips-2010-Online Learning for Latent Dirichlet Allocation
195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices
196 nips-2010-Online Markov Decision Processes under Bandit Feedback
197 nips-2010-Optimal Bayesian Recommendation Sets and Myopically Optimal Choice Query Sets
198 nips-2010-Optimal Web-Scale Tiering as a Flow Problem
199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
200 nips-2010-Over-complete representations on recurrent neural networks can support persistent percepts
201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
202 nips-2010-Parallelized Stochastic Gradient Descent
203 nips-2010-Parametric Bandits: The Generalized Linear Case
204 nips-2010-Penalized Principal Component Regression on Graphs for Analysis of Subnetworks
205 nips-2010-Permutation Complexity Bound on Out-Sample Error
206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs
208 nips-2010-Policy gradients in linearly-solvable MDPs
209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
211 nips-2010-Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression
212 nips-2010-Predictive State Temporal Difference Learning
213 nips-2010-Predictive Subspace Learning for Multi-view Data: a Large Margin Approach
214 nips-2010-Probabilistic Belief Revision with Structural Constraints
215 nips-2010-Probabilistic Deterministic Infinite Automata
216 nips-2010-Probabilistic Inference and Differential Privacy
217 nips-2010-Probabilistic Multi-Task Feature Selection
218 nips-2010-Probabilistic latent variable models for distinguishing between cause and effect
219 nips-2010-Random Conic Pursuit for Semidefinite Programming
220 nips-2010-Random Projection Trees Revisited
221 nips-2010-Random Projections for $k$-means Clustering
222 nips-2010-Random Walk Approach to Regret Minimization
223 nips-2010-Rates of convergence for the cluster tree
224 nips-2010-Regularized estimation of image statistics by Score Matching
225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification
226 nips-2010-Repeated Games against Budgeted Adversaries
228 nips-2010-Reverse Multi-Label Learning
229 nips-2010-Reward Design via Online Gradient Ascent
230 nips-2010-Robust Clustering as Ensembles of Affinity Relations
231 nips-2010-Robust PCA via Outlier Pursuit
232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
233 nips-2010-Scrambled Objects for Least-Squares Regression
234 nips-2010-Segmentation as Maximum-Weight Independent Set
235 nips-2010-Self-Paced Learning for Latent Variable Models
236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information
237 nips-2010-Shadow Dirichlet for Restricted Probability Modeling
238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades
240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision
241 nips-2010-Size Matters: Metric Visual Search Constraints from Monocular Metadata
242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models
243 nips-2010-Smoothness, Low Noise and Fast Rates
245 nips-2010-Space-Variant Single-Image Blind Deconvolution for Removing Camera Shake
246 nips-2010-Sparse Coding for Learning Interpretable Spatio-Temporal Primitives
247 nips-2010-Sparse Instrumental Variables (SPIV) for Genome-Wide Studies
248 nips-2010-Sparse Inverse Covariance Selection via Alternating Linearization Methods
249 nips-2010-Spatial and anatomical regularization of SVM for brain image analysis
250 nips-2010-Spectral Regularization for Support Estimation
251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
253 nips-2010-Spike timing-dependent plasticity as dynamic filter
255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs
256 nips-2010-Structural epitome: a way to summarize one’s visual experience
257 nips-2010-Structured Determinantal Point Processes
258 nips-2010-Structured sparsity-inducing norms through submodular functions
259 nips-2010-Subgraph Detection Using Eigenvector L1 Norms
260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
261 nips-2010-Supervised Clustering
262 nips-2010-Switched Latent Force Models for Movement Segmentation
264 nips-2010-Synergies in learning words and their referents
265 nips-2010-The LASSO risk: asymptotic results and real world examples
266 nips-2010-The Maximal Causes of Natural Scenes are Edge Filters
267 nips-2010-The Multidimensional Wisdom of Crowds
268 nips-2010-The Neural Costs of Optimal Control
269 nips-2010-Throttling Poisson Processes
270 nips-2010-Tight Sample Complexity of Large-Margin Learning
271 nips-2010-Tiled convolutional neural networks
272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models
273 nips-2010-Towards Property-Based Classification of Clustering Paradigms
274 nips-2010-Trading off Mistakes and Don't-Know Predictions
275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone
276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average
278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification
279 nips-2010-Universal Kernels on Non-Standard Input Spaces
280 nips-2010-Unsupervised Kernel Dimension Reduction
281 nips-2010-Using body-anchored priors for identifying actions in single images
282 nips-2010-Variable margin losses for classifier design
283 nips-2010-Variational Inference over Combinatorial Spaces
284 nips-2010-Variational bounds for mixed-data factor analysis
285 nips-2010-Why are some word orders more common than others? A uniform information density account
286 nips-2010-Word Features for Latent Dirichlet Allocation
287 nips-2010-Worst-Case Linear Discriminant Analysis
288 nips-2010-Worst-case bounds on the quality of max-product fixed-points
289 nips-2010-b-Bit Minwise Hashing for Estimating Three-Way Similarities
290 nips-2010-t-logistic regression