nips nips2013 nips2013-59 knowledge-graph by maker-knowledge-mining

59 nips-2013-Blind Calibration in Compressed Sensing using Message Passing Algorithms

Source: pdf

Author: Christophe Schulke, Francesco Caltagirone, Florent Krzakala, Lenka Zdeborova

Abstract: Compressed sensing (CS) is a concept that allows to acquire compressible signals with a small number of measurements. As such it is very attractive for hardware implementations. Therefore, correct calibration of the hardware is a central issue. In this paper we study the so-called blind calibration, i.e. when the training signals that are available to perform the calibration are sparse but unknown. We extend the approximate message passing (AMP) algorithm used in CS to the case of blind calibration. In the calibration-AMP, both the gains on the sensors and the elements of the signals are treated as unknowns. Our algorithm is also applicable to settings in which the sensors distort the measurements in other ways than multiplication by a gain, unlike previously suggested blind calibration algorithms based on convex relaxations. We study numerically the phase diagram of the blind calibration problem, and show that even in cases where convex relaxation is possible, our algorithm requires a smaller number of measurements and/or signals in order to perform well. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Therefore, correct calibration of the hardware is a central issue. [sent-3, score-0.546]

2 In this paper we study the so-called blind calibration, i. [sent-4, score-0.238]

3 when the training signals that are available to perform the calibration are sparse but unknown. [sent-6, score-0.636]

4 We extend the approximate message passing (AMP) algorithm used in CS to the case of blind calibration. [sent-7, score-0.368]

5 In the calibration-AMP, both the gains on the sensors and the elements of the signals are treated as unknowns. [sent-8, score-0.26]

6 Our algorithm is also applicable to settings in which the sensors distort the measurements in other ways than multiplication by a gain, unlike previously suggested blind calibration algorithms based on convex relaxations. [sent-9, score-0.913]

7 We study numerically the phase diagram of the blind calibration problem, and show that even in cases where convex relaxation is possible, our algorithm requires a smaller number of measurements and/or signals in order to perform well. [sent-10, score-1.184]

8 Compressed sensing theory shows that a K-sparse N -dimensional signal can be reconstructed from far less than N linear measurements [1, 2], thus saving acquisition time, cost or increasing the resolution. [sent-13, score-0.375]

9 Nowadays, the concept of compressed sensing is very attractive for hardware implementations. [sent-15, score-0.457]

10 Usually the sensors introduce a distortion (or decalibration) to the measurements in the form of some unknown gains. [sent-17, score-0.477]

11 Calibration is about how to determine the transfer function between the measurements and the readings from the sensor. [sent-18, score-0.262]

12 Similar distortion can be found in applications with microphone arrays [5]. [sent-20, score-0.319]

13 The need for calibration has been emphasized in a number of other works, see e. [sent-21, score-0.491]

14 One common way of dealing with calibration (apart from ignoring it or considering it as measurement noise) is supervised calibration 1 when some known training signals xl , l = 1, . [sent-24, score-1.287]

15 , P and the corresponding observations yl are used to estimate the distortion parameters. [sent-27, score-0.319]

16 Given a sparse signal recovery problem, if we were not able to previously estimate the distortion parameters via supervised calibration, we will need to estimate the unknown signal and the unknown distortion parameters simultaneously - this is known as blind (unsupervised) calibration. [sent-28, score-1.078]

17 If such blind calibration is computationally possible, then it might be simpler to do than the supervised calibration in practice. [sent-29, score-1.24]

18 The main contribution of this paper is a computationally efﬁcient message passing algorithm for blind calibration. [sent-30, score-0.368]

19 1 Setting We state the problem of blind calibration in the following way. [sent-32, score-0.729]

20 First we introduce an unknown distortion parameter (we will also use equivalently the term decalibration parameter or gain) dµ for each of the sensors, µ = 1, . [sent-33, score-0.354]

21 We consider that the signal is linearly projected by a known M × N measurement matrix F and only then distorted according to some known transfer function h. [sent-38, score-0.371]

22 In order to perform the blind calibration, we need to measure several statistically diverse signals. [sent-43, score-0.238]

23 To illustrate a situation in which one has sample dependent noise wµl and sample independent distortion dµ , consider for instance sound sensors placed in space at positions dµ that are not exactly known. [sent-45, score-0.426]

24 The ﬁnal inference problem is hence as follows: Given the M × P measurements yµl and a perfect knowledge of the matrix F , we want to infer both the P different signals {x1 , · · · xP } and the M distortion parameters dµ , µ = 1, · · · M . [sent-48, score-0.566]

25 In this work we place ourselves in the Bayesian setting where we assume the distribution of the signal elements, PX , and the distortion coefﬁcients, PD , to be known. [sent-49, score-0.41]

26 2 Relation to previous work As far as we know, the problem of blind calibration was ﬁrst studied in the context of compressed sensing in [9] where the distortions were considered as multiplicative, i. [sent-51, score-1.192]

27 dµ (2) A subsequent work [10] considers a more general case when the distortion parameters are dµ = (gµ , θµ ), and the transfer function h(zµl , dµ , wµl ) = eiθµ (zµl + wµl )/gµ . [sent-54, score-0.448]

28 Both [9] and [10] applied convex optimization based algorithms to the blind calibration problem and their approach seems to be limited to the above special cases of transfer functions. [sent-55, score-0.884]

29 The most commonly used algorithm for signal reconstruction in CS is the 1 minimization of [1]. [sent-57, score-0.202]

30 In CS without noise and for measurement matrices with iid Gaussian elements, the 1 minimization algorithm leads to exact reconstruction as long as the measurement rate α = M/N > αDT in the limit of large signal dimension, where αDT is a well known phase transition of Donoho and Tanner [11]. [sent-58, score-0.73]

31 The blind calibration algorithm of [9, 10] also directly uses 1 minimization for reconstruction. [sent-59, score-0.729]

32 In the context of compressed sensing, the canonical loopy BP is difﬁcult to implement because its messages would be probability distributions over a continuous support. [sent-61, score-0.298]

33 At the same time in problems such as compressed sensing, Gaussian or quadratic approximation of BP still contains the information necessary for a successful reconstruction of the signal. [sent-62, score-0.332]

34 The following two works have considered blind calibration related problems with the use of AMPlike algorithms. [sent-68, score-0.748]

35 In [19] the authors use AMP combined with expectation maximization to calibrate gains that act on the signal components rather than on the measurement components as we consider here. [sent-69, score-0.277]

36 3 Contributions In this work we extend the generalized approximate message passing (GAMP) algorithm of [14] to the problem of blind calibration with a general transfer function h, eq. [sent-73, score-0.988]

37 The Cal-AMP uses P > 1 unknown sparse signals to learn both the different signals xl , l = 1, . [sent-76, score-0.319]

38 , P , and the distortion parameters dµ , µ = 1, . [sent-79, score-0.319]

39 We hence overcome the limitations of the blind calibration algorithm presented in [9, 10] to the class of settings for which the calibration can be written as a convex optimization problem. [sent-83, score-1.246]

40 In the second part of this paper we analyze the performance of Cal-AMP for the product transfer function (2) used in [9] and demonstrate its scalability and better performance with respect to their 1 -based calibration approach. [sent-84, score-0.62]

41 In the numerical study we observe a sharp phase transition generalizing the phase transition seen for AMP in compressed sensing [21]. [sent-85, score-0.951]

42 Note that for the blind calibration problem to be solvable, we need the amount of information contained in the sensor readings, P M , to be at least as large as the size of the vector of distortion parameters M , plus the number of the non-zero components of all the signals, KP . [sent-86, score-1.123]

43 If we ﬁx the number of signals P we have a well deﬁned line in the (ρ, α)-plane given by P α≥ ρ ≡ αmin , (3) P −1 below which exact calibration cannot be possible. [sent-88, score-0.636]

44 We will compare the empirically observed phase transition for blind calibration to this theoretical bound as well as to the phase transition that would have been observed in the pure CS, i. [sent-89, score-1.249]

45 We denote the marginals of the signal x components νil (xil ) = µ ddµ jn=il dxjn P (x, d|F, y) and those of the distortion paramed ∗ ters νµ (dµ ) = γ=µ ddγ il dxil P (x, d|F, y). [sent-94, score-0.68]

46 The estimators xil that minimizes the expected ∗ mean-squared error (MSE) of the signals and the estimator dµ of the distortion parameters are the avx d erages w. [sent-95, score-0.764]

47 the marginal distributions, namely x∗ = dxil xil νil (xil ) and d∗ = ddµ dµ νµ (dµ ). [sent-98, score-0.371]

48 Figure 1: Graphical model representing the blind calibration problem. [sent-101, score-0.729]

49 Here the dimensionality of the signal is N = 8, the number of sensors is M = 3, and the number of signals used for calibration P = 2. [sent-102, score-0.809]

50 The variable nodes xil and dµ are depicted as circles, the factor nodes as squares. [sent-103, score-0.3]

51 Given the factor graph representation of the calibration problem in Fig. [sent-104, score-0.491]

52 At every time-step the quantity ail is the estimate for the signal element xil , and vil is the approximate error of this estimate. [sent-123, score-0.575]

53 The estimate and its error for the distortion parameter dµ can be computed as t+1 kµ = ∂ t+1 t+1 t+1 G(yµ· , ωµ· , Vµ· , θ) ∂θ t+1 lµ = and θ=0 ∂2 t+1 t+1 t+1 G(yµ· , ωµ· , Vµ· , θ) ∂θ2 . [sent-124, score-0.319]

54 1 Cal-AMP for the product transfer function In the numerical section of this paper we will focus on a speciﬁc case of the transfer function h(zµl , dµ , wµl ), deﬁned in eq. [sent-131, score-0.287]

55 The situation when PX mismatches the true signal distribution was discussed for AMP for compressed sensing in [21]. [sent-146, score-0.493]

56 The distortion parameters dµ were generated from a uniform distribution centered at d = 1 with √ variance σ 2 and width 2 3σ. [sent-147, score-0.319]

57 This ensures that, as σ 2 → 0, the results of standard compressed sensing are recovered, while the distortions are growing with σ 2 . [sent-148, score-0.463]

58 For numerical stability purposes, the parameter σ 2 used in the update functions of Cal-AMP was taken to be slightly larger than the variance used to create the actual distortion parameters. [sent-149, score-0.348]

59 Success or failure of the reconstruction is usually determined by looking at the mean squared error (MSE) between the true signal x0 and the reconstructed one al . [sent-152, score-0.249]

60 5 ρ 1 −14 Figure 2: Phase diagrams for different numbers P of calibrating signals: The measurement rate α = M/N is plotted against the density of the signal ρ = K/N . [sent-164, score-0.279]

61 In this ﬁgure the distortion variance is σ 2 = 0. [sent-169, score-0.319]

62 While for P = 1 reconstruction is never possible, for P > 1, there is a phase transition very close to the lower bound deﬁned by αmin in equation (3) or to the phase transition line of the pure compressed sensing problem αCS . [sent-171, score-1.033]

63 Note, however, that in the large N limit we expect the calibration phase transition to be strictly larger than both the αmin and αCS . [sent-172, score-0.751]

64 Note also that while this diagram is usually plotted only for α ≤ 1 for compressed sensing, the part α > 1 displays pertinent information in blind calibration. [sent-173, score-0.539]

65 2 shows the empirical phase diagrams in the α-ρ plane we obtained from the Cal-AMP algorithm for different number of signals P . [sent-175, score-0.35]

66 For P = 1 the reconstruction is never exact, and effectively this case corresponds to reconstruction without any attempt to calibrate. [sent-176, score-0.222]

67 For any P > 1, there is a sharp phase transition taking place with a jump in MSEcorr of ten orders of magnitude. [sent-177, score-0.279]

68 As P increases, the phase of exact reconstruction gets bigger and tends to the one observed in Bayesian compressed sensing [15]. [sent-178, score-0.669]

69 Remarkably, for small values of the density ρ, the position of the CalAMP phase transition is very close to the CS one already for P = 2 and Cal-AMP performs almost as well as in the total absence of distortion. [sent-179, score-0.26]

70 6 Figure 3: Left: Cal-AMP phase transition as the system size N grows. [sent-187, score-0.26]

71 The curves are obtained by averaging log10 (MSEcorr ) over 100 samples, reﬂecting the probability of correct reconstruction in the region close to the phase transition, where it is not guaranteed. [sent-188, score-0.292]

72 For higher values of N , the phase transition becomes sharper. [sent-192, score-0.26]

73 55 −4 −3 −2 log10(σ2) −1 Figure 4: Left: Position of the phase transition in α for different distortion variances σ 2 . [sent-204, score-0.579]

74 With growing distortion, larger measurement rates become necessary for perfect calibration and reconstruction. [sent-207, score-0.628]

75 Intermediary values of MSEcorr are obtained in a region where perfect calibration is not possible, but distortions are small enough for the uncalibrated AMP to make only small mistakes. [sent-208, score-0.603]

76 Right: Phase diagram as the variance of the distortions σ 2 and the number of signals P vary, for ρ = 0. [sent-211, score-0.258]

77 3 shows the behavior near the phase transition, giving insights about the inﬂuence of the system size and the number of iterations needed for precise calibration and reconstruction. [sent-215, score-0.666]

78 The right part is the phase diagram in the σ 2 -P plane. [sent-218, score-0.208]

79 In [9, 10], a calibration algorithm using 1 -minimization has been proposed. [sent-219, score-0.491]

80 We also remind at this point that whereas the Cal-AMP algorithm works for a generic transfer function (1), the 1 -minimization based calibration is restricted to the transfer functions considered by [9, 10]. [sent-223, score-0.795]

81 The Cal-AMP clearly outperforms the 1 -minimization in the sense that the region in which calibration is possible is much larger. [sent-226, score-0.516]

82 5 ρ 1 Figure 5: Comparison of the empirical phase diagrams obtained with the Cal-AMP algorithm proposed here (top) and the 1 -minimization calibration algorithm of [9] (bottom) averaged over several random samples; black indicates failure, white indicates success. [sent-251, score-0.696]

83 The plotted lines are the phase transitions for CS without unknown distortions with the AMP algorithm (αCS , in red, from [21]), and with 1 -minimization (the Donoho-Tanner transition αDT , in blue, from [11]). [sent-253, score-0.349]

84 The advantage of Cal-AMP over 1 -minimization calibration is clear. [sent-256, score-0.491]

85 4 Conclusion We have presented the Cal-AMP algorithm for blind calibration in compressed sensing, a problem where the outputs of the measurements are distorted by some unknown gains on the sensors, eq. [sent-258, score-1.099]

86 The Cal-AMP algorithm allows to jointly infer sparse signals and the distortion parameters of each sensor even with a very small number of signals and is computationally as efﬁcient as the GAMP algorithm for compressed sensing [14]. [sent-260, score-1.065]

87 previous works is that the CalAMP algorithm works for generic transfer function between the measurements and the readings from the sensor, not only those that permit a convex formulation of the inference problem as in [9, 10]. [sent-264, score-0.326]

88 Our results show that, for the chosen parameters, calibration is possible with a very small number of different sparse signals P (i. [sent-266, score-0.636]

89 Comparison with the 1 -minimizing calibration algorithm clearly shows lower requirements on the measurement rate α and on the number of signals P for Cal-AMP. [sent-269, score-0.747]

90 The Cal-AMP algorithm for blind calibration is scalable and simple to implement. [sent-270, score-0.729]

91 We expect Cal-AMP to become useful in practical compressed sensing implementations. [sent-272, score-0.402]

92 Future work also includes the study of the robustness to the mismatch between assumed and true distribution of signal elements and distortion parameters, as well as the expectation-maximization based learning of the various parameters. [sent-275, score-0.41]

93 Finally, the use of spatially coupled measurement matrices [15, 16] could further improve the performance of the algorithm and make the phase transition asymptotically coincide with the information-theoretical counting bound (3). [sent-276, score-0.399]

94 Robustly stable signal recovery in compressed sensing with structured matrix perturbation. [sent-304, score-0.493]

95 Model-based calibration of ﬁlter imperfections in the random demodulator for compressive sensing. [sent-340, score-0.491]

96 Blind sensor calibration in sparse recovery using convex optimization. [sent-354, score-0.571]

97 Information-theoretically optimal compressed sensing via spatial coupling and approximate message passing. [sent-411, score-0.471]

98 Autocalibrated signal reconstruction from linear measurements using adaptive gamp. [sent-436, score-0.278]

99 Phase diagram and approximate message passing for blind e a calibration and dictionary learning. [sent-442, score-0.911]

100 Probabilistic reconstruction in come a pressed sensing: Algorithms, phase diagrams, and threshold achieving matrices. [sent-452, score-0.267]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('calibration', 0.491), ('distortion', 0.319), ('xil', 0.3), ('blind', 0.238), ('compressed', 0.221), ('msecorr', 0.194), ('sensing', 0.181), ('il', 0.178), ('amp', 0.175), ('phase', 0.156), ('cs', 0.153), ('signals', 0.145), ('transfer', 0.129), ('reconstruction', 0.111), ('measurement', 0.111), ('vil', 0.106), ('transition', 0.104), ('signal', 0.091), ('sensors', 0.082), ('ail', 0.078), ('measurements', 0.076), ('dxil', 0.071), ('gamp', 0.071), ('message', 0.069), ('dd', 0.068), ('krzakala', 0.062), ('mil', 0.062), ('zdeborov', 0.062), ('cnrs', 0.061), ('distortions', 0.061), ('passing', 0.061), ('pd', 0.06), ('bp', 0.058), ('readings', 0.057), ('hardware', 0.055), ('px', 0.055), ('sensor', 0.054), ('multiuser', 0.053), ('ril', 0.053), ('diagram', 0.052), ('mse', 0.051), ('donoho', 0.051), ('diagrams', 0.049), ('fc', 0.047), ('zard', 0.047), ('messages', 0.047), ('fa', 0.046), ('isit', 0.04), ('distorted', 0.04), ('pw', 0.04), ('umr', 0.038), ('paris', 0.037), ('france', 0.036), ('calamp', 0.035), ('cdma', 0.035), ('cea', 0.035), ('decalibration', 0.035), ('espci', 0.035), ('gout', 0.035), ('orique', 0.035), ('physique', 0.035), ('sausset', 0.035), ('ura', 0.035), ('equations', 0.035), ('gains', 0.033), ('gribonval', 0.031), ('neglecting', 0.031), ('loopy', 0.03), ('xl', 0.029), ('saclay', 0.029), ('cvx', 0.029), ('laska', 0.029), ('rangan', 0.029), ('numerical', 0.029), ('plotted', 0.028), ('counting', 0.028), ('reconstructed', 0.027), ('maleki', 0.027), ('remind', 0.027), ('dt', 0.027), ('noiseless', 0.026), ('perfect', 0.026), ('institut', 0.026), ('convex', 0.026), ('region', 0.025), ('noise', 0.025), ('recovered', 0.024), ('propagation', 0.024), ('dw', 0.023), ('ht', 0.023), ('acoustics', 0.022), ('iid', 0.021), ('components', 0.021), ('matlab', 0.021), ('belief', 0.021), ('supervised', 0.02), ('failure', 0.02), ('jump', 0.019), ('iterations', 0.019), ('works', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 59 nips-2013-Blind Calibration in Compressed Sensing using Message Passing Algorithms

Author: Christophe Schulke, Francesco Caltagirone, Florent Krzakala, Lenka Zdeborova

2 0.19072981 180 nips-2013-Low-rank matrix reconstruction and clustering via approximate message passing

Author: Ryosuke Matsushita, Toshiyuki Tanaka

Abstract: We study the problem of reconstructing low-rank matrices from their noisy observations. We formulate the problem in the Bayesian framework, which allows us to exploit structural properties of matrices in addition to low-rankedness, such as sparsity. We propose an efﬁcient approximate message passing algorithm, derived from the belief propagation algorithm, to perform the Bayesian inference for matrix reconstruction. We have also successfully applied the proposed algorithm to a clustering problem, by reformulating it as a low-rank matrix reconstruction problem with an additional structural property. Numerical experiments show that the proposed algorithm outperforms Lloyd’s K-means algorithm. 1

3 0.15834895 109 nips-2013-Estimating LASSO Risk and Noise Level

Author: Mohsen Bayati, Murat A. Erdogdu, Andrea Montanari

Abstract: We study the fundamental problems of variance and risk estimation in high dimensional statistical modeling. In particular, we consider the problem of learning a coefﬁcient vector θ0 ∈ Rp from noisy linear observations y = Xθ0 + w ∈ Rn (p > n) and the popular estimation procedure of solving the 1 -penalized least squares objective known as the LASSO or Basis Pursuit DeNoising (BPDN). In this context, we develop new estimators for the 2 estimation risk θ − θ0 2 and the variance of the noise when distributions of θ0 and w are unknown. These can be used to select the regularization parameter optimally. Our approach combines Stein’s unbiased risk estimate [Ste81] and the recent results of [BM12a][BM12b] on the analysis of approximate message passing and the risk of LASSO. We establish high-dimensional consistency of our estimators for sequences of matrices X of increasing dimensions, with independent Gaussian entries. We establish validity for a broader class of Gaussian designs, conditional on a certain conjecture from statistical physics. To the best of our knowledge, this result is the ﬁrst that provides an asymptotically consistent risk estimator for the LASSO solely based on data. In addition, we demonstrate through simulations that our variance estimation outperforms several existing methods in the literature. 1

4 0.14471577 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions

Author: Eftychios A. Pnevmatikakis, Liam Paninski

Abstract: We propose a compressed sensing (CS) calcium imaging framework for monitoring large neuronal populations, where we image randomized projections of the spatial calcium concentration at each timestep, instead of measuring the concentration at individual locations. We develop scalable nonnegative deconvolution methods for extracting the neuronal spike time series from such observations. We also address the problem of demixing the spatial locations of the neurons using rank-penalized matrix factorization methods. By exploiting the sparsity of neural spiking we demonstrate that the number of measurements needed per timestep is signiﬁcantly smaller than the total number of neurons, a result that can potentially enable imaging of larger populations at considerably faster rates compared to traditional raster-scanning techniques. Unlike traditional CS setups, our problem involves a block-diagonal sensing matrix and a non-orthogonal sparse basis that spans multiple timesteps. We provide tight approximations to the number of measurements needed for perfect deconvolution for certain classes of spiking processes, and show that this number undergoes a “phase transition,” which we characterize using modern tools relating conic geometry to compressed sensing. 1

5 0.10451426 247 nips-2013-Phase Retrieval using Alternating Minimization

Author: Praneeth Netrapalli, Prateek Jain, Sujay Sanghavi

Abstract: Phase retrieval problems involve solving linear equations, but with missing sign (or phase, for complex numbers). Over the last two decades, a popular generic empirical approach to the many variants of this problem has been one of alternating minimization; i.e. alternating between estimating the missing phase information, and the candidate solution. In this paper, we show that a simple alternating minimization algorithm geometrically converges to the solution of one such problem – ﬁnding a vector x from y, A, where y = |AT x| and |z| denotes a vector of element-wise magnitudes of z – under the assumption that A is Gaussian. Empirically, our algorithm performs similar to recently proposed convex techniques for this variant (which are based on “lifting” to a convex matrix problem) in sample complexity and robustness to noise. However, our algorithm is much more efﬁcient and can scale to large problems. Analytically, we show geometric convergence to the solution, and sample complexity that is off by log factors from obvious lower bounds. We also establish close to optimal scaling for the case when the unknown vector is sparse. Our work represents the only known theoretical guarantee for alternating minimization for any variant of phase retrieval problems in the non-convex setting. 1

6 0.086409599 88 nips-2013-Designed Measurements for Vector Count Data

7 0.08301913 65 nips-2013-Compressive Feature Learning

8 0.081003428 69 nips-2013-Context-sensitive active sensing in humans

9 0.080043405 212 nips-2013-Non-Uniform Camera Shake Removal Using a Spatially-Adaptive Sparse Penalty

10 0.074998423 118 nips-2013-Fast Determinantal Point Process Sampling with Application to Clustering

11 0.073502161 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

12 0.069866888 196 nips-2013-Modeling Overlapping Communities with Node Popularities

13 0.062553301 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators

14 0.062374469 168 nips-2013-Learning to Pass Expectation Propagation Messages

15 0.055327658 335 nips-2013-Transfer Learning in a Transductive Setting

16 0.05390957 331 nips-2013-Top-Down Regularization of Deep Belief Networks

17 0.052423082 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models

18 0.045290668 117 nips-2013-Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis

19 0.04454302 72 nips-2013-Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses

20 0.042098045 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.126), (1, 0.037), (2, -0.004), (3, 0.013), (4, -0.025), (5, 0.035), (6, -0.001), (7, 0.001), (8, -0.052), (9, -0.01), (10, 0.063), (11, 0.009), (12, -0.025), (13, -0.056), (14, -0.081), (15, -0.042), (16, 0.023), (17, -0.037), (18, 0.009), (19, 0.005), (20, -0.014), (21, -0.021), (22, -0.012), (23, -0.048), (24, 0.025), (25, 0.043), (26, -0.081), (27, 0.074), (28, 0.16), (29, -0.163), (30, -0.042), (31, 0.114), (32, 0.039), (33, 0.231), (34, 0.187), (35, -0.25), (36, 0.103), (37, -0.022), (38, 0.028), (39, 0.053), (40, 0.164), (41, 0.077), (42, -0.076), (43, -0.11), (44, 0.102), (45, 0.014), (46, 0.075), (47, -0.053), (48, -0.077), (49, -0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.957362 59 nips-2013-Blind Calibration in Compressed Sensing using Message Passing Algorithms

Author: Christophe Schulke, Francesco Caltagirone, Florent Krzakala, Lenka Zdeborova

2 0.72096479 180 nips-2013-Low-rank matrix reconstruction and clustering via approximate message passing

Author: Ryosuke Matsushita, Toshiyuki Tanaka

3 0.52638328 247 nips-2013-Phase Retrieval using Alternating Minimization

Author: Praneeth Netrapalli, Prateek Jain, Sujay Sanghavi

4 0.52480197 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions

Author: Eftychios A. Pnevmatikakis, Liam Paninski

5 0.45366317 88 nips-2013-Designed Measurements for Vector Count Data

Author: Liming Wang, David Carlson, Miguel Rodrigues, David Wilcox, Robert Calderbank, Lawrence Carin

Abstract: We consider design of linear projection measurements for a vector Poisson signal model. The projections are performed on the vector Poisson rate, X ∈ Rn , and the + observed data are a vector of counts, Y ∈ Zm . The projection matrix is designed + by maximizing mutual information between Y and X, I(Y ; X). When there is a latent class label C ∈ {1, . . . , L} associated with X, we consider the mutual information with respect to Y and C, I(Y ; C). New analytic expressions for the gradient of I(Y ; X) and I(Y ; C) are presented, with gradient performed with respect to the measurement matrix. Connections are made to the more widely studied Gaussian measurement model. Example results are presented for compressive topic modeling of a document corpora (word counting), and hyperspectral compressive sensing for chemical classiﬁcation (photon counting). 1

6 0.41681674 168 nips-2013-Learning to Pass Expectation Propagation Messages

7 0.40702647 109 nips-2013-Estimating LASSO Risk and Noise Level

8 0.4056941 117 nips-2013-Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis

9 0.40209109 65 nips-2013-Compressive Feature Learning

10 0.39211774 189 nips-2013-Message Passing Inference with Chemical Reaction Networks

11 0.37004861 284 nips-2013-Robust Spatial Filtering with Beta Divergence

12 0.34797266 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

13 0.34284183 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators

14 0.34047312 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models

15 0.32822058 212 nips-2013-Non-Uniform Camera Shake Removal Using a Spatially-Adaptive Sparse Penalty

16 0.32457617 69 nips-2013-Context-sensitive active sensing in humans

17 0.31620979 72 nips-2013-Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses

18 0.29136911 354 nips-2013-When in Doubt, SWAP: High-Dimensional Sparse Recovery from Correlated Measurements

19 0.287164 196 nips-2013-Modeling Overlapping Communities with Node Popularities

20 0.28464139 214 nips-2013-On Algorithms for Sparse Multi-factor NMF

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.026), (33, 0.135), (34, 0.08), (41, 0.065), (49, 0.028), (55, 0.295), (56, 0.079), (70, 0.037), (85, 0.056), (89, 0.053), (93, 0.041), (95, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8712464 174 nips-2013-Lexical and Hierarchical Topic Regression

Author: Viet-An Nguyen, Jordan Boyd-Graber, Philip Resnik

Abstract: Inspired by a two-level theory from political science that uniﬁes agenda setting and ideological framing, we propose supervised hierarchical latent Dirichlet allocation (S H L DA), which jointly captures documents’ multi-level topic structure and their polar response variables. Our model extends the nested Chinese restaurant processes to discover tree-structured topic hierarchies and uses both per-topic hierarchical and per-word lexical regression parameters to model response variables. S H L DA improves prediction on political afﬁliation and sentiment tasks in addition to providing insight into how topics under discussion are framed. 1 Introduction: Agenda Setting and Framing in Hierarchical Models How do liberal-leaning bloggers talk about immigration in the US? What do conservative politicians have to say about education? How do Fox News and MSNBC differ in their language about the gun debate? Such questions concern not only what, but how things are talked about. In political communication, the question of “what” falls under the heading of agenda setting theory, which concerns the issues introduced into political discourse (e.g., by the mass media) and their inﬂuence over public priorities [1]. The question of “how” concerns framing: the way the presentation of an issue reﬂects or encourages a particular perspective or interpretation [2]. For example, the rise of the “innocence frame” in the death penalty debate, emphasizing the irreversible consequence of mistaken convictions, has led to a sharp decline in the use of capital punishment in the US [3]. In its concern with the subjects or issues under discussion in political discourse, agenda setting maps neatly to topic modeling [4] as a means of discovering and characterizing those issues [5]. Interestingly, one line of communication theory seeks to unify agenda setting and framing by viewing frames as a second-level kind of agenda [1]: just as agenda setting is about which objects of discussion are salient, framing is about the salience of attributes of those objects. The key is that what communications theorists consider an attribute in a discussion can itself be an object, as well. For example, “mistaken convictions” is one attribute of the death penalty discussion, but it can also be viewed as an object of discussion in its own right. This two-level view leads naturally to the idea of using a hierarchical topic model to formalize both agendas and frames within a uniform setting. In this paper, we introduce a new model to do exactly that. The model is predictive: it represents the idea of alternative or competing perspectives via a continuous-valued response variable. Although inspired by the study of political discourse, associating texts with “perspectives” is more general and has been studied in sentiment analysis, discovery of regional variation, and value-sensitive design. We show experimentally that the model’s hierarchical structure improves prediction of perspective in both a political domain and on sentiment analysis tasks, and we argue that the topic hierarchies exposed by the model are indeed capturing structure in line with the theory that motivated the work. 1 ߨ ݉ ߠௗ ߙ ߰ௗ ߛ ‫ݐ‬ௗ௦ ‫ݖ‬ௗ௦௡ ‫ݓ‬ௗ௦௡ ܿௗ௧ ܰௗ௦ ∞ ߩ ܵௗ ‫ݕ‬ௗ ‫ܦ‬ ߱ ߟ௞ ߬௩ ܸ 1. For each node k ∈ [1, ∞) in the tree (a) Draw topic φk ∼ Dir(βk ) (b) Draw regression parameter ηk ∼ N (µ, σ) 2. For each word v ∈ [1, V ], draw τv ∼ Laplace(0, ω) 3. For each document d ∈ [1, D] (a) Draw level distribution θd ∼ GEM(m, π) (b) Draw table distribution ψd ∼ GEM(α) (c) For each table t ∈ [1, ∞), draw a path cd,t ∼ nCRP(γ) (d) For each sentence s ∈ [1, Sd ], draw a table indicator td,s ∼ Mult(ψd ) i. For each token n ∈ [1, Nd,s ] A. Draw level zd,s,n ∼ Mult(θd ) B. Draw word wd,s,n ∼ Mult(φcd,td,s ,zd,s,n ) ¯ ¯ (e) Draw response yd ∼ N (η T zd + τ T wd , ρ): ߶௞ ∞ ߤ i. zd,k = ¯ ߪ ߚ ii. wd,v = ¯ 1 Nd,· 1 Nd,· Sd s=1 Sd s=1 Nd,s n=1 I [kd,s,n = k] Nd,s n=1 I [wd,s,n = v] Figure 1: S H L DA’s generative process and plate diagram. Words w are explained by topic hierarchy φ, and response variables y are explained by per-topic regression coefﬁcients η and global lexical coefﬁcients τ . 2 S H L DA: Combining Supervision and Hierarchical Topic Structure Jointly capturing supervision and hierarchical topic structure falls under a class of models called supervised hierarchical latent Dirichlet allocation. These models take as input a set of D documents, each of which is associated with a response variable yd , and output a hierarchy of topics which is informed by yd . Zhang et al. [6] introduce the S H L DA family, focusing on a categorical response. In contrast, our novel model (which we call S H L DA for brevity), uses continuous responses. At its core, S H L DA’s document generative process resembles a combination of hierarchical latent Dirichlet allocation [7, HLDA] and the hierarchical Dirichlet process [8, HDP]. HLDA uses the nested Chinese restaurant process (nCRP(γ)), combined with an appropriate base distribution, to induce an unbounded tree-structured hierarchy of topics: general topics at the top, speciﬁc at the bottom. A document is generated by traversing this tree, at each level creating a new child (hence a new path) with probability proportional to γ or otherwise respecting the “rich-get-richer” property of a CRP. A drawback of HLDA, however, is that each document is restricted to only a single path in the tree. Recent work relaxes this restriction through different priors: nested HDP [9], nested Chinese franchises [10] or recursive CRPs [11]. In this paper, we address this problem by allowing documents to have multiple paths through the tree by leveraging information at the sentence level using the twolevel structure used in HDP. More speciﬁcally, in the HDP’s Chinese restaurant franchise metaphor, customers (i.e., tokens) are grouped by sitting at tables and each table takes a dish (i.e., topic) from a ﬂat global menu. In our S H L DA, dishes are organized in a tree-structured global menu by using the nCRP as prior. Each path in the tree is a collection of L dishes (one for each level) and is called a combo. S H L DA groups sentences of a document by assigning them to tables and associates each table with a combo, and thus, models each document as a distribution over combos.1 In S H L DA’s metaphor, customers come in a restaurant and sit at a table in groups, where each group is a sentence. A sentence wd,s enters restaurant d and selects a table t (and its associated combo) with probability proportional to the number of sentences Sd,t at that table; or, it sits at a new table with probability proportional to α. After choosing the table (indexed by td,s ), if the table is new, the group will select a combo of dishes (i.e., a path, indexed by cd,t ) from the tree menu. Once a combo is in place, each token in the sentence chooses a “level” (indexed by zd,s,n ) in the combo, which speciﬁes the topic (φkd,s,n ≡ φcd,td,s ,zd,s,n ) producing the associated observation (Figure 2). S H L DA also draws on supervised LDA [12, SLDA] associating each document d with an observable continuous response variable yd that represents the author’s perspective toward a topic, e.g., positive vs. negative sentiment, conservative vs. liberal ideology, etc. This lets us infer a multi-level topic structure informed by how topics are “framed” with respect to positions along the yd continuum. 1 We emphasize that, unlike in HDP where each table is assigned to a single dish, each table in our metaphor is associated with a combo–a collection of L dishes. We also use combo and path interchangeably. 2 Sd Sd,t ߶ଵ ߟଵ dish ߶ଵଵ ߟଵଵ ߶ଵଶ ߟଵଶ ߶ଵଵଵ ߟଵଵଵ ߶ଵଵଶ ߟଵଵଶ ߶ଵଶଵ ߟଵଶଵ ߶ଵଶଶ ߟଵଶଶ table ܿௗ௧ ‫1=ݐ‬ ‫2=ݐ‬ ‫1=ݐ‬ ‫2=ݐ‬ ‫3=ݐ‬ ‫1=ݐ‬ ‫2=ݐ‬ ‫ݐ‬ௗ௦ ‫2=ݏ 1=ݏ‬ ‫ܵ = ݏ‬ଵ ‫3=ݏ 2=ݏ 1=ݏ‬ ݀=1 ݇ௗ௦௡ ‫ܵ = ݏ‬ଶ ‫ܵ = ݏ‬஽ ݀=2 ߶ଵ ߟଵ ݀=‫ܦ‬ customer group (token) (sentence) restaurant (document) ߶ଵଵ ߟଵଵ ݀=1 ‫1=ݏ‬ ߶ଵଵଵ ߟଵଵଵ combo (path) Nd,s Nd,·,l Nd,·,>l Nd,·,≥l Mc,l Cc,l,v Cd,x,l,v φk ηk τv cd,t td,s zd,s,n kd,s,n L C+ Figure 2: S H L DA’s restaurant franchise metaphor. # sentences in document d # groups (i.e. sentences) sitting at table t in restaurant d # tokens wd,s # tokens in wd assigned to level l # tokens in wd assigned to level > l ≡ Nd,·,l + Nd,·,>l # tables at level l on path c # word type v assigned to level l on path c # word type v in vd,x assigned to level l Topic at node k Regression parameter at node k Regression parameter of word type v Path assignment for table t in restaurant d Table assignment for group wd,s Level assignment for wd,s,n Node assignment for wd,s,n (i.e., node at level zd,s,n on path cd,td,s ) Height of the tree Set of all possible paths (including new ones) of the tree Table 1: Notation used in this paper Unlike SLDA, we model the response variables using a normal linear regression that contains both pertopic hierarchical and per-word lexical regression parameters. The hierarchical regression parameters are just like topics’ regression parameters in SLDA: each topic k (here, a tree node) has a parameter ηk , and the model uses the empirical distribution over the nodes that generated a document as the regressors. However, the hierarchy in S H L DA makes it possible to discover relationships between topics and the response variable that SLDA’s simple latent space obscures. Consider, for example, a topic model trained on Congressional debates. Vanilla LDA would likely discover a healthcare category. SLDA [12] could discover a pro-Obamacare topic and an anti-Obamacare topic. S H L DA could do that and capture the fact that there are alternative perspectives, i.e., that the healthcare issue is being discussed from two ideological perspectives, along with characterizing how the higher level topic is discussed by those on both sides of that ideological debate. Sometimes, of course, words are strongly associated with extremes on the response variable continuum regardless of underlying topic structure. Therefore, in addition to hierarchical regression parameters, we include global lexical regression parameters to model the interaction between speciﬁc words and response variables. We denote the regression parameter associated with a word type v in the vocabulary as τv , and use the normalized frequency of v in the documents to be its regressor. Including both hierarchical and lexical parameters is important. For detecting ideology in the US, “liberty” is an effective indicator of conservative speakers regardless of context; however, “cost” is a conservative-leaning indicator in discussions about environmental policy but liberal-leaning in debates about foreign policy. For sentiment, “wonderful” is globally a positive word; however, “unexpected” is a positive descriptor of books but a negative one of a car’s steering. S H L DA captures these properties in a single model. 3 Posterior Inference and Optimization Given documents with observed words w = {wd,s,n } and response variables y = {yd }, the inference task is to ﬁnd the posterior distribution over: the tree structure including topic φk and regression parameter ηk for each node k, combo assignment cd,t for each table t in document d, table assignment td,s for each sentence s in a document d, and level assignment zd,s,n for each token wd,s,n . We approximate S H L DA’s posterior using stochastic EM, which alternates between a Gibbs sampling E-step and an optimization M-step. More speciﬁcally, in the E-step, we integrate out ψ, θ and φ to construct a Markov chain over (t, c, z) and alternate sampling each of them from their conditional distributions. In the M-step, we optimize the regression parameters η and τ using L-BFGS [13]. Before describing each step in detail, let us deﬁne the following probabilities. For more thorough derivations, please see the supplement. 3 • First, deﬁne vd,x as a set of tokens (e.g., a token, a sentence or a set of sentences) in document d. The conditional density of vd,x being assigned to path c given all other assignments is −d,x Γ(Cc,l,· + V βl ) L −d,x fc (vd,x ) = l=1 −d,x Γ(Cc,l,v + Cd,x,l,v + βl ) V −d,x Γ(Cc,l,· + Cd,x,l,· + V βl ) (1) −d,x Γ(Cc,l,v + βl ) v=1 where superscript −d,x denotes the same count excluding assignments of vd,x ; marginal counts −d,x are represented by ·’s. For a new path cnew , if the node does not exist, Ccnew ,l,v = 0 for all word types v. • Second, deﬁne the conditional density of the response variable yd of document d given vd,x being −d,x assigned to path c and all other assignments as gc (yd ) =  1 N Nd,· ηc,l · Cd,x,l,· + ηcd,td,s ,zd,s,n + wd,s,n ∈{wd \vd,x }  Sd Nd,s L τwd,s,n , ρ (2) s=1 n=1 l=1 where Nd,· is the total number of tokens in document d. For a new node at level l on a new path cnew , we integrate over all possible values of ηcnew ,l . Sampling t: For each group wd,s we need to sample a table td,s . The conditional distribution of a table t given wd,s and other assignments is proportional to the number of sentences sitting at t times the probability of wd,s and yd being observed under this assignment. This is P (td,s = t | rest) ∝ P (td,s = t | t−s ) · P (wd,s , yd | td,s = t, w−d,s , t−d,s , z, c, η) d ∝ −d,s −d,s −d,s Sd,t · fcd,t (wd,s ) · gcd,t (yd ), for existing table t; (3) −d,s −d,s α · c∈C + P (cd,tnew = c | c−d,s ) · fc (wd,s ) · gc (yd ), for new table tnew . For a new table tnew , we need to sum over all possible paths C + of the tree, including new ones. For example, the set C + for the tree shown in Figure 2 consists of four existing paths (ending at one of the four leaf nodes) and three possible new paths (a new leaf off of one of the three internal nodes). The prior probability of path c is: P (cd,tnew = c | c−d,s ) ∝       L l=2 −d,s Mc,l −d,s Mc,l−1 + γl−1  γl∗    −d,s M ∗ cnew ,l∗ + γl , l∗ l=2 for an existing path c; (4) −d,s Mcnew ,l , for a new path cnew which consists of an existing path −d,s Mcnew ,l−1 + γl−1 from the root to a node at level l∗ and a new node. Sampling z: After assigning a sentence wd,s to a table, we assign each token wd,s,n to a level to choose a dish from the combo. The probability of assigning wd,s,n to level l is −s,n P (zd,s,n = l | rest) ∝ P (zd,s,n = l | zd )P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) (5) The ﬁrst factor captures the probability that a customer in restaurant d is assigned to level l, conditioned on the level assignments of all other customers in restaurant d, and is equal to P (zd,s,n = −s,n l | zd ) = −d,s,n mπ + Nd,·,l −d,s,n π + Nd,·,≥l l−1 −d,s,n (1 − m)π + Nd,·,>j −d,s,n π + Nd,·,≥j j=1 , The second factor is the probability of observing wd,s,n and yd , given that wd,s,n is assigned to level −d,s,n −d,s,n l: P (wd,s,n , yd | zd,s,n = l, w−d,s,n , z −d,s,n , t, c, η) = fcd,t (wd,s,n ) · gcd,t (yd ). d,s d,s Sampling c: After assigning customers to tables and levels, we also sample path assignments for all tables. This is important since it can change the assignments of all customers sitting at a table, which leads to a well-mixed Markov chain and faster convergence. The probability of assigning table t in restaurant d to a path c is P (cd,t = c | rest) ∝ P (cd,t = c | c−d,t ) · P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) (6) where we slightly abuse the notation by using wd,t ≡ ∪{s|td,s =t} wd,s to denote the set of customers in all the groups sitting at table t in restaurant d. The ﬁrst factor is the prior probability of a path given all tables’ path assignments c−d,t , excluding table t in restaurant d and is given in Equation 4. The second factor in Equation 6 is the probability of observing wd,t and yd given the new path −d,t −d,t assignments, P (wd,t , yd | cd,t = c, w−d,t , c−d,t , t, z, η) = fc (wd,t ) · gc (yd ). 4 Optimizing η and τ : We optimize the regression parameters η and τ via the likelihood, 1 L(η, τ ) = − 2ρ D 1 ¯ ¯ (yd − η zd − τ wd ) − 2σ T d=1 T K+ 2 (ηk − µ)2 − k=1 1 ω V |τv |, (7) v=1 where K + is the number of nodes in the tree.2 This maximization is performed using L-BFGS [13]. 4 Data: Congress, Products, Films We conduct our experiments using three datasets: Congressional ﬂoor debates, Amazon product reviews, and movie reviews. For all datasets, we remove stopwords, add bigrams to the vocabulary, and ﬁlter the vocabulary using tf-idf.3 • U.S Congressional ﬂoor debates: We downloaded debates of the 109th US Congress from GovTrack4 and preprocessed them as in Thomas et al. [14]. To remove uninterestingly non-polarized debates, we ignore bills with less than 20% “Yea” votes or less than 20% “Nay” votes. Each document d is a turn (a continuous utterance by a single speaker, i.e. speech segment [14]), and its response variable yd is the ﬁrst dimension of the speaker’s DW- NOMINATE score [15], which captures the traditional left-right political distinction.5 After processing, our corpus contains 5,201 turns in the House, 3,060 turns in the Senate, and 5,000 words in the vocabulary.6 • Amazon product reviews: From a set of Amazon reviews of manufactured products such as computers, MP 3 players, GPS devices, etc. [16], we focused on the 50 most frequently reviewed products. After ﬁltering, this corpus contains 37,191 reviews with a vocabulary of 5,000 words. We use the rating associated with each review as the response variable yd .7 • Movie reviews: Our third corpus is a set of 5,006 reviews of movies [17], again using review ratings as the response variable yd , although in this corpus the ratings are normalized to the range from 0 to 1. After preprocessing, the vocabulary contains 5,000 words. 5 Evaluating Prediction S H L DA’s response variable predictions provide a formally rigorous way to assess whether it is an improvement over prior methods. We evaluate effectiveness in predicting values of the response variables for unseen documents in the three datasets. For comparison we consider these baselines: • Multiple linear regression (MLR) models the response variable as a linear function of multiple features (or regressors). Here, we consider two types of features: topic-based features and lexicallybased features. Topic-based MLR, denoted by MLR - LDA, uses the topic distributions learned by vanilla LDA as features [12], while lexically-based MLR, denoted by MLR - VOC, uses the frequencies of words in the vocabulary as features. MLR - LDA - VOC uses both features. • Support vector regression (SVM) is a discriminative method [18] that uses LDA topic distributions (SVM - LDA), word frequencies (SVM - VOC), and both (SVM - LDA - VOC) as features.8 • Supervised topic model (SLDA): we implemented SLDA using Gibbs sampling. The version of SLDA we use is slightly different from the original SLDA described in [12], in that we place a Gaussian prior N (0, 1) over the regression parameters to perform L2-norm regularization.9 For parametric models (LDA and SLDA), which require the number of topics K to be speciﬁed beforehand, we use K ∈ {10, 30, 50}. We use symmetric Dirichlet priors in both LDA and SLDA, initialize The superscript + is to denote that this number is unbounded and varies during the sampling process. To ﬁnd bigrams, we begin with bigram candidates that occur at least 10 times in the corpus and use Pearson’s χ2 -test to ﬁlter out those that have χ2 -value less than 5, which corresponds to a signiﬁcance level of 0.025. We then treat selected bigrams as single word types and add them to the vocabulary. 2 3 4 http://www.govtrack.us/data/us/109/ 5 Scores were downloaded from http://voteview.com/dwnomin_joint_house_and_senate.htm 6 Data will be available after blind review. 7 The ratings can range from 1 to 5, but skew positive. 8 9 http://svmlight.joachims.org/ This performs better than unregularized SLDA in our experiments. 5 Floor Debates House-Senate Senate-House PCC ↑ MSE ↓ PCC ↑ MSE ↓ Amazon Reviews PCC ↑ MSE ↓ Movie Reviews PCC ↑ MSE ↓ SVM - LDA 10 SVM - LDA 30 SVM - LDA 50 SVM - VOC SVM - LDA - VOC 0.173 0.172 0.169 0.336 0.256 0.861 0.840 0.832 1.549 0.784 0.08 0.155 0.215 0.131 0.246 1.247 1.183 1.135 1.467 1.101 0.157 0.277 0.245 0.373 0.371 1.241 1.091 1.130 0.972 0.965 0.327 0.365 0.395 0.584 0.585 0.970 0.938 0.906 0.681 0.678 MLR - LDA 10 MLR - LDA 30 MLR - LDA 50 MLR - VOC MLR - LDA - VOC 0.163 0.160 0.150 0.322 0.319 0.735 0.737 0.741 0.889 0.873 0.068 0.162 0.248 0.191 0.194 1.151 1.125 1.081 1.124 1.120 0.143 0.258 0.234 0.408 0.410 1.034 1.065 1.114 0.869 0.860 0.328 0.367 0.389 0.568 0.581 0.957 0.936 0.914 0.721 0.702 SLDA 10 SLDA 30 SLDA 50 0.154 0.174 0.254 0.729 0.793 0.897 0.090 0.128 0.245 1.145 1.188 1.184 0.270 0.357 0.241 1.113 1.146 1.939 0.383 0.433 0.503 0.953 0.852 0.772 S H L DA 0.356 0.753 0.303 1.076 0.413 0.891 0.597 0.673 Models Table 2: Regression results for Pearson’s correlation coefﬁcient (PCC, higher is better (↑)) and mean squared error (MSE, lower is better (↓)). Results on Amazon product reviews and movie reviews are averaged over 5 folds. Subscripts denote the number of topics for parametric models. For SVM - LDA - VOC and MLR - LDA - VOC, only best results across K ∈ {10, 30, 50} are reported. Best results are in bold. the Dirichlet hyperparameters to 0.5, and use slice sampling [19] for updating hyperparameters. For SLDA , the variance of the regression is set to 0.5. For S H L DA , we use trees with maximum depth of three. We slice sample m, π, β and γ, and ﬁx µ = 0, σ = 0.5, ω = 0.5 and ρ = 0.5. We found that the following set of initial hyperparameters works reasonably well for all the datasets in our experiments: m = 0.5, π = 100, β = (1.0, 0.5, 0.25), γ = (1, 1), α = 1. We also set the regression parameter of the root node to zero, which speeds inference (since it is associated with every document) and because it is reasonable to assume that it would not change the response variable. To compare the performance of different methods, we compute Pearson’s correlation coefﬁcient (PCC) and mean squared error (MSE) between the true and predicted values of the response variables and average over 5 folds. For the Congressional debate corpus, following Yu et al. [20], we use documents in the House to train and test on documents in the Senate and vice versa. Results and analysis Table 2 shows the performance of all models on our three datasets. Methods that only use topic-based features such as SVM - LDA and MLR - LDA do poorly. Methods only based on lexical features like SVM - VOC and MLR - VOC outperform methods that are based only on topic features signiﬁcantly for the two review datasets, but are comparable or worse on congressional debates. This suggests that reviews have more highly discriminative words than political speeches (Table 3). Combining topic-based and lexically-based features improves performance, which supports our choice of incorporating both per-topic and per-word regression parameters in S H L DA. In all cases, S H L DA achieves strong performance results. For the two cases where S H L DA was second best in MSE score (Amazon reviews and House-Senate), it outperforms other methods in PCC. Doing well in PCC for these two datasets is important since achieving low MSE is relatively easier due to the response variables’ bimodal distribution in the ﬂoor debates and positively-skewed distribution in Amazon reviews. For the ﬂoor debate dataset, the results of the House-Senate experiment are generally better than those of the Senate-House experiment, which is consistent with previous results [20] and is explained by the greater number of debates in the House. 6 Qualitative Analysis: Agendas and Framing/Perspective Although a formal coherence evaluation [21] remains a goal for future work, a qualitative look at the topic hierarchy uncovered by the model suggests that it is indeed capturing agenda/framing structure as discussed in Section 1. In Figure 3, a portion of the topic hierarchy induced from the Congressional debate corpus, Nodes A and B illustrate agendas—issues introduced into political discourse—associated with a particular ideology: Node A focuses on the hardships of the poorer victims of hurricane Katrina and is associated with Democrats, and text associated with Node E discusses a proposed constitutional amendment to ban ﬂag burning and is associated with Republicans. Nodes C and D, children of a neutral “tax” topic, reveal how parties frame taxes as gains in terms of new social services (Democrats) and losses for job creators (Republicans). 6 E ﬂag constitution freedom supreme_court elections rights continuity american_ﬂag constitutional_amendm ent gses credit_rating fannie_mae regulator freddie_mac market ﬁnancial_services agencies competition investors fannie bill speaker time amendment chairman people gentleman legislation congress support R:1.1 R:0 A minimum_wage commission independent_commissio n investigate hurricane_katrina increase investigation R:1.0 B percent tax economy estate_tax capital_gains money taxes businesses families tax_cuts pay tax_relief social_security affordable_housing housing manager fund activities funds organizations voter_registration faithbased nonproﬁts R:0.4 D:1.7 C death_tax jobs businesses business family_businesses equipment productivity repeal_permanency employees capital farms D REPUBLICAN billion budget children cuts debt tax_cuts child_support deﬁcit education students health_care republicans national_debt R:4.3 D:2.2 DEMOCRAT D:4.5 Figure 3: Topics discovered from Congressional ﬂoor debates. Many ﬁrst-level topics are bipartisan (purple), while lower level topics are associated with speciﬁc ideologies (Democrats blue, Republicans red). For example, the “tax” topic (B) is bipartisan, but its Democratic-leaning child (D) focuses on social goals supported by taxes (“children”, “education”, “health care”), while its Republican-leaning child (C) focuses on business implications (“death tax”, “jobs”, “businesses”). The number below each topic denotes the magnitude of the learned regression parameter associated with that topic. Colors and the numbers beneath each topic show the regression parameter η associated with the topic. Figure 4 shows the topic structure discovered by S H L DA in the review corpus. Nodes at higher levels are relatively neutral, with relatively small regression parameters.10 These nodes have general topics with no speciﬁc polarity. However, the bottom level clearly illustrates polarized positive/negative perspective. For example, Node A concerns washbasins for infants, and has two polarized children nodes: reviewers take a positive perspective when their children enjoy the product (Node B: “loves”, “splash”, “play”) but have negative reactions when it leaks (Node C: “leak(s/ed/ing)”). transmitter ipod car frequency iriver product transmitters live station presets itrip iriver_aft charges international_mode driving P:6.6 tried waste batteries tunecast rabbit_ears weak terrible antenna hear returned refund returning item junk return A D router setup network expander set signal wireless connect linksys connection house wireless_router laptop computer wre54g N:2.2 N:1.0 tivo adapter series adapters phone_line tivo_wireless transfer plugged wireless_adapter tivos plug dvr tivo_series tivo_box tivo_unit P:5.1 tub baby water bath sling son daughter sit bathtub sink newborn months bath_tub bathe bottom N:8.0 months loves hammock splash love baby drain eurobath hot ﬁts wash play infant secure slip P:7.5 NEGATIVE N:0 N:2.7 B POSITIVE time bought product easy buy love using price lot able set found purchased money months transmitter car static ipod radio mp3_player signal station sound music sound_quality volume stations frequency frequencies C leaks leaked leak leaking hard waste snap suction_cups lock tabs difﬁcult bottom tub_leaks properly ring N:8.9 monitor radio weather_radio night baby range alerts sound sony house interference channels receiver static alarm N:1.7 hear feature static monitors set live warning volume counties noise outside alert breathing rechargeable_battery alerts P:6.2 version hours phone F ﬁrmware told spent linksys tech_support technical_supportcusto mer_service range_expander support return N:10.6 E router ﬁrmware ddwrt wrt54gl version wrt54g tomato linksys linux routers ﬂash versions browser dlink stable P:4.8 z22 palm pda palm_z22 calendar software screen contacts computer device sync information outlook data programs N:1.9 headphones sound pair bass headset sound_quality ear ears cord earbuds comfortable hear head earphones ﬁt N:1.3 appointments organized phone lists handheld organizer photos etc pictures memos track bells books purse whistles P:5.8 noise_canceling noise sony exposed noise_cancellation stopped wires warranty noise_cancelling bud pay white_noise disappointed N:7.6 bottles bottle baby leak nipples nipple avent avent_bottles leaking son daughter formula leaks gas milk comfortable sound phones sennheiser bass px100 px100s phone headset highs portapros portapro price wear koss N:2.0 leak formula bottles_leak feeding leaked brown frustrating started clothes waste newborn playtex_ventaire soaked matter N:7.9 P:5.7 nipple breast nipples dishwasher ring sippy_cups tried breastfeed screwed breastfeeding nipple_confusion avent_system bottle P:6.4 Figure 4: Topics discovered from Amazon reviews. Higher topics are general, while lower topics are more speciﬁc. The polarity of the review is encoded in the color: red (negative) to blue (positive). Many of the ﬁrstlevel topics have no speciﬁc polarity and are associated with a broad class of products such as “routers” (Node D). However, the lowest topics in the hierarchy are often polarized; one child topic of “router” focuses on upgradable ﬁrmware such as “tomato” and “ddwrt” (Node E, positive) while another focuses on poor “tech support” and “customer service” (Node F, negative). The number below each topic is the regression parameter learned with that topic. In addition to the per-topic regression parameters, S H L DA also associates each word with a lexical regression parameter τ . Table 3 shows the top ten words with highest and lowest τ . The results are unsuprising, although the lexical regression for the Congressional debates is less clear-cut than other 10 All of the nodes at the second level have slightly negative values for the regression parameters mainly due to the very skewed distribution of the review ratings in Amazon. 7 datasets. As we saw in Section 5, for similar datasets, S H L DA’s context-speciﬁc regression is more useful when global lexical weights do not readily differentiate documents. Dataset Floor Debates Amazon Reviews Movie Reviews Top 10 words with positive weights bringing, private property, illegally, tax relief, regulation, mandates, constitutional, committee report, illegal alien highly recommend, pleased, love, loves, perfect, easy, excellent, amazing, glad, happy hilarious, fast, schindler, excellent, motion pictures, academy award, perfect, journey, fortunately, ability Top 10 words with negative weights bush administration, strong opposition, ranking, republicans, republican leadership, secret, discriminate, majority, undermine waste, returned, return, stopped, leak, junk, useless, returning, refund, terrible bad, unfortunately, supposed, waste, mess, worst, acceptable, awful, suppose, boring Table 3: Top words based on the global lexical regression coefﬁcient, τ . For the ﬂoor debates, positive τ ’s are Republican-leaning while negative τ ’s are Democrat-leaning. 7 Related Work S H L DA joins a family of LDA extensions that introduce hierarchical topics, supervision, or both. Owing to limited space, we focus here on related work that combines the two. Petinot et al. [22] propose hierarchical Labeled LDA (hLLDA), which leverages an observed document ontology to learn topics in a tree structure; however, hLLDA assumes that the underlying tree structure is known a priori. SSHLDA [23] generalizes hLLDA by allowing the document hierarchy labels to be partially observed, with unobserved labels and topic tree structure then inferred from the data. Boyd-Graber and Resnik [24] used hierarchical distributions within topics to learn topics across languages. In addition to these “upstream” models [25], Perotte et al. [26] propose a “downstream” model called HSLDA , which jointly models documents’ hierarchy of labels and topics. HSLDA ’s topic structure is ﬂat, however, and the response variable is a hierarchy of labels associated with each document, unlike S H L DA’s continuous response variable. Finally, another body related body of work includes models that jointly capture topics and other facets such as ideologies/perspectives [27, 28] and sentiments/opinions [29], albeit with discrete rather than continuously valued responses. Computational modeling of sentiment polarity is a voluminous ﬁeld [30], and many computational political science models describe agendas [5] and ideology [31]. Looking at framing or bias at the sentence level, Greene and Resnik [32] investigate the role of syntactic structure in framing, Yano et al. [33] look at lexical indications of sentence-level bias, and Recasens et al. [34] develop linguistically informed sentence-level features for identifying bias-inducing words. 8 Conclusion We have introduced S H L DA, a model that associates a continuously valued response variable with hierarchical topics to capture both the issues under discussion and alternative perspectives on those issues. The two-level structure improves predictive performance over existing models on multiple datasets, while also adding potentially insightful hierarchical structure to the topic analysis. Based on a preliminary qualitative analysis, the topic hierarchy exposed by the model plausibly captures the idea of agenda setting, which is related to the issues that get discussed, and framing, which is related to authors’ perspectives on those issues. We plan to analyze the topic structure produced by S H L DA with political science collaborators and more generally to study how S H L DA and related models can help analyze and discover useful insights from political discourse. Acknowledgments This research was supported in part by NSF under grant #1211153 (Resnik) and #1018625 (BoydGraber and Resnik). Any opinions, ﬁndings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reﬂect the view of the sponsor. 8 References [1] McCombs, M. The agenda-setting role of the mass media in the shaping of public opinion. North, 2009(05-12):21, 2002. [2] McCombs, M., S. Ghanem. The convergence of agenda setting and framing. In Framing public life. 2001. [3] Baumgartner, F. R., S. L. De Boef, A. E. Boydstun. The decline of the death penalty and the discovery of innocence. Cambridge University Press, 2008. [4] Blei, D. M., A. Ng, M. Jordan. Latent Dirichlet allocation. JMLR, 3, 2003. [5] Grimmer, J. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis, 18(1):1–35, 2010. [6] Zhang, J. Explore objects and categories in unexplored environments based on multimodal data. Ph.D. thesis, University of Hamburg, 2012. [7] Blei, D. M., T. L. Grifﬁths, M. I. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. J. ACM, 57(2), 2010. [8] Teh, Y. W., M. I. Jordan, M. J. Beal, et al. Hierarchical Dirichlet processes. JASA, 101(476), 2006. [9] Paisley, J. W., C. Wang, D. M. Blei, et al. Nested hierarchical Dirichlet processes. arXiv:1210.6738, 2012. [10] Ahmed, A., L. Hong, A. Smola. The nested Chinese restaurant franchise process: User tracking and document modeling. In ICML. 2013. [11] Kim, J. H., D. Kim, S. Kim, et al. Modeling topic hierarchies with the recursive Chinese restaurant process. In CIKM, pages 783–792. 2012. [12] Blei, D. M., J. D. McAuliffe. Supervised topic models. In NIPS. 2007. [13] Liu, D., J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Prog., 1989. [14] Thomas, M., B. Pang, L. Lee. Get out the vote: Determining support or opposition from Congressional ﬂoor-debate transcripts. In EMNLP. 2006. [15] Lewis, J. B., K. T. Poole. Measuring bias and uncertainty in ideal point estimates via the parametric bootstrap. Political Analysis, 12(2), 2004. [16] Jindal, N., B. Liu. Opinion spam and analysis. In WSDM. 2008. [17] Pang, B., L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL. 2005. [18] Joachims, T. Making large-scale SVM learning practical. In Adv. in Kernel Methods - SVM. 1999. [19] Neal, R. M. Slice sampling. Annals of Statistics, 31:705–767, 2003. [20] Yu, B., D. Diermeier, S. Kaufmann. Classifying party afﬁliation from political speech. JITP, 2008. [21] Chang, J., J. Boyd-Graber, C. Wang, et al. Reading tea leaves: How humans interpret topic models. In NIPS. 2009. [22] Petinot, Y., K. McKeown, K. Thadani. A hierarchical model of web summaries. In HLT. 2011. [23] Mao, X., Z. Ming, T.-S. Chua, et al. SSHLDA: A semi-supervised hierarchical topic model. In EMNLP. 2012. [24] Boyd-Graber, J., P. Resnik. Holistic sentiment analysis across languages: Multilingual supervised latent Dirichlet allocation. In EMNLP. 2010. [25] Mimno, D. M., A. McCallum. Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. In UAI. 2008. [26] Perotte, A. J., F. Wood, N. Elhadad, et al. Hierarchically supervised latent Dirichlet allocation. In NIPS. 2011. [27] Ahmed, A., E. P. Xing. Staying informed: Supervised and semi-supervised multi-view topical analysis of ideological perspective. In EMNLP. 2010. [28] Eisenstein, J., A. Ahmed, E. P. Xing. Sparse additive generative models of text. In ICML. 2011. [29] Jo, Y., A. H. Oh. Aspect and sentiment uniﬁcation model for online review analysis. In WSDM. 2011. [30] Pang, B., L. Lee. Opinion Mining and Sentiment Analysis. Now Publishers Inc, 2008. [31] Monroe, B. L., M. P. Colaresi, K. M. Quinn. Fightin’words: Lexical feature selection and evaluation for identifying the content of political conﬂict. Political Analysis, 16(4):372–403, 2008. [32] Greene, S., P. Resnik. More than words: Syntactic packaging and implicit sentiment. In NAACL. 2009. [33] Yano, T., P. Resnik, N. A. Smith. Shedding (a thousand points of) light on biased language. In NAACL-HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 2010. [34] Recasens, M., C. Danescu-Niculescu-Mizil, D. Jurafsky. Linguistic models for analyzing and detecting biased language. In ACL. 2013. 9

same-paper 2 0.75587791 59 nips-2013-Blind Calibration in Compressed Sensing using Message Passing Algorithms

Author: Christophe Schulke, Francesco Caltagirone, Florent Krzakala, Lenka Zdeborova

3 0.67332202 301 nips-2013-Sparse Additive Text Models with Low Rank Background

Author: Lei Shi

Abstract: The sparse additive model for text modeling involves the sum-of-exp computing, whose cost is consuming for large scales. Moreover, the assumption of equal background across all classes/topics may be too strong. This paper extends to propose sparse additive model with low rank background (SAM-LRB) and obtains simple yet efﬁcient estimation. Particularly, employing a double majorization bound, we approximate log-likelihood into a quadratic lower-bound without the log-sumexp terms. The constraints of low rank and sparsity are then simply embodied by nuclear norm and ℓ1 -norm regularizers. Interestingly, we ﬁnd that the optimization task of SAM-LRB can be transformed into the same form as in Robust PCA. Consequently, parameters of supervised SAM-LRB can be efﬁciently learned using an existing algorithm for Robust PCA based on accelerated proximal gradient. Besides the supervised case, we extend SAM-LRB to favor unsupervised and multifaceted scenarios. Experiments on three real data demonstrate the effectiveness and efﬁciency of SAM-LRB, compared with a few state-of-the-art models. 1

4 0.63503283 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models

Author: Anirban Roychowdhury, Ke Jiang, Brian Kulis

Abstract: Small-variance asymptotics provide an emerging technique for obtaining scalable combinatorial algorithms from rich probabilistic models. We present a smallvariance asymptotic analysis of the Hidden Markov Model and its inﬁnite-state Bayesian nonparametric extension. Starting with the standard HMM, we ﬁrst derive a “hard” inference algorithm analogous to k-means that arises when particular variances in the model tend to zero. This analysis is then extended to the Bayesian nonparametric case, yielding a simple, scalable, and ﬂexible algorithm for discrete-state sequence data with a non-ﬁxed number of states. We also derive the corresponding combinatorial objective functions arising from our analysis, which involve a k-means-like term along with penalties based on state transitions and the number of states. A key property of such algorithms is that— particularly in the nonparametric setting—standard probabilistic inference algorithms lack scalability and are heavily dependent on good initialization. A number of results on synthetic and real data sets demonstrate the advantages of the proposed framework. 1

5 0.55441493 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

Author: Yuhong Guo

Abstract: Principal component analysis (PCA), a well-established technique for data analysis and processing, provides a convenient form of dimensionality reduction that is effective for cleaning small Gaussian noises presented in the data. However, the applicability of standard principal component analysis in real scenarios is limited by its sensitivity to large errors. In this paper, we tackle the challenge problem of recovering data corrupted with errors of high magnitude by developing a novel robust transfer principal component analysis method. Our method is based on the assumption that useful information for the recovery of a corrupted data matrix can be gained from an uncorrupted related data matrix. Speciﬁcally, we formulate the data recovery problem as a joint robust principal component analysis problem on the two data matrices, with common principal components shared across matrices and individual principal components speciﬁc to each data matrix. The formulated optimization problem is a minimization problem over a convex objective function but with non-convex rank constraints. We develop an efﬁcient proximal projected gradient descent algorithm to solve the proposed optimization problem with convergence guarantees. Our empirical results over image denoising tasks show the proposed method can effectively recover images with random large errors, and signiﬁcantly outperform both standard PCA and robust PCA with rank constraints. 1

6 0.55378807 180 nips-2013-Low-rank matrix reconstruction and clustering via approximate message passing

7 0.55318713 11 nips-2013-A New Convex Relaxation for Tensor Completion

8 0.55281651 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables

9 0.55153847 350 nips-2013-Wavelets on Graphs via Deep Learning

10 0.55136156 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

11 0.55054301 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits

12 0.54862976 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent

13 0.54821366 173 nips-2013-Least Informative Dimensions

14 0.54802603 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

15 0.54735368 77 nips-2013-Correlations strike back (again): the case of associative memory retrieval

16 0.54678273 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions

17 0.5461967 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives

18 0.5461458 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic

19 0.54597837 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching

20 0.54564375 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding