nips nips2012 nips2012-87 nips2012-87-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Richard Socher, Brody Huval, Bharath Bath, Christopher D. Manning, Andrew Y. Ng
Abstract: Recent advances in 3D sensing technologies make it possible to easily record color and depth images which together can improve object recognition. Most current methods rely on very well-designed features for this new 3D modality. We introduce a model based on a combination of convolutional and recursive neural networks (CNN and RNN) for learning features and classifying RGB-D images. The CNN layer learns low-level translationally invariant features which are then given as inputs to multiple, fixed-tree RNNs in order to compose higher order features. RNNs can be seen as combining convolution and pooling into one efficient, hierarchical operation. Our main result is that even RNNs with random weights compose powerful features. Our model obtains state of the art performance on a standard RGB-D object dataset while being more accurate and faster during training and testing than comparable architectures such as two-layer CNNs. 1
[1] M. Quigley, S. Batra, S. Gould, E. Klingbeil, Q. Le, A. Wellman, and A.Y. Ng. High-accuracy 3D sensing for mobile manipulation: improving object detection and door opening. In ICRA, 2009. 7 apple ball banana bell pepper binder bowl calculator camera cap cellphone cereal box coffee mug comb dry battery flashlight food bag food box food can food cup food jar garlic glue stick greens hand towel instant noodles keyboard kleenex lemon lightbulb lime marker mushroom notebook onion orange peach pear pitcher plate pliers potato rubber eraser scissors shampoo soda can sponge stapler tomato toothbrush toothpaste water bottle apple ball banana bell pepper binder bowl calculator camera cap cellphone cereal box coffee mug comb dry battery flashlight food bag food box food can food cup food jar garlic glue stick greens hand towel instant noodles keyboard kleenex lemon lightbulb lime marker mushroom notebook onion orange peach pear pitcher plate pliers potato rubber eraser scissors shampoo soda can sponge stapler tomato toothbrush toothpaste water bottle Figure 5: Confusion Matrix of our CNN-RNN model. The ground truth labels are on the y-axis and the predicted labels on the x-axis. Many misclassifications are between (a) garlic and mushroom (b) food-box and kleenex. Figure 6: Examples of confused classes: Shampoo bottle and water bottle, mushrooms labeled as garlic, pitchers classified as caps due to shape and color similarity, white caps classified as kleenex boxes at certain angles.
[2] K. Lai, L. Bo, X. Ren, and D. Fox. A Large-Scale Hierarchical Multi-View RGB-D Object Dataset. In ICRA, 2011.
[3] A. Johnson. Spin-Images: A Representation for 3-D Surface Matching. PhD thesis, Robotics Institute, Carnegie Mellon University, 1997.
[4] H.S. Koppula, A. Anand, T. Joachims, and A. Saxena. Semantic labeling of 3d point clouds for indoor scenes. In NIPS, 2011.
[5] L. Bo, X. Ren, and D. Fox. Depth kernel descriptors for object recognition. In IROS, 2011.
[6] M. Blum, J. T. Springenberg, J. Wlfing, and M. Riedmiller. A Learned Feature Descriptor for Object Recognition in RGB-D Data. In ICRA, 2012. 8
[7] L. Bo, X. Ren, and D. Fox. Unsupervised Feature Learning for RGB-D Based Object Recognition. In ISER, June 2012.
[8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), November 1998.
[9] R. Socher, C. Lin, A. Y. Ng, and C.D. Manning. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. In ICML, 2011.
[10] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions. In EMNLP, 2011.
[11] C. Goller and A. K¨ chler. Learning task-dependent distributed representations by backpropagation u through structure. In Proceedings of the International Conference on Neural Networks (ICNN-96), 1996.
[12] R. Socher, C. D. Manning, and A. Y. Ng. Learning continuous phrase representations and syntactic parsing with recursive neural networks. In Proceedings of the NIPS-2010 Deep Learning and Unsupervised Feature Learning Workshop, 2010.
[13] A. Coates, A. Y. Ng, and H. Lee. An Analysis of Single-Layer Networks in Unsupervised Feature Learning. Journal of Machine Learning Research - Proceedings Track: AISTATS, 2011.
[14] Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, and A.Y. Ng. Building high-level features using large scale unsupervised learning. In ICML, 2012.
[15] Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. What is the best multi-stage architecture for object recognition? In ICCV, 2009.
[16] A. Saxe, P.W. Koh, Z. Chen, M. Bhand, B. Suresh, and A. Y. Ng. On random weights and unsupervised feature learning. In ICML, 2011.
[17] K. Jarrett and K. Kavukcuoglu and M. Ranzato and Y. LeCun. What is the Best Multi-Stage Architecture for Object Recognition? In ICCV. IEEE, 2009.
[18] N. Pinto, D. D. Cox, and J. J. DiCarlo. Why is real-world visual object recognition hard? PLoS Comput Biol, 2008.
[19] J. B. Pollack. Recursive distributed representations. Artificial Intelligence, 46, 1990.
[20] R. Socher, E. H. Huang, J. Pennington, A. Y. Ng, and C. D. Manning. Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection. In NIPS. MIT Press, 2011.
[21] N. Silberman and R. Fergus. Indoor scene segmentation using a structured light sensor. In ICCV Workshop on 3D Representation and Recognition, 2011.
[22] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-Up Robust Features (SURF). Computer Vision and Image Understanding, 110(3), 2008.
[23] A. E. Abdel-Hakim and A. A. Farag. CSIFT: A SIFT descriptor with color invariant characteristics. In CVPR, 2006.
[24] K. Grauman and T. Darrell. The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features. ICCV, 2005.
[25] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786), 2006.
[26] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 2009.
[27] M. Ranzato, F. J. Huang, Y. Boureau, and Y. LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. CVPR, 0:1–8, 2007.
[28] A. Coates and A. Ng. The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization . In ICML, 2011.
[29] Farabet C., Couprie C., Najman L., and LeCun Y. Scene parsing with multiscale feature learning, purity trees, and optimal covers. In ICML, 2012.
[30] A. Hyv¨ rinen and E. Oja. Independent component analysis: algorithms and applications. Neural Netw., a 13, 2000.
[31] J. Ngiam, P. Koh, Z. Chen, S. Bhaskar, and A.Y. Ng. Sparse filtering. In NIPS. 2011.
[32] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber. Flexible, high performance convolutional neural networks for image classification. In IJCAI, 2011.
[33] C. Farabet, B. Martini, P. Akselrod, S. Talay, Y. LeCun, and E. Culurciello. Hardware accelerated convolutional neural networks for synthetic vision systems. In Proc. International Symposium on Circuits and Systems (ISCAS’10), 2010. 9