ICPR 2018 Program | Thursday August 23, 2018


Th1PL	Ballroom C, 1st Floor
Plenary Session: Jianchang Mao, Achieving Human Parity Performance in Pattern Recognition and Language Understanding by Machines(Ballroom C, 1st Floor)	Plenary Session


Th2PL	Ballroom C, 1st Floor
Plenary Session: Ashok Popat, Advice to a Promising OCR Researcher (Ballroom C, 1st Floor)	Plenary Session


ThAM_Coffee_Break	North Foyer & Park View Foyer, 3rd Floor
Coffee Break ThAM(North Foyer & Park View Foyer, 3rd Floor)


ThAMOT1	Ballroom C, 1st Floor
ThAMOT1 Semi-Supervised Learning (Ballroom C, 1st Floor)	Oral Session

11:10-12:30, Paper ThAMOT1.1
Deep Semi-Supervised Learning
Hailat, Zeyad	Wayne State Univ
Komarichev, Artem	Wayne State Univ
Chen, Xuewen	Wayne State Univ
Keywords: Semi-supervised learning, Neural networks, Classification Abstract: Convolutional neural networks (CNNs) attain state-of-the-art performance on various classification tasks assuming a sufficiently large number of labeled training examples. Unfortunately, curating sufficiently large labeled training dataset requires human involvement, which is expensive and time consuming. Semi-supervised methods can alleviate this problem by utilizing a limited number of labeled data in conjunction with sufficiently large unlabeled data to construct a classification model. Self-training techniques are among the earliest semi-supervised methods proposed to enhance learning by utilizing unlabeled data. In this paper, we propose a deep semi-supervised learning (DSSL) self-training method that utilizes the strengths of both supervised and unsupervised learning within a single model. We measure the efficacy of the proposed method on semi-supervised visual object classification tasks using the datasets CIFAR-10, CIFAR-100, STL-10, MNIST, and SVHN. The experiments show that DSSL surpasses semi-supervised state-of-the-art methods for most of the aforementioned datasets.

11:10-12:30, Paper ThAMOT1.2
Robust Adaptive Label Propagation by Double Matrix Decomposition
Zhang, Huan	Soochow Univ
Zhang, Zhao	Soochow Univ
Li, Sheng	Nanjing Univ. of Posts and Telecommunications
Ye, Qiaolin	Nanjing Univ. of Science and Tech
Zhao, Mingbo	City Univ. of Hong Kong
Wang, Meng	Microsoft Res. Asia
Keywords: Classification, Semi-supervised learning Abstract: In this paper, we investigate the robust transductive label prediction problem. Technically, a Robust Adaptive Label Propagation framework by Double Matrix Decomposition, called ALP-MD, is proposed for the semi-supervised data classification. Compared with existing transductive label propagation models, our ALP-MD improves the classification power by performing label prediction in the clean data space and clean label space at the same time. More specifically, our ALP-MD clearly integrates the idea of double matrix decomposition into the process of label prediction for the noise removal. Since the predicted soft labels usually contains noise and mixed signs, our ALP-MD explicitly decomposes the predicted soft label matrix into a clean soft label matrix and a noise term and then estimates the hard label based on the clean soft label matrix for more accurate classification. In addition, ALP-MD also involves a regularization term to model the noise in data, integrates the adaptive weights learning into the process of robust label prediction and moreover performs the weights learning in the clean data space. Thus, our ALP-MD can explicitly ensure the learned weights to be informative as much as possible and to be joint optimal for both representation and classification, and potentially enhance the label prediction ability. Extensive comparisons demonstrated its effectiveness.

11:10-12:30, Paper ThAMOT1.3
Efficient Object Region Discovery for Weakly-Supervised Semantic Segmentation
Zhong, Min	Peking Univ
Zeng, Gang	Peking Univ
Keywords: Semi-supervised learning, Deep learning, Mid-level vision Abstract: Deep Convolutional Neural Networks (DCNNs) provide the leading performance in the semantic segmentation task. However, collecting large-scale pixel-level annotations for training such a DCNN is labor intensive and not cost-effective. In this paper, we propose a small to large (STL) Filed-Of-View framework to train semantic segmentation networks from image level annotations. Specifically, we first train a small Filed-Of-View segmentation network (SFN) with the image-level annotations to discover initial object regions effectively. These localized regions are then combined with saliency maps to construct hypotheses on pixel-level annotations, using which a large Filed-Of-View segmentation network (LFN) is learned. To further enhance the segmentation quality, the object regions generated by LFN are verified with saliency maps, and thus the hypotheses are refined in an iterative manner. The converged hypotheses serve as the supervision information to learn a more powerful LFN for semantic segmentation. Extensive experimental results on PASCAL VOC 2012 segmentation benchmark well demonstrates the superiority of our proposed methods compared with the state-of-the-art.

11:10-12:30, Paper ThAMOT1.4
Semi-Supervised Graph Rewiring with the Dirichlet Principle
Curado, Manuel	Univ. of Alicante
Escolano, Francisco	Univ. of Alicante
Lozano, Miguel Angel	Univ. of Alicante
Hancock, Edwin	Univ. of York
Keywords: Semi-supervised learning, Transfer learning, Clustering Abstract: In this paper, we propose the concept of graph rewiring and we show how to exploit it in an un-supervised setting so that commute times can be better estimated by state-of-the-art methods. Our experiments show a significant improvement with respect to unsupervised graph rewiring.


ThAMOT2	309B, 3rd Floor
ThAMOT2 Object Detection (309B, 3rd Floor)	Oral Session

11:10-12:30, Paper ThAMOT2.1
Densely Connected Single-Shot Detector
Xu, Pei	CASIA
Zhao, Xin	Inst. of Automation, Chinese Acad. of Sciences
Huang, Kaiqi	NLPR
Keywords: Object detection, Deep learning Abstract: One-stage object detection approach which utilizes multi-scale feature maps to predict objects is currently the best real-time detector. However, in this approach, the high-resolution feature maps which are responsible for detecting small objects are harder to learn a proper abstraction of objects than the low-resolution feature maps. The problem is that these feature maps have to transform sufficient low-level information to the next layer while learning high-level abstraction. In this paper, we develop a transformation module which adopts the dense structures to simplify the learning problem of high-resolution feature maps. In addition, we utilize the inception module to enrich the representation power of high-resolution feature maps. Extensive experiments on most object detection datasets clearly demonstrate the effectiveness of our method. In particular, on PASCAL VOC 2007/2012, our method outperforms all the existing one-stage methods. Our model based on the VGG-16 network also achieves competitive result on MS COCO.

11:10-12:30, Paper ThAMOT2.2
Video Salient Object Detection Via Multiple Time-Scale Analysis
Chen, Yuhuan	Shenzhen Univ
Huang, Limin	Shenzhen People's Hospital
Zou, Wenbin	Shenzhen Univ
Li, Xia	Shenzhen Univ
Qiu, Guoping	Univ. of Nottingham
Keywords: Object detection, Video processing and analysis, Segmentation, features and descriptors Abstract: This paper focuses on salient object detection in video by multiple time-scale analysis, which exploits the temporally consistent information under three different scales. In the ﬁrst time-scale, we deﬁne an effective measure called motion contrast from both low-level cues and the optical ﬂow ﬁelds. In the second time-scale, we propose a novel approach to repair the inaccurate motion contrast due to the mistake of optical ﬂow. In the third time-scale, considering the low-contrast objects that stop moving for a certain amount of time and cannot remain prominent, we present a robust motion detection method based on point-tracking and trajectories clustering. Finally, the outcomesfromthethreetime-scalesjointlyformulatethesaliency detection by Bayesian inference. The proposed model is evaluated on the widely-used DAVIS and FBMS benchmark. Experiments demonstrate that our proposed model substantially outperforms the state-of-the-art saliency detection models.

11:10-12:30, Paper ThAMOT2.3
Object Detection in Equirectangular Panorama
Yang, Wenyan	Lab. of Signal Processing, Tampere Univ. of Tech
Qian, Yanlin	Tampere Univ. of Tech
Cricri, Francesco	Nokia Tech
Fan, Lixin	Nokia Tech
Kamarainen, Joni-Kristian	Tampere Univ. of Tech
Keywords: Object detection, Applications of pattern recognition and machine learning, Deep learning Abstract: Abstract�We introduce a high-resolution equirectangular panorama (aka 360-degree, virtual reality, VR) dataset for object detection and propose a multi-projection variant of the YOLO detector. The main challenges with equirectangular panorama images are i) the lack of annotated training data, ii) high-resolution imagery and iii) severe geometric distortions of objects near the panorama projection poles. In this work, we solve the challenges by I) using training examples available in the �conventional datasets� (ImageNet and COCO), II) employing only low-resolution images that require only moderate GPU computing power and memory, and III) our multi-projection YOLO handles projection distortions by making multiple stereographic sub-projections. In our experiments, YOLO outperforms the other state-of-the-art detector, Faster R-CNN, and our multi-projection YOLO achieves the best accuracy with low-resolution input.

11:10-12:30, Paper ThAMOT2.4
Multi-Scale Semantic Segmentation Enriched Features for Pedestrian Detection
Xie, Xiaolu	1.Inst. of Intelligent Machines, Chinese Acad. of Sciences
Wang, Zengfu	Univ. of Science and Tech. of China
Keywords: Object detection, Segmentation, features and descriptors, Deep learning Abstract: Pedestrian detection, as a branch of computer vision, has many significant real world applications such as autonomous driving or human behavior analysis. In this paper, we propose a convolutional neural network (CNN) based pedestrian detection framework which can be trained end-to-end. We design a feature enrichment unit to produce more representative features to improve detection performance. The feature enrichment units receive feature maps from the body network layer by layer and convey features in a backward manner. Together they produce multi-scale semantic segmentation results as extra features and merge them with feature maps of the body network. Then the merged feature maps will be fed into the detector to produce final predictions. The feature enrichment unit is easy to embed into existing convolutional neural networks based detection frameworks since it receives and produces feature maps. We use an alternating training strategy to train the network for detection and segmentation respectively and achieve considerable accuracy. The multi-scale feature enrichment units improve detection accuracy significantly as proven by experiments.


ThAMOT3	310, 3rd Floor
ThAMOT4 Face Biometrics (310, 3rd Floor)	Oral Session

11:10-12:30, Paper ThAMOT3.1
Joint Voxel and Coordinate Regression for Accurate 3D Facial Landmark Localization
Zhang, Hongwen	Inst. of Automation, Chinese Acad. of Sciences
Li, Qi	Inst. of Automation, Chinese Acad. of Sciences
Sun, Zhenan	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Biometric systems and applications, Facial expression recognition, Pattern recognition for human computer interaction Abstract: 3D face shape is more expressive and viewpoint-consistent than its 2D counterpart. However, 3D facial landmark localization in a single image is challenging due to the ambiguous nature of landmarks under 3D perspective. Existing approaches typically adopt a suboptimal two-step strategy, performing 2D landmark localization followed by depth estimation. In this paper, we propose the Joint Voxel and Coordinate Regression (JVCR) method for 3D facial landmark localization, addressing it more effectively in an end-to-end fashion. First, a compact volumetric representation is proposed to encode the per-voxel likelihood of positions being the 3D landmarks. The dimensionality of such a representation is fixed regardless of the number of target landmarks, so that the curse of dimensionality could be avoided. Then, a stacked hourglass network is adopted to estimate the volumetric representation from coarse to fine, followed by a 3D convolution network that takes the estimated volume as input and regresses 3D coordinates of the face shape. In this way, the 3D structural constraints between landmarks could be learned by the neural network in a more efficient manner. Moreover, the proposed pipeline enables end-to-end training and improves the robustness and accuracy of 3D facial landmark localization. The effectiveness of our approach is validated on the 3DFAW and AFLW2000-3D datasets. Experimental results show that the proposed method achieves state-of-the-art performance in comparison with existing methods.

11:10-12:30, Paper ThAMOT3.2
Patch-Gated CNN for Occlusion-Aware Facial Expression Recognition
Li, Yong	Inst. of Computing Tech. Chinese Acad. on Sciences
Zeng, Jiabei	Inst. of Computing Tech. Chinese Acad. on Sciences
Shan, Shiguang	Inst. of Computing Tech. ChineseAcademy of Sciences
Chen, Xilin	Inst. of Computing Tech
Keywords: Facial expression recognition, Affective computing Abstract: Facial expression recognition in the wild is challenging due to various un-constrained conditions. Although existing facial expression classifiers have been almost perfect on analyzing constrained frontal faces, they fail to perform well on partially occluded faces that are common in the wild. In this paper, we propose an end-to-end trainable Patch-Gated Convolution Neural Network (PG-CNN) that can automatically percept the occluded region of the face and focus on the most discriminative un-occluded regions. To determine the possible regions of interest on the face, PG-CNN decomposes an intermediate feature map into several patches according to the positions of related facial landmarks. Then, via a proposed Patch-Gated Unit, PG-CNN reweighs each patch by the unobstructed-ness or importance that is computed from the patch itself. The proposed PG-CNN are evaluated on two largest in-the-wild facial expression datasets (RAF-DB and AffectNet) and their modifications with synthesized facial occlusions. Experimental results show that PG-CNN improves the recognition accuracy on both the original faces and faces with synthesized occlusions. Visualization results demonstrate that, compared with the CNN without Patch-Gated Unit, PG-CNN is capable of shifting the attention from the occluded patch to other related but unobstructed ones. Experiments also show that PG-CNN outperforms other state-of-the-art methods on several widely used in-the-lab facial expression datasets under the cross-dataset evaluation protocol.

11:10-12:30, Paper ThAMOT3.3
Scattering Transform for Matching Surgically Altered Face Images
Gupta, Ishita	Google Inc
Bhalla, Ikshu	Google Inc
Singh, Richa	IIIT Delhi
Vatsa, Mayank	IIIT Delhi
Keywords: Face recognition Abstract: The use of face as a biometric feature has been widely accepted and used in security and surveillance systems. Recent studies have made significant advancements to address various challenges in face recognition such as illumination, age, pose and disguise. Another important covariate is recognizing faces with pre-and-post facial plastic surgery. Facial plastic surgeries alter the geometry and texture of facial regions, the extent of which is dependent on both the number, and the type of surgeries performed. The increasing reach of plastic surgery and its expanding user base present an indispensable challenge that must be dealt with while devising robust face recognition systems. In this paper, we present Invariant Scattering transform based feature extraction to compute translation invariant representation at local and global levels that is stable against plastic surgery variations. The identification accuracy achieved by the proposed algorithm is over 97% at rank-10 on the IIITD plastic surgery face database.

11:10-12:30, Paper ThAMOT3.4
Multimodal Face Spoofing Detection Via Rgb-D Images
Sun, Xudong	Inst. of Automation, Chinese Acad. of Sciences
Huang, Lei	Inst. of Automation, Chinese Acad. of Sciences
Liu, Changping	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Biometric anti-spoofing, Security and privacy in biometrics Abstract: While it has been shown that using 3D information might significantly benefit face anti-spoofing systems, traditional color images are still generally used, due to several issues such as expensive hardware requirement, high time cost, or poor accessibility when obtaining and using true 3D images. Thus, we could use RGB-D images captured by relatively low cost sensors instead, e.g., Kinect cameras, to achieve better performance without consuming huge amount of time or money. This research presents a novel multimodal face anti-spoofing method, which makes full use of available information on RGB-D images and no manually chosen regions are needed. For every pair of RGB-D images, first of all, we calculate the correlation between color and depth images to detect multimodal properties; then, by analyzing the consistency of sub regions extracted from the depth image, we are able to distinguish flat spoofing faces from genuine human beings. Both anti-spoofing features are fused to make final anti-spoofing decisions. Experiments on both self-collected and pubic 3DMAD datasets show that our proposed approach is effective for intra-dataset and cross-dataset testing scenarios, and that our method could deal with different presentation attacks carried by photos, tablet screens, and face masks.


ThAMOT4	311B, 3rd Floor
ThAMOT5 Text Detection and Recognition (311B, 3rd Floor)	Oral Session

11:10-12:30, Paper ThAMOT4.1
Scene Text Detection with Recurrent Instance Segmentation
Feng, Wei	Inst. of Automation of Chinese Acad. of Sciences
He, Wenhao	Chinese Acad. of Science
Yin, Fei	Inst. of Automation of CAS
Liu, Cheng-Lin	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Scene text detection and recognition, Neural networks, Deep learning Abstract: Convolutional Neural Network (CNN) based scene text detection methods mostly employ the semantic segmentation (text/non-text classification) task to localize the regions of texts. However, they cannot distinguish different text-lines like instance segmentation. In this paper, we propose a novel framework based on Fully Convolutional Networks (FCN) and Recurrent Neural Network (RNN) to achieve both scene text detection and instance segmentation. The FCN is used to classify text and non-text regions, and the RNN utilizes the features extracted by FCN to simultaneously detect and segment one text instance at each time step. Meanwhile, it also extracts bounding boxes by a much simpler way than the non-maximum suppression (NMS) method. The proposed method achieves competitive results on two public benchmarks including ICDAR 2015 Incidental Scene Text Dataset and ICDAR 2013 Focused Scene Text Dataset. Moreover, the benefits of adding regression task in the RNN module are manifested.

11:10-12:30, Paper ThAMOT4.2
A Novel Integrated Framework for Learning Both Text Detection and Recognition
Sui, Wanchen	Alibaba
Zhang, Qing	Alibaba
Yang, Jun	Alibaba
Chu, Wei	Ant Financial, Alibaba Group
Keywords: Character and text recognition, Deep learning, Sequence modeling Abstract: In this paper, we propose a novel integrated frame- work for learning both text detection and recognition. For most of the existing methods, detection and recognition are treated as two isolated tasks and trained separately, since parameters of detection and recognition models are different and two models target to optimize their own loss functions during individual training processes. In contrast to those methods, by sharing model parameters, we merge the detection model and recognition model into a single end-to-end trainable model and train the joint model for two tasks simultaneously. The shared parameters not only help effectively reduce the computational load in inference process, but also improve the end-to-end text detection- recognition accuracy. In addition, we design a simpler and faster sequence learning method for the recognition network based on a succession of stacked convolutional layers without any recurrent structure, this is proved feasible and dramatically improves inference speed. Extensive experiments on different datasets demonstrate that the proposed method achieves very promising results.

11:10-12:30, Paper ThAMOT4.3
Learning Graph Distances with Message Passing Neural Networks
Riba, Pau	Computer Vision Center
Fischer, Andreas	Univ. of Fribourg
Llados, Josep	Computer Vision Center
Forn�s, Alicia	Computer Vision Center
Keywords: Document retrieval, Graphics recognition, Graph matching Abstract: Graph representations have been widely used in pattern recognition thanks to their powerful representation formalism and rich theoretical background. A number of error-tolerant graph matching algorithms such as graph edit distance have been proposed for computing a distance between two labelled graphs. However, they typically suffer from a high computational complexity, which makes it difficult to apply these matching algorithms in a real scenario. In this paper, we propose an efficient graph distance based on the emerging field of geometric deep learning. Our method employs a message passing neural network to capture the graph structure and learns a metric with a siamese network approach. The performance of the proposed graph distance is validated in two application cases, graph classification and graph retrieval of handwritten words, and shows a promising performance when compared with (approximate) graph edit distance benchmarks.

11:10-12:30, Paper ThAMOT4.4
Multi-Scale Attention with Dense Encoder for Handwritten Mathematical Expression Recognition
Zhang, Jianshu	Univ. of Science and Tech. of China
Du, Jun	Univ. of Science and Tech. of China
Dai, Li-Rong	Univ. of Science and Tech. of China
Keywords: Character and text recognition, Document image processing, Pen-based document analysis Abstract: Handwritten mathematical expression recognition is a challenging problem due to the complicated two-dimensional structures, ambiguous handwriting input and variant scales of handwritten math symbols. To settle this problem, recently we propose the attention based encoder-decoder model that recognizes mathematical expression images from two-dimensional layouts to one-dimensional LaTeX strings. In this study, we improve the encoder by employing densely connected convolutional networks as they can strengthen feature extraction and facilitate gradient propagation especially on a small training set. We also present a novel multi-scale attention model which is employed to deal with the recognition of math symbols in different scales and restore the fine-grained details dropped by pooling operations. Validated on the CROHME competition task, the proposed method significantly outperforms the state-of-the-art methods with an expression recognition accuracy of 52.8% on CROHME 2014 and 50.1% on CROHME 2016, by only using the official training dataset.


ThLunch_Break	Exhibition Hall 5, B1 Floor
Lunch Break Fh (Exhibition Hall 5, B1 Floor)


ThMMOT2	309A, 3rd Floor
ThPMOT1.A Online and Active Learning (309A, 3rd Floor)	Oral Session

14:00-14:20, Paper ThMMOT2.1
An Incremental Multi-View Active Learning Algorithm for PolSAR Data Classification
Nie, Xiangli	Chinese Acad. of Sciences
Luo, Yongkang	Inst. of Automation, Chinese Acad. of Sciences
Qiao, Hong	Inst. of Automation, Chinese Acad. of Sciences
Zhang, Bo	AMSS, Chinese Acad. of Sciences
Jiang, Zhong-Ping	New York Univ
Keywords: Online learning, Multiview learning, Active learning Abstract: The fast and accurate classification of polarimetric synthetic aperture radar (PolSAR) data in dynamically changing environments is an important and challenging task. In this paper, we propose an Incremental Multi-view Passive-Aggressive Active learning algorithm, named IMPAA, for PolSAR data classification. This algorithm can deal with online two-view multi-class categorization problem by exploiting the relationship between the polarimetric-color and texture feature sets of PolSAR data. In addition, the IMPAA algorithm can handle the dynamic large-scale datasets where not only the amount of data but also the number of classes gradually increases. Moreover, this algorithm only queries the class labels of some informative incoming samples to update the classifier based on the disagreement of different views' predictors and a randomized rule. Experiments on real PolSAR data demonstrate that the proposed method can use a smaller fraction of queried labels to achieve low online classification errors compared with previously known methods.

14:20-14:40, Paper ThMMOT2.2
A Linear Incremental Nystrom Method for Online Kernel Learning
Xu, Shan	Tianjin Univ
Zhang, Xiao	Tianjin Univ
Liao, Shizhong	Tianjin Univ
Keywords: Online learning, Support vector machine and kernel methods, Classification Abstract: Although the incremental Nystrom method has been used in kernel approximation, it is not suitable for online kernel learning due to the cubic time complexity and the lack of theoretical guarantees. In this paper, we propose a novel incremental Nystrom method, which is in a linear time complexity with respect to the sampling size at each round, and enjoys a sublinear regret bound for online kernel learning. We construct the intersection matrix using the ridge leverage score estimator, compute the rank-k approximation of the intersection matrix incrementally via the incremental singular value decomposition, and recalculate the generalized inverse matrix periodically. When applying the proposed incremental Nystrom method to online kernel learning, we approximate the kernel matrix using the updated generalized inverse matrix at each round, and formulate the explicit feature mapping by the singular value decomposition of the approximated kernel matrix, yielding the linear classifier for online kernel learning at each round. Theoretically, we prove that our incremental Nystrom method has a (1+epsilon) relative-error bound for kernel matrix approximation, enjoys a sublinear regret bound using online gradient descent method for online kernel learning, and reduces the time complexity of generalized inverse computation from O(m^3) to O(mk) at each round, where m is the sampling size and k is the truncated rank. Experimental results show that the proposed incremental Nystrom method is accurate and efficient in kernel matrix approximation and is suitable for online kernel learning.

14:40-15:00, Paper ThMMOT2.3
Rotate Your Networks: Better Weight Consolidation and Less Catastrophic Forgetting
Liu, Xialei	Computer Vision Center of UAB
Masana, Marc	Computer Vision Center UAB
Herranz, Luis	Computer Vision Center
van de Weijer, Joost	Computer Vision Center Barcelona
L�pez Pe�a, Antonio M.	CVC-UAB
Bagdanov, Andrew D.	Univ. of Florence
Keywords: Neural networks, Image classification, Deep learning Abstract: In this paper we propose an approach to avoiding catastrophic forgetting in sequential task learning scenarios. Our technique is based on a network reparameterization that approximately diagonalizes the Fisher Information Matrix of the network parameters. This reparameterization takes the form of a factorized rotation of parameter space which, when used in conjunction with Elastic Weight Consolidation (which assumes a diagonal Fisher Information Matrix), leads to significantly better performance on lifelong learning of sequential tasks. Experimental results on the MNIST, CIFAR-100, CUB-200 and Stanford-40 datasets demonstrate that we significantly improve the results of standard elastic weight consolidation, and that we obtain competitive results when compared to the state-of-the-art in lifelong learning without forgetting.

15:00-15:20, Paper ThMMOT2.4
Dynamic Ensemble Active Learning: A Non-Stationary Bandit with Expert Advice
Pang, Kunkun	Univ. of Edinburgh
Dong, Mingzhi	Beijing Univ. of Posts and Telecommunications
Wu, Yang	Nara Inst. of Science and Tech
Hospedales, Timothy	Queen Mary Univ. of London
Keywords: Active learning Abstract: Active learning aims to reduce annotation cost by predicting which samples are useful for a human teacher to label. However it has become clear there is no best active learning algorithm. Inspired by various philosophies about what constitutes a good criteria, different algorithms perform well on different datasets. This has motivated research into ensembles of active learners that learn what constitutes a good criteria in a given scenario, typically via multi-armed bandit algorithms. Though algorithm ensembles can lead to better results, they overlook the fact that not only does algorithm efficacy vary across datasets, but also during a single active learning session. That is, the best criteria is non-stationary. This breaks existing algorithms' guarantees and hampers their performance in practice. In this paper, we propose dynamic ensemble active learning as a more general and promising research direction. We develop a dynamic ensemble active learner based on a non-stationary multi-armed bandit with expert advice algorithm. Our dynamic ensemble selects the right criteria at each step of active learning. It has theoretical guarantees, and shows encouraging results on 13 popular datasets.


ThMMOT3	309B, 3rd Floor
ThPMOT2.A Vision Applications (309B, 3rd Floor)	Oral Session

14:00-14:20, Paper ThMMOT3.1
Egocentric Shopping Cart Localization
Spera, Emiliano	Univ. of Catania - Centro Studi S.r.l
Furnari, Antonino	Univ. of Catania
Battiato, Sebastiano	Univ. of Catania
Farinella, Giovanni Maria	Univ. of Catania
Keywords: Applications of computer vision, Scene understanding, Vision for robotics Abstract: This work investigates the new problem of image-based egocentric shopping cart localization in retail stores. The contribution of our work is two-fold. First, we propose a novel large-scale dataset for image-based egocentric shopping cart localization. The dataset has been collected using cameras placed on shopping carts in a large retail store. It contains a total of 19,531 image frames, each labelled with its six Degrees Of Freedom pose. We study the localization problem by analysing how cart locations should be represented and estimated, and how to assess the localization results. Second, we benchmark different image-based techniques to address the task. Specifically, we investigate two families of algorithms: classic methods based on image retrieval and emerging methods based on regression. Experimental results show that methods based on image retrieval largely outperform regression-based approaches. We also point out that deep metric learning techniques allow to learn better visual representations w.r.t. other architectures, and are useful to improve the localization results of both retrieval-based and regression-based approaches. Our findings suggest that deep metric learning techniques can help bridge the gap between retrieval-based and regression-based methods.

14:20-14:40, Paper ThMMOT3.2
Scalable Monocular SLAM by Fusing and Connecting Line Segments with Inverse Depth Filter
Zhang, Jiyuan	Peking Univ
Zeng, Gang	Peking Univ
Zha, Hongbin	Peking Univ
Keywords: Applications of computer vision, 3D vision, Multiple view geometry Abstract: In this paper we propose a fast and robust line-based approach to monocular SLAM. It relies on a novel inverse depth representation of lines capable of tracking line segments in long image consequences. Tracked lines through frames provide crucial directional and positional knowledge for boosting localization performance, for they are more informative in charactering environments than points especially for urban outdoor and indoor scenes. The developed two-parameter inverse depth representation of lines is applicable for Kalman filter to achieve an efficient solver due to its linearity, which has lower computational cost compared to binary descriptors. This filter is also harmonious with inverse depth filter of points, both of which are incorporated under a unified minimization framework to enhance the performance of monocular SLAM. Real world monocular sequences have demonstrated that the proposed SLAM system outperforms the state-of-the-art and produces accurate results in both indoor and outdoor scenes.

14:40-15:00, Paper ThMMOT3.3
End-To-End Multi-Modal Multi-Task Vehicle Control for Self-Driving Cars with Visual Perceptions
Yang, Zhengyuan	Univ. of Rochester
Zhang, Yixuan	Univ. of Rochester
Yu, Jerry	SAIC USA Innovation Center
Cai, Junjie	SAIC USA Innovation Center
Luo, Jiebo	-Univ. of Rochester
Keywords: Video analysis, Scene understanding, Applications of computer vision Abstract: Convolutional Neural Networks (CNN) have been successfully applied to autonomous driving tasks, many in an end-to-end manner. Previous end-to-end steering control methods take an image or an image sequence as the input and directly predict the steering angle with CNN. Although single task learning on steering angles has reported good performances, the steering angle alone is not sufficient for vehicle control. In this work, we propose a multi-task learning framework to predict the steering angle and speed control simultaneously in an end-to-end manner. Since it is nontrivial to predict accurate speed values with only visual inputs, we first propose a network to predict discrete speed commands and steering angles with image sequences. Moreover, we propose a multi-modal multi-task network to predict speed values and steering angles by taking previous feedback speeds and visual recordings as inputs. Experiments are conducted on the public Udacity dataset and a newly collected SAIC dataset. Results show that the proposed model predicts steering angles and speed values accurately. Furthermore, we improve the failure data synthesis methods to solve the problem of error accumulation in real road tests.

15:00-15:20, Paper ThMMOT3.4
Spatial Calibration for Thermal-RGB Cameras and Inertial Sensor System
Li, Yan	Univ. of Tennessee, Knoxville
Zhang, Yinlong	Shenyang Inst. of Automation Chinese Acad. of Sciences
He, Hongsheng	Wichita State Univ
Tan, Jindong	Univ. of Tennessee, Knoxville
Keywords: Vision for robotics, Vision sensors Abstract: The light-weight thermal-RGB-inertial sensing units are now gaining increasing research attention, due to their heterogeneous and complementary properties. A robust and accurate registration between a thermal-RGB camera and an inertial sensor is a necessity for effective thermal-RGB-inertial fusion, which is an indispensable procedure for reliable tracking and mapping tasks. This paper presents an accurate calibration method to geometrically correlate the spatial relationships between an RGB camera, a thermal camera and an inertial measurement unit (IMU). The calibration proceeds within the unified calibration framework (thermal-to-RGB, RGB-to-IMU). The extrinsic parameters are estimated by jointly optimizing both the chessboard corner reprojection errors and acceleration and angular velocity error terms. Extensive evaluations have been performed on the collected thermal-RGB-inertial measurements. In this experiments study, the average RMS translation and Euler angle errors are less than 6 mm and 0.04 rad respectively under 20% artificial noise.


ThMMOT4	311A, 3rd Floor
ThPMOT6 Medical Signal Analysis and Recognition (311A, 3rd Floor)	Oral Session

14:00-14:20, Paper ThMMOT4.1
Automatically Detecting Arrhythmia-Related Irregular Patterns Using the Temporal and Spectro-Temporal Textures of ECG Signals
Abdeldayem, Sara	West Virginia Univ
Bourlai, Thirimachos	WVU
Keywords: Medical image and signal analysis, Classification, Signal analysis Abstract: Arrhythmia is an abnormal heart rhythm that occurs due to the improper operation of the electrical impulses that coordinate the heartbeats. It is one of the most well-known heart conditions (including coronary artery disease, heart failure etc.) that is experienced by millions of people around the world. While there are several types of arrhythmias, not all of them are dangerous or harmful. However, there are arrhythmias that can often lead to death in minutes (e.g. ventricular fibrillation and ventricular tachycardia) even in young people. Thus, the detection of arrhythmia is critical for stopping and reversing its progression and for increasing longevity and life quality. While a doctor can perform different heart-monitoring tests specific to arrhythmias, the electrocardiogram (ECG) is one of the most common ones used either independently or in combination with other tests (to only detect, e.g. echocardiogram, or trigger arrhythmia and, then, detect, e.g. stress test). We propose a machine learning approach that augments the traditional arrhythmia detection approaches via our automatic arrhythmia classification system. It utilizes the texture of the ECG signal in both the temporal and spectro-temporal domains to detect and classify four types of heartbeats. The original ECG signal is first preprocessed, and then, the R-peaks associated with heartbeat estimation are identified. Next, 1D local binary patterns (LBP) in the temporal domain are utilized, while 2D LBPs and texture-based features extracted by a grayscale co-occurrence matrix (GLCM) are utilized in the spectro-temporal domain using the short-time Fourier transform (STFT) and Morse wavelets. Finally, different classifiers, as well as different ECG lead configurations are examined before we determine our proposed time-frequency SVM model, which obtains a maximum accuracy of 99.81%, sensitivity of 98.17%, and specificity of 99.98% when using a 10 cross-validation on the MIT-BIH database. Our approach yields competitive accuracy when compared to other methods discussed in the literature.

14:20-14:40, Paper ThMMOT4.2
Impact of Lossy Data Compression Techniques on EEG-Based Pattern Recognition Systems
Nguyen, Binh	Univ. of Canberra
Ma, Wanli	Univ. of Canberra
Tran, Dat	Univ. of Canberra
Keywords: Brain-computer interface, Applications of pattern recognition and machine learning, Pattern recognition for human computer interaction Abstract: Electroencephalogram (EEG) data compression has been used to reduce the space for storage and speed up the data circulation. Albeit lossy compression techniques achieve a much higher compression ratio than lossless ones, they introduce the loss of information in reconstructed data, which may affect to the performance of EEG-based pattern recognition systems. In this paper, we investigate the impact of lossy compression techniques on the performance of EEG-based pattern recognition systems including seizure recognition and person recognition. Our experiments are conducted on two public databases using two different EEG lossy compression techniques. Experimental results show that the recognition performance is not significantly reduced when using lossy techniques at high compression ratios.

14:40-15:00, Paper ThMMOT4.3
SlideNet: Fast and Accurate Slide Quality Assessment Based on Deep Neural Networks
Zhang, Teng	The Univ. of Queensland
Carvajal, Johanna	The Univ. of Queensland
Smith, Daniel F.	The Univ. of Queensland
Zhao, Kun	The Univ. of Queensland
Wiliem, Arnold	The Univ. of Queensland
Hobson, Peter	Sullivan Nicolaides Pathology
Jennings, Anthony	Sullivan Nicolaides Pathology
Lovell, Brian Carrington	The Univ. of Queensland
Keywords: Computer-aided detection and diagnosis, Medical image and signal analysis, Neural networks Abstract: This work tackles the automatic fine-grained slide quality assessment problem for digitized direct smears test using the Gram staining protocol. Automatic quality assessment can provide useful information for the pathologists and the whole digital pathology workflow. For instance, if the system found a slide to have a low staining quality, it could send a request to the automatic slide preparation system to remake the slide. If the system detects severe damage in the slides, it could notify the experts that manual microscope reading may be required. In order to address the quality assessment problem, we propose a deep neural network based framework to automatically assess the slide quality in a semantic way. Specifically, the first step of our framework is to perform dense fine-grained region classification on the whole slide and calculate the region histogram of the label distributions. Next, our framework will generate assessments of the slide quality from various perspectives: staining quality, information density, damage level and which regions are more valuable for subsequent high-magnification analysis. To make the information more accessible, we present our results in the form of a heat map and text summaries. Additionally, in order to stimulate research in this direction, we propose a novel dataset for slide quality assessment. Experiments show that the proposed framework outperforms recent related works.

15:00-15:20, Paper ThMMOT4.4
Multi-Task Multiple Kernel Machines for Personalized Pain Recognition from Functional Near-Infrared Spectroscopy Brain Signals
Lopez-Martinez, Daniel	Massachusetts Inst. of Tech
Peng, Ke	Harvard Medical School
Steele, Sarah	The Univ. of Tennessee Health and Science Center
Lee, Arielle, Arielle	Boston Children's Hospital
Borsook, David	Harvard Univ
Picard, Rosalind	MIT Affective Computing
Keywords: Computer-aided detection and diagnosis, Brain and cognitive engineering, Medical image and signal analysis Abstract: Currently there is no validated objective measure of pain. Recent neuroimaging studies have explored the feasibility of using functional near-infrared spectroscopy (fNIRS) to measure alterations in brain function in evoked and ongoing pain. In this study, we applied multi-task machine learning methods to derive a practical algorithm for pain detection derived from fNIRS signals in healthy volunteers exposed to a painful stimulus. Especially, we employed multi-task multiple kernel learning to account for the inter-subject variability in pain response. Our results support the use of fNIRS and machine learning techniques in developing objective pain detection, and also highlight the importance of adopting personalized analysis in the process.


ThPMP	North Foyer & Park View Foyer, 3rd Floor
Poster Session ThPMP, Coffee Break (North Foyer & Park View Foyer, 3rd Floor)	Poster Session

15:20-17:20, Paper ThPMP.1
3D Human Pose Estimation from Deep Multi-View 2D Pose
Schwarcz, Steven	Univ. of Maryland, Coll. Park
Pollard, Thomas	Systems & Tech. Res
Keywords: Motion and tracking, 3D vision, Probabilistic graphical model Abstract: Human pose estimation - the process of recognizing a human's limb positions and orientations in a video - has many important applications including surveillance, diagnosis of movement disorders, and computer animation. While deep learning has lead to great advances in 2D and 3D pose estimation from single video sources, the problem of estimating 3D human pose from multiple video sensors with overlapping fields of view has received less attention. When the application allows use of multiple cameras, 3D human pose estimates may be greatly improved through fusion of multi-view pose estimates and observation of limbs that are fully or partially occluded in some views. Past approaches to multi-view 3D pose estimation have used probabilistic graphical models to reason over constraints, including per-image pose estimates, temporal smoothness, and limb length. In this paper, we present a pipeline for multi-view 3D pose estimation of multiple individuals which combines a state-of-art 2D pose detector with a factor graph of 3D limb constraints optimized with belief propagation. We evaluate our results on the TUM-Campus and Shelf datasets for multi-person 3D pose estimation and show that our system significantly out-performs the previous state-of-the-art with a simpler model of limb dependency.

15:20-17:20, Paper ThPMP.2
Enhancing Pix2Pix for Remote Sensing Image Classification
Wang, Xiaoye	China Univ. of Geosciences, Beijing
Yan, Hongping	China Univ. of Geosciences, Beijing
Huo, Chunlei	Inst. of Automation, CAS
Yu, Jiayuan	Beijing Univ
Pan, Chunhong	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Image classification, Deep learning, Neural networks Abstract: Remote sensing image classification is challenging due to low separation between different classes and difficulty in learning discriminative features. GAN (Generative Adversarial Model) is promising for this task due to the generator in reproducing samples and the discriminator for improving the generator. Among GAN�s variants for image translation and image classification tasks, Pix2Pix performs best. However, Pix2Pix is limited in explicitly capturing the relationship between the source domain and the reconstructed ones from the target domain. To address the above problem, an improved Pix2Pix is proposed in this paper, where a controller is added to Pix2Pix whose role is to improve classification performance and enhance training stability. Experiments demonstrate the effectiveness and advantages of the proposed approach.

15:20-17:20, Paper ThPMP.3
Global Context Encoding for Salient Objects Detection
Wang, Jingbo	Peking Univ
Xing, Yajie	Peking Univ
Zeng, Gang	Peking Univ
Keywords: Low-level vision, Deep learning, Learning-based vision Abstract: Abstract�Deep convolutional neural networks (CNNs) have gained their reputation for the success in various tasks in computer vision, including salient objects detection. However, it remains a challenge that the CNNs have repeated downsample operators and always create low-resolution predictions, which tend to loss details and finer structure of images. To detect and segment the salient objects well, it is also necessary to merge high-level semantic information and low-level fine details simultaneously. Thus, we propose a novel network structure with stage-wise refinement sub-structures. In addition, we exploit the essence of salient objects detection by encoding the global image context in a specifically designed module, which is applied to every stage of the refinement structure. So the coarse saliency map generated from the base CNN can be refined with lowlevel feature and global context information step-by-step. Experimental results have demonstrated that the proposed method outperforms the state-of-the-art approaches on four benchmark datasets.

15:20-17:20, Paper ThPMP.4
3D Geometry-Aware Semantic Labeling of Outdoor Street Scenes
Zhong, Yiran	Australian National Univ
Dai, Yuchao	Northwestern Pol. Univ
Li, Hongdong	Australian National Univ
Keywords: 3D vision, Segmentation, features and descriptors, Deep learning Abstract: This paper is concerned with the problem of how to better exploit 3D geometric information for dense semantic image labeling. Existing methods often treat the available 3D geometry information (e.g., 3D depth-map) simply as an additional image channel besides the R-G-B color channels, and apply the same technique for RGB image labeling. In this paper, we demonstrate that directly performing 3D convolution in the framework of a residual connected 3D voxel top-down modulation network can lead to superior results. Specifically, we propose a 3D semantic labeling method to label outdoor street scenes whenever a dense depth map is available. Experiments on the ``Synthia'' and ``Cityscape'' datasets show our method outperforms the state-of-the-art methods, suggesting such a simple 3D representation is effective in incorporating 3D geometric information.

15:20-17:20, Paper ThPMP.5
Real-Time Vehicle Localization and Tracking Using Monocular Panomorph Panoramic Vision
Belbachir, Ahmed Nabil	Teknova AS
Svendsen, Lisa Maria	Teknova AS
Akdemir, Benyamin	Teknova AS
Keywords: Motion and tracking, Video analysis, Vision sensors Abstract: This paper presents a feasibility analysis of the ORB-SLAM [1] for real-time vehicle localization and tracking using a monocular visual camera providing 360˚ panoramic views. This method described in [1] was initially designed and developed for conventional cameras, making use of a method for detection and tracking visual features and estimating the camera trajectory while reconstructing the environment. The accuracy of the tracking depends on the ability of this method to robustly detect and match sufficient visual features. This work aims to extend this method to large monocular round views using fish-eye cameras allowing an increase of visual features with the aim of improving localization robustness. The main challenge in using a fish-eye camera for generating panoramic views is the reduction of visual performance due to a potential higher distortion and lower spatial resolution compared to that using a standard camera lens. The objective of this research is to perform a feasibility analysis of a method combining a camera equipped with a panamorph lens to generate real-time panoramic views at minimal distortion and ORB-SLAM to robustly detect and track visual features for real-time camera localization and tracking. A quantitative evaluation is performed on a vehicle driving in an outdoor natural scene with the panamorph camera mounted on-board and without any other additional sensors. The results with analysis and concluding summary are included as well.

15:20-17:20, Paper ThPMP.6
An Automated Classification Framework for Pressure Ulcer Tissues Based on 3d Convolutional Neural Network
Elmogy, Mohammed	Faculty of Computers and Information, Mansoura Univ
Garcia-Zapirain, Begona	Facultad Ingenieria, Univ. De Deusto, Avda/Univ. 2
Elmaghraby, Adel	Univ. of Louisville
El-Baz, Ayman	Univ. of Louisville
Keywords: Image classification, Medical image and signal analysis, Deep learning Abstract: Pressure ulcer (PU) is a clinical pathology of localized deterioration to the underlying tissues as well as to the skin, which is generated by friction and pressure. A trustworthy diagnosis of PU, which is supported by accurate assessment, is critical to have effective therapy and save the patient's life. In this paper, we propose an automatic classification framework to segment and classify various tissues to help in diagnosis and treatment of PU. The proposed framework consists of two main stages, which are region of interest (ROI) extraction and tissue segmentation stages. The main idea is to extract various models and features from PU RGB images and supply them to multi-path 3D convolution neural network (CNN) to segment slough, necrotic eschar, and granulation tissues to help in assessing the status of PU. ROI is extracted by supplying three different color models to the CNN, which are RGB, HSV, and YCbCr. Then, the PU tissues are classified by providing four various models to the 3D CNN. These models are the original RGB image, the smoothed image with a pre-selected Gaussian kernel, and the 1st-order models of prior and current visual appearance. The framework was trained and tested on 100 color RGB PU images. The classification accuracy was evaluated using the area under the curve (AUC), the percentage area distance (PAD), and Dice similarity coefficient (DSC). The obtained preliminary results have AUC of 96%, PAD of 10%, and DSC of 93%. These experimental results are promising and can lead to an accurate assessment of the PU status.

15:20-17:20, Paper ThPMP.7
Scale and Orientation Aware EPI-Patch Learning for Light Field Depth Estimation
Zhou, Wenhui	Hangzhou Dianzi Univ
Liang, Linkai	Hangzhou Dianzi Univ
Lin, Lili	Zhejiang Gongshang Univ
Lumsdaine, Andrew	Pacific Northwest Lab
Zhang, Hua	Hangzhou Dianzi Univ
Keywords: Computational photography, Deep learning Abstract: Epipolar Plane Image (EPI) implies some important depth cues for light field depth estimation. Intuitively, the EPI patches with different spatial scales and orientations may exhibit different features and result in different estimation precision. In this paper, we discuss this issue and present a scale and orientation aware EPI-Patch learning model for depth estimation. We take the multi-orientation EPI patches of each pixel as input, and design two types of network structures for adaptive scale selection and orientation fusion. One type is a scale-aware structure, which feeds one orientation patch into a multi-layer feed-forward network with long and short skip connections. The other type is a shared-weight network for fusing the multi-orientation features. We demonstrate the effectiveness of our model by experiments on 4D Light Field Benchmark.

15:20-17:20, Paper ThPMP.8
An Image Rain Removal Algorithm Based on the Depth of Field and Sparse Coding
Lei, Junfeng	School of Electronic Information, Wuhan Univ
Zhang, Shangyue	Wuhan Univ
Zou, Wentao	Wuhan Univ
Xiao, Jinsheng	Wuhan Univ
Chen, Yunhua	Guangdong Univ. of Tech
Sui, HaiGang	Wuhan Univ
Keywords: Computational photography, Enhancement, restoration and filtering Abstract: Rainfall weather can always seriously deteriorate the quality of the outdoor monitoring system image.Since the decomposition based methods do not need to impose any restrictions on the types of rain, they have a wider application in removing the rain streaks. However, they still have the problems of rain residues in the low frequency component,and mis-matching the background and the rain streaks with the same gradient in the high frequency. In this condition, we propose an image rain removal algorithm based on the depth of field and sparse coding. The algorithm includes four steps: image decomposition, dictionary learning, atomic clustering based on Principal Component Analysis and Support Vector Machine, image revising based on the depth of field saliency map. Firstly, the image is decomposed by using the combination of bilateral filtering and short-time Fourier transform, so that the contour in the low-frequency part of the image can be better preserved. The depth of field saliency map of the image is utilized to eliminate the rain residues in the low frequency components, and also to solve the problem of mis-matching the background and the rain streaks with the same gradient in the high frequency components. The experimental results demonstrate that the proposed algorithm performs better both in rain removal and preserving the detailed information of the image than current methods.

15:20-17:20, Paper ThPMP.9
An Efficient Line Segment Matching Algorithm for 3D Face Recognition
Rong, Shenghui	Xidian Univ
Gao, Yongsheng	Griffith Univ
Yu, Xun	Griffith Univ
Zhou, Huixin	Xidian Univ
Zhou, Jun	Griffith Univ
Keywords: Object recognition, 3D vision Abstract: Abstract�Much progress has been made in developing new techniques for 3D face recognition and achieved a very high level of accuracy. However, their high computational costs make them difficult to be used in real-world applications due to the relatively large size of 3D data. Though line segments on 3D surfaces have been used as an effective tool for 3D face recognition, developing an effective matching algorithm between two line segment sets remains an unsolved problem. In this paper, we propose a coarse-to-fine 3D line segment matching approach for 3D face recognition. A coarse matching step is first utilized to detect the most likely curve pairs between two faces. Then a fine matching step is adopted to calculate the similarity between two paired curves to recognize faces. Experimental results show that the proposed method can recognize a face with a lower computational cost compared with the previous research.

15:20-17:20, Paper ThPMP.10
Large Margin Structured Convolution Operator for Thermal Infrared Object Tracking
Gao, Peng	Harbin Inst. of Tech. Shenzhen
Ma, Yipeng	Harbin Inst. of Tech. Shenzhen
Song, Ke	Harbin Inst. of Tech. Shenzhen
Li, Chao	Harbin Inst. of Tech. Shenzhen
Wang, Fei	Harbin Inst. of Tech. (Shenzhen)
Xiao, Liyi	Harbin Inst. of Tech. Shenzhen
Keywords: Motion and tracking, Object detection, Structured prediction Abstract: Compared with visible object tracking, thermal infrared (TIR) object tracking can track an arbitrary target in total darkness since it cannot be influenced by illumination variations. However, there are many unwanted attributes that constrain the potentials of TIR tracking, such as the absence of visual color patterns and low resolutions. Recently, structured output support vector machine (SOSVM) and discriminative correlation filter (DCF) have been successfully applied to visible object tracking, respectively. Motivated by these, in this paper, we propose a large margin structured convolution operator (LMSCO) to achieve efficient TIR object tracking. To improve the tracking performance, we employ the spatial regularization and implicit interpolation to obtain continuous deep feature maps, including deep appearance features and deep motion features, of the TIR targets. Finally, a collaborative optimization strategy is exploited to significantly update the operators. Our approach not only inherits the advantage of the strong discriminative capability of SOSVM but also achieves accurate and robust tracking with higher-dimensional features and more dense samples. To the best of our knowledge, we are the first to incorporate the advantages of DCF and SOSVM for TIR object tracking. Comprehensive evaluations on two thermal infrared tracking benchmarks, i.e. VOT-TIR2015 and VOT-TIR2016, clearly demonstrate that our LMSCO tracker achieves impressive results and outperforms most state-of-the-art trackers in terms of accuracy and robustness with sufficient frame rate.

15:20-17:20, Paper ThPMP.11
Cross Modal Multiscale Fusion Net for Real-Time RGB-D Detection
Yin, Kejie	Zhejiang Univ. of Tech
Liu, Sheng	Zhejiang Univ. of Tech
Liu, Ruyu	Zhejiang Univ. of Tech
Chen, Yibin	Zhejiang Univ. of Tech
Shen, Kang	Zhejiang Univ. of Tech
Keywords: Object detection, Neural networks, Transfer learning Abstract: This paper presents a novel multi-modal CNN architecture for object detection by exploiting complementary input cues in addition to sole color information. Our one-stage architecture fuses the multiscale mid-level features from two individual feature extractor, so that our end-to-end net can accept cross modal streams to obtain high-precision detection results. In comparison to other cross modal fusion neural networks, our solution successfully reduces runtime to meet the real-time requirement with still high-level accuracy. Experimental evaluation on challenging NYUD2 dataset shows that our network achieves 49.1% mAP, and processes images in real-time at 35.3 frames per second on one single Nvidia GTX1080 GPU. Compared to baseline one stage network SSD on RGB images which gets 39.2% mAP, our method has great accuracy improvement.

15:20-17:20, Paper ThPMP.12
Pentuplet Loss for Simultaneous Shots and Critical Points Detection in a Video
Gupta, Nitin	IBM Res
Jain, Abhinav	IBM INDIAN Res. LABS
Agarwal, Prerna	IBM Res
Mujumdar, Shashank	IBM Res. India
Mehta, Sameep	IBM Res
Keywords: Video analysis, Deep learning, Deep learning for multimedia analysis Abstract: Critical events in videos amount to the set of frames where the user attention is heightened. Such events are usually fine-grained activities and do not necessarily have defined shot boundaries. Traditional approaches to the task of Shot Boundary Detection (SBD) in videos perform frame-level classification to obtain shot boundaries and fail to identify the critical shots in the video. We model the problem of identifying critical frames and shot boundaries in a video as learning an image frame similarity metric where the distance relationships between different types of video frames are modeled. We propose a novel pentuplet loss to learn the frame image similarity metric through a pentuplet based deep learning framework. We showcase the results of our proposed framework on soccer highlight videos against state-of-the-art baselines and significantly outperform them for the task of shot boundary detection. The proposed framework shows promising results for the task of critical frame detection against human annotations on soccer highlight videos.

15:20-17:20, Paper ThPMP.13
Aggregated Sparse Attention for Steering Angle Prediction
He, Sen	Univ. of Exeter
Kangin, Dmitry	Univ. of Exeter
Yang, Mi	Univ. of Exeter
Pugeault, Nicolas	Univ. of Exeter
Keywords: Cognitive and embodied vision, Perceptual organization, Applications of computer vision Abstract: In this paper, we apply the attention mechanism to autonomous driving for steering angle prediction. We propose the first model, applying the recently introduced sparse attention mechanism to visual domain, as well as the aggregated extension for this model. We show the improvement of the proposed method, comparing to no attention as well as to different types of attention.

15:20-17:20, Paper ThPMP.14
Multi-Layer CNN Features Aggregation for Real-Time Visual Tracking
Zhang, Lijia	Beijing Inst. of Tech
Dong, Yanmei	BIT
Wu, Yuwei	Beijing Inst. of Tech. School Ofcomputerscience, Media
Keywords: Motion and tracking, Deep learning, Neural networks Abstract: In this paper, we propose a novel convolutional neural network (CNN) based tracking framework, which aggregates multiple CNN features from different layers into a robust representation and realizes real-time tracking. We find that some feature maps have interference for effectively representing objects. Instead of using original features, we build an end-to-end feature aggregation network (FAN) which suppresses the noisy feature maps of CNN layers, and thus significantly benefits to represent objects with both coarse semantic information and fine details. The FAN, as a light-weight network, can runs at a real-time due to few parameters. The highlighted region of feature maps obtained by the FAN is employed as the tracking result. Our method performs at a real-time speed of 24fps while maintaining a promising accuracy compared with state-of-the-art methods on existing tracking benchmarks.

15:20-17:20, Paper ThPMP.15
Localization Based on Semantic Map and Visual Inertial Odometry
Jin, Jie	Univ. of Chinese Acad. of Sciences
Zhu, Xiaoyang	Inst. of Automation, Chinese Acad. of Sciences
Jiang, Yongshi	Inst. of Automation, Chinese Acad. of Sciences
Du, Zhiying	Momenta
Keywords: Applications of computer vision, Applications of pattern recognition and machine learning, Vision for robotics Abstract: Autonomous vehicles require precise localization for safe control. This paper presents a localization approach based on semantic map and visual inertial odometry for autonomous vehicles. Our approach uses consumer grade parts, and only relies on a single front camera and a consumer grade IMU and a GPS. Using real-time semantic landmark detection and real-time visual inertial odometry, we localize the full 6-DOF pose of the vehicle in the semantic map with mean absolute accuracy at less than 20 cm. With this accuracy, we can achieve high levels of autonomy, and speed up the evolution of autonomous driving. The main contributions of our approach are : (i) 2D-3D semantic landmark matching in continuous frames; (ii) full 6-DOF pose optimization with semantic constraints in a sliding time window.

15:20-17:20, Paper ThPMP.16
Weakly and Semi-Supervised Faster RCNN with Curriculum Learning
Wang, Jiasi	Huazhong Univ. of Science and Tech
Wang, Xinggang	Huazhong Univ. of Science and Tech
Liu, Wenyu	Huazhong Univ. of Science and Tech
Keywords: Object detection, Deep learning, Semi-supervised learning Abstract: Object detection is a core problem in computer vision and pattern recognition. In this paper, we study the problem of learning an effective object detector using weakly-annotated images (i.e., only the image level annotation is given) and a few proportion of fully-annotated images (i.e., bounding box level annotation is given) with curriculum learning. Our method is built upon Faster RCNN. Different from previous weakly-supervised object detectors which rely on hand-craft object proposals, the proposed method learns a region proposal network using weakly- and semi-supervised training data. And the weakly-labeled images are fed into the deep network in a meaningful order which illustrates from easy to gradually more complex examples with curriculum learning. We name the Faster RCNN trained using Weakly- And Semi-Supervised data using Curriculum Learning as WASSCL RCNN. The WASSCL RCNN is validated on the PASCAL VOC 2007 benchmark, and obtains 90% of a fully-supervised Faster RCNN's performance (measured using mAP) with only 15% of fully-supervised annotations together with weak supervision. The results show that the proposed learning framework can significantly reduce the labeling efforts for obtaining reliable object detectors.

15:20-17:20, Paper ThPMP.17
Deep Context Networks for Image Annotation
Jiu, Mingyuan	Zhengzhou Univ
Sahbi, Hichem	CNRS LIP6, UPMC Sorbonne Univ
Qi, Lin	School of Information Engineering, Zhengzhou Univ
Keywords: Image classification, Support vector machine and kernel methods, Deep learning Abstract: Context plays an important role in visual pattern recognition as it provides complementary clues for different learning tasks including image classification and annotation. In the particular scenario of kernel learning, the general recipe of context-based kernel design consists in learning positive semi-definite similarity functions that return high values not only when data share similar content but also similar context. However, in spite of having a positive impact on performance, the use of context in these kernel design methods has not been fully explored; indeed, context has been handcrafted instead of being learned. In this paper, we introduce a novel context-aware kernel design framework based on deep learning. Our method discriminatively learns spatial geometric context as the weights of a deep network (DN). The architecture of this network is fully determined by the solution of an objective function that mixes content, context and regularization, while the parameters of this network determine the most relevant (discriminant) parts of the learned context. We apply this context and kernel learning framework to image classification using the challenging ImageCLEF Photo Annotation benchmark; the latter shows that our deep context learning provides highly effective kernels for image classification as corroborated through extensive experiments.

15:20-17:20, Paper ThPMP.18
Non-Iterative Multiple Data Registration Method Based on the Motion Screw Theory and Trackable Features
Gu, Feifei	Shenzhen Inst. of Advanced Tech. Chinese Acad. of S
Keywords: 3D reconstruction, Image based modeling, Multiple view geometry Abstract: Registration of 3D point clouds is an important issue in the field of 3D reconstruction. In this work, we proposed a non-iterative registration method based on the motion screw theory and trackable features. The screw theory is derived from the theory of rigid body mechanics, which holds the idea that the motion of rigid body can be regarded as a kind of spiral motion and can be effectively represented by an angular velocity vector and a linear velocity vector. It has not been utilized in 3D data registration before as far as we know. In this paper, 3D data registration based on the motion screw theory is specifically introduced, and a searching strategy based on the trackable features in image sequences is presented to improve the accuracy of 3D registration. The proposed method has been successfully tested on real multi-view data. Experimental results showed that it could simplify the computational process, accelerate the speed of registration, and achieve higher precision than other methods.

15:20-17:20, Paper ThPMP.19
Multiple Mice Tracking: Occlusions Disentanglement Using a Gaussian Mixture Model
Sadafi, Ario	Pattern Analysis and Computer Vision (PAVIS), Istituto Italiano
Katsageorgiou, Vasiliki-Maria	Istituto Italiano Di Tecnologia
Huang, Huiping	Istituto Italiano Di Tecnologia
Papaleo, Francesco	Istituto Italiano Di Tecnologia (IIT)
Murino, Vittorio	Istituto Italiano Di Tecnologia
Sona, Diego	Istituto Italiano Di Tecnologia (IIT)
Keywords: Motion and tracking, Occlusion and shadow detection, Video analysis Abstract: Mouse models play an important role in preclinical research and drug discovery for human diseases. The fact that mice are a social species partaking in social interactions of high degree facilitates the study of diseases characterized by social alterations. Hence, robust animal tracking is of great importance in order to build tools capable of automatically analyzing social behavioral interactions of multiple mice. However, the presence of occlusions is a major problem in multiple mice tracking. To deal with this problem, we present here a tracking algorithm based on Kalman filter and Gaussian Mixture Modeling. Specifically, Kalman tracking is used to track the mice and when occlusions happen, we fit 2D Gaussian distributions to separate mouse blobs. This helps us to prevent mice identity swaps as it is an important feature for accurate behavior analysis. As the results of our experiments show, the proposed algorithm results in much fewer identity swaps than other state of the art algorithms.

15:20-17:20, Paper ThPMP.20
Weather Recognition Based on Edge Deterioration and Convolutional Neural Networks
Shi, Yuzhou	Shanghai Jiao Tong Univ
Li, Yuanxiang	Shanghai Jiao Tong Univ
Liu, Jiawei	Shanghai Jiao Tong Univ. of Aeronautics and Astronau
Liu, Xingang	AVIC Leihua Electronic Tech. Res. Inst
Murphey, Yi	Univ. of Michigan-Dearborn
Keywords: Object recognition, Classification, Image classification Abstract: Weather recognition is of great significance in traffic safety, environment and meteorology However, the visual image features of the weather are highly abstractive, and the traditional method of weather recognition has a high computational complexity and low accuracy. In this paper, the edge deterioration phenomenon is introduced in convolution neural network (CNN) to solve the problem that common CNN can not distinguish the specific weather. The proposed method used Mask R-CNN to extract the regions of interest including the foreground and foreground edges in the image, and superimposed them into the same-scale matrix and then input them into the network for classification. Outdoor traffic image experiments showed that this method can effectively improve the classification accuracy of the four weather conditions (sunny, foggy, rainy and snowy).

15:20-17:20, Paper ThPMP.21
Teaching Squeeze-And-Excitation PyramidNet for Imbalanced Image Classification with GAN-Based Curriculum Learning
Liu, Jing	Ocean Univ. of China
Du, Angang	Ocean Univ. of China
Wang, Chao	Ocean Univ. of China
Zheng, Haiyong	Ocean Univ. of China
Wang, Nan	Ocean Univ. of China
Zheng, Bing	Ocean Univ. of China
Keywords: Image classification, Applications of pattern recognition and machine learning Abstract: Image classiﬁcation with datasets that suffer from great imbalanced class distribution is a challenging task in computer vision ﬁeld. In many real-world problems, the datasets are typically imbalanced and have a serious impact on the performance of classiﬁers. Although deep convolutional neural networks (DCNNs) have shown remarkable performance on image classiﬁcation tasks in recent years, there are still few effective deep learning algorithms speciﬁcally for imbalanced image classiﬁcation problems. To solve imbalanced image classiﬁcation problem, in this paper, we explore a new deep learning algorithm called Squeeze-and-Excitation Deep Pyramidal Residual Network (SE-PyramidNet) combining with Generative Adversarial Network (GAN)-based curriculum learning. Firstly, we construct the reﬁned Deep Pyramidal Residual Network by embedding the �Squeeze-and-Excitation� (SE) blocks. Secondly, towards the class imbalance problem, we adopt GAN to generate samples of minority classes. Finally, we draw lessons from the curriculum learning strategy by teaching our classiﬁer training from original easy samples to generated complex samples, which improves the classiﬁcation ability. Experimental results show that our method achieves around 0.5% gains for accuracy and 0.02 gains for F1 score respectively outperforming the state-of-the-art DCNNs.

15:20-17:20, Paper ThPMP.22
Adaptive Albedo Compensation for Accurate Phase-Shift Coding
Pistellato, Mara	Univ. Ca' Foscari Venezia
Cosmo, Luca	Univ. Ca' Foscari Venezia
Bergamasco, Filippo	Univ. Ca' Foscari Venezia
Gasparetto, Andrea	Ca' Foscari
Albarelli, Andrea	Univ. Ca' Foscari Di Venezia
Keywords: 3D reconstruction, Multiple view geometry, Vision sensors Abstract: Among structured light strategies, the ones based on phase shift are considered to be the most adaptive with respect to the features of the objects to be captured. Inter alia, the theoretical invariance to signal strength and the absence of discontinuities in intensity, make phase shift an ideal candidate to deal with complex surfaces of unknown geometry, color and texture. However, in practical scenarios, unexpected artifacts could still result due to the characteristics of real cameras. This is the case, for instance, with high contrast areas resulting from abrupt changes in the albedo of the captured objects. In fact, the not negligible size of pixels and the presence of blur can produce a mix of signal integration from adjacent areas with different albedo. This, in turn, would result in a bias in the phase recovery and, consequentially, in an inaccurate 3D reconstruction of the surface. While this problem affects most structure light methods based on phase shift or derived techniques, little effort has been put in addressing it. With this paper we propose a model for the phase corruption and a theoretically sound correction step to be adopted to compensate the bias. The practical effectiveness of our approach is well demonstrated by a complete set of experimental evaluations.

15:20-17:20, Paper ThPMP.23
Generating Image Sequence from Description with LSTM Conditional GAN
Ouyang, Xu	Illinois Inst. of Tech
Zhang, Xi	Illinois Insititute of Tech
Ma, Di	Illinois Inst. of Tech
Agam, Gady	Illinois Inst. of Tech
Keywords: Vision and language, Deep learning, Applications of pattern recognition and machine learning Abstract: Generating images from word descriptions is a challenging task. Generative adversarial networks(GANs) were shown to be able to generate realistic images of real-life objects. In this paper, we propose a new neural network architecture of LSTM Conditional Generative Adversarial Networks to generate images of real-life objects. Our proposed model is trained on the Oxford-102 Flowers and Caltech-UCSD Birds-200-2011 datasets. We demonstrate that our proposed model produces results surpassing other state-of-art approaches.

15:20-17:20, Paper ThPMP.24
Neighborhood-Based Recovery of Phase Unwrapping Faults
Pistellato, Mara	Univ. Ca' Foscari Venezia
Bergamasco, Filippo	Univ. Ca' Foscari Venezia
Cosmo, Luca	Univ. Ca' Foscari Venezia
Gasparetto, Andrea	Ca' Foscari
Ressi, Dalila	Univ. Ca' Foscari Venezia
Albarelli, Andrea	Univ. Ca' Foscari Di Venezia
Keywords: 3D reconstruction, Multiple view geometry, Applications of computer vision Abstract: Among several structured light approaches, phase shift is the most widely adopted in real-world 3D reconstruction devices. This is mainly due to its high accuracy, strong resilience to noise and straightforward implementation. However, Phase shift also exhibits an inherent weakness, that is the spatial ambiguity resulting from the periodicity of the sinusoidal wave adopted. Of course many phase unwrapping methods have been proposed to solve such ambiguity. One of the most promising methods exploits additional signals of mutually prime periods, in order to observe a distinct combination of phases for each spatial point. Unfortunately, for such combination to be properly recognized, a very high accuracy in phase recovery must be attained for each signal. In fact, even modest errors could lead to unwrapping faults, making the overall approach much less resilient to noise than plain phase shift. With this paper we introduce a feasible and effective fault recovery method that can be directly applied to multi-period phase shift. The combined pipeline offers an optimal accuracy and coverage even with high noise conditions, overcoming the setbacks of the original method. The performance of such pipeline is established by means of an in depth set of experimental evaluations and comparison, both with real and synthetically generated data.

15:20-17:20, Paper ThPMP.25
CANDID: Robust Change Dynamics and Deterministic Update Policy for Dynamic Background Subtraction
Mandal, Murari	Malaviya National Inst. of Tech. Jaipur
Saxena, Prafulla	Malaviya National Inst. of Tech. Jaipur
Vipparthi, Santosh	MALAVIYA NATIONAL Inst. OF Tech. JAIPUR
Murala, Subrahmanyam	IIT Ropar
Keywords: Motion and tracking, Video analysis, Video processing and analysis Abstract: Background subtraction in video provides the preliminary information which is essential for many computer vision applications. In this paper, we propose a sequence of approaches named CANDID to handle the change detection problem in challenging video scenarios. The CANDID adaptively initializes the pixel-level distance threshold and update rate. These parameters are updated by computing the change dynamics at a location. Further, the background model is maintained by formulating a deterministic update policy. The performance of the proposed method is evaluated over various challenging scenarios such as dynamic background and extreme weather conditions. The qualitative and quantitative measures of the proposed method outperform the existing state-of-the-art approaches.

15:20-17:20, Paper ThPMP.26
Automatic Eye Gaze Estimation Using Geometric & Texture-Based Networks
Jyoti, Shreyank	Indian Inst. of Tech. Ropar
Dhall, Abhinav	Indian Inst. of Tech. Ropar
Keywords: Applications of computer vision, Human behavior analysis, Affective computing Abstract: Eye gaze estimation is an important problem in automatic human behavior understanding. This paper proposes a deep learning based method for inferring the eye gaze direction. The method is based on the use of ensemble of networks, which capture both the geometric and texture information. Firstly, a Deep Neural Network (DNN) is trained using the geometric features that are extracted from the facial landmark locations. Secondly, for the texture based features, three Convolutional Neural Networks (CNN) are trained i.e. for the patch around the left eye, right eye, and the combined eyes, respectively. Finally, the information from the four channels is fused with concatenation and dense layers are trained to predict the final eye gaze. The experiments are performed on the two publicly available datasets: Columbia eye gaze and TabletGaze. The extensive evaluation shows the superior performance of the proposed framework. We also evaluate the performance of the recently proposed swish activation function as compared to Rectified Linear Unit (ReLU) for eye gaze estimation.

15:20-17:20, Paper ThPMP.27
A New Bag of Visual Words Encoding Method for Human Action Recognition
Cort�s, Xavier	Univ. Fran�ois Rabelais De Tours
Conte, Donatello	Univ. of Tours
Cardot, Hubert	Univ. Fran�ois Rabelais De Tours
Keywords: Behavior recognition, Video analysis Abstract: Human action recognition in videos is one of the key problems in computer vision. Inspired by image classification models, techniques based on bags of visual words have become one of the most effective approaches to solve this problem. The most usual way to engage an interest point in a bag of words is by means of the closest word found in a previously trained codebook. However, the quality of representation decreases when interest points have different visual words at similar distances or when we map noisy interest points. The aim of this paper is to present a new encoding procedure to engage the interest points in a bag of visual words improving the quality of the representation. The encoding that we propose tries to map only the relevant interest points detected in the scene. We experimentally show that using the new encoding method we can significantly improve the classification ratio.

15:20-17:20, Paper ThPMP.28
Multi-Scale Fusion with Context-Aware Network for Object Detection
Wang, Hanyuan	Univ. of Electronic Science and Tech. of China
Xu, Jie	Univ. of Electronic Science and Tech. of China
Li, Linke	Univ. of Electronic Science and Tech. of China
Tian, Ye	Univ. of Electronic Science and Tech. of China
Xu, Du	Univ. of Electronic Science and Tech. of China
Xu, Shizhong	Univ. of Electronic Science and Tech. of China
Keywords: Object detection Abstract: Almost all of the state-of-the-art object detectors employ convolutional neural network (CNN) to extract feature. However, how to fully utilize spatial information is a challenge. In this paper, we propose an effective framework for object detection. Our motivation is that multi-scale representation and context are extremely important for object detection. For multi-scale representation, our mothed combines hierarchical feature maps to a fusion map, which has abundant spatial information and high-level semantics. For context, we exploit spatial information by stacking multi-region feature maps. The network is learned end-to-end, by minimize an objective function. Our network achieves competitive results on PASCAL VOC dataset, 75.9% mAP on PASCAL VOC 2007, 72.0% mAP on PASCAL VOC 2012. Our studies demonstrate that multi-scale representation and context can further improve performance of object detection.

15:20-17:20, Paper ThPMP.29
Inception Donut Convolution for Top-Down Semantic Segmentation
Guan, He	Inst. of Automation， Chinese Acad. of Sciences
Zhang, Zhaoxiang	Inst. of Automation, Chinese Acad. of Sciences
Tan, Tieniu	Casia
Keywords: Applications of computer vision, Structured prediction Abstract: One of recent trends in network architecture design confirms that the inception-block convolutional group is efficient, since it can aggregate spatial context information in lower dimensions without casusing significant loss in representative capabilities. We believe that not only the strong correlation between adjacent cells, multi-scale feature extraction also play a vital role in this novel module. In this paper, we extend the profits of the block to a top-down donut convolutional network for semantic segmentation task. Our network automatically learns rich convolution kernels to capture more structure prior. In the inception-block design, it overcomes the limitations in larger kernel size and adaptively captures different object-scales contexts without chain sampling. Our experiments demonstrate that the proposed inception-block donut convolutional network is orthogonal and can further improve the performance of most off-the-shelf bottom-up based methods.

15:20-17:20, Paper ThPMP.30
Joint Image Restoration and Matching Based on Distance-Weighted Sparse Representation
Shao, Yuanjie	Huazhong Univ. of Science and Tech
Sang, Nong	Huazhong Univ. of Science and Tech
Gao, Changxin	Huazhong Univ. of Science and Tech
Lin, Wei	Huazhong Univ. of Science and Tech
Keywords: Object detection, Sparse learning Abstract: Image matching is widely used in visual-based navigation systems, most of which simply assume the ideal inputs without considering the degradation of the real world, such as image blur. In presence of such situation, the traditional matching methods first resort to image restoration and then perform image matching with the restored image. However, by treating the restoration and matching separately, the accuracy of image matching will be reduced by the defective output of the image restoration. In this paper, we propose a joint image restoration and matching method based on distance-weighted sparse representation (JRL-DSR), which utilizes the sparse representation prior to exploit the correlation between restoration and matching. This prior assumes that the blurry image, if correctly restored, can be well represented as a sparse linear combination of the dictionary constructed by the reference image. In order to achieve more accurate matching results to help restoration, we consider both local and sparse information and adopt distance-weighted sparse representation to obtain better representation coefficients. By iteratively restoring the input image in pursuit of the sparest representation, our approach can achieve restoration and matching simultaneous, and these two tasks can benefit greatly from each other. matching, we give a coarse to fine matching strategy to further improve the matching accuracy. Experiments demonstrate the effectiveness of our method compared with conventional methods.

15:20-17:20, Paper ThPMP.31
Spindle-Net: CNNs for Monocular Depth Inference with Dilation Kernel Method
He, Lei	Inst. of Automation, Chinese Acad. of Sciences (CASIA)
Yu, Miao	Zhongyuan Univ. of Tech
Wang, Guanghui	Univ. of Kansas
Keywords: 3D vision, Scene understanding, Super-resolution Abstract: Learning depth from a single image is an important issue in computer vision. To solve this problem, encoder-decoder architect is usually employed as a powerful architecture to learn the dense corresponding function. In this work, we propose a symmetrical Spindle network of the encoder-decoder to learn the fine-grained depth. Unlike traditional convolution neural network, we first boost up the feature maps from low-dimension space to a high-dimension space, then extract the features for monocular depth learning. In order to overcome limitation of the computer memory, a single image super-resolution technique is proposed to replace the boosting process by fusing local cues in edge direction. Given the super-resolution images, the monocular depth learning needs more global information than most architectures for pixel-wise predictions. To address this issue, dilation kernel method is proposed to enlarge the receptive field in each layer. For the task of the super-resolution, the proposed method achieves better performance than the state-of-the-art methods. Extensive experiments on the monocular depth inference demonstrate that the Spindle network could achieve comparable performance on the NYU and Make3D datasets, compared with the state-of-the-art algorithms. The proposed method reveals a new perspective to learn the depth from a single image, which shows a promising generality to other pixel-wise prediction problems.

15:20-17:20, Paper ThPMP.32
MDCN: Multi-Scale, Deep Inception Convolutional Neural Networks for Efficient Object Detection
Ma, Wenchi	Univ. of Kansas
Wu, Yuanwei	Univ. of Kansas
Wang, Zongbo	Ainstein Inc
Wang, Guanghui	Univ. of Kansas
Keywords: Object detection, Scene understanding, Applications of computer vision Abstract: Object detection in challenging situations such as scale variation, occlusion, and truncation depends not only on feature details but also on contextual information. Most previous networks emphasize too much on detailed feature extraction through deeper and wider networks, which may enhance the accuracy of object detection to certain extent. However, the feature details are easily being changed or washed out after passing through complicated filtering structures. To better handle these challenges, the paper proposes a novel framework, multi-scale, deep inception convolutional neural network (MDCN), which focuses on wider and broader object regions by activating feature maps produced in the deep part of the network. Instead of incepting inner layers in the shallow part of the network, multi-scale inceptions are introduced in the deep layers. The proposed framework integrates the contextual information into the learning process through a single-shot network structure. It is computational efficient and avoids the hard training problem of previous macro feature extraction network designed for shallow layers. Extensive experiments demonstrate the effectiveness and superior performance of MDCN over the state-of-the-art models.

15:20-17:20, Paper ThPMP.33
Towards Good Practice for Action Recognition with Spatiotemporal 3D Convolutions
Hara, Kensho	National Inst. of Advanced Industrial Science and Tech
Kataoka, Hirokatsu	National Inst. of Advanced Industrial Science and Tech
Satoh, Yutaka	National Inst. of Advanced Industrial Science and Tech
Keywords: Video analysis, Behavior recognition, Deep learning for multimedia analysis Abstract: The purpose of this study is to explore good practice for training convolutional neural networks (CNNs) with spatiotemporal three-dimensional (3D) kernels. Recently, 3D CNNs in the field of action recognition are rapidly developed, and the performance levels of them have improved significantly. However, to date, conventional research has mainly focused on their architecture, and has not sufficiently explored their training configurations. We conduct various experiments with different training configurations on Kinetics, UCF-101, and HMDB-51 datasets to share the knowledge of 3D CNNs for the research community. According to the results of those experiments, the following conclusions could be obtained. (i) Data augmentation by spatiotemporal random cropping improved the performance levels. (ii) Data augmentation by multi-scale spatial cropping increased the accuracies in most cases whereas multi-scale temporal cropping decreased them. (iii) A corner cropping strategy, which is previously shown as a good method for two-stream 2D CNNs, resulted lower accuracies for 3D CNNs compared with simple random cropping. (iv) Freezing early layers of 3D CNNs improved the performance levels when fine-tuning 3D CNNs on a relatively small dataset.

15:20-17:20, Paper ThPMP.34
3D Convolutional Generative Adversarial Networks for Detecting Temporal Irregularities in Videos
Yan, Mengjia	Nanyang Tech. Univ
Jiang, Xudong	-Nanyang Tech. Univ
Yuan, Junsong	State Univ. of New York at Buffalo
Keywords: Video analysis Abstract: In this work, we introduce a novel method for video temporal irregularity detection using the discriminative framework of 3D convolutional generative adversarial networks (3D-GANs). Temporal irregularities indicate unusual video segments. Detecting such irregularities is essential to video analysis applications like video anomaly detection and video summarization. To detect temporal irregularities in videos we need to address two problems: 1) temporal irregularities are difficult to define, different situations have different irregularities, and 2) irregularities are scarce in videos. Therefore, we formulate video temporal irregularity detection as fake data detection via the discriminative framework of a designed 3D-GAN. This new formulation only employs regular videos during the training phase and detects irregularities according to the deviation estimated by the discriminator of 3D-GAN. We take regular videos as real data and construct a 3D-GAN to learn the distribution of regular videos during the training phase. Since testing data contain irregular videos or fake data, whose distribution is different from regular videos or real data, the trained discriminator of our networks is able to detect temporal regularities and irregularities. Experiments show that 3D-GANs outperforms 2D-GANs in temporal irregularity detection, and demonstrate the effectiveness and competitive performance of our approach on anomaly detection datasets.

15:20-17:20, Paper ThPMP.35
Detecting Heads Using Feature Refine Net and Cascaded Multi-Scale Architecture
Peng, Dezhi	South China Univ. of Tech
Sun, Zikai	South China Univ. of Tech
Chen, Zirong	South China Univ. of Tech
Cai, Zirui	South China Univ. of Tech
Xie, Lele	School of Electronic and Information Engineering, South China Un
Jin, Lianwen	South China Univ. of Tech
Keywords: Object detection, Deep learning Abstract: This paper presents a method that can accurately detect heads especially small heads under indoor scene. To achieve this, we propose a novel Feature Refine Net (FRN) and a cascaded multi-scale architecture. FRN exploits the multi-scale hierarchical features created by deep convolutional neural networks. Proposed channel weighting method enables FRN to make use of features alternatively and effectively. To improve the performance of small head detection, we propose a cascaded multi-scale architecture which has two detectors. One called global detector is responsible for detecting large objects and acquiring the global distribution information. The other called local detector is specified for small objects detection and makes use of the information provided by global detector. Due to the lack of head detection datasets, we have collected and labeled a new large dataset named SCUT-HEAD that includes 4405 images with 111251 heads annotated. Experiments show that our method has achieved state-of-art performance on SCUT-HEAD.

15:20-17:20, Paper ThPMP.36
Continuous Action Recognition and Segmentation in Untrimmed Videos
Bai, Ruibin	Xi'an Jiaotong Univ
Zhao, Qing	Canglong Island, Jiangxia District, Wuhan
Zhou, Sanping	Xi'an Jiaotong Univ
Li, Yubing	Xi'an Jiaotong Univ
Zhao, Xueji	Southwest Univ
Wang, Jinjun	Xi'an Jiaotong Univ
Keywords: Behavior recognition, Video analysis, Deep learning Abstract: Recognizing continuous human action is a fundamental task in many real-world computer vision applications including video surveillance, video retrieval, and human-computer interaction, etc. It requires to recognize each action performed as well as their segmentation boundaries in a continuous sequence. In previous works, great progress has been reported for single action recognition, by using deep convolutional networks. In order to further improve the performance for continuous action recognition, in this paper, we introduce a discriminative approach consisting of three modules. The first feature extraction module uses a two stream Convolutional Neural Network to capture the appearance and the short-term motion information from the raw video input. Based on the obtained features, the second classification module performs spatial and temporal recognition and then fuses the two scores from respective feature stream. In the final segmentation module, a semi-Markov Conditional Field model, capable of handling long-term action interactions, is built to partition the action sequence. As can be seen in the experimental results, our approach obtains state-of-the-art performance on public datasets including 50Salads, Breakfast, and MERL Shopping. We have also visualized the continuous actions segmentation results for more insightful discussion in the paper.

15:20-17:20, Paper ThPMP.37
Traffic Sign Image Synthesis with Generative Adversarial Networks
Luo, Hengliang	Inst. of Automation, Chinese Acad. of Sciences
Kong, Qingqun	Inst. of Automation, Chinese Acad. of Sciences
Wu, Fuchao	Inst. of Automation, Chinese Acad. of Science
Keywords: Image classification, Deep learning, Applications of computer vision Abstract: Deep convolutional neural networks(CNN) has achieved state-of-the-art result on traffic sign classification, which plays a key role in intelligent transportation system. However, it usually requires a large number of labeled training data, which is not always available, to guarantee a good performance. In this paper, we propose to synthesize traffic sign images by generative adversarial networks (GANs). It takes a standard traffic sign template and a background image as input to the generative network in GANs, where the template defines which class of traffic sign to include and the background image controls the visual appearance of the synthetic images. Experiments show that our method could generate more realistic traffic sign images than the conventional image synthesis method. Meanwhile, by adding the synthesis images to train a typical CNN for traffic sign classification, we obtained a better accuracy.

15:20-17:20, Paper ThPMP.38
DeepDriver: Automated System for Measuring Valence and Arousal in Car Driver Videos
Theagarajan, Rajkumar	Univ. of California, Riverside
Bhanu, Bir	Univ. of California
Cruz, Alberto	CSU Bakersfield
Keywords: Applications of computer vision Abstract: We develop an automated system for analyzing facial expressions using valence and arousal measurements of a car driver. This information is used by Motor Trends magazine to provide car manufacturers a report on how the drivers felt at each moment on the race track. The reason for this is that, the drivers remember only a brief description of the emotions they felt after test driving a car. Our approach is a data driven approach and does not include any pre-processing done to the faces of the drivers. The motivation of this paper is to show that with large amount of data, deep learning networks can extract better and more robust facial features compared to state-of-the-art hand crafted features. The network was trained on just the raw facial images and achieves better results compared to state-of-the-art methods. Our system incorporates Convolutional Neural Networks (CNN) for detecting the face and extracting the facial features, and a Long Short Term Memory (LSTM) for modelling the changes in CNN features with respect to time. The system was evaluated on videos from the Motor Trend Magazines Best Driver Car of the Year 2014-16 and the AFEW-VA dataset. We compared our approach with state-of-the-art methods and show that our approach achieves the better results compared to seven other methods.

15:20-17:20, Paper ThPMP.39
Discriminative Latent Visual Space for Zero-Shot Object Classification
Roy, Abhinaba	IIT
Banerjee, Biplab	IIT, Roorkee
Murino, Vittorio	Istituto Italiano Di Tecnologia
Keywords: Image classification, Classification, Applications of computer vision Abstract: In this paper We deal with the problem of zero-shot visual recognition. The standard zero-shot learning (ZSL) pipeline is based on the idea of learning a functional mapping from a visual embedding space to an auxiliary semantic space for a set of seen categories. In the testing phase, the task is to recognize a set of novel categories which are semantically linked to the already known ones. Although such a pipeline is inherently supervised, there exists very few endeavours in the context of ZSL that enforce discrimination in learning this mapping. In this work, we propose a novel encoder-decoder network to explore the possibility of learning an intermediate latent space for the visual features, which is deemed to be simultaneously reconstructive and discriminative. By reaching a trade-off between the joint (re)construction of the visual and the semantic embedding spaces, while ensuring separability among the known classes, the proposed model better generalizes to the unknown categories. Experimental results obtained on challenging datasets, such as AwA, CUB, and ImageNet-2, establish the efficacy of such a discriminative latent space for the standard ZSL setup.

15:20-17:20, Paper ThPMP.40
Face Image Illumination Processing Based on Generative Adversarial Nets
Ma, Wei	Sun Yat-Sen Univ
Xie, Xiaohua	Sun Yat-Sen Univ
Yin, Chong	SUN YAT-SEN Univ
Lai, Jian-huang	Sun Yat-Sen Univ
Keywords: Learning-based vision, Low-level vision, Image processing and analysis Abstract: It is a well-known fact that the variations in illumination could seriously affect the performance of 2D face analysis algorithms, such as face landmarking and face recognition. Unfortunately, the illumination condition is usually uncontrolled and unpredictable in most practical applications. Numerous methods have been developed to tackle this problem but the results is poor, especially for images with extreme lighting condition. Furthermore, most traditional illumination processing methods only demonstrate on grayscale images and require strict alignment of face images, resulting in limited applications in real world. In this paper, we proposed to reformulate the face image illumination processing problem as a style translation task with a Generative Adversarial Network (GAN). The key insight is to use the powerful mapping ability of GAN between two domains without knowing their true distributions. In this new sight, we developed a new multi-scale dual discriminate nets and employed multi-scale adversarial learning for visually realistic illumination processing. Advocating the use of the insights from traditional method, we also use reconstruction learning and add two new loss items of image quality assessment to enforce the preservation of all other illumination excluding details on the generated image. Experiments on CMU Multi-PIE and FRGC datasets show that our method can obtain promising illumination normalization results and preserve a superior visual quality.

15:20-17:20, Paper ThPMP.41
Towards Automatic Detection of Monkey Faces
Zhang, Manning	Sun Yat-Sen Univ
Guo, Susu	Sun Yet-Sen Univ
Xie, Xiaohua	Sun Yat-Sen Univ
Keywords: Object detection, Face recognition, Transfer learning Abstract: An automated monkey face detection system confers distinct advantages in the protection of wild monkeys, sociological studies, monkey feeding and management and so on. The monkey face and human face have similar structures, but still hold some very important differences in appearance. Therefore, whether the mainstream human face detection algorithms can be adapted to the detection of monkey face is still unknown. To investigate this problem, we collected a database of monkey face (with more than 20,000 macaque faces) and conducted several experiments in our database. Experimental results reveal some interesting results. Firstly, the classical Viola-Jones Adaboost algorithm on monkey faces does not work as well as that on human faces. A in-depth study for this result will be given by taking insight into the selected features by Adaboost. In particular, lips and eyebrows are very important to human face recognition. However, the lack of these prominent features in the monkey's face causes the Viola-Jones algorithm to choose more local Haar-like features, resulting in a higher false positive rate. Secondly, the Faster R-CNN works effectively for monkey face detection but requires a large number of training samples. A pre-training with human faces helps to tackle the problem of shortage of monkey faces for training. Above conclusion indicate that an automatic monkey face detector can be learnt from a human face detector, yet a model with complex features should be employed.

15:20-17:20, Paper ThPMP.42
Visual Tracking with Breeding Fireflies Using Brightness from Background-Foreground Information
Kate, Pranay	Indian Inst. of Tech. Guwahati
Francis, Mathew	Indian Inst. of Tech. Guwahati
Guha, Prithwijit	Department of EEE, IIT Guwahati
Keywords: Motion and tracking Abstract: Visual target tracking involves object localization in image sequences. This is achieved by optimizing image feature similarity based objective functions in object state space. Metaheuristic algorithms have shown promising results in solving hard optimization problems where gradients are not available. This motivated us to use Firefly algorithms in visual object tracking. The object state is represented by its bounding box parameters and the target is modeled by its color distribution. This work has two significant contributions. First, we propose a hybrid firefly algorithm where genetic operations are performed using Real-coded Genetic Algorithm(RGA). Here, the crossover operation is modified by incorporating parent velocity information. Second, the firefly brightness is computed from both foreground and background information (as opposed to only foreground). This helps in handling scale implosion and explosion problems. The proposed approach is benchmarked on challenging sequences from VOT2014 dataset and is compared against other baseline trackers and metaheuristic algorithms.

15:20-17:20, Paper ThPMP.43
UAV Target Tracking with a Boundary-Decision Network
Song, Ke	Shandong Univ
Zhang, Wei	Shandong Univ
Rong, Xuewen	Shandong Univ
Keywords: Motion and tracking, Deep learning, Applications of computer vision Abstract: The aspect ratio of a target changes frequently during UAV tracking task, which makes the aerial tracking very challenging. Traditional trackers struggle from such problem as they mainly focus on the scale variation issue by maintaining a certain aspect ratio. In this paper, we propose a novel tracker, named boundary-decision network (BDNet), to address the aspect ratio variation in UAV tracking. Unlike previous work, the proposed method aims at operating each boundary separately with a policy network. Given an initial estimate of the bounding box, a sequential actions are generated to tune the four boundaries with an optimization strategy including boundary proposal rejection, offline and online learning. Experimental results on the benchmark aerial dataset prove that the proposed approach outperforms existing trackers and produces significant accuracy gains in dealing with the aspect ratio variation in UAV tracking.

15:20-17:20, Paper ThPMP.44
Learning Collaborative Model for Visual Tracking
Ma, Ding	Harbin Inst. of Tech
Wu, Xiangqian	Harbin Inst. of Tech
Bu, Wei	Harbin Inst. of Tech
Cui, YueHua	Michigan State Univ. Department of Statistics and Probabil
Xie, Yuying	Michigan State Univ. Department of Computational Mathemati
Keywords: Motion and tracking, Video analysis, Applications of computer vision Abstract: This paper proposes a robust visual tracking method by designing a collaborative model. The collaborative model employs a two-stage tracker and a HOG-based detector, which exploits both holistic and local information of the target. The two-stage tracker learns a linear classifier from the patches of original images and the HOG-based detector trains a linear discriminant analysis classifier with the object exemplar. Finally,a result decision making strategy is developed by considering both the original template and the appearance variations, making the tracker and the detector collaborate with each other. The proposed method has been evaluated on OTB-50, OTB-100 and Temple-Color datasets, and results demonstrate that the proposed method is able to effectively address the challenging cases such as scale variation and out-of-view and gets better performance than the state-of-the-art trackers.

15:20-17:20, Paper ThPMP.45
An Efficient System for Hazy Scene Text Detection Using a Deep CNN and Patch-NMS
Mohanty, Sabyasachi	IIT (BHU) Varanasi
Dutta, Tanima	IIT (BHU) Varanasi
Gupta, Hari Prabhat	IIT (BHU) Varanasi
Keywords: Applications of computer vision, Scene understanding, Deep learning Abstract: Scene text detection systems detect texts in natural scene images. Hazy scene text detection is a specific case of scene text detection where detection is done in hazy weather conditions. Haze affects the contrast of the image. In this paper, we reframe the traditional two class hazy scene text detection problem into a four class problem. We develop a deep learning based model that combines features from all layers for accurate and fast text detection from hazy images. In addition, we develop a novel training approach for the four class problem. Merging and patch-NMS are used as post processing steps for fast word detection. We also create a new dataset of hazy scene images and obtain significant improvements on an existing hazy scene text dataset.

15:20-17:20, Paper ThPMP.46
ALFA: Agglomerative Late Fusion Algorithm for Object Detection
Razinkov, Evgenii	Kazan Federal Univ
Saveleva, Iuliia	Kazan Federal Univ
Matas, Jiri	CTU Prague
Keywords: Object detection Abstract: We propose ALFA � a novel late fusion algorithm for object detection. ALFA is based on agglomerative clustering of object detector predictions taking into consideration both the bounding box locations and the class scores. Each cluster represents a single object hypothesis whose location is a weighted combination of the clustered bounding boxes. ALFA was evaluated using combinations of a pair (SSD and DeNet) and a triplet (SSD, DeNet and Faster R-CNN) of recent object detectors that are close to the state-of-the-art. ALFA achieves state of the art results on PASCAL VOC 2007 and PASCAL VOC 2012, outperforming the individual detectors as well as baseline combination strategies, achieving up to 32% lower error than the best individual detectors and up to 6% lower error than the reference fusion algorithm DBF � Dynamic Belief Fusion.

15:20-17:20, Paper ThPMP.47
Low-Rank Tensor Completion by Truncated Nuclear Norm Regularization
Xue, Shengke	Zhejiang Univ
Qiu, Wenyuan	Zhejiang Univ
Liu, Fan	Zhejiang Univ
Jin, Xinyu	Zhejiang Univ
Keywords: Inpainting, Image processing and analysis, Video processing and analysis Abstract: Currently, low-rank tensor completion has gained cumulative attention in recovering incomplete visual data whose partial elements are missing. By taking a color image or video as a three-dimensional (3D) tensor, previous studies have suggested several definitions of tensor nuclear norm. However, they have limitations and may not properly approximate the real rank of a tensor. Besides, they do not explicitly use the low-rank property in optimization. It is proved that the recently proposed truncated nuclear norm (TNN) can replace the traditional nuclear norm, as a better estimation to the rank of a matrix. Thus, this paper presents a new method called the tensor truncated nuclear norm (T-TNN), which proposes a new definition of tensor nuclear norm and extends the truncated nuclear norm from the matrix case to the tensor case. Beneficial from the low rankness of TNN, our approach improves the efficacy of tensor completion. We exploit the previously proposed tensor singular value decomposition and the alternating direction method of multipliers in optimization. Extensive experiments on real-world videos and images demonstrate that the performance of our approach is superior to those of existing methods.

15:20-17:20, Paper ThPMP.48
A Generic Axiomatic Characterization for Measuring Influence in Social Networks
Bandyopadhyay, Sambaran	IBM Res
Narayanam, Ramasuri	IBM Res. India
Musti, Narasimha Murty	Indian Inst. of Science
Keywords: Social and web multimedia Abstract: Measuring influence, also popularly known as centrality measures, has been a center-piece of research in the analysis of complex social networks such as to find coherent communities and to find trend setters in viral marketing. Even though there exists a few axiomatic frameworks associated with some specific forms of influence measures in the literature, these existing formal frameworks are not generic in nature in terms of characterizing the space of influence measures in complex networks. To address this research gap, in this paper, we propose a generic axiomatic framework to capture most of the key intrinsic properties of any influence measure in networks. We further analyze certain popular centrality measures using this framework. Interestingly, our analysis reveals that none of the centrality measures considered satisfies all the desirable axioms. We finally conclude this paper by stating an appealing conjecture on a potential impossibility theorem associated with the proposed axiomatic framework.

15:20-17:20, Paper ThPMP.49
Temporal Filter Parameters for Motion Pattern Maps
O'Gorman, Lawrence	Nokia Bell Labs
Keywords: Video processing and analysis, Video analysis, Scene understanding Abstract: There are many video analysis applications in which the flow of people or traffic over time is of more interest than tracking the discrete objects contained. Motion pattern maps, loosely called heat maps, can show accumulated activity over time via temporal filtering. In this paper, we examine three parameters that control this filtering with respect to: event resolution, time-merging of events, and dynamic background. Design of these parameters is detailed and their effects are shown on simulated videos. Design considerations for real video events are discussed and shown by example.

15:20-17:20, Paper ThPMP.50
Balancing Video Analytics Processing and Bandwidth for Edge-Cloud Networks
O'Gorman, Lawrence	Nokia Bell Labs
Wang, Xiaoyang	Nokia Bell Labs
Keywords: Video processing and analysis, Sensor array & multichannel signal processing, Visual surveillance Abstract: For IoT networks, signals are captured by edge sensors, system-wide decisions are made at a central host, and processing may be performed at either or both sides of the network. Decision on edge or cloud placement of methods depends in part on processing and bandwidth efficiencies. Video analytics offers a good application to investigate this because signals from edge cameras are large both spatially and temporally, and processing usually involves a sequence of methods. For public camera analytics where processing and bandwidth need only be expended for active periods, we find by experiment a data range for crowd, traffic, and building scenes. For a typical sequence of video analytics methods applied to surveillance, we find upper bounds for processing and bandwidth, and experimental measurements for real � much sparser � video. Explicit description of algorithmic requirements and knowledge of experimentally determined loads gives information to balance methods between edge and cloud.

15:20-17:20, Paper ThPMP.51
Focusing on What Is Relevant: Time-Series Learning and Understanding Using Attention
Vinayavekhin, Phongtharin	IBM Res
Chaudhury, Subhajit	IBM Res
Munawar, Asim	IBM Res
Agravante, Don Joven	IBM Res
De Magistris, Giovanni	IBM Res
Kimura, Daiki	IBM Res
Tachibana, Ryuki	IBM Res
Keywords: Deep learning for multimedia analysis, Sequence modeling, Deep learning Abstract: This paper is a contribution towards interpretability of the deep learning models in different applications of time- series. We propose a temporal attention layer that is capable of selecting the relevant information to perform various tasks, including data completion, key-frame detection and classification. The method uses the whole input sequence to calculate an attention value for each time step. This results in more focused attention values and more plausible visualisation than previous methods. We apply the proposed method to three different tasks. Experimental results show that the proposed network produces comparable results to a state of the art. In addition, the network provides better interpretability of the decision, that is, it generates more significant attention weight to related frames compared to similar techniques attempted in the past.

15:20-17:20, Paper ThPMP.52
Foreground Enlargement of Omnidirectional Images by Spherical Trigonometry
Yu, An-shui	Kyushu Univ
Hara, Kenji	Kyushu Univ
Inoue, Kohei	Kyushu Univ
Urahama, Kiichi	Kyushu Univ
Keywords: Image processing and analysis Abstract: Omnidirectional images with a 360-degree field of view have much smaller foreground regions compared to background regions and thus are difficult to visually recognize, especially when displaying on a small size display. In this paper, we propose a method for enhancing visibility of omnidirectional images by enlarging the user specified foreground regions while compressing the other background regions without the sense of visual incompatibility. The proposed method first transforms the omnidirectional image into a 3D spherical image, and then relocates the spherical nodes, which are assigned pixel values, from their initial position to a preferable position based on point correspondence between spherical polygons. The experiments demonstrate that our approach produces satisfactory omnidirectional images in which the foreground regions are enlarged analogously.

15:20-17:20, Paper ThPMP.53
Feature Extraction and Grain Segmentation of Sandstone Images Based on Convolutional Neural Networks
Feng, Jiang	Taizhou Inst. of Sci. & Tech. NUST
Gu, Qing	Nanjing Univ
Hao, Huizhen	Nanjing Univ
Li, Na	Nanjing Univ
Keywords: Segmentation, features and descriptors, Neural networks, Clustering Abstract: Grain segmentation of sandstone images is to segment mineral grains into separate regions, which is the first step for mineral identification and sandstone classification. A sandstone image contains a large number of mineral grains, which due to the complexity of the micro-structures and the variability of the optical properties, becomes difficult for automatic segmentation. In this paper, we propose a three-stage method for grain segmentation taking multi-angle sandstone images as the input. In the first stage, the pixels of the images are clustered by both color and spatial properties into superpixels. In the second stage, convolutional neural network (CNN) is trained based on replicated images of sandstone mineral grains, which is then used to extract the convolutional features of the superpixels. In the third stage, the fuzzy clustering algorithm is used to merge the over-segmented superpixels into mineral grains. We collect sandstone images from Tibet, based on which the experimental results demonstrate the following: (1) The convolutional features extracted by the designed CNN is more suitable for characterizing the mineral grains of sandstone images, comparing to the hand-crafted features. (2) The proposed three-stage method is more effective than the state-of-the-art segmentation methods.

15:20-17:20, Paper ThPMP.54
Completed Grayscale-Inversion and Rotation Invariant Local Binary Pattern for Texture Classification
Song, Tiecheng	Chongqing Univ. of Posts and Telecommunications
Xin, Liangliang	Chongqing Univ. of Posts and Telecommunications
Luo, Lin	Chongqing Univ. of Posts and Telecommunications
Gao, Chenqiang	School of Communication and Information Engineering, Chongqing U
Keywords: Texture analysis, Image classification, Classification Abstract: Local binary pattern (LBP) and its variants (e.g., LTP and CLBP) are powerful descriptors for texture analysis. However, most of these LBP-based methods are sensitive to inverse grayscale changes. To overcome this problem, we present a novel texture descriptor named Completed Grayscale-Inversion and Rotation Invariant Local Binary Pattern (CGRI-LBP). CGRI-LBP is based on the framework of CLBP which jointly encodes three components (i.e., the signs and magnitudes of local differences as well as central pixels) but with two significant improvements: 1) the sign information of local differences is encoded by a rotation-invariant complementary coding scheme, and 2) the intensity information of central pixels is encoded via a dominant intensity order measure. Extensive experiments on three texture databases (Outex, CUReT and KTH-TIPS) demonstrate that the proposed descriptor achieves the state-of-the-art classification performance in the presence of linear and even nonlinear grayscale-inversion changes.

15:20-17:20, Paper ThPMP.55
Local Perceptual Loss Function for Arbitrary Region Style Transfer
Zhang, WenXiang	Southeast Univ
Liu, Qingshan	Southeast Univ
Keywords: Image processing and analysis, Transfer learning, Deep learning Abstract: Style transfer is a very important and interesting task in computer science. In this paper, we propose to use local perceptual loss functions for local image style transfer. The conditional fast transfer network can treat a variety of styles via conditional input. We extent this framework by using the local perceptual loss functions to achieve spatial control. A trained transfer network can only implement style transfer on target region and keep background fixed. We control target region by mask matrix which is the conditional input. For example, we can change the style of a car or cat in an input image, and we can even make someone feels like a cartoon character but keep the background realistic. In the paper, we describe how to use the local perceptual loss functions and combine it with conditional transfer network for arbitrary region style transfer. From the experiments we can see that the generated image looks very good. Besides, our method is very efficient during testing, since it only requires a forward pass time.

15:20-17:20, Paper ThPMP.56
A Dravidian Language Identification System
Mukherjee, Himadri	West Bengal State Univ
Obaidullah, Sk Md	Aliah Univ
Phadikar, Santanu	Maulana Abul Kalam Azad Univ. of Tech
Roy, Kaushik	West Bengal State Univ
Keywords: Spoken language analysis, Speech recognition, Segmentation, features and descriptors Abstract: Speech recognition has established a strong bond with various technological boons for the day to day life of the rustics across the continents. Such advances have not yet propagated to the grassroot level of India, one of the reasons for it being the multilingual nature of our country. We are habituated in using multiple languages while talking, which makes the task of speech recognition challenging thereby making Language Identification an important task. The technique of automatically identifying language from spoken phrases is termed as Automatic Language Identification. The problem of multilingual speech further elevates for South Indian languages which at times become very difficult to distinguish with negligible prior knowledge. In this paper, an Automatic Language Identification System is proposed to distinguish the 4 Dravidian languages which are also known as South Indian languages due to their pre dominant use in the South Indian subcontinent. Dataset size ranged up to the size of 12224 clips and a highest accuracy of 96.46% was obtained by using a newly proposed Line Spectral Pair-Grade (LSP-G) feature along with FURIA based classification technique.

15:20-17:20, Paper ThPMP.57
Boundary-Based Image Forgery Detection by Fast Shallow CNN
Zhang, Zhongping	Univ. of Rochester
Zhang, Yixuan	Univ. of Rochester
Zhou, Zheng	Univ. of Rochester
Luo, Jiebo	-Univ. of Rochester
Keywords: Image processing and analysis, Deep learning for multimedia analysis Abstract: Image forgery detection is the task of detecting and localizing forged parts in tampered images. Previous works mostly focus on high resolution images using traces of resampling features, demosaicing features or sharpness of edges. However, a good detection method should also be applicable to low resolution images because compressed or resized images are common these days. To this end, we propose a Shallow Convolutional Neural Network (SCNN) capable of distinguishing the boundaries of forged regions from original edges in low resolution images. SCNN is designed to utilize the information of chroma and saturation. Based on SCNN, two approaches that are named Sliding Windows Detection (SWD) and Fast SCNN, respectively, are developed to detect and localize image forgery region. Our model is evaluated on the CASIA 2.0 dataset. The results show that Fast SCNN performs well on low resolution images and achieves significant improvements over the state-of-the-art.

15:20-17:20, Paper ThPMP.58
Residual HSRCNN: Residual Hyper-Spectral Reconstruction CNN from a RGB Image
Han, Xian-Hua	Yamaguchi Univ
Shi, Boxin	National Inst. of Advanced Industrial Science and Tech
Zheng, Yinqiang	National Inst. of Informatics
Keywords: Enhancement, restoration and filtering, Image processing and analysis, Applications of pattern recognition and machine learning Abstract: Hyperspectral imaging has great potential for understanding the characteristics of different materials in many applications ranging from remote sensing to medical imaging. However, due to various hardware limitations, only low-resolution hyperspectral and high-resolution multi-spectral or RGB images can be available using existing imaging techniques. This study aims to generate a hyperspectral image via enhancing spectral resolution from a RGB image obtained using a common camera. Motivated by the success of deep convolutional neural network (DCNN) for spatial resolution enhancement of natural images, we explore a spectral reconstruction CNN for spectral super-resolution with an available RGB image, which predicts the high-frequency content of the fine spectral wavelength in narrow band interval. Since the lost high-frequency content can not be perfectly recovered, we overwrite the same CNN architecture on the baseline CNN for further estimate the non-recovered high-frequency content (Residual) from the output of the baseline CNN, and propose a novel residual hyperspectral reconstruction CNN framework. Experiments on benchmark hyperspectral datasets validate that the proposed method achieves promising performances compared with the existing state of the art methods.

15:20-17:20, Paper ThPMP.59
Two-Stage Convolutional Network for Image Super-Resolution
Hui, Zheng	Xidian Univ
Wang, Xiumei	Xidian Univ
Gao, Xinbo	Xidian Univ
Keywords: Super-resolution, Low-level vision Abstract: Deep convolutional neural networks (DCNN) have recently advanced the state-of-the-art on the issue of single image super-resolution (SR). In this work, we propose a two-stage convolutional network (TSCN) to estimate the desired high-resolution (HR) image from the corresponding low-resolution (LR) image. Specifically, we propose the multi-path information fusion (MIF) module that collects abundant information from feature maps of input, output and intermediary in a module and distills primary information therein. Several cascaded MIF modules are used to progressively extract features desired by reconstruction and the output of each module is gathered for rebuilding HR image. In addition, we introduce a refinement network with local residual topology architecture as the second stage so as to further restore the high-frequency details of HR image produced by the first stage. Due to less number of filters, the compact model achieves fast inference time and brings about state-of-the-art SR results on four benchmark datasets simultaneously.

15:20-17:20, Paper ThPMP.60
Ensemble Reversible Data Hiding
Wu, Hanzhou	Inst. of Automation, Chinese Acad. of Science
Wang, Wei	Inst. of Automation, Chinese Acad. of Sciences
Dong, Jing	Inst. of Automation, Chinese Acad. of Sciences
Wang, Hongxia	Southwest Jiaotong Univ
Keywords: Information forensics and security, Image processing and analysis, Image and video coding Abstract: The conventional reversible data hiding (RDH) algorithms often consider the host as a whole to embed a secret payload. In order to achieve satisfactory rate-distortion performance, the secret bits are embedded into the noise-like component of the host such as prediction errors. From the rate-distortion optimization view, it may be not optimal since the data embedding units use the identical parameters. This motivates us to present a segmented data embedding strategy for efficient RDH in this paper, in which the raw host could be partitioned into multiple subhosts such that each one can freely optimize and use the data embedding parameters. Moreover, it enables us to apply different RDH algorithms within different subhosts, which is defined as ensemble. Notice that, the ensemble defined here is different from that in machine learning. Accordingly, the conventional operation corresponds to a special case of the proposed work. Since it is a general strategy, we combine some state-of-the-art algorithms to construct a new system using the proposed embedding strategy to evaluate the rate-distortion performance. Experimental results have shown that, the ensemble RDH system could outperform the original versions in most cases, which has shown the superiority and applicability.

15:20-17:20, Paper ThPMP.61
Low-Rank and Sparse Decomposition on Contrast Map for Small Infrared Target Detection
Deng, Xiaoya	Beijing Univ. of Chemical Tech
Li, Wei	Beijing Univ. of Chemical Tech
Li, Liwei	Chinese Acad. of Science
Zhang, Wenjuan	Chinese Acad. of Science
Li, Xia	Science and Tech. on Optical Radiation Lab
Keywords: Image processing and analysis, Applications of pattern recognition and machine learning Abstract: Small infrared target detection is a key and challenging issue in object detection and tracking systems. Existing algorithms can be mainly categorized into nonlocal-based or local-based methods. However, the detection performance degrades rapidly when facing highly heterogeneous backgrounds. This is mainly due to that they exploit only one kind of information (e.g., local or nonlocal) while sacrificing the other. Thus, an effective small target detection method is proposed to combine local and nonlocal priors. The former is obtained by a sliding dual window while the latter is realized by low-rank and sparse decomposition. Experimental results on three real datasets validate the effectiveness of the proposed framework, which is more stable and robust compared with several state-of-the-art methods, especially for the image scenes with heavy background clutters.

15:20-17:20, Paper ThPMP.62
Revised Spatial Transformer Network towards Improved Image Super-Resolutions
Kasem, Hossam	Shenzhen Univ
Hung, Kwok-Wai	Shenzhen Univ
Jiang, Jianmin	Shenzhen Univ
Keywords: Super-resolution Abstract: In this paper, we propose two approaches to generate a high resolution (HR) image from a low resolution (LR) version. It is commonly referred as single-image super-resolution(SISR). Our approaches are inspired by the spatial transformer (ST) module and Very Deep Convolutional Network (VDSR). The spatial transformer module is the neural network originally used for geometric transformations of images, while the VDSR is for image super-resolution. In the first approach, we propose to add the ST module with the VDSR to generate HR images. The use of the spatial transform with VDSR makes the network more robust to the geometric transformations. We propose the second approach to replace the convolutional neural network (CNN) used in the ST module with VDSR network. The replacement of CNN leads to improvements in the performance of ST module. The simulation results confirm that the feasibility of combing ST module and VDSR for super-resolution reconstruction, where the performance of the combination of ST module and VDSR is comparable to the VDSR alone in terms of Peak signal-to-noise ratio (PSNR) and structural Similarity Index Measurement (SSIM). Hence, the revised spatial transformer network can be used in the future for simultaneous geometric transformation and image superresolution, which solve the practical applications of image super-resolution in real life.

15:20-17:20, Paper ThPMP.63
Deep High-Order Supervised Hashing for Image Retrieval
Cheng, Jingdong	Dalian Univ
Sun, Qiule	Dalian Univ
Zhang, Jianxin	Dalian Univ
Wei, Xiaopeng	Dalian Univ. of Tech
Zhang, Qiang	Dalian Univ. of Tech
Keywords: Multimedia analysis, indexing and retrieval, Deep learning Abstract: Recently, deep hashing has achieved excellent performances in large-scale image retrieval by simultaneously learning deep features and hashing function. However, state-of-the-art works have so far failed to explore the feature statistics higher than first-order. In this paper, to take a step towards addressing this problem, we propose two novel Deep High-order Supervised Hashing architectures (DHoSH), i.e., point-wise labels based DHoSH (DHoSH-PO) and pair-wise labels based DHoSH (DHoSH-PA). The core of DHoSH is that a trainable layer of bilinear pooling incorporates into deep convolutional neural networks (CNNs) for end-to-end learning. This layer captures the local feature interactions of the image by outer product, employing the autocorrelation information and cross-correlation information of deep features. Furthermore, our DHoSH method systematically exploits the high-order statistics of features of multiple layers. Extensive experiments on commonly used benchmarks illuminate that both DHoSH-PO and DHoSH-PA can obtain competitive improvements over its first-order counterparts, and achieve state-of-the-art performance for image retrieval task.

15:20-17:20, Paper ThPMP.64
Mutual-Optimization towards Generative Adversarial Networks for Robust Speech Recognition
Ding, Ke	Beijing Forestry Univ
Luo, Ne	Beijing Forestry Univ
Ke, Dengfeng	Chinese Acad. of Sciences
Xu, Yanyan	Beijing Forestry Univ
Su, Kaile	Griffith Univ
Keywords: Speech recognition, Deep learning, Neural networks Abstract: In the context of Automatic Speech Recognition (ASR), improving the noise robustness remains an intractable task. Speech enhancement, combined with Generative Adversarial Networks (GAN), such as SEGAN, has effective performance in denoising raw waveform speech signals. Instead of waveforms, using Mel filterbank spectra in GAN is proposed, which has better performance in the task of ASR. However, these techniques will still miss useful information when GAN is used in them. In this paper, we investigate to protect the useful information in GAN, and propose a novel model, called Discriminator Generator Classifier-GAN (DGC-GAN). While normal GAN combining just two networks will lead the model to denoising rather than recognition, DGC-GAN has another network called classifier, which is an ASR system that will tune GAN to be recognized easier. By adding a classifier into previous GAN to get DGC-GAN, we achieve 29.1% Phone Error Rate (PER) relative improvement in a tiny dataset and 47.4% PER relative improvement in a large dataset.

15:20-17:20, Paper ThPMP.65
Infrared and Visible Image Fusion Using a Deep Learning Framework
Li, Hui	Jiangnan Univ
Wu, Xiaojun	Jiangnan Univ
Kittler, Josef	Univ. of Surrey
Keywords: Image processing and analysis, Deep learning, Neural networks Abstract: In recent years, deep learning has become a very active research tool which is used in many image processing fields. In this paper, we propose an effective image fusion method using a deep learning framework to generate a single image which contains all the features from infrared and visible images. First, the source images are decomposed into base parts and detail content. Then the base parts are fused by weighted-averaging. For the detail content, we use a deep learning network to extract multi-layer features. Using these features, we use l_1-norm and weighted-average strategy to generate several candidates of the fused detail content. Once we get these candidates, the max selection strategy is used to get the final fused detail content. Finally, the fused image will be reconstructed by combining the fused base part and the detail content. The experimental results demonstrate that our proposed method achieves state-of-the-art performance in both objective assessment and visual quality. The Code of our fusion method is available at https://github.com/exceptionLi/imagefusion_deeplearning

15:20-17:20, Paper ThPMP.66
Non-Uniform Illumination Video Enhancement Based on Zone System and Fusion
Liu, Shiguang	Tianjin Univ
Zhang, Yu	Tianjin Univ
Keywords: Enhancement, restoration and filtering, Video processing and analysis, Signal analysis Abstract: In daily life, the acquisition of digital video will introduce non-uniform exposure due to the set of shooting equipment or scene illumination. Due to the low visibility, details are hidden and these videos usually fail to present visually pleasing browsing. Previous work typically relies on a single heuristic tone mapping curve to expand the dynamic range, which inevitably leads to uneven exposure. To solve this problem, we present a new video enhancement method based on zone system and image fusion. Given an input non-uniform illumination video, we first apply the zone system for exposure evaluation. We then remap each region using a series of tone mapping curves to generate multi-exposure regions which contain different exposed versions. Guided by some visual perception quality measures, we locate all the best exposed regions and then integrate them into a well-exposed video frame. Finally, in order to keep temporal consistency, we temporally propagate the zone regions from the former frame to the current frame. Experimental results have shown that the enhanced video ex-hibits uniform exposure, and preserves temporal consistency.

15:20-17:20, Paper ThPMP.67
Multi-Kernel Supervised Hashing with Graph Regularization for Cross-Modal Retrieval
Zhu, Ming	Anhui Univ
Miao, Huanghui	Anhui Univ
Tang, Jun	Anhui Univ
Keywords: Multimedia analysis, indexing and retrieval Abstract: Hashing based approximate nearest neighbor search has received considerable attention due to the demand of fast query for big multimedia data. Cross-modal hashing focuses on retrieval tasks across different modalities, which is more useful in practical applications. In this paper, we propose a novel twostage cross-modal hashing method, referred to as Multi-Kernel Supervised Hashing with Graph Regularization (MKSRH). To better capture the essential attribute of original data, MKSRH first maps original data to a kernel space constructed by a linear combination of multiple kernel functions. Then the preliminary hash functions are learned using the Adaboost framework in the kernel space. To produce more accurate hash codes, the obtained hash function are then refined using a graph regularization based strategy. Experimental results on two canonical datasets show that MKSRH significantly outperforms than some typical crossmodal hashing methods, demonstrating the effectiveness and the superiority of the proposed approach.

15:20-17:20, Paper ThPMP.68
Radiometric Confidence Criterion for Patch-Based Inpainting
Fayer, Julien	Univ. of Toulouse, Toulouse INP - IRIT
Morin, Geraldine	Univ. of Toulouse
Gasparini, Simone	Univ. of Toulouse - Toulouse INP - IRIT
Daisy, Maxime	INNERSENSE
Coudrin, Benjamin	Innersense
Keywords: Inpainting, Image processing and analysis, Physics-based vision Abstract: Diminished Reality (DR) consists in virtually removing objects from a captured scene, thus requiring a coherent filling of the areas originally hidden behind these objects. Indoor DR applications often exploit the planar geometry of the scene to apply an inpainting process on a perspectively undistorted view of the plane. In this paper we propose to integrate a novel, physic-based criterion into classical state-of-the-art inpainting algorithms in order to take into account the variations in image resolution of the undistorted view. The proposed inpainting process selects the patches and avoids the propagation of low-resolution data, ie patches corresponding to parts of the plane that are far from the camera or seen under an very skew angle. We illustrate the improvements of DR results on synthetic and real images.

15:20-17:20, Paper ThPMP.69
Are French Really That Different? Recognizing Europeans from Faces Using Data-Driven Learning
Nguyen, Viet-Duy	Univ. of Rochester
Tran, Minh	Univ. of Rochester
Luo, Jiebo	-Univ. of Rochester
Keywords: Social and web multimedia, Deep learning for multimedia analysis, Multimedia analysis, indexing and retrieval Abstract: Travel agents and retailers are curious about where their customers come from, which would help them increase their sales and optimize their marketing strategy. In this study, we present a system to predict where people come from in the European region only using their faces. The countries that have been chosen for the study are Russia, Italy, Germany, Spain, and France, based on diversity and representativeness. These countries have been well known for their economy, population, and political impact. First, we implement different neural network classifiers on the dataset of people's faces that we have collected from Twitter. Next, we investigate in more detail 11 different facial features that may help differentiate ethnic groups representative of those five countries. Our system achieves an accuracy of over 50%, more than twice as good as that of humans. We uncover and interpret using genetic anthropological evidences the various differences and similarities between people's faces across geographical distances among different contingents.

15:20-17:20, Paper ThPMP.70
Texture Segmentation Using Siamese Network and Hierarchical Region Merging
Yamada, Ryusuke	Hiroshima Univ
Ide, Hidenori	Hiroshima Univ
Yudistira, Novanto	Hiroshima Univ
Kurita, Takio	Hiroshima Univ
Keywords: Segmentation, features and descriptors, Texture analysis, Deep learning Abstract: This paper proposes an texture segmentation algorithm. In the proposed texture segmentation algorithm, the feature vectors at each pixel of an input image are extracted by using the deep neural networks such as the deep convolutional network (CNN) or the Siamese Network. Then they are used as input of the hierarchical region merging. Unlike the semantic segmentation such as fully connected network (FCN) or U-Net which are based on the supervised learning, the proposed algorithm can correctly segment the texture regions whose texture is taken from the other types of the texture. The effectiveness of the proposed texture segmentation algorithm is experimentally confirmed by using the famous texture images taken from book by P.Brodatz.

15:20-17:20, Paper ThPMP.71
Screen-Rendered Text Images Recognition Using a Deep Residual Network Based Segmentation-Free Method
Xu, Xin	Wuhan Univ. of Science and Tech
Zhou, Jun	Wuhan Univ. of Science and Tech
Zhang, Hong	Wuhan Univ. of Science & Tech
Keywords: Deep learning for multimedia analysis, Image processing and analysis Abstract: Text images recognition has long been known as a research hotspot of computer vision. However, screen-rendered text image pose great challenges to current character or text recognition methods due to its low resolution and low signal-to-noise ratio properties. In this paper, a segmentation-free method utilizing Residual Network (ResNet) and Recurrent Neural Network (RNN)-Connectionist Temporal Classification (CTC) is proposed to recognize Chinese and English texts in screen-rendered images. Text lines are firstly extracted from screen-rendered images to obtain feature sequences. Then, a bidirectional RNN layer is applied to model the contextual information within feature sequences and predict identification results. Finally, a CTC method is employed to calculate loss and yield the results. The proposed method can achieve the best performance on ORAND-CAR-A dataset, ORAND-CAR-B dataset and a generated dataset with the recognition accuracy of 91.89%, 93.79% and 95.67%, respectively. Moreover, experiments on several real screen-rendered text images also demonstrate the effectiveness of the proposed method.

15:20-17:20, Paper ThPMP.72
Deep Pixel Probabilistic Model for Super Resolution Based on Human Visual Saliency Mechanism
Gao, Hongxia	South China Univ. of Tech
Chen, Zhanhong	SCUT
Ma, Ge	Guangzhou Univ
Xie, Wang	South China Univ. of Tech
Li, Zhifu	Guangzhou Univ
Keywords: Super-resolution, Deep learning, Image quality assessment Abstract: This work explores super resolution (SR) with a deep network based on a pixel probabilistic model, where particular small inputs and large magnification factors make the problem highly underspecified since fairly large amounts of high-frequency details are missing in low resolution (LR) source. In this paper, we develop a deep architecture comprising of a PixelCNN and a residual network (ResNet), in which PixelCNN predicts the serial dependencies of the pixel sequence and ResNet for capturing the global structure of LR input. A human visual saliency mechanism (HVSM) by employing accurate SR in salient regions and fast interpolation in nonsalient regions is integrated within the pixel probabilistic model to efficiently reduce the computational complexity while maintaining the desired visual quality. Additionally, we present a Bayesian optimization technique to automatically determine the optimal weight of loss function. Furthermore, a modified image quality assessment taking into account HVSM is introduced, trying to align with the human visual perception. Experiments demonstrate that the proposed algorithm could generate more plausible facial features than previous deep learning methods, offering finer details and significant improvement in visual quality.

15:20-17:20, Paper ThPMP.73
Joint Denoising and Super-Resolution Via Generative Adversarial Training
Chen, Li	Xiamen Univ
Dan, Wen	Department of Computer Science, Xiamen Univ
Cao, Liujuan	Xiamen Univ
Keywords: Super-resolution, Enhancement, restoration and filtering Abstract: Abstract�Single image denoising and super-resolution are sitting in the core of various image processing and pattern recognition applications. Typically, these two tasks are handled separately, without regarding to joint reinforcement and learning. The former deals with equal-size pixel-to-pixel translation, while the latter deals with scaling up amount of input pixels. In this paper, we propose a Generative Adversarial Network(GAN) towards joint learning of single image denoising and super-resolution. In principle, our design allows both tasks to share several common building blocks, with the linking between both outputs to reinforce each other. Such a reinforcement is accomplished via designing a novel generative network through optimizing a novel loss function to achieve both denoising and super-resolution. Quantitatively comparing to a set of alternative approaches and baselines, the experiment demonstrated superior performance our method in denoising and super-resolution with high upscaling factors.

15:20-17:20, Paper ThPMP.74
Recursive Inception Network for Super-Resolution
Jiang, Tao	Shaanxi Normal Univ
Zhang, Yu	Shaanxi Normal Univ
Shui, Wuyang	Beijing Normal Univ
Lu, Gang	Shaanxi Normal Univ
Wu, Xiaojun	Shaanxi Normal Univ
Guo, Shiqi	George Washington Univ
Fei, Hao	Shaanxi Normal Univ
Zhang, Qieshi	Chinese Acad. of Sciences (CAS)
Keywords: Super-resolution, Deep learning, Image processing and analysis Abstract: In this paper, we propose a novel network for super-resolution and achieve state-of-the-art performance with limited parameters. Inspired by some of the previous methods, we use an ResNet architecture for residual learning. In addition, an inception-like structure is utilized for features extraction and decreasing model�s parameters. By cascading multi-scale filters with separate paths many times in a deep architecture, the proposed method can fully exploit the contextual information over large image regions. In the meantime, residual learning enables the training phase easy to converge. Extensive experiments demonstrate that our proposed method performs the same performance in PSNR, SSIM and IFC with fewer parameters compared with these state-of-the-art methods.

15:20-17:20, Paper ThPMP.75
Weakly Supervised Vehicle Detection in Satellite Images Via Multiple Instance Ranking
Sheng, Yihan	Xiamen Univ
Cao, Liujuan	Xiamen Univ
Keywords: Image processing and analysis, Object recognition Abstract: Given the difficulty in labeling sufficient amount of instances across different resolutions and imaging environment of satellite images, weakly supervised vehicle detection is with great importance for satellite images analysis and processing. To prevent such cumbersome and meticulous manual annotation, naturally we have introduced the weakly supervised detection that has recently explosively prevalent in ordinary viewing angle images. Our program merely stands in need of region-level group annotation, i.e., whether this district convers vehicle(s) without plainly pointing out the coordinates of vehicles. There are two major problems are often encountered for Weakly Supervised Object Detection. One is that it is often chooses only a most expressive instance contains multiple target objects which often have a bigger probability when selecting a target block. For this problem, the number of vehicles can be estimated based on the object counting, a combinatorial selection algorithm can be used to select patch which contains at most one vehicle instance. Another problem is that precise object positioning becomes more difficult due to the lack of instance-level supervision. This problem can be optimized by a progressive learning strategy. Experiments was carried on wide-ranging remote sensing dataset and achieved better results compared to the state-of-the-art weakly supervised vehicle detection schemes.

15:20-17:20, Paper ThPMP.76
Restoration of Sea Surface Temperature Satellite Images Using a Partially Occluded Training Set
Shibata, Satoki	Kyoto Univ
Iiyama, Masaaki	Kyoto Univ
Hashimoto, Atsushi	Kyoto Univ
Minoh, Michihiko	Kyoto Univ
Keywords: Inpainting, Deep learning, Applications of pattern recognition and machine learning Abstract: Sea surface temperature(SST) satellite images are often partially occluded by clouds. Image inpainting is one approach to restore the occluded region. Considering the sparseness of SST images, they can be restored via learning-based inpainting.However, state-of-the-art learning-based inpainting methods using deep neural networks require large amount of non-occluded images as a training set. Since most SST images contain occluded regions, it is hard to collect sufficient non-occluded images.In this paper, we propose a novel method that uses occluded images as training images hence we can enlarge the amount of available training images from a certain SST image set. This is realized by comprising a novel reconstruction loss and adversarial loss. Experimental results confirm the effectiveness of our method.

15:20-17:20, Paper ThPMP.77
An Attention-Based Approach for Single Image Super Resolution
Liu, Yuan	Southeast Univ
Wang, Yuancheng	Southeast Univ. of China
Li, Nan	Southeast Univ
Cheng, Xu	Southeast Univ. P.R.C
Zhang, Yifeng	Southeast Univ
Huang, Yongming	Southeast Univ
Lu, Guojun	Monash Univ
Keywords: Super-resolution, Deep learning, Enhancement, restoration and filtering Abstract: The main challenge of single image super resolution (SISR) is the recovery of high frequency details such as tiny textures. However, most of the state-of-the-art methods lack specific modules to identify high frequency areas, causing the output image to be blurred. We propose an attention-based approach to give a discrimination between texture areas and smooth areas. After the positions of high frequency details are located, high frequency compensation is carried out. This approach can incorporate with previously proposed SISR networks. By providing high frequency enhancement, better performance and visual effect are achieved. We also propose our own SISR network composed of DenseRes blocks. The block provides an effective way to combine the low level features and high level features. Extensive benchmark evaluation shows that our proposed method achieves significant improvement over the state-of-the-art works in SISR.

15:20-17:20, Paper ThPMP.78
Compression of Acoustic Model Via Knowledge Distillation and Pruning
Li, Chenxing	Inst. of Automation, Chinese Acad. of Sciences
Zhu, Lei	AI Lab, Rokid Inc
Xu, Shuang	Inst. of Automation, Chinese Acad. of Sciences
Gao, Peng	AI Lab, Rokid Inc
Xu, Bo	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Speech recognition, Audio and acoustic processing and analysis, Applications of pattern recognition and machine learning Abstract: Recently, the performance of speech recognition system based on neural network has been greatly improved. Arguably, this huge improvement can be mainly attributed to deeper and wider layers. These systems are more difficult to be deployed on the embedded devices due to their large size and high computational complexity. To address these issues, we propose a method to compress deep feed-forward neural network (DNN) based acoustic model. In detail, a state-of-the-art acoustic model is trained as the baseline model. In this step, layer normalization is applied to accelerating the model convergence and improving the generalization performance. Knowledge distillation and pruning are then conducted to compress the model. Our final model can achieve 14.59* parameters reduction, 5* storage size reduction and comparable performance compared with the baseline model.

15:20-17:20, Paper ThPMP.79
Super-Resolution Ultrasound Imaging Based on the Phase of the Carrier Wave without Deterioration by Grating Lobes
Tagawa, Norio	Tokyo Metropolitan Univ
Zhu, Jing	Tokyo Metropolitan Univ
Keywords: Sensor array & multichannel signal processing, Super-resolution, Medical image and signal analysis Abstract: Ultrasound imaging is used in medical applications, as well as various other fields, such as nondestructive inspection and environmental measurement. We have proposed a super-resolution ultrasound imaging method based on multiple transmission and reception with different carrier frequencies. In this methods, it is assumed that array transducers with a sufficient number of elements are used. From the viewpoint of manufacturing cost, it is desirable to have a small number of elements. However, conventional methods tend to generate grating lobes, which negatively affect performance. In the current study, we aimed to develop an approach for avoiding grating lobes by improving on our previous methods, which is a super-resolution method applicable to transducers with a small number of elements.

15:20-17:20, Paper ThPMP.80
Solar Atmosphere Data Analysis
Simberova, Stanislava	Astronomical Inst. Acad. of Sciences of the CzechRepublic
Suk, Tom�	Inst. of Information Theory and Automation, Czech Acad. Of
Keywords: Signal analysis, Applications of pattern recognition and machine learning Abstract: We present new approach to study of behavior of the solar atmosphere. In actual fact it is processing of plasma parameters obtained from remote satellite observations. In this study we focused on physical data characterizing the so-called solar wind. Ever-present dynamic magnetic field affects other physical variables such as speed, temperature, density, etc. in the solar atmosphere. The turning points of the magnetic induction magnitude and direction time curves are computed. They serve as estimates of the beginning and end shockwave in solar wind and also as boundaries of segments for statistical analysis of magnetic field variables by Kendall rank correlation.

15:20-17:20, Paper ThPMP.81
Self-Attention Based Network for Punctuation Restoration
Wang, Feng	Inst. of Automation, Chinese Acad. of Sciences
Chen, Wei	Inst. of Automation, Chinese Acad. of Sciences
Yang, Zhen	Chinese Acad. of Science, Inst. of Automation
Xu, Bo	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Spoken language analysis, Deep learning, Sequence modeling Abstract: Inserting proper punctuation into Automatic Speech Recognizer(ASR) transcription is a challenging and promising task in real-time Spoken Language Translation(SLT). Traditional methods built on the sequence labelling framework are weak in handling the joint punctuation. To tackle this problem, we propose a novel self-attention based network, which can solve the aforementioned problem very well. In this work, a light-weight neural net is proposed to extract the hidden features based solely on self-attention without any Recurrent Neural Nets(RNN) and Convolutional Neural Nets(CNN). We conduct extensive experiments on complex punctuation tasks. The experimental results show that the proposed model achieves significant improvements on joint punctuation task while being superior to traditional methods on simple punctuation task as well.

15:20-17:20, Paper ThPMP.82
Face Image Super-Resolution Via K-NN Regularized Collaborative Representation with Importance Reweighting
Liu, Licheng	Hunan Univ
Li, Shutao	Hunan Univ
Keywords: Super-resolution, Enhancement, restoration and filtering Abstract: In visual recognition and surveillance system, human face is one of the most important factors. Unfortunately, due to the low-cost imaging sensors and the complexity imaging environment, the captured face images are always low-resolution (LR) and corrupted by noise. The noisy LR face images possess limited useful information, which will extremely degrade the performance of face recognition system. To address this issue, in this paper we presented a K-nearest neighbor (K-NN) Regularized Collaborative Representation (K-RCR) method to simultaneously enhance the resolution of face images and suppress the noise. The proposed K-RCR breaks the bottlenecks of patch based face super-resolution methods, which makes it to be a reality that denoising and super-resolution can be achieved in a unified framework. Specifically, the K-NN selection strategy is employed to use the most important K nearest neighbors in the training dataset to collaboratively represent the test patch, leading to a unique and stable solution for the least squares problem. Moreover, a diagonal weight matrix is incorporated into the objective function to equip it more robust to noise. Experimental results on the standard test face dataset, i.e., FEI, demonstrate the superiority of our proposed method over several state-of-theart face image super-resolution methods.

15:20-17:20, Paper ThPMP.83
Global Contrast Enhancement Detection Via Deep Multi-Path Network
Zhang, Cong	Univ. of Chinese Acad. of Sciences
Du, Dawei	Univ. of Chinese Acad. of Sciences
Ke, Lipeng	Univ. of Chinese Acad. of Sciences
Qi, Honggang	Univ. of Chinese Acad. of Sciences
Lyu, Siwei	SUNY Albany
Keywords: Information forensics and security Abstract: Identifying global contrast enhancement in an image is an important task in forensics estimation. Several previous methods analyze the �peak-gap� fingerprints in graylevel histograms. However, images in real scenarios are often stored in the JPEG format with middle/low compression quality, resulting in less obvious �peak-gap� effect and then unsatisfactory performance. In this paper, we propose a novel deep Multi-Path Network (MPNet) based approach to learn discriminative features from graylevel histograms. Specifically, given the histograms, their high-level peaks and gaps information can be exploited effectively after several shared convolutional layers in the network, even in middle/low quality compressed images. Moreover, the proposed multi-path module is able to focus on dealing with specific forensics operations for more robustness on image compression. The experiments on three challenging datasets (i.e., Dresden, RAISE and UCID) demonstrate the effectiveness of the proposed method compared to existing methods.

15:20-17:20, Paper ThPMP.84
Deep Joint Rain and Haze Removal from a Single Image
Shen, Liang	Huazhong Univ. of Science and Tech
Yue, Zihan	Huazhong Univ. of Science and Tech
Chen, Quan	School of Automation， Huazhong Univ. of Science and T
Feng, Fan	Huazhong Univ. of Science and Tech
Ma, Jie	Huazhong Univ. of Science and Tech
Keywords: Enhancement, restoration and filtering, Deep learning, Multitask learning Abstract: Rain removal from a single image is a challenge which has been studied for a long time. In this paper, a novel convolutional neural network based on wavelet and dark channel is proposed. On one hand, we think that rain streaks correspond to high frequency component of the image. Therefore, haar wavelet transform is a good choice to separate the rain streaks and background to some extent. More specifically, the LL subband of a rain image is more inclined to express the background information, while HL, LH subband tend to represent the rain streaks and the edges respectively. On the other hand, the accumulation of rain streaks from long distance makes the rain image look like haze veil. We extract dark channel of rain image as a feature map in network. By increasing this mapping between the dark channel of input and output images, we achieve haze removal in an indirect way. All of the parameters are optimized by back-propagation. Experiments on both synthetic and real-world datasets reveal that our method outperforms other state-of-the-art methods from a qualitative and quantitative perspective.

15:20-17:20, Paper ThPMP.85
Getting Rid of Night: Thermal Image Classification Based on Feature Fusion
Lu, Guoyu	Rochester Inst. of Tech
Yu, Huili	Delphi Automotive Systems, LLC
Yuan, Chun	Tsinghua Univ
Keywords: Multimedia analysis, indexing and retrieval, Image processing and analysis, Sensor array & multichannel signal processing Abstract: Thermal images are essential to deal with situations in dark environments, as they capture the objects' temperature. While the objects can still be seen in thermal images, the texture is extremely blur or even not observable at all. We propose to extract different features from images that capture various characteristics of the images. As one feature emphasizes one distinguishing aspect differing from the others, we can grasp multiple pieces of evidence from the images and take advantage of each to improve the thermal image classification accuracy. In particular, in additional to corner features usually used in color images, we also extract features from the edges and the shapes of the objects that emphasize the integral image appearance, as well as the temperature characteristics obtained from the image intensity. In this way, even if one feature is not evident in an image, the others can still play a critical role towards the correct classification result. By optimizing the objective function, we maximize the fusion performance of multiple features. By doing so, we can to the largest extent make use of the information exhibited in the thermal images to classify the query image into the correct group. Experiments demonstrate promising thermal image classification result.

15:20-17:20, Paper ThPMP.86
Video-Based Emotion Recognition Using Aggregated Features and Spatio-Temporal Information
Xu, Jinchang	Beijing Univ. of Posts and Telecommunications
Dong, Yuan	Beijing Univ. of Posts and Telecommunications
Ma, Lilei	Beijing Univ. of Posts and Telecommunications
Bai, Hongliang	Beijing Faceall Co.Ltd
Keywords: Emotion recognition, Deep learning Abstract: In this paper, we present a video-based emotion recognition system in the wild which consists of four pipeline modules: image-processing, deep feature extraction, feature aggregation and emotion classification. Our method focuses more on different feature descriptors. To obtain high-level features which are more discriminative in emotion recognition, we employ an aggregation of features extracted from different deep convolutional neural networks (CNNs). Furthermore, the long short-term memory network (LSTM) and 3D convolutional networks (C3D) are utilized to extract spatio-temporal features from videos in order to combine the spatial information and temporal information. Additionally, we evaluate our method on the 5th Emotion Recognition in the Wild Challenge in the category of video-based emotion recognition and the result shows our proposed system achieves better performance.

15:20-17:20, Paper ThPMP.87
Self-Talk Responses to Users' Opinions and Challenge in Human Computer Dialog
Yang, Minghao	National Lab. of Pattern Recognition (NLPR) Inst. of A
Zhang, Ke	Guilin Univ. of Electronic Tech
Na, ShengRuoYang	Inst. of Automation, Chinese Acad. of Sciences
Tao, Jianhua	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Spoken language analysis, Speech and natural language based interaction Abstract: People like to be, or partly, encouraged when their opinions or challenges are supported by listeners, even the listeners are robots. Encouraging responses from the robot which seem to get users' points potentially improve users' feeling in human computer dialog. According to this hypothesis, this paper proposes a method to generate supporting responses to users' opinions or challenges. The core ideas and contributions of the proposed method are: (1) multiple search engines cooperate, and (2) each engine random asks itself or ask another one to obtain more related information from the internet in multiple turns; then (3) final responses are abstracted from the answers. We call these three steps as Self-Talk. The comparisons between Self-Talk and several commercial open speech assistants show that the proposed method does generate suitable answers to users when they present their opinions or challenges in dialog. The hypothesis is positively evaluated that encouraging responses could improve users' chat feeling.

15:20-17:20, Paper ThPMP.88
Deep Conditional Color Harmony Model for Image Aesthetic Assessment
Lu, Peng	Beijing Univ. of Posts and Telecommunications
Yu, Jinbei	Beijing Univ. of Posts and Telecommunications
Peng, Xujun	Univ. of Southern California
Keywords: Image quality assessment, Image processing and analysis, Image classification Abstract: As one of the important features, color provides plenty useful information to represent images. Thus, color harmony, which is defined as "two or more colors are sensed together as a single, pleasing, collective impression", can also be served as a fundamental feature and plays a key role to determine the aesthetics quality of images. To reveal the inherent color harmony attribute within patch and the harmonious relations between image patches which construct the pleasing colorful images, we designed a conditional random field (CRF) based color harmony model in this paper to accomplish the image aesthetic assessment tasks. Unlike the previous learning based color harmony models, we used deep neural networks to obtain the coherence properties between original image patch pairs, and embedded these relations along with each patch's own color harmony characteristic into a CRF to measure the harmony scores of the entire image. The experimental results on a public dataset show that the proposed deep conditional color harmony model is superior to the existing color harmony models in respect of the image aesthetic assessment.

15:20-17:20, Paper ThPMP.89
Visualization of Hyperspectral Images Using Moving Least Squares
Liao, Danping	Zhejiang Univ
Chen, Siyu	Zhejiang Univ
Qian, Yuntao	Zhejiang Univ
Keywords: Image processing and analysis, Sensor array & multichannel signal processing, Color analysis Abstract: Displaying the large number of bands in a hyperspectral image (HSI) on a trichromatic monitor has been an active research topic. The visualized image shall convey as much information as possible from the original data and facilitate image interpretation. Most existing methods display HSIs in false colors, which contradict with human�s experience and expectation. In this paper, we propose a nonlinear approach to visualize an input HSI with natural colors by taking advantage of a corresponding RGB image. Our approach is based on Moving Least Squares (MLS), an interpolation scheme for reconstructing a surface from a set of control points, which in our case is a set of matching pixels between the HSI and the corresponding RGB image. Based on MLS, the proposed method solves for each spectral signature a unique transformation so that the nonlinear structure of the HSI can be preserved. The matching pixels between a pair of HSI and RGB image can be reused to display other HSIs captured by the same imaging sensor with natural colors. Experiments show that the output images of the proposed method not only have natural colors but also maintain the visual information necessary for human analysis.

15:20-17:20, Paper ThPMP.90
An Automated Point Set Registration Framework for Multimodal Retinal Image
Zhang, Haotian	Tongji Univ
Liu, Xianhui	Tongji Univ
Wang, Gang	Shanghai Univ. of Finance and Ec
Chen, Yufei	Tongji Univ
Zhao, Weidong	Tongji Univ
Keywords: Image processing and analysis, Medical image and signal analysis, Segmentation, features and descriptors Abstract: Multimodal retinal image registration plays an important role in medical image analysis. In this field, retinal images from different modalities are aligned together to achieve a more evaluable fusion image for diagnoses. One of the challenging problem solved in this paper is the low success rate in multimodal retinal image registration. An automated point set registration framework is proposed to solve the problem. The framework includes three parts: feature point extraction and robust initial point matching, matching postprocessing, adaptive mismatches removing and transformation estimation. The experimental results show that our proposed framework is robust to outliers and repeated pattern and it obtains a more stable and accurate result than state-of-the-art methods.

15:20-17:20, Paper ThPMP.91
Quality Classified Image Analysis with Application to Face Detection and Recognition
Yang, Fei	Univ. of Nottingham Ningbo China
Zhang, Qian	Univ. of Nottingham Ningbo China
Wang, Miaohui	Coll. of Information Engineering, Shenzhen Univ
Qiu, Guoping	Univ. of Nottingham
Keywords: Image quality assessment, Image processing and analysis, Signal analysis Abstract: Motion blur, out of focus, insufficient spatial resolution, lossy compression and many other factors can all cause an image to have poor quality. However, image quality is a largely ignored issue in traditional pattern recognition literature. In this paper, we use face detection and recognition as case studies to show that image quality is an essential factor which will affect the performances of traditional algorithms. We demonstrated that it is not the image quality itself that is the most important, but rather the quality of the images in the training set should have similar quality as those in the testing set. To handle real-world application scenarios where images with different kinds and severities of degradation can be presented to the system, we have developed a quality classified image analysis framework to deal with images of mixed qualities adaptively. We use deep neural networks first to classify images based on their quality classes and then design a separate face detector and recognizer for images in each quality class. We will present experimental results to show that our quality classified framework can accurately classify images based on the type and severity of image degradations and can significantly boost the performances of state-of-the-art face detector and recognizer in dealing with image datasets containing mixed quality images.

15:20-17:20, Paper ThPMP.92
Vessel Enhancement Based on Length-Constrained Hessian Information
Shi, Zhenhui	Shanghai Jiao Tong Univ
Xie, Hongzhi	Department of Cardiology, Peking Union Medical Coll. Hospital
Zhang, Jingyang	Shanghai Jiao Tong Univ
Liu, Jie	SJTU
Gu, Lixu	Shanghai Jiao Tong Univ
Keywords: Enhancement, restoration and filtering, Medical image and signal analysis Abstract: Vessel enhancement is an important pre-processing step of applications in vessel image analysis. However, most of the current methods are developed merely based on the intensity variety inside and outside vessel instead of considering the vessel path, which emphasizes the vascular structures via characterizing additional connectivity and length information. Aiming at further utilizing beneficial length information of vessels, we propose a novel method to impose length constraint on Hessian information for vessel enhancement. Specifically, Eigen analysis of multiscale Hessian matrix has been taken at each pixel for the local vesselness response and direction information. Then, vessel path is searched along each pixel�s direction, as well as maintains the property of curvilinear smoothness. The proposed method is compared with three conventional vessel enhancement methods. The experiment results show that our proposed approach has the advantages of the fine response of low-contrast vessel region and less noise background. In addition, the quantity evaluation indicates that a state-of-art vessel enhancement performance could be achieved compared with other methods.

15:20-17:20, Paper ThPMP.93
Reducing Tongue Shape Dimensionality from Hundreds of Available Resources Using Autoencoder
Yang, Minghao	National Lab. of Pattern Recognition (NLPR) Inst. of A
Tao, Jianhua	Inst. of Automation, Chinese Acad. of Sciences
Dawei, Zhang	National Lab. of Pattern Recognition (NLPR), Inst. Of
Keywords: Spoken language analysis, Neural networks Abstract: In spite of various observation tools, tongue shapes are still scarce resource in reality. Autoencoder, a kind of deep neural networks (DNN), performs well on data reduction and pattern discovery. However, since autoencoder usually needs large scale data in training, challenges exist for traditional autoencoder to obtain tongues' motion patterns only from tens or hundreds of available tongue shapes. To overcome this problem, we propose a two-steps autoencoder, where we first construct a stacked denoising autoencoder (dAE) to learn the essential presentation of the tongue shapes from their possible deformations; then an additional autoencoder with small number of hidden units is added upon the previous stacked autoencoder, and used for dimensionality reduction. Experiments run on 240 vowels' tongue shapes obtained from Chinese speakers' pronunciation X-ray films, and the proposed model is compared with traditional dAE and the classical principal component analysis (PCA) on dimensionality reduction and reconstruction in details. Results validate the performance of the proposed tongue model.

15:20-17:20, Paper ThPMP.94
One-Class SVMs Based Pronunciation Verification Approach
Mostafa Shahin, Mostafa Shahin	Texas A&M Univ
Ji, Jim Xiuquan	Electrical and Computer Engineering Program, Texas A&M Univ
Ahmed, Beena	Electrical and Computer Engineering Program, Texas A&M Univ
Keywords: Speech recognition, Deep learning Abstract: The automatic assessment of speech plays an important role in Computer Aided Pronunciation Learning systems. However, modeling both the correct and incorrect pronunciation of each phoneme to achieve accurate pronunciation verification is unfeasible due to the lack of sufficient mispronounced samples in training datasets. In this paper, we propose a novel approach that handles this unbalanced data distribution by building multiple one-class SVMs to evaluate each phoneme as correct or incorrect. We model the correct pronunciation of each individual phoneme with a one-class SVM trained using a set of speech attributes features, namely the manner and place of articulation. These features are extracted from a bank of pre-trained DNN speech attributes classifiers. The one-class SVM model measures the similarity between the new data and the training set and then classifies it as normal (correct) or an anomaly (incorrect). We evaluated the system using native speech corpus and disordered speech corpus and compared it with the conventional Goodness of Pronunciation (GOP) algorithm. The results show that our approach reduces the false-acceptance and false-rejection rates by around 26% and 39% respectively.

15:20-17:20, Paper ThPMP.95
Shot Level Egocentric Video Co-Summarization
Sahu, Abhimanyu	Jadavpur Univ
Chowdhury, Ananda	Jadavpur Univ
Keywords: Video processing and analysis, Graph matching, Applications of pattern recognition and machine learning Abstract: Video co-summarization has emerged as an important problem in the areas of computer vision and multimedia communities. In this paper, we present a novel approach of co-summarizing egocentric videos at shot level. Our solution pipeline consists of three major components. We develop a new way of characterizing egocentric video frames by computing the differences in contrast, entropy and optic flow values between a central region and the surrounding region in a frame. This is termed as the center-surround model. Visual similarity between a test video shot and a database video shot is next derived using a game-theoretic framework. Each video shot is modelled as a player and the expected pay-off difference between any two such players at mixed Nash equilibrium is deemed as the similarity between them. A weighted bipartite graph is constructed next between the shots in a test and in a database video. Game-theoretic similarity values are deemed as the weights. Maximum Cardinality Minimum Weight matching in the bipartite graph yields non-greedy shot correspondences. Best matched shots from the test video are used to form the summary. Experimental comparisons on standard datasets clearly indicate the advantage of our solution.

15:20-17:20, Paper ThPMP.96
Visual Localization of Key Positions for Visually Impaired People
Cheng, Ruiqi	Zhejiang Univ
Wang, Kaiwei	Zhejiang Univ
Lin, Longqing	Kr Vision Tech. Company Limited
Yang, Kailun	Zhejiang Univ
Keywords: Sensor array & multichannel signal processing, Applications of computer vision, Image classification Abstract: On the off-the-shelf navigational assistance devices, the localization precision is limited to the signal error of global navigation satellite system (GNSS). During travelling outdoors, the inaccurately localization perplexes visually impaired people, especially at key positions, such as gates, bus stations or intersections. The visual localization is a feasible approach to improving the positioning precision of assistive devices. Using multiple image descriptors, the paper proposes a robust and efficient visual localization algorithm, which takes advantage of priori GNSS signals and multi-modal images to achieve the accurate localization of key positions. In the experiments, we implement the approach on the wearable system and test the performance of visual localization under practical scenarios.

15:20-17:20, Paper ThPMP.97
A New Foreground Segmentation Method for Video Analysis in Different Color Spaces
Shi, Hang	New Jersey Inst. of Tech
Liu, Chengjun	New Jersey Inst. of Tech
Keywords: Segmentation, features and descriptors, Density estimation Abstract: A new foreground segmentation method is presented in this paper for video analysis. Specifically, a new feature representation scheme is first proposed in different color spaces, namely, the RGB, the YIQ, and the YCbCr color spaces. The new feature vector, which integrates the color values in a particular color space, the horizontal and vertical Haar wavelet features, and the temporal difference features, enhances the discriminatory power. A new Global Foreground Modeling (GFM) method is then presented to improve upon the popular video analysis approaches. The Bayes classifier is finally applied for foreground segmentation in video. Experimental results using the New Jersey Department of Transportation (NJDOT) traffic video sequences show that the new foreground segmentation method achieves better performance than the popular video analysis methods.

15:20-17:20, Paper ThPMP.98
Classification Guided Deep Convolutional Network for Compressed Sensing
Cui, Wenxue	Harbin Inst. of Tech
Zhang, Shengping	Harbin Inst. of Tech
Liu, Yashu	Harbin Inst. of Tech
Xu, Heyao	Harbin Inst. of Tech
Gao, Xinwei	Wechat Business Group
Jiang, Feng	Harbin Inst. of Tech
Zhao, Debin	Harbin Inst. of Tech
Liu, Shaohui	Harbin Inst. of Tech
Keywords: Image and video coding, Deep learning, Image processing and analysis Abstract: Compressed Sensing (CS) has been successfully applied to image compression in the past few years. However, there are still several challenges that restrict its applications in practice including large memory requirement and unsatisfactory reconstruction performance. To address these challenges, in this paper, we propose a classification guided deep convolutional network for image compressed sensing (CCSNet), which includes a sampling sub-network and a reconstruction sub-network. In the sampling sub-network, multiple convolutional layers are used to sample the original image, which significantly reduces the parameters of the sampling matrix while causes performance degradation moderately compared against existing convolution based sampling methods. In the reconstruction sub-network, a novel two-branch architecture is proposed to improve the adaptability of the model to various textures in natural images. The first branch, named the classification branch, is to classify the sampled measurements of the original image to one of the predefined textural classes. The second branch, named the reconstruction branch, consists of multiple sub-branches, which are responsible for reconstructing the original images belonging to the corresponding textural classes. By jointly utilizing two sub-networks, the entire network can be trained in the form of end-to-end metric with a joint loss function. Experimental results demonstrate that the proposed method provides a significant quality improvement in terms of PSNR compared against state-of-the-art methods.

15:20-17:20, Paper ThPMP.99
Spatio-Temporal Laban Features for Dance Style Recognition
Dewan, Swati	International Inst. of Information Tech
Agarwal, Shubham	IIIT
Singh, Navjyoti	IIIT
Keywords: Video processing and analysis, Segmentation, features and descriptors, Applications of pattern recognition and machine learning Abstract: This work targets Dance Style Recognition in videos as an application of Human Action Recognition. We propose a novel Spatio-Temporal Laban Feature descriptor (STLF) for dance style recognition based on Laban theory. Laban Movement Analysis has become increasingly popular as a language to describe, index and record human motion. We only exploit motion features and body-pose information without encoding the appearance. The model is tested on action recognition benchmarks and ICD, a challenging dataset of YouTube dance videos. Unlike other works, where Laban based features have been used in constrained environments, with static camera, sensors and no background noise, we employ STLF on videos in unconstrained and natural settings. It is robust to camera jitter, zoom variations and other acquisition conditions and is computationally cheap. It performs comparable or better than the state-of-the-art.

15:20-17:20, Paper ThPMP.100
Heterogeneous Image Change Detection Using Deep Canonical Correlation Analysis
Yang, Jing	Tianjin Univ
Zhou, Yuan	Tianjin Univ
Cao, Ying	China Mobile Communications Corp. Tianjin Branch
Feng, Liyang	Tianjin Univ
Keywords: Sensor array & multichannel signal processing, Multiview learning, Applications of pattern recognition and machine learning Abstract: Cross-sensor change detection is nowadays of paramount importance for earth observation applications. Most current change detection techniques are based on homogeneous input images. Due to the detailed and complementary spatial and spectral information, heterogeneous images change detection has become an active research topic. Change detection models need effective feature representations to estimate changes of interest. Although great progress has been made, existing approaches mainly focus on shallow models, which only extracting hand-crafted low-level features. To this end, this paper proposes a novel heterogeneous change detection method using deep canonical correlation analysis (DCCA). Specifically, the two heterogeneous images are transformed via a deep neural network, and they are projected in the common latent space in the output layer. Experiments on the commonly used homogenous and heterogeneous image datasets demonstrate the superiority of the proposed method compared with the traditional approaches.

15:20-17:20, Paper ThPMP.101
Integrating Local and Non-Local Denoiser Priors for Image Restoration
Gu, Shuhang	Huazhong Univ. of Science&Tech. Wuhan , PR China
Timofte, Radu	ETH Zurich
Van Gool, Luc	ETH Zurich and Univ. of Leuven
Keywords: Enhancement, restoration and filtering, Super-resolution, Low-level vision Abstract: Image local structural prior and non-local self-similarity (NSS) prior are two categories of priors which have been commonly used for solving the ill-posed image restoration problem. As they exploit different properties of natural images, it is interesting to investigate whether the two categories of priors can be integrated to achieve better restoration performance. Inspired by recently proposed textit{Regularization by denoising} idea, we propose textit{LNIR} which incorporates a textit{L}ocal CNN denoiser prior and a textit{N}SS-based denoiser prior implicitly for textit{I}mage textit{R}estoration. Our experimental results on the image deblurring and super-resolution tasks demonstrate the effectiveness of the proposed method. The proposed LNIR algorithm can not only flexibly adapt to different restoration tasks, but also delivers state-of-the-art restoration results.

15:20-17:20, Paper ThPMP.102
Dynamic Facial Expression Synthesis Driven by Deformable Semantic Parts
Gong, Nanxue	Xi'an Jiaotong Univ
Yang, Yang	Xi'an Jiaotong Univ
Liu, Yuehu	Xi'an Jiaotong Univ
Liu, Dingdong	Xi'an Jiaotong Univ
Keywords: Image processing and analysis, Structured prediction, Applications of pattern recognition and machine learning Abstract: Dynamic facial expression synthesis has some wild applications in human-computer interaction and virtual reality. The popular data-driven synthesis method like generative adversarial network (GAN) has made a great progress in generating a single face image, but has not well performed for expression sequences. To solve this problem, we design a series of deformable semantic parts to represent facial geometrical movement. And we synthesize the facial appearance by the geometrical driven under the-state-of-art pix2pixHD framework. In order to maintain the person identity among image sequence, we utilize an encoder to constrain the attributes of target face. With the above efforts , our method is capable to synthesize satisfied dynamic facial expression sequences.

15:20-17:20, Paper ThPMP.103
Skip-Connected Deep Convolutional Autoencoder for Restoration of Document Images
Zhao, GuoPing	Renmin Univ. of China
Liu, Jiajun	Renmin Univ. of China
Jiang, Jiacheng	Renmin Univ. of China
Guan, Hua	Renmin Univ. of China
Wen, Ji-Rong	Renmin Univ. of China
Keywords: Enhancement, restoration and filtering, Document image processing, Deep learning Abstract: The denoising and deblurring of images are the two essential restoration tasks in the document image processing task. As the preprocessing stages of the processing pipeline, the quality of denoising and deblurring heavily influences the result of subsequent tasks, such as character detection and recognition. In this paper, we propose a novel neural method for restoring document images. We named our network Skip-Connected Deep Convolutional Autoencoder (SCDCA), which is composed of multiple layers of convolution followed by a batch normalization layer and the leaky rectified linear unit (Leaky ReLU) activation function. Inspired by the idea of residual learning, we use two types of skip connections in the network. One is identity mapping between convolution layers and the other is used to connect the input and output. Through these connections, the network learns the residual between the noisy and clean images instead of learning an ordinary transformation function. We empirically evaluate our algorithm on an open and challenging document images dataset. We also assess our restoring results using the optical character recognition (OCR) test. Experimental results have demonstrated the effectiveness and efficiency of our proposed algorithm by comparing with several state-of-the-art methods.

15:20-17:20, Paper ThPMP.104
Semantic Music Annotation by Label-Specific Conditional Random Fields
Wang, Qianqian	Nanjing Univ
Xiong, Yu	Nanjing Univ
Su, Feng	Nanjing Univ
Keywords: Multimedia analysis, indexing and retrieval, Audio and acoustic processing and analysis Abstract: Music annotation is the task automatically assigning a set of semantically meaningful text labels to a music piece, which is of great value to many variant music applications such as music searching, indexing, recommendation and management. In this paper, we propose a novel music annotation method that integrates feature-to-label correspondence, label smoothness and local-to-global annotation consistency in a conditional random field (CRF) model with label-specific feature learning. For one music piece to be annotated, we first divide the music into a set of acoustically homogeneous segments and infer the relevant labels of every music segment using the CRF models corresponding to respective labels. These local annotations are then aggregated to obtain the holistic annotation of the music. Experiments on the public CAL500 music annotation dataset demonstrate the effectiveness of the proposed method.

15:20-17:20, Paper ThPMP.105
Automatic Feature Extraction for Wide-Angle and Fish-Eye Camera Calibration
Fasogbon, Peter	Nokia Tech
Fan, Lixin	Nokia Tech
Keywords: Image processing and analysis Abstract: The increase need to have 360 degree perception of world environment has led to the combination of various cameras in the varieties of narrow and wide-angle field of view. Current state of art methods do not provide an automatic means for fish-eye calibration, which is indispensable in an industrial environment where many lenses are to be calibrated in a relative short time. As automatic feature extraction is the key issue for fully automatic calibration framework, we address this issue to remove any form of human intervention during the calibration process. Unlike state-of-the-art methods, the proposed framework is completely automatic and not prone to detection errors. We evaluate the proposed feature extraction using state-of-the-art generic calibration model with both real and synthetic images.

15:20-17:20, Paper ThPMP.106
Delving into the Synthesizability of Dynamic Texture Samples
Yang, Feng	Wuhan Univ
Xia, Gui-Song	Wuhan Univ
Dai, Dengxin	Signal Processing Lab, Wuhan Univ. Wuhan, China
Zhang, Liangpei	State Key Lab. LIESMARS, Wuhan Univ
Keywords: Texture analysis, Low-level vision Abstract: The example-based dynamic texture synthesis (EDTS) methods have emerged in multitude, dedicated to generating new dynamic textures (DTs) of high quality from an input exemplar. The problem of EDTS has been studied for several decades, but none of the existing synthesis methods are able to tackle all kinds of dynamic textures equally well. Rather than focus on new synthesis methods, we turn to another way to help EDTS by investigating dynamic texture synthesizability - how synthesizable a specific dynamic texture sample is by EDTS. We propose to predict synthesizability score of a given dynamic texture sample and suggest which EDTS method is best suited to synthesize it. To this end, we compiled a dynamic texture dataset and annotated each DT in terms of synthesizability. We address the problem of learning dynamic texture synthesizability by using a regression model to train a predictor on the data collection. More precisely, we first characterize DT samples by a set of spatiotemporal features. Then, based on dynamic texture descriptors, we train regression models to estimate synthesizability scores and use an additional classifier to choose the optimal EDTS methods. The experiments demonstrate that our method can predict the synthesizability of DT samples effectively.

15:20-17:20, Paper ThPMP.107
Blurred Image Region Detection Based on Stacked Auto-Encoder
Zhou, Yuan	Tianjin Univ
Yang, Jianxing	Tianjin Univ
Chen, Yang	Tianjin Univ
Kung, Sun-Yuan	Princeton Univ
Keywords: Image processing and analysis, Enhancement, restoration and filtering Abstract: In this study, we address a fundamental yet challenging problem on detection and classification of blurred regions in partially blurred images. we propose to learn a latent feature representation with stacked auto-encoder (SAE) network to perform blur region detection. Most previous approaches focus on extracting a few blur features in image gradient, Fourier domain, and data-driven local filters. We extract a latent high-level feature representation from such low-level features using the stacked auto-encoder network, thereby improve the accuracy of blur region classification. This high accuracy enables us to successfully separate the clear and blurred regions. Experimental results demonstrate that the proposed method significantly outperforms the state-of-the-arts methods in detecting and classifying blur regions in partially blurred images.

15:20-17:20, Paper ThPMP.108
Accumulated Aggregation Shifting Based on Feature Enhancement for Defect Detection on 3D Textured Low-Contrast Surfaces
Yan, Yaping	Hokkaido Univ
Xiang, Sheng	Hokkaido Univ
Asano, Hirokazu	HUAWEI Tech. JAPAN K.K
Kaneko, Shun'ichi	Hokkaido Univ
Keywords: Image processing and analysis, Segmentation, features and descriptors, Classification Abstract: Detecting defects on 3D textured low-contrast surfaces plays an important role in product quality control. However, because of the affects from uneven distributions of materials, irregular textures, and unclear boundaries between defects and background, this is still a challenging problem. In this paper, a saliency-guided defect detection method, named accumulated aggregation shifting (AAS) model, is proposed to iteratively shift brightness of pixels based on their defective probability. And then, the output sequences of AAS at different iterations can be formalized as linear distribution or exponential distribution through statistical analysis. Finally, by utilizing the risk minimization method, we theoretically determine a reasonable threshold to classify all pixels as defective ones or defect-free ones. This method models defect detection problem under a probabilistic framework. And only a handful of samples are needed for parameter optimization. Experiments on a real-world image dataset for an industrial surface defect detection task demonstrate the effectiveness of our approach.

15:20-17:20, Paper ThPMP.109
Unsupervised Video Highlight Extraction Via Query-Related Deep Transfer
Wang, Han	Beijing Forestry Univ
Yu, Huangyue	Beijing Forestry Univ
Chen, Pei	Beijing Forestry Univ
Hua, Rui	Beijing Forestry Univ
Zou, Ling	Beijing Film Acad
Keywords: Multimedia analysis, indexing and retrieval, Deep learning for multimedia analysis, Video analysis Abstract: The emergence of user-operated media motivates the explosive growth of online videos. Browsing these large amounts of videos is time-consuming and tedious, which makes finding the moments of user major or special preference (i.e. highlights extraction) becomes an urgent problem. Moreover, the user subjectivity over a video makes no fixed extraction meets all user preferences. This paper addresses these problems by posing a query-related highlight extraction framework which optimizes selected frames to both semantically query-related and visually representative of the entire video. Under this framework, relevance between the query text and the video frames is first computed on a visual-semantic feature embedding space induced by a convolutional neural network. Then we enforce the diversity on the video frames with the determinantal point process (DPP), a recently introduced probabilistic model for diverse subset selection. The experimental results show that our query-related highlight extraction method is particularly useful for news videos content fetching, e.g. showing the abstraction of the entire video while playing focus on the parts that matches the user queries.

15:20-17:20, Paper ThPMP.110
Super-Resolution Imaging Based on Global Interpolation and Structural Similarities
Zhou, Yuan	Tianjin Univ
Huo, Shuwei	Tianjin Univ
Chen, Ying	Tianjin Univ
Kung, Sun-Yuan	Princeton Univ
Keywords: Super-resolution, Sparse learning, Image processing and analysis Abstract: In this paper, we propose a double dictionary learning method for image super-resolution (SR) reconstruction. Different from existing dictionary learning based super-resolution, we combine both self-similarity and external images to construct a double dictionary learning method. A new optimization model is established using self-similarities and external-similarities as regularization terms. Furthermore, we propose a global interpolation method to reconstruct an accurate initial estimation at the edges. Experimental results show that the proposed algorithm can produce high-quality reconstruction results both perceptually and quantitatively in terms of peak signal-to-noise ratio (PSNR) and structural similarity (SSIM), as compared to existing algorithms.

15:20-17:20, Paper ThPMP.111
Augment and Adapt: A Simple Approach to Image Tampering Detection
Annadani, Yashas	International Inst. of Information Tech. (IIIT), Hydera
Jawahar, C. V.	IIIT
Keywords: Information forensics and security, Applications of computer vision Abstract: Convolutional Neural Networks have been shown to be promising for image tampering detection in the recent years. However, the number of tampered images available to train a network is still small. This is mainly due to the cumbersomeness involved in creating lots of tampered images. As a result, the potential offered by these networks is not completely exploited yet. In this work, we propose a simple method to address this problem by augmenting data using inpainting and compositing schemes. We consider different forms of inpainting like simple inpainting and semantic inpainting as well as compositing schemes like feathering in order to augment the data. A domain adaptation technique is employed to reduce the domain shift between the augmented data and the data available using proprietary softwares. We demonstrate that this method of augmentation is effective in improving the detection accuracies. We present experimental evaluation on two popular datasets for image tampering detection to demonstrate the effectiveness of the proposed approach.

15:20-17:20, Paper ThPMP.112
Significant Region-Based Framework for Early Diagnosis of Alzheimer's Disease Using 11c Pib-Pet Scans
El-Gamal, Fatma El-Zahraa A.	1 Faculty of Computers and Information, IT Dept., Mansoura Univ
Elmogy, Mohammed	Faculty of Computers and Information, Mansoura Univ
Atwan, Ahmed	Information Tech. Dept., Faculty of Computers and Informati
Ghazal, Mohammed	Electrical and Computer Engineering Department, Abu Dhabi Univ
Barnes, Gregory	Univ. of Louisville
Hajjdiab, Hassan	Abu Dhabi Univ
Keyntone, Robert	Univ. of Louisville
El-Baz, Ayman	Univ. of Louisville
Keywords: Image processing and analysis, Computer-aided detection and diagnosis, Classification Abstract: Alzheimer's disease (AD) is a behavioral and cognitive neurodegenerative disorder whose sufferers exceed 5.5 million Americans. Among its stages, the early diagnosis of AD is considered the main research issue due to many factors including the variable effects of the disease through its patients. This paper targets the personalized diagnosis of AD through presenting a local/regional analysis system that represents the degree of regional abnormalities using detailed parcellation of the brain. For more detailed results, the statistical analysis was applied for restricting the diagnosis to the statistically determined significant brain regions. The system�s evaluation shows promising results with an average accuracy, specificity, and sensitivity between the three tested groups of 98%, 99.09%, and 96.48%, respectively.

15:20-17:20, Paper ThPMP.113
Learning a Hierarchical Latent Semantic Model for Multimedia Data
Wu, Shao-Hui	National Central Univ
Lee, Yuan-Shan	National Central Univ
Chen, Sih-Huei	National Central Univ
Wang, Jia-Ching	National Central Univ
Keywords: Affective multimedia processing/analysis, Image classification, Probabilistic graphical model Abstract: This paper develops a hierarchical feature representation that is based on a Bayesian non-parametric method. Feature learning is an important issue in classification and data analysis. It can improve the classification performance and increase the convenience of data processing and analysis. Popular methods of representation learning include methods that are based on mixture models or dictionary learning methods. However, current methods have some disadvantages. The use of a traditional mixture model, such as the Gaussian mixture model (GMM), involves the model selection problem and suffers a lack of hierarchy between components. Inspired by h-LDA, distance-based Gaussian hierarchical Dirichlet allocation (distance-based GhLDA) is proposed herein. This method can automatically determine the number of components and construct a hierarchical representation. The distance function between data is used in the prior distribution. The learnt representation in the proposed model has the advantage of hLDA, which can handle shared components and distinct components. The quantization loss problem, which commonly arises when a topic model is used to deal with continuous data, can be solved by assuming that the distribution of words follows a Gaussian rather than a Dirichlet distribution. The performance of the proposed model in solving audio and image classification problems is evaluated. Experimental results indicate that the distance-based GhLDA outperforms baseline methods.

15:20-17:20, Paper ThPMP.114
A New Dynamic Minimal Path Model for Tubular Structure Centerline Delineation
Chen, Da	Univ. Paris Dauphine
Cohen, Laurent	CNRS
Keywords: Segmentation, features and descriptors, Biological image and signal analysis Abstract: We propose a new dynamic Riemannian metric with adaptive anisotropy enhancement and with appearance feature coherence penalization. The appearance features are characterized by the orientation score maps. Unlike the static geodesic metrics which depend on local pointwise information, the dynamic metric can take into account the nonlocal feature coherence penalty in order to extract a desired structure from complicated background or from a vessel tree. We construct the metric using the information from two external reference points which are identified during the geodesic distance computation. Numerical experiments are performed in retinal vessels, including the independent results from the proposed dynamic metric itself and the comparison against existing minimal path models. The results show that the proposed metric indeed gets better performance than state-of-the-art geodesic metrics

15:20-17:20, Paper ThPMP.115
Video Compression for Object Detection Algorithms
Galteri, Leonardo	Univ. Degli Studi Di Firenze - MICC
Bertini, Marco	Univ. of Florence
Seidenari, Lorenzo	Media Integration and Communication Center - Univ
Del Bimbo, Alberto	Univ. of Florence
Keywords: Image and video coding, Image quality assessment, Video processing and analysis Abstract: Video compression algorithms have been designed aiming at pleasing human viewers, and are driven by video quality metrics that are designed to account for the capabilities of the human visual system. However, thanks to the advances in computer vision systems more and more videos are going to be watched by algorithms, e.g. implementing video surveillance systems or performing automatic video tagging. This paper describes an adaptive video coding approach for computer vision-based systems. We show how to control the quality of video compression so that automatic object detectors can still process the resulting video, improving their detection performance, by preserving the elements of the scene that are more likely to contain meaningful content. Our approach is based on computation of saliency maps exploiting a fast objectness measure. The computational efficiency of this approach makes it usable in a real-time video coding pipeline. Experiments show that our technique outperforms standard H.265 in speed and coding efficiency, and can be applied to different types of video domains, from surveillance to web videos.

15:20-17:20, Paper ThPMP.116
Kinematics-Based Extraction of Salient 3D Human Motion Data for Summarization of Choreographic Sequences
Voulodimos, Athanasios	National Tech. Univ. of Athens
Doulamis, Nikolaos	National Tech. Univ. of Athens
Doulamis, Anastasios	National Tech. Univ. of Athens
Rallis, Ioannis	National Tech. Univ. of Athens
Keywords: Multimedia analysis, indexing and retrieval, Applications of computer vision Abstract: Capturing, documenting and storing Intangible Cultural Heritage content has been recently enabled at unprecedented volume and quality levels through a variety of sensors and devices. When it comes to the performing arts, and mainly dance and kinesiology, the massive amounts of RGB-D and 3D skeleton data produced by video and motion capture devices the huge number of different types of existing dances and variations thereof, dictate the need for organizing, indexing, archiving, retrieving and analyzing dance-related cultural content in a tractable fashion and with lower computational and storage resource requirements. In this context, we present a novel framework based on kinematics modeling for the extraction of salient 3D human motion data from real-world choreographic sequences. Two approaches are proposed: a clustering-based method for the selection of the basic primitives of a choreography, and a kinematics-based method that generates meaningful summaries at hierarchical levels of granularity. The dance summarization framework has been successfully validated and evaluated with two real-world datasets and with the participation of dance professionals and domain experts.

15:20-17:20, Paper ThPMP.117
Weakly Supervised Domain-Specific Color Naming Based on Attention
Yu, Lu	Computer Vision Center UAB
Cheng, Yongmei	Northwestern Pol. Univ
van de Weijer, Joost	Computer Vision Center Barcelona
Keywords: Color analysis, Applications of computer vision, Image classification Abstract: The majority of existing color naming methods focus on the eleven basic color terms of the English language. However, in many applications different sets of color names are used for the accurate description of objects. Labeling data to learn these domain-specific color names is an expensive and laborious task. Therefore, in this article we aim to learn color names from weakly labeled data. For this purpose we add an attention branch to the color naming network which is used to modulate the pixel-wise color naming predictions of the network. In experiments, we illustrate that the attention network correctly identifies the relevant regions, and we show that our result obtains state-of-the-art results for image-wise classification on the EBAY dataset and that our method is able to learn color names for various domains.

15:20-17:20, Paper ThPMP.118
Confocal Ellipse-Based Distance and Confocal Elliptical Field for Polygonal Shapes
Gabdulkhakova, Aysylu	Tech. Univ. Wien (TU Wien)
Kropatsch, Walter	TU Vienna
Keywords: Image processing and analysis, Segmentation, features and descriptors, Object recognition Abstract: The paper introduces a novel confocal ellipse-based distance (CED), that is based on the properties of the confocal ellipses. This distance is used to produce a confocal elliptical field (CEF). The Euclidean Distance Transform (EDT) of a single point (called seed) generates a distance field of concentric circles. The sum of two such distance fields of two distinct seed points produces a distance field of confocal ellipses. This fact enables to adapt CED and CEF to the discrete case, referred to as CED-DT and CEF-DT. The properties of the CEF and CEF-DT make them useful for skeletonization, in particular for efficient removal of the spurious branches.

15:20-17:20, Paper ThPMP.119
Multiphase Local Mean Geodesic Active Regions
Daniel Kirstejn Hansen, Jacob	Univ. of Copenhagen
Lauze, Francois	Univ. of Copenhagen
Keywords: Segmentation, features and descriptors, Clustering Abstract: This paper presents two variational multiphase segmentation methods for recovery of segments in weakly structured images, presenting local and global intensity bias fields, as often is the case in micro-tomography. The proposed methods assume a fixed number of classes. They use local image averages as discriminative features and binary labelling for class membership and their relaxation to per pixel/voxel posterior probabilities, Hidden Markov Measure Field Models (HMMFM). The first model uses a Total Variation weighted semi-norm (wTV) for label field regularization, similar to Geodesic Active Contours, but with a different and possibly richer representation. The second model uses a weighted Dirichlet (squared gradient) regularization. Both problems are solved by alternating minimization on computation of local class averages and label fields. The quadratic problem is essentially smooth, except for HMMFM constraints. The wTV problem uses a Chambolle-Pock scheme for label field updates. We demonstrate on synthetic examples the capabilities of the approaches, and illustrate it on a real examples.

15:20-17:20, Paper ThPMP.120
A Fast Cascade Shape Regression Method Based on CNN-Based Initialization
Gao, Pengcheng	Univ. of Chinese Acad. of Sciences
Xue, Jian	Univ. of Chinese Acad. of Sciences
Lv, Ke	Univ. of Chinese Acad. of Sciences
Yan, Yanfu	Univ. of Chinese Acad. of Sciences
Keywords: Image processing and analysis, Regression, Neural networks Abstract: Cascade shape regression (CSR) methods predict facial landmarks by iteratively updating an initial shape and are state-of-the-art. The initial shape always limits the result and causes local optimum, which is usually obtained from the average face or by randomly picking a face from the training set. In this paper, we propose a cnn-based initial method for CSR. Convolution neural network provides a highly robust initial shape estimation, while the following CSR algorithem fine-tunes the initialization rapidly to achieve higher accuracy. Furthermore, cnn-based initial approach is proposed to get 68-point initial shape, which is calculated from convolutional network 5-point result by the radial basis function interpolation with thin-plate splines (RBF-TPS). Extensive experiments demonstrate that CSR methods are sensitive to the initialization and proposed approach gets favorable results compared to state-of-the-art algorithms and achieves real-time performance.

15:20-17:20, Paper ThPMP.121
A New Single Image Super-Resolution Method Using SIMK-Based Classification and ISRM Technique
Duan, Peiqi	Beijing Univ. of Posts and Telecommunications
Ming, Anlong	Beijing Univ. of Posts and Telecommunications
Kang, Xuejing	Beijing Univ. of Posts and Telecommunications
Yao, Chao	Beijing Univ. of Posts and Telecommunications
Keywords: Super-resolution, Classification, Regression Abstract: Single image super-resolution (SR) technique is widely used to estimate high-resolution (HR) images from low-resolution (LR) ones. As a research hotspot, many example-based SR methods achieve superior results by learning class-mapping-kernels from classified external LR-HR patch-pair samples. However, in these methods, the classification of samples is generally based on the features of LR patch, and the interference of ill-samples to learn class-mapping-kernels is ignored as well. In this paper, we propose a new SR method with Sample Individual Mapping-Kernel (SIMK) based classification and Ill-Sample Removal Mechanism (ISRM) in learning LR-HR mapping. In the proposed sample classification, we use the SIMK feature which is the LR-to-HR mapping kernel of each sample, to classify samples and obtain more reasonable sample sets for mapping-learning. To prevent overfitting and reduce the complexity of SIMK-based-classification, samples are pre-categorized by relative pixel values of LR patch. In the mapping-learning process, the ill-samples which are far away from the classification center are removed to improve the validity of class-mapping-kernels. In addition, for each testing LR patch, the optimal class is assigned reasonably based on a probabilistic decision model learned from Naive Bayes Classifier. Comparing with state-of-the-art methods, our SR method achieves both visual and performance improvement.

15:20-17:20, Paper ThPMP.122
Video Stitching with Extended-MeshFlow
Chen, Kai	Wuhan Univ
Yao, Jian	Wuhan Univ
Xiang, Binbin	Wuhan Univ
Tu, Jingmin	Wuhan Univ
Keywords: Video processing and analysis, Applications of computer vision, Video analysis Abstract: In this paper, we present a method that stitches multiple videos captured with a fixed camera rig. We propose an Extended-MeshFlow motion model for video stitching. Firstly, uniform features are detected and matched at the overlapping region, from which the Exended-MeshFlow model is estimated. The model then warps the adjacent views to the common central view to eliminate the spatial misalignment. The motions located on the feature position are interpolated to the mesh vertexes by Multilevel B-Spline Approximation(MBA). Collecting the motions on the vertexes form the vertex profiles, which are smoothed for temporal consistency. During the smoothing, only previous frames are required, thus the proposed method can stitch videos in an online mode. Experimental results on various of videos demonstrate that the proposed method can produce comparable stitching results in aspects of spatial alignment and temporal coherence.

15:20-17:20, Paper ThPMP.123
A Fast Local Analysis by Thresholding Applied to Image Matching
Faula, Yannick	LIRIS, INSA Lyon
Bres, Stephane	LIRIS Lab. National Inst. of Applied Sciences in Lyon
Eglin, Veronique	LIRIS
Keywords: Segmentation, features and descriptors, Image processing and analysis, Object detection Abstract: Key structures extraction and matching are key steps in computer vision. Many fields of application need large image acquisition and fast extraction of fine structures. In this study, we focus on situations where existing local feature extractors give not enough satisfying results concerning both accuracy and time processing. Among good illustrations, we can quote short-line extraction in local weakly-contrasted images. We propose a new Fast Local Analysis by threSHolding (FLASH) designed to process large images under hard time constraints. We use "micro-line" points as key feature. These are used for shape reconstruction (like lines) and local signature design. We apply FLASH on the field of concrete infrastructure monitoring where robots and UAVs are more and more used for automated defect detection (like cracks). For large concrete surfaces, there are several hard constraints such as the computational time and the reliability. Results show us that the computations are faster than several existing algorithms in image matching and FLASH has invariance to rotation, partial occlusion, and scale range from 0.7 to 1.4 without scale-space exploration.

15:20-17:20, Paper ThPMP.124
CANDY: Conditional Adversarial Networks Based End-To-End System for Single Image Haze Removal
Swami, Kunal	Samsung Res. Inst. Bangalore
Das, Saikat Kumar	Samsung R&D Inst. India, Bangalore
Keywords: Enhancement, restoration and filtering, Deep learning, Applications of pattern recognition and machine learning Abstract: Single image haze removal is a challenging and ill-posed problem. The existing haze removal methods in literature, including the recently introduced deep learning methods, model the problem of haze removal as that of estimating intermediate parameters, viz., scene transmission map and atmospheric light. These are used to compute the haze-free image from the hazy input image. Such an approach only focuses on accurate estimation of intermediate parameters, while the aesthetic quality of the haze-free image is unaccounted for in the optimization framework. Thus, errors in the estimation of intermediate parameters often lead to generation of inferior quality haze-free images. In this paper, we present CANDY (Conditional Adversarial Networks based Dehazing of hazY images), a fully end-to-end model which directly generates a clean haze-free image from a hazy input image. CANDY also incorporates the visual quality of haze-free image into the optimization function; thus, generating a superior quality haze-free image. This is one of the first works in literature to propose a fully end-to-end model for single image haze removal. Also, this is the first work to explore the concept of generative adversarial networks for the problem of single image haze removal. CANDY was trained on a synthetically created haze image dataset, while evaluation was performed on challenging synthetic as well as real haze image datasets. The extensive evaluation and comparison results of CANDY reveal that it significantly outperforms existing state-of-the-art haze removal methods in literature, both quantitatively as well as qualitatively.

15:20-17:20, Paper ThPMP.125
Fast Motion Deblurring for Feature Detection and Matching Using Inertial Measurements
Mustaniemi, Janne	Univ. of Oulu
Kannala, Juho	Aalto Univ
S�rkk�, Simo	Aalto Univ
Matas, Jiri	CTU Prague
Heikkil�, Janne	Univ. of Oulu
Keywords: Segmentation, features and descriptors, Enhancement, restoration and filtering, 3D reconstruction Abstract: Many computer vision and image processing applications rely on local features. It is well-known that motion blur decreases the performance of traditional feature detectors and descriptors. We propose an inertial-based deblurring method for improving the robustness of existing feature detectors and descriptors against the motion blur. Unlike most deblurring algorithms, the method can handle spatially-variant blur and rolling shutter distortion. Furthermore, it is capable of running in real-time contrary to state-of-the-art algorithms. The limitations of inertial-based blur estimation are taken into account by validating the blur estimates using image data. The evaluation shows that when the method is used with traditional feature detector and descriptor, it increases the number of detected keypoints, provides higher repeatability and improves the localization accuracy. We also demonstrate that such features will lead to more accurate and complete reconstructions when used in the application of 3D visual reconstruction.

15:20-17:20, Paper ThPMP.126
Layered Surface Detection for Virtual Unrolling
Dahl, Vedrana Andersen	Tech. Univ. of Denmark
Dahl, Anders	Tech. Univ. of Denmark
Trinderup, Camilla Himmelstrup	Tech. Univ. of Denmark
Gundlach, Carsten	Tech. Univ. of Denmark
Keywords: Segmentation, features and descriptors, Image processing and analysis Abstract: We present a method for virtual unrolling of a thin rolled object. From a volumetric image of the rolled object we obtain a flat image of the object's surface, which allows visual inspection of the object and has a number of applications. Our method exploits the geometric constrains of the problem and detects a single rolled surface. For surface detection we adapt a solution to an optimal net surface problem, previously used for terrain-like and tubular surfaces. We present our approach on an example of a rolled sheet of microelectronic, which has a layer of flexible polymer substrate and a thin metal layer lithographically coated onto the polymer. Our approach is automatic and robust. The unrolled image is undistorted, and the surface structures may be accurately quantified making our approach a good candidate for an industrial application of virtual unrolling.

15:20-17:20, Paper ThPMP.127
Fully Convolutional Network and Graph-Based Method for Co-Segmentation of Retinal Layer on Macular OCT Images
Liu, Yun	Shandong Univ
Ren, Gang	Shandong Univ
Yang, Gongping	Shandong Univ
Xi, Xiaoming	Shandong Univ. of Finance and Ec
Chen, Xinjian	Soochow Univ
Yin, Yilong	Shandong Univ
Keywords: Segmentation, features and descriptors, Deep learning Abstract: Retinal layer segmentation in optical coherence tomography (OCT) images is crucial for the diagnosis and study of retinal diseases. Graph-based methods are commonly used in layer segmentation. However, most of these methods require a lot of human efforts for determining an appropriate model to compute good edge weights. In this paper, we propose a novel automatic method for segmenting retinal layers in macular OCT images. Specially, we propose a new fully convolutional deep learning architecture with a side output layer to directly learn optimal graph-edge weights from raw pixels. The architecture can automatically learn multi-scale and multi-level features to generate accurate boundary probabilities as good edge weights without hand-crafted appropriate models. The boundaries are finalized by using graph segmentation method. The proposed method is evaluated on a dataset with 130 OCT B-scans. The experimental results show the mean absolute boundary positioning differences are 1.48+-0.34 pixel.

15:20-17:20, Paper ThPMP.128
Applying Hand Gesture Recognition and Joint Tracking to a TV Controller Using CNN and Convolutional Pose Machine
Yueh, Wu	Acad. Sinica
Chien-Ming, Wang	Acad. Sinica
Keywords: Gesture recognition, Motion and tracking, Deep learning Abstract: This paper introduces a novel TV control simulation system that recognizes hand gestures and track hand joints based on Convolutional Neural Networks (CNN) and Convolutional Pose Machines (CPM). The system provides users with an intuitive means of controlling television functions through hand gestures. Moreover, based on relative position and angle of fingers, users can manipulate an onscreen cursor and continuously modify volume & channels at varying speed. We achieved 95.8 percent testing accuracy in 19 gestures with 4 subjects, and average 11 & 45 fps while conducting CPM and CNN respectively.

15:20-17:20, Paper ThPMP.129
Deep Emotion Transfer Network for Cross-Database Facial Expression Recognition
Li, Shan	Beijing Univ. of Posts and Telecommunications
Deng, Weihong	Beijing Univ. of Posts and Telecommunications
Keywords: Facial expression recognition, Deep learning Abstract: Due to the large domain discrepancy between the training and testing data and the inaccessibility of annotating sufficient training samples, cross-database facial expression recognition which has more application value remains to be challenging in the literature. Previous researches on this problem are based on shallow features with limited discrimination ability. In this paper, we propose to address this problem with a Deep Emo-transfer Network (DETN). Specifically, maximum mean discrepancy was embedded in the deep architecture to reduce dataset bias. Furthermore, a very common but widely ignored bottleneck in facial expression, imbalanced class distribution, has been taken into account. A learnable class-wise weighting parameter was introduced to our network by exploring class prior distribution on unlabeled data so that the training and testing domains can share similar class distribution. Extensive empirical evidences involving both lab-controlled vs. real-world and small-scale vs. large-scale facial expression databases show that our DETN can yield competitive performances across various facial expression transfer tasks.

15:20-17:20, Paper ThPMP.130
WiTT: Modeling and the Evaluation of Table Tennis Actions Based on WIFI Signals
Chen, Chong	Southwest Univ
Shu, Yao	Coll. of Computer & Information Science Southwest Univ
Zhang, Heng	Coll. of Computer & Information Science Southwest Univ
Shu, Kuang-I	Coll. of Computer & Information Science Southwest Univ
Keywords: Pattern recognition for human computer interaction, Signal analysis, Classification Abstract: In recent years, with the rapid developments in science and technology which have taken place, wireless signals have evolved from a simple communication medium to something which can be used for environmentally aware tools. Many researchers have applied wireless signals to the research fields of human perception and human behavior recognition. At present, vision-based and sensor-based action recognition represent two mainstream methods. However, the former is sensitive to the levels of light available and the environment, and the latter requires users to wear (or deploy) devices which may cause the user inconvenience. Action recognition systems based on wireless signals can avoid these difficulties, and at low cost. Wi-Fi signals have been used to realize keystroke recognition, gesture recognition, and simple human behavior recognition. Inspired by this and the urgent need for the implementation of indoor somatosensory games and other application requirements which include human motion recognition, this paper presents a Table Tennis action recognition system based on Wi-Fi signals, which is termed WiTT. In family environment, Wi-Fi signals is always seriously affected by other wireless signals and environmental factors. WiTT uses discrete wavelet decomposition and support vector machines and other techniques, and it achieves a greater than 96.34% detection rate in detecting table tennis actions and a 90.33% recognition accuracy in classifying 6 different table tennis actions.

15:20-17:20, Paper ThPMP.131
One-Factor Cancellable Biometrics Based on Indexing-First-Order Hashing for Fingerprint Authentication
Kim, Jihyeon	Coll. of Engineering, Yonsei Univ
Teoh, Andrew	Yonsei Univ
Keywords: Security and privacy in biometrics, Forensic applications of biometrics, Fingerprint recognition Abstract: Despite biometrics is deemed a more secure and user-friendly solution than password-based or token-based approach for identity management, biometric templates are vulnerable to adversary attacks that may lead to privacy invasion and irreversible identity theft. Cancelable biometrics is a template protection method that generates a noninvertible identifier from the original biometric template by means of a parameterized transformation function and user/application-specific parameters. However, the necessity to input parameter, either in possession (token) or in memory (password) form along with biometrics, hence two factors, jeopardizes usability of the biometrics. In this paper, we propose a one-factor cancellable biometric authentication scheme that empowered by Indexing First Order hashing, a tailor-made locality sensitive hashing function for template protection. We evaluate the proposed scheme with respect to four template protection design criteria, namely noninvertible, renewability, unlinkability and accuracy performance. We also analyze the threat model of the proposed scheme that enclosed five major security attacks. Despite the scheme can be applied to any binary biometric features, we adopt binary fingerprint vector as a case study for this paper. The evaluations have been carried out under six datasets taken from FVC 2002 and FVC 2004 benchmark databases.

15:20-17:20, Paper ThPMP.132
Identification of Hypertension by Mining Class Association Rules from Multi-Dimensional Features
Liu, Fan	Northwestern Pol. Univ
Zhou, Xingshe	Northwestern Pol. Univ
Wang, Zhu	Northwestern Pol. Univ
Wang, Tianben	Northwestern Pol. Univ
Zhang, Yanchun	Victoria Univ
Keywords: Biometric systems and applications, Applications of pattern recognition and machine learning, Pattern recognition for human computer interaction Abstract: Hypertension is a common cardiovascular disease, which will lead to severe complications without timely treatment. Identifying hypertension accurately is essential to prevent the condition deteriorated. However, the state of art hypertension identification methods only extract features from very few aspects, and hence have limited identification accuracy. Furthermore, they only can judge whether the subjects are hypertensive or not, more meaningful information (such as, why the subjects suffer from hypertension) that can help doctors to improve their diagnosis level are absent. In this paper, we propose a class association rules-based method for identifying hypertension. Particularly, the key idea of our method is to mine the relationship among multi-dimensional features to characterize hypertension pattern more effectively, aiming to improving the identification performance. In addition, this approach also generate a set of class association rules (CARs), which can reflect the subjects� physiological status and are demonstrated to be useful for doctors to analyze subject�s condition deeply. Extensive experiments with 128 subjects (61 hypertension patients and 67 healthy subjects) shows that our method outperforms the baseline methods and the accuracy, precision and recall reach 85.2%, 85.0%, and 83.6%, respectively. In addition, a user study based on five clinicians demonstrates the usefulness of the generated CARs. demonstrates the usefulness of the generated CARs.

15:20-17:20, Paper ThPMP.133
What Are You Doing While Answering Your Smartphone?
Abate, Andrea F.	Univ. of Salerno
Nappi, Michele	Univ. of Salerno
Barra, Silvio	Univ. of Cagliari
De Marsico, Maria	Sapienza Univ. of Rome
Keywords: Biometric sensors, Human behavior analysis, Soft biometrics Abstract: Context awareness is major component of Ambient Intelligence. In fact, Ambient Intelligent environments are designed to combine ubiquity, awareness, intelligence and natural interaction. Awareness is defined as the ability by the system to locate and recognize people and objects, and their intentions. Then, intelligence is the ability of the system to analyze the detected context and to adapt its behavior to people and situations, and to learn over time, in order to provide users with personalized services. These concepts date back to late '90, but nowadays the widespread and ubiquitous availability of mobile devices, equipped with several different sensors, allows to put them into practice in a number of unexpected ways. This work presents a preliminary investigation on the possibility to use some of the smartphone sensors, namely the accelerometer and the gyroscope, to identify the bodily context when the user lifts the device to answer a call. The arm gesture, i.e., the way it is performed, is classified into 4 different states: while standing, sitting, walking or running. This information can be used to trigger context-sensitive system actions.

15:20-17:20, Paper ThPMP.134
On Mugshot-Based Arbitrary View Face Recognition
Liang, Jie	SiChuan Univ
Liu, Feng	Sichuan Univ
Tu, Huan	Sichuan Univ
Zhao, Qijun	Sichuan Univ
Jain, Anil	Michigan State Univ
Keywords: Forensic applications of biometrics, Face recognition, 3D reconstruction Abstract: Despite the wide usage of mugshot images in forensic applications, they are underutilized in existing automated face recognition systems. In this paper, we propose a novel mugshot-based arbitrary view face recognition method. Our approach reconstructs full 3D faces via cascaded regression in shape space with efficient seamless texture recovery. Unlike existing methods, it makes full use of the frontal and profile views available in mugshot images, and thus generates accurate and realistic 3D faces. Multi-view face images are synthesized from the reconstructed 3D faces to enlarge the gallery so that arbitrary view faces can be better recognized. Evaluation experiments were conducted on BFM and Multi-PIE databases by using state-of-the-art deep learning (DL) based face matchers. The results demonstrate the effectiveness of our proposed method and show that DL-based face matchers can benefit from mugshot images and the reconstructed 3D faces, especially for recognizing large off-angle faces.

15:20-17:20, Paper ThPMP.135
Local Subclass Constraint for Facial Expression Recognition in the Wild
Luo, Zimeng	Beijing Univ. of Posts and Telecommunications
Hu, Jiani	Beijing Univ. of Posts and Telecommunications
Deng, Weihong	Beijing Univ. of Posts and Telecommunications
Keywords: Facial expression recognition, Deep learning Abstract: The Automated Facial Expression Recognition (FER) in the wild is still a challenge problem. Currently, most of Deep Convolutional Neural Networks(DCNNs) based FER methods adopt softmax cross-entropy loss to encourage the separability of inter-class features. Many deep embedding approaches (e.g. contrastive loss, triplet loss, center loss) have been extended to the field of FER to enhance the discriminative ability of deep expression features and obtain the predictive effect. In this work, we present a novel deep embedding approach explicitly designed to respect the huge intra-class variation of expression features while learning discriminative expression features. We aim at forming a locally compact representation space structure through minimizing the distance between samples and their nearest subclass center. We demonstrate the effectiveness of this idea on RAF(Real-world Affective Faces) database. The experiment results show that our approaches can not only improve the classification performance but also adaptively learn a locally compact and expression intensity-aware feature space structure. We further extend our models to Static Facial Expressions in the Wild (SFEW) dataset and the results show the generalized ability of our approaches.

15:20-17:20, Paper ThPMP.136
Minutia Matching Using 3D Pore Clouds
Ksiaskiewcz Czovny, Raphael	IMAGO Res. Group - Univ. Federal Do Paran�
Bellon, Olga Regina Pereira	IMAGO Res. Group - Univ. Federal Do Parana
Silva, Luciano	Univ. Federal Do Parana, IMAGO Res. Group
Gutierrez da Costa, Henrique Sergio	Univ. Federal Do Paran�
Keywords: Fingerprint recognition Abstract: This paper proposes a novel methodology for biometric identification of individuals using level-3 features (pores), extracted from 3D fingerprint images obtained through Optical Coherence Tomography (OCT). OCT fingerprint images contain detailed 3D information from both the dermis and the epidermis skin layers of fingertips. Our approach first fetches and extracts pores around minutiae from the 3D fingerprint data, creating small structures called pore clouds. The correspondence of existent pore clouds are then verified for all the three possible fingerprint matching: dermis-dermis, epidermis-epidermis, and dermis-epidermis. To this end, three different measures are extracted and compared: the Hausdorff distance, the Surface Interpenetration Measure and the Root Mean Square Error. Experiments using 518 pore clouds achieved recognition rates of 99.19% for Rank-1 with EER (Equal Error Rate) of 0.72%. From our best knowledge, this is the first time the identification of individuals using only 3D information from pores is explored.

15:20-17:20, Paper ThPMP.137
Dual-Modality Talking-Metrics: 3D Visual-Audio Integrated Behaviometric Cues from Speakers
Zhang, Jie	Beihang Univ
Richmond, Korin	Univ. of Edinburgh
Fisher, Robert	Univ. of Edinburgh
Keywords: Multi-biometrics, Behavior recognition, 3D vision Abstract: Face-based behaviometrics focus on dynamic biological signatures generated from face behaviors, which are informative and subject-specific for identity recognition. Most existing face behaviometrics rely on 2D visual features and thus are sensitive to pose or intensity variations. This paper presents a dual-modality behaviometrics algorithm (talking-metrics) that integrates 3D video and audio cues from a human face speaking a passphrase. Static and dynamic 3D face features are extracted algorithmically and audio features are transformed through a few learning models. We concatenate the top 18 discriminative 3D visual-audio features to represent the bi-modality and utilize an linear discrimant analysis (LDA) classifier for identity recognition. The experiments were conducted on a new publicly released dataset (S3DFM). Both qualitative feature distributions and quantitative comparison results show the feasibility of the proposed pipeline and the superiority over using each modality independently. A 98.5% cross-validation recognition rate over 60 subjects and 10 trials was achieved. An anti-spoofing test also demonstrates the robustness of the proposed method.

15:20-17:20, Paper ThPMP.138
SegDenseNet: Iris Segmentation for Pre and Post Cataract Surgery
Lakra, Aditya	IIIT-Delhi
Tripathi, Pavani	Indraprastha Inst. of Information Tech. Delhi
Keshari, Rohit	IIIT Delhi
Vatsa, Mayank	IIIT Delhi
Singh, Richa	IIIT Delhi
Keywords: Iris and ocular recognition Abstract: Cataract is one of the major ophthalmic diseases worldwide which can potentially affect the performance of iris based biometric systems. While existing research has shown that cataract does not have a major impact on iris recognition, our observations suggest that iris segmentation algorithms are not well equipped to handle cataract or post cataract surgery cases, thereby affecting the overall iris recognition performance. This paper presents an efficient iris segmentation algorithm with variations due to cataract and post cataract surgery. The proposed algorithm, termed as SegDenseNet, is a deep learning algorithm based on DenseNet. The experiments on the IIITD Cataract Surgery Database show that improving iris segmentation enhances the recognition performance by up to 25% across different sensors and matchers.

15:20-17:20, Paper ThPMP.139
Face Recognition for Newborns, Toddlers and Pre-School Children: A Deep Learning Approach
Siddiqui, Sahar	Indraprastha Inst. of Information Tech. Delhi
Vatsa, Mayank	IIIT Delhi
Singh, Richa	IIIT Delhi
Keywords: Face recognition Abstract: Biometric recognition of newborns, toddlers, and pre-school children aims is an important research challenge with applications in identifying newborn swapping, missing kids, and disbursing benefits. In this research, we propose a representation learning algorithm to extract unique and invariant features from face images of newborns and toddlers, to design an efficient face recognition algorithm. Specifically, we propose a deep learning model which applies class-based penalties while learning the filters of a convolutional neural network. The proposed CNN architecture achieves a rank-1 identification accuracy of 62.7% for single gallery newborn face recognition and 85.1% for single gallery toddler face recognition, forming state-of-the-results for both the databases. Comparison with several existing algorithms also showcases the effectiveness of the proposed algorithm on both the databases.

15:20-17:20, Paper ThPMP.140
Enhancing OCR Accuracy with Super Resolution
Lat, Ankit	IIIT Hyderabad
Jawahar, C. V.	IIIT
Keywords: Document image processing, Character and text recognition, Performance evaluation Abstract: Accuracy of OCR is often marred by the poor quality of the input document images. Generally this performance degradation is attributed to the resolution and quality of scanning. This calls for special efforts to improve the quality of document images before passing it to the OCR engine. One compelling option is to super-resolve these low resolution document images before passing them to the OCR engine. In this work we address this problem by super-resolving document images using Generative Adversarial Network (GAN). We propose a super resolution based preprocessing step that can enhance the accuracies of the OCRs (including the commercial ones). Our method is specially suited for printed document images. We validate the utility in wide variety of document images (where fonts, styles, and languages vary) without any pre-processing step to adapt across situations. Our experiments show an improvement upto 21% in accuracy OCR on test images scanned at low resolution. One immediate application of this can be in enhancing the recognition of historic documents which have been scanned at low resolutions.

15:20-17:20, Paper ThPMP.141
Myocardial Scar Segmentation in LGE-MRI Using Fractal Analysis and Random Forest Classification
Kurzendorfer, Tanja	Pattern Recognition Lab, Friedrich-Alexander-Univ. Erlangen
Breininger, Katharina	Pattern Recognition Lab, Friedrich-Alexander-Univ. Erlangen
Steidl, Stefan	Friedrich-Alexander-Univ. Erlangen-N�rnberg
Brost, Alexander	Siemens Healthcare GmbH
Forman, Christoph	Siemens Healthcare GmbH
Maier, Andreas	Friedrich-Alexander-Univ. Erlangen-N�rnberg
Keywords: Medical image and signal analysis, Computer-aided detection and diagnosis Abstract: Late-gadolinium enhanced magnetic resonance imaging (LGE-MRI) is the clinical gold standard to visualize myocardial scarring. The gadolinium based contrast agent accumulates in the damaged cells and leads to various enhancements in the LGE-MRI scan. The quantification of the scar tissue is very important for diagnosis, treatment planning, and guidance during the procedure. In clinical routine, the scar is often segmented manually. However, manual segmentation is prone to inter- and intra-observer variability and very time consuming. In this work a new texture based scar quantification is proposed. For texture characterization, segmentation based fractal analysis is used. First, the image is decomposed into a set of binary images by applying a two-threshold binary decomposition. Second, a set of features are extracted for each of the binary images, namely the fractal dimension, the mean gray value, and the size of the binary object. In addition, the local and global intensity of each patch is added to the feature vector. In the next step, the features are classified using a random forest classifier. The scar quantification is evaluated on 30 clinical LGE-MRI data sets. In addition, the results are compared to the x-fold standard deviation approach and the full-width-at-half-max method, which are implemented in a fully automatic manner. The proposed scar quantification achieved a mean Dice coefficient of 0.64 � 0.17 and outperforms the x-fold standard deviation approach.

15:20-17:20, Paper ThPMP.142
A Hybrid Deep Architecture for Robust Recognition of Text Lines of Degraded Printed Documents
Biswas, Chandan	Indian Statistical Inst
Mukherjee, Partha Sarathi	Indian Statistical Inst. Kolkata
Ghosh, Koyel	Nopany Inst. of Management Studies, Kolkata
Bhattacharya, Ujjwal	Indian Statistical Inst
Parui, Swapan Kumar	Indian Statistical Inst
Keywords: Document analysis systems, Document image processing, Character and text recognition Abstract: During the last 20 years, significant research studies have been undertaken for automatic recognition of printed documents. The same is true for Bangla, a major Indian script. All these studies were mainly centered on comparatively well-behaved good quality printed documents. However, many of the large archives include significant volumes of older documents which are so degraded in their present form that they cannot be reasonably transcribed using the existing OCR (Optical Character Recognition) approaches. On the other hand, automatic recognition of printed contents of these documents has significant application potentials such as generation of descriptive metadata, full-text searching, information extraction etc. The contributions made in the present study are (i) creation of a moderately large annotated database of degraded Bangla documents towards their recognition studies, (ii) development of a Gaussian mixture model based strategy for extraction of text components from complex noisy background of such documents and (iii) development of a line level recognition scheme for degraded Bangla documents. We have studied two different CNN-BLSTM-CTC hybrid architectures for this recognition problem. The winning architecture uses the first convolution layer of the CNN in a fashion similar to the inception model of deep learning methodologies.

15:20-17:20, Paper ThPMP.143
Document Image Classification with Intra-Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks
Das, Arindam	Valeo
Roy, Saikat	Univ. of Bonn
Bhattacharya, Ujjwal	Indian Statistical Inst
Parui, Swapan Kumar	Indian Statistical Inst
Keywords: Applications of deep learning to document analysis, Deep learning, Neural networks Abstract: In this article, a region-based Deep Convolutional Neural Network framework is proposed for document structure learning. The contribution of this work involves efficient training of region based classifiers and effective ensembling for document image classification. A primary level of `inter-domain' transfer learning is used by exporting weights from a pre-trained VGG16 architecture on the ImageNet dataset to train a document classifier on whole document images. Exploiting the nature of region based influence modelling, a secondary level of `intra-domain' transfer learning is used for rapid training of deep learning models for image segments. Finally, stacked generalization based ensembling is utilized for combining the predictions of the base deep neural network models. The proposed method achieves state-of-the-art accuracy of 92.21% on the popular RVL-CDIP document image dataset, exceeding benchmarks set by existing algorithms.


ThPMOT2	309A, 3rd Floor
ThPMOT1.B Manifold and Feature Learning (309A, 3rd Floor)	Oral Session

17:20-17:40, Paper ThPMOT2.1
A Unified Neighbor Reconstruction Method for Embeddings
Zhang, Zhihong	Xiamen Univ
Ye, Zhiling	Xiamen Univ
Bai, Zhengjian	Xiamen Univ
Hu, Guosheng	Anyvision Company
Hu, Yiqun	Zhongshan Hospital Affiliated with Xiamen Univ
Hancock, Edwin	Univ. of York
Bai, Lu	Central Univ. of Finance and Ec
Keywords: Manifold learning, Dimensionality reduction, Applications of pattern recognition and machine learning Abstract: In this work we propose a novel and compact Neighbor Reconstruction Method (NRM) which is a unified pre-processing method for graph-based sparse spectral algorithms. This method is conducted by vector operations on a central point and its corresponding neighbor points. NRM generates new neighbor points which can capture the local space structure of the central point more appropriately than original neighbor points. With NRM, a large number of sparse spectral based nonlinear feature extraction and selection algorithms gain significant improvement. Specifically, we embedded NRM to several classical algorithms, Local Linear Embedding (LLE)~cite{Roweis2000Nonlinear}, Laplacian Eigenmaps (LE)~cite{Belkin2002Laplacian} and Unsupervised Feature Selection for Multi-cluster Data (MCFS)~cite{Cai2010Unsupervised}, with accuracy improvement of up to 7%, 2.6 %, 2.4 % on ORL, CIFAR 10, and MINST data sets respectively. We also apply NRM to a Super Resolution algorithm, A+~cite{A+}, and obtain 0.12dB improvement than original method.

17:40-18:00, Paper ThPMOT2.2
Flexible and Discriminative Non-Linear Embedding with Feature Selection for Image Classification
Zhu, Ruifeng	Univ. Bourgogne Franche-Comt�
Dornaika, Fadi	Univ. of the Basque Country
Ruichek, Yassine	Univ. De Tech. De Belfort-Montb�liard
Keywords: Manifold learning, Dimensionality reduction, Sparse learning Abstract: In the past years, various graph-based data embedding algorithms were proposed and used in machine learning and pattern recognition fields. This paper introduces a graph-based non-linear embedding learning algorithm for image classification and recognition. The proposed embedding method can be used for supervised and semi-supervised learning settings. The proposed criterion allows the simultaneous estimation of a linear and a non-linear embedding. It integrates manifold smoothness, Sparse Regression and Margin Discriminant Embedding. The deployed sparse regression implicitly performs feature selection on the original features of the data matrix and of the linear transform. The proposed method is applied to four image datasets: 8 Sports Event Categories dataset, Scene 15 dataset, ORL Face dataset and COIL-20 Object dataset. The experiments demonstrate the effectiveness of the proposed embedding method.

18:00-18:20, Paper ThPMOT2.3
Kernel Discriminant Correlation Analysis: Feature Level Fusion for Nonlinear Biometric Recognition
Bai, Yang	Univ. of Miami
Haghighat, Mohammad	Univ. of Miami
Abdel-Mottaleb, Mohamed	Univ. of Miami
Keywords: Support vector machine and kernel methods, Multi-biometrics, Dimensionality reduction Abstract: In biometric recognition, feature fusion is an important area of research due to the fact that multiple types of features contain richer and complementary information. Discriminative Correlation Analysis (DCA) is a recently proposed feature fusion method, which incorporates the class association into correlation analysis so that the features not only have the maximum intrinsic correlation between feature sets but also have class structure information. However, DCA is a linear technique, that finds a linear transformation of the original space. For highly nonlinearly distributed data, classification with nonlinear techniques works better than the linear ones. In this paper, we propose Kernel-DCA which generalizes DCA in order to handle nonlinear problems. Similar to Kernel-SVM, Kernel-DCA utilizes a kernel method to map feature sets to a high-dimensional space in which features are linearly separable. Experimental results, for the fusion of ear and face feature, using the WVU database with large variations in pose, show that Kernel-DCA achieves better results on nonlinearly distributed data than DCA and other feature fusion methods.


ThPMOT3	309B, 3rd Floor
ThPMOT2.B Behavior Recognition (309B, 3rd Floor)	Oral Session

17:20-17:40, Paper ThPMOT3.1
Recognition of Infants' Gaze Behaviors and Emotions
Yang, Bikun	Peking Univ
Tong, Yuqiang	Peking Univ
Cui, Jinshi, Jinshi	Key Lab of Machine Perception (MOE), Peking Univ
Wang, Li	Peking Univ
Zha, Hongbin	Peking Univ
Keywords: Behavior recognition, Emotion recognition, Video processing and analysis Abstract: This paper proposes a system for recognition of infants� gaze behaviors and emotions from the videos. In the current work, researchers believed that the information of eye region is crucial for gaze behavior recognition, and emotion recognition mostly depends on the appearance of face. However, we find infants always express their intentions and emotions using their whole body, especially their head movements, in that the differentiation of their body�s all parts has not finished yet. Therefore, we incorporate the head pose information as features into the gaze behavior recognition, and we extract the gaze features to improve the recognition of infants� emotions. In addition, we combine several Deep Neural Networks, which can not only capture the details of the images very well, but also make full use of the temporal features. In order to recognize infants� gaze behaviors, we design a feature-extraction convolutional neural network (FE-CNN) which can obtain the features of infants� gaze direction, then we feed these features with head pose into the next gaze behavior recurrent neural network (G-RNN). Moreover, we combine the features of facial express and gaze behavior to characterize infants� emotions, and expand this system with an emotion recurrent neural network (E-RNN). In the end, we achieve the recognition accuracy 98.31% and 94.71% respectively on our data set.

17:40-18:00, Paper ThPMOT3.2
Multi-Modal Three Stream Network for Action Recognition
Khalid, Muhammad Usman Khalid	TU Dortmund
Yu, Jie	Computer Vision Res. Lab, Robert Bosch GmbH
Keywords: Behavior recognition, Deep learning, Video analysis Abstract: Human action recognition in video is an active yet challenging research topic due to high variation and complexity of data. In this paper, a novel video based action recognition framework utilizing complementary cues is proposed to handle this complex problem. Inspired by the successful two stream networks for action classification, additional pose features are studied and fused to enhance understanding of human action in a more abstract and semantic way. Towards practices, not only ground truth poses but also noisy estimated poses are incorporated in the framework with our proposed pre-processing module. The whole framework and each cue are evaluated on varied benchmarking datasets as JHMDB, sub-JHMDB and Penn Action. Our results outperform state-of-the-art performance on these datasets and show the strength of complementary cues.

18:00-18:20, Paper ThPMOT3.3
Temporal Inception Architecture for Action Recognition with Convolutional Neural Networks
Zhang, Wei	Sun Yat-Sen Univ
Cen, Jiepeng	Sun Yat-Sen Univ
Zheng, Huicheng	Sun Yat-Sen Univ
Keywords: Behavior recognition, Deep learning, Neural networks Abstract: Modeling appearance and short-term dynamic information is the mainstream strategy for action recognition based on deep learning. We consider it important to model the multi-scale temporal information, including both short-term information and long-term information, for action representation. In this paper, a novel temporal inception architecture (TIA) is proposed to solve this problem, which is a general structure that can be combined with multi-segment-based frameworks for action recognition. The TIA is composed of multiple spatial-temporal convolutional branches, in which the temporal information of different scales is extracted. Then feature maps of all branches are concatenated as the output of TIA. In our experiments, the TIA is embedded into temporal segment networks (TSN) to construct our temporal segment inception networks (TSIN) for action recognition tasks. Extensive experiments demonstrate that TSIN outperforms TSN and achieves the state-of-the-art performance on HMDB51 and UCF101.


ThPMOT4	311A, 3rd Floor
ThPMOT3 Speech and Signal (311A, 3rd Floor)	Oral Session

17:20-17:40, Paper ThPMOT4.1
Recurrent Neural Network Based Small-Footprint Wake-Up-Word Speech Recognition System with a Score Calibration Method
Li, Chenxing	Inst. of Automation, Chinese Acad. of Sciences
Zhu, Lei	AI Lab, Rokid Inc
Xu, Shuang	Inst. of Automation, Chinese Acad. of Sciences
Gao, Peng	AI Lab, Rokid Inc
Xu, Bo	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Speech recognition, Audio and acoustic processing and analysis, Pattern recognition for human computer interaction Abstract: In this paper, we propose a small-footprint wake-up-word speech recognition (WUWSR) system based on long short-term memory (LSTM) recurrent neural network, and we design a novel back-end calibration scoring method named modified zero normalization (MZN). First, LSTM is trained to predict posterior probability of context-dependent state. Next, MZN is adopted to transfer posterior probability to normalized score, which is then converted to confidence score by dynamic programming. Finally, a certain wake-up-word is recognized according to the confidence score. This WUWSR system can recognize multiple wake-up words and change wake-up words flexibly. This system can guarantee low latency by omitting decoding network. Equal error rate (EER) is adopted as the evaluation metric. Experimental results show that the proposed LSTM-based system achieves 33.33% relative improvement compared with a baseline system based on deep feed-forward neural network. Combining the front-end LSTM acoustic model with back-end MZN method, our WUWSR system can achieve 51.92% relative improvement.

17:40-18:00, Paper ThPMOT4.2
SSSD: Speech Scene Database by Smart Device for Visual Speech Recognition
Saitoh, Takeshi	Kyushu Inst. of Tech
Kubokawa, Michiko	Kyushu Inst. of Tech
Keywords: Speech recognition, Pattern recognition for human computer interaction Abstract: Speech scenes of conventional databases available for lip reading or visual speech recognition (VSR), were record with a video camera fixed on a tripod in a well-maintained environment. On the other hand, VSR is expected to be used in smart devices such as smartphone and tablet as an interface. Therefore, collecting the speech scenes recorded with these devices is an important task for practical use of VSR. In this paper, we collect Japanese word utterance scenes taken with smart device, and build a new publicly available database named speech scene database by smart device (SSSD) for VSR. Moreover, we apply the existing method to our database and show baseline recognition accuracy.

18:00-18:20, Paper ThPMOT4.3
New Singular Value Decomposition Algorithm for Octonion Signals
Shen, Miaomiao	Shanghai Univ
Rui, Wang	Shanghai Univ
Keywords: Image processing and analysis, Signal analysis, Image quality assessment Abstract: The singular value decomposition (SVD) has been considered as one of the most powerful tools in numerical algebra and has witnessed great success in a wide range of image processing tasks, such as principal component analysis, linear discriminant analysis and sparse representation. However, the existing SVD algorithms cannot directly applied in octonion signals. In this paper, we propose a novel singular value decomposition algorithm for octonion signal, namely OSVD. Firstly, a new real representation according to the components of the original octonion signal is formed and the real SVD for the real matrix is performed. Then with several largest singular values and the corresponding vectors in both left and right unitary matrices selected, the octonion signal can be reconstructed successfully. It is demonstrated by the denoising experiments multispectral image with seven spectral channels that our proposed algorithm significantly outperforms existing state-of-the-art algorithms in both quantitative and visual performance.


ThBQ	Ballroom C, 1st Floor
Banquet (Ballroom C, 1st Floor)

2018 24th International Conference on Pattern Recognition (ICPR)
August 20-24, 2018, Beijing, China

Technical Program for Thursday August 23, 2018

2018 24th International Conference on Pattern Recognition (ICPR) August 20-24, 2018, Beijing, China

Technical Program for Thursday August 23, 2018

2018 24th International Conference on Pattern Recognition (ICPR)
August 20-24, 2018, Beijing, China