ICPR 2018 Program | Wednesday August 22, 2018


We1PL	Ballroom C, 1st Floor
Plenary Session: Maria Petrou Prize Speech (Ballroom C, 1st Floor)	Plenary Session


We2PL	Ballroom C, 1st Floor
Plenary Session: Long Quan, the Challenges of 3D Reconstruction with Deep Learning (Ballroom C, 1st Floor)	Plenary Session


WeAM_Coffee_Break	North Foyer & Park View Foyer, 3rd Floor
Coffee Break We(North Foyer & Park View Foyer, 3rd Floor)


WeAMOT1	Ballroom C, 1st Floor
WeAMOT1 Deep Learning 2 (Ballroom C, 1st Floor)	Oral Session

11:10-11:30, Paper WeAMOT1.1
Deep Temporal Feature Encoding for Action Recognition
Li, Lin	CASIA
Zhang, Zhaoxiang	Inst. of Automation, Chinese Acad. of Sciences
Huang, Yan	Inst. of Automation, Chinese Acad. of Sciences
Wang, Liang	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Deep learning, Video processing and analysis, Human behavior analysis Abstract: Human action recognition is an important task in computer vision. Recently, deep learning methods for video action recognition have developed rapidly. A popular way to tackle this problem is known as two-stream methods which take both spatial and temporal modalities into consideration. These methods often treat sparsely-sampled frames as input and video labels as supervision. Because of such sampling strategy, they are typically limited to processing shorter sequences, which might cause the problems such as suffering from the confusion by partial observation. In this paper we propose a novel video feature representation method, called Deep Temporal Feature Encoding (DTE). It could aggregate frame-level features into a robust and global video-level representation. Firstly, we sample enough RGB frames and optical flow stacks across the whole video. Then we use a deep temporal feature encoding layer to construct a strong video feature. Lastly, end-to-end training is applied so that our video representation could be global and sequence-aware. Comprehensive experiments are conducted on two public datasets: HMDB51 and UCF101. Experimental results demonstrate that DTE achieves the competitive state-of-the-art performance on both datasets.

11:30-11:50, Paper WeAMOT1.2
Learning an Order Preserving Image Similarity through Deep Ranking
Gupta, Nitin	IBM Res
Mujumdar, Shashank	IBM Res. India
Samanta, Suranjana	IBM Res
Mehta, Sameep	IBM Res
Keywords: Deep learning, Applications of computer vision, Multimedia analysis, indexing and retrieval Abstract: Recently, deep learning frameworks have been shown to learn a feature embedding that captures fine-grained image similarity using image triplets or quadruplets that consider pairwise relationships between image pairs. In real-world datasets, a class contains fine-grained categorization that exhibits within-class variability. In such a scenario, these frameworks fail to learn the relative ordering between - (i) samples belonging to the same category, (ii) samples from a different category within a class and (iii) samples belonging to a different class. In this paper, we propose the quadlet loss function, that learns an order-preserving fine-grained image similarity by learning through quadlets (query:q, positive:p, intermediate:i, negative:n) where p is sampled from the same category as q, i belongs to a fine-grained category within the class of q and n is sampled from a different class than that of q. We propose a deep quadlet network to learn the feature embedding using the quadlet loss function. We present an extensive evaluation of our proposed ranking model against state-of-the-art baselines on three datasets with fine-grained categorization. The results show significant improvement over the baselines for both order-preserving fine-grained ranking task and general image ranking task.

11:50-12:10, Paper WeAMOT1.3
Anomaly Detection Via Minimum Likelihood Generative Adversarial Networks
Wang, Chu	Inst. of Automation, Chinese Acad. of Sciences
Zhang, Yan-Ming	Inst. of Automation, Chinese Acad. of Sciences
Liu, Cheng-Lin	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Deep learning, Data mining Abstract: Anomaly detection aims to detect abnormal events by a model of normality. It plays an important role in many domains such as network intrusion detection, criminal activity identity and so on. With the rapidly growing size of accessible training data and high computation capacities, deep learning based anomaly detection has become more and more popular. In this paper, a new domain-based anomaly detection method based on generative adversarial networks (GAN) is proposed. Minimum likelihood regularization is proposed to make the generator produce more anomalies and prevent it from converging to normal data distribution. Proper ensemble of anomaly scores is shown to improve the stability of discriminator effectively. The proposed method has achieved significant improvement than other anomaly detection methods on Cifar10 and UCI datasets.

12:10-12:30, Paper WeAMOT1.4
Deep Generative Adversarial Networks for the Sparse Signal Denoising
Wu, Kailun	Tsinghua Univ
Zhang, Changshui	Tsinghua Univ
Keywords: Sparse learning, Neural networks, Deep learning Abstract: In many practical denoising problems, noisy signal contains lots of sparse information, which is helpful for denoising. However, common denoising methods such as low-pass filtering methods and Wavelet denoising methods have a number of limitations like the rigid projection space or loss of high frequency component. In this paper, we propose a deep learning framework based on Generative Adversarial Networks (GANs) to deal with the sparse denoising tasks. We design the Generative Network (G-net) as denoising model with three parts, which are decoder part, denoising part and linear recovery part. To maintain the original features of the data, we utilize the Discriminator Network (D-net) to help the denoising model G-net learn. The experimental results show that our framework is more effective than some traditional methods and state-of-art deep learning methods. In particular, sparse denoising GANs can recover details of picture better in the MNIST image tasks.


WeAMOT2	309B, 3rd Floor
WeAMOT2 Low Level Vision (309B, 3rd Floor)	Oral Session

11:10-11:30, Paper WeAMOT2.1
Spatially Coherent Matching for Robust Registration
Wang, Gang	Shanghai Univ. of Finance and Ec
Chen, Yufei	Tongji Univ
Zhang, Haotian	Tongji Univ
Keywords: Multiple view geometry, Graph matching, Image processing and analysis Abstract: In order to solve the registration problem, we propose a robust method called Spatially Coherent Matching (SCM), where it can get the underlying correspondences from the given putative sets of feature points for robust matching, and estimate the transformation for robust registration. Recovering correct matches and fitting transformations between image pairs are key components in the field of pattern recognition. The proposed SCM starts with a putative correspondence set which is contaminated by degradations (e.g., occlusion, deformation, rotation, and outliers), and the main goal is to identify the true correspondences and estimate the underlying transformation. Then we formulate this challenging problem by the spatially coherent matching model with a robust exponential distance loss and a spatial constraint. Based on the regularization theory, SCM preserves the topological structure of the adjacent features. Moreover, a sparse approximation strategy is used to improve the efficiency. Finally, the experimental results reveal that the proposed method outperforms current state-of-the-art methods in most test scenarios on several real image datasets and synthesized datasets.

11:30-11:50, Paper WeAMOT2.2
Masked Label Learning for Optical Flow Regression
Yang, Guorun	Tsinghua Univ
Deng, Zhidong	Tsinghua Univ
Wang, Shiyao	School
Li, Zeping	Tsinghua Univ
Keywords: Motion and tracking, Low-level vision, Multiple view geometry Abstract: Optical flow estimation is a challenging task in computer vision. Recent methods formulate such task as a supervised learning problem. But it often suffers from limited realistic ground truth. In this paper, a compact network, embedded with cost volume, residual encoder and deconvolutional decoder, is presented to regress optical flow in an end-to-end manner. To overcome the lack of flow labels, we propose a novel data-driven strategy called masked label learning, where a large amount of masked labels are generated from the FlowNet 2.0 model and filtered by warping calibration for model training. We also present an extended-Huber loss to handle large displacements. With pretraining on massive masked flow data, followed by finetuning on a small number of sparse labels, our method achieves state-of-the-art accuracy on KITTI flow benchmark.

11:50-12:10, Paper WeAMOT2.3
Saliency Guided Fast Interpolation for Large Displacement Optical Flow
Zu, Yueran	Beihang Univ
Gao, Ke	Inst. of Computing Tech. of Chinese Acad. of Sciences
Bao, Xiuguo	CNCERT
Tang, Wenzhong	Beihang Univ
Keywords: Motion and tracking, Low-level vision, Video processing and analysis Abstract: The optical flow estimation is still an open question nowadays. One of the bottlenecks of it is the interpolation speed. In this paper, a saliency guide fast interpolation method is proposed which is more than about 2 times faster than the traditional one. The method runs on CPU without any supervision or semantic segmentation information. To make it faster, a fast saliency detection method is introduced to separate the image into two parts. The non-saliency superpixels are interpolated faster with random search only. The salient superpixels are interpolated by propagation and random search. To keep it accurate, the relative initial movement is used to guide the search area when computing the affine model. A soft affine model evaluation is introduced to make the optical flow result more robust. Extensive experiments on challenging datasets MPI-Sintel and KITTI-15 show that our method is efficient and effective.

12:10-12:30, Paper WeAMOT2.4
BTF Compound Texture Model with Non-Parametric Control Field
Haindl, Michael	Inst. of Information Theory and Automation
Havlicek, Vojtech	Inst. of Information Theory and Automation
Keywords: Illumination and reflectance modeling, Image based modeling, Physics-based vision Abstract: This paper introduces a novel multidimensional statistical model for realistic modeling, enlargement, editing, and compression of the recent state-of-the-art bidirectional texture function (BTF) textural representation. The presented multispectral compound Markov random field model (CMRF) efficiently fuses a non-parametric random field model with several parametric random fields models. The primary purpose of our modeling texture approach is to reproduce, compress, and enlarge a given measured natural or artificial texture image so that ideally both natural and synthetic texture will be visually indiscernible for any observation or illumination directions. However, the model can be easily applied for BFT material texture editing as well. The CMRF model consists of several parametric sub-models each having different characteristics along with an underlying switching structure model which controls transitions between these sub models. The proposed model uses the non-parametric random field for distributing local texture models in the form of analytically solvable wide-sense BTF Markov representation for single regions among the fields of a mosaic approximated by the random field structure model. The non-parametric control field of BTF-CMRF is reiteratively generated to guarantee identical region-size histograms for all material sub-classes present in the target example texture. The local texture regions (not necessarily continuous) are represented by analytical BTF models modeled by the adaptive 3D causal auto-regressive (3DCAR) random field model which can be analytically estimated as well as synthesized. The visual quality of the resulting complex synthetic textures generally surpasses the outputs of the previously published simpler non-compound BTF-MRF models. The model allows reaching huge compression ratio incomparable with any standard image compression method.


WeAMOT3	310, 3rd Floor
WeAMOT3 Image Analysis and Segmentation (310, 3rd Floor)	Oral Session

11:10-11:30, Paper WeAMOT3.1
Convexity Invariance of Voxel Objects under Rigid Motions
Ngo, Phuc	LORIA - Lorraine Univ
Passat, Nicolas	Univ. De Reims Champagne-Ardenne
Kenmochi, Yukiko	Univ. Paris-Est
Debled-Rennesson, Isabelle	LORIA - Nancy Univ
Keywords: Image processing and analysis, Shape modeling and encoding, Vision for graphics Abstract: Volume data can be represented by voxels. In many applications of computer graphics (eg., animation, simulation) and image processing (eg., shape registration), such voxel data require manipulations. Among the simplest manipulations, we are interested in rigid motions, namely motions that do not change the shape of voxel objects but do change their position and orientation. Such motions are well-known as isometric transformations in continuous spaces. However, when they are applied on voxel data, some important properties of geometry and topology are generally lost. In this article, we discuss this issue, and we provide a method for rigid motions of voxel objects that preserves the global convexity properties of objects, with digital topology guarantees. This method is based on the standard notion of H-convexity, and a new notion of quasi-regularity.

11:30-11:50, Paper WeAMOT3.2
Multi-Scale Cross-Band Encoding of Sectored Local Binary Pattern for Robust Texture Classification
Song, Tiecheng	Chongqing Univ. of Posts and Telecommunications
Luo, Lin	Chongqing Univ. of Posts and Telecommunications
Xin, Liangliang	Chongqing Univ. of Posts and Telecommunications
Gao, Chenqiang	School of Communication and Information Engineering, Chongqing U
Keywords: Texture analysis, Image classification, Classification Abstract: The original Local Binary Pattern (LBP) has limited discriminative power and is sensitive to noise. In view of this, this paper proposes a novel image descriptor called Multi-Scale Cross-Band Encoding of Sectored Local Binary Pattern (MCE-SLBP) for robust texture classification. First, the pyramid decomposition is explored to obtain multi-scale low-frequency and high-frequency (difference) images. To encode more discriminative features, these high-frequency images are further decomposed into positive and negative high-frequency images via the polarity splitting. Then, a robust Sectored Local Binary Pattern (SLBP) is proposed to compute texture feature codes on the decomposed images via cross-band joint coding. Finally, a multi-scale histogram representation is obtained by concatenating histograms of texture codes computed at all decomposition levels. Experiments on three benchmark texture databases (i.e., Outex, Brodatz and CUReT) demonstrate that the proposed method achieves the state-of-the-art classification accuracies both under noise-free conditions and in the presence of different levels of Gaussian noise.

11:50-12:10, Paper WeAMOT3.3
Locality Preserving Discriminative Complex-Valued Latent Variable Model
Chen, Sih-Huei	National Central Univ
Lee, Yuan-Shan	National Central Univ
Wang, Jia-Ching	National Central Univ
Keywords: Emotion recognition, Image classification, Dimensionality reduction Abstract: Techniques for analyzing complex-valued data are required in numerous fields, such as signal processing. This work develops a novel complex-valued latent variable model, named locality-preserving discriminative complex-valued Gaussian process latent variable model (LPD-CGPLVM), for discovering a compressed complex-valued representation of data. The developed LPD-CGPLVM operates on the complex-valued domain. Additionally, we attempt to preserve both global and local data structures while promoting discrimination. A new objective function that imposes a locality-preserving and a discriminative term for complex-valued data is presented. Complex-valued gradient descent is then utilized to obtain a complex-valued representation of high-dimensional data and the hyperparameters in the LPD-CGPLVM. The proposed method was evaluated using two pattern recognition applications- face recognition with occlusion and music emotion recognition. The experimental results thus obtained demonstrated the superior accuracy of the proposed method, especially for situations with only a small number of training data.

12:10-12:30, Paper WeAMOT3.4
Segmentation Edit Distance
Pucher, Daniel	TU Wien
Kropatsch, Walter	TU Vienna
Keywords: Image processing and analysis, Segmentation, features and descriptors Abstract: In this paper, we present a novel distance metric called Segmentation Edit Distance (SED) and its use as a segmentation evaluation metric. In segmentation evaluation, the difference or distance of a test segmentation and the associated ground truth segmentation are measured in order to compare different algorithms. Our proposed edit distance extends the idea of other edit distances such as the string edit distance or the graph edit distance to the domain of image segmentations. The distance is based on the cost of edit operations that are needed to transform one segmentation into another. Only one edit operation, the deletion of an error region, is considered. Different to other edit distances, the costs assigned to this operation are based on properties of the error regions and the image processing method used to delete a region. As a segmentation evaluation metric, it combines the assessment of accuracy and efficiency into a single metric. Evaluations on synthetic and real world data show promising results compared to other state of the art segmentation evaluation metrics.


WeAMOT4	311A, 3rd Floor
WeAMOT6 Medical Image Analysis (311A, 3rd Floor)	Oral Session

11:10-11:30, Paper WeAMOT4.1
An Automated Airway Segmentation Algorithm for CT Images Using Topological Leakage Detection and Volume Freezing
Nadeem, Syed Ahmed	Univ. of Iowa
Hoffman, Eric A	Univ. of Iowa Carver Coll. of Medicine
Saha, Punam Kumar	Univ. of Iowa
Keywords: Medical image and signal analysis, Segmentation, features and descriptors Abstract: Numerous multi-center studies related to chronic obstructive pulmonary disease use computed tomography (CT) based characterization of the lung parenchyma and bronchial tree to understand the disease's status and progression. To our knowledge, there are no fully automated methods for airway tree segmentation that don�t require post-segmentation manual review and intervention. In this paper, we present a novel CT-based airway tree segmentation algorithm using topological leakage detection and volume freezing. The method is fully automated requiring no manual inputs or post-segmentation editing. It uses intensity-based connectivity and novel approaches of leakage detection and volume freezing to iteratively grow an airway tree starting from an initial seed inside the trachea. It begins with a conservative threshold and then, iteratively shifts towards generous threshold values. The method was applied on chest CT scans of ten non-smoking healthy subjects at total lung capacity, and the results were highly promising with no visual segmentation leakages.

11:30-11:50, Paper WeAMOT4.2
A Method for PET-CT Lung Cancer Segmentation Based on Improved Random Walk
Liu, Zhe	School of Computer Science and Communication Engineering, Jiangs
Song, Yuqing	School of Computer Science and Communication Engineering, Jiangs
Maere, Charlie	Jiangsu Univ
Liu, Qingfeng	School of Computer Science and Communication Engineering, Jiangs
Zhu, Yan	Affiliated Hospital of Jiangsu Univ
Lu, Hu	Fudan Univ
Yuan, Deqi	Zhenjiang First People's Hospital Branch
Keywords: Medical image and signal analysis, Image processing and analysis, Computer-aided detection and diagnosis Abstract: Segmentation methods only work for a single imaging modality usually suffer from the low spatial resolution in positron emission tomography (PET) or low contrast in computed tomography (CT) when the tumor region is inhomogeneous or not obvious. To address this problem, we develop a segmentation method combining the advantages and disadvantages of PET and CT. Firstly, the initial contours are obtained by the pre-segmentation of PET images using region growing and mathematical morphology. The initial contours can be used to automatically obtain the seed points required for random walk on PET and CT images, at the same time, they can be also used as a constraint in the random walk on CT images to solve the shortcoming that the tumor areas are not obvious if the CT images have not been enhanced. For the reason that CT provides essential details on anatomic structures, the anatomic structures of CT can be used to improve the weight of random walk on PET images. Finally, the similarity matrices obtained by random walk on PET and CT images are weighted to obtain identical results on PET and CT images. Our methods achieve an average DSC of 0.8456�0.0703 on 14 patients with lung cancer. Our method has much better performance when the tumors are inhomogeneous on PET images and not obvious on CT images.

11:50-12:10, Paper WeAMOT4.3
Medical Knowledge Constrained Semantic Breast Ultrasound Image Segmentation
Huang, Kuan	Utah State Univ
Cheng, Heng-Da	Utah State Univ
Zhang, Yingtao	Harbin Inst. of Tech
Zhang, Boyu	Utah State Univ
Xing, Ping	The First Affiliated Hospital of Harbin Medical Univ
Ning, Chunping	The Affiliated Hospital of Qingdao Univ
Keywords: Medical image and signal analysis, Segmentation, features and descriptors, Computer-aided detection and diagnosis Abstract: Computer-aided diagnosis (CAD) can help doctors in diagnosing breast cancer. Breast ultrasound (BUS) imaging is harmless, effective, portable, and is the most popular modality for breast cancer detection/diagnosis. Many researchers work on improving the performance of CAD systems. However, there are two main shortcomings: (1) Most of the existing methods are based on prerequisites that there is one and only one tumor in the image. (2) The results depend on the datasets, i.e., an algorithm using different datasets may obtain different performances. It implies that the performance of traditional methods is dataset-dependent. In this paper, we propose an effective approach: (1) using information extended images to train a fully convolutional network (FCN) to semantically segment BUS image into 3 categories: mammary layer, tumor, and background; and (2) applying layer structure information - the breast cancers are located inside the mammary layer - to the conditional random field (CRF) for conducting breast cancer segmentation and making the segmentation result more accurate. The proposed method is evaluated utilizing BUS images of 325 cases, and the result is the best comparing with that of the existing methods by achieving true positive rate 92.80%, false positive rate 9%, and Intersection over Union 82.11%. The proposed approach has solved the above mentioned two shortcomings of the existing methods.

12:10-12:30, Paper WeAMOT4.4
Interactive Segmentation of Glioblastoma for Post-Surgical Treatment Follow-Up
Dhara, Ashis Kumar	Centre for Image Analysis, Department of Information Tech
Arvids, Erik	Uppsala Univ
Fahlstr�m, Markus	Department of Surgical Sciences, Radiology, Uppsala Univ
Wikstr�m, Johan	Department of Surgical Sciences, Radiology, Uppsala Univ
Larsson, Elna-Marie	Department of Surgical Sciences, Radiology, Uppsala Univ
Strand, Robin	Uppsala Univ
Keywords: Medical image and signal analysis, Computer-aided detection and diagnosis, Deep learning Abstract: In this paper, we present a novel framework for interactive segmentation of glioblastoma in contrast-enhanced T1-weighted magnetic resonance images. U-net based-fully convolutional network is combined with an interactive refinement technique. Initial segmentation of brain tumor is performed using U-net, and the result is further improved by including complex foreground regions or removing background regions in an iterative manner. The method is evaluated on a research database containing post-operative glioblastoma of 15 patients. Radiologists can refine initial segmentation results in about 90 seconds, which is well below the time of interactive segmentation from scratch using state-of-the-art interactive segmentation tools. The experiments revealed that the segmentation results (Dice score) before and after the interaction step (performed by expert users) are similar. This is most likely due to the limited information in the contrast-enhanced T1-weighted magnetic resonance images used for evaluation. The proposed method is computationally fast and efficient, and could be useful for post-surgical treatment follow-up.


WeLunch_Break	Exhibition Hall 5, B1 Floor
Lunch Break We (Exhibition Hall 5, B1 Floor)


We3PL	Ballroom C, 1st Floor
Plenary Session: Venkatesh Prasad, Automobiles and Mobility Solutions(Ballroom C, 1st Floor)	Plenary Session


WePMP	North Foyer & Park View Foyer, 3rd Floor
Poster Session WePMP, Coffee Break(North Foyer & Park View Foyer, 3rd Floor)	Poster Session

15:00-17:00, Paper WePMP.1
Common Random Subgraph Modeling Using Multiple Instance Learning
Xu, Tao	Univ. of Guelph
Chiu, David K.Y.	Univ. of Guelph
Gondra, Iker	St. Francis Xavier Univ
Keywords: Probabilistic graphical model, Data mining, Graph matching Abstract: In balancing information that is typical of a class and discriminatory between classes, we aim at synthesizing a common random subgraph (CRSG) model from an ensemble of attributed graph data. The common random subgraph model incorporates both structural and probabilistic information of the data that is common of a class in the ensemble, while multiple instance learning provides an effective process in handling large number of samples and is tolerant of substantial irrelevant graph elements. The proposed two-level multiple instance learning that compares the data between graphs at one level (as bags of instances), but also takes into account structural relationships between graph elements (as instances) at the other level. The method is evaluated using benchmarked structural datasets taken from the IAM graph repository. The experimental results show that the method can generate a meaningful and informative common random subgraph model of a class, but also effective in applying it to classification tasks in discriminating between classes.

15:00-17:00, Paper WePMP.2
Fast Descriptor Extraction for Contextless 3D Registration Using a Fully Convolutional Network
Garrett, Timothy	Iowa State Univ
Radkowski, Rafael	Iowa State Univ
Keywords: Deep learning, Segmentation, features and descriptors, 3D vision Abstract: In recent years, numerous consumer devices have emerged that are capable of capturing 3D point data originating from depth images. Many computer vision tasks, such as object recognition, environment mapping, augmented reality, and more, rely on accurately registering 3D point sets. One method to compute this registration is to use 3D local feature descriptors for a coarse alignment, and to further refine the alignment with a variant of the Iterative Closest Point algorithm. While robust feature descriptors work well for this approach, online computation for all points in a single depth image is typically intractable. In this work, a method to facilitate real-time 3D registration by performing descriptor extraction on depth images using a Fully Convolutional Network (FCN) is presented. The input to this method is a raw depth image and results in a 33-bin descriptor for each pixel, enabling a general-purpose 3D registration process that doesn't require future network retraining and refinement. Experimental results on consumer hardware demonstrate that the proposed method significantly outperforms the state-of-the-art in terms of computation time and approaches depth sensor frame capture times, with only a slight reduction in descriptiveness.

15:00-17:00, Paper WePMP.3
Rotational Invariant Discriminant Subspace Learning for Image Classification
Ye, Qiaolin	Nanjing Univ. of Science and Tech
Zhang, Zhao	City Univ. of Hong Kong
Keywords: Dimensionality reduction Abstract: A novel discriminant analysis technique for feature extraction, referred to as Robust Discriminant Subspace (RDS) with L2,p+s-Norm Distance Maximization-Minimization (maxmin) is posed. In its objective, the within-class and between-class distances are measured by L2,p-norm and L2,s-norm, respectively, such that it is robust and rotational invariant. An efficient iterative algorithm is designed to solve the resulted objective, which is non-greedy. These characteristics make RDS more intuitive and powerful than previous efforts. We also conduct some insightful analysis on the convergence of the proposed algorithm. Theoretical insights and effectiveness of our RDS are further supported by promising experimental results on several images databases.

15:00-17:00, Paper WePMP.4
FReLU: Flexible Rectified Linear Units for Improving Convolutional Neural Networks
Qiu, Suo	South China Univ. of Tech
Xu, Xiangmin	South China Univ. of Tech
Cai, Bolun	South China Univ. of Tech
Keywords: Deep learning, Image classification Abstract: Rectified linear unit (ReLU) is a widely used activation function for deep convolutional neural networks. However, because of the zero-hard rectification, ReLU networks lose the benefits from negative values. In this paper, we propose a novel activation function called emph{flexible rectified linear unit (FReLU)} to further explore the effects of negative values. By redesigning the rectified point of ReLU as a learnable parameter, FReLU expands the states of the activation output. When a network is successfully trained, FReLU tends to converge to a negative value, which improves the expressiveness and thus the performance. Furthermore, FReLU is designed to be simple and effective without exponential functions to maintain low-cost computation. For being able to easily used in various network architectures, FReLU does not rely on strict assumptions by self-adaption. We evaluate FReLU on three standard image classification datasets, including CIFAR-10, CIFAR-100, and ImageNet. Experimental results show that FReLU achieves fast convergence and competitive performance on both plain and residual networks.

15:00-17:00, Paper WePMP.5
Adaptive Latent Representation for Multi-View Subspace Learning
Zhang, Yuemei	Xidian Univ
Wang, Xiumei	Xidian Univ
Gao, Xinbo	Xidian Univ
Keywords: Multiview learning, Clustering, Dimensionality reduction Abstract: In recent years, datasets represented in multi-view contain more information because different views describe different aspects. In fact, the structure of the dataset is embedded in a union of certain low-dimensional subspaces. Therefore, multi-view subspace learning is a powerful technology to find the underlying structure and cluster data points correctly. Actually, the information contained in different views is different. Furthermore, the original data may be noisy. However, most existing multi-view subspace clustering methods treat each view equally and learn the self-expressiveness coefficient matrix of each view on the original data, which would decrease the clustering performance. To solve the above problems, we propose a new method that learns a latent representation in an adaptive way. Meanwhile, the local geometrical structure is maintained on the latent representation by graph regularization. At same time, a basic subspace clustering method is performed on the latent representation to get self-expressiveness coefficient matrix. We formulate the above problems into a unified optimization framework. Experimental results on several real-world datasets show the effectiveness of the proposed method.

15:00-17:00, Paper WePMP.6
Appearance-Based Data Augmentation for Image Datasets Using Contrast Preserving Sampling
Merchant, Alishan	NUCES
Syed, Tahir	NUCES
Khan, Behraj	NUCES
Keywords: Deep learning, Learning-based vision, Object detection Abstract: Data augmentation techniques have been employed to overcome the problem of model over-fitting in deep convolutional neural networks and have consistently shown improvement in classification. Most data augmentation techniques perform affine transformations on the image domain. However these techniques cannot be used when object position is significant in the image. In this work we propose a data augmentation technique based on sampling an representation built by inequality constraints propagated from local binary patterns. We sample nine distinct variations for an image in a manner meant to preserve local structure and differ only in the amount of contrast between pixels. These textit{contrast invariants} are then used to augmented the original images. We present evaluations on CIFAR-10 and validate our gains in performance across four criteria, accuracy, precision, recall and F1-Score; using a 2-layer convolutional neural network with different configuration of filters, and report improvement by about 13%, 9%, 12%, and 22% respectively over the baseline.

15:00-17:00, Paper WePMP.7
Local Binary Patterns for Graph Characterization
Jawad, Muhammad	Inst. of Management Sciences
Aziz, Furqan	Inst. of Management Sciences
Hancock, Edwin	Univ. of York
Keywords: Graph matching, Clustering, Object recognition Abstract: In this paper we propose a novel approach for defining Local Binary Patterns (LBP) to directly encode graph structure. LBP is a simple and widely used technique for texture analysis in static 2D images, and there is no work in the literature describing its generalisation to graphs. The proposed method (GraphLBP) is efficient and yet effective as a noise-tolerant graph-based representation. We compute the new feature representation for graphs by combining LBP with Galois Fields, using irreducible polynomials. The proposed method is scalable as it preserves the local and global properties of the graph. Experimental results show that GraphLBP can both increase the recognition accuracy and is both simpler and more computationally efficient when compared with state of the art techniques.

15:00-17:00, Paper WePMP.8
Nonnegative and Adaptive Multi-View Clustering
Zou, Peng	Soochow Univ
Li, Fan-Zhang	Soochow Univ
Zhang, Li	Soochow Univ
Keywords: Multiview learning, Clustering, Data mining Abstract: This paper proposes a novel Nonnegative and Adaptive Multi-view Clustering (NAMC) method. NAMC integrates the nonnegative matrix factorization (NMF), adaptive neighborhood learning and consensus adaptive similarity matrix fusion. More specifically, NAMC performs the nonnegative weight learning over the original data and the parts-based representations of NMF for more accurate measure and representation. For nonnegative adaptive feature extraction, our model first utilizes NMF to obtain the local parts-based representation of the original high-dimensional data. To keep the local structure of parts-based representations, we minimize the adaptive neighborhood reconstruction error. Then the optimal consensus similarity matrix can be iteratively obtained according to the nonnegative adaptive similarity matrix of each view. What�s more, the proposed NAMC is totally self-weighted. Once the target graph is obtained in our model, it can be partitioned into specific clusters directly. Extensive simulations show that NAMC can achieve good performance on several public databases for multi-view clustering, compared with other related methods.

15:00-17:00, Paper WePMP.9
Stabilizing Actor Policies by Approximating Advantage Distributions from K Critics
Labao, Alfonso	Univ. of the Philippines Computer Science Department
Naval, Prospero	Univ. of the Philippines
Keywords: Reinforcement learning, Deep learning Abstract: Reinforcement learning algorithms that use policy gradient methods approach an optimal policy faster than Q-learning but at the cost of incurring high variances in gradients. Among variance reduction techniques are actor-critic methods that use value and advantage functions to train a policy actor. We propose an algorithm under the actor-critic family that further reduces gradient variance through estimation of advantage distributions from K deep network critics. We combine outputs of the K critics into an advantage distribution using a histogram approach followed by kernel convolution. We show in our analysis that using the K-critic advantage distribution provides variance reduction properties that results in more stable performance even on long training runs. We test our algorithm on a set of high-dimensional VizDoom experiments. Our experimental results show that our proposed algorithm attains the most average rewards compared to other methods, and with less noise compared to the 1-critic method.

15:00-17:00, Paper WePMP.10
Single-Image Dehazing Algorithm Based on Convolutional Neural Networks
Xiao, Jinsheng	Wuhan Univ
Luo, Li	Wuhan Univ
Liu, Enyu	Wuhan Univ
Lei, Junfeng	School of Electronic Information, Wuhan Univ
Klette, Reinhard	Auckland Univ. of Tech
Keywords: Deep learning, Low-level vision Abstract: The paper proposes a new single image dehazing method based on a convolutional neural network. Our method directly learns an end to end mapping between the haze images and their corresponding haze layers (i.e. residual images between haze images and non-haze images). A convolutional neural network takes the haze image as an input and the residual image as an output. Then, a recovered dehazed image can be obtained by removing the residual from the haze image. Residual learning allows the network to directly estimate the initial haze layer with relatively high learning rates, which reduce computational complexity and speed-up the convergence process. Since the initial haze layer is only approximate, we use a guided filter to refine this layer to avoid halos and block artefacts, which makes the recovered image more similar to a real scene. The algorithms are tested on fog images with different fog densities. Comparisons are provided with other dehazing algorithms. Experiments demonstrate that the proposed algorithm outperforms state-of-the-art methods on both synthetic and real-world images, qualitatively and quantitatively.

15:00-17:00, Paper WePMP.11
Driving Maneuver Detection Via Sequence Learning from Vehicle Signals and Video Images
Peng, Xishuai	Univ. of Michigan-Dearborn
Liu, Ruirui	Univ. of Michgan-Dearborn
Murphey, Yi	Univ. of Michigan-Dearborn
Stent, Simon	Univ. of Cambridge
Li, Yuanxiang	Shanghai Jiao Tong Univ
Keywords: Applications of pattern recognition and machine learning, Video processing and analysis, Sensor array & multichannel signal processing Abstract: Driving maneuver detection is one of the most challenging tasks in Advanced Driver Assistance Systems (ADAS). Research has shown that the early notification of improper driving maneuvers is helpful to avoid fatalities and serious accidents. In this paper, we introduce a driver maneuvering detection (DMD) system. The DMD system contains three major computational components, distance based representation of driving context, combined features of vehicle trajectory and VGG-19 network features extracted from the video images of vehicle front view, and a Long Short-Term Memory (LSTM)-based neural network model to learn sequence knowledge in driving maneuvering events. We show through experiments that the DMD system is capable of learning the latent features of five different classes of driving maneuvers and achieving significantly better performance than traditional classification methods on real-world driving trips.

15:00-17:00, Paper WePMP.12
Background Subtraction Via 3D Convolutional Neural Networks
Gao, Yongqiang	National Univ. of Defense Tech. Changsha, Hunan, P.R
Cai, Huayue	National Univ. of Defense Tech
Zhang, Xiang	National Univ. of Defense Tech. Changsha, 410073, Hun
Lan, Long	National Univ. of Defense Tech. Changsha, Hunan, P.R
Luo, Zhigang	National Univ. of Defense Tech. Changsha, Hunan, P.R
Keywords: Applications of pattern recognition and machine learning, Deep learning, Neural networks Abstract: Background subtraction can be treated as the binary classification problem of highlighting the foreground region in a video whilst masking the background region, and has been broadly applied in various vision tasks such as video surveillance and traffic monitoring. However, it still remains a challenging task due to complex scenes and for lack of the prior knowledge about the temporal information. In this paper, we propose a novel background subtraction model based on 3D convolutional neural networks (3D CNNs) which combines temporal and spatial information to effectively separate the foreground from all the sequences in an end-to-end manner. Different from conventional models, we view background subtraction as three-class classification problem, emph{i.e.}, the foreground, the background and the boundary. This design can obtain more reasonable results than existing baseline models. Experiments on the Change Detection 2012 dataset verify the potential of our model in both quantity and quality.

15:00-17:00, Paper WePMP.13
Universal Perturbation Generation for Black-Box Attack Using Evolutionary Algorithms
Wang, Siyu	Tianjin Univ. School of Computer Science and Tech
Shi, Yucheng	Tianjin Univ
Han, Yahong	Tianjin Univ
Keywords: Neural networks, Image classification, Information forensics and security Abstract: Image classifiers based on deep neural networks (DNNs) are vulnerable to tiny, imperceptible perturbations. Maliciously generated adversarial examples can exploit the instability of DNNs and mislead it into outputting a wrong classification result. Prior works showed the transferability of adversarial perturbations between models and between images. In this work, we shed light on the combination of source/target misclassification, black-box attack, and universal perturbation by employing improved evolutionary algorithms. We additionally find that the use of �adversarial initialization� enhances the efficiency of evolutionary algorithms finding universal perturbations. Experiments demonstrate impressive misclassification rates and surprising transferability for the proposed attack method using different models trained on CIFAR-10 and CIFAR-100 datasets. Our attach method also shows robustness against defensive measures like adversarial training.

15:00-17:00, Paper WePMP.14
Probabilistic Graph Embedding for Unsupervised Domain Adaptation
Xiao, Pan	Wuhan Univ. Wuhan 430072
Du, Bo	School of Computer, Wuhan Univ. Wuhan 430079, China
Yun, Shuang	Wuhan Univ
Li, Xue	Wuhan Univ
Zhang, YiPeng	Department of Electrical Engineering & Computer Science, Syracus
Wu, Jia	Department of Computing, Macquarie Univ. Sydney
Keywords: Domain adaptation, Classification, Probabilistic graphical model Abstract: Unsupervised domain adaptation aims to predict unlabeled target domain data by taking advantaging of labeled source domain data. This problem is hard to solve mainly because of the limitation of target domain label and the difference between both domains. To overcome these two difficulties, many researchers have proposed to assign pseudo labels for the unlabeled target domain data and then project both domain data into a common subspace. The use of pseudo labels, however, lack of theoretical basis and thus may not gain ideal classification results eventually. Therefore, in this work we propose a novel method called Probabilistic Graph Embedding(PGE). PGE first derives probabilities that the target domain instances belong to each category, which is believed to be able to explore target domain better. We then obtain a projection matrix by constructing a within-class probabilistic graph. This projection matrix can embed both domains into a shared subspace where domain shift is largely diminished. Experiments on object recognition cross-domain datasets show that PGE is more effective and robust than the state-of-art unsupervised domain adaptation methods.

15:00-17:00, Paper WePMP.15
Reservoir Computing with Untrained Convolutional Neural Networks for Image Recognition
Tong, Zhiqiang	THE Univ. OF TOKYO
Tanaka, Gouhei	The Univ. of Tokyo
Keywords: Neural networks, Classification, Image processing and analysis Abstract: Reservoir computing has attracted much attention for its easy training process as well as its ability to deal with temporal data. A reservoir computing system consists of a reservoir part represented as a sparsely connected recurrent neural network and a readout part represented as a simple regression model. In machine learning tasks, the reservoir part is fixed and only the readout part is trained. Although reservoir computing has been mainly applied to time series prediction and recognition, it can be applied to image recognition as well by considering an image data as a sequence of pixel values. However, to achieve a high performance in image recognition with raw image data, a large-scale reservoir including a large number of neurons is required. This is a bottleneck in terms of computer memory and computational cost. To overcome this bottleneck, we propose a new method which combines reservoir computing with untrained convolutional neural networks. We use an untrained convolutional neural network to transform raw image data into a set of smaller feature maps in a preprocessing step of the reservoir computing. We demonstrate that our method achieves a high classification accuracy in an image recognition task with a much smaller number of trainable parameters compared with a previous study.

15:00-17:00, Paper WePMP.16
EMD-Based Entropy Features for Micro-Doppler Mini-UAV Classification
Ma, Xinyue	Nanyang Tech. Univ
Oh, Beomseok	Nanyang Tech. Univ
Sun, Lei	Beijing Inst. of Tech
Toh, Kar-Ann	Yonsei Univ
Lin, Zhiping	Nanyang Tech. Univ
Keywords: Classification, Signal analysis Abstract: In this paper, we first investigate into six popular entropies extracted from a set of intrinsic mode functions (IMFs) as a feature pattern for radar-based mini-size unmanned aerial vehicles (mini-UAV) classification. The six entropies include Shannon entropy, spectral entropy, log energy entropy, approximate entropy, fuzzy entropy and permutation entropy. Via an empirical comparison among the six entropies on real measurement radar data, the first three are selected as the representative due to their high efficiency and accuracy. To enhance the classification accuracy, the three selected entropies are then extracted from eight different sets of IMFs obtained by signal downsampling, and then fused at feature level. The nonlinear support vector machine classifier is adopted to predict the class label of unseen test radar signals. Our empirical results on a set of real-world continuous wave radar data show that the proposed method outperforms the state-of-the-art method in terms of the mini-UAV classification accuracy.

15:00-17:00, Paper WePMP.17
Context-Aware Attention LSTM Network for Flood Prediction
Wu, Yirui	Hohai Univ
Liu, Zhaoyang	Nanjing Univ
Xu, Weigang	Hohai Univ
Feng, Jun	Hohai Univ
Palaiahnakote, Shivakumara	National Univ. of Singapore
Lu, Tong	State Key Lab. for Software Tech. Nanjing Univ
Keywords: Applications of pattern recognition and machine learning, Neural networks, Deep learning Abstract: To minimize the negative impacts brought by floods, researchers from pattern recognition community utilize artificial intelligence based methods to solve the problem of flood prediction. Inspired by the significant power of Long Short-Term Memory (LSTM) networks in modeling the dynamics and dependencies of sequential data, we intend to utilize LSTM networks to predict sequential flow rate values based on a set of collected flood factors. Since not all factors are informative for flood prediction and the irrelevant factors often bring a lot of noise, we need to pay more attention to the informative ones. However, original LSTM doesn't have strong attention capability. Hence we propose an context-aware attention LSTM (CA-LSTM) network for flood prediction, which is capable to selectively focus on informative factors. During training, the local context-aware attention model is constructed by learning probability distributions between flow rate and hidden output of each LSTM cell. During testing, the learned local attention model assign weights to adjust relations between input factors and predictions at all steps of LSTM network. We conduct experiments on a flood dataset with several comparative methods to demonstrate high accuracy of the proposed method and the effectiveness of the proposed context-aware attention model.

15:00-17:00, Paper WePMP.18
An Online Kernel Selection Wrapper Via Multi-Armed Bandit Model
Li, Junfan	Tianjin Univ
Liao, Shizhong	Tianjin Univ
Keywords: Model selection, Online learning, Classification Abstract: Online kernel selection is critical to online kernel learning, but most of the existing online kernel learning methods ignore the online kernel selection process, and instead they empirically preset and ﬁx a kernel or adjust kernel parameter s by gradient descent, which is sensitive to the initial setting and has no theoretical guarantee. In this work, we propose an online kernel selection wrapper via the multi-armed bandit model, which can select a kernel at each round from a set of candidate kernels with theoretical guarantee and can be applied to any online kernel learning model. Speciﬁcally, the wrapper consists of two layers. In the outer layer, the wrapper corresponds each candidate kernel to an arm of the multi-armed bandit model, and chooses an arm according to the probability distribution maintained by the model at each round. In the inner layer, the wrapper updates the probability distribution according the loss of the selected arm, which is incurred by the prediction of the online kernel learning algorithm. We propose a new online kernel selection regret to measure the performance of the proposed wrapper, and prove that the proposed wrapper enjoys a sub-linear expected online kernel selection regret with respect to the cumulative loss of the optimal kernel among the candidates kernels. Experimental results on benchmark datasets demonstrate the effectiveness of the proposed wrapper.

15:00-17:00, Paper WePMP.19
Variational Bayes Block Sparse Modeling with Correlated Entries
Sharma, Shruti	Indian Inst. of Tech. Delhi
Chaudhury, Santanu	Indian Inst. of Tech. Delhi
Jayadeva, Dr.	IIT Delhi
Keywords: Sparse learning, Signal analysis Abstract: This paper addresses the problem of Bayesian Block Sparse Modeling when coefficients within the blocks are correlated. In contrast to the current hierarchical methods which do not exploit correlation structure within the blocks, we propose a three level hierarchical estimation framework. It employs heavy-tailed priors for block sparse modeling and variational inference for Bayesian estimation. This paper also describes the relationship between proposed framework and some of the existing Block Sparse Bayesian Learning (SBL) methods and show that these SBL methods can be viewed as its special cases. Extensive experimental results for synthetic signals are provided, demonstrating the superior performance of the proposed framework in terms of failure rate, relative reconstruction error, to name a few. We also demonstrate the applicability of this framework in telemonitoring of Fetal Electrocardiogram.

15:00-17:00, Paper WePMP.20
Semi-Supervised Convolutional Neural Networks with Label Propagation for Image Classification
Chen, Lin	ShenZhen Univ
Yu, Shiqi	Shenzhen Univ
Yang, Meng	Sun Yat-Sen Univ
Keywords: Semi-supervised learning, Image classification, Neural networks Abstract: Over the past several years, deep learning has achieved promising performance in many visual tasks, e.g., face verification and object classification. However, a limited number of labeled training samples existing in practical applications is still a huge bottleneck for achieving a satisfactory performance. In this paper, we integrate class estimation of unlabeled training data with deep learning model which generates a novel semi-supervised convolutional neural network (SSCNN) trained by both the labeled training data and unlabeled data. In the framework of SSCNN, the deep convolution feature extraction and the class estimation of the unlabeled data are jointly learned. Specifically, deep convolution features are learned from the labeled training data and unlabeled data with confident class estimation. After the deep features are obtained, the label propagation algorithm is utilized to estimate the identities of unlabeled training samples. The alternative optimization of SSCNN makes the class estimation of unlabeled data more and more accurate due to the learned CNN feature more and more discriminative. We compared the proposed SSCNN with some representative semi-supervised learning approaches on MINIST and Cifar-10 databases. Extensive experiments on landmark databases show the effectiveness of our semi-supervised deep learning framework.

15:00-17:00, Paper WePMP.21
Occluded Joints Recovery in 3D Human Pose Estimation Based on Distance Matrix
Guo, Xiang	Australian National Univ
Dai, Yuchao	Northwestern Pol. Univ
Keywords: Applications of pattern recognition and machine learning, 3D reconstruction, Sparse learning Abstract: Albeit the recent progress in single image 3D human pose estimation due to the convolutional neural network, it is still challenging to handle real scenarios such as highly occluded scenes. In this paper, we propose to address the problem of single image 3D human pose estimation with occluded measurements by exploiting the Euclidean distance matrix (EDM). Specifically, we present two approaches based on EDM, which could effectively handle occluded joints in 2D images. The first approach is based on 2D-to-2D distance matrix regression achieved by a simple CNN architecture. The second approach is based on sparse coding along with a learned over-complete dictionary. Experiments on the Human3.6M dataset show the excellent performance of these two approaches in recovering occluded observations and demonstrate the improvements in accuracy for 3D human pose estimation with occluded joints.

15:00-17:00, Paper WePMP.22
Scalable Semi-Supervised Learning by Graph Construction with Approximate Anchors Embedding
Zhu, Hao	JD Finance
Xia, Minxue	JD Finance
Keywords: Semi-supervised learning, Image classification Abstract: Semi-supervised learning (SSL) generalizes and improves supervised learning using labeled data and unlabeled data. With the rapid development of the Internet as well as the increasing availability of data in the open web, collecting tremendous amount of unlabeled data has became more feasible. As the central notion in SSL, smoothness is often defined on a graph representation of the data. However, only few researches up to date adapt graph-based SSL approaches into the large scale Solution. Even if approximation approaches are commonly used in methods of the large-scale SSL, it is difficult to outperform original methods. In this paper, we propose an efficient method to construct a graph based on anchors. There are two major concerns in the anchor graph generally: how to learn better anchors and to present the input both in a more efficient and effective fashion. Intuitively, compared with sparse representation (e.g. local anchor embedding), a more straightforward approach is marginal regression with non-negative constraint. Rather than using clustering algorithms, the anchors are trained using dictionary learning with sparsity constraints. And thus in our approach, not only the relation between anchors and data is taken into consideration, the relation between anchors is also regarded. Beyond that, in the evaluation section, we demonstrate that our method outperforms other large scale SSL methods as well as traditional ones in classification performance according to several classical datasets. Further more, the proposed method solves the large scale SSL problem more efficient than current methods. Therefore, our method is an efficient and effective alternative to handle large scale SSL problem.

15:00-17:00, Paper WePMP.23
Facial Attribute Editing by Latent Space Adversarial Variational Autoencoders
Li, Defang	Sun Yat-Sen Univ
Zhang, Min	Sun Yat-Sen Univ
Chen, Weifu	Sun Yat-Sen Univ. Guangzhou, China
Feng, Guocan	Sun Yat-Sen Univ
Keywords: Neural networks Abstract: In this work, we focus on the problem of editing facial images by manipulating specify attributes of interest. To learn latent representations disentangled with respect to specified face attribute, we proposed a novel attribute-disentangled generative model by combining Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) frameworks. In the proposed model, only two deep mappings are included: an encoder and a decoder, similarly as the counterparts in the context of VAEs. Latent space mapped by the encoder is split into two parts: style space and attribute space. The former represents attribute-irrelevant factors, such as identity, position, illumination and background. The latter represents the attributes, such as hair color, gender, with or without glasses etc, of which each dimension represents one single attribute. By regarding constraints on the output of the encoder as discriminative objectives, the encoder can act not only as a discriminator that expected to discriminate a sample is a real or a generated one, but also is an attribute classifier that can discriminate whether a sample has the specified attributes or not. Combining reconstruction and Kullback-Leibler (KL) divergence regularization error like in VAEs, the adversarial training loss defined for the style and the attribute in the latent space was introduced, which led the proposed model to generate images whose distribution are close to the real data distribution in the latent space. Finally, the model was evaluated on the CelebA dataset and experimental results showed its effectiveness in disentangling face attributes and generating high-quality face images.

15:00-17:00, Paper WePMP.24
MMGAN: Manifold-Matching Generative Adversarial Network
Park, Noseong	Univ. of North Carolina, Charlotte
Anand, Ankesh	Montreal Inst. for Learning Algorithms
Moniz, Joel Ruben Antony	Carnegie Mellon Univ
Lee, Kookjin	Univ. of Maryland Coll. Park
Choo, Jaegul	Korea Univ
Park, David Keetae	Korea Univ
Chakraborty, Tanmoy	IIIT Delhi, India
Keywords: Deep learning, Neural networks, Image processing and analysis Abstract: It is well-known that GANs are difficult to train, and several different techniques have been proposed in order to stabilize their training. In this paper, we propose a novel training method called manifold-matching, and a new GAN model called manifold-matching GAN (MMGAN). MMGAN finds two manifolds representing the vector representations of real and fake images. If these two manifolds match, it means that real and fake images are statistically identical. To assist the manifold matching task, we also use i) kernel tricks to find better manifold structures, ii) moving-averaged manifolds across mini-batches, and iii) a regularizer based on correlation matrix to suppress mode collapse. We conduct in-depth experiments with three image datasets and compare with several state-of-the-art GAN models. 32.4% of images generated by the proposed MMGAN are recognized as fake images during our user study (16% enhancement compared to other state-of-the-art model). MMGAN achieved an inception score of 7.8 for CIFAR-10.

15:00-17:00, Paper WePMP.25
Multi-Source Domain Adaptation for Face Recognition
Yi, Haiyang	Guilin Univ. of Electronic Tech
Xu, Zhi	Guilin Univ. of Electronic Tech
Wen, Yimin	Guilin Univ. of Electronic Tech
Fan, Zhigang	Zebra Tech. Corp
Keywords: Domain adaptation, Transfer learning, Face recognition Abstract: For transfer learning, many research works have demonstrated that effective use of information from multi-source domains will improve classification performance. In this paper, we propose a method of Targetize Multi-source Domain Bridged by Common Subspace (TMSD) for face recognition, which transfers rich supervision knowledge from more than one labeled source domains to the unlabeled target domain. Specifically, a common subspace is learnt for several domains by keeping the maximum total correlation. In this way, the discrepancy of each domain is reduced, and the structures of both the source and target domains are well preserved for classification. In the common subspace, each sample projected from the source domains is sparsely represented as a linear combination of several samples projected from the target domain, such that the samples projected from different domains can be well interlaced. Then, in the original image space, each source domain image can be represented as a linear combination of neighbors in the target domain. Finally, the discriminant subspace can be obtained by targetized multi-source domain images using supervised learning algorithm. The experimental results illustrate the superiority of TMSD over those competitive ones.

15:00-17:00, Paper WePMP.26
Domain Translation with Conditional GANs: From Depth to RGB Face-To-Face
Fabbri, Matteo	Univ. Degli Studi Di Modena E Reggio Emilia
Borghi, Guido	Univ. of Modena and Reggio Emilia
Lanzi, Fabio	Univ. Degli Studi Di Modena E Reggio Emilia
Vezzani, Roberto	Univ. of Modena and Reggio Emilia
Calderara, Simone	Univ. of Modena and Reggio Emilia
Cucchiara, Rita	Univ. Degli Studi Di Modena E Reggio Emilia
Keywords: Deep learning, Applications of pattern recognition and machine learning, Applications of computer vision Abstract: Can faces acquired by low-cost depth sensors be useful to catch some characteristic details of the face? Typically the answer is no. However, new deep architectures can generate RGB images from data acquired in a different modality, such as depth data. In this paper, we propose a new Deterministic Conditional GAN, trained on annotated RGB-D face datasets, effective for a face-to-face translation from depth to RGB. Although the network cannot reconstruct the exact somatic features for unknown individual faces, it is capable to reconstruct plausible faces; their appearance is accurate enough to be used in many pattern recognition tasks. In fact, we test the network capability to hallucinate with some Perceptual Probes, as for instance face aspect classification or landmark detection. Depth face can be used in spite of the correspondent RGB images, that often are not available due to difficult luminance conditions. Experimental results are very promising and are as far as better than previously proposed approaches: this domain translation can constitute a new way to exploit depth data in new future applications.

15:00-17:00, Paper WePMP.27
Temporal Pattern Localization Using Mixed Integer Linear Programming
Zhao, Rui	RPI
Schalk, Gerwin	BCI R&D Progr, Wadsworth Ctr, NYS Dept of Health
Ji, Qiang	RPI
Keywords: Applications of pattern recognition and machine learning, Brain-computer interface, Sequence modeling Abstract: In this paper, we consider the problem of localizing the subsequence in time series which contains the dynamic pattern of interest. This is motivated by brain computer interface application where we need to analyze the dynamic pattern of brain signals in response to external stimulus. We treat the localization as a binary label assignment problem and formalize a mixed integer linear programming (MILP) problem. The optimal solution is obtained by minimizing a cost function associated with label assignment subject to empirical constraints induced by data acquisition process. We experiment with synthetic data to evaluate the effectiveness of the proposed MILP formulation and achieve 10.8% improvement on F1-score. We then experiment with electrocorticographic (ECoG) data for a classification task and achieve 8.8% improvement on accuracy using subsequences localized by our method compared to other methods.

15:00-17:00, Paper WePMP.28
Net4lap: Neural Laplacian Regularization for Ranking and Re-Ranking
Curado, Manuel	Univ. of Alicante
Escolano, Francisco	Univ. of Alicante
Lozano, Miguel Angel	Univ. of Alicante
Hancock, Edwin	Univ. of York
Keywords: Neural networks, Manifold learning, Learning-based vision Abstract: In this paper, we propose net4Lap, a novel architecture for Laplacian-based ranking. The two main ingredients of the approach are: a) pre-processing graphs with neural embeddings before performing Laplacian ranking, and b) introducing a global measure of centrality to modulate the diffusion process. We explicitly formulate ranking as an optimization problem where regularization is emphasized. This formulation is a theoretical tool to validate our approach. Finally, our experiments show that the proposed architecture significantly outperforms state-of-the-art rankers and it is also a proper tool for re-ranking.

15:00-17:00, Paper WePMP.29
Part-Based Multi-Stream Model for Vehicle Searching
Sun, Ya	Nanjing Univ. of Science and Tech
Minxian, Li	Nanjing Univ. of Science and Tech
Lu, Jianfeng	Nanjing Univ. of Science & Tech
Keywords: Neural networks, Multilabel learning, Image classification Abstract: Due to the tremendous requirement in public security and intelligent transportation system, searching an identical vehicle has become more and more important. Current studies usually treat vehicle object as an integral whole and then train a distance metric to measure the similarity among vehicles. However, these raw images may be exactly similar to ones with different identification and include more or less pixels in background that may disturb the deep distance metric learning. In this paper, we propose a novel and useful method to segment an original vehicle image into several discriminative and foreground parts, and these parts consist of some fine grained regions that are named discriminative patches. After that, these parts combined with the raw image are fed into a new deep learning network. We can easily measure the similarity of two vehicle images by computing the Euclidean distance of the features from FC layer. Two main contributions of this paper are as follows. Firstly, a method is proposed to estimate if a patch in a raw vehicle image is discriminative or not. Secondly, a new Part-based Multi-Stream Model (PMSM) is designed and optimized for vehicle retrieval and re-identification tasks. We evaluate our new method on the VehicleID dataset, and the experimental results show that our method can outperform the baseline.

15:00-17:00, Paper WePMP.30
Dynamic Learning Rate for Neural Networks: A Fixed-Time Stability Approach
Aldana Lopez, Rodrigo	Intel Corp
Campos Macias, Leobardo Emmanuel	Intel Corp
Zamora Esquivel, Julio	Intel Corp
Gomez-Gutierrez, David	Intel Labs
Cruz Vargas, Jesus Adan	Intel
Keywords: Online learning, Deep learning, Neural networks Abstract: Neural Networks (NN) have become important tools that have demonstrated their value solving complex problems regarding pattern recognition, natural language processing, automatic speech recognition, among others. Recently, the number of applications that require running the training process at the front-end in an online manner have increased dramatically. Unfortunately, in state-of-the-art (SoA) methods, this training process is an unbounded function of the initial conditions. Thus, there is no insight on the number of epochs required, making the online training a difficult problem. Speeding up the training process plays a key role in machine learning. In this work, an algorithm for dynamic learning rate is proposed based on recent results from fixed-time stability of continuous-time nonlinear systems, which ensures a convergence time bound to the equilibrium point independently of the initial conditions. We show experimentally that our discrete-time implementation presents promising results, proving that the number of epochs required for the training remains bounded, independently of the initial weights. This constitutes an important feature toward learning systems with real-time constraints. The efficiency of the method proposed is illustrated under different scenarios, including the public database MNIST, which shows that out algorithm outperforms SoA methods in terms of the number of epoch required for the training.

15:00-17:00, Paper WePMP.31
Convolutional Networks for Semantic Heads Segmentation Using Top-View Depth Data in Crowded Environment
Liciotti, Daniele	Univ. Pol. Delle Marche
Paolanti, Marina	Univ. Pol. Delle Marche
Pietrini, Rocco	Univ. Pol. Delle Marche
Frontoni, Emanuele	Univ. Pol. Delle Marche
Zingaretti, Primo	Univ. Pol. Delle Marche
Keywords: Applications of pattern recognition and machine learning Abstract: Detecting and tracking people is a challenging task in a persistent crowded environment (i.e. retail, airport, station, etc.) for human behaviour analysis of security purposes. This paper introduces an approach to track and detect people in cases of heavy occlusions based on CNNs for semantic segmentation using top-view depth visual data. The purpose is the design of a novel U-Net architecture, U-Net3, that has been modified compared to the previous ones at the end of each layer. In particular, a batch normalization is added after the first ReLU activation function and after each max-pooling and up-sampling functions. The approach was applied and tested on a new and public available dataset, TVHeads Dataset, consisting of depth images of people recorded from an RGB-D camera installed in top-view configuration. Our variant outperforms baseline architectures while remaining computationally efficient at inference time. Results show high accuracy, demonstrating the effectiveness and suitability of our approach.

15:00-17:00, Paper WePMP.32
Conditional Information Gain Networks
Bi�ici, Ufuk Can	Boğ Azi�i Univ. Idea Teknoloji
Keskin, Cem	PerceptiveIO
Akarun, Lale	Bogazici Univ
Keywords: Deep learning, Image classification, Neural networks Abstract: Deep neural network models owe their representational power to the high number of learnable parameters. It is often infeasible to run these largely parametrized deep models in limited resource environments, like mobile phones. Network models employing conditional computing are able to reduce computational requirements while achieving high representational power, with their ability to model hierarchies. We propose Conditional Information Gain Networks, which allow the feed forward deep neural networks to execute conditionally, skipping parts of the model based on the sample and the decision mechanisms inserted in the architecture. These decision mechanisms are trained using cost functions based on differentiable Information Gain, inspired by the training procedures of decision trees. These information gain based decision mechanisms are differentiable and can be trained end-to-end using a unified framework with a general cost function, covering both classification and decision losses. We test the effectiveness of the proposed method on MNIST and recently introduced Fashion MNIST datasets and show that our information gain based conditional execution approach can achieve better or comparable classification results using significantly fewer parameters, compared to standard convolutional neural network baselines.

15:00-17:00, Paper WePMP.33
Face Aging with Improved Invertible Conditional GANs
Li, Jia	Xi�an Jiaotong Univ
Song, Yonghong	Xi'an Jiaotong Univ
Zhang, Yuanlin	Xian JiaoTong Univ
Keywords: Deep learning Abstract: Abstract�Due to the continuous development of GAN, vivid faces can be generated, and the use of GAN for face aging becomes a novel trend. However, many existing works for face aging require tedious pre-processing of datasets. This brings a lot of computational burden and limits the application of face aging.In order to solve these problems, a face aging network is constructed using IcGAN without any data pre-processing which map a face image into personality and age vector spaces through encoders Z and Y. Different from the previous work, we make an emphasis on the preservation of both personalized and aging features. Thus, the minimize absolute reconstructing loss is proposed to optimize vector z, which can remain the personality characteristics, meanwhile preserving the pose, hairstyle and background of the input face. Additionally, we introduce a novel age vector optimization approach by classifying reconstruction loss and introduce the parameter  which is well-balanced between large age features and subtle texture features. The experimental results demonstrate our proposed AIGAN provides better aging faces over other state-of-the-art age progression methods.

15:00-17:00, Paper WePMP.34
IMU-Based Robust Human Activity Recognition Using Feature Analysis, Extraction, and Reduction
Dehzangi, Omid	Univ. of Michigan-Dearborn
Sahu, Vaishali	Univ. of Michigan Dearborn
Keywords: Classification, Dimensionality reduction, Performance evaluation Abstract: In recent years, research investigations on recognizing human activities to assess the physical and cognitive capability of humans have gained importance. This paper presents the development of a robust recognition system for Human Activity Recognition under real-world conditions. The activities considered are walking, walking upstairs, walking downstairs, sitting, standing and sleeping. The proposed system consists of 3 main elements - a feature extraction from an IMU (Inertial Measurement Unit) based on the spectral and temporal analysis; a feature dimensionality reduction techniques to reduce the high dimensional feature representation, and; various model training algorithms to recognize the human activities. Different methods for feature extraction based on time and frequency domain signal properties are evaluated. The high dimensionality of extracted features results in complex model training and suffers from the curse of dimensionality. Therefore, we evaluated feature selection and transformation algorithms to improve robustness without decreasing the prediction accuracy. Our results finding shows that Random forest feature selection method, when used with Ensemble bagged classifier, provides an accuracy of 96.9% with 15 features compared to the current benchmark system that employs 561 features. We further obtained a less complex activity recognition system via Neighborhood component analysis along with Ensemble bagged classifier that yields a classification accuracy of 96.3% with only 9 features.

15:00-17:00, Paper WePMP.35
Multi-Label Classification of Stem Cell Microscopy Images Using Deep Learning
Witmer, Adam	Univ. of California, Riverside
Bhanu, Bir	Univ. of California
Keywords: Applications of pattern recognition and machine learning Abstract: This paper develops a pattern recognition and machine learning system to localize cell colony subtypes in multi-label, phase-contrast microscopy images. A convolutional neural network is trained to recognize homogeneous cell colonies, and is used in a sliding-window patch based testing method to localize these homogeneous cell types within heterogeneous, multi-label images. The method is used to determine the effects of nicotine on induced pluripotent stem cells expressing the Huntington's disease phenotype. The results of the network are compared to those of an ECOC classifier trained on texture features. The ability of the network to localize cell phenotypes within heterogeneous colonies is visualized and the temporal behavior of stem cells is analyzed

15:00-17:00, Paper WePMP.36
D-NND: A Density-Based Hierarchical Clustering Method Via Nearest Neighbor Descent
Teng, Qiu	Univ. of Electronic Science and Tech. of China
Li, Chaoyi	Univ. of Electronic Science and Tech. of China
Li, Yongjie	Univ. of Electronic Science and Tech. of China
Keywords: Clustering Abstract: Most density-based clustering methods largely rely on how well the underlying density is estimated. However, like clustering, density estimation is also a challenging unsupervised learning problem,especially the determination of the kernel bandwidth. In this paper, we propose a density-based multi-layer hierarchical clustering method, called the Deep Nearest Neighbor Descent (D-NND), which can largely alleviate the impact of the density estimation. Unlike previous density-based methods, D-NND learns the underlying density distribution layer by layer and at the same time makes the dataset sparsely and effectively organized into a directed Tree. The experiments on three real-world datasets and several challenging synthetic datasets demonstrate that the proposed method has strong ability to discover the underlying cluster structures and is not very sensitive to the density estimation method, the parameters and the clusters of multiple scales.

15:00-17:00, Paper WePMP.37
Deep Age Estimation Model Stabilization from Images to Videos
Ji, Zhipeng	Beijing Jiaotong Univ. School of Computer and Information
Lang, Congyan	BJTU School of Computer and Information Tech
Li, Kai	Chinese Acad. of Sciences
Xing, Junliang	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Deep learning, Applications of pattern recognition and machine learning, Video processing and analysis Abstract: Deep learning models for age estimation from a single image have significantly improved the state-of-the-art. However, when deploying a deep age estimation model from images directly to videos, it often suffers from the fluctuation issue, i.e., the estimated age varies a lot for face frames from the same person. To deal with this problem, this work presents a new deep age estimation model specifically designed for video facial age estimation, which produces very stable and accurate age estimation results. The proposed deep architecture for video facial age estimation incorporates a convolutional neural network with an attention mechanism, where the convolutional neural network extracts the facial features, and an attention block aggregates the facial feature vectors into a single feature representation for final age estimation. The whole model is trained by a novel loss function to guarantee both the accuracy of each frame and the stabilization of age estimation results of all the frames. To evaluate the proposed model for video facial age estimation, a new dataset is collected and annotated. Extensive experimental analyses and comparisons demonstrate the effectiveness of the proposed model and the state-of-the-art performances compared to many competing methods.

15:00-17:00, Paper WePMP.38
Person Re-Identification with Weighted Spatio-Temporal Features
Zhang, Dongyu	Sun Yat-Sen Univ
Chen, Rongcong	Sun Yat-Sen Univ
Qiu, Zhilin	Sun Yat-Sen Univ
Zhang, Wei	Sun Yat-Sen Univ
Wang, Qing	Sun Yat-Sen Univ
Keywords: Classification, Deep learning, Object recognition Abstract: Person re-identification (re-id) which resolves to recognize a person from the non-overlapped cameras has received increasing research. In this paper, we addressed a new problem of person re-id, i.e., image-to-video (ImtoV) person re-id, in which the probe is an image and the gallery consists of videos from nonoverlapping cameras with different views of probe image as shown in Fig. ref{fig:pipeline}. It is different from the traditional image-based person re-id in which the probe and gallery are all images. Although more information in the video is brought into ImtoV, it remains a challenging problem because of the large variations of light conditions, viewing angles, body pose, and occlusions in different views of videos. One problem is that most of the current models ignore that different frames play different importance in the matching, and assign equal weights to feature vector of each frame of videos. However, frames with serious occlusion and dramatical illumination change have the negative effect in improving the re-id performance. In order to overcome this problem, we proposed a novel framework for this task. We adopted CNNs for the feature extraction of images and videos, and further employed LSTM network for the spatiotemporal feature representation of videos. We added a weight modular to learn the weights for different frames of videos adaptively. We evaluated the proposed framework on three public person re-id datasets, and the experimental results showed that the proposed approach was effective for the ImtoV person re-id.

15:00-17:00, Paper WePMP.39
Transductive Label Augmentation for Improved Deep Network Learning
Elezi, Ismail	Ca' Foscari Univ. of Venice; Zurich Univ. of Applied S
Torcinovich, Alessandro	Ca' Foscari Univ. of Venice
Vascon, Sebastiano	Univ. Ca' Foscari of Venice
Pelillo, Marcello	Ca' Foscari Univ
Keywords: Semi-supervised learning, Deep learning, Classification Abstract: A major impediment to the application of deep learning to real-world problems is the scarcity of labeled data. Small training sets are in fact of no use to deep networks as, due to the large number of trainable parameters, they will very likely be subject to overfitting phenomena. On the other hand, the increment of the training set size through further manual or semi-automatic labellings can be costly, if not possible at times. Thus, the standard techniques to address this issue are transfer learning and data augmentation, which consists of applying some sort of �transformation� to existing labeled instances to let the training set grow in size. Although this approach works well in applications such as image classification, where it is relatively simple to design suitable transformation operators, it is not obvious how to apply it in more structured scenarios. Motivated by the observation that in virtually all application domains it is easy to obtain unlabeled data, in this paper we take a different perspective and propose a label augmentation approach. We start from a small, curated labeled dataset and let the labels propagate through a larger set of unlabeled data using graph transduction techniques. This allows us to naturally use (second-order) similarity information which resides in the data, a source of information which is typically neglected by standard augmentation techniques. In particular, we show that by using known game theoretic transductive processes we can create larger and accurate enough labeled datasets which use results in better trained neural networks. Preliminary experiments are reported which demonstrate a consistent improvement over standard image classification datasets.

15:00-17:00, Paper WePMP.40
Fast Skin Lesion Segmentation Via Fully Convolutional Network with Residual Architecture and CRF
Luo, Wenfeng	South China Univ. of Tech
Yang, Meng	Sun Yat-Sen Univ
Keywords: Deep learning, Segmentation, features and descriptors, Medical image and signal analysis Abstract: Melanoma is known to be the most fatal form of skin cancers. In order to achieve automated diagnosis of such disease, a system is needed to accurately locate suspicious skin lesions using images captured by standard digital cameras. Recently, there exists a trend for the use of Fully Convolutional Network(FCN) to perform image segmentation task. In this paper, we propose a FCN-based processing pipeline that incorporates a deep neural net and a graphical model, to attain a segmentation mask of lesion region from normal skin. Our method extends the residual network by adding a transposed convolution layer to yield a FCN architecture. We demonstrate that the noisy outcome from FCN can be refined by a fully connected Conditional Random Field(CRF). Our model enjoys three major advantages over existing algorithms: simpler process pipeline, state-of-art accuracy in terms of segmentation sensitivity(95.6%) and fast inference time.

15:00-17:00, Paper WePMP.41
Improved Robust Discriminant Analysis for Feature Extraction
Chen, Xiaobo	Jiangsu Univ
Keywords: Dimensionality reduction, Classification Abstract: Dimensionality reduction (DR) has emerged as a crucial issue in developing effective pattern recognition approaches. However, the performance of many DR algorithms degrades due to the impact of noisy environment. To address this problem, we propose in this paper a novel algorithm termed as robust large margin discriminant analysis (RLMDA) in order to enhance the robustness of features. In the spirit of large margin principle as applied in support vector machine, RLMDA maximizes the minimum between-class dispersion and simultaneously minimizes the within-class dispersion in the reduced subspace. Moreover, the l1-norm rather than traditional squared l2-norm is exploited to describe such dispersions, making the resultant algorithm robust to noisy features. The solution of RLMDA boils down to a nonconvex and nonsmooth optimization problem. Therefore, we take advantages of both constrained concave-convex procedure (CCCP) and Lagrangian dual method, and develop an efficient iterative algorithm. Experimental results show that RLMDA achieves better performance compared with other related methods.

15:00-17:00, Paper WePMP.42
An Improved Self-Representation Approach for Missing Value Imputation
Chen, Xiaobo	Jiangsu Univ
Keywords: Applications of pattern recognition and machine learning Abstract: Recovering missing values (MVs) from incomplete data is an important problem for many real-world applications. In this work, we propose a novel MVs imputation method by combing sample self-representation strategy and underlying local structure of data in a uniformed framework. Specifically, the proposed method firstly obtain the first-round estimation of MVs using an existing method. Then, a graph, characterizing local proximity structure of data, is constructed. Next, a novel model coined as graph regularized local self-representation (GRLSR) is proposed by integrating two crucial elements: local self-representation and graph regularization. The former assumes each sample can be well represented (reconstructed) by linearly combining the neighboring samples while the latter further requires the neighboring samples should not deviate too much from each other after reconstruction. By doing so, MVs can be more accurately restored due to the joint imputation as well as local linear reconstruction. We also develop an effective alternating optimization algorithm in order to solve GRLSR model, thereby achieving final imputation. Experimental results on a real-world road network traffic flow data and several UCI benchmark datasets demonstrate the effectiveness of our proposed method.

15:00-17:00, Paper WePMP.43
Convolutional Discriminant Analysis
Zhong, Guoqiang	Ocean Univ. of China
Zheng, Yan	Ocean Univ. of China
Zhang, Xu-Yao	Inst. of Automation, Chinese Acad. of Sciences
Wei, Hongxu	Ocean Univ. of China
Ling, Xiao	Ocean Univ. of China
Keywords: Deep learning, Image classification Abstract: Softmax regressor is arguably the most commonly used classifier in convolutional neural networks (CNNs). However, the cross-entropy based softmax loss only supervises the deep neural networks to learn effective representations of data, but does not explicitly enforce the separability between the classes. In this paper, we propose a novel convolutional neural network model, called convolutional discriminative analysis (CDA). Beyond the softmax loss, CDA employs a convolutional discriminant loss (CD-Loss), which minimizes the distance between the sample and its class center while maximizes the distance between the sample and its adversarial class center in the space of the learned deep representations. Extensive experiments on two benchmark data sets, Fashion-MNIST and CIFAR-10, demonstrate the superiority of CDA over traditional deep CNNs on the image classification tasks.

15:00-17:00, Paper WePMP.44
Merging Neurons for Structure Compression of Deep Networks
Zhong, Guoqiang	Ocean Univ. of China
Yao, Hui	Ocean Univ. of China
Zhou, Huiyu	Univ. of Leicester
Keywords: Deep learning, Image classification Abstract: Deep neural networks are increasingly used in many fields, such as pattern recognition, computer vision, and natural language processing. However, how to apply deep neural networks in mobile settings has become an urgent issue, as mobile devices are getting more and more popularity. This is mainly due to the fact that mobile devices usually have very limited computation and storage resources, which prevents from running a large-scale deep network. This paper proposes a novel method for structure compression of deep neural networks. The main idea is to merge the neurons and connections of the original network using clustering methods. To the end, the new network after compression possesses much less parameters, which leads to reduced requirements for computation and storage resources. Experiments on benchmark data sets demonstrate that the proposed method can greatly improve the efficiency of deep neural networks, while retain their learning capability.

15:00-17:00, Paper WePMP.45
Effects of Sampling Skewness of the Importance-Weighted Risk Estimator on Model Selection
Kouw, Wouter Marco	Delft Univ. of Tech
Loog, Marco	Delft Univ. of Tech. / Univ. of Copenhagen
Keywords: Domain adaptation, Model selection, Performance evaluation Abstract: Importance-weighting is a popular and well-researched technique for dealing with sample selection bias and covariate shift. It has desirable characteristics such as unbiasedness, consistency and low computational complexity. However, weighting can have a detrimental effect on an estimator as well. In this work, we empirically show that the sampling distribution of an importance-weighted estimator can be skewed. For sample selection bias settings, and for small sample sizes, the importance-weighted risk estimator produces overestimates for data sets in the body of the sampling distribution, i.e. the majority of cases, and large underestimates for data sets in the tail of the sampling distribution. These over- and underestimates of the risk lead to sub-optimal regularization parameters when used for importance-weighted validation.

15:00-17:00, Paper WePMP.46
A Convolutional Neural Network Approach for Estimating Tropical Cyclone Intensity Using Satellite-Based Infrared Images
Combinido, Jay Samuel	Advanced Science and Tech. Inst
Mendoza, John Robert	Electrical and Electronics Engineering Inst. Univ. Of
Aborot, Jeffrey	Advanced Science and Tech. Inst
Keywords: Applications of pattern recognition and machine learning, Transfer learning, Domain adaptation Abstract: Existing techniques for satellite-based tropical cy- clone (TC) intensity estimation involve an explicit feature extraction step to model TC intensity on a set of relevant TC features or patterns such as eye formation and cloud organization. However, crafting such a feature set is often time-consuming and requires expert knowledge. In this paper, a convolutional neural network (CNN) approach, which eliminates explicit feature extraction, for estimating the intensity of tropical cyclones is proposed. Utilizing a Visual Geometry Group 19-layer CNN (VGG19) model pre-trained on ImageNet, transfer learning experiments were performed using grayscale IR images of TCs obtained from various geostationary satellites in the Western North Pacific region (1996 � 2016) to estimate TC intensity. The model re-trained on TC images achieved a root-mean-square error (RMSE) of 13.23 knots � a performance comparable to existing feature-based approaches (RMSE ranging from 12 to 20 knots). Moreover, the model was able to learn generic TC features that were previously identified in feature-based approaches as important indicators of TC intensity.

15:00-17:00, Paper WePMP.47
Driver Distraction Detection Using MEL Cepstrum Representation of Galvanic Skin Responses and Convolutional Neural Networks
Dehzangi, Omid	West Virginia Univ
Taherisadr, Mojtaba	Univ. of Michigan
Keywords: Applications of pattern recognition and machine learning, Human behavior analysis, Biological image and signal analysis Abstract: Driver distraction is one of the major causes of road accidents which can lead to severe physical injuries and deaths. Statistics indicate the need of a reliable driver distraction system, which can monitor the driver�s distraction and alert the driver before there is a chance of for disasters on the road continuously and ubiquitously. Therefore, early detection of driver distraction can help decrease the cost of roadway disasters. Physiological signals such as galvanic skin response (GSR) analysis have been extensively used to monitor distraction at physiological level and develop system which alerts divers well in advance. In this paper, we introduce a driver distraction detection system based on MEL Cepstrum analysis of GSR signals and using convolutional neural networks (CNN). The proposed model operates by calculating and feeding two dimensional (2D) representation of GSR data as input to deep convolutional neural networks. We present a recipe to extract Mel frequency filter bank coefficients in time and frequency domains. The deep CNN is structured to automatically learn reliable discriminative patterns in the 2D Mel cepstrum space as features thus replacing the traditional ad hoc hand-crafted features when working with a high dimensional time-series dataset. The classification accuracy of the proposed prediction algorithm is evaluated based on a set of recorded GSR signals from 7 subjects. The subjects aged 24 to 45, actively participated in the naturalistic driving experiment during the GSR recordings. The experimental results demonstrate that the proposed algorithm achieves 93.28% accuracy.

15:00-17:00, Paper WePMP.48
Exploring Spatio-Temporal Correlations Via Deep Convolutional Neural Networks for Short-Term Traffic Flow Prediction with Incomplete Data
Hou, Jiaxin	School of Software Engineering, Chongqing Univ
Chen, Jing	School of Big Data & Software Engineering, Chongqing Univ
Liao, Shijie	Chongqing Univ
Xiong, Qingyu	Chongqing Univ
Wen, Junhao	Chongqing Univ
Keywords: Applications of pattern recognition and machine learning, Sequence modeling, Deep learning Abstract: Traffic flow prediction is a crucial task for the intelligent traffic management and control. Various machine learning based methods have been applied in this field. Most of these methods encounter three fundamental issues: feature representation of traffic patterns, learning from single location or network, and data quality. In order to address these issues, in this work we present a deep architecture for traffic flow prediction that learns deep hierarchical feature representation with spatio-temporal relations over the traffic network. Furthermore, we design an ensemble learning strategy via random subspace learning to make the model be able to tolerate incomplete data. The experimental results corroborate the effectiveness of the proposed approach compared with the state of the art methods.

15:00-17:00, Paper WePMP.49
A Joint Optimization Framework of Low-Dimensional Projection and Collaborative Representation for Discriminative Classification
Liu, Xiaofeng	Carnegie Mellon Univ
Li, Zhaofeng	Univ. of Chinese Acad. of Sciences
Kong, Lingsheng	Chinese Acad. of Sciences
Diao, Zhihui	Chinese Acad. of Sciences
Yan, Junliang	Chinese Acad. of Sciences
Zou, Yang	Carnegie Mellon Univ
Yang, Chao	Univ. of Southern California
Jia, Ping	Changchun Inst. of Optics, Fine Mechanies and Physics, CAS
You, Jane	The Hong Kong Pol. Univ
Keywords: Image classification, Object recognition, Dimensionality reduction Abstract: Various representation-based methods have been developed and shown great potential for pattern classification. To further improve their discriminability, we propose a Bi-level optimization framework in terms of both low-dimensional projection and collaborative representation. Specifically, during the projection phase, we try to minimize the intra-class similarity and inter-class dissimilarity, while in the representation phase, our goal is to achieve the lowest correlation of the representation results. Solving this joint optimization mutually reinforces both aspects of feature projection and representation. Experiments on face recognition, object categorization, and scene classification dataset demonstrate remarkable performance improvements led by the proposed framework.

15:00-17:00, Paper WePMP.50
Action Classification Via Concepts and Attributes
Rosenfeld, Amir	York Univ
Ullman, Shimon	Weizmann Inst
Keywords: Vision and language, Classification, Deep learning Abstract: Classes in natural images tend to follow long tail distributions.This is problematic when there are insufficient training examples for rare classes. This effect is emphasized in compound classes, involving the conjunction of several concepts, such as those appearing in action-recognition datasets. In this paper, we propose to address this issue by learning how to utilize common visual concepts which are readily available. We detect the presence of prominent concepts in images and use them to infer the target labels instead of using visual features directly, combining tools from vision and natural-language processing. We validate our method on the recently introduced HICO dataset reaching a mAP of 31.54% and on the Stanford-40 Actions dataset, where the proposed method outperforms that obtained by direct visual features, obtaining an accuracy 83.12%. Moreover, the method provides for each class a semantically meaningful list of keywords and relevant image regions relating it to its constituent concepts.

15:00-17:00, Paper WePMP.51
RelationNet: Learning Deep-Aligned Representation for Semantic Image Segmentation
Zhuang, Yueqing	Peking Univ
Tao, Li	Peking Univ
Yang, Fan	Peking Univ
Ma, Cong	Peking Univ
Zhang, Ziwei	Peking Univ
Jia, Huizhu	Peking Univ
Xie, Xiaodong	Peking Univ
Keywords: Scene understanding, Deep learning, Mid-level vision Abstract: Semantic image segmentation, which assigns labels in pixel level, plays a central role in image understanding. Recent approaches have attempted to harness the capabilities of deep learning. However, one central problem of these methods is that deep convolutional neural network gives little consideration to the correlation among pixels. To handle this issue, in this paper, we propose a novel deep neural network named RelationNet, which utilizes CNN and RNN to aggregate context information. Besides, a spatial correlation loss is applied to train RelationNet to align features of spatial pixels belonging to same category. Importantly, since it is expensive to obtain pixel-wise annotations, we exploit a new training method to combine the coarsely and finely labeled data. Experiments show the detailed improvements of each proposal. Experimental results demonstrate the effectiveness of our proposed method to the problem of semantic image segmentation, which obtains state-of-the-art performance on the Cityscapes benchmark and Pascal Context dataset.

15:00-17:00, Paper WePMP.52
Color Image Reconstruction with Perceptual Compressive Sensing
Du, Jiang	Xidian Univ
Xie, Xuemei	Xidian Univ
Wang, Chenye	Xidian Univ
Shi, Guangming	Xidian Univ
Keywords: Learning-based vision, Deep learning, Perceptual organization Abstract: We propose a novel compressive sensing framework for color images. Recently, compressive sensing (CS) has gain its popularity with the development of deep learning. To our best knowledge, existing methods all deal with RGB images channel by channel. This brings redundancy of measurements. In this paper, we do a breakthrough work. Instead of recovering RGB images channel by channel uniformly, we adopt non-uniform sampling in different channels in YCbCr color space. The luminance component takes up more measurements while the other channels take up less in the proposed framework. It greatly enhances the performance on CS for color images. Moreover, perceptual loss gives a powerful ability to better capture the structure information. We give the measurement rate at 2% as an example in the experiments, and the results show the proposed method outperforms all the existing methods with better structure of images.

15:00-17:00, Paper WePMP.53
Local Compact Binary Patterns for Background Subtraction in Complex Scenes
He, Wei	Hunan Inst. of Science and Tech
Kim, Yongkwan	Hoseo Univ
Qi, Qi	Hunan Inst. of Science and Tech
Wu, Jianhui	Hunan Inst. of Science and Tech
Zhang, Guoyun	Hunan Inst. of Science and Tech
Guo, Longyuan	Hunan Inst. of Science and Tech
Tu, Bing	Hunan Inst. of Science and Tech
Ou, Xianfeng	Hunan Inst. of Science and Tech
Huang, Feng	Hunan Inst. of Science and Tech
Keywords: Object detection, Motion and tracking Abstract: Background modeling in complex scenes is a challenging problem. In this paper, a novel background subtraction method is proposed to address it. First, the textures are modeled with local compact binary patterns (LCBP), which have excellent robustness, strong discriminative power, and fast computation speed. To make LCBP more effective to appearance changes in complex scenarios, spatiotemporal local compact binary patterns (STLCBP) are then considered in which spatial texture information and temporal motion information are combined together. Multiple color spaces are also presented to separate foreground pixels more accurately from the background. To our knowledge, this is the first time that LCBP have been used for background modeling. Extensive experimental results on a widely used dataset clearly show that the proposed method outperforms other state-of-the-art methods and works effectively in complex scenes.

15:00-17:00, Paper WePMP.54
Facial Expression Recognition for Different Pose Faces Based on Special Landmark Detection
Wu, Wenqi	Inst. of Automation, Chinese Acad. of Sciences
Yin, Yingjie	Inst. of Automation Chinese Acad. of Sciences
Wang, Yingying	�Inst. of Automation, Chinese Acad. of Sciences
Wang, Xingang	Inst. of Automation， Chinese Acad. of Sciences
Xu, De	Inst. of Automation, Chinese Acad. of Sciences, Beijing 10
Keywords: Object recognition, Deep learning, Neural networks Abstract: Facial expression recognition is a challenging task in computer vision field using only single facial image. As we know, human faces are convex spheres. The self-occlusion phenomenon generated from face pose will seriously affect the accuracy of expression recognition. In order to solve this problem, we propose a novel facial expression recognition method for different pose faces based on special landmark detection (FER-MPI-SFL). Our method is based on two shared networks. The outputs of the first Network are 29 special landmarks and 1 face box, which are the inputs of the second network and used to estimate face pose. The methods of RoIAlign and feature map concatenation are introduced in the second network to recognize the facial expression. The weight allocation of feature maps concatenation is guided by the result of pose estimation. In addition, an improved center loss is proposed to make the distances between the features of different expressions larger and easier to be classified in the feature space. As a result, superior performance to other state-of-the-art methods is achieved in facial expression databases CK+, MMI, Oulu-CASIA VIS and a new created database CASIA-MFE which contains more faces with different poses.

15:00-17:00, Paper WePMP.55
Plant Identification from Bark: A Texture Description Based on Statistical Macro Binary Pattern
Boudra, Safia	LaSTIC, Univ. of Batna 2
Itheri Yahiaoui, Itheri Yahiaoui	CReSTIC, Univ. De Reims Champagne-Ardenne
Ali Behloul, Ali Behloul	Univ. of Batna2
Keywords: Applications of computer vision, Texture analysis, Image classification Abstract: This paper presents a novel, yet compact texture descriptor for plant species identification based on bark texture images. Termed Statistical Macro Binary Pattern (SMBP), the descriptor is informative, rotation invariant, and it is designed to encode texture information from a large support area. The main novelty of this approach is the use of statistical description to represent the intensity distribution in the large support area, and an LBP-like encoding scheme to derive a statistical macro pattern by thresholding it against its adaptive statistical prototype. We propose to test three neighborhood sampling schemes according to the angular quantization at each level of the macrostructure. The comprehensive experiments on three challenging bark datasets (BarkTex, Trunk12, AFF) show that our descriptor achieves high and more consistent identification rates when compared with LBP-like texture descriptors.

15:00-17:00, Paper WePMP.56
Person in Vehicle Counting Method of HOV HOT System
Miyamoto, Shinichi	NEC Corp
Keywords: Object recognition, Video analysis, Learning-based vision Abstract: Automated number of passenger in vehicle counting system has been desired to realize for HOV (High Occupancy Vehicle) and HOT (High Occupancy Toll) use. Under outdoor environment, many complicated factors such as sunlight and vehicle window with darkly tinted cause to deteriorate captured image quality and as a result number counting accuracy remains at low level. In this paper, simple but refined image acquisition scheme and novel number of passenger counting algorithm are proposed. Especially, by incorporating deep neural network for face detection, integrating face detection results over plural frames and calculating confidence value of estimation result, useful and high accuracy has been realized.

15:00-17:00, Paper WePMP.57
Kernel Dual Linear Regression for Face Image Set Classification
Gao, Xizhan	Nanjing Univ. of Science and Tech
Sun, Quansen	Nanjing Univ. of Science and Tech
Xu, Haitao	Liaocheng Univ
Li, Yanmeng	Nanjing Univ. of Science and Tech
Keywords: Image classification, Regression, Support vector machine and kernel methods Abstract: DLRC is an extension of LRC that extends LRC from conventional still image based method to the image set based method. DLRC has a demonstrated better performance on image set classification. However, when the image sets of different objects are not linear separable, or when the linear regression axes of class-specific samples of different classes have an intersection, DLRC may be failed for well classifying the image sets. In this paper, a new classification method, kernel dual linear regression classification (KDLRC), is proposed. KDLRC is a nonlinear version of DLRC and can overcome the drawback of DLRC. KDLRC first embeds the input data into a highdimensional Hilbert space, then in the kernel space, the data become easier to classify. Extensive experiments on four wellknown databases prove that the performance of KDLRC is better than that of DLRC and several state-of-the-art classifiers.

15:00-17:00, Paper WePMP.58
Flexible Rotation Invariant Bases from Orthogonal Moments
Yang, Bo	Northwestern Pol. Univ
Chen, Xiaofeng	Northwestern Pol. Univ
Zhang, Yuye	Xianyang Normal Univ
Keywords: Object recognition, Image processing and analysis, Signal analysis Abstract: Rotation transformation is a basic but fundamental geometric distortion. Design of rotation invariants is an indispensable part in researches on moment invariants. Due to better numerical stability and efficient invariant development, Gaussian-Hermite moments become powerful tools in field of pattern recognition. The existing rotation invariants of Gaussian-Hermite moments are constructed either with special constraints or for special patterns. The invariant bases neither contain all possible rotation invariants; nor are they available for all images. In this paper, we propose the flexible rotation invariant bases from Gaussian-Hermite moments. The invariants generated from such bases are available for any image and they are complete and exact representations of all rotation invariants of Gaussian-Hermite moments. The inherent properties, such as rotation invariance, completeness and independence of such flexible rotation bases are proven. Rotation invariance is verified by real data. The experiments with respect to template matching and image recognition show that the invariants generated by such flexible bases have better feature representation ability and numerical stability in comparison with traditional complex moment invariants.

15:00-17:00, Paper WePMP.59
Anchor Free Network for Multi-Scale Face Detection
Wang, Chengji	Xiamen Univ
Luo, Zhiming	Xiamen Univ
Lian, Lancer	Xiamen Univ
Li, Shaozi	Xiamen Univ
Keywords: Object detection, Deep learning, Multitask learning Abstract: Anchor-based deep methods are the most widely used methods for face detection and have reached the state-of-the-art result. Compared with anchor-based methods that estimates the bounding-box rely on some pre-defined anchor boxes, anchor-free methods perform the localization by predicting the offsets of a pixel inside a face to its outside boundaries whose accuracies are much more precise. However, anchor-free methods suffer the drawback of low recall-rate mainly because 1) only using single scale features lead to miss detection of small faces, 2) the highly intra-class imbalance problem among different size faces. In this paper, to address these problems, we propose a unified anchor-free network for detecting multi-scale faces by leveraging the local and global contextual information of multi-layer features. We also utilize a scale aware sampling strategy to mitigate the intra-class imbalance issue which can adaptivity select the positive samples. Furthermore, a revised focal loss function is adopted to deal with the foreground/background imbalance issue. Experimental results on two benchmark datasets demonstrate the effective of our proposed method.

15:00-17:00, Paper WePMP.60
3DMAX-Net: A Multi-Scale Spatial Contextual Network for 3D Point Cloud Semantic Segmentation
Ma, Yanxin	National Univ. of Defense Tech
Guo, Yulan	National Univ. of Defense Tech
Lei, Yinjie	Sichuan Univ
Lu, Min	National Univ. of Defense Tech
Zhang, Jun	National Univ. of Defense Tech
Keywords: 3D vision, Scene understanding, Deep learning Abstract: Semantic segmentation of 3D scenes is a fundamental problem in 3D computer vision. In this paper, we propose a deep neural network for 3D semantic segmentation of raw point clouds. A multi-scale feature learning block is first introduced to obtain informative contextual features in 3D point clouds. A global and local feature aggregation block is then extended to improve the feature learning ability of the network. Based on these strategies, a powerful architecture named 3DMAX-Net is finally provided for semantic segmentation in raw 3D point clouds. Experiments have been conducted on the Stanford largescale 3D Indoor Spaces Dataset using only geometry information. Experimental results have clearly shown the superiority of the proposed network.

15:00-17:00, Paper WePMP.61
Beyond Two-Stream: Skeleton-Based Three-Stream Networks for Action Recognition in Videos
Xu, Jianfeng	KDDI Res. Inc
Tasaka, Kazuyuki	KDDI Res. Inc
Yanagihara, Hiromasa	KDDI Res. Inc
Keywords: Behavior recognition, Deep learning for multimedia analysis, Neural networks Abstract: Two-stream architecture, trained on stacked optical flows and image frames, has demonstrated excellent performance for human action recognition in videos and served as the basis of many advanced architectures. To improve performance, we propose a novel architecture of three-stream networks based on skeletons, extending two-stream architecture while retaining the input as video only. Recently, it has become possible to successfully detect skeletons from videos, which contain high-level joint-motion information and are complementary to low-level optical flows and image frames. However, skeleton data lack object information (e.g. a book) involved in the action, which is essential to distinguish such actions as reading, writing, and typing. Therefore, we fuse the complementary information on skeleton data, optical flows, and image frames in our three-stream networks. Furthermore, we crop the human and object region using skeleton data for sparse spatial sampling. In addition, we calculate adaptive weights when fusing the three streams to further improve performance. The experimental results demonstrate that the proposed three-stream networks outperform the state-of-the-art techniques on the NTU RGB+G dataset, significantly improving the accuracy from 80.8% to 86.4%.

15:00-17:00, Paper WePMP.62
Sparse Representation and Weighted Clustering Based Abnormal Activity Detection
Jin, Dongliang	Nanjing Univ. of Posts and Telecommunications
Songhao, Zhu	Nanjing Univ. of Posts and Telecommunications
Songsong, Wu	Nanjing Univ. of Posts and Telecommunications
Jing, Xiaoyuan	Nanjing Univ. of Posts and Telecommunications
Keywords: Behavior recognition, Sparse learning, Image processing and analysis Abstract: Abnormal activity detection is a challenging problem divided into global abnormal activity detection and local abnormal activity detection. First, the hybrid histogram of optical flow feature is extracted; then, the double sparse representation is proposed to tackle the issue of global abnormal activity detection; finally, for the issue of local abnormal activity detection, the foreground of region of interest within the current frame is first detected, and then the method of online weighted clustering is utilized to detect local abnormal activity. Experiments results conducted on UMN datasets and UCSD datasets validate the advantages of the proposed method.

15:00-17:00, Paper WePMP.63
Weak Supervised Learning Based Abnormal Behavior Detection
Sun, Xian	Nanjing Univ. of Posts and Telecommunications
Songhao, Zhu	Nanjing Univ. of Posts and Telecommunications
Songsong, Wu	Nanjing Univ. of Posts and Telecommunications
Jing, Xiaoyuan	Nanjing Univ. of Posts and Telecommunications
Keywords: Behavior recognition, Semi-supervised learning, Image processing and analysis Abstract: Artificial features are adopted in most of existing abnormal behavior detection, however it is difficult to choose and design an effective behavior feature for the reason of high computation cost and complex scenarios. To solve this problem, weak supervised detection method of abnormal behavior based on temporal consistency is proposed in this paper. First, temporal gram matrix is constructed for a given pair of video. Then, a pair of behavior units (Candidate action fragment) is formed by exploiting the temporal consistency and smoothness of human behavior, which aims to locate the start and end frames of the related abnormal action and train the corresponding classifier. Finally, sparse reconstruction is utilized to detect abnormal behavior. Experimental results conducted on CAVIAR and BOSS dataset demonstrate the effectiveness of the proposed method.

15:00-17:00, Paper WePMP.64
A New Method for Face Alignment under Extreme Poses and Occlusion
Li, Jun	Nanjing Univ
Xiao, Qiongling	Nanjing Univ
Yang, Ruoyu	Nanjing Univ
Keywords: Object detection, Applications of computer vision, Applications of pattern recognition and machine learning Abstract: In real-world conditions, robust face alignment is challenging due to the large variability of occlusion and pose. Many methods aim to solve the problem, but can handle either images with occlusion only or with arbitrary poses only. In this paper, we propose a unified framework by ignoring the points which cannot be seen under occlusion and extreme poses, in which we get facial parts first by classification and then train regression models to get key points. It leads to higher accuracy when locating the truly existing points without considering the occluded and non-existent points. Besides, we observed that the drift and shape of face detection results affect face alignment .As far as we know, we are the first to explicitly raise the issue and solve it to some extent. Finally, our method outperforms the state-of-the-art methods on AFLW and COFW datasets. It is also comparable to other methods on LFPW dataset.

15:00-17:00, Paper WePMP.65
Latent Linear Dynamics for Modeling Pedestrian Behaviors
Dhaka, Devendra	NEC Corp. Japan
Ishii, Masato	NEC
Sato, Atsushi	NEC
Keywords: Behavior recognition, Clustering Abstract: We consider the problem of generative pedestrian modeling for capturing common behaviors constituting trajectory dataset. We present a model, that simultaneously represents the trajectory data and latent dynamics associated with different behaviors. Model represents the trajectory dynamics as scaled component of cluster dynamics, and overall the cluster dynamics is shared among all trajectories belonging to a cluster, thus giving rise to similarity. Cluster dynamics is modeled by incorporating Bayesian nonparametrics, particularly the usage of Dirichlet process mixture model approach, which relaxes the number of unique behaviors or clusters. Additionally, the relative velocity scaling term encapsulates the relative nature of an individual trajectory to its cluster dynamics. Model parameters and latent states are inferred using sequential blocked Gibbs sampler, which can be scaled to large datasets.

15:00-17:00, Paper WePMP.66
SCUT-FBP5500: A Diverse Benchmark Dataset for Multi-Paradigm Facial Beauty Prediction
Liang, Lingyu	South China Univ. of Tech
Lin, Luojun	South China Univ. of Tech
Jin, Lianwen	South China Univ. of Tech
Xie, Duorui	South China Univ. of Tech
Li, Mengru	South China Univ. of Tech
Keywords: Image classification Abstract: Facial beauty prediction (FBP) is a significant visual recognition problem to make assessment of facial attractiveness that is consistent to human perception. To tackle this problem, various data-driven models, especially state-of-the-art deep learning techniques, were introduced into FBP, and benchmark dataset become one of the essential elements to achieve FBP. Previous works have formulated the recognition of facial beauty as a specific supervised learning problem of classification or regression, which indicates that FBP is intrinsically a computation problem with multiple paradigms. However, most of FBP benchmark datasets were built under specific computation constrains, which limits the performance and flexibility of the computational model trained on the dataset. In this paper, we argue that FBP is a multi-paradigm computation problem, and build a new diverse benchmark dataset, called SCUT-FBP5500, to achieve multi-paradigm facial beauty prediction. The SCUT-FBP5500 dataset has totally 5500 frontal faces with diverse properties (male/female, Asian/Caucasian, ages) and diverse labels (face landmarks, beauty scores within [1,~5], beauty score distribution), which allows different computational model with different FBP paradigms, such as appearance-based/shape-based facial beauty classification/regression/ranking model for male/female of Asian/Caucasian. We evaluated the SCUT-FBP5500 dataset for FBP using different combinations of feature and predictor, and various deep learning methods. The results indicates the improvement of FBP and the potential applications based on the SCUT-FBP5500.

15:00-17:00, Paper WePMP.67
Fast and Robust Pose Estimation Algorithm for Bin Picking Using Point Pair Feature
Li, Mingyu	Tohoku Univ
Hashimoto, Koichi	Tohoku Univ
Keywords: Object recognition, 3D vision Abstract: Bin picking refers to picking up the objects randomly piled in the container (bin) and robotic bin picking is always used to improve the industrial production efficiency. A pose estimation algorithm is necessary to tell the poses of the objects to the robot. This paper proposes a pose estimation algorithm for bin picking using 3D point cloud data. Point Pair Feature algorithm is performed in a fast way to propose possible poses and the poses are verified by a voxel-based verification method. Iterative Closest Point is used to refine the result poses. Our algorithm is proved to be more accurate and faster than Curve Set Feature algorithm and Point Pair Feature algorithm, robust to occlusion and able to detect multiple poses in one scene.

15:00-17:00, Paper WePMP.68
Hybrid 3D Surface Description with Global Frames and Local Signatures of Histograms
Shen, Zhiqiang	School of Control Science and Engineering
Ma, Xin	Shandong Univ
Zeng, Xianglei	Shandong Univ
Keywords: 3D vision, Object recognition Abstract: This paper presents a novel 3D descriptor named Frame-SHOT to combine global structural frame with local surface information. Global feature descriptors are generally more descriptive for surface representation, while susceptible to occlusion and clutter. In contrast, local feature descriptors are more robust, while less discriminative due to the limit of the support region. We employ the advantages of these two methods by combining both local and global information for surface description. The Signature of Histograms of Orientation (SHOT) descriptor is used to characterize the local surface and structural frame points of the object are used to encode the global surface. We have compared the proposed Frame-SHOT descriptor with the state-of-the-art global and local methods on two public datasets. The results show that our proposed descriptor is more descriptive and robust for surface matching and recognition.

15:00-17:00, Paper WePMP.69
A Structural Approach to Person Re-Identification Problem
Mahboubi, Amal	Greyc Umr Cnrs 6072
Brun, Luc	ENSICAEN
Conte, Donatello	Univ. of Tours
Keywords: Object recognition, Image processing and analysis Abstract: Although it has been studied extensively during past decades, object tracking is still a difficult problem due to many challenges. Several improvements have been done, but more and more complex scenes (dense crowd, complex interactions) need more sophisticated approaches. Particularly long-term tracking is an interesting problem that allow to track objects even after it may become longtime occluded or it leave/re-enter the field-of-view. In this case the major challenges are significantly changes in appearance, scale and so on. At the heart of the solution of long-term tracking is the re-identification technique, that allows to identify an object coming back visible after an occlusion or re-entering on the scene. This paper proposes an approach for pedestrian re-identification based on structural representation of people. The experimental evaluation is carried out on two public data sets (ETHZ and CAVIAR4REID datasets) and they show promising results compared to others state-of-the-art approaches.

15:00-17:00, Paper WePMP.70
Pre-Trained VGG-Net Architecture for Remote-Sensing Image Scene Classification
Usman, Muhammad	Univ. of Chinese Acad. of Sciences
Wang, Weiqiang	Univ. of Chinese Acad. of Sciences
Shahbaz Pervaiz, Chattha	Yanbu Univ. Coll
Sajid, Ali	Univ. of Education
Keywords: Image classification, Deep learning, Transfer learning Abstract: The visual geometry group network (VGGNet) is used widely for image classification and has proven to be very effective method. Most existing approaches use features of just one type, and traditional fusion methods generally use multiple manually created features. However, to get the benefits of multi-layer features remain a significant challenge in the remote-sensing domain. To address this challenge, we present a simple yet powerful framework based on canonical correlation analysis and 4-layer SVM classifier. Specifically, the pretrained VGGNet is employed as a deep feature extractor to extract mid-level and deep features for remote-sensing scene images. We then choose two convolutional (mid-level) and two fully-connected layers produced by VGGNet in which each layer is treated as a separated feature descriptor. Next, canonical correlation analysis (CCA) is used as a feature fusion strategy to refine the extracted features, and to fuse them with more discriminative power. Finally, the support vector machine (SVM) classifier is used to construct the 4-layer representation of the scenes images. Experimenting on a UC Merced and WHU-RS datasets, demonstrate that the proposed approach, even without data augmentation, fine tuning or coding strategy, has a superior performance than state-of-the-art methods used now.

15:00-17:00, Paper WePMP.71
Long-Term Object Tracking with Instance Specific Proposals
Liu, Hao	National Univ. of Defense Tech
Hu, Qingyong	National Univ. of Defense Tech
Li, Biao	National Univ. of Defense Tech
Guo, Yulan	National Univ. of Defense Tech
Keywords: Motion and tracking, Online learning, Regression Abstract: Correlation filter based trackers have been extensively investigated for their superior efficiency and fairly good robustness. However, it remains challenging to achieve long-term tracking when the object is under occlusion and severe deformation. In this paper, we propose a tracker named Complementary Learners with Instance-specific Proposals (CLIP). The CLIP tracker consists of three main components, including a translation filter, a scale filter, and an error correction module. Complementary features are incorporated into the translation filter to cope with illumination changes and deformation, and an adaptive updating mechanism is proposed to prevent model corruption. The translation filter aims to provide an excellent real-time inference. Furthermore, the error correction module is activated to correct the localization error by an instance-specific proposal generator, especially when the target suffers from dramatic appearance changes. Experimental results on the OTB, Temple-Color 128 and UAV20L datasets demonstrate that the CLIP tracker performs favorably against existing competitive trackers in term of accuracy and robustness. Moreover, our proposed CLIP tracker runs at the speed of 33 fps on the OTB. It is highly suitable for real-time applications.

15:00-17:00, Paper WePMP.72
Which Part Is Better: Multi-Part Competition Network for Person Re-Identification
Du, Peng	Xi�an Jiaotong Univ. School of Software Engineering
Song, Yonghong	Xi'an Jiaotong Univ
Zhang, Yuanlin	Xian JiaoTong Univ
Keywords: Image classification, Deep learning Abstract: Person re-identification is a challenging task due to the background clutters, occlusion and illumination variations. In addition, the pedestrian misalignment always exists in some automatic-detection datasets. In this paper, we propose a MultiPart Competition Network (MPCN) consisting of Multi-Part Network (MPN) and Part Competition Network (PCN), which aims to solve the misalignment problem caused by the detector errors and human pose variations. First, we construct original body parts and enlarged body parts using human pose estimation algorithm. These two kinds of body parts not only alleviate the misalignment from background and varying human pose but also solve the missing details and imprecise body parts introduced by human pose estimator. Then, we use MPN to acquire global features and two different body parts features. The components of MPN, a global branch and two part branches, are combined by ROI pooling layer. Finally, we apply PCN to achieve a tradeoff between the original body parts and the enlarged body parts and acquire discriminative part features from these two different body parts. Extensive evaluations on three widely used re-id datasets, Market-1501, CUHK03, VIPeR demonstrate that our proposed network have a competitive result compared to the state-of-the-art methods.

15:00-17:00, Paper WePMP.73
Explain Black-Box Image Classifications Using Superpixel-Based Interpretation
Wei, Yi	Univ. at Albany, State Univ. of New York
Chang, Ming-Ching	Univ. at Albany - SUNY
Ying, Yiming	SUNY Albany
Lim, Ser Nam	GE
Lyu, Siwei	SUNY Albany
Keywords: Image classification, Deep learning, Applications of pattern recognition and machine learning Abstract: How to best understand and interpret the decisions of deep neural networks is a crucial topic, as the impact of intelligent deep network systems is prevalent in many applications. We propose a superpixel based method to interpret and explain the results of black-box deep networks in the widely-applied image classification tasks. We perform probabilistic prediction difference analysis upon one or more superpixels clustered from image pixels. Our method calculates a superpixel score map visualization that can provide rich interpretation regarding image components. Such interpretation provides supportive likelihoods of image regions upon the decisions performed by the black-box classifier. We compare our method against state-of-art pixelwise interpretation methods over the latest deep neural network classifiers on the ImageNet dataset. Results show that our method produces more consistent interpretations in less computation time. Our method also supports interactive interpretation, where users can acquire explanations on specified regions through a convenient interface for a prompt reaction.

15:00-17:00, Paper WePMP.74
Improved Correlation Filter Tracking with Hard Negative Mining
Qie, Chunguang	Xiamen Univ
Guanjun, Guo	Xiamen Univ
Yan, Yan	Xiamen Univ
Liming, Zhang	Univ. of Macau
Wang, Hanzi	Xiamen Univ
Keywords: Motion and tracking, Video analysis Abstract: Recently, the correlation filter based trackers have achieved very good tracking performance. However, due to the boundary effects of the circulant matrix and the usage of cosine window, the lack of effective negative samples becomes a challenging problem for the correlation filter based trackers. This problem may cause overfitting so that these trackers become very sensitive to deformation and occlusion. In this paper, we propose a novel object tracker (i.e., STAPLE HNM), which can effectively select hard negative samples and assign adaptive weights to these samples to train the correlation filter. Experimental results demonstrate that the proposed STAPLE HNM tracker effectively improves the performance of the baseline STAPLE CA tracker on the OTB-50 and OTB-100 datasets. Moreover, the proposed STAPLE HNM tracker also achieves superior performance among several state-of-the-art trackers.

15:00-17:00, Paper WePMP.75
A Rigorous Solution for Closed-Form Correlation Filter Tracking
Li, Dongdong	National Univ. of Defense Tech
Wen, Gongjian	National Univ. of Defense Tech
Kuai, Yangliu	National Univ. of Defense Tech
Keywords: Motion and tracking, Video analysis, Scene understanding Abstract: Recently, Discriminative Correlation Filters (DCF) have achieved enormous popularity in the tracking community due to high efficiency and fair robustness. With the circular structure, DCF transform computationally consuming spatial correlation into efficient element-wise operation in the Fourier domain. In this paper, we argue that this element-wise solution can be derived only in the case of single-channel features. In terms of tracking with multi-channel features, this element-wise solution trains each feature dimension independently and fails to learn a joint correlation filter. To tackle this problem, we propose a rigorous solution to closed-form correlation filter tracking. This rigorous solution can be computed pixel by pixel from a small linear equation system. Experimental results demonstrate that our rigorous pixel-wise solution achieves better tracking performance than the baseline element-wise solution.

15:00-17:00, Paper WePMP.76
Incremental 3D Line Segment Extraction from Semi-Dense SLAM
He, Shida	Univ. of Alberta
Qin, Xuebin	Univ. of Alberta
Zhang, Zichen	Univ. of Alberta
Jagersand, Martin	Univ. of Alberta
Keywords: 3D reconstruction, Multiple view geometry, Image based modeling Abstract: Although semi-dense Simultaneous Localization and Mapping (SLAM) has been becoming more popular over the last few years, there is a lack of efficient methods for representing and processing their large scale point clouds. In this paper, we propose using 3D line segments to simplify the point clouds generated by semi-dense SLAM. Specifically, we present a novel incremental approach for 3D line segment extraction. This approach reduces a 3D line segment fitting problem into two 2D line segment fitting problems and takes advantage of both images and depth maps. In our method, 3D line segments are fitted incrementally along detected edge segments via minimizing fitting errors on two planes. By clustering the detected line segments, the resulting 3D representation of the scene achieves a good balance between compactness and completeness. Our experimental results show that the 3D line segments generated by our method are highly accurate. As an application, we demonstrate that these line segments greatly improve the quality of 3D surface reconstruction compared to a feature point based baseline.

15:00-17:00, Paper WePMP.77
Robust Locality-Constrained Label Consistent KSVD by Joint Sparse Embedding
Zhang, Zhao	Soochow Univ
Jiang, Weiming	Soochow Univ. the School of Computer Science and Tech
Li, Sheng	Nanjing Univ. of Posts and Telecommunications
Qin, Jie	ETH Zurich
Liu, Guangcan	Cornell
Yan, Shuicheng	National Univ. of Singapore
Keywords: Learning-based vision, Image and video coding, Classification Abstract: We mainly propose a robust Embedded Locality-Constrained Label Consistent Dictionary Learning (ELC2DL) framework for discriminative classification. ELC2DL improves the representation and classification performance by performing DL in the noise-removed sparse embedding space, since most real data often contains noise and performing DL over noisy data for reconstruction may decrease performance potentially. To reduce the noise in data, our model computes a sparse projection jointly for noise reduction and then uses the noise-removed data for DL. By incorporating a noise-reduction term with a discriminative locality-constrained label consistent term that associates the label information with each dictionary atom to preserve local structure of training data, a noise-reduction projection, an over-complete dictionary and discriminative sparse codes are obtained jointly. Simulations on several image databases show that our algorithm can deliver enhanced performance over other state-of-the-arts.

15:00-17:00, Paper WePMP.78
Perceptual Face Completion Using a Local-Global Generative Adversarial Network
Ma, Ruijun	Sun Yat-Sen Univ
Hu, Haifeng	Sun Yat-Sen Univ
Keywords: Mid-level vision, Learning-based vision, Inpainting Abstract: Face completion is one of the most challenging problems, as the reconstruction algorithm should render the missing pixels with semantically plausible contents. Recent methods have achieved promising advances in photorealistic human face synthesis. However, these approaches are limited to fix highly-structured images because they cannot faithfully preserve the identity information in the training process. In this paper, we propose a Two-Pathway Perceptual Generative Adversarial Network (TPP-GAN) for face completion by learning identity-preserving representations from both global structures and local details of a face. We combine a reconstruction network and a perceptual network containing two pathways adversarial networks (local and global) into our framework to ensure the transfer of the prominent features for identity classiﬁcation to the occluded parts, encouraging a high identity preserving quality of synthesis results. Experimental results well demonstrate that our proposed framework not only generates locally semantic as well as globally consistent fragments, but also outperforms existing methods on unaligned faces and synthesis of part components.

15:00-17:00, Paper WePMP.79
Joint Identification-Verification Model for Visual Tracking
Wu, Min	Air Force Engineering Univ
Zha, Yufei	Air Force Engineering Univ
Zhang, Yuanqinag	Air Force Engineering Univ
Ku, Tao	Air Force Engineering Univ
Zhang, Lichao	Air Force Engineering Univ
Chen, Bin	Air Force Engineering Univ
Keywords: Motion and tracking, Deep learning, Applications of pattern recognition and machine learning Abstract: Similarity tracking algorithm determine the location of the target by the similarity of the template and the candidate, the most similar candidate of the template is considered as the target in visual tracking. Most trackers only take usage of the intra-class similarity, yet the inter-class separability is ignored. In this paper, a joint identiﬁcation -veriﬁcation model is proposed to learn the similarity with the category attribution for visual tracking. The approach constructs the cost function both on the inter-class category and intra-class similarity. Then, the training dataset is fed into the network, the discriminative features are learned in the embedding space. During tracking, the template and candidates are fed into the network simultaneously, and in the learned embedding space, the object can be located correctly by the similarity metric between the template and candidates. We evaluate the proposed approach on the tracking benchmark: OTB50 and UAV123 dataset. A large number of experimental results show that the inter-class category can increase the discrimination for the similar distractors effectively, and bootstrap the tracking performance of the trackers based on the similarity learning.

15:00-17:00, Paper WePMP.80
Incremental Kernel Null Foley-Sammon Transform for Person Re-Identification
Huang, Xinyu	Aviation Univ. of Airforce
Xu, Jiaolong	Computer Vision Center
Guo, Gang	Aviation Univ. of Airforce
Keywords: Object recognition, Online learning, Visual surveillance Abstract: Person re-identification (Re-ID) is an important technique for video surveillance and security systems. Most existing Re-ID methods assume fixed size of training data. Given newly collected training data, the models have to be re-trained from scratch with both new and old data, which is time-consuming. Accelerating the training speed with ever-increasing data is desired and critical for Re-ID. In this work, we propose to apply incremental learning to address this problem. We build the Re-ID model based on the null Foley-Sammon transform (NFST) method. The idea is to extract new information from newly-added samples and integrate it with the existing NFST-trained model by an efficient updating scheme. We derived the incremental learning algorithm for both the non-kernelization and kernelization version of NFST. Extensive experiments have been carried on three public datasets, including VIPeR, PRID2011 and CUHK01. The results show that our proposed method can achieve comparable accuracy to the batch learning method while significantly reduces the computational complexity.

15:00-17:00, Paper WePMP.81
Simultaneous Context Feature Learning and Hashing for Large Scale Loop Closure Detection
Fu, Zhiheng	Coll. of Electronic Science, National Univ. of Defense Te
Guo, Yulan	National Univ. of Defense Tech
An, Wei	National Univ. of Defense Tech
Keywords: Vision for robotics, Deep learning, Clustering Abstract: Visual loop closure is important in pose tracking and relocalization in many robotics and Argument Reality (AR) systems. For large and highly repetitive environments, sparse keypoint-based methods face several challenges, especially the discriminability of descriptors. In this paper, we propose an augmented descriptor by combining ORB feature and the context descriptor to increase its discriminability and matching performance. An end-to-end network is adopted to perform simultaneous feature learning and code hashing for the context. In addition, feature position clustering is used to reduce the number of contexts. Besides, hash mapping is adopted to reduce the dimensionality of ORB features. Finally, the context descriptors and ORB features with dimensionality reduction are stacked. Experimental results on the NewCollege and TUM datasets demonstrate that our algorithm achieves higher precision/recall and faster speed than the original algorithm proposed by Antonio et al.[1]

15:00-17:00, Paper WePMP.82
From Text to Video: Exploiting Mid-Level Semantics for Large-Scale Video Classification
Zhang, Ji	Inst. of Artificial Intelligence and Robotics, Xi'an Jiaoton
Mei, Kuizhi	Inst. of Artificial Intelligence and Robotics, Xi'an Jiaoton
Wang, Xiao	Inst. of Artificial Intelligence and Robotics, Xi'an Jiaoton
Zheng, Yu	Xidian Univ
Fan, Jianping	Univ. of North Carolina - Charlotte
Keywords: Video analysis, Classification, Video processing and analysis Abstract: Automatically classifying large scale of video data is an urgent yet challenging task. To bridge the semantic gap between low-level features and high-level video semantics, we propose a method to represent videos with their mid-level semantics. Inspired by the problem of text classification, we regard the visual objects in videos as the words in documents, and adapt the TF-IDF word weighting method to encode videos by visual objects. Some extensions upon the proposed method are also made according to the characteristics of videos. We integrate the proposed semantic encoding method with the popular two-stream CNN model for video classification. Experiments are conducted on two large-scale video datasets, CCV and ActivityNet. The experimanetal results validates the effectiveness of our method.

15:00-17:00, Paper WePMP.83
Occlusion Handling Human Detection with Refocused Images
Kataoka, Hirokatsu	National Inst. of Advanced Industrial Science and Tech
Shuhei, Ohki	AIST, Tsukuba Univ
Iwata, Kenji	National Inst. of Advanced Industrial Science and Tech
Satoh, Yutaka	National Inst. of Advanced Industrial Science and Tech
Keywords: Vision sensors, Object detection, Object recognition Abstract: The paper presents a novel robust human detection method based on camera array system to broaden the application range for human detection. Currently, even by using a deep neural network (DNN), it is difficult to detect a hardly occluded human. In the camera array system, we consider how to distinctly show a human occluded by an environmental condition. The generated refocused images by the camera array system allow us to remove the effect of the noises. Although refocused images have not been utilized in conventional human detection, we believe that the refocused images are beneficial for improving the detection performance, especially in severe conditions. To execute the experiments, we have collected Refocused Human DataBase (RHDB) with the camera array system. By using HOG+SVM with a monocular camera (at an almost random rate of 54.8%), the refocused images made the +10.1% improvement (64.9%) by noticeably showing a human. The combined representation of refocused images and AlexNet achieved 94.6% on the RHDB. Moreover, our final model recorded 98.0% with an attention-layer and fine-tuned parameters.

15:00-17:00, Paper WePMP.84
Fourier Transform Based Features for Clean and Polluted Water Image Classification
Wu, Xuerong	Nanjing Univ
Palaiahnakote, Shivakumara	National Univ. of Singapore
Zhu, Liping	Nanjing Univ
Zhang, Hualu	NARI GROUP Corp. GRID ELECTRIC POWER Res. Inst
Shi, Jie	NARI GROUP Corp. GRID ELECTRIC POWER Res. Inst
Lu, Tong	State Key Lab. for Software Tech. Nanjing Univ
Pal, Umapada	Indian Statistical Inst
Blumenstein, Michael	Univ. of Tech. Sydney
Keywords: Image classification, Applications of computer vision Abstract: Water image classification is challenging because water images of ocean or river share the same properties with images of polluted water such as fungus, waste and rubbish. In this paper, we present a method for classifying clean and polluted water images. The proposed method explores Fourier transform based features for extracting texture properties of clean and polluted water images. Fourier spectrum of each input image is divided into several sub-regions based on angle and spatial information. For each region over the spectrum, the proposed method extracts mean and variance features using intensity values, which results in a feature matrix. The feature matrix is then passed to an SVM classifier for the classification of clean and polluted water images. Experimental results on classes of clean and polluted water images show that the proposed method is effective. Furthermore, a comparative study with the state-of-the-art method shows that the proposed method outperforms the existing method in terms of classification rate, recall, precision and F-measure.

15:00-17:00, Paper WePMP.85
A Robust and Efficient Method for License Plate Recognition
Meng, Ajin	Univ. of Science and Tech. of China
Yang, Wei	Univ. of Science and Tech. of China
Xu, Zhenbo	Univ. of Science and Tech. in China
Huang, Huan	Xingtai Financial Holdings Group Co., Ltd
Huang, Liusheng	Univ. of Science and Tech. of China
Ying, Changchun	Xingtai Financial Holdings Group Co., Ltd
Keywords: Object detection, Object recognition, Image processing and analysis Abstract: License plate recognition is an essential step in automatic license plate recognition since it is a key technology to recognize detected license plates. Though there are extensive researches on license plate recognition, it is still challenging to recognize license plates under conditions like great tilt angles, uneven illuminations, and distortions. Based on the observation that an accurate shape correction can significantly improve the recognition accuracy on these images, this paper proposes a robust methodology named LCR for license plate recognition free of conventional image analysis operations. This approach is based on three neural networks for three different purposes: (i) predicting the locations of four vertices; (ii) predicting cutting locations; (iii) character classification. To the best of our knowledge, LCR is the first to address shape correction by designing neural networks to accurately predict the coordinates of license plates vertices. Experiments on over 250,000 unique images show that LCR significantly outperforms several state-of-the-art license plate recognition approaches. Moreover, in evaluations, the application of shape correction significantly improve the recognition accuracy.

15:00-17:00, Paper WePMP.86
Object-Adaptive LSTM Network for Visual Tracking
Du, Yihan	Xiamen Univ
Yan, Yan	Xiamen Univ
Chen, Si	Xiamen Univ. of Tech
Hua, Yang	Queen�s Univ. Belfast
Wang, Hanzi	Xiamen Univ
Keywords: Motion and tracking, Video analysis, Deep learning Abstract: Convolutional Neural Networks (CNNs) have shown outstanding performance in visual object tracking. However, most of classification-based tracking methods using CNNs are time-consuming due to expensive computation of complex online fine-tuning and massive feature extractions. Besides, these methods suffer from the problem of over-fitting since the training and testing stages of CNN models are based on the videos from the same domain. Recently, matching-based tracking methods (such as Siamese networks) have shown remarkable speed superiority, while they cannot well address target appearance variations and complex scenes for inherent lack of online adaptability and background information. In this paper, we propose a novel object-adaptive LSTM network, which can effectively exploit sequence dependencies and dynamically adapt to the temporal object variations via constructing an intrinsic model for object appearance and motion. In addition, we develop an efficient strategy for proposal selection, where the densely sampled proposals are firstly pre-evaluated using the fast matching-based method and then the well-selected high-quality proposals are fed to the sequence-specific learning LSTM network. This strategy enables our method to adaptively track an arbitrary object and operate faster than conventional CNN-based classification tracking methods. To the best of our knowledge, this is the first work to apply an LSTM network for classification in visual object tracking. Experimental results on OTB and TC-128 benchmarks show that the proposed method achieves state-of-the-art performance, which exhibits great potentials of recurrent structures for visual object tracking.

15:00-17:00, Paper WePMP.87
Context-Aware and Depthwise-Based Detection on Orbit for Remote Sensing Image
Fu, Yanmei	Inst. of Software Chinese Acad. of Sciences
Wu, Fengge	Inst. of Software Chinese Acad. of Sciences
Zhao, Junsuo	Inst. of Software Chinese Acad. of Sciences
Keywords: Object detection, Image processing and analysis, Deep learning Abstract: Automatic detection on orbit is an efficient way to filter useless data downloaded to the ground. However, detection on orbit is a challenging task due to limited computational resources on the satellite. In this paper, a context-aware and depthwise-based detection framework for remote sensing images is proposed which can be used on orbit. In the result of limited computational resources on the satellite, on-orbit object detection should detect with low memory cost and fast speed while ensuring the accuracy. To address the problem of small model in the process of feature extracting, a depthwise convolution is applied instead of typical convolution. In this light, a small deep neural network is built to run on orbit, using Single Shot Multibox Detector (SSD) as basic detection module. Motivated by its weak performance on remote sensing image owing to few pixel about target object, context information about target object is added to improve performance. To further investigate the context information influence, we add a balance factor to balance the context information and background noise it brings. Then an experiment on real remote sensing image dataset is conducted comparing our extended model with other current state-of-the-art detection models. Results show our extended model outperforms other models in accuracy and speed. Deploying the pre-trained model on the Android Platform with only 60M memory cost confirms the feasibility to detect on orbit. This detection system is to be verified on the TZ-1 satellite which will be launched in the year of 2018.

15:00-17:00, Paper WePMP.88
Which Content in a Booklet Is He/she Reading? Reading Content Estimation Using an Indoor Surveillance Camera
Kawanishi, Yasutomo	Nagoya Univ
Murase, Hiroshi	Nagoya Univ
Xu, Jianfeng	KDDI Res. Inc
Tasaka, Kazuyuki	KDDI Res. Inc
Yanagihara, Hiromasa	KDDI Res. Inc
Keywords: Behavior recognition, Visual surveillance, Image classification Abstract: In this paper, we propose a method for estimating reading content in a booklet using an image captured by an indoor surveillance camera. Here, we assume that a reading content can be specified by estimating followings; what booklet, which page of the booklet, and which region in the page. We propose a reading booklet/page estimation method based on image search, and a reading region estimation method focusing on the body pose of the reader. We evaluated the method as a 44 classes classification problem, which consists of eleven pages of booklets and four regions in each pages. We achieved 25.6% in accuracy of the reading content estimation.

15:00-17:00, Paper WePMP.89
Hybrid Sparse Subspace Clustering for Visual Tracking
Ma, Lin	Samsung Company
Liu, Zhihua	Samsung
Keywords: Motion and tracking, Clustering, Object detection Abstract: In many conditions, the object samples are distributed in a number of different subspaces. By segmenting the subspaces with spectral clustering based subspace clustering, more accurate sample distribution is obtained. The LSR (Least Squares Regression) sparse subspace clustering method which fulfills the EBD (Enhance Block Diagonal) criterion and has closed-form solution, is an important spectral clustering based sparse subspace clustering method. However, LSR uses no discriminative information which is important to discriminate positive samples from the negative samples. Thus, we propose a new hybrid sparse subspace clustering method which makes the clustering discriminative by involving the discriminative information provided by graph embedding into LSR. The sub subspaces obtained based on the new subspace clustering method can both retain the object distribution information and also make the object samples less confused with surrounding environment. Experimental results on a set of challenging videos in visual tracking demonstrate the effectiveness of our method in discriminating the object from the background.

15:00-17:00, Paper WePMP.90
A Co-Occurrence Background Model with Hypothesis on Degradation Modification for Object Detection in Strong Background Changes
Zhou, Wenjun	Graduate School of Information Science and Tech. Hokkaido
Kaneko, Shun'ichi	Hokkaido Univ
Hashimoto, Manabu	Chukyo Univ
Satoh, Yutaka	National Inst. of Advanced Industrial Science and Tech
Liang, Dong	Nanjing Univ. of Aeronautics and Astronautics
Keywords: Object detection, Video analysis, Low-level vision Abstract: Object detection has become an indispensable part of video processing and current background models are sensitive to background changes. In this paper, we propose a novel background model using an algorithm called Co-occurrence Pixel-block Pairs (CPB) against background changes, such as illumination changes and background motion. We utilize the co-occurrence ``pixel to block" structure to extract the spatial-temporal information of each pixel to build background model, and then employ an efficient evaluation strategy to identify the current state of each pixel, which is named as correlation dependent decision function. Furthermore, we also introduce a Hypothesis on Degradation Modification (HoD) into CPB structure to reinforce the robustness of CPB. Experimental results obtained from the dataset of the PETS 2001, AIST-Indoor, SBMnet and CDW-2012 databases show that our models can detect objects robustly in strong background changes.

15:00-17:00, Paper WePMP.91
High-Quality and Memory-Efficient Volumetric Integration of Depth Maps Using Plane Priors
Liu, YangDong	National Lab. of Pattern Recognition, Inst. of Automat
Gao, Wei	Inst. of Automation, Chinese Acad. of Sciences
Hu, Zhanyi	Inst. of Automation, Chinese Acad. of Sciences
Keywords: 3D reconstruction, Image based modeling Abstract: Volumetric integration method is widely used to fuse depth maps in dense 3D reconstruction systems. High memory footprint is one of its main disadvantages. We introduce a method to de-noise depth maps and save memory usage during volumetric integration of depth maps with the use of plane priors. We develop a new planar region detection method with the use of depth gradients and then de-noise the planar region of depth maps. During volumetric integration we allocate the voxels and integrate depth maps with the use of plane priors as well. Extensive experiments show that our method saves approximately 30% memory footprint and has higher reconstruction quality compared with some of the current state-of-the-art systems. These characteristics enable our method to be used for 3D scanning on mobile devices which have limited memory resources.

15:00-17:00, Paper WePMP.92
Appearance Variation Insensitive State Regression for Visual Tracking
Ma, Lin	Samsung Company
Liu, Zhihua	Samsung
Keywords: Motion and tracking, Regression, Object detection Abstract: In visual tracking, many methods first sample a set of candidate states and then select the optimal state with the best evaluation value. In this way the tracking avoids trapping in local optimum. However, the obtained state is not accurate when the appearance suffers from large challenges or the sample number is small, while the prediction information provided by the surrounding candidates is useful to improve the robustness of state determination. Thus, in this paper we propose a new object localization method which infers the object state by state regression of surrounding states. By acquiring the state weights according to two constraints, i.e. the constraint of representing the confidence of single state and the constraint of approximate states having approximate weights, the sensitivity to appearance variation in state regression is reduced. Experimental results on a set of benchmark videos demonstrate the robustness of the proposed method.

15:00-17:00, Paper WePMP.93
Online Temporal Calibration of Camera and IMU Using Nonlinear Optimization
Liu, Jinxu	Inst. of Automation, Chinese Acad. of Sciences
Gao, Wei	Inst. of Automation, Chinese Acad. of Sciences
Hu, Zhanyi	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Vision for robotics, Motion and tracking Abstract: In this paper, we aim to calibrate the time delay of timestamps of cameras and IMU measurements provided by Android smart phones or other low-cost devices whose camera and IMU are not temporally aligned. The time delay is estimated online in an iterative way through nonlinear optimization in sliding windows. We add new terms that are relative to time delay to the pre-integration results of IMU measurements instead of feature observations in order to improve the precision of temporal calibration. The experimental results indicate that our calibration result is closer to the real value compared with the state-of-the-art system and that our method appears to converge faster. By using our temporal calibration, the visual inertial odometry algorithm is less likely to suffer from fast turning or sudden stop.

15:00-17:00, Paper WePMP.94
Local Regression Based Hourglass Network for Hand Pose Estimation from a Single Depth Image
Li, Jia	Univ. of Science and Tech. of China
Wang, Zengfu	Univ. of Science and Tech. of China
Keywords: Mid-level vision, Deep learning, Pattern recognition for human computer interaction Abstract: Hand pose estimation plays an important role in many applications such as human-computer interaction. With the advent of commodity depth sensors and the developments of deep learning, noticeable improvements have been made in this field recently. Nevertheless, the accuracy and robustness of existing approaches are still dissatisfying. In this paper, we propose an end-to-end local regression based hourglass network with a modified loss function to estimate the 3D pose of the hand in a depth image. We use a third order hourglass block to extract features of the hand. At the top of our network, we slice the feature map into several regions and regress the regions independently first. Then, we merge the regression results and feed them to the final regressor. Besides, we compare performances of different loss functions for the task. The results indicate that the structure of the network and the loss function designed here lead to an obvious improvement. And the proposed approach is comparable to, or superior to the state-of-the-art on a public challenging dataset. Our system can run at over 910 FPS on a single GPU, and the mean error of estimation is reduced to 12.36 mm.

15:00-17:00, Paper WePMP.95
Action Recognition Method Based on Sets of Time Warped ARMA Models
Sogi, Naoya	Univ. of Tsukuba
Fukui, Kazuhiro	Univ. of Tsukuba
Keywords: Behavior recognition, Classification, Sequence modeling Abstract: In this paper, we propose a novel method for recognizing human actions from sequential body skeleton data. Our method is based on ARMA (Autoregressive Mean Average) model, which is constructed from the matrix of 3D joint positions time-series. The intrinsic structure of an action can be compactly summarized by the observability matrix of the ARMA model. Since the column vectors of an observability matrix span a subspace, given two ARMA models, we can measure the similarity between them by the canonical angles between the corresponding subspaces. This framework based on subspace representation is useful for action recognition. However, it does not work well when handling various actions with different action speeds, since optimal row size of each observability matrix depends on the action speed. To address this limitation, we perform a random sampling operation to the row elements in each observability matrix, while preserving the order of the elements. By repeating this operation, we generate a set of various time-warped ARMA models with various local motion speeds. The essence of this idea is that a whole set of such time-warped ARMA models is invariant to the changes in action speed. Furthermore, to construct an effective classification framework, we applied Grassmann discriminant analysis to the time-warped ARMA models. The effectiveness of the proposed method is demonstrated through comparison experiments with state-of-the-art methods on two public datasets: MSR 3D action dataset and UT-Kinect dataset.

15:00-17:00, Paper WePMP.96
Multi-Spectral Fusion and Denoising of RGB and NIR Images Using Multi-Scale Wavelet Analysis
Jung, Cheolkon	Xidian Univ
Su, Haonan	Xidian Univ
Keywords: Computational photography, Enhancement, restoration and filtering, Image processing and analysis Abstract: In this paper, we propose multi-spectral fusion and denoising (MFD) of RGB and NIR images using multi-scale wavelet analysis. We formulate MFD of RGB and NIR images as a maximum a posterior (MAP) estimation problem in the wavelet domain. The direct fusion of noisy RGB and NIR image often leads to contrast attenuation due to the discrepancy between RGB and NIR images. Thus, we generate the wavelet scale map for fusion and denoising based on correlation between NIR and RGB wavelet coefficients. To consider local contrast and visibility of NIR data on RGB components, we provide the contrast preservation term for scale map estimation based on the local contrast and visibility. We use the regularization term to select high visibility and contrast of NIR wavelet coefficients in the scale map. Since noise generally appears in the high frequency band, we use gradients of NIR wavelet coefficients as the weight for weighted least square (WLS) smoothing in the scale map. Based on the wavlet scale map, we perform fusion and denoising of RGB and NIR wavelet coefficients. Experimental results show that the proposed method successfully performs fusion of RGB and NIR images with noise reduction and detail preservation as well as outperforms state-of-the-arts in terms of discrete entropy (DE) and feature-based blind image quality evaluator (FBIQE).

15:00-17:00, Paper WePMP.97
Visual Localization in Changing Environments Using Place Recognition Techniques
Xin, Zhe	Inst. of Automation, Chinese Acad. of Sciences
Cai, Yinghao	Chinese Acad. of Sciences
Cai, Shaojun	UISEE Tech. Beijing Co., Ltd
Zhang, Jixiang	Inst. of Automation, Chinese Acad. of Sciences
Yang, Yiping	Inst. of Automation, Chinese Acad. of Sciences
Wang, Yanqing	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Vision for robotics, Applications of computer vision, Neural networks Abstract: This paper proposes a visual localization system combining Convolutional Neural Networks (CNNs) and sparse point features to estimate the 6-DOF pose of the robot. The challenges of visual localization across time lie in that the same place captured across time appears dramatically different due to different illumination and weather conditions, viewpoint variations and dynamic objects. In this paper, a novel CNN-based place recognition approach is proposed, which requires no time-consuming feature generation process and no tasks-specific training. Moreover, we demonstrate that the rich semantic context information obtained from place recognition can greatly improve the subsequent feature matching process for pose estimation. The semantic constraint performs much better than traditional Bag-of-Words based methods for establishing correspondences between the query image and the map. To evaluate the robustness of the algorithm, the proposed system is integrated into ORB-SLAM2 and verified on the data collected over various illumination and weather conditions. Extensive experimental results show that even with weak ORB descriptors, the proposed system can significantly improve the success rate of localization under severe appearance changes.

15:00-17:00, Paper WePMP.98
Probabilistic Voting for Sequence Based Visual Place Recognition
Xin, Zhe	Inst. of Automation, Chinese Acad. of Sciences
Cai, Yinghao	Chinese Acad. of Sciences
Zhang, Jixiang	Inst. of Automation, Chinese Acad. of Sciences
Yang, Yiping	Inst. of Automation, Chinese Acad. of Sciences
Wang, Yanqing	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Vision for robotics, Applications of computer vision Abstract: Visual place recognition is the task of recognizing the query image in a set of dataset images. It is a challenging problem in computer vision due to frequent and unpredictable environmental changes. In this paper, a novel approach is proposed for visual place recognition. We consider the problem of visual place recognition as a probabilistic voting problem on coherent image sequences. According to the co visibility relationship of images in the dataset, each query can be represented by a categorical variable. Therefore, the whole sequence is the distribution of several independent and non-identical categorical variables. Introducing the probabilistic framework not only removes the need for heuristic parameters but also recognizes location efficiently and effectively. Two widely used datasets are used to evaluate the performance of the proposed method. The probabilistic voting algorithm achieves superior performance compared with state-of-the-art methods and satisfies the real-time requirement.

15:00-17:00, Paper WePMP.99
Convolutional Features-Based CRF Graph Matching for Tracking of Densely Packed Cells
Qian, Weili	Hunan Univ
Wei, Yangliu	Hunan Univ
Wang, Xueping	Hunan Univ
Liu, Min	Hunan Univ
Keywords: Motion and tracking, Biological image and signal analysis, Deep learning Abstract: The tracking of plant cells across large-scale microscopy image sequences is very challenging, because plant cells are densely packed in a specific honeycomb structure, and the microscopy images can be randomly translated, rotated and scaled in the imaging process. This paper proposes a convolutional features-based conditional random field (CRF) graph matching method to track plant cells in unregistered image sequences, by exploiting deep features extracted from deep convolutional neural networks and tight spatial topology feature of neighboring cells as contextual information. Because the extracted convolutional feature and spatial topology feature are resilient to image translation, rotation and scaling, the proposed CRF matching approach is able to track plant cells across unregistered image sequences. Compared with other plant cell tracking methods, the experimental results show that the proposed method improves the tracking accuracy rate by about 30% in the unregistered cell image sequences.

15:00-17:00, Paper WePMP.100
SPCNet: Scale Position Correlation Network for End-To-End Visual Tracking
Wang, Qiang	Inst. of Automation, Chinese Acad. of Sciences, Beijing, C
Gao, Jin	Inst. of Automation Chinese Acad. of Sciences
Zhang, Mengdan	Chinese Acad. of Sciences
Xing, Junliang	Inst. of Automation, Chinese Acad. of Sciences
Hu, Weiming	National Lab. of Pattern Recognition, Inst
Keywords: Motion and tracking Abstract: We present a novel Scale Position Correlation Network (SPCNet) for learning to track objects robustly and efficiently. Different from most previous Correlation Filter (CF) based tracking models, SPCNet unifies the feature representation learning and CF based appearance modeling within one end-to-end learnable framework. In particular, SPCNet learns to track objects within a joint scale-position space, and is very effective in learning features for the accurate prediction of object scale and position. To learn our model from end to end, the SPCNet introduces a differentiable correlation filter layer into a Siamese architecture. Therefore, the localization error can be effectively back-propagated through the whole network, enabling fast adaptation of feature learning and appearance modeling for the objects to be tracked. Such task driven feature learning admits a very lightweight design that can be efficiently pre-trained. In addition, the dense appearance modeling in the joint scale-position space is also efficient. It benefits from the computation of gradients within the Fourier frequency domain. Such careful architecture design ensures that SPCNet is effective and efficient with a small model size. Extensive experimental analyses and evaluations on three largest benchmarks, OTB-2013, OTB-2015, and VOT2015, demonstrate its superiority over many state-of-the-art algorithms.

15:00-17:00, Paper WePMP.101
Online Multi-Target Tracking with Tensor-Based High-Order Graph Matching
Zhou, Zongwei	Inst. of Automation， Chinese Acad. of Sciences
Xing, Junliang	Inst. of Automation, Chinese Acad. of Sciences
Zhang, Mengdan	Chinese Acad. of Sciences
Hu, Weiming	National Lab. of Pattern Recognition, Inst
Keywords: Motion and tracking, Applications of computer vision Abstract: In this paper we formulate multi-target tracking (MTT) as a high-order graph matching problem and propose a l1-norm tensor power iteration solution. Concretely, the search for trajectory-observation correspondences in MTT task is casted as a hypergraph matching problem to maximize a multilinear objective function over all permutations of the associations. This function is defined by a tensor representing the affinity between association tuples where pair-wise similarities, motion consistency and spatial structural information can be embedded expediently. To solve the matching problem, a dual-direction unit l1-norm constrained tensor power iteration algorithm is proposed. Additionally, as measuring the appearance affinity with features extracted from the rectangle patch, which is adopted in most methods, has a weak discrimination when bounding boxes overlap each other heavily, we present a deep pair-wise appearance similarity metric based on object mask in this paper where just the features from true target region are utilized. Experimental evaluation shows that our approach achieves an accuracy comparable to state-of-the-art online trackers1. Our code will be made available soon.

15:00-17:00, Paper WePMP.102
Dense Receptive Field for Object Detection
Yao, Yongqiang	Beijing Univ. of Posts and Telecommunications
Dong, Yuan	Beijing Univ. of Posts and Telecommunications
Huang, Zesang	Beijing Univ. of Posts and Telecommunications
Bai, Hongliang	Beijing Faceall Co.Ltd
Keywords: Object detection, Deep learning, Neural networks Abstract: Current one-stage single-shot detectors such as DSSD and StairNet based on aggregating context information from multiple scales have shown promising accuracy. However, existing multi-scale context fusion techniques are insufficient for detecting objects of different scales. In this paper, we investigate how to detect different objects with different scales with respect to accuracy-vs-speed trade-off. We propose a novel single-shot based detector, called DRFNet which fuses feature maps with different sizes of the receptive field to boost the detection accuracy. Our final model DRFNet detector unifies comprehensive context information from various receptive fields effectively to enable it to detect objects in different sizes with higher accuracy. Experimental results on PASCAL VOC 2007 benchmark (79.6% mAP, 68 FPS) demonstrate that DRFNet is better than other state-of-the-art one-stage detectors similar to FPN. Code will be made publicly available soon.

15:00-17:00, Paper WePMP.103
Online Learning of Spatial-Temporal Convolution Response for Robust Real-Time Tracking
Zhou, Jinglin	People's Public Security Univ. of China
Wang, Rong	People's Public Security Univ. of China
Ding, Jianwei	People's Public Security Univ. of China
Keywords: Motion and tracking, Occlusion and shadow detection Abstract: The challenges of generic visual tracking have attracted great attentions. However, it is still difficult for most of the existing trackers to track objects accurately on real-time occasion. We propose a framework which integrate a verifying mechanism and a correcting mechanism to improve the accuracy of real-time tracking. Under online learning, both target location and sample model update in parallel. Validations are carried out in every frame according to spatial-temporal convolution response. Furthermore, a correcting mechanism would be activated when the current tracking results considered to be unreliable. Synchronously, an online target model updating strategy is constructed to filter the contributive samples, which makes the sample model update confidently. The proposed tracker is evaluated on four popular benchmarks, achieving a state-of-the-art performance while runs at real-time speed.6

15:00-17:00, Paper WePMP.104
Partial Descriptor Update and Isolated Point Avoidance Based Template Update for High Frame Rate and Ultra-Low Delay Deformation Matching
Xu, Yuhao	Waseda Univ
Hu, Tingting	Waseda Univ
Du, Songlin	Waseda Univ
Ikenaga, Takeshi	Waseda Univ. Japan
Keywords: Motion and tracking, Video processing and analysis, Applications of computer vision Abstract: High frame rate and ultra-low delay matching system plays an important role in various human-machine interactive applications, which demands better performance in matching deformable and out-of-plane rotating objects. Although many algorithms have been proposed for deformation tracking and matching, few of them are suitable for hardware implementation due to complicated operations and large time consumption. This paper proposes a hardware-oriented template update method for high frame rate and ultra-low delay deformation matching system. In the proposed method, the new template is generated in real time by partially updating the template descriptor and adding new keypoints simultaneously with the matching process in pixels, and incorrect boundary points are avoided when judged as isolated with distance-reachability to solve the problem of template drift. Evaluation results indicate that the proposed method successfully supports the real-time processing of the 784fps and 640*480 resolution system on field-programmable gate array (FPGA), with a delay of 0.808ms/frame, as well as achieves satisfactory deformation matching results in comparison with other general methods.

15:00-17:00, Paper WePMP.105
Human Routine Change Detection Using Bayesian Modelling
Xu, Yangdi	Univ. of Bristol
Damen, Dima	Univ. of Bristol
Keywords: Video analysis, Behavior recognition, Applications of pattern recognition and machine learning Abstract: Automatic discovery of changes in a human�s routine is one of the requirements for the future of smart home living, and its contribution to the E-health of the community. In this paper, a Bayesian modelling approach is used which models routine change discovery as a pairwise model selection problem. The method is evaluated on a collected office kitchen dataset that captures snapshots of the routine of the same person over multiple years (2014-2017). The results show that our method is able to detect not only the presence of routine changes, but also which activity patterns have been changed, fully automatically, and in a fully unsupervised manner. Moreover, changes within the same activity pattern can be discovered. Interestingly, discovered changes demonstrate subtle variations that are missed by the visual inspection of a human observer.

15:00-17:00, Paper WePMP.106
Attention-Based Neural Network for Traffic Sign Detection
Zhang, Jing	Nanjing Univ. of Science and Tech
Hui, Le	Nanjing Univ. of Science and Tech
Lu, Jianfeng	Nanjing Univ. of Science & Tech
Zhu, Yuhua	Nanjing Univ. of Science and Tech
Keywords: Object detection, Neural networks, Deep learning Abstract: Existing object detection pipelines can show superior performance for large objects with high resolution but fail to detect very small objects such as traffic signs. So, detecting traffic signs is a proverbially challenging problem. In this paper, we propose a novel end-to-end architecture that improves small object detection by combining Faster R-CNN with the attention mechanism. Specifically, we focus on channel-wise features and utilize the attention mechanism to enhance the feature responses by explicitly modeling the interdependencies between channel-wise features. Finally, the regression of bounding boxes and the classification of traffic signs are generated after selecting the discriminative features by the attention mechanism. Extensive evaluations of the largest traffic sign dataset demonstrate that the attention mechanism improves the performance of detecting objects, especially the small targets. For traffic sign detection task, our method achieves better performance compared with many state-of-the-art approaches on the largest traffic sign detection dataset, Tsinghua-Tencent 100K.

15:00-17:00, Paper WePMP.107
Depth-Assisted RefineNet for Indoor Semantic Segmentation
Chang, Manyu	Xiamen Univ
Guo, Feng	Xiamen Univ
Ji, Rongrong	Department of Computer Science, Xiamen Univ
Keywords: Scene understanding, Deep learning Abstract: This paper focuses on indoor semantic segmentation using RGB-D data. It has been shown that incorporating depth information into RGB information is helpful to improve segmentation accuracy. However, previous studies have revealed two problems. One is about the model size. Recent state-of-the-art methods generally build a network branch for depth images, inherently increasing the model size. The other is about boundary segmentation. The complex and various object configurations with severe occlusions influence the segmentation precision of object boundaries. To address these two problems, we propose a depth-assisted refinenet (D-RefineNet) for refining the boundary segmentation. The proposed network only uses RGB images to predict segmentation results. Depth images are just used in the proposed loss function without increasing the model size. When the depth values of adjacent pixels change drastically but the adjacent pixels have the same predicted semantic labels, the proposed loss function penalizes the predicted result. Experimental evaluations demonstrate that the proposed method is effective on two challenging RGB-D indoor datasets, NYUDv2 and SUN RGB-D.

15:00-17:00, Paper WePMP.108
Robust Projective Low-Rank and Sparse Representation by Robust Dictionary Learning
Ren, JiaHuan	Soochow Univ
Zhang, Zhao	Soochow Univ
Li, Sheng	Nanjing Univ. of Posts and Telecommunications
Liu, Guangcan	Cornell
Wang, Meng	Microsoft Res. Asia
Yan, Shuicheng	National Univ. of Singapore
Keywords: Learning-based vision, Image and video coding, Classification Abstract: In this paper, we discuss the robust factorization based robust dictionary learning problem for data representation. A Robust Projective Low-Rank and Sparse Representation model (R-PLSR) is technically proposed. Our R-PLSR model integrates the L1-norm based robust factorization and robust low-rank & sparse representation by robust dictionary learning into a unified framework. Specifically, R-PLSR performs the joint low-rank and sparse representation over the informative low-dimensional representations by robust sparse factorization so that the results are more accurate. To make the factorization and representation procedures robust to noise and outliers, R-PLSR imposes the sparse L2, 1-norm jointly on the reconstruction errors based on the factorization and dictionary learning. Note that L2, 1-norm can also minimize the reconstruction error as much as possible, since the L2, 1-norm theoretically tends to force many rows of the reconstruction error matrix to be zeros. The Nuclear-norm and L1-norm are jointly used on the representation coefficients so that salient representations can be obtained. Extensive results on several image data sets show that our R-PLSR formulation can deliver superior performance over other state-of-the-arts.

15:00-17:00, Paper WePMP.109
A Multi-Part Convolutional Attention Network for Fine-Grained Image Recognition
Zhong, Weilin	Shanghai Jiao Tong Univ
Jiang, Linfeng	Shanghai Jiao Tong Univ
Zhang, Tao	Shanghai Jiao Tong Univ
Ji, Jinsheng	Shanghai Jiao Tong Univ
Xiong, Huilin	Shanghai Jiao Tong Univ
Keywords: Applications of computer vision, Image classification, Deep learning Abstract: The goal of fine-grained image recognition is to recognize hundreds of sub-categories affiliating to the same basic-level category (e.g., bird species). It is a highly challenging task due to the large intra-class variance and small inter-class variance. Existing approaches deal with the subtle difference among object classes via learning and localizing discriminative parts. However, most of the part localization methods follow a step-to-step manner that first localizes larger parts and then generates smaller parts from the larger ones, which is not efficient. In this paper, we present a Multi-part Convolutional Attention Network (M-CAN), which simultaneously focuses on the discriminative image parts at multiple scales. In specific, a convolutional attention based part localization network is presented to localize multi-scale parts from different layers of the deep Convolutional Neural Networks (CNN). Importantly, our part localization network requires no part annotations but only the image labels, which avoids the heavy labor of complex part labeling. We conduct comprehensive experiments and the experimental results show that, our method outperforms the state-of-the-art approaches on three challenging fine-grained datasets, including CUB-Birds, Stanford-Dogs and Stanford-Cars.

15:00-17:00, Paper WePMP.110
Improving Image Classification Performance with Automatically Hierarchical Label Clustering
Chen, Zhiqiang	Inst. of Automation, Chinese Acad. of Sciences
Du, Changde	Inst. of Automation, Chinese Acad. of Sciences
Huang, Lijie	Inst. of Automation, Chinese Acad. of Sciences
Li, Dan	Inst. of Automation, Chinese Acad. of Sciences
He, Huiguang	Inst. of Automation, Chinese Acad. of Sciences
Keywords: Image classification, Deep learning, Clustering Abstract: Image classification is a common and foundational problem in computer vision. In traditional image classification, a category is assigned with single label, which is difficult for networks to learn better features. On the contrary, hierarchical labels can depict the structure of categories better, which helps network to learn more hierarchical features and improve the classification performance. Though many datasets contain images with multi-labels, the labels in these datasets usually lack of hierarchy. To overcome this problem, we propose a new method to improve image classification performance with Automatically Hierarchical Label Clustering (AHLC). Firstly, AHLC calculates the similarity between each pair of original categories by how easily they are misclassified with a pre-trained classifier. Secondly, AHLC obtains hierarchical labels by merging similar categories using hierarchical clustering. Finally, AHLC trains a new classifier with hierarchial labels to improve the original classification performance. We evaluate our method on MNIST and CIFAR-100 datasets and the results demonstrate the superiority of our method. The main contribution of this work is that we can simply improve an existing classification network by AHLC without extra information or heavy architecture redesign.

15:00-17:00, Paper WePMP.111
MSFD: Multi-Scale Receptive Field Face Detector
Guo, Qiushan	Beijing Univ. of Posts and Telecommunications
Dong, Yuan	Beijing Univ. of Posts and Telecommunications
Guo, Yu	Beijing Univ. of Posts and Telecommunications
Bai, Hongliang	Beijing Faceall Co.Ltd
Keywords: Object detection, Deep learning, Neural networks Abstract: We aim to study the multi-scale receptive fields of a single convolutional neural network to detect faces of varied scales. This paper presents our Multi-Scale Receptive Field Face Detector (MSFD), which has superior performance on detecting faces at different scales and enjoys real-time inference speed. MSFD agglomerates context and texture by hierarchical structure. More additional information and rich receptive field bring significant improvement but generate marginal time consumption. We simultaneously propose an anchor assignment strategy which can cover faces with a wide range of scales to improve the recall rate of small faces and rotated faces. To reduce the false positive rate, we train our detector with focal loss which keeps the easy samples from overwhelming. As a result, MSFD reaches superior results on the FDDB, Pascal-Faces and WIDER FACE datasets, and can run at 31 FPS on GPU for VGA-resolution images.

15:00-17:00, Paper WePMP.112
Generic Calibration of Cameras with Non-Parallel Optical Elements
Fasogbon, Peter	Nokia Tech
Fan, Lixin	Nokia Tech
Keywords: 3D vision Abstract: Multiple cameras are increasingly prevalent for autonomous driving, and the increase need to have 360 degree perception of world space has led to the combination of various cameras in the varieties of narrow-angle and wide-angle field of view. This has led to issues regarding the quality of these optics as a result of bad and cheap designs. Intrinsic calibration is indispensable for accurate perception of the environment in order for it to make accurate decisions such as camera pose estimation and 3-D reconstruction. In this work, we propose a lens distortion model that has been motivated by unintentional tilt in the optical lens system. The proposed distortion model has been added to current state of art generic method of Kannala to form an extended model. To our knowledge, this is the first time that the idea of tilt distortion has been introduced for wide-angle view and fish-eye cameras. We show improved result of 4 to 13percent from experiments.

15:00-17:00, Paper WePMP.113
Em-SLAM: A Fast and Robust Monocular SLAM Method for Embedded Systems
Wu, Yirui	Hohai Univ
Li, Zhi-Kai	National Key Lab for Novel Software Tech. Nanjing Univ
Palaiahnakote, Shivakumara	National Univ. of Singapore
Lu, Tong	State Key Lab. for Software Tech. Nanjing Univ
Keywords: Vision for robotics, Motion and tracking, Applications of computer vision Abstract: Simultaneous Localization and Mapping (SLAM) is difficult to deploy in the embedded systems due to its high computation cost and stable input requirements. Building on excellent algorithms of recent years, we present Em-SLAM, a monocular SLAM method which is fast and robust in the embedded system. We present Em-SLAM in three stages comprising initial pose estimation, iterative pose optimization and correspondences, and mapping with nearest frame queue. During the first stage, we perform stable initial pose estimation based on the matched ORB features extracted around the selected key points. Regarding initial pose and corresponding key points as input, the second stage of Em-SLAM iteratively optimizes these inputs values by tracking key points in the new frames. At the last stage, we firstly determine keyframes with the help of the proposed nearest frame queue and then design a greedy search algorithm to find matched ORB features between keyframes, which are adopted for compact and robust map reconstruction. Due to the special designs for the embedded systems, Em-SLAM demonstrates a high accurate and fast performance on the embedded system for all SLAM tasks: tracking, mapping and loop closing. We evaluate Em-SLAM on he most popular datasets by comparing with one latest SLAM method.

15:00-17:00, Paper WePMP.114
Non-Negative Subspace Representation Learning Scheme for Correlation Filter Based Tracking
Xu, Tianyang	Jiangnan Univ
Wu, Xiaojun	Jiangnan Univ
Kittler, Josef	Univ. of Surrey
Keywords: Motion and tracking Abstract: Discriminative correlation filter (DCF) based tracking methods have achieved great success recently. However, the temporal learning scheme in the current paradigm is of a linear recursion form determined by a fixed learning rate which can not adaptively feedback appearance variations. In this paper, we propose a unified non-negative subspace representation constrained leaning scheme for DCF. The subspace is constructed by several templates with auxiliary memory mechanisms. Then the current template is projected onto the subspace to find the non-negative representation and to determine the corresponding template weights. Our learning scheme enables efficient combination of correlation filter and subspace structure. The experimental results on OTB50 demonstrate the effectiveness of our learning formulation.

15:00-17:00, Paper WePMP.115
Radial Lens Distortion Correction by Adding a Weight Layer with Inverted Foveal Models to Convolutional Neural Networks
Shi, Yongjie	Peking Univ
Zhang, Danfeng	Peking Univ
Wen, Jingsi	Peking Univ
Tong, Xin	Peking Univ
Ying, Xianghua	Peking Univ
Zha, Hongbin	Peking Univ
Keywords: Low-level vision, Image based modeling Abstract: Radial lens distortion often exists in images taken by commercial cameras, which does not satisfy the assumption of pinhole camera model. Eliminating the radial lens distortion of an image is necessary as a preprocessing step for many vision applications. Some paper has employed Convolutional Neural Networks (CNNs), to achieve radial distortion correction. They generated images with a large number of images of high variation of radial distortion, which can be well exploited by deep CNN with a high learning capacity, and reach the state-of-the-art results. In this paper, we claim that a weight layer with inverted foveal models can be added to these existing CNNs methods for radial distortion correction. In the widely used very deep Resnet-18 model, our method achieves about 20 percent decrease in the loss function with faster convergence compared to the previous methods.

15:00-17:00, Paper WePMP.116
Semi-Supervised Learning Via Convolutional Neural Network for Hyperspectral Image Classification
Ling, Zhigang	Hunan Univ
Li, Xiuxin	Hunan Univ
Zou, Wen	Hunan Univ
Siyu, Guo	Hunan Univ
Keywords: Image classification, Deep learning, Semi-supervised learning Abstract: Abstract�In order to make use of unlabeled data in hyperspectral images (HSIs), a simple but effective semi-supervised learning method based on convolutional neural network (CNN) is proposed for HSIs classification. First, we define a loss function by integrating a clustering loss function for unlabeled data with softmax loss function for labeled data. Here, the labeled features extracted from CNN are not only used for training classifiers, but also providing anchors to initialize a set of clustering centers via K-means method. Then, all data are used to jointly train deep network for HSIs classification. The experimental results demonstrate that our method can achieve comparative results over the traditional supervised learning method based on CNN. Meanwhile, our method has simple network structure and can be easily trained.

15:00-17:00, Paper WePMP.117
Single Shot Feature Aggregation Network for Underwater Object Detection
Zhang, Lu	Inst. of Automation, Chines Acad. of Sciences
Yang, Xu	Inst. of Automation, Chinese Acad. of Sciences
Liu, Zhiyong	Inst. of Automation, Chinese Acad. of Sciences
Qi, Lu	The Chinese Univ. of Hong Kong
Zhou, Hao	Harbin Engineering Univ
Charles, Chiu	School for Higher and Professional Education, Chai Wan, Hong Kon
Keywords: Applications of computer vision, Object detection, Deep learning Abstract: The rapidly developing ocean exploration and observation make the demand for underwater object detection become increasingly urgent. Recently, deep convolutional neural networks (CNN) have shown strong ability in feature representation and CNN-based detectors also achieve remarkable performance, but still facing the big challenge when detecting multi-scale objects in a complex underwater environment. To address this challenge, we propose a novel underwater object detector, introducing multi-scale features and complementary context information for better classification and location ability. In the auto-grabbing contest of 2017 Underwater Robot Picking Contest sponsored by National Natural Science Foundation of China (NSFC), we won the 1-st place by using proposed method for real coastal underwater object detection.

15:00-17:00, Paper WePMP.118
Visual Tracking by Combining the Structure-Aware Network and Spatial-Temporal Regression
Xu, Dezhong	Beijing Univ. of Tech
Wu, Lifang	Beijing Univ. of Tech
Jian, Meng	Beijing Univ. of Tech
Wang, Qi	Beijing Univ. of Tech
Keywords: Motion and tracking Abstract: In this paper, we propose a novel visual tracking algorithm by combining the structure-aware network (SA-Net) and spatial-temporal regression model. We first use SA-Net to obtain the initial location proposal, and the deep features are extracted using a fine-tuned convolutional neural network model. Finally, both the location proposal and deep features, including historical information, are input into the long short-term memory (LSTM) for end-to-end spatial temporal regression to adjust the initial location proposal from SA-Net. The experimental results on the challenging OTB dataset demonstrate that the proposed scheme is robust to missing tracking caused by occlusion or object deformation. Additionally, the compared experiments show that the proposed scheme is more competitive than state-of-the-art algorithms.

15:00-17:00, Paper WePMP.119
Generative Band Feature Enhancement for Hyperspectral Image Classification
Li, Jiming	Zhejiang Pol. Coll
Chen, Fangjie	Zhejiang Univ. of Tech
Yang, Dongyong	Zhejiang Univ. of Tech
Keywords: Image classification, Neural networks, Deep learning Abstract: In this paper, we propose a generative method for the feature enhancement of hyperspectral image bands. The method can significantly improve the discriminative information and visible quality of hyperspectral image. Based on the generative adversarial networks scheme, randomly sampled small band subset from original hyperspectral image cube can be used to disentangle spectral signals from noisy bands and generate new bands which achieves much better performance in land-cover classification. Experiments on real hyperspectral datasets demonstrate the effectiveness of the generative band feature enhancement method.

15:00-17:00, Paper WePMP.120
A Selective Tracking and Detection Framework with Target Enhanced Feature
Ding, Xinyao	South China Univ. of Tech
Li, Lian	Tencent Company
Zhang, Xin	South China Univ. of Tech
Keywords: Motion and tracking, Applications of computer vision Abstract: Abstract�In the long time tracking, object representation and occlusion handling are two important issues. We propose a novel selective tracking and detection framework and a new probabilistic object-enhanced feature are integrated. Firstly, besides precise object appearance feature, we believe the neighboring foreground-background contrast is another key factor in the tracking. Hence we propose a foreground probability map to enhance the target and weaken the surrounding background. It is computed based on the object color distribution and its comparison with the surrounding background. Secondly, we introduce the selective tracking and detection framework that has two sets of conditions to control the detector activation and final result selection. The detector will only be activated when the tracker is not trustable, which is determined by the tracking confidence and foreground parochiality value. Then, given the tracking and detection results, the final output is selected in terms of their individual correspondence values. We have evaluated our methods on two popular benchmark datasets. Extensive experiments demonstrate that our algorithm performs favorably comparing with state-of-the-art methods

15:00-17:00, Paper WePMP.121
Voting-Based Incremental Structure-From-Motion
Cui, Hainan	Inst. of Automation, Chinese Acad. of Sciences
Shen, Shuhan	Inst. of Automation, Chinese Acad. of Sciences
Gao, Wei	Inst. of Automation, Chinese Acad. of Sciences
Keywords: 3D reconstruction, 3D vision Abstract: Incremental Structure-from-Motion (SfM) technique is the most prevalent way for image-based reconstruction,but its robustness is highly relying on each camera registration, where a false calibration could make everything following fail.In this paper, we propose a voting-based incremental SfM approach to improve upon the camera registration process.First, the degree of closeness between cameras is used as the vote to determine which cameras are going to register. Then, for each camera, two methods are simultaneously used to estimate the camera pose, and the number of inliers is used as the vote to determine which pose is more accurate.Finally, by estimating the priori global camera rotations from the view-graph, the camera poses that are consistent with the priori camera rotations are considered as getting double votes and preferentially kept. After all these prioritized cameras are calibrated, the other cameras are then incrementally registered. Compared to the state-of-the-art incremental SfM approaches, extensive experiments demonstrate that our system performs similarly or better in terms of reconstruction efficiency, while achieves a better robustness and accuracy. Especially for the ambiguous datasets, our system has a better potential to reconstruct them.

15:00-17:00, Paper WePMP.122
Object Classification of Remote Sensing Images Based on Partial Randomness Supervised Discrete Hashing
Kang, Ting	Nanjing Univ. of Science and Tech
Liu, Yazhou	Nanjing Univ. of Science and Tech
Sun, Quansen	Nanjing Univ. of Science and Tech
Keywords: Image classification, Multilabel learning, Applications of pattern recognition and machine learning Abstract: Recently, object classification of remote sensing images has attracted more and more research interests due to the development of satellite and aerial vehicle technologies. Hashing learning is an efficient method to handle the huge amount of the remote sensing data. In this paper, we proposed a novel hashing learning method named partial randomness supervised discrete hashing (PRSDH), which combines data-dependent methods and data-independent methods. It jointly learns a discrete binary codes generation and partial random constraint optimization model. By random projection, the computation complexity is reduced effectively. With the weight matrix derived from the training data, the semantic similarity between the data can be well preserved while generating the hashing codes. For the discrete constraint problem, this paper adopts the discrete cyclic coordinate descent (DCC) algorithm to optimize the codes bit by bit. The experimental results show that PRSDH outperforms other comparative methods and demonstrated that PRSDH has good adaptability to the characteristic of remote sensing object.

15:00-17:00, Paper WePMP.123
Context-Aware Trajectory Prediction
Bartoli, Federico	Univ. of Florence
Lisanti, Giuseppe	Univ. Degli Studi Di Pavia
Ballan, Lamberto	Univ. of Padova
Del Bimbo, Alberto	Univ. of Florence
Keywords: Behavior recognition, Motion and tracking Abstract: Human motion and behaviour in crowded spaces is influenced by several factors, such as the dynamics of other moving agents in the scene, as well as the static elements that might be perceived as points of attraction or obstacles. In this work, we present a new model for human trajectory prediction which is able to take advantage of both human-human and human-space interactions. The future trajectory of humans, are generated by observing their past positions and interactions with the surroundings. To this end, we propose a ''context-aware'' recurrent neural network LSTM model, which can learn and predict human motion in crowded spaces such as a sidewalk, a museum or a shopping mall. We evaluate our model on a public pedestrian datasets, and we contribute a new challenging dataset that collects videos of humans that navigate in a (real) crowded space such as a big museum. Results show that our approach can predict human trajectories better when compared to previous state-of-the-art forecasting models.

15:00-17:00, Paper WePMP.124
A Multi-Modal Multi-View Dataset for Human Fall Analysis and Preliminary Investigation on Modality
Tran, Thanh-Hai	Hanoi Univ. of Science and Tech
Le, Thi-Lan	MICA Inst. Hanoi Univ. of Science and Tech
Dinh-Tan, Pham	MICA Inst. Hanoi Univ. of Science and Tech
Van-Nam, Hoang	MICA Inst. Hanoi Univ. of Science and Tech
Van-Minh, Khong	MICA
Quoc-Toan, Tran	MICA Inst. Hanoi Univ. of Science and Tech
Thai-Son, Nguyen	PTIT
Van-Cuong, Pham	PTIT
Keywords: Video analysis, Behavior recognition, Deep learning for multimedia analysis Abstract: Over the last decade, a large number of methods have been proposed for human fall detection. Most of methods were evaluated on trimmed datasets. More importantly, these datasets lack variety of falls, subjects, views and modalities. This paper makes two contributions in the topic of automatic human fall detection. Firstly, to address the above issues, we introduce a large continous multimodal multivew dataset of human fall, namely CMDFALL. Our proposed dataset is captured by 50 subjects, with seven overlapped Kinect sensors and two wearable accelerometers. Each subject performs 20 activities including 8 falls of different styles and 12 daily activities. All multi-modal multi-view data (RGB, depth, skeleton, acceleration) are time-synchronized and annotated for evaluating performance of recognition algorithms of human activities or human fall in particular in indoor environment. Secondly, based on the multimodal property of the dataset, we investigate the role of each modality and its combination to produce the best results in the context of human activity recognition. To this end, we adopt existing baseline techniques which have been shown to be very efficient for each data modality such as C3D convnet on RGB; DMM-KDES on depth; Res-TCN on skeleton and 2D convnet on acceleration data. We analyze to show which modality gives the best performance.

15:00-17:00, Paper WePMP.125
A Novel Model for Multi-Label Image Annotation
Wu, Xinjian	Soochow Univ
Zhang, Li	Soochow Univ
Li, Fan-Zhang	Soochow Univ
Wang, Bangjun	Soochow Univ
Keywords: Image captioning, Deep learning, Multilabel learning Abstract: Multi-label image annotation is one of the most important open problems in the computer vision. In this paper, we propose a novel model for image annotation. Unlike existing works that usually use conventional visual features to annotate images, this paper adopts features based on convolutional neural network (CNN), which have shown potential to achieve outstanding performance. In particular, we use CNN to extract image features with higher semantic meaning and apply them to the image annotation method � Tag Propagation (TagProp). Experimental results on four challenging datasets indicate that our model makes a marked improvement as compared to the current state-of-the-art.

15:00-17:00, Paper WePMP.126
Hallucinating Dense Optical Flow from Sparse Lidar for Autonomous Vehicles
Vaquero, Victor	Iri, Upc-Csic
Sanfeliu, Alberto	-Univ. Pol. De Catalunya
Moreno-Noguer, Francesc	CSIC-UPC
Keywords: Applications of computer vision, Deep learning, Vision for robotics Abstract: In this paper we propose a novel approach to estimate dense optical flow from sparse lidar data acquired on an autonomous vehicle. This is intended to be used as a drop-in replacement of any image-based optical flow system when images are not reliable due to e.g. adverse weather conditions or at night. In order to infer high resolution 2D flows from discrete range data we devise a three-block architecture of multiscale filters that combines multiple intermediate objectives, both in the lidar and image domain. To train this network we introduce a dataset with approximately 20K lidar samples of the Kitti dataset which we have augmented with a pseudo ground-truth image-based optical flow computed using FlowNet2. We demonstrate the effectiveness of our approach on Kitti, and show that despite using the low-resolution and sparse measurements of the lidar, we can regress dense optical flow maps which are at par with those estimated with image-based methods.

15:00-17:00, Paper WePMP.127
Gaze-Aided Eye Detection Via Appearance Learning
Cao, Lin	Inst. of Automation, Chinese Acad. of Sciences
Gou, Chao	Chinese Acad. of Sciences
Wang, Kunfeng	Inst. of Automation, Chinese Acad. of Sciences
Xiong, Gang	Inst. of Automation, Chinese Acad. of Sciences
Wang, Fei-Yue	Chinese Acad. of Sciences
Keywords: Learning-based vision, Object detection, Image processing and analysis Abstract: Image based eye detection and gaze estimation have a wide range of potential applications, such as medical treatment, biometrics recognition, human-computer interaction. Though a large number of researchers have attempted to solve the two problems, they still exist some challenges due to the variation in appearance and lack of annotated images. In addition, most related work perform eye detection first, followed by gaze estimation via appearance learning. In this paper, we propose a unified framework to execute the gaze estimation and the eye detection simultaneously by learning the cascade regression models from appearance around the eye related key points. Intuitively, there is coupled relationship among location of eye center, shape of eye related key points, appearance representation and gaze information. To incorporate these information, at each cascade level, we first learn a model to map the shape and appearance around current eye related key points to the three dimension gaze. Then, with the help of estimated gaze, we further learn a regression model to map the gaze, shape and appearance information to eye location updates. By leveraging the power of cascade learning, the proposed method can alternatively optimize the two tasks of eye detection and gaze estimation. The experiments are conducted on benchmarks of GI4E and MPIIGaze. Experimental results show that our proposed method can achieve preferable results in gaze estimation and outperform the state-of-the-art methods in eye detection.

15:00-17:00, Paper WePMP.128
Fish Detection from Low Visibility Underwater Videos
Shevchenko, Violetta	Lappeenranta Univ. of Tech
Eerola, Tuomas	Lappeenranta Univ. of Tech
Kaarna, Arto	Lappeenranta Univ. of Tech
Keywords: Applications of computer vision, Motion and tracking, Video analysis Abstract: Counting and tracking fish populations is important for conservation purposes as well as for the fishing industry. Various non-invasive automatic fish counters exist based on such principles as resistivity, light beams and sonar. However, such methods typically cannot make distinction between fish and other passing objects, and moreover, cannot recognize different species. Computer vision techniques provide an attractive alternative for building a more robust and versatile fish counting systems. In this paper we present the fish detection framework for noisy videos captured in water with low visibility. For this purpose, we compare three background subtraction methods for the task. Moreover, we propose necessary post-processing steps and heuristics to detect the fish and separate them from other moving objects. The results showed that by choosing an appropriate background subtraction method, it is possible to achieve a satisfying detection accuracy of 80% and 60% for two challenging datasets. The proposed method will form a basis for the future development of fish species identification methods.

15:00-17:00, Paper WePMP.129
Accurate 3-D Reconstruction with RGB-D Cameras Using Depth Map Fusion and Pose Refinement
Ylim�ki, Markus	Univ. of Oulu
Kannala, Juho	Aalto Univ
Heikkil�, Janne	Univ. of Oulu
Keywords: 3D reconstruction, Image based modeling, Multiple view geometry Abstract: Depth map fusion is an essential part in both stereo and RGB-D based 3-D reconstruction pipelines. Whether produced with a passive stereo reconstruction or using an active depth sensor, such as Microsoft Kinect, the depth maps have noise and may have poor initial registration. In this paper, we introduce a method which is capable of handling outliers, and especially, even significant registration errors. The proposed method first fuses a sequence of depth maps into a single non-redundant point cloud so that the redundant points are merged together by giving more weight to more certain measurements. Then, the original depth maps are re-registered to the fused point cloud to refine the original camera extrinsic parameters. The fusion is then performed again with the refined extrinsic parameters. This procedure is repeated until the result is satisfying or no significant changes happen between iterations. The method is robust to outliers and erroneous depth measurements as well as even significant depth map registration errors due to inaccurate initial camera poses.

15:00-17:00, Paper WePMP.130
Gender Recognition from Face Images Using Trainable Shape and Color Features
Azzopardi, George	Univ. of Malta
Foggia, Pasquale	Univ. Di Salerno
Greco, Antonio	Univ. of Salerno
Saggese, Alessia	Univ. of Salerno
Vento, Mario	Univ. Degli Studi Di Salerno
Keywords: Image classification, Biologically inspired vision, Soft biometrics Abstract: Gender recognition from face images is an important application and it is still an open computer vision problem, even though it is something trivial from the human visual system. Variations in pose, lighting, and expression are few of the problems that make such an application challenging for a computer system. Neurophysiological studies demonstrate that the human brain is able to distinguish men and women also in absence of external cues, by analyzing the shape of specific parts of the face. In this paper, we describe an automatic procedure that combines trainable shape and color features for gender classification. In particular the proposed method fuses edge-based and color-blob-based features by means of trainable COSFIRE filters. The former types of feature are able to extract information about the shape of a face whereas the latter extract information about shades of colors in different parts of the face. We use these two sets of features to create a stacked classification SVM model and demonstrate its effectiveness on the GENDER-COLOR-FERET dataset, where we achieve an accuracy of 96.4%.

15:00-17:00, Paper WePMP.131
Semantic-Only Visual Odometry Based on Dense Class-Level Segmentation
Mah�, Howard	Airbus Defence and Space/cnrs-I3s/uca
Marraud, Denis	Airbus Defence and Space
Comport, Andrew Ian	CNRS-I3S/UCA
Keywords: Motion and tracking, Learning-based vision, Segmentation, features and descriptors Abstract: This paper proposes a novel approach called Semantic Visual Odometry (SemVO) which incorporates class-level consistency priors into the problem of 6-DoF Visual Odometry. Dense class-level labels are learnt for each pixel of the image using a CNN trained for semantic segmentation. A semantic error is formulated penalising the sum of squared differences (SSD) on class-level feature maps extracted from the decoder of a RefineNet. It will be shown how the proposed approach allows dense RGB-D camera tracking using solely a semantic error term. SemVO is evaluated on the ScanNet dataset and the results demonstrate how the number of classes affects performance. Results are also provided showing how best to fuse the new error function with classic dense photometric and geometric methods. Finally, it is demonstrated that SemVO improves over standard approaches for large camera motion applications.

15:00-17:00, Paper WePMP.132
A Multi-Scale Feature Extraction Method for Single Sample
Xu, Xiaoxiang	Soochow Univ
Zhang, Li	Soochow Univ
Li, Fan-Zhang	Soochow Univ
Keywords: Image classification Abstract: Face recognition is a system that is identified by a computer for the detected face image, which is matched with the stored face database in the computer. The single face recognition means that only 1 face images are used as training samples. In many actual situations, such as public security, airport and customs, most face images have only 1 or a small number of face images. The research shows that the number of training samples has a great influence on the performance of face recognition. Many excellent face recognition algorithms give a sharp decline in performance or even failure, dealing with single sample face recognition problems. The study of single sample face recognition has always been a hot but difficult issue in the research of face recognition. The methods to solve this problem are generally divided into two categories. One is to find and select the feature of robust face recognition for face recognition from the point of feature selection. The other is to generate multiple virtual samples from the perspective of extended samples, so as to reduce the influence of the number of samples. But the existing algorithms are generally only considered from a single aspect, or the mechanical combination of two aspects. Based on this, considering two perspectives, combined with human learning cognition mechanism, we use support vector transformation to generate multi-scale virtual samples for single image, and extract the best support vector transformation feature to identify it

15:00-17:00, Paper WePMP.133
Automatic Inspection of Aerospace Welds Using X-Ray Images
Dong, Xinghui	The Univ. of Manchester
Taylor, Chris	Univ. of Manchester
Cootes, Tim	The Univ. of Manchester
Keywords: Applications of computer vision, Applications of pattern recognition and machine learning Abstract: The non-destructive testing (NDT) of components is very important to the aerospace industry. Welds in these components may contain porosities and other defects. These reduce the fatigue life of components and may result in catastrophic accidents if they end up in the aircraft. Currently such welds are inspected by humans studying radiographs of the welds. We describe an automatic system for detecting defects in welds, with the aim of creating a triage system to reduce the workload on human inspectors. Given an X-ray image of the aerospace weld, the system locates the weld line, then analyses the region around the line to identify abnormalities. Our results show that the weld can be precisely extracted from X-ray images and the defect detection operation can identify 83% of defects with fewer than 3 false positives per image, and thus may be useful for prompting human inspectors to reduce their workload.

15:00-17:00, Paper WePMP.134
2D-To-3D Facial Expression Transfer
Rotger, Gemma	Computer Vision Center and Dpt. Ci�ncies De La Computaci�, Univ
Felipe, Lumbreras	Computer Vision Center and Dpt. Ci�ncies De La Computaci�, Univ
Moreno-Noguer, Francesc	CSIC-UPC
Agudo, Antonio	Iri, Csic-Upc
Keywords: 3D vision, Applications of computer vision, 3D reconstruction Abstract: Automatically changing the expression and physical features of a face from an input image is a topic that has been traditionally tackled in a 2D domain. In this paper, we bring this problem to 3D and propose a framework that given an input RGB video of a human face under a neutral expression, initially computes his/her 3D shape and then performs a transfer to a new and potentially non-observed expression. For this purpose, we parameterize the rest shape --obtained from standard factorization approaches over the input video-- using a triangular mesh which is further clustered into larger macro-segments. The expression transfer problem is then posed as a direct mapping between this shape and a source shape, such as the blend shapes of an off-the-shelf 3D dataset of human facial expressions. The mapping is resolved to be geometrically consistent between 3D models by requiring points in specific regions to map on semantic equivalent regions. We validate the approach on several synthetic and real examples of input faces that largely differ from the source shapes, yielding very realistic expression transfers even in cases with topology changes, such as a synthetic video sequence of a single-eyed cyclops.

15:00-17:00, Paper WePMP.135
Segmentation-Guided Tracking with Prior Map Decision
Ma, Ding	Harbin Inst. of Tech
Bu, Wei	Harbin Inst. of Tech
Wu, Xiangqian	Harbin Inst. of Tech
Xie, Yuying	Michigan State Univ. Department of Computational Mathemati
Cui, YueHua	Michigan State Univ. Department of Statistics and Probabil
Keywords: Motion and tracking, Video analysis, Applications of computer vision Abstract: For visual tracking, the target object is represented by an appearance model and the location of the target is estimated in each frame. Numerous tracking algorithms model the appearance of the target with a confidence score and rarely take into account the semantic information of the target. In this paper, we propose an efficient tracking algorithm that models the appearance of the target based on semantic segmentation. The overall architecture consists of two parts: the segmentation part and the tracking part. In the segmentation part, an attention model is employed, providing spatial highlights of the candidate region of the target. In the tracking part, the tracker is constructed by an online updated convolutional neural networks to identify the target in subsequent frames, taking advantage of the segmentation information of the target from the segmentation part. To enhance the performance of this architecture, we design an incremental updated prior map taking both the segmentation signal and the tracking signal into consideration. Extensive experiments on two benchmarks including OTB-50, OTB-100, and Temple-Color, show that the proposed method outperforms other trackers.

15:00-17:00, Paper WePMP.136
Fast Single Image Dehazing Via Positive Correlation
Li, Bingheng	Xi�an Univ. of Posts and Telecommunications
Lai, Yi	Xi�an Univ. of Posts and Telecommunications
Wu, Chaoyan	Xi�an Univ. of Posts and Telecommunications
Liu, Ying	Xi�an Univ. of Posts and Telecommunications
Keywords: Applications of computer vision, Enhancement, restoration and filtering, Image processing and analysis Abstract: In this paper, we propose a fast single image dehazing method based on positive correlation. Firstly, a linear model is built to describe the positive correlation between the minimum channel of the hazy image and its corresponding depth map. Then, the transmission map and the atmospheric light are separately obtained using the created linear model. Finally, based on the traditional atmospheric scattering model, the hazefree image can be recovered with the transmission map and the atmospheric light. Experimental results on numerous hazy images demonstrate that proposed method has better performance and lower time complexity than the state-of-art methods.

15:00-17:00, Paper WePMP.137
Temporal Action Detection by Joint Identiﬁ Cation-Veriﬁ Cation
Wang, Wen	UESTC
Yongjian, Wu	State Key Lab. of Synthetical Automation for Process Indus
Liu, Haijun	UESTC
Wang, Shiguang	Univ. of Electronic Science and Tech. of China
Cheng, Jian	Univ. of Electronic Science and Tech. of China
Keywords: Video analysis, Classification Abstract: Temporal action detection aims at not only recognizing action category but also detecting start time and end time for each action instance in an untrimmed video. The key challenge of this task is to accurately classify the actions and determine the temporal boundaries of each action instance. In temporal action detection benchmark: THUMOS 2014, large variations exist in the same action category while many similarities exist in different action categories, which always limit the performance of temporal action detection. To address this problem, we propose to use joint Identification-Verification network to reduce the intra-action variations and enlarge inter-action differences. The joint Identification-Verification network is a siamese network based on 3D ConvNets, which can simultaneously predict the action categories and the similarity scores for the input pairs of video proposal segments. Extensive experimental results on the challenging THUMOS 2014 dataset demonstrate the effectiveness of our proposed method compared to the existing state-of-art methods for temporal action detection in untrimmed videos. We further demonstrate that our model is a general framework by evaluating our approach on Charades dataset.

15:00-17:00, Paper WePMP.138
Vehicle Re-Identification by Deep Feature Fusion Based on Joint Bayesian Criterion
Li, Siyu	Beijing Inst. of Tech
Pei, Mingtao	Beijing Inst. of Tech
Zhu, Leyi	Univ. of Science and Tech. of China
Keywords: Object recognition, Deep learning, Neural networks Abstract: Vehicle re-identification is a challenging task as the differences between vehicles of the same model are extremely small. In this paper, we propose to fuse deep features extracted by two different CNNs for vehicle re-identification. CNNs can extract discriminative features for classification tasks. Features extracted by different CNNs describe different aspects of the input image, and are complementary to each other. We propose a new loss function called the Joint Bayesian loss to fuse the different deep features. The proposed Joint Bayesian loss can minimize the intra-class variations and simultaneously maximize the inter-class variations of the fused features, and it is very fit for vehicle re-identification. Experiments on a large-scale vehicle dataset demonstrate the effectiveness of the proposed method.

15:00-17:00, Paper WePMP.139
Robust Attentional Pooling Via Feature Selection
Zheng, Jian	State Univ. of New York at Binghamton
Lee, Teng-Yok	Mitsubishi Electric Res. Lab. (MERL)
Feng, Chen	Mitsubishi Electric Res. Lab. (MERL)
Li, Xiaohua	State Univ. of New York at Binghamton
Zhang, Ziming	Mitsubishi Electric Res. Lab. (MERL)
Keywords: Object recognition, Deep learning Abstract: In this paper we propose a novel network module, namely Robust Attentional Pooling (RAP), that potentially can be applied in an arbitrary network to generate single vector representations for classification. By taking a feature matrix for each data sample as the input, our RAP learns data-dependent weights that are used to generate a vector through linear transformations of the matrix. We utilize feature selection to control the sparsity in weights for compressing the data matrices as well as enhancing the robustness of attentional pooling. As exemplary applications, we plug RAP into PointNet and ResNet for 3D point cloud and 2D image recognition, respectively. We demonstrate that our RAP significantly improves the recognition performance for both networks whenever sparsity is high. For instance, in extreme cases where only one feature per matrix is selected for recognition, RAP achieves more than 60% improvement in terms of accuracy on the ModelNet40 dataset.

15:00-17:00, Paper WePMP.140
Unsupervised Multi-Domain Image Translation with Domain-Specific Encoders/Decoders
Hui, Le	Nanjing Univ. of Science and Tech
Li, Xiang	NJUST
Chen, Jiaxin	Nanjing Univ. of Science and Tech
He, Hongliang	Nanjing Univ. of Science and Tech
Yang, Jian	Nanjing Univ. of Science and Tech
Keywords: Mid-level vision, Deep learning, Domain adaptation Abstract: Unsupervised Image-to-Image Translation achieves spectacularly advanced developments nowadays. However, recent approaches mainly focus on one model with two domains, which may face heavy burdens with the large cost of training time and the huge model parameters, under such a requirement that n (n > 2) domains are freely transferred to each other in a general setting. To address this problem, we propose a novel and unified framework named Domain-Bank, which consists of a globally shared auto-encoder and n domain-specific encoders/decoders, assuming that there is a universal shared-latent space can be projected. Thus, we not only reduce the parameters of the model but also have a huge reduction of the time budgets. Besides the high efficiency, we show the comparable (or even better) image translation results over state-of-the-arts on various challenging unsupervised image translation tasks, including face image translation and painting style translation. We also apply the proposed framework to the domain adaptation task and achieve state-of-the-art performance on digit benchmark datasets.

15:00-17:00, Paper WePMP.141
A Light CNN Based Method for Hand Detection and Orientation Estimation
Yang, Li	Southeast Univ
Qi, Zhi	Southeast Univ
Liu, Zeheng	Southeast Univ
Zhou, Shanshan	Southeast Univ
Zhang, Yang	Southeast Univ
Liu, Hao	Southeast Univ
Wu, Jianhui	Southeast Univ
Shi, Longxing	Southeast Univ
Keywords: Object detection, Pattern recognition for human computer interaction Abstract: Hand detection is an essential step to support many tasks including HCI applications. However, detecting various hands robustly under conditions of cluttered backgrounds, motion blur or changing light is still a challenging problem. Recently, object detection methods using CNN models have significantly improved the accuracy of hand detection yet at a high computational expense. In this paper, we propose a light CNN network, which uses a modified MobileNet as the feature extractor in company with the SSD framework to achieve a robust and fast detection of hand location and orientation. The network generates a set of feature maps of various resolutions to detect hands of different sizes. In order to improve the robustness, we also employ a top-down feature fusion architecture that integrates context information across levels of features. For an accurate estimation of hand orientation by CNN, we manage to estimate two orthogonal vectors� projections along the horizontal and vertical axes then recover the size and orientation of a bounding box exactly enclosing the hand. Evaluated on the challenging Oxford hand dataset, our method reaches 83.2% average precision (AP) at 139 FPS on a Nvidia Titan X, outperforming the previous methods both in accuracy and efficiency.

15:00-17:00, Paper WePMP.142
Orientation-Guided Similarity Learning for Person Re-Identification
Jiang, Na	Beihang Univ
Liu, Junqi	Beihang Univ
Sun, Chenxin	Beihang Univ
Wang, Yuehua	Beihang Univ
Zhou, Zhong	Beihang Univ
Wu, Wei	Beihang Univ
Keywords: Image classification, Deep learning, Visual surveillance Abstract: Person re-identification (re-id) is a promising topic in computer vision, which concentrates on similarity learning of individuals across different camera views. It remains challenging due to the unpredictable orientation variations, the partial occlusions, and the inaccurate detections. To solve these problems, we present an orientation-guided similarity learning architecture to learn discriminative feature representations and define similarity metric for person re-id. Our proposed architecture explicitly leverages pedestrian orientation and body part cues to enhance the generalization ability. In the architecture, an orientation-guided loss function that pulls the positive samples with the same orientations closer is designed to alleviate the orientation variations. Meanwhile, an aligned dense network with pose estimation is presented to extract robust global-local fusion representations, which effectively exploits local features to overcome partial occlusions. In the end, we introduce a two-stage Top-k re-ranking strategy to optimize initial re-id results by min-hash and weighted distance. Extensive experimental results demonstrate that our proposed approach significantly outperforms state-of-the-art re-id methods on the popular CUHK03, Market1501, and DukeMTMC-reID datasets.


WePMOT1	Ballroom C, 1st Floor
WePMOT1.A Multitask and Multilabel Learning (Ballroom C, 1st Floor)	Oral Session

17:00-17:20, Paper WePMOT1.1
Learning Multi-View Generator Network for Shared Representation
Han, Tian	Univ. of California, Los Angeles
Xing, Xianglei	Harbin Engineering Univ
Wu, Yingnian	Univ. of California, Los Angeles
Keywords: Multiview learning, Gait recognition, Deep learning Abstract: Multi-view representation learning is challenging because different views contain both the common structure and the complex view specific information. The traditional generative models may not be effective in such situation, since view-specific and common information cannot be well separated, which may cause problems for downstream vision tasks. In this paper, we introduce a multi-view generator model to solve the problem of multi-view generation and recognition in a unified framework. We propose a multi-view alternating back-propagation algorithm to learn multi-view generator networks by allowing them to share common latent factors. Our experiments show that the proposed method is effective for both image generation and recognition. Specifically, we first qualitatively demonstrate that our model can rotate and complete faces accurately. Then we show that our model can achieve state-of-art or competitive recognition performances through quantitative comparisons.

17:20-17:40, Paper WePMOT1.2
Multi-Task Learning of Cascaded CNN for Facial Attribute Classification
Zhuang, Ni	Xiamen Univ
Yan, Yan	Xiamen Univ
Chen, Si	Xiamen Univ. of Tech
Wang, Hanzi	Xiamen Univ
Keywords: Classification, Deep learning, Multitask learning Abstract: Recently, facial attribute classification (FAC) has attracted significant attention in the computer vision community. Great progress has been made along with the availability of challenging FAC datasets. However, conventional FAC methods usually firstly pre-process the input images (i.e., perform face detection and alignment) and then predict facial attributes. These methods ignore the inherent dependencies among these tasks (i.e., face detection, facial landmark localization and FAC). Moreover, some methods using convolutional neural network are trained based on the fixed loss weights without considering the differences between facial attributes. In order to address the above problems, we propose a novel multi-task learning of cascaded convolutional neural network method, termed MCFA, for predicting multiple facial attributes simultaneously. Specifically, the proposed method takes advantage of three cascaded sub-networks (i.e., S_Net, M_Net and L_Net corresponding to the neural networks under different scales) to jointly train multiple tasks in a coarse-to-fine manner, which can achieve end-to-end optimization. Furthermore, the proposed method automatically assigns the loss weight to each facial attribute based on a novel dynamic weighting scheme, thus making the proposed method concentrate on predicting the more difficult facial attributes. Experimental results show that the proposed method outperforms several state-of-the-art FAC methods on the challenging CelebA and LFWA datasets

17:40-18:00, Paper WePMOT1.3
Learning with Latent Label Hierarchy from Incomplete Multi-Label Data
Pei, Yuanli	Oregon State Univ
Fern, Xiaoli Z	Oregon State Univ
Raich, Raviv	Oregon State Univ
Keywords: Multilabel learning, Probabilistic graphical model Abstract: Exploiting hierarchical label structure for multi-label classification can significantly improve classification performance and also benefit the labeling process. Existing work either cannot make use of such structure or assume the hierarchy is given as a prior. In practice, such hierarchy is not always available beforehand and it is desirable to learn it from data. Moreover, the labels in the training data may be incomplete due to inconsistent labeling process, which raises another learning challenge. This paper studies multi-label learning with a latent label hierarchy and incomplete label assignments. Our goal is to simultaneously learn the hierarchy as well as a multi-label classifier given the input features and incomplete label assignments. We propose a probabilistic model that captures the hierarchical structure and the incompleteness of the labels and introduce an Expectation-Maximization (EM) procedure for maximum likelihood estimation.

18:00-18:20, Paper WePMOT1.4
Learning Fixation Point Strategy for Object Detection and Classification
Lyu, Jie	Xi'an Jiaotong Univ
Yuan, Zejian	Xi'an Jiaotong Univ
Chen, Dapeng	Xi'an Jiaotong Univ
Zhao, Yun	Xi'an Jiaotong Univ
Zhang, Hui	Shenzhen Forward Innovation Digital Tech. Co. Ltd. China
Keywords: Multitask learning, Object detection, Deep learning Abstract: We propose a novel recurrent attentional structure to localize and recognize objects jointly. The network can learn to extract a sequence of local observations with detailed appearance and rough context, instead of sliding windows or convolutions on the entire image. Meanwhile, those observations are fused to complete detection and classification tasks. On training, we present a hybrid loss function to learn the parameters of the multi-task network end-to-end. Particularly, the combination of stochastic and object-awareness strategy, named SA, can select more abundant context and ensure the last fixation close to the object. In addition, we build a real-world dataset to verify the capacity of our method in detecting the object of interest including those small ones. Our method can predict a precise bounding box on an image, and achieve high speed on large images. Experimental results indicate that the proposed method can mine effective context by several local observations. Moreover, the precision and speed are easily improved by changing the number of recurrent steps. Finally, source code is available at url{https://github.com/jielyu/RADCN}.


WePMOT2	309A, 3rd Floor
WePMOT1.B Clustering (309A, 3rd Floor)	Oral Session

17:00-17:20, Paper WePMOT2.1
Probabilistic Sparse Subspace Clustering Using Delayed Association
Jaberi, Maryam	Univ. of Central Florida
Pensky, Marianna	Univ. of Central Florida
Foroosh, Hassan	Univ. of Central Florida
Keywords: Clustering, Dimensionality reduction, Sparse learning Abstract: Discovering and clustering subspaces in high-dimensional data is a fundamental problem of machine learning with a wide range of applications in data mining, computer vision, and pattern recognition. Earlier methods divided the problem into two separate stages of finding the similarity matrix and finding clusters. Similar to some recent works, we integrate these two steps using a joint optimization approach. We make the following contributions: (i) we estimate the reliability of the cluster assignment for each point before assigning a point to a subspace. We group the data points into two groups of "certain" and "uncertain", with the assignment of latter group delayed until their subspace association certainty improves. (ii) We demonstrate that delayed association is better suited for clustering subspaces that have ambiguities, i.e. when subspaces intersect or data are contaminated with outliers/noise. (iii) We demonstrate experimentally that such delayed probabilistic association leads to a more accurate self-representation and final clusters. The proposed method has higher accuracy both for points that exclusively lie in one subspace, and those that are on the intersection of subspaces. (iv) We show that delayed association leads to huge reduction of computational cost, since it allows for incremental spectral clustering.

17:20-17:40, Paper WePMOT2.2
Constrained Sparse Subspace Clustering with Side-Information
Li, Chun-Guang	Beijing Univ. of Posts and Telecommunications
Zhang, Junjian	Beijing Univ. of Posts and Telecommunications
Guo, Jun	Beijing Univ. of Posts and Telecommunications
Keywords: Clustering, Performance evaluation, Semi-supervised learning Abstract: Subspace clustering refers to the problem of segmenting high dimensional data drawn from a union of subspaces into the respective subspaces. In some applications, partial side-information to indicate ``must-link'' or ``cannot-link'' in clustering is available. This leads to the task of subspace clustering with side-information. However, in prior work the supervision value of the side-information for subspace clustering has not been fully exploited. To this end, in this paper, we present an enhanced approach for constrained subspace clustering with side-information, termed Constrained Sparse Subspace Clustering plus (CSSC+), in which the side-information is used not only in the stage of learning an affinity matrix but also in the stage of spectral clustering. Moreover, we propose to estimate clustering accuracy based on the partial side-information and theoretically justify the connection to the ground-truth clustering accuracy in terms of the Rand index. We conduct experiments on three cancer gene expression datasets to validate the effectiveness of our proposals.

17:40-18:00, Paper WePMOT2.3
Stream Clustering with Dynamic Estimation of Emerging Local Densities
Wang, Ziyin	Indiana Univ. Univ. Indianapolis
Tsechpenakis, Gavriil	Indiana Univ. Univ
Keywords: Clustering, Image classification, Online learning Abstract: We present a method for clustering data streams incrementally, designed to discover all valid density peaks in a single pass, in a non-parametric fashion. It detects emerging clusters along the stream by dynamically locating kernels in the most promising areas and performing a Stochastic Mean Shift procedure to find clustering centers. We present a density estimation approach for dynamic initialization, considering every sub-stream that follows `emerging data' as a sample set and applying Hypothesis Testing (p-value approach) to estimate its local density. The sub-stream size and the p-value are determined in a way that provides provable accuracy guarantee. We compare our method with the state-of-the-art, on realistic and complex datasets. We show that it outperforms not only stream algorithms but also their more complex, non-stream foundational paradigms.


WePMOT3	309B, 3rd Floor
WePMOT2 Motion Analysis (309B, 3rd Floor)	Oral Session

17:00-17:20, Paper WePMOT3.1
A Benchmark for Full Rotation Head Tracking
Li, Yulin	Inst. of Computing Tech. Chinese Acad. of Sciences
Ma, Bingpeng	Univ. of Chinese Acad. of Sciences
Chang, Hong	Inst. of Computing Tech. CAS
Chen, Xilin	Inst. of Computing Tech
Keywords: Motion and tracking, Visual surveillance Abstract: This paper introduces a new benchmark for 360-degree rotation head tracking, named Full Rotation Head Tracking (FRHT). The benchmark consists of 50 color sequences containing diverse human activities with complicated head motions. Specially, FRHT covers the most challenges of head tracking and focuses on the appearance variations of heads during the 360-degree rotation. It also pays attention to the clutters from the heads of nearby people. Further, we propose a baseline tracker. It guides a selective adaption updating by verifying strategies, thus alleviates error accumulation. Extensive experiments validate the advantages of FRHT in head rotation and similar object clutter.

17:20-17:40, Paper WePMOT3.2
Depth Masked Discriminative Correlation Filter
Kart, Ugur	Tampere Univ. of Tech
Kamarainen, Joni-Kristian	Tampere Univ. of Tech
Matas, Jiri	CTU Prague
Fan, Lixin	Nokia Tech
Cricri, Francesco	Nokia Tech
Keywords: Motion and tracking, Video analysis, Vision for robotics Abstract: Depth information provides a strong cue for occlusion detection and handling, but has been largely omitted in generic object tracking until recently due to lack of suitable benchmark datasets and applications. In this work, we propose a Depth Masked Discriminative Correlation Filter (DM-DCF) which adopts novel depth segmentation based occlusion detection that stops correlation filter updating and depth masking which adaptively adjusts the spatial support for correlation filter. In Princeton RGBD Tracking Benchmark, our DM-DCF is among the state-of-the-art in overall ranking and the winner on multiple categories. Moreover, since it is based on DCF, �DM-DCF� runs an order of magnitude faster than its competitors making it suitable for time constrained applications.

17:40-18:00, Paper WePMOT3.3
OWP: Objectness Weighted Patch Descriptor for Visual Tracking
Jiang, Bo	Anhui Univ
Zhang, Yuan	Anhui Univ
Tang, Jin	Anhui Univ
Luo, Bin	Anhui Univ
Keywords: Motion and tracking, Video analysis, Learning-based vision Abstract: Visual object tracking is an active research problem and has been widely used in computer vision and pattern recognition area. Existing visual tracking methods usually localize the visual object with a bounding box which are often disturbed by the introduced background information and partial occlusion because of bounding box representation of visual object. To deal with this problem, in this paper, we propose a novel Objectness Weighted Patch (OWP) descriptor for object feature descriptor in visual tracking. The aim of OWP is to assign different objectness weights to the patches of bounding box to reduce the influences of background information and partial occlusion. We propose to compute the objectness weights of patches in OWP by integrating multiple cues (background, foreground and local spatial consistency) together in a general optimization model. Also, the proposed model has a simple closed-form solution and thus can be computed efficiently. We incorporate our OWP into structured SVM tracking framework and provide a new robust tracking method. Extensive experiments on two standard benchmark datasets OTB100 and Temple-Color demonstrate the effectiveness and benefits of the proposed tracking method.

18:00-18:20, Paper WePMOT3.4
Dual-SVM Tracker Via Multiple Support Instance and LEVER Strategy
Ma, Ding	Harbin Inst. of Tech
Wu, Xiangqian	Harbin Inst. of Tech
Bu, Wei	Harbin Inst. of Tech
Keywords: Motion and tracking, Video analysis, Applications of computer vision Abstract: Visual tracking can be modeled as a binary classification problem, and the classic classifiersupport vector machine (SVM) based methods have been demonstrated encouraging performance in recent object tracking benchmarks. However, the performance of SVM is too sensitive to noisy training data during online update. In this paper, we propose an efficient dual-SVM based tracker to improve classification performance for visual tracking. The tracker proposed consists of two models: the holistic model and the part model. To learn the holistic model,the support instances are derived from the RMI-SVM trained in a deep feature space. As for the part model to highlight local structure of the target, a linear SVM is learned to further encode local details of the target, selecting candidate instances from the support instances by the confidence as input. To fuse the holistic model and the part model, we design a simple but efficient decision strategy (LEVER) to enforce the dual-SVM to focus on the target. The proposed LEVER is updated incrementally to capture changes of the appearance of the target. Extensive experimental results show that the proposed tracker performs favorably against state-of-the-art methods.


WePMOT4	311B, 3rd Floor
WePMOT4 Gait and Person Re-Identification (311B, 3rd Floor)	Oral Session

17:00-17:20, Paper WePMOT4.1
3D Gait Recognition Based on Functional PCA on Kendall's Shape Space
Hosni, Nadia	Ec. Nationale Des Sciences Informatiquesuniv. of Manouba
Drira, Hassen	LIFL (UMR Lille1/CNRS8022), Univ. De Lille1
Chaieb, Faten	CRISTAL Lab. ENSI, Manouba Univ
Ben Amor, Boulbaba	IMT Lille Douai/CRIStAL (UMR CNRS 9189)
Keywords: Gait recognition, Shape modeling and encoding, Classification Abstract: In this paper we propose a novel gait recognition approach from animated 3D skeletal data. Our approach is based on two disparate ideas from Shape Analysis and Functional Data Analysis (FDA) for a joint geometric-functional analysis. That is, skeletal sequences are viewed as time-parametrized trajectories on the Kendall's shape space when scaling, translation and rotation variations are filtered out from fixed-time 3D skeletons. A Riemannian Functional Principal Component Analysis (RFPCA) is carried out on our manifold-valued trajectories in order to build a new basis of principal functions, termed EigenTrajectories. Thus, each trajectory, could be projected into the eigenbasis which give rise to a compact signature, or EigenScores. The latter is fed to pre-trained 'One-vs-All' SVM classifiers for identity recognition and authentication. Based on the geometry of the underlying shape space, tools for re-sampling and synchronizing trajectories are naturally derived to apply the proposed variant of FPCA. We have conducted experiments on a subset of the CMU dataset. Our approach shows promising results compared to the state-of-the-art when a compact and robust signature is considered.

17:20-17:40, Paper WePMOT4.2
Person Re-Identification with Vision and Language
Yan, Fei	Univ. of Surrey
Kittler, Josef	Univ. of Surrey
Mikolajczyk, Krystian	Univ. of Surrey
Keywords: Soft biometrics, Vision and language, Neural networks Abstract: In this paper we propose a new approach to person re-identification using images and natural language descriptions. We propose a joint vision and language model based on CNN and LSTM architectures to match across the two modalities as well as to enrich visual examples for which there are no language descriptions. We also introduce new annotations in the form of natural language descriptions for two standard Re-ID benchmarks, namely CUHK03 and VIPeR. We perform experiments on these two datasets with techniques based on CNN, hand-crafted features as well as LSTM for analysing visual and natural description data. We investigate and demonstrate the advantages of using natural language descriptions compared to attributes as well as CNN compared to LSTM in the context of Re-ID. We show that the joint use of language and vision can significantly improve the state-of-the-art performance on standard Re-ID benchmarks.

17:40-18:00, Paper WePMOT4.3
Does a Body Image Tell Age?
Yuan, Baoyu	Sun Yat-Sen Univ
Wu, Ancong	Sun Yat-Sen Univ
Zheng, Wei-Shi	Sun Yat-Sen Univ
Keywords: Soft biometrics, Visual surveillance Abstract: Age estimation is an important task in computer vision and is widely used in applications. However, such a technology is largely affected by the resolution of face, and it would be a challenge if one has to estimate the age of a person at a distance. While body image of a person is often captured more clearly, when and how to use body-based visual cues for age estimation are largely under studied. In this work, we argue that body-based visual cues are better for estimating the age group and can assist the estimation of exact age value. For this purpose, we develop a Body-based Age Net (BAN) that unifies selective local convolution features and contextual convolution features. The network is designed based on two assumptions: 1) a person's wearing is closely related to his/her age group property; 2) some selective local parts of a body are more discriminative for age group estimation. We have contributed a large-scale and publicly available Body Age (BAG) dataset. We have quantitatively evaluated the proposed model on BAG.

18:00-18:20, Paper WePMOT4.4
Attend and Align: Improving Deep Representations with Feature Alignment Layer for Person Retrieval
Xu, Qin	Tsinghua Univ
Sun, Yifan	Tsinghua Univ
Li, Yali	Tsinghua Univ
Wang, Shengjin	Tsinghua Univ
Keywords: Other biometrics, Learning-based vision, Visual surveillance Abstract: In fine-grained recognition, object misalignment and background noise are two long-standing factors that influence the robustness of deep learning models. This paper mainly focuses on person re-identification (re-ID) and introduces a feature alignment layer (FAL) which alleviates the target misalignment and the background noise simultane- ously. Through attention mechanism, FAL informs the underlying importance of each pixel on feature maps, i.e., whether the pixel is beneficial towards discriminating different persons. Then the discriminative regions relocate to the center and are stretched to fill the feature maps. Such an �attend and align� mechanism is specified into two steps: target position prediction and value assignment. In the first step, a pixel on feature maps learns to find a target position which is ID-discriminative. In the second step, the pixel is assigned with a new value using the context of the predicted position. Moreover, FAL can be easily plugged into a canonical Convolutional Neural Network (CNN) and learned in an end-to-end manner. In experiment, our method yields competitive results compared with the state-of-the-art approaches on three person re-ID datasets, Market-1501, DukeMTMC-reID and CUHK03. We also demonstrate that our method improves a competitive fine-grained recognition baseline on CUB-200-2011.

2018 24th International Conference on Pattern Recognition (ICPR)
August 20-24, 2018, Beijing, China

Technical Program for Wednesday August 22, 2018

2018 24th International Conference on Pattern Recognition (ICPR) August 20-24, 2018, Beijing, China

Technical Program for Wednesday August 22, 2018

2018 24th International Conference on Pattern Recognition (ICPR)
August 20-24, 2018, Beijing, China