Face tracking and indexing in video sequences

Keywords: head pose tracking, pose estimation, 3D face tracking, in-the-wild facetracking, Bayesian tracking, face alignment, cascaded regression, synthetic data... A Robust Framework fo

Trang 1

Face Tracking and Indexing In Video

M Malik MALLEM

M Sami ROMDHANI

M Kevin BAILLYSupervisors:

M Fakhreddine ABABSA

M Maurice CHARBIT

A thesis submitted in fulfilment of the requirements

for the degree of Doctor of Philosophy

in theLTCI DepartmentTelecom ParisTECH

July 2015

Trang 2

Face tracking in video sequences is an important problem in computer vision because ofmultiple applications in many domains, such as: video surveillance, human computer in-terface, biometrics Although there are many recent advances in this field, it still remains

a challenging topic, in particular if 6 Degrees of Freedom (DOF) - three dimensional (3D)translation and rotation - or fiducial points (to capture the facial animation) needs to

be estimated in the same time Its challenge comes mainly from following factors: lumination variations, wide head rotation, expression, occlusion, cluttered background,etc In this thesis, contributions are made to address two of mentioned major difficul-ties: the expression and wide head rotation We aim to build a 3D tracking framework

il-on mil-onocular cameras, which is able to estimate accurate 6 DOF and facial animatiil-onwhile being robust with wide head rotation, even profile Our method adopt the 3D facemodel as a set of 3D vertices needed to compute continuous values of 6 DOF in videosequences

To track wide 3D head poses, there are significant difficulties regarding to the datacollection, where the pose and/or a large number of fiducial points is generally required

in order to build the statistical shape and appearance models The pose ground-truth

is expensive to be collected because of requirement of some specific devices or sensors,while the manual annotation of correspondences on large databases can be tedious anderror prone Moreover, the problem of correspondence annotation is the difficulty oflocating the hidden points of self-occlusion faces at the profile The result is a generalapproach that wants to tackle the out-of-plane rotations have to usually use a view-based

or adaptive models The problem matters since the different numbers of fiducial points

on views, for example, profile and frontal, that causes a gap in term of tracking betweenthem using a single model The first main contribution of our thesis is to use of a largesynthetic data to overcome this problem Leveraging on the such data covering the fullrange of head pose and the on-line information correlating between frames, we proposethe combination between them to robustify the tracking The local features are adopted

in our thesis to reduce the high variation of facial appearance when learning on a largedataset

Trang 3

To be rigorous with simultaneously tracking of the facial animation, out-of-plane tations and towards to in-the-wild, the combination of synthetic and real datasets isproposed to train using cascaded-regression tracking model In the perspective of cas-caded regression, the local descriptor is very significant for high performance Hence, weutilize the method of feature learning in order to allow the representation of local patchesmore discriminative than hand-crafted descriptors Furthermore, the proposed methodintroduces some modifications of the traditional cascaded approach at later stages inorder to boost the quality of the found fiducial points or correspondences

ro-Lastly, Unconstrained 3D Pose Tracking (U3PT) dataset, our own recordings, would

be nominated for the evaluation of in-the-wild face tracking in the community Thedataset captured on ten subjects (five videos/subject) in the office environment withthe cluttered background People, who are captured, moving comfortably in front of thecamera, rotating their heads in three direction (Yaw, Pitch and Roll), even 90 degree

of Yaw, doing some occlusion or expression In addition, these videos are recorded indifferent light conditions With the available ground-truth of 3D pose computed usingthe infrared camera system, the robustness and accuracy of tracking framework, whichaims to work in-the-wild, could be accurately evaluated

Keywords: head pose tracking, pose estimation, 3D face tracking, in-the-wild facetracking, Bayesian tracking, face alignment, cascaded regression, synthetic data

Trang 4

During this study, the following papers were published or under submission:

U3PT: A New Dataset for Unconstrained 3D Pose Tracking Evaluation

Ngoc Trung Tran, Fakhreddine Ababsa and Maurice Charbit,

International Conference on Computer Analysis of Images and Patterns (CAIP), 2015.

Cascaded Regression of Learning Feature for Face alignment

Ngoc Trung Tran, Fakhreddine Ababsa, Sarra Ben Fredj and Maurice Charbit,

Advanced Concepts for Intelligent Vision Systems (ACIVS), 2015.

Towards Pose-Free Tracking of Non-Rigid Face using Synthetic Data

International Conference on Pattern Recognition Applications and Methods (ICPRAM), 2015.

A Robust Framework for Tracking Simultaneously Face Pose and Animation using Synthesized Faces

Pattern Recognition Letter, 2014.

3D Face Pose and Animation Tracking via Eigen-Decomposition based Bayesian Approach

Ngoc-Trung Tran, Fakhreddine Ababsa, Jacques Feldmar, Maurice Charbit, Dijana Delacretaz and Gerard Chollet

Petrovska-International Symposium on Visual Computing (ISVC), 2013.

3D Face Pose Tracking from Monocular Camera via Sparse Representation of thesized Faces

Syn-Ngoc-Trung Tran, Jacques Feldmar, Maurice Charbit, Dijana Petrovska-Delacretaz and Gerard Chollet

International Conference on Computer Vision Theory and Applications (VISAPP), 2013 Towards In-the-wild 3D Head Tracking

Journal of Machine Vision and Applications (MVA), 2015 (Submitted)

Trang 5

AAM Active Appearance Model

ASM Active Shape Model

CLM Constrained Local Model

DOF Degrees-of-Freedom

GMM Gaussian Mixture Model

IC Inverse Compositional

IRLS Iteratively Re-weighted Least Squares

LDM Linear Deformable Model

KLT Kanade–Lucas–Tomasi

MAE Mean Average Error

PCA Principal Component Analysis

RANSAC RANdom SAmple Consensus

RGB Red Green and Blue

RMS Root Mean Squared

POSIT Pose from Orthography and Scaling with ITerationsSDM Supervised Descent Method

SfM Structure from Motion

SSD Sum of Squared Differences

SVD Singular Value Decomposition

SVM Support Vector Machine

3DMM 3D Morphable Model

Trang 6

Contents v

1.1 Motivation 1

1.2 Challenge 4

1.3 Objectives 5

1.4 Overview 7

1.5 Notation 8

2 State-of-the-art 10 2.1 Face Models 10

2.1.1 Linear Deformable Models (LDMs) 11

2.1.2 Rigid Models 12

2.2 Tracking Approaches 13

2.2.1 Optical Flows Regularization 13

2.2.2 Manifold Embedding 14

2.2.3 Template Model 15

2.2.4 Local Matching 16

2.2.5 Discriminative 17

2.2.6 Regression 19

2.2.7 Hybrid 20

2.3 Multi-view Tracking 20

2.3.1 On-line adaptive models 20

2.3.2 View-based models 21

2.3.3 3D Synthetic models 21

2.4 Discussion 22

2.4.1 Off-line learning vs On-line adaptation 22

2.4.2 Real data vs Synthetic 23

2.4.3 Global vs Local features 23

2.5 Databases 23

v

Trang 7

Contents vi

3 A Baseline Framework for 3D Face Tracking 25

3.1 The General Bayesian Tracking Formulation 25

3.2 Face Tracking using Covariance Matrices of Synthesized Faces 26

3.2.1 Face Representation 26

3.2.2 Baseline Framework 29

3.2.2.1 Data Generation 29

3.2.2.2 Training 31

3.2.2.3 Tracking 31

3.2.3 Experiments 34

3.3 Face Tracking using Feature Reconstruction through Sparse Representation 36 3.3.1 Reconstructed-Error Function 36

3.3.2 Codebook 38

3.4 Face Tracking using Adaptive Local Models 40

3.4.1 Adaptive Local Models 40

3.4.2.1 Pose accuracy 42

3.4.2.2 Landmark precision 44

3.4.2.3 Out-of-plane tracking 47

3.5 Conlcusion 47

4 Pose-Free Tracking of Non-rigid Face in Controlled Environment 49 4.1 Introduction 49

4.2 Wide Baseline Matching as Initialization 49

4.3 Large-rotation tracking via Pose-wise Appearance Models 52

4.3.1 View-based Appearance Model 52

4.3.2 Matching Strategy by Keyframes 54

4.3.3 Rigid and Non-rigid Estimation using View-based Appearance Model 55

4.3.4 Experimental Results 57

4.4 Flexible tracking via Pose-wise Classifiers 59

4.4.1 Templates for Matching using Support Vector Machine 60

4.4.2 Large off-line Synthetic Dataset 62

4.4.3 Local Appearance Models 63

4.4.4 Fitting via Pose-Wise Classifiers 64

4.4.5.1 Pose Accuracy 66

4.4.5.2 Landmark precision 67

4.4.5.3 Out-of-plane tracking 68

4.5 Conclusion 69

5 Towards In-the-wild 3D Face Tracking 70 5.1 Cascaded Regression of Learning Features for Landmark Detection 70

5.1.1 Introduction 70

Trang 8

5.1.2 Cascaded Regression 72

5.1.3 Local Learning Feature 74

5.1.4 Coarse-to-Fine Correlative Cascaded Regression 75

5.1.5.1 Learing Feature on Inner vs Boundary points 79

5.1.5.2 Coarse-to-Fine Correlative Regression (CFCR) 81

5.1.6 Conclusion 81

5.2 In-the-wild 3D Face Tracking 81

5.2.1 Introduction 81

5.2.2 Real and synthetic data for in-the-wild tracking 83

5.2.3 Unconstrained 3D Pose Tracking Dataset 88

5.2.3.1 Recording System 88

5.2.3.2 Calibration 89

5.2.3.3 Dataset 92

5.2.4.1 Public datasets 93

5.2.4.2 Datasets for Visualization 94

5.2.4.3 Our dataset: Unconstrained Pose Tracking dataset 95

5.2.5 Conlcusion 98

5.3 Conlcusion 99

6 Conclusion and Future Works 100 A Algorithms in Use 103 A.1 Nelder-Mead algorithm 103

A.2 POSIT 103

A.3 RANSAC 106

A.4 Local Binary Patterns 107

A.5 3D Morphable Model 107

Trang 9

List of Figures

1.1 Three head orientations: Yaw, Pitch and Roll 2

1.2 Some challenging conditions of 3D face tracking: illumination, head rota-tion, clustered background in one sample video sequence 4

2.1 Some 3D face models from left to right: AAM, Cynlinder, Mesh and 3DMM 11 3.1 The frontal and profile views of Candide-3 model 27

3.2 One example of the perspective projection in our method to obtain 2D projected landmarks from the current model given the parameters 29

3.3 The diagram of baseline framework for 3D face tracking: data generation, training and tracking stages 30

3.4 Landmark initialization, 3D model fitting using POSIT and the rendering of synthesized images 31

3.5 The landmarks used in the framework, examples of mean and covariance matrices at the eye and mouth corners 32

3.6 The visualization of two and three first components of descriptor projec-tion using PCA learned through synthetic images of some feature points 33

3.7 Some video examples of BUFT videos at different views 34

3.8 The framework using reconstructed features via sparse representation with two modifications from the baseline 37

3.9 The sparse representation to estimate the coefficients of the positive (green) and negative (red) local patches of one corner of eyebrow 38

3.10 The construction of posistive and negative patches of eye corners 39

3.11 The framework using adaptive local models based on the baseline 40

3.12 An sample video of BUFT dataset using local adaptive model 44

3.13 Some tracking examples on BUFT dataset using local adaptive model (green) and FaceTracker (yellow) 45

3.14 The visualization of our (Yaw, Pitch, Roll) estimation (blue) and ground-truth (red) on the video jam7.avi using local adaptive model 45

3.15 The visualization of our (Yaw, Pitch, Roll) estimatation (blue) and ground-truth (red) on the video vam2.avi using local adaptive model 46

3.16 The RMS error of 12 selected points for tracking in our framework (red) compared to [97] (blue) The vertical axis is RMS error (in pixel) and the horizontal axis is the frame number 47

4.1 The basic idea of wide baseline matching 50

viii

Trang 10

4.2 Computing 3D points Lk as the 3D intersection points 51

4.3 From left to right: The 2D SIFT keypoints of the keyframe, SIFT ing, outliers removed by RANSAC 52

match-4.4 The pipeline of Pose-Wise Appearance Model based Framework 53

4.5 The structure of mapping table with the key as three orientation and thecontent as the local descriptors 54

4.6 The cross-validation to select two parameters: the number of nearestneighbors Nq and the threshold kHub The validation of a) first row: thenumber of nearest poses, b) second row: kHub for |Y aw| ≤ 30◦, and c)third row: kHub for |Y aw| > 30◦ The vertical axis is RMS error 58

4.7 Tracking examples on VidTimid and Honda/UCSD 59

4.8 The way how to pick up positive (blue) and negative samples (red), andthe response map computed at the mouth corner after training 61

4.9 Some weight matrices or templates of local patches T (x) in our mentation 61

imple-4.10 From left to right of training process: 143 frontal images, landmark tation and 3D model alignment, synthesized images rendering, and pose-wise SVMs training 62

anno-4.11 a) The Candide-3 model with facial points in our method (b) The way

to compute the response map at the mouth corner using three descriptorsvia SVM templates 64

4.12 The pipeline of tracking process from the frame t to t + 1 64

4.13 The RMS of our framework (red curve) and FaceTracker [97] (blue curve).The vertical axis is RMS error (in pixel) and the horizontal axis is theframe number 67

4.14 Our tracking method on some sample videos of VidTimid and Honda/UCSD 68

5.1 Some results of face alignment on some challenging images of 300-W dataset 71

5.2 The visualization of weight matrices when learning local descriptors ofeye and mouth corners using RBM 75

5.3 The overview of our approach In training, we learn sequentially RBMmodels, coarse-to-find regression (correlative and global regression) 76

5.4 CED evaluation of local correlative regression on specific landmarks i →

j means using i-th landmark to detect j-th landmark 77

5.5 The cross-validation of feature size, the number of hiddens, the number

of random samples and the number of regressors 79

5.6 The evaluation of the effect of inner and boundary points on cross-validationset of 300-W dataset 80

5.7 Some results on 300-W dataset 82

5.8 The annotation of 51 landmarks in synthetic images and its automaticrendering in different views 83

5.9 Examples of landmark annotation of Multi-PIE at different views 84

5.10 Examples of automatic rendering of synthetic images 85

5.11 Landmark detection on some large rotations using synthetic training data 86

Trang 11

List of Figures x

5.12 Examples of landmark detection on real images using the model trained

from the synthetic data 86

5.13 The mean shape, the transformed shape as close as possible to target shape using anisotropic scaling and the local descriptors of transformed shape 87

5.14 Tracking between conscutive frames using the simple strategy 88

5.15 The installation of recording system 89

5.16 The tree flystick and the detection of using stereo infrared camera and Dtrack2 software 89

5.17 The checkboard for calibration of RGB camera 90

5.18 Annotation of the zero marker for stereo calibration The red point is the annotation of the orignal coordinate position in 2D images 90

5.19 The diagram to estimate the rotation and translation from the infrared camera to RGB camera 91

5.20 The some example of stereo-calibration results 91

5.21 Some sample video sequences in our datasets 94

5.22 The RMS error on Talking Face video: 5.3 (pixels) compared to 6.8 (pix-els) of FaceTracker on this video 95

5.23 Some example of face tracking on YouTube Celebrities and Honda/UCSD datasets 96

5.24 The accuracy evaluation of U3PT on each subset of videos 97

5.25 The robustness evaluation of U3PT on each subset of videos 98

A.1 The pinhole camera model of POSIT (courtesy of [33]) 106

A.2 How to compute LBP of a pixel (courtesy of [90]) 107

Trang 12

3.1 The list of shape and animation units in our study 28

3.2 The comparison of robustness (Ps) and accuracy (Eyaw, Epitch, Eroll and

Em) between two descriptors intensity and SIFT on the uniform-light set

of BUFT dataset 35

3.3 The comparing of robustness (Ps) and accuracy (Eyaw, Epitch, Eroll and

Em) between different ranges of Y aw on the uniform-light set of BUFTdataset 35

Em) to the baseline method on the uniform-light set of BUFT dataset.The results are computed at two different thresholds (95% and 100%) ofrobustness 40

3.5 The cross-validation using the accuracy for α and β terms of the adaptivemethod on the jam1.avi video 43

3.6 The comparison of robustness (Ps) and accuracy (Eyaw, Epitch, Eroll and

Em) of adaptive local models to our previous results on BUFT dataset 44

Em) of adaptive local models to state-of-the-art methods on the light set of BUFT dataset 44

uniform-4.1 The evaluation of view-based appearance model based method on BUFTdataset 58

4.2 The pose precision of our method and state-of-the-art methods on light set of BUFT dataset 66

uniform-5.1 Compare to state-of-the-art methods on 300-W dataset using 68 landmarks 80

5.2 Compare to state-of-the-art methods on 300-W dataset using 51 landmarks 81

5.3 The tracking performance on the uniform-light set of BUFT dataset 93

5.4 The performance of methods on our own U3PT dataset 97

xi

Trang 13

- of a person’s head relative to the camera view As commonly used in the literature [82],

we adopt three terms Yaw (or Pan), Pitch (or Tilt ) and Roll for the three axial rotations.The Yaw orientation is computed when, for example, rotating the head from right to left.The Pitch orientation is related to the movement of the head from forward to backward.The Roll orientation is when bending the head from left to right The 6 DOF areconsidered as rigid parameters ii) The non-rigid parameters describe the facial musclemovements or facial animation, which are usually the early step to recognize the facialexpression, such as: happy, sad, angry, disgusted, surprise, and fearful, etc Indeed,the consideration for non-rigid parameters is often represented in form of detectingand tracking facial points These points are acknowledged as fiducial points, featurepoints or landmarks in face processing community The word ”indexing” in our reportmeans that rigid and non-rigid parameters are estimated from video frames Our aim

is to read a video (or from a webcam) capturing the single face (In case of multiplefaces, the biggest face is selected) and the output is parameters (rigid and non-rigid) ofvideo frames Indeed, the non-rigid parameters are difficult to be represented because

it depends on the application In our study, they are represented indirectly as localizing

or detecting feature points on the face

1

Trang 14

Roll

Pitch

Figure 1.1: Three head orientations: Yaw, Pitch and Roll.

There are several potential applications in many domains which use face tracking Themost popular one is to recognize facial behaviors in order to support for an automaticsystem of human communication understanding [76] In this context, the visual focus ofattention of a person is a very important key to recognize It is a nonverbal communica-tion way or an indicative signal in a conversation For this problem, we have to analyzefirst the head pose to determine the direction where people are likely looking at in videosequences It makes sense that people may be focusing on someone or something whiletalking Furthermore, there are important meanings in head movements as a form ofgesturing in a conversation For example, the head nodding or shaking indicates thatpeople understand and misunderstand or agree and disagree respectively to what is be-ing said Emphasized head movements are a conventional way of directing someone toobserve a particular object or location In addition, the head pose is intrinsically linkedwith the gaze: Head pose indicates a coarse estimation of gaze in situations of invisibleeyes such as low-resolution imagery, very low-bit rate video recorders, or eye-occlusiondue to sunglasses-wearing Even when the eyes are visible, the head pose supports

to predict more accurately the gaze direction There are other gestures that are able

to indicate dissent, confusion and consideration, etc Facial animation analysis is alsonecessary to be able to read what kind of expression people are exposing The facial ex-pression is a natural part in human communication, it is one of the most cogent meansfor human beings to infer the attitude and emotions of other persons in the vicinity.The expression analysis, which requires facial animation detection, is a crucial topic notonly in machine vision but also psychology [1, 86] The head gesture and expressionare complement each other to clarify more comprehensively the interactions, intentions,emotions, etc between persons

Trang 15

Chapter 1 Introduction 3

The 3D face tracking plays an important role in many other applications For example,integrating the estimation of facial animation allows a natural human user to control adistant avatar in virtual environments [21] It is useful in controlling facial animations

of virtual characters in video games, e.g Second Life or Poker 3D, creating characters

as real as possible in making animated movies, or enhancing remote communications

in a collaborative virtual environment The vision-driven user interfaces for DriverFatigue Detection Systems detect the early signs of fatigue/drowsiness to alarm duringdriving a car [35,114] The human-machine interaction to control computers or devicesusing face gestures [107, 109] Face tracking and pose estimation is useful to build theteleconferencing and Virtual Reality-interfaces systems [51, 113] Face tracking is thecompulsory stage of the face recognition system for video security systems to track andidentify people, because the faces have to be tracked (at least the rigid parameters)before doing some pre-processing stages, such as: normalizing or estimating the frontalview of the face Many monitoring systems involve informations associated to the faceposition In addition to security issues, they can be used for marketing issues Oneexample is the face tracking and recognition of human activity, which can be used insupermarkets to track customer behavior This kind of system analyses human actions tounderstand their behaviors and give an efficient recommendation of product arrangement

in supermarket for instance

Nowadays, there are many types of camera devices which could be used for 3D facetracking, such as: Kinect, 3D infrared cameras or stereo cameras Even though thesecameras can provide more useful information (like depth) allowing to make trackingmore robust and accurate, the monocular 2D cameras are still our choice in the thesis.The first reason is that the 3D cameras are usually installed for only indoor applications.Second, they are more expensive devices and not portable to be integrated in any smalldevices such as mobile phone or a chat webcam Third, the complicated calibration andsome configuration are often required to use such devices Therefore, there are alwaysstill rooms for applications of monocular cameras If building a face tracking systemsuccessfully using a monocular camera, people can extend it to the stereo system orcombine with depth cameras to improve performances

Trang 16

Figure 1.2: Some challenging conditions of 3D face tracking: illumination, head

rotation, clustered background in one sample video sequence.

1.2 Challenge

In the literature, the estimation of rigid and non-rigid parameters with monocularcamera were usually considered as two separate tasks: pose estimation [82] and facealignment or facial feature point detection [116] When attempting to estimate the rigidparameters or head pose in wide rotation, the previous works did not focus on non-rigidparameters Otherwise, when trying to represent the facial animation efficiently, it wasuneasy to address the wide range of rotations Indeed, the simultaneous estimation ofrigid and non-rigid is a very difficult task because of dozens of parameters have to besimultaneously estimated under the various challenging conditions In order to build arobust face tracking system for both rigid and non-rigid parameters, researchers have

to face several difficulties, given below:

• 3D-2D projection: The projection is an ill-posed problem because the observationfrom monocular cameras is only 2D; whereas, the recovery of 3D pose or theestimation of rigid parameters is 3D

• Illumination variations: Face appearance depends on the light from different sourceslocations; moreover, the light reflection is not the same on parts of faces and de-pends on light sources, camera, or objects Such conditions change head appear-ance in different ways as can be seen in Fig 1.2

• Head orientation: Obviously, a head does not look the same from different sides,for instance, between the frontal and profiles looking When changing the head ori-entation relative to the camera, the face appearance changes depending on visibleparts of faces

• Biological appearance variation: It is the problem that all of systems related toface processing have to tackle In other words, the tracking system has to be able

to track any faces of unknown people

Trang 17

– Intra variability: For the same person, the different conditions such as short orlong hairs, beard and mustaches or make-ups can change the face appearance.– Extra variability: Different persons have different face appearance because ofdifferent sizes and shapes of eyes, noses, and mouths Different facial hairsand beards are other things which make the tracking less accurate

• Simultaneous non-rigid expressions: The estimation of non-rigid parameters ally depends on some facial points around eyes, nose and mouth corners It isdifficult to collect the amount of training data to build statistical models of shapeand appearance because a large number of annotated correspondences are gen-erally required Moreover, the estimation of dozens of rigid and non-rigid 3Dparameters from noisy observations seems to be challenging

usu-• Occlusions and self-occlusions: When wearing some accessories, such as glasses,hats, a part of face is occluded, or a half of face is invisible in the case of profileviews

• Low resolution: Although it is not really a big problem thanks to the resolution cameras nowadays, in some conditions like surveillance systems, thelow resolution usually happens when people stay at distance from the camera

high-• Cluttered or ambiguous background probably causes the fragment to tracking tem

sys-• Moreover, the face tracking has to be real-time for many applications

In addition, many stages are rigorously integrated to build an automatic and robust 3Dface tracking system in video sequences First, the face detection to localize the face

at first frame Second, the alignment stage to align the face model into the 2D facialimage Third, the tracking stage to infer the face model at the next time given the headmodel at a given time Fourth, the recovery stage to recover the failure tracking quickly,etc Each stage is itself one challenging task in the reality

1.3 Objectives

Because the face is a deformable object as discussed above, one of the major issues that

is really challenging in real world applications, is the need of a large corpus of training

Trang 18

data with ground-truth of wide range of rotations, expression, shapes and so on Inaddition, a lot of human resources have to be spent to annotate that seems very costly.However, there are still some ways to circumvent this issue in practice The main aim

of this thesis is to propose an algorithm, which is able to track both facial pose andanimation We focus not only on divergence but also on the accuracy, as well as ondeveloping some methods to overcome current issues and make the application realistic.For that, we want to tackle the following challenging problems:

• A lot of studies have been proposed to estimate separately rigid and non-rigidparameters because of challenges discussed above In this study, we want to anal-yse how to track them simultaneously, the rigid can help to estimate non-rigidparameters and vice versa

• Handling the wide range of poses is still a big challenge Work with this problem

is focused in our study We propose the combined approach using both off-lineand on-line information for this problem

• Our goal is to build a robust 3D face tracking framework, so a large amount oftraining data is required to build statistical models The data collection is tooexpensive; moreover, how to annotate the hidden landmarks in profile views isvery difficult So, we propose the use of synthetic dataset to train tracking model

It is not expensive to collect data and useful to address hidden landmarks of profileviews

• We investigate the utility of synthetic data, and then combine with the availablereal datasets that aim to work reasonably on in-the-wild videos Our study is thefirst work approaching this way for efficient 3D tracking, where the combination ofcascaded regression based method and matching is adopted In our study, we aim

to develop a method that is able to work with reasonable results for 3D in-the-wildtracking

• We propose to create a new database with ground-truth of 3D pose This dataset

is recorded under challenging conditions: wide rotation, expression, occlusion andcluttered background, etc We setup the recording system (RGB camera + 3Dinfrared cameras) that allows us to capture accurately 3D ground-truth, whilepeople can move comfortably in front of the system Furthermore, the dataset isrecorded at different levels of difficulties that enable to evaluate the tracking moreprecisely, such as: robustness, accuracy, profile-tracking capability than datasets

Trang 19

ever reported This database could be a contribution to the vision community as

a benchmark for in-the-wild tracking

Chapter 2: To report a review of state-of-the-art methods This includes a detaileddiscussion in different perspectives of methods such as head model, tracking approaches.The chapter serves as a basis analysis of one existing method, where the strengths andweaknesses of them are discussed, to figure out which approach is good for our problems.Chapter 3: To present a basic investigation into the utility of synthetic dataset for rigidtracking We develop the baseline framework based on the idea of one published work

to derive a new efficient way of tracking rigid faces The applicability of the approach

is empirically evaluated through some experiments on a face tracking dataset Resultsthat we achieved are encouraging to improve in next chapters An evaluation of twodifferent descriptors used in our framework is also reported This chapter is mandatoryfor developments of our methods in next ones

Chapter 4: To describe the proposed two-step method that shows the applicability ofusing the wide baseline matching or tracking-by-detection to improve the robustness ofout-of-plane tracking a) The first step benefits matching between the current frameand some adaptive preceding-stored keyframes to estimate only rigid parameters Bythis way, our method is sustainable to fast movement and recoverable in terms of losttracking b) The second step obtains the whole set of parameters (rigid and non-rigid)

by a heuristic method using pose-wise SVMs This way can efficiently align a 3D modelinto the profile face in similar manner of the frontal face fitting The combination ofthree descriptors is also considered to have better local representation

Chapter 5: To report two improvements from ealier chapters For first stage, wepropose a new method for face alignment using deep learning, the approach based onthe Supervised Descent Method (SDM) and learning features of Restricted Boltzmann

Trang 20

Machine (RBM) Details regarding its derivation as well as the efficiency of learningfeatures compared to designed features in face alignment points of views We alsopropose the combination of local correlative regression and global regression to improvethe accuracy performances The performance reported in this chapter is comparable

to recent state-of-the-art methods This study is useful for the new framework to doalignment in first video frame in next chapter

For second stage, we extend the method of face alignment from early stage to enable theframework to work in-the-wild conditions In addition, we propose to use 3D MorphableModel (3DMM) to generate better synthetic dataset for training and combine with realdatabases Empirical evaluations are performed on our own face tracking dataset that

is captured by a new recording system Analyses of the results are presented along withideas for further performance gains

Chapter 6: To sum up this study with an overview of what is mentioned in the studyand give directions for future work

1.5 Notation

• Scalar: are written in italics, either in lower or upper-case, for example: a and B

• Vectors: are written in lower-case non-italic boldface, with components separated

by spaces v = [a b c]T

• Matrices: are written in upper-case non-italic boldface, for example: M; however,

we sometimes uses the Greek symbols instead as Φ

• Function: are typeset in the upper-case Ralph Smith’s Formal Script (RSFS), forexample: F , G

• Function composition: is denoted by the ◦ symbol, for example:

Trang 21

dis-Chapter 1 Introduction 9

• Perspective projection: is denoted asP, for example: P(v) is the projection of a3D point v to 2D

Trang 22

In this section, the state-of-the-art methods in the literature are categorized into groupsdepending on the tracking technique points of views More precisely, the technique ofaligning the face model into 2D frames in video sequences Fortunately, each group oftracking methods uses one specific 3D model, feature descriptor or training data Forexample, discriminative methods for face tracking often use Linear Deformable Models(LDMs) and local descriptors and the face model is trained from off-line dataset So,the review of tracking methods itself includes the review of the face model, features ordata However, we will firstly mention about 3D face models because of their importance

to decide which tracking approaches are being used In addition, the robustness withprofile views - one of our main objectives - will be researched as one separate part to findout how previous works track efficiently the profile sides Some datasets for evaluationare also discussed at the end of this section

2.1 Face Models

To estimate facial pose or head orientation (rigid parameters) in video sequences, threebasic ways are commonly used: First, using 3D models to track and estimate directly thehead pose Second, tracking 2D feature points and the pose estimation are performedwhen fitting a given 3D face model into 2D feature points (sometimes using geometricalcharacteristics instead of 3D model) Third, tracking 2D face region, extracting featuresand estimating the pose directly from its appearance using trained models without theneed of 3D face model To estimate the facial shape and animation, 2D or 3D models

of feature points are considered The feature points (named also facial points, fiducial

10

Trang 23

2.1.1 Linear Deformable Models (LDMs)

The most popular face model in the literature is Linear Deformable Models (LDMs).Some instances of this model have the distinction representing deformability in onlyshape or both in shape and appearance The main point of this model is that from aset of annotated training images, separate linear models of shape and appearance arelearned through Principle Component Analysis (PCA) [89] The pioneer work of LDMs

is Active Shape Models (ASMs) proposed by [24,25] learning PCA of shapes ASMs wasextended to Active Appearance Models (AAMs) [26] by the combination of two separatelinear models of shape and appearance At first, ASMs and AAMs are 2D, then extended

to 3D by adding more parameters [122] or using structure-from-motion to learn from 2Dcorrespondents [106] ASMs and AAMs are often used for face alignment (or landmarkdetection) rather than pose estimation (rigid ) A well-known problem when using AAMs

is the weak generalization capability because the facial texture space is too large to besufficiently captured Recently, the LDMs combined with local feature is more popular

in use because of the development of invariant local descriptors [7,97,111,118]

Trang 24

Candide model [3] is another 3D face model that has been designed to manage bothrigid and non-rigid parameters in the same viewpoint In contrast to previous LDMs,this model separates the shape and animation parameters and it is available model Thelatest version Candide-3 is applied in many works [4, 36, 100] It is an efficient modelfor frontal face, but with the best of our knowledge, there is no work using it to trackprofile faces.

3D Morphable Models (3DMMs) [14], which is closely related to AAMs, is a techniquefor modeling textured 3D faces The difference from AAMs is that 3DMMs is directlytrained from 3D scan images Ideally, 3DMM is likely the most appropriate model for 3Dface tracking because it is really 3D and has large regions on two sides that are significant

to keep robust tracking with profile faces However, this is, let’s say, a ”heavy” mode,because of a lot of vertices need to be controlled In addition, the texture appearance

on two sides are not well reconstructed in current works There is a few works, such as[80,130] using this model for 3D face tracking

Many other deformable models exist in the literature but not commonly-used in otherworks, such as: [17,32] constructed hand-fully a 3D polygon mesh which is realized by

a set of parameterized deformations However, this model is limited in expressing somecomplicated motion, for example, lip deformations [67] constructed the textured 3Dface models from videos with some initial interactions at the begining by users

2.1.2 Rigid Models

Using the pre-defined 3D geometric models such as cylindrical, ellipsoid, cube, and so

on, to represent the 3D head They address only the estimation of rigid parameters.Some of them could be considered as follows:

Planar Model: The rectangle that is usually used to capture the face region in videosequence For example, [135] used this model in the combination of face detection and3D pose tracking for pose estimation; however, the head model has 6 degrees of freedom

to initialize, and the experiments indicate that this kind of model is too simple and notefficient enough to track large rotations

Cylindrical Model: It is the most popular rigid model used for pose estimation It isthe surface of the points from a given line at a fixed distance and is parameterized by3D positions and orientations For tracking, the model texture is usually obtained by

Trang 25

Chapter 2 State-of-the-art 13

warping frame texture from video This model is appropriate to track a wide range

of rotations, especially for Yaw orientation [2, 20, 50, 79, 103,123] because it has fewdegrees of freedom to approximate the human pose reasonably

Ellipsoid Model: [5, 9, 23] is supposed to be better than cylindrical model because itshows relatively more stable performance on the Pitch movement In addition, it canapproximate the forehead part well since the geometric shape of the forehead is curved.However, controlling this model is more difficult than Cylindrical Model

Mesh Model: It is a collection of vertices and polygons defining the fixed shape structure

of 3D face [112] It is a kind of model designed by hand, not learned from trainingset This model is appropriate to model and track 3D face; however, it is not flexiblebecause every time, it needs to be adjusted to specific face before tracking In addition,the initialization of this model at first frame is challenging There are some other meshmodels designed by hand, such as [131]

Most of rigid head models are designed for only rigid estimation, easy to control anduseful to track in large rotations, but it is worth noting that it is not appropriate fornon-rigid estimation However, it is possible to combine them with non-rigid models tohave better estimation

2.2 Tracking Approaches

In our study, we classify the state-of-the-art methods in some categories dependingmainly on the tracking approach However, we will also discuss about some importantproperties, such as the features (e.g, local or global), adaptive or off-line learning, thetype of training data (real, synthetic or hybrid) Other properties are sometimes men-tioned; for example, which parameters are estimated (rigid and non-rigid ), the recovery

of failure tracking, the robustness to occlusion, illumination, clustered background andtime consuming The problem of out-of-plane rotation will be considered in a separatesection afterward, because it is one of the most important problems that needs to besolved to build a robust tracking

2.2.1 Optical Flows Regularization

This approach uses set of vertices (given or random points) on the face model regularized

by optical flow for tracking These points are not used to extract clearly the appearance

Trang 26

like other models; otherwise, they are tracked for motion estimation using optical flow

or some similar methods with the constraint of the 3D model known in advance [63].[32] regularized rigid and non-rigid motion tracking of 3D face using optical flow as ahard constraint on the extracted motion [17] estimated the rigid and non-rigid motionusing a dense set of vertices from a 3D model in a recursive Bayesian framework Themotion is measured using optical flow regularized by a 3D model [131] proposed amotion segmentation algorithm, using the extended super-quadric, based on motionresidual error to detect occlusion area, in which, the motion is the optical flow estimationregularized by a 3D model

This approach uses 3D models to estimate directly rigid parameters and can be real-time.However, because of using optical flow to construct the tracking models, this approachhas some known drawbacks, such as: the sensitivity to illumination, the problem withfast motion and flow based model cannot be robust when the image is noisy Moreover,this approach is usually adaptive through frames that usually suffers the accumulateddrift and is difficult to recover the failure tracking, and the face model has to be initializedmanually at first frame

2.2.2 Manifold Embedding

Any algorithm for dimensionality reduction can be an attempt for manifold embedding.With a rigid model of head (three positions and orientations), it is possible to embedone high-dimensional head image onto a low-dimensional manifold space constrained

by pose variations For non-rigid model, the problem is much more difficult, because

it is challenging to create an algorithm recovering both rigid and non-rigid variationswhile ignoring other variations of the image This approach usually tracks only the 2Drectangle of face region in video sequences and estimates the head pose directly fromthe image feature without the need of 3D models [77] used SVD to learn view-basedsubspaces to reconstruct the prior model The prior model helped to estimate the pose-change from the current frame and track it using Kalman Filter The subspaces can

be learned by the non-linear embedding method of a large set of training faces usingIsomap [91] [66] tracked the non-rigid face using the linear combination of 1D manifoldfields of face expression [49] proposed Supervised Local Subspace Learning to learn amixture of local spaces for continuous pose estimation

This approach has some drawbacks First, non-rigid modeling is hard because its ifold is probably complex Second, the problem of time consuming and adaptation

Trang 27

man-Chapter 2 State-of-the-art 15

because of the complexity of learning model, an important issue in tracking Third,the training data have to be well-prepared with careful annotation that covers the largefacial variations because it is sensitive to noisy and inaccurate labels In addition, theweak generalization is another problem which needs to be solved

off-of warping templates and illumination basis for registration problem in the presence

of lighting variation [123] proposed a similar method that uses dynamic templatesand applies Iteratively Re-weighted Least Squares (IRLS) method for registration to berobust with illumination and self-occlusion [4] proposed a linear model between thefacial parameters and the appearance of face images This method was only robustfor face and landmark tracking on near frontal faces [88] casted the 3D tracking as anon-linear optimization problem using a 3D morphable model, such that many existingnumerical methods can be applied [22, 58] used their rendered faces from the frontalimage to learn their templates as mean and covariance matrices of local patches Toestimate the rigid and non-rigid parameters, the distance of the observation and itstemplates are computed [36] proposed a fast method to track 3D face using on-lineappearance model The appearance model is global and adaptive and the rigid andnon-rigid parameters are estimated by the residual error of the template and the currentobservation [78] developed a 3D face tracking using adaptive view-based 3D headappearance The current head pose is estimated by the current observation and thepreceding-selected static keyframes to avoid the drift caused by the adaptation

Active Appearance Models (AAMs) also belongs to the family of this approach Theoriginal AAMs was first proposed by [26], and then its forward local search method wasimproved in computation time by a more efficient method, named Inverse Compositional(IC) [71] [27] introduced AAM at different views to simultanesouly align face andestimate 3D pose [122] used non-rigid structure-from-motion to construct the 3D modeswhile enforcing some constraints of 2D AAMs to create 2D+3D AAMs Because of be

Trang 28

modeled from 2D corresponding, these 3D models are pseudo and ambiguous to estimatethe 3D shape and pose in practice Recently, [110] speeded up a fitting algorithm andallows AAM work well with in-the-wild data AAMs was developed for face alignmentproblem, but extending it for 3D face tracking is simple This is the same situation likeother face alignment algorithms that we are going to present in next sections In contrast

to AAMs, 3D Morphable Models (3DMMs) [14] is built for real 3D shape and texturespace Similarly to above, Inverse Compositional (IC) is able to be deployed to make3DMMs possible for tracking [93], or using Lucas-Kanade algorithm to link textures oftemplates and the current frame [80]

Using this approach, the method can be off-line [20, 26] or on-line [4, 36, 78] Theobservation is simple to compute from the current shape, the estimation of parameters

is cheap using the residual error between the current observation and the template.However, people have to control a large number of vertices that slow down the trackingspeed if using 3DMMs In other words, 3DMM is a ”heavy” model to be used for real-time tracking The training data, needed to train appearance models, could be the realimages [20,26, 71] or the synthetic images [4, 22] However, some hard problems canhappen: a) If the template is off-line learnt, how to have the ground-truth to learn theappearance model and estimate well the 3D head pose is very critical The annotation isreally a problem especially for profile views, in which people could not know exactly thehidden landmarks This is perhaps the reason why most of off-line-training methods isimpossible to work and estimate well the 3D head pose in the profile b) on-line methods

is more efficient to deal with profile, but the robustness and recovery of failure tracking

as well as how to avoid the accumulated drift need to be improved more to work inin-the-wild conditions

2.2.4 Local Matching

This approach bases on matching local points of interests regularized by 3D models.The points of interests are a set of points idenfified by detectors [98] and extracting localdescriptors (e.g, [68]) invariant with some transformations; for example, scale, rotationand translation It could be also simple vanishing points such as corners of eyes, nose andmouth By matching these points between the model and the current observation, the3D tracking problem or 3D head pose tracking, is casted to Perspective-n-Point problem[61] [46] used the local matching to track feature points and Kalman Filter (KF) system

to relocate the lost features The pose estimation is computed using multiple triplets

Trang 29

of feature points [69] track the rigid and non-rigid motions by using normalized correlation [132] to match local points between exemplars and the current observationwhile constrained by a 3D model This model aims to tackle occlusion and illuminationproblems [100] used Structure from Motion (SfM) to track on-line features via anadaptive-texture mapped 3D head model through KF framework The use of adaptivetemplate allows the system being robust to large rotation [112] formulated the trackingproblem in terms of local bundle adjustment via correspondences matching By thecombination of off-line and on-line keyframes, the method is quite robust with drift,jitter and significant aspect changes [59] treated wide baseline matching of keypoints as

cross-a clcross-assificcross-ation problem using cross-a lcross-arge number of synthesized views [117] formulated the3-D pose tracking task in a Bayesian framework using the fusion of matching informationfrom both previous frame and key-frames into the representation of the posterior posedistribution [50] proposed the use of SIFT matching with cylindrical model Theset of feature points were registered and updated corresponding to each reference posevia IRLS [115] proposed keypoint descriptor learning at wide range of rotations as anincremental scheme to have discriminative descriptors

The face appearances between consecutive frames are not too different, so the keypointsdetected on two these frames can endorse to find the geometric correlation If using agood strategy like through keyframes, methods can work efficiently with out-plane-ofrotation, occlusion, fast motion, and recovery of failed tracking However, detectingthe keypoints and matching is expensive, and these methods degrade when the number

of keypoints detected on the face is not enough In addition, this approach allows toestimate only the rigid parameters

2.2.5 Discriminative

In this case, discriminative classifiers to learn the appearance model of local patches.These patches are represented with the constraint of a set of points (2D or 3D) or statisticshape models, e.g LDMs, so it is commonly named as constrained local model approach.For each feature point, computing its ”response image” or ”response map”, which is themap of confidence of all pixels in the sampling region Looking for a shape configura-tion, which optimizes the total cost for model fitting, by manipulating the shape modelparameters The pioneer of this approach is Active Shape Models (ASMs) proposed by[24, 25] for shape alignment It is then improved by [28] using Adaboost for non-rigidface alignment, and significant improvements were proposed later [29, 75] The aims of

Trang 30

later methods towards robustifying these local distributions in order to satisfy the erties that make optimization efficient and stable, and still approximately preserve thetrue certainty/uncertainty of local detectors To do that, local experts are built to learnresponse maps of facial feature points [29] used the Random Forest Regression Voting

prop-to build the response map, and aligned the model using the shape constraints [118]used linear Support Vector Machine (SVM) to approximate the response map fast viaSum of Squared Differences (SSD) matching distance for face alignment and then extend

to face tracking [119] The response map is enforced parametrically to be the quadraticconvex before fitting the shape model The quadratic approximation of the responsemaps may overcome the drawbacks of some local detectors, its estimation can be poor

in some cases, in particular, the multi-modal response map To account for this, [42]approximated the response map as a Gaussian mixture model (GMM) The GMM is abetter approximation for the response map compared to a quadratic approximation, butthere are some remaining major drawbacks, such as: the estimation of the GMM param-eters is non-linear optimization that is locally convergent and requires an a-priori mode,and it has the problem of computational costs [87] propose a Bayesian modification of[118] so that the fitting can be considered as the Maximum Likelihood (ML) solution.[97] proposed the non-parametric approach using Mean-Shift algorithm to approximatethe response map at multi-scale [65] tackled the problem a little bit differently thanfitting the shape model like ASMs, but utilizing the facial component as constraints

to regularize facial points [12] trained local experts using SIFT by linear SVM andusing RANdom SAmple Consensus (RANSAC) [39] to constrain the configurations offeature points through the similarity measure from training exemplar images [7] learned

a function of mapping the shape incremental from the low-dimensional representation

of response maps through sequential regressors [128] is an extension of [97] using theactive contours to refine the shape variations [121] proposed a unified model of threetasks (face detection, pose estimation, and landmark estimation) using tree-structuredpictorial structure [126]

The discriminative approach is developed for non-rigid rather than rigid estimation.However, if the shape model is 3D, the pose estimation system is also able to be deployed.The preparation of off-line real dataset with ground-truth of landmarks is required totrain statistical shape and landmark appearance models In practice, this makes expen-sive building such methods in order to track the unconstrained pose, for example, profileview When tracking the face between consecutive frames, no more information than thecurrent fitting at previous frame needs to be utilized This is an issue if the motion istoo fast For recovery, when tracking is lost, face detection is combined to detect frontal

Trang 31

and re-align the model as in the first frame Because of the training data limitation,this model is developed largely for frontal face Although [97] was able to work withprofile, but only a half face really is tracked in controlled environment and it is really aview-based tracking model This approach could be implemented in real-time [97], thekey factor is how to speed up the step of computing ”response map” Most of worksused linear SVM or Random Forest to learn local detectors because these classifiers areoptionally linear, the computation of response maps is fast enough

to remove incorrect landmark candidates [125] improved the sensitive initialization ofSDM by learning the combination of multiple hypotheses with ranking

Like discriminative approach, it was first developed for non-rigid face alignment, but it

is easy to extend to the tracking method by using the aligned model at previous frame asthe initialization for the current frame The training data needs to be designed carefullyfor training This approach is sensitive to training data and the initialization of themodel There is no cascaded-manner method working with large Y aw because of the

Trang 32

limitation of training data so far The fast motion is another issue which needs to beimproved for this approach.

2.2.7 Hybrid

It consists in combining the aforementioned methods for a robust face tracking becauseone method is efficient at one specific aspect For example, [103] combined two facemodels of the AAM and cylindrical model The cylindrical model keeps the trackingrobust in the profile views and it could provide the good initialization for AAM fitting innear-frontal views [6] combined skin detection, distance vector fields and ConvolutionalNeural Network (CNN) to track and recognize the head pose and gaze direction.The hybrid approach is one promising direction because each method can only take intoaccount the narrower problems of 3D head tracking However, developing such a systemneeds more efforts of implementations and linking many components In addition, ifrunning many components in parallel increases the complexity of the system, such as:the computational cost and flexibility, highly desirable in the practice

2.3 Multi-view Tracking

There are three popular approaches to tackle the large views in 3D face tracking problem:i) using on-line adaptive models, ii) using view-based models and iii) using 3D syntheticmodels

2.3.1 On-line adaptive models

Most of popular approaches for large rotation is usually using the cylindrical model.The initialization of model at first frame could be manual if using 3D non-rigid model

or automatic if using the cylindrical model During tracking, the single appearancemodel is adapted [32, 100, 103, 115, 123] or multi-view keyframe are stored to matchwith current frame [50, 112] The adaptive models could accumulate errors and driftaway in long video sequences, while keyframe-based models could estimate only the rigidparameters and be slow for practical use

Trang 33

2.3.2 View-based models

It is often trained on a large off-line dataset The dataset is divided into pose-wisegroups and the view-based models are trained corresponding to specific views The 3Dface tracking is decomposed into the mixture of pose-wise tracking models, where eachview-based model takes in charge of a lower variation of poses [27, 77, 97, 121, 128].This approach has some drawbacks: the pose detector needs to be integrated to choosewhich view-based model for alignment, and building the training data that cover a largerotation is too expensive In particular, such a large dataset is much more difficult forthe non-rigid estimation

2.3.3 3D Synthetic models

Using a real 3D face models such as 3DMM for tracking With a real 3D model, it

is possible to work with 3D pose estimation and non-rigid face [93] proposed the use

of IC algorithm for 3DMM to fit to 2D images using the initialization from previousfitting Other works are able to track the profile view robustly using optical flow [80] ormatching [130] to track salient points before fitting 3DMM These methods did trackingand updating on-line the 3DMM that can accumulate the errors in long video sequences.For landmark detection, [43] used view-based patches from the off-line synthetic dataset

to detect landmarks at large Y aw angle The use of 3DMM is promising direction [43]because of several following reasons Creating a good dataset for tracking costs a lot

of efforts and annotate the hidden landmarks on the visible side is very hard Because

of the this annotation problem, most of dataset providing for profile-view data has thedifferent number of landmarks between the frontal and profile This gap is the reason whyother works used the idea of aforementioned view-based models because dealing hiddenlandmarks is difficult In contrast, the use of 3DMM bridge this gap with the necessity

of building one single 3D tracking model because the hidden landmarks at profile areknown The big problem is that the quality of synthetic data is highly necessary to beusable in the real contexts

Trang 34

2.4 Discussion

A 3D face tracking framework is robust if it can operate with a wide range of pose views,face expression, environmental changes and fast motion, occlusions, and also have re-covery capability Through the literature, we realize that each approach can tackle oneissue of 3D face tracking For example, Optical flow based approach can track the facewith high precision with slow motion with no illumination changes Manifold Embed-ding based approach is a good way to embed one high-dimensional representation on

a low-dimensional manifold space constrained by rigid variations It is also a methodfor dimensional reduction purpose Template based approach with Cylindrical modelvia on-line adaptation can handle a large variation of pose and has high precision Lo-cal matching based approach provides the recovery capability, the robustness with fastmotion and estimates the accurate rigid parameters even at profile-view ConstrainedLocal Parts based approach using the constraint of 3D models via the response mapscan deal with occlusion and face expression Regression based approach is robust withface expression, occlusion and clustered background However, these aforementionedapproaches have limitations which need to be solved In order to build a robust frame-work, we have to decide some properties that affect the most the tracking performances:i) the tracking method should be off-line-learning, on-line adaptation or combination ofthem, ii) using real or synthetic data or a combination of them and iii) using global orlocal features

2.4.1 Off-line learning vs On-line adaptation

The on-line adaptation is to update dynamically the tracking model; otherwise, the line learning is performed on an off-line dataset ahead of tracking The on-line adaptation

off-is often more accurate than off-line In most of cases, the consecutive frames are not toodifferent and trackers like KLT is accurate and reliable enough Yet, such conditions arenot practical because of the challenges discussed in the previous chapter For example,

if a part of face is hidden for a while, the recovery, when the face re-appears, is reallychallenging Additionally, it is prone to make model drift away in long video sequences

In contrast, the learning tracker often has weak generalization ability because of trainingalgorithms or the lack of training data In practice, to be robust and accurate for theface tracker, these two aspects should be integrated into a unique framework In thisthesis, we will investigate both of them and their combination

Trang 35

2.4.2 Real data vs Synthetic

For off-line learning based trackers, the real data is of course much better than syntheticdata in order to train tracker models because synthetic face is impossible to capture com-pletely the variations of faces in real contexts However, the campaign of ground-truth

is too expensive due to the high requirement of human resources, and the annotationprocess may be error prone It is likely impossible to obtain good ground-truth in manycases, e.g the annotation of landmarks of profile views It will create some obstacles that

a robust tracker has to overcome We will discuss on more detail about these problems

in next chapters The goal of this study is to track the non-rigid face, so the tion of landmarks is crucial Otherwise, synthetic data probably avoid these problems,but the quality of synthetic faces is important to replace real ones completely In thisreport, we just focus on some traditional forms of creating synthetic data to investigateits efficiency in tracking contexts

annota-2.4.3 Global vs Local features

The global feature is the information extracted from the whole face region while localfeature is the information of small local patches, e.g, around the point of interests.Many recent methods of discriminative and regression approaches using local featuresdemonstrated that it is valuable to our context It is worthy to tackle some problems,such as: illumination, expression, occlusion and self-occlusion In our study, we will usethe local descriptors to be robust and accurate

2.5 Databases

In the literature, there are some datasets reported for pose estimation evaluation in videosequences The most popular one is Boston University Face Tracking (BUFT) dataset[20] Its ground-truth of 3D pose is captured by magnetic sensors “Flock and Birds”with an accuracy of less than 1o It has two subsets: uniform-light set and varying-lightset The uniform-light set, which is used to evaluate, has a total of 45 video sequences(320×240 resolution) for 5 subjects (9 videos per subject) with available ground-truth ofpose consisting of three directions: Yaw (or Pan), Pitch (or Tilt), Roll The varying-lightset contains 27 sequences of 3 subjects recorded under same condition like the first setexcept fast-changing lighting conditions So far, this is no longer a challenging dataset,

Trang 36

because all subjects in this dataset always kept their faces neutral while moving theirheads slowly, no occlusion happened, the background is not too cluttered and the angles

of three direction is mostly not larger than 40◦ Many works reported high performance

on this dataset [50,115,123] CLEAR07 [99] contains multi-view video recordings of aseminar room It consists of 15 videos with four-synchronized cameras with frame rate at15fps This dataset provides both pose data from single view and multi-view However,the subject captured for head pose is just seating in the same place IDIAP HeadPose [8] is an another source of LEAR07, and the values of Yaw, Pitch and Roll rangeonly between (-60,60), (-60,15) and (-30,30) respectively for the single view Methodsevaluated on this dataset are not fully automatic because the bounding-box is provided.[81] created a dataset recording drivers while driving in daytime and nighttime lightingconditions The drivers faces are usually neutral at mostly frontal views There areother datasets reviewed in the survey [82], but they are not much more challenging thanabove-mentioned datasets because of simple background, slow motion or near-frontalview or no longer available Most of reported datasets used the magnetic sensors notcomfortable to wear and move around because of wires connecting between them and thecomputer Another point of these databases is the evaluation which is simply represented

as the averages of three angles Yaw, Pitch and Roll No more detail is concerned, forexample, to validate the robustness in terms of frontal vs profile tracking, expression, orocclusion In this work, we will propose a new way of recording 3D ground-truth whileenabling people to move around more comfortably We also propose some protocols ofrecording a new database to be able to validate in details some attributes of trackers

Trang 37

3.1 The General Bayesian Tracking Formulation

The general visual tracking is often considered as a Bayesian context that aims to tify the density of state sequences x1:t = {x1, x2, , xt} corresponding to the observationthat specified by y1:t = {y1, y2, , yt} The goal is to find the good state xt from theavailable observations y1:t To estimate xt, the posterior density distribution p(xt|y1:t)needs to be constructed as the update stage via Bayes’ rule:

quan-p(xt|y1:t) = p(yt|xt)p(xt|y1:t−1)

p(yt|y1:t−1) = cp(yt|xt)p(xt|y1:t−1) (3.1)

25

Trang 38

The update stage is to obtain the posterior density of the current stage, where c is thenormalization constant and p(xt|y1:t−1) is the probability distribution associated withthe predicted state over all previous observations that can be estimated as follows viaChapman-Kolmogorov equation and with the assumption a order-one Markov process:

p(xt|y1:t−1) =

Zp(xt|xt−1)p(xt−1|y1:t−1)dxt−1 (3.2)

where p(xt−1|y1:t−1) is the posterior density distribution at time t − 1 and p(xt|xt−1) isthe evolution model which represents the prediction of something related to the motion.Equations (3.1) and (3.2) are the basis for optimal Bayesian solution, but in generalthis recursive propagation of the posterior density can not be solved in closed-form(some approximation must be made) This formulation is the popular manner for visualtracking If we assume the Gaussian distribution for the posterior density, it leads to thewell-known Kalman Filter and the Extended Kalman Filter or more generally ParticleFilters to approximate the optimal Bayesian solution Our method presented in laterchapters are going to use Bayesian tracking as the ground

3.2 Face Tracking using Covariance Matrices of

Synthe-sized Faces

In this section, we will present the baseline method in the idea of using synthetic data.The main idea of this method is to create the dataset of synthetic images and then learntracking models from this dataset Through the baseline, we investigate the possibility

of synthetic dataset and how to use it efficiently for face tracking

3.2.1 Face Representation

The 3D model is the first thing we have to define before doing others for 3D tracking.Because of aiming to investigate the possibility of synthetic data without the need of thereal annotated data and track the rigid parameters (3D pose) and non-rigid parameters(animation) later, we propose to use an available and light-weight 3D non-rigid model

As aforementioned some reasons in the state-of-the-art section, the Candide-3 [3] is thechoice for this chapter because this 3D model is available and easy to control withoutthe cost of much human resources

Trang 39

Chapter 3 A Baseline Framework for 3D Face Tracking 27

Figure 3.1: The frontal and profile views of Candide-3 model.

Candide-3, initially proposed by [3], is a popular face model managing both facial rigidand non-rigid parameters In particular, this 3D face model controls the shape andanimation separately It consists of Np = 113 vertices representing 184 triangle sur-faces M (p) ∈ R3N p ×1 denotes the vector representation of Np dimensions, obtained byconcatenation of the three components of the Np vertices, the model can be written:

where s denotes the mean shape and p = [ps, pa] as the non-rigid parameters Theknown matrices Φs ∈ R3N p ×14 and Φa∈ R3N p ×65 are Shape and Animation Units thatcontrol respectively shape and animation through ps and pa parameters Among the

14 and 65 components of shape ps and pa respectively, 12 and 11 ones are associated

to track eyebrows, eyes and lips as in {Table3.1} So, when mentioned later, ps and pa

mean vectors of 12 and 11 components of shape and animation units respectively.The rotation R and translation t matrices are considered as rigid parameters duringtracking The rotation matrice can be easily converted into three extrinsic orientations

of pose denoted as r = [ryaw, rpitch, rroll]T Therefore, the full model parameter, noted Θ, comprise 17 components: three components of rotation r, three components oftranslation t = [tx, ty, tz]T and 11 components of animation parameters pa:

Trang 40

Index Shape Units Animation Units

2 Eyebrows vertical position Jaw drop

3 Eyes vertical position Lip stretcher

6 Eye separation distance Outer brow raiser

11 Mouth vertical position Upper lid raiser

Table 3.1: The list of shape and animation units in our study.

Notice that the landmark detection is required at the first frame, in which both shapeand animation parameters p are estimated; otherwise, only pa is estimated at nextframes because we assume that the shape parameters do not change To initialize Θ0,the initial model parameter at first frame, we first compute [r, t] based on 2D landmarksusing POSIT algorithm [33] The non-rigid parameters p are then estimated by Nelder-Mead method [83] The Θt at frame t will base on the aligned model of previous frame

Θt−1 ps is estimated once in this step, after that only pa is considered

Projection: We assume the perspective projection, for which the camera calibration Ahas been obtained from empirical experiments If the rotation R and translation t areknown, the perspective projection P(x3D) of a 3D point x3D in our work as follows:

Định dạng
Số trang	131
Dung lượng	47,91 MB