IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 3, MARCH 2015

Fiducial Facial Point Extraction Using a Novel Projective Invariant Xin Fan, Member, IEEE, Hao Wang, Zhongxuan Luo, Yuntao Li, Wenyu Hu, and Daiyun Luo

Abstract— Automatic extraction of fiducial facial points is one of the key steps to face tracking, recognition, and animation. Great facial variations, especially pose or viewpoint changes, typically degrade the performance of classical methods. Recent learning or regression-based approaches highly rely on the availability of a training set that covers facial variations as wide as possible. In this paper, we introduce and extend a novel projective invariant, named the characteristic number (CN), which unifies the collinearity, cross ratio, and geometrical characteristics given by more (6) points. We derive strong shape priors from CN statistics on a moderate size (515) of frontal upright faces in order to characterize the intrinsic geometries shared by human faces. We combine these shape priors with simple appearance based constraints, e.g., texture, edge, and corner, into a quadratic optimization. Thereafter, the solution to facial point extraction can be found by the standard gradient descent. The inclusion of these shape priors renders the robustness to pose changes owing to their invariance to projective transformations. Extensive experiments on the Labeled Faces in the Wild, Labeled Face Parts in the Wild and Helen database, and cross-set faces with various changes demonstrate the effectiveness of the CN-based shape priors compared with the state of the art. Index Terms— Fiducial facial point extraction, pose changes, projective invariant, characteristic number.

F

I. I NTRODUCTION ACIAL feature extraction is an important task in computer vision that has broad applications to face

Manuscript received January 26, 2014; revised June 16, 2014, October 18, 2014, and December 20, 2014; accepted December 28, 2014. Date of publication January 12, 2015; date of current version February 11, 2015. This work was supported in part by the Natural Science Foundation of China under Grant 11171052, Grant 61003177, Grant 61033012, Grant 61272371, and Grant 61328206, in part by the Program for New Century Excellent Talents under Grant NCET-11-0048, and in part by the Science Foundation for Young Scholars of Jiangxi Provincial Education Department under Grant GJJ13647. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Shiguang Shan. X. Fan, H. Wang, and Y. Li are with the School of Software, Dalian University of Technology, Dalian 116024, China (e-mail: [email protected]; [email protected]; [email protected]). Z. Luo is with the School of Software, Dalian University of Technology, Dalian 116024, China, and also with the School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, China (e-mail: [email protected]). W. Hu is with the School of Mathematics and Computer Sciences, Gannan Normal University, Ganzhou 341000, China (e-mail: [email protected]). D. Luo is with the School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, China (e-mail: [email protected]). This paper has supplementary downloadable material available at http://ieeexplore.ieee.org., provided by the author. The material includes all the author’s results on 500 cross-set facial images and the histograms of characteristic number values for all combinations of six fiducial points. The total size of the file is 17 MB. Contact [email protected] for further questions about this work. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2015.2390976

Fig. 1. Fiducial point localization with pose changes as well as variations on age, expression and resolution.

tracking, recognition and animation as well as video communication. Significant efforts have been devoted to the development of accurate and robust algorithms in these decades [1]. It is still one of the greatest challenges to deal with large appearance changes due to facial variations on expressions, illuminations and poses (or viewpoints). Previous studies show that a subset of facial landmarks, called fiducial points, is stable to expression changes and is essential to pose recovery [2]. The localization of these points is also the key step as the initialization to many current algorithms for global facial shape extraction [3]–[5]. Additionally, the use of known fiducial landmarks can significantly improve the performance of face recognition algorithms across pose changes [6]. In this paper, we focus on 8 fiducial points, i.e., 4 eye corners, 2 nose nostrils, and 2 mouth corners, and address the challenges on their extraction when great variations especially on poses occur as shown in Fig. 1. Most of existing methods for facial feature extraction can be traced back to the framework of the classical active contour models (Snakes) [7], which introduces the external and internal energies as the constraints for contour extraction. The external energy reflects the appearance of the object of interest, and the internal one encodes the prior shape information. The final contour can be found by an optimization process on the total energy functional. The most popular textural features for facial appearance are derived from the Gabor filtering [8]–[10]. These textural features are insensitive to slight illumination changes and in-plane rotations, but yield unnatural results due to weak shape constraints. Researchers employ statistical models on facial shapes, which are highly structured, into the Snake framework for facial landmark extraction. The pioneering works of active shape model (ASM) [11] and active appearance model (AAM)1 [12] construct a parametric shape model by performing principal component analysis (PCA) on a set of labelled face shapes. Numerous variants on ASM and 1 Appearance models with strong shape constraints.

1057-7149 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

FAN et al.: FIDUCIAL FACIAL POINT EXTRACTION USING A NOVEL PROJECTIVE INVARIANT

1165

Fig. 2. Approach overview. Left: a facial image with fiducial points initialized by a face detector as blue dots. Middle: we combine shape and textural constraints into a quadratic optimization. Shape priors are derived from the characteristic number that reflects the underlying geometry of fiducial points. Textural constraints include intensity templates on local patches around fiducial points, edge and corner candidates obtained by the Sobel operator and Hessian matrix, respectively. Right: final fiducial points are localized as yellow dots, and numbered accordingly.

AAM have been emerging including constrained local models (CLMs) [13], position-optimized ASM [14], and extended ASM [15]. Among them, Milborrow and Nicolls extend the ASM in various aspects and their approach is able to localize feature points quite accurately for frontal faces. Dibeklioglu et al. combine statistical models on both appearance and shape, and develop a landmark extraction algorithm robust to resolution, expressions, natural occlusions, and small in-plane rotations [1]. In [16], the shape prior was modeled nonparametrically from training data. Xiong and Torre present the supervised descent method (SDM) as a non-parametric extension to AAM, which gives robust alignments [17]. Despite of their great success, these approaches can hardly yield accurate results on non-frontal faces with pose changes. Recently, regression based methods have attracted great attentions for the ability of handling facial shapes with pose changes. These methods learn direct mappings from appearance to target shapes or landmarks, which provide fast and flexible localization. Chen et al. [18] use a boosting algorithm to determine the most probable positions of 5 fiducial points. Valstar et al. model the constraints between landmarks with Markov random fields (MRFs), and learn a regressor for individual landmarks [4]. Efraty et al. introduce Fern regressors into the extraction of 8 fiducial points [5]. Cao et al. [19] share the similar idea of Fern regressors with [5], but perform learning on non-parametric shape models that characterize complete facial shapes with more landmarks. Dantone et al. propose a fast localization algorithm by using a conditional regression forest to learn the relations conditional to global face properties [20]. These methods typically demand a large number of labeled training faces covering pose variations as much as possible. Consequently, the data availability for training highly curbs their abilities to handle a wide range of pose changes. Moreover, the labeling of training images requires intensive and tedious manual labors. The latest cognitive studies on the invariance mechanism for human faces reveal that direct geometrical associations for images under very different viewpoints work parallel to the learning strategy for neighboring viewpoints [21], [22]. These discoveries inspire us to directly find facial shape constraints under projective geometry that characterizes the intrinsic geometrical relationships under pose or viewpoint changes [23]. Early results on generic shape matching [24]

show that the incorporation of the cross ratio, a projective invariant, can improve the matching accuracy. Gee et al. [2] use the collinear constraint of eye corners to recover the face geometry. The recent work in [3] develops a projective ASM that extracts facial features under perspective deformations. Their approach explicitly estimates the projective transformation matrix in every iteration that demands accurate localizations of stable fiducial points. These works from the geometric perspective have in common that they embrace rational geometric constraints as much as possible into the localization for an accurate solution. In this paper, we bring a newly developed projective invariant, named the characteristic number (CN) [25], [26], into fiducial point localization, and construct rich geometric constraints upon CN that combines the collinearity, cross ratio, and those on more points (6 in our current algorithm). These constraints reflect the common geometry of fiducial points shared by human faces, and they do not rely on training examples as those for regression models in [5] and [19]. We calculate all these constraints on a moderate dataset with frontal upright faces in contrast to the vast collections for various poses. We find 16 combinations of the fiducial points showing consistent CN values on all 515 images, and these CN values on the combinations keep unchanged under projective/perspective transformations. This invariant property naturally brings robustness to facial viewpoint variations when we incorporate the shape priors from CN into the localization. We formulate the shape prior together with the appearance information as a quadratic energy optimization. Specifically, we concatenate the coordinates of the fiducial points into a vector S = [ p1, . . . , p8 ]T , where pi = (x i , yi ) (i = 1 . . . 8). Our objective is to find a vector S that minimizes the following energy: ¯ 2, |Fi (S) − Fi ( S)| i = 1...8 (1) E(S) = i

where Fi denotes the constraint imposed on the points and S¯ is the template points. In addition to the shape constraints from CN, we include relatively simple constraints for landmark appearance, i.e., pixel intensities, edges and corners, as shown in Fig. 2. The solution to this optimization can be found by the standard iterative gradient descent. The experiments on the Labeled Faces in the Wild (LFW) database (over 13,000 faces) [27], Labeled Face Parts in

1166

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 3, MARCH 2015

the Wild (LFPW) [16], Helen data set [28], and cross-set faces from a commercial set and several public databases demonstrate that the strong shape priors based on CN yield accurate localizations of fiducial points with multiple facial variations including age, expression and pose changes as long as no self occlusion occurs. The remainder of this paper is organized as follows. Section II reviews related works on learning based landmark localization and geometric invariants for shape representation. Section III gives the definition of CN that includes the collinearity and cross ratio, and extends the definition to 6 points. The new prior constraints for facial shapes from CN are given in Section IV, followed by the localization algorithm with other constraints in Section V. Section VI provides the experiments on facial data sets with multiple types of variations, and Section VII concludes the paper.

invariants in order to match geometric primitives between images such as points [42], [45], lines [46], [47] and closed contours [48], [49]. In a recent work, Bryner et al. derive a novel metric invariant to affine and projective groups in a general Riemannian framework, and develop shape analysis algorithms for both point sets and parametric curves [50]. In the context of facial analysis, Riccio and Dugelay devise features for recognition based on 2D/3D geometric invariants [51]. Their method performs well on pose variations. In this work, we characterize the common geometric structures of fiducial points shared by human faces using our projective invariants, and incorporate these informative constraints, giving robust facial point localization.

II. R ELATED W ORK Dibeklioglu et al. provide a comprehensive study on the approaches to facial feature extraction for small pose changes in [1]. Herein, we briefly review the methods for faces with large pose variations since 2010, which leverage advanced machine learning (ML) techniques. Liang et al. employ a graph structure, specifically a Markov network, to represent the geometric and appearance constraints between facial points [29]. The work in [30] applies the similar idea to the detection and tracking of facial features. Zhu and Ramanan present an efficient algorithm for simultaneous face detection, pose estimation and landmark localization by learning regressors on a tree structure [31]. In addition to generative graphical models, discriminative learning is also applicable when we pose the landmark localization as a detection problem. Ding and Martinez employ strong Adaboosting learners to discriminate feature points from their contexts [32]. Saragih and Goecke combine nonlinear discriminative learning into AAM fitting [33]. Similarly, a discriminative approach is applied to build response fitting maps upon the CLMs in a recent work [34]. Zhao et al. propose a shape space pruning algorithm using learnt classifiers on a tree structure [35]. Moreover, they provide one of the most comprehensive comparisons on all LFW images. Gu and Kanade present a regularized shape model [36] and Zhang et al. learn a sparse representation for shape variations [37]. These models are robust to noise and low contrasts. Burgos-Artizzu et al. explicitly detect occlusions by cascaded pose regression [38]. These approaches incorporate global facial geometries in a statistical manner that requires numerous examples showing a wide range of variations. The generalization of ML algorithms is still an open problem. It has been a long history in computer vision to use geometric invariants that reflect the intrinsic geometries of an object under different transformation groups [39], [40]. The cross ratio (CR) on 5 coplanar points is a fundamental invariant to projective transformations. One may derive projective invariants from CR for more points [41], [42]. These invariants can be used to construct descriptors for shape recognition invariant to projective deformations [43], [44]. Researchers also build robust constraints upon these

III. C HARACTERISTIC N UMBER In this section, we give the definition of the characteristic number (CN) [25] and relate CN with the collinearity and cross ratio, two fundamentals in projective geometry. We also extend the definition such that we are able to construct the projective invariant on 6 points that characterize facial geometry in a larger scale. We use capital letters for points and lowercases for lines unless otherwise stated. We denote the line passing two points P and Q as (P, Q), and the intersection of two lines l and k as < l, k >. Definition 1 Characteristic Ratio: Let U, V P 2 be two points on a projective plane, P1 , P2 , . . . , Pk be k points on the line (U, V ), and then there exist scalar coefficients {at , bt }kt=1 satisfying Pt = at U + bt V, t = 1, 2, . . . , k. We name the ratio b1 b2 . . . bk (2) R(P1 , . . . , Pk ) := a1 a2 . . . ak as the characteristic ratio of P1 , P2 , . . . , Pk with respect to the basic points U and V . The computation of characteristic ratio is independent of the choice of the basic points U and V . It is straightforward to take x and y coordinates of the fiducial points as a and b, respectively. We define CN upon the characteristic ratio. Definition 2 Characteristic Number: n is a planar projective curve of degree n, and l, k and m are 3 distinct lines on a projective plane. The points {Ptl , Ptk , Ptm }nt=1 are the intersections of n and l, k and m, respectively. We name the value κn (n ) := R(P1l , . . . , Pnl ) · R(P1k , . . . , Pnk ) · R(P1m , . . . , Pnm ) (3) as the characteristic number of the curve n of degree n. The value of CN reflects the intrinsic characteristics, e.g., the degree of the curve n . CN of a straight line, a curve of degree 1, is −1, and curves of degree 2 including circles, ellipses, and parabolas have a CN value 1. Refer to [25] for more theoretical details, and we relate CN with the collinearity and cross ratio that are both important under projective transformations. A. Collinearity and Cross Ratio The homography of two imaging views preserves the collinearity [23]. The images of three collinear points in

FAN et al.: FIDUCIAL FACIAL POINT EXTRACTION USING A NOVEL PROJECTIVE INVARIANT

Fig. 3.

Sketch of the characteristic number for (a) 3 points, (b) 5 points, and (c) 6 points on a plane.

a view lie on a line in another view. For facial feature extraction, the collinearity of eye corners remain unchanged across viewpoints. We can verify the collinearity of three points by Theorem 1 on CN. We give the proof of the theorem in [25] below since the proof deduces the extended version of CN on 6 points in this study. Theorem 1: The characteristic number of three collinear points is −1. Proof: Suppose that a line l on a projective plane intersects three arbitrary lines as shown in Fig. 3(a). Let U =< c, a >, V =< a, b > and W =< b, c > be the basic points, and let P =< l, a), Q =< l, b > and R =< l, c > be the three points intersected by l with a, b, and c, respectively. There exist the following linear combinations: ⎧ ⎪ ⎨ P = a1 U + b1 V (4) Q = a2 V + b2 W, ⎪ ⎩ R = a3 W + b3 U where ai and bi are real values. Our goal is to prove: b1 b2 b3 . . = −1. (5) a1 a2 a3 As the characteristic number is independent to the basic points, let us take U = (1, 0, 0), V = (0, 1, 0), and W = (0, 0, 1) without loss of generality. The area of P Q R equals zero since the three points lie on a line, that is: a1 b1 0 0 a2 b2 = 0, b3 0 a3 which is exactly (5). We consider the cross ratio, a fundamental invariant in projective geometry. Given 5 points Pi (i = 1, . . . , 5) on a projective plane (see Fig. 3(b)), the cross ratio of these 5 points are defined as [23], [24]: CR =

1167

sin(α13 )sin(α24 ) , sin(α23 )sin(α14 )

(6)

where αi j is the angle between the lines (Pi , P5 ) and (P j , P5 ). We can readily verify that the calculation of the cross ratio is equivalent to that of CN. We incorporate the cross ratio [52] or CN of 5 points as the constraint for fiducial point extraction.

B. Characteristic Number on Six Points Definition 2 for CN demands points lying on a curve of degree n, but curves of higher degree than 1 (straight lines) are not quite common in practice. We relax the restriction on the existence of underlying curves, and define CN on 6 points as depicted in Fig. 3(c). Theorem 2 shows the construction process and its invariance to projective transformations. Theorem 2: Suppose that A, B, C, H, I and J are 6 points, any three of which are not collinear on a projective plane. The characteristic number, defined as the product of ratios of directed triangle areas below, is a projective invariant. κ(A, I, C, H, B, J ) =

(−1)3 SAB H SB AI SC A J · · , (7) (−1)3 SAC H SBC I SC B J

where S is the area of a triangle. Noticing the proof of Theorem 1, we can calculate κ with the areas replaced by the determinants of points’ homogenous coordinates: |AB H | |B AI | |C A J | · · , (8) κ= |AC H | |BC I | |C B J | where

1 |AB H | = x a ya

1 xb yb

1 x h . yh

We prove Theorem 2 in Appendix A inspired by a generalization of CN in [24], [26], [44]. We can also relax the non-collinearity of any 3 points as long as the 6 points can form 6 triangles (20 at most) in our application. It is also possible to relax the coplanarity [26] involving more complex geometrical configurations, but our experiments show that the coplanarity of the 8 fiducial points is reasonable in this study. As shown above, CN derives both collinearity and cross ratio. Moreover, the new invariant paves the possibilities for us to develop new constraints on 6 points, i.e., a larger scale to characterize the facial geometry. The more rational constraints we have, the more accurate localization we can achieve. IV. S HAPE P RIORS U SING C HARACTERISTIC N UMBER Human faces are highly structured and present common geometries across age, gender, and ethnicity of individuals.

1168

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 3, MARCH 2015

Fig. 4. CN values for subsets with 3 (blue), 5 (red), and 6 (green) points: (a) histograms of CN values on the subsets of fiducial points whose locations are annotated in frontal face images in (b), and (c) histograms of CN values on the same combinations of points as (a). The point coordinates for (c) are extracted from images in (d) significantly different from (b). Horizontal axes of the histograms are CN values, and the vertical axes are the number of faces.

For example, four eye corners are collinear and this collinearity preserves under pose/viewpoint changes. Researchers employ this invariant property of collinearity for pose recovery [2]. We intend to incorporate more geometric constraints than the collinearity. The CN invariant is able to characterize more geometric information on faces in addition to the collinearity. For frontal faces, the lines connecting eye corners, nostrils and mouth corners are mutually parallel. The line through eye corners is perpendicular to the line connecting the midpoints of nostrils and mouth corners. The parallelism and perpendicularity vary with viewpoints, but instead CN values preserve under projective transformations. CN values also describe the intrinsic geometric properties of the underlying curves where the subsets of facial feature points locate. Thereafter, we leverage CN to discover more shape priors shared by a group of frontal upright human faces (515 individuals with varying gender and race). These priors directly apply to different poses and viewpoints as the constraints for the extraction of fiducial points. We take an exhaustive strategy on the CN values for the subsets of fiducial points in order to discover their common geometries. We enumerate all possible combinations of choosing 3, 5 and 6 points from 8 manually labeled fiducial points, and calculate the CN values on every combination for all 515 images. Taking the discovery process of the collinearity using CN as an example, we have C83 = 56 3-point combinations. Each combination generates 515 CN values for 515 frontal faces. Theorem 1 tells that the CN value of three collinear points is a constant −1 so that we can pick out 4 combinations with 3 collinear points satisfying: |(−1) − C Nsub |2 < ε,

(9)

where CNsub is the CN of the 3-point subsets and ε is a small positive constant. The blue bar in Fig. 4(a) shows the histogram of the CN values for one 3-point combination that satisfies (9) on 515 frontal faces. Almost all the CN values of the points, whose locations are annotated as blue dots on the top frontal face in Fig. 4(b), are quite close to −1.

We verify the invariance of the prior by using the same 3-point combination from the other set of 515 uncontrolled faces with different poses and identities, one of which is given in the top image of Fig. 4(d). The CN values calculated from all these points almost equal to −1 as we expect. These histograms verify that CN can find the collinear fiducial points on human faces, and the collinearity is preserved under pose variations. We perform the similar screening process on the combinations of 5 and 6 points whose CN values are approximately identical for all 515 frontal faces: |C − C Nsub |2 < ε (10) Sd(C Nsub ) < σ, where Sd(·) denotes the standard deviation, and σ is a small positive constant. The constant C is called the intrinsic value that characterizes the geometric property of the curve underlying 5 or 6 points. We find 6 combinations for 5-point and 6 for 6-point subsets that follow (10). The histograms and point locations on frontal faces are given in Figs. 4(a) and (b), respectively.2 The CN values of one combination for all 515 frontal faces concentrate on one definite value. Again, Figs. 4(c) and (d) verify the projective invariances of CN on 5 (cross ratio) and 6 points given by Theorem 2. In summary, we find 16 combinations in total (4 for 3-point, 6 for 5-point, and 6 for 6-point) that have consistent CN values, satisfying (9) or (10), for all 515 images. We list all these combinations and their corresponding intrinsic values in Tab. I. The numbering of the fiducial points in the brackets of Tab. I is given in Fig.3. These invariant priors, reported for the first time to our best knowledge, reflect common facial geometries including the collinearity and those on a larger scale involving more points for more facial components. We can calculate these priors with one formula as (3). As theoretically proved above and verified by the experiments later, CN values of these combinations keep unchanged across 2 The histograms and locations for all combinations are available in supplemental materials.

FAN et al.: FIDUCIAL FACIAL POINT EXTRACTION USING A NOVEL PROJECTIVE INVARIANT

1169

TABLE I S IXTEEN C OMBINATIONS OF F IDUCIAL P OINTS S HOWING C ONSISTENT CN VALUES FOR A LL I MAGES

poses and a wide range of faces. It is unnecessary to run the discovery process for any other faces in a new data set. The highly clustering of CN values makes it possible for us to combine all these geometric priors as hard constraints into a deterministic optimization framework. V. E XTRACTION A LGORITHM We pose the extraction problem as an optimization in (1) that combines three types of constraints. We employ simple intensity and edge/corner information in a patch surrounding a point candidate as appearance constraints. The shape priors derived from CN are able to incorporate the collinearity, cross ratio and even more as the constraints in the energy function. The solution (localization) can be found by the standard gradient descent for the energy optimization. A. Characteristic Number Constraints We found 16 combinations or subsets in Section IV showing identical CN values across subjects (images). These values reflect the intrinsic geometrical characteristics of human faces, termed as intrinsic values. There is no need to update the 16 combinations and their corresponding intrinsic values for new testing sets, though found on 515 frontal upright faces. We exploit the intrinsic values as the constraints for the localization of fiducial points under pose changes. We write the energy for the shape constraints as: Ec =

n

||Ci ( S¯i ) − C N(Si )||2 ,

(11)

i=1

where n = 16 is the number of intrinsic values, S¯1 , S¯2 , . . . , S¯n are the subsets of fiducial points having intrinsic values Ci , and S1 , S2 , . . . , Sn are the corresponding subsets of the fiducial points to be estimated. The key to gradient descent for energy optimization is the calculation of the gradients. We apply the partial derivatives with respect to the coordinates { pk }, and set to zero: ∂ Ec ∂C N(Si ) = f ki (Ci ( S¯i ) − C N(Si )) = 0, ∂ pk ∂ pk n

(12)

i=1

where f ki is a binary function taking 1 if the kth point pk falls in the i th subset Si . The partial derivative ∂C N(Si )/∂ pk is the difference of CN values when point locations change. We have the updating rule for fiducial points: ∂ Ec pi = ∂ pi

(13)

This updating converges when the CN values of fiducial points take the intrinsic values.

B. Edge/Corner Constraints We assume that fiducial points have large values of the first derivatives and are likely to be corners. We pick out the points with larger responses (default threshold in Matlab) to the Sobel operator as the edge candidates in Sedge , and assign a larger value to all edge candidates: pi ∈ Sedge w1 Pedge ( pi ) = , (14) w2 pi ∈ / Sedge where w1 , w2 ∈ (0, 1), and w1 > w2 . We apply the Hessian matrix to detect corners. The determinant of the Hessian matrix for an image patch I ( pi ) can distinguish whether pi lies at a saddle or extreme point: 2 ∂ I ∂2 I ∂x2 ∂ x∂ y H = 2 ∂2 I ∂∂y∂Ix ∂ y2 . ∂2 I • H>0: if > 0, pi is the minimum ∂x2

if ∂∂ x I2 < 0, pi is the maximum • H w4 . We set the points with non-zero determinants of the Hessian matrices for their surrounding image patches as the corner candidates in Scorn . The energy for edge/corner constraint is: 1 1 + . (16) E e ( pi ) = Pedge ( pi ) Pcorn ( pi ) We simply search a 17 × 17 region around a fiducial candidate to find the minimum of the energy. C. Texture Constraints Our textural constraints share the similar idea with AAM [12] in which PCA is applied to a local patch centered at each fiducial point, and separately generate 8 texture templates for the fiducial points, Ti (i = 1, . . . , 8). Herein, we use frontal upright faces to train the templates since local intensities are not so sensitive to global pose changes. The texture energy is: E t ( pi ) = ||∇ I ( pi ) − ∇Ti ||2 ,

(17)

where ∇ denotes the gradient operator along x and y. We use the difference of gradients instead of original patch intensities to lower down the global intensity changes between images and templates. The optimization updating can be derived similar to AAM [12]. We sketch the process in Fig. 5.

1170

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 3, MARCH 2015

Fig. 5. Textural template matching of the left month corner. The red box indicates the desired patch of the fiducial point, and the blue boxes are candidate patches neighboring the desired point. The objective is to find an optimal patch that matches the textural template the best.

VI. E XPERIMENTAL R ESULTS In this section, we present our experimental setup, complexity analysis, and both qualitative and quantitative comparisons with 5 representative methods including Vukadinovic et al.’s [10], Milborrow and Nicolls’s [15], Valstar et al.’s [4], Cao et al.’s [19] and Zhao et al.’s [35]. Vukadinovic et al. use the classical Gabor features,3 and Milborrow and Nicolls improve the ASM algorithm in various aspects.4 Both Valstar et al.’s5 and Cao et al.’s methods learn regressors for direct landmark prediction, but Valstar et al. explicitly model the global shape constraints with a graphical model. Zhao et al. take a pruning strategy with a learnt classifier that yields the state-of-the-art performance on LFW. We employ the normalized mean error (NME) as the metric for quantitative comparisons. This metric is widely accepted in comparative studies for facial point localization, defined as: me =

n 1 di , ndlr

(18)

i=1

where n denotes the number of landmarks and di values are the Euclidean point-to-point distances between the estimated locations and manually labeled ground truth. The distance di is normalized by ndlr , the distance between two pupils. We use a fixed distance 80 for LFW facial images normalized to 200 × 200 for fair comparisons since the pupil distances significantly vary with pose changes. The normalized distance for LFPW and Helen is 100 as faces take larger portion in the images of these two sets. A. Experimental Setup Only frontal upright faces are necessary for the development of our algorithm. We use our collections with 515 frontal faces in order to find the facial shape priors on 3, 5 and 6 fiducial points as given in Sec. IV. These frontal collections are also the training examples for the PCA based templates of the local patches around the fiducial points shown in Fig. 5. The local textural features do not change so significant as holistic 3 http://ibug.doc.ic.ac.uk/resources/fiducial-facial-point-detector-20052007/ 4 http://www.milbo.users.sonic.net/stasm/ 5 http://ibug.doc.ic.ac.uk/resources/facial-point-detector-2010/

ones under pose changes, and hence may suffice to the robust localization together with our shape priors derived from CN. We test our algorithm on 3 benchmark data sets, i.e., LFW [27], LFPW [16] and Helen [28], and cross-set data. We set up our cross-set testing data from a commercial set and several public face sets including IMM-FACE-DB [53], LFW [27], AFLW [54], and Pointing’04 [55]. IMM-FACE-DB and Pointing’04 are the sets of moderate scale under a controlled environment, which categorizes facial images into identities and types of variations. LFW and AFLW have facial images in the wild, and the commercial set complements the testing images with faces of young children. We randomly select 500 images, a typical validation size for many existing methods [19], [31], [34], [38], from these sets. We use the Viola-Jones face detector [56] available at OpenCV to pick out the faces with both eyes, nose and mouth presented. The detector can also output the regions of eyes and mouth for each detected face. We use these regions to roughly initialize the positions of the 8 fiducial points. This initialization is also adopted in [15], [19], and [35], and the method of [4] is able to “automatically take into account the error made by the face detector”. B. Time Complexity Analysis The calculation of CN (7) involves several determinant computations available in MATLAB and any scientific computing packages. The determinant computation runs quite fast. The exhausted strategy is used to discover the shape priors from 515 images in Section IV. We have 8 fiducial points, and need to enumerate the combinations of 3, 5 and 6 from 8 points. All possible combinations are C83 + C85 + C86 = 140. The exhausted search on all 140 combinations for 515 frontal upright images takes about 2.8 seconds. We only store the 16 combinations and corresponding intrinsic values, and no additional searching is necessary for any other testing sets. During the optimization, the image gradients in (17) and Hessian matrix can be calculated prior to the iteration, resulting in fast local searches for the edge/corner and texture terms. We have to calculate CNs for every iteration in (12) when the locations of fiducial candidates update. The gradient calculation for one iteration takes about 2.2 seconds, and the energy dramatically decreases in the first 2-3 iterations and gradually converges in 8-10 iterations. It is possible to use other advanced sub-gradient techniques for a faster converge. Xiong and Torre [17] learn the descents from shape updates to the objective, which may also possibly accelerate the gradient calculation of our approach in the future. C. Localization Performance of Our Approach Figures 6-9 show the selected results of our algorithm on the testing images.6 Our algorithm works well on faces with pose and expression changes under controlled environment in IMM-FACE-DB and Pointing’04 as shown in Fig. 6. The used shape constraints of the algorithm can help recover the accurate positions when glasses or partial occlusions appear in 6 Results for all 500 images are available in supplementary materials.

FAN et al.: FIDUCIAL FACIAL POINT EXTRACTION USING A NOVEL PROJECTIVE INVARIANT

Fig. 6.

Selected results from IMM-FACE-DB and Pointing’04.

Fig. 7.

Fig. 8.

1171

Selected results from LFW and AFLW.

Selected results on children faces from the commercial set.

the facial images of LFW and AFLW taken under uncontrolled environments in Fig. 7. The algorithm is also insensitive to low resolutions (LR) as long as the face detector can find a face from the LR image. The global facial geometry preserves when a person is growing. Our shape priors from CN,

especially those on 5 and 6 points, can reflect the geometry at a larger scale, and thus is applicable to images of babies and toddlers though we discover the priors from adult faces. Figure 8 demonstrates the accurate localization of our algorithm on images of children from the commercial data set.

1172

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 3, MARCH 2015

TABLE II AVERAGED N ORMALIZED M EAN E RRORS

Fig. 9.

Selected results on large pose variations faces from Pointing’04 set.

Figure 9 shows that our method also works well on images of relatively large pose changes (yaw ≥ 60) on Pointing’04. Feature points may occlude each other (self occlusion) for images with the yaw angle close to 90 degrees. We cannot apply the CN based priors in this case as theres are no enough points available for the calculation of CN. We show the impacts of the CN based shape priors by comparing the localization accuracies when using the collinearity (CN on 3 points), all CN constraints, and no shape constraint. We calculate NMEs on 110 testing images selected from the cross-set data, and plot the cumulative error distributions (CEDs) [1], [19] of the three configurations in Fig. 10. We can hardly reduce the errors lower than 0.15 for nearly 15% images if no geometric constraint is imposed. The collinearity improves the accuracy on eye corner localization so that the errors are down to about 0.12 for almost all images. More significantly, the errors are less than 0.1 for all the images when we combine more shape constraints from CN. The averaged NMEs are given in Tab. II. We achieve 20% averaged accuracy improvement from 0.0649 to 0.0516 with the collinearity, and gain another 24% improvement from 0.0516 to 0.0390 when combining more. These improvements validate the use of our CN based shape priors. D. Performance Comparisons on Localization Accuracy We compare our approach with the 5 representative methods, [4], [10], [15], [19], [35], on all LFW images (13,232 faces of 5,749 subjects), all testing images of LFPW (208 images) and Helen (272 images), and our crossset collections. We apply the downloaded implementations

Fig. 10. Cumulative error distributions of localizations with the collinearity (3-point CN, red dots), all CN constraints (green solid), and no shape constraints (blue dash dots). The x axis is the normalized mean error (NME), and the y axis indicates the percentage of images on which localization NMEs are lower than the x value.

of [4], [10], and [15] and an executable program provided by the authors of [35] along with their trained models, following the protocols in [35]. We implemented [19] using Matlab, and trained Fern regressors using 2000 images from the LFW-A&G subset for LFW tests. We complemented the training sets available in LPFW and Helen with 135 profile faces from IBUG7 to train Fern regressors for the respective tests. The compared 5 methods may have different landmark annotations, but the 8 fiducial points appear in all of their annotations. We pick out the 8 points (6 points without 2 nostrils for Zhao et al.’s) from their localization for comparisons. Table III shows the comparisons of the detection rate of every individual fiducial point for LFW images. Detection rate gives the percentage of successfully detected (NME