(LUẬN VĂN THẠC SĨ) Reconnaissance multimodale de gestes de communication non verbale

Introduction

Problématique

The analysis and interpretation of "human movements" is a key focus of the Images and Signals laboratory This research theme involves examining and interpreting human motion through video analysis Body language, including gestures, expressions, attitudes, and postures, plays a crucial role in face-to-face communication Our interest lies in identifying and recognizing specific body actions, such as gesture recognition, facial expression recognition, and head movement recognition.

Facial expression recognition relies on analyzing the deformations of the face's permanent features Additionally, head movements, such as nodding, and mobile facial traits, including blinking and yawning, provide valuable information for understanding emotions.

Objectif de stage

LIS has developed a method for analyzing facial feature deformations, focusing on the mouth, eyes, and eyebrows However, the automatic segmentation of facial features has yet to be achieved Additionally, the LIS laboratory has created a method for analyzing both rigid and non-rigid head movements The goal of this internship is to evaluate the performance of these two methods and to enhance the automatic recognition of facial expressions.

Environnement de stage

My internship took place at the LIS (Laboratory of Images and Signals) for six months The LIS is a Mixed Research Unit that operates in collaboration with the National Center for Scientific Research (CNRS), the Grenoble Institute of Technology (INPG), and Joseph Fourier University.

Le LIS est un laboratoire récent, créé en 1998 par la fusion du CEPHAG et du TIRF En

In 2007, LIS became part of a future merger forming GIPSA (Grenoble Image Parole Signal Automatique), a new laboratory established in collaboration with the Institute of Spoken Communication (ICP) and the Grenoble Automatic Laboratory (LAG) This new structure aims to enhance key themes in signal processing, imaging, and communication, while fostering collaboration in areas such as perception, multimodality, and diagnostics.

The scientific policy of LIS focuses on the challenges related to the processing and interpretation of signals and images within the rapidly evolving field of information sciences The laboratory's research activities are primarily directed towards signal and image processing, with two main application areas: geophysics and communications This research dynamic is supported by the work of five specialized teams.

• Groupe Objets, Traitement et Analyse

• Signaux et Images dans les milieux Naturels

Organisation de mémoire

The first section of this report discusses tests on head movement and facial analysis methods The second chapter details algorithms for head movement detection, eye state detection, and eye localization.

In the third chapter, the facial expression recognition system is introduced, detailing the work undertaken and the proposed solutions for its enhancement Additionally, the chapter presents the obtained results and includes comparisons with the Hammal system.

L’intégration de mon travail lors d’un projet portant sur l’analyse multimodale des émotions au cours du workshop eNTERFACE 2006 est présentée par la même occasion dans le chapitre 5

Finalement, le chapitre 6 donne quelques conclusions et perspectives du travail.

Analyse des mouvements de tête et mouvements faciaux par un algorithme

Description de l’algorithme

Several studies address 2D motion estimation methods This article presents an approach developed by Alexandre Benoît, inspired by the functioning of the human retina Figure 2-1 illustrates the algorithm's schematic.

Figure 2-1 Schéma de l’algorithme de détection de mouvement

The first stage involves a pre-filtering process inspired by the functioning of the human visual system This approach implements a retinal filter, followed by a log-polar transformation of the filtered image spectrum for further analysis Finally, the interpretation of motion is taken into account.

The initial step involves filtering, as the method relies on analyzing the frequency response of moving contours It is essential to enhance these contours because variations in illumination can obscure and alter them over time Additionally, noise significantly contributes to the degradation of these contours.

The motion detector employs a retina filter to assess movement, utilizing a spatio-temporal filter designed to model human retinal behavior This filter enhances moving contours, eliminates static outlines, cancels spatio-temporal noise, and addresses variations in illumination.

Pre-filtering occurs in two stages The first stage is performed by the outer plexiform layer (OPL), which enhances all contours present in the scene The second stage, involving the inner plexiform layer (IPL), accentuates moving contours while eliminating static ones.

At the OPL level, the spatio-temporal filter exhibits specific behaviors: at low temporal frequencies, it acts as a spatial band-pass filter, enhancing the contours in the processed image; at low spatial frequencies, it functions as a temporal band-pass filter, reducing local illumination variations; and at high temporal frequencies, it behaves as a spatial low-pass filter, effectively diminishing spatio-temporal noise.

Figure 2-2 Fonction de transfert spatio-temporel [2]

- A niveau IPL, l’opérateur dérivation temporelle est mis en œuvre pour ce filtrage Celui-ci nous permet de garder les contours en mouvement en supprimant les contours statiques

Un grand avantage de ce filtre est que le rehaussement des contours en mouvement se fait en temps réel (traitement à 25 images par seconde)

After the retinal filtering stage, a log-polar transformation is performed It is observed that the overall energy of the spectrum is closely related to the amplitude of movement present in the scene In the absence of movement, the energy is minimal or zero To estimate the direction of movement, the spectrum's response is calculated using a set of oriented Gabor filters.

In the log-polar domain, spatial zoom or rotation is translated into spectral translation These properties allow for the interpretation of the type of movement present in the analyzed scene.

Détection d’événements

Dans cette partie, il s’agit de tester un algorithme qui permet de détecter la présence dans une scène vidéo d’un événement associé à un mouvement dans la scène analysée

2.2.1 Principe de l’algorithme À partir de l’analyse du spectre log polaire, on peut interpréter le type de mouvement ainsi que sa direction Pour estimer la direction du mouvement, nous cumulons les énergies des réponses des filtres pour chaque orientation Dans ce travail, nous proposon un indicateur E1(t) C’est l’énergie maximale de la sortie IPL Connaissant l’énergie totale E(t) du spectre log polaire de la sortie du OPL, on calcule E1(t) comme algorithme suivant :

Ebruit est énergie moyenne du bruit

Figure 2-4 Evolution temporelle de l’énergie totale et de l’énergie maximale [1] À partir de l’énergie totale de tous les contours, on peut déduire les alertes de mouvement grâce à l’indicateurα

E t α Si α>0, cela correspond au mouvement, et vice-versa Pour plus de détails, consulter l’article [1]

Le but est ici de déterminer les performances du détecteur de mouvements dans des scènes vidéo quelconques

We tested videos captured under various conditions, including standard office lighting, low light, and noisy environments The video scenes feature both outdoor settings, such as streets, and indoor environments like offices Moving objects in the footage include people and vehicles.

Les taux de succès, de fausse alerte et d’oublis sont déterminés comme suit :

Taux de succès Nombre d'alerte détecté correcte

Nombre de mouvement(vérité terrain)

Pourcentage de fausse alerte Nombre de fausse alerte

Pourcentage d’oublis Nombre d'oublis d'alerte

Couverture totale Durée détectée par le programme

Durée détectée par la main

Un oubli et une fausse alerte sont définis sur la Figure 2-5 :

Figure 2-5 Exemple d’un oubli et une fausse alarme

Manual detection of omissions or false alarms can be observed, as shown in Figure 2-4 When a peak disappears in the second graph, it indicates an omission, while the opposite occurs for a false alarm.

Total coverage enables the analysis of the algorithm's ability to assess the duration of a movement It is important to note that the basic method is not designed for this purpose; rather, it is intended to detect the onset of movements.

La Figure 2-6 donne un exemple de résultat Chaque événement correspond à un mouvement On peut mesurer la durée de chaque mouvement détecté

Figure 2-6 Evolution de α, chaque pic correspond à un mouvement

The following table summarizes the collected results, evaluating the previously described indicators and the time lag between alerts detected by the algorithm and the ground truth established through manual labeling.

Moyenne des décalages Écart type des décalages

Table 2-1 Résultat de détection d’événements

We tested a significant number of videos under normal lighting conditions, resulting in 940 test alerts, while only 38 alerts were recorded in low-light conditions The motion detector demonstrated a high success rate of 95-96% in detecting movements, with very low false alarm and omission rates The time lag between alerts detected by the algorithm and those noted visually is acceptable, indicating good synchronization of the algorithm However, the coverage rate is somewhat imprecise, which is expected given the methodology used.

Cet algorithme détecte avec précision les mouvements de courte durée (figure ci-dessus), mais, les mouvements longs sont détectés comme une suite d'alertes plus courtes

Figure 2-7 Un mouvement long est détecté comme une suite d’alertes plus courtes

De l’image 685 à l’image 720, il s’agit d’un seul mouvement, mais le détecteur en a détecté plusieurs (chaque crête correspond à un mouvement)

The program effectively detected movements even in low-light conditions, achieving better percentages of recall, standard deviation, and overall coverage compared to normal lighting conditions However, due to the limited amount of data available for performance evaluation in these circumstances, caution is advised, as these rates are of a similar magnitude.

Pour tester les mouvements en condition bruitée, j’ai inséré du bruit dans la vidéo (par le filtre du programme VirtualDUB)

Le rapport signal sur bruit : SNR (dB) = 7dB

Moyenne des décalages 13.92 25.67 Écart type des décalages 20.38 59.96

Table 2-2 Résultats de détection d’événement de vidéo sans bruit et avec bruit

The false alarm rate remains unchanged; however, the percentage of missed detections has significantly increased This is attributed to the system's evaluation of background noise before detecting movement When movements are too subtle, they may be mistaken for noise, leading to missed detections Additionally, in the test video, several moving objects matched the background color, further complicating detection due to noise interference Consequently, the high rate of missed detections can be explained In contrast, the noise-free video allowed the system to successfully detect all alerts, although with greater delays.

Détection de l’état ouvert ou fermé de la bouche et des yeux

In this section, we will explore how to detect whether the mouth or eye is open or closed The key concept is that there are more contours present when the mouth or eye is open compared to when they are closed Consequently, we will estimate the total energy spectrum output from the OPL filter, which demonstrates that the energy level is higher when the mouth or eye is open than when they are closed.

Figure 2-8 Spectre log polaire et orientations de la sortie du filtre IPL contours mobiles pour différents mouvement d’œil : clignement et changement de direction de regard 2.3.1.1 Détection de clignement

First, it is essential to locate the eye The MPT algorithm can detect a bounding rectangle around the face To identify each eye, we focus on the upper quarter of the detected face, while the analysis of the mouth is conducted in the lower half When the eye is open, the contours of the iris become visible Therefore, we can determine the state of the eye by examining the changes in energy over time.

Figure 2-9 Évolution d’énergie totale de OPL de l’oeil et de la bouche

The above figure illustrates that the total energy of OPL increases progressively with the degree of openness of the eye or mouth The energy level associated with the open state is significantly higher than that linked to the closed state.

Another application of this system is yawn detection When yawning, the duration of mouth opening is typically longer than during shouting or speaking, and the mouth closes rapidly afterward Expert analysis shows that the energy expended during a yawn is over 1.5 times greater than that of normal movements and twice that of a closed mouth state Therefore, yawning can be detected by monitoring the vertical motion of mouth opening or closing, along with changes in energy levels, specifically a factor of 2 during opening and 0.5 during closing.

Figure 2-10 Évolution de l’énergie OPL lors de mouvement de bouche

The green curve in the figure illustrates the total energy behavior of the OPL filter spectrum During speech sequences, this energy reaches a maximum value nearly twice that of the closed state In contrast, during a yawn, the energy increases significantly over an extended duration, reaching a very high level The figure indicates that the low level corresponds to the closed state, while the high level represents the open state.

This section focuses on evaluating the performance of open or closed state detectors for the mouth and eyes Testing is conducted on the entire mouth, as well as on partial areas (50% and 30%), while the eyes are assessed at 100% and 50% coverage of the analysis zone Due to the smaller size of the eye compared to the mouth, the 30% coverage for the eye is deemed too small for consideration.

Par exemple : Nombre pixel d’un œil de test est environ 50*140 (pixels)

De plus, comme la méthode proposée analyse la quantité de contours présents dans l'image, une zone trop petite pour l’œil n'a pas de sens (l'iris risque de disparaợtre)

For yawn detection, videos were collected from ten individuals performing voluntary or simulated yawns, ensuring that their hands did not obstruct their mouths The test database consists of 152 minutes of footage featuring 203 yawns, evenly divided between simulated and natural yawns These test videos are interspersed with periods of inactivity (silence) and segments of speech, both with and without yawns.

Voici le résultat pour la détection de l’état de la bouche

Bouche % Succès % fausse alarme % Oublis

Table 2-3 Résultat sur la détection de l’état de bouche

Et les résultats de détection de l’état de l’œil Œil % Succès % fausse alarme % Oublis

Table 2-4 Résultats de détection de l’état de l’œil Remarques

The success rate remains high even with minimal facial or eye coverage When 100% of the areas of interest are included in the analysis window, the results are nearly perfect With only 30% of the mouth and 50% of the eye visible, the rates of false alarms and omissions are still within acceptable limits.

Résultat de test de bâillement

Détection de l’orientation de mouvement de tête

Table 2-5 Résultats de détection de bâillement Remarques

The results indicate that natural and simulated yawns are comparable, with a low rate of false alarms and minimal confusion between speech and yawning However, there is a higher rate of forgetting, attributed to confusion with speech movements during low-amplitude yawns.

2.4 Détection de l’orientation du mouvement de tête

This section focuses on the automatic detection of head nods and the overall orientation of head movement, which are crucial for face-to-face communication Movement detection is achieved through the implementation of the event detection module outlined previously Additionally, analyzing the log-polar spectrum provides insights into the orientation of the movement.

Figure 2-11 Mouvement rigide de la tête a- translation verticale, b- rotation verticale, c – rotation latérale, d- rotation oblique

To enhance system performance, optical flow can be utilized to determine both speed and direction of movement In instances of oblique rotation, optical flow provides more reliable information When movement is detected, the integration of these two data sources occurs seamlessly.

Nous avons évalué les performances de système sur une base de 123 minutes Et la durée de mouvement de vérité terrain est environ 75 minutes

Estimation de l’orientation de mouvement 93% 7%

Table 2-6 Résultat de détection de direction de mouvement

The table above presents the results, which are accurate despite some errors attributed to oblique rotation Additionally, turning the head causes movement of the mouth and eyes, leading to the detection of vertical orientation in the contours of the mouth and eyes From this information, we can develop an application to detect nods indicating approval or negation.

Localisation de l’œil

This section describes the location of the eye, highlighting the bounding box around the face and individual bounding boxes around each eye The MPT algorithm enables us to obtain the coordinates for a bounding box surrounding the face The core concept of the algorithm is to determine the position of the eye within a quarter of the face.

Figure 2-12 L’oeil est localisộ dans un quart de boợte englobante de visage dộtectộ

We assume that the eye is the only element with multiple orientations Our goal is to identify the point in the search area where vertical and horizontal contours yield the strongest energy We extract the output from the OPL filter in the search area, separately for vertical and horizontal contours.

The figure below illustrates the OPL filter response within the search area, highlighting the outcomes of two one-dimensional filtering processes Subsequently, their product is presented, revealing that the position of maximum energy aligns with the center of the iris.

Figure 2-13 Sortie du filtre OPL dans la zone de recherche 2.5.2 Tests effectués

This test is conducted using two facial databases, Feret and BioID, where the ground truth is known These databases include various types of faces under different conditions, including individuals wearing glasses.

Base de test Taille moyenne d'iris Taille d'image Nombre d'images Écart moyen Écart type

Table 2-7 Résultats de la localisation de l’œil

The results are satisfactory; however, accuracy decreases when glasses are present, leading to a diminished response in the contours of the eye Additionally, certain elements, such as the eyeglass frame or parts of the hair, have contours that possess more energy than the iris.

This algorithm effectively detects the center of the iris with satisfactory accuracy Compared to Hammal's method, it offers faster performance, although it may sacrifice some precision Additionally, it performs well with lower resolution images and is less sensitive to noise, making it a robust choice for iris detection tasks.

Reconnaissance d’expressions faciales

Système de reconnaissance d’expressions faciales existant

This system evaluates the recognition of the six universal emotions—happiness, sadness, fear, anger, surprise, and disgust—by analyzing the distortions in facial features such as the eyes, eyebrows, and mouth These characteristics are believed to provide sufficient information for accurate emotion recognition.

Figure 3-1 Les étapes dans le processus de reconnaissance des expressions faciales

During the segmentation phase, the system identifies the regions of the eyes, eyebrows, and mouth The algorithm utilized in this process is detailed in Hammal's thesis In the data extraction step, the contours of the eyes, mouth, and eyebrows are extracted as skeletal structures From these skeletal outlines, five distances are selected to characterize the deformations.

Segmentation des traits du visage

This section explains the process of extracting emotional skeletons from an image, with the crucial assumption that the user is facing the camera By utilizing an eye, eyebrow, and mouth contour segmentation algorithm, facial feature outlines are automatically extracted Refer to Figure 3-2 for visualization.

Extraction de données caractéristiques

Les caractéristiques considérées contiennent les contours des yeux, des sourcils et des lèvres Finalement, on obtient les squelettes d’expressions

Figure 3-2 Extraction des contours et définition des 5 distances À partir du squelette d’expression, on peut déterminer les déformations de caractéristiques faciales Cinq distances sont définies pour chaque squelette

D1 Distance entre la paupière supérieure et inférieure

D2 Distance entre le coin intérieur de l’œil et le celui du sourcil

D4 Ouverture de bouche en hauteur

D5 Distance entre un coin de la bouche et celui du coin extérieur de l’œil

Table 3-1 Tableau de définition de distances

Classification par le Modèle de Croyance Transférable

Using characteristic distances, we will employ the transferable belief model for facial expression recognition Each distance value is associated with one of three symbolic states.

• état C + pour lequel la distance Di est plus grande que celle pour l’expression neutre

• état S pour lequel la distance Di est du même ordre de grandeur que celle pour l’expression neutre

State C is defined as the condition where the distance Di is shorter than that of the neutral expression Following this, each facial expression is characterized by a unique combination of symbolic states, with a specific set of states assigned to each expression.

E6 Peur SC+ SC+ SC- SC+ S

Table 3-2 États symboliques associés à chaque expression 3.4.1 Modèle de Croyance Transférable

Fondé sur la théorie de l’évidence, le Modèle de Croyance Transférable a été développé par Smets

To represent the level of confidence in each proposition A within the set 2 Ω, we associate it with an elementary evidence mass m(A) This mass indicates the total confidence we can have in that proposition, with m: 2 Ω → [0,1].

In this article, we define the set Ω as {C+, C−, S} and its power set 2 Ω as {{S}, {C+}, {C−}, {SC+}, {SC−}} It is important to note that the propositions {C+, C−} and {S, C+, C−} are deemed impossible within this context The union of S and C+, denoted as SC+, represents the state of uncertainty between S and C+ Similarly, the union of S and C−, denoted as SC−, signifies the state of uncertainty between S and C−.

Le modèle est défini par la figure ci-dessous

Figure 3-3 Les seuils pour chaque distance

On associe à chaque valeur de distance Di l’un des symboles {C + , C - , S, SC + , SC - } avec une masse de croyance associée Les seuils (a, b, c, d, e, f, g, h) pour chaque expression sont différents

• h de l’état C + correspond à la valeur moyenne des valeurs maximales Di pour tous les sujets et toutes les expressions

• De même, a correspond à la valeur moyenne des valeurs minimales Di pour tous les sujets et toutes les expressions

• Le seuil d et e correspond à la valeur moyenne des valeurs maximales et minimales Di pour toutes les images de l’expression neutre

• Les seuils f, g, b et c sont définis par [moyenne+/-médiane] de chaque état (C+, C-, S)

To reach a final decision, sensors are integrated to consider all available information This process utilizes Dempster's theory Based on the previously mentioned Table 3-2, a rule base can be established for eight distances and seven expressions.

Table 3-3 Règle logique des états symboliques pour distance caractéristique D1

Nous allons utiliser ce tableau pour associer des masses de croyance de chaque état symbolique à chaque expression correspondante

To derive the final mass of belief associated with each expression or set of expressions, we need to merge a collection of basic belief masses This merging process can be accomplished using Dempster's combination rule, known as the orthogonal sum The orthogonal sum, denoted as m, is defined in a specific manner.

A, B, and C represent expressions or subsets of expressions The goal of this combination is to allocate evidence weight to propositions that are more precise and contain fewer elements compared to the original propositions.

An intriguing aspect of this model is its ability to represent conflict situations Conflict arises when a configuration does not align with the descriptions of the expressions outlined in the table above To address conflict cases, we introduce an expression E8, referred to as "unknown" or "rejection."

La prise de décision est ici de choisir une expression entre les différentes hypothèses Ei et leurs combinaisons possibles

The expressions of Joy and Disgust can often be confused due to the overlapping emotional states outlined in the previous table To eliminate this ambiguity, Hammal suggests utilizing mouth shape as a key factor in a post-processing stage.

Contribution pour le système de reconnaissance d’expressions faciales

Dans cette partie, on propose quelques contributions et réalisation de ce stage pour le système actuel

3.5.1 Détection de contours et suivi de point

Dans la thèse de Hammal, on a proposé des méthodes afin d’extraire des contours Le système marche bien avec une image statique Toutefois, avec une séquence vidéo, il y a des limitations

The current system utilizes the face detector developed by Viola and Jones, available through the Machine Perception Toolbox (MPT) This tool allows for the extraction of relevant information from the localized face, represented as a bounding box, which is essential for facial expression analysis However, there is a 4% failure rate in face detection To address this, we propose the use of OpenCV, a free image analysis and computer vision library developed by Intel in C/C++ While OpenCV provides functionalities for face detection, it only identifies the bounding box of the face and does not detect features like the eyes, unlike MPT.

Dans le cas ó aucun des deux détecteurs ne détecte de visage dans l’image, les informations nécessaires pour l’étape de segmentation des contours sont absentes

3.5.1.1 Suivi des coordonnộes de la boợte englobante du visage

Pour résoudre le problème ci-dessus, un suivi est ajoutée dans la détection de contours

Au cas ó MPT ne détecte pas de visage, on utilise la méthode prédiction linéaire suivante:

Avec X(i), coordonnées de visage (en cas d’utilisation de OpenCV) dans l’image i X(i-

2) et X(i-1) sont des coordonnées dans les images i-1 et i-2 α est prédéfini On choisit α=0.33 afin de donner la même importance à chaque image utilisée

Les coordonnées prédites sont utilisées comme paramètres d’entrée pour la détection de contours

3.5.1.2 Suivi de points caractéristiques de l’œil, du sourcil et de la bouche

Facial feature segmentation yields effective results for neutral expressions and low-intensity facial expressions However, there are instances where detection errors have occurred.

Figure 3-4 à gauche: les traits sont bien détectés et à droite (dégỏt) les sourcils ne sont pas bien détectés)

In these cases, it is observed that the characteristic points move more rapidly To overcome these limitations, we propose a method for tracking characteristic points The core idea of the tracking method utilized is based on the Lucas-Kanade algorithm.

Avec I(x,y,t ) et I(x + δx,y + δy,t + δt) sont des valeurs luminance dans les deux frames Ft et Ft+1 et δx, δy est le déplacement de pixel à position (x,y)

On va chercher dans une zone Z de taille nxn dans frame Ft+1 la zone la plus similaire que

Z Pour la trouver, on va minimiser le cỏt de somme de différence carrée de tous les pixels :

Avec w(x) est la fonction de poids

Normalement, w(x) est égal à 1, mais pour donner plus d’importance au point central, on peut choisir une forme gaussienne ou sinusọdale

Les points de suivi choisis d’ici sont les coins des yeux, et des sourcils

Figure 3-5 Les points de suivi

Généralement, l’algorithme de suivi nous donne des points corrects d’image en image Néanmoins, après quelques images, l’erreur sera cumulée de temps en temps

Une combinaison entre l’algorithme de suivi et l’algorithme d’extraction de contours statique est mise en œuvre

Un algorithme suivant est mis en œuvre :

Si Différence entre point suivi et point détecté > 2 *Seuil

Prendre le moyen de point suivi et point détecté

Prendre le point détecté fin fin

Le seuil Seuil pour l’œil et le sourcil est mis à 0.3 et celui-ci pour la bouche est égal à 0.15

Figure 3-6 Les contours de la segmentation automatique : avant et après la mise en œuvre de l’algorithme de suivi

However, there are some limitations to consider When the eyes are closed, the system cannot detect the irises, which is crucial for accurately segmenting the contours of the eyes As a result, this can lead to errors in the system's performance.

Figure 3-7 Dans le cas de fermeture des yeux, il y a des erreurs

La figure ci-dessus montre quelques exemples des erreurs

Deuxièmement, lors que le sujet est trop expressif, la bouche est trop ouverte, l’œil aussi (le cas de surprise), le système détecte les mauvais contours

Figure 3-8 Les fausses détections dans le cas de sujets sont trop expressives

The above figure illustrates two instances of incorrect segmentation On the left, the expression "Surprise" is depicted, where the eyes are overly opened, leading the system to produce an inaccurate result Additionally, for the mouth, the system misidentifies the boundary between the teeth and lips in both images, with the right image representing the expression "Disgust."

3.5.2 Calcul et filtrage de distances

The measured distances are standardized by the distance between the irises to analyze facial expressions To easily observe the changes in distance with each expression, one can utilize the evolution of distance.

Figure 3-9 L’évolution de distance 5 par rapport à distance neutre

Evol(i) represents the evolution of Dmesure(i), while Diris(i) indicates the measured distance between the irises in image i Dneutre refers to the corresponding distance in the first image, which is assumed to be a neutral expression The changes in distance 5 are illustrated in the figure above.

Cependant, les points caractéristiques ne sont pas bien détectés Donc, la forme de cette courbe est ondulée

Afin de lisser la courbe, un filtre passe bas (filtre Gaussien) est mis en œuvre Voici le résultat après le filtrage

Après l’étape filtrage, les distances normalisées sont utilisées pour l’étape de reconnaissance :

The facial expression recognition code by Hammal is largely unreadable Consequently, I have rewritten all sections related to emotion recognition, incorporating my contributions to evaluate the performance of the theory.

Résultats

We conducted tests on the HCE database, which includes four expressions represented by 4,237 frames: Joy, Surprise, Disgust, and Neutral The table below presents the recognition results based on automatic segmentation However, some sequences are missing, preventing a complete comparison with Hammal's results.

Figure 3-11 Quelques illustrations de l’expression Joie L’image à gauche correspond à l’état Neutre, à droite l’état Joie À côté de chaque image, l’indicateur montre la masse d’évidence

Figure 3-12 presents illustrations of the expression "Disgust." The image on the left depicts a Neutral state, while the one on the right represents the Disgust state Accompanying each image, the indicator displays the weight of evidence, with the gray bar indicating an alternative possibility that has a lower weight of evidence.

Figure 3-13 Quelques illustrations de l’expression Surprise L’image à gauche correspond à l’état Neutre, à droite l’état Surprise À côté de chaque image, l’indicateur montre la masse d’évidence

Voici, le tableau de résultat final

Expert E1-Joie E2-Surprise E3-Dégỏt E7-Neutre

Table 3-4 Résultat de reconnaissance des expressions faciales sur la base HCE

The expression "Disgust" shows the lowest recognition rate, although it is higher compared to Hammal's results Due to the inaccuracies in automatic facial feature detection, the "Unknown" expression is often misidentified as "Surprise." Some images are labeled as "Unknown" because they fall into an intermediate state where the subject is neither in a neutral state nor displaying a specific expression.

A significant advantage is that the recognition rate for neutral expressions is 100% We only test neutral sequences and do not consider neutral moments within expression sequences.

Workshop eNTERFACE 2006

Présentation de Workshops eNTERFACE

The eNTERFACE workshops are organized by the European excellence network SIMILAR, aiming to foster research and development collaborations among researchers, PhD students, and students from various countries over a four-week period Each project's outcomes include open-source software and a final project report The second workshop took place in Dubrovnik, Croatia, in 2006.

Projet de détection d’émotion (projet 7)

This project aims to develop a multimodal emotion detection technique, utilizing three key modalities: brain signals through functional Near Infrared Spectroscopy (fNIRS), facial video analysis, and Electroencephalogram (EEG) signals The focus is on identifying three specific emotions within this innovative framework.

Pour enregistrer les signaux, on utilise des instruments suivants:

- Capteur fNIRS pour l’enregistrement de signaux des activités de cerveau frontal

- Capteur EEG pour capturer les activités de reste de cerveau

- Caméra, ordinateur pour l’enregistrement de vidéo de visage

In this project, we combined signals in pairs to enhance data analysis The EEG is significantly affected by facial muscle movements during the production of facial expressions Conversely, fNIRS is a modality that can be effectively combined with video signals or EEG signals for improved results.

Figure 4-1 L’enregistrement de vidéo et fNIRS

In this project, I focused on video-based emotion detection, utilizing the Transferable Belief Model to classify facial expressions effectively.

La segmentation automatique des traits du visage est effectuée Néanmoins, à cause de bandeau des capteurs de fNIRS que le sujet doit porter sur le front, il couvre les sourcils

Additionally, the feature point tracking algorithm has not yet been implemented, which means that precise eyebrow detection is currently not possible The detector struggles to differentiate between the headband and the skin.

Figure 4-2 Résultat de la segmentation automatique

We designed the database structure to ensure that each video is stored as an image in individual files Each file is named according to the subject's name, the date and time of recording, and the type of stimulus For example:

Résultat sur l’enregistrement de Vidéo et fNIRS :

- Trois stimuli par classe (neutre, bonheur, dégỏt)

- 1h20 de réaction émotionnelle (5h total de l’enregistrement)

Conclusions et perspectives sur projet

In this project, we developed a substantial common database comprising video, fNIRS, and EEG data The synchronization issue between each modality has been addressed, except for the EEG and video pair Due to time constraints, we have not yet found a solution to merge the analyzed information from each modality Since no single modality is sufficient for emotion assessment, a multimodal approach will undoubtedly enhance classification performance However, the temporal differences between each modality have been resolved, allowing us to proceed with the data fusion step.

Conclusions et perspectives

The first part of this thesis presents a head and facial movement analysis method inspired by a biological approach The algorithm effectively detects head orientation, leading to applications for recognizing approval or negation from a subject Additionally, by analyzing eye and mouth states, it can identify blinks, yawns, non-verbal cues, and gaze direction The second part discusses contributions to facial expression recognition using evidence theory By enhancing automatic facial feature segmentation, we achieved promising results, demonstrating that characteristic facial contours are sufficient for classifying expressions, and that the Transferable Belief Model is suitable for this task A significant advantage of this model is its ability to model unknown expressions However, due to facial feature segmentation errors, the recognition rate is lower than that of manual segmentation, which is acceptable Nonetheless, some limitations remain, such as the requirement for a fixed head position One perspective is to utilize Benoît's motion analysis method to address this issue It is important to note that facial features are not the only indicators for emotion detection; additional information such as speech and brain signals can also be beneficial.

[1] A Benoit., A Caplier –Motion Estimator Inspired from Biological Model for Head Motion Interpretation– WIAMIS05, Montreux, Suisse, April 2005

[2] W.H.A Beaudot, “The neural information processing in the vertebrate retina: A melting pot of ideas for artificial vision”, PhD Thesis in Computer Science, INPG (France) december 1994

[3] Z.Hammal – Segmentation des traits du visage, analyse et reconnaissance d’expressions faciales par le Modèle de Croyance Transférable PhD Thesis in Cognitive Science, Université de Joseph Fourier de Grenoble Juin 2006

[4] BENOIT A., CAPLIER A - Hypovigilence Analysis: Open or Closed Eye or Mouth ? Blinking or Yawning Frequency ?- IEEE AVSS, International conference on Advanced Video and Signal based Surveillance, Como, Italy, September 2005

[5] Workshop eNTERFACE 2006 http://www.enterface.net/enterface06/

[6] Machine Perception Toolbox MPT http://mplab.ucsd.edu/grants/project1/free- software/mptwebsite/introduction.html

[7] Open Computer Vision Library http://sourceforge.net/projects/opencvlibrary/

[8] A.Sarvan, K.Cifti, G.Chanel, J.C.Motta, H-V Luong, B.Sankur, L.Akarun,

A.Caplier, M.Rombaut - Emotion Detection in the Loop from Brain Signals and Facial Images – Project 7, Workshop Enterface 2006, Dubrovnik Croatia- Final report http://enterface.tel.fer.hr/docs/reports/P7-report.pdf

[9] Smet, PH Data fusion in the Transferable Belief Model Proc ISIF, Frane(2000) 21-

[10] Eveno, N., Caplier, A., Coulon, P.Y.: Automatic and Accurate Lip Tracking IEEE Trans On CSVT, Vol 14 (2004) 706–715

[11] Hammal, Z., Caplier, A : Eye and Eyebrow Parametric Models for Automatic Segmentation IEEE SSIAI, Lake Tahoe, Nevada (2004)

[12] Hammal: Facial Features Segmentation, Analysis and Recognition of Facial Expressions using the Transferable Belief Model 29-06-2006

[13] FERET Database http://www.itl.nist.gov/iad/humanid/feret/

[14] BIOID Database http://www.bioid.com/downloads/facedb/index.php

[15] BENOIT A., CAPLIER A., BONNAUD L - Gaze direction estimation tool based on head motion analysis or iris position estimation - EUSIPCO2005, Antalya, Turkey, September 2005

Tiêu đề	Reconnaissance Multimodale De Gestes De Communication Non Verbale
Tác giả	Hong-Viet Luong
Người hướng dẫn	Alice Caplier, Alexandre Benoît
Trường học	Institut National Polytechnique de Grenoble
Chuyên ngành	Informatique
Thể loại	thesis
Năm xuất bản	2006
Thành phố	Grenoble

Định dạng
Số trang	49
Dung lượng	2,42 MB