VIETNAM NATIONAL UNIVERSITY, HANOIINTERNATIONAL SCHOOL ---STUDENT RESEARCH REPORT TOPIC: RESEARCH ON FACIAL EMOTION RECOGNITION METHOD BASED ON COMPUTER VISION Team Leader: Nguyen Thi L
INTRODUCTION
Literature review
Facial Expression Recognition (FER) is a field of artificial intelligence that focuses on automatically detecting and classifying human facial expressions In the modern world, FER is not only an important field of research but also a very valuable practical application in many fields From applications in security and surveillance systems, to improving user experience in eye-machine communication, FER is becoming an important technology in everyday life By understanding and analyzing these expressions, FER not only provides valuable information about people's moods and intentions but also opens up many potential applications in many fields from medicine to education and entertainment.
Traditional approaches use handcrafted features such as Histogram of OrientedGradients (HOG), Local Binary Pattern (LBP), and SIFT for FER These traditional approaches often require careful and accurate data preprocessing to extract suitable features from faces However, they often have lower performance than state-of-the-art methods based on deep learning when dealing with complex and diverse scenarios such as RAN, SCN, and KTN These approaches have benefited from large-scale datasets providing sufficient training data from challenging real-world scenarios and have shown significant performance improvement over traditional methods.
Despite the great progress made so far, there are still several challenges remain in FER: • Inter-class similarity: Similar images with subtle changes between them can be classified into different expression categories As illustrated in Fig 1 (a), A small change in a specific region of an image, such as the mouth, can determine the expression category, even when the overall appearance remains largely unchanged. Due to the subtlety of these differences, current methods may not be sufficiently robust to differentiate between such images • Intra-class discrepancy: Within the same expression category, images can have significant differences, such as the skin tone, gender, and age of a person varies across samples, as well as image background appearance As shown in Fig 1 (b), two images both represent the expression of happiness but have very different visual appearances • Scale sensitivity: When naively applied, deep-learning based networks are often sensitive to image quality and resolution The image sizes within FER datasets and with in-the-wild testing images vary considerably Therefore, ensuring consistent performance across scales is critical in FER
Figure 1 Inter-class similarity and intra-class discrepancy
In response to challenges in facial emotion recognition (FER), recent research has focused on enhancing attention to detail and promoting invariance within class differences To achieve this, landmarks on the face have been incorporated into deep learning-based FER methods, such as CNN and Multi-Task Deep Learning Networks However, these approaches often utilize simple concatenation to combine image and landmark features, which may limit their effectiveness.
While combining facial highlights with visual features is likely to reduce detail With the cross fusion transformer design and characteristic pyramid structure integration, we fully address all three problems in a unified framework to bridge this research gap and achieve results New SOTA on several popular benchmarks.
Overall, our contributions are summarized as follows:
• We propose a Pyramid Cross Aggregation Transformer (POSTER) network to alleviate inter-layer similarity, intra-layer differences, and scale sensitivity problems in FER.
• Cross-merge transformer structure ensures feature visualization can be guided by landmark features with prior attention to prominent facial regions, while prominent features can Use global information provided by imaging features beyond landmarks.
• We extensively confirm the effectiveness and efficiency of our proposed POSTER.
We show that POSTER outperforms previous state-of-the-art methods on three commonly used datasets (92.05% on RAF-DB, 67.31% on AffectNet, and 91.62% onFERPlus.)
Related work
Deep learning in FER (Facial Emotion Recognition)is a part of the field of artificial intelligence, focusing on recognizing and classifying emotions based on human faces using deep learning models Deep learning in FER often uses deep neural network architectures such as convolutional neural networks (CNN - Convolutional Neural Networks) to extract features from facial images These models are then trained on large datasets containing facial images labeled with corresponding emotions.
Facial landmarks in FER:Typically used to collect information about facial structure and expression for the purpose of estimating the locations of predefined key points on the human face Detected facial landmarks will be used to analyze the physical features of the face such as mouth lines, eye corners, nose shape, and wrinkles This information can help the system recognize and understand facial expressions Significant progress has been made using worm learning techniques in the task of detecting facial landmarks, and many accurate facial landmark detectors have been proposed.
The POSTER (Pose Transformation Enhanced Robust Face Recognition)method is an effective method for solving a problem or evaluating a situation POSTER implementation methods can include techniques such as brainstorming, and testing methods such as A/B testing POSTER is a flexible method and can be applied in many different contexts Key features of POSTER are its flexibility and ability to make adjustments based on evaluation and testing results.
Research Objectives and Scope
The goal of the research is to provide comprehensive insight and knowledge about computer vision and facial emotion analysis methods From there, evaluate the results and compare with previous studies to propose new, more effective methods.
To achieve the above goal we will research the following issues:
● Work on many different methods to find a method to recognize mold emotions based on computer vision, going deep into the Baseline method and the
● Develop data preprocessing, feature extraction and classification methods based on deep learning methods.
● Evaluate and compare different methods on collected data sets.
● Building a website to analyze and evaluate the POSTER method for facial emotion recognition
Research Methods
The facial emotion recognition system with the "Computer Vision" method will go through the following stages: Preprocessing facial images to provide the best image quality, thereby extracting features and classifying emotions.
● Pre-processing: this process helps improve the performance of facial emotion recognition systems, performing different types of processes such as: clarity alignment, image scaling, brightness adjustment contrast and use advanced processes to improve expression frameworks.
● Feature extraction: in machine vision, this is an important stage, it helps detect the transition from graphic description to data description, extract the most unique features of the image, then from These data descriptions serve as input to the classification problem.
● Emotion classification: this is the final stage of the facial emotion recognition system (FER), to classify facial emotions: happiness, sadness, surprise, anger, fear, scary, disgusting and normal Use classification methods such as: Decision Tree (ID3), SVM, HMM (Hidden Markov Model) Through this, we will compare and choose the method with the best accuracy and classification to use for our identification system.
To achieve the research objective, this research paper includes:
A Introduction: we present an overview of the research, research objectives,research methods, conclusions, research structure, research problems and techniques used to achieve the objectives in this chapter.
RESEARCH CONTENT
BUILD A FACIAL EMOTION RECOGNITION SYSTEM
IV Conclusion and development direction
I OVERVIEW OF FACIAL EMOTION RECOGNITION ON COMPUTER VISION
1.1: Basic Concepts Of Facial Emotion Recognition
1.1.1: Definition of emotions and recognition of facial emotions
Emotions are physical and mental states caused by neurophysiological changes, involving thoughts, feelings, behavioral responses, and levels of pleasure or discomfort There is no scientific consensus on a definition Emotions are often associated with mood, temperament, personality, character, or creativity Lexico's definition of emotion is "A strong feeling that results from a person's circumstances, mood, or relationship with another person." Emotions are reactions to important internal and external events[14].
Facial expression recognition (FER) is the process of automatically detecting and classifying human expressions through analyzing images or videos of faces This is an important task in computer vision, which has practical applications in areas such as human-computer interaction, education, healthcare, and online surveillance[1].
Figure 2: Steps of Facial Emotion Recognition
FER analysis comprises three steps: a) face detection b) facial expression detection, c) expression classification to an emotional state
Facial expressions for detecting emotions have always been an easy task for humans, but achieving the same task with computer algorithms is quite difficult With recent advancements in the field of computer vision and machine learning, it is possible to detect emotions from images.
1.1.2: The basic emotions of facial emotion recognition
Facial expressions arise as muscles beneath the skin move, resulting in dozens of distinct displays Despite their diversity, researchers have identified specific expressions universally associated with particular emotions.
Charles Darwin, a prominent proponent of the evolutionary theory of emotions, observed similarities in facial movements between primates and humans, particularly in expressions of disgust These movements include lip curling, nose wrinkling, and eye squinting As social behaviors evolved, these expressions evolved to play a more significant communicative role, allowing individuals to convey emotions and facilitate interactions within their social groups.
Studies in the 1960s by American psychologist Paul Ekman also supported this hypothesis Worldwide, he has studied the six main emotions expressed through the face including happiness, sadness, anger, fear, surprise and disgust Facial emotion recognition typically involves identifying several basic emotions expressed by facial expressions Basic emotions through facial muscles are expressed as follows: a, Happiness : Typically characterized by a smile, raised cheeks, wide eyes and sometimes wrinkling around the eyes. b, Sadness : Often marked by a downward curve of the lips, drooping eyelids, and sometimes tears or a quivering chin. c, Anger: Usually recognized by narrowed eyes, furrowed brows, and tightened jaw muscles. d, Fear: Commonly identified by widened eyes, raised eyebrows, and a tense or frozen expression. e, Disgust: Often involves wrinkling of the nose, raised upper lip, and a narrowing of the eyes. f, Surprise: Typically shown by widened eyes, raised eyebrows, and sometimes an open mouth.
These emotions are considered to be universal across cultures, and facial recognition technology often focuses on identifying and categorizing these expressions in images or videos.
1.2: Methods For Recognizing Facial Emotions
The research topic of facial emotion recognition is a very popular and important research area in the field of artificial intelligence and image processing Therefore, this topic is always a topic that many people are interested in researching using many different methods Here are some articles and research on facial emotion recognition based on computer vision:
A traditional method that many people research is the deep learning method using Residual Network (ResNet) architecture, this is a technique that has shown very positive results recently for object recognition problems statue With many convolutional neural networks (CNN) deep learning network models applied in recognition problems such as: LeNet, AlexNet, VGG, GoogLeNet, ResNet, in this study, we delve into the deep learning network ResNet101 and its structure Figure ResNet 101.
Convolutional Neural Networks (CNNs) are deep learning algorithms used for feature extraction, classification, and recognition CNNs offer the advantage of "end-to-end" learning, reducing reliance on physics-based models and preprocessing techniques They have achieved remarkable results in fields like object and face recognition, scene understanding, and Facial Emotion Recognition (FER) Researchers have leveraged CNN visualization techniques to comprehend model learning mechanisms using various FER datasets These analyses showcase the capabilities of trained networks in emotion detection across datasets and FER-related tasks.
The best solution offered a solution for the identification of emotion using facial features We use the Haar Cascades method to identify whether a face exists in the images, and if the face does not exist, then return to the start and input the image frames If the face exists, eyes and mouth need to be located and eye and mouth regions need to be cropped Filter and edge detection is carried out using Sobel edge detection method, followed by feature extraction We train the feature extraction using the neural network method
One method that is up to date is FER based on Arousal-Valence (AV) which uses continuous labels with research to overcome the above limitations of FER classification USE the AV space continuously based on activation (stimulation) and positivity (valence) of emotions Compared with categorical FER, continuous label processing AV-based FER can theoretically understand complex labels, facial expressions and micro facial expressions, and even detect hidden emotions[4].
LA-Net: Landmark-Aware Learning for Reliable Facial Expression Recognition under Label Noise.They present a new one based on the FER model called Landmark-Aware Net (LA-Net), which Take advantage of facial landmarks to minimize the impact of labels noise from two angles First, LA-Net uses landmark information to remove uncertainty in the expression space and construct the label distribution of each sample by aggregating the neighborhood, thereby improving the quality of training supervision Second, the model incorporates landmark information into expression representations using the devised expression-landmark contrast loss Advanced expression feature extractors could be less susceptible to label noise This method can be integrated with any deep neural network for better training supervision without incurring additional inference overhead They proceed with extensive experiments on both wild and synthetic noisy datasets and demonstrate that LA-Net achieves state-of-the-art performance[5].
Latent-OFER: Detect, Mask, and Reconstruct with Latent Vectors for Occluded Facial Expression Recognition This is because when unwanted objects occur on the face, the FER network has difficulty extracting facial features and predicting facial expressions accurately So, the blocked FER (OFER) is a challenging issue Previous studies of FER recognition often require full-marked facial images for training However, collecting facial images with many different bite joints and expressive notes is time-consuming and costly LatentOFER, the suggested method, can detect clogging, restore covered parts of the face as if they weren't covered, and recognize them, improving the accuracy of the FER This approach consists of three steps: First, the vision transformator (VT) - based on the block patch detector - hides the blocked location by training only the potential vectors from the unblocked patches using the supported vector data description algorithm Secondly, the hybrid reconstruction network generates the mask position in the form of a complete image using ViT and the accumulated neuron network (CNN) Finally, the implicit vector expression extractor retrieves and uses implicit expression information from all potential vectors by applying a layer activation map based on CNN This mechanism has a significant advantage in preventing the decline in performance from being obscured by invisible objects Experimental results on several databases demonstrate the superiority of the proposed method over modern methods[6].
POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression sensitivity Therefore, they proposed a two-line Pyramid model Cross-fuSion transformer network (POSTER), aimed at comprehensively solve all three problems. Specifically, they design a transformer-based cross-matching method that enables efficient collaboration between facial features and images features to maximize appropriate attention to prominent facial areas Furthermore, POSTER uses a pyramid structure to promote scale invariance[7].
Facial Emotion Recognition (FER) faces challenges in differentiating emotions within classes, accommodating diverse expressions within classes, and scaling to large datasets To address these issues, we introduce the Pyramid Multiplex Transformer (POSTER) network POSTER utilizes a transformer-based cross-matching approach, effectively combining facial landmark features and image features This allows for selective attention to crucial facial regions, enhancing emotion recognition accuracy.
II FACIAL EMOTION RECOGNITION BASED ON CNN