Design and implement reception Robot for student base on computer vision and natural language processing

Trang 1 HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION CAPSTONE PROJECTAUTOMATION AND CONTROL ENGINEERING TECHNOLOGY DESIGN AND IMPLEMENT RECEPTION ROBOT FOR STUDENT BASE ON CO

INTRODUCTION

Define a problem

Autonomous robotics is a research field which has been in development since the middle of the 20th century and it is currently one of the main areas of interest within the field of Robotics Even though great breakthroughs have been achieved throughout the years, this area still has a long way to go, as much in terms of sensory, mechanical, and mobility capabilities as well as in the artificial intelligence and decision-making domain, before it can achieve efficient and flexible behaviors comparable to the ones observed in animals and humans

In a research survey, people are afraid of meeting each other in public places such as airports, hospitals, schools and public transport In fact, there are some activities that students do not need to communicate face to face, while robots can make these work more easily Especially when the corona virus spreads around the World and it is urgent to take action in preventing this disease Up to now, relying on collected data from the country's government, the number of individuals in the World infected with coronavirus has reached over 140 million cases and this figure in Viet Nam was about 2.8 thousand cases Besides, it is true that the trend of using robots and smart machines can definitely be taken to tackle this problem

The receptionist robot can help to reduce risk of spreading this disease and it can be seen as the replacement of reception jobs in future Detailly, the main purposes of the receptionist robot are described in two factors: interaction and coordination Robot have abilities in talking with humans, eye-contacting, which are needed for a good conversation Besides, it is better if the reception robot can move flexibly, but it is a big problem for dealing with positioning and tracking Some of the available robots are selling at a very high cost and their applications are built for various purposes As a result, it will be a waste of money if the robot cannot be used in the right way

Thanks to the rapid increasing development in technology, Computer Vision has become more popular to many people The applications of computer vision can solve more complex problems, facial recognition for example, which is continuously developed from classical technique to the state of the art Another field of artificial intelligence is natural language processing in which computers analyze, understand, and derive meaning from human language in a smart and useful way By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, speech recognition, and topic segmentation These two applications are extremely crucial for building a receptionist robot because those things make that robot have the ability to human interaction, especially in decision-making

Based on the thesis statements considered, the project team proposed the name of project “DESIGN AND IMPLEMENT RECEPTION ROBOT FOR STUDENT BASE ON COMPUTER VISION AND NATURAL LANGUAGE PROCESSING”

Objectives

● Design a receptionist robot which supports student affairs such as discussion about information of lectures, course enrollment, functional rooms navigation and some important announcements

● The robot can be in charge of detecting and recognizing students and teachers with high accuracy.

Project scopes

● The working environment for this project is supposed to be indoor where low noise and backlit effects are affected

● The input face images are front-end images, missing not more than half under normal lighting conditions.

Research methods

The project is approached, which base on the following methods as:

● Document study method that is relevant to various topics as theory of Computer Vision, Natural Language Understanding and Generating by computer

● Methods of studying in detecting and recognizing human faces are available, surveying methods of voice recognition, giving judgement in some common receptionist robots that are already used

● Experimental method for constructing the robot, computing and programing

Content

In the rest, the project has chapters as follows:

To create a receptionist robot that can meet the set criteria, it is necessary to conduct research on the technologies that are being applied to the receptionist robot today The content of chapter 2 describes the popular reception robots currently on the market, popular facial recognition technologies and voice recognition applied to the virtual assistant function for the robot

● Chapter 3: Hardware design and implementation

The content of chapter 3 is related to the design of drawings and hardware implementation, including the analysis of the essential requirements of the robot Based on that, the equipment is selected according to the requirements and assembled to form a complete reception robot

● Chapter 4: Software design and algorithms

The content of chapter 4 is related to software design for the receptionist robot The software will have 2 main parts: face recognition and voice assistant, building a suitable

3 environment for the technologies to be used Based on the operational requirements, the flowchart and the state diagram are designed to describe the robot's processing

● Chapter 5: Experiment results, findings and analysis

The content of chapter 5 describes the experimental results on the face recognition model by analyzing the accuracy on the data sets when the robot is actually operating, and the virtual assistant model is also checked and overcomes the limitations

The content of chapter 6 describes a summary of the results that the project has achieved and future path for development to overcome the limitations and improve the reception robot better

LITERATURE REVIEW

Survey of robots being used in service industry

When discussing robots and their uses, it is important to first establish what they actually are In simple terms, a robot is a machine, which has been built to carry out complex actions or tasks automatically Some robots are designed to resemble humans and these are called androids, but many robots do not take such a form

Modern robots can be either autonomous or semi-autonomous and may make use of artificial intelligence (AI) and speech recognition technology With that being said, most robots are programmed to perform specific tasks with great precision, with an example being the industrial robots seen in factories or production lines

2.1.1 Mission of robots in the service industry

Part of the reason why robots have emerged as a popular technology trend within the hospitality industry is because ideas of automation and self-service are playing an increasingly vital role in the customer experience The use of robots can lead to improvements in terms of speed, cost-effectiveness and even accuracy

For example, chatbots allow a hotel or travel company to provide 24/7 support through online chat or instant messaging services, even when staff would be unavailable, delivering extremely swift response times Meanwhile, a robot used during the check-in process can speed up the entire process, reducing congestion

Pepper is a semi-humanoid robot manufactured by SoftBank Robotics (formerly Aldebaran Robotics), designed with the ability to read emotions It was introduced in a conference on 5 June 2014, and was showcased in SoftBank Mobile phone stores in Japan beginning the next day Pepper's ability to recognize emotion is based on detection and analysis of facial expressions and voice tones The image below shows a robot called Pepper working in a mobile store [1]

Figure 2.1 Pepper robot working in a mobile store

The robot's head has four microphones, two HD cameras (in the mouth and forehead), and a 3-D depth sensor (behind the eyes) There is a gyroscope in the torso and touch sensors in the head and hands The mobile base has two sonars, six lasers, three bumper sensors, and a gyroscope

It is able to run the existing content in the app store designed for SoftBank's Nao robot Some necessary information about the robot is shown in that specifications table

Table 2-1 Specifications of Pepper robot

Head Mic × 4, RGB camera × 2,3D sensor × 1,

Legs Sonar sensor × 2, Laser sensor × 6, Bumper sensor × 3, Gyro sensor × 1

Moving parts Degrees of motion

Head (2°), Shoulder (2° L&R), Elbow (2 rotations L&R), Wrist (1° L&R), Hand with 5 fingers (1° L&R), Hip (2°), Knee (1°), Base (3°)

Background of face recognition system

Facial recognition is a way of identifying or confirming an individual’s identity using their face Facial recognition systems can be used to identify people in photos, videos, or in real-time In the 60s of the twentieth century, when the problem of face recognition began to be studied, since then many approaches have been proposed to solve this problem But it was not until the end of the twentieth century that this technology achieved significant achievements

2.2.2 Structure and procedure for face recognition

Generally, a face recognition system is often described as a process which involves four stages: face detection, face alignment, feature extraction, and finally face recognition

Figure 2.2 A typically procedure for face recognition model

Regarding the image above, it is able to conclude that a face recognition model contains 5 stages as described in detail below

Face detection: As can be seen from the chart, the input of face detection is a sequence of images captured from a video stream The detected faces may need to be tracked across multiple frames using a face tracking component While face detection provides a coarse estimate of the location and scale of the face, face landmarking localizes facial landmarks (e.g., eyes, nose, mouth, and facial outline) This may be accomplished by a landmarking module or face alignment module In short, face detection will locate one or more faces in the image and mark them with a bounding box [2]

Face alignment: This stage is performed to normalize the face geometrically and photometrically This is necessary because state-of-the-art recognition methods are expected to recognize face images with varying pose and illumination The geometrical normalization process transforms the face into a standard frame by face cropping Warping or morphing may be used for more elaborate geometric normalization The photometric normalization process normalizes the face based on properties such as illumination and gray scale [2]

Feature extraction: This is vital for face recognition Face feature extraction is performed on the normalized face to extract salient information that is useful for distinguishing faces of different persons and is robust with respect to the geometric and photometric variations The extracted face features are used for face matching, which is described at the next stage [2]

Feature matching: The final stage which performs matching of the face against one or more known faces in a prepared database is shown the matcher outputs ‘yes’ or ‘no’ for 1:1 verification In case of 1: N identification, the output is the identity of the input face when the top match is found with sufficient confidence or unknown when the tip match score is below a threshold The main challenge in this stage of face recognition is to find a suitable similarity metric for comparing facial features [2]

Color spaces in image processing

2.3.1 RGB color space (Red-Green-Blue)

RGB color models use complementary modeling in which red, green, and blue light are combined in different ways to form other colors There, colors are represented as one or more integer decimal values The RGB color model was represented below

Figure 2.3 RGB color space (Red-Green-Blue)

If each color channel is encoded with 1 byte (8 bits), and the value is in the segment

[0, 255], then we have a 24-bit color image, and all 2 8 × 2 8 × 2 8 = 16,581,375 colors can be encoded (about 16 million colors).For example, some of the basic colors represented in the RGB color space such as: [0; 0; 0] is Black, [255; 255; 255] is White, [255; 0; 0] is Red, [0; 255; 0] is Green, [0; 0; 255] is Blue

2.3.2 HSV color space (Hue-Saturation-Value)

HSV color space, which is also known as HSI (Hue-Saturation-Intensity), HSL (Hue-Saturation-Light) It is based on visual color properties such as tint, shade, and tone; in other words, they are color, purity, and brightness The figure below showing brief description of HSV space color

Figure 2.4 HSV color space (Hue-Saturation-Value)

● Hue: color tone, runs from 0 to 360

● Saturation: is the degree of purity of the color, which means how much white is added to the pure color The value of S is in the segment [0, 255], where S = 255 is the purest color, completely non-white In other words, the larger the S, the purer color

● Value: Also known as Intensity, Lightness, the value ranges in [0, 255], where V 0 is completely dark (black), V = 255 is completely bright In other words, the larger the V, the brighter color.

Viola-Jones algorithm

Viola Jones algorithm is named after two computer vision researchers who proposed the method in 2001, Paul Viola and Michael Jones in their paper, “Rapid Object Detection using a Boosted Cascade of Simple Features” Despite being an outdated framework, Viola-Jones is quite powerful, and its application has proven to be exceptionally notable in real-time face detection This algorithm is painfully slow to train but can detect faces in real-time with impressive speed [3]

Given an image (this algorithm works on grayscale images), the algorithm looks at many smaller subregions and tries to find a face by looking for specific features in each subregion It needs to check many different positions and scales because an image can contain many faces of various sizes Viola and Jones used Haar-like features to detect faces in this algorithm

The Viola Jones algorithm has four main steps, which we shall discuss in the sections to follow the figure below:

Figure 2.5 Stages in executing Viola-Jones algorithm

In this project, “Adaboost training” and “Cascading classifiers” were not applied in implementing the robot’s program There are only two stages that applied in the project were “Haar feature selection” and “Creat an integral image”

In explanation, Haar-like features or rectangular filter haar provide information about the distribution of the gray levels of two adjacent regions in an image

Isolated pixel values do not give any information other than the luminance and/or the color of the radiation received by the camera at a given point So, a recognition process can be much more efficient It is based on the detection of features that encode some information about the class to be detected

This is the case of Haar-like features that encode the existence of oriented contrasts between regions in the image A set of these features can be used to encode the contrasts exhibited by a human face and their spatial relationships, as shown in figure 2.6 [3]

These prototypes are scaled independently in vertical and horizontal direction in order to generate a rich, over-complete set of features

Figure 2.7 An image containing faces after featuring by Haar

After the features are built into a mask and scanned over the images containing faces The results show that facial features such as eyes, nose, and mouth are well- defined and visible, as shown in figure 2.7

When computing Haar features, the problems that these kinds of approaches present is the computation effort that is required to compute each of the features as a kernel sweeps the whole image at various scales Fortunately, each of the used features can be computed by peeking 8 values in a table (the integral image) independently of the position or scale

Rectangle features can be computed very rapidly using an intermediate representation for the image which we call the integral image The integral image at location z, y contains the sum of the pixels above and to the left of x, y, inclusive:

(2.1) where ii(x, y) is the integral image and i(x,y) is the original image Using the following pair of recurrences:

𝑠(𝑥, 𝑦) = 𝑠(𝑥, 𝑦 − 1) + 𝑖(𝑥, 𝑦) (2.2) 𝑖𝑖(𝑥, 𝑦) = 𝑖𝑖(𝑥 − 1, 𝑦) + 𝑠(𝑥, 𝑦) (2.3) where s(x, y) is the cumulative row sum, s(x, -1) = 0, and ii(-1, y) = 0) the integral image can be computed in one pass over the original image

Figure 2.8 An example of integral image

As shown in figure 2.8, in explanation, location (x1, y1) is calculated by taking the sum of the pixels in rectangle A Similarly, location (x2, y2) is the sum of A+B+C+D

Histogram of Oriented Gradients algorithm

There are many different methods in computer vision When classifying images, we can apply the family of CNN models (Inception Net, mobile Net, Resnet, Dense Net, Alexnet, Unet, ) and when detecting objects are YOLO, SSD, Faster RCNN models , Fast RCNN, Mask RCNN Before the explosion of deep learning, there was a classical but also very effective algorithm in image processing, which was HOG (Histogram of Oriented Gradient) [4]

This algorithm will generate feature descriptors for object detection purposes From an image, we will take out two important matrices that help store image information, which are the gradient magnitude and the gradient orientation By combining these two pieces of information into a histogram, the magnitude of the gradient is counted in groups of bins of the gradient direction Finally, we will get the HOG feature vector representing histogram There are some terms that are relevant to this algorithm

Table 2-2 Explanation of common terms used in Computer Vision

Feature Descriptor A transformation of data into features that are useful for classification or object recognition The methods can be mentioned as HOG, SUFT, SHIFT

Histogram A histogram showing the distribution of color intensities over a range of values

Gradient The derivative of the color intensity vector that helps detect the direction of movement of objects in the image

Local cell In the HOG algorithm, an image is divided into cells by a grid of squares Each cell is called a local cell

Local portion A pre-extracted area from the square on the image In the algorithm presentation, the local area is also called a block

Local normalization Normalization is performed in a local area

Usually divided by the second normal norm

13 or the first normal norm The purpose of normalization is to unify the color intensity values about a common distribution

Gradient direction The magnitude of the angle between the x and y gradient vectors that help determine the direction of the color intensity change or the direction of the shadow of the image

Gradient magnitude The length of the gradient vector in the x and y directions Representing the histogram distribution of this vector according to the gradient vector will obtain the HOG feature vector

In most image processing algorithms, the first step is pre-processing the image data The aim of pre-processing is an improvement of the image data that suppresses unwilling distortions or enhances some image features important for further processing, although geometric transformations of images (e.g rotation, scaling, translation) are classified among pre-processing methods here since similar techniques are used This stage will need to normalize the color and gamma values

However, this step can be omitted in the calculation of the HOG descriptor, since descriptor normalization in the next step has achieved the same result Instead, at the first step of the descriptor computation, the gradient values are calculated The most common method is to apply a discrete derivative mask in one or both horizontal and vertical directions Specifically, the method will filter the image intensity matrix with filters such as Sobel mask or scharr

To compute the Sobel filter, a convolution of the kernel of size 3x3 is performed with the original image When I is denoted as the original image matrix; 𝐺 𝑥 and 𝐺 𝑦 are two image matrices where each point on it is a derivative of the x and y axis, respectively We can compute the kernel as follows

The * The symbol is similar to the convolution between the left filter and the right input image The gradient magnitude and gradient direction can be generated from two derivatives Gx and Gy according to the formula below:

HOG feature descriptor used for pedestrian detection is calculated on a 64×128 patch of an image Of course, an image may be of any size, so the patches need to have a suitable aspect ratio For example, they can be 100×200, 128×256, or 1000×2000 but not 101×205

To illustrate this point, the figure below shows a large image of size 720×475 A patch of size 100×200 has been selected for calculating this HOG feature descriptor This patch is cropped out of an image and resized to 64×128

Figure 2.9 Stages of image pre-processing

● Step 2: Calculate the gradient images

Before calculating a HOG descriptor, it needs to calculate the horizontal and vertical gradients; after all, the histogram of gradients could be described

Figure 2.10 Gradient images in horizontal, vertical directions and combination image

The gradient image removed a lot of non-essential information (e.g constant colored background), but highlighted outlines In other words, it is easy to say there is a person in the picture

At every pixel, the gradient has a magnitude and a direction For color images, the gradients of the three channels are evaluated (as shown in the figure above) The magnitude of gradient at a pixel is the maximum of the magnitude of gradients of the three channels, and the angle is the angle corresponding to the maximum gradient

● Step 3: Calculate Histogram of Gradients in 8×8 cells

In this step, the image is divided into 8×8 cells and a histogram of gradients is calculated for each 8×8 cells The histogram is essentially a vector (or an array) of 9 bins (numbers) corresponding to angles 0, 20, 40, 60 … 160

The next step is to create a histogram of gradients in these 8×8 cells The histogram contains 9 bins corresponding to angles 0, 20, 40 … 160 The following figure illustrates the process

A bin is selected based on the direction, and the vote (the value that goes into the bin) is selected based on the magnitude Let’s first focus on the pixel encircled in blue It has an angle (direction) of 80 degrees and magnitude of 2 Hence, it adds 2 to the 5th bin The gradient at the pixel encircled using red has an angle of 10 degrees and magnitude of

4 Since 10 degrees is halfway between 0 and 20, the vote by the pixel splits evenly into the two bins

Figure 2.11 Calculating Histogram of Gradients from its direction and magnitude

There is one more detail to be aware of If the angle is greater than 160 degrees, it is between 160 and 180, and we know the angle wraps around making 0 and 180 equivalents So, in the example below, the pixel with angle 165 degrees contributes proportionally to the 0-degree bin and the 160-degree bin

Figure 2.12 An example of special case in calculating Histogram of Gradients

The contributions of all the pixels in the 8×8 cells are added up to create the 9-bin histogram For the patch above, it looks like this

Figure 2.13 9-bin histograms generating from the image

In our representation, the y-axis is 0 degrees The histogram has a lot of weight near 0 and

180 degrees, which is just another way of saying that in the patch gradients are pointing either up or down

Ideally, this descriptor will be independent of lighting variations In other words, we would like to “normalize” the histogram so they are not affected by lighting variations

Figure 2.14 Method of normalizing the histogram of image

A 16×16 block has 4 histograms which can be concatenated to form a 36 x 1 element vector and it can be normalized just the way a 3×1 vector is normalized The window is then moved by 8 pixels and a normalized 36×1 vector is calculated over this window and the process is repeated

● Step 5: Calculate the Histogram of Oriented Gradients feature vector

To calculate the final feature vector for the entire image patch, the 36×1 vector is concatenated into one giant vector

With positions of the 16×16 blocks, there are 7 horizontal and 15 vertical positions making a total of 7 x 15 = 105 positions

Each 16×16 block is represented by a 36×1 vector Once concatenating them all into one giant vector, it obtains a 36×105 = 3780 dimensional-vector

Figure 2.15 HOG features labeling in the image result

Support Vector Machine algorithm

SVM is a Supervised Learning algorithm, which is used for Classification as well as Regression problems However, primarily, it is used for Classification problems in Machine Learning In addition to performing linear classification, SVMs can efficiently perform a non-linear classification as well using a trick or parameter called as Kernel,

19 which implicitly maps their inputs into high-dimensional feature spaces Will see the details about the Kernel soon

SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes, as shown in the image below

Figure 2.16 An example of support vector in 2-Dimensional data

For 1 Dimensional data, the support vector classifier is a point Similarly, for 2- Dimensional data, the support vector classifier will be a line, and for 3-dimensional data, a support vector classifier is a plane And for 4 dimensional or more, the support vector classifier will be a hyperplane

In geometry, a hyperplane is a subspace whose dimension is one less than that of its ambient space If space is 3-dimensional then its hyperplanes are the 2-dimensional planes, while if the space is 2-dimensional, its hyperplanes are the 1-dimensional lines This notion can be used in any general space in which the concept of the dimension of a subspace is defined [5]

Figure 2.17 Margins describing in a plane

The distance between the hyperplane and the nearest data point from either set is known as the margin The goal is to choose a hyperplane with the greatest possible margin between the hyperplane and any point within the training set, giving a greater chance of new data being classified correctly

However, data is rarely ever as clean as our simple example above A dataset will often look more like the jumbled balls below which represent a linearly non separable dataset In order to classify a dataset like the one above it’s necessary to move away from a 2d view of the data to a 3d view

Figure 2.18 An example of linearly non separable dataset

Because we are now in three dimensions, our hyperplane can no longer be a line It must now be a plane as shown in the example above The idea is that the data will continue to be mapped into higher and higher dimensions until a hyperplane can be formed to segregate it

SVM could work well on smaller and cleaner datasets with high accuracy Because of using a subset of training points, it gives more efficient results Despite giving some advantages, there are also some drawbacks when applying SVM algorithm Firstly, this algorithm is not suitable for dealing with larger datasets which makes the training time longer Secondly, the critical problem is less effective on noisier datasets with overlapping classes

2.6.2 Building optimization problems for SVM

Assume that the data pairs of a training dataset are (x1, y1), (x2, y2), …, (xN, yN) with respect to x representing the input of a data point and y being the label of that data point Assume that the label of each data point is determined by y =1 (class 1) or y = −1 (class 2) as in PLA [5]

Let's consider the case in two-dimensional space below Two-dimensional space for easy visualization, math operations can completely be generalized to multi-dimensional space

Figure 2.19 Two-dimensional space and data points

Assume that the green square points are class 1, the red circles are class -1, and the plane 𝑤 𝑇 𝑥 + 𝑏 = 𝑤 1 𝑥 1 + 𝑤 2 𝑥 2 + 𝑏 = 0 is the dividing surface between the two classes Furthermore, class 1 is on the positive side, class -1 is on the negative side of the division Otherwise, we only need to change the signs of w and b The goal to be achieved is to find the coefficients w and b

After observing, there is the following important point: for any pair of data (xn, yn), the distance from that point to the division is:

‖𝑤‖ 2 (2.5) This can be easily seen because, according to the above assumption, y is always the same sign as the side of xn From this it follows that yn has the same sign as 𝑤 𝑇 𝑥 𝑛 + 𝑏 , and the numerator is always a non-negative number

With the split face as above, the margin is calculated as the closest distance from a point to that face (regardless of any point in the two classes):

The optimization problem in SVM is the problem of finding w and b so that this margin reaches the maximum value:

When we replace the coefficient vector w by kw and b by kb where k is a positive constant, the subdivision does not change, i.e the distance from each point to the subdivision remains the same, as the margin does not change Based on this property, it can be assumed that:

𝑦 𝑛 (𝑤 𝑇 𝑥 𝑛 + 𝑏) = 1 (2.7) with the points located closest to the division as shown in Figure 4 below:

Figure 2.20 Locating the data points closely to the boundary

So, the optimization problem (1) can be reduced to the optimization problem with the following constraints:

However, solving this problem becomes complicated when the number of dimensions d of the data space and the number of data points N increases Therefore, this problem is often solved by dual problem modeling During the construction of the dual problem, it can be seen that SVM can be applied to problems where the data is not linearly separable so the dividing lines are not a plane but can be complex planes.

Background of speech recognition system

Speech recognition is a complex process The output voice signal is analog Through the process of sampling, quantizing and coding to obtain a digital signal, these signal samples are feature extracted These features will be the input to the identification process The recognition system will output the recognition result

Some difficult factors for speech recognition problem:

- When pronouncing, speakers are fast and slow

- The spoken words are often different in length

- The same person says the same word but has different pronunciations and endings with different analysis results

- Each person has their own voice expressed through pitch, loudness, intensity, pitch and timbre Noise factors of the environment, receiving equipment… also not small to the recognition efficiency

The speech-to-text recognition and conversion system is widely researched and developed by domestic and international scientists

2.7.2 The speech signal and its representation

A brief introduction to how the speech signal is produced and perceived by the human system can be regarded as a starting point in order to go into the field of speech recognition The process from human speech production to human speech perception, between the speaker and the listener, is shown in Figure 1

Figure 2.21 Process from sounds by speaker transmitting to listener

Speech recognition systems try to establish a similarity to the human speech communication system A source-channel model for a speech recognition system is illustrates in figure 2.22

Figure 2.22 Block diagram of speech recognition systems

The aim of human speech communication is to transfer ideas They are made within the speaker’s brain and then, the source word sequence W is performed to be delivered through her/his text generator The human vocal system, which is modeled by the speech generator component, turns the source into the speech signal waveform that is transferred via air (a noisy communication channel) to the listener, being able to be affected by external noise sources When the acoustical signal is perceived by the human auditory system, the listener’s brain starts processing this waveform to understand its content and then, the communication has been completed This perception process is modeled by the signal processing and the speech decoder components of the speech recognizer, whose aim is to

24 process and decode the acoustic signal X into a word sequence Ŵ, which is hopefully close to the original word sequence W

An efficient representation of the speech signal based on short-time Fourier analysis is spectrograms A spectrogram of a time signal is a special two-dimensional representation that displays time in the horizontal axis and frequency in the vertical axis Then, in order to indicate the energy in each time/frequency point, a grey scale is typically used, in which white represents low energy, and black, high energy Sometimes, spectrograms can be represented by a color scale; as in Figure 2.23, where darkest blue parts represent low energy, and lightest red parts, high energy

Figure 2.23 Spectrogram of a typically sound

When speech is produced in the sense of a time-varying signal, its characteristics can be represented via a parameterization of the spectral activity This speech representation is used by front-end Automatic Speech Recognition systems, where the frame sequence is converted into a feature vectors that contains the relevant speech information

The built-in “Windows Speech Recognition” application Microsoft Windows 7, Windows

8, Windows 10, born in 2009, has the ability to recognize speech to manage and control software and applications on the Windows operating system to reduce time for users

Figure 2.24 Interface of Windows Speech Recognition

The main features of the application are being able to generate text from voice and manage and control software and applications on the computer However, this identifier still has many problems to improve such as must be learned before using; difficulty distinguishing voices accurately; The identifier is not effective and cannot recognize Vietnamese yet

● Voice-To-Text Facebook Messenger

The "Voice-To-Text" application integrated on Facebook Messenger, is Facebook went live in 2013 This application recognizes the voice and converts the voice into a text sent through the text message input on the Facebook Messenger application and sends that text message to the recipient

Besides the advantages such as no need to be trained in advance because of using the data warehouse of the Facebook machine along with the ability to identify quite accurately However, this tool only converts voice to text and does not support Vietnamese

Figure 2.25 Interface of Voice-To-Text Facebook Messenger

“Google Speech to Text” was developed by Google about 2 years back The application runs on many environments such as Windows, iOS, Android, integrates into the Chrome browser and recognizes long texts

Figure 2.26 Interface of Google Speech to Text

This tool was launched in 2017 and has significantly improved the disadvantages of its predecessors such as the ability to recognize good language conversion and support Vietnamese

Above are a few famous voice recognition tools, which can realize the usefulness of voice recognition in the world Based on the above research, the team decided to use Google's voice recognition tool to apply for this topic

2.7.4 Advantages and disadvantages of speed recognition

One of the most important biometric technologies called Speech recognition technology has become an increasingly popular concept in recent years Extraordinary computing power and the Cloud take expert speech transcription to new heights Global communities of talented transcribers deliver accurate results without delay Specialized transcription services even use artificial intelligence (AI) to improve results

Advantages: Regular speech recognition software comes with advantages and disadvantages It’s all about using the right tool for the task Sometimes, traditional speech recognition software is the best option For example, it is convenient to have it installed right on your computer Such software integrates with your other programs Accuracy often improves as a given user works with it, and the software learns Besides, the system can capture client’s speech at a faster rate than they might normally type, so it is now possible to get thoughts onto electronic paper faster than waiting for fingers to catch up

Disadvantages: There are limitations to speech recognition software It does not always work across all operating systems Noisy environments, accents and multiple speakers may degrade results Limited Vocabulary may cause lots of delays while the software stumbles on speech of strange words The simple reason for this is, new industry-specific vocabularies are being updated regularly.

HARDWARE DESIGN AND IMPLEMENTATION

Requirement analysis for hardware selection

In general, the robot consists of 4 main parts: the robot's frame and shell; central processor; input devices: microphone and camera; Output device: speaker, monitor In addition, the robot needs to be moved flexibly and safely for the user

Table 3-1 The table describes the requirements of hardware devices which are used in this project

Robot frame The robot frame needs to be designed firmly, the material is flexible and easy to move (perhaps the wheel mechanism) In particular, the robot shell needs to be absolutely insulated to protect the user The robot size needs to be suitable for the user for easy communication The size must be fixed in rectangle box 60x60x100 centimeter (LengthxWidthxHeight)

Processor The processor needs to be compact in size and capable of performing multiple tasks at the same time (face recognition and voice recognition) In addition, the processor needs to provide appropriate communication standards for peripheral

28 devices such as speakers, monitors, cameras, and microphones The processor also needs to be equipped with a programming environment that supports the technologies used in this project

Camera The camera has a compact size and suitable focal length to easily collect images when interacting with users In particular, the camera needs to have a good resolution to increase the quality of the input image The camera also needs a common connection standard (USB for example)

Microphone The microphone has a compact size and the right input sensitivity to easily pick up sound when interacting with the user, so that it can be accurately captured The selected microphone must be able to accurately record sound within the farthest radius of 3M

Monitor The monitor needs to be sized to match the overall structure of the robot Monitor provides a good color and resolution image for accurate image display It also requires a compatible interface with the processor

Speaker Speaker needs to be small in size and light in weight to be easily installed in the robot frame, the connection standard needs to match the processor Sound quality needs to be accurate to increase efficiency when communicating with users

Safety All devices need to be connected to each other according to the correct communication standards and ensure compliance with the specifications of the devices This ensures the safe operation of the robot The power supply is supplied

29 from the AC mains and converted to direct current through the adaptor Connectors need to be secure against leakage of electrical cores

Computers being used for computer vision and embedded applications

The processor acts as the brain of the robot, it receives input data from the receivers and processes the information to make accurate decisions In case of facial recognition and voice recognition system requirements, the processor must be powerful enough to handle both tasks and must be suitable for most different devices and platforms to ensure stability designated for robots

There are some processors that have compact size and good efficiency like PLC, SOC – System on Chips, especially some common embedded computers For embedded computers which are purpose-built computing platforms, designed for a specific, software- controlled task are reasonable choices The project team finds that embedded computers can connect easily to many peripheral devices (camera, microphone, monitor) by integrating a variety of connection standards Moreover, these computers provide to users an easy coding environment and configuration for face and voice recognition

These are not the typical tower or desktop consumer-grade computers we are used to working with at home or at the office Applications of embedded computers can range from Industrial Automation, Digital Signage, Self-autonomous Vehicles, to Space Exploration, or just a small specific embedded application

The image below describes some of the processors that are used quite a lot today in real projects The basic commonalities of these devices can be easily seen, so it is easy to identify them

Figure 3.1 Some kind of processor in a real-life project

The main differences between an embedded and a desktop computer are purpose and design Embedded computers are purposeful and dedicated equipment built from scratch to perform a specific task It can run at maximum, with low resources, and withstand harsh conditions — something not possible with the consumer-grade computers

Another crucial distinction is that the general-purpose desktops come with traditional motherboards, which allow you to expand or replace its components

● Features of the embedded computer

- Small factor size: Embedded computers use small factor motherboards Their innovative enclosure designs and next-generation cooling systems also allow them to be small Most industrial embedded computers are finless — they only rely on thermodynamic principles These systems are capable of cooling down without big enclosures and fans

- High reliable: Some embedded computers are designed for high reliability in critical mission-based applications like industrial or military deployments They need to operate 24×7 in demanding applications and extreme environments like rugged terrains, continuous vibration, high temperatures, etc These embedded computers can operate at a wide range of temperatures (e.g -30°C to 70°C), dust-proof, and with protection from humidity

- Power efficiency: Some applications require that embedded computers remain operational, day and night That is why these computers are designed for power efficiency Embedded computers come with lightweight and dedicated software so they do not need lots of processing power Another advantage of power consumption is that some embedded computers come without fans and don’t have any moving components

Based on all things considered in session 3.1, the project team decided to use an NVIDIA embedded computer as Jetson Nano which is the processor of the robot Its actual image is shown in the figure below

The NVIDIA Jetson Nano Developer Kit is an AI computer for makers, learners, and developers that brings the power of modern artificial intelligence to a low-power, easy- to-use platform In summary, the specification of this board is described in the table below

Table 3-2 Table of NVIDIA Jetson Nano’ s specifications[6]

GPU Quad-core ARM A57 @ 1.43 GHz

Memory 4 GB 64-bit LPDDR4 25.6 GB/s

Get started quickly with out-of-the-box support for many popular peripherals, add- ons, and ready-to-use projects Jetson Nano is supported by the comprehensive NVIDIA Jetpack SDK, and has the performance and capabilities needed to run modern AI workloads Jetpack includes:

● Full desktop Linux with NVIDIA drivers

● AI and Computer Vision libraries and APIs

Figure 3.3 Pinout of NVIDIA Jetson Nano

Image 3.3 shows the pin layout of NVIDIA Jetson Nano We can visualize the basic functions of its output pins Thereby we can search for suitable devices to communicate with NVIDIA Jetson Nano [6]

Regarding the table showing information of requirement analysis, it is clear to see that the Jetson Nano board is integrated with various connection standards As can be seen from the figure above, it is possible to connect with a monitor through HDMI or DisplayPort Besides, the USB type A can be used for interfacing with camera, speaker and microphone Looking at the view on the right side, there is a microSD card slot which is used for putting a memory card Specially, connecting the Micro-USB power supply can let the developer kit powers on automatically The figure below showing position and function of various junctions in the board

Figure 3.4 Various junctions of Jetson Nano board

From the top view, this show a brief overview of the board

Table 3-3 Description of Jetson Nano board junctions

J2 SO-DIMM connector for Jetson Nano module

J6 HDMI and DP connector stack

J13 Camera connector; enables use of CSI cameras Jetson Nano

Developer Kit works with IMX219 camera modules, including Leopard Imaging LI-IMX219-MIPIFF-NANO camera module and Raspberry Pi Camera Module V2

J15 4-pin fan control header Pulse Width Modulation (PWM) output and tachometer inputs are supported

J18 M.2 Key E connector can be used for wireless networking cards; includes interfaces for PCIe (x1), USB 2.0, UART, I2S, and I2C

J25 Power jack for 5V⎓4A power supply (The maximum supported continuous current is 4.4A.) Accepts a 2.1×5.5×9.5 mm plug with positive polarity

J28 Micro-USB 2.0 connector; can be used in either of two ways:

● If J48 pins are not connected, you can power the developer kit from a 5V⎓2A Micro-USB power supply

● If J48 pins are connected, operates in Device Mode

J32 and J33 These junctions are each a stack of two USB 3.0 Type A connectors

Each stack is limited to 1A total power delivery All four are connected to the Jetson Nano module via a USB 3.0 hub built into the carrier board

J38 The Power over Ethernet (POE) header exposes any DC voltage present on J43 Ethernet jack

J40 Carrier board rev A02 only: 8-pin button header; brings out several system power, reset, and force recovery related signals J41 40-pin expansion header includes power pins and interface signal pins J43 RJ45 connector for gigabit Ethernet

J44 Carrier board rev A02 only: 3.3V serial port header; provides access to the UART console

J48 Enables either J28 Micro-USB connector or J25 power jack as power source for the developer kit Without a jumper, the developer kit can be powered by J28 Micro USB connector With a jumper, no power is drawn from J28, and the developer kit can be powered via J25 power jack.

Camera and microphone

In order to apply face recognition technology, the robot needs a device that has the ability to acquire good images in the required range To evaluate specifically the suitability of the camera with the requirements of the topic The following criteria can be given:

● Resolution that determines the quality of the captured image: HD or Full HD

● The frame rate determines the detail and smoothness of the video, thereby helping the recognition and processing process better: 15-30 fps

● Angle: Good rotation allows increased flexibility in capturing frames, the wider the angle, the better

● Design and materials: A compact design that is easy to install is an advantage and the material will certainly increase the durability of the product

There are many devices on the market that can impose requirements on the size and specific parameters as above For example, there are compact webcams from famous brands such as Microsoft, Genius… However, Logitech still excels in its webcam product lines Specifically, the Logitech B525 product with a suitable number of parameters along with an affordable price is a reasonable choice

Next, to apply voice recognition technology to virtual assistants, it is necessary to have a recording tool suitable for the topic For recording, there are many products on the market that easily meet the parameters as in 3.1 The first two product directions can be mentioned, which is a separate audio recorder product and a recorder that is integrated on another product Conveniently, the b525 camera has a built-in microphone built-in with high-quality transceivers Therefore, the use of this built-in mic will save space, cost and facilitate installation without sacrificing the quality of the system Figure 3.5 is the real shape of Logitech B525

The parameters of specifications and features of this camera are shown clearly on the table below.

Table 3-4 The features description of Logitech B525

Dimensions and weight 68.5 x 29 x 40.4 mm (width x height x depth)

Rotation ability 360 degrees at various angles

Advance feature Auto focus and integrated microphone used for conversations with clean and clear voice

Monitor

The display is one of the main methods for conveying information to the user Choosing a suitable monitor will increase the quality of output information for the system Specifically, the screen used in this project should meet the following requirements:

● Dimensions: Suitable for mounting on robots

● Material: Good material will increase the life of the screen

● Resolution: High resolution will increase the visual experience for users (HD to full

HD is the optimal choice)

● Connection standard: Good connectivity is required with the selected processor

Based on the above specific criteria, the team decided to choose a screen product (Waveshare 7-inch LCD) which is shown in the image above The table 3.5 shows the following specific parameters:

Table 3-5 Table of Waveshare 7-inch LCD ’s specific parameters

Dimension and weight 190.5 x 114.6 x 16.7 mm (width x height x depth) Weight of 0.408 kg

On-board interfaces USB and HDMI

Operation voltage and rate current 5 Volt and 490 mA

Touch control Capacitive touch control

Operating system Using Raspberry Pi, it supports Raspbian /

Ubuntu / Kali / Retropie and WIN10 IoT, no need to install any drivers

Advance Support backlight control, more power saving

Speaker

The speaker will be the device that provides the audio output for the system There are no strict requirements for speakers, but the following basic requirements must still be met:

- Size: Fits the system size

- Standard connection: Supports standard 3.5mm headphone jack to connect to audio devices

- Power supply: Powered via USB standard

- Control Button: There is a volume knob to easily control the loudness

Based on the above criteria, the product Logitech mini Z121 for compatible parameters

The figure 3.7 describes this one in real life The specifics are shown in the table below

Table 3-6 Table of Logitech mini Z121’s specifications

Dimension and weight 88 mm x 110 mm x 90 mm (width x height x depth)

Weight of 0.25 kg Operation voltage 5 Volt (USB cable for supply connection)

Button Adjust volume at the right side

Design sketch of robot frame

The drawing helps the project team calculate and illustrate the size, shape, and movement mechanism for the robot to help the robot be firm and move flexibly Through the drawing, the team can visualize the robot after the finished product, thereby recognizing the mechanical limitations of the robot, then will edit the drawing appropriately

Currently, there are many powerful tools for technical drawing work, one of which is SOLIDWORKS This is a popular tool in engineering, it helps users create technical drawings, which can then be combined into a complete model With fairly simple commands and operations, this tool is suitable for beginners

First, the robot can move flexibly on flat surfaces To simplify this problem, the team intends to apply a 2-wheel mechanism and an omnidirectional wheel to help the robot move more easily From there, the wheel frame (to attach the wheels) needs to be attached to the body of the robot

The body of the robot is designed in a pyramid shape and the top will be connected to the head of the robot

The picture below shows the basic design of the body of the robot in Solidworks

Next, the head of the robot needs to carry the control and display components shown in the table below From that, all measurements in this sketch must be complied with the size of monitor, camera, speaker, processor box

Table 3-7 Table of device size requirements in the body part of the robot

Processor box 120 x 60 x 85 mm (width x height x depth)

Camera with microphone integrated 68.5 x 29 x 40.4 mm (width x height x depth)

Monitor 190.5 x 114.6 x 16.7 mm (width x height x depth)

Speaker 160 x 110 x 90 mm (width x height x depth)

Figure 3.10 The robot face design

Next will be the face of the robot This part will be designed with gaps to accommodate the LCD screen and holes to output sound from the speaker Figure … shows the robot face design in solidworks

The facets are circular in shape, they omit the parts used to communicate with people such as cameras, screens, and speakers These parts are arranged in the proper order from top to bottom The robot head will attach to the face to form a complete block of the upper body of the robot, they are machined with 4 mm diameter screw holes to make the connection between them stronger

Figure 3.11 The robot head design in Solidworks

After the drawings for each part of the robot are completed, the final stage is to assemble them into a complete model Based on this completed version, the project team will check once again whether the robot meets the criteria for size and movement in accordance with design requirements The image below describes the design of the robot after assemble the head and body together on Solidworks

Figure 3.12 The design of assembled part of the robot in solidworks

The image below depicts the connection diagram of the electronic components included in the robot Components that match each other according to the connection standard allow both stable communication and safety in terms of power consumption

Figure 3.13 Connection diagrams of the electronic components

Implementation

After completing the robot drawing, based on 3D printing technology, due to the large size of the robot, the printing size of the machine cannot meet the printing of the entire robot The robot is broken into pieces and printed as shown in the image below

Figure 3.14 The robot parts are printed individually before being put together

After finishing printing the pieces, they will all be glued together to form the appearance of a complete robot

Next, the electronic components are connected to each other according to the connection diagram and are arranged inside the robot frame The figure 3.15 show the process which puts the electronic devices in the head of the robot

Figure 3.15 The head of the robot after device putting

After the robot parts are printed successfully Gluing and coating process are carried out This is a difficult stage, requiring meticulousness and focus from team members The robot after completing the hardware has the appearance as shown below

Figure 3.16 The Reception Robot in full view

After the hardware design process, the robot has completed the outer shell With plastic and white as the background, the robot's appearance is designed to be harmonious and user-friendl

SOFTWARE DESIGN AND ALGORITHMS

Programing languages and environment being used for computer vision

There are a variety of programming language choices for computer vision – OpenCV using C++, OpenCV using Python, or MATLAB However, most engineers have a personal favorite, depending on the task they perform Beginners often pick OpenCV with Python for its flexibility It’s a language most programmers are familiar with, and owing to its versatility is very popular among developers

Python has strong attributes as follow:

● Eases of coding: “Code as plain English” is Python’s primary goal This allows programmers to focus on the design and not on coding Additionally, Python is easy to learn, especially for beginners It is one of the first programming languages learnt by most users This language is also easily adaptable for all kinds of programming needs

● Most used computing language: Python offers a complete learning environment for people who want to use it for various kinds of Computer Vision and Machine Learning experiments Its NumPy, scikit-learn, matplotlib and OpenCV provides an exhaustive resource for any computer vision applications

● Open source: It means Python is freely available without any cost Its source code is also available One can modify, improve/extend an open source software

● Most commonly used: This means it has a bigger community There are a lot of blog posts and online resources regarding Python and OpenCV, so you can always get help most of the time trying to fix a problem

The programming environment provides the main interface for programmers to write code, finding a suitable programming application will greatly determine the efficiency of the programming process The basic criteria to search for a programming application for the project are as follows:

● Support programming language: Python strongly

● High compatibility: easily install advanced application and library packages without problems

● Programming interface: Smart and logical design interface will optimize the efficiency of coding and help a lot for coders

● Operating system: Can be installed on Ubuntu operating system

Figure 4.2 The working interface of Visual Studio Code

The programming environment of choice is Visual Studio Code Specifically, this tool supports several programming languages and a different set of features for each Many features of Visual Studio Code are not exposed through menus or the user interface, but are accessible through the command panel Visual Studio is extensible through plugins available through a central repository This includes editor additions and language support One notable feature is the ability to create extensions that add support for new languages, themes, debuggers, perform static code analysis, add coding, use language server protocols, and more languages and connect to additional services

The use of libraries helps to reduce the computational load of the programmer In addition, these libraries make it easier for beginners to apply an algorithm or technique that has been studied before These libraries are quite powerful and are supported by a large user community, so they are easy to access and modify

Figure 4.3 OpenCV and Python icon

For instance, OpenCV (Open Source Computer Vision Library) is an open source software library for computer vision and machine learning OpenCV was created to provide a shared infrastructure for applications for computer vision and to speed up the use of machine perception in consumer products OpenCV, as a BSD-licensed software, makes it simple for companies to use and change the code There are some predefined packages and libraries that make our life simple and OpenCV is one of them

OpenCV gives access to more than 2,500 state-of-the-art and classic algorithms By using this library, users can perform various tasks like removing red eyes, extracting 3D models of objects and following eye movements OpenCV provides algorithmic efficiency mainly to process real-time programs Moreover, it has been designed in a way that allows it to take advantage of hardware acceleration and multi-core systems to deploy

In this project, another package was also applied, which is DLib DLib is an open source C++ library implementing a variety of machine learning algorithms, including classification, regression, clustering, data transformation, and structured prediction Dlib implements numerous machine learning algorithms such as SVMs, K-Means clustering, Bayesian Networks, and many others Dlib can work well in open source operating systems in Linux, Ubuntu for example Especially when the GPU is integrated, the processing speed when applying this model will increase significantly [5]

Figure 4.4 Dlib and Ubuntu icon

Method of face recognition

This is one of the two important things in a software design session Choosing the workable method used in face recognition plays an important role for the system As mentioned in Chapter 2, for both hardware and software related reasons, the project team chose HOG and SVM algorithms for face recognition

The recognition of a face in a stream of images is split into three primary tasks: Face Detection, Face Prediction, and Face Tracking The tasks performed in the Face Capture program are performed during face recognition as well To recognize the face obtained, a vector of HOG features of the face is extracted This vector is then used in the SVM model to determine a matching score for the input vector with each of the labels The SVM returns the label with the maximum score, which represents the confidence to the closest match within the trained face data This approach is illustrated by the diagram below

Figure 4.5 The face recognition step by step diagram

According to the diagram, it is clear to consider the three major modules of face recognition in detail

The very first step in face recognition is to collect face samples This is carried out in three basic steps as follows: Detecting the face, then it will be cropped the cardinal section of the face The final step is saving the face image

The detection of the face is achieved using Haar Feature-based Cascade Classifiers, as discussed in the previous section Typically, the accuracy of face recognition is highly dependent on the quality and variety of the sample images The figure 4.6 shows the variety of sample images which can be obtained by capturing multiple images with multiple facial expressions for the same face

Figure 4.6 Some sample images for one face after capturing

If the face is detected, it can be cropped and stored as a sample image for analysis The ubiquitous use of rectangles to bind regions in an image introduces a superfluous section of the cropped head image Thus, rectangle-shaped bounded faces obtained using Haar Cascade Classifiers contain insignificant data such as the area surrounding the neck, ears, hair, etc This can be mitigated using a geometric face model, which is formed using the geometric relationship between the various features within a face, including eyes, nose, and mouth The images collected in the folder are then encoded into 512-d vector and saved in the database that is used for the training model

In this stage, features from images associated with each person are gathered Later, a complete set of information from all of the stored images, isolated per person as a single SVM label, is trained to generate an SVM model

Figure 4.7 SVM model used in the robot

An SVM model can be considered as a point space wherein multiple classes are isolated using hyperplanes.

Method of voice assistant

In general, a speech recognizer is used to correctly convert speech to text, the process of which[7] is described in the block diagram below

Figure 4.8 Voice assistant block diagram

Feature extraction: In the first step, the sound is captured for processing by recording tools such as microphones This is an analog signal, so in order for the processor to work with this type of data, there must be a method to convert them into digital signals This

51 digital signal will be the feature vectors extracted from the phonetics or triphones heard by the microphone

Features extraction from the speech signal is the computation of feature vector sequence to provide a compact representation of speech signal The most used spectral features extract methods are Mel-frequency cepstral coefficients and perceptual linear prediction

Decoding: After extracting the digital features of the collected syllables, it will be compared with reference patterns, pre-trained and stored with class identities to accurately identify the spoken word These pre-trained patterns, obtained in a learning process are in our system the acoustical models for speech units The output of this process will be the recognized words

Word sequence search: Based on the words obtained from the above step, it is necessary to have methods to understand these words and combine them into a complete meaningful sentence and correct the errors in the word recognition process If there is a case that the recognized word does not match the overall sentence, it will go to step 2 to compare and find a more suitable word This process is repeated and the result will be a text exactly adapted from the audio obtained in step 1

There are many language models built from different methods in the world such as Hidden Markov models (HMM), artificial neural networks (multilayer perceptron, Kohonen networks, support vector machines) and hybrid models (fuzzy-HMM, fuzzy) - multilayer perceptron, HMM/multilayer perceptron) Each model has different advantages and disadvantages, but in this topic, students will not go into depth about them.[8]

From the above recognition model, the world in general and Vietnam in particular have existed many applications and technologies that allow accurate speech recognition, and students will apply the most appropriate model to use in the topic

Export answer block: based on the text transferred from the communicator's voice, it will be entered into the premade database about various questions and answers This data includes keywords in the questions and answers that correspond to the communicator's question From there, the text of the answer will be converted into an audio signal sent to the speaker to output the voice.

Process flow diagram

Firstly, it is important to know how many states of the robot operation and the pre- condition of accessing each state is also clarified From that, it is reasonable to draw the state diagram for the robot This method enables both sequential and non-sequential processes to be described in a graphical form To organize and clarify how a system works, the figure below describes the state diagram to set a rule of changes between operational states in this robot

Figure 4.9 Operational states diagram of this robot

Based on the state diagram, it can be seen very clearly that there are 3 states of this robot (in which the initial state is "Building Model") The remaining states represent the robot's operation and it also describes the robot's ability to work in the current state After successfully building the model, the robot will start to work by recognizing people If the data contains this person's image, the robot will say a friendly greeting to them In this state, the robot can also check if the person is wearing a mask If it detects that someone is not wearing it, the robot will raise a voice alert to ask them to wear a mask Next, in case that robot receiving a greeting responded, the robot can switch from its current state to a state where it will act as a virtual assistant, which helps students answer their questions and answers words depend on the understanding of the robot The user can ask the robot to return to previous states using the appropriate voice commands described in the state diagram In addition, the generated model flowchart helps to describe in detail the calculation method as well as the logic performed inside the model

This flowchart describes all the logic implemented in the robot and the workflow of it Generally, the model is based on 3 techniques that are relevant to image processing as Haar feature extractor, HOG feature extractor and SVM classifier Based on criteria that are considered, it is reasonable to choose the suitable approach

Figure 4.10 Flowchart of robot’s operation

First of all, in order to identify whether a person is wearing a mask or not, a simple but effective way is to apply the method of calculating the average saturation value of the mouth area This value is then compared with a given threshold value (the next chapter will describe how to find the threshold value) If the current value is greater than the threshold value, then the person is most likely not wearing a mask The second one, the face recognition will be mainly based on the HOG algorithm and the SVM classifier, the 128- dimensional vector will be displayed continuously in the input image sequence from the camera Based on this method, the face distance value will be calculated based on the distance of the hyperplane and the nearest point, which is also the value that represents the accuracy that the robot recognizes The last function is a virtual assistant, based on a data set stored in the robot's memory, which includes questions and corresponding answers that represent the robot's understanding The sound of the user's question will be the input of the module as an electrical signal By filtering the audio signal, it is converted to text These texts will be scanned and compared in the robot's data to find the right answer Finally, the system can exit the session when it receives the goodbye.

EXPERIMENT RESULTS, FINDINGS AND ANALYSIS

Face mask detection

The method to test the mask is followed by the following steps Firstly, the landmark set is used to detect the mouth area of the face Then calculate the average saturation in the mouth area and this value is compared with a given threshold to determine whether the person is wearing a mask or not

The pre-trained facial landmark detector inside the dlib library is applied to estimate the location of 68 (x, y)-coordinates that map to facial structures on the face The indexes of the 68 coordinates can be visualized on the image below [5]

Figure 5.1 The pre-trained facial landmark detector supported by DLib library

With an input image containing a face, the landmark detector will help extract facial features such as eyes, nose, and mouth The image below depicts the features as bounding boxes

Figure 5.2 Output images with facial feature shown in bounding box

After determining the location of the facial features, the image will be converted to the HSV color channel for the purpose of calculating the average saturation value The saturation channel splitting from the HSV color system contains elements in type of unsigned integer. avg_saturation = sum_saturation / area when: sum_saturation: sum of all value in the saturation chanel area: number of elements in the saturation chanel

Figure 5.3 Real-time saturation extracted from mouth area

The next step is to calculate the saturation value through many experiments with different lighting conditions Based on the above results to find the most suitable reference saturation value for optimization in many different lighting environments The calculated values are listed in detail in the following table:

Table 5-1 Relationship between luminance and saturation mean

Experiments were carried out under different lighting conditions Therefore, some results show that this method is not feasible due to unrecognized features or incorrect recognition However, the fast processing speed helps this method to be considered and applied Some of the results can be seen in the image below which illustrates the impact of light levels on the robot's recognition process

Figure 5.4 Experiment in thresholding the value of average saturation

In summary, the optimal value for the threshold of average saturation is 100 Indeed, with saturations less than 100 or invalid, the function will assume that the opponent is wearing a mask The image below shows the result in face mask detection

Figure 5.5 The result of face mask detection

This table below show the data experiment in 100 attempts

Table 5-2 Statistics of face mask detection experiment

The table of experimental results of mask recognition is based on the brightness factor divided into 3 levels: high, medium, low The experiment was performed over 100 times, the results showed that the average accuracy value was about 80%.

Face recognition

The purpose of this experiment is to evaluate the facial recognition function in the robot In fact, the robot will operate in many different changing light environment conditions, so this experiment will also test the robot's ability to operate stably based on the light environment conditions

Figure 5.6 A sample in face recognition validation test

At the threshold value of “face distance” of 0.4, the face recognition system works stably and achieves relatively high accuracy Based on the image below, the relationship between

“face distance” and “confidence score” through the threshold parameter for “face distance” is 0.4

Figure 5.7 Face Distance, Confidence Score relationship at 0.4 Face Match Threshold

Relationship between Face Distance and Confidence Score can be described as equation in below:

Case 1: Face Distance < Face Match Threshold

Case 2: Face Distance ≥ Face Match Threshold

With “face distance” values varying from 0 to less than 0.4, the “confidence score” is guaranteed to be above 0.8, which means greater than 80% accuracy Therefore, this system can achieve relatively reliable accuracy.

Voice assistant

The operation process of the virtual assistant follows the following steps The first step is to receive and process input audio in Vietnamese language (including noise), this step will convert the audio signal into text by speech-to-text method Then, based on the obtained text, it will be compared with the question data already stored on the system, this data is generated by the user to serve to answer the required questions Based on the answer

60 that is compatible with the questions data, the robot creates an audio record to communicate with the user by text-to-speech method

Based on system requirements, the robot needs to be able to receive sound accurately within a radius of 3m To evaluate the effectiveness of this ability, the team conducted sound acquisition experiments in a variety of environmental conditions and distances Evaluation criteria will be based on the accuracy of the text converted from the audio and the ability to recognize noise The results of this experiment are listed in the table below

Table 5-3 Data experiment of speech to text following distance and noise criteria

After assessing the system's ability to receive audio inputs, they are converted to text and processed based on the available question data as shown in the table below The data of questions designed in this topic is built to answer students' questions related to schools, faculties, subjects and other activities in the school They will be classified according to specific criteria as shown in the table below

Table 5-4 Shortlist of knowledge database including in robot

Trường Thành lập năm nào

Trường Đại Học Sư Phạm Kỹ Thuật Thành Phố Hồ Chí Minh được thành lập từ năm 1962

Có bao nhiêu bãi giữ xe

Bãi giữ xe nằm ở khu A, B, D, E

61 Địa chỉ ở đâu Trường Đại Học Sư Phạm Kỹ Thuật Thành Phố Hồ

Chí Minh có địa chỉ là số 1, đường Võ Văn Ngân, phường Linh Chiểu, thành phố Thủ Đức

Trường Đại Học Sư Phạm Kỹ Thuật Thành Phố Hồ Chí Minh có 13 khoa và một viện SPKT Đóng học phí Có thể đóng học phí trực tiếp ở phòng A1-102 ở khu

A, hoặc có thể sử dụng phương thức chuyển khoản online Đăng ký Đăng kí môn học như thế nào Đăng ký môn học bằng cách truy cập trang online cá nhân sinh viên, sau đó vào mục đăng ký học phần

Số lượng tín chỉ Đăng ký môn học với số lượng ít nhất 15 tín chỉ

Phiếu điểm Đăng ký phiếu điểm ở phòng Công tác và Tuyển sinh sinh viên ở tầng 2 khu A

Giấy xác nhận sinh viên Đăng ký giấy xác nhận sinh viên bằng cách truy cập trang online cá nhân sinh viên, sau đó vào mục đăng ký giấy xác nhận sinh viên Khoa Điện Vị trí Khoa Điện Điện Tử nằm ở khu C và khu D

Trưởng khoa Thầy Nguyễn Minh Tâm là trưởng khoa Điện Điện

Phòng thí nghiệm hệ thống thông minh

Phòng thí nghiệm hệ thống thông minh là phòng C103 và được quản lý bởi thầy Lê Mỹ Hà

Góc sẻ chia Vị trí Góc sẻ chia nằm ở tầng hầm của tòa nhà trung tâm

Tiện ích Góc sẻ chia là nơi cung cấp các dịch vụ hỗ trợ cho sinh viên như đồ dùng thiết yếu, cho mượn xe đạp và sách

Góc sẻ chia mở cửa từ 7 giờ sáng đến 5 giờ, buổi chiều từ thứ 2 đến thứ 7

Thư viện Vị trí Thư viện nằm ở khu A tầng 1 là khu tự học, tầng 2 là khu mượn trả sách

Thư viện làm việc từ thứ 2 đến thứ 6 theo khung giờ sau

Sáng: Từ 07g00 đến 11g30 Chiều: Từ 13g00 đến 17g00

Thời gian mượn sách Đối với giáo trình là 1 học kì; đối với sách tham khảo là 3 tuần

In addition, for questions on topics other than the question data, the robot will refuse to answer in a friendly manner Questions outside of this dataset may be updated depending on storage capacity

Figure 5.8 Robot understanding human speech and communicating via text

The image above shows a short conversation between a student and a virtual assistant of the robot Hence, the robot is able to understand the speech and respond to the student in the right way.

Reception robot real-time working

Through previous tests for each function such as facial recognition, communicating with users by human voice recognition This part of the experiment is an overview test for the whole reception robot The functions included in the robot have been fully integrated and run under supervision from the project team As shown in figure 5.7, the Robot is checking whether the person is wearing a mask or not and talking to them using the

63 knowledge available in the database It can be seen that the robot is working very confidently and showing user-friendly communication

Figure 5.9 Reception Robot is checking face-mask and chatting with user confidently

As for the robot's emotional expression, the robot can express emotions in 4 states described as shown below

Overall, the robot worked according to the requirements and criteria set by the project team In addition, there are some cases where the robot expressed confusion because it could not find the answer or misidentified the opposite person These unexpected cases will be re-examined and recommended to upgrade the robot program in the future.

CONCLUSION AND FUTURE WORKS

Conclusion

After the research and completion of the project, the results have solved some required problems First, this topic is a small study contributing to the field of using robots to assist people in the university environment in particular and other fields in general In addition, accurate face recognition also helps increase the security of the robot during operation In the current situation of the COVID pandemic, it is necessary to limit human- to-human contact, so using robots to support students can both help answer questions for students and meet other requirements

After applying the robot into actual operation, it can be concluded that the robot has completed the initial objectives of the project The first is the built-in facial recognition that is accurate with low latency In addition, the ability to communicate to answer questions is available in a given knowledge database to help students in the process of studying and working at the school such as lecture information, enrollment information, and areas navigation in school, etc

However, during the construction of the project, the team also encountered some difficulties First, fine-tuning the sub-optimal parameters for the robot to operate with the highest accuracy requires a lot of time and research to find a more suitable solution In the face recognition part, in some cases where the ambient lighting environment is not ideal, the high light noise makes the face recognition robot sometimes inaccurate Regarding the communication function, sometimes the voice recognition system is wrong due to the noise from the environment, which reduces the robot's performance.

Future works

There are quite a few directions to develop this robot in the future The first option is to increase the workload for the robot, not stopping at identifying and supporting students, the robot can automatically control the actuators in the classroom or office where it works independently automatically In detail, it is able to control electrical equipment such as light bulbs, fans, etc The second direction is to improve the limitations of the current robot One of the limitations is that the robot is only able to interact with one user object at the same time, so that it reduces the robot's performance quite a bit In this case, the appropriate solution is to increase the number of real users by executing at the same time Finally, manually updating knowledge databases by programming is not very efficient In this case, there are two ways for the robot to automatically update the knowledge data Firstly, it is possible to apply the error prompting method from the communicator to help the robot take the initiative in updating the knowledge that has been wrong Secondly, the communicator can rate the satisfaction of the answer through the push of a button By this method, the robot can update knowledge faster and easier for the user

[1] Z Al Barakeh, S Alkork, A S Karar, S Said, and T Beyrouthy, “Pepper humanoid robot as a service robot: A customer approach,” BioSMART 2019 - Proc 3rd Int

Conf Bio-Engineering Smart Technol., pp 1–4, 2019, doi: 10.1109/BIOSMART.2019.8734250

[2] D Gries and F B Schneider, “Computer Vision: Algorithms and Applications.” [Online] Available: www.springer.com/series/3191

[3] P Viola and M Jones, “Rapid Object Detection using a Boosted Cascade of Simple Features.”

[4] N Boyko, “2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP),” 2018 IEEE Second Int Conf Data Stream Min Process., pp 478–482, 2018

[5] M M Sani, K A Ishak, and S A Samad, “Evaluation of face recognition system using support vector machine,” SCOReD2009 - Proc 2009 IEEE Student Conf Res

[6] NVIDIA, “Jetson Nano,” Nvidia, 2020, [Online] Available: https://developer.nvidia.com/embedded/jetson-nano

[7] S Mischie, L Matiu-Iovan, and G Gasparesc, Implementation of Google Assistant on Rasberry Pi 2018

[8] D Militaru and I Gavat, “A historically perspective of speaker-independent speech recognition in Romanian language ALADIN View project Intelligent System for Impaired Speech Evaluation and Correction Used in Recovery Therapy View project

A HISTORICALLY PERSPECTIVE OF SPEAKER-INDE,” 2014 [Online] Available: https://www.researchgate.net/publication/299594444.

Tiêu đề	Design and Implement Reception Robot for Student Base on Computer Vision and Natural Language Processing
Tác giả	Tran Thanh Hung, Huynh Xuan Tan
Người hướng dẫn	Assoc.Prof.PhD. Le My Ha
Trường học	Ho Chi Minh City University of Technology and Education
Chuyên ngành	Automation and Control Engineering Technology
Thể loại	Graduation project
Năm xuất bản	2022
Thành phố	Ho Chi Minh City

Định dạng
Số trang	81
Dung lượng	5,47 MB