Face Mesh when applied to Augmented Reality...15Figure 2.5 3D-Skeleton recognition of the hand...15Figure 2.6 Human Pose Tracking in 3D...16Figure 3.1 Video acquisition and processing sy
STATE THE PROBLEM
Overview of human posture estimation
1.1.1 What is the human pose estimate? [1]
Human posture estimation is a computer vision task that represents a person's orientation in a graphical format This technique is widely applied to predict a person's body parts or joint position This is one of the most exciting areas of research in computer vision that has gained a lot of traction because of the abundance of applications that can benefit from such a technology Human Pose Estimator identifies and classifies the posture of human body parts and joints in images or videos In general, model-based techniques are used to represent and infer human body postures in 2D and 3Dspace.
It is basically a way to capture a set of coordinates by defining human body joints like wrist, shoulder, knee, eye, ear, ankle, and arm, this is the point key in images and videos that can depict a person's posture.
Then, when an image or video is fed to the posture estimation model as input, it determines the coordinates of these detected body parts and joints as output and the confidence score shows accuracy of the estimates.
1.1.2 The importance of estimating human posture
Person detection has long been the main discussion center for various applications in traditional object detection With recent developments in machine learning algorithms, computers can now understand human body language by performing posture detection and posture tracking The accuracy of these findings and the hardware requirements to run them have now reached commercially viable levels.
In addition, the evolution of technology is also profoundly transformative amid the coronavirus pandemic, where high-performance real-time posture detection and tracking will bring about some of the most impactful trends in machine vision count For example, it can be used to create social distancing by combining human posture estimation and distance prediction experience It helps people maintain physical distance from each other in a crowded place.
Human posture estimates will significantly impact various industries, including security, business intelligence, health and safety, and entertainment One of the areas where this technique has proven its existence is in autonomous driving With the help of real-time human posture detection and tracking, computers can radically sense and predict pedestrian behavior - enabling more stable driving.
1.1.3 The estimate human pose in 2D and 3D
There are two main techniques by which posture estimation models can detect human posture.
2D Posture Estimation: In this type of pose estimation, you only need to estimate the position of the body joints in 2D space relative to the input data (i.e., image or video frames) The location is represented by the X and Y coordinates for each key point.
3D Pose Estimation: In this type of pose estimation, you turn a 2D image into a 3D object by estimating an extra Z dimension to the prediction 3D pose estimation allows us to predict the exact spatial position of a represented person or thing.
Estimating 3D poses is a significant challenge faced by machine learning engineers because of the complexity involved in constructing datasets and algorithms that estimate several factors, such as the background scene of an image or video, lighting conditions, etc
1.1.4 Models to estimate human posture [2] a) Skeleton-based model also known as kinematic model, this representation includes a set of key points (joints) such as ankles, knees, shoulders, elbows, wrists, and major limb directions are used to estimate 3D and 2D poses.
This flexible and intuitive human body model covers the skeletal structure of the human body and is often applied to capture the relationships between different body parts. b) Contour-based model also known as flat model, it is used to estimate 2D posture and includes rough contour and width of the body, trunk and limbs It basically represents the appearance and shape of the human body, where body parts are shown by the boundaries and rectangles of a person's contour.
A well-known example is Active Shape Modeling (ASM), which records the entire human body graph and silhouette deformations using principal component analysis (PCA). c) Mass-based model also known as volumetric model, used to estimate 3D posture It includes many popular 3D human body models and poses represented by meshes and geometric human shapes, often captured for deep learning-based 3D human pose estimation.
THEORETICAL BASIS
OpenCV image processing library
OpenCV (Open-Source Computer Vision Library) library is an open-source library developed by Intel starting in 2000 OpenCV is one of the perfect tools for computer vision development and learning machine learning, especially real-time related applications This library is written in C/C++ programming language, can be implemented on Windows, Linux, MacOS, Android, iOS operating systems.
The OpenCV library has about 500+ functions, divided into modules corresponding to each function Some of the main modules in the OpenCV library include:
▪ Core: contains the basic structures and classes that OpenCV uses to store and process image manipulation such as: Mat, Scale, Point, Vec, etc; and basic methods to use for other modules.
▪ Imgproc: is OpenCV's image processing module, including linear filters and non-linear, geometric transformations, color space transformations, and algorithms related to the histogram of the image.
▪ Highgui: allows user interaction on user interface (UI) such as Display pictures, display videos.
▪ Feature2d: has the function of finding features in the image, performing algorithms image feature extraction algorithms such as PCA video: used for analysis video data, including motion estimation, background subtraction, and tracking algorithms object tracking
• Objdetect: used to detect objects such as faces, eyes, people, cars in the image.
Algorithms used in this module are Haar-like Features.
• Ml: contains machine learning algorithms, serving classification problems and other algorithms clustering problem Some of the algorithms used in this module are SVM (Support Vector Machine), ANN.
The OpenCV library has been used to build applications in many different fields In robotics, the OpenCV library is used to control direction, obstacle avoidance, and human- machine interaction In medicine, libraries are used to classify and detect cancer cells, segment 2D, 3D, reproduce 3D images of organs In the field of industrial automation, the OpenCV library is used to build applications for fault identification, barcode checking, product sorting In the field of information security and safety, libraries.
OpenCV is widely used in the fields of camera surveillance, biometric image processing.
MediaPipe
In a nutshell, Mediapipe is a collection of a series of cross-platform, interoperable and extremely lightweight Machine Learning solutions Some of the advantages of this solution include:
• Provides a quick inference solution: Google claims that this toolkit can run stably on most common hardware configurations.
Easy to install and deploy: The installation is extremely easy and convenient, can be deployed on many different platforms such as Mobile (Android/iOS), Desktop/Cloud,Web and IoT devices.
• Free and open source: The entire source code is publicly available on MediaPipe, users can completely use and customize it directly to suit their problems.
Figure 2.3 Solutions offered and availability across platforms in MediaPipe
Most of the outstanding problems in the field of Computer Vision - Computer Vision, are installed by Google in MediaPipe We will go through the solutions provided to better understand the diversity of MediaPipe.
This is a problem familiar to everyone With an image or a video as input, our task is to find the position and bounding box of the human faces appearing on it, as well as mark the important points (MediaPipe uses 5 -landmarks) on that face MediaPipe Face Detection uses the BlazeFace network as the foundation but changes the backbones In addition, theNMS (non-maximum suppression) algorithm has also been replaced by another tactic,greatl reducing the processing time.
Figure 2.4 Face Detection on an Android device
Instead of finding the bounding box surrounding the face, Face Mesh is a problem that identifies a series of points on the face, thereby forming a mesh of the face This mesh will be applied to 3D face image editing problems or tasks related to 3D Alignment and Anti- spoofing The tool in MediaPipe will generate a total of 468 points on the face and create a mesh without requiring too much computing power as well as the number of cameras (just1 front camera).
Figure 2.5 Face Mesh when applied to Augmented Reality
Hands Detection, also known as hand recognition, is the problem that we will test in this article The output of this solution is a skeleton model of the hand, which includes the locations of landmarks on the hand and is joined together to form a complete hand frame.
Figure 2.6 3D-Skeleton recognition of the hand
Expanding on the Hands Detection problem, Human Pose Estimation provides a 3D skeleton model of the whole body, with predefined key joints and joined together to form the human frame The strategy used for this problem is similar to Hands Detection and Face Mesh BlazeFace, again, is used as the main idea for this card processing algorithm.
Figure 2.7 Human Pose Tracking in 3D
In addition, there are many more solutions provided by Google in this kit, includingSegmentation problems (Selfie, Hair, ), Object Detection, Motion Tracking, 3D ObjectDetection,
Pycharm
PyCharm software provides a complete set of tools for professional Python developers.
PyCharm is built around an editor that understands code deeply, and a debugger that gives a clear view of how the code works PyCharm provides integration with collaboration tools such as version control systems and trackers The professional editor extends the essentials by seamlessly integrating with web frameworks, JavaScript engines, virtualization, and containerization support.
An important aspect of programming is understanding the code platform you're putting on PyCharm ensures you can explore your project with just a few taps, it gives you an overview of the project structure and gives you instant access to relevant documents. from the editor Understanding a coding platform faster means speeding up your development.
Key features of Pycharm software:
• Supports Windows, macOS and Linux
• Smart completion code support, one-click navigation and PEP8 style checking
• Safe and automated refactoring of your project
• Automatic detection of code problems: e.g unused code analysis
•Vim simulation mode Python debugging Version control Unit testing &
Code coverageProfilingDatabase ToolsWeb frameworkDjangoFlaskPyramidWeb2pyWeb developmentJavaScriptHTML / CSSAngularJSReact
Node.js Vue.js Virtualization:
Remote serversVagrant boxesDocker containers
BUILDING THE SYSTEM
Camera
- Use 3 cameras of 3 mobile phones with different resolutions:
Redmi Note 11: 720P, 30FPS Redmi Note 11S: 1080P, 30FPS
Resolution Horizontal pixels x Vertical pixels Number of pixels
480p is now considered low resolution on mobile devices, it appears on screen standards FWVGA (854 x 480 pixels), WVGA (720 x 480 pixels) and VGA (640 x 480 pixels).
Currently, users only encounter this parameter on low-cost phones or feature phones.
720p is the quality of an HD screen with a common size of 1,280 x 720 pixels The HD screen also has another variant, HD + with 720p parameters, but has a more diverse width such as: 1,440 x 720 pixels, 1,480 x 720 pixels, 1,520 x 720 pixels Currently this screen standard is no longer popular variable and only appear on low-cost smartphones.
1080p is another way of calling Full HD or Full HD + screens, these two types of screens have a height of 1,080p but differ in width Specifically, Full HD will have a resolution of1,920 x 1,080 pixels with an aspect ratio of 16: 9, while Full HD + will be more diverse in size such as: 2,160 x 1,080 pixels, 2,280 x 1,080 pixels, 2,340 x 1,080 pixels,
Tripod
We use Weifeng WT-3520 tripod to increase the stability during video recording Parameter:
- Folding height: 0.560 m- Max Operating height: 1,400 m- Min Height: 0.545 m- Pipe diameter: 21.2*16.2 cm
- Net weight: 920 grams - Max Load Capacity: 3 KG
The camera tripod consists of two main parts: the stand and the base for attaching the camera The stand usually has 3 split forks, which can be changed in length or angle The camera head is divided into two types: the type with the need to fine-tune and the type that rotates freely, suitable for the purpose of use.
Personal computer
The personal computer is a high-precision motion calculation and processing engine based on the Open CV and Mediapipe libraries – which will be discussed later.
Building processing algorithms
Figure 3.9 Diagram of processing algorithm
3.2.1 Put the video in Pycharm
A video about 15 seconds long after being captured from the camera will be added to Pycharm in turn and begin the processing In turn, videos with 480p, 720p, and 1080p quality will be processed separately and output the results to Excel Here is the code how to get the video to be analyzed from the link cap = cv2.VideoCapture("the_duc\h480-30fps.mp4") # cap = cv2.VideoCapture("the_duc\h720-30fps.mp4") # cap = cv2.VideoCapture('the_duc\h1080-30fps.mp4') # cap = cv2.VideoCapture("VID_20230314_143131.mp4")
3.2.2 Use Pose Estimation algorithm to determine the joint point
This code shows how to Draw a defined skeleton on the original image: imgRGB = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) self results = self.pose.process(imgRGB) if self results.pose_landmarks:
# # Display the found normalized landmarks.
# f'{self.mpPose.PoseLandmark(i).name}:\ n{self.results.pose_landmarks.landmark[self.mpPose.PoseLandmark(i).value]}') if draw: self.mpDraw.draw_landmarks(img, self.results.pose_landmarks, self.mpPose.POSE_CONNECTIONS) return img def findPosition(self, img, draw=True, display = True):
The video, after being captured from the camera, is fed to the computer, and it will be separated into many single RGB color images The number of images to split depends on the frame rate of the camera, in this study the frame rate was 30 fps (frames per second), which would mean about 900 images from each video about 30 seconds.
After using Pose Landmark Model (BlazePose GHUM 3D) to detect motion on an image,it will return 33 points on the body as shown below:
Figure 3.10 Detected motion on an image by Pose Landmark Model
The way to find and display the joints on the body is shown by the following code self lmList = [] if self results.pose_landmarks: for id, lm in enumerate (self results pose_landmarks.landmark): h, w, c = img.shape # print(id, lm) cx, cy = int(lm.x * w), int(lm.y * h) self.lmList.append([id, cx, cy]) if draw: cv2.circle(img, (cx, cy), 5, (255, 0, 0), cv2.FILLED) return self.lmList def findAngle(self, img, p1, p2, p3, draw=True):
After getting the length and width dimensions of the video, we multiply the ratio of that joint point according to the preset parameters of the x, y axis to get the point coordinates corresponding to the frame We will then use the list append to list and save the parameters of the coordinates thereby creating the variation of the coordinates over time.
Each framing match will be 5 units in size and create a human frame in the RGB color system.
3.2.4 Output the results to Excel and compare graphically
3.2.4.1 Output the results to Excel
# initialize the list to join in excel joint_list = (lmList[14] + lmList[16] + lmList[12] + list([cTime])) pointCoor = pd.concat([pointCoor, pd.Series(joint_list)], axis=1)
# key to press to stop processing and plotting # if cv2.waitKey(1) & 0xFF == ord('d'):
# if cv2.waitKey(1) & (not success):
# break # create column names and write to excel pointCoor = pointCoor.T pointCoor.columns = ['14', 'arm_x', 'arm_y', '16', 'wrist_x', 'wrist_y', '12', 'shoulder_x', 'shoulder_y', 'times'] pointCoor.to_csv("Excel\w480-30fps.csv", index=None, header=True) # pointCoor.to_csv("Excel\w720-30fps.csv", index=None, header=True) # pointCoor.to_csv("Excel\w1080-30fps.csv", index=None, header=True)
# call main and plot plots using matplotlib.pyplot and pandas if name == " main ": main()
Initialize a list to join in Excel through the joint list function Landmarks 14,16,12 at elbow, wrist, and shoulder positions are selected in their respective rows and columns, respectively, and vary according to the video's running time as the object moves.
The data will be exported to Excel and then plotted into graphs, we will process the data to create 3 graphs with different camera quality, then shrink them to one graph and give a similar assessment response.
The coding program for project
import cv2 import mediapipe as mp import time import math import numpy as np import pandas as pd
# Create a class that contains separate functions to define lanmarks and human frames class poseDetector():
# Initialize object-oriented functions to be used by other functions in the class def init (self, modese, upBodyse, smooth=True, detectionCon=0.5, trackCon=0.5): self.mode = mode self.upBody = upBody self.smooth = smooth self.detectionCon = detectionCon self.trackCon = trackCon self.mpDraw = mp.solutions.drawing_utils self.mpPose = mp.solutions.pose self.pose = self.mpPose.Pose() def findPose(self, img, draw=True):
# Draw a defined skeleton on the original image imgRGB = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) self.results = self.pose.process(imgRGB) if self.results.pose_landmarks:
# # Display the found normalized landmarks.
# f'{self.mpPose.PoseLandmark(i).name}:\ n{self.results.pose_landmarks.landmark[self.mpPose.PoseLandmark(i).value]}') if draw: self.mpDraw.draw_landmarks(img, self.results.pose_landmarks, self.mpPose.POSE_CONNECTIONS) return img def findPosition(self, img, draw=True, display = True):
# Find and display joints on the body self.lmList = [] if self.results.pose_landmarks: for id, lm in enumerate(self.results.pose_landmarks.landmark): h, w, c = img.shape # print(id, lm) cx, cy = int(lm.x * w), int(lm.y * h) self.lmList.append([id, cx, cy]) if draw: cv2.circle(img, (cx, cy), 5, (255, 0, 0), cv2.FILLED) return self.lmList def findAngle(self, img, p1, p2, p3, draw=True):
# Find angle based on 3 ids of different joints passed in p1, p2, p3
# Get the landmarks x1, y1 = self.lmList[p1][1:] x2, y2 = self.lmList[p2][1:] x3, y3 = self.lmList[p3][1:]
# Calculate the Angle angle = math.degrees(math.atan2(y3 - y2, x3 - x2) - math.atan2(y1 - y2, x1 - x2)) if angle < 0: angle += 360 if angle > 180: angle = 360 - angle # print(angle) # Draw if draw: cv2.line(img, (x1, y1), (x2, y2), (255, 255, 255), 3) cv2.line(img, (x3, y3), (x2, y2), (255, 255, 255), 3) cv2.circle(img, (x1, y1), 10, (0, 0, 255), cv2.FILLED) cv2.circle(img, (x1, y1), 15, (0, 0, 255), 2) cv2.circle(img, (x2, y2), 10, (0, 0, 255), cv2.FILLED) cv2.circle(img, (x2, y2), 15, (0, 0, 255), 2) cv2.circle(img, (x3, y3), 10, (0, 0, 255), cv2.FILLED) cv2.circle(img, (x3, y3), 15, (0, 0, 255), 2) cv2.putText(img, str(int(angle)), (x2 - 50, y2 + 50), cv2.FONT_HERSHEY_PLAIN, 2, (0, 0, 255), 2) return angle
# The main function to call the required functions has been created above def main():
# angle_list = pd.DataFrame() time_0 = time.time() # get the initial time when starting to run the code pointCoor = pd.DataFrame() # convert data to dataframe to plot with pandas
# how to get the video to be analyzed from the link cap = cv2.VideoCapture("the_duc\h480-30fps.mp4") # cap = cv2.VideoCapture("the_duc\h720-30fps.mp4") # cap = cv2.VideoCapture('the_duc\h1080-30fps.mp4') # cap = cv2.VideoCapture("VID_20230314_143131.mp4") pTime = 0 # extra variable to calculate fps detector = poseDetector() # gọi class ở trên
# The processing here is processing on each image, so it should be put in a while loop to display continuously into a video while True: success, img = cap.read() # read image from input source # print(type(img)) if cv2.waitKey(1) & (not success):
# if cv2.waitKey(1) & 0xFF == ord('d'): break # img = cv2.resize(img, dsize=None, fx=0.5, fy=0.5) dim = (650, 850) img = cv2.resize(img, dim, interpolation=cv2.INTER_AREA) # resize the image for easy handling and for the same frame size, I compare the videos with each other so the videos must be as uniform as possible
# img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # đây là bộ lọc ảnh, tạm thời chưa dùng
# img = cv2.flip(img, 1) # flip image img = detector.findPose(img) # draw the skeleton lmList = detector.findPosition(img, drawse) # create list point
# Draw and find angles if len(lmList) != 0:
# print(lmList[14], ' ', type(lmList[14][1])) cv2.circle(img, (lmList[14][1], lmList[14][2]), 9, (0, 0, 255), cv2.FILLED) cv2.circle(img, (lmList[12][1], lmList[12][2]), 9, (0, 0, 255), cv2.FILLED) cTime = time.time() - time_0 #The time to start playing different videos is different, so I need to get the time difference to match the unit fps = 1 / (cTime - pTime) # tìm fps pTime = cTime cv2.putText(img, str(int(fps)), (70, 50), cv2.FONT_HERSHEY_PLAIN, 3, (255, 0, 0), 3) cv2.imshow("Image", img)
# initialize the list to join in excel joint_list = (lmList[14] + lmList[16] + lmList[12] + list([cTime])) pointCoor = pd.concat([pointCoor, pd.Series(joint_list)], axis=1)
# key to press to stop processing and plotting # if cv2.waitKey(1) & 0xFF == ord('d'):
# if cv2.waitKey(1) & (not success):
# break # create column names and write to excel pointCoor = pointCoor.T pointCoor.columns = ['14', 'arm_x', 'arm_y', '16', 'wrist_x', 'wrist_y', '12', 'shoulder_x', 'shoulder_y', 'times'] pointCoor.to_csv("Excel\w480-30fps.csv", index=None, header=True) # pointCoor.to_csv("Excel\w720-30fps.csv", index=None, header=True) # pointCoor.to_csv("Excel\w1080-30fps.csv", index=None, header=True)
# call main and plot plots using matplotlib.pyplot and pandas if name == " main ":
EXPERIMENT AND RESULTS ASSESSMENT
Test of measuring method
The hardware system includes 3 cameras of 3 different phones, a tripod to fix 3 cameras and a personal computer as described in the previous section The cameras used in this project are mobile phone cameras with resolutions of 480p, 720p and 1080p respectively with a sampling rate of 30FPS The tripod to fix the 3 cameras plays a very important role in limiting the effects of shaking and fixing the 3 cameras in one position to produce videos with the least influence from external factors The personal computer is installed with Pycharm software with integrated Open CV and Mediapipe, and the obtained results will be exported to Excel, thereby giving parameters about the wrong command of the cameras affecting the joint point recognition.
Figure 4.11Camera setup for data collection
For the experimental video recording of moving objects, they will be instructed in some movements, especially movements related to the position of joints such as shoulder,elbow and wrist exercises That exercise is the practice of spreading your arms and bringing your arms up and down in the most comfortable way within 15 seconds The cameras that will capture video are placed on a fixed trippod and can capture the full frame of the experimenter.
The movements will be performed in a room with good lighting conditions and suitable temperature for the experimenter's state The video will be recorded within 15 seconds from the time the experimenter performs the movement until the end of the movement.
Then use the computer to process the captured video to determine the effect of the camera.
Experiment results
We did it on a number of volunteers who were prepared with pre-existing exercises The volunteer's movement is recorded through short videos, the duration is enough for processing and calculation After processing the video and outputting the data to Excel,we have come up with a graph representing the coordinates of 3 lines over time as shown below:
Figure 4.13 Line chart to compare joint points recognition in different resolutions 480P, 720P and
Considering the overall shape of the 3 lines: the 480 line goes up and down continuously even though it is a time that the point is increasing over time, the noise level decreases significantly as the cam quality is increased (the up and down lines are softer, the time interval is higher When the point is stationary, the coordinates are also less prone to change After the process of giving an overview with the data available on the chart, the consensus group will recommend choosing a 1080 camera because the noise of the point is detected minimized in the best way, thereby leading to the processing of tasks that use the parameters of the joint point will be accurate, the data will not be interrupted,especially the jobs that need high accuracy such as: measurement angle, measure length
[1] Jeong-Seop Han, Choong-Iyeol Lee, Young-Hwa Youn, Sung-Jun Kim, "A Study on Real-time Hand Gesture Recognition," Journal of System and Management Sciences, vol 12, pp 466-470, 2022
[2] Sankeerthana Rajan Karem, Sai Prathyusha Kanisetti, Dr K Soumya, J Sri Gayathri Seelamanthula, Madhurima Kalivarapu, "AI Body Language Decoder using MediaPipe and Python," International Journal of Advance Research, Ideas and Innovations in Technology, vol 7, no 3, pp 2436-2437, 2021
[3] Indriani, Moh.Harris, Ali Suryaperdana Agoes1 , "Applying Hand Gesture Recognition for User Guide Application Using MediaPipe," Atlantis Press International, vol 207, pp 103-105, 2021
[4] R Josyula, "Human Pose Estimation," pp 3-11, 2021
[5] Chittineni Yashwanth, Akula Abheshek, "SYSTEM FUNCTIONALITY CONTROL BY AIR MOUSE," International Journal of Emerging Technologies and Innovative Research, vol 8, no 7, pp 855-856, July-2021
[6] Tharani R, Gopika Sri R, Hemapriya R, Karthiga M, "Gym Posture Recognition and Feedback Generation Using Mediapipe and OpenCV," vol 8, no 5, pp 2054-2055, 2022
[7] S Shriram, B Nagaraj, J Jaya, S Shankar and P Ajay, "Deep Learning-Based Real- Time AI Virtual Mouse System Using Computer Vision to Avoid COVID-19 Spread," Research Article , pp 5-6, 2021
[8] Sankha Sarkar, Indrani Naskar, Sourav Sahoo, Sayan Ghosh, "A Vision BaseApplication For Virtual Mouse," International Journal of Innovative Science andResearch Technology, vol 6, no 11, pp 941-942, November – 2021.