2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Socially aware robot navigation framework: Social activities recognition using deep learning techniques Ngoc Anh Pham, Lan Anh Nguyen and Xuan Tung Truong Faculty of Control Engineering, Le Quy Don Technical University Hanoi, Vietnam xuantung.truong@gmail.com Abstract—In this study, we propose a deep learning basedsocial activities recognition algorithm for socially aware mobile robot navigation framework The proposed method utilizes the OpenPose library and the Long-short term memory deep learning neural network, which observes the human skeleton in some time steps, then predicts that the human social activities including human running, walking, standing, sitting and laying We train and test the proposed deep learning neural network model on a dataset that we synthesize The experimental results illustrate that our proposed method can predict the human social activities with higher accuracy Index Terms—Social activities recognition, OpenPose, LSTM, Socially aware navigation, Mobile service robot I I NTRODUCTION In recent years, autonomous mobile robots are increasingly researched, developed and applied in social life and also in the military field The strong development of the fourth scientific and technological revolution together with the trend of globalization have been a strong driving force in manufacturing technology and the application of autonomous mobile robots in all areas of life Although today’s modern robot navigation systems are capable of driving the mobile robot to avoid and approach humans in a socially acceptable manner, and providing respectful and polite behaviors akin to the humans [1], [2] and [3], they still surfer the following drawbacks if we wish to deploy the robots into our daily life settings: (i) a robot should react according to social cues and signals of humans (facial expression, voice pitch and tone, body language, human gestures), (ii) a robot should predict future action of the human [4], and (iii) a robot should be able to estimate the social activities of the human in its vicinity Robots navigate in social environments affected greatly by human navigation processes as well as the decisions of humans Therefore robots can make better decisions if they knew in advance plans that humans will make in the future to foresee trajectories of them The previous research works in trajectory prediction has some challenges due to the inherent properties of human motion in crowded scenes [5] such as interpersonal, socially acceptable and multiple trajectories The traditional methods based on hand-crafted features [6] have addressed exhaustively in term of interpersonal problems Methods based on Recurrent Neural Networks (RNNs) [7] and [8] can effectively adapt in term of socially acceptable 978-1-6654-1001-4/21/$31.00 ©2021 IEEE aspect The methods used Long Short-Term Memory networks (LSTM) to jointly reason across multiple agents to predict their trajectories in a scene Beside, the problem of multiple trajectories has been studied in the context of route choices in given scene [5] Moreover, in [9] the authors demonstrate that pedestrians have different navigation choices in crowded scenes depending on their personal properties Nevertheless, in order to investigate human movement as well as support human future trajectory predictions, predicting human social activities is a very important part Because it allows a mobile robot to automatically predict situations of humans to actively set up respective action scenarios Human social activities prediction has been studied and incorporated into robotic systems such as applied to trajectory planning of robot arms [10] and [11], mobile robots [12], and autonomous driving [13] The authors used Hidden Markov Models (HMMs) to model activities and recognize human intents [11] and [12] Beside, by employing radial basis function neural networks (RBFNN) in [10], the motion intention of the human has been estimated However, the systems have suffered from a limitation if the number of human action types increases or in crowded scenes Therefore, in this study, we propose an improving system to predict social activities, including standing, sitting, lying and walking human, which uses the output of the Open Pose model and a deep Long-Short Term Memory network We aim to apply the output of this system to the socially aware navigation system of mobile robots which helps them avoid human effectively in crowded environments Because it enables the robot understand human’s intentions and foresee their future trajectories The remainder of this paper is organized as follows Section II introduces the background information that will be utilized in the paper The human social activities prediction algorithm using deep learning techniques is presented in Section III Section IV shows the experimental results The conclusion is provided in Section V II BACKGROUND I NFORMATION A The Overview of the OpenPose Model The OpenPose model is a real-time multi-person keypoint detection library for body, face, hands and foot estimation The OpenPose model was created by CMU Perceptual Computing 381 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Laboratory The first version was released in July 2017, so far the latest version is 1.7.0 It supported Ubuntu (20, 18, 16, 14), Windows (10, 8), MacOS and NVIDIA TX2 embedded computers The algorithm of OpenPose is presented in detailed in the [14] and [15] The original architecture of the OpenPose model consisted of a convolutional neural network with two branches, in which the first branch was the reliability map and the second branch was the polynomial activation functions set B Long-Short Term Memory Technique Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture [16] used in the field of deep learning Unlike standard feed forward neural networks, the LSTM model has feedback connections As a result, it can process not only the single data points, such as image, but also the entire sequences of data, such as human speech or video sequence A common LSTM unit is composed of a cell state, an input gate, an output gate and a forget gate The cell state is used to remember values over arbitrary time intervals and the three gates are used to regulate the flow of information into and out of the cell Fig The architecture of the OpenPose algorithm Fig The architecture of LSTM model The inputs of the OpenPose model are the image, video, webcam, Flir/Point Grey and IP Camera The outputs of the OpenPose model are the basic image and keypoint display/saving in popular image formats such as PNG, JPG, AVIm, etc or keypoint saving as JSON, XML, etc The number of body keypoints that can be exported from OpenPose model is 15, 18 or 25-keypoints (a) (b) Fig The example result of the OpenPose algorithm In particular, the authors also provide API (Application Programming Interface) for two popular languages, Python and C++, allowing the users to easily use the OpenPose model in their applications The LSTM model is well-suited to classify, process and make predictions based on time series data, since there can be lags of unknown duration between important events in a time series The LSTM model was developed to deal with the vanishing gradient problem that can be encountered when training the traditional RNNs The relative insensitivity to gap length is an advantage of the LSTM model over the RNN model, the hidden Markov model and other sequence learning methods in numerous applications III P ROPOSED M ETHOD In this study, we divide the human social activities which are recognized for socially aware mobile robot navigation systems into five categories, as illustrated in Fig The human social activities include standing, sitting, laying down, walking and running There have been many studies proving that a person’s posture carries a lot of information, including emotions, health conditions [17] Does person’s posture contain information about the social activities of the people? To answer the question, we utilize the LSTM network, observe the person’s posture in n steps time steps, and then recognize the social activities of the humans The block diagram of the proposed system is shown in Fig The proposed system consists of two phases including training and testing The skeleton of the people is extracted from OpenPose algorithm A skeleton consists 2D coordinates of j keypoints on the body of the people (15, 18 or 25 keypoints), therefore for each time step we have a coordinate vector X = [x1 , x2 , , xk ] with xi ∈ R and k = * j As a result, the input of LSTM network is a n steps xk matrix X, while the output are the cases of human social activities n case (as shown above, n case = 5), which is represented as a one-hot vector 382 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) (a) (b) (c) (d) (e) Fig The experimental scenario of human-robot interaction: (a) a standing human, (b) a sitting human, (c) a laying down human, (d) a walking human and (e) a running human Fig The sliding window TABLE I T HE SET OF PARAMETERS Fig The block diagram of the social activities recognition algorithm Parameters N steps Hidden layer Classes Learning rate Init learning rate A Data preparation Dataset preparation is one of the most important step for training process of the deep learning models It is crucial and can significantly affect the overall performance, accuracy and usability of trained model The dataset of the social activities of the human is not available, so we created our own dataset by recording multiple videos in different environmental conditions To create the time step dataset, we utilize the sliding window technique, as shown in Fig A window has n steps width, which slides across the data series, for each step we have a the data point and a corresponding label The label is the name of social activities including standing, sitting, laying down, walking and running Value 32 48 0.0025 0.005 Parameters Decay rate Decay steps Epochs Batch size Lambda loss amount Value 0.96 6000 300 512 0.0015 The values of the keypoints in each window are written to input set X, while the ground truth, represented by a classification label, is written to output set Y We the same for training and testing set B Training Process The dataset is split into two sets included 80 percent for training and 20 percent for testing It is extremely important that the training set and testing set are independent of each 383 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Fig The training and testing accuracy Fig The confusion matrix other and not overlap The values of the parameters are empirical set in Table I The batch size and epochs number were set in different values for training process The training process was running automatically and finished when the pre-setting epochs number is reached The model was saved after every certain number of epochs At the end of the training process, we exported the prediction results on the training set and testing set to evaluate the newly trained model We filmed a variety of models with different heights, weights, and BMI (Body Mass Index) to create our datasets The keypoints j is chosen 25, the time step n steps is chosen 32, so the input of the network is a 32x50 matrix We tested with different number of hidden layers of the LSTM network to find out the good parameter set The good network training results are shown in Fig In this case, we tested with 48 hidden layers and 300 iterations The results of evaluating the testing set on the confusion matrix are shown in Fig The evaluation results with the testing set are very good The accuracy of the proposed model is over 95 percent IV E XPERIMENTAL R ESULTS A Experimental Setup In order collect the data for training and testing the proposed model, we utilize a smartphone to represent the position of the mobile robot, which have fullHD (1920x1080) resolusion camera However, in order to increase the frame rate of the prediction process, the videos are scaled to 640x480 resolution before being fed to the proposed model The human stands 810[m] far away from the camera, and create the human social activities include standing, sitting, laying down, walking and running, as shown in Fig We also record a video that contain a combination of several human social activities to evaluate the accuracy of the proposed algorithm The testing and training process was run on Desktop computer with Intel core i7-10700 CPU, 16 GB RAM and NVIDIA GerForce GTX 1650 card The computer was installed Ubuntu operating system 18.04 B Experimental results The simulation results are shown in Fig A video of our experiments can be found at the link1 The proposed LSTM network model predicts very well with clear human social activities such as standing, sitting, laying down, as illustrated in Fig 9(a), 9(b) and 9(c) In addition in the more difficult cases, for example human runs or walks, as shown in Fig 9(d) and 9(e) the output of the proposed model is quite good In this case when the human changes the moving direction, at the early frames the proposed network model may has some mistakes between running with walking activities In addition, we conduct an experiment that combine single social activities Although the movement is complicated, the LSTM network model shows the good and stable results From the achieved results, we are going to incorporate the information into socially aware navigation systems It enables a robot to perceive humans’ intentions, leading to future trajectory predictions of them Therefore, the robot is able to avoid humans more proactively and efficiently V CONCLUSIONS In this article, we have presented an approach that recognize the social activities of the human for socially aware mobile robot navigation systems using deep learning techniques We make use of the OpenPose model to extract human posture and LSTM network to observe a person over a certain period of time We then distinguish the social activities of the human 384 https://youtu.be/WM5OJJ3icIA 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) (a) (b) (c) (d) (e) Fig The examples of experimental results: (a) a standing human, (b) a sitting human, (c) a laying down human, (d) a walking human and (e) a running human in front of the mobile robot This approach initially gave some very positive results In the future, we will continue to develop the algorithm by pre-processing information from the image and apply this algorithm with multiple people In addition, we will incorporate the social activities into the socially aware mobile robot navigation system to evaluate its usefulness [8] [9] [10] R EFERENCES [1] M Shiomi, F Zanlungo, K Hayashi, and T Kanda, “Towards a socially acceptable collision avoidance for a mobile robot navigating among pedestrians using a pedestrian model,,” International Journal of Social Robotics, vol 6, no 3, pp 443–455, 2014 [2] X T Truong and T D Ngo, “Toward socially aware robot navigation in dynamic and crowded environments: A proactive social motion model,,” in IEEE Transactions on Automation Science and Engineering,, 2017, pp 1743–1760 [3] Y F Chen, M Everett, M Liu, and J P How, “Socially aware motion planning with deep reinforcement learning, booktitle = in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), year=2017,.” [4] X T Truong and T D Ngo, “Toward socially aware robot navigation in dynamic and crowded environments: A proactive social motion model,,” in in ICRA 2019 Workshop on MoRobAE-Mobile Robot Assistants for the Elderly, Montreal Canada,, 2019, pp 20–24 [5] A Gupta, J Johnson, L Fei-Fei, S Savarese, and A Alahi, “Social gan: Socially acceptable trajectories with generative adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018 [6] K Yamaguchi, A C Berg, L E Ortiz, and T L Berg, “Who are you with and where are you going?” in CVPR 2011, 2011, pp 1345–1352 [7] A Alahi, K Goel, V Ramanathan, A Robicquet, L Fei-Fei, and S Savarese, “Social LSTM: Human trajectory prediction in crowded [11] [12] [13] [14] [15] [16] [17] 385 spaces,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp 961–971 F Bartoli, G Lisanti, L Ballan, and A Del Bimbo, “Context-aware trajectory prediction,” in 2018 24th International Conference on Pattern Recognition (ICPR), 2018, pp 1941–1946 A Robicquet, A Sadeghian, A Alahi, and S Savarese, “Learning social etiquette: Human trajectory understanding in crowded scenes,” in Computer Vision – ECCV 2016 Springer International Publishing, 2016, pp 549–565 Y Li and S S Ge, “Humanrobot collaboration based on motion intention estimation,,” IEEE/ASME Transactions on Mechatronics,, vol 19, no 3, pp 1007–1014, 2013 J S Park, C Park, and D Manocha, “I-planner: Intention-aware motion planning using learning-based human motion prediction,,” The International Journal of Robotics Research,, vol 38, no 1, pp 23–39, 2019 R Kelley, A Tavakkoli, C King, M Nicolescu, M Nicolescu, and G Bebis, “Understanding human intentions via hidden markov models in autonomous mobile robots,,” in in Proceedings of the 3rd ACM/IEEE international conference on Human robot interaction, 2008, pp 367– 374 T Bandyopadhyay, K S Won, E Frazzoli, D Hsu, W S Lee, and D Rus, “Intention-aware motion planning,,” in Algorithmic foundations of robotics X: Springer,, pp 475–491, 2013 Z Cao, T Simon, S.-E Wei, and Y Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,,” in in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp 7291–7299 Z Cao, G Hidalgo, T Simon, S.-E Wei, Y J I t o p a Sheikh, and m intelligence, “Openpose: realtime multi-person 2d pose estimation using part affinity fields,,” vol 43, no 1, 2019, pp 172–186 S Hochreiter and J J N c Schmidhuber, “Long short-term memory,,” vol 9, no 8, 1997, pp 1735–1780 V Narayanan, B M Manoghar, V S Dorbala, D Manocha, and A Bera, “Clearpath: Highly parallel collision avoidance for multiagent simulation,” in arXiv preprint arXiv:2003.01062, 2020 ... we have presented an approach that recognize the social activities of the human for socially aware mobile robot navigation systems using deep learning techniques We make use of the OpenPose model... the social activities into the socially aware mobile robot navigation system to evaluate its usefulness [8] [9] [10] R EFERENCES [1] M Shiomi, F Zanlungo, K Hayashi, and T Kanda, “Towards a socially. .. other sequence learning methods in numerous applications III P ROPOSED M ETHOD In this study, we divide the human social activities which are recognized for socially aware mobile robot navigation