A study on deep learning techniques for human action representation and recognition with skeleton data

MINISTRYOFEDUCATIONANDTRAINING HANOIUNIVERSITYOFSCIENCEANDTECHNOLOGY PHAMDINHTAN ASTUDYONDEEPLEARNINGTECHNIQUESFOR HUMANACTIONREPRESENTATIONANDREC OGNITIONWITHSKELETONDATA DOCTORALDISSERTATIONIN COMPUTERENGINEERING Hanoi−2022 MINISTRYOFEDUCATIONANDTRAINING HANOIUNIVERSITYOFSCIENCEANDTECHNOLOGY PHAMDINHTAN ASTUDYONDEEPLEARNINGTECHNIQUESFOR HUMANACTIONREPRESENTATIONANDREC OGNITIONWITHSKELETONDATA Major:ComputerEngineering Code:9480106 DOCTORALDISSERTATIONIN COMPUTERENGINEERING SUPERVISORS: 1.Assoc.Prof.VuHai 2.Assoc.Prof.LeThiLan Hanoi−2022 DECLARATIONOFAUTHORSHIP I,PhamDinhTan,declarethatthedissertationtitled"Astudyondeeplearningtechniquesfor humanactionrepresentationandrecognitionwithskeletondata"hasbeenentirelycomposedbymy self.Iassuresomepointsasfollows: ThisworkwasdonewhollyormainlywhileincandidatureforaPh.D.researchdegreeatHanoiUn iversityofScienceandTechnology TheworkhasnotbeensubmittedforanyotherdegreeorqualificationsatHanoi UniversityofScienceandTechnologyoranyotherinstitution Appropriateacknowledgmenthasbeengivenwithinthisdissertation,wherereferencehasbeenmadetothepublishedworkofothers Thedissertationsubmittedismyown,exceptwhereworkinthecollaborationhasbeeninclude d.Thecollaborativecontributionshavebeenindicated Hanoi,May08,2022 Ph.D.Student PhamDinhTan SUPERVISORS 1.Assoc.Prof.VuHai 2.Assoc.Prof.LeThiLan i ACKNOWLEDGEMENT ThisdissertationiscomposedduringmyPh.D.attheComputerVisionDepartment,MICAIns titute,HanoiUniversityofScienceandTechnology.Iamgratefultoallpeoplewhocontributeindiff erentwaystomyPh.D.journey First,IwouldliketoexpresssincerethankstomysupervisorsAssoc.Prof.VuHaiandAssoc.Prof LeThiLanfortheirguidanceandsupport IwouldliketothankallMICAmembersfortheirhelpduringmyPh.D.study.Mysincerethanksto Dr.NguyenVietSon,Assoc.Prof.DaoTrungKien,andAssoc.Prof.TranThiThanhHaiforgivingme alotofsupportandvaluableadvice.ManythankstoDr.NguyenThuyBinh,NguyenHongQuan,Hoa ngVanNam,NguyenTienNam,PhamQuangTien,andNguyenTienThanhfortheirsupport IwouldliketothankmycolleaguesattheHanoiUniversityofMiningandGeologyfortheirsuppo rtduringmyPh.D.study.Specialthankstomyfamilyforunderstandingmyhoursgluedtothecomput erscreen Hanoi,May08,2022 Ph.D.Student ii ABSTRACT Humanactionrecognition(HAR)fromcoloranddepthsensors(RGBD),especiallyderivedinformationsuchasskeletondata,receivestheresearchcommunity’sattent ionduetoitswideapplications.HARhasmanypracticalapplicationssuchasabnormaleventdetectionfromcamerasurveillance,gaming,humanmachineinteraction,elderlymonitoring,andvirtual/ augmentedreality.Inadditiontotheadvantagesoffastcomputation,lowstorage,andimmutabil itywithhumanappearance,skeletondatahaveshortcomings.Theshortcomingsincludeposeesti mationerrors,skeletonnoiseincomplexactions,andincompletenessduetoocclusion.Moreover,act ionrecogni-tionremainschallengingduetothediversityofhumanactions,intraclassvariations,andinterclasssimilarities.Thedissertationfocusesonimprovingactionrecognitionperformanceusingthes keletondata.TheproposedmethodsareevaluatedusingpublicskeletondatasetscollectedbyRGB-Dsensors.Especially,theyconsistofMSR-Action3D/ MICA-Action3D-datasetswithhigh-qualityskeletondata,CMDFALLachallengingdatasetwithnoiseinskeletondata,andNTURGB+D aworldwidebenchmarkamongthelargescaledatasets.Therefore,thesedatasetscoverdifferentdatasetscalesaswellasthequalityofskelet ondata Toovercomethelimitationsoftheskeletondata,thedissertationpresentstechniquesindiffere ntapproaches.First,asjointshavedifferentlevelsofengagementineachaction,techniquesforsele ctingjointsthatplayanimportantroleinhumanactionsareproposed,includingbothpresetjointsu bsetselectionandautomaticjointsubsetselection.Twoframeworksareevaluatedtoshowtheperfo rmanceofusingasubsetofjointsforactionrepresentation.ThefirstframeworkemploysDynamicTi meWarping(DTW)andFourierTemporalPyramid(FTP),whilethesecondoneusesCovarianc eDescriptorsextractedonjointpositionandvelocity.Experimentalresultsshowthatjointsubse ctselectionhelpsimproveactionrecognitionperformanceondatasetswithnoiseinskeletondata However,HARusinghandcraftedfeatureextractioncouldnotexploittheinherentgraphstruct ureofthehumanskeleton.RecentGraphConvolutionNetworks(GCNs)arestudiedtohandlethese issues.AmongGCNmodels,AttentionenhancedAdaptiveConvolutionalNetwork(AAGCN)isusedasthebaselinemodel.AAGCNach ievesstate-of-the-artperformanceonlarge-scaledatasetssuchasNTURGB+DandKinetics.However,AAGCNemploysonlyjointinformation.Therefore,aFeatureFusion(FF)moduleis proposedinthisdissertation.ThenewmodelisnamedFF-AAGCN.TheperformanceofFFAAGCNisevaluatedonthelargescaledatasetNTURGB+DandCMDFALL.Theevaluationresultsshowthattheproposedmetho disrobustto noiseandinvarianttotheskeletontranslation.Particularly,FF-AAGCNachievesremarkableresultsonchallengingdatasets.Finally,asthecomputingcapacityofedgedevicesislimite d,alightweightdeeplearningmodelisexpectedforapplicationdeployment.AlightweightGCNarchitectureisproposedtoshowthatthecomplexityofGCNarchit ecturecanstillbereduceddependingonthedataset’scharacteristics.Theproposedlightweightm odelissuitableforapplicationdevelopmentonedgedevices Hanoi,May08,2022 Ph.D.Student CONTENTS DECLARATIONOFAUTHORSHIP iACKNOWLEDG EMENT .iiABSTRACT iiiCONTENTS viiiABBREVI ATIONS .viiiSYMBOLS xLISTOFTABLES xiiiLISTOF FIGURES xviINTRODUCTION CHAPTER1.LITERATUREREVIEW 1.1.Introduction 1.2.Anoverviewofactionrecognition 1.3.Datamodalitiesforactionrecognition .9 1.3.1.Colordata 10 1.3.2.Depthdata .10 1.3.3.Skeletondata .11 1.3.4.Othermodalities 11 1.3.5.Multi-modality .13 1.4.Skeletondatacollection .14 1.4.1.Datacollectionfrommotioncapturesystems 14 1.4.2.DatacollectionfromRGB+Dsensors .14 1.4.3.Datacollectionfromposeestimation .16 1.5.Benchmarkdatasets 18 1.5.1.MSR-Action3D .18 1.5.2.MICA-Action3D 19 1.5.3.CMDFALL .19 1.5.4.NTURGB+D .19 1.6.Skeleton-basedactionrecognitionmethods .20 1.6.1.Handcraft-basedmethods .20 1.6.1.1.Joint-basedactionrecognition 22 1.6.1.2.Bodypart-basedactionrecognition .25 1.6.2.Deeplearning-basedmethods 28 1.6.2.1.ConvolutionalNeuralNetworks 28 1.6.2.2.RecurrentNeuralNetworks 30 1.7.ResearchonactionrecognitioninVietnam 33 1.8.Conclusionofthechapter 35 CHAPTER2.JOINTSUBSETSELECTIONFORSKELETON-BASED HUMANACTIONRECOGNITION 36 2.1.Introduction 36 2.2.Proposedmethods 37 2.2.1.PresetJointSubsetSelection 37 2.2.1.1.Spatial-TemporalRepresentation 39 2.2.1.2.DynamicTimeWarping 39 2.2.1.3.FourierTemporalPyramid 40 2.2.2.AutomaticJointSubsetSelection 40 2.2.2.1.Jointweightassignment 41 2.2.2.2.Mostinformativejointselection 42 2.2.2.3.HumanactionrecognitionbasedonMIJjoints 43 2.3.Experimentalresults 45 2.3.1.Evaluationmetrics 45 2.3.2.PresetJointSubsetSelection 46 2.3.3.AutomaticJointSubsetSelection 48 2.4.Conclusionofthechapter 57 CHAPTER3.FEATUREFUSIONFORTHEGRAPHCONVOLUTIONALNE TWORK .58 3.1.Introduction .58 3.2.RelatedworkonGraphConvolutionalNetworks 58 3.3.Proposedmethod .65 3.4.Experimentalresults .71 3.5.Discussion .81 3.6.Conclusionofthechapter 84 CHAPTER4.THEPROPOSEDLIGHTWEIGHTGRAPHCONVOLUTIONALNETWORK 85 4.1.Introduction 85 4.2.RelatedworkonLightweightGraphConvolutionalNetworks 85 4.3.Proposedmethod 87 4.4.Experimentalresults .89 4.5.Applicationdemonstration 97 4.6.Conclusionofthechapter .101 CONCLUSIONANDFUTUREWORKS 103 PUBLICATIONS 105 BIBLIOGRAPHY 106 vii ABBREVIATIONS No Abbreviation Meaning 2D Two-Dimensional 3D Three-Dimensional AAGCN Attention-enhancedAdaptiveGraphConvolutionalNetwork AGCN AdaptiveGraphConvolutionalNetwork AMIJ AdaptivenumberofMostInformativeJoints AS ActionSet AS-GCN Actional-StructuralGraphConvolutionalNetwork BN BatchNormalization BPL BodyPartLocation 10 CAM ChannelAttentionModule 11 CCTV Close-CircuitTelevision 12 CNN ConvolutionalNeuralNetwork 13 CovMIJ CovarianceDescriptoronMostInformativeJoints 14 CPU CentralProcessingUnit 15 CS Cross-Subject 16 CV Cross-View 17 DFT DiscreteFourierTransform 18 DTW DynamicTimeWarping 19 FC FullyConnected 20 FF FeatureFusion 21 FLOP FloatingPointOPeration 22 FMIJ FixednumberofMostInformativeJoints 23 fps f ramespersecond 24 FTP FourierTemporalPyramid 25 GCN GraphConvolutionalNetwork 26 GCNN Graph-basedConvolutionalNeuralNetwork 27 GPU GraphicalProcessingUnit 28 GRU GatedRecurrentUnit 29 HAR HumanActionRecognition 30 HCI Human-ComputerInteraction