1. Trang chủ
  2. » Luận Văn - Báo Cáo

Word Importance Modeling To Enhance Captions Generated By Automat.pdf

314 3 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 314
Dung lượng 12,27 MB

Nội dung

Word Importance Modeling to Enhance Captions Generated by Automatic Speech Recognition for Deaf and Hard of Hearing Users Rochester Institute of Technology Rochester Institute of Technology RIT Schola[.]

Rochester Institute of Technology RIT Scholar Works Theses 11-2019 Word Importance Modeling to Enhance Captions Generated by Automatic Speech Recognition for Deaf and Hard of Hearing Users Sushant Kafle sxk5664@rit.edu Follow this and additional works at: https://scholarworks.rit.edu/theses Recommended Citation Kafle, Sushant, "Word Importance Modeling to Enhance Captions Generated by Automatic Speech Recognition for Deaf and Hard of Hearing Users" (2019) Thesis Rochester Institute of Technology Accessed from This Dissertation is brought to you for free and open access by RIT Scholar Works It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works For more information, please contact ritscholarworks@rit.edu Word Importance Modeling to Enhance Captions Generated by Automatic Speech Recognition for Deaf and Hard of Hearing Users by Sushant Kafle A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computing and Information Sciences B Thomas Golisano College of Computing and Information Sciences Rochester Institute of Technology November, 2019 Word Importance Modeling to Enhance Captions Generated by Automatic Speech Recognition for Deaf and Hard of Hearing Users by Sushant Kafle Committee Approval: We, the undersigned committee members, certify that we have advised and/or supervised the candidate on the work described in the dissertation We further certify that we have reviewed the dissertation manuscript and approve it in partial fulfillment of the requirements of the degree of Doctor of Philosophy in Computing and Information Sciences Dr Matt Huenerfauth, Dissertation Advisor Date Dr Cecilia Ovesdotter Alm, Dissertation Committee Member Date Dr Vicki Hanson, Dissertation Committee Member Date Dr Emily Prud’hommeaux, Dissertation Committee Member Date Dr Jai Kang, Dissertation Chair Date Certified by: Dr Pengcheng Shi, Director, Computing and Information Sciences ii Date c 2019 Sushant Kafle All rights reserved iii Word Importance Modeling to Enhance Captions Generated by Automatic Speech Recognition for Deaf and Hard of Hearing Users by Sushant Kafle Submitted to the B Thomas Golisano College of Computing and Information Sciences Ph.D Program in Computing and Information Sciences in partial fulfillment of the requirements for the Doctor of Philosophy Degree at the Rochester Institute of Technology Abstract People who are deaf or hard-of-hearing (DHH) benefit from sign-language interpreting or live-captioning (with a human transcriptionist), to access spoken information However, such services are not legally required, affordable, nor available in many settings, e.g., impromptu small-group meetings in the workplace or online video content that has not been professionally captioned As Automatic Speech Recognition (ASR) systems improve in accuracy and speed, it is natural to investigate the use of these systems to assist DHH users in a variety of tasks But, ASR systems are still not perfect, especially in realistic conversational settings, leading to the issue of trust and acceptance of these systems from the DHH community To overcome these challenges, our work focuses on: (1) building metrics for accurately evaluating the quality of iv v automatic captioning systems, and (2) designing interventions for improving the usability of captions for DHH users The first part of this dissertation describes our research on methods for identifying words that are important for understanding the meaning of a conversational turn within transcripts of spoken dialogue Such knowledge about the relative importance of words in spoken messages can be used in evaluating ASR systems (in part of this dissertation) or creating new applications for DHH users of captioned video (in part of this dissertation) We found that models which consider both the acoustic properties of spoken words as well as text-based features (e.g., pre-trained word embeddings) are more effective at predicting the semantic importance of a word than models that utilize only one of these types of features The second part of this dissertation describes studies to understand DHH users’ perception of the quality of ASR-generated captions; the goal of this work was to validate the design of automatic metrics for evaluating captions in real-time applications for these users Such a metric could facilitate comparison of various ASR systems, for determining the suitability of specific ASR systems for supporting communication for DHH users We designed experimental studies to elicit feedback on the quality of captions from DHH users, and we developed and evaluated automatic metrics for predicting the usability of automatically generated captions for these users We found that metrics that consider the importance of each word in a text are more effective at predicting the usability of imperfect text captions than the traditional Word Error Rate (WER) metric The final part of this dissertation describes research on importance-based vi highlighting of words in captions, as a way to enhance the usability of captions for DHH users Similar to highlighting in static texts (e.g., textbooks or electronic documents), highlighting in captions involves changing the appearance of some texts in caption to enable readers to attend to the most important bits of information quickly Despite the known benefits of highlighting in static texts, research on the usefulness of highlighting in captions for DHH users is largely unexplored For this reason, we conducted experimental studies with DHH participants to understand the benefits of importance-based highlighting in captions, and their preference on different design configurations for highlighting in captions We found that DHH users subjectively preferred highlighting in captions, and they reported higher readability and understandability scores and lower task-load scores when viewing videos with captions containing highlighting compared to the videos without highlighting Further, in partial contrast to recommendations in prior research on highlighting in static texts (which had not been based on experimental studies with DHH users), we found that DHH participants preferred boldface, word-level, non-repeating highlighting in captions Acknowledgments I would like to express my sincere gratitude to my advisor Dr Matt Huenerfauth for his continuous support during my Ph.D studies and research It was a great pleasure to work under his advisement and to learn from him His guidance has helped me not only in the time of research and writing of this thesis but also at times when I doubted myself and my intuitions I could not have imagined having a better advisor and a mentor I would also like to thank my wonderful thesis committee members: Drs Cecilia Alm, Vicki Hanson, and Emily Prud’hommeaux, for their encouragement and insightful comments I will forever be indebted to their stimulating discussions and hard questions that have helped shape this research tremendously I thank my colleagues and research assistants in the Center for Accessibility and Inclusion Research (CAIR) lab for their collaboration and support Special thanks to Larwan Berke, a fellow researcher at the CAIR lab, for invaluable discussions and collaborations without which this research would not have been possible Last but not least, I would like to thank my family, especially my parents Dev Raj Kafle and Kabita Kafle, for their endless love and support, and my girlfriend Swapnil Sneham for believing in me, always vii Contents List of Figures xvii List of Tables xxv Introduction 1.1 Motivating Challenges 1.2 Research Questions Investigated in this Dissertation 1.3 Overview of The Chapters Background on Automatic Speech Recognition Technology 2.1 10 Conventional Speech Recognition Architecture 10 2.1.1 Acoustic Models 11 2.1.2 Language Models 12 2.1.3 Decoding 12 2.2 Recent Advancements: End-to-End ASR 13 2.3 Other Terminology 15 2.3.1 Confidence Scores 15 2.3.2 Word Error Rate 16 viii CONTENTS ix Part I: Word Importance Modeling 18 Prologue to Part I 19 Prior Methods of Word Importance Estimation 22 3.1 Word Importance Estimation as a Keyword Extraction Problem 23 3.1.1 Frequency-based Keyword Extraction 23 3.1.2 Supervised Methods of Keyword Extraction 24 3.1.3 Limitations and Challenges 27 3.2 Reading Strategies of Deaf Individuals 28 3.3 Acoustic-Prosodic Cues for Semantic Knowledge 30 Unsupervised Models of Word Importance 31 4.1 Defining the Word Predictability Measure 32 4.2 Methods for Computing Word Predictability 33 4.2.1 N-gram Language Model 33 4.2.2 Neural Language Model 37 Evaluation and Conclusion 40 4.3 Building the Word Importance Annotation Corpus 42 5.1 Defining Word Importance 43 5.2 Word Importance Annotation Task 44 5.2.1 Annotation Scheme 44 5.3 Inter-Annotator Agreement Analysis 46 5.4 Summary of the Corpus 48 BIBLIOGRAPHY 272 August 4, Volume 1: Long Papers 2121–2130 DOI:http://dx.doi.org/ 10.18653/v1/P17-1194 [144] Marek Rei, Gamal K O Crichton, and Sampo Pyysalo 2016 Attending to Characters in Neural Sequence Labeling Models In Proc COLING 2016 309–318 [145] Luz Rello, Ricardo Baeza-Yates, Laura Dempere-Marco, and Horacio Saggion 2013 Frequent words improve readability and short words improve understandability for people with dyslexia In IFIP Conference on HumanComputer Interaction Springer, 203–219 [146] Luz Rello, Horacio Saggion, and Ricardo Baeza-Yates 2014 Keyword highlighting improves comprehension for people with dyslexia In Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR) 30–37 [147] Daniel Renshaw and Keith B Hall 2015 Long short-term memory language models with additive morphological features for automatic speech recognition In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on IEEE, 5246–5250 [148] F Richardson, Mari Ostendorf, and Jan Robin Rohlicek 1995 Latticebased search strategies for large vocabulary speech recognition In ICASSP IEEE Computer Society, 576–579 [149] E K Ringger and J F Allen 1996 Error correction via a post-processor for continuous speech recognition In Acoustics, Speech, and Signal Process- BIBLIOGRAPHY 273 ing, 1996 ICASSP-96 Conference Proceedings., 1996 IEEE International Conference on, Vol 427–430 vol DOI:http://dx.doi.org/10.1109/ ICASSP.1996.541124 [150] Brian Roark, Murat Saraclar, and Michael Collins 2007 Discriminative n-gram language modeling Computer Speech & Language 21, (2007), 373–392 [151] Naomi B Robbins, Richard M Heiberger, and others 2011 Plotting Likert and other rating scales In Proceedings of the 2011 Joint Statistical Meeting 1058–1066 [152] Anthony Rousseau, Paul Deléglise, and Yannick Esteve 2012 TEDLIUM: an Automatic Speech Recognition dedicated corpus In LREC 125–129 [153] Haşim Sak, Andrew Senior, and Franỗoise Beaufays 2014 Long shortterm memory recurrent neural network architectures for large scale acoustic modeling In Fifteenth annual conference of the international speech communication association [154] Matthew Seita, Khaled Albusays, Sushant Kafle, Michael Stinson, and Matt Huenerfauth 2018 Behavioral Changes in Speakers Who Are Automatically Captioned in Meetings with Deaf or Hard-of-Hearing Peers In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’18) ACM, New York, NY, USA, 68–80 DOI:http://dx.doi.org/10.1145/3234695.3236355 BIBLIOGRAPHY 274 [155] Matt Shannon 2017 Optimizing Expected Word Error Rate via Sampling for Speech Recognition Proc Interspeech 2017 (2017), 3537–3541 [156] JI Sheeba and K Vivekanandan 2012 Improved keyword and keyphrase extraction from meeting transcripts International Journal of Computer Applications 52, 13 (2012) [157] Imran A Sheikh, Irina Illina, Dominique Fohr, and Georges Linarès 2016 Learning Word Importance with the Neural Bag-of-Words Model In Proceedings of the 1st Workshop on Representation Learning for NLP, Rep4NLP@ACL 2016, Berlin, Germany, August 11, 2016, Phil Blunsom, Kyunghyun Cho, Shay B Cohen, Edward Grefenstette, Karl Moritz Hermann, Laura Rimell, Jason Weston, and Scott Wen-tau Yih (Eds.) Association for Computational Linguistics, 222–229 DOI:http://dx.doi org/10.18653/v1/W16-1626 [158] Andreas Stolcke, Yochai Konig, and Mitchel Weintraub 1997 Explicit word error minimization in n-best list rescoring In Eurospeech, Vol 97 Citeseer, 163–166 [159] Hendrik Strobelt, Daniela Oelke, Bum Chul Kwon, Tobias Schreck, and Hanspeter Pfister 2016 Guidelines for Effective Usage of Text Highlighting Techniques IEEE Trans Vis Comput Graph 22, (2016), 489–498 DOI: http://dx.doi.org/10.1109/TVCG.2015.2467759 [160] M Brooks T Apone, B Botkin and L Goldberg 2011 Research into Automated Error Ranking of Real-time Captions in Live Television News BIBLIOGRAPHY 275 Programs The Carl and Ruth Shapiro Family National Center for Accessible Media, Boston (2011) http://ncam.wgbh.org/file_download/136 [161] Fabio Tamburini 2003 Prosodic prominence detection in speech In Signal Processing and Its Applications, 2003 Proceedings Seventh International Symposium on, Vol IEEE, 385–388 [162] Yuan Tang 2016 TF.Learn: TensorFlow’s High-level Module for Distributed Machine Learning CoRR abs/1612.04251 (2016) http: //arxiv.org/abs/1612.04251 [163] Trang Tran, Shubham Toshniwal, Mohit Bansal, Kevin Gimpel, Karen Livescu, and Mari Ostendorf 2018 Parsing Speech: a Neural Approach to Integrating Lexical and Acoustic-Prosodic Information In Proc NAACLHLT 2018 69–81 [164] David R Traum and Peter A Heeman 1996 Utterance units in spoken dialogue In Workshop on Dialogue Processing in Spoken Language Systems Springer, 125–140 [165] Michael D Tyler, Caroline Jones, Leonid Grebennikov, Greg Leigh, William Noble, and Denis Burnham 2009 Effect of caption rate on the comprehension of educational television programmes by deaf school students Deafness & Education International 11, (2009), 152–162 DOI: http://dx.doi.org/10.1002/dei.262 BIBLIOGRAPHY 276 [166] Keith Vertanen and Per Ola Kristensson 2008 On the benefits of confidence visualization in speech recognition In Proceedings of the SIGCHI conference on human factors in computing systems ACM, 1497–1500 [167] Xiaojun Wan, Jianwu Yang, and Jianguo Xiao 2007 Towards an iterative reinforcement approach for simultaneous document summarization and keyword extraction In ACL, Vol 552–559 [168] Dagen Wang and Shrikanth Narayanan 2007 An acoustic measure for word prominence in spontaneous speech IEEE transactions on audio, speech, and language processing 15, (2007), 690–701 [169] Lu Wang and Wang Ling 2016 Neural Network-Based Abstract Generation for Opinions and Arguments In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, Kevin Knight, Ani Nenkova, and Owen Rambow (Eds.) The Association for Computational Linguistics, 47–57 https://www.aclweb.org/anthology/N16-1007/ [170] Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency 2018 Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors Proc AAAI 19 (2018) [171] Ye-Yi Wang, Alex Acero, and Ciprian Chelba 2003 Is word error rate a good indicator for spoken language understanding accuracy In BIBLIOGRAPHY 277 Automatic Speech Recognition and Understanding, 2003 ASRU’03 2003 IEEE Workshop on IEEE, 577–582 [172] Gary S Wilkinson and Gary J Robertson 2006 Wide range achievement test (WRAT4) Lutz, FL: Psychological Assessment Resources (2006) [173] Jason D Williams and Suhrid Balakrishnan 2009 Estimating probability of correctness for ASR N-Best lists In In Proc SIGDIAL’09 Association for Computational Linguistics, 132–135 [174] Shasha Xie, Dilek Hakkani-Tur, Benoit Favre, and Yang Liu 2009 Integrating prosodic features in extractive meeting summarization In Automatic Speech Recognition and Understanding, 2009 ASRU 2009 IEEE Workshop on IEEE, 387–391 [175] Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig 2016 Achieving Human Parity in Conversational Speech Recognition CoRR abs/1610.05256 (2016) http://arxiv.org/abs/1610.05256 [176] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy 2016 Hierarchical attention networks for document classification In Proc NAACL-HLT 2016 1480–1489 [177] Wen-tau Yih, Joshua Goodman, Lucy Vanderwende, and Hisami Suzuki 2007 Multi-Document Summarization by Maximizing Informative ContentWords In IJCAI, Vol 1776–1782 BIBLIOGRAPHY 278 [178] Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and LouisPhilippe Morency 2017 Tensor Fusion Network for Multimodal Sentiment Analysis In Proc EMNLP 2017 1103–1114 Appendices 279 Appendix A Publications Peer-Reviewed Journal Articles Sushant Kafle and Matt Huenerfauth 2019 Predicting the Understandability of Imperfect English Captions for People Who Are Deaf or Hard of Hearing ACM Transactions on Accessible Computing 12, 2, Article (June 2019), 32 pages DOI: https://doi.org/10.1145/3325862 Peer-Reviewed Conference Articles Sushant Kafle, Peter Yeung and Matt Huenerfauth 2019 Evaluating the Benefit of Highlighting Key Words in Captions for People who are Deaf or Hard of Hearing In Proceedings of the 21th International ACM SIGACCESS Conference on Computers and Accessibility, ASSETS 2019 ACM http://cair.rit.edu/share/kafle-et-al-2019-assets.pdf Sushant Kafle, Cecilia Ovesdotter Alm and Matt Huenerfauth 2019 280 APPENDIX A PUBLICATIONS 281 Fusion Strategy for Prosodic and Lexical Representations of Word Importance In Proceedings of Interspeech 2019 1313-1317, DOI: 10.21437/ Interspeech.2019-1898 Sushant Kafle, Cissi Ovesdotter Alm, and Matt Huenerfauth 2019 Modeling Acoustic-Prosodic Cues for Word Importance Prediction in Spoken Dialogues In Proceedings of the Eighth Workshop on Speech and Language Processing for Assistive Technologies, SLPAT 2019 Association for Computational Linguistics, Minneapolis, 9-16 DOI: http://dx.doi org/10.18653/v1/W19-1702 Matthew Seita, Khaled Albusays, Sushant Kafle, Michael Stinson, and Matt Huenerfauth 2018 Behavioral Changes in Speakers Who Are Automatically Captioned in Meetings with Deaf or Hard-of-Hearing Peers In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS’18) ACM, New York, NY, USA, 68âĂŞ80 DOI: http://dx.doi.org/10.1145/3234695.3236355 Sedeeq Al-khazraji, Larwan Berke, Sushant Kafle, Peter Yeung, and Matt Huenerfauth 2018 Modeling the Speed and Timing of American Sign Language to Generate Realistic Animations In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’18) ACM, New York, NY, USA, 259-270 DOI: https://doi.org/10.1145/3234695.3236356 Sushant Kafle and Matt Huenerfauth 2018 A Corpus for Modeling Word Importance in Spoken Dialogue Transcripts In Proceedings of the APPENDIX A PUBLICATIONS 282 Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018 European Language Resources Association (ELRA) https://www.aclweb.org/anthology/ L18-1016 Sedeeq Al-khazraji, Sushant Kafle and Matt Huenerfauth 2018 Modeling and Predicting the Location of Pauses for the Generation of Animations of American Sign Language Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018 European Language Resources Association (ELRA) http://lrec-conf.org/workshops/lrec2018/W1/ pdf/18013_W1.pdf Larwan Berke, Sushant Kafle, and Matt Huenerfauth 2018 Methods for Evaluation of Imperfect Captioning Tools by Deaf or Hard-of-Hearing Users at Different Reading Literacy Levels In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18) ACM, New York, NY, USA, Paper 91, 12 pages DOI: https://doi.org/ 10.1145/3173574.3173665 Sushant Kafle and Matt Huenerfauth 2017 Evaluating the Usability of Automatically Generated Captions for People who are Deaf or Hard of Hearing In Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’17) ACM, New York, NY, USA, 165-174 DOI: https://doi.org/10.1145/3132525 3132542 APPENDIX A PUBLICATIONS 283 10 Sushant Kafle and Matt Huenerfauth 2016 Effect of Speech Recognition Errors on Text Understandability for People who are Deaf or Hard of Hearing In Proceedings of SLPAT 2016 Workshop on Speech and Language Processing for Assistive Technologies 20–25 DOI: 10.21437/ SLPAT.2016-4 Other Technical Papers and Pending Submissions Sushant Kafle, Becca Dingman and Matt Huenerfauth 2019 Deaf and Hard-of-Hearing Users Evaluating Designs for Highlighting Key Words in Educational Lecture Videos In ACM CHI Conference on Human Factors in Computing Systems (CHI’20) Under Review Sushant Kafle, Abraham Glasser, Sedeeq Al-khazraji, Larwan Berke, Matthew Seita, Matt Huenerfauth 2019 Artificial Intelligence Fairness in the Context of Accessibility Research on Intelligent Systems for People who are Deaf or Hard of Hearing In ASSETS 2019 Workshop on AI Fairness for People with Disabilities Sushant Kafle and Matt Huenerfauth 2018 Usability Evaluation of Captions for People who are Deaf or Hard of Hearing ACM SIGACCESS Newsletter October 2018 Issue Appendix B IRB Approval Forms All of the studies presented in this thesis has been approved by the Institutional Review Board (IRB) Below, we provide the IRB decision form for two projects: • Creating the Next Generation of Live-Captioning Technologies: This IRB covers the ASR evaluation studies presented in Part for this work • Identifying the Best Methods for Displaying Word-Confidence in Automatically Generated Captions for Deaf and Hard-of-Hearing Users: This IRB covers the caption highlighting studies presented in Part of this work 284 APPENDIX B IRB APPROVAL FORMS 285 Figure B.1: IRB Decision Form for “Creating the Next Generation of LiveCaptioning Technologies” APPENDIX B IRB APPROVAL FORMS 286 Figure B.2: IRB Decision Form for “Identifying the Best Methods for Displaying Word-Confidence in Automatically Generated Captions for Deaf and Hard-of-Hearing Users” ... Certified by: Dr Pengcheng Shi, Director, Computing and Information Sciences ii Date c 2019 Sushant Kafle All rights reserved iii Word Importance Modeling to Enhance Captions Generated by Automatic.. .Word Importance Modeling to Enhance Captions Generated by Automatic Speech Recognition for Deaf and Hard of Hearing Users by Sushant Kafle A dissertation submitted... I: Word Importance Modeling 18 Prologue to Part I 19 Prior Methods of Word Importance Estimation 22 3.1 Word Importance Estimation as a Keyword Extraction Problem 23 3.1.1 Frequency-based Keyword

Ngày đăng: 11/03/2023, 22:39

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w