1. Trang chủ
  2. » Ngoại Ngữ

Word Importance Modeling to Enhance Captions Generated by Automat

314 3 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Word Importance Modeling to Enhance Captions Generated by Automatic Speech Recognition for Deaf and Hard of Hearing Users
Tác giả Sushant Kafle
Người hướng dẫn Dr. Matt Huenerfauth, Dissertation Advisor, Dr. Cecilia Ovesdotter Alm, Dissertation Committee Member, Dr. Vicki Hanson, Dissertation Committee Member, Dr. Emily Prud’hommeaux, Dissertation Committee Member, Dr. Jai Kang, Dissertation Chair
Trường học Rochester Institute of Technology
Chuyên ngành Computing and Information Sciences
Thể loại Thesis
Năm xuất bản 2019
Thành phố Rochester
Định dạng
Số trang 314
Dung lượng 12,27 MB

Cấu trúc

  • Word Importance Modeling to Enhance Captions Generated by Automatic Speech Recognition for Deaf and Hard of Hearing Users

    • Recommended Citation

  • List of Figures

  • List of Tables

  • 1 Introduction

    • 1.1 Motivating Challenges

    • 1.2 Research Questions Investigated in this Dissertation

    • 1.3 Overview of The Chapters

  • 2 Background on Automatic Speech Recognition Technology

    • 2.1 Conventional Speech Recognition Architecture

      • 2.1.1 Acoustic Models

      • 2.1.2 Language Models

      • 2.1.3 Decoding

    • 2.2 Recent Advancements: End-to-End ASR

    • 2.3 Other Terminology

      • 2.3.1 Confidence Scores

      • 2.3.2 Word Error Rate

  • Part I: Word Importance Modeling

  • Prologue to Part I

  • 3 Prior Methods of Word Importance Estimation

    • 3.1 Word Importance Estimation as a Keyword Extraction Problem

      • 3.1.1 Frequency-based Keyword Extraction

      • 3.1.2 Supervised Methods of Keyword Extraction

      • 3.1.3 Limitations and Challenges

    • 3.2 Reading Strategies of Deaf Individuals

    • 3.3 Acoustic-Prosodic Cues for Semantic Knowledge

  • 4 Unsupervised Models of Word Importance

    • 4.1 Defining the Word Predictability Measure

    • 4.2 Methods for Computing Word Predictability

      • 4.2.1 N-gram Language Model

      • 4.2.2 Neural Language Model

    • 4.3 Evaluation and Conclusion

  • 5 Building the Word Importance Annotation Corpus

    • 5.1 Defining Word Importance

    • 5.2 Word Importance Annotation Task

      • 5.2.1 Annotation Scheme

    • 5.3 Inter-Annotator Agreement Analysis

    • 5.4 Summary of the Corpus

  • 6 Supervised Models of Word Importance

    • 6.1 Text-based Model of Word Importance

      • 6.1.1 Model Architecture

      • 6.1.2 Experimental Setup

      • 6.1.3 Experiment 1: Performance of the Models

      • 6.1.4 Experiment 2: Comparison with Human Annotators

      • 6.1.5 Limitations of this Research

    • 6.2 Speech-based Importance Model

      • 6.2.1 Model Architecture

      • 6.2.2 Acoustic-Prosodic Feature Representation

      • 6.2.3 Experimental Setup

      • 6.2.4 Experiment 1: Comparison of the Projection Layers

      • 6.2.5 Experiment 2: Ablation Study on Speech Features

      • 6.2.6 Experiment 3: Comparison with the Text-based Models

      • 6.2.7 Limitations of this Research

    • 6.3 Text- and Speech-based Importance Model

      • 6.3.1 Prior Work on Joint Modeling of Speech and Text

      • 6.3.2 Lexical-Prosodic Feature Representation

      • 6.3.3 Experimental Setup

      • 6.3.4 Experiment 1: Error Analysis of Unimodal Models

      • 6.3.5 Experiment 2: Comparison of Fusion Strategies

    • 6.4 Conclusions

  • Epilogue for Part I

  • Part II: Automatic Caption Quality Evaluation

  • Prologue to Part II

  • 7 Prior Approaches to ASR Evaluation

    • 7.1 Limitations of the Word Error Rate Metric

    • 7.2 Other Methods of ASR Evaluation

    • 7.3 Metric of ASR Quality for DHH users

  • 8 Collection of Understandability Scores from DHH users for Text with Errors

    • 8.1 Understanding the Effect of Recognition Errors

    • 8.2 User Study (QUESTION-ANSWER STUDY)

      • 8.2.1 ASR Error Category

      • 8.2.2 Study Resources

      • 8.2.3 Recruitment and Participants

      • 8.2.4 Study Procedure

    • 8.3 Summary of the data

  • 9 Metric for ASR Evaluation for Captioning Applications

    • 9.1 Automatic-Caption Evaluation Framework

      • 9.1.1 Word Importance Sub-score

      • 9.1.2 Semantic Distance Sub-score

      • 9.1.3 The Weighting Variable

    • 9.2 Research Methodology and Hypotheses

      • 9.2.1 Four Phases of this Research

    • 9.3 Phase 1: Designing and Evaluating the ACE Metric

      • 9.3.1 Computing the Word Importance Sub-score

      • 9.3.2 Computing the Semantic Distance Sub-score

      • 9.3.3 From Individual-Error Impact Scores to an Overall Sentence Error Score

      • 9.3.4 Designing Stimuli for Metric Evaluation (PREFERENCE-2017 Study)

      • 9.3.5 Experimental Study Setup and Procedure

      • 9.3.6 Results and Discussion

      • 9.3.7 Summary and Discussion of Limitations of ACE

    • 9.4 Phase 2: Improving the ACE Metric to Create ACE2

      • 9.4.1 Improving the Word Importance Sub-score

      • 9.4.2 Alternatives for Combining Individual Error Scores into a Sentence Score

    • 9.5 Phase 3: Comparison with Prior Metrics

      • 9.5.1 Human Perceived Accuracy (HPA)

      • 9.5.2 Information Retrieval Based Evaluation Metrics

      • 9.5.3 Word Information Lost (WIL)

      • 9.5.4 Weighted Word Error Rate (WWER)

      • 9.5.5 Weighted Keyword Error Rate (WKER) and Keyword Error Rate (KER)

    • 9.6 Phase 4: User-Based Evaluation of ACE and ACE2 (PREFERENCE-2018 Study)

      • 9.6.1 Designing Stimuli

      • 9.6.2 User Study Setup

      • 9.6.3 Results and Discussion

    • 9.7 Conclusions

  • Epilogue for Part II

  • Part III: Enhancements to Improve Caption Usability

  • Prologue to Part III

  • 10 Prior Work on Caption Accessibility

    • 10.1 Caption Accessibility Challenges

    • 10.2 Improving Caption Accessibility

    • 10.3 Importance-based Highlighting in Text

      • 10.3.1 Style Guidelines for Highlighting

      • 10.3.2 Visual Markup of Text in Captions

  • 11 Evaluating the Benefits of Highlighting in Captions

    • 11.1 Background and Introduction

      • 11.1.1 Research Questions Investigated in this Chapter

    • 11.2 Formative Studies: Method and Results

      • 11.2.1 Highlighting Configurations for Formative Studies

      • 11.2.2 Stimuli Preparation for Formative Studies

      • 11.2.3 Recruitment and Participants for Formative Studies

      • 11.2.4 Questionnaires for Smaller Studies

      • 11.2.5 Round-1 Results: Comparing Markup-Styles

      • 11.2.6 Round-2 Results: Comparing Highlight Percentage

      • 11.2.7 Round-1 and Round-2 Results: Interest in Highlighting

      • 11.2.8 Discussion of Results from Round-1 and Round-2

    • 11.3 Larger Study: Method and Results

      • 11.3.1 Preparation of the Stimuli Video

      • 11.3.2 Study Setup and Questionnaires

      • 11.3.3 Recruitment and Participants

      • 11.3.4 Results

    • 11.4 Discussion and Conclusion

    • 11.5 Limitations of this Research and the Need for an Additional Study

  • 12 Evaluating the Designs for Highlighting Captions

    • 12.1 Background and Introduction

      • 12.1.1 Harmful Effects of Inappropriate Highlighting

      • 12.1.2 Research Questions Investigated in this Chapter

    • 12.2 Methodology

      • 12.2.1 Four Phases in the Study

      • 12.2.2 Details of Video Stimuli Creation for Each Condition

      • 12.2.3 Questions Asked in the Study

      • 12.2.4 Recruitment and Participants

    • 12.3 Results

      • 12.3.1 Text Decoration Style for Highlighting

      • 12.3.2 Granularity for Highlighting

      • 12.3.3 Handling Key Term Repetition

      • 12.3.4 Interest in Highlighting Applications

    • 12.4 Discussion of the Results

    • 12.5 Conclusions

  • Epilogue for Part III

  • 13 Limitations and Future Work

    • 13.1 Word Importance Modeling

      • 13.1.1 Modeling Importance at a Larger Semantic Units

      • 13.1.2 Unsupervised (and Semi-supervised) Models of Word Importance

    • 13.2 Automatic Caption Quality Evaluation

    • 13.3 Highlighting in Captions to Improve Caption Usability

    • 13.4 Using Word-Importance Models during the Training or Decoding of ASR Systems

      • 13.4.1 N-best list Re-scoring Technique

      • 13.4.2 Improved Optimization Strategy (End-to-End Models)

  • 14 Summary and Contributions

    • 14.1 Summary of the Contribution of This Research

    • 14.2 Final Comments

  • Bibliography

  • Appendices

  • A Publications

  • B IRB Approval Forms

Nội dung

Motivating Challenges

The rise of cloud-enabled services has made Automatic Speech Recognition (ASR) systems affordable, scalable, and widely accessible, making them ideal for real-time captioning for Deaf and Hard of Hearing (DHH) users These systems can now be seamlessly integrated into mobile phones and tablets, allowing for on-demand transcription of spoken messages into digital text.

In our exploratory study, we observed a deaf student collaborating effectively with two hearing peers, utilizing automatic speech recognition technology integrated into their mobile devices This innovative service demonstrated significant potential in enhancing communication and collaboration among students with differing hearing abilities.

Fig 1.1 shows how ASR system installed on mobile devices could be used to enable participation of DHH users in mainstream meetings with their hearing peers.

Despite the recent leaps in the accuracy of ASR systems, the performance

Chapter 1 introduces the challenges faced by automatic speech recognition (ASR) systems in providing accurate captioning for Deaf and Hard of Hearing (DHH) users Currently, human-generated captions outperform these systems, highlighting the need for improved design and evaluation to gain the trust of DHH users for real-time applications Despite the significant potential of ASR-based captioning, research in this area remains largely underexplored.

This dissertation tackles challenges in evaluating and enhancing the usability of Automatic Speech Recognition (ASR) technology to facilitate communication between Deaf and Hard of Hearing (DHH) users and their hearing peers It begins by investigating methods to determine the significance of specific words in spoken messages, establishing a word-importance model that serves as a foundation for subsequent research phases By identifying the semantic importance of words, the study aims to accurately assess the understandability of automatically generated captions, thereby addressing usability issues in captioning applications.

Specifically, we investigate two main challenges discussed below (also illus- trated by two rectangles in Fig 1.2):

Evaluating the quality of automatic captions presents significant challenges, as traditional metrics often rely on a simplistic error-counting approach These methods fail to account for the importance of specific words, leading to a lack of correlation with human performance in related tasks Prior research, although not focused on Deaf and Hard of Hearing (DHH) users or captioning, has highlighted the inadequacy of these evaluation metrics.

This thesis investigates the correlation between simplistic metrics of Automatic Speech Recognition (ASR) output and the perceptions of Deaf and Hard of Hearing (DHH) users regarding caption quality It aims to identify whether existing metrics adequately reflect DHH users' judgments and explores the necessity for improved ASR performance metrics that align more closely with their experiences This analysis is detailed in Part II of the thesis.

User -experience challenges arise when automatic speech recognition (ASR) systems produce text with errors, making it harder to comprehend compared to human-generated transcripts Research indicates that while both types of transcripts may contain mistakes, errors from human transcriptionists tend to be less confusing than those from ASR To improve the user experience for Deaf and Hard of Hearing (DHH) individuals using ASR as a captioning tool, it is essential to explore ways to enhance the usability of caption text output, even when errors are present.

Chapter 1 introduces the concept of highlighting as a technique to emphasize key segments of text, which has been shown to enhance reading experiences and facilitate quicker information recall in educational settings Despite its effectiveness in traditional texts, the application of importance-based highlighting in video captions remains largely unexamined Video captions present unique challenges due to their dynamic nature, with text appearing for only 2 to 4 seconds and often limited to one or two lines Additionally, factors such as caption speed, font size, and visual decorations significantly impact readability, particularly for Deaf and Hard of Hearing (DHH) users, a topic further explored in Part III of this thesis.

In the coming sections, we discuss how we use the information about the importance of words in a text to design solutions for tackling these challenges.

Research Questions Investigated in this Dissertation

This research investigates the challenges faced by Automatic Speech Recognition (ASR)-based captioning technologies in creating more effective captions for Deaf or Hard of Hearing (DHH) users We aim to provide methodological solutions to these challenges, which are validated through user studies Our work specifically addresses a series of key research questions to enhance the usability of captions for DHH individuals.

This research explores how to identify key words in spoken messages that enhance understandability for Deaf and Hard of Hearing (DHH) readers By predicting the significance of these words, we aim to improve the usability of Automatic Speech Recognition (ASR) captioning technologies Our preliminary studies indicate that addressing this research question can aid in evaluating ASR system quality and enhancing caption usability through importance-based highlighting However, existing methods for pinpointing important words in conversational texts face challenges, which motivates a detailed investigation in Part I of this study.

The effectiveness of our models in estimating the quality of Automatic Speech Recognition (ASR) systems for generating captions for Deaf and Hard of Hearing (DHH) users is under scrutiny Current evaluation methods, particularly the Word Error Rate metric, have proven inadequate in accurately predicting human task performance across various applications This highlights the necessity for a more reliable measurement system to assess ASR output quality, ensuring it meets the standards required for automatic caption generation for DHH users.

In this research, we aim to examine additional metrics that provide deeper insights into how different errors affect text comprehensibility, particularly in the context of assessing the quality of automatic captions for Deaf and Hard of Hearing (DHH) users This analysis will be elaborated upon in Part II of our study.

This article explores the receptiveness of Deaf and Hard of Hearing (DHH) users to importance-based word highlighting in captions, focusing on their preferences for such highlighting Due to the challenge of splitting visual attention between captions and video content, emphasizing key words can enhance comprehension Chapters 11 and 12 delve into the advantages of highlighting in captions for DHH viewers during online educational lectures, as well as examining their design preferences through experimental studies These findings will be elaborated upon in Part III of the research.

Overview of The Chapters

Chapter 2 offers a concise overview of Automatic Speech Recognition (ASR) technology, including its architecture and key concepts, to equip readers with the foundational knowledge necessary for the subsequent discussions in this work.

In Part I, we explore the estimation of word importance in texts, referencing prior research in Chapter 3 and focusing on spoken dialogues Chapter 4 introduces our initial method for assessing word importance based on predictability, drawing inspiration from eye-tracking studies of DHH readers Chapters 5 and 6 further examine supervised models of word importance using human-labelled data In Part II, we investigate the evaluation practices of Automatic Speech Recognition (ASR) systems across various applications, as discussed in the subsequent chapters.

Chapter 8 explores our approaches to analyzing the impact of recognition errors on text comprehension for Deaf and Hard of Hearing (DHH) readers Subsequently, Chapter 9 utilizes these findings to create and assess various automated metrics aimed at evaluating Automatic Speech Recognition (ASR) performance in real-time captioning applications for DHH users.

In Part III, we explore strategies to enhance the usability of captioning systems by improving the user experience Our focus is on importance-based highlighting in captions, aimed at increasing readability and reducing reading times Chapter 11 evaluates the advantages of highlighting key words in captions for Deaf and Hard of Hearing (DHH) users, particularly in educational lecture videos Following this, Chapter 12 investigates DHH users' preferences regarding various design options for caption highlighting.

An Automatic Speech Recognition (ASR) system is designed to convert spoken language into written text This chapter offers a brief overview of how ASR systems function, along with essential terminology that serves as useful background information for this document.

Conventional Speech Recognition Architecture

Acoustic Models

Acoustic models play a crucial role in speech recognition systems by capturing the statistical properties of speech They estimate the probability of generating observed speech waveforms based on linguistic units Traditionally, Hidden Markov Models (HMM) serve as finite state machines to infer temporal structures probabilistically HMMs are commonly paired with Gaussian Mixture Models (GMM) to calculate observation probabilities effectively.

CHAPTER 2 AUTOMATIC SPEECH RECOGNITION 12 abilities from the input feature vectors of speech More recently, several DeepNeural Network (DNN) based acoustic models have been proposed, which in- clude hybrid-HMMs that use deep neural network to approximate the likelihood probability P(X|W) [66, 108], to fully DNN (particularly Recurrent NeuralNetworks) based acoustic models which directly model sequential acoustic signals to generate posterior probabilities of the acoustic states [153, 175].

Language Models

The task of language models in speech recognition is to compute the probabilistic parameterP(W) in Eq 2.1, which refers to the probability that a given string of words (W =w 1 , w 2 , , w n ) belongs to a language.

N-gram models are a prevalent method for representing language, relying on the probability estimates of word sequences derived from extensive text corpora To simplify these estimates, the probability of a word is approximated based on the preceding one (bigram), two (trigram), or three (fourgram) words, leading to the term "n-gram models." Although n-gram models have historically been the standard in language modeling, there has been a recent shift towards the use of Recurrent Neural Network (RNN) based models, which have gained popularity in the field.

Decoding

The decoding process in speech recognition is the final step, where the input speech features are matched to a sequence of words using acoustic and language models Acoustic models, often represented by phones, require a pronunciation lexicon to map these phones to words Typically, a search algorithm like Viterbi decoding is employed to find the optimal word sequence Wˆ = w 1, w 2, , w n that maximizes the posterior probability P(W|X) for the input speech waveform X = x 1, x 2, , x t.

Recent Advancements: End-to-End ASR

End-to-end speech recognition has gained significant attention as it integrates the training of acoustic and language models into a single system Unlike traditional generative models, this discriminative approach employs a unified sequence-to-sequence model that directly converts raw audio into words or graphemes This innovation streamlines the speech recognition process, making it more efficient and effective.

The development of end-to-end Automatic Speech Recognition (ASR) systems began with Connectionist Temporal Classification (CTC), introduced by Graves et al CTC allows for the training of acoustic models without the need for frame-level alignments between acoustic signals and their corresponding transcripts By incorporating a special blank symbol, CTC encodes sequential input and maximizes the total probability of the label sequence by considering all possible alignments from the encoded input An early implementation of end-to-end ASR utilizing CTC with phoneme output targets achieved state-of-the-art performance.

Chapter 2 discusses advancements in Automatic Speech Recognition (ASR), highlighting the work of Graves and Jaitly, who introduced a character-based Connectionist Temporal Classification (CTC) system that outputs word sequences from speech input Their approach involved utilizing an external language model to improve the accuracy of the CTC outputs Subsequent research has led to various enhancements in CTC-based ASR systems, yet these systems face intrinsic challenges Notably, CTC operates under the assumption of conditional independence among network outputs at different frames, and it typically necessitates an external language model, as direct greedy decoding tends to yield suboptimal results.

Recently, attention-based encoder-decoder models have gained popularity, initially utilized in automatic speech recognition (ASR) applications by researchers Chan et al and Chorowski et al These models consist of three key components.

• Encoder Layer: The function of the encoder is to transform the input speech into ahigher-level representation This can be thought of as the acoustic model in conventional ASR.

The attention layer plays a crucial role in identifying encoded frames that are pertinent to generating the current output Essentially, it functions as an alignment model that establishes connections between the input and output, facilitating accurate predictions.

• Decoder Layer: The decoder layer operates by predicting each output token as a function of the previous predictions and the contextualized representation from the attention layer.

Attention-based encoder-decoder models have demonstrated exceptional performance in large-scale automatic speech recognition (ASR) tasks; however, they are not well-suited for streaming applications To address this limitation, researchers have proposed various solutions, including the integration of different end-to-end methods, such as Connectionist Temporal Classification (CTC) and attention-based models, as well as structural enhancements like multi-headed attention.

Other Terminology

Confidence Scores

In speech recognition, confidence scores, ranging from 0 to 1, assess the reliability of recognized words by indicating the likelihood of their correct identification by the ASR system These scores are derived from features collected during the decoding process, incorporating both acoustic and language data A classifier is typically trained with these features to produce a single score reflecting the accuracy of recognition decisions Alternatively, confidence scores can be estimated using the posterior probability P(W|X) according to the maximum a posteriori (MAP) decision rule, providing an absolute measure of the model's trust in its recognition outcomes.

CHAPTER 2 AUTOMATIC SPEECH RECOGNITION 16 confidence estimation is also very popular However, as shown in Eq 2.1, a challenge is that the posterior probability estimated for MAP decision does not include the normalization termP(X) in the denominator, which needs to be approximated.

Word Error Rate

Accurate and continuous speech recognition remains a challenging issue, despite recent advancements in automatic speech recognition (ASR) systems Currently, ASR performance does not match human capabilities, which are relied upon for providing captions to Deaf and Hard of Hearing (DHH) users Factors such as background noise, speech ambiguity, and unique speaker traits, like strong accents, contribute to errors in ASR Researchers typically assess the effectiveness of their systems using the Word Error Rate (WER) metric as they strive to enhance ASR accuracy.

Word Error Rate (WER) is determined by comparing the output of Automatic Speech Recognition (ASR) systems to a human-generated reference transcript through Levenshtein distance calculations Due to its widespread use, minimizing WER is a common objective in various ASR research initiatives, whether explicitly stated or not.

The Word Error Rate (WER) formula is expressed as N = (S + D + I) / N, where S denotes the count of incorrect word substitutions, D represents the number of omitted words, I indicates the number of unnecessary words added in the Automatic Speech Recognition (ASR) output, and N is the total number of words spoken.

Word Error Rate (WER), as outlined in Equation 2.2, measures the accuracy of Automatic Speech Recognition (ASR) systems by comparing the system's output, known as the "hypothesis text," to the actual spoken words in the "reference text." This metric quantifies the number of misrecognitions in the hypothesis, adjusted for the word count of the reference However, WER does not account for the varying significance of different words in conveying meaning or their predictability within the context Research indicates that humans perceive ASR errors with differing impacts on the overall message, suggesting that some mistakes may alter the meaning more significantly than others Additionally, the consequences of these errors can vary based on the specific application of the ASR technology.

Speech-based models often treat words as the primary units of meaning and prosody, yet their significance varies within utterances Some words are essential for comprehension, while others hold less importance This variance in word significance has positively impacted various tasks, including automatic speech recognition (ASR) evaluation, text classification, and summarization Research focused on identifying crucial words in spoken dialogues aims to address usability challenges of ASR systems, particularly for captioning applications benefiting Deaf and Hard of Hearing (DHH) users For example, the second part of this study explores metrics that assess the quality of automatic captioning systems based on word importance.

In Part I of this thesis, we aim to predict the significance of words in spoken dialogues for different readers We start by examining previous research on word importance for various applications in Chapter 3 Drawing inspiration from studies on the reading mechanisms of Deaf and Hard of Hearing (DHH) readers, Chapter 4 introduces our innovative approach.

CHAPTER 2 AUTOMATIC SPEECH RECOGNITION 20 pervised method of estimating the importance of a word as a measure of its predictability in the context Next, in Chapter 5, we describe our method for building the corpus of word importance Subsequently, Chapter 6 describes our methods towards more accurate word importance modeling through supervised training of statistical models, based on the human-labelled data on importance of words collected in Chapter 5.

Specifically, Part I of this thesis will explore research question RQ1 (as presented in 1.2), which states:

To enhance the understandability of spoken messages for DHH (Deaf and Hard of Hearing) readers, it is crucial to identify key words within the message This article will explore this concept through four specific sub-research questions, which will be examined in the following chapters.

RQ1.1: Does measuring the predictability of word (given context) help measure the importance of that word when focusing on applications for DHH users? (We will examine RQ1.1 in Chapter 4).

RQ1.2: Do supervised models based on textual features from a spo- ken language transcript accurately predict word importance? (We will examine RQ1.2 in Chapter 6).

RQ1.3: Do acoustic-prosodic cues in spoken dialogues help identify important words in the dialogue? (We will examine RQ1.3 inChapter 6).

RQ1.4: Do models trained on both textual and speech features from spoken dialogues outperform word importance models trained on a single type of feature? (We will examine RQ1.4 in Chapter6)

Researchers in speech and language technology are increasingly focused on determining the significance of individual words in conveying the overall meaning of text The definition of word importance can vary based on context, making this analysis valuable for applications such as text summarization, text classification, and speech synthesis This chapter reviews previously explored methods for estimating word importance across these diverse applications.

3.1 Word Importance Estimation as a Keyword Ex- traction Problem

Previous research on keyword extraction has concentrated on identifying key descriptive words that summarize a document effectively Various automatic techniques have been explored, including unsupervised methods like Term Frequency Inverse Document Frequency (TF-IDF) and supervised approaches that utilize semantic features for prediction Recently, neural network architectures have emerged, focusing on how individual words contribute to the network's discriminative tasks In sections 3.1.1 to 3.3, we will delve into these methods in detail, categorizing them into two main groups.

Frequency-based Keyword Extraction

The Term-Frequency Inverse Document Frequency (TF-IDF) measure is a widely used technique for identifying important keywords in a text This method helps pinpoint relevant words by analyzing data from a larger collection of documents Similar to the word predictability score, TF-IDF operates as an unsupervised measure, removing the necessity for subjective human scoring, which can be resource-intensive and time-consuming.

CHAPTER 3 PRIOR WORD IMPORTANCE ESTIMATION 24

TF-IDF score for a word (w), referred to as a term, in a text document (D) is computed in reference to a collection of reference text documents (D, such thatD∈D), as follows:

The TF-IDF (Term Frequency-Inverse Document Frequency) measure is calculated using the formula TF-IDF(w, D) = tf(w, D) ∗ idf(w, D) Here, tf(w, D) represents the frequency of the term (w) within a document (D), while idf(w, D) is determined by the logarithm of the ratio of the total number of documents |D| to the number of documents n w that contain the term (w) This method effectively ranks words based on their frequency of occurrence across documents and has been extensively explored in existing literature Additionally, similar techniques, including word occurrence frequency and its variants, have gained popularity in text analysis.

Supervised Methods of Keyword Extraction

In addition to unsupervised keyword extraction techniques, researchers have explored various supervised methods that leverage the semantic features of words and their contexts to identify significant terms within text.

156, 174] Following sections will discuss research work in this avenue, which we have categorized into two big sub-groups:

Utilizing Linguistic Features for Keyword Extraction

Hulth [69] enhanced keyword extraction from text abstracts by incorporating syntactic features alongside traditional word frequency statistics Their findings indicated that utilizing simple syntactic elements, like parts-of-speech tags and noun-phrase chunks, improved the performance of supervised learning algorithms Similarly, Hong et al [67] explored methods to evaluate word significance based on their likelihood of inclusion in human-generated summaries, employing various unsupervised scoring techniques such as word probability, log-likelihood ratio, and Markov random walk models to assess word importance They also incorporated additional text features, including word position and part of speech, to train supervised models for word importance prediction Other researchers have applied similar methodologies to different text genres, such as conversational styles in meetings for keyword identification [102, 156] Recently, advancements in neural network architectures have proven beneficial for language modeling tasks [114, 147].

CHAPTER 3 PRIOR WORD IMPORTANCE ESTIMATION 26 explored these architectures for the task of importance scoring of words for various applications [29, 157, 169] For example, Chopra et al [29] present a neural encoder-decoder architecture based on Recurrent Neural Network (RNN) units to generate an abstractive summarized representation of a sentence As discussed in Section 2.1.2, researchers use the RNN-units to generate a context- based representation of words in the text which has been shown to be a useful feature in many linguistic applications, including text summarization However, rather than using features learned from every word in the sentence, researchers used attention-based filter in their neural architecture that ensures that only important input words are selected for further processing [29] With this setup, researchers demonstrate significant improvement in the sentence summarization task A similar methodology was utilized by Wang et al [169] who used an attention-based encoder-decoder architecture to generate an abstract from multiple sources of opinions and argument in the form of text Beyond the attention-based networks, Sheikh et al [157] demonstrated a Neural Bag-of- Words model where the model uses a weighted bag of words architecture to get a summative representation of a text In this setup, each feature representation of word is weighted (by a learned parameter (α)) which is used to sum each individual word features to get a total representation of the text This final representation of text when trained to produce specific application-oriented predictions, such as sentiment prediction in text sentiment analysis [157], would in turn learn the weighing parameters (α) for each word, which are shown to correspond with the importance of each word in that application.

Limitations and Challenges

The conceptualization of word importance as a keyword-extraction problem has shown success in text summarization; however, this method may not be suitable for all applications In spontaneous speech dialogue, where topic transitions can be unpredictable, a more localized model of word importance—focusing on individual sentences, utterances, or segments of dialogue—might be more effective Additionally, the interactive nature of dialogue, characterized by contributions from multiple speakers, necessitates a tailored approach to assessing word importance.

In response to existing challenges, we introduce innovative methods for scoring word importance that take into account the local context of each word Additionally, we present a specially annotated corpus that aids in the research of word importance measurement, allowing for a more detailed exploration of this complex topic.

This article examines previous research that lays the foundation for developing unsupervised and supervised models of word importance In Section 3.2, we analyze prior studies on the reading strategies employed by deaf individuals, highlighting how various text features, including frequency and predictability, influence comprehension Building on this research, Chapter 4 presents our approach to modeling the predictability of a word based on its context and its application in predicting word importance for Deaf and Hard of Hearing (DHH) users.

CHAPTER 3 PRIOR WORD IMPORTANCE ESTIMATION 28

Section 3.3 discusses prior work on harnessing acoustic-prosodic cues from speech for semantic modeling for various natural language processing tasks.This work is later referenced in Chapter 6 which discusses various supervised models of word importance including unimodal-feature-based models (e.g.,models trained only on acoustic-prosodic features from speech) and multimodal- feature-based models (e.g., models trained on both text- and speech-based features).

Reading Strategies of Deaf Individuals

Research suggests that deaf readers employ a strategy focused on identifying content words to grasp sentence meaning, often overlooking morpho-syntactic relationships Eye movement studies indicate that deaf readers fixate on about 30% of the words in a text, with skipped words influenced by lexical factors such as word frequency, length, and predictability Additionally, findings by Keith et al reveal that both word length and contextual predictability affect whether readers skip words and the time spent on non-skipped words Generally, highly predictable words are read more quickly and are skipped more frequently, particularly by less-skilled readers.

Word predictability plays a crucial role in evaluating text readability and reading comprehension skills, as highlighted in previous research The Cloze procedure, a longstanding assessment method, effectively measures both text readability and participants' reading abilities by requiring them to fill in missing words in a given text This technique is commonly employed in standardized English-language tests such as TOEFL, GRE, and WRAT Essentially, word predictability reflects how well a reader can infer a missing word based on the surrounding context.

The was barking at the mail-man.

The word "dog" is highly predictable within its context, which offers strong clues about its meaning.

The meeting is scheduled on The predictability is very low – suggesting that the readers might not be able to rely on the context to predict the word.

The use of word predictability in reading assessments, such as Cloze tests, alongside eye-tracking research, suggests that Deaf and Hard of Hearing (DHH) readers tend to overlook highly predictable words This linguistic characteristic can be valuable for evaluating the significance of words in text, particularly for DHH individuals.

In contrast to the popular frequency-based approaches (like TF-IDF measure discussed in Section 3.1.1), this approach of importance scoring of words based

Chapter 3 explores the significance of prior word importance estimation, particularly in relation to predictability within context, a topic that remains underexplored in existing literature Considering the distinct reading strategies employed by deaf readers compared to hearing readers, it is essential to examine this measure for applications tailored to the deaf community Our methodology for assessing word importance in text is detailed in Section 4.

Acoustic-Prosodic Cues for Semantic Knowledge

Previous research has focused on modeling prosodic cues in speech for various applications, including automatic prominence detection, which helps identify regions of speech with greater stress and supports the automatic identification of content words, a key aspect of spoken language understanding Additionally, studies have explored prosodic patterns to uncover syntactic relationships among words, with evidence showing that speech-based features can enhance the parsing of conversational texts Researchers have also examined prosodic events to pinpoint significant segments in speech, aiding in the creation of generic summaries for meeting recordings However, prosodic cues present challenges as they fulfill diverse linguistic functions and convey emotional nuances Our investigation targets models applied to spoken messages at the dialogue-turn level to predict the significance of words in comprehending utterances.

Unsupervised Models of Word Importance

In section 3.2, we explored a unique reading strategy used by deaf readers, who often skip common and contextually predictable words While previous research has examined word frequency as a measure of word importance, the predictability of words within a context has not been thoroughly investigated This chapter aims to quantify the predictability of words in text and assess its relevance for deaf and hard-of-hearing (DHH) users.

CHAPTER 4 UNSUPERVISED MODELS OF WORD IMPORTANCE 32

Defining the Word Predictability Measure

Word predictability is a metric that assesses how easily a word can be anticipated within a specific context, reflecting the effort needed to make such a prediction A word's significance in a text is inversely related to its predictability; if a word is crucial, its absence would challenge the reader's understanding Conversely, words embedded in strong contexts that facilitate inference are deemed less important To quantify this concept, we utilize various language models that predict words based on their contextual cues, allowing us to evaluate a word's predictability by measuring the difficulty the model faces in making accurate inferences.

The language model analyzes the predictability of a word based on its contextual usage within a sentence, as illustrated in Figure 4.1 For a deeper understanding of the mathematical computations behind this scoring, readers are encouraged to consult Section 4.2.

In estimating the difficulty of predicting a word within a given context, we analyze various candidate words, such as "dog" and "tree," for the incomplete sentence, "The is barking at the mail-man." The language model evaluates its confidence across these candidates to gauge the overall challenge of making an inference Unlike traditional measures of word predictability, such as surprisal, which focus on the model's confidence in predicting a specific word, our approach emphasizes the general difficulty of making a contextual guess.

Methods for Computing Word Predictability

N-gram Language Model

N-gram models are one of the popularly used approaches for language modeling that consider the relative frequency counts of words and word phrases from a very large corpus to make estimates about the language As discussed in

CHAPTER 4 UNSUPERVISED MODELS OF WORD IMPORTANCE 34

Section 2.1.2, one of the key task of a language model is to be able to make predictions about the word (w) given its context (c), i.e., computingP(w|c). Rather than using the entire context (history) to make the prediction, n-gram models consider approximating the context using only the lastn−1words For example, a bigram model (n=2) approximates the prediction of a word using only the preceding word, i.e, P(w t |w t−1 ) Similarly, a trigram model (n=3) makes prediction about a word using the last two words from the context, i.e., P(wt|w t−1 , wt−2) Theoretically, as the value ofnincreases, the model is able to make more accurate predictions However, in reality using a longer context also increases data sparsity as it becomes increasingly difficult to find longer common sequences of words in the training corpus, hence making accurate predictions more difficult.

Methodology: Estimating Word Predictability Using N-grams

To calculate the predictability score of a word, we employed various n-gram language models that analyze the frequency of word sequences in extensive text collections, similar to those used in word-prediction systems for text-entry applications Our n-gram models, ranging from n = 1 to 5, were trained on the Switchboard, English CALLHOME, and TEDLIUM corpora, which collectively encompass 1.9 million word tokens and effectively represent conversational speech dialogues relevant to real-time captioning scenarios The models functioned bi-directionally to make predictions based on both left and right word-sequence contexts, utilizing a Stupid Back-off mechanism to rank potential word candidates independently from each context.

 count(w t t−n+1 ) count(w t−1 t−n+1 ) if count(w t t−n+1 )>0 λS(wt|w t−1 t−n+2 ) otherwise

In this study, we define wt as the word at position t and wx y as a word sequence starting at position x and ending at position y, where y is greater than x We employed an n-gram size denoted by n and set the lambda (λ) value to 0.4, following the recommendations of Brant et al A comparable scoring function was utilized to rank candidates based on the right context Ultimately, the predictions from both the right and left contexts were merged and ranked for subsequent application.

To derive a predictability score from our predictions, we identified the top 20 ranked unique candidates and converted their count probabilities into normalized probabilities that total to 1 For example, in the phrase "The meeting is scheduled on," the language model predicts several potential words such as "Monday," "Friday," and "Tuesday," each assigned a specific probability of occurrence An entropy score was then computed based on the probability distribution of these candidates.

CHAPTER 4 UNSUPERVISED MODELS OF WORD IMPORTANCE 36

The entropy of a word \( w \) at a specific location in the text is represented by \( E(w) \), calculated using the formula \( P(w_c(i)) \cdot \log(P(w_c(i))) \) In this equation, \( w_c(i) \) denotes a candidate word predicted by the language model, while \( P(w_c(i)) \) indicates the probability assigned to that candidate by the model.

The entropy score, derived from information theory, quantifies the unpredictability of a state, specifically in our context, the unpredictability of a word based on its surrounding context A higher entropy score signifies a lower likelihood of accurately selecting the correct word from potential candidates, indicating greater difficulty in prediction Conversely, a lower score suggests that certain words are more probable than others, facilitating easier word prediction This entropy is normalized to produce a predictability score ranging from 0 to 1.

The existing n-gram based word importance prediction model struggles to generalize to unseen data, particularly when encountering out-of-vocabulary words, due to its reliance on exact-search and matching strategies Additionally, these models often overlook long text dependencies, which can significantly impact their effectiveness.

To overcome these limitations, in the coming section (Section 4.2.2) we discuss a neural architecture for language modeling.

Neural Language Model

In the previous section, we explored various n-gram models to assess the predictability of words based on context However, these n-gram models face significant challenges, particularly when dealing with out-of-vocabulary words, as highlighted in section 4.2.1.

Figure 4.2: Diagram of neural word predictability model demonstrating how the context of a wordw(i) is captured using bi-directional recurrent units.

To address existing limitations, we explored neural-network-based language models for estimating word predictability in context These models have gained popularity and proven effective across various applications Consequently, the next step in enhancing our word predictability model involves examining the estimation of word predictability through neural language models However, we must implement specific architectural modifications to the standard neural language model framework to suit our application, which will be detailed in the subsequent section.

CHAPTER 4 UNSUPERVISED MODELS OF WORD IMPORTANCE 38

Our language model is built using a bi-directional RNN, leveraging pre-trained GLoVE embeddings for word representation The model processes these embeddings with Long Short-Term Memory (LSTM) units to effectively capture contextual information.

To predict a word (w(t)) at time (t), the model leverages the hidden representations from both the forward-moving LSTM (hf w(t−1)) and the backward-moving LSTM (hbw(t+1)), while omitting the hidden representations of the word itself (hf w(t) and hbw(t)) This approach aligns with previous discussions on language modeling by Rei [143].

The hidden representations from the forward and backward LSTMs are passed through softmax layers in order to make the prediction about the word (w(t)):

For training, the objective function for both components is then constructed as a regular language modeling objective which calculates the negative log- likelihood for the prediction, as:

GloVe, or Global Vectors for Word Representation, is an unsupervised learning algorithm that generates vector representations for words by analyzing global word-word co-occurrence statistics from a large corpus This method effectively captures the semantic relationships between words, allowing for interesting linear substructures in the word vector space The pre-trained GLoVe embeddings utilized in our analysis can be accessed from the official Stanford NLP project page.

The model's total loss is calculated by summing the losses from both forward and backward contexts, represented by P(w(t)|c), where 'c' indicates the context for the word w(t), which can be either hf w(t−1) or hbw(w(t+1)) To generate predictions, we combine the probability scores derived from both contexts.

The neural network was developed using TensorFlow, featuring LSTM hidden layers for effective word context modeling, each set to a size of 650 in both directions All digits in the text were replaced with the character '0', and the vocabulary was limited to 23,000 words To enhance optimization, utterances were organized into batches of 50 The training process employed the Adam optimizer with an initial learning rate of 0.001 and a decay rate of 0.9, operating on utterance units without retaining prior dialogue contexts.

Methodology: Estimating Word Predictability Using Neural Lan- guage Model

To calculate word predictability scores, we employed combined probability scores derived from both forward and backward LSTMs for each word position This combined score was then used to compute entropy, following the method outlined in Equation 4.2 Additionally, we normalized the resulting entropy score for consistency.

CHAPTER 4 UNSUPERVISED MODELS OF WORD IMPORTANCE 40 a predictability within a (0,1) range.

Evaluation and Conclusion

This chapter explores the impact of word predictability on reading patterns, highlighting how readers tend to spend more time on less predictable words Eye tracking studies indicate that words difficult to predict based on their context require greater attention, underscoring their significance in text comprehension To further investigate this phenomenon, the chapter examines various language models, including n-gram and deep neural network (DNN) models, to estimate word predictability and assess its role in understanding written content.

Instead of conducting an intrinsic evaluation of these models at this moment, we integrate them into a practical application Our assessment of the word-importance models will be discussed in a subsequent chapter, specifically focusing on their role in predicting the impact of ASR-generated errors on text understandability for DHH users Further details on this evaluation can be found in Section 9.4.1.

In Chapter 9, we present findings that demonstrate the effectiveness of modeling word predictability as a measure of word importance in error-impact prediction tasks Our results indicate that this approach significantly outperforms traditional frequency-based methods, such as TF-IDF Notably, the neural-language model utilized for word predictability estimation achieved superior performance compared to other models in this context.

Building the Word Importance Annotation Corpus

In Chapter 4, we explored methods for estimating word predictability as an initial approach to assessing word importance However, this method lacks empirical validation from user data To accurately gauge word significance, it is essential to collect data from individuals and develop more advanced models This chapter focuses on efforts to gather data regarding word importance in spoken conversations, which will be utilized in a subsequent chapter.

In this article, we focus on training and evaluating supervised models to assess word importance in spoken dialogues We initiate our study by quantifying the significance of words within these dialogues, detailed in Section 5.1 Following this, we outline the annotation task designed to gather insights on word importance in Section 5.2, and we analyze the validity of our findings to ensure robust results.

Defining Word Importance

In our project, we visually represent the importance scores assigned to words in a sentence by a human annotator The height and font size of each word indicate its significance, with color coding to enhance clarity: green signifies high-importance words with scores above 0.6, blue represents words with scores between 0.3 and 0.6, and gray is used for words with lower scores.

Eye-tracking studies have shown distinct reading patterns among different readers, indicating potential features that influence their perceptions of word importance based on eye fixations To create effective annotation guidelines for our research, we needed to establish a clear definition of word importance, rather than relying on specific characteristics like word length for annotators to consider.

In Chapter 5, we explore the concept of word importance in spontaneous spoken conversation, defining it as the extent to which omitting a word from a dialogue transcript hinders the reader's understanding of the overall meaning This functional perspective is crucial for our application domain, particularly in evaluating automatic speech recognition (ASR) systems for real-time meeting captioning, guiding our data acquisition strategy effectively.

Word Importance Annotation Task

Annotation Scheme

For our annotation project, we defined word-importance as a single-dimensional property, which could be expressed on a continuous scale from 0.0 (not important

The project assigns numerical importance scores to words in sentences, ranging from 0.0 (not important) to 1.0 (very important), as illustrated in Figure 5.1 This figure shows actual scores given by human annotators However, quantifying word importance through specific numerical scores is challenging due to its subjective nature This section addresses our efforts to enhance consistency among annotators while developing this annotated resource, with Section 5.3 detailing the agreement levels among annotators on this task.

To reduce the cognitive load on annotators and to promote consistency, we created the following annotation scheme:

Range and Constraints Each word is assigned a numeric score between

Importance scores range from 0 to 1, with 1 representing the highest significance These numeric scores are precise to 0.05 and do not reflect the overall meaning of the utterance based on the individual words Therefore, the scores are not required to total 1.

In the methodology of analyzing utterances, the annotator evaluates the overall meaning of a speaker's statement, taking into account the context provided by prior conversation history Each word within the utterance is then scored based on its contribution—whether direct or indirect—to the overall meaning, following the established rubric outlined in the Interpretation and Scoring section.

Rating Scheme To help annotators calibrate their scores, Table 5.1 provides some recommendations for how to select word-importance scores in

CHAPTER 5 WORD IMPORTANCE ANNOTATION CORPUS 46

Words that are of least importance - these words can be easily omitted from the text without much consequence.

Words that are fairly important - omitting these words will take away some important details from the utterance.

Words that are of high importance - omitting these words will change the message of the utterance quite significantly.

Table 5.1: Guidance for the annotators to promote consistency and uniformity in the use of numerical scores. various numerical ranges.

When annotating conversations, it's crucial to evaluate how the removal of specific words impacts the overall understanding of the utterance Annotators should assess the potential confusion that might arise for the other speaker in the dialogue if a key word were replaced with a blank space This analysis helps ensure that the clarity of the message is accurately reflected in the scoring process.

Inter-Annotator Agreement Analysis

In our analysis, we identified 3,100 tokens within the overlap set, which consisted of transcripts labeled independently by both annotators This subset served as the foundation for assessing inter-annotator agreement Given that the scores were nearly continuous, ranging from 0 to 1 with a precision of 0.05, we calculated the Concordance Correlation to evaluate the level of agreement.

The coefficient (ρc), or Lin’s concordance correlation coefficient, serves as the main metric for evaluating agreement among annotators This coefficient assesses how effectively a new measurement (X) aligns with a gold standard (Y) By treating one annotator's annotations as the gold standard, we can extend this metric to analyze the agreement between two annotators Similar to other correlation coefficients, ρc ranges from -1 to 1, with a score of 1 indicating perfect agreement.

Concordance between the two measures can be characterized by the expected value of their squared difference as:

The equation E[(Y − X)²] = (ȳ − ȳx)² + σx² + σy² − 2ρσxσy illustrates the relationship between two variables, X and Y, where ρ represents the correlation coefficient, ȳx and ȳy are the population means, and σx and σy denote the standard deviations Additionally, the expectation score coefficient, calculated using ρc = 2ρSxSy / (Ȳ − X̄)² + Sx² + Sy², provides insights into the strength and direction of the correlation, with values ranging from -1 to 1, where X̄ and Ȳ are the means of X and Y, respectively.

Our analysis revealed a concordance correlation score (ρc) of 0.89 among annotators, indicating a satisfactory level of agreement This result is significant considering the subjective challenge of assessing word importance in spoken dialogue transcripts.

CHAPTER 5 WORD IMPORTANCE ANNOTATION CORPUS 48

Summary of the Corpus

The Word Importance Annotation corpus is a newly developed collection that annotates transcripts from the Switchboard conversational speech corpus, focusing on the significance of individual words in conveying meaning This extensive corpus includes over 25,000 terms, each meticulously labeled with importance scores ranging from 0 to 1 In this system, a score of 1 indicates high importance, while a score of 0 signifies low importance Notably, the importance of each word is determined by its specific context within a conversation, highlighting that a word's significance can vary across different contexts.

In our project, hearing annotators have evaluated word-importance scores for transcripts, annotating over 25,000 tokens by September 2017 This includes around 3,100 overlapping tokens and encompasses 25,048 utterances from 44 different English speakers The annotations are publicly accessible as supplementary files, aligned with the ISIP transcripts for the Switchboard corpus.

Through the development of a comprehensive protocol for our annotation team, as outlined in Section 5.2, we achieved a high level of agreement regarding word importance among annotators Our analysis revealed a concordance correlation score of 0.89 between the two annotators, demonstrating the effectiveness of our approach.

2 http://latlab.ist.rit.edu/lrec2018

Current methods for estimating word importance in text typically assign a uniform importance score to all instances of a term within a document, regardless of their location This research aims to refine this approach by predicting word importance at a more detailed level, specifically at the sentence level The subsequent sections will illustrate how this corpus can be utilized to train and evaluate various supervised models designed for word-importance prediction, enhancing the granularity of analysis beyond the document level.

By integrating word-level importance data into a conversational speech corpus, we enhance the development of word importance models tailored for spoken contexts This approach is particularly beneficial for real-time communication captioning applications, as explored in this thesis.

Text-based Model of Word Importance

Model Architecture

We implemented a neural architecture based on Lample et al [90], utilizing bidirectional LSTM encoders combined with a Conditional Random Field (CRF) layer for optimal sequence prediction Our approach began by mapping input word tokens to pre-trained distributed embeddings, which were then enhanced with learned character-based representations to form comprehensive word representations The bidirectional LSTM encoders generated context-aware representations, with hidden states concatenated to condition on the entire sentence The CRF layer leveraged this representation to determine the most optimal state sequence The framework was developed using TensorFlow, with publicly available code, and initialized word embeddings using GLoVE vectors Character embeddings were randomly initialized with a length of 100, while LSTM layers were configured with sizes of 300 for word components and 100 for character components We optimized parameters with the Adam optimizer, starting at a learning rate of 0.001 and a decay rate of 0.9, processing sentences in batches of 20 and applying a dropout rate of 0.5 during training.

We explored two variations of our model: the first is a bidirectional LSTM integrated with a sequential CRF layer (LSTM-CRF), which approaches the problem as a discrete classification task The second is an innovative bidirectional LSTM model designed to enhance performance.

1 https://github.com/SushantKafle/speechtext-wimp-labeler

CHAPTER 6 SUPERVISED WORD IMPORTANCE MODELS 52

The general unfolded network structure of our model, inspired by Lample et al., consists of a bottom layer for word-embedding inputs, which are processed by bi-directional LSTM layers Each LSTM layer receives the previous hidden state and current word embeddings, generating a new hidden state The concatenation of hidden representations from LSTMs captures the contextual meaning of words at each time step The model includes a sigmoid layer (LSTM-SIG) for continuous predictions, while the LSTM-CRF interprets the prediction task as a classification problem with fixed non-ordinal class labels In contrast, the LSTM-SIG model employs a sigmoid non-linearity to constrain prediction scores between 0 and 1, and is trained using square loss to directly predict annotation scores akin to a regression task.

Experimental Setup

We partitioned the Word Importance Annotation corpus into training (80%), development (10%), and test sets (10%) to evaluate our model's performance We employed two key metrics: total root mean square error (RMS), which measures the deviation of model predictions from human annotations, and the F1 measure for classification accuracy To assess classification performance, we discretized annotation scores into six distinct classes: [0, 0.1), [0.1, 0.3), [0.3, 0.5), [0.5, 0.7), [0.7, 0.9), and [0.9, 1].

Experiment 1: Performance of the Models

Table 6.1 provides an overview of our models' performance on the test set, showcasing average scores across five configurations to mitigate the effects of outlier results from random model initialization Although the LSTM-CRF achieved a superior F-score in the classification task, its RMS score was higher compared to the LSTM-SIG model, potentially highlighting the limitations of the LSTM-CRF as discussed in Section 5.

The confusion matrices presented in Figure 6.2 offer an in-depth analysis of the classification performance for each model Notably, the LSTM-SIG model was specifically trained to enhance the accuracy of its continuous predictions, which accounts for the observed wider diagonal in its results, reflecting its focus on continuous rather than discrete class assignments.

CHAPTER 6 SUPERVISED WORD IMPORTANCE MODELS 54

Table 6.1: Model performance in terms of RMS deviation and macro-averaged

F1 score, with best results inboldfont.

The confusion matrices presented in Figure 6.2 illustrate the classification performance of two models across six classes, defined as c1 = [0, 0.1), c2 = [0.1, 0.3), and so on Notably, the confusion matrix in Figure 6.2(b) reveals that the LSTM-SIG model tends to misclassify words into ordinally adjacent classes Additionally, both models exhibited reduced accuracy when classifying words with importance scores in the middle range of [0.3, 0.7).

Experiment 2: Comparison with Human Annotators

We evaluated the agreement between human annotations and various models by calculating the concordance correlation coefficient The LSTM-CRF model demonstrated a higher average correlation with human annotators (ρc= 0.839) compared to the LSTM-SIG model (ρc= 0.826) In contrast, the agreement among the human annotators themselves was even higher, with a coefficient of ρc= 0.89.

Limitations of this Research

We developed a supervised model that assesses the importance of individual words within spoken conversation transcripts, focusing on utterance-level analysis This model employs a bi-directional neural network architecture for effective sequence modeling and tagging To represent spoken words as vectors, we utilized pre-trained GLoVE vectors, enhancing the model's ability to capture the overall meaning conveyed by each utterance.

All models were trained and assessed using the Word Importance Annotation corpus, achieving an F1 score of 0.60 in a three-class word importance classification task Additionally, the model demonstrated a correlation of 0.839 with human evaluations, while human-human agreement reached a correlation of 0.89.

The Word Importance Annotation corpus demonstrates effectiveness in developing neural network models for predicting word importance However, the existing model primarily utilizes text-based linguistic knowledge and overlooks valuable speech information, such as prosody Therefore, the next section will explore the potential benefits of incorporating speech-based information into word importance modeling for conversational speech.

CHAPTER 6 SUPERVISED WORD IMPORTANCE MODELS 56

Speech-based Importance Model

Model Architecture

We propose a sequence labeling architecture for predicting word importance in spoken dialogue turns This model utilizes word-level timestamps and employs a bi-directional LSTM architecture to assign importance labels to each spoken word within the utterance.

To effectively assess the efficacy of speech-based features for determining word importance, we utilize high-quality, human-annotated word-level timestamp data in our training and evaluation datasets Looking ahead, we aim to automate the process of speech tokenization for improved efficiency.

CHAPTER 6 SUPERVISED WORD IMPORTANCE MODELS 58

The architecture for representing spoken words utilizes time series speech data, where each identified word (w) is associated with a word-level timestamp A fixed-length interval window (τ) is employed to segment the spoken word into n sub-word intervals, calculated as n=time(w)/τ An RNN network processes these variable-length sub-word sequences to extract a coherent word-level feature (s), represented as a fixed-length vector.

The acoustic-prosodic representation for each word is generated using word-level timestamp information from the speech signal To create a context-aware representation, two LSTM units process these word units in opposite directions within an utterance Each LSTM unit receives the word's representation along with the previous hidden state, producing a new hidden state at each time step The hidden representations from both LSTMs are concatenated to form a contextualized representation for each word, which is then passed through a projection layer to achieve the final word prediction.

Word importance prediction involves classifying words into various importance categories, such as high (hi), medium (mid), and low (low) These categories are ordered, meaning that misclassifying a high importance word as low carries a greater penalty than misclassifying it as medium To address this ordinal classification, we explore different output prediction layers: a standard softmax layer for local predictions, a relaxed softmax designed specifically for ordinal classification, and a linear-chain conditional random field (CRF) for making decisions based on the entire sequence.

•Softmax Layer For the softmax-layer, the model predicts a normalized distribution over all possible labels (L) for every word conditioned on the hidden vector (h t ).

The Relaxed Softmax Layer differs from the traditional ord-layer by employing a standard sigmoid projection for each output label candidate without normalization This approach allows the model to predict multiple labels for a word, rather than just one Specifically, for a word associated with label \( l \in L \), the model also predicts all other labels that are ordinally less than \( l \).

CHAPTER 6 SUPERVISED WORD IMPORTANCE MODELS 60 relaxed-softmax models are trained to minimize the categorical cross-entropy, which is equivalent to minimizing the negative log-probability of the correct labels However, they differ in how they make the final prediction: Unlike the softmax layer which considers the most probable label for prediction, the ord-layer uses a special scanning strategy [25] – where for each word, the candidate labels are scanned from low to high (ordinal rank), until the score from a label is smaller than a threshold (usually 0.5) or no labels remain The last scanned label with score greater than the threshold is selected as the output.

The CRF layer analyzes the dependencies between the importance labels of words, enabling the network to identify the optimal path among all possible label sequences for accurate predictions This model is fine-tuned by maximizing the score of the correct label sequence while reducing the likelihood of all other sequences.

In this article, we explore various models for predicting word importance by examining different projection layers We detail the architecture utilized for acoustic-prosodic feature representation at the word level in Section 6.2.2, while Sections 6.2.3 to 6.2.6 outline the experimental setup and evaluations conducted.

Acoustic-Prosodic Feature Representation

Researchers have explored vector representations of words derived from speech, akin to traditional word representations like word2vec and GloVe These acoustic embeddings not only capture the acoustic-phonetic characteristics of speech but also aim to encode the semantic properties of words directly from spoken language.

This study explores a speech-based feature representation strategy that focuses on prosodic characteristics at the sub-word level, aiming to develop a word-level representation for predicting the significance of elements in spoken dialogue.

We examined four categories of features that have been previously considered in computational models of prosody, including: pitch-related features (10), energy features (11), voicing features (3) and spoken-lexical features (6):

Pitch and energy features are crucial for modeling intonation and identifying emphasized speech regions We extracted key metrics from the pitch and energy contours, including minimum, time of minimum, maximum, time of maximum, mean, median, range, slope, standard deviation, and skewness Additionally, we obtained RMS energy from a mid-range frequency band of 500-2000 Hz, which is effective for detecting syllable prominence in speech.

We analyzed spoken-lexical features, focusing on aspects like the duration of spoken words, their position within utterances, and the silence preceding them Additionally, we assessed the syllable count in each spoken word to gain deeper insights into spoken language patterns.

CHAPTER 6 SUPERVISED WORD IMPORTANCE MODELS 62 methodology of Jong et al [33] Further, we considered the per-word average syllable duration and the per-word articulation rate of the speaker (number of syllables per second).

In our investigation of voicing features (voc) as a measure of voice quality, we focused on spectral-tilt, defined as the difference between the amplitudes of the first harmonic (H1) and the second harmonic (H2) in the Fourier Spectrum This spectral-tilt measure is effective in characterizing glottal constriction, which plays a crucial role in distinguishing various voicing characteristics, such as whispering Additionally, we explored other voicing measures, including the Harmonics-to-Noise Ratio and the Voiced-Unvoiced Ratio.

We extracted a total of 30 features using Praat, which were further enhanced by incorporating their speaker-normalized (znorm) versions This resulted in a comprehensive set of 60 speech-based features derived from sub-word units.

Sub-word to Word-level Representation

Acoustic features were extracted using a 50-ms sliding window with a 10-ms overlap for each word region Each word was represented by a sequence of sub-word features of varying lengths To create a feature representation for each word, a bi-directional Recurrent Neural Network (RNN) layer was applied to these sub-word features The spoken-lexical features were then combined with the word-level feature representation to form the final feature vectors For this analysis, Gated Recurrent Units (GRUs) were chosen over LSTM units due to their superior performance observed in preliminary tests.

Experimental Setup

We employed the Word Importance Annotation from the Switchboard corpus, which includes 25,048 utterances from 44 English speakers, each annotated with word importance scores These scores, ranging from 0 to 1, provide word-level timestamp information and are categorized into three ordinal ranges: low importance (0 - 0.3), medium importance (0.3 - 0.6), and high importance (0.6 - 1) This systematic approach allows for a better understanding of the significance of words within utterances.

The models were trained and evaluated using this data, treating the problem as a ordinal classification problem with the labels ordered as (li

Ngày đăng: 20/10/2022, 11:33

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w