Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 119 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
119
Dung lượng
1,77 MB
Nội dung
ADVANCES IN PUNCTUATION AND DISFLUENCY PREDICTION WANG XUANCONG B.Sc. (Hons.) NUS A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY NUS GRADUATE SCHOOL FOR INTEGRATIVE SCIENCES AND ENGINEERING NATIONAL UNIVERSITY OF SINGAPORE 2015 DECLARATION I hereby declare that this thesis is my original work and it has been written by me in its entirety. I have duly acknowledged all the sources of information which have been used in the thesis. This thesis has also not been submitted for any degree in any university previously. Wang Xuancong 23 January 2015 i Acknowledgment My PhD journey is a life journey during which I have not only learned the knowledge in the field of speech and natural language processing, but also learned various techniques in doing research, how to collaborate with other people, how to analyze problems and come up with effective solutions. Now at the end of this journey, it is time to acknowledge all those who have contributed to it. First and foremost, I would like to thank my main supervisor Prof. Ng Hwee Tou and my co-supervisor Prof. Sim Khe Chai. I began my initial research in speech processing under Prof. Sim Khe Chai. As a physics undergraduate, I lacked various techniques in doing computer science research. Prof. Sim was very patient and helpful in teaching me those basic experimental skills in addition to knowledge in speech processing. Later, my research focus was shifted to natural language processing (NLP) because I realized that there was a gap between speech recognition and natural language processing when we talked about reallife applications, and some intermediate processing was indispensable for downstream NLP tasks. Prof. Ng, with his experience for many years in the NLP field, has helped me tremendously in coming up with useful ideas and tackling difficult problems. Under their teaching and supervision, I have acquired acknowledge in both speech and NLP field. They have also spent numerous time in providing me invaluable guidance and assistance in the writing of my papers and thesis. Discussions with them have been very pleasant and helpful in improving my scientific skills. Next, I would like to thank the other member of my thesis advisory committee, Prof. Wang Ye. His guidance and feedback during the time of my candidature has always been helpful and encouraging. iii I would also like to thank my friends, schoolmates and colleagues in NUS Graduate School for Integrative Sciences and Engineering and NUS School of Computing for their support, helpful discussions, and fellowship. Finally, I would like to thank my parents for their continued emotional care and spiritual support especially when I encountered difficulties or failures. iv Contents Introduction 1.1 Why we need to predict punctuation? . . . . . . . . . . . . . . 1.2 Why we need to predict disfluency? . . . . . . . . . . . . . . . 1.3 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . 1.3.1 1.4 Dynamic Conditional Random Fields for Joint Sentence Boundary and Punctuation Prediction . . . . . . . . . . . 1.3.2 A Beam-Search Decoder for Disfluency Detection . . . . 10 1.3.3 Combining Punctuation and Disfluency Prediction . . . . 11 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . 11 Related Work 13 2.1 Sentence Boundary and Punctuation Prediction . . . . . . . . . . 14 2.2 Disfluency Prediction . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Joint Learning and Joint Label Prediction . . . . . . . . . . . . . 17 2.4 Model Combination using Beam-Search Decoders . . . . . . . . . 19 Machine Learning Models 21 3.1 21 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . v 3.2 Max-margin Markov Networks (M3N) . . . . . . . . . . . . . . . 26 3.3 Graphical Model Extension . . . . . . . . . . . . . . . . . . . . . 27 3.4 The Relationship between Model Complexity and Clique Order . 32 Dynamic Conditional Random Fields for Joint Sentence Boundary and Punctuation Prediction 36 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.1 Lexical Features . . . . . . . . . . . . . . . . . . . . . . 38 4.3.2 Prosodic Features . . . . . . . . . . . . . . . . . . . . . . 39 4.3.3 Normalized N-gram Language Model Scores . . . . . . . 39 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . 40 4.4.2 Incremental Local Training . . . . . . . . . . . . . . . . . 41 4.4.3 Vocabulary Pruning . . . . . . . . . . . . . . . . . . . . . 42 4.4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . 43 4.4.5 Comparison to a two stage LCRF+LCRF . . . . . . . . . 45 4.4.6 Results on the Switchboard Corpus . . . . . . . . . . . . 46 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4 4.5 A Beam-Search Decoder for Disfluency Detection 48 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2 The Improved Baseline System . . . . . . . . . . . . . . . . . . . 49 5.2.1 Node-Weighted and Label-Weighted Max-Margin Markov Networks (M3N) . . . . . . . . . . . . . . . . . . . . . . vi 50 5.2.2 5.3 5.4 5.5 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 The Beam-Search Decoder Framework . . . . . . . . . . . . . . . 53 5.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3.2 General Framework . . . . . . . . . . . . . . . . . . . . . 55 5.3.3 Hypothesis Producers . . . . . . . . . . . . . . . . . . . . 57 5.3.4 Hypothesis Evaluators . . . . . . . . . . . . . . . . . . . 59 5.3.5 Integrating M3N into the Decoder Framework . . . . . . . 60 5.3.6 POS-Class Specific Expert Models . . . . . . . . . . . . . 61 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 63 5.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 67 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Combining Punctuation and Disfluency Prediction: An Empirical Study 70 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.2 The Baseline System . . . . . . . . . . . . . . . . . . . . . . . . 72 6.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 72 6.2.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.2.3 Evaluation and Results . . . . . . . . . . . . . . . . . . . 75 The Cascade Approach . . . . . . . . . . . . . . . . . . . . . . . 77 6.3.1 Hard Cascade . . . . . . . . . . . . . . . . . . . . . . . . 78 6.3.2 Soft Cascade . . . . . . . . . . . . . . . . . . . . . . . . 79 6.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . 79 The Rescoring Approach . . . . . . . . . . . . . . . . . . . . . . 82 6.3 6.4 vii 6.5 The Joint Approach . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Conclusion and Future Work 90 viii Table 6.7 shows the comparison of results. On DF alone, the improvement of the cross-product LCRF over the mixed-label LCRF, and the improvement of the mixed-label LCRF over the isolated baseline are not statistically significant. However, if we test the statistical significance on the overall performance of both PU and DF, both the 2-layer FCRF and the cross-product LCRF perform better than the mixed-label LCRF. And we also obtain the same conclusion as (Stolcke et al., 1998) that mixed-label LCRF performs better than isolated prediction. However, for the comparison between the 2-layer FCRF and the cross-product LCRF, although the 2-layer FCRF performs better than the cross-product LCRF on disfluency prediction, it does worse on punctuation prediction. Overall, the two methods perform about the same, their difference is not statistically significant. In addition, both the 2-layer FCRF and the cross-product LCRF slightly outperform the soft cascade method (statistical significance at p=0.04). 6.6 Discussion In this section, we will summarise our observations based on the empirical studies that we have conducted in this chapter. Firstly, punctuation prediction and disfluency prediction influence each other. The output from one task does provide useful information that can improve the other task. All the approaches studied in this chapter, which link the two tasks together, perform better than their corresponding isolated prediction baseline. Secondly, soft cascade performs better than hard cascade. This is because hard cascade applies strict corrections to the output of the first stage while soft cascade adds information from the first stage to the second stage. Therefore, soft cascade 86 is less sensitive to prediction errors from the first stage, although it may not be able to fully exploit the additional information even when they were all correct; on the other hand, hard cascade is much more sensitive to error propagation than soft cascade, unless the error from the first stage is sufficiently small, hard cascade can hardly perform better than soft cascade. Thirdly, if we train a model using a fine-grained label set but test it on the same coarse-grained label set, we are very likely to get improvement. For example: • The edit word F1 for mixed edit and filler prediction using {E, F, O} is better than that for edit prediction using {E, O} (see the second and third rows in Table 6.4). This is because the former actually splits the O in the latter into F and O. Thus, it has a finer label granularity. • Disfluency prediction using mixed-label LCRF (using label set {E, F, Comma, Period, Question, None}) performs better than that using isolated LCRF (using label set {E, F, O}) (see the second and fourth rows in Table 6.7). This is because the former distinguishes between different punctuations for fluent tokens and thus has a finer label granularity. • Both the cross-product LCRF and 2-layer FCRF perform better than mixedlabel LCRF because the former two distinguish between different punctuations for edit, filler and fluent tokens while the latter distinguishes between different punctuations only for fluent tokens. Thus, the former has a much finer label granularity. From the above comparisons, we can see that increasing the label granularity can greatly improve the accuracy of a model. However, this may also increase 87 the model complexity dramatically, especially when higher clique order is used. Although the joint approach (2-layer FCRF and cross-product LCRF) are better than the soft-cascade approach, they cannot be easily scaled up to using higher order cliques, which greatly limits their potential. In practice, the soft cascade approach offers a simpler and more efficient way to achieve a joint prediction of punctuations and disfluencies. 6.7 Conclusion In general, punctuation prediction and disfluency prediction can improve downstream NLP tasks. Combining the two tasks can potentially improve the effectiveness of the overall framework and minimize error propagation. In this chapter, we have carried out an empirical study on the various methods for combining the two tasks. Our results show that the various methods linking the two tasks perform better than isolated prediction. This means that punctuation prediction and disfluency prediction influence each other, and the prediction outcome in one task can provide useful information that helps to improve the other task. Specifically, we compare the cascade models and the joint prediction models. For the cascade approach, we show that soft cascade is less sensitive to prediction errors in the first step, and thus performs better than hard cascade. For the joint model approach, we show that, when clique order of one is used, all the three joint model approaches perform significantly better than the isolated prediction baseline. Moreover, 2layer FCRF and cross-product LCRF perform slightly better than mixed-label LCRF and the soft-cascade approach, suggesting that modelling at a finer label granularity is potentially beneficial. However, the soft cascade approach is more 88 efficient than the joint approach when a higher clique order is used. 89 Chapter Conclusion and Future Work In this thesis, we have made several contributions that have advanced natural language processing (NLP) research in punctuation and disfluency prediction. The two tasks serve as a bridge between automatic speech recognition (ASR) and downstream NLP tasks such as machine translation and spoken language understanding. In the beginning, we have introduced the use of punctuation symbols in written language and the importance of punctuation prediction in disambiguating the meaning of a sequence of words. We have also introduced the different types of disfluency in human speech and the importance of detecting speech disfluencies for practical applications. After giving an overview of some previous works in this field, we have described in details the machine learning algorithms used in this work. We draw connections between logistic regression, maximum entropy model (MaxEnt), conditional random field (CRF), dynamic conditional random field, and max-margin Markov network (M3N) (see Figure 7.1). We also highlighted the trade-off between performance and efficiency in building practical systems. After that, we have presented our work in improving punc- 90 A smooth mapping from linear space into probability space Logistic Regression Support multiple classes Multinomial Logistic Regression Same equation for all classes Maximum Entropy Model Model joint distribution of adjacent nodes, i.e. edge probabilities Conditional Random Field Discriminative objective function More complex graphical structure Dynamic CRF Max-margin Markov Network Figure 7.1: An overall picture showing the relationship between different machine learning models used in this thesis and their evolution over time. tuation prediction, disfluency prediction, and joint prediction of both. In Chapter 4, we show that joint sentence boundary and punctuation prediction using dynamic conditional random field (CRF) outperforms isolated sentence boundary and punctuation prediction respectively using linear-chain CRF. We also show that by performing feature pruning and vocabulary pruning, we can significantly reduce the model size and at the same time improve the performance slightly, which is useful for building practical systems. In Chapter 5, we propose a beamsearch decoder for disfluency detection. In particular, we show the importance of multiple iterations of disfluency clean-up, measuring the quality of cleaned-up 91 utterances, and combining expert M3N systems by node-biasing words of different POS tags. We have also achieved the highest performance on the dataset both with and without external knowledge sources. In Chapter 6, we have done an empirical study of various methods of combining the two tasks. These include isolated prediction, cascade prediction, and joint prediction. We conclude that the information from one prediction task is useful for the other task. Moreover, both joint prediction and cascade prediction outperform isolated prediction. Although joint prediction outperforms cascade prediction marginally, it increases the model complexity tremendously, which shows a trade-off between performance and efficiency. Even though we have advanced the current state of the art for punctuation and disfluency prediction in several directions, structured prediction incorporating syntactic knowledge and joint learning is still among the most popular research topics in natural language processing, and there is still much work to be done. For example, recently, the shift-reduce algorithm has been shown to be effective in structured learning tasks such as parsing (Zhang and Clark, 2011) and punctuation prediction (Zhang et al., 2013). The shift-reduce algorithm is also a dynamic programming algorithm like the Viterbi algorithm used in HMM and CRF. What this algorithm does in addition is that it keeps the short-term hierarchical information in a stack, from which it can extract additional features on-the-fly during decoding. By doing so, it can make use of not only the static features from input observation, but also the dynamic features because the stack changes dynamically during decoding. And it has achieved significant improvement. Another area is joint learning. As discussed in this thesis, it is not possible to model high clique-order joint distributions for every observed feature because 92 of memory and CPU limitation. Therefore, if we can model joint distribution at various clique orders automatically, we may achieve a good balance between performance and efficiency, since that makes the model adaptive, i.e., the model can allocate more parameters to more important features and fewer parameters to less important features. The third area is semantic representation. Recently, vector-space representation (or word embedding) has received much attention in the NLP research community. If we look at the way how MaxEnt or CRF make predictions, we will find that these models work by relating observations to predictions by computing the statistics of features. In fact, this is quite superficial, because intrinsically, the computer does not understand the meaning of language as we humans do. However, if we represent words and phrases by meaningful vectors, the computer can to some extent learn the semantic relationship between words. It has been shown that vector representation of words (also known as word embedding) can capture many relational similarities (Mikolov et al., 2013). For example, for the word-embedding vectors trained using a neural network, apples − apple ≈ cars − car and king − queen ≈ man − woman. However, how to incorporate vector representation into the existing prediction framework, or how to create a new framework for these tasks, still requires much research. We believe that a machine-level understanding of human language should be helpful for existing NLP tasks. 93 References Don Baron, Elizabeth Shriberg, and Andreas Stolcke. 2002. Automatic punctuation and disfluency detection in multi-party meetings using prosodic and lexical cues. Channels, 20(61):41. Doug Beeferman, Adam Berger, and John Lafferty. 1998. CYBERPUNC: A lightweight punctuation annotation system for speech. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing. Adam L Berger, Vincent J Della Pietra, and Stephen A Della Pietra. 1996. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71. Paul Boersma and David Weenink. 2009. Praat: doing phonetics by computer (version 5.1. 05)[computer program]. http://www.praat.org. Eugene Charniak and Mark Johnson. 2001. Edit detection and parsing for transcribed speech. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics, pages 1–9. Association for Computational Linguistics. Heidi Christensen, Yoshihiko Gotoh, and Steve Renals. 2001. Punctuation annotation using statistical prosody models. In ISCA Tutorial and Research Workshop (ITRW) on Prosody in Speech Recognition and Understanding. Christopher Cieri, David Graff, Mark Liberman, Nii Martey, and Stephanie Strassel. 2000. Large multilingual broadcast news corpora for cooperative research in topic detection and tracking: The TDT2 and TDT3 corpus efforts. In Proceedings of Language Resources and Evaluation Conference, Athens, Greece. 94 Daniel Dahlmeier and Hwee Tou Ng. 2012. A beam-search decoder for grammatical error correction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 568–578. Association for Computational Linguistics. Kallirroi Georgila. 2009. Using integer linear programming for detecting speech disfluencies. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pages 109–112. Association for Computational Linguistics. David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2003. English Gigaword corpus. Corpus number LDC2003T05, Linguistic Data Consortium, Philadelphia. Agustin Gravano, Martin Jansche, and Michiel Bacchiani. 2009. Restoring punctuation and capitalization in transcribed speech. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing. Jing Huang and Geoffrey Zweig. 2002. Maximum entropy model for punctuation annotation from speech. In Proceedings of Eurospeech 2002. Frederick Jelinek. 1997. Statistical methods for speech recognition. MIT press. Mark Johnson and Eugene Charniak. 2004. A TAG-based noisy channel model of speech repairs. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 33–39. Association for Computational Linguistics. Jeremy G Kahn, Matthew Lease, Eugene Charniak, Mark Johnson, and Mari Os- 95 tendorf. 2005. Effective use of prosody in parsing conversational speech. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 233–240. Association for Computational Linguistics. Ji-Hwan Kim and Philip C Woodland. 2001. The use of prosody in a combined system for punctuation generation and speech recognition. In Proceedings of European Conference on Speech Communication and Technology 2001. Joungbum Kim. 2004. Automatic detection of sentence boundaries, disfluencies, and conversational fillers in spontaneous speech. Ph.D. thesis, University of Washington. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pages 177–180. Association for Computational Linguistics. Jachym Kolar, Jan Svec, and Josef Psutka. 2004. Automatic punctuation annotation in Czech broadcast news speech. In Proceedings of the 9th International Conference "Speech and Computer" (SPECOM’2005). Taku Kudo. 2005. CRF++: yet another CRF toolkit. http://crfpp.sourceforge.net. John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning. 96 Dong C Liu and Jorge Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Mathematical programming, 45(1-3):503–528. Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Dustin Hillard, Mari Ostendorf, and Mary Harper. 2006. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1526–1540. Wei Lu and Hwee Tou Ng. 2010. Better punctuation prediction with dynamic conditional random fields. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330. Sameer Maskey, Bowen Zhou, and Yuqing Gao. 2006. A phrase-level machine translation approach for disfluency detection using weighted finite state transducers. In Proceedings of Interspeech. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of Human Language Technologies: The 2013 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 746–751. Zeev Nehari. 1975. Conformal mapping. Courier Dover Publications. Mark EJ Newman. 2005. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46(5):323–351. Hwee Tou Ng and Jin Kiat Low. 2004. Chinese part-of-speech tagging: Oneat-a-time or all-at-once? word-based or character-based? In Proceedings of 97 the 2004 Conference on Empirical Methods in Natural Language Processing, pages 277–284. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics-Volume 1, pages 160–167. Association for Computational Linguistics. Mari Ostendorf, Benoît Favre, Ralph Grishman, Dilek Hakkani-Tur, Mary Harper, Dustin Hillard, Julia Hirschberg, Heng Ji, Jeremy G Kahn, Yang Liu, et al. 2008. Speech segmentation and spoken document processing. Signal Processing Magazine, IEEE, 25(3):59–69. Xuan-Hieu Phan, Le-Minh Nguyen, and Cam-Tu Nguyen. 2005. Flex- crfs: Flexible conditional random field toolkit. http://www.jaist.ac.jp/ hieuxuan/flexcrfs/flexcrfs.html. Xian Qian and Yang Liu. 2013. Disfluency detection using multi-step stacked learning. In Proceedings of Human Language Technologies: The 2013 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 820–825. Stuart Russell and Peter Norvig. 2009. Artificial Intelligence: A Modern Approach. Prentice Hall Press. Guergana Savova and Joan Bachenko. 2003. Prosodic features of four types of disfluencies. In ISCA Tutorial and Research Workshop on Disfluency in Spontaneous Speech. Yanxin Shi and Mengqiu Wang. 2007. A dual-layer CRFs based joint decoding 98 method for cascaded segmentation and labeling tasks. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 1707–1712. Elizabeth Shriberg, Andreas Stolcke, Daniel Jurafsky, Noah Coccaro, Marie Meteer, Rebecca Bates, Paul Taylor, Klaus Ries, Rachel Martin, and Carol Van Ess-Dykema. 1998. Can prosody aid the automatic classification of dialog acts in conversational speech? Language and speech, 41(3-4):443–492. Elizabeth E Shriberg. 1999. Phonetic consequences of speech disfluency. Technical report, DTIC Document. Andreas Stolcke, Elizabeth Shriberg, Rebecca A Bates, Mari Ostendorf, Dilek Hakkani, Madelaine Plauche, Gökhan Tür, and Yu Lu. 1998. Automatic detection of sentence boundaries and disfluencies based on recognized words. In Proceedings of the International Conference on Spoken Language Processing. Andreas Stolcke. 2002. SRILM – An extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing. Charles Sutton, Andrew McCallum, and Khashayar Rohanimanesh. 2007. Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. Journal of Machine Learning Research, 8:693–723. Charles Sutton. 2006. GRMM: GRaphical Models in Mallet. URL http://mallet.cs.umass.edu/grmm. Johan AK Suykens and Joos Vandewalle. 1999. Least squares support vector machine classifiers. Neural Processing Letters, 9(3):293–300. Ben Taskar, Carlos Guestrin, and Daphne Koller. 2004. Max-margin markov networks. 16:25. 99 Dagen Wang and Shrikanth S Narayanan. 2004. A multi-pass linear fold algorithm for sentence boundary detection using prosodic cues. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages I–525. IEEE. Pidong Wang and Hwee Tou Ng. 2013. A beam-search decoder for normalization of social media text with application to machine translation. In Proceedings of Human Language Technologies: The 2013 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 471–481. Xuancong Wang, Hwee Tou Ng, and Khe Chai Sim. 2012. Dynamic conditional random fields for joint sentence boundary and punctuation prediction. In Proceedings of Interspeech. Xuancong Wang, Hwee Tou Ng, and Khe Chai Sim. 2014a. A beam-search decoder for disfluency detection. In Proceedings of International Conference on Computational Linguistics. Xuancong Wang, Khe Chai Sim, and Hwee Tou Ng. 2014b. Combining punctuation and disfluency prediction: An empirical study. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, et al. 1997. The HTK book, volume 2. Entropic Cambridge Research Laboratory Cambridge. Yue Zhang and Stephen Clark. 2011. Shift-reduce CCG parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguis- 100 tics: Human Language Technologies-Volume 1, pages 683–692. Association for Computational Linguistics. Qi Zhang, Fuliang Weng, and Zhe Feng. 2006. A progressive feature selection algorithm for ultra large feature spaces. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pages 561–568. Association for Computational Linguistics. Dongdong Zhang, Shuangzhi Wu, Nan Yang, and Mu Li. 2013. Punctuation prediction with transition-based parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 752–760. Chengqing Zong and Fuji Ren. 2003. Chinese utterance segmentation in spoken language translation. In Computational Linguistics and Intelligent Text Processing, pages 516–525. Springer. Simon Zwarts and Mark Johnson. 2011. The impact of language models and loss functions on repair disfluency detection. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 703–711. Association for Computational Linguistics. 101 [...]... long/sentence-joined sequences, one from each speaker 73 6.2 Labels for punctuation prediction and disfluency prediction 74 6.3 Feature templates for disfluency prediction, or punctuation prediction, or joint prediction for all the experiments in this chapter 6.4 76 Baseline results showing the degradation by joining utterances into long sentences, removing precision/recall balancing, and reducing the clique... clean up disfluency in speech so as to improve the accuracy and reduce ambiguity in downstream NLP tasks such as information extraction and machine translation 1.3 Contributions of this Thesis This thesis consists mainly of three parts: punctuation prediction, disfluency prediction, and joint punctuation and disfluency prediction Parts of this thesis have been published in the following papers: (Wang et... overview of related work in sentence boundary prediction, punctuation prediction, and disfluency detection Chapter 3 describes the major machine learn11 ing algorithms used in this thesis, namely linear-chain conditional random field (CRF), dynamic conditional random field (DCRF), and max-margin Markov network (M3N) Chapter 4 focuses on joint sentence boundary and punctuation prediction Chapter 5 describes... on sentences joint together Prosodic information has been shown to be helpful for punctuation prediction There are several works that make use of both prosodic and lexical features (Kim and Woodland, 2001) combined prosodic and lexical information for punctuation prediction In their work, prosodic features were incorporated using the classification and regression tree (CART), and lexical information was... processing utilizes machine learning algorithms In fact, the method we used for punctuation and disfluency prediction is adopted from sparsefeature label-sequence-prediction algorithms in machine learning In machine learning, researchers have also developed algorithms to predict multiple layers of labels together (Ng and Low, 2004) has proposed cross-product label prediction for Chinese word segmentation and. .. and “disfluency prediction” interchangeably in some sections, i.e., the term “disfluency prediction” in this thesis refers to “disfluency detection” in the literature 1.1 Why do we need to predict punctuation? Punctuation is a very important constituent in written language It is a product of language evolution because not all languages contain punctuation since the beginning of the time For example, punctuation. .. used in Japanese and 3 Korean writing until the late 19th century and early 20th century Moreover, the punctuation used in ancient Chinese is very different from now In fact, most of the ancient inscriptions do not contain punctuation The reason why humans introduce punctuation into written language is because without punctuation, the meaning of a sequence of words can often be ambiguous This kind of... technical term for inserting punctuation symbols into unpunctuated text is called punctuation prediction” because punctuation is not present in the original text, so the algorithm needs to find possible locations and insert an appropriate punctuation symbol at each location In this thesis, since we have treated both tasks as label prediction tasks, we will refer to both problems as prediction tasks for... methods We show that the two tasks in uence each other, the prediction in one task can provide useful information to the other task and thus, joint prediction works better than isolated prediction However, using joint prediction models will lead to higher model complexity, which limits its application in practice 1.4 Organization of the Thesis The remainder of this thesis is organized as follows The... we focus on punctuation prediction and disfluency prediction For punctuation prediction, we propose using dynamic conditional random fields for joint sentence boundary and punctuation prediction We have also investigated several model optimization techniques which are important for practical applications For disfluency prediction, we propose a beam-search decoder approach Our decoder can combine generative . spent numerous time in providing me invaluable guidance and assistance in the writing of my papers and thesis. Discus- sions with them have been very pleasant and helpful in improving my scientific skills. Next,. joint prediction for all the experiments in this chapter. . 76 6.4 Baseline results showing the degradation by joining utterances into long sentences, removing precision/recall balancing, and. friends, schoolmates and colleagues in NUS Graduate School for Integrative Sciences and Engineering and NUS School of Computing for their support, helpful discussions, and fellowship. Finally, I would