Hidden Markov Model for Text Analysis

Hidden Markov Model for Text Analysis Student : Tun Tao Tsai Advisor : Dr Mark Stamp Committee member : Dr Jeff Smith Committee Member : Dr Chris Pollett Department of Computer Science, San Jose State University Email: joetsai@nanocluster.net Hidden Markov Model for Text Analysis Abstract Introduction The Basic probability theory The Cave and Neuwirth experiment Hidden Markov model .13 3.1 The Markov property 14 3.2 The Hidden Markov model definition .15 3.3 Finding the probability of the observed sequence .15 3.4 The forward-backward algorithm 16 3.4.1 The forward recursion 17 3.4.2 The backward recursion 18 3.5 Choosing the best state sequence .19 3.6 Parameter re-estimation .19 Chinese information processing 21 Phonology 22 5.1 Chinese phonemic transcription 22 5.2 English phoneme transcription 24 Experiment design and the software 25 6.1 Number of iterations 25 6.2 Numbers of states 26 6.3 Chinese corpus 26 6.4 The software 27 Experiment results 28 7.1 English alphabet experiment results 28 7.2 English phoneme results 28 7.2.1 English phoneme experiment using States 29 7.2.2 English phoneme experiment with more than two states 29 7.3 Chinese characters experiment result 30 7.4 Zhuyin experiment results 31 7.5 Entropy 33 Summary and conclusions .35 Future work 35 Reference .36 Appendix : Experiment results 38 Appendix 2: Entropy experiment results .48 Appendix : Brown University corpus .51 Appendix Chinese character encoding 52 Appendix 5: Zhuyin – Pinyin conversion table 54 Appendix 6: CMU pronouncing dictionary phoneme chart 55 Appendix : Trial experiment result to determine the number of iterations to use 56 Abstract In the field of Natural Language processing, the Hidden Markov Model (hereafter as HMM) method is proven to be useful in the application area of finding patterns from sequence of data In this study, we apply HMM technique to the Brown corpus [1], the Brown corpus in its phonemic representation, a Chinese corpus, and the Chinese corpus in its Zhuyin representation in an attempt to find the statistical models that can explain certain language features We first give a brief overview to the original experiment conducted by Cave and Neuwirth [14], the Hidden Markov Model, the Chinese language processing issues and English phonetic properties that are related to our experiments, the design of our experiments, and the finding from our experiments Finally, we discuss some of the challenges of future research Keywords: Chinese, Hidden Markov Model Introduction The theory of Hidden Markov Model (HMM) was first introduced in the late 1960’s by Baum and his colleagues Quickly spotting the potential in this method, Jack Ferguson (The Institute for Defense Analyses) brought it into the area of speech recognition During the 1970s, Baker at Carnegie-Mellon University (CMU) and Jelinek at International Business Machines (IBM) applied the HMM technique to speech processing Several new studies of the language models were quickly spawned after the introduction of the HMM technique, such as Part-Of-Speech Tagging, and the word segmentation problem that is common in Asian languages [12] Even today, HMM is still widely used in many new research areas such as DNA sequence alignment, network intrusion detection, and vision recognition An HMM is a statistical model of a Markov process with hidden states One use of an HMM is to determine the relationship of the hidden states to the observations, which depends on the associated probabilities The graphic representation of the HMM is illustrated as figure The states of the Markov chain are hidden The outputs from the Markov chain are observable Figure1: Graphic illustration of Hidden Markov model This is a common technique used in language processing One classic example of such is the experiment done by R.L Cave and L.P Neuwirth, published in 1980 In this experiment, they applied an HMM to a large body of English text (known as a corpus) where the individual letters were taken as the set of observations With two states, they found that the letters naturally separated into vowels and consonants In some sense, this is not surprising But since no a priori assumption was made regarding the two hidden states, the experiment clearly demonstrates that the vowel-consonant separation of letters in English is significant in a very strong statistical sense Cave and Neuwirth [14] also conducted similar experiments with more than two hidden states and they were able to sensibly interpret the results for up to 12 hidden states We have conducted an HMM analysis of the Chinese text somewhat analogous to the HMM study of English text discussed in the previous paragraph Chinese is, of course, very different than English Unlike English, Chinese language has a large character base and the characters are written in sequence without space or other separators The only form of separation in Chinese is the use of punctuation marks in locations such as sentence boundaries The language structure affects the experiment outcome of our projects More detail will be discussed in the chapter of Chinese information processing In addition, we have extended this study to the area of phonology We applied HMM techniques to both corpora under their respective phonemic transcription The result is interesting since phonemes are more primitive than graphemes In addition, the experiment may reveal the different uses of nasal and oral sounds between languages We applied an HMM with two to six states, using the individual characters/phoneme as the observations Such an HMM application is computationally intensive and requires vast amounts of data in order to have any chance of success We also attempt to give meaningful linguistic interpretation to the results we obtain The Basic probability theory The mathematics behind the Hidden Markov Model method is pretty straightforward and easy to understand In this section, we will describe several basic probability theories that are required to understand the Hidden Markov Model technique The information is obtainable from many sources[11] Probability axioms: Given a finite sample space S and an event A in S We define P ( A) is the probability of A, then a  P ( A)  for each event A in S b P ( S )  c P( A  B)  P  A   P  B  if A and B are mutually exclusive events in S Joint probability : If A and B are random variables, then the joint probability function is P (a, b)  P  A  a, B  b  Conditional probability : the probability of A conditioned on B is defined as P ( A, B ) P( B) P ( A B)  Product rule: from the definition of conditional probability, the product rule is P  A, B   P  A B  P  B   P  B A  P  A  Chain rule : the chain rule is an extension of the product rule which we can write down in more generic form as: P  a1 , a2 ,K , an   P  a1 a2 ,K , an  P  a2 a3 ,K , an  K K P  an 1 an  P  an  Bayes’ rule: Bayes’ rule is an alternative method to calculate the conditional probability if the joint probability P ( A, B ) is unknown From the conditional probability, we know P ( A B) P  B   P( A, B ) as well as P  B A  P  A   P  A, B  Bayes rule is P  A B   P  B A P  A P B Entropy : Entropy is the measurement randomness of a system If S is a finite set of values based on a probability distribution, the Entropy n H  S     P  si  log P  si  where S takes the value si , State[0] State[0] = State[1] 1833 1655 1801 1830 1651 1808 1830 1647 1812 1830 1644 1815 1832 1647 1810 1834 1644 1811 1832 1645 1812 1832 1644 1813 1828 1644 1817 1829 1644 1816 1820 1655 1814 1811 1661 1817 1802 1677 1810 1815 1692 1782 1830 1701 1758 1847 1747 1695 1849 1776 1664 1875 1793 1621 2488 2558 243 2479 2773 37 2466 2786 37 2481 2772 36 2482 2773 34 2489 2766 34 2489 2767 33 2501 2757 31 2449 2756 84 2500 2757 32 2504 2750 35 2501 2759 29 59 ... Stamp, “A Revealing Introduction to Hidden Markov Models,” Jul 29,2003 [14] R L Cave and L P Neuwirth, Hidden Markov models for English, in Hidden Markov Models for Speech, J D Ferguson, editor,... increases, that is, the experiment result becomes more meaningful Hidden Markov model The Hidden Markov model is a statistical model for a sequence of observation items There are three different... the Markov system at time t 3.2 The Hidden Markov model definition A Hidden Markov Model (HMM) is a probabilistic function of a Markov property The term ? ?hidden? ?? indicates that the system is

Tiêu đề	Hidden Markov Model for Text Analysis
Tác giả	Tun Tao Tsai
Người hướng dẫn	Dr. Mark Stamp
Trường học	San Jose State University
Chuyên ngành	Computer Science
Thể loại	thesis
Thành phố	San Jose

Định dạng
Số trang	59
Dung lượng	1,65 MB