Đánh giá mô hình ngôn ngữ bằng Perplexity- 123docz.net

V. Thử nghiệm trên dữ liệu VOV (VOV Corpus)

B.2.4. Đánh giá mô hình ngôn ngữ bằng Perplexity

Để đo sự thực hiện của một mô hình ngôn ngữ dựa vào xác suất trung bình có thể đƣợc phát triển trong định luật thông tin (shannon). Một nguồn nói ra có thể coi nhƣ một khối thông tin nguồn mà sẽ sinh ra các từ tuần tự w1,w2, … ,wm từ bộ từ vựng W. Khả năng của kí hiệu i phụ thuộc vào i-1 kí hiệu trƣớc đó. Entropy H diễn tả số lƣợng các thông tin không dƣ thừa cung cấp cho mỗi từ mới ở giá trị trung bình, đƣợc tính bằng bít :

Công thức trên tính cho khả năng của tất cả các dãy từ, nhƣng nếu với dữ liệu nguồn là ergodic thì công thức trên tƣơng đƣơng với:

Hay

Coi một mô hình ngôn ngữ là một nguồn thông tin, thì một mô hình ngôn ngữ mà hoạt động hiệu quả nếu nó có khả năng đoán từ .

Perplexity, PP là một trong những thƣớc đo mô hình ngôn ngữ, đƣợc tính nhƣ sau:

Với

Là xác suất của dãy trong mô hình ngôn ngữ. Perplexity có thể đƣợc coi nhƣ là thƣớc đo giá trị trung bình các từ khác nhau có giá trị xác suất lớn nhất tƣơng đƣơng đi sau bất kể từ nào cho trƣớc. Giá trị perplexity thấp có nghĩa mô hình ngôn ngữ đó tốt hơn.

Để tính toán đƣợc perplexity của cả hai mô hình ngôn ngữ thì cần có một số dữ liệu test. Sự so sánh của hai mô hình ngôn ngữ phụ thuộc vào giá trị perplexity trên cùng một dữ liệu kiểm tra và bộ từ vựng tƣơng đƣơng .

Phụ lục C. Các mã nguồn của chƣơng trình

C1. Đọc thƣ mục chứa các file transcript và tạo thành một file duy nhất

import glob import os import re

#file nay khong su dung chi de tham khao

output = open("transcriptAll.txt","w",encoding='utf-8')

output2 = open("transcriptWithFileNamẹtxt","w",encoding='utf-8') p = rẹcompile('\w+',rẹIGNORECASE)

fName = str

for filename in glob.glob('transcript all\\*.txt'): with open(filename,"r",encoding='utf-8') as f: lines =f.readlines()

for line in lines:

if linẹendswith(ú.txt\n'):

fName = line[0:-5] #ten file loai bo di .txt else:

modified =line[3:-5].upper()#loai bo ky tu <s> o dau va </s> o cuoi modified =rẹsub(r'[.\'\"\,\;\:\-\”\(\)\…\“\–\?\!]','',modified)

output.write(modified + "\n")

C2. Loại bỏ những ký tự đặc biệt mà Sphinx không nhận dạng đƣợc thành số

import re

inputFile = open("transcriptEncripted.txt","r",encoding='utf-8') outputFile = open("transcript.txt","w",encoding='utf-8')

lines = inputFilẹreadlines() for line in lines:

edit1 = linẹreplace("`","2") edit2 = edit1.replace("~","3") edit3 = edit2.replace("?","4") edit4 = edit3.replace(".","6") edit5 = edit4.replace("'","5") edit6 = edit5.replace("^","8") edit7 = edit6.replace("(","7") edit8 = edit7.replace("+","9") edit9 = rẹsub('[/]\w+','',edit8)

edit10 = rẹsub(r'[#/]','',edit9) #edit10= edit9.replace('#/#','') #loai bo ky tu dac biet if edit10.strip() !='': words = edit10.split() print(words) #outputFilẹwrite('<s> ' + ' '.join(edit10.words[0:len(words)-2] + '</s> (' + words[-1] + ') edit11 = ' '.join(words[:-1]) edit12 = rẹsub(r'[^\w\s]','',edit11)

#outputFilẹwrite('<s> ' +' '.join(words[:-1]) + ' </s> (' +words[-1] +')\n') outputFilẹwrite('<s> ' + edit12.strip() + ' </s> (' +words[-1] +')\n')

C3. Tạo từ điển từ file transcript

# cu phap: python createDic.py y : neu dung thanh dieu #python createDic.py y : neu khong dung thanh dieu

import re import sys

#ham tach mot tu: Nghe -> ngh e hoac ban -> b a n def buildAmVi(word,thanhdieu):

start =0

end = -111 #ky tu cuoi cung la so last = str

amvi = list() start =0

if word[0:2].lower() in amdau: #doc 3 ky tu dau tien trong tu hien tai xem co la am dau hay khong

amvịappend(word[0:2]) start = 2

elif word[0:1].lower() in amdau: #doc 2 ky tu dau tien trong tu hien tai xem co la am dau hay khong

76 start =1 if word[-2:].lower() in amcuoi: last = word[-2:] end =-2 elif word[-1:].isalphă): last = word[-1:] end = -1

if end !=-111: #ky tu cuoi cung khong phai la so iterator = p.finditer(word[start:end]) for match in iterator:

if thanhdieu=='ý:

amvịappend(match.group(0)) else:

amvịappend(rẹsub(r'[23456]','',match.group(0))) #2,3,4,5,6 la cac thanh dieu

if len(word) != 1: #khong can them vao trong truong hop word chu co mot ky tu vd: P->P

amvịappend(last) else:

iterator = p.finditer(word[start:]) for match in iterator:

if thanhdieu=='ý:

amvịappend(match.group(0)) else:

77 if len(amvi) == 0:

amvịappend(word) return amvi;

thanhdieu=sys.argv[1]; #mo ta co dung thanh dieu hay khong, y neu co,n neu khong print ('Dung thanh dieu:' + sys.argv[1])

outputFile = open("tudien.dic","w",encoding='utf-8') inputFile = open("transcript.txt","r",encoding='utf-8') dic = set()

lines = inputFilẹreadlines() for line in lines:

words = linẹsplit()[1:-2] for word in words:

dic.ađ(word)

#chuyen dic thanh list de sort mylist = list(dic)

mylist.sort()

amdau =set(['b','c','ch','d','đ','g','gh','h','k','kh','l','m','n','ng','ngh','nh','p','ph','q', 'r','s','t','th','tr','v','x'])

78 p = rẹcompile('\w\d*',rẹIGNORECASE) line =str

for word in mylist: amvi = list()

outputFilẹwrite(word + ' ') #ghi vao file dic line =''

for w in word.split('_'):

amvi= buildAmVi(w,thanhdieu) line = line + ' '.join(amvi) + ' '

#outputFilẹwrite(" ".join(amvi) + ' ')

#if not word.endswith(w): #khong them ky tu space vao tu cuoi cung # outputFilẹwrite(' ')

outputFilẹwrite(linẹstrip()+ '\n')

C4. Tạo file transcript và test file từ danh sách file transcript và test

import re inputTranscript = open("asrvn_train.fileids","r") inputTestTranscript = open("asrvn_test.fileids","r") inputTranscriptAll = open('transcript.txt','r') outputTranscript = open("asrvn_train.transcription","w") outputTestTranscript = open("asrvn_test.transcription","w")

79 fileName =str

transcriptAllLines =inputTranscriptAll.readlines() lines = inputTranscript.readlines()

for line in lines:

fileName = linẹsplit('/')[1].strip() #tach ten file tu chuoi vd :ta/ox0328 for content in transcriptAllLines:

if content.strip().endswith(fileName[-6:] +')'): #chi lay 6 ky tu cuoi cua file vd TBox0328=>ox0328

#print(rẹsub(r'\(.*\)','('+ fileName+ ')',content))

outputTranscript.write(rẹsub(r'\(.*\)','('+ fileName+ ')',content)) break;

outputTranscript.close()

lines = inputTestTranscript.readlines() for line in lines:

fileName = linẹsplit('/')[1].strip() #tach ten file tu chuoi vd :ta/ox0328 for content in transcriptAllLines:

if content.strip().endswith(fileName[-6:] +')'):

outputTestTranscript.write(rẹsub(r'\(.*\)','('+ fileName+ ')',content)) break;

outputTestTranscript.close() print('done!')

Đánh giá mô hình ngôn ngữ bằng Perplexity

Tổng quan các bƣớc thực hiện

Tạo mô hình ngôn ngữ