Đề thi học kì môn máy học chuyên ngành công nghệ thông tin

Phân loại văn bản tiếng Việt bằng phương pháp K-NN hoặc BayesTraining Phase: Input: D={d1,…dn}: collection of documents that has been assignment in to C.. Output: Detemining representati

Trang 1

Bài toán 1 Phân loại văn bản tiếng Việt bằng phương pháp K-NN hoặc Bayes

Training Phase:

Input: D={d1,…dn}: collection of documents that has been assignment in to C.

C={c1, ,ck}: catergories

Output: Detemining representation of each catergorie

Processing:

- In D segment into 2 word sets: noun set and other word set

- Calculate F(wi) in noun set by:

D

i D i

N

w N w

F( ) = ( )

- For each catergories select all word that is noun and threshold s>0 And with each catergories we have a representation for it

Testing phase

Input: d: document

Output: d has been assigned by catergories C

Reprocessing:

- d has been segmented into 2 word sets: noun set and other word set (not noun)

- Calculate I(wi) in noun set by: ( )

) ( )

d w j

i S

w

w N w I

j

+

=

∑

∈

- Reduce dimensional feature by remove all wi that have I(wi) <0.20

- d has been presentated by Tnoun={<t1; w1>, <t2; w2>, …,<tm; wm>}

Algorithm (Bayes)

with d, we will calculate probability for each catergories Ci by eq:

We will predict d in Ck that have Prob(d|Ck) max

Bài toán 2 Phân cụm văn bản tiếng Việt bằng phương pháp K-means hoặc phân cấp

Training Phase:

Input: D={d1,…dn}: collection of documents

s:threshold

Output: clusters

Processing:

D

i D i

N

w N w

F( ) = ( )

- For each catergories select all word that is noun and threshold s>0

Algorithm of hierarchy is below

Bài toán 3 Tóm tắt văn bản tiếng Việt bằng phương pháp không giám sát

Training Phase:

Input: D={d1,…dn}: collection of documents.

Output: Calculated F(wi)

Trang 2

D

i D i

N

w N w

F( ) = ( )

Testing phase

Input: d: original document, r: rate of summary.

Output: d’: summary of document

Reprocessing:

- d has been segmented a set of sentences S={s1, s2, …, sn}

- In each sentence:

+ segment into 2 word sets: noun set and other word set (not noun) + Calculate I(wi) in noun set by: ( )

) ( )

d w j

i S

w

w N w I

j

+

=

∑

∈

+P(si)=1/i;

Algorithm:

V=” ”;

For each sentence calculating weight of sentence:

W(si)= I(wi) + P(si);

Sort (s i ) by descending.

Length (d’)=length(d)*r%;

While (length(d’)< length(d)*r%)

V=V+si;

Arrangements all selected sentence by the original document

Ví dụ về thuật toán phân cụm văn bản dựa trên phân cấp

- Input n văn bản đầu vào

- Coi mỗi đối tượng là 1 cụm (ví dụ có 3 văn bản thì 3 văn bản là 3 cụm)

- Trong mỗi văn bản tách các danh từ và tính tần suất các danh từ

- Đo khoảng cách từng đôi một văn bản với nhau theo công thức:

- Đặt ngưỡng khoảng cách d(i,j)

- Output: Gom lại các cụm có khoảng cách d(i,j)<= ngưỡng

Ví dụ:

Cho 3 văn bản đầu vào như dưới đây

Văn bản 1: Chiều nay, lớp D3tin thực hành máy tính

Văn bản 2: Chiều nay, phòng máy tính A202 phải để cho lớp D4tin sử dụng Văn bản 3: Sáng nay, cô ấy đi xem bóng đá

So sánh văn bản 1 và văn bản 2:

)

|

| (|

) ,

2 2

2 1

i x j

i

Trang 3

*Tách từ chủ đề trong từng văn bản và tính tần xuất các từ chủ đề tạo thành các vector biểu diễn đặc trưng cho mỗi văn bản

D1={<lớp,0.3>; <d3tin,0.3>; <máy tính,0.6>}

D2={<phòng,0.1>; <máy tính,0.55>; <A202,0.12>; <lớp,0.2>; <D4tin,0.4>}

D3={<cô ấy,0.6>; <bóng đá,0.2>}

*Tính khoảng cách từng cặp văn bản:

d(1,2)=sqrt(|0.3-0.2|2+|0.6-0.55|2+|0-0.1|2+|0-0.12|2+|0.3-0|2+|0-0.4|2)

=sqrt(0.12+0.052+0.12+0.122+0.32+0.42)

d(1,3)=sqrt(0.32+0.32+0.62+0.62+0.22)

Tương tự, tính d(2,3)

So sánh d(1,2) có khoảng cách nhỏ, do đó, văn bản 1 và văn bản 2 thuộc cùng 1 cụm Khi cho 3 văn bản trên là 3 văn bản đầu vào thì có thể gom lại thành 2 cụm

Định dạng
Số trang	3
Dung lượng	41 KB