Song song phần Training

Trong phần training tác giả (Taku Kudo <taku@chasen.org>) sử dụng 3 thuật toán khác nhau. Đó là các thuật toán CRF_L1, CRF_L2 và MIRA. CRF_L1 tính toán các tham số của mô hình dựa trên độ đo Laplacian, CRF_L2 tính toán các tham số của mô hình dựa trên độ đo Gaussian [19]. Trong đó tác giả đã song song hóa CRF_L1, CRF_L2. Vì CRF_L1, CRF_L2 sử dụng thƣ viện LBFGS để tìm cực trị của hàm log-likelihood. Mà thƣ viện này lại dùng kiểu

double cho các biến (mảng, vector), một kiểu không đƣợc hỗ trợ trong một số dòng GPU (Geforce 210, tƣơng thích 1.2).

Những khó khăn trong việc song song hóa training:

- Các thức tổ chức và lƣu trữ dữ liệu của CRF++: CRF++ xây dựng các đối tƣợng hoàn toàn bằng C++, việc chuyển các đối tƣợng này sang C là vô cùng khó khăn. Chẳng hạn nhƣ đối tƣợng tagger nhƣ đƣợc trình bày ở trên.

- Các hàm tính toán của CRF++ dựa vào cấu trúc của đối tƣợng để tính toán, ví dụ tính toán giá trị alpha, beta, kỳ vọng của các node của một tagger.

- CRF++ là phần mềm nguồn mở, không có tài liệu mô tả chi tiết về các module, các hàm. Vì thế việc đọc hiểu code, mục đích của tác giả cũng gặp nhiều khó khăn.

Vì thế luận văn chỉ tập trung áp dụng GPU để song song một số tính toán trong thuật toán MIRA.

MIRA là thuật toán 1-best, đƣợc tác giả áp dụng vào CRF++ phiên bản 0.45, tháng 11 năm 2006.

 Thuật toán Mira

1. Với mỗi lần lặp: (maxiter = 10.000) 2. Với mỗi câu trong tập dữ liệu training:

o Tính toán giá trị:

 Err

 Active set

 Upperactive set

 Max_ktt_violation 3. Tính toán giá trị obj

4. Hiển thị kết quả (số lần lặp, tỷ lệ lỗi đối với các tag, tỷ lệ lỗi đối với các câu,…)

5. Nếu max_ktt_violation = 0 thì tăng biến hội tụ (converge) thêm 1 6. Nếu converge = 2 kết thúc.

 Cách thức song song

Song song vòng lặp 2.

1. Sao chép dữ liệu từ CPU vào thiết bị.

2. Chọn số thread tối đa trong mỗi block: NUM_THREADS = 512 3. Mỗi thread sẽ gọi kernel và thực hiện tính toán song song.

__global__

void gpu_mira_kernel(int *d_zeroone, int *d_err,

int *d_active_set, int *d_upper_active_set,

float *d_max_kkt_violation, int *d_shrink,

float *d_cost_diff, float *d_s, int *d_error_num, float

*d_upper_bound, float *d_mu, const float C, const int

sentences_size,

const short shrinking_size) {

const int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < sentences_size)

{

if (d_shrink[i] < shrinking_size){ d_active_set[i] = 1;

float cost_diff = d_cost_diff[i]; int error_num = d_error_num[i]; d_err[i] = error_num; if (error_num) d_zeroone[i] = 1; if (error_num == 0) { ++d_shrink[i]; } else { d_shrink[i] = 0; double mu = 0.0; if (d_s[i] > 0.0) { mu =max_d(0.0,(error_num - cost_diff)/d_s[i]); } if (d_upper_bound[i] + mu > C) { mu = C - d_upper_bound[i]; ++d_upper_active_set[i]; } else { d_max_kkt_violation[i] = max_d(d_max_kkt_violation[i], error_num - cost_diff); } if (mu > 1e-10) { d_upper_bound[i] += mu;

d_upper_bound[i] = min(C, d_upper_bound[i]); } d_mu[0] = mu; } } } }

4. Với sentences_size là tổng số câu trong tập dữ liệu training. Khi đó

số lƣợng block đƣợc dùng là:

NUM_BLOCKS = ceil(sentences_size/ NUM_THREADS)

5. Sao chép kết quả từ thiết bị ra CPU và thực hiện các tính toán: tính tổng số lỗi, tổng zeroone,…

int num_blocks = ceil(sentences_size/(float)NUM_THREADS) dim3 dimBlock(NUM_THREADS);

dim3 dimGrid(num_blocks);

gpu_mira_kernel<<<dimGrid, dimBlock>>>(d_zeroone, d_err, d_active_set, d_upper_active_set,

d_max_kkt_violation, d_shrink, d_cost_diff, d_s, d_error_num, d_upper_bound, d_mu, C, sentences_size, shrinking_size);

cudaMemcpy(h_zeroone, d_zeroone, s_size, cudaMemcpyDeviceToHost);

cudaMemcpy(h_err, d_err, s_size, cudaMemcpyDeviceToHost); cudaMemcpy(h_active_set, d_active_set, s_size,

cudaMemcpyDeviceToHost);

cudaMemcpy(h_upper_active_set, d_upper_active_set, s_size, cudaMemcpyDeviceToHost);

cudaMemcpy(h_max_kkt_violation, d_max_kkt_violation, s_sizef, cudaMemcpyDeviceToHost);

cudaMemcpy(h_shrink, d_shrink, s_size, cudaMemcpyDeviceToHost);

cudaMemcpy(h_cost_diff, d_cost_diff, s_sizef, cudaMemcpyDeviceToHost);

cudaMemcpy(h_error_num, d_error_num, s_size, cudaMemcpyDeviceToHost);

cudaMemcpy(h_upper_bound, d_upper_bound, s_sizef, cudaMemcpyDeviceToHost);

cudaMemcpy(h_s, d_s, s_sizef, cudaMemcpyDeviceToHost); cudaMemcpy(h_mu, d_mu, s_sizef, cudaMemcpyDeviceToHost); //Calculate... int sum_err = 0; int sum_zeroone = 0; int sum_active_set = 0; int sum_upper_active_set = 0; float sum_max_kkt_violation = 0.0;

for (int i = 0; i < sentences_size; ++i) { sum_err += h_err[i]; sum_zeroone += h_zeroone[i]; sum_active_set += h_active_set[i]; sum_upper_active_set += h_upper_active_set[i]; } sum_max_kkt_violation = max_reduce(h_max_kkt_violation, sentences_size);

Song song tính toán 3

Tính toán này có dạng:

Sử dụng thƣ viện Thrust để song song nhƣ sau:

 Song song vòng lặp tính toán các Các tính toán có dạng nhƣ sau:

float sum(float *x, int N) {

// transfer to device

thrust::device_vector<float> d_x(x, x + N); // setup arguments

square<float> unary_op; thrust::plus<float> binary_op; float init = 0.0;

// compute norm

float norm = thrust::transform_reduce(d_x.begin(), d_x.end(), unary_op, init, binary_op);

return norm; } template <typename T> struct square { __host__ __device__

T operator()(const T& x) const { return x * x;

} };

Xây dựng phép chuyển đổi:

Tối ƣu hóa sử dụng bộ nhớ

Ƣớc lƣợng tham số mô hình CRFs