Luận văn thạc sĩ Khoa học máy tính: Lọc thông tin riêng trong bệnh án điện tử

Năm Công trình – Nhóm tác giả Phương pháp và các cộng sự MEDINA được phát triển cho việc lọc thông tin riêng trong các bệnh án điện tử tiếng Pháp dựa trên đặc điểm của hệ thống

GIỚI THIỆU TỔNG QUAN

Mục tiêu nghiên cứu

Xây dựng một mô hình lọc thông tin riêng cho BAĐT mà có thể linh động áp dụng được cho các loại ngôn ngữ khác ngoài tiếng Anh Đề tài sẽ tập trung tìm hiểu 2 hệ thống độc lập có khả năng lọc thông tin riêng là học máy và tập luật Kết hợp 2 hệ thống này để phát triển thành hệ thống sử dụng phương pháp lai mà ở đó phát huy từng ưu điểm của từng hệ thống nhằm tối đa hóa giá trị đạt được Cụ thể là dựa vào chỉ số RECALL và PRECISION để quyết định độ thành công của hệ thống

- Dữ liệu dùng cho tiếng Anh để hiện thực hóa kỹ thuật được lấy tại cuộc thi i2b2 năm 2006 tại website https://www.i2b2.org.

Ý nghĩa khoa học

Cho vấn đề lọc thông tin riêng trong các bệnh án điện tử, hiện nay, đã có 2 cuộc thi dành riêng cho lĩnh vực lọc thông tin riêng do i2b2 tổ chức năm 2006 và 2014 Các công trình dự thi ở 2 năm này đạt được kết quả rất khả quan, tuy nhiên vẫn chưa thật sự đủ hiệu quả để các phương pháp lọc thông tin có thể được đưa vào sử dụng trong thực tế như được nhận định trong công trình [28] Do đó, vấn đề lọc thông tin riêng trong các bệnh án điện tử ngày nay vẫn được xem là chưa được giải quyết mặc dù có rất nhiều công trình được phát triển và giới thiệu trên thế giới

Trang 8 Hiện nay hầu hết các tiếp cận đều là các tiếp cận lai, nghĩa là kết hợp giữa phương pháp dựa trên học máy và phương pháp dựa trên quy tắc, so trùng mẫu và khai thác từ điển Một số công trình ban đầu được phát triển cho bài toán này như [13, 29] hay được phát triển cho tập BAĐT thuộc ngôn ngữ khác ngôn ngữ tiếng Anh như [11, 17, 27] đều bắt đầu với hướng tiếp cận dựa trên quy tắc Với hướng tiếp cận dựa trên học máy trong hầu hết các công trình khác, việc rút trích đặc trưng và chuẩn bị tập dữ liệu huấn luyện và kiểm tra là việc khó khăn để không hạn chế tính tổng quát hóa của giải pháp khi áp trên nguồn dữ liệu khác Do đó, giải pháp lai kết hợp hai hướng tiếp cận này là xu thế hiện diện trong những công trình gần đây như được đề xuất bởi các công trình [7, 21, 34] dự thi ở i2b2 2014

Vậy về mặt khoa học, đề tài muốn đóng góp một hướng tiếp cận lai giữa học máy và phương pháp dựa trên quy tắc để giải quyết bài toán lọc thông tin riêng trong bệnh án điện tử mà ở đó khả năng mở rộng cho các ngôn ngữ khác ngoài tiếng Anh, và cơ sở dữ liệu đầu vào còn chưa đủ nhiều, dùng kỹ thuật học bán giám sát.

Ý nghĩa thực tiễn

Đối với việc lọc thông tin riêng cho các BAĐT, hiện nay chưa có giải pháp nào được phát triển và áp dụng rộng rãi, có thể thấy được điều này ở Việt Nam

Danh sách các loại thông tin cần được che dấu cho bệnh nhân chưa được xem xét và chỉ định cụ thể Do đó, giải quyết vấn đề lọc thông tin riêng cho các BAĐT là một vấn đề cấp thiết và nhiều khó khăn Mong muốn của đề tài này là cần giải quyết bài toán lọc thông tin riêng cho các BAĐT mà không phục thuộc vào ngôn ngữ và tập dữ liệu ban đầu

Ngoài ra, trong thực tế, sẽ có rất nhiều trường hợp mà các bệnh án điện tử sẽ được chia sẻ cho các đơn vị ngoài y tế nhằm phục vụ cho nghiên cứu, học tập, hay dự báo những vấn đề liên quan tới các loại bệnh Để làm được điều này, trong mỗi bệnh án điện tử cần phải che giấu những thông tin của bệnh nhân hoặc những thông tin mà ở đó có thể truy xuất ra thông tin của bệnh nhân nhằm đảm bảo tính pháp lý Đây là một ý nghĩa thực tiễn mà đề tài hướng tới

CƠ SỞ LÝ THUYẾT

Học máy với mô hình CRFs

Khái niệm học máy được định nghĩa như sau:

- Với một tập dữ liệu vũ trụ X

- Một tập số mẫu S, cho S là tập con của X

- Một số hàm đích sao cho f: X -> [đúng,sai]

- Một tập huấn luyện D được gán nhãn, 𝐷 = {(𝑥, 𝑦)|𝑥 ∈ 𝑆 ∪ 𝑦 = 𝑓(𝑥)}

- Tính ra hàm 𝑓 ′ : 𝑋 → [đú𝑛𝑔, 𝑠𝑎𝑖] bằng cách sử dụng D như sau : 𝑓 ′ (𝑥) ≅ 𝑓(𝑥) cho tất cả x thuộc X

Có nhiều hàm học máy như mô hình MARKOV ẩn-HMM, mô hình cực đại hóa ENTROPY-MEMM hay mô hình xác suất có điều kiện-CRF Trong đó CRF giải quyết được bài toán mà các mô hình khác gặp phải là label alias, là trạng thái mà chuỗi quan sát được chọn sai nhưng học máy không phát hiện được Qua cả thực nghiệm và tìm hiểu được thì CRF xuất hiện lỗi thấp hơn 2 mô hình còn lại

Bằng chứng là ở cuộc thi i2b2 khi sử dụng học máy có 2 mô hình được sử dụng nhiều đó là SVM (học máy hỗ trợ vector) và CRF thì CRF được sử dụng phổ biến hơn cả

2.1.1 Giới thiệu mô hình CRFs

CRF (conditional random fields) được giới thiệu vào những năm 2001 bởi Lafferty và các đồng nghiệp [33] CRF là mô hình dựa trên xác suất có điều kiện, thường được sử dụng trong gán nhãn và phân tích dữ liệu tuần tự CRF là mô hình đồ thị vô hướng, điều này cho phép CRF có thể định nghĩa phân phối xác suất của toàn bộ chuỗi trạng thái với điều kiện biết chuỗi quan sát cho trước thay vì phân phối trên mỗi trạng thái với điều kiện biết trạng thái trước đó và quan sát hiện tại

CRF thay vì sử dụng xác suất độc lập trên chuỗi nhãn và chuỗi quan sát, ta sử dụng xác suất có điều kiện P(Y|X) trên toàn bộ chuỗi nhãn được đưa bởi mỗi chuỗi quan sát X CRF là một mô hình đồ thì vô hướng định nghĩa một phân bố tuyến tính đơn trên các chuỗi nhãn, chuỗi trình tự được đưa ra bởi các chuỗi quan sát được CRF thuận lợi khi sử dụng cho việc xem xét ngôn ngữ tự nhiên, việc gán nhãn cho các từ trong câu sẽ tương ứng với loại PHI (Protected Health Information) (DOCTOR, PATIENT, HOSPITAL, LOCATION, AGE, ADDRESS, PHONE …) hoặc không phải là PHI Lúc này ta cần khái niệm lược đồ BIO để ánh xạ từng từ trong văn bản thành B-I-O tương ứng

Về lược đồ BIO, ví dụ sau mô tả cách thức gán nhãn thông qua lược đồ BIO như sau Nếu một entity được gán nhãn là DOCTOR thì sẽ phân biệt rõ token là bắt đầu của entity đó hay nằm ở giữa của entity Nếu token là bắt đầu của entity, ta thêm tiền tố “B-“ trước loại PHI, ngược lại nếu token đó là không phải đầu của entity thì tiền tố sẽ là “I-“

Ví dụ: [B-DOCTOR Nguyễn] [I-DOCTOR Đông] [I-DOCTOR Phương] [O- PHI là] [O-PHI bác] [O-PHI sĩ] [O-PHI chính] [O-PHI trong] [O-PHI ca] [O-PHI này]

CRF là mô hình chuỗi các xác suất có điều kiện, huấn luyện để tối đa hóa xác suất điều kiện Nó là một framework cho phép xây dựng những mô hình xác suất để phân đoạn và gán nhãn chuỗi dữ liệu CRF, cũng giống như trường ngẫu nhiên Markov (Markov random field), là một mô hình đồ thị vô hướng mà mỗi đỉnh biểu diễn cho một biến ngẫu nhiên (random variable) mà có phân phối (distribution) được suy ra, và mỗi cung (edge) biểu diễn mối quan hệ phụ thuộc giữa hai biến ngẫu nhiên

Trang 11 X là một biến ngẫu nhiên trên chuỗi dữ liệu cần được gán nhãn và Y là biến ngẫu nhiên trên chuỗi nhãn (hoặc trạng thái) tương ứng Ví dụ X là chuỗi các từ quan sát (observation) thông qua các câu bằng ngôn ngữ tự nhiên, Y là chuỗi các nhãn từ loại được gán cho những câu trong tập X (các nhãn này được quy định sẵn trong tập các nhãn từ loại) Một linear-chain (chuỗi tuyến tính) CRF với các tham số Λ = { 𝜆, … } được cho bởi công thức:

Với 𝑍 𝑥 là một thừa số chuẩn hóa nhằm để đảm bảo tổng các xác suất của chuỗi trạng thái bằng 1

𝑘=1 f k (y t-1 , y t , x, t) là một hàm đặc trưng (feature function), thường có giá trị nhị phân (binary-valued), nhưng cũng có thể là giá trị thực (real-valued) Và là một trọng số học (learned weight) kết hợp với đặc trưng f k Những hàm đặc trưng có thể đo bất kỳ trạng thái chuyển dịch (state transition) nào, y t-1 → y t , và chuỗi quan sát x, tập trung vào thời điểm hiện tại t Ví dụ, một hàm đặc trưng có thể có giá trị 1 khi y t-

1 là trạng thái TITLE, y t là trạng thái AUTHOR và x t là một từ xuất hiện trong tập từ vựng chứa tên người

Người ta thường huấn luyện CRFs bằng cách làm cực đại hóa hàm likelihood theo dữ liệu huấn luyện sử dụng các kĩ thuật tối ưu như L‐BFGS 1 Việc lập luận (dựa trên mô hình đã học) là tìm ra chuỗi nhãn tương ứng của một chuỗi quan sát đầu vào Đối với CRFs, người ta thường sử dụng thuật toán qui hoạch động điển hình là Viterbi 2 (là thuật toán lập trình động nhằm tìm ra chuỗi khả năng (most likely) của các trạng thái ẩn) để thực hiện lập luận với dữ liệu mới

1 http://en.wikipedia.org/wiki/L-BFGS

2 http://en.wikipedia.org/wiki/Viterbi_algorithm

Trang 12 Để đưa vào học máy được mô hình CRF này ta cần trích rút những đặc trưng của từng token và gán cho token đó nhãn trước theo lược đồ BIO như ví dụ trên

Các độ đo đánh giá mô hình phân lớp bao gồm độ nhạy (sensitive), hay còn gọi là độ bao quát – Recall), độ xác đáng (Precision), F Chú ý rằng, tuy độ chính xác là một độ đo cụ thể, nhưng từ ngữ “độ chính xác” còn được dùng như một thuật ngữ tổng quát để chỉ các khả năng dự đoán của mô hình phân lớp [39]

Ma trận nhầm lẫn chỉ được sử dụng đối với bài toán phân loại:

TP i :Số lượng các ví dụ thuộc lớp c i được phân loại chính xác vào lớp c i

FP i :Số lượng các ví dụ không thuộc lớpc i bị phân loại nhầm vào lớpc i TN i : Số lượng các ví dụ không thuộc lớpc i được phân loại (chính xác)

FN i : Số lượng các ví dụ thuộc lớpc i bị phân loại nhầm (vào các lớp khácc i )

P = tổng số các ví dụ thuộc lớp c i N = tổng số các ví dụ thuộc lớp khác c i

Lớp dự báo Đúng TP FN

Precision:trong tập tìm được thì bao nhiêu cái (phân loại) đúng

Recall: trong số các tồn tại, tìm ra được bao nhiêu cái (phân loại) Được sử dụng để đánh giá các hệ thống phân loại văn bản

Học bán giám sát

Kỹ thuật học máy có thể chia làm 3 kỹ thuật cơ bản sau:

- Học không giám sát (Unsupervised Learning) Học với tập dữ liệu huấn luyện ban đầu hoàn toàn chưa được gán nhãn

Trang 13 - Học giám sát (Supervised Learning) Học với tập dữ liệu huấn luyện ban đầu đã được gán nhãn hoàn toàn

- Học bán giám sát (Semi-supervised Learning) Dữ liệu gồm cả gán nhãn và không gán nhãn Ở đây, vì nhắm tới mục tiêu đặt ra là nâng cao hiệu quả của việc học, phù hợp với cơ sở dữ liệu ban đầu là chưa có, khả năng mở rộng cho tương lai mà đề tài này phù hợp với kỹ thuật học bán giám sát Ban đầu, cơ sở dữ liệu y tế bị giới hạn bởi việc phổ biến thông tin bệnh nhân là điều khó khăn nên dữ liệu sẽ rất ít, sau đó khi dữ liệu tăng dần, thì kỹ thuật này có khả năng tăng dần hiệu quả học, và khả năng mở rộng cho dữ liệu lớn dần trong tương lai

Có thể coi đây là phương pháp lặp lại nhiều lần phương pháp học giám sát

Với một bộ phân lớp ban đầu được huấn luyện bằng một số lượng nhỏ các dữ liệu đã được gán nhãn Sau đó, sử dụng bộ phân lớp này để gán nhãn cho dữ liệu chưa gán nhãn Các dữ liệu được gán nhãn có độ tin cậy cao (sử dụng độ đo tin cậy vượt ngưỡng cho phép) và nhãn tương ứng của chúng sẽ được đưa vào tập huấn luyện lại, có hiệu chỉnh nhằm tăng độ chính xác của tập huấn luyện Sau đó, bộ phân lớp được huấn luyện lại và thủ tục này được lặp lại cho đến khi bộ phân lớp đạt được ngưỡng tin cậy cho phép hoặc lần lặp sau hiệu quả thấp hơn lần lặp trước thì dừng

2.2.2 Thuật toán học bán giám sát:

 L: là tập các dữ liệu đã gán nhãn

 U: là tập các dữ liệu chưa gán nhãn

 Gán nhãn cho tập U và chọn tập con U’ có độ tin cậy cao nhất

 Huấn luyện bộ phấn lớp h trên tập dữ liệu huấn luyện L

 Sử dụng h để phân lớp dữ liệu trong tập U

 Tìm tập con U’ của U có độ tin cậy cao nhất

2.3 Bài toán lọc thông tin riêng trong bệnh án điện tử:

2.3.1 Bài toán lọc thông tin riêng nói chung: a Khái niệm về lọc thông tin:

Lọc thông tin riêng thực chất là bài toán phân lớp dữ liệu sao cho một đơn vị dữ liệu sau khi phân lớp sẽ phải thuộc một lớp cho trước nào đó Bài toán phân lớp này được giải quyết phổ biến bằng học máy, phương pháp này giúp chúng ta giải quyết những vấn đề như không cần tới số lượng chuyên gia nhiều để có thể phân lớp được đâu là thông tin cần lọc, hay phân lớp nhanh chóng một lượng dữ liệu khổng lồ

Bài toán phân lớp này có thể áp dụng kỹ thuật học có giám sát hoặc bán giám sát b Các kỹ thuật phân lớp phổ biến:

- K láng giềng gần nhất

- Tối đa hóa xác suất điều kiện CRF

Trong đề tài này sử dụng mô hình CRF đã trình bày phía trên để giải quyết bài toán

2.3.2 Bài toán lọc thông tin riêng trong bệnh án điện tử: a Những đặc trưng khi thao tác với bệnh án điện tử:

So với bài toán thông thường thì khi làm việc với bệnh án điện tử có những đặc trưng rất riêng mà ở đó chúng ta cần những cách giải quyết đặc thù: o Thứ nhất là kích thước của dữ liệu, dữ liệu là những bản ghi bệnh án

Trang 15 điện tử không có cấu trúc, có thể là nhận xét của bác sĩ, là ghi chú của y tá, có đơn thuốc, hay phiếu khám bệnh thông thường Vì vậy dữ liệu đầu vào thuộc dữ liệu phi cấu trúc, và kích thước hoàn toàn là không cố định Số lượng bản ghi là rất lớn, tăng dần theo thời gian o Thứ hai là thuật ngữ chuyên ngành và nhiễu, thuật ngữ chuyên ngành và nhiễu là 2 yếu tố chính trong việc góp phần làm tăng sai số đầu ra của kết quả phân lớp Vì bệnh án điện tử được ghi sao cho phù hợp với tính chất công việc của bác sĩ và ý tá nên xuất hiện các loại lỗi chính tả hay còn gọi là nhiễu là điều không thể tránh khỏi

Buộc bài toán sẽ phải quy định đầu vào đã được làm sạch hoặc xử lý nhiễu ngay trong chính bộ phân lớp o Thứ ba là chỉ tiêu đánh giá, kết quả của bộ phân lớp phụ thuộc vào 2 chỉ tiêu mà phải trải qua thực nghiệm mới biết được con số chính xác, đó là khi chọn những dữ liệu đã được gán nhãn có độ tin cậy lớn nhất, và quá trình huấn luyện lại kết thúc khi kết quả huấn luyện đã đạt ngưỡng chấp nhận được Ở đây việc chọn ngưỡng cho 2 quá trình này hoàn toàn là phương pháp thử và sai (Trial And Error Method)

QUAN

BỆNH ÁN ĐIỆN TỬ

Ba pha xử lý cơ bản của giải pháp

Trong pha này, tất cả các bệnh án điện tử bắt đầu được tiền xử lý Quá trình tách từ sẽ trả về tập hợp các từ được biểu diễn một cách tuần tự theo đúng vị trí của nó trong bản ghi Mỗi từ của một kiểu PHI sẽ biểu diễn thành tập các đặc trưng mà ta thu được từ những đặc tính của từ đó mà nó thể hiện được sự khác biệt với những kiểu PHI khác Nếu một từ đã được gán nhãn thì kiểu PHI của nó đã được biết trước Ngược lại, kiểu PHI chính là điều mà ta cần phải xác định

Danh sách đặc trưng dựa trên tập các đặc trưng của các công trình liên quan [7, 21, 34] và có xem xét đến sự ảnh hưởng của các đặc trưng đến các kết quả tương ứng cho mỗi kiểu PHI như được giới thiệu trong các công trình liên quan này

Trong danh sách đặc trưng sử dụng hai đặc trưng sẵn có là POS (Part-Of- Speech) và NER (Named Entity Recognition) được sử dụng phổ biến trong các công trình tham khảo, một số công trình còn xem đây là hai đặc trưng quan trọng Đối với NER và POS Tagger, The Stanford Natural Language Processing Group 3 đã phát triển một bộ công cụ giúp nhận diện thực thể có tên tiếng Anh hoàn chỉnh được nhiều tác giả sử dụng

Bảng 8 Danh sách đặc trưng đề xuất cho mô-đun học máy của giải pháp lọc thông tin riêng

Số thứ tự Nhóm đặc trưng Mô tả

3 Kết hợp của token và POS trong cửa sổ ngữ cảnh kích thước là 5 w 0 p -1 , w 0 p 1 , w 0 p -1 p -2 , w 0 p 1 p 2 , w 0 p -1 p 1

4 Tiền-hậu tố của token Tiền tố và hậu tố có chiều dài từ 1 đến 5 5 Dạng của token ở thể đầy đủ Biểu diễn lại token dựa trên đặc điểm của các ký tự xuất hiện trong token đầy đủ cho mỗi ký tự

6 Dạng của token ở thể ngắn Tương tự ở trên nhưng rút ngắn cho nhóm ký tự liền kề giống nhau 7 Đặc điểm chính tả và cách viết của token

Chữ đầu hoa, tất cả hoa, chữ hoa ở trong, có chứa ký số, có chứa ký tự, có chứa dấu đặc biệt

8 Đặc trưng từ điển/mẫu Có xuất hiện trong các từ điển và so trùng khớp với biểu thức chính quy về tuổi, thành phố, ngày, ngày lễ, số điện thoại, nghề nghiệp, bang, đường phố, tiền tố, hậu tố

9 Kiểu thực thể từ bộ NER Kết quả từ bộ NER, ví dụ: PERSON,

10 Kiểu PHI từ bộ chỉ báo kiểu

Kết quả từ bộ chỉ báo ngữ cảnh cho các kiểu PHI, ví dụ: Dr cho DOCTOR

4.1.2 Nhận diện PHI Ở giai đoạn này chính là xây dựng một bộ phân lớp với nhiều lớp dùng để tiên đoán nhiều kiểu PHI có thể cho mỗi từ đã được xử lý ở giai đoạn trước, một từ có thể là bắt đầu của một thể hiện PHI, là phần giữa của thể hiện PHI đó, hoặc không phải là một kiểu PHI Ở đây dùng lược đồ BIO để mô tả điều này, nếu một từ là bắt đầu của một PHI sẽ có nhãn là B-PHI, nếu là phần bên trong của PHI sẽ có nhãn là I-PHI, nếu không là một kiểu PHI nào thì sẽ có nhãn là O-PHI Như đã đề cập ở trên, bộ phân lớp sẽ được xây dựng dựa trên phương pháp học máy bán giám sát để người sử dụng không cần phải có một tập dữ liệu thật lớn ban đầu cho giai đoạn xây dựng bộ phân lớp Những bản ghi mới phải đạt được độ chính xác cao nhất mới có thể đưa vào tập huấn luyện nhằm nâng cao tập huấn luyện

Là giai đoạn cuối cùng xử lý trên từng bản ghi một, cho từng phần của bản ghi bất kể độ phức tạp của phần bản ghi đó Các thể hiện của PHI được thay thế bằng những giá trị đại diện tương ứng cho từng kiểu PHI Việc hậu xử lý được tiến hành nhằm lọc lại những sai sót của bộ phân lớp và hiệu chỉnh lại sao cho nó nhận được giá trị PHI đúng Giai đoạn này là hoàn toàn cần thiết vì trong mỗi bản ghi có rất nhiều kiểu PHI được xây dựng theo một quy luật nhất định hoặc một biểu thức rõ ràng như LOCATION, hay ID mà ở đó ta có thể bắt được các PHI này ở giai đoạn hậu xử lý với các kỹ thuật như từ điển, so trùng mẫu hoặc biểu thức chính quy

Lúc này kết quả của giai đoạn hậu xử lý sẽ cho ra các bản ghi đã được che giấu các thông tin riêng, và sẵn sàng để chia sẻ.

Chi tiết quá trình xử lý dữ liệu của giải pháp

Việc phân tích kỹ những bản ghi ở tập huấn luyện đã phát hiện ra rằng trong một bản ghi thường được chia thành những phân đoạn khác nhau với cấu trúc và nhiệm vụ riêng biệt Đoạn đầu của bản ghi sẽ viết dưới dạng có cấu trúc, chuyên dùng để làm tiêu đề cho bệnh án điện tử, các khai báo về ID hay DOCTOR thường nằm ở đâu Đoạn mô tả về bệnh án thường sẽ chứa ít kiểu PHI nhất, LOCATION hoặc PATIENT thường nằm ở đây Đoạn thứ 3 có kiểu bán cấu trúc, ở đây phụ thuộc vào mỗi bản mỗi khác nhau Vì vậy, việc xem xét bản ghi chỉ ở mức từ là chưa đủ, vì mỗi từ ở mỗi phân đoạn sẽ có chức năng khác nhau, tương ứng với mỗi phần trong bản ghi sẽ có hướng tiếp cận khác nhau và hướng giải quyết cũng khác nhau Với hướng tiếp cận theo nhiều mức, một bản ghi sẽ được xem xét ở 3 mức độ: mức từ, mức thực thể, và mức phân đoạn bản ghi

Trang 40 Ở mức từ là mức quan trọng của giai đoạn nhận diện PHI nên ở giai đoạn tiền xử lý sẽ phải tách từ và tập hợp bộ đặc tính phải thật sự đủ để làm đầu vào cho bộ huấn luyện Việc chia theo nhiều mức lấy cơ sở từ việc nhận diện ở mức từ và tổng hợp thành thực thể và phân đoạn bản ghi Ở mức từ sẽ xác định được từ đó thuộc lớp B-PHI hay I-PHI hay O-PHI, với B-PHI là nhãn phân lớp cho những từ bắt đầu của một thực thể, I-PHI là nhãn phân lớp cho những từ nằm trong của một thực thể, và O-PHI là nhãn phân lớp cho những từ không thuộc 2 lớp trên Việc gán nhãn cho mức từ sẽ dùng kỹ thuật học có giám sát với bộ phân lớp nhiều lớp Ở mức thực thể được xử lý trong giai đoạn hậu xử lý và tái tạo thực thể ở mức từ nhằm phục vụ cho công tác che giấu thông tin, việc che giấu thông tin sẽ tiến hành ở mức thực thể Chính ở mức thực thể này đòi hỏi việc xử lý ở mức từ dùng lược đồ B-I-O mang lại hiệu quả hơn các lược đồ khác như I-O hay B-I-O-E-S (Begin-Inside-Other-End-Single) Với lược đồ B-I-O việc tổng hợp thực thể đơn từ và thực thể đa từ sẽ rõ ràng hơn với bảng ma trận tổng hợp thực thể dùng schema B- I-O

Bảng 7 Ma trận tổng hợp thực thể dùng lược đồ B-I-O

B Trường hợp 1: Từ trước là

B-PHI1, từ sau là B-PHI1

Trường hợp 2: từ trước là

B-PHI1, từ sau là B-PHI2

I-PHI1, từ sau là B-PHI1

I-PHI1, từ sau là B-PHI2

O-PHI, từ sau là B-PHI1

I Trường hợp 6: từ trước là

B-PHI1, từ sau là I-PHI1 Trường hợp 7: từ trước là

B-PHI1, từ sau là I-PHI2

Trường hợp 8: từ trước I- PHI1, từ sau là I-PHI1

I-PHI1, từ sau là I-PHI2

O-PHI, từ sau là I-PHI1

O Trường hợp 11: từ trước là

B-PHI, từ sau là O-PHI

I-PHI, từ sau là O-PHI

Không xảy ra Trường hợp cần xác định thực thể

Từ bảng ma trận trên ta có các trường hợp cụ thể sau:

 Trường hợp 1: o Xác định được ngay token trước là entity kiểu single-token

 Trường hợp 2: o Xác định được ngay token trước là entity kiểu single-token

 Trường hợp 3: o Xác định được ngay token trước là kết thúc của entity kiểu multiple- token

 Trường hợp 4: o Xác định được ngay token trước là kết thúc của entity kiểu multiple- token

 Trường hợp 5: o Token trước là O-PHI

 Trường hợp 6: o Xác định được ngay token trước là bắt đầu của multiple-token

 Trường hợp 7: o Xác định được ngay token trước là entity kiểu single-token Token sau là lỗi, có thể được tự động chuyển sang B-PHI2

 Trường hợp 8: o Xác định được ngay token trước là giữa của một entity kiểu multiple- token

 Trường hợp 9: o Xác định được ngay token trước là kết thúc của một entity kiểu multiple-token Token sau là lỗi, có thể được tự động chuyển sang B- PHI2

 Trường hợp 10: o Token trước là O-PHI, token sau là lỗi, có thể được tự động chuyển sang B-PHI

 Trường hợp 11: o Xác định được ngay token trước là một entity kiểu single-token

 Trường hợp 12: o Xác định được ngay token trước là kết thúc của một entity kiểu multiple-token

Tương ứng ở mức thực thể ta có các trường hợp sau:

 Trường hợp 1, 2, 7, 11 xác định từ hiện tại là một thực thể kiểu đơn từ

 Trường hợp 6 xác định từ hiện tại là bắt đầu của một thực thể kiểu đa từ

 Trường hợp 8 xác định từ hiện tại là từ giữa của một thực thể kiểu đa từ

 Trường hợp 3, 4, 9, 12 xác định từ hiện tại là kết thúc của một thực thể kiểu đa từ

Trang 42 Sau cùng là ở mức phân đoạn, giai đoạn này xác định thêm được một từ nào đó thuộc phân đoạn nào của bản ghi Sự phức tạp trong cấu trúc của bản ghi được xem xét để tiến hành xác định cách tiếp cận cho phù hợp ở giai đoạn sau Ở đây, sau khi đọc toàn bộ cấu trúc của các bản ghi, quyết định chia bản ghi thành hai phân đoạn: phân đoạn được mô tả theo cấu trúc rất rõ ràng theo dạng sau "header: values" và phân đoạn chứa dữ liệu mang tính bán cấu trúc hoặc phi cấu trúc

Lúc này một bản ghi được tách làm 3 sections: các phân đoạn có cấu trúc gồm phần 1 và 3 chứa tiêu đề và kết luận; phân đoạn bán cấu trúc hoặc phi cấu trúc là phần 2 chứa phần mô tả bệnh án hoặc ghi chép của bác sĩ Trong thực nghiệm, phân đoạn bản ghi có cấu trúc cho ra kết quả là 100% cho các kiểu PHI điều này ngay lập tức có thể đưa những phần có cấu trúc này vào lại tập huấn luyện và làm gia tăng khả năng tiên đoán của bộ phân lớp, giảm thiểu công việc của bộ phân lớp

Việc nhận diện phần bản ghi nào có cấu trúc, bán cấu trúc hoặc phi cấu trúc hoàn toàn dựa vào việc đọc hiểu của người dùng, vì vậy đây không phải là một giai đoạn bắt buộc của phương pháp giải quyết này, mà chỉ là một cách khai thác hiệu quả hơn các bản ghi có sẵn Lúc này quá trình semi chỉ lặp lại việc gán nhãn cho các phần bản ghi bán hoặc phi cấu trúc

Và với việc chia dữ liệu đang có thành 2 phân đoạn như trên làm cho bài toán sẽ tách làm 2 phần để giải quyết: 1.giải quyết những phân đoạn có cấu trúc, 2.giải quyết những phân đoạn bán cấu trúc hoặc phi cấu trúc Cả 2 quá trình giải quyết 2 vấn đề trên sẽ có chung phần tiền xử lý

4.2.1 Giải quyết việc gán nhãn cho các phân đoạn bản ghi có cấu trúc

Như đã đề cập ở trên, dựa vào độ phức tạp của từng phân đoạn mà có cách xử lý riêng cho từng từ thuộc mỗi phân đoạn là khác nhau Cụ thể ở phần có cấu trúc với độ chính xác cao sẽ tiến hành các bước theo hình 2:

Hình 2: Quy trình gán nhãn cho các phân đoạn bản ghi có cấu trúc

Trang 43 Trong quá trình trên được chia làm 3 quá trình con: 1.học có giám sát với CRFs và kiểm tra chéo k-fold, 2.nhận diện PHI, 3.hậu xử lý dùng tập luật

Học có giám sát với CRFs và kiểm tra chéo: một mô hình CRFs được xây dựng và đánh giá bằng kiểm tra chéo k-fold, cụ thể với k=5 nhằm tăng độ chính xác và tránh over-fitting Các PHI sẽ được tiên đoán với xác suất có điều kiện được trả về bởi mô hình CRFs

Hậu xử lý dùng tập luật: các luật được xây dựng dành riêng cho phân đoạn bản ghi có cấu trúc này nhằm thực hiện 2 quá trình: 1.bắt lại những PHI bị nhận diện thành O-PHI, 2.chuyển lại những O-PHI bị nhận diện thành PHI Trong quá trình này có 2 thông số đánh giá mức độ hiệu quả là

RECALL và PRECISION, ở đây chú trọng thông số RECALL nhằm bắt được càng nhiều PHI đúng nhất có thể Danh sách tập luật chuyên xử lý cho phần bản ghi có cấu trúc như sau:

Luật 1: thay đổi O-PHI sang B-LOCATION và I-LOCATION sử dụng mẫu so trùng: [No] [Name] [Blvd/Road/St/ ] [,] [State Name or its Abbreviation] với No được xác định là B-LOCATION và những thành phần còn lại là I-PHI

Luật 2: thay đổi O-PHI sang B-PATIENT và I-PATIENT sử dụng mẫu so trùng: [SUMMARY NAME :] [PatientName_1] [Patient Name_2] với PatientName_1 là B-PATIENT và PatientName_2 là I-PATIENT

Luật 3: thay đổi O-PHI sang B-DATE sử dụng biểu thức chính quy: d/mm hoặc dd/mm hoặc dd/m hoặc dd/mm Tương tự ta có thêm những biểu thức chính quy sau: d/mmm hoặc d/mmmm hoặc dd/mmm hoặc dd/mmmm

Luật 4: thay đổi O-PHI sang B-DATE và I-DATE sử dụng so trùng mẫu: [dd] [mm] hoặc [mmm] [dd] hoặc [dd] [mmmm] hoặc [mmmm] [dd] với phần trước là B-DATE và phần còn lại là I-DATE

Luật 5: thay đổi O-PHI thành B-HOSPITAL sử dụng so trùng mẫu:

[presented to]/[transferred to]/ [transferred from]/[admitted to]/[discharge to]/[hospitalization at]/[admission to]/[etc] [Name] với Name là B-

Giai đoạn 2, che giấu dữ liệu

Sau giai đoạn nhận diện PHI, các từ lúc này sẽ được gán một nhãn hoặc là B- PHI hoặc là I-PHI hoặc là O-PHI Giai đoạn này sẽ là giai đoạn tổng hợp các từ thành thực thể và che giấu các thực thể đó bằng một thể hiện khác Lí do chọn lược đồ BIO phát huy tác dụng rất lớn ở giai đoạn này Bảng ma trận thể hiện việc tổng hợp thực thể từ lược đồ B-I-O như trình bày ở phần trước đã thể hiện cách tổng hợp dữ liệu ở mức thực thể rất rõ ràng

Sau đó ta có được các thực thể và thay thế bằng chính kiểu PHI của thực thể đó Ví dụ: cụm từ "5/06" được nhận diện là DATE-PHI sẽ được thay thế bằng chữ

"DATE", cụm từ "John Henrry" được nhận diện là PATIENT-PHI sẽ được thay thế bằng chữ "PATIENT" trong bản ghi

Như vậy quá trình che giấu thông tin ở giai đoạn 2 đã hoàn tất

KẾT QUẢ THỰC NGHIỆM

Tập dữ liệu thực nghiệm được lấy từ cuộc thi của i2b2 năm 2006 Có tổng cộng 668 bản ghi đã gán nhãn làm tập huấn luyện, 220 bản ghi đã gán nhãn làm tập kiểm tra Mặc dù i2b2 đã công bố dữ liệu 2014 nhưng lí do đề tài vẫn chọn tập dữ liệu 2006 để tiến hành thực nghiệm là vì 3 lí do sau:

- Dữ liệu 2014 lớn hơn và phức tạp hơn 2006 nhưng đều do Partners HealthCare (http://www.partners.org/ ) chuẩn bị và chuẩn hóa nên có rất nhiều phần tương tự nhau

- Đề tài muốn so sánh kết quả mình đạt được không những với những công trình đạt kết quả cao trong cuộc thi 2006 như [30] [31] mà còn muốn so sánh với những công trình sau này lấy dữ liệu của i2b2 nằm 2006 nhằm đánh giá khách quan hơn về kết quả đạt được như công trình [37] [38]

- Dữ liệu 2014 vừa mới được công bố nên số lượng công trình liên quan được công khai không nhiều, gây khó khăn trong việc so sánh đánh giá

Những bản ghi đã gán nhãn sẽ được tiến hành giai đoạn kfold bằng việc chia thành 5 nhóm dữ liệu hoàn toàn ngẫu nhiên Mỗi fold sẽ lấy 4 nhóm làm tập huấn luyện và 1 nhóm còn lại làm tập kiểm tra Quá trình kfold sau khi hoàn thành ở mỗi vòng lặp sẽ được tính RECALL bằng việc tổng hợp kết quả của 5 tập tin kiểm tra để làm điều kiện dừng cho quá trình bán giám sát

Với những bản ghi ở tập kiểm tra, được chia thành 4 phần: phần 1 sẽ từ đầu bản ghi cho tới cụm từ "HISTORY OF PRESENT ILLNESS" được viết hoa và phần 3 chứa 3 câu bắt đầu bởi "TR:" "DD " và "TD " sẽ là những phân đoạn có cấu trúc; phần còn lại của bản ghi này là những phân đoạn bán cấu trúc hoặc phi cấu trúc Chương trình được viết bằng ngôn ngữ C# Net 4.5, sử dụng công cụ xử lý ngôn ngữ tự nhiên của Stanford có sẵn và bộ công cụ CRFSharp Cấu hình máy chủ chạy thực nghiệm là Intel(R) Xeon(R) CPU E5-2620 0 @2.00GHz with 96 GB RAM using MS Windows

Kết quả của chương trình được so sánh với các công trình [30], [31], [37], [38] Công trình [30] [31] là 2 công trình tham gia cuộc thi của i2b2 năm 2006 và đạt kết quả cao nhất, công trình [37] là công trình được thực hiện sau, sử dụng dữ liệu 2006 của i2b2 và đạt giá trị rất cao ở nhiều loại PHI, công trình [38] có hướng giải quyết bài toán tương tự cũng dùng cách tiếp cận lai bán giám sát nhưng không chia nhỏ bản ghi thành nhiều mức xử lý

Bảng 8 Giá trị PRECISION cho mỗi kiểu PHI

Bảng 9 Giá trị RECALL cho mỗi kiểu PHI

Bảng 10 Giá trị độ đo F cho mỗi kiểu PHI

Trang 49 Trong bảng kết quả ở trên, hướng tiếp cận của bài toán này được đặt tên là

MLHSLA (multilevel hybrid semi-supervised learning approach) Những giá trị in đậm là kết quả đạt được cao nhất, gạch chân là kết quả thứ 2

Từ kết quả thu được, kết quả của PRECISION cho thấy AGE, PHONE của phương pháp này đạt cao nhất, LOCATION đứng thứ 2, có thể hiểu được rằng phương pháp này hướng tới RECALL nên PRECISION không chiếm ưu thế, mỗi vòng lặp sẽ chỉ làm tăng lên RECALL và quá trình bán giám sát kết thúc chỉ khi RECALL giảm xuống Ở bảng 2 thu được kết quả khả quan cho các chỉ số

RECALL ở các loại PHI AGE, DOCTOR, PATIENT và PHONE, và đồng thời DATE và LOCATION đứng thứ 2 Bảng 3 so sánh với bài [38] với hướng tiếp cận lai bán giám sát thì kết quả của bài này tốt hơn gần như ở mọi loại PHI, chỉ thua ID, ở bài [38] không chia thành nhiều cấp độ, như vậy việc chia thành các cấp độ thưc sự là một hướng tiếp cận hiệu quả

Công trình được viết từ bài luận văn này:

 P D Nguyen, C T N Vo, and B T Ho, “A hybrid semi-supervised learning approach to identifying protected health information in electronic medical records” in Proc of the 10th ACM IMCOM, 2016, pp 82:1-82:8

KẾT LUẬN VÀ ĐỀ XUẤT

Hướng giải quyết bài toán của đề tài này đã chứng minh hiệu quả khi so sánh kết quả đạt được với các công trình liên quan có kết quả cao nhất, đặc biệt là về chỉ số RECALL và độ đo F Như vậy đề tài đã cung cấp một hướng giải quyết mới mà ở đó tính thực tiễn khi đưa vào sử dụng là rất cao:

- Học bán giám sát với lượng dữ liệu huấn luyện ban đầu không cần nhiều

- Kết quả học được tăng cường thêm cho tập huấn luyện nhưng vẫn đảm bảo việc hạn chế tích lũy lỗi theo thời gian

- Việc hậu xử lý dựa vào những cấu trúc tự nhiên của ngôn ngữ hình thành nên bản ghi, vì vậy không có quá nhiều sự phụ thuộc vào ngôn ngữ triển khai Có thể làm một quá trình chuyển đổi giữa việc xử lý bệnh án điện tử tiếng Anh sang ngôn ngữ khác

Hạn chế: Việc chia nhỏ bản ghi ở mức đoạn vẫn chưa phải là tối ưu nhất

Trong một đoạn được chọn sẽ có tỉ lệ nhất định chứa những câu bị sai gây ra lỗi, và theo thời gian lỗi này sẽ tích lũy dẫn đến hiệu quả của hệ thống sẽ bị giảm Đề xuất: Chia nhỏ hơn bản ghi ở mức câu để tạo độ mịn và chi tiết hơn cho mỗi lần lặp lại của quá trình bán giám sát Nếu bản ghi được xem xét ở mức độ câu thì vẫn có thể đảm bảo được ngữ cảnh của kiểu PHI và tăng độ chính xác khi lựa chọn những câu đúng cho lần lặp tiếp theo Độ mịn và chi tiết rõ ràng sẽ cao hơn rất nhiều so với ở mức đoạn bản ghi, trong một đoạn bản ghi sẽ có thể chứa từ 3 cho đến gần 300 câu Vậy hướng phát triển cho tương lai của bài toán này sẽ xem xét thêm mức câu trong cơ chế nhiều mức như hiện tại

[1] J Aberdeen, S Bayer, R Yeniterzi, B Wellner, C Clark, D Hanauer, B Malin, L Hirschman, “The MITRE identification Scrubber toolkit: design, training, and assessment,” International Journal of Medical Informatics, vol 79, pp 849-859, 2010

[2] M Adnan, J Warren, M Orr, “Iterative refinement of SemLink to enhance patient readability of discharge summaries,” In: Health Informatics: Digital Health Service Delivery - The Future is Now! H Grain and L.K Schaper (Eds.), 2013, pp 128-134

[3] R Bjurstrứm, J Singh, “De-identification of Norwegian health record notes: an experimental approach,” Master Thesis in Computer Science, Norwegian University of Science and Technology, 2013

[4] A Boonstra, M Broekhuis, “Barriers to the acceptance of electronic medical records by physicians from systematic review to taxonomy and interventions,”

BMC Health Services Research, vol 10, no 231, pp 1-17, 2010

[5] W W Chapman, P M Nadkarni, L Hirschman, L W D’Avolio, G K Savova, O Uzuner, “Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions,” J Am Med Inform Assoc, vol 18, no 5, Sept 2011

[6] H Dalianis, S Velupillai, “De-identifying Swedish clinical text – refinement of a gold standard and experiments with Conditional Random Fields,” Journal of Biomedical Semantics, vol 1, no 6, pp 1-10, 2010

[7] A Dehghan, A Kovacevic, G Karystianis, J A Keane, G Nenadic, “Combining knowledge- and data-driven methods for de-identification of clinical narratives,”

J Biomed Inform, 2015 http://dx.doi.org/10.1016/j.jbi.2015.06.029

[8] O Ferrández, B R South, S Shen, F J Friedlin, M H Samore, S M Meystre,

“BoB, a best-of-breed automated text de-identification system for VHA clinical documents,” J Am Med Inform Assoc., vol 20, pp 77-83, 2013

[9] J Gardner, L Xiong, “HIDE: an integrated system for health information DE- identification,” In: Proc The 2008 21st IEEE International Symp On Computer- based Medical Systems, 2008, pp 254-259

[10] A Grouin, A Névéol, “De-identification of clinical notes in French: towards a protocol for reference corpus development,” Journal of Biomedical Informatics, vol 50, pp 151-161, 2014

[11] C Grouin, A Rosier, O Dameron, P Zweigenbaum, “Testing tactics to localize de-identification,” Stud Health Technol Inform, vol 150, pp 735-739, 2009

Trang 52 [12] C Grouin, P Zweigenbaum, “Automatic de-identification of French clinical records: comparison of rule-based and machine learning approaches,” MEDINFO 2013, pp 476-480, 2013

[13] D Gupta, M Saul, J Glbertson, “Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research,”

Am J Clin Pathol, vol 121, pp 176-186, 2004

[14] D Hanauer, J Aberdeen, S Bayer, B Wellner, C Clark, K Zheng, L

Hirschman, “Bootstrapping a de-identification system for narrative patient records: Cost-performance tradeoffs,” International Journal of Medical Informatics, vol 82, pp 821-831, 2013

[15] A Henriksson, M Conway, M Duneld, W W Chapman, "Identifying synonymy between SNOMED clinical terms of varying length using distributional analysis of electronic health records," In: AMIA Annu Symp Proc., 2013, pp 600-609

[16] F Hu, Z Shao, T Ruan, “Self-supervised synonym extraction from the Web,”

Journal of Information Science and Engineering, vol 31, 2015, pp 1133-1148, 2015

[17] J Jaćimović, C Krstev, D Jelovac, “A rule-based system for automatic de- identification of medical narrative texts,” Informatica, vol 39, pp 45-53, 2015

[18] M-Y Kim, Y Xu, O Zaiane, R Goebel, “Patient information extraction in noisy tele-health texts,” In: Proc of the IEEE International Conference on Bioinformatics and Biomedicine, 2013, pp 326-329

[19] M-Y Kim, Y Xu, O R Zaiane, R Goebel, “Recognition of patient-related named entities in noisy tele-health texts,” ACM Transactions on Intelligent Systems and Technology, vol 6, no 4, pp 59:1-59:23, 2015

[20] M Li, D Carrell, J Aberdeen, L Hirschman, B A Malin, “De-identification of clinical narratives through writing complexity measures,” International Journal of Medical Informatics, vol 83, pp 750-767, 2014

[21] Z Liu, Y Chen, B Tang, X Wang, Q Chen, H Li, J Wang, Q Deng, S Zhu,

“Automatic de-identification of electronic medical records using token-level and character-level conditional random fields,” Journal of Biomedical Informatics, 2015 http://dx.doi.org/10.1016/j.jbi.2015.06.009

[22] Y Liu, T Ge, K S Mathews, H Ji, D L McGuinness, “Exploiting task-oriented resources to learn word embeddings for clinical abbreviation expansion,” In:

Proc the 2015 Workshop on Biomedical Natural Language Processing (BioNLP 2015), 2015, pp 92-97

Trang 53 [23] S M Meystre, F J Friedlin, B R South, S Shen, M H Samore, “Automatic de-identification of textual documents in the electronic health record: a review of recent research,” BMC Medical Research Methodology, vol 10, no 70, pp 1-16, 2010

[24] S M Meystre, G K Savova, K C Kipper-Schuler, J F Hurdle, “Extracting information from textual documents in the electronic health record: a review of recent research,” IMIA Yearbook of Medical Informatics 2008 Methods Inf Med 2008, vol 47, S1, pp 128-144, 2008

[25] I Neamatullah, M M Douglass, L H Lehman, A Reisner, M Villarroel, W J

Long, P Szolovits, G B Moody, R G Mark, G D Clifford, “Automated de- identification of free-text medical records,” BMC Medical Informatics and Decision Making, vol 8, no 32, 2008

[26] E Scheurwegs, K Luyckx, F Van der Schueren, T Van den Bulcke, “De- identification of clinical free text in Dutch with limited training data: a case study,” In: Proc the Workshop on NLP for Medicine and Biology, 2013, pp 18- 23

[27] S-Y Shin, Y R Park, Y Shin, H J Choi, J Park, Y Lyu, M-S Lee, C-M Choi,

W-S Kim, J H Lee, “A de-identification method for bilingual clinical texts of various note types,” J Korean Med Sci, vol 30, pp 7-15, 2015

[28] A Stubbs, “Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1,” J Biomed Inform, 2015 http://dx.doi.org/10.1016/j.jbi.2015.06.007

[29] L Sweeney, “Replacing personally-identifying information in medical records, the Scrub system,” In: AMIA Annu Fall Symp, 1996, pp 333-337

[30] G Szarvas, R Farkas, R Busa-Fekete, “State-of-the-art anonymization of medical records using an iterative machine learning framework,” J Am Med

Inform Assoc., vol 14, issue 5, pp 574-580, 2007

[31] B Wellner, M Huyck, S Mardis, J Aberdeen, A Morgan, L Peshkin, A Yeh, J

Hitzeman, L Hirschman, “Rapidly retargetable approaches to de-identification in medical records,” J Am Med Inform Assoc, vol 14, pp 564-573, 2007

[32] Fredric Brown Information Extraction: 10-707 and 11-748 (slide)

[33] John Lafferty and Andrew McCallum Conditional Random Fields Probabilistic Models for Segmenting and Labeling Sepuence Data Pages 1-8

[34] H Yang, J M Garibaldi, “Automatic detection of protected health information from clinic narratives,” J Biomed Inform, 2015 http://dx.doi.org/10.1016/j.jbi.2015.06.015

Trang 54 [35] A Gkoulalas-Divanis, G Loukides, J Sun, “Toward smarter healthcare: anonymizing medical data to support research studies,” IBM J Res & Dev., vol

[37] G Zuccon, D Kotzur, A Nguyen, and A Bergheim, "De-identification of health records using Anonym: effectiveness and robustness across datasets," Artificial Intelligence in Medicine, vol 61, issue 3, pp 145-151, July 2014.

[38] P D Nguyen, C T N Vo, and B T Ho, “A hybrid semi-supervised learning approach to identifying protected health information in electronic medical records” in Proc of the 10th ACM IMCOM, 2016, pp 82:1-82:8 [39] J Han, M Kamber (2001), Data Mining: Concepts and Techniques,

PHẦN LÝ LỊCH TRÍCH NGANG

Họ và tên: Nguyễn Đông Phương Ngày, tháng, năm sinh: 05/06/1987 Nơi sinh: Khánh Hòa Địa chỉ liên lạc: 321 CC 234 Phan Văn Trị P11 Quận Bình Thạnh Hồ Chí

Minh Địa chỉ Email: phuongndfree@gmail.com

QUÁ TRÌNH ĐÀO TẠO

THỜI GIAN TRƯỜNG ĐÀO TẠO CHUYÊN

2005 - 2010 Trường Đại Học Bách Khoa

2012-2016 Trường Đại Học Bách Khoa- ĐHQG Tp HCM

QUÁ TRÌNH CÔNG TÁC

THỜI GIAN ĐƠN VỊ CÔNG TÁC VỊ TRÍ CÔNG TÁC 2013 đến nay Trường Đại Học Tôn Đức

Thắng Giảng Viên

Association for Computing Machinery (ACM) Sungkyunkwan University (SKKU), Korea

ACM IMCOM 2016, January 4 –6 Danang, Vietnam

Conference Program

Session 12: Information Retrieval and Management Room: SALON VI

Session Chairs: Shahrulniza Musa, Sangwook Kim

A Buffer Cache Algorithm Using the Characteristic of Mobile Applications Based on Hybrid Memory System

Chansoo Oh (Hanwha Techwin, Korea), Dong Hyun Kang (Sungkyunkwan University, Korea), Minho Lee (Sungkyunkwan University, Korea), Young Ik Eom (Sungkyunkwan University, Korea)

A Syllable-based Method for Vietnamese Text Compression

Vu Nguyen (Ton Duc Thang University, Vietnam), Hien Nguyen (Ton Duc Thang University, Vietnam), Hieu Duong (Ho Chi Minh City University of Technology, Vietnam), Vaclav Snasel (VSB-Technical University of Ostrava, Czech Republic)

Candidate Searching and Key Coreference Resolution for Wikification

Minh Pham (John von Neumann Institute, Vietnam), Tru Cao (Ho Chi Minh City University of Technology, Vietnam) Huy Huynh (Ton Duc Thang University, Vietnam),

Escalating Memory Accesses to Shared Memory by Profiling Reuse

Yohan Ko (Sungkyunkwan University, Korea), Hyunjun Kim (Sungkyunkwan University, Korea), Hwansoo Han (Sungkyunkwan University, Korea)

Session 11: Machine Learning Room: SALON IV + V

Session Chairs: Jong-Seok Lee, Kangwoo Lee

Standard Based Personal Mobile Health Record System

Yeong-Tae Song (Towson University, USA), Tao Qiu (Towson University, USA)

The Hybrid Approaches for Forecasting Real Time Multi-step-ahead Boiler Efficiency

Hieu Duong Ngoc (Ho Chi Minh City University of Technology, Vietnam),

Vu Nguyen (Ton Duc Thang University, Vietnam), Tam M Nguyen (Petro Vietnam Fertilizer and Chemical Corporation, Vietnam), Hien Nguyen (Ton Duc Thang University, Vietnam),

Vaclav Snasel (VSB-Technical University of Ostrava, Czech Republic)

A Hybrid Semi-supervised Learning Approach to Identifying Protected Health Information in Electronic Medical Records

Phuong Nguyen (Ton Duc Thang University, Vietnam), Chau Vo (Ho Chi Minh City University of Technology, Vietnam), Bao Ho Tu (Japan Advanced Institute of Science and Technology, Japan)

11-4 A Framework of Information Technology Supported Intelligent Learning Environment

Toyohide Watanabe (Nagoya Industrial Science Research Institute, Japan)

Conference Program http://www.IMCOM.org

ACM IMCOM 2016

Electronic Medical Records

INTRODUCTION

Nowadays, a very large number of electronic medical records are prepared and used worldwide for health care and medical research Enabling such data sets to be available for different purposes of the researchers outside their associated institutions is significantly concerned due to the need of protecting patient’s private information which is called protected heal information (PHI) Indeed, specified in [2] as one of the barriers to natural language processing development in the clinical domain, the lack of access to shared data stems from the lack of reliable and inexpensive de-identification techniques This fact leads to a great focus of many existing works since about 1995 with the two shared tasks of i2b2 (Informatics for Integrating Biology and the Bedside, http://www.i2b2.org) in 2006 and 2014 Summarized in [15] are the 18 works for de-identification in 1995-2010 and in [20] are the 10 works with the highest results in the 2014 i2b2 shared task Although these works produced positive research outcomes for de-identification, [20] marked this problem as an unsolved problem That is why we have witnessed a large number of the related works such as [1, 3, 5-10, 12-14, 16-18, 21-25] with a diversity of de-identification systems on many clinical document types in many various languages

Among the aforementioned works, [22, 23] are the works with the highest results in the 2006 i2b2 shared task about de- identification of discharge summaries while [5, 14, 24] with the highest results in the 2014 i2b2 shared task about de- identification of longitudinal clinical narratives Some of the other works developed a various range of de-identification systems Scrub system in [21] is considered to be one of the first de-identification systems for de-identification of clinical text De- Id system is a commercial de-identification system introduced in [9] Both Scrub and De-Id systems are rule-based systems [7] proposed HIDE system based on an integrated approach using conditional random fields (CRF)-based technique for © 2016 Association for Computing Machinery ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only

IMCOM '16, January 04-06, 2016, Danang, Viet Nam © 2016 ACM ISBN 978-1-4503-4142-4/16/01…$15.00 DOI: http://dx.doi.org/10.1145/2857546.2857630 unstructured data and k-anonymization-based technique for structured data In [16], a de-identification system used a rule- based pattern matching approach with look-up tables (dictionaries), regular expressions, and heuristics to de- identifying free-text medical records MIST system was developed in [1] using the machine learning-based approach with CRF models for 4 types of patient records Based on the MIST system in [1], [10] obtained a bootstrapping MIST system to consider the annotation of clinical records for the de- identification task and [13] enabled the MIST system to handle the writing complexity in clinical narratives Especially, in a stepwise hybrid approach, [6] built a best-of-breed system called BoB by combining a rule-based method and a machine learning- based method along with a false positives filtering component in a supervised learning mechanism for better both recall and precision In addition to the works at the system level, we are also aware of the works specifically for different languages They are listed as follows: [8] for French records, [3] for Swedish records, [25] for English records from the Australian perspectives, [17] for Dutch records, [18] for bilingual records:

English and Korean, and [12] for Serbian records In the aforementioned works, a hybrid approach is quite popular except that the early works on a de-identification system like [9, 21] or the initial works on non-English medical records like [8, 12, 18] were based on a rule-based method with regular expressions, dictionary look-up, and heuristic rules This method is simple and quickly developed as a starting point Nevertheless, it depends on the data set from which extraction patterns are derived and captures few contexts for the instances belonging to different PHI types with ambiguity In contrast, a machine learning method seems to be more complicated because it requires a good set of features and an annotated training data set large enough for building the classifiers Hence, it is sometimes hard to get generalized for other data sources Those reasons ask us for a hybrid approach which appears to be a recent trend introduced in most of the works like [5, 14, 24] taking part in the 2014 i2b2 shared task

In this paper, our work is dedicated to identifying PHI instances in free-text medical records Particularly, we concentrate on a hybrid semi-supervised learning approach so that the resulting solution can be enhanced over time and adaptable to PHI identification of the medical records in other languages different from English A hybrid approach implies that our work takes advantage of both machine learning and rule-based methods to identifying PHI instances A semi-supervised learning approach means that our work would like to enhance the training data set of the resulting PHI classifier with the new medical records that have been predicted with high confidence over time In that manner, we believe that the resulting PHI classifier can get more accurate with a larger training data set although it started at a point with a small annotated training data set Such a property makes our approach more practical and flexible in reality as compared to the existing approaches As an assumption of our current work, 8 main PHI types including AGE (ages over 89), DATE, DOCTOR, HOSPITAL, IDENTIFIER (ID), LOCATION, PATIENT, and PHONE are considered Nevertheless, our approach is not limited to the listed PHI types as it is trivial to generalize this work to identify the instances of more PHI types.

A HYBRID SEMI-SUPERVISED LEARNING APPROACH FOR

The de-identification task can be seen as a two-phase process that includes: (1) the first phase of protected health information extraction from free-text medical records; (2) the second phase of removing the identified PHI instances or consistently replacing the identified PHI instances with other values that cannot be used to recognize the corresponding patients of the de-identified medical records In this work, our contribution is given to the first phase of the de-identification task Therefore, we consider the protected health information extraction sub-process as a labeling problem to assign a label of a corresponding PHI type to the information extracted from the text In order to tackle this problem, a machine learning-based method or a rule-based method or a hybrid method can be defined as discussed in [15]

Different from the existing works for PHI identification, our work proposes a hybrid semi-supervised learning approach delineated in Figure 1 As the input of our approach, the data sets given at the beginning are a set L of labeled records and another set U of unlabeled records which need to be labeled with the appropriate labels of the PHI types The output expected for PHI identification is U which is now a set of labeled records and a PHI identifier used over time either in a traditional manner as a classifier along with a rule-based post-processing module or in an iterative semi-supervised learning manner as a classifier with a rule-based post-processing module for each iteration

Shown in Figure 1, our approach includes six phases as follows

(1) Supervised learning with CRFs and k-fold cross validation :

In this phase, we build a CRF-based PHI identifier using the set L of labeled records as a training data set The IO (inside- outside) scheme is used along with the labels of the PHI types of interest An evaluation of the resulting PHI identifier is done in a k-fold cross validation scheme At the first time a CRF-based PHI identifier is built, if the performance of this PHI identifier is good enough for PHI identification, the next phase is conducted

Otherwise, we need to reconsider the feature set and/or the training data set At the next times in a semi-supervised learning mechanism, if the performance of this PHI identifier is improved for PHI identification, the next phase is executed Otherwise, the PHI identifier in the previous iteration which is the best PHI identifier up to now is returned for use in the future The details of building a CRF-based PHI identifier in this phase will be provided in subsection 2.1

(2) PHI identifying : In this phase, we use the PHI identifier to predict a PHI label of each token in the set U of unlabeled records The output is the set U which is now a set of labeled records called CRF-based labeled records Each PHI instance is associated with a conditional probability returned by CRFs

Figure 1 An overall view of the proposed hybrid semi-supervised learning approach for PHI identification

(3) Rule-based post-processing : In this phase, a rule-based post- processing procedure is carried out to further examine and improve the result of the CRF-based PHI identifier by extracting more PHI instances missed and filter out PHI instances mislabeled In our work, we favor recall rather than precision so that we can extract as many PHI instances from the free-text medical records as possible The output of this phase is a set of CRF-based and rule-based labeled records The details of this phase will be described in subsection 2.2

(4) Records selecting with the most confident prediction : In this phase, we select the CRF-based and rule-based labeled records with the most confident prediction to enhance the current training data set of the current CRF-based PHI identifier The features of the selected records are then reexamined to make them consistent with their predicted PHI types In order to perform this phase, we define a confident prediction score to reflect how confidently a record is correctly labeled This score is based on the conditional probability of each PHI instance in the record and the performance of the current CRF-based PHI identifier that has been used to label the record It will be detailed in subsection 2.3

(5) Update unlabeled records : In the previous phase, some records have been selected to be the most confidently predicted records and the rest need to be reconsidered for labeling Thus, we update the set of unlabeled records by removing the selected records with the most confident prediction in this phase

(6) Enhance labeled records : In this phase, we enhance the current training data set with the selected records with the highest confidence in phase (4) By doing that, the training data set for a CRF-based PHI identifier gets larger over time If our selected records are really truly labeled records, a new CRF- based PHI identifier built on such an enhanced training data set can get more accurate over time Nevertheless, we are aware of the accumulation of errors from the mislabeled PHI instances in the selected records along the time axis This error accumulation is a risk in our approach In order to diminish the influences of the “wrongly” selected records, our approach compares the new CRF-based PHI identifier with its previous one If the performance gets improved, the new CRF-based PHI identifier is accepted and used for the next ieration Otherwise, the previous one is remained and used for future prediction That is, as previously mentioned in phase (4), we reexamined the selected records prior to enhancement This step also aims at smoothing out the effects of errors in the mislabeled records In addition, the error accumulation is removed by time if we put our approach in practice with human-interaction to filter out errors

The following subsections 2.1, 2.2, and 2.3 will elaborate three main phases of the approach: building of a CRF-based identifier, the rule-based post-processing procedure, and records selecting with the most confident prediction, respectively After that, the characteristics of our approach are highlighted in subsection 2.4

2.1 Supervised Learning with CRFs and K- fold Cross Validation

In order to build a CRF-based PHI identifier, we employ a supervised learning mechanism with CRFs and k-fold cross validation In this current work, our PHI identifier is a token- level CRF-based PHI identifier First of all, we prepare a set of features to capture as many aspects of a PHI instance as possible to distinguish it with PHI instances of the other PHI types

Secondly, we perform an automatic token-level feature extraction process to form an input training data set Thirdly, we use an available CRF toolkit to obtain a trained model Moreover, we suggest applying the k-fold cross validation scheme in this phase so that we can ensure the capability of the resulting CRF-based PHI identifier with more reliability and over-fitting avoidance In our approach, the resulting CRF-based PHI identifier in the first iteration is required to have at least 90% of precision and recall on average and the other identifiers in the later iterations are required to have averaged precision and averaged recall higher than that of the identifier in the previous iteration

In the following, we introduce a set of token-level features defined for our CRF-based PHI identifier These features are based on the ones used in [5, 14, 24]

- POS feature : a tag returned by a Part-of-Speech (POS) tagger

- Combinations of tokens and their POS tags : {w 0 p -2 p -1 , w 0 p -1 , p -1 w 0 p 1 , w 0 p 1 , w 0 p 1 p 2 }, where w 0 denotes the current token and p -2 , p -1 , p 1 , p 2 denote the last and next POS tags in the 5-token window, respectively

- Affix features : all prefixes and suffixes of widely-used length from 1 to 5

- Orthorgraphic features : form information about the token to indicate if the token has the first letter capitalized, if the token has all uppercase letters, if there exists at least one uppercase letter inside the token, if the token contains at least one letter, if the token contains at least one digit, if the token contains a punctuation mark

- Word shape features : the shape of the token uses “#” for a digit, “A” for an uppercase letter, “a” for a lowercase letter, and “-“ for a punctuation mark in the token Both full and short shapes are included in the feature list

- Regular expression features : indicating if the token matched a regular expression or is part of a token sequence that mached a regular expression Regular expressions were defined for ages, dates, and phone numbers

EXPERIMENTAL RESULTS

In order to further evaluate our proposed approach, we perform the PHI identification phase of the de-identification task on the data set from the i2b2 de-identification shared task in 2006

Although the 2014 i2b2 de-identification track on identifying PHI in longitudinal clinical narratives has taken place, the corpus has not yet been worldwide distributed As introduced in [22], it will be available at [11] in November 2015 Therefore, in this work, we used 668 records in the training data set and 220 records in the test data set from the 2006 i2b2 data set For PHI instance representation in both training and test data sets, the IO (inside- outside) scheme has been finalized after trial-and-error tests with other schemes such as BIOES (begin-inside-outside-end-single) and BIO (begin-inside-outside) Besides, our proposed approach is implemented by Net (C#), making use of the existing Stanford natural language processing tools at [19] and the CRFSharp toolkit at [3] The experiments were performed on a server machine which is Intel(R) Xeon(R) CPU E5-2620 0 @2.00GHz with 96 GB RAM using MS Windows

Also, we use Precision, Recall, and F-measure to check how effective the various approaches of the existing works and ours are for each PHI type For comparison, we examine the works typical for the 2006 i2b2 data set which are [22] and [23] participating in the 2006 i2b2 shared task These two works produced the highest results on the 2006 i2b2 data set which is used in our experiments Their results are gathered from their corresponding papers In addition to the works in [22, 23], we are aware of the work in [25] which also conducted the experiments on the same 2006 i2b2 data set Unfortunately, there was no detailed report about the result for each PHI type Therefore, we skip our comparison with [25] in the following part For our proposed approach, we examine the different solutions at the different phases: (1) CRF at the first phase of the approach by using the resulting CRF model only for PHI identification in the test data set; (2) CRF_PP at the second phase not in an iterative manner by using the resulting CRF model and the post- processing procedure for PHI identification in the test data set; and (3) Semi_CRF_PP_Final at the second phase in an iterative manner by using the resulting CRF model and the post- processing procedure for PHI identification in the test data set with a semi-supervised learning mechanism The overall performance of each solution on the 2006 i2b2 test data set in comparison with the existing works in [22, 23] in the 2006 i2b2 shared task is given in Table 1

Table 1 The overall performance from our approach in comparison with the existing works in the 2006 i2b2 shared task

PHI types AGE DATE DOCTOR HOSPITAL ID LOCATION PATIENT PHONE

For more readability, the best recall values are presented in bold and the second best recall values are underlined while the best precision values are in bold and italics Although most of the results in [22, 23] in Table 1 are higher than ours, our approach can extract more LOCATION, PATIENT, and PHONE instances

A pity that our semi-supervised learning mechanism cannot recognize AGE instances in an iterative manner This tells us about the strong impact of the selection of the new medical records with the highest confident prediction Perhaps at this moment, our confident prediction scores do not reflect truly the confident prediction of unlabeled records Nevertheless, our approach executed in the traditional manner can extract all PHI instances and label them correctly like the approach in [22]

Generally speaking, our work can attain the comparable results of PHI identification from the 2006 i2b2 data set as compared to the best results in [22, 23]

Table 2 Comparison between our solutions on average

In addition, we further compare the various solutions from our approach in Table 2 where Semi_CRF_PP_2 is the solution obtained right after the CRF_PP solution and Semi_CRF_PP_3 is the solution obtained right after the Semi_CRF_PP_2 As the performance of the Semi_CRF_PP_3 solution is less than that of the Semi_CRF_PP_2 solution The semi-supervised learning process is stopped at the Semi_CRF_PP_3 and returns the Semi_CRF_PP_2 solution as the final solution of our approach, named Semi_CRF_PP_Final, as previously mentioned As displayed in Table 2, our hybrid semi-supervised learning approach can improve both recall and precision of the traditional hybrid supervised learning approach as the results of Semi_CRF_PP_Final are higher than that of CRF_PP This leads to an improvement in F-measure of Semi_CRF_PP_Final as compared to that of CRF_PP Nonetheless, in comparison with the results of CRF, we realize that recall values of CRF_PP,

Semi_CRF_PP_2, and Semi_CRF_PP_Final are all larger than that of CRF This implies that the rule-based post-processing procedure works well for extracting more PHI instances In contrast, precision value of CRF is better than that of CRF_PP, Semi_CRF_PP_2, and Semi_CRF_PP_Final, indicating that the rule-based post-processing procedure is not suitable for filtering out the PHI instances misclassified The reason might be that the current rules for filtering out those PHI instances are not informative enough to capture the characteristics of those PHI instances and their surrounding contexts This fact will ask us for an improvement on our work in the future.

RELATED WORKS

In this section, an overall review of the related works is presented in comparison with ours

First of all, we have a look at the works [5, 14, 22, 23, 24] that took part in the i2b2 challenge tasks in 2006 and 2014 with the highest results Due to the unavailability of the 2014 i2b2 data set, we have used the 2006 i2b2 data set and thus, compared our result with the best results of [22, 23] participating the 2006 shared task The results have shown that our approach is comparable to [22, 23] However, our approach is more practical in a semi-supervised learning mechanism This mechanism can be seen as a generalized approach of the other most recent works [5, 14, 24] where the machine learning-based method and the rule-based post-processing phase were performed once In contrast, our hybrid approach is enabled in an incremental and iterative manner so that our PHI identification solution can get tuned As a trade-off, the cost of our approach becomes more as compared to that of the approaches in these related works

Secondly, we discuss the differences between the works in [1, 6, 7, 9, 10, 16, 21] and ours focusing on de-identification systems

As one of the first de-identification systems, Scrub in [21] was developed with a set of detection algorithms for pattern matching based on orthographic rules, templates with likelihood values, and a list of commonly known information about first names, last names, etc Later, De-Id, a commercial de-identification system, was built in [9] De-Id also followed a pattern matching method based on rules and dictionaries and making use of the Unified

Medical Language System (UMLS) In [16], a de-identification system, called MIT system by [15], used a rule-based pattern matching approach with look-up tables (dictionaries for known PHI instances such as patient names, doctor names, etc.), regular expressions, and heuristics to de-identifying free-text medical records It is noted that [9, 16, 21] have utilized a rule-based method Switching to a machine learning-based approach, [7] proposed HIDE system based on a CRF-based NER to obtain a CRF-based classifier to identify and extract terms from textual pathology reports In contrast to the single approaches in [7, 9, 16, 21], our approach is a hybrid approach that can make the most of both rule-based and machine learning-based methods for PHI identification More recently, MIST system was developed in [1] using the machine learning-based approach with CRF models for 4 types of patient records The system includes: a web-based graphical annotation tool, a training module, a tagging module, a redaction and resynthesis module, and an experiment engine It also enables the users to conduct an iterative PHI locating and redacting procedure Based on the MIST system in [1], [10] obtained a bootstrapping MIST system to consider the annotation of clinical records for the de-identification task and [13] enabled the MIST system to handle the writing complexity in clinical narratives Different from the MIST system and its extended versions, our approach has a rule-based post-processing procedure for enhancing our CRF-based PHI identifier in a semi- supervised learning mechanism As a best-of-breed de- identification system, BoB built in [6] has combined a rule-based method and a machine learning-based method in a stepwise hybrid approach along with a false positives filtering component in a supervised learning mechanism with Support Vector Machine (SVM) models for better both recall and precision Also a hybrid approach, our work is different from [6] in that PHI identification in our work is iteratively performed whereas there is no iterative identification in [6] Furthermore, the identification process in [6] is based on a supervised learning mechanism whereas ours follows a semi-supervised learning mechanism that requires less labeled records for building an effective PHI classifier over time

Thirdly, we figure out the contributions of our work as compared to the works [3, 8, 17, 18, 25] developed for de-identifying the medical records written in other languages different from English In [8], the authors introduced a rule-based system named MEDINA and a CRF-based system for de-identification of French records MEDINA system was constructed based on the characteristics of De-Id system in [9] [3] aimed at refinement of a manually annotated Gold standard which is the Stockholm EPR PHI Corpus in Swedish They employed the CRFs algorithm in their automatic de-identification system [25] introduced an approach to automatically de-identifying electronic health records by combining a CRF-based classifier with pattern matching techniques for feature extraction with regular expressions in addition to lexical and linguistic features Their work was evaluated on the 2006 i2b2 data set and another Australian data set For de-identification of Dutch medical records, [17] is based on a supervised learning approach with Random Forests, one- against-one SVMs, and one-against-all SVMs using four types of features: direct target word characteristics, pattern matching features, dictionary features, and contextual word features

Different from [17, 25], our work defined a richer set of features of the CRF-based identifier and added a rule-based post- processing procedure in a semi-supervised learning mechanism for this PHI identification task [12] is also a rule-based de- identification system for Serbian records This system is an adaptation of an existing rule-based named entity recognition system using amount expressions, time expressions, personal names, geopolitical names, and urban names In addition, the finite-state transducers with local grammars were used for modeling various triggers and named entity contexts Unlike the aforementioned works, [18] processed bilingual clinical records in English and Korean A rule-based method is performed in [18] with 15 regular expressions As compared to the works in this group that simply followed either machine learning-based method or rule-based method, our work can identify PHI instances incrementally and iteratively by means of both CRF- based PHI identifier and rule-based post-processing procedure

In summary, our work has provided a hybrid semi-supervised learning approach by combining a CRF-based method and a rule- based method in an incremental and iterative manner which is effective and practical for the protected health information identification task on electronic medical records.

CONCLUSIONS

Identifying protected health information is one of the most significant and important tasks to enable electronic medical records to be shared and processed for more research and development in the medical, biomedical, and other related fields

Therefore, our work introduced a novel hybrid semi-supervised learning approach to this problem by taking advantage of the machine learning-based approach and the rule-based approach in an iterative self-training manner The resulting PHI classifier is capable of identifying the instances of 8 PHI types in the 2006 i2b2 data set as effectively as the existing works on the same data set However, our PHI classifier is enhanced with the new medical records that have the most confident prediction in order to obtain a new PHI classifier with higher accuracy in the future

In addition, it is easy to adapt our approach to identifying PHIs in the electronic medical records in other languages different from English by simply replacing the natural language processing tools, dictionaries, and regular expressions appropriately

As our future works, more experiments on the 2014 i2b2 data set will be conducted for an effectiveness confirmation on our approach Besides, we plan to apply the proposed approach to PHI de-identification of the real Vietnamese free-text medical records Above all, we will investigate more advanced techniques to overcome the challenges of the de-identification problem such as high ambiguities in the instances of the different PHI types, e.g hospital names, patient names, and doctor names; the imbalance between the different PHI types and the group of non-PHI instances; and the representation learning of PHI-based features Especially, we will reconsider the confident prediction scores to choose the truly best records for the enhancement of the set of labeled records in the training data set over time.

ACKNOWLEDGMENTS

This work is funded by Vietnam National University at Ho Chi Minh City under the grant number B2015-42-02 In addition, we would like to thank John von Neumann Institute, Vietnam National University at Ho Chi Minh City, very much to provide us with a very powerful server machine to carry out the experiments.

Tiêu đề	Lọc thông tin riêng trong bệnh án điện tử
Tác giả	Nguyễn Đông Phương
Người hướng dẫn	TS. Võ Thị Ngọc Châu
Trường học	Trường Đại học Bách Khoa
Chuyên ngành	Khoa học Máy tính
Thể loại	Luận văn thạc sĩ
Năm xuất bản	2016
Thành phố	Tp. Hồ Chí Minh

Định dạng
Số trang	71
Dung lượng	2,2 MB