Phát triển các phương pháp tối ưu giải quyết các bài toán liên quan đến chuỗi protein

Mặc dù đã có nhiều phương pháp/mô hình tính toán dự đoán chức năng của protein được phát trién tuy nhiên chúng vẫn cần các phương pháp tối ưu để chọn các giá trị tham số, các thuộc tính

Trang 2

MẲU 14/KHCN

(Ban hành kèm theo Quyết định sổ 3839 /QĐ-ĐHQGHN ngày 24 thángio năm 2014

của Giảm đốc Đại học Quốc gia Hà Nội)

ĐẠI HỌC QUỐC GIA HÀ NỘI

BÁO CÁO TỒNG KỂT KÉT QUẢ T H ự C H IỆN ĐÈ TÀI KH&CN

CẤP ĐẠI HỌC QUỐC GIA

Tên đề tài: Phát triển các phương pháp tối ưu giải quyết các bài toán liên

quan đến chuỗi protein.

M ã số đê tài: QG.15.21

Chủ nhiệm đề tài: Đặng Thanh Hải

Hà N ội, 26/12/2017

Trang 3

PHÀN I T H Ô N G TIN CH UN G

1.1 Tên đề tài: Phát triển các phương pháp tối ưu giải quyết các bài toán liên quan đến chuỗi

protein.

1.2 M ã số: QG 15.21

1.3 Danh sách chủ trì, thàn h viên tham gia thực hiện đề tài

TT Chức danh , học vị, họ và tên Đơn vị công tác V ai trò th ự c hiện đề tài

Tính toán, Khoa Công Nghệ Thông Tin, ĐH Công Nghệ, ĐHQGHN

Chủ nhiệm đề tài, Nghiên cứu các nội dung Đề tài

ĐHQGHN

Nghiên cứu các nội dung Đe tài

Tính toán, Khoa Công Nghệ Thông Tin, ĐH Công Nghệ, ĐHQGHN

Nghiên cứu các nội dung Đe tài

Thuật Tính Toán, Khoa CNTT,

1.4 Đơn vị chủ trì: Khoa Công Nghệ Thông Tin, ĐH Công Nghệ, ĐHQGHN.

1.5 Thời gian thực hiện:

1.5.1 Theo hợp đồng: từ tháng 02 nàm 2015 dến tháng 02 năm 2017.

1.5.2 Gia hạn (nếu có): đến tháng 12 năm 2017.

1.5.3 Thực hiện thực tế: từ tháng 02 năm 2015 đến tháng 12 năm 2017.

1.6 Những thay đổi so với thuyết minh ban đầu (nếu có):

(Ve mục tiêu, nội dung, phư ơng pháp, kết quà nghiên cứu và tổ chức thực hiện; Nguyên nhân; Ý kiến của Cơ quan quản lý)

1.7 Tổng kinh phí được phê duyệt của đề tài: 250 triệu đồng.

PHÀN II TỎNG QUAN KẾT QUẢ NGHIÊN c ứ u

Viết theo cấu trúc một bài báo khoa học tổng quan từ 6-15 trang (báo cáo này sẽ được đăng trên tạp chí khoa học ĐHQGHN sau khi đề tài được nghiệm thu), nội dung gồm các phần:

1 Đặt vấn đề

Protein là phân tử hữu cơ phức tạp, cả về mặt cấu trúc lẫn chức năng, được cấu thành từ sự kết nối liên tiếp của các axit amin thuộc 20 loại khác nhau Được biết đến như là máy chức năng trong tế báo, protein thực hiện phần lớn các chức năng sinh lý được mã hoá trong hệ gen của tế bào, ví dụ như vận chuyển oxy khắp cơ thể đa té bào, truyền tín hiệu từ giữa các tế bào với nhau, hoặc kích hoạt hàng trăm phản ứng hoá học cần thiết cho sự sống trong các tế bào (Alberts, 2007).

Việc dự đoán chính xác chức năng sinh lý của protein là chìa khoá để chúng ta có thể hiểu được

sự sống ở mức độ phân tử và do đó có ảnh hường vô cùng to lớn vào lĩnh vực y-sinh và dược học.

Trang 4

Tu) nhiên việc xác định chức năng của protein bằng thí nghiệm thường rất khó và đăt dẫn đên việc khôig thể phù họp cho một khối lượng khổng lồ các dữ liệu chuỗi đang được tạo ra từ công nghệ

giải trình tự thế hệ mới (Liolios et al., 2009) Nghiên cứu các phương pháp tính toán để dự đoán

chúc năng của protein do vậy trở thành một hướng nghiên cứu có tính then chôt và quan trọng của sirứ học phân tử và tính toán Điều này được thể hiện qua việc (1) có tới 98% chú thích trong cơ sở

dữ iệu GO (Gene Ontology) là kết quả dự đoán bằng mô hình tính toán trong khi chỉ có 0.6% là đã đưcc kiểm định bằng thí nghiệm (du Plessis et al., 2011); (2) số lượng các chú thích chức năng pro.ein được dự đoán bằng mô hình tính toán trong cơ sở dữ liệu uniprotKB/Swiss-Prot luôn luôn tăng luỹ thừa theo đơn vị là 10 triệu chú thích trong khi số lượng chú thích đã được kiểm nghiệm troig cơ sở đữ liệu uniprotKB/TrEmble thì chỉ tăng tuyến tính hầu như không đáng kể.

Dự đoán chính xác chức năng protein là bài toán rất khó và đầy thách thức vì chức năng của prctein không những được quyết định bởi chuỗi axit amin của nó mà còn bởi sự tương tác với các prctein nhất định khác và bởi hơn 200 loại biến đổi protein sau tổng họp xẩy ra rất thường xuyên trcng tế bào.

Các biến đồi sau tổng họp xẩy ra thường xuyên trong tế bào (Khoury et al., 2011): các nhà khDa học hoá sinh ước lượng rằng có khoảng 1/3 protein ở người bị photphoryl hoá (một loại biến

đổ sau tổng hợp quan trọng và được nghiên cứu nhiều nhất) (Alberts, 2007) Những vị trí axit amin

bị biến đổi sau tổng họp có thể được xác định bởi các thí nghiệm hoá sinh Theo thời gian, đến thời

đièm hiện tại đã có một số lượng nhất định các vị trí bị BST đã được xác định (Sugiyama et al., 2C08; Boersema et a l, 2010) Tuy nhiên do các thí nghiệm này thường rất tốn thời gian, rất khó và

đắt đỏ nên số lượng các BST được phát hiện vẫn còn rất hạn chế Nhiều biến đổi sau tổng họp đã đcợc biết đến là những nguyên nhân chính gây ra một số lượng lớn các loại bệnh (Manning et al., 2(02), trong đó có ung thư (Seeler et al., 2007), bệnh tâm thần mất trị (Alzheimer) (Hưen and Chen, 2(08) và bệnh Huntington (Steffan et al., 2004) Những hiện thực khó khăn và tầm quan trọng của những biến đổi protein sau tổng hợp này đã tạo ra cơ hội và thách thức để cộng động nghiên cứu về tin sinh học (sinh học tính toán) phát triển các mô hình tính toán có khả năng dự đoán chính xác các

vị trí axit amin bị biến đổi sau tổng hợp (Suo et al., 2014) Mặc dù đã có rất nhiều phương pháp tính tcán tiên tiến được phát triển nhằm dự đoán photphoryl hoá protein (là m ột loại BST được các nhà hoá sinh nghiên cứu nhiều nhất) và chúng đã tạo ra nhiều bước tiến trong việc dự đoán nhưng mững phương pháp này vẫn còn nhiều hạn chế cần được khắc phục giải quyết (Suo et al., 2014)

Co đó các phương pháp dự đoán tiên tiến mới vẫn rất cần phải được nghiên cứu phát triển để có thể

d i đoán chính xác hơn các vị trí photphoryl hoá protein Nhu cầu này lại càng cấp thiết hơn cho các loại biến đổi protein sau tổng hợp khác.

Một số đặc tính của mạng tương tác protein, ví dụ nhu các tương tác điểm làm thay đổi cấu trúc (illosteric) và các hotspot tương tác, đã được ứng dụng vào trong các chiến lược thiết kế/chế tạo thuốc (Arkin và Wells, 2004; Chen vcs., 2013) Sự liên quan của mạng tương tác protein như là các cối tượng điều trị ban đầu phục vụ việc phát triển các liệu pháp điều trị mới là khá rõ ràng đối với bệnh ung thư, với một số thí nghiệm y học lâm sàng trong lĩnh vực này Sự thống nhất giữa những cối tượng tiềm năng này được thể hiện trong việc hiện tại đã có rất nhiều thuốc trên thị trường để đều trị một số lượng rất lớn các loại bệnh Ví dụ như: Titrobifan, chất ức chế gluco-protein Ilb/IIIa, cược sử dụng như là một loại thuốc tim mạch, và Maraviroc, chất ức chế tương tác CCR5-gpl20, dược sử dụng như là một dạng thuốc chống HIV (Ivanov vcs., 2013).

Các protein hay các mạng tương tác protein cũng có thể có tương tác với các họp chất hoá học/thuốc (Ivanov vcs., 2013; Arkin và Wells, 2004) Các hợp chất hoá học/thuốc cũng như các orotein đều có tương tác (gây ra hoặc ức chết) các loại bệnh một cách vô cùng phức tạp Đen nay, /ốn hiểu biết cùa con người về quá trình tương tác này vẫn chi rất hạn ché, còn rất xa so với thực tế iiễn ra (Duran-Frigola et al., 2015) Chức năng cùa các protein do đó có thể được hiểu rõ hơn khi

;húng ta có thể nắm bắt được các tuong tác giữa các hợp chất hoá học/thuốc và bệnh Các tương tác này thường được mô tả trong một khối lượng khổng lồ các bài báo khoa học về sinh - y - dược học, íược công bố, được lưu trữ, đánh chỉ mục và quản lý bời hệ thong PubMed (MEDLINE) Tính đến

2

Trang 5

ngà/ 8 tháng 2 năm 2015, PubMed đã quản lý hơn 24.6 triệu bài báo khoa học kể từ 1966; khoảng 50(,000 bài báo mới được thêm vào mỗi năm Trong số này 13.1 triệu bài báo có phần tóm tắt và 14.2 triệu có đường link đến toàn văn bài báo (trong đó 3.8 triệu bài báo được cung cấp miễn phí chc bất kỳ người đọc nào) Năm 2011 Plake và Schroeder, thông qua nghiên cứu cùa mình, đã đi đếr kết luận rằng khai phá văn bản sinh-y-dược là một công cụ thiết yếu và vô cùng quan trọng để

có -hể hỗ trợ, đẩy nhanh quá trình nắm bắt được các tương tác giữa các hợp chất hoá học/thuốc và bệrh (Plake và Schroeder, 2011).

Mặc dù đã có nhiều phương pháp/mô hình tính toán dự đoán chức năng của protein được phát trién tuy nhiên chúng vẫn cần các phương pháp tối ưu để chọn các giá trị tham số, các thuộc tính đặ; trưng hay thậm chí tập dữ liệu học phù hợp để có thể dự đoán chính xác (Radivojac et al., 2013) Quá việc khảo cứu đánh giá, so sánh các phương pháp dự đoán chức năng protein hiện có Rtdivojac và cộng sự năm 2013 đã chi ra rằng bài toán dự đoán chức năng protein nhiều khả năng vẫi sẽ là một lĩnh vực nghiên cứu chủ đạo hấp dẫn và phát triển mạnh; các mô hình tính toán tối ưu hen cần phải được nghiên cứu phát triển.

2 M ụ c tiêu

Đi tài này do đó sẽ tập trung nghiên cứu các phương pháp tối ưu, các kỹ thuật khai phá dữ liệu và hcc máy tiên tiến nhằm giải quyết các bài toán liên quan đến phân tích chuỗi protein.

3.Phưoìig pháp nghiên cứu

Ciúng tôi tiến hành khảo sát, nghiên cứu chi tiết các phương pháp/giải pháp tốt nhất hiện có liên quan đến nội dung đề tài, qua đó đánh giá và so sánh điểm mạnh yếu cùa mỗi giải pháp Các giả thuyết sẽ được xây dựng và được kiểm nghiệm bằng chứng minh lý thuyết hoặc bằng thực nghiệm Cuối cùng chúng tôi sẽ lựa chọn, cải tiến một giải pháp đã có hoặc phát triển một giải pháp mới để tiến hành triển khai nhằm hoàn thành nội dung đặt ra.

Các phương pháp liên quan đã được phát triển bởi các thành viên tham gia đề tài được kế thừa, kết họp với việc nghiên cứu chi tiết và so sánh các phương pháp liên quan tốt nhất hiện có để từ đó

cj thể phát triển được giải pháp mới cho mục tiêu đề ra Chúng tôi đã tiến hành tập trung nghiên cứu kỹ các phương pháp khai phá dữ liệu, kỹ thuật học máy tiên tiến nhất, ví dụ như kỹ thuật tối ưu hjá đàn kiển (Ant Colony Optimization - ACO), mô hình xác suất, và kỹ thuật học sâu (deep learning) để có thể giải quyết các vấn đề liên quan đến chuỗi protein.

Ngoài ra, các thuộc tính hoá lý cùa axit amin, các phương pháp biểu diễn những thuộc tính này -vào các bài toán liên quan đến chuỗi protein cũng đã được chúng tôi nghiên cứu Các thông tin (phần lớn là dự đoán bằng các mô hình tính toán, không chắc chắn và chưa được xác định bằng thực rghiệm) về cấu trúc protein và tầm quan trọng của nó đối với các bài toán liên quan đến protein cũng đã được nghiên cứu đê qua đó tích hợp những thông tin này vào các phương pháp giải quyêt các bài toán liên quan.

Cuối cùng, chúng tôi cũng đã nghiên cứu khảo sát các nguồn dữ liệu, các tri thức liên quan đến orotein (bao gồm cả về họp chất hoá học/thuốc và bệnh); qua đó có thể áp dụng các kỹ thuật khai ohá dữ liệu tiên tiến (ví dạ như phuơng pháp Tập phổ biến/Luật kết hợp) để tích hợp chúng vào các ohương pháp tối ưu giải các bài toán liên quan đến chuỗi protein.

4 Tổng kết kết quả nghiên cứu

Chúng tôi đã nghiên cứu và phát triển các mô hình dựa trên các phương pháp tối ưu, các kỹ thuật khai phá dữ liệu và học máy tiên tiến nhằm giải quyết các vấn đề quan trọng có ảnh hường (trực tiếp hoặc gián tiếp) đến quá trình phân tích chuỗi protein, bao gồm vấn đề về các tương tác giữa cạc protein, các biến đổi protein sau tổng họp, tìm chuỗi nguồn tiến hoá và tương tác giữa các hợp chất hoá học/thuốc và bệnh.

3

Trang 6

4.1 Mô hình liên quan đến tương tác giữa các protein

Chúng tôi đã xây dựng một Webserver có khả năng dự đoán tương tác giữa các cặp enzim xúc tác (protein kinase) và cơ chất (substrate), ngoài ra còn đưa ra các vị trí cụ thể của tương tác này Webserver này được cung cấp miễn phí tại địa chỉ: http://fit.uet vnu.edu vn:8286/subin/web, cho phép người dùng (những nhà nghiên cứu về hoá sinh học) có thể sừ dụng một cách dễ dàng để dự đoán được một cặp enzim xúc tác-cơ chất (kinase-substrate) có tương tác với nhau hay không một cách nhanh chóng? (nếu có thì ở những vị trí nào?) Webserver này được xây dựng dựa trên mô hình đồ thị xác suất các trường ngẫu nhiên có điều kiện (Conditional Random Fields) kết họp với thuật toán khai phá luật kết hợp nén (Vreeken vcs., 2011) Trong tất cả các loại tương tác dựa trên biến đổi sau tổng họp giữa các protein đã được biết thì tương tác enzim xúc tác-cơ chất (kinase- substrate) thuộc tốp 2 loại xẩy ra thường xuyên trong tế bào và được nghiên cứu nhiều nhất (Suo vcs., 2014) Nhiều tương tác thuộc loại này đã được biết đến là những nguyên nhân chính gây ra một số lượng lớn các loại bệnh (Manning et al., 2002), trong đó có ung thư (Seeler et al., 2007), bệnh tâm thần mất trị (Alzheimer) (Huen and Chen, 2008) và bệnh Huntington (Steffan et al., 2004).

Trước đây, các chức năng của protein có thể được xác định dựa trên mối quan hệ tiến hóa, với tiêu chí thường được sử dụng là độ tương tự giữa các chuỗi protein (Remm vcs., 2001) Tuy nhiên, cách tiếp cận này thường không đủ tốt để nhận dạng các chức năng cùa protein (Park vcs 2011) Sự phát triển của các kỹ thuật công nghệ sinh học trong hơn thập kỷ qua đã cho phép xây dựng được các mạng tương tác protein cho nhiều loài sinh vật Các mạng tương tác này cũng có thể được bổ sung (hay thậm chí được tạo ra) từ rất nhiều mô hình tính toán tiên tiến (ví dụ như mô hình được triển khai dưới dạng Webserver được đề cập ở trên) Việc phân tích, so sánh các dữ liệu mạng tương tác này cung cấp nhiều thông tin hữu ích cho dự đoán các chức năng chưa biết hoặc kiểm định các chức năng đã biết của các chuỗi protein (Dutkowski và Tiuryn, 2007; Memisevic và Przulj, 2012) Bài toán này đã được chứng minh là NP-khó (Aladag và Erten, 2013).

Chúng tôi đã đề xuất một thuật toán mới có tên là FASTan để dóng hàng toàn cục mạng PPI (Đỗ Đức Đông vcs 2015) Thuật toán gồm hai pha: pha thứ nhất xây dựng dóng hàng ban đầu bằng một thuật toán heuristic dựa trên sự tương quan giữa cấu trúc tô pô và sự tương đồng trình tự giữa các nút, sau pha này FASTan thu được một dóng hàng toàn cục ban đầu; pha thứ hai đề xuất thủ tục Rebuild (là điêm mạnh của thuật toán) nhằm giữ lại những phần dóng hàng tổt của pha thứ nhât (loại bỏ những dóng hàng không tốt) và dựa vào đó để dựng lại toàn bộ dóng hàng FASTan sau đó được tiếp tục cải tiến bằng việc sử dụng phương pháp tối ưu đàn kiến (ACO), kết họp với thủ tục rebuild của FASTan như một thủ tục tìm kiếm cục bộ (Đỗ Xuân Quyền vcs., 2016).

4.2 Mô hình liên quan đến biến đổi protein sau tổng hợp

Chúng tôi đã phát triển SKIPHOS, là một mô hình dự đoán vị trí photphoryl hoá (một trong những loại biến đổi sau tổng hợp quan trọng vào loại bậc nhất đối với hoạt động của các tế bào) dựa trên rừng ngẫu nhiên (random forests) sử dụng các thuộc tính được tính toán từ các đặc trưng hoá lý của chuỗi protein và các biễu diễn liên tục của các axít amin dựa trên kỹ thuật học sâu SKIPHOS, với giao diện đơn giản, được cung cấp trực tuyến miễn phí tại http://fit.uet.vnu.edu.vn/SKIPHOS Việc xây dựng thành công các mô hình tính toán có khả năng dự đoán chính xác và hiệu quà các vị trí bị photphoryl hoá đang trở thành một vấn đề có tính cấp thiết và đầy thách thức (theo Trost và Kusalik, 2011).

4.3 Mô hình tìm chuỗi nguồn tiến hoá

Bài toán xây dựng lại chuỗi nguồn/gốc (gene hoặc protein) (tổ tiên) cho một quần thể nhất định là một vấn đề quan trọng trong sinh học Nó liên quan đến việc tìm ra một tập các chuỗi nguồn để từ

đó có thể kết hợp với nhau để tạo thành các chuỗi trình tự cho trước cùa các cá thể trong quần thể nhất định Việc xây dựng lại các chuỗi nguồn có thể được mô hình hoá thành vấn đề tối ưu hóa tổ hợp, trong đó chúng ta phải tìm ra một tập các chuỗi trình tự (tổ tiên) để các cá thể cho trước trong một quần thể nhất định có thể được tạo ra bằng một số lượng nhỏ nhất các biến đổi tái tổ hợp trên

4

Trang 7

nh j~.g trình tự tổ tiên đó Bài toán này được đề xuất bởi Ukkonen và đã được chính mình là NP-khó vói 'êu cầu số chuỗi nguồn >2 (Ukkonen, 2002).

Chúng tôi đã đề xuất ACOFSRP, một phương pháp xây dựng chuỗi nguồn dựa trên thuật toán tối ƯU hóa kiến (ACO) cùng với một số cải tiến quan trọng (Anh Thị Vũ Ngọc vcs 2018) Các cải tién này bao gồm: chiến thuật để các kiến tìm kiếm lời giải đồng thời cùng nhau, tìm kiếm lân cận

và tm kiếm theo hai chiều ngược và xuôi.

4.4 Mô hình liên quan đến hoá chất/thuốc và bệnh

Trcng các mối quan hệ giữa các thực thể y-sinh thì các mối quan hệ giữa hợp chất hoá học/thuốc và bệrh cũng như giữa bệnh và gen/protein đã và đang nhận được ngày càng nhiều sự quan tâm từ cộrg đồng các nhà nghiên cứu khai phá dữ liệu văn bản y-sinh học Một khảo sát về hành vi tìm kiến trên PubMed cùa người dùng cho thấy rằng các tên bệnh, các hợp chất hoá học, thuốc và tên ger/protein gây bệnh là ba trong số các từ khoá được tìm nhiều nhất trên thế giới (Dogan et al., 2009) Ba thực thể này là đối tượng trung tâm cùa nhiều nội dung nghiên cửu quan trọng, ví dụ như ché tạo thuốc, phát hiện các phản ứng phụ cùa thuốc v.v Việc phát hiện ra các tương tác giữa thuốc

và bệnh là rất cần thiết cho việc hiểu rõ bản chất cùa bệnh, cũng như cho quá trình phát hiện các chxc năng quan trọng của họp chất hoá học/thuốc và gene/protein (Yu et al 2015).

Trong cuộc sống hàng ngày con người tiếp xúc với một số lượng lớn các hoá chất, bao gồm các loã thuốc và các nguồn độc tố ở môi trường xung quanh Các tác dụng chữa bệnh cũng như tác ding phụ của những hoá chất này là hệ quà của quá trình tuơng tác vô cùng phức tạp ở mức phân tử vói cơ thể người Đen nay, vốn hiểu biết của con người về quá trình tương tác này vẫn chi rất hạn chế, còn rất xa so với thực tế diễn ra (Duran-Frigola et al., 2015).

Bài toán trích xuất các mối quan hệ giữa hoá chấưthuốc và bệnh từ văn bản y-sinh hiện tại vẫn

là rất khó, đầy thách thức (Leaman et al., 2015), đang là chù để nghiên cứu nóng trên thế giới (Choi

et al., 2016) Nó bao gồm hai bước: (i) bước thứ nhất nhằm nhận dạng và chuẩn hoá các thực thể hoá chất-thuốc và bệnh; (ii) bước thứ hai nhằm phát hiện và trích xuất các mối quan hệ tác dụng pkụ của thuốc giữa các thực thể được nhuận dạng và chuẩn hoá từ bước thứ nhất.

Chúng tôi đã xây dựng thành công hệ thống ƯET-CAM có khả năng trích xuất tự động mối qaan hậ hỡá chất - bệnh từ văn bàn y-sinh (Lê Hoàng Quỳnh vcs 2015, 2016) UET-CAM sử dụng

kỹ thuật diễn giải đồng tham chiếu đa sàng multi-pass sieve coreference resolution (kết hợp với mô hnh dự dựa trên SVM đoán các mối quan hệ xuất hiện trong một câu) Thông thường, pha nhận dạng (NER) và pha chuẩn hoá (NEN) các thực thể thuốc/hoá chất, bệnh (NER) được xây dựng thành hai công đoạn độc lập nhau trong một chu trình Điều này dẫn đến các hạn chế rất lớn, cụ thể li: lỗi ở pha NER sẽ được lan truyền đến pha NEN và không có phản hồi từ pha NEN tới NER (Liu

et al., 2011) Hiện tại UET-CAM khắc phục hạn chế này bằng mô hình giải mã gộp (joint-decoding) cùa NEN và NER dù rằng giải pháp tốt hơn, nếu không muốn nói là tốt nhất, nên được giải quyết bằng mô hình suy luật/học gộp (joint reference/learning).

5 Đánh giá về các kết quả đã đạt được và kết luận

Các mô hình, phương pháp chúng tôi đã đạt được đã được kiểm chửng bằng các thực nghiệm òhuẩn mực và công phu Chúng tôi cũng đã tiến hành so sánh các phương pháp, mô hình đề xuất với các phương pháp, mô hình cùng loại tiên tiến trên thế giới đến thời điểm hiện tại Các kết quả :hực nghiệm so sánh đã chỉ ra tính hiệu quả của các mô hình, phương pháp mà chúng tôi đã xây lựng.

Hiệu năng cùa Webserver dự đoán tương tác enzim xúc tác-cơ chất đã được so sánh với một phương pháp tốt nhất hiện có, cùa nhóm Song vcs (2017) Hệ thống của chúng tôi cung cấp dự đoán cho 56 protein/nhóm protein kinase, trong khi của nhóm Song vcs chỉ là 12 Hệ thống của chúng tôi có khả năng dự đoán tốt hơn của Song vcs cho cặp tương tác của nhóm protein kinase PKA (AUC của chúng tôi là 96%, trong khi cùa Song vcs là 93%) Với các protein kinase còn lại

hệ thống cùa chúng tôi hoạt động kém hơn Tuy nhiên, điều này có thể giải thích được khi hệ thống

Trang 8

cùa chúng tôi chỉ hoạt động dựa vào thông tin chuỗi protein trong khi cùa Song vcs lại tích hợp thên vào hệ thống cùa họ rất nhiều thông tin bổ sung quan trọng, bao gồm thông tin về câu trúc proein, Gene Ontology, Từ điển bách khoa toàn thư Tokyo về các chu trình gene và hệ gene (K£GG), các tương tác protein-protein loại khác, các thông tin về vùng chức năng trên protein Khi chỉ dùng thông tin về chuỗi protein như hệ thống cùa chúng tôi, hệ thống của Song vcs chi có thể

dự ioán tốt hơn của chúng tôi cho 2 kinase (trong số 12 kinase mà họ cung cấp), đó là GSK-3 và nhem kinase MAPK, với những kinase còn lại hệ thống của chúng tôi dự đoán tốt hơn Trong tương lai gần chúng tôi sẽ nâng cấp hệ thống hiện tại bằng cách tích hợp thêm các thông tin như đã được nhem Song vcs (2017) thực hiện.

Mô hình dóng hàng toàn cục 2 mạng tương tác protein FASTan của chúng tôi đã được so sánh

vớ mô hình SPINAL, là mô hình tương tự tốt nhất đến thời điểm chúng tôi tiến hành nghiên cứu và thục nghiệm (Aladag và Erten, 2013) Việc so sánh được tiến hành trên 4 tập dữ liệu chuẩn đã được

sử dụng bởi nhóm tác giả cùa SPINAL Chúng là dữ liệu về mạng tương tác giữa các protein trong

4 oài: Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, and Homo sapiens Ket CỊuà thực nghiệm đã chỉ ra rằng FASTan hoạt động tốt hơn SPINAL theo cả hai tiêu chí đáih giá chuẩn, được sử dụng rộng rãi (Chindelevitch vcs 2013), đó là độ đo chỉ số dóng hàng toín cục (GNAS) và độ đo tính chính xác của các cạnh được dóng (EC) Hơn nữa, FASTan còn có khi năng dóng hàng nhanh hơn SPINAL Phiên bản nâng cấp cùa FASTan bằng việc sử dụng phxơng pháp tối ưu đàn kiến (ACO), kết họp với thủ tục rebuild của FASTan như một thủ tục tìm kiém cục bộ, cũng đã được thực nghiệm trên 4 tập dữ liệu chuẩn này, và cho thấy tính ưu việt của bải nâng cấp so với phiên bản FASTan cũ.

Mô hình SKIPHOS (Đặng Thanh Hải vcs 2018, Bioinformatics, submitted) dự đoán vị trí bị photphoryl hoá trên chuỗi protein cùa chúng tôi đã được so sánh công phu với 4 phương pháp cùng loai tốt nhất gần đây nhất trên tập dữ liệu chuẩn, cũng như các tập dữ liệu mà các mô hình này đã sử dụng thêm Các phương pháp được so sánh bao gồm RFPhos (Ismail et al., 2016), PhosphoSVM (Eou et al., 2014), PHOSFER (Trost and Kusalik, 2013, Bioinformatics) và iPhos-PseEn (Qiu et al., 2(16, Oncotarget) Két quả thực nghiệm đã chỉ ra rằng SKIPHOS có khà năng dự đoán vị trí bị photphoryl hoá trên chuỗi protein tốt hơn 4 phương pháp này.

Mô hình ACOFSRP xây dựng chuỗi nguồn dựa trên thuật toán tối ưu hóa kiến (ACO) cùng với m3t số cải tiến quan trọng đã được tiến hành thực nghiệm và so sánh với một phương pháp tương tự tốt nhất đến thời điểm ACOSRP nghiên cứu, đó là LN S-lc (Roli và Blum, 2012) Quá trình thực nghiệm được tién hành trên 108 tập test kiểm tra được lấy từ 3 tập dữ liệu chuẩn đã được sử dụng bơi các tác giả của L N S -lc Ket quả thực nghiệm đã chỉ ra tính hiệu quả của ACOSRP khi nó có thể xảy dựng lại chuỗi nguồn tốt hơn trong 45 tập test, tương đương trong 44 tập và chi kém hơn trong 1° tập khi so với LN S-lc.

Hệ thống UET-CAM đã tham gia cuộc thi BioCreative V và đã được hội đồng của BioCreative

V xếp hạng thứ 4 về khả năng trích xuất tự động mối quan hệ hoá chất/thuốc - bệnh trong tổng số

n nhóm nghiên cứu tham gia từ Australia, Châu Âu, Châu Á và Bắc Mỹ (Wei et al., 2015) Với kết qià này hệ thống UET-CAM của chúng tôi đã được chọn đăng trong kỷ yếu của hội thảo EioCreative V tại Sevilla, Tây Ban Nha (Lê Hoàng Quỳnh vcs., 2015) và đã được hội đồng EioCreative V khuyến nghị tiếp tục nâng cấp hoàn thiện và sẽ được giới thiệu để được đăng ở tại

cú Database (2015 Impact Factor: 3.35; xếp hạng 5/57 tạp chí ISI về lĩnh vực Toán và Sinh học Tính toán) (Lê Hoàng Quỳnh vcs., 2016).

Các mô hình, phương pháp chúng tôi đã đạt được đều nhằm giải quyết các vấn đề quan trọng c5 liên quan trực tiếp đến vấn đề phân tích chuỗi protein Tính hiệu quà hơn cùa các mô hình, phương pháp đạt được khi so sánh với các phương pháp liên quan tốt nhất hiện có, nhất là khi chúng được cung cấp dưới dạng các giao diện hệ thống phần mềm dễ dùng, sẽ nhiều khả năng có tie đóng một vai trò nhất định trong việc giúp các nhà nghiên cứu hoá-sinh học đẩy nhanh quá trình rghiên cứu liên quan của họ, quá đó thu được những hiểu biết hơn về chức năng của protein.

6

Trang 9

6 Tóm tắt kết quả (tiếng Việt và tiếng Anh)

Protein thực hiện tất cả các chức năng (được mã hoá trong hệ gen) trong tế bào Việc dự đoán chính xác chức năng của protein là chìa khoá để chúng ta có thể hiểu được sự sống ờ mức độ phân tử và

do đó có ảnh hường vô cùng to lớn vào lĩnh vực y-sinh và dược học Chức năng của protein không những được quyết định bởi chuỗi axit amin của nó mà còn bởi cấu trúc 2D, 3D, 4D của nó, sự tương tác với các protein và các họp chất hoá học nhất định khác và bời hon 200 loại biến đổi protein sau tồng hợp xẩy ra rất thường xuyên trong tế bào.

Chúng tôi đã nghiên cứu và phát triển thành công các mô hình dựa trên các phương pháp tối

ưu, các kỹ thuật khai phá dữ liệu và học máy tiên tiến nhằm giải quyết các vấn đề quan trọng có ảnh hường trực tiếp đến quá trình phân tích chuỗi protein, bao gồm: 01 hệ thống W ebserver dự đoán tương tác giữa enzim xúc tác (protein kinase) và cơ chat (substrate); hai phiên bản của một mô hình dóng hàng toàn cục hai mạng tương tác giữa các protein; 01 hệ thống dự đoán photphoryl hoá (là một trong những loại biến đổi protein sau tổng hợp quan trọng, thiết yếu và nhận được nhiều quan tâm nghiên cứu nhât); 01 mô hình xây dựng lại chuỗi nguôn tiên hoá; và 01 mô hình trích xuât tương tác giữa các họp chất hoá học/thuốc và bệnh từ văn bản y sinh.

Các mô hình, phương pháp chúng tôi đã đạt được đã được kiểm chứng bằng các thực nghiệm chuẩn mực và công phu Chúng tôi cũng đã tiến hành so sánh các phương pháp, mô hình đề xuất với các phương pháp, mô hình cùng loại tiên tiến trên thế giới đến thời điểm hiện tại Các kết quả thực nghiệm so sánh đã chỉ ra tính hiệu quả của các mô hình, phương pháp mà chúng tôi đã xây dựng khi so sánh với các mô hình liên quan tốt nhất hiện có Bên cạnh đó, các mô hình, phương pháp chúng tôi đã đạt được đều có khả năng mở rộng và nâng cấp hơn nữa trong tương lai.

Chúng tôi hy vọng rằng, qua việc cung các mô hình, phương pháp đạt được dưới dạng các giao diện hệ thống phần mềm dễ dùng, sẽ nhiều khả năng có thể đóng một vai trò nhất định trong việc giúp các nhà nghiên cứu hoá-sinh học đẩy nhanh quá trình nghiên cứu liên quan cùa họ, quá đó thu được những hiểu biết hơn về chức năng cùa protein.

In English

Proteins perform all biological functions (encoded in the genome) in living cells Accurately predicting proteins’ functions is the key to understanding the life at the molecular level and thus having a tremendous impact on biomedicine and pharmacy Proteins’ funcitons are not only determined by its primary amino acid sequence but also by its 2D, 3D, 4D structure, interaction with certain proteins and chemical compounds, and by more than 200 types o f post-translation protein medificaitons (PTMs), which occurs very often in living cells.

We have successfully studied and developed novel models based on optimal methods, data mining techniques and advanced machine learning to address important issues that directly affect the process o f analyzing protein sequences They include: 01 Webserver that predicts position-

specific kinase-substrate interactions; 02 versions o f a global alignment model for two protein interaction networks; 01 phosphorylation prediction system (which is one o f the most important, essential and most well-studied PTMs); 01 model for reconstruction o f founder sequences; and 01 model for extraction o f interactions between chemical compounds/drugs and diseases from the biomedical literature.

The proposed models and methods have been verified by standard and sophisticated experiments We have also compared the proposed them with the same state-of-the-art methods and models up to the present time Comparative experimental results have shown the effectiveness o f the proposed models and methods when compared to state-of-the-art models by far In addition, the models and methods we have proposed are more likely to be expanded and further upgraded in the follow-up.

7

Trang 10

We anticipate that the proposed models and methods delivered in the form of easy-to-use software system interfaces will be possible to play an important role in helping chemists, biologists accelerate their related researchs, acquiring better understanding o f proteins’ functions.

7 Tài liệu tham khảo

• C.-H Wei, Y Peng, R Leaman, A p Davis, c J Mattingly, J Li, T c Wiegers, z Lu, Overview of the biocreative V chemical disease relation (cdr) task, in: Proceedings of the fifth BioCreative challenge evaluation workshop, Sevilla Spain,

2015, pp 154-166

• B Alberts, A Johnson, J Lewis, M Ra, K Roberts, p Walter, The shape and structure of proteins

• K Liolios, I.-M A Chen, K Mavromatis, N Tavemarakis, p Hugenholtz, V M Markowitz, N c Kyrpides, The genomes on line database (gold) in 2009: status of genomic and metagenomic projects and their associated metadata, Nucleic acids research 38 (suppl_l) (2009) D346-D354

• c Plake, M Schroeder, Computational polypharmacology with text mining and ontologies, Cuưent pharmaceutical biotechnology 12 (3) (2011) 449-457

• G A Khoury, R c Baliban, c A Floudas, Proteome-wide post-translational modification statistics: frequency analysis and curation of the swiss-prot database, Scientific reports 1 (2011) 90

• L du Plessis, N ẫkunca, c Dessimoz, The what, where, how and why of gene ontology—a primer for bioinformaticians, Briefings in bioinformatics 12 (6) (2011) 723-735

• A Roll, S Benedettini, T Stutzle, c Blum, Large neighbourhood search algorithms for the founder sequence reconstruction problem, Computers & operations research 39 (2) (2012) 213-224

• N Sugiyama, H Nakagami, K Mochida, A Daudi, M Tomita, K Shirasu, Y Ishihama, Large-scale phosphorylation mapping reveals the extent of tyrosine phosphorylation in arabidopsis, Molecular systems biology 4 (1) (2008) 193

• P J Boersema, L Y Foong, V M Ding, s Lemeer, B van Breukelen, R Philp, J Boekhorst, B Snel, J den Hertog, A

B Choo, et al., In-depth qualitative and quantitative profiling of tyrosine phosphorylation using a combination of phosphopeptide immunoa_nity purification and stable isotope dimethyl labeling, Molecular & Cellular Proteomics 9 (1) (2010)84-99

• Y Dou, B Yao, c Zhang, Phosphosvm: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine, Amino acids 46 (6) (2014) 1459-1469

• H D Ismail, A Jones, J H Kim, R H Newman, D B Kc, Rf-phos: a novel general phosphorylation site prediction tool based on random forest, BioMed research international 2016

• B Trost, A Kusalik, Computational phosphorylation site prediction in plants using random forests and organism-specific instance weights, Bioinformatics 29 (6) (2013) 686-694

• W.-R Qiu, X Xiao, Z.-C Xu, K.-C Chou, iphos-pseen: identifying phosphorylation sites in proteins by fusing di erent pseudo components into an ensemble classifier, Oncotarget 7 (32) (2016) 51270

• G Manning, D B Whyte, R Martinez, T Hunter, s Sudarsanam, The protein kinase complement of the human genome, Science 298 (5600) (2002) 1912-1934

• J.-S Seeler, o Bischof, K Nacerddine, A Dejean, Sumo, the three rs and cancer, in: Acute Promyelocytic Leukemia, Springer, 2007, pp 49-71

• S.-B Suo, J.-D Qiu, S.-P Shi, X Chen, R.-P Liang, Psea: Kinase-specific prediction and analysis of human phosphorylation substrates, Scientific reports 4 (2014) 4524

• M S Huen, J Chen, The dna damage response pathways: at the crossroad of protein modifications, Cell research 18 (1) (2008) 8.

• L Chindelevitch, C.-Y Ma, C.-S Liao, B Berger, Optimizing a global alignment of protein interaction networks, Bioinformatics 29 (21) (2013) 2765-2773

• J S Ste an, N Agrawal, J Pallos, E Rockabrand, L c Trotman, N Slepko, K Illes, T Lukacsovich, Y.-Z Zhu, E Cattaneo, et al., Sumo modification of huntingtin and huntington’s disease pathology, Science

• A E Alada'g, c Erten, Spinal: scalable protein interaction network alignment, Bioinformatics 29 (7) (2013) 917-924

• B P Kelley, B Yuan, F Lewitter, R Sharan, B R Stockwell, T Ideker, Pathblast: a tool for alignment of protein interaction networks, Nucleic acids research 32 (suppl_2) (2004) W83-W88

• M Remm, c E Storm, E L Sonnhammer, Automatic clustering of orthologs and in-paralogs from pairwise species comparisons, Journal of molecular biology 314 (5) (2001) 1041-1052

• D Park, R Singh, M Baym, C.-S Liao, B Berger, Isobase: a database of functionally related proteins across ppi networks, Nucleic acids research 39 (suppl_l) (2010) D295-D300

• J Dutkowski, J Tiuryn, Identification of functional modules from conserved ancestral protein-protein interactions, Bioinformatics 23 (13) (2007) i 149—i 158

• V MemiSevi’c, N Priulj, C-graal: Common-neighbors-based global graph alignment of biological networks, Integrative Biology 4 (7) (2012) 734-743

• B Trost, A Kusalik, Computational prediction of eukaryotic phosphorylation sites, Bioinformatics 27 (21) (2011)

Trang 11

• M R Arkin, J A Wells, Small-molecule inhibitors of protein-protein interactions: progressing towards the dream, Nature reviews Drug discovery 3 (4) (2004) 301.

• J Chen, N Sawyer, L Regan, Protein-protein interactions: General ưends in the relationship between binding a_nity and interfacial buried surface area, Protein Science 22 (4) (2013) 510-515

• A A Ivanov, F R Khuri, H Fu, Targeting protein-protein interactions as an anticancer strategy, Trends in pharmacological sciences 34 (7) (2013) 393-400

• J Song, H Wang, J Wang, A Leier, T Marquez-Lago, B Yang, z Zhang, T Akutsu, G I Webb, R J Daly, Phosphopredict: A bioinformatics tool for prediction of human kinase-specific phosphorylation substrates and sites by integrating heterogeneous feature selection, Scientific Reports 7 (1) (2017) 6862

• J Vreeken, M Van Leeuwen, A Siebes, Krimp: mining itemsets that compress, Data Mining and Knowledge Discovery

PHÀN III sAN PHÀM, CÔNG BỐ VÀ KẾT QUẢ ĐÀO TẠO CỦA ĐÈ TÀI

3.1 Ket quả nghiên cứu

-Được triển khai dưới dạng một ứng dụng Web với giao diện dễ hiểu, dễ sử dụng đối với người dùng là những nhà nghiên cứu hoá sinh học.

-Một mô hình có khả năng dự đoán tốt hơn 4 phương pháp

cùng loại tốt nhất hiện có, đã

được công bố trên các tạp chí ISI

uy tin (bao gồm cả Bioinformatics).

-01 bài báo trên tạp chí Database (IF=3.35, ISI) và 01 bài báo đã gửi đăng tạp chí Bioinformatics (IF -7 307, ISI).

-Được triển khai dưới dạng một ứng dụng Web với giao diện dễ hiểu, dễ sử dụng đối với người dùng là những nhà nghiên cứu hoá sinh học Được cung cấp miễn phí tại:

httD://fit.uet.vnu.edu.vn:8286/subin/web

9

Trang 12

2 Mô hình dự đoán tương tác

- Được triển khai dưới dạng một ứng dụng Web với giao diện dễ hiểu, dễ sử dụng đối với người dùng là những nhà nghiên cứu hoá sinh học.

-Một hệ thống Webserver có khả năng dự đoán tốt hơn 01 phương pháp cùng loại tốt nhất hiện có của nhóm Song vcs., (2017) cho trường hợp tương tác với nhóm protein kinase PKA (là một trong những nhóm được tập trung nghiên cứu nhiều nhất trong), ngoài ra hệ thống cùa chúng tôi

dự đoán tốt hơn hệ thống của Song vcs khi không tích hợp các thông tin bổ sung (giống như

hệ thống của chúng tôi).

- 02 phiên bản của một mô hình dóng hàng toàn cục hai mạng tương tác giữa các protein; hoạt động tốt hơn mô hình tương tự tốt nhất hiện có.

-01 bài báo trong kỷ yếu Hội nghị quốc tế.

-01 bài báo trong kỷ yếu Hội nghị quốc gia.

-Được triển khai dưới dạng một ứng dụng Web với giao diện dễ hiểu, dễ sử dụng đối với người dùng là những nhà nghiên cứu hoá sinh học Được cung cấp miễn phí tại:

httD://fit.uet.vnu.edu.vn/SKÍPHOS

3 Nghiên cứu khảo sát các

phương pháp tối ưu giải

quyết bài toán liên quan đến

chuỗi protein

- Được trình bày thành một bài báo khảo cứu đầy đủ và chi tiết (bao gồm điểm mạnh, điểm yếu, hướng cải tiến) nhất có thể các phương pháp tối ưu điển hình, tốt nhất hiện có giải quyết bài toán liên quan đến chuỗi protein.

- Được đăng trong Chuyên san CNTT của ĐHQGHN

- 01 báo cáo chuyên đề được trình bày thành một bài báo khảo cứu đầy đủ và chi tiết (bao gồm điểm mạnh, điểm yếu, hướng cải tiến) nhất có thể các phương pháp tối ưu điển hình, tốt nhất hiện có giải quyết bài toán liên quan đến chuỗi protein.

-01 bài báo đăng trên tạp chí Khoa học của ĐHQGHN.

-01 mô hình xây dựng lại chuỗi nguồn (protein/gene) có khả năng hoạt động tốt hơn mô hình tương tự tốt nhất hiện có.

10

Trang 13

Ghi địa chỉ

và cảm ơn

sự tài trợ của

Đ HQGHN đúng quy đinh

Đánh giá chung

(Đạt, không đạt)

1 Công trình công bô trên tạp chí khoa học quôc tê theo hệ thông ISI/Scopus

1.1 Le, Hoang-Quynh, Mai-Vu Tran,

Thanh Hai Dang, Quang-Thuy Ha,

and Nigel Collier "Sieve-based

coreference resolution enhances

semi-supervised learning model

for chemical-induced disease

relation extraction." Database

2016 (2016) (IF: 3.29).

has been supported by Vietnam National University, Hanoi (VNU), under Project No.

Q G 15.21.

Đạt

1.2 Thanh H ai Dang, Quang Think

Trac, Kinh Huy Phan, Manh

Cuong Nguyen, Quynh Trang

Pham Thi (2018) SKIP HOS: A

novel non-kinase specific

phosphorylation site prediction

with random forests and amino

acid skip-gram embeddings

Bioinform atics (IF:7.307) (being

reviewed)

Đã nộp (vào ngày 5/12/2017)

This work has been supported by Vietnam National University, Hanoi (VNU), under Project No.

Q G 15.21.

Đạt (vượt chỉ tiêu)

2 Sách chuyên khảo được xuât bản hoặc ký họp đông xuât bản

4 Bài báo quôc tê không thuộc hệ thông ISI/Scopus

4.1 Due Dong Do and Ngoe Ha Tran

and Thanh Hai Dang and Cao

Cuong D ang and Xuan Huan

Q G 15.21.

Đạt

11

Trang 14

4.2 Dô Xuân Quyền, Nguyên Hoàng

Dức, Thái Đình Phúc, Đô Đức

Đông Phương pháp tối ưu đàn

kiến dóng hàng toàn cục các mạng

tương tác protein Kỷ yếu Hội nghị

Khoa học Quốc gia lan thứ IX

“Nghiên círu cơ bản và ứng dụng

Công nghệ thông tin (FAIR'9)

Cần Thơ, ngày 4-5/8/2016 DOI:

10.15625/vap.2016.00077.

hoàn thành trong khuôn khổ của đề tài KHCN cấp ĐHQGHN,

Mã số đề tài:

QG.15.21.

Đạt (vượt chi tiêu)

5 Bài báo trên các tạp chí khoa học cùa ĐHQGHN, tạp chí khoa học chuyên ngành

quốc gia hoặc báo cáo khoa học đăng trong kỷ yếu hội nghị quốc tế

5.1 Anh Vu Thi Ngoe, Dinh Phuc Thai,

Hoang Due Nguyerì, Thanh Hai

Dang, Due Dong Do Ant colony

optimization based founder

sequence reconstruction VNU

Journal o f Science: Computer

Science and Communication

Engineering, 2018.

has been supported by Vietnam National University, Hanoi (VNU), under Project No.

Cột sản phẩm khoa học công nghệ: Liệt kê các thông tin các sàn phẩm KHCN theo thứ tự

<tên tác già, tên công trình, tên tạp chưnhà xuất bàn, số phát hành, năm phát hành, trang đăng cóng trình, mã công trình đăng tạp chí/sách chuyên khảo (DOI), loại tạp chí ISI/Scopus>

Các ấn phẩm khoa học (bài báo, báo cáo KH, sách chuyên khảo ) chi đươc chấp nhân nếu

có ghi nhận địa chi và cảm ơn tài trợ của ĐHQGHN theo đúng quy định.

Bản phô tô toàn văn các ấn phẩm này phải đưa vào phụ lục các minh chứng của báo cáo Riêng sách chuyên khảo cần có bản phô tô bìa, trang đầu và trang cuối có ghi thông tin mã số xuất bản.

12

Trang 15

3.3 K ết quả đào tạo

336, 14-16 October 2015, Ho Chi Minh city, Vietnam Print ISSN:

2162-1020.

Đang làm thủ tục bào vệ.

10.15625/vap.2Ọl 6.00077

* Luận văn: “ứ n g dụng phương pháp tối ưu đàn kiến dóng hàng hai đồ thi”

Đã bào vệ và

đã được công nhận và cấp bằng Thạc sỹ (QĐ số 812/QG- ĐHCNTT&TT)

2 Phạm Văn

Hiếu

protein-protein sử dụng kỹ thuật khai phá dữ liệu”

Đã bảo vệ thành công, được điểm TB 7.8/10 (QG số 1163/QĐ-ĐT, ngày

23/11/2017).

3 Đặng Quôc

Hùng

giữa các protein dựa trên kỹ thuật

Đã bảo vệ thành công.

13

Trang 16

deep learning”

Ghi chú:

Gửi kèm bản photo trang bìa luận án/ luận văn/ khóa luận và bang hoặc giấy chứng nhận nghiên cím sinh/thạc sỹ nếu học viên đã bảo vệ thành công luận án/ luận văn;

Cột công trình công bố ghi như mục III 1.

PHÀN IV TỎNG HỢP KÉT QUẢ CÁC SẢN PHẢM KH&CN VÀ ĐÀO TẠO CỦA ĐẺ TÀI

đăng ký

Số lượng đã hoàn thành

1 Bài báo công bô trên tạp chí khoa học quôc tê theo hệ thông

ISI/Scopus

được in, 01 đang chờ phàn biện)

2 Sách chuyên khảo được xuât bản hoặc ký họp đông xuât

bàn

3 Đăng ký sở hữu trí tuệ

5 Sô lượng bài báo trên các tạp chí khoa học của ĐHQGHN,

tạp chí khoa học chuyên ngành quốc gia hoặc báo cáo khoa

hoc đăng trong kỷ yếu hôi nghi quốc tế

6 Báo cáo khoa học kiên nghị, tư vân chính sách theo đặt

hàng của đơn vị sử dụng

7 Kêt quả dự kiên được ứng dụng tại các cơ quan hoạch định

chính sách hoặc cơ sở ứng dụng KH&CN

quyết định lùi thời gian bảo vệ vì vừa

bổ sung chứng chi tiếng anh)

PHÀN V TÌNH HÌNH s ử DỤNG KINH PHÍ

Kinh phí được duyệt

(triệu đồng)

Kinh phí thực hiện

(triệu đồng)

Ghi chú

A Chi phí trực tiêp

2 Nguyên, nhiên vật liệu, cây con.

14

Trang 17

(Thủ trưởng đom vị ký tên, đóng dấu

H à Nội, ngày 26 tháng 12 năm 2 0 1 7.

Chủ nhiệm đề tài

(Họ tên, chữ ký)

15

Trang 18

Database. 2016, 1-14 doi: 10.1093/database/bawl02

Original article

Driginal article

Sieve-based coreference resolution enhances

semi-supervised learning model for

chemical-induced disease relation extraction

H o a n g - Q u y n h L e 1, M a i - V u T r a n 1, T h a n h H a i D a n g 1' * , Q u a n g - T h u y H a1

a n d N i g e l C o l l i e r 2' *

Faculty of Inform ation Technology, VNU U niversity of Engineering and Technology, Hanoi, Vietnam

3uilding E 3 ,144 Xuan Thuy str., Cau Giay dist., Hanoi, Vietnam Postal code: 100000 and d e p a r tm e n t of

Theoretical and Applied Linguistics, U niversity of Cambridge, Cambridge, UK

‘Corresponding author: Tel: + 44 (0)1223 7 67356, Email: nhc30@ cam.ac.uk

Correspondence may also be a d d re sse d to Thanh Hai Dang Tel: +84(4)375 47 064; Fax: +84(4)37547.460; Email:

iai.dang@ vnu.edu.vn

Citation details: Le,H-Q., Tran,M-V., Hai Dang.T e t al DEOP: a d a ta b a se on o sm o p ro tectan ts and asso ciated pathw ays

D atabase (2016) Vol 2016: article ID baw102; doi:10.1093/database/baw102

Received 4 December 2015; Revised 4 June 2016; Accepted 6 June 2016

A b s t r a c t

The BioCreative V chem ical-disease relation (CDR) track w as proposed to accelerate the pro

gress of text m ining in facilitating integrative understanding of chemicals, diseases and thoir

relations In this article, w e describe an extension of our system (nam ely U ET-C AM ) that par

ticipated in the BioCreative V CDR The original U ET-CAM system's perform ance w as ranked

fourth am ong 18 participating systems by the BioCreative CDR track co m m ittee In the

Disease N am ed Entity Recognition and Norm alization (DNER) phase, our system em ployed

joint inference (decoding) w ith a perceptron-based nam ed entity recognizer (NER) and a

back-off m odel with S em antic Supervised Indexing and Skip-gram for nam ed entity n orm al

ization In the chem ical-induced disease (CID) relation extraction phase, w e proposed a pipe

line that includes a coreference resolution m odule and a Support Vector M achine relation

extraction m odel The fo rm e r m odule utilized a multi-pass sieve to extend entity recall In

this article, the U ET-C AM system w as im proved by adding a 'silver' CID corpus to train the

prediction m odel This silver standard corpus of m ore than 50 thousand sentences w as

autom atically built based on the Com parative Toxicogenom ics Database (CTD) database

W e evaluated our m ethod on the CDR test set Results show ed that our system could reach

the state of the art perform ance w ith F1 of 82.44 for the DNER task and 58.90 for the CID

task Analysis dem onstrated substantial benefits of both the multi-pass sieve coreference

resolution m ethod (FI + 4.13% ) and the silver CID corpus (F1 + 7 3 % ).

D atab ase URL: S ilv e rC ID -T h e silver-standard corpus for CID relation extraction is fre e ly

online availab le at: https://zen o do org/reco rd /34530 (d o i:10.5281/zen o do 34530).

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.Org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Trang 19

I n t r o d u c t i o n

A survey of PubMed users’ search behavior showed that

diseases and chemicals were two of the most frequently re

quested entities by PubMed users worldwide: diseases ap

peared in 20% of queries and chemicals in 11% (1) These

two entities are central to several topics such as developing

drugs for therapeutics, discovering adverse drug reactions

(ADRs) as well as chemical safety/toxicity among patient

groups and facilitating hypothesis discovery for new

pharmaceutical substances As a consequence, extracting

chemical-disease relations (CDR) from unstructured free

text has become an important field in biomedical text

mining

In recent years, there has been an increased focus in re

search on capturing disease and chemical relations (e.g

drug-side-effecc relations) from biomedical literature text

The Comparative Toxicogenomics Database (CTD) has

been a notable target of many studies The CTD is a manu

ally curated database that promotes understanding about the

effects of environmental chemicals (e.g arsenic, heavy metals

and dioxins) on human health (2) As of June 2015, the CTD

database had 1 842 746 chemical-disease associations Due

to the high cost of manual curacion and the rapid growth of

the biomedical literature, a number of researchers have at

tempted to extract chemical—disease relations or drug side

effects automatically The simplest class of approaches is

based on the co-occurrence statistics of chemical and disease

entities, i.e if two entities are mentioned together in the

same sentence or abstract, they are probably related Chen

et ttl (3) used this method to identify and rank associations

between eight diseases and relevant drugs This approach

tends to achieve high recall, but low precision and fails to

distinguish the chemical-induced disease (CID) relations

from other relations that commonly occur between chem

icals and diseases Knowledge-based approaches were also

successfully applied for the ADR extraction (4, 5) They,

however, demands the time-consuming and labor-intensive

manual compilation of huge knowledge (in terms of rules as

in 4 or a three-tier hierarchical graph as in 5), which results

from the wide variety of contexts in which relations can

occur These approaches, therefore, tend to suffer from the

low recall Other approaches are based on automated ma

chine learning techniques, such as Support Vector Machines

(SVMs) (6) and decision trees (7) Their performance, how

ever, has still been limited, which is mainly due to the lack of

a substantial data set for training Moreover, the variety of

abundant ADR syntaxes as well as a failure to resolve inter

sentence alternative entity-mentions also hampers the

performance

To accelerate the progress, BioCreative V proposed a

challenge task for automatic extraction of CDRs (8, 9)

C h e m ic a l-in d u c e d disease relations

F igure 1 A n a lysis o f th e d ire ct evidence f ie ld in th e CTD database.

The CDR challenge has two sub-tasks:

(A) Disease Named Entity Recognition (DNER) This task includes automatic recognition of disease mentions (named entity recognition, NER) in PubMed abstracts and assignment of Medical Subject Heading (MeSH, 10) identifiers to these mentions named entity normalization (NEN)

(B) CID relation extraction Participating systems were provided with raw text from PubMed articles as input and asked to return a list of <chemical, disease> pairs

In which, chemicals and diseases are normalized concepts that participate in a CID relation

In these challenge tasks, diseases were annotated using the ‘Diseases’ [C] branch of MeSH 2015, including diseases, disorders, signs and symptoms; chemical terminologies were annotated using the ‘Drugs and Chemicals’ [D] branch of MeSH 2015 The CID relations can be marked

as ‘marker/mechanism’ in the CTD database There are two types of such relationships: (i) biomarker relations between a chemical and disease indicating that the chemical correlates with the disease and (ii) putative mechanistic relationships between a chemical and disease indicating that the chemical may play a role in the etiology of the disease (see Figure 1)

As a team participating in the CDR challenge, we proposed a modular system that handled the DNER and CID tasks separately For the DNER as the first phase, we proposed a method for combining several state-of-the-art word-embedding techniques in the NEN module in order

to take advantages of both the gold standard annotated corpus and large scale unlabeled data The NEN and NER modules were then combined into a joint inference model

to boost performance and reduce noise For the second phase, the CID task exposed many challenges such as (i)

Trang 20

:omplex grammatical structures, (ii) entities that belong to

1 relation may appear not only in a single sentence but also

n multiple sentences, in which they are often mentioned

coreferencially or using different forms, (iii) entities being

expressed in MeSH IDs instead of of free-text forms To

overcome these challenges, a traditional machine learning

model for relation extraction, which is based only on

explicit mentions of entities in a single sentence, will not be

adequate We thus had to employ a coreference module

along with a SVM-based relation extraction module as the

central core The intention of using the coreference module

was to extend system recall on disease/chemical mentions,

then to convert inter-sentence relations to intra-sentence

relations Additionally, in order to exploit as much useful

information as possible from the literature, we built a

silver-standard corpus (namely ‘SilverCID’) for training the

DNER average perceptron model and the SVM intra

sentence relation extraction model This corpus was a care

fully selected sub-set of citations in the CTD database and

totally disjoint from the targets in the testing set In add

ition, we explored the benefit of using a large-scale feature

set to handle the variety of CTD relation mentions

The novel contributions of this article are as follows: (i)

we proposed a DNER model that was based on the joint in

ference between an averaged perceptron NER model and a

NEN pipeline of two phases, i.e Supervised Semantic

Indexing (SSI) followed by a skip-gram model; (ii) we dem

onstrated the benefit of our automatically built SilverCID

corpus (a sentence-level corpus) for the CID relation extrac

tion; (iii) we presented evidence for the efficacy of using the

multi-pass sieve in the CID relation extraction task and (iv)

we demonstrated the strength of the rich feature set (see sec

tion SVM-based intra-sentence relation extraction and

Table 2 for more details) for CÍD relation extraction

M a t e r i a l s a n d M e t h o d s

Data set

Our experiments were conducted on the BioCreative V

CDR data In order to take advantage of the CTD data

base, we also built a SilverCID corpus from PubMed art

icles that were cited in the CTD database but which did

not appear in the BioCreative CDR track data set

BioCreative CDR track data set

To assist the development and assessment of participating

CDR systems, the BioCreative V workshop organizers cre

ated an annotated text corpus that consists of expert anno

tations for all chemicals, diseases, and their CID relations

This corpus contained a total of 1500 PubMed articles that

were separated into three sub-sets, each of 500 for the

Men, Mention; CID, CID relations.

training, development and test set (the details are shown

on Table 1) Following the data survey of BioCreative (9),

of these 1500 articles, 1400 were selected from an existing CTD-Pfizer data set that had been jointly curated via a previous collaboration between CTD and Pfizer (11) The remaining 100 articles contained newly curated data and were incorporated into the test set

SilverCID corpusThe CTD (2) is a robust, publicly available database that aims to advance understanding about how environmental exposures affect human health Chemicals in the CTD come from the chemical subset of MeSH The CTD’s disease vocabulary is a modified subset of descriptors from the ‘Diseases’ category of MeSH, combined with genetic disorders from the Online Mendelian Inheritance in Man (OMIM) database (12)

In > 28 million CTD toxicogenomic relationships, there are 1 919 790 disease-chemical relations (curated or inferred via CTD-curated chemical-gene interaction) (October 2015) There are several types of relations between diseases and chemicals, which may be described within the ‘Direct Evidence’ field of the CTD database This field has two labels M and T, in which the label M indicates that a chemical can correlate with a disease or can

be the etiology of a disease (Figure 1) Relations curated as

M, therefore, are more likely to be CID relations

Moreover, we observed that if two entities that participated in a relation appear in the same sentence, it is highly probable that this sentence contains the grammatical relation that we were considering Taking into account these two observations, a silver standard CID corpus, SiiverCID, was constructed using the CTD database and PubMed according to five steps (Figure 2 gives an example of how the SilverCID was constructed):

Step 1 (Relation filtering): CID relations in the CTD database were filtered using information from the ‘Direct evidence’ field Only relations marked as ‘M ’ (marker/ mechanism) were chosen

Step 2 (Collecting): We collected PubMed abstracts from the reference list of the relations that had been chosen in Step

1 This reference list was provided by the CTD database

Trang 21

N e o p la sm s n /H

V alproic Acid

PMID: 22197969 ( ) The them e of work was to evaluate effectiveness in oral route of polylactide co-glycolide (PLGA) Nanocapsuiated curcumin (Nano Cur) against

Ì (DEN) induced rat ( ) Three i.p injections oft he chemical hepatocarcinogen DEN

at ỈSdays interval causes hepatotoxicity, the generation of reactive oxygen species (ROS), lipid peroxidation, decrease in plasma m em brane microviscosity and depletion of antioxidant enzyme levels in liver Nano Cur (weekly oral treatm entf or 16weeks at 20mg/kg b.wt) in DEN induced rats exerted

significant protection against HCC

and restored redox homeostasis in liver cells ( )

PMID: 6206716 Four patients had pancreatitis associated with valproic acid therapy Three patients received valproic

a d d at usual doses, and all were free of other symptoms oft oxic reactions, with serum levels

of valproic acid in the usual therapeutic range ( ) All patients recovered with discontinuation of valproic acid therapy and enteral feeding and administration of* ntravenous fluids After recovery, a valproic acid regimen was restarted uneventfully (in one patient) ( ) Pancreatitis is

a serious complication of valproic acid therapy that must be considered in any patient receiving valproic acid who experiences severe abdominal pain and

vomiting.

♦

CID R elatio n s (ch e m ic a l-d isea se )

R e fe re n c e s PMID

Induced hepgtqcfllulm

carcinoma (HCCi in rat.

Nano Cur (weekly oral treatm ent for 16weeks at 20mg/kg b.wt) in PEN

induced HCC rats exerted

significant protection against

HCC and restored redox

homeostasis in liver cells Pancreatitis is a serious complication of valproic acid therapy that must be considered in any patient receiving valproic a d d who experiences severe abdominal pain and

vomiting.

F igure 2 An e xam ple o f co n s tru c tin g silverC ID corpus.

Step 3 (Overlap removal): To avoid overlap between

the SilverCID corpus and the CDR test set, we removed all

the PubMeb abstracts which appeared in the CDR track

data set to ensure a fair evaluation of the SilverCID’s

contribution

Step 4 (Annotating): For each relation that had been

chosen in Step 1, all disease and chemical mentions in its

referring PubMed articles were automatically annotated

Step 5 (Sentence filtering): Sentences in the abstracts that remained after Step 4 were kept for downstream works if they contained both chemical and disease entities, which may participate in a CID relation Sentences that did not contain any entity or contained only one entity were removed

Two novel aspects that makes our SilverCID corpus different from other resources are (i) i t was built automatically

Trang 22

Figure 3 A rch ite ctu re o f th e p rop o se d CDR e xtra ctio n syste m , w h ich includes the p ip e lin e o f p rocessing m o du le s and m a te ria l resources; boxes w ith dotted lines indicate su b-m o d u le s.

and (ii) it is a sentence-level corpus (i.e a set of sentences

that contains at least one intra-sentence CID relation with

its participating chemical and disease entities), which cov

ered about 60% of CID relations in the CTD database

This data set contains 38 332 sentences, 1.25 million

tokens, 48 856 chemical entities (1196 unique chemical

entities), 44 744 disease entities (2098 unique disease enti

ties) and 48 199 CID relations (12776 unique CID rela

tions) It is freely available online at URL: https://zenodo

org/record/34530 (doi: 10.5281/zenodo.34530)

Proposed m odel

The overall architecture of our proposed system is

described in Figure 3 Compared with our previous system

in the BioCreative CDR track, the improved system used

the SilverCID corpus for training in both the DNER and

CID phases The impact of this improvement on the sys

tem’s performance will be analyzed in the next sections

Pre-processing steps include sentence splitting, tokeniza-

tion, abbreviation identification, stemming, POS tagging

and dependency parsing (Stanford; Stanford Dependencies:

http://nlp.stanford.edu/software/stanford-dependencies

shtml) The system was based on the integration of several

state-of-the-art machine learning techniques in order to

maximize their strengths and overcome their weaknesses

Named entity recognition and normalization

This module solved the sub-task DNER It was a joint-

decoding model of a NER and NEN modules in order to

boost performance and reduce noises (13) The NER and

NEN modules were trained separately and then decoded

simultaneously

Following reports of high level performance of the joint-inference model by Li and Ji (13) and Zhang and Clark (14), we decided to employ a structured perception model for NER Its output was a set of real numbers, each

in which corresponded to the weight of each class label This output format was the same with that of the NEN model, therefore, it was suitable for joint-inference in the decoding phase The structured perception was an extension of the standard perceptron for structured prediction

by applying inexact search with violation-fixing update methods (15) It was trained on the CDR training, development set and SilverCID corpus with a standard lexicographic feature set: orthography features, context features, POS tagging features and dictionary (CTD) features.The NEN module was a sequential back-off model based on two word embedding (WE) methods: SSI (16)—a supervised WE method—and skip-grams (17)—an unsupervised WE method The SSI model was trained on the CDR training and development set to obtain a correlation matrix w between tokens in the training data as well as MeSH Skip-gram is a state-of-the-art word-to-vector method that took advantage of large unlabeled data We used an open source skip-gram model provided by NLPLab (http://evexdb.org/pmresources/vec-space-models/ wikipedia-pubmed-and-PMC-w2v.bin), which was trained

on all PubMed abstracts and PMC full texts (4.08 million distinct words) The output of skip-gram model was a set

of word vectors of 200 dimensions, from which similarities between all word pairs were calculated As a result, we constructed a correlation matrix that was in the same format as the output of the SSI model Therefore, we could combine the SSI model and the skip-gram model into the back-off model For normalizing entities, we created pairs

Trang 23

if each entity and each MeSH concept and then processed

hem by the SSI and skip-gram sequential back-off model,

n this regard, firstly, we implemented the SSI model to

ind which pairs are linked, and then processed non-linked

>airs once again by the skip-gram model

The CỈD subtask required the system to extract the CID

elations at the abstract level In simple cases, a CID rela-

ion might be expressed in a single sentence (intra-sentence

:elation), i.e two entities that participate in a CID relation

ippear in the same sentences Unfortunately, they might be

:xpressed in multiple sentences (inter-sentence relation)

Our system was based on a strategy that firstly converted

nter-sentence relations to intra-sentence relations by using

Ì coreference resolution method and then applied a ma-

:hine learning model to extract them

Our DNER system was a joint decoding model, which

used a modified beam search for decoding (13, 18) In this

model, we trained two separate models for NER and NEN

and then decoded them simultaneously We also proposed

a new scoring function for Beam search decoding as fol

lowed (sec formula 1)

a r g m a x E l J { w n e r (* < = ;, y i =i - i , NER )

+ WNEN (x/=i, x t= i - l , y t= i-\-N E R , yt=i-,NEI0 )

The scoring function for NEN is:

M'NfcN ( * » = ; , X ! = i- l, y < = i- l; N £ R , y / = i; N E R )

'0,1if y,=i-o

y t = i - i - , B - D S \ l - D S \ O a n d y t = i , B - C D

y t = i - \ ;B - C D \ l - C D \ O a n d y t = i,B - D S yt=i-\\B-DS\l-DS‘>ndyt=ril-DS

Formally, the coreference consists of two linguistic expres

sions—antecedent and anaphor (19) Figure 4 is an ex

ample of the coreference, in which the anaphor ‘side effect’

is the expression whose interpretation depends on that of

the other expression, and the antecedent ‘tohemorrhagic

cystitis’ is the linguistic expression on which an anaphor

‘side effect’ depends on

It is c h a ra c te riz e d by its in te n se urotoxic action, leading to h e m o r r h a q ic c y s t it i s

Figure 4 A n e xam ple o f the co refe re n ce betw een chem ical entities

T w o se qu e n tial sentences are extracted fro m PubM ed abstract PMID: 23949582.

The traditional coreference resolution task was normally to discover the antecedents for each anaphor in a document From the perspective of this study, it was not necessary to always make clear which is the antecedent or anaphor Our system considered both antecedents and ana- phors as mentions of entities, and strived to recognize as many mentions of an entity as possible

Studies on the coreference resolution in the general English domain date back to 1960s and 1970s and often focus on person, location and organization In biomedicine, because entity types to be resolved are atypical to general domains (i.e protein, gene, disease, chemical, etc.), coreference researches in this domain have received comparatively less attentions (19) Previous approaches had applied several methods, ranging from heuristics-based (20, 21) to machine learning (22, 23)

In this regard, our proposed system employed the coreference module that was based on a multi-pass sieve model (21) It has been evaluated as a simple yet effective mean for disorder mention normalization (21) We first processed each abstract by noun phrase (NP) chunking (using Genia tagger; http://www.nactem.ac.uk/GENIA/tag ger/) and then created a set of NPs pairs for each abstract These pairs of NPs were then passed through the sieves Those that were kept by any sieve were considered as coreferent pairs, those that were not kept in each sieve were passed through the next sieve to the end There were nine

sie v e s used, each corresponded to a set of rules Figure 5 is

an illustration of the sieve-based coreference resolution module with example pairs that were kept by each sieve.Sieve 1—ID matching: Two chemical or disease mentions that have the same MeSH ID are coreferent This sieve used information from the previous NEN step For example as ‘irregular heartbeat’ and ‘irregular heart beat’ were both normalized to MeSH ID: D001145, and were thus considered coreferent

Sieve 2—Abbreviation expansion: In this sieve we used the BioText Abbreviation recognition software (http://bio text.berkeley.edu/software.html) (24) to identify abbreviations and their full forms (e.g full form of ‘PND’ in the

Trang 24

p r o p a n o lo l - p r o p ra n o lo l

m a c r o p r o la c tin e m ia

- m ic ro p r o la c tin o m a

PMIO: »037*3 M«SH 10: 0000031

a b o r tio n - a b o r tio n s

PMIO: 3323259 M«SH 10: D00Ĩ12I

c a lc iu m c h a n n e l b lo c k in g a g e n ts

- c a lc iu m c h a n n e l blo c ke rs

PMID: 23949S42 MtSH IP : D003S5<

to h e m o r r h a g ic cystitis

- s id e e ffe c t

not co-reference pairs

igure 5 C oreference re so lu tio n u sin g nine-pass sieve Exam ples is pairs w ere kept by sieves.

bstract PM[D:11708428 is ‘prednisone’) We then

hecked the MeSH ID of the full form and applied it to the

bbreviation in order to unify mentions

Sieve 3—Grammatical conversion: Similar forms of an

:ntity mention were automatically generated by changing

;rammatical elements in mentions, including subjects, ob-

ects and prepositions, etc The ID match criterion was

hen checked New forms were obtained by applying rules

jroposed by D’Souza and Ng (21), which includes: (i)

eplacing the preposition in the name with other prepos-

tions, (ii) dropping the preposition from the name and

iwapping the substring surrounding it, (iii) bringing the

ast token to the front, inserting a preposition as the second

:oken, and shifting the remaining tokens to right by two

ind (iv) moving the first token to the end, inserting a prep

osition as the second to last token, and shifting the remain-

ng tokens to the left by two Examples include

calcification of the artery’ and ‘artery calcification’, ‘men-

:al status alteration’ and ‘alteration in mental status’

Sieve A— Number replacement: Similar forms of a men

tion were generated by replacing numbers with other forms

and the ID match criterion was checked In this regard, we

considered the numeral, roman numeral, cardinal and multiplicative forms of a number for generating new mention forms, i.e ‘two’ can be converted to ‘2’, ‘ii’ and

‘double’

Sieve 5—Synonym replacement: The ID match criterion for synonyms of mentions was checked This sieve used a synonym dictionary constructed from the MeSH, which contains 780 982 entries Examples include ‘propanolop and ‘propranolol’

Sieve 6—Affix normalization: New forms of a mention were generated by changing affixes (including prefixes and suffixes) and then the ID match criterion was checked For examples, ‘macroprolactinemia’ and ‘microprolactinoma’ (PMID:20595935), ‘nephrotoxicity’ and ‘nephrotoxic’ (PMID:19642243) are coreferent

Sieve 7—Stemming: Entity mentions are stemmed using the Porter stemmer (http^/taưarus.org/martin/PorterStcmmer/), and then the ID match criterion was checked Examples include ‘abortion’ and ‘abortions’

Sieve 8—Partial match: This sieve used the output information from the abbreviation expansion sieve and applied the criterion for partial matching proposed by D’Souza

Trang 25

,d Ng (21) It is said that ‘a mention can be partially

atched with another mention for which it shares the

ost tokens’ To give an example, ‘calcium channel block-

g agents’ and 'calcium channel blockers’ in abstract

vlID:3323259 were marked as coreference

Sieve 9—Hyponymic terms: We created two dictionaries

r chemicals and diseases including hyponymic nouns that

ten referred to chemicals/diseases For example, chemical

'ponymic dictionary includes ‘drug’, ‘dose’, etc.; disease

'ponymic dictionary includes ‘disease’, ‘case’, ‘infection’,

de effect’, etc In this sieve, NER information was used to

id chemical and disease entities, and if there was any term

dictionary within its context window of two sentences be-

re-/after-wjrd, we could determine a coreference

/M-based intra-sentence relation extraction

ur work was based on the know-how that if a NP and an

ltity are coreferent, the NP can be considered as an entity

i that type The intra-sentence relation extraction module

iceived sentences that contain a disease—chemical pair as

iput and classified whether this pair had the CID relation

r not

The example in Figure 4 (section Coreference reso-

Ition) also shows how to combine the coreference reso-

Ition module and the intra-sentence relation extraction

lodule for handling inter-sentence relations The strategy

that if the intra-sentence relation extraction module can

xognize the relation between ‘side effect’ and ‘IFO’, we

an also determine the relation between ‘tohemorrhagic

pstitis’ and ‘IFO’ because ‘tohemorrhagic cystitis’ and

lide effect’ are coreferent

The intia-sentence relation extraction module was

ased on a SVM (25)—one of the most popular machine

:arning methods that has been successfully applied for

iomedical relation extraction (26, 27) We used the

.iblinear tool (http://www.csie.ntu.edu.tw/~cjlin/liblinear/)

D train a supervision binary SVM classifier (L2- regular-

&ed and Ll-loss) on the CDR track training/development

lata set and our SilverCID corpus In this study, we

ibserved that the complexities of CID relations (several

tructural forms, abundance-related vocabulary sets, diffi-

ulty to determine the distance between the two entities,

:tc.) are very similar to the event extraction problem As a

:onsequencs, the feature set that was specially constructed

or event extraction might work better than that com-

nonly used for normal relation extraction [they were

vords, entity types, mention levels, overlap, dependency,

>arse tree £nd dictionary (28-30)] Following a report of

ligh performance in event extraction (31), we decided to

lse a large-scale feature set including four types of features:

Token features, neighboring token features, token features

1 1-gram, pair features n-gram and shortest features path, the feature’s details are shown in Table 2

E x p e r i m e n t a l r e s u l t s

For evaluation, disease entities and CID relations that had been predicted by our proposed model were compared to the gold standard annotated CDR testing data set using standard metrics: precision (P, indicating the percentage of predicted positives that are true instances), recall (R, indicating the percentage of true positive instances that the system has retrieved) and FI (the harmonic means of R and P).BioCreative V also evaluates the running time of participating systems based on response time via teams’ respective web services

DNER results

The experimental results of the DNER phase on the CDR track testing data set are shown in Table 3 Note that only disease entities were evaluated

We compared our results with the benchmarks provided

by the BioCreative organizer, including:

• The straightforward dictionary look-up method that relied on disease names from the CTD database

• Retrained models using the out-of-box DNorm (16), which was a competitive system that achieved the highest performance in a previous disease challenge DNorm combined an approach that was based on conditional random fields (CRFs) and rich features for NER with a pair wise learning to rank for NEN

• BioCreative DNER average results: Average results of the best run of 16 teams participating in the DNER task

• BioCreative DNER no 1 ranked team results: Results from the team that was ranked no 1 (in term of FI) in the DNER task (32) This system used a linear chain CRF with rich features for NER, they used three lexicons resources to generate CRF dictionary features and multiple post processing steps to optimize the results In the NEN step, they used a dictionary-lookup method that was based on the collection of MEDI, NCBI disease corpus and the CDR task data set

In this article, we improved our system (33) that had participated in the BioCreative DNER task by adding the silverCID corpus in the NER averaged perceptron training set Table 3 also shows how useful the silverCID was in boosting the performance of our proposed model

In the BioCreative V evaluation, our system performed far beyond the dictionary look up method, but worse than DNorm that was considered as a very strong benchmark (note that there were only seven participating teams that achieved

Trang 26

iabase, V o 2 0 1 6 , A rticle ID b a w 1 0 2

lie 2 L arge - s c a le f e a tu r e s e t u s e d in th e in tr a - s e n te n c e re la tio n e x tra c tio n m o d u le

letter of sentence, num ber, etc.)

• Base form of token

• N -gram s (n = 1-4) of token

• Part-of-speech tagging

Inform ation a b o u t the current token

token feature s paths from the target token,

which then w ere used to extract n-gram s

function for each token

• T oken and dependency n-gram s

(n = 2 -4 )

• T o k en n-gram s (n = 2; 3)

• D ependency n-gram s [n = 2)

co n tex t of the current token

before an d three to k en s after the target token

co n tain cu rren t token

tokens before the first to k en s to • N -gram s (n = 2—4) of dependencies and tree

in ta rg e t ch e m ica l-d ise a se p a ir.

representing governor-dependent relationships

• Edge w alks (w ord-dcpendency-w ord) and their sub-struccures

• Vertex w alks (dependency-w ord-dependency) and their sub-structures

ency tree and function of each token in this p ath

" P ro v id e d b y ĩh e B io C re a tiv c 2 0 1 5 o r g a n i z e r (3 3 ).

bT h e s ilv e rC ID c o r p u s in c lu d e d in t r a i n in g th e N E R m o d u le

performances better than DNorm) Using the silverCID corpus

for training NER model boosted the performance by 6% of

FI and became better than the DNorm’s result

To demonstrate the benefit of the joint decoding model,

we also built a baseline system that was based on the trad

itional pipeline model: NER was employed first and its re

sult was then used for NEN In this manner, the NER and

NEN modules were totally similar with them in our joint decoding model The results showed that joint decoding model boosted the performance by 1.8% of the FI score.Following the results reported by BioCreative (8), the average response time in the DNER task was 5.6 s and our system was among participating systems that had smallest response time (276 ms, ranked no 2)

CID results

Table 4 shows the results of our system on the CID task It serves two purposes, i.e firstly for comparing our results with the BioCreative benchmark results, and secondly for evaluating the contribution of the coreference resolution approach and the silver-CID corpus as well as finding the best combination of them

We compared our results with the benchmarks provided

by the BioCreative organizer, including:

• The co-occurrence baseline method with two variants: abstract-level and sentence-level

Trang 27

'able 4 CID re la tio n e x tr a c tio n r e s u lts

m ra -se n c e n c e re la tio n e x tr a c ti o n B o ld v a lu e s a re performance m e a s u r e s o f

Jur tw o m o d e ls o n th e T e s t s e t, n o t u s in g c r o s s - v a lid a tio n e v a lu a tio n

BioCreative CID average results: Average results of the

best run of 18 teams participating in the CID task

BioCreative CID no 1 ranked team results: Results from

the team that was ranked no 1 (in term of FI) in the

C1D task (34) This system combines two SVM classifiers

trained on sentence- and document-level, its novel aspect

is at using rich features coming from CID relations in

other biomedical resources

The configuration of our system that participated in the

CID task was the pipeline of a multi-pass sieve coreference

resolution module and a SVM intra-sentences relation ex

traction module It achieved the FI score of 51.60% This

was much better than that of the co-occurrence benchmark

method Further, the improved system that used the

SilverCID corpus for training SVM module boosted the

performance by 7.3% of FI It can be noted that this result

is better than that of the highest ranked system in the CID

task The contributions of the coreference resolution and

silverCID corpus were evaluated by comparing results of

the SVM-based intra-sentence relation extraction module

with and without adding the coreference resolution mod-

ule/silverCID corpus A comparative evaluation showed

that the original SVM approach (only trained on the CDR

training and development set) achieved FI of 47.47%,

whilst adding the SilverCID corpus boosted FI by 4.64%

(51.60%) and, further, adding the multi-pass sieve corefer

ence resolution module boosted FI by 4.13% more

(58.90%)

We also made a comparison between our heuristic-

based multi-pass sieve method and another state-of-the-art

machine learning-based method for coreference resolution

In this regard, we re-implemented a method proposed by

Ng (22), which was an expectation maximization (EM)

clustering coreference approach This system also used the

SVM-based intra-sentence relation extraction model that had been trained on the CDR training and development set The results demonstrated the strength of our multipass sieves method We achieved 53.41% in precision (5.77% better than that of the EM clustering-based), 49.91% in recall (0.37% worse) and 51.60% in FI (2.67% better)

The feature set that was used in the SVM model contains 332 570 features It is clearly a non-trivial large feature space to compute In our experiments, the SVM model took more than an hour for training According to the results reported by BioCreative (8), the average response time in the CID task was 9.3 s Our system response time was 8.993 s

D i s c u s s i o n

Our work makes available to others the SilverCID corpus which was built automatically, in addition to the extant resources, i.e the CDR track data set (section BioCreative CDR track data set) and the CTD-Pfizer collaboration data set (11) which have resulted from manual curation A further novel aspect that makes the SilverCID corpus different from other sources is that it is a sentence-level corpus which covered about 60% of CID relations in CTD database.Traditionally, NER and NEN were treated as two separate tasks, in which, NEN took the output of NER as its

input Following several studies, e.g Liu et al (35), we

begin to understand the limitations of this pipeline approach, i.e the errors propagate from NER to NEN and

there is no feedback from NEN to NER Khalid et a l. (36)

also demonstrated that most NEN errors were caused by recognition errors The joint decoding model is expected to overcome these disadvantages of such a traditional pipeline model Table 3 shows that joint decoding model boosted performance by 1.8% in term of the FI score Joint decoding outperformed the pipeline model in the cases of long entities that belongs to MeSH, such as ‘combined oral contraceptives’ (MeSH:68003277) and ‘angiotensin-con- verting enzyme inhibitors’ (MeSH:D000806)

In the DNER phase, the NEN back-off model could take advantage of both labeled CDR data set and the extremely large unlabeled PubMed data The SSI model calculated the correlation matrix between tokens, it worked better than Skip-gram in cases that tokens appeared in training data or MeSH (e.g SSI links ‘arrhythmias’ to MeSH:D001145 (Arrhythmias, Cardiac), ‘peripheral neurotoxicity’ to MeSH:D010523 (Peripheral Nervous System Diseases) The skip-gram model calculated s im ila r 

ity between tokens by taking advantage of the large unlabeled PubMed data, and helped improve the system recall (e.g Skip-gram linked ‘disordered gastrointestinal

Trang 28

Table 5 A r a ly s is o f th e c o n tr ib u tio n o f m e t h o d s a n d r e s o u r c e s u s e d in o u r p r o p o s e d s y s te m fo r c a p tu r in g CID re la tio n s h ip s

relation Inter-

th ro n b o sis (D 003328)

reperfusion injury (D 015427)

cell lym phom a (D 008224)

S V M , SV M i n tr a -s e n te n c e r e l a ti o n e x tr a c ti o n ; C R , m u lri-p a s s sie v e c o re fe re n c e r e s o lu rio n ; sc, s ilv e rC ID c o rp u s ; I n tr a -, I n tr a -s e n te n c e C IO re la tio n ; In te r-, [nrer-sencencc C ID r e la tio n ; / , c h e m ic a ỉ- d is e a s e p a ir is c la ssified a s C ID re la tio n c o rre c tly See s u p p le m e n ta r y 1 fo r th e s a m p le te x ts

motility’ to MeSH:D005767 (Gastrointestinal Diseases),

'hyperplastic marrow’ to MeSH:D001855 (Bone Marrow

Diseases), which were false negatives by the SSI model)

In the CID phase, we compared some true positive re

sults of three other systems which are listed in Table 5 The

comparing systems include: (i) SVM model that was only

rained on the CDR training and development data, (ii)

ipeline model of the SVM module and the multi-pass sieve

preference resolution module and (iii) the same model as

n (ii), but with the Silver-CID corpus used for training the

VM module The disagreements between these three sys-

ems, which are shown in Table 5, clarify contribution of

he method and data set used in our model (see

Supplementary 1 for the sample texts)

First, Table 5 shows that the SVM-based intra-sentence

relation extraction model played the central core of our

Isystem It worked well in the cases of intra-sentence CID

relations (e.g examples 1 and 2) As a result, if SVM failed

on an intra-sentence relation, adding the multi-pass sieve

coreference resolution module was not helpful (e.g ex

amples 3 and 4)

Since the SilverCID corpus enriched the training data

for SVM, using it might help to find more relations than

only the SVM model did (e.g example 3) It, however, also

might bring some noises leading to the small adverse effects

for the system, i.e adding the silverCID could lead to some

missing results (e.g example 2)

It is certain that SVM only based model, even trained

on the SilverCID corpus or not, could not catch the inter-

sentence relations (e.g examples 5-8) Therefore, the co

reference resolution was completely necessary for handling

the inter-sentence relations (e.g examples 5 and 7) Similar

to intra-sentence relation cases, adding the silverCID cor

pus might help (e.g exampleổ) or reject very small amount

true positives classified by the SVM + coreference mode! (e.g example 7)

Furthermore, there were still many cases on that the systems as a whole failed (e.g examples 4 and 8)

Table 6 shows examples of where our system disagreed with the annotation standard (see Supplementary 2 for example texts) There were two types of errors: wrong results (FP) and missing results (FN) Since entities which participated in the CID relations were expressed using their MeSH ID and the evaluation was made at the abstract- level, it was very hard to clarify the cause of errors within the whole system Our comments for these cases were empirical, based on heuristic surveying the system output

In Table 6, some errors were caused by the previous DNER phase: in example 4, NER module did not recognize ‘theophylline’ as chemical; in example 5, FP result for the ‘retention deficit’ of NER module leaded to the FP error of the whole system; in example 6, NER determined the wrong boundary of the entity ‘acute hepatitis’ and in example 7, NEN module matched ‘heart hypertrophy’ to the wrong MeSH ID (it should be ‘D006332’)

Inter-sentence relations often have very complex structures Two entities involving such relations might belong

to two sentences that were not adjacent In some worst cases, one entity was even hidden, which caused many FN errors (e.g examples 1 and 2)

The SVM module depended on the training data set, thus, it might lead to several limitations of finding new relations (which were not similar with those in the training set) Example 9 in Table 6 demonstrates this type of limitation.Coreference resolution was not a trivial problem, it had several types of errors by itself The FP error in example 10 seems to be caused by the coreference resolution module, i.e linking the term ‘dose’ to the wrong entity

Trang 29

ible 6 S o u r c e s o f e r r o r s by o u r s y s t e m o n th e CDR t e s t s e t

FP

of error FN Cause o f erro r

In cra -, I n tr a - s c n tc n c c C ỈD r e la tio n ; I n te r -, I n c c r-sc n tc n c e C ID r e la tio n ; FP, false p o s itiv e ; F N , false n e g a tiv e See s u p p le m e n ta r y 2 fo r th e s a m p le te x ts

Regarding errors caused by using the silverCID corpus,

/e noted that this corpus might bring much valuable infor-

lation, but it also might bring some noise, leaded to FN

rrors (e.g example 3) and FP errors (e.g example 8) Such

wo errors would disappear if we had removed the

ilverCID corpus from our system

C o n c l u s i o n s

n this research, we have presented a systematic study of

mr approach to the BioCreative V CDR task,

mprovements on the original system include: (i) a joint

lecoding approach for NER and NEN based on several

;tate-of-the-art machine learning methods for the DNER

;ub-task and (ii) improvements of a SVM-based model for

he CID sub-task by using a large-scale feature set,

iilverCID corpus and crucially, a multi-pass sieve corefer-

:nce resolution module Our best performance achieved an

FI of 81.93 for DNER while that of the DNorm, the state-

DÍ-the-art DNER system based on SSI, was 80.64% The

3est performance for CID of our improved system had FI

af 58.90, comparable to 57.03% of the highest ranked sys

tem in the CID task

Based on the CTD database, we built a silver standard

data set (called ‘silverCID’ corpus), including 51 719 sen

tences that contained CID relations with silver annotations

for NER, NEN and CID relations The use of the

SilverCID corpus would not have been allowed under the

original rules of the task because it was unknown which

subset of the database was used in the test evaluation In

our comparison, this SilverCID corpus proved its effective

ness when boosting our system performance by 7.3% in

term of the FI score (note that we checked to make sure

that there were no overlap between CTD-silver set and the

test set)

Several comparisons were made to compare our results with those of other systems and to analyze the system errors The evidences pointed towards complementarities between the NER-NEN joint decoding model, the SVM model, the SilverCID corpus and the coreference resolution module The empirical results also demonstrated the advantage of using multi-pass sieve coreference resolution to handle inter-sentence relations

One limitation of our system is that DNER was the initial step of CID, thus, DNER results greatly influenced the CID results Therefore, the comparison hereby required further validations because we used NER and NEN information provided by our DNER phase while other systems used theirs

Our proposed system is extensible in several ways Improving the coreference resolution module is obviously the first possible follow-up Although the coreference resolution module plays a central role in extracting interrelations, at this time, it only boosted the performance by 4.13% in terms of FI One potential suggestion is to use the SilverCID corpus for training a multi-pass sieve coreference module as the more results the coreference resolution module can find out, the more inter-relations can be found The second possible follow-up to improve our system may come from several useful biomedical resources that we did not utilize According to the report of the best team in the DNER sub-task of BioCreative V (32), we know that they exploited many databases such as CTD, MEDI (37), SIDER (38), etc to extract various useful knowledgebase features for their machine learning based participating system or to

be as a dictionary for matching The third can be application of several post-processing steps, such as abbreviation resolution and consistency improvement, which was applied

by the best team in the DNER sub-task of the BioCreative V and demonstrated its effectiveness (31)

Trang 30

c k n o w l e d g e m e n t s

is w ork has been su p p o rted by V ietnam N atio n al University,

moi (VNU), u n d er Project N o Q G 1 5 2 1

f u n d i n g

Ị-Q.L and T H D gratefully acknow ledge funding su p p o rt from

Betnam N atio n al U niversity, H anoi (VNƯ), under Project N o

JG 15.21 N c gratefully acknow ledges funding su p p o rt from the

ỊK EPSRC (grant n u m b er E P/M 005089/1) Funding for open access

harge : V N U H P ro ject N o Q G 1 5 2 1

s u p p l e m e n t a r y d a t a

iupplem entary d a ta are available at Database O nline.

R e f e r e n c e s

D ogan,R I., M u rray ,G C , Névéo!,A et al. U nderstanding

PubM ed user search behavior through log analysis Database,

2009, bap018.

Davis,A.P., M u rp h y C G , Saraceni-R ichards.C A et al. (2009)

C om parative toxicogenom ics d atabase: a know ledgebase and

discovery tool for chem ical-gene-disease netw orks Nucleic

Acids Res., 37(suppl 1), D 7 8 6 -D 7 9 2

i Chen,F S., H ripcsak.G , X u,H et al. (2008) A utom ated acquisition

of disease-drug knowledge from biomedical and clinical docu

ments: an initial study J Am Med Inform Assoc., 15, 87-98

ị L iu J , Li,A-, an d SeneffjS (2011) A utom atic drug side effect dis

covery from online patien t-su b m itted reviews: focus on statin

in fo rm a tio n M ining a n d M anagem ent ( I MMMJ In: Norbisrath,

u International Academy, Research, and Industry Association

(IARIA), Barcelona, Spain, pp 23-9.

5 K ang.N , Singh,B., B u i,c et at. (2014) K now ledge-based ex trac

tion o f adverse dru g events from biom edical text BMC

Bioinformatics, 15, 64.

6. M iura,Y , A ram aki,E , O h k u m a,T et al. (2010) A dverse-effect

relations ex tra c tio n from massive clinical records 23rd

International Conference on Computational Linguistics, p 75.

7 H am m an n ,F , G u tm an n H , V ogt,N et al. (2010) Prediction of

adverse dru g reactions using decision tree m odeling Clin

Pharmacol Ther.y 88, 5 2 -5 9

8 W ei,C H , Peng,Y , L eam an.R et al. (2015) Overview of the

BioCreative V Chem ical Disease R elation (CDR) task The Fifth

BioCreative Challenge Evaluation Workshop Seville, Spain, pp

1S4 66.

9 L iJ , Sun,Y., Joh n so n ,R et at. (2015) A nnotating chem icals, dis

eases, and th eir interactions in biom edical literature The Fifth

BioCreative Challenge Evaluation Workshop, pp 154-66.

10 L ipscom b,C.E (2000) M edical subject headings (MeSH) Bull

Med Libr Assoc., 88, 265.

11 D avis,A.P., W iegers,T C , R oberts,P.M et al. (2013) A C T D -

pfizer collaboration: m anual cu ratio n o f 88 000 scientific articles

te x t mined for drug-disease and d ru g -p h en o ty p e interactions

Database, 2013, bat080.

12 H am osh.A , Scott,A F., A m bergerJ.S et al. (2005) O nline mendelian inheritance of m an (O M IM ), a know ledgebase of hum an genes and genetic disorders Nucleic Acids Res., 33, (Suppl 1),

D514-D517.

13 Li,Q and Ji,H (2014) Increm ental joint extraction of entity

Linguistics. Baltim ore, USA pp 4 0 2 ^ 1 2

14 Zhang,Y and C lark ,s (2008) Jo in t w ord segm entation and POS

Computational Linguistics, Columbus, Ohio, USA pp 888-96.

15 H uang,L , F ayong.s., and G uo,Y (2012) Structured perceptron

w ith inexact search 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for C om putational Linguistics, PA, USA .pp 142-151.

16 L eam an,R , D ogan,R I., and L u ,z (2013) DNorm: disease name normalization with pairwise learning to rank Bioinformatics, btt474.

17 M ikolov.T , Sutskever,!., Chen.K e! al. (2013) D istributed rep resentations of w ords and phrases and their com positionality

Conference on Advances in Neural Information Processing Systems USA, December 5-10, 2013., 3 1 1 1 -3 1 1 9

18 M iw a,M and Sasaki,Y (2014) M odeling joint entity and rela tion extraction w ith table representation The 2014 Conference

on Empirical Methods in Natural Language Processing EMNLP. S troudsburg, PA, USA: A ssociation for C om putational Linguistics, pp 1858-1869.

19 Z heng,J., C h a p m a n w w , Crow ley.R.S et al. (2011) Coreference resolution: a review of general m ethodologies arid applications in the clinical dom ain / Biomed Inform44,

1113-1122.

20 Lee,H , Peirsm an,Y , Chang,A et al. (2011) S tanford’s m ulti pass sieve coreference resolution system at the C oN LL-2011

Computational Natural Language Learning: shared Task Association for Computational Linguistics, pp 28-34.

21 D ’S o u z a J and N g ,v (2015) Sieve-based entity linking for the

Short Papers, 297.

22 N g ,v (2008) Unsupervised m odels for coreference resolution

The Conference on Empirical Methods in Natural Language Processing. Association for C o m putational Linguistics, Stroudsburg, PA, USA pp 6 4 0 -6 4 9

23 Bejan,C.A and H arabagiu,s (2010) Unsupervised event corefer ence resolution with rich linguistic features The 48th Annual Meeting of the Association for Computational Linguistics

Association for Com putational Linguistics, Stroudsburg, PA, USA

pp 1412-1422.

24 O liver,D E., B halotia.G , S chw artz,A s., et al. (2004) For load ing MEDLINE into a local relational database BMC Bioinformatics, 5 ,1

25 Cortes,C and V apnik,V (1995) Support-vector netw orks

Machine Learning, 2 0 ,2 7 3 -2 9 7

26 Song,S.J., H eo,G E , K im ,H J et al. (2014) G rounded feature se lection for biom edical relation ex tractio n by the com binative ap

Text Mining in Bioinformatics, ACM New York, NY, USA. pp

29-32.

Trang 31

Kim,S., Liu,H , Y eganova,L et al. (2015) E xtracting d ru g -d ru g

interactions from literatu re using a rich feature-based linear k er

nel approach / Eiorned Inf., 55, 2 3 -3 0

ị K anibhatla,N (2004) C om bining lexical, syntactic, and seman-

I tic features with m axim um en tro p y m odels for extracting rela

sessions Association for Computational Linguistics

I Stroudsburg, PA, USA, Barcelona, p 22.

G \ioD ong,Z , Jia n ,s., Jie ,z et al. (2005) E xploring various

know ledge in relation ex tractio n The 43rd annual jneeting on

association for computational linguistics Association for

Ỉ Computational Linguistics Stroudsburg, PA, Michigan, USA pp

427-434.

10 J ia n g J and Z h a i,c (2007) A system atic e x p lo ratio n of the fea

Technologies: T he A nnual C onference of the N o rth Am erican

C hapter of the A ssociation for C om p u tatio n al Linguistics

Rochester, N Y , USA, April 2 2 -2 7 , 2 007 P p l 13-120.

»1 M iw a,M , S xtre,R , K im J.D et al. (2010) Event e x t r a c t i o n w ith

com plex event classification using rich features J Bioinf

Comput Biol., 8, 131 -1 4 6

Ặ1. Lee.H C., H su,Y Y , and K ao,H Y (2015) An enhanced CRF-

based system for disease nam e entity recognition and n orm aliza

tion on BioCreative V D N E R T ask The Fifth BioCreatwe

34 Xu,J., W u,Y , ZhangjY et al. (2015) ƯTH-CCB@ BioCreacive V CDR Task: identifying chem ical-induced disease relations in bio medical text Proceedings of the Fifth BioCreative Challenge Evaluation Workshop, Seville, Spain; 9-11 September 20ì5m pp 254-259.

35 L iu,x., Z hou,M , Wei,F et al. (2012) Joint inference of named en tity recognition and normalization for tweets Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1) Association for Computational Linguistics, pp 526-535.

36 K halid,M A , Jijkoun,V , and De R ijke,M (2008) T he im pact of nam ed entity norm alization on inform ation retrieval for ques tion answ ering Adư Inform Retrieval 705-710 Springer Berlin

H eidelberg.

37 W ei,W Q , C ronin,R M , X u ,H et al. (2013) D evelopm ent and evaluation o f an ensem ble resource linking m edications to their

in d ic a tio n s./ Am Med Inform Assoc., 20, 9 5 4 -9 6 1

38 Kuhn,M., Campillos,M., Letunic,I etal. (2010) A side effect resource

to capture phenotypic effects of drugs Mo/ Syst. Bio/., 6, 343.

Trang 32

doi 10.1093/bioinformatics/xxxxxx Advance Access Publication Date: Day Month Year

Original paper

Sequence analysis

SKIPHOS: non-kinase specific phosphorylation

site prediction with random forests and amino acid

skip-gram embeddings

T han h Hai D a n g t’*, Quang Thinh T ra c t, Huy Kinh P han , Manh Cuong

N guyen and Quynh Trang Pham Thi

Faculty of Information Technology, VNU University ol Engineering and Technology, Hanoi, Vietnam.

‘ To whom correspondence should be addressed,

t Authors contributed equally to this work.

Associate Editor: x x x x x x x

Received on XXXXX; revised on XXXXX; accepted on x x x x x

A b s tra c t

M otivation: Phosphorylation, which is catalyzed by kinase proteins, is in the top two most common and

w id e ly studied types of know n essential post-translation protein modification (PTM) Phosphorylation is

known to regulate most cellular processes such as protein synthesis, cell division, signal transduction, cell

grow th, developm ent and aging Various phosphorylation site prediction models have been developed,

w hich can be broadly categorized as being kinase-specific or non-kinase specific (general) Unlike the la t

ter, th e form er requires a large enough num ber of experim entally known phosphorylation sites annotated

w ith a given kinase for training the model, w hich is not the case in reality: less than 3% of the phosph

o ryla tio n sites known to date have been annotated w ith a responsible kinase To date, there are a (ew

non Kinase specific phosphorylation site prediction models proposed.

R e s u lts : This paper proposes SKIPHOS, a non-kinase specific phosphorylation site prediction model

based on random forests on top of a continuous distributed representation of amino acids Experim ental

results on the benchm ark dataset and the independent test set demonstrate that SKIPHOS com pares

favorably to recent state-of-the-art related methods lor three phosphorylation residues Although being

trained on phosphorylation sites in mamals, SKIPHOS can yield predictions for Y residues better than

PH O SFER , a recently proposed plants-specific phosphorylation prediction model.

A vailability and Im plem entation: SKIPHOS Web Server is freely available for non-com m ercial use at

http://lit.uet.vnu.edu.vn/S K IP H O S or h ttp 7 /1 12.137.130.46:5000.

C o n ta c t: h ai.dang@ vnu.edu.vn

Supplem entary information: Supplementary data are available at Bioinlormatics online.

1 In tro du ctio n

Am ong known essential posi-translation protein modification (PTM)

types, phosphorylation is o f the top two most common and widely studied

one (Khoury et al., 2011) A protein kinase catalyzes phosphorylation by

adding a phosphate group to certain protein subsưates on specific residues,

including serine (S), threonine (T), and tyrosine (Y) Phosphorylation is

known to regulate most cellular processes such as protein synthesis, cell

division, signal transduction, cell growth, development and aging (Hun ter, 2000) There arc approximately a! least 30% o f all human proieins arc likely to be phosphorylated and about 518 protein kinases encoded in the

human genome (Manning et al., 2002; Ardito et al., 2 0 17) The mouse pro- tcome has more than 540 putative protein kinases (Caenepeel et al., 2004) while plant genomes encodes more than 1,000 protein kinases (Vlad el at.,

2008).

An increasing number of phosphorylation sites in various species have been being experimentally validated, collected and compiled into speciali zed databases, motivating bioinformalics community to develop advanced

in silico prediction models as fast, lower-cost and efficient complements For permissions, please e-mail: journals.permisslons@oup.com 1

Trang 33

As a result, various phosphorylation site prediction models have been

developed over ihc past years.

Those m odels can be broadly categorized as being kinase-specific or

non-kinase specific (general) The former aims at building computational

models thai predicts whether a residue is phosphorylated by a given kinase

white the lauer 10 predict ừrespective to the kinases The former thus

requires that there are a large enough num ber o f experimentally known

phosphorylation sites with a given known catalyzing kinase for the model

training This guarantees the resulting trained models to have satisfactory

and significantly persistent kinase-specific phosphorylation predictions

However, note the fact that less than 3% of the phosphorylation sites known

lo date have been annotated wilh information about responsible kinases

(Newman e la l., 2013) As a consequence, the num ber of kinases known to

phosphorylale a large amount residues is still limited To give an example,

Phospho.ELM version 9.0 (Dinkel et al„ 2010), the benchmark dataset for

most phosphorylation prediction studies to date (Trosi and Kusalik, 2011;

Dou et at., 2014; Ism ail e ta l., 2016; Song et al., 2017), has only 9 kinases

each catalyzes m ore than 100 phosphorylation sites The total residues

phosphorylated by such 9 are 1,616, out o f 42,500 Over the last decade,

an increasing num ber of non-model organisms’ genomes were scquenced

thank to the em erging development o f the next generation sequencing

technologies, leading to more protein kinases and putative phosphorylaton

sites being identified Therefore, the development o f novel non-kinase

specific phosphorylation site prediction models is o f high demand as an

essentially initial phase in phosphorylation studies for a widespread of

species (Trost and Kusalik, 2011).

To dale, there arc a few non-kinase specific phosphorylation site predi

ction models proposed M ost o f ihem employs advanced machine learning

algorithms, such as neural networks in NelPhos (Blom el at., 1999), Sup

port Vector M achines in Musite (Gao et at., 2010), PPRED (Biswas eta l.,

2010) PhosphoSV M (Dou d a l , 2014), and random forests in PHOSFER

(Trostand Kusaiik, 2013) and RFPhos (Ismail et al., 2016) Note lhai most

efforts in the developm ent o f phosphorylation prediction models are focu

sed on the kinase-specific (Trost and Kusalik, 2 0 1 1; Song el al., 2017)

However, kinase-specific models when being rc-adapted for non-kinase

specific predictions often generates more false-positives (Dou etal., 2014)

To this end, this paper introduces a non-kinase specific phosphorylation

site prediction model based on random forests on top o f a continuous distri

buted representaion o f amino acids Experimental results demonsưate thal

our model com pares favorably to three reccnt state-of-the-art methods,

namely PhosphoSV M (Dou el al., 2014), iPhos-PseEn (Qiu et al., 2016)

and RFPhos (Ism ail ela l., 2016) Our method out-pcrforms PhosphoSVM,

RFPhos and iPhos-PscEn in predictions for s , Y and T residues in terms

o f overall scoring metrics.

Tabic 1 The number of potential phosphorylation silcs in non-redundant protein sequences from ihe benchmark dataset P.ELM

Residue Number o f sequences Number o f sites

a positive, otherwise a negative In order to avoid the bias, CD-HIT with

a 30% identity cutoff were applied to each o f both the positive and nega tive sets to remove redundant corresponding subsequences Bccause the numbers o f negative subsequences are much larger lhan that o f positive sub

sequences for S/Y/T (Dou el al., 2014), a subset o f negative subsequences

was randomly selected such that ihe ratio o f negatives to corresponding positives is 1:1 for each S/Y/T This ratio has been demonstrated to be

optimal for phosphorylation site prediction mode! (Biswas el a i, 2010)

Our dataset (called P.ELM) for cross-validating SKIPHOS is made o f ihcsc remained non-redundant positive and negative subsequences.

For the independent test set (called PPA), non-redundant Arabidopsis

thaliana proiein sequences from PhosphAt version 3.0 (Zulawski et at.,

2012) are used lo cxtract experimentally verified positive and negative subscqucnces Note that P.ELM contains phosphorylation sites from mam

mals whereas PPA contains those from Arabidopsis thaliana Such two arc shown to be independent of each other (Dou et at., 2014) The number of

Scr, Thr and Tỷr subsequences for specific window sizes in P.ELM and PPA are provided in Table 2 These chosen sizes are the same as in Pho- sphoS VM and RFPhos, two recent state-of-lhe-art corresponding methods

to which we compared our model.

Table 2 The number of non-redundant known phosphorylation sites for diffe rent context window sizes in the benchmark dataset P.ELM and the independent test set PPA

Dataset Residue Window Positive number Negative number size _

~ 21 12657 12657

15 1191 1191 _ 9 _ 321 321

Experim entally validated phosphorylation sites were extracted from Pho-

spho.ELM version 9.0 (Dinkel el al., 2010), the benchmark dataset for

mosi phosphorylation prediction studies to date (Trost and Kusalik, 2011;

Dou e ta l., 2014; Ismail el al., 2016; Song el al., 2017) All redundant pro

tein sequences were eliminated by CD-HIT (Fu et al.% 2012) with a cutoff

of 70% sequence identity The total num ber o f protein scqucnccs and pho

sphorylation sites remained for downstream analyses after the redundancy

removal are listed in Table 1 For each potential phosphorylation residues

(S, Y and T ), suưounding windows o f certain sizes centering at such are

exưacted R esulting subsequences are taken as input to CD-HIT witha70%

euloff o f identity to keep only non-rcdundant subsequences A subseque

nce that has the verified phosphorylation site in the middle is considered

2.2 Random forests based prediction

Random Forest is a popular ensemble algorithm for classification and regression on high dimensional data (Breiman 2001) This algorithm constructs a number o f decision frees during the ưaining phase and uses the majority vote for prediction Trees are constructed using booisưap samples with randomly selected features from the training datascl The tree construction is guided with the Gini impurity index calculated for each o f such selected features Various recent bioinformatics studies have employed random forests, demonstrating its benefit and robustness for

high dimensional datasets (Ismail et al., 2016; Song et al., 2017).

In this study, random foresu with 500 decision trees is used for prediction o f phosphorylation siles from rich features derived from subse quences, including: amino acid embeddings, Composition, Transition and Distribution features Sequence Order Coupling Num ber features Quasi

Trang 34

Sequence Order features and protein disorder features The model is imple

mented using a popular machine learning tool called skJeam (version )

(Buitinck el al„ 2013).

2.2.1 F e a tu re ex tra c tio n

C o m position, T ra n sitio n and D istrib u tio n (CTD)

In 1995, D ubchakct al introduced ihc Composition Transition and Disưi-

bution (CTD) features for predicting protein folding, which is based on 7

physicochemical properties o f amino acids, namely charge, hydrophobi-

city, normalized van der Waals volume, polarity, polarizability, secondary

structure and solvent accessibility (Dubchak el al., 1995) Based on a given

physicochemical properties, twenly amino acids can be categorized into

three groups (i.e I, 2 and 3) Each amino acid was then encoded as I

2, or 3 according to the group it belongs to For example, based on the

charge property, ihe subsequence "LLAKK.GYQERDLE" is encoded as

"1113311 123212" For each such 7 physicochemical properties, there are

three lypes o f features could be derived for a subsequence of the length L

(Li el al., 2006; Chou, 2011; Cao et al„ 2013), including:

• C o m position of a given group (namely 1, 2 or 3) is the global

percentage o f such group in the subsequence and is calculated as

follows:

Cl = —, t = 1,2,3

w here N t is the num ber o f times group t appears in ihe subsequ

ence.

• T ra n sitio n for a given pair o f groups ( t p v ) characterizes the percent

frequency with w hich group t is followed by group V or vice versa It

is calculated as follows:

Tt.v =Nt.v + Nv.i L - 1 , t , v = 1 ,2 ,3

where N t V is the num ber o f times group t is followed by group V.

• D istrib u tio n descriptor o f each group comprises five values, i.e ihc

fractions o f the subsequence where the group is loeated for the first

lime, and where 25% , 50%, 75% and 100% of the group arc included.

Sequence O r d e r C o upling N u m b e r (SOCN)

Using the Schncidcr-W rede physicochemical distance matrix (Schneider

and W rcdc, 1994) and chemical distances Grantham inaưix (Grantham

1974), the k th rank Scquencc Order Coupling Number o f a L amino acid

subsequence was calculated as follows:

(lakoucheva et a l 2004) Many phosphorylation prediction studies have

used proiein disorder as an enriched feature to increase the model accuracy

(Gao el al„ 2010; Ismail el al., 2016) In this sludy, wc use DISOPRED (Ward et at., 2004) to predict the disorder feature o f protein sequences and

then the disorder scores prcdictcd for amino acids within a subsequence were extracted.

A m ino acid em beddings (AAE)

In natural language processing, a word embedding is an algorithm 10 Icam a high-dimensional dense vector representation for words from a very large textual corpus (i.e ưaining corpus) with billions of words Words with sim ilar syntax and semantic arc embedded to elose vectors in the space

It works based on the basic idea that the meaning o f a word is affected by surrounding words within its context.

Recently, Mikolov et al have inưoduced the Skip-gram model, a novel word embedding architecture based on the neural network language model

(Mikolov el al 2013a) Sincc then, Skip-gram has been employed for

numerous natural language processing studies, demonstrating its power and effectiveness in providing good vector representations o f words in terms of syntax and semantic relationships.

Given a sentence o f N words w l ,w 2 ., t u N in the ưaining corpus, the

word embedding aims to maximize the probability o f observer contexts conditioned on cach o f such N words at the ccnlcr:

where d i is the distance between amino acids at position t and

position t + k , m = 30 is the maximum lag.

Qua*! Sequ en ce O rd e r (QSO)

The quasi sequence order comprises two types o f features: the first 20

features reflect the frequency ratios o f amino acids in a subsequence and

the remain reflects the sequence order calculated on the Schneider-Wrede

physicochemical distance matrix (Schneider and Wredc, 1994) and the

Grantham chemical distance matrix (Grantham, 1974).

The first twenty QSO features arc calculated as:

where w is the total number o f words in the training corpus For the sake o f computational efficiency, this full softmax function is approximated with the hierarchical softmax (Morin and Bengio, 2005), in

w hich all w words are represented as leaves o f a binary Huffman tree The Skip-gram model was ihen further improved with Negative sam pling in which the log probability by the softmax is replaced with the new

one as follows (Mikolov et at., 2013b):

Where die noise distribution P n (tu ) was empirically chosen to be the

unigram distribution i/(itf) raised to the 3 / 4 rd power ( i.e , U ( w ) * t 4 / Z ) and k is a predefined number o f negative samples for each data sample The authors have experimentally shown that k should be in the range 5-

20 for small training datasets and 2-5 for large datasets (Mikolov el al.,

2013b).

In the context o f protein bioinformatics, we note that protein sequ ences or peptides can be considered as "biological" sentences in which cach amino acid acts as a distinct "biological" word Functions o f each amino acid on a proicin sequence/pcptide are affected by neighboring ones surrounding such In this regard, protein sequences remained after ihe

Trang 35

r e d u n d a n c y r e m o v a l w e r e u s e d a s th e t r a in in g c o r p u s fo r th e S k ip - g r a m

W c e m p l o y w o r d 2 v e c ( M ik o lo v e l a i , 2 0 1 3 a ,b ) , w h ic h i m p le m e n ts th e

state-of-the-art Skip-gram model, to learn continuous vector representati

ons of 300 dimensions for 20 amino acids This number was experimentally

where TP, TN, FP, and FN respectively represent the number o f true

positives, true negatives, false positives, and false negatives in the fusion

matrix We run a 10-fold cross validation procedure 30 times and the

average of resulting aforementioned performance scores are reported for

evaluation.

3 R esu lts and D iscussion

O ur proposed general phosphorylation site prediction model SKIPHOS

uses subsequences o f 9, 15 and 19 amino acids centering al s Y and T

respectively Such three lengths have been experimentally demonstrated to

produce the best performance correspondingly for s , Y and T when com

pared with other lengths Wc here only present our model’s performance

using Che length 2 1 ,9 and 9 for s , Y and T respectively as an extra reference

for the comparison with two recefll slale of-lhc iưt correiponding models,

namely RFPhos (Ismail el al., 2016) and PhosphoSVM (Dou e t a i , 2014)

These extra lengths allow our models to work on subsequences of ihe same

lengths as such two models: RFPhos uses subsequences of 9 amino acids

for s , Y and T while PhosphoSVM uses subscqucnccs o f 21, 15 and 19

for S Y and T, respectively We re-implemenlcd these two models for the

(cross-validated) comparison with SKIPHOS on our subsequence datasets

The reasons include: (i) the authors o f RFPhos only provide three trained

models (for s , Y, and T), coupled with the subsequence dataset on which

their models were trained, and (ii) (he authors o f PhosphoS VM do provide

neither and do not release its source codc as well.

Experimental results show that our proposed model yields favorable

performance on non-kinase specific prediction o f s Y and T phosphory

lation sites, when compared to iPhos-PscEn (Qiu cl al., 2016), RFPhos

(Ismail et at., 2016) and PhosphoSVM (Dou et at., 2014), three recent

state-of-the-art corresponding models.

For 10-fold cross validation on the subsequence dataset o f RFPhos,

SKIPHOS yields excellent prediction performance It archives the AUC

values o f 90% 91.7% and 91.3% for s , Y and T residues, rcspcctivcly

which arc better than those from both RFPhos (i.e 88%, 91% and 90%)

and phosphoSVM (i.e 84%, 74% and 82%) Futher, when using random

forests o f 100 decision trees, which is the same number as in RFPhos,

SKIPHOS can yield performance with the AUC values o f 89.5%, 91.3%

and 90.8%, respectively, out-performing RFPhos This demonsừatcs the

predictive power o f SK IPHOS’s features Nole lhal ihc RFPhos model

implemented by us performs cxactly on-par with the trained model given

by the RFPhos authors (data nol shown), guaranteeing thai RFPhos was correctly re-implemented by us.

Further, SKIPHOS is also compared with iPhos-PscEn (Qiu el at.,

2016), a human-specific non-kinase phosphorylation site predictor based

on ensemble random forests The same 5-fold cross validation scheme as used in iPhos-PseEn is employed for SKIPHOS on the dataset provided by iPhos-PseEn To this end, SKIPHOS yields excellent performance for s,

Y and T in terms o f AUCs, i.e 9 1.96%, 88.23% and 84.43%, respectively Prediction results show that, for s and Y, SKIPHOS out-performs iPhos- PseEn and vice versa for T The prediction accuracy values o f iPhos-PseEn are all less than 80% (79.76% for s 76.28% for Y and 79.88% for T) while those o f SKIPHOS arc 86.66% for s 80.52% for Y and 76.28% for T We note that the MCC values o f SKIPHOS arc much belter than those of iPhos-PseEn (see Tabic 3 for more details).

Tabic 3 Performance of SKỈPHOS and iPhos-PseEn (Qiu el al„ 2016) on ihe

dataset provided by the iPhos-PscEn's authors The best value of each scoring metric is in bold.

For cross-validation on our P.ELM subsequence dataset, SKIPHOS

yields good prediction performance for Y ( A Ư C = 75.5% ) and very

good for s ( A U C — 84.5% ) and T ( A U C = 84.4% ) It, however, Slid

out-perfonm PhosphoSVM (for s, Y and T sites) and RFPhos (for both

Y and T) in terms o f all aforementioned scoring metrics (see Tabic 4 for mote details).

Tabic 4 Performance of SKIPHOS in comparison with two receni state-of- the-art related models using 10-fold cross-validation on the benchmark dataset P.ELM (•) indicates ihe use of the same context window Jiztfs aa in SKIPHOS, i.e 15 for Y and 19 for T The best value of each scoring metric is in bold Methods Residue = Y

FI AƯC MCC Rccall Precision SKIPHOS 0.700 0.755 0.396 0.711 0.691 RFPhos* 0.660 0.713 0.318 0.668 0.654 PhosphoSVM 0.627 0.677 0.253 0.628 0.627 RFPhos 0.607 0.656 0.226 0.603 0.616

M ethods Residue = T

FI AUC MCC Recall Precision SKIPHOS 0.765 0.844 0.547 0.744 0.788 RFPhos* 0.747 0.824 0.502 0.741 0.753 PhosphoSVM 0.729 0.804 0.464 0.720 0.738 RFPhos 0.747 0.815 0.475 0.784 0.716 Methods Residue = s

FI AƯC MCC Recall Precision SKIPHOS 0.765 0.845 0.521 0.785 0.749 PhosphoSVM 0.743 0.819 0.499 0.724 0.762 RFPhos 0.781 0.842 0.547 0.816 0.751

Trang 36

F ig 1 The ROC curves of SKIPHOS and two recent siale-of-the-art related models using I (Mold cross-validation on the benchmark dataset P.ELM (•) indicates ihe use of Ihe same content window sites a* in SKIPHOS i.e 15 for Y and 19 forT.

Table 5 Performance of SKIPHOS wilh the use of different feature types All models are tested with 10-fold cross-validation on (he benchmark (latasel P.ELM • indicates that DIS is rcplaccd with QSO ill the case of T.

Feulure Residue:=Y Residue as T Residue = s

AƯC FI Recall MCC Accuracy AƯC FI Recall MCC Accuracy AUC FI Recall MCC Accuracy AAE 0.653 0.607 0.603 0.225 0.612 0.804 0.706 0.644 0.476 0.734 0.801 0.714 0.666 0.476 0.736 AAE+CTD 0.720 0.666 0.677 0.326 0.663 0.835 0.752 0.713 0 5 3 6 0.767 0,815 0,725 0,682 0,490 0,744 AAE+DIS 0.677 0.632 0.620 0.281 0.640 0.803 0.718 0.677 0.475 0.736 0.809 0.727 0.690 0.489 0.744 AAE+QSO 0.667 0.617 0.613 0.244 0.622 0.808 0.709 0.647 0.479 0.736 0.805 0.716 0.668 0.478 0.737 A.AE+SOCN 0.661 0.613 0.608 0.239 0.619 0.807 0.707 0.643 0.478 0.735 0.804 0.716 0.669 0.478 0.737 AAE+CTD+DIS* 0.748 0.694 0.707 0.382 0.690 0.838 0.754 0.714 0.541 0.769 0.824 0.741 0.711 0.509 0.754 CTD 0.565 0.580 0.587 0.157 0.578 0.581 0.582 0.581 0.171 0.585 0.659 0.637 0.651 0.264 0.632 DIS 0.616 0.588 0.587 0.182 0.591 0.649 0.631 0.666 0.228 0.613 0.645 0.651 0.729 0.233 0.612 QSO 0.676 0.630 0.639 0.256 0.627 0.769 0.715 0.723 0.430 0.715 0.766 0.712 0.730 0.417 0.708 SOCN 0.545 0.529 0.532 0.059 0.529 0.604 0.576 0.584 0.148 0.574 0.614 0.584 0.587 0.170 0.585 All-AAE 0.739 0.683 0.685 0.368 0.684 0.80) 0.741 0.758 0.476 0.737 0.810 0.751 0.787 0.484 0.741

For s residues, ic performs on-par with RFPhos However SKIPHOS

can predict better than RFPhos at low false positive rate (i.e <20%) (see

Figure I) It can be argued that the out-performance o f SKIPHOS over

RFPhos for Y and T may come from the larger context windows used by

SKIPHOS We thus evaluate RFPhos wilh the larger context windows as

used in SKIPHOS, i.e 15 amino acids for Y and 19 forT To this end, this

variant o f RFPhos still performs worse than ours, demonstrating the great

utility of features used in SKIPHOS (Tabic 4).

Interestingly, experimentally results showed that extending the contcxt

windows surrounding Y and T brings significant performance improve

ments to SKiPHOS and RFPhos, as well It is, however, vice versa for s

This phenomenon suggests that factors determining the phosphorylation

status of s residues arc likely to be located in the windows of only 9 amino

acids centering at them However, for Y and T residues, these windows

are much larger, i.e 15 and 19, respectively.

We do evaluate the impact o f every feature type for SKIPHOS in

prediction o f non-kinase specific phosphorylation sites by 10-fold cross

validating SKIPHOS with such each Table 5 shows the greatest impact

o f the amino acid embeddings when Ihey contribute up to 96.6%, 86.5%

and 95.3% o f the SKIPHOS predictive capacity for s , Y and T, respecti

vely Among all feature types, the amino acid embeddings contribute most

to the predictive strength o f SKIPHOS for s and T For Y it takes the

second place, a little bit after the QUASI Surprisingly, the contributions

of the amino acid embeddings for SKIPHOS in prediction of s and T are

on-par with those o f all remaining others together Note lhat the amino

acid embeddings are calculated offline just only for one lime while such

all remaining features arc calculated upon protein sequences It is use ful when using SK1PHOS to make prediction for a newly given protein sequence.

3.1 Performance on the independent test set

SKIPHOS, PhosphoSVM re-implemented by us and ihc trained RFPhos

is also the ease for PhosphoSVM when its AUC values for s , Y and T arc 66%, 57.8% and 63.1%, respectively.

Looking deeper into the ROC curves o f 3 models in Figure 2 it can be observed that the ROCs of SK1PHOS for s Y and T arc respectively above those o f boih RFPhos and PhosphoSVM in the upper left regions, in which recall values are high (let says >= 50%) and false positive rates (FPR) arc low (let says <=40%), implying the better performance Wilhin the lower left regions (recall <=32.5% and FPR <=18%) SKIPHOS performs better than RFPhos and PhosphoSVM, except only for the case o f T predicted

by PhosphoSVM.

Compared with the plant-specific model PHOSFER, the SKIPHOS performance is better only at the recall values greater than 35% (associated with FPR > 2 1 %) for Y and only up to 59% (37.5%) for T The ROC curve

Trang 37

F ig 2 The ROC curves o f SKIPHOS and recent slalc-of-lhe-art related models on ihc independent dataset PPA.

of SKIPHOS for s is totally under lhat o f PHOSFER implying the worse

performance, which is nol a surprise since PHOSFER arc trained from

a much larger training dataset from 9 organisms including plants while

SKJPHOS is not trained on plant phosphorylation sites However, this fact,

in turn, demonstrates the predictive strength and stability of SKIPHOS

when it can yield predictions in the upper left regions (high recall and

possibly allowed low FPR) o f ROCs lhat are belter than PHOSFER for Y

We developed a web server with user-friendly graphic interface for SKl-

PHOS and deployed it online accessible free for non-commercial use

at http://fit.uet.vnu.edu.vn/SKIPHOS Users ju st only need 10 provide a

protein sequence, choose phosphorylation types for which they want SKI-

PHOS to make prediction When completed, Ihe web server rciums a list

of all phosphorylation residues predicted for each o f choscn types.

5 C onclusion

In this paper we present SKIPHOS, a novel computational model for non kinase specific prediction of phosphorylation sites using random foresis and amino acid skip-gram embeddings Experimental results from rigorous validation schemes dcmonsưatc the favorable strength and stability of SKI- PHOS when compared to recent staie-of-thc-art related models, namely

PhosphoSVM (Dou el fl/., 2014), iPhos-PseEn (Qiu el al., 2016) and RFPhos (Ismail et a i , 2016) T he SKIPHOS performance cross-validated

on the benchmark dataset is belter than that of iPhos-PseEn, RFPhos and PhosphoSVM for all cases, except for s residue when being compared with RFPhos, with which on-par performance is observed However, SKI- PHOS outperforms both RFPhos and PhosphoSVM on the independent data set of phosphorylation sites in plants Surprisingly, SK1PHOS can yield high-recalled predictions for Y and T that arc better than those of PHOSHER Note that PHOSFER is trained on a large dataset containing phosphorylation sites in plants whereas SKIPHOS is only trained on a smaller dataset o f those in mammals.

We anticipate that SKIPHOS with a freely available web server will facilitate other basic and/or translational researches related to identification

o f phosphorylation sites, accelerating discoveries o f new important bio chemical insights al low costs.

A ckn ow ledg em ents

This work has been supported by Vietnam National University, Hanoi (VNU), under Project No QG 15.21 We would like to thank the author of RFPhos, Dr Hamid D Ismail, Research Associate at Department of Ani mal Sciences, North Carolina Agricultural and Technical State University, for sending us their trained RFPhos model and dataset.

References

Ardito, F G iu lian i, M , Pcrrone, D , Troiano, G M uzio L L (2 0 1 7 ) T he crucial role o f protein phosphorylation in cell signaling and i u use as targeted th e rapy (R eview ) International Journal o f M olecular M edicine, 4 0 (2 ) 2 7 1 iL “280

Buitinck, L , L ouppe, G , B londel, M , Pedregosa F., M ueller, A , B uitinck, L., Louppe, G , B londel, M , Pedregosa, F M ueller A , G risel, o , L ayion, R- (2013) A PI d esign for m achine learning so il ware: experiences from the scikil-leam

p ro je c t arX iv p reprint arX iv: 1309.0238.

B reim an, L (2001) Random forests M achine learning, 45(1), 5-32.

Định dạng
Số trang	75
Dung lượng	4,47 MB