Dự đoán liên kết trong mạng hỗn tạp và ứng dụng dự đoán mối quan hệ giữa RNA không mã hóa và bệnh.Dự đoán liên kết trong mạng hỗn tạp và ứng dụng dự đoán mối quan hệ giữa RNA không mã hóa và bệnh.Dự đoán liên kết trong mạng hỗn tạp và ứng dụng dự đoán mối quan hệ giữa RNA không mã hóa và bệnh.Dự đoán liên kết trong mạng hỗn tạp và ứng dụng dự đoán mối quan hệ giữa RNA không mã hóa và bệnh.Dự đoán liên kết trong mạng hỗn tạp và ứng dụng dự đoán mối quan hệ giữa RNA không mã hóa và bệnh.Dự đoán liên kết trong mạng hỗn tạp và ứng dụng dự đoán mối quan hệ giữa RNA không mã hóa và bệnh.Dự đoán liên kết trong mạng hỗn tạp và ứng dụng dự đoán mối quan hệ giữa RNA không mã hóa và bệnh.Dự đoán liên kết trong mạng hỗn tạp và ứng dụng dự đoán mối quan hệ giữa RNA không mã hóa và bệnh.Dự đoán liên kết trong mạng hỗn tạp và ứng dụng dự đoán mối quan hệ giữa RNA không mã hóa và bệnh.Dự đoán liên kết trong mạng hỗn tạp và ứng dụng dự đoán mối quan hệ giữa RNA không mã hóa và bệnh.Dự đoán liên kết trong mạng hỗn tạp và ứng dụng dự đoán mối quan hệ giữa RNA không mã hóa và bệnh.Dự đoán liên kết trong mạng hỗn tạp và ứng dụng dự đoán mối quan hệ giữa RNA không mã hóa và bệnh.Dự đoán liên kết trong mạng hỗn tạp và ứng dụng dự đoán mối quan hệ giữa RNA không mã hóa và bệnh.Dự đoán liên kết trong mạng hỗn tạp và ứng dụng dự đoán mối quan hệ giữa RNA không mã hóa và bệnh.Dự đoán liên kết trong mạng hỗn tạp và ứng dụng dự đoán mối quan hệ giữa RNA không mã hóa và bệnh.MINISTRY OF EDUCATION AND TRAINING HANOI NATIONAL UNIVERSITY OF EDUCATION NGUYEN VAN TINH LINK PREDICTION IN HETEROGENEOUS INFORMATION NETWORKS AND ITS APPLICATIONS IN PREDICTING ASSOCIATIONS BETWEEN.
MINISTRY OF EDUCATION AND TRAINING HANOI NATIONAL UNIVERSITY OF EDUCATION NGUYEN VAN TINH LINK PREDICTION IN HETEROGENEOUS INFORMATION NETWORKS AND ITS APPLICATIONS IN PREDICTING ASSOCIATIONS BETWEEN NON-CODING RNAS AND DISEASES Major: Computer Science Major code: 9480101 SUMMARY OF DISSERTATION IN COMPUTER SCIENCE Hanoi, 2023 The dissertation was completed at the Faculty of Information Technology, Hanoi National University of Education Supervisors: Assoc Proc Dr Tran Dang Hung Dr Le Thi Tu Kien Reviewer 1: Assoc.Prof.Dr Nguyen Long Giang, Institute of Information Technology- Vietnam Academy of Science and Technology Reviewer 2: University Assoc.Prof.Dr Le Duc Hau, Thuyloi Reviewer 3: Assoc.Prof.Dr Nguyen Ngoc Hoa, University of Engineering and Technology-Vietnam National University, Hanoi (VNU) The dissertation will be defensed at The university committee: Hanoi National University of Education At: ……………… The dissertation can be read at: National Library of Vietnam, Hanoi, Vietnam or Library of Hanoi National University of education, Hanoi, Vietnam PUBLICATION [VTN1] Van Tinh Nguyen, Thi Tu Kien Le and Dang Hung Tran, "A new method on lncRNA-disease-miRNA tripartite graph to predict lncRNA-disease associations", 2020 12th International Conference on Knowledge and Systems Engineering (KSE), 2020, pp 287-293, doi: 10.1109/KSE50997.2020.9287563 (Scopus indexed) [VTN2] Van Tinh Nguyen, Thi Tu Kien Le, Tran Quoc Vinh Nguyen and Dang Hung Tran, “Inferring miRNA-disease associations using collaborative filtering and resource allocation on a tripartite graph”, BMC Med Genomics 14, 225 (2021) https://doi.org/10.1186/s12920-021-01078-8 (ISI Q2 journal) [VTN3] Van Tinh Nguyen and Dang Hung Tran, "An improved computational method for prediction of lncRNA-disease associations based on collaborative filtering and resource allocation", 2021 13th International Conference on Knowledge and Systems Engineering (KSE), 2021, pp 1-6, doi: 10.1109/KSE53942.2021.9648632 (Scopus indexed) [VTN4] Van Tinh Nguyen, Thi Tu Kien Le, Khoat Than and Dang Hung Tran, “Predicting miRNA–disease associations using improved random walk with restart and integrating multiple similarities”, Sci Rep 11, 21071 (2021) https://doi.org/10.1038/s41598-021-00677-w (ISI Q1 journal) INTRODUCTION Nowadays, we are in a connected world where data or objects’ information, actors or agents, object groups or component groups are interacted with each other to compose large networks These networks are complex They contain multiple types of nodes and multiple types of interactions These networks are called heterogeneous information networks (HINs) They are rich in semantic information and can be constructed from multiple data sources Analyzing of heterogeneous information network (HIN) generates a trendy research of mining of data, retrieving of information, link prediction, mining of graph, network science, and so forth… Link prediction is a crucial and active task in HIN analysis It benefits many researchers and organizations in a variety of fields The link prediction’s main objective is to discover absent links in a network or to forecast links which may soonly occur in a network Link prediction has been broadly applied in various domains from social networks to biological systems For biological systems, link prediction has been used to discover the relationships or associations among biological objects such as disease-phenotype/gene associations, drug-protein interactions, drug-miRNA associations, disease-drug associations, non-coding RNA-disease associations, and so forth Especially, for a long time, identifying non-coding RNAs (ncRNAs) in the human genome is difficult They were treated as noise However, ncRNAs play vital roles in life activities Identifying relationships between ncRNAs and diseases has exposed opportunities for therapeutic and diagnostic of human diseases Therefore, the studies of ncRNA-disease relationships have extensively been executed in recent years Conventional biological experiments make it costly, time- consuming, and laborious to discover potential ncRNA-disease relationships Therefore, it requires to have computational methods for identifying ncRNA-disease associations, especially associations between non-coding RNAs (ncRNAs) and diseases In the past few years, various computational methods for predicting ncRNA-disease associations have been developed Although actual computational methods have made massive benefits in revealing disease‐associated ncRNAs in each category and typically decrease the cost as well as time of biological experiments Howerver, there are still some limitations which are needed to be solved as follows Firstly, the computational approaches for predicting ncRNAdisease associations ought to deal with sparse data problem It bases on the reality that the known ncRNA-disease associations' number is quite smaller compared to the unknown associations Hence, it is difficult to obtain a reliable network to represent a reasonable biological network Therefore, it limits prediction accuracy Secondly, due to the sparsity data problem, it causes another issue that the is unbalancing of positive and negative samples in performing computational methods for predicting ncRNA-disease associations It is the reason that the prediction performance of computational methods is not very reliable Thirdly, the similarity calculation in existing computational methods depends excessively on known associations between ncRNAs and diseases It could generate the noticeable bias to construct computational models for predicting ncRNA-disease associations Therefore, it requires to reasonably fuse different similarity scores from different souces of biological information to enhance ncRNA-disease association prediction performance Fourthly, most of existing computational methods are not applicable to predict associations for isolated diseases or ncRNAs (miRNAs or lncRNAs) which have not any known association with other ncRNAs or other diseases in the examined data sets So, it is nessessary to combine different biological information to improve the capability of prediction ncRNA-disease association for isolated cases Fifthly, there are too many parameters that need to be adjusted in many computational methods leading to the difficulty in performing ncRNA-disease association prediction It means that the researchers need to develop more computational methods which will be easier to employ in ncRNA-disease association prediction And finally, since more and more biological databases become available so it requiresto effectively fuse data from multiple data sources to enhance the reliability and performance of prediction Up to date, a numerous number of research are weekly published in scientific journals or conferences to show new results of research on developing computational methods for ncRNA-disease association prediction Many of them concentrates on solving the above mentioned limitations Additionally, based on the fact that selecting useful data from heterogeneous information to build up a reliable HIN is still a challenge, it remains room for scientists and researchers to research for constructing a reliable HIN and training an useful computational method to achieve more decisive performance of ncRNA-disease associations prediction It is the reason that the Ph.D student selects the topic “Link prediction in heterogeneous information networks and its applications in predicting associations between non-coding RNAs and diseases” for this dissertation • Dissertation objective and research problem Through this dissertation, the research will focus on: proposing computational methods or models to improve prediction performance for predicting human non-coding RNA-disease associations on heterogeneous information networks by solving the following problems: - Solving the sparse data problem to improve the accuracy of human ncRNA-disease associations prediction performance - Fusing multi-types of information from different biological datasets to have more realistic similarities and to decrease the impact of depending on known human ncRNA-disease associations excessively - Inheriting the computational methods from other domains such as predicting of microbe-disease associations, drug-disease relationships, drug-target interactions and so forth, and improving them to achieve better performance in predicting human non-coding RNA-disease associations • Thesis’s scientific contributions: The thesis has the following scientific contributions: Contribution 1: Proposed an improved computational model by combining collaborative filtering and a resource allocation process on a tripartite graph using multiple known associations' types to forecast ncRNA-disease associations Contribution 2: Proposed a new miRNA-disease associations prediction method which used a WKNKN algorithm to preprocess data to decrease unverified associations in miRNAdisease association dataset and uncover latent miRNA-disease associations using improved random walk with restart (RWR) algorithm and fusing multiple similarities from HINs The contribution is presented in Chapter of the dissertation, related contents of this contribution were published the Proceeding of the KSE2020 ([VTN1], Scopus Indexed), BMC Medical Genomics journal in 2021 ([VTN2], ISI Q2 journal), and Proceeding of the KSE2021 ([VTN3], Scopus Indexed) Thesis’s structure: Thesis structure is shown in the below figure CHAPTER BACKGROUND 1.1 Basic concepts 1.1.1 Heterogeneous information networks • Information network Definition 1.1 Information network An information network is represented by a graph 𝐺 = (𝑉, 𝐸) where 𝑉 represents set of nodes and 𝐸 represents set of edges of the graph The graph G contains an mapping function of object type, ϕ: V → A It also contains a mapping function of link type, ψ: E → R Each node v ϵ V has only one particular object type, ϕ(v) ϵ A Each link e ϵ E has only one particular link type, ψ(e) ϵ R If two links share the same starting and the same ending object type, they are same link type • Heterogeneous/Homogeneous information network Definition 1.2 Heterogeneous/Homogeneous information network If the information network contains moreover one object type or one link type, it is known as a HIN, typically |A|>1 or |R|>1; otherwise it is called a homogeneous information network, typically, |A|=1 and |R|=1 1.1.2 Biological systems Biological systems are a special class of heterogeneous information networks which consists of a large number of biological entities such as genes, miRNAs, lncRNAs, gene expressions, phenotypes, and so forth 1.1.3 Non-coding RNAs (ncRNAs) The RNAs which can not be transferred into proteins are referred to as non-coding RNAs Micro RNAs (miRNAs) MiRNAs are a class of single-stranded, endogenous, small, evolutionarily conserved ncRNAs with length of 22-26 nucleotides Long non-coding RNAs (lncRNAs) LncRNAs belong to a subclass of ncRNAs They lengthen more than 200 nucleotides 1.2 Link prediction in heterogeneous information networks 1.2.1 Link prediction problem Definition 1.5 Link prediction in heterogeneous information networks Giving a network which is represented by a graph 𝐺 = (𝑉1 ∪ 𝑉2 ∪ … ∪ 𝑉𝑀 , 𝐸1 ∪ 𝐸2 ∪ … ∪ 𝐸𝑁 ), where 𝑉𝑖 (𝑖 = 1,2, … , 𝑀) indicates the node set with type i, 𝐸𝑗 (𝑗 = 1,2, … , 𝑁) relects the link set with type j The link prediction task is to find if there will exist a link 𝑒𝑘 between 𝑣𝑖 (𝑣𝑖 ∈ 𝑉𝑖 ) and 𝑣𝑗 (𝑣𝑗 ∈ 𝑉𝑗 ) In this sense, 𝑒𝑘 reflects the prediction task's target link Input: The graph 𝐺 = (𝑉1 ∪ 𝑉2 ∪ … ∪ 𝑉𝑀 , 𝐸1 ∪ 𝐸2 ∪ … ∪ 𝐸𝑁 ): 𝑉𝑖 (𝑖 = 1,2, … , 𝑀) is the node set with type i and 𝐸𝑗 (𝑗 = 1,2, … , 𝑁) reflects the link set with type j Output: For any two potentially connected objects 𝑣𝑖 (𝑣𝑖 ∈ 𝑉𝑖 ) and 𝑣𝑗 (𝑣𝑗 ∈ 𝑉𝑗 ), whether link 𝑒𝑘 exists (1) or not exist (0)? 1.2.2 Link prediction methods In general, the link prediction methods can roughly be categorized into the following types: Network similarity-based methods, probabilistic and maximum likelihood-based methods, machine learning based methods and deep learning-based methods 1.2.3 Link prediction applications in biological systems In biological networks, link prediction is normally used in predicting associations between diseases and genes, prioritizing of disease candidate metabolites, drug development, predicting drugprotein interactions, identifying drug-miRNA associations, discovering disease-drug associations, predicting associations between non-coding RNAs and diseases, and so forth 1.3 Computational methods for predicting associations between non-coding RNAs and diseases 1.3.1 Predicting non coding RNA-disease association prediction as a link prediction problem Predicting ncRNA-disease associations problem is considered as a HIN-based link prediction problem It typically uses a heterogeneous network with multiple biological objects and relationships among 12 miRNA-disease associations, known lncRNA-disease associations, and virified miRNA-lncRNA interactions, a tripartite graph G is constructed During the second stage, an item-based CF algorithm is Figure 2.1 The proposed model's flowchart 13 applied on graph G0 to solve the problem of sparsity data and to produce a tripartite graph Gu At the third stage, a resource allocation algorithm uses the graph Gu to determine the final resource score of each disease's ncRNA candidates And finally, all ncRNA candidates’ resource score for each disease are ranked in descending order at the final stage, so that the candidate with higher resource score will have greater possibility of being verified in the future 2.4 Employing the proposed model to infer miRNA-disease associations based on collaborative filtering and resource allocation 2.4.1 Proposed model's stages in inferring miRNA-disease associations • Stage 1: Construction of a tripartite graph G0 • Stage 2: Construction of a tripartite graph Gu • Stage 3: Employing resource allocation process on the tripartite graph Gu to infer miRNA-disease associations • Stage 4: Ranking all candidate miRNAs’ Rscores for each disease in descending order 2.4.2 Proposed method’s experiments and results • Step Preparing datasets The proposed method used datasets which came from the study of Zhao et al There are 190 diseases, 111 lncRNAs, 264 miRNAs, 1880 known lncRNA-miRNA associations, 936 known associations between diseases and lncRNAs, and 3552 verified miRNA-disease associations in these datasets Figure 2.2 shows the relationships among different datasets and the number of data nodes used in the proposed method • Step 2: Implementing the proposed method and estimating its computational time complexity The proposed method was implemented by using the Python 14 Figure 2.2 The datasets and the numbers of data nodes in the proposed method programming language Its time complexity is equivalent to O(n3) It is polynomial time complexity • Step Evaluating prediction performance The 5-fold cross-validation experiments, and the Area under roc curve (AUC) as well as Area under precision-recall curve (AUPR) are used to evaluate the proposed method's performance in identifying miRNAdisease associations Evaluating the AUC under 5-fold-cross validation Figure 2.3 ROC curve and AUC value of the proposed method with γ = 0.9 in one experimental running time The proposed method achieves the best averaged value of 0.9788 15 with γ = 0.9 after performing the experiments for 10 times under 5fold-cross-validation Figure 2.3 illustrates ROC curve and AUC value of proposed method with γ = 0.9 in one experimental running time Evaluating the AUPR under 5-fold-cross validation The proposed method achieves the best averaged AUPR value 0.9373 with γ = 0.9 under 5-fold cross experiments Figure 2.4 shows Precision-Recall curve and AUPR value of proposed method with γ = 0.9 in one experimental running time Figure 2.4 Precision-Recall curve and AUPR value of the proposed method with γ = 0.9 in one experimental running time Comparing prediction performance with other related models The performance of proposed method is compared with the performance of DCSMDA and TPGLDA methods The performances of these methods are shown in the Table 2.1 Table 2.1 Performance comparison with other related methods Method AUC value AUPR value TPGLDA 0.9703 0.7421 DCSMDA 0.8155 The proposed method 0.9788 • Step Checking case studies 0.9373 16 Case studies on Prostatic Neoplasms, Heart Failure, Glioma and Open-angle Glaucoma are constructed to show the ability of our model in order to identify new disease-associated miRNAs 2.5 Employing the proposed model to predict lncRNA-disease associations based on collaborative filtering and resource allocation 2.5.1 Proposed model's stages in predicting lncRNA-disease associations • Stage 1: Construction of a tripartite graph G0 • Stage 2: Applying collaborative filtering algorithm on known lncRNA-disease associations and verified lncRNA-miRNA interactions to obtain a new tripartite graph Gu • Stage 3: Using improved resource allocation process to obtain predicted lncRNA-disease associations • Stage 4: Ranking all predicted lncRNAs for each disease to have final results 2.5.2 Proposed method’s experiments and results • Step Preparing datasets Figure 2.5 The relationships between the different data sources and the numbers of data nodes used in the proposed method The proposed method used the datasets came from studies of Fu et al and Yao et al These datasets contained 2697 tentatively known lncRNA-disease associations (LDAs), 13562 known miRNA-disease associations (MDAs) and 1002 verified lncRNA-miRNA interactions 17 (LMIs) among 240 lncRNAs, 420 diseases and 495 miRNAs Figure 2.5 illustrates the data nodes’ numbers and the relationships of the different data sources used in the proposed method • Step 2: Implementing the proposed method and estimating time complexity of proposed method This proposed method was also implemented by using the Python programming language The time complexity of the proposed method is O(nl*nd*nm) ≈ O(n3) It means its time complexity is polynomial • Step 3: Evaluating prediction performance The 5-fold cross-validation experiments were employed and AUC and AUPR were used to estimate the proposed method's performance Evaluating AUC under 5-fold cross-validation The new proposed method achieves the best AUC values when 𝛾 = 0.8 under 5-fold cross validation experiments The Figure 2.6 demonstrates the experimental running times' ROC curves as well as AUC values with 𝛾 = 0.8 Figure 2.6 The proposed method's ROC curves and AUC values in running times of experiments with 𝛾 = 0.8 • Evaluating AUPR under 5-fold cross-validation The new proposed method accomplishes best AUPR values when 𝛾 = 0.8 under 5-fold cross validation experiments Figure 2.7 shows