Link prediction in heterogeneous information networks and its applications in predicting associations between noncoding RNAs and diseases.

Link prediction in heterogeneous information networks and its applications in predicting associations between noncoding RNAs and diseases.Link prediction in heterogeneous information networks and its applications in predicting associations between noncoding RNAs and diseases.Link prediction in heterogeneous information networks and its applications in predicting associations between noncoding RNAs and diseases.Link prediction in heterogeneous information networks and its applications in predicting associations between noncoding RNAs and diseases.Link prediction in heterogeneous information networks and its applications in predicting associations between noncoding RNAs and diseases.Link prediction in heterogeneous information networks and its applications in predicting associations between noncoding RNAs and diseases.Link prediction in heterogeneous information networks and its applications in predicting associations between noncoding RNAs and diseases.Link prediction in heterogeneous information networks and its applications in predicting associations between noncoding RNAs and diseases.Link prediction in heterogeneous information networks and its applications in predicting associations between noncoding RNAs and diseases.Link prediction in heterogeneous information networks and its applications in predicting associations between noncoding RNAs and diseases.Link prediction in heterogeneous information networks and its applications in predicting associations between noncoding RNAs and diseases.Link prediction in heterogeneous information networks and its applications in predicting associations between noncoding RNAs and diseases.Link prediction in heterogeneous information networks and its applications in predicting associations between noncoding RNAs and diseases.Link prediction in heterogeneous information networks and its applications in predicting associations between noncoding RNAs and diseases.Link prediction in heterogeneous information networks and its applications in predicting associations between noncoding RNAs and diseases.Link prediction in heterogeneous information networks and its applications in predicting associations between noncoding RNAs and diseases.Link prediction in heterogeneous information networks and its applications in predicting associations between noncoding RNAs and diseases.Link prediction in heterogeneous information networks and its applications in predicting associations between noncoding RNAs and diseases.Link prediction in heterogeneous information networks and its applications in predicting associations between noncoding RNAs and diseases.

BACKGROUND

Basic concepts

As mentioned before, most the real-world systems may be treated as heterogeneous information networks (HINs) Biological systems are a special class of HINs Therefore, in this section, some basic concepts which relate to HINs and biological systems are introduced.

As can be known, most real-world systems are composed of multi-typed components and a large number of interactions or associations among them, for examples, human social activity systems, communication systems, computer systems, biological systems, and so forth Without loss of generality, such systems, can be known as information networks Formally, an information network is defined as following:

Definition 1.1 Information network [2] An information network is represented by a graph 𝐺 = (𝑉, 𝐸) where 𝑉 represents set of nodes and 𝐸 represents set of edges of the graph The graph G contains an mapping function of object type, ϕ: V → A It also contains a mapping function of link type, ψ: E → R. Each node v ϵ V has only one distinct object type, ϕ(v) ϵ A Each link e ϵ E has only one particular link type, ψ(e) ϵ R If two links have the same type of starting and ending object, they are same link type.

Definition 1.2 Heterogeneous/Homogeneous information network [2] If the information network contains moreover one object type or one link type, it is known as a HIN, typically |A|>1 or |R|>1; in another way, it is called a homogeneous information network, typically, |A|=1 and |R|=1 Figure 1.1 is an illustration of a heterogeneous information network with multiple node types and multiple link types.

Figure 1.1 An illustration of HIN with multiple node types and multiple link types.

Definition 1.3 Network schema [2] A network schema is signified by 𝑇 𝐺 (𝐴, 𝑅) It is a meta template for a heterogeneous network 𝐺 = (𝑉, 𝐸), which is defined over object types from A as well as link types from R The graph G contains the mapping function of object type ϕ:V → A, and also the link type's mapping function ψ: E → R.

A HIN’s network schema points out the restrictions on the object set and set of objects’ relationships These restrictions could be used to guide semantic the networks' investigations A heterogeneous network that follows a network schema is known as a network instance With a link type R that connects object types X and Y, namely 𝑋 𝑅 𝑌, X and Y are called the link type R's source and target object types, respectively They are signified by R.X and R.Y Figure 1.2 illustrates a HIN’s network schema.

Figure 1.2 An illustration of HIN’s network schema.

In a HIN, two objects could be linked through various paths and each path has paricular meanings These paths are known as HIN meta paths.

Definition 1.4 Meta path [2] A HIN meta path 𝑃 on schema 𝑇 𝐺 = (𝐴, 𝑅) is signified by

𝑅 = 𝑅 1 °𝑅 2 ° … °𝑅 𝑙 among nodes with ° is the relations' composition operator.

 Reasons for heterogeneous information networks

In numerous studies, information networks are frequently considered as homogeneous networks where nodes are in the same object type, and links are in the same relation type Nevertheless, in fact, most networks are heterogeneous where nodes and also links belong to different types Generally, HINs can be established in any areas For example, in biological networks, nodes can be genes, proteins, microbes, diseases, miRNAs, lncRNAs, gene expressions, phenotypes, [24]. Therefore, HINs are forceful and thoughtful representations of natural interactions in diverse domains among different object types [25].

An information network can be analyzed in a variety of ways In homogeneous information networks, various data mining tasks including ranking, clustering, link prediction, influence analysis, and particularly social networks have been investigated However, most homogeneous information networks' methods can not straightforwardly be utilized in HINs to mine heterogeneous data The reason is that heterogeneous links across objects of divergent types and a HIN generally hold plentiful information than in homogeneous networks [25] In recent years, various object types are inter-connected They are difficult to model by homogeneous networks Therefore, HIN is naturally taken into account to represent divergent object types and relationsships.

Biological systems are a special class of heterogeneous information networks which consists of a large number of biological entities such as genes, miRNAs, lncRNAs, gene expressions, phenotypes, and so forth [7], [22], [26], [27] Normally, all biological processes are regulated by molecular entities and their interactions or associations Understanding of biological processes requires not only knowledge about biological entities themselves but also knowledge about relationships among them Naturally, these biological processes are represented by a graph, also called a network, in which nodes represent biomolecules and links represent interactions or associations among molecular entities [22], [27] In other words, networks are representations of heterogeneous and complex biological systems The analysis of biomolecules’ interactions or associations plays a crucial role in understanding the physiology and pathology of various forms of life, including the new drugs' development and disease mechanisms' discovery Therefore, biological interactions' studying comes a dominating topic in biological networks [26] In recent years, the diversity of biological objects in addition to the amount of biological data's rapid growth makes biological networks have to handle the challenges of the problem of big data Therefore, HINs are considered to be powerful tools to deal with the heterogeneous and complex problem of biological networks [3].

Normally, biological networks could be classified into the common types as Protein-Protein interactions (PPI), Gene regulation networks (GRNs), Metabolic networks, Disease networks and so forth Based on the research purposes, biological networks could be defined as a gene-disease, drug-disease, drug-disease-gene, lncRNA-disease, miRNA-disease networks, and so forth [7].

As known before, most of the human genome is transcribed into RNAs RNAs are divided into two forms The first form of RNA can encode proteins and they account for only approximately 2% of the human genome The second form of RNA accounts for nearly 98% of the human genome that can not be transferred into proteins The RNAs which can not be transferred into proteins are referred to as noncoding RNAs (ncRNAs) [11], [23].

Non-coding RNAs can be divided into different types such as tRNAs (transfer RNAs), rRNAs (ribosomal RNAs), snRNAs (small nuclear RNAs), smcRNAs (small non-coding RNAs), lncRNAs, miRNAs as well as circRNAs Especially, miRNAs and lncRNAs are two thoroughly studied types of ncRNAs because miRNAs are the regulators of most protein-coding genes whereas lncRNAs are the most ubiquitously found in mammalian [11].

MiRNAs are a single-stranded, endogenous, small, evolutionarily conserved class of ncRNAs with length of 22-26 nucleotides [28], [29].

Long non-coding RNAs (lncRNAs)

LncRNAs belong to a ncRNAs's subclass They lengthen moreover than 200 nucleotides [11].

Non-coding RNAs and complex human diseases

For a long time, it is hard to recognize ncRNAs in the human genome They were treated as noise and considered having no biological function Nevertheless,ncRNAs play vital roles in life activities Additionally, it has been demonstrated that ncRNAs have a significant consequence on the human diseases' occurrence,progression as well as development For example, miRNAs have been proven to play crucial aspects of different primary cellular functions such as developing,differentiating, growing, and so forth [29] Various works have indicated that human diseases' development and progression are related with miRNAs' irregular expressions and dysregulations [9], [29] Discovering miRNA-disease relationships can help us understanding disease mechanisms at molecular level as well as detecting disease biomarkers for diagnosing, treating, prognosticating and preventing [9] Besides that, lncRNAs have been proven to play essential roles in multiple biological processes such as the transcripting, translating, splicing,differentiating, and so on, in the body of human [10], [30] It has been confirmed that lncRNAs' dysregulations and mutations are rigorously concerned in numerous complex human diseases' development and progression Understanding lncRNA- disease relationships could not only help us explaining disease mechanisms but also facilitating the complex diseases' diagnosing, treating, prognosticating and preventing [10], [11], [30].

Link prediction in heterogeneous information networks

Link prediction is a underlying task of HINs It tries to measure the link's probability of existence among two nodes relied on known links and nodes' attributes in a network [2] Formally, a link prediction problem can be stated as follows.

Definition 1.5 Link prediction in heterogeneous information networks

Giving a network which is represented by a graph 𝐺 = (𝑉 1 ∪ 𝑉 2 ∪ … ∪

𝐸 1 ∪ 𝐸 2 ∪ … ∪ 𝐸 𝑁 ), where 𝑉 𝑖 (𝑖 = 1,2, … , 𝑀) indicates the node set with type i,

𝐸 𝑗 (𝑗 = 1,2, … , 𝑁) relects the link set with type j The link prediction task is to find if there will exist a link 𝑒 𝑘 between 𝑣 𝑖 (𝑣 𝑖 ∈ 𝑉 𝑖 ) and 𝑣 𝑗 (𝑣 𝑗 ∈ 𝑉 𝑗 ).

𝑒 𝑘 reflects the prediction task's target link [31].

∪ 𝐸 𝑁 ): 𝑉 𝑖 (𝑖 = 1,2, … , 𝑀) is the node set with type i and 𝐸 𝑗 (𝑗 = 1,2,

… , 𝑁) reflects the link set with type j.

Output: For any two potentially connected objects 𝑣 𝑖 (𝑣 𝑖 ∈ 𝑉 𝑖 ) and 𝑣 𝑗 (𝑣 𝑗 ∈

𝑉 𝑗 ), whether link 𝑒 𝑘 exists (1) or not exist (0)?

The objective of link prediction is to reveal the absent links of the network or to predict the emergence of potential links that will be determined in the current network [7] Normally, link prediction may be treated as a problem of binary classification Concretely, for any two noded that could be connected, we have to forecast either a link exists (1) or not exist (0) An illustration of a link prediction problem is shown in Figure 1.3.

Figure 1.3 An illustration of a link prediction problem

Link prediction in HIN(s) contains the following characteristics Firstly, there are divergent types of links in a HIN that lead to the predicted links may be different types Secondly, the dependencies in multiple link types may exist, therefore, a HIN's link prediction may need to forecast multiple link types generally by gathering the diversified and complicated relationships among different link types and taking into account the interdependant information for prediction [2].

In general, the link prediction methods can roughly be categorized into the following types: Network similarity-based methods, probabilistic and maximum likelihood-based methods, machine learning based methods and deep learning- based methods [32]–[35].

 Network similarity-based link prediction methods

Network similarity-based link prediction methods are the most uncomplicated ones in link prediction Normally, in these methods, for each particular two nodes

𝑣 𝑖 (𝑣 𝑖 ∈ 𝑉 𝑖 ) and 𝑣 𝑗 (𝑣 𝑗 ∈ 𝑉 𝑗 ), a similarity score 𝑆(𝑣 𝑖 , 𝑣 𝑗 ) is computed The computation of 𝑆(𝑣 𝑖 , 𝑣 𝑗 ) score relied on the structural information of the couple of nodes in considering The unknown link 𝑒 𝑘 is assigned score in accordance with the nodes' similarity The node couple having a greater similarity score reflects the forecasted link will have more chance to be true [32],

[35] In more detail, similarity- based methods may be divived into local similarity- based, global similarity-based, and quasi-local similarity-based methods The local similarity-based methods' main idea is to obtain similarity scores relying on common neighbors' number The global similarity-based methods take into account information of entire network structure to cumpute similarity score matrix The quasi-local similarity-based methods examine the neighbor nodes' information to calculate similarity scores [32], [35] These methods have advantages of high accuracy but they have disadvantages of sensitive sparse networks [35] Some recent similarity-based link prediction methods are Clustering Coefficient for Link Prediction (CCLP ) method for local similarity-based link prediction, Linear Optimization for global similarity-based link prediction, and Common Neighbors Degree Penalization (CNDP) method for quasi-local similarity- based link prediction [36]–[38].

 Probabilistic and maximum likelihood-based link prediction methods

Recently, some probabilistic and likelihood-based link prediction methods,which set up a model with multiple parameters by optimizing an objective function,have been proposed [39]–[41] Normally, the probabilistic methods need extra information, for example knowledge about nodes or edge properties besides fundamental information It is not easy to obtain these extra information.Furthermore, it is hard to tune the parameters which limit their applicability in such models Maximum likelihood methods are complicated and laborious Therefore, they are not reasonable for real massive networks [35].

 Machine learning-based link prediction methods

One of the well-known problems for link prediction using machine learning is the curse of dimensionality Some dimension reduction techniques have been used to deal with the above problem in the link prediction scenario, for examples, network embedding and matrix decomposition techniques.

The goal of network embedding-based link prediction methods is to effectively reserve the structure of the network, node properties, and other side information while also assigning nodes in a network to low-dimensional representations [35] While performing the network embedding, by keeping the node neighborhood structures, higher D dimensional nodes in networks are mapped to a representation space with a lower d (d ≪ D) dimension In other words, the initial network's nodes have identical embedding in lower d-dimensions representation space In recent years, there have been numerous proposals for network embedding-based link prediction methods such as Wang et al.’s and Zhang et al.’s methods [42], [43].

The matrix decomposition-based link prediction methods aim to break down the adjacency matrix to two matrices with low-rank factors, then rebuild up the network with the least amount of error These methods can use additional information of node attributes to enhance the prediction performance They can also analyze the information of network structure, drop irregular noise and diagnose multiple link types to enhance prediction performance Matrix factorization-based link prediction methods are applied in various studies and systems They have achieved high performance in link prediction on some specific real-world networks. However, a typical drawback of these methods is that there is only one layer that maps the initial network's adjacency matrix to the low-dimensional space [35]. Some recent NMF- based link prediction methods are the FSSDNMF method, Ma et al.’s method, GRNMF method, and DANMF method [35], [44]–[46].

Besides the dimension reduction methods, some other machine learning-based methods, for instances, support vector machine (SVM), Nạve Bayes, Logistic

Regression, decision tree and so on, have also been used to link prediction as a binary classification task The main challenge of these methods is to select the appropriate feature set [32] Some studies extract feature sets from the topology of networks [47] These features are typically genaral and cross-domain, and suitable to any network.

 Deep learning based link prediction methods

Deep learning methods are a fundamental component of machine learning. They are relied on data representation learning and have been applied to a variety of tasks including image processing, pattern recognition, natural language processing, computer vision, and so forth [32] Normally, these methods depict a network as multi-layered structures in order to more fully explore their latent and internal hierarchical structures They are able to look at latent community structure data, obtain the multi-view data's hierarchical semantics, and gather intrinsic geometric structures from each view of data By converting the network structures into low dimensional representations, these methods impress the extensions of the techniques on graph structures to carry out link prediction tasks [48] Recently, some deep learning-based link prediction methods have been produced such as Gu et al.’s method, Dadu et al.’s method, Wang et al.’s method, and Wang et al.’s method

1.2.3 Link prediction applications in biological systems

HINs can efficently integrate complex and inter-connected data in order to address many biological problems [2], [3], [24] In a biological HIN, nodes and links can respectively represent biological objects and their interactions or associations HIN-based methods, particularly link prediction-based methods, are promising approaches for handling thorough analysis in systems biology [3] In this section, we introduce several applications of HIN link prediction in biological systems.

Diagnosis, treatment and prevention of many human complex diseases including cancers and diabetes, could be enhanced by identifying disease-associated genes Recently, based on multi-omics data, numerous HIN-based approaches have been established to mine disease-causing genes Additionally, HIN-based methods can determine disease candidate genes by utilizing the known disease-gene relationships Moreover, a particular phenotype set is generated by a disease which has an effect on one or more systems in the body In recent years, various methods for gene-disease associations prediction have been proposed For example, a bi- layer HIN disease/phenotype-gene network, which is composed of a gene network and a disease/phenotype network, is one of the most well-liked network models to predict gene-disease associations A multi-layered HIN, which integrates divergent genomic and phenotype data into a network, is also used to predict disease/phenotype-gene associations As well as, some complex HINs, which contain multi-omics data, are used to predict disease/phenotype-gene associations [3].

Metabolomics is utilized to point out cells' epigenetic metabolites, and bodily fluid to eliminate the pathogenesis of molecular diseases and contribute to identifying metabolic biomarkers for disease diagnosis [3] Identifying disease- related metabolites plays essential roles in medical and biological field Link prediction in HIN is applied to reveal metabolites related with diseases in various studies [51], [52].

Computational methods for predicting associations between non-coding RNAs

1.3.1 Predicting non coding RNA-disease association prediction as a link prediction problem

In recent years, there has been a lot of research into relationships between ncRNAs and diseases A numeral of experimental methods and technologies have been developed to generate biological data for supporting the research on ncRNA- disease associations However, because experimental methods are laborious, time- consuming and expensive, thereby computational methods have received a lot of attention from many computer scientists and have been frequently utilized to solve these issues According to the computational point of view, this section fundamentally presents two typical types of ncRNAs, i.e miRNAs and lncRNAs, and the related computational approaches for predicting the associations between the above typical types of ncRNAs and diseases.

Formally, predicting ncRNA-disease associations problem is considered as a HIN-based link prediction problem It typically uses a heterogeneous network with multiple biological objects and relationships among them These biological objects and links among them were collected from different sources including nodes belonging to ncRNA (miRNA, lncRNA) type and disease type Then, it predicts latent associations between ncRNAs and diseases The predicted associations may be new associations or missing ones.

1.3.2 Materials used for ncRNA-disease association prediction

 Materials used for miRNA-disease association prediction

Information about miRNAs and miRNA-target associations could be gathered from different data sources such as miRBase, miReg, miRTarBase, miRecords, and so on The verified miRNA-disease associations could be collected from multiple public databases such as MiRCancer, MiR2Disease, HMDD, MiREC, DbDEMC,and so forth Some details of the materials used for miRNA-disease associations prediction are shown in Table 1.1.

Table 1.1 Databases containing miRNA-related information and miRNA- disease associations

 Materials used for predicting lncRNA-disease associations

LncRNAs’ information could be obtained from different data sources such as

LNCipedia, NONCODE database, LncRBase, and so on LncRNA-related interactions’ information could be gathered from different databases such as DIANA- LncBase, lncRNA2Target, and so on LncRNA-disease associations’ information could be collected from divergent databases such as LncRNADisease, Lnc2Cancer, MNDR, and so on [10] Additionally, different biological objects’ information and their associations could be collected from multiple data sources. Table 1.2 presents databases of lncRNAs’ information, lncRNA-related interactions as well as lncRNA- disease associations.

Table 1.2 Databases containing lncRNA-related information

As be mentioned, the information of ncRNAs as well as ncRNA-disease associations are divergent They can be collected from different data sources and can contain multiple attributes or properties Collecting and integrating information from these databases is time consuming and costly Whereas, in this dissertation, the associations between ncRNAs and diseases in our studies only reflect that there are any known associations or interactions among them If the association is already known or verified by biological experiments, it will be represented by "1", otherwise it will be denoted by "0" in computational approaches for ncRNA-disease associations prediction Additionally, to be used in link prediction problems, the data about ncRNA-disease associations need to be pre-processed and represented in network representation or assocition matrices There are different studies in this field already pre-processed these different types of data and provide them as benchmark datasets Therefore, the details of datasets used in each contribution in this dissertation along with their characteristics will be described in each contribution in following chapters, respectively.

1.3.3 Similarity calculation and network construction

Diseases' semantic descriptions and the external biomolecules' effects can be used to identify similarities In addition, there are numerous interactions among ncRNAs as well as interactions between ncRNAs and other biomolecules In this section, several methods to calculate similarities for diseases and ncRNAs as well as network construction are briefly introduced.

Typically, the disease similarity's calculation is required in order to forecast ncRNA-disease associations The disease' semantics and its relationships with other biomolecules are the primary foundations of a typical method This type of methods ordinarily gauges disease similarity by computing the the disease's ancestral nodes' contribution in a tree structure like MeSH (https://www. nlm.nih.gov/mesh/meshhome.html) One other type of method used other related biological molecules' information to calculate disease’s similarity, for example, Gaussian interaction profile (GIP) kernel similarity The common influence of diseases on biological molecules is used to measure disease similarity in these methods [11], [12].

 Non-coding RNA similarity calculation

Similar to disease similarity calculating, various calculation methods are used to compute ncRNA similarity The most popular method is to measure ncRNA similarity using the biological information of ncRNAs themselves including sequence information, expression information, regulation information and functional information Meanwhile, we can also measure ncRNA similarity using the effects of ncRNAs on other biological objects, for example, GIP kernel similarity [11], [12].

After obtaining similarities, normally a heterogeneous network is constructed.

It is commonly represented by a graph 𝐺 = (𝑉 1 ∪ 𝑉 2 ∪ … ∪ 𝑉 𝑀 , 𝐸 1 ∪ 𝐸 2 ∪

… ∪ 𝐸 𝑁 ), where 𝑉 𝑖 (𝑖 = 1,2, … , 𝑀) reflects the node set with type i and

𝐸 𝑗 (𝑗 = 1,2, … , 𝑁) represents the edge set with type j.

1.3.4 Literature review of computational methods to predict ncRNA-disease associations

Numerous computational approaches have recently been produced for ncRNA- disease associations prediction They can generally be arranged into the accompanying categories: network-based, recommendation-based, resource allocation-based, machine learning-based, deep learning-based, and multi-biological information and multi-model integration-based methods [9]–[12].

 Network-based methods for ncRNA-disease associations prediction

Most network-based approaches for predicting ncRNA-disease associations used the common assumption that ncRNAs associated with diseases with similar phenotypes share the same functions, and vice versa [9]–[12], [22], [27], [58], [59]. Many network-based methods for predicting ncRNA-disease associations have been developed in recent years For example, Jiang et al [60] through the human peptide- microRNAome forecasted potential miRNA-disease associations by prioritizing miRNAs associated to diseases Gu et al [61] integrated similarity and associated networks in an algorithm of network consistent projection to predict potential miRNA-disease associations Chen et al [62] used the known miRNA- disease associations, integrated miRNA similarity as well as integrated disease similarity to propose a Bipartite Network Projection for miRNA-disease association prediction (BNPMDA) computational model An Adaptive Multi-View Multi-Label model (AMVML) was developed by Liang et al [63] to accquire a new affinity graph for both diseases and miRNAs in order to identify possible miRNA-disease associations With lncRNA-disease association prediction, Guoxian Yu et al [64] devised a bi- random walks method for lncRNA-disease association prediction Gu et al [65] produced a global network RWR-based prediction method, for identifying possible associations between lncRNAs and diseases, called GrwLDA. BPLLDA, a method for lncRNA-disease associations prediction using limited length simple paths in a heterogeneous network, was proposed by Xiao et al [66].

These methods have a main advantage is that they can be used to forecast isolated disease-associated ncRNAs However their performance is not particularly satisfying [63].

Besides that, recently, divergent random walk-based methods for ncRNA- disease associations prediction have been developed The majority of random walk- based methods for ncRNA-disease associations prediction assigned the identically walk probabilities of each disease or ncRNA's linked neighbor node in accordance with its degree, for instances, RWRMDA [67], MIDP&MIDPE [68], NTSMDA[69].

Additionally, in recent years, some extended random walk-based methods have also been devised to solve ncRNA-disease association prediction problem, such as Le et al.’s method [70], RWBRMDA method [71] and NPRWR method [72] The disadvantage of previously mentioned random walk-based methods is that they were unable to efficiently forecast the majority diseases or ncRNAs without of any known associated ncRNAs or diseases.

 Resource allocation-based methods to predict ncRNA-disease associations

The resource allocation-based approaches use the original values of the matrices of multiple data sources as potential values for node-to-node relationships

[12] A resource allocation algorithm is utilized to distribute available resources to each network node In recent years, different resource allocation-based methods to predict ncRNA-disease associations have been implemented to infer potential ncRNA- disease associations For instance, Ding et al [18] proposed a model called TPGLDA that used a process of resource allocation on a lncRNA-disease-gene tripartite graph to forecast lncRNA-disease associations By combining positive pointwise mutual information from a variety of heterogeneous sources and a RWR algorithm, a new approach known as IDHI-MIRW was proposed by Fan et al [73] to identify potential lncRNA-disease associations Wang et al [74] established a model called IIRWR based on an internal inclined random walk with restart algorithm to infer potential lncRNA-disease associations A MAP method was proposed by Marissa Sumathipala and Scott T Weiss [20] This method uses a network diffusion approach to predict miRNA-disease associations by starting with the validated disease-related genes in a HIN made up of protein-protein interactions, miRNA-gene associations, and associations between genes and diseases. Additionally, additional information has been incorporated into the resource allocation process to enhance prediction performance For instance, Liu et al [75] proposed a model named as NBLDA to get four matrices based on resource allocation to discover ncRNA-disease associations.

The aforementioned methods' disadvantage is that they are ineffective at accurately predicting the majority of ncRNAs or diseases without any known associated diseases or ncRNAs.

 Recommendation-based methods to predict ncRNA-disease associations

Predicting ncRNA-disease associations is frequently done using recommendation-based approaches that mean to suggest a node in a network which could be associated to other nodes [11], [12] Content-based recommendation, collaborative filtering, and matrix factorization are typical methods of recommendation system First, the content-based recommendation methods aim to suggest similar nodes for a given node from previously associated nodes For instance, Li et al [76] produced a NCPLDA model to determine lncRNA-disease association scores by incorporating the probability matrix of lncRNA-disease associations, integrated disease similarity as well as lncRNA similarity relied on network consistency projection Second, the collaborative filtering approaches typically advise adding different nodes which are very much alike to the determined nodes and utilizing these nodes’ data to make suggestions [12] For example, Yu et al [13] utilized Nave Bayesian Classifier and CF on multiple matrices to determine whether or not there are any lncRNA-diseases associations to reveal new associations Li et al [77] introduced a CF-based miRNA-disease association prediction method named as CFMDA to forecast miRNA-disease associations. CFMDA was straightforward and powerful by examining an insignificant measure of related data and no tunable parameters were characterized However, because it only uses miRNA-disease associations to make predictions, CFMDA's association prediction performance was subjective And finally, the matrix factorization algorithm serves as the foundation for the various matrix factorization methods such as DSCMF, MFLDA, SDLDA [78]–[80] DSCMF utilized an integrated Gaussian interaction profile kernel and collaborative matrix factorization to uncover lncRNA- disease associations [78] The lncRNAs and diseases' non-linear features that are derived by SDLDA [80] which improve the representation ability of linear features obtained through matrix factorization techniques Shen et al [81] used a method called Collaborative Matrix Factorization to discover miRNA-disease associations which resulted in causing bias to miRNAs with moreover well-known related diseases.

Each of the sub-categories of recommendation algorithm-based methods contains its own issues For instance, content-based recommendation methods are unable to resolve new ncRNA-disease associations because they require the node's prior knowledge [12], [76] New ncRNA-disease associations can be solved using collaborative filtering techniques but when the amount of data is too large, collaborative filtering algorithms become significantly more complicated and time- consuming The matrix factorization methods have to deal with the same issue [12].

 Machine learning-based methods to predict ncRNA-disease associations

Thesis’s research directions

With the above literature review of computational approaches for anticipating ncRNA-disease relationships, in this thesis, the research will focus on proposing computational methods using network-based methods for predicting ncRNA-disease associations, especially miRNA-disease associations as well as lncRNA-disease associations The reason for choosing network-based methods is that they can achieve high performance accuracy and can be used to reveal diseases (ncRNAs) without any known associtions ncRNAs if the issue of sparsity data is resolved The research can be implemented in the following directions.

Firstly, it is necessary to develop network representation methods and more reasonable methods for feature extraction, similarity calculation, and fusion to solve sparse data problem or enhance the reliability of prediction performance.

Secondly, the thesis can concentrate on integrating different biological datasets to construct more reasonable similarities and proposing new computational approachess for forecasting non-coding RNA-disease associations.

Thirdly, the computational approaches for revealing ncRNA-disease association can also be utilized to other research areas including predicting drug-disease associations, microbe-disease associations, metabolite-disease associations, and so on Therefore, the new approaches or models for predicting non-coding RNA- disease associations can also borrow the computational techniques from the aforementioned fields as well and acclimate them in order to increase performance in the interested area.

Finally, if at all possible, the research may extend the deep learning and machine learning techniques' application for a more comprehensive predictive analysis

Some evaluation methods and metrics to evaluate prediction performance

It is very important to evaluate a classifier's prediction performance to be able to judge its usefulness and to make comparison of its prediction performance with other competing approaches In this dissertation, the prediction performance of the proposed models is evaluated by assessing the Area under Roc Curve (AUC) as well as Area under Precision-Recall Curve (AUPR) by doing 5-fold-cross-validation and leave-one-out-cross-validation (LOOCV) experiments Besides that, to support the prediction performance reliability, some case studies could be implemented in each proposed method Additionally, although the time complexity was not usually taken into account to evaluate the performance of a method but in this dissertation, the time complexity of proposed methods were quantitatively estimated to assure that they will be finished in acceptable execution time.

Cross-validation is widely used to evaluate predictive models' ability to be generalized It is a resampling technique that tests and trains a model multiple times by using different parts of the data Cross-validation objective is to give a gauge to the performance of the last model Ordinarily, cross-validation is typically applied for tuning parameters of model, and applied on different iterations for different values of the tuning parameters In practice, the most popular cross-validation methods used in bioinformatics are k-fold-cross-validation (k=5 or k) and leave- one-out-cross- validation (LOOCV) [110]. k-fold-cross-validation

In the k-fold cross-validation experimental method, the learning set is divided into k-separated approximately equal size subsets The “fold” in this context reflects the partitioned subsets' quantity Partitioning involves selecting cases from the learning set at random without replacing any of them The model then uses k-1 subsets for training whereas remaining subset is used for testing, and the performance is measured This is done again and again until all k subsets have acted as a validation set The cross-validated performance is obtained by averaging k measurements' performance on the k validation sets.

Leave-one-out-cross-validation (LOOCV)

An expecial instance of k-fold cross-validation when k=n, n is the size of the learning set is LOOCV Each case acts as a hold-out case for the validation set individually, in LOOCV As a result, only the first case is included in the first validation set, only the second case is included in the second validation set, and so on For the entirety of the learning set, this procedure is repeated In LOOCV, the test performance typically approximates the true prediction performance in a way that is undifferentiated It has a high change, since the n training sets are basically something similar, as two different training sets contrast just concerning one case

[110] Normally, with large n, LOOCV is costly in computation time, so we can only consider performing LOOCV experiments in needed cases as required.

1.5.2 Area under Roc Curve (AUC)

One of the most important and popular evaluation metrics which is utilized to assess the performance of prediction is Area under Roc Curve (AUC) In a diagnostic test with a dichotomous outcome (positive/negative test results), the typical way of diagnostic test evaluation handles Sensitivity and Specificity as measures of accuracy A receiver operating characteristic (ROC) curve appears on the Sensitivity versus 1-Specificity plot And, the area under the roc curve (AUC) has been regarded as a useful diagnostic test accuracy measurement The proportion of real positive samples to all predicted positive samples is referred to as sensitivity, or True Positive

Rate (TPR) 1-Specificity, also referred to as False Positive Rate (FPR), is the ratio of real negative samples to all negative samples in predicted positive samples The TPR and FPR can be computed as below:

(1.1) (1.2) where TP (True positive) demonstrates that a positive sample was correctly predicted to be True; FN (false negative) indicates that a positive sample was incorrectly predicted to be negative; FP (false positive) indicates that a negative sample resulted as a positive sample incorrectly; TN (True negative) indicates that a negative sample was correctly predicted TPR and FPR are utilized as the vertical axis and horizontal axis, respectively, to the receiver operating characteristic (ROC) curve [111] Figure

1.4 is a ROC curve and AUC's illustration.

Figure 1.4 A ROC curve and AUC's illustration

1.5.3 Area under Precision-Recall Curve (AUPR)

Another important evaluation metric is the AUPR which is used to assess a classifier's prediction performance in case of an imbalanced dataset In a diagnostic test with a dichotomous outcome (positive/negative test results) based on an imbalanced dataset, Precision and Recall measurements are normally used to estimate the classifier's prediction performance The plot of Precision values for corresponding Recall (sensitivity) values is recognized as a Precision-Recall curve. And the Area under Precision-Recall curve (AUPR) is regarded as an efficient metric for evaluating the prediction accuracy of a classifier The proportion of accurately predicted positive samples in all predicted positive samples is shown by Precision, whereas the proportion of accurately predicted positive samples in all real positive samples is shown by Recall The Precision as well as Recall are calculated as follows:

Where TP (true positive), FN (false negative), FP (false positive) and TN (true negative) have the same meanings as in previous section To draw Precision-Recall curve, the vertical axis reflects Precision, while the horizontal axis reflects Recall in the Precision-Recall curve [112] Figure 1.5 is an illustration of a Precision-Recall curve and AUPR.

Figure 1.5 An illustration of a Precision-recall curve and AUPR.

To support the prediction performance reliability, in the dissertation, some case studies are employed in each proposed method After obtaining prediction results, some diseases are selected as case studies For each selected disease, firstly predicted ncRNAs are ranked Secondly, the top predicted ncRNAs, which relate to the selected disease, will be checked whether they have known associations with the selected disease or not If they have no known associations with the selected disease, the predicted associations will be further checked in other public databases or literature The more associations are known before or verified by other databases or literature, the higher prediction performance reliability is The unverified associations, which are predicted for each selected disease, will be suggested for further checking by biologists.

NCRNA-DISEASE ASSOCIATIONS PREDICTION WITH

Motivations

Numerous computational approaches have recently been proposed to forecast potential ncRNA-disease associations, particularly miRNA-disease associations as well as lncRNA-disease associations Many of them heavily depended on known ncRNA-disease associations They need to utilize different similarity matrices including the disease semantic similarity matrix, miRNA functional similarity matrix or lncRNA functional similarity matrix, and so on However, these matrices are not straightforwardly connected with the miRNA-disease associations or lncRNA- disease associations, respectively Therefore, recently, numerous computational methods have been constructed by using different known associations' types among different objects' types to discover latent miRNA-disease, and lncRNA-disease associations Generally, computational methods that take into account a variety of known associations among a variety of objects typically contributes to an increase in prediction accuracy for predicting ncRNA-disease associations.

In addition, there are a very small number of known associations compared to potential associations among biological objects Therefore, the sparse similarity matrices' problem, which affected the accuracy of prediction, must be addressed by the majority of ncRNA-disease associations prediction computational methods.

To solve aforementioned issues, this chapter introduces a brand-new computational model that integrates a variety of known associations between multiple objects to solve the sparsity data problem and enhance prediction accuracy. The ncRNA-disease associations were predicted by employing a process for allocating resources on a tripartite graph.

The sparsity data problem is solved by inspiring the item-based CF algorithm presented in Yu et al.'s CFNBC model [13] For sparsity data, the CF algorithm was used because it can take advantages of multiple types of known associations between multiple biological objects, each of which can affect the other by associating with the same objects in a network Additionally, it can decrease depending on only one type of known association and enhancing prediction accuracy.

The ncRNA-disease associations prediction was executed by adopting or improving a resource allocation process on a tripartite graph which was used in a model called TPGLDA of Ding et al [18] The tripartite graph was used to integrate a considerable number of ncRNAs which related with diseases as collaborative prediction of hidden ncRNA-disease associations which cultivate diseases' properties during the process of allocating resources on the tripartite graph [18].

As mentioned before, ncRNAs can contain different types including rRNAs tRNAs, snRNAs, sncRNAs, lncRNAs, miRNAs as well as circRNAs However,two types of ncRNAs that have received a lot of attention in recent years are lncRNAs and miRNAs because miRNAs are the regulators of most protein-coding genes whereas lncRNAs are the most ubiquitously found in mammalian [11] Therefore,the proposed model in this chapter was applied to demonstrate the performance in two different cases of predicting lncRNA-disease association and inferring miRNA- disease associations problems which were shown in the sections 2.4 and 2.5 of this chapter.

Main related works

2.2.1 The item-based collaborative filtering algorithm for ncRNA-disease association prediction

To date, several prediction computational methods have incorporated CF algorithms to determine various potential biological objects associated with the disease, thereby the issue of limited known associations between various objects is resolved.

The objective of the collaborative filtering process is to suggest new items or to predict the item’s utility for a specific user depended on his or her previously preferences and the other users' opinions with similar interests [113].

Normally, a CF algorithm has a list of m users 𝑈 = {𝑢 1 , 𝑢 2 , … , 𝑢 𝑚 } and a checklist 𝐼 = {𝑖 1 , 𝑖 2 , … , 𝑖 𝑛 } contains n items A particular user 𝑢 𝑖 possesses an item list 𝐼 𝑢 𝑖 that he/she has expressed his/her opinions about Opinions can be reported explicitly by the user in the form of a rating score, or they can be derived implicitly from previous records by looking at timing logs or previous purchases. Note that,

𝐼 𝑢 𝑖 ⊆ 𝐼 and it can be a null-set that contains nothing There is a well-known user 𝑢 𝛼

𝑈 be considered the active user for whom a collaborative filtering’s task is to find an item likelihood.

CF algorithms show the entire 𝑚 𝑥 𝑛 user-item information as a rating matrix

𝐴 Each element 𝑎 𝑖,𝑗 in 𝐴 indicates a rating score (preferences) of the i th user on the j th item Each rating score is based on a numerical scale, and it can be 0 if the user hasn't given the item a rating CF-based recommendation techniques can be classified into user-based and item-based CF techniques [113].

In this chapter, our proposed model for predicting ncRNA-disease associations

𝑗= 1 used the item-based CF algorithms for recommendation.

2.2.2 Resource allocation on a tripartite graph

The resource allocation algorithm on a tripartite graph was successfully implemented in some computational approaches to forecast associations between ncRNAs and diseases, including TPGLDA [18] and ncPred [114] It is inspired by the Zhang et al.’s research utilized an user-item-tag tripartite graph for recommendation making [115].

 Resource allocation on an user- item-tag tripartite graph for recommendation

In the Zhang et al.’s study [115], a tripartite graph was constructed on three sets, respectively of 𝑈 = {𝑈 1 , 𝑈 2 , … , 𝑈 𝑛 } users, 𝐼 = {𝐼 1 ,

𝐼 2 , … , 𝐼 𝑚 } items and 𝑇 {𝑇 1 , 𝑇 2 , … , 𝑇 𝑟 } tags Two adjacency matrices, 𝐴 and 𝐴′, are used to represent user-item and item-tag relationships, respectively, in this tripartite graph If user 𝑈 𝑖 has already attained item 𝐼 𝑗 then 𝑎 𝑖𝑗 is set to 1, otherwise 𝑎 𝑖𝑗 is set to 0 Similarly, 𝑎′ 𝑗𝑘 is set to 1 if item 𝐼 𝑗 has been attached by tag 𝑇 𝑘 , otherwise 𝑎′ 𝑗𝑘 = 0.

They started by considering a bipartite graph 𝐺 𝑢𝑖 = (𝑈, 𝐼, 𝐸 𝑢𝑖 ) in which U, I and

𝐸 𝑢𝑖 represent users, items and edges' sets that connect users and items, respectively.

A particular item will identically assign to all neighboring users its resources if we assume that resource type is originally located on items Next, the assigned resource will be redistributed to all items possessed by each user Signifying 𝑓⃗ is the items' initial vector (i.e., 𝑓 𝑗 means the resource amount located on item 𝐼 𝑗 ), the final resource vector ⃗𝑓⃗′ which is diffused after two steps is computed as in the following equation:

𝑗 𝑙=1 𝑘(𝑈 𝑙 ) 𝑠=1 𝑘(𝐼 𝑠 ) where 𝑘(𝑈 𝑙 ) = ∑𝑚 𝑎 𝑙𝑗 reflects the items' quantity gathered for user 𝑈 𝑙 , and 𝑘(𝐼 𝑠 )

� 𝑖𝑠 means neighboring users' number for item 𝐼 𝑠

𝑖= 1 and the final resource vector ⃗𝑓⃗′ is attained as in equation (2.1).

Additionally, different users may receive distinct tags for the same items As a result, in a recommender system, the meaning of the collaborative tags is double. First, the item information may be richened by tags Two items with many similar tags may have similar content Second, the various tag usages frequently incorporate personalized preferences That is the reason that tag information is used to provide better recommendations In a similar way, assuming that a particular resource type is initially assigned to items The resource of each item will be distributed equally to all neighboring tags And then, each tag will reassign its received resource to all neighboring items So, for a particular initial resource vector 𝑓⃗, the final resource vector ⃗𝑓⃗⃗⃗′⃗′⃗ is as follows:

𝑗 𝑙=1 𝑘(𝑇 𝑙 ) 𝑠=1 𝑘′(𝐼 𝑠 ) in which 𝑘(𝑇 𝑙 ) = ∑𝑚 𝑎′ 𝑙𝑗 reflects the neighboring items' number for tag 𝑇 𝑙 , while

� 𝑠𝑖 means the neighboring tags' number of item 𝐼 𝑠

With a target user 𝑈 𝑖 , the initial resource is given in accordance with equation

(2.2), the final resource allocation ⃗𝑓⃗⃗⃗ 𝑓 ⃗⃗ 𝑖 ⃗ 𝑛 ⃗⃗⃗ 𝑎 ⃗⃗ 𝑙 on user-item-tag tripartite graph to recommend items for the target user is diffused as a linear superposition of ⃗𝑓⃗′ and ⃗𝑓⃗⃗⃗

(2.4) in which 𝛾 ∈ [0,1] means a tunable parameter, ⃗𝑓⃗′ indicates the vector attained in equation (2.1) and ⃗𝑓⃗⃗⃗′⃗′ reflects a vector obtained in equation (2.3) Both two vectors are reached for the same target user.

 TPGLDA method’s resource allocation on a tripartite graph for lncRNA- disease association prediction

The TPGLDA model, proposed by Ding et al [18], is one of the models which inferred lncRNA-disease associations relying on multiple data sources In TPGLDA model, confirmed lncRNA-disease associations and validated associations between model's authors applied the Zhang et al.’s method for allocating resources on a tripartite user-item-tag graph for recommendation [115] to calculate resource allocation on diseases to predict associations between lncRNAs and diseases In this model, lncRNA, disease, and gene are considered as user, item, and tag, respectively It obtained a good performance in predicting lncRNA-disease associations However, its authors point out that it does not integrate the third type of known interactions on the tripartite graph, so its performance may be limited by the incompleteness of the data It very well might be helpful for additional expansion by incorporating additional biological information.

The proposed model for predicting ncRNA-disease associations based on a

The proposed model's flowchart is provided in Figure 2.1 It typically consists of four main stages In the initial stage, by using known associations between miRNAs and diseases, known associations between lncRNAs and diseases, and verified interactions between miRNAs and lncRNAs, a tripartite graph G 0 is constructed During the second stage, an item-based CF algorithm is applied on graph G 0 to address the sparsity data problem and to produce a tripartite graph G u

At the third stage, a resource allocation algorithm uses the graph G u to determine the final resource score of each disease's ncRNA candidates And finally, for each disease, all ncRNA candidates’ resource scores are ranked in descending order thus the candidate with higher resource score will have greater probability of being verified by future biological experiments.

The author contributions in the proposed model are included in stage 2 and stage

3 In this dissertation, we already employed this model to predict miRNA-disease associations as well as lncRNA-disease associations and published the results in different journal and international conferences as detailed as in the sections 2.4 and2.5 of this chapter.

Figure 2.1 The proposed model's flowchart

Employing the proposed model to infer miRNA-disease associations based on

on collaborative filtering and resource allocation

This section presents an application for inferring miRNA-disease associations based on the proposed model shown in section 2.3 It is the first method to predict associations between miRNAs and diseases that combines an item-based collaborative filtering algorithm to solve the incompleteness data issue with a process of resource allocation on a tripartite graph using multiple data sources to avoid relying only on known miRNA-disease associations.

2.4.1 Detailed description of proposed model's stages in inferring miRNA- disease associations

 Stage 1: Construction of a tripartite graph G 0

At this stage, motivated by former study [18], a tripartite graph G 0 , miRNA- disease-lncRNA, is constructed as follows:

Let M={m k ; k=1,…,n m } is the miRNAs' set, D={d j ; j=1,…, n d } is the diseases' set, and L={l i ; i=1,…, n l } is the lncRNAs' set n m , n d , and n l are the miRNAs', diseases' and lncRNAs' numbers, respectively.

Firstly, a MD 0 graph is built on the known miRNA-disease associations A known miRNA-disease associations' adjacency matrix A0 MD is used to represents the

MD0 graph The element of A0 MD in kth row and jth column is represented by the entity A0 MD (m k , d j ) where A0 MD (m k , d j )=1 if the miRNA mk-disease d j association has been known, alternatively A0 MD (m k , d j )=0.

Secondly, based on known interactions between miRNAs and lncRNAs, a ML 0 graph is constructed It is represented by an adjacency matrix A0 ML of known miRNA- lncRNA interactions The element of A0 ML in kth row and ith column is denoted by the entity A0 ML (m k , l i ) where A0 ML (m k , l i )=1 if the interaction between miRNA m k and lncRNA l i has been validated, otherwise, A0 ML (m k , l i )=0.

Thirdly, a DL 0 graph is built on known associations between diseases and lncRNAs It is represented by an adjacency matrix A0 DL of known associations lncRNA-disease associations The element of A0 DL in jth row and ith column is

𝑀𝐿 𝐷 reflected by entity A0 DL (d j , l i ) where A0 DL (d j , l i )=1 if the association between disease d j and lncRNA l i has been known, alternatively A0 DL (d j , l i )=0.

Finally, the integration of three MD 0 , ML 0 , DL 0 graphs results in the creation of a tripartite graph G0 It is depicted by three matrices: A0 MD, A0 ML and A0 DL as aforementioned.

 Stage 2: Construction of a tripartite graph G u

It is a fact that, in the tripartite graph G 0 , the quantity of known associations among miRNAs and diseases, and the quantity of known interactions among miRNAs and lncRNAs are exceptionally restricted in examination with the quantity of their possible associations Therefore, for a particular pair of lncRNA l i node and disease d j node, it is evident that the miRNA nodes' quantity which associated with both l i and d j is also really restricted To enhance it, in the new proposed method, an item- based collaborative filtering algorithm was utilized for suggesting reasonable miRNA nodes to corresponding lncRNA as well as disease nodes, respectively In light of the fact that a recommender system may concern with a variety of input data, such as users and items [116], the proposed method considers lncRNAs and diseases as users, and miRNAs as items, respectively For the two adjacency matrices A0 ML and A0 MD previously acquired, it is simple for us to build up an other adjacency matrix 𝐴 0 =[ A0 ML , A0 MD ] by joining A0 ML and A0 MD together because the numbers of rows in both A0 ML and A0 MD are same Obviously, the row vector of 𝐴0 contains the row vectors in A0 ML and A0 MD whereas the column vectors in

A0 MLD are the same as the column vectors in A0 ML or A0 MD

From the obtained 𝐴 0 matrix intripartite graph G 0 , a co-occurrence

𝑅 𝑛𝑚 𝑥 𝑛 𝑚 matrix is attained, in which the element of 𝑅 𝑛𝑚 𝑥 𝑛 𝑚 in k th row and r th column is represented by the entity R(m k , m r ) where R(m k , m r )=1 when and only when two miRNAs m k and m r share at least one common neighboring node in G0, alternatively R(m k , m r )=0, and the common neighboring node in G0 means a lncRNA or a disease, respectively By normalizing 𝑅 𝑛𝑚 𝑥 𝑛 𝑚 , a similarity matrix R nor could be computed as follows:

(2.5) where k, r indicate miRNAs' number |𝑁(𝑚 𝑘 )| is the quantity of known lncRNAs and diseases related with m k in G0 graph It is the quantity of elements with values equal to 1 in k th row of 𝐴 0 |𝑁(𝑚 𝑟 )| reflects the quantity of known lncRNAs and diseases related with m r in G0 graph It shows the quantity of elements with values equal to 1 in rth row of 𝐴0 ∣N(mk) ∩ N(mr)∣ means the quantity of known lncRNAs and diseases related with both miRNA m k and miRNA m r at the same time in G0.

Relied on the R nor similarity matrix and the 𝐴 0 adjacency matrix, a new recommender matrix Au MLD is computed as follows:

Specially, for each specific lncRNA l i or disease d j in G0, if it exists a miRNA m k complying with 𝐴0 (𝑚𝑘, 𝑙𝑖) = 1 or 𝐴0 (𝑚𝑘, 𝑑𝑗) = 1 in

𝐷 compute the total value of all elements' values in the 𝑙 𝑖 or 𝑑 𝑗 column in 𝐴 𝑢

, respectively And, its averaged value P is computed Next, if the column 𝑙 𝑖 or 𝑑 𝑗 of

𝐴 𝑢 has a miRNA 𝑚 𝜃 which complies with 𝐴 𝑢 (𝑚 𝜃, 𝑙 𝑖 ) > 𝑷 or 𝐴 𝑢 (𝑚 𝜃,

𝑷, then miRNA 𝑚𝜃 is recommended for lncRNA l i or disease d j , respectively At the same time, new association between miRNA 𝑚𝜃 and lncRNA l i , or between miRNA

𝑚𝜃 and disease d j is added to the tripartite graph G0 As result, we obtained a tripartite graph G u It consists of three graphs: MD update , ML update and DL 0 It is presented by three adjacency matrices: Au MD , Au ML and A0 DL where MDupdate reflects the updated graph of MD 0 after an addition of new recommended miRNA- disease associations ML update represents the updated graph of ML 0 after an addition of new recommended miRNA-lncRNA associations Au MD means the adjacency matrix that represents MDupdate graph Au ML means the adjacency matrix to

 Stage 3: Employing resource allocation process on the tripartite graph G u to infer miRNA-disease associations

The resource allocation algorithm is applied to the tripartite graphG u to predict the association between miRNAs and diseases in the following steps:

Step 1: Calculating resource allocation between miRNAs and diseases

The initial resources, for a particular miRNA mk, settled on disease d j are defined as:

𝑓𝑑(𝑚 𝑘 ) = 𝐴 𝑢 (𝑚 𝑘 , 𝑑 𝑗 ), 𝑗 = 1,2, … 𝑛 𝑑 (2.7) where n d means the diseases' number.

Follow this, the weight matrix W= {w kt }n m x n m is used to calculate resource moved back from D to M, to reflect the resource allocation process between miRNAs and diseases as:

𝑀𝐷 𝑀𝐷 𝑗 where 𝑤𝑘𝑡 can be thought of the similarity between miRNA m k and miRNA m t in

MD update graph It is the contribution resource moved from t th node to k th node in M.

𝑑𝑒𝑔 𝐴𝑢 (𝑚𝑘) means the degree of miRNA m k in MDupdate graph It also indicates the quantity of diseases related to miRNA m k Analogously, 𝑑𝑒𝑔 𝐴𝑢 (𝑑𝑗) means the degree of disease d j in MDupdate graph It also represents the quantity of miRNAs related to disease d j

In accordance with a previous study [18], the resource allocation algorithm is customized by taking into account the level of consistency between the contribution of transferred resources in both directions It illustrates the effect of co-selection

(m k , m t ) between the resource's contribution from m k to m t and the resource's contribution from m t to m k The following equation can be used to define a consistence-based resource allocation for a final miRNA-disease weight matrix

A final resource Rscore_ondisease_1 located on D is defined by combining the final miRNA-disease weight matrix W’ and the adjacency matrix Au MD , as follows:

Step 2: Calculating resource allocation between diseases and lncRNAs

Regarding to resource allocation between genes and diseases in TPGLDA [18], the identical initial resources detected on M nodes are assigned from nodes in M to

𝐷 nodes in D and afterward moved back The final resource matrix 𝑅𝑠𝑐𝑜𝑟𝑒_𝑜𝑛𝑑𝑖𝑠𝑒𝑎𝑠𝑒_2 detected on D nodes is computed by:

𝐷𝐿 𝐷𝐿 𝑗 where deg 𝐴 0 (𝑙 ) ∑𝑛𝑑 𝐴0 (𝑑 , 𝑙 ) means the degree of lncRNA l i in DL0 graph

𝐿 𝑗 𝑖 or the quantity of diseases related to lncRNA l i deg 𝐴0 (𝑑 )∑ 𝑛 𝑙 𝐴 0 (𝑑 , 𝑙 ) means

𝐿 𝑗 𝑖 the degree of disease d j in DL0 graph or the quantity of lncRNAs related to disease d j

Step 3: Calculating the final resource score Rscore_final to infer the potential disease-related miRNAs

To measure potential disease-related miRNAs, the final resource score

Rscore_final is calculated as in equation (2.12):

Rscore_final= γ ∗ Rscore_ondisease_1+ (1-γ) * Rscore_ondisease_2 (2.12) where γ is a tunable parameter whose value between 0 and 1 When γ = 0.9, the proposed model has the best prediction performance.

 Stage 4: Ranking all candidate miRNAs’ Rscores for each disease in descending order

Lastly, Rscore_final of all miRNAs for each disease are ranked in descending order so that candidates with higher score values will have more chances to be confirmed in the future.

2.4.2 Proposed method's experiments and results

To assess the proposed method's performance, we already performed the experiments as described in following steps:

- Step 2: Implementing the proposed method and estimating its time complexity.

- Step 4: Checking case studies to support prediction performance reliability.

The proposed method used datasets that were used in the study of Zhao et al.

It means that the utilized datasets in this proposed method are the same as them which were used in the Zhao et al [117] This is the reason that we compare our proposed method's prediction performance with the Zhao et al.'s method in the evaluating prediction performance section There are 190 diseases, 111 lncRNAs,

264 miRNAs, 1880 known lncRNA-miRNA associations, 936 known associations between diseases and lncRNAs, and 3552 verified miRNA-disease associations in these datasets as described below:

Known lncRNA-miRNA associations The dataset of known lncRNA- miRNA associations was gathered from the starBase v2.0, in February 2017 [118]. Based on large-scale CLIP-Seq data, it provides the most comprehensive set of lncRNA- miRNA interactions that have been experimentally confirmed The DS1 dataset which includes 1880 known lncRNA-miRNA associations was obtained after eliminating duplicate values and inacurate data as well as removing lncRNAs not included in known lncRNA-disease associations (DS2) dataset.

Known lncRNA-disease associations The dataset of known lncRNA-disease associations was gathered from MNDR database's 8842 known disease-lncRNA associations [119] and LncRNADisease database's 2934 known disease-lncRNA associations [120] The DS2 dataset which contains 936 known associations between diseases and lncRNAs was obtained after eliminating diseases without any MeSH descriptors because the disease names came from different data sources, consolidating the diseases with the same MeSH descriptors and eliminating the lncRNAs which were not included in the lncRNA-miRNA dataset (DS1).

Employing the proposed model to predict lncRNA-disease associations based

In the previous section 2.4, a new method was applied on the proposed model to infer miRNA-disease associations It achieved a reliable performance in inferring miRNA-disease associations However, as be indicated in the section 2.4.3, it did not integrate the third type of associations into the resource allocation process on the tripartite graph Therefore, this section presents a new method of employing the proposed model which uses a collaborative filtering algorithm and an improved resource allocation process integrating the third type of association on a tripartite graph to increase the performance and reliability of lncRNA-disease association prediction.

2.5.1 Detailed description of proposed model's stages in predicting lncRNA- disease associations

At first stage, a tripartite graph G 0 was built on the known lncRNA-disease, known miRNA-disease and verified lncRNA-miRNA interactions datasets An adjacency matrix called 𝐴𝐿𝐷 𝑘𝑛𝑜𝑤𝑛 was used to represent the known lncRNA- disease associations dataset with 𝐴𝐿𝐷 𝑘𝑛𝑜𝑤𝑛 (𝑖, 𝑗) = 1 if and ony if the association of lncRNA and disease was confirmed, otherwise 𝐴𝐿𝐷 𝑘𝑛𝑜𝑤𝑛 (𝑖, 𝑗) 0 An adjacency matrix

𝐴𝑀𝐷 𝑘𝑛𝑜𝑤𝑛 was used to represent the known miRNA-disease associations with 𝐴𝑀𝐷 𝑘𝑛𝑜𝑤𝑛 (𝑡, 𝑗) = 1 if and only if the miRNA-disease association was confirmed, otherwise 𝐴𝑀𝐷 𝑘𝑛𝑜𝑤𝑛 (𝑡, 𝑗) = 0 An adjacency matrix 𝐴𝐿𝑀 𝑘𝑛𝑜𝑤𝑛 was used to represent the verified lncRNA-miRNA interactions with 𝐴𝐿𝑀 𝑘𝑛𝑜𝑤𝑛 (𝑖, 𝑡) = 1 if and only if lncRNA-miRNA interaction was known, otherwise 𝐴𝐿𝑀 𝑘𝑛𝑜𝑤𝑛 (𝑖, 𝑡) = 0 The number of lncRNAs, miRNAs and diseases are 𝑛 𝑙 , 𝑛 𝑚 and 𝑛 𝑑 , respectively.

 Stage 2: Applying collaborative filtering algorithm on known lncRNA- disease associations and verified lncRNA-miRNA interactions to obtain a new tripartite graph G u

It is the fact that the number of known lncRNA-disease associations and number of verified lncRNA-miRNA interactions were very limited in comparison with the total number associations in each type Therefore, we used an improved item-based collaborative filtering algorithm to reduce the impact of imbalanced data problem. Specifically, the known lncRNA-disease associations and verified lncRNA-miRNA interactions are considered as inputs to an item-based collaborative filtering algorithm in order to recommend reasonable lncRNA nodes to disease nodes and miRNA nodes, respectively Concretely, the diseases and miRNAs are considered as users while lncRNAs are considered as items in a recommender system It is distinguished from the collaborative filtering process in [13] where lncRNAs and diseases were considered as users whereas miRNAs were recognized as items The belowed steps describe the improved item-based collaborative filtering process in the proposed method:

Step 1: Constructing a new adjacency matrix 𝐴𝐿𝐷𝑀 1

On the basis of the number of rows in both 𝐴𝐿𝐷 𝑘𝑛𝑜𝑤𝑛 and 𝐴𝐿𝑀 𝑘𝑛𝑜𝑤𝑛 are the same, we spliced them together to obtain a new adjacency matrix 𝐴𝐿𝐷𝑀 1 The row vectors of 𝐴𝐿𝐷𝑀 1 contains the combination of row vectors in 𝐴𝐿𝐷 𝑘𝑛𝑜𝑤𝑛 and

𝐴𝐿𝑀 𝑘𝑛𝑜𝑤𝑛 whereas its column vectors are same as column vectors in 𝐴𝐿𝐷 𝑘𝑛𝑜𝑤𝑛 and

Step 2: Computing a new recommender matrix 𝐴𝐿𝐷𝑀 2

Relied on the matrix 𝐴𝐿𝐷𝑀 1 , a co-occurrence matrix 𝑅 𝑛𝑙 𝑥 𝑛 𝑙 is computed, where

𝑛 𝑙 means the number of lncRNAs The entity 𝑅(𝑙 𝑘 , 𝑙 𝑟 ) indicates the element in k th and r th column of 𝑅 𝑛𝑙 𝑥 𝑛 𝑙 and 𝑅(𝑙 𝑘 , 𝑙 𝑟 ) = 1 if and only if lncRNA 𝑙 𝑘 and lncRNA 𝑙 𝑟 share at least one common disease or miRNA, otherwise (𝑙 𝑘 , 𝑙 𝑟 ) = 0.

A similarity matrix R normalized is built by normalizing 𝑅 𝑛𝑙 𝑥 𝑛 𝑙 as the below equation:

(2.13) where k and r are the indexes of lncRNAs |𝑁(𝑙 𝑘 )| is the number of known diseases and verified miRNAs associated with lncRNA 𝑙 𝑘 in 𝐴𝐿𝐷𝑀 1 In other words, it the number of values equaling to 1 in k th row of 𝐴𝐿𝐷𝑀 1 In a like manner,

|𝑁(𝑙 𝑟 )| reflects the number of values equaling to 1 in rth row of 𝐴𝐿𝐷𝑀1.

|𝑁(𝑙𝑘) ⋂ 𝑁(𝑙𝑟)| indicates the number of known diseases and verified miRNAs concurrently associated with both lncRNA 𝑙 𝑘 and lncRNA 𝑙 𝑟

A new recommender matrix 𝐴𝐿𝐷𝑀 2 is constructed on 𝐴𝐿𝐷𝑀 1 and R normalized as:

Step 3: Updating the adjacency matrix 𝐴𝐿𝐷𝑀 1 based on 𝐴𝐿𝐷𝑀 2 matrix to obtain a new updated adjacency matrix 𝐴𝐿𝐷𝑀 3

For a specific disease d j or miRNA m t in 𝐴𝐿𝐷𝑀1, if it exists a lncRNA l k which satisfies that 𝐴𝐿𝐷𝑀 1 (𝑙 𝑘 , 𝑑 𝑗 ) = 1 or 𝐴𝐿𝐷𝑀 1 (𝑙 𝑘 , 𝑚 𝑡 ) = 1 in 𝐴𝐿𝐷𝑀 1 , firstly the sum of all values of the elements in j th column or t th column in 𝐴𝐿𝐷𝑀 2 matrix is computed to have its corresponding averaged 𝑃 𝑎 value respectively Next, if the j th or t th column of 𝐴𝐿𝐷𝑀 2 matrix contains a lncRNA 𝑙 𝜃 so that 𝐴𝐿𝐷𝑀 2 (𝑙 𝜃 , 𝑑 𝑗 ) > 𝑃𝑎 or

𝐴𝐿𝐷𝑀 2 (𝑙 𝜃 , 𝑚 𝑡 ) > 𝑃 𝑎 then lncRNA 𝑙 𝜃 is recommended for disease 𝑑 𝑗 or miRNA 𝑚 𝑡 , respectively It means that a new updated adjacency matrix 𝐴𝐿𝐷𝑀 3 is computed as:

Step 4: Split 𝐴𝐿𝐷𝑀 3 into two matrices 𝐴𝐿𝐷 𝑛𝑒𝑤 and 𝐴𝐿𝑀 𝑛𝑒𝑤 which have the same shape and position of 𝐴𝐿𝐷 𝑘𝑛𝑜𝑤𝑛 and 𝐴𝐿𝑀 𝑘𝑛𝑜𝑤𝑛 in 𝐴𝐿𝐷𝑀 1 𝐴𝐿𝐷 𝑛𝑒𝑤 indicates new lncRNA-disease associations set and 𝐴𝐿𝑀 𝑛𝑒𝑤 reflects new lncRNA-miRNA interactions set after the collaborative filtering process finished.

A tripartite graph was built by integrating the associations based on the new lncRNA-disease associations, new lncRNA-miRNA interactions and known miRNA- disease associations which are represented by three matrices 𝐴𝐿𝐷 𝑛𝑒𝑤 , 𝐴𝐿𝑀 𝑛𝑒𝑤 and

𝐴𝑀𝐷 𝑘𝑛𝑜𝑤𝑛 , respectively The tripartite graph was symbolized by 𝑇𝐺 (𝐿𝑆, 𝐷𝑆, 𝑀𝑆, 𝐸) where 𝐿𝑆 = (𝑙 1 , 𝑙 2 , … , 𝑙 𝑛 𝑙 ) is the sets of lncRNAs,

𝑚 𝑛 𝑚 ) is the set of miRNAs, respectively 𝑛 𝑙 , 𝑛 𝑑 , 𝑛 𝑚 indicate the numbers of lncRNAs, diseases and

� � miRNAs, respectively It consists of three subgraphs LD graph , LM graph and MD graph which correspond with three matrices 𝐴𝐿𝐷 𝑛𝑒𝑤 , 𝐴𝐿𝑀 𝑛𝑒𝑤 and 𝐴𝑀𝐷 𝑘𝑛𝑜𝑤𝑛 , respectively.

 Stage 3: Using improved resource allocation process to obtain predicted lncRNA-disease associations.

After attaining a tripartite graph at Stage 2, an improved resource allocation process was employed as in the following steps:

Step1: Calculating resource allocation between lncRNAs and diseases, and resource allocation between lncRNAs and miRNAs, simultaneously.

For a particular lncRNA 𝑙 𝑘 , the initial resource located at disease 𝑑 𝑗 was determined as in equation (2.16):

𝑖𝑅𝑆 𝑑 = 𝐴𝐿𝐷 𝑛𝑒𝑤 (𝑖, 𝑗), 𝑖 = 1 … 𝑛 𝑙 , 𝑗 = 1 … 𝑛 𝑑 (2.16) and the initial resource located at miRNA m t was determined as in Equation (2.17):

𝑖𝑅𝑆 𝑚 = 𝐴𝐿𝑀 𝑛𝑒𝑤 (𝑖, 𝑡), 𝑖 = 1 … 𝑛 𝑙 , 𝑡 = 1 … 𝑛 𝑚 (2.17) With respect to the previous study [18], the resource transferred back to DS from

LS is computed by using a weight matrix 𝑊𝑀1 𝑛𝑙 𝑥 𝑛 𝑙 to indicate resource allocation process between lncRNAs and diseases as in equation (2.18):

(2.18) where 𝑤𝑚1 𝑖𝑗 means the contribution resource moved from i th node to j th node in

LS It represents the similarity between lncRNA l i and lncRNA l j in LD graph

𝐷 𝑛𝑒𝑤 indicates the number of associated diseases for

𝑖𝑠 lncRNA l i and it represents the degree of lncRNA l i in LD graph 𝑑𝑒𝑔𝐴𝐿𝐷𝑛𝑒𝑤(𝑑𝑠) 𝑛 𝑙

𝐷 𝑛𝑒𝑤 represents the degree of disease

Similarly, the resource moved back to MS from LS is computed by utilizing a weight matrix 𝑊𝑀2 𝑛𝑙 𝑥 𝑛 𝑙 to reflect the resource allocation process as follows:

𝑑𝑒𝑔𝐴𝐿𝑀 𝑛𝑒𝑤 (𝑙 ) where 𝑤𝑚2𝑖𝑗 means the similarity between lncRNA l i and lncRNA l j in LM graph

𝐴𝐿𝑀 𝑛𝑒𝑤 represents the degree of lncRNA 𝑙 in LM graph

𝑖 represents the degree of miRNA

Two final weight matrix WM1 final and WM2 final are defined as in the followings:

The final total resource allocation weight matrix 𝑊𝑀 𝑓𝑖𝑛𝑎𝑙 is determined as:

A resource scores RS1 matrix located on disease set DS is built by the multiplication of final total resource allocation weight matrix and ALD new as:

Step 2: Computing resource allocation between miRNAs and diseases

Similar to the resource allocation between genes and diseases in TPGLDA method [18], we defined a resource score matrix RS2 between miRNAs and diseases located on DS nodes as:

𝐷 𝑘𝑛𝑜𝑤𝑛 represents the degree of miRNA

Step 3: Computing final resource scores RSfinal to infer lncRNA-disease associations.

At this step, the final resource scores RS final , which was used to estimate potential disease-associated lncRNAs, is computed as below:

 Stage 4: Ranking all predicted lncRNAs for each disease to have final results

To indentify the most potential disease-associated lncRNAs for each disease, all disease-associated candidate lncRNAs’ resource score for each disease in

𝑅𝑆 𝑓𝑖𝑛𝑎𝑙 are ranked in descending order The candidate with higher value will have greater possibility to be a true association and have more chance to be proven in the future.

2.5.2 Proposed method’s experiments and results

To assess the performance of the proposed method, experiments were performed as in following steps:

- Step 2: Implementing the proposed method and estimating its time complexity.

- Step 4: Checking case studies to support prediction performance reliability.

The proposed method used the datasets came from studies of Fu et al [79] and

Yao et al [89] These datasets contained 2697 tentatively known lncRNA-disease associations (LDAs), 13562 known miRNA-disease associations (MDAs) and 1002 verified lncRNA-miRNA interactions (LMIs) among 240 lncRNAs, 420 diseases and 495 miRNAs as depicted in details as below.

Known lncRNA-disease associations Known lncRNA-disease associations dataset (DS4) was gathered from Lnc2Cancer [137], lncRNADisease [138] and GeneRIF [139] databases It contains 2697 lncRNA-disease associations between

240 lncRNAs and 420 diseases that have been experimentally supported.

Known miRNA-disease associations The dataset (DS5) of known miRNA- disease associations was obtained from the HMDD (V2.0) database [121] There are a total of 13562 known associations between miRNAs and diseases, involving 495 miRNAs and 412 diseases.

Verified lncRNA-miRNA interactions The dataset of confirmed lncRNA- miRNA connections (DS6) was acquired from the starBase database [118] It includes 1002 confirmed lncRNA-miRNA interactions involving 240 lncRNAs and

Figure 2.5 illustrates the data nodes’ numbers and the relationships of the different data sources used in the proposed method.

Figure 2.5 The relationships between the different data sources and the numbers of data nodes used in the proposed method

 Step 2: Implementing the proposed method and estimating its time complexity

Similar to the previous proposed method, this proposed method was also implemented by using the Python programming language in PyCharm Community IDE along with the supported packages and libraries.

The most complex stage's time complexity is used to qualitatively estimate the proposed method's time complexity It is the stage 3, using improved resource allocation process to obtain predicted lncRNA-disease associations The algorithm 2.2 shows the pseudo code that described steps in this stage Quantitatively, it can be seen that the most complex block of statements contains three nested

\For{condition}{for block} statements and other basic operations As a result, the proposed method's time complexity is O(nl*nd*nm) ≈ O(n3) It means that the its time complexity is polynomial time.

5-fold cross-validation experiments were used to estimate the proposed method's performance As aforementionede, there were 2697 known associations in the known lncRNA-disease associations dataset The known association rate of2.727% across all possible lncRNA-disease associations demonstrates a data imbalance problem While these known associations are regarded as positive samples, the remaining associations are regarded as negative samples All positive and negative samples are randomly divided into five equal parts for 5-fold cross- validation, in each running time where 4 parts are used for training, and the remained part is used for testing Then, firstly, 𝐴𝐿𝐷 𝑛𝑒𝑤 and 𝐴𝐿𝑀 𝑛𝑒𝑤 matrices are recalculated.

Secondly, final resource score 𝑅𝑆 𝑓𝑖𝑛𝑎𝑙 is again computed And finally, the values of AUC and AUPR are measured.

Evaluating AUC under 5-fold cross-validation

MIRNA-DISEASE ASSOCIATIONS PREDICTION USING

Motivation and main related works

In recently years, numerous significant applications have been found in predicting miRNA-disease associations using random walk-based computational methods including the studies of Le et al.’s [70] and BRWH [145] Addtionally, Niu et al.[71] introduced the Random Walk and Binary Regression based miRNA- disease association prediction (RWBRMDA) method, which uses the integrated miRNA similarity network for binary logistic regression to extract features for each miRNA Li et al [72] predicted miRNA-disease associations using a network projection-based dual random walk with restart (NPRWR) model.

However, the walk probabilities of each associated neighbor node of the disease or miRNA node in accordance with its degree was indistinguishably assigned in the majority of above common random walk-based methods.Additionally, it was ineffective to predict associations for isolated diseases or miRNAs which have no known miRNAs or diseases were associated in the examined datasets As a result, recently reseachers usually use an assumption to predict miRNA-disease associations, that a disease (miRNA) would have distinct relevant probabilities for each associated miRNA (disease), each miRNA-disease association was given a distinct weight value in various heterogeneous network spaces that were constructed by integrating multiple similarities Moreover, a recent study of Luo J and Long Y. already used an extended random walk and restart algorithm to identify the majority of potential microbe-disease associations based on a heterogeneous network of known disease-microbe associations, a Gaussian kernel microbe similarity network, and a Gaussian kernel disease similarity network [146] It performed admirably in predicting associations between diseases and microbes However, its authors noted that incorporating other types of prior biological information, such as disease symptom similarity, microbe functional similarity, and disease semantic similarity networks, could improve its performance It is the reason that we would like to improve random walk with restart algorithm by assigning different different weight value to each miRNA-disease asociation in different spaces that were distinct from common RWR algorithms [70], [71] to forecast miRNA-disease associations.

Additionally, common RWR computational approaches for predicting miRNA- disease associations still have some drawbacks that could be overcome for improved performance The problem of sparsity and incomplete data which affected prediction accuracy, is one of these drawbacks In recent years, as a result of the fact that the number of known miRNA-disease associations is very restricted in comparison to the number of non-interacting miRNA-disease pairs, which are unknown cases and could potentially be accurate associations in the training datasets based on their known neighbors Furthermore, weighted K-nearest known neighbors (WKNKN) algorithm was recently utilized as a pre-processing step move toward to reduce the quantity of unknown values in miRNA-disease association set in different studies including Ezzat et al [14], Wu et al [15], Gao et al [16], and Li et al [17] methods by taking into account the nearest neighbor information for miRNAs and diseases In these

𝑖 𝑗 methods, a new miRNA or disease’s association profile was determined utilizing its similarities to other miRNAs or diseases, separately, to diminish negative impact of a large number of missing associations [14], [16].

In this chapter, a new method to forecast latent miRNA-disease associations utilizing improved RWR algorithm and integrating multiple similarities (RWRMMDA) is proposed The proposed method uses a WKNKN algorithm as a pre-processing step to resolve sparsity information issue It also integrates multiple data sources to increase prediction reliability Besides that, it borrows and improves a random walk with restart method introduced by Luo J and Long Y [146] to predict microbe-disease associations and improves the random walk with restart process to uncover latent miRNA-disease associations.

3.2 Datasets used in the proposed method

In the proposed method, the dataset of miRNA-disease associations was downloaded from the HMDD V2.0 database [121] It contained 5430 experimentally confirmed relationships between 383 diseases and 495 miRNAs. The miRNA-disease associations dataset was represented by an adjacency matrix

𝐴 𝐷𝑀 Specifically, if the association between disease 𝑑 𝑖 and miRNA 𝑚 𝑗 is already known, the element 𝐴 𝐷𝑀 is assigned to be equal to 1, otherwise 𝐴 𝐷𝑀 is assigned to be equal 0 Thereby, the i th row of 𝐴 𝐷𝑀 contains a binary vector that reflects the associations between disease 𝑑 𝑖 and each miRNA The j th column of 𝐴 𝐷𝑀 contains a binary vector that represents the associations between miRNA 𝑚 𝑗 and each disease.

Disease semantic similarity was measured according to the literature [59],

[84], [147] By downloading MeSH descriptors from the National Library ofMedicine (http://www.ncbi.nlm.nih.gov/), the relationships of diseases based on the hierarchical directed acrylic graphs (DAGs) were gathered Typically, DAGs are used to calculate the similarity between diseases.

In particular, the directed acrylic graph of a disease d is shown by 𝐷𝐴𝐺(𝑑)

= (𝑑, 𝑇𝐴 𝑑 , 𝐸𝐶 𝑑 ), where 𝑇𝐴 𝑑 denotes the set of the disease d’s ancestors and d itself, and 𝐸𝐶 𝑑 denotes the set of edges which specify child nodes from parent nodes in the MeSH tree Then, the disease t to disease d semantic contribution is determined as in the below equation (3.1).

Where ∆ signifies a predefined semantic contribution factor with values between 0 and 1 In accordance with Wang et al [147], Xu et al [59] and Chen et al [84], in the proposed method, ∆ equal was set to 0.5 Then, the calculation of semantic similarity between diseases based on the assumption that two diseases with their larger DAGs' parts prefer to have higher semantic similarity, as in formula (3.2).

In accordance to previous studies [59], [147], in the proposed method, the miRNA functional similarities among miRNAs were calculated by functional similarity measurements.

Figure 3.1 depicts the computation of miRNA functional similarity.

Figure 3.1 Illustration of computing miRNA functional similarity

In particular, let any two miRNAs 𝑚 𝑖 and 𝑚 𝑗 associated disease sets be represented the 𝐷𝑇𝑇 𝑖 = {𝑑 𝑖1 , 𝑑 𝑖2 , … , 𝑑 𝑖𝑘 } and 𝐷𝑇𝑇 𝑗 = {𝑑 𝑗1 , 𝑑 𝑗2 , … , 𝑑 𝑗𝑙

The 𝑆𝑆(𝑑, 𝐷𝑇𝑇) = 𝑑 𝑚𝑎𝑥 𝐷𝑆𝑆(𝑑, 𝑑 𝑖 ) was firstly used to indicate how similar a disease d and DTT set was, in accordance with Wang et al.[147] and Xu et al.[59] Then, the similarity between 𝑚 𝑖 and 𝑚 𝑗 was calculated as below:

Proposed method

Figure 3.2 The workflow of the proposed method (RWRMMDA)

Figure 3.2 depicts the workflow of the proposed improved random walk with restart and integrating multiple similarities (RWRMMDA) method for predicting potential miRNA-disease associations.

In general, RWRMMDA consisted of 6 stages as described in the last of this section It used information about known miRNA-disease associations, miRNA functional similarity, and disease semantic similarity as inputs.

At the first stage, the information of the known association adjacency matrix

𝐴 𝐷𝑀 was used to compute miRNAs and diseases' Gaussian Interaction Profile Kernel Similarity, 𝐺𝐼𝑃𝑚𝑖𝑅𝑁𝐴 and 𝐺𝐼𝑃𝑑𝑖𝑠𝑒𝑎𝑠𝑒, respectively The detailed equations to calculate these similarities are given in the section 3.3.2.

At the second stage, the Integrated Similarity for miRNAs (ISM) and Integrated Similarity for diseases (ISD) were figured out By integrating miRNA functional similarity (MFS) and 𝐺𝐼𝑃𝑚𝑖𝑅𝑁𝐴, the Integrated Similarity for miRNAs (ISM) were obtained By integrating disease semantic similarity (DSS) with 𝐺𝐼𝑃𝑑𝑖𝑠𝑒𝑎𝑠𝑒 the Integrated Similarity for diseases (ISD) were obtained according to previous studies [59], [148] The detailed equations and steps to calculate ISM and ISD are given in section 3.3.3.

During the third stage, a WKNKN algorithm was performed as a pre- processing step to eliminate unknown missing values in the set of miRNA-disease associations by taking into account the nearest neighbor information for miRNAs and diseases Detailed steps and explanation of WKNKN are shown in section 3.3.4.

Following that, two miRNA similarity-based and disease similarity-based heterogeneous networks were built at the fourth stage By constructing two heterogeneous networks in different spaces, we can weight different walk probabilities for nodes in different networks based on the degree to which different miRNAs or diseases correspond to a particular disease or miRNA literally exists difference Detailed steps of contructing two different heterogeneous networks are shown in section 3.4.5.

Then, the final prediction probabilities were calculated using an improvedRWR algorithm on disease similarity-based and miRNA similarity-based heterogeneous

� � networks simultaneously The main difference of the improved random walk with restart in our proposed method with other common RWR algorithms [70], [71] is that the random walk process is performed on two heterogeneous networks in different spaces simultaneously, and the walk probabilities for nodes in different networks are also different The details of improved random walk process is presented in section 3.3.6.

Finally, the prediction scores were ranked in descending order to have the most potential miRNA-disease associations.

3.3.2 Calculating Gaussian interaction profile kernel similarity for miRNAs and diseases.

At the first stage, Gaussian Interaction Profile Kernel Similarity for miRNAs and diseases were computed by using information of the known association adjacency matrix 𝐴 𝐷𝑀 in accordance with literature [59], [84] By supposing that the vector related to disease 𝑑 𝑖 in 𝐴 𝐷𝑀 is presented by 𝐴 𝐷𝑀 (𝑑 𝑖 ) to indicate the i th row of 𝐴 𝐷𝑀 adjacency matrix Analogously, the vector related to with miRNA 𝑚 𝑗 is presented by

𝐴 𝐷𝑀 (𝑚 𝑗 ) which reflects the j th column of 𝐴 𝐷𝑀 adjacency matrix Thus, we calculated 𝐺𝐼𝑃𝑑𝑖𝑠𝑒𝑎𝑠𝑒, Gaussian interaction profile kernel similarity between disease 𝑑 𝑖 and disease 𝑑 𝑗 , as below:

(3.4) where 𝛾 𝑑 indicates an adjustment parameter for a kernel bandwidth and it is updated as below:

𝑖=1 here 𝛾 ′ is generally set to 1 in accordance with previous studies [59], [84].

Similarly, the GIP kernel similarity between miRNA 𝑚 𝑖 and miRNA 𝑚 𝑗 was measured as below:

𝑖 𝑗 𝑚 𝑖 𝑗 where 𝛾 𝑚 is an adjustment parameter for a kernel bandwidth and it is updated as:

𝑛𝑚 𝑖=1 𝑖 here 𝛾 ′ is generally set to 1 in accordance with previous studies [59], [84].

3.3.3 Calculating Integrated similarity for miRNAs and diseases

Although we can determine the disease semantic similarity (DSS) based on DAGs as aforementioned However, the DAGs for every diseases could not be attained It is the reason that the DSS in case of a particular disease without DAGs could not be assessed Consequently, it’s needed to integrate DSS with Gaussian interaction profile kernel to measure all disease similarity information, in line with previous studies [59], [148] as below:

Similarly, integrated miRNA similarity was calculated in accordance with previous studies [59], [148] as below:

3.3.4 Weighted K-nearest known neighbors algorithm

A WKNKN algorithm, which was introduced in[14], [17] was used as a pre- processing step at the third stage to remove unknown values from the miRNA- disease association set by taking into account the information about diseases and their nearest neighbors The information provided by known neighbors was taken into account because the fact that many of the non-interacting miRNA-disease pairs in 𝐴 𝐷𝑀 are unknown cases that could potentially be truthful and accurate associations In particular, WKNKN substitutes an interaction likelihood continuous value between 0 and 1 for 𝐴 𝐷𝑀 = 0 as follows.

Firstly, in order to quantify and evaluate the likelihood profile of interaction for each disease 𝑑 𝑖 , the semantic similarities with K known diseases which are closest to

Secondly, in order to estimate the likelihood profile of interaction for each 𝑚 𝑗 , its functional similarities with K known miRNAs which are closest to 𝑚 𝑗 and their corresponding interaction profiles were chosen.

And finally, if 𝐴 𝐷𝑀 = 0, it was substituted by the average of the two interaction likelihood profiles Algorithm 3.1 shows the pseudocode that illustraes the steps aboved in details, in the pseudocode, r ≤ 1 means a decay term, and KNN() returns the K-nearest known neighbors in descending order of their similarities to

3.3.5 Constructing miRNA similarity-based and disease similarity based heterogeneous networks

Two miRNA similarity-based and disease similarity based heterogeneous networks were constructed, at the fourth stage, to perform the extended RWR process at the next stage In most common RWR algorithms, the transition probabilities from a disease (a miRNA) node to each related neighbor miRNA (disease) are distributed equally [70], [71], where the total of the probabilities is equal to 1 Notwithstanding, the tends of degree to be connected with various miRNAs or diseases comparing to a given diseases or miRNA are generally different [146], [149] For example, a numerous associations between a given disease 𝑑 𝑖 and different associated miRNAs shows different similarities among them, whereas some of the other miRNAs that are associated with 𝑑 𝑖 do not share or share sparse similarities Therefore, it is hypothesized that a disease or miRNA has stronger association with miRNA or disease when a bigger quantity of the remaining miRNAs or diseases are similar among miRNAs or diseases associated with the disease or miRNA [146] On the basis of such assumption, the topological similarity was integrated with disease semantic similarity (DSS) or miRNA functional similarity (MFS) to assess the degree to which a disease (miRNA) is associated to a miRNA (disease) [146], [149] The followings are the weights assigned to the edges in the miRNA-disease association network that represent the degree of real association based on integrated similarity for diseases and integrated similarity for miRNAs, respectively.

Firstly, a bipartite graph was constructed containing disease nodes and miRNA nodes.

Secondly, when walking from disease network to miRNA network, the probability of targeted miRNA node 𝑚𝑗 (j = 1, 2, …, n m ) for a certain disease node

𝑑𝑖 (i = 1, 2, …, n d ) was chosen entirely depending on the similarities between 𝑚𝑗 and all neighbor 𝑑 𝑖 -related miRNA nodes including 𝑚 𝑗 Similarly, with a particular miRNA node 𝑚𝑗 (j = 1, 2, …, n m ), when walking to disease network from miRNA network, the probability of targeted disease node 𝑑𝑖 (i =1, 2,…, n d ) was selected entirely depending on the similarities between 𝑑𝑖 and all neighbor 𝑚𝑗-related disease nodes including 𝑑 𝑖 [146] Figure 3.3 shows a simple illustration of the weight assignment's process in disease and miRNA spaces, respectively.

Figure 3.3 Illustration of the process of weight assignment in disease space and miRNA space

Finally, two new 𝐴DMdiseasebase and 𝐴DMmirnabase integrated adjacency matrices based on ISD matrix for diseases, ISM matrix for miRNAs and 𝐴 𝐷𝑀𝑛𝑒𝑤 matrix were redefined as in the equations (3.10) and (3.11) below:

3.3.6 Employing improved random walk with restart to predict miRNA- disease associations

After two miRNA similarity-based and disease similarity based heterogeneous networks were obtained, an improved RWR algorithm was executed to predict miRNA-disease associations The improved RWR process' steps to predict miRNA- disease associations are described as in the Figure 3.4 below.

Figure 3.4 The improved RWR process's steps to predict miRNA-disease associations

Firstly, two transition probability matrices including 𝑇 𝐷𝑀 , a transition probability matrix from disease network to miRNA network, and 𝑇 𝑀𝐷 , a transition probability matrix from miRNA network to disease network, were calculated These calculations used the two integrated adjacency matrices previously determined as follows:

𝑘= 1 where 𝜑 ∈ (0,1) means the random walker's jumping probability among these two networks [146].

Secondly, the transition probabilities from a disease node to all neighboring disease nodes in a disease-based network are represented by a disease transition probability matrix 𝑊 𝑑 where the element 𝑊 𝑑 (𝑖, 𝑗) represents the jumping probability from disease 𝑑 𝑖 to disease 𝑑 𝑗 as in equation below.

Similarly, the miRNA transition probability matrix in miRNA-based network

Thirdly, rather than employing the vector form of initial probability as in common RWR algorithms [67]–[69], and motivated by the Luo and Long's extended RWR [146], the heterogeneous network's initial probability matrix is defined as in equation (3.16):

] (3.16) where the diagonal matrices 𝑃𝐷 0 and 𝑃𝑀 0 with 𝑃𝐷 0 (𝑖, 𝑖) = 1/𝑛 𝑑 and

𝑃𝑀 0 (𝑗, 𝑗) 1 , serve as the normalized probabilities of disease seed nodes and miRNA seed

𝑛 𝑚 nodes, respectively While 𝛿 means the weight factor utilized to indicate the importance level or impact factor of two sub-networks previously identified by

Experiments and results

The datasets used in experiments in this study was described before in the section 3.2.

3.4.2 Implementing and Estimating time complexity of the proposed method

This proposed method was also implemented by using the Python programming language in PyCharm Community IDE along with the supported packages and libraries.

Similar to estimating the methods' time complexity in previous chapter, the time complexity of the proposed method is quantitatively estimated by considering only the most complex steps because the proposed method contains many separately steps The most complex steps in the proposed method are calculating a transition probability matrix from disease network to miRNA network 𝑇 𝐷𝑀 and a transition probability matrix from miRNA network to disease network 𝑇 𝑀𝐷 as in equations 3.12 and 3.13 The Algorithms 3.2 and 3.3 show pseudocode that described the steps to define these two matrices Each algorithm contains three nested \

For{condition}{for block} statements and other operations Therefore, the proposed method's time complexity is O(nd*nm*nm) or O(nm*nd*nd) It is equivalent to O(n3). Tentatively, it can be concluded that the the proposed method's time complexity is polynomial time.

The global LOOCV and 5-fold cross-validation experiments were used to evaluate the efficacy of the proposed method in identifying associations between miRNAs and diseases And, the Area under roc curve (AUC) [111] as well as the Area under precision-recall curve (AUPR) [112] were measured as described follows.

 Evaluating the AUC and AUPR under 5-fold cross validation

The 5-fold cross-validation experiments' steps were described as follows:

- Firstly, the known miRNA-disease associations were considered as positive samples and the remained unknown associations as negative samples.

- Secondly, all positive and negative samples in known adjacency matrix 𝐴 𝐷𝑀 were randomly partitioned into 5 equal parts to perform 5-fold cross-validation.

- Thirdly, in each experimental running time, 4 parts of positive and negative samples were used for training and the last part was used for testing The elements’ values which are equal to 1 in the part used for testing were changed to 0.

- Fourthly, the Final_score in each running time is recalculated.

- Finally, the AUC and AUPR values are figured out.

To increase the reliability of AUC and AUPR values, the 5-fold cross-validation experiments were again and again performed for 25 times Then AUC and AUPR values were computed to obtain final results The proposed method achieved best averaged AUC value of 0.9855 and obtained the best averaged AUPR value of 0.8642 after 25 times under 5-fold cross-validation experiments These best averaged AUC and AUPR values are proven by statistical tests One-sample t-test with N% at confidence level of 95% was performed to increase the reliability of AUC and AUPR values Table 3.1 shows the results of statistical tests on One- sample t-test of AUC and AUPR Figure 3.5 illustrates ROC curves and AUC values (a) and PR curves and AUPR values (b) in 5 running times of 5-fold cross- validation experiments.

Table 3.1 AUC and AUPR One-sample t-test

Figure 3.5 ROC curves and AUC values (a) and PR curves and AUPR values (b) in

5 running times of 5-fold cross-validation experiments

 Evaluating AUC and AUPR under global LOOCV experiments

In addition to 5-fold cross validation, leave-one-out cross validation (LOOCV) was normally used to evaluate global prediction ability of a model [59],

[110] In the proposed method, global LOOCV experiments were performed by removing each known miRNA-disease association in turn as a testing sample and all remaining associations as training samples Then the final prediction matrix P in each running time was recalculated to evaluate prediction performance The global LOOCV prediction performance of the proposed method reached AUC value of 0.9882 and AUPR value of 0.9066 as illustrated in Figure 3.6 They are negligible higher than AUC and AUPR values under 5-fold cross validation due to the number of known associations removed in each experimental running time of 5-fold cross validation experiment is more than in global LOOCV experiment.

Figure 3.6 ROC curve and AUC value (a) and PR curve and AUPR value (b) under global LOOCV experiment

The proposed method contains 5 parameters effecting on its the performance.

In other words, the results with above best averaged AUC and AUPR values could be attained by modifying the join of multiple parameters with their different values.

In the proposed method, the WKNKN algorithm was used as a pre-processing step to eliminate unknown values in miRNA-disease association set based on their known neighbors by considering that there are some true associations from unknown miRNA-disease associations in the matrix ADM ij In this proposed method, the K parameter means the number of nearest known neighbors, r is a decay term whereas r ≤ 1 By mainly focusing on the influence of number of nearest known neighbors to reduce the impact of sparsity data problem, we concentrate on the impact of K parameter The more the nearest known neighbors were selected, the more associations between diseases and miRNAs would be appended into the heterogeneous network And it decreased the impact of sparsity data problem. However, when the number of added associations was too big, it could cause the bias Therefore, we have to identify the optimal value of the two parameters before performing improved random walk on heterogeneous networks In the proposed method’s experiments, the value of K and r were repeatedly changed to choose the optimal values and it demonstrated that AUC and AUPR achieve the best values when K=5 and r = 0.7 It is consisten with the result in NPCMF method [16] Table 3.2 reflects the evaluation index changes when K was fixed to 5 and r run from 0.1 to 0.9 and r was fixed to 0.7 and K run from 1 to 9 when evaluating prediction performance over all samples.

Table 3.2 Evaluation of index changes in WKNKN algorithm

Three parameters from improved random walk with restart

There are three parameters which can affect the result performance when performing improved random walk with restart on heterogeneous networks The

𝜑 parameter, 𝜑 ∈ (0,1), means the jumping probability of random walker among two different networks The 𝛿 parameter, 𝛿 ∈ (0,1), denotes the weight factor used to represent the importance level or impact factor of two sub-networks. The 𝛾 parameter, 𝛾 ∈ (0,1), symbolizes the restart probability The influences of the three parameters were identified by adjusting them over repeated experiments and then select 𝜑 = 0.9, 𝛿 = 0.7 and 𝛾 = 0.7 as the optimal combination values in our proposed method.

3.4.4 Performance comparison with other related models

To demonstrate the outperformance of the proposed method with other related approaches, its prediction performance was compared with the performance of NTSHMDA [146], PMFMDA [59], IMCMDA [92] and MCLPMDA[93] approaches under best averaged 5-fold-cross-validation experiments The NTSHMDA method contained an extended Random Walk with Restart algorithm which was improved in the our proposed method The PMFMDA, ICMMDA and MCLPMDA methods utilized the same miRNA-disease association dataset as in the proposed method’s experiments The performances of these methods in terms of AUCs and AUPRs are demonstrated in Figure 3.7.

As illustrated in Figure 3.7, the proposed approach's performance is better than all related methods as NTSHMDA, PMFMDA, IMCMDA and MCLPMDA InAUC measurement, our proposed method is higher than NTSHMDA, PMFMDA,IMCMDA and MCLPMDA methods in AUC values of 0.61%, 0.6%, 14.5% and7.5%, respectively It is also better than all NTSHMDA, PMFMDA, IMCMDA andMCLPMDA methods in AUPR measurement with the AUPR values higher than13.62%, 35.04%, 60.44% and 53.52%, respectively It indicated that the proposed method outperforms all other previous related methods Specially, in the kind of imbalanced datasets, the significant improvement in AUPR performance prediction illustrated that the proposed method could be considered to be more informative and achieves better performance than other previous related methods.

Figure 3.7 ROC curves and AUC values (a) and Precision-Recall curves and

AUPR values (b) in comparison with other related approaches

Additionally, to recognize the effects of using WKNKN and integrating multiple similarities independently, when performing improved random walk with restart, the ROC and Precision and Recall curves were also drawn in the cases of:

(1) Uses WKNKN as a pre-processing step but not use integrated similarities

(2) Use integrated similarities but not use WKNKN as a pre-processing step.

Figure 3.8 ROC curves and AUC values (a) and Precision-Recall curves and

AUPR values (b) in different cases of RWRMMDAs

As illustrated in Figure 3.8 (a), the AUC value of the proposed method look to be the average of the AUC values of the above cases (1) and (2) And, as can be seen in Figure 3.8 (b), the proposed method achieves the the highest AUPR value in comparison with the above cases It means that both the cases of using WKNKN algorithm as a pre-processing step as well as using integrated similarities, respectively, can increase the AUPR values while using WKNKN algorithm as a pre- processing step can reduce the impact of sparsity data problem when evaluating AUC values.

Besides that, to compare the performance of predicting miRNA-disease associations with some latest computational methods in miRNA-disease associations prediction, we use results reported by the authors of the GATMDA [150], HGCNELMDA [151] and PATMDA [152] methods under 5- fold cross validation experiments All of these methods use same miRNA-disease association datasets as in our proposed method The Table 3.3 shows the AUC and AUPR values in comparison of our proposed method with above mentioned methods.

Table 3.3 AUC and AUPR values RWRMMDA and other latest methods in comparison

As can be seen in Table 3.3, our proposed method achieves a comparative performance in comparison with other latest related methods.

In addition to 5-fold-cross-validation and LOOCV experiments, some case studies were also employed on the proposed approach as in the following steps:

- Doing experiments on all known samples of miRNA-disease associations to have predicted scores

- For a given disease, the candidate associated miRNAs’ scores are sorted in descending order to have predicted associations.

- With top ranked miRNAs for each disease, we will manually check whether the predicted miRNA-disease associations already be verified and published in biological literature or other databases.

In more details, the case studies on Breast Neoplasms, Carcinoma Hepatocellular and Stomach Neoplasms were constructed to show the ability of the proposed method in predicting of miRNA-disease associations.

Breast Neoplasms, also known as Breast Cancer, is the most common type of cancer that kills women worldwide It has been reported that MicroRNAs (miRNAs) plays crucial roles in breast cancer [153], [154] For instance, members of the miR- 34 family control breast cancer cell proliferation, apoptosis, invasion, and metastasis [155] Through down-regulation of Bcl-2 and SIRT1, miR-34a stops breast cancer from proliferation and migration [156] Breast Neoplasms was chosen as a case study for the proposed method to demonstrate its ability to infer miRNA- disease associations As shown in Table 3.4, there is one new miRNA-disease association in top 40 predicted Breast Neoplasms-associated miRNAs It has been confirmed in dbDEMC V2.0 database.

Table 3.4 Top 40 predicted Breast Neoplasms-associated miRNAs

Chapter summary and discussion

In this chapter, a new method entitled “Predicting miRNA–disease associations using improved random walk with restart and integrating multiple similarities” is presented In this method, the author has contributed some new points as follows.

First, by integrating multiple similarity networks to assign distinct walk probabilities to each related neighbor node of the disease or miRNA node in accordance with its degree in various spaces, the author constructed two heterogeneous networks in disease and miRNA spaces, respectively.

Second, the author used a WKNKN algorithm as a pre-processing step to solve the problem of sparsity and incompleteness to decrease the negative effects of a large number of missing associations.

And finally, also the most importance point, we the improved random walk with restart algorithm based on miRNA similarity-based and disease similarity-based heterogeneous networks simultaneously It was differed to common random walk with restart algorithms to forecast miRNA-disease associations.

Although, the simulated experiments' results for Lung neoplasms and Ovarian neoplasms in the section on predicting new disease-related miRNAs indicate that the proposed method can infer new disease-related miRNAs and achieve a reliable prediction performance However, bias in prediction can be caused by subjectively selecting a new disease for simulated experiments and eliminating all of its known associations Hence, it expects to do further research or incorporate more biological information to expand the reliability of prediction in the case of new diseases or new miRNAs.

Traditionally, linkage studies, genome-wide association studies, RNA inference screens, and wet-experiments were used to identify associations between ncRNAs and diseases, including miRNA-disease and lncRNA-disease associations. However, it requires a long time and high cost to identify associations between ncRNAs and diseases In order to save time and money by providing reasonable potential disease- related ncRNAs, it is urgent to develop meaningful and valuable computational methods for predicting ncRNA-disease associations.

This dissertation has presented the contributions in developing computational methods to predict ncRNA-disease associations based on link prediction in heterogeneous networks Concretely, the dissertation has two main contributions. First, the dissertation already proposed a new computational model for predicting ncRNA-disease associations It solves the sparsity data problem by a collaborative filtering algorithm and combine with a resource allocation process on a tripartite graph based on multiple types of known associations between multiple objects to predict ncRNA-disease associations Then, the new computational model was employed in inferring miRNA-disease association and predicting lncRNA- disease association applications In application of inferring miRNA-disease associations, the experimental results show that the new proposed method provides outperformed performance with AUC and AUPR values of 0.9788 and 0.9373,respectively, compared to several related methods Case studies of ProstaticNeoplasms, Heart Failure, and Glioma diseases have demonstrated its ability to infer potential associations between miRNAs and diseases Additionally, it can find new association for new disease (or miRNAs) with no known association before in the examined dataset as demonstrated for the Case study of Open-angle glaucoma disease In application of predicting lncRNA-disease associations, the new proposed method obtains a better prediction performance by taking into account one more type of associations and improving the process to predict lncRNA-disease associations It was demonstrated with the both best AUC and AUPR values of 0.983 as aforementioned Therefore, the proposed model could be an useful tool to predict associations between ncRNAs and diseases.

Second, the dissertation has developed a new miRNA-disease associations prediction computational method It utilized a WKNKN algorithm as a pre- processing step to reduce the quantity of unknown values in miRNA-disease association set It integrates multiple similarities from different sources to build two different heterogeneous information networks based on miRNA and disease spaces.

As a result, different walk probabilities were assigned to each disease-related neighbor node or miRNA-related neighbor node based on its degree in various spaces Following that, an extended random walk with restart algorithm based on miRNA similarity-based and disease similarity-based heterogeneous networks was employed to calculate miRNA-disease association prediction probabilities The proposed method could be considered as an useful tool to forecast miRNA-disease associations It was supported by the global LOOCV AUC and AUPR values of 0.9882 and 0.9066, respectively, and the 5-fold-cross-validation AUC and AUPR values of 0.9855 and 0.8642, respectively.

Although the proposed computational methods in this dissertation have made immense beneficences to reveal disease‐related lncRNAs or miRNAs, but there are still rooms for improvements in the future to achieve more decisive performance. Firstly, the proposed methods to predict ncRNA-disease associations still focus on an unweighted tripartite graph Therefore, the research in the future can make improvements by weighting the known associations or interactions among biological objects in the tripartite graph Secondly, the future research can enhance the resource allocation process on the tripartite graph to increase the prediction performance Thirdly, subjectively choosing a new disease to perform simulated experiments by removing all its known associations can cause the bias in prediction.Therefore, it requires to do further research or integrate more biological information to increase the reliability of prediction in case of new diseases, new miRNAs or new lncRNAs.Fourthly, the future research can extend the machine learning and deep learning methods' application for more comprehensive predictive analysis Fifthly, future non-coding RNA-disease associations prediction computational methods need to integrate different biological datasets to construct more reasonable similarities.Finally, the above computational methods can also be applied for other research areas such as microbe-disease associations prediction, metabolite-disease associations prediction, drug-disease associations prediction, drug-target prediction and so on Therefore, the future research of non-coding RNA-disease associations prediction can borrow the computational methods from other different fields and customize them to obtain better performance in non-coding RNA-disease association prediction.

[VTN1] Van Tinh Nguyen, Thi Tu Kien Le and Dang Hung Tran, "A new method on lncRNA-disease-miRNA tripartite graph to predict lncRNA-disease associations", 2020 12th International Conference on Knowledge and Systems

Engineering (KSE), 2020, pp 287-293, doi: 10.1109/KSE50997.2020.9287563

[VTN2] Van Tinh Nguyen, Thi Tu Kien Le, Tran Quoc Vinh Nguyen and Dang Hung Tran, “Inferring miRNA-disease associations using collaborative filtering and resource allocation on a tripartite graph”, BMC Med Genomics 14, 225 (2021). https://doi.org/10.1186/s12920-021-01078-8 (ISI Q2 journal).

[VTN3] Van Tinh Nguyen and Dang Hung Tran, "An improved computational method for prediction of lncRNA-disease associations based on collaborative filtering and resource allocation", 2021 13th International Conference on Knowledge and Systems Engineering (KSE), 2021, pp 1-6, doi:

[VTN4] Van Tinh Nguyen, Thi Tu Kien Le, Khoat Than and Dang HungTran, “Predicting miRNA–disease associations using improved random walk with restart and integrating multiple similarities”, Sci Rep 11, 21071 (2021).https://doi.org/10.1038/s41598-021-00677-w (ISI Q1 journal).

1 Han J (2009), "Mining Heterogeneous Information Networks by Exploring the Power of Links", Discov Sci DS 2009 Lect Notes Comput Sci vol 5808., doi: 10.1007/978-3-642-04414-4_3.

2 Chuan Shi and Yu P S (2015), Heterogeneous Information Network Analysis and Applications 2015 doi: 10.1007/978-3-319-56212-4.

3 Ding P et al (2019), "Heterogeneous information network and its application to human health and disease", Brief Bioinform., vol 00, no 00, pp 1–20, doi: 10.1093/bib/bbz091.

4 Liben-Nowell D and Kleinberg J (2007), "The link-prediction problem for social networks", J Am Soc Inf Sci Technol., vol 58, no 7, pp 1019–1031, doi: 10.1002/asi.20591.

5 Dong Y et al (2012), "Link prediction and recommendation across heterogeneous social networks", Proc - IEEE Int Conf Data Mining, ICDM, pp 181–190, doi: 10.1109/ICDM.2012.140.

6 Abbas K et al (2021), "Application of network link prediction in drug discovery", BMC Bioinformatics, vol 22, no 1, pp 1–21, doi:

7 Sulaimany S., Khansari M., and Masoudi-Nejad A (2018), "Link prediction potentials for biological networks", Int J Data Min Bioinform., vol 20, no.

8 Yang Y et al (2012), "Link prediction in heterogeneous networks: Influence and time matters", Proc 2012 IEEE …, [Online] Available: http://web.engr.illinois.edu/~hanj/pdf/icdm12_yyang.pdf

9 Chen X et al (2019), "MicroRNAs and complex diseases: From experimental results to computational models", Brief Bioinform., vol 20, no.

10 Chen X et al (2017), "Long non-coding RNAs and complex diseases: From experimental results to computational models", Brief Bioinform., vol 18, no.

11 Lei X et al (2020), "A comprehensive survey on computational methods of non-coding RNA and disease association prediction", Brief Bioinform., vol.

00, no August, pp 1–31, doi: 10.1093/bib/bbaa350.

12 Yan C et al (2020), "Computational Methods and Applications for Identifying Disease-Associated lncRNAs as Potential Biomarkers and Therapeutic Targets", Mol Ther - Nucleic Acids, vol 21, pp 156–171, doi: 10.1016/j.omtn.2020.05.018.

13 Yu J et al (2019), "A novel collaborative filtering model for LncRNA- disease association prediction based on the Nạve Bayesian classifier", BMC

Bioinformatics, vol 20, no 1, pp 1–13, doi: 10.1186/s12859-019-2985-0.

14 Ezzat A et al (2017), "Drug-target interaction prediction with graph regularized matrix factorization", IEEE/ACM Trans Comput Biol.

Bioinforma., vol 14, no 3, pp 646–656, doi: 10.1109/TCBB.2016.2530062.

15 Wu T.-R et al (2020), "MCCMF: Collaborative matrix factorization based on matrix completion for predicting miRNA-disease associations", BMC

Bioinformatics, vol 21, p 454, doi: 10.21203/rs.3.rs-36602/v1.

16 Gao Y L et al (2019), "NPCMF: Nearest Profile-based Collaborative Matrix Factorization method for predicting miRNA-disease associations", BMC

17 Li G et al (2018), "Predicting microRNA-disease associations using label propagation based on linear neighborhood similarity", J Biomed Inform., vol 82, no February, pp 169–177, doi: 10.1016/j.jbi.2018.05.005.

18 Ding L et al (2018), "TPGLDA: Novel prediction of associations between lncRNAs and diseases via lncRNA-disease-gene tripartite graph", Sci Rep., vol 8, no 1, pp 1–11, doi: 10.1038/s41598-018-19357-3.

19 Sumathipala M et al (2019), "Network Diffusion Approach to PredictLncRNA Disease Associations Using Multi-Type Biological Networks :LION", Front Physiol., vol 10, no July, pp 1–11, doi:

20 Sumathipala M and Weiss S T (2020), "Predicting miRNA-based disease- disease relationships through network diffusion on multi-omics biological data", Sci Rep., vol 10, no 1, pp 1–12, doi: 10.1038/s41598-020-65633-6.

21 Cao B., Kong X., and Yu P S (2014), "Collective Prediction of Multiple Types of Links in Heterogeneous Information Networks", Proc - IEEE Int.

Conf Data Mining, ICDM, vol 2015-Janua, no January, pp 50–59, doi:

22 Muzio G., O’Bray L., and Borgwardt K (2021), "Biological network analysis with deep learning", Brief Bioinform., vol 22, no 2, pp 1515–1530, doi: 10.1093/bib/bbaa257.

23 Beermann J et al (2016), "Non-coding rnas in development and disease: Background, mechanisms, and therapeutic approaches", Physiol Rev., vol.

24 Tsuyuzaki K and Nikaido I (2017), "Biological systems as heterogeneous information networks: A Mini-review and Perspectives", arXiv, pp 1–8.

25 Sun Y and Han J (2013), "Mining Heterogeneous Information Networks: A Structural Analysis Approach ∗", ACM SIGKDD Explor., vol 14, no 2, pp. 20–28, doi: 10.1145/2481244.2481248.

26 Baltoumas F A et al (2021), "Biomolecule and bioentity interaction databases in systems biology: A comprehensive review", Biomolecules, vol.

27 Ding Y et al (2021), "Machine learning approaches for predicting biomolecule-disease associations", Brief Funct Genomics, vol 20, no 4, pp. 273–287, doi: 10.1093/bfgp/elab002.

28 Ambros V (2004), "The functions of animal microRNAs", Nature, vol 431, no 7006, pp 350–355, doi: 10.1038/nature02871.

29 Ardekani A M and Naeini M M (2010), "The role of microRNAs in human diseases", Avicenna J Med Biotechnol., vol 2, no 4, pp 161–179.

30 Li X et al (2018), Non-coding RNAs in Complex Diseases Springer Singapore, 2018 doi: 10.1007/978-981-13-0719-5.

31 Chen K J et al (2017), "On link formation in heterogeneous information networks: A view based on multi-Label learning", Proc 2017 IEEE/ACM Int.

Conf Adv Soc Networks Anal Mining, ASONAM 2017, no July, pp 50–53, doi: 10.1145/3110025.3110076.

32 Kumar A et al (2020), "Link prediction techniques, applications, and performance: A survey", Phys A Stat Mech its Appl., vol 553, p 124289, doi: 10.1016/j.physa.2020.124289.

33 Wang X W., Chen Y., and Liu Y Y (2020), "Link Prediction through Deep Generative Model", iScience, vol 23, no 10, p 101626, doi: 10.1016/j.isci.2020.101626.

34 Almansoori W et al (2012), "Link prediction and classification in social networks and its application in healthcare and systems biology", Netw Model.

Anal Heal Informatics Bioinforma., vol 1, no 1–2, pp 27–36, doi:

35 Chen G et al (2022), "Link prediction by deep non-negative matrix factorization", Expert Syst Appl., vol 188, no February 2021, p 115991, doi: 10.1016/j.eswa.2021.115991.

36 Wu Z et al (2016), "Link prediction with node clustering coefficient", Phys.

A Stat Mech its Appl., vol 452, no xxxx, pp 1–8, doi:

37 Pech R et al (2019), "Link prediction via linear optimization", Phys A Stat.

Mech its Appl., vol 528, pp 1–29, doi: 10.1016/j.physa.2019.121319.

38 Rafiee S., Salavati C., and Abdollahpouri A (2020), "CNDP: Link prediction based on common neighbors degree penalization", Phys A Stat Mech its

39 Gao S., Denoyer L., and Gallinari P (2011), "Temporal link prediction by integrating content and structure information", Int Conf Inf Knowl Manag.

40 Guimerà R and Sales-Pardo M (2009), "Missing and spurious interactions and the reconstruction of complex networks", Proc Natl Acad Sci U S A., vol 106, no 52, pp 22073–22078, doi: 10.1073/pnas.0908366106.

41 Clauset A., Moore C., and Newman M E J (2008), "Hierarchical structure and the prediction of missing links in networks", Nature, vol 453, no 7191, pp 98–101, doi: 10.1038/nature06830.

42 Wang D., Cui P., and Zhu W (2016), "Structural deep network embedding",

Proc ACM SIGKDD Int Conf Knowl Discov Data Min., vol 13-17-Augu, pp 1225–1234, doi: 10.1145/2939672.2939753.

43 Zhang Z et al (2018), "Arbitrary-order proximity preserved network embedding", Proc ACM SIGKDD Int Conf Knowl Discov Data Min., pp. 2778–2786, doi: 10.1145/3219819.3219969.

44 Ma X., Sun P., and Qin G (2017), "Nonnegative matrix factorization algorithms for link prediction in temporal networks using graph communicability", Pattern Recognit., vol 71, pp 361–374, doi: 10.1016/j.patcog.2017.06.025.

45 Chen G et al (2019), "Graph regularization weighted nonnegative matrix factorization for link prediction in weighted complex network",

Neurocomputing, vol 369, no xxxx, pp 50–60, doi: 10.1016/j.neucom.2019.08.068.

46 Zhang M and Zhou Z (2018), "Deep Autoencoder-like Nonnegative Matrix Factorization for community detection", Proc 27th ACM Int Conf Inf. Knowl Manag., pp 1393–1402, doi: 10.1016/j.asoc.2020.106846.

47 Raeini M G (2020), "Link Prediction Using Supervised Machine Learning based on Aggregated and Topological Features", [Online] Available: http://arxiv.org/abs/2006.16327

48 Gu W et al (2019), "Link Prediction via Deep Learning", arXiv:1910.04807v1, pp 1–11, [Online] Available: http://arxiv.org/abs/1910.04807

49 Dadu A et al (2018), "A Study of Link Prediction Using Deep Learning", in

50 Wang X et al (2021), "Link prediction in heterogeneous information networks: An improved deep graph convolution approach", Decis Support

Syst., vol 141, no September 2020, p 113448, doi: 10.1016/j.dss.2020.113448.

51 Shang D et al (2014), "Prioritizing candidate disease metabolites based on global functional relationships between metabolites in the context of metabolic pathways", PLoS One, vol 9, no 8, pp 1–11, doi: 10.1371/journal.pone.0104934.

52 Pinu F R et al (2019), "Systems biology and multi-omics integration: Viewpoints from the metabolomics research community", Metabolites, vol 9, no 4, pp 1–31, doi: 10.3390/metabo9040076.

53 Hao M., Bryant S H., and Wang Y (2017), "Predicting drug-target interactions by dual-network integrated logistic matrix factorization", Sci.

Rep., vol 7, no June 2016, pp 1–11, doi: 10.1038/srep40376.

54 Berenstein A J et al (2016), "A Multilayer Network Approach for Guiding Drug Repositioning in Neglected Diseases", PLoS Negl Trop Dis., vol 10, no 1, pp 1–33, doi: 10.1371/journal.pntd.0004300.

55 Qu J et al (2018), "Inferring potential small molecule–miRNA association based on triple layer heterogeneous network", J Cheminform., vol 10, no 1, pp 1–14, doi: 10.1186/s13321-018-0284-9.

56 Luo H et al (2018), "Computational drug repositioning using low-rank matrix approximation and randomized algorithms", Bioinformatics, vol 34, no 11, pp 1904–1912, doi: 10.1093/bioinformatics/bty013.

57 Martínez V et al (2015), "DrugNet: Network-based drug-disease prioritization by integrating heterogeneous data", Artif Intell Med., vol 63, no 1, pp 41–

58 Xuan Z et al (2019), "A probabilistic matrix factorization method for identifying lncRNA-disease associations", Genes (Basel)., vol 10, no 2, doi: 10.3390/genes10020126.

59 Xu J et al (2019), "Identifying Potential miRNAs–Disease Associations With Probability Matrix Factorization", Front Genet., vol 10, no December, p 1234, doi: 10.3389/fgene.2019.01234.

60 Jiang Q et al (2010), "Prioritization of disease microRNAs through a human phenome-microRNAome network", BMC Syst Biol., vol 4, no SUPPL 1, p. S2, doi: 10.1186/1752-0509-4-S1-S2.

61 Gu C et al (2016), "Network Consistency Projection for Human miRNA- Disease Associations Inference", Sci Rep., vol 6, no October, p 36054, doi: 10.1038/srep36054.

62 Chen X et al (2018), "BNPMDA: Bipartite network projection for MiRNA– Disease association prediction", Bioinformatics, vol 34, no 18, pp 3178–

63 Liang C., Yu S., and Luo J (2019), "Adaptive multi-view multi-label learning for identifying disease-associated candidate miRNAs", PLoS Comput Biol., vol 15, no 4, p e1006931, doi: 10.1371/journal.pcbi.1006931.

64 Yu G et al (2017), "BRWLDA: Bi-random walks for predicting lncRNA- disease associations", Oncotarget, vol 8, no 36, pp 60429–60446, doi: 10.18632/oncotarget.19588.

65 Gu C et al (2017), "Global network random walk for predicting potential human lncRNA-disease associations", Sci Rep., vol 7, no 1, pp 1–11, doi: 10.1038/s41598-017-12763-z.

66 Xiao X et al (2018), "BPLLDA: Predicting lncRNA-disease associations based on simple paths with limited lengths in a heterogeneous network",

Front Genet., vol 9, no OCT, p 411, doi: 10.3389/fgene.2018.00411.

67 Chen X., Liu M X., and Yan G Y (2012), "RWRMDA: Predicting novel human microRNA-disease associations", Mol Biosyst., vol 8, no 10, pp. 2792–2798, doi: 10.1039/c2mb25180a.

68 Xuan P et al (2015), "Prediction of potential disease-associated microRNAs based on random walk", Bioinformatics, vol 31, no 11, pp 1805–1815, doi: 10.1093/bioinformatics/btv039.

69 Sun D et al (2016), "NTSMDA: prediction of miRNA–disease associations by integrating network topological similarity", Mol Biosyst., vol 12, no 7, pp 2224–2232, doi: 10.1039/C6MB00049E.

70 Le D et al (2017), "Random walks on mutual microRNA-target gene interaction network improve the prediction of disease-associated microRNAs", BMC Bioinformatics, vol 18, p 479, doi: 10.1186/s12859-017- 1924-1.

71 Niu Y W et al (2019), "Integrating random walk and binary regression to identify novel miRNA-disease association", BMC Bioinformatics, vol 20, no.

72 Li A et al (2021), "A novel miRNA-disease association prediction model using dual random walk with restart and space projection federated method",

PLoS One, vol 16, no 6 June 2021, p e0252971, doi:

73 Fan X et al (2019), "Prediction of lncRNA-disease associations by integrating diverse heterogeneous information sources with RWR algorithm and positive pointwise mutual information", BMC Bioinformatics, pp 1–12.

74 Wang L E I (2019), "IIRWR : Internal Inclined Random Walk With Restart for LncRNA-Disease Association Prediction", IEEE Access, vol 7, pp.

75 Liu Y et al (2019), "A Novel Network-Based Computational Model for Prediction of Potential LncRNA – Disease Association", doi: 10.3390/ijms20071549.

76 Li G et al (2019), "Prediction of LncRNA-Disease Associations Based onNetwork Consistency Projection", IEEE Access, vol 7, pp 58849–58856,doi:

77 Li Z S., Liu B., and Yan C (2019), "CFMDA: collaborative filtering-based MiRNA-disease association prediction", Multimed Tools Appl., vol 78, no.

78 Liu J X et al (2021), "DSCMF: prediction of LncRNA-disease associations based on dual sparse collaborative matrix factorization", BMC Bioinformatics, vol 22, pp 1–18, doi: 10.1186/s12859-020-03868-w.

79 Fu G et al (2018), "Matrix factorization-based data fusion for the prediction of lncRNA-disease associations", Bioinformatics, vol 34, no 9, pp 1529–

80 Zeng M et al (2020), "SDLDA: lncRNA-disease association prediction based on singular value decomposition and deep learning", Methods, vol 179, no May, pp 73–80, doi: 10.1016/j.ymeth.2020.05.002.

81 Shen Z et al (2017), "MiRNA-disease association prediction with collaborative matrix factorization", Complexity, vol 2017, doi: 10.1155/2017/2498957.

82 Lan W et al (2017), "LDAP : a web server for lncRNA-disease association prediction", vol 33, pp 458–460, doi: 10.1093/bioinformatics/btw639.

83 Chen X., Wu Q F., and Yan G Y (2017), "RKNNMDA: Ranking-based KNN for MiRNA-Disease Association prediction", RNA Biol., vol 14, no 7, pp 952–962, doi: 10.1080/15476286.2017.1312226.

84 Chen X., Zhu C C., and Yin J (2019), "Ensemble of decision tree reveals potential miRNA-disease associations", PLoS Comput Biol., vol 15, no 7, p. e1007209, doi: 10.1371/journal.pcbi.1007209.

85 Chen X and Yan G Y (2013), "Novel human lncRNA-disease association inference based on lncRNA expression profiles", Bioinformatics, vol 29, no.

86 Chen X and Yan G Y (2014), "Semi-supervised learning for potential human microRNA-disease associations inference", Sci Rep., vol 4, p.

87 Chen X and Huang L (2017), "LRSSLMDA: Laplacian Regularized Sparse Subspace Learning for MiRNA-Disease Association prediction", PLoS

Comput Biol., vol 13, no 12, p e1005912, doi:

88 Wang B (2018), "Multiple Linear Regression Analysis of lncRNA – Disease", vol 2018.

89 Yao D et al (2020), "A random forest based computational model for predicting novel lncRNA-disease associations", BMC Bioinformatics, vol 21, no 1, pp 1–18, doi: 10.1186/s12859-020-3458-1.

90 Fraidouni N and Zaruba G (2019), "A Matrix Completion Approach for Predicting lncRNA-disease association", Int’l Conf Bioinforma Comput.

91 Li W et al (2019), "Inferring Latent Disease-lncRNA Associations by Faster Matrix Completion on a Heterogeneous Network", Front Genet., vol 10, no. September, pp 1–15, doi: 10.3389/fgene.2019.00769.

92 Chen X et al (2018), "Predicting miRNA-disease association based on inductive matrix completion", Bioinformatics, vol 34, no 24, pp 4256–4265, doi: 10.1093/bioinformatics/bty503.

93 Yu S P et al (2019), "MCLPMDA: A novel method for miRNA-disease association prediction based on matrix completion and label propagation", J.

Cell Mol Med., vol 23, no 2, pp 1427–1438, doi: 10.1111/jcmm.14048.

94 Xuan P et al (2019), "CNNDLP: A method based on convolutional autoencoder and convolutional neural network with adjacent edge attention for predicting lncrna–disease associations", Int J Mol Sci., vol 20, no 17, doi: 10.3390/ijms20174260.

95 Xuan P et al (2019), "Dual Convolutional Neural Networks With Attention Mechanisms Based Method for Predicting Disease-Related lncRNA Genes",

Front Genet., vol 10, no May, pp 1–11, doi: 10.3389/fgene.2019.00416.

96 Peng J et al (2019), "A learning-based framework for miRNA-disease association identification using neural networks", Bioinformatics, vol 35, no.

97 Tian Q., Zhou S., and Wu Q (2022), "A miRNA-Disease Association Identification Method Based on Reliable Negative Sample Selection and Improved Single-Hidden Layer Feedforward Neural Network", Inf., vol 13, no 3, doi: 10.3390/info13030108.

98 Madhavan M and Gopakumar G (2020), "DBNLDA: Deep Belief Network based representation learning for lncRNA-disease association prediction", arXiv Prepr arXiv, no 2006:12534., doi: 10.1007/s10489-021-02675-x.

99 Ding Y et al (2020), "Deep belief network–Based Matrix Factorization Model for MicroRNA-Disease Associations Prediction", Evol Bioinforma., vol 16, doi: 10.1177/1176934320919707.

100 Chen X et al (2021), "Deep-belief network for predicting potential miRNA- disease associations", Brief Bioinform., vol 22, no 3, pp 1–10, doi: 10.1093/bib/bbaa186.

101 Li J et al (2020), "Neural inductive matrix completion with graph convolutional networks for miRNA-disease association prediction",

Bioinformatics, vol 36, no 8, pp 2538–2546, doi:

102 Fu H et al (2022), "MVGCN: data integration through multi-view graph convolutional network for predicting links in biomedical bipartite networks",

Bioinformatics, vol 38, no 2, pp 426–434, doi: 10.1093/bioinformatics/btab651.

103 Shi Z et al (2021), "A representation learning model based on variational inference and graph autoencoder for predicting lncRNA-disease associations",

BMC Bioinformatics, vol 22, no 1, pp 1–20, doi: 10.1186/s12859-021-

104 Zeng M et al (2020), "DMFLDA: A deep learning framework for predicting

IncRNA–disease associations", IEEE/ACM Trans Comput Biol Bioinforma., p 1, doi: 10.1109/tcbb.2020.2983958.

105 Ji C et al (2021), "AEMDA: inferring miRNA-disease associations based on deep autoencoder", Bioinformatics, vol 37, no 1, pp 66–72, doi: 10.1093/bioinformatics/btaa670.

106 Mứrk S et al (2014), "Protein-driven inference of miRNA-disease associations", Bioinformatics, vol 30, no 3, pp 392–397, doi: 10.1093/bioinformatics/btt677.

107 Shi H et al (2013), "Walking the interactome to identify human miRNA- disease associations through the functional link between miRNA targets and disease genes", BMC Syst Biol., vol 7, doi: 10.1186/1752-0509-7-101.

108 Guo Z H et al (2019), "A Learning-Based Method for LncRNA-Disease Association Identification Combing Similarity Information and Rotation Forest", iScience, vol 19, pp 786–795, doi: 10.1016/j.isci.2019.08.030.

109 Guo Z H et al (2019), Combining High Speed ELM with a CNN Feature

Encoding to Predict LncRNA-Disease Associations, vol 11644 LNCS.

110 Berrar D (2019), "Cross-validation", Encycl Bioinforma Comput Biol. Acad Press., vol 1, pp 542–545, doi: 10.1016/B978-0-12-809633-8.20349-

111 K H.-T (2013), "Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation", Casp J Intern Med 2013;, vol 4(2), pp 627–635.

112 Saito T and Rehmsmeier M (2015), "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets", PLoS One, vol 10, no 3, p e0118432., doi: 10.1371/journal.pone.0118432.

113 Sarwar B., Karypis G., and Konstan J (2001), "Item-based collaborative filtering recommendation algorithms", WWW ’01 Proc 10th Int Conf World

114 Alaimo S., Giugno R., and Pulvirenti A (2014), "ncPred: ncRNA-disease association prediction through tripartite network-based inference", Front.

Bioeng Biotechnol., vol 2, no DEC, doi: 10.3389/fbioe.2014.00071.

115 Zhang Z K., Zhou T., and Zhang Y C (2010), "Personalized recommendation via integrated diffusion on user-item-tag tripartite graphs",

Phys A Stat Mech its Appl., vol 389, no 1, pp 179–186, doi:

116 Liu N N., He L., and Zhao M (2013), "Social temporal collaborative ranking for context aware movie recommendation", ACM Trans Intell Syst Technol., vol 4, no 1, doi: 10.1145/2414425.2414440.

117 Zhao H et al (2018), "Prediction of microRNA-disease associations based on distance correlation set", BMC Bioinformatics, vol 19, no 1, pp 1–14, doi: 10.1186/s12859-018-2146-x.

118 Li J et al (2014), "starBase v2.0: decoding miRNA-ceRNA, miRNA-ncRNA and protein – RNA interaction networks from large-scale CLIP-Seq data", vol 42, no December 2013, pp 92–97, doi: 10.1093/nar/gkt1248.

119 Cui T et al (2018), "MNDR v2.0: An updated resource of ncRNA-disease associations in mammals", Nucleic Acids Res., vol 46, no D1, pp D371– D374, doi: 10.1093/nar/gkx1025.

120 Bao Z et al (2019), "LncRNADisease 2.0: An updated database of long noncoding RNA-associated diseases", Nucleic Acids Res., vol 47, no D1, pp. D1034–D1037, doi: 10.1093/nar/gky905.

121 Li Y et al (2014), "HMDD v2 0 : a database for experimentally supported human microRNA and disease associations", Nucleic Acids Res., vol 42, no. Database, pp 1070–1074, doi: 10.1093/nar/gkt1023.

122 McGuire S (2016), "World Cancer Report 2014 Geneva, Switzerland: World Health Organization, International Agency for Research on Cancer, WHO Press, 2015", Adv Nutr., vol 7, no 2, pp 418–419, doi: 10.3945/an.116.012211.

123 Fu Q et al (2019), "SOX30, a target gene of miR-653-5p, represses the proliferation and invasion of prostate cancer cells through inhibition of Wnt/β- catenin signaling", Cell Mol Biol Lett., vol 24, no 1, pp 1–13, doi: 10.1186/s11658-019-0195-4.

124 Budd W T et al (2015), "Dual action of miR-125b as a tumor suppressor and OncomiR-22 promotes prostate cancer tumorigenesis", PLoS One, vol.

10, no 11, pp 1–21, doi: 10.1371/journal.pone.0142373.

125 Chen Y et al (2017), "The association of heart failure-related microRNAs with neurohormonal signaling ☆", BBA - Mol Basis Dis., vol 1863, no 8, pp 2031– 2040, doi: 10.1016/j.bbadis.2016.12.019.

126 Wei X J et al (2015), "Biological significance of miR-126 expression in atrial fibrillation and heart failure", Brazilian J Med Biol Res., vol 48, pp. 983– 989.

127 Bernardo B C et al (2012), "Therapeutic inhibition of the miR-34 family attenuates pathological cardiac remodeling and improves heart function",

Proc Natl Acad Sci U S A., vol 109, no 43, pp 17615–17620, doi:

128 van Middendorp L B et al (2017), "Local microRNA-133a downregulation is associated with hypertrophy in the dyssynchronous heart", ESC Hear Fail., vol 4, no 3, pp 241–251, doi: 10.1002/ehf2.12154.

129 Zhou Q et al (2018), "MicroRNAs as potential biomarkers for the diagnosis of glioma: A systematic review and meta-analysis", Cancer Sci., vol 109, no.

130 Vaitkiene P et al (2019), "Association of miR-34a expression with quality of life of glioblastoma patients: A prospective study", Cancers (Basel)., vol 11, no 3, pp 1–11, doi: 10.3390/cancers11030300.

131 Yuan M et al (2018), "MicroRNA (miR) 125b regulates cell growth and invasion in pediatric low grade glioma", Sci Rep., vol 8, no 1, pp 1–14, doi: 10.1038/s41598-018-30942-4.

132 Luo G et al (2017), "MicroRNA-21 promotes migration and invasion of glioma cells via activation of Sox2 and β-catenin signaling", Mol Med Rep., vol 15, no 1, pp 187–193, doi: 10.3892/mmr.2016.5971.

133 Shaikh Y., Yu F., and Coleman A L (2014), "Burden of undetected and untreated glaucoma in the United States", Am J Ophthalmol., vol 158, no 6, pp 1121-1129.e1, doi: 10.1016/j.ajo.2014.08.023.

134 Drewry M D et al (2018), "Differentially expressed microRNAs in the aqueous humor of patients with exfoliation glaucoma or primary open-angle glaucoma", Hum Mol Genet., vol 27, no 7, pp 1263–1275, doi: 10.1093/hmg/ddy040.

135 Hindle A G et al (2019), "Identification of candidate miRNA biomarkers for glaucoma", Investig Ophthalmol Vis Sci., vol 60, no 1, pp 134–146, doi: 10.1167/iovs.18-24878.

136 Qin W et al (2016), "Down-regulation of miR-34a promotes the cell proliferation and inhibits apoptosis in glaucoma", Int J Clin Exp Pathol, vol.

137 Ning S et al (2016), "Lnc2Cancer : a manually curated database of experimentally supported lncRNAs associated with various human cancers", vol 44, no October 2015, pp 980–985, doi: 10.1093/nar/gkv1094.

138 Chen G et al (2013), "LncRNADisease : a database for long-non-coding RNA-associated diseases", vol 41, no November 2012, pp 983–986, doi: 10.1093/nar/gks1099.

139 Lu Z., Cohen K B., and Hunter L (2009), "GeneRIF QUALITY ASSURANCE AS SUMMARY REVISION", no 1999, pp 269–280.

140 Malik R et al (2014), "The lncRNA PCAT29 inhibits oncogenic phenotypes in prostate cancer", Mol Cancer Res., vol 12, no 8, pp 1081–1087, doi: 10.1158/1541-7786.MCR-14-0257.

141 Yu Y et al (2020), "lncRNA UCA1 Functions as a ceRNA to Promote Prostate Cancer Progression via Sponging miR143", Mol Ther - Nucleic

Acids, vol 19, no March, pp 751–758, doi: 10.1016/j.omtn.2019.11.021.

142 Rawla P and Barsouk A (2019), "Epidemiology of gastric cancer: Global trends, risk factors and prevention", Prz Gastroenterol., vol 14, no 1, pp.

143 Yuan L et al (2020), "Long non-coding RNAs towards precision medicine in gastric cancer: Early diagnosis, treatment, and drug resistance", Mol Cancer, vol 19, no 1, pp 1–22, doi: 10.1186/s12943-020-01219-0.

144 Xu W et al (2020), "Circulating lncRNA SNHG11 as a novel biomarker for early diagnosis and prognosis of colorectal cancer", Int J Cancer, vol 146, no 10, pp 2901–2912, doi: 10.1002/ijc.32747.

145 Luo J and Xiao Q (2017), "A novel approach for predicting microRNA- disease associations by unbalanced bi-random walk on heterogeneous network", J Biomed Inform., vol 66, pp 194–203, doi: 10.1016/j.jbi.2017.01.008.

146 Luo J and Long Y (2020), "NTSHMDA: Prediction of Human Microbe- Disease Association Based on Random Walk by Integrating Network Topological Similarity", IEEE/ACM Trans Comput Biol Bioinforma., vol.

147 Wang D et al (2010), "Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases",

Bioinformatics, vol 26, no 13, pp 1644–1650, doi:

148 Chen X et al (2016), "WBSMDA: Within and between Score for MiRNA- Disease Association prediction", Sci Rep., vol 6, no November 2015, p.

149 Lu M et al (2008), "An Analysis of Human MicroRNA and Disease Associations", PLoS One, vol 3, no 10, p e3420, doi: 10.1371/journal.pone.0003420.

150 Li G et al (2022), "Predicting miRNA-disease associations based on graph attention network with multi-source information", BMC Bioinformatics, vol.

151 Huang D et al (2022), "Computational method using heterogeneous graph convolutional network model combined with reinforcement layer for MiRNA– disease association prediction", BMC Bioinformatics, vol 23, no 1, pp 1–19, doi: 10.1186/s12859-022-04843-3.

152 Xie X et al (2023), "Predicting miRNA - disease associations based on PPMI and attention network", BMC Bioinformatics, pp 1–19, doi: 10.1186/s12859- 023-05152-z.

153 Singh R and Mo Y (2013), "Role of microRNAs in breast cancer", Cancer

Biol Ther., vol 14, no 3, pp 201–212.

154 Zografos E et al (2019), "Prognostic role of microRNAs in breast cancer: A systematic review", Oncotarget, vol 10, no 67, pp 7156–7178, doi: 10.18632/oncotarget.27327.

155 Imani S., Wu R C., and Fu J (2018), "MicroRNA-34 family in breast cancer: From research to therapeutic potential", J Cancer, vol 9, no 20, pp 3765–

156 Li L et al (2013), "MiR-34a inhibits proliferation and migration of breast cancer through down-regulation of Bcl-2 and SIRT1", Clin Exp Med., vol.

157 Xu X et al (2018), "The role of MicroRNAs in hepatocellular carcinoma", J.

Cancer, vol 9, no 19, pp 3557–3569, doi: 10.7150/jca.26350.

158 O’Connor S et al (2010), "Hepatocellular Carcinoma — United States, 2001– 2006", Morb Mortal Wkly Rep., vol 59, no 17, pp 517–520.

159 Balogh J et al (2016), "Hepatocellular carcinoma: a review", J Hepatocell.

Carcinoma, vol Volume 3, pp 41–53, [Online] Available: https://www.dovepress.com/hepatocellular-carcinoma-a-review-peer- reviewed-article-JHC

160 Zhang Z et al (2015), "MicroRNA-146a inhibits cancer metastasis by downregulating VEGF through dual pathways in hepatocellular carcinoma",

Mol Cancer, vol 14, no 1, p 5, doi: 10.1186/1476-4598-14-5.

161 Zhou Y et al (2018), "Hepatocellular carcinoma-derived exosomal miRNA-

21 contributes to tumor progression by converting hepatocyte stellate cells to cancer-associated fibroblasts", J Exp Clin Cancer Res., vol 37, no 1, p.

162 Rong M.-H et al (2019), "Overexpression of MiR-452-5p in hepatocellular carcinoma tissues and its prospective signaling pathways.", Int J Clin Exp.

Pathol., vol 12, no 11, pp 4041–4056, [Online] Available: http://www.ncbi.nlm.nih.gov/pubmed/31933800%0Ahttp://www.pubmedcent ral.nih.gov/articlerender.fcgi?artid=PMC6949781

163 Xia Q et al (2019), "Identification of novel biomarkers for hepatocellular carcinoma using transcriptome analysis", J Cell Physiol., vol 234, no 4, pp. 4851–4863, doi: 10.1002/jcp.27283.

164 Zhang H., Chen X., and Yuan Y (2020), "Investigation of the miRNA and mRNA Coexpression Network and Their Prognostic Value in Hepatocellular Carcinoma", Biomed Res Int., vol 2020, p Article ID 8726567, doi: 10.1155/2020/8726567.

165 Yu L et al (2015), "miR-454 functions as an oncogene by inhibiting CHD5 in hepatocellular carcinoma", Oncotarget, vol 6, no 36, pp 39225–39234, doi: 10.18632/oncotarget.4407.

166 Wu G et al (2016), "MicroRNA-655-3p functions as a tumor suppressor by regulating ADAM10 and β-catenin pathway in Hepatocellular Carcinoma", J.

Exp Clin Cancer Res., vol 35, no 1, p 89, doi: 10.1186/s13046-016-0368-

167 Zhang C et al (2018), "Downregulation of microRNA-376a in gastric cancer and association with poor prognosis", Cell Physiol Biochem., vol 51, no 5, pp 2010–2018, doi: 10.1159/000495820.

168 Gong J et al (2014), "Characterization of microRNA-29 family expression and investigation of their mechanistic roles in gastric cancer",

Carcinogenesis, vol 35, no 2, pp 497–506, doi: 10.1093/carcin/bgt337.

169 Feng Y et al (2018), "Dysregulated microrna expression profiles in gastric cancer cells with high peritoneal metastatic potential", Exp Ther Med., vol.

170 Lu Q et al (2019), "MicroRNA-181a functions as an oncogene in gastric cancer by targeting caprin-1", Front Pharmacol., vol 9, p 1565, doi: 10.3389/fphar.2018.01565.

171 Li H et al (2019), "MicroRNA-183 affects the development of gastric cancer by regulating autophagy via MALAT1-miR-183-SIRT1 axis and PI3K/AKT/mTOR signals", Artif Cells, Nanomedicine Biotechnol., vol 47, no 1, pp 3163–3171, doi: 10.1080/21691401.2019.1642903.

172 Zhenkai Wang et al (2018), "The Role of mir-152 and DNMT1 in Gastric Cancer Cell Proliferation and Invasion", Gastroenterol Hepatol Res., vol 3, no 1, p 011, doi: 10.24966/ghr-2566/100011.

173 Peng Y et al (2014), "MicroRNA-338 inhibits growth, invasion and metastasis of gastric cancer by targeting NRP1 expression", PLoS One, vol 9, no 4, p e94422, doi: 10.1371/journal.pone.0094422.

174 Wu K L et al (2019), "The roles of microRNA in lung cancer", Int J Mol.

175 Liao J et al (2020), "MicroRNA-based biomarkers for diagnosis of non- small cell lung cancer (NSCLC)", Thorac Cancer, vol 11, pp 762–768, doi: 10.1111/1759-7714.13337.

176 Staicu C E et al (2020), "Role of microRNAs as Clinical Cancer Biomarkers for Ovarian Cancer: A Short Overview", Cells, vol 9, no 1, p.

177 Zhang S et al (2018), "Identification of common differentially-expressed mirnas in ovarian cancer cells and their exosomes compared with normal ovarian surface epithelial cell cells", Oncol Lett., vol 16, no 2, pp 2391–

178 Alshamrani A A (2020), "Roles of microRNAs in Ovarian Cancer

Tiêu đề	Link Prediction in Heterogeneous Information Networks and Its Applications in Predicting Associations Between Non-Coding RNAs and Diseases
Tác giả	Nguyen Van Tinh
Người hướng dẫn	Assoc. Prof. Dr. Tran Dang Hung, Dr. Le Thi Tu Kien
Trường học	Hanoi National University of Education
Chuyên ngành	Computer Science
Thể loại	Doctoral Dissertation
Năm xuất bản	2023
Thành phố	Hanoi

Định dạng
Số trang	156
Dung lượng	3,36 MB