The approach to "Suggesting context-aware query by session data mining and click-throught documents" (short call: "context-aware approach" by Huanhuan Cao et al [9], [1[r]
MINISTRY OF EDUCATION AND VIETNAM ACADEMY TRAINING OF SCIENCE AND TECHNOLOGY GRADUATE UNIVERSITY SCIENCE AND TECHNOLOGY Tran Lam Quan SOME SEARCHING TECHNIQUES FOR ENTITIES BASED ON IMPLICIT SEMANTIC RELATIONS AND CONTEXT-AWARE QUERY SUGGESTIONS Major: Mathematical Theory of Informatics Code: 9.46.01.10 SUMMARY OF MATHEMATICS DOCTORAL THESIS Hanoi - 2020 Cơng trình hồn thành tại: Học viện Khoa học Công nghệ - Viện Hàn lâm Khoa học Công nghệ Việt Nam Người hướng dẫn khoa học: TS Vũ Tất Thắng Phản biện 1: … Phản biện 2: … Phản biện 3: … Luận án bảo vệ trước Hội đồng đánh giá luận án tiến sĩ cấp Học viện, họp Học viện Khoa học Công nghệ - Viện Hàn lâm Khoa học Công nghệ Việt Nam vào hồi … ’, ngày … tháng … năm 202… Có thể tìm hiểu luận án tại: - Thư viện Học viện Khoa học Công nghệ - Thư viện Quốc gia Việt Nam INTRODUCTION The necessity of the thesis In the big data era, when the new data flow is generated incessantly, the search engine becomes a useful tool for the user to search for information Based on the statistics, approximately 71% of the web searching sentences includes the name of entities [7], [8] When looking at the query only includes the entity name: "Vietnam", "Hanoi", "France ", in terms of visualization, we see the underlying semantics behind this query In other words, a similar relationship exists between the pair of entity names "Vietnam": "Hanoi" and the pair of entity names "France": "?" If only considered visually, this is one of the "natural" abilities of human - the ability to infer unknown information/knowledge by similar inference With the above query, human have the ability to give immediate answers, but the Search Engine (SE) can only find the documents containing the aforementioned keywords, the SE cannot immediately give the answer "Paris" The same happen in real world, there are questions as: "If Fansipan is the highest mountain in Vietnam, which one is the highest in Tibet?" or "If you know Elizabeth as Queen of England, who is the Japanese monarch?", etc For queries with similar relationships as above, the keyword search engine has difficulty in giving answers while human can easily make similar inferences Figure 1.1: The list returns from Keyword-SE with query = "Việt Nam", "Hà Nội", "Pháp" Researching and simulating ability of human to deduce from a familiar semantic domain ("Vietnam", "Hanoi") to an unfamiliar semantic domain ("France", "?") - is the purpose of the first problem The second problem about query suggestions Also according to statistics, the queries of user to enter are often short, ambiguous, and poly-semantic [1-6] In search sessions, the number of results returned a lot, but most of them are not suitable for the user's search intent1 Therefore, there are many researching directions set out to improve results and assist searchers These researching directions include: query suggestion, rewriting queries, query expansion, personalized recommendations, ranking/re-ranking search results, etc The researching direction suggests that the query often applies traditional techniques such as clustering, similarity measurement, etc of queries [9], [10] However, traditional techniques have three disadvantages: First, it can only give similar suggestion or related to the query that is recently entered (current query) - but the quality is not sure and better than the current query Second, it is not possible to give the trend that most knowledge often asks after the current query Third, these approaches not seamlessly consider the user's query to capture the user's search intent For example, on the keyword SE, type consecutive queries q1: "Who is Joe Biden", q2: https://static.googleusercontent.com/media/guidelines.raterhub.com/en//searchqualityevaluatorguidelines.pdf "How old is he", q1, q2 are semantically related However, the results returned for q1, q2 are very different set of the result This shows the disadvantage of keyword search Figure 1.2: The answers list from SE corresponding to q1 and q2 Capturing a seamless query string, in other words, capturing the search context, SE will "understand" the user's search intent Moreover, capturing query string, SE can suggest string query, this suggestion string is majority knowledge, community often asks after q1, q2 This is the purpose of the second problem Thesis: Objectives, Layout and Contributions Research, identify and experiment with methods to solve the two above problems The objectives are set out, the main contributions of the thesis include: - The thesis researches and builds an entity search technique based on implicit semantic relations using clustering methods to improve search efficiency - Apply context-aware techniques, build an vertical search engine that applies context-aware in its own knowledge base domain (aviation data) - Propose to measure combinatorial similarity in the contextual query suggestion problem to improve the quality of suggestion CHAPTER 1: OVERVIEW 1.1 The problem for searching entities based on implicit semantic relations Consider query including entities: "Qur’an": "Islam", "Gospels": "?", Humans have the ability to immediately deduce the "?", but the SE only gives results that contain documents that contain the above keywords, not immediately give the answer: "Christian" Due to only finding entities, the techniques of extending or rewriting the query not apply to the relationship form with the meaning hidden in the entity pair From there, a new search form is studied, the search query's motive has the form: {(A, B), (C,?)}, where (A, B) is the source entity-pair, (C,?) is the target entity-pair Simultaneously, two pairs (A, B), (C,?) have similar semantic relationship Specifically, when the user enters a query consisting of entities {(A, B), (C,?)}, SE has the task of listing and searching in the candidate list of entities D ( entity sign?), each entity D satisfies the condition having semantic relationship with C, and the pair (C, D) has similar relationship with the pair (A, B) Semantic relation - in the narrow sense and in the lexical perspective - is expressed by terms/patterns/context surrounding (before, between, after) the known entity pair2 Because of the semantic relation, the similarity relation is not Birger Hjorland Link: http://vnlp.net explicitly stated in the query (the query consists of only entities: A, B, C), so motive search morphology is called the Implicit Relational Entity Search or Implicit Relational Search, in short: IRS Consider the input query that includes only entities q= "Mekong":"Vietnam", "?": "China" Query q contains only entities ("Mekong": "Vietnam", "?": "China") The query q does not describe a semantic relation ("longest river" or "largest" or "widest basin", etc.) The searching model based on the implicit semantic relation is responsible for finding the entity "?", such as satisfying the semantic relationship with the "China" entity, and the "?":"China" pair being similar with the pair: "Mekong":"Vietnam" Finding/calculating the relative similarity between two pairs of entities is a difficult problem because: First, the relational similarity changes over time, considering two pairs of entities (Joe Biden, US President) and (Elizabeth, Queen of England), the similarity of relationship changes over the term Second, it is difficult due to the intrinsic entity having names (names of individuals, organizations, places, ) which are not Hình 1.3: Input query: ”Cuba”, “José Marti”, “Ấn Độ” (ngữ nghĩa ẩn: “anh hùng dân tộc”) common words or in the dictionary Third, in a pair of entities, there can be many different semantic relations, such as: "The Corona outbreak originated from Wuhan"; "Corona isolates Wuhan city"; "The number of Corona infections decreased gradually in Wuhan"; v.v Fourth, due to the timing factor, entity pairs may not share or share very little of the context around the entity pair, like: Apple: iPod (in 2010s) and Sony: Walkman (1980s), leading to the result of pairs of entities are not identical Fifth, the pair of entities has only one semantic relation but has more than one expression: "X was acquired by Y" and "X buys Y" And finally, it is difficult because the unknown D entity, the D entity is in the process of searching The query's search motive takes the form: q = {(A, B), (C,?)}, the query consists of only entities: A, B, C Identifying the similarity relationship between the pair of entities (A, B), (C, ?) is a necessary condition for determining the entity to be sought As a problem of NLP (Natural Language Processing), similarity relational is one of the most important tasks of search for entities based on the implicit semantic relations Thus, thesis lists the main research directions for similarity relationship 1.2 IRS - Related work 1.2.1 SMT - Structure Mapping Theory SMT [12] considers the similarity as a mapping of “knowledge” (mapping of knowledge ) from the source domain to the target domain, according to the mapping rules: Eliminate the attributes of the object but maintain the relational mapping between objects from the source domain to the target domain Mapping rules: M: si ti; (in which s: source, t: target) Eliminate attribute: HOT(si) ↛HOT(ti); MASSIVE(si) ↛MASSIVE(ti); Maintain relational mapping: Revolves(Planet, Sun) Revolves(Electron, Nucleus) Figure 1.5 shows that due to the same s (subject), o (object) structures, the SMT considers the pairs (Planet, Sun) and (Electron, Nucleus) are relation similarity, regardless of the fact that the source and target pairs - Sun and Nucleus, Planet and Electron are very different in properties (HOT, MASSIVE, ) Referring to the purpose of the paper, if the query is ((Planet, Sun), (Electron, ?)), SMT will Figure 1.5: Structure Mapping Theory (SMT) output the correct answer: "Nucleus" However, SMT is not feasible with low-level structures (lack of relations) Therefore, SMT is not feasible with the problem of searching entities based on implicit semantic relation 1.2.2 Relational similarity based on Wordnet classification system Cao [20] and Agirre [21] proposed relational similarity measure based on similarity classification system in Wordnet However, as mentioned above, Wordnet does not contain named entities Thus, Wordnet is not suitable for entity search model 1.2.3 VSM - Vector Space Model Using the vector space model, Turney [13] presents the concept of each vector formed by a pattern containing the entity pair (A, B) and the occurrence frequency of the pattern The VSM performs the relational similarity measurement as follows: Patterns are generated manually and queried to the Search Engine (SE), the number of results returned from the SE is the frequency of occurrence of such patterns Thus, the relational similarity of two pairs of entities is computed by Cosine between two frequency vectors 1.2.4 LRA - Latent Relational Analysis By extension of VSM, Turney combines it with LRA to determine level of relational similarity [14-16] Like VSM, LRA uses a vector made up of the pattern/context containing the entity pair (A, B) and the frequency of the pattern (pattern in n-grams format) At the same time, LRA applies a thesaurus to extend the variants of: A bought B, A acquired B; X headquarters in Y, X offices in Y, etc LRA applies the most frequent n-grams to assign the pattern with the entity pair (A, B), then builds a pattern - entity pair matrix, where each element of the matrix represents the frequency of the pair (A, B) in the pattern In order to reduce the matrix dimension, the LRA uses Singular Value Decomposition (SVD) to reduce the number of columns in the matrix Finally, the LRA applies a Cosine measure to define the relational similarity between two pairs of entities In spite of an effective approach to identifying relational similarity, LRA requires a long time to compute and process LRA requires days to perform 374 SAT analogy questions [17] This is impossible with a real-time response system 1.2.5 LMRE - Latent Relation Mapping Engine - LRME To improve the manual construction of mapping rules, s (subject), o (object) in SMT, Turney applies the LRME implicit relational mapping LRME [11], by combining SMT and LRA Purpose: Find a relationship between terms A, B (consider terms as entities) With input (table 1.1) being lists of terms from domains (source and target), output (table 1.2) is the result of mapping lists 1.2.6 LSR - Latent Semantic Relation Bollega, Duc et al [17, 18], Kato [19] uses the Distributional Hypothesis at the context level: In the corpus, if two contexts pi and pj are different but usually co-occur with entity pairs wm, wn, they are similar in semantics When pi, pj are semantically similar, entity-pairs wm, wn are similar in relation The Distribution Hypothesis requires pairs of entities to always co-occur with contexts, and the Bollega clustering algorithm is proposed at the context level rather than clustering at the term level in the sentence Measure of similarity based on the distribution hypothesis, which is not based on term similarity, will significantly affect the quality of the clustering technique, thus affecting the quality of the search system 1.2.7 Word2Vec The Word2Vec model, proposed by Mikolov et al [22], is a learning model that represents each word into a vector (maps a word to one-hot vector), Word2Vec describes the relationship (probability) between words with the context of the word The Word2Vec model has simple Neural network architectures: Continous Bag-OfWords (CBOW) and Skip-gram Apply Skip-gram, at each training step, the Word2Vec model predicts the contexts within certain skip-grams Assuming the input training word is "banking", with the sliding window skip = m = 2, the left context output will output as "turning into", the right context will output as "crises as" Figure 1.6: Relationship between the target word and the context in the Word2Vec model In order to predict, the objective function in Skip-gram implemented to maximize probability With a series of training words w1 , w2 , , wT , Skip-gram applies Maximum Likelihood: 𝑇 𝐽(𝜃) = ∑ 𝑇 ∑ log 𝑝(𝑤𝑡+𝑗 |𝑤𝑡 ) 𝑡=1 −𝑚≤𝑗≤𝑚,𝑗≠0 in which: T: number of words in the data-set; t: trained words; m: window-side (skip); 𝜃: vector representation; The training process applies back-propagation algorithm, the output probability p (wt + j | wt) determined by the softmax activation function: 𝑝(𝑜|𝑐) = in which: exp(𝑢𝑜𝑇 𝑣𝑐 ) 𝑇 ∑𝑊 𝑤−1 exp(𝑢𝑤 𝑣𝑐 ) W: Vocabulary; c: the trained word (input/center); o: output of c; u: representing vector of o; v: representing vector of c; In the experiment, Mikolov et al [22-25] treats phrases as single words, eliminates frequently repeated words, uses Negative Sampling loss function, randomly selecting n words to process calculations instead of entire words in the data-set, helping for the training algorithm faster than the above softmax function Figure 1.7: Word2Vec "learns" the "hidden" relationship between the target word and its context3 Vector operations such as: vec ("king") - vec ("man") ≈ vec ("queen") - vec ("woman") show that the Word2Vec model is suitable for a query like "A: B :: C :? ”, in other words, the Word2Vec model is quite close to the research direction of the thesis The difference: Word2Vec input (following the Skip-gram model) is one word, output is a context The input of IRS model based on the semantics is entities (A: B :: C :?), the output is the entity to be searched for (D) Regarding the search for entities based on semantics, from existing problems, to asymptotic to an "artificial intelligence" in the search engine, the research thesis, the application of self-ability simulation techniques of human: ability to infer information/knowledge not determined by similar inference 1.3 The problem Context-aware query suggestions For SE, the ability to "understand" the search intent in a user's query is a challenge The data set used for mining is Query Log (QLogs) Query set in the past QLogs record the queries and "interactions" of users with search engines, so QLogs contain valuable information about the query content, purpose, behavior, habits, preferences as well as implicit feedback of the user on the result set returned by SE Logs data set mining is useful in many applications: Query analysis, advertising, trends, personalization, query suggestion, etc For query suggestions, traditional techniques such as Explicit Feedback [30-32], Implicit Feedback [3336], User profile [37-39], Thesaurus [40-42], only give suggestions similar to input queries of users Figure 1.12: Suggest traditional with input query: “điện thoại di động” 1.4 Query suggestions – Related work Around the kernel is Qlogs, it can be said that query suggestion in traditional techniques performs two main functions: https://cs224d.stanford.edu/lectures/CS224d-Lecture2.pdf Cluster-based techniques apply similarity measurements to aggregate similar queries together in clusters (groups) Session-based technology with session search is a continuous sequence of queries 1.4.1 Session-based query suggestion technique a) Based on queries co-occurrence or adjacent (adjacency) belongings to sessions in Qlog: In a Sessionbased approach, adjacency or co-occurrence query pairs belonging to the same session act as the candidate list for the query proposal b) Based on the graph (Query Flow Graph - QFG): On QFG graph, two queries qi, qj belong to the same search intent (search mission) are represented as an edge with direction from qi to qj Each node on the graph corresponds to a query, any edge on the graph is also considered a searching behavior The session general structure in CFG is represented: QLog = ; Boldi et al [50, 51] uses the simplified session structure QLog = to perform query suggestions, following a series of steps: Construct QFG graph with the input that is set of sessions in Query Logs The queries qi and qj are connected if there exists at least one session where qi and qj occur consecutively Calculate the weight w (qi, qj) on 𝑓(𝑞 each edge: 𝑖 ,𝑞𝑗 ) , 𝑖𝑓(𝑤(𝑞𝑖 , 𝑞𝑗 ) > 𝜃) ∨ (𝑞𝑖 = 𝑠) ∨ (𝑞𝑖 = 𝑡) w(qi, qj) = { 𝑓(𝑞𝑖) 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 in which: f(𝑞 ,𝑞 ): the number of occurrences q immediately after q in the session; 𝑖 𝑗 j (1.6) i f(qi): number of occurrences of qi in QLogs; 𝜃: threshold; s, t: state nodes start, end of query chain in session; Identify the strings that meet the conditions (1.6) to analyze the user's intent: When new query is inserted, based on the graph, it gives the query suggestions in turn with the greatest edge weight 1.4.2 Cluster-based query suggestion technique K-means; Hierarchical; DB-SCAN; … Figure 1.9: Clustering methods [54] Context-aware Query Suggestion (Context-aware Query Suggestion) is a new feature, Context-aware considers the queries immediately before the current query as the search context, to "capture" the intent of the search of users Next, exploring the queries that immediately follow the current query - as the list of suggestion This is the unique advantage of this approach - compared to one that only suggests similar queries The query layer that immediately follows the current query formally reflects the problems that users often ask after the current query At the same time, the query layer immediately after the current query often includes better queries (query strings) that better reflect the search intent CHAPTER 2: SEARCH FOR ENTITIES BASED ON IMPLICIT SEMANTIC RELATIONS 2.1 Problem In nature, there exists a relationship between two entities, such as: Khue Van Cac - Temple of Literature; Stephen Hawking - Physicist; Shakyamuni - Mahayana group; Apple - iPhone; In the real world, there are questions like: "knowing Fansipan is the highest mountain in Vietnam, which is the highest mountain in India?", "If Biden is president-elect the United States, who is the most powerful person in Sweden? ”, … By the keyword search engine, according to statistics, queries are often short, ambiguous, and polysemantic [1-6] Approximately 71% of web search queries contain names of entities, as statistics [7, 8] If the user enters the entities like "Vietnam", "Hanoi", "French", then the search engine only results in documents that contain the above keyboards, but does not immediately answer "Paris" Because of looking for entities only, query extending and query rewriting techniques are not applied to the type of the implicit semantic relation in the entity pair Therefore, a new search morphology is researched The pattern of the search query is in the form of: (A, B), (C, ?), in which (A, B) is the source entity pair, (C, ?) is the target entity pair At the same time, the two pairs (A, B), (C, ?) have a semantic similarity In other words, when the user enters the query (A, B), (C,?), the search engine has the duty of listing entities D so that each entity D satisfies the condition of semantic relation with C as well as the pair (C, D) have similarity relation with the pair (A, B) With an input consisting of only entities: "Vietnam", "Hanoi", "France", the semantic relation "is the capital" is not indicated in the query 2.2 Method for searching entities is based on implicit semantic relations 2.2.1 Architecture – Modeling The concept of searching entities through implicit semantic relation is the most obvious distinction for search engines based on keywords Figure 2.1 simulates a query consisting of only three entities, query = (Vietnam, Mekong), (China, ?) Write the convention: q = {(A, B), (C, ?)}, where (Vietnam, Mekong) is a pair of source entities, (China, ?) is a pair of target entities The search engine is responsible for identifying the entity ("?") that has a semantic relation with the "China" entity, and the entity pair (China, ?) must be similarly related to the entity pair (Vietnam, Mekong) Note that the above query does not explicitly contain the semantic relation between the two entities This is because semantic relations are expressed in various ways around the pair of entities (Vietnam, Mekong), such as "the longest river", "big river system", "the largest basin", etc Figure 2.1: Implicit Semantic Relation Search with input consisting of entities 11 ∑ (𝑃𝑀𝐼(𝑤 ,𝑝)∙𝑃𝑀𝐼(𝑤 ,𝑞)) 𝑖 𝑖 𝑖 SimDH(p,q) = Cosine(PMI(p, q)) =||𝑃𝑀𝐼(𝑤 ,𝑝)||||𝑃𝑀𝐼(𝑤 ,𝑞)|| 𝑖 𝑖 (2.25) The similarity by terms of context p, q: Simterm (p, q) = ∑n i=1(weighti (p)∙weighti (q)) ||weight(p)||||weight(q)|| (2.26) Measurement of the combined similarity: Sim(p,q) = Max(SimDH(p, q),Simterm(p, q)) (2.27) c) Clustering algorithm: - Input: Set P = {p1, p2,…, pn}; Cluster threshold: θ1, heuristic threshold: θ2; Dmax: Cluster diameter; Sim_cp: Results of combined similarity measurement function, apply the formula (2.27) - Output: Set of clusters: Cset (ClusterID; context; weight of each context; pair of respective entities) Program Clustering_algorithm 01 Cset = {}; iCount=0; 02 for each context pi ∈ P 03 Dmax = 0; c* = NULL; 04 for each cluster cj ∈ Cset 05 Sim_cp=Sim(pi,Centroid(cj)) 06 if (Sim_cp > Dmax) then 07 Dmax = Sim_cp; c* ← cj; 08 end if 09 end for 10 if (Dmax > θ1) then 11 c*.append(pi) 12 else 13 Cset ∪= new cluster{c*} 14 end if 15 if (iCount > θ2) then 16 iCount++; 17 exit Current_Proc_Cluster_Alg(); 18 end if 19 end for 20 Return Cset; @CallMerge_Cset_from_OtherNodes() 2.2.4 Modules calculating the relational similarity between two pairs of entities The module calculating the relational similarity between two pairs of entities that perform two tasks: Filtering (searching) and ranking As illustrated in 3.1, the input query q = (A, B), (C, ?), through the inverted index, IRS executes the function Filter-Entities Fe to filter (search) out candidate sets having entity pairs (C, Di) and the corresponding context, such that (C, Di) similar to (A, B) Then, it executes the function Rank-Entities Re to rank the entities Di, Dj within the candidate set according to RelSim measure (Relational Similarity), finally which results in list of ranked {Di} Filter-Entities algorithm: Filter to find the candidate set containing the answer: Input: Query q = (A, B)(C, ?) Output: Candidate set S (includes Di entities and corresponding context); Program Filter_Entities 01 S = {}; 02 P(w) = EntPair_from_Cset.Context(); 12 03 04 05 06 07 for each context pi ∈ P(w) W(p) = Context(pi).EntPairs(); If (W(p) contains (C:Di)) then S ∪= W(p); end for retufn S After executing Filter-Entities, a subset of the entities Di and corresponding context are obtained RelSim only processes and calculates on the very small subset In addition, RelSim uses the threshold α to eliminate entities Di with low RelSim values With: Fe(q,D) = Fe({(A, B),(C,?)}, D): 𝐹𝑒 (𝑞, 𝐷𝑖 ) = { 1, 𝑖𝑓𝑅𝑒𝑙𝑆𝑖𝑚((𝐴, 𝐵), (𝐶, 𝐷𝑖 )) > α 0, 𝑒𝑙𝑠𝑒 (2.29) Rank-Entities function: Rank-Entities Algorithm is responsible for calculating RelSim: Input: Candidate set S and: - Source entity pair (A, B), denote s; Candidate entities (C, Di), denote c; - Contexts corresponding to s, c; The resulting cluster set: Cset; - Known entities A, B, C corresponding cluster set containing A, B, C are identified; - Threshold α (compare RelSim value); Threshold α is set during testing the program; - Initialize the dot product (β); used-context set (γ); Output: List of answers (ranked entity list) Di; Denotations: - P(s), P(c) given in formula (2.19), (2.20); - f(s, pi), f(c, pi), ɸ(s), ɸ(c) given in (2.21), (2.22); - γ: Variable (set of context) keep the considered contexts; - q: Temporary/Intermediate variable (Context); Ω: Cluster; Program Rank_Entities - 01 for each context pi ∈ P(c) 02 if (pi ∈ P(s)) then 03 β ← β + f(s, pi)·f(c, pi) 04 γ ← γ ∪ {p} 05 else 06 Ω ← cluster contains pi 07 max_co-occurs = 0; 08 q← NULL; 09 for each context pj ∈ (P(s)\P(c)\γ) 10 if (pj ∈ Ω) & (f(s, pj) > max_co-occurs) 11 max_co-occurs ← f(s, pj); 12 q ← pj; 13 end if 14 end for 15 if (max_co-occurs > 0) then 16 β ← β + f(s, q)·f(c, pi) 17 γ ← γ ∪ {q} 18 end if 13 19 end if 20 end for 21 RelSim ← β/L2-norm(ɸ(s), ɸ(c)) 22 if (RelSim ≥ α) then return RelSim Algorithm interpretation: In case two pairs of source and target entities have the same semantic relationship (sharing the same context, statement 1-2): pi ∊ P(s) ∩ P(c), calculate the dot product as a modified version of standard Cosine similarity formula In the case of pi ∊ P(c) but pi ∉ P(s), the algorithm finds the context pj (or temporary variable, q, line 12), where pi, pj belong to the same cluster The loop body (from statements 10-13) chooses the context pj has largest frequency of co-occurrence with the s Under the Distribution Hypothesis, the more pairs of entities two contexts pi, pj co-occur in, the higher Cosine similarity between the two vectors As the cosine value is higher, pi, pj are more similar Therefore, the pair (C, Di) is more accurate and semantically consistent with the source entity pair (A, B) The sequence of statements from 15-18 calculate the dot product Statements from 21-22 calculate the RelSim value From the set of RelSim value, whichever entities Di have RelSim higher will be ranked lower (in the closer top, or higher rank) Finally, the result set Di is the answer list for the query that the end-user wants to find 2.3 Experiment Results - Evaluation 2.3.1 Dataset The dataset is built from the empirical sample dataset, based on four entity subclasses named: PER; ORG; LOC and TIME; 2.3.2 Test - Parameter adjustment To evaluate the effectiveness of the Rank_Entities clustering and ranking algorithm, Chapter changes the values θ1 and α, then calculates the Precision, Recall, F-Score measures corresponding to each value of α, θ1 Figure 2.3 shows that at α = 0.5, θ1 = 0.4, the FScore score has the highest value Figure 2.3: F-Score value corresponding to each changed value of α, θ1 Giải thuật Rank_Entities dòng 22 (if (RelSim ≥ α) return RelSim) cho thấy, α nhỏ số lượng ứng viên tăng, có nhiễu, đồng thời thời gian xử lý real-time tốn chi phí thời gian, hệ thống xử lý nhiều truy vấn ứng viên Ngược lại α lớn giá trị Recall nhỏ, kéo theo F-Score giảm đáng kể 2.3.3 Evaluation with MRR (Mean Reciprocal Rank) For the query Q, if the first correct answer rank in the query q ∈ Q is rq, then the MRR measurement of Q is calculated: 1 MRR(Q) = |𝑄| ∑𝑞∈𝑄 𝑟 𝑞 (2.33) 14 With entity subclasses: PER; ORG; LOC and TIME; the method is based on the co-current frequency (f) reaching the average value MRR ≈ 0.69; meanwhile, PMI-based method is 0.86 This shows that PMI helps improve the accuracy of semantic similarity better than the co-current frequency of context-pair entities Figure 2.4: Compare PMI with f: frequency (co-current) based on MRR 2.3.4 Experimental system The dataset was downloaded from Viwiki (7877 files) and Vn-news (35440 files) The goal of selecting source Viwiki and Vn-news because these datasets contain samples of named entities (Named Entity) After reading, extracting file content, separating paragraphs and sentences (main-sentences, subsentences), 1572616 sentences are obtained The general labels of NER (Named Entity Recognition) include: PER: Name of person; ORG: Name of organization; LOC: Name of place; TIME: Time type; NUM: Number type; CUR: Currency; PCT: Percentage type; MISC: Another entity type; O: Not an entity By using the algorithm for extracting context stored in the database, after performing the processing steps and restriction conditions, Database remains with 404507 context sentences From this set of context, the clustering algorithm of semantic relations collects 124805 clusters Figure 2.5: IRS experiment with B-PER entity label Đo evaluate the accuracy, experimentally performed 500 queries to test, the results showed an accuracy of about 92% ID Table 2.3: Examples of experimental results with input q = {A, B, C} and output D A B C D 15 German Angela Merkel Israel Benjamin Netanyahu Harry Kane Tottenham Messi Barca Hồng Cơng Lương Hịa Bình Thiên Sơn RO 2.4 Conclusion The ability to infer information/knowledge is not determined by similar inference is one of the natural abilities of human Chapter presents an an Implicit Relational entity search model (IRS) that simulates the above possibility The IRS model searches for information/knowledge from an unfamiliar domain and does not require keyword in advance, using a similar example (similarity relation) from a familiar domain The main contribution of Chapter 2: Build the entity search technique based on hidden semantic relation using clustering method to improve search efficiency At the same time, the thesis proposes the measure of combined similarity terms and distribution hypothesis; From the proposed measurement, and at the same time applying heuristic to cluster algorithm with improving cluster quality CHAPTER 3: CONTEXT-AWARE QUERY SUGGESTION 3.1 Problem In the field of sugessting the query, traditional approaches like session-based, document-click based, and so on Performing Query Logs to generate the suggestion The approach to "Suggesting context-aware query by session data mining and click-throught documents" (short call: "context-aware approach" by Huanhuan Cao et al [9], [10]) is one new approach - this approach considers the queries immediately before the query just entered (the current query) as a search context, in order to "capture" the user's search intent, in order to provide exact suggestions worth more Obviously, the preceding layer of query has a semantic relationship with the current query Next, mining for queries that immediately follow the current query - like a list of suggestions This method makes use of the "knowledge" of the community, because the query layer immediately follow the current query reflects the problems that users often ask after the current query The main contributions of chapter include: 1) Apply context-aware techniques, build an vertical search engine that applies context-aware in its own knowledge base domain (aviation data) 2) Propose to measure combinatorial similarity in the contextual query suggestion problem to improve the quality of suggestion In addition, chapter also has additional experimental contributions: i) Integrating Vietnamese speech recognition and synthesis as an option into the search engine to create a voice-search system, with speech interaction ii) Apply the Concept-lattice structure to classify the returned result set 3.2 Context-aware method 3.2.1 Definition - Terminology Search session: Is a continuous sequence of queries Query strings are represented in chronological order Each session corresponds to one user General session structure: {sessionID; queryText; queryTime; URL_clicked} Context: specifies adjacent string before the current query In a user's search session, context is the query string immediately preceding the query entered Query-layer before qcurrent ↔ ngữ cảnh) qcurrent (Query-layer after qcurrent ) 16 3.2.2 Architecture - Modeling The ideology of Contextaware based on two phases: online and offline, generalized: During a search session (online phase), the context-aware waits the current query and looks at the preceding query string standing before the current query as a context More precisely, this process is interpreted to the concept sequence - this concept sequence expressed searching intention of users Figure 3.4: Context-aware query suggestion model When a search context is obtained, the system performs a match against the built-in context set (phase offline, the built-in context set is pre-processed on the query set in the past - Query Logs About structural data and storage, the built-in context set is stored on a suffix tree data structure) A maximum matching provides a list of candidates, a list of issues that most users often ask about after the queries they already entered After the ranking step, the candidate list becomes a suggestion list 3.2.4 Offline phase – Clustering algorithm The idea of clustering algorithm: The algorithm scans all queries in Query Logs once, the clusters will be generated during the scan Each cluster is initially initialized with a query, and then expanded gradually by similar queries The expansion process stops when the cluster diameter exceeds the threshold Dmax Because each cluster is seen as a concept, so the cluster set is the concept set Input: Query Logs Q, threshold Dmax; Output: Set of clusters: Cset; program Context_Aware_Clustering_alg // Khởi tạo mảng dim_array[d] = Ø, ∀d (d: document click) // Mảng dim_array chứa số chiều vectors 01 02 03 04 05 06 07 08 09 for each query qi ∈ Q θ = Ø; for each nonZeroDimension d of ⃗⃗⃗ 𝑞𝑖 θ ∪= dim_array[d]; C = arg minC’∈C-Setdistance(qi, C’); if (diameter(C∪{qi}) ≤ Dmax) C ∪= qi; cập nhật lại đường kính tâm cụm C; else C = new cluster({qi}); Cset ∪= C; for each nonZeroDimension d of ⃗⃗⃗ 𝑞𝑖 if (C ∉ dim_array[d]) dim_array[d] ∪= C; end for 17 10 return Cset; 3.3.6 Analyze pros and cons Advantages: - Context-aware issue is a novel approach Performing query suggestions, almost all traditional approaches are often taken the classical queries which existed in Query Logs for proposals This kind of queries can only proposed similar or related queries to the current query, rather than giving trends about which communities often asked after the current query Likewise, there is no approach which places the previous queries in preceding of the current query into a search context - as a seamless expression for the intentions of the users The context-aware technique, above all, is the idea suggested by the problems that users often asked after the current query, which is a unique, efficient, and a “smart focusing” on the field of query suggestions Disadvantages: - When the user enters the first query or some of the queries are new (new compared to past queries) or not even new - the meaning does not present in the frequent concept string (for example, in data-set, with conceptual strings c2c3 and c1c2c3, the algorithm for determining the frequently equence is c2c3, in this case - the user enters c1) Context-aware approach is not generated the suggestion even though c1 was in the past (already in QLogs) - Each cluster (each concept) consists of a group of similar queries The similarity measure is only based on URL click without basing on similarity of term, which can significantly affect the quality of clustering technique includes a group of similar queries Similarity measure is only based on URL click without basing on similarity of term - Constraint each query belongs to a cluster (concept): This point of view is not reasonable and unnatural for a polymorphic query like "tiger" or "gladitor", or many other polymorphic words in Vietnamese langguage, etc - Besides, just only query suggestion without considering URL recommendation or document suggestion Likewise, “click-through” orientation but does not use clicked Urls information in search context (when searching on the suffix tree, the input of Concept sequence consists of queries only) - On the bipartite graph, on the vertex set Q, the vectors are quite sparse (low dimensions), the set of click URLs also encounter sparse data (URL click sparse), when the vectors are sparse, the quality of clustering will affected - In clustering algorithm, when Query Logs is large, or the number of dimensions of each vector is large, the dim_array [d] array will be very large, requiring a large amount of memory to be executed In fact, in any one search session, the user enters one or more queries, likewise, the user may not click or click on many of the resulting URLs, of which there are default URLs not as expected, is seen as noise Contextaware method requires a series of consecutive queries to form a context that does not reflect the reality, when users only enter one query However, depending on the click URL and not taking into account term similarity is the most disadvantage of this approach 3.3.7 Technical proposal 18 In terms of query suggestions, although having the same philosophy with the team working in contextaware in [9], [10]: "It provides the suggest that majority of users often ask after the current query", the approach, implementation, complex formulas, detail of data structures, design, algorithms, source code, etc in our search engine are completely different Mining Query Logs, clustering step in our application does not simply rely on click-through that focuses on three of fixed and certain components, including: query; Top N results; set of URLs click These are the three most important components of data mining tasks in context-aware search engine, with the premise: If the intersection of two keywords (terms) sets in the two queries reaches a certain rate, the two queries are considered similar If the intersection of the top N results of two queries reaches a certain rate, the two queries are considered similar If the intersection of sets of URLs click of two queries reaches a certain rate, the two queries are considered similar Considering of the above premises, combined with the thresholds drawn from experiments, to ensure the exact similarity measurement, the thesis lists the following formulas: Similarity according to keywords in queries p, q: Simkeywords (p, q) = ∑n i=1 w(ki (p))+w(ki (q)) (3.9) 2×MAX(kn(p),kn(q)) In the above formula: - kn(.): the total weight of the terms in p, in q; - w(ki(.)): the weight of common ith term in p and q; The similarity in top 50 URL results of queries p, q: ∧(topUp,topUq) Simtop50URL (p, q) = 2×MAX(kn(p),kn(q)) (3.10) denote: (topUp, topUq): the intersection of the results top50URL p and q; The similarity of two queries p, q by Urls_clicked: ∧(U_click_p,U_click_q) SimURLsClicked (p, q) = 2×MAX(kn(p),kn(q)) (3.11) denote: (U_Clickp, U_Clickq): the common URLs_clicked in p and q; U_Clickp: number of URL clicked in query; From (3.9), (3.10), (3.11), the thesis propose the equation to calculate combination similarity: Sim(p, q) = α Sim(p, q) + β Sim(p, q) + γ Sim(p, q) 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛 keywords 𝑡𝑜𝑝50𝑈𝑅𝐿 URLsclicked (3.12) With α + β + γ = 1; α, β, γ the threshold parameter drawn during the experiment In the search application, α = 0,4; β = 0,4; γ = 0,2 3.2.8 Classification search results technique based on Concept Lattice The Concept-lattice structure can automatically group the results set back from the search engine Grouping the result set into topics makes it easy for the user to observe, decide which document is appropriate, and limit the relevant information being buried by a long list Formal Concept Analysis (FCA) is one of the main techniques applied on lattice FCA is used to create tables with rows described by the object, columns describe the ... Công nghệ - Viện Hàn lâm Khoa học Công nghệ Việt Nam Người hướng dẫn khoa học: TS Vũ Tất Thắng Phản biện 1: … Phản biện 2: … Phản biện 3: … Luận án bảo vệ trước Hội đồng đánh giá luận án tiến sĩ... viện Khoa học Công nghệ - Viện Hàn lâm Khoa học Công nghệ Việt Nam vào hồi … ’, ngày … tháng … năm 202… Có thể tìm hiểu luận án tại: - Thư viện Học viện Khoa học Công nghệ - Thư viện Quốc gia... thuật Rank_Entities dòng 22 (if (RelSim ≥ α) return RelSim) cho thấy, α nhỏ số lượng ứng viên tăng, có nhiễu, đồng thời thời gian xử lý real-time tốn chi phí thời gian, hệ thống xử lý nhiều truy