In this paper, we apply the tokenizer for Vietnamese text to build the lexicon and hence, each record of the lexicon may contain several single words. On the basis of this method, we decrease the size of the lexicon and improve the precision of search while maintaining the complexity of the process.
Nghiên cứu khoa học công nghệ THE INDEXING ALGORITHM IN SEARCHING ENGINE FOR VIETNAMESE TEXT Nguyen Dang Tien Abstract: This work investigates the indexing algorithm which is mandatory for searching in large-scale text data sets The problem consists of dividing the dataset into words and building the metadata for each word, to boost the speed of search For Vietnamese text, the ambiguity of the tokenization processes in traditional indexing algorithms leads to a large size of lexicons and also, the low precision of results In this paper, we apply the tokenizer for Vietnamese text to build the lexicon and hence, each record of the lexicon may contain several single words On the basis of this method, we decrease the size of the lexicon and improve the precision of search while maintaining the complexity of the process Consequently, the time consumption for each query from user is shortening Simulation shows that our algorithm provides better performances compared to the traditional strategy in terms of lexicon's size and the precision Keywords: Tokenization, Search engine, Indexor, Lexicon, Page rank INTRODUCTION We investigate the indexing algorithm, where text documents are divided into subsets, each subset is identified by a word and it stores the index in a cache-based search engine The purpose of storing an index is to provide precise results to a search query while optimize the speed of the process In particular, we focus on the problem of building an information retrieval module for large-scale Vietnamese text data and apply it to divide the corpus into less subset in comparison to the traditional method A challenge in the indexing algorithm is to program the computer to identify what forms an individual or distinct word referred to as a token Information retrieval is not a new problem in the literature Many approaches were proposed in order to implement an effective algorithm to distract information from largescale corpus The first class of algorithms is string matching strategy which tries to find a place where one or several strings (also called patterns) are found within a larger string or text [1, 8] General speaking, let be an alphabet (finite set), both the pattern and searched text are vectors of elements of is a usual human alphabet (for example, the letters A through Z in the Latin alphabet) The simplest strategy to find the searched text is to use brute force searching, where each character in the query and the pattern are compared in order However, the time complexity of the brute force strategy is not feasible for search engines where the corpus is in large size Another approach for string matching algorithm is the Knuth–Morris–Pratt (KMP) algorithm [2] The algorithm searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters However, the performance of KMP algorithm is limited in natural language text On the basis of KMP algorithm, Robert S Boyer and J Strother Moore [8] proposed the Boyer– Moore string search algorithm, which uses information gathered during the preprocess step to skip sections of the text, resulting in a lower constant factor than many other string search algorithms The string matching algorithms, however, are only appropriate for the text in small size In order to make a search engine perform effectively, one need to build an intelligent 128 Nguyen Dang Tien, “The indexing algorithm in searching engine for Vietnamese text.” Điều khiển – Cơ điện tử - Truyền thông information retrieval module in addition to a crawler and a page ranker The purpose of this module is to find a particular phrase in a large-scale corpus For information retrieval in large-scale data set, the indexing technique is widely used [3, 9] Similar to the index of a book, the index of a search engine includes information about words and their position The technique provided a significant improvement compared to the string matching methods However, the traditional indexing algorithm is designed for inflection language, where the isolation of words is whitespace For the isolation languages, there is a need to apply a better tokenizer for the search engine to "understand" the text Unlike literate humans, computers not understand the structure of a natural language document and cannot automatically recognize words and sentences To a computer, a document is only a sequence of bytes Computers not 'know' that a space character separates words in a document During tokenization, the parser identifies sequences of characters which represent words and other elements, such as punctuation, which are represented by numeric codes, some of which are non-printing control characters The parser can also identify entities such as email addresses, phone numbers, and URLs When identifying each token, several characteristics may be store, such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number Motivated by the observation above, our idea is to apply a Vietnamese tokenization to the indexing algorithm to improve the quality of the lexicon Moreover, we propose two data structure to store the lexicons in memory The contributions of our work are threefold: - Apply a Vietnamese tokenization process to break Vietnamese corpus to words - Implement an indexing technique to build the lexicon and store it in an appropriate data structure - Build a computer simulation to demonstrate how well our method works THE STATE-OF-THE-ART INDEXING ALGORITHM For large-scale data set (hundreds megabytes of text data), using string matching algorithms leads to unfeasible performances In these cases, indexing technique is usually applied before performing the searching operation In the following, we describe the stateof-the-art indexing algorithm in detail 2.1 The model In the traditional indexing algorithm, two steps are performed before searching results for each query: - The first step is to index the corpus In other words, an index file (lexicon) is created by an indexor - In the second step, for each query, we need to parse to retrieve words in the query As describe above, the searching operation is not performed in the corpus by traditional string-matching algorithms In contrast, the lexicon is used to find information of each word and the final results are composed by combining information of all words in the query 2.2 The structure of a lexicon Lexicon is a file that contains a list of items Each item corresponds to a word in the corpus and has the format as in the following: {word }{ti ,{pi }}, i, j 0,1, Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san ACMEC, 07 - 2017 129 Nghiên cứu khoa học công nghệ where word is a word in the text ti are the document identifier that the word appears; pj are the positions of the word in the document ti To search a single word, we only perform the searching algorithm over the lexicon instead of finding the word in the whole corpus The information of each word, includes the document numbers that contain the word and the position of the word in each document, can be extracted completely only from the lexicon Note that a word can appear in a document multiple times, not to mention that it can appear in different documents of the corpus Hence, for each value of ti , we have a large set of p j 2.3 Searching algorithm For each query, we need to extract the words contained and perform the search algorithm in the lexicon The information to extract includes: the document numbers and the position of the word in the document The following example show how the traditional algorithms work Example: There are two documents in the corpus, each contains one sentence - Paris (nicknamed the City of light) is the capital city of France, and the largest city in that country - The Greater Tokyo Area is the most populous metropolitan area in the world After performing the traditional indexing algorithm, the information in the lexicon is constructed, as shown in the Table Table The information in the lexicon Word Document ID Positions Paris 1 nicknamed 14 the 11 City 10 15 of 11 light is 10 capital France 12 and 13 in 16 that 17 Greater 2 Tokyo Area 49 most metropolitan world 12 130 Nguyen Dang Tien, “The indexing algorithm in searching engine for Vietnamese text.” Điều khiển – Cơ điện tử - Truyền thông On the basis of the Table 1, we can accurately search the query “the capital city” without using the document For English text, the indexing algorithm can boost the performance of searching in comparison with traditional string matching methods However, in Vietnamese text, some inaccurate results can be outputted For instance, if we use the query "đại học", the results may include documents in which the words "dai" and "hoc" are not located in a same part An accurate result, however, should include the whole word "dai hoc" In order to solve the issue above, we can perform following steps: (1) search for the word "dai" and "hoc" separately and independently (2) Then, on the basis of the results, the system only outputs the documents in which the whole word "dai hoc" appears (in other words, the documents where the word "hoc" stands next to the word "dai") In this paper, we propose another approach to solve the issue, which includes four steps: - Using a Vietnamese tokenizer to split documents into Vietnamese words Each word represents a record in the lexicon Thanks to the tokenizer, this step makes the lexicon "understands" Vietnamese compound words - On the basis of parsed text, build a lexicon - For each query, apply the same tokenizer to split it into words - Search the query in the lexicon PROBLEM ESTABLISHMENT In this part, we briefly describe a Vietnamese tokenizer which was introduced in the literature and apply it to the indexing algorithm 3.1 Vietnamese tokenizer In lexical analysis, tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens The list of tokens becomes input for further processing such as parsing or text mining Tokenization is useful both in linguistics (where it is a form of text segmentation), and in computer science, where it forms part of lexical analysis Typically, tokenization occurs at the word level However, it is sometimes difficult to define what is meant by a "word" Often a tokenizer relies on simple heuristics, for example: - Punctuation and whitespace may or may not be included in the resulting list of tokens - All contiguous strings of alphabetic characters are part of one token; likewise with numbers - Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters The tokenization is an important process in natural language processing, in particular, for East Asia's languages where languages are isolation: Chinese, Japanese, Korean, etc For this kind of language, the isolation of words is not the white space as in English On the other hands, there is a connection between single words, i.e a word may include multiple single words As a consequence, a good tokenizer has to decrease the ambiguity of words Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san ACMEC, 07 - 2017 131 Nghiên cứu khoa học công nghệ For Vietnamese text, a popular algorithm implemented in a tokenizer is “minimum weight” [4, 10], in which the tokenization is transform to a graph problem as follow: Create two virtual nodes: Start node and end node Compare sequentially the segments with an arbitrary length to a lingual dictionary A segment which is contained in the dictionary corresponds to a new node in the graph The weight between two nodes (two continuous segments in the sentence) is calculated by the formula: f (i, j ) log f N Find the shortest path from start node to end node In the formula, f (i, j ) is usually calculated by the uni-gram value (the probability that a word appear) and bi-gram (the probability that two words appear together) In addition, we can add other aspects to get better values of f : word type, linking word, etc Previously, these values (accept uni-gram and bi-gram, which are given by statically analyze the corpus) are evaluated manually However, by the advantages of machine learning models (Markov model, CRFs, etc.), they can be calculate automatically In the step 5, the shortest path is found by the Viterbi algorithm, with the complexity O(n) where n is the length of the input sentence Implementing this strategy can reach the precision of 97\% for Vietnamese text Refer to the paper of Dien et al [4] and Tran et al [10] for the more detail description of the tokenizer 3.2 Search engine model The model of our proposed system is shown in the Fig The indexing algorithm performs two steps: - Implement the tokenizer to split Vietnamese text into words - Build a lexicon in the hard disk The information need to be stored in the lexicon includes the string that contain words, the document IDs that words are contained and their position Figure A chromosome example For each query, the system performs two steps: - Use the Vietnamese tokenizer to split the query into words 132 Nguyen Dang Tien, “The indexing algorithm in searching engine for Vietnamese text.” Điều khiển – Cơ điện tử - Truyền thông - Retrieve the information of words in the lexicon - Combine the information retrieved, output the documents that contain the query How we store the list of words in the lexicon is important in the performance of the algorithm Since searching and inserting are the two most popular processes in the lexicon, we divided the lexicon into two types: - The lexicon for large Vietnamese scale data set: In this case, the size of the lexicon is significant in comparison to the size internal memory We need to implement an appropriate data structure to store the list of words in the external memory which supports the searching and inserting process Note that, the average time to access a byte in a hard disk is 19 ms while in the internal memory, the average time is 0.000113 ms In this work, we suggest to use the B-tree to store and search in the lexicon - The small lexicons: For the small lexicon, we can store it in the internal memory to take advantages of its speed We use the red-black tree [6] structure to store the lexicon For both data structures above, the complexity of searching and insertion is O(logn) where n is the number of words in the corpus SIMULATION AND RESULT 4.1 Simulation description To evaluate the performance of our proposed indexing strategy, we have built a computer simulation by C++ and Python, which is described in the Fig Figure Simulation description The crawler module (which collects Vietnamese text from the internet) is written in Python The output is a large-scale Vietnamese corpus The indexor is written in C++ We store lexicons in the external memory using B-Tree [5] It allows us to expand the size of corpus without erasing the old lexicon Moreover, in the tokenizer, we implement the “minimum weight” algorithm The performance of systems are evaluated in terms of the precision and recall where: = = }∩{ |{ }| }| |{ }∩{ |{ |{ }| }| To show the advantages of our proposed approach, we compare the result of our algorithm with traditional algorithm for Vietnamese Language, which was presented in [9] in term of accuracy 4.2 Result and discussion Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san ACMEC, 07 - 2017 133 Nghiên cứu khoa học công nghệ Figure Precision for single and compound terms As can be seen from Fig 3, our method provides a better performance in comparison with traditional strategy in terms of precision In particular, for the case of compound terms and 200 megabytes of corpus, the precision of our method and the traditional algorithm is 0.652 and 0.587, respectively This demonstrates the effect of the Vietnamese tokenizer applying to the indexing algorithm Figure Recall for single and compound terms We remark that, the improvement of our method is more significant in the case of compound terms searching compared to the cased of single terms searching The reason is that, by Vietnamese tokenizer, some single words are absorbed by compound terms in the lexicon Hence, searching process on these words results in worse results We also remark that our approach also outperformed the traditional method in terms of recall As shown in the Fig 4, for the case of single terms, the recall of our algorithm is 0.641 compared to 0.530 of the traditional indexor Similarly, the recalls in case of compound terms in system with tokenizer and system without tokenizer are 0.598 and 0.304, respectively This demonstrates the advantage of the tokenization in our method CONCLUSION In this paper, we apply a Vietnamese tokenizer to the traditional indexing algorithm By the tokenization process, the indexor "understands" more about the language than the traditional strategy, hence, it outputs a better lexicon Our proposed search engine using a crawler module, an indexor with tokenization process provides a better performance in comparison with the traditional strategy We consider the application of our approach to isolated languages as a future work 134 Nguyen Dang Tien, “The indexing algorithm in searching engine for Vietnamese text.” Điều khiển – Cơ điện tử - Truyền thông REFERENCES [1] Charras, C and Lecroq, T., “Handbook of exact string matching algorithms”, Citeseer, 2004 [2] Boyer, R S and Moore, J S., “A fast string searching algorithm”, Communications of the ACM, Vol 20, no 10, pp 762–772, 1977 [3] Brin, S and Page, L , “Reprint of: The anatomy of a large-scale hypertextual web search engine”, Computer networks, Vol 56, no 18, pp 3825–3833, 2012 [4] Dien, D., Kiem, H and Van Toan N, “Vietnamese word segmentation.” in NLPRS, Vol 1, 2001, pp 749–756 [5] Ferragina, P and Grossi, R., “The string b-tree: a new data structure for string search in external memory and its applications”, Journal of the ACM (JACM), Vol 46, no 2, pp 236–280, 1999 [6] Hanke, S., “The performance of concurrent red-black tree algorithms” Springer, 1999 [7] Johnson, C., “Method and system for visual internet search engine,” Oct 10 2001, US Patent App 09/975,755 [8] Knuth, D E., Morris J H., Jr, and Pratt V R., “Fast pattern matching in strings”, SIAM journal on computing, Vol 6, no 2, pp 323–350, 1977 [9] Orlando, S., Perego, R., and Silvestr,i F , “Design of a parallel and distributed web search engine”, arXiv preprint cs/0407053, 2004 [10] Tran, O T., Le, C A., and Ha T Q., “Improving vietnamese word segmentation and pos tagging using mem with various kinds of resources”, Natural language processing, Vol 17, no 3, pp 41–60, 2010 TÓM TẮT THUẬT TỐN CHỈ MỤC TRONG CƠNG CỤ TÌM KIẾM VĂN BẢN TIẾNG VIỆT Trong báo đề xuất thuật tốn mục dùng tìm kiếm ngôn ngữ Tiếng Việt với liệu lớn Vấn đề cần giải bao gồm hai phần: chia liệu thành từ riêng lẻ xây dựng metadata cho từ Với ngôn ngữ Tiếng Việt, mập mờ q trình thẻ hóa (tokenization) thuật toán mục truyền thống dẫn đến kết tìm khơng xác Trong báo này, chúng tơi áp dụng phương pháp thẻ hóa cho Tiếng Việt thuật ngữ bao gồm nhiều từ Qua đó, tăng xác trình tìm kiếm giảm thời gian tìm kiếm Q trình mơ ưu việt thuật tốn đề xuất Từ khóa: Thẻ hóa, Cơng cụ tìm kiếm, Chỉ mục, Thuật ngữ, Xếp hạng trang Received date, 02nd May, 2017 Revised manuscript, 10th June, 2017 Published, 20th July, 2017 Author affiliations: People's Police University of Technology and Logistics, Bac Ninh, Vietnam Email: dangtient36@gmail.com Tạp chí Nghiên cứu KH&CN quân sự, Số Đặc san ACMEC, 07 - 2017 135 ... before performing the searching operation In the following, we describe the stateof -the- art indexing algorithm in detail 2.1 The model In the traditional indexing algorithm, two steps are performed... works THE STATE-OF -THE- ART INDEXING ALGORITHM For large-scale data set (hundreds megabytes of text data), using string matching algorithms leads to unfeasible performances In these cases, indexing. .. On the basis of the Table 1, we can accurately search the query the capital city” without using the document For English text, the indexing algorithm can boost the performance of searching in