1. Trang chủ
  2. » Luận Văn - Báo Cáo

Data engineering assignment similarity search

16 0 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Nội dung

2.2.2 Jaccard similarity coefficient for Multiset• We can vectorized each book into a multiset and then use Jaccard similarity coefficient to calculate the similarity score of them• Jacc

Trang 1

Data Engineering

ASSIGNMENT SIMILARITY SEARCH

Lecturer: TS Phan Trong Nhan

Nguyen Khoa Gia Cat 1912749

Ho Chi Minh, May 2022

Trang 2

2.2.1 Information Retrieval Measure TF-IDF 5

2.2.2 Jaccard similarity coefficient for Multiset 6

3 The Proposed Scheme 6 3.1 Pairwise Similarity 6

3.1.1 Calculate the number of occurrences of the word in document 6

3.1.2 Calculate the frequence of words of each document 7

3.1.3 Calculate TF-IDF of words of each document 8

3.1.4 Calculate the total TF-IDF of each document 8

3.1.5 Find the list TF-IDF of each word in the whole documents 9

3.1.6 Calculate the similarity score for each two books 10

3.2 Search By Example 11

3.2.1 List words in search query appear in each document 11

3.2.2 Calculate percentage of total TF-IDF of search query over total TF-IDF of each

Trang 3

• Similarity search is the most general term referring to a variety of techniques which aims to find all records in the database that are sufficiently similar to the query record This is becoming increasingly important in an age of large information repositories while similarity search applications need to adhere to much stricter time constraints.

Figure 1:Similarity Search

• Many researchers have addressed the field of similarity search First, if similarity search is to be per-formed within vector data, there exist many appropriate similarity measures and index structures Another popular group of approaches addresses the metric space, which requires the similarity measure to fulfill the metric requirements, in particular the triangular inequality, which is often a handicap for modeling similarity measures For similarity search on a set of strings, several algorithms for specific similarity measures, such as the edit-distance, have been proposed • To deal with challenge of time constraints when implementing these approaches on large-scale

datasets, they can be deployed on a novel parallel paradigm like MapReduce to accelerate compu-tations.

Figure 2:MapReduce Paradigm

Trang 4

1.2 MapReduce

• The MapReduce programming model and runtime environment was first introduced by Jeffrey Dean and Sanjay Ghemawat at Google in 2004.

• The motivation behind the MapReduce system was to handle special-purpose computations on large datasets (e.g., computing inverted indexes from Web content collected via Web crawling; building Web graphs; and extracting statistics from Web logs, such as frequency distribution of search requests by topic, by region, by type of user, etc.).

• MapReduce is a fault-tolerant implementation and a runtime environment that scales to thousands of processors The underlying model of data is the key-value pair The model is inspired by the map and reduce tasks which allows the infrastructure to parallelize and execute on large clusters of commodity hardware.

–Map is generic function that takes a key of type K1 and value of type V1, and then returns a list of key-value pairs of type K2 and V2.

–Reduce is generic function that takes a key of type K2 and a list of values V2 and returns pairs of type (K3, V3)

Figure 3:Process of MapReduce

• Hadoop is an open source implementation of the MapReduce programming model which was de-veloped by Cutting and Carafella in 2004 as a search engine.

• In 2006, Cutting joined Yahoo where attempted to improve its search processing based on ideas from the Google File System and the MapReduce programming paradigm In 2011, Yahoo spun off Hadoop-centered company.

• The two core components of Hadoop are the MapReduce programming paradigm and HDFS -Hadoop Distributed File System

Figure 4:Hadoop with MapReduce

Trang 5

1.4 Hadoop Distributed File System

• The Hadoop Distributed File System (HDFS) is the file system component of Hadoop and is de-signed to run on a cluster of commodity hardware

–Patterned after the UNIX file system –Provides high-throughput access to large datasets –Stores metadata on NameNode server –Stores application data on DataNode servers

• HDFS was designed with the following assumptions and goals: –Hardware failure is the norm rather than an exception –Batch processing rather than interactive use

–Decouples metadata from data operations –Replication provides reliability and high availability –Network traffic minimized

Figure 5:Architechture of Hadoop Distributed File System

• The master server is called NameNode, and slaves called DataNodes.

–NameNode maintains an image of the file system comprising i-nodes and corresponding block locations Changes to the file system are maintained in a Write-ahead commit log called Journal.

–Secondary NameNodes perform either the checkpointing role or a backup role.

–DataNodes store blocks in node’s native file system, and periodically reports state to the NameNode.

• File I/O operations:

–Single-writer, multiple-reader model –Files cannot be updated, only appended

Trang 6

–Write pipeline set up to minimize network utilization • Block placement:

–Nodes of Hadoop cluster typically spread across many racks • Replica management:

–NameNode tracks number of replicas and block location based on block reports –Replication priority queue contains blocks that need to be replicated

• Consider a library that manages thousands of books, and each book can be cataloged for a variety of subjects.

• Assume that we have to rearrange the books based on their contents only, we have to find the pairs of similar books to create clusters However, it is hard to produce fast - effective queries on these large datasets, especially text data.

• Often, queries against this data set have to be answered extremely fast, e.g., to recommend some book title based on the keywords no matter what the catalog is.

• In this assignment, we will use Hadoop with MapReduce to solve the following two problems – Problem 1.Given a dataset consists N book’s file in txt format Find the similarity of each

pair of books

– Problem 2.Given a dataset consists N book’s file in txt format and a query Q Find the similarity of query Q in each book

2.2.1 Information Retrieval Measure TF-IDF

• The term count in the given document is simply the number of times a given term appears in that document For the termtiwithin the particular documentdj, its term frequency is defined as follows:

tfi,j=ni,jknk,j

In the formulani,jis the number of occurrences of the considered termtiin documentdj, and the denominator is the sum of number of occurrences of all terms in documentdj.

• The inverse document frequency is a measure of the general importance of the term The formula are defined as follows:

idfi= log|{ |D| }| j: ti∈ dj

In the formula,|D|is the total number of documents in the corpus;|{j : ti∈ dj}|is the number of documents where the termtiappears (that isni,j= 0).

• The tf-idf weight of term is the product of tf and idf The formula are defined as follows: (tf − idf)i,j=tfi,j×idfi

Trang 7

2.2.2 Jaccard similarity coefficient for Multiset

• We can vectorized each book into a multiset and then use Jaccard similarity coefficient to calculate the similarity score of them

• Jaccard similarity coefficient is defined as follow:

Ifx= (x1, x , x23, , xn)andy= (y1, y , y , y23n)are two vector with all realxi, yig 0then their Jaccard similarity coefficient is defined as In the formularxiis the count of elementiin the multisetx

3The Proposed Scheme

• Pairwise Similarity Search is the case in that we want to find out all possible similar pairs In other words, one is bound to every other to give their similarity

• In this assignment, we design six MapReduce process to achieve the pairwise similarity • We use the following dataset to simulate the execution processes

Figure 6:Sample Dataset

3.1.1 Calculate the number of occurrences of the word in document

• In the mapper, we match words and write (word#documentName, 1) pairs to intermediate values which will be processed by reducer.

• Then we calculate the number of occurrences of the word in document directly in the reducer The output of reducer need to be written to the intermediate files which will be processed in next MapReducer process.

• The output is using (word # documentName) as the key, (n) as the value ’n’ is the number of occurrences of the term ’word’ in the ’documentName’ Function is designed as follows:

Trang 8

Figure 7:MapReduce-1 operation 3.1.2 Calculate the frequence of words of each document

• In this step, we reorganized the (key,value) pairs in mapper using (documentName) as key and (word=n) as value.

• Then we calculate the total number of words of each document in reducer The output of reducer need to be written to the intermediate files which will be processed in next MapReducer process • The output is using (word#documentName) as the key, (n/N) as the value ’n’ is the number of

occurrences of the term ’word’ in the document ’documentName’, and ’N’ is the total number of words of ’documentName’ Function is designed as follows:

Trang 9

3.1.3 Calculate TF-IDF of words of each document

• In this step, we reorganized the (key,value) pairs in mapper using (word) as key and (document-Name#n/N) as value.

• Then we calculate the number ’d’ which is the number of documents containing this word and the number ’D’ which is the total number of whole documents.

• At last, we can calculate the TF-IDF according to formula TFIDF = n / N * log (D / d) Function

Input: (word, documentName#n/N)

Output: (word#documentName, n/N * log(D/d))

Figure 9:MapReduce-3 operation

3.1.4 Calculate the total TF-IDF of each document

• In this step, we reorganized the (key,value) pairs in mapper using (documentName) as key and (word#TF-IDF) as value.

• Then we calculate the total TF-IDF of each document in reducer The output of reducer need to be written to the intermediate files which will be processed in next MapReducer process • The output is using (word#documentName@Total TF-IDF) as the key, (TF-IDF) as the value.

’Total TF-IDF’ is the TF-IDF of each document and TF-IDF is the Information Retrieval Measure of term ’word’ in the document ’documentName’ Function is designed as follows:

Input: (word#documentName, TF-IDF) Output: (documentName, word#TF-IDF) Reduce():

Input: (documentName, word#TF-IDF)

Output: (word#documentName@Total TF-IDF, TF-IDF)

Trang 10

Figure 10:MapReduce-4 operation 3.1.5 Find the list TF-IDF of each word in the whole documents

• In this step, we reorganized the (key,value) pairs in mapper using (word) as key and (document-Name@Total TF-IDF#TF-IDF) as value.

• Then we find the list TF-IDF of each word in the whole document in reducer The output of reducer need to be written to the intermediate files which will be processed in next MapReducer process • The output is using (word) as the key, (List((documentName@Total TF-IDF, TF-IDF)) as the

value ’Total TF-IDF’ is the TF-IDF of each document and TF-IDF is the Information Retrieval Measure of term ’word’ in the document ’documentName’ Function is designed as follows: Map():

Input: (word#documentName@Total TF-IDF, TF-IDF) Output: (word, documentName@Total TF-IDF#TF-IDF) Reduce():

Input: (word, documentName@Total TF-IDF#TF-IDF) Output: (word, List((documentName@Total TF-IDF, TF-IDF))

Figure 11:MapReduce-5 operation

Trang 11

3.1.6 Calculate the similarity score for each two books

• In this step, we reorganized the (key,value) pairs in mapper using (documentName1@Total TF-IDF 1, documentName2@Total TF-TF-IDF 2) as key and (min(TF-TF-IDF 1, TF-TF-IDF 2)) as value for each element in the value list.

• Then we calculate the similarity score by using the Jaccard similarity coefficient for Multiset for each two books in reducer as result.

• The output is using (documentName1 documentName2) as the key, (similarity-score) as the value Function is designed as follows:

Input: (word, List((documentName@Total TF-IDF, TF-IDF))

Output: (documentName1@Total TF-IDF 1, documentName2@Total TF-IDF 2, min(TF-IDF 1, TF-min(TF-IDF 2))

Input: (documentName1@Total TF-IDF 1, documentName2@Total TF-IDF 2, min(TF-IDF 1, TF-min(TF-IDF 2))

Output: (documentName1 documentName2, similarity-score)

Figure 12:MapReduce-6 operation

• According to the results, we can see ’1.txt’ and ’2.txt’ are the most similar books in the three-book

Trang 12

3.2 Search By Example

• Search by example is a well-known similarity search case when given a pivot object as an example for the search The goal is to find the most similar objects according to the pivot.

• In this assignment, we design two MapReduce process to achieve the search by example • We use the above dataset and the following query string to simulate the execution processes In

addition, the result of MapReduce is also used as the input

Figure 13:Search Query

Figure 14:Output of Pairwise MapReduce 5

3.2.1 List words in search query appear in each document

• In this step, we reorganized the (key,value) pairs in mapper using (word) as key and the value is depend on the previous (key,value):

–If the previous value has the form of assignment (the input of the above MapReduce 5), then the current value is the list of TF-IDF of term ’word’ in the whole documents

–Otherwise (the search query), the current value is 0

• Then we can find list words in search query appear in each document

• The output is using (word) as the key, (List((documentName@Total TF-IDF, TF-IDF)) as the value Function is designed as follows:

Trang 13

Figure 15:MapReduce-1 operation

3.2.2 Calculate percentage of total TF-IDF of search query over total TF-IDF of each document

• In this step, we reorganized the (key,value) pairs in mapper using (word) as key and (List((documentName@Total TF-IDF, TF-IDF))) as value for each element in the value list • Then we calculate the percentage of total TF-IDF of search query over total TF-IDF of each

document as result.

• The output is using (documentName) as the key, (percentage) as the value Function is designed as follows:

Input: (word,List((documentName@Total TF-IDF, TF-IDF))) Output: (documentName@Total TF-IDF, TF-IDF) Reduce():

Input: (documentName@Total TF-IDF, TF-IDF) Output: (documentName, percentage)

Figure 16:MapReduce-2 operation

• According to the results, we can see ’1.txt’ is the most similar book to the search query in the three-book

Trang 14

• In our experiments we used the dataset provided by https://www.gutenberg.org, which contains approximately 60 thousands books.

• Due to resource limitation, a small dataset is used to demonstrate the above method The input is a set of 20 text files which are the first page of 20 different books They contain the summary of content of the book with necessary keyword for queries.

–Size: 20 files x 120KB (Total size: 11MB) –Extension: txt.utf8

–Key attribute: semantic, meaning words, subjects,

4.2.1 Pairwise Similarity

• We will use this dataset to find the similarity of each two books

Figure 17:Data Set - First Page of 1.txt

• After running 6 MapReduce, the result is show below

Figure 18:Result of Pairwise Similarity

• According to the results, we can see ’2.txt’ and 4.txt’ are the most similar books in the 20-book (similarity score = 0.358133)

Trang 15

4.2.2 Search By Example

• We will use the following Search Query to find the most similar books

Figure 19:The Search Query

• After running 2 MapReduce, the result is show below

Figure 20:Result of Search By Example

• According to the results, we can see ’16.txt’ is the most similar book to the search query in the 20-book (Percentage of Total TF-IDF = 9.846140%)

Ngày đăng: 15/04/2024, 18:57

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w