Data engineering assignment similarity search

2.2.2 Jaccard similarity coefficient for Multiset• We can vectorized each book into a multiset and then use Jaccard similarity coefficient to calculate the similarity score of them• Jacc

Trang 1

Data Engineering

ASSIGNMENT SIMILARITY SEARCH

Lecturer: TS Phan Trong Nhan

Nguyen Khoa Gia Cat 1912749

Ho Chi Minh, May 2022

Trang 2

1.1 Similarity Search 2

1.2 MapReduce 3

1.3 Hadoop 3

1.4 Hadoop Distributed File System 4

2 Problem 5 2.1 Main Concept 5

2.2 Algorithm 5

2.2.1 Information Retrieval Measure TF-IDF 5

2.2.2 Jaccard similarity coefficient for Multiset 6

3 The Proposed Scheme 6 3.1 Pairwise Similarity 6

3.1.1 Calculate the number of occurrences of the word in document 6

3.1.2 Calculate the frequence of words of each document 7

3.1.3 Calculate TF-IDF of words of each document 8

3.1.4 Calculate the total TF-IDF of each document 8

3.1.5 Find the list TF-IDF of each word in the whole documents 9

3.1.6 Calculate the similarity score for each two books 10

3.2 Search By Example 11

3.2.1 List words in search query appear in each document 11

3.2.2 Calculate percentage of total TF-IDF of search query over total TF-IDF of each document 12

4 Experiments 13 4.1 Dataset Preparation 13

4.2 Result 13

4.2.1 Pairwise Similarity 13

4.2.2 Search By Example 14

Trang 3

1 Introduction

• Similarity search is the most general term referring to a variety of techniques which aims to find all records in the database that are sufficiently similar to the query record This is becoming increasingly important in an age of large information repositories while similarity search applications need to adhere to much stricter time constraints

Figure 1:Similarity Search

• Many researchers have addressed the field of similarity search First, if similarity search is to be per-formed within vector data, there exist many appropriate similarity measures and index structures Another popular group of approaches addresses the metric space, which requires the similarity measure to fulfill the metric requirements, in particular the triangular inequality, which is often

a handicap for modeling similarity measures For similarity search on a set of strings, several algorithms for specific similarity measures, such as the edit-distance, have been proposed

• To deal with challenge of time constraints when implementing these approaches on large-scale datasets, they can be deployed on a novel parallel paradigm like MapReduce to accelerate compu-tations

Figure 2:MapReduce Paradigm

Trang 4

1.2 MapReduce

• The MapReduce programming model and runtime environment was first introduced by Jeffrey Dean and Sanjay Ghemawat at Google in 2004

• The motivation behind the MapReduce system was to handle special-purpose computations on large datasets (e.g., computing inverted indexes from Web content collected via Web crawling; building Web graphs; and extracting statistics from Web logs, such as frequency distribution of search requests by topic, by region, by type of user, etc.)

• MapReduce is a fault-tolerant implementation and a runtime environment that scales to thousands

of processors The underlying model of data is the key-value pair The model is inspired by the map and reduce tasks which allows the infrastructure to parallelize and execute on large clusters of commodity hardware

–Map is generic function that takes a key of type K1 and value of type V1, and then returns a list of key-value pairs of type K2 and V2

–Reduce is generic function that takes a key of type K2 and a list of values V2 and returns pairs of type (K3, V3)

Figure 3:Process of MapReduce

• Hadoop is an open source implementation of the MapReduce programming model which was de-veloped by Cutting and Carafella in 2004 as a search engine

• In 2006, Cutting joined Yahoo where attempted to improve its search processing based on ideas from the Google File System and the MapReduce programming paradigm In 2011, Yahoo spun off Hadoop-centered company

• The two core components of Hadoop are the MapReduce programming paradigm and HDFS -Hadoop Distributed File System

Figure 4:Hadoop with MapReduce

Trang 5

1.4 Hadoop Distributed File System

• The Hadoop Distributed File System (HDFS) is the file system component of Hadoop and is de-signed to run on a cluster of commodity hardware

–Patterned after the UNIX file system

–Provides high-throughput access to large datasets

–Stores metadata on NameNode server

–Stores application data on DataNode servers

• HDFS was designed with the following assumptions and goals:

–Hardware failure is the norm rather than an exception

–Batch processing rather than interactive use

–Large datasets

–Simple coherency model: one write and many reader, file content cannot be updated, but only appended

• The architecture of HDFS has the following highlights:

–Master-slave

–Decouples metadata from data operations

–Replication provides reliability and high availability

–Network traffic minimized

Figure 5:Architechture of Hadoop Distributed File System

• The master server is called NameNode, and slaves called DataNodes

–NameNode maintains an image of the file system comprising i-nodes and corresponding block locations Changes to the file system are maintained in a Write-ahead commit log called Journal

–Secondary NameNodes perform either the checkpointing role or a backup role

–DataNodes store blocks in node’s native file system, and periodically reports state to the NameNode

• File I/O operations:

–Single-writer, multiple-reader model

–Files cannot be updated, only appended

Trang 6

–Write pipeline set up to minimize network utilization

• Block placement:

–Nodes of Hadoop cluster typically spread across many racks

• Replica management:

–NameNode tracks number of replicas and block location based on block reports

–Replication priority queue contains blocks that need to be replicated

2 Problem

• Consider a library that manages thousands of books, and each book can be cataloged for a variety

of subjects

• Assume that we have to rearrange the books based on their contents only, we have to find the pairs

of similar books to create clusters However, it is hard to produce fast - effective queries on these large datasets, especially text data

• Often, queries against this data set have to be answered extremely fast, e.g., to recommend some book title based on the keywords no matter what the catalog is

• In this assignment, we will use Hadoop with MapReduce to solve the following two problems – Problem 1.Given a dataset consists N book’s file in txt format Find the similarity of each pair of books

– Problem 2.Given a dataset consists N book’s file in txt format and a query Q Find the similarity of query Q in each book

2.2.1 Information Retrieval Measure TF-IDF

• The term count in the given document is simply the number of times a given term appears in that document For the termtiwithin the particular documentdj, its term frequency is defined as follows:

tfi,j=ni,j

knk,j

In the formulani,jis the number of occurrences of the considered termtiin documentdj, and the denominator is the sum of number of occurrences of all terms in documentdj

• The inverse document frequency is a measure of the general importance of the term The formula are defined as follows:

idfi= log|{ |D| }|

j: ti∈ dj

In the formula,|D|is the total number of documents in the corpus;|{j : ti∈ dj}|is the number of documents where the termtiappears (that isni,j= 0)

• The tf-idf weight of term is the product of tf and idf The formula are defined as follows:

(tf − idf)i,j=tfi,j×idfi

Trang 7

2.2.2 Jaccard similarity coefficient for Multiset

• We can vectorized each book into a multiset and then use Jaccard similarity coefficient to calculate the similarity score of them

• Jaccard similarity coefficient is defined as follow:

Ifx= (x1, x , x2 3, , xn)andy= (y1, y , y , y2 3 n)are two vector with all realxi, yig 0then their Jaccard similarity coefficient is defined as

J(x, y) =



imin(xi, yi)



imax(xi, yi)

In the formularxiis the count of elementiin the multisetx

3 The Proposed Scheme

• Pairwise Similarity Search is the case in that we want to find out all possible similar pairs In other words, one is bound to every other to give their similarity

• In this assignment, we design six MapReduce process to achieve the pairwise similarity

• We use the following dataset to simulate the execution processes

Figure 6:Sample Dataset

3.1.1 Calculate the number of occurrences of the word in document

• In the mapper, we match words and write (word#documentName, 1) pairs to intermediate values which will be processed by reducer

• Then we calculate the number of occurrences of the word in document directly in the reducer The output of reducer need to be written to the intermediate files which will be processed in next MapReducer process

• The output is using (word # documentName) as the key, (n) as the value ’n’ is the number of occurrences of the term ’word’ in the ’documentName’ Function is designed as follows: Map():

Input: (documentLineNumer, contents)

Output: (word#documentName, 1)

Reduce():

Input: (word#documentName, 1)

Output: (word#documentName, n)

Trang 8

Figure 7:MapReduce-1 operation 3.1.2 Calculate the frequence of words of each document

• In this step, we reorganized the (key,value) pairs in mapper using (documentName) as key and (word=n) as value

• Then we calculate the total number of words of each document in reducer The output of reducer need to be written to the intermediate files which will be processed in next MapReducer process

• The output is using (word#documentName) as the key, (n/N) as the value ’n’ is the number of occurrences of the term ’word’ in the document ’documentName’, and ’N’ is the total number of words of ’documentName’ Function is designed as follows:

Map():

Input: (word#documentName, n)

Output: (documentName, word=n)

Reduce():

Input: (documentName, word=n)

Output: (word#documentName, n/N)

Figure 8:MapReduce-2 operation

Trang 9

3.1.3 Calculate TF-IDF of words of each document

• In this step, we reorganized the (key,value) pairs in mapper using (word) as key and (document-Name#n/N) as value

• Then we calculate the number ’d’ which is the number of documents containing this word and the number ’D’ which is the total number of whole documents

• At last, we can calculate the TF-IDF according to formula TFIDF = n / N * log (D / d) Function

is designed as follows:

Map():

Input: (word#documentName, n/N)

Output: (word, documentName#n/N)

Reduce():

Input: (word, documentName#n/N)

Output: (word#documentName, n/N * log(D/d))

3.1.4 Calculate the total TF-IDF of each document

• In this step, we reorganized the (key,value) pairs in mapper using (documentName) as key and (word#TF-IDF) as value

• Then we calculate the total TF-IDF of each document in reducer The output of reducer need to

be written to the intermediate files which will be processed in next MapReducer process

• The output is using (word#documentName@Total TF-IDF) as the key, (TF-IDF) as the value

’Total TF-IDF’ is the TF-IDF of each document and TF-IDF is the Information Retrieval Measure

of term ’word’ in the document ’documentName’ Function is designed as follows:

Map():

Input: (word#documentName, TF-IDF)

Output: (documentName, word#TF-IDF)

Reduce():

Input: (documentName, word#TF-IDF)

Output: (word#documentName@Total TF-IDF, TF-IDF)

Trang 10

Figure 10:MapReduce-4 operation 3.1.5 Find the list TF-IDF of each word in the whole documents

• In this step, we reorganized the (key,value) pairs in mapper using (word) as key and (document-Name@Total TF-IDF#TF-IDF) as value

• Then we find the list TF-IDF of each word in the whole document in reducer The output of reducer need to be written to the intermediate files which will be processed in next MapReducer process

• The output is using (word) as the key, (List((documentName@Total TF-IDF, TF-IDF)) as the value ’Total TF-IDF’ is the TF-IDF of each document and TF-IDF is the Information Retrieval Measure of term ’word’ in the document ’documentName’ Function is designed as follows: Map():

Input: (word#documentName@Total TF-IDF, TF-IDF)

Output: (word, documentName@Total TF-IDF#TF-IDF)

Reduce():

Input: (word, documentName@Total TF-IDF#TF-IDF)

Output: (word, List((documentName@Total TF-IDF, TF-IDF))

Trang 11

3.1.6 Calculate the similarity score for each two books

• In this step, we reorganized the (key,value) pairs in mapper using (documentName1@Total TF-IDF 1, documentName2@Total TF-TF-IDF 2) as key and (min(TF-TF-IDF 1, TF-TF-IDF 2)) as value for each element in the value list

• Then we calculate the similarity score by using the Jaccard similarity coefficient for Multiset for each two books in reducer as result

• The output is using (documentName1 documentName2) as the key, (similarity-score) as the value Function is designed as follows:

Map():

Input: (word, List((documentName@Total TF-IDF, TF-IDF))

Output: (documentName1@Total TF-IDF 1, documentName2@Total TF-IDF 2, min(TF-IDF 1, TF-min(TF-IDF 2))

Reduce():

Input: (documentName1@Total TF-IDF 1, documentName2@Total TF-IDF 2, min(TF-IDF 1, TF-min(TF-IDF 2))

Output: (documentName1 documentName2, similarity-score)

• According to the results, we can see ’1.txt’ and ’2.txt’ are the most similar books in the three-book

Trang 12

3.2 Search By Example

• Search by example is a well-known similarity search case when given a pivot object as an example for the search The goal is to find the most similar objects according to the pivot

• In this assignment, we design two MapReduce process to achieve the search by example

• We use the above dataset and the following query string to simulate the execution processes In addition, the result of MapReduce is also used as the input

Figure 13:Search Query

Figure 14:Output of Pairwise MapReduce 5

3.2.1 List words in search query appear in each document

• In this step, we reorganized the (key,value) pairs in mapper using (word) as key and the value is depend on the previous (key,value):

–If the previous value has the form of assignment (the input of the above MapReduce 5), then the current value is the list of TF-IDF of term ’word’ in the whole documents

–Otherwise (the search query), the current value is 0

• Then we can find list words in search query appear in each document

• The output is using (word) as the key, (List((documentName@Total TF-IDF, TF-IDF)) as the value Function is designed as follows:

Map():

Input: (documentLineNumber, word = List((documentName@Total TF-IDF, TF-IDF))) or (documentLineNumber, search-query)

Output: (word, 0) or (word,List((documentName@Total TF-IDF, TF-IDF)))

Reduce():

Trang 13

Figure 15:MapReduce-1 operation 3.2.2 Calculate percentage of total TF-IDF of search query over total TF-IDF of each document

• In this step, we reorganized the (key,value) pairs in mapper using (word) as key and (List((documentName@Total TF-IDF, TF-IDF))) as value for each element in the value list

• Then we calculate the percentage of total TF-IDF of search query over total TF-IDF of each document as result

• The output is using (documentName) as the key, (percentage) as the value Function is designed

as follows:

Map():

Input: (word,List((documentName@Total TF-IDF, TF-IDF)))

Output: (documentName@Total TF-IDF, TF-IDF)

Reduce():

Input: (documentName@Total TF-IDF, TF-IDF)

Output: (documentName, percentage)

• According to the results, we can see ’1.txt’ is the most similar book to the search query in the three-book

Tiêu đề	Similarity Search
Tác giả	Phan Le Tuan Anh, Nguyen Khoa Gia Cat, Nguyen Minh Nhut
Người hướng dẫn	TS. Phan Trong Nhan
Trường học	Vietnam National University Ho Chi Minh City
Chuyên ngành	Data Engineering
Thể loại	Assignment
Năm xuất bản	2022
Thành phố	Ho Chi Minh

Định dạng
Số trang	16
Dung lượng	1,71 MB