HƯỚNG PHÁT TRIỂN CỦA ĐỀ TÀI - Xây dựng Search Engi- 123docz.net

Với sản phẩm này, chúng tôi thấy là hoàn toàn có thể cải tiến thêm cho chương trình theo hướng hiểu ngôn ngữ Việt hơn nữa, hiểu truy vấn người Việt để có thể trả lời chính xác truy vấn người dùng.

Chẳng hạn, trong trường hợp người dùng muốn tìm “Trịnh Công Sơn” thì chương trình phải hiểu đó là tên của nhạc sỹ chứ không phải là một bài hát có từ “Trịnh Công Sơn” trong đó.

Nâng cao hơn nữa, khi người sử dụng muốn truy vấn là “nhạc Trịnh” thì ý của người dùng là muốn có kết quả là nhạc do “Trịnh Công Sơn” sáng tác hoặc là những bài hát nổi tiếng của Trịnh Công Sơn. Nếu chúng ta đã có một WordNet tiếng Việt mà trong đó sự liên quan giữa 2 từ khóa này là rất cao thì ta có thể suy ra từ khóa này từ từ khóa kia.

Trong trường hợp người sử dụng gõ truy vấn là “Trịnh Côn Sơn” thì chương trình cũng phải hiểu đó là người dùng gõ sai, kết quả chương trình vẫn phải truy vấn tên là “Trịnh Công Sơn”.

Muốn có những kết quả tốt hơn như vậy, chúng tôi cho rằng cần phải đầu tư nhiều thời gian hơn nữa vào việc nghiên cứu những Module tiếng Việt và phần Ranking kết quả.

TÀI LIỆU THAM KHẢO

Tiếng Việt

1. Nguyễn Tài Cẩn (1998), Ngữ pháp tiếng Việt (Tiếng - Từ ghép - Đoản Ngữ), NXB Đại học Quốc gia Hà Nội.

2.Nguyễn Thiện Giáp, Phân loại các ngôn ngữ theo quan hệ loại hình,

http://ngonngu.net/index.php?p=234

3.NgonNgu.Net, Cụm từ cố định, http://ngonngu.net/index.php?p=187

4. Tcxdvn.xaydung.gov.vn, Tiêu chuẩn xây dựng Việt nam

http://tcxdvn.xaydung.gov.vn/TCXDVN/TCXDVN.NSF/da73105996deacc047 2570d5005b7a6a/5873b41ce9e8fb63472570c4004da72e?OpenDocument

5. Wikipedia.Org, Loại hình ngôn ngữ,

http://vi.wikipedia.org/wiki/Lo%E1%BA%A1i_h%C3%ACnh_ng%C3%B4n_n g%E1%BB%AF

6. Wikipedia.Org, Lucene, http://vi.wikipedia.org/wiki/ Lucene 7. Wikipedia.Org, Unicode, http://vi.wikipedia.org/wiki/Unicode

Tiếng Anh

8. Anthony Scime, Web mining: applications and techniques

http://books.google.com.vn/books?id=TDhPMs3adw0C&pg=PA53&lpg=PA53 &dq=%22Forward+link+count%22&source=bl&ots=r0_utue0fg&sig=PNBIsNl-K- qlGM2wLfDaGAc4ytI&hl=vi&ei=jiUxS_apKZy-

swOwypS7BA&sa=X&oi=book_result&ct=result&resnum=1&ved=0CAgQ6AEwAA #v=onepage&q=%22Forward%20link%20count%22&f=false

9. Junghoo Cho, Garcia-Molina, H. and Page, L. (1998), Efficient Crawling

Through URL Ordering, http://ilpubs.stanford.edu:8090/347/

10. Junghoo Cho, Hector Garcia-Molina (2002), Parallel Crawlers,

http://rose.cs.ucla.edu/~cho/papers/cho-parallel.pdf

11. www.focuseek.com, Chapter 4. Notes for Search Engine beginners,

http://www.focuseek.com/manuals/User/beginners.html

12. Marc Najork, Janet L. Wiener(2001), Breadth-first Search crawling yields high-quality pages, http://www10.org/cdrom/papers/208/

13. Grossman, Frieder, Goharian(2002),

http://docs.google.com/viewer?a=v&q=cache:ww20te0h39sJ:www.eng.auburn. edu/~gilbert/Comp7120/Concept-50/IR-Building-Inverted-

Index.pdf+building+an+invert+index&hl=vi&gl=vn&pid=bl&srcid=ADGEESi_uMD xtrhmQJCylHryuRCoTFL3fFP7Ngf2dvBVEhpr3bVS53Z6dNUg628zf

14. Prasad Pingali, Jagadeesh Jagarlamudi, Vasudeva Varma, WebKhoj: Indian language IR from Multiple Character Encodings,

http://www2006.org/programme/files/xhtml/5503/fp5503-pingali/fp5503- pingali-xhtml.html

14. Red-gate.com, .NET Reflector,

http://www.red-gate.com/products/reflector/index.htm

15. Sahilthaker (2008), Information Retrieval & Search - Basic IR Models, http://blogs.msdn.com/spt/archive/2008/03/05/information-retrieval-Search- basic-ir-models.aspx

16. Wikipedia.Org, BackLink, http://en.wikipedia.org/wiki/BackLink

17. Wikipedia.Org, Distributed web crawling,

http://en.wikipedia.org/wiki/ Distributed web crawling

18. Wikipedia.Org, HITS algorithm, http://en.wikipedia.org/wiki/HITS

algorithm

19. Wikipedia.Org, Hubs and Authorities,

http://en.wikipedia.org/wiki/Hubs and Authorities 20. Wikipedia.Org, Information retrieval,

http://en.wikipedia.org/wiki/Information retrieval

22. Wikipedia.Org, Lucene, http://en.wikipedia.org/wiki/ Lucene 23. Wikipedia.Org, PageRank, http://en.wikipedia.org/wiki/PageRank

24. Wikipedia.Org, Stemming, http://en.wikipedia.org/wiki/Stemming 25. Wikipedia.Org, Search engine indexing,

http://en.wikipedia.org/wiki/ Search engine indexing

26. Wikipedia.Org, Tf–idf, http://en.wikipedia.org/wiki/Tf–idf

27. Wikipedia.Org, Web Crawler, http://en.wikipedia.org/wiki/Web Crawler

http://en.wikipedia.org/wiki/Search engineindexing

28. Wikipedia.Org, Web Search query, http://en.wikipedia.org/wiki/web

PHỤ LỤC

PHỤ LỤC A. KIẾN TRÚC GOOGLE

Nguồn: http://seogurudelhi.blogspot.com/

Hình vẽ sau đây cho ta một hình dung về kiến trúc mức cao của Google.

Hình 24. Kiến trúc Google.

Quá trình tải các trang Web về và đánh chỉ mục được thực hiện bởi nhiều crawlers phân tán. Có một vài URLserver thực hiện nhiệm vụ chuyển các danh sách URLs cho các crawlers. Các trang Web sau khi được tải về, chúng được chuyển cho storeserver (thực hiện chức năng lưu trữ). Storeserver nén các trang Web lại và lưu trữ chúng tại kho lưu trữ. Mỗi trang Web có một mã hiệu gọi là docID, được gán mỗi khi có một URL mới được phân tích từ trang Web tải về.

Chức năng đánh chỉ mục được thực hiện bởi bộ Indexer và Sorter. Indexer thực hiện việc đọc kho dữ liệu, giải nén tài liệu và phân tích chúng. Các từ được phân tách và được lưu trữ vào các barrels. Ngoài ra, indexer còn thực hiện việc phân tích các thông tin liên quan đến một hyperlink trên trang Web rồi lưu lại các thông tin này (gọi là anchor information) vào anchors file. File này lưu trữ đầy đủ thông tin cho biết liên kết tương ứng chỉ tới đâu và dòng chữ xuất hiện trên trang Web tương ứng với liên kết đó.

URL_Resolver đọc các thông tin trong anchors file và chuyển đổi thành các URL thực sự và căn cứ trên các URL đã có để kết gắn với các docID, đồng thời cũng

tạo nên cơ sở dữ liệu về liên kết (có tác dụng trong việc tính toán độ nổi tiếng của một trang Web).

Sorter thực hiện việc sắp xếp lại barrels theo wordID thay vì theo docID để tạo ra chỉ mục ngược. Chương trình có tên DumpLexicon thu nhận danh sách các từ và tiến hành cập nhật Lexicon (từ điển).

Để trả lời một truy vấn của người dùng, Google sử dụng Lexicon, chỉ mục ngược và PageRanks.

PHỤ LỤC B. CÁC KHÁI NIỆM VỀ SEARCH ENGINE

Nguồn: http://www.cadenza.org/Search_engine_terms/srchad.htm

Adjacency

A property of the relationship between words in a Search Engine (or directory) query. Search engines often allow users to specify that words should be next to one another or somewhere near one another in the Web pages Searched

ArchitextSpider

The name of the Excite Search engine's spider.

Cloaking

The hiding of page content. Normally carried out to stop page thieves stealing optimized pages.

Clustering

The listing of only one page from each Web site in a Search Engine or directory's list of Search results. This avoids occupation of all the top results by a small number of Web sites and makes the list of results clearer and more useful to the user.

Crawler

See Spider.

DeadLink

An Internet link which doesn't lead to a page or site, probably because the server is down or the page has moved or no longer exists. Most Search engines have techniques for removing such pages from their listings automatically, but as the Internet continues to increase in size, it becomes more and more difficult for a Search Engine to check all the pages in the index regularly. Reporting of dead links helps to keep the indexes clean and accurate, and this can usually be done by submitting the dead link to the Search engine.

Directory

A server or a collection of servers dedicated to indexing Internet Web pages and returning lists of pages which match particular queries. Directories (also known as

Indexes) are normally compiled manually, by user submission (such as at

whatsnew.com), and often involve an editorial selection and/or categorization process (such as at LookSmart and Yahoo).

Domain

A sub-set of Internet addresses. Domains are hierarchical, and lower-level domains often refer to particular Web sites within a top-level domain. The most significant part of the address comes at the end - typical top-level domains are .com, .edu, .gov, .org (which sub-divide addresses into areas of use). There are also various geographic top-level domains (e.g. .ar, .ca, .fr, .ro etc.) referring to particular countries.

Heading

Many Search engines give extra weight and importance to the text found inside HTML heading sections. It is generally considered good advice to use headings when designing Web pages and to place keywords inside headings.

Hidden Text

Text on a Web page which is visible to Search Engine spiders but not visible to human visitors. This is sometimes because the text has been set the same colour as the background, because multiple TITLE tags have been used or because the text is an HTML comment. Hidden text is often used for spamdexing. Many Search engines can now detect the use of hidden text, and often remove offending pages from their database or lower such pages' positioning.

Hit

In the context of visitors to Web pages, a hit (or site hit) is a single access request made to the server for either a text file or a graphic. If, for example, a Web page contains ten buttons constructed from separate images, a single visit from someone using a Web browser with graphics switched on (a "page view") will involve eleven hits on the server. (Often the accesses will not get as far as your server because the page will have been cached by a local Internet service provider).

In the context of a Search Engine query, a hit is a measure of the number of Web pages matching a query returned by a Search Engine or directory.

HTML

HyperText Markup Language - the (main) language used to write Web pages.

HTTP

HyperText Transfer Protocol - the (main) protocol used to communicate between Web servers and Web browsers (clients).

Inbound Link

A hypertext link to a particular page from elsewhere, bringing traffic to that page. Inbound links are counted to produce a measure of the page popularity.

Index

See Directory. Also refers to the database of Web pages maintained by a Search Engine or directory.

Keyword

A word which forms (part of) a Search Engine query.

Keyword Density

A property of the text in a Web page which indicates how close together the keywords appear. Some Search engines use this property for Positioning. Analysers are available which allow comparisons between pages. Pages can then be produced with the similar keyword densities to those found in high ranking pages.

Keyword Domain Name

The use of keywords as part of the URL to a Website. Positioning is improved on some Search engines when keywords are reinforced in the URL.

Keyword Phrase

A phrase which forms (part of) a Search Engine query.

Keyword Purchasing

The buying of Search keywords from Search engines, usually to control banner ad. placement. All the major Search engines (except EuroSeek and GoTo) insist that keyword purchasing is only used for banner ad. placement, and doesn't influence Search results. The display of banner ads. for bought keywords can be studied using a service called Bannerstake from Thomson and Thomson at http://www.namestake.com. which returns the banner ads. displayed when particular queries are used.

Keyword Stuffing

The repeating of keywords and keyword phrases in META tags or elsewhere.

Meta Search

A Search of Searches. A query is submitted to more than one Search Engine or directory, and results are reported from all the engines, possibly after removal of duplicates and sorting. Also the meta Search engine of the same name, found at http://www.metaSearch.com.

Meta Search Engine

A server which passes queries on to many Search engines and/or directories and then summarises all the results. Ask Jeeves, Dogpile, Infind, Metacrawler, Metafind and

MetaSearch are examples of meta Search engines.

A construct placed in the HTML header of a Web page, providing information which is not visible to browsers. The most common meta tags (and those most relevant to Search engines) are KEYWORDS and DESCRIPTION.

Page Popularity

A measure of the number and quality of links to a particular page (inbound links).

Portal

See Gateway page. Can also mean Portal Site.

Portal Page

See Gateway page.

Portal Site

A generic term for any site which provides an entry point to the Internet for a significant number of users.

Positioning

The process of ordering Web sites or Web pages by a Search Engine or a directory so that the most relevant sites appear first in the Search results for a particular query. Software such as PositionAgent, Rank This and Webposition can be used to determine how a URL is positioned for a particular Search Engine when using a particular Search phrase. The GoHip Search site allows you to see positioning information from many of the big Search engines, displayed all on one page.

Positioning Technique

A method of modifying a Web page so that Search engines (or a particular Search engine) treat the page as more relevant to a particular query (or a set of queries).

Query

A word, a phrase or a group of words, possibly combined with other syntax used to pass instructions to a Search Engine or a directory in order to locate Web pages.

Ranking

Robot

Any browser program which follows hypertext links and accesses Web pages but is not directly under human control. Examples are the Search Engine spiders, the "harvesting" programs which extract e-mail addresses and other data from Web pages and various intelligent Web Searching programs. A database of Web robots is maintained by Webcrawler.

robots.txt

A text file stored in the top level directory of a Web site to deny access by robots

to certain pages or sub-directories of the site. Only robots which comply with the

Robots Exclusion Standard will read and obey the commands in this file. Robots will

read this file on each visit, so that pages or areas of sites can be made public or private at any time by changing the content of robots.txt before re-submitting to the Search engines. The simple example below attempts to prevent all robots from visiting the /secret directory:

PHỤ LỤC C. THUẬT TOÁN VUN ĐỐNG HEAPSORT CHO TÌM KIẾM

Nguồn:

http://vi.wikipedia.org/wiki/S%E1%BA%AFp_x%E1%BA%BFp_vun_%C4%91%E1 %BB%91ng

Đống (Heap)

Mỗi mảng a[1..n] có thể xem như một cây nhị phân gần đầy (có trọng số là các giá trị của mảng), với gốc ở phần tử thứ nhất, con bên trái của đỉnh a[i] là a[2*i] con

bên phải là a[2*i+1] (nếu mảng bắt đầu từ 1 còn nếu mảng bắt đầu từ 0 thì 2 con là a[2*i+1] và a[2*i+2] ) (nếu 2*i<=n hoặc 2*i+1<=n, khi đó các phần tử có chỉ số lớn

hơn không có con, do đó là lá).

Ví dụ mảng (45, 23, 35, 13, 15, 12, 15, 7, 9) là một đống

Môt cây nhị phân, được gọi là đống cực đại nếu khóa của mọi nút không nhỏ hơn khóa các con của nó. Khi biểu diễn một mảng a[] bởi một cây nhi phân theo thứ

tự tự nhiên điều đó nghĩa là a[i]>=a[2*i] và a[i]>=a[2*i+1] với mọi i =1..int(n/2). Ta

cúng sẽ gọi mảng như vậy là đống. Như vậy trong đống a[1] (ứng với gốc của cây) là phần tử lớn nhất. Mảng bất kỳ chỉ có một phần tử luôn luôn là một đống.

Một đống cực tiểu được định nghĩa theo các bất đẳng thức ngược lại:

a[i]<=a[2*i] và a[i]<=a[2*i+1]. Phần tử đứng ở gốc cây cực tiểu là phần tử nhỏ nhất.

Vun đống

Việc sắp xếp lại các phần tử của một mảng ban đầu sao cho nó trở thành đống được gọi là vun đống.

Nếu hai cây con gốc 2 * i và 2 * i + 1 đã là đống thì để cây con gốc i trở thành đống chỉ việc so sánh giá trị a[i] với giá trị lớn hơn trong hai giá trị a[2 * i] và

a[2 * i + 1], nếu a[i] nhỏ hơn thì đổi chỗ chúng cho nhau. Nếu đổi chỗ cho a[2 * i], tiếp tục so sánh với con lớn hơn trong hai con của nó cho đên khi hoặc gặp

đỉnh lá. (Thủ tục DownHeap trong giả mã dưới đây)

Vun một mảng thành đống

Để vun mảng a[1..n] thành đống ta vun từ dưới lên, bắt đầu từ phần tử a[j]với j =Int(n/2) ngược lên tới a[1]. (Thủ tục MakeHeap trong giả mã dưới đây)

Sắp xếp bằng vun đống

Đổi chỗ (Swap): Sau khi mảng a[1..n] đã là đống, lấy phần tử a[1] trên đỉnh

của đống ra khỏi đống đặt vào vị trí cuối cùng n, và chuyển phẩn tử thứ cuối cùng a[n] lên đỉnh đống thì phần tử a[n] đã được đứng đúng vị trí.

Vun lại: Phần còn lại của mảng a[1..n-1] chỉ khác cấu trúc đống ở phần tử a[1]. Vun lại mảng này thành đống với n-1 phần tử.

Lặp: Tiếp tục với mảng a[1..n-1]. Quá trình dừng lại khi đống chỉ còn lại một phần tử.

Ví dụ

Cho mảng a=(2, 3, 5, 6, 4, 1, 7).Ở đây n = 7. Các phần tử từ a[4] đến a[7] là lá.

Vun đống

Vun cây gốc a[3] ta được mảng a=(2, 3, 7, 6, 4, 1, 5) Vun cây gốc a[2] ta được mảng a=(2, 6, 7, 3, 4, 1, 5) Vun cây gốc a[1] ta được mảng a=(7, 6, 5, 3, 4, 1, 2)

Bây giờ a=(7, 6, 5, 3, 4, 1, 2) đã là đống.

Sắp xếp

Đổi chỗ a[1] với a[7]: a=(2, 6, 5, 3, 4, 1, 7) và vun lại mảng a[1..6] ta được mảng a=(6, 4, 5, 3, 2, 1, 7)

Đổi chỗ a[1] với a[6]: a=(1, 4, 5, 3, 2, 6, 7) và vun lại mảng a[1..5] ta được mảng a=(5, 4, 2, 3, 1, 6, 7)

Đổi chỗ a[1] với a[5]: a=(1, 4, 2, 3, 5, 6, 7) và vun lại mảng a[1..4] ta được mảng a=(4, 3, 2, 1, 5, 6, 7)

Đổi chỗ a[1] với a[4]: a=(1, 3, 2, 4, 5, 6, 7) và vun lại mảng a[1..3] ta được mảng a=(3, 1, 2, 4, 5, 6, 7)

Đổi chỗ a[1] với a[3]: a=(2, 1, 3, 4, 5, 6, 7) và vun lại mảng a[1..2] ta được mảng a=(2, 1, 3, 4, 5, 6, 7)

Đổi chỗ a[1] với a[2]:a=(1, 2, 3, 4, 5, 6, 7)Mảng còn lại chỉ một phần tử. Quá trình sắp xếp đã xong.

==Mã giả==(DowHeap)

function heapSort(a[1..count], count) {

var int end := count

MakeHeap(a, count) while end > 0 swap(a[end], a[1]) end := end - 1 DownHeap(a, 1, end) }

function MakeHeap(a, count) {

var int start := Int(count/2)

while start > 0

DownHeap(a, start, count) start := start - 1

}

function DownHeap(a, start, count) {

var int i := start, j

while i * 2 <= count { j := i * 2

if j+1 <= count and a[j] < a[j + 1]

j := j + 1 if a[i] < a[j] swap(a[i], a[j]) i := j else return } }

PHỤ LỤC D. BẢNG MÃ HOÁ CHỮ CÁI TIẾNG VIỆT

Bảng 16. Mã hoá chữ cái tiếng Việt bằng các bộ mã khác nhau

Chữ cái

Mã Unicode Mã VNI Mã VPS Mã VISCII Mã TCVN3 Mã VIRQ À U+00C0 41 D8 80 C0 41 B5 A` Á U+00C1 41 D9 C1 C1 41 B8 A' Â U+00C2 41 C2 C2 C2 A2 A^ Ã U+00C3 41 D5 82 C3 41 B7 A~ È U+00C8 45 D8 D7 C8 45 CC E` É U+00C9 45 D9 C9 C9 45 D0 E' Ê U+00CA 45 C2 CA CA A3 E^ Ì U+00CC CC B5 CC 49 D7 I` Í U+00CD CD B4 CD 49 DD I' Ò U+00D2 4F D8 BC D2 4F DF O` Ó U+00D3 4F D9 B9 D3 4F E3 O' Ô U+00D4 4F C2 D4 D4 A4 O^ Õ U+00D5 4F D5 BE A0 4F E2 O~ Ù U+00D9 55 D8 A8 D9 55 EF U`