TOÀN VĂN BÁO CÁO NÓI ORAL Tiểu ban CÔNG NGHỆ THÔNG TIN

ĐẠI HỌC QUỐC GIA TP HỒ CHÍ MINH TRƢỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN ISBN: 978-604-82-1375-6 TOÀN VĂN KỶ YẾU HỘI NGHỊ Conference Proceeding Fulltext TP HCM – 21/11/2014 www.hcmus.edu.vn TỒN VĂN BÁO CÁO NĨI ORAL Tiểu ban CƠNG NGHỆ THƠNG TIN Báo cáo tồn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM VII-O-1 A SHALLOW APPROACH FOR QUERYING GRAPH DATABASE Dƣơng Quang Hƣng, Nguyễn Minh Nhựt, Nguyễn Trần Minh Thƣ, Bùi Đắc Thịnh Information System Department University of Science, Ho Chi Minh City ABSTRACT Rapidly growing on information system applications bydiverse human demands has led to the essential requirements on data storing problem NoSQL database, the most common way beyonding traditional data models used to store structured data, is applied in improving performance on system with scalable database Among them, Graph database takes reponsibility of storing and querying data related to graph nodes and links which are considerable as large scalable data In this paper, we proposed a work on analyzing the pros and cons of Graph database, in comparison with traditional data models, along with building an experimental scenario to evaluate querying progress on time efficiency The evaluation on the real data crawled from an operating information system shows out the reason that going for Graph database would be a justifiable decision on scalable data Keywords: NoSQL; graph database; graph model; scalable data model INTRODUCTION Relational databases have been around for many decades and are the prefer database technology for most traditional data storages and retrieval applications [8] In particular, they usually use SQL, a declarative query language to exploit such databases In such many analysis, relational databases are generally efficient in case data doesn’t contain many relationships, which require join operations between large tables and cost massive plenty of time Although there have been different approaches such as XML or object databases, they are all absorbed by almost relational database management systems (RDBMSs) [1,2,9].Recently, there has been many shifts in data stores called NoSQL movements, created by challenges of high-performance on reading and writing big data effectively, with the development of the Internet and cloud computing [1,9] Until now, NoSQL still has many definitions to present its core themes In [9], the authors defined NoSQL as a set of concepts that allows any rapid and efficient processing of data sets with a focus on performance, reliability, and agility The most important point in NoSQL that differs with SQL is that it’s free of joins and schema NoSQL allows not only to create data without entity model but also to extract data without joins, which is considered as most costly time reason Not like relational databases, NoSQL uses a diversity of data store types, from the simple key-value store to column-family, an extend of column in relational databases, to graph stores used to associate relationships, to document stores used for variable data [2,9] Among them, graph database is the most appropriated solution for dense relationship problems As the system of a sequence of nodes and relationships, graph store is used to facesuch typical problems as social networks, fraud detection, or relationship-heavy data, where graphs are truly one of the most useful structures for modeling objects and links [1,2,5,8,9] In graph store, each two nodes are linked by some relationships and both of them, even relationships, have their own properties which are stored in key-value fields [9] In this paper, we present a shallow approach to query graph data store on the crawled real data from an operating information system In initial experiment, we evaluate the time efficency of common and advanced queries on two database management systems in representation for relational database and NoSQL database In addition, we also deploy an information system using graph database to demonstrate the feasibility of our application and data The remains of this paper are organized as follows First, Section II presents the related work on NoSQL and graph data store in particular Next, we describle the approach to query data on graph database Then, Section IV present some experiments on the crawled real data Finally, conclusion is presented in Section V RELATED WORK There have been many studies on investigation of alternative storages to relational databases In some way, NoSQL is the blanket term for them In this term, many projects such as Cassandra, BigTable, CouchDB, Voldemort, Dynamo,… are presented and are used more widely [1,8].BigTable [12] is, in effect, a database system created and used by Google, with large-scale, fast, and distributed While Cassandra is developed by Facebook, an open-source, distributed key-value data store [13],project Voldemort is LinkedIn’s large-scale, persistent hash table working in distributed enviroment and being designed majorly to handle errors [14] ISBN: 978-604-82-1375-6 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM Most recently, some very new projects, like Redis [11], is suitable for proving high performance computing to small amount of data but not big data storage; or we could include MongoDB, the hybrid form between relational database and non-relational database MongoDB supports a range of complex data types with powerful query language: most of functions like querying in single-table, and effective index, which can make itself 10 times faster accessing than MySQL, as it claimed [10] While [1] pointed out the main features in NoSQL that differs with relational databases are considered as four aspects: concurrently reading and updating, supporting mass storage and access requirements, easy on scalability, and low cost [1,8,9]; authors in [7] claimed that there are just two possible reason to move to NoSQL but not relational databases: performance and flexibility These judgementsare somewhat precise in business problems of massively complex relationships between objects such as social networking, rules-based engines, mashups In these case, graph system is the most suitable for quickly analyzing complex network structures, even with mining patterns [8,9] Graph store represents any complex network problem as graphthat contains nodes on vertices, relationships on edges and their properties The relationship can be thought of as the connection between the objects from real world objects [9] The author in [9] also pointed out queries in graph data stores are similar to traversing nodes in a common graph: what the shortest path between two nodes is, what nodes have nearest nodes that have given properties,… Although graph data store can meet the existing problems, there is still a few of experiments to compare graph data store with the relational databases In [1], the authors just gave some options to consider in which properties that NoSQL is well-adapted The authors in [8] achieved results at specific aspects: designing some experiments on comparison of MySQL [16], representative relational database and Neo4j [1, 3, 4], representative NoSQL The experiments based on a predefined set of queries, evaluated processing speed on both data store managements However, data is random characters (8K or 32K) and is not real-world data Compared with previous work, our work makes some contributions to the advancement of judging the NoSQL movement as follows: We present the evaluation on time efficiency and make comparision between a relational database management system and a graph data store system The evaluation is processed on real-world data, which is crawled from an operating information system, by using meaningful graph queries We build and deploy another information system using graph data store and graph queries to illustrate the feasibility of using graph data store in action To our knowledge, this is one of the first works that exploits real-world data to compare the performance between relational database and NoSQL, in particular: MSSQL server [17] and Neo4j graph database [1, 3, 4] A SHALLOW APPROACH FOR QUERYING DATA Aiming to the target of comparising time efficiency performance, we carried out some specific database management systems, on both relational and graph database Based on related work and some technology knowledge, we decided to choose MSSQL server, representative of relational database and Neo4j, representative of graph database, NoSQL The process to benchmark two systems’ performance is as follows: firstly, we build a crawler to get the real-world data from Foody (http://www.foody.vn), the system with more than one million users.The data is about a social network, in which food courts are the nuclei Then, based on our knowledge in data schema, we create two schemas which each of them corresponding to a database management system (MSSQL server or Neo4j) Next, the crawled data is ETL processing [15] before being constructed fulfilled databases The experiments are processed on these databases with the same predefined meaningful queries which are suitable and essential for real applications ISBN: 978-604-82-1375-6 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM We present some objects’s brief descriptions in Table Table 1.Object description in data Object Name (vn-vi) Object name (en-us) Description THANHVIEN USER Member’s information DIADIEMANUONG FOODCOURT BANBE FRIEND BINHLUAN COMMENT DANHGIA RATING DIADIEM_MONAN FOODCOURT _FOOD MONDACTRUNG TYPICALFOOD THANHVIEN_CHECKIN_DIADIEM USER_CHECKIN_FOODCOURT THANHVIEN_LIKE_DIADIEM USER_LIKE_FOODCOURT Food court information Information of Member’s friend Member’s comment for a venue Member’s rating point for a venue Relationship between COURT and FOOD Food information, corresponding to some courts Relationship between USER and COURT, related to action ―Check-in‖ Relationship between USER and COURT, related to action ―Like‖ EXPERIMENT To evaluate the time efficiency of queries on two database management systems, we predefined three queries corresponding to existing problems on dense relationship network data The queries are presented in Table We plot the running time on a 2.1 GHz CPU CoreI3, 2GB RAM Execution time is measured in miliseconds (ms) Table Experiment query on food court data ID Query Finding friends of friend in variety of depth-level Browsing food courts that friends used to check-in, like, comment or rate, with the given properties Suggesting food courts that followed a pattern (User used to come X then coming Y) Data Data for experiments is the full data as we presented above Table describes the number of records of each object Data for experiments is the same inMSSQL server and Neo4j Table Data record used for experiment Object Name (vn-vi) Object name (en-us) Number of records THANHVIEN USER 41881 DIADIEMANUONG BANBE BINHLUAN DANHGIA DIADIEM_MONAN MONDACTRUNG THANHVIEN_CHECKIN_DIADIEM THANHVIEN_LIKE_DIADIEM VENUE FRIEND COMMENT RATING VENUE_CUISINE TYPICALCUISINE USER_CHECKIN_VENUE USER_LIKE_VENUE 16633 86746 28148 14345 17571 14045 3886 13980 ISBN: 978-604-82-1375-6 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM Query As we presented above, the experiments will evaluate two systems on three queries respectively In each following subsection, purpose of each query and its performance in time (ms) would be described and analyzed seriously i Query 1: Finding friends of friend in variety of depth-level This query is used to find friends along with their properties, with a given user and depth-level It can be described as follows: the current user’s name is Nam; this query targets to find all friends of Nam with given depth-level; assuming that the depth-level equals to 2, the mention-aboved query will find not only friends of Nam but also all friends of friends of Nam The query’s experimental result on two database system is presented in Figure In which, we should say that costly time of this query in Neo4j tend to be stable when the depthlevel increases while the one in MSSQL serverrapidly increasewhen the depth-level equals Figure also shows that costly time in Neo4j is proportional with depth-level increment but just slightly, in comparision with performance in MSSQL server 140000 120000 Time (ms) 100000 80000 60000 RDBMS 40000 Graph Datastore 20000 Depth - level Figure First query’s experimental result on time costing 14000 12000 10000 8000 Graph Datastore 6000 4000 2000 10 Figure First query’s experiment on Neo4j with high depth-level ii Query 2: Browsing food courts that friends used to check-in, like, comment or rate, with the given properties This query is used to list all food courts related to current user’s friends (check-in, like, comment or rate) with several given properties The result is presented in Figure and we measured costly time according to times ISBN: 978-604-82-1375-6 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM of execution Excuding that Neo4j is considered as 20 times faster than MSSQL server in this query, Neo4j still express its stable execution 10000 9000 8000 Time (ms) 7000 6000 5000 4000 RDBMS 3000 Graph Datastore 2000 1000 1st 2nd 3rd 4th 5th Execution Times Figure Second query’s experimental result on time costing iii Query 3: Suggesting food courts that followed a pattern (User used to come X then coming Y) In real world, there is a demand that people need suggestion before giving their decision We assumed that when user A visited food court X, user A tends to visit food court Y and so on With a large data, the patterns will be generated and this query is used to suggest users these patterns Absolutely, the properties of the ―next‖ food court will be listed also In this case, we try to explore whether how costly time increase for each database system when more criteria (action check-in, like, comment) are included The result is presented in Figure and Figure When the query included more criteria, absolutely that costly time will increase on both database system, but we can see the Neo4j’s stable is clearly evident 80000 70000 Time (ms) 60000 50000 40000 RDBMS 30000 Graph Datastore 20000 10000 Number of criteria Figure Third query’s experimental result on time costing according to included criteria ISBN: 978-604-82-1375-6 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM 350 300 Time (ms) 250 200 150 Graph Datastore 100 50 Number of criteria Figure Third query’s experimental result on time costing on Neo4j Application development Based on the characteristics of crawled data and several functional and non-functional requirements, we developed an information system application that uses ASP.NET MVC [18] and Neo4j community server [3, 4], aiming to indicate the feasibility of an approach to store and query large scalabledata The application is deployed as a website that is the same purpose with Foody but focusing on advanced queries that utilize graph data store’s ability CONCLUSION In this paper, we presented a shallow approach to query data on graph database and made comparison with the relational database We also described the advantage and disadvantage of Graph database Neo4j, in comparison with MSSQL server as a case study Graph database is compatible with scalable data which can be represented as nodes and links between them Experiments show that graph database is critically effective than relational database in case queries is complex and require join operations between the objects Drawbacks, in simple queries or on sparse relationship data, relational database still express its high performance compared with graph database So that graph database is actually suitable with large scale and dense data.Anyway, one of the reason is that relational database has many constraints in data, which is considered as not important at real time in graph data store However, there are still some limitations in our research such as the specific interfaces of SQL and NoSQL In this case, they are MSSQL server and Neo4j on NoSQL To get a objective glance, the comparision in a set of interfaces should be included on the crawled real data, which we have done well Moreover, the application we built should be deployed in reality to get feedback on rising of scalable data REFERENCES [1] Han, Jing, et al "Survey on NoSQL database." Pervasive computing and applications (ICPCA), 2011 6th international conference on IEEE, 2011 [2] Robinson, Ian, Jim Webber, and Emil Eifrem Graph databases " O'Reilly Media, Inc.", 2013 [3] Miller, Justin J "Graph Database Applications and Concepts with Neo4j." (2013) [4] Partner, Jonas, Aleksa Vukotic, and Nicki Watt Neo4j in Action O'Reilly Media, 2013 [5] Neubauer, Peter "Graph databases, NOSQL and Neo4j." (2010) [6] Holzschuher, Florian, and René Peinl "Performance of graph query languages: comparison of cypher, gremlin and native access in neo4j." Proceedings of the Joint EDBT/ICDT 2013 Workshops ACM, 2013 [7] Stonebraker, Michael "SQL databases v NoSQL databases." Communications of the ACM 53.4 (2010): 10-11 [8] Vicknair, Chad, et al "A comparison of a graph database and a relational database: a data provenance perspective." Proceedings of the 48th annual Southeast regional conference ACM, 2010 [9] McCreary, Dan, and Ann Kelly "Making Sense of NoSQL." Greenwich, Conn.: Manning Publications (2013) ISBN: 978-604-82-1375-6 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM [10] Banker, Kyle MongoDB in action Manning Publications Co., 2011 [11] Carlson, Josiah L Redis in Action Manning Publications Co., 2013 [12] Chang, Fay, et al "Bigtable: A distributed storage system for structured data." ACM Transactions on Computer Systems (TOCS) 26.2 (2008): [13] Lakshman, Avinash, and Prashant Malik "Cassandra: a decentralized structured storage system." ACM SIGOPS Operating Systems Review 44.2 (2010): 35-40 [14] Sumbaly, Roshan, et al "Serving large-scale batch computed data with project voldemort." Proceedings of the 10th USENIX conference on File and Storage Technologies USENIX Association, 2012 [15] Karakasidis, Alexandros, Panos Vassiliadis, and Evaggelia Pitoura "ETL queues for active data warehousing." Proceedings of the 2nd international workshop on Information quality in information systems ACM, 2005 [16] MySQL: the world's most popular open source database MySQL AB, 1995 [17] Mistry, Ross, and Stacia Misner Introducing Microsoft® MSSQL server® 2012 " O'Reilly Media, Inc.", 2012 [18] Esposito, Dino Programming Microsoft ASP NET MVC Pearson Education, 2011 ISBN: 978-604-82-1375-6 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM VII-O-2 MÔ HÌNH CHUYỂN ĐỔI BÁN TỰ ĐỘNG GIAO DIỆN TĨNH WEB SILVERLIGHT 5.0 SANG ANDROID 4.2 DỰA TRÊN GIẢI PHÁP TRANUI Nguyễn Đức Huy, Nguyễn Văn Vũ, Trần Minh Triết Khoa Công Nghệ Thông Tin, Trường Đại Học Khoa Học Tự Nhiên, ĐHQG-HCM Email: {ndhuy,nvu,tmtriet}@fit.hcmus.edu.vn TÓM TẮT Nội dung báo trình bày mơ hình chuyển đổi bán tự động giao diện tĩnh trang Web xây dựng công nghệ Microsoft Silverlight 5.0 sang giao diện ứng dụng tảng Android 4.2 thiết bị di động Dựa giải pháp TranUI, nhóm tác giả xây dựng mơ hình giao diện công nghệ MS Silverlight 5.0 Android 4.2, mô hình chuyển đổi tổng quát (CUI) giao diện nên tảng Bên cạnh đó, để thực việc chuyển đổi, nhóm tác giả đề xuất chuyển tập luật thực việc chuyển đổi mơ hình Từ khóa: RIA, MDD, MBUID, Tranformation Modeling, User Interface, Silverlight, Android UI GIỚI THIỆU Với phát triển Internet toàn giới ngày nay, hoạt động hàng ngày người trao đổi thơng tin, liên lạc, làm việc, tìm kiếm thông tin… thực môi trường mạng Từ đó, ứng dụng web với việc hỗ trợ giao diện thân thiện gần gũi với người dùng gần thay ứng dụng desktop Bên cạnh đó, phát triển cơng nghệ RIA (Rich Internet Application) mang lại sức mạnh cho ứng dụng web RIA – Rich Internet Application1 ứng dụng web mang nhiều đặc điểm ứng dụng desktop Cơ chế hoạt động RIA thường giao tiếp trình duyệt web, thơng qua plugin, vùng độc lập website (sandbox), đoạn mã Javascript máy ảo (virtual machine) riêng tảng RIA cụ thể Giai đoạn năm 2007 – 2008 thời điểm phát triển mạnh mẻ ứng dụng RIA Điển hình hướng công nghệ tảng Adobe Flash/Flex, Microsoft Silverlight JavaFX Bên cạnh đó, Ngồi ra, theo số liệu thống kê Wikimedia IDC vào năm gần (đặc biệt từ 2010 2013) công nghệ di động truyền thông phát triển mạnh mẽ Cùng với phát triển công nghệ di động đem đến tiện lợi cho người dùng cuối với tính linh động tiện dụng chúng Số lượng người sử dụng thiết bị di động năm gần tăng mạnh Doanh số bán số truy cập vào internet từ thiết bị di động hẳn máy tính cá nhân (laptop desktop) Từ đó, nhu cầu ứng dụng tảng di động ngày tăng nhanh Hình 1là thể nhu cần cần chuyển đổi ứng dụng thống có tảng Web sang ứng dụng tảng di động Mặc dù thiết bị di động có trình duyệt web cho phép người dùng cuối thao tác với ứng dụng web Facebook, Youtube, GMail… Nhưng ứng dụng web chưa tận dụng hết ưu điểm thiết bị di động như: đồng hóa danh bạ với ứng dụng Gmail, tự động thơng báo có mail mới… Vì vậy, nhà phát triển hướng đến việc xây dựng riêng ứng dụng tảng di động khác cho hệ thống Hình Giao diện ứng dụng FaceBook Gmail Web, iOS Android Tuy nhiên, chi phí để phát triển ứng dụng nhiều tảng di động khác lớn Đặc biệt, trình chuyển đổi giao diện tảng tốn thời gian chi phí để đảm bảo giữ nguyên thông số thành phần giao diện.Bảng bảng khảo sát thơng kê chi phí phát triển dự án ứng RIA – Rich Internet Application: http://en.wikipedia.org/wiki/Rich_Internet_application ISBN: 978-604-82-1375-6 10 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM Từ chức xem ứng cử viên phù hợp cho vị trí đó? Tất vấn đề giải báo nhằm cải tiến chất lượng dịch SMT CƠNG TRÌNH LIÊN QUAN Cách tiếp cận dựa thống kê bước đột phá về phương pháp luâ ̣n cho dich ̣ máy , kế t quả thực tế của ̣ dich ̣ này còn thấ p Vì vậy, người ta nghiên cứu cải tiế n nó bằ ng cách đưa thêm tri thức ngôn ngữ Hiê ̣n có nhiề u cách cải tiế n hiê ̣u quả cũng chấ t lươ ̣ng cho SMT , đó có đề câ ̣p đế n hướng cải tiế n tâ ̣p trung vào các từ chức (function word) Hiện có mơ ̣t sớ công trình nghiên cứu nâng cao chấ t lươ ̣ng của SMT xoay quanh những vấ n đề liên quan đế n từ chức ; sử dụng từ chức để chuyển đổi trật tự ngữ [17], cú pháp ngơn ngữ nguồn – đích [5] cải tiến chất lượng dịch máy thông qua việc xóa chèn từ chức [12][2] Hướng tiếp cận thứ sử dụng từ chức để chuyển đổi trật tự ngữ, cú pháp ngơn ngữ nguồn – đích Mục đích hướng tiếp cận giải khác trật tự từ ngơn ngữ nguồn – đích, tức làm cho trật tự từ ngôn ngữ nguồn gần giống với trật tự từ ngơn ngữ đích; chất lượng dịch cải tiến Nhóm tác giả [17] sử du ̣ng giải pháp FWS - Function Word centered, Syntax-based để giải quyế t viê ̣c sắ p xế p các cu ̣m từ SMT d ựa từ chức Trong phương pháp này , tác giả đề xuất văn phạm xác suất đồng (Probabilistic Synchronous Grammar ) để mã hóa thứ tự từ chức tham số bên trái (left), bên phải (right) chúng Thực nghiê ̣m cho thấ y hướng FWS tố t ̣ thố ng dich ̣ bản tron g viê ̣c sắ p xế p thứ tự các tham số của từ chức và cải tiế n chấ t lươ ̣ng dich ̣ cả trường hơ ̣p gióng hàng xác hay bị nhiễu Tuy nhiên, giải pháp áp dụng cho phía ngơn ngữ nguồn gă ̣p phải khó khăn viê ̣c sắ p xế p các cu ̣m từ (ngữ) dài (do mô hin ̣ ranh giới ngữ hỗ trơ ̣ cho những ̀ h xác đinh trường hơ ̣p này chưa đươ ̣c tố t ) Một nghiên cứu khác nhóm tác giả [5] lại sử dụng từ chức nhằm mục đích sắ p xế p la ̣i cú pháp bên ngôn ngữ nguồ n SMT theo hướng tiế p câ ̣n không xác đinh (non – deterministic reordering approach ) ̣ Mô hình đươ ̣c thử nghiê ̣m và so sánh ̣ thố ng: ̣ thố ng SMT dựa ngữ bản , ̣ thố ng sắ p xế p la ̣i dựa cú pháp với mẫu liệu rút trích từ kho ngữ liệu hệ thống xếp lại dựa cú pháp cho mẫu trích với từ chức Kết thực nghiệm hệ dịch Hoa – Anh (chỉ thử nghiệm ngữ liê ̣u có kić h cỡ trung biǹ h) cho thấy mơ hình tăng 0,34% so với hệ thống dịch Baseline Tương tự hướng tiếp cận đầu tiên, mơ hình hướng tiếp cận cải tiến chất lượng dịch máy thơng qua việc xóa chèn từ chức mang lại hiệu đáng kể Nhóm tác giả [12] sử du ̣ng phương pháp chèn và xóa các từ ch ức dựa các gơ ̣i ý về cú pháp (syntactic cues) dich ̣ dựa cú pháp (điể n hin ̀ h là Treelet ) Mô hin ̀ h tương đố i đơn giản và cải thiê ̣n đáng kể chấ t lư ợng dịch cho ngôn ngữ không cấu trúc (chẳ ng ̣n: Anh – Nhâ ̣t, …) Phương pháp này thực nghiệm cặp Anh – Nhâ ̣t (điể m BLEU tăng 1,1% so với ̣ thố ng Treelet bản ), Anh – Tây Ban Nha (điể m BLUE tăng từ 0,5% đến 1,1%) Một thực nghiệm khác góp phần cải tiến chất lượng dịch SMT thực nghiệm nhóm [2] Các tác giả đề xuất giải pháp xóa và chèn các từ chức ngôn ngữ đić h nhìn chung , phương pháp này đem đế n những cải ti ến đáng kể so với hệ thống dịch s dich ̣ Hoa – Anh (điể m BLEU tăng khoảng 1.28% bô ̣ dữ liê ̣u NIST 2005 tăng 1.19% bô ̣ dữ liê ̣u NIST 2006) Trong báo này, sử dụng hướng tiếp cận cải tiến chất lượng dịch máy thơng qua việc xóa tích hợp chèn từ chức năngvào giai đoạn giải mã (decoding) cho hệ thống dịch Việt – Anh MƠ HÌNH Trong báo này, chúng tơi đề xuất mơ hình sau: ISBN: 978-604-82-1375-6 92 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM Hình Mơ hình cải tiến chất lượng dựa vào từ chức Mơ hình gồm giai đoạn: Xác định tập từ chức cần xóa Xóa từ chức năng: q trình diễn suốt thời gian huấn luyện mơ hình Mục đích giai đoạn nhằm làm giảm độ nhiễu từ chức Chèn từ chức năng: sử dụng mơ hình TFWIM (Target Function Word Insertion Model) Xác Định Từ Chức Năng Trong báo này, quy ước: từ chức cầ n xóa là từ chức thường xun khơng gióng hàng Giai đoạn thực qua bước: Bƣớc 1: Tính xác suất p(w) từ w khơng gióng hàngtheo cơng thức (1): p w = Số phân đoạn từ không gióng hàng c w Sốphânđoạntừcủa w trongngữ liệu (1) Bƣớc 2: Sắp xếp từ w theo thứ tự giảm dần p(w), chúng tơi có danh sách từ chức Xóa Từ Chức Năng Dựa theo [2], giai đoạn tiến hành xóa từ chức ngơn ngữ đích xóa cần lưu trữ thơng tin ngữ cảnh (hay gọi vùng thông tin ngữ cảnh) sau: Thông tin từ loại (POS) Thơng tin từ vựng Xóa từ chức tiến hành bước sau: Bước 1: Tiền xử lý liệu như: tách từ, gán nhãn từ loại (POS) Bước 2: Xóa từ chức Do kết thực nghiệm vùng thông tin từ cho kết xác nên chúng tơi đề cập vùng thông tin ngữ cảnh xuyên suốt báo Gọi: T tập ứng viên (tập từ chức xác định giai đoạn 1) wilàtừđang xét;wi-1 wi+1lần lượt từ liền kề bên trái liền kề bênphảicủawi wi-2và wi+2lần lượt từ liền kề bên trái liền kề bên phải wi-1 wi+1 Pi-1vàPi+1lần lượt từ loại từ liền kề bên trái liền kề bênphảicủawi Pi-2vàPi+2lần lượt từ loại từ liền kề bên trái liền kề bên phải wi-1 wi+1 CLW, CPW thông tin ngữ cảnh từ vựng từ loại Để tránh thông tin ngữ cảnh sau, cần tuân thủ quy tắc xóa sau: IF(wi T (wi-1 T, wi+1 T)) THEN tiếnhànhxóawi, thơng tin ngữ cảnh xóa : ISBN: 978-604-82-1375-6 93 Báo cáo toàn văn Kỷ yếu hội nghị khoa học lần IX Trường Đại học Khoa học Tự nhiên, ĐHQG-HCM CLW = (wi,wi-2wi-1wi+1wi+2) CPW = (Pi, Pi-2Pi-1Pi+1Pi+2) Ngược lại, IF (wi T (wi-1 T wi+1 T)) THEN khơngxóawi Xét câu tiếng Anh gán nhãn từ loại sau đây, để tránh nhập nhằng thông tin trường hợp xóa, chúng tơi đưa thêm thơng tin từ vựng START đầu câu END cuối câu “START|START the|DTislanders|NNSof|INtorcello|NN ,|, who|WPhave|VBPperhaps|RBalready|RBspread|VBNto|TOneighbouring|JJislands|NNSin|INthe|DTvenetian|JJl agoon|NN ,|, are|VBPincluded|VBNin|INthe|DTexarchate|NN | END|END” Xét tập T = {―the‖, ―in‖, ―to‖, ―for‖, ―of‖}, ta có thơng tin ngữ cảnh trường hợp xóa bảng sau: Bảng Thơng tin ngữ cảnh xóa từ chức Từ Thông tin ngữ cảnh Chèn Từ Chức Năng Stt chức Giai đoạn mơ hình hóa thành tốn Từ loại Từ vựng phân lớp sử dụng phương pháp Maximum Entropy để START the START islanders of phân lớp định vị trí chèn từ chức NNS IN START START islanders Bƣớc 1: Xác định vị trí cần chèn, bước sử of NNS NN , torcello , dụng thông tin ngữ cảnh từ loại already spread Bƣớc 2: Sau tìm vị trí chèn, mơ hình RB VBN to neighbouring tìm từ chức thích hợp để chèn vào JJ NNS islanders Chúng tơi dựa vào mơ hình TFWIM để tìm vị trí từ chức thích hợp Mơ hình tính tốn thơng qua cơng thức Maximum Entropy (2) P( w | C)  exp[ i i f i( w, C)]  w 'W {NULL} exp[ i i f i( w ' , C)] (2) Trong đó: C: thơng tin ngữ cảnh từ vựng từ loại lưu trữ tronggiaiđoạn fi(w,C):làtầnsuất w xuất trongthông tin C i trọng số tương ứng hi(0

Định dạng
Số trang	106
Dung lượng	4,08 MB