Cụm dữliệu lý tưởng và thực tế trong thí nghiệm 5- 123docz.net

II. PJK烏O"X影"XÉ"P浦K"FWPI<

4 Hiện thự c Đánh giá

4.5 Cụm dữliệu lý tưởng và thực tế trong thí nghiệm 5

q trình tiến hành bước đánh giá này, tơi đã xem qua các tập dữ liệu trong các thí nghiệm và rút ra các nhận xét sau. Các tập dữ liệu được gom cụm tối ưu đều có đặc điểm là có nhiều thuộc tính thuộc loại tập hợp - các giá trị về thứ trong tuần (Monday, Tuesday, Wednesday,...) hay các dữ liệu được quy định chung (từ viết tắt chuyên ngành, tên riêng đường phố, ...) nên có Set Unionability cao. Các tập dữ liệu thuộc các cụm chưa đọc gom lại tối ưu đều không đáp ứng đủ yêu cầu về số lượng thuộc tính cũng như một số điểm chưa đủ cao nên đã bị loại bỏ. Vấn đề này đã được định liệu trước vì trong luận văn chỉ dùng độ đo Set Unionability.

• Trong thí nghiệm 5, hầu như các tập dữ liệu đều không được gom cụm với nhau. Kết quả này cho thấy các tập dữ liệu có giá trị Set Unionability thấp đã không được gom cụm với các tập dữ liệu có giá trị Set Unionability cao. trong thí nghiệm này có xảy ra ngoại lệ: một số tập dữ liệu đã được gom lại với nhau. Lý do cho vấn đề này đã được giải thích ở trên và ngoại lệ này cũng không chứng tỏ giải thuật đã đề ra là sai.

• Ngồi ra, khi tơi kiểm tra kết quả của giải thuật gom cụm trong Clustering Step thì các tập dữ liệu đã được gom cụm đúng theo các tập dữ liệu gốc, các

cụm mới chỉ được tạo ra khi xét tới các yêu cầu về ngưỡng. Điều này chứng tỏ phương pháp gom cụm các tập dữ liệu dựa trên giải thuật gom cụm phân cấp và Set Unionability là khả thi nhưng sẽ cần phải điều chỉnh các thông số liên quan tới ngưỡng (threshold) để đạt hiệu quả cao. Ngoài việc điều chỉnh các ngưỡng, các vấn đề này cũng có thể được giải quyết khi sử dụng kết hợp thêm những độ đo khác trong Clustering Step. Đây là một trong những hướng nghiên cứu kế tiếp của tơi.

Qua các thí nghiệm trên, tơi có nhận xét sau: đối với các tập dữ liệu có nhiều thuộc tính (attribute) thuộc loại tập hợp - các giá trị về thứ trong tuần (Monday, Tuesday, Wednesday,...) hay các dữ liệu được quy định chung (từ viết tắt chuyên ngành, tên riêng đường phố, ...) thì độ đo Set Unionability hoạt động rất tốt cho các tập dữ liệu này. Tuy nhiên, nếu các tập dữ liệu khơng có nhiều thuộc tính thuộc loại tập hợp thì độ đo Set Unionability sẽ hoạt động khơng hiệu quả.

Có thế thấy rằng, phương pháp phân cụm dữ liệu dựa trên Set Unionability hoạt động chưa thực sự hiệu quả lắm đối với mọi loại tập dữ liệu. Tuy nhiên, khi so sánh các cụm dữ liệu kết quả gom cụm của giải thuật và cụm dữ liệu tối ưu, ta có thể dễ dàng nhận ra rằng kết quả gom cụm của giải thuật tuy chưa được tối ưu nhưng hồn tồn khơng gom cụm sai. Ngồi ra, với các tập dữ liệu có nhiều thuộc tính thuộc loại tập hợp, giải thuật gom cụm sử dụng Set Unionability sẽ hoạt động rất tốt. Đối với các tập dữ liệu có số lượng thuộc tính thuộc loại tập hợp ít, vì bản chất nên độ đo Set Unionability sẽ khơng hoạt động tốt. Trong trường hợp này, tôi đề nghị sử dụng kết hợp thêm các độ đo khác như Semantics Unionability hay Natural Language Unionability. Đây là một hướng nghiên cứu tiếp theo rất hợp lý cho nghiên cứu này.

4.3.3 Kết luận

Từ các thí nghiệm trên, tơi có thể kết luận rằng phương pháp gom cụm tập dữ liệu dựa trên giải thuật gom cụm phân cấp là hoàn toàn khả thi và độ đo Set Unionability có thể được áp dụng chung với giải thuật gom cụm phân cấp để thực hiện quy trình data union. Tuy nhiên, việc sử dụng độ đo Set Unionability sẽ khơng có đủ độ chính xác vì nó chỉ hoạt động hiệu quả với dữ liệu có nhiều thuộc tính

thuộc loại tập hợp. Vì vậy, các độ đo khác như Semantics Unionability, Natural Language Unionability [7] cần phải được sử dụng kết hợp với Set Unionability để tạo thành một độ đo tốt hơn. Đây là hướng nghiên cứu tiếp theo của tơi.

Tổng kết và hướng phát triển đề tài

Trong qúa trình thực hiện đề cương và luận văn, tơi đã tham gia q trình nghiên cứu, phát triển các bài báo sau:

• An Elastic Data Conversion Framework for Data Integration trong hội nghị International Conference on Future Data and Security Engineering (2020).

• An Elastic Data Conversion Framework - A Case Study for MySQL and

MongoDB đăng trên tạp chí SN Computer Science (2021).

• A Data Union Method Using Hierarchical Clustering and Set Unionability trong hội nghị International Conference on Future Data and Security Engineering (2021).

Các nghiên cứu trên đã góp phần giúp tơi hồn thành luận văn của mình. Chi tiết các bài báo có thể được tham khảo ở phần phụ lục. Ngồi ra, trong q trình nghiên cứu, tơi cịn tham gia một đề tài nghiên cứu khoa học công nghệ của Sở khoa học và công nghệ Thành phố Hồ Chí Minh. Đề tài nghiên cứu khoa học đã giúp tơi rất nhiều trong q trình thực hiện luận văn.

Data union là một lĩnh vực khá quan trọng liên quan tới nhiều lĩnh vực như dữ liệu mở (open data), tích hợp dữ liệu (data integration). Luận văn đề xuất giải

pháp sử dụng giải thuật gom cụm phân cấp kết hợp với độ đo Set Unionability để giải quyết bài toán data union trong ngữ cảnh datastore. Dựa trên dữ liệu đến từ tập dữ liệu đến từ cổng dữ liệu của nước Anh và Canada, tôi đã tiến hành các thí nghiệm để tiến hành tính khả thi của phương pháp. Các thí nghiệm đã chứng minh rằng phương pháp data union sử dụng giải thuật gom cụm phân cấp cũng như khi sử dụng giải thuật gom cụm phân cấp với độ đo Set Unionability là khả thi. Tuy nhiên, các thí nghiệm cũng cho thấy rằng việc chỉ sử dụng mỗi độ đo Set Unionability là chưa hiệu quả, cần phải kết hợp thêm các độ đo khác.

Các điểm yếu này cũng chính là phương hướng phát triển nghiên cứu tiếp theo của tơi. Các hướng phát triển tương lai bao gồm:

• Kết hợp thêm các độ đo khác với Set Unionability (cụ thể là Semantics

Unionability và Natural Language Unionability), nghiên cứu phương pháp tổng hợp "điểm"của nhiều độ đo lại với nhau thay vì chỉ sử dụng hàm max như trong [7].

• Nghiên cứu phương pháp tổng hợp độ giống nhau giữa các thuộc tính trong hai tập dữ liệu thay vì sử dụng hàm tích đơn giản.

• Nghiên cứu giải pháp song song hóa phương pháp data union sử dụng giải thuật gom cụm phân cấp nhằm tăng tính hiệu quả.

• Nghiên cứu, đề ra hệ thống data union có khả năng tự động kết hợp dữ liệu và kết hợp dữ liệu cộng dồn (sử dụng các tập dữ liệu kết hợp để thay thế cho các tập dữ liệu cũ) kết hợp với cổng dữ liệu mở.

Danh mục các cơng trình khoa học

M. H. Ta, D. T. Khanh and N. L. Hoang, “An Elastic Data

Conversion Framework for Data Integration,” in Proceedings of the Future Data and Security Engineering, Springer, 2021.

T. K. Dang, M. H. Ta, L. H. Dang and N. L. Hoang, “An Elastic

Data Conversion Framework - A Case Study for MySQL and MongoDB,”

SN Computer Science, vol. 2, 2021.

T. K. D. Manh Huy Ta and N. Nguyen-Tan, “A Data Union Method

Using Hierarchical Clustering and Set Unionability,” in Proceedings of the Future Data and Security Engineering, Springer, 2021.

3. 2.

https://link.springer.com/book/10.1007/978-3-030-63924-2 1/5 HFUG<"Kpvgtpcvkqpcn"Eqphgtgpeg"qp"Hwvwtg"Fcvc"cpf"Ugewtkv{"Gpikpggtkpi Ỉ"4242 Hwvwtg"Fcvc"cpf"Ugewtkv{"Gpikpggtkpi 9vj"Kpvgtpcvkqpcn"Eqphgtgpeg."HFUG"4242."Sw{"Pjqp. Xkgvpco."Pqxgodgt"47Ỵ49."4242."Rtqeggfkpiu Gfkvqtu *xkgy"chhknkcvkqpu+ Vtcp"Mjcpj"Fcpi Lqugh"M¯pi Ocmqvq"Vcmk¦cyc Vck"O0"Ejwpi Eqphgtgpeg"rtqeggfkpiu"HFUG"4242 49"Ekvcvkqpu 3"Ogpvkqpu 36m"Fqypnqcfu Rctv"qh"vjg"Ngevwtg"Pqvgu"kp"Eqorwvgt"Uekgpeg"dqqm"ugtkgu"*NPEU."xqnwog"34688+ Cnuq"rctv"qh"vjg"Kphqtocvkqp"U{uvgou"cpf"Crrnkecvkqpu."kpen0"Kpvgtpgv1Ygd."cpf"JEK"dqqm"uwd ugtkgu"*NPKUC."xqnwog"34688+ Rcrgtu Xqnwogu Cdqwv Vcdng"qh"eqpvgpvu Rcig"""""qh"""4 Pgzv 30 Htqpv"Ocvvgt Rcigu"k/zkkk RFH 40Kpxkvgf"Mg{pqvgu

30 Htqpv"Ocvvgt Rcigu"3/3 RFH 40 Dnqemejckp"Vgejpqnqi{<"Kpvtkpuke"Vgejpqnqikecn"cpf"Uqekq/Geqpqoke"Dcttkgtu Cjvq"Dwnfcu."Fktm"Ftcjgko."Vcmgjkmq"Pciwoq."Cpvqp"Xgfgujkp Rcigu"5/49 50 Fcvc"Swcnkv{"hqt"Ogfkecn"Fcvc"Ncmgncpfu Lqjcpp"Gfgt."Xncfkokt"C0"Ujgmjqxvuqx Rcigu"4:/65 50Ugewtkv{"Kuuwgu"kp"Dki"Fcvc 30 Htqpv"Ocvvgt Rcigu"67/67 RFH 40 Cwvjqtk¦cvkqp"Rqnke{"Gzvgpukqp"hqt"Itcrj"Fcvcdcugu C{c"Oqjcogf."Fcioct"Cwgt."Fcpkgn"Jqhgt."Lqugh"M¯pi Rcigu"69/88 50 C"Oqfgn/Ftkxgp"Crrtqcej"hqt"Gphqtekpi"Hkpg/Itckpgf"Ceeguu"Eqpvtqn"hqt"USN Swgtkgu

Rj逢噂e"D違q"Jq pi"Piw{宇p."Ocpwgn"Encxgn Rcigu"89/:8 60 Qp"Crrn{kpi"Itcrj"Fcvcdcug"Vkog"Oqfgnu"hqt"Ugewtkv{"Nqi"Cpcn{uku Fcpkgn"Jqhgt."Octmwu"L“igt."C{c"Oqjcogf."Lqugh"M¯pi Rcigu":9/329 60Dki"Fcvc"Cpcn{vkeu"cpf"Fkuvtkdwvgf"U{uvgou 30 Htqpv"Ocvvgt Rcigu"32;/32; RFH 40 Kpvgitcvkpi"Ygd"Ugtxkegu"kp"Uoctv"Fgxkegu"Wukpi"Kphqtocvkqp"Rncvhqto"Dcugf"qp Hqi"Eqorwvkpi"Oqfgn Vcmgujk"Vuwejk{c."T{wkejk"Oqejk¦wmk."Jktqq"Jktqug."Vgvuw{cuw"[cocfc."Pqtkpqdw Kocowtc."Pcqmk"[qmqwejk"gv"cn0 Rcigu"333/345 50 Cfcrvkxg"Eqpvkiwqwu"Uejgfwnkpi"hqt"Fcvc"Ciitgicvkqp"kp"Ownvkejcppgn"Yktgnguu Ugpuqt"Pgvyqtmu Xcp/Xk"Xq."Vkgp/Fwpi"Piw{gp."Fwe/Vck"Ng."J{wpugwpi"Ejqq Rcigu"346/355 60 Tgncvkpi"Pgvyqtm/Fkcogvgt"cpf"Pgvyqtm/Okpkowo/Fgitgg"hqt"Fkuvtkdwvgf Hwpevkqp"Eqorwvcvkqp J0"M0"Fck."O0"Vqwnqwug Rcigu"356/372 70 Itqykpi"Ugnh/Qticpk¦kpi"Ocru"hqt"Ogvcigpqoke"Xkuwcnk¦cvkqpu"Uwrrqtvkpi"Fkugcug Encuukhkecvkqp Jck"Vjcpj"Piw{gp."Dcpi"Cpj"Piw{gp."O{"P0"Piw{gp."Swqe/Fkpj"Vtwqpi."Nkpj"Ejk Piw{gp."Vjcq"Vjw{"Piqe"Dcpj"gv"cn0 Rcigu"373/388 70Cfxcpegu"kp"Dki"Fcvc"Swgt{"Rtqeguukpi"cpf"Qrvkok¦cvkqp 30 Htqpv"Ocvvgt Rcigu"389/389

https://link.springer.com/book/10.1007/978-3-030-63924-2 3/5 RFH 40 Qp"Pqto/Dcugf"Nqecnkv{"Ogcuwtgu"qh"4/Fkogpukqpcn"Fkuetgvg"Jkndgtv"Ewtxgu J0"M0"Fck."J0"E0"Uw Rcigu"38;/3:6 50 C"Eqorctcvkxg"Uvwf{"qh"Lqkp"Cniqtkvjou"kp"Urctm Cpj/Ecpi"Rjcp."Vjwqpi/Ecpi"Rjcp."Vjcpj/Piqcp"Vtkgw Rcigu"3:7/3;: 80Dnqemejckp"cpf"Crrnkecvkqpu 30 Htqpv"Ocvvgt Rcigu"3;;/3;; RFH 40 Dnqemejckp/Dcugf"Hqtyctf"cpf"Tgxgtug"Uwrrn{"Ejckpu"hqt"G/ycuvg"Ocpcigogpv Uycicvkmc"Ucjqq."Tclw"Jcnfgt Rcigu"423/442 50 C"Rtciocvke"Dnqemejckp"Dcugf"Uqnwvkqp"hqt"Ocpcikpi"Rtqxgpcpeg"cpf Ejctcevgtkuvkeu"kp"vjg"Qrgp"Fcvc"Eqpvgzv Vtcp"Mjcpj"Fcpi."Vjw"Fwqpi"Cpj Rcigu"443/464 90Kpfwuvt{"602"cpf"Uoctv"Ekv{<"Fcvc"Cpcn{vkeu"cpf"Ugewtkv{ 30 Htqpv"Ocvvgt Rcigu"465/465 RFH Rcig"""""qh"""4 Pgzv Qvjgt"xqnwogu 30 Hwvwtg"Fcvc"cpf"Ugewtkv{"Gpikpggtkpi0"Dki"Fcvc."Ugewtkv{"cpf"Rtkxce{."Uoctv"Ekv{"cpf" Kpfwuvt{"602"Crrnkecvkqpu 9vj"Kpvgtpcvkqpcn"Eqphgtgpeg."HFUG"4242."Sw{"Pjqp."Xkgvpco."Pqxgodgt"47Ỵ49."4242." Rtqeggfkpiu 40 Hwvwtg"Fcvc"cpf"Ugewtkv{"Gpikpggtkpi 9vj"Kpvgtpcvkqpcn"Eqphgtgpeg."HFUG"4242."Sw{"Pjqp."Xkgvpco."Pqxgodgt"47Ỵ49."4242. Rtqeggfkpiu Cdqwv"vjgug"rtqeggfkpiu Kpvtqfwevkqp Vjku"dqqm"eqpuvkvwvgu"vjg"rtqeggfkpiu"qh"vjg"9vj"Kpvgtpcvkqpcn"Eqphgtgpeg"qp"Hwvwtg"Fcvc"cpf Ugewtkv{"Gpikpggtkpi."HFUG"4242."yjkej"ycu"uwrrqugf"vq"dg"jgnf"kp"Sw{"Pjqp."Xkgvpco."kp Pqxgodgt"4242."dwv"vjg"eqphgtgpeg"ycu"jgnf"xktvwcnn{"fwg"vq"vjg"EQXKF/3;"rcpfgoke0 Vjg"46"hwnn"rcrgtu"*qh"75"ceegrvgf"hwnn"rcrgtu+"rtgugpvgf"vqigvjgt"ykvj"4"kpxkvgf"mg{pqvgu"ygtg ectghwnn{"tgxkgygf"cpf"ugngevgf"htqo"383"uwdokuukqpu0"Vjg"qvjgt"4;"ceegrvgf"hwnn"cpf":"ujqtv rcrgtu"ctg"kpenwfgf"kp"EEKU"35280"Vjg"ugngevgf"rcrgtu"ctg"qticpk¦gf"kpvq"vjg"hqnnqykpi"vqrkecn

jgcfkpiu<"ugewtkv{"kuuwgu"kp"dki"fcvc="dki"fcvc"cpcn{vkeu"cpf"fkuvtkdwvgf"u{uvgou="cfxcpegu"kp"dki fcvc"swgt{"rtqeguukpi"cpf"qrvkok¦cvkqp="dnqemejckp"cpf"crrnkecvkqpu="kpfwuvt{"602"cpf"uoctv ekv{<"fcvc"cpcn{vkeu"cpf"ugewtkv{="cfxcpegf"uvwfkgu"kp"ocejkpg"ngctpkpi"hqt"ugewtkv{="cpf gogtikpi"fcvc"ocpcigogpv"u{uvgou"cpf"crrnkecvkqpu0

Mg{yqtfu

ceeguu"eqpvtqn cwvjgpvkecvkqp dki"fcvc"cpcn{vkeu dnqemejckp enqwf"fcvc"ocpcigogpv et{rvqitcrj{ fggr"ngctpkpi"hqt"fcvc"cpf"ugewtkv{"gpikpggtkpi kpfwuvt{"602

Kpvgtpgv"qh"Vjkpiu ocejkpg"ngctpkpi swgt{"rtqeguukpi"cpf"qrvkok¦cvkqp ugewtkv{"cpf"rtkxce{ ugewtkv{"rtqvqeqnu uoctv"ekv{

Gfkvqtu"cpf"chhknkcvkqpu Vtcp"Mjcpj"Fcpi *3+"Xkgy"cwvjqt)u"QteKF"rtqhkng"*Xkgy"QteKF"rtqhkng+ Lqugh"M¯pi *4+" Ocmqvq"Vcmk¦cyc *5+" Vck"O0"Ejwpi *6+" 30"Jq"Ejk"Okpj"Ekv{"Wpkxgtukv{"qh"Vgejpqnqi{."."Jq"Ejk"Okpj"Ekv{."Xkgvpco 40"Lqjcppgu"Mgrngt"Wpkxgtukv{"qh"Nkp¦."."Nkp¦."Cwuvtkc 50"Jqugk"Wpkxgtukv{."."Vqm{q."Lcrcp 60"Uwpim{wpmycp"Wpkxgtukv{."."Uwyqp."Mqtgc"*Tgrwdnke"qh+ Dkdnkqitcrjke"kphqtocvkqp Dqqm"Vkvng"Hwvwtg"Fcvc"cpf"Ugewtkv{"Gpikpggtkpi Dqqm"Uwdvkvng"9vj"Kpvgtpcvkqpcn"Eqphgtgpeg."HFUG"4242."Sw{"Pjqp."Xkgvpco."Pqxgodgt"47Ỵ49."4242. Rtqeggfkpiu Gfkvqtu"Vtcp"Mjcpj"Fcpi" Lqugh"M¯pi" Ocmqvq"Vcmk¦cyc" Vck"O0"Ejwpi" Ugtkgu"Vkvng"Ngevwtg"Pqvgu"kp"Eqorwvgt"Uekgpeg Ugtkgu"Cddtgxkcvgf"Vkvng"Ngev0Pqvgu"Eqorwvgt FQK"jvvru<11fqk0qti132032291;9:/5/252/85;46/4 Eqr{tkijv"Kphqtocvkqp"Urtkpigt"Pcvwtg"Uykv¦gtncpf"CI"4242 Rwdnkujgt"Pcog"Urtkpigt."Ejco gDqqm"Rcemcigu"Eqorwvgt"Uekgpeg"Eqorwvgt"Uekgpeg"*T2+ Uqhveqxgt"KUDP";9:/5/252/85;45/7 gDqqm"KUDP";9:/5/252/85;46/4 Ugtkgu"KUUP"2524/;965 Ugtkgu"G/KUUP"3833/556; Gfkvkqp"Pwodgt"3 Pwodgt"qh"Rcigu"ZKKK."686 Pwodgt"qh"Knnwuvtcvkqpu"86"d1y"knnwuvtcvkqpu."363"knnwuvtcvkqpu"kp"eqnqwt Vqrkeu"Kphqtocvkqp"U{uvgou"Crrnkecvkqpu"*kpen0"Kpvgtpgv+" Ctvkhkekcn"Kpvgnnkigpeg" Ugewtkv{" Eqorwvgt"Kocikpi."Xkukqp."Rcvvgtp"Tgeqipkvkqp"cpf"Itcrjkeu"

https://link.springer.com/book/10.1007/978-3-030-63924-2 5/5

ặ"4242"Urtkpigt"Pcvwtg"UykvƯgtncpf"CI0"Rctv"qh"Urtkpigt"Pcvwtg0 Pqv"nqiigf"kp"Pqv"chhknkcvgf"3450420906:

Tran Khanh Dang ( ), Ta Manh Huy, and Nguyen Le Hoang ( ) Ho Chi Minh City University of Technology,

Vietnam National University Ho Chi Minh City, 268 Ly Thuong Kiet, District 10, Ho Chi Minh City, Vietnam

khanh@hcmut.edu.vn,tamanhhuy@yahoo.vn,nlhoang@hcmut.edu.vn

Abstract. Data nowadays is an extremely valuable resource. However, they are created and stored in different places with various formats and types. As a result, it is not easy and efficient for data analysis and data mining which can make profits for every aspect of social applications. In order to overcome this problem, a data conversion is a crucial step that we have to build for linking and merging different data resources to a unified data store. In this paper, based on the intermediate data conversion model, we propose an elastic data conversion framework framework for data integration system

Keywords: data conversion· data integration system·data transformation·open data

1 Introduction

With the development of technology, data is becoming an extremely valuable resource. Data is being created, analyzed and used in a massive scale in every modern system. As a result, data analysis and data mining are very essential in each aspect of social applications. The value of data will be more useful if it can be linked and merged with other different data resources, especially for solving current social problems. In order to make a reality of this big challenge, data transformation is a crucial step that we have to overcome.

Data transformation can be described as a task that can flexibly convert data among different models and formats, thereby supporting the combination of data from various resources to a unified one, in another word, a unified dataset. This problem is not easy even when converting traditional data with few data sources with simple structure. Usually, this process requires the participation of human to understand and correct the meaning of the data in each source to solve the data ambiguity problems, including semantic and data representation ambiguity. In the age of big data, this problem becomes more and more challenging when data are not only heterogeneous, but are also produced continuously with enormous mass. These three main characteristics of big data are known through the notation ”3V”:

of sources of the data also becomes very large

– Velocity: data are continuously generated and changed over time

– Variety: data from many different sources are diverse and heterogeneous

Fig. 1.Data Stores

Data transformation is an essential problem in many industries. For exam- ple, the traffic data integrated from the bus black boxes and the cameras on the road will provide a comprehensive view of the traffic situation of the city. If able to combined with data on population such as population density and distribution, the management agencies and related departments will be able to make appropriate decisions and policies such as traffic flow, reconstructing and establishing traffic infrastructure, or navigating traffic to avoid traffic jams. The problem is that departments often store data with completely different models and formats. Hence, data transformation is an indispensable step in the integration, analysis, and decision-making process. In the US, transport agencies rely on large amounts of data to support everyday tasks such as planning, design, and construction [5]. Therefore, these agencies also need to gather and exchange a lot of information. The speed of access together with the accuracy and consis- tency of the information from these different platforms and targets lead to the problem of data conversion. In addition, the converted data can be combined together into a unified dataset and through data mining process, this can bring

vide potential and optimal value. Furthermore, this combined dataset is a rich resource in making predictions and supporting decision making. Data now can be collected and integrated to store and manage in data centers for a variety of purposes. (Fig.1)

However, the challenge in the data transformation problem is that this process need to interact with many different data sources in various structures and formats. Hence, it is necessary to do research and propose a data standard format to support storage in data centers and propose a framework supporting data transformation before integrating them into data centers. This research di- rection is also one of the research trends on Information Technology for the Ho Chi Minh City in the period of 2018-2023. In this paper, we propose a novel data conversion framework for data integration system. The rest of this paper is organized as follows, some related works and researches will be mentioned in section 2 while the our proposed framework will be in section 3. The summary and conclusion will be in section 4.

2 Related Works

Since 2010, there have been a lot of researches and proposed methods for data conversion. Such as in 2013, Ivan et al. proposed a data transformation system based on a community contribution model [8]. As depicted in Fig.2, the data shared on the publicdata.eu portal includes data from many different organizations of various formats. The system will make initial mappings, then let the community contribute by creating new mappings, re-editing existing mappings, transforming the data, and using the data. The accuracy in data conversion will improve over time with the contribution of the community.

Fig. 2.Data transformation system based on a community contribution model (Ivan et al., 2013)

In 2015, Rocha et al. proposed a method to support the migration of data from relational databases RDBMS to NoSQL [18]. This method includes 2 main modules which are data migration and data mapping.

fying all elements from the original relational databases (e.g. tables, prop- erties, relationships, indexes, etc.), then creates equivalent structures using the NoSQL data model and then exports the data to the new model

– The data mapping module (Fig.4) consists of an abstract class, designed as an interface between the application and the DBMS, which oversees all SQL transactions from the applications, and translates these operations then moves to the NoSQL model that was created in previous module

Fig. 3. The migration of data from relational databases RDBMS to NoSQL - Data migration module (Rocha et al., 2015)

Fig. 4. The migration of data from relational databases RDBMS to NoSQL - Data mapping module (Rocha et al., 2015)

Hyeonjeong et al. developed a semi-automatic tool for converting ecological data in Korea in 2017 [6]. The goal of this tool is to gather data in different formats from various research organizations and institutes specializing in envi- ronment in Korea and then convert to a shared standard ecological dataset. To accomplish this goal, the authors propose 4 transformation steps as described in Fig.5 including:

and the corresponding protocol

– Step 2 - Species Selection: choosing which species in the data to be converted

– Step 3 - Attribute Mapping: mapping attributes from source data to nor- malized attributes defined in the protocol

– Step 4 - Data Standardization: converting mapped data to a shared standard However, this tool currently only converts data for few species from the original data. Another limitation is that it only supports data sources stored in.csv

format whereas the actual data is usually represented in many different formats.

Fig. 5.Semi-automatic tool for converting ecological data in Korea (Hyeonjeong et al., 2017)

In 2017, Milan et al. looked to the context of factory integration through the use of the data transformation toolkit for AutomationML (AML), an open standard XML-based data format for storage and exchange of technical information of the plant [14].

In this context, factory automation requires the participating and collabo- rating in a variety of fields from automation control, mechanical engineering, electronics, and software engineering. These domains all have different support tools, and the tools manipulate different data structures. Therefore, the authors propose a model integrating these tools with AML by using a process engineering transformation tool. This model will convert the data described by the AML into the appropriate formats corresponding to the technical tools of different disciplines as depicted in Fig.6. Although the model can work well, the input of the process is stored only in AML standard.

In 2017, Luis et al. developed a data conversion framework to support energy simulation [10]. The goal of this framework is to convert data in different formats

Fig. 6. Factory integration through the use of the data transformation toolkit for AutomationML (Milan et al., 2017)

to enable communicating and interacting among different systems in an auto-

Cụm dữliệu lý tưởng và thực tế trong thí nghiệm 5

đo tương tự Similarity Measurement

Kết quả của Schema Step