Tài liệu Mining Database Structure; Or, How to Build a Data Quality Browser docx

12 581 0
Tài liệu Mining Database Structure; Or, How to Build a Data Quality Browser docx

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Mining Database Structure; Or, How to Build a Data Quality Browser Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan, Vladislav Shkapenyuk AT&T Labs–Research ABSTRACT 1. INTRODUCTION Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM SIGMOD ’2002 June 4-6, Madison, Wisconsin, USA Copyright 2002 ACM 1-58113-497-5/02/06 5.00. 1.1 Related Work 2. SUMMARIZING VALUES OF A FIELD 2.1 Set Resemblance 2.2 Multiset Resemblance 2.3 Substring Resemblance 2.3.1 Q-gram Signature 2.4 Q-gram Sketches 2.5 Finding Keys 3. MINING DATABASE STRUCTURES 3.1 Finding Join Paths 3.2 Finding Composite Fields 3.3 Finding Heterogeneous Tables 4. BELLMAN 5. EXPERIMENTS 5.1 Estimating Field Intersection Size 5.2 Estimating Join Sizes Errorinintersectionsizeestimation,50samples 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.2 0.4 0.6 0.8 1 Resemblance Errorinestimation Errorinintersectionsizeestimation,100samples 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.2 0.4 0.6 0.8 1 Resemblance Erroinestimation ErrorinJoinSizeEstimation,100samples 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 Resemblance Estimationerror Errorinjoinsizeestimation,250samples 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 0.2 0.4 0.6 0.8 1 1.2 Resemblance Errorinestimation Unadjustedjoinsizevs.actualjoinsize,100samples 1 10 100 1000 10000 100000 1000000 10000000 1 10 100 1000 10000 10000 0 1E+06 1E+07 1E+08 Actualjoinsize Estimatedjoinsize 5.3 Q-gram Signatures Adjustedjoinsizevs.actualjoinsize,100samples 1 10 100 1000 10000 100000 1000000 10000000 100000000 1 10 100 1000 10000 10000 0 1E+06 1E+07 1E+08 Actualjoinsize Estimatedjoinsize Estimatedvs.ActualQ-gramResemblance,50samples 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Actualresemblance Estimatedresemblance 5.4 Q-gram Sketches Estimatedvs.ActualQ-gramResemblance,150Samples 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Actualresemblance Estimatedresemblance Estimatedvs.actualq-gramvectordistance,50 sketchsamples 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Actualq-gramvectordistance Estimatedq-gramdistance Estimatedvs.actualq-gramvectordistance,150sketch samples 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Actualq-gramvectordistance Estimatedq-gramdistance Q-gramvectordistancevs.g-gramresemblance 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 Q-gramresemblance Q-gramvectordistance 5.5 Qualitative Experiments 5.5.1 Using Multiset Resemblance [...]... q q #Ưxe5ƯƯƯ  Ư ƯBƯP$U m FdPXq ps p q p t m w r mn Ư â  8 G QYD  Ê DD   Ư P DD   QYD  P  8 Ư G Ư E Ư 9  A Ơ Ê    Ư  I e"Đ! 6 AD    Ê  ) G I Ư Ơ f ( Ư P 6 I Ư   Ê    U    6 ƯÔ 64 "7Đ5 ă G Ư E Ư  eĐ"! Ư A9   Ê     Ư 9 ba  A d c3 Ê G I  Ư F D "YP XĐ ( `  Ư WV   6  Ư G Ư E  U   P   q q ns q w t ) 5ƯiĂ ÔS ÂS 5E$9YED$@$@#ƯÔQ$9Y(... Đ p  C Ă (C Đ 6  4 4 6 Đ !   G C  Đ Ê 4 I & 4 5$#ăF$)Â%7 @87 YƯY5ƯED A 8Ư A 7 @ƯQ T  Ê â   G C Ơ 7 â  Ê SR 9 T1 7  Ơ  ă D 0 AQ Ư 9 D  ă   6 P  I Ư  H Ư  Ê  Ư 39     Đ"!  Ê G  Ư Ơ Ư F   ƯÔ 64 ă  Đ7"5 3 Ê Ư 9 A E    Ê 6  Ư 39 D Ư  ă ( C B  8 Ư 9 A@ ƯÔ 64 ă Đ7Đ5 (      Ê ƯÔ 64 Đ7"5 ă Ư   Ê   ... #Ư)1#ƯƯB)Y#F bƯi $ à w t ÂS yXu $"FƯÂE5BÔ)EF5$@R 3  x 7P  Ă G C  Ê 4P 6 Ê Ă Ê T 4P WC   4 Ă  4 78ƯDFXkrƯ#Y)YY7 A 8@5$ƯE7 A 5)Ô)EƯÊ Ê BƯ@Q   ƠC 7 T  Ê â Ê   Ê  â ! 6  4 Đ P â  Ê T Đ Ê P 7 e 7 ƯƯ â p T 7 ư 1( j A #$k9u#19d56 $9kgƯ5e54 Ukk 8$Ă  Đ ( ( Ă p T ( Ơ  7 4 Đ Ê Ăp   4 '  & 7 ( Ă 7  Ă 6 4   Đ 5$@@9$(  C Ă Ê Đ G 4... 4 c T  F5YV@$9ÔH$r$9gEPDEƠEIEP)Y$@$yv$9vÔFdUF5YyYc  4  6 Ă 4 '  T  Ê  4 Đ ! Ă Ê  GC   Ă  W  4C Đ Ê Ơ Ơ ! Ô5$ehƯ5$$9YEh$ƯXuv5E$)bD Ơ Ê Đ Gg { 4 ƯÔ)yew } 5.5.2 Using Q-gram Similarity 7 Đ 4 ( Ê ( C  Ă C T 4   ! 6 C ÂXƯYdE$HFHÂ5E#T  4 {C   6 4 Ă G CP c  Đ ( 4  Ă  Ă  Ă 6  !  T Ê P !p 4  ! 4 Đ Ê G CP c  Đ ( p @! FÔ5$EEYƯ$#H$$B$@)Ưdq@H$)EEYƯ$#f)...  Ê  Ê Ơ 7 7 â )8ƯYGEeuƯ â qs mnn v l q t m m q q uFqkăq$1qƯS Ư ps p 74 6ÊĐ ( (Ê 4ĐĐ ! ă55ÔƯ$)Ă 5$56 T  Ê  Ơ 4P W  Đ G C  Ê 4P 6 Ê Ă Y)dD5EƯ$zFƯÂE5BÔ)Ê A  A VRY)~)Ê dV 7 7 e T  Ê Ơ  7 7 ơ â ơ ô á á 5#ƯÂY5ÂEƯe â y m s qfv "ƯiX8ƯF$9ÔƯ5$Ă EƯƯÔ)Ư o o 7  C Ă Ê Đ G 4 C 4  Ê W Ê Ă Ê T p  4  6 Ê  Đ ( ( Ê T  Ê  4 ! ... $ ÂS y%)e$#ƯÊ G C  Đ Ê 4Pg 4 C  6 Ê E$)ÂE@E)bƠ 555$#VÔ)f$)Ô)E)SƯb5  4 6 Đ !   Ê Ă Ê T 4 Ă Ê Đ Ê ( C T p   Ê Ơ 4  6 G CPC 6   6 FEE5554 7 I 5EƯf#7 Y)Y5ƯEDƯ A 8)e A 7 9ÂQ & 4P Ê e T  Ê â   G C Ơ 7 â  Ê 7 ô â â ÂƯ eG# eY 7 QEE)YÔƯÂ$HƯ$#E5UeE$Q)Ô"$E$)5FEƯÊ I ĂCPC W Ê  6 Ê 4 Đ T  Ê 4 Đ !  P 6 4 &C ĂC   Ê Đ Ă  Ă ... 5E5ăYÂ@F$#Ư3RƯ@D)ÔĐ ã 1B)Ư 7 7 ô I 4P  4 s   C T Ưe58â F@|HƯET Ư Ô)$X ƯR u)X 7 q w t p m q p qs t n Ư B"ƯƯ$S yY3ƯivkÔt q o o p t t t ms v o s m 7  Ă ! 868$ ã 7 A )) 7 d56 Â@F$Â#ÂEdÂ' Ukk9 %$ÂeE$Đ A #ÂEd ã )) Ơ  7 Đ 4 &C Đ T 4 G T 4P   7 ( Ă Ă  7 Đ 4 &C 4 G T 4P  ơ 7 á â á ô n p q t v #ƯÂ#ƯƯắ) u1â YƯ l  8ÔƯ$ƯÊ 7 6ÊĐ ( ( T 4  Ê Wg... Ê & Ă 6 C Ă C T  Ă  Đ 4   5$ƯX5$kĂ 5e@HY)5E$@! d5F)@E$F#D$d$@SƯÊ 4 Ă Ê Đ ! 6 6 Ê g IP  GC  Đ p G CP ( Ơ Ê  Ă 6 C Ă  $9Ô55)y#QEƯq"EED)$@E$EC A 5XEC Ê V9ÂQ 7    W W 7 ô 7P ÊC Đ  Ă ! uƯF$)$$Ă #ƯƯu A f 55$I ƯÔ@$3$"FD)Ê } E5@6 $( 7  4 Ă W Ê Đ 4 Ă 4  Ă G C Ơ G C   4  Đ I Đ 4 { 4 Ă Ê ƠC   Đ ( @! $)bE#9$#( 5XEC Ê RY)UEƯFƯƯ$)Ê Ê B)5Q 7   ... â  ' ! Ơ   uƯS5Ô)Y1RY)8B54 A 7 ã Â)Ư$ƯzB  â Đ Ê ' Đ  7 7 ơ #Ư â á  4 G Ê ( â q t m m q q #Ưá 5eƯ5)YYƯ ÔÔXƯS )V bƯiĂ ÔS ÂS x ps p q ns qạ w t  74@5$@qEd@@Y@X5)d#Y$5)b$ƯXRY)d#Y$ 6  4 Đ 4p C I 6  4 T  4 ( 4 T p  I T ! Ă  4 6  Ê Ơ Đ p Đ 4 ( T  Ê I T ! Ă I ĂCPC WC  Ê 4 QEEEƯÂqp 8$$Ư } XrY)X%ƯbEEFRY7 8Ư$EjY7 A ) 7    Đ 4 G Đ 7 T  Ê â  Ê... 7 8 7 7 ư T  Ê â    W W 1Y))g$9r"$1YY)Y5XEC Ê Y%EP B 7 â   7 7 â ƯY  x 7   C Ă 6 4   Đ ( Ơ  T  Ê Đ IP T  4C Đpg 4  Ê W Ê Ă 5E$@5)$#~#YƯÔdQ5E$q@ƯƯÔ9Ê A 5)ÔƯFE6 7 A Q 7  Ê Ă ( CP    ƯÔ 64 ă  "7Đ5 3 Ê ăă 21 0 ă Ư ) (  ""! ăƯÔ â"â(  Đ"! &# '% $   Ư   ă Ư      Ê   ă  ăƯÔ ƠĐâ ăƯÔ ƠĐƠÊ 7  4P ( Ơ Ê  4  6 !  55EDƯ@HbI ƯbH5$@$Ă . Mining Database Structure; Or, How to Build a Data Quality Browser Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan, Vladislav Shkapenyuk AT&T. provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To

Ngày đăng: 19/02/2014, 12:20

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan