Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 12 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
12
Dung lượng
304,46 KB
Nội dung
MiningDatabaseStructure;Or, How toBuilda Data Quality
Browser
Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan, Vladislav Shkapenyuk
AT&T Labs–Research
ABSTRACT
1. INTRODUCTION
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ACM SIGMOD ’2002 June 4-6, Madison, Wisconsin, USA
Copyright 2002 ACM 1-58113-497-5/02/06
5.00.
1.1 Related Work
2. SUMMARIZING VALUES OF A FIELD
2.1 Set Resemblance
2.2 Multiset Resemblance
2.3 Substring Resemblance
2.3.1 Q-gram Signature
2.4 Q-gram Sketches
2.5 Finding Keys
3. MININGDATABASE STRUCTURES
3.1 Finding Join Paths
3.2 Finding Composite Fields
3.3 Finding Heterogeneous Tables
4. BELLMAN
5. EXPERIMENTS
5.1 Estimating Field Intersection Size
5.2 Estimating Join Sizes
Errorinintersectionsizeestimation,50samples
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.2 0.4 0.6 0.8 1
Resemblance
Errorinestimation
Errorinintersectionsizeestimation,100samples
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.2 0.4 0.6 0.8 1
Resemblance
Erroinestimation
ErrorinJoinSizeEstimation,100samples
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.2 0.4 0.6 0.8 1
Resemblance
Estimationerror
Errorinjoinsizeestimation,250samples
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 0.2 0.4 0.6 0.8 1 1.2
Resemblance
Errorinestimation
Unadjustedjoinsizevs.actualjoinsize,100samples
1
10
100
1000
10000
100000
1000000
10000000
1 10 100 1000 10000 10000
0
1E+06 1E+07 1E+08
Actualjoinsize
Estimatedjoinsize
5.3 Q-gram Signatures
Adjustedjoinsizevs.actualjoinsize,100samples
1
10
100
1000
10000
100000
1000000
10000000
100000000
1 10 100 1000 10000 10000
0
1E+06 1E+07 1E+08
Actualjoinsize
Estimatedjoinsize
Estimatedvs.ActualQ-gramResemblance,50samples
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Actualresemblance
Estimatedresemblance
5.4 Q-gram Sketches
Estimatedvs.ActualQ-gramResemblance,150Samples
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Actualresemblance
Estimatedresemblance
Estimatedvs.actualq-gramvectordistance,50
sketchsamples
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Actualq-gramvectordistance
Estimatedq-gramdistance
Estimatedvs.actualq-gramvectordistance,150sketch
samples
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Actualq-gramvectordistance
Estimatedq-gramdistance
Q-gramvectordistancevs.g-gramresemblance
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.2 0.4 0.6 0.8 1
Q-gramresemblance
Q-gramvectordistance
5.5 Qualitative Experiments
5.5.1 Using Multiset Resemblance
[...]... q q #Ưxe5ƯƯƯ Ư ƯBƯP$U m FdPXq ps p q p t m w r mn ả Ư â 8 G QYD Ê DD  Ư P DD  QYD P 8 Ư G Ư E Ư 9 ÂA Ơ Ê Ư I e"Đ! 6 AD Â Ê ) G I Ư Ơ f ( Ư P 6 I Ư Ê Â U 6 ƯÔ 64 "7Đ5 ă G Ư E Ư eĐ"! Ư A9 Ê Â Ư 9 ba ÂA d c3 Ê G I Ư F D "YP XĐ ( ` Ư WV  6 Ư G Ư E U P  q q ns q ạ w t ) 5ƯiĂ ÔS ÂS 5E$9YED$@$@#ƯÔQ$9Y(... Đ p C Ă (C Đ 6 4 4 6 Đ ! G C Đ Ê 4 I & 4 5$#ăF$)Â%7 @87 YƯY5ƯED A 8Ư A 7 @ƯQ T Ê â G C Ơ 7 â Ê SR 9 T1 7  Ơ Âă D 0 AQ Ư 9 D ă 6 P I Ư H Ư Ê Ư 39  Đ"! Ê G Ư Ơ Ư F ƯÔ 64 ă  Đ7"5 3 Ê Ư 9 A E Â Ê 6 Ư 39 D Ư ă ( C B 8 Ư 9 A@ ƯÔ 64 ă Đ7Đ5 ( Â Ê ƯÔ 64 Đ7"5 ă Ư Ê Â ... #Ư)1#ƯƯB)Y#F bƯi ạ $ à w t ÂS yXu $"FƯÂE5BÔ)EF5$@R ằ 3 x 7P Ă G C Ê 4P 6 Ê Ă Ê T 4P WC 4 Ă 4 78ƯDFXkrƯ#Y)YY7 A 8@5$ƯE7 A 5)Ô)EƯÊ Ê BƯ@Q ƠC 7 T Ê â Ê Ê â ! 6 4 Đ P â Ê T Đ Ê P 7 e 7 ƯƯ â p T 7 ư 1( j A #$k9u#19d56 $9kgƯ5e54 Ukk 8$Ă Đ ( ( Ă p T ( Ơ 7 4 Đ Ê Ăp 4 ' & 7 ằ ( Ă 7 Ă 6 4 Đ 5$@@9$( C Ă Ê Đ G 4... 4 c T F5YV@$9ÔH$r$9gEPDEƠEIEP)Y$@$yv$9vÔFdUF5YyYc 4 6 Ă 4 ' T Ê 4 Đ ! Ă Ê GC Ă W 4C Đ Ê Ơ Ơ ! Ô5$ehƯ5$$9YEh$ƯXuv5E$)bD Ơ Ê Đ Gg { 4 ƯÔ)yew } 5.5.2 Using Q-gram Similarity 7 Đ 4 ( Ê ( C Ă C T 4 ! 6 C ÂXƯYdE$HFHÂ5E#T 4 {C 6 4 Ă G CP c Đ ( 4 ĂĂĂ 6 ! T Ê P !p 4 ! 4 Đ Ê G CP c Đ ( p @! FÔ5$EEYƯ$#H$$B$@)Ưdq@H$)EEYƯ$#f)... Ê Ê Ơ 7 7 âằ )8ƯYGEeuƯ â qs mnn v l q t m m q q uFqkăq$1qƯS Ư ps p ạ 74 6ÊĐ ( (Ê 4ĐĐ ! ă55ÔƯ$)Ă 5$56 T Ê Ơ 4P W Đ ằ G C Ê 4P 6 Ê Ă Y)dD5EƯ$zFƯÂE5BÔ)Ê AA VRY)~)Ê ả dV 7 7 e T Ê Ơ 7 7 ơ â ơ ô ááằ 5#ƯÂY5ÂEƯe â y m s qfv "ƯiX8ƯF$9ÔƯ5$Ă EƯƯÔ)Ư o o 7 C Ă Ê Đ G 4 C 4 Ê W Ê Ă Ê T p 4 6 Ê Đ ( ( Ê T Ê 4 ! ... $ ÂS y%)e$#ƯÊ G C Đ Ê 4Pg 4 C 6 Ê E$)ÂE@E)bƠ 555$#VÔ)f$)Ô)E)SƯb5 ằ 4 6 Đ ! Ê Ă Ê T 4 Ă Ê Đ Ê ( C T p Ê Ơ 4 6 G CPC 6 6 FEE5554 ả 7 I 5EƯf#7 Y)Y5ƯEDƯ A 8)e A 7 9ÂQ & 4P Ê e T Ê â G C Ơ 7 â Ê 7 ô âằâ ÂƯ eG# eY 7 QEE)YÔƯÂ$HƯ$#E5UeE$Q)Ô"$E$)5FEƯÊ I ĂCPC W Ê 6 Ê 4 Đ T Ê 4 Đ ! P 6 4 &C ĂC Ê Đ ĂĂ ... 5E5ăYÂ@F$#Ư3RƯ@D)ÔĐ ã 1B)Ư 7 7 ô I 4P 4 s C T Ưe58â F@|HƯET Ư Ô)$X ƯR u)X 7 q w t p m q p qs t n Ư B"ƯƯ$S yY3ƯivkÔt q o o p t t t ms v o s m 7 Ă ! 868$ ã 7 A )) 7 d56 Â@F$Â#ÂEdÂ' Ukk9 %$ÂeE$Đ A #ÂEd ã )) Ơ 7 Đ 4 &C Đ T 4 G T 4P 7 ằ ( ĂĂ 7 Đ 4 &C 4 G T 4P ơ 7 áâá ô ằ n p q t v #ƯÂ#ƯƯắ) u1â YƯ l 8ÔƯ$ƯÊ 7 6ÊĐ ( ( T 4 Ê Wg... Ê & Ă 6 C Ă C T Ă Đ 4 5$ƯX5$kĂ 5e@HY)5E$@! d5F)@E$F#D$d$@SƯÊ 4 Ă Ê Đ ! 6 6 Ê g IP GC Đ p G CP ( Ơ Ê Ă 6 C Ă $9Ô55)y#QEƯq"EED)$@E$EC A 5XEC Ê V9ÂQ 7 W W 7 ô 7P ÊC Đ Ă ! uƯF$)$$Ă #ƯƯu A f 55$I ƯÔ@$3$"FD)Ê } E5@6 $( 7 4 Ă W Ê Đ 4 Ă 4 Ă G C Ơ ằ G C 4 Đ I Đ 4 { 4 Ă Ê ƠC Đ ( @! $)bE#9$#( 5XEC Ê RY)UEƯFƯƯ$)Ê Ê B)5Q 7 ... â ' ! Ơ uƯS5Ô)Y1RY)8B54 A 7 ã Â)Ư$ƯzB Ââ Đ Ê ' Đ 7 7 ơ #Ư âá 4 G Ê ( â q t m m q q #Ưá 5eƯ5)YYƯ ÔÔXƯS )V bƯiĂ ÔS ÂS x ps p q ns qạ w t 74@5$@qEd@@Y@X5)d#Y$5)b$ƯXRY)d#Y$ 6 4 Đ 4p C I 6 4 T 4 ( 4 T p I T ! Ă 4 6 Ê Ơ Đ p Đ 4 ( T Ê I T ! Ă I ĂCPC WC Ê 4 QEEEƯÂqp 8$$Ư } XrY)X%ƯbEEFRY7 8Ư$EjY7 A ) 7 Đ 4 G Đ 7 T Ê â Ê... 7 8 7 7 ư T Ê â W W 1Y))g$9r"$1YY)Y5XEC Ê Y%EP B 7 â 7 7 â ƯY x 7 C Ă 6 4 Đ ( Ơ T Ê Đ IP T 4C Đpg 4 Ê W Ê Ă 5E$@5)$#~#YƯÔdQ5E$q@ƯƯÔ9Ê A 5)ÔƯFE6 7 A Q 7 Ê Ă ( CP ƯÔ 64 ă  "7Đ5 3 Ê ăă 21 0 ă Ư ) ( ""! ăƯÔ â"â( Đ"! '% $ Ư ă Ư Ê Âă ăƯÔ ƠĐâ ăƯÔ ƠĐƠÊ 7 4P ( Ơ Ê 4 6 ! 55EDƯ@HbI ƯbH5$@$Ă . Mining Database Structure; Or, How to Build a Data Quality
Browser
Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan, Vladislav Shkapenyuk
AT&T. provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To