Skew estimation for document images

SKEW ESTIMATION FOR DOCUMENT IMAGES YUAN BO ( !) (M.Sc., NUS, Peking; B.Sc., Peking) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF COMPUTER SCIENCE SCHOOL OF COMPUTING NATIONAL UNIVERSITY OF SINGAPORE 2005 Acknowledgements I would like to thank my academic advisor Professor Tan Chew Lim (∀#∃) of Computer Science for his invaluable guidance and encouragements throughout the many years of my graduate studies in Computer Science. He provides me with the best of his knowledge and research equipments that make my research smooth and fruitful. I am always grateful to Professor Tang Seung Mun (%&∋) of Physics for providing me with many opportunities to widen my academic views and enrich my personal life. Without his help, I would not have been able to achieve what I have today. I own so much to my wife Xiaojing (()) for her extraordinary hard work for the family and her constant support for my study in the years past. I could never really understand how difficult it is to be a wife with a successful career and two young children, our daughter Xinyi (∗+) and our son Xinran (∗,), with almost no external help. My parents year after year longed for my being the first Ph.D. in the family. Now the time comes, even though they hardly understand the work I have done. To them, the degree itself is glorious enough to reward their decades of sacrifices to bring me up and give me highest possible education. Acknowledgements ii Table of Contents ACKNOWLEDGEMENTS II TABLE OF CONTENTS .III SUMMARY . VI LIST OF FIGURES VIII LIST OF TABLES . XII CHAPTER INTRODUCTION 1.1 Digital and Analog Publications . 1.2 Motivations and Contributions . 1.3 Organization of This Thesis CHAPTER RELATED WORK .9 2.1 Related Work on Skew Estimation . 11 2.1.1 Projection-profile based skew estimation class . 12 2.1.2 Hough-transform based skew estimation class 13 2.1.3 Nearest-neighbor clustering based skew estimation class . 14 2.1.4 Morphological operation based skew estimation class 15 2.1.5 Spatial frequency based skew estimation class 16 2.1.6 Other approaches of skew estimation methods 17 2.2 Related Work on Page Segmentation . 18 2.2.1 Connected component analysis based . 18 2.2.2 Projection profile based . 20 2.2.3 Morphological operation based . 21 2.2.4 Background based . 21 2.2.5 Other approaches . 22 CHAPTER SKEW ESTIMATION FROM FIDUCIAL LINES 23 3.1 Skew Estimation . 24 3.1.1 Histogram generation 25 3.1.2 Peaks searching . 28 3.1.3 Results verification 32 3.1.4 Working on component holes – the background mode . 32 Table of Contents iii 3.2 Speedup Measures . 34 3.2.1 Alternative histogram configuration 35 3.2.2 Filters for individual components 35 3.2.3 Filters for component pairs 35 3.2.4 Faster slope-to-angle calculation . 38 3.2.5 Skew-independent segmentation . 38 3.3 Experimental Results . 38 3.3.1 Synthetic images (total 168 from UW-I) . 39 3.3.2 Scanned images (total 979 from UW-I) 40 3.3.3 Scanned images from Chinese newspaper clips 42 3.4 Conclusion 43 CHAPTER SKEW ESTIMATION FROM CONVEX HULLS 60 4.1 Components Grouping . 61 4.1.1 The proposed grouping function . 62 4.1.2 The advantages of using convex hulls . 65 4.1.3 The choice of the parameter k . 67 4.1.4 The detection-loop with k feedback 70 4.2 Skew Estimation . 72 4.2.1 The detection of parallel/perpendicular edges of convex hulls . 72 4.2.2 The accumulator array for the edge slopes 73 4.2.3 The search for peaks 74 4.3 Experimental Results . 75 4.3.1 Synthetic images (total 168 from UW-I) . 75 4.3.2 Scanned images (total 979 from UW-I) 76 4.4 Conclusion 80 CHAPTER SKEW ESTIMATION FROM STRAIGHT EDGES 90 5.1 Skew Estimation . 91 5.1.1 The edge enhancement 92 5.1.2 The Wallace parameterization . 93 5.1.3 The probe-line sweeping and scanning 94 5.1.4 The perpendicular criterion . 96 5.1.5 The speedup measures . 98 5.2 Experimental Results . 98 5.3 Conclusion 101 CHAPTER COMPARISONS AND DISCUSSIONS .106 6.1 Suite Tests . 107 6.1.1 Suite tests using UW-I . 107 6.1.2 Suite tests using “DOE samples” . 110 6.2 Feature-by-Feature Comparisons . 113 6.2.1 Excessive noises 113 6.2.2 Multiple skews 115 6.2.3 Non-textual documents 117 6.2.4 Other issues . 118 Table of Contents iv 6.3 Complementary Features 119 6.3.1 The choice of centroids . 120 6.3.2 Divide and conquer . 121 6.4 Future Work . 121 ANNOTATED BIBLIOGRAPHY 129 APPENDIX .149 Table of Contents v Summary This thesis represents our efforts in developing a series of skew estimation models for scanned document images. These three self-contained yet complementary models include the fiducial-line based which relies on the existence of text lines, the convex-hull based which relies on the existence of paragraphs or columns, and the straight-edge based which relies on the existence of non-textual components that have straight edges or lines. The objective is to tackle some of the major problems that still challenge document analysis systems at present: excessive interfering noises; multiple skews and their locations in a single document; skew estimation for non-textual documents. The first model is based on fiducial lines. A fiducial line is defined as the virtual straight line that passes through the centroids of any two components. For textual document images, the fiducial lines of the components comprising the text lines have highly concentrated slopes. Any out-of-the-text-lines component pairs have their fiducial lines spread widely across the slope histogram. The central values of the highest peaks in the histogram are taken as the estimated skews of the image. This proposed model works very well for the document images with excessive noises without requiring the separation of the textual components from non-textual ones. Speedup measures for this model’s baseline implementation are provided. The second model is based on convex hulls. A convex hull is defined as the smallest virtual convex polygon that fully contains a component or a group of Summary vi components. This proposed model has two integral parts: a grouping scheme and an orientation estimator, both of which are based on the detection of the convex hulls. It is developed mainly for solving the multi-skew problem. It first extracts the convex hulls of the components image, and then it groups the components according to both the spatial distances and size similarities among their convex hulls. This not only reveals the hints of the alignments of the text groups, but also separate noises or graphical components from that of the textual ones. Therefore, this proposed model can detect not only the angles of the multiple skews in a single image, but also their locations. The third model is based on straight edges. The straight edges or lines in an image include the separators of tables, the borders of graphical inserts, the black bars around the borders and the center spine of bounded materials, etc. This proposed model first applies an edge detector on an image to highlight the borders. Then, it uses a line-probing algorithm in the same configuration of the Wallace parameterization for the Muff Transform. Any significant straight edges or lines will be identified and used as the basis for skew estimation. Various strategies for optimized line probing are devised. This proposed model is applicable to both textual and graphical documents scanned with ordinary scanners or copiers under normal conditions. The performance of these models are evaluated using the full set of 168 synthetic and 979 real images from the University of Washington English Document Image Database I (UW-I). Summary vii List of Figures Figure 1.1. The different levels of tasks in document image analysis . Figure 1.2. Some of the scanned document images that contain excessive noises, multiple skews, sparse text or short text block, scanning artifacts, and so on. They are still serious challenges to any document analysis systems. Figure 3.1 An enlarged portion of a document image superposed with the fiducial lines drawn among the centroids of components . 24 Figure 3.2. The fiducial lines are drawn on the image in Figure 3.1 along the angles of 1.72±0.02°. 25 Figure 3.3. The flowchart of the fiducial line based skew estimation model. 26 Figure 3.4. The slope histogram of the fiducial lines for the image in Figure 3.1. The prominent spikes at 0°, -90°, 45°, etc. are the results of the quantization effects and can be removed by the convolution as proposed in this chapter . 27 Figure 3.5. The slope histograms for separated inter-/intra-line components of the image in Figure 3.1. The contributions from intra-line components form a broad background, while the contributions from inter-line components form an easily recognizable sharp peak. . 27 Figure 3.6. Distance versus angle plot for the grid points in a squared imaging grid. The quantization effects are obvious in short distances, especially along ±90° or tg-1(± ), 0° or tg-1(0), ±45° or tg-1(±1), ±63.44° or tg-1(±2), ±26.56° or tg-1(±1/2) 29 Figure 3.7. The convolved histogram of Figure 2.3 with the kernel shown as inset (σ=0.5°, not to scale) . 30 Figure 3.8. The convolved histograms for the individual lines of the image in Figure 3.1. The histogram in Figure 2.6 is the addition of all the nine lines. 31 Figure 3.9. Working on the component holes along the angles of 1.78±0.02° for the image in Figure 3.1. The holes are extracted with 4-connnectedness on the background. The major white spaces have no effect on the skew detection (removed here for cleaner presentation) 33 Figure 3.10. The convolved histogram in the background mode for the image in Figure 3.1 (σ=0.5°). The S/N ratio and peak accuracy are both lower than in the foreground mode, but still usable even for the skew detection for low-resolution or down-sampled images . 34 Figure 3.11. Design of distance filter for component pairs using the image in Figure 3.1. The dense, stripe-like central pattern is formed by the intra-line component pairs, while the other hyperbola-shaped patterns are from inter-line pairs. . 36 Figure 3.12. Design of the size-difference filter for component pairs using the image in Figure 3.1. The major contributions to the central peak are from the component pairs whose size differences are less than 100 pixels. . 37 Figure 3.13. The accumulated percentage of samples versus the absolute error on the 168 synthetic document images in UW-I. Each of the images in the database is randomly rotated three times in the range of [0, 90°), resulting in 504 rotated images. 39 Figure 3.14. The accumulated percentage of samples versus the absolute error on the 979 real document images in UW-I. . 41 Figure 3.15. Regression analysis using the real document images in UW-I. The linear correlation coefficient is 92.80%. The details of the labeled outliners are shown from Figure 3.26 to Figure 3.30 . 42 Figure 3.16. The sample H04I from UW-I. The ground-truth is -0.1° and the detected skew angle is 89.82°. Fiducial lines are drawn along the angles of 89.82±0.02° 45 List of Figures viii Figure 3.17. The raw (top) and the convolved (bottom) histograms of the sample H04I from UWI. The detected skew angle is -0.18° (89.82° - 90°). The peak at 0.1° is from the horizontal text lines at the bottom of the page 46 Figure 3.18. The sample D03E from UW-I. The ground-truth is -0.21°, and the detected skew angle is 0.16°. Fiducial lines are drawn along the angles of 0.16±0.02° 47 Figure 3.19. The raw (top) and the convolved (bottom) histograms of the sample D03E from UWI. The detected skew angle is 0.16° . 48 Figure 3.20. The sample E01E from UW-I. The ground-truth is -0.04°, and the detected skew angle is -0.04°. Fiducial lines are drawn along the angles of -0.04±0.02°. 49 Figure 3.21. The raw (top) and the convolved (bottom) histograms of the sample E01E from UWI. The detected skew angle is -0.04°. 50 Figure 3.22. The sample H00L from UW-I (labeled in Figure 3.15). The ground-truth is 1.47°, and the detected skew angle is 1.36°. Fiducial lines are drawn along the angles of 1.36±0.02°. This is one of the images that contain sparse text . 51 Figure 3.23. The raw (top) and the convolved (bottom) histograms of the sample H00L from UWI. The detected skew angle is 1.36°. The S/N ratio is only 19.68dB . 52 Figure 3.24. A scanned Chinese newspaper clip. Fiducial lines are drawn on the original image (top-left) along the angles of 0.04±0.02° (top-right), 50.44±0.02° (bottom-left) and 89.86±0.02° (bottom-right). 53 Figure 3.25. The raw (top) and the convolved (bottom) histograms of the Chinese newspaper clip. There are multiple prominent peaks in the convolved histogram due to the special style of Chinese text. . 54 Figure 3.26. The sample A03I from UW-I (labeled in Figure 3.15). Fiducial lines are drawn on the original image along the angles of 0.90±0.02° (top, detected) and -0.65±0.02° (bottom, ground truth). The images are rotated 90° counter-clockwise. The detected skew angle is for the dominant left page, while the ground truth may be for the right page. . 55 Figure 3.27. The sample A03J from UW-I (labeled in Figure 3.15). Fiducial lines are drawn on the original image along the angles of -0.54±0.02° (top, detected) and 0.81±0.02° (bottom, ground truth). The images are rotated 90° counter-clockwise. The detected skew angle is for the dominant right page, while the ground truth may be for the left page. . 56 Figure 3.28. The sample A05G from UW-I (labeled in Figure 3.15). Fiducial lines are drawn on the original image along the angles of 0.48±0.02° (top, detected) and -2.12±0.02° (bottom, ground truth). The images are rotated 90° counter-clockwise. The detected skew angle is for the dominant right page, while the ground truth is doubtful . 57 Figure 3.29. The sample N03I from UW-I (labeled in Figure 3.15). Fiducial lines are drawn on the original image along the angles of -0.64±0.02° (top, detected) and 0.25±0.02° (bottom, ground truth). The images are rotated 90° counter-clockwise. This is a false detection case that is caused by the cross-the-column correlation when the text lines in different columns are not collinearly aligned. Many skew detectors suffer from such kind of samples. 58 Figure 3.30. The sample S021 from UW-I (labeled in Figure 3.15). Fiducial lines are drawn on the original image along the angles of 0.58±0.02° (top, detected) and -1.00±0.02° (bottom, ground truth). The images are rotated 90° counter-clockwise. This is another false detection case that is caused by the cross-the-column correlation when the text lines in different columns are not collinearly aligned. 59 Figure 4.1. The convex hulls of the components with their vertices and centroids marked. This is a clip of the image A00O from UW-I . 61 Figure 4.2. The convex hulls of the component groups with their vertices marked. This is a clip of the image A00O from UW-I. 62 Figure 4.3. A reference implementation of the components grouping algorithm in pseudo code. In principle, this is a partition algorithm with a binary predicate. . 64 Figure 4.4. The areas distributions of the components (top) and their convex hulls (bottom) of the image A00O from UW-I. The background is the original image represented by its components (top) and convex hulls (bottom) of the corresponding half. . 66 Figure 4.5. The areas distributions of the components (top) and their convex hulls (bottom) of a Chinese newspaper clip. The background is the original image represented by its components (top) and convex hulls (bottom) of the corresponding half. . 68 List of Figures ix Figure 4.6. Various components grouping stages for the image A00O: (foreground) the weighted density of the groups versus the k value; (background) the component groups and their convex hulls at k = 6, 12 and 20. . 69 Figure 4.7. The frequency distribution of the smallest k at which the formation of the paragraphs stabilizes for the 979 real images and the 168 synthetic images from UW-I. Using the convex hulls of the components (bottom) is superior to using the components directly (top). . 71 Figure 4.8. The flowchart of the convex hull based skew estimation model 73 Figure 4.9. The accumulated percentage of samples versus the absolute error on the 168 synthetic document images in UW-I. Each of the images in the database is randomly rotated three times in the range of [-45°, 45°), resulting in 504 rotated images 76 Figure 4.10. The accumulated percentage of samples versus the absolute error on the 979 real document images in UW-I. . 77 Figure 4.11. Regression analysis using the 979 real document images in UW-I. The linear correlation coefficient is 92.1% 78 Figure 4.12. The sample A002 from UW-I (labeled in Figure 4.11). The ground truth is 0.4°, and the detected skew angle is -2.54° (highest peak) for the left half and 0.28° (second highest peak) for the right half of the image. The components in gray are those filtered out by the size filter or the aspect-ratio filter, while the components in black are those grouped by the grouping function. The edges and vertices of their convex hulls are drawn in gray . 82 Figure 4.13. The sample A03I from UW-I (labeled in Figure 4.11). The ground truth is -0.65°, and the detected skew angle is 1.06°. 83 Figure 4.14. The sample A05G from UW-I (labeled in Figure 4.11). The ground truth is -2.12°, and the detected skew angle is 0.14°. The ground truth is doubtful in this case 84 Figure 4.15. The sample J00B from UW-I (labeled in Figure 4.11). The ground truth is -0.48°, and the detected skew angle is 0.95° for the left page and -0.52° for the right page. The most prominent peak is for the left page. The value of the parameter k has been increased from 16 to 35 . 85 Figure 4.16. The sample N042 from UW-I (labeled in Figure 4.11). The ground truth is 0.79°, and the detected skew angle is 0.0°. This sample reveals the limitation of angular resolution at short distances, which is true for any skew estimation method. . 86 Figure 4.17. The sample H04I from UW-I. The ground truth is -0.10°, and the detected skew angle is -0.19°. This is one of the samples that demonstrate the robustness of the proposed skew estimation method in the presence of excessive noises. 87 Figure 4.18. The sample I047 from UW-I. The ground truth is 0.50°, and the detected skew angle is 0.25°. This is one of the samples that demonstrate the robustness and versatility of the convex hull based model in selecting hints for skew estimation 88 Figure 4.19. The sample A06M from UW-I. The ground truth is -3.00°, and the detected skew angle is -2.75°. The warping along the spine of the original document does not impede the correct detection of the skew angle. 89 Figure 5.1. The scanned pages with black bars and table dividers (left), photographic inserts (center), and field dividers (right) 91 Figure 5.2. The Wallace parameterization [54]., where w and h are the width and height of the image, respectively. The line from S1 to S2 is a probe-line. Note that S2 is always greater than S1 in this configuration in order to achieve unique probe-lines. . 92 Figure 5.3. The flowchart of the straight edge based skew estimation model. 94 Figure 5.4. The possible range of the unique (S1, S2) pairs in the Muff space shown in the shaded areas, where w and h are the width and height of the image. The total size of the shaded areas is 6wh + (h-w)2. An area marked “edge” is on one of the four edges of the image, thus are not useful. An area marked “dup” is a duplicated area of the symmetric one related to the diagonal . 95 Figure 5.5. The perpendicular criterion for any two probe-lines . 97 Figure 5.6. Detection result for the image A00G (ground truth: 0.95°, detected: 1.46°) 102 Figure 5.7. Detection result for the image A002 (ground truth: 0.40°, detected: 0.35°) . 103 Figure 5.8. Detection result for the image H04I (ground truth: -0.10°, detected -0.13°). 104 Figure 5.9. Detection result for the image D053 (ground truth: 0.02°, detected: 0.09°) . 105 List of Figures x [...]... handle skewed documents before doing OCR However, if there are multiple skews, the OCR engines will only work on textual areas with the estimated predominant skew For document images that contain excessive noises, the performances of the existing methods that rely on the predominant alignment of the text lines for skew estimation will deteriorate or even fail Faxed documents are often badly skewed... Work on Skew Estimation Due to the wide spread use of digital scanners and copiers, skew estimation and correction for scanned document images has triggered some extensive studies and a large array of techniques has consequently been developed Different skew estimation methods compete on the aspects of detection accuracy, time and space efficiencies, abilities to detect the existence of multiple skews... original documents on the surfaces of the scanners or copiers, the edges of the captured pages an image may not always align precisely with that of the image This amount of misalignment or offset is usually referred to as the skew angle of a document image Skew estimation is the process of detecting the skew angles and their specific locations in an image for the subsequent correction Skew estimation. .. such as paper prints or microfilm archives Therefore, many efforts have been put into the research and development of new technologies to convert the legacy analog publications into the new electronic form [1] The ultimate goal is to integrate the analog publications with the new digital publications to form a Chapter 1 - Introduction 1 Skew Estimation Skew Estimation Structural Layout Analysis Structural... Tables Table 6.1 Performances comparison using the 979 real document images in UW-I Shaded rows are the best performers from Chen, Bloomberg and Yuan (Charts digitization uncertainty: ±0.5%) 109 Table 6.2 Feature comparison among the three skew estimation models in this thesis and the three popular approaches 114 List of Tables xii Chapter 1 Introduction Skew estimation is the main... performance evaluation of some selected skew estimation methods [14] available in the research literature Besides the journal publications and conference proceedings, patent documents are another rich source of information that in many cases is even more comprehensive in the sense of details and completeness In a textual document image, there are various hints of skews The most explored hint of skew. .. parameter) to select the “good” lines for estimating the “optimal estimate of the test skew angle” in a Bayesian framework The test images are totally 12617 = 11 × (168 + 979) from UW-I The factor 11 represents the original images plus additional images created from rotating each images in 10 intervals The estimated skew angles and their ground truth of all the test images are subject to a training process... frequency based skew estimation class Typical spatial frequency based skew estimation methods treat the text lines in a textual document image as textures or patterns They use the Fourier transform or other waveforms such as the distributions in Cohen’s class, to reveal such global trend This class of methods usually depends on the availability of dominant text lines They cannot provide the local information... textual documents with multiple skews O’Gorman’s Docstrum [32] takes this approach His Docstrum, which is an angle-distance scatter plot for all the nearest-neighbor pairs of components, can be used not only shows the clustering of the component pairs that have similar directions (for skew estimation) , but also shows the inter-component and the inter-line spacing (for component grouping) Therefore, by... a skew corrected page 2.2.5 Other approaches There are still some other additional approaches, such as the method from Jain et al using Gabor filter [97], the method from Tang et al using fractal geometry [80][83] They all have special merits in their algorithm design and the applications Chapter 2 - Related Work 22 Chapter 3 Skew Estimation from Fiducial Lines Skew estimation for textual document images . Work on Skew Estimation 11 2.1.1 Projection-profile based skew estimation class 12 2.1.2 Hough-transform based skew estimation class 13 2.1.3 Nearest-neighbor clustering based skew estimation. still challenge document analysis systems at present: excessive interfering noises; multiple skews and their locations in a single document; skew estimation for non-textual documents. The. handle skewed documents before doing OCR. However, if there are multiple skews, the OCR engines will only work on textual areas with the estimated predominant skew. For document images that contain

Định dạng
Số trang	168
Dung lượng	22,01 MB