MINISTRYOFEDUCATIONANDT RAINING VIETNAMACADEMYOFSCIENCEA NDTECHNOLOGY GRADUATEUNIVERSITYOFSCIENCEANDTECHNOLOGY *** HADAITON DOCUMENTGEOMETRICLAYOUTANALYSISBASEDONA DAPTIVETHRESHOLD Major Mathematicsfo[.]
MINISTRYOFEDUCATIONANDT RAINING VIETNAMACADEMYOFSCIENCEA NDTECHNOLOGY GRADUATEUNIVERSITYOFSCIENCEANDTECHNOLOGY .*** HADAITON DOCUMENTGEOMETRICLAYOUTANALYSISBASEDONA DAPTIVETHRESHOLD Major:MathematicsforInformaticsCo de:62460110 SUMMARYOFPhDTHESISINMATHEMATICS Hanoi-2018 The work was completed at: Graduate university of Science andTechnology–VietnamAcademyofScienceand Technology Supervisor:Prof.DrNguyenDucDung Review1: Review2: Review3: The thesis will be protected on the PhD thesis defense, meeting at the GraduateuniversityofScienceandTechnology–VietnamAcademyofScience andTechnologyon hour ,date month 201 Thedissertation canbefoundat: - LibraryoftheGraduateuniversityofScienceandTechnology - NationalLibraryofVietnam INTRODUCTION Text recognition is a field that has been researched and applied for many years Text recognitionprocess is performed through the following main steps: The input image page will go through thepreprocessingstep,thenthepageanalysisstep,theoutputofthepageanalysiswillbetheinpu toftherecognitionstep, andfinallypostprocessing.Theresult ofarecognition systemdependsontwomainsteps:pageanalysisandrecognition.Atthis point,theproblemofrecognitiononprinted text has been resolved almost completely (ABBYY's FineReader 12.0 commercial productcan recognize printed text in various languages, recognition software of Vietnamese words inVnDOCR 4.0 of the Hanoi Information Technology Institute can recognize with accuracy over98%) However, in the world as well as in Vietnam, the page analysis problem remains a majorchallengeforresearchers.Untilnow,pageanalysisisstillreceivingtheattentionofmanyresearch ers Every two years in the world there is an international page analysis contest topromotethedevelopmentofpageanalysisalgorithms.Thesewerethemotivationsforthedissertatio n to try researching so that they can propose effective solutions to the page analysisproblem In recent years, there are many page analysis algorithms have been developed, especially arehybrid-oriented approached development algorithms The proposed algorithms show differentstrengths and weaknesses, but in general most of them still suffer from two basic errors: an errorseparatingacorrecttextareaintosmallerthatleadstomisleadormisstheinformationoftextlines or paragraph (over-segmentation), the aggregation error of text areas in text columns orparagraphstogether(under-segmentation).Therefore,theobjectiveofthedissertationist o study and develop page analysis algorithms that simultaneously reduce both types of errors:oversegmentation,under-segmentation.Theissuesinpageanalysisareverybroadsothedissertation limits the scale of the study within the scope of text image pages written in LatinlanguagewhichparticularlyisEnglishandfocusesontheanalysisofthetextareas.Thedissertatio n has not proposed the problemof detecting and analyzing the structure of tablespaces,detectingimageareasandanalyzinglogicalstructures.Withtheobjectivesofthedissertati on have achievedthefollowingresults: Proposeasolutionthatspeedsupthealgorithmfordetectingbackgroundimages Proposeda d a p t i v e p a r a m e t e r i z a t i o n m e t h o d re d uce s t h e e f f e c t o f s i z e a n d f o n t t y p e o n theresultsof page analysis Proposedanewsolutionfortheproblemofdetectingandusingseparatorobjectsinpageanaly sisalgorithms Proposesanewsolutionthatseparatestextareasintoparagraphsbasedoncontextanalysis CHAPTER1.OVERVIEWOFDOCUMENTLAYOUTANALYSIS In this chapter, I present an overview of the text recognition system, the page analysisproblem, the typical page analysis algorithms, the most basic errors of page analysis algorithms.Thisleads totheresearchobjectives andresults ofthisdissertation 1.1 Themainelementsofthetextrecognitionsystem Basically, a text recognition system is usually done through the basic steps described inFigure Information is in the form of text such as books, newspapers, magazines, etc afterscanning process, it will show us the result in the image file These image files will be the input ofan recognition system, the output of the recognition system are text files that can be easily editedand archived, such as files of * doc, * docx, * excel, * pdf, etc The dissertationfocuses onstudying the the page analysis steps, in which the focus is the analysis of the geometric structureofthelayout Documentlayout Textfile Post-processing Pre-processing Recognize Analysisofthegeometricstr ucture Analysisofthelogicalstru cture Figure1:Illustrationof basicprocessingstepsoftextrecognitionsystem 1.1.1 Pre-processing The task of pre-processing a layout is usually binary, defines the components of connectedimage, filters noise,and alignsthegradient Theoutput ofthepre-processing step willbetheinput of the page analysis process As a result, the pre-processing results will also have significant effectson theresults ofthepageanalysis 1.1.2 Documentlayoutanalysis Documentl a y o u t a n a l y s i s i s o n e o f t h e m a j o r c o m p o n e n t s o f t e x t r e c o g n i t i o n s y s t e m s (OCR - System) Besides, it is also widely used in other fields of computing such as documentdigitization,automaticdataentry,computervision,etc.Thetaskofpageanalysisincludesauto matically detecting image areas on a document layout (physical structure) and categorizethemintodifferentdataregionssuchastextarea,image,table,header,footer,etc (logicalstructure) Page analysis results are used as an input to the recognition and automatic data entryofdocument imaging processingsystems 1.1.3 Recognitionofopticalcharacters This is the most important stage, this stage determines the accuracy of the recognitionsystem There are many different classification methods applied to word recognition systems,such as: matching method, direct approach method, grammar method, graph method, neuralnetwork,statisticmethod, andsupportvector machine 1.1.4 Post-processing This is the final stage of the recognition process Maybe post-processing is a step to jointtherecognizedcharactersintowords,sentences,andparagraphstoreconstitutetextwhiledetecti ng false recognized errors by checking spelling based on structure and semantics of words,sentences or paragraphs of text The discovery of errors, mistakes in recognition at this stagesignificantlycontributedtoimprovingthequalityofrecognition 1.2 Thetypicalalgorithmsforanalyzingpage’sgeometricstructure Over the decades of development so far, there are a lot of page analysis algorithms havebeenpublished.Basedontheorderofalgorithms’execution,documentlayoutanalyzingalgorith ms can be divided into three different directions of approach: top-down, bottom-up andHybridmethods 1.2.1 Top-downdirectionofapproach Typical top-down algorithms such as XY Cut, WhiteSpace, etc These approach algorithmsperform page analysis by dividing the document layout into horizontal or vertical directionsunder spaces in the page These spaces are usually along the boundary of the column or border ofparagraphs.Thestrengthofthesealgorithmsistheirlowcomputationalcomplexity,w h i c h resultsingo odanalysisonrectangularpages,ie,layoutswheretheimageareascanbesurrounded by rectangle does not cross However, they cannot process pages which are non-rectangularimage areas 1.2.2 Bottom-updirectionofapproach Typical bottom-up algorithms such as Smearing, Docstrum, Voronoi, etc These approachalgorithms start with small areas of the image (pixels or characters) and in turn group the smallareas of the same type together to form the image area The strength of this approach is thatalgorithms can well process image pages with any structure (rectangle or non-rectangle) Theweakness of bottom-up algorithms is that memory is slow, because small areasare groupedtogether based ondistanceparameters, which are typically estimated onthee n t i r e image p a g e Sothesealgorithmsareoftentoosensitivetoparametervaluesandoversegmentationoftexturedimage areas, especiallyfontareaswithdifferences infont sizeandstyle 1.2.3 Hybriddirectionofapproach From the above analysis, the advantage of the bottom-up direction of approach is thedisadvantage of the Top-down direction of approach and vice versa Thus, in recent years therehaveb ee nma ny al gor ithmsde vel oped in t he hyb r id betwee nt o p d o w n and bottom-up, o ne o fthetypicalalgorithmssuchasRAST,Tab-Stop, PAL, etc Algorithms developed int h i s d i r e c t i o n are often based on analytic objects such as clear space of rectangles, tab stops, etc to infer thestructure of text columns From there, the image areas are determined by the bottom-up method.The results show that hybrid algorithms have overcome some of the limitations of top-down andbottom-up algorithms, which can be implemented on any document layouts with any structureand less restrictions on distance parameters However, defining analytic objects is a difficultproblem for many reasons, such as having too closely spaced letters, the text area is aligned, leftand rightare notaligned orthedistance betweenconnected componentsi s t o o l a r g e , e t c T h i s has led to the fact that existing algorithms often suffer from forgotten errors or misidentificationofanalyticalpaths leadingtoerroranalysis 1.3 Methodsanddatasetsthatevaluatethedocumentlayoutanalysisalgorithms 1.3.1 Measure Evaluatinganalysisalgorithmsfordocumentlayoutisalwaysacomplexissueasitdepends on data sets, ground-truths, and evaluation methods The issue of evaluating the qualityof page analysis algorithms has received a lot of attention In this dissertation, three measures areused:FMeasure,PSET-MeasureandPRImA-Measureforallexperimentalassessments.PRImA-Measure has been successfully used at international page analysis events in 2009, 2011, 2013,2015and2017 1.3.2 Data In this dissertation, I used three data sets of UW-III, a PRImA data set and a UNLV data setfor experimental assessment and comparison of document layout analysis algorithms The UW-IIIhas1600images,PRImAhas305images,andUNLVhas2000images.Thesedatasetshaveaground-truth at the paragraph level and text level, represented by non-intersecting polygons Thelayoutsarescannedat300DPIresolutionand havebeenre-adjustedthetilt.Itcontainsavarietyof layouts on layout styles, which reflect many of the challenges of page analysis The structure ofthe layout contains a blend from simple to complex, consists of pictures with text around thelayouts, with a large change in font size Therefore, these are very good data sets to performcomparativeanalysis of page analysisalgorithms 1.4 Conclusionofchapter This chapter presents an overview of the field of text recognition, in which page analysis isanimportantstep.Sofartheproblemofpageanalysisisstillaproblemthatmanydomesticandforeign research interest There are many recommended page analysis algorithms, especially atinternational page analysis competitions (ICDAR) However, the algorithms still suffer from twobasic errors: oversegmentation and under-segmentation Therefore, the dissertation will focusonthe solutions fortheproblem ofdocument layout analysis There are three main approaches for the problem of document layout analysis: topdown,bottom-up and hybrid In particular, the hybrid approach has been thriving in recent times as itovercomes the disadvantages of both top-down and bottom-up approaches For that reason, thedissertation will focus more on hybrid algorithms, particularly the techniques for detecting andusing analytical objectsof hybrid algorithms.T h e n e x t c h a p t e r o f t h e d i s s e r t a t i o n p r e s e n t s a quick layout background detection technique, this technique will be used as a module in thealgorithmproposedinChapter CHAPTER2.QUICKALGORITHMTODECTECTTHEBACKGROUNDOFDOCUMENTLAYOUT This chapter presents the advantages and disadvantages of a direction of approach basedon the background of layout background in document layout analysis, WhiteSpace page analysisalgorithms,fastlayoutbackgrounddetectionalgorithms,andfinallyexperimentalresults 2.1 Advantagesanddisadvantagesofthedirectionofapproachbasedonthebackgroun doflayoutbackgroundindocument layoutanalysis On the intuitive aspect, in many cases, the background layout can be detected more easily,and at the same time based on the layout background can easily separate the page layout intodifferent areas So early on, there were a lot of page analysis algorithms based on the layoutbackground developed, typical example such as X-Y Cut, WhiteSpace-Analysis, WhiteSpace-Cuts,and etc and recently there are also many algorithms based on the layout developed, for example,Fraunhofer(winningatICDAR2009),Jouve(winningatICDAR2011),PAL(winningatICDAR2013), etc The direction of approach based on layout background is not only used in pageanalysis, but also widely used in the problem of table detection, table structure analysis, andlogicalstructure analysis The above examples show that the direction of approach based on layout background hasmanyadvantages.Therearemanydifferentalgorithmsdevelopedforlayoutbackgrounddetection , such as X-Y Cuts, WhiteSpace-Analysis, WhiteSpace-Cuts (hereinafter referred to asWhiteSpace), etc In which, WhiteSpace is known as a well-known geometric algorithm for layoutbackgrounddetection,algorithmsareincludedintheOCROpusopencodesourcesoitiswidely used as a basic step to develop algorithm However, the WhiteSpace algorithm has a very limitedexecution time which is quite slow, as shown in Figure Thus, acceleration of the WhiteSpacealgorithmhas manyrealmeanings 2.2 Layout background detection algorithms (WhiteSpace) for the problem of pageanalysis Figure2.Illustrationofaverageexecutiontimeofeachalgorithm 2.2.1 Definition Thelargestwhitespaceinalayoutisdefinedasthelargestrectanglelocatedintheenvelopeoft helayoutanddoesnot haveanycharacters,asshowninFigure3 Figure3.Bluerectanglerepresentsthelargestwhitespacefound 2.2.2 Thealgorithmforfindingthelargestwhitespace Thealgorithmforfindingthelargestwhitespace(hereinafterreferredtoasMaxWhitespace) can be applied to objects that are points or rectangles The key idea of thealgorithm isthebranchand bound methodand theQuicksortalgorithm.Figure.5 a) and 4illustratethefakecodeofalgorithmandthestepofdividingtherectangleintosubrectangles In the repository of this dissertation, the input of the algorithm is a set of rectangles (theenvelope of characters), the bound rectangle (envelope of whole layout) and the quality function(rectangle),returntoareaofeachrectangle,seeFigure4.a).Thealgorithmdefinesastateconsis ting of a rectangle r, a set of obstacles rectangles (envelope of characters) that reside in therectangle r and the area of the rectangle r (q = quality (r)) State state iis defined as greater thanstatestatejif quality(ri)>quality(rj).The queuepriorityisusedtostorethestate Each algorithm loop will derive state = (q, r, obstacles) as the beginning of the priorityqueue, which is the state in which the rectangle r has the largest area If no rectangular obstaclesare contained in r then r is the largest rectangular white area found and the algorithm terminates.Incontrast,thealgorithmwillselectoneoftherectangleobstaclestomakepivot,thebestchoiceis as close to the center of the rectangle as possible, see Figure 4.b) We know that the largestwhites p a c e w i l l n o t c o n t a i n a n y r e c t a n g u l a r o b s t a c l e s s o i t w i l l n o t c o n t a i nthepivoteither Therefore, there are four possibilities which may happens for the largest white space: is the leftand the right of the pivot, see Figure 4.c), or the top and bottom of the pivot, see Figure 4.d) Next,thealgorithmwillidentifytherectangleobstaclesintersectedwitheachofthesesubrectangles, with four sub rectangles r 0, r1, r2, r3generated from the rectangle r, see Figure and calculate theupper bound of the largest possible white space in each newly sub created rectangle, the upperboundmainlyselectedistheareaofeachsubrectangle.Thesubrectanglealongwiththeobstacles initand the upper bound corresponding to ita r e p u s h e d i n t o t h e p r i o r i t y q u e u e a n d the above steps are repeated until the state appears with a rectangular r which does not containany obstacles This rectangle is the overview solution of the problem to find the largest whitespace Figure 4: Describes the step divided layout into four sub-regions of algorithm to find the largest white space, (a)envelopeand rectangles,(b)findablepivots, (c,d)left/rightand above/belowsub-regions def find_whitespace(bound,rectangles):queue.enqueue(qu ality(bound),bound,rectangles)whilenot queue.is_erapty(): (q,r,obstacles) = queue.dequeue_max0ifobstacles==[]: returnr pivot=pick(obstacles) r0=(pivot.xl,r.yG,r.xl,r.yl) rl=(r.x0,r.y0,pivot.x0,r.yl) r2=(r.x0,pivot.yl,r.xl,r.yl) r3 = (r.x0,r.y0,r.xl,pivot.y0)subr ectangles=[r0,rl,r2,r3]forsub _rin subrectangles: sub_q=quality(sub_r)s ub_obstacles= [list of u in obstacles notoverlapslu,sub_r)] if queue.enqueue(sub_q,sub_r,sub_obStacies} Figure5:Illustratesthefakecodeofalgorithmtofindthelargestwhitespace 2.2.3 Layoutbackgrounddetectionalgorithm To detect the layout background, algorithm is proposed as a module of the WhiteSpacealgorithm applying the MaxWhitespace algorithm to find m-Whitespace (with m Whitespace ofabout300issufficienttowelldescribethelayoutbackground),thefollowingbackgrounddetection algorithm is called WhiteSpaceDetection Diagram of the algorithm is shown in Figure 5b) 2.3 Accelerationoflayoutbackgrounddetectionalgorithm Tofindthewhitespacewhichcoverthelayoutbackground,whitespacedetectionalgorithm recursively divides the layout into sub areas so that the sub area does not contain anycharacters When each repeat algorithm will divide each sub area of the layout into four differentsubregions, See Figure This process will form a quadrilateral tree, so if the loop is large thenthen u m b e r o f r e g i o n s t h a t n e e d t o b e c o n s i d e r e d w i l l b e v e r y l a r g e T h e r e f o r e , t h e e x e c u t i o n timeofthealgorithmisveryslow.Therefore,inordertoacceleratethelayoutbackgroun ddetectiona l g o r i t h m , i t i s n e c e s s a r y t o m i n i m i z e t h e n u m b e r o f s u b s p a c e s w h i c h n e e d t o b e considered,bylimitingthearisingofunnecessarysubbranchofthequadrilateral tree Figure showsthat the ZGregion (thegrandparents region) isdivided intofour subregions: ZPTsub-region, ZPBsub-region, ZPLleft sub-region, and ZPRright sub-region Continuing todivide the ZPTregion, the sub-region must be ZCTRin the ZPRregion, so when considering the ZPRregion,alsoconsidertheZCTRregion,ortheZCTRregiontobereconsidered.Theexampleillustrated in Figure shows that the sub-region on the ZCRTof the ZPRregion reconsider the ZCTRregion This division process will form a quadrilateral tree and the further downs, the more sub- regionswill be reconsidered In this chapter, the dissertation proposes a solution that minimizes the number of subregionsbeingreconsidered.Theproposedalgorithm(hereinafterreferredtoasFastWhiteSpaceDetection) will not generate sub-regions that lie fully in previous sub-regions, basedon the relative position of the pivot of region considering with the pivot of father region As theexample in Figure 6, the Z CTRsub-region will not be generated because it is in the region (Z PR).However, only consider to remove sub-regions in pairs, or left / right sub-regions or above / below sub-regions, in all considered regions That means, if we consider removing the left / rightsubregions,wewillnotconsiderremovingtheabove/belowsub-regions,andviceversa,because if we consider the elimination of all four sub-regions, then there will be a space which isnever considered, resulting in the omission of some white spaces For example, in Figure 6, if allthefoursubregionsareremoved, t heZ CTRandZ CRTregions areremoved sothat somepartsof theintersection will benever considered Thus,theimprovedFast-WhiteSpaceDetectionalgorithmproducesthefollowingsubregions(Figure 7): • Produceabovesub-region • Producebelowsub-region • Produceleftsub-regioniftheleftcoordinateofitspivotisgreaterthantheleftcoordinateof thePivot ofthefatherregion and twonon-verticaloverlappingpivots • ProducerightsubregioniftherightcoordinateofitspivotislessthantherightcoordinatesofthePivotof thefatherregionandthetwopivotsare verticallyoverlapping 2.4 WhiteSpacealgorithmandFast-WhiteSpacealgorithm 2.4.1 WhiteSpacealgorithm Analyzingthebackgroundstructureofthelayoutisanapproachdevelopedbymanyauthors.Howev er,theseapproaches aredifficulttoexperimentallyinstall, Figure6:DrawbackleadingtothedecreasedspeedofwhitespacessearchingbyWhiteSpaceDetectionalgorithm TheZCTR, ZcRTanditssub-domainswillbereviewedmultipletimes a) b) Figure7:Sub-domainsgenerationbyWhiteSpaceDectionandtheFastWhiteSpaceDetectionalgorithms.Figurea)generation of sub-domains by WhiteSpaceDetection algorithm Figure b) results of sub-domains generation byFast-WhiteSpaceDetectionalgorithm requiring a large number of geometric and detailed data structures with many specialcases.Therefore,thesemethodshavenotbeenwidelyapplied.TheWhiteSpacealgorithmprop osed by Breuel can be simply installed without considering special cases The main steps ofthealgorithminclude: Step (Figure b): Find and divide interconnected components into three groups basedon size: large group includes visual objects, lines, etc medium group includes characters(CCs)andsmallgroup includes interference objects Step2(Figure8c):Findrectangularwhitespaces Step3:Fromthewhitespacesfound,filtertoobtainverticalwhitespace(vspace)segmenting columns and horizontal rectangle space (hspace) separating segments undersome criteria: the size and overlap of white spaces and the density of adjacent charactersofthewhite space a) b) Figure 9: Execution time and accuracy of Fast-WhiteSpace algorithm compared to those of WhiteSpace and typicalalgorithms:a)execution time,b)accuracy CHAPTER3.DOCUMENTLAYOUTSEGMENTATIONALGORITHMSHP2SANDAOSM This chapter presents two document layout analysis algorithms: A hybrid paragraphlevelpage segmentation - hereinafter referred to as HP2S algorithm and an adaptive over-split andmerge for page segmentation - hereinafter referred to as AOSM algorithm The first part presentsthe layout analysis models of both HP2S and AOSM algorithms The second part presents thephaseofgatheringphrasesfrominterconnectedcomponentstoformtextareasofHP2Salgorithm.T hethird partpresentsthetwo phasesof AOSMalgorithm: phase1: segmentinglayouts into candidate text areas, phase 2: gathering small segmented text areas to form textareas The phase of segmenting text areas into paragraphs is presented in the fourth section.Finally, the experimental results on the data sets of page analysis competitions from 2009, 2015,2017,UWIIIandUNLVdatasetswillbe 3.1 PageanalysismodelsofHP2SandAOSMalgorithms The algorithms analyze the pages in a hybrid approach which is a combination of topdown and bottom-up approaches In recent years, many powerful algorithms have developed inhybrid approach The general idea of hybrid approach is to use low-level information (normallyinterconnectedcomponents)toidentifysegmentationtherebyinfercolumnstructureofthe layout, which means to figure out the number of text columns in the layout and that they will beon different sides of the separators Then, use gathering method to group low-level componentstoformtextareas Finally,thetextareas aresegmentedintoparagraphs In this section, the thesis presents the page analysis models of both HP2S and AOSMalgorithms, see Figure 10 From model 10, it can be seen that HP2S and AOSM apply the samemethod of segmenting the text areas into paragraphs However, two algorithms use two differentapproaches to identify the text areas, see Figure 11 HP2S uses bottom-up approach to groupinterconnected components to form text areas while AOSM uses top-down approach to segmentthe layouts into candidate text areas, then apply the adaptive parameter method to group smallsegmentedtextareas.Detailsofboth algorithms arepresentedinsections Figure10:GeneralmodelsofHP2SandAOSMalgorithms Figure11:AlgorithmdiagramofbothHP2SandAOSMalgorithms:a)HP2Salgorithm,b)AOSMalgorithm 3.2 HP2Salgorithm In this section, the thesis presents the main steps for determining text areas of HP2Salgorithm This process consists of three main steps as illustrated in Figure 12 In step thealgorithmwilldetecttab-linesbetweentextcolumns.Step2,thealgorithmusesHoughtransform and tab - lines to identify text lines Finally, the text lines are grouped to form textareas Details ofthese stepswill be presentedinsections ,, 3.2.1 Tab–linesdetection Figure12:MainstepsfordeterminingtextareasofHP2Salgorithm Tab-Stop algorithm has presented the problem of detecting tab-lines as a sequence ofcharactersatthebeginningortheendofeachline(tabstop)andverticallyaligned.Thesesegmentation lines can be used to replace physical segmentations or rectangular white spaces indetecting column structure of document layout In this section, I would like to introduce a simplemethod for detecting tab-lines HP2S algorithm has a tab-line detection method which has fewerstep,is simplerandeasier toexperimentallyinstall 3.2.2 Textlinesidentification Firstly,Houghtransformisperformedonthemidpointssetofbottomedgesofthecharacters to find the sequence of horizontally aligned characters The sequence of horizontallyaligned characters will be the best candidate to form text lines Each of these characters sequence iscalleda candidatetextline,seeFigures13and14.Foreachcandidatetextline,thealgorithmwill estimate the horizontal spacing of the characters and adjacent words, the spacing between the words is denoted bydw Thedwspacing will be used along with segmentation lines to segmentthecandidatetextlinesintotextlinesasfollows:twohorizontaladjacent charactersareinthesame text line if they are not on two sides of a certain segmentation line, and their horizontalspacing does not exceed two times ofdw The combination of segmentation lines and bottom-uptraditional method to identify the text lines has helped the algorithm segment the text lines invery close text columns In some cases the spacing between two columns is almost equal to thespacingbetweenthewordsincandidatetextlines(13a).However,theexistenceof verticalsegmentation lines has helped the algorithm segment the candidate text lines into different textlines in different columns, see Figure 13b) When the text columns are not aligned, there will beno segmentation line anddwparameter will be useful for identifying text lines In most of these cases, the spacing between the text linesdis greater than the spacing between the wordsdw(Figure14) Unlikethetraditionalbottom-upalgorithms,ouralgorithmdoesnotusejustonedwparameter for all candidate text lines The dwparameter is estimated on each set of characters with similar font size and in the same candidate text line Thus, this has reduced the text linefragmentationofthealgorithmremarkably,especiallythetextlinesintheheader(Figure13b) a) b) Figure13:Segmentationlinesusedintheprocessofidentifyingtextlines.a)candidatetextlines.Characterslocatedat different sides of a segmentation line will belong to different text lines b) The text lines are the results identifiedbythealgorithm a) b) Figure14:a)candidatetextlines,b)incaseofnosegmentationline,dwisusedtosegmentcharactersintotextlines In some cases, for example, the text areas of the references or paragraphs beginning withspecial characters, the text areas are often aligned and indented compared to special indices andcharacters Therefore, the segmentation line will remove the special indices and characters fromthetext lines In order to fix this type of error, we first find more candidate tab-stop by applying thesame tab-stop search method as the section with the width of the right a d j a c e n t r e c t a n g l e e q u a l to one of the width of the character being considered Then, the newly found candidate tab-stopswhich intersect with the left candidate tab-stops identified from the section will be updated asreferencet a b s t o p s o r s p e c i a l c h a r a c t e r s d e n o t e d b y m _ t a b s m _ t a b s a r e c h a r a c t e r s t h a t h a v e beenseparatedfromthetextlineduetotheappearanceofsegmentationline.Finally,thealgorithmwillc ombinem_tabswiththerightadjacenttextlinesandlabeledthemassegmentation text lines The segmentation text lines will be re-used in the section to identify theparagraphs 3.2.3 Groupclustersoftextlinesintotextareas In this section, the process of grouping text lines into text areas will be presented Thebottom-upapproachisusedtogroupadjacenttextlinestoformtextareaswithanyenvelope Thesetoftextlinesidentifiedfromtheprevioussectionisrearrangedinorderfromlefttoright,fromtopto bottom.Apairoflines(linei,linej)simultaneouslysatisfyingthefollowingconditionswillbe groupedintoa same text area a) b) c) Figure15:a)Originalimage,b)separationlines,c)definedtextareas (i)𝐷 i 𝑠 𝑡 𝐻 𝑜 𝑟 i 𝑧 (𝑙i𝑛𝑒 i,𝑙i𝑛𝑒j)< 𝐴 𝑣 𝑔 𝐻 𝑜 𝑟 i 𝑧 , (ii)𝐶ℎ𝑒𝑐𝑘𝑅𝑢𝑙𝑙i𝑛𝑔(𝑙i𝑛𝑒i,𝑙i𝑛𝑒j) i𝑠f𝑎𝑙𝑠𝑒, (iii)𝐶ℎ𝑒𝑐𝑘𝑇𝑎𝑏𝑙i𝑛𝑒(𝑙i𝑛𝑒i,𝑙i𝑛𝑒j)i𝑠f𝑎𝑙𝑠𝑒, 𝗅( i𝑣)|𝑦i−𝑦j|