Lecture Notes in Computer Science- P15 doc

60 J. Qiu et al. To solve above problems, a new method need to be proposed. It should be able to distinguish content pages from non-content pages, and then extract main contents from content pages without using template and DOM-Tree. In this paper, we propose a novel main contents extracting method. Main contribu- tions include: (1) Define a new concept of block and propose a block-partition method for web page. Without using DOM-Tree and template, main contents and noise may be well partitioned into different blocks. (2) Define a concept of Block Distribution and study its features. Based on these features, we employ classification method to distinguish content page from non-content page, and then employ outlier analysis to get main contents from Block Distribution. The remaining of this paper is organized as follows. Section 2 gives a brief intro- duction to related works. Section 3 represents blocks partition method for web page. Section 4 introduces block distribution concept and its statistics feature. Section 5 gives a thorough study on performance of new method. Section 6 summarizes our work. 2 Related Works Some works [3, 4, 5] have studied template-based methods on contents extraction of web pages. Li [3] proposes a hybrid method that employed both tag sequence matching and tree matching to extract news from news web pages. Geng [4] firstly generates mapping rules from specified news pages. Then employ these rules to extract information from web page which have same page structure. Yi [5] assumed that layout of web pages is fixed in same website. He builds a Style Tree for the website. Contents of web pages of the website may be well extracted by using Style Tree. Lin [6] partitions web page to blocks, then build profile vector for each block. Ac- cording to the entropy value of each feature in a content block, the entropy of the block may be derived. By entropy, blocks are determined being either informative or re- dundant. Cai [8] utilizes visual cues of web pages, such as layout font size and color, to extract information. Wang [9] proposes STU-DOM tree data structure which may be regarded as a DOM tree with some semantic contextual attributes. Having been pruned, the STU2DOM tree can be used to automatically and accurately extract the useful and relevant contents from HTML document. [10] uses heuristic method to partition web pages, and then calculate probability of individual block. Contents will be extracted from blocks with high probability. There existed three kind of methods for web pages partition. DOM-Based [11], Location-Based [12] and Visual-Based[8]. [6] uses tag <TABLE> as basic granularity to partition web page. [7] proposes an efficient fragment-aware data structure to model dynamic web pages and detect fragments that are shared among documents. 3 Block Partition of Web Page On extracting main contents from web pages, some studies firstly convert the web page to a DOM-Tree, and then get contents by traversing the DOM-Tree. Compared with Web Contents Extracting for Web-Based Learning 61 these methods, we want to directly partition web page to blocks, and then store them to a list structure without using DOM-Tree. Due to saving complicated operation on DOM-Tree, Such method may have better time performance on extracting main contents than DOM-based method. In many studies [6, 10, 12], block is defined as a portion of web page between an open-tag and its corresponding close-tag. Such blocks may contain much noise besides main contents, or main contents will be scattered to multiple blocks. We try to put main contents to one block without noises, therefore give a new definition of block. Definition 1 (block and sub-block): Let S be a sequence of characters, which represents a piece of HTML document. For a pair of tags in S, <TAG>and </TAG>, s=(<TAG>, ,</TAG>) ⊂ S is a sub-sequence in S starting from <TAG> and ending in </TAG>. For any sub-sequence s i ⊂ S, if ∃ s j ⊂ s i , (s i - s j ) is called as Block, otherwise s i is called as Block, denoted as B. s j is called as sub-block of s i . Block B consists of a pair of tag <TAG>, </TAG> and contents c between the two tags. s i , s j are two sub-sequence corresponding to blocks b i , b j in S respectively. If ij ss∩≠∅ , we call ij bb∩≠∅ . Definition 2 (Block List): Let BSet be a block collection of a partitioned web page, Block-List be a List storing BSet. A node in Block-List corresponds to a block b ∈BSet. Each node consists of two fields t and c where t registers open-tag of block b, c registers content of b. Figure 1 gives an example for Block-List. By analyzing structure of web page, we get Observation 1. Observation 1: (1) most of tags in HTML documents usually occur in pairs. A pair of tags consists of an open-tag <TAG> and a close-tag </TAG>. Contents of web page appear between tags. Tag pairs occur in embedment, for example, <table><tr></tr> </table>. (2) Some tags may appear in crossing, for example, <Table><Form></Table> </Form>. (3) Some tags may occur in single, for example, <br>,<p>. (4) Some web pages do not strictly comply with HTML regulation. Some tags fail to occur in pairs, we call them missing tag. After eliminating crossing tags, single tags and missing tags, all tags in HTML documents will occur in pairs and embedment. Such HTML documents is called normalized HTML document, which may be made by using some techniques. This paper assumes all HTML documents involved in our work are normalized. With some tests, we have observed that following techniques may well partition web page. (1) Holding tags involving structure of web page, for example <TABLE>, <TR>, <TD>, <DIV>. (2) Neglecting denoting tags, for example <FONT>, <SPAN>. (3) Skipping tag-pairs which are unrelated with contents of web page, for example <STYLE>, <SCRIPT>. (4) <A> are regarded as structure tag. To partition a piece of web page to blocks defined in this paper, new method need to be proposed. 62 J. Qiu et al. We use a stack to aid blocks-partition for web page. On scanning web page, once an open-tag is met, a block will be built. Then the block is inserted to Block-List, in the meantime the open-tag and reference of the block are pushed to stack. Top tag in the stack will be popped when a close-tag is met. Whatever tag is met, contents between the tag and former tag will be extracted. Then insert them to block corresponding to top element in stack. This method is simpler than DOM-based method. Algorithm 1 describes the process of blocks-partition. Algorithm 1. web page partition block Input: HTML document f Output: Block-List BL 1. sÅbuild_aid_stack (); BLÅbuild_Block_list (); 2. while( NOT EOF of f){ 3. tagÅgetNextTag(); 4. contentÅgetContent(); //get contents between current tag and former tag 5. blockÅgetTop(); //get block corresponding to top tag in stack 6. insert (content, block) //put contents to the block 7. If (isNeglect (tag) ) continue; // is insignificant tag? 8. If (isJump (tag)) //is skipped tag? 9. {jump(); continue;} //skip tags and contents between them 10. If (isOpenTag (tag)) { //is open tag? 11. blockÅnew Block(tag) 12. insert (BL, block); // insert new block to Block-List 13. push (s, tag, block); // put tag and reference of block to stack 14. }else //is close tag? 15. pop (s); 16. } /* end of while */ Lemma 1: Given a piece of HTML document f. Time cost of building Block-List and DOM-Tree for f are t 1 and t 2 respectively. t 1 < t 2 may be concluded. Rational: Let t 1 _T and t 2 _T be time cost of scanning f on building Block_List and DOM-Tree respectively, t 1 _I, t 2 _I be time of inserting contents to Block_List and DOM-Tree respectively. t 1 ≈ t 1 _T+t 1 _I, t 2 ≈ t 2 _T+t 2 _I. (1) On building Block-List, some tag-pairs may be omitted or skipped. However each tag will be process on building DOM-Tree. Thus, t 1 _T < t 2 _T. (2) inserting contents to Block_List is a simple operation by getting reference of block from top element of stack. However, before inserting contents to DOM-Tree, inserting position must be located in DOM-Tree. Thus, t 1 _I < t 1 _I. (3) By (1) and (2), thus, t 1 < t 2 . Example 1: Use Algorithm 1 to build Block-List for web page shown in Fig.1(a). Fig.1(b) is derived Block-List. Web Contents Extracting for Web-Based Learning 63 <DIV>Hello <TABLE><TR><TD>USA</TD></TR> <TR><TD>CHN</TD></TR></TABLE> <DIV>GBK</DIV>World </DIV> DIV TABLE TR TR Hello World TD USA TD CHN DIV GBK Fig. 1(a). A portion of HTML document Fig. 1(b). Block-List 4 Block Distribution and Main Contents Extraction In practical application, before extracting contents from web pages, first step is to determine whether web pages contain main contents. Web pages may be are divided to two types, content page and non-content page. For non-content page, there are various information in the page except for main contents. For content page, it contains a main contents. Fig 2 gives an example of content page and non-content page. To distinguish content page and non-content page, it is needed to study features of web page, and then use these features to classify web pages. Main content Index of blocks size of blocks Fig. 2 (a). Non-content page Fig. 2(b). Content page Fig. 3. Curve of block distribution Definition 3 (Block Distribution and Block Distribution Curve): Given a Block-List BL. Let o be a node in BL, c be content of o. n be size of c. A collection of {n 1 ,…, n k } represents size of all blocks in BL. After the collection is sorted in descending order, we call the sequence D =(n 1 ,…, n k ) as block distribution of web page. Let y-axis represents n i and x-axis represents index of n i in D. D is represented in a piecewise curve, called Block Distribution Curve. By using Algorithm 1, we can derive Block Distributions for web pages. Fig 3 shows example of Block Distribution Curve. Algorithm 1 may well put main contents to one 64 J. Qiu et al. block, and scatter noise to multiple blocks. If content block is large enough, then Block Distribution of content page and non-content page will appear obvious difference. For example, in Fig. 3, Block Distribution Curve of content page, Curve 1, is steeper than a Block Distribution Curve of non-content page, Curve 5. Lemma 2: Let Var (D) denote variance of block distribution. D 1 =(n 1 ,…, n m ), D 2 =( n 1 +k,…, n m ), k>0, are block distributions of two piece of web page. Value of D 1 is equal to D 2 except for value in index 1. Then Dev (D 1 )> Dev (D 2 ) can be concluded. Proof: see appendix Lemma 2 shows that the larger size of main content block is in a Block Distribution, the larger variance Block Distribution has. Because there are not obvious large block in Block-Distribution of non-content page, its variance will be small. Therefore variance may be used to distinguish content pages and non-content pages. However, sometimes only using variance could not get enough good result. Test 2 in section 6 demonstrates that only using variance to distinguish content and non-content page could not get enough good accuracy. So we introduce a new feature for block distribution. Definition 4 (Bending of Block Distribution β): Let D=(n 1 ,…, n m ) be Block Distribu- tion of a piece of web page. In Block Distribution Curve of D, α i (i=1,…,k-1) is rate of slope of a piece of curve. 11 11 ( ) ( , , ) ( , , ) kk DMax Min βαααα −− =− is called as the bending of D. If there existed two blocks that have same size in a Block Distribution, bending means maximum difference between two adjacent blocks in the Block Distribution. For example, bending of Block Distribution D1=(5, 2, 2, 2, 2) are β(D1)=3. After deriving variance and bending of each Block Distribution, classification algorithm may be employed to distinguish content pages from non-content pages. Test 2 shows that classification methods may well distinguish content pages and non-content pages based on the two features. In content pages, main content block is large and sparse. Corresponding to noise blocks, it is suitable to consider content blocks as outlier. In application, we employ deviation-based outlier detection algorithm [13] to derive content blocks. Contents in content blocks are main contents of a piece of web pages. Experiments demonstrate feasibility of our method. 5 Experiments and Results In this section, we will perform a thorough analysis for our method. All experiments were implemented in Java and conducted on an Intel P2.6G system with 512M of RAM. 5.1 Dataset Experiments are conducted on three data sets. Dataset1 consists of 543 piece of web pages (220 for content pages (news page), 323 for non-content pages) collected from website SOHU, YAHOO, CHINA and Netease. Dataset2 come from Chinese Web . crossing tags, single tags and missing tags, all tags in HTML documents will occur in pairs and embedment. Such HTML documents is called normalized HTML document, which may be made by using. assumes all HTML documents involved in our work are normalized. With some tests, we have observed that following techniques may well partition web page. (1) Holding tags involving structure of. a piece of HTML document. For a pair of tags in S, <TAG>and </TAG>, s=(<TAG>, ,</TAG>) ⊂ S is a sub-sequence in S starting from <TAG> and ending in </TAG>.

Định dạng
Số trang	5
Dung lượng	253,33 KB