1. Trang chủ
  2. » Luận Văn - Báo Cáo

Báo cáo khoa học: "A Generic Sentence Trimmer with CRFs" pot

9 169 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 9
Dung lượng 695,56 KB

Nội dung

Proceedings of ACL-08: HLT, pages 299–307, Columbus, Ohio, USA, June 2008. c 2008 Association for Computational Linguistics A Generic Sentence Trimmer with CRFs Tadashi Nomoto National Institute of Japanese Literature 10-3, Midori Tachikawa Tokyo, 190-0014, Japan nomoto@acm.org Abstract The paper presents a novel sentence trimmer in Japanese, which combines a non-statistical yet generic tree generation model and Con- ditional Random Fields (CRFs), to address improving the grammaticality of compres- sion while retaining its relevance. Experi- ments found that the present approach out- performs in grammaticality and in relevance a dependency-centric approach (Oguro et al., 2000; Morooka et al., 2004; Yamagata et al., 2006; Fukutomi et al., 2007) − the only line of work in prior literature (on Japanese compres- sion) we are aware of that allows replication and permits a direct comparison. 1 Introduction For better or worse, much of prior work on sentence compression (Riezler et al., 2003; McDonald, 2006; Turner and Charniak, 2005) turned to a single cor- pus developed by Knight and Marcu (2002) (K&M, henceforth) for evaluating their approaches. The K&M corpus is a moderately sized corpus consisting of 1,087 pairs of sentence and compres- sion, which account for about 2% of a Ziff-Davis collection from which it was derived. Despite its limited scale, prior work in sentence compression relied heavily on this particular corpus for establish- ing results (Turner and Charniak, 2005; McDonald, 2006; Clarke and Lapata, 2006; Galley and McKe- own, 2007). It was not until recently that researchers started to turn attention to an alternative approach which does not require supervised data (Turner and Charniak, 2005). Our approach is broadly in line with prior work (Jing, 2000; Dorr et al., 2003; Riezler et al., 2003; Clarke and Lapata, 2006), in that we make use of some form of syntactic knowledge to constrain com- pressions we generate. What sets this work apart from them, however, is a novel use we make of Conditional Random Fields (CRFs) to select among possible compressions (Lafferty et al., 2001; Sut- ton and McCallum, 2006). An obvious benefit of using CRFs for sentence compression is that the model provides a general (and principled) proba- bilistic framework which permits information from various sources to be integrated towards compress- ing sentence, a property K&M do not share. Nonetheless, there is some cost that comes with the straightforward use of CRFs as a discriminative classifier in sentence compression; its outputs are often ungrammatical and it allows no control over the length of compression they generates (Nomoto, 2007). We tackle the issues by harnessing CRFs with what we might call dependency truncation, whose goal is to restrict CRFs to working with can- didates that conform to the grammar. Thus, unlike McDonald (2006), Clarke and Lap- ata (2006) and Cohn and Lapata (2007), we do not insist on finding a globally optimal solution in the space of 2 n possible compressions for an n word long sentence. Rather we insist on finding a most plausible compression among those that are explic- itly warranted by the grammar. Later in the paper, we will introduce an approach called the ‘Dependency Path Model’ (DPM) from the previous literature (Section 4), which purports to provide a robust framework for sentence compres- 299 sion in Japanese. We will look at how the present approach compares with that of DPM in Section 6. 2 A Sentence Trimmer with CRFs Our idea on how to make CRFs comply with gram- mar is quite simple: we focus on only those la- bel sequences that are associated with grammati- cally correct compressions, by making CRFs look at only those that comply with some grammatical constraints G, and ignore others, regardless of how probable they are. 1 But how do we find compres- sions that are grammatical? To address the issue, rather than resort to statistical generation models as in the previous literature (Cohn and Lapata, 2007; Galley and McKeown, 2007), we pursue a particular rule-based approach we call a ‘dependency trunca- tion,’ which as we will see, gives us a greater control over the form that compression takes. Let us denote a set of label assignments for S that satisfy constraints, by G(S). 2 We seek to solve the following, y ⋆ = arg max y∈G(S) p(y|x; θ θ θ). (2) There would be a number of ways to go about the problem. In the context of sentence compression, a linear programming based approach such as Clarke and Lapata (2006) is certainly one that deserves con- sideration. In this paper, however, we will explore a much simpler approach which does not require as involved formulation as Clarke and Lapata (2006) do. We approach the problem extentionally, i.e., through generating sentences that are grammatical, or that conform to whatever constraints there are. 1 Assume as usual that CRFs take the form, p(y|x) ∝ exp P k,j λ j f j (y k , y k−1 , x) + P i µ i g i (x k , y k , x) ! = exp[w ⊤ f(x, y)] (1) f j and g i are ‘features’ associated with edges and vertices, re- spectively, and k ∈ C, where C denotes a set of cliques in CRFs. λ j and µ i are the weights for corresponding features. w and f are vector representations of weights and features, respectively (Tasker, 2004). 2 Note that a sentence compression can be represented as an array of binary labels, one of them marking words to be retained in compression and the other those to be dropped. Figure 1: Syntactic structure in Japanese Consider the following. (3) Mushoku-no unemployed John John -ga SBJ takai expensive kuruma car -wo ACC kat-ta. buy PAST ‘John, who is unemployed, bought an expensive car.’ whose grammatically legitimate compressions would include: (4) (a) John -ga takai kuruma -wo kat-ta. ‘John bought an expensive car.’ (b) John -ga kuruma -wo kat-ta. ‘John bought a car.’ (c) Mushoku-no John -ga kuruma -wo kat-ta. ‘John, who is unemployed, bought a car. (d) John -ga kat-ta. ‘John bought.’ (e) Mushoku-no John -ga kat-ta. ‘John, who is unemployed, bought.’ (f) Takai kuruma-wo kat-ta. ‘ Bought an expensive car.’ (g) Kuruma-wo kat-ta. ‘ Bought a car.’ (h) Kat-ta. ‘ Bought.’ This would give us G(S)={a, b, c, d, e, f, g, h}, for the input 3. Whatever choice we make for compres- sion among candidates in G(S), should be gram- matical, since they all are. One linguistic feature 300 Figure 2: Compressing an NP chunk Figure 3: Trimming TDPs of the Japanese language we need to take into ac- count when generating compressions, is that the sen- tence, which is free of word order and verb-final, typically takes a left-branching structure as in Fig- ure 1, consisting of an array of morphological units called bunsetsu (BS, henceforth). A BS, which we might regard as an inflected form (case marked in the case of nouns) of verb, adjective, and noun, could involve one or more independent linguistic elements such as noun, case particle, but acts as a morpholog- ical atom, in that it cannot be torn apart, or partially deleted, without compromising the grammaticality. 3 Noting that a Japanese sentence typically consists of a sequence of case marked NPs and adjuncts, fol- lowed by a main verb at the end (or what would be called ‘matrix verb’ in linguistics), we seek to compress each of the major chunks in the sentence, leaving untouched the matrix verb, as its removal of- ten leaves the sentence unintelligible. In particular, starting with the leftmost BS in a major constituent, 3 Example 3 could be broken into BSs: / Mushuku -no / John -ga / takai / kuruma -wo / kat-ta /. we work up the tree by pruning BSs on our way up, which in general gives rise to grammatically legiti- mate compressions of various lengths (Figure 2). More specifically, we take the following steps to construct G(S). Let S = ABCDE. Assume that it has a dependency structure as in Figure 3. We begin by locating terminal nodes, i.e., those which have no incoming edges, depicted as filled circles in Figure 3, and find a dependency (singly linked) path from each terminal node to the root, or a node labeled ‘E’ here, which would give us two paths p 1 = A-C-D-E and p 2 = B-C-D-E (call them ter- minating dependency paths, or TDPs). Now create a set T of all trimmings, or suffixes of each TDP, including an empty string: T (p 1 ) = {<A C D E>, <C D E>, <D E>, <E>, <>} T (p 2 ) = {<B C D E>, <C D E>, <D E>, <E>, <>} Then we merge subpaths from the two sets in every possible way, i.e., for any two subpaths t 1 ∈ T (p 1 ) and t 2 ∈ T (p 2 ), we take a union over nodes in t 1 and t 2 ; Figure 4 shows how this might done. We remove duplicates if any. This would give us G(S)={{A B C D E}, {A C D E}, {B C D E}, {C D E}, {D E}, { E}, {}}, a set of compressions over S based on TDPs. What is interesting about the idea is that creating G(S) does not involve much of anything that is spe- cific to a given language. Indeed this could be done on English as well. Take for instance a sentence at the top of Table 1, which is a slightly modified lead sentence from an article in the New York Times. As- sume that we have a relevant dependency structure as shown in Figure 5, where we have three TDPs, i.e., one with southern, one with British and one with lethal. Then G(S) would include those listed in Ta- ble 1. A major difference from Japanese lies in the direction in which a tree is branching out: right ver- sus left. 4 Having said this, we need to address some lan- guage specific constraints: in Japanese, for instance, we should keep a topic marked NP in compression as its removal often leads to a decreased readability; and also it is grammatically wrong to start any com- pressed segment with sentence nominalizers such as 4 We stand in a marked contrast to previous ‘grafting’ ap- proaches which more or less rely on an ad-hoc collection of transformation rules to generate candidates (Riezler et al., 2003). 301 Table 1: Hedge-clipping English An official was quoted yesterday as accusing Iran of supplying explosive technology used in lethal attacks on British troops in southern Iraq An official was quoted yesterday as accusing Iran of supplying explosive technology used in lethal attacks on British troops in Iraq An official was quoted yesterday as accusing Iran of supplying explosive technology used in lethal attacks on British troops An official was quoted yesterday as accusing Iran of supplying explosive technology used in lethal attacks on troops An official was quoted yesterday as accusing Iran of supplying explosive technology used in lethal attacks An official was quoted yesterday as accusing Iran of supplying explosive technology used in attacks An official was quoted yesterday as accusing Iran of supplying explosive technology An official was quoted yesterday as accusing Iran of supplying technology Figure 4: Combining TDP suffixes -koto and -no. In English, we should keep a prepo- sition from being left dangling, as in An official was quoted yesterday as accusing Iran of supplying tech- nology used in. In any case, we need some extra rules on G(S) to take care of language specific is- sues (cf. Vandeghinste and Pan (2004) for English). An important point about the dependency truncation is that for most of the time, a compression it gener- ates comes out reasonably grammatical, so the num- ber of ‘extras’ should be small. Finally, in order for CRFs to work with the com- pressions, we need to translate them into a sequence of binary labels, which involves labeling an element token, bunsetsu or a word, with some label, e.g., 0 for ’remove’ and 1 for ‘retain,’ as in Figure 6. Figure 5: An English dependency structure and TDPs Consider following compressions y 1 to y 4 for x = β 1 β 2 β 3 β 4 β 5 β 6 . β i denotes a bunsetsu (BS). ‘0’ marks a BS to be removed and ‘1’ that to be re- tained. β 1 β 2 β 3 β 4 β 5 β 6 y 1 0 1 1 1 1 1 y 2 0 0 1 1 1 1 y 3 0 0 0 0 0 1 y 4 0 0 1 0 0 0 Assume that G(S) = {y 1 , y 2 , y 3 }. Because y 4 is not part of G(S), it is not considered a candidate for a compression for y, even if its likelihood may exceed those of others in G(S). We note that the approach here does not rely on so much of CRFs as a discriminative classifier as CRFs as a strategy for ranking among a limited set of label sequences which correspond to syntactically plausible simpli- fications of input sentence. Furthermore, we could dictate the length of com- pression by putbting an additional constraint on out- 302 Figure 6: Compression in binary representation. put, as in: y ⋆ = arg max y∈G ′ (S) p(y|x; θ θ θ), (5) where G ′ (S) = {y : y ∈ G(S), R(y, x) = r}. R(y,x) denotes a compression rate r for which y is desired, where r = # of 1 in y length of x . The constraint forces the trimmer to look for the best solution among can- didates that satisfy the constraint, ignoring those that do not. 5 Another point to note is that G(S) is finite and rel- atively small − it was found, for our domain, G(S) usually runs somewhere between a few hundred and ten thousand in length − so in practice it suffices that we visit each compression in G(S), and select one that gives the maximum value for the objective function. We will have more to say about the size of the search space in Section 6. 3 Features in CRFs We use an array of features in CRFs which are ei- ther derived or borrowed from the taxonomy that a Japanese tokenizer called JUM AN and KNP, 6 a Japanese dependency parser (aka Kurohashi-Nagao Parser), make use of in characterizing the output they produce: both JUMA N and KNP are part of the compression model we build. Features come in three varieties: semantic, mor- phological and syntactic. Semantic features are used for classifying entities into semantic types such as name of person, organization, or place, while syn- tactic features characterize the kinds of dependency 5 It is worth noting that the present approach can be recast into one based on ‘constraint relaxation’ (Tromble and Eisner, 2006). 6 http://nlp.kuee.kyoto-u.ac.jp/nl-resource/top-e.html relations that hold among BSs such as whether a BS is of the type that combines with the verb (renyou), or of the type that combines with the noun (rentai), etc. A morphological feature could be thought of as something that broadly corresponds to an English POS, marking for some syntactic or morphological category such as noun, verb, numeral, etc. Also we included ngram features to encode the lexi- cal context in which a given morpheme appears. Thus we might have something like: for some words (morphemes) w 1 , w 2 , and w 3 , f w 1 ·w 2 (w 3 ) = 1 if w 3 is preceded by w 1 , w 2 ; otherwise, 0. In ad- dition, we make use of an IR-related feature, whose job is to indicate whether a given morpheme in the input appears in the title of an associated article. The motivation for the feature is obviously to iden- tify concepts relevant to, or unique to the associ- ated article. Also included was a feature on tfidf, to mark words that are conceptually more important than others. The number of features came to around 80,000 for the corpus we used in the experiment. 4 The Dependency Path Model In what follows, we will describe somewhat in detail a prior approach to sentence compression in Japanese which we call the ”dependency path model,” or DPM. DPM was first introduced in (Oguro et al., 2000), later explored by a number of people (Morooka et al., 2004; Yamagata et al., 2006; Fukutomi et al., 2007). 7 DPM has the form: h(y) = αf(y) + (1 − α)g(y), (6) where y = β 0 , β 1 , . . . , β n−1 , i.e., a compression consisting of any number of bunsetsu’s, or phrase- like elements. f(·) measures the relevance of con- tent in y; and g(·) the fluency of text. α is to provide a way of weighing up contributions from each com- ponent. We further define: f(y) = n−1 ∑ i=0 q(β i ), (7) 7 Kikuchi et al. (2003) explore an approach similar to DPM. 303 Figure 7: A dependency structure and g(y) = max s n−2 ∑ i=0 p(β i , β s(i) ). (8) q(·) is meant to quantify how worthy of inclusion in compression, a given bunsetsu is; and p(β i , β j ) represents the connectivity strength of dependency relation between β i and β j . s(·) is a linking function that associates with a bunsetsu any one of those that follows it. g(y) thus represents a set of linked edges that, if combined, give the largest probability for y. Dependency path length (DL) refers to the num- ber of (singly linked) dependency relations (or edges) that span two bunsetsu’s. Consider the de- pendency tree in Figure 7, which corresponds to a somewhat contrived sentence ’Three-legged dogs disappeared from sight.’ Take an English word for a bunsetsu here. We have DL(three-legged, dogs) = 1 DL(three-legged, disappeared) = 2 DL(three-legged, from) = ∞ DL(three-legged, sight) = ∞ Since dogs is one edge away from three-legged, DL for them is 1; and we have DL of two for three- legged and disappeared, as we need to cross two edges in the direction of arrow to get from the for- mer to the latter. In case there is no path between words as in the last two cases above, we take the DL to be infinite. DPM takes a dependency tree to be a set of linked edges. Each edge is expressed as a triple < C s (β i ), C e (β j ), DL(β i , β j ) >, where β i and β j represent bunsestu’s that the edge spans. C s (β) de- notes the class of a bunsetsu where the edge starts and C e (β) that of a bunsetsu where the edge ends. What we mean by ‘class of bunsetsu’ is some sort of a classificatory scheme that concerns linguistic char- acteristics of bunsetsu, such as a part-of-speech of the head, whether it has an inflection, and if it does, what type of inflection it has, etc. Moreover, DPM uses two separate classificatory schemes for C s (β) and C e (β). In DPM, we define the connectivity strength p by: p(β i , β j ) = { log S(t) if DL(β i , β j ) ̸= ∞ −∞ otherwise (9) where t =< C s (β i ), C e (β j ), DL(β i , β j ) >, and S(t) is the probability of t occurring in a compres- sion, which is given by: S(t) = # of t’s found in compressions # of triples found in the training data (10) We complete the DPM formulation with: q(β) = log p c (β) + tfidf(β) (11) p c (β) denotes the probability of having bunsetsu β in compression, calculated analogously to Eq. 10, 8 and tfidf(β) obviously denotes the tfidf value of β. In DPM, a compression of a given sentence can be obtained by finding arg max y h(y), where y ranges over possible candidate compressions of a particular length one may derive from that sentence. In the experiment described later, we set α = 0.1 for DPM, following Morooka et al. (2004), who found the best performance with that setting for α. 5 Evaluation Setup We created a corpus of sentence summaries based on email news bulletins we had received over five to six months from an on-line news provider called Nikkei Net, which mostly deals with finance and politics. 9 Each bulletin consists of six to seven news briefs, each with a few sentences. Since a news brief contains nothing to indicate what its longer version 8 DPM puts bunsetsu’s into some groups based on linguis- tic features associated with them, and uses the statistics of the groups for p c rather than that of bunsetsu’s that actually appear in text. 9 http://www.nikkei.co.jp 304 Table 2: The rating scale on fluency RATING EXPLANATION 1 makes no sense 2 only partially intelligible/grammatical 3 makes sense; seriously flawed in gram- mar 4 makes good sense; only slightly flawed in grammar 5 makes perfect sense; no grammar flaws might look like, we manually searched the news site for a full-length article that might reasonably be con- sidered a long version of that brief. We extracted lead sentences both from the brief and from its source article, and aligned them, us- ing what is known as the Smith-Waterman algorithm (Smith and Waterman, 1981), which produced 1,401 pairs of summary and source sentence. 10 For the ease of reference, we call the corpus so produced ‘NICOM’ for the rest of the paper. A part of our sys- tem makes use of a modeling toolkit called GRMM (Sutton et al., 2004; Sutton, 2006). Throughout the experiments, we call our approach ‘Generic Sen- tence Trimmer’ or GST. 6 Results and Discussion We ran DPM and GST on NICOM in the 10-fold cross validation format where we break the data into 10 blocks, use 9 of them for training and test on the remaining block. In addition, we ran the test at three different compression rates, 50%, 60% and 70%, to learn how they affect the way the models perform. This means that for each input sentence in NICOM, we have three versions of its compression created, corresponding to a particular rate at which the sen- tence is compressed. We call a set of compressions so generated ‘NICOM-g.’ In order to evaluate the quality of outputs GST and DPM generate, we asked 6 people, all Japanese natives, to make an intuitive judgment on how each compression fares in fluency and relevance to gold 10 The Smith-Waterman algorithm aims at finding a best match between two sequences which may include gaps, such as A-C-D-E and A-B-C-D-E. The algorithm is based on an idea rather akin to dynamic programming. Table 3: The rating scale on content overlap RATING EXPLANATION 1 no overlap with reference 2 poor or marginal overlap w. ref. 3 moderate overlap w. ref. 4 significant overlap w. ref. 5 perfect overlap w. ref. standards (created by humans), on a scale of 1 to 5. To this end, we conducted evaluation in two sepa- rate formats; one concerns fluency and the other rel- evance. The fluency test consisted of a set of com- pressions which we created by randomly selecting 200 of them from NICOM-g, for each model at com- pression rates 50%, 60%, and 70%; thus we have 200 samples for each model and each compression rate. 11 The total number of test compressions came to 1,200. The relevance test, on the other hand, consisted of paired compressions along with the associated gold standard compressions. Each pair contains compres- sions both from DPM and from GST at a given com- pression rate. We randomly picked 200 of them from NICOM-g, at each compression rate, and asked the participants to make a subjective judgment on how much of the content in a compression semantically overlap with that of the gold standard, on a scale of 1 to 5 (Table 3). Also included in the survey are 200 gold standard compressions, to get some idea of how fluent “ideal” compressions are, compared to those generated by machine. Tables 4 and 5 summarize the results. Table 4 looks at the fluency of compressions generated by each of the models; Table 5 looks at how much of the content in reference is retained in compressions. In either table, CR stands for compression rate. All the results are averaged over samples. We find in Table 4 a clear superiority of GST over DPM at every compression rate examined, with flu- ency improved by as much as 60% at 60%. How- ever, GST fell short of what human compressions achieved in fluency − an issue we need to address 11 As stated elsewhere, by compression rate, we mean r = # of 1 in y length of x . 305 Table 4: Fluency (Average) MODEL/CR 50% 60% 70% GST 3.430 3.820 3.810 DPM 2.222 2.372 2.660 Human − 4.45 − Table 5: Semantic (Content) Overlap (Average) MODEL/CR 50% 60% 70% GST 2.720 3.181 3.405 DPM 2.210 2.548 2.890 in the future. Since the average CR of gold standard compressions was 60%, we report their fluency at that rate only. Table 5 shows the results in relevance of con- tent. Again GST marks a superior performance over DPM, beating it at every compression rate. It is in- teresting to observe that GST manages to do well in the semantic overlap, despite the cutback on the search space we forced on GST. As for fluency, we suspect that the superior per- formance of GST is largely due to the depen- dency truncation the model is equipped with; and its performance in content overlap owes a lot to CRFs. However, just how much improvement GST achieved over regular CRFs (with no truncation) in fluency and in relevance is something that remains to be seen, as the latter do not allow for variable length compression, which prohibits a straightfor- ward comparison between the two kinds of models. We conclude the section with a few words on the size of |G(S)|, i.e., the number of candidates gener- ated per run of compression with GST. Figure 8 shows the distribution of the numbers of candidates generated per compression, which looks like the familiar scale-free power curve. Over 99% of the time, the number of candidates or |G(S)| is found to be less than 500. 7 Conclusions This paper introduced a novel approach to sentence compression in Japanese, which combines a syntac- tically motivated generation model and CRFs, in or- Number of Candidates Frequency 0 500 1500 2500 0 400 800 1200 Figure 8: The distribution of |G(S)| der to address fluency and relevance of compres- sions we generate. What distinguishes this work from prior research is its overt withdrawal from a search for global optima to a search for local optima that comply with grammar. We believe that our idea was empirically borne out, as the experiments found that our approach out- performs, by a large margin, a previously known method called DPM, which employs a global search strategy. The results on semantic overlap indicates that the narrowing down of compressions we search obviously does not harm their relevance to refer- ences. An interesting future exercise would be to explore whether it is feasible to rewrite Eq. 5 as a linear inte- ger program. If it is, the whole scheme of ours would fall under what is known as ‘Linear Programming CRFs’ (Tasker, 2004; Roth and Yih, 2005). What re- mains to be seen, however, is whether GST is trans- ferrable to languages other than Japanese, notably, English. The answer is likely to be yes, but details have yet to be worked out. References James Clarke and Mirella Lapata. 2006. Constraint- based sentence compression: An integer programming 306 approach. In Proceedings of the COLING/ACL 2006, pages 144–151. Trevor Cohn and Mirella Lapata. 2007. Large margin synchronous generation and its application to sentence compression. In Proceedings of the 2007 Joint Confer- ence on Empirical Methods in Natural Language Pro- cessing and Computational Natural Language Learn- ing, pages 73–82, Prague, June. Bonnie Dorr, David Zajic, and Richard Schwartz. 2003. Hedge trimmer: A parse-and-trim approach to head- line generataion. In Proceedings of the HLT-NAACL Text Summarization Workshop and Document Under- standing Conderence (DUC03), pages 1–8, Edmon- ton, Canada. Satoshi Fukutomi, Kazuyuki Takagi, and Kazuhiko Ozeki. 2007. Japanese Sentence Compression using Probabilistic Approach. In Proceedings of the 13th Annual Meeting of the Association for Natural Lan- guage Processing Japan. Michel Galley and Kathleen McKeown. 2007. Lexical- ized Markov grammars for sentence compression. In Proceedings of the HLT-NAACL 2007, pages 180–187. Hongyan Jing. 2000. Sentence reduction for automatic text summarization. In Proceedings of the 6th Confer- ence on Applied Natural Language Processing, pages 310–315. Tomonori Kikuchi, Sadaoki Furui, and Chiori Hori. 2003. Two-stage automatic speech summarization by sentence extraction and compaction. In Proceedings of ICASSP 2003. Kevin Knight and Daniel Marcu. 2002. Summariza- tion beyond sentence extraction: A probabilistic ap- proach to sentence compression. Artificial Intelli- gence, 139:91–107. John Lafferty, Andrew MacCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilis- tic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML-2001). Ryan McDonald. 2006. Discriminative sentence com- pression with soft syntactic evidence. In Proceedings of the 11th Conference of EACL, pages 297–304. Yuhei Morooka, Makoto Esaki, Kazuyuki Takagi, and Kazuhiko Ozeki. 2004. Automatic summarization of news articles using sentence compaction and extrac- tion. In Proceedings of the 10th Annual Meeting of Natural Language Processing, pages 436–439, March. (In Japanese). Tadashi Nomoto. 2007. Discriminative sentence com- pression with conditional random fields. Information Processing and Management, 43:1571 – 1587. Rei Oguro, Kazuhiko Ozeki, Yujie Zhang, and Kazuyuki Takagi. 2000. An efficient algorithm for Japanese sentence compaction based on phrase importance and inter-phrase dependency. In Proceedings of TSD 2000 (Lecture Notes in Artificial Intelligence 1902,Springer-Verlag), pages 65–81, Brno, Czech Re- public. Stefan Riezler, Tracy H. King, Richard Crouch, and An- nie Zaenen. 2003. Statistical sentence condensation using ambiguity packing and stochastic disambigua- tion methods for lexical functional grammar. In Pro- ceedings of HLT-NAACL 2003, pages 118–125, Ed- monton. Dan Roth and Wen-tau Yih. 2005. Integer linear pro- gramming inference for conditional random fields. In Proceedings of the 22nd International Conference on Machine Learning (ICML 05). T. F. Smith and M. S. Waterman. 1981. Identification of common molecular subsequence. Journal of Molecu- lar Biology, 147:195–197. Charles Sutton and Andrew McCallum. 2006. An in- troduction to conditional random fields for relational learning. In Lise Getoor and Ben Taskar, editors, Introduction to Statistical Relational Learning. MIT Press. To appear. Charles Sutton, Khashayar Rohanimanesh, and Andrew McCallum. 2004. Dynamic conditional random fields: Factorized probabilistic labeling and segment- ing sequence data. In Proceedings of the 21st In- ternational Conference on Machine Learning, Banff, Canada. Charles Sutton. 2006. GRMM: A graphical models toolkit. http://mallet.cs.umass.edu. Ben Tasker. 2004. Learning Structured Prediction Mod- els: A Large Margin Approach. Ph.D. thesis, Stanford University. Roy W. Tromble and Jason Eisner. 2006. A fast finite- state relaxation method for enforcing global constraint on sequence decoding. In Proceeings of the NAACL, pages 423–430. Jenie Turner and Eugen Charniak. 2005. Supervised and unsupervised learning for sentence compression. In Proceedings of the 43rd Annual Meeting of the ACL, pages 290–297, Ann Arbor, June. Vincent Vandeghinste and Yi Pan. 2004. Sentence com- pression for automatic subtitling: A hybrid approach. In Proceedings of the ACL workshop on Text Summa- rization, Barcelona. Kiwamu Yamagata, Satoshi Fukutomi, Kazuyuki Takagi, and Kzauhiko Ozeki. 2006. Sentence compression using statistical information about dependency path length. In Proceedings of TSD 2006 (Lecture Notes in Computer Science, Vol. 4188/2006), pages 127–134, Brno, Czech Republic. 307 . for sentence compres- 299 sion in Japanese. We will look at how the present approach compares with that of DPM in Section 6. 2 A Sentence Trimmer with CRFs Our idea on how to make CRFs comply with. Linguistics A Generic Sentence Trimmer with CRFs Tadashi Nomoto National Institute of Japanese Literature 10-3, Midori Tachikawa Tokyo, 190-0014, Japan nomoto@acm.org Abstract The paper presents a novel sentence. each of the major chunks in the sentence, leaving untouched the matrix verb, as its removal of- ten leaves the sentence unintelligible. In particular, starting with the leftmost BS in a major

Ngày đăng: 31/03/2014, 00:20

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN