1. Trang chủ
  2. » Ngoại Ngữ

Extracting Summary Sentences Based on the Document Semantic Graph

11 1 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Nội dung

Extracting Summary Sentences Based on the Document Semantic Graph Jure Leskovec Natasa Milic-Frayling Marko Grobelnik January 31, 2005 MSR-TR-2005-07 Microsoft Research Microsoft Corporation One Microsoft Way Redmond, WA 98052 Extracting Summary Sentences Based on the Document Semantic Graph Jure Leskovec Natasa Milic-Frayling Marko Grobelnik Carnegie Mellon University, USA Jozef Stefan Institute, Slovenia Microsoft Research Ltd Cambridge, UK Jozef Stefan Institute Ljubljana, Slovenia Jure.Leskovce@ijs.si natasamf@microsoft.com Marko.Grobelnik@ijs.si ABSTRACT We present a method for extracting sentences from an individual document to serve as a document summary or a pre-cursor to creating a generic document abstract We apply syntactic analysis of the text that produces a logical form analysis for each sentence We use subject–object–predicate (SOP) triples from individual sentences to create a semantic graph of the original document and the corresponding human extracted summary Using the Support Vector Machines learning algorithm, we train a classifier to identify SOP triples from the document semantic graph that belong to the summary The classifier is then used for automatic extraction of summaries from test documents Our experiments with the DUC 2002 and CAST datasets show that including semantic properties and topological graph properties of logical triples yields statistically significant improvement of the micro-average F1 measure for both the extraction of SOP triples that correspond to the semantic structure of extracts and the extraction of summary sentences Evaluation based on ROUGE shows similar results for the extracted summary sentences INTRODUCTION Document summarization refers to the task of creating document surrogates that are smaller in size but retain various characteristics of the original document To automate the process of abstracting, researchers generally rely on a two phase process First, key textual elements, e.g., keywords, clauses, sentences, or paragraphs are extracted from text using linguistic and statistical analyses In the second step, the extracted text may be used as a summary Such summaries are referred to as ‘extracts’ Alternatively, textual elements can be used to generate new text, similar to the human authored abstract Automatic generation of texts that resemble human abstracts presents a number of challenges While abstracts may include portions of document text, it has been shown that authors of abstracts often rewrite the text, interpreting the content and fusing the concepts In the study by Jing [6] of 300 humanwritten summaries of news articles, 19% of summary sentences did not have matching sentences in the document The remainder of summary sentences overlapped with a single sentence content in 42% of cases This included matches through paraphrasing and syntactic transformation, implying that the number of perfectly aligned matches would be even lower Other studies show that the number of aligned sentences varies significantly from corpus to corpus For the set of 202 computational linguistic papers used by Teufel and Moens [18] the perfect alignment is observed for only 31.7% of abstract sentences That figure rises to 79% in 188 technical papers in [9] Thus, if the automatic summarization methods are to take advantage of the texts from the document it is important to investigate alignment on the sub-sentence level, e.g., at the level of clauses as investigated by Marcu [12] Comparing the meaning of clauses in the document and corresponding abstracts, by employing human subjects, Marcu [12] showed that in order to create an abstract from extracted text one may need to start with a pool of extracted clauses with a total length 2.76 times larger than the length of the resulting abstract This implies that relevant concepts, carrying the meaning, are scattered across clauses Starting with a hypothesis that the main functional elements of sentences and clauses are Subjects, Objects, and Predicates, we ask whether identifying and exploiting links among them could facilitate the extraction of relevant text Thus, we devise a method that creates a semantic graph of a document, based on logical form triples subject– predicate–object (SPO), and learns a relevant sub-graph that could be used for creating summaries In order to establish the plausibility of this approach we first focus on learning to automate human extracts We assess how well the model can extract the substructure of the graph that corresponds to the extracted sentences This substructure is then the basis for extracting the relevant text from the document Restricting the evaluation to sentence extraction we gain a good understanding of the effectiveness of the approach and learnt model Essentially we decouple the evaluation of the learning model from the issues of text generation that arises in the creation of abstracts In this paper we present results from our experiments on two data sets, CAST [4] and a part of DUC 2002 [3], equipped with human extracted summaries We demonstrate that the feature attributes related to the connectivity of the semantic graph and linguistic properties of the graph nodes significantly contribute to the performance of our summary extraction model With this understanding we set solid foundations for exploring similar learning models for document abstraction Cracks ggooagainst Cracksappeared appearedTuesday Tuesdayininthe theUU.N .N.trade tradeembar embar againstIraq IraqasasSaddam SaddamHussein Husseinsought soughttotocicircumvent rcumventthe theeconomic economicnoose noosearound aroundhis hiscountry country.J apan, J apan,meanwhi meanwhile, le,announced announcedi ti would t would increase increaseitsitsaid aidtotocountries countrieshardest hardesthithitbybyenforcing enforcingthe thesanctions sanctions.Hoping Hopingtotodefuse defusecricrititicism cismthat thatititi si snot notdoing doingitsitsshare sharetotooppose opposeBaghdad, Baghdad,J apan J apansaid saidupuptoto$2$2bill billion ionininaid aidmay maybebe sent gogoononIraq senttotonations nationsmost mostaffected affectedbybythe theUU.N .N.embar embar Iraq.President PresidentBush BushononTuesday Tuesdaynight nightpromised promiseda ajoint jointsession sessionofofCongress Congressand anda anationwide nationwideradio radioand andtelevisi televisiononaudience audiencethat that ``Saddam ``SaddamHussei Husseinnwill willfail'' fail''totomake makehis hisconquest conquestofofKuwait Kuwaitpermanent permanent.``America ``Americamust muststand standupuptotoaggression, aggression,and andwe wewill will,'',''saisaiddBush, Bush,who whoadded addedthat thatthe theUU.S S.military militarymay may remain ' Bush remaini ni nthe theSaudi SaudiArabian Arabiandesert deserti ndefinitely i ndefinitely.``I ``Icannot cannotpredict predictjust justhow howlong longititwill willtake taketotoconvi convince nceIraq Iraqtotowithdraw withdrawfrom fromKuwait,' Kuwait,' ' Bushsaid said.More Morethan than150,000 150,000UU.S S.troops troops have havebeen beensent senttotothe thePersi PersiananGulf Gulfregion regiontotodeter detera apossible possibleIraqi Iraqiinvasi invasiononofofSaudi SaudiArabia Arabia.Bush's Bush'saides aidessaisaiddthe thepresident presidentwould wouldfollow followhis hisaddress addresstotoCongress Congresswith witha atelevised televised message The messagefor forthe theIraqi Iraqipeople, people,declaring declaringthe theworld worldisisunited unitedagainst againsttheir theirgovernment's government'sinvasion invasionofofKuwait Kuwait.Saddam Saddamhad hadoffered offeredBush Bushtime timeononIraqi IraqiTV TV ThePhilippines Philippinesand andNamibia, Namibia,the the first Saddam's firstofofthe thedeveloping developingnations nationstotorespond respondtotoananoffer offerMonday MondaybybySaddam Saddamofoffree freeoioil l ininexchange exchangefor forsendi sendingngtheir theirown owntankers tankerstotoget getitit said saidnonototothe theIraqi Iraqileader leader Saddam'soffer offer was seen as a none-too-subtl e attempt to bypass the U.N embar g o, in effect since four days after Iraq' s Aug invasion of Kuwait, by getting poor countri es to dock their tankers in was seen as a none-too-subtl e attempt to bypass the U.N embargo, in effect since four days after Iraq's Aug invasion of Kuwait, by getting poor countri es to dock their tankers in Iraq Iraq.But Butaccording accordingtotoa aState StateDepartment Departmentsurvey, survey,Cuba Cubaand andRomania Romaniahave havestruck struckoil oildeals dealswith withIraq Iraqand andcompanies companieselelsewhere sewhereare aretryi tryingngtotocontinue continuetrade tradewith withBaghdad, Baghdad,alall linindefiance defiance ofofUU.N .N.sanctions sanctions.Romania Romaniadeni deniesesthe theallegati allegation on.The Thereport, report,made madeavailable availabletotoThe TheAssociated AssociatedPress, Press,said saidsome someEastern EasternEuropean Europeancountries countriesalso alsoare aretrying tryingtotomaintain maintaintheir theirmil military itary sales Tehran salestotoIraq Iraq.AAwell-informed well-informedsource sourceinin Tehrantold toldThe TheAssociated AssociatedPress Pressthat thatIran Iranhas hasagreed agreedtotoananIraqi Iraqirequest requesttotoexchange exchangefood foodand andmedici medicinenefor forupuptoto200,000 200,000barrels barrelsofofrefined refinedoil oil a aday and cash payments There was no official comment from Tehran or Baghdad on the reported food-for oil deal But the source, who requested anonymity, said the deal was struck day and cash payments There was no official comment from Tehran or Baghdad on the reported food-for- oil deal But the source, who requested anonymity, said the deal was struck during s svisit After duringIraqi IraqiForeign ForeignMinister MinisterTariq TariqAziz' Aziz' visitSunday SundaytotoTehran, Tehran,the thefifirstrstbybya asenior seniorIraqi Iraqiofficial officialsince sincethe the1980-88 1980-88gul gulf fwar war Afterthe thevivisit, sit,the thetwo twocountries countriesannounced announcedthey they would , ,saisaiddthat wouldresume resumedipl diplomatic omaticrelations relations.Well Well-informed -informedoil oilindustry industrysources sourcesininthe theregion, region,contacted contactedbybyThe TheAP AP thatalthough althoughIran Irani si sa amajor majoroil oilexporter exporteritself, itself,ititcurrently currentlyhas hastoto import Along importabout about150,000 150,000barrels barrelsofofrefined refinedoil oila aday dayfor fordomesti domestic cuse usebecause becauseofofdamages damagestotorefineries refineriesininthe thegul gulf fwar war Alongsisimilar milarlines, lines,ABC ABCNews Newsreported reportedthat thatfollowing followingAziz's Aziz's vivisit, sit,Iraq Iraqisisapparentl apparentlyyprepared preparedtotogive giveIran Iranalall lthe theoil oilititwants wantstotomake makeupupfor forthe thedamage damageIraq Iraqinfli inflicted ctedononIran Iranduring duringtheir theirconflict conflict.Secretary SecretaryofofState StateJ ames J amesA.A.Baker BakerIII, III, meanwhile, meanwhile,met metininMoscow Moscowwith withSovi SovietetForeign ForeignMinister MinisterEduard EduardShevardnadze, Shevardnadze,two twodays daysafter afterthe theUU.S.-Soviet S.-Sovietsummit summitthat thatproduced produceda ajoint jointdemand demandthat thatIraq Iraqwithdraw withdrawfrom fromKuwai Kuwait t During Duringthe thesummit, summit,Bush Bushencouraged encouragedMikhail MikhailGorbachev Gorbachevtotowithdraw withdraw190 190Soviet Sovietmili military taryspecialists specialistsfrom fromIraq, Iraq,where wherethey theyremain remaintotofulfil fulfill lcontracts contracts.Shevardnadze Shevardnadzetoltolddthe theSoviet Soviet parliament Tuesday the special ists had not reneged on those contracts for fear it would jeopardize the 5,800 Soviet citizens in Iraq In his speech, Bush sai d his heart went out to parliament Tuesday the special ists had not reneged on those contracts for fear it would jeopardize the 5,800 Soviet citizens in Iraq In his speech, Bush sai d his heart went out tothe the families ' The familiesofofthe thehundreds hundredsofofAmericans Americansheld heldhostage hostagebybyIraq, Iraq,butbuthehedeclared, declared,``Our ``Ourpolicy policycannot cannotchange, change,and andititwiwillllnot notchange change.America Americaand andthe theworld worldwill willnot notbebeblackmailed.' blackmailed.' ' The president ' In presidentadded: added:``Vital ``Vitalissues issuesofofprinciple principleare areatatstake stake.Saddam SaddamHussei Husseinnisisliterally literallytrying tryingtotowipe wipea acountry countryoff offthe theface faceofofthe theEarth.' Earth.' ' Inother otherdevelopments: developments:_A _AUU.S S.diplomat diplomatinin Baghdad Baghdadsaid saidTuesday Tuesdayupuptoto800 800Americans Americansand andBritons Britonswill willflflyyout outofofIraqi Iraqi-occupied -occupiedKuwait Kuwaitthis thisweek, week,most mostofofthem themwomen womenand andchildren childrenl eaving l eavingtheir theirhusbands husbandsbehind behind.Saddam Saddamhas has said saidheheisiskeeping keepingforeign foreignmen menasashuman humanshields shieldsagainst againstattack attack.On OnMonday, Monday,a aplaneload planeloadofof164 164Westerners Westernersarrived arrivedininBaltimore Baltimorefrom fromIraq Iraq.Evacuees Evacueesspoke spokeofoffood foodshortages shortagesinin Kuwait, ,'',said Thuraya, Kuwait,nightti nighttime megunfire gunfireand andIraqi Iraqiroundups roundupsofofyoung youngpeopl people esuspected suspectedofofinvolvement involvementi ni nthe theresistance resistance.``There ``Thereisisnonolaw lawand andorder order '' said Thuraya,19, 19,who whowould wouldnot notgigiveveher herlast last name t tdodoanything ' _The name.``A ``Asoldier soldiercan canrape rapea afather's father'sdaughter daughterininfront frontofofhim himand andhehecan' can' anythingabout aboutit.' it.' ' _TheState StateDepartment Departmentsaid saidIraq Iraqhad hadtold toldUU.S S.officials officialsthat thatAmerican Americanmal malesesresi residing dinginin Iraq Iraqand andKuwait Kuwaitwho whowere wereborn bornininArab Arabcountries countrieswiwillllbebeallowed allowedtotoleave leave.Iraq Iraqgenerally generallyhas hasnot notlet letAmerican Americanmales malesleave leave.ItItwas wasnot notknown knownhow howmany manymen menthe theIraqi Iraqimove movecould could affect ' had affect._A _APentagon Pentagonspokesman spokesmansaid said``some ``someincrease increaseininmilitary militaryactivi activity'ty' ' hadbeen beendetected detectedinside insideIraq Iraqnear neari tsi tsborders borderswith withTurkey Turkeyand andSyria Syria.He Hesaid saidthere therewas wasl ittl l ittle ei ndicati i ndicationon hostilities are imminent Defense Secretary Dick Cheney said the cost of the U S mil itary bui ldup in the Middle East was rising above the $1 bi llion-a-month estimate generall hostilities are imminent Defense Secretary Dick Cheney said the cost of the U S mil itary bui ldup in the Middle East was rising above the $1 bi llion-a-month estimate generallyy used usedbybygovernment governmentofficials officials.He Hesaid saidthe thetotal totalcost cost ififnonoshooti shootingngwar warbreaks breaksout out could couldtotal total$15 $15bibillion llionininthe thenext nextfiscal fiscalyear yearbeginning beginningOct Oct.1.1.Cheney Cheneypromi promised seddisgruntled disgruntled lawmakers lawmakers``a ``asignificant significanti ncrease'' i ncrease''ininhelp helpfrom fromArab Arabnations nationsand andother otherUU.S S.alli alliesesfor forOperation OperationDesert DesertShield Shield.J apan, J apan,whi whichchhas hasbeen beenaccused accusedofofresponding respondingtoo tooslowly slowlytotothe thecrisis crisis ininthe Turkey, ' said thegulf, gulf,said saidTuesday Tuesdayititmay maygive give$2$2billion billiontotoEgypt, Egypt,J ordan J ordanand and Turkey,hit hithardest hardestbybythe theUU.N .N.prohibition prohibitiononontrade tradewith withIraq Iraq.``The ``Thepressure pressurefrom fromabroad abroadi si sgetti gettingngsosostrong,' strong,' ' said Hiroyasu HiroyasuHorio, Horio,ananofficial officialwith withthe theMMinistry inistryofofInternational InternationalTrade Tradeand andIndustry Industry.Local Localnews newsreports reportssaisaiddthe theaid aidwould wouldbebeextended extendedthrough throughthe theWorl WorlddBank Bankand andInternational InternationalMonetary Monetary Fund, On Fund,and and$600 $600mill million ionwould wouldbebesent sentasasearly earlyasasmid-September mid-September OnFri Friday, day,Treasury TreasurySecretary SecretaryNicholas NicholasBrady Bradyvivisited sitedTokyo Tokyoon ona aworld worldtour tourseeki seekingng$10.5 $10.5bibillion lliontotohelp helpEgypt, Egypt, Jordan , vehi Jordanand andTurkey Turkey.Japan Japanhas hasalready alreadypromi promised seda a$1$1billion billionaid aidpackage packagefor formultinational multinationalpeacekeeping peacekeepingforces forcesininSaudi SaudiArabia, Arabia,incl including udingfood, food,water water , vehicles clesand andprefabricated prefabricatedhousing housing for fornon-mil non-military itaryuses uses.But Butcritics criticsininthe theUnited UnitedStates Stateshave havesaid saidJ apan J apanshould shoulddodomore morebecause becauseitsitseconomy economydepends dependsheavil heavilyyononoil oilfrom fromthe theMiddl Middle eEast East.Japan Japanimports imports9999percent percentofof itsitsoil oil.J apan's J apan'sconstitution constitutionbans bansthe theuse useofofforce forceininsettli settlingnginternati international onaldidisputes sputesand andJ apanese J apaneselaw lawrestri restrictsctsthe themimilitary litarytotoJapanese Japaneseterri territory, tory,except exceptfor forceremonial ceremonialoccasions occasions.On On Monday, Monday,Saddam Saddamoffered offereddeveloping developingnations nationsfree freeoil oili fi fthey theywould wouldsend sendtheir theirtankers tankerstotopick pickititup.up.The Thefifirstrsttwo twocountries countriestotorespond respondTuesday Tuesday the thePhilippines Philippinesand andNamibia Namibia said saidno no Mani ' for Manilalasaid saidi ti thad hadalready alreadyfulfi fulfilled lleditsitsoil oilrequirements, requirements,and andNamibia Namibiasaid saidititwould wouldnot not``sell ``sellitsitssovereignty' sovereignty' ' forIraqi Iraqioil oil.Venezuelan VenezuelanPresident PresidentCarl CarlososAndres AndresPerez Perezdismissed dismissed Saddam's ' Venezuel , has Saddam'soffer offerofoffree freeoil oilasasa a``propaganda ``propagandaploy.' ploy.' ' Venezuela,a,ananOPEC OPECmember member , hasled leda adrive driveamong amongoil-produci oil-producingngnations nationstotoboost boostproduction productiontotomake makeupupfor forthe theshortfall shortfallcaused causedbyby the s soioil lreserves theloss lossofofIraqi Iraqiand andKuwaiti Kuwaitioil oilfrom fromthe theworld worldmarket market.Their Theiroioil lmakes makesupup2020percent percentofofthe theworld' world' reserves.Only OnlySaudi SaudiArabia Arabiahas hashigher higherreserves reserves.But Butaccordi accordingngtotothe theState State Department, Department,Cuba, Cuba,which whichfaces facesananoil oildeficit deficitbecause becauseofofreduced reducedSoviet Sovietdeliveries, deliveries,has hasreceived receiveda ashipment shipmentofofIraqi Iraqipetroleum petroleumsince sinceU.N U.N.sanctions sanctionswere wereimposed imposedfifiveveweeks weeksago ago And s sambassador AndRomania, Romania,ititsaid, said,expects expectstotoreceive receiveoil oilindi indirectly rectlyfrom fromIraq Iraq.Romania' Romania' ambassadortotothe theUUnited nitedStates, States,Virgil VirgilConstanti Constantinescu, nescu,denied deniedthat thatclclaim aimTuesday, Tuesday,calli callingngitit``absolutel ``absolutelyy false falseand andwithout withoutfoundation.'' foundation.' ' Original document Linguistic Processing and Semantic Graph Creation Cracks gogoagainst Cracksappeared appearedTuesday Tuesdayininthe theUU.N .N.trade tradeembar embar againstIraq Iraqasas Saddam SaddamHussei Husseinnsought soughttotocicircumvent rcumventthe theeconomic economicnoose noosearound around his hiscountry country.Japan, Japan,meanwhil meanwhile,e,announced announcedititwould wouldi ncrease i ncreaseitsitsaiaidd totocountries countrieshardest hardesthit hitbybyenforcing enforcingthe the sancti sanctions ons.Hoping Hopingtoto defuse defusecritici criticism smthat thatititisisnot notdoing doingi tsi tsshare sharetotooppose opposeBaghdad, Baghdad, Japan Japansaid saidupuptoto$2$2billion billionininaiaiddmay maybebesent senttotonations nationsmost most affected affectedbybythe theU.N U.N.embargo embargoononIraq Iraq.President PresidentBush BushononTuesday Tuesday night nightpromised promiseda ajoi jointntsession sessionofofCongress Congressand anda anationwi nationwidederadio radio and andtelevi television sionaudience audiencethat that``Saddam ``SaddamHussein Husseinwiwillllfail fail'' to '' tomake make his hisconquest conquestofofKuwait Kuwaitpermanent permanent.``America ``Americamust muststand standupuptoto aggression, aggression,and andwe we wil will,''l,''said saidBush, Bush,who whoadded addedthat thatthe theUU.S S mil itary may remain in the Saudi Arabian desert indefi nitel y mil itary may remain in the Saudi Arabian desert indefi nitel y.``I ``I cannot cannotpredict predictjust justhow howlong longititwill will take taketotoconvince convinceIraq Iraqtoto withdraw withdrawfrom fromKuwait,'' Kuwait,''Bush Bushsaid said More Morethan than150,000 150,000UU.S S troops troopshave have been beensent senttotothe the Persian PersianGulf Gulf region regiontotodeter detera a possibl possible eIraqi Iraqi i nvasion i nvasionofofSaudi SaudiArabia Arabia Bush's Bush'saides aidessaid saidthe the president presidentwould wouldfollow followhihis saddress addresstotoCongress Congresswiwiththa ateltelevised evised message messagefor forthe theIraqi Iraqipeople, people,declaring declaringthe theworld worldisisunited unitedagainst against thei r government's invasi on of Kuwait Saddam had offered Bush thei r government's invasi on of Kuwait Saddam had offered Bush time The timeononIraqi IraqiTV TV ThePhi Philippi lippines nesand andNamibia, Namibia,the thefirst firstofofthe the developing developingnations nationstotorespond respondtotoananoffer offerMonday MondaybybySaddam Saddamofof free freeoioil l ininexchange exchangefor forsending sendingtheir theirown owntankers tankerstotoget getitit said no to the Iraqi leader Saddam's offer was seen as a none-toosaid no to the Iraqi leader Saddam's offer was seen as a none-toosubtle subtleattempt attempttotobypass bypassthe theUU.N .N.embargo, embargo,inineffect effectsince sincefour four days daysafter afterIraq's Iraq'sAug Aug.22invasion invasionofofKuwait, Kuwait,bybygetting gettingpoor poor countries countriestotodock docktheir theirtankers tankersi ni nIraq Iraq.But Butaccording accordingtotoa aState State Department Departmentsurvey, survey,Cuba Cubaand andRomania Romaniahave havestruck struckoil oildeals dealswiwithth Iraq Iraqand andcompanies companieselsewhere elsewhereare aretrying tryingtotocontinue continuetrade tradewith with Baghdad, Baghdad,all alli ni ndefiance defianceofofUU.N .N.sanctions sanctions.Romania Romaniadeni deniesesthe the allegation The report, made available to The Associated Press, allegation The report, made available to The Associated Press, said saidsome someEastern EasternEuropean Europeancountries countriesalso alsoare aretryi tryingngtotomaintai maintainn thei their military r militarysales salestotoIraq Iraq.AAwell-informed well-informedsource sourceininTehran Tehrantold told The TheAssociated AssociatedPress Pressthat thatIran Iranhas hasagreed agreedtotoananIraqi Iraqirequest requesttoto exchange exchangefood foodand andmedicine medicinefor forupuptoto200,000 200,000barrels barrelsofofrefi refined ned oioil la aday dayand andcash cashpayments payments.There Therewas wasnonoofficial officialcomment commentfrom from Tehran -oil TehranororBaghdad Baghdadononthe thereported reportedfood-for food-for -oildeal deal.But Butthe thesource, source, who whorequested requestedanonymity, anonymity,said saidthe thedeal dealwas wasstruck struckduring duringIraqi Iraqi Foreign ForeignMinister MinisterTariq TariqAziz's Aziz'svivisitsitSunday SundaytotoTehran, Tehran,the thefirst firstbybya a senior After seniorIraqi Iraqiofficial officialsince sincethe the1980-88 1980-88gul gulf fwar war Afterthe thevivisit, sit, the the two two countries countries announced announcedthey they would would resume resume dipl diplomatic omatic relati relations ons.Well-informed Well-informedoioil l industry industry sources sources ininthe the region, region, contacted , ,saisaiddthat contactedbybyThe TheAP AP thatalthough althoughIran Irani si sa amajor majoroil oilexporter exporter itself, it currently has to import about 150,000 barrels of refi ned itself, it currently has to import about 150,000 barrels of refi ned oioil la aday dayfor fordomestic domesticuse usebecause becauseofofdamages damagestotorefineries refineriesininthe the gulf AlAlong gulf war war ong sisimilar milar lines, lines, ABC ABC News News reported reported that that following followingAziz's Aziz'svisit, visit,Iraq Iraqi si sapparently apparentlyprepared preparedtotogive giveIran Iranallall the theoil oilititwants wantstotomake makeupupfor forthe thedamage damageIraq Iraqinflicted inflictedononIran Iran during duringtheir theirconfl conflict ict.Secretary SecretaryofofState State James JamesA.A.Baker BakerIII, III, meanwhile, meanwhile, met metininMoscow Moscow with withSovi SovietetForeign ForeignMinister Minister Eduard Shevardnadze, two days after the U S.-Soviet summit that Eduard Shevardnadze, two days after the U S.-Soviet summit that produced produceda ajoi jointntdemand demandthat thatIraq Iraqwithdraw withdrawfrom fromKuwai Kuwait t.During During the thesummit, summit,Bush Bushencouraged encouragedMMikhail ikhailGorbachev Gorbachevtotowithdraw withdraw190 190 Soviet Sovietmilitary military speci specialists alistsfrom fromIraq, Iraq, where where they theyremai remainntoto fulfill fulfillcontracts contracts.Shevardnadze Shevardnadzetold toldthe theSoviet Sovietparliament parliamentTuesday Tuesday the thespeci specialists alistshad hadnot notreneged renegedononthose thosecontracts contractsfor forfear fearititwoul wouldd jeopardize jeopardizethe the5,800 5,800Soviet Sovietcitizens citizensininIraq Iraq.InInhis hisspeech, speech,Bush Bush said saidhis his heart heart went wentout out toto the the famil families ies ofof the the hundreds hundreds ofof Americans Americansheld heldhostage hostage bybyIraq, Iraq,but buthehedeclared, declared,``Our ``Ourpolicy policy cannot change, and it will not change America and the world wil cannot change, and it will not change America and the world will l not not bebe blackmail blackmailed.'' ed.''The The president president added: added: ``Vi ``Vitaltal issues issues ofof princi principle pleare areatatstake stake.Saddam SaddamHussein Husseini si sliterally literallytryi tryingngtotowipe wipe a acountry countryoff offthe theface faceofofthe theEarth.'' Earth.''InInother otherdevelopments: developments:_A _A UU.S S.diplomat diplomatininBaghdad Baghdadsaid saidTuesday Tuesdayupuptoto800 800Americans Americansand and Britons Britonswil will lflflyyout outofofIraqi Iraqi-occupied -occupiedKuwait Kuwaitthis thisweek, week,most mostofof them themwomen womenand andchichildren ldrenl eaving l eavingthei their husbands r husbandsbehind behind.Saddam Saddamhas has said saidheheisiskeeping keepingforeign foreignmen menasashuman humanshishields eldsagainst againstattack attack On OnMonday, Monday,a aplanel planeload oadofof164 164Westerners Westernersarrived arrivedininBaltimore Baltimore from from Iraq Iraq Evacuees Evacuees spoke spoke ofof food food shortages shortages i ni n Kuwait, Kuwait, nighttime nighttimegunfire gunfireand andIraqi Iraqiroundups roundupsofofyoung youngpeople peoplesuspected suspectedofof invol vement in the resistance ``There is no law and order,'' sai invol vement in the resistance ``There is no law and order,'' saidd Thuraya, Thuraya,19, 19,who whowould wouldnot notgive giveher herl ast l astname name.``A ``Asoldier soldiercan can rape a father's daughter i n front of hi m and he can't anythi ng about rape a father's daughter i n front of hi m and he can't anythi ng about it.'' it.''_The _TheState StateDepartment DepartmentsaisaiddIraq Iraqhad hadtold toldUU.S S.officials officialsthat that American Americanmales malesresiding residingininIraq Iraqand andKuwait Kuwaitwho whowere wereborn borni ni n Arab Arabcountri countrieseswil will lbebealallowed lowedtotoleave leave.Iraq Iraqgenerally generallyhas hasnot not let letAmeri American canmales malesl eave l eave.ItItwas wasnot notknown knownhow howmany manymen menthe the Iraqi Iraqi move movecould couldaffect affect._A _APentagon Pentagonspokesman spokesmansaid said``some ``some increase increaseininmilitary militaryactivity'' activity''had hadbeen beendetected detectedinsi insidedeIraq Iraqnear neari tsi ts borders with Turkey and Syri a He said there was li ttle indi cation borders withTurkey and Syri a He said there was li ttle indi cation hostilities hostilitiesare areimminent imminent.Defense DefenseSecretary SecretaryDick DickCheney Cheneysaisaidd the cost of the U S mili tary buil dup in the Middl e East was the cost of the U S mili tary buil dup in the Middl e East was rising risingabove abovethe the$1$1bibillion-a-month llion-a-monthestimate estimategenerall generallyyused usedbyby government offici als He said the total cost _ if no shooting war government offici als He said the total cost _ if no shooting war breaks breaksout out coul coulddtotal total $15 $15bilbillion lionininthe the next nextfifiscal scal year year beginning beginningOct Oct.1.1.Cheney Cheneypromised promiseddisgruntled disgruntledlawmakers lawmakers``a ``a signifi significant cantincrease'' increase''i ni nhel helppfrom fromArab Arabnations nationsand andother otherUU.S S alli alliesesfor forOperation OperationDesert DesertShi Shield eld.J apan, J apan,which whichhas hasbeen beenaccused accused ofofresponding respondingtoo tooslowly slowlytotothe thecricrisis sisininthe thegul gulf,f,said saidTuesday Tuesdayitit may maygive give$2$2bilbillion liontotoEgypt, Egypt,J ordan J ordanand andTurkey, Turkey,hithithardest hardestbybythe the UU.N prohibition on trade with Iraq ``The pressure from abroad N prohibition on trade with Iraq ``The pressure from abroadi si s getting gettingsosostrong,'' strong,''saisaiddHiroyasu HiroyasuHori Horio,o,ananofficial official wiwithththe the Mini stry of International Trade and Industry Local news reports Mini stry of International Trade and Industry Local news reports said saidthe the aiaiddwoul woulddbebe extended extendedthrough throughthe theWorld WorldBank Bank and and International InternationalMonetary MonetaryFund, Fund,and and$600 $600milli milliononwoul woulddbebesent sentasas earl On earlyyasasmid-September mid-September OnFri Friday, day,Treasury TreasurySecretary SecretaryNichol Nicholasas Brady Bradyvivisited sitedTokyo Tokyoonona aworld worldtour tourseeking seeking$10.5 $10.5bill billion iontoto help helpEgypt, Egypt,Jordan Jordanand andTurkey Turkey.J apan J apanhas hasalready alreadypromised promiseda a$1$1 bill billion ionaid aidpackage package for formultinational multinational peacekeepi peacekeepingngforces forcesinin Saudi , ,vehicles SaudiArabia, Arabia,including includingfood, food,water water vehiclesand andprefabri prefabricated cated housing housingfor fornon-military non-militaryuses uses.But Butcritics criticsininthe theUUnited nitedStates States have have said saidJapan Japanshould shoulddodomore more because because itsitseconomy economydepends depends heavi heavilylyononoioil lfrom fromthe theMi Middle ddleEast East.J apan J apanimports imports9999percent percentofof itsitsoioil.l.J apan's constitution bans the use of force in settling J apan's constitution bans the use of force in settling international internationaldisputes disputesand andJapanese Japaneselaw lawrestricts restrictsthe themili military tarytoto Japanese Japaneseterritory, territory,except exceptfor forceremonial ceremonialoccasi occasions ons.On OnMonday, Monday, Saddam Saddamoffered offereddeveloping developingnations nationsfree freeoil oilififthey theywould wouldsend send thei their rtankers tankerstotopick pick ititup.up.The Thefirst firsttwo twocountries countriestotorespond respond Tuesday Tuesday the thePhilippines Philippinesand andNamibia Namibia said saidno no.Manila Manilasaid saiditit had hadalalready readyfulfilled fulfilleditsitsoioil lrequi requirements, rements,and andNamibia Namibiasaid saiditit would wouldnot not ``sell ``sell itsits soverei sovereignty'' gnty''for forIraqi Iraqi oil oil.Venezuelan Venezuelan President PresidentCarlos CarlosAndres AndresPerez Perezdismi dismissed ssedSaddam's Saddam'soffer offerofoffree free oioil lasasa a``propaganda ploy.'' Venezuela, an OPEC member, has led ``propaganda ploy.'' Venezuela, an OPEC member, has leda a drive driveamong amongoioil-producing l-producingnations nationstotoboost boostproduction productiontotomake makeupup for forthe theshortfall shortfallcaused causedbybythe theloss lossofofIraqi Iraqiand andKuwaiti Kuwaitioioil lfrom from the theworl worlddmarket market.Thei Their roioil lmakes makesupup2020percent percentofofthe theworld's world's oioil lreserves reserves.Only OnlySaudi SaudiArabia Arabiahas hashigher higherreserves reserves.But Butaccording according totothe theState StateDepartment, Department,Cuba, Cuba,which whichfaces facesananoil oildeficit deficitbecause because ofofreduced reducedSoviet Sovietdeliveri deliveries, es,has hasreceived receiveda ashishipment pmentofofIraqi Iraqi petroleum petroleumsince sinceUU.N .N.sanctions sanctionswere wereimposed imposedfive fiveweeks weeksago ago And AndRomania, Romania,ititsaid, said,expects expectstotoreceive receiveoil oilindi indirectly rectlyfrom fromIraq Iraq Romania's Romania'sambassador ambassadortotothe theUUnited nitedStates, States,Virgil VirgilConstanti Constantinescu, nescu, denied deniedthat thatclaim claimTuesday, Tuesday,call calling ingi ti ``absolutely t ``absolutelyfalse falseand andwithout without foundation.'' foundation.'' Automatically generated document summary Natural Language Generation Sub-graph Selection using Machine Learning Methods Figure Summarization procedure based on semantic structure analysis In the following sections we describe the procedure that we use to generate the semantic graphs and define feature attributes for the learning model We present the results of the experiments and discuss how they can guide the future work SEMANTIC GRAPH GENERATION In this study we create a novel representation of the document content that relies on the deep syntactic analysis of the text We extract elementary syntactic structures from individual sentences in the form of logical form triples, i.e., subject– predicate–object triples, and use linguistic properties of the nodes in the triples to build semantic graphs for both documents and corresponding summaries We expect that the graph of the extracted summary would capture essential semantic relations among concepts and that the resulting structure could be found within the corresponding document semantic graph Thus, we reduce the problem of summarization to acquiring machine learning models for mapping between the document graph and the graph of a summary We generate a semantic graph in three steps: - Syntactic analysis of the text – We apply deep syntactic analysis to document sentences, using NLPWin linguistic tool [2][5], and extract logical form triples - Co-reference resolution – We identify co-references for named entities through the surface form matching and text layout analysis Thus we consolidate expressions that refer to the same named entity - We merge the resulting logical form triples into a semantic graph and analyze the graph properties The nodes in our graphs correspond to Subjects and Objects A link between them corresponds to a Predicate In our research we investigated semantic graphs that involved pronominal reference resolution and semantic normalization However, initial experiments showed that using anaphora resolution which achieved 80% accuracy and WordNet [20] for synonym normalization yields marginal improvement in the performance of the summary extractor Thus, for the sake of clarity and simplicity we present the method using minimal post-processing of the NLPWin output through co-reference resolution 2.1 Tom Tom Sawyer Sawyer went went to to town town He He met met aa friend friend Tom Tom was was happy happy Deep syntactic analysis Co-reference resolution: Tom=Tom Sawyer Tom Tom Sawyer Sawyer went went to to town town He He [Tom [Tom Sawyer] Sawyer] met met aa friend friend Tom Tom [Tom [Tom Sawyer] Sawyer] was was happy happy Linguistic Analysis For linguistic analysis of text we use Microsoft’s NLPWin natural language processing tool NLPWin first segments the text into individual sentences, converts sentence text into a parse tree that represents the syntactic structure of the text (Figure 2) and then produces a sentence logical form that reflects the meaning, i.e., semantic structure of the text (Figure 3) This process involves a variety of techniques: use of knowledge base, grammar rules, and probabilistic methods in analyzing the text Tom Tom Sawyer Sawyer ß ß go go à town town Tom Tom Sawyer Sawyer ß ß meet meet à friend friend Tom Tom Sawyer Sawyer ß ß is is à happy happy Refined/enhanced Subject– Predicate– Object triples Creation of the semantic graph Figure Process of creating a semantic graph Figure Syntactic tree for the sentence “Jure sent Marko a letter” Figure Logical form for the sentence The logical form in Figure 3, shows that the sentence is about sending, where “Jure” is the deep subject (an “Agent” of the activity), “Marko” is the deep indirect object (having a “Benefactive” role), and the “letter” is the deep direct object (assuming the “Patient” role) The notations in parentheses provide semantic information about each node (e.g., “Jure” is a masculine, singular, and proper name) From the logical form we extract constituent sub-structures in the form of triples: “Jure”→“send”→“Marko” and “Jure”→“send” →“letter” For each node we preserve semantic tags that are assigned by the NLPWin software These are used in our further linguistic analyses and machine learning stage Figure outlines the main processes Identified logical form triples are linked into a graph based on common nodes Figure shows an example of a semantic graph for an entire document 2.2 Co-reference Resolution for Named Entities It is common that terms with different surface forms refer to the same entity in the same document Identifying such terms is referred to as co-reference resolution We restrict our coreference resolution attempt to syntactic nodes that, in the NLPWin analysis, have the attribute of ‘named entity’ Such are names of people, places, companies, and similar For each named entity we record the gender tag which reduces the number of terms that need to be examined for co-reference resolution Starting with multi-word named entities, we first eliminate the standard set of English stop words and ‘common’ words, such as “Mr.”, “Mrs.”, “international”, “company”, “group”, “federal”, etc We then apply a simple rule by which two terms with distinct surface forms refer to the same entity if all the words from one term also appear as words in the other term The algorithm, for example, correctly finds that “Hillary Rodham Clinton”, “Hillary Clinton”, “Hillary Rodham”, and “Mrs Clinton” all refer to the same entity This approach is similar to the ones explored in related research [14] and has proven to be effective in the context of our study, yielding better learning models 2.3 Construction of the Semantic Graph We merge the logical form triples on subject and object nodes which belong to the same normalized semantic class and produce semantic graph, as shown in Figure Subjects and objects are nodes in a graph and predicates label the relations between them Each node is also described with a set of properties – explanatory words which are helpful for understanding the content of the node For each node in a semantic graph we calculate the number of topological properties These are later used as attributes of logical form triples during the sub-graph learning process The full set of features used in the learning process is given in section 3.2 LEARNING SEMANTIC SUBGRAPHS USING SUPPORT VECTOR MACHINES Using linguistic procedures described in Section we can generate, for each pair of document and document summary, the corresponding set of subject–predicate–object triples and associate them with a rich set of attributes, coming from linguistic, statistical, and graph analysis These serve as the basis for training our summarization models 3.1 Data Sets We run our experiments on two data sets: a subset of the DUC2002 dataset and CAST collection 3.1.1 DUC2002 Data set We use the DUC2002 document collection from the Document Understanding Conference (DUC) 2002 [3] For our experiments we use training part of DUC 2002, which consists of 300 newspaper articles on 30 different topics, collected from Financial Times, Wall Street Journal, Associated Press, and similar sources Almost half of these documents have human extracted sentences, interpreted as extracted summaries These are not used in the official DUC evaluation since DUC is primarily focused on generating abstracts Thus, we cannot make a direct comparison with DUC systems performance However, the data is useful for our objective of exploring various aspects of our approach On average, an article in the DUC data set contains about 1100 words or 50 sentences, each having 22 words About 7.5 sentences are selected into the summary After applying our linguistic processing, we find, on average 81 logical triples per document with 15 of them contained in extracted summary sentences In preparation for learning, we label as positive examples all subject–predicate–object triples that correspond to sentences in the human extracted summaries Triples form other sentences are designated as negative examples 3.1.2 CAST Data set CAST corpus [4] contains texts from the Reuters Corpus annotated with information that can be used to train and evaluate automatic summarization methods Four annotators marked 15% of document sentences as essential and additional 15% as important for the summary However the distribution of documents across assessors has been rather arbitrary and for some documents we have up to three sets of sentence selections while for others only one For that reason we decided to run our experiments on the set of 89 documents annotated by a single assessor, Annotator We run experiments that model separately extraction of short (15%) summaries, represented by sentences marked as essential, and longer (30%) summaries, which include both sentences marked as essential and sentences marked as important An average length article in the CAST data set contains about 528 words or 29 sentences, each having 18 words The assessor selected on average about sentences for short summaries and additional for longer summaries After applying our linguistic processing, we find on average 41 logical form triples per document with or 12 of them included in extracted sentences for short and longer summaries, respectively Figure Automatically generated summary (semantic graph) from the document “Long Valley volcano activities” Subject/object nodes indicated by the light color (yellow) nodes in the graph indicate correct logical form nodes Dark gray nodes are false positive and false negative nodes Figure Full semantic graph of the DUC 2002 document “Long Valley volcano activities” Subject/object nodes indicated by the light color (yellow) nodes in the graph indicate summary nodes Gray nodes indicate non-summary nodes We learn a model for distinguishing between the light and dark nodes in the graph 3.2 Feature Set As features for the learning process, we consider logical form triples characterized by three types of attributes: - Linguistic attributes which include logical form tags (subject, predicate, object), part of speech tags, and about 70 semantic tags (such as gender, location name, person name, etc.) There are total 118 distinct linguistic attributes for each node - Semantic graph attributes describing properties of the graph For each node we calculate the number of incoming and outgoing links, Hubs and Authorities [8] and PageRank [15] weights We also include the statistics on the number nodes reachable by 2, and hops away respectively, and the total of reachable nodes We consider both the directed and undirected versions of the semantic graph when calculating these statistics There are total 14 attributes calculated from the semantic graph - Document discourse structure is approximated by several attributes: the location of the sentence in the document and the triple in the sentence, frequency and location of the word inside the sentence, number of different senses of the word, and related Each set of attributes is represented as a sparse vector of binary and real-valued numbers These are concatenated into a single sparse vector and normalized to the unit length, to represent a node in the logical form triple Similarly, for each triple the node vectors are concatenated and normalized The resulting vectors for logical form triples contain about 372 binary and real-valued attributes For the DUC dataset, 69 of these components have non-zero values, on average For the CAST dataset we find 327 attributes total with 68 non-zero values per triple on average 3.3 Learning Algorithm This rich set of features serves as input to the Support Vector Machine (SVM) classifier [1][7] In the initial experiments we explored SVMs with polynomial kernel (up to degree five) and RBF kernel However, the results were not significantly different from the SVMs with the linear kernel Thus we continued our experiments with the linear SVMs We define the learning task as a binary classification problem We label as positive examples all subject–predicate–object triples that were extracted from the document sentences which humans selected into the summary Triples from all other sentences are designated as negative examples We then learn a model that discriminates between these two classes of triples 3.4 Experimental Setup We evaluate performance of both, the extraction of semantic structure elements, i.e., logical form triples, and the extraction of document sentences We use extracted logical form triples to identify the appropriate sentences for inclusion into the summary We apply a simple decision rule by which a sentence is included in the summary if it contains at least one triple identified by the learning algorithm We accumulate the summaries to satisfy the length criteria All reported experiment statistics are micro-averaged over the instances of logical triple and sentence classifications, respectively One important objective of our research is to understand the relative importance of various attribute types that describe the logical form triples Thus we evaluate how adding features to the model impacts the precision and recall of extracted logical form triples and corresponding summaries We report the standard precision and recall and their harmonic mean – the F1 score All the experiments are run using stratified 10-fold crossvalidation, where samples of documents are selected randomly and corresponding sentences (triples) are used for training and testing We take into account the document boundaries and therefore the triples from a single document all belong either to the training or test set and are never shared between the two We always run and evaluate the resulting models on both the training and the test sets, to gain an insight into the generalization of the model When evaluating summaries, we are also interested in the coverage of the human extracts achieved by our extracted summaries In instances where we miss to extract the correct sentence, we still wish to assess whether the automatically extracted sentence is close in content to the ones that we missed For that we calculate the overlap between the automatically extracted summaries and human extracted summaries using ROUGE [10], the measure adopted by DUC as the standard for assessing the summary coverage ROUGE is a recall oriented measure, based on n-gram statistics that has been found highly correlated with human evaluations We use ROUGE n-gram(1,1) statistics and restrict the length of the automatically generated summary to be the same as of the human sentence extract EXPERIMENT RESULTS Tables 1–3 summarize the results of the sentence extraction based on the learned SVM classifier for the DUC and CAST datasets Precision, recall and F1 measures for the extraction of triples are very close to the performance of extracted sentences and therefore we not present them separately 4.1.1 Impact of Different Feature Attributes Performance statistics presented in Tables to provides insight into the relative importance of different attribute types, the graph topological properties, the linguistic features, and the statistical and discourse attributes The first row of each table shows the baseline model where we use only sentence position and sentence terms for learning the model In all cases we observe very good performance of the baseline on training set, but the model does not generalize well – has poor performance on the test set The Rouge score of baseline is also quite low For comparison we also generated another set of baseline summaries by taking first sentences in each document Over all datasets Rouge score of these summaries was additional 0.10 lower than of the baseline obtained using machine learning Table 1: Performance of sentence extraction on the DUC2002 extracts, in terms of macro-average Precision, Recall and F1 measures and Rouge score Results for stratified ten-fold cross validation Training set Test set Attribute set Precision Recall F1 Precision Recall F1 Rouge Sentence position and terms 65.08 92.14 76.28 28.77 37.27 32.48 0.69 Triple and sentence position 31.29 53.38 39.45 31.12 53.34 39.32 0.71 Graph attributes 28.26 62.99 39.02 27.58 61.67 38.11 0.73 Linguistic attributes 25.79 62.48 36.51 20.79 51.87 29.69 0.78 Position + Linguistic 30.74 67.33 42.21 28.66 63.23 39.44 0.76 Position + Graph 34.44 65.37 45.11 33.67 64.39 44.22 0.83 Position + Graph + Linguistic 34.25 71.40 46.29 31.85 66.77 43.13 0.82 Table 2: Performance of the sentence selection on the CAST 15% extracts (essential sentences), in terms of macro-average Precision, Recall and F1 measures and Rouge score Results for the stratified ten-fold cross validation Attribute set Training set Test set Precision Recall F1 Precision Recall F1 Rouge Sentence position and terms Triple and sentence position 85.54 33.07 87.43 65.69 86.42 44.99 30.32 32.54 25.14 64.54 27.49 43.27 0.59 0.62 Graph attributes 20.92 59.52 30.95 19.82 56.85 29.39 0.66 Linguistic attributes 35.95 57.10 44.12 21.34 32.83 25.87 0.62 Position + Linguistic 39.89 74.59 51.89 34.31 63.41 44.53 0.73 Position + Graph 33.70 72.63 46.04 32.47 70.92 44.54 0.73 Position + Graph + Linguistic 40.43 77.40 53.12 33.83 64.35 44.34 0.74 Table 3: Performance on the CAST 30% extracts (essential and important sentences), in terms of macro-average Precision, Recall and F1 measures and Rouge score Results for the stratified ten-fold cross validation Training set Test set Attribute set Precision Recall F1 Precision Recall F1 Rouge Sentence position and terms 87.97 84.27 86.08 43.24 33.68 37.86 0.59 Triple and sentence position 44.62 59.44 50.97 43.67 58.42 49.98 0.68 Graph attributes 38.42 67.42 48.95 36.80 65.85 47.22 0.67 Linguistic attributes 45.96 80.41 58.41 40.22 70.84 51.31 0.73 Position + Linguistic 50.57 74.18 60.14 43.92 64.25 52.18 0.72 Position + Graph 45.10 70.60 55.04 43.47 67.26 52.81 0.71 Position + Graph + Linguistic 51.04 75.00 60.74 44.45 65.57 52.98 0.72 For the all datasets, the performance statistics are obtained from the 10-fold cross-validation Relative difference in performance has been evaluated using pair-wise t-test and it has been established that the differences between different runs are statistically significant From Table we see that including semantic graph attributes consistently improves recall and thus the F1 score Starting with only linguistic attributes and adding information about position, we experience 9.75% absolute increase in the F1 measure As new attributes are added to describe the triple from additional perspectives, the performance of the classifier consistently increases The cumulative effect of all attributes considered in the study is 26.5% relative increase in F1 measure over the baseline Table 4: Some of the most important Subject–Predicate– Object triple attributes for DUC experiments Attribute name Authority weight of Object node Size of weakly connected component of Object node Number of links of Object node Attribute rank 1st 3rd Median quartile quartile 1 2.5 3 Is Object a name of a country Size of weakly connected component of Subject node Number of links of Subject node 5 10.5 12 PageRank weight of Object node Is Object a name of a geographical location Authority weight of Subject 11 12 13 16 13 18.5 23 that uses only sentence terms and position attributes The model which uses information about position of the triple and the structure of semantic graph performs best both in F1 and Rouge scores In terms of Rouge measure, linguistic features (syntactic and semantic tags) outperform the model which relies only on the semantic graph For linguistic attributes we also observe a discrepancy between F1 and Rouge score Linguistic attributes score low on F1 but usually relatively high on Rouge On the other hand, for position attributes we observe the reverse effect – good F1 and low Rouge score We make similar observations on CAST dataset (tables and 3) We see that using position and graph attributes gives very good performance in terms of F1 and Rouge measures We observe that using only semantic graph attributes does not give a very good performance While the size of sentence extracts in DUC and CAST are similar, DUC documents are much longer, contain more logical triples, and therefore have semantic graphs that are better connected We manually inspected CAST semantic graphs and observed that they are not so well connected and appear less helpful for summarization 4.1.2 Observations from the SVM Normal We also inspect the learned SVM models, i.e., the SVM normal, for the weights assigned to various attributes during the training process We normalize each attribute to have a value between and This way we prevent the attributes with smaller values to automatically have higher weights We then observe the relative rank of attribute weights over 10 folds Since the distributions of weights and corresponding attribute ranks are skewed they are best described by the median From table it is interesting to see that the semantic graph attributes are consistently ranked high among the attributes used in the model They describe the elements of a triple in relation to other entities mentioned in the text and capture the overall structure of the document For example, ‘Authority weight of Object node’ measures how other important ‘hub’ nodes in the graph link to it A good ‘hub’ points to nodes with ‘authoritative’ content, and a node has a high ‘authority’ if it is pointed to by good hubs In our graph representations, subjects are hubs pointing to authorities – objects and thus the authority weight captures how important is the object, i.e., in how much actions, described by predicates, it is involved These results support our intuition that relations among concepts in the document that result from the syntactic and semantic properties of the text are important for summarization Interestingly, feature attributes that most strongly characterize non-summary triples are mainly linguistic attributes describing gender, position of the verb, as being inside the quotes, position of the sentence in the document, word frequency, and similar – the latter few attributes are typically used in statistical approaches to summary extraction RELATED WORK Over the past decades, research in text summarization has produced a great volume of literature and methods For overview and insights into the state-of-the-art we refer to [16][17] and comment on the work that relates to several aspects of our approach While most of the past work stays in the realm of shallow text parsing level and statistical processing, our approach is unique in that it combines two aspects: (1) it introduces an intermediate layer of text representation within which the structure and the content of both the document and summary are captured and (2) it uses machine learning to identify elements of the semantic structures, i.e., concepts and relations, as oppose to learning from linguistic features of finer granularity, such as keywords and noun phrases [9][18] or yet, complete sentences [13] We also note that the semantic graph representation opens possibilities for novel types of document surrogates, focused not on reading but navigation through the document on the basis of captured concepts and relations Graph based methods Application of graph representation in summarization has been applied by Mihalcea [13] by treating individual sentences as nodes in the graph and establishing links among the sentences based the content overlap In addition to the difference in the text granularity level at which the graph is applied, the method in [13] does not involve learning It selects sentences by setting the threshold on the scores associated with the graph nodes Most similar to our approach to constructing the semantic graph is the method by Vanderwende et al [19] aimed at generating event-centric summaries The method uses the same linguistic tool, NLPWin to obtain logical form triples from sentences but constructs the semantic graph in a rather different way In order to capture text about events Vanderwende et al [19] treat Predicates as nodes in the graph, together with Subjects and Objects while the links between the nodes are inherited from the logical form analysis More precisely, the atomic structure of the graph is a triple (Node i, relation/link, Nodej), where relation is a syntactic tag such as: direct object, location, time, and similar For example, the graph would contain (“Marko”, Subject, “Send”), (“Send”, Object, “Letter”), (“Send”, Time, “Wednesday”) In our representation the elementary structure is (“Marko”, “Send”, “Letter”) Therefore, the statistical properties of the graph and link weight propagation have different meaning and effect Similarly to Mihalcea [13], Vanderwendte et al [19] not apply learning to select substructures but set the score threshold for selection of logical form triples Both methods [13] and [19] are applied in the context of generating abstracts and their encouraging results lead us to believe that further evaluation of our method will show similar results In their work Mani & Bloedorn [11] and Kupiec et al [9] applied several learning algorithms to the set of features that were in the previous research applied in an adhoc manner to select text for summarization (sentence location, statistical measures of term prominence, similarity between sentences, presence of proper names or certain syntactic features in the sentence, etc.) The significant contribution of our work is in widening the type of features for learning to those that capture both the structure and the content and enhance our understanding of the role that these structural elements play in modeling sentence extraction for summarization SUMMARY AND FUTURE WORK We presented a novel approach to document summarization which generates a semantic representation of the document and applies machine learning to extract semantic sub-structure suitable for creating summaries We evaluated our approach on a simpler problem of sentence extraction for document summaries This enabled us to focus on the characteristics of the learning model and investigate the relative importance of feature attributes used in learning Experiments on the two data sets show that the attributes which capture properties of the document semantic structure play an important role in the sentence selection process Our approach, has a number of advantages over methods used so far Semantic structure based on the logical form enables us to extract triples that correspond to sub-clauses of document sentences This provides a good foundation for collecting text segments that would be useful for abstract creation and multidocument summarization Furthermore, the rich set of linguistic and graph attributes enable the learning algorithm to select the set of attributes that best model the summarization process for a particular set of documents and a particular performance measure For example, we noticed that for training data with shorter summaries linguistic features play more significant role in optimizing the performance than the structure features That is reversed in the situation where we have longer summaries and longer documents, for which the semantic structure is richer and more informative Our future work will involve explorations of alternative semantic structures on additional data sets and a wider set of summarization problems, including human generated abstracts and cross document summaries REFERENCES [1] Burges, C.J.C A tutorial on support vector machines for pattern recognition Data Mining and Knowledge Discovery, (2): 121–167, 1998 [2] Corston-Oliver, S.H and Dolan, B Less is more: eliminating index terms from subordinate clauses ACL, 1999 [3] Document Understanding http://tides.nist.gov/ Conference (DUC), 2002 [4] Hasler, L., Orasan, C and Mitkov, R Building better corpora for summarization Corpus Linguistics 2003 [5] Heidron, E.G Intelligent Writing Assistance In Handbook of Natural Language Processing, Eds Dale, R., Moisl, H and Somers, G Marcel Dekker, 2000 [6] Jing, H Using Hidden Markov Modeling to Decompose Human-Written Summaries Computational Linguistics 4, 28, 527-543, 2002 [7] Joachims, T Making large-scale support vector machine learning practical Advances in kernel methods: Support vector learning The MIT Press, 1999 [8] Kleinberg, J.M Authoritative sources in a hyperlinked environment Journal of the ACM, 46(5): 604–632, 1999 [9] Kupiec, J., Pederson, J & Chen, F A Trainable Document Summarizer In Proceedings of SIGIR’95, 1995 [10] Lin, J.C and Hovy, E H Automatic evaluation of summaries using n-gram co-occurrence statistics Human Language Technology Conference, Edmonton, 2003 [11] Mani, I and Bloedorn, E Machine Learning of Generic and User-Focused Summarization AAAI 1998 [12] Marcu, D The automatic construction of large-scale corpora for summarization research SIGIR 1999 [13] Mihalcea, R Graph-based ranking algorithms for sentence extraction, applied to text summarization ACL 2004 [14] Nenkova, A and McKeown, K References to Named Entities: a Corpus Study HLT-NAACL 2003 [15] Page, L., Brin, S., Motwani, R and Winograd T The PageRank citation ranking: Bringing order to the web Digital libraries project report, Stanford University, 1998 [16] Paice, C.D Constructing literature abstracts by computer: Techniques and prospects Information processing and Management, 26:171-186, 1990 [17] Sparck-Jones, K Summarizing: Where are we now? Where should we go? ACL/EACL'97 Workshop on Intelligent Scalable Text Summarization, Madrid, Spain, 1997 [18] Teufel, S and Moens, M Sentence extraction as a classification task ACL/EACL'97 Workshop on Intelligent Scalable Text Summarization, Madrid, Spain, 1997 [19] Vanderwende, L., Banko, M., and Menezes, A EventCentric Summary Generation DUC 2004 [20] Fellbaum, C WordNet: An Electronic Lexical Database MIT Press, 1998 ... correspond to the semantic structure of extracts and the extraction of summary sentences Evaluation based on ROUGE shows similar results for the extracted summary sentences INTRODUCTION Document. .. extract the substructure of the graph that corresponds to the extracted sentences This substructure is then the basis for extracting the relevant text from the document Restricting the evaluation... forceremonial ceremonialoccasi occasions ons .On OnMonday, Monday, Saddam Saddamoffered offereddeveloping developingnations nationsfree freeoil oilififthey theywould wouldsend send thei their rtankers

Ngày đăng: 18/10/2022, 21:09

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w