1. Trang chủ
  2. » Luận Văn - Báo Cáo

Báo cáo khoa học: "A Re-examination of Query Expansion Using Lexical Resources" pptx

9 244 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 9
Dung lượng 145,98 KB

Nội dung

Proceedings of ACL-08: HLT, pages 139–147, Columbus, Ohio, USA, June 2008. c 2008 Association for Computational Linguistics A Re-examination of Query Expansion Using Lexical Resources Hui Fang Department of Computer Science and Engineering The Ohio State University Columbus, OH, 43210 hfang@cse.ohio-state.edu Abstract Query expansion is an effective technique to improve the performance of information re- trieval systems. Although hand-crafted lexi- cal resources, such as WordNet, could provide more reliable related terms, previous stud- ies showed that query expansion using only WordNet leads to very limited performance improvement. One of the main challenges is how to assign appropriateweights to expanded terms. In this paper, we re-examine this prob- lem using recently proposed axiomatic ap- proaches and find that, with appropriate term weighting strategy, we are able to exploit the information from lexical resources to sig- nificantly improve the retrieval performance. Our empirical results on six TREC collec- tions show that query expansion using only hand-crafted lexical resources leads to signif- icant performance improvement. The perfor- mance can be furtherimproved if the proposed method is combined with query expansion us- ing co-occurrence-based resources. 1 Introduction Most information retrieval models (Salton et al., 1975; Fuhr, 1992; Ponte and Croft, 1998; Fang and Zhai, 2005) compute relevance scores based on matching of terms in queries and documents. Since various terms can be used to describe a same con- cept, it is unlikely for a user to use a query term that is exactly the same term as used in relevant docu- ments. Clearly, such vocabulary gaps make the re- trieval performance non-optimal. Query expansion (Voorhees, 1994; Mandala et al., 1999a; Fang and Zhai, 2006; Qiu and Frei, 1993; Bai et al., 2005; Cao et al., 2005) is a commonly used strategy to bridge the vocabulary gaps by expanding original queries with related terms. Expanded terms are of- ten selected from either co-occurrence-based the- sauri (Qiu and Frei, 1993; Bai et al., 2005; Jing and Croft, 1994; Peat and Willett, 1991; Smeaton and van Rijsbergen, 1983; Fang and Zhai, 2006) or hand- crafted thesauri (Voorhees, 1994; Liu et al., 2004) or both (Cao et al., 2005; Mandala et al., 1999b). Intuitively, compared with co-occurrence-based thesauri, hand-crafted thesauri, such as WordNet, could provide more reliable terms for query ex- pansion. However, previous studies failed to show any significant gain in retrieval performance when queries are expanded with terms selected from WordNet (Voorhees, 1994; Stairmand, 1997). Al- though some researchers have shown that combin- ing terms from both types of resources is effective, the benefit of query expansion using only manually created lexical resources remains unclear. The main challenge is how to assign appropriate weights to the expanded terms. In this paper, we re-examine the problem of query expansion using lexical resources with the recently proposed axiomatic approaches (Fang and Zhai, 2006). The major advantage of axiomatic ap- proaches in query expansion is to provide guidance on how to weight related terms based on a given term similarity function. In our previous study, a co- occurrence-based term similarity function was pro- posed and studied. In this paper, we study several term similarity functions that exploit various infor- mation from two lexical resources, i.e., WordNet 139 and dependency-thesaurus constructed by Lin (Lin, 1998), and then incorporate these similarity func- tions into the axiomatic retrieval framework. We conduct empirical experiments over several TREC standard collections to systematically evaluate the effectiveness of query expansion based on these sim- ilarity functions. Experiment results show that all the similarity functions improve the retrieval per- formance, although the performance improvement varies for different functions. We find that the most effective way to utilize the information from Word- Net is to compute the term similarity based on the overlap of synset definitions. Using this similarity function in query expansion can significantly im- prove the retrieval performance. According to the retrieval performance, the proposed similarity func- tion is significantly better than simple mutual infor- mation based similarity function, while it is compa- rable to the function proposed in (Fang and Zhai, 2006). Furthermore, we show that the retrieval per- formance can be further improved if the proposed similarity function is combined with the similar- ity function derived from co-occurrence-based re- sources. The main contribution of this paper is to re- examine the problem of query expansion using lexi- cal resources with a new approach. Unlike previous studies, we are able to show that query expansion us- ing only manually created lexical resources can sig- nificantly improve the retrieval performance. The rest of the paper is organized as follows. We discuss the related work in Section 2, and briefly re- view the studies of query expansion using axiomatic approaches in Section 3. We then present our study of using lexical resources, such as WordNet, for query expansion in Section 4, and discuss experi- ment results in Section 5. Finally, we conclude in Section 6. 2 Related Work Although the use of WordNet in query expansion has been studied by various researchers, the im- provement of retrieval performance is often lim- ited. Voorhees (Voorhees, 1994) expanded queries using a combination of synonyms, hypernyms and hyponyms manually selected from WordNet, and achieved limited improvement (i.e., around −2% to +2%) on short verbose queries. Stairmand (Stair- mand, 1997) used WordNet for query expansion, but they concluded that the improvement was restricted by the coverage of the WordNet and no empirical results were reported. More recent studies focused on combining the in- formation from both co-occurrence-based and hand- crafted thesauri. Mandala et. al. (Mandala et al., 1999a; Mandala et al., 1999b) studied the problem in vector space model, and Cao et. al. (Cao et al., 2005) focused on extending language models. Al- though they were able to improve the performance, it remains unclear whether using only information from hand-crafted thesauri would help to improve the retrieval performance. Another way to improve retrieval performance using WordNet is to disambiguate word senses. Voorhees (Voorhees, 1993) showed that using Word- Net for word sense disambiguation degrade the re- trieval performance. Liu et. al. (Liu et al., 2004) used WordNet for both sense disambiugation and query expansion and achieved reasonable perfor- mance improvement. However, the computational cost is high and the benefit of query expansion using only WordNet is unclear. Ruch et. al. (Ruch et al., 2006) studied the problem in the domain of biology literature and proposed an argumentative feedback approach, where expanded terms are selected from only sentences classified into one of four disjunct argumentative categories. The goal of this paper is to study whether query expansion using only manually created lexical re- sources could lead to the performance improve- ment. The main contribution of our work is to show query expansion using only hand-crafted lex- ical resources is effective in the recently proposed axiomatic framework, which has not been shown in the previous studies. 3 Query Expansion in Axiomatic Retrieval Model Axiomatic approaches have recently been proposed and studied to develop retrieval functions (Fang and Zhai, 2005; Fang and Zhai, 2006). The main idea is to search for a retrieval function that satisfies all the desirable retrieval constraints, i.e., axioms. The un- derlying assumption is that a retrieval function sat- 140 isfying all the constraints would perform well em- pirically. Unlike other retrieval models, axiomatic retrieval models directly model the relevance with term level retrieval constraints. In (Fang and Zhai, 2005), several axiomatic re- trieval functions have been derived based on a set of basic formalized retrieval constraints and an induc- tive definition of the retrieval function space. The derived retrieval functions are shown to perform as well as the existing retrieval functions with less pa- rameter sensitivity. One of the components in the inductive definition is primitive weighting function, which assigns the retrieval score to a single term document {d} for a single term query {q} based on S({q}, {d}) =  ω(q) q = d 0 q = d (1) where ω(q) is a term weighting function of q. A lim- itation of the primitive weighting function described in Equation 1 is that it can not bridge vocabulary gaps between documents and queries. To overcome this limitation, in (Fang and Zhai, 2006), we proposed a set of semantic term match- ing constraints and modified the previously derived axiomatic functions to make them satisfy these ad- ditional constraints. In particular, the primitive weighting function is generalized as S({q}, {d}) = ω(q) × f(s(q, d)), where s(q, d) is a semantic similarity function be- tween two terms q and d, and f is a monotonically increasing function defined as f(s(q, d)) =  1 q = d s(q,d) s(q,q) × β q = d (2) where β is a parameter that regulates the weighting of the original query terms and the semantically sim- ilar terms. We have shown that the proposed gen- eralization can be implemented as a query expan- sion method. Specifically, the expanded terms are selected based on a term similarity function s and the weight of an expanded term t is determined by its term similarity with a query term q, i.e., s(q, t), as well as the weight of the query term, i.e., ω(q). Note that the weight of an expanded term t is ω(t) in traditional query expansion methods. In our previous study (Fang and Zhai, 2006), term similarity function s is derived based on the mutual information of terms over collections that are con- structed under the guidance of a set of term semantic similarity constraints. The focus of this paper is to study and compare several term similarity functions exploiting the information from lexical resources, and evaluate their effectiveness in the axiomatic re- trieval models. 4 Term Similarity based on Lexical Resources In this section, we discuss a set of term similar- ity functions that exploit the information stored in two lexical resources: WordNet (Miller, 1990) and dependency-based thesaurus (Lin, 1998). The most commonly used lexical resource is WordNet (Miller, 1990), which is a hand-crafted lexical system developed at Princeton University. Words are organized into four taxonomies based on different parts of speech. Every node in the WordNet is a synset, i.e., a set of synonyms. The definition of a synset, which is referred to as gloss, is also pro- vided. For a query term, all the synsets in which the term appears can be returned, along with the defi- nition of the synsets. We now discuss six possible term similarity functions based on the information provided by WordNet. Since the definition provides valuable information about the semantic meaning of a term, we can use the definitions of the terms to measure their semantic similarity. The more common words the definitions of two terms have, the more similar these terms are (Banerjee and Pedersen, 2005). Thus, we can com- pute the term semantic similarity based on synset definitions in the following way: s def (t 1 , t 2 ) = |D(t 1 ) ∩ D(t 2 )| |D(t 1 ) ∪ D(t 2 )| , where D(t) is the concatenation of the definitions for all the synsets containing term t and |D| is the number of words of the set D. Within a taxonomy, synsets are organized by their lexical relations. Thus, given a term, related terms can be found in the synsets related to the synsets containing the term. In this paper, we consider the following five word relations. 141 • Synonym(Syn): X and Y are synonyms if they are interchangeable in some context. • Hypernym(Hyper): Y is a hypernym of X if X is a (kind of) Y. • Hyponym(Hypo): X is a hyponym of Y if X is a (kind of) Y. • Holonym(Holo): Y is a holonym of Y if X is a part of Y. • Meronym(Mero): X is a meronym of Y if X is a part of Y. Since these relations are binary, we define the term similarity functions based on these relations in the following way. s R (t 1 , t 2 ) =  α R t 1 ∈ T R (t 2 ) 0 t 1 /∈ T R (t 2 ) where R ∈ {syn, hyper, hypo, holo, mero}, T R (t) is a set of words that are related to term t based on the relation R, and αs are non-zero parameters to control the similarity between terms based on differ- ent relations. However, since the similarity values for all term pairs are same, the values of these pa- rameters can be ignored when we use Equation 2 in query expansion. Another lexical resource we study in the paper is the dependency-based thesaurus provided by Lin 1 (Lin, 1998). The thesaurus provides term similar- ities that are automatically computed based on de- pendency relationships extracted from a parsed cor- pus. We define a similarity function that can utilize this thesaurus as follows: s Lin (t 1 , t 2 ) =  L(t 1 , t 2 ) (t 1 , t 2 ) ∈ TP Lin 0 (t 1 , t 2 ) /∈ T P Lin where L(t 1 , t 2 ) is the similarity of terms stored in the dependency-based thesaurus and T P Lin is a set of all the term pairs stored in the thesaurus. The similarity of two terms would be assigned to zero if we can not find the term pair in the thesaurus. Since all the similarity functions discussed above capture different perspectives of term relations, we 1 Available at http://www.cs.ualberta.ca/ ˜ lindek/downloads.htm propose a simple strategy to combine these similar- ity functions so that the similarity of a term pair is the highest similarity value of these two terms of all the above similarity functions, which is shown as follows. s combined (t 1 , t 2 ) = max R∈Rset (s R (t 1 , t 2 )), where Rset = {def, syn, hyper, hypo, holo, mero, L in}. In summary, we have discussed eight possible similarity functions that exploit the information from the lexical resources. We then incorporate these similarity functions into the axiomatic retrieval models based on Equation 2, and perform query ex- pansion based on the procedure described in Section 3. The empirical results are reported in Section 5. 5 Experiments In this section, we experimentally evaluate the effec- tiveness of query expansion with the term similar- ity functions discussed in Section 4 in the axiomatic framework. Experiment results show that the sim- ilarity function based on synset definitions is most effective. By incorporating this similarity function into the axiomatic retrieval models, we show that query expansion using the information from only WordNet can lead to significant improvement of re- trieval performance, which has not been shown in the previous studies (Voorhees, 1994; Stairmand, 1997). 5.1 Experiment Design We conduct three sets of experiments. First, we compare the effectiveness of term similarity func- tions discussed in Section 4 in the context of query expansion. Second, we compare the best one with the term similarity functions derived from co-occurrence-based resources. Finally, we study whether the combination of term similarity func- tions from different resources can further improve the performance. All experiments are conducted over six TREC collections: ap88-89, doe, fr88-89, wt2g, trec7 and trec8. Table 1 shows some statistics of the collec- tions, including the description, the collection size, 142 Table 1: Statistics of Test Collections Collection Description Size # Voc. # Doc. #query ap88-89 news articles 491MB 361K 165K 150 doe technical reports 184MB 163K 226K 35 fr88-89 government documents 469MB 204K 204K 42 trec7 ad hoc data 2GB 908K 528K 50 trec8 ad hoc data 2GB 908K 528K 50 wt2g web collections 2GB 1968K 247K 50 the vocabulary size, the number of documents and the number of queries. The preprocessing only in- volves stemming with Porter’s stemmer. We use WordNet 3.0 2 , Lemur Toolkit 3 and TrecWN library 4 in experiments. The results are evaluated with both MAP (mean average preci- sion) and gMAP (geometric mean average preci- sion) (Voorhees, 2005), which emphasizes the per- formance of difficulty queries. There is one parameter β in the query expansion method presented in Section 3. We tune the value of β and report the best performance. The parameter sensitivity is similar to the observations described in (Fang and Zhai, 2006) and will not be discussed in this paper. In all the result tables, ‡ and † indicate that the performance difference is statistically sig- nificant according to Wilcoxon signed rank test at the level of 0.05 and 0.1 respectively. We now explain the notations of different meth- ods. BL is the baseline method without query ex- pansion. In this paper, we use the best performing function derived in axiomatic retrieval models, i.e, F2-EXP in (Fang and Zhai, 2005) with a fixed pa- rameter value (b = 0.5). QE X is the query expan- sion method with term similarity function s X , where X could be Def., Syn., Hyper., Hypo., Mero., Holo., Lin and Combined. Furthermore, we examine the query expansion method using co-occurrence-based resources. In particular, we evaluate the retrieval performance us- ing the following two similarity functions: s MIBL and s MIImp . Both functions are based on the mutual information of terms in a set of documents. s MIBL uses the collection itself to compute the mutual in- formation, while s MIImp uses the working sets con- 2 http://wordnet.princeton.edu/ 3 http://www.lemurproject.org/ 4 http://l2r.cs.uiuc.edu/ cogcomp/software.php structed based on several constraints (Fang and Zhai, 2006). The mutual information of two terms t 1 and t 2 in collection C is computed as follow (van Rijs- bergen, 1979): I(X t 1 , X t 2 ) =  p(X t 1 , X t 2 )log p(X t 1 , X t 2 ) p(X t 1 )p(X t 2 ) X t i is a binary random variable corresponding to the presence/absence of term t i in each document of col- lection C. 5.2 Effectiveness of Lexical Resources We first compare the retrieval performance of query expansion with different similarity functions us- ing short keyword (i.e., title-only) queries, because query expansion techniques are often more effective for shorter queries (Voorhees, 1994; Fang and Zhai, 2006). The results are presented in Table 2. It is clear that query expansion with these functions can improve the retrieval performance, although the per- formance gains achieved by different functions vary a lot. In particular, we make the following observa- tions. First, the similarity function based on synset def- initions is the most effective one. QE def signifi- cantly improves the retrieval performance for all the data sets. For example, in trec7, it improves the per- formance from 0.186 to 0.216. As far as we know, none of the previous studies showed such significant performance improvement by using only WordNet as query expansion resource. Second, the similarity functions based on term re- lations are less effective compared with definition- based similarity function. We think that the worse performance is related to the following two reasons: (1) The similarity functions based on relations are binary, which is not a good way to model term sim- ilarities. (2) The relations are limited by the part 143 Table 2: Performance of query expansion using lexical resources (short keyword queries) trec7 trec8 wt2g MAP gMAP MAP gMAP MAP gMAP BL 0.186 0.083 0.250 0.147 0.282 0.188 QE def 0.216‡ 0.105‡ 0.266‡ 0.164‡ 0.301‡ 0.210‡ (+16%) (+27%) (+6.4%) (+12%) (+6.7%) (+12%) QE syn 0.194 0.085‡ 0.252† 0.150† 0.287‡ 0.194‡ (+4.3%) (+2.4%) (+0.8%) (+2.0%) (+1.8%) (+3.2%) QE hyper 0.186 0.086 0.250 0.152 0.286† 0.192† (0%) (+3.6%) (0%) (+3.4%) (+1.4%) (+2.1%) QE hypo 0.186† 0.085‡ 0.250 0.147 0.282† 0.190 (0%) (+2.4%) (0%) (0%) (0%) (+1.1%) QE mero 0.187‡ 0.084‡ 0.250 0.147 0.282 0.189 (+0.5%) (+1.2%) (0%) (0%) (0%) (+0.5%) QE holo 0.191‡ 0.085‡ 0.250 0.147 0.282 0.188 (+2.7%) (+2.4%) (0%) (0%) (0%) (0%) QE Lin 0.193‡ 0.092‡ 0.256‡ 0.156‡ 0.290‡ 0.200‡ (+3.7%) (+11%) (+2.4%) (+6.1%) (+2.8%) (+6.4%) QE Combined 0.214‡ 0.104‡ 0.267‡ 0.165‡ 0.300‡ 0.208‡ (+15%) (+25%) (+6.8%) (+12%) (+6.4%) (+10.5%) ap88-89 doe fr88-89 MAP gMAP MAP gMAP MAP gMAP BL 0.220 0.074 0.174 0.069 0.222 0.062 QE def 0.254‡ 0.088‡ 0.181‡ 0.075‡ 0.225‡ 0.067‡ (+15%) (+19%) (+4%) (+10%) (+1.4%) (+8.1%) QE syn 0.222‡ 0.077‡ 0.174 0.074 0.222 0.065 (+0.9%) (+4.1%) (0%) (+7.3%) (0%) (+4.8%) QE hyper 0.222‡ 0.074 0.175 0.070 0.222 0.062 (+0.9%) (0%) (+0.5%) (+1.5%) (0%) (0%) QE hypo 0.222‡ 0.076‡ 0.176† 0.073† 0.222 0.062 (+0.9%) (+2.7%) (+1.1%) (+5.8%) (0%) (0%) QE mero 0.221 0.074† 0.174† 0.070† 0.222 0.062 (+0.45%) (0%) (0%) (+1.5%) (0%) (0%) QE holo 0.221 0.076 0.177† 0.073 0.222 0.062 (+0.45%) (+2.7%) (+1.7%) (+5.8%) (0%) (0%) QE Lin 0.245‡ 0.082‡ 0.178 0.073 0.222 0.067† (+11%) (+11%) (+2.3%) (+5.8%) (0%) (+8.1%) QE Combined 0.254‡ 0.085‡ 0.179† 0.074† 0.223† 0.065 (+15%) (+12%) (+2.9%) (+7.3%) (+0.5%) (+4.3%) 144 Table 3: Performance comparison of hand-crafted and co-occurrence-based thesauri (short keyword queries) Data MAP gMAP QE def QE MIBL QE MIImp QE def QE MIBL QE MIImp ap88-89 0.254 0.233‡ 0.265‡ 0.088 0.081‡ 0.089‡ doe 0.181 0.175† 0.183 0.075 0.071† 0.078 fr88-89 0.225 0.222‡ 0.227† 0.067 0.063 0.071‡ trec7 0.216 0.195‡ 0.236‡ 0.105 0.089‡ 0.097 trec8 0.266 0.250‡ 0.278 0.164 0.148‡ 0.172 wt2g 0.301 0.311 0.320‡ 0.210 0.218 0.219‡ of speech of the terms, because two terms in Word- Net are related only when they have the same part of speech tags. However, definition-based similarity function does not have such a limitation. Third, the similarity function based on Lin’s the- saurus is more effective than those based on term relations from the WordNet, while it is less effective compared with the definition-based similarity func- tion, which might be caused by its smaller coverage. Finally, combining different WordNet-based sim- ilarity functions does not help, which may indicate that the expanded terms selected by different func- tions are overlapped. 5.3 Comparison with Co-occurrence-based Resources As shown in Table 2, the similarity function based on synset definitions, i.e., s def , is most effective. We now compare the retrieval performance of using this similarity function with that of using the mutual in- formation based functions, i.e., s MIBL and s MIImp . The experiments are conducted over two types of queries, i.e. short keyword (keyword title) and short verbose (one sentence description) queries. The results for short keyword queries are shown in Table 3. The retrieval performance of query ex- pansion based on s def is significantly better than that based on s MIBL on almost all the data sets, while it is slightly worse than that based on s MIImp on some data sets. We can make the similar ob- servation from the results for short verbose queries as shown in Table 4. One advantage of s def over s MIImp is the computational cost, because s def can be computed offline in advance while s MIImp has to be computed online from query-dependent working sets which takes much more time. The low computa- tional cost and high retrieval performance make s def more attractive in the real world applications. 5.4 Additive Effect Since both types of similarity functions are able to improve retrieval performance, we now study whether combining them could lead to better per- formance. Table 5 shows the retrieval performance of combining both types of similarity functions for short keyword queries. The results for short verbose queries are similar. Clearly, combining the similar- ity functions from different resources could further improve the performance. 6 Conclusions Query expansion is an effective technique in in- formation retrieval to improve the retrieval perfor- mance, because it often can bridge the vocabulary gaps between queries and documents. Intuitively, hand-crafted thesaurus could provide reliable related terms, which would help improve the performance. However, none of the previous studies is able to show significant performance improvement through query expansion using information only from man- ually created lexical resources. In this paper, we re-examine the problem of query expansion using lexical resources in recently pro- posed axiomatic framework and find that we are able to significantly improve retrieval performance through query expansion using only hand-crafted lexical resources. In particular, we first study a few term similarity functions exploiting the infor- mation from two lexical resources: WordNet and dependency-based thesaurus created by Lin. We then incorporate the similarity functions with the query expansion method in the axiomatic retrieval 145 Table 4: Performance Comparison (MAP, short verbose queries) Data BL QE def QE MIBL QE MIImp ap88-89 0.181 0.220‡ (21.5%) 0.205‡ (13.3%) 0.230‡ (27.1%) doe 0.109 0.121‡ (11%) 0.119 (9.17%) 0.117 (7.34%) fr88-89 0.146 0.164‡ (12.3%) 0.162‡ (11%) 0.164‡ (12.3%) trec7 0.184 0.209‡ (13.6%) 0.196 (6.52%) 0.224‡(21.7%) trec8 0.234 0.238‡(1.71%) 0.235 (0.4%) 0.243† (3.85%) wt2g 0.266 0.276 (3.76%) 0.276† (3.76%) 0.282‡ (6.02%) Table 5: Additive Effect (MAP, short keyword queries) ap88-89 doe fr88-89 trec7 trec8 wt2g QE MIBL 0.233 0.175 0.222 0.195 0.250 0.311 QE def+M IBL 0.257‡ 0.183‡ 0.225‡ 0.217‡ 0.267‡ 0.320‡ QE MIImp 0.265 0.183 0.227 0.236 0.278 0.320 QE def+M IImp 0.269‡ 0.187 0.232‡ 0.237‡ 0.280† 0.322† models. Systematical experiments have been con- ducted over six standard TREC collections and show promising results. All the proposed similarity func- tions improve the retrieval performance, although the degree of improvement varies for different sim- ilarity functions. Among all the functions, the one based on synset definition is most effective and is able to significantly and consistently improve re- trieval performance for all the data sets. This simi- larity function is also compared with some similarity functions using mutual information. Furthermore, experiment results show that combining similarity functions from different resources could further im- prove the performance. Unlike previous studies, we are able to show that query expansion using only manually created the- sauri can lead to significant performance improve- ment. The main reason is that the axiomatic ap- proach provides guidance on how to appropriately assign weights to expanded terms. There are many interesting future research direc- tions based on this work. First, we will study the same problem in some specialized domain, such as biology literature, to see whether the proposed ap- proach could be generalized to the new domain. Second, the fact that using axiomatic approaches to incorporate linguistic information can improve re- trieval performance is encouraging. We plan to ex- tend the axiomatic approach to incorporate more linguistic information, such as phrases and word senses, into retrieval models to further improve the performance. Acknowledgments We thank ChengXiang Zhai, Dan Roth, Rodrigo de Salvo Braz for valuable discussions. We also thank three anonymous reviewers for their useful com- ments. References J. Bai, D. Song, P. Bruza, J. Nie, and G. Cao. 2005. Query expansion using term relationships in language models for information retrieval. In Fourteenth Inter- national Conference on Information and Knowledge Management (CIKM 2005). S. Banerjee and T. Pedersen. 2005. Extended gloss over- laps as a measure of semantic relatedness. In Proceed- ings of the 18th International Joint Conference on Ar- tificial Intelligence. G. Cao, J. Nie, and J. Bai. 2005. Integrating word rela- tionships into language models. In Proceedings of the 2005 ACM SIGIR Conference on Research and Devel- opment in Information Retrieval. H. Fang and C. Zhai. 2005. An exploration of axiomatic approaches to information retrieval. In Proceedings of the 2005 ACM SIGIR Conference on Research and Development in Information Retrieval. H. Fang and C. Zhai. 2006. Semantic term matching in axiomatic approaches to information retrieval. In Proceedings of the 2006 ACM SIGIR Conference on Research and Development in Information Retrieval. 146 N. Fuhr. 1992. Probabilistic models in information re- trieval. The Computer Journal, 35(3):243–255. Y. Jing and W. Bruce Croft. 1994. An association the- saurus for information retreival. In Proceedings of RIAO. D. Lin. 1998. An information-theoretic definition of similarity. In Proceedings of International Conference on Machine Learning (ICML). S. Liu, F. Liu, C. Yu, and W. Meng. 2004. An effec- tive approach to document retrieval via utilizing word- net and recognizing phrases. In Proceedings of the 2004 ACM SIGIR Conference on Research and Devel- opment in Information Retrieval. R. Mandala, T. Tokunaga, and H. Tanaka. 1999a. Ad hoc retrieval experiments using wornet and automati- cally constructed theasuri. In Proceedings of the sev- enth Text REtrieval Conference (TREC7). R. Mandala, T. Tokunaga, and H. Tanaka. 1999b. Com- bining multiple evidence from different types of the- saurus for query expansion. In Proceedings of the 1999 ACM SIGIR Conference on Research and Devel- opment in Information Retrieval. G. Miller. 1990. Wordnet: An on-line lexical database. International Journal of Lexicography, 3(4). H. J. Peat and P. Willett. 1991. The limitations of term co-occurencedata for queryexpansion in document re- trieval systems. Journal of the american society for information science, 42(5):378–383. J. Ponte and W. B. Croft. 1998. A language modeling approach to information retrieval. In Proceedings of the ACM SIGIR’98, pages 275–281. Y. Qiu and H.P. Frei. 1993. Concept based query ex- pansion. In Proceedings of the 1993 ACM SIGIR Con- ference on Research and Development in Information Retrieval. P. Ruch, I. Tbahriti, J. Gobeill, and A. R. Aronson. 2006. Argumentative feedback: A linguistically-motivated term expansion for information retrieval. In Pro- ceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 675–682. G. Salton, C. S. Yang, and C. T. Yu. 1975. A theory of term importance in automatic text analysis. Jour- nal of the American Society for Information Science, 26(1):33–44, Jan-Feb. A. F. Smeaton and C. J. van Rijsbergen. 1983. The retrieval effects of query expansion on a feedback document retrieval system. The Computer Journal, 26(3):239–246. M. A. Stairmand. 1997. Textual context analysis for in- formation retrieval. In Proceedings of the 1997 ACM SIGIR Conference on Research and Development in Information Retrieval. C. J. van Rijsbergen. 1979. Information Retrieval. But- terworths. E. M. Voorhees. 1993. Using wordnet to disambiguate word sense for text retrieval. In Proceedings of the 1993 ACM SIGIR Conference on Research and Devel- opment in Information Retrieval. E. M. Voorhees. 1994. Query expansion using lexical- semantic relations. In Proceedings of the 1994 ACM SIGIR Conference on Research and Development in Information Retrieval. E. M. Voorhees. 2005. Overview of the trec 2005 ro- bust retrieval track. In Notebook of the Thirteenth Text REtrieval Conference (TREC2005). 147 . studies of query expansion using axiomatic approaches in Section 3. We then present our study of using lexical resources, such as WordNet, for query expansion. the presence/absence of term t i in each document of col- lection C. 5.2 Effectiveness of Lexical Resources We first compare the retrieval performance of query expansion

Ngày đăng: 17/03/2014, 02:20

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN