general guidelines on random query evaluation

GENERAL GUIDELINES ON RANDOM-QUERY EVALUATION Version 3.1 Last update: December 31, 2003 2 Random-Query Evaluation Thank you for participating in one of Google’s routine quality control processes, the ‘random-query’ evaluation. This form of search evaluation takes its name from the fact that the queries which it draws on were randomly selected from our query logs – in other words, these are all queries that someone, at some point, actually entered into the Google query box. Because we want to obtain a realistic impression of how well we’re serving the average user, we are careful not to pick queries that are particularly well phrased, or easy to search, or unambiguous in intent. (At present we only filter out queries that are clearly pornographic, queries that are complete or incomplete URL addresses such as sport.com , www.simslots.com, ssa.gov or www.California.com and certain numerical queries) That means that you will encounter queries posed by school-age children and reference librarians, research scientists and housekeepers, first-time Internet users and experienced computer geeks. Of course, our decision to include the full spectrum of queries people pose means that evaluation of search results is a tricky business at times, and that – in the absence of help from the person who originally posed the query – we are often confronted with uncertainty about the meaning or purpose of a query and the suitability of the results it brings up. Let us note from the outset that we evaluate results based on relevance not to a specific person who actually posed the query, but to an imaginary rational mind “behind” the query. Oftentimes, a query may have more than one meaning, or interpretation. In such cases we will have to look at the hypothetical set of rational search engine users behind an ambiguous query, and deduce, or roughly estimate, the make-up of that set; for instance, we will consider the relative presence of zoology enthusiasts and car shoppers in a hypothetical representative sample of the users who could have queried [jaguar]. 1 People use the web, and search in particular, for all sorts of needs and in all sorts of ways. The suitability of results to their searches can be assessed from several perspectives and often along several dimensions. One result on {product ABC} is good if you want to buy the product, but has no information on what to do if ABC malfunctions; another result has the troubleshooting guide for ABC, while a third one does a good job comparing ABC to similar products. There is a certain subjective element involved in evaluation. Despite this complexity, there are some general principles on how to rate query results appropriately and consistently. This document tries to articulate these principles as clearly as possible. Undoubtedly there are many specific situations that it does not cover. If you find yourself repeatedly stumped by a certain type of query and/or result, please do not hesitate to contact your project manager for advice. 1 Throughout this document, we will us square brackets to denote a query exactly as posed to the search engine, including any syntax, e.g. [philosophy+mind], [car dealers, “Mountain View”]. We will use curly brackets to denote a query, or part of a query, by type: {celebrity by name} means a query for anyone who’s currently considered a celebrity, [waiters-on-wheels {location}] can mean [waiters-on-wheels san francisco] or [waiters-on-wheels san jose], {global company, location} can stand for [ikea germany] or [hp palo alto], etc. 3 Introduction Evaluation process always starts with web coverage research for the query. The major goals of the web coverage research are: • Determining whether the query is ambiguous, possessing multiple interpretations • Assigning rough probability values to each interpretation based on available evidence. For example, the query [wallpaper] is ambiguous between the real, tangible wallpaper and the computer wallpaper. Statistically speaking, seeking to download computer wallpaper may be a more likely scenario, but the home improvement interpretation also has its niche on the web. Take a different query, [beer wallpaper], and the desktop decoration interpretation clearly wins over the home improvement one. • Ascertaining how much, or little, information there is on the web for the query. 2 The knowledge of the query coverage will come in handy when you have to decide whether a particular result merits a relatively high position in the search result listing. For instance, if the query is [Maltese music], a top-level page on a site devoted to regional types of music that has a link to a site on Maltese music among many other links, should not belong in the top ten or so results. Maltese music enjoys good web coverage, and there is no reason to promote a result that’s not exactly right. • Determining the “country of origin” for the query. The default assumption is that the query comes from the US. Broad, “global”, results, and specific US results are thus appropriate. But, many queries may override the default assumption. For instance, the query [Motorized bicycle Singapore petrol] uses a word that’s not part of American English and specifies a region outside of the United States. Whereas Singapore results to other queries – those that by default fall under the US origin rule – may be inappropriate, Singapore and only Singapore results are appropriate, and may belong in the top ten, for this particular query. The Query Types While there is no simple way to categorize all searches into a neatly organized system, three major categories have been used by analysts of web search to draw a distinction between navigational queries, informational queries, and transactional queries. This 2 Some queries have ample coverage – results to them should be put to the strictest scrutiny. Other queries have scant coverage – rate results to those more leniently. For example, the query ["new mexico" state penal code] cannot bring the desired result, because the State of New Mexico, as of this writing, does not have a state penal code (New Mexico relies instead on its common law). 4 classification, as it turns out, allows for some useful generalizations in the context of query- result evaluation. A navigational query is one that normally has only one satisfactory result: the user types in the name of an entity (“United Airlines”) and expects to be taken to the homepage of that entity. An informational query can have many or few appropriate results, with varying degrees of relevance/informativeness/utility and varying degrees of authority. The user types in a topic (“renaissance paintings”, “aging disease”), sometimes in the form of an actual question (“What is a quark?”, "How do I ?"), and expects to be provided with information on this topic. A transactional query as well can have many or few appropriate results, of varying quality. However, in this case the user is not requesting information – or at least not only information – but instead has the primary goal of carrying out a transaction. Typically, the transaction consists of the acquisition – for money or free - of a product or service. Some transactions can be fully carried out on the web (think furniture clipart download), some come to fruition offline (think furniture to put in a house). Again, not every query can be clearly classified. Since products include information products, the line between informational and transactional queries is sometimes hard to draw. Similarly, because the ulterior motive for a navigational query often is to locate a site for potential transactions, there is a grey zone between navigational and transactional queries. To the extent the classification is helpful, use it, but do not attempt to fit any query that comes your way into one of the three boxes: always trying to decide in favor of one or another will only lead to frustration. It may be more helpful to think of different aspects of a query: for instance, the query [thomas the tank engine] can have (a) a navigational aspect - take me to Thomas' homepage, (b) an informational aspect - tell me the history of Thomas creation, and finally, (c) a transactional aspect - I want to buy a book or a toy engine from the Thomas collection. The Rating Categories We make use of the following categories for evaluating our search results: Vital Useful Relevant Off Topic Offensive Erroneous Didn’t Load 5 Foreign Language Unrated Please note that the technical term Relevant differs from the generic word relevant. The same may be true of other category names; to avoid confusion, we always capitalize our category names to set them apart from the generic meanings. Options from Vital down to Offensive constitute the merit scale. A major subset of the merit scale is the utility continuum, spanning categories from Useful to Off Topic,. Assigning ratings on the merit scale reflects your opinion on where, roughly, the rated results should or could appear in an idealized search result ranking for a given query. Due to the multi-dimensional relativity of the merit scale ratings, more than one rating can be justified at times. In such cases, we ask you to pick the lower rating on the scale. Erroneous, Didn’t Load, Foreign Language and Unrated are special categories that are, in effect, non-ratings. By selecting one of these categories, you do not express your opinion on the range of positions the result may occupy in a ranking; rather, you depict certain technical attributes of the result page. To match the workflow of query-result evaluation, we will start with briefly introducing the non-ratings. 3 For detailed examples, please view the FAQ posted on the Rater Hub. Didn’t Load A result that’s not visible cannot be evaluated. If you are seeing a “Page not found” message, assign Didn’t Load. Note that many sites experience a certain amount of downtime on a regular basis. As a result, a page that does not load in the morning may load later in the day or the next day. We would appreciate if you note which results you marked Didn’t Loads as you work your way through your project, and briefly revisit them before you sign off on your completed job. Doing so may yield a few extra informative ratings. The in-depth discussion of rating policies in the absence of a working cache is contained in Using Quest, an instruction on the rating interface. Foreign Language A result that loads fine but is fully in a foreign language should be labeled as such. For rating English projects, any result that’s not in English is Foreign Language (for projects in other languages, please read Appendix “Evaluating i18n results” for the description of the Foreign Language category). 4 Certain exceptions apply when the foreign language page is essentially non-verbal (think images, downloads), and in a few other cases discussed in the answers to FAQ on the Rater Hub. Erroneous 3 Please read “Using Quest” to familiarize yourself with the logistics of ratings. “Using Quest” is available online at http://eval.google.com/happier/portal_files/Using_Quest.pdf 4 The Appendix is NOT required reading if your project is English. 6 Erroneous results load fine and are not in a foreign language. This category designates what you might think of as “indirect results”: an output of searching on an engine or a directory, or a page that offers you to search an engine or directory. Engines and directories that fall in this category search the whole web and not just a subset of it, such as everything in one city or all travel-related information. Of course, Erroneous rating does not apply to engines or directories that are expressly requested by the query. 5 Unrated Under certain circumstances, you may be unable to assign a valid rating. For example, despite your best efforts at researching the query and/or result, you may feel you lack certain knowledge to express an opinion. Choose Unrated then. A page can well possess several of the above technical attributes. We have more discussion on this in the FAQ section of the Rater Hub. If none of the above technical categories applies, meaning that: • the page loads fine; and • the page is in the “correct” language; and • the page is not a search output from an engine or directory (be it Google or another engine/directory organizing information from the web as a whole); 6 and • you have sufficient information and understanding about the query and/or result, the result should be rated on the merit scale. We will now introduce the merit scale categories from the top down: from Vital to Offensive. VERY IMPORTANT: Merit scale evaluation is not based on the absence or presence of queried terms on the result page. Consider a result to the query [German educational toys], http://www.toy-spectrum.com/overview/puzzle/puzzle.html . Does the absence of the word “educational” reduce the quality of the match in any way? No. The products on the page are clearly educational without being overtly described with this term. 7 Similarly, the query [Users of the internet (Graph)] can be quite adequately answered by a resource that gives a graph and mentions such terms as “statistics of”, “demographic”, “access”, “usage”, “database”, “table” etc., without explicitly mentioning “users” and “Graph”. 5 See [ask jeeves] example in Table 1. 6 Infrequently, you may see a result that is a Google directory listing, or a result page from another Google interface, or even a search result listing from Google.com. Those instances should all be categorized as Erroneous. 7 Arguably, a self-respecting educational toy outlet would not mention the word “educational” very much, assuming that customers can recognize a quality product on their own. 7 VITAL Vital is a category reserved for very special, unique results to a special subset of queries. Examples illustrate the special attributes of the queries that can have Vital results. Listed in Table 1 are some examples of queries with Vital results to them. You will notice that the queries in Table 1 are predominantly navigational. Table 1. Query Vital Results [OfficeMax] http://www.officemax.com/ [arizona lottery] http://www.arizonalottery.com/ [san jose international airport] http://www.sjc.org/ [Simon and Garfunkel] http://www.simonandgarfunkel.com/ [suny Binghamton] http://www.binghamton.edu/ [Union Bank of California] http://www.uboc.com/uboc/home [The weather channel] http://www.weather.com/ [banana republic] http://www.bananarepublic.com/default.htm [ask jeeves] http://www.ask.com/ [interact australia] http://www.interactaust.com.au/intaust.htm [los altos school district boundaries map] http://www.losaltos.k12.ca.us/schls_boundmap.htm#top or http://www.losaltos.k12.ca.us/PDF_Files/Boundaries_2003.pdf (both fit the bill) [san jose public library] http://www.sjlibrary.org/ [san jose public library branches] http://www.sjlibrary.org/about/contacts/branches.htm or http://www.sjlibrary.org/about/locations/index.htm N-400 form http://uscis.gov/graphics/formsfee/forms/n-400.htm [Canadian parliament] http://www.parl.gc.ca/ or http://www.parl.gc.ca/common/index.asp?Language=E&Parl=37 &Ses=2 [disable javascript ie] http://support.microsoft.com/default.aspx?scid=kb;EN- US;244233 Note that the information may not be displayed conspicuously (in the case at hand, one needs to scroll down the page to read the how-to). The page is not necessarily wholly on the topic of the query. Yet, it provides the how-to endorsed by the creator of IE. Hence, Vital. [Barbie] www.barbie.com, from Mattel, the company owning the rights to the brand 8 Table 1 (cont’d). Query Vital Results [form iap-66] http://travel.state.gov/visa%3Bexchange.html a page on the site of the ultimate authority on the subject, advises of the change in the title of the form, and therefore appears Vital to the query. Note how Vital results are not necessarily the most useful - but are uniquely authoritative. The pertinent paragraph is buried in the dense text on this page. Someone's personal page shouting "Hey, I searched for IAP-66 and could not find it, guess what, the world has changed and I want everyone to know!" could have been user-friendlier, yet lacking authority. Is there a Vital result out there for any query imaginable? Emphatically, no. Indeed, most queries cannot have Vital results. We will call those queries generic. Generic queries cannot have Vital results because no one has ultimate authority on the subject matter and no entity is the uniquely definitive target of the search. .Some queries, such as [things different cultures do for fun] , are obviously generic – no ultimate omniscient authority can ever put together “the” resource for such queries. Other generic queries are sometimes matched by results that may, incorrectly, appear uniquely appropriate. Table 2 lists a few cases in point. Table 2. Query Result that may wrongly appear Vital Why there are no Vital results [Learn How To knit] http://www.lear nhowtoknit.com Please refer to the discussion below on URLs that match the query verbatim. [crime and punishment] http://www.nbc. com/Crime_&_ Punishment/ Several interpretations “compete” for this query. There is a book by Fyodor Dostoevsky that is part of the global canon and thus an appropriate result for a US-based query. A seminal work by Lawrence Meir Friedman,“Crime and Punishment in American History”, is very well known and widely cites in the US. Although it does not match the query fully, the book is commonly referred to as “Crime and Punishment” by Friedman, omitting the rest of the title. Then there is a book “Crime and Punishment in America” by Elliott Currie, another possible interpretation for a U.S based query. A handful of resources on law enforcement juxtapose “crime” and “punishment” in their descriptions, evidencing that the word combination has a generic sense to it. [map of central America] http://www.info please.com/atlas /centralamerica. html There are terrific resources for maps, but none can claim ultimate authority on the subject of maps of any region 9 Table 2 (cont’d). Query Result that may wrongly appear Vital Why there are no Vital results [mouth ulcers] (a) http://www.mou thulcers.org/ (b) http://www.nlm. nih.gov/medline plus/ency/article /001448.htm Diseases cannot have homepages; no one can claim unique authority to everything related to diagnosing, treating, and preventing any particular disease. Neither a personal homepage such as result (a) nor an informative page from a well-regarded source, such as result (b), can claim the unique Vital status in relation to a disease query. [How to build a fence] http://www.the workshop.net/Ti ps/htm/fence_ho wtobuild.htm Good, but not unique. Many fence models out there, many opinions on how best to construct each. [music] http://www.mus ic.com/ http://www.mus ic.org Please refer to the discussion below on URLs that match the query verbatim. [quality of life], http://www.utor onto.ca/qol/ A concerted effort to research the quality of life cannot speak on this query with unique authority. [wrongful dismissal from employment], http://www.wro ngful- dismissal.com/ Although this site is solely concerned with the wrongful dismissal cases, it isn’t a uniquely authoritative resource for the body of law on wrongful dismissal. [london student apartments] http://www.lond onnet.co.uk/ln/g uide/accomm/bu dget_student.ht ml A good list, but again, unless, counterfactually, all student apartments in London are monopolized by one agency, no unique resource exists for this query. Please note: Certain queries for familiar named entities will jump at you as such. You would be able to tell that those queries have Vital results without pressing a key or pointing a mouse. Others may wrongly appear generic. For instance, [interact australia] may appear to be an awkward query placed by someone in need of online or offline companionship on the continent. However, doing web research for the query quickly reveals that there has long been a unique organization by the name of “Interact Australia”. Similarly, [men's health online] could be broadly generic, but given that there is a magazine called exactly Men’s Health, the likelier scenario for the query is that it’s targeting the online version of that magazine. Even more likely does the query [economist] pertain to the magazine, making Economist.com Vital to the query. Why? Existence of the magazine, and now the online resource for it, gives this single word query a very strong, dominant interpretation. What about the generic interpretation? It is weak. Taken the generic sense, it’s unclear what [economist] would look for – information on economists? On the latest Nobel Prize award in economics? Schools, professional organizations with directories of economists? Economist jokes? When the query has a vague generic meaning and a clear named entity 10 meaning, treat the generic interpretation as a minor one at most. By this token, appropriate results to [amnesty], [“Amnesty International”] and [amnesty international] queries should be the same, as the queries are essentially no different from one another – the generic meaning of [amnesty] is weak. 8 To [legal information institute], http://www.law.cornell.edu/ is Vital because it is the homepage for a widely (and internationally) cited resource, LII at Cornell Law School. It is THE resource that most people who could have placed the query (legal scholars and practitioners, students of law, legislators, anyone interested in legal research) would associate with the query. At the same time, there is an international network of legal information institutes, to which the LII at Cornell belongs. Existence of the network does not render the query generic: the network and its non-US branches are less known. It’s inconceivable that the representative user behind [legal information query] would be aware of the network or of a regional institution, such as the Australasian Legal Information Institute, without knowing of the Cornell Law School LII. Hence, had the (rational) user wanted the British or Australasian resource, the query would have reflected that preference. The network homepage and regional institutes merit high ratings but are not Vital. What if an ambiguous query has two strong interpretations, so that each can be roughly assigned the probability of 50 percent, and each interpretation “possesses” a unique homepage? Do we have a Vital result for each? The answer to this question is no. Reflecting the ambiguity inherent in the query, we demote what otherwise would have been a Vital result to the next rating down the merit scale. As a result, both unique, ideal results – one per interpretation – should be rated Useful. 9 Similarly, if one interpretation of the query happens to have a uniquely matching homepage, but the interpretation does not stand out as the most salient, predominantly likely one, then the homepage which would have been Vital in the absence of other, stronger query interpretations, should be appropriately demoted on the merit scale. In essence, you as a rater will make two judgments before arriving at the final rating for results to ambiguous queries: first, you will determine the rating applicable per interpretation. Second, you will map this rating onto the merit scale considering the presence and relative likelihoods of other query interpretations. Then, what is Vital to the dominant interpretation becomes the final Vital score, but what would have been Vital to a non-dominant interpretation should be mapped down on the merit scale, To sum up, a Vital result is one that uniquely matches the most dominant query interpretation. 8 In print, an all-lowercase string is less likely to identify a named entity than one in which initial letters are capitalized. However, it is well known among web users that Google does not distinguish upper- and lowercase in queries; hence a query without caps might just reflect efficiency on the part of an experienced user. 9 Please see the [ADA] example below for the application of this rule. [...]... site of a large store network can be Useful Or consider the hours and location page if the queried entity is one that people likely visit in person • Results that are Useful to the non-navigational aspect of the query For example, [simon and garfunkel] query may well be placed not just in expectation of the homepage of Simon and Garfunkel, but in expectation of browsing through good resources for tablatures,... exclusively on spam identification tools and is required reading after you become comfortable with these General Guidelines 21 If you object to pornographic web environment, we will accommodate your preferences and not require that you rate objectionable content 21 Table 5 Query [Charleston, SC Chamber of Commerce] Result URL http://www.jicccharl eston.com/charleston -sc-chamber-ofcommerce.shtml Explanation... RELEVANT Further down on the utility continuum, Not Relevant results are generally not helpful to users but are still connected with the query topic: you can see a relationship, albeit an attenuated one, between the query and the result Thus, on- topic results that are too marginal in scope, outdated, too narrowly regional, too specific, too broad, etc are Not Relevant Take the query [yellow pages] Of... meaning of the query and its type – is it navigational, informational, transactional, or a mixture of two or three? Determining the merit rating in light of the query coverage and considering various utility dimensions, as well as taking into account evidence of deceitful web design where appropriate Research for the query should be done before you open any results that are up for evaluation ... meaning of the queried word combination 19 Resources that may be Not Relevant, although Off Topic rating too is appropriate Aromatherapy resources in general; online aromatherapy classes Resources on Latin other than translation from English to Latin, in particular Latin to English translation Results on philosophy without overt mentions of Berkeley’s contribution Good geography resources that allow... the news of the day on that individual on an ongoing basis Conversely, a news site may be Useful if reliable and timely without offering the benefit of great depth Useful results ought to be highly satisfying for the user: if the query is informational, they should be very informative; if the query is transactional, they should allow the user to complete the transaction.10 Table 3 contains a few examples... English-speaking users who pose this query, how many do we expect to be based in New Zealand, statistically speaking? Very few Hence, the New Zealand Yellow Pages should not be in the top ten results They are Not Relevant Consider now a broad informational query: [information on law school programs] The query is clearly placed with an expectation of a broad resource Technically, information on the sports law program... the query – it does describe a law school program However, it’s too narrow on several levels – it’s an isolated page from one law school covering one specific, esoteric program It’s Not Relevant To provide some concrete meaning to the Not Relevant category as applicable to informational queries, think of someone who would want to do an extremely thorough, exhaustive research on the topic of the query, ... NOT be rated Offensive to the query [Anthony calleja] 22 IMPORTANT: Observe the query- matching words in the URL structure of the first three examples Spam tactics such as these ones is another reason not to take URL addresses at face value, but to evaluate actual pages Concluding Remarks As you see, the rating task consists of • • If you come to the realization that the query could have been posted... Off Topic: none or very few Not Relevant results are conceivable That is OK – we cannot say it often enough, go with your best judgment and do not worry too much about rationalizing every single rating decision What we hope to instill via these Guidelines is a general understanding of the rating methodology Once you internalize the criteria for placing results on the merit scale based on attributes . GENERAL GUIDELINES ON RANDOM- QUERY EVALUATION Version 3.1 Last update: December 31, 2003 2 Random- Query Evaluation Thank you for participating in one of Google’s. relies instead on its common law). 4 classification, as it turns out, allows for some useful generalizations in the context of query- result evaluation. A navigational query is one that normally. quality control processes, the random- query evaluation. This form of search evaluation takes its name from the fact that the queries which it draws on were randomly selected from our query logs

Định dạng
Số trang	22
Dung lượng	294,13 KB