1. Trang chủ
  2. » Ngoại Ngữ

DIRS Disconnected Information Retrieval System

8 1 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Nội dung

DIRS: Disconnected Information Retrieval System Gregory Tracy Patrick Votruba Department of Computer Sciences University of Wisconsin, Madison 1210 West Dayton Street Madison, WI 53706 gtracy@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison 1210 West Dayton Street Madison, WI 53706 votruba@cs.wisc.edu the use of cellular telephones and personal digital assistants (PDAs) to manage our daily lives However, more can be done to bridge the divide between traditional sources of information and their more recent electronic counterparts Abstract The World Wide Web gives individuals access to huge amounts of data This includes access to information also found in traditional formats such as news copy This study addresses a desire to blend these two mediums in such a way that media consumers can move transparently from a hard copy of a given article to an electronic copy Document retrieval experiments were performed in an attempt to determine the feasibility of implementing a handheld scanning device used to mark traditional newspaper articles for subsequent online retrieval Several thousand random articles were fetched from two popular news search services to emulate the scanning of print media also available online Experiments were performed on these articles to quantify the success of searching with various article attributes Query success is quantified by measuring whether or not the article is found, and how deep into the query results we must parse to locate the correct article When searching on the title of a news article, it was retrieved correctly 98% of the time with an average depth of one When searching for an article based on a randomly chosen, 30-character string, 92% of the articles were retrieved successfully with an average depth of two People’s daily routines will cause them to shift from the connected world to the disconnected world and back again There is a need for a service that can make this transition easier; a service which makes the disconnected moments feel more connected and the transitions back to the connected world less cumbersome For instance, our quest for information does not stop when we step away from our computer Walking down the street, we can be drawn to various data points that we may want to more research on A flyer on a telephone pole may advertise a concert we wish to attend or learn more about A certain dish on a Thai restaurant menu may tease our curiosity about recipes for home A visit to Home Depot may generate some home improvement thoughts These moments of curiosity can be captured in many different ways including a simple scrap of paper But there is a need to make a transparent transition from these disconnected moments to a connected world without the manual transfer of thoughts from a notation device to an online data query Perhaps the best example of consuming information while disconnected is print media Now that the vast majority of print media is available online via the World Wide Web, it should be possible to make note of an article printed in a periodical and easily find the corresponding digital version via one of the several Internet search engines For instance, if someone finds a newspaper article of interest while waiting in an airport or a doctor's office, but doesn't have time to finish it, they could later use a search engine to retrieve the web-based version of the article This scenario raises two primary questions 1) What sort of information needs to be captured from the article so that it can be fed to a search engine for a reliable, automated Introduction The World Wide Web provides individuals with access to a tremendous amount of information Although there are web search services that a terrific job of locating specific items of interest, it can still be a difficult and time-consuming process In addition, it requires that a person be connected to the network Given the fact that lifestyles have become more hectic, there exists a market for products and services to help manage this wealth of information The information age has already spawned retrieval? 2) Is there an electronic device capable of capturing the relevant information? This paper presents the results of experiments attempting to answer the first question while speculating on the feasibility of using a pen-sized scanning device to address the second question belief that all print media will be available online in the not too distant future Experimental Approach The approach of our feasibility study parallels that of an actual implementation for a prospective retrieval service If this service were to be successful, all manual data entry would need to be eliminated from the process of retrieving articles Furthermore, the intensity and volume of running the large experiments in this study makes manual data entry prohibitive To automate the process of performing web searches, we created Perl scripts that perform two primary tasks 1) Collect large numbers of news articles being stored on the web to be used as test cases and 2) generate query strings based on the attributes of the articles collected, feed them it into an online search engine, and check the results to determine whether the search successfully found the respective article Background The inspiration for this work was a Sony product called the eMarker [7] The now discontinued eMarker was a small, inexpensive device that allowed users to "bookmark" songs they heard on the radio by pressing the lone button on the device Users could later connect the device to their personal computer (PC) via a Universal Serial Bus (USB) connection The eMarker retrieved song information by recording the time that the user pressed the device button It then used a separate service, Broadcast Data Systems (BDS), which provides the play lists for over 1,000 radio stations A user provides a set of favorite radio stations as part of the eMarker account preferences, which allows a radio station's play list to be easily searched using the time recorded by the device This allows retrieval of a song's primary attributes such as song title, artist, and album that was playing at the time the user set the bookmark 3.1 Data Collection In order to evaluate the success of searching with various search criteria, we first needed to collect a set of online articles As we later determined, it was important to collect articles repeatedly to have a "fresh" bank of articles to test against To simplify this process we developed a Perl script, fetchit, which takes a generic search topic as an argument and produces all of the online articles that match the search string Under the hood of fetchit is an HTTP interface to the Yahoo! News web portal The eMarker was innovative in many ways It relied on a simple device that could be purchased for a low cost ($19.95) that interfaced with existing technology (personal computers with web browsers) via a standardized interface (USB) Furthermore, the eMarker's reliance on a pre-existing service (BDS) whose primary source of revenue was already well established minimized the overhead needed to provide the necessary services to its customers Rather than focus on a particular publisher, we did not discriminate, and chose instead to treat all articles as if they came from print media and thus can be scanned with a prospective pen-scanning device In terms of the experiments we performed, the source of the article is not as important as the fact that Yahoo! News or some other online search service is caching it For example, one cannot locate an article online, no matter how much detail has been saved by a scanning device, if the search engine being used does not crawl that site which contains the article As stated earlier, it is assumed that sometime soon, all print material will be available online and thus searchable by a news search engine such as Yahoo! News Based on this argument, if Yahoo! News has cached it, we assume that there is a print version and thus can be scanned We propose that producing an analogous device for marking print material for later retrieval via an Internet search engine would be relatively straightforward The device used to capture information could be a pen-sized scanning device, and the information from the bookmarked articles could then be fed to a web-based search engine In the next section, we present an approach using the Yahoo! News search engine [8] Yahoo! News is a specialized search engine designed for retrieving news articles The advantage of using a specialized news search engine is that it is updated with wire articles from the Associated Press and Reuters news services as soon as they are released, and typically archived by Yahoo! News for two weeks Yahoo! News also archives articles from other news sources such as The New York Times, Business Week, and USA Today for about one month Although the number of publishers is limited, it is our Although we limited ourselves to the publishers cached by Yahoo! News, this is not an architectural issue associated with the prospective scanning device itself As we show later, we were able to use Excite's news search engine to similar experiments When implemented, the URL: http://dailynews.yahoo.com/h/ap/20011209/sp/bba_red_sox_coaches_1.html TITLE: Red Sox Talking to Herzog, Alou AUTHOR: Jimmy Golen DATE: December 09 PUBLISHER: AP TEXT: By JIMMY GOLEN, AP Sports Writer BOSTON (AP) - The Red Sox have approached former major league managers Whitey Herzog and Felipe Alou about serving as bench coaches along side inexperienced Boston manager Joe Kerrigan……… Figure 1: One example of a meta-data file created by the fetchit script Each line begins with a tag identifying an article attribute In this example, the "TEXT" section has been cut short for space considerations searching can be extended to any number of online sources including a publisher's own content site mechanism is similar in nature to fetchit We used a Perl script, matchit, which performed three tasks: Each article retrieved from an online source is actually a content rich web page with the article embedded within it In order to simplify the experiments, each web page is parsed as the fetchit script reads it in The HTML code is broken down into the pieces that make up a print version and fall under the following categories: 1) Determine search criteria: This is intended to simulate the choices a user might have when scanning print media The search criterion is a combination of those article attributes stored in each meta-data file By selecting one attribute, such as article title, or a set of attributes, we are able to experiment with various combinations and quantify their success in finding the correct article For example, we may set up an experiment in which the search criterion is made up of the article's title and author These two attributes are extracted from a meta-data file and used to construct a web search query If the criterion includes a portion of the text body, we also specify the number of characters from the text to be used As will be explained later in Section 4.2, simply specifying the number of consecutive characters was not enough information to produce successful search strings • • • • • Title Author Publisher Date Text body of the article This information is stored in a unique file along with the URL where the article resides online Collectively, this information makes up the meta-data our experiments can use to create search queries as well as to determine whether the search was successful or not An example of a meta-file is shown in Figure 2) Apply search criteria: Once the search criteria has been determined for an experiment, it is applied to every article in our test bank The matchit script reads in a meta-data file corresponding to a single article, extracts the data associated with the search criteria, concatenates a search string, and sends an HTTP request to an online news search service When articles were collected and put into this meta-data format, a wide range of topics was chosen From "Rudy Giuliani" to "Green Bay Packers football", anywhere from 50 to 200 articles were retrieved for each topic totaling between 1500 and 2000 articles With this process automated, we were able to focus our attention on the feasibility experiments 3.2 3) Parse search result: When the search engine returns its results, we break down the HTML code and examine only the URLs that pertain to actual search results We compare these URLs against the article URL we store in the respective meta-data file, and look for a match The success of the search is quantified in two ways 1) Did we find the article? 2) If it was found, how deep into the search results did we have to comb before we found the link to the Article Searching Once we were able to collect large sets of news articles, it was much easier to generate scenarios to understand and measure the success or failure of searches that could be performed by users of this proposed service The search Criteria title title/author title/publisher title/date Hit Rate 98% 61% 49% 1% URL Depth 1.1 1.1 1.1 4.6 article? We refer to the first metric as the hit rate For example, if nine out of ten articles are successfully found in a given experiment, the hit rate is 90% We refer to the second metric as the URL depth If a search returns ten URLs, and we find the matching URL five links deep into the results list; the URL depth is five This value is averaged over all article searches for a given experiment Table 1: The hit rate, and average URL depth for queries using Yahoo! News with a combination of article attributes The matchit script repeats steps two and three for each article (meta-data file) we have collected In the case of the searches using the text body as criteria, we loop over the set of articles five times and take the average of the results This is due to the fact that when selecting text body for a query, we seek to a random location in the file to use as a starting point Averaging over several runs allows us to detect and avoid any abnormalities produced by the random function As it turned out, there was very little variance amongst the five loop iterations Table also lists the results of our experiments when we used various combinations of search criteria that included the title string In these cases, each article attribute was surrounded by quotation marks and the combination was concatenated together For example, given the article found in Figure 1, a search using title and author would produce the following query string1: may have to visit before they find the one they intended to find "Red Sox Talking to Herzog, Alou" +"Jimmy Golen" The results are mixed, but adding any other criteria to the title search consistently makes the search results worse Interestingly, however, the URL depth is very consistent If the article is in fact found, we will usually find it within the first two articles returned by the search engine This is investigated in more detail in Section 4.4 Results Given the ease at which we are able setup experiments and evaluate results with the matchit script, we were able to focus our attention on the breadth of scenarios to test What are some alternatives for recording information to give us the best opportunity for retrieving this article later? It's a very basic question, and one that we used to direct our experiments Date experiments can be improved using a special interface to Yahoo! News Currently, when the date is used in the search criteria, it gets passed to the search engine as if it were just another text string It is clear, however, that these dates are not indexed by the search engine as other words in the article are Yahoo! News has added an interface which allows a query to specify a date (or range of dates) the article can be found on Although this is appealing when using Yahoo! News, the solution does not port well to other services and thus is not studied in great detail in this paper All experiments were performed on Linux PCs in the Computer Sciences Department at the University of Wisconsin, Madison The only apparent bottleneck of the scripts is the latency of the search engine servers and the Internet itself It took between one and two seconds on average to successfully find an article we were targeting Complete experiments typically ran for around thirty minutes 4.1 4.2 Searching On Title Searching On Text Users may not always have the option of scanning an article title Font sizes, graphics, and page layout may prevent this type of operation One other unique characteristic of an article is a string of consecutive characters from within the article body It is not uncommon, however, for an arbitrary string of characters to be found in multiple articles Naturally, the longer the string is, the more likely the case that it is unique to one article In an attempt to understand these properties, we conducted a set of experiments that searched for articles Perhaps the most unique characteristic of an article is its title If one were able to scan the entire title of an article before walking away from the waiting room, the chances of finishing that article later are remarkable As shown in the first row of Table 1, the correct article will be found 98% of the time Furthermore, when the article is found, the correct URL will be in one of the first two returned by the search This experiment was set up using the string found after the TITLE tag in each article's meta-data file The exact title string was submitted to the search engine surrounded by quotation marks If the quotation marks are removed, the success rate falls only slightly (less than 2%) but the URL depth falls more significantly It grows by a factor of four While four doesn't seem like a huge number, that's four more articles (on average) that a user The format of these strings is a subtlety of each search engine Although we experimented with different combinations that were successful with Yahoo! News, the format may not yield similar results with other search engines In fact, using any form of quotations with Excite's News produces a hit rate of 0% 100% 60% Avg URL Depth 15 Hit Rate (%) 40% 10 20% URL Dept h 20 Avg URL Dept h 25 80% Hit Rat e 30 0% 10 15 20 25 30 35 40 Avg URL Depth 5 50 Number of Characters 10 15 20 25 30 35 40 50 Number of Characters Figure 2: The average URL depth and average hit rate for Yahoo! News searches performed using a variable number of characters from the article's text body Figure 3: The average URL depth for Yahoo! News searches using a variable number of characters from the article's text body (blown up from Figure 2) using a variable number of consecutive characters from the text body of the article The string is chosen through the use of the TEXT tag in the meta-data file A random offset into the text body is chosen for each search We seek to that location and then slide forward if necessary to find the first word boundary2 The end point of the string is the start offset plus the specified number of characters plus a variable number of characters if necessary to get to the first word boundary beyond that number of articles returned by the query shrinks and a user is able to locate the correct article more quickly Figure isolates the average URL depth measurements It illustrates that by jumping from fifteen to thirty characters, the search depth improves by a factor of three 4.3 Effects Of Time When the search criterion is set to title, the results are not always stellar Other than the examples shown in Table 1, the results also degrade as the data ages In an attempt to simulate this behavior, we collected sets of articles (and the corresponding meta-data files) each day for seven days On the eighth day, we began running the title searches on the file sets for each day Figure illustrates the deterioration of the hit rate as the data ages The primary reason for this behavior is that Yahoo! News stops caching some articles after a specified period of time Although this isn't happening within a seven-day span, this does show that on any given day, a person may pick up an article in print, and potentially be limited by the number days they could successfully search for the article The number of characters we experimented with ranged from five to fifty and the results are reported in Figure The hit rate very quickly ramps up to 90% as the number of characters used to search grows from five to fifteen In fact, based solely on the hit rate, there is little motivation to use more than fifteen characters from the article body to create a search query As shown by Figure 2, the hit rate hits a plateau and sustains approximately 92% success rate up to fifty character search strings We did not test strings longer than fifty characters This experiment emphasizes the importance of our second quantitative metric for success, the URL depth As more characters are added to the search string, the range of our search depth decreases and approaches one As was stated earlier, this is a very practical measurement of the search results As the uniqueness of our search string grows, the It was discovered very early in the experiments that it is imperative to send search queries that fall on word boundaries Even though a set of strings may be unique to an article, the search engines are indexing words and not random character strings 100% 100% 90% 90% 85% 80% 80% 70% Hit Rate Hit Rat e 95% 75% 70% 60% Excite New s 50% Yahoo! New s 65% 40% 60% 30% 20% Age of Articles (days) 10 15 20 25 30 35 40 50 Number of Characters Figure 4: The hit rate for a Yahoo! News search using article titles The results progressively get worse as the titles become older Figure 5: The hit rate for text string searches using both Yahoo! News and Excite News 4.4 fetchit and matchit Figure compares the Yahoo! News results with those generated using Excite's News service The performance of Yahoo! News is clearly superior, but Excite News also produces good hit rates if a long enough search string is used Despite the mixed results produced by Excite News, its high hit count with longer search strings lead us to believe that this model could be extended to use search engines other than Yahoo! News Therefore, we assert that the implementation of this scheme does not rely solely on the service provided by Yahoo! News Analyzing Misses When an article title was combined with other attributes, we were surprised to see the dramatic decrease in hit rate For instance, when an article's author(s) is combined with its title, the hit rate falls from 98% to 61% as shown in Table In the latter case, the 39% of the articles that constitute "misses" (where the article could not be found) were analyzed to try and understand when there is a miss, why there is a miss It was discovered that the success of searches containing the author depends almost exclusively on the source of the article The publishers that are responsible for the majority of the misses are Reuters, New York Daily News, and New York Post Reuters was the publisher for 82% of all misses in the title/author search In another experiment, in which strings from the text body were combined with the author, Reuters was the publisher of 78% of those articles that could not be found In addition to the automated experiments, we were able construct a prototype service and generate experiments that resembled a real-world situation involving a handheld pen scanner Using the Quick Link™ pen from Wizcom Technologies, a group of articles from a printed copy of The New York Times were used to test the success rate of various text scans The matchit Perl script was adapted into a cgi-bin script to complete the real-world setting as well as to ease the implementation of the pen experiments A web page on the department's web server was created which had a text input form and an action button that sent the text to the matchit cgi script The Quick Link™ pen can be configured such that scans can be transferred directly into the text box of a browser window3 Once the text is there, the action button is pressed and a list of potential web links is returned As for the 61% of the articles that could be found, 76% of those hits came from the AP Some of the other publications that performed well with author searches included The Sporting News, The Daily Herald, Business Week, and the San Jose Mercury News It is not clear what makes these publishers more "author friendly" We cannot find anything inherent in the HTML documents themselves that show that in some cases the author is part of the article's text body and others it is not In each case, the author has clear markup so that it stands out when rendered by the browser The answer lies somewhere in the interface between Yahoo! News and the publishers themselves Among the feasibility experiments run earlier, only the text-body experiment was duplicated with the pen scanner This was due to two factors First, the font sizes of the article titles in The New York Times were too large for the Quick Link™ scanner to capture This prevented any experiments involving the article title Second, the process is labor intensive which is what led us to the Validation In an effort to demonstrate that the results presented here extend beyond the use of the Yahoo! News engine, the hit rate experiment shown in Figure was repeated with the Excite News search engine using Perl scripts analogous to The Quick Link™ pen scanner can also be configured so that each individual scan makes up a unique "note file" All notes can be uploaded at the same time and posted to the cgi script individually earlier methodology in the first place Therefore, we scanned portions of ten different articles Each article had three non-overlapping text-body selections scanned and uploaded into the web browser displaying our cgi form Each individual scan was a scan input of one string that stretched an entire column of the article This is a logical start point and stop point for a user trying to mark a particular article The scans ranged in length of 27 to 29 characters (not including spaces), which resulted in to words for each scan author can be identified and sent to a search engine to retrieve the desired information 6.2 The experiments described in this paper addressed the retrieval of documents found in periodicals Work should be done to see if a variation of this scheme could be used for retrieval of other types of media such as music The proliferation of online music has made it feasible to locate music titles and artists without the use of a time stamped database like the ones used by eMarker A “smart” eMarker device would rely on recorded music samples as a means to locate the song's album and artist Using one of the popular music sharing services like AudioGalaxy or Gnutella, it could be possible to either search for a string of music notes, or to use voice recognition technologies to gather song lyrics and use them in a search query Among the ten articles scanned, five were successfully retrieved from the web In each of these five cases, all three scans were successful with a URL depth of one Overall, the 50% hit rate was disappointing However, after more investigation it was discovered that Yahoo! News was not caching the articles that constituted misses No matter what the search criteria may have been, the article would not be found This may sound disappointing, but this is just one news service available There are other services - that come at some expense - which index nearly every publisher in the United States Therefore, every article can be found given the right search engine with minimal query data The contribution of this paper is that it demonstrates the characteristics and properties of accurate web searches given limited query data common in a disconnected application Conclusion As more traditional media producers make their product available in electronic form, the distinction between print media and web media begins to blur One of the benefits of this trend is the ability to locate information whenever a user is connected to the network However, many media consumers still lack tools to help them easily manage personal notations as they transition from their disconnected lives into their connected ones The waiting room of a doctor's office or a city bus ride will many times become a playground of information as people become exposed to new genres or topics which spark interest Through the use of paper scraps, most people manage to mark information they would like to revisit later This paper addresses an alternative approach to simplify the interaction between an individual and information on the web Future Work Future work in this area will need to address two primary questions First, how can the services described here be integrated into existing hardware? Second, can this technology be applied to the retrieval of other forms of media besides print material? 6.1 Alternative Data Retrieval Applications Integrating Document Retrieval Service One possible solution for the implementation of this document retrieval scheme is a pen-sized text-scanning device such as the Quick Link™ described in Section A pen device has many desirable features They are small, lightweight, and use standard communication interfaces to easily transfer data to and from PCs and PDAs One drawback of pen devices is the relatively steep price when compared to the eMarker that retailed for $20 Another alternative would be to incorporate small scanning devices into existing, well-established technologies such as PDAs and cellular telephones Consumers may be more likely to use a service like this if it was available through devices that they are already likely to own In this feasibility study we have shown that it is possible to ease the transition through the use of small text scanning devices already available commercially The best results have been seen through the use of article titles as the primary search query If one is able scan the title, they will be able to retrieve the same article electronically over 90% of the time If, while riding the bus, one is only able to find the last page of a torn newspaper article, we have shown that the prospect of retrieving that article in its entirety is still very promising Yahoo! News had a 91% success rate when searching from a string of fifteen characters from an article We have shown that such a service is viable and the only roadblock we see in the adoption of this technology is the breadth at which search engines are able to crawl and cache traditional print media Another possible approach would be to use a dictation device that records whatever notations are spoken into it Using voice recognition, article attributes such as title and Acknowledgments References The authors thank David DeWitt for inspiring the ideas proposed in this work as well as his helpful discussions concerning the direction of this project [1] Brin, S., L Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine" WWW-7 April, 1998 [2] Hawking, D., N Craswell, P Thistlewaite, and D Harman "Results and Challenges in Web Search Evaluation" WWW-8 Proceedings, 1999 [3] Hu, W., Y Chen, M S Schmalz, and G X Ritter "An overview of World Wide Web search technologies" In Proceedings of the 5th World Multi-Conference on Systemics, Cybernetics and Informatics, SCI 2001, pages 356-361, Orlando, Florida, July 22-25, 2001 [4] Kobayashi, M, and K Takeda, "Information Retrieval on the Web", ACM Computing Surveys, 32(2), pp 144-173, 2000 [5] Marchiori, M "The Quest for Correct Information on the Web: Hyper Search Engines", Proceedings of the Sixth International World Wide Web Confernce (WWW6) Santa Clara, CA., 1997 [6] Mendelzon, A., G Mihaila, T Milo, "Querying the World Wide Web", In Proceedings PDIS'96 December, 1996 [7] Sony eMarker http://www.emarker.com [8] Yahoo! News http://news.yahoo.com ... material for later retrieval via an Internet search engine would be relatively straightforward The device used to capture information could be a pen-sized scanning device, and the information from... desired information 6.2 The experiments described in this paper addressed the retrieval of documents found in periodicals Work should be done to see if a variation of this scheme could be used for retrieval. .. this technology be applied to the retrieval of other forms of media besides print material? 6.1 Alternative Data Retrieval Applications Integrating Document Retrieval Service One possible solution

Ngày đăng: 18/10/2022, 15:25

w