free ebooks ==> www.ebook777.com www.it-ebooks.info WWW.EBOOK777.COM free ebooks ==> www.ebook777.com www.it-ebooks.info WWW.EBOOK777.COM free ebooks ==> www.ebook777.com Enterprise Search Martin White www.it-ebooks.info WWW.EBOOK777.COM free ebooks ==> www.ebook777.com Enterprise Search by Martin White Copyright © 2013 Martin White All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://my.safaribooksonline.com) For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com Editors: Simon St Laurent and Meghan Blanchette Production Editor: Christopher Hearse December 2012: Proofreader: Christopher Hearse Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrator: Kara Ebrahim First Edition Revision History for the First Edition: 2012-11-21 First release See http://oreilly.com/catalog/errata.csp?isbn=9781449330446 for release details Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc Enterprise Search, the image of a Purple Martin, and related trade dress are trademarks of O’Reilly Media, Inc Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trade‐ mark claim, the designations have been printed in caps or initial caps While every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein ISBN: 978-1-449-33044-6 [LSI] www.it-ebooks.info WWW.EBOOK777.COM free ebooks ==> www.ebook777.com Table of Contents Preface xi Searching the Enterprise Every Day Is a Decision Day Information as a Corporate Asset The Information Paradox Enterprise Search Search and Information Retrieval Search Is a Dialog Search Has to Be Managed Why Search Is Important Summary Further Reading 7 11 11 Enterprise Search Is Difficult 13 A Day at the Office There Are 3,245 Results There Are Results There Are 230 Results There Are 400 Results There Are 425 Results There Are 390 Results You Think It’s All the Relevant Information A Short History of Search A Short History of Information Retrieval Recall, Precision, and Relevance Why Can’t Our Search Be Like Google? With Web Search You Have Options Information Quality Poor Titles 13 14 14 15 15 15 15 16 16 17 18 19 20 21 21 iii www.it-ebooks.info WWW.EBOOK777.COM free ebooks ==> www.ebook777.com No Author Information Metadata Ambiguous Date Formats Document Structure Language Summary Further Reading 21 22 22 22 22 23 23 Defining User Requirements 25 Information Seeking Models Another Search Engine! Why? User Requirements and User Satisfaction Climate Surveys Diaries Focus Groups Help Desk Calls Microsoft Product Description Cards Personas Team Meetings Usability Tests Use Cases Analysis Compliance Expertise Induction Item Learning Mobile Monitor Product Task User Interviews User Surveys Search Benchmarking Search Logs Stories User Feedback Writing the User Requirements Report Summary Further Reading 26 27 27 28 28 28 29 29 30 32 33 33 34 34 34 34 34 35 35 35 35 35 36 37 38 38 39 39 39 40 40 Planning for Search 41 iv | Table of Contents www.it-ebooks.info WWW.EBOOK777.COM free ebooks ==> www.ebook777.com Making a Business Case Invest in Skills Before Software Search Support Team Stakeholder Analysis Business Impact Search Owner Content Owner Scope Document Size and File Formats Metadata Management Language Security Technology Infrastructure Disaster Recovery Security Performance Metadata and Taxonomies Help Desk Usability Training and Support Risks Web Site Search Summary Further reading 42 42 43 44 45 46 47 47 47 47 48 48 48 48 49 49 50 50 51 52 52 52 52 53 53 53 Search Technology Part 55 Content Gathering Connectors Document Filters and Language Identification Parsing and Tokenising Stop Words Stemming and Lemmatization Dates Phrases Processing Pipeline Building and Managing the Index Security and ACLs Query Management Spell Checking Retrieval Models 56 56 57 58 59 59 60 60 61 61 63 64 65 66 Table of Contents www.it-ebooks.info WWW.EBOOK777.COM | v free ebooks ==> www.ebook777.com Ranking Summarization Document Thumbnails Summary Further Reading 67 67 68 68 68 Search Technology Part 69 Entity Extraction People Search Federated Search Duplicate and Similar Documents Mobile Search Faceted Search Multilingual Search Search-Based Applications Semantic Search Social Search Text Mining and Sentiment Analysis Summary Further Reading 69 71 73 74 74 75 77 77 78 78 79 79 79 The Business of Search 81 Industry Structure Dassault HP IBM Lexmark Oracle Independent Search Vendors Open Source Search Software Google and Search Appliances Microsoft SharePoint Specialized Search Components Cloud-Based Search OEM Applications Systems Integrators e-Discovery Summary Further Reading 82 82 82 82 82 83 83 84 85 86 88 88 89 89 90 91 91 Specification and Selection 93 The Project Teams vi | Table of Contents www.it-ebooks.info WWW.EBOOK777.COM 94 free ebooks ==> www.ebook777.com Specification Project Team Selection Project Team Installation Project Team Project Programme Office The Global Dimension Risk Management Project Schedule Writing the Specification The Story So Far Content Scope User Expectations Information Systems Architecture IT Partnerships Internal Development and Support Resources Security and Identity Management Federated Search Requirements People Databases Project Timetable Functional Specification Connectors and APIs Federated Search User Interfaces Index Freshness Filters and Facets Taxonomy and Metadata Management Search and System Logs Entity Extraction Questions for the Vendors Risk Assessment Project Schedule Project Management Methodology Upgrade Release Schedule Supporting a Global Implementation User Groups Key Employee Strategy License and Support Costs Reference Sites Training Building the Vendor Short List Using a Consultant Using a Implementation Partner Open Source Software Procurement The Best of Both Worlds? 95 95 95 95 95 96 97 98 98 99 99 99 99 99 100 100 100 100 100 101 101 101 102 102 102 102 102 102 103 103 103 103 104 104 104 104 104 105 106 106 108 109 Table of Contents www.it-ebooks.info WWW.EBOOK777.COM | vii free ebooks ==> www.ebook777.com Proof of Concept Contract Negotiation Summary Further Reading 109 110 111 111 Installation and Implementation 113 Project Management Customer Responsibilities Implementation Schedule Knowledge Transfer The Show Stoppers Get Indexing! User Interface Design Usability and Accessibility Testing Disaster Recovery Tests Help Desk Metadata Management Communications Plan Summary Further Reading 113 114 114 115 116 117 117 117 117 118 118 118 119 119 10 Managing Search 121 Search Support Team Roles Search Manager Search Technology Manager Search Analytics Manager Search Information Specialist Search User Support Manager Supporting Global Enterprise Search Creating a Centre of Search Excellence Search Team Skills Introduction to Information Retrieval Indexing Retrieval and Ranking User Interaction and Interface Design Evaluation of IR Systems Web Search Enterprise Search Help Desk Management Security and Compliance Search Liaison Specialists Reporting Lines viii | Table of Contents www.it-ebooks.info WWW.EBOOK777.COM 121 122 122 123 123 124 125 126 127 127 127 128 128 128 128 128 129 130 132 132 free ebooks ==> www.ebook777.com Professional Microsoft Search, Mark Bennett, Jeff Fried, Miles Kehoe, and Natalya Voskresenskaya (2011) Wrox Prospects of Mobile Search, Jose Luis Gomez-Barroso et al., (2010) Institute for Pro‐ spective Technological Studies, Joint Research Centre, European Commission Search Analytics For Your Site, Louis Rosenfeld (2011) Rosenfeld Media Search-Based Applications: At the Confluence of Search and Database Technolo‐ gies, Gregory Grefenstette and Laura Wilber (2010) Morgan and Claypool Publishers Search Engines: Information Retrieval in Practice, W Bruce Croft, Donald Metzler, and Trevor Strohman (2010) Addison Wesley Search Patterns, Peter Morville and Jefferey Callender (2010) O’Reilly Publishing Search User Interfaces, Marti A Hearst (2009) Cambridge University Press Search User Interfaces Design, Max Wilson (2012) Morgan and Claypool Publishers Semantic Software Technologies, Lynda Moulton (2010) Gilbane Group Successful Enterprise Search Management, Stephen Arnold and Martin White 2010 Galatea Publishing Teaching and Learning in Information Retrieval, E Efthimiadis, J.M.FernándezLunam J.F.Huete and A MacFarlane 2011 Springer Text Mining: Advanced Approaches in Analyzing Unstructured Data, Ronen Feldman and James Sanger (2006) Cambridge University Press Text Mining: Applications and Theory, Michael W Berry and Jacob Kogan (Editors) (2010) Wiley-Blackwell Understanding Information Retrieval Systems: Management, Types, and Standards, Marica J Bates (Editor) (2011) Auerbach Publications Working with Microsoft FAST Search Server 2010 for SharePoint, Mikael Svenson, Marcus Johansson, and Robert Piddocke (2012) Microsoft Press Enterprise Search - Blogs This page is a list of blogs which track developments in enterprise search technology and implementation Some of these blogs are published by search vendors Others (such as the Real Story Group blog) cover a range of topics including enterprise search This list was checked in September 2012 Attensity Blog Basis Technology Beyond Search (Stephen Arnold) Enterprise Search - Blogs www.it-ebooks.info WWW.EBOOK777.COM | 155 free ebooks ==> www.ebook777.com Concept Searching Coveo Insights Darwin Awareness Engine Do More With Search (BA Insight) Enterprise Search (New Idea Engineering) Enterprise Search (Lynda Moulton) Exalead Flax Findability (Findwise) Google Enterprise Information Interaction (Tony Russell-Rose) Information Optimized (Vivisimo - discontinued from 09/2012 but the archive remains live) Mindbreeze Noisy Channel (Daniel Tunkelang) Perfect Search Polyspot Blog Real Story Group Blog Searchblox Search Chronicles (Paul Nelson, Search Technologies) Search Hub Lucid Works Sematext Blog Sinequa’s Blog SmartLogic Journal The Core Perspective Recommind Unified Information Access (Attivio) There is also a very good LinkedIn Enterprise Search Engine Professionals Group Membership is by application Further Reading Here you’ll find a list of further readings recommended for the following chapters Chapter Readings Unlocking the Value of the Information Economy, (2011) Harvard Business Review An‐ alytic Services “The Post-Relational Reality Sets In” , (2011) Survey on Unstructured Data MarkLogic 156 | Appendix A: Resources www.it-ebooks.info WWW.EBOOK777.COM free ebooks ==> www.ebook777.com Mind the Enterprise Search Gap, (2011) SmartLogic “Enterprise Search and Findability Survey”, (2012) Findwise From Overload to Impact: An Industry Scorecard to Big Data Business Challenges , (2012) Oracle Chapter Readings The Name Matching You Need, (2011) Basis Technologies Chapter Readings Rosenfeld Media specializes in books on determining user requirements and the de‐ velopment of user interfaces The User is Always Right, Steve Mulder with Ziv Yaar (2007) New Riders Publishing Ad-hoc Personas and Empathetic Focus, Donald Norman Chapter Readings “Enterprise Search and Findability Survey”, (2012) Findwise Digital Workplace Trends, NetStrategy/JMC Chapter Readings The Name Matching You Need, (2011) Basis Technologies The Answer Machine, Susan Feldman (2012) Morgan and Claypool Publishers Chapter Readings New Landscape of Enterprise Search, Stephen Arnold (2011) Pandia Search Central Enterprise Search, Real Story Group Beyond Search – News and Information About Search and Content Processing Chapter Readings Enterprise Search, Real Story Group Chapter Readings Professional Microsoft Search, Mark Bennett, Jeff Fried, Miles Kehoe, and Natalya Vos‐ rensenskaya (2010) Wrox Further Reading www.it-ebooks.info WWW.EBOOK777.COM | 157 free ebooks ==> www.ebook777.com Working With Microsoft FAST Search Server 2010 for SharePoint Mikael Svenson, Mar‐ cus Johansson, and Robert Piddocke (2012) Microsoft Press Although these two books are specifically about SharePoint search implementation they both illustrate the work involved in installation and implementation Chapter 10 Readings Search Analytics for Your Site, Louis Rosenfeld (2010) Rosenfeld Media Chapter 11 Readings The Answer Machine, Susan Feldman (2012) Morgan and Claypool Publishers Search-Based Applications , Gregory Grefenstette and Laura Wilber (2011) Morgan and Claypool Publishers Reshaping the Workforce With the New Analytics, Technology Forecast (2012) PwC Semantic Software Technologies - Landscape of High Value Added Applications for the Enterprise, Lynda Moulton (2010) Gilbane Group Second Strategic Workshop on Information Retrieval in Lorne, (2012) http:// www.cs.rmit.edu.au/swirl12/ 158 | Appendix A: Resources www.it-ebooks.info WWW.EBOOK777.COM free ebooks ==> www.ebook777.com APPENDIX B Vendor List This table lists over 70 vendors Not all are search software vendors Also included are vendors providing a range of ancillary services and some of the many vendors providing data analysis and business intelligence software In the case of major corporations only the home page URL has been provided as the pages for search products tend to be far from permanent All links were verified in September 2012 No warranty is given by Intranet Focus Ltd or the author Company Country URL Active Navigation UK http://www.activenav.com/ Alcove9 USA http://www.alcove9.com Amazon USA http://a9.com/ Ankiro Denmark http://www.ankiro.com Apache (Lucene/Solr) Community http://lucene.apache.org Applied Relevance USA http://www.appliedrelevance.com Attensity USA http://www.attensity.com Attivio USA http://www.attivio.com Autonomy (HP) UK http://www.autonomy.com BA-Insight USA http://www.bainsight.com Basis USA http://www.basistech.com Brainware USA http://www.perceptivesoftware.com Clearwell USA http://www.clearwellsystems.com/ Cognition Systems USA http://cognition.com Commvault USA http://www.comvault.com Concept Searching UK http://www.conceptsearching.com Constellio Canada http://www.constellio.com 159 www.it-ebooks.info WWW.EBOOK777.COM free ebooks ==> www.ebook777.com Company Country URL Coveo Canada http://www.conveo.com Dieselpoint USA http://www.dieselpoint.com Digital Reasoning USA http://digitalreasoning.com dtSearch USA http://www.dtsearch.com EMC USA http://www.emc.com ElasticSearch Netherlands http://www.elasticsearch.com Endeca USA http://www.oracle.com Exalead France http://www.3ds.com/products/exalead Exorbyte USA http://www.exorbyte.com Expert System Italy http://www.expertsystem.net Fabasoft Austria http://www.mindbreeze.com/en Funnelback Australia http://www.funnelback.com Google USA http://www.google.com/enterprise/search/products_gsa.html IBM USA http://www.ibm.com Inbenta Spain http://www.inbenta.com Infosys USA http://www.infosys.com InQuira USA http://www.oracle.com Intelligenx USA http://www.intelligenx.com Intrafind Germany http://www.intrafind.de ISYS Australia www.perceptivesoftware.com Karmasphere USA http://www.karmasphere.com Lexalytics USA http://www.lexalytics.com LTU France http://www.ltutech.com Lucid Works USA http://www.lucidworks.com Mark Logic USA http://www.marklogic.com MaxxCAT USA http://www.maxxcat.com Microsoft USA http://www.microsoft.com Omniture USA http://www.omniture.com OpenSearchServer France http://www.open-search-server.com OpenText Canada http://www.opentext.com Oracle USA http://www.oracle.com Perfect Search USA http://www.perfectsearchcorp.com Polyspot France http://www.polyspot.com Q-Sensei USA http://www.qsensei.com/ Recommind USA http://www.recommind.com SAP Germany http://www.sap.com 160 | Appendix B: Vendor List www.it-ebooks.info WWW.EBOOK777.COM free ebooks ==> www.ebook777.com Company Country URL SchemaLogic USA http://www.schemalogic.com SearchBlox USA http://www.searchblox.com SearchDaimon Sweden http://www.searchdaimon.com Sematext USA http://sematext.com Sinequa France http://www.sinequa.com Smart Logic UK http://www.smartlogic.com Sophia Systems UK http://www.sophiasearch.com Sphinx Community http://sphinxsearch.com Stored IQ USA http://www.storediq.com SurfRay Denmark http://www.surfray.com Synaptica USA http://www.synaptica.com Temis France http://temis.com Teragram USA http://www.teragram.com/oem TeraText USA http://www.teratext.com Terrier UK http://terrier.org Thetus USA http://thetus.com Thunderstone USA http://www.thunderstone.com Vivisimo USA http://www.vivisimo.com Wand USA http://wandinc.com Xapian Community http://xapian.org X1 Technologies USA http://www.x1.com ZyLab USA http://zylab.com Search integration specialists All links were verified in September 2012 No warranty is given by Intranet Focus Ltd or the author Capax Global USA http://www.capaxglobal.com Comperio Norway http://www.comperiosearch.com Enterprise Data Fusion USA http://www.edatafusion.com/ Findwise Sweden http://www.findwise.com Flax UK http://www.flax.co.uk New Idea Engineering USA http://www.ideaeng.com Raytion Germany http://www.raytion.com Search Technologies USA http://www.searchtechnologies.com Tieto Finland http://www.tieto.com TNR Global USA http://tnrsearch.com Further Reading www.it-ebooks.info WWW.EBOOK777.COM | 161 free ebooks ==> www.ebook777.com www.it-ebooks.info WWW.EBOOK777.COM free ebooks ==> www.ebook777.com Glossary Adjacent result A result that is comparable or analogous to a searched term; often produced by “Find more like this” searches Auto-categorization An automated process for creating a classi‐ fication system (or taxonomy) from a col‐ lection of nominally related documents Absolute boosting Ensuring that a specified document always appears at the same point in a results set, or always appears on the first page of results Auto-classification An automated process for assigning meta‐ data or index values to documents, usually in conjunction with an existing taxonomy Access Control List (ACL) Defines permissions to access a specific repository, a set of documents or a section of a document Average response time An average of the time taken for the search engine to respond to a query, or the average end-to-end time of a query Ambiguity A search involving one word with many dif‐ ferent meanings, or in a search for an object that can be described many different ways Bayesian Inference or Bayesian Statistics A probability technique based on the work of Thomas Bayes (1702-1761) and used to determine the relevancy of a given docu‐ ment against a particular query Appliance A search application pre-installed on a serv‐ er ready for insertion into a standard server rack BigTable A highly scalable database technology that is proprietary to Google Approximate Pattern Matching A process in which an algorithm deter‐ mines the similarity between items, for ex‐ ample in spell checking Boolean Operators A widely used approach to create search queries Examples include and, or, and not, e.g., John and Smith Automatic Indexing An entirely automated process of convert‐ ing information into an index Boolean Search A search query using Boolean operators 163 www.it-ebooks.info WWW.EBOOK777.COM free ebooks ==> www.ebook777.com Boosting Boosting The changing of a parameter of a search to ensure that a certain object or objects ap‐ pear in the results Case-Based Reasoning A technology that allows a system to “learn” by gathering past instances into a “case base” that it can use to solve future prob‐ lems Categorization The placing of boundaries around objects that share similarities, e.g., taxonomy Clustering A process employed to generate groupings of related words by identifying patterns in a document index Collection A group of objects methodically sorted and placed into a category Computational linguistics The use of computer-based statistical anal‐ ysis of language to determine patterns and rules that aid semantic understanding Concept extraction The process of determining concepts from text using linguistic analysis Connector A software application that enables a search application to index content in another ap‐ plication Controlled Vocabulary An organized list of words, phrases, or some other set employed to identify and retrieve documents Corpus A collection of objects with a defined scope E.g all annual reports COTS Commercial-Of-The-Shelf software Crawler A program used to index documents See Also Spider 164 | Description A brief statement in a document that effec‐ tively summarizes the meaning of a docu‐ ment, often employed to annotate search results See Also Key Sentence Document A structured sequence of text information, but often used as a generic description of any content item in a search application Document processing The de-construction of a document into a form that can be tokenised and indexed Document Repository A site where source documents or other content objects are stored, generally a folder or folders See Also Information Source Early binding A search is conducted only across docu‐ ments that a user has permission to access See Also Late binding Entity extraction The automatic detection of defined items in a document, such as dates, times, locations, names and acronymns Exact Match Two or more words considered mutually inclusive in a search, often by enclosing them in quotation marks, e.g., “United Na‐ tions.” Extract-Transform-Load The process of migrating content between databases when undertaken by a single spe‐ cialised software application Facet Presentation of topic categories on the search user interface to support the refine‐ ment of a search query Fallout A quantity representing the percentage of irrelevant hits retrieved in a search Glossary www.it-ebooks.info WWW.EBOOK777.COM m free ebooks ==> www.ebook777.co Federated search Federated search A search carried out across multiple repo‐ sitories and/or applications Field Query A search that is limited to a specific field in a document, e.g a title or date Filter A function that sets specific criteria for search results Free Text Query A search enabling a user to input words in any form, without following any query lan‐ guage criteria Freshness The time period between a document being crawled and the index being updated so that a user will be able to find the document Fuzzy Search A search allowing a degree of flexibility for generating hits, i.e., matches that are pho‐ netically or typographically similar Golden Set A set of documents and other content that is representative of content that will be searched on a regular basis that can be used to benchmark search performance Guided Search A search in which the system prompts the user for information that will refine the search results Hit Index A search result matching given criteria List containing data and/or metadata indi‐ cating the identity and location of a given file or document Index File A file that stores data in a format capable of retrieval by a search engine Indexer (automatic) A program that collects data on a given set of files or documents and provides results for a user search Indexer (human) A person who assigns metadata to a given set of files or documents and makes results available for a user search Information Source The location of indexed documents See Also Document Repository Ingestion Rate The rate at which documents can be in‐ dexed, usually specified in Mb/sec Inverse Document Frequency (IDF) A measure of the rarity of a given term in a file or document collection Inverted File A list of the words contained within a set of documents, and which document each word is present in Inverted Index An index whose entries identify a given word and the documents in which it ap‐ pears Iterative Calculation A calculation utilizing a recursive and selfreferential algorithm Key Sentence A brief statement that effectively summari‐ zes a document, often employed to annotate search results Keyword A word used in a query to search for docu‐ ments Keyword Search A search that compares an inputted word against an index and returns matching re‐ sults Keyword Targeting A process which helps to ensure the inclu‐ sion of given web sites in a search for a spe‐ cific object Knowledge Extraction The procurement of metadata from a given set of objects Glossary www.it-ebooks.info WWW.EBOOK777.COM | 165 free ebooks ==> www.ebook777.com Late binding Late binding A search carried out across a complete repository of documents when a check is carried out on access permissions immedi‐ ately before the presentation of the docu‐ ment to the user See Also Early binding Lemmatization A process that identifies root form of a words contained within a given document based on grammatical analysis (e.g., run from running) See Also Stemming Parametric Search A search that adheres to predefined at‐ tributes present within a given data source Parsing The process of analysing text to determine its semantic structure Pattern Matching Pattern matching recognizes naturally oc‐ curring patterns (word usage, frequency of use, etc.) within a document Phrase Extraction The procurement of linguistic concepts, generally phrases, from a given document Lexical Analysis An analysis that reduces a text to a set of discrete words, sentences and paragraphs Precision The quantification of the number of correct documents returned in a given search Linguistics The study of the structure, use and devel‐ opment of language Probabilistic Method A method that utilizes user-supplied infor‐ mation to determine the probability that a given word appears in a document Linguistic Indexing The classification of a set of words into grammatical classes, such as noun or verb Meta Search Engine A class of search engine that generally re‐ trieves information to user queries by uti‐ lizing other search engines Meta Tag An HTML command located within the header of a website that displays additional or referential data not present on the page itself Metadata Metadata is data about data, Morphologic analysis The analysis of the structure of language Natural Language Processing A process that identifies content by at‐ tempting to adhere to the rules of a given language Proximity Searching A search whose results are returned based on the proximity of given words Query by Example A search in which a previously returned re‐ sult is used to obtain similar results Query Performance A measure of performance based on the speed a system can receive a query and re‐ turn results Query transformation The process of analyzing the semantic structure of a query prior to processing in order to improve search performance Ranking A value assigned to a specific result re‐ turned for a query The first item listed has a ranking of 1, the second has a ranking of 2, etc Natural Language Query A search input entered using conventional language, e.g., a sentence 166 | Glossary www.it-ebooks.info WWW.EBOOK777.COM free ebooks ==> www.ebook777.comRecall Recall A percentage representing the relationship between correct results generated by a query and the total number of correct re‐ sults within an index Relevance The value that a user places on a specific document or item of information Relevance Ranking See Ranking Search Results The documents or data that are returned from a search Search Terms The terms used within a search field Semantic Analysis An analysis based upon grammatical or syntactical constraints that attempts to de‐ cipher information contained in a docu‐ ment Sentiment analysis The use of of natural language processing, computational linguistics, and text analyt‐ ics to identify and extract subjective infor‐ mation in documents Soundex Search A search in which the user receives results that are phonetically similar to their query Spider An automated process that provides docu‐ ments to a data extraction or parsing en‐ gine See Also Crawler Statistical Indexing Probabilistic methods relying on mathe‐ matics, not “linguistics.” See Bayesian Stemming A process based on a set of heuristic rules that identifies the root form of words con‐ tained within a given document (e.g., run from running) Stop words Words that are deemed to have no value in an index Structured Data Data that can be represented according to specific descriptive parameters, e.g., rows and columns in a relational database, or hi‐ erarchical nodes in an XML document or fragment Summarization An automated process for producing a short summary of a document and presenting it in the list of results Synonym expansion Automatically expanding a search by adding synonymns of the query terms de‐ rived from a thesaurus Syntactic Analysis An analysis capable of associating a word with its respective part of speech by deter‐ mining its context in a given statement Taxonomy In respect to search, the broad categoriza‐ tion of objects (typically a tree structure of classifications for a given set of objects) in order to make them easier to retrieve and possibly sort Term Frequency A quantity representing how often a term appears in a document Thesaurus A collection of words in a cross-reference system that refers to multiple taxonomies and provides a kind of meta-classification, thereby facilitating document retrieval TREC Text Retrieval Conference, a conference held by the National Institute of Standards and Technology in which participants search a collection of documents and present results on various search metrics See Also Lemmatization Glossary www.it-ebooks.info WWW.EBOOK777.COM | 167 free ebooks ==> www.ebook777.com Tokenising Tokenising The process of identifying the elements of a sentence, such as phrases, words, abbrevia‐ tions and symbols, prior to the creation of an index Weight A value applied to a given area of a search system, e.g., term weighting, which repre‐ sents its importance with respect to other factors Truncation The removal from a prefix of suffix Wildcard A notation, generally an asterisk or ques‐ tion mark which, when used in a query, represents all possible characters, e.g., a search for boo* would return book, boom, boot, etc Unstructured Information Information that is without document or data structure (i.e., cannot be effectively de‐ composed into constituent elements or chunks for atomic storage and manage‐ ment) Vector space A model that enables documents to be ranked for relevance against a query by comparing an algebraic expression of a set of documents with that of the query 168 Word Exclusion and Stop Lists A list containing words that will not be in‐ dexed, usually words that are excessively common, e.g., a, an, the, etc Word Proximity Analysis An analysis that measures the distance be‐ tween searched words in a document | Glossary www.it-ebooks.info WWW.EBOOK777.COM free ebooks ==> www.ebook777.com About the Author Martin White is an information management consultant specialising in enterprise search assignments He established Intranet Focus Ltd in 1999 and has been a Visiting Professor at the iSchool, University of Sheffield since 2002 He is a Fellow of the Royal Society of Chemistry and a Member of the Association of Computing Machinery www.it-ebooks.info WWW.EBOOK777.COM ... Mobile Search Cross-Session Search Social Search Federated Search 10 Developments in Information Retrieval 11 Enterprise Search Professionals 12 The Digital Workplace 13 Does ? ?enterprise search? ??... Search 121 Search Support Team Roles Search Manager Search Technology Manager Search Analytics Manager Search Information Specialist Search. .. a solution Enterprise Search It is time for some definitions This book is entitled Enterprise Search so what is ‘enter‐ prise search? ??? Here’s one possible definition: An enterprise search application