DSpace at VNU: Argumentation-based schema matching for multiple digital libraries

Online Information Review Argumentation-based schema matching for multiple digital libraries: Tho Thanh Quan Xuan H Luong Thanh C Nguyen Hui Su Cheung Article information: To cite this document: Tho Thanh Quan Xuan H Luong Thanh C Nguyen Hui Su Cheung , (2015),"Argumentation-based schema matching for multiple digital libraries", Online Information Review, Vol 39 Iss pp Permanent link to this document: http://dx.doi.org/10.1108/OIR-02-2014-0023 Downloaded on: 27 December 2014, At: 12:11 (PT) References: this document contains references to other documents To copy this document: permissions@emeraldinsight.com The fulltext of this document has been downloaded times since 2015* Downloaded by University of Reading At 12:11 27 December 2014 (PT) Access to this document was granted through an Emerald subscription provided by 289728 [] For Authors If you would like to write for this, or any other Emerald publication, then please use our Emerald for Authors service information about how to choose which publication to write for and submission guidelines are available for all Please visit www.emeraldinsight.com/authors for more information About Emerald www.emeraldinsight.com Emerald is a global publisher linking research and practice to the benefit of society The company manages a portfolio of more than 290 journals and over 2,350 books and book series volumes, as well as providing an extensive range of online products and additional customer resources and services Emerald is both COUNTER and TRANSFER compliant The organization is a partner of the Committee on Publication Ethics (COPE) and also works with Portico and the LOCKSS initiative for digital archive preservation *Related content and download information correct at time of download Argumentation-based schema matching for multiple digital libraries Tho T Quan* Department of Software Engineering Xuan H Luong and Thanh C Nguyen Department of Computer Science Ho Chi Minh City University of Technology Downloaded by University of Reading At 12:11 27 December 2014 (PT) Ho Chi Minh City, Vietnam Hui Siu Cheung Department of Computer Engineering Nanyang Technological University Singapore Acknowledgement This work was supported by research project B0212-20-02TD funded by Vietnam National University – Ho Chi Minh City About the authors *Quan Thanh Tho is an associate professor in the Faculty of Computer Science and Engineering at Ho Chi Minh City University of Technology (HCMUT), Vietnam He received his BEng from HCMUT in 1998 and his PhD in 2006 from Nanyang Technological University His current research interests include formal methods, program analysis/verification, the semantic web, machine learning/data mining and intelligent systems Currently he heads the Department of Software Engineering at HCMUT and also serves as Chair of the Computer Science Programme (undergraduate level) Dr Quan is the corresponding author and may be contacted at qttho@cse.hcmut.edu.vn Hui Siu Cheung is an associate professor in the School of Computer Engineering at Nanyang Technological University He received his BSc (1983) and DPhil (1987) from the University of Sussex He worked at IBM China/Hong Kong as a system engineer from 1987 to 1990 His current research interests include data mining, web mining, the semantic web, intelligent systems, information retrieval, intelligent tutoring systems, timetabling and scheduling Xuan Hoai Luong earned his BSc in computer science at HCMUT He is currently a masters’ student in computer science at the Swiss Federal Institute of Technology (EPFL, Lausanne) His research backgrounds consist of software verification, data integration and argumentation Nguyen Chanh Thanh is an invited lecturer at HCMUT, where he also obtained his PhD His research interests include natural language processing, digital libraries and software engineering Downloaded by University of Reading At 12:11 27 December 2014 (PT) Paper received March 2014 Second revision approved 10 November 2014 Abstract Purpose – Most digital libraries (DLs) are now available online They also provide the Z39.50 standard protocol which allows computer-based systems to effectively retrieve information stored in the DLs The major difficulty lies in inconsistency between database schemas of multiple DLs This paper presents a system known as Argumentation-based Digital Library Search (or ADLSearch) which facilitates information retrieval across multiple DLs Design/methodology/approach – The proposed approach is based on argumentation theory for schema matching reconciliation from multiple schema matching algorithms In addition a distributed architecture is proposed for the ADLSearch system for information retrieval from multiple DLs Findings – Initial performance results are promising First, schema matching can improve retrieval performance in DLs, as compared with the baseline technique Subsequently, argumentation-based retrieval can yield better matching accuracy and retrieval efficiency than individual schema matching algorithms Research limitations/implications – The work discussed in this paper has been implemented as a prototype supporting scholarly retrieval from about 800 DLs around the world However, due to the complexity of the argumentation algorithms, the process of adding new DLs to the system cannot be performed in a real-time manner Originality/value – In this paper an argumentation-based approach is proposed for reconciling the conflicts from multiple schema matching algorithms in the context of information retrieval from multiple digital libraries Moreover, the proposed approach can also be applied to similar applications which require automatic mapping from multiple database schemas Keywords Schema matching, Information retrieval, Digital libraries Article classification Research paper Introduction Unlike traditional means of storage, digital libraries (Saracevic and Dalbello, 2001) are a new Downloaded by University of Reading At 12:11 27 December 2014 (PT) kind of library that has emerged since the end of the twentieth century In digital libraries information and documents are stored in digital forms which can be accessed and retrieved over the web Through the standard protocols such as Z39.50, search engines can also search information from different digital libraries, and crawlers can connect directly to the database servers and access the data of the digital libraries Nowadays digital libraries have become one of the major sources for researchers when finding scholarly information over the web Traditionally digital libraries organise information in database schema To support information retrieval from multiple digital libraries it is commonly assumed that the databases of the different digital libraries would have the same schemas However, in practice each digital library will have its own schema As shown in Figure the same publication record may be represented differently in schemas when stored in different digital libraries a) A document record retrieved from the b) A document record retrieved from the digital library of Universidad Complutense digital library of Rice University (SirsiDynix) de Madrid (INNOPAC) Figure The same publication record may be represented and stored differently in different digital libraries In Figure we present a closer view of the problem of inconsistent concept representation in different schemas When representing the concept of “academic paper”, one schema may adopt the term Document while other schemas may use the term Publication Some others may even split the concept into two sub-concepts such as Article and Publisher It may be easy for humans to understand the similarity between these terms However, the Downloaded by University of Reading At 12:11 27 December 2014 (PT) inconsistency of terms or keywords used to represent the same concept poses a serious problem for information retrieval from different sources of digital libraries This leads to a well-known research problem called schema matching Different algorithms have been proposed for automatic matching between schemas However, as most algorithms rely mainly on heuristics to deal with the inconsistency of keywords, applying them to different datasets would lead to different, or even conflicting, results (Nguyen et al., 2012) In general each algorithm works well in certain domains, but its performance suffers when applied to other domains Thus for the digital library domain, the difficulty lies in the fact that scholarly materials stored in digital libraries are from different domains, ranging from social sciences to natural sciences Hence to select a suitable one-sizefits-all matching algorithm is a very challenging task Document c₂ c₁ Article Publisher c₃ c₄ Publication Figure Different terms for the same concept from different schemas and their mappings In this paper we propose to apply argumentation theory to tackle this problem The idea here is that, instead of fixing a certain schema matching algorithm, we can try multiple matching strategies at the same time Then if any conflict is found among the matching results, argumentation theory is applied to infer the most logical and appropriate answer This paper makes two main contributions First, we propose an argumentation-based approach to perform schema matching from multiple digital libraries The argumentation framework has been published in our previous work (Nguyen et al., 2013); however, this is the first time it has been applied to the digital library domain Moreover, we also improve our argumentation framework to make it fully automatic, instead of relying on the involvement of human experts Second, the proposed approach is then incorporated into a search system for digital libraries, called Argumentation-based Digital Library Search (or ADLSearch) To the best of our knowledge, up to now the matching between multiple digital libraries has mainly involved manual methods In contrast the ADLSearch system is capable of handling more than 800 digital libraries in an automatic manner due to the integration of our extended Downloaded by University of Reading At 12:11 27 December 2014 (PT) argumentation framework Related work Classical schema matching algorithms Schema matching has been recognised as one of the most important operations required by the process of data integration, which has been studied by the database and AI communities for over 25 years (Doan and Halevy, 2005) There are many cutting-edge schema matching techniques and tools (Bernstein et al., 2011), such as element-level matching, structure-level matching, instance-based matching and combined techniques Classical and recent tools developed alongside this direction are discussed in detail by Nguyen et al (2012), notably including Bmatch (Duchateau et al., 2007), COMA++ (Aumueller et al., 2005), ASMOV (Jean-Mary et al., 2009), Falcon-AO (Gonzalez et al., 2010), AgreementMaker (Marie and Gal, 2007), OII Harmony (Melnik et al., 2002), AMC (Peukert et al., 2011), Ontobuilder (Roitman and Gal, 2006), etc Most systems focus on semi-structure schema types (e.g XML, OWL and RDF), in order to be aligned with current business standards (Kabak and Dogac, 2010) These tools thus introduced various approaches to capture similarities between schemas, including linguistic processing (dictionary lookup, string matching etc.), structure-based analysis or tuning selection methods However, the outputs of these methods are still inherently uncertain, as a lot of irrelevant items and mismatches were found when applying these methods to real-life datasets Schema matching of big data on the web As the amount of data shared over the World Wide Web keeps growing dramatically, schema matching for structured data on the web, especially ontological data used by semantic technologies, is equally attracting considerable attention Schema matching is considered one of the four challenges of Big Data processing, known as Orri’s Challenge (Bizer et al., 2012) To tackle this problem, increasing the performance of schema matching by using linked data such as Wikipedia has been considered (Assaf et al., 2012) However, this method would suffer from performance issues when dealing with real data where the linkage between elements/entities is very large Crowdsourcing (Doan et al., 2011), where the major ideas of communities are taken into account and analysed to eventually infer the most logical ones, is a noteworthy approach However, building a reliable community is another real challenge Downloaded by University of Reading At 12:11 27 December 2014 (PT) Applying classic schema matching algorithms to big data, especially in the context of the semantic web, was recently discussed (Pinkel et al., 2013) However, the same problem persists when different algorithms are applied The most recent work (Dong and Srivastava, 2014) suggested a model for data integration in big data, which is a two-fold process: 1) constructing a mediated global schema, and 2) generating the mappings between the mediated (global) schema and the local schemas This approach is also our proposal for schema matching for digital libraries, where argumentation is adopted for the second step of mapping generation Schema matching for multiple digital libraries Different digital libraries have been proposed and developed For example JeromeDL (Kruk, 2010) is an open source semantic digital library CDS Invenio (http://invenio-software.org) is another open source digital library with approximately one million documents in 700 collections of different categories Papadakis et al (2009) proposed a subject-based digital library, whereas Cinque et al (2004), Bloehdorn (2007) and Quan et al (2007) proposed ontology-based digital libraries Supporting information search from multiple digital libraries is an emerging research area The ICDL project (Hutchinson et al., 2005) aimed to organise the indexes and search information from several digital libraries located in different countries ANTAEUS (Joint, 2010) introduced an amalgamated search engine which searches information sources gathered from multiple digital libraries Chen et al (2011) developed CollabSeer to search information on researchers’ publications stored in digital libraries for recommending suitable candidates for research projects However, in order to support scholarly retrieval from multiple digital libraries, the issue of schema matching is undeniable Schema matching in digital libraries can be considered a specific case of big data schema matching where the stored data is structured scholarly information In addition standardised protocols for digital libraries such as Z39.50 and MARC-21 can support information retrieval from multiple digital libraries A web data integration approach, in which schema matching plays a crucial role, was proposed by Belhajjame et al (2011) and Bernstein et al (2011) However, due to the unresolved problem of inconsistency between schema matching algorithms, most of the methods for data integration from multiple digital libraries are still manual in practice (Song et al., 2005; Kent and Bowman, 2011) Unlike classic schema matching algorithms, COSM (Song et al., 2005) is a clusteringbased approach which aims to infer matching from element-based clustering results from Downloaded by University of Reading At 12:11 27 December 2014 (PT) digital libraries’ data However, applying clustering to large-scale data still requires data preprocessing steps Content-based systems, such as SIMPLIcity (Chen and Wang, 2002) or ETANA (Ravindranathan et al., 2004) for multimedia retrieval from digital libraries, also take a noteworthy approach, as they try to extract semantic information from the contents of the materials stored in the DLs, rather than processing at the schema layer However, attempts to automate this process using machine learning algorithms are still encountering considerable difficulty due to the complexity of dealing with large volumes of data (Shvaiko and Euzenat, 2013) As a result information retrieval from multiple digital libraries with various data schemas is still taking a manual approach such as the Nebula interface for constructing conceptual knowledge systems for DLs (Kent and Bowman, 2011) Applications of argumentation-based approaches The argumentation-based approach, in which matching decisions are formulated as arguments, is a kind of propositional logic supporting reasoning and reconciliation from n-parties games (Phan, 1995) This work then evolved to argumentation theory, which is a systematic study of techniques to reach conclusions from given premises (Besnard and Hunter, 2008) Based on the arguments we can detect the conflicts between arguments and support the selection of the most reasonable arguments to resolve the conflicts There are two kinds of argumentation approach: abstract argumentation and logical argumentation (Prakken, 2012) The former was proposed by Dung (1995), who described arguments as abstract objects Dung (1995; Dung et al., 2007) also introduced the concept of acceptability semantics, which defined different levels of acceptance for a proposed argument However, the most prominent proposal in this area is logical argumentation (Besnard and Hunter, 2008) which was adopted in this research This approach relies on propositional logic to describe the arguments The theoretical details and running example of applying logical argumentation for schema matching will be presented in the next section The argumentation-based approach has been successfully applied to many practical applications Bentahar et al (2010) used argumentation for solving conflicts that may arise among web services and resources in business processes of e-commerce systems In collaborative and cooperative planning (Sapena et al., 2011) argumentation can be combined with machine learning to improve the automation level of operations policies In social networks (Grosse et al., 2012) natural language processing is adopted to extract arguments from textual data, which are used to make social agreements among participants In cloudcomputing (Heras et al., 2012) argumentation can be used to help cloud providers handle Downloaded by University of Reading At 12:11 27 December 2014 (PT) physical failures in a collaborative manner In the semantic web (Rahwan et al., 2007) argumentation has been modelled using Argument Interchange Format ontology, allowing large-scale collection of interconnected arguments on the web Motivation for this research from existing work Schema matching is a technique which aims at reasonably matching elements from different schemas Thus this technique plays a crucial role in data integration from various sources, especially from those available on the internet Many classic schema matching algorithms have been proposed, each of which achieved better accuracy when applied to certain domains of data However, to identify which algorithm is the best for a given dataset is an important task which still remains unsolved With the recent emerging trends of big data and semantic technologies, schema matching is one of the four major challenges of performing data integration from multiple databases/ontologies One of the works in this field suggested the usage of a central schema which the element matchings will be centred around We adopt this idea to integrate scholarly data from multiple digital libraries So far data integration of multiple digital libraries has still relied heavily on manual methods We propose to use the argumentation technique to automate this process as this method can yield reasonable combinations from matching results However, the existing approach of argumentation still requires human intervention from experts to approve or disapprove each matching produced We overcome this obstacle by using empirical thresholds to replace human decisions, which is discussed in the subsequent sections Argumentation-based conflict reconciliation of schema matching results In this section we describe using the argumentation-based approach for conflict reconciliation of schema matching results Currently several schema matching algorithms are available However, thus far no algorithm has been shown to be better than the others Moreover, conflicts can arise from matching results produced by these algorithms In previous work we proposed an argumentation-based framework to handle this problem (Nguyen et al., 2013) In this work the framework is adopted and extended to support automatic schema matching for digital libraries As shown in Figure the framework consists of two phases: individual validation and conflict reconciliation Individual validation involves two steps The first step is individual matching, which Downloaded by University of Reading At 12:11 27 December 2014 (PT) involves several matching algorithms The mappings outputted by the matching algorithms will be integrated in the schema mapping table The second step is argument construction which will then convert the stored mappings into a mathematical representation – the argument – for further processing The arguments will be stored in the arguments set Schema Mapping Table Alg Alg k 1.1.1 Individual matching Arguments Set 2.1 Negotiation 1.1.k Individual matching Conflict detection Argument evaluations Guided resolution Intermediate Rounds Schema Mapping Table 1.2 Argument construction Schema Mapping Table Arguments S 2.n Negotiation Conflict detection Argument evaluations Guided resolution Arguments Set Phase Individual validation Phase Conflict reconciliation Figure Conflict reconciliation framework The conflict reconciliation phase reconciles the mapping conflicts It comprises the following tasks: • Conflict detection: As the mappings are converted into arguments in the first phase, we process the arguments to detect any conflicts among them mathematically • Argument evaluation: When a conflict between arguments is detected, the involved arguments will be evaluated to determine their strengths • Guided resolution: Based on the strength of the arguments a final resolution will be 0.2 a4 AVG = 0.3 SUM = 0.9 MAX = 0.4 MIN = 0.2 0.3 a6 !c4 0.4 a5 Figure Aggregated score of a mapping Downloaded by University of Reading At 12:11 27 December 2014 (PT) Figure illustrates the mapping evaluation In the example given in Table we have three arguments a4, a5 and a6 claiming that the mapping c4 should be disapproved, with respective scores of 0.2, 0.4 and 0.3 0.2 0.2 a4 0.4 a5 0.9 0.4 !c4 c4 0.3 b2 0.2 k2 a6 Figure Final resolution for a mapping conflict Figure depicts the conflict resolution process between approving (c4) and disapproving (¬c4) The disapproval decision (¬c4) is derived from arguments a4, a5 and a6, while the approval comes from arguments b2 and k2 Assuming that the SUM operator is applied, the scores of c4 and ¬c4 are 0.4 and 0.9 respectively These values obviously hint that we should follow the disapproval decision (¬c4) and discard the approval Extension of the reconciliation framework in this work Compared to our previous work (Nguyen et al., 2013), the framework which has been discussed is extended in the following ways: • In the previous work we relied on human experts to approve or disapprove a mapping In this work this step is automated by using upper and lower scores Thus the reconciliation framework is scalable for a large number of schemas available for various digital libraries • We suggest using a defence graph to calculate argument strength Thus the complexity of this step is reduced significantly, as compared to the logic-based approach introduced in the previous work The ADLSearch system Figure shows our proposed ADLSearch system, which is a search engine designed for searching scholarly information from multiple digital libraries over the internet One can observe that the architecture of ADLSearch comprises the major components of a typical Downloaded by University of Reading At 12:11 27 December 2014 (PT) search engine including crawling, retrieving and an indexed data layer In particular the system is enhanced by the argumentation-based conflict reconciliation framework, which has just been discussed This component is incorporated for handling conflicts when mapping schemas between multiple digital libraries GATech DL Schema DBase NTU DL Schema DBase Web GUI Web GUI Schema DBase Web GUI Z39.50 Z39.50 VNU DL Z39.50 ADL SEARCH Crawling Schema extractor Document crawling Conflict Reconciliation Framework Full text retrieval Descriptor retrieval Retrieving Result producing Query processing Indexed Data Layer Schema Mapping Table Central Schema Central Database Figure The architecture of ADLSearch The main function of the crawling component is to crawl scholarly information from multiple digital libraries on the internet Digital libraries usually offer a web-based graphical user interface (GUI) allowing general users to search for information in a convenient manner Apart from that, information from digital libraries can also be automatically retrieved through the Z39.50 protocol As such, the crawling component can retrieve information from a specific digital library There are two types of information to be crawled: the schema of the scholarly information organised in the digital library and the document descriptors that describe the significant attributes of the documents such as authors, titles, publication information, etc However, accessing the full text of the documents may require membership To store and index the schema and document descriptors crawled from the digital libraries, ADLSearch facilitates the central schema and central database in the indexed data Downloaded by University of Reading At 12:11 27 December 2014 (PT) layer The central schema defines a “standardised” schema adopted by the system When ADLSearch crawls information from a new digital library, the schema of the new digital library will be extracted and mapped into the central schema Based on the attributes defined in the central schema, the crawled document descriptors will be indexed and stored in the central database Figure illustrates a document record stored in our central schema In addition as ADLSearch collects information from multiple digital libraries over the internet, a schema mapping table is also constructed to store all of the mappings between the schemas of the crawled digital libraries and the central schema When ADLSearch connects to a new digital library, the mappings between the central schema of ADLSearch and the schema of the new digital library will be generated and added to the schema mapping table As discussed before, the proposed argumentation-based conflict reconciliation framework will be responsible for generating the contents of the schema mapping table and handling the conflicts < articleid >1 < title_statement >1993 computer architectures for machine perception [electronic resource] < title_uniform > computer architectures < publish_place > Los Alamitos, Calif < publish_name > IEEE Computer Society Press < publish_date > 1993 < author_personal_name > Bayoumi, Magdy A < author _corporate_name > University of Southwestern Louisiana.$bCenter for Advanced Computer Studies < author _conference_name > Workshop on Computer Architectures for Machine Perception < edition >1st ed < series > McGraw-Hill series on computer communications < foundin > IEEE/IET Electronic Library (IEL) Proceedings By Volume Downloaded by University of Reading At 12:11 27 December 2014 (PT) < description > x, 456 p : ill ; < lc_call_no > QA76.9.A73A17 1993 < dewey_no > 006.3/7 20 < isbn > 0818654201 (paper) 081865421X (microfiche) < issn > 1527-6805 < other_authors > Davis, Larry S < availability > 978-1-4020-2758-1 Springer < notes > Workshop on Computer Architectures for Machine Perception" Cover IEEE catalog number 93TH0608-0" Cover Includes bibliographical references and index Only available to UniSA staff and students Mode of access: WWW < subjects > Computer architecture < control_no > 1595693 < access_location > http://purl.access.gpo.gov/GPO/LPS1988 < access_note > Digital access Figure A document record stored in the central schema of ADLSearch Based on the indexed central database, the retrieving component will perform descriptor retrieval to retrieve documents whose descriptors matched the queries submitted by users via query processing If the full texts of the retrieved documents are available either by the policy of the hosting digital libraries or memberships of the users, then full text retrieval will retrieve the full text of the corresponding document Finally, result producing displays the final retrieval results to the users System interface In ADLSearch we have downloaded schemas from digital libraries available at http://www.loc.gov/z3950/gateway.html In this page there are approximately 800 libraries supported with the Z39.50 protocol, thereby enabling automatic information access and retrieval for these libraries The downloaded schemas are mapped and indexed in ADLSearch as discussed before Similar to other search engines for digital libraries, ADLSearch supports Downloaded by University of Reading At 12:11 27 December 2014 (PT) users to search any relevant scholarly information from the indexed digital libraries Currently the following search functions are supported in ADLSearch: • Document search searches for documents related to the submitted keywords • Author search searches for publications of specified authors • Publisher search searches for documents published by specified publishers • Expert search searches for experts in areas specified by keywords Figure 10 Search interface of ADLSearch Downloaded by University of Reading At 12:11 27 December 2014 (PT) Figure 11 A document descriptor retrieved by ADLSearch Figure 10 shows the document search interface of the system, where users can select to search from the targeted digital libraries indexed by ADLSearch Users can view the detailed information of a retrieved document as illustrated in Figure 11 If the user has the necessary permission, they can continue to retrieve the full text of the document from the digital library that hosts the document One of the most special features of ADLSearch is that the system can allow users to add and index new digital library schemas in an automatic manner Users can keep track of the mapping decisions, the generated arguments and their evaluated strengths as shown in Figure 12 Moreover, expert users can even view the information on the technical implementation such as the detailed information of the evaluation process (such as the defence graph) as shown in Figure 13 Figure 12 Visualisation of mapping decisions and their corresponding scores Downloaded by University of Reading At 12:11 27 December 2014 (PT) Figure 13 Tracking a defence graph in ADLSearch Implementation Regarding the technical implementation the system was developed using the Java programming language As mentioned earlier the ADLSearch system currently employs three matching algorithms: COMA++, AMC and Ontobuilder We have also used the Vispatrix (Charwat et al., 2012) tool to support the generation of arguments from the outputs of the matching algorithms In addition the ASP solver in DLV-Complex (Calimeri et al., 2008) was adopted to detect conflicts between arguments Experiment results Research questions To evaluate the performance of the proposed approach we conducted experiments to verify two hypotheses as follows H1 The argumentation approach can improve the schema matching accuracy compared to individual matching algorithms This claim has been supported in our previous work (Nguyen et al., 2013), but we wanted to verify it again when applied to scholarly datasets collected from digital libraries H2 Employing schema matching can improve the retrieval efficiency from various digital libraries In addition, benefiting from better schema matching accuracy, the argumentation-based approach should achieve better retrieval precision According to the two hypotheses, we evaluated the performance of our system using two measures: schema matching accuracy and retrieval efficiency, respectively Appropriate metrics were adopted in the evaluation of these two measures Datasets and matching algorithms In this experiment we collected the dataset which comprises schemas of digital libraries collected from the webpage http://www.loc.gov/z3950/gateway.html We classified similar schemas into schema patterns In total we had 71 patterns, which can be downloaded from http://www.cse.hcmut.edu.vn/~save/patterns.zip Moreover, we only selected the matching tools for which the sources were available and without any licensing issues Furthermore, in the evaluation we also used the three most popular schema matchers: COMA++ (Aumueller, 2005), Auto Mapping Core (AMC) (Peukert et al., 2011) and OntoBuilder (Roitman and Gal, Downloaded by University of Reading At 12:11 27 December 2014 (PT) 2006) as given in Table These three matching tools are also deployed in ADLSearch Table Schema matching tools Matching tools Description Features Supported format COMA++ GUI, API, commercial XSD, XDR, TXT, RDF AMC GUI, API, commercial XSD OntoBuilder GUI, API, research RDF, XSD (partial) Schema matching accuracy We evaluated the schema matching accuracy of our argumentation-based approach compared to other individual matching algorithms To carry out the experiments we selected a dataset comprising 20 schema patterns, which covered about 2,000 corresponding records We then manually produced the corresponding matching between these patterns The manual matching generated is considered the ground truths of the experiment Then we performed the schema matching algorithms on the dataset of the 20 schema patterns If the output of a certain schema matching agreed with the information in the ground truths, then we counted it as a hit, or otherwise a miss Then we defined the ratio of accuracy metric, which measures the number of hits over the total number of suggestions provided by the corresponding resolution strategy It was calculated based on the following formula: ܴܽ‫ ݕܿܽݎݑܿܿܣ ݂݋ ݋݅ݐ‬ሺ݈ܽ݃‫ݐ݅ݎ݋‬ℎ݉, ܽܿ‫݀݊݅݇_݊݋݅ݐ‬ሻ = ݊ுூ்௦ ݊௖௢௡௙௟௜௖௧ where algorithm is the matching algorithm involved and action_kind is the kind of matching action suggested by the algorithm The action_kind can be approving or disapproving of a mapping of schema attributes For example if the algorithm is COMA++ and the action_kind is approving, it means that we aim at evaluating the accuracy performance of the COMA++ algorithm when suggesting an approving action As discussed earlier, upper and lower thresholds were used to identify whether a matching algorithm approves or disapproves a mapping We tuned the values of upper and lower thresholds for each algorithm to identify the most appropriate thresholds used for approving and disapproving actions Table Schema matching accuracy evaluation Action AMC OntoBuilder COMA++ Argumentation 0.72 0.75 0.84 0.96 Disapproving 0.55 0.62 0.64 0.83 Combination 0.68 0.71 0.77 0.89 Approving Downloaded by University of Reading At 12:11 27 December 2014 (PT) Matcher Table presents the results of the matching accuracy evaluation, where we have evaluated three kinds of actions: approving, disapproving and overall (combination of approving and disapproving) Among the three individual matching tools COMA++ has achieved the best matching accuracy However, the argumentation-based approach has outperformed all of them It has achieved an increase of about 20 percent compared to the average of the three matching tools, and an increase of about 14 percent when compared with the best matching tool, COMA++ It is especially significant that when the performance of the three individual tools is quite poor for the disapproving action, the argumentation-based approach can still maintain relatively good accuracy Retrieval efficiency Since ADLSearch is an information retrieval system, we used traditional information retrieval metrics – precision, recall and F-measure (Rijsbergen, 1979) – to evaluate the retrieval efficiency of the system In addition to highlight the advantages gained by schema matching algorithms when applied to retrieval from multiple digital libraries, we also measured the performance of the baseline method based on the Z39.50 protocol, known as BaseZ3950 In this baseline method we merely used the result retrieved from Z39.50 when processing keyword queries without handling inconsistencies if they arose Table Retrieval efficiency evaluation Metric Matcher BaseZ3950 AMC OntoBuilder COMA++ Argumentation Precision 0.58 0.77 0.74 0.84 0.95 Recall 0.81 0.66 0.77 0.83 0.78 F-measure 0.67 0.70 0.75 0.83 0.85 Table compares the average performance of individual tools and the argumentationbased approach based on the dataset For each of the performance results we have presented Downloaded by University of Reading At 12:11 27 December 2014 (PT) the average precision, recall and F-measure It is evident that BaseZ3950 enjoys good recall performance, which is even better than that of the argumentation-based method It can be explained by the fact that when handling inconsistencies, the schema matching algorithms may suffer from missing true positive cases (i.e some correct matchings are not approved and therefore missing in the final results) However, BaseZ3950 achieved very poor precision, meaning that many false positive cases are included in the final result due to the unresolved inconsistencies As a result this baseline method is outperformed by all matching algorithms in terms of F-measure Among all the matching methods the argumentation-based approach is the clear winner in terms of precision, recall and F-measure The argumentation-based approach has achieved an increase of about 17 percent on precision and percent on recall compared to the average of the other three matching tools For a predefined weighting of precision and recall for Fmeasure, the argumentation-based approach is also the best technique in the overall results It achieved an increase of about percent on F-measure compared to the average of the three individual matching tools and percent when compared with the best Conclusion This paper introduces a system called Argumentation-based Digital Library Search or ADLSearch Basically this is a search engine designed for scholarly information retrieval from multiple digital libraries distributed over the internet On the one hand, ADLSearch makes use of the standard protocol Z39.50 to connect with external digital libraries for crawling and indexing scholarly information On the other hand, ADLSearch includes an internal argumentation-based conflict reconciliation framework, which uses the argumentation theory to handle inconsistencies when matching multiple schemas of the external digital libraries The framework supports new digital libraries to be indexed by ADLSearch in an automatic manner Currently ADLSearch has indexed over 800 digital libraries and has achieved good scalable performance due to its use of some best practices for handling large-scale datasets at the server side Our research work has opened up some new research directions First, we would like to design a negotiation protocol to enable negotiation within the ADLSearch system Second, we intend to extend the notion of the proposed constraints to further consider the integrity constraints that are relevant in the praxis (e.g functional dependencies, domain-specific constraints, etc.) Third, we intend to apply our proposed approach to other problems While Downloaded by University of Reading At 12:11 27 December 2014 (PT) our work focuses on schema matching between digital libraries, our techniques – especially the argumentation-based conflict reconciliation framework – could be applied to other tasks such as entity resolution or business process matching References Assaf, A., Louw, E., Senart, A., Follenfant, C., Troncy, R and Trastour, D (2012), “Improving schema matching with linked data”, Computing Research Repository (CoRR), [online], available at http://arxiv.org/abs/1205.2691 (accessed 27 November 2014) Aumueller, D., Do, H.H., Rahm, E and Massmann, S (2005), “Schema and ontology matching with COMA++”, in Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, ACM, New York, pp 906-8 Belhajjame, K., Paton, N.W., Fernandes, A.A.A., Hedeler, C and Embury, S.M (2011), “User feedback as a first class citizen in information integration systems”, in Proceedings of the Conference on Innovative Data Systems Research, ACM, New York, pp 175-83 Bentahar, J., Alam, R., Maamar, Z and Narendra, N.C (2010), “Using argumentation to model and deploy agent-based B2B applications”, Knowledge-Based Systems, Vol 23 No 7, pp 677-92 Bernstein, P.A., Madhavan, J and Rahm E (2011), “Generic schema matching, ten years later”, Proceedings of the VLB Endowment, Vol No 11, pp 695-701 Besnard, P and Hunter, A (2008), Elements of Argumentation, MIT Press, Cambridge Bloehdorn, S., Cimiano, P., Duke, A., Haase, P., Heizmann, J., Thurlow, I and Völker, J (2007), “Ontology-based question answering for digital libraries”, Research and Advanced Technology for Digital Libraries, Lecture Notes in Computer Science, Vol 4675/2007, pp 14-25, DOI: 10.1007/978-3-540-74851-9_2 Bizer, C., Boncz, P., Brodie, M.L and Erling, O (2012), “The meaningful use of big data: four perspectives – four challenges”, ACM SIGMOD Record, Vol 40 No 4, pp 56-60 Calimeri, F., Cozza, S., Ianni, G and Nicola, N (2008), “Computable functions in ASP: theory and implementation”, in Logic Programming, Lecture Notes in Computer Science, Vol 5366, Springer, Berlin, pp 407-24 Chang, C.L and Lee, R.C.T (1973), Symbolic Logic and Mechanical Theorem Proving, Academic Press, New York Charwat, G., Wallner, J.P and Woltran, S (2012), “Utilizing ASP for generating and Downloaded by University of Reading At 12:11 27 December 2014 (PT) visualizing argumentation frameworks”, in Proceedings of the 5th Workshop on Answer Set Programming and Other Computing Paradigms, pp 51-65, [online], available at: http://www.dbai.tuwien.ac.at/research/project/argumentation/papers/CharwatWW12.pd f (accessed 27 November 2014) Chen, C.C and Wang, J.Z (2002), “Large-scale Emperor Digital Library and semanticssensitive region-based retrieval”, in Proceedings of Digital Library – IT Opportunities and Challenges in the New Millennium, Beijing Library Press, Beijing, pp 454-62 Chen, H.H., Gou, L., Zhang, X and Giles, C.L (2011), “CollabSeer: a search engine for collaboration discovery”, in Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries (JCDL '11), pp 231-40, DOI: 10.1145/1998076.1998121 Cinque, L., Malizia, A and Navigli, R (2004), “OntoDoc: an ontology-based query system for digital libraries”, in Proceedings of the 17th International Conference on Pattern Recognition, Vol 2, IEEE, Los Alamitos, CA, pp 671-4 Doan, A and Halevy, A.Y (2005), “Semantic-integration research in the database community”, AI Magazine, Vol 26 No 1, pp 83-94 Doan, A., Franklin, M.J., Kossmann, D and Kraska, T (2011), “Crowdsourcing applications and platforms: a data management perspective”, Proceedings of the VLDB Endowment, Vol No 12, pp 1508-9 Dong X.L and Srivastava D (2014), “Big data integration”, in Proceedings of 2014 IEEE 30th International Conference on Data Engineering, IEEE, Los Alamitos, CA, pp 1245-8 Duchateau, F., Bellahsene, Z and Roche, M (2007), “BMatch: a semantically context-based tool enhanced by an indexing structure to accelerate schema matching”, [online], available at: http://www2.lirmm.fr/~mroche/Web/Publications/All_papers/duchateau_BDA07.pdf (accessed 27 November 2014) Dung, P.M (1995), “On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games”, Artificial Intelligence, Vol 77 No 2, pp 321-57 Dung, P.M., Mancarella, P and Toni, F (2007), “Computing ideal sceptical argumentation”, Artificial Intelligence, Vol 171 No 10, pp 642-74 Egly, U., Gaggl, S.A and Woltran, S (2010), “Answer-set programming encodings for Downloaded by University of Reading At 12:11 27 December 2014 (PT) argumentation frameworks”, Argument and Computation, Vol No 2, pp 147-77 Gonzalez, H., Halevy, A.Y., Jensen, C.S., Langen, A., Madhavan, J., Shapley, R., Shen, W and Goldberg-Kidon, J (2010), “Google Fusion Tables: web-centered data management and collaboration”, ACM’s Special Interest Group on Management of Data (SIGMOD 2010), ACM, New York, pp 1061-6 Grosse, K., Chesñevar, C.I and Maguitman, A.G (2012), “An argument-based approach to mining opinions from Twitter”, in Proceedings of the First International Conference on Agreement Technologies, Vol 918, pp 408-22, [online], available at: http://ceurws.org/Vol-918/111110408.pdf (accessed 27 November 2014) Heras, S., de la Prieta, F., Rodríguez, S., Bajo, J., Botti, V.J and Julián, V (2012), “The role of argumentation on the future internet: reaching agreements on clouds”, in Proceedings of the First International Conference on Agreement Technologies, Vol 918, pp 393-407, [online], available at: http://ceur-ws.org/Vol-918/111110393.pdf (accessed 27 November 2014) Hutchinson, H.B., Rose, A., Bederson, B.B., Weeks, A.C and Druin, A (2005), “The International Children’s Digital Library: a case study in designing for a multilingual, multicultural, multigenerational audience”, Information Technology and Libraries, Vol 24 No 1, pp 4-12 Jean-Mary, Y.R., Shironoshita, E.P and Kabuka, M.R (2009), “Ontology matching with semantic verification”, Web Semantics, Vol No 3, pp 235-51 Joint, N (2010), “The one-stop shop search engine: a transformational library technology?: ANTAEUS”, Library Review, Vol 59 No 4, pp 240-8 Kabak, Y and Dogac, A (2010), “A survey and analysis of electronic business document standards”, ACM Computer Survey, Vol 42 No 3, pp 11:1-31 Kent, R.E and Bowman, C (2011), “Digital libraries, conceptual knowledge systems, and the Nebula interface”, Computing Research Repository (CoRR), [online], available at: http://arxiv.org/abs/1109.1841 (accessed 27 November 2014) Kruk, S.R (2010), “Semantic digital libraries – improving usability of information discovery with semantic and social services”, PhD thesis, National University of Ireland, Galway Marie, A and Gal, A (2007), “On the stable marriage of maximum weight royal couples”, in Proceedings of AAAI Workshop on Information Integration on the Web (II-Web07), AAAI Press, Menlo Park, CA, pp 62-7 Downloaded by University of Reading At 12:11 27 December 2014 (PT) Melnik, S., Garcia-Molina, H and Rahm, E (2002), “Similarity flooding: a versatile graph matching algorithm and its application to schema matching”, paper presented at International Conference on Data Engineering (ICDE), 26 February to March, San Jose, CA Nguyen, T.T., Nguyen, Q.V.H and Quan, T.T (2012), “A framework to combine multiple matchers for pair-wise schema matching”, in Proceedings of 2012 IEEE RIVF International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF 2012), IEEE Press, Los Alamitos, CA, pp 1-6 Nguyen, Q.V.H., Luong, H.X., Miklos, Z., Quan, T.T and Aberer, K (2013), “Collaborative schema matching reconciliation”, in Proceedings of the 21st International Conference on Cooperative Information Systems (CoopIS 2013), Springer, Berlin, pp 222-40 Papadakis, I., Kyprianos, K., Mavropodi, R and Stefanidakis, M (2009), “Subject-based information retrieval within digital libraries employing LCSHs”, D-Lib Magazine, Vol 15 No 9/10, [online] available http://www.dlib.org/dlib/september09/papadakis/09papadakis.html (accessed at: 26 November 2014) Phan, M.D (1995), “On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games”, Artificial Intelligence, Vol 77 No 2, pp 321-58 Peukert, E., Eberius, J and Rahm, E (2011), “AMC – a framework for modelling and comparing matching systems as matching processes”, in Proceedings of the IEEE 27th International Conference on Data Engineering (ICDE), IEEE Press, Los Alamitos, CA, pp 1304-7 Pinkel, C., Binnig, C., Kharlamov, E and Haase, P (2013), “IncMap: pay as you go matching of relational schemata to OWL ontologies”, in Proceedings of Ontology Matching 2013 (OM 2013), pp 37-48, [online], available at: http://ceur-ws.org/Vol1111/om2013_proceedings.pdf#page=46 (accessed 27 November 2014) Prakken, H (2012), “Some reflections on two current trends in formal argumentation”, Logic Programs, Norms and Action, Springer, Berlin, pp 249-72 Quan, T.T., Fong, A.C.M and Hui, S.C (2007), “A scholarly semantic web system for advanced search functions”, Online Information Review, Vol 31 No 3, pp 353-64 Rahwan, I., Zablith, F and Reed, C (2007), “Towards large scale argumentation support on the semantic web”, in Proceedings of the 22nd National Conference on Artiﬁcial Downloaded by University of Reading At 12:11 27 December 2014 (PT) Intelligence, Vol 2, AAAI Press, Menlo Park, CA, pp 1446-51 Ravindranathan, U., Shen, R., Gonỗalves, M.A., Fan, W., Fox, E.A and Flanagan, J.W (2004), “ETANA-DL: a digital library for integrated handling of heterogeneous archaeological data”, in Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, ACM, New York, pp 76-7 Rijsbergen, C.J.V (1979), Information Retrieval, 2nd ed., Butterworths, London Roitman, H and Gal, A (2006), “OntoBuilder: fully automatic extraction and consolidation of ontologies from web sources using sequence semantics”, in Proceedings of the 2006 International Conference on Current Trends in Database Technology, Springer, Berlin, pp 573-6 Sapena, O., Torreño, A and Onaindia, E (2011), “On the construction of joint plans through argumentation schemes”, in Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems, Vol 3, International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, pp 1195-6 Saracevic, T and Dalbello, M (2001), “A survey of digital library education”, Proceedings of the American Society for Information Science and Technology, Vol 38, pp 209-23 Shvaiko, P and Euzenat, J (2013), “Ontology matching: state of the art and future challenges”, Knowledge and Data Engineering, Vol 25 No 1, pp 158-76 Song, H., Ma, F and Wang, C (2005), “Clustering-based schema matching of web data for constructing digital library”, in Computational Science and Its Applications – ICCSA 2005, Springer, Berlin, pp 1086-95 ... proposal for schema matching for digital libraries, where argumentation is adopted for the second step of mapping generation Schema matching for multiple digital libraries Different digital libraries. .. facilitates information retrieval across multiple DLs Design/methodology/approach – The proposed approach is based on argumentation theory for schema matching reconciliation from multiple schema matching. .. multiple digital libraries, the issue of schema matching is undeniable Schema matching in digital libraries can be considered a specific case of big data schema matching where the stored data is structured

Định dạng
Số trang	29
Dung lượng	0,99 MB