Data Mining and Knowledge Discovery Handbook, 2 Edition part 126 ppt

1230 Oded Maimon and Abel Browarnik Fig. 64.10. NHECD model. 64.3 NHECD implementation NHECD is built around Documentum 11 , an enterprise content management system (ECM). Documentum acts as a central repository for documents (e.g., unstructured data), metadata (semi structured to structured data, mostly in XML) and extracted information (mostly structured, tabular data). NHECD assumes that all target scientific papers to be included in the repository are in Adobe PDF format. A review of several websites hosting scientific papers shows that this is a safe choice. If other formats are found the crawler can be instructed to convert the format found to PDF almost seamlessly. 64.3.1 Taxonomies Each taxonomy deals with a certain aspect of the Nanotox domain. They were built by teams of domain experts and information management experts. Taxonomies are not an expected output of NHECD. Yet, they are essential to the NHECD process. Hence it was one of the initial steps in NHECD implementation. The taxonomies are: Following are the taxonomies describing the subject “commercial NP characterization”. 64.3.2 Crawling The process of automatically obtaining scientific papers and data about the paper (such as the name of the author or authors, the publication date, the name of the journal, keywords, abstract, 11 http://www.emc.com/products/family/documentum-family.htm 64 NHECD - Nano Health and Environmental Commented Database 1231 NHECD rating information extraction annotation metadata y 1 x 0 y 0 x 1 taxonomies Fig. 64.11. NHECD process. Fig. 64.12. Commercial characterization of NP. 1232 Oded Maimon and Abel Browarnik Table 64.1. Taxonomies. Subject Taxonomies animal model animal gender S p ecies experimental exposure parameters mode of exposure Metabolism Excretion Distribution NP exposure protocol effectin g a g ents standard test protocol NP chemical characterization NP chemical composition core impurities coat impurities coat chemical composition NP carrier solubility commercial NP characterization Substance fraction NP stated Mixture (GHS or 1999/45/EC) Article NP characterization methods X-ray and neutron based instruments Electron beam methods Ion Beam analysis O p tical methods other NP measurements methods NP general characterization Specific Surface Area NP shape dispersion and adsorption Nano Delivery system Structure zeta potential NP type Overall NP taxonomies of NP characterization taxonomies of NP chemical characterization taxonomies of NP characterization methods Results visible toxic effect by system biological effects pathologic effects time for effect manifestation Reversibility site of effect measurement methods for biological effect measurements methods for pathological effects and in general any detail made available with the paper itself) by visiting scientific paper repositories available on the web (whether restricted to subscribers or available to everyone) and searching by keywords on the paper text is called crawling. NHECD developed a crawler for Pubmed 12 . Crawlers for other leading scientific sites, such as, ISIWEB 13 or SciFinder 12 http://www.ncbi.nlm.nih.gov/pubmed/ 13 http://www.webofknowledge.com/ 64 NHECD - Nano Health and Environmental Commented Database 1233 14 are in a development stage. The main obstacles found often refer to intellectual property issues and the efforts by the publishers to enforce it. The crawler is written in java. It takes as input a set of keywords. Using websites API 15 it obtains a list of pointers to the targeted scientific papers. Those pointers are processed to transform it into downloadable links. If a downloadable link is obtained, the scientific paper is downloaded, provided that NHECD has access to the paper (e.g., there is a subscription to the resource or it is publicly available). The paper (if available, otherwise its place holder), along with the metadata already converted to XML, are uploaded to the NHECD document repository. 64.3.3 Information extraction The goals of information extraction in NHECD are: 1. To enable users to ask specific questions about specific attributes and receive answers. If possible, a link to the paper is given, along with a pointer to the location of the requested information within the document. 2. To enable, in the future, data mining on extracted data (e.g., patterns). The process starts with a multistep preprocessing stage: 1. convert the input documents from PDF to text 2. perform parsing and stemming 3. perform zoning within the document 4. classify the document according to NHECD taxonomies Next in the process is the tagging stage, used to recognize keywords, either by using the taxonomies or the values involved. As an example, the input phrase “To determine the effect of particle size, labeled microspheres of 500 and 1000 nm in diameter were incubated with mouse melanoma B16 cells” would result in the tagged form “To determine the effect of particle size, labeled microspheres of <NUMBER_1> and <NUMBER_2> <LENGTH-UNIT_3> in diameter were incubated with <SPECIES_4> <CELL-TYPE_5> <CELL-LINE_6> cells” The pattern matching stage is based on the output of previous stages and on the process of annotation, an auxiliary step performed by Nanotox domain experts to prepare a training set for this stage. The tasks needed to obtain patterns are: 1. Define the list of features to be extracted (based on the taxonomy) 14 http://pubs.acs.org/ 15 Application Program Interface 1234 Oded Maimon and Abel Browarnik 2. For each feature that needed to be extracted we define a list of extraction patterns 3. Each extraction pattern (p) consists of the following items: a) p.attributes – Associated attributes to be extracted. (note the same pattern can be used to extract several attributes concurrently) b) p.precondiction – A pre-condition c) p.match - A regular expression to be matched. d) p.extraction – A regular extraction expression to be used for extraction the values assuming that pattern p.t has been matched. e) p.scope – determine the scope of the extracted values in the text f) p.store – A SQL query for storing the results in the database The closing stage of the process is the conflict resolution stage. It is required for cases where several possible contradicting patterns can be matched to the same text or the same pattern can be matched to different part of the text. The information extraction process is depicted in Figure 13: Fig. 64.13. The information extraction process. 64 NHECD - Nano Health and Environmental Commented Database 1235 64.3.4 NHECD products The results of NHECD consist mainly of two products: 1. A repository of scientific papers related to Nanotox, augmented by metadata provided by authors and publishers, metadata extracted from the papers using text mining algorithms, and ratings for the articles based on methods adopted by NHECD. All the above, indexed using NHECD taxonomies. As a result, it is possible to retrieve scientific papers using sophisticated queries. 2. A set of structured facts extracted from the scientific papers in tabular format. The structured facts should make it possible to perform data mining to obtain new, unforeseen knowledge. 64.3.5 Scientific paper rating A scientific paper has a well established life cycle. After the paper is written, refereed and eventually accepted, it is published. From this point in time the paper can be cited. The rating of a paper depends on several variables: 1. Journal Name 2. Publication Year 3. Full Author Names 4. For each citing article: 5. Citing article name (and a unique identifier for the paper itself. NHECD decided to adopt SICI 16 for this purpose) 6. Citing journal name 7. From JCR (Journal Citation Report) , for journal name (including citing journals): 8. Impact Factor 9. Cited Half Life 10. H-Indices 17 per Author, From PoP The rating algorithm is applied when the paper is loaded and then on a periodic basis, to reflect changes such as new citations, changes in impact factors, in “Cited Half Life”, in JCR data and more. The rating algorithm takes into account the publication date of newly published papers to avoid less-than-fair ratings for such papers. The scientific paper rating devised by NHECD is composed a Journal Impact Factor and by H-indices. These components are defined below. 1. Rating By Journal Impact Factor Rating 1 (Article(i)) = 1 −2 −0.6•CitationScore Article(i) where CitationScore Article(i) = ∑ Article( j)∈citations(Article(i)) Map(impact(Journal(Article( j)))) Age(Article(i)) and 16 http://en.wikipedia.org/wiki/SICI 17 http://en.wikipedia.org/wiki/Hirsch number 1236 Oded Maimon and Abel Browarnik Map(impact(Journal(Article( j)))) = ⎧ ⎨ ⎩ 0.08 0 ≤ impact(Journal(Article( j))) ≤1.296 0.41.297 ≤ impact(Journal(Article( j))) ≤3.76 13.77 ≤ impact(Journal(Article( j))) ≤∞ 2. Rating By H-Indices Rating 2 (Article(i)) = 1 −1.05 −HScore Article(i) where HScore Article(i) = ∑ Article( j)∈citations(Article(i)) Average Author(k)∈Article( j) (H −Index k ) Age(Article(i)) 3. Final Rating Rating = α 1 Rating 1 + α 2 Rating 2 α 1 + α 2 0 ≤ α i ≤ 1 64.3.6 NHECD Frontend NHECD provides a free access website including information retrieval functionalities to facil- itate the search on NHECD repository. It includes the following components: 1. An open source content management system implemented on Drupal, which stores and manages the entire frontend database (including user information and usage patterns). 2. The user interface component that handles all the input or requests from the user. The frontend interacts with the backend repository, stored and managed on Documentum. Figure 14 shows the architecture design of NHECD Frontend. 1. User communities and Characteristics – NHECD front end is designed to meet the different needs of three main communities and an additional group – the administrators. 2. Scientists – Users in this community will be scientists from academia and industry – the most expert users among all three communities. These users should have an extensive prior knowledge in the domain of nanotoxicology. The system assumes that these users are proficient in information searches. 3. Regulators – Users working for (or on behalf of) government institutes and regulatory agencies are part of the NHECD regulatory community. This community aims at provid- ing legislation and regulation on the health, safety or environmental concerns regarding the use of nano-particles. Usage patterns of this group often overlap with those of the other communities. 4. General public – This community is composed of individuals and NGO’s who are active in a wide range of fields where information provided by NHECD may be relevant. We assume that most of the general public users are NOT able to read/evaluate the scientific material NHECD provides. Therefore, the frontend provides - for this community - mainly answers to queries on general information/light reviews or news on the impact of exposure to nanoparticles. 64 NHECD - Nano Health and Environmental Commented Database 1237 Fig. 64.14. Architecture. 5. Administrator – The administrator is in charge of managing the daily operation of the system. Administrators are responsible for managing user accounts, general settings and monitoring. The NHECD frontend provides the following features: 1. Basic search 2. Advanced search 3. Intelligent search 4. Taxonomic navigation 5. Recommender results (i.e., recommendations based on the analysis of usage patterns of other users) 6. Option to resubmit queries, adding additional criteria for the refinement of results 7. Site registration 8. Personalization features 9. Displaying a list of most viewed papers 10. Links to other nanotox related sites 11. NHECD news, updates and FAQ’s 64.4 Conclusions NHECD provides two important products: 1. An extensive and commented repository of scientific papers and other publications in the Nanotox area, searchable using taxonomies and full text search. The scientific papers are rated according to published NHECD criteria, to help users to better estimate their findings. Such a repository significantly expand currently available repositories due to the fact that it goes beyond the mapping of existing research in Nanotox (as most current initiatives do). NHECD gives access to the research papers results, extracted from the sources using text mining algorithms. Access to scientific papers is granted to visitors following copyright and restrictions as imposed by publishers. This NHECD result is intended for Nanotox scientists, regulators and for the general public. 1238 Oded Maimon and Abel Browarnik NHECD 2.0 rating information extraction table extraction graph mining annotation metadata y 1 x 0 y 0 x 1 taxonomies Fig. 64.15. NHECD 2.0. 2. A set of structured results extracted from the scientific papers populating the NHECD repository. Using these results it will be possible to perform data mining on the results. Data mining will result in validated results and further knowledge discovery. This part of NHECD results is targeted at Nanotox scientists and regulators. 64 NHECD - Nano Health and Environmental Commented Database 1239 64.5 Further research Graph and table mining NHECD makes resort to text mining algorithms, allowing for information extraction from textual data. It appears that scientific Nanotox papers (as in many other areas) often include other type of elements, such as graphs and tables. Moreover, the expressiveness of these elements is generally higher than that conveyed by text. Hence, expanding NHECD to include graph and table mining seems desirable. Preliminary research on these subjects made by the NHECD team shows that – at least for some types of graphs and tables – the task is feasible. The concept of the future NHECD (touted NHECD 2.0) is shown in Figure 15. . validated results and further knowledge discovery. This part of NHECD results is targeted at Nanotox scientists and regulators. 64 NHECD - Nano Health and Environmental Commented Database 123 9 64.5 Further. NHECD 2. 0. 2. A set of structured results extracted from the scientific papers populating the NHECD repository. Using these results it will be possible to perform data mining on the results. Data mining. scientists, regulators and for the general public. 123 8 Oded Maimon and Abel Browarnik NHECD 2. 0 rating information extraction table extraction graph mining annotation metadata y 1 x 0 y 0 x 1

Định dạng
Số trang	10
Dung lượng	428,47 KB