Integrating Natural Language Processing And Web Gis For Interactive Knowledge Domain Isualization

63 165 0
Integrating Natural Language Processing And Web Gis For Interactive Knowledge Domain Isualization

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

INTEGRATING NATURAL LANGUAGE PROCESSING AND WEB GIS FOR INTERACTIVE KNOWLEDGE DOMAIN VISUALIZATION _ A Thesis Presented to the Faculty of San Diego State University _ In Partial Fulfillment of the Requirements for the Degree Master of Science in Geography with a Concentration in Geographic Information Science _ by Fangming Du Summer 2014 iii Copyright © 2014 by Fangming Du All Rights Reserved iv DEDICATION To my parents and my family v ABSTRACT OF THE THESIS Integrating Natural Language Processing and Web GIS for Interactive Knowledge Domain Visualization by Fangming Du Master of Science in Geography with a Concentration in Geographic Information Science San Diego State University, 2014 Recent years have seen a powerful shift towards data-rich environments throughout society This has extended to a change in how the artifacts and products of scientific knowledge production can be analyzed and understood Bottom-up approaches are on the rise that combine access to huge amounts of academic publications with advanced computer graphics and data processing tools, including natural language processing Knowledge domain visualization is one of those multi-technology approaches, with its aim of turning domain-specific human knowledge into highly visual representations in order to better understand the structure and evolution of domain knowledge For example, network visualizations built from co-author relations contained in academic publications can provide insight on how scholars collaborate with each other in one or multiple domains, and visualizations built from the text content of articles can help us understand the topical structure of knowledge domains These knowledge domain visualizations need to support interactive viewing and exploration by users Such spatialization efforts are increasingly looking to geography and GIS as a source of metaphors and practical technology solutions, even when nongeoreferenced information is managed, analyzed, and visualized When it comes to deploying spatialized representations online, web mapping and web GIS can provide practical technology solutions for interactive viewing of knowledge domain visualizations, from panning and zooming to the overlay of additional information This thesis presents a novel combination of advanced natural language processing – in the form of topic modeling – with dimensionality reduction through self-organizing maps and the deployment of web mapping/GIS technology towards intuitive, GIS-like, exploration of a knowledge domain visualization A complete workflow is proposed and implemented that processes any corpus of input text documents into a map form and leverages a web application framework to let users explore knowledge domain maps interactively This workflow is implemented and demonstrated for a data set of more than 66,000 conference abstracts vi TABLE OF CONTENTS PAGE ABSTRACT vi LIST OF TABLES .x LIST OF FIGURES xii ACKNOWLEDGEMENTS .xv CHAPTER INTRODUCTION .11 Problem Statement 13 Objectives and Intellectual Merit 15 LITERATURE REVIEW 17 Knowledge Domain Visualization 17 Web GIS .18 Spatialization 19 Topic Modeling 20 RESEARCH DESIGN 21 Functionality Design 21 Spatial Concepts .21 Real World .21 Semantic World .21 From Concepts to Functionality .23 Workflow Design 24 Web GIS Application Design 26 IMPLEMENTATION 28 Workflow .28 Text Processing Workflow 28 Data Preprocessing 28 LDA Topic Modeling 31 SOM Training and Clustering 36 vii Programming Environment 37 GIS Processing Workflow .38 Integrating Workflow with Web GIS 39 Web GIS Implementation Framework .39 Web Inferencing Services 40 Mapping and Geoprocessing Services .43 Web User Interface 43 Evaluation of Performance 46 CONCLUSION 48 Results Summary 48 Limitations and Future Studies 49 REFERENCES 51 A FILTERED STOP TOPICS 54 viii LIST OF TABLES PAGE Table Semantic Generalization (Fabrikant and Skupin 2005) 21 Table Functions for Non-geographic Information Visualization in GIS 23 Table Dataset Format 29 Table LDA Topic Model Training Output Files 34 Table Input and Output Data for Inferencing Services 41 Table Filtered Out Stop Topics, Each Stop Topic Consists of Several Topic Phrases 55 ix LIST OF FIGURES PAGE Figure Google Maps technology deployed for knowledge domain visualization 12 Figure Perplexity evaluations of different computational language models (Blei 2003) 25 Figure Data processing workflow 26 Figure Web GIS application framework .26 Figure NLP serivces 27 Figure XML Processing for PDF Format Data .29 Figure XML Schema 30 Figure Data Content Preprocessing 31 Figure Perplexity Computation Using Mallet .33 Figure 10 Perplexity Graph for Our Model .33 Figure 11 Trained SOM represented as Shape File Panels (a) and (b) show the SOM neurons as hexagons at zoom levels Panels (c) and (d) contain renderings of component planes, i.e the distribution of the weights of one particular attribute across the two-dimensional neuron geometry 37 Figure 12 GIS Processing Workflow .38 Figure 13 SOM Polygon Dissolve and Labeling (a) represents the dissolved polygons from SOM neurons (Figure 11 (a)) (b) adds labels to cluster polygons 39 Figure 14 Data and Process Flow in Web GIS 40 Figure 15 Inferencing Services for Projection Functionality 41 Figure 16 User Interface Components .44 Figure 17 Projection as Point 45 Figure 18 Projection as Overlay Map Layer 45 Figure 19 Time Consumption in Inferencing Service with Three Test Groups of Data 46 Figure 20 Time Consumption in Geoprocessing Service 47 x ACKNOWLEDGEMENTS I would like to thank the members my thesis committee Dr Skupin, Dr Tsou, and Dr Eckberg for their help, support, and interest in my thesis work Especially, I would like to thank Dr Skupin for serving as my major professor and graduate advisor This thesis would not have been possible without his great amount of support I am very grateful to him for giving me invaluable advice and continuous guidance on my research project In addition I would like to acknowledge and thank David Mckinsey and Marcus Chiu for their great technical support I would also like to thank Raymond Lee for his useful advise and help in my thesis writing I extend my gratitude to my colleagues and friends Jay Yang, Shuang Yang, Marilyn Stowell for their support Finally, last but not the least, I would like to address a special thank you to my parents for the love and care they have continually given me all these years 49 the text documents, which here is akin to obtaining geographic locations in form of latitude and longitude Then, SOM performs transformation from the high-dimensional space to a two-dimensional display space, akin to how cartographic map projections perform a transformation from the latitude/longitude space into a Cartesian display space In an analogical manner, spatial concepts and techniques can be sought to be transferred to the high-dimensional case (Table 2), though important differences need to be taken into account in terms of the specific computations For example, in analogy to the projection concept extensively used in geography and GIS (measured lat/lon projected into x/y), this thesis demonstrates how the location of an input text in the high-dimensional topic space can be measured (i.e., inferred) and then projected (through similarity computation) into the twodimensional space as a single point by matching the closest SOM neuron cell to the new text In addition, it is possible to project the user-supplied input text from the topical space to a two-dimensional surface, which enables detecting more nuanced patterns Figure 17 and 18 shows the results for the projection as a point or a surface When projected as a point, the input text (abstract of Skupin (2013)) is located in the region labeled "map", indicating the main topical focus of the input artifact When projected as a surface, the "map" region is still dominant, but its reach is extended to include "visualization" and "GIS" and regions of secondary fit are highlighted elsewhere in the map It appears that projection of input text as either point or surface can serve different, yet complementary goals in exploratory visual analysis of knowledge domain visualizations LIMITATIONS AND FUTURE STUDIES Improvements upon the creation of the knowledge domain maps could be approached by a better way to identify stop topics and phrases This research focused mainly on the integration of the LDA topic modeling, SOM and GIS as comprehensive workflow This workflow was demonstrated to be able to generate a high-resolution base map of the geography domain, derived from abstracts of twenty years of AAG annual meetings However, stop topics and phrases were identified manually by Dr Skupin, as an expert from the knowledge domain of geography There are certain stop topics could be applied for the academic corpus But other geography domain specific stop topics could be hardly applied for other knowledge domains So for future improvements, other methods should be 50 considered for removing domain specific stop topics As a recommendation for future studies, based in our observations, the component planes corresponding to stop topics seem to exhibit a lack of spatial autocorrelation in the two-dimensional SOM space, which possibly points towards automated detection of stop topics via spatial auto-correlation algorithms that are yet to be identified Concerning the performance of the web application, inferencing services can scale up to very large dataset without any trade off to the performance, but the geoprocessing service can take up to a lot of time if the dimensions of the SOM are large One solution would be to store neuron polygon features in the database and having the geoprocessing service read from it, instead of generating polygon features only in response to user input As far as analytical functionality is concerned, Table lays out several additional functions, all of which could be accomplished within the web application framework Within that same framework, individual functions could also be combined, such as in the overlay of multiple input texts Finally, this research has potential implications and impact for many different application areas One application would be to project different scholars or institutions on the map to study their relationship to get insight about what topics they study or how they collaborate with each other The workflow can of course be applied to any knowledge domain, as well as to domain-independent document collections 51 REFERENCES Blei, D M., Ng, A Y., & Jordan, M I (2003) Latent dirichlet allocation the Journal of machine Learning research, 3, 993-1022 Börner, K., Chen, C., & Boyack, K W (2003) Visualizing knowledge domains Annual review of information science and technology, 37(1), 179-255 Börner, K (2010) Atlas of science MIT Press Boyack, K W., Klavans, R., & Börner, K (2005) Mapping the backbone of science Scientometrics, 64(3), 351-374 Card, S K., Mackinlay, J D., & Shneiderman, B (Eds.) (1999) Readings in information visualization: using vision to think Morgan Kaufmann Dent, B D (1999) Cartography-thematic map design WCB/McGraw-Hill Dibiase, D., DeMers, M., Johnson, A., Kemp, K., Luck, A.T., Plewe, B & Wentz, E (2006) Geographic information science and technology body of knowledge Association of American Geographers Fabrikant, S I (2000) Spatialized browsing in large data archives Transactions in GIS, 4(1), 65-78 Fabrikant S (2001) Visualizing region and scale in information spaces In Proceedings of the 20th International Cartographic Conference, August 6-10, 2001 Beijing, China 2522-9 Fabrikant S., and Skupin A (2005) Cognitively plausible information visualization In Dykes J., MacEachren A., and Kraak MJ (Eds) Exploring Geovisualization (pp 667-690) Elsevier Amsterdam Garrett J (2011) The elements of user experience: user-centered design for the web and beyond Berkeley: New Riders Gartner G., Bennett D., Morita T (2007) Towards Ubiquitous Cartography Cartography and Geographic Information Science 34(4): 247 – 257 Golledge, R G (1995) Primitives of spatial knowledge In Cognitive aspects of humancomputer interaction for geographic information systems (pp 29-44) Springer Netherlands Goodchild MF (2007) Citizens as sensors: the world of volunteered geography Geojournal 69(4): 211-221 Green D, and Bossomaier T (2002) Online GIS and spatial metadata London; New York: Taylor&Francis Haklay M., & Tobon C (2003) Usability evaluation and PPGIS: towards a user-centered design approach International Journal of Geographical Information Science 17(6): 577 – 592 52 Haklay M., Singleton A., & Parker C (2008) Web Mapping 2.0: The Neogeography of the GeoWeb Geography Compass 2(6): 2011- 2039 Haklay M., & Weber P (2008) OpenStreetMap: User-Generated Street Maps Pervasive Computing, IEEE 7(4): 12 – 18 Hewett T., Baecker R., Card S., Carey T., Gasen J., Mantei M., Perlman G., Strong G., & Verplank W (1997) ACM SIGCHI Curricula for Human-Computer Interaction http://old.sigchi.org/cdg/, Accessed April 29 2012 Ivory M., & Hearst M (2001) The State of the Art in Automating Usability Evaluation of User Interfaces AMC Computing Surveys 33(4): 470-516 Janelle, D G., & Goodchild, M F (2011) Concepts, principles, tools, and challenges in spatially integrated social science The SAGE handbook of GIS and society Thousand Oaks, CA: SAGE, 27-45 Kraak MJ., and Brown A (Eds.) (2001) Web Cartography: Developments and Prospects London; New York: Taylor & Francis Kuhn W and Blumenthal B (1996) Spatialization: Spatial Metaphors for User Interfaces Vienna, Austria, Technical University of Vienna MacEachren A., Gahegan M., Pike W., Brewer I., Cai G., Lengerich E., & Hardistry F (2004) Geovisualization for knowledge construction and decision support Computer Graphics and Applications, IEEE 24(1): 13 – 17 Mark D., & Gould M (1991) Interacting with geographical information: A commentary Photogrammetric Engineering and Remote Sensing 57(11): 1427 – 1430 McCallum, A (2002) Mallet: A Machine Learning for Language Toolkit http://mallet.cs.umass.edu McNurlin B., Spraque R., & Bui T (2008) Information Systems Management in Practice Englewood Cliffs: Prentice-Hall Nielsen J (1993) Usability Engineering San Diego: Academic Press Nivala A-M., Sarjakoski L., & Sarjakosko T (2005) User-Centered Design and Development of a Mobile Map Service In the Proceedings of Scandinavian Research Conference on GIScience (ScanGIS) 109 – 123 Price DJ (1965) Networks of Scientific Papers Science 149(3683):510-515 Robinson A et al (1995) Elements of Cartography New York: Wiley Salton, G (1989) Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer Addison-Wesley Skupin A and Buttenfield B P (1996) Spatial metaphors for visualizing very large data archives In Proceedings of GIS/LIS 96, Denver, Colorado: 607-17 Skupin A and Buttenfield B P (1997) Spatial metaphors for visualizing information spaces In Proceedings of the ACSM/ASPRS Annual Convention and Exhibition, Seattle, Washington: 116-25 53 Skupin A (2000) From Metaphor to Method: Cartographic Perspectives on Information Visualization In: Roth S.F., and Keim D.A (Eds.) Proceedings IEEE Symposium on Information Visualization (InfoVis 2000), 9-10 October, Salt Lake City, Utah 91-97 Los Alamitos: IEEE Computer Society Skupin A (2002) A Cartographic Approach to Visualizing Conference Abstracts IEEE Computer Graphics and Applications 22(1): 50-58 Skupin A., and Fabrikant S (2008) Spatialization In Wilson J., and Fotheringham S (Eds.) The Handbook of Geographic Information Science Blackwell publishing 61-79 Skupin A., Biberstine J R., and Börner, K (2013) Visualizing the Topical Structure of the Medical Sciences: A Self-Organizing Map Approach PLOS ONE, 8(3), e58779 Shiffrin R., and Börner K (2004) Mapping Knowledge Domains Proceedings of National Academy of Sciences of the United States of America 101(suppl 1): 5183-5185 Smith T., and Frew J (1995) Alexandria Digital Library Communication of The ACM 38(4):61-62 Spence R (2001) Information Visualization Harlow, England; New York: Addison Wesley Spence R (2007) Information visualization: design for interaction Harlow, England; New York: Addison Wesley Tsou MH (2004) Integrating web-based GIS and image processing tools for environmental monitoring and natural resource management Journal of Geographical Systems 6:120 Tsou MH., & Curran J (2008) User-Centered Design Approaches for Web Mapping Applications: A Case Study with USGS Hydrological Data in the United States In Peterson M (Ed.) International Perspectives on Maps and the Internet (pp 209-320) Berlin; New York: Springer Tsou MH (2011) Revisiting Web Cartography in the United States: the Rise of UserCentered Design Cartography and Geographic Information Science 38(3): 250-257 54 APPENDIX A FILTERED STOP TOPICS 55 Table Filtered Out Stop Topics, Each Stop Topic Consists of Several Topic Phrases Topic phrase Topic phrase Topic phrase weight count Long Term 0.009230769 15 academic literature 0.012884044 26 addres issue 0.00960334 23 addres question 0.007515658 18 address issue 0.009487666 20 address question 0.016840417 21 adequately address 0.004743833 10 adequately addressed 0.005218216 11 alternative solution 0.0056926 12 ambitious project 0.00990099 11 answer question 0.018371608 44 answered question 0.005613472 attention paid 0.012878788 34 basic question 0.006682868 11 billion dollar 0.006446991 18 briefly outline 0.005613472 briefly review 0.008821171 11 broad array 0.005447942 broad range 0.041162228 68 broad spectrum 0.007869249 13 broader issue 0.005613472 case study 0.504714863 2248 central role 0.011742424 31 challenge encountered 0.005927682 10 challenge faced 0.018026565 38 challenge facing 0.031783681 67 challenge posed 0.015654649 33 common set 0.004237288 common type 0.004842615 56 concerted effort 0.01620162 18 conducting research 0.009624198 21 considerable attention 0.015361744 31 constraint imposed 0.004149378 conventional wisdom 0.020048603 33 cost effective 0.009484292 16 critical attention 0.007433102 15 critical issue 0.012108559 29 critical role 0.021969697 58 crucial role 0.029166667 77 current debate 0.011691023 28 current literature 0.006442022 13 current research 0.014207149 31 demonstration project 0.00720072 difficult task 0.005927682 10 difficulty encountered 0.005218216 11 difficulty faced 0.006641366 14 dissertation research 0.022456462 49 distinct type 0.005447942 diverse array 0.005447942 diverse range 0.00968523 16 diverse set 0.006053269 10 doctoral research 0.010082493 22 draw attention 0.009415263 19 effective strategy 0.004149378 evidence suggests 0.014769231 24 extremely difficult 0.007113219 12 face challenge 0.005927682 10 final goal 0.00720072 final section 0.01170117 13 focused attention 0.006937562 14 forty percent 0.007163324 20 frequently cited 0.004307692 57 full range 0.013317191 22 fully understood 0.004923077 fundamental question 0.011273486 27 funded project 0.0083022 20 generally accepted 0.005538462 generally assumed 0.004307692 generally understood 0.003692308 good deal 0.004860267 great deal 0.016403402 27 heated debate 0.012526096 30 high percentage 0.007163324 20 high proportion 0.006446991 18 high rate 0.025429799 71 higher proportion 0.007879656 22 higher rate 0.013252149 37 highest rate 0.016833811 47 hold true 0.005467801 important aspect 0.009848485 26 important contribution 0.00719697 19 important implication 0.00719697 19 important question 0.01042502 13 important role 0.163636364 432 improvement project 0.0090009 10 integral role 0.009090909 24 interesting question 0.005467801 intractable problem 0.004149378 issue addressed 0.004743833 10 issue faced 0.006641366 14 issue facing 0.00711575 15 issue raised 0.017118998 41 issue related 0.006442022 13 issue relating 0.007217322 issue surrounding 0.011273486 27 58 key issue 0.010020877 24 key question 0.007515658 18 key role 0.043560606 115 lack thereof 0.009013283 19 large number 0.010028653 28 larger project 0.00630063 lesson learned 0.04680468 52 literature review 0.010458063 50 long lasting 0.007384615 12 long recognized 0.008615385 14 long run 0.006153846 10 long standing 0.019692308 32 long term 0.316923077 515 long time 0.006153846 10 longer term 0.030769231 50 million acre 0.017191977 48 ninety percent 0.008237822 23 ongoing project 0.01577418 38 ongoing research 0.01784973 43 ongoing research project 0.00664176 16 overarching goal 0.00630063 paper address 0.011691023 28 paper address 0.013386321 64 paper aims 0.211144698 15437 paper analyzes 0.021752771 104 paper argues 0.045597155 218 paper asks 0.008019246 10 paper assesses 0.008784773 42 paper attempt 0.006442022 13 paper begins 0.020706965 99 paper briefly 0.005613472 paper concludes 0.084919473 406 paper considers 0.035975737 172 59 paper describes 0.025517674 122 paper discusses 0.083455344 399 paper draws 0.010248902 49 paper examines 0.135954821 650 paper explores 0.096632504 462 paper focuses 0.039113156 187 paper highlights 0.008993934 43 paper identifies 0.009412257 45 paper investigates 0.011397423 23 paper outline 0.009621418 46 paper presents 0.045597155 218 paper question 0.005613472 paper report 0.00720072 paper review 0.020497804 98 paper seeks 0.021803766 44 paper shows 0.211144698 15437 paper suggests 0.011503869 55 paying attention 0.014370664 29 pertinent question 0.004811548 pivotal role 0.023106061 61 play important role 0.010227273 27 posed question 0.005613472 potential benefit 0.006520451 11 potential problem 0.004742146 potential solution 0.008298755 14 practical solution 0.007113219 12 preliminary finding 0.007184553 32 preliminary result 0.012123934 54 presentation address 0.008821171 11 presentation describes 0.00810081 previous research 0.013920072 62 previous study 0.028064661 125 problem arise 0.006641366 14 60 problem arising 0.004149378 problem confronting 0.005927682 10 problem encountered 0.016129032 34 problem faced 0.012333966 26 problem facing 0.023244782 49 problem inherent 0.006166983 13 problem solving 0.040796964 86 project aimed 0.0180018 20 project aims 0.01203819 29 project attempt 0.00630063 project design 0.00788709 19 project goal 0.01170117 13 project include 0.00810081 project involved 0.00720072 project involves 0.00788709 19 project involving 0.0090009 10 project seeks 0.02340234 26 prominent role 0.009469697 25 proposed project 0.01170117 13 proven difficult 0.004149378 put forward 0.020656136 34 question addressed 0.020041754 48 question arise 0.007933194 19 question asked 0.006075334 10 question posed 0.005613472 question raised 0.013361169 32 question remains 0.004860267 raise question 0.032985386 79 raised question 0.007933194 19 raises concern 0.005613472 raises issue 0.006415397 raises question 0.052192067 125 raising question 0.005613472 61 recent debate 0.006442022 13 recent literature 0.014866204 30 recent research 0.016352825 33 remain largely 0.004923077 remain unanswered 0.003692308 remains unclear 0.005538462 research agenda 0.011397423 23 research aims 0.00871731 21 research analyzes 0.011457379 25 research conducted 0.007184553 32 research design 0.01577418 38 research examines 0.069660862 152 research explores 0.039871677 87 research field 0.009624198 21 research finding 0.021081577 46 research focuses 0.00954753 23 research investigates 0.050870761 111 research methodology 0.01512374 33 research objective 0.00664176 16 research project 0.165628892 399 research question 0.009415263 19 research seeks 0.022456462 49 research study 0.005837449 26 research team 0.01784973 43 research topic 0.016498625 36 research utilizes 0.010082493 22 role played 0.064015152 169 scant attention 0.010406343 21 scholarly attention 0.009415263 19 shed light 0.011273486 27 shed light 0.055681818 147 sheds light 0.020075758 53 short term 0.057230769 93 62 short time 0.003692308 significant role 0.029166667 77 solve problem 0.011385199 24 solving problem 0.004149378 special attention 0.021212121 56 specific issue 0.007217322 specific question 0.008019246 10 specific reference 0.004811548 specific type 0.012106538 20 square kilometer 0.01504298 42 square mile 0.00752149 21 starting point 0.010328068 17 stated goal 0.01530153 17 study aims 0.01077683 48 study analyzes 0.011457379 25 study conducted 0.008756174 39 study demonstrates 0.007633588 34 study employs 0.006511001 29 study examines 0.046746104 102 study explores 0.028414299 62 study focuses 0.011674899 52 study investigates 0.014593624 65 study presents 0.005163898 23 study seeks 0.014207149 31 study utilizes 0.005388415 24 ten percent 0.006446991 18 ten year 0.010028653 28 total amount 0.006446991 18 total number 0.011103152 31 ultimate goal 0.0124533 30 unanswered question 0.004860267 unique set 0.007869249 13 unrelated variety 0.004842615 63 varying degree 0.012711864 21 viable alternative 0.007705987 13 viable option 0.009484292 16 viable solution 0.004742146 vice versa 0.034237996 82 vital role 0.01780303 47 wide array 0.034503632 57 wide range 0.260895884 431 wide ranging 0.006658596 11 wide spectrum 0.009079903 15 wide variety 0.133171913 220 widely accepted 0.013973269 23 widely recognized 0.004307692 widely regarded 0.004252734 ... Reserved iv DEDICATION To my parents and my family v ABSTRACT OF THE THESIS Integrating Natural Language Processing and Web GIS for Interactive Knowledge Domain Visualization by Fangming Du Master... graphics and data processing tools, including natural language processing Knowledge domain visualization is one of those multi-technology approaches, with its aim of turning domain- specific human knowledge. .. multiple domains, and visualizations built from the text content of articles can help us understand the topical structure of knowledge domains These knowledge domain visualizations need to support interactive

Ngày đăng: 29/04/2017, 11:20

Tài liệu cùng người dùng

Tài liệu liên quan