Big Data Computing This page intentionally left blank Big Data Computing Edited by Rajendra Akerkar Western Norway Research Institute Sogndal, Norway CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2014 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S Government works Version Date: 20131028 International Standard Book Number-13: 978-1-4665-7838-8 (eBook - PDF) This book contains information obtained from authentic and highly regarded sources Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint Except as permitted under U.S Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers For permission to photocopy or use material electronically from this work, please access www.copyright com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400 CCC is a not-for-profit organization that provides licenses and registration for a variety of users For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com To All the visionary minds who have helped create a modern data science profession This page intentionally left blank Contents Preface .ix Editor xvii Contributors xix Section I Introduction Toward Evolving Knowledge Ecosystems for Big Data Understanding Vadim Ermolayev, Rajendra Akerkar, Vagan Terziyan, and Michael Cochez Tassonomy and Review of Big Data Solutions Navigation 57 Pierfrancesco Bellini, Mariano di Claudio, Paolo Nesi, and Nadia Rauch Big Data: Challenges and Opportunities 103 Roberto V Zicari Section II Semantic Technologies and Big Data Management of Big Semantic Data 131 Javier D Fernández, Mario Arias, Miguel A Martínez-Prieto, and Claudio Gutiérrez Linked Data in Enterprise Integration 169 Sören Auer, Axel-Cyrille Ngonga Ngomo, Philipp Frischmuth, and Jakub Klimek Scalable End-User Access to Big Data 205 Martin Giese, Diego Calvanese, Peter Haase, Ian Horrocks, Yannis Ioannidis, Herald Kllapi, Manolis Koubarakis, Maurizio Lenzerini, Ralf Möller, Mariano Rodriguez Muro, ệzgỹr ệzỗep, Riccardo Rosati, Rudolf Schlatte, Michael Schmidt, Ahmet Soylu, and Arild Waaler Semantic Data Interoperability: The Key Problem of Big Data 245 Hele-Mai Haav and Peep Küngas vii viii Contents Section III Big Data Processing Big Data Exploration 273 Stratos Idreos Big Data Processing with MapReduce 295 Jordà Polo 10 Efficient Processing of Stream Data over Persistent Data 315 M Asif Naeem, Gillian Dobbie, and Gerald Weber Section IV Big Data and Business 11 Economics of Big Data: A Value Perspective on State of the Art and Future Trends .343 Tassilo Pellegrin 12 Advanced Data Analytics for Business 373 Rajendra Akerkar Section V Big Data Applications 13 Big Social Data Analysis 401 Erik Cambria, Dheeraj Rajagopal, Daniel Olsher, and Dipankar Das 14 Real-Time Big Data Processing for Domain Experts: An Application to Smart Buildings 415 Dario Bonino, Fulvio Corno, and Luigi De Russis 15 Big Data Application: Analyzing Real-Time Electric Meter Data 449 Mikhail Simonov, Giuseppe Caragnano, Lorenzo Mossucca, Pietro Ruiu, and Olivier Terzo 16 Scaling of Geographic Space from the Perspective of City and Field Blocks and Using Volunteered Geographic Information .483 Bin Jiang and Xintao Liu 17 Big Textual Data Analytics and Knowledge Management 501 Marcus Spies and Monika Jungemann-Dorner Index 539 Preface In the international marketplace, businesses, suppliers, and customers create and consume vast amounts of information Gartner* predicts that enterprise data in all forms will grow up to 650% over the next five years According to IDC,† the world’s volume of data doubles every 18 months Digital information is doubling every 1.5 years and will exceed 1000 exabytes next year according to the MIT Centre for Digital Research In 2011, medical centers held almost billion terabytes of data That is almost 2000 billion file cabinets’ worth of information This deluge of data, often referred to as Big Data, obviously creates a challenge to the business community and data scientists The term Big Data refers to data sets the size of which is beyond the capabilities of current database technology It is an emerging field where innovative technology offers alternatives in resolving the inherent problems that appear when working with massive data, offering new ways to reuse and extract value from information Businesses and government agencies aggregate data from numerous private and/or public data sources Private data is information that any organization exclusively stores that is available only to that organization, such as employee data, customer data, and machine data (e.g., user transactions and customer behavior) Public data is information that is available to the public for a fee or at no charge, such as credit ratings, social media content (e.g., LinkedIn, Facebook, and Twitter) Big Data has now reached every sector in the world economy It is transforming competitive opportunities in every industry sector including banking, healthcare, insurance, manufacturing, retail, wholesale, transportation, communications, construction, education, and utilities It also plays key roles in trade operations such as marketing, operations, supply chain, and new business models It is becoming rather evident that enterprises that fail to use their data efficiently are at a large competitive disadvantage from those that can analyze and act on their data The possibilities of Big Data continue to evolve swiftly, driven by innovation in the underlying technologies, platforms, and analytical capabilities for handling data, as well as the evolution of behavior among its users as increasingly humans live digital lives It is interesting to know that Big Data is different from the conventional data models (e.g., relational databases and data models, or conventional governance models) Thus, it is triggering organizations’ concern as they try to separate information nuggets from the data heap The conventional models of structured, engineered data not adequately reveal the realities of Big * † http://www.gartner.com/it/content/1258400/1258425/january_6_techtrends_rpaquet.pdf http://www.idc.com/ ix Big Textual Data Analytics and Knowledge Management 525 corpora can base analyses on document spaces, for example, for following historical trends, while analyses on the term spaces can uncover shifts in terminology A separation between topic shift and terminology shift is impossible in LDA, and thus, LDA cannot even identify topics across several analyses—which is readily possible with LSA Knowledge Life Cycles In order to examine the impact of unstructured information processing on knowledge management, we introduce the key concepts of a specific approach to knowledge management by Firestone (2003) While many approaches are available, Firestone’s work is most closely related to enterprise architecture and enterprise information systems infrastructure According to this approach, knowledge management is management of enterprise knowledge life cycles These knowledge life cycles are composed of two essential phases, namely knowledge production and knowledge dissemination An overview of Firestone’s model according to the Object Management Groupy system modeling language (Object Management Group, 2006) is given in Figure 17.2 Knowledge production comprises all activities related to knowledge generation on an individual, team, or organizational scale Knowledge production is a problem-solving activity that implies learning or information acquisition activities Information acquisition consists of integrating information from various sources and applying the results of this research to a specific problem at hand in the daily work of a knowledge worker A convenient example Figure 17.2 A SysML model of enterprise knowledge life cycles For details, see text 526 Big Data Computing of information acquisition is the application of business intelligence methods to data of relevance to the problem to be solved, for example, market basket data for a decision prioritization problem related to product lifecycle management Learning, in this framework, is a procedure for generating and testing hypotheses relevant to the problem to be solved Learning comprises dedicated empirical and even foundational research Knowledge production by learning is closely related to innovation The key task of knowledge production in the overall knowledge life cycle is producing codified knowledge claims A knowledge claim is not just a hypothesis, whereas it is a specific approach consisting of elements of knowledge supposed to solve the problem that kicked off a particular knowledge lifecycle instance As an example of a knowledge claim resulting from business intelligence, take a set of association rules derived from mining suitable customer data that suggest a way to address emerging demands with a new product Codified knowledge claims are passed to a validation process that leads either to the adoption or to the rejection of the claim An adoption is not merely a confirmation of a hypothesis, but rather means that the knowledge produced is going to be activated in the business transaction space of the organization in question It means that the validated knowledge claim is now a knowledge asset or part of the organization’s intellectual capital Activation of a knowledge asset presumes a set of processes that allows one to integrate the related knowledge elements into specific business activities across an organization These processes are referred to by Firestone as knowledge integration Knowledge integration comprises all activities related to maintaining and fostering the usage of knowledge that a company or an organization has acquired through knowledge production Joseph Firestone distinguishes four key processes of knowledge integration, namely broadcasting, searching or retrieving, teaching, and knowledge sharing The distinctive feature of knowledge sharing compared to broadcasting is that it addresses a limited set of recipients (e.g., a community of practice), while broadcasting is unaware of the number and composition of recipients Broadcasting and knowledge sharing can be subsumed under the concept of knowledge dissemination Teaching, in this framework, is knowledge sharing with a control loop in which the degree of success of knowledge transfer is monitored and used for adapting the content and even the method of sharing elements of a knowledge domain Searching and retrieving cover knowledge integration from the demand point of view These processes are closely related to intelligent search engines building on information retrieval technologies Searching and retrieving address elements of knowledge that are available without having been supplied without reference to a specific knowledge integration process, for example, government documents, reports, and public domain research reports Searching and retrieving are also keys to leveraging knowledge captured in organizational document and content repositories Big Textual Data Analytics and Knowledge Management 527 All knowledge integration activities are focused on, but not confined to the knowledge elements from validated knowledge claims As an example, consider the adoption of a new go-to market strategy for a product as a result of a knowledge production process Implementing the new strategy usually affects several existing processes, might require adaptations of roles, etc The knowledge needed for these implementation specifics is accessed and combined with the newly validated knowledge in knowledge integration It should be noted that since the rapid increase of social web functionalities including massive content posting and social bookmarking, knowledge broadcasting has become a widespread everyday activity, even if sometimes on a rather superficial level, contrary to the knowledge management discussions in the 1990s that focused on incentives for fostering knowledge sharing within companies In addition, Wikipedia and related information resources have considerably reshaped the teaching and the searching/retrieving processes related to knowledge integration Nevertheless, all four knowledge integration subprocesses take a specific shape when used in a business context of problem solving in the knowledge life cycle Quality, effectiveness, and reliability of these processes are keys, while the social web realizations of these processes are characterized by spontaneity and large degrees of error tolerance Impact of Unstructured Information Processing on Knowledge Management How can unstructured information processing contribute to knowledge lifecycle processes and their specific implementations? Considering first the essential phase of the knowledge life cycle, knowledge production, contributions to information acquisition, and even learning processes are possible by either information extraction on the relational level or text clustering in order to identify topics or subjects of interest and estimating their proximity and interrelationships Information extraction on the relational level is equivalent to formulating meaningful sentences in a given domain related to a specific problem From a logical point of view, these sentences are individual statements (e.g., “IBM has developed a new software for credit card fraud detection”) Collections of individual statements support accumulation of evidence in favor or against a given hypothesis Text clustering is a means of identifying topics of interest for knowledge claims Proximity relationships as revealed by clustering can help formulating specific hypotheses relevant to the knowledge lifecycle instance at hand As an example, consider the problem to prevent high losses due to system failures at an IT-managed operations provider Typically, the provider has document collections describing system failures and customer claims data Through clustering, it is possible to derive significant groupings of system log entries and failure descriptions, on the one hand, and significant groupings of customer claims, on the other hand Through data merging, it is further possible to use the results from both cluster analyses to derive plausible 528 Big Data Computing causation hypotheses for a customer claim given a system failure A related method of knowledge production has been used in the EU MUSING project for the IT operational risk analysis in the private customers division of a major Italian financial services provider Regarding the second essential phase of the knowledge life cycle, knowledge integration, the assignment of unstructured information to distinct information categories by text classification or entity recognition are major supporting technologies for searching and retrieving processes since they allow the integration of much more comprehensive sets of knowledge resources than it would be possible by fixed index terms and predefined document categories In fact, the first generation of knowledge document repositories in the 1990s was entirely based on author or editor provided document classification terms and search terms under which important facets of documents often went unnoticed This led to the often described underutilization of such repositories and, finally, to their abandonment in favor of more dynamic socially supported Wikis, etc Processing unstructured information is becoming of key importance to further knowledge integration processes as distinguished by Firestone For teaching, the most promising aspects of applications of unstructured information processing lie in the capabilities of content syndication Content syndication summarizes the capability to dynamically reorganize textual material with respect to the needs of specific interests of specific user communities or enterprise teams Finally, with respect to knowledge dissemination, most of the functionalities as supplied by unstructured information processing can provide significant enhancements to today’s knowledge-sharing facilities As an example, in recent years, keyword spotting has become part of many blogsphere platforms, often visualized by word clouds (e.g., see worldle.net) Sophisticated algorithms for information extraction will allow to combine information from different sources in a still more efficient way for information consumers and later information producers Enterprise Monitoring: A Case Study In this part, we describe a pilot application exploiting information extraction from web pages to improve knowledge life cycles for credit information providers and other information service providers regarding business relevant information about companies Credit information providers are offering a large variety of services that focus on the sector of company information, such as international company profiles, annual accounts/balance sheets, address data, data of commercial register, etc The readiness to take a chance and thus fast decision-making are important qualities for successful companies in the fast-moving world Big Textual Data Analytics and Knowledge Management 529 of the twenty-first century Make the right decision and take advantage of Creditreform’s reliable business and consumer information The Creditreform company profiles provide customers with general business information These profiles help customers to get a short and inexpensive overview of a companies’ situation and to add or update the data stored in the own database With a monitoring service, customers can even watch single customers or complete customer portfolios in order to notice changes in the address, the financial or corporate structure of a business partners as early as possible Updates of information about a company are necessary in the case of change events, which are detected from analyzing the content of web pages or news articles, as soon as a company under monitoring appears in current data sources registered to the application The underlying events are usually caused by a change of business information, for example, a change in capital, business rating, or address The list of possible causes that trigger change events has been classified as “a trigger to be monitored by a research analyst.” The monitoring service is based on a MUSING information extraction service that builds the monitoring information by applying natural language processing to the unstructured information resources (web pages, etc.) The overall information flow of the pilot application is depicted in Figure 17.3 The Creditreform enterprise monitoring service allows users to take into account huge amounts of public data which will speed up the workflow to get the estimation of a company’s correct address or the financial situation they can rely on The interest of customers aims at a high level of transparency of company information and at the collection of details supporting the evaluation of the business conduct of a company MUSING Changes in company profiles Data maintenance (case worker) Validation Database company information Updating Figure 17.3 Enterprise monitoring: overview of the information flow 530 Big Data Computing The goal of this innovative application is the development of a versatile information extraction service, which is able to aggregate enterprise information from a variety of data sources and is capable of converting it into a predefined format for further processing Converting the extracted data into a predefined data format is supposed to facilitate a comfortable integration into existing databases of Creditreform in order to have updated information of high quality Therefore, a final manual checking of the information extracted will be performed by human experts before committing the data to the Creditreform Firmenwissen database The application combines thus information extraction and monitoring services and provides the basis for an innovative data maintenance service, which shows the high potential of market relevance in other sectors, for example, address and competitors monitoring The use case will start with a researcher of a local branch of Creditreform who needs to update the available information on a company in the central database Each company information is monitored for a period of time— updates are registered as monitoring events (triggered by a change of business information, e.g., a change in personal, capital or address) These triggers of events have been classified as “reasons to forward a supplement to previously provided information.” In the case of changes of a company website, notifications by e-mail are delivered to the Creditreform analysts and business research experts The notifications also identify outdated data in the internal company information database These alerts facilitate more efficient and time-saving enquiries with respect to company information, for Creditreform’s information researchers Furthermore, they will decide whether the information attached in the Enterprise Monitoring service is of value for customers and should be inserted in the central database In the sequel, we explain the processing steps of the enterprise monitoring pilot application as developed in the EU MUSING project (see Figure 17.4) The starting point is a set of URLs of the company websites, which shall be monitored by the service Every URL is attributed with a Creditreform unique identifier (UID) which is used by Creditreform to identify data sets of specific companies and to encode additional meta-information Assigning the UID to each company URL is a first step towards simplifying the integration process Each extract from a company website can easily be attributed with the UID and therefore be easily associated with the corresponding company profile in the database This is crucial, since in the further process, comparisons between the extract and the company profile need to be conducted in regular intervals Furthermore, the UID is important for the automatic distribution of notifications/alerts to the responsible Creditreform branch The IE tool analyzes the links on company websites, which can easily be identified by the html tags (〈a href = “URL”〉description〈/a〉) The links, which include strings that indicate they are directed at an imprint page, are then being selected by the IE tool The administrator will be informed, in the case 531 Big Textual Data Analytics and Knowledge Management Merging Tool Extracted information Company URL Unique ID Wikipedia extraction DBpedia extraction Imprint extraction MUSING Ontology update Each company URL has a uniqur ID IE Tools extract and merge data for companies Notification service Update ontology if data has changed Notify users about detected changes Figure 17.4 Enterprise monitoring pilot in MUSING: application workflow the imprint page could not be found, so that the imprint URL can be inserted manually if necessary After an imprint page has been identified, the extraction process can be executed The extracted information is aligned with the company profiles by directly mapping it to the FirmenWissen—XML schema Subsequently, the XML document will be time-stamped automatically in order to keep the time of the extraction stored for later comparisons This value is important for the monitoring component, because a new extract will always be compared with the last extract to determine whether the provided company information was altered and needs to be analyzed or no further processing is necessary There is the eventuality of extracting ballast information while processing the imprint data Web design agencies often include their contact information on sites they developed for an enterprise The system must be able to differentiate between that information and the information of the enterprise By operating on segregated text segments, the IE System is able to avoid these kinds of problems The system is mainly based on handwritten extraction rules for recognizing patterns Subsequent to applying the extraction rules, the HTML code of an imprint page is converted into an intermediate format called “WebText.” The extraction rules then work with text items, whereby confusions can be minimized Currently, the manually crafted extraction patterns are being analyzed and utilized to advance in the development of the machine-learning-based IE System 532 Big Data Computing The link between the online portal of FirmenWissen enterprise profiles and the monitoring component is the FirmenWissen XML Gateway The comparisons of enterprise profiles and the extracted information, which need to be conducted to identify changes in company information, are realized by the monitoring tool The monitoring component pulls the company profiles via the XML gateway from the FirmenWissen database The selection process is based on the given UID The normalized extracts can then be compared with the corresponding company profile The IE Tool extracts information regularly and passes them on to the monitoring component If the information on the imprint page has changed, an alert will be created In this case, a monitoring alert is sent to the responsible researcher in the Creditreform Office As quality assurance is of high value to Creditreform, the extracted information needs to be checked by a human expert Information on an imprint page can be obsolete or incorrect, even though it is classified as an administrative offence in countries such as Germany and Austria The use case has a large variety of deployment potential as the impact of the enterprise monitoring service may characterize so far The value of information is high if relevant, true and reliable, and provides impetus for progress in key areas The value added of this use case is in enhancing the quality of knowledge work in terms of production efficiency, improvement of performance, and the value of current products and services by allowing the information provider to focus human efforts on more demanding analysis and development phases and generating growth by unearthing revenue opportunities A series of validation studies have been carried out on successive versions of a pilot application, implementing the enterprise monitoring service in the framework of the MUSING project The last validation study (2010) could be performed in direct cooperation with six selected Creditreform branch offices In the validation phase, 1605 e-mails were generated on the basis of 925 company websites The functionality of the currently being developed enterprise monitoring service has been validated with regard to the following major items: • What is the quality of the results delivered by the prototype? This includes questions about the usefulness, completeness, and proper allocation of the extracted information • Can the potential time and cost savings as well as the expected improvements in quality, actuality, and degree of filling by using such a monitoring service be estimated? • Another focus was on the scalability of the prototype at this status As a result out of validation, it came out that 84% of the delivered information by the enterprise monitoring application has been useful or partly Big Textual Data Analytics and Knowledge Management 533 useful There was no problem with the scalability of the service All validators confirmed the usefulness of this service as there will be a growing need for optimized continuous knowledge working instruments on a daily basis The classical work of a knowledge worker will change in the upcoming years Information must be acquired and processed in a short and efficient way However, the goal should be to create automated data synchronization as a service The reasons are obvious for that: The monitoring of the whole or parts of the data is carried out using too much time and person load A timely forwarding supporting the current work process would be more effective If the application can be effectively adjusted to this requirement, then this would be a major benefit An improvement of the quality of the knowledge or research worker is achieved by this application The processing of the internet information base is part of the daily work The researcher in the local office receives valuable input and advice and has not to “pick in the fog.” The enterprise monitoring is an innovative instrument toward intelligent engineering of information These results also indicated that the key improvement from knowledge workers’ point of view is not time saving, but rather quality enhancement in terms of reliability and timeliness of information updates about enterprises The value added is in enhancing the quality of reports and reliable addresses and therefore protecting enterprises from unforeseen risk and ensuring revenue opportunities Conclusion In the context of knowledge-driven business information gathering, processing, and scoring, MUSING* supports the development in this domain aiming at enhancing the use of semantic technologies in specific industrial sectors With the aid of the semantic technology, information providers or knowledge-driven companies can generate added value through implementation of business intelligence software On the other hand, it is clear that knowledge-driven products often purchased based on pressure for modernization or to higher expectations of the management Knowledge life cycles are, in most cases, depending on semi- or unstructured information From production log sheets that are annotated manually even in completely automated car manufacturing plants over analyst opinions to customer satisfaction reports—textual information that can initiate or support knowledge life cycles is abundant As is well confirmed, the relative amount of unstructured over structured information is increasing fast In a standard presentation of its InfoSphere business intelligence suite, IBM * MUlti-industry, Semantic-based next-generation business INtelliGence 534 Big Data Computing reports an expected four times faster increase in unstructured over structured information in the forthcoming decade (2010 onwards) Many of the texts being produced are not aimed at supporting a given knowledge life cycle; they come from different business contexts In order to exploit the information residing in these sources, information retrieval and extraction methods are of prime importance As our case study has shown, analysis of semistructured (web pages) and unstructured (texts) information for tracing entity descriptions and registering their changes over time is a feasible application of semantic technology These registered changes can support a knowledge production life cycle quite directly since they amount to codified knowledge claims about an enterprise They can also deliver back-up arguments for the assessment, rating, or valuation of a company On the knowledge integration side, the production of structured information by information extraction methods offers tangible benefits especially for knowledge broadcasting—distributing knowledge to subscribers participating potentially in many different knowledge-related activities Combined with content syndication mechanisms like RSS or specific publish/subscribe solutions, this step in our case study has enabled a deployment of a research prototype to the desks of many knowledge workers and led to high usability ratings The authors of the present paper expect the integration of natural language-processing technologies, notably information extraction, with knowledge lifecycle optimization and knowledge management to continue and to enable a new generation of broadly usable organizational knowledge management support tools Acknowledgments The sections on knowledge life cycles and on enterprise monitoring in the present work have been supported by the European Commission under the umbrella of the MUSING (multi-industry semantics based next generation business intelligence) project, contract number 027097, 2006–2010 The authors wish to thank the editor of the present volume for his continued support References Alexandrov, A., Ewen, S., Heimel, M., Hueske, F., Kao, O., Markl, V., , Warneke, D 2011 MapReduce and PACT—Comparing Data Parallel Programming Models Proceedings of the 14th Conference on Database Systems for Business, Technology, and Web (BTW) (pp 25–44) Kaiserslautern, Germany: GI Big Textual Data Analytics and Knowledge Management 535 Apache UIMA Development Community 2008a UIMA Overview and SDK Setup (A S Foundation, Trans.) (Version 2.2.2 ed.): International Business Machines Apache UIMA Development Community 2008b UIMA References (A S Foundation, Trans.) (Version 2.2.2 ed.): International Business Machines Blei, D M 2011 Introduction to Probabilistic Topic Models Princeton: Princeton University Blei, D M 2012 Probabilistic Topic Models Communications of the ACM, 55(4), 77–84 Blei, D M and Lafferty, J D 2007 A correlated topic model of Science Annals of Applied Statistics, 1(1), 17–35 Blei, D M., Griffiths, T., Jordan, M., and Tenenbaum, J 2003a Hierarchical topic models and the nested Chinese restaurant process, Presented at the Neural Information Processing Systems (NIPS) 2003, Vancouver, Canada Blei, D M., Ng, A Y., and Jordan, M I 2003b Latent Dirichlet allocation Journal of Machine Learning Research, 3, 993–1022 Bradford, R B 2011 Implementation techniques for large-scale latent semantic indexing applications Proceedings of the 20th ACM International Conference on Information and Knowledge Management (pp 339–344) Glasgow, Scotland, UK: ACM Bunescu, R and Mooney, R 2007 Statistical relational learning for natural language information extraction In L Getoor and B Taskar (Eds.), Introduction to Statistical Relational Learning (535-552), Cambridge, MA: MIT Press Chomsky, N 2006 Language and Mind, 3rd ed Cambridge: Cambridge University Press Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I., Gorrell, G et al 2008 Developing Language Processing Components with GATE (General Architecture for Text Engineering) Version 5—A User Guide Sheffield: University of Sheffield Fan, J., Kalyanpur, A., Gondek, D C., and Ferrucci, D A 2012 Automatic knowledge extraction from documents IBM Journal of Research and Development, 56(3.4), 5:1–5:10 Feinerer, I An introduction to text mining in R R News, 8(2):19–22, October 2008 R News, now R Journal, is online only Ferguson, T 1973 A Bayesian analysis of some nonparametric problems The Annals of Statistics, 1(2), 209–230 Firestone, J M 2003 Enterprise Information Portals and Knowledge Management Amsterdam, Boston: Butterworth Heinemann Grcar, M., Podpecan, V., Jursic, M., and Lavrac, N 2010 Efficient visualization of document streams Lecture Notes in Computer Science, vol 6332, 174–188 Berlin: Springer Grcar, M., Kralj, P., Smailovic, J., and Rutar, S 2012 Interactive visualization of textual streams In F E p n 257928 (Ed.), Large Scale Information Extraction and Integration Infrastructure for Supporting Financial Decision Making Grün, B and Hornik, K 2011 Topicmodels: An R package for fitting topic models Journal of Statistical Software, 40(13), 1–30 Gsell, M., Razimski, M., Häusser, T., Gredel, L., and Grcar, M 2012 Specification of the information-integration model In FIRST EU project no 257928 (Ed.), Large Scale Information Extraction and Integration Infrastructure for Supporting Financial Decision Making First EU project no 257928 http://project-first.eu/ Hastie, T., Tibshirani, R., and Friedman, J 2009 The Elements of Statistical Learning Berlin: Springer 536 Big Data Computing Heinrich, G 2008 Parameter Estimation for Text Analysis Leipzig: University of Leipzig Jelinek, F 1998 Statistical Methods for Speech Recognition Cambridge, MA: MIT Press Kohonen, T., Kaski, S., Lagus, C., Salojärvi, J., Honkela, J., Paatero, V., and Saarela, A 2000 Self organization of a massive document collection IEEE Transactions on Neural Networks, 11(3), 574–585 doi: 10.1109/72.846729 Lally, A., Prager, J M., McCord, M C., Boguraev, B K., Patwardhan, S., Fan, J., Fodor, P., and Chu-Carroll, J 2012 Question analysis: How Watson reads a clue IBM Journal of Research and Development, 56(3.4), 2:1–2:14 Landauer, T K and Dumais, S T 1997 A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge Psychological Review, 104, 211–240 Li, W and McCallum, A 2006 Pachinko allocation: DAG-structured mixture models of topic correlations ICML Manning, C D., Raghavan, P., and Schuătze, H 2009 An Introduction to Information Retrieval Retrieved from http://www.informationretrieval.org/ McCord, M C., Murdock, J W., and Boguraev, B K 2012 Deep parsing in Watson IBM Journal of Research and Development, 56(3.4), 3:1–3:15 Mimno, D., Li, W., and McCallum, A 2007 Mixtures of hierarchical topics with Pachinko allocation ICML Motik, B., Patel-Schneider, P., and Horrocks, I 2006 OWL 1.1 Web Ontology Language Structural Specification and Functional-Style Syntax Object Management Group 2006 OMG Systems Modeling Language (OMG SysML) Specification (Vol ptc/06-05-04): Object Management Group Paisley, J., Wang, C., and Blei, D M 2012a The Discrete Infinite Logistic Normal Distribution arxiv.org Paisley, J., Wang, C., Blei, D M., and Jordan, M I 2012b Nested Hierarchical Dirichlet Processes arxiv.org Pepper, S 2006 The TAO of Topic Maps—Finding the Way in the Age of Infoglut Oslo: Ontopia Spies, M 2013 Knowledge discovery from constrained relational data—A tutorial on Markov logic networks In E Zimanyi and M.-A Aufaure (Eds.), Business Intelligence Second European Summer School, eBISS 2102 Berlin: Springer Steyvers, M and Griffiths, T 2006 Probabilistic topic models In T Landauer, D S McNamara, S Dennis, and W Kintsch, (Eds.), Latent Semantic Analysis: A Road to Meaning, Hillsdale, NJ: Laurence Erlbaum, 1–15 Teh, Y W., Jordan, M I., Beal, M J., and Blei, D M 2006 Hierarchical Dirichlet processes Journal of the American Statistical Association, 101(476), 1566–1581 Toutanova, K., Klein, D., Manning, C D., and Singer, Y 2003 Feature-rich partof-speech tagging with a cyclic dependency network Paper presented at the Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, Edmonton, Canada Trautmann, D and Suleiman, F 2012 Topic Analyse auf Grundlage des LDA– Modells In M Spies (Ed.), Knowledge Management Working Papers Munich: LMU—University of Munich Wang, C., Kalyanpur, A., Fan, J., Boguraev, B K., and Gondek, D C 2012 Relation extraction and scoring in DeepQA IBM Journal of Research and Development, 56(3.4), 9:1–9:12 Wild, F 2011 Latent Semantic Analysis—Package lsa CRAN Repository Big Textual Data Analytics and Knowledge Management 537 WordNet: An Electronic Lexical Database 1998 Cambridge, MA: MIT Press Zhang, J., Song, Y., Zhang, C., and Liu, S 2010 Evolutionary hierarchical Dirichlet processes for multiple correlated time-varying corpora, Paper presented at the KDD, July 25–28, 2010, Washington, DC Zikopoulos, P C., deRoos, D., Parasuraman, K., Deutsch, T., Corrigan, D., and Giles, J 2013 Harness the Power of Big Data: The IBM Big Data Platform New York: McGraw-Hill This page intentionally left blank ... of Big Data computing Section I focuses on what Big Data is, why it is important, and how it can be used Section II focuses on semantic technologies and Big Data Section III focuses on Big Data. .. Development in Big Data Computing 12 Big Data Processing—Technology Stack and Dimensions 13 Big Data in European Research 14 Complications and Overheads in Understanding Big Data 20... Semantic Data Interoperability: The Key Problem of Big Data 245 Hele-Mai Haav and Peep Küngas vii viii Contents Section III Big Data Processing Big Data Exploration 273 Stratos Idreos Big