Sabina Jeschke Ingrid Isenhardt Frank Hees Klaus Henning Editors Automation, Communication and Cybernetics in Science and Engineering 2015/2016 123 Automation, Communication and Cybernetics in Science and Engineering 2015/2016 Sabina Jeschke Ingrid Isenhardt Frank Hees Klaus Henning • • Editors Automation, Communication and Cybernetics in Science and Engineering 2015/2016 123 Editors Sabina Jeschke Ingrid Isenhardt Frank Hees Klaus Henning IMA/ZLW & IfU - RWTH Aachen University Faculty of Mechanical Engineering Aachen Germany Das Buch wurde gedruckt mit freundlicher Unterstützung der RWTH Aachen University ISBN 978-3-319-42619-8 DOI 10.1007/978-3-319-42620-4 ISBN 978-3-319-42620-4 (eBook) Library of Congress Control Number: 2016947394 Mathematics Subject Classification (2010): 68-06, 68Q55, 68T30, 68T37, 68T40 CR Subject Classification: I.2.4, H.3.4 © Springer International Publishing Switzerland 2016 This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made Cover figure: © Fotolia - red150770 Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer International Publishing AG Switzerland Foreword Dear Reader, Today we present the fourth instalment of our book series Automation, Communication and Cybernetics Like its predecessors this book brings together our scientifically diverse and widespread publications from the period of July 2014 to June 2016 The peer-reviewed publications have been published in recognised journals, books or conference proceedings of the various disciplinary cultures Below you find an up to date version of the organisational structure of our Cybernetics Lab It is headed by Sabina Jeschke with Ingrid Isenhardt and Frank Hees as her Deputy and Vice Deputy The former Head Klaus Henning still supports us as Senior Advisor The Cybernetics Lab itself consists of three institutes: the Institute of Information Management in Mechanical Engineering IMA, the Center for Learning and Knowledge Management ZLW and the Associated Institute for Management Cybernetics e.V IfU, which are managed by Tobias Meisen, Anja Richert and René Vossen respectively Our research activities are arranged in nine different research groups whose activities are described further below v vi Foreword Although the structure itself has not changed the people or their statuses have Our managing director of the IMA, Tobias Meisen, is now a Junior Professor and within the Cybernetics Lab we have several new research group leaders: • At the IMA Thomas Thiele is now the leader of the group Production Technology, Christian Kohlschein of the group Cognitive Computing & eHealth and Max Haberstroh heads the group Traffic and Mobility • In the ZLW Christian Tummel now heads the research group Agile Management & eHumanties Stefan Schröder took over the team Innovation & Research Futurology, Sebastian Stiehm the group Knowledge Engineering and Valerie Stehling the group Didactics in STEM Fields • At the IfU the Heads of the research groups are now Kristina Lahl and Sebastian Reuter for the groups Economic and Social Cybernetics and Engineering Cybernetics respectively The scientific core of the Institute of Information Management in Mechanical Engineering – IMA consists of three research groups: • The scope of the research group Production Technology is to provide innovative research regarding information management for Industry 4.0 The group is specialised in methods and procedures from computer science to semantically integrate, to consolidate and to propagate data generated in these domains In addition, their research focuses on visualisation and interaction techniques to enable the user to analyse the retrieved information in an explorative and interactive way Thereby, their research covers a broad range of different areas especially virtual and automated production Meeting the challenges of information management within these areas, the group studies information Foreword vii integration, descriptive and predictive analysis using a variety of techniques from artificial intelligence like regression, machine learning, natural language processing and data mining as well as visual analytics Regarding the domain of virtual production, the group has shaped the concept of the Virtual Production Intelligence (VPI) to collaboratively and holistically support product-, factoryand machine planners The work of the group provides essential basics to facilitate the realisation of cyber-physical production systems (CPPS) and therefore is a cornerstone of information management in Industry 4.0 • The research group Traffic and Mobility is working on concepts for multimodal freight transport and urban mobility, intelligent transport systems and on the design of user-friendly and barrier-free mobility solutions and human– machine-interaction In its projects, the research group investigates concepts for autonomous vehicles, advanced driver assistant systems and the interactions and interdependencies between humans, organisation and technology In order to develop holistic solutions, the interdisciplinary team combines skills and knowledge from engineering, computer science, sociology and economics The applied methods of the research group range from simulator and real-life testing, over usage-centered design, empirical studies and acceptance or mental stress and strain analysis One approach to reach the ideal of efficient freight traffic of the future is to use modular, worldwide usable loading units with appropriate transport carriers All research is based upon the holistic consideration of the three recursion dimensions: human, organisation and technology The activities of the research group include the research and development of new technologies as well as the development of methods and tools for the product development process in the above mentioned application fields • The research group Cognitive Computing & eHealth focuses on research considering information management supporting healthcare The group is specialised to meet the challenges within the research fields predictive data analytics and visual analytics The research group understands itself as an “integrator” within the eHealth domain: educating and providing experts in the mentioned research fields, but also understanding the importance of covering and dealing with problems of all phases (i.e needs assessment, integration, evaluation, and deployment) of the information management cycle Lately, the group focused on research topics occurring in scenarios of medical emergencies, thus developing an intelligent and reliable ad-hoc network structure streaming medical data in real-time from the case of incident to an expert Predictive analytics is used to detect upcoming delays, future connection losses, or approaching quality reductions The eHealth group coined the term “prescient profiling” which is used to describe an AI driven concept selecting relevant laypersons to nearby medical emergencies To determine relevancy the solution considers for example traveling speed, known behavioural patterns (i.e trajectories), current circumstances, and infrastructural limitations Currently, the group works on an algorithm to predict the emotions of a driver interacting with a navigation system to adapt the systems behaviour accordingly In the near future, the group will also use its expertise to establish a complex and highly viii Foreword available information management system for rapidly changing ad-hoc infrastructures that are for example needed to ensure information availability in the case of major incidents The Center for Learning and Knowledge Management – ZLW has four research groups: • The research group Innovation Research and Futurology focuses on two fields Innovation research concentrates on a concept of innovation management, which not only comprises planning, realisation, and design of processes and structures to create innovation, but also stresses the innovative capability The first research field focuses on innovation systems with various dimensions like regional and national innovation systems as well as their relevant subsystems, which are created and analysed from a cybernetic perspective This is achieved by a holistic consideration of the system-intrinsic dimensions “human, organisation and technology”, in order to produce innovative capability of the involved actors under competitive and sustainable conditions The second field of the research group depicts futurology Here, a monitoring approach is applied for different research, development and funding programs Consequently, a range of future trends, scenarios and development strategies is derived for respective target groups This expertise is supplied to experts in science, economy and policy • The research group Knowledge Engineering currently focuses on three topics: First, it supports and explores the development and steering of inter- and transdisciplinary networks and clusters of excellence with the aim to identify and promote synergies as a source of innovation A continuous qualitative and quantitative evaluation of the research network as well as text mining of scientific publications with machine learning are realized The use of interactive data visualizations for a feedback and exploration of the results is considered in both cases Second, within the framework of demographic change in the labour world, the research group develops concepts for the evaluation and analysis of company’s demographic alignment Making use of a wide range of quantitative research methods, holistic demography management systems can be implemented, which also respect the perspectives of the various stakeholders involved Third, the research group focuses on the identification of opportunities and potentials for (re-) integrating production sites into urban space with a holistic, transdisciplinary view Realizing a socio-technical research-approach, the research group develops factors and scenarios of urban production by combining methods of empirical social research with data science • With an interdisciplinary team of communication scientists, engineers, psychologists, sociologists and computer scientists the research group Didactics in STEM Fields is dealing with challenges of didactics, especially those of the STEM Fields, including mathematics, computer sciences and engineering To ensure the development of successful didactical concepts, the involvement of every actor actively participating in education is needed Therefore groups of students, teaching staff, intermediate organisations and other experts on Foreword ix university didactics are involved in our research activities The user oriented approach of the research focuses on learning in virtual environments, learning with natural user interfaces and VR-technology, remote and virtual laboratories and other forms of computer and web based learning Moreover, social aspects of learning in a higher education context are investigated Here, the focus lies on mentoring concepts, students’ mobility and service based learning methods In all its activities, the research group considers the whole student life cycle, from pupils, bachelor and master students up to doctoral candidates • The research group Agile Management & eHumanities deals with the application of data analytics approaches in social sciences and humanities The major effort is the examination of how computer-assisted processes and digital resources are systematically used in these disciplines while its main emphasis is put on the field of data analytics with special regards to social media In order to manage the continuously increasing complexity and dynamics in organisational structures the field of Agile Management investigates the application and implementation of agile methods, techniques, principles, and values As far as the application area of research on competencies is concerned, the analysis of the “digital footprints” from employers and staff is focused, which allows to draw conclusions on hidden profile characteristics The identification of these hidden characteristics and their significance for tomorrow’s job market are current research topics in this field Furthermore, the research group conducts analyses on effects of these characteristics in order to enhance individual competencies by optimizing qualification processes and programmes in the context of academic teaching The Associated Institute for Management Cybernetics e.V – IfU used the opportunity to extend its research focus once more: • The research team Economic and Social Cybernetics deals with cybernetic methods and tools for industrial applications The main research topics include the assessment of organisational culture and structure, business model innovation and development of decision support tools In the context of evaluation and decision support enhanced economic assessment tools including uncertainty and soft aspects and sustainability assessment tools are generated In interdisciplinary research projects cybernetic tools and solutions for complex problems in collaboration with industrial and research partners are developed The employed methods include system dynamics, viable system model, organizational culture assessment instrument (OCAI) and business model canvas Furthermore, cybernetic tools for the development of sustainable product strategies, design of efficient organisational structure, culture based implementation of quality management, and change processes are applied • The research team Engineering Cybernetics is a part of the Institute for Management Cybernetics at the RWTH Aachen University Its research objectives are intelligent planning and control algorithms for technical systems The focus is on mobile robotics within intralogistic applications as well as process planning and industrial robotics Here the group addresses aspects of x Foreword human robot interaction and collaboration The main goal is to endow the respective technical systems with autonomy and situational awareness in order to achieve more robust behaviour and an increased flexibility while at the same time simplifying the interaction with those systems (Multi-)agent technologies, closed loop control systems and visual serving, and natural interface technologies play an important role The research group also maintains the institute’s school labs We would like to thank our scientific researchers who work hard and publish continuously and without whom we would not have accomplished the now fourth instalment of this diverse and comprehensive collection Further, we would like to acknowledge the support of our administrative and technical staff who fight the battles with bureaucracy and IT-technologies for us to keep our minds focused on our research projects and the education of students At last we would like to thank our Public Relations team and especially Miro Tommack for the unification of all these different articles Aachen, Germany September 2016 Sabina Jeschke Ingrid Isenhardt Frank Hees Klaus Henning Text Mining Analytics as a Method of Benchmarking Interdisciplinary Research Collaboration Stefan Schrưder, Thomas Thiele, Claudia Jo, René Vossen, Anja Richert, Ingrid Isenhardt and Sabina Jeschke Abstract This paper introduces the process of adopting and implementing modern text mining approaches of analysis within the Cluster of Excellence (CoE) TailorMade Fuels from Biomass (TMFB) at RWTH Aachen University and presents initial results of the analysis of research output by use of common clustering algorithms, namely Principal Component Analysis and k-means As one main part of this paper the data driven approach is classified into benchmarking efforts, which are part of the research work of the so called Supplementary Cluster Activities The SCA, supporting the cluster management, are initiated in order to promote interdisciplinary collaboration of CoE researchers with different disciplinary backgrounds This cross-linking is aided by means of knowledge engineering and knowledge transfer strategies, such as the exploration of synergies and benchmarking of research results as well as progress In this course an adoption of current benchmarking efforts to the specific cluster research framework conditions is described At this, in case of differing data sources according to those used in widespread business organisational benchmarking, possible TMFB data sources are outlined and a selection for analysis is reasoned While benchmarking is usually differentiated in internal and external benchmarking, in this case focus lies on internal analysis of publications in order to reflect research work Benchmarking of publications is used and implemented to identify (best) methods, practices and processes of CoE to improve the research organization Second major part and central question within the scope of this paper is in which way text mining respectively clustering algorithms are sensitive applicable to TMFB publications and are able to be used as benchmark for research clusters Thus thematically priorities of TMFB researchers will be investigated in order to create an overview according to research topics, keywords and methods In case of an outlook further steps, e g dealing with generated results, data visualisation or further acquisition of data corpora, are formulated Keywords Benchmarking · Interdisciplinarity K-Means · Principal Component Analysis · Text Mining · Clustering · S Schröder (B) · T Thiele · C Jooß · R Vossen · A Richert · I Isenhardt · S Jeschke IMA/ZLW & IfU, RWTH Aachen University, Dennewartstr 27, 52068 Aachen, Germany e-mail: stefan.schroeder@ima-zlw-ifu.rwth-aachen.de Originally published in “12th International Conference on Intellectual Capital, Knowledge Management & Organisational Learning (ICICKM 2015)”, © Academic Conferences and Publishing International Limited 2015 Reprint by Springer International Publishing Switzerland 2016, DOI 10.1007/978-3-319-42620-4_73 985 986 S Schröder et al Introduction: Cluster of Excellence ‘Tailor-Made Fuels from Biomass’ and ‘Supplementary Cluster Activities’ at the ‘RWTH Aachen University’ In 2012 the further funding of the research Cluster of Excellence [1] “Tailor-Made Fuels from Biomass” (TMFB) at RWTH Aachen University was announced by the German Research Foundation (German abbreviation: DFG) and the German Council of Science and Humanities (German abbreviation: WR) Since 2007, TMFB has been working on a solution for one of the major challenges that society is facing today: a rising energy demand and the limited availability of fossil energy resources With this challenge in mind, researchers from the fields of chemistry, biology, process engineering and mechanical engineering have joined forces in this CoE to develop new alternative fuels from biomass which will not be competing with the food chain [2] In an interdisciplinary approach, more than 70 scientists are engaged in the synthesis and combustion of such fuels Over the course of the past decade, the number of interdisciplinary CoE at German Universities has been increasing, primarily because of the emergence of research issues that cannot be met by only disciplinary approaches [3, 4] Thus, German CoE are initiated to focus on current issues and research questions which require the input of more than one discipline and their researchers With the initiation of CoE the complexity of cooperation increases Hence, as part of TMFB, the Supplementary Cluster Activities (SCA) were initiated to promote a cross-linking and collaboration of CoE members The SCA pursues a two-fold strategy (cf Figure 1) One working area mainly focuses on the promotion of early career researchers, promotion of gender equality and performance measurement as well as knowledge engineering These tasks are summarized under the so called Collaboration-Enhancing Services, which comprise e g strategy workshops, colloquia, trainings and public relations The second working area, in order to strengthen interdisciplinary collaboration, is called Collaboration-Enhancing Research These tasks include e g identification of key performance indicators, exploration of synergies, benchmarking and support of knowledge transfer within and outside of the CoE Hereafter measures according to the research work package, especially to benchmarking task, are outlined This paper points out an approach to investigating and adopting data driven analytical methods in order to achieve objectives regarding tasks like establishing (internal) benchmarks, utilizing synergies and supporting knowledge transfer Therefore Section deals with the relevance and connection of benchmarking, data-mining analytics and supplementary cluster activities tasks On this occasion the necessity of benchmarking and differences to common used business benchmarking approaches are lined out In Section existing and allegedly useful data bases for analysis are described The text mining driven approach by itself, its application for logical reasons and its targets are described in Section Afterwards, first initial generated results by applying two frequently exploited clustering algorithms are presented to proof if the mentioned approach may lead to further process able results In form of an outlook, in Section 5, the idea of adopting clustering algorithms in order to Text Mining Analytics as a Method of Benchmarking … 987 Figure SCA tasks within the Cluster of Excellence TMFB (own diagram) benchmark scientific research are discussed as well as the procedure of comparing TMFB research priorities to relevant topics from the general scientific community is briefly sketched Likewise further steps, e g in form of visualisation and data acquisition, are pointed out Supplementary Cluster Activities in the Context of Benchmarking Benchmarking is one task within the SCA portfolio Following the interpretation and necessity to adopt available approaches is explained Originated from the ‘business world’, benchmarking is traditionally thought of as a management tool that improves performance by identifying and applying best documented practices Modern commercial benchmarking has now come to refer to the process of identifying the best methods, practices, and processes of an organization and implementing them to improve one’s own organization [5, 6] In order to so, the best entities in a sector have to be identified and specific indicators like unit costs have to be compared [6– 8] The objective of benchmarking is to help the management of an organisation to 988 S Schröder et al improve performance Over the last years many governmental and public organisations adopted a range of approaches to research benchmarking [9, 10, 32–34] In an effort to create comparable and compatible quality assurance and academic degree standards, universities increasingly employ benchmarking strategies, too [5] Originated from the DFG and WR, who interpret benchmarking as “evaluating and steering […] research performance”, “increasing self-reflection” and “avoiding […] ton technology” [11], there is a need for adopting business driven benchmarking approaches to research clusters and its differing frameworks Particularly in the context of acquiring university data, the CHE reveals common challenges for university benchmarking [12] Especially, points regarding to the data acquisition (quality, format and effort) as well as assignment and interpretation problems are outlined To meet these challenges applying data analytics is described in Sections and Jackson and Lund [9] define benchmarking as collection of approaches for selfevaluation activities On this account and with regard to Levy and Valcik [13] the SCA are dealing with benchmarking in a functional primarily internal and even external way Functional benchmarking examines similar functions in organization that are not direct competitors [5] Levy and Valcik’s [13] outline benchmarking in a manner, that it is well transferable to the SCA understanding: “Benchmarking is a strategic and structured approach whereby an organization compares aspects of its processes and/or outcomes to those of another organization or set of organizations to identify opportunities for improvement.” [13] Transferred to the SCA approach: • Organization stands for the Cluster of Excellence TMFB • Another organization or set of organizations stands for the relevant • Scientific community (applies for other CoE as well as for other similar specialist disciplines) • Improvement in this case includes: revealing synergies and intersections as well as uncovering research methods, practices and processes Whatever it’s (benchmarking) origins, implicit in the concept of benchmarking is the use of references by which other objects can be measured, compared, or judged [35] Which ‘references’ can be considered for analysis is discussed in Section At this point the question may arise, what objectives are pursued by implementing and adopting a benchmarking approach to TMFB Objective is to … • • • • • • …create an overview according to research topics, keywords and methods …compare actual and desired values respectively results …reflect on research focus and alignment …identify synergies and intersections between TMFB researchers …identify synergies with external entities …analyse analogies and differences of TMFB research priorities in comparison to general scientific community research topics • …adapt user-specific visualization of results This mentioned objectives are not only motivated theoretically or based on research funding sources like the WR As Jooß [14] inter alia worked out in a CoE Text Mining Analytics as a Method of Benchmarking … 989 questioning, the success of interdisciplinary cooperation depends on methodological exchange, identification of key persons, reflection of research results, identification and visualization of interfaces, cross-linking and reflection [14] In order to handle these demands, the following SCA approach of analyzing data is superimposed As this paper outlines the whole SCA approach and holistic objectives are introduced, the results presented in Section regard to the above mentioned objective ‘create an overview onto research topics, keywords and methods’, which are used within TMFB Section briefly explains benchmarking differences between CoE and commercial operating companies and lists a selection of possible information bases which can be analyzed Benchmarking and Data Sources In the context of benchmarking, there is a need for measuring, comparing and judging standards or references [5] University frameworks and CoE funding objectives are unlike companies While CoE are driven and judged by scientific output and knowledge production, companies are sales and product output driven [6] Epistemic interests (WR) are paramount to CoE, contrary to companies’ products or service output On this account, the basis of valuation differs If the quality of CoE performance is to be judged, scientific output (e g publications) than (none existing) sales revenues are being consulted (sure, CoE are managed on financial basis, but no direct revenues of CoE are expected from political or funding perspective Return on investments are desired for the business location Germany and therefore are hardly quantifiable) Therefore, contrary to business units, the success and progress of CoEs must be evaluated on another information basis That’s why possible access to CoE information from an external point of view is identified In case, as part of the management, the SCA have further internal access, the external access resulting from endeavour of transferability is preferred Table lists possible sources (without any claim to completeness) and corresponding content: It is obvious to deal with this existing information Noticeable, nearly almost all types of data sources contain to a greater or lesser extent vast text amounts In order to develop a benchmarking approach to performant analyse this text amounts to achieve results regarding to content, method and intersections of TMFB work, the SCA decided to adopt data driven analytical methods In a first step, publications are chosen as information source because of their informative value, accessibility and comparability This paper outlines the processing of a TMFB publication sample in order to reflect current research foci and together used keywords (e g methods or materials) Future analysis will consider general scientific publication databases, to achieve the SCA objectives mentioned in Section (e g analyze analogies/differences of TMFB research priorities in comparison to general scientific community research topics) In the case of analyzing publications text mining analytics are sensitive Later on, in accordance with the holistic analysis of data mentioned in 990 S Schröder et al Table Free access to CoE data sources (own diagram) Source (derivable) Content Online presence CoE Online presence involved partners Publication databases TMFB Publication databases general Newsletter Social Media (Xing, LinkedIn, facebook, twitter, YouTube) Brochures, flyer and demonstrator Research funding Online Stores (e g Amazon, Springer) Proposals, research reports, project reviews, advisory board advises etc … • General information onto project structure and objectives • Background information according to funding volume and runtime • News and announcements • Contact persons and responsibilities • Partner objectives • Researcher assignment and expertise • Number of employees • Dissertation projects • Publications (e g conferences, journals) • Research results (e g research questions, methodology, findings, further research desiderates) • Cooperation structure (who cooperates with whom – internal and external) • Quality of research output (e g journal impact factor) • Research results (e g research questions, methodology) according to the scientific community • See above • Current information and news • Employee information (professional and personal) • CoE/project news • Announcements (e g Call for Papers, new Professorships) • Images/photos/illustrations • Videos • Project visits • General information • Research questions • Specific advertisements • Lego c demonstrator (process description) • General information regarding to e g funding criteria • General annual reports (e g funding volume) • Book publications • No external access •… Text Mining Analytics as a Method of Benchmarking … 991 Table 1, web-mining and further analytics have to be applied as well (e g in case of analyzing social media content) At this point the question may arise, why (semi-)automatized approaches should be used? There are several reasons for the implementation of an automated system for the content analysis of text Krcmar [15] also describes the necessity of adopted information management systems for the management of organizations He outlies the efficient usage to supply researcher and stakeholders with relevant information [15] The benefit for TMFB is the automation of this process to reduce the manually workload and to allow more efficient analysis and reanalysis of text Such a system will be applicable to extremely large quantities of text where there is little possibility of intense human analysis [16] In case of the TMFB publication sample (described in Section 4) and especially in view of analysing holistic TMFB publications as well as community relevant publications, the process automation is essential Otherwise the workload in manually analysing is disproportionate and it could be possible that patterns cannot be revealed by classical text analysis Section clarifies functionalities of text mining analytics and assesses existing text mining respectively clustering algorithms and presents first results coming from a data sample of 28 TMFB publications Applying Text Mining Analytics “While technology enables us to capture and store ever larger quantities of data, finding relevant information like underlying patterns, trends, anomalies, and outliers in the data and summarizing them with simple understandable and robust quantitative and qualitative models is a grand challenge Data mining helps to discover underlying structures in the data, to turn data into information, and information into knowledge.” [17] Text mining is the extraction of implicit, previously unknown, and potentially useful information from data [17] In view of the provided data, a quantity of text (28 publications) is analyzed, in order to investigate applicability of clustering algorithms to prepare existing data for further analysis regarding to the earlier mentioned objectives Within the help of text mining, patterns can be discovered and it is possible to predict future values, derive recommendations, optimize the choice of methods and identify possible or even new research paths [17] (Miner 2009) Text mining extends the applicability to un- or semi-structured data like texts from documents, news, customer feedback, e-mails, web pages, internet discussion groups and social media [17] The following subsections elaborate further on pre-processing the data and applying clustering algorithms In the end of Section results according to TMFB data are described 992 S Schröder et al 4.1 Clustering and Pre-processing Publications An access to deal with TMFB publications by the use of text mining algorithms in order to create an overview onto research topics, keywords and methods is clustering, i.e., the task of automatically grouping objects into groups of similar objects Cluster analysis divides a heterogeneous group of records into several more homogeneous classes or clusters [18] Below, the choice of a cluster algorithm is roughly sketched According to Miner [19] Figure underlines the selection of clustering data To summarize, clustering data is sensible because in this case we exploratory analyze sorted documents with no given categories Whether existing algorithms are applicable to the TMFB application will be determined in the following paragraph “Clustering is a technique that is useful for exploring data It is particularly useful where there are many cases and no obvious natural groupings.” [17] As mentioned by Kempf and Siebert [20], benefits of using cluster analysis are inherent in grouping none clearly assignable, poor structured and assorted data [20] Thus patterns within heterogeneous publications can be revealed [6] By clustering groups of publications with mainly related properties are built [17] Affiliations are expressed by homogeneity of objects within the clusters and heterogeneity between them [21, 22] A good clustering method produces high-quality clusters to ensure that the inter-cluster similarity is low and the intra-cluster similarity is high; members of a cluster are more like each other than they are like members of a different cluster Specifically, functionally related properties can be grouped together to i e novel research pathways or identify patterns based on statistical similarity [7] This depends on a distance metric, such as Euclidean distance (squared root sum of distances along each feature) or Manhattan distance (sum of absolute feature differences) [18] The aim of clustering is twofold Firstly, it seeks to separate data items into a number of groups so that items within one group are more similar to other items in the same group Secondly, it aims to arrange so that items in one group are different from items in other groups Visualizing how the clusters are formed and which data points are in which cluster is difficult [17] Clustering techniques can be divided into different approaches i.e agglomerative and divisive approaches as well as portioning approaches (e g [23]) [7, 17] Before clustering can be proceeded, pre-processing the data is an important step in text mining processes, because the given data has to be transformed in a machinereadable format In the case of analyzing publications the original text has to be Figure Choice of clustering algorithm (own diagram following [19]) Text Mining Analytics as a Method of Benchmarking … word total production reaction model temperature pressure conditions process glucose reactions cellulose experiments ignition oxygen energy yield concentration liquid biomass lignin phase hydrogen method formation species product water 1175,0 1077,0 789,0 775,0 713,0 706,0 672,0 599,0 585,0 576,0 520,0 520,0 511,0 501,0 491,0 462,0 444,0 439,0 437,0 430,0 399,0 395,0 383,0 380,0 374,0 370,0 993 in class (TMFB_Literatur) 1175,0 1077,0 789,0 775,0 713,0 706,0 672,0 599,0 585,0 576,0 520,0 520,0 511,0 501,0 491,0 462,0 444,0 439,0 437,0 430,0 399,0 395,0 383,0 380,0 374,0 370,0 Figure Word occurrences after pre-processing sorted by frequency reduced to only relevant words, which are accessible for statistical analysis Therefore steps like tokenizing, filtering stopwords and tokens by length as well as POS-Tagging are conducted Figure shows first results with regard to the data pre-processing Occurences of recurrent words respectively attributes are investigated after decompositing words from the text, reducing of articles, conjunctions, präpositions and negations, as well as assigning words and punctuations to verbosities Further investigation according to the comparison of word occurences and document belongings have to be done and may give insights in terms of specific word usage or common cluster topics For instance at first view the identifcation of common used substances (e g glucose, cellulose, oxygen, lignin, hydrogen etc.) is possible By applying clustering algorithms word occurences influence the assignment Whether clustering techniques can be applied sensitive is worked out in the following 994 S Schröder et al 4.2 Principal Component Analysis (PCA) Often in the context of cluster analysis k-means and principal component analysis (PCA) are used to reduce dimensions of parameter space PCA is one of the most widely employed and useful tools in the field of exploratory analysis [24] PCA is used if the structure of the data is no obvious It turns the orthogonal coordinate system so that the covariance matrix is diagonalized The basic assumption of PCA is that the direction with the biggest variance contains most of the information [25] The axis of the greatest variance is called the first principal component Another axis, which is orthogonal to the previous one and positioned to represent the next greatest variance, is called the second principal component [26] PCA is often used to identify the strongest predictor variables in a data set PCA is a technique for revealing the relationships between variables in a data set by identifying and quantifying a group of principal components [18] Principal components have been used frequently in studies as a means to reduce the number of raw variables in a data set [27, 28] PCA is interesting because it serves as a starting point for many modern algorithms [29] Applying PCA to TMFB research papers, the following components can be investigated Table gives exemplary insights to principal components and their attributes It becomes clear, that attributes are successfully assigned by its eigenvectors Concerning the contents, intersections between different TMFB authors can be exposed as every principal component is determined by not less than two different publications Looking up the principal component assignment of papers according to Table 2, the following distribution of publications within the principal components (PC) is assigned: PC3 is mainly described (highest impact) by nearly publications, PC6 by publications and PC by In this case, validating the results by looking up the eigenvectors in the documents shows a successful application of the PCA This Table Exemplary Eigenvectors PCA TMFB publications Attribute PC Attribute PC Attribute PC Attribute PC 12 Cellulose 0,278 Fuel 0,394 Lignin 0,497 Descriptors 0,235 Ignition 0,234 Bio fuel 0,290 Organosolv 0,219 Xylan 0,177 Flame 0,190 Methylfuran 0,115 Liquids 0,188 Lignin 0,159 Laminar 0,186 Toxicity 0,098 Conductivity 0,183 Catalysts 0,131 Rules 0,156 Bio mass 0,097 Viscosity 0,146 Xylose 0,110 Delay 0,144 Ignition 0,071 Alcohol 0,140 Silica 0,107 Radicals 0,135 Motor 0,065 Solutions 0,119 Conformer 0,097 Hydrolysis 0,132 Bio diesel 0,062 Catalytic 0,094 Organosolv 0,092 Glucose 0,131 Butanol 0,047 Acetate 0,087 Catalysis 0,088 … … … … … … … … Text Mining Analytics as a Method of Benchmarking … 995 approaches allows inter alia the identification of intersections between researchers as well as revealing of commonly together used keywords Concluding at this point leads to an assumption, that clustering TMFB publications by PCA will lead to the desired objectives in case of analysing publications and reflecting research work Further investigation by applying the algorithm to the whole TMFB data has to be done, especially in case of extracting and separating methods from research topics or substances and approaches Likewise mapping researchers to keywords has to be integrated in a next step 4.3 K-Means Algorithm The application of the most common clustering algorithm k-means and its results are presented in the following paragraph The classical k-means algorithm was introduced by [36] Its basic operation is explained hereafter: given a fixed number (k) of clusters, assign observations to those clusters so that the means across clusters are as different from each other as possible The points are iteratively adjusted so that each of the N points is assigned to one of the K clusters, and each of the K clusters is the mean of its assigned points [18, 30] The underlying mathematical function is k | x j − µi |2 J= (1) i=1 x j ∃Si whereby x j is defined as data points and yi as main focus of clusters Si The target function is based upon the method of smallest squares and is described as clustering by variance minimization Because ||x j − µi ||2 is the squared Euclidean distance, k-means arranges every object to the ones close by The k-means algorithm finds the best location for k centroids, each corresponding to the center of a cluster Starting with k random, examines the algorithm data points closest to the centroid For each cluster, the points within it are considered to determine a new centroid This new centroid is then used to determine the closest points potentially causing points to move into different clusters [17, 18] This process is repeated until the centroid stops moving or until an error function is minimized It is best suited for clusters that are spherical in shape and are sufficiently distant from one another to be distinct Its weaknesses are that it is limited to spherical clusters of similar density and it sometimes gets stuck in a local minimum where the centroids not represent the best clustering Another aspect is that the number of clusters k, is not determined automatically, however this does allow different runs to be performed in order for cluster validity measures to be compared [17] Applying k-means to TMFB data, by setting k to 25 (according to the PCA 25 clusters with sufficient impact were built) the publications were clustered Exemplary results are displayed in Table 996 S Schröder et al Table Exemplary attributes per cluster (k-means) Attribute Cluster_0 Attribute Cluster_4 Hexanol Phase Analyst … 0,456 0,309 0,267 … Toxic Mutagen Bio diesel … 0,357 0,278 0,194 … Attribute Cluster_6 Flame Stencil Isotherm … 0,505 0,302 0,193 … Validating the results as done by the PCA leads to similar results Building contentrelated clusters upon TMFB publications by use of k-means algorithms works Compared with the PCA algorithm, the performance of k-means algorithm is worse In case of greater numbers of publications at this point PCA seems to be the performant alternative To sum up, applying text mining algorithms to research publications is suitable Relevant key words and their allocation can be described by using k-means as well as PCA algorithms This is the starting point for further investigations in order to create an overview onto research topics, keywords and methods Further steps and thoughts are pointed out in Section 5 Outlook: Clustering, Web Mining and Visualization To sum up the considerations made in the introduction of this paper: The introduced concept of benchmarking and reflecting of scientific research output by applying data-mining approaches will be possible Clustering data with common algorithms reveals first suitable and promising results Correlations concerning the contents have been investigated Based on the findings in Section 4, results can be derived in order to satisfy objectives like identifying current research topics, keywords and methods Nevertheless, the used algorithms and maybe even further algorithms must be adopted and refined to get more detailed and more performant information on specific content generated by the CoE Thus, in case of the PCA, reducing the threshold by excluding of attributes may be one possible way which will be further investigated Additional attention lies on the validation of clustering approaches For example, the given examples (cf Section 4: Tables and 3) must be reflected against the backdrop of the generated clusters and their attribute composition Besides, the acquisition of a second data corpus is initiated Hereby publications from associated fields (e g mechanical engineering, biology and chemistry) are collected in order to apply k-means and PCA algorithms to provide replies onto the SCA objectives mentioned in Section Thus the reflection of research focus and alignment, the identification of synergies with external entities and the analysis of analogies and differences of TMFB research priorities in comparison to general scientific community research topics should be facilitated Thus the benchmarking approach, as described in Section 2, can be applied appropriate Text Mining Analytics as a Method of Benchmarking … 997 Figure Association Rules- graphic view onto intersections related to the attribute “catalytic” So far not notified, dealing with this topic also includes questions regarding to the visualisation of results in order to interpret and present results and thus produce benefits for researcher in an easy accessible and understandable way For this, beside the usage of semantic mapping, association rules algorithms can be investigated as well First studies show promising results (cf Figure and [31] Simultaneously adjustments get obvious in case of reducing generic attributes, i e complex, angew or table Possibilities will be figured out within the scope of SCA research activities Acknowledgments This research has been funded by the German Research Foundation (DFG) as part of the Clusters of Excellences ‘Tailor-Made Fuels from Biomass’ and ‘Integrative Production Technology for High-Wage Countries’ at RWTH Aachen University References CoE Fưrderantrag, Exzellenzcluster Mgeschneiderte Kraftstoffe aus Biomasse, Proposal for the Establishment of Clusters of Excellence: Unveröffentlichter Förderantrag, eingereicht bei der Deutschen Forschungsgemeinschaft (DFG) und dem Wissenschaftsrat (WR) 2007 Cluster of Excellence Renewal proposal for the cluster of excellence “tailor-made fuels from biomass”: Excellence initiative by the german federal and state governments to promote science and research at german universities: unpublished, 2011 E.M.K.a.B CoE Förderantrag, Renewal Proposal for the Cluster of Excellence: TailorMade Fuels from Biomass: Unveröffentlichter Förderantrag, eingereicht bei der Deutschen Forschungsgemeinschaft (DFG) und dem Wissenschaftsrat (WR) 2011 998 S Schröder et al Deutsche Forschungsgemeinschaft (DFG), Verwaltungsvereinbarung zwischen Bund und Ländern gemäß Artikel 91 b Abs Nr des Grundgesetztes über die Fortsetzung der Exzellenzinitiative des Bundes und der Länder zur Förderung von Wissenschaft und Forschung an deutschen Hochschulen - Exzellenzvereinbarung II (ExV II) Bonn, 2009 G.D Levy, S.L Ronco, How benchmarking and higher education came together New Directions for Institutional Research 2012(156), 2012, pp 5–13 doi:10.1002/ir.20026 J Rehäuser, Prozeßorientiertes informationsmanagement-benchmarking In: IV-Controlling auf dem Prüfstand, ed by H Krcmar, A Buresch, M Reb, Gabler Verlag, Wiesbaden, 2000, pp 189–229 doi:10.1007/978-3-322-90245-0_10 X Dai, T Kuosmanen, Best-practice benchmarking using clustering methods: Application to energy regulation Omega 42(1), 2014, pp 179–188 doi:10.1016/j.omega.2013.05.007 P Bogetoft, Performance benchmarking: Measuring and managing performance Management for Professionals Springer, New York, 2012 N Jackson, H Lund, Benchmarking for higher education Buckingham, 2000 10 H Lund, P Vestergaard-Poulsen, I Kanstrup, P Sejrsen, The effect of passive streching on delayed onset muscle soreness, and other detrimental effects following ecccentric exercise Scandinavian Journal of Medicine & Science in Sports 8(4), 1998, pp 216–221 11 D Dzwonnek Statement zum forum „informationsbedarf der hochschulen und relevanz von leistungsvergleichen“ im rahmen der „nationalen tagung zur bedeutung des forschungsratings als instrument der strategischen steuerung und kommunikation“ des wissenschaftsrats, 21 september 2012, bonn., 2012 12 Y Hener, P Giebisch, I Roessler, Entwicklung geeigneter Indikatoren und Kennzahlen für die Steuerung der University Leipzig – Benchmarking von Fakultäten, CHE Centrum für Hochschulentwicklung gGmbH Workingpaper (103), 2008 13 G.D Levy, N.A Valcik, eds., Benchmarking in Institutional Research New Directions of Instituional Research San Francisco, 2012 14 C Jooß, Gestaltung von Kooperationsprozessen interdisziplinärer Forschungsnetzwerke Aachen, 2014 15 H Krcmar, Einführung in das Informationsmanagement, 2nd edn Springer-Lehrbuch Springer Gabler, Berlin, Heidelberg, 2015 16 A.S Smith, M.S Humphreys, Evaluation of unsupervised semantic mapping of natural language with leximancer concept mapping Behavior Research Methods (38), 2006, pp 262–279 17 M Hofmann, R Klinkenberg, Rapidminer: Data mining use cases and business analytics applications, 2014 18 R Nisbet, J Elder, G Miner, Handbook of Statistical Analysis & Data Mining Applications Elsevier, 2009 19 G Miner, D Delen, J Elder, A Fast, T Hill, R Nisbet, eds., Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications Elsevier, 2012 20 K Mertins, G Siebert, S Kempf, Benchmarking: Praxis in deutschen Unternehmen SpringerVerlag, Berlin and New York, 1995 21 K Backhaus, Multivariate Analysemethoden – Eine anwendungsorientierte Einführung, 6th edn Berlin, 1990 22 J Hartung, B Elpelt, Multivariate Statistik - Lehr- und Handbuch der angewandten Statistik München Wien, 1984 23 C Ding, X He, K-means clustering via principal component analysis Proceedings of the 21st International Conference on Machine Learning Canada., 2004 24 M Monfreda, Principal component analysis: A powerful interpretative tool at the service of analytical methodology In: Principal Component Analysis, ed by P Sanguansat, 2012, pp 49–66 25 I.T Jolliffe, Principal Component Analysis, 2nd edn Springer, 2010 26 P Sanguansat, Two-dimensional principal component analysis and its extensions In: Principal Component Analysis, ed by P Sanguansat, 2012, pp 1–22 27 I Fodor A survey of dimension reduction techniques, 2002 http://e-reports-ext.llnl.gov/pdf/ 240921.pdf Text Mining Analytics as a Method of Benchmarking … 999 28 M Hall, G Holmes, Benchmarking attribute selection techniques for discrete class data mining In: Proceedings of IEEE Transactions on Knowledge and Data Engineering, vol 15 2003, vol 15, pp 1437–1447 29 O Maimon, L Rokach, Data mining and knowledge discovery handbook, 2nd edn Springer, New York and London, 2010 30 C.M Bishop, Pattern Recognition and Machine Learning Springer, 2007 31 S Schröder, T Thiele, C Jooß, R Vossen, A Richert, I Isenhardt, S Jeschke, Text mining analytics as a method of benchmarking interdisciplinary research collaboration 12th International Conference on Intellectual Capital, Knowledge Management & Organisational Learning, ICICKM 2015, 05.11.2015-06.11.2015, Bangkok, Thailand, 2015, pp 408–417 32 R Farquhar, Higher education benchmarking in Canada and the United States of America In: A Schofield (ed.), Benchmarking in Higher Education: An International Review London: United Nations Educational, Scientific and Cultural Organization, CHEMS; Paris, 1998 33 V Massaro, Benchmarking in australian higher education In: A Schofield (ed.), Benchmarking in Higher Education United Nations Educational, Scientific and Cultural Organization, London: CHEMS; Paris, 1998 34 U Schreiterer, Benchmarking in european higher education In: A Schofield (ed.), Benchmarking in Higher Education: An International Review United Nations Educational, Scientific and Cultural Organization, London: CHEMS; Paris, 1998 35 S.L Ronco, Internal benchmarking for institutional effectiveness, In: New Directions for Institutional Research, Issue 156, 2012, pp 15–23 36 J.A Hartigan, Clustering Algorithms (Probability & Mathematical Statistics) John Wiley & Sons Inc., 1975 .. .Automation, Communication and Cybernetics in Science and Engineering 2015/ 2016 Sabina Jeschke Ingrid Isenhardt Frank Hees Klaus Henning • • Editors Automation, Communication and Cybernetics in. .. Cybernetics in Science and Engineering 2015/ 2016 123 Editors Sabina Jeschke Ingrid Isenhardt Frank Hees Klaus Henning IMA/ZLW & IfU - RWTH Aachen University Faculty of Mechanical Engineering Aachen... platoons on german highways In: Automation, Communication and Cybernetics in Science and Engineering 2009/2010, ed by S Jeschke, I Isenhardt, K Henning, Springer, Berlin, Heidelberg, 2011, pp 441–451