1. Trang chủ
  2. » Công Nghệ Thông Tin

Dark web exploring and data mining the dark side of the web chen 2011 12 23

478 194 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 478
Dung lượng 10,23 MB

Nội dung

Integrated Series in Information Systems Volume 30 Series Editors Ramesh Sharda Oklahoma State University, Stillwater, OK, USA Stefan Voß University of Hamburg, Hamburg, Germany For further volumes: http://www.springer.com/series/6157 Hsinchun Chen Dark Web Exploring and Data Mining the Dark Side of the Web Hsinchun Chen Department of Management Information Systems University of Arizona Tuscon, AZ, USA hchen@eller.arizona.edu ISSN 1571-0270 ISBN 978-1-4614-1556-5 e-ISBN 978-1-4614-1557-2 DOI 10.1007/978-1-4614-1557-2 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2011941611 © Springer Science+Business Media, LLC 2012 All rights reserved This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Preface Aims The University of Arizona Artificial Intelligence Lab (AI Lab) Dark Web project is a long-term scientific research program that aims to study and understand the international terrorism (jihadist) phenomena via a computational, data-centric approach We aim to collect “ALL” web content generated by international terrorist groups, including web sites, forums, chat rooms, blogs, social networking sites, videos, virtual world, etc We have developed various multilingual data mining, text mining, and web mining techniques to perform link analysis, content analysis,web metrics (technical sophistication) analysis, sentiment analysis, authorship analysis, and video analysis in our research The approaches and methods developed in this project contribute to advancing the field of Intelligence and Security Informatics (ISI) Such advances will help related stakeholders perform terrorism research and facilitate international security and peace Dark Web research has been featured in many national, international and local press and media, including: National Science Foundation press, Associated Press, BBC, Fox News, National Public Radio, Science News, Discover Magazine, Information Outlook, Wired Magazine, The Bulletin (Australian), Australian Broadcasting Corporation, Arizona Daily Star, East Valley Tribune, Phoenix ABC Channel 15, and Tucson Channels 4, 6, and As an NSF-funded research project, our research team has generated significant findings and publications in major computer science and information systems journals and conferences We hope our research will help educate the next generation of cyber/Internet-savvy analysts and agents in the intelligence, justice, and defense communities This monograph aims to provide an overview of the Dark Web landscape, suggest a systematic, computational approach to understanding the problems, and illustrate research progress with selected techniques, methods, and case studies developed by the University of Arizona AI Lab Dark Web team members v vi Preface Audience This book aims to provide an interdisciplinary and understandable monograph about Dark Web research We hope to bring useful knowledge to scientists, security professionals, counter-terrorism experts, and policy makers The proposed work could also serve as a reference material or textbook in graduate level courses related to information security, information policy, information assurance, information systems, terrorism, and public policy The primary audience for the proposed monograph will include the following: • IT Academic Audience: College professors, research scientists, graduate students, and select undergraduate juniors and seniors in computer science, information systems, information science, and other related IT disciplines who are interested in intelligence analysis and data mining and their security applications • Security Academic Audience: College professors, research scientists, graduate students, and select undergraduate juniors and seniors in political sciences, terrorism study, and criminology who are interested in exploring the impact of the Dark Web on society • Security Industry Audience: Executives, managers, analysts, and researchers in security and defense industry, think tanks, and research centers that are actively conducting IT-related security research and development, especially using open source web contents • Government Audience: Policy makers, managers, and analysts in federal, state, and local governments who are interested in understanding and assessing the impact of the Dark Web and their security concerns Scope and Organization The book consists of three parts In Part I, we provide an overview of the research framework and related resources relevant to intelligence and security informatics (ISI) and terrorism informatics Part II presents ten chapters on computational approaches and techniques developed and validated in the Dark Web research Part III presents nine chapters of case studies based on the Dark Web research approach We provide a brief summary of each chapter below Part I Research Framework: Overview and Introduction • Chapter Dark Web Research Overview The AI Lab Dark Web project is a long-term scientific research program that aims to study and understand the international terrorism (jihadist) phenomena via a computational, data-centric approach We aim to collect “ALL” web content generated by international terrorist groups, including web sites, forums, chat rooms, blogs, social networking sites, videos, virtual world, etc We have developed various multilingual data mining, text mining, and web mining techniques to perform link analysis, content analysis,web metrics (technical sophistication) Preface vii analysis, sentiment analysis, authorship analysis, and video analysis in our research • Chapter Intelligence and Security Informatics (ISI): Research Framework In this chapter we review the computational research framework that is adopted by the Dark Web research We first present the security research context, followed by description of a data mining framework for intelligence and security informatics research To address the data and technical challenges facing ISI, we present a research framework with a primary focus on KDD (Knowledge Discovery from Databases) technologies The framework is discussed in the context of crime types and security implications • Chapter Terrorism Informatics In this chapter we provide an overview of selected resources of relevance to “Terrorism Informatics,” a new discipline that aims to study the terrorism phenomena with a data-driven, quantitative, and computational approach We first summarize several critical books that lay the foundation for studying terrorism in the new Internet era We then review important terrorism research centers and resources that are of relevance to our Dark Web research Part II Dark Web Research: Computational Approach and Techniques • Chapter Forum Spidering In this study we propose a novel crawling system designed to collect Dark Web forum content The system uses a human-assisted accessibility approach to gain access to Dark Web forums Several URL ordering features and techniques enable efficient extraction of forum postings The system also includes an incremental crawler coupled with a recall improvement mechanism intended to facilitate enhanced retrieval and updating of collected content • Chapter Link and Content Analysis To improve understanding of terrorist activities, we have developed a novel methodology for collecting and analyzing Dark Web information The methodology incorporates information collection, analysis, and visualization techniques, and exploits various web information sources We applied it to collecting and analyzing information of selected jihad web sites and developed visualization of their site contents, relationships, and activity levels • Chapter Dark Network Analysis Dark networks such as terrorist networks and narcotics-trafficking networks are hidden from our view yet could have a devastating impact on our society and economy Based on analysis of four real-world “dark” networks, we found that these covert networks share many common topological properties with other types of networks Their efficiency in communication and flow of information, commands, and goods can be tied to their small-world structures characterized by small average path length and high clustering coefficient In addition, we found that because of the small-world properties dark networks are more vulnerable to attacks on the bridges that connect different communities than to attacks on the hubs viii Preface • Chapter Interactional Coherence Analysis Despite the rapid growth of text-based computer-mediated communication (CMC), its limitations have rendered the media highly incoherent Interactional coherence analysis (ICA) attempts to accurately identify and construct interaction networks of CMC messages In this study, we propose the Hybrid Interactional Coherence (HIC) algorithm for identification of web forum interaction HIC utilizes both system features, such as header information and quotations, and linguistic features, such as direct address and lexical relation Furthermore, several similarity-based methods, including a Lexical Match Algorithm (LMA) and a sliding window method, are utilized to account for interactional idiosyncrasies • Chapter Dark Web Attribute System In this study we propose a Dark Web Attribute System (DWAS) to enable quantitative Dark Web content analysis from three perspectives: technical sophistication, content richness, and web interactivity Using the proposed methodology, we identified and examined the Internet usage of major Middle Eastern terrorist/ extremist groups In our comparison of terrorist/extremist web sites to U.S government web sites, we found that terrorists/extremist groups exhibited levels of web knowledge similar to that of U.S government agencies Moreover, terrorists/extremists had a strong emphasis on multimedia usage and their web sites employed significantly more sophisticated multimedia technologies than government web sites • Chapter Authorship Analysis In this study we addressed the online anonymity problem by successfully applying authorship analysis to English and Arabic extremist group web forum messages The performance impact of different feature categories and techniques was evaluated across both languages In order to facilitate enhanced writing style identification, a comprehensive list of online authorship features was incorporated Additionally, an Arabic language model was created by adopting specific features and techniques to deal with the challenging linguistic characteristics of Arabic, including an elongation filter and a root clustering algorithm • Chapter 10 Sentiment Analysis In this study the use of sentiment analysis methodologies is proposed for classification of web forum opinions in multiple languages The utility of stylistic and syntactic features is evaluated for sentiment classification of English and Arabic content Specific feature extraction components are integrated to account for the linguistic characteristics of Arabic The Entropy Weighted Genetic Algorithm (EWGA) is also developed, which is a hybridized genetic algorithm that incorporates the information gain heuristic for feature selection The proposed features and techniques are evaluated on U.S and Middle Eastern extremist web forum postings • Chapter 11 Affect Analysis Analysis of affective intensities in computer-mediated communication is important in order to allow a better understanding of online users’ emotions and preferences In this study we compared several feature representations for affect analysis, Preface ix including learned n-grams and various automatically- and manually-crafted affect lexicons We also proposed the support vector regression correlation ensemble (SVRCE) method for enhanced classification of affect intensities Experiments were conducted on U.S domestic and Middle Eastern extremist web forums • Chapter 12 CyberGate Visualization Computer-mediated communication (CMC) analysis systems are important for improving participant accountability and researcher analysis capabilities However, existing CMC systems focus on structural features, with little support for analysis of text content in web discourse In this study we propose a framework for CMC text analysis grounded in Systemic Functional Linguistic Theory Our framework addresses several ambiguous CMC text mining issues, including the relevant tasks, features, information types, feature selection methods, and visualization techniques Based on it, we have developed a system called CyberGate, which includes the Writeprint and Ink Blot techniques These techniques incorporate complementary feature selection and visualization methods in order to allow a breadth of analysis and categorization capabilities • Chapter 13 Dark Web Forum Portal The Dark Web Forum Portal provides web-enabled access to critical international jihadist web forums The focus of this chapter is on the significant extensions to previous work including: increasing the scope of our data collection; adding an incremental spidering component for regular data updates; enhancing the searching and browsing functions; enhancing multilingual machine translation for Arabic, French, German and Russian; and advanced Social Network Analysis A case study on identifying active jihadi participants in web forums is shown at the end Part III Dark Web Research: Case Studies • Chapter 14 Jihadi Video Analysis This chapter presents an exploratory study of jihadi extremist groups’ videos using content analysis and a multimedia coding tool to explore the types of video, groups’ modus operandi, and production features that lend support to extremist groups The videos convey messages powerful enough to mobilize members, sympathizers, and even new recruits to launch attacks that are captured (on video) and disseminated globally through the Internet The videos are important for jihadi extremist groups’ learning, training, and recruitment In addition, the content collection and analysis of extremist groups’ videos can help policy makers, intelligence analysts, and researchers better understand the extremist groups’ terror campaigns and modus operandi, and help suggest counter-intelligence strategies and tactics for troop training • Chapter 15 Extremist YouTube Videos In this study, we propose a text-based framework for video content classification of online video-sharing web sites Different types of user-generated data (e.g., titles, descriptions, and comments) were used as proxies for online videos, and Investigating the Botnet World 437 Fig 22.4 A small component of the hierarchical clustering results showing the inherent structure in three of the discussed criminal gangs Fig 22.5 Several interesting groups found based on concentrations of criminal events and membership size The columns, from left to right, represent the criminal nicknames, the number of C&C channels controlled by the gang (C), the number of unique DDoS targets (D), estimated number of bots (B), and number of PSTORE compromises of victim passwords (P) 438 22 Botnets and Cyber Criminals users Overall, the results show some of the largest gangs personally observed by the author to be prominent in the botnet scene While the first groups were well known by the author, the last shown, as well as some 20 other groups not listed, were brought to light as potential future targets requiring investigative scrutiny The overall hierarchy suggests that most criminals cooperate within private trusted syndicates; however, almost all individuals monitored are loosely connected through covert black market hub channels found sprawled among the underground economy Future Work While the analysis techniques outlined above are an excellent start for obtaining a big-picture look at the botnet underground, several additional steps must be taken to better filter and condition the data Furthermore, additional investigation is needed to better scrutinize the motivations behind the logged crimes To better profile the DDoS attack motivations of these gangs, the actual DDoS targets must be individually scrutinized Current work is in progress to quantify the distribution of targets among geographic and national network ranges Additionally, IP addresses could be correlated with latitude and longitude to determine any geopolitical or regional focus Ideally, attacked web sites should be automatically text mined to determine the subject matter If correlations between many attacked sites are found over a given time period, coordinated hacktivism could potentially be automatically detected and tied to the correct criminal gangs Work is being done to utilize graph-mining approaches to assist in the investigation of cybercrime data By considering observed entities such as nicknames, IP addresses, DNS and whois records, and other classifiable features within botnet logs, sandnet analysis, and web page spidering, it is hoped that link analysis will discover hidden connections/causations between seemingly unrelated events Additionally, we are interested in additional scrutiny of any links between extremist forums and botnet-related technologies Conclusions The Cyber Underground has dramatically evolved in recent years to encompass world-influencing crimes with large sums of money at stake Some prolific players in the botnet scene have built, upon the resources of countless infected and unwilling drone machines, computerized armies with unbelievable power Behind these spreading networks is a community of cyber criminals engaging in serious crimes that influence people, markets, and entire nations References 439 As these communities dive deeper underground and as encryption technologies become more widespread, tracking these miscreants and their botnets will become more and more challenging Although the importance of computer security has finally entered the public consciousness, the problem of ubiquitous cybercrime will likely not be significantly combated in the coming years without substantial policy change The only defense we have against this new wave of cybercrime is for individuals to secure themselves at the personal level and for researchers and policy makers to keep an open mind concerning the complexity of the problems at hand Acknowledgments The author would like to thank the UA Dark Web team for their support of ongoing terrorism and cybercrime research, and the volunteer effort of The Shadowserver Foundation for helping to make the Internet a safer place References Abbasi, A and Chen, H (2008), “Analysis of Affect Intensities in Extremist Group Forums,” In: Terrorism Informatics, E Reid and H Chen, Eds., Springer, pp 285–307 Dagon, D., Zou, C and Lee, W (2006), “Modeling Botnet Propagation Using Time Zones.” In Proceedings of the 13th Network and Distributed System Security Symposium (NDSS) Holtz, Thorsten (2005), “A Short Visit to the Bot Zoo,” IEEE Security and Privacy, Vol 3, No pp 76–79 Krebs, Brian (2007) “Terrorism’s Hook Into Your Inbox.” Washington Post, July 5, 2007 http:// www.washingtonpost.com/wp-dyn/content/article/2007/07/05/AR2007070501153.html McCarthy, Bill (2003) “Botnets: Big and Bigger,” IEEE Security and Privacy, Vol 1, No 4, pp 15–23 Nazario, Jose (2007) “Botnet Tracking: Tools, Techniques, and Lessons Learned,” Black Hat DC 2007 Presentations, https://www.blackhat.com/presentations/bh-dc-07/ Nazario/Paper/bh-dc07-Nazario-WP.pdf.mwcollect Alliance, Nepenthes honeypot http://mwcollect.org/; http:// nepenthes.carnivore.it/ Smith, Brad (2008), “A Storm (Worm) Is Brewing,” IEEE Technology News, Vol 41, No 2, pp 20–22 Spitzner, L (2003), “The Honeynet Project: Trapping the Hackers,” IEEE Security and Privacy, Vol 1, No 2, pp 15–23 Sunbelt Software CWSandbox, http://www.cwsandbox.org The Shadowserver Foundation, http://www.shadowserver.org Thomas, Rob and Martin, Jerry (2006) “The Cyber Underground Economy: Priceless,”;login: The USENIX Magazine, Vol 31, No http://www.usenix.org/ publications/login/2006–12/openpdfs/ cymru.pdf Weimann, Gabriel (2006), Terror on the Internet: The New Arena, the New Challenges Washington, D.C.: United States Institute of Peace Press Xu, Jennifer, Chen, Hsinchun, Zhou, Yilu and Qin, Jialun (2006) “On the Topology of the Dark Web of Terrorist Groups,” Intelligence and Security Informatics, ISI 2006, LNCS 3975, pp 367–376 Index A Ablation testing, 217–218 Affect analysis affect intensities, 208–209 Al-Firdaws vs Montada, 220–222 evaluation ablation testing, 217–218 experimental design, 215 feature set comparison, 216 hypotheses results, 218–220 technique comparison, 217 test bed, 214–215 features for generic n-gram, 208 lexicons, 206–207 pointwise mutual information (PMI) scoring mechanism, 207 prior affect analysis, 204 research framework, 210–213 research gaps, 209 research hypotheses, 214 Al-Firdaws vs Montada affect intensities, 222 posting frequency, 221 statistics, 220, 221 Al-Qaeda Cluster, 85 Ambiguity score, 211 Anti-Microsoft network, 407 Arabic language characteristics diacritics, 157 inflection, 156–157 word length and elongation, 157–158 Archival videos, 298 Asynchronous CMC modes HTTP-based modes, 110 SMTP-based modes, 110 Authorship analysis arabic language characteristics diacritics, 157 inflection, 156–157 word length and elongation, 157–158 authorship identification, 154 accuracy, 163 collection and extraction, 162–163 experimentation, 163 categories of, 154 comparison of classification techniques, 164–165 feature types, 164 English and Arabic group models decision tree analysis, 165–166 feature usage analysis, 166–168 multilingual issues, 156 online messages, 155–156 research analysis techniques, 159 arabic characteristics, 159–160 feature sets, 160–161 test bed, 158–159 techniques, 155 and writeprint, writing style features content-specific features, 155 lexical features, 154 structural features, 154–155 syntax, 154 Authorship characterization, 154 comparison of, 166 Authorship identification, 154 accuracy, 163 procedure, 162–163 H Chen, Dark Web: Exploring and Data Mining the Dark Side of the Web, Integrated Series in Information Systems 30, DOI 10.1007/978-1-4614-1557-2, © Springer Science+Business Media, LLC 2012 441 442 B Back-link search, 75–76 Bag-of-words (BOWs), 208 BFS See Breadth-first search (BFS) strategy Biomedical informatics, 22–23 Bioterrorism research 1996–2000, human agents/diseases literature, 364 2001–2005, human agents/diseases literature, 364–365 collaboration status, 362–364 design data acquisition, 360 data analysis, 361 data parsing and cleaning, 360 growth rate, 365 knowledge mapping techniques information visualization, 357–358 network analysis, 357 text mining, 356–357 literature analysis, 356 objectives, 355 productivity status, 361–362 test bed, 358–359 BlogPulse, 48 BoardPulse, 48 Border security, 20 Botnet attacker, 428 click-through fraud, 429 criminal social networks covert black market hub channels, 438 groups, criminal events and membership size, 436–437 hierarchical agglomerative clustering algorithm, 436–437 monitored botnet herders, 435–436 weights, 436 dataset processing entity resolution, 434–435 NICK event, 433 population estimation, 435 population modeling, 434 PRIVMSG events, 433 distributed denial of service (DDoS), 428 espionage, 428 proxies, 429 Shadowserver Foundation honeypots, 431 malware analysis, 431–432 snooping, 432 spamming-related activities, 428 terrorism online, 430 truncated log snippet, 434 Index underground economy, 429–430 web-based infection mechanisms, 428 Breadth-first search (BFS) strategy, 58–59 Brute force approach, 237 C C4.5, 159, 309 Capability-accessibility-intent model, 344–345 Catastrophic terrorism, 21 Caucasian Cluster, 85 Caveats, 99–100 Character-based lexical features, 154 Chat Circles, 231 Civil liberties, 28–29 Clearguidance.com -Gate visualization forum author, 249 religious expert, 250 member interaction network, 248–249 CMC See Computer-mediated communication (CMC) CMI See Computer-mediated interaction (CMI) Common neighbor effect, 100 Communication Garden, 230, 231 Computer-mediated communication (CMC) categories of, 230 characteristics of, 227 Chat Circles, 231 computer-mediated interaction (CMI), 105–106 content analysis model, 229 definition, 105 design framework for decision support system, 238–239 ISDT design product, 239 meta-requirements and meta-design, 240, 241 hybrid interactional coherence (HIC) algorithm, 106–107 interactional coherence analysis, 106 experimental results, 122 hybrid interactional coherence (HIC) system design, 114–120 hypotheses, 122 hypotheses results, 122–123 linguistic features, 112 link-based techniques, 113 LNSG forum, 123 manual analysis techniques, 113 obstacles to, 107–108 previous studies, 109 Index research domains, 109–110 research gaps, 113–114 similarity-based techniques, 113 system features, 111–112 taxonomy, 108–109 test bed, 120–121 social translucence, 228 text analysis., 232–233 text-based modes of, 227 text mining features, 235–236 feature selection, 236–237 information types, 234–235 tasks, 233–234 visualization, 237–238 Computer-mediated interaction (CMI), 105–106 Content analysis, dark Web information collection and analysis, 75–77 extremist groups cluster activity scale, 402 ecoterrorism, 403 web communities, 403 www.stormfront.org, 402 information services for studying terrorism, 73 information technologies for combating terrorism, 74–75 Jihad on Web case study Al-Qaeda Cluster, 85 Caucasian Cluster, 85 expert validation, 87–88 Hezbollah Cluster, 85 Hizb-Ut-Tahrir, 85 information analysis, 79–86 information collection, 78 information filtering, 78–79 Jihad Supporters, 85 multidimensional scaling visualization, 84–85 Palestinian Cluster, 85 snowflake visualization, 85–87 Tanzeem-e-Islami Cluster, 85 Web sites, 80–84 terrorists use of Web, 72–73 Content-specific features, 155, 306 Core Arabs, 93 Corpus-based approach, 259 Criminal network analysis, 26–27 Criminal social networks covert black market hub channels, 438 groups, criminal events and membership size, 436–437 443 hierarchical agglomerative clustering algorithm, 436–437 monitored botnet herders, 435–436 weights, 436 Critical infrastructure and key assets protection, 20–21 Cyber-archaeology framework extremist movements, 323 high-risk behavior, 324 phase 1, 324–325 phase cyber-artifact classification, 327–332 cyber-artifact collection, 325–327 phase betweenness, 337 distribution of discussion and materials pages, 333–334 key lexicon terms and levels of occurrence, 334–335 link network snapshots, 336 network analysis and visualization techniques, 332 site map featuring intensity, 333–334 structure of link network, 335 social movement research, 321–323 Cyber-artifacts analysis degree of node and betweenness, 416 forum messages, key Chinese phrases, 416–417 web content, 416 classification attack report page, 328–329 extended feature set, 330–331 general discussion page, 328–329 genre classification, 332 genres of communication, 327 recon report page, 328–329 stylistic features, 330 support vector machine (SVM) model, 330 syntactic structure, 328 tactics page, 328, 330 weapons page, 328–329 Web page genres, 328 collection articles, Clearwisdom.net, 414–416 crawling parameters, 326 description, core Web sites, 327 focused crawler, 325 forum content, 415–416 frequency distribution, 326–327 Web pages and sites, 326–327 web sites, 414–415 444 CyberGate visualization Clearguidance.com forum author, 249 member interaction network, 248–249 religious expert, 250 computer-mediated communication (see also Computer-mediated communication (CMC)) categories of, 230 Chat Circles, 231 content analysis model, 229 design framework for, 238–241 text analysis., 232–233 text mining (see Text mining) system design feature selection, 242–243 information types and features, 241 visualizations, 243–245 writeprints and ink blots, 245–248 D Dark network analysis caveats common neighbor effect, 100 preferential attachment effect, 100 random effect, 100 Core Arabs, 93 gang criminal network, 94 global Salafi jihad (GSJ) terrorist network, 93–94 Maghreb Arabs, 93 narcotics-trafficking criminal network, 94 network robustness, 100–102 scale-free properties, 98–99 small-world properties, 96–97 Southeast Asians, 93 terrorist Web site network, 94 topological analysis, 91–93 complex systems, 92 random networks, 92 scale-free network, 92–93 small-world network, 92 Dark networks, 27 Dark Web analysis and visualization authorship analysis and writeprint, content analysis, dark Web forum portal, IEDs, sentiment and affect analysis, social network analysis (SNA), video analysis, Web metrics analysis, Index Dark web attribute system (DWAS) content analysis, 132–133 content richness attributes, 137–138 dark web collection building approach terrorist/extremist group identification, 134–135 terrorist/extremist group URL identification, 135–136 terrorist/extremist web site content downloading, 136 terrorist URL expansion, link and forum analysis, 135–136 technical sophistication attributes, 137–138 terrorism and Internet, 128–129 terrorist/extremist web sites collection, 129–132 terrorists’ use of web, 129 web interactivity (WI) attributes, 137, 139 Dark Web collection forum spidering, multimedia spidering, Web site spidering, Dark Web forum accessibility experiment, 62–63 Dark Web forum collection statistics, 65–66 Dark Web forum collection update experiment, 63–65 Dark Web forum crawling system, 53, 55–56 Dark Web forum crawling system interface, 61–62 Dark Web forum portal, active participants, identification of, 268–269 functions forum browsing and searching, 265–266 social network visualization, 266–268 incremental forum spidering, 258–259 motivation and research, 260 multilingual translation, 259 SNA, 259–260 statistics of, 263, 264 system design data acquisition, 261–263 data preparation, 263 system functionality, 263 Dark Web forum spidering system See Dark Web forum crawling system Dark Web project funding and acknowledgments, 16–17 IEEE Intelligence and Security Informatics Conference, partnership acknowledgments, 17–18 press coverage and interest, 8–9 Index publications, 10–16 team members, Dartmouth Institute for Security Technology Studies (ISTS), 39–40 Data clustering, 77 Data mining dark Web analysis and visualization authorship analysis and writeprint, content analysis, dark Web forum portal, IEDs, sentiment and affect analysis, social network analysis (SNA), video analysis, Web metrics analysis, dark Web collection forum spidering, multimedia spidering, Web site spidering, Data visualization, 77 Decision support system (DSS), 239 Depth-first search (DFS) strategy, 58–59 Dictionary-based approach, 259 Document-level sentiment polarity categorization, 174 Domain spidering, 75–76 Domestic counterterrorism, 20 Domestic extremist videos, 311–312 Domestic security, 28–29 E Ecoterrorism, 403 Emergency preparedness and responses, 21 Emotional intelligence, 204 Entropy weighted genetic algorithm (EWGA) crossover, 187 evaluation structure and selection, 186–187 illustration of, 185 information gain, 186 mutation, 188 solution structure and initial population, 186 steps for, 185 EWGA See Entropy weighted genetic algorithm (EWGA) Extremist See Terrorism F Focused Web crawlers accessibility, 47 collection type, 47–48 445 collection update procedure, 50 content richness, 48 focused crawling, 50–51 URL ordering features, 48–49 URL ordering techniques, 49–50 Focused web crawling, WMD accessibility, 344 content richness, 344 dark web computer-mediated communication (CMC) sources, 347 content analysis, 348 interview, Iraqi nuclear scientists, 351–352 nuclear tutorial for the Mujahedeen (NTM), 352 system design, 349–350 nuclear web, 343 URL ordering techniques, 344 Forum spidering, accessibility, 53–54 content richness, 52 dark Web forum crawling/spidering system, 53 dark Web forum crawling system interface, 61–62 forum storage and analysis duplicate multimedia removal, 61 statistics generation, 60 focused crawling of hidden Web, 52 incremental crawling for collection updating, 54 log and parsed log, 59–60 system design, forum identification, 54–56 system design, forum preprocessing forum accessibility, 56–57 forum strucute, 58–59 wrapper generation, 59 system evaluation forum accessibility experiment, 62–63 forum collection statistics, 65–66 forum collection update experiment, 63–65 Web crawlers accessibility, 47 collection type, 47–48 collection update procedure, 50 content richness, 48 focused crawling, 50–51 URL ordering features, 48–49 URL ordering techniques, 49–50 Web forum collection update strategies, 52 Fuzzy semantic typing, 206 446 G Gang criminal network, 94 Gaussian mixture model (GMM), 302 Gawaher forum, 266 Gender classification, web forums classification and evaluation, 380–381 feature generation feature extraction, 378 feature selection, 379–380 unigram/bigram preselection, 378–379 message acquisition, 377 Genre theory, 230 Global Salafi jihad (GSJ) terrorist network, 93–94 Group/personal profile search, 75–76 H Hezbollah Cluster, 85 HIC See Hybrid interactional coherence (HIC) system design Hidden Markov model (HMM), 302 Hizb-Ut-Tahrir, 85 Homeland security See National security HTTP-based CMC modes, 110 Hybrid interactional coherence (HIC) system design data preparation, 115–116 linguistic feature match, HIC algorithm direct address match, 117–118 lexical match algorithm, 118–119 residual match, HIC algorithm, 119–120 system feature match, HIC algorithm header information match, 116 quotation match, 116 I ICPVTR See International Center for Political Violence and Terrorism Research (ICPVTR) ID3, 309 IEEE Intelligence and Security Informatics Conference, Improvised explosive devices (IED), cyber-archaeology framework phase 1, 324–325 phase 2, 325–332 phase 3, 332–337 social movement research, 321–323 cyber protest, 319 Zapatista model, 320 Incremental crawling, 50 Incremental forum spidering, 258–259 Index Incremental spidering, 262–263 Indexing, 77 Information analysis, 77, 79–86 Information classification, 77 Information collection, 78 Information extraction, 77 Information filtering, 76, 78–79 Information technology (IT) See National security Ink Blot technique principal component analysis, 245 process illustration, 247–248 steps for, 247 Intelligence and security informatics (ISI) vs biomedical informatics, 22–23 criminals and crime characteristics, 21 IT and national security (see National security) research framework civil liberties, 28–29 crime association mining, 26 crime classification and clustering, 26 crime, definition, 24 crime types and security concerns, 25 criminal network mining, 26 dark networks, 27 data mining, 27 domestic security, 28–29 information sharing and collaboration, 26 intelligence text mining, 26 knowledge discovery, 25–26, 28–29 research opportunities, 29 spatial and temporal crime mining, 26 research opportunities, 24 security and intelligence analysis techniques, 22 security-and intelligence-related data characteristics, 21–22 Intelligence and warning, 20 Interactional coherence analysis See Computer-mediated communication (CMC) Inter-coder reliability affect analysis, 215 interactional coherence analysis, 121 sentiment analysis, 190 videos, content analysis of, 280 International Center for Political Violence and Terrorism Research (ICPVTR), 39–40 International Falun Gong (FLG) movement cyber-artifact analysis degree of node and betweenness, 416 Index forum messages, key Chinese phrases, 416–417 web content, 416 cyber-artifact collection articles, Clearwisdom.net, 414–416 forum content, 415–416 web sites, 414–415 forum content analysis author interaction, 422–423 threads, 419–422 functions, 413 link analysis, 417–419 organizational and doctrinal mechanisms, 413 practice and principle components, 412 research design, 413–414 social movement organizations (SMO) and Internet, 409–410 social movement theory, 408–409 social network analysis, 410–411 web content analysis, 419 writeprints, 411–412 Internet, 31–33 Internet Haganah, 34, 36 ISI See Intelligence and security informatics (ISI) ISTS See Dartmouth Institute for Security Technology Studies (ISTS) J Jihadi video analysis Afghani Mujahideen and Chechen, 274 content analysis inter-coder reliability, 280 multimedia coding tool (MCT), 279–280 sample collection, 279 extremist groups, 275 extremist group’s video categorization, 278 extremist group’s video collection, 276–277 groups identification, 284–285 group’s modus operandi and production features, 286–287 RPG attack, 286 video dissemination, 275–276 video types documentary videos, 280–282 individual-oriented vs group-oriented videos, 283–284 operational vs nonoperational videos, 283–284 suicide attack videos, 282–283 447 Jihad on Web case study Al-Qaeda Cluster, 85 Caucasian Cluster, 85 expert validation, 87–88 Hezbollah Cluster, 85 Hizb-Ut-Tahrir, 85 information analysis, 79–86 information collection, 78 information filtering, 78–79 Jihad Supporters, 85 multidimensional scaling visualization, 84–85 Palestinian Cluster, 85 snowflake visualization, 85–87 Tanzeem-e-Islami Cluster, 85 Web sites, 80–84 Jihad Supporters, 85 K Knowledge discovery, 25–26, 28–29 Knowledge mapping techniques bioterrorism research information visualization, 357–358 network analysis, 357 text mining, 356–357 WMD information visualization, 343 network analysis, 343 text mining, 342–343 L Language resources, 235 Lexical features, 305 Lexical match algorithm (LMA), 118–119 Libertarian National Socialist Green Party (LNSG), 189 Link analysis See also Content analysis central/prominent nodes, 401 Militia and ecoterrorism clusters, 401 web community visualization, 400 Link analysis methods, sentiment analysis, 177 Link-based features, sentiment analysis, 176 LMA See Lexical match algorithm (LMA) Log, 59–60 M Machine learning algorithms authorship analysis, 155 sentiment analysis, 177 Machine translation-based approach, 259 448 Maghreb Arabs, 93 MCT See Multimedia coding tool (MCT) Memorial Institute for the Prevention of Terrorism (MIPT), 35, 37 MEMRI See Middle East Media Research Institute (MEMRI) Meta-searching, 76 Middle Eastern Terrorist Groups, case study benchmark comparison results content richness, 145–146 technical sophistication, 142–144 Web interactivity, 146–148 dark web research test bed terrorist/extremist web collection file types, 141 US government web collection file types, 142 Middle East Media Research Institute (MEMRI), 35, 36 MIPT See Memorial Institute for the Prevention of Terrorism (MIPT) Movies and movie previews, 298 Multidimensional scaling (MDS) algorithm, 398 Multidimensional scaling visualization, 84–85 Multilingual translation, 259 Multimedia coding tool (MCT), 279–280 Multimedia spidering, N Narcotics-trafficking criminal network, 94 National security border and transportation security, 20 catastrophic terrorism, 21 critical infrastructure and key assets protection, 20–21 domestic counterterrorism, 20 emergency preparedness and responses, 21 intelligence and warning, 20 Neural network ensemble, 212 Nuclear tutorial for the Mujahedeen (NTM), 352 Nuclear web accessibility, 345–346 capability, 345 intent, 346 model, 344–345 O Online social media text classification authorship classification, 372, 374 content-specific features, 375 gender classification, 374 Index lexical features, 375 sentiment classification, 374 structural features, 375 syntactic features, 375 taxonomy of, 372, 373 types of, 376 Opaque proxy servers, 57 P Pairwise t tests, 164 affect analysis, 218–220 sentiment analysis, 191, 192 Palestinian Cluster, 85 Parsed log, 59–60 PeopleGarden, 230, 231 Periodic crawling, 50 Pointwise mutual information (PMI) scoring mechanism, 207 Preferential attachment effect, 100 Processing resources, 235–236 R Random effect, 100 Reply network, 260 Research opportunities, ISI, 24 S Sandbox analysis, 432 Score-based methods, sentiment analysis, 177 Search for International Terrorist Entities (SITE), 38 Semantic orientation (SO) method, 175, 207 Sentence-level polarity categorization, 174 Sentiment analysis and affect analysis, analysis tasks, 173–174 classification techniques, 176–177 domains, 177–178 extremist group, 172 features, 175–176 polarity classification, taxonomy of, 173 research design, 179–181 research gaps sentiment classification, feature reduction for, 179 stylistic features, 178–179 web forums, 178 system design classification, 188 EWGA (see Entropy weighted genetic algorithm (EWGA)) feature extraction, 181–184 Index system evaluation classification accuracy, 188 features, 190–191 feature selection techniques, 191–192 test bed, 189–190 US forum, EWGA, 192, 193 Shadowserver Foundation honeypots, 431 malware analysis, 431–432 snooping, 432 Simon Wiesenthal Center, 37–38 SITE See Search for International Terrorist Entities (SITE) SMTP-based CMC modes, 110 Social movement organizations (SMO), 409–410 Social movement theory, 408–409 Social network analysis (SNA), 5, 410–411 function interface, 266–267 on web forums, 259–260 Social networking sites, Southeast Asians, 93 Sowflake visualization, 85–87 Specific scenario videos, 298 START See Study of Terrorism and Responses to Terrorism (START) Structural features, authorship analysis, 154–155 Study of Terrorism and Responses to Terrorism (START), 36, 37 Stylistic features, sentiment analysis, 176, 194–196 Stylometry, 154 Sub-forum list page spidering, 262 Subset selection methods, 237 Support vector machine (SVM), 302 Support vector regression correlation ensemble (SVRCE), 212–213 Syntactic features, 306 authorship analysis, 154 sentiment analysis, 175, 196 System design, YouTube data collection process, 304 feature generation process extraction, 305–306 selection, 307–308 sets, 306–307 text-based video classification, 304, 305 Systemic Functional Linguistic Theory, 234, 239 T Tanzeem-e-Islami Cluster, 85 Technical structure, authorship analysis, 155 449 Terrorism See also Jihadi video analysis; Middle Eastern Terrorist Groups, case study; National security social networking sites, videos and multimedia content, virtual worlds, Web blogs, Web forums, Web sites, Terrorism informatics definition, 31 Internet, 31–33 research centers and resources databases and online resources, 34–38 higher education research institutes, 39–40 Think Tanks and intelligence resources, 33–34 Terrorism online, 430 Terrorist Web site network, 94 Text mining See also Data mining features, 235–236 feature selection, 236–237 information types, 234–235 tasks, 233–234 visualization, 237–238 The Jihad and Terrorism Project, 73 The Project for Research of Islamist Movements, 73 Thread list page spidering, 262 Threads, 419–422 Translucent proxy servers, 57 Transparent proxy servers, 57 Transportation security, 20 TV programs, 298 U Ummah.com CyberGate visualization computer networking expert, 251, 252 women-issue expert, 251 member interaction network, 250, 251 URL traversal strategies breadth-first search (BFS), 58–59 depth-first search (DFS), 58–59 US domestic extremist groups approach architecture, 395–396 collection building, 395–397 content analysis, 398–399 link analysis, 397–398 content analysis cluster activity scale, 402 ecoterrorism, 403 450 US domestic extremist groups (cont.) web communities, 403 www.stormfront.org, 402 link analysis central/prominent nodes, 401 Militia and ecoterrorism clusters, 401 web community visualization, 400 social movement research, 392–393 test bed, 399 web harvesting approaches, 393–394 web link and content analysis, 394–395 V Vector space model (VSM), 118 Video analysis, Video-sharing web sites See YouTube videos Violent affect lexicon, 212 Virtual worlds, VSM See Vector space model (VSM) W Weapons of mass destruction (WMD) case study dark web, 347–352 nuclear web, 346–347 focused web crawling, 343–344 knowledge mapping information visualization, 343 network analysis, 343 text mining, 342–343 nuclear web and dark web accessibility, 345–346 capability, 345 intent, 346 model, 344–345 Web blogs, Web crawlers accessibility, 47 collection type, 47–48 collection update procedure, 50 content richness, 48 definition, 45 focused crawling, 50–51 URL ordering features, 48–49 URL ordering techniques, 49–50 Web forum message acquisition, 377 Web forums, Web harvesting approaches, 393–394 Web link and content analysis, 394–395 Web metrics analysis, Web mining, 74–75 See also Data mining Web sites, 3, 80–84 Index Web site spidering, WMD See Weapons of mass destruction (WMD) Women’s forums international Islamic politics communication channel, 385 female-and male-preferred unigrams and bigrams, 383–384 gender and personality differences, blogs, 384 hypotheses, 382 number of features, feature set, 382, 383 performance measures, feature set, 382, 383 test bed, 381 online gender differences, 370–372 online text classification authorship classification, 372, 374 content-specific features, 375 gender classification, 374 lexical features, 375 sentiment classification, 374 structural features, 375 syntactic features, 375 taxonomy of, 372, 373 types of, social media texts, 376 research design classification and evaluation, 380–381 feature extraction, 378 feature selection, 379–380 unigram/bigram preselection, 378–379 web forum message acquisition, 377 research gaps, 376–377 Word-based lexical features, 154 WordNet lexicons., 207–208 Wrapper generation, 59 Writeprints, 411–412 Writeprint technique principal component analysis, 245 process illustration, 246 steps for, 246 Writing style features, authorship analysis content-specific features, 155 lexical features, 154 structural features, 154–155 syntax, 154 Y YouTube videos accuracy, feature sets and techniques, 311 classification and evaluation, 308–309 feature counts, feature sets, 311 Index flag mechanism, 296 Gaussian mixture model (GMM), 302 hidden Markov model (HMM), 302 hypotheses, 310 research gaps, 302–304 semantic gap, 297 social network analysis, white supremacy groups, 311–312 support vector machine (SVM), 302 system design data collection process, 304 feature generation process, 305–308 451 text-based video classifi cation, 304, 305 taxonomy of, 297 test bed, 309–310 types nontext features, 299–300 text features, 300–301 video domains, 298–299 Z Zapatista model, 320 ... University of Hamburg, Hamburg, Germany For further volumes: http://www.springer.com/series/6157 Hsinchun Chen Dark Web Exploring and Data Mining the Dark Side of the Web Hsinchun Chen Department of. .. understanding and assessing the impact of the Dark Web and their security concerns Scope and Organization The book consists of three parts In Part I, we provide an overview of the research framework and. .. H Chen, Dark Web: Exploring and Data Mining the Dark Side of the Web, Integrated Series in Information Systems 30, DOI 10.1007/978-1-4614-1557-2_1, © Springer Science+Business Media, LLC 2012

Ngày đăng: 23/10/2019, 15:17