HƯớng dẫn cách cài đặt và sử dụng Apache Nutch để crawl dữ liệu từ các website. Và liên kết với apache solr để đánh chỉ mục

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	24
Dung lượng	1,18 MB

Nội dung

Apache Nutch Professor Kim Kyoung-Yun Content I What is Apache Nutch? II How to install Apache Nutch on Window 10? III How to crawl web? I What is Apache nutch? • Apache Nutch is a highly extensible and scalable open-source web crawler software project • Runs on top of Hadoop • Mostly used to feed a search index, also data mining • Customizable/ extensible plugin architecture II How to install Apache Nutch on Window 10 Requirements: + Windows-Cygwin environment + Java Runtime/Development Environment (JDK 1.11 / Java 11) + (Source build only) Apache Ant: https://ant.apache.org/ Installing Cygwin • Download the Cygwin installer and run setup.exe: https://www.cygwin.com/install.html Installing Java Runtime/Development Environment • Download Java SE Development Kit 11 for Window and run exe file: https://www.oracle.com/kr/java/technologies/javase/jdk11-archivedownloads.html • Setup environment variables: JAVA_HOME and PATH • Check installed Java Installing Apache Ant • Download and install Apache ant (1.10.12) https://ant.apache.org/ • Set variables ANT_HOME and PATH • Check ant –version Successfully installation Installing Nutch • Download a binary package (apache-nutch-1.X-bin.zip) (1.19) https://archive.apache.org/dist/nutch/ • Unzip Nutch package There should be a folder apache-nutch-1.X • Move folder apache-nutch-1.X (nutch_home).X into cygwin64/home • Verify Nutch installation: + Open cygwin64 terminal + Run bin/nutch: @{nutch_home} $bin/nutch III How to crawl a web Crawler Workflow • initialize crawldb, inject seed URLs • generate fetch list: select URLs from crawldb for fetching • fetch URLs from fetch list • parse documents: extra content, metadata and links • update crawldb status, score and signature, add new URLs inlines or at the end of one crawler run • invert links: map anchor texts to documents the links point to • calculate link rank and web graph and update Crawldb • deduplicate document by signature • index document content, meta data, and anchor texts Crawler Workflow III How to crawl a web? Installing Apache Solr (8.11.2) • Download and unzip Apache Solr https://archive.apache.org/dist/lucene/solr/8.11.2 Installing Apache Solr • Check solr installation: {APACHE_SOLR_HOME} + Start: bin\solr.cmd start + Go to this: http://localhost:8983/solr/#/ + Status: bin\solr.cmd status + Stop: bin\solr.cmd stop Crawl a site Customize your crawl properties in conf of {nutch_home} and Configure Regular Expression Filters conf/nutch-site.xml conf/regex-urlfilter.txt Crawl a site Create a URL seed list + Create a urls folder + Create a file seed.txt under urls folder and add a site which will crawl For example: https://www.youtube.com/ bin/crawl -i -s urls crawl Crawl a site + Open Cygwin terminal Crawl: bin/crawl –i –s urls crawl Seeding the crawldb with a list of URLs + bin/nutch inject crawl/crawldb urls Fetching + bin/nutch generate crawl/crawldb crawl/segments + s1=`ls -d crawl/segments/2* | tail -1` echo $s1 + bin/nutch fetch $s1 + bin/nutch parse $s1 + bin/nutch updatedb crawl/crawldb $s1 + bin/nutch generate crawl/crawldb crawl/segments -topN 1000 s2=`ls -d crawl/segments/2* | tail -1` echo $s2 bin/nutch fetch $s2 bin/nutch parse $s2 bin/nutch updatedb crawl/crawldb $s2 Crawl a site Invertlinks bin/nutch invertlinks crawl/linkdb -dir crawl/segments Dump to file and take a look: bin/nutch readlinkdb crawl/ linkdb -dump out2 Crawl a site Read db • 1- bin/nutch readdb crawl/crawldb –stats >stats.txt • 2- bin/nutch readdb crawl/crawldb/ -dump db • 3- bin/nutch readlinkdb crawl/linkdb/ -dump link • 4- bin/nutch readseg -dump crawl/segments/20131216194731 crawl/ segments/201312161 9473 _dump-nocontent-nofetch-noparsenoparsedata-noparsetext Crawl a site – Some errors and solution Solution: + Download and extract Hadoop package : https://archive.apache.org/dist/hadoop/common/ + Set the environment variable %HADOOP_HOME% and PATH + Download the winutils.exe binary from a Hadoop redistribution and extract to folder https://github.com/steveloughran/winutils + Replace the bin folder from Hadoop folder by bin in Hadoop redistribution which has winutils.exe + Copy hadoop.dll from bin of Hadoop redistribution into C:/Window/System32 Crawl a site Use apache-nutch-1.18 will have the above error Solution: + Upgrade the version of Apache Nutch to 1.19 Indexing in Solr - Integrate Nutch with Solr + Go to solr folder (solr-8.11.2)

Ngày đăng: 15/03/2023, 12:42