Cách cài đặt Apache nutch trên Window 10Slide bằng Tiếng AnhPowerPoint Presentation Apache Nutch Team 4 Nguyen Thi Phuong Hang Park Minsoo Ali Usman Professor Kim Kyoung Yun Content I What is Apache Nutch? II How to install Apache Nutch on Window 10? III How t.
Apache Nutch Professor Kim Kyoung-Yun Team Nguyen Thi Phuong Hang Park Minsoo Ali Usman Content I What is Apache Nutch? II How to install Apache Nutch on Window 10? III How to crawl web? I What is Apache nutch? • Apache Nutch is a highly extensible and scalable open-source web crawler software project • Runs on top of Hadoop • Mostly used to feed a search index, also data mining • Customizable/ extensible plugin architecture II How to install Apache Nutch on Window 10 Requirements: + Windows-Cygwin environment + Java Runtime/Development Environment (JDK 1.11 / Java 11) + (Source build only) Apache Ant: https://ant.apache.org/ Installing Cygwin • Download the Cygwin installer and run setup.exe: https://www.cygwin.com/install.html Installing Java Runtime/Development Environment • Download Java SE Development Kit 11 for Window and run exe file: https://www.oracle.com/kr/java/technologies/javase/jdk11-archivedownloads.html • Setup environment variables: JAVA_HOME and PATH • Check installed Java Installing Apache Ant • Download and install Apache ant (1.10.12) https://ant.apache.org/ • Set variables ANT_HOME and PATH • Check ant –version Successfully installation Installing Nutch • Download a binary package (apache-nutch-1.X-bin.zip) (1.19) https://archive.apache.org/dist/nutch/ • Unzip Nutch package There should be a folder apache-nutch-1.X • Move folder apache-nutch-1.X (nutch_home).X into cygwin64/home • Verify Nutch installation: + Open cygwin64 terminal + Run bin/nutch: @{nutch_home} $bin/nutch III How to crawl a web Crawler Workflow • initialize crawldb, inject seed URLs • generate fetch list: select URLs from crawldb for fetching • fetch URLs from fetch list • parse documents: extra content, metadata and links • update crawldb status, score and signature, add new URLs inlines or at the end of one crawler run • invert links: map anchor texts to documents the links point to • calculate link rank and web graph and update Crawldb • deduplicate document by signature • index document content, meta data, and anchor texts Crawler Workflow III How to crawl a web? Installing Apache Solr (8.11.2) • Download and unzip Apache Solr https://archive.apache.org/dist/lucene/solr/8.11.2 Installing Apache Solr • Check solr installation + Start: bin\solr.cmd start + Go to this: http://localhost:8983/solr/#/ + Status: bin\solr.cmd status + Stop: bin\solr.cmd stop Crawl a site Customize your crawl properties in conf of {nutch_home} and Configure Regular Expression Filters conf/nutch-site.xml conf/regex-urlfilter.txt Crawl a site Create a URL seed list + Create a urls folder + Create a file seed.txt under urls folder and add a site which will crawl For example: https://www.youtube.com/ Crawl a site + Open Cygwin terminal Seeding the crawldb with a list of URLs + bin/nutch inject crawl/crawldb urls Fetching + bin/nutch generate craw l/craw ldb craw l/segm ents + s1= `ls -d craw l/segm ents/2* | tail-1` echo $s1 + bin/nutch fetch $s1 + bin/nutch parse $s1 + bin/nutch updatedb craw l/craw ldb $s1 + bin/nutch generate craw l/craw ldb craw l/segm ents -topN 1000 s2= `ls -d craw l/segm ents/2* | tail-1` echo $s2 bin/nutch fetch $s2 bin/nutch parse $s2 bin/nutch updatedb craw l/craw ldb $s2 Crawl a site Invertlinks bin/nutch invertlinks craw l/linkdb -dir craw l/segm ents D um p to fi le and take a look: bin/nutch readlinkdb craw l/linkdb -dum p out2 Crawl a site – Some errors and solution Solution: + Download and extract Hadoop package : https://archive.apache.org/dist/hadoop/common/ + Set the environment variable %HADOOP_HOME% and PATH + Download the winutils.exe binary from a Hadoop redistribution and extract to folder. https://github.com/steveloughran/winutils + Replace the bin folder from Hadoop folder by bin in Hadoop redistribution which has winutils.exe + Copy hadoop.dll from bin of Hadoop redistribution into C:/Window/System32 Crawl a site Use apache-nutch-1.18 will have the above error Solution: + Upgrade the version of Apache Nutch to 1.19 ...Content I What is Apache Nutch? II How to install Apache Nutch on Window 10? III How to crawl web? I What is Apache nutch? • Apache Nutch is a highly extensible and scalable... https://archive .apache. org/dist /nutch/ • Unzip Nutch package There should be a folder apache- nutch- 1.X • Move folder apache- nutch- 1.X (nutch_ home).X into cygwin64/home • Verify Nutch installation: ... install Apache Nutch on Window 10 Requirements: + Windows-Cygwin environment + Java Runtime/Development Environment (JDK 1.11 / Java 11) + (Source build only) Apache Ant: https://ant .apache. org/