Data mining the web using perl

41 299 0
Data mining the web using perl

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Data-Mining the Web Data-Mining the Web Using Perl Using Perl Burt L. Monroe Burt L. Monroe Director, Quantitative Social Science Initiative Director, Quantitative Social Science Initiative Department of Political Science Department of Political Science The Pennsylvania State University The Pennsylvania State University Data-Mining the Web Data-Mining the Web  Examples Examples • Election Returns in Luxembourg Election Returns in Luxembourg  Luxembourg Official Election Results, 2004 Luxembourg Official Election Results, 2004  http://qssi.psu.edu/files/luxembourg.pl http://qssi.psu.edu/files/luxembourg.pl • Parliamentary Speech Parliamentary Speech  The Congressional Record The Congressional Record How’d You Do That? How’d You Do That?  There are several programming languages There are several programming languages with “straightforward” facilities for doing with “straightforward” facilities for doing this. Most notably, this. Most notably, • Perl Perl • Python Python • Java Java  I’m going to talk about Perl, because I’m going to talk about Perl, because • it’s the most established it’s the most established • it’s the one I know it’s the one I know  It appears that Python may be preferable, It appears that Python may be preferable, but that’s for someone else to say. but that’s for someone else to say. What’s Perl? What’s Perl?  Open source (free / flexible / extensible / a little Open source (free / flexible / extensible / a little wild and woolly – like Linux, R) programming wild and woolly – like Linux, R) programming language. language.  It is very very good at processing text. It is very very good at processing text. • note, webpages are just texts. note, webpages are just texts. • note, datasets (like a flat spreadsheet or Stata file) are note, datasets (like a flat spreadsheet or Stata file) are just texts. just texts. • Social scientists might have some use for turning one Social scientists might have some use for turning one into the other, no? into the other, no?  It has very useful facilities for building It has very useful facilities for building • Spiders Spiders • Scrapers Scrapers • (and “agents”, “robots”, “crawlers”, etc.) (and “agents”, “robots”, “crawlers”, etc.) What’s a Spider? What’s a Spider?  A spider is a program designed to automatically A spider is a program designed to automatically gather webpages. gather webpages.  If, for example, you want to automatically If, for example, you want to automatically download all of the speeches delivered in download all of the speeches delivered in Congress today – without manually clicking on Congress today – without manually clicking on every one, cutting and pasting, etc. – you might every one, cutting and pasting, etc. – you might want to build a spider. want to build a spider. What’s a scraper? What’s a scraper?  A scraper (or “screen-scraper”) extracts the A scraper (or “screen-scraper”) extracts the information you want – whatever you consider to information you want – whatever you consider to be data – from a given webpage. be data – from a given webpage.  If you want to know who said “health” and how If you want to know who said “health” and how many times, you might want to build a scraper. many times, you might want to build a scraper. BEWARE! BEWARE!  Spiders (and other similar types of programs – Spiders (and other similar types of programs – “robots”, “crawlers”) can be put to nefarious use: “robots”, “crawlers”) can be put to nefarious use: • appropriating copyrighted materials appropriating copyrighted materials • extracting email addresses for spammers extracting email addresses for spammers • overwhelming servers to create “denial of service” overwhelming servers to create “denial of service” • generally violating a site’s “terms of service” or generally violating a site’s “terms of service” or “acceptable use policy” “acceptable use policy”  If you are not careful to use legal and ethical If you are not careful to use legal and ethical good practices, you can good practices, you can • be denied access to a website altogether be denied access to a website altogether • get yourself or the university sued or even subjected to get yourself or the university sued or even subjected to criminal penalties criminal penalties Perl Perl  Open-source Open-source  Cross-platform Cross-platform • (Windows – I recommend “ActivePerl” from (Windows – I recommend “ActivePerl” from http://www.activestate.com http://www.activestate.com ) )  There are many websites with resources: There are many websites with resources: • http://www.cpan.org http://www.cpan.org (Comprehensive Perl (Comprehensive Perl Archive Network) Archive Network) • http://www.perlmonks.org http://www.perlmonks.org (PerlMonks) (PerlMonks) • http://www.perl.org http://www.perl.org • http://perl.oreilly.com http://perl.oreilly.com (O’Reilly Publishing) (O’Reilly Publishing)  Lots of mailing lists, etc. Lots of mailing lists, etc. Books Books  Basics of Perl Basics of Perl • The best books are put out by O’Reilly Publishing and The best books are put out by O’Reilly Publishing and are generally known by the animal on the cover. are generally known by the animal on the cover. • Learning Perl Learning Perl (the Llama) (the Llama)  or, Learning Perl on Win32 Systems or, Learning Perl on Win32 Systems (the Gecko) (the Gecko) • Programming Perl Programming Perl (the Camel) (the Camel)  Web-mining Web-mining • Perl & LWP Perl & LWP (the Blesbok, apparently) (the Blesbok, apparently) • Spidering Hacks Spidering Hacks  These books, and some others, are or will be These books, and some others, are or will be available in the “QuaSSI Library” (in Pond 216). available in the “QuaSSI Library” (in Pond 216). Running Perl Running Perl  For machines with approved ActivePerl For machines with approved ActivePerl installations in Pond installations in Pond • Perl is located in c:/Perl/ Perl is located in c:/Perl/  For today, For today, • we will operate entirely in the directory c:/Perl/eg/ we will operate entirely in the directory c:/Perl/eg/ • To get there, To get there,  open Programs -> Accessories -> Command Prompt open Programs -> Accessories -> Command Prompt  At the prompt, type At the prompt, type c: c:  Type Type cd Perl/eg cd Perl/eg  (In your particular installation, or in a Mac, or (In your particular installation, or in a Mac, or something like Unix on high performance something like Unix on high performance computing, these details will be different.) computing, these details will be different.) [...].. .The First Perl Program  Go to the QuaSSI Website for the example scripts for todays workshop: • http://qssi.psu.edu/files/howdy.pl     Right-click on the first script, “howdy.pl”, and save it to c: \Perl\ eg\ Open up the text-editor WinEdt (you could use almost anything) and then open howdy.pl That’s a complete Perl program Note: that’s all a program is – a text file Running a Perl Program... command prompt Type perl howdy.pl –w (The –w tells perl to give you warnings about what might be wrong if the program is broken.) Modifying a program       Go back to WinEdt Edit the text between the quotation marks to say something new Click File -> Save Go back to the command prompt Hit the up arrow (to get the last command, perl howdy.pl –w Look at that – you’re a programmer! Break the program ... back to WinEdt Delete the semicolon at the end of the line Save the file Go back to the command prompt and run the program, with –w, again What happened? Perl at 30,000 feet   Much of the next set of slides is stolen shamelessly from Andy Tester’s Perl at 10,000 Feet” at www.petdance.com (I’m skipping even more than he did.) Some generalities about Perl     Statements in Perl are, or usually... Hundreds of modules / packages available through cpan ActivePerl gives a GUI for installing them in its Perl Package Manager” A basic Perl example  Counting words • counter1.pl Grabbing from the web  The basic idea is simply to have Perl act as an “agent”, in the way a browser like Explorer or Firefox does requesting and interpreting webpages  There are a few basic modules that can do this LWP::Simple... Can then use to refer to the file The above would be Matching string patterns using regular expressions     This is where much of the power of Perl lies m/pattern/ will check the last stored variable ($_) for pattern $var =~ m/pattern/; will check $var for pattern If the pattern is in $var, then • $var =~ m/pattern/ is TRUE  If you “group” part of the pattern and it is present, • $var... is a list • %foo is a hash  Variables default to global (they apply in all parts of your program) This can be problematic • local $var will make the variable active only for the current “block” of code • my $var does the same, and is the more usual construction • the very common use strict; at the beginning of code forces good practice in the use of local variables (creates more syntax errors, but... •  $foo = “myfile”; $datafile = “$foo.txt”; will result in the variable $datafile holding the string “myfile.txt” Another example • print ‘Howdy\n’; will print:  Howdy\n • print “Howdy\n”; will print  Howdy • (\n is a control sequence, standing for “new line”) Scalar operators  Math • *, /, % (for modulo), ** (for exponentiation), etc  Strings • x to repeat the thing on the left  “b” x 10 gives... Combination of any literal * + ? [aeiou] ^ $ \b \d \D \s \S \w \W | ()  character, number, etc any single character zero or more of the previous one or more of the previous zero or one of the previous character class – this is the vowels beginning of the line end of the line word boundary digit / non-digit space / non-space word character / non-word character or – match this or that grouping See handout... scalar variables: • my @data = ($name, $address, $SSN) Using Arrays  Elements are indexed, from 0 • • •  my @animals = (“frog”, “bear”, “elephant”); print $animals[2]; # prints elephant Note: element is a scalar, so $ rather than @ Subsections are “slices” • my @mammals = @animals[1,2];  Lots of functions for • using as a stack (moving things on and off the right or left side of the array) • sorting... and it is present, • $var =~ m/(pattern)/ is true, AND, now a variable names $1 contains the first match it found • Group more pieces of the pattern and the matches are stored in $2, $3, etc  This only grabs the *first* match To grab all, say • my @matches = ($var =~ m/(pattern)/g); • This will store every match in the array @matches What’s a “regular expression”?  Combination of any literal * + ? . computing, these details will be different.) computing, these details will be different.) The First Perl Program The First Perl Program  Go to the QuaSSI Website for the example Go to the QuaSSI Website. generally known by the animal on the cover. • Learning Perl Learning Perl (the Llama) (the Llama)  or, Learning Perl on Win32 Systems or, Learning Perl on Win32 Systems (the Gecko) (the Gecko) • Programming. Gecko) (the Gecko) • Programming Perl Programming Perl (the Camel) (the Camel)  Web- mining Web- mining • Perl & LWP Perl & LWP (the Blesbok, apparently) (the Blesbok, apparently) • Spidering

Ngày đăng: 23/10/2014, 16:11

Từ khóa liên quan

Mục lục

  • Data-Mining the Web Using Perl

  • Data-Mining the Web

  • How’d You Do That?

  • What’s Perl?

  • What’s a Spider?

  • What’s a scraper?

  • BEWARE!

  • Perl

  • Books

  • Running Perl

  • The First Perl Program

  • Running a Perl Program

  • Modifying a program

  • Break the program

  • Perl at 30,000 feet

  • Some generalities about Perl

  • Data Types

  • Scalars

  • Strings

  • Scalar operators

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan