Google hacking for penetration tester - part 20 potx

allowing for 1,000 fresh results on this sub-domain (which might give us deeper sub- domains). Finally, we can have our script terminate when no new sub-domains are found. Another sure fire way of obtaining domains without having to perform the host/domain check is to post process-mined e-mail addresses. As almost all e-mail addresses are already at a domain (and not a host), the e-mail address can simply be cut after the @ sign and used in a similar fashion. Telephone Numbers Telephone numbers are very hard to parse with an acceptable rate of false positives (unless you limit it to a specific country).This is because there is no standard way of writing down a telephone number. Some people add the country code, but on regional sites (or mailing lists) it’s seldom done. And even if the country code is added, it could be added by using a plus sign (e.g. +44) or using the local international dialing method (e.g., 0044). It gets worse. In most cases, if the city code starts with a zero, it is omitted if the internal dialing code is added (e.g., +27 12 555 1234 versus 012 555 1234). And then some people put the zero in parentheses to show it’s not needed when dialing from abroad (e.g., +27 (0)12 555 1234).To make matters worse, a lot of European nations like to split the last four digits in groups of two (e.g., 012 12 555 12 34). Of course, there are those people that remember numbers in certain patterns, thereby breaking all formats and making it almost impossible to determine which part is the country code (if at all), the city, and the area within the city (e.g., +271 25 551 234). Then as an added bonus, dates can look a lot like telephone numbers. Consider the text “From 1823-1825 1520 people couldn’t parse telephone numbers.” Better still are time frames such as “Andrew Williams: 1971-04-01 – 2007-07-07.”And, while it’s not that difficult for a human to spot a false positive when dealing with e-mail addresses, you need to be a local to tell the telephone number of a plumber in Burundi from the ISBN number of “Stealing the network.” So, is all lost? Not quite.There are two solutions: the hard but cheap solution and the easy but costly solution. In the hard but cheap solution, we will apply all of the logic we can think of to telephone numbers and live with the false positives. In the easy (OK, it’s not even that easy) solution, we’ll buy a list of country, city, and regional codes from a provider. Let’s look at the hard solution first. One of the most powerful principles of automation is that if you can figure out how to do something as a human being, you can code it. It is when you cannot write down what you are doing when automation fails. If we can code all the things we know about telephone numbers into an algorithm, we have a shot at getting it right.The following are some of the important rules that I have used to determine if something is a real telephone number. ■ Convert 00 to +, but only if the number starts with it. ■ Remove instances of (0). Google’s Part in an Information Collection Framework • Chapter 5 191 452_Google_2e_05.qxd 10/5/07 12:46 PM Page 191 ■ Length must be between 9 and 13 numbers. ■ Has to contain at least one space (optional for low tolerance). ■ Cannot contain two (or more) single digits (e.g., 2383 5 3 231 will be thrown out). ■ Should not look like a date (various formats). ■ Cannot have a plus sign if it’s not at the beginning of the number. ■ Less than four numbers before the first space (unless it starts with a + or a 0). ■ Should not have the string “ISBN” in near proximity. ■ Rework the number from the last number to the first number and put it in +XX- XXX-XXX-XXXX format. To find numbers that need to comply to these rules is not easy. I ended up not using regular expressions but rather a nested loop, which counts the number of digits and accepted symbols (pluses, dashes, and spaces) in a sequence. Once it’s reached a certain number of acceptable characters followed by a number of unacceptable symbols, the result is sent to the verifier (that use the rules listed above). If verified, it is repackaged to try to get in the right format. Of course this method does not always work. In fact, approximately one in five numbers are false positives. But the technique seldom fails to spot a real telephone number, and more importantly, it does not cost anything. There are better ways to do this. If we have a list of all country and city codes we should be able to figure out the format as well as verify if a sequence of numbers is indeed a telephone number. Such a list exists but is not in the public domain. Figure 5.12 is a screen shot of the sample database (in CSV): Figure 5.12 Telephone City and Area Code Sample Not only did we get the number, we also got the country, provider, if it is a mobile or geographical number, and the city name.The numbers in Figure 5.12 are from Spain and go six digits deep. We now need to see which number in the list is the closest match for the 192 Chapter 5 • Google’s Part in an Information Collection Framework 452_Google_2e_05.qxd 10/5/07 12:46 PM Page 192 number that we parsed. Because I don’t have the complete database, I don’t have code for this, but suspect that you will need to write a program that will measure the distance between the first couple of numbers from the parsed number to those in the list.You will surely end up in a situation where there is more than one possibility.This will happen because the same number might exist in multiple countries and if they are specified on the Web page without a country code it’s impossible to determine in which country they are located. The database can be bought at www.numberingplans.com, but they are rather strict about selling the database to just anyone.They also provide a nifty lookup interface (limited to just a couple of lookups a day), which is not just for phone numbers. But that’s a story for another day. Post Processing Even when we get good data back from our data source there might be the need to do some form of post processing on it. Perhaps you want to count how many of each result you mined in order to sort it by frequency. In the next section we look at some things that you should consider doing. Sorting Results by Relevance If we parse an e-mail address when we search for “Andrew Williams,” that e-mail address would almost certainly be more interesting than the e-mail addresses we would get when searching for “A Williams.” Indeed, some of the expansions we’ve done in the previous section borders on desperation.Thus, what we need is a method of implementing a “confidence” to a search.This is actually not that difficult. Simply assign this confidence index to every result you parse. There are other ways of getting the most relevant result to bubble to the top of a result list. Another way is simply to look at the frequency of a result. If you parse the e-mail address andrew@syngress.com ten times more than any other e-mail address, the chances are that that e-mail address is more relevant than an e-mail address that only appears twice. Yet another way is to look at how the result correlates back to the original search term. The result andrew@syngress.com looks a lot like the e-mail address for Andrew Williams. It is not difficult to write an algorithm for this type of correlation. An example of such a correlation routine looks like this: sub correlate{ my ($org,$test)=@_; print " [$org] to [$test] : "; my $tester; my $beingtest; my $multi=1; Google’s Part in an Information Collection Framework • Chapter 5 193 452_Google_2e_05.qxd 10/5/07 12:46 PM Page 193 #determine which is the longer string if (length($org) > length($test)){ $tester=$org; $beingtest=$test; } else { $tester=$test; $beingtest=$org; } #loop for every 3 letters for (my $index=0; $index<=length($tester)-3; $index++){ my $threeletters=substr($tester,$index,3); if ($beingtest =~ /$threeletters/i){ $multi=$multi*2; } } print "$multi\n"; return $multi; } This routine breaks the longer of the two strings into sections of three letters and com- pares these sections to the other (shorter) string. For every section that matches, the resultant return value is doubled.This is by no means a “standard” correlation function, but will do the trick, because basically all we need is something that will recognize parts of an e-mail address as looking similar to the first name or the last name. Let’s give it a quick spin and see how it works. Here we will “weigh” the results of the following e-mail addresses to an original search of “Roelof Temmingh”: [Roelof Temmingh] to [roelof.temmingh@abc.co.za] : 8192 [Roelof Temmingh] to [rtemmingh@abc.co.za] : 64 [Roelof Temmingh] to [roeloft@abc.co.za] : 16 [Roelof Temmingh] to [TemmiRoe882@abc.co.za] : 16 [Roelof Temmingh] to [kosie@temmingh.org] : 64 [Roelof Temmingh] to [kosie.kramer@yahoo.com] : 1 [Roelof Temmingh] to [Tempest@yahoo.com] : 2 This seems to work, scoring the first address as the best, and the two addresses con- taining the entire last name as a distant second. What’s interesting is to see that the algorithm does not know what is the user name and what is a domain.This is something that you might want to change by simply cutting the e-mail address at the @ sign and only com- paring the first part. On the other hand, it might be interesting to see domains that look like the first name or last name. There are two more ways of weighing a result.The first is by looking at the distance between the original search term and the parsed result on the resultant page. In other words, if the e-mail address appears right next to the term that you searched for, the chances are 194 Chapter 5 • Google’s Part in an Information Collection Framework 452_Google_2e_05.qxd 10/5/07 12:46 PM Page 194 more likely that it’s more relevant than when the e-mail address is 20 paragraphs away from the search term.The second is by looking at the importance (or popularity) of the site that gives the result.This means that results coming from a site that is more popular is more relevant than results coming from sites that only appear on page five of the Google results. Luckily by just looking at Google results, we can easily implement both of these require- ments. A Google snippet only contains the text surrounding the term that we searched for, so we are guaranteed some proximity (unless the parsed result is separated from the parsed results by “ ”).The importance or popularity of the site can be obtained by the Pagerank of the site. By assigning a value to the site based on the position in the results (e.g., if the site appears first in the results or only much later) we can get a fairly good approximation of the importance of the site. A note of caution here.These different factors need to be carefully balanced.Things can go wrong really quickly. Imagine that Andrew’s e-mail address is whipmaster@midgets.com, and that he always uses the alias “WhipMaster” when posting from this e-mail address. As a start, our correlation to the original term (assuming we searched for Andrew Williams) is not going to result in a null value.And if the e-mail address does not appear many times in different places, it will also throw the algorithm off the trail. As such, we may choose to only increase the index by 10 percent for every three-letter word that matches, as the code stands a 100 percent increase if used. But that’s the nature of automation, and the reason why these types of tools ultimately assist but do not replace humans. Beyond Snippets There is another type of post processing we can do, but it involves lots of bandwidth and loads of processing power. If we expand our mining efforts to the actual page that is returned (i.e., not just the snippet) we might get many more results and be able to do some other interesting things.The idea here is to get the URL from the Google result, download the entire page, convert it to plain text (as best as we can), and perform our mining algo- rithms on the text. In some cases, this expansion would be worth the effort (imagine looking for e-mail addresses and finding a page that contains a list of employees and their e- mail addresses. What a gold mine!). It also allows for parsing words and phrases, something that has a lot less value when only looking at snippets. Parsing and sorting words or phrases from entire pages is best left to the experts (think the PhDs at Google), but nobody says that we can’t try our hand at some very elementary processing. As a start we will look at the frequency of words across all pages. We’ll end up with common words right at the top (e.g., the, and, and friends). We can filter these words using one of the many lists that provides the top ten words in a specific language.The resultant text will give us a general idea of what words are common across all the pages; in other words, an idea of “what this is about.” We can extend the words to phrases by simply con- catenating words together. A next step would be looking at words or phrases that are not used in high frequency in a single page, but that has a high frequency when looking across Google’s Part in an Information Collection Framework • Chapter 5 195 452_Google_2e_05.qxd 10/5/07 12:46 PM Page 195 many pages. In other words, what we are looking for are words that are only used once or twice in a document (or Web page), but that are used on all the different pages.The idea here is that these words or phrases will give specific information about the subject. Presenting Results As many of the searches will use expansion and thus result in multiple searches, with the scraping of many Google pages we’ll need to finally consolidate all of the sub-results into a single result.Typically this will be a list of results and we will need to sort the results by their relevance. Applications of Data Mining Mildly Amusing Let’s look at some basic mining that can be done to find e-mail addresses. Before we move to more interesting examples, let us first see if all the different scraping/parsing/weighing techniques actually work.The Web interface for Evolution at www.paterva.com basically implements all of the aforementioned techniques (and some other magic trade secrets). Let’s see how Evolution actually works. As a start we have to decide what type of entity (“thing”) we are going to look for. Assuming we are looking for Andrew Williams’ e-mail address, we’ll need to set the type to “Person” and set the function (or transform) to “toEmailGoogle” as we want Evolution to search for e-mail addresses for Andrew on Google. Before hitting the submit button it looks like Figure 5.13: Figure 5.13 Evolution Ready to Go 196 Chapter 5 • Google’s Part in an Information Collection Framework 452_Google_2e_05.qxd 10/5/07 12:46 PM Page 196 By clicking submit we get the results shown in Figure 5.14. Figure 5.14 Evolution Results page There are a few things to notice here.The first is that Evolution is giving us the top 30 words found on resultant pages for this query.The second is that the results are sorted by their relevance index, and that moving your mouse over them gives the related snippets where it was found as well as populating the search box accordingly. And lastly, you should notice that there is no trace of Andrew’s Syngress address, which only tells you that there is more than one Andrew Williams mentioned on the Internet. In order to refine the search to look for the Andrew Williams that works at Syngress, we can add an additional search term. This is done by adding another comma (,) and specifying the additional term.Thus it becomes “Andrew,Williams,syngress.”The results look a lot more promising, as shown in Figure 5.15. It is interesting to note that there are three different encodings of Andrew’s e-mail address that were found by Evolution, all pointing to the same address (i.e., andrew@syngress.com,Andrew at Syngress dot com, and Andrew (at) Syngress.com). His alternative e- mail address at Elsevier is also found. Google’s Part in an Information Collection Framework • Chapter 5 197 452_Google_2e_05.qxd 10/5/07 12:46 PM Page 197 Figure 5.15 Getting Better Results When Adding an Additional Search Term Evolution Let’s assume we want to find lots of addresses at a certain domain such as ****.gov.We set the type to “Domain,” enter the domain ****.gov, set the results to 100, and select the “ToEmailAtDomain.”The resultant e-mail addresses all live at the ****.gov domain, as shown in Figure 5.16: Figure 5.16 Mining E-mail Addresses with Evolution 198 Chapter 5 • Google’s Part in an Information Collection Framework 452_Google_2e_05.qxd 10/5/07 12:46 PM Page 198 As the mouse moves over the results, the interface automatically readies itself for the next search (e.g., updating the type and value). Figure 5.16 shows the interface “pre-loaded” with the results of the previous search). In a similar way we can use Evolution to get telephone numbers; either lots of numbers or a specific number. It all depends on how it’s used. Most Interesting Up to now the examples used have been pretty boring. Let’s spice it up somewhat by looking at one of those three letter agencies.You wouldn’t think that the cloak and dagger types working at xxx.gov (our cover name for the agency) would list their e-mail addresses. Let’s see what we can dig up with our tools. We will start by searching on the domain xxx.gov and see what telephone numbers we can parse from there. Using Evolution we supply the domain xxx.gov and set the transform to “ToPhoneGoogle.”The results do not look terribly exciting, but by looking at the area code and the city code we see a couple of numbers starting with 703 444.This is a fake extension we’ve used to cover up the real name of the agency, but these numbers correlate with the contact number on the real agency’s Web site.This is an excellent starting point. By no means are we sure that the entire exchange belongs to them, but let’s give it a shot. As such we want to search for telephone numbers starting with 703 444 and then parse e-mail addresses, telephone numbers, and site names that are connected to those numbers.The hope is that one of the cloak-and-dagger types has listed his private e-mail address with his office number.The way to go about doing this is by setting the Entity type to “Telephone,” entering “+1 703 444” (omitting the latter four digits of the phone number), setting the results to 100, and using the combo “ToEmailPhoneSiteGoogle.”The results look like Figure 5.17: Figure 5.17 Transforming Telephone Numbers to E-mail Addresses Using Evolution Google’s Part in an Information Collection Framework • Chapter 5 199 452_Google_2e_05.qxd 10/5/07 12:46 PM Page 199 This is not to say that Jean Roberts is working for the xxx agency, but the telephone number listed at the Tennis Club is in close proximity to that agency. Staying on the same theme, let’s see what else we can find. We know that we can find documents at a particular domain by setting the filetype and site operators. Consider the following query, filetype:doc site:xxx.gov in Figure 5.18. Figure 5.18 Searching for Documents on a Domain While the documents listed in the results are not that exciting, the meta information within the document might be useful.The very handy ServerSniff.net site provides a useful page where documents can be analyzed for interesting meta data (www.serversniff.net/file- info.php). Running the 32CFR.doc through Tom’s script we get: Figure 5.19 Getting Meta Information on a Document From ServerSniff.netWe can get a lot of information from this.The username of the original author is “Macuser” and he or she worked at Clator Butler Web Consulting, and the user “clator” clearly had a mapped drive that had a copy of the agency Web site on it. Had, because this was back in March 2003. It gets really interesting once you take it one step further. After a couple of clicks on Evolution it found that Clator Butler Web Consulting is at www.clator.com, and that Mr. Clator Butler is the manager for David Wilcox’s (the artist) forum. When searching for “Clator Butler” on Evolution, and setting the transform to “ToAffLinkedIn” we find a LinkedIn profile on Clator Butler as shown in Figure 5.20: 200 Chapter 5 • Google’s Part in an Information Collection Framework 452_Google_2e_05.qxd 10/5/07 12:46 PM Page 200 . text “From 182 3-1 825 1 520 people couldn’t parse telephone numbers.” Better still are time frames such as “Andrew Williams: 197 1-0 4-0 1 – 200 7-0 7-0 7.”And, while it’s not that difficult for a human. length($test)){ $tester= $org; $beingtest=$test; } else { $tester= $test; $beingtest=$org; } #loop for every 3 letters for (my $index=0; $index<=length( $tester) -3 ; $index++){ my $threeletters=substr( $tester, $index,3); if. looking for Andrew Williams’ e-mail address, we’ll need to set the type to “Person” and set the function (or transform) to “toEmailGoogle” as we want Evolution to search for e-mail addresses for

Định dạng
Số trang	10
Dung lượng	574,55 KB