Test your site for broken links

◆ Problem

You want to verify that all the links on your site lead somewhere.

◆ Background

A typical web application end user will leave your site and never come back if he is annoyed by clicking a link that leads him nowhere. You need to avoid letting end users see “404 File Not Found” at essentially any cost. Fortunately, it is quite simple to use JUnit to test your entire site.

◆ Recipe

Here, HtmlUnit can come to the rescue: a fairly simple recursive algorithm makes this test surprisingly easy to write. The key parts of the algorithm are:

1 Retrieve a page by invoking WebClient.getPage().

2 If the page is an HtmlPage, get all the anchors (<a> tags) on it and try to follow each one.

3 If you reach a page outside your domain, do not bother checking any further.

4 If something goes wrong when following a link, identify that link as broken.

This leads to a recursive algorithm; however, as we tried to execute the test, we ran into some specific details of which you need to be aware.

523 Test your site for broken links

The Jakarta Commons HttpClient does not handle mailto links, so we cannot check those. The best you can do is verify that they represent valid e-mail addresses, perhaps. We recommend you check them by hand.

There is a defect in HtmlUnit 1.2.3 that does not handle linking to page targets (<a href="#Hello" />) correctly. We have submitted the issue to Mike Bowler with a fix, and more than likely by the time you read this sentence, it will already have been fixed. If not, lean on him a little.6

Many links lead back to a page the test has already checked. To avoid infinite recursion, keep track of every URL the test has checked so far, and then skip those URLs if they come up again.

There are the occasional false failures—that is, the test fails, you check the link, and it is not broken. Part of that is the nature of the web: sometimes a URL is unavailable for a few seconds. Other than that, we do not know why this would hap- pen. You would have to run the test more often to notice a pattern. Because you will likely run this test say, once per week, these false negative are not a hot issue.

Also be aware that we are not checking form submission, which would be very complex to do in general. Instead, see chapter 12, “Testing Web Components,”

for a discussion on how to verify web forms, one by one, in isolation.

Let us look at the code in listing 13.7. Simply change domainName to whatever URL you would like to start with. We do not recommend running this test against yahoo.com—that would take an awfully long time.

package junit.cookbook.applications.test;

import java.io.IOException;

import java.net.URL;

import java.util.*;

import junit.framework.TestCase;

import com.gargoylesoftware.HtmlUnit.*;

import com.gargoylesoftware.HtmlUnit.html.*;

public class LinksTest extends TestCase { private WebClient client;

private List urlsChecked;

private Map failedLinks;

private String domainName;

6 No need. We received e-mail from Mike that our fix was checked in and will be part of the next release of HtmlUnit. Ah, open source!

Listing 13.7 LinksTest

524 CHAPTER 13

Testing J2EE applications

protected void setUp() throws Exception { client = new WebClient();

client.setJavaScriptEnabled(false);

client.setRedirectEnabled(true);

urlsChecked = new ArrayList();

failedLinks = new HashMap();

}

public void testFindABrokenLink() throws Exception { domainName = "yahoo.com";

URL root = new URL("http://www." + domainName + "/");

Page rootPage = client.getPage(root);

checkAllLinksOnPage(rootPage);

assertTrue(

"Failed links (from => to): " + failedLinks.toString(), failedLinks.isEmpty());

}

private void checkAllLinksOnPage(Page page) throws IOException { if (!(page instanceof HtmlPage))

return;

URL currentUrl = page.getWebResponse().getUrl();

String currentUrlAsString = currentUrl.toExternalForm();

if (urlsChecked.contains(currentUrlAsString)) { return;

}

if (currentUrlAsString.indexOf(domainName) < 0) { return;

}

urlsChecked.add(currentUrlAsString);

System.out.println("Checking URL: " + currentUrlAsString);

HtmlPage rootHtmlPage = (HtmlPage) page;

List anchors = rootHtmlPage.getAnchors();

for (Iterator i = anchors.iterator(); i.hasNext();) { HtmlAnchor each = (HtmlAnchor) i.next();

String hrefAttribute = each.getHrefAttribute();

boolean isMailtoLink = hrefAttribute.startsWith("mailto:");

boolean isHypertextLink = hrefAttribute.trim().length() > 0;

if (!isMailtoLink && isHypertextLink) { try {

Page nextPage = each.click();

checkAllLinksOnPage(nextPage);

}

525 Test web resource security

catch (Exception e) {

failedLinks.put(currentUrlAsString, each);

} } } } }

◆ Discussion

A few warnings about this test:

■ It is slow to execute, as it is checking URLs over a live network.

■ If one of your broken links is to a nonexistent domain or a domain with its server entirely down, the test’s network connection might have to time out before registering a failure, which makes it even slower. Although our ver- sion of the test only checks pages in the desired domain, you could certainly remove that restriction in your test. If you do, then this becomes an issue.

■ The whole thing executes as one big test, rather than as a test for each URL to check. We cannot see a way to get around this without implementing Test directly, which we could certainly do; however, the idea of creating a TestSuite in memory while the test is running gave us a headache, so we chose not to try it.

Still, we think it is a good starting point for common use.

◆ Related

■ Chapter 12—Testing Web Components

Test a method that returns nothing

Test throwing the right exception