Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 16 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
16
Dung lượng
690,26 KB
Nội dung
Chapter 2 GettingintoGoogle In This Chapter ᮣ Knowing the three steps to visibility in Google ᮣ Meeting Google’s pet spider and understanding how the crawl works ᮣ Keeping Google out of your site T his chapter is about getting your site to appear on Google search pages. I’m not talking about the Google Directory, submission to which is a simple matter also covered here. The challenge is to appear in search results based on keywords related to your site. Chapters 3 and 4 focus on becoming more prominently placed on those search results pages; this chapter is more elementary but no less crucial for new sites. The Three-Step Process Many of the suggestions, tactics, and concepts discussed in this chapter and Chapter 3 and 4 apply to both gettingintoGoogle (the first step) and improv- ing a site’s status in Google (an ongoing project). Understanding the Google crawl (this chapter), networking your site (Chapter 3), and site optimization (Chapter 4) are important topics for newcomers and veterans alike. There’s no proper order in which to tackle these subjects — they are presented here in a certain order, but the topics in these three chapters add up to a single process that maximizes your site’s exposure in Google. Here is a summary of the ground covered in these three chapters: ߜ GettingintoGoogle (Chapter 2). Understand how the Google spider crawls the Web and what the spider looks at. Judge whether to submit a new page manually to the index or let the spider find it. Find out how to keep material out of Google. ߜ Networking your site (Chapter 3). Develop a matrix of incoming links, which is crucial for building a higher status in Google and effective for getting into the index at the start. 06_571435 ch02.qxd 5/21/04 11:27 PM Page 21 ߜ Optimizing your site for Google (Chapter 4). Create content, optimize your page’s meta tags, and introduce keywords as the fundamental building blocks of a highly ranked site. These are golden topics for the serious Webmaster at all stages of business development, from concep- tion to customer interaction. First things first. New sites must get intoGoogle and then work to raise their profiles. GettingintoGoogle really means gettinginto the Google index, which is a database of Web content. Google builds the index by crawling through the Web collecting pages. When a user searches for keywords, Google doesn’t actually search the Web — it searches its index. If your site already appears in Google search results, you might feel tempted to skip this chapter and head straight for Chapter 3. However, the next two sections contain useful information about Google’s behavior and ways for both new and existing sites to leverage its quirks. Meet Google’s Pet Spider All search engines operate in the same basic way: they crawl the Web with automatic software robots called spiders or crawlers, which create searchable indexes of Web content. Every engine allows visitors to search its index for keywords and groups of keywords. Search results come in a variety of list for- mats, but most display a bit of information about each Web page in the list and a link to that page. Each engine’s index is unique, thanks to the programming of its spider. The main element of that programming is the engine’s algorithm, which ranks pages in an index. This ranking determines the order in which search results are presented. Google’s central technology asset is its algorithm — the complex ranking for- mula that gives people good search results and often seems to be reading people’s minds when they Google something. The results of Google’s algo- rithm are summarized in a single ranking statistic called PageRank. Google is secretive about the software formula from which PageRank is derived, but the company does promote the importance of PageRank, and offers Webmasters broad hints for improving a site’s PageRank. Google displays a general approximation of any page’s rank (on a 0-to-10 scale) in the Google Toolbar, which is shown in Figure 2-1. Although the exact formulation of PageRank is a well-protected secret, its basic ingredients are well-known (and discussed in Chapter 3). 22 Part I: Meeting the Other Side of Google 06_571435 ch02.qxd 5/21/04 11:27 PM Page 22 The Google PageRank is like a carrot dangled before the ambitious gaze of Webmasters, who devote considerable energy to inching their pages up to a higher PageRank, thereby moving them up the search results list. Chapter 3 is devoted to improving your site’s ranking and position on search results pages. Figure 2-1: The Google Toolbar affords a rough glimpse of any site’s PageRank, on a scale of 0 to 10. 23 Chapter 2: GettingintoGoogle Search engine integrity One reason pre-Google search engines declined in usefulness and popularity as Web-content portals was the emergence of paid listings. Hungry for revenue, some engines sold positions on the search results page to advertisers. This dilution of objectivity polluted search results and undermined the essential democracy of the Web. The distinction blurred between search engines, which supposedly located what you wanted, and browser channels, which sent you to the browser’s business affiliates. Even though many search engines did not accept paid place- ment, distrust grew among users. Google started a renaissance of utility and trust. Google’s integrity is symbolized by its gunk-free home page, the spartan design of which lures the user with the promise of search, and nothing but search. To be sure, Google accepts adver- tising, and Parts II and III of this book are all about Google ads. But Google’s paid content is clearly separated from search listings. Not everyone agrees with the ranking of search results in Google, but nobody thinks that a high rank can be bought. 06_571435 ch02.qxd 5/21/04 11:27 PM Page 23 Timing Google’s crawl Google crawls the Web at varying depths and on more than one schedule. The so-called deep crawl occurs roughly once a month. This extensive reconnais- sance of Web content requires more than a week to complete and an undis- closed length of time after completion to build the results into the index. For this reason, it can take up to six weeks for a new page to appear in Google. Brand new sites at new domain addresses that have never been crawled before might not even be indexed at first, depending on considerations explained later in this chapter. If Google relied entirely on the deep crawl, its index would quickly become outdated in the rapidly shifting Web. To stay current, Google launches various supplemental fresh crawls that skim the Web more shallowly and frequently than the deep crawl. These supplementary spiders do not update the entire index, but they freshen it by updating the content of some sites. Google does not divulge its fresh-crawling schedules or targets, but Webmasters can get an indication of the crawl’s frequency through sharp observance. Google has no obligation to touch any particular URL with a fresh crawl. Sites can increase their chance of being crawled often, however, by changing their content and adding pages frequently. Remember the shallowness aspect of the fresh crawl; Google might dip into the home page of your site (the front page, or index page) but not dive into a deep exploration of the site’s inner pages. (More than once I’ve observed a new index page of my site in Google within a day of my updating it, while a new inner page added at the same time was missing.) But Google’s spider can compare previous crawl results with the current crawl, and if it learns from the top navigation page that new con- tent is added regularly, it might start crawling the entire site during its fre- quent visits. The deep crawl is more automatic and mindlessly thorough than the fresh crawl. Chances are good that in a deep crawl cycle, any URL already in the main index will be reassessed down to its last page. However, Google does not necessarily include every page of a site. As usual, the reasons and formu- las involved in excluding certain pages are not divulged. The main fact to remember is that Google applies PageRank considerations to every single page, not just to domains and top pages. If a specific page is important to you and is not appearing in Google search results, your task is to apply every net- working and optimization tactic described in Chapter 3 to that page. You may also manually submit that specific page to Google (see the next section). The terms deep crawl and fresh crawl are widely used in the online marketing community to distinguish between the thorough spidering of the Web that Google launches approximately monthly and various intermediate crawls run at Google’s discretion. Google itself acknowledges both levels of spider activ- ity, but is secretive about exact schedules, crawl depths, and formulas by 24 Part I: Meeting the Other Side of Google 06_571435 ch02.qxd 5/21/04 11:27 PM Page 24 which the company chooses crawl targets. To a large extent, targets are determined by automatic processes built into the spider’s programming, but humans at Google also direct the spider to specific destinations for various reasons, some of which are discussed in this chapter. Earlier, I said that the Google index remains static between crawls. Technically, that’s true. Google matches keywords against the index, not against live Web content, so any pages put online (or modified) between visits from Google’s spider remain excluded from (or out of date in) the search results until they are crawled again. But two factors work against the index remaining unchanged for long. First, the frequency of fresh crawls keeps the index evolving in a state that Google-watchers call everflux. Second, some time is required to put crawl results into the index on Google’s thousands of servers. The irregular heaving and churning of the index that results from these two factors is called the Google dance. To submit or not to submit You can get your site into the Google index in two simple ways: ߜ Submit the site manually ߜ Let the crawl find it Neither method offers a guarantee. Google accepts URL submissions, but it doesn’t respond to them nor assure Webmasters that their submissions will be added to the index. When Google decides to manually add a site, it does so by sending the spider crawling to the submitted URL to take stock of the site’s various pages. Characteristically, Google doesn’t inform the Webmaster that the site has been accepted, and it doesn’t provide a schedule for crawl- ing accepted sites. 25 Chapter 2: GettingintoGoogle Google’s hands-off operation Google is a reasonably communicative com- pany in certain departments, such as AdWords, AdSense, and enterprise solutions. And Google accepts URL submissions for the index, though it doesn’t acknowledge them. But asking humans at Google to interfere with the con- struction of its index is an exercise in futility. Google builds its index through robotic interac- tion, for the most part, and prides itself on these sophisticated automated processes. Google does not correct a Webmaster’s outdated list- ings or make any custom change to the index. The company counts on time and thorough crawling to solve problems. Google doesn’t want to hear from you about your index issues. 06_571435 ch02.qxd 5/21/04 11:27 PM Page 25 The key to attracting Google’s spider is getting your page linked on other sites. Google finds your content by following links to your pages. With no incoming links (also called backlinks), you are an unreachable island as far as the Google crawl is concerned. This isolated condition is the natural state of any new site. Of course, anybody can reach you directly by entering the URL, but you won’t pluck the spider’s web until you get some other sites to link to you. See Chapter 3 for a detailed tutorial in creating a backlink network. Submitting a site might not be a ticket to instant success, but at least it’s easy. Enter your submitted URL at this address: www.google.com/addurl.html Fill in the form (see Figure 2-2) and click the Add URL button, keeping in mind that the button is misnamed. You are not adding the URL, you are submitting it. Only the spider can add your site, and only a Google human can tell it to. If you add a page to a URL already in the Google index, there’s no need to submit the new page. Under most circumstances, Google will find the new page the next time your site is crawled in its entirety. Figure 2-2: Submitting a URL to Google could hardly be easier, but don’t expect acknowl- edgment or guaranteed results. 26 Part I: Meeting the Other Side of Google 06_571435 ch02.qxd 5/21/04 11:27 PM Page 26 You don’t have to choose between submitting and not submitting; do both if you’re impatient. Submitting doesn’t stop the spider from visiting you in the normal course of events, but it doesn’t encourage the spider, either. Conversely, the spider’s failure to find you doesn’t affect the disposition of your submitted request. Are you getting the idea that gaining admission to Google’s index is a crapshoot? Not really. In fact, Google’s spider is so thor- ough that entering the index is practically inevitable if you follow the net- working suggestions in the next chapter. Submitting a URL manually is a crapshoot, though. My best suggestion is to submit if you must, but don’t only submit. Get to work networking your site and implementing other opti- mization tactics in Chapter 3, which will get you inside the index more quickly and push your site to a higher PageRank. The directory route If submitting a URL seems too uncertain and networking seems too difficult, you can get into the Web index by getting your site listed in the Google Directory. The Google Directory is a categorized list of Web sites, built by hand. Google does not build its own directory — a fact that surprises many people. Instead, Google repurposes the large Web directory created by the Open Directory Project. The Open Directory Project (ODP) is a non-profit organization staffed by thousands of volunteer editors who accept URL sub- missions for their respective subject niches. Google applies a PageRank to the Open Directory (see the bars on the left in Figure 2-3), thereby reordering the directory listings, and presents the whole thing in familiar Google style. Naturally, the Google spider crawls the directory, so any new directory listing is automatically added to Google’s main Web index. Submit a URL to the Open Directory Project at this address: www.dmoz.org/add.html When it comes to accepting submissions, the Open Directory Project does not guarantee your entry any more than Google does. With ODP, you are at the mercy of whichever editor is in charge of your most relevant category, and the chance of developing a companionable dialogue with that person is slim. Furthermore, the ODP URL-submission process is much more compli- cated than at Google. Finally, you can usually count on a long and indetermi- nate wait before your site is added. Keep checking by searching for your site in the Google Directory. 27 Chapter 2: GettingintoGoogle 06_571435 ch02.qxd 5/21/04 11:27 PM Page 27 Checking your site’s status in Google During the sometimes-long wait to be included in Google, you naturally want to know when you’ve succeeded. (So you can run through the streets yelling, “Google me! Google me!”) How do you know whether your site is in the Google index? Don’t try search- ing for it with general keywords — that method is hit-or-miss. You could search for an exact phrase located in your site’s text (by putting quotes around the phrase), but if the phrase is not unique you could get tons of other matches. The best bet is to simply search for your URL, as shown in Figure 2-4. Make it exact, and include the www prefix. If you’re searching for an inner page of the site, precision is likewise necessary, so remember to include the .htm or .html file extension if it exists. When adding a page to a site already in Google, be prepared for a long wait for it to appear, especially if you don’t change your content often. If Google’s spider checks your site during only its deep crawl and the timing is off, you could tap your fingers for about six weeks before seeing the new page in search results. Figure 2-3: Listing in the Google Directory assures being crawled for the Web index. 28 Part I: Meeting the Other Side of Google 06_571435 ch02.qxd 5/21/04 11:27 PM Page 28 Figure 2-4: Search for your page or site address to see whether you’re in the Google index. 29 Chapter 2: GettingintoGoogle Indexing frustrations Moving is hell, on land and in cyberspace. Moving your site from one URL to another — and especially from one domain to another — presents a vexing indexing problem. There’s a good chance that Google will continue to list your old site after you move, and even after it begins to list your new site. The Google spider is not dense. It trusts incoming links, many of which probably still point to your old location. From Google’s perspective, you haven’t really moved until you update your entire network of incoming links (which, if you take Chapter 3 seriously, you worked hard to establish), point- ing them to your new location. Your PageRank will drop considerably, too, until you get those backlinks up to speed. Moving is a serious consideration for any site that depends on stature in Google, and it shouldn’t be under- taken lightly or without planning. Partial listings can also spark frustration, for example, when Google’s spider locates your site and files the address but does not crawl all of its content. Because Google’s descriptions are quoted from the pages, your listing on any search page is bereft of a description. This sit- uation bodes ill, for descriptions often provide the motivation to click on search results. Your only recourse is to build up your PageRank to the level at which Google sniffs out all your con- tent and provides descriptions of your pages. See Chapter 3. 06_571435 ch02.qxd 5/21/04 11:27 PM Page 29 Keeping Google Out This book is about partnering with Google: getting into the index, improving your PageRank, advertising on Google, distributing other people’s Google ads on your site, and other ways of building your online business through Google. So a section about rebuffing Google might seem counterproductive. But in the interest of covering all bases, here it is. Sometimes even publicity-hungry Webmasters want to keep Google away from certain parts of their business. Private pages designed for friends and semiprivate pages created for select visitors shouldn’t be indexed for the world at large. Entire sites that are still under development while existing on the Web in a live state might best be excluded from Google. It’s fairly easy to prevent Google from indexing an entire site or selected pages of a site even if the spider crawls your URL. You can prevent Google also from caching pages of your site, a process by which Google stores each indexed page on its servers. This section explains how to prevent Google from crawling and caching your site. Deflecting the crawl The key to deflecting Google’s spider is the robots.txt file, also known as the Robots Exclusion Protocol. Google’s spider understands and obeys this pro- tocol. The robots.txt file is a short, simple text file that you place in the top- level directory (root directory) of your domain server. (If you lease your Web space from your ISP, not from a dedicated Web host, you probably need administrative help in placing the robots.txt file.) Create the robots.txt file in Notepad or another text editor, and transfer it as an ASCII text file. It’s best not to use Microsoft Word or another word proces- sor to create the robots.txt file. But if you do, remember to save it as a plain text file with the .txt file extension. Then make sure you transfer it to your server as a binary file, which is the default setting of many FTP (file transfer protocol) programs. The robots.txt file contains two instructions: ߜ User-agent. This instruction specifies which search engine crawler must follow the robots.txt instructions. You may specify Google’s spider, mul- tiple specific spiders, or all spiders. (The command works for all spiders that seek and acknowledge the robots.txt file.) ߜ Disallow. This line specifies which directories (Web page folders) or spe- cific pages at your site are off-limits to the search engine. You must include a separate Disallow line for each excluded directory. 30 Part I: Meeting the Other Side of Google 06_571435 ch02.qxd 5/21/04 11:27 PM Page 30 [...]... invisible to Google Others are mere static welcome mats that force users to click again before getting into the site Google does not like pointing its searchers to splash pages In fact, these tedious welcome mats are bad site design by any standard, even if you Chapter 2: Getting into Google don’t care about Google indexing, and I recommend getting rid of them Give your visitors, and Google, meaningful... first prevents Google from indexing your page, and the second prevents Google from following links on that page If you want the page to be excluded from the index but would like Google to follow its outgoing links, leave off the nofollow command, like this: Make your command Google- specific by using the name of Google s spider, Googlebot: . first. New sites must get into Google and then work to raise their profiles. Getting into Google really means getting into the Google index, which is a. 4 apply to both getting into Google (the first step) and improv- ing a site’s status in Google (an ongoing project). Understanding the Google crawl (this