How to Get Your Site Indexed (or Not)
In addition to understanding the basic rules of thumb for how the Google appliance crawls and indexes sites, it's also useful to know how to structure your site in such a way that you can control what gets indexed in your site and what doesn't get indexed.
How to get your site pages indexed
Making sure your pages are available to Web visitors in a Duke search is made easier if you follow the Google-provided "Publishing Best Practices" guidelines, and if you follow these basic steps:
- Make sure the site is hosted in a Web domain that is included in the list of crawled domains (What is the universe of domains?). If you suspect that your site is being excluded, please contact the OIT Service Desk (see links at right).
- Make sure you haven't added a robots.txt file or meta tags that prevent the Google robot from indexing your page. A more detailed explanation of the various combinations of arguments for the robots meta tag is available at www.robotstxt.org.
- Make sure your site is linked from somewhere in the list of crawled domains.
How to keep your site out of the Google index
- To exclude an entire Web server from the Google application, insert a robots.txt file at the top level of that server. The robots.txt file should contain something like the following:
User-agent: *
Disallow: /
(NOTE: If you want to simply restrict the Duke appliance from crawling or indexing your site, the appliance user-agent is known as "duke-crawler".)
- For individual directories or pages, insert this <meta> tag within your home page's <head> tag:
<head>
<meta name="robots" content="noindex, nofollow">
</head>
NOTE: If the page has been indexed in the past and you add the meta tag, it will be removed from the index the next time Google crawls the page.
- Additionally, you can prevent the pages in a site or directory from being indexed by restricting access to the directory with WebAuth.