You may also be interested in...
Google has provided the following tips and guidelines for appliance customers (and everybody else) to improve your users' search experience.
Create useful, information-rich content. Write pages that clearly and accurately describe your content. Don't load pages with irrelevant words. Think about the words users would type to find your pages, and make sure that your site includes those words.
Focus on the text on your site. Make sure that your TITLE and ALT tags are descriptive and accurate. Since the Google crawler doesn't recognize text contained in images, avoid using graphical text and instead place information within the alt and anchor text of pictures. When linking to non-HTML documents, use strong descriptions within the anchor text that describe the links your site is making.
Make a site with a clear hierarchy of hypertext links. Every page should be reachable from at least one hypertext link. Offer a site map to your users with hypertext links that point to the important parts of your site. Keep the links on a given page to a reasonable number (fewer than 100).
Ensure that your site is linked from all relevant sites within your network. Interlinking between sites and within sites gives the Google crawler additional ability to find content, and improves the quality of the search.
Validate all HTML content to ensure that the HTML is well-formed. Use a text browser such as Lynx to examine your site because most search engine spiders see your site much as Lynx would. If extra features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine crawlers may have trouble crawling your site.
Allow search bots to crawl your sites without session IDs or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in multiple copies of the same document being indexed for your site, as crawl robots will see each unique URL (including session ID) as a unique document.
Ensure that your site's internal link structure provides a hypertext link path to all of your pages. The Google search engine follows hypertext links from one page to the next, so pages that are not linked to by others may be missed. Additionally, you should consult the administrator of your Google Search Appliance to ensure that your site's home page is accessible to the search engine.
Make use of the robots.txt file on your Web server. This file tells crawlers which files and directories can or cannot be crawled, including various file types. If the search engine gets an error when getting this file, no content will be crawled on that server. The robots.txt file will be checked on a regular basis, but changes may not have immediate results. Each port (including HTTP and HTTPS) requires its own robots.txt file.
Use robots meta tags to control whether individual documents are indexed, whether the links on a document should be crawled, and whether the document should be cached. The "NOARCHIVE" value for robots meta tags is supported by the Google search engine to block cached content, even though it is not mentioned in the robots standard.
For information on how robots.txt files and ROBOTS meta tags work, read the Robots Exclusion standard. If the search engine is generating too much traffic on your site during peak hours, contact your Google Search Appliance administrator to customize the traffic.
Make sure your Web server supports the If-Modified-Since HTTP header. This feature allows your Web server to tell the Google Search Appliance whether your content has changed since it last crawled your site. Supporting this feature saves you bandwidth and overhead.
Each time the Google Search Appliance updates its database of Web pages, the documents in the index can change. Here are a few examples of reasons why pages may not appear in the index.
If you still have questions, contact your Google Search Appliance administrator to get more information.
The Google search engine supports frames to the extent that it can. Frames tend to cause problems with search engines, bookmarks, email links and so on, because frames don't fit the conceptual model of the Web (where every document corresponds to a single URL).
Searches that return framed pages will most likely only produce hits against the "body" HTML page and present it back without the original framed "Menu" or "Header" pages. Google recommends that you use tables or dynamically generate content into a single page (using ASP, JSP, PHP, etc.), instead of using FRAME tags. This ultimately will maintain the content owner's originally intended look and feel, as well as allow most search engines to index your content properly.
Most search engines do not read any information found in SCRIPT tags within an HTML document. This means that content within script code will not be indexed, and hypertext links within script code will not be followed when crawling. When using a scripting language, make sure that your content and links are outside SCRIPT tags. Investigate alternate HTML technologies to dynamic web pages, such as HTML layers.
From software version 4.2.2 onwards, the Google Search Appliance supports googleon and googleoff tags embedded in the HTML of crawled documents.
The googleoff/googleon tags disable the indexing of a part of a Web page. The result is that those pages do not appear in search results when users search for the tagged word or phrase. For example, some customers use googleoff/googleon tags to comment out a navigation bar in static HTML pages.
You can use googleon/off to tell the Google Search Appliance to ignore portions of a page. Insert <!--googleoff: index--> at the point you want the Google Search Appliance to stop indexing, then insert <!--googleon: index--> where you want it to resume indexing the page.
You can also use the tags to avoid indexing anchor links leading to another web page.
You can use either of the following to prevent the words "chocolate pudding" from appearing in the snippets.
<!--googleoff: snippet--> chocolate pudding <!--googleon: snippet-->
<!--googleoff: all--> chocolate pudding <!--googleon: all-->
The googleon/googleoff tags are index, anchor, snippet, all. Here's how they are used:
index tag
Words surrounded by the googleon/off tags will not be indexed as occurring on the current page
A page containing:
fish <!--googleoff: index--> shark <!--googleon: index--> mackerel
has the terms "fish" & "mackerel" indexed for that page, but will not index "shark" for the page. It's possible, however, that the page could be a search result for the search term "shark", since "shark" may occur elsewhere on the page, or in anchortext for links to the page.
anchor tag
"Anchortext" surrounded by the googleon/off tags and occurring in links to other pages will not be indexed as words associated with the other linked-to pages. A page containing:
<!--googleoff: anchor--> <a xhref="linked_to_page.html"> shark </a> <!--googleon: anchor-->
will not cause the word "shark" to be associated with the page "linked_to_page.html". Otherwise, this hyperlink could cause the page "linked_to_page.html" to be a search result for the search term "shark".
snippet tag
The text surrounded by googleon/off tags will not be used to create snippets for search results.
<!--googleoff: snippet--> come to the fair! <!--googleon: snippet-->
all tag
Turns on all of the attributes: index, anchor, and snippet:
<!--googleoff: all--> come to the fair! <!--googleon: all>
The text surrounded by googleon/off tags will not be indexed, followed to another linked-to page, or used for a snippet.
The crawler obeys the noindex, nofollow, and noarchive meta-tags. If you place these tags in the head of your HTML document, you can cause the appliance to not index, not follow, and/or not archive particular documents on your site. The tags to include and their effects are:
<META NAME="robots" CONTENT="noindex">
The crawler will retrieve the document, but it will not index the document. The document will count towards the license limit.
<META NAME="robots" CONTENT="nofollow">
The crawler will not follow any links that are present on the page to other documents. The document will count towards the license limit.
<META NAME="robots" CONTENT="noarchive">
The appliance maintains a cache of all the documents that it fetches, to permit users to access the content that is indexed (in the event that the original host of the content is inaccessible, or the content has changed). If you do not wish to archive a document from your site, you can place this tag in the head of the document, and the appliance will not provide an archive copy for the document. The document will count towards the license limit.
You can also combine any or all of these tags into a single meta tag. For example:
<META NAME="robots" CONTENT="noarchive,nofollow">
Currently, it is not possible to set NAME="gsa-crawler" to specify some of these restrictions just for the appliance.