Introducing your site structure to a search engine
In order for the crawler to know which page to crawl, the crawler will try to infer the site structure either using sitemap or simply crawling the links on the main page. You can generate a sitemap XML based on specification located at http://www.sitemaps.org/ from http://www.xml-sitemaps.com/
A short introduction of sitemap: (abstracted from http://www.sitemaps.org/)
What are Sitemaps?
Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site.
Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata. Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site.
http://www.xml-sitemaps.com/ is a site with a free tool to generate the sitemap.xml.
After generating the sitemap.xml document, you will need to inform the crawler of this change: (abstracted from http://www.sitemaps.org/)
Informing search engine crawlers
Once you have created the Sitemap file and placed it on your webserver, you need to inform the search engines that support this protocol of its location. You can do this by:
- submitting it to them via the search engine’s submission interface
- specifying the location in your site’s robots.txt file
- sending an HTTP request
The search engines can then retrieve your Sitemap and make the URLs available to their crawlers.
Submitting your Sitemap via the search engine’s submission interface
To submit your Sitemap directly to a search engine, which will enable you to receive status information and any processing errors, refer to each search engine’s documentation.
Specifying the Sitemap location in your robots.txt file
You can specify the location of the Sitemap using a robots.txt file. To do this, simply add the following line:
Sitemap: <sitemap_location>The <sitemap_location> should be the complete URL to the Sitemap, such as: http://www.example.com/sitemap.xml
This directive is independent of the user-agent line, so it doesn’t matter where you place it in your file. If you have a Sitemap index file, you can include the location of just that file. You don’t need to list each individual Sitemap listed in the index file.
You can specify more than one Sitemap file per robots.txt file.
Sitemap: <sitemap1_location> Sitemap: <sitemap2_location>Submitting your Sitemap via an HTTP request
To submit your Sitemap using an HTTP request (replace <searchengine_URL> with the URL provided by the search engine), issue your request to the following URL:
<searchengine_URL>/ping?sitemap=sitemap_urlFor example, if your Sitemap is located at http://www.example.com/sitemap.gz, your URL will become:
<searchengine_URL>/ping?sitemap=http://www.example.com/sitemap.gzURL encode everything after the /ping?sitemap=:
<searchengine_URL>/ping?sitemap=http%3A%2F%2Fwww.yoursite.com%2Fsitemap.gzYou can issue the HTTP request using wget, curl, or another mechanism of your choosing. A successful request will return an HTTP 200 response code; if you receive a different response, you should resubmit your request. The HTTP 200 response code only indicates that the search engine has received your Sitemap, not that the Sitemap itself or the URLs contained in it were valid. An easy way to do this is to set up an automated job to generate and submit Sitemaps on a regular basis.
Note: If you are providing a Sitemap index file, you only need to issue one HTTP request that includes the location of the Sitemap index file; you do not need to issue individual requests for each Sitemap listed in the index.