EJM Designs Limited Blog

Tuesday, May 26, 2009

DIY: XML Sitemap and Autodiscovery

A sitemap has always been, traditionally, an HTML page that lays out your website's page hierarchy. Simple enough: use the unordered list tagset and categorize those pages. And for more SEO love, put a link to that page in your footer. Tasty!

But then it got a little more complicated, with both sitemaps and SEO. But - don't worry - not that much.

You may have heard the term "XML sitemap" and perhaps even the term "autodiscovery." And you thought "WTF? How do I do that?" It's really not as hard as you think. And if you have FTP access to your site, you can.

The XML Sitemap

A couple years ago (has it been that long?), Google spearheaded the idea of an XML Sitemap which is basically an XML file sitting at your root directory called, obviously enough, "sitemap.xml" It was soon adopted by MSN Search (now Microsoft Live!), Yahoo, and Ask. All the parameters and guidelines can be found at sitemaps.org, but it boils down to this: an XML sitemap that lists all the pages on your site, the last change date, the frequency of change, and the importance in a 0.1 to 1.0 scale (1.0 highest).

Creating a sitemap.xml file from scratch is a daunting prospect, to say the least.

So how about a tool, Eric? Of course. The free software I personally use to create these sitemaps is GSiteCrawler that you can download here.

The program is intuitive: you enter your URL, click some boxes specifying what you'd like logged, and let it run. (NOTE: this tool is for a live site; grabbing from a staging site will only confuse the robots/spiders and potentially be detrimental in a "duplicate content" kind of way.)

Generate the sitemap and save it on your system. Take a look at it in a text editor to make sure everything looks good, and upload it to the website, where the home page is.

Autodiscovery

Very soon after the sitemap.xml protocol was adopted, so was autodiscovery. It might be a fancy word to flaunt in front of people, but it's really quite basic. In your directory - where your home page is, where your sitemap.xml now is, should exist a "robots.txt" file. That file should contain information about what directories in your site should not be followed (disallowed).

Autodiscovery is a single additional line in that file that reads like this:

sitemap: http://www.yoursite.com/sitemap.xml

That's it!

So why's it called "autodiscovery?" Simple enough: when the robots/spiders visit your site, the first thing they are "supposed to" check is the "robots.txt" file so they know what they don't have to bother with, saving them time. This additional line of code increased their efficiency by immediately directing them to the exact file they can use to guide them through your site.

Fancy words, easy results.

"But Eric, I've got my XML sitemap and my autodiscovery in place, so how do I make sure Google, Yahoo, etc. know that it's all cool and they need to take a look at my site again?"

Well, sir/madam, that would be tomorrow's post: DIY: How to Let the Engines Know You're There.

Tune on in.

Questions or suggestions always happy in the comments.

2 comments:

  1. Hi Eric-

    For WordPress users, there's a great free plugin that generates the XML file AND pings the search engines when it's updated.

    http://www.arnebrachhold.de/projects/wordpress-plugins/google-xml-sitemaps-generator/

    I guess it'd be moot on Blogger, but for any of your readers who use WP, I can say it's super easy to use.

    ReplyDelete
  2. Brent- Thanks for the input. For the WP users out there, this looks like a great tool. Anything that can automate this process is a plus in my book!

    ReplyDelete