Search Engine Enhancement

Wednesday, September 12, 2007

Getting timely search engine coverage of a site means people can find things soon after you change or post them.

Linked pages get searched by most search engines following external links or manual URL submissions every few days or so, but they won’t find unlinked pages or broken links, and it is likely that the ranking and efficiency of the search is suboptimal compared to a site that is indexed for easy searching using a sitemap.

There are three basic steps to having a page optimally indexed:

Generating a Sitemap
Creating an appropriate robots.txt file
Informing search engines of the site’s existence

Sitemaps
It seems like the world has settled on sitemaps for making search engine’s lives easier. There is no indication that a sitemap actually improves rank or search rate, but it seems likely that it does, or that it will soon. The format was created by Google, and is supported by Google, Yahoo, Ask, and IBM, at least. The reference is at sitemaps.org.

Google has created a python script to generate a sitemap through a number of methods: walking the HTML path, walking the directory structure, parsing Apache-standard access logs, parsing external files, or direct entry. It seems to me that walking the server-side directory structure is the easiest, most accurate method. The script itself is on sourceforge . The directions are good, but if you’re only using directory structure, the config.xml file can be edited down to something like:

<?xml version="1.0" encoding="UTF-8"?>
<site
  base_url="http://www.your-site.com/"
  store_into="/www/data-dist/your_site_directory/sitemap.xml.gz"
  verbose="1"
  >

 <url href="http://www.your-site.com/" />
 <directory
    path="/www/data-dist/your_site_directory"
    url="http://www.your-site.com/"
    default_file="index.html"
 />

Note that this will index every file on the site, which can be large. If you use your site for media files or file transfer, you might not want to index every part of the site. In which case you can use filters to block the indexing of parts of the site or certain file types. If you only want to index web files you might insert the following:

 <filter  action="pass"  type="wildcard"  pattern="*.htm"           />
 <filter  action="pass"  type="wildcard"  pattern="*.html"          />
 <filter  action="pass"  type="wildcard"  pattern="*.php"           />
 <filter  action="drop"  type="wildcard"  pattern="*"               />

Running the script with

python sitemap_gen.py --config=config.xml

will generate the sitemap.xml.gz file and put it in the right place. If the uncompressed file size is over 10MB, you’ll need to pare down the files listed. This can happen if the filters are more inclusive than what I’ve given, particularly if you have large photo or media directories or something like that and index all the media and thumbnail files.

The sitemap will tend to get out of date. If you want to update it regularly , there are a few options: one is to use a wordpress sitemap generator (if that’s what you’re using and indexing) which does the right thing and generates a sitemap using relevant data available to wordpress and not to the file system (a good thing) and/or add a chron script to regenerate the sitemap regularly, for example

3  3  *  *  *  root python /path_to/sitemap_gen.py --config=/path_to/config.xml

will update the sitemap daily.

robots.txt

The robots.txt file can be used to exclude certain search engines, for example MSN if you don’t like Microsoft for some reason and are willing to sacrifice traffic to make a point, it also points search engines to your sitemap.txt file. There’s kind of a cool tool here that generates a robots.txt file for you but a simple one might look like:

User-agent: MSNBot                             % Agent I don't like for some reason
Disallow: /                                    % path it isn't allowed to traverse
User-agent: *                                  % For everything else
Disallow:                                      % Nothing is disallowed
Disallow: /cgi-bin/                            % Directory nobody can index
Sitemap: http://www.my_site.com/sitemap.xml.gz % Where my sitemap is.

Telling the world

Search engines are supposed to do the work, that’s their job, and they should find your robots.txt file eventually and then read the sitemap and then parse your site without any further assistance. But to expedite the process and possibly enhance search results there are some submission tools at Yahooo, Ask, and particularly Google that generally allow you to add meta information.
Ask
Ask.com allows you to submit your sitemap via URL (and that seems to be all they do)
http://submissions.ask.com/ping?sitemap=http://www.your_site.com/sitemap.xml.gz

Yahoo
Yahoo has some site submission tools and supports site authentication, which means putting a random string in a file they can find to prove you have write-access to the server. Their tools are at
https://siteexplorer.search.yahoo.com/mysites

with submissions at
https://siteexplorer.search.yahoo.com/submit.php

you can submit sites and feeds. I usually use the file authentication which means creating a file with some random string (y_key_random_string.html) with another random string as the only contents. They authenticate within 24 hours.
It isn’t clear that if you have a feed and submit it that it does not also add a site, it looks like it does. If you don’t have a feed you may not need to authenticate the site for submission.
Google
Google has a lot of webmaster tools at
https://www.google.com/webmasters/tools/siteoverview?hl=en

The verification process is similar but you don’t have to put data inside the verification file so

touch googlerandomstring.html

is all you need to get the verification file up. You submit the URL to the sitemap directly.
Google also offers blog tools at
http://blogsearch.google.com/ping

Where you can manually add the feed for the blog to Google’s blog search tool.

Posted at 13:25:13 GMT-0700

Category: FreeBSD • Technology

Recent Posts
- Managing a proxyware infection 2026 May 24
- Google gonna do evil. Resist the Droidpocolypse. 2026 April 15
- Generative Spongiform Encephalopathy (GSE): The AI Ouroboros 2026 January 11
- Free your Datas from The Tyranny of the Organizer 2026 January 09
- Does nano select the wrong syntax highlighter? But how to know which? 2025 December 16
- Serial dependencies are a catastrophe already starting to happen 2025 November 19
- Cleaning up poor quality laser scans with Meshlab 2025 September 12
- Disappearing checks in WordPress checkboxes 2025 July 29
- Apropos of nothing in particular…. 2025 March 26
- Treegraph.sh a tool for generating pretty file structure graphs 2025 February 28
Search
Categories
Categories
Links
Archives
Archives
Post History
September 2007

M T W T F S S

1 2

3 4 5 6 7 8 9

10 11 12 13 14 15 16

17 18 19 20 21 22 23

24 25 26 27 28 29 30

« Aug Oct »

Gessel On…

Search Engine Enhancement