Research tools - laptops and handhelds for mobile usability testing

isham research

Generating sitemaps

HTML Sitemaps

HTML sitemaps are fairly easily generated, though caution is necessary. Tools such as Xenu (ask for the report and reply "Cancel" to the password request) produce HTML code that can be cut'n'pasted to create a sitemap. For modest sites up to around fifty pages, this is really quite adequate and an XML sitemap might be an unnecessary effort. HTML sitemaps have two main advantages - they allow anchor text to describe a page and they can 'level' a site such that no page is further than two or three levels down from the home page. Problems arise with large sites, because there are supposedly limits to the number of links on any page. A hundred has been suggested as a reasonable limit by Google, but this is a competitive issue and if other search engines support more, they will have to do the same.

The isham research system produces HTML sitemaps as above, using the <title> tag as link anchor text for each page, but instead of reflecting the tree structure, it sorts the URIs into descending priority order. This ensures that any search engine crawler that does in fact give up after an arbitrary number of URIs will have got the most important ones. The isham research system also prepends the ISO 8601 date to the title - this ensures that the anchor text for each page changes each time that page is updated.

XML Sitemaps

Metadata

Sitemap metadata is a way of storing information about pages so that it does not have to be changed every time a page is updated or a new sitemap is generated. During the design of the isham research system, three approaches were considered:

The isham research system uses codified SGML comment statements such as:

<!-- priority 0.5 -->
<!-- changefreq never -->

These statements are added to the page directly behind the !DOCTYPE statement. By far the most important is 'priority' - no other mechanism can convey business goals correctly to a search engine. The idea that it can be mechanically generated is fatuous - and the search engines do their own version of that already, considering hundreds of parameters. Changefreq is a slightly dubious parameter and the isham research system does not insert a default - any value specified should be consistent with the search engine crawlers' observations. Lastmod is derived from the actual file date in the development environment - the last date the page was edited and stored. This date is not available from the server.

One extra tag:

<!-- ignore -->

This is used on pages that shouldn't be indexed - such pages might also include a robots noindex meta tag.

Mobile Sitemaps

Mobile sitemaps describe those pages on a site that are handheld friendly. Mobile devices are expected to dominate web access in a few years' time, and mobile platforms are rapidly converging with desktop capabilities. Although the USA lags significantly in the adoption of handheld browser technology - witness the raving about the very pedestrian Apple iPhone - cellular phone browsing is becoming pervasive in Europe, with over a quarter of cellphone users having surfed the web on the move. For over a tenth of young users, their cellphone is their main means of accessing the Internet. Various early attempts to adapt web content to handhelds (WAP, .mobi) are falling by the wayside as handheld browser technology catches up. Google now accepts the submission of mobile sitemaps. Rather than define its own meta data, isham research uses the AvantGo browser extension meta tag:

<meta name="HandHeldFriendly" content="true">

This permits the controlled construction of mobile sitemaps. It has little effect on traffic - the Avantgo browser itself is not commonly used. For backward compatibility reasons, the system also processes the equivalent PalmOS meta tag:

<meta name="palmcomputingplatform" content="true">

robots.txt

robots.txt (lower case is important) is used in two ways:

Management

As mentioned above, the ideal shape for a priority occurrence plot is asymptotic - a very few pages with high priority and a long tail (the bulk of the site) with a low priority because in general it will be reached heirarchically through the site structure rather than being the immediate target of a search. For this purpose, Google's default of 0.5 is of little use and the isham research system normally uses a default of 0.1 - though this is easily changed. Sitemap entries are sorted by descending priority; this allows the most important pages to appear first in the HTML sitemap - just in case a crawler does impose a limit - and also permits the creation of a comma-separated variable (CSV) file in descending priority that can be read into any spreadsheet for further analysis.

A Working Example

As an example, these techniques have been employed on a small client site:

And once the XML sitemap is created, don't forget to add it to your robots.txt file, leaving a blank line between it and the last Useragent: record.

Useragent: *
Disallow:

Sitemap: http://www.mydomain.com/sitemap.xml

This will help other search engines discover it without any further effort.

or call 07833 654800      Back to the Web Site Services Page