PCS SemiAutomatic Web Page Indexing


PCS Web Design Guide

Template Driven Web Pages

Semi-Automatic Indexing of Web Pages

Motivation

PCS pages are hard to navigate. Basically, however valid the web page organization is for the structure of a massive documentation project (and it ain't all that valid), the structure that makes sense when looking at the documentation as a whole often does not make sense to someone looking for a quick tidbit of information.

For the web as a whole, the best solution are search engines like Google, combined with people who collect and publish good links. We make use of the campus Google search engine for searching the PCS documentation, but there are some drawbacks. It will not index password protected pages, and despite the work in the engine to effectively determine relevancy, computers are not that good at that.

Printed media usually includes an index, hopefully generated by a human, pointing you to pages of interest for a given topic and subtopic chain. This project is only starting as this is written, but it is hoped that it will allow us to easily mimic the printed index in usefulness.

For example, we may have instructions for printing from the PNCE-Unix environment under PNCE-Unix Environment, Common Tasks, Printing. While a perfectly reasonable location, and perhaps more than reasonable in terms of the structure of the web site as a whole, an impatient user might not be thinking in those terms. The goal is to have a master index page which would list the instructions for printing under all the topic entries one might try (e.g. Printing, PNCE-Unix Environment, or MDQS, commands, or Common Tasks, Printing, PNCE-Unix Environment).

No computerized system can do a good job of placing a page under the relevant entries automatically, but generally the author of the page can easily come up with a fair number of potential entries. The aim is to enable the author to simply list these potential entries in the template for the page, and have an automated system of assembling all these hints into the master index pages. Since we are using templates which need compilation anyway, this is a good place to do a lot of that work.

Additional work for Content Authors

The author is the one who for the most part needs to enter the keys and subkeys under which the page should be indexed. Although others can add more keys at a later date, a human really needs to do this (if we could get the computers to do it well enough this would be a fully automated system) and the author is the obvious choice. The goal, then, is to make the burden on the author as minimal as possible.

For the most part, I think we succeeded. The author just needs to include an addTopic for each location in the index where the page should be referenced from. These can go anywhere after the standard header macro is called, and unless using the setCurrentTitle command also, the order and placement is immaterial. Basically, for references to the page as a whole, it is probably best practice to include the addTopic commands just below or near the standard header. If you think it is a particularly good or poor match for a topic, you can alter the weight parameter (50 is the default, higher numbers are better matches. This controls the order in which multiple references for the same key get listed), but this is not typically needed.

So for our printing example, the start of the document might look like:

[%
pnce_standard_header('PNCE-Unix Printing',
'Instructions for Printing on PNCE-Unix Systems');

addTopic('PNCE-Unix|common tasks|printing');
addTopic('PNCE-Unix|printing');
addTopic('PNCE-Unix|MDQS');
addTopic('printing|PNCE-Unix');
addTopic('MDQS|commands');
addTopic('common tasks|printing|PNCE-Unix',weight=100);
%]
(We altered the weight for one topic just as an example, though generally that would not be done.)

If you want to reference a portion of the page instead of the whole page, it is only slightly more complicated. I assume that there is already a <a name= ... anchor tag for the spot you want to reference. At some point (best practice would seem to be near the actual name anchor), you issue the command [% setCurrentTitle("new title", "#anchor_name"); %] Then issue more addTopic commands with the key list for this location. All addTopic macros called after that (until the end of the file or another setCurrentTitle) will use the given title and URL (the URL must be fairly absolute, but the macro will detect if it is in the same page (by virtue of the leading '#' mark), and prepend the absolute path to the current page).

Under the hood

This is a brief description of what is happening 'under the hood,' so to speak.

The make command uses a local script /dept/phys/htdocs/tt-html-stuff/ttprocess.pl to process templates and convert them into HTML. In this script, an object of Template::WebPageIndices (TWPI) is created and given to TT when the script is being processed, with the full path of the template file. The URL, etc. is determined from the path of the template file. In addition, we determine the of the page, that is whether the page is general user documentation, or systems documentation (based on whether in user-docs or pcs-docs) and whether public or restricted access ( if restricted is in the path.

When the standard header macro is called, it calls the pageInit method on the TWPI object, giving it the title (as in HTML header) of the page.

The addTopic and setCurrentTitle macros invoke similarly named methods on the TWPI, and set its data store.

After the template is finished, the processing code calls the writeIdxFile method on the TWPI, which creates a file specific to the template/html file in the index-files directory, making the directory if needed. This file has a bunch of field=value lines at the top (the most important being for Class), followed by a line indicating that topics are following, followed by a line for each topic entry. Each line has multiple records, with pipe (|) delimitters. The last three fields are the weight, URL, and page description/title for the entry. The rest of the fields are the list of keys and subkeys to place it under.

Periodically (should be in a cron job, maybe daily), a script /dept/phys/htdocs/tt-html-stuff/collect_masterindex_data.pl searches for and reads all of this page specific index files, and consolidates everything into four files based on whether user or system docs and whether public or restricted.

Another script (again, should be a cron job) /dept/phys/htdocs/tt-html-stuff/make_master_index_pages.pl reads the four master index data files created previously, and generates the actual HTML pages.


Main Physics Dept site Main UMD site


Valid HTML 4.01! Valid CSS!