SHORTREF was one of the many SGML features that didn't make the initial - and, as it seems to have panned out, the final - cut for XML, back in 1996. The basic reasons were two: it was optional in SGML itself, and it would complicate parser implementations.
But, as this new article points out, it had and still has its attractions. Given parser support, it was a cheap way to have alternate syntaxes (such as Wiki style markup today). It was also a way to make conforming SGML instances out of documents that on the surface might not look like SGML at all.
The canonical example is a flat file with delimited fields, like /etc/passwd or a BCP dump of a relational table. Years ago - yikes, many years ago - I cooked up two DTDs to demonstrate this. Here is the first one in its entirety:
<!-- DTD for simplified tables from flat file: see Notes at end --> <!-- what's a DTD without parameter entities? :-) --> <!ENTITY % row "tr" > <!ENTITY % cell "td" > <!ENTITY % delim "~" > <!-- bog-standard schema --> <!ELEMENT table o o (%row)+ > <!ELEMENT (%row) o o (%cell)+ > <!ELEMENT (%cell) o o (#PCDATA) > <!ATTLIST table border (border) #IMPLIED > <!-- map tildes and line boundaries to appropriate tags --> <!ENTITY start "<tr><td>" > <!ENTITY more "</td><td>" > <!ENTITY end "</td></tr>" > <!SHORTREF tblmap "&#RS;" start "%delim;" more "&#RE;" end > <!USEMAP tblmap table > <!-- NOTES -- -- This is basically a "joke" DTD to prove a concept about parsing delimited flat files with a SGML parser. The idea is to treat line boundaries and explicit delimiters as short references for the "required" tags. -- -- This is not guaranteed to work, because enabling short references requires all markup to be recognized. Some fields could have the dreaded < or & characters. An actual data file should probably be preprocessed via sed or somesuch to replace '<' with '<<!>' and '&' with '&<!>'. This null declaration trick saves the need to declare entities such as '<' and '&'. One such filter is sed -e 's/\([&<]\)/\1<!>/g' -- -- A problem occurs in tab delimited files. For reasons buried deep in ISO 8879, nsgmls will not recognize empty fields in this case, because two or more consecutive tabs also constitute a valid short reference! (Yes, this is a nasty gotcha.). A workaround is to change the tabs to a graphic like '~' or ':' if this is safe in relation to the data. Otherwise, it's best to redefine the shortref delimiter set in the SGML declaration to exclude the "contiguous whitespace" shortrefs. (Thanks to Joe English for this idea.) -- -- Copyright 1994-8 Arjun Ray -->
The basic idea of SHORTREF is that contexts are set up where ordinary data characters (usually punctuation) can be recognized as "shorthand" for markup. The markup is then substituted into the document in a process like that of entity reference expansion, and parsing continues as if the markup had been there to begin with. So, three things are needed for this to work:
- A mapping between data characters and names of entities.
- Definitions of these entities, providing the replacement markup.
- A context specification for when this mapping will be active.
<!SHORTREF tblmap "&#RS;" start "%delim;" more "&#RE;" end >
which associates "&#RS;" (or "Record start", SGMLese for "beginning of line") with an entity reference named "start", the character "~" with one named "more", and "&#RE;" ( or "Record end") with one named "end".
Next, we have definitions of these entities:
<!ENTITY start "<tr><td>" > <!ENTITY more "</td><td>" > <!ENTITY end "</td></tr>" >
And finally we have a USEMAP declaration:
<!USEMAP tblmap table >
to say that the short reference mapping named "tblmap" should become active during the parse when a "table" element is opened.
Of course, getting the parser to recognize the table element needed to get this ball rolling calls for more SGML magic, the OMITTAG feature, which also was left on the cutting room floor. The cryptic "o o" in these element declarations (which may be familiar to those who have seen the "official" DTDs for HTML):
<!ELEMENT table o o (%row)+ > <!ELEMENT (%row) o o (%cell)+ > <!ELEMENT (%cell) o o (#PCDATA) >
says that start-tags and end-tags can be omitted when they can be deduced from the context. In this case, at the beginning of the document, there will have to be a <table> start-tag, and lo, that's all we need, to make a conforming SGML instance out of a flat file with tilde-delimited fields.
But what about the fields of /etc/passwd being delimited by ':'? No problem. We can exploit yet another SGML rule, which says that only the first declaration of an entity counts. This is the reason why the DTD above has a parameter entity named "delim". Consider the following "wrapper" file:
<!DOCTYPE table SYSTEM "table.dtd" [ <!ENTITY % delim ':' > ]>
Assuming our DTD is in a file named "table.dtd", and the wrapper file is named "wrapper.drv" we can do this:
$ cat wrapper.drv /etc/passwd | /usr/bin/sgmlnorm
Try it yourself! :-)
(To be continued...)
No comments:
Post a Comment