Monday, August 20, 2012

SHORTREF Redux

How I came to google the term "shortref" yesterday doesn't really matter at this point.  But it led to a pleasant surprise when I found this.  Someone is trying to resurrect SHORTREF for XML?  Wow!

SHORTREF was one of the many SGML features that didn't make the initial - and, as it seems to have panned out, the final - cut for XML, back in 1996.  The basic reasons were two: it was optional in SGML itself, and it would complicate parser implementations.

But, as this new article points out, it had and still has its attractions.  Given parser support, it was a cheap way to have alternate syntaxes (such as Wiki style markup today).  It was also a way to make conforming SGML instances out of documents that on the surface might not look like SGML at all.

The canonical example is a flat file with delimited fields, like /etc/passwd or a BCP dump of a relational table.   Years ago - yikes, many years ago - I cooked up two DTDs to demonstrate this.  Here is the first one in its entirety:

<!-- DTD for simplified tables from flat file: see Notes at end -->

<!-- what's a DTD without parameter entities? :-) -->

<!ENTITY   %        row        "tr"   >
<!ENTITY   %        cell       "td"   >

<!ENTITY   %        delim      "~"    >

<!-- bog-standard schema -->

<!ELEMENT  table    o o        (%row)+   >
<!ELEMENT  (%row)   o o        (%cell)+  >
<!ELEMENT  (%cell)  o o        (#PCDATA) >

<!ATTLIST  table
           border   (border)   #IMPLIED  >

<!-- map tildes and line boundaries to appropriate tags  -->

<!ENTITY   start    "<tr><td>"       >
<!ENTITY   more     "</td><td>"      >
<!ENTITY   end      "</td></tr>"     >
<!SHORTREF tblmap   "&#RS;"    start
                    "%delim;"  more
                    "&#RE;"    end   >
<!USEMAP   tblmap              table >

<!-- NOTES
  -- --
  This is basically a "joke" DTD to prove a concept about parsing
  delimited flat files with a SGML parser. The idea is to treat
  line boundaries and explicit delimiters as short references for
  the "required" tags.
  -- --
  This is not guaranteed to work, because enabling short references
  requires all markup to be recognized. Some fields could have the
  dreaded < or & characters. An actual data file should probably be
  preprocessed via sed or somesuch to replace '<' with '<<!>' and
  '&' with '&<!>'. This null declaration trick saves the need to
  declare entities such as '&lt;' and '&amp;'. One such filter is

     sed -e 's/\([&<]\)/\1<!>/g'
  -- --
  A problem occurs in tab delimited files. For reasons buried deep
  in ISO 8879, nsgmls will not recognize empty fields in this case,
  because two or more consecutive tabs also constitute a valid
  short reference! (Yes, this is a nasty gotcha.).

  A workaround is to change the tabs to a graphic like '~' or ':'
  if this is safe in relation to the data.

  Otherwise, it's best to redefine the shortref delimiter set in the
  SGML declaration to exclude the "contiguous whitespace" shortrefs.
  (Thanks to Joe English for this idea.)
  -- --
  Copyright 1994-8  Arjun Ray
-->


The basic idea of SHORTREF is that contexts are set up where ordinary data characters (usually punctuation) can be recognized as "shorthand" for markup. The markup is then substituted into the document in a process like that of entity reference expansion, and parsing continues as if the markup had been there to begin with. So, three things are needed for this to work:
  1. A mapping between data characters and names of entities.
  2. Definitions of these entities, providing the replacement markup.
  3. A context specification for when this mapping will be active.
Thus, in the DTD above we have a short reference map:

<!SHORTREF tblmap   "&#RS;"    start
                    "%delim;"  more
                    "&#RE;"    end   >

which associates "&#RS;" (or "Record start", SGMLese for "beginning of line") with an entity reference named "start", the character "~" with one named "more", and "&#RE;" ( or "Record end") with one named "end".

Next, we have definitions of these entities:

<!ENTITY   start    "<tr><td>"       >
<!ENTITY   more     "</td><td>"      >
<!ENTITY   end      "</td></tr>"     >

And finally we have a USEMAP declaration:

<!USEMAP   tblmap              table >

to say that the short reference mapping named "tblmap" should become active during the parse when a "table" element is opened.

Of course, getting the parser to recognize the table element needed to get this ball rolling calls for more SGML magic, the OMITTAG feature, which also was left on the cutting room floor.  The cryptic "o o" in these element declarations (which may be familiar to those who have seen the "official" DTDs for HTML):

<!ELEMENT  table    o o        (%row)+   >
<!ELEMENT  (%row)   o o        (%cell)+  >
<!ELEMENT  (%cell)  o o        (#PCDATA) >

 says that start-tags and end-tags can be omitted when they can be deduced from the context.  In this case, at the beginning of the document, there will have to be a <table> start-tag, and lo, that's all we need, to make a conforming SGML instance out of a flat file with tilde-delimited fields.

But what about the fields of /etc/passwd being delimited by ':'?  No problem.  We can exploit yet another SGML rule, which says that only the first declaration of an entity counts.  This is the reason why the DTD above has a parameter entity named "delim".  Consider the following "wrapper" file:

<!DOCTYPE  table SYSTEM "table.dtd" [
<!ENTITY % delim  ':' >
]>

Assuming our DTD is in a file named "table.dtd", and the wrapper file is named "wrapper.drv" we can do this:

$ cat wrapper.drv /etc/passwd | /usr/bin/sgmlnorm

Try it yourself!  :-)

(To be continued...)

No comments: