Friday, August 24, 2012

More on SHORTREF

As I mentioned earlier, I had two "joke" DTDs to demonstrate the use of SHORTREF.  Without further ado, here is the second:

<!-- DTD for UNIX /etc/passwd file: see Notes at end -->

<!ENTITY % acct "user,pwd,uid,gid,gcos,home,shell" >

<!ELEMENT passwd   o o  (account)+ >
<!ELEMENT account  - o  (%acct)    >
<!ELEMENT (%acct)  - o  (#PCDATA)  >

<!-- map colons and line boundaries to appropriate tags  --
  -- cascade of maps to avoid problems with empty fields -->

<!ENTITY   start    "<account><user>"   >
<!SHORTREF passmap  "&#RS;"    start    >
<!USEMAP   passmap             passwd   >

<!ENTITY   s.pwd    STARTTAG   "pwd"    >
<!SHORTREF usermap  ":"        s.pwd    >
<!USEMAP   usermap             user     >

<!ENTITY   s.uid    STARTTAG   "uid"    >
<!SHORTREF pwdmap   ":"        s.uid    >
<!USEMAP   pwdmap              pwd      >

<!ENTITY   s.gid    STARTTAG   "gid"    >
<!SHORTREF uidmap   ":"        s.gid    >
<!USEMAP   uidmap              uid      >

<!ENTITY   s.gcos   STARTTAG   "gcos"   >
<!SHORTREF gidmap   ":"        s.gcos   >
<!USEMAP   gidmap              gid      >

<!ENTITY   s.home   STARTTAG   "home"   >
<!SHORTREF gcosmap  ":"        s.home   >
<!USEMAP   gcosmap             gcos     >

<!ENTITY   s.shell  STARTTAG   "shell"  >
<!SHORTREF homemap  ":"        s.shell  >
<!USEMAP   homemap             home     >

<!ENTITY   end      ENDTAG    "account" >
<!SHORTREF shellmap "&#RE;"   end       >
<!USEMAP   shellmap           shell     >

<!-- NOTES
  -- --
  My first attempt tried to get away with a generic mapping of ":"
  to minimized end tags like this:

        <!ENTITY   start    STARTTAG  "account" >
        <!SHORTREF passmap  "&#RS;"    start    >
        <!USEMAP   passmap             passwd   >

        <!ENTITY   end      ENDTAG    "account" >
        <!ENTITY   delim    ENDTAG    ""        >
        <!SHORTREF acctmap  "&#RE;"   end
                            ":"       delim     >
        <!USEMAP   acctmap            account   >

  This works only so long as account lines don't have empty fields.
  However, a number of system accounts (bin, sys, etc.) don't have
  SHELLs assigned. The problem here is Clause 7.3.1.1:

     The start-tag can be omitted if the element is a contextually
     required element and any other elements that could occur are
     contextually optional elements, except if
       a) the element has a required attribute or declared content;
       or
       b) the content of the instance of the element is empty.

  Hence the series of shortrefs mapping ":" to various start-tags.
  -- --
  Copyright 1994-8  Arjun Ray
-->

This was also an attempt to make an SGML instance out of /etc/passwd, using a sequence of shortref maps, each triggered by the current element context. The DTD also uses another feature of entity declarations, where the replacement text can be identified directly as markup so that the parser doesn't even have to go looking for tags.

But SHORTREF isn't only for jokes.  As shorthand for otherwise verbose markup, it helps authors, albeit only those who are familiar with the DTD to which the document they're composing will be expected to conform.  The approach taken in this article goes one step further and considers the possibility of short references constituting an alternate syntax, and demonstrates the viability of this through RNG validation.

Unfortunately, this will not go down well with tag-heads, those who can't sleep at night without their daily diet of pointy brackets.  Anything that threatens to take their beloved tags away would be anathema.

Monday, August 20, 2012

SHORTREF Redux

How I came to google the term "shortref" yesterday doesn't really matter at this point.  But it led to a pleasant surprise when I found this.  Someone is trying to resurrect SHORTREF for XML?  Wow!

SHORTREF was one of the many SGML features that didn't make the initial - and, as it seems to have panned out, the final - cut for XML, back in 1996.  The basic reasons were two: it was optional in SGML itself, and it would complicate parser implementations.

But, as this new article points out, it had and still has its attractions.  Given parser support, it was a cheap way to have alternate syntaxes (such as Wiki style markup today).  It was also a way to make conforming SGML instances out of documents that on the surface might not look like SGML at all.

The canonical example is a flat file with delimited fields, like /etc/passwd or a BCP dump of a relational table.   Years ago - yikes, many years ago - I cooked up two DTDs to demonstrate this.  Here is the first one in its entirety:

<!-- DTD for simplified tables from flat file: see Notes at end -->

<!-- what's a DTD without parameter entities? :-) -->

<!ENTITY   %        row        "tr"   >
<!ENTITY   %        cell       "td"   >

<!ENTITY   %        delim      "~"    >

<!-- bog-standard schema -->

<!ELEMENT  table    o o        (%row)+   >
<!ELEMENT  (%row)   o o        (%cell)+  >
<!ELEMENT  (%cell)  o o        (#PCDATA) >

<!ATTLIST  table
           border   (border)   #IMPLIED  >

<!-- map tildes and line boundaries to appropriate tags  -->

<!ENTITY   start    "<tr><td>"       >
<!ENTITY   more     "</td><td>"      >
<!ENTITY   end      "</td></tr>"     >
<!SHORTREF tblmap   "&#RS;"    start
                    "%delim;"  more
                    "&#RE;"    end   >
<!USEMAP   tblmap              table >

<!-- NOTES
  -- --
  This is basically a "joke" DTD to prove a concept about parsing
  delimited flat files with a SGML parser. The idea is to treat
  line boundaries and explicit delimiters as short references for
  the "required" tags.
  -- --
  This is not guaranteed to work, because enabling short references
  requires all markup to be recognized. Some fields could have the
  dreaded < or & characters. An actual data file should probably be
  preprocessed via sed or somesuch to replace '<' with '<<!>' and
  '&' with '&<!>'. This null declaration trick saves the need to
  declare entities such as '&lt;' and '&amp;'. One such filter is

     sed -e 's/\([&<]\)/\1<!>/g'
  -- --
  A problem occurs in tab delimited files. For reasons buried deep
  in ISO 8879, nsgmls will not recognize empty fields in this case,
  because two or more consecutive tabs also constitute a valid
  short reference! (Yes, this is a nasty gotcha.).

  A workaround is to change the tabs to a graphic like '~' or ':'
  if this is safe in relation to the data.

  Otherwise, it's best to redefine the shortref delimiter set in the
  SGML declaration to exclude the "contiguous whitespace" shortrefs.
  (Thanks to Joe English for this idea.)
  -- --
  Copyright 1994-8  Arjun Ray
-->


The basic idea of SHORTREF is that contexts are set up where ordinary data characters (usually punctuation) can be recognized as "shorthand" for markup. The markup is then substituted into the document in a process like that of entity reference expansion, and parsing continues as if the markup had been there to begin with. So, three things are needed for this to work:
  1. A mapping between data characters and names of entities.
  2. Definitions of these entities, providing the replacement markup.
  3. A context specification for when this mapping will be active.
Thus, in the DTD above we have a short reference map:

<!SHORTREF tblmap   "&#RS;"    start
                    "%delim;"  more
                    "&#RE;"    end   >

which associates "&#RS;" (or "Record start", SGMLese for "beginning of line") with an entity reference named "start", the character "~" with one named "more", and "&#RE;" ( or "Record end") with one named "end".

Next, we have definitions of these entities:

<!ENTITY   start    "<tr><td>"       >
<!ENTITY   more     "</td><td>"      >
<!ENTITY   end      "</td></tr>"     >

And finally we have a USEMAP declaration:

<!USEMAP   tblmap              table >

to say that the short reference mapping named "tblmap" should become active during the parse when a "table" element is opened.

Of course, getting the parser to recognize the table element needed to get this ball rolling calls for more SGML magic, the OMITTAG feature, which also was left on the cutting room floor.  The cryptic "o o" in these element declarations (which may be familiar to those who have seen the "official" DTDs for HTML):

<!ELEMENT  table    o o        (%row)+   >
<!ELEMENT  (%row)   o o        (%cell)+  >
<!ELEMENT  (%cell)  o o        (#PCDATA) >

 says that start-tags and end-tags can be omitted when they can be deduced from the context.  In this case, at the beginning of the document, there will have to be a <table> start-tag, and lo, that's all we need, to make a conforming SGML instance out of a flat file with tilde-delimited fields.

But what about the fields of /etc/passwd being delimited by ':'?  No problem.  We can exploit yet another SGML rule, which says that only the first declaration of an entity counts.  This is the reason why the DTD above has a parameter entity named "delim".  Consider the following "wrapper" file:

<!DOCTYPE  table SYSTEM "table.dtd" [
<!ENTITY % delim  ':' >
]>

Assuming our DTD is in a file named "table.dtd", and the wrapper file is named "wrapper.drv" we can do this:

$ cat wrapper.drv /etc/passwd | /usr/bin/sgmlnorm

Try it yourself!  :-)

(To be continued...)

Sunday, August 19, 2012

XML reconsidered: Part 0


Recently, my love-hate relationship with XML has once again intruded into my consciousness.  It was on hold for quite a while.  For some eight years I could happily dismiss XML as something I really didn't need to deal with just now, or even think about... maybe later.

My latest gig has changed all that.  I am now awash in Big Company XML.  The good news is that said Big Company has been sensible, on the whole, in their use of XML; they have not fallen prey to techno lock-in by plumping for the bells and whistles that have been available.  The bad news is that the use of XML in the enterprise is still headed for the rocks, as the "case" for using those bells and whistles, wisely eschewed so far, is getting stronger.

This is the Golden Hammer syndrome.  You may start out simple, using only the basic features of some tool, some enabling technology.  But, over time, your dependence on this particular tool grows, even if only by dint of familiarity, until you reach a stage where the tool, and how it "works", starts to determine how you even think of the data you're applying the tool to.  Your data - and their proper handling - are no longer your paramount consideration.  Instead, your acid reflux is over how to use your tool "more effectively", and meanwhile your data will simply have to conform to your tool's features - and its limitations.

So it is with XML.  Early on, there was a big push to sell XML as a data format.  While document-centric applications of XML - much truer to its origins in SGML - exist, it's safe to say that XML is being used, and rather overused, as a ridiculously verbose - and finicky - syntax for property sets,  whether these be configuration files for control or data-sets shunted through middle-ware. This devolution of a syntax, designed for deep hierarchical structures, into glorified name-value pairs is, in turn, the long-term consequence of early decisions on how to process XML.

And therein lies the rub.  We are, in fact, already locked into certain ways of processing XML, to the exclusion of other ways.  And this, in a feedback loop, affects the ways in which XML itself is used. The taggery written out is now mostly if not entirely determined by how the stuff is going to be read back in.  What the processing tools make convenient now dictates everything. 

In the blog posts to follow, I hope to explore my dissatisfaction with XML in this regard.  Stay tuned.

How many squares?

The old "how many coconuts?" puzzle had some mathematical sophistication.  Not so this recent puzzle that made the rounds on Facebook (the link may not work, in which case you'll have to go the home page of the original poster - a radio station, it seems - and look through their Wall photos album to find this):


The question, of course, was: how many squares?

Perhaps it's the apparent simplicity of the puzzle that prompted hordes of people - over 245,000 within a day or so, a veritable feeding frenzy - to comment with their answers.  Which, both predictably and sadly, were all over the place.  Even the OP had not much of a clue, claiming 25 as the "best answer" - as if this were indeed a matter of popular vote.  Shades of the Indiana legislature trying to mandate the value of pi!

Okay, lest anyone try their hand at the puzzle here, the answer is 40.  This is because the diagram is a composite of two small 2x2 grids, each of which has 5 squares, and one large 4x4 grid, which has 30 squares.

And how do we know this?  Because there is a general formula for an NxN grid: the number of squares is the sum of the squares of the numbers 1 to N.  Thus a 2x2 grid has 1 + 4 = 5 squares, and a 4x4 grid has 1  + 4 + 9 + 16 = 30 squares.

And how can we prove this formula?  By using a uniform procedure to count the squares of different sizes (1x1, 2x2, etc up to NxN).

 Imagine a unit 1x1 square superimposed on the square at the top left.  Shifting it one unit to the right, we get N positions for it.  Similarly, shifting it one unit at a time down the left side yields N possible starting points for the rightward shifts along a row.  So, there are NxN 1x1 squares in an NxN grid.   Next, start with a 2x2 square at the top left.  Shifting it one unit at a time to the right yields N-1 positions.  A similar analysis down the left side allows us to conclude that there are (N-1)x(N-1) 2x2 squares in an NxN grid.  We can repeat this analysis with progressively larger squares (3x3, up to NxN), and each time we will get a number of squares that is a perfect square, including 1 - which is the square of 1! - for an NxN square at the end.  The formula follows.

So, knowing the formula - or even working it out on the fly - would have found the answer without even counting.  Nevertheless, it's somewhat disquieting that only a very small percentage of approximately 250,000 people on Facebook - surely a representative sample of reasonably educated people -  got this puzzle right.