Arbortext Editor Parsers: XML vs. SGML

steve.thompson4 · ‎Mar 24, 2011

Good Day!
Hope someone here can shed some light on this situation. Thanks in advance for your attention to this somewhat long post.

We have a customer who wants their data in SGML format, rather than the XML of the majority. We do our input and publishing using XML, then change the DTD declaration (and remove , etc.) to point to an "identical" SGML DTD. Seemed to be working fine, until... Someone thought to click the completeness check in Editor this revision cycle with the SGML open. Context checking was still 'on', but the SGML was reporting missing content in a place the XML did not and does not.

Changing the PROCEDURE ELEMENT declaration in the SGML DTD to "...(topic | %text; |graphic)*))..." (use an '*' in place of the '+') eliminates the error message. My belief is that the XML parser sees the %text; entity being satisfied (%text;'s '*'s allow for zero content) as also satisfying the '+', but the SGML parser sees the '+' as overriding the entity's '*'s.

While it may seem the easy answer is, make both use '*' rather than '+', the question remains: why don't they both return the error? Not sure which one should be considered 'wrong', if either, regarding the intent behind the structure, but shouldn't both parsers return the same response with what are essentially identical inputs?

Error message with SGML:
"/procedure Unexpected end tag encountered. More content is required."

Here's a 'sanitized' example of the tagged data, and the relevant portions of the DTDs.

Data:
...<procedure...><title>Title Data</title><pfmatr><pretopic>...</pretopic><pretopic>...</pretopic></pfmatr>(cursor here as result of error)</procedure>...

XML DTD:

| table | revst | revend)*" >

mfmatr?, chapter*) | increv | tr+) >
(pfmatr?, (topic | %text; |graphic)+))
| %deleted;) >

SGML DTD:

| table)*" >

mfmatr?, chapter*) | increv | tr+)
+(revst|revend) >
(pfmatr?, (topic | %text; |graphic)+))
| %deleted;) >

Thanks for any insight,
Steve Thompson
TAD Technical
Boeing-IDS Technical Publications
+1(316)977-0515
MC K83-08
The truth is the truth even if nobody believes it, and a lie is a lie even if everyone believes it.

NOTICE: This communication may contain proprietary or other confidential information. If you are not the intended recipient, or believe that you have received this communication in error, please do not print, copy, retransmit, disseminate, or otherwise use the information. Also, please indicate to the sender that you have received this e-mail in error, and delete the copy you received. Any and all views expressed are the current understanding of the sender and should not be interpreted as an expression of official Boeing Company policy or position.

Les renseignements contenus dans ce message peuvent être confidentiels. Si vous n'êtes pas le destinataire visé ou une personne autorisée à lui remettre ce courriel, vous êtes par la présente avisé qu'il est strictement interdit d'utiliser, de copier ou de distribuer ce courriel, de dévoiler la teneur de ce message ou de prendre quelque mesure fondée sur l'information contenue. Vous êtes donc prié d'aviser immédiatement l'expéditeur de cette erreur et de détruire ce message sans garder de copie.

PaulGrosso · ‎Mar 24, 2011

ptc-953343 · ‎Mar 24, 2011

If I'm reading it right, I would say the SGML parser is correct. The (
,,,)+ says one or more of those things has to appear after the pfmatr. The
(...)* part of &text doesn't really mean anything. That says the content
inside is optional repeating. Not sure what the specs say, but I would
expect to see another element required. So your model sort of reads like:

(topic | null | graphic)+

null equalling no element used from the &text choices.

I don't know what parsers Arbortext uses these days. Have you tried
running your XML version through Saxon to see what is says? Could just be
a bug. Do you still have a copy of OmniMark around? I'd see what it said
for the SGML version.

..dan

> Good Day!
> Hope someone here can shed some light on this situation. Thanks in advance
> for your attention to this somewhat long post.
>
> We have a customer who wants their data in SGML format, rather than the
> XML of the majority. We do our input and publishing using XML, then change
> the DTD declaration (and remove ,
> etc.) to point to an "identical" SGML DTD. Seemed to be working fine,
> until... Someone thought to click the completeness check in Editor this
> revision cycle with the SGML open. Context checking was still 'on', but
> the SGML was reporting missing content in a place the XML did not and does
> not.
>
> Changing the PROCEDURE ELEMENT declaration in the SGML DTD to "...(topic |
> %text; |graphic)*))..." (use an '*' in place of the '+') eliminates the
> error message. My belief is that the XML parser sees the %text; entity
> being satisfied (%text;'s '*'s allow for zero content) as also satisfying
> the '+', but the SGML parser sees the '+' as overriding the entity's '*'s.
>
> While it may seem the easy answer is, make both use '*' rather than '+',
> the question remains: why don't they both return the error? Not sure which
> one should be considered 'wrong', if either, regarding the intent behind
> the structure, but shouldn't both parsers return the same response with
> what are essentially identical inputs?
>
> Error message with SGML:
> "/procedure Unexpected end tag encountered. More content is
> required."
>
> Here's a 'sanitized' example of the tagged data, and the relevant portions
> of the DTDs.
>
> Data:
> ...<procedure...><title>Title
> Data</title><pfmatr><pretopic>...</pretopic><pretopic>...</pretopic></pfmatr>(cursor
> here as result of error)</procedure>...
>
> XML DTD:
>
>
> > | table | revst | revend)*" >
>
> > mfmatr?, chapter*) | increv | tr+) >
> > (pfmatr?, (topic | %text; |graphic)+))
> | %deleted;) >
>
> SGML DTD:
>
>
> > | table)*" >
>
> > mfmatr?, chapter*) | increv | tr+)
> +(revst|revend) >
> > (pfmatr?, (topic | %text; |graphic)+))
> | %deleted;) >
>
> Thanks for any insight,
> Steve Thompson
> TAD Technical
> Boeing-IDS Technical Publications
> +1(316)977-0515
> MC K83-08
> The truth is the truth even if nobody believes it, and a lie is a lie even
> if everyone believes it.
>
> NOTICE: This communication may contain proprietary or other confidential
> information. If you are not the intended recipient, or believe that you
> have received this communication in error, please do not print, copy,
> retransmit, disseminate, or otherwise use the information. Also, please
> indicate to the sender that you have received this e-mail in error, and
> delete the copy you received. Any and all views expressed are the current
> understanding of the sender and should not be interpreted as an expression
> of official Boeing Company policy or position.
>
> Les renseignements contenus dans ce message peuvent être confidentiels. Si
> vous n'êtes pas le destinataire visé ou une personne autorisée à lui
> remettre ce courriel, vous êtes par la présente avisé qu'il est
> strictement interdit d'utiliser, de copier ou de distribuer ce courriel,
> de dévoiler la teneur de ce message ou de prendre quelque mesure fondée
> sur l'information contenue. Vous êtes donc prié d'aviser immédiatement
> l'expéditeur de cette erreur et de détruire ce message sans garder de
> copie.
>
>
>
>
>

steve.thompson4 · ‎Mar 24, 2011

I'm originally inclined to agree with you that the (...)+ says something has to be there. On the other hand, if you look at your (topic | null | graphic)+ in a slightly different way, it can be seen as saying the same thing the XML parser is doing.

If 'null' is one of the choices for 'something', then having nothing (null) there satisfies the '+'. That's the point I had reached when I decided to see whether someone from PTC would take an interest if I posted the question.

Bottom line for us is, it would be much better if 'identical' XML and SGML DTDs resulted in the same parsing of 'identical' XML and SGML instances. Right now, that doesn't happen. Having to maintain two different DTD structures, or only one but different than intended, is not the best answer.

Thanks for you input!
Steve Thompson
+1(316)977-0515

ptc-953343 · ‎Mar 24, 2011

Agreed should be the same which ever one is correct, can you try with
other parsers to see what sort of agreement there might be with other
tools? Saxon is good for XML and OmniMark was always a strong one for SGML
purposes.

> I'm originally inclined to agree with you that the (...)+ says something
> has to be there. On the other hand, if you look at your (topic | null |
> graphic)+ in a slightly different way, it can be seen as saying the same
> thing the XML parser is doing.
>
> If 'null' is one of the choices for 'something', then having nothing
> (null) there satisfies the '+'. That's the point I had reached when I
> decided to see whether someone from PTC would take an interest if I posted
> the question.
>
> Bottom line for us is, it would be much better if 'identical' XML and SGML
> DTDs resulted in the same parsing of 'identical' XML and SGML instances.
> Right now, that doesn't happen. Having to maintain two different DTD
> structures, or only one but different than intended, is not the best
> answer.
>
> Thanks for you input!
> Steve Thompson
> +1(316)977-0515
>

ebenton · ‎Mar 25, 2011

Another way to look at it might be to treat the "null" element the same as #PCDATA which can either be typed-in characters or nothing (null). In this case there are rules or recommendations about "mixed content" elements in a DTD. The #PCDATA should occur first in the "or-group". Maybe putting the "null" stuff first in the or-group would fix this. On the other hand, there's a possibility I don't know what I'm talking about...

From Arbortext 5.2 Help:
'Following is a list of constructs that Xerces will not consider valid that MarkIt may have allowed. Xerces treats many of these errors as fatal and will abort its parsing of the DTD on the first such error. The error message will be written to the Parser Error Messages window and the Invalid Schema/DTD file error dialog will be displayed.

More than one element being defined at once.

Tag minimization defined in element definition.

Comments embedded in markup.

"&" connector in content model.

Mixed content model with #PCDATA ending with "+" instead of "*".

#PCDATA not the first choice of a repeatable or-group at the top level of a content model.
...'

ptc-953343 · ‎Mar 25, 2011

Thsoe would be errors in the DTD not the source document. The DTD is being
parsed and Ok, it is then the interpretation that is saying another
element is required by the content model.

Good idea though.

..dan

> Another way to look at it might be to treat the "null" element the same as
> #PCDATA which can either be typed-in characters or nothing (null). In
> this case there are rules or recommendations about "mixed content"
> elements in a DTD. The #PCDATA should occur first in the "or-group".
> Maybe putting the "null" stuff first in the or-group would fix this. On
> the other hand, there's a possibility I don't know what I'm talking
> about...
>
> From Arbortext 5.2 Help:
> 'Following is a list of constructs that Xerces will not consider valid
> that MarkIt may have allowed. Xerces treats many of these errors as fatal
> and will abort its parsing of the DTD on the first such error. The error
> message will be written to the Parser Error Messages window and the
> Invalid Schema/DTD file error dialog will be displayed.
>
> More than one element being defined at once.
>
> Tag minimization defined in element definition.
>
> Comments embedded in markup.
>
> "&" connector in content model.
>
> Mixed content model with #PCDATA ending with "+" instead of "*".
>
> #PCDATA not the first choice of a repeatable or-group at the top level of
> a content model.
> ...'
>

ebenton · ‎Mar 25, 2011

The help says that some of the Xerces errors may be fatal and some not. It doesn't say which ones, and I don't know what messages you would get for the non-fatal ones.

The fact that changing the "+" to a "*" eliminates the error kind of points to a content model issue involving mixed content.

steve.thompson4 · ‎Mar 25, 2011

Sorry for the delay in getting back with you all. Really appreciate all the feedback. Working two priority one tasks at the same time (no one else does that, right?), and the other one took first place most of yesterday and early today.

The error we get is during the XML/SGML completeness check, not a DTD error during initial load. Both parsers accept the DTD as valid. They just interpret its proper application differently at this specific point.

But the point about mixed content is well taken, plus this problem has resulted in some basic discussion about the real requirements behind this doctype. We are going to disambiguate both DTDs, at least in this particular structure.

I'll post further details when the DTDs are updated and working, since I'll need to write the documenting change comments anyway.

Thanks again,
Steve Thompson
+1(316)977-0515