Gathering text from xml

IlyaYeroshenko · ‎Jul 09, 2012

Hello!

I use Java (the plugin is made) for modifying the content of the opened XML/SGML files (Arbortext 5.2).

I need to collect the text from text nodes. Additionally, I have to add "\r\n" after parsing some nodes.

"\r\n" are used as sentence terminator.

<heading>Some text</heading><content>Main article<content> will be transformed into "Some text\r\nMain article\r\n". So, we have 2 sentences though they don't have standard dots at the end.

So, in this example "heading", "content" are external tags, "b" is an internal tag.

Can I get the list of such standard internal/external tags? Or should I make a dialog with the settings which allows to add them to ini file? The division is required to control sentence boundaries.

naglists · ‎Jul 09, 2012

Hi again, Ilya!
I do not fully understand this requirement either. Some additional
information would be helpful. What is your output? Are you displaying
something on screen? Or are you producing an HTML or XML file? Or are you
creating a PDF? Knowing the answer to that question will help suggest the
best strategies along the path you are on OR (perhaps) suggest an alternate
path.

For example, if you are outputting HTML, then you should probably be using
some flavor of XSL rather than using Java to directly manage node
traversal, interpretation, and expression.

With respect to getting a list of elements, I'm will answer as though you
want a list of the elements within the document you are working with and
make some assumptions about the structure of that document. I will assume
your document looks like this:

<document>
<heading> Some text</heading>
<content>Main article</content>
</document>

To get a list of the elements within <document> use the following command:
oid_find_children(oid_root(), root_children_arr,"document")

This will populate the array root_children_arr[] with the OIDs of each
child. In this case, heading and content, which in my interpretation of
your question are the "external" tags. You can then traverse each of those
elements using the same ACL function (but new arrays) to find their
children or the "internal" tags.

But I can easily imagine other intended meanings for internal and external,
so ... again, hopefully this will give you some ideas of where to start or
help you clarify your question for us.

IlyaYeroshenko · ‎Jul 11, 2012

Hello, Paul!

The output is a plain text, which has been gathered from text nodes. But I need to add additional separators to the text in order to calculate the correct number of sentences and define sentence boundaries in ther linguistic kernel dll. My kernel analyzer treats "\r\n" as sentence terminators.

<heading>This is a heading</heading><contents>This is an article</contents> => this xml contains 2 sentences, though it does not contain any dots.

So, is used to emphasize the text, it is an "internal" tag in my terminology, that's why we don't add special sentence terminators "\r\n". <heading> and <contents> tags are external, we have to add "\r\n" after their parsing if the string buffer does not have "\r\n" at the end already.

So, how should each name of the node be treated? Should I make the list of internal tags which can be customized?

naglists · ‎Jul 11, 2012

From what I can understand, then, yes, I think you will have to maintain a
list of tags that do not require the terminators. (Or, if it's a shorter
list, the ones that do.)

I have to say, though, that I can't help thinking that XSL might be a nice
technology to use. Although XSL doesn't output plain text, it might
simplify some of your Java requirements if you can pre-process your XML
with it. There's the fact that some of your content appears to be SGML,
too. I don't know what you'd have to do to convert your SGML to XML to
process with XSL.

GarethOakes · ‎Jul 12, 2012

Sorry Paul, I can't help but point out that XSLT can in fact output plain text. Using <xsl:output method="text"/"> will do the job.

Something else that may be of interest: certain XSLT processors will also accept HTML input and allow it to be processed as if it is an XML instance. I've found that feature handy in the past.

-G

naglists · ‎Jul 12, 2012

Ooops. Thanks for reminding me.

I always get focused on some of the issues I've had trying to handle
character entities through XSL and the problems (from my perspective, not
from Michael Kay's 😉 expressing them the way I want.

However, Ilya, this is even more support for looking into whether you can
bake an XSL step into your process.