Skip to main content
1-Visitor
July 9, 2012
Question

Gathering text from xml

  • July 9, 2012
  • 5 replies
  • 1260 views

Hello!


I use Java (the plugin is made) for modifying the content of the opened XML/SGML files (Arbortext 5.2).


I need to collect the text from text nodes. Additionally, I have to add "\r\n" after parsing some nodes.


"\r\n" are used as sentence terminator.


<heading>Some text</heading><content>Main article<content> will be transformed into "Some text\r\nMain article\r\n". So, we have 2 sentences though they don't have standard dots at the end.


So, in this example "heading", "content" are external tags, "b" is an internal tag.


Can I get the list of such standard internal/external tags? Or should I make a dialog with the settings which allows to add them to ini file? The division is required to control sentence boundaries.


    5 replies

    1-Visitor
    July 9, 2012
    Hi again, Ilya!
    I do not fully understand this requirement either. Some additional
    information would be helpful. What is your output? Are you displaying
    something on screen? Or are you producing an HTML or XML file? Or are you
    creating a PDF? Knowing the answer to that question will help suggest the
    best strategies along the path you are on OR (perhaps) suggest an alternate
    path.

    For example, if you are outputting HTML, then you should probably be using
    some flavor of XSL rather than using Java to directly manage node
    traversal, interpretation, and expression.

    With respect to getting a list of elements, I'm will answer as though you
    want a list of the elements within the document you are working with and
    make some assumptions about the structure of that document. I will assume
    your document looks like this:

    <document>
    <heading> Some text</heading>
    <content>Main article</content>
    </document>

    To get a list of the elements within <document> use the following command:
    oid_find_children(oid_root(), root_children_arr,"document")

    This will populate the array root_children_arr[] with the OIDs of each
    child. In this case, heading and content, which in my interpretation of
    your question are the "external" tags. You can then traverse each of those
    elements using the same ACL function (but new arrays) to find their
    children or the "internal" tags.

    But I can easily imagine other intended meanings for internal and external,
    so ... again, hopefully this will give you some ideas of where to start or
    help you clarify your question for us.


    1-Visitor
    July 11, 2012

    Hello, Paul!


    The output is a plain text, which has been gathered from text nodes. But I need to add additional separators to the text in order to calculate the correct number of sentences and define sentence boundaries in ther linguistic kernel dll. My kernel analyzer treats "\r\n" as sentence terminators.


    <heading>This is a heading</heading><contents>This is an article</contents> => this xml contains 2 sentences, though it does not contain any dots.


    So, is used to emphasize the text, it is an "internal" tag in my terminology, that's why we don't add special sentence terminators "\r\n". <heading> and <contents> tags are external, we have to add "\r\n" after their parsing if the string buffer does not have "\r\n" at the end already.


    So, how should each name of the node be treated? Should I make the list of internal tags which can be customized?

    1-Visitor
    July 11, 2012
    From what I can understand, then, yes, I think you will have to maintain a
    list of tags that do not require the terminators. (Or, if it's a shorter
    list, the ones that do.)

    I have to say, though, that I can't help thinking that XSL might be a nice
    technology to use. Although XSL doesn't output plain text, it might
    simplify some of your Java requirements if you can pre-process your XML
    with it. There's the fact that some of your content appears to be SGML,
    too. I don't know what you'd have to do to convert your SGML to XML to
    process with XSL.

    16-Pearl
    July 12, 2012
    Sorry Paul, I can't help but point out that XSLT can in fact output plain text. Using <xsl:output method="text"/"> will do the job.

    Something else that may be of interest: certain XSLT processors will also accept HTML input and allow it to be processed as if it is an XML instance. I've found that feature handy in the past.

    -G
    1-Visitor
    July 12, 2012
    Ooops. Thanks for reminding me.

    I always get focused on some of the issues I've had trying to handle
    character entities through XSL and the problems (from my perspective, not
    from Michael Kay's 😉 expressing them the way I want.

    However, Ilya, this is even more support for looking into whether you can
    bake an XSL step into your process.