Extracting Content

RobertWilliams · ‎Dec 19, 2011

Need to extract content in <title>, <para> and <trim.para> tags from a moderate sized (1100 pgs) xml document. I've been checking into regex (but repeatedly see folks say it's not the best tool for this), but I don't know anything about XML DOM parsers or stuff like that (just a simple arbortext user.) Is there any easy way to do this? Either within Arbortext or a simple tool I can find online to parse an xml file and dump content to text or whatever based on what tags I select? Request advise, over. (Bob, Bristol RI)

Thanks all for the quick replies.Suzanne's write-untagged command was perfect - quick and so simple, even a caveman could.... well, you know.

Happy holidays and thanks again. (Bob, Bristol RI)

LynnHales · ‎Dec 19, 2011

Bob,

There is some ACL coding that might work for you. I might have a script that I can post for you to get you started. Just not where I can check right now.

I know you can find and capture the content of various elements with oid_select() (see help 2157).

If you write a script that searches for an OID with oid_name() of <title>, <para>, or <trim.para> you can then capture and manipulate the content.

Lynn
---- Bob Williams <robert.williams@techresearchgroup.com> wrote:
> Need to extract content in <title>, <para> and <trim.para> tags from a moderate sized (1100 pgs) xml document. I've been checking into regex (but repeatedly see folks say it's not the best tool for this), but I don't know anything about XML DOM parsers or stuff like that (just a simple arbortext user.) Is there any easy way to do this? Either within Arbortext or a simple tool I can find online to parse an xml file and dump content to text or whatever based on what tags I select? Request advise, over. (Bob, Bristol RI)
>
>
> ----------

RobertWilliams · ‎Dec 19, 2011

Thx - but note - I'm trying to extract it - in order (all sequential instances of title, trim.para, and para in the order they appear in the book).

Pulling them out separately won't help, as I'm having to try to re-purpose a large portion of the Tech Manual content over to a training doc that isn't done in XML, it's in MS Word, hence the difficulty.

If I can get it out as text in some manner, I can simply copy and paste it in to the other DOC.

Thanks for reading and any replies welcome. Happy Holidays (Bob, Bristol RI)

steve.thompson4 · ‎Dec 19, 2011

Changing the 'stylesheet' reference at the top of the XML to point to a relatively simple XSLT would allow opening the XML in a web browser. Structuring the XSLT to only pass the content of those elements into a bare-bones HTML document would allow copying that content from the browser window to wherever.

Steve Thompson
+1(316)977-0515

AndyEsslinger · ‎Dec 19, 2011

Sounds like you might be able to use the "write command" or the "save_as command" to spit out an XML doc with everything in order:

See help 9061 for write and help 9132 for save_as
write [-ct [original | changesapplied | changeshighlighted]] [-sgml | -untagged | -xml] [-encoding string] [-paste | -buffer buffername | -all] [-ok] [-header | -noheader] [-pi | -nopi] [-eoc | -noeoc] [-nonasciichar [entref | numref | char]] [-public pubident] [-sysidsysident] [-nobreakattag] [-flatten [file | text | both]] filename
-Andy

\ / Andy Esslinger LM Aero - Tech Order Data
_____-/\-_____ (817) 279-0442 1 Lockheed Blvd, MZ 4285
\_\/_/ (817) 777 3047 Fort Worth, TX 76108-3916

ptc-1908075 · ‎Dec 19, 2011

Bob,

Check out the write -untagged command (help 9061). Note there are limitations:

Writes a text-only version of the document which contains no tags. No entity expansion or substitution is attempted.write -untaggeddoes not write equations. Tables will be written ifset tabletagsis set toon.

Just enter write -untagged filename at the command line of Arbortext Editor.

HTH.

Suzanne Napoleon
www.FOSIexpert.com
"WYSIWYG is last-century technology!"

JasonBuss · ‎Dec 19, 2011

I think using a combination of oid_find_children() and oid_content() will probably get you what you need in the order you want. If you make a txt file with the open(fid,path) func, walk the array from oid_find_children() and do a put() on the return from each oid_content on the array objects, this shouldn't be too bad. Setting the flag on oid_find_children to make the expression for the tagname a regex, you should be able to OR your tagnames to pull them into the array sequentially.

Of course this is all off the top of my head, sounds like it would work 🙂

ptc-953343 · ‎Dec 19, 2011

I would use XSLT to do this. It would be a pretty easy stylesheet and then
use kernow to just provide an easy front end. Many tutorials on the web,
or use Michael Kays XPATH and XSLT book as one of the better tutorial and
reference manuals.

Within Arbortext you could aslo use ACL and traverse the tree and pull the
content, XSLT would be easier to get into just because there is more
information available. XSLT is also more portable if you need to share the
tools.

..dan

> Need to extract content in <title>, <para> and <trim.para> tags from a
> moderate sized (1100 pgs) xml document. I've been checking into regex
> (but repeatedly see folks say it's not the best tool for this), but I
> don't know anything about XML DOM parsers or stuff like that (just a
> simple arbortext user.) Is there any easy way to do this? Either within
> Arbortext or a simple tool I can find online to parse an xml file and dump
> content to text or whatever based on what tags I select? Request advise,
> over. (Bob, Bristol RI)
>
>

ClayHelberg · ‎Dec 19, 2011

Hi Bob--

As others have pointed out, there are a number of ways to skin this cat. For an Arbortext-specific solution, ACL gives a fairly direct method. You could use code that looks something like this:

# function to dump specific text to a message window for subsequent copying/pasting into
# another app (e.g. Word)

function dumptext() {

# define an array indicating what elements we want to dump text from
local elems[];
elems["title"]=1;
elems["para"]=1;
elems["trim.para"]=1;

# walk through document, check each element, and dump content
# if it's one of the flagged entries
goto_oid(oid_root(),0);
oid = oid_root();
while (oid_valid(oid)) {
oid = oid_forward(oid);
if (elems[oid_name(oid)]) {
eval oid_content(oid) output=>*;
}
}
}

Just load the document you want to dump, call the dumptext() function (using the Command Window), and it should pop up a new window containing all the text you want to dump.

You could also do this easily with XSLT, using a very simple stylesheet something like this:

<xsl:stylesheet version="2.0" xmlns:xsl="<a" style="COLOR:" blue;=" text-decoration:=" underline"=" target="_BLANK" href="http://www.w3.org/1999/XSL/Transform">">http://www.w3.org/1999/XSL/Transform">

<xsl:output method="text" version="1.0" encoding="UTF-8"/">

<xsl:template match="title" |=" para=" |=" trim.para"=">
<xsl:value-of select="."/">
<xsl:text>
</xsl:text>
</xsl:template>

<xsl:template match="text()"/">

</xsl:stylesheet>

To apply this, you could load the document in Arbortext, select "Publish->Using XSL...", and select that stylesheet. This will produce a text file containing the content you want to dump.

These are only two approaches, you could also use just about any pure programming environment (Python, Perl, Ruby, Java, C++, etc. etc.) to do this. I wouldn't recommend trying to use regular expressions, however, because regex has known problems handling recursive structures such as the angle bracket element tags in XML.

--Clay

Clay Helberg
Senior Consultant
TerraXML

Extracting Content - SOLVED

Extracting Content - SOLVED