Community Tip - You can Bookmark boards, posts or articles that you'd like to access again easily! X
Hello!
I use Java (the plugin is made) for modifying the content of the opened XML/SGML files (Arbortext 5.2).
I need to collect the text from text nodes. Additionally, I have to add "\r\n" after parsing some nodes.
"\r\n" are used as sentence terminator.
<heading>Some text</heading><content>Main article<content> will be transformed into "Some text\r\nMain article\r\n". So, we have 2 sentences though they don't have standard dots at the end.
So, in this example "heading", "content" are external tags, "b" is an internal tag.
Can I get the list of such standard internal/external tags? Or should I make a dialog with the settings which allows to add them to ini file? The division is required to control sentence boundaries.
Hello, Paul!
The output is a plain text, which has been gathered from text nodes. But I need to add additional separators to the text in order to calculate the correct number of sentences and define sentence boundaries in ther linguistic kernel dll. My kernel analyzer treats "\r\n" as sentence terminators.
<heading>This is a heading</heading><contents>This is an article</contents> => this xml contains 2 sentences, though it does not contain any dots.
So, is used to emphasize the text, it is an "internal" tag in my terminology, that's why we don't add special sentence terminators "\r\n". <heading> and <contents> tags are external, we have to add "\r\n" after their parsing if the string buffer does not have "\r\n" at the end already.
So, how should each name of the node be treated? Should I make the list of internal tags which can be customized?