ACL gsub command

aleslie · ‎Jun 11, 2010

I am using the gsub ('<.*>', ' ', string) command to remove start tags within an ascii string.

However, if there are nested elements within the string, it will remove a whole section from the very first '<' to the very last '>' regardless of the tag.
Is there anyway of just getting it to remove individual start tags from '<' to the next occurance of '>'

I assumed this is what the gsub command should be doing anyway ?

ajordin · ‎Jun 11, 2010

Andy,

try the following regular expression in the gsub command, it seems to do what you require.

<[^<>]+>

Adrian Jordin
Senior Consultant

Mekon Ltd.
Mekon House, 31-35 St Nicholas Way, Sutton, Surrey SM1 1JN.

steve.thompson4 · ‎Jun 11, 2010

Your algorithm is wrong. What you have says "Remove any string starting with '<', ending with '>', and with any number of any character in between." What you want is "Remove any string starting with '<', ending with '>', and with any number of any character other than '>' or '/' in between."

The '>' exclusion causes the removed string to stop at the first '>' it finds after starting at a '<'. The '/' exclusion causes only the removal of start tags, since any end tag will contain a '/'.

Coding of the actual regex is left as an exercise for the student... 😄

No, seriously, I believe you need something like "<[^/>]*>".

Steve Thompson
+1(316)977-0515

bibach · ‎Jun 11, 2010

The problem you're running into is that the "*" operator is "greedy",
meaning that it will match as many instances of the previous item as
possible. Since the previous item can match any character, including
">", it will plow right through the end of the start tag.

To avoid going past the ending ">", you could try an expression like
"<[^>]+>", which would probably work 99.99% of the time. However, it
is actually legal for an ">" to appear in an attribute value, so you
might be just a little safer with something like
"<([^=>]|="[^"]*"|='[^']')+>", which separately matches an attribute
value wrapped with ", one wrapped with ', and any other character in
the tag (which will not be a = or >).

Mind you, this should work for XML. If you're dealing with SGML, the
rules would be a little different.

-Brandon 🙂

On Fri, Jun 11, 2010 at 9:05 AM, Andy Leslie
<info@structuredinformation.co.uk> wrote:
> I am using the gsub ('<.*>', ' ', string) command to remove start tags
> within an ascii string.
>
> However, if there are nested elements within the string, it will remove a
> whole section from the very first '<' to the very last '>' regardless of the
> tag.
> Is there anyway of just getting it to remove individual start tags from '<'
> to the next occurance of '>'
>
> I assumed this is what the gsub command should be doing anyway ?
>
>
>
> ----------

bibach · ‎Jun 11, 2010

Good point, Steve. Andy did actually say "start tags", though I
guessed that this was meant to include all tags, including end tags
and empty tags. If it should really only match start tags, then I
agree that adding "/" inside the first set of square brackets in
either of the expressions I suggested would accomplish this (the first
of which would then be almost identical to Steve's recommendation).

-Brandon 🙂

On Fri, Jun 11, 2010 at 9:33 AM, EXT-Thompson, Steve
<steve.thompson2@boeing.com> wrote:
> Your algorithm is wrong. What you have says "Remove any string starting with
> '<', ending with '>', and with any number of any character in between." What
> you want is "Remove any string starting with '<', ending with '>', and with
> any number of any character other than '>' or '/' in between."
>
> The '>' exclusion causes the removed string to stop at the first '>' it
> finds after starting at a '<'. The '/' exclusion causes only the removal of
> start tags, since any end tag will contain a '/'.
>
> Coding of the actual regex is left as an exercise for the student... 😄
>
> No, seriously, I believe you need something like "<[^/>]*>".
>
> Steve Thompson
> +1(316)977-0515

steve.thompson4 · ‎Jun 11, 2010

Great catch on the attribute value possibility, Brandon.

This is such a good example of not having the actual requirement defined in sufficient detail before trying to write the code. The algorithm just can't be right unless or until the details are all captured. Once they are, writing the algorithm (mentally or actually) makes the code almost write itself.

Regards,
Steve Thompson
+1(316)977-0515

steve.thompson4 · ‎Jun 11, 2010

This one is essentially the same as the first suggested by Brandon. The '<' in "[^<>]" is unneeded, since the expression cannot reach one of the '<'s without first encountering a '>'. It will therefore terminate the match before it reaches that '<'. I don't believe having that '<' in the exclusion would hurt anything, but makes it less readable, in my opinion.

The fewer characters to parse in trying to read a regex, the better the readability. Again, in my opinion. 🙂

My Two Cents (if that much),
Steve Thompson
+1(316)977-0515

steve.thompson4 · ‎Jun 11, 2010

BTW, here's a good online resource re' Regex:

ClayHelberg · ‎Jun 11, 2010

I also recommend the little O'Reilly book, Regular Expressions Pocket Reference (

GarethOakes · ‎Jun 11, 2010

Oooh I love regular expressions! Out of curiosity, does anyone know if
ACL implements any non-greedy versions of "+" or "*"?

-Gareth

ClayHelberg · ‎Jun 11, 2010

I don't think so. At least it doesn't seem to support the usual "?"
modifier.