Skip to main content
1-Visitor
June 11, 2010
Question

ACL gsub command

  • June 11, 2010
  • 10 replies
  • 1809 views

I am using the gsub ('<.*>', ' ', string) command to remove start tags within an ascii string.

However, if there are nested elements within the string, it will remove a whole section from the very first '<' to the very last '>' regardless of the tag.
Is there anyway of just getting it to remove individual start tags from '<' to the next occurance of '>'

I assumed this is what the gsub command should be doing anyway ?

    10 replies

    1-Visitor
    June 11, 2010
    Andy,

    try the following regular expression in the gsub command, it seems to do what you require.

    <[^<>]+>


    Adrian Jordin
    Senior Consultant

    Mekon Ltd.
    Mekon House, 31-35 St Nicholas Way, Sutton, Surrey SM1 1JN.
    1-Visitor
    June 11, 2010
    Your algorithm is wrong. What you have says "Remove any string starting with '<', ending with '>', and with any number of any character in between." What you want is "Remove any string starting with '<', ending with '>', and with any number of any character other than '>' or '/' in between."

    The '>' exclusion causes the removed string to stop at the first '>' it finds after starting at a '<'. The '/' exclusion causes only the removal of start tags, since any end tag will contain a '/'.

    Coding of the actual regex is left as an exercise for the student... 😄

    No, seriously, I believe you need something like "<[^/>]*>".

    Steve Thompson
    +1(316)977-0515
    1-Visitor
    June 11, 2010
    The problem you're running into is that the "*" operator is "greedy",
    meaning that it will match as many instances of the previous item as
    possible.  Since the previous item can match any character, including
    ">", it will plow right through the end of the start tag.

    To avoid going past the ending ">", you could try an expression like
    "<[^>]+>", which would probably work 99.99% of the time.  However, it
    is actually legal for an ">" to appear in an attribute value, so you
    might be just a little safer with something like
    "<([^=>]|="[^"]*"|='[^']')+>", which separately matches an attribute
    value wrapped with ", one wrapped with ', and any other character in
    the tag (which will not be a = or >).

    Mind you, this should work for XML.  If you're dealing with SGML, the
    rules would be a little different.

    -Brandon 🙂


    On Fri, Jun 11, 2010 at 9:05 AM, Andy Leslie
    <info@structuredinformation.co.uk> wrote:
    > I am using the gsub ('<.*>', ' ', string) command to remove start tags
    > within an ascii string.
    >
    > However, if there are nested elements within the string, it will remove a
    > whole section from the very first '<' to the very last '>' regardless of the
    > tag.
    > Is there anyway of just getting it to remove individual start tags from '<'
    > to the next occurance of '>'
    >
    > I assumed this is what the gsub command should be doing anyway ?
    >
    >
    >
    > ----------
    1-Visitor
    June 11, 2010
    Good point, Steve. Andy did actually say "start tags", though I
    guessed that this was meant to include all tags, including end tags
    and empty tags. If it should really only match start tags, then I
    agree that adding "/" inside the first set of square brackets in
    either of the expressions I suggested would accomplish this (the first
    of which would then be almost identical to Steve's recommendation).

    -Brandon 🙂


    On Fri, Jun 11, 2010 at 9:33 AM, EXT-Thompson, Steve
    <steve.thompson2@boeing.com> wrote:
    > Your algorithm is wrong. What you have says "Remove any string starting with
    > '<', ending with '>', and with any number of any character in between." What
    > you want is "Remove any string starting with '<', ending with '>', and with
    > any number of any character other than '>' or '/' in between."
    >
    > The '>' exclusion causes the removed string to stop at the first '>' it
    > finds after starting at a '<'. The '/' exclusion causes only the removal of
    > start tags, since any end tag will contain a '/'.
    >
    > Coding of the actual regex is left as an exercise for the student... 😄
    >
    > No, seriously, I believe you need something like "<[^/>]*>".
    >
    > Steve Thompson
    > +1(316)977-0515
    1-Visitor
    June 11, 2010
    Great catch on the attribute value possibility, Brandon.

    This is such a good example of not having the actual requirement defined in sufficient detail before trying to write the code. The algorithm just can't be right unless or until the details are all captured. Once they are, writing the algorithm (mentally or actually) makes the code almost write itself.

    Regards,
    Steve Thompson
    +1(316)977-0515
    1-Visitor
    June 11, 2010
    This one is essentially the same as the first suggested by Brandon. The '<' in "[^<>]" is unneeded, since the expression cannot reach one of the '<'s without first encountering a '>'. It will therefore terminate the match before it reaches that '<'. I don't believe having that '<' in the exclusion would hurt anything, but makes it less readable, in my opinion.

    The fewer characters to parse in trying to read a regex, the better the readability. Again, in my opinion. 🙂

    My Two Cents (if that much),
    Steve Thompson
    +1(316)977-0515
    1-Visitor
    June 11, 2010
    BTW, here's a good online resource re' Regex:

    18-Opal
    June 11, 2010
    I also recommend the little O'Reilly book, Regular Expressions Pocket Reference (
    16-Pearl
    June 11, 2010
    Oooh I love regular expressions! Out of curiosity, does anyone know if
    ACL implements any non-greedy versions of "+" or "*"?

    -Gareth

    18-Opal
    June 12, 2010
    I don't think so. At least it doesn't seem to support the usual "?"
    modifier.