Skip to main content
1-Visitor
March 11, 2015
Question

anyone validating web links from Arbortext? Or "helping" somehow with Arbortext?

  • March 11, 2015
  • 7 replies
  • 2748 views
Just curious. Authors are asking if there is anything Arbortext can do to
help. Before I start thinking too hard, I thought I'd check with you all.

--
Paul Nagai

    7 replies

    16-Pearl
    March 11, 2015
    Hi Paul,

    There is nothing built-in for this, but there are a few approaches I've seen people take:
    1. Validate links at publish.
    2. Validate links at review.
    3. Validate links via redirector.

    #1 basically means PE would check the links as it publishes. This can slow the publish process down, unless you have a validation cache to check against.

    #2 means add a function that can be manually triggered by reviewers (editors?) to check all links in the current document while open in an Editor session.

    #3 means all links that you publish are actually links to your own link redirector service (think tinyurl.com). Your link redirector service is now a single point of failure for your users, however it does give you control over where their links end up. If a broken link is found, you change it on your redirector service and all users will automatically forward to the new location without having to re-publish to them. Also you can do sneaky things like check what users are clicking to what links etc.

    There might be other ways to crack this nut, but those are three common approaches I've seen.

    // Gareth Oakes
    // Chief Architect, GPSL
    // www.gpsl.co
    18-Opal
    March 12, 2015

    Not sure this will help, but we create a rds of the document, then run a completeness check on it to verify that the links function in the document.

    naglists1-VisitorAuthor
    1-Visitor
    March 14, 2015
    Thanks to all who responded. I've got some good ideas to work with here.
    Some of them I see/understand all of the requirements from start to finish.
    I'll dive in on those.

    One idea I have ... not sure how hard it would be yet to do ... deals with
    web links that validate via doc_open(). Authors want to judge whether it
    opens successfully to the right content, not just that it opens from an
    html perspective. I'd love to present a two column report for web links
    that are successful from the doc_open() perspective. First column is a
    clickable link back to the link element within the source document while
    the second column is a thumbnail of the web page itself giving the author
    some visual feedback without having to open every link to see what "good"
    result is being returned.

    By the way, I am definitely focused on #2, validating at author review
    although I have contemplated looking at PDF link validators, but then the
    report/errors are so far from authoring it seems of limited value except
    maybe as a last-chance production check sort of validation.

    And ... a redirector! Wow. Hadn't thought of anything along those lines. I
    don't think we have enough web links to make that worth the effort. Am I
    overestimating the overhead of constructing and maintaining that or do you
    agree that there is some largish number of web links where that becomes
    more efficient?

    Hey, if you haven't started already, have a good weekend everyone!

    Thanks again.

    On Wed, Mar 11, 2015 at 4:07 PM, Gareth Oakes <->
    wrote:

    > Hi Paul,
    >
    > There is nothing built-in for this, but there are a few approaches I've
    > seen people take:
    > 1. Validate links at publish.
    > 2. Validate links at review.
    > 3. Validate links via redirector.
    >
    > #1 basically means PE would check the links as it publishes. This can
    > slow the publish process down, unless you have a validation cache to check
    > against.
    >
    > #2 means add a function that can be manually triggered by reviewers
    > (editors?) to check all links in the current document while open in an
    > Editor session.
    >
    > #3 means all links that you publish are actually links to your own link
    > redirector service (think tinyurl.com). Your link redirector service is
    > now a single point of failure for your users, however it does give you
    > control over where their links end up. If a broken link is found, you
    > change it on your redirector service and all users will automatically
    > forward to the new location without having to re-publish to them. Also you
    > can do sneaky things like check what users are clicking to what links etc.
    >
    > There might be other ways to crack this nut, but those are three common
    > approaches I've seen.
    >
    > // Gareth Oakes
    > // Chief Architect, GPSL
    > // www.gpsl.co
    >
    16-Pearl
    March 15, 2015
    Hi Paul,

    Firstly, yes I've seen workflows where you have a link validation step which has "screenshots" of the link targets for rapid checking. I think this works pretty well and is nice and easy for the reviewers. Something like PhantomJS might be the easiest way to implement this (
    naglists1-VisitorAuthor
    1-Visitor
    March 25, 2015
    So I now have two web link validation strategies available for our authors.
    Well, it's in UAT, really, and one complication will require some deeper
    analysis of the validations' performance against real document links.

    Phase 1: Author runs an ACL driven validation against web links in an
    Editor session. It is currently separate from a similar validation we were
    already doing for our intra-document links. But that may change in the
    future when we understand the web link testing better. The validation
    examines each web link and attempts a doc_open(). Anything that opens is
    set valid="yes". Anything that fails to open is set valid="no". Mailto:
    links are not tested (doc_open() against a mailto link opens an email in,
    for me, Outlook) and are set valid="manual". Any link set by an author (or
    a previous validation) as "manual" is ignored. I store the validation date
    on each link. (Those all refer to "setting" attributes on the link element,
    if that wasn't clear.) Oh yes, I store each URL in an array while I'm
    processing links and check to see if I've validated a URL already. If I
    have, I skip re-testing the subsequent occurrences of a URL (which are not
    set to any valid= state nor are they dated ... still thinking about this
    one). Finally, I open a window with the URLs that don't validate to allow
    authors double-click navigation to "bad" links in the Edit window.

    Things I have noted:
    1) The first validation in an Editor session takes a while as one would
    expect. Subsequent validations are very quick. Something is being omitted
    or cached or ... something. I haven't thought about this too deeply yet.
    2) Some parser errors are sometimes thrown. Some(times) I can suppress
    (some).
    3) Some (well, one, at least) sites that are up do not validate. Amazon,
    specifically. Not sure what they're doing that fails the doc_open().
    Hopefully further testing of real links shows this to be rare and we can
    largely discount false negatives.
    4) Authors need the ability to set some links valid="manual" so I can NOT
    attempt to validate them. I found one link that causes a Segment 11 Editor
    exit. Kerboom! Because of this, I force a save at the beginning of the
    validation AND I write each URL to a log before I test it. That way, the
    author will know exactly which link killed Editor, should they run into
    additional such links. (I'd like to submit this to PTC but since it is an
    internal URL and I'm not sure what it is about that page/site that's
    causing the problem, I'm not sure I will be able to.)

    Phase 2: Author requests an HTML document containing all of the links for a
    click-test. This allows authors to periodically validate that a URL is
    still contextually valid (not just that a host, page, or file is still
    there which is the most the previous validation can do). I end up having to
    chain XSL transformations to support sorting the URLs and removing the
    duplicates. The first puts all of the link tags into a single XML "topic."
    The second sorts by URL and removes duplicates. I couldn't (figure out how
    to) do the sort/duplicate removal in the first stylesheet since xsl:key (or
    my implementation of it) assumed the link tags were all siblings and the
    result was mostly document order, not alpha by URL. I'm still trying to see
    how much useful information I can pass all the way through that process.
    Contemplating some context (the para owning the link? not sure I can
    actually do this one with my current process). Contemplating an occurrence
    count (so authors have an idea of how many they'll have to find/replace if
    an update is needed).

    Once we get a better understanding of how likely Amazon-like false
    negatives are, what we find in our click-tests, etc., it's possible I will
    use the valid= state in some fashion to reduce the number of links output
    for a click-test.

    Anyhow ... thanks for all your thoughts. The feedback I'm getting from the
    authors is really positive. They're looking forward to using this in the
    wild.


    On Sun, Mar 15, 2015 at 4:29 PM, Gareth Oakes <->
    wrote:

    > Hi Paul,
    >
    > Firstly, yes I've seen workflows where you have a link validation step
    > which has "screenshots" of the link targets for rapid checking. I think
    > this works pretty well and is nice and easy for the reviewers. Something
    > like PhantomJS might be the easiest way to implement this (
    >
    16-Pearl
    March 26, 2015
    Hi Paul,

    Thanks for sharing. It sounds like you are making some good progress.

    The one thing that stuck out to me was regards your Amazon problems. It occurs to me that doc_open() will be trying to download a file over HTTP. The HTTP protocol also supports things like 301 Moved Permanently which are cases where a web link needs the browser (or in this case Arbortext Editor) to do something further in order to find the requested resource (see here:
    naglists1-VisitorAuthor
    1-Visitor
    March 26, 2015
    Good ideas, Gareth. Thanks. (Also, I really liked the web page image
    generator you shared in an earlier post. For reasons of National Security
    [joking, mostly], getting that approved for in-house use wouldn't probably
    be worth the effort.) I'll keep them in the hopper pending how happy (and
    for how long happy lasts) with the zero support to some support moves we're
    making right now.

    On Wed, Mar 25, 2015 at 5:21 PM, Gareth Oakes <->