RE: anyone validating web links from Arbortext? Or...

naglists · ‎Mar 11, 2015

Just curious. Authors are asking if there is anything Arbortext can do to
help. Before I start thinking too hard, I thought I'd check with you all.

--
Paul Nagai

GarethOakes · ‎Mar 11, 2015

Hi Paul,

There is nothing built-in for this, but there are a few approaches I've seen people take:
1. Validate links at publish.
2. Validate links at review.
3. Validate links via redirector.

#1 basically means PE would check the links as it publishes. This can slow the publish process down, unless you have a validation cache to check against.

#2 means add a function that can be manually triggered by reviewers (editors?) to check all links in the current document while open in an Editor session.

#3 means all links that you publish are actually links to your own link redirector service (think tinyurl.com). Your link redirector service is now a single point of failure for your users, however it does give you control over where their links end up. If a broken link is found, you change it on your redirector service and all users will automatically forward to the new location without having to re-publish to them. Also you can do sneaky things like check what users are clicking to what links etc.

There might be other ways to crack this nut, but those are three common approaches I've seen.

// Gareth Oakes
// Chief Architect, GPSL
// www.gpsl.co

bfriesen · ‎Mar 12, 2015

Not sure this will help, but we create a rds of the document, then run a completeness check on it to verify that the links function in the document.

naglists · ‎Mar 13, 2015

Thanks to all who responded. I've got some good ideas to work with here.
Some of them I see/understand all of the requirements from start to finish.
I'll dive in on those.

One idea I have ... not sure how hard it would be yet to do ... deals with
web links that validate via doc_open(). Authors want to judge whether it
opens successfully to the right content, not just that it opens from an
html perspective. I'd love to present a two column report for web links
that are successful from the doc_open() perspective. First column is a
clickable link back to the link element within the source document while
the second column is a thumbnail of the web page itself giving the author
some visual feedback without having to open every link to see what "good"
result is being returned.

By the way, I am definitely focused on #2, validating at author review
although I have contemplated looking at PDF link validators, but then the
report/errors are so far from authoring it seems of limited value except
maybe as a last-chance production check sort of validation.

And ... a redirector! Wow. Hadn't thought of anything along those lines. I
don't think we have enough web links to make that worth the effort. Am I
overestimating the overhead of constructing and maintaining that or do you
agree that there is some largish number of web links where that becomes
more efficient?

Hey, if you haven't started already, have a good weekend everyone!

Thanks again.

On Wed, Mar 11, 2015 at 4:07 PM, Gareth Oakes <->
wrote:

> Hi Paul,
>
> There is nothing built-in for this, but there are a few approaches I've
> seen people take:
> 1. Validate links at publish.
> 2. Validate links at review.
> 3. Validate links via redirector.
>
> #1 basically means PE would check the links as it publishes. This can
> slow the publish process down, unless you have a validation cache to check
> against.
>
> #2 means add a function that can be manually triggered by reviewers
> (editors?) to check all links in the current document while open in an
> Editor session.
>
> #3 means all links that you publish are actually links to your own link
> redirector service (think tinyurl.com). Your link redirector service is
> now a single point of failure for your users, however it does give you
> control over where their links end up. If a broken link is found, you
> change it on your redirector service and all users will automatically
> forward to the new location without having to re-publish to them. Also you
> can do sneaky things like check what users are clicking to what links etc.
>
> There might be other ways to crack this nut, but those are three common
> approaches I've seen.
>
> // Gareth Oakes
> // Chief Architect, GPSL
> // www.gpsl.co
>

GarethOakes · ‎Mar 15, 2015

Hi Paul,

Firstly, yes I've seen workflows where you have a link validation step which has "screenshots" of the link targets for rapid checking. I think this works pretty well and is nice and easy for the reviewers. Something like PhantomJS might be the easiest way to implement this (

naglists · ‎Mar 25, 2015

So I now have two web link validation strategies available for our authors.
Well, it's in UAT, really, and one complication will require some deeper
analysis of the validations' performance against real document links.

Phase 1: Author runs an ACL driven validation against web links in an
Editor session. It is currently separate from a similar validation we were
already doing for our intra-document links. But that may change in the
future when we understand the web link testing better. The validation
examines each web link and attempts a doc_open(). Anything that opens is
set valid="yes". Anything that fails to open is set valid="no". Mailto:
links are not tested (doc_open() against a mailto link opens an email in,
for me, Outlook) and are set valid="manual". Any link set by an author (or
a previous validation) as "manual" is ignored. I store the validation date
on each link. (Those all refer to "setting" attributes on the link element,
if that wasn't clear.) Oh yes, I store each URL in an array while I'm
processing links and check to see if I've validated a URL already. If I
have, I skip re-testing the subsequent occurrences of a URL (which are not
set to any valid= state nor are they dated ... still thinking about this
one). Finally, I open a window with the URLs that don't validate to allow
authors double-click navigation to "bad" links in the Edit window.

Things I have noted:
1) The first validation in an Editor session takes a while as one would
expect. Subsequent validations are very quick. Something is being omitted
or cached or ... something. I haven't thought about this too deeply yet.
2) Some parser errors are sometimes thrown. Some(times) I can suppress
(some).
3) Some (well, one, at least) sites that are up do not validate. Amazon,
specifically. Not sure what they're doing that fails the doc_open().
Hopefully further testing of real links shows this to be rare and we can
largely discount false negatives.
4) Authors need the ability to set some links valid="manual" so I can NOT
attempt to validate them. I found one link that causes a Segment 11 Editor
exit. Kerboom! Because of this, I force a save at the beginning of the
validation AND I write each URL to a log before I test it. That way, the
author will know exactly which link killed Editor, should they run into
additional such links. (I'd like to submit this to PTC but since it is an
internal URL and I'm not sure what it is about that page/site that's
causing the problem, I'm not sure I will be able to.)

Phase 2: Author requests an HTML document containing all of the links for a
click-test. This allows authors to periodically validate that a URL is
still contextually valid (not just that a host, page, or file is still
there which is the most the previous validation can do). I end up having to
chain XSL transformations to support sorting the URLs and removing the
duplicates. The first puts all of the link tags into a single XML "topic."
The second sorts by URL and removes duplicates. I couldn't (figure out how
to) do the sort/duplicate removal in the first stylesheet since xsl:key (or
my implementation of it) assumed the link tags were all siblings and the
result was mostly document order, not alpha by URL. I'm still trying to see
how much useful information I can pass all the way through that process.
Contemplating some context (the para owning the link? not sure I can
actually do this one with my current process). Contemplating an occurrence
count (so authors have an idea of how many they'll have to find/replace if
an update is needed).

Once we get a better understanding of how likely Amazon-like false
negatives are, what we find in our click-tests, etc., it's possible I will
use the valid= state in some fashion to reduce the number of links output
for a click-test.

Anyhow ... thanks for all your thoughts. The feedback I'm getting from the
authors is really positive. They're looking forward to using this in the
wild.

On Sun, Mar 15, 2015 at 4:29 PM, Gareth Oakes <->
wrote:

> Hi Paul,
>
> Firstly, yes I've seen workflows where you have a link validation step
> which has "screenshots" of the link targets for rapid checking. I think
> this works pretty well and is nice and easy for the reviewers. Something
> like PhantomJS might be the easiest way to implement this (
>

GarethOakes · ‎Mar 25, 2015

Hi Paul,

Thanks for sharing. It sounds like you are making some good progress.

The one thing that stuck out to me was regards your Amazon problems. It occurs to me that doc_open() will be trying to download a file over HTTP. The HTTP protocol also supports things like 301 Moved Permanently which are cases where a web link needs the browser (or in this case Arbortext Editor) to do something further in order to find the requested resource (see here:

naglists · ‎Mar 26, 2015

Good ideas, Gareth. Thanks. (Also, I really liked the web page image
generator you shared in an earlier post. For reasons of National Security
[joking, mostly], getting that approved for in-house use wouldn't probably
be worth the effort.) I'll keep them in the hopper pending how happy (and
for how long happy lasts) with the zero support to some support moves we're
making right now.

On Wed, Mar 25, 2015 at 5:21 PM, Gareth Oakes <->

anyone validating web links from Arbortext? Or "helping" somehow with Arbortext?

anyone validating web links from Arbortext? Or "helping" somehow with Arbortext?