cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Showing results for 
Search instead for 
Did you mean: 

PDF2WORD - Does it worth?

unknown1
1-Newbie

PDF2WORD - Does it worth?

While our SGML to Word/RTF conversion process is under development, we're searching for short term alternatives.
Can somebody share any experience or comments regarding PDF to Word/RTF conversions?
13 REPLIES 13

Hello Antonio,


We are using PDF Converter Professional 3 from
Scansoft.




Results are VERY good although the styles that are created are not our
standard stylesheets.


Some scrubbing of the stylesheetsis required for compliance with
our internal standards.


Hank Pelletier

General Dynamics




At 08:11 AM 6/20/2005, you wrote:

>While our SGML to Word/RTF conversion process is under development, we're
>searching for short term alternatives.
>Can somebody share any experience or comments regarding PDF to Word/RTF
>conversions?

Can you get to XML and XSL-FO output? I've found a new tool that takes a FO
output and converts that to Word and seems to do a very good job. You don't
end up with styles but you do end up with an editable and useful Word
document. See xmlmind.com. This is for folks that don't want to maintain
two sets of stylesheets to convert their XML (one for XML and one for
MS-XML styles).

Many of the PDF converters that I tried gave you a document that looked and
printed correctly, but it was set in individual text blocks that were
impossible to work with.

..dan
---------------------------------------------
Danny Vint

Specializing in Panoramic Images of California and the West
http://www.dvint.com

voice: 510-522-4703





class=062094016-20062005>Abby Finereader and Abby Transport also work very
well.

class=062094016-20062005>

class=062094016-20062005>Kerry Kavanaugh





From: Pellitier, Hank

Sent: Monday, June 20, 2005 10:19
AM
To: adepters@arbortext.com
Subject: RE: PDF2WORD - Does
it worth?



Hello Antonio,


We are using PDF Converter Professional 3 from
Scansoft.




Results are VERY good although the styles that are created are not our
standard stylesheets.


Some scrubbing of the stylesheetsis required for compliance with
our internal standards.


Hank Pelletier

General Dynamics





To borrow from Dan's comment below.



"<tt>but you
do end up</tt>
<tt>with</tt>...(a)
<tt>useful Word document</tt>."
Do I see a contradiction in terms here? "Useful"
and "Word" used together???? I think I'd prefer a stone
tablet and chisel thank you.



Lynn

Would
that be a *useful* stone tablet and *useful* chisel?



face=Tahoma size=2

Hi,

One feature of Acrobat that many people miss is that when you have a
pdf open, and you choose view/continuous facing, you then see a dual
page layout. If you use the text selection tool you can select all of
the text in the document, cntrl-a, and copy the text out to word. As an
aside, while you do get the entire text rather than blocks I noticed
that the export to RTF gives you pretty much the same content, adding
unwanted CRs in the middle of sentences and so on. It appears to
preserve bolding, some fonts etc...

Greg

>>> dvint@dvint.com 06/20/05 12:28PM >>>


Well I think the chisel would be the
more 'useful' item. Without it, the stone tablet would only be 'useful'
as just another heavy weight to drop on someone's toe to get their attention.




Lynn


Greg,



That does work, but the other thing
the select all does is select and copy EVERYTHING, including headers, footers,
page numbers, footnotes, etc. So if there is a paragraph that breaks
across a page, you have to remove all the overhead. For page numbers,
header and footer, maybe something fairly easy to do. For footnotes
or other unique things, that aren't always there, a bit more problematic.




In addition, paragraphs copied this
way also have a CR at the end of each line (not paragraph). So from
that respect, neither the RTF nor the select all are much different. Now
fonts, bolds, underlines, paragraph and list number, and such do come across.
Indents, graphics (a text box labeled a figure did pop up in the
middle of the flowing text), tabs, and other formatting is lost.




I took a 139 page PDF file, a MIL-HDBK
(next worse thing to a TM for strange rules) and selected the entire document
(CTRL-A). I got all my headers, footers and page numbers. Page
numbers DID NOT appear in the copied Word file in the same place in the
text as they did in the PDF (???). Tables, well let's just say the
information is there and leave it at that. Text boxes labeled as
figures did come through, but with no format Actual graphics did
not.




Now I will say that I do not have a
full blown Acrobat installation on the machine I did this test with. I
cannot save as RTF to see the differences. I will check this when
I get back to the home machine (really rough when you have better software
on your personal machine than your work machine).




From what I've seen, without going to
specific applications designed to reverse engineer PDF, there is no good,
simple, easy way to get there. Though PDF can do may things, to me
the basis of PDF is you get your paper without the paper.




Lynn


Hi Lynn,

Yes, I found the same ugly things when I looked at the output from pdf.
What you get from the pdf by copying text or saving to RTF is far from
completely usable unless you apply some effort to cleaning up the
results. What I would do is ask for the legacy source document rather
than start my production from a pdf.

Please take what I said in previous ms in that light that I consider
trying to get text out of a pdf as an "act of desperation".

As I see it, I would only try extracting content from a pdf if I had a
small amount of text that wasn't convienent to paste into a single
element. I'd copy the text into a text file, and then import the text
file, tabbing down and selecting only what I wanted, and placing the
required content into the various elements. The awkward CRs at the ends
of lines are not so much an issue since you can add content from several
lines into a single element. Its a bit labour intensive, but you do get
"control" back and you don't have to use "word".

As for large scale conversions, well I wouldn't start from a pdf.

Greg

>>> lhales@csc.com 06/21/05 08:53AM >>>

Greg,

That does work, but the other thing the select all does is select and
copy EVERYTHING, including headers, footers, page numbers, footnotes,
etc. So if there is a paragraph that breaks across a page, you have to
remove all the overhead. For page numbers, header and footer, maybe
something fairly easy to do. For footnotes or other unique things, that
aren't always there, a bit more problematic.

In addition, paragraphs copied this way also have a CR at the end of
each line (not paragraph). So from that respect, neither the RTF nor
the select all are much different. Now fonts, bolds, underlines,
paragraph and list number, and such do come across. Indents, graphics
(a text box labeled a figure did pop up in the middle of the flowing
text), tabs, and other formatting is lost.

I took a 139 page PDF file, a MIL-HDBK (next worse thing to a TM for
strange rules) and selected the entire document (CTRL-A). I got all my
headers, footers and page numbers. Page numbers DID NOT appear in the
copied Word file in the same place in the text as they did in the PDF
(???). Tables, well let's just say the information is there and leave
it at that. Text boxes labeled as figures did come through, but with no
format Actual graphics did not.

Now I will say that I do not have a full blown Acrobat installation on
the machine I did this test with. I cannot save as RTF to see the
differences. I will check this when I get back to the home machine
(really rough when you have better software on your personal machine
than your work machine).

From what I've seen, without going to specific applications designed to
reverse engineer PDF, there is no good, simple, easy way to get there.
Though PDF can do may things, to me the basis of PDF is you get your
paper without the paper.

Lynn

Regarding the relative utility of PDF to Word Tools:

In a mixed environment where the Publications Support groups are required to provide emergency recovery of text from documents provided from outside company control, PDF to Word tools are very useful.

When inheriting legacy documents from other divisions who may have used arcane word processing applications but at least produced legacy PDF print files, the relative cost of recovering content is low.

The key COST question is always:

Is using a PDF to Word tool cheaper than rekeying from scratch?

PDF Conversion/extraction is ALWAYS cheaper.

Hank Pelletier
General Dynamics


Thanks for all the input. That was very kind of you.
Every "of-the-shelf" conversion I've tried was a disaster. I'll try some of the options you have listed.
I'll do some analysis on your answers and contact each one individually.
Thank you very much.


Hear hear. 🙂



Unfortunately, I have been involved
in a couple of major conversions where the only data available was the
PDF. Of course there were other problems with the conversion




1. The owners of the data were
NOT involved in (hell not even AWARE of) the conversion.


2. In at least two instances with
number 1, the TM being converted was being changed by the owner (DUH or
is that DUMB).


3. The conversion was changing
style, format AND structure (going from a chapter/section to work packages),
thus a revision.


4. With number 1, no original
source data available.

5. With number 1, the converted manuals, in XML, were not being updated
in the new format nor in XML after the conversion..


6. No subject matter expert to
deal with any problems or errors found during the conversion.




Basically if there was anything that
could have been done to make the conversion easier, it wasn't done. So
PDF was amongst the least of the problems.




Hank, your comments are also valid,
even in the case above. You need to have the owners involved and
aware of the conversion. There are several conversion houses that
do a fairly good job of converting PDF to other output. Internally,
unless you have lots of PDF to convert, brute force is probably the best
way to go.




Lynn

There's a (small) number of companies that supply technology *and* services that work. As you noted, "of-the-shelf" conversion is usually a disaster, but much better results can be achieved if you can justify the expense and effort required.

As and aside, PDF documents can be structured and tagged, i.e., the content can be tagged in a way similar to XML. Tagged PDF documents can be edited, reflowed and exported with fairly good results. The catch is how to get tagged PDFs in the first place. When you create PDFs from recent versions of Microsoft Office and Adobe Framemaker, InDesign and PageMaker, the resulting PDFs are tagged. For unstructured PDFs, the export can often be somewhat improved by running the 'Make Accessible' command first.


I didn't see anyone mention BCL Drake:
Announcements