Skip to main content
12-Amethyst
February 11, 2015
Question

Configuration of PDF OCR with Adobe Livecycle Server ES3

  • February 11, 2015
  • 1 reply
  • 1642 views

We have scanned documents that are not searchable, so we would like to use OCR functionality to create viewables using Adobe Livecycle Server. When enabling the OCR capability we receive an error and it is unclear what the cause is. Any one having experience with configuration of OCR based on Adobe Livecycle Server (ES3)?

1 reply

12-Amethyst
April 16, 2025

This would be very valuable to have and would make AEM/PDF Publisher a worthwhile investment. 

14-Alexandrite
April 16, 2025

If you are a bit handy(or search areound a bit) you can use Python to extract images from pdf files.
With PyTesseract (and Tesseract) you can do the OCR your self and merge the different files again to pdf.

12-Amethyst
April 17, 2025

These things are all trivial in isolation. Doing it as part of Windchill publish workflow is another issue however. 
PTC explicitly support use of AEM, which happens to have OCR out of the box. PTC does not support customisations and extensions, and provides very little documentation on how to modify the publication generation steps like that.

 

Fortunately, Chromium has Pdf-Searchify which uses Tesseract to generate in-browser OCRing of PDF images- but this doesn't result in index-able data for SOLR Server.or Windchill Search Preview.