Solved: Re: Is it Possible to extract the pdf content and ...

KSM · ‎Apr 19, 2022

Hello,

I have one requirement of matching content from pdf file and from SAP system. So, I want to first extract the content in teh pdf and can able to store in variables using thingworx services.

if possible plzz how we can achieve that

Thanks

KSM

VladimirRosu · ‎Apr 21, 2022

Hi @KSM , it is absolutely possible, but it is not exactly easy.

You have two options:

1. Create a ThingWorx extension that reads the PDF and gives you back the text. Apache PDFBox is one such library you can use, https://pdfbox.apache.org/download.html.

After looking on the information on the internet, creating it should be something very fast - but remember, it depends on how information is stored on the PDF, since this library won't do OCR...

2. If you have ThingWorx Flow, you can extend the Azure Computer Vision service (or use it as inspiration) to interact with the Azure Computer Vision service, and provide a PDF as input in order to perform OCR on it. Note you can use the standard connector to extract text if you give an image as an input. https://support.ptc.com/help/thingworx/platform/r9/en/index.html#page/ThingWorx/Help/Integration_Orchestration/Azure/ComputerVision.html#

View solution in original post

VladimirRosu · ‎Apr 21, 2022

Hi @KSM , it is absolutely possible, but it is not exactly easy.

You have two options:

1. Create a ThingWorx extension that reads the PDF and gives you back the text. Apache PDFBox is one such library you can use, https://pdfbox.apache.org/download.html.

After looking on the information on the internet, creating it should be something very fast - but remember, it depends on how information is stored on the PDF, since this library won't do OCR...

2. If you have ThingWorx Flow, you can extend the Azure Computer Vision service (or use it as inspiration) to interact with the Azure Computer Vision service, and provide a PDF as input in order to perform OCR on it. Note you can use the standard connector to extract text if you give an image as an input. https://support.ptc.com/help/thingworx/platform/r9/en/index.html#page/ThingWorx/Help/Integration_Orchestration/Azure/ComputerVision.html#

KSM · ‎May 16, 2022

Hi Vladimir,

Thanks a lot for your suggestion.

Do we have already Thingworx extension which can read the pdf and convert into the text.

Thanks

KSM

VladimirRosu · ‎May 16, 2022

Hi @KSM ,

I can not answer if you have a ThingWorx extension (you asked if "we have"), but I assume you wanted to know if PTC provides OOTB such an extension, and the answer is no in this case.

However, be aware PTC can build such an extension through our professional services - if you want to do this, please contact your sales person or CSM.

KSM · ‎May 17, 2022

Thanks @VladimirRosu for your Information.

Is it Possible to extract the pdf content and store in a string variable using Thingworx?

Is it Possible to extract the pdf content and store in a string variable using Thingworx?