15-Moonstone

Solved

Is it Possible to extract the pdf content and store in a string variable using Thingworx?

Forum|Forum|4 years ago
April 19, 2022
1 reply
1870 views

Hello,

I have one requirement of matching content from pdf file and from SAP system. So, I want to first extract the content in teh pdf and can able to store in variables using thingworx services.

if possible plzz how we can achieve that

Thanks

KSM

Best answer by VladimirRosu_116627

Hi @KSM , it is absolutely possible, but it is not exactly easy.

You have two options:

1. Create a ThingWorx extension that reads the PDF and gives you back the text. Apache PDFBox is one such library you can use, https://pdfbox.apache.org/download.html.

After looking on the information on the internet, creating it should be something very fast - but remember, it depends on how information is stored on the PDF, since this library won't do OCR...

2. If you have ThingWorx Flow, you can extend the Azure Computer Vision service (or use it as inspiration) to interact with the Azure Computer Vision service, and provide a PDF as input in order to perform OCR on it. Note you can use the standard connector to extract text if you give an image as an input. https://support.ptc.com/help/thingworx/platform/r9/en/index.html#page/ThingWorx/Help/Integration_Orchestration/Azure/ComputerVision.html#

V

VladimirRosu_116627

Answer

19-Tanzanite

Hi @KSM , it is absolutely possible, but it is not exactly easy.

You have two options:

1. Create a ThingWorx extension that reads the PDF and gives you back the text. Apache PDFBox is one such library you can use, https://pdfbox.apache.org/download.html.

After looking on the information on the internet, creating it should be something very fast - but remember, it depends on how information is stored on the PDF, since this library won't do OCR...

2. If you have ThingWorx Flow, you can extend the Azure Computer Vision service (or use it as inspiration) to interact with the Azure Computer Vision service, and provide a PDF as input in order to perform OCR on it. Note you can use the standard connector to extract text if you give an image as an input. https://support.ptc.com/help/thingworx/platform/r9/en/index.html#page/ThingWorx/Help/Integration_Orchestration/Azure/ComputerVision.html#

K

KSM

Author

15-Moonstone

Hi Vladimir,

Thanks a lot for your suggestion.

Do we have already Thingworx extension which can read the pdf and convert into the text.

Thanks

KSM

V

VladimirRosu_116627

19-Tanzanite

Hi @KSM ,

I can not answer if you have a ThingWorx extension (you asked if "we have"), but I assume you wanted to know if PTC provides OOTB such an extension, and the answer is no in this case.

However, be aware PTC can build such an extension through our professional services - if you want to do this, please contact your sales person or CSM.

Sign up

Please use your PTC eSupport account.

Welcome to the PTC Community

Please use your PTC eSupport account.