Skip to main content
15-Moonstone
April 19, 2022
Solved

Is it Possible to extract the pdf content and store in a string variable using Thingworx?

  • April 19, 2022
  • 1 reply
  • 1862 views

Hello,

I have one requirement of matching content from pdf file and from SAP system. So, I want to first extract the content in teh pdf and can able to store in variables using thingworx services.

 

if possible plzz how we can achieve that

 

Thanks

KSM

 

Best answer by VladimirRosu_116627

Hi @KSM , it is absolutely possible, but it is not exactly easy.

You have two options:

1. Create a ThingWorx extension that reads the PDF and gives you back the text. Apache PDFBox is one such library you can use, https://pdfbox.apache.org/download.html.

After looking on the information on the internet, creating it should be something very fast - but remember, it depends on how information is stored on the PDF, since this library won't do OCR...

2. If you have ThingWorx Flow, you can extend the Azure Computer Vision service (or use it as inspiration) to interact with the Azure Computer Vision service, and provide a PDF as input in order to perform OCR on it. Note you can use the standard connector to extract text if you give an image as an input. https://support.ptc.com/help/thingworx/platform/r9/en/index.html#page/ThingWorx/Help/Integration_Orchestration/Azure/ComputerVision.html#

1 reply

19-Tanzanite
April 21, 2022

Hi @KSM , it is absolutely possible, but it is not exactly easy.

You have two options:

1. Create a ThingWorx extension that reads the PDF and gives you back the text. Apache PDFBox is one such library you can use, https://pdfbox.apache.org/download.html.

After looking on the information on the internet, creating it should be something very fast - but remember, it depends on how information is stored on the PDF, since this library won't do OCR...

2. If you have ThingWorx Flow, you can extend the Azure Computer Vision service (or use it as inspiration) to interact with the Azure Computer Vision service, and provide a PDF as input in order to perform OCR on it. Note you can use the standard connector to extract text if you give an image as an input. https://support.ptc.com/help/thingworx/platform/r9/en/index.html#page/ThingWorx/Help/Integration_Orchestration/Azure/ComputerVision.html#

KSM15-MoonstoneAuthor
15-Moonstone
May 16, 2022

Hi Vladimir,

Thanks a lot for your suggestion. 

Do we have already Thingworx extension which can read the pdf and convert into the text.

 

 

Thanks

KSM

 

19-Tanzanite
May 16, 2022

Hi @KSM ,

 

I can not answer if you have a ThingWorx extension (you asked if "we have"), but I assume you wanted to know if PTC provides OOTB such an extension, and the answer is no in this case.

However, be aware PTC can build such an extension through our professional services - if you want to do this, please contact your sales person or CSM.