cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Showing results for 
Search instead for 
Did you mean: 

Community Tip - When posting, your subject should be specific and summarize your question. Here are some additional tips on asking a great question. X

Analytics: Load large (100GB+) data sets

Analytics: Load large (100GB+) data sets

Hello,

 

Loading analytics data with the TWA web GUI works really pleasantly for datasets of say 50MB/100K rows (completes quickly, doesn't hang/crash, gives timely/meaningful feedback etc).

 

When you're loading a data set of 100GB things are not so pleasant. The UI stops being your friend and the workarounds of loading the files directly into the TWA storage aren't much better (uses up 100's of GB of storage on the host OS which is a pain with virtualisation/networking, gives feedback only when you go diving through the API). 

 

What I'm suggesting is this:

  1. Create UI and services to allow data (in the current import format but also adding the attached example) to be loaded from network storage or databases e.g. SQL Query, FTP, SMB (because we shouldn't be fiddling with the files on the OS drive of an application server)
  2. Infer data types from a subset of the file to be loaded e.g. only use the first 100K rows to inter data types as it may take 12+ hours to scan through the entire file or query result.
  3. Provide status of the load
    1. Is it working?
    2. Where's it up to?
    3. Can I stop it?
    4. Can I see a preview of the sort of data being loaded?

Thanks!

Mike

5 Comments
MikeFR
8-Gravel

Part of this is a pivot in mindset from using the API to push data in, to having the server take over the task of pulling data in from the network and ingesting. The GUI is just to set up the job in less than 5 minutes, after that the GUIs role should just be to see the status of the load and give the option to cancel. This is a job that could run for days, we shouldn't reply on having a web browser open that long (browser logs you out with inactivity, corporate updates get pushed and reboot your machine, networks intermittently go down, browsers stop executing high CPU tasks as a countermeasure to browser crypto-mining)

kriswang
5-Regular Member

Just curious what business problem are you trying to solve that entails a data set that big? Most factory data I have seen was far more smaller then that number or could be broken down to smaller set to focus on one problem.

MikeFR
8-Gravel

Hi Kris,

 

 Thanks for taking an interest in this.

 

 I'm doing analytics applications involving: years of data collection X sub-second sampling X 100-5000 parameters per machine.

 

 These assets could be > $US25M purchase price and/or with downtime worth > $US1M per hour. (I'm being vague and not mentioning any specific customers figures because of NDAs)

 

 I can use less data (and I tried this) but it leads to less performant models/prediction/detection/analysis/etc

 

TWA presently gives really good results once the model is built (like weeks of warning of an asset failure) with crazy RoIs. Most of the hard part is done in the platform, it just needs a little more user friendliness to get data ingested.

 

Cordially,

Mike

olivierlp
Community Manager
Status changed to: Under Consideration
 
olivierlp
Community Manager
Status changed to: Archived

Hello,

We are archiving your idea as part of a general review. This action is based on the age of your idea and the total number of votes received, as per this announcement.

You can always post a new idea with all the details required in the form.

Thank you for your participation.