cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
Showing results for 
Search instead for 
Did you mean: 

When to split a stream in two or more streams

Willie
Alexandrite

When to split a stream in two or more streams

I have a stream that has about 40 fields.  The data in each field is unique so the stream cannot be normalized.  About 2000 records are created each month.  Most of the time, when the stream is queried, not all of the fields are needed.  Should the stream be split into two or more streams?  Will this have any noticeable improvement in performance?  When performing queries, my understanding is that we cannot select the fields to be outputted as in the example below:

https://bi-bigdata.com/2012/09/02/select-vs-select-in-sql-server-query

 

 

11 REPLIES 11

Re: When to split a stream in two or more streams

Assuming you are using a regular Stream vs. ValueStream, I don't think Stream supports querying only specified columns unlike ValueStreams.

If you know that you need certain fields more often than others, splitting can help, especially if you use multiple persistence providers. 

ie one Stream is supported by schema "A" and the other Stream is supported by schema "B"

First advantage though is less total data retrieved and transferred, but if you use two or more persistence providers than the overall table size of the stored stream reduces significantly.

Re: When to split a stream in two or more streams

Would there be any significant performance improvement if there is only one persistence provider?  Would the change in performance be almost negligible?  Is 40 fields a lot for a stream?  There are times when all of the fields need to be displayed.

Re: When to split a stream in two or more streams

If you use one persistence provider, that means all the records will still reside in one and the same table. There is limited indexing on Streams so the advantage will be reduced to nil. I can't say for sure, you would have to test it.

When you do need all the fields, as long as you have a field to allow joining, you can use the very fast in memory infotable service operations to put the table(s) together. The only part that you would have to be aware of  is the amount of data it will represent that needs to be transferred from Server to Client and then displayed.

40 fields isn't necessarily a lot, but if you only need 10 fields all the time and 40 once a month, splitting would make sense. The other part to the calculation would be number of rows.

 

Note that setting up another persistence provider doesn't mean a separate Database server, you can just create a second schema within the same Database server.

Re: When to split a stream in two or more streams

I've been using ThingworxPersistenceProvider for everything.  Is this bad practice?  Should there be several persistence providers?  

Re: When to split a stream in two or more streams

Hi Willie,

 

In the absence of details around your use-case (very granular details around expected response times etc), it is not possible to answer your question with a yes or no. And generally speaking, discussing in so much detail your use-case sometimes is not really doable in a forum format.

 

I can say that generally speaking using a single persistence provider is not a bad practice, but I'm pretty sure that there are use-cases where having a single persistence provider will not offer the best performance in some applications.

 

My advise in your case: try to estimate in numbers your usecase: how many queries will be required to have all the data, how many just a data subset, and also their frequency (how many per hour for example). For example, if 80% of the queries require all the data, then probably splitting in two persistence providers does not make sense, right?

That is: try to understand the load your application generate, and as soon as you have that, I believe it will be much easier to decide if you must split or not the data.

 

Also, I see you are aware of how many rows you generate in one month. I believe testing how the system behaves by generating random data should be relatively simple to do in your use-case (that is you do not need to generate hundreds of millions of rows). In this case you can see how the system really behaves, and make even more informed decisions.

Tags (1)

Re: When to split a stream in two or more streams

Okay, makes sense.  Thank you.

Re: When to split a stream in two or more streams


@PaiChung wrote:

If you use one persistence provider, that means all the records will still reside in one and the same table.


Does this mean data is stored in one table with millions of columns when stored in a persistence provider?  

 

We have Tomcat and Postgres on the same machine.  Should they be on separate machines?  Would having them on the same machine affect performance?

 

Re: When to split a stream in two or more streams

It's millions of rows over time yes.

Streams are basically StreamName/TimeStamp/JSONObject (describes the data according to data shape with value)

 

For production environments, best practice is to indeed have the database server on a separate server.

Re: When to split a stream in two or more streams

Okay, thank you for the clarification.

I understand what streams, value streams, and data tables are.  In your reply, I thought you meant that all stream data are stored in one giant table with millions of columns, meaning a combination of different streams in one giant table with the timestamp as the key in the persistence provider.

 


@PaiChung wrote:

For production environments, best practice is to indeed have the database server on a separate server.

Okay, PTC cloud service put our Tomcat and Postgres on the same VM.  Been trying to figure out why the performance on our servers have been sluggish and how we can improve it.

Re: When to split a stream in two or more streams

Willie, you mentioned million of columns, but can you confirm that this is really what you meant, and not rows?

Regardless, the underlying SQL tables do not have a correspondence of 1:1 with the columns you define in ThingWorx.

If I may, I strongly suggest reading the deployment architecture guide from the ThingWorx Help Center  here. While you are there, you will also see a sizing section that I believe it would help answering your questions.

 

Re: When to split a stream in two or more streams

Yes, I meant columns and not rows.  

 


@VladimirRosu wrote:

Regardless, the underlying SQL tables do not have a correspondence of 1:1 with the columns you define in ThingWorx.


Okay, so a field in a datashape does not correspond to a field in an SQL table in the persistence provider.

 

I'll take a look at the deployment architecture guide.

Announcements

Check out the upcoming Expert Session: Understanding ThingWorx Navigate Licensing in Community "Customer Events" section.