Solved: How to find which stream consumed more space in Da...

Velkumar · ‎Aug 12, 2025

Hi all,

We have a ThingWorx instance that has been running for the last 4 years. In the database, streams have consumed 1 TB of data. Is there a way to find out exactly which stream has consumed the most space in the database?

Thanks,

VR

Rocko · ‎Aug 12, 2025

How about naive approaches? They might take a while to run though...

select entity_id,count(*) from stream group by entity_id;

select entity_id,sum(pg_column_size(field_values)) from stream group by entity_id;

View solution in original post

Rocko · ‎Aug 12, 2025

How about naive approaches? They might take a while to run though...

select entity_id,count(*) from stream group by entity_id;

select entity_id,sum(pg_column_size(field_values)) from stream group by entity_id;

Rocko · ‎Aug 12, 2025

The caveat being it may not directly reflect disk space due to how the DB organizes its tables, blocks, dead tuples and so on, but it should give an indication on the worst offenders.

Velkumar · ‎Aug 13, 2025

Hi @Rocko

I tried this but it takes lot of time. Is there any other quick approach to find it

/VR

Rocko · ‎Aug 13, 2025

Not that I am aware of.

select entity_id,count(*) from stream group by entity_id;

is not using an index this is why it takes so long.

You could create an index, but that will take time AND disk space. But it would help for the future.

nmutter · ‎Aug 13, 2025

Most obvious ideas (most likely already checked):

- I assume you do have a lot of streams in the platform? If you only have like 5 you could figure which one it has to be?

- Other idea if you use the GetStreamEntryCount service on the Stream-Things, maybe this is faster than the SQL? But maybe internally it does the same (not sure). At least for a count - not for actual size..

Constantine · ‎Aug 14, 2025

Hello @Velkumar,

You can also apply some heuristics to it. Assuming that all stream records for each given stream have similar structure, and you didn't evolve that structure too much, you can do some Monte-Carlo sampling:

For each Stream ID:
- Generate N random record IDs
- Run @Rocko's second query with "WHERE id IN (<IDs>)" clause -- this won't take long
  - Compute the average
  - Multiply it by the count from @Rocko's first query

Sorry, I'm too lazy to write an SQL for that, but you should be able to do it all in a single SELECT.

This should give you a good approximation, and you can even estimate the quality of that approximation by checking how the result converges as you increase N, e.g. from 10 to 100 to 1000.

/ Constantine

Vilia (my company) | GitHub | LinkedIn

vnamboodheri · ‎Aug 19, 2025

Hello @Velkumar,

It looks like you have some responses from some community members. If any of these replies helped you solve your question please mark the appropriate reply as the Accepted Solution.
Of course, if you have more to share on your issue, please let the Community know so other community members can continue to help you.

Thanks,
Vivek N.
Community Moderation Team.

How to find which stream consumed more space in Database

How to find which stream consumed more space in Database