Eventsrouter over capacity

tcoufal · ‎Sep 09, 2019

We have problems with event handling lately.

We try to actively avoid subscriptions between different Things. (we use me.Event subscription as much as we can, so it would not enter EventSubsystem at all).

If that is not possible we are using AsyncServices so services would be handled inside separate threads.

Normally we are having like 100-200 events (submitedTaskCounts) per second which I think is not a lot.

Active threads are between 1 and 8.

Lately out system is suffering from

2019-09-04 17:12:41.431+0200 [L: ERROR] [O: c.t.s.s.e.EventRouter] [I: ] [U: ] [S: ] [T: Timer-23] CRITICAL ERROR: EventRouter is over capacity - events being rejected Task com.thingworx.system.subsystems.eventprocessing.EventInstance@37cbc1e5 rejected from com.thingworx.common.utils.MonitoredThreadPoolExecutor@121b4a03[Running, pool size = 500, active threads = 500, queued tasks = 200000, completed tasks = 2816461]

Than the whole system will basically crash, no response, just error messages into the application log that is all.

There is no major error(s) leading to this issue as far as I can tell. Stopping tomcat takes roughly half an hour.

I have made small system which reads out information about event subsystem and logs it into an infotable.

From some reason our unfinished tasks count is just building up. This picture is actually showing that system could recover from what was happening, no matter what that was. But sometimes it just cannot do it. It reaches ma amount of threads and then the whole system freezes. Strangely enough CPU as 0 percent, like the machine is not even trying anymore.

Machine is Win Server with 8 CPUs and 24GB RAM dedicated to JVM

Postgre is running in HA cluster controlled by EFM (separate machine).

Do you have any idea what could cause our system to build up queue size so rapidly? Have you had every such a problems?

Strangely enough when I tried to increase number of Maximum threads the whole Event Subsystem stopped working completely. No events at all. No error messages either. I change it back to 500 and restarted the Tomcat. Very weird.

Update:

I have increased CorePoolSize to 64threads and decreased number of queue size before opening new thread to 20000.

Event hough our system is crashing every 24hours.

Thanks very much for anything that you can tell me.

raluca_edu · ‎Sep 09, 2019

Hi,

Please check this article .

Are there any other errors in Application.log?

If you are enabling Application log level to TRACE/DEBUG, what you get?

In Tomcat logs (stderr/stdout) are any related exceptions/errors?

Is there any Timer added recently that could trigger this issue?

Thank you,

Raluca Edu

tcoufal · ‎Oct 23, 2019

Hi,

we increased RAM on our PP (Postgre).

So far so good.