Intro
Last week I had to spend some time figuring out why a few of our applications were crashing over the weekend and even some nights during the week. Our status monitor was throwing up failures nonstop, and the only remedy we could find was to restart the applications and let them live another day.After a couple days of this, we think we've finally found out what was happening.
What's happening?
Our applications use ActiveMQ to communicate between each other, and as part of our monitoring process, we check the connection between our apps and ActiveMQ.The common way we do this is to send a message to a queue on ActiveMQ in transactional mode, then roll it back. If no exceptions are thrown during the process, we consider the ActiveMQ connection to be up.
The problem is, ActiveMQ was sometimes keeping sockets open when the status monitor would send a message and rollback. As a result of this, the "open file" count for the ActiveMQ process slowly kept climbing higher and higher, until the ActiveMQ log became filled with messages like this:
Database ~/activemq/data/kahadb/lock is locked... waiting 10 seconds for the database to be unlocked. Reason: java.io.IOException: File '~/activemq/data/kahadb/lock' could not be locked. | org.apache.activemq.store.kahadb.MessageDatabase
Could not accept connection : java.net.SocketException: Too many open files | org.apache.activemq.broker.TransportConnector | ActiveMQ Transport Server: tcp://0.0.0.0:61616
Could not accept connection : java.net.SocketException: Too many open files | org.apache.activemq.broker.TransportConnector | ActiveMQ Transport Server: tcp://0.0.0.0:61616
Let's fix it
We had to find out why out open-file count was always increasing! After a long day of trying to reproduce the error in a test environment, followed by a few attempted solutions, we discovered the problem.When we wired up the connection to ActiveMQ in our applications, we were just using a simple ActiveMQConnectionFactory to create the connections. For some reason, when we rolled back messages sent to ActiveMQ, the socket on the broker's end would sometimes stay open.
We quickly discovered that by using a PooledConnectionFactory in place of the ActiveMQConnectionFactory, the sockets were being properly released when messages were rolled back! Success!
Conclusion
As always, the best way to figure out what's going on is to reproduce the error in a safe, closed environment (be it 'Dev' or 'Test' or 'Staging' or whatever you like). Once you can get it to happen reliably, you are able to try any number of solutions and can verify the results consistently.Hopefully if you encounter an error like this, you'll be able to find the problem and remedy it quickly!