Chain of failures on blocking threads
I came back to Milano little time ago and I’ve bumped into an API implementation in this new job. This will be a library that aims to interact with a remote application through a simple text-based protocol.
The typical process is a sequence of authorization – session initialization – commands processing – session disposing each of which enclosed in atomic request/respose interaction. The simplest and most immediate approach provides to write the protocol stubs, and manage them through simple methods that elaborate such commands at low level handling tcp sockets and the client/server handshaking with synchronous calls.
Sometimes the simplest is the best way, but not this time, especially within multi layer structured systems, where every component depends on many others, and any of those can fail.
This task rings as an alarm bell to me due to a recent project that looked like this one, and I can still remember the effects of hangs and missed responses in a SOA context; fortunately the event happened during a load test:
The application was a client interacting with openfire through XMPP. The investigation uncovered a bug that caused a dead lock in a connection pool in certain conditions, the consequences were easily predictable as the fast resource exhaustion, causing soon an application break down. The application server was over but also the client side was unrecoverable since the unresilient application’s architecture didn’t foresee hang requests.
What is unacceptable is the chain of failures that a problem like this can disseminate along the process path, what about combined systems where one side does not expect the other side to hang off if it stops responding?
Domino is a pleasant show, you watch all pieces tracing doodles during their falls, it’s funny but only when it doesn’t look like your system when it works.
Don’t play domino, be skeptical (and use concurrent package)
Blocking threads may happen every time you attempt to get resources out of a connection pool, deal with caches or registred objects, or make calls to external systems as this unfortunate experience above. I mean to be distrustful of each component you inquiry decoupling systems as necessary as to skirt the failure propagation. If your component is properly protected from its neighbours the probability of failure clearly drops down .
What does this mean in practice?
If you’re dealing with sockets you’re unaware of peer status, except when you send or receive bytes, then check the connectivity polling with fake sends and using setSoTimeout(int timeout) to prevent blocking reads.
However, I find much more effective isolating the whole business unit in a single timeboxed job, because delays may also come from huge responses as unbounded result set or file fecthing.
If you allow the clients to set timeouts, the request thread quit the operation when the call is not completed in time. Easy?
Concurrent programming is hard and it requires high skills and it is even discoraged unless you don’t want to reinvent the wheel. The java.util.concurrent package helps to craft your code with timeout controls as in the following example where I’m encapsulating a job unit (a login) into an ExecutorService.
The Login action implements the Callable interface; despite Runnable it may throw checked exceptions when executed.
The tip shows to launch the callable through ScheduledThreadPoolExecutor.submit and waiting the task’s end through Future.get(long timeout, TimeUnit timeUnit). By specifying the timeout value the operation will be completed in time , otherwise a TimeoutException will be thrown.
N.B.: in this last case when timeout occours the ExecutorService doesn’t seem to take care about the still open thread, so don’t forget to execute Future.cancel(true) in the final statement.
Comments