Giancarlo Frison Signals from the Noise

Chain of failures on blocking threads

I came back to Milano little time ago and I’ve bumped into an API implementation in this new job. This will be a library that aims to interact with a remote application through a simple text-based protocol. The typical process is a sequence of authorization – session initialization – commands processing – session disposing each of which enclosed in atomic request/respose interaction. The simplest and most immediate approach provides to write the protocol stubs, and manage them through simple methods that elaborate such commands at low level handling tcp sockets and the client/server handshaking with synchronous calls. Sometimes the simplest is the best way, but not this time, especially within multi layer structured systems, where every component depends on many others, and any of those can fail. This task rings as an alarm bell to me due to a recent project that looked like this one, and I can still remember the effects of hangs and missed responses in a SOA context; fortunately the event happened during a load test:

The application was a client interacting with openfire through XMPP. The investigation uncovered a bug that caused a dead lock in a connection pool in certain conditions, the consequences were easily predictable as the fast resource exhaustion, causing soon an application break down. The application server was over but also the client side was unrecoverable since the unresilient application’s architecture didn’t foresee hang requests.

What is unacceptable is the chain of failures that a problem like this can disseminate along the process path, what about combined systems where one side does not expect the other side to hang off if it stops responding? Domino is a pleasant show, you watch all pieces tracing doodles during their falls, it’s funny but only when it doesn’t look like your system when it works.

Don’t play domino, be skeptical (and use concurrent package)

Blocking threads may happen every time you attempt to get resources out of a connection pool, deal with caches or registred objects, or make calls to external systems as this unfortunate experience above. I mean to be distrustful of each component you inquiry decoupling systems as necessary as to skirt the failure propagation. If your component is properly protected from its neighbours the probability of failure clearly drops down . What does this mean in practice? If you’re dealing with sockets you’re unaware of peer status, except when you send or receive bytes, then check the connectivity polling with fake sends and using setSoTimeout(int timeout) to prevent blocking reads. However, I find much more effective isolating the whole business unit in a single timeboxed job, because delays may also come from huge responses as unbounded result set or file fecthing. If you allow the clients to set timeouts, the request thread quit the operation when the call is not completed in time. Easy? Concurrent programming is hard and it requires high skills and it is even discoraged unless you don’t want to reinvent the wheel. The java.util.concurrent package helps to craft your code with timeout controls as in the following example where I’m encapsulating a job unit (a login) into an ExecutorService.

public class Login implements Callable {

The Login action implements the Callable interface; despite Runnable it may throw checked exceptions when executed.

Login login = new Login(user, password);
Future<?> res = exe.submit(login);
try {
  res.get(commandTimeout, TimeUnit.MILLISECONDS);
}
catch (ExecutionException e) {
  log.error("error on login", e);
}
finally {
  res.cancel(true);
}

The tip shows to launch the callable through ScheduledThreadPoolExecutor.submit and waiting the task’s end through Future.get(long timeout, TimeUnit timeUnit). By specifying the timeout value the operation will be completed in time , otherwise a TimeoutException will be thrown.

N.B.: in this last case when timeout occours the ExecutorService doesn’t seem to take care about the still open thread, so don’t forget to execute Future.cancel(true) in the final statement.

Let’s get Scrum

At school times the teacher used to quote an important saying: culture is what remains after we forget the things we studied thorougly. The concept is charming, but at that time the principle was often adopted to forget things even before they were studied.

The saying is also valid for software development methodologies, where the best practices try to teach us the right path to come up with something really good, shaping a product in the most efficient way and with the highest quality. The agile methodologies set few general rules, but the result depends on you, your skills and your team of course, not on the methodology. Scrum doesn’t produce good software products, but if you are smart, it might suggest you some hints helpful to get away with the failure scenario.

What does Scrum say?

It declares that all activities are in a time box and assigns to each team member his own responsibility based on workload estimation, and the activities priority has to be shared with your chief, most of the time the Scrum master. The daily meetings are essential, the team members explain what they did the day before and what they are going to do today and the blocking problems that affect the task development, as well as the estimated time for the new task or its progress update. As in the rugby game from which Scrum took its name, the goal is to get things done. The powerpoint presentations, the docs are internal artifacts but the objective is to get product shipped.

How? By setting objectives for the next iteration (sprint), and incrementally so on with the next. The iterations firstly face the most critical issues and the trivial ones come later, as you are mostly concerned with the software/system architecture and you’d know if such solution overcome the issue, as soon as possible.

As in the rugby game, the project team will be capable of thinking by itself. The coach hasn’t to enforce a defined set of steps to reach those objectives, and as in a real game the team has to learn to handle chaos of requirement changing and with emerging problems (even hoping that Italy will win the next 6 nations).

My considerations

In my scrum practice, I’ve appreciated the time estimation duty for each single task, as discussing with fellows or build a new feature. However, I don’t find it much helpful during the meetings because it’s an information that project managers need to check the process but not useful to the other developing members. The time estimations are closely related to the tasks, so why not to handle them with the issue/bug tracking system, jira for example? You might use it as a time monitor, so the scrum master can automatically obtain all required informations about the development’s progress. The meeting is an opportunity to get together and to make it clear to others where you are, but most importantly, firstly explain your problems. Sharing difficulties among the members and get proper tips back make the team more integrated and helps to overcome matters quickly.

What impresses of the agile methodology practices is the communication approach, the synchronization of the developers and the feedback on a daily basis. Quick stand up meetings in the morning, before the activities start; maybe better with a coffee.