Fix Unportable Signal Disposition Establishment on Solaris
The rtorrent project makes use of signals for certain kinds of inter-thread communication. Certain users on Solaris reported that rtorrent crashed as soon as signal
SIGUSR1 was delivered. Being an avid reader of the POSIX standards I was curious and felt that I might know what was wrong, figuring that correctly identifying and solving the problem would be a huge testament to the POSIX standards considering I had never used Solaris.
Some people in the issue figured that it must be a non-POSIX compliant implementation of
pthread_kill() which was preventing the application from sending signals to specific threads. I didn't think this was the case, as Solaris’ manual page for
pthread_kill() claims that it's implemented as is intended. If it was indeed a non-compliant implementation, I figured someone else would've already encountered the issue and there would be some sort of note in the man page. In fact, Solaris is fully POSIX-compliant which is more than can be said of Linux, and yet Linux didn't exhibit this behavior.
Instead my suspicion was something else entirely. As soon as I recognized that clearly something signal-related was causing crashes on certain platforms in particular, I thought of the one glaring, well-known to be unportable system call:
signal(). In fact, the Linux man page for
The only portable use of
signal()is to set a signal's disposition to
SIG_IGN. The semantics when using
signal()to establish a signal handler vary across systems (and POSIX.1 explicitly permits this variation); do not use it for this purpose.
The emphasis is theirs and goes to show how unpredictable the use of
signal() could be. To complicate matters, the first line in the man page's notes section says:
The effects of
signal()in a multithreaded process are unspecified.
However, I believed that the problem lay in the possibility that Solaris provided different semantics for
signal() from the semantics that Linux provided:
In the original UNIX systems, when a handler that was established using
signal()was invoked by the delivery of a signal, the disposition of the signal would be reset to
SIG_DFL, and the system did not block delivery of further instances of the signal. […] This was bad because the signal might be delivered again before the handler had a chance to reestablish itself. Furthermore, rapid deliveries of the same signal could result in recursive invocations of the handler.
This behavior is known as System V semantics. In other words, when a signal handler is established and then subsequently triggered, the signal disposition is reset to its default disposition, whatever that may be for the signal in question. If the handler isn't re-established, then a subsequent triggering of that signal will be handled based on the default disposition for that signal.
There is another behavior which is referred to as BSD semantics in which:
the signal disposition is not reset, and further instances of the signal are blocked from being delivered while the handler is executing. Furthermore, certain blocking system calls are automatically restarted if interrupted by a signal handler.
The situation on Linux is such that the kernel's
signal() system call provides System V semantics. However, glibc 2 and later expose a wrapper function for
signal() which instead delegates its work to the preferred—for portability and flexiblity reasons—system call
sigaction(), called in such a way as to provide BSD semantics. This wrapper function is exposed if the
_BSD_SOURCE feature test macro is defined, which it is by default.
Solaris doesn't have such a wrapper for
signal(), instead exposing its bare, System V semantics system call with
void (*signal(int sig, void (*disp)(int)))(int);
dispis the address of a signal handler, and
SIGPWR, the system first sets the signal's disposition to
SIG_DFLbefore executing the signal handler.
This clearly states that the signal disposition is reset to its default disposition before executing the signal handler. Taking a look at the default signal disposition table for Solaris, we can see that
SIGUSR1's default disposition is to exit the application. Presumably, Solaris users were crashing upon the second delivery of
SIGUSR1 or any other signal established with
signal() who's default disposition was to exit or abort (core dump).
My patch simply consisted of switching out calls to
signal() for purposes of establishing signal handlers with calls to