A literal MPI corner case

I just spent a day tracking down a nasty MPI bug in some fluids code, and the cause is amusing enough to record. A process was interleaving message receives from two of its neighbors, one above and one to the right. Each message contained a band of data one grid cell thick. Both messages were sending into the same array, so the two regions intersected at the corner of the domain.

This worked fine under LAM/MPI, but broke when we switched to OpenMPI. Presumably LAM was receiving one message at a time and returning control to the calling code, and OpenMPI decided to receive both messages at once. Since the message destinations overlapped in memory, the second message trounced the value of the first at the single overlapping corner cell.

For a while I thought the problem was a compiler bug, since switching to debug mode or adding print statements in the relevant area of code caused it to vanish. Fun.

Edit: (30may2009) I need more practice making figures, so I added one.

comments powered by Disqus