This is interesting. Like a lot of people, I'd believed for a long time that FET's couldn't suffer from thermal runaway effects because they "had a negative temperature coefficient". A little while ago, however, someone here explained that thermal runaway could indeed occur with modern FETs with a very low Rds(on), as even if this increased with temperature it was still never great enough to overtake the decreasing Vgs(threshold). That's at slightly odds with what Simon is saying above, though, which is that this runaway phenominon can only occur when the FET is driven close to Vgs(threshold), as when they are driven "hard", variations in Vgs(threshold) are not significant.
I suppose the take-home message from this is that we need to read the datasheet very carefully with reference to how a device will behave across temperature under any particular drive conditions.
With regard to Mosaic's specific problem, even in the absence of thermal runaway as such, is it possible that a FET with a lower Vgs(threshold) is simply turning on sooner and off later, and therefore just dissapating more power - eventually pushing it over it's alowable die temperature?
It seems to me that in order for this to happen you'd have to be driving all the transistors very near their maximum temperatures - and I presume you're not doing that? Also, if this was happening, you should be able to measure a temperature difference between the transistors. I presume they are all running at a similar temperature?
It seems to me that differing load-sharing due to variations in Vgs(threshold) are going to be more significant with slower turn-on and turn-off times. At 700Hz you've probably not worried too much about fast high-current driving, so perhaps a faster drive would help mitigate these device variations?
If we're interested in determining whether this is (a) a thermal effect due to poor load sharing or (b) something else like back-EMF of ringing, then it seems to me that we'd be more likely to see (a) when running hot and/or heavily loaded - in contrast, if it fails shortly after turn-on, then (b) seems more likely. Probably stating the obvious here, but does that make sense?
I know (I'm sure we all know) that Mosaic is pretty competent at power switching circuits, so I presume that you're confidant there is no parasitic oscilation or ringing? Good high-current gate drive, low-inductance paths, separate gate stoppers for each FET, maybe ferrite beads, appropriate snubber circuits etc?
As a last thought, if you don't mind the continued down time, it might be interesting to characterise but not pre-select the FETs - just mark them up before installing them. Then you'll be able to see, when they fail, if there is any common factor in the paramiters of the failed devices.
Cheers!