I believe you will find the
Parallel Guide we published today very useful (some of it is inspired by the discussions in this Forum thread)! It also contains answers to your questions about which parts that are parallelized in MPI/OpenMP.
I can't speculate at this point why combining OpenMP and MPI doesn't work for you, we have certainly seen it operate well on other systems. But, on the other hand, the two scenarios you outline (either full MPI without threading, or one MPI process per node with full threading) are still the most relevant ones in the majority of cases.
Thanks for all your input, it's very valuable!