Most analysis functions are indeed parallelized over k-points. I suppose you just used the number 4 as an example, which could correspond to 2x2x1 k-points, but it's a very low number for any accurate DOS results. So let's use a more realistic example of 5x5x5 points. We then have roughly 5*5*5/2=62 points, by simply halving the product due to time-reversal symmetry. So running this on 32 or 64 cores will work quite well. The actual number can be computed as
len(MonkhorstPackGrid(5,5,5).kpoints())
which comes out to 63.