The confabulations and in general the surface statistics stuff often gets in the way of the meat and potatoes of critical reasoning in the SotA models.
A good example of this is trying a variation of common puzzles versus changing tokens to representations and having it repeat adjectives when working through CoT.
Often as soon as it makes a mistake and has that mistake in context, it just has no way of correcting course. A lot of my current work is related to that and using a devil's advocate approach to self-correction.
But in reality, we won't see a significant jump in things like being able to identify self-ignorance until hardware shifts in the next few years.