Yeah, riddles work better than puns for what I'm talking about since most popular puns were probably in the training dataset.
Like I said, I've had best results (or worst) using cryptic crossword clues, since their solutions are almost definitely not in the training set. So it actually has to "think for itself" and you can see just how stupid it really is when it doesn't have some existing explanation buried somewhere in its training set.