Those mistakes would be easily solved by something that doesn’t even need to think. Just add a filter of acceptable orders, or hire a low wage human who does not give a shit about the customers special orders.
That wouldn't address the bulk of the issue, only the most egregious examples of it.
For every funny output like "I asked for 1 ice cream, it's giving me 200 burgers", there's likely tens, hundreds, thousands of outputs like "I asked for 1 ice cream, it's giving 1 burger", that sound sensible but are still the same problem.
It's simply the wrong tool for the job. Using LLMs here is like hammering screws, or screwdriving nails. LLMs are a decent tool for things that you can supervision (not the case here), or where a large amount of false positives+negatives is not a big deal (not the case here either).