Can LLMs do accounting?

Despite promising results on synthetic benchmarks (e.g. Vending-Bench, SpreadsheetBench, DSBench), frontier models consistently underperform once they are deployed in complex, real-world situations.