Can LLMs do accounting?
Despite promising results on synthetic benchmarks (e.g. Vending-Bench, SpreadsheetBench, DSBench), frontier models consistently underperform once they are deployed in complex, real-world situations.
Despite promising results on synthetic benchmarks (e.g. Vending-Bench, SpreadsheetBench, DSBench), frontier models consistently underperform once they are deployed in complex, real-world situations.