Codex helps build self-improving tax agents | Radar

OpenAI, Thrive Holdings and Crete present Tax AI as a tax agent that improves directly from expert work. In a pilot across more than 30 accounting firms, the system helped process 7,000 tax returns and targets work that can take up to eight hours of manual data entry on medium and large filings.

Tax AI turns practitioner failures into evals and Codex tasks

Tax AI automates parts of preparing 1040 and 1041 returns: reading source documents, filling fields, drafting a first version and flagging places where human judgment is needed. According to the published results, the system saves about one third of practitioner time, drafts returns with up to 97% accuracy and increases throughput by around 50%.

The most interesting part is the closed loop between practice and engineering. The system collects practitioner feedback, product traces and review outcomes. When it fails on a case such as rental property income, that failure becomes a targeted eval and a Codex task that adjusts code or workflow so similar errors are caught earlier next time.

For accounting firms this changes who actually drives product improvement

At launch, only a quarter of returns reached the threshold of 75% correct field completion. Six weeks later, 86% of returns reached that mark. The firm does not have to wait for long quarterly roadmaps to fix a bottleneck found in production.

For accounting firms, the value is twofold. They get faster return preparation and a system that learns from their specific practice. That may matter more than a broad AI promise, because tax work depends on details, exceptions and caution.

Tax AI still depends on human review and pilot numbers are not production data

Tax AI still depends on human review. Practitioners check drafted returns, decide ambiguous cases and route uncertain situations back to product or engineering. That is the right division of labor: the agent speeds up preparation while accountability stays with the expert.

The published numbers come from a pilot, not from long-term production deployment. Accuracy of 97% on returns depends on the specific case mix and rules, which change every year. Performance in a tax season with a broader customer base may look different.

The agent-practitioner hybrid model will be tested in the next tax season

This hybrid model may be the most likely path for professional services. The agent does not become magic background automation. It becomes a working system that measures its own failures, learns from them and leaves humans to decide where context matters more than speed.

Watch whether Tax AI maintains its claimed metrics across a broader customer base in future tax seasons, and how the system handles changes to tax rules.

Lilith's verdict

The most important part is not tax form automation by itself, but the operating model. Tax AI turns real practitioner failures into evals and Codex tasks, so the product improves on the exact cases that slow firms down. That is a practical picture of agentic software: humans keep accountability, the system absorbs repeat work and the product team gets a faster path from failure to fix.