All papers

Agents

Lifting Traces to Logic: Programmatic Skill Induction with Neuro-Symbolic Learning for Long-Horizon Agentic Tasks

Jie-Jing Shao, Haiyan Yin, Yueming Lyu, et al.

Instead of fine-tuning on traces, this paper induces reusable, programmatic "skills" from an agent's own execution logs using a neuro-symbolic loop — and the skills transfer to new long-horizon tasks. A genuinely different answer to "how do agents get better over time."

Reviewed by Jithin Kumar PalepuMay 28, 20268 min read

Most attempts to make agents “learn from experience” quietly mean fine-tuning on their own logs. This paper does something more interesting: it reads the logs, extracts what worked as actual reusable programs, and hands those back to the agent as skills. The learning lives in code, not weights.

The problem it tackles

Long-horizon agentic tasks — the kind with twenty steps and branching decisions — are where LLM agents fall apart. They re-derive the same sub-procedures every run, make the same mistakes, and have no durable memory of “here is how you do this class of thing.” Fine-tuning helps a little but is expensive, opaque, and doesn't transfer cleanly to new tasks.

The key idea

The authors run a neuro-symbolic loop: the agent acts and produces execution traces, then a symbolic induction step “lifts” recurring patterns in those traces into small, named, parameterized programs — skills. Those skills become callable building blocks on the next task. Because they are programs, they are inspectable, composable, and transfer to problems the agent has never seen.

The learning lives in code, not weights — which makes it inspectable, composable, and transferable.

The neat part is the division of labor: the neural model proposes and executes, the symbolic layer generalizes and remembers. Each does what it is good at, which is exactly the pitch for neuro-symbolic methods that has been promising more than it delivered for years.

What the results show

On long-horizon agentic benchmarks, induced skills improve both success rate and sample efficiency versus trace fine-tuning and pure prompting baselines — and crucially, the skills transfer to held-out task families rather than overfitting to the training tasks. The ablations matter here: remove the symbolic induction and most of the gain disappears, which is the evidence that the skills, not just more data, are doing the work.

Where I'm skeptical

The induction step is the whole ballgame, and inducing clean programs from messy real-world traces is hard outside of benchmark-shaped tasks. I'd want to see how it behaves when traces are noisy, when skills conflict, and how the skill library is pruned before it bloats. Promising direction, but the gap between “works on benchmarks” and “works in your messy agent” is exactly where these methods usually stumble.

Everything that matters in AI,
straight to your inbox.

Join 12,000+ readers — daily, free, no spam.