LLM agent engineering

Budget: ongoing.

Why this matters

You already use LangGraph in agent.py. The loop works. What is missing is the set of production concerns that separate a demo agent from one that is cheap, debuggable, and stable under prompt edits.

What to learn

Topic	Time
Prompt caching: add `cache_control` to the system prompt in `agent.py`	2 hours
Measure token cost before and after caching on a representative session	1 hour
Streaming structured output so the UI can render partial tool progress	1 day
Write 10 scripted eval cases at `tests/agent_evals.jsonl`	1 day
Build a pytest-based eval harness that runs the agent against each case	1-2 days
Wire LangSmith tracing behind an env flag	0.5 day
Ongoing: review traces weekly, add eval cases when you find regressions	ongoing

Resources

Anthropic prompt caching: the official guide, the only thing you need for step 1 of the exercise.
Anthropic tool use: if you are reworking the tool layer.
LangGraph docs: concepts and recipes.
LangSmith: hosted tracing and eval platform.
promptfoo: lighter-weight eval framework.
The claude-api skill in this environment. Invoke it when you touch agent.py or system_prompt.md. It enforces caching and model-version hygiene.

Exercise

Turn on prompt caching for the system prompt in agent.py. Measure token cost on a representative session before and after.
Add a JSONL eval file at tests/agent_evals.jsonl with 10 scripted user turns and the expected tool calls or substrings in the final response. Write a pytest that runs the agent against each entry and fails if the expected tool was not called.
Wire LangSmith tracing behind an env flag so you can turn it on in staging without leaking traces from production by default.

Learning roadmap for the FastVRP product

LLM agent engineering

Why this matters

What to learn

Resources

Exercise