LLM agent engineering
Budget: ongoing.
Why this matters
You already use LangGraph in agent.py. The loop works. What is missing
is the set of production concerns that separate a demo agent from one
that is cheap, debuggable, and stable under prompt edits.
What to learn
| Topic | Time |
|---|---|
Prompt caching: add cache_control to the system prompt in agent.py | 2 hours |
| Measure token cost before and after caching on a representative session | 1 hour |
| Streaming structured output so the UI can render partial tool progress | 1 day |
Write 10 scripted eval cases at tests/agent_evals.jsonl | 1 day |
| Build a pytest-based eval harness that runs the agent against each case | 1-2 days |
| Wire LangSmith tracing behind an env flag | 0.5 day |
| Ongoing: review traces weekly, add eval cases when you find regressions | ongoing |
Resources
- Anthropic prompt caching: the official guide, the only thing you need for step 1 of the exercise.
- Anthropic tool use: if you are reworking the tool layer.
- LangGraph docs: concepts and recipes.
- LangSmith: hosted tracing and eval platform.
- promptfoo: lighter-weight eval framework.
- The
claude-apiskill in this environment. Invoke it when you touchagent.pyorsystem_prompt.md. It enforces caching and model-version hygiene.
Exercise
- Turn on prompt caching for the system prompt in
agent.py. Measure token cost on a representative session before and after. - Add a JSONL eval file at
tests/agent_evals.jsonlwith 10 scripted user turns and the expected tool calls or substrings in the final response. Write a pytest that runs the agent against each entry and fails if the expected tool was not called. - Wire LangSmith tracing behind an env flag so you can turn it on in staging without leaking traces from production by default.