LLM agent engineering

Budget: ongoing.

Why this matters

You already use LangGraph in agent.py. The loop works. What is missing is the set of production concerns that separate a demo agent from one that is cheap, debuggable, and stable under prompt edits.

What to learn

TopicTime
Prompt caching: add cache_control to the system prompt in agent.py2 hours
Measure token cost before and after caching on a representative session1 hour
Streaming structured output so the UI can render partial tool progress1 day
Write 10 scripted eval cases at tests/agent_evals.jsonl1 day
Build a pytest-based eval harness that runs the agent against each case1-2 days
Wire LangSmith tracing behind an env flag0.5 day
Ongoing: review traces weekly, add eval cases when you find regressionsongoing

Resources

  • Anthropic prompt caching: the official guide, the only thing you need for step 1 of the exercise.
  • Anthropic tool use: if you are reworking the tool layer.
  • LangGraph docs: concepts and recipes.
  • LangSmith: hosted tracing and eval platform.
  • promptfoo: lighter-weight eval framework.
  • The claude-api skill in this environment. Invoke it when you touch agent.py or system_prompt.md. It enforces caching and model-version hygiene.

Exercise

  1. Turn on prompt caching for the system prompt in agent.py. Measure token cost on a representative session before and after.
  2. Add a JSONL eval file at tests/agent_evals.jsonl with 10 scripted user turns and the expected tool calls or substrings in the final response. Write a pytest that runs the agent against each entry and fails if the expected tool was not called.
  3. Wire LangSmith tracing behind an env flag so you can turn it on in staging without leaking traces from production by default.