What a year of running an AI agent in a restaurant taught me

For most of the past year, an AI agent has worked the night shift at a bar & grill in Bigfork, Montana. After close, a scheduled job fires and the agent reads through what the staff recorded that day: inventory counts, vendor orders, prep log entries. It digs into anything that looks off, and by the time the manager unlocks the door the next morning there's a short list of proposals waiting. Raise the par level on this. Look at the prep history on that. Nothing happens until a human taps approve.

I built the system, and it's been running in production since 2025. The platform underneath it is ordinary in the best way: recipes, scheduling, checklists, inventory, all used daily by real kitchen and bar staff. The agent is the unusual part, and running it against a real business for a year changed how I think about agents in general.

Triage, investigate, synthesize

The nightly run is a pipeline with three distinct phases, and the separation does more work than any individual prompt.

Phase one is triage. A mid-tier model (Sonnet) scans the day's operational observations — counts, orders, logs, prep entries — and flags what deserves a closer look. It isn't trying to be brilliant. It's trying to be cheap and hard to surprise: don't miss the weird thing, don't burn reasoning on routine data.

Phase two is investigation. Each flagged thread gets handed to an agent with tools: get_inventory_history, search_prep_items, get_prep_history, consumption-rate calculators. This is what makes it an agent rather than a report generator. It decides what to query, reads what comes back, and queries again. A typical thread looks something like:

investigate: "bar liquor usage vs. par"
→ get_inventory_history(location: bar, days: 21)
→ consumption_rate(item: well vodka)
→ note: usage trending up, order cadence unchanged

Phase three is synthesis. The strongest model (Opus) takes the investigation threads and produces two kinds of output: insights, each carrying a confidence score, and proposed actions — concrete and specific, like "raise the par level on X." The proposals land in a manager-facing queue with approve and dismiss buttons and a full audit trail behind them.

Observe, investigate with tools, propose, wait for a human. That's the entire loop. Most of the discipline is in not letting any phase do another phase's job.

The agent proposes, the manager disposes

This is the rule the whole system hangs on, and it's not a training-wheels phase I'm planning to remove. It's the design.

Approved proposals affect actual purchasing. If the agent talks the restaurant into over-ordering, that's real money sitting in a walk-in. And the manager holds context that never makes it into the database — the big party booked for Saturday, the vendor who's been unreliable lately, the menu change coming next week. The approval gate is where the agent's pattern-reading meets the human's situational knowledge, and you need both.

The gate also turns out to be a feedback channel. Watching which proposals get approved and which get dismissed tells you more about your agent's judgment than any eval suite I've written. A dismissal isn't a failure; it's a label.

Confidence should decay

Every insight the agent produces carries a confidence score, and that score decays over time. This sounded like a nice-to-have when I designed it. It turned out to be load-bearing.

A restaurant is not a stable system. Bigfork swells in the summer and empties out after; the platform literally models seasonal vendor preferences because ordering in July and ordering in January are different jobs. A consumption pattern the agent observed three weeks ago may simply no longer be true.

The failure mode decay prevents is the stale-but-confident insight: something that was solid when it was minted, still wearing its original score long after the world moved on. An insight that can't get re-confirmed by fresh data should fade until it's quiet. Confidence in an operating business has a half-life, and your data model should say so.

Use the cheapest model that can do the job

The pipeline is tiered on purpose: Sonnet does the scanning and the tool work, and Opus only shows up at the end, where judgment quality actually pays for itself.

The chat side of the platform is the cleanest example of the same idea. Haiku maintains a manifest of the restaurant's documents — a summary and keywords for each — and routes every question to the one to three documents that matter. Sonnet answers with that selective context plus live database data, like the actual recipe or the user's actual shifts, streamed back over SSE. The cheap model routes, the strong model answers, and nobody pays Opus prices to look up a prep list.

Running an agent every night, indefinitely, for a small business makes cost a design constraint rather than a footnote. Tiering is how you stay welcome.

The boring parts of autonomy are the hard parts

This is the biggest lesson, and the least glamorous. The agent pipeline is the part people ask about, but it's a minority of the engineering. Most of the work is the stuff that makes autonomy survivable in the real world.

Auth for people who will never remember a password. Restaurant staff are not going to maintain a password. The system uses passwordless token auth with a role hierarchy — owner, manager, staff — device tokens that quietly re-authenticate, and login links delivered over WhatsApp. If logging in is annoying even once, the data the agent depends on stops getting entered.

Messaging that refuses to silently fail. Notifications go out over WhatsApp first. Delivery-status webhooks watch what happens; when a WhatsApp message fails, it goes into a pending-message queue and gets retried as SMS. An insight nobody reads might as well not exist, so delivery is part of the agent, not an afterthought.

Audit trails. Every proposal, approval, and dismissal is recorded. When someone asks "why did we change that par level?", there's an answer with a timestamp and a name on it. An agent without an audit trail is a rumor.

The data the agent reads. The prep log is append-only. Inventory counts are per-location — kitchen, bar, liquor — because that's how counting actually happens. The agent's overnight reasoning is exactly as good as what tired humans were willing to record at close, which means the data-entry UX is, secretly, agent infrastructure.

None of this is the part demos show. All of it is the part that decides whether the system is still running a year later.

Where I've landed

If you're building an agent for an operating business, my advice is to aim the autonomy at investigation, not action. Let the agent read everything, question anything, and propose whatever it can defend — then make a human the only thing in the building that can say yes. The agent proposes, the manager disposes. A year in, I haven't found a reason to want it any other way.