The Brief

I picked up a project for a heritage Australian luxury retailer that has been trading since 1885. They sell premium goods sourced from family owned ateliers in Italy, and their fulfilment model is genuinely complicated.

A single Shopify order can contain items from different Italian suppliers. Each ships from their own warehouse, with their own VAT number and their own signatory on the export paperwork. Every parcel leaving Italy needs a commercial invoice, a Dichiarazione Libera Esportazione (the Italian free export declaration), a DHL air waybill, and a booked pickup with the local courier. Before this app existed, all of that was happening in spreadsheets, Word templates, and the DHL portal in three browser tabs. A wrong HS code or a missing declaration means the parcel sits in customs for a week and a $400 leather bag becomes a refund.

What I shipped is a Shopify app that groups order line items by supplier, generates the invoice and export declaration as editable HTML documents, pulls live DHL Express rates, creates the shipment, captures the declaration as a PDF in the browser, uploads it to DHL as a Paperless Trade document, books the courier pickup, and creates a fulfilment in Shopify with the tracking number. One button, end to end.

The Part That Will Get the Most Attention

I did not write a single line of code for this app.

Every TypeScript file, every test, every GraphQL fragment, every Prisma migration was produced by GitHub Copilot in agent mode, driven by Claude Opus 4.6 inside VS Code. I scoped the work, designed the failure flow, reviewed every diff before it landed on a branch, and curated a small set of repository scoped memory files that the agent reads at the start of every session. But the keystrokes that produced the code were not mine.

I know how that sounds. So before the agent sceptics close the tab and the vibe coders cheer, let me be specific about what "agent wrote it" actually means here, because it is not the same thing as "I asked Claude and got the result."

The Setup

Here is the exact configuration I used. All of it is reproducible by any developer with a Copilot subscription.

Component	What it does
VS Code Copilot Chat, agent mode	The driver. Lets the model read, edit, run, and verify changes across the workspace.
Claude Opus 4.6	The model. Picked for long context, careful code edits, and strong test generation.
Shopify AI Toolkit (Dev MCP server)	Shopify's official toolkit that exposes the Admin GraphQL schema, the Functions reference, and a code validator to the agent.
Repository memory	Two short markdown files inside the project that the agent reads at the start of every session. More on these below.
Explore subagent	A read only subagent used for "go scan the codebase and tell me X" tasks. Returns a structured report without polluting the main conversation.
Workspace instructions	A short file describing the project conventions and the rules of engagement: write tests first, never bypass git hooks, never disable a failing test to make it pass.

The Shopify AI Toolkit, documented at shopify.dev/docs/apps/build/ai-toolkit, is worth a one line plug. You install it once into your editor and the agent stops hallucinating GraphQL fields. It validates mutations against the real Admin schema and checks Polaris component props. On a project where the schema changes per API version, that removed an entire class of "looks right but is not" bugs before any test had to catch them.

How the Loop Actually Worked

Most blog posts about coding with agents either sell it as magic or dismiss it as a toy. Neither is true. The loop I settled on was a fairly boring four step cycle that I ran for almost every feature.

1. Write the spec. Before any code, I wrote a few paragraphs in a markdown file or directly in the chat, describing what should happen, the inputs, the outputs, the failure modes, and the user visible behaviour for each.

2. Tests first. I asked the agent to translate that spec into Vitest tests, with realistic fixtures based on real DHL and Shopify payloads. I read every test before letting it write any implementation. If a test described the wrong behaviour, the implementation would be wrong too.

3. Implementation, then run. Only after the tests existed did the agent write the implementation. Then it ran the test command itself. If anything was red, it iterated. If a test got "fixed" by being deleted or weakened, I caught it on review and reverted.

4. Memory if I learned something hard. Whenever a debugging session uncovered a non obvious gotcha, I had the agent summarise the finding in a project memory file. The next session did not re discover the same trap.

The thing that makes this work is the test count. By the time I shipped the first production version, the app had close to 40 test files. The ratio of test code to application code sits around one to two. The tests are the leash. Without them, an agent on Opus 4.6 is still smart enough to write plausible looking code that ships a broken parcel.

There is one integration test file I am particularly fond of. It is named after the warehouse manager in Naples. Every time he sent a parcel back with a problem, I turned the problem into a failing test before the agent went anywhere near the fix. That rule, "the bug becomes a test before it becomes a fix," kept the app honest as the feature surface grew.

Repo Memory: The Bit That Actually Felt Like Magic

The single biggest accuracy upgrade I got from this workflow was not a smarter model. It was a folder of two markdown files that I keep in sync after the agent and I work through something hard together.

The first one is about the EORI number. EORI is the European customs registration number, and it is a mess in a project like this. It is referenced in the supplier configs, it appears as a label in the rendered commercial invoice, and it shows up in the legal text of the export declaration. But, and this is the gotcha that took an afternoon to figure out, it is currently not sent in the DHL booking API request at all. The shipper payload carries the postal address and contact information, not the registration identifiers. The label "EORI:" renders in the invoice template but the value is blank, because nothing populates it.

That is the kind of thing that is invisible until it bites you. The first time the agent looked at the supplier config, it assumed EORI was being sent to DHL. It is not. So the memory file lays out where EORI is configured, where it appears, where it does not appear, and what the action items are if DHL ever starts enforcing it. Every subsequent session reads that file and stops re discovering the same gap.

The second file is about a real customs hold we had to dig out of. Three things were wrong, and all three are now permanent agent context.

First, the Italian export declaration was being uploaded to DHL with the wrong document type code. DHL's Paperless Trade upload uses short type codes, and we were classifying the declaration as a customs invoice when DHL expected the dedicated declaration type. The two codes look similar in a list of options, the agent had picked the more common one, and the parcel got held in Italian customs because the declaration was filed against the wrong slot.

Second, the PDF the customs broker was opening was technically a PDF but practically unreadable. The declaration is captured client side in the browser from a live HTML form. The form has editable input fields styled with yellow backgrounds and dashed blue borders so the user knows what is editable. The browser print stylesheet strips that styling on print. The canvas to PDF capture we use does not honour print media queries, so the captured document had yellow form field rectangles all over what was supposed to look like a signed declaration. The fix was to temporarily flatten input styling for the duration of the capture, then restore it afterwards.

Third, the captured PDF was using JPEG compression at a moderate quality setting. JPEG is fine for photos and terrible for thin black serifs on white paper. Switching to PNG at triple resolution fixed it.

None of those findings are things you can prompt your way to. They came out of a real parcel sitting in a real warehouse in Milan, a frustrated phone call, and a couple of hours of reading DHL's documentation cover to cover. Once they were in the memory file, the agent stopped re suggesting JPEG capture and stopped reaching for the wrong document type code on fresh sessions. Future me also stopped re discovering them.

This is what I mean when I say "agentic workflow." Not the model. The scaffolding around the model.

The Failure Spec That Kept Me Out of Trouble

Before I let the agent touch the DHL booking flow, I spent a long evening writing a failure handling specification. It runs to about six hundred lines and enumerates every realistic failure mode for the one click "Ship Order" path.

There is one subtle failure mode in there I want to call out, because it is exactly the kind of thing an agent will not invent on its own.

Scenario: shipment created, but the database write fails. DHL has allocated a real tracking number. Real money has been spent on a real label. But our application's write to record the shipment crashed. If we just throw a generic error and tell the user the shipment failed, they will click the button again and we will book a second courier and pay for a second label.

The spec for this scenario is explicit. Always show the user the AWB returned by DHL, even in the error banner. Tell them the parcel is booked and a courier will arrive. Tell them to contact support with the AWB so we can reconcile manually. Never present a retry button that re runs the shipment creation step.

When I first asked the agent to implement the one click flow, the implementation it produced wrapped the DHL call and the database write in a single try/catch. That would have produced exactly the wrong behaviour: the user would see "shipment failed" while DHL had silently allocated a real tracking number on their account. The spec was explicit, so the agent's first cut got rejected on review. I pointed it at the relevant scenario and the next version split the two error surfaces and kept the AWB visible in the warning banner regardless of what happened to the database write.

That is not a "wow look at the AI" story. That is the boring, normal story of how senior engineers and agents collaborate when the senior engineer takes the spec seriously.

Three Things That Made This Workflow Actually Work

Agentic coding is the way forward, that part is settled. The interesting question is how to do it without shipping broken software. For me it came down to three rules.

Tests are the agent's leash. If you would not let a junior engineer commit code without tests, do not let an agent do it. The agent is faster than the junior engineer and equally capable of producing confident wrong code. Tests are how you reduce both costs of being wrong.

Write the failure spec before the happy path. Most agent generated code is fine when everything works. The places it falls down are partial successes, network blips, and 4xx errors that look transient but are not. If you write down what should happen in each failure mode before any code exists, the agent has something concrete to implement. If you do not, you get a try/catch with a generic error message and a frustrated warehouse manager.

Memory is the cheapest accuracy upgrade you have access to. When you learn something non obvious, write it down where the agent will read it. Two markdown files in this project have probably saved me twenty hours of re debugging the same issues across different sessions.

What I Got Out of This as a Developer

I want to address the obvious worry, because I had it too. Does writing zero code make me less of a developer than I was a year ago.

Honestly, I think it makes me a more useful one. The things I spent my time on for this project were the things I am best at and the model is worst at: scoping the brief with the client, designing the failure flow, deciding the data model, reviewing diffs, writing the specs that the tests get generated from, and noticing when "that test got faster" was actually "that test got disabled." None of those tasks are getting easier with better models. They are getting more valuable.

The keystrokes were not the work. They never were.

Closing

If you read all the way down here and you are an engineering manager looking for someone who can both run an agent properly and ship a non trivial integration to production, I would love to talk. The app behind this post is in production today, processing real orders for a real retailer that has been trading longer than most of the countries we ship to.

I will see you again with the next one.