From Prompt-and-Pray to Architect-First: Three Years of LLM-Assisted Development

This article is for developers already using LLM coding assistants who want to sharpen their workflow. I won't be explaining what an LLM is or how to set up your first AI coding tool. Instead, I'll walk through the tooling, habits, and hard-won lessons I've picked up over two years of daily LLM-assisted development.

The single biggest shift in that time: I stopped treating LLMs as code generators and started treating them as collaborators that need a plan.

The Tooling Stack

I've been using Roo Code for the longest time. Its appeal is flexibility — it works with anything, supports multiple modes, and is deeply customizable. Recently, I switched to Kilo Code, which adds one critical feature Roo Code lacked: autocomplete. Kilo Code uses the Codestral model running locally via Ollama for this, which means fast completions with zero cloud cost.

On the context side, I use the Conport MCP server to persist long-term context across sessions — architectural decisions, patterns, conventions, and to-do items. This acts as a project's living memory that any mode can reference.

I also use custom Roo Flow prompts, which I've tuned to reduce token usage. One recent change worth noting: I had to modify the default Roo Flow prompts to remove XML-based tool usage directions and switch to native tool calling. If you're still on the XML-based prompts, it's worth making that switch.

The Architect-First Workflow

Early on, my process looked like this: write a prompt, get code back, iterate on the code, get frustrated, iterate more. The generated code was often close but structurally wrong in ways that took multiple rounds to fix. I was spending most of my time reacting to output instead of shaping it.

Now, I spend almost all of my time in Flow Architect mode. I generate a plan, iterate on that plan until I'm satisfied with every detail, and only then move to code generation. The result is that code generated in Flow Code mode — or the walkthrough generated in Flow Ask mode — is nearly flawless on the first pass.

This was the single biggest improvement in my entire workflow. Everything else I'll describe builds on top of it.

When a task is fully documented and follows the single responsibility principle, I feed the design document from Architect mode as initial context into a new Flow Code conversation. If there are multiple clearly defined tasks, I switch to Flow Orchestrator mode and let it manage parallel conversations. The key is that no code gets written until the plan is solid.

Model Selection

Not every task needs the most powerful model. Setting the exact default model per mode has been one of the simplest ways to control cost without sacrificing quality.

My current breakdown: Opus for architecture work, where reasoning depth matters most. Sonnet for complex technical tasks that require nuance. Haiku for straightforward, well-known tasks — even high-volume code generation — which, honestly, is the majority of the work. Most coding tasks are not novel; they're well-established patterns that a lighter model handles perfectly.

Prompt Discipline

A few things I've learned about how to talk to these models effectively.

Be specific in your instructions, and tell the LLM what it should not do. This is just as important as telling it what you want. Without negative constraints, models will infer scope and often infer too broadly.

Front-load context. I load as much relevant context as I can think of in the initial prompt — files, architecture docs, prior decisions. The more the model knows upfront, the less it guesses.

Add rules incrementally, per project. I don't rely solely on what I've stored in Conport. Small, targeted rules in the .kilocode/rules directory accumulate over the life of a project and catch the specific behaviors I want to enforce or prevent. For example, every project gets a rule enforcing no emojis and requiring succinct, direct output. Recent models have defaulted to being irritatingly verbose and chatty, and this keeps them in check.

Be explicit about file scope. I tell the model exactly which files to work on. I then watch the output as it's generated. If I see it touching other files with good justification, I let it continue. If it's drifting, I stop the work immediately and re-prompt with more context.

Reviewing Generated Code

I check every single line of generated code, then evaluate it as a whole. This is non-negotiable, even for simple tasks.

What I consistently find: the code is often more verbose than necessary. It uses odd techniques that it was trained on but that don't fit the project's conventions. It's not always DRY. It can lean on overly simplified patterns that technically work but aren't how you'd write it in production.

That said, reviewing code has also been one of the best ways to learn. The models have shown me techniques and approaches I hadn't considered before. It's a two-way street — you're checking their work, but you're also absorbing new ideas.

Debugging

For stack traces, I copy the output and paste it inside triple backticks. For UI issues involving layout, sizing, or placement, I send a screenshot — these models are vision-capable now, and a picture communicates a layout problem far more efficiently than a text description.

The screenshot workflow extends to wireframes. Even a rough hand-drawn wireframe, photographed and sent to the model, gives it enough to work with. This is one of the most underused capabilities I've seen — most people don't think to send visual input to a coding assistant.

Testing

Test strategy gets decided in Architect mode, alongside everything else. Ninety-five percent of the time, I let the LLM generate the tests. If it's something I'm uncertain about — an unfamiliar testing pattern or an edge case I want to understand deeply — I switch to Flow Ask mode and build the tests out myself, step by step.

Learning New Things

When I'm working with a new concept, tool, or language, I never let the model generate code for me. I switch to Flow Ask mode and have it walk me through implementation step by step. I write every line myself. This is slower, but it means I actually understand what I've built, and I can maintain it without the model's help later.

The Rules I Wish I'd Known Earlier

It's not Google. Don't treat it as a search engine. It's a reasoning tool that generates output based on patterns, and it needs to be directed, not queried.

Always review the code, even for simple tasks. You will always catch something. But you'll also learn something — those two things happen in roughly equal measure.

If you find yourself writing more than two prompts about the same block of code, stop and write it yourself. The model isn't going to get there, and you're burning tokens and time. Step in, write the code, and move on.

The shift from iterating on code to iterating on plans was the turning point. Once I internalized that the quality of the output is determined by the quality of the plan, everything got faster, cheaper, and more reliable.

Building Agents

Beyond using LLMs as tools, I've started building with them. As a personal project, I'm using Pydantic AI to build a D&D Dungeon Master agent that runs locally. It's been an education in a completely different set of concerns.

Model selection matters far more when you're running small local models. A model that excels at narrative generation might be terrible at structured output or tool calling. You can't just pick the "best" model — you have to pick the best model for the specific task your agent performs, and that often means testing several.

The parameters you'd normally ignore start to matter. Temperature, max tokens, and retrieval_k all have a direct, noticeable impact on output quality and consistency. Small changes in these values can be the difference between a coherent game session and nonsense.

History processing is deceptively tricky. Conversation history grows fast in a D&D session, and you have to make deliberate choices about when to summarize history versus when to truncate it outright. Summarization preserves context but costs tokens and can lose important details. Cutting history is cheaper but risks the agent forgetting something the player said three turns ago. There's no universal answer — it depends on what your agent needs to remember and what it can afford to forget.

Writing System Prompts

One skill that's grown significantly over the past two years is writing system prompts — not just for coding, but for any task where I need consistent, specialized behavior from a model.

The best example is outside of software entirely. I'm writing a book, and I've written a system prompt that instructs an LLM to act as my editor. I generate one to three chapters of content, then provide them as context in a custom Book Editor mode with that system prompt. It gives me structured feedback on weak verbs, passive voice, character inconsistencies, overly complex language that makes passages hard to read, world-building inconsistencies, and missed opportunities to build tension or develop characters.

This works because the system prompt is specific about what to look for and how to deliver the feedback. The same principles that make coding prompts effective — be explicit, define scope, state what not to do — apply directly to any domain. Once you've internalized how to write a good system prompt for code, you've learned a transferable skill.

Where I Am Now

Two years in, my relationship with LLM tooling has fundamentally changed. I spend the majority of my time thinking and planning, not prompting and fixing. The code generation step, which used to be where all the friction lived, is now the easiest part. The hard work — and the valuable work — happens before a single line of code is written.

The models will keep getting better. The tooling will keep evolving. But the meta-skill — knowing when to architect, when to generate, when to learn manually, and when to just write the code yourself — that's the part that compounds.