Why RapidNative Runs a 4-Step LLM Pipeline to Generate React Native Code
. No hallucinations. Identical output every time.
By Suraj Ahmed
24th May 2026
Last updated: 23rd May 2026
The cleanest mental model of an AI code generator is one model, one prompt, one file. Send the user's request to Claude or GPT-4, take whatever comes back, render it. It works in a demo. It breaks the moment a real user types "add a profile screen that pulls the logged-in user from the database" into a project that already has thirty files, an auth stack, and a half-finished settings tab.
A single model now has to do four jobs at once: figure out which files to read, decide whether auth is in play, plan a database migration, and write production-quality React Native code that compiles on the first try. Asking one LLM to do all of that means you're either burning tokens on a frontier model for tasks a small model could nail, or you're starving the actual code-generation step of context. Neither produces an app worth shipping.
This is why the LLM pipeline for code generation inside RapidNative is split into four distinct steps, each routed to a different model chosen for the specific job. Below is a walkthrough of how that pipeline actually works in production, drawn from the codebase that powers it.
Modern AI code generation isn't a single model call — it's an orchestrated pipeline. Photo by David Pupăză on Unsplash
The 4-step LLM pipeline at a glance
Every chat message from a user routes through src/app/api/user/ai/generate-v2/route.ts. Inside that route, the request flows through four ordered steps:
- Context gathering — a fast, cheaper model equipped with file-reading tools figures out what the user is asking for and which existing files matter.
- Auth and screen-limit detection — semantic signals (
AUTH: yes,NEW_SCREEN: yes) parsed out of Step 1's output decide whether the next step needs to scaffold authentication, and whether a free-tier user is about to hit their screen cap. - Deterministic generation tools — pure code (no LLM) generates database schemas, migrations, and auth screens from a structured JSON description.
- Final code generation — the strongest model writes the actual React Native screens, with no tools attached, given all the context the first three steps gathered.
Each step uses a model selected for that step's specific tradeoff between speed, cost, and capability. The model assignments live in a Supabase table (ai_model_config), cached for five minutes, so the team can swap models without redeploying.
Here's why that split matters — and how each step works under the hood.
Step 1: Context gathering with a fast, tool-equipped model
When a prompt arrives, the very first thing the system needs is grounding. The user wrote "add a profile screen that shows the current user's stats." That sentence is meaningless without knowing:
- What files already exist in the project
- Whether there's already a
userstable in the database - Whether the app has authentication wired up
- Whether the user means a brand-new screen or an edit to an existing one
A frontier model could answer all of this, but you'd waste the expensive context window on file reads and grep results. Instead, Step 1 runs on a smaller, cheaper model — typically something in the Llama 3.1 8B or Qwen-3 Coder tier on OpenRouter — with a tight, tool-equipped prompt.
The tools available in Step 1 are defined in src/modules/api/services/ai/providers/ToolsProvider.ts:
get_files_content({ paths })— reads files (with optional line ranges) from the project's virtual file systemlist_dir()— lists every file in the projectglob({ pattern })— finds files by patternbatch_grep({ pattern })— regex search across the projectget_images_by_keywords({ keywords })— pulls image URLs from the project asset librarylist_skills,search_skills,read_skills— dynamic knowledge injection
Some models get a reduced tool set. src/modules/api/services/ai/llm/models.ts declares a MODEL_TOOLS map that disables batch_grep for models known to misuse regex — Qwen 3 Coder and Llama 3.3 70B both fall into this category.
The Step 1 prompt is intentionally slim — around 3K tokens — and the model is instructed to output two structured signals inline before any prose: AUTH: yes|no and (for free-tier users near their screen limit) NEW_SCREEN: yes|no. Those signals get parsed out and fed to Step 2.
This step streams its tool calls and text deltas back to the client over SSE, so the user sees the "thinking" phase happen in real time. The code in generate-v2/route.ts uses streamText() from the Vercel AI SDK and pulls each chunk via for await (const chunk of step1Result.fullStream), dispatching text-delta, tool-call, and tool-result events as they arrive.
Step 2: Cheap semantic gating before expensive work
Step 2 isn't a model call at all — it's a parser on top of Step 1's output. But it's worth naming as a distinct step because it's where the pipeline saves the most money.
Two gates run here:
The auth gate. If AUTH: yes appeared in Step 1's output, Step 3 will scaffold a (auth)/_layout.tsx, a sign-in screen, and a sign-up screen — and the root layout will get patched to guard protected routes. If AUTH: no, none of that runs.
The screen-limit gate. Free-tier users get five screens per project. Before kicking off Step 4 (the expensive one), the route counts the existing screens in the project's app/ directory and, if the user is already at the limit, checks NEW_SCREEN: yes from Step 1. If both conditions are true, the route emits an SSE screen_limit_exceeded event and bails out — before burning any tokens on the main generation model.
This kind of cheap, deterministic gate is the single biggest argument for splitting a code-generation pipeline across steps. A monolithic prompt to a frontier model would only discover the screen-limit problem after spending several thousand output tokens drafting a screen the user is never allowed to see.
A real AI code generator has to decide what to build before deciding how to build it. Photo by Rodion Kutsaiev on Unsplash
Step 3: Deterministic tools generate the boring (and risky) parts
Database schemas and auth boilerplate share a property that makes them dangerous to leave to an LLM: they're highly structured, and getting them subtly wrong breaks the app silently.
So Step 3 doesn't ask an LLM to write them. Instead, Step 1 has already (when relevant) emitted a structured JSON description like:
{
"tables": [
{
"name": "posts",
"columns": [
{ "name": "title", "type": "string", "required": true },
{ "name": "likes", "type": "integer" }
],
"relationships": [
{ "name": "author", "type": "belongsTo", "target": "users", "foreignKey": "user_id" }
]
}
]
}
Pure TypeScript code in tools/project-templates/fullstack/ai/tools/database.ts consumes that JSON and emits the actual files:
src/db/schema.tsin the project's DB client formatsrc/db/seeds/{table}.tsseed filespocketbase/migrations/{timestamp}_create_{table}.jsmigration filespocketbase/seeds/{table}.mjsPocketBase seed files
For follow-up messages, the tool parses the existing schema.ts, merges the AI's delta into it, and preserves unchanged tables — so adding a comments table to an app that already has posts and users doesn't accidentally drop the other two.
The auth tool in tools/project-templates/fullstack/ai/tools/auth.ts works the same way: if Step 1 said AUTH: yes, the tool emits the standard sign-in/sign-up screens and the route guard. No LLM, no hallucination risk, identical output every time.
Letting deterministic code own the structured parts means Step 4 — the only step where you really want a creative model — never has to think about migrations or auth boilerplate. It just writes screens.
Step 4: The main code generation with no tools and full context
By the time Step 4 starts, every piece of context the model needs has been packed into the prompt:
- The full file content gathered in Step 1, formatted as
--- FILE.tsx ---\n[content]blocks - The image URLs the user referenced (if any), already validated
- The skills the model should follow — design patterns, layout docs, TanStack Query syntax, pre-output checklists — pulled from a compiled skill registry
- The current file the user has open (for point-and-edit prompts)
- A list of every file in the project, capped at 100 entries
- The selected code range, if the user used point-and-edit to highlight a region
The system prompt for Step 4 is set explicitly to instruct the model that it has no tools — that's a critical change from Step 1. Tool use during code generation is a footgun: the model burns tokens deliberating about which file to read instead of writing the screen the user asked for, and maxSteps can run out before it finishes the actual code. Step 4 is pure text-out.
This is where the strongest model in the lineup runs. In production, that's typically Claude Sonnet 4.5 for projects that need design polish, with GLM-5 as a cost-effective alternative for simpler asks. The model is selected by getAIModelAsync('MAIN_GENERATION'), which reads from the cached ai_model_config map and instantiates the right provider — OpenRouter, Bedrock, Vertex, Azure, Anthropic direct, or OpenAI.
The output streams straight to the user's editor. Code blocks are rendered into files in real time using the streaming live preview that powers RapidNative's instant feedback loop — covered in more depth in Under the Hood: How RapidNative Streams AI-Generated Components in Real Time.
The model router: configuration-driven, not hard-coded
A pipeline that uses three different models is only useful if you can swap those models without shipping a new build of the app. The model router lives in src/modules/api/services/ai/llm/index.ts and exposes a small set of async helpers keyed by purpose, not by model:
const [modelToUse, modelId, providerOptions, cacheType] = await Promise.all([
getAIModelAsync(modelPurpose),
getAIModelIdAsync(modelPurpose),
getProviderOptionsAsync(modelPurpose),
getPromptCacheTypeAsync(modelPurpose),
]);
Three purposes are defined: MAIN_GENERATION, VISION (used when a user uploads a screenshot or sketch), and CONTEXT_GATHERING. Each purpose maps, in the database, to a (provider, modelId) pair plus any provider-specific options.
The router supports six providers via the Vercel AI SDK ecosystem — versions pinned in package.json:
@openrouter/ai-sdk-provider@^0.7.3— unified access to OpenRouter's catalog (primary in production for cost flexibility)@ai-sdk/anthropic@^1.2.12— direct Anthropic API@ai-sdk/amazon-bedrock@^1.1.6— Claude via AWS Bedrock@ai-sdk/google-vertex@^1.0.4— Gemini and Claude via Google Cloud@ai-sdk/azure@^1.3.25— OpenAI and Claude via Azure@fal-ai/client@^1.9.4— Flux Schnell for app icon generation (not an LLM, but lives in the same orchestration layer)
The active config is wrapped in unstable_cache() with a 5-minute TTL and a revalidateTag('ai-model-configs') hook so a config edit takes effect within seconds. The team can route CONTEXT_GATHERING to a Vertex Gemini Flash one day and a Llama 3.1 8B on OpenRouter the next without a code change.
Prompt caching: a quiet 90% cost saver on follow-up messages
Most multi-turn AI products waste an enormous amount of money re-sending the same system prompt with every message. RapidNative's pipeline opts into Anthropic prompt caching when the routed model supports it.
The detection lives in the router:
export async function getPromptCacheTypeAsync(purpose) {
if (providerType === 'azure' && modelId.includes('claude')) return 'anthropic';
if (providerType === 'openrouter' && modelId.includes('anthropic')) return 'anthropic';
if (providerType === 'anthropic') return 'anthropic';
return null;
}
When the cache type is 'anthropic', the Step 4 messages get an ephemeral cache marker attached to the system message:
(codeGenMessages[0] as any).experimental_providerMetadata = {
anthropic: { cacheControl: { type: 'ephemeral' } },
};
The system prompt — around 15K tokens on the first message of a project (with the full first-message skill bundle) and ~9K on follow-ups — gets cached on Anthropic's side. Subsequent messages within the cache window hit the cache for the static portion and only pay full price for the dynamic context (gathered file content, the user's latest message, recent chat history).
For a project with twenty back-and-forth messages, that's a 10–15K token discount on every turn after the first, at roughly 10% of normal input cost — meaningful when each token of Claude Sonnet costs roughly 4x what a context-gathering Llama call costs.
Prompt caching only pays off when the static portion of the prompt is large enough to amortize the cache write. Photo by Carlos Muza on Unsplash
The cost math, in concrete numbers
Pricing data for the pipeline lives in src/modules/api/services/ai/llm/models.ts as a PROVIDER_PRICING map, broken down by provider and purpose. A representative production configuration looks roughly like:
| Step | Purpose | Typical model | Input $/1K | Output $/1K |
|---|---|---|---|---|
| 1 | CONTEXT_GATHERING | Llama 3.1 8B / Qwen 3 Coder | ~$0.0001 | ~$0.0003 |
| 1 (vision) | VISION | Claude 3.5 Haiku | ~$0.001 | ~$0.005 |
| 4 | MAIN_GENERATION | Claude Sonnet 4.5 | ~$0.003 | ~$0.015 |
A typical chat turn on an existing project might burn ~3,000 input + 800 output tokens in Step 1 (call it ~$0.0006) and ~12,000 input + 2,500 output tokens in Step 4 (call it ~$0.074). If Step 1 ran on the same model as Step 4 — a common architecture in simpler builders — that ~$0.0006 step would balloon to ~$0.075 or more, doubling the cost of every message for almost no quality gain.
This is the unsexy reason multi-model pipelines win. The capability ceiling of the best model isn't what makes the product work; the routing decisions around it are.
Beyond the main pipeline: background AI tasks
Three other AI workloads run alongside the main pipeline, each with its own model selected for the job:
- App icon generation runs in parallel with Step 1 on the very first message of a project, via FAL's Flux Schnell endpoint. Three icon styles (flat modern, gradient glossy, 3D rendered) generate concurrently and land in the project's asset bucket. Code lives in
src/modules/api/services/ai/IconGeneratorService.ts. - Sentiment analysis runs after every user message as a fire-and-forget background task in
SentimentService.ts. It uses Llama 3.1 8B — the cheapest classifier in the lineup — to score the user's emotional state on a -1.0 to 1.0 scale, persisted as a rolling average per(user, project)pair. Slack alerts fire on outliers. - Free public tools under
/api/tools/*(idea generators, color palette generators, etc.) callgenerateObject()with a Zod schema against Claude 3.5 Haiku via a separate OpenRouter API key, so usage on the marketing surface area never contends for quota with the in-app generation pipeline.
Each of these could in principle have been jammed into the main pipeline. Keeping them separate, with their own model selections, is what lets the team tune any one of them independently — bumping the icon model to a higher-fidelity Flux variant doesn't require redeploying anything that touches chat.
What this architecture actually buys you
A four-step LLM pipeline with per-step model selection looks like over-engineering on a whiteboard. Four model calls instead of one. Three system prompts to maintain. A database table just to hold model assignments.
In practice, every one of those costs has paid for itself:
- Cost. Routing context gathering to a $0.0003-per-1K-output model instead of a $0.015 model is a 50x discount on a step that runs every single turn.
- Latency. A small model finishes its tool calls in 2–4 seconds; a frontier model with the same tools would take 8–15. The user sees code start streaming faster.
- Reliability. Deterministic code, not an LLM, owns database schemas and auth screens. Those files don't hallucinate. They're identical on the 1st generation and the 100th.
- Quality. Step 4's model only has to do one thing — write a screen — with all the context already prepared and no tool-call deliberation eating its
maxStepsbudget. The output quality goes up because the cognitive load goes down. - Flexibility. Swapping the
MAIN_GENERATIONmodel to a newly released frontier model is a single database row update, taking effect within five minutes.
This is the architecture that turns a prompt like "build a fitness app with a leaderboard and Stripe payments" into a working Expo project you can scan with your phone and use. The full breakdown of what happens after the LLMs finish — the browser bundler, the live preview, the export pipeline — lives in Inside RapidNative's Export Pipeline and How We Built Team Collaboration into an AI App Builder.
People also ask
Why use multiple LLMs instead of one strong model? Different stages of code generation have wildly different cost/capability needs. Context gathering needs speed and tool use; final code generation needs reasoning and design judgment. Routing each stage to a model sized for that stage cuts cost by roughly 50% per turn without measurably lowering output quality, because the strong model only spends its budget on the work that actually needs it.
What is the Vercel AI SDK used for in this pipeline?
The ai package (v4.3.19 in production) is the unified streaming and tool-use layer across all six providers. streamText() powers both Step 1's tool-equipped context gathering and Step 4's pure-text code generation. Provider-specific quirks — Anthropic prompt caching, Bedrock region routing, Azure endpoint config — are isolated behind a thin factory layer so the pipeline code itself is provider-agnostic.
How does prompt caching reduce cost on multi-turn AI products? Anthropic prompt caching lets you mark static parts of the prompt (system instructions, design guidelines, code style rules) as cacheable. After the first call writes the cache, subsequent calls within the cache window read the cached tokens at roughly 10% of the normal input price. For a 15K-token system prompt over a 20-turn conversation, that's the difference between paying for 300K cached-write tokens or 300K full-price tokens.
Try the pipeline yourself
The 4-step LLM pipeline isn't visible from the chat UI — it just looks like fast, accurate React Native code appearing on the screen. But every prompt you send to RapidNative is routing through it. You can try it now at rapidnative.com — start from a text prompt, upload a whiteboard sketch, feed it a screenshot, or paste in a PRD. The first prompt is free; pricing only kicks in if you want to keep building beyond the free tier.
If you want a side-by-side comparison with the previous generation of this architecture, the two-step pipeline and browser bundler post covers the version that ran in production before the four-step pipeline replaced it.
Ready to Build Your App?
Turn your idea into a production-ready React Native app in minutes.
Free tools to get you started
Free AI PRD Generator
Generate a professional product requirements document in seconds. Describe your product idea and get a complete, structured PRD instantly.
Try it freeFree AI App Name Generator
Generate unique, brandable app name ideas with AI. Get creative name suggestions with taglines, brand colors, and monogram previews.
Try it freeFree AI App Icon Generator
Generate beautiful, professional app icons with AI. Describe your app and get multiple icon variations in different styles, ready for App Store and Google Play.
Try it freeFrequently Asked Questions
RapidNative is an AI-powered mobile app builder. Describe the app you want in plain English and RapidNative generates real, production-ready React Native screens you can preview, edit, and publish to the App Store or Google Play.