How We Test AI-Generated React Native Code at Scale

RI

By Riya

24th Jun 2026

Last updated: 24th Jun 2026

How We Test AI-Generated React Native Code at Scale

When an LLM writes your codebase, the entire definition of "testing" changes.

Traditional software has a stable codebase. You write unit tests against known functions, integration tests against known endpoints, and end-to-end tests against known UI flows. At RapidNative, the codebase doesn't exist until a user types a prompt. Every project is a fresh React Native + Expo app generated on the fly — with screens, state, navigation, styling, and database schemas that no human has ever seen before.

So how do you test AI-generated code when you've never seen it before?

Developer working with code on a laptop screen at night Photo by Carlos Muza on Unsplash

We've spent the last eighteen months building an answer. This post walks through the actual pipeline behind RapidNative — the validators, bundlers, retry loops, and alerts that turn raw LLM output into a working React Native app inside a preview iframe. None of this is theoretical. Every layer described here is in production today, processing thousands of generations a day.

Why testing AI-generated code is different

In a normal codebase, the failure modes are predictable. You wrote the function; you know its inputs; you can enumerate the edge cases. With AI-generated code, the failure surface is the entire API surface of React Native, every npm package version, every JSX nesting pattern, and every Hermes runtime quirk — multiplied by the temperature of the model.

The bugs you actually see fall into recurring buckets:

  • Hallucinated importsimport { SomeIcon } from 'lucide-react-native' when the icon doesn't exist
  • Mismatched JSX — an unclosed <View> or a stray }
  • Stale state references — a useEffect that depends on a variable defined two screens away
  • Forgotten exports — a default export missing because the model truncated mid-line
  • Drift from project conventions — importing components that the Expo template doesn't bundle

Conventional testing strategies — unit tests, snapshot tests, type checking before commit — are built for code you already have. Our problem is faster: we need to know within one bundling cycle whether the model just shipped broken code, and then either fix it automatically or show the user a graceful error.

Our entire pipeline is designed around that real-time loop.

Layer 1: Pre-bundler validation

The first guardrail runs before a single byte hits the virtual file system (VFS) that backs each project's preview.

When the AI completes a file — isComplete=true on the streaming buffer — we hand the code to src/shared/utils/codeValidator.ts. It runs four cheap, deterministic checks:

  1. Syntax — brace matching, JSX attribute syntax, balanced parentheses
  2. JSX structure — tag nesting, fragment closure
  3. Runtime safety — variable declarations, optional chaining, list key props
  4. Security — flags eval(), innerHTML, and dangerouslySetInnerHTML

The output is a ValidationResult with { isValid, errors, warnings, score: 0-100 }. We don't block generation when the score dips below 100 — that would be too brittle, since some genuine JSX patterns trigger false positives. Instead, the score is a signal: low scores prime the downstream retry pipeline to expect a problem.

For .ts and .js files that don't contain JSX, we go further. We parse the file with acorn plus the @sveltejs/acorn-typescript plugin. If parsing fails, the file is not persisted to the VFS at all. It stays in the streaming buffer and waits for the next AI iteration. This stops a half-streamed file from poisoning the bundle and confusing the user with a phantom error.

This layer is fast, predictable, and rejects roughly 8-12% of generations before they ever reach the bundler. Nothing exotic — just the cheapest checks running first, the way every production pipeline should.

Layer 2: Real-time bundling with graceful degradation

Most of the testing happens at bundle time. We don't run TypeScript ahead of time. We don't run ESLint on generated code. The bundler is the test.

The bundler that ships in our preview is browser-metro, an Expo-compatible incremental bundler that runs inside a Web Worker (src/modules/file/bundler.worker.ts). Every file write triggers an incremental rebuild. The worker reports the result to the main thread through watch-ready, watch-rebuild, and hmr messages.

When a file fails to compile, we do something most bundlers don't: we don't stop the build. The failing file is replaced with a BrokenComponentStub — a tiny React component that renders a visible error card at the route where the broken file would have mounted. Every other screen in the project continues to render normally. The user can keep navigating; they just see a clear "this screen broke" card in the one place that actually broke.

Mobile device displaying a development preview Photo by Daniel Romero on Unsplash

The error itself is parsed in src/modules/file/bundler-error-parser.ts. We normalize the Sucrase and acorn stack into a structured object — { file, line, column, message } — and render five lines of code context around the error in the preview UI. Users don't need to read Metro's raw stderr to know what went wrong.

There's a similar pattern for package failures. If a generated file imports victory-native@3.0.0 and the package fetch returns a 4xx, we stub the package with a BrokenPackageStub, flag every route file that depends on it, and surface the package and the affected routes in the chat panel. Package fetches are wrapped in a 15-second timeout so a single slow registry response can't hang the preview.

The same pattern works for monorepo and flat templates. The bundler detects a rapidnative.json root prefix and flattens the FileMap so the bundler sees the right paths whether the project lives at apps/mobile/app/index.tsx or app/index.tsx.

This is the philosophical core of how we test: graceful degradation over hard failure. A broken file is a fact about one screen, not a verdict on the whole app. We dig deeper into the streaming side of this elsewhere in Streaming AI-Generated Code in Real Time.

Layer 3: Auto-retry loops with bounded recursion

Detecting an error is half the job. The other half is fixing it without making the user do anything.

Build errors are pushed from the bundler worker into a ProjectBundlerState.fileErrors map (src/modules/file/build-error-manager.ts), which then syncs into the Redux runtimeErrors slice. While the AI is actively streaming, we skip Redux dispatches and queue the errors — otherwise the UI flickers between partial states. After streaming completes, reloadIframesAfterAI() calls syncToRedux() once with the final error set.

That's the trigger for AutoRetryManager (src/modules/editor/client/managers/AutoRetryManager.ts). Its job is to detect "we just finished generating, and there are still build errors," and then automatically ask the AI to fix them. The flow:

  1. AI generation completes — lastAiGenerationTimestamp updates
  2. We wait a 2-second settling period to let all bundler errors land
  3. Each error is hashed: md5(type + '|||' + message + '|||' + path)
  4. We look up the retry count for that hash in sessionStorage
  5. If the count is less than 2, we dispatch a buildFixWithAiPrompt() with the error embedded as a chat message
  6. We increment the retry count
  7. If the count is 2, we stop — the error shows in the chat with a manual "Fix with AI" button

Two retries per unique error is the cap we landed on after a lot of A/B testing. Below that, you trade user iterations for tokens you'll save. Above that, the model starts going in circles — fixing one error and introducing another. The hash-keyed counter ensures we never auto-retry the same error twice; identical errors after manual edits are treated as new, so a user fix doesn't lock out future auto-retries.

This is the closest thing we have to a "test-fix-test" loop. The bundler is the test. The error is the assertion failure. The model is the fixer. The hash counter is the bound that keeps the loop honest.

Engineer debugging code on multiple monitors Photo by James Harrison on Unsplash

The same buildFixWithAiPrompt plumbing handles QuickEdit patches — surgical before/after string edits that the model emits for incremental changes. We apply them sequentially, run Prettier on the result, and fall back to skipping the edit if the before block doesn't match the file exactly. The rule we enforce internally: each QuickEdit must leave the file syntactically valid. No "empty then fill" patterns where the file is broken between two patches.

Layer 4: Blank-screen detection and Slack alerts

Some failures don't produce a build error — they produce a white screen. A blank iframe is the worst outcome because there's nothing for the bundler to report, no stack trace to parse, no error to feed the model.

We catch these client-side. The preview iframe runs a route probe that records which routes rendered into white space (whitescreenRoutes) and which routes threw runtime errors (errorRoutes). When the union is non-empty for a sustained window after AI completion, the client posts to src/app/api/user/ai/broken-screen-alert/route.ts.

That endpoint does two things. First, it enforces a four-hour cooldown per project so a single broken project doesn't flood the alert channel. Second, it posts a structured Slack message with the project slug, the affected routes, the user's email and plan, and the billing status. Ops sees patterns across users that no individual support ticket would surface — for example, "every project on Expo SDK 50 with react-native-reanimated@4.0.0-rc.5 is blank-screening, here are seven projects from this week."

This is the testing layer that catches the failures the bundler can't see. It's not a unit test. It's not even automated remediation. But it gives us a hit-list of repeatable AI failure modes that we can fix at the template, the prompt, or the model layer.

This is also where the AI safety story connects. We wrote about the upstream side of this in AI-Safe Code: How RapidNative Prevents AI Code Hallucinations. The downstream side is this Slack alert — it's how we know which hallucinations are escaping the upstream defenses, and where to harden next.

Layer 5: Durable error logs and end-to-end tests

Two more layers round out the system.

The first is durable error logging. The error_logs table in our Supabase schema stores every classified failure with the project ID, the error type (build-error, runtime-error, jsx-syntax, react-runtime, module-not-found, network-error), the file path, the line number, and a JSON metadata blob. We don't use this for retries — that's all in-memory and session-scoped. We use it for analytics. When a model rev ships, we want to know within a day whether module-not-found errors went up or down. The error log is the dataset behind that question.

The second is Playwright. Our E2E suite (/tests/e2e/project-creation.spec.ts, playwright.config.ts) runs the full RapidNative flow against localhost:3000: it signs in with a pre-saved storage state, posts a real prompt, waits for the animate-pulse indicator to disappear, and then asserts that the preview iframe has a non-blank src. Each test has a two-minute budget; the suite runs on every PR and on CI with two retries on flake.

The E2E tests aren't testing the AI's output. They're testing the testing pipeline. They confirm that prompt → generation → bundling → preview → assertions still works end-to-end after every code change. When we touch the bundler or the retry manager, the Playwright suite catches the case where we accidentally broke the loop and started showing white screens to real users.

We also bolt Sentry onto both the API and the client. Every uncaught error in a streaming endpoint gets captured with the project ID; every unhandled client exception is tagged with the same. Sentry isn't part of the auto-retry loop — by the time a Sentry event fires, the user has already seen the error — but it's the safety net that catches everything else.

What we deliberately don't test

A few things you might expect in this stack — and that we explicitly chose not to do:

  • TypeScript checks ahead of bundling. We tried it. The latency cost of running tsc on every file save dwarfed the value of catching errors a few hundred milliseconds earlier than the bundler. The bundler already catches the structural errors; types are a downstream consideration.
  • ESLint on generated code. Generated React Native code from a model trained on the open-source corpus is going to violate someone's preferred style. Linting noise doesn't help users build apps. We let Prettier format QuickEdit patches after they're applied and stop there.
  • Unit tests for generated components. A user who just typed "build me a fitness tracker" does not want to also receive twelve .test.tsx files. Auto-generating tests for AI-generated code is a fascinating research problem and a real product anti-feature.
  • Full AST-level semantic checks. We have @babel/parser as a dependency. We don't use it in the hot path. The cost of a full AST round-trip per file save isn't worth the marginal improvement over what acorn and the bundler already catch.

Every "test we don't run" is a deliberate choice to keep the feedback loop tight. Latency is a quality signal too.

This decision tree is also why we chose a four-step generation pipeline upstream — we wrote about it in Why RapidNative Runs a 4-Step LLM Pipeline. Putting expensive checks at generation time means the testing pipeline downstream can stay lean.

Frequently asked questions

How do you test AI-generated code that spans multiple files?

The bundler reports per-file errors, but the error message usually references the missing identifier and where it was expected. When the AutoRetryManager dispatches a fix prompt, it includes the file path of the erroring file plus the message. The model has enough context — from the conversation history and the project context gathered in the upstream generation step — to find the upstream source and edit both files in the same patch.

What happens if the auto-retry itself produces an error?

The retry counter is keyed by the error hash, not by the file. If the retry produces a new error with a different hash, we start counting from zero on the new error. If the retry produces the same error, the counter increments and eventually caps at two. We monitor both cases in error_logs and use the data to tune the system prompt for future generations.

Do you test AI-generated code on real devices?

The bundler runs in the browser, so what users see in the preview is what a Hermes runtime running the same JavaScript would see — minus the native module surface. For native-only paths (camera, geolocation, in-app purchases) we lean on Expo Go and EAS Build. You can scan a QR code from the editor to load any project on a real device via Expo, which is the fastest path to validation for native APIs. We compare these surfaces in Image to App and Whiteboard to App flows.

How does this compare to dedicated AI testing tools?

Tools like TestSprite or Panto focus on generating tests for AI-written code — Jest specs, Detox flows, Maestro scripts. That's a complementary problem. Our pipeline is upstream: we're catching errors before the user even sees a test failure. The two approaches stack — once you've shipped your app, you still want a real test suite for the long-running maintenance loop. We're solving the inner loop; those tools solve the outer one.

The shape of testing at scale

Testing AI-generated code isn't testing in the classical sense. There are no fixtures, no golden outputs, no asserting against a spec. The spec is the user's prompt, and we won't see it until generation time.

What works at scale is a layered system: cheap validation up front, real-time bundling with graceful degradation, bounded auto-retry loops, observability into the failures that escape, and durable logs for analytics. None of those layers alone is enough. Together they let us ship AI-generated React Native apps with the same confidence a traditional team ships hand-written code — and at a fraction of the iteration time.

If you've been waiting for AI app builders to be production-grade, this is what production-grade actually looks like under the hood. Start building on RapidNative and watch the pipeline run in real time — or see how teams use it to ship apps in days instead of months. More engineering deep-dives are on our blog.

Start now

Ready to build your app?

Turn your idea into a production-ready React Native app in minutes.

Free tools to get you started

Questions

Frequently asked questions

What is RapidNative?

RapidNative is an AI-powered mobile app builder. Describe the app you want in plain English and RapidNative generates real, production-ready React Native screens you can preview, edit, and publish to the App Store or Google Play.

Can I export the code?

Yes. RapidNative generates clean React Native and Expo code that you can export at any time. No lock-in, no proprietary format. Hand it to your developers or keep building inside RapidNative.

Is RapidNative free to use?

Yes. You can build apps on the free plan with no credit card required. Paid plans unlock unlimited AI generations, code export, and direct publishing to the App Store and Google Play.

Do I need to know how to code?

No. Most users build apps by describing what they want in plain English. Developers can drop into the code whenever they want more control, but coding is optional.

How long does it take to build an app?

Most users have a working first screen in under a minute. A full MVP usually takes a few hours instead of the weeks or months traditional development requires.