Inside the Vision Prompt: How RapidNative Reads a Screenshot to Generate React Native Code

RI

By Riya

8th Jun 2026

Last updated: 8th Jun 2026

Inside the Vision Prompt: How RapidNative Reads a Screenshot to Generate React Native Code

Most "screenshot to code" demos cheat. They re-render a single screen as static JSX with hard-coded values, slap a wrapper around it, and call that an "app." Try uploading a fuzzy phone photo of a competitor's checkout flow and the illusion breaks.

Turning a UI screenshot into a working mobile app is not really a code-generation problem. It's a reading problem. The model has to figure out what kind of image it's looking at (a polished mockup? a hand-drawn sketch? a screenshot of a live app?), what to copy literally vs. reinterpret, where assets belong, which theme to use, and how the new screen fits into the rest of the project. That reading happens in the vision prompt — the system prompt that frames a multimodal request to the LLM — long before any code starts streaming.

This post walks through RapidNative's image-to-app pipeline from upload to running app, with a focus on the prompt-engineering layer that turns raw pixels into a useful design brief. If you've ever wondered why a "convert this screenshot" workflow needs more than a vision-capable model, the answer lives here.

A developer reviewing UI screenshots on a laptop screen Screenshot-to-code is a reading problem before it's a code problem — Photo by Galymzhan Abdugalimov on Unsplash

What "screenshot to react native code" actually requires

Before walking through the pipeline, it's worth being explicit about what a useful screenshot-to-code feature has to solve.

A naive approach feeds the image to a vision model with a prompt like "generate React Native code for this UI" and prints whatever comes back. That works for a single isolated screen and almost nothing else. A production-grade pipeline has to handle four problems at once:

  1. Image classification. Is this a polished UI mockup the user wants reproduced faithfully, a wireframe they want interpreted into a styled design, or a literal asset (a logo, a hero image) they want embedded in the app?
  2. Project context. The new screen has to match the existing theme, typography, navigation, and component conventions of the surrounding project — not invent its own.
  3. Asset handling. If the image is meant to ship as an asset (a logo, an illustration), it needs to be stored, registered with the bundler, and referenced via require() rather than a remote URL.
  4. Multi-step generation. Vision-capable models are expensive. Calling one for every step of context-gathering, file-reading, schema design, and final code emission burns money and latency. The pipeline has to use vision only where it earns its cost.

RapidNative's image-to-app feature solves these with a four-step LLM pipeline, an intent classifier, and a system prompt that does a surprising amount of the heavy lifting. Let's open it up.

The 60-second data flow

Here's what happens between dropping a PNG into the chat box and seeing a working app on your phone:

  1. Client capture. You paste, drop, or upload an image into the prompt composer on the marketing page or inside an existing project. The file is stored in IndexedDB via idb-keyval for resilience against tab reloads, and a local blob URL is shown immediately as an optimistic preview.
  2. Compression and upload. The client compresses the image to keep payload size sane, then POSTs multipart/form-data to /api/upload. The route sanitizes the filename, detects the MIME type, handles collisions with a nanoid suffix, and writes the file to Supabase Storage under projects/{projectId}/fs/assets/. The image gets a public URL and a row in the project_assets table.
  3. Generate request. A second POST hits the generation endpoint (/api/user/ai/generate-v2) with the user's prompt, the project ID, the image URL, and the image filename.
  4. Four-step pipeline. The server runs context gathering → auth detection → optional schema generation → code generation. Only the first and last steps see the image (more on why in a moment).
  5. Streaming response. The model emits XML-tagged code blocks (<CodeProject>, <QuickEdit>, <ProjectMetadata>) over Server-Sent Events.
  6. File materialization. The client parses the stream, writes files into the project_files table, and the in-browser Expo Router preview hot-reloads. You scan a QR code to open the app on your real device.

The whole loop typically takes under a minute for a single-screen request. The interesting part — and the part most "screenshot to code" tools skip — is steps 3 and 4.

Why one model isn't enough: the two-tier vision pipeline

The most expensive line item in any LLM-based product is image tokens. A high-resolution screenshot can cost 1,500–4,000 input tokens depending on the model. If you fan that image out to every step of a multi-step pipeline, costs balloon fast.

RapidNative splits work across two model tiers, both pulled from a database-driven config (getAIModelConfigMap() with a 5-minute cache) and routed through the Vercel AI SDK via provider packages: @ai-sdk/anthropic, @ai-sdk/amazon-bedrock, @ai-sdk/google-vertex, @ai-sdk/azure, and @openrouter/ai-sdk-provider.

TierPurposeWhen it sees the image
CONTEXT_GATHERINGStep 1 — figures out which existing files to read, what to clarify, what tools to callYes
MAIN_GENERATION (vision-tier pricing)Step 4 — emits the actual React Native codeYes
AUTH_DETECTION (text-only)Step 2 — parses Step 1's output for auth intentNo
SCHEMA_GENERATION (text-only)Step 3 — drafts database tables if neededNo

In other words, only the steps that genuinely need to see the image pay for vision tokens. The intermediate parsing and schema work runs on cheaper text-only models. The pricing tier for vision models inside RapidNative's config sits at roughly $0.003 input / $0.015 output per 1K tokens — close to Claude 3.5 Sonnet's published pricing — so dropping two of the four steps to a text-only model meaningfully cuts cost per request.

There's a second optimization buried in Step 1. When an image is present, the context-gathering model is explicitly instructed to skip database and auth tool calls and to read only one existing screen plus theme.ts and _layout.tsx. That's the minimum project context needed to keep the new screen visually consistent with the rest of the app — no more, no less.

if (imageUrl) {
  step1SystemPrompt += `\n\nThe user attached a reference image.`;
  if (imageFileName) {
    step1SystemPrompt += ` Saved as project asset. Use: require('@/assets/${imageFileName}') in Image components.`;
  }
  step1SystemPrompt += ` Read: ONE existing screen as reference + \`theme.ts\` + \`_layout.tsx\`. That's it.`;
}

That single instruction prevents the model from spending half its context window grepping the project tree, and tightens the surface area for downstream code generation.

The vision prompt: teaching the model to read a UI

This is the layer that distinguishes a "vision model wrapper" from a useful product. When the image hits Step 4, the system prompt includes a block of design interpretation guidelines that tell the model how to read what it's looking at.

The guidelines are short, opinionated, and worth quoting:

Design Interpretation Guidelines (Image Provided)

  • Use the provided image as a visual reference for the UI layout and structure.
  • For black-and-white prototypes, wireframes, or hand-drawn sketches:
    • Do NOT reproduce the black-and-white look.
    • ALWAYS apply vibrant, modern solid color palettes.
    • If building on an existing app: follow the established theme, colors, brand identity.
    • If starting fresh: infer the palette and theme from the wireframe's app type.
    • Treat wireframes as conceptual blueprints, not final designs.

That last bullet does more work than it looks like it does. Most multimodal models, given a wireframe, will faithfully reproduce its grayscale aesthetic — light gray boxes, dark gray text, no color. That's technically "correct" image reproduction but produces apps that look like prototypes. RapidNative's prompt explicitly inverts that behavior: read the structure of the wireframe, throw away the aesthetic, and apply a palette appropriate to the inferred app type.

For polished screenshots (a Figma export, a competitor's app), the same prompt works in reverse — preserve the color story but adapt it to the project's existing theme tokens rather than hard-coding hex values.

There's a second guideline about asset usage when the model decides the uploaded image should ship as an asset (a logo, a hero illustration, an avatar):

The image is saved as a project asset.
Use: <Image source={require('@/assets/{filename}')} />

The require() form matters because Expo's Metro bundler resolves it at build time, so the asset is bundled into the app rather than fetched at runtime. The model is told the exact import path, removing one of the most common ways AI-generated React Native code breaks: hallucinated asset paths.

Wireframe and final colored UI side by side on a designer's desk Wireframes are conceptual blueprints, not final designs — the vision prompt teaches the model to skip the grayscale aesthetic — Photo by Hal Gatewood on Unsplash

Reference vs asset: how the model classifies its own intent

The "design reference" vs "asset" distinction isn't just baked into the prompt. RapidNative tracks which one the model picked, after the fact.

After Step 4 finishes streaming, the server inspects the generated code:

const imageIntent = fullResponseText.includes(imageUrl)
  ? 'asset'      // the model embedded the URL in an <Image> component
  : 'design_reference'; // the model used it purely as a layout/style hint

If the image URL shows up in the generated code, the model treated it as an asset — it embedded the file (or its public Supabase URL) directly into an <Image> component. If the URL doesn't appear anywhere in the output, the model used it as a design reference — it studied the layout and styled a new screen from scratch.

That classification is logged to the user_activity_logs table, which lets the team tune the prompt over time: are users uploading more logos and hero images (assets), or are they uploading UI inspiration (references)? The answer changes how the prompt should weight each case.

This is one of those product details that doesn't matter to a single user on a single request but matters enormously over thousands of requests. The intent classification is what lets the prompt evolve from generic "you see an image" instructions to behavior tuned to how people actually use the feature.

From wireframe to a vibrant app: a concrete example

Take a generic wireframe of a fitness tracker — three gray boxes labeled "Today," "History," "Stats," a header strip, and a tab bar. Drop it into RapidNative with the prompt "build me a fitness tracking app from this wireframe." Here's what the pipeline does:

  1. Step 1 (cheap context model with vision): Notes the image is a wireframe based on the grayscale palette and box-based layout. Decides not to fetch database tools (it's a UI-first request). Reads theme.ts to learn the project's color tokens.
  2. Step 2: Parses Step 1's output — no auth UI mentioned, so no auth pages needed.
  3. Step 3: Skipped — no schema work.
  4. Step 4 (vision-tier generation model): Receives the image plus the design-interpretation guidelines plus the theme tokens. Generates three screens (app/(tabs)/today.tsx, history.tsx, stats.tsx) plus a _layout.tsx with a tab navigator. The screens use the project's existing primary color (say, vibrant orange) instead of the wireframe's gray. Typography matches the theme. Spacing tokens are reused.

That's the difference between a model that reproduces an image and a model that interprets it. The grayscale wireframe becomes a styled, on-brand fitness app — not a black-and-white prototype.

The same pipeline handles the inverse case too. Drop in a screenshot of a competitor's polished checkout flow and the model preserves the structural layout (header, line items, sticky CTA) but maps the colors and typography to the active project's theme instead of copying the competitor's brand identity.

From AI text to running React Native files

When the model finishes Step 4, it streams a response that looks roughly like:

<CodeProject id="...">
  <File path="app/(tabs)/today.tsx">
    import { View, Text } from 'react-native';
    // ...
  </File>
  <File path="app/(tabs)/_layout.tsx">
    // ...
  </File>
</CodeProject>

The XML-tagged shape is a deliberate compromise. JSON wouldn't let the model freely stream JSX without escaping every quote. Plain markdown code fences would lose path information. XML tags give a streamable, parseable wrapper that survives token-by-token emission.

A subtle gotcha lives here. Some models — particularly GLM and a few open-source vision-capable ones — sometimes wrap a valid <CodeProject> block inside an outer <tool_call> or <function_calls> tag, hallucinating that they're calling a tool. The server runs a streaming sanitizer that detects these wrappers and unwraps them in flight, preventing silent code-drop bugs that would otherwise leave the user staring at a blank preview.

Once parsed, each file gets written into the project_files table. The in-browser Expo Router preview (a separate dev process spun up alongside Next.js via npm run dev:rapidnative-expo-router) picks up the changes through a sync service, recompiles, and hot-reloads. By the time you finish reading the streamed code, the app is running.

For a deeper look at the bundler side of this — how generated code becomes a running app without a server-side build step — see how RapidNative delivers real-time React Native live preview across every device.

When image-to-app falls short (and what to try instead)

Honest section. Image-to-app is not magic, and there are cases where it's the wrong tool.

Low-resolution photos of phone screens. A 480px-wide photo of someone's iPhone, taken at an angle, with glare across the screen — the model can usually still read it, but spacing and typography will be guessed rather than read. For best results, upload a screen capture (not a photo) at native resolution.

Highly custom UI primitives. If the screenshot includes a complex custom control — a circular gesture-based slider, an unusual chart type, a non-standard animation — the model will approximate the layout but the interaction logic won't come from the image. You'll need to follow up with a prompt that describes the interaction.

Multi-screen flows from a single image. A single image can produce a single screen. If you have a Figma export that contains six screens side by side, RapidNative will currently treat it as one composite image. The cleaner approach is to upload each screen separately, or paste a PRD that describes the flow and reference the screens by name.

Pixel-perfect reproduction. If you need the output to match the input pixel-for-pixel, the answer is not AI generation — the answer is exporting from your design tool. RapidNative's image-to-app is built around interpretation, not reproduction. The wireframe-to-vibrant-app behavior described above is a feature, not a bug.

When in doubt, the whiteboard mode is often a better fit for rough ideas, and the PRD-to-app flow handles structured requirements better than a single image can.

How to try it

You can hit the image-to-app workflow three ways:

  1. From the home page prompt composer, drop or paste an image into the input. No signup is needed to start.
  2. From the dedicated image-to-app feature page, which includes the demo video and walkthrough.
  3. Inside an existing project — drop an image into the chat sidebar and reference it in your prompt. The same four-step pipeline runs with the project context already loaded.

Generation runs on the free tier. If you want to know what each request actually costs in tokens and credits, the credit system explainer breaks it down.

Why the prompt is the product

A common framing of AI app builders is that they're "wrappers around GPT" or "Claude with a UI." That framing is wrong, but not for the reason most defenders give. The model matters — but the prompt matters more, because the prompt is where the product's opinions about images, code style, asset handling, and design interpretation live.

A vision model on its own will turn a wireframe into a grayscale React Native screen. A vision model with the right system prompt will turn that same wireframe into a vibrant, themed, on-brand mobile app that bundles cleanly and runs on a real device. The model is interchangeable. The prompt is not.

That's the whole reason a feature like image-to-app needs to live inside an app builder rather than as a standalone "convert this screenshot" tool. The project context, the asset pipeline, the intent classification, the wireframe-to-color transformation — none of that comes from the model. All of it comes from the layer around it.

If you're curious how the rest of that layer works, the four-step LLM pipeline post covers the broader architecture, and the streaming post goes into how generated code reaches your phone in seconds.

Or you can just drop a screenshot into RapidNative and see what happens. That's faster.

Ready to Build Your App?

Turn your idea into a production-ready React Native app in minutes.

Try It Now

Free tools to get you started

Frequently Asked Questions

RapidNative is an AI-powered mobile app builder. Describe the app you want in plain English and RapidNative generates real, production-ready React Native screens you can preview, edit, and publish to the App Store or Google Play.