Inside Image-to-App: How RapidNative Turns a Screenshot Into Working React Native Code

SA

By Suraj Ahmed

27th May 2026

Last updated: 26th May 2026

Inside Image-to-App: How RapidNative Turns a Screenshot Into Working React Native Code

You drag a screenshot of a screen you like onto the canvas, type "build this," and twenty seconds later there's a working React Native version of it running on a phone in your browser. From the outside, that looks like a single AI call. It isn't. Turning a screenshot to React Native code is a four-step pipeline with three different models, a storage handoff, and a deliberate skip of two normally-mandatory tools — all stitched together so the round trip stays under a minute.

This post is a walkthrough of what actually happens inside RapidNative's image-to-app feature when you upload an image: the compression on the client, the Supabase storage handoff, the two-model architecture that keeps latency down, and the prompt engineering that makes a multimodal LLM produce code that compiles instead of code that just looks plausible.

Person designing a mobile app screen on a phone next to a laptop A screenshot is the most underrated input mode in app building — Photo by Hal Gatewood on Unsplash

Why Screenshot-to-Code Is Harder Than It Looks

A naive screenshot-to-code implementation does one thing: shove the image into a vision model with a prompt like "convert this to React Native" and stream the output. That works for a demo. It breaks the moment you want the result to be a real screen inside a real app, because of three problems the naive approach ignores.

First, the model has no idea what your project's theme is. It will hardcode colors, spacing, and typography that don't match the rest of your app. Second, it has no idea how your other screens are structured — your navigation pattern, your component conventions, your layout primitives. A login screen generated cold doesn't know that the rest of your app uses SafeAreaView from a specific package or that your buttons live in components/ui/Button.tsx. Third — and this is the one most tools get wrong — the model has no way to use the image as an asset. If the screenshot includes a hero image you want to keep, a naive pipeline either describes it back in prose or invents a stock placeholder. Neither is what you wanted.

The image-to-app pipeline is built to solve all three. Here's how.

Step 0: The Client Compresses Before It Uploads

The flow starts in the editor's useSendMessage hook. When you drop an image into the chat input alongside your prompt, the hook doesn't upload the raw file. It compresses it first, in the browser, using the HTML5 Canvas API:

  • Max dimensions: 1920×1080
  • Quality: 80%
  • MIME type preserved (PNG stays PNG, JPEG stays JPEG)
  • Falls back to the original if compression fails

This matters more than it sounds. The raw screenshots people upload are routinely 4–8 MB straight off a Retina display. At 1920×1080 / 80% JPEG, the same image drops to 200–600 KB. That's the difference between a 3-second upload on a typical connection and a sub-second one — and the difference between a vision model that ingests the image quickly and one that times out.

The compressed file gets pushed to /api/upload as multipart form data. While that upload is in flight, the client creates a local URL.createObjectURL() so you see the image instantly in the chat — optimistic UI for the upload, not just the generation.

Step 1: Storage Is a Two-Table Handoff, Not Just an Upload

Where the image gets stored matters because the model later needs a URL it can fetch. RapidNative uses Supabase Storage, but the upload route does two things most tutorials skip.

First, it sanitizes the filename. Special characters get stripped, the extension is inferred from the MIME type if the file came in without one, and a 6-character nanoid suffix is appended on collision. The result is a deterministic, safe path:

projects/{projectId}/fs/assets/{sanitized-name}.{ext}

Second — and this is the part that makes the screenshot actually useful as an asset — the upload route writes to two database tables in addition to storage:

  • files — registers the asset in the project's virtual filesystem with is_external: true, so the bundler can resolve it
  • project_assets — registers the upload time, used by the bundler and the AI tooling

The reason for that second write is subtle but important. Without it, the generated React Native code would have to embed the full Supabase URL in <Image source={{ uri: '...' }} /> calls — fragile, ugly, and not how production apps reference local assets. With it, the LLM can generate code that uses the proper Expo/Metro convention:

<Image source={require('@/assets/screenshot-abc123.png')} />

That's a small detail, but it's the difference between code that works locally during preview and code that survives an eas build.

Developer working on laptop with code editor open The unglamorous infrastructure work — file sanitization, asset registration — is what makes the AI output usable — Photo by Headway on Unsplash

Step 2: The 4-Step LLM Pipeline (And Why Image-to-App Skips Half of It)

The main AI route, /api/user/ai/generate-v2, runs a four-step pipeline for every request. Image uploads go through the same pipeline, but two steps behave differently to keep the round trip fast. Here's the full pipeline, then the image-specific deviations.

Step 1 — Context Gathering (fast model, with tools). A cheaper, faster model runs first with a small toolset (get_files_content, batch_grep). Its job is to figure out what the main model needs to see before it generates code. For a normal prompt that might mean reading three screens, the navigation config, and the theme. For an image request, the system prompt narrows the read aggressively — basically: "The user attached a reference image. Read one existing screen as reference plus theme.ts plus _layout.tsx. That's it." The output is a bundle of file contents the main generator can use as ground truth.

Step 2 — Auth Signal Detection. The pipeline scans Step 1's output for an AUTH: yes|no marker that tells the generator whether to wire up authentication. For image-to-screen requests this is short-circuited to no — when someone uploads a screenshot, they're recreating a UI, not asking for a sign-in flow.

Step 3 — Generation Tools (database + auth scaffolding). Normally this step provisions Supabase tables and auth handlers based on what Step 1 found. For image requests, it's deliberately skipped:

if (imageUrl && (genTool.id === 'database' || genTool.id === 'auth')) {
  console.log(`⏭️ [Step 3] Skipping ${genTool.name} — image-to-screen request, no DB needed`);
  continue;
}

This is the single biggest latency win for screenshot-to-code. Database tool calls round-trip through the LLM with structured outputs and are the slowest part of a normal generation. Cutting them shaves 5–15 seconds off the request — and they were never needed in the first place, because you're recreating a UI, not designing a schema.

Step 4 — Code Generation (main model, no tools). This is where the vision happens. The main generation model — separate from the fast model used in Step 1 — gets called with streamText(). The user message is built as a multimodal payload:

finalUserMessage = {
  role: 'user',
  content: [
    {
      type: 'text',
      text: `${query}\n[UPLOADED IMAGE] Saved as project asset at: assets/${imageFileName}\nTo use: require('@/assets/${imageFileName}')`,
    },
    { type: 'image', image: imageUrl },
  ],
};

Three things are happening in that payload that matter. The user's prompt comes first, so the model knows the intent. The asset note tells the model the file already lives in the project — that's how it knows to emit a require() call instead of inlining the Supabase URL. And the image itself is passed as a remote URL, not base64-encoded. The Vercel AI SDK's standard ImagePart format handles the fetch on the provider side, which is cheaper than blowing up the request body with 800 KB of base64.

The call uses temperature: 0.6, maxTokens: 32000, and maxSteps: 1 — pure generation, no tool calls. The response streams back to the client over Server-Sent Events.

Why Two Models Instead of One

A single model could theoretically do both context gathering and code generation. RapidNative deliberately uses two, and the reason comes down to economics and token budgets.

The fast model in Step 1 needs to be able to call tools well — read files, run greps, pick what's relevant. That's a different optimization profile than code generation. It also needs to be cheap, because it's running on every request whether or not the user uploaded an image.

The main model in Step 4 needs to be strong at multimodal reasoning and React Native code synthesis. Those models — Claude Sonnet, GPT-4o, Gemini Pro Vision — are several times more expensive per token. Running them in Step 1 would waste budget on file-reading tool calls. Running the fast model in Step 4 would produce visually plausible but technically broken code.

The model selection is database-driven through an ai_model_configs table with a purpose column (MAIN_GENERATION, VISION, CONTEXT_GATHERING) and a provider column (openrouter, bedrock, anthropic, azure, vertex, openai). That means the vision model can be swapped without a code change — if Claude releases a better vision model on Tuesday, it's a database update.

Abstract neural network visualization Two models, two budgets, two jobs — separating context gathering from generation is the architectural decision that lets the pipeline scale — Photo by Possessed Photography on Unsplash

Asset Versus Design Reference: A Subtle Classification

After the stream completes, the route logs the request to a user_activity_logs table with one of two values:

  • 'asset' — the generated code referenced the uploaded image (a require('@/assets/...') call appeared in the output)
  • 'design_reference' — the model used the image to understand layout but didn't include it in the code

That distinction is more than analytics. It's the model implicitly telling you what kind of screenshot you uploaded. If you dropped in your competitor's signup page, the model uses it as a design reference — copies the layout, doesn't embed their logo. If you dropped in a hero photo you want in your app, the model uses it as an asset — generates code that displays your image. The same pipeline handles both, but the downstream behavior is shaped by what the model actually emits.

What the System Prompt Actually Tells the Model

The system prompts for the vision step aren't public, but the shape is observable from the codebase. The key instruction in Step 1 is unusually short for an LLM pipeline — basically "read one screen, the theme, and the root layout, and stop." Most context-gathering systems try to read everything. For image-to-app, that's wrong: the screenshot itself is the dominant context, and over-reading the codebase risks the model copying patterns from screens that don't match the input visually.

In Step 4, the main system prompt steers the model toward:

  • Using existing components from the project's component library when their shape matches the screenshot
  • Pulling colors, spacing, and typography from theme.ts instead of hardcoding values from the image
  • Treating the image as a layout reference, not a pixel-perfect spec — apps need to be responsive

The result is generated code that looks like the screenshot but feels like the rest of your app.

How Long Does All of This Take?

For a typical screenshot-to-screen request:

StepTime
Client-side compression~100–300 ms
Upload to Supabase + asset registration~500 ms–1.5 s
Step 1: Context gathering (fast model)3–8 s
Step 2: Auth detection<100 ms (string parse)
Step 3: Generation toolsskipped for images
Step 4: Code generation streaming start2–4 s to first token
Step 4: Full screen generated15–30 s

End to end: usually 20–40 seconds from drop to a running preview. The Step 3 skip is what keeps this under a minute. Without it, the same request would round-trip through database tool calls and land closer to 45–60 seconds.

What This Means If You're Building With It

You don't need to know any of this to use the Image-to-App feature — you upload, you prompt, you iterate. But knowing the pipeline gives you a few practical levers:

Prompts matter even with images. The image is huge context, but the text prompt is still what tells the model your intent. "Recreate this" produces a different result than "use this as a reference for the login layout, but match our brand colors." The pipeline always gives the prompt to the model — use that.

Iterate on one screen at a time. Because Step 1 reads exactly one existing screen as reference, the closer your previous screen is to what you want next, the more consistent the result. Build the first screen carefully, then upload screenshots for follow-ups — the model learns your patterns as you go.

Treat uploads as assets when you want them embedded. If you want the screenshot's hero image, logo, or photo to actually appear in your app, say so in the prompt. The model defaults to layout reference; you have to ask for it as an asset.

It works equally well for non-screenshot images. The pipeline doesn't actually know what you uploaded is a screenshot. A whiteboard photo, a Figma export, a competitor's marketing screen, a hand-drawn wireframe — they all go through the same vision pipeline. The Sketch-to-App and image-to-app flows share most of the same plumbing.

Mobile app prototype on a phone in someone's hand Twenty seconds from a screenshot to a running React Native preview — Photo by Daniel Romero on Unsplash

Frequently Asked Questions

How does screenshot-to-code work in RapidNative?

It's a four-step LLM pipeline. A fast model first reads a small slice of your project (one screen, the theme, the root layout) to ground the generation. Two normally-mandatory tool calls — database and auth scaffolding — are skipped when an image is attached. Then a multimodal model receives your prompt plus the image and streams React Native code that matches both your project's conventions and the screenshot's layout.

What vision model does RapidNative use?

The vision model is configured in the database, not hardcoded, so it can change without a code release. The current providers supported include Anthropic (Claude), OpenAI (GPT-4o), Google (Gemini via Vertex), AWS Bedrock, and OpenRouter as a multi-provider proxy. The vision model is purpose-tagged separately from the main generation model, so context gathering, vision, and code synthesis can each use the best-fit model.

Can the generated code use the uploaded image as a real asset?

Yes. The upload route registers the image in the project's virtual filesystem and in project_assets, so the generated code can reference it through the standard require('@/assets/your-image.png') pattern. That's what makes the output survive an Expo build instead of breaking the moment you go offline.

Does the screenshot get sent as base64 or a URL?

A URL. After upload, the file lives in Supabase Storage with a public URL. The LLM call uses the Vercel AI SDK's ImagePart format with a URL reference. URLs are cheaper than base64 in the request body and let the model provider handle the fetch.

Is image-to-app the same flow as sketch-to-app?

Largely the same plumbing — both upload an image, go through the same vision pipeline, and skip the database/auth generation tools. The main difference is intent and prompting: sketches are almost always design references, while screenshots can be either references or actual assets the user wants in their app.

Why Pipelines Beat Single Calls

If there's one takeaway from the architecture, it's that screenshot-to-code is not an LLM problem — it's a pipeline problem. The vision model is necessary, but the difference between a demo and a tool you can actually ship with is everything around the model: compression, storage, asset registration, two-step context gathering, the deliberate skip of database tools, and the choice to pass the image as a URL with an asset annotation.

You can build a screenshot-to-code prototype in an afternoon with any modern vision model. Building one that produces code that compiles, matches your existing project's conventions, and arrives in under a minute is what the pipeline is for.

Try the Image-to-App feature free — no signup needed. Drop in any UI screenshot and you'll get a working React Native screen back in under a minute. If you want to read more on how RapidNative's generation stack is built, the 4-step LLM pipeline post goes deeper on the non-image path, and the NativeWind styling deep-dive covers how the styling layer is produced once the layout is in place.

For more reading on multimodal models, see the Anthropic vision docs and the Vercel AI SDK multimodal guide. For React Native asset conventions, the Expo asset documentation is the canonical reference.

Ready to Build Your App?

Turn your idea into a production-ready React Native app in minutes.

Try It Now

Free tools to get you started

Frequently Asked Questions

RapidNative is an AI-powered mobile app builder. Describe the app you want in plain English and RapidNative generates real, production-ready React Native screens you can preview, edit, and publish to the App Store or Google Play.