Building PicMe: How LLMs Solved Multi-Lingual Input Without Writing a Single Parser

Introduction

PicMe is a flashcard generation app designed to help children with communication needs. Users describe what they want in any language, and the app generates a clean, child-friendly illustration with text-to-speech support. The app needed to be fast, accessible globally, and support 60+ languages.

This post walks through the architecture decisions, technical challenges, and solutions we implemented to build PicMe.


Architecture Overview

PicMe follows an edge-first architecture, running entirely on Cloudflare's global network. There are no traditional servers—just Workers, KV storage, and R2 object storage.

flowchart TB subgraph Client["Client (React + Vite)"] UI[Material UI Components] TQ[TanStack Query] Auth[Auth Context] end subgraph Edge["Cloudflare Edge"] Workers[Hono Workers API] KV[(KV Store)] R2[(R2 Bucket)] ImgTransform[Images API] end subgraph OpenAI["OpenAI APIs"] GPT4[gpt-4o-mini<br/>Metadata Extraction] DALLE[gpt-image-1<br/>Image Generation] TTS[gpt-4o-mini-tts<br/>Speech Synthesis] end Client <-->|REST API| Workers Workers <--> KV Workers <--> R2 Workers --> ImgTransform ImgTransform --> R2 Workers <--> GPT4 Workers <--> DALLE Workers <--> TTS

Why Cloudflare Workers?

  1. Global latency: Code runs in 300+ data centers worldwide
  2. No cold starts: Workers start in under 5ms
  3. Integrated storage: KV and R2 are first-class citizens
  4. Cost efficient: Pay per request, not per hour

Tech Stack

Layer Technology Purpose
Frontend React 18 + Material UI 6 Component library with accessibility built-in
Build Vite 6 Fast HMR and optimized production builds
Backend Cloudflare Workers + Hono Lightweight edge-native API framework
Auth PBKDF2 + Session Tokens Web Crypto API (no external dependencies)
Database Cloudflare KV User data, flashcard metadata, sessions
Object Storage Cloudflare R2 Images (WebP) and audio (MP3)
AI - Images OpenAI gpt-image-1 Child-friendly illustration generation
AI - Text OpenAI gpt-4o-mini Language detection and metadata extraction
AI - Speech OpenAI gpt-4o-mini-tts Multi-lingual text-to-speech

Project Structure

picme/
├── web/                      # React frontend
│   └── src/
│       ├── components/       # FlashcardTile, FlashcardModal
│       ├── contexts/         # AuthContext, QuickChoicesContext
│       ├── hooks/            # useSpeech, useUsage
│       ├── pages/            # HomePage, CreatePage, SequencesPage
│       └── services/         # API client with TanStack Query
│
├── api/                      # Cloudflare Workers backend
│   └── src/
│       ├── routes/           # auth.ts, flashcards.ts, sequences.ts
│       ├── middleware/       # Session validation
│       ├── services/         # OpenAI integration, usage tracking
│       └── utils/            # Crypto (PBKDF2, token generation)
│
└── shared/                   # Shared TypeScript types
    └── types/                # User, Flashcard, API contracts

Challenge 1: Multi-Lingual Input Processing

The Problem

Users might type in any language—including transliterated text. A Hindi speaker might type "mujhe paani chahiye" (I want water) in Latin script, not Devanagari. The AI needs to understand this and generate appropriate images.

The Solution

We use a two-stage AI pipeline:

sequenceDiagram participant User participant API participant GPT4 as gpt-4o-mini participant DALLE as gpt-image-1 User->>API: "mujhe paani chahiye" API->>GPT4: Extract metadata GPT4-->>API: {language: "hi", normalized: "drinking water", categories: ["needs", "food"]} API->>DALLE: Generate image for "drinking water" DALLE-->>API: Image URL API-->>User: Flashcard created

The metadata extraction prompt handles 60+ languages with priority for Indian languages:

const METADATA_EXTRACTION_PROMPT = `
You are a flashcard metadata extractor for children's AAC communication.

Given a user's input text (which may be in any language, including
transliterated text like Hindi written in Latin script), extract:

1. detected_language: ISO 639-1 code
2. normalized_sentence: Simple English sentence describing the image
3. suggested_categories: 1-4 categories from the predefined list
4. confidence: low/medium/high

Prioritized languages: Hindi, Tamil, Telugu, Kannada, Malayalam,
Bengali, Marathi, Gujarati, Punjabi, Urdu...
`;

This approach means a user in Tamil Nadu can type "தண்ணீர் வேண்டும்" or "thanni venum" and get the same result.


Challenge 2: Image Compression and Delivery

The Problem

OpenAI generates images at 1024×1024 pixels in PNG format (~500KB-1MB). For a mobile-first app with a grid of flashcards, this is too large.

The Solution

We use Cloudflare Images transformation API to compress on-the-fly:

async function transformAndStoreImage(
  env: Env,
  sourceUrl: string,
  userId: string,
  flashcardId: string
): Promise<string> {
  // Fetch the original image
  const response = await fetch(sourceUrl);
  const imageBuffer = await response.arrayBuffer();

  // Transform to two sizes using Cloudflare Images API
  const sizes = [
    { width: 256, suffix: '256' },  // Grid view
    { width: 512, suffix: '512' },  // Modal view
  ];

  for (const size of sizes) {
    const transformed = await env.IMAGES_TRANSFORM.transform(
      new Blob([imageBuffer]),
      {
        width: size.width,
        height: size.width,
        fit: 'cover',
        format: 'webp',
        quality: 82,
      }
    );

    // Store in R2
    await env.IMAGES.put(
      `${userId}/${flashcardId}/${size.suffix}.webp`,
      transformed
    );
  }

  return `${userId}/${flashcardId}`;
}

Results:

  • 256px WebP: ~15-25KB (vs ~200KB PNG)
  • 512px WebP: ~30-50KB (vs ~500KB PNG)
  • 70-80% bandwidth reduction

The frontend requests the appropriate size:

// Grid view: load small image
<img src={`/image/${imagePath}/256`} />

// Modal view: load larger image
<img src={`/image/${imagePath}/512`} />

Challenge 3: Child-Friendly Text-to-Speech

The Problem

Generic TTS sounds robotic and uses inconsistent pacing. Children with communication needs benefit from calm, predictable audio that's suitable for repetition.

The Solution

We crafted detailed TTS instructions that enforce a specific delivery style:

function getTTSInstructions(languageCode?: string): string {
  const baseInstructions = `
    Voice style requirements:
    - Calm, neutral, clear tone
    - No dramatic emphasis or emotion
    - Consistent moderate pacing
    - Suitable for repeated playback
    - Child-appropriate pronunciation
    - No background sounds or effects
  `;

  // Language-specific additions
  const languageInstructions: Record<string, string> = {
    hi: 'Use standard Hindi pronunciation. Avoid regional accents.',
    ta: 'Use clear Tamil pronunciation suitable for children.',
    // ... 58 more languages
  };

  return baseInstructions + (languageInstructions[languageCode] || '');
}

Audio Caching Strategy

To avoid duplicate TTS generation costs, we cache audio during preview:

sequenceDiagram participant User participant CreatePage participant API participant TTS User->>CreatePage: Click "Preview Speech" CreatePage->>API: Generate TTS API->>TTS: Request audio TTS-->>API: MP3 stream API-->>CreatePage: Audio blob CreatePage->>CreatePage: Cache blob + convert to base64 User->>CreatePage: Click "Save Card" CreatePage->>API: Save with cached audio (base64) Note over API: Reuses cached audio, skips TTS API-->>CreatePage: Card saved
// CreatePage.tsx - Audio caching logic
const [previewAudioCache, setPreviewAudioCache] = useState<{
  text: string;
  language: string;
  audioBase64: string;
} | null>(null);

const handleSave = async () => {
  // Reuse cached audio if text hasn't changed
  const audioToSend =
    previewAudioCache?.text === speechSentence &&
    previewAudioCache?.language === selectedLanguage
      ? previewAudioCache.audioBase64
      : undefined;

  await createFlashcard({
    prompt,
    speechSentence,
    language: selectedLanguage,
    cachedAudio: audioToSend,  // Avoids duplicate TTS call
  });
};

Challenge 4: Edge-Native Authentication

The Problem

Workers don't have access to Node.js crypto libraries like bcrypt. We needed secure password hashing using only Web Crypto APIs.

The Solution

PBKDF2 with 100,000 iterations provides equivalent security to bcrypt:

// api/src/utils/crypto.ts
export async function hashPassword(password: string): Promise<string> {
  const encoder = new TextEncoder();
  const salt = crypto.getRandomValues(new Uint8Array(16));

  const keyMaterial = await crypto.subtle.importKey(
    'raw',
    encoder.encode(password),
    'PBKDF2',
    false,
    ['deriveBits']
  );

  const hash = await crypto.subtle.deriveBits(
    {
      name: 'PBKDF2',
      salt,
      iterations: 100000,
      hash: 'SHA-256',
    },
    keyMaterial,
    256
  );

  // Store as: salt:hash (both base64)
  return `${base64Encode(salt)}:${base64Encode(hash)}`;
}

export async function verifyPassword(
  password: string,
  stored: string
): Promise<boolean> {
  const [saltB64, hashB64] = stored.split(':');
  const salt = base64Decode(saltB64);

  // Derive hash with same parameters
  const keyMaterial = await crypto.subtle.importKey(
    'raw',
    new TextEncoder().encode(password),
    'PBKDF2',
    false,
    ['deriveBits']
  );

  const hash = await crypto.subtle.deriveBits(
    {
      name: 'PBKDF2',
      salt,
      iterations: 100000,
      hash: 'SHA-256',
    },
    keyMaterial,
    256
  );

  return base64Encode(hash) === hashB64;
}

Session tokens use cryptographically secure random bytes:

export function generateToken(): string {
  const bytes = crypto.getRandomValues(new Uint8Array(32));
  return base64UrlEncode(bytes);
}

Challenge 5: Usage Limits and Quota Management

The Problem

AI image generation is expensive (~$0.04 per image). We needed a fair usage system that encourages quality over quantity.

The Solution

A two-tier system with smart quota management:

// api/src/services/usage.ts
interface UsageData {
  plan: 'free' | 'personal';
  imageAttempts: number;      // Monthly count
  savedCards: number;         // Total saved
  currentMonth: string;       // "2025-02" format
}

const LIMITS = {
  free: { monthlyAttempts: 10, maxSavedCards: 15 },
  personal: { monthlyAttempts: 100, maxSavedCards: Infinity },
};

export async function checkAndIncrementUsage(
  kv: KVNamespace,
  userId: string
): Promise<{ allowed: boolean; remaining: number }> {
  const usage = await getUsage(kv, userId);
  const limits = LIMITS[usage.plan];

  // Auto-reset on new month
  const currentMonth = getCurrentMonth();
  if (usage.currentMonth !== currentMonth) {
    usage.imageAttempts = 0;
    usage.currentMonth = currentMonth;
  }

  if (usage.imageAttempts >= limits.monthlyAttempts) {
    return { allowed: false, remaining: 0 };
  }

  usage.imageAttempts++;
  await saveUsage(kv, userId, usage);

  return {
    allowed: true,
    remaining: limits.monthlyAttempts - usage.imageAttempts
  };
}

Clever refund mechanism: When a user saves a generated image (accepts it), we refund one attempt. This encourages accepting good images rather than regenerating endlessly:

export async function refundAttempt(
  kv: KVNamespace,
  userId: string
): Promise<void> {
  const usage = await getUsage(kv, userId);
  const limits = LIMITS[usage.plan];

  // Can't exceed monthly limit via refunds
  if (usage.imageAttempts > 0) {
    usage.imageAttempts = Math.max(
      0,
      usage.imageAttempts - 1
    );
    await saveUsage(kv, userId, usage);
  }
}

Challenge 6: Backwards Compatibility

The Problem

We migrated from PNG to WebP images mid-project. Existing users had cards with the old format that needed to continue working.

The Solution

Support both formats with graceful fallback:

// api/src/routes/flashcards.ts
app.get('/image/:path{.+}', async (c) => {
  const path = c.req.param('path');

  // Try new WebP format first: {userId}/{cardId}/{size}.webp
  const webpKey = `${path}.webp`;
  let image = await c.env.IMAGES.get(webpKey);

  if (image) {
    return new Response(image.body, {
      headers: { 'Content-Type': 'image/webp' },
    });
  }

  // Fallback to legacy PNG: {userId}/{cardId}.png
  const legacyKey = path.replace(/\/\d+$/, '') + '.png';
  image = await c.env.IMAGES.get(legacyKey);

  if (image) {
    return new Response(image.body, {
      headers: { 'Content-Type': 'image/png' },
    });
  }

  return c.notFound();
});

The flashcard metadata tracks both formats:

interface Flashcard {
  id: string;
  // New format
  imagePath?: string;    // "{userId}/{cardId}" - size appended at request time
  // Legacy format
  imageKey?: string;     // "{userId}/{cardId}.png" - full path
}

Challenge 7: Initial Load Flash

The Problem

On page load, the app would briefly show the login page before checking if the user was already authenticated, causing a jarring flash.

The Solution

Add a loading state that blocks rendering until auth is confirmed:

// web/src/contexts/AuthContext.tsx
export function AuthProvider({ children }: { children: React.ReactNode }) {
  const [user, setUser] = useState<User | null>(null);
  const [isLoading, setIsLoading] = useState(true);

  useEffect(() => {
    const token = localStorage.getItem('token');
    if (!token) {
      setIsLoading(false);
      return;
    }

    // Validate token with API
    api.get('/auth/me')
      .then((response) => {
        setUser(response.data.user);
      })
      .catch(() => {
        localStorage.removeItem('token');
      })
      .finally(() => {
        setIsLoading(false);
      });
  }, []);

  if (isLoading) {
    return <LoadingSpinner fullScreen />;
  }

  return (
    <AuthContext.Provider value={{ user, isLoading }}>
      {children}
    </AuthContext.Provider>
  );
}

Data Flow: Creating a Flashcard

Here's the complete flow when a user creates a new flashcard:

sequenceDiagram participant User participant React as React App participant Worker as Cloudflare Worker participant KV participant R2 participant GPT4 as gpt-4o-mini participant DALLE as gpt-image-1 participant TTS as gpt-4o-mini-tts participant ImgAPI as Images API User->>React: Enter prompt + language React->>Worker: POST /flashcards/generate Worker->>KV: Check usage limits KV-->>Worker: {attempts: 5, limit: 10} Worker->>GPT4: Extract metadata GPT4-->>Worker: {language, normalized, categories} Worker->>DALLE: Generate image DALLE-->>Worker: Temporary image URL Worker-->>React: {tempImageUrl, metadata} React->>React: Display preview User->>React: Click "Preview Speech" React->>Worker: POST /flashcards/preview-speech Worker->>TTS: Generate audio TTS-->>Worker: MP3 stream Worker-->>React: Audio blob React->>React: Play audio + cache User->>React: Click "Save" React->>Worker: POST /flashcards (with cached audio) Worker->>Worker: Fetch temp image Worker->>ImgAPI: Transform to WebP (256px, 512px) ImgAPI-->>Worker: Transformed images Worker->>R2: Store images Worker->>R2: Store audio Worker->>KV: Save flashcard metadata Worker->>KV: Refund 1 attempt Worker-->>React: {flashcard} React->>React: Add to grid

Deployment Architecture

flowchart LR subgraph Development DevWeb[localhost:5173] DevAPI[localhost:8787] end subgraph "Cloudflare Edge" Pages[Cloudflare Pages<br/>picme.scopecreeplabs.com] Workers[Cloudflare Workers<br/>picme-api.workers.dev] KV[(KV Namespace)] R2[(R2 Bucket<br/>picme-images)] end DevWeb --> DevAPI Pages --> Workers Workers --> KV Workers --> R2

Deployment commands:

# Deploy API
cd api && npx wrangler deploy

# Deploy frontend
cd web && npm run build
npx wrangler pages deploy dist

Environment configuration (wrangler.toml):

name = "picme-api"
main = "src/index.ts"
compatibility_date = "2024-12-01"
compatibility_flags = ["nodejs_compat"]

[[kv_namespaces]]
binding = "KV"
id = "f0ce751e..."

[[r2_buckets]]
binding = "IMAGES"
bucket_name = "picme-images"

[images]
binding = "IMAGES_TRANSFORM"

[vars]
CORS_ORIGIN = "https://picme.scopecreeplabs.com"

Key Takeaways

  1. LLMs excel at dynamic parsing and generation: Traditional approaches to multi-lingual input would require language detection libraries, translation APIs, and hand-crafted parsing rules for each language. LLMs collapse this complexity into a single prompt—they inherently understand context, handle transliterated text, normalize meaning across languages, and generate structured output. When your requirements involve "understand arbitrary user input and produce something useful," LLMs are the right tool.

  2. AI pipelines benefit from staging: Using a fast model (gpt-4o-mini) for metadata extraction before expensive image generation improves reliability and allows language normalization.

  3. Image optimization is critical: Transforming images at the edge (WebP, multiple sizes) dramatically improves mobile experience.

  4. Design for accessibility: Child-friendly TTS with specific instructions produces better results than generic voices.


What's Next

  • Thin-client iOS and Android app: Offline-first Flutter-based app for the communicator
  • Collaborative decks: Share card collections between users
  • Custom categories: User-defined category taxonomies
  • Print mode: Export cards for physical flashcard decks

PicMe can be tried out at https://picme.scopecreeplabs.com/. Built with ❤️ for people who communicate differently.