Building PicMe: How LLMs Solved Multi-Lingual Input Without Writing a Single Parser
Introduction
PicMe is a flashcard generation app designed to help children with communication needs. Users describe what they want in any language, and the app generates a clean, child-friendly illustration with text-to-speech support. The app needed to be fast, accessible globally, and support 60+ languages.
This post walks through the architecture decisions, technical challenges, and solutions we implemented to build PicMe.
Architecture Overview
PicMe follows an edge-first architecture, running entirely on Cloudflare's global network. There are no traditional servers—just Workers, KV storage, and R2 object storage.
Why Cloudflare Workers?
- Global latency: Code runs in 300+ data centers worldwide
- No cold starts: Workers start in under 5ms
- Integrated storage: KV and R2 are first-class citizens
- Cost efficient: Pay per request, not per hour
Tech Stack
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | React 18 + Material UI 6 | Component library with accessibility built-in |
| Build | Vite 6 | Fast HMR and optimized production builds |
| Backend | Cloudflare Workers + Hono | Lightweight edge-native API framework |
| Auth | PBKDF2 + Session Tokens | Web Crypto API (no external dependencies) |
| Database | Cloudflare KV | User data, flashcard metadata, sessions |
| Object Storage | Cloudflare R2 | Images (WebP) and audio (MP3) |
| AI - Images | OpenAI gpt-image-1 | Child-friendly illustration generation |
| AI - Text | OpenAI gpt-4o-mini | Language detection and metadata extraction |
| AI - Speech | OpenAI gpt-4o-mini-tts | Multi-lingual text-to-speech |
Project Structure
picme/
├── web/ # React frontend
│ └── src/
│ ├── components/ # FlashcardTile, FlashcardModal
│ ├── contexts/ # AuthContext, QuickChoicesContext
│ ├── hooks/ # useSpeech, useUsage
│ ├── pages/ # HomePage, CreatePage, SequencesPage
│ └── services/ # API client with TanStack Query
│
├── api/ # Cloudflare Workers backend
│ └── src/
│ ├── routes/ # auth.ts, flashcards.ts, sequences.ts
│ ├── middleware/ # Session validation
│ ├── services/ # OpenAI integration, usage tracking
│ └── utils/ # Crypto (PBKDF2, token generation)
│
└── shared/ # Shared TypeScript types
└── types/ # User, Flashcard, API contracts
Challenge 1: Multi-Lingual Input Processing
The Problem
Users might type in any language—including transliterated text. A Hindi speaker might type "mujhe paani chahiye" (I want water) in Latin script, not Devanagari. The AI needs to understand this and generate appropriate images.
The Solution
We use a two-stage AI pipeline:
The metadata extraction prompt handles 60+ languages with priority for Indian languages:
const METADATA_EXTRACTION_PROMPT = `
You are a flashcard metadata extractor for children's AAC communication.
Given a user's input text (which may be in any language, including
transliterated text like Hindi written in Latin script), extract:
1. detected_language: ISO 639-1 code
2. normalized_sentence: Simple English sentence describing the image
3. suggested_categories: 1-4 categories from the predefined list
4. confidence: low/medium/high
Prioritized languages: Hindi, Tamil, Telugu, Kannada, Malayalam,
Bengali, Marathi, Gujarati, Punjabi, Urdu...
`;
This approach means a user in Tamil Nadu can type "தண்ணீர் வேண்டும்" or "thanni venum" and get the same result.
Challenge 2: Image Compression and Delivery
The Problem
OpenAI generates images at 1024×1024 pixels in PNG format (~500KB-1MB). For a mobile-first app with a grid of flashcards, this is too large.
The Solution
We use Cloudflare Images transformation API to compress on-the-fly:
async function transformAndStoreImage(
env: Env,
sourceUrl: string,
userId: string,
flashcardId: string
): Promise<string> {
// Fetch the original image
const response = await fetch(sourceUrl);
const imageBuffer = await response.arrayBuffer();
// Transform to two sizes using Cloudflare Images API
const sizes = [
{ width: 256, suffix: '256' }, // Grid view
{ width: 512, suffix: '512' }, // Modal view
];
for (const size of sizes) {
const transformed = await env.IMAGES_TRANSFORM.transform(
new Blob([imageBuffer]),
{
width: size.width,
height: size.width,
fit: 'cover',
format: 'webp',
quality: 82,
}
);
// Store in R2
await env.IMAGES.put(
`${userId}/${flashcardId}/${size.suffix}.webp`,
transformed
);
}
return `${userId}/${flashcardId}`;
}
Results:
- 256px WebP: ~15-25KB (vs ~200KB PNG)
- 512px WebP: ~30-50KB (vs ~500KB PNG)
- 70-80% bandwidth reduction
The frontend requests the appropriate size:
// Grid view: load small image
<img src={`/image/${imagePath}/256`} />
// Modal view: load larger image
<img src={`/image/${imagePath}/512`} />
Challenge 3: Child-Friendly Text-to-Speech
The Problem
Generic TTS sounds robotic and uses inconsistent pacing. Children with communication needs benefit from calm, predictable audio that's suitable for repetition.
The Solution
We crafted detailed TTS instructions that enforce a specific delivery style:
function getTTSInstructions(languageCode?: string): string {
const baseInstructions = `
Voice style requirements:
- Calm, neutral, clear tone
- No dramatic emphasis or emotion
- Consistent moderate pacing
- Suitable for repeated playback
- Child-appropriate pronunciation
- No background sounds or effects
`;
// Language-specific additions
const languageInstructions: Record<string, string> = {
hi: 'Use standard Hindi pronunciation. Avoid regional accents.',
ta: 'Use clear Tamil pronunciation suitable for children.',
// ... 58 more languages
};
return baseInstructions + (languageInstructions[languageCode] || '');
}
Audio Caching Strategy
To avoid duplicate TTS generation costs, we cache audio during preview:
// CreatePage.tsx - Audio caching logic
const [previewAudioCache, setPreviewAudioCache] = useState<{
text: string;
language: string;
audioBase64: string;
} | null>(null);
const handleSave = async () => {
// Reuse cached audio if text hasn't changed
const audioToSend =
previewAudioCache?.text === speechSentence &&
previewAudioCache?.language === selectedLanguage
? previewAudioCache.audioBase64
: undefined;
await createFlashcard({
prompt,
speechSentence,
language: selectedLanguage,
cachedAudio: audioToSend, // Avoids duplicate TTS call
});
};
Challenge 4: Edge-Native Authentication
The Problem
Workers don't have access to Node.js crypto libraries like bcrypt. We needed secure password hashing using only Web Crypto APIs.
The Solution
PBKDF2 with 100,000 iterations provides equivalent security to bcrypt:
// api/src/utils/crypto.ts
export async function hashPassword(password: string): Promise<string> {
const encoder = new TextEncoder();
const salt = crypto.getRandomValues(new Uint8Array(16));
const keyMaterial = await crypto.subtle.importKey(
'raw',
encoder.encode(password),
'PBKDF2',
false,
['deriveBits']
);
const hash = await crypto.subtle.deriveBits(
{
name: 'PBKDF2',
salt,
iterations: 100000,
hash: 'SHA-256',
},
keyMaterial,
256
);
// Store as: salt:hash (both base64)
return `${base64Encode(salt)}:${base64Encode(hash)}`;
}
export async function verifyPassword(
password: string,
stored: string
): Promise<boolean> {
const [saltB64, hashB64] = stored.split(':');
const salt = base64Decode(saltB64);
// Derive hash with same parameters
const keyMaterial = await crypto.subtle.importKey(
'raw',
new TextEncoder().encode(password),
'PBKDF2',
false,
['deriveBits']
);
const hash = await crypto.subtle.deriveBits(
{
name: 'PBKDF2',
salt,
iterations: 100000,
hash: 'SHA-256',
},
keyMaterial,
256
);
return base64Encode(hash) === hashB64;
}
Session tokens use cryptographically secure random bytes:
export function generateToken(): string {
const bytes = crypto.getRandomValues(new Uint8Array(32));
return base64UrlEncode(bytes);
}
Challenge 5: Usage Limits and Quota Management
The Problem
AI image generation is expensive (~$0.04 per image). We needed a fair usage system that encourages quality over quantity.
The Solution
A two-tier system with smart quota management:
// api/src/services/usage.ts
interface UsageData {
plan: 'free' | 'personal';
imageAttempts: number; // Monthly count
savedCards: number; // Total saved
currentMonth: string; // "2025-02" format
}
const LIMITS = {
free: { monthlyAttempts: 10, maxSavedCards: 15 },
personal: { monthlyAttempts: 100, maxSavedCards: Infinity },
};
export async function checkAndIncrementUsage(
kv: KVNamespace,
userId: string
): Promise<{ allowed: boolean; remaining: number }> {
const usage = await getUsage(kv, userId);
const limits = LIMITS[usage.plan];
// Auto-reset on new month
const currentMonth = getCurrentMonth();
if (usage.currentMonth !== currentMonth) {
usage.imageAttempts = 0;
usage.currentMonth = currentMonth;
}
if (usage.imageAttempts >= limits.monthlyAttempts) {
return { allowed: false, remaining: 0 };
}
usage.imageAttempts++;
await saveUsage(kv, userId, usage);
return {
allowed: true,
remaining: limits.monthlyAttempts - usage.imageAttempts
};
}
Clever refund mechanism: When a user saves a generated image (accepts it), we refund one attempt. This encourages accepting good images rather than regenerating endlessly:
export async function refundAttempt(
kv: KVNamespace,
userId: string
): Promise<void> {
const usage = await getUsage(kv, userId);
const limits = LIMITS[usage.plan];
// Can't exceed monthly limit via refunds
if (usage.imageAttempts > 0) {
usage.imageAttempts = Math.max(
0,
usage.imageAttempts - 1
);
await saveUsage(kv, userId, usage);
}
}
Challenge 6: Backwards Compatibility
The Problem
We migrated from PNG to WebP images mid-project. Existing users had cards with the old format that needed to continue working.
The Solution
Support both formats with graceful fallback:
// api/src/routes/flashcards.ts
app.get('/image/:path{.+}', async (c) => {
const path = c.req.param('path');
// Try new WebP format first: {userId}/{cardId}/{size}.webp
const webpKey = `${path}.webp`;
let image = await c.env.IMAGES.get(webpKey);
if (image) {
return new Response(image.body, {
headers: { 'Content-Type': 'image/webp' },
});
}
// Fallback to legacy PNG: {userId}/{cardId}.png
const legacyKey = path.replace(/\/\d+$/, '') + '.png';
image = await c.env.IMAGES.get(legacyKey);
if (image) {
return new Response(image.body, {
headers: { 'Content-Type': 'image/png' },
});
}
return c.notFound();
});
The flashcard metadata tracks both formats:
interface Flashcard {
id: string;
// New format
imagePath?: string; // "{userId}/{cardId}" - size appended at request time
// Legacy format
imageKey?: string; // "{userId}/{cardId}.png" - full path
}
Challenge 7: Initial Load Flash
The Problem
On page load, the app would briefly show the login page before checking if the user was already authenticated, causing a jarring flash.
The Solution
Add a loading state that blocks rendering until auth is confirmed:
// web/src/contexts/AuthContext.tsx
export function AuthProvider({ children }: { children: React.ReactNode }) {
const [user, setUser] = useState<User | null>(null);
const [isLoading, setIsLoading] = useState(true);
useEffect(() => {
const token = localStorage.getItem('token');
if (!token) {
setIsLoading(false);
return;
}
// Validate token with API
api.get('/auth/me')
.then((response) => {
setUser(response.data.user);
})
.catch(() => {
localStorage.removeItem('token');
})
.finally(() => {
setIsLoading(false);
});
}, []);
if (isLoading) {
return <LoadingSpinner fullScreen />;
}
return (
<AuthContext.Provider value={{ user, isLoading }}>
{children}
</AuthContext.Provider>
);
}
Data Flow: Creating a Flashcard
Here's the complete flow when a user creates a new flashcard:
Deployment Architecture
Deployment commands:
# Deploy API
cd api && npx wrangler deploy
# Deploy frontend
cd web && npm run build
npx wrangler pages deploy dist
Environment configuration (wrangler.toml):
name = "picme-api"
main = "src/index.ts"
compatibility_date = "2024-12-01"
compatibility_flags = ["nodejs_compat"]
[[kv_namespaces]]
binding = "KV"
id = "f0ce751e..."
[[r2_buckets]]
binding = "IMAGES"
bucket_name = "picme-images"
[images]
binding = "IMAGES_TRANSFORM"
[vars]
CORS_ORIGIN = "https://picme.scopecreeplabs.com"
Key Takeaways
LLMs excel at dynamic parsing and generation: Traditional approaches to multi-lingual input would require language detection libraries, translation APIs, and hand-crafted parsing rules for each language. LLMs collapse this complexity into a single prompt—they inherently understand context, handle transliterated text, normalize meaning across languages, and generate structured output. When your requirements involve "understand arbitrary user input and produce something useful," LLMs are the right tool.
AI pipelines benefit from staging: Using a fast model (gpt-4o-mini) for metadata extraction before expensive image generation improves reliability and allows language normalization.
Image optimization is critical: Transforming images at the edge (WebP, multiple sizes) dramatically improves mobile experience.
Design for accessibility: Child-friendly TTS with specific instructions produces better results than generic voices.
What's Next
- Thin-client iOS and Android app: Offline-first Flutter-based app for the communicator
- Collaborative decks: Share card collections between users
- Custom categories: User-defined category taxonomies
- Print mode: Export cards for physical flashcard decks
PicMe can be tried out at https://picme.scopecreeplabs.com/. Built with ❤️ for people who communicate differently.