How it works — Ostronaut

01 Capture Rover device · far-field mic · always-on

A purpose-built device on the counter and a wearable in the field. The hard constraint: zero behaviour change from staff. No app, no login, no badge to tap. Audio goes up the moment the device sees a network — including the cellular fallback we keep on for stores with patchy wifi.

What we built

Far-field mic array tuned for the 60–90cm distance at a retail counter
Per-device acoustic calibration — no two stores sound the same
Cellular fallback so stores with patchy wifi do not drop hours of capture
Local encryption on the device; nothing leaves the floor unencrypted

02 Assemble Conversation windows · pause + proximity

Our AI infers the boundary of every customer conversation from pause and proximity in the signal — not from uniform clock windows. A 17-minute interaction at one counter and a 38-second one at the next stay separate, with their own speaker assignments and event tags, even if they share a shift.

What we built

Conversation boundaries inferred from the signal, not from clock buckets
Same-counter continuity across shifts and devices
Per-shift staff handoff that doesn’t lose speaker identity

03 Transcribe Hindi · Marathi · English · code-mixed

Devanagari is preserved exactly as spoken. Multilingual code-switching inside a single utterance — “sir, paracetamol नहीं है, but ये देख लीजिए” — is handled in one pass. No language toggle. Our AI transcription pipeline routes between primary and fallback models on confidence, not on language hints.

What we built

Code-mixed Hindi · Marathi · English in a single forward pass
Devanagari preserved as Devanagari — never transliterated to Latin
Multi-stage routing: primary model + fallback + confidence gate
Per-segment confidence used downstream — uncertain turns are flagged, not silently shipped

What we see · spectrogram of one counter conversation

Frequency spectrogram of a pharmacy counter conversation

A frequency-vs-time view of a real conversation our pipeline ingested. The bright vertical bands are speech; the warm-toned high-frequency clusters are sibilants and consonants — the parts that carry word identity. The clarity of these bands under our denoising is what makes Devanagari transcription possible at the noise floor of an Indian retail counter.

Why we built our own transcription

We tested four of the major transcription vendors before deciding to build our own. One transliterated Devanagari into Latin script, losing the language. One missed Marathi entirely, defaulting to English. One hallucinated language under noise — inventing words on long Hindi clips. None of them surfaced the confidence signal we needed to know when to ask for human review.

We keep the head-to-head numbers internal so we don’t end up in a comparison war. Ask us on a call — we’ll show you the receipts.

04 Diarise Staff vs. customer, per turn

Our diarisation AI attributes each turn to the right person. Staff are recognised by voice embedding across conversations and shifts. Customers stay anonymous. No enrolment ritual — staff don’t have to read a script into a microphone to be identified.

What we built

Our own diarisation pipeline — purpose-built for noisy, multi-speaker counter audio
Staff recognised by voice across conversations; customers stay anonymous
No enrolment ritual — staff change nothing about how they work
When the audio is ambiguous, we ask before we answer — your team only sees what we’re sure of

Watch the pipeline split a conversation in real time

A real customer conversation, split into staff and customer turns as the audio plays. The red tag at the top marks where our retail-event taxonomy flagged a stockout with no recovery — the kind of moment that walks revenue out the door.

05 Extract Business events — not topics

Our extraction AI doesn’t hand operators a topic cloud. It surfaces specific business events against a hand-built retail taxonomy: stockout uncoached · bounce unhandled · substitution offered · prescription check missed · return handled well · cross-sell success. Each one ships with the exact line of transcript that triggered it.

What we built

A taxonomy of retail conversation events, not generic NLP topic clusters
Every event ships with the exact line of evidence behind it — no ‘trust us’ signals
Lost-revenue valuation joined against real POS history
We don’t ship a moment to your team unless we’re sure
Where we’re unsure, we say so — your team confirms with one tap and the system gets better

06 Coach Inbox → practice → measure

Every event becomes a 90-second practice rep that our AI generates from the actual conversation — the staff member’s own words, the customer’s real ask. The same evaluator runs on practice and production, so improvement on the rep is measurable on the next month’s real conversations.

What we built

Per-event remediation paths into practice scenarios
Same evaluator on practice and production — one standard, one feedback loop
Skill drift tracked across staff, stores, and cohorts
Built on six years of teaching 10,000+ professionals what actually changes behaviour in front of a customer

Six stages between a real conversation and a recovery card.

Three output surfaces.

₹11,800 at risk across 3 stores today.

Pharmacy · Virar West · 14:32

A four-stop journey, every Monday morning.

₹11,800 at risk across 3 stores today.

Mumbai West

Priya Sharma

134 “stockouts with no alternative” this week.

Built in-house, end to end.

Want to hear it on your audio?