How it works

Six stages between a real conversation and a recovery card.

Each stage shipped its own depth before we could call the loop closed. The hard parts — Devanagari preserved exactly as spoken, code-mixed Hindi · Marathi · English in one pass, diarisation that works at the noise floor of an Indian retail counter — were our own builds, not wrappers around someone else’s API.

01 Capture Rover device · far-field mic · always-on

A purpose-built device on the counter and a wearable in the field. The hard constraint: zero behaviour change from staff. No app, no login, no badge to tap. Audio goes up the moment the device sees a network — including the cellular fallback we keep on for stores with patchy wifi.

What we built

  • Far-field mic array tuned for the 60–90cm distance at a retail counter
  • Per-device acoustic calibration — no two stores sound the same
  • Cellular fallback so stores with patchy wifi do not drop hours of capture
  • Local encryption on the device; nothing leaves the floor unencrypted
02 Assemble Conversation windows · pause + proximity

Our AI infers the boundary of every customer conversation from pause and proximity in the signal — not from uniform clock windows. A 17-minute interaction at one counter and a 38-second one at the next stay separate, with their own speaker assignments and event tags, even if they share a shift.

What we built

  • Conversation boundaries inferred from the signal, not from clock buckets
  • Same-counter continuity across shifts and devices
  • Per-shift staff handoff that doesn’t lose speaker identity
03 Transcribe Hindi · Marathi · English · code-mixed

Devanagari is preserved exactly as spoken. Multilingual code-switching inside a single utterance — “sir, paracetamol नहीं है, but ये देख लीजिए” — is handled in one pass. No language toggle. Our AI transcription pipeline routes between primary and fallback models on confidence, not on language hints.

What we built

  • Code-mixed Hindi · Marathi · English in a single forward pass
  • Devanagari preserved as Devanagari — never transliterated to Latin
  • Multi-stage routing: primary model + fallback + confidence gate
  • Per-segment confidence used downstream — uncertain turns are flagged, not silently shipped

What we see · spectrogram of one counter conversation

Frequency spectrogram of a pharmacy counter conversation

A frequency-vs-time view of a real conversation our pipeline ingested. The bright vertical bands are speech; the warm-toned high-frequency clusters are sibilants and consonants — the parts that carry word identity. The clarity of these bands under our denoising is what makes Devanagari transcription possible at the noise floor of an Indian retail counter.

Why we built our own transcription

We tested four of the major transcription vendors before deciding to build our own. One transliterated Devanagari into Latin script, losing the language. One missed Marathi entirely, defaulting to English. One hallucinated language under noise — inventing words on long Hindi clips. None of them surfaced the confidence signal we needed to know when to ask for human review.

We keep the head-to-head numbers internal so we don’t end up in a comparison war. Ask us on a call — we’ll show you the receipts.

04 Diarise Staff vs. customer, per turn

Our diarisation AI attributes each turn to the right person. Staff are recognised by voice embedding across conversations and shifts. Customers stay anonymous. No enrolment ritual — staff don’t have to read a script into a microphone to be identified.

What we built

  • Our own diarisation pipeline — purpose-built for noisy, multi-speaker counter audio
  • Staff recognised by voice across conversations; customers stay anonymous
  • No enrolment ritual — staff change nothing about how they work
  • When the audio is ambiguous, we ask before we answer — your team only sees what we’re sure of

Watch the pipeline split a conversation in real time

A real customer conversation, split into staff and customer turns as the audio plays. The red tag at the top marks where our retail-event taxonomy flagged a stockout with no recovery — the kind of moment that walks revenue out the door.

05 Extract Business events — not topics

Our extraction AI doesn’t hand operators a topic cloud. It surfaces specific business events against a hand-built retail taxonomy: stockout uncoached · bounce unhandled · substitution offered · prescription check missed · return handled well · cross-sell success. Each one ships with the exact line of transcript that triggered it.

What we built

  • A taxonomy of retail conversation events, not generic NLP topic clusters
  • Every event ships with the exact line of evidence behind it — no ‘trust us’ signals
  • Lost-revenue valuation joined against real POS history
  • We don’t ship a moment to your team unless we’re sure
  • Where we’re unsure, we say so — your team confirms with one tap and the system gets better
06 Coach Inbox → practice → measure

Every event becomes a 90-second practice rep that our AI generates from the actual conversation — not a generic scenario. The same evaluator runs on practice and production, so improvement on the rep is measurable on the next month’s real conversations.

What we built

  • Per-event remediation paths into practice scenarios
  • Same evaluator on practice and production — one standard, one feedback loop
  • Skill drift tracked across staff, stores, and cohorts
  • Built on six years of teaching 10,000+ professionals what actually changes behaviour in front of a customer

Where the intelligence shows up

Three output surfaces.

The pipeline produces intelligence that lands in three places, depending on who needs it and when.

Web dashboard — operations manager

Opens Monday morning. Fleet view → store view → conversation view. Search by time, store, staff member, moment type, or language. Every number links back to a clip and a transcript line.

app.ostronaut.ai/today Live

Money first. Named accountability. Top-3 stores at risk, the one you can fix in 15 minutes.

MCP server — your AI talks to ours

Query our intelligence store from any AI assistant that supports the Model Context Protocol. Ask natural-language questions over structured retail data. Your audio never leaves Ostronaut.

# Ask your AI assistant, connected to Ostronaut MCP
Show me every conversation last week where a customer asked about a brand we don’t carry, ranked by store.

# Ostronaut MCP returns structured events, not text summaries
14 conversations · 4 stores · top brand gap: Dolo 650 (9 mentions) · Store 3 leads

Your data stays in our store. The MCP surface gives your existing AI assistant structured access — no new dashboards required.

Coaching artifact + practice rep — staff member

The staff member who lived a missed moment gets a 90-second practice rep on their phone, tied to the exact conversation. Built from the real audio — not a generic objection-handling module — and delivered before the next shift.

app.ostronaut.ai/conversations/c-7341 Live

Every coaching rep ties back to a specific moment in a real conversation.

What the coaching loop actually looks like

A four-stop journey, every Monday morning.

Money at risk in the day → drill into the worst store → open the staff member’s scorecard for the 1:1 → send the practice rep tied to the recurring pattern. One loop, four stops, every morning.

app.ostronaut.ai/today Live
01 · See Open the day. ₹11,800 at risk, named to the staff who can fix it before the next shift.
app.ostronaut.ai/stores/mumbai-west Live
02 · Drill Open the worst store. Conversation by conversation, see exactly where revenue walked out — with the staff and the moment attached.
app.ostronaut.ai/staff/priya-sharma Live
03 · Prep Open the staff member’s scorecard — the same view the manager pulls up in the 1:1. Trend, top themes, what to focus this week.
app.ostronaut.ai/themes/stockout-no-alternative Live
04 · Coach Open the recurring pattern. Send the 90-second practice rep, built from the actual conversations — delivered before the staff’s next shift.

Coaching is one branch of the intelligence layer — shown last on purpose. The same platform that surfaces a missed moment can wrap it into a 90-second practice rep that lands with the staff member who lived it. Built on six years of teaching 10,000+ professionals what actually changes behaviour in front of a customer.

Why this matters

We did not wrap an API.

Off-the-shelf transcription doesn’t handle Devanagari without transliterating it. Generic diarisation tools fall apart when two voices overlap at a busy counter. And nobody else has a retail event taxonomy — they hand you topics, not moments. Each stage of our pipeline is its own build, joined to a real POS ledger and a real coaching loop. The numbers on the home page — 25,078 transcribed turns, 3,044 coachable moments, ₹1.64L caught — are what falls out of that stack running on real audio from a real Indian retail floor.

Keep exploring

Want to hear it on your audio?