Skip to main content
Acoustic Intelligence logo — voice and speech AI research company

Deep-tech AI research · Australia

Speech learned from physics — not datasets.

From random starts to stable vowel targets using a single adaptive controller.

ah 100% · ee 100% · oo 93%

From random initial states (research-stage)

What you’re seeing

From random initial conditions to a controlled acoustic outcome

Observed behaviour on our instrument—not an internal method diagram.

Side-by-side comparison: random acoustic state versus controlled, structured vowel production

Observed behaviour, not internal method.

What we proved

  • ·Convergence from random initial states
  • ·One controller, multiple vowel targets
  • ·No recorded speech datasets
  • ·Stable acoustic targets with closed-loop control

Representative gated scores

ah 100% · ee 100% · oo 93%

Full definitions and protocol under diligence.

What you’re seeing

Vowel evidence (F1 / F2)

A minimal, high-signal view of target structure in the acoustic space we care about—without exposing implementation details.

F1 versus F2 vowel map showing learned productions against reference regions

Observed behaviour, not internal method.

What you’re seeing

Convergence from a random start

Representative run: formant error (Hz) versus step—lower is better.

Line chart: formant error in hertz versus step, decreasing from a random start on a representative run

Observed behaviour, not internal method.

What you’re seeing

Repeatability across runs

A compact snapshot of consistency across recent internal runs.

Bar chart summarising representative success rates across vowels for recent internal runs

Observed behaviour, not internal method.

Why this matters (without the recipe)

The visuals above are the public bar: controlled, repeatable speech production from random initial states, learned without recorded speech datasets. Method detail stays diligence-gated.

Research progress

High-level milestones. Detailed methods and evaluation design are shared under appropriate agreements—we publish the story, not the recipe.

Phase 1 — Discovery at scale

Early 2026

From noise to structured acoustic behaviour

Large-scale search over a parametric vocal instrument, with only high-level acoustic feedback. Multiple independent seeds converged on structured, voice-like output—strong evidence that the landscape rewards real production physics, not a single lucky path.

Evolution of sound (representative run)

Random exploration sharpening into structured, voice-like output over many generations.

Recurrent discovered gesture

A short acoustic onset pattern that appeared across runs—a signature of convergent discovery rather than manual design.

Top composed sample (high objective score)

Primitives combined into one of the strongest-scoring utterances from the discovery phase.

Why we moved on

Search found compelling acoustic structure, but the next chapter required memory, generalisation, and controllable targets. Phase 2 is that chapter.

Phase 2 — Adaptive motor control

2026 — active

coarse → refine → stable

One learner, many vowels

A single adaptive controller steers the instrument toward different targets. Vowels and transitions are learned outcomes under the same closed-loop principle as Phase 1. Qualitative audio in Evidence is labelled for provenance—it is not assumed to match every newer internal run.

Recent internal milestone

Recent runs show reliable convergence toward target vowel structures from random initial states, using a single adaptive controller. Behaviour is fully closed-loop: coarse motor convergence, then finer acoustic stabilisation, with distinct vowel targets—including acoustically narrow basins—without dependence on recorded speech datasets for these outcomes.

Figures for this milestone appear near the top of this page; older clips in Evidence may not match the newest runs.

Multi

Distinct steady vowels demonstrated

Unified controller across targets

Closed-loop

Real-time synthesis & sensing

Multi-phoneme learning

Training on a single sound allows the system to settle on fixed motor patterns. Training across multiple sounds forces it to use feedback and produce different outputs for different targets.

Recent experiments include multi-sound vocal sequences and transitions (for example “ah” → “ee”) using a shared controller—still foundational motor-acoustic research, not conversational speech or text-to-speech.

A single controller learns multiple distinct sounds and transitions, rather than memorising separate trajectories; the headline behaviour story is summarised in the visuals above.

This marks the shift from isolated sound discovery toward controllable, repeatable motor-acoustic production under closed-loop feedback—without publishing the internal training interface here.

Acoustic evidence

These audio clips predate the latest closed-loop milestone narrative.

Read the Research progress section for the current behaviour-level story.

Qualitative clips and spectrograms—useful for listening; quantitative claims are in the figures above.

Representative outputs from an earlier adaptive-controller phase. The latest behaviour-level visuals live near the top of this page.

Steady vowels

Same controller, different targets—each a learned production, not a separate baked-in model.

“ah”

“ee”

“oo”

Between vowels

Continuous motion between targets—evidence of control, not hard cuts.

For the latest behaviour-level summary—including repeatable convergence from random starts—see Research progress above.

“ah” → “ee”

“ah” → “oo”

Spectral validation

Human reference (top) versus learned vowels—distinct formant structure per vowel. Same qualitative demo phase as the clips above.

Spectrogram comparing human vowel reference to AI-learned vowels — formant structure

The audio examples above predate the latest closed-loop milestone narrative; use them as listening context alongside the figures at the top of the page.

A different path to speech

Every current speech system depends on recorded human data to function. We are exploring what happens when that dependency is removed entirely.

The current generation of systems is hitting structural limits around data, control, and interpretability.

If speech can be learned without data, the entire speech stack changes.

The voice AI market is dominated by systems trained to reproduce patterns in enormous speech corpora. That path works for products—but it bundles dataset risk, opacity, and a ceiling on how “understanding” is grounded.

Incumbent trajectory

  • · Scale data, scale parameters, scale compute
  • · Quality tied to licensing and coverage of voices
  • · Harder to prove what the model “knows” about production

Our thesis

  • · Ground learning in a real-time acoustic instrument and closed-loop feedback
  • · Let structure emerge under tight scientific constraints—not by hand-authoring the answer
  • · Build a moat around methodology and measurable progression, not a one-off demo

Positioning

Post-dataset speech systems

No text-in → waveform-out product narrative. This is foundational motor-acoustic research.

Evidence

Reproducible runs

Independent experiments converged on speech-like behaviour—then we moved to adaptive control.

Today

Multi-sound & transitions

Recent internal work emphasises repeatable convergence and stable holds on distinct vowels—including demanding targets.

Tomorrow

Richer inventory

Expanding the sound set, robustness, and real-world interfaces—still without giving up the core thesis.

The question we started with

Infants do not download a corpus. They explore, listen, and calibrate. We asked whether an artificial learner could follow a similarly grounded path: act → perceive → improve—with scientific discipline and without shortcutting the problem with a scripted phoneme machine.

Emergence, not theatre

If the system cannot discover a behaviour through the allowed learning interface, we treat that as a research result—not something to patch over with hand-tuned production rules.

Coaching, not puppetry

External evaluation shapes the objective landscape; it does not micromanage the motor solution. That separation is central to what we’re proving.

Physics-first audio

Sound is produced by a deterministic acoustic engine running in real time—so claims about “what the AI did” stay tied to measurable waveforms.

Interpretability by design

Motor controls map to production levers you can reason about—an advantage for science, safety storytelling, and long-term product paths.

How the loop works (conceptual)

No teacher ever provides the correct motor trajectory.

We stay intentionally high-level here. Implementation, loss design, and training contracts are part of our research package—not a webpage.

1

Act

The learner proposes motor commands; the instrument turns them into audio in real time.

2

Sense

A causal perception stack summarises what was actually produced—no peeking at the future waveform.

3

Align

External evaluation scores progress toward targets while preserving the integrity of the motor path.

4

Compound

Skills stack: new sounds build on stabilised behaviours rather than resetting the system.

What we protect

The defensibility is in the integration: instrument design, perception, curriculum, and evaluation—tuned as one system.

Real-time acoustic core

Low-latency, deterministic synthesis so experiments are repeatable and claims stay falsifiable.

Closed-loop learning discipline

Training assumes the instrument is in the loop—mirroring how real vocal systems operate.

Rigorous evaluation

Human-in-the-loop listening, spectral checks, and automated gates—so “it sounds right” isn’t a vibe.

Cloud-scale experimentation

Phase 1 validated at serious compute breadth; Phase 2 pairs local iteration with elastic cloud for campaigns.

For investors & partners

Acoustic Intelligence is an Australian deep-tech research company pursuing a category-defining thesis—one that does not depend on access to proprietary speech datasets: voice as a learned motor skill on a controllable physical instrument, not only as a statistical mirror of scraped audio.

  • ·Timing: Regulators and enterprises are asking harder questions about data provenance, voice rights, and interpretability—our stack is aligned with that shift.
  • ·Traction shape: Convergent discovery at scale, then adaptive control with multi-sound sequences and transitions; public clips are labelled by provenance, while the latest internal milestone emphasises repeatable convergence from random starts and distinct vowel targets under closed-loop feedback (see Progress).
  • ·What we’re raising toward: Expanding the learned inventory, hardening robustness, and selective partnerships—without diluting the scientific bar.
  • ·What you get in diligence: Methodology walkthroughs and metrics under NDA—not a public playbook.

AWS

Large discovery campaigns run on cloud compute; adaptive training mixes efficient local iteration with scalable runs. We participate in programs that support serious infrastructure for deep research teams.

About

Founded by Richard James. Based in Brisbane, Australia.

Acoustic Intelligence exists to push a principled boundary: machines that earn speech through interaction with physics—measured honestly, improved iteratively, and positioned for the next decade of voice technology.

This site showcases outcomes and direction. It is not an instruction manual.

Contact

Investment conversations, strategic partnerships, and qualified research collaboration—we reply to serious inbound.

richard@acousticintelligence.ai