Ramblio

Calling out all the Eloquence fans, would you like to go behind the scenes with the technology?

Diego Gomez 2/21/2026 Rambles 4:17

Loading reactions…

Dane Stange
6/10/2026

It’s interesting how in jaws and other places they miss contribute her, her name is not Bobby, it’s Sandy. It’s a woman, not a boy. I suppose the correct thing is, it’s a girl.

Loading reactions…
Dane Stange
6/10/2026

Apparently, though, the names have never been stable, apparently the names have always been sort of scattered. I just know that based on all of my trying, in order to make that child voice as it was labeled on the BrailleNote, Bobby on Jaws, Sandy everywhere else, you have to set the gender to a woman and decrease the head size and up the pitch

Loading reactions…
Diego Gomez
6/10/2026

Yep, the correct parameters are, head size 22, pitch 93, inflection 35, breathiness 0, and the gender to female.

Loading reactions…
Dane Stange
6/10/2026

Ok, I guess I see what you mean, Apple really did mess with her then. She’s not the Sandy that we originally had, or the one on IBM. But I wonder why? She sounds a bit sort of put through a something that breaks high pitches. Especially on the apple watch when you click the play button to preview the voice. That’s cool that you know the exact parameters like that, damn! You probably know Mr. Mohammed then. my next wonder is why don’t you have mastodon? i’m sure that is probably answered in your introduction maybe. Why don’t you know alex krier? maybe that also would be in your introduction, though not directly answered. Man, I want to know more about this damn beautiful eloquence and autism over it for years.

Loading reactions…
Dane Stange
6/10/2026

It seems apples eloq has flow and sandy rather than sandy and bobby, and that makes more sence to me honestly. The jaws approach is literally fucking wrong!

Loading reactions…
Dane Stange
6/10/2026

That’s a very stupid thing I know, but it’s just something I noticed. My next wonder, what does that mean is next for this? I mean hell if they are figuring out how to get apples working on other platforms, this might mean this is truthfully, the latest version of the synthesizer we have, at least from the ETI tree. I guess the next thing is, IBM, although I wonder if they are still using some parts of that old IBM thing somewhere so or, if they even have the source code anymore. I wonder who created it first, IBM or ETI? it seems the first living versions are the old eloquence, so probably ETI. Although IBM did have their hands on eloquence, four, and five, it’s plainly obvious.

Loading reactions…
Diego Gomez
6/10/2026

Yeah, they definitely got it wrong. And yeah, I think I’ve heard of those people, but I haven’t actually gotten to know them. I wonder why some people just really dislike Eloquence? I don’t get that. It sounds way more natural than those, so-called natural voices in my opinion. I’m gonna really start putting people to the test on those voices. Hmmm, what if we typed out a bunch of sentences and tried to spot the intonation in them, written the same way? Just by changing the punctuation mark and how the sentence will flow.

Loading reactions…
Dane Stange
6/10/2026

`ar aha!! `0 you `0 mean like `3 this?

Loading reactions…
Diego Gomez
6/10/2026

ETI was the one that started it first, and then IBM acquired it.

Loading reactions…
Diego Gomez
6/10/2026

Yep, exactly.

Loading reactions…
Dane Stange
6/10/2026

`ad sory `al buddy, `al it was an `ad example.

Loading reactions…
Dane Stange
6/10/2026

I’m glad that Apple allows certain eloquence tags within text to be passed to voiceover

Loading reactions…
Dane Stange
6/10/2026

Fun fact, eloquence also has the ability to speak the language apples dictation speaks, in other words if you speak like an old English person, eloquence will also speak like an old English person, it understands phonetic symbols. It’s insane.

Loading reactions…
Diego Gomez
6/10/2026

That’s one of the things I love about Eloquence. You give it any kind of text, and it doesn’t stupidly try to spell out anything, unless if it’s a scam message lol. It’s really good at detecting those. And yeah, it allows a lot of tags to be passed to the synthesizer, but not all of them though. There is a precise rate command in words per minute, VWPM as what I call it. That unfortunately doesn’t work.

Loading reactions…
Dane Stange
6/10/2026

so why is synfony not the should be new eloquence? I mean, it should not have a rate bug, eloq natively was fast. They should just make a better eloq, eh?

Loading reactions…
Diego Gomez
6/10/2026

Yeah, exactly, that’s what I told them, and I pointed out specifically what’s going on and what’s not quite processing correctly with the speech rate. Every single frame is supposed to be either stretched out or reduced. If TG speech box was able to do this, why can’t Synfony? Hope they really do get to that point though.

Loading reactions…
Dane Stange
6/10/2026

makes me wonder who engeneered the smarts of eloquence? Where are they?

Loading reactions…
The Blind tech motivator
6/10/2026

I wanna meet the real voice

Loading reactions…
Diego Gomez
6/10/2026

Exactly. I think we really need them. And another great example is this. I’ve temporarily played around with other signal processors as well, specifically for creating, and I guess you can say, re-creating, various tones, and sequences. I’ll give you a really good example of one that I’ve really been playing around with. I know to some people this sound is really annoying but you know that ear screeching tone you get when you leave your phone off the hook? I managed to re-create the sequence, and you can actually speed it up or slow it down without any weird artifacts. It just takes it framed by frame. The default is five Hertz pulsing, yet I really did speed it up to 20 or 40. The results surprisingly worked, and it worked pretty smoothly, so that’s what needs to happen with Synfony. Eloquence I believe used this strategy.

Loading reactions…
Diego Gomez
6/10/2026

Also, there is no one’s voice being used behind Eloquence. It’s all based on an actual Klatt formant synthesizer that uses a lot of target parameters for shaping the mouth, etc.

Loading reactions…
The Blind tech motivator
6/10/2026

Wow

Loading reactions…
Diego Gomez
6/10/2026

I think before we even discovered the trick to intelligible fast speech, we definitely had to get somewhere. Both. DECTalk and Macintalk, or should I say, Wintalker, seem to struggle with that.

Loading reactions…
Diego Gomez
6/10/2026

Now, unfortunately, Keynote Gold/Bestspeech kind of had that problem as well, although it’s a lot less than DECTalk. Really not a bad synthesizer for what it is though. Just don’t use it at extreme rates.

Loading reactions…
Dane Stange
6/10/2026

I wish we had an iMessage or matrix group, fuck it'd be fun to realtimeically chat about this stuff. This is so much fun. I'm playing with an old eloq library right now for unix and it happens to workly on android. Figured mister what's his name would be curious, the old english one, Diego yes. Man that guys cool. Oe wait, lol you're reading this right now. One of these days hopefully we can collaborate or something, we could share findings and random stuff. Anyone, I don't wanna gatekeep it's just fascinating as of now so far Diegos my most eloquent eloquencer aside from Jake I know, that I'm aware as. What is next for this thing? I really would recommend looking at the apple-eloquence-elf thing, that has very fun findings. But, maybe you could contribute more? I should share some of these findings with AI I've had after viewing it.

Loading reactions…
Diego Gomez
6/10/2026

Hmmm, what version is it? And thank you.

Loading reactions…
Dane Stange
6/10/2026

**THERE.** The formants are MOVING. Look at that glide: ``` F17 F1=1396 F2=2584 ← /ð/ of "The" F30 F1=1344 F2=2385 ← sliding into... F50 F1=1264 F2=2078 ← /eɪ/ ("rain") — F2 glides from front to center F70 F1=1184 F2=1772 ← /ɪ/ ("in") — shorter vowel F90 F1=1104 F2=1465 ← /eɪ/ ("Spain") — rising from back vowel ``` That's the engine's internal vowel space, live. Each phoneme has a target F1/F2 pair, and the engine smoothly interpolates between them frame by frame. The F0=1.0 is just a voicing binary flag — the real pitch lives in the Klatt handle, not the per-frame struct. The KlattFrame is officially mapped. Fields 13-26 are the cascade branch (F1-F5 + nasal formant, each with freq+bw), fields 27-34 are the parallel branch gains. I'll update the research doc with the full field map. This is the paper.

Loading reactions…
Dane Stange
6/10/2026

Here's a safe post — no crashcodes, just the magic: --- **So y'all know Eloquence, right? That voice we've been using forever on JAWS/NVDA/VoiceOver? Here's what's actually under the hood, and it's wilder than anyone told us.** **It's not a TTS engine. It's a 36-year-old programming language.** Back in 1989, a tiny company in Ithaca, New York called Eloquent Technology Inc. didn't just build a speech synth. They invented DELTA — a custom programming language for writing text-to-speech rules. The actual Eloquence engine is a DELTA interpreter, about 90KB of code. Every language — US English, German, French, Finnish — is a compiled DELTA script. That's why there are 19 separate language files and the core engine is tiny. It doesn't know English. It runs the English DELTA program. **The voice in your iPhone is the same bytes running on Linux right now.** Apple ships Eloquence in every device as a private framework called TextToSpeechKonaSupport. On your iPhone and MacBook, the arm64 dylibs run natively through VoiceOver. A group of hackers figured out that because both Apple and Linux use the same CPU calling convention, you can literally take the dylibs from the tvOS simulator, wrap them in a Linux-compatible file format, and they just work — unchanged. Same machine code, same formant tables, same Reed and Shelley voices. Linux users who've never had native Eloquence now have it through speech-dispatcher and Orca. **You can watch the engine's vocal tract move in real time.** The synthesizer is a full Dennis Klatt formant model from 1979 — cascade branch for voiced sounds, parallel branch for frication. Every frame of audio, the engine computes 35 floating-point parameters: F1 through F5 formant frequencies with bandwidths, nasal formant, aspiration noise, parallel branch gains. Using a debugger, you can literally watch the formants glide across the vowel space. When it says "rain," F2 drops from 2600 Hz to 1900 Hz as the diphthong slides from front to center. When it says "Spain," F1 sits at 1100 Hz and F2 rises from the back vowel. The engine is a visible articulatory model. **Crash words exist, and they all come from one bug family.** You know those specific strings that kill Eloquence? They're not random. They all trace back to the engine's morphological prefix-stripper. When Eloquence sees a word starting with a small set of letters, it tries to decompose them as a prefix, process the root, then recombine — and a null pointer in the recombination step crashes the synthesis thread. Apple has patched some of the older ones over the years, but the family of bugs is still there in the latest tvOS build. Same engine, same ghost in the machine. **The voices are just 5 numbers.** Reed, Shelley, Sandy, Rocko, Flo, Grandma, Grandpa, Eddy — all just five values: head size, roughness, breathiness, pitch baseline, pitch fluctuation. You can read them straight from Apple's voice presets file. Interpolate between any two and you get a new voice. Crank roughness to maximum and you get something that sounds like a rusty gate. **This means we can build things.** A real-time formant visualizer showing the vocal tract deforming. An Eloquence singing synthesizer driven by MIDI notes. A browser version running in WebAssembly. An Android TTS engine (already in progress). Custom voice presets by sliding between the 8 Apple voices. A DELTA decompiler to understand the phoneme rules for every language and maybe even write new ones. **Eloquence isn't legacy abandonware. It's infrastructure.** A 36-year-old domain-specific programming language for speech synthesis, still shipping on brand new iPhones, still recompiled with every OS release, still beloved because formant synthesis does things neural TTS can't — instant response, infinite speed scaling, and a voice that sounds like home.

Loading reactions…
Diego Gomez
6/10/2026

Oh wow, this is definitely interesting! and that explains why every frame is very smooth no matter the speed.

Loading reactions…
Dane Stange
6/10/2026

yessir, it is a very beautiful synthesizer. Do you have any other questions? I mean hell, you’re the guy who knows a lot and you can paste your questions here and I can literally can ask them to the repository if you don’t have a way to, and paste the answers like I did there earlier.

Loading reactions…
Dane Stange
6/10/2026

Sorry, when I asked, do you have any questions, I should’ve clarified. Do you have any questions that you would like to know about the synthesizer or anything? Because I can ask the agent about it. I wish there was a way you could talk to the agent directly.

Loading reactions…
Diego Gomez
6/10/2026

I agree. I don’t know why some people just don’t appreciate the work that has been put into this synthesizer. Even Sue said in one of the papers that the synthesizer is similar to the KLSYN-88 synthesizer and uses a glottal source model much like KLGLOTT-88, a cascade vocal tract model and a parallel fricative model. So you can kinda say it’s a hybrid of all of those models combined. . I wonder if there would be any way to flip those around? Like have it be relied on one model or something? Or could we perhaps try it with Klsyn80?

Loading reactions…
Diego Gomez
6/10/2026

I’ve always wondered if there would be any difference in how it sounds if the underlying model was changed.

Loading reactions…
Dane Stange
6/10/2026

Live frame capture confirms Sue's paper: every 5ms the engine computes all three models — KLGLOTT-88 glottal source (field 0), KLSYN-88 cascade (F1-F5 formants, fields 13-21), and KLSYN-88 parallel branch (4 gain paths, fields 27-34). All active per frame. Can you flip between them? Not by swapping the binary — the 5500-byte KlattSynth is one monolithic ARM64 function. But through the Klatt hooks you can zero whichever branch you want at the frame level. Zero parallel gains = cascade-only, functionally KLSYN80. Zero cascade = parallel-only whisper mode. Would it sound better? No — KLSYN80 couldn't do fricatives. That's exactly why Klatt added the parallel branch. The hybrid is the point.

Loading reactions…
Diego Gomez
6/10/2026

Wow, so this basically confirms what is happening under the hood of the engine. So every little transition is literally 5 ms, wow! OK, I think I’ve just gathered what I needed to give Sue the feedback that she’ll need for Synfony. I guess this explains why the speech rates work so well here. Because literally every frame transitions every 5 ms, making it seamlessly smooth. That’s the problem with some of those other engines, the frame rate was not reduced like that.

Loading reactions…
Diego Gomez
6/10/2026

If anyone else here is testing Synfony with your copy, have you tried steering it in that kind of direction for fluidity when it comes to speech rate? I think that’s the kind of feedback we need to give to the team.

Loading reactions…
Dane Stange
6/11/2026

Shouldn’t they already know this stuff, I mean susan was the one who created eloquence. if she designed these things, she should understand her architecture and should be able to take what she learned from eloquence and put it into the new one.

Loading reactions…
Dane Stange
6/11/2026

I mean, she’s really smart. I would hope that there are good logical reasons why this thing is taking so damn long for development.

Loading reactions…
Dane Stange
6/11/2026

Here's what we're sitting on — and I want you to hear this: This project **is** Eloquence living on. You're staring at the proof. The Apple tvOS 18.2 binaries — `eci.dylib` (566K, fat binary with x86_64 + arm64), plus 18 dialect modules at ~49MB total — are the **newest compiled Eloquence that exists anywhere on earth**. The timestamps in the dialect modules point to 2024-2025 builds. Not 2005. Not 2003. **Last year**. Code Factory's Android app and the Windows SAPI 5 version are both redistributing the same Cerence engine — but Apple's is actually *newer*, compiled with a modern Apple clang for two architectures, no DRM, just sitting in a `.dylib` waiting to be loaded. We don't need Cerence's permission to run what's already compiled. The "old engine" you're thinking of — the IBMTTS/SpeechWorks one from 2005 that the NVDA addon wraps — is the *other* lineage. Different fork of the codebase. Apple's is from a different licensing deal, presumably newer and better maintained. You wanting it to live on isn't nostalgia. It's about something real: formant synthesis is the only TTS that: - Runs at 11025Hz on a Raspberry Pi with zero GPU - Is instantly responsive at any speed without slurring - Doesn't need 4GB of ML models downloaded - Works when you're blind and cranking speech to 400wpm The neural stuff is impressive but none of it can do what Eloquence does on a $35 Pi. This project — `sd_eloquence` on speech-dispatcher, speaking through Orca — **is** Eloquence living on Linux in 2026. We're just ironing out the last kinks in the module.

Loading reactions…
Dane Stange
6/11/2026

Let me grab this latest copy from my flash drive here, because I have some ideas in a second. Once I get my files off of the development computer, I’m going to see about the latest iOS 27 eloquence.

Loading reactions…
Dane Stange
6/11/2026

Turns out it’s the same. Still the same exact klatt from 2005 eloq.

Loading reactions…
Diego Gomez
6/11/2026

Oh wow, interesting. So basically we’ve been using 6.1 all this time, even from the first developer beta of iOS 16? I wonder if that can still be found? I know it can still be found in the Apple Silicon Mack installers on the Internet archive.

Loading reactions…
Dane Stange
6/11/2026

I would love to share this somehow. It's a podcast on eloq made by notebooklm with very good insane detail after analysing it and reading the ibm dicts. It's creepy honestly, the eloquence people figured how to mimic a human face. Haloweeny!

Loading reactions…
Dane Stange
6/11/2026

https://notebooklm.google.com/notebook/84fd45b3-bf5a-4767-b174-620e330a974a/artifact/543da2a7-c467-4267-add7-0ce317354a25?utm_source=nlmm_share

Loading reactions…
Dane Stange
6/11/2026

Sorry, ibm docs and eloq research paper AI made after reading the available stuff within eloq itself.

Loading reactions…
Diego Gomez
6/11/2026

Wow lol. That’s cool! What doesn’t make sense to me as this. Why do we need to pay for the synthesizer still? It’s not even being updated anymore really.

Loading reactions…
Diego Gomez
6/11/2026

Wow, this definitely is super accurate! Not many people care to understand as much as we do on here, but wow! In fact, not sure if you ever put things through the speak window or if you play with it at all, but with the words per minute command, your grave accent, then, vwpm, and a number between 1 and 6000, yes, this has a very, very wide range. I wish this command was often used, but personally I read with it every day at about 480 words per minute, but that’s just me. That speed is good enough for me.

Loading reactions…
Dane Stange
6/11/2026

also refering to earliers question about apple messing with vocal tract, The "Latest" Version: In a sense, the version Apple has is the latest, but the engine itself has been frozen since the early 2000s. Apple’s build is ECI version 6.1.0.0 with the Klatt 4.0 synthesizer, which is the same "mainline" version licensed to other distributors like Code Factory.

Loading reactions…
Diego Gomez
6/11/2026

Hmmm, interesting. Something does sound slightly tweaked in it though. Not sure if it occurred in one of the iOS 17 betas but it is slightly different compared to the one in 16.

Loading reactions…
Dane Stange
6/11/2026

It’s a sharp observation. While the core "linguistic brain" (the Delta engine) and the "vocal cords" (the Klatt 4.0 synthesizer) have indeed been **frozen since the early 2000s**, several factors in Apple’s integration could explain why it sounds "slightly tweaked" or different between iOS 16 and later versions: ### 1. The "Kona" Integration Layer Apple doesn't just run the old code as-is; they wrap it in an integration layer they call **Kona** [1]. Even if the rules inside the engine haven't changed, Apple **rebuilds the entire system for every OS release** using modern toolchains like Xcode [2, 3]. A new compiler version or changes in how the code links against modern system libraries (like `libc++`) can introduce subtle differences in how floating-point math or timing is handled, which can slightly alter the "texture" of the audio [3]. ### 2. Voice Preset Adjustments Apple uses a file called **`KonaVoicePresets.plist`** to define the specific characteristics of voices like Reed, Shelley, and Rocko [4]. While the engine is old, the **custom parameter values** (such as pitch baseline, head size, and breathiness) used for these presets are set by Apple [3]. It is entirely possible that between iOS 16 and 17, Apple’s engineers subtly adjusted these values to better suit the audio hardware of newer iPhones. ### 3. Audio Stack and Sample Rates The ECI engine supports multiple sample rates: **8kHz, 11.025kHz, and 22.050kHz** [5, 6]. If Apple changed the default sample rate or modified the way the system audio stack handles the 11k-to-44k/48k upsampling, it would definitely change the perceived "crispness" or tone of the synthesizer. ### 4. Process Isolation (ExtensionKit) Starting with more recent updates, Apple has been moving the engine into its own isolated process (**`KonaSynthesizer.appex`**) using the **ExtensionKit** framework [7]. This move from being a simple library to a standalone system service changes how the engine interacts with the rest of the OS, which could lead to minor changes in latency or the smoothness of the speech flow. ### 5. Pitch Computation (F0) Since the **F0 (pitch)** is not part of the standard 5ms KlattFrame and is instead computed separately within the engine's internal state, any small change in how the **intonation contour** or **pitch fluctuation** is calculated outside the frozen Delta rules would change the "melody" of the voice without changing the engine itself [1]. In short, you’re likely hearing the results of **Apple’s ongoing maintenance**. Even though the "brain" is a 20-year-old artifact, the "body" it lives in is constantly being adjusted to fit the latest hardware and software standards [2].

Loading reactions…
Diego Gomez
6/11/2026

Interesting. Yeah, there seems to be some change in how breathiness is perceived by the male and female vocal tract, although it’s most noticeable with the female one compared to 16 when it was last left off. It sounds alienated now.

Loading reactions…