Gnuspeech: Difference between revisions

Gnuspeech
Developer(s)	Trillium Sound Research
Initial release	2002; 22 years ago
Stable release	0.9 / 14 October 2015; 8 years ago
Repository	savannah.gnu.org/git/?group=gnuspeech ;
Platform	Cross-platform
Type	Text-to-speech
License	GNU General Public License
Website	www.gnu.org/software/gnuspeech/

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Inline

Latest revision as of 05:31, 15 March 2024

Gnuspeech is an extensible text-to-speech computer software package that produces artificial speech output based on real-time articulatory speech synthesis by rules. That is, it converts text strings into phonetic descriptions, aided by a pronouncing dictionary, letter-to-sound rules, and rhythm and intonation models; transforms the phonetic descriptions into parameters for a low-level articulatory speech synthesizer; uses these to drive an articulatory model of the human vocal tract producing an output suitable for the normal sound output devices used by various computer operating systems; and does this at the same or faster rate than the speech is spoken for adult speech.

Design

The synthesizer is a tube resonance, or waveguide, model that models the behavior of the real vocal tract directly, and reasonably accurately, unlike formant synthesizers that indirectly model the speech spectrum.^[2] The control problem is solved by using René Carré's Distinctive Region Model^[3] which relates changes in the radii of eight longitudinal divisions of the vocal tract to corresponding changes in the three frequency formants in the speech spectrum that convey much of the information of speech. The regions are, in turn, based on work by the Stockholm Speech Technology Laboratory^[4] of the Royal Institute of Technology (KTH) on "formant sensitivity analysis" - that is, how formant frequencies are affected by small changes in the radius of the vocal tract at various places along its length.^[5]

History

Gnuspeech was originally commercial software produced by the now-defunct Trillium Sound Research for the NeXT computer as various grades of "TextToSpeech" kit. Trillium Sound Research was a technology transfer spin-off company formed at the University of Calgary, Alberta, Canada, based on long-standing research in the computer science department on computer-human interaction using speech, where papers and manuals relevant to the system are maintained.^[6] The initial version in 1992 used a formant-based speech synthesiser. When NeXT ceased manufacturing hardware, the synthesizer software was completely re-written^[7] and also ported to NSFIP (NextStep For Intel Processors) using the waveguide approach to acoustic tube modeling based on the research at the Center for Computer Research in Music and Acoustics (CCRMA) at Stanford University, especially the Music Kit. The synthesis approach is explained in more detail in a paper presented to the American Voice I/O Society in 1995.^[8] The system used the onboard 56001 Digital Signal Processor (DSP) on the NeXT computer and a Turtle Beach add-on board with the same DSP on the NSFIP version to run the waveguide (also known as the tube model). Speed limitations meant that the shortest vocal tract length that could be used for speech in real time (that is, generated at the same or faster rate than it was "spoken") was around 15 centimeters, because the sample rate for the waveguide computations increases with decreasing vocal tract length. Faster processor speeds are progressively removing this restriction, an important advance for producing children's speech in real time.

Since NeXTSTEP is discontinued and NeXT computers are rare, one option for executing the original code is the use of virtual machines. The Previous emulator, for example, can emulate the DSP in NeXT computers, which can be used by the Trillium software.

Trillium ceased trading in the late 1990s and the Gnuspeech project was first entered into the GNU Savannah repository under the terms of the GNU General Public License in 2002, as an official GNU software.

Due to its free and open source license, which allows customization of the code, Gnuspeech has been utilized in academic research.^[9] ^[10]

Synthesis example

The Chaos synthesized by Trillium TTS (Gnuspeech) using the DSP vocal tract model.

Problems playing this file? See media help.

References

^ https://directory.fsf.org/wiki/gnuspeech. {{cite web}}: Missing or empty |title= (help)
^ COOK, P.R. (1989) Synthesis of the singing voice using a physically parameterized model of the human vocal tract. International Computer Music Conference, Columbus Ohio
^ CARRE, R. (1992) Distinctive regions in acoustic tubes. Speech production modelling. Journal d'Acoustique, 5 141 to 159
^ Now Department for Speech, Music and Hearing
^ FANT, G. & PAULI, S. (1974) Spatial characteristics of vocal tract resonance models. Proceedings of the Stockholm Speech Communication Seminar, KTH, Stockholm, Sweden
^ Relevant U of Calgary website
^ The Tube Resonance Model Speech Synthesizer
^ HILL, D.R., MANZARA, L. & TAUBE-SCHOCK, C-R. (1995) Real-time articulatory speech-synthesis-by-rules. Proc. AVIOS '95 14th Annual International Voice Technologies Conf, San Jose, 12-14 September 1995, 27-44
^ D'Este, F. - Articulatory Speech Synthesis with Parallel Multi-Objective Genetic Algorithm. Master's Thesis, Leiden Institute of Advanced Computer Science, 2010.
^ Xiong, F.; Barker, J. - Deep Learning of Articulatory-Based Representations and Applications for Improving Dysarthric Speech Recognition. ITG Conference on Speech Communication, Germany, 2018.

External links

[wikidata-11394c46a421cfb4f2c5f1fe4a3af522b7679d35-v18-1] ttps://directory.fsf.org/wiki/gnuspeech. {{cite web}}: Missing or empty |title= (help)

[2] COOK, P.R. (1989) Synthesis of the singing voice using a physically parameterized model of the human vocal tract. International Computer Music Conference, Columbus Ohio

[3] CARRE, R. (1992) Distinctive regions in acoustic tubes. Speech production modelling. Journal d'Acoustique, 5 141 to 159

[4] Now Department for Speech, Music and Hearing

[5] FANT, G. & PAULI, S. (1974) Spatial characteristics of vocal tract resonance models. Proceedings of the Stockholm Speech Communication Seminar, KTH, Stockholm, Sweden

[6] Relevant U of Calgary website

[7] The Tube Resonance Model Speech Synthesizer

[8] HILL, D.R., MANZARA, L. & TAUBE-SCHOCK, C-R. (1995) Real-time articulatory speech-synthesis-by-rules. Proc. AVIOS '95 14th Annual International Voice Technologies Conf, San Jose, 12-14 September 1995, 27-44

[9] D'Este, F. - Articulatory Speech Synthesis with Parallel Multi-Objective Genetic Algorithm. Master's Thesis, Leiden Institute of Advanced Computer Science, 2010.

[10] Xiong, F.; Barker, J. - Deep Learning of Articulatory-Based Representations and Applications for Improving Dysarthric Speech Recognition. ITG Conference on Speech Communication, Germany, 2018.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

@@ Line 1: / Line 1: @@
+{{short description|Text-to-speech computer software package}}
-{{Notability|date=November 2008}}
+{{Infobox software
-{{Primarysources|date=November 2008}}
+| platform  = [[Cross-platform]]
-{{infobox software
-| platform = [[Cross-platform]]
+| genre     = [[Text-to-speech]]
+| developer = Trillium Sound Research
-| genre    = [[Text-to-speech]]
+| released  = {{Start date and age|df=yes|2002}}
-| license  = [[GNU General Public License]]
+| latest release version = {{wikidata|property|reference|edit|P348}}
+| latest release date    = {{Start date and age|{{wikidata|qualifier|P348|P577}}}}
+| license   = [[GNU General Public License]]
 }}
-'''Gnuspeech''' is an extensible [[text-to-speech]] computer [[software package]] that produces artificial speech output based on real-time [[articulatory synthesis|articulatory]] speech synthesis by rules. That is, it converts text strings into phonetic descriptions, aided by a pronouncing dictionary, letter-to-sound rules, and rhythm and intonation models; transforms the phonetic descriptions into parameters for a low-level articulatory [[speech synthesizer]]; uses these to drive an articulatory model of the human [[vocal tract]] producing an output suitable for the normal sound output devices used by various computer [[operating system]]s; and does this at the same or faster rate than the speech is spoken for adult speech.
+'''Gnuspeech''' is an extensible [[text-to-speech]] computer [[Application software|software package]] that produces artificial speech output based on real-time [[articulatory synthesis|articulatory]] speech synthesis by rules. That is, it converts text strings into phonetic descriptions, aided by a pronouncing dictionary, letter-to-sound rules, and rhythm and intonation models; transforms the phonetic descriptions into parameters for a low-level articulatory [[speech synthesizer]]; uses these to drive an articulatory model of the human [[vocal tract]] producing an output suitable for the normal sound output devices used by various computer [[operating system]]s; and does this at the same or faster rate than the speech is spoken for adult speech.
 == Design ==
+The synthesizer is a tube resonance, or waveguide, model that models the behavior of the real [[vocal tract]] directly, and reasonably accurately, unlike formant synthesizers that indirectly model the speech spectrum.<ref>COOK, P.R. (1989) Synthesis of the singing voice using a physically parameterized model of the human vocal tract. International Computer Music Conference, Columbus Ohio</ref> The control problem is solved by using René Carré's Distinctive Region Model<ref>CARRE, R. (1992) Distinctive regions in acoustic tubes. Speech production modelling. Journal d'Acoustique, 5 141 to 159</ref> which relates changes in the radii of eight longitudinal divisions of the vocal tract to corresponding changes in the three frequency [[formants]] in the speech spectrum that convey much of the information of speech. The regions are, in turn, based on work by the Stockholm Speech Technology Laboratory<ref>Now [http://www.speech.kth.se Department for Speech, Music and Hearing]</ref> of the Royal Institute of Technology ([[Royal Institute of Technology|KTH]]) on "formant sensitivity analysis" - that is, how formant frequencies are affected by small changes in the radius of the vocal tract at various places along its length.<ref>FANT, G. & PAULI, S. (1974) Spatial characteristics of vocal tract resonance models. Proceedings of the Stockholm Speech Communication Seminar, [[Royal Institute of Technology|KTH]], Stockholm, Sweden</ref>
-The synthesizer is a tube resonance, or waveguide, model that models the behavior of the real [[vocal tract]] directly, and reasonably accurately, unlike formant synthesizers that indirectly model the speech spectrum.<ref>COOK, P.R. (1989) Synthesis of the singing voice using a physically parameterized model of the human vocal tract. International Computer Music Conference, Columbus Ohio</ref> The control problem is solved by using René Carré’s Distinctive Region Model<ref>CARRE, R. (1992) Distinctive regions in acoustic tubes. Speech production modelling. Journal d'Acoustique, 5 141 to 159</ref> which relates changes in the radii of eight longitudinal divisions of the vocal tract to corresponding changes in the three frequency [[formants]] in the speech spectrum that convey much of the information of speech.  The regions are, in turn, based on work by the Stockholm Speech Technology Laboratory<ref>Now [http://www.speech.kth.se Department for Speech, Music and Hearing]</ref> of the Royal Institute of Technology ([[KTH]]) on "formant sensitivity analysis" - that is, how formant frequencies are affected by small changes in the radius of the vocal tract at various places along its length.<ref>FANT, G. & PAULI, S. (1974) Spatial characteristics of vocal tract resonance models. Proceedings of the Stockholm Speech Communication Seminar, [[KTH]], Stockholm, Sweden</ref>
 ==History==
-Gnuspeech was originally commercial software produced by the now-defunct Trillium Sound Research for the NeXT computer as various grades of "TextToSpeech" kit. Trillium Sound Research was a [[technology transfer]] spin-off company formed at the University of Calgary, Alberta, Canada, based on long-standing research in the computer science department on [[computer-human interaction]] using speech, where papers and manuals relevant to the system are maintained.<ref>[http://pages.cpsc.ucalgary.ca/~hill/gnuspeech/gnuspeech-index.htm Relevant U of Calgary website]</ref> The initial version in 1992 used a formant-based speech synthesiser. When [[NeXT]] ceased manufacturing hardware, the synthesizer software was completely re-written<ref>[http://www.gnu.org/software/gnuspeech/trm-write-up.pdf The Tube Resonance Model Speech Synthesizer]</ref> and also ported to NSFIP (NextStep For Intel Processors) using the waveguide approach to acoustic tube modeling based on the research at the Center for Computer Research in Music and Acoustics ([[CCRMA]]) at Stanford University, especially the Music Kit. The synthesis approach is explained in more detail in a paper presented to the American Voice I/O Society in 1995.<ref>[http://pages.cpsc.ucalgary.ca/~hill/papers/avios95/index.htm HILL, D.R., MANZARA, L. & TAUBE-SCHOCK, C-R. (1995) Real-time articulatory speech-synthesis-by-rules. Proc. AVIOS '95 14th Annual International Voice Technologies Conf, San Jose, 12-14 September 1995, 27-44]</ref>  The system used the onboard 56001 Digital Signal Processor (DSP) on the NeXT computer and a Turtle Beach add-on board with the same DSP on the NSFIP version to run the waveguide (also known as the tube model). Speed limitations meant that the shortest vocal tract length that could be used for speech in real time (that is, generated at the same or faster rate than it was "spoken") was around 15 centimeters, because the sample rate for the waveguide computations increases with decreasing vocal tract length. Faster processor speeds are progressively removing this restriction, an important advance for producing children's speech in real time.
+Gnuspeech was originally commercial software produced by the now-defunct Trillium Sound Research for the [[NeXT]] computer as various grades of "TextToSpeech" kit. Trillium Sound Research was a [[technology transfer]] spin-off company formed at the University of Calgary, Alberta, Canada, based on long-standing research in the computer science department on [[computer-human interaction]] using speech, where papers and manuals relevant to the system are maintained.<ref>[http://pages.cpsc.ucalgary.ca/~hill/gnuspeech/gnuspeech-index.htm Relevant U of Calgary website]</ref> The initial version in 1992 used a formant-based speech synthesiser. When NeXT ceased manufacturing hardware, the synthesizer software was completely re-written<ref>[https://www.gnu.org/software/gnuspeech/trm-write-up.pdf The Tube Resonance Model Speech Synthesizer]</ref> and also ported to NSFIP (NextStep For Intel Processors) using the waveguide approach to acoustic tube modeling based on the research at the Center for Computer Research in Music and Acoustics ([[CCRMA]]) at Stanford University, especially the Music Kit. The synthesis approach is explained in more detail in a paper presented to the American Voice I/O Society in 1995.<ref>[http://pages.cpsc.ucalgary.ca/~hill/papers/avios95/index.htm HILL, D.R., MANZARA, L. & TAUBE-SCHOCK, C-R. (1995) Real-time articulatory speech-synthesis-by-rules. Proc. AVIOS '95 14th Annual International Voice Technologies Conf, San Jose, 12-14 September 1995, 27-44]</ref> The system used the onboard 56001 Digital Signal Processor (DSP) on the NeXT computer and a Turtle Beach add-on board with the same DSP on the NSFIP version to run the waveguide (also known as the tube model). Speed limitations meant that the shortest vocal tract length that could be used for speech in real time (that is, generated at the same or faster rate than it was "spoken") was around 15 centimeters, because the sample rate for the waveguide computations increases with decreasing vocal tract length. Faster processor speeds are progressively removing this restriction, an important advance for producing children's speech in real time.
+Since [[NeXTSTEP]] is discontinued and [[NeXT]] computers are rare, one option for executing the original code is the use of
+[[virtual machine]]s. The [[Previous (software)|Previous]] emulator, for example, can emulate the DSP in [[NeXT]] computers,
+which can be used by the Trillium software.
+[[File:Monet (Gnuspeech) in Nextstep 3.3 running inside Previous.png|thumb|MONET (Gnuspeech) in [[NeXTSTEP]] 3.3 running inside [[Previous (software)|Previous]]. ]]
 Trillium ceased trading in the late 1990s and the Gnuspeech project was first entered into the [[GNU Savannah]] repository under the terms of the [[GNU General Public License]] in 2002, as an official [[GNU]] software.
+Due to its [[Free and open-source software|free and open source]] license, which allows customization of the code, Gnuspeech has been utilized in academic research.<ref>D'Este, F. - Articulatory Speech Synthesis with Parallel Multi-Objective Genetic Algorithm. Master's Thesis, Leiden Institute of Advanced Computer Science, 2010.</ref> <ref>Xiong, F.; Barker, J. - Deep Learning of Articulatory-Based Representations and Applications for Improving Dysarthric Speech Recognition. ITG Conference on Speech Communication, Germany, 2018.</ref>
-== Portability ==
+{{listen
-Various associated modules used to help in developing the original spoken English databases are being [[porting|ported]] and they could be used for other languages. The whole software suite is suitable for [[psychoacoustic]] and [[linguistics|linguistic]] research, but is currently only complete for the NeXT.  A main module - ''Monet'' - is available  for [[Mac OS X]]. Monet allows the creation and modification of the rules used to form and concatenate the speech sound parameters for different languages, with the exception of the rules used for intonation. However, the rule-based intonation can be manually varied.
+| filename    = The Chaos synthesized by Gnuspeech - DSP.ogg
+| title       = Synthesis example
+| description = [[The Chaos]] synthesized by Trillium TTS (Gnuspeech) using the DSP vocal tract model.
+}}
 ==References==
 {{Reflist}}
 == External links ==
 * [http://savannah.gnu.org/projects/gnuspeech Gnuspeech on GNU Savannah]
-* [http://www.gnu.org/software/gnuspeech/ Overview of the Gnuspeech system]
+* [https://www.gnu.org/software/gnuspeech/ Overview of the Gnuspeech system]
 {{Speech synthesis}}
-[[Category:Free cross-platform software]]
-[[Category:Speech synthesis]]
+[[Category:Cross-platform free software]]
+[[Category:Free speech synthesis software]]