“The human voice is the organ of the soul”. – Henry Wadsworth Longfellow
Until the last century, speech was considered to be an exclusively human ability where only Homo sapiens could talk and understand one or more languages. Now, however, mobile phones, companion robots, virtual assistants, video games and many other software applications use artificial voices and speak dozens of languages much better and more grammatically correct than most of us! It seems that artificial voices have invaded all areas of our lives and will do so for many years to come.
Researchers created the first artificial voice in the 1950s. Thanks to rudimentary computing systems at the time, it had a robotic tone and a male timbre. Over the next 40 years, computing systems did not evolve much, so artificial voices continued to have the robotic tone and male timbre. In the 1990s, however, high-powered data processors appeared on the market and with their help, researchers were able to rid artificial voices of their robotic tonality and create the first artificial voice with a female timbre. Computing systems developed exponentially in the years that followed, and with their help, researchers created increasingly complex and high-performance artificial voices. Artificial voices are produced by computer systems using databases containing all the acoustic representations of each word, with all possible meanings. When a system of computation utters a sentence, it extracts from this database an acoustic representation of each word that composes that sentence, according to the meaning it must convey within that sentence.1 The calculation system then adds over the selected acoustic representations a certain melodic line (prosody), so that the artificial voice acquires a certain tonality and a certain rhythm and thus transmits to the human interlocutor certain meanings and certain emotions. The high-performance computing systems that exist today have allowed the development of very complex artificial voices that have a clarity of speech and tonalities similar to the human voice. However, artificial voices still fail to convey emotions as intense and varied as the human voice [1] [2] [3] [4] [5] [6] [7] [8].
Some researchers believe that the emotional disability of artificial voices is due to the way emotions are received by human listeners. We receive emotions transmitted by a human or artificial voice with the help of mirror neurons in the brain. Mirror neurons duplicate in the listener’s brain only the authentic emotional experiences of the speaker and not those that are imitations. Computing systems are not alive, so they cannot have authentic human emotional experiences but can only imitate them. Mirror neurons, therefore, cannot duplicate imitations in the human listener’s brain, however good they may be.
Researchers in this field believe that this problem could be solved by investigating the brain processes that ensure the reception of emotions transmitted by the human voice and identifying those that would also ensure the transmission of emotions by artificial voices.
The brain mechanisms that ensure the reception of emotions transmitted by voice have been analysed in many scientific studies. A study from 2024 presented the brain mechanisms by which human voices and artificial voices transmit emotions to listeners [1]. Volunteers who participated in that study assessed the emotional impact of artificial voices with an average grade of 3 (on a scale of 1 to 10) and grade 8 for human voices. The MRI images showed that in the case of artificial voices, transmission occurs through brain mechanisms that were related to memory and in the case of human voices, through brain mechanisms related to mirror neurons. The results of the study showed that artificial voices can also generate emotions in the brain of the human listener but only those that are already known (and found in the memory) and that these can only be of minimal intensity.2 Researchers conclude that because of the way they are constructed, artificial voices cannot yet equal the emotional impact of the human voice.
Memo
Acknowledgment: This text was taken from the book The power of voice, with the consent of the author Eduard Dan Franti. The power of voice can be obtained from the Memobooks, Apple books, or Amazon.
Footnotes:
- In this chapter, a simplified version of the process of generating artificial voices is presented to facilitate its understanding by those who are not specialists in this field. This process is actually very complex and contains many more steps than those presented herein.
- Researchers say that limiting the intensity of emotions transmitted by voices is due to the brain’s memory mechanisms that reactivate only some of the emotion-triggering stimuli.
Bibliography
[1] A. G. Andrei, C. A. Brătan, C. Tocilă-Mătăsel, B. Morosanu, B. Ionescu, A. V. Tebeanu, M. Dascălu, G. Bobes, I. Popescu, A. Neagu, G. Iana, E. Franti and G. Iorgulescu, “Mirror Neurons cannot be Fooled by Artificial Voices – a study using MRI and AI algorithms,” in Conference on Computing in Natural Sciences, Biomedicine and Engineering, Athens, 2024.
[2] J. Bachorowski, “Vocal expression and perception of emotion,” Curr. Dir. Psychol. Sci., vol. 8, p. 53–57, 1999.
[3] A. Abelin and a. A. J. , “Cross linguistic interpretation of emotional prosody,” ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, p. 110–113.
[4] A. V. Clark, Psychology of Moods, Nova Publishers, 2005.
[5] X. Chen, J. Yang, S. Gan and Y. Yang, “The contribution of sound intensity in vocal emotion perception: behavioral and electrophysiological evidence,” PLoS ONE, 2012.
[6] T. Bänziger, G. Hosoya and K. R. Scherer, “Path models of vocal emotion communication,” PLoS ONE, 2015.
[7] R. L. C. Mitchell and Y. Xu, “What is the Value of Embedding Artificial Emotional Prosody in Human–Computer Interactions? Implications for Theory and Design in Psychological Science,” Front. Psychol., 2015.
[8] K.-L. Huang, S.-F. Duan and X. Lyu, “Affective Voice Interaction and Artificial Intelligence: A Research Study on the Acoustic Features of Gender and the Emotional States of the PAD Model,” Front. Psychol., 2021.