|ARTFICIAL INTELLIGENCE| HEALTHCARE| SPEECH|

Body language and tone of voice – not words – are our most powerful assessment tools. – Christopher Voss
Future medicine will also allow diagnosis without invasive examinations and remote patient monitoring. Artificial Intelligence is an indispensable component of this future.
When one thinks of AI-enabled diagnosis, one thinks of medical imaging. One example is the diagnosis of potentially cancerous moles using photos captured by cell phones and analyzed with AI algorithms. Humans communicate primarily by voice. Inside the human voice is a treasure trove of information that goes far beyond the content of words.
How to analyze this hidden message? How to use it for diagnostic purposes?
This article discusses exactly that. We will discuss why it is important, why it is difficult, what has already been done, and what challenges have been encountered. Likewise, we will discuss recent developments and what the future holds. The list of references is at the end.
The new frontier of AI in healthcare

As artificial intelligence models have improved, researchers’ eyes have turned to potential application implications. Some attention has been paid to potential medical applications. This is both because of the potential economic and social implications and also because of the abundance of data. Indeed, both medical notes but also diagnostic results such as medical imaging (X-ray, CT scan, and so on) and laboratory tests (genomic sequencing, blood analysis, and so on) are generated during a medical investigation. However, these data are complex and the acceptable margin of error is small, thus delaying deployment in hospitals.
Another reason for interest in artificial intelligence is that it can be used to aid diagnosis even without the need for invasive diagnostic tests. In fact, blood tests or other medical tests still pose a risk to the patient. In addition, the patient has to travel to a hospital or other specialized center.
We can highlight two main stages in the development of artificial intelligence models for patient data analysis without an invasive examination. The first phase exploited convolutional networks for the analysis of images obtained from the patient. For example, Google and other research groups have focused on photos of the eye or ocular fundus to diagnose various diseases.
Through the Looking Glass, and What Google find there in the eye
The second phase exploits the success of large language models (LLMs) in analyzing human language and thus being able to diagnose patients from textual issues or conduct reasoning about medical records.
Google Med-PaLM: The AI Clinician
If men communicate with voice, why not take advantage of speech data as well?
Indeed, speech is a rich source of data: there is not only semantic content, but also the transmission of emotion, articulation, resonation, and phonation. For a diagnostic investigation, there is not only semantic content (which could still be analyzed by its transcription) but also a whole acoustic component that is rich in information.
These signals are important for obtaining a diagnosis, such as doctors use whopping sounds in children’s cough to diagnose pertussis. In addition, speech or cough can also be recorded with the use of a cell phone or other low-cost device. Potentially allowing access to diagnosis even in low-income countries or rural areas with limited access to healthcare. However, these data are difficult to process and require specific pre-processing that can lead to loss of information.
Nevertheless, there are some examples of models developed for diagnostics by exploiting acoustic elements produced by the patient. For example, researchers have developed an algorithm for detecting pertussis (a disease that causes 200,000 deaths a year if left untreated). The researchers’ model is capable of diagnosing the disease from audio recordings without any false diagnosis. After pre-processing, a logistic regression model makes it possible to identify the whopping sound of pertussis.
, license here](https://towardsdatascience.com/wp-content/uploads/2024/04/0glhuZilBqCJUJPwo.png)
This has led researchers to extend the field of research to various respiratory diseases, such as tuberculosis, COVID-19, asthma, bronchitis, and many others. Moreover, with the advent of deep learning, it has been seen that one can extract from speech data both complex representations but also meaningful features. Much of this work has been based on exploiting convolutional networks on spectrograms. This approach has shown several successes in varied applications
. license here](https://towardsdatascience.com/wp-content/uploads/2024/04/1WXmJbwOn35FJ-bgHQSD0zA.png)
Although it seems less obvious, these models have been exploited to be able to detect early symptoms of degenerative diseases in patients’ voices. For example, in this study, they showed how it is possible to diagnose from recordings of patients’ voices the symptoms of Alzheimer’s. Again, the authors applied a convolution network on patient voice recordings.
These results are very promising because diseases such as Alzheimer’s require expensive tests such as PET amyloid brain scans or MRIs. In addition, these diagnoses are difficult and can only be truly confirmed postmortem.
The main problem with these models is that they are task-specific and fail to generalize out-of-distribution. In fact, these datasets are often trained on small datasets and when applied in different settings they fail spectacularly. Obtaining medical datasets is expensive but also the annotation itself is laborious and requires specialized medical personnel.
For example, a model trained on data obtained from one hospital often fails when applied to data obtained from another hospital. The collection protocol, the machinery, and the care taken by the staff are different. This makes the results obtained in the study not reproducible and the model should be re-trained.

In recent years, self-supervised learning (SSL) has shown success in learning a general representation of data when trained with large unlabeled sources (aka the transformer). The success of transformers is to learn a universal representation of data such as images or text. So the same principle could be applied in which a model (a transformer or other model) is trained on a large amount of data without supervision. So why not do the same for speech?
Of course, there are two main problems:
- Finding enough data to be able to train the model.
- Creating a task to be able to train the model in an unsupervised manner.
A foundation model for speech disease detection

The application of SSL to audio registration has already been tested in previous studies. The authors of this study showed that one can extract features (a representation of data) by exploiting a model. With respect to images and text, however, the authors note that the main limitation is the lack of large, general datasets. This compromises its ability to generalize to unseen tasks and potential applications in life sciences.
Recently, Google researchers introduced a model that is trained unsupervised on a large corpus of human records and can generalize better to new distributions and new data.

While most models seen before are trained on data and their labels (supervised learning) this model instead is trained on a large corpus of sounds. Instead, Google collected and extracted 300 million short sound clips (2 seconds) of coughing, breathing, throat clearing, and other human sounds using videos available on YouTube. The authors extracted from the videos available on the platform a total of more than 174k human sound registrations.

Obviously, such a task would be impossible to do manually, so they built into the system what the authors call a health acoustic event detector. Taking an audio clip, the sounds it is transformed in a spectrogram. Then, a multilabel classification convolutional neural network (CNN) identifies whether a sound of interest is present (coughing, baby coughing, breathing, throat clearing, laughing, and speaking).

The system is then trained using SSL. Now, for am LLM we train a model to predict the next word in a sequence of words (language modeling). Predicting the next word the LLM learns the general rule of a language and the semantic meaning of each word (if you have enough data to extract general rules). How we can adapt this for the sound?
The system is trained to predict the next portion of a spectrogram. In more detail, given a spectrogram, a part is blocked and the model has to predict the missing portion. Similarly to the LLM training, the model is learning hidden rules in the speech.
Once it has obtained its filtrates, the system exploits an audio encoder to transform it into a vector representation (an embedding). This is done using a transformer that can learn a representation of the sound. Once this representation of the sound is obtained, the authors exploit linear or logistic regressions for downstream tasks.
In other words, the model learns a representation of human sounds in an unsupervised manner. This representation can then be used for downstream tasks. The model extracts features and then using linear probing (linear model on these features) specific tasks can be conducted.
In fact, the authors have collected specific datasets to evaluate their model. These datasets contain records of individuals with different diseases such as COVID-19 or tuberculosis, but they can also be exploited for other tasks such as identifying smoking status and BMI.

The authors set different types of tasks:
- 13 different health audio detection tasks.
- 14 cough inference tasks, which include the presence of abnormality in X-ray, diagnosis of Covid-19 identification of lifestyle factors (smoking status, sex, age, BMI).
- 6 spirometry tasks, such as estimation of forced expiratory volume and total exhale duration.
It is interesting to note that these are both classification tasks but also regression tasks. Moreover, some of these are usually values (e.g., spirometry tasks) that are obtained with the use of specific instruments, whereas here the authors use only recordings.
The authors compare the system (HeAR) they obtained with models that are state-of-the-art audio encoders. Especially for Cough and Spyrometry tasks the model performs much better than the other models. Notice, that 0.5 means random guessing, so previous models state-of-the-art models in some health-related tasks performed very poorly.


Parting thoughts
Audio together with video is clearly the new frontier this year. On the one hand, OpenAI Sora showed that models generating video are becoming more and more capable. On the other, there is a renewed interest in speech analysis. So far these types of data have lagged further behind the SSL revolution because they are more complex to analyze and require difficult preprocessing, but this is changing now.
Until now, the application of models to speech analysis was limited to specific tasks. Moreover, these models did not generalize well, making their application limited. In this latest paper, Google shows how through the use of SSL a model capable of generalizing to unseen tasks can be obtained. This is possible because Google used a huge amount of data making SSL efficient.
This work opens up some very interesting perspectives. First, it shows that a model can generalize even tasks when there is little data available (a common case in health research). In fact, transfer learning or the use of foundation models has transformed medical imaging research. Foundation models for speech could have the same impact.
Second, audio records are not an invasive examination and can be collected anywhere even with simple devices such as cell phones. This would allow access to diagnosis even for patients in remote areas or even in developing countries.
"Acoustic science has existed for decades. What’s different is that now, with AI and machine learning, we have the means to collect and analyse a lot of data at the same time. There’s an immense potential not only for diagnosis, but also for screening. We can’t repeat scans or biopsies every week. So that’s why voice becomes a really important biomarker for disease monitoring. It’s not invasive, and it’s low resource." – Imran, Lead of a consortium for exploring voic as a biomarker, source
Noninvasive diagnosis will be a breakthrough for medicine, allowing for rapid and continuous analysis of the patient, even if they are in remote places. In addition, listening devices could pick up early symptoms of deadly diseases before they happen giving the patient time to be treated in time (signs of respiratory crisis, heart attack, and so on). All this, however, requires artificial intelligence.
If you have found this interesting:
You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.
Here is the link to my GitHub repository, where I am collecting code and many resources related to Machine Learning, artificial intelligence, and more.
GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…
or you may be interested in one of my recent articles:
Do Really Long-Context LLMs Exist
Tabula Rasa: Large Language Models for Tabular Data
Cosine Similarity and Embeddings Are Still in Love?
Think, Then Speak: How Researchers Gave AI an Inner Monologue
Reference
Here is the list of the principal references I consulted to write this article, only the first name of an article is cited.
- Shor, 2021, Universal Paralinguistic Speech Representations Using Self-Supervised Conformers, link
- Pramono, 2016, A Cough-Based Algorithm for Automatic Diagnosis of Pertussis, link
- Laguarta, COVID-19 Artificial Intelligence Diagnosis Using Only Cough Recordings, 2020, link
- Laguarta, 2021, Longitudinal Speech Biomarkers for Automated Alzheimer’s Detection, link
- D’amour, 2022, Underspecification Presents Challenges for Credibility in Modern Machine Learning, link
- Baur, 2024, HeAR – Health Acoustic Representations, link
- De la Fuerte Garcia, 2020, Artificial Intelligence, Speech, and Language Processing Approaches to Monitoring Alzheimer’s Disease: A Systematic Review, link
- Haridas, 2018, A critical review and analysis on techniques of Speech Recognition: The road ahead, link