How to build a presentation trainer using pretrained models?

Aug 15, 2023

2 min read

Public speaking is an art that often takes years of practice to master. To assist in this endeavor, we've created a Presentation Trainer that uses various machine learning models to give you real-time feedback. This project combines Speech-to-Text (STT), Speech Emotion Recognition (SER), and Facial Emotion Recognition (FER) technologies to evaluate your presentation skills.

Key Features

  1. Speech-to-Text (STT): Transcribes your speech.
  2. Speech Emotion Recognition (SER): Identifies the emotional tone in your speech.
  3. Facial Emotion Recognition (FER): Recognizes facial expressions.
  4. Pitch Detection: Monitors the pitch of your voice.
  5. Words per Minute (WPM) Monitoring: Calculates your speaking speed.
  6. Loudness Monitoring: Monitors the loudness of your voice.
  7. Streamlit: Web UI Interface.

Preparing the Pretrained Models

Firstly, we make use of various pretrained models for SER, STT, and FER.

def get_fer_model():
return FER()
def get_stt_model():
REPO_ID = "mbarnig/lb-de-fr-en-pt-coqui-stt-models"
en_stt_model_path = hf_hub_download(repo_id=REPO_ID, filename="english/model.tflite")
en_stt_scorer_path = hf_hub_download(repo_id=REPO_ID, filename="english/huge-vocabulary.scorer")
model = Model(en_stt_model_path)
return model
def get_ser_model():
return SpeechEmotionRecognition()

Streamlit-WebRTC Setup

We start by setting up our Streamlit interface and initial variables. The variables store various statistics and metrics to be calculated during the session.

st.header("Presentation Trainer")
frames_deque: deque = deque([])
stats = { ... }

Real-time Video and Audio Streaming with WebRTC

We use streamlit_webrtc to stream audio and video. The streaming happens in a separate thread.

webrtc_ctx = webrtc_streamer(

Video Processing

For real-time video feedback, we use the FER model to identify and highlight facial emotions. Emojis are also used to give instant feedback.

def video_frame_callback(frame: av.VideoFrame) -> av.VideoFrame:
detections = fer_model.detect_emotions(img)

Audio Processing

We collect audio frames and process them for STT, SER, pitch detection, and loudness calculation. The STT model transcribes your speech to evaluate your speech speed in terms of Words Per Minute (WPM).

async def queued_audio_frames_callback(frames: List[av.AudioFrame]) -> List[av.AudioFrame]:
if loudness >= 65:
pred = ser_model.inference(sound_window_buffer)
transcript = stream.intermediateDecodeWithMetadata().transcripts[0]


Report View

After the session, we provide a report containing all the metrics and feedback for the presenter.

page = ReportView(st.session_state['stats'])


While the application offers a comprehensive suite of machine learning models. Yet, one significant constraint is the hardware requirement for real-time inference. The machine learning models employed in this project, especially those for Speech Emotion Recognition (SER) and Facial Emotion Recognition (FER), are computationally intensive. As such, they necessitate a robust hardware setup for smooth, real-time feedback. If you find that your local machine is not able to handle the load efficiently, one possible solution is to host different models on separate servers as individual services. By doing so, you can distribute the computational burden and potentially improve the responsiveness of the application.


Combining various machine learning models, we've built a comprehensive Presentation Trainer that provides real-time metrics and feedback to hone your public speaking skills. By using this tool, you can get a multi-dimensional view of your performance and know where to improve.

Ready to take your public speaking skills to the next level? Give our Presentation Trainer a try!

The complete source code is available in this repository.