Skit.ai’s Post

View organization page for Skit.ai, graphic

60,369 followers

Exciting news from our ML research team!

View profile for Shangeth Rajaa, graphic

Senior ML Scientist | Skit.ai | NTU Singapore | IBM Research Labs | INRIA

We at Skit.ai are thrilled to announce the release of our latest Multi-Modal LLM models for Speech Understanding on Hugging Face, along with a comprehensive GitHub repository containing the code to train and infer these models! Unlike traditional ASR + LLM systems, our multi-modal speech LLMs leverage the acoustic, semantic, prosodic, and speaker information in the speech signal to predict various attributes such as Transcript, Speech Activity, Gender, Age, Accent, and Emotion of the speaker in a conversation directly from the speech signal. The models can be further trained to generate responses based on the user's metadata in an end-to-end manner for TOD systems, eg: Apologetic responses, when the speaker appears frustrated, etc. Similar to our previous demo blog on Multi-Modal LLM for Conversational Agents - https://lnkd.in/gRaSi99X. Due to the simplicity of training the model, any new perception/generation tasks could be added to the model eg: Multi-speaker transcript, speech environment classification, speech translation, etc. 🔗 Check it out: Hugging Face Models:  • speechllm-2B: https://lnkd.in/gdRAwj3U • speechllm-1.5B: https://lnkd.in/gdGA6Jzj GitHub Repository: https://lnkd.in/gu6DSvmc #MultiModalLLM #LLM #HuggingFace #GitHub #ConversationalAI

  • No alternative text description for this image

To view or add a comment, sign in

Explore topics