February 2, 2024 / Self-developed Tools / Read Time: 8 Min

A Simple Encapsulated Application Based on Alibaba's Speech-to-Text AI Large Model

Introduces an open-source offline speech-to-text tool built on Alibaba's Tongyi speech recognition model, designed for law firms and confidential industries that cannot risk uploading sensitive audio data to cloud services.

I. Introduction

In the AI era, a large number of AI large models have emerged, covering all aspects of life.

Unfortunately, these large models often conflict greatly with confidentiality-sensitive industries.

【Privacy Leakage】

These large models are usually provided by various companies. To use these models, you need to upload data or materials to these companies.

Although big companies always guarantee they won’t peek at the data content.

But who can really be sure?

For legal professionals or those in other confidential positions

Leaking secrets = Ruining your career

Fortunately, we can deploy large models locally (even for offline use), minimizing the risk of data leakage.

Based on this, the author built a simple application using the speech-to-text large model shared by Alibaba’s Tongyi Laboratory from the ModelScope community (http://modelscope.cn).

The tool is completely open source, hoping everyone can experience the charm of AI large models.

(No more buying expensive speech-to-text services from certain companies.)

II. Repository Address

You can git clone or directly download all files in the repository

https://github.com/ByronLeeeee/SimpleSpeechTranscription

III. Usage Instructions

The project uses the Gradio library as the WebUI. The basic usage is hardcoded into the interface—simply put in audio, click the convert button, and find the output text file in the local folder.

Required Environment

Python 3.10+
FFMPEG (for audio format conversion; can be prepared and configured in advance, or the program will download and install it if not found)
Code only tested on Windows; please test on other systems yourself
Install dependencies via:

pip install -r requirements.txt

Usage Steps

There are 4 folders with purposes as shown above.

【Audio Recognition】

Simply place the wav files to be converted into the wav folder. Multiple files can be processed simultaneously.

Open the program or refresh the webpage, and it will automatically read the list of recognizable audio files. Click the [Start Recognition] button to begin speech-to-text conversion.

After successful conversion, the full text result of the first file will appear on the right:

Meanwhile, two txt files with the same name as the audio file will be generated in the output folder

The author has pre-selected some models from ModelScope:

Other models can be found on ModelScope. Paste the model link into the modellist.ini file and restart the program:

【Format Conversion】

Since the model usually only supports wav format, audio files in MP3/FLAC and other formats need to be converted.

Simply place the audio file in the input folder and click convert.

After successful conversion, the file will be automatically saved to the wav folder. Switch back to the “Recognize Audio” tab or refresh the webpage to see the converted file.

IV. Additional Notes

【Slow Download Speed】

The model will be downloaded on first use. ModelScope’s download speed may not be stable and could take a long time—please be patient.

You can also directly download the model from the webpage to a local path.

【Do Not Update modelscope and funasr Libraries】

Alibaba recently updated the modelscope and funasr libraries, but the calling methods are completely different from before.

This tool’s code does not support the new library calling methods. Please keep the version numbers in requirements.txt

funasr==0.8.7
modelscope==1.9.5

For any other usage issues, feel free to leave a message or submit issues on GitHub.

Author

Boyang Li

Chinese Attorney — Beijing Longan (Guangzhou) Law Firm

A lawyer focused on game law, AI regulation, data compliance, and digital content rights. I write about practical legal insights for innovative tech teams.

Contact me about this topic →

A Simple Encapsulated Application Based on Alibaba's Speech-to-Text AI Large Model

Introduces an open-source offline speech-to-text tool built on Alibaba's Tongyi speech recognition model, designed for law firms and confidential industries that cannot risk uploading sensitive audio data to cloud services.

I. Introduction

II. Repository Address

Required Environment

Usage Steps

【Audio Recognition】

【Format Conversion】

IV. Additional Notes

【Slow Download Speed】

【Do Not Update modelscope and funasr Libraries】

Boyang Li

Adding Agent Capabilities to Word/WPS! | WordOllama 2.0 Update!

Possibly the First Free Chinese Law Verification SKILL — Use It Freely

Tencent Open-Sourced an Edge AI Translation Model, So I Made an Android Local Translation App

I. Introduction

II. Repository Address

Required Environment

Usage Steps

【Audio Recognition】

【Format Conversion】

IV. Additional Notes

【Slow Download Speed】

【Do Not Update modelscope and funasr Libraries】

Boyang Li

Related Reading

Adding Agent Capabilities to Word/WPS! | WordOllama 2.0 Update!

Possibly the First Free Chinese Law Verification SKILL — Use It Freely

Tencent Open-Sourced an Edge AI Translation Model, So I Made an Android Local Translation App