The AI detects human speech from other sounds and is widely used in voice-activated apps.
Beta version: The platform is in a development stage of the final version, which may be less stable than usual. The efficiency of platform access and usage might be limited. For example, the platform might crash, some features might not work properly, or some data might be lost.
The speech segmentation uses a deep learning model trained on Mel Frequency Cepstrum Coefficients (MFCCs) which are extracted from short signals (less than 1 second). The model was built to classify speech and non-speech parts in both clean and noisy environments.
For a speech segmentation, we use F1 score, precision, TPR, FPR, AUROC, and confusion matrix to evaluate a model. Generally, the higher F1 score, precision, TPR, AUROC mean better performance. Meanwhile, the lower FPR means better performance.
The speech segmentation model receives MFCCs as an input. Users are required to send only audio files (.wav .mp3 . ogg .flac and etc.); our backend will convert an audio file to MFCCs automatically. The API command input consists of multipart or form-data (the filename and filepath) under HTTP Post request.
Note that the input audio should meet the following criteria: mono-channel recording, and sampling rate of 8 kHz or greater.
The response of the speech segmentation model will be interval time of speech and non-speech segments in each input file. The API response would be in the following JSON format:
[
{
“filename”: <file-name>,
“interval”: [ {
“start”: <HH:mm:ss.sss> // time
“end”: <HH:mm:ss.sss> // time
“label”: <speech / non-speech>
},
… // continue to next segment
]
},
… // continue to next file
]
Example of JSON response:
[
{
“filename”: “test.wav”
“interval”: [ {
“start” : “00:00:00.120”,
“end” : “00:00:01.147”,
“label”: “non-speech”
},{
“start” : “00:00:01.147”,
“end” : “00:00:6.897”,
“label”: “non-speech”
}]
}
]