# How it works

## Process of transcribing audio <a href="#process-of-transcribing-audio" id="process-of-transcribing-audio"></a>

The process of transcribing is illustrated in the below figure

![Work flow](https://paper-attachments.dropbox.com/s_AC291D565674465714122CB5F15A2F5BC24D128D6C807A99049FEB69358E12A2_1550999501169_SgDecoding_Offline.png)

The above picture illustrates how the offline decoding system works. The audio input will be processed through steps

* **Step 1**: Resample the audio file
  * The audio needs to be split into mono channels, and sample rate that match with the trained model.
  * Tools used: Soxi/ffmpeg
* **Step 2**: Detect the speech in the input
  * Speaker diarisation (or diarization) is the process of partitioning an input audio stream into homogeneous segments according to the speaker identity.
  * Output of this process is the segment file (.seg), including the speaker id, segment that including speech, and start/end time
* **Step 3**: Convert the audio to proper format&#x20;
  * To process further by the decoding scripts, the audio data and segment file will be parse to appropriate format ready for the next step.
* **Step 4**: Extract features from the input
  * Extract the features from the audio and speech segments
* **Step 5**: Decode/Generate the transcription
  * Features extracted from previous step, with our trained model, to generate the transcription in ctm/stm format.
  * Furthermore, transcription are also converted to different formats, support different requests from user: like TextGrid, csv, text.

The file output will be sent to public folder, user could have other post-processing like converting to their required format, sending to other modules (language understanding, adding sentence unit, etc.)

The system will process input files sequentially. The ‘file\_name’ of input audio files will be normalized into ‘file\_id’. The output folder will have the following structure:

```
​/path/to/the/output/folder/
.
├── <file-id-1>
│   ├── <file-id-1>.<model_name>.ctm
│   ├── <file-id-1>.<model_name>.srt
│   ├── <file-id-1>.<model_name>.stm
│   ├── <file-id-1>.<model_name>.TextGrid
│   └── <file-id-1>.<model_name>.txt
├── <file-id-2>
│   ├──  <file-id-2>.<model_name>.ctm
│   ├── <file-id-2>.<model_name>.srt
│   ├── <file-id-2>.<model_name>.stm
│   ├── <file-id-2>.<model_name>.TextGrid
│   └── <file-id-2>.<model_name>.txt
└── <file-id-3: eg: 8khz-testfile>
    ├── 8khz-testfile.<model_name>.ctm    
    ├── 8khz-testfile.<model_name>.srt    
    ├── 8khz-testfile.<model_name>.stm    
    ├── 8khz-testfile.<model_name>.TextGrid    
    └── 8khz-testfile.<model_name>.txt​

*Other file types can also exists.
```

Information extraction from the output:

* \***.ctm** file including word and start time, end time of each word.
* \***.srt**, \***.stm**, \***.textgrid** including segments (sentences) with start time, end time.
* \***.txt** including the whole transcription.

## File type and language supported <a href="#file-type-and-language-supported" id="file-type-and-language-supported"></a>

Currently our offline system does support following file types and language models (list is not exhaustive):

<table data-header-hidden><thead><tr><th width="283">Languages</th><th>Description</th></tr></thead><tbody><tr><td>Languages</td><td>Description</td></tr><tr><td>Singapore Code Switch</td><td>"Code-switching" between different languages such as English-Mandarin-Malay</td></tr><tr><td>Mandarin</td><td>Monolingual ASR model</td></tr><tr><td>Singapore English</td><td>English ASR model with localised terms</td></tr><tr><td></td><td></td></tr></tbody></table>

> User can upload file with size up to 500MB each time

For all language models, we support the following file types: **.wav, .mp3, .mp4, .flac, .ogg.**

No matter what the input audio format, our offline system can process to down/up sample the audio file to the 16khz/8khz sampling rate, 16 bit rate, and mono channel to process. Our system performs best with audio in 16Khz sampling rate, 16 bit rate and close talk or telephony with clear and clean speech.
