Getting Started with Microsoft Speech Recognition Under Unix

UPDATE – May 7 2018:

Microsoft has released new unified speech platform SDKs (preview). At the time of this writing, it supports Windows (C/C++/C#), Linux (C/C++) and Java.

Introduction

Recently, Microsoft has announced a new speech recognition API as part of its Cognitive Services. Previously, Microsoft Speech Recognition (SR) had offered two ways to perform SR:

  • Using RESTful-like API

This simple API uses HTTP POST to send audio to SR service, then you receive a response with the SR result. It is a simple request/response API which does not satisfy all the SR-based modern scenarios.

  • Using opaque platform libraries

Microsoft also provided three platform libraries that you can link against to provide a more advanced SR functionality. For example it provides streaming support, partial results and other features. These libraries are only available for .NET, iOS and Android. If your app runs on Unix, then your only choice was to use the RESTful API which has very limited functionality.

With their latest update, Microsoft now offers revamped set of APIs for both transports, in addition it opened the underlying protocol so that you can implement your own rich client library for your platform of choice. There is also a provided reference implementation in JavaScript to make it easy for developers to get started.

The older three SDKs are still available but they won’t use the latest and greatest in the recent APIs. For example, they don’t provide support for recognition modes or profanity handling. More on that later.

In this blog post, I will show you how to use Microsoft Speech Recognition using an open-source C library that I had written using the published WebSocket protocol specifications.

In the following sections, I will show you how to use the library and some pointers on how you can build your own. However I will not go into details about speech recognition concepts or how Microsoft Speech Recognition service works. This may be the subject of a future post.

What you need

First of all, you’ll need Microsoft Cognitive Services Speech API key. You can obtain one for free by following this process.

Second, you need a working Unix build system with some prerequisites libraries. Refer to the library documentation for more information about perquisites and build process.

Finally if you want to run some of the examples below yourself, you would need some form of command line based audio capture. In these examples, I’m using Linux Alsa utilities arecord, aplay and amixer. If you’re using Debian or one of its flavors such as Ubuntu, you can install them using:

sudo apt-get install alsa-utils

It is also a good idea to test that your sound recording and playback using these tools works before proceeding.

Getting started

The library comes with an exampleProgram command line executable to test it. Its usage is self-explanatory:

Usage: exampleProgram [OPTION...]  
 -d Produce debug output.
 -f FILE Audio input file, stdin if omitted.
 -m MODE Recognition mode:
    {interactive|dictation|conversation}. Default is interactive.
 -p MODE Set profanity handling mode {raw|masked|removed}. Default is masked.
 -t Request detailed recognition output.

Hello-world!

Let’s begin by a very simple example. Assume we have an audio file that we want recognized into text. Here is a sample file that I recorded to get started. I’m saying “What’s the weather like?”:

./exampleProgram -f whatstheweatherlike.wav YOUR_COGNITIVE_SERVICES_KEY en-us
Connecting to: wss://speech.platform.bing.com/speech/recognition/interactive/cognitiveservices/v1?language=en-us
============= RESPONSE: turn.start
 path: turn.start
 request_id: f81d66d9bd424ec3b11e46b049c1d926
 content_type: application/json; charset=utf-8
 serviceTag: e8aee48917594d4f97d0f3f2a10e04b5
============= RESPONSE: speech.startDetected
 path: speech.startDetected
 request_id: f81d66d9bd424ec3b11e46b049c1d926
 content_type: application/json; charset=utf-8
 offset: 0.170000
============= RESPONSE: speech.hypothesis
 path: speech.hypothesis
 request_id: f81d66d9bd424ec3b11e46b049c1d926
 content_type: application/json; charset=utf-8
 text: what
 time: offset: 0.170000, duration: 0.190000
============= RESPONSE: speech.hypothesis
 path: speech.hypothesis
 request_id: f81d66d9bd424ec3b11e46b049c1d926
 content_type: application/json; charset=utf-8
 text: what's
 time: offset: 0.170000, duration: 0.340000
============= RESPONSE: speech.hypothesis
 path: speech.hypothesis
 request_id: f81d66d9bd424ec3b11e46b049c1d926
 content_type: application/json; charset=utf-8
 text: what's the
 time: offset: 0.170000, duration: 0.590000
============= RESPONSE: speech.hypothesis
 path: speech.hypothesis
 request_id: f81d66d9bd424ec3b11e46b049c1d926
 content_type: application/json; charset=utf-8
 text: what's the weather
 time: offset: 0.170000, duration: 0.840000
============= RESPONSE: speech.hypothesis
 path: speech.hypothesis
 request_id: f81d66d9bd424ec3b11e46b049c1d926
 content_type: application/json; charset=utf-8
 text: what's the weather like
 time: offset: 0.170000, duration: 1.080000
============= RESPONSE: speech.endDetected
 path: speech.endDetected
 request_id: f81d66d9bd424ec3b11e46b049c1d926
 content_type: application/json; charset=utf-8
 offset: 1.310000
============= RESPONSE: speech.phrase
 path: speech.phrase
 request_id: f81d66d9bd424ec3b11e46b049c1d926
 content_type: application/json; charset=utf-8
 status: 0
 nbest: 1
 0
 display: What's the weather like?
 time: offset: 0.170000, duration: 1.140000
============= RESPONSE: turn.end
 path: turn.end
 request_id: f81d66d9bd424ec3b11e46b049c1d926
 content_type: application/json; charset=utf-8

If all goes well, you should see output like the above. The final SR results will be in the message speech.phrase and as expected it says What's the weather like?.

Let’s take a quick look on the output to explain a few important things:

  • Connecting to wss://speech.platform.bing.com/speech/recognition/interactive/cognitiveservices/v1?language=en-us

This is the URI you connect to to perform speech recognition. It takes one or more URL parameter to specify how the recognition should work in addition to other things. You can learn more about the URIs here. Here we specified that the language is United States English. Also notice the interactive part of the URI path. This indicates the speech recognition mode, or speaking style. We’ll talk a bit more about it later.

  • path: turn.start

The SR service is telling us that it is starting a new turn. You can think of a turn as a unit of interaction. This can be a command to your app or multiple sentences in dictation. More information here.

  • path: speech.startDetected
    offset: 0.170000

The SR service is telling us that it has detected speech in the audio stream at offset 0.17 seconds.

  • path: speech.hypothesis
    text: what
    time: offset: 0.170000, duration: 0.190000

As we are streaming the audio to the service, it is able to hypothesize what has been said so far.  It also tells us that this chunk of audio that contained what was found at offset 0.17 seconds and its duration was 0.19 seconds. The service will send zero or more speech.hypothesis messages as it is receiving audio.

Although hypothesis results might match parts of the final results, they should not be relied upon. They should only be used for UI purposes to give the user feedback about the progress.

  • path: speech.endDetected
    offset: 1.310000

The service is telling us that it has determined that the speech in the audio stream has stopped at offset 1.31 seconds.

  • path: speech.phrase
    display: What's the weather like?
    time: offset: 0.170000, duration: 1.140000

The service is telling us that it managed to recognize a phrase in the input audio stream that says What's the weather like? starting at offset 0.17 seconds for the duration of 1.14 seconds.

  • path: turn.end

The service is telling us that it now considers this speech turn as over. Any subsequent interaction will happen in a new turn.

Microphone streaming example

Let’s go one more step. The above example requires you to have the audio recorded before hand. In many cases, this is not what you want. Although the library behaves the same, you can send it the audio as you capture it from the microphone. Here is a simple example of how to do it:

arecord -c 1 -r 16000 -f S16_LE | ./exampleProgram YOUR_COGNITIVE_SERVICES_KEY en-us

What this does is that it uses arecord to capture raw audio from the microphone at 16kHz mono signed 16bit samples, and pipe it to exampleProgram which reads it from stdin if -f is not specified. By default, record will add RIFF/WAV header to the audio stream as required by Microsoft Speech Recognition service.

Speech recognition mode

Now that we have the basics, let’s take a look at the speech recognition mode. Microsoft Speech Recognition service lets you choose from one of three modes: interactive, dictation or conversation. There is a detailed description about the difference between those here.

If you take a look at the documentation, it should be clear enough the difference between interactive and dictation; both are used when issuing commands, the former is for short interactive ones, the latter is for long, multi-phrase dictation. While I will not go into any further details, I thought I’d try to explain the difference between commanding and spontaneous speech (i.e. human conversations).

Let’s try out this file where I speak “what’s the weather like” but with some hesitations and interjections:

./exampleProgram -f disfluency.wav -m interactive YOUR_COGNITIVE_SERVICES_KEY en-us

<snip>

============= RESPONSE: speech.phrase
 path: speech.phrase
 display: What's up the weather like?
 time: offset: 1.830000, duration: 2.480000
<snip>

As you can see here, the recognition is What's up the weather like? which is close but not quite what I said.

Now let’s try to run it again but this time we’ll set the mode to conversation:

./exampleProgram -f disfluency.wav -m conversation YOUR_COGNITIVE_SERVICES_KEY en-us

<snip>

============= RESPONSE: speech.phrase
 path: speech.phrase
 display: Um what's uh the weather like.
 time: offset: 0.790000, duration: 3.520000

<snip>

Here we got a better recognition as Um what's the uh the weather like. which resembles more what I said.

Also notice two things:

  • The start offset is now easier since the SR service managed to recognize the disfluency at the beginning.
  • There is no question mark at the end. The disfluencies inserted into the text might have confused the punctuation restoration process.

The key point here is you can help the SR service recognize better and give you more accurate and matching results by telling it about the nature of the speech spoken in the audio.

Using the library

A lot of the functionality can be learnt by looking at the source code of exampleProgram.c but I will provide a quick introduction here.

All required definitions and declarations are in the include file:

#include <ms_speech.h>

The header file also includes a lot of documentation about each function and how it can be used.

General library usage concepts

The follow sections give an overview of the some the basic concepts used in the library implementation.

Connections and connection contexts

The library abstracts each connection to the service using a connection object of type ms_speech_connection_t. One or more connection can be managed together using a connection context of type ms_speech_context_t. Each context can be managed independently. Generally you will not need to create more than one context unless you want to maintain multiple connections separately.

Library Callbacks

You interact with the library by registering a set of callbacks that the library would invoke whenever an event happens. For example, you must register the following callback to provide authorization header:

const char * (*provide_authentication_header)(ms_speech_connection_t connection, void *user_data, size_t max_len);

You also register a callback that will be invoked when a hypothesis is received from the service:

void (*speech_hypothesis)(ms_speech_connection_t connection, ms_speech_hypothesis_message_t *message, void *user_data);

Every callback would receive the connection object and user_data pointing to any custom data you want to provide.

Streaming audio

Once you request the start of streaming, the library uses pull model to obtain audio from your app:

int ms_speech_start_stream(ms_speech_connection_t connection, ms_speech_audio_stream_callback stream_callback);

Here you tell the library that you want to start streaming audio and you provide stream_callback that the library will call to provide the payload. Here is the signature of the callback:

typedef int (*ms_speech_audio_stream_callback)(ms_speech_connection_t connection, unsigned char *buffer, int buffer_len);

When your callback is invoked, you will need to copy your bytes in buffer up to buffer_len then return the number of bytes you copied. If you have more audio bytes than buffer_len then you copy as much as buffer_len and wait for the next invocation. You can indicate that you’re app is done sending audio by return 0.

In case the callback is invoked and your app does not yet have audio ready, you can return -EGAIN​ to indicate it. In this case, the library will wait for you to resume the streaming by calling the function:

int ms_speech_resume_stream(ms_speech_connection_t connection);

After you make this call, the library will resume invoking your callback again.

Note that the library does not perform any processing on the audio bytes you provide not will it prepend a valid RIFF header. It is the responsibility of your app to do so.

The runloop

In order to make it easy to integrate in multi-threaded apps, the library works in a single thread model that uses a runloop model. Your app must explicitly perform a single loop on the library to handle events by calling the function:

void ms_speech_service_step(ms_speech_context_t context, int timeout_ms);

Your app must call this function in a loop. You may choose to run this loop in a separate thread but your app will need to handle multi-threading issues resulting from your callbacks being called from that thread and not the main one.

It is up to you to choose the best model that suits your app.

Initializing and using

The following are the steps to take in order to setup the connection and stream the audio.

1. Create and initialize your callback structure

Create a new callback structure and set the callbacks you’re interested in receiving. For example:

ms_speech_client_callbacks_t callbacks;
memset(&callbacks, 0, sizeof(callbacks));
callbacks.log = &connection_log;
callbacks.provide_authentication_header = &auth_token;
callbacks.connection_error = &connection_error;
callbacks.client_ready = &client_ready;
callbacks.speech_startdetected = &speech_startdetected;
callbacks.speech_enddetected = &speech_enddetected;
callbacks.speech_hypothesis = &speech_hypothesis;
callbacks.speech_result = &speech_result;
callbacks.turn_start = &turn_start;
callbacks.turn_end = &turn_end;

2. Determine and build the service URI

Based on the recognition mode that your app requires, in addition to the language and other recognition options you want, construct the service URI. Here is where you can find more information. For example, this service URI uses interactive mode, for France French language, removing any profanity and returning detailed results:

https://speech.platform.bing.com/speech/recognition/interactive/cognitiveservices/v1?language=fr-FR&profanity=remove&format=detailed

3. Create connection context

Create a new connection context to manage your connection to the service. For example:

ms_speech_context_t context = ms_speech_create_context();

4. Create and initiate connection to the service

Using the context, the constructed URI and the callbacks struct, create and initiate new connection to the service:

ms_speech_connection_t connection;
ms_speech_connect(context, full_uri, &callbacks, &connection);

5. Start your runloop

You must implement a loop that pumps and processes events for the context. For example:

while(!done) {
  ms_speech_service_step(context, 500);
}

6. Provide authorization

The library will invoke provide_authentication_header callback for you to provide your Cognitive Services key. For example:

const char * auth_token(ms_speech_connection_t connection, void *user_data, size_t max_len)
{
  static char buffer[1024];
  sprintf(buffer, "Ocp-Apim-Subscription-Key: %s", subscription_key);
  return buffer;
}

7. Wait for the client ready callback

When authentication succeeds and the client is ready for audio streaming, the library will invoke client_ready callback. At this point, your app may request audio streaming.

8. Manage your app state machine

The library will invoke multiple callbacks to inform you about the state of the connection and currently active turn. Your app should implement the callbacks to be able to stay in sync. For example, your app must know when it should stream, when it can stream again and if there are any problems in the connection.

Final thoughts

Now that Microsoft Cognitive Services Speech Recognition WebSocket protocol has been opened up, it removed a big barrier of adoption to many platforms. The protocol itself is not too hard to implement. The documentation is clear and precise.

In my implementation, I chose a low-level approach and left a lot of the logic to the app. One of the ideas I have is to build another higher-level library that would take care of state machine and audio formatting heavy lifting for you.

Before taking the dive and implementing your own client library, you should first try out the service using RESTful APIs to make sure that it suits your needs. Perhaps do a quick comparison with other available speech APIs. Then, once you’re satisfied, you may build your own platform library.

Finally if you decide to build your own, you just need a good WebSocket implementation on your platform and the rest is your code.

11 thoughts on “Getting Started with Microsoft Speech Recognition Under Unix

  1. Great article. Thanks. I’ve been trying to get the Websocket implementation running in C# on .Net Core without success. I just can’t get the server to send me messages. Did you confront an issue of never getting a ‘turn.start’ message from the server?

    I’ve got a post on StackOverflow with all the c# code. I can’t help but think I’m missing something super-simple:

    https://stackoverflow.com/questions/45492964/bing-speech-to-text-api-communicate-via-websocket-in-c-sharp

    Like

    1. While I can’t help you with the C# code, I think you’re trying to receive as soon as the connection is established. What you need to do so to immediately send speech.config then by your audio message(s). You will only receive turn.start when you start sending the audio.

      As per protocol spec:

      Clients must send a speech.config message immediately after establishing the connection to the Microsoft Speech Service and before sending any audio messages. You only need to send a speech.config message once per connection.

      Review the protocol spec here.

      Like

      1. Thanks for the reply. I do send the speech.config immediately followed by an audio binary message. This message contains a header and the first 4096 byes of the PCM wav file
        I then send the remaining byes of the audio file in 4096 chunks with the same header.. I still receive nothing back.

        Like

        1. Your code calls Receiving() as soon as the connection is established. You should send speech.config instead.

                      await cws.ConnectAsync(new Uri(url), new CancellationToken());
                      Console.WriteLine("Connected.");
                      /* #endregion*/
          
                      /* #region Receiving */
                      var receiving = Receiving(cws);
          

          Like

          1. Yes. That call just starts the Websocket ‘Receive’ on a separate thread. The receive thread just sits there waiting on the receive thread until it receives data (which, in this case, it never does!).

            Like

            1. I feel like I’ve exhausted the options here. The only thing I can think of now is that I’m not sending the audio chunks correctly. Currently, I’m just sending the audio file in consecutive 4096 bytes. So the first audio message contains the RIFF header which is only 36 bytes.

              In your implementation, do you have to do anything to the audio file, i.e. send it in byte segments of a specific size aligned to the sample size or anything?

              Like

  2. I tell you what, I’ve spotted another difference. I’m not using per message compression as it’s not enabled in my websocket implementation. Do you know if this is necessary?

    Like

    1. The problem seems to be a discrepancy in the specs. Although your implementation is correct and you’re not sending X-RequestId in speech.config message, the services does not respond properly.

      To get yourself going, send the X-RequestId header in the speech.config message.

      Like

Leave a comment