Skip to main content

Real-Time Speech Translation API

Onboarding

Currently the API is available through AWS. You can contact us for other options. The AWS procurement process is as follows:

  1. Click on the View Purchase Options and follow the prompts to subscribe to the product.
  2. You will be taken to a signup page where you will be asked to create an account using your email address and password.
  3. After that you will be asked to select the region and availability zone(s) where you want your API to be deployed. Refer this article for details. Your API runs in its own dedicated and isolated single-tenant environment.
  4. Once we have deployed the API you will be notified via email for next steps and will be provided a URL to access the API.
  5. You will be provided access to an admin portal from where you can start and stop the EC2 instance(s) hosting the API.

Once you have gone through the onboarding process and obtained a URL to the API you can access it by following below steps.

Get socket.io

The only dependency you need is Socket.IO client library (you do not need the server). Install the socket.io-client package in your favorite programming language. If using JavaScript (whether on browser or NodeJS) this means run [1]:

npm i socket.io-client

If using Python run [2]:

pip install python-socketio

Tips:

  • In case of JavaScript you would install version 4.x of the package (4.8.1 as of this writing) and in case of Python you would install version 5.x of the package (5.12.1 as of this writing). Although they have different version numbers both clients (JS and Python) are using same version of Socket.IO and Engine.IO protocols. If you are unfamiliar with Socket.IO, it wouldn't hurt to spend 15 to 30 minutes getting basic familiarity with it.
  • Do not try to make a direct websocket connection to the API. It will not work. You need socket.io (client).

If you are using another programming language download the client by following the instructions given at this page.

Start Coding

Below is complete client code showing how to access the API from JavaScript:

//@ts-nocheck
const BASE_URL = ...API URL...;
import { io } from "socket.io-client";

export class SpeechTranslationClient {

constructor({ speechCallback, textCallback, initCallback }) {
this.socket = io(BASE_URL)
let socket = this.socket
socket.on("speech", (speechSamples, args) => speechCallback && speechCallback(new Float32Array(speechSamples), args))
socket.on("text", arg => textCallback && textCallback(arg))
socket.on("connect", () => {
initCallback && initCallback()
});
socket.on("connect_error", (error) => {
if (socket.active) {
// temporary failure, the socket will automatically try to reconnect
} else {
// the connection was denied by the server
// in that case, `socket.connect()` must be manually called in order to reconnect
}
});
socket.on("disconnect", (reason, details) => {
console.log(`disconnected ${reason} ${details}`)
});
}

sendRequest({ audioData, targetLanguage, sampleRate, state }) {
this.socket.emit("translate", { audioData: audioData, targetLanguage: targetLanguage, sampleRate: sampleRate, state: state })
}

shutdown() {
this.socket.disconnect()
}
}

In above the connect, connect_error and disconnect events are provided by the socket.io library and their explanation can be found on socket.io docs. The Translation API provides you with 3 events:

  • socket.emit('translate'): used to make an asynchronous request against the API.
  • socket.on('text'): event (think of it as a callback) that provides the translated text
  • socket.on('speech'): event (think of it as a callback) that provides the translated audio data

There is also a REST endpoint provided to get list of supported languages. We go deeper into each of these in following section.

API Reference

GET /languages

Get list of supported languages. Returns a JSON formatted string of the form {languages: [...]} where languages is an array of ISO 639-3 formatted 3 letter language codes.

socket.emit('translate')

Used to submit a translation request to the server. e.g.: socket.emit("translate", { audioData: audioData, targetLanguage: targetLanguage, sampleRate: sampleRate, state: state }).

Arguments:

  • audioData: Float32Array of audio samples between -1 and 1.
  • targetLanguage: 3 letter code of the language to which the audio data should be translated.
  • sampleRate: Sampling rate of the audio signal. For best results provide audio signal with sampling rate = 16 kHz. This will avoid resampling the signal on the backend.
  • state: any optional state that you want to be provided with in the callback. e.g.: the state can be used to provide a request ID that you can then use to correlate response with request.

socket.on('speech')

Callback that provides the translated speech. E.g.: socket.on("speech", (speechSamples, args) => speechCallback && speechCallback(new Float32Array(speechSamples), args)).

Arguments:

  • speech: byte array of audio samples. This should be converted into a Float32Array on the client. How to do this will vary from language to language.
  • args: a dictionary containing following fields:
    • sampleRate: the sampling rate associated with the returned speech signal (16000).
    • targetLanguage: the language of the speech signal. Same as the target language when you made the request.
    • state: any optional state parameter when the request was made.

socket.on('text')

Callback that provides translated text in the target language. E.g.: socket.on("text", arg => textCallback && textCallback(arg)).

Arguments:

  • args: a dictionary containing following fields:
    • text: translated text in target language.
    • targetLanguage: the language corresponding to the text. Same as the target language when you made the request.
    • state: any optional state parameter when the request was made.

Tips

  • Use a Voice Activity Detection (VAD) library on the client to segment the audio signal before sending it over to the API. VAD libraries using Silero VAD algorithm can be found for the browser as well as mobile and server environments.
  • Capture audio signal at frequency of 16 kHz to avoid resampling it on the backend.

And that is all! Congratulations! Now you know everything that is to know. We can't wait to see what you build with the Speech Translation API. In case of any questions reach out to us via the admin portal or drop a comment in the Google Groups.

Supported Languages

  • English
  • Arabic
  • Bengali
  • Catalan
  • Czech
  • Chinese
  • Welsh
  • Danish
  • German
  • Estonian
  • Finnish
  • French
  • Hindi
  • Indonesian
  • Italian
  • Japanese
  • Korean
  • Maltese
  • Dutch
  • Persian
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Slovak
  • Spanish
  • Swedish
  • Swahili
  • Telugu
  • Tagalog
  • Thai
  • Turkish
  • Ukrainian
  • Urdu
  • Uzbek
  • Vietnamese

You can translate between any pair from above