Logo

Podcast Transcription with Amazon Transcribe

January 17, 2018podcasting, transcription, AWS

Transcribing podcasts is one of the most painstaking and time-consuming parts of being a podcaster. I was delighted when I recieved the email that said I had been approved for the Amazon Transcribe Preview. I had been seeking a solution for transcribing episodes of my Hack for the Sea Podcast, and I’ve been a fan of AWS’ Machine Learning tools since they first launched a few years back.

This post serves half as my notepad for experimenting with the new API, and half as a helpful reference / tutorial for people looking to use these tools to do similar work.

Table of Contents

Getting Set Up with the Bleeding-Edge SDK

First, make sure you have your credentials in order. That means you have a folder called ~/.aws and inside of it is a file called credentials that looks something like the following.

Replace the bracketed tokens with your own access key and secret key.

$ cat ~/.aws/credentials
[default]
aws_access_key_id = [ACCESS KEY]
aws_secret_access_key = [SECRET KEY]
region = us-east-1

The email came with a link to updated SDKs that had API support for the Transcribe service.

$ mkdir aws-sdk-go
$ unzip -d aws-sdk-go aws-sdk-go.zip
$ ls -alh aws-sdk-go

total 87M
drwxr-xr-x 3 ubuntu ubuntu 4.0K Jan 17 20:22 ./
drwxr-xr-x 9 ubuntu ubuntu 4.0K Jan 17 20:20 ../
-rw-r--r-- 1 ubuntu ubuntu  87K Dec 22 08:43 AWSTranscribeJavaClient-1.11.x.jar
-rw-r--r-- 1 ubuntu ubuntu  404 Dec 22 08:41 README.txt
drwxrwxr-x 2 ubuntu ubuntu 4.0K Dec 22 08:43 __MACOSX/
-rw-r--r-- 1 ubuntu ubuntu  58M Dec 22 08:29 aws-sdk-go.zip
-rw-r--r-- 1 ubuntu ubuntu 4.0M Dec 22 08:26 aws-sdk-js.tgz
-rw-r--r-- 1 ubuntu ubuntu  16M Dec 22 08:25 aws-sdk-php.zip
-rw-r--r-- 1 ubuntu ubuntu 9.9M Dec 22 08:27 aws-sdk-ruby.zip
-rw-r--r-- 1 ubuntu ubuntu 6.2K Dec 22 08:32 service-2.json

Even though the file is called aws-sdk-go.zip, it actually has SDKs for PHP, Ruby, JavaScript, Java, as well as Go. I would have preferred Python, but since we’re all horny for JavaScript these days anyway, let’s just use that.

$ tar xf aws-sdk-go/aws-sdk-js.tgz
$ cd aws-sdk
$ npm install

Go ahead and clean up aws-sdk-go.zip and the aws-sdk-go folders if you want, and let’s continue.

$ cd ..
$ node
> const AWS = require('./aws-sdk')
> aws.config.update({region:'us-east-1'})
> const ATS = new aws.TranscribeService()
> Object.keys(ATS)
[ 'config', 'isGlobalEndpoint', 'endpoint', '_clientId' ]

Nice.

Using the AWS Transcribe APIs

Now, what can this thing do? From what I was able to piece together from the contents of the SDK file, it looks like we can:

  1. Start Transcription Jobs
  2. List Transcription Jobs
  3. Get Single Transcription Jobs.

So, let’s do that stuff! We can start a new transcription job, use the list function to monitor it and then finally get the finished transcription.

Starting a New Transcription Job

Taking another quick moment to plug my podcast, let’s head over to the Soundcloud page for the Hack for the Sea Podcast and download the first episode. I went ahead and put the mp3 file at a public s3 location, here.

Using that S3 URL, we can go ahead and run the following command to start a transcription job:

ATS.startTranscriptionJob({
  "TranscriptionJobName": "H4TSEpisode001",
  "LanguageCode": "en-US",
  "MediaFormat": "mp3",
  "Media": {
    "MediaFileUri": "https://s3.amazonaws.com/mrh-podcasts/hackforthesea/public/Hack+for+the+Sea+Episode+001_mixdown.mp3"
  }
}, (err,result) => {
  if(err) throw err;
  console.log(result);
});

/*
{
  TranscriptionJob: {
    TranscriptionJobName: 'H4TSEpisode001',
    TranscriptionJobStatus: 'IN_PROGRESS',
    LanguageCode: 'en-US',
    MediaFormat: 'mp3',
    Media:
      { MediaFileUri: 'https://s3.amazonaws.com/mrh-podcasts/hackforthesea/public/Hack+for+the+Sea+Episode+001_mixdown.mp3' },
     CreationTime: 2018-01-17T22:04:38.073Z } }
*/

Listing Transcription Jobs

Now we want to check on our job to see how it’s going. Here’s how to list the jobs you have in progress. You could theoretically write a script that uses setInterval or some such thing to automatically check the status, but I’ll just show how to check it once.

ATS.listTranscriptionJobs({
  "Status": "IN_PROGRESS"
}, (err,result) => {
  if(err) throw err;
  console.log(result)
})

/*
{
  Status: 'IN_PROGRESS',
  TranscriptionJobSummaries: [
    {
      TranscriptionJobName: 'H4TSEpisode001',
      CreationTime: 2018-01-17T22:04:38.073Z,
      LanguageCode: 'en-US',
      TranscriptionJobStatus: 'IN_PROGRESS'
    }
  ]
}
*/

Once you run that script and get an empty array, you’ll know the job is complete. Alternatively, you can change IN_PROGRESS to COMPLETED in the above command and it will show your completed job.

Either way, once you’re done, you can move to the last step and actually see your transcription.

Getting the Result of a Transcription Job

Now you take the TranscriptionJobName and pass that to getTranscriptionJob, and in that result you can see the result of your transcription in the Media.Transcript.TranscriptFileUri key.

ats.getTranscriptionJob({
  "TranscriptionJobName": "H4TSEpisode001"
}, (err, result) => {
  if(err) throw err;
  console.log(result)
});

/*
{
  TranscriptionJob: {
    TranscriptionJobName: 'H4TSEpisode001',
    TranscriptionJobStatus: 'COMPLETED',
    LanguageCode: 'en-US',
    MediaSampleRateHertz: 44100,
    MediaFormat: 'mp3',
    Media: {
      MediaFileUri: 'https://s3.amazonaws.com/mrh-podcasts/hackforthesea/public/Hack+for+the+Sea+Episode+001_mixdown.mp3'
    },
    Transcript: {
      TranscriptFileUri: 'https://s3.amazonaws.com/aws-transcribe-us-east-1-prod/686528633557/H4TSEpisode001/asrOutput.json?X-Amz-Security-Token=FQoDYXdzEK%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaDOb4UdAIeQ162U3ghSK3A8xERzGjbGK%2Bc2GjA1IfroDWt3zG%2FJnh6MlekMyhBIGwqSwy02FIvjD4B0KRfIvW7xVPR0hJzSPAU1xaRwneJHD8SsSy%2BEswlr5bv1CgVT7ejxXosy0TXJhytfDNDNjBjz47EyOJj5rjuZtKb%2FeVHX9ClC1BXqO%2Bf9Cw%2BuUv74ZhEWObcCudfRutYCx4H9b2yMszXS%2F2FZUQxU4MgadHJIxz0zv9RtZkOXN7tfEsbWT0P1FS04QRkTjPdMJ1n%2FbVBCBIMn61tt1qceXi2s7MjAwJLVeTbD%2Fx%2B2rTxEM%2B9lq6odhr9rZpZFmpMac48FT%2FVEaFBm7mk3EslUeE%2BLZp1WIV3KDDcPcQt8rGHLHJA6%2BxtmMczGXLq5ftVHOgksh76pKO8MFodixIl2znrGno8vvGlw5xcOqIEw7DjJm3FRLR1RZMQ3DCBT5VgWYQCcQJ85eArVfldimHySNb9d6T8ALKQ89AQZ27vtA4mbyKgKOE1s0ZvVj6gS8Uo75g2PsXFRORUJ8Vu8XiK6ruU3uSVSdK%2FkIDomrWYHgfQ6LcZTpX8Gqr6TGNhwrx6gl2JhZuBbQOw58NN5konob%2F0gU%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20180117T225402Z&X-Amz-SignedHeaders=host&X-Amz-Expires=899&X-Amz-Credential=ASIAISX7NSCWWIWIBAYA%2F20180117%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=2016800618c94e9be0e078227a75cdbfa243605a05f455e8c355123aea6f69fa'
    },
    CreationTime: 2018-01-17T22:04:38.073Z,
    CompletionTime: 2018-01-17T22:42:42.232Z
  }
}
*/

Examining the Contents of the Transcription File

Visit the URL from the result above in a browser to download it. Then, let’s have a look at it:

{
  "jobName":"H4TSEpisode001",
  "accountId":"686528633557",
  "results":{
    "transcripts":[
      {"transcript":"Wait no so ten fifteen twenty seventeen we will be hosting the second annual had..."}
    ],
    "items":[
      {"start_time":"8.130","end_time":"12.730","alternatives":[{"confidence":"0.8163","content":"Wait"}],"type":"pronunciation"},
      {"start_time":"21.010","end_time":"21.070","alternatives":[{"confidence":"0.1488","content":"no"}],"type":"pronunciation"},
      {"start_time":"41.170","end_time":"41.470","alternatives":[{"confidence":"0.8094","content":"so"}],"type":"pronunciation"},
    ]
  },
  "status":"COMPLETED"
}

Cool. Some cursory metadata, along with a text transcript. the ‘items’ section is where it gets really interesting. You get a nice array of most every items

For the curious, here’s the full text transcription as Amazon Transcribe understands it.

My Initial Impressions

The first thing that I would want out of this is speaker detection, i.e. knowing how many different speakers there are and to be able to differentiate their voices. Podcasts typically have more than one host, or a host and a guest for an interview, so that would be helpful.

Also, it would be great to be able to send back corrections on words somehow, to help with the training. I’m sure Amazon has a pretty good thing going, but maybe on an account level? Or for proper nouns? I still think it would be good for people to provide that feedback.


Mark Robert Henderson

This is the website of Mark Robert Henderson. He lives in Cape Ann, works in Cambridge, and plays with distributed apps and tech philosophy online.

Mark's social media presence is slowly and deliberately withering away, so the best way to reach him is via e-mail.