Subtitling Redux: Subtitles y Subtitulos

A few years ago I wrote a blog entry about subtitling and a little text-processing tool I wrote for preparing text so that it could be imported into DVD Studio Pro.  I wrote it because often someone doing translation for you is not putting timecode start and end times in, not to mention chopping things up into lines short enough to fit on the screen.  My program simply made up some rough timecodes based on the time offset at which you'd like subtitles to start and a constant duration for each subtitle, then inserted them before each line in the STL format.

This week I ran into a problem with that.  I was dealing with the Spanish translation from English that I was given for a project I'm subtitling, and my Perl script was having trouble properly outputting special characters like tildas and accents.  This has to do with something called Unicode, a system for representing all forms of human writing in existence, (and even some non-human, like Klingon). I had heard about Unicode but never had to pay much attention to. In fact, most English speakers never have to worry much about it, another way in which the hegemony of the U.S. and England is still in effect, because the system was designed so that "regular" english text looks just the same and requires no changes or bother.

But it took me several hours to figure out how to read the unicode in my translation translation and then properly process the text and print it out again in the unicode encoding that DVD Studio Pro likes, which is called UTF-16. (Thank you to Perl Monks and Creative COW and all the other online resources I consulted during my resarch into this!)

So for those of you who are also intrepid multilingual filmmakers or dvd authors, here is code for my modified Perl script:

#!/usr/bin/perl -w -CS
# this takes a text file and puts time codes in front of each line in the STL format for subtitling
use Encode qw(encode decode);
$input = shift @ARGV;   
binmode(STDOUT, ":encoding(utf-16)");
$time = 0;
$offset = 18; # time code to start.
$length = 5; # seconds each subtitle will last.
$max_chars = 58;
print '$FontSize = 18' . "\n";
open(FILE, '<:encoding(utf8)', $input) || die "couldnt open $input: $!";
while() {
if(/^(\S.+)$/) {
$text = $1;
# split the text up into 2 lines if it's too long. the top should be the shorter.
$revwords = join(" ", reverse split(" ", $text));
$revwords =~ s/.{1,$max_chars}\s/$&\|/sg;
$text = join(" ", reverse split(" ", $revwords));
$text =~ s/^\|//;    # get rid of extra pipe at beginning.
if($time == 0) { $start = $offset } else {$start = $time};
$time = $start + $length;
$start_seconds = $start % 60;
$start_minutes = int($start/60);
$end_seconds = $time % 60;
$end_minutes = int($time/60);
print sprintf ("00:%02d:%02d:01,\t00:%02d:%02d:00,\t$text\n", $start_minutes, $start_seconds, $end_minutes, $end_seconds);
} else { print; }
}
close FILE;

Of course this little program isn't the end of the process. Once importing the resulting STL file, you still have adjust line lengths and subtitle positions and durations within DVDSP. But at least this gives a starting point.
It was very annoying for a time but I'm very happy to have figured this out eventually, and it gives me a warm happy feeling to not only have figured out something arcane and geeky but also (getting philosophic for a moment) to have done my part to understand something that furthers, in some small way, the understanding between peoples. yay.