Convert time stamped data to timed-text (XML) subtitle format using Google Spreadsheet Script

Originally uploaded by

My post titles just get better and better. As part of my research into twitter subtitling I’ve focused on integrating with the twitter search and Twapper Keeper archive into the twitter subtitle generator tool, but I’m aware there is a wider world of timed data for subtitlizing. When Tony contacted me on Friday with some timed data he had as part of his F1 data junkie series it seemed like the ideal opportunity to see what I could do.

The data provided by Tony was in a *.csv spreadsheet format the first couple of lines included below:

2010-04-18 08:01:54,PIT,Lewis last car's coming into position now.,PW
2010-04-18 08:02:05,PIT,All cars in position.,PW
2010-04-18 08:02:59,COM,0802: The race has started,CM

My first thought was to just format it in Excel but quickly got frustrated with the way it handles dates/time, so instead uploaded it to Google Spreadsheet. Shown below is how the same data appears:

Google Spreadsheet of csv

Having played around with the timed-text XML format I knew the goal was to convert each row into something like (of course wrapping with the obligatory XML header and footer):

<p style="s1" begin="00:00:00" id="p1" end="00:00:11">PIT: Lewis last car's coming into position now.</p>

Previously I’ve played with Google Apps Script to produce an events booking systems, which uses various components of Google Apps (spreadsheet, calendar, contacts and site), so it made sense to use the power of Scripts for timed text. A couple of hours later I came up with this spreadsheet (once you open it click File –> Make a copy to allow you to edit).

On the first sheet you can import your timed data (it doesn’t have to be *.csv, it only has to be readable by Google Spreadsheet), and then clicking ‘Subtitle Gen –> Timed Data to XML’ on the XMLOut sheet it generates and timed text XML.

Below is the main function which is doing most of the work, the comments indicating what’s going on:

function writeTTXML() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var dataSheet = ss.getSheets()[0];
var data = getRowsData(dataSheet); // read data from first sheet into javascript object
var sheet = ss.getSheetByName("XMLOut") || ss.insertSheet("XMLOut"); // if there isn't a XMLOut sheet create one
sheet.clear(); // make sure it is blank
// Start the XMLOut sheet with tt-XML doc header
sheet.getRange(1, 1).setValue("<?xml version=\"1.0\" encoding=\"utf-8\"?><tt xmlns=\"\" xmlns:ttp=\"\" ttp:timeBase=\"media\" xmlns:tts=\"\" xml:lang=\"en\" xmlns:ttm=\"\"><head><metadata><ttm:title>Twitter Subtitles</ttm:title></metadata><styling><style id=\"s0\" tts:backgroundColor=\"black\" tts:fontStyle=\"normal\" tts:fontSize=\"16\" tts:fontFamily=\"sansSerif\" tts:color=\"white\" /></styling></head><body tts:textAlign=\"center\" style=\"s0\"><div>");
var startTime = data[0].timestamp; // collect start time from first data row, all subsequent relative to this
for (var i = 0; i < (data.length-1); ++i) { // looping through all the data one row at a time except last line (excluded because have no end date/time
var row = data[i];
var nextRow = data[i+1];
row.rowNumber = i + 1;
//calc begin and end for an entry converting to HH:mm:ss format.
var begin = Utilities.formatDate(new Date(row.timestamp-startTime), "GMT", "HH:mm:ss");
var end = Utilities.formatDate(new Date(nextRow.timestamp-startTime), "GMT", "HH:mm:ss");
// prepare string in tt-XML format. Conent is pulled by ref the column header in normalised format (e.g. if col headed 'Twitter status' normalsed = 'twitterStatus'
var str = "<p style=\"s1\" begin=\""+begin+"\" id=\"p"+row.rowNumber+"\" end=\""+end+"\">"": "+row.text+"</p>";;
// add line to XMLOut sheet
var out = sheet.getRange(row.rowNumber+1, 1).setValue(str);
var lastRow = sheet.getLastRow()+1;
//write tt-XML doc footer
var out = sheet.getRange(lastRow, 1).setValue("</div></body></tt>");

If your timed data has different headers you can tweak this by clicking ‘Tools –> Script –> Script editor …’ and changing how the str on line 18 is constructed.

I’m the first one to admit that this spreadsheet isn’t the most user friendly and it only includes the tt-XML format, but hopefully there is enough structure for you to go, play and expand (if you do please use the post comments to share your findings)

Comments are closed.