本来已经有人写了python脚本从ted上下载字幕了,但是他的网站被墙同时有些ted的地址他解析不了,
所以我将他的python 脚本下载了下来,修改了一下。
谢谢: http://tedtalksubtitledownload.appspot.com/
source 如下:
#! /usr/bin/env python import simplejson import pdb import urllib import sys
import re def getFormatedTime(intvalue): mils = intvalue%1000 segs =
(intvalue/1000)%60 mins = (intvalue/60000)%60 hors = (intvalue/3600000) return
“%02d:%02d:%02d,%03d”%(hors,mins,segs,mils) def availableSubs(subs): a =
subs.find(“LanguageCode”) if a == -1: return [] subs =
subs[a+len(“LanguageCode”):] return [re.search(“%22([^A-Z]+)%22”,
subs).group(1)] + availableSubs(subs) def getVideoParameters(urldirection): ht
= urllib.urlopen(urldirection).read() var = re.search(‘flashVars =
{/n([^}]+)}’, ht) if var: var = var.group(1) else: return None var =
[a.replace(‘/t’, ‘’) for a in var.split(‘/n’)] # debug pdb.set_trace() for a
in range(len(var)): if var[a]: var[a] = var[a][:var[a].rfind(‘,’)] resultado =
[] for a in var: l = a.find(‘:’) if l != -1: resultado.append((a[:l],
a[l+1:])) return dict(resultado) def downloadSub(idtalk, lang, timeIntro):
print(“Downloading subtitles for language %s”%lang) c =
simplejson.load(urllib.urlopen(‘http://www.ted.com/talks/subtitles/id/%d/lang/%s'%(idtalk,
lang))) salida = file(‘subs_%s_%s.srt’%(idtalk,lang), ‘w’) conta = 1 c =
c[‘captions’] for linea in c: salida.write(“%d/n”%conta) conta += 1
salida.write(“%s –> %s/n”%(getFormatedTime(timeIntro+linea[‘startTime’]),
getFormatedTime(timeIntro+linea[‘startTime’]+linea[‘duration’])))
salida.write(“%s/n/n”%(linea[‘content’].encode(‘utf-8’))) salida.close() def
main(tedurl): print(“Loading information about TED talk number %s…”%tedurl)
vidpar = getVideoParameters(tedurl) if not vidpar: print(“There was a problem
fetching information about that TED Talk”) sys.exit(1) print(“Download all
subtitles (write ‘all’ when prompted) or only one (specify wich)?”) a =
raw_input() availables = availableSubs(vidpar[‘languages’]) idtalk =
vidpar[‘ti’] idtalk = int(idtalk[1:3]) if a == “all”: for lang in availables:
downloadSub(idtalk, lang, int(vidpar[‘introDuration’])) else: while a not in
availables: print(“We’re sorry, the only available languages are:”) for a in
availables: print(“/t”+a) a = raw_input() downloadSub(idtalk, a,
int(vidpar[‘introDuration’])) if name == “main“: if len(sys.argv) < 2:
print(“Usage: %s tedurl”%sys.argv[0]) else: main(sys.argv[1])
要使用它的话,需要先下载simplejson包,地址是: http://pypi.python.org/pypi/simplejson/
在通过http代理上网的环境中也可以使用。
具体使用例子如下:
D:/Document and Setting/test/My Documents/Downloads/TEDTalkSubtitles>
TEDTalkSub
itles.py
http://www.ted.com/talks/barry_schwartz_on_the_paradox_of_choice.html
Loading information about TED talk number
http://www.ted.com/talks/barry_schwar
z_on_the_paradox_of_choice.html…
Download all subtitles (write ‘all’ when prompted) or only one (specify wich)?
chi_hans
Downloading subtitles for language chi_hans
D:/Document and Setting/test/My Documents/Downloads/TEDTalkSubtitles >dir
ドライブ D のボリューム ラベルは programe です
ボリューム シリアル番号は 447B-7E2B です
D:/Document and Setting/test/My Documents/Downloads/TEDTalkSubtitles のディレク
トリ
2011/04/15 14:16
2011/04/15 14:16
2011/04/15 14:34 31,879 subs_93_chi_hans.srt
2011/04/15 14:16 31,928 subs_93_eng.srt
2011/04/15 14:26 2,639 TEDTalkSubtitles.py
3 個のファイル 66,446 バイト
2 個のディレクトリ 13,469,048,832 バイトの空き領域
D:/Document and Setting/test/My Documents/Downloads/TEDTalkSubtitles>
refs:
http://pythonconquerstheuniverse.wordpress.com/category/the-python-debugger/