This is an old revision of the document!

Complete our OCR subsystem

Subtitles come in all shapes are colors. Some are text based (such as American closed-captions, as specified in CEA-608 and CEA-708, or the old European teletext). Others are bitmap based such as the European DVB. When subtitles use bitmaps they are a lot more flexible, but also a lot harder to transcribe.

For the Latin languages in DVB what we have works quite well. Note that while DVB is bitmap based, as least those bitmaps are separate from the main image, so you only need to OCR the bitmap to get the text.

However, there's variants and cases that make things a lot more harder (and interesting):

- Burned-in subtitles, in which they overlay the actual TV image.
- Non-latin languages, such as Chinese.
- Moving subtitles, such as the usual tickers on the screen that move from to side.
- Subtitles with different colors, for example to distinguish between different speakers.

Believe it or not some of these cases are also supported already in CCExtractor, at least for some “good” conditions. But the really hard ones, are still a job in progress.

The heavy lifting (the OCR itself) is done by tesseract. But selecting the area to process, prefilter it so tesseract gets an input it likes and so on, it's done by our own code.

We need someone that likes challenges to make the whole thing work.

We will provide all the samples and access to a high speed server that has them so the student can work on it (optional) if a fast internet connection is not available to them.

Related GitHub Issues
Extract cyrillic tickertape text in Russian from NTV
Extract subtitles in a Chinese newscast
GUI, Burned-in Subtitle Extraction not working

  • public/gsoc/ocr.1517252720.txt.gz
  • Last modified: 2018/01/29 19:05
  • by cfsmp3