Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
public:gsoc:ocr [2018/01/29 19:27]
cfsmp3
public:gsoc:ocr [2018/02/14 23:32] (current)
abhinav95
Line 1: Line 1:
 +~~META: 
 +title = Google Summer of Code 2018 - Complete our OCR subsystem 
 +~~
 ====== Complete our OCR subsystem ====== ====== Complete our OCR subsystem ======
  
-Subtitles come in all shapes ​are colors. Some are text based (such as American closed-captions,​ as specified in CEA-608 and CEA-708, or the old European teletext). Others are bitmap based such as the European DVB. When subtitles use bitmaps they are a lot more flexible, but also a lot harder to transcribe.+=== Useful skills/​interests:​ Image processing, Text Localization and Binarization,​ Tesseract API  === 
 + 
 + 
 +Subtitles come in all shapes ​and colors. Some are text based (such as American closed-captions,​ as specified in CEA-608 and CEA-708, or the old European teletext). Others are bitmap based such as the European DVB. When subtitles use bitmaps they are a lot more flexible, but also a lot harder to transcribe.
  
 For the Latin languages in DVB what we have works quite well. Note that while DVB is bitmap based, as least those bitmaps are separate from the main image, so you only need to OCR the bitmap to get the text. For the Latin languages in DVB what we have works quite well. Note that while DVB is bitmap based, as least those bitmaps are separate from the main image, so you only need to OCR the bitmap to get the text.
Line 15: Line 20:
 Believe it or not some of these cases are also supported already in CCExtractor,​ at least for some "​good"​ conditions. But the really hard ones, are still a job in progress. Believe it or not some of these cases are also supported already in CCExtractor,​ at least for some "​good"​ conditions. But the really hard ones, are still a job in progress.
  
-The heavy lifting (the OCR itself) is done by tesseract. But selecting the area to process, prefilter it so tesseract gets an input it likes and so on, it's done by our own code.+The heavy lifting (the OCR itself) is done by **tesseract**. But selecting the area to process, prefilter it so tesseract gets an input it likes and so on, it's done by our own code.
  
 We need someone that likes challenges to make the whole thing work. We need someone that likes challenges to make the whole thing work.
  
 We will provide all the samples and access to a high speed server that has them so the student can work on it (optional) if a fast internet connection is not available to them.  We will provide all the samples and access to a high speed server that has them so the student can work on it (optional) if a fast internet connection is not available to them. 
 +
 +__**Qualification tasks**__\\
 +[[https://​github.com/​CCExtractor/​ccextractor/​issues/​929|Terrible OCR results with Channel 5 (UK)]]\\
 +This task is ideal to get started, because you only need to deal with one function in one file: [[https://​github.com/​CCExtractor/​ccextractor/​blob/​930ca716ca0bdae629ddd170abbcc2ad75472422/​src/​lib_ccx/​ocr.c|quantize_map]]() in src/​lib_ccx/​ocr.c
 +
 +In addition to the samples that we already have, we would also like the creation of a dataset of a few hardsubbed (videos with burned-in subtitles) videos with the accurate timed transcripts so that we can evaluate the performance of our code on a wide variety of these real world samples. For the qualification task, this does not have to be huge. A good representative set will do fine.
  
 __**Related GitHub Issues**__\\ __**Related GitHub Issues**__\\
Line 28: Line 39:
 [[https://​github.com/​CCExtractor/​ccextractor/​issues/​726|Process closed captions and burned-in subtitles in one pass]]\\ [[https://​github.com/​CCExtractor/​ccextractor/​issues/​726|Process closed captions and burned-in subtitles in one pass]]\\
 [[https://​github.com/​CCExtractor/​ccextractor/​issues/​224|DVB subtitles from China]]\\ [[https://​github.com/​CCExtractor/​ccextractor/​issues/​224|DVB subtitles from China]]\\
-[[https://​github.com/​CCExtractor/​ccextractor/​issues/​243|Corrupt or empty subtitles]]+[[https://​github.com/​CCExtractor/​ccextractor/​issues/​243|Corrupt or empty subtitles]]\\ 
 +[[https://​github.com/​CCExtractor/​ccextractor/​issues/​929|Terrible OCR results with Channel 5 (UK)]]
  
 +__**Mentor**__\\
 +Abhinav Shukla (@abhinav95 on slack), which is the former Summer of Code student that worked on it last year and made an incredible job.
  
  
  • public/gsoc/ocr.1517254058.txt.gz
  • Last modified: 2018/01/29 19:27
  • by cfsmp3