Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
public:gsoc:ocr [2018/02/14 23:26]
cfsmp3
public:gsoc:ocr [2018/02/14 23:32] (current)
abhinav95
Line 3: Line 3:
 ~~ ~~
 ====== Complete our OCR subsystem ====== ====== Complete our OCR subsystem ======
 +
 +=== Useful skills/​interests:​ Image processing, Text Localization and Binarization,​ Tesseract API  ===
 +
  
 Subtitles come in all shapes and colors. Some are text based (such as American closed-captions,​ as specified in CEA-608 and CEA-708, or the old European teletext). Others are bitmap based such as the European DVB. When subtitles use bitmaps they are a lot more flexible, but also a lot harder to transcribe. Subtitles come in all shapes and colors. Some are text based (such as American closed-captions,​ as specified in CEA-608 and CEA-708, or the old European teletext). Others are bitmap based such as the European DVB. When subtitles use bitmaps they are a lot more flexible, but also a lot harder to transcribe.
Line 23: Line 26:
 We will provide all the samples and access to a high speed server that has them so the student can work on it (optional) if a fast internet connection is not available to them.  We will provide all the samples and access to a high speed server that has them so the student can work on it (optional) if a fast internet connection is not available to them. 
  
-__**Qualification ​task**__\\+__**Qualification ​tasks**__\\
 [[https://​github.com/​CCExtractor/​ccextractor/​issues/​929|Terrible OCR results with Channel 5 (UK)]]\\ [[https://​github.com/​CCExtractor/​ccextractor/​issues/​929|Terrible OCR results with Channel 5 (UK)]]\\
 This task is ideal to get started, because you only need to deal with one function in one file: [[https://​github.com/​CCExtractor/​ccextractor/​blob/​930ca716ca0bdae629ddd170abbcc2ad75472422/​src/​lib_ccx/​ocr.c|quantize_map]]() in src/​lib_ccx/​ocr.c This task is ideal to get started, because you only need to deal with one function in one file: [[https://​github.com/​CCExtractor/​ccextractor/​blob/​930ca716ca0bdae629ddd170abbcc2ad75472422/​src/​lib_ccx/​ocr.c|quantize_map]]() in src/​lib_ccx/​ocr.c
 +
 +In addition to the samples that we already have, we would also like the creation of a dataset of a few hardsubbed (videos with burned-in subtitles) videos with the accurate timed transcripts so that we can evaluate the performance of our code on a wide variety of these real world samples. For the qualification task, this does not have to be huge. A good representative set will do fine.
  
 __**Related GitHub Issues**__\\ __**Related GitHub Issues**__\\
  • public/gsoc/ocr.1518650805.txt.gz
  • Last modified: 2018/02/14 23:26
  • by cfsmp3