Drop links or images here to add them to the editor.

Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic conversion of scanned or photographed images of typewritten or printed text into machine-encoded/computer-readable text. It is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as machine translation, text-to-speech, key data extraction and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.

Assignment

In this exercise we work with text files of which the lines represent a number of handwritten characters. The manuscript is printed in a font that meets the following conditions:

An empty column is a column of the lines with handwritten text consisting only of spaces.

The following example displays the content of a text file that contains a handwritten version of the letters ocrrococo. Click here to view a graphical representation of the segmentation of the text file. These segments are indicated with a dark gray background, and the lines that are not included in the string representation of segments are crossed out with a red line.

                                             
                                             
                                             
 ##   ### # ## # ##  ##   ###  ##   ###  ##  
#  # #    ## # ## # #  # #    #  # #    #  # 
#  # #    #    #    #  # #    #  # #    #  # 
#  # #    #    #    #  # #    #  # #    #  # 
#  # #    #    #    #  # #    #  # #    #  # 
 ##   ### #    #     ##   ###  ##   ###  ##  
                                             
                                             

The challenge is to convert the contents of the text file to the corresponding series of ASCII characters. To do this, just follow these steps:

Example

In the following example session we assume that the example file ocr.txt, sportmannen.txt and romanheld.txt are in the current directory.

>>> segment = segmentation('ocr.txt')
>>> segment[0]
' ## \\n#  #\\n#  #\\n#  #\\n#  #\\n ## '
>>> print(segment[0])
 ## 
#  #
#  #
#  #
#  #
 ## 
>>> print(segment[1])
 ###
#   
#   
#   
#   
 ###
>>> print(segment[2])
# ##
## #
#   
#   
#   
#   
>>> print(segment[3])
# ##
## #
#   
#   
#   
#   
>>> print(segment[-1])
 ## 
#  #
#  #
#  #
#  #
 ## 

>>> OCR('ocr.txt')
'rococo'
>>> OCR('sportmannen.txt')
'marmer'
>>> OCR('romanheld.txt')
'emerald'

Click on the links below to view a graphical representation of the segmentation of the text files. These segments are indicated with a dark gray background, and the lines that are not included in the string representation of segments crossed out with a red line.