Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic conversion of scanned or photographed images of typewritten or printed text into machine-encoded/computer-readable text. It is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as machine translation, text-to-speech, key data extraction and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.

Assignment

In this exercise we work with text files of which the lines represent a number of handwritten characters. The manuscript is printed in a font that meets the following conditions:

An empty column is a column of the lines with handwritten text consisting only of spaces.

The following example displays the content of a text file that contains a handwritten version of the letters ocrrococo. Click here1 to view a graphical representation of the segmentation of the text file. These segments are indicated with a dark gray background, and the lines that are not included in the string representation of segments are crossed out with a red line.

                                             
                                             
                                             
 ##   ### # ## # ##  ##   ###  ##   ###  ##  
#  # #    ## # ## # #  # #    #  # #    #  # 
#  # #    #    #    #  # #    #  # #    #  # 
#  # #    #    #    #  # #    #  # #    #  # 
#  # #    #    #    #  # #    #  # #    #  # 
 ##   ### #    #     ##   ###  ##   ###  ##  
                                             
                                             

The challenge is to convert the contents of the text file to the corresponding series of ASCII characters. To do this, just follow these steps:

Example

In the following example session we assume that the example file ocr.txt2, sportmannen.txt3 and romanheld.txt4 are in the current directory.

>>> segment = segmentation('ocr.txt5')
>>> segment[0]
' ## \\n#  #\\n#  #\\n#  #\\n#  #\\n ## '
>>> print(segment[0])
 ## 
#  #
#  #
#  #
#  #
 ## 
>>> print(segment[1])
 ###
#   
#   
#   
#   
 ###
>>> print(segment[2])
# ##
## #
#   
#   
#   
#   
>>> print(segment[3])
# ##
## #
#   
#   
#   
#   
>>> print(segment[-1])
 ## 
#  #
#  #
#  #
#  #
 ## 

>>> OCR('ocr.txt6')
'rococo'
>>> OCR('sportmannen.txt7')
'marmer'
>>> OCR('romanheld.txt8')
'emerald'

Click on the links below to view a graphical representation of the segmentation of the text files. These segments are indicated with a dark gray background, and the lines that are not included in the string representation of segments crossed out with a red line.