Character recognition, usually abbreviated to optical character recognition or shortened OCR, is the mechanical or electronic translation of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine-editable text. It is a field of research in pattern recognition, artificial intelligence and machine vision. Though academic research in the field continues, the focus on character recognition has shifted to implementation of proven techniques.
For many document-input tasks, character recognition is the most cost-effective and speedy method available. And each year, the technology frees acres of storage space once given over to file cabinets and boxes full of paper documents.
Problem Description :
Before OCR can be used, the source material must be scanned using an optical scanner (and sometimes a specialized circuit board in the PC) to read in the page as a bitmap (a pattern of dots). Software to recognize the images is also required.
The character recognition software then processes these scans to differentiate between images and text and determine what letters are represented in the light and dark areas.
Older OCR systems match these images against stored bitmaps based on specific fonts. The hit-or- miss results of such pattern-recognition systems helped establish OCR's reputation for inaccuracy.
Today's OCR engines add the multiple algorithms of neural network technology to analyze the stroke edge, the line of discontinuity between the text characters, and the background. Allowing for irregularities of printed ink on paper, each algorithm averages the light and dark along the side of a stroke, matches it to known characters and makes a best guess as to which character it is. The OCR software then averages or polls the results from all the algorithms to obtain a single reading.
OCR software can recognize a wide variety of fonts, but handwriting and script fonts that mimic handwriting are still problematic, therefore additional help of neural network power is required. Developers are taking different approaches to improve script and handwriting recognition. As mentioned above, one possible approach of handwriting recognition is with the use of neural networks.
Neural networks can be used, if we have a suitable dataset for training and learning purposes. Datasets are one of the most important things when constructing new neural network. Without proper dataset, training will be useless. There is also a saying about pre-processing and training of data and neural network: “Rubbish-in, rubbish-out”. So how do we produce (get) a proper dataset? First we have to scan the image. After the image is scanned, we define processing algorithm, which will extract important attributes from the image and map them into a database or better to say dataset. Extracted attributes will have numerical values and will be usually stored in arrays. With these values, neural network can be trained and we can get a good end results. The problem of well defined datasets lies also in carefully chosen algorithm attributes. Attributes are important and can have a crucial impact on end results.