Digital Histories: What is OCR? Algorithm anybody?

Just thought that perhaps maybe it would be helpful to explain why OCR or Optical Character Recognition is so important to history on the web and why the underlying algorithm can be so to. Basically, when you scan a document into a computer it converts it into text via an OCR program - this believe it or not your printer can probably do at home. If it has a scanner built into it it will probably have a program that can turn a sheet of typed text into a document (such as Word or a plain text document). Although the programs and scanners for history documents you can appreciate are a lot more complicated than the ones you have at home. The algorithm is a set of logical rules for the computer to follow so that it can recognise say a letter of the alphabet or other characters or numbers. This generates the underlying text document which search engines use, including probably the sites own search engine. Algorithms can be programed to become self correcting, so if the algorithm doesn't recognise a long 'S' for example and it is told that it is an 'S', the next time it sees a long 'S' it will automatically use the correct letter. This is what I believe Tim has been trying to argue for some funding or perhaps a collective effort from different institutions for some time. However, although I kind of understand this you would need a programming whizz to set up it all up which would require lots of £££'s! Each mistake requires someone to go in and correct it, which also costs £££'s! This is why some institutions use voluntary users to correct the underlying text. The good thing is once you have this 'Super-algorithm', it would be all conquering and take over the world (!?) or perhaps just make looking for something with a lot of long S's in that much easier.

If this proves useful I might explain how to use embeded code for things like the slide show below next week?

Digital Histories

Thursday, 1 March 2012

What is OCR? Algorithm anybody?

1 comment:

Presentation