Scanned documents such as academic papers, or government reports used to be off limits for the Googlebot. That’s because when scanned, the entire paper appeared as a giant image, instead of text.
Now, using Optical Character Recognition (OCR), Google is able to turn these documents into text and will begin including these files in its search results.
Previously, Google was only able to search the filename and limited meta data associated with these files in order to include them in search results. Google’s new technology now turns the scanned “images of text” into computer readable text itself.
As with traditional PDF files, when you encounter a scanned document, you’ll be able to view the original version, or the text only version Google has created. To see the technology in action, try the search repairing aluminum wiring (the first result should be a scanned document).
This type of technology has been around for a while now, but the scanning accuracy has always been a problem. Some words would get jumbled or miss spelt, so it’s impressive that Google has found a solution that’s accurate enough to be used for their search results.
What does this mean for SEO?
If you’ve got any scanned documents on your site, for example press releases, newspaper articles or research papers, this now gives your business more chances to appear in Google’s results. By giving more content for Google to index, you’ll improve your chances in coming up for queries related to these documents!
P.S. If you’re hiding any information on the web by keeping it as an image, you may want to consider removing those files now