OCR Tools: Coping with PDF Files

As a medical translator, I work with a LOT of PDF files. I probably use my OCR tool up to 10 times per day and I’m fairly certain that at this point, I couldn’t work without it. However, it took some time before I figured out exactly how to get the most out of it and I’m certain that I haven’t even scratched the surface. In case you are not familiar with OCR, it stands for “Optical Character Recognition” and is basically used to turn “dead” (not editable) documents of all kinds (including pictures and PDFs) into editable Word documents preserving the formatting of the original. This sometimes works better in theory than in practice since a bad fax can ruin the OCR tool’s ability to properly recreate formatting.

Fixing the strange formatting produced by an OCR tool can be more difficult than recreating the formatting from scratch. With that said, it still has plenty of uses. I use the text from an OCR file as unformatted text which I format from scratch. I find this to be the easiest way to get around the strange formatting the files can create while still taking advantage of the benefits.

Quality: When editing translations from PDF documents, I often find that translators omit text. Although this is an unacceptable translation error, it does happen. OCR helps ensure all of the text gets translated, just like using a Word file.

Computer-Assisted Translation tools: OCR enables you to use your favorite CAT tool with a dead PDF file. This helps speed up the translation process by taking advantage of the matches and repetitions that are generally inaccessible in PDF translations. You can also increase consistency by always ensuring that segments and terminology are translated the same way throughout a document.

Numbers, names and lists: Have you ever waded through pages and pages of a lab report? Ever painfully retyped tables full of numbers? An OCR tool will recreate all of those numbers for you. That means all you need to do is proofread them! Or, how about a list of names with phone numbers? Don’t type the whole list from scratch—OCR the list and proofread instead!

Tables: Although OCR tools can create strange formatting, they are great with simple tables and lines that they can read well. You may just need to correct the cell alignment and font.

Word counts: Most translators estimate how long a project will take based on the number of words in the document. With a PDF, the word count is usually estimated a variety of ways, but the accuracy varies. I recently had a client ask me to translate a very technical medical document with 2,000 words in 24 hours. No problem, right? It looked a little longer than that to me so I sent the file through my OCR tool and it turned out that the file was 7,000 words. No, I’m not kidding. That would have been a long night.

Flat rates: Having an accurate word count also allows you to give clients a flat rate if you so choose and/or helps provide a more accurate quote up front so no one is surprised.

Just remember that OCR tools only give an estimate. If you use it to check the word count of a document, be sure to scroll through and make sure that all or most of the text was picked up by the OCR tool. If it can’t read something, it will be inserted as a picture and maybe a picture is worth a thousand words, but not to a translator!

How do you use your OCR tool?

WEBINAR: How to Win at PDF Files

Intro to Earning More in Less Time with OCR
[Wednesday, Jan. 18 @ 9 AM PST -- recording provided]
Posted in Technology and tagged , , .

7 Comments

  1. Pingback: Technology: Embrace It or Get Left Behind | Success by Rx

  2. I was just curious about whether you go through and spend to the time to fix the OCR’d file before running it through a CAT tool if it doesn’t read it accurately.

    • It would depend on the file. If it was mostly text with little formatting, I would go ahead and translate the plain text OCR file in a CAT tool and fix it afterward. If it had a significant amount of formatting, I would go ahead and create that before using a CAT tool. However, unless there are lots of repetitions, it’s unlikely that I would use a CAT tool if there was a lot of formatting. In that case, I would OCR the file and use it as reference while translating in a clean file, copying and pasting sections at a time. If all the words are going to be “new” anyway, I find this to be the most productive method.

  3. I purchased Abbyy Finereader after your webinar. I didn’t have very high hopes but I’ve already used it for several quotes/projects. It’s worth it. I’ve used it mainly to translate/quote medical reports, bank statements, tax returns and payslips.

    From the article: “maybe a picture is worth a thousand words, but not to a translator!” Lol, good one.

    • Haha I have to admit I am particularly proud of that line. 🙂 I’m glad you found it useful. I’ll keep your feedback in mind for the next webinar. I was trying to ensure people’s expectations were set correctly because it can do some crazy things…but its advantages outweigh its disadvantages by far in my opinion!

Leave a Reply

Your email address will not be published. Required fields are marked *