FLOSS tools make file preparation easy!

djvu and tcltext

I received a large project in a pdf file with 4 pages to a page, but it was a text pdf. I exported the pdf to text, but it was a mess, as one might expect. So, I found another method: I converted the pdf to djvu, which then allowed me to select a page at a time, meaning one of the four pages within each page. So I copy/pasted each page to a text file. I have 19 pages of this pdf file to translate, which means copy/paste 19×4 times (76 times for those as bad at math as I am).

Or, to explain the process in greater detail:

First, of course, I used pdftk to cut out the 19 pages assigned to me.
Then, I used pdf2djvu to convert to djvu.
Then I opened the djvu file in djview4 and spent about 20 minutes doing the copy/paste thing. (pictured above, using djview4 and tcltext)
Then, because there were line numbers splitting workable segments, I used awk to remove all numerical characters from the file.
(it is a court transcript, so the numbers weren’t relevant. All numerical data in the witness’ testimony are actually written out, like “one”, “sixty”, etc.)
Then, I went through the file in vim (an extremely powerful text editor. In fact, the BEST text editor EVER!) doing “shift-j” to join separated lines that are part of useful segments
(i.e., where sentences had been split into separate lines, with numbers at their beginning), thus making “normal” paragraphs and sentences.
This gave me a more workable text file, which I am then used in my CAT tool (OmegaT).
Afterwards, while doing a final review in OpenOffice, I will have to replace the numbers.
Now, this all sounds like a lot of work, but I’m certain the time I will save by being able to use OmegaT, which is incredibly efficient (and a lot better than just reconstructing the file) will more than make up for the time lost preparing the file. All told, I spent about an hour converting a mess of a pdf file into a usable text file, and probably saved myself several hours or more by being able to translate it with OmegaT, rather than simply having to reconstruct the file from scratch. Plus, I’ll be able to give the client a translation memory, which is always useful (and makes clients like you and send more work, especially when they aren’t expecting it, because they sent you a pdf with the expectation that you’d have to recreate the file).

People think you can’t work in this industry (professional translation) using only Free Software, but I have been proving that wrong for over 10 years now. I don’t think translators on proprietary platforms have all these great tools I have, or, at least, not without spending ridiculous amounts of money to have them. I’m constantly impressed with how much a few command line tools, a little sed or awk fu in a quick script, etc., can make my job so much easier.

I am grateful for Debian GNU/Linux, the Free Software Foundation, and all of these great FLOSS (Free/Libre Open Source Software) tools.

1 Comment

  1. Tony

    For more information on lots of great Free Open Source Software for translators, see http://baldwinlinguas.com/freesoftware/


Leave a Comment

Your email address will not be published. Required fields are marked *

Robots Begone! *