[Fwd: Re: [Color Computer][Coco] Rainbow on Disc - OCR]
John R. Hogerhuis
jhoger at pobox.com
Fri Jun 10 23:29:53 EDT 2005
Imagemagick is not the only tool that can generate PDF under Linux.
OpenOffice, various command line tools and don't underestimate the
ability of vi or emacs to generate PDF (conceptually anyway... one would
probably just edit an existing script that generates PostScript).
If we can standardize on open source tools as much as possible that
would be ideal, but we should be open to closed source stuff where there
is no open source tool, and the price is reasonable (or sufficient
numbers of people already have it). This is how we did Thinking Forth.
The book was scanned by me, run through OCR, distributed as chapters w/
OCR and original to the proofreaders, images went to other folks to
cleanup, and LaTeX experts did the template writing and programming.
There are no good free OCR tools for Linux. I've gone through it in the
past... Transym OCR on Windows is good enough to get raw text, is
scriptable in VB, works in batch mode and it's about $40, so it probably
wouldn't break anybody.
That gives us raw text easily. Personally I think raw text is the most
important thing here... because it is searchable and indexable across
issues with simple, free tools. If the ascii text is locked up inside a
PDF, it doesn't permit that.
Now PDF is just PostScript. Anyone that can write PostScript can produce
just as fancy a document as any OCR program can. It may be that there
are already open source scripts out there that can do what we need. I
need to ask around. I'd won't say we can just write the scripts
ourselves because I haven't done much beyond experimenting with
PostScript programming. But perhaps others have ideas on this.
I think a $299 tool is a non starter. Also it is a do-everything tool,
it would want to OCR and make the PDF. It's also something of a
black-box. If there's something we don't like about how it does stuff,
there's nothing we can do to fix it. If we can separate out the tasks of
scanning, OCR, proofreading and composition, we can be more scalable,
efficient and therefore done sooner.
More information about the Coco