[Coco][Color Computer] .djvu vs .pdf revisited

Michael Wayne Harwood michael at musicheadproductions.org
Wed Jul 13 15:18:42 EDT 2005


The main goal for the production of the .pdf/.djvu files is to be able to
keep the per-page average size under 160kb while maintaining excellent
visual quality.  To do this I have taken the original scans and resized them
to 1/3 of their original resolution before processing them with the document
encapsulation/compression utilities that produce the .pdf/.djvu files.

OCR does not work very well on the resized images, so OCR'd searchable text
will most likely not be included as a "feature" of the end products.  There
will be a folder on the DVD that will include the full text of all of the
issues saved as .txt files.  I have attached a .zip of a "one pass" OCR of
the first 10 pages of the August 1989 issue - the same issue the test files
were generated from.  Page 1 is the cover, and the associated text file is
empty because it did not recognize anything on the page!

I have a few ideas I am testing out in regards to importing OCR'd text into
.djvu documents, and if there's a free (as in beer) way to do that in .pdf I
would love to know about it.  My OCR software will only export bounding box
and positional information as a PDF hidden text layer, so I have decided to
use ghostscript to export the text layer from a .pdf with OCR'd text,
massage the data, and import it into a .djvu file using the resized images
as pix sources.  The main frustration I have is that the OCR I am using does
a cruddy job with finding bounding boxes correctly, but it would be good
enough to allow for a full text search in a .djvu file.
 
Ramble ramble ramble....


Regards,
Michael Harwood

-----Original Message-----
From: coco-bounces at maltedmedia.com [mailto:coco-bounces at maltedmedia.com] On
Behalf Of Mark Anderson
Sent: Wednesday, July 13, 2005 12:31 PM
To: coco at maltedmedia.com
Subject: [Coco][Color Computer] .djvu vs .pdf revisited

I agree with you Neil.  ABBYY does a fine job with the "text under image".  
As I pointed out earlier in a previous post, check out test1.djvu that
Michael posted.  I like the capability to go into all B&W mode or to isolate
the foreground and background with the view options of the Lizardtech djvu
viewer.  This unfortunately, is not available in any PDF viewer that I am
aware of.  I think it is native to the djvu format since that is part of the
djvu technology, separating the text from the background, from what I
gather, it does automatically.



--
Coco mailing list
Coco at maltedmedia.com
http://five.pairlist.net/mailman/listinfo/coco


Brought to you by the 6809, the 6803 and their cousins! 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/ColorComputer/

<*> To unsubscribe from this group, send an email to:
    ColorComputer-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0789-text.zip
Type: application/octet-stream
Size: 17291 bytes
Desc: not available
URL: <http://five.pairlist.net/pipermail/coco/attachments/20050713/d14f4dc2/attachment-0001.obj>


More information about the Coco mailing list