[Coco] Rainbow archives in DjVu

Jeff Teunissen deek at d2dc.net
Tue Mar 17 17:41:38 EDT 2009


John W. Linville wrote:
> On Tue, Mar 17, 2009 at 10:46:44AM -0500, Joel Ewy wrote:

[snip]

>> I'm accustomed to PDF, but if DjVu readers are available under the GPL,  
>> have been ported to most popular operating systems, and if the DjVu  
>> format is being used by Google and Archive.org, why shouldn't we make  
>> use of it?
>>
>> That's my perspective on it.
> 
> I'm sure DjVu is a wonderful format.  However, given the fact that
> I've only ever seen it discussed or used in this mailing list I think
> claims of it's immense popularity are suspect.  YMMV...

It's only "immensely popular" in the relatively narrow field of digital
document preservation.

Since the early to mid 90s, commercial magazine-on-disc releases have
pretty-much exclusively used the DjVu format. Most of them use it because
unless you already have the true electronic source versions of the documents
you're trying to preserve (in which case you can use PDF), it's the only way
to fit more than a couple of volumes of text at a resolution at which it's
actually readable -- it's basically the ONLY way to do it.

Archivists use it for a different reason (though technically it's kinda the
same reason): for a given size, DjVu lets you get much higher resolution for
the same price in bytes.

Here's a technical overview of how (and why) it actually works as well as it does:

Beware, incoming info dump -- I had to learn all this crap over the past week,
so now you all do too. ;)

A .djvu file itself is just a standard IFF-format compound document containing
as its first chunk a bz-compressed RLE-encoded 1bpp bitmap which acts as a
mask for two 24-bit color pixmaps. Where the mask is black the image on the FG
chunk is displayed, and where the mask is white the image on the BG chunk is
displayed.

The bit mask is always at the full resolution of the scan, but since it uses
only 1 byte to display 8 pixels (and is RLE-encoded and bz-compressed on top
of that), there's a pretty huge space savings: a page of The Rainbow, scanned
at 300DPI, uses about 23K to store the high-res mask.

The other two layers (FG and BG), which describe what color is actually
displayed, are usually where the space/storage compromises take place. These
chunks are usually scaled down by an integer factor of 1..12 to around 100DPI
and then wavelet-compressed using the mask as a template.

Since the mask is there to tell the wavelet compressor exactly which pixels
are going to be displayed in the viewer, it can use almost all of the
available space compressing the visible parts. The areas of a layer that are
masked out are still in the image, just encoded using the minimum possible bit
rate; if you look at a Rainbow table of contents page without the mask, it can
look pretty crazy -- wild colors strewn about apparently randomly, except
where you can see nice sharp letters and illustrations. With a perfect (that
is, hand-edited) mask, a block of blue text could be rendered in the
foreground color map as a solid blue rectangle. Obviously, this would also be
a huge space savings without actually sacrificing any quality at all -- a
megapixel in a few kilobytes.

The background chunk, like a scanned blank sheet, contains what's left after
the page's detailed stuff has been removed. In automatically-processed
documents, this usually means only the texture of the page -- the water marks,
the age discoloration, and so on. In paper documents that have been processed
by a human, it might be all that plus the large flood fills like those in the
Computer Plus ads that were in most issues of The Rainbow. Since these don't
need to be preserved in super-high resolution, they can be recorded with
resolution as low as 25DPI or with some pretty extreme compression. The files
I create so far don't do this -- because they are automatically generated, the
only thing on the background is page texture. If we were to go forward with a
larger-scale project of mask editing (it's pretty easy), there might be good
reason to switch -- but a megapixel page-texture image is only about 5
kilobytes, so I haven't done that.

The technology is more common than you might think, although being a free
software guy its proprietary origins (not to mention its limited field) might
have kept it off your radar.

The tech was originally developed at AT&T Labs, and spun off into a company
(LizardTech) founded by its original inventors. LizardTech has since been
bought by what appears to be a Japanese GIS company called Celartem. I think
they wanted it because of its applicability to cartographic stuff.




More information about the Coco mailing list