[Coco] Resolution, size and usability

Fri May 22 12:26:42 EDT 2009

Dennis Bathory-Kitsz wrote:
> Hi all,
> 
> I've only been marginally paying attention to the PDF-vs-DJVU 
> discussion until I looked at my disk stats and saw that identical 
> stuff is being uploaded in competing formats.
> 
> In two weeks, CoCo archives have exploded from 9GB to 45 GB -- more 
> space than all 40 domains there used in 11 years of operation:
>    <http://maltedmedia.com/images/du-apr-may-2009.jpg>
> 
> This is not a hosting complaint. It's an efficiency and usability 
> complaint. You all know I'm happy to host CoCo files, but I do 
> question an image-format experiment that duplicates large content 
> with more large content. Bill's work is much appreciated, but how 
> necessary is publication quality, high-res, color rasterizing of 
> monochrome line art and text -- especially if it includes page-back 
> shadows? Some publications to exceed 650MB.
> 
> This is an inefficient use of space, but more importantly, it's a 
> barrier to use. Aside from bots (which are blocked as quickly as I 
> find them), how many everyday CoCo enthusiasts are going to download 
> documents that large? Even some DJVU items exceed 100MB. (Yes, the 
> server does support Flashget-style multiple connections as well as 
> resume, but even so... dialup or capped bandwidth, anyone? That's why 
> this CoCo list has a 48KB message limit.)

Show me a 100MB DjVu document and I'll show you either 2000-page book or
someone who doesn't know how to use the tools (and/or is using the wrong
tools). a 300-page full-color Rainbow at 300dpi (when compressed
properly) weighs in at 20-30 megs, and that's after OCR and bundling. A
single page runs between 40 and 180 kilobytes, full color.

Ideally, someone would put together a Web site on which the DjVu
documents would be presented as indirect files -- that is, each page
is compressed separately and index documents are auto-generated so
that the browser plugin can operate exactly as if the user had
downloaded the whole thing. It might be useful if the server could
bundle a document on the fly for download all at once, but that's not
that big a deal.

> Bill's done a lot of dedicated and much-appreciated work. For others 
> in the future, though, I recommend optimizing each page to a usable 
> size by using appropriate bit depth, color, brightness, contrast and 
> resolutions settings -- before compiling the final publication.
> 
> I've been doing extensive archival scanning since the 3-pass days; I 
> still have (and use for'art') my absurd 3-pass battleship-size 
> scanner from 1992. It had its own dedicated SCSI card running under 
> Windows 3.1 and took work to produce good documents.
> 
> On the other hand, modern scanner GUI control panels are excellent. 
> Though it takes a few extra seconds per page, the settings really 
> must be changed based on each page preview, producing a single PDF of 
> each page at a time. For some pages, that means lower bit depth and 
> monochrome. For others, it means changing brightness and contrast so 
> the page itself is *white* (oh, yes!). Based on the results I've been 
> looking at, the document size could have been 1/4 to 1/3 its present 
> size simply by doing the latter -- avoiding the shadows of reverse 
> pages which gets rasterized into hundreds of megabytes of garbage 
> information. You can also tape on a black instead of a white backing 
> panel to avoid this shadowing problem almost entirely from the start. 
> Once there are separate PDFs for each page, they can be compiled into 
> one (I use docPrint Pro).
> 
> Believe me, having scanned quite literally thousands of documents 
> (and worn out a dozen scanners in the process), I can tell you a 
> little extra work up front will produce a more legible, smaller, and 
> ultimately more usable document.

Except for choosing the white and black points and maybe the scanner's
gamma curve, all of that stuff is ALMOST completely useless these
days. A monochrome page of a given resolution compresses to pretty-much
exactly the same size as a full-color page, and is often larger than
the color scan because it can't be compressed as well (too much
"redundant" information has been thrown away).

Throwing away page content by reducing bit depth is MUCH worse for
the content than lossy compression, because at least the lossy
compression of J2K (or even JFIF) drops bits that the human brain
doesn't readily notice, while reducing bit depth loses bits across
the whole range at arbitrary cutoff points.

I've been doing this stuff for the better part of two decades myself,
and it took quite a while to learn just how wrongheaded it was to mess
with the bit depth. Encountering a publication that has been
bit-reduced, these days, means locating a new copy to rescan properly.