[sebhc] archiving documents - an essay

Jack Rubin jack.rubin at ameritech.net
Wed Apr 21 22:14:18 CDT 2004


Thanks for today's lively discussion of archiving and document formats!

It's clear that we all share many of the same concerns - the biggest one
being the preservation and sharing of information about these old
systems. As Pat Swayne said earlier, truly the best way to preserve this
material, be it code or documentation, is to print it out with inert
inks on archival paper and store it in a controlled environment. Not
being the Library of Congress, I've tried to find the best balance
between the demands for access and accuracy.

I'll start with the easy stuff - long file names incorporationg part and
revision numbers allow easy identification of docs in a swarm of similar
sounding titles, now and in the future. Sorry, you command-line folks,
but since the creators of dd, ls, rm and vi had the vision to allow us
really-long-and-descriptive-file.names, I will use them, always
following the conventions regarding legal characters. No spaces, not
even _underscores_.

As for file formats, if I have material in electronic text form, that's
the way I'll present it, assuming it is easily accessible (clearly a
value judgement). I'd argue that Lee's TMSI documents in Magic Wand
files are less useful than the same material in MS Word format. 

If I only have hard copy to scan, the .pdf format is clearly a common
and accessible format. I will admit that I've not always been too happy
with "progress" in our electronic world, but I like the current
iteration of Adobe Acrobat a lot. I can scan directly to a finished
document without any intermediate steps, and it knows about my scanner,
the attached automatic document feeder and double-sided printing. The
real clincher, as Patrick pointed out, is that the output is a single
self-contained file. No assembly needed, batteries included. 

Normally, I scan at 2 bits ("line art") and 600 dpi, but I thought I
would re-examine my settings in light of today's discussion. 

I did a few quick tests using Adobe Acrobat 6 - I scanned three sample
pages from the PAM-37 manual at different compatability levels (i.e.
version 6 only, versions 5 and 6, and versions 4, 5 and 6) and different
resolutions (300, 450, and 600 dpi). Here are the sizes of the output
files:

file with mixed text and line art -

version 4,5,6 @ 300 dpi - 35K
version 4,5,6 @ 450 dpi - 52K
version 4,5,6 @ 600 dpi - 69K
version 5,6 @ 600 dpi - 33K
version 6 @ 300 dpi - 17K
version 6 @ 600 dpi - 34K

file with text only ("average" page coverage) -

version 6 @ 300 dpi - 13K
version 5,6 @ 450 dpi - 16K
version 6 @ 450 dpi - 16K
version 5,6 @ 600 dpi - 23K
version 6 @ 600 dpi - 23K

file with text and line art with shading (not halftone) -

version 6 @ 300 dpi - 28K
version 5,6 @ 450 dpi - 70K
version 6 @ 450 dpi - 71K
version 5,6 @ 600 dpi - 100K
version 6 @ 600 dpi - 98K

The first thing that stands out is that version 4 compatability is
unjustified in terms of file size overhead. It's just not worth doubling
file size to support a version 2 generations old, and using Acrobat 4.0
is not an available option. There is almost no penalty for adding
version 5 support to a version 6 scan. Beyond that, it looks like Adobe
scales linearly with resolution and is relatively smart about page
composition and file compression.

Regardless of file size, output quality is my first concern - artifacts,
dropout and rasterization are clearly visible in 300 dpi laser prints
and I find it unacceptable for archival purposes. That certainly doesn't
mean it is useless - I think Dave's concept of draft-quality
scanning/printing for working documents and low-bandwidth situations is
on target (and I've tried to address his other issues below), but again
my concern is for developing an archive that is as true to the original
as possible. To my eyes (a bit old and bleary at this point), 600 dpi
scans printed on a laser printer (HP1100 or HP2200) are much better than
300 dpi, slightly better than 450 dpi and nearly indistinguishable from
the original. I only want to scan my documents once, and often I'm
scanning material that is on loan to me and not otherwise accessible. I
don't want to regret lack of information later on when the original is
no longer available.

In sum, my decision is to make a best guess about what will be useful in
the future and get on with the work at hand. My goal in setting up this
site echoes the goals of Lenny Geisler, Hank Lotz, Kirk Thompson,
Charlie Floto, Henry Fale and other "staunch 8/89ers" - support of 8-bit
Heathkit computers, computing and computerists now and in the future.

That means if you've got something to share, I want to help you share
it. If I can provide something you need, I want you to have it. I've
outlined the conventions I plan to follow and my rationale for the
decisions I've made. If what I've outlined makes sense to you, great. If
not, your contributions will be warmly welcomed in what ever format
works best for you. This really is about developing a shared information
resource.

I will also re-iterate Patrick's offer - if it is impossible or
impractical for you to access material in the archives, I will be glad
to provide whatever I can in an alternative format, electronic or
otherwise. Once things stabilize, I think issuing one or more CDs with
related manual sets, software, etc. makes a lot of sense. 

And again, thank you all very much for your participation and
involvement!

Jack

--
Delivered by the SEBHC Mailing List



More information about the Sebhc mailing list