[sebhc] new postings to SEBHC archives
Patrick
patrick at vintagecomputermarketplace.com
Wed Apr 21 17:15:08 CDT 2004
> Aren't there forms of PDF that use plain text inside? Some PDF files are
> gigantic, and others of the same page are hardly any bigger than plain
> text.
Lee, yes, definitely. PDF has its roots in PostScript. The document itself
is represented as a tree of objects that can be traversed to build the
representation of a page. Properly laid out, the tree makes it possible for
you to view parts of the document without downloading it in its entirety,
which is what the Acrobat plug-in for your browser attempts to do (how well
it does is partly dependent on the web server serving the document).
The tree objects themselves are, in huge generalization, either graphical
elements such as lines or images, or text blocks. Typically when you use
the Acrobat Distiller or the PDF print driver, the text in the document you
are printing becomes text blocks in the resultant PDF. So, when you make a
PDF from Word, for example, the text areas in the Word document can be
simply converted to PostScript/PDF text blocks, which is efficient. When
you scan pages to PDF, however, the image of the page is stored as an image.
This accounts for most of the size differences. But even within that, as
Dave pointed out, some people will unwittingly leave their scanner set for
300dpi 24-bit color, so the resultant scan of a single page is 15MB. The
same page in 300 dpi 8-bit grayscale may be just 1MB or less, and in Dave's
preferred 1-bit monochrome just a few hundred K. When I started scanning
stuff, I made the same mistake--I just didn't know better. I've been
rescanning those documents or using PhotoShop to pare them down since.
If I remember correctly, the preferred internal representation in a PDF is
JPEG, so there are also quality settings that can affect the result/size.
This is also the reason that PDFs of scanned documents don't compress
well--JPEG is already compressed. I believe you can also have TIFF or raw
bitmap in a PDF file, but I'm not sure how you choose this (I've never had
to, I think JPEG is fine). The Acrobat Distiller will also down sample to
meet the target "device", and I think the PDF print driver does as well
(another elusive/unobvious setting). My version of Distiller has output
options for Screen, Print, Press, and eBook. Screen, for example, down
samples the graphics to 72dpi, which is tragic for printing. Press probably
assumes you're going to a 2540 dpi Linotronic, so you're likely to see no
down sampling at all. Anyway, there are a lot of settings, and not all of
them are in obvious places or have obvious uses, so I think it takes some
getting used to. I know I'm still learning.
> Yes, Jack; my thanks, too. I am not complaining about your effort; only
> pointing out that in its present form, I can't use it.
Well, I just tried the ghostscript conversion from PDF to JPEG (output one
file per page), and it seems to work. If you let me know what documents you
are interested in, I'll convert them and hand them off to Jack for posting.
Patrick
--
Delivered by the SEBHC Mailing List
More information about the Sebhc
mailing list