[sebhc] new postings to SEBHC archives
Lee Hart
leeahart at earthlink.net
Wed Apr 21 18:08:30 CDT 2004
Patrick wrote:
> Lee, for schematics and other mostly- or purely-graphical documents
> of limited size, I agree that GIF or JPEG is a better choice than
> PDF, and I prefer to keep them this way myself.
>
> But I disagree on the text documents. I think few of us has the
> time to lovingly OCR most of these documents
I agree; it takes more time to to a better job. The person posting it
may not have the time. He can spend 10 seconds a page, but not 10
minutes.
> among the OCR'd documents I've been given, experience has taught me
> to mistrust them. I tried to OCR a couple of North Star manuals...
> The greatest difficulty was not the gross errors, but the subtle
> ones that are sprinked throughout the document... turning .1uf
> capacitors into 1uf capacitors, 74LS257 into 74LS251, etc.
Again, I agree. You can't just OCR a document and use it without
proofreading. OCR software makes too many errors.
However, the error factor is precisely why I don't think scanned
documents, or scanned+OCR'd documents are a good idea for critical data.
Your Northstar schematic and MTR-90 listing are perfect examples. Errors
in such documents critically damage the information. The reader can no
longer tell if that was a 74LS257 or a 74LS251; or whether that opcode
was mov a,l or mov a,1.
Ideally, the MTR-90 listing would be scanned, OCR'ed, proofread, and
posted in .TXT form. If errors are discovered, they can be readily
corrected, and the updated file can replace the original. If it's a
schematic, the fuzzy text in the GIF file can be edited back into text.
But a multimegabyte PDF file is not editable, except by those few
individuals who have the right software and access to the original data
as a 'master' to know what needs to be changed. As a practical matter,
no on is likely to ever correct errors in these PDF files.
> You're asking a lot, I think. It takes a lot of work to reproduce
> documents in this form. Have you tried?
Yes, I have. I converted Tom Pittman's Tiny BASIC books and articles
into HTML and TXT files. You can see them at
http://www.IttyBittyComputers.com/index.htm
Proofreading isn't that bad if you have good software that does most of
the work. The OCR program that Tom used got the text 99.9+% correct (I
don't know what software he used). The text editor I used (Vedit)
spellchecked the resulting file, marking all opcodes, part numbers, etc.
as "misspelled" (which really meant those were the ones I should check
against the actual source document). So, proofreading was mainly a
matter of reading it on-screen in the test editor, making corrections
immediately as they were revealed.
I am converting Pittman's "The First Book of Tiny BASIC" to HTML right
now. The original is proportionally spaced, with a very bad font choice
("0" and "O" look identical to his OCR program). What makes it tough is
that it is full of BASIC programs, which I am thoroughly checking to
make *sure* they are correct and actually run. I'm using HTML so it
maintains the basic appearance of the original, yet you can pull out the
BASIC programs and actually run them; no need to type them in!
> What you end up with may be a good result, but it's also a complex
> form in that it's comprised of one or more HTML files and a number
> of separate graphics. You can ZIP it into a single file, of course,
> but it's still a BOM.
BOM?
I agree that having the text in one file, and the illustrations as
separate files is awkward. So far, I haven't had to deal with that
problem. All the illustrations in the Tiny BASIC books are composed with
plain text.
Aren't there forms of PDF that use plain text inside? Some PDF files are
gigantic, and others of the same page are hardly any bigger than plain
text.
> And, THANK YOU, Jack.
Yes, Jack; my thanks, too. I am not complaining about your effort; only
pointing out that in its present form, I can't use it.
--
"Never doubt that a small group of committed people can change the
world. Indeed, it's the only thing that ever has!" -- Margaret Meade
--
Lee A. Hart 814 8th Ave N Sartell MN 56377 leeahart_at_earthlink.net
--
Delivered by the SEBHC Mailing List
More information about the Sebhc
mailing list