NEITHER Clara nor Vernon appeared at the mid-day table. Pr. Middleton talked with Miss Dale on classical matters, like a good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that an une(lified audience might really suppose, upon - seeing her over the difficulty, she had done something for herself. Sir Willoughby was proud of her, and therefore anxious to settle hier l)uSifleSS while he was in the humour to lose her. lie hoped to finish it by shooting a word or two at Vernon before dinner. Clara's petition to be set free, released from hirn~, had vaguely frightened even more than it offended his pri(le.
* * * * *
Scan 4—A Really Bad Case!
Scan4 is a paragraph from Pope's translation of Homer's "Odyssey". This is a very, very tough one. It was obviously a cheap printing to begin with, using thin, poor-quality paper in a page size of 6" by 4.5", with capital letters about 1.5 mm high, a little bigger than Times New Roman size 8. Text this small really needs a higher-resolution scan. The book was falling apart when I got it, the ink was fading and flaking, and there was no point in even thinking about trying to scan it flat, so I cut the pages. To add an extra challenge, I scanned the sample with the cover open in a medium-lit room for the 300 and 400dpi scans, but closed the cover for the 600dpi to show the best quality I could possibly get. (I was pleased to note that Abbyy, while recognizing the page in the 300dpi and 400dpi images, flashed up a suggestion that I should lower the brightness of the scan.)
This particular book was one I sporadically tried to produce, without success, on an older scanner and a bundled OCR program over a period of two years, back in 98/99. Eventually, in 2000, it was the first book processed through Charles Franks' Distributed Proofreaders site. The initial text produced by the OCR was very poor, but the human volunteers made up for it! Thanks, guys! Today, just two years later, with a better scanner and better OCR, I could have done it myself, as you will see from the best of the results of the 600dpi scans. That's how much things have improved recently.
A separate point to note here is that you can see the "three-quarter space" effect before the exclamation mark and semi-colon that was discussed in [V.104].
The results of the OCR are:
Abbyy FineReader 6:
" Ah me ! on what inhospitable coast,On Tvh.it new region is Ulysses toss'd ;Possess'd by wild barbarians fierce in arms ;Or men. whose bosom tender pity warms ?What sounds are these that gather from the shores ?The voice of nymphs that haunt the sylvan bowers,The fair-hair'd Pryads of the shady wood ;Or azure daughters of the silver flood ;Or human voir-e? but issuing1 from the shades,AVhv cease I straight to learn what sound invades?"
" Ah me ! on what inhospitable coast,On what new region is Ulysses toss'd ;Possess'd by wild barbarians fierce in arms ;Or men, whose bosom tender pity warms '?"What sounds are these that gather from the shores ?The voice of nymphs that haunt the sylvan bowers,The fair-hair'd Dryads of the shady wood ;Or azure daughters of the silver flood ;Or human voice? but issuing from the shades,Why cease I straight to learn what sound invades?"
" Ah me ! on what inhospitable coast,On what new region is Ulysses toss'd ;Possess'd by wild barbarians fierce in arms ;Or men, whose bosom tender pity warms ?"What sounds are these that gather from the shores ?The voice of nymphs that haunt the sylvan bowers,The fair-hair'd*Dryads of the slrady wood ;Or azure daughters of the silver flood ;Or human voice? but issuing from the shades,Why cease I straight to learn what sound invades?"
gocr 0.3.6:
[The 300 and 400 dpi scans produced nothing recognizable.The result of the 600 dpi scan is below.]
'' _hh i_3e ! o_1 ___l_at_ i__l__sl__ it_nble CoaSt_ On ___l_,__ _)e_v i_e_io__ i__ ___ _._____ses toss'd ; _(3s3gs3_d l3.__ ___iiíi l3_3__b___i_c_i3_ fie_Ce in il__S- _ Or i11pn, __-i)c3se l_osonl te_1de_ _it____ __ai_n3__ ? ___l_at __o__i1ds Qre tlipse tliat g__tl_p_r fE_oi33 the shoTes ? '_ilie __oi__e of i)____ E1)l3l3s tl3nT 1i_n__nt the s__l__inn bo_Ye_5_ 3'l_e fni___i____ir'd _____-ads of' il_e sli__d__ i___oOd _ Op az(_pe da_____litc__s of _tlie sil __?r t1ood ; Or l___i31_nn ___)i___? l3__t i3____ii_6 fi_oi11 tlie __hiade__ _ __'!3.__ _ea___e _ s_rai__li.t to l_ar_i1- i_—li__t so_nd- in__ad_S___''
Recognita Standard 3.2.7AK:
.: lh nt"'. on w-hat inlu,;y:t, I,:e co;;~t, On ~cli^t ne~- re~ion i.. 1= 1-.-:.:e~ tm:'d ; Possea'd 1n- wil~l L;,rba~:c, .~ fierce in arm~ ; Or u.~u. w-Ln.e bossum tender pit~- warna'? ~l-u:lt .<,:~;;::;3s are tll~ce that ~atl:er from the shnre~ ? 'I'l.e -;;o'.re :,; nwtthil: tW ,t l:aa;nt the s~-l:c 1llJOR'er5, 'lhe :a,:~-h ~;r'd~It.wa~i~ ot' tl:e ~Il;;dv vood; Or az.lre dau~~l.ts~: oY tl:c ·:iv-~~r floo;:3 ; C?r humnn ~-<:i: e'? l,~:tt i~~; from tl:c· ~had~~, 11-lts- cea~e I ctrai rlit to learn ~s-l:, t socud incades %"
" ~h me ! ou "-Mat iuMospita~le coast,On ~i-lmt ne~c reyion is L 1~-~ses to~s'd ;Pos:e;s'd 1"~ w-iMl lrvrbaria:ns fiet~ce in arms ;Or m~ n, "-hose hosom tender pit~- warm5 ?~~~hat ~ounds are tlmse tMat ~;atMer from t:he shores ?~t'I~e ~-oi~~e of n~-Inhhs t.hat liaunt the s~-l~~a n howers.Tlie fair-hnir'd D~ vads ot tl:e shad~- "-ood ;Or aznre dau~liters of tMe sil~-~r fiood ;Or lmman ~-oi:~e'? but iauin~ frotn the shades, alVly cea.~e I straibht to learn "-Mat souud in~ad°s?"
" Ah me ! on what inhospitable coastOn ~~-hat new r e~ion is L;1 ~-sses toss'd ~,Possess'd 1J~- "-ilil I:OII'uai'la ils fierce in arms_ ·Or men, whose hosom tender pit~l ~varn~s ?~'G'l~at somnds are these tliat ~atl~er from the shores ?~I'Iie v oice of n~-mpl~S that ~munt the sy Ivan bowers,Tlie fair -hair'd D~~~-ads of tl~e slmdy wood ;Or azure daylltcrs of tlle silver flood ;Or lm:nan voice? uut issL~ing from the shades,~~'lm cea~e I strai~ht to Iearn ~~-lmt so~nd inv ades ?"
OmniPage Pro 10:
,.lh in- ' on "-hat inh-slit al.:e coast,On "M.^t new reion is 1=1;-a:e~ to-s'd ;P"::e:~'d hw "ild Larba.:an~ fierce in arms ;Or inn. "-hnse bo.,om tender pity warmsWhat 'Wh me ! on what inhospitable coast,On what new region is L fusses toss'd ;Possess'd br wild barbaric ns fierce in arms ;Or men, whose bosom tender pith- warmsAN-hat sounds are these that gather from the shores ?The voice of nymphs that Haunt the sylvan bowers,The fair-hair'd IWvads of the shady -wood ;Or azure daughters of the silver flood ;Or human voice? bat iauina from the shades,Why cease I straight to learn what sound invades?" " Ah me! on what inhospitable coast,On what new region is Ll ysses toss'd ;Possess'd bv -wild barbarians fierce in arms ;Or men, whose bosom tender pity warnis ?AVlia± sounds are these that gatller from the shoresThe voice of nYI11pliS that haunt the -sylvan bowers,The fair -hair'd D.-yads of the shady wood ;Or azure daughters of the silver flood ;Or human voice? lout issuing from the shades,Why cease I straight to learn what sound invades?" OmniPage Pro 11: .` lh in-' on what inhospital,le co-st,On xclznt near region is t 1:-sse~ toss'(: ;Possess'd bY Mild barbarians fierce in aims ;Or inn. whose boson tender pity warmsWhat ''' :Ah me ! on what inhospitable coast,On iyhat new region is Ulysses toss'd ;Possess'd br wild barbarimis fierce in arms ;Or men, whose bosom tender pity warmsAN-hat sounds are tliese that gather from the shores ?The voice of nymphs that haunt the sylvan bowers,The fair-hair'd D~ yads of the shady -wood;Or azure dau.L-hters of the silver flood ;Or human voice? but issuing from the shades,Why cease I straight to learn what sound invades?" " Ah me! on what inhospitable coast,On what new region is Ulysses toss'd ;Possess'd by -wild barbarians fierce in arms ;Or n1en, whose bosom tender pity warnis ?AVliat sounds are these that gather from the shoresThe voice of nyniplis that haunt the sylvan bowers,The fair-hair'd Dryads of the shady Wood ;Or azure daughters of the silver flood ;Or human voice? but issuing from the shades,Why cease I straight to learn what sound invades?" TextBridge Millennium Pro: no on what inhe~ptaEie coast,
On what new realun is hivs,e' to5sd
,s~s Ä-~d liv wild lie il)~m.ihI fir see in al-rn~
Or u~,-n. w'linse bo,uuiu tender pity warnls
Wl at ~ are t1ie~e that ~atler from the shores ?
'n.e a oro of imvntpirs tint he~nt the sad van bowers,
'flie tah'-ha~r'd D~vahs ct the shady wood
1)1' az Ire dauul~t ~ of tl,e shvr flood
Or liunian vi i 'I ? h'tt is- eng from the shades,
\VIiv cea-~e I straight to learn w hat sound invades 1" Ah me on what inhospitable coast,On what new region is U vases toss'dPossess'd by wild barbarians fierce in armsOr men, whose bosom tender pity warms ~What sounds are these that gather from the shores?The voi'e of nymphs that haunt the sylvan bowers,The fair-baird Prvads of tl~e shady woodOr azure daughters of the silver floodOr human vuiae? but issuing fi'om the shades,Why cease I straigl~t to learn what sound invades?" Ah me on what inhospitable coast,On what new region is Ulysses toss'dPossess'd by wild barbarians fierce in armsOr men, whose bosom tender pity warms?What sounds are these that gather from the shores?rfhe voice of nymphs that haunt the sylvan bowers,The fair-hair'd Dtyads of the shady wood;Or azure daughters of 'the silver floodOr human voice? but issuing from the shades,Why cease I straigl~t to learn what sOund invades?" What can we conclude from this? Small mistakes in scanning, like letting too much light in, getting
your scanner settings wrong for the page, or not pressing the paper
flat enough, can make a major difference to the final quality of the
text that you will have to correct. Sometimes, no matter what you do with your scanner, problems with the
paper or the print will make it difficult for your OCR package to give
good output. Generally, bigger is better within the range 300dpi-600dpi, but you
only need higher resolution with more difficult material. Different OCR packages will produce widely differing texts from the
same images. Given a really good image, most OCR software will work
acceptably, but when you have lower quality material to work with, the
gap between OCR packages shows clearly. S.18. I got an OCR package bundled with my scanner. Is it good enough
to use? That depends on how well your package performs on the actual scans
that you do, and how much you value your time vs. money. Most scanners
are bundled with OCR software, but these OCR packages are often older
or "brain-damaged" versions, with their functionality deliberately
lowered. It's unlikely that you'll get a current-version,
top-of-the-line OCR package thrown in for free. You may have to pay extra for better OCR, but it means that you spend
less time making corrections. The question is how much better you want
your OCR to be. Save the images from the FAQ "Why am I getting a lot of mistakes in my
OCRed text?" [S.17] and try processing them with the OCR you have.
Compare the quality of the text produced with the quality of the
samples. This should give you some idea of how your OCR compares to
others. Try a few pages from your book with your OCR. How many mistakes do you
see on each page? Do you find that acceptable? S.19. I want to include some images with a HTML version. How should Iscan them? We don't often see color prints in our books, but if you do have one,
then scan it in color. Otherwise, try both greyscale and B&W, and see
which gives you the best image. It's usually better to scan images in a higher resolution than you're
going to use, and then use an image manipulation package to reduce
them [H.10] to a size appropriate for your HTML file. An initial scan
at 600dpi is often good. Image manipulation programs will also allow
you to "clean up" the pictures, by increasing contrast, despeckling,
or other filtering. S.20. I want to include some images with a HTML version. What type ofimage should I use? GIF, JPEG and PNG images are supported by current browsers, and you
should stick with those unless you have a specific reason not to. GIF and PNG tend to be more efficient—provide better quality at a
given file size—for simple line-drawings; JPEG is usually better for
photographic images. S.21. Will PG store scanned page images of my book? No. Or, at least, not yet. The idea has been kicked around a bit. There's no question of
replacing etexts with page images, but many volunteers who have
already scanned the book anyway like the idea of saving page images as
well—for general information, and as a means of checking future
correction suggestions against the original. Some volunteers already
keep their page images, stored for possible future use. Working some back-of-the-napkin figures: a page of text might take up
1KB of space on a computer as plain text or HTML or XML. The same page
might take 70KB if stored as a black-and-white image, of just enough
quality to serve as a reliable guide to making corrections. Pages with
pictures, or stored with enough resolution to allow some future
researcher to write a paper on the changing shape of serifs in the
18th and 19th centuries, would start at around 350KB per page, and go
up from there. A 300 page book thus becomes about 300KB as plain text (and around 150K zipped)
about 20,000KB as minimal-quality images
about 100,000KB as high-quality images and with the images, we won't save much space on the zipping, because
they're already compressed. On a normal "56K" modem, getting about 4KB / second, it would take: 75 seconds to download the text file (40 for the Zip)
80 minutes to download the minimal images
over 5 hours to download the high-res images. Someday, the disk and bandwidth capacities that we will take for
granted will be such that uploading images, when we have them, will be
quite natural, just for the few people who will want them. But we're
not quite there yet. Late flash! As of late 2002, the Internet Archive is providing space
to volunteers for storing page images. To see the images, and find
out more, go to H.1. Can I submit a HTML version of my text? Yes. H.2. Why should I make a HTML version? Well, you can make one just because you want to, but on some texts
there is special reason to. If you want to preserve the pictures that accompany the text, making a
HTML version means that you can specify where and how those images
appear. If there is particular meaningful information in the layout of the
text that can't be expressed in ASCII, like special characters or
complex tables or fonts, HTML may offer an open format alternative. H.3. Can I submit a HTML version without a plain ASCII version? You can submit it, but the Posting Team will then consider whether
we should also make an ASCII, or perhaps ISO-8859 or Unicode version
of it. We really do want our texts to be viewable by everybody, under
every circumstances, and we do not want to start posting texts that
are in any way inaccessible to anyone. See also the FAQ [G.17] "Why is PG so set on using Plain VanillaASCII?" H.4. What are the PG rules for HTML texts? 1. The only absolute rule is that the HTML should be valid according
to one of the W3C HTML standards. You can verify that your HTML is valid at the W3C's HTML Validator at
For a more convenient and friendly, though less official, check of the
correctness of your HTML, you should use Dave Raggett's Tidy program
at After that, we have some requirements and recommendations. Compliance
with the requirements might be waived if there is a really good reason
to make an exception in this case. 2. Requirement: File names and extensions If you want your text to work within 8.3 filename conventions, you may
use .htm as the extension for your HTML files; otherwise, use .html as
the extension. If you are working to 8.3 conventions, all of your
images as well as your HTML files should have 8.3-compliant filenames. All file names and extensions should be in lower-case throughout. Yes,
we know this is not strictly necessary, but we don't want to have to
correct every file that comes with "image.gif" referenced in the HTML
accompanied by a file IMAGE.GIF. 3. Requirement: HTML and plain-text Project Gutenberg does publish well-formatted, standards compliant
HTML. However, we insist that a plain text version be available for
all HTML documents we publish (even if images or formatting are
absent), except when ASCII can't reasonably be used at all, for
example with Arabic, or mathematical texts. 4. Requirement: Archive format for posting If the HTML book contains more than one file (including images), create
a ZIP (preferable) or TAR archive containing all of the files in the
book. The ZIP file may, if you wish, unzip to a subdirectory named for
the book. For example, a book called 'The Humour of Mark Twain' might
unzip in a directory called 'mthumor'. Make sure directory names
contain only alphabetic and numeric characters, no spaces, and are 8
characters or less, even if you're not sticking to 8.3 conventions for
filenames. 5. Recommendation: Simplicity Make your HTML as simple as possible. HTML is an evolving standard,
and one that may be completely obsolete in the long term. Use of
advanced features may just mean that your version will be obsolete or
unreadable that much faster. 6. Recommendation: Images Images included with your HTML should be in a format that Web browsers
can read: GIF, JPEG or PNG. Images should be edited for high quality
in a reasonably small file size. Make the best decision you can
concerning the image size and placement in the text. Every image
included must be linked into (referenced by) the HTML. 7. Recommendation: Line lengths If it is reasonable to do so, try to wrap paragraphs of text at around
the normal PG margin of 70 characters. Ideally, your HTML should be as
near as possible identical to your text version except for the HTML
tags and entities. People who open your HTML won't all be using
browsers, people will need to make corrections, not all editors can
handle very long lines, and even with editors that can handle long
lines, it's easier to work with short lines. Apart from these rules and recommendations, we also have a rule about
the PG header, but that will normally be handled by the Posting
Team. Where your HTML is all in one file, the header text will be
inserted within PRE tags in that file. Where the HTML is split into
multiple pages, the header will be put into a separate file named
index.htm or index.html, and will link to the first page of your HTML. H.5. Can I use Javascript or other scripting languages in my HTML? No. We don't want our readers to have to worry about any potential for
malicious or just plain buggy code. H.6. Should I make my HTML edition all on one page, or split it intomultiple linked pages? For a typical novel, one page or HTML file is appropriate, but when
that single HTML file gets up around 2 megabytes in size, it may be
worth considering a split because of the difficulty of loading it in
some browsers. In some other cases, where the content requires different styles on
different pages, or different pages need different character sets, or
the page, with images, just gets too heavy, you may need to split the
HTML even if the HTML itself isn't technically too big. When we post a HTML eBook containing multiple files, whether they
contain text or images, we post them only in zipped format, so if you
don't have images, and want your text to be directly accessible, you
should stick to one file where possible. H.7. How can I check that I haven't made mistakes in coding my HTML? There are two kinds of mistakes you can make in coding HTML:
you can produce invalid HTML, or you can produce HTML that
doesn't do what you want. Checking for invalid HTML is straightforward. The W3C site
Producing HTML that actually does what you want is equally
important. If you've converted the eBook from text, you may
have created inconsistencies, or closed an italics tag in the
wrong place, or used the wrong tag at some points. The only way
to check this is by reading through the HTML in a browser. H.8. Can I submit a HTML or other format of somebody else's text? Maybe. This question has several complications. First, you must
understand that it is quite possible, even likely, that your
HTML file will eventually be overwritten by better information. The value of a HTML file, as opposed to a plain text file,
lies in its ability to capture elements of the original that
have been lost in the plain text. A plain text file, using
extended character sets like ISO-8859 [V.76] or Unicode [V.77]
andunderscoresfor italics, can capture all of the author's
intent in almost all cases. Sometimes, images and other important
features of the original cannot be captured in plain text alone,
but can be captured in HTML, or other markup. When Michael Hart stopped posting books, in September 2001, we
had HTML formats of about 1.6% of all our eBooks. At the end of
2002, that has risen to nearly 11% of all our eBooks. If you
have a clearable copy of an existing posted book, with extra
features not included in the original plain text, we would
encourage you to make a new edition, or version, or format,
correcting any errors in the original, and adding any new
information not included there. If, on the other hand, you just want to make a "blind format
change"—making your best guess at what the HTML, or other format,
layout should be for a book you've never seen, based on the original
producer's work—your best bet is to get in touch with the original
producer, and ask whether they can supply more material for you to
work with. Otherwise, you are at best just rearranging information
rather than contributing something new. A blind format conversion can be done in anything from 2 minutes
[R.33] to an hour. It just doesn't make sense for us to keep posting
these files when they contain nothing new, and especially when two
people may want to convert the same text. It is likely that, at some
time in the next couple of years, we will start on a large-scale
conversion project, to add some form of markup to all of the existing
text files for ease of serving, and having a mish-mash of existing
markup styles to deal with at that point won't help either. H.9. How big can the images be in a HTML file? The images should be as big as necessary, and no bigger. Sorry, but there is no clear number to give here. Web page designers
sweat blood to save an extra 20K on a page; so should you. If you're
an experienced HTML maker, you know this stuff; if you're not, take it
as a guideline that you should generally aim to keep your images in
the 30K to 50K size range, with occasional forays into 70-80K
territory. That's generally big enough for a clear picture, unless
you're reproducing fine artwork. H.10. The images I've scanned are too big for inclusion in HTML.What can I do about it? This is a common problem, where images from the book occupy a full or
half page. Your images should be of an appropriate size for
downloading, and 2 megabytes of high-quality scan per image is not
really an appropriate size for most PG texts! You should reduce the size, and maybe the quality, of the original
scan for simple viewing purposes. There is lots of image-manipulation
software to do this. For Windows, you might look at the freeware
Irfanview, and for both *nix and Windows there is ImageMagick [P.1].
Look for the words "resize" and "resample" in the Help. Apart from simple converters, which do enough for this purpose, you
can also manipulate the images in full imaging creation and editing
packages like Paint Shop Pro, Adobe Photoshop and The Gimp [P.1]. Different image encoding methods can make a huge difference to the
filesize. Any of the packages mentioned above can encode images as
GIF, JPEG or PNG, and, particularly for black and white line drawings,
these can encode to very different sizes. So, for example, a 60K JPEG
may save as a 30K GIF, because the GIF encoding works better for that
particular image. Try your images out, and see what works. When manipulating images, always work from your original. Don't
convert your original to a JPEG, and then shrink that and convert it
to a GIF. Depending on the format, images may lose definition as they
are converted (search for "lossy compression" in your favorite search
engine to find out more about this), and they certainly lose
definition as they are resized, and you end up with the "imperfect
copy of an imperfect copy of an . . ." effect. When you're
experimenting, take your original, resize and Save As GIF, then go
back to your original, resize and Save As JPG, and so on. You can also use an image optimizer. These are specialist software
programs that try to make image files smaller without sacrificing
resolution or detail. H.11. Can I include decorative images I've made or found? No. Please include only the images you got from the book. If you want to
make an edition of the book for your own web site, you can of course
use whatever you like there, but for PG purposes, we want the book,
the whole book, and nothing but the book. H.12. How can I make a plain text version from a HTML file? You can edit out the HTML by hand, of course, but there are several
easier ways to convert. You can view the HTML in a browser, Select All text, and just Copy and
Paste into your editor. This is easiest, but doesn't handle formatting
like tables very well. You can use the Lynx [P.1] browser to convert your text with the command
lynx -dump myfile.html > myfile.txt Bruce Guthrie's HTMSTRIP for MS-DOS [P.1] is very configurable. H.13. How can I make a HTML version from my plain text file? This is not a course in HTML, but, for most books, you don't really
need a course in HTML. Making a HTML format of most books is very
easy, and doesn't take long, once you have mastered basic HTML. Let's
assume you have your completed PG plain text file ready, and walk
through the steps commonly needed to make a HTML version. We'll do
this by successive approximation, doing the major things first, and
then dealing more and more with the detail. There are lots of specialized HTML editors out there, but you don't
actually need any of them. The same editor that you used to create
your text will also create your HTML. HTML is just text, with two
types of special instructions added: tags and entities. Atagis an instruction to the browser, usually to display something
with specific rules. Tags are shown within angled brackets: for
example, is the instruction to start a new paragraph. Anentityis a named special character that might not be available
in your character set. Entities are shown starting with an ampersand
"&" and ending with a semi-colon ";" : for example, — is the
representation of an em-dash. I'm marking up a made-up short text as I write these steps, loosely
based on the sample page from question [V.121]. You can see the
changes made at each stage by looking at the files htmstep0.txt (text before starting)
htmstep1.htm (after adding the HTML header and footer)
htmstep2.htm (after adding paragraph marks)
htmstep3.htm (after marking main headings)
htmstep4.htm (after adding special line breaks and indents)
htmstep5.htm (after adding italics and bold)
htmstep6.htm (after adding accents and non-ASCII characters)
htmstep7.htm (after adding an image)
htmstep8.htm (showing some extra techniques) Before you start, make sure that you can see these files both
in your browser and in your editor. In your editor, you should
see the HTML codes; in your browser, you should see the text
as it is intended to be viewed. Note for people who already know HTML: yes, this example omits
lots of possible ways to do things, and lots of refinements. You
already know how to do what you want to do—skip onwards, and
give the beginners room to learn in peace! :-) Step 1. Add the HTML header and footer information Add the following lines at the top of your text file:
Let's explain these one by one: says that your file is HTML 4.01 Transitional, which is the
latest version, allowing the widest range of tags and entities. denotes the start of the HTML
denotes the start of the HTML header information.
says that the characters are text, using ISO-8859-1 encoding. If you need to use a different character set, you should change ISO-8859-1 to whatever you intend to use. ISO-8859-1 is good for lots of PG books in English that use French or German words.
You should obviously change this to the actual title and author you're producing. The
denotes the end of the HTML header information and
denotes the start of the actual text itself - the body of the book.
At the very end of the file, you should append these two lines
these denote the end of the body of the book, and the end of the HTML.
At this point, you actually have a valid HTML file! OK, if you view it
with a browser, it doesn't look anything like the way it's supposed to,
but itisHTML. Save it with a name like MYFILE1.HTM or STEP1.HTM and
get a copy of Tidy for your DOS, Unix, Mac or Windows system from
If it does say that there are errors, deal with them now, before you continue. Make sure, at each step, that you have cleaned up any errors; it's a lot easier now than later. Also, when you've finished each step, save your file with a number in its name, so that if you run into problems later and get confused, you can, at worst, drop back to the correct version at the end of the previous step.
The most likely error you might have at this point relates to the characters "<", ">", or "&". These are the characters used by HTML to indicate tags and entities. If these characters are used in the text of your file, (and ampersand is likely to be), you should replace them with entities, so that HTML will know that they are to be displayed as characters, not interpreted as commands.
Replace & with & < with < > with >
There is an example of this in the file htmstep1.htm
Step 2. Add paragraph marks.
For novels and general prose, paragraphs are the main logical and display unit. Paragraphs are marked in HTML with the sign
at the start, and
at the end. You don't actually need the at the end, but adding these is a good habit to get into. You do, very much, need theat the start.
The line-lengths within a
pair are irrelevant; the browser in which the text is viewed will ignore extra spaces and line-ends, and will wrap text to fit the screen. This is bad for poetry and tables, but we will discuss those later. For this step, all you need to know is that you can leave your text exactly as it is, and just add the paragraph marks.
Put a
at the start of the line before the first letter of every paragraph, and a
just after the last letter or punctuation of every paragraph. If you can do macros in your editor, this will just take a minute; otherwise, it may be rather boring, but at least it is simple. For this step, put the paragraph marks aroundeverythingthat has a blank line after it, even poetry or chapter titles. We'll come back and change that later.Now save your text as something like MYFILE2.HTM or STEP2.HTM.Again, run Tidy to check for errors, and fix them before continuing.
If you now look at the file htmstep2.htm in your browser, you will see that it is starting to take shape. Look at it in your editor, and you will see the paragraph marks.
Step 3. Add marks for headings.
We want to indicate to the reader that certain lines are for chapter or other headings. HTML provides the tags
Since there won't be many headers, and most headers are only on one line, this is usually not hard. Look at the file htmstep3.htm to see how our sample is improving, and if you're working along with me, don't forget to save your file under a new name and check it.
In our example, we have marked some lines with paragraph marks where we now want to put headings, so we will change those
s into
Step 4. Line up verse, tables of contents, and other lists.
The HTML tag
tells the browser to force a line break without
starting a new paragraph. We use this when we don't want text all
wrapped together, but not separated with blank lines either, for
example in verse and tables of contents.
In our sample, we add the
tag to the end of each line in the
table of contents and the end of each line of the verse. If we were
working on a whole book of poetry, the same principle would apply,
but we'd be using the
tag a lot more.
Where we want to indent a line of poetry, we can use " " at the start of the line. Normally, however many spaces you leave between words, HTML condenses them to one space, so normal indentation doesn't work. But the "non-breaking space" entity will cause the browser to show one space for each character, so that you can indent as much as you need.
The file htmstep4.htm shows the effect: this is now an entirely readable HTML text!
Step 5. Add back in italics and bold.
The HTML tag tells the browser to start displaying italics, and the tells it to stop. Similarly, the tag tells it to display bold, and marks the end of the bold text. See htmstep5.htm for the changes.
Step 6. Restore accents and special characters.
Since we declared our HTML file to use ISO-8859-1 back at the start, we can use any of the common accented characters for Western European languages, but we may also use HTML entities. For example, for the "a circumflex" in "flaneur", we can use either the ISO-8859 character directly, or the HTML entity name "â" or number "â".
There is a trade-off between characters and entities: entities do not limit you to any particular character set, but characters are directly readable when looking at the HTML source.
Within entitles, there is also a trade-off between entity names and numbers: older browsers may not recognize some of the entity names, but the entities do make the text work in multiple character sets. Which you choose is entirely up to you, but it's best to be consistent; if you like entities, use them everywhere. Entities can be represented by their names—for example, ——or by their number, derived from their ISO-10646 (see Unicode) number—for example, —.
There are other special character entities you may choose, to replace the ASCII equivalents in the main text. Here are some of the common ones:
We've already seen
& & ampersand replaces "&" < < less than replaces "<" > > greater than replaces ">" space replaces a space when you want to indent
and these are also very useful for many PG texts:
— — em-dash replaces "—" ° ° degree replaces "deg." or "degrees" £ £ British pound replaces "L" or "l" or "pounds"
There are many others.
I've made a couple of entity changes in htmstep6.htm.
Step 7. Link Images into the text.
First, you need to have your image ready. You should already have resized your image to the size you want it to be viewed at. You should also have saved it as a GIF, JPG, or PNG image, since those are the formats most supported by current browsers.
If your image is named front.gif, and it is a picture of the frontispiece of the book, you should add the line
to your HTML at the place where you want it displayed.
The "alt" text gives a label to the image, and is displayed if the image can't be shown, or in the case of a browser for visually impaired people.
You don'thaveto add images with your HTML file, unless you want to. In many older books, there are no images at all to be added.
My final HTML text is now in htmstep7.htm. You need to have the image front.gif in the same directory in order to see it. When your HTML text is posted, the images will be zipped with it, so that future readers can see them.
Step 8. Over to you!
This is enough to make a reasonable HTML format of most PG texts, but it doesn't begin to cover everything that can be done in HTML. If you've gone this far, I recommend the W3C's tutorials:
and
which cover the ground we've just crossed, and go a bit further.
Here are a few more things you might want to know, but don't go nuts adding tags just because you can! Use them only when you really need them. The file htmstep8.htm shows some of these techniques. Personally, I think that this is a bit overdone, and I prefer the effect of htmstep7, with left-aligned chapter headings, but that's a matter of taste.
Once you're used to the basic HTML needed for most PG eBooks, you'll probably be able to convert one in under an hour.
How do I force more space between specific paragraphs?
Insert a blank paragraph like this:
or use an extra
How do I make text, or image, or headings centered?
Put the
How do I make some text bigger or smaller?
Put the and , or and tags around it.
How do I lay out tabular information?
The simplest way to do it is with the
andtags. These will cause whatever is within them to be displayed as plain text, just as it was in the original, so that spaces separate the entries just as they did in the text version. You can also use this for poetry, though you usually won't need to. It's not entirely satisfactory, but it will work.
Making a full HTML table requires you to use the
(table detail) tags, among others,
and a full exposition of tables is beyond the scope of this FAQ.
Briefly, you start a table with the For each row you want in the table, you open and close a table row and then for each cell within a row, you specify a tag and
the contents of that cell:
|
This only scratches the surface of tables. However, there are many
guides available on the Web, and they're easy to find, once you
know which tags you're looking for. A brief discussion of tables
is provided by the W3C as part of the HTML 4.01 spec at
Step 9. Some common problems When you're just starting to code HTML, it may seem that errors are coming at you from all sides. Tidy may spew out a stream of complaints that you don't recognize or understand. If it's any consolation, this is normal! Just take the error list one line at a time, starting at the top. Often, one actual mistake, like not closing a tag, may cause many errors, since an unclosed tag can cause many subsequent tags to be reported as errors. Common errors include: 1. Simple typos in tags, like This is centered. One option for making a HTML version is to use GutenMark
Programs and programmers FAQ P.1. What useful programs are available for Project Gutenberg work? These suggestions came largely from a poll of volunteers in June, 2002. The programs listed are a summary of the programs we actually use. There are many other programs out there that can do the same jobs, so don't limit your search just to these. Abbyy These are the three main commercial packages that volunteers bought specifically for the purpose. In a few cases, people had got older versions of these bundled with their scanners. Clara OCR These are Free Software packages. Some people who responded to the survey had tried them, but nobody had actually used them to produce a text. DocMorph — a free, web-based OCR This one is interesting—you can just submit your image through a web page, and the service will return OCRed text. However, the process of submission, waiting for your text, and then cutting and pasting into your document is slow. Other volunteers use various OCR software that came bundled with their scanner. 2. Editing The main answers, given by more than one person, were: AbiWord Other editors mentioned included: Crisp for Windows Programs recommended by Apple Macintosh users included: AppleWorksBBEdit Lite 3. Checking and proofing For spelling, most people just use the spellchecker built into their
editor or word-processor. The *nix users running emacs or vi tended to
use variants of the standard Unix spell command, such as ispell or
aspell. Mac users have the free spelling checker Excalibur, available
from Gutcheck 4. Working with HTML In the survey, most volunteers preferred to handcraft their HTML using their normal editor. Those using a word processor edited the HTML as text, rather than composing a word processor file and then Saving As HTML. There was remarkable unanimity on this. Specific HTML editors that were mentioned for occasional use were: Adobe PageMill (no longer available)Mozilla Composer However, not all HTML work is about editing, and the following
packages were honorably mentioned for other functions. Especially
important is Tidy, which is pretty much necessary for all but the
most experienced people for quick HTML checking.
GutenMark:Converts Project Gutenberg texts to HTML and TeX. HTMSTRIP by Bruce Guthrie:MS-DOS. Converts HTML to text Lynx (lynx —dump):Converts HTML to text Dave Raggett's HTML Tidy:Checks HTML for correctness, reformats and fixes W3C html2txt (web-based):Converts HTML to plain text. W3C Validator (web-based):The Last Word on the correctness of HTML. wget:A very neat utility for getting web pages 5. Working with images. There are two main applications of images in PG—images to be used within texts, like illustrations in HTML, and the management of page images for scanning. These packages are used by volunteers variously for both of those purposes. Their typical use within PG is indicated. "Advanced image processing" packages will permit you to edit and restore damaged images, but for PG work, we mostly just need to manage, convert, resize and crop them. ACDSEE for WindowsFor image reviewing Adobe PhotoshopFor advanced image processing ImageMagick for *nix, Mac and WindowsResizing and format conversion Irfanview for WindowsImage viewing, conversion, cropping and resizing The GimpFor advanced image processing Picture PublisherFor advanced image processing VuePrint ProFor viewing images Proofreaders' Toolkit (PRTK)For splitting batches of image files into individual pages P.2. What programs could I write to help with PG work? Look at the programs listed above in [P.1]. Can you write a better version of any of them? Improving OCR and editors constitutes a major challenge, unless you're a world-class expert, but checking and reformatting texts is an area not addressed by large scale programs, and you might contribute there. Formats FAQ F.1. What formats does Project Gutenberg publish? In principle, there's no format that we won't publish, but, in practice, we prefer formats that are open and editable. An open format is one whose structure is publicly defined and documented, and not burdened with patent or trade secret or copy-protection (a.k.a. "DRM") restrictions. Anyone can write a reader or creator for an open format, and in 500 years' time, anyone interested will still be able to write a program to display the file. Closed formats, by contrast, will almost certainly be unreadable in just a few decades, when the companies now promoting them disappear, or lose interest, or decide to stop supporting them because they want to sell a replacement. Being able to edit the file is also important. We make corrections to our editions constantly, and it is important to us that we should be able to update our files easily. If adding one word to a sentence involves a complete re-marking of the whole text and a complete rebuild of the file, we have to ask ourselves whether this format is really necessary for this text. Further, the people who re-use our texts should also be allowed to copy and reformat them freely, and non-editable formats restrict their ability to do this in various ways. F.2. What is, and how do I make or use: [Note: Character sets and formats are both listed here. Character sets refer to the characters you can use; formats describe how those characters are put together. For non-text formats such as music files, there is no exact equivalent to a character set.] ASCII (Character Set) ASCII (American Standard Code for Information Interchange) is a set of common characters, including just about everything that you can type in on an English-language keyboard. It includes the letters A-Z, a-z, space, numbers, punctuation and some basic symbols. Every character in this document is an ASCII character, and each character is identified with a number from 0 through 127 internally in the computer. You can view or edit ASCII text using just about every text editor or viewer in the world. Big-5 (Character Set) Big-5 is a set of 13,494 traditional Chinese characters. You will need to use an editor or viewer that supports the character set. Codepage 437, 850, 1252, etc. (Character Sets) These codepages are Microsoft-specific character sets which allow the
display of accented characters and other symbols. To view a text that
uses one of these, you will have to use a Microsoft application that
supports them. Many of the fonts supplied with Word for Windows will
display and edit CP-1252 correctly. For Codepages 437 and 850, you may
have to open a Command Prompt and use a DOS editor like EDIT. A search
form DVI stands for DeVice Independent, and is commonly used to store text and instructions for displaying it involving complex mathematical symbols and expressions, though it can be used for any content. Given a DVI file, you need a viewer to render it on the specific device you're using. Specifically, DVI is used as the standard output format for TeX, discussed below. HTML/HTM (Format) HyperText Markup Language defines the standard format of web pages.
You should be able to view these with any web browser, and edit them
with any text editor or a specialized HTML editor. ISO-8859/ISO-Latin (Character Sets) ISO-8859 is a series of character sets used to represent the accented
characters most commonly used in European languages. There's
ISO-8859-1, ISO-8859-2, and so on. ISO-Latin is just another name for
the same thing. You can read the overview at
LIT (Format for PDA-based eBooks) This is a proprietary, closed format for files that can be displayed
only by the Microsoft Reader. Search MacRoman (Character Set) MacRoman is an 8-bit Apple Mac-specific character set which allows the display of accented characters and other symbols. To view a text that uses MacRoman, you will have to use an application that supports it, and there are few outside the Apple fold. However, iconv and recode are programs that convert between many character sets, and MacRoman is supported by both. MID/MIDI (Format for music) Musical Instrument Digital Interface is a music description language,
encompassing not only file formats but definitions of interfaces. A
MIDI file contains instructions for sending messages to a musical
instrument to recreate the sounds. MP3 (Format for any audio file) MPEG-1, Level 3, was defined by the Moving Pictures Expert Group as a
means for encoding sounds. Many, many MP3 players exist for all
platforms, and can be found easily with a Net search. The official
home page of the MPEG is MPEG/MPG (Format for moving pictures) The Moving Pictures Expert Group have released a series of formats for
encoding video and audio. MPEG (pronounced EM-peg) formats are
published and widely used. The official home page of the MPEG is
MUS (Format for music) MUS from Coda Music PDB (Format for PDA-based eBooks) The Palm Data Base format can actually be used for purposes other than eBooks, and there are many possible variants of formats for Palm-based readers all using the extension PDB on PCs, and they're not all entirely compatible. Some of them are proprietary, and it may not be possible to edit them directly, or export files from these formats; they have to be made in another format and converted. Some can be converted back to text. The most common, though, is the "Palm-DOC" format, which is an open format and can be edited on the Palm itself. PDF (Format for eBooks) Portable Document Format is a format for storing texts, containing any
fonts or graphics. It is copyrighted by Adobe, PRC (Format for PDA-based eBooks) This is a proprietary format for files that can be displayed only by
the MobiPocket Reader. See PS (Format for text and graphics) Postscript is technically a programming language, not just a format.
It has conditional statements, procedures and program flow control.
However, it is commonly referred to as a format. Adobe
Back to IndexNext |