HTML FAQ

NEITHER Clara nor Vernon appeared at the mid-day table. Pr. Middleton talked with Miss Dale on classical matters, like a good-natured giant giving a child the jump from stone to stone across a brawling mountain ford, so that an une(lified audience might really suppose, upon - seeing her over the difficulty, she had done something for herself. Sir Willoughby was proud of her, and therefore anxious to settle hier l)uSifleSS while he was in the humour to lose her. lie hoped to finish it by shooting a word or two at Vernon before dinner. Clara's petition to be set free, released from hirn~, had vaguely frightened even more than it offended his pri(le.

* * * * *

Scan 4—A Really Bad Case!

Scan4 is a paragraph from Pope's translation of Homer's "Odyssey". This is a very, very tough one. It was obviously a cheap printing to begin with, using thin, poor-quality paper in a page size of 6" by 4.5", with capital letters about 1.5 mm high, a little bigger than Times New Roman size 8. Text this small really needs a higher-resolution scan. The book was falling apart when I got it, the ink was fading and flaking, and there was no point in even thinking about trying to scan it flat, so I cut the pages. To add an extra challenge, I scanned the sample with the cover open in a medium-lit room for the 300 and 400dpi scans, but closed the cover for the 600dpi to show the best quality I could possibly get. (I was pleased to note that Abbyy, while recognizing the page in the 300dpi and 400dpi images, flashed up a suggestion that I should lower the brightness of the scan.)

This particular book was one I sporadically tried to produce, without success, on an older scanner and a bundled OCR program over a period of two years, back in 98/99. Eventually, in 2000, it was the first book processed through Charles Franks' Distributed Proofreaders site. The initial text produced by the OCR was very poor, but the human volunteers made up for it! Thanks, guys! Today, just two years later, with a better scanner and better OCR, I could have done it myself, as you will see from the best of the results of the 600dpi scans. That's how much things have improved recently.

A separate point to note here is that you can see the "three-quarter space" effect before the exclamation mark and semi-colon that was discussed in [V.104].

The results of the OCR are:

Abbyy FineReader 6:

" Ah me ! on what inhospitable coast,On Tvh.it new region is Ulysses toss'd ;Possess'd by wild barbarians fierce in arms ;Or men. whose bosom tender pity warms ?What sounds are these that gather from the shores ?The voice of nymphs that haunt the sylvan bowers,The fair-hair'd Pryads of the shady wood ;Or azure daughters of the silver flood ;Or human voir-e? but issuing1 from the shades,AVhv cease I straight to learn what sound invades?"

" Ah me ! on what inhospitable coast,On what new region is Ulysses toss'd ;Possess'd by wild barbarians fierce in arms ;Or men, whose bosom tender pity warms '?"What sounds are these that gather from the shores ?The voice of nymphs that haunt the sylvan bowers,The fair-hair'd Dryads of the shady wood ;Or azure daughters of the silver flood ;Or human voice? but issuing from the shades,Why cease I straight to learn what sound invades?"

" Ah me ! on what inhospitable coast,On what new region is Ulysses toss'd ;Possess'd by wild barbarians fierce in arms ;Or men, whose bosom tender pity warms ?"What sounds are these that gather from the shores ?The voice of nymphs that haunt the sylvan bowers,The fair-hair'd*Dryads of the slrady wood ;Or azure daughters of the silver flood ;Or human voice? but issuing from the shades,Why cease I straight to learn what sound invades?"

gocr 0.3.6:

[The 300 and 400 dpi scans produced nothing recognizable.The result of the 600 dpi scan is below.]

'' _hh i_3e ! o_1 ___l_at_ i__l__sl__ it_nble CoaSt_ On ___l_,__ _)e_v i_e_io__ i__ ___ _._____ses toss'd ; _(3s3gs3_d l3.__ ___iiíi l3_3__b___i_c_i3_ fie_Ce in il__S- _ Or i11pn, __-i)c3se l_osonl te_1de_ _it____ __ai_n3__ ? ___l_at __o__i1ds Qre tlipse tliat g__tl_p_r fE_oi33 the shoTes ? '_ilie __oi__e of i)____ E1)l3l3s tl3nT 1i_n__nt the s__l__inn bo_Ye_5_ 3'l_e fni___i____ir'd _____-ads of' il_e sli__d__ i___oOd _ Op az(_pe da_____litc__s of _tlie sil __?r t1ood ; Or l___i31_nn ___)i___? l3__t i3____ii_6 fi_oi11 tlie __hiade__ _ __'!3.__ _ea___e _ s_rai__li.t to l_ar_i1- i_—li__t so_nd- in__ad_S___''

Recognita Standard 3.2.7AK:

.: lh nt"'. on w-hat inlu,;y:t, I,:e co;;~t, On ~cli^t ne~- re~ion i.. 1= 1-.-:.:e~ tm:'d ; Possea'd 1n- wil~l L;,rba~:c, .~ fierce in arm~ ; Or u.~u. w-Ln.e bossum tender pit~- warna'? ~l-u:lt .<,:~;;::;3s are tll~ce that ~atl:er from the shnre~ ? 'I'l.e -;;o'.re :,; nwtthil: tW ,t l:aa;nt the s~-l:c 1llJOR'er5, 'lhe :a,:~-h ~;r'd~It.wa~i~ ot' tl:e ~Il;;dv vood; Or az.lre dau~~l.ts~: oY tl:c ·:iv-~~r floo;:3 ; C?r humnn ~-<:i: e'? l,~:tt i~~; from tl:c· ~had~~, 11-lts- cea~e I ctrai rlit to learn ~s-l:, t socud incades %"

" ~h me ! ou "-Mat iuMospita~le coast,On ~i-lmt ne~c reyion is L 1~-~ses to~s'd ;Pos:e;s'd 1"~ w-iMl lrvrbaria:ns fiet~ce in arms ;Or m~ n, "-hose hosom tender pit~- warm5 ?~~~hat ~ounds are tlmse tMat ~;atMer from t:he shores ?~t'I~e ~-oi~~e of n~-Inhhs t.hat liaunt the s~-l~~a n howers.Tlie fair-hnir'd D~ vads ot tl:e shad~- "-ood ;Or aznre dau~liters of tMe sil~-~r fiood ;Or lmman ~-oi:~e'? but iauin~ frotn the shades, alVly cea.~e I straibht to learn "-Mat souud in~ad°s?"

" Ah me ! on what inhospitable coastOn ~~-hat new r e~ion is L;1 ~-sses toss'd ~,Possess'd 1J~- "-ilil I:OII'uai'la ils fierce in arms_ ·Or men, whose hosom tender pit~l ~varn~s ?~'G'l~at somnds are these tliat ~atl~er from the shores ?~I'Iie v oice of n~-mpl~S that ~munt the sy Ivan bowers,Tlie fair -hair'd D~~~-ads of tl~e slmdy wood ;Or azure daylltcrs of tlle silver flood ;Or lm:nan voice? uut issL~ing from the shades,~~'lm cea~e I strai~ht to Iearn ~~-lmt so~nd inv ades ?"

OmniPage Pro 10:

,.lh in- ' on "-hat inh-slit al.:e coast,On "M.^t new reion is 1=1;-a:e~ to-s'd ;P"::e:~'d hw "ild Larba.:an~ fierce in arms ;Or inn. "-hnse bo.,om tender pity warmsWhat

'Wh me ! on what inhospitable coast,On what new region is L fusses toss'd ;Possess'd br wild barbaric ns fierce in arms ;Or men, whose bosom tender pith- warmsAN-hat sounds are these that gather from the shores ?The voice of nymphs that Haunt the sylvan bowers,The fair-hair'd IWvads of the shady -wood ;Or azure daughters of the silver flood ;Or human voice? bat iauina from the shades,Why cease I straight to learn what sound invades?"

" Ah me! on what inhospitable coast,On what new region is Ll ysses toss'd ;Possess'd bv -wild barbarians fierce in arms ;Or men, whose bosom tender pity warnis ?AVlia± sounds are these that gatller from the shoresThe voice of nYI11pliS that haunt the -sylvan bowers,The fair -hair'd D.-yads of the shady wood ;Or azure daughters of the silver flood ;Or human voice? lout issuing from the shades,Why cease I straight to learn what sound invades?"

OmniPage Pro 11:

.` lh in-' on what inhospital,le co-st,On xclznt near region is t 1:-sse~ toss'(: ;Possess'd bY Mild barbarians fierce in aims ;Or inn. whose boson tender pity warmsWhat

''' :Ah me ! on what inhospitable coast,On iyhat new region is Ulysses toss'd ;Possess'd br wild barbarimis fierce in arms ;Or men, whose bosom tender pity warmsAN-hat sounds are tliese that gather from the shores ?The voice of nymphs that haunt the sylvan bowers,The fair-hair'd D~ yads of the shady -wood;Or azure dau.L-hters of the silver flood ;Or human voice? but issuing from the shades,Why cease I straight to learn what sound invades?"

" Ah me! on what inhospitable coast,On what new region is Ulysses toss'd ;Possess'd by -wild barbarians fierce in arms ;Or n1en, whose bosom tender pity warnis ?AVliat sounds are these that gather from the shoresThe voice of nyniplis that haunt the sylvan bowers,The fair-hair'd Dryads of the shady Wood ;Or azure daughters of the silver flood ;Or human voice? but issuing from the shades,Why cease I straight to learn what sound invades?"

TextBridge Millennium Pro:

no on what inhe~ptaEie coast, On what new realun is hivs,e' to5sd ,s~s Ä-~d liv wild lie il)~m.ihI fir see in al-rn~ Or u~,-n. w'linse bo,uuiu tender pity warnls Wl at ~ are t1ie~e that ~atler from the shores ? 'n.e a oro of imvntpirs tint he~nt the sad van bowers, 'flie tah'-ha~r'd D~vahs ct the shady wood 1)1' az Ire dauul~t ~ of tl,e shvr flood Or liunian vi i 'I ? h'tt is- eng from the shades, \VIiv cea-~e I straight to learn w hat sound invades 1"

Ah me on what inhospitable coast,On what new region is U vases toss'dPossess'd by wild barbarians fierce in armsOr men, whose bosom tender pity warms ~What sounds are these that gather from the shores?The voi'e of nymphs that haunt the sylvan bowers,The fair-baird Prvads of tl~e shady woodOr azure daughters of the silver floodOr human vuiae? but issuing fi'om the shades,Why cease I straigl~t to learn what sound invades?"

Ah me on what inhospitable coast,On what new region is Ulysses toss'dPossess'd by wild barbarians fierce in armsOr men, whose bosom tender pity warms?What sounds are these that gather from the shores?rfhe voice of nymphs that haunt the sylvan bowers,The fair-hair'd Dtyads of the shady wood;Or azure daughters of 'the silver floodOr human voice? but issuing from the shades,Why cease I straigl~t to learn what sOund invades?"

What can we conclude from this?

Small mistakes in scanning, like letting too much light in, getting your scanner settings wrong for the page, or not pressing the paper flat enough, can make a major difference to the final quality of the text that you will have to correct.

Sometimes, no matter what you do with your scanner, problems with the paper or the print will make it difficult for your OCR package to give good output.

Generally, bigger is better within the range 300dpi-600dpi, but you only need higher resolution with more difficult material.

Different OCR packages will produce widely differing texts from the same images. Given a really good image, most OCR software will work acceptably, but when you have lower quality material to work with, the gap between OCR packages shows clearly.

S.18. I got an OCR package bundled with my scanner. Is it good enough to use?

That depends on how well your package performs on the actual scans that you do, and how much you value your time vs. money. Most scanners are bundled with OCR software, but these OCR packages are often older or "brain-damaged" versions, with their functionality deliberately lowered. It's unlikely that you'll get a current-version, top-of-the-line OCR package thrown in for free.

You may have to pay extra for better OCR, but it means that you spend less time making corrections. The question is how much better you want your OCR to be.

Save the images from the FAQ "Why am I getting a lot of mistakes in my OCRed text?" [S.17] and try processing them with the OCR you have. Compare the quality of the text produced with the quality of the samples. This should give you some idea of how your OCR compares to others.

Try a few pages from your book with your OCR. How many mistakes do you see on each page? Do you find that acceptable?

S.19. I want to include some images with a HTML version. How should Iscan them?

We don't often see color prints in our books, but if you do have one, then scan it in color. Otherwise, try both greyscale and B&W, and see which gives you the best image.

It's usually better to scan images in a higher resolution than you're going to use, and then use an image manipulation package to reduce them [H.10] to a size appropriate for your HTML file. An initial scan at 600dpi is often good. Image manipulation programs will also allow you to "clean up" the pictures, by increasing contrast, despeckling, or other filtering.

S.20. I want to include some images with a HTML version. What type ofimage should I use?

GIF, JPEG and PNG images are supported by current browsers, and you should stick with those unless you have a specific reason not to.

GIF and PNG tend to be more efficient—provide better quality at a given file size—for simple line-drawings; JPEG is usually better for photographic images.

S.21. Will PG store scanned page images of my book?

No. Or, at least, not yet.

The idea has been kicked around a bit. There's no question of replacing etexts with page images, but many volunteers who have already scanned the book anyway like the idea of saving page images as well—for general information, and as a means of checking future correction suggestions against the original. Some volunteers already keep their page images, stored for possible future use.

Working some back-of-the-napkin figures: a page of text might take up 1KB of space on a computer as plain text or HTML or XML. The same page might take 70KB if stored as a black-and-white image, of just enough quality to serve as a reliable guide to making corrections. Pages with pictures, or stored with enough resolution to allow some future researcher to write a paper on the changing shape of serifs in the 18th and 19th centuries, would start at around 350KB per page, and go up from there.

A 300 page book thus becomes

about 300KB as plain text (and around 150K zipped) about 20,000KB as minimal-quality images about 100,000KB as high-quality images

and with the images, we won't save much space on the zipping, because they're already compressed.

On a normal "56K" modem, getting about 4KB / second, it would take:

75 seconds to download the text file (40 for the Zip) 80 minutes to download the minimal images over 5 hours to download the high-res images.

Someday, the disk and bandwidth capacities that we will take for granted will be such that uploading images, when we have them, will be quite natural, just for the few people who will want them. But we're not quite there yet.

Late flash! As of late 2002, the Internet Archive is providing space to volunteers for storing page images. To see the images, and find out more, go to

H.1. Can I submit a HTML version of my text?

Yes.

H.2. Why should I make a HTML version?

Well, you can make one just because you want to, but on some texts there is special reason to.

If you want to preserve the pictures that accompany the text, making a HTML version means that you can specify where and how those images appear.

If there is particular meaningful information in the layout of the text that can't be expressed in ASCII, like special characters or complex tables or fonts, HTML may offer an open format alternative.

H.3. Can I submit a HTML version without a plain ASCII version?

You can submit it, but the Posting Team will then consider whether we should also make an ASCII, or perhaps ISO-8859 or Unicode version of it. We really do want our texts to be viewable by everybody, under every circumstances, and we do not want to start posting texts that are in any way inaccessible to anyone.

See also the FAQ [G.17] "Why is PG so set on using Plain VanillaASCII?"

H.4. What are the PG rules for HTML texts?

1. The only absolute rule is that the HTML should be valid according to one of the W3C HTML standards.

You can verify that your HTML is valid at the W3C's HTML Validator at

For a more convenient and friendly, though less official, check of the correctness of your HTML, you should use Dave Raggett's Tidy program at , which not only points out any messiness in your HTML code, but also has some neat modes to clean it up and standardize the formatting.

After that, we have some requirements and recommendations. Compliance with the requirements might be waived if there is a really good reason to make an exception in this case.

2. Requirement: File names and extensions

If you want your text to work within 8.3 filename conventions, you may use .htm as the extension for your HTML files; otherwise, use .html as the extension. If you are working to 8.3 conventions, all of your images as well as your HTML files should have 8.3-compliant filenames.

All file names and extensions should be in lower-case throughout. Yes, we know this is not strictly necessary, but we don't want to have to correct every file that comes with "image.gif" referenced in the HTML accompanied by a file IMAGE.GIF.

3. Requirement: HTML and plain-text

Project Gutenberg does publish well-formatted, standards compliant HTML. However, we insist that a plain text version be available for all HTML documents we publish (even if images or formatting are absent), except when ASCII can't reasonably be used at all, for example with Arabic, or mathematical texts.

4. Requirement: Archive format for posting

If the HTML book contains more than one file (including images), create a ZIP (preferable) or TAR archive containing all of the files in the book. The ZIP file may, if you wish, unzip to a subdirectory named for the book. For example, a book called 'The Humour of Mark Twain' might unzip in a directory called 'mthumor'. Make sure directory names contain only alphabetic and numeric characters, no spaces, and are 8 characters or less, even if you're not sticking to 8.3 conventions for filenames.

5. Recommendation: Simplicity

Make your HTML as simple as possible. HTML is an evolving standard, and one that may be completely obsolete in the long term. Use of advanced features may just mean that your version will be obsolete or unreadable that much faster.

6. Recommendation: Images

Images included with your HTML should be in a format that Web browsers can read: GIF, JPEG or PNG. Images should be edited for high quality in a reasonably small file size. Make the best decision you can concerning the image size and placement in the text. Every image included must be linked into (referenced by) the HTML.

7. Recommendation: Line lengths

If it is reasonable to do so, try to wrap paragraphs of text at around the normal PG margin of 70 characters. Ideally, your HTML should be as near as possible identical to your text version except for the HTML tags and entities. People who open your HTML won't all be using browsers, people will need to make corrections, not all editors can handle very long lines, and even with editors that can handle long lines, it's easier to work with short lines.

Apart from these rules and recommendations, we also have a rule about the PG header, but that will normally be handled by the Posting Team. Where your HTML is all in one file, the header text will be inserted within PRE tags in that file. Where the HTML is split into multiple pages, the header will be put into a separate file named index.htm or index.html, and will link to the first page of your HTML.

H.5. Can I use Javascript or other scripting languages in my HTML?

No.

We don't want our readers to have to worry about any potential for malicious or just plain buggy code.

H.6. Should I make my HTML edition all on one page, or split it intomultiple linked pages?

For a typical novel, one page or HTML file is appropriate, but when that single HTML file gets up around 2 megabytes in size, it may be worth considering a split because of the difficulty of loading it in some browsers.

In some other cases, where the content requires different styles on different pages, or different pages need different character sets, or the page, with images, just gets too heavy, you may need to split the HTML even if the HTML itself isn't technically too big.

When we post a HTML eBook containing multiple files, whether they contain text or images, we post them only in zipped format, so if you don't have images, and want your text to be directly accessible, you should stick to one file where possible.

H.7. How can I check that I haven't made mistakes in coding my HTML?

There are two kinds of mistakes you can make in coding HTML: you can produce invalid HTML, or you can produce HTML that doesn't do what you want.

Checking for invalid HTML is straightforward. The W3C site will formally validate your file and point out any mistakes, and this is the official standard. However, it is not always convenient to use, especially when you're in a cycle of fix-and-retest. For this, you should try the program Tidy , which runs on your computer, tells you about errors, and has other useful functions as well. Tidy is available for just about every operating system, and there are several Windows utilities that include Tidy. The links on the main Tidy page will lead you to the right version for you. Tidy is fast and friendly, compared to validation over the web, but it is not the last word. The W3C Validator may find formal errors, such as DOCTYPE mismatches with HTML tags or entitles, that Tidy may not. The best solution is to complete your HTML tests using Tidy, and then, when Tidy finds nothing further to gripe about, submit it to for the official seal of approval. Please run these checks before submitting your HTML; we can generally fix it for you, but it may take us a lot of work.

Producing HTML that actually does what you want is equally important. If you've converted the eBook from text, you may have created inconsistencies, or closed an italics tag in the wrong place, or used the wrong tag at some points. The only way to check this is by reading through the HTML in a browser.

H.8. Can I submit a HTML or other format of somebody else's text?

Maybe.

This question has several complications. First, you must understand that it is quite possible, even likely, that your HTML file will eventually be overwritten by better information.

The value of a HTML file, as opposed to a plain text file, lies in its ability to capture elements of the original that have been lost in the plain text. A plain text file, using extended character sets like ISO-8859 [V.76] or Unicode [V.77] andunderscoresfor italics, can capture all of the author's intent in almost all cases. Sometimes, images and other important features of the original cannot be captured in plain text alone, but can be captured in HTML, or other markup.

When Michael Hart stopped posting books, in September 2001, we had HTML formats of about 1.6% of all our eBooks. At the end of 2002, that has risen to nearly 11% of all our eBooks. If you have a clearable copy of an existing posted book, with extra features not included in the original plain text, we would encourage you to make a new edition, or version, or format, correcting any errors in the original, and adding any new information not included there.

If, on the other hand, you just want to make a "blind format change"—making your best guess at what the HTML, or other format, layout should be for a book you've never seen, based on the original producer's work—your best bet is to get in touch with the original producer, and ask whether they can supply more material for you to work with. Otherwise, you are at best just rearranging information rather than contributing something new.

A blind format conversion can be done in anything from 2 minutes [R.33] to an hour. It just doesn't make sense for us to keep posting these files when they contain nothing new, and especially when two people may want to convert the same text. It is likely that, at some time in the next couple of years, we will start on a large-scale conversion project, to add some form of markup to all of the existing text files for ease of serving, and having a mish-mash of existing markup styles to deal with at that point won't help either.

H.9. How big can the images be in a HTML file?

The images should be as big as necessary, and no bigger.

Sorry, but there is no clear number to give here. Web page designers sweat blood to save an extra 20K on a page; so should you. If you're an experienced HTML maker, you know this stuff; if you're not, take it as a guideline that you should generally aim to keep your images in the 30K to 50K size range, with occasional forays into 70-80K territory. That's generally big enough for a clear picture, unless you're reproducing fine artwork.

H.10. The images I've scanned are too big for inclusion in HTML.What can I do about it?

This is a common problem, where images from the book occupy a full or half page. Your images should be of an appropriate size for downloading, and 2 megabytes of high-quality scan per image is not really an appropriate size for most PG texts!

You should reduce the size, and maybe the quality, of the original scan for simple viewing purposes. There is lots of image-manipulation software to do this. For Windows, you might look at the freeware Irfanview, and for both *nix and Windows there is ImageMagick [P.1]. Look for the words "resize" and "resample" in the Help.

Apart from simple converters, which do enough for this purpose, you can also manipulate the images in full imaging creation and editing packages like Paint Shop Pro, Adobe Photoshop and The Gimp [P.1].

Different image encoding methods can make a huge difference to the filesize. Any of the packages mentioned above can encode images as GIF, JPEG or PNG, and, particularly for black and white line drawings, these can encode to very different sizes. So, for example, a 60K JPEG may save as a 30K GIF, because the GIF encoding works better for that particular image. Try your images out, and see what works.

When manipulating images, always work from your original. Don't convert your original to a JPEG, and then shrink that and convert it to a GIF. Depending on the format, images may lose definition as they are converted (search for "lossy compression" in your favorite search engine to find out more about this), and they certainly lose definition as they are resized, and you end up with the "imperfect copy of an imperfect copy of an . . ." effect. When you're experimenting, take your original, resize and Save As GIF, then go back to your original, resize and Save As JPG, and so on.

You can also use an image optimizer. These are specialist software programs that try to make image files smaller without sacrificing resolution or detail.

H.11. Can I include decorative images I've made or found?

No.

Please include only the images you got from the book. If you want to make an edition of the book for your own web site, you can of course use whatever you like there, but for PG purposes, we want the book, the whole book, and nothing but the book.

H.12. How can I make a plain text version from a HTML file?

You can edit out the HTML by hand, of course, but there are several easier ways to convert.

You can view the HTML in a browser, Select All text, and just Copy and Paste into your editor. This is easiest, but doesn't handle formatting like tables very well.

You can use the Lynx [P.1] browser to convert your text with the command lynx -dump myfile.html > myfile.txt

Bruce Guthrie's HTMSTRIP for MS-DOS [P.1] is very configurable.

has a list of other HTML to plain text converters.

H.13. How can I make a HTML version from my plain text file?

This is not a course in HTML, but, for most books, you don't really need a course in HTML. Making a HTML format of most books is very easy, and doesn't take long, once you have mastered basic HTML. Let's assume you have your completed PG plain text file ready, and walk through the steps commonly needed to make a HTML version. We'll do this by successive approximation, doing the major things first, and then dealing more and more with the detail.

There are lots of specialized HTML editors out there, but you don't actually need any of them. The same editor that you used to create your text will also create your HTML. HTML is just text, with two types of special instructions added: tags and entities.

Atagis an instruction to the browser, usually to display something with specific rules. Tags are shown within angled brackets: for example,

is the instruction to start a new paragraph.

Anentityis a named special character that might not be available in your character set. Entities are shown starting with an ampersand "&" and ending with a semi-colon ";" : for example, — is the representation of an em-dash.

I'm marking up a made-up short text as I write these steps, loosely based on the sample page from question [V.121]. You can see the changes made at each stage by looking at the files

htmstep0.txt (text before starting) htmstep1.htm (after adding the HTML header and footer) htmstep2.htm (after adding paragraph marks) htmstep3.htm (after marking main headings) htmstep4.htm (after adding special line breaks and indents) htmstep5.htm (after adding italics and bold) htmstep6.htm (after adding accents and non-ASCII characters) htmstep7.htm (after adding an image) htmstep8.htm (showing some extra techniques)

Before you start, make sure that you can see these files both in your browser and in your editor. In your editor, you should see the HTML codes; in your browser, you should see the text as it is intended to be viewed.

Note for people who already know HTML: yes, this example omits lots of possible ways to do things, and lots of refinements. You already know how to do what you want to do—skip onwards, and give the beginners room to learn in peace! :-)

Step 1. Add the HTML header and footer information

Add the following lines at the top of your text file:

The Project Gutenberg eBook of My Book, by A. N. Author

Let's explain these one by one:

says that your file is HTML 4.01 Transitional, which is the latest version, allowing the widest range of tags and entities.

denotes the start of the HTML

denotes the start of the HTML header information.

says that the characters are text, using ISO-8859-1 encoding. If you need to use a different character set, you should change ISO-8859-1 to whatever you intend to use. ISO-8859-1 is good for lots of PG books in English that use French or German words.

The Project Gutenberg eBook of My Book, by A. N. Author

You should obviously change this to the actual title and author you're producing. The

denotes the end of the HTML header information and

denotes the start of the actual text itself - the body of the book.

At the very end of the file, you should append these two lines

these denote the end of the body of the book, and the end of the HTML.

At this point, you actually have a valid HTML file! OK, if you view it with a browser, it doesn't look anything like the way it's supposed to, but itisHTML. Save it with a name like MYFILE1.HTM or STEP1.HTM and get a copy of Tidy for your DOS, Unix, Mac or Windows system from . Run Tidy on your file, telling it just to look for errors (tidy -e if running from a command-line; if you're using a GUI version, there should me a menu option or tickbox for showing errors only). Tidy should tell you that there are no errors. Yay!

If it does say that there are errors, deal with them now, before you continue. Make sure, at each step, that you have cleaned up any errors; it's a lot easier now than later. Also, when you've finished each step, save your file with a number in its name, so that if you run into problems later and get confused, you can, at worst, drop back to the correct version at the end of the previous step.

The most likely error you might have at this point relates to the characters "<", ">", or "&". These are the characters used by HTML to indicate tags and entities. If these characters are used in the text of your file, (and ampersand is likely to be), you should replace them with entities, so that HTML will know that they are to be displayed as characters, not interpreted as commands.

Replace & with & < with < > with >

There is an example of this in the file htmstep1.htm

Step 2. Add paragraph marks.

For novels and general prose, paragraphs are the main logical and display unit. Paragraphs are marked in HTML with the sign

at the start, and

at the end. You don't actually need the

at the end, but adding these is a good habit to get into. You do, very much, need the

at the start.

The line-lengths within a

pair are irrelevant; the browser in which the text is viewed will ignore extra spaces and line-ends, and will wrap text to fit the screen. This is bad for poetry and tables, but we will discuss those later. For this step, all you need to know is that you can leave your text exactly as it is, and just add the paragraph marks.

Put a

at the start of the line before the first letter of every paragraph, and a

just after the last letter or punctuation of every paragraph. If you can do macros in your editor, this will just take a minute; otherwise, it may be rather boring, but at least it is simple. For this step, put the paragraph marks aroundeverythingthat has a blank line after it, even poetry or chapter titles. We'll come back and change that later.

Now save your text as something like MYFILE2.HTM or STEP2.HTM.Again, run Tidy to check for errors, and fix them before continuing.

If you now look at the file htmstep2.htm in your browser, you will see that it is starting to take shape. Look at it in your editor, and you will see the paragraph marks.

Step 3. Add marks for headings.

We want to indicate to the reader that certain lines are for chapter or other headings. HTML provides the tags

,

, and so on for this.

is for the biggest heading, and usually, you will reserve this for the title, and use

for chapter headings. If you find these too big, you could choose

for main headings, and

for chapters. Whenever you use one of these header tags, you must close it with its equivalent end tag. So a chapter heading might look like:

Chapter XI

Since there won't be many headers, and most headers are only on one line, this is usually not hard. Look at the file htmstep3.htm to see how our sample is improving, and if you're working along with me, don't forget to save your file under a new name and check it.

In our example, we have marked some lines with paragraph marks where we now want to put headings, so we will change those

s into

s, since we don't need or want to mark a line as both.

Step 4. Line up verse, tables of contents, and other lists.

The HTML tag
tells the browser to force a line break without starting a new paragraph. We use this when we don't want text all wrapped together, but not separated with blank lines either, for example in verse and tables of contents.

In our sample, we add the
tag to the end of each line in the table of contents and the end of each line of the verse. If we were working on a whole book of poetry, the same principle would apply, but we'd be using the
tag a lot more.

Where we want to indent a line of poetry, we can use " " at the start of the line. Normally, however many spaces you leave between words, HTML condenses them to one space, so normal indentation doesn't work. But the "non-breaking space" entity will cause the browser to show one space for each character, so that you can indent as much as you need.

The file htmstep4.htm shows the effect: this is now an entirely readable HTML text!

Step 5. Add back in italics and bold.

The HTML tag tells the browser to start displaying italics, and the tells it to stop. Similarly, the tag tells it to display bold, and marks the end of the bold text. See htmstep5.htm for the changes.

Step 6. Restore accents and special characters.

Since we declared our HTML file to use ISO-8859-1 back at the start, we can use any of the common accented characters for Western European languages, but we may also use HTML entities. For example, for the "a circumflex" in "flaneur", we can use either the ISO-8859 character directly, or the HTML entity name "â" or number "â".

There is a trade-off between characters and entities: entities do not limit you to any particular character set, but characters are directly readable when looking at the HTML source.

Within entitles, there is also a trade-off between entity names and numbers: older browsers may not recognize some of the entity names, but the entities do make the text work in multiple character sets. Which you choose is entirely up to you, but it's best to be consistent; if you like entities, use them everywhere. Entities can be represented by their names—for example, ——or by their number, derived from their ISO-10646 (see Unicode) number—for example, —.

There are other special character entities you may choose, to replace the ASCII equivalents in the main text. Here are some of the common ones:

We've already seen

& & ampersand replaces "&" < < less than replaces "<" > > greater than replaces ">"     space replaces a space when you want to indent

and these are also very useful for many PG texts:

— — em-dash replaces "—" ° ° degree replaces "deg." or "degrees" £ £ British pound replaces "L" or "l" or "pounds"

There are many others. has a fuller list. Please note that you don'thaveto use these entities in your HTML; if you're happy with the text reading "500 pounds", there is no need to make that "£500".

I've made a couple of entity changes in htmstep6.htm.

Step 7. Link Images into the text.

First, you need to have your image ready. You should already have resized your image to the size you want it to be viewed at. You should also have saved it as a GIF, JPG, or PNG image, since those are the formats most supported by current browsers.

If your image is named front.gif, and it is a picture of the frontispiece of the book, you should add the line

Frontispiece

to your HTML at the place where you want it displayed.

The "alt" text gives a label to the image, and is displayed if the image can't be shown, or in the case of a browser for visually impaired people.

You don'thaveto add images with your HTML file, unless you want to. In many older books, there are no images at all to be added.

My final HTML text is now in htmstep7.htm. You need to have the image front.gif in the same directory in order to see it. When your HTML text is posted, the images will be zipped with it, so that future readers can see them.

Step 8. Over to you!

This is enough to make a reasonable HTML format of most PG texts, but it doesn't begin to cover everything that can be done in HTML. If you've gone this far, I recommend the W3C's tutorials:

and

which cover the ground we've just crossed, and go a bit further.

Here are a few more things you might want to know, but don't go nuts adding tags just because you can! Use them only when you really need them. The file htmstep8.htm shows some of these techniques. Personally, I think that this is a bit overdone, and I prefer the effect of htmstep7, with left-aligned chapter headings, but that's a matter of taste.

Once you're used to the basic HTML needed for most PG eBooks, you'll probably be able to convert one in under an hour.

How do I force more space between specific paragraphs?

Insert a blank paragraph like this:

 

or use an extra
tag.

How do I make text, or image, or headings centered?

Put the

and
tags around what you want centered, like:

Chapter 12

How do I make some text bigger or smaller?

Put the and , or and tags around it.

How do I lay out tabular information?

The simplest way to do it is with the

 and 
tags. These will cause whatever is within them to be displayed as plain text, just as it was in the original, so that spaces separate the entries just as they did in the text version. You can also use this for poetry, though you usually won't need to. It's not entirely satisfactory, but it will work.

Making a full HTML table requires you to use the , (table row), and tag, like:

(table detail) tags, among others, and a full exposition of tables is beyond the scope of this FAQ.

Briefly, you start a table with the tag.

For each row you want in the table, you open and close a table row

and then for each cell within a row, you specify a tag and the contents of that cell:

This is the Top Left cellThis is the Top Right cell
This is the Bottom Left cellThis is the Bottom Right cell

This only scratches the surface of tables. However, there are many guides available on the Web, and they're easy to find, once you know which tags you're looking for. A brief discussion of tables is provided by the W3C as part of the HTML 4.01 spec at and the tutorial at also shows how to make HTML tables.

Step 9. Some common problems

When you're just starting to code HTML, it may seem that errors are coming at you from all sides. Tidy may spew out a stream of complaints that you don't recognize or understand. If it's any consolation, this is normal!

Just take the error list one line at a time, starting at the top. Often, one actual mistake, like not closing a tag, may cause many errors, since an unclosed tag can cause many subsequent tags to be reported as errors.

Common errors include:

1. Simple typos in tags, like instead of

Chapter 3

2. Unclosed tags, like forgetting to add the in thesample above, or forgetting the slash in the closingtag so that you type italics instead ofitalics.3. Not nesting tags correctly. Get used to thinking of tagsas brackets; the first one opened should be the last oneclosed. For example, you should type:

This is centered.

instead of

This is centered.

One option for making a HTML version is to use GutenMark to create the basic HTML straight from your text, and then edit the resulting HTML to add the features you want. If you're having a lot of problems with your main conversion, this is worth a try.

Programs and programmers FAQ

P.1. What useful programs are available for Project Gutenberg work?

These suggestions came largely from a poll of volunteers in June, 2002. The programs listed are a summary of the programs we actually use. There are many other programs out there that can do the same jobs, so don't limit your search just to these.

Abbyy OmniPage TextBridge

These are the three main commercial packages that volunteers bought specifically for the purpose. In a few cases, people had got older versions of these bundled with their scanners.

Clara OCR Gocr

These are Free Software packages. Some people who responded to the survey had tried them, but nobody had actually used them to produce a text.

DocMorph — a free, web-based OCR

This one is interesting—you can just submit your image through a web page, and the service will return OCRed text. However, the process of submission, waiting for your text, and then cutting and pasting into your document is slow.

Other volunteers use various OCR software that came bundled with their scanner.

2. Editing

The main answers, given by more than one person, were:

AbiWord emacsMicrosoft WordviWindows WordPadWord Perfect

Other editors mentioned included:

Crisp for Windows EditPad Editplus for Windows Foxpro 2.6 for DOSMetapad Windows Notepad

Programs recommended by Apple Macintosh users included:

AppleWorksBBEdit Lite Microsoft WordNisus Writer Text-Edit Plus TextSpresso Add/Strip

3. Checking and proofing

For spelling, most people just use the spellchecker built into their editor or word-processor. The *nix users running emacs or vi tended to use variants of the standard Unix spell command, such as ispell or aspell. Mac users have the free spelling checker Excalibur, available from .

Gutcheck was used for format checking, and a few people had written some checking procedures of their own.

4. Working with HTML

In the survey, most volunteers preferred to handcraft their HTML using their normal editor. Those using a word processor edited the HTML as text, rather than composing a word processor file and then Saving As HTML. There was remarkable unanimity on this.

Specific HTML editors that were mentioned for occasional use were:

Adobe PageMill (no longer available)Mozilla Composer HTMLKit HTMLPad

However, not all HTML work is about editing, and the following packages were honorably mentioned for other functions. Especially important is Tidy, which is pretty much necessary for all but the most experienced people for quick HTML checking. has the original, and links to versions of Tidy for Windows (Tidy-GUI) and just about all other platforms.

GutenMark:Converts Project Gutenberg texts to HTML and TeX.

HTMSTRIP by Bruce Guthrie:MS-DOS. Converts HTML to text

Lynx (lynx —dump):Converts HTML to text

Dave Raggett's HTML Tidy:Checks HTML for correctness, reformats and fixes

W3C html2txt (web-based):Converts HTML to plain text.

W3C Validator (web-based):The Last Word on the correctness of HTML.

wget:A very neat utility for getting web pages

5. Working with images.

There are two main applications of images in PG—images to be used within texts, like illustrations in HTML, and the management of page images for scanning. These packages are used by volunteers variously for both of those purposes. Their typical use within PG is indicated. "Advanced image processing" packages will permit you to edit and restore damaged images, but for PG work, we mostly just need to manage, convert, resize and crop them.

ACDSEE for WindowsFor image reviewing

Adobe PhotoshopFor advanced image processing

ImageMagick for *nix, Mac and WindowsResizing and format conversion

Irfanview for WindowsImage viewing, conversion, cropping and resizing

The GimpFor advanced image processing

Picture PublisherFor advanced image processing

VuePrint ProFor viewing images

Proofreaders' Toolkit (PRTK)For splitting batches of image files into individual pages

P.2. What programs could I write to help with PG work?

Look at the programs listed above in [P.1]. Can you write a better version of any of them? Improving OCR and editors constitutes a major challenge, unless you're a world-class expert, but checking and reformatting texts is an area not addressed by large scale programs, and you might contribute there.

Formats FAQ

F.1. What formats does Project Gutenberg publish?

In principle, there's no format that we won't publish, but, in practice, we prefer formats that are open and editable.

An open format is one whose structure is publicly defined and documented, and not burdened with patent or trade secret or copy-protection (a.k.a. "DRM") restrictions. Anyone can write a reader or creator for an open format, and in 500 years' time, anyone interested will still be able to write a program to display the file. Closed formats, by contrast, will almost certainly be unreadable in just a few decades, when the companies now promoting them disappear, or lose interest, or decide to stop supporting them because they want to sell a replacement.

Being able to edit the file is also important. We make corrections to our editions constantly, and it is important to us that we should be able to update our files easily. If adding one word to a sentence involves a complete re-marking of the whole text and a complete rebuild of the file, we have to ask ourselves whether this format is really necessary for this text. Further, the people who re-use our texts should also be allowed to copy and reformat them freely, and non-editable formats restrict their ability to do this in various ways.

F.2. What is, and how do I make or use:

[Note: Character sets and formats are both listed here. Character sets refer to the characters you can use; formats describe how those characters are put together. For non-text formats such as music files, there is no exact equivalent to a character set.]

ASCII (Character Set)

ASCII (American Standard Code for Information Interchange) is a set of common characters, including just about everything that you can type in on an English-language keyboard. It includes the letters A-Z, a-z, space, numbers, punctuation and some basic symbols. Every character in this document is an ASCII character, and each character is identified with a number from 0 through 127 internally in the computer.

You can view or edit ASCII text using just about every text editor or viewer in the world.

Big-5 (Character Set)

Big-5 is a set of 13,494 traditional Chinese characters. You will need to use an editor or viewer that supports the character set.

Codepage 437, 850, 1252, etc. (Character Sets)

These codepages are Microsoft-specific character sets which allow the display of accented characters and other symbols. To view a text that uses one of these, you will have to use a Microsoft application that supports them. Many of the fonts supplied with Word for Windows will display and edit CP-1252 correctly. For Codepages 437 and 850, you may have to open a Command Prompt and use a DOS editor like EDIT. A search form should bring up information about the codepage you're interested in, or you can read the excellent overview at . For Unix users, iconv and recode provide translation facilities from one character set to another, and support many or all of the MS codepages.

DVI stands for DeVice Independent, and is commonly used to store text and instructions for displaying it involving complex mathematical symbols and expressions, though it can be used for any content. Given a DVI file, you need a viewer to render it on the specific device you're using. Specifically, DVI is used as the standard output format for TeX, discussed below.

HTML/HTM (Format)

HyperText Markup Language defines the standard format of web pages. You should be able to view these with any web browser, and edit them with any text editor or a specialized HTML editor. is the definitive reference.

ISO-8859/ISO-Latin (Character Sets)

ISO-8859 is a series of character sets used to represent the accented characters most commonly used in European languages. There's ISO-8859-1, ISO-8859-2, and so on. ISO-Latin is just another name for the same thing. You can read the overview at

LIT (Format for PDA-based eBooks)

This is a proprietary, closed format for files that can be displayed only by the Microsoft Reader. Search for more information. It is not possible to edit or correct files in this format; it is not possible to export files from this format; they have to be made in another format and converted.

MacRoman (Character Set)

MacRoman is an 8-bit Apple Mac-specific character set which allows the display of accented characters and other symbols. To view a text that uses MacRoman, you will have to use an application that supports it, and there are few outside the Apple fold. However, iconv and recode are programs that convert between many character sets, and MacRoman is supported by both.

MID/MIDI (Format for music)

Musical Instrument Digital Interface is a music description language, encompassing not only file formats but definitions of interfaces. A MIDI file contains instructions for sending messages to a musical instrument to recreate the sounds. has much more on this.

MP3 (Format for any audio file)

MPEG-1, Level 3, was defined by the Moving Pictures Expert Group as a means for encoding sounds. Many, many MP3 players exist for all platforms, and can be found easily with a Net search. The official home page of the MPEG is and copies of the specification can be purchased from the ISO at

MPEG/MPG (Format for moving pictures)

The Moving Pictures Expert Group have released a series of formats for encoding video and audio. MPEG (pronounced EM-peg) formats are published and widely used. The official home page of the MPEG is but you will find information about MPEG formats, and software to play MPEG files, all over the Net. You can also purchase specifications through

MUS (Format for music)

MUS from Coda Music is a proprietary, closed format for editing and replaying sheet music. However, we do post music files in this format because of its many features. We hope to be able to post these also in more open standards at some point in the future, but at the moment, there is no open format with similar capabilities. You can find out more about this at

PDB (Format for PDA-based eBooks)

The Palm Data Base format can actually be used for purposes other than eBooks, and there are many possible variants of formats for Palm-based readers all using the extension PDB on PCs, and they're not all entirely compatible. Some of them are proprietary, and it may not be possible to edit them directly, or export files from these formats; they have to be made in another format and converted. Some can be converted back to text. The most common, though, is the "Palm-DOC" format, which is an open format and can be edited on the Palm itself.

PDF (Format for eBooks)

Portable Document Format is a format for storing texts, containing any fonts or graphics. It is copyrighted by Adobe, but is well and publicly documented. It is sometimes referred to as a kind of compiled Postscript (see PS below). It is viewable using the Adobe Acrobat Reader. It is not possible to edit files in this format.

PRC (Format for PDA-based eBooks)

This is a proprietary format for files that can be displayed only by the MobiPocket Reader. See for more information. It is not possible to edit or correct files in this format; it is not possible to export files from this format; they have to be made in another format and converted.

PS (Format for text and graphics)

Postscript is technically a programming language, not just a format. It has conditional statements, procedures and program flow control. However, it is commonly referred to as a format. Adobe holds copyright on the Postscript specifications (there have been three "levels" published) but Postscript is well and publicly documented and has wide support, not only in printing, but in screen display as well. Apart from Adobe's official version, you can also render Postscript files with Ghostscript, a Free Software package. Postscript can be edited directly, but any complex editing may present difficulties.


Back to IndexNext