Chapter 4

If you'd rather send it by e-mail, send the e-mail, including the Credits Line and Clearance Line as in the sample above, to any or all of the Posting Team, with your text as an attachment. Again, ZIPped is better, since it avoids certain damage that can happen to a plain text e-mail along the way.

Do not add the Project Gutenberg header or footer to your file, unless we specifically asked you to. If you do add it, we'll just have to strip it off again, since we add headers automatically when posting. There are times, perhaps when you're working in an unusual non-editable format, when we may give you a header and ask you to add it, but this is rare.

Please read section "4: Posting" of the FAQ "How does a text get produced?" [V.16] for more detail about what happens in posting. Especially, if you want to draw some peculiarities of this text to the Posting Team's attention, or want feedback on any minor edits done during posting, you should say so in the e-mail you send.

Don't assume that we know anythingwhen you send the e-mail. We don't know what you want us to put on the Credits Line. We don't know that this is an unusual text, and needs some kind of special reformatting. We don't know that the text should be split into two volumes before posting. We don't know that you would really like us to check it closely before posting. You have to tell us, exactly and precisely, what you want on the Credits Line. If the text needs some specific work, you have to tell us exactly what that is. And please do that in your e-mail, not in the text itself. Remember that we could be dealing with five or ten other texts at the same time, and even if the poster you discussed it with two weeks ago is the same one who posts the book, he may not remember.

V.47. What is the "Credits Line"?

The Credits line is a line that the Posting Team can insert into each PG text naming the producer or producers of a particular text.

You should decide what you want on the credits line of your text; it's really not up to us.

Most credits lines are something like:

Produced by John Doe .

If you don't want to be mentioned by name at all, just say, in your e-mail:

Please omit the Credits Line for this text. I want to contributeit anonymously.

If you do want to be mentioned, please give the exact wording you want us to use. Some people want their name only; they don't want us to include their e-mail addresses. Others want to make their e-mail addresses public so that readers can contact them with comments. That is entirely up to you, but you do need to tell us. If you do want to include your e-mail, remember that having it permanently on the net is a spam-magnet, and we can't effectively remove or change it later.

Occasionally, a Credits Line can spill onto more than one line, for example:

This text was converted to HTML by Jane Roe from an original ASCII text scanned by Jack Wentand proofed by Jill Hill

V.48. How soon after I send it will my text be posted?

First read the "Posting" section of the FAQ "How does a book get produced?" [V.16] to understand the process.

You should expect some response within three or four days. We try to get to all submissions within that time. In most cases, that response will be simply the official notification that it has been posted. If there is a query on your text, for example if we can't find the copyright clearance or if we have trouble converting or correcting your text, we will probably e-mail you back directly with questions.

If you don't hear from us within four days, send a follow-up e-mail; it could be that your original note never got to us, or just fell through the cracks.

If your file happens to arrive while one of us is logged in and working, it could get posted within the hour. Some frequent contributors who know our habits know just how to time their uploads!

V.49. I found a problem with my posted text. What do I do?

Most postings go smoothly, but problems can happen. Sometimes, one of the servers is down. Sometimes a file gets corrupted for some unknown reason. Sometimes, let's face it, we screw up.

Usually, one of the indexers will tell us about it, but if you catch it first, e-mail whoever sent out your notification e-mail and explain the problem. Don't worry; your original file will be quite safe, since we keep these long after posting them.

V.50. Someone has e-mailed me about my posted text, pointing out errors.

Great!

Since you're the original producer, you're in the best position to decide whether these are real errors. If they're right about it, tell the Posting Team and we'll correct the text.

V.51. Someone has e-mailed me about my posted text, thanking me.

Nice feeling, isn't it? :-)

About Proofing

V.52. What role does proofing play in Project Gutenberg?

A very big one!

Typists' work doesn't usually need many corrections, but unfortunately, scanners and OCR packages are far from perfect, and scanned text varies from "almost-right" down to "maybe I should consider typing instead of scanning". Proofing is the process that turns a scan into a readable e-text.

Proofing a typist's work is straightforward; you just read it, and keep an eye out for mistakes. Typists typically have few mistakes in their texts, but the errors that they do make tend to be hard to spot. Proofing OCRed text has its quirks, and you can expect many, many errors to correct.

The only thing that all proofers agree on is to differ in their methods. Some people scan and almost complete the proofing process within their OCR package, others do no editing at all within their OCR. Some spell-check first, others spell-check last. Some work through in one pass, doggedly line by line, others make several light passes. Some start at the end and work backwards! Some proofers mark all queries with special characters like asterisks (*) in the text, most just make all the obvious changes and mark only the dubious ones. Some people always send their texts out for proofing, others prefer to do it all themselves.

So this guide is not prescriptive; this is not the "only way" to do it. The only rule is that, at the end of the process, your e-text should be as error-free as you can make it, and should conform to Gutenberg's editing standards, which are mostly just common sense guidelines to make readable text.

The aim of this FAQ is to give you an understanding of what text looks like when it comes fresh off the scanner, and an overview of the whole process by which it becomes a publishable e-text.

V.53. What is Distributed Proofing?

It has always been common for volunteers to share proofing work among themselves—you take the first five chapters, I'll take the next, and so on.

When you're just starting as a PG volunteer, you should go to one of the Distributed Proofing sites [B.4] and do some work there to get a grounding in the basics and a feel for whether you would like to continue working in PG. In distributed proofing, you get a very short section, as little as a page of text at a time, and usually an image file of the page as it scanned. You then make the text match the image. This is a great start, since all you have to do is read, compare and correct. However, other work also needs to be done, and will normally be done by the project managers of these sites. The samples below give you an idea of the whole process, and also some ideas of what proofing a whole book from start to finish is like.

V.54. What do I need to proof an e-text?

You actually need only two things: the e-text itself and a text editor or word-processor that can handle book-sized files and save them as text.

Nearly all word processors and text editors in current use will work. Volunteers use many common programs, including WordPerfect, Microsoft Word, WordPad, DOS EDIT, vi, Brief, Crisp, EditPad, MetaPad, emacs, AbiWord, and the word processors from Open Office abd AppleWorks. And all of these are in actual use by volunteers today. Since all of them contain the necessary basic functions, the best program is the one you're most comfortable with.

Be cautious with recent, powerful word-processors that "auto-correct" text, or use "smart quotes" or any other such automatic retyping or formatting feature, since they can Do Bad Things to your e-text without your consent! When using any such package, it is best to switch off any feature that makes changes without asking you.

Two utilities which may come in useful are a spell-checker and a version difference checker. These may be built into your word processor, or you may have them as separate packages.

A spell-checker is like a chain-saw: a powerful tool, but one to be used very carefully. It is very easy to say "Yes" to the wrong change, and make a really bad mess of the text. Spell-checkers have problems with proper names, foreign words, archaic usages, and dialects. Incautious use can leave you with a text such as that immortalized in the

Owed two a Spell in Chequer.

Eye half a spell in chequer,It cane with my Pea Sea.It plane lee marques four my revueMiss steaks eye can knot sea.

Every e-text should pass through a spell-checker at some point, but the human half of the partnership needs a very light hand on the confirmations of change!

A difference checker, such as FC or COMP for MS-DOS, diff for Unix or ExamDiff for Windows, may also come in handy. A difference checker compares two versions of the text, and points out the changes. This is important when you've sent a text out for proofing, and you get it back with changes. Rather than re-reading the whole text, you can use a difference checker to highlight the changes so that you can verify them against the printed text. As a proofer, you can use it to compare the original text with what you're sending back to ensure that you've only changed what you meant to change.

V.55. Do I need to have a paper copy of the book I'm proofing?

No.

Your job as proofer is to ensure that the e-text you're working on is readable in itself, and contains no obvious errors. Where you think there might be an error, but you're not sure, you mark the spot in the e-text, and let the volunteer who has the paper book look it up.

V.56. What's the difference between "first proof" and "second proof"?

These are fuzzy terms used to indicate how accurate the e-text is, and what type of work is needed to improve it. Quite commonly, the same volunteer who scans the book proofs the whole thing in one or two passes. Sometimes, given a good scan, the text can be sent out for "first proof" with little or no preparatory fixing-up. Often, the scanner makes quite a lot of corrections, then sends the text out for "second proof".

A text is ready for first proofing when it's obvious that there are plenty of errors, but it's possible to figure out, in almost every case, what the correct text should be without needing to refer to the book.

The objective of first proofing is to eliminate all the obvious errors, so that if you speed-read quickly through the text, you probably won't notice any.

Second proofing involves taking a text that has been first-proofed and correcting all the remaining, more subtle errors. Often, some simple errors such as incorrect spacing and quotes may be left for second proofing. Texts that have been typed instead of scanned will always be of at least second-proof quality.

V.57. What do I do with an e-text sent to me for proofing?

First, establish reasonable expectations. A typical book takes 10-15 hours of concentrated effort, and when you first start, you're climbing a learning curve. For your first session, decide to mark out a chapter or two—something like 500 to 1,000 lines—and work only on that. If you get through 1,000 lines in your first sitting, you have done extremely well! It's a good idea to send this first 1,000 lines or so back immediately. The volunteer who sent you the e-text will comment on it, and let you know about any style guidelines you may have breached or common errors you may have missed. Most beginning proofers do make mistakes, so don't worry about it—it's easier to correct these in 1,000 lines than to go back over them in 15,000 lines!

You will usually receive the e-text as an attachment to your e-mail. It's better to send e-texts as attachments than to paste them as text into the body of the e-mail to make sure that the text isn't changed by different e-mail clients. It's better to send e-mailed attachments as ZIP files [R.20], since e-mails sent as text can be damaged along the way. But whether you receive a TXT file or a ZIP file that you have to open, you should save the .TXT file to your hard disk and open it with your editor.

It may be that the text you see appears double-spaced—every second line is blank—or that all the text is on one incredibly long line. This is a familiar effect when moving between a DOS/Windows computer and a Mac or Unix system, but it can happen between any two editors. It is caused by the use of different characters to mark the end of a line. If you have this problem, ask whoever sent you the text to re-send it, telling them what kind of computer and editor you have.

Now you make any changes that obviously need to be made, and mark any places where the text looks wrong, but you're not sure what the right text should be. You can usually use asterisks (*) to mark these dubious spots, but you might use other characters if the text already contains asterisks. When in doubt, mark them all, and let the volunteer with the text sort them out!

It is usually best not to make global changes to line lengths by reformatting lots of paragraphs, since the person who sent you the e-text may want to use a difference checker when you return it, and changed line-lengths throughout mean that every line will be different.

When working on a long text, or when making a lot of changes, it may be wise to save several versions of the text with different filenames at different stages so that if something goes badly wrong, you can revert to the last good version. This applies especially to saving the text just before performing a spell-check.

When you're finished with the e-text, make sure you save it as a plain text file (.TXT) and send it back by zipping it if you can, and attaching it to an e-mail.

V.58. What kinds of errors will I have to correct?

Each text has its own peculiarities, but there are a number of well-known scanning errors you will be dealing with all the time.

Punctuation is always a problem. Periods, commas and semi-colons are often confused, as are colons and semi-colons. There are also usually a number of extra or missing spaces in the e-text.

The problem of quotes can assume nightmarish proportions in a text which contains a lot of dialog, particularly when single and double quotes are nested.

The numeral 1, the lower-case letter l, the exclamation mark ! and the capital I are routinely confused, and often, single or double quotes may be mistaken for one of these.

Lower-case m is often mistaken for rn or ni.

The letters h and b and e and c are commonly mis-read, and these are probably the hardest of all to catch, since ear/car, eat/cat, he/be, hear/bear, heard/beard are all common words which no spell-checker will flag as problems.

For example:

" Hello1' caIled jirnmy breczily. 11Anyone home ? "

There seemed to he no-oneabout. Only tbe eat beard him."

should read:

"Hello!" called Jimmy breezily, "Anyone home?"

There seemed to be no-one about. Only the cat heard him.

As well as scanner errors, which affect one letter at a time, you have to keep an eye out for editing mistakes by the volunteer who scanned the text or by previous proofers. These are typically cases where a whole line, paragraph or page has been omitted or misplaced. They show up as sentences that don't make sense, or paragraphs that don't follow from the previous one.

This means that you have to keep reading the flow of the text, so that you can spot context errors as well as typos.

V.59. How long does it take to proof an e-text?

This depends on how long the e-text is, how clean the text is when you start, and how thorough you're being, as well as how much time per day you can give it and how fast you can proof.

On a first proof, it can take a very long time to get the e-text to a readable condition if it scanned badly. As a beginner, you would be unlikely to be given such a difficult text to work with. First proofs are usually done by the same person who did the scanning, and are only given out in the context of established scanning/proofing teams.

You might expect to proof anywhere between 500 and 2,000 lines per hour during a second proof. A short novel or novella might have as few as 6,000 or 7,000 lines; War and Peace weighs in at about 54,000 lines. Most novels run to 10,000 to 15,000 lines. So you might spend anything between 5 and 30 hours second-proofing a standard book, with 10 to 15 hours being typical.

For an average novel, a week or two for second proofing is good going.A month is reasonable.

Proofing an e-text is a significant amount of work, and you may find it psychologically more comfortable to take on a chunk at a time—say 1,000 lines per session—and send that proofed section back, rather than wait until the whole job is done before sending anything back. This helps to avoid the fairly common case where you keep falling behind where you expect to be until you dread the thought of getting back to the text, and finally just abandon it.

If you find after a while that you just don't want to continue, please tell the person who sent you the text that you're not going ahead with it. It's very frustrating for the volunteer who scanned the book, and who wants to get it posted, to wait for two or three months, only to have to start all over again with another proofer.

V.60. Are there any special techniques for proofing?

The classic way to proof is to open the text in your editor or word processor, and just start reading carefully.

This method has received a major boost since editors and word processors have added a feature of showing squiggly red underlines under words not in their dictionary. While this is very useful, you still need to read carefully, since not all errors produce misspelled words. The classic, and very common, example of this is scanning "he" for "be". These visual spellchecks also commonly do not check words beginning with capitals. Capitalized words are commonly names not in the dictionary, and when checking of capitalized words is switched off, they will not query "Tbe". Other errors that a spellchecker doesn't look for include missing spaces, mismatched quotes and misplaced punctuation. For these, you can try gutcheck [P.1]. And of course, no automatic check will find omitted lines or words. Worse, spellcheckers will query words not in their dictionary that might be quite correct, and this can be quite troublesome when dealing with older texts or dialect.

Still, if your concentration is up to the job, scrolling through a text with non-dictionary words underlined in red is a fast and effective way of giving a text the final once-over.

Volunteers have also used other techniques for proofing. Some people can't sit at their screen and read for hours; many people don't want to.

Some people just use the good old-fashioned method of printing out the text to be proofed, and blue-pencilling the mistakes.

It is becoming fairly common now for people to load the text onto their PDA, and read it from that. Mistakes found can be bookmarked or jotted down and fixed when they go back to their PC.

Getting your computer to read the text aloud is a very effective way of achieving high accuracy. Modern PCs have audio capabilities built in, and it is possible to find free or cheap shareware "read-aloud" text-to-speech packages for just about everything. Some PDAs are also capable of doing text-to-speech.

The first time you try text-to-speech, it will probably sound and feel a little strange, but you will quickly learn tohearerrors in words. This can be very effective, but you should have given the text at least a light proofing before you begin; it is hard to deal with a high number of errors using a text-to-speech method.

When proofing by a speech program, you either set your text-to-speech program to pronounce all punctuation, or, if that is not possible, you make a special version of your text to feed it, first doing a global replace of "," with " comma ", ";" with " semi-colon ", and so on. Mark a block of 500 to 1,000 lines for reading aloud, and set the reading speed to whatever is comfortable for you. Then you sit down with the original book in front of you, and listen. When you hear an error, mark the place in the text with a light pencil. Stopping the reading at every error, editing the text and restarting is possible, but it breaks the flow, and ends up taking longer. When the reading is done, go to your keyboard and correct the errors found.

V.61. What actually happens during a proof?

Stage One—The original Scan

We start with a scanned e-text, in this case a paragraph from The Odyssey. The paragraph used as an example here has been "enhanced" with more errors than in the real scanned text, so that you can see samples of many problems all in one place.

We begin by looking at the original OCRed text, of which our sample section reads:

1There Periniedes and Eurylochus held the victims, but ldrew my sharp sword from my thigh, and dug a pit, as it werea cubit in length and breadth, and about it poured a drink-offering to all the dead, first with mead and thereafter withsweet wine, and for the third time with water, And 1 sprink-BOOK XLODYSSEY X, 24-56.173

ODYSS.EY XI, %4-56. 173lef white incal thereon, and entreated with many prayersstrengthless beads of the dead, and prornised that on myreturn to Ithaea 1 would offer in my halls a barren heifer,the best 1 had, and fil the pyre with treasure, and apart untoTeiresias alone sacrifice a black rarn without spot, the fairestof my flock. But when 1 bad hesought the tribes of thed with vows and prayers, 1 took the sheep and cut theirs over the trench. and the dark blood flowed forth,he spirits of the dead that he departed gatheredfrom out of Erebus.

It's clear that we should tidy up the page headings and numbers that have been scanned in with the main text, and that we should separate the paragraphs and remove the spaces inserted by the scan at the start of some lines. We also need to restore some of the text that got lost in the scan. Since there isn't much of it, we just type it in. Having done this, we get to . . .

Stage Two—First pass through the scanned text

At this point, we have a complete text. All of the words are actually there, and we have eliminated page breaks and other extraneous artifacts of proofing. Again, mileage varies: some people like to preserve page breaks and numbering until much later, to make it easy to refer back from the e-text to the book.

Our job in this phase is to fix all of the obvious scanning errors and double-check that we really do have all the text. Our aim here is to create an e-text that is ready for First Proof. In fact, since it's fairly clear what all the words are, this text could be considered ready for first proof.

1There Periniedes and Eurylochus held the victims, but l drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink- offering to all the dead, first with mead and there after with sweet wine, and for the third time with water. And 1 sprink- led white incal thereon, and entreated with many prayers the strengthless beads of the dead, and prornised that on my return to Ithaea 1 would offer in my halls a barren heifer, the best 1 had, and fill the pyre with treasure, and apart unto Teiresias alone sacrifice a black rarn without spot, the fairest of my flock. But when 1 bad besought the tribes of the dead with vows and prayers, 1 took the sheep and cut their throats over the trench. and the dark blood flowed forth, and lo, the spirits of the dead that he departed gathered them from out of Erebus.

Now we convert those numeral 1s to capital Is and to quotes, where appropriate, we straighten up the quotes and we deal with other obvious scanning errors, which brings us to . . .

Stage Three—The First Proof

At this point, we could hand over the text to an experienced proofer who doesn't have a copy of the book. This would be called a "first proof". An e-text is at first proof stage when there are still plenty of errors, but in each case it's pretty obvious what the correct word is. The excerpt now looks like normal text.

Unfortunately, in stage two above, we accidentally deleted a line.

'There Periniedes and Eurylochus held the victims, but l drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink- offering to all the dead, first with mead and there after with sweet wine, and for the third time with water. And I sprink- led white incal thereon, and entreated with many prayers the strengthless beads of the dead, and prornised that on my return to Ithaea I would offer in my halls a barren heifer, Teiresias alone sacrifice a black rarn without spot, the fairest of my flock. But when I bad besought the tribes of the dead with vows and prayers, I took the sheep and cut their throats over the trench, and the dark blood flowed forth, and lo, the spirits of the dead that he departed gathered them from out of Erebus.

Stage Four—Corrections from First Proof

We receive the first proof back from the proofer, and find that it has been mostly corrected.

The corrections made were "l/I", "there after/thereafter", "prornised/promised", "bad/had", and "rarn/ram".

We have also wrapped the lines—at 60 characters in this case, but it is commonly as much as 70 characters per line. Sentences which look wrong, but where it isn't clear what the right text should be, have been marked with asterisks (*).

'There Periniedes and Eurylochus held the victims, but I drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink-offering to all the dead, first with mead and thereafter with sweet wine, and for the third time with water. And I sprinkled white incal * thereon, and entreated with many prayers the strengthless beads of the dead, and promised that on my return to Ithaea I would offer in my halls a barren heifer, * Teiresias alone sacrifice a black ram without spot, the fairest of my flock. But when I had besought the tribes of the dead with vows and prayers, I took the sheep and cut their throats over the trench, and the dark blood flowed forth, and lo, the spirits of the dead that he departed gathered them from out of Erebus.

We look up the text where the first proofer has asterisked it, and make the corrections.

The text is now ready for second proofing. An e-text is ready for second proofing when you can skim through the text without noticing that there are errors.

We can either do a second proof ourselves, or send it out for second proofing.

Second proofing involves a very careful reading of the text, looking for small errors. In some ways, it's much harder than first proofing, since it's very easy to let your eyes run on auto-pilot and in doing so, miss subtle errors.

Having performed the second proof, which caught errors like "beads/heads", "Ithaea/Ithaca", "Periniedes/Perimedes" and "he/be", we now have our final e-text.

'There Perimedes and Eurylochus held the victims, but I drew my sharp sword from my thigh, and dug a pit, as it were a cubit in length and breadth, and about it poured a drink-offering to all the dead, first with mead and thereafter with sweet wine, and for the third time with water. And I sprinkled white meal thereon, and entreated with many prayers the strengthless heads of the dead, and promised that on my return to Ithaca I would offer in my halls a barren heifer, the best I had, and fill the pyre with treasure, and apart unto Teiresias alone sacrifice a black ram without spot, the fairest of my flock. But when I had besought the tribes of the dead with vows and prayers, I took the sheep and cut their throats over the trench, and the dark blood flowed forth, and lo, the spirits of the dead that be departed gathered them from out of Erebus.

Hooray! At long last we have an e-text to post, which can be downloaded, read and enjoyed by anyone in the world from now on.

About Net searching:

V.62. I've found an eligible text elsewhere on the Net, but it's not in the PG archives. Can I just submit it to PG?

You can submit it, but you can't "just" submit it.

We wish we could give a permanent home to all the etexts that people have produced and placed on the Net, but without proof of their public domain [C.10] status, we can't.

We need to be able to prove that the eBooks we publish are in the public domain, so, in order to use one of the many texts that are just floating around the Net, you need to find a matching paper edition that we can prove is eligible [V.18].

(By the way, please be sure that it isn't already in the PG archive. A lot of texts circulating on the Net originated at PG, and people quite often submit them back to us.)

Before you get into this, you should check whether the text you have found is likely to be in the public domain in the U.S. A quick way to verify this is to hit the Library of Congress Catalog site at and search for the title or author. If you find no publications before 1923, then you should probably move on; the Library of Congress doesn't list every book, and in particular doesn't list all books published outside the U.S., but, if there isn't a pre-1923 copy there, it may be difficult to follow up on. If you're not dissuaded, do a search on the Net for used book shops that might have pre-1923 copies.

Sometimes, with a text on the Net, you know who typed it; it's on someone's website, or the transcriber is named in the text. Sometimes, the text has just been floating around Usenet or old gopher sites for years, with no attribution.

The first thing to remember is that we would like to give credit to the original transcriber if they want it, and if we can identify them.

The next thing to consider is that the original transcriber may well have an eligible copy of the book, and may be able to provide TP&V [V.25] for it.

So, if you can locate the original transcriber, it makes sense to e-mail them, explain what you propose to do, and ask them whether they can help with copyright clearance and whether they would like to be credited in the PG edition. Often, you will get no response, or a response but no prospect of material that will help with clearance, but sometimes you will get lucky.

If the transcriber can't help with TP&V, it's up to you to find a matching paper edition of the same book. This may not be as hard as it sounds. Libraries can help, and may get editions for you on interlibrary loan.

This is an ideal way for students, academics and librarians to contribute texts to PG, since you probably have access to a good library with stocks of old books to find matching paper editions.

If you find a matching paper edition, you then need to compare the etext you found with the book. Legally, what we're trying to prove here is that we have done "due diligence"—that we have done our best to prove that the etext is indeed a copy of a public domain work.

The minimum "due diligence" we can perform is to compare the first and last pages of each chapter, (or every 20 pages where the book is not neatly divided into chapters of about that size). You should list all of the differences between the book and the etext that you find on those pages. It is to be expected that there will be some minor differences of punctuation, spacing and spelling, and even perhaps of wording. Minor differences are OK, but we do need to list them, to prove that we did the comparison. When you have your lists, you can send in the TP&V as normal, accompanied by your lists, for clearance.

Many texts floating round without attribution, and indeed many with attribution, could do with a thorough checking, and another option you have is "comparative retyping", where you go through the whole etext, proofing it carefully against the cleared paper book, and changing everything that is different in the etext to match the paper edition. If you do this, you don't need to produce a list of differences, since there won't be any by the time you've finished; you can just submit it as a normal text—andit may well be a lot cleaner! However, if you do take this path, please do a very thorough job on the proofing and comparison.

If the etext you find has been marked up, in HTML for example, you should remove all HTML for the PG edition, because, even though the text itself has been proved to be in the public domain, the original transcribers may hold copyright on the HTML markup, even if you can't find them. If you do want to make a HTML edition of it for PG, strip out all of the original markup and then re-add your own markup.

If you do find the producer and he or she wants to be identified, you may submit a double credits line like:

Transcribed by Sally Wright Produced for PG by You

V.63. I've found an eligible text elsewhere on the Net, but it's notin the PG archives. Why should I submit it to PG?

The first reason is file safety.

Yes, we accept that the file is already available to everyone today, but it may not be safe in the long term. We've seen college students who put books on their personal site, and then lose that site when they graduate. We've seen individuals who transcribe several books, and later lose interest, or move, or die, and the work they've done is lost. We've seen small projects with a few volunteers who produce and post books for a few years, but then break up or run out of funds to maintain their site. We've seen large institutions drop their collections as part of a cost-cutting exercise. We've even seen organizations lock public domain works up behind licenses, requiring users to commit to registration and a "no copying" agreement before downloading them.

Whenever a set of etexts is published and distributed by only one person or organization, there is a danger that their etexts will disappear from the Net sometime. We wantalletexts to be spread as widely as possible, copied as much as possible, so that no one event or loss, or whim of a sponsor, can obliterate them.

We think that the PG collection is, for that reason, the safest place to put a text for its long-term survival. There are copies of the PG archives all over the world, on public servers and private CDs. PG publications are widely converted, collected and read on PDAs. Other text projects copy works from PG.

The PG archive is so valuable, yet free and easily portable, that even if every current PG volunteer vanished overnight, people around the world would copy and preserve it. Even if PG itself decided to withdraw all our texts, we couldn't do it, because so many people have made copies.

The second reason is legal safety.

Unlike some other projects and individual efforts, PG retains documentary proof of the public domain status of its texts. This is more valuable than it might appear at first glance.

Publishers often claim a new copyright [C.17] on works that they republish, and as time goes on, it becomes harder and harder to prove that a particular book is in the public domain. Walk into your local bookstore and check out how many works by Shakespeare, Poe, Dickens, and Twain have copyright notices on them! People who want to translate these, or create derivative works like screenplays or lyrics or films must first prove that they are basing their work on a public domain edition, but the creeping copyright practices of commercial publishers make that difficult.

Here's a practical example: we were approached by a film student who wanted to make a short piece based on characters from James Joyce's "Ulysses". But before he could do that, he needed to confirm that the material on which he was basing his movie was in the public domain, and all the editions he could find were copyrighted. However, because PG had already established the public domain status of Ulysses, we could point him to our established PD version, and even tell him where to find a paper copy published in 1922. Without that evidence, he could not have made his project.

V.64. I have already scanned or typed a book; it's on my web site.How can I get it included in the Gutenberg archives?

Great! We get these a lot, but it's always nice to see another!

You need to send us the TP&V [V.25] so that we can prove that your edition is in the public domain. If you don't have the TP&V, you will need to find a matching paper book with eligible TP&V for us to be able to use it.

V.65. I have already scanned or typed a book; it's on my web site.The world can already access it. Why should I add it to theGutenberg archives?

The Project Gutenberg archives are widely copied and searched, and much safer and more permanent that any individual website can possibly be. We aim to keep this collection together over not just years, but centuries. You took the trouble to transcribe this book. We can relate; that's whatwedo, as well. We know you want this work to survive you and your ISP, and we believe we can do that. And it's not as if you have to take it off your website when we make a copy; you're just using your candle to light another!

If you want to let readers know that your site has other relatedmaterial, you can put that information in the Credits Line [V.47].Taking a real-world example, you could ask us to add this to theCredits line for a C. M. Yonge text:

A web page for Charlotte M. Yonge will be found at www.menorot.com/cmyonge.htm

V.66. I have already scanned or typed a book, but it's not in plain text format. Can I submit it to PG?

Yes, of course. We'll be happy to discuss format options with you, and we're quite experienced in converting between multiple formats and deciding which formats work best and will have the longest life. All you need is to get us a copy of your TP&V [V.25].

About author-submitted eBooks:

V.67. I've written a book. Will PG publish it?

Maybe.

PG gets submissions from young people, for example, who just want to get a story they wrote published in PG. We wish them well with their writing, but that's not really why we're here.

If you are a published author, or perhaps an academic who wants to put a textbook into the archives, it's quite likely that we will publish it.

V.68. I have translated a classic book from one language to another.Will PG publish my translation?

Yes, if we can.

The book that you translated needs to be in the public domain, and we will need the same proof of eligibility that we would use if you were contributing the book in its original language.

For example, if you were translating Hesse's Siddhartha (published pre-1923 in German, but no pre-1923 English translation available), we would need to copyright clear [V.25] the original German edition from which you worked—it needs to be a pre-1923 or otherwise public domain edition. (We actually did this one, thanks to the hard work and scholarship of some volunteers.)

V.69. OK, this is one of the cases where PG will publish it.What do I do next?

You need to decide about copyright issues. Do you want to release your work to the public domain, or do you want to retain copyright? If you want to retain copyright, what terms do you want to release it under? The next few questions deal with those issues.

Having decided that you want PG to publish it, and decided what restrictions (if any) you want to place on further distribution, you just need to write the appropriate letter and send the text to us. [V.46]

V.70. I hold the copyright on a book. Can I release it to the public domain?

You can. All you need to do is put a statement into the released version of the text saying that you have.

If you want to release it into the public domain and distribute it through Project Gutenberg, you should send us a letter to that effect.

To: Michael S. HartFounder, Project Gutenberg405 West Elm StreetUrbana IL, 61801-3231, USA

Dear Project Gutenberg:

I am the sole copyright holder for the book, "Wallaby Happiness." It gives me pleasure to release this work into the public domain, and I invite Project Gutenberg to publish this public domain edition.

Sincerely,

Gregory B. Newby

Once you have released it into the public domain, neither we nor anyone else needs your permission to publish it, but for us to be sure that itisa public domain version, we do need a signed letter.

V.71. I hold the copyright on a book. Do I have to release the book into the public domain for Project Gutenberg to publish it?

Absolutely not! For example, many contributors of copyrighted material want to share it with the world, but do not want it commercially republished by other companies.

You can grant Project Gutenberg perpetual, non-exclusive, world-wide rights to distribute your book on a royalty-free basis by sending a letter to Michael Hart. Your letter may be brief, but must be signed, and must include the name of the book and the assertion that you are the copyright holder or the agent for the copyright holder.

If you want some related information, like a link to your website, included in the text, we will be happy to oblige.

Once we have posted a text, many people will copy it. We have no effective mechanism for "recalling" texts that we have posted, so please be sure, before you commit to this, that you intend to follow through with it, because there is no way to change your mind later.

Here is a sample letter, including the address to send it to:

To: Michael S. HartFounder, Project Gutenberg405 West Elm StreetUrbana IL, 61801-3231, USA

Dear Project Gutenberg:

I am the sole copyright holder for the book, "Wallaby Happiness." It gives me pleasure to grant Project Gutenberg perpetual, worldwide, non-exclusive rights to distribute this book in electronic form through Project Gutenberg Web sites, CDs or other current and future formats. No royalties are due for these rights.

Sincerely,

Gregory B. Newby

V.72. I hold the copyright on a book, and would like Project Gutenberg to publish it. Can I choose what rights to assign?

For PG to be in a position to copy it, we do need perpetual, worldwide, non-exclusive, royalty-free rights to distribute the book in electronic form. What rights you choose to assign to readers after that is a decision for you to make.

The Creative Commons site may give you some ideas of what practical use you can make of your copyright to see that the work is used in the ways you intended.

About what goes into the texts:

V.73. Why does PG format texts the way it does?

PG texts are formatted as plain ASCII, with 60-70 characters per line, with a hard return [CR/LF] at end of line, and some people ask "Why do itthisway? You could omit the hard returns and let the reader's word processor or Reader software wrap the lines. You could use "8-bit" accented characters for non-English characters." "You could use ' - ' instead of '—' for an em-dash." And so on, through a different choice we could make for every formatting feature. And the answer, of course, is that wecoulddo it differently, and sometimes we do, but mostly we keep to one consistent style.

We'll be discussing each of the formatting decisions below, not only giving the summary PG answer, but also discussing the plusses and minuses of each, and the possible options.

Like any question beginning "Why does/doesn't PG . . . ?", the answer is "Because that's what the volunteers and readers want!". These conventions have been worked out over the years, largely by Michael Hart, our founder and chief volunteer, in conjunction with all of us volunteers, as the result of feedback from readers.

We are guided throughout by the principle that we want to produce texts in the simplest format that will adequately express the content. Quoting Michael Hart (1994):

Etext as developed and distributed by Project Gutenberg since 1971 wasnever intended to be a copy of a paper or a parchment [remember, firstProject Gutenberg Etext was typed in from parchment replicas of the USDeclaration of Independence].

The major purposes of Project Gutenberg have always been:

1. to encourage the creation and distribution of electronic texts for the general audience.

2. to provide these Etexts in a manner available to everyone in termsof price and accessibility [i.e. no special hardware or software],and no price tag attached to the Etexts themselves.

3. to make the Etexts as readily usable as possible, with no forms orother paperwork required, and as easily readable to the human eyesas to computer programs, and in fact, more readable than paper.

There is sometimes a conflict between "simplest format" and "adequately express the content"; further, different people have different views on what is "simple" or "adequate". You, the producer of the text, have spent the time and effort to make the eBook available to the world, you have thought more about it than anyone else, and we respect your informed judgment. However, please make sure that your judgmenthasbeen informed, by studying the precedents and reasons behind our guidelines.

Where a simple, standard PG-ASCII layout does not, in your view, "adequately express the content", you should think of making your text in another open format, perhaps HTML or XML or TeX, that allows you to use more characters, more formatting options, and images. We are always happy to accept these kinds of files. In these cases, you should also provide a standard PG-ASCII version, even if you feel it is unacceptably degraded, for those who cannot use your preferred format.

Just ten years ago, presentation as plain ASCII was not only a universal standard, it was effectively the only way that most people could view the books. The first version of the HTML specification had been drafted, but was unknown among the general public. XML did not exist. SGML was (as it still is) the province of specialists. Specialized eBook readers and PDAs had not yet appeared.

In 2002, plain vanilla ASCII is still readable everywhere, but people also want to convert our texts into other formats for more convenient loading on readers and web sites. We therefore have to keep in mind that our works will be processed by automatic conversion programs, none of which is perfect, and we have evolved some "defensive formatting" practices, which, while retaining the universality of plain text, also supply clues to automatic converters about how they should treat the layout. These do help to keep converters from making at least the worst mistakes. The most significant "defensive formatting" practices are indenting unwrappable text like quotations, and usingunderscoresrather than CAPITALS for italics. Different volunteers have different priorities: at one extreme, some people want to make the best plain text they can, giving no weight to conversion issues; at the other, some people emphasize the cues that will allow automatic reformatters to convert the texts well, even if that causes some ugliness in the plain text. Most of us operate somewhere between, making the choices we feel are best depending on the context. Getting a text on-line is the important thing; which choices you make in doing so is a matter of detail.

About the characters you use:

V.74. What characters can I use?

a) You should use plain ASCII for straight English texts.

b) When producing a text partly or completely in a language that requires accents, you should use the appropriate ISO-8859 character set for the language, and specify which you are using, and also provide a 7-bit plain ASCII version with the accents stripped.

c) When producing a text in a language that doesn't use one of the ISO-8859 character sets, you should use the encoding most commonly used for that language. [e.g. Chinese—Big 5]

d) When producing a text containing more characters than can be found in any one of the ISO-8859 character sets, you should use Unicode.

You should use plain ASCII wherever possible—that is, the letters and numbers and punctuation available on a standard U.S. keyboard, without accented letters. The immediate and major exception to this is when you are typing a text written in a language like French or German that requires accents.

There is a problem with using non-ASCII characters. They do not display consistently on all computers; in fact, they do not even display consistently on the same computer! On my computer, for example, what looks like an e-acute in this editor just shows as a black box in another editor, or even using a different font in the same editor. And this is by no means confined to some theoretical minority; we have to deal with it all the time when posting texts.

Further, standards are changing: ten years ago, the character set Codepage 850 [MS-DOS] was very common; now it's rare except in some texts that have survived those ten years.

We want to preserve these texts overcenturies, not just decades, and at the moment there is no single clear standard that we can use across all texts. Unicode may perhaps be a future standard, but, right now, it's not something that people use every day, and it's not supported by a lot of common software.

ASCII, while limited, is supported by almost all computers everywhere, so we make a point of always supplying an ASCII version where possible, even if the ASCII version is degraded when compared to the 8-bit original. When we get a text in, say, German, we post two versions of it—one with accents and one without.

V.75. What is ASCII?

Don't get scared by the computer jargon; ASCII (pronounced ASS-key) is just a name for the set of unaccented letters, numbers and other symbols on a standard U.S. keyboard.

ASCII (American Standard Code for Information Interchange) is a set of common characters, including just about everything that you can type in on an English-language keyboard. It includes the letters A-Z, a-z, space, numbers, punctuation and some basic symbols. Every character in this document is an ASCII character, and each character is identified with a number from 0 through 127 internally in the computer.

Just about every computer in the world can show ASCII characters correctly, which makes it ideal for PG's purpose of providing texts that can be read by anyone, anywhere, but ASCII does not include accented characters, Greek letters, Arabic script and other non-English characters, which causes some problems when we produce texts that need non-ASCII characters.

V.76. So what is ISO-8859? What is Codepage 437? What is Codepage 1252?What is MacRoman?

Today's computers mostly work on the basis of dealing with one "byte" at a time. A byte is a unit of storage than can contain any number from 0 through 255—256 values in all. It's very convenient for computers to associate one character with each of these numbers, so that we can have up to 256 "letters" viewable from the values stored in one byte. The first 128 values, zero through 127, are defined by ASCII—so, for example, in ASCII, the number 65 represents a capital "A", 97 represents a lowercase "a", 49 stands for the digit "1", 45 for the hyphen "-", and so on.

ASCII doesn't define characters for the values 128 through 255, and in early days computer manufacturers used these values to hold non-ASCII characters like accented letters and box-drawing lines. Of course, 128 wasn't nearly enough values to hold all of the characters that people needed to use for different languages, so they made the character sets switchable, so that a PC in France could use a different set of accented letters from a PC in Poland. Microsoft's version of this was called Codepages. Each Codepage held a different set of non-ASCII characters. Codepage 437, and later Codepage 850, were commonly used for English and some major Western European languages on MS-DOS.

MacRoman was Apple's first codepage, containing most of the accented letters in Latin-derived languages, and MacRoman is still in common use on Apple Macs today.

Later, the International Standards Organization ISO got around to looking at the problem, and defined ISO-8859-1, ISO-8859-2 and so on, as the standards for different language groups. These sets all define the characters 160 through 255 as accented letters and other symbols, and define the 32 characters from 128 through 159 as control characters.

Since Microsoft Windows has no use for the control characters 128 through 159, Windows fonts commonly use Codepage 1252, which has ASCII in the first 128 characters, ISO-8859-1 in characters 160 through 255, and other symbols in the characters 128 through 159. Just to make an already chaotic system worse, all characters can be defined differently in different fonts!

Of course, most of these codepages are incompatible with each other. For example, the byte value 232 shows as a lower-case "e" with a grave accent in ISO-8859-1 and CP1252, a capital letter "E" with diaeresis in MacRoman, a Latin capital letter "Thorn" in CP850, a Cyrillic lower-case "Sha" in ISO-8859-5, a Greek capital letter "Phi" in CP437, and so on. So if you view a text intended for one of these character sets with a program that assumes a different character set, you see gibberish.

The good news, for mostly-English texts at least, is that ISO-8859-1, Codepage 1252 and Unicode agree on the numerical values of the accented characters and symbols to be represented by the values 160 through 255. And everybody accepts ASCII—a pure ASCII file is valid ISO-8859-anything, valid Codepage-anything, and valid Unicode UTF-8.

For more detail about the mappings between Unicode and other formats, you can view Unicode<—>ISO-8859 mappings at ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/ Unicode<—>Windows mappings at ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/ and Unicode<—>Apple mappings at ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/

If you're not confused enough by now, please read the excellent guide to the whole "alphabet soup" problem at .

V.77. What is Unicode?

Recognizing that no single set of 256 characters can hold all of the symbols necessary for true multi-lingual texts, ISO 10646 was created. This defined the Universal Character Set (UCS) using 31 bits, which has the potential for a staggering2 billioncharacters.

The Unicode Consortium is a group of computer industry companies who agree the Unicode standard. Unicode accepts the ISO 10646 standards, and adds some restrictions and implementation processes. It plans for a modest million or so characters; however, this is enough for all living and extinct languages, and imaginable future ones too.

Using 4 bytes for each character is wasteful, though, when most characters need only one or two, and there are programming problems with implementing 4-byte characters, so Unicode provides Transformation Formats (UTF) which allow the characters to be encoded using fewer bytes where possible. UTF-8 and UTF-16 are common.

UTF-8, which is the most practical of these from the PG point of view, allows ASCII to be encoded normally, and usually uses two or three bytes for other non-ASCII characters.

Because of the extra work needed to support this extra space, and the fact that most people work mostly in one or maybe two languages, Unicode is being adopted only slowly, and most computer programs in 2002 do not fully support it. But when you need to mix Arabic, Greek, Ogham and Sanskrit in one text, it's the only possible answer!

For more about this, go straight to the source at .

V.78. What is Big-5?

Big 5 is an encoding of a set of 13,000+ traditional Chinese characters.

V.79. What are "8-bit" and "7-bit" texts?

For practical purposes, 7-bit texts are plain ASCII; 8-bit texts have accented letters.

This comes from computer jargon. You can represent the 128 characters of ASCII using 7 bits—binary digits—but to represent the 256 characters needed for the various codepages and ISO-8859 standards, like accented letters, you need 8 bits. Hence, we call a text that uses non-ASCII characters in a character set like Codepage 850 or ISO-8859-1 an "8-bit" text.

When we post a text as both 8-bit and 7-bit, as we do when ASCII is not enough to render the text acceptably, we name the file with an "8" or a "7" at the start. So, for example, Crime and Punishment by Dostoevsky is named 8crmp10 for the 8-bit version with accents, and 7crmp10 for the 7-bit version without accents.

See also FAQ [R.35]: "What do the filenames of the texts mean?"

V.80. I have an English text with some quotations from a language that needs accents—what should I do about the accents?

If stripping the accents would unacceptably degrade the book, then submit two versions, one "8-bit" with the accents included and one "7-bit" plain ASCII, and we will post both.

This is a hard choice. What constitutes "unacceptable degradation"?

Clearly this is a decision that all of us in PG have to make. It's a very common problem, and different people have different views. For that matter, different print publishers have different views; you will see the words "debris", "facade" and "cafe" printed with and without accents in different books, and even in different editions of the same book.

We don't want to post two versions when we don't have to. It doubles the posting work, doubles the disk space needed, potentially confuses downloaders, doubles the maintenance when we need to correct the text. On the other hand, we don't want to degrade the text.

There is no clear line, no definitive answer to what level of degradation is acceptable. Most producers feel that there is no point in making a separate version when dealing only with a few foreign words thrown in among the English, but when, for example, some significant dialog between the characters is in French or Spanish, it's harder to say that stripping the accents is acceptable. You, the producer, need to decide this on a case-by-case basis. If you're not sure, discuss it with one of the Directors of Production or one of the Posting Team.

If you have made the text with accents, you can choose to make your own 7-bit version and send it to us, or just send the 8-bit version and we'll make the 7-bit version from it. Some people prefer to make their own 7-bit editions; some don't. Whether you use a Microsoft Codepage, one of the ISO standards or MacRoman doesn't matter—we can convert any of them for you.

V.81. I have some Greek quotations in my book. How can I handle them?

There is no way to show Greek letters in ASCII. You have three options:

You can just replace the Greek words with [Greek] to indicate to the reader that you have omitted it.

You can "transliterate" the Greek to ASCII. Greek letters do have a correspondence to plain "Latin" letters—for example, the Greek letter "delta" can be represented by the letter "d". There is a simple PG guide to transliteration at . This practice has had a long and honorable history: words like "amphora" and "hubris", for example, are straight transliteration from the Greek. This is usually the best option.

If there is enough Greek to warrant it, and no other accented characters, you may be able to use the ISO-8859-7 character set, and submit both 7-bit and 8-bit versions [V.79]. ISO-8859-7 is for modern rather than classical Greek, but, if necessary, you will surely be able to express the Greek fully in Unicode. However accurate your Greek, that still leaves the issue of what to do with the 7-bit ASCII version, where transliteration is probably still your best bet.

V.82. I want to produce a book in a language like Spanish or Frenchwith accented characters. What should I do?

Use the appropriate ISO-8859 Character set [V.76] for your 8-bit version.

About the formatting of a text file:

This section of the FAQ goes into great detail about all kinds of formatting questions. However, looked at from a higher level, the only real issue is that we want to render texts clearly, with formatting that reflects the original, so that readers of the plain text format can read them easily, and people converting them to other formats can do so reliably. When you come across a case that is not covered by the detailed guidelines below, keep this ultimate aim in mind, and make the best decision you can. Don't get hung up for hours or days over a question of formatting—if you want advice, look at how other people have handled the same situation in previous texts, or ask other volunteers for their ideas.

V.83. How long should I make my lines of text?

For normal prose, such as you find in a novel, your lines should mostly be 60 to 70 characters long, not shorter than 55, not longer than 75 except where it can't be helped. Never, ever longer than 80, except where you're trying to render a non-text structure, like a family tree.

For poetry, make the text look as much like the book as possible. This also applies to some plays where the lines are clearly intended to be broken at specific points, whether blank verse or not.

V.84. Why should I break lines at all? Why not make the text as one line per paragraph, and let the reader wrap it?

We could either use 70-character lines and let readers unwrap them if they want to, or use infinite-length lines and let readers wrap them if they want to. We choose to wrap the lines so that they are readable on even the simplest of text editors and viewers.

V.85. Why use a CR/LF at end of line?

CR/LF can lead to double-spacing, notably on Mac and Unix, but at least thereisa CR in there for Mac users, and thereisan LF for *nix users.

If you don't know or care what this is about, please skip blithely on.

There are three differing standards for how to represent the end of a line of text. In brief, Apple Macs use the CR character. Unix and its variants use the LF character. Microsoft systems, from MS-DOS through Windows, use both together.

If you want the history behind these:

CR stands for Carriage Return, and comes from the old typewriter / teletype idea of a command to move the print head from the right of the page back to the left when it reaches the end;


Back to IndexNext