The initial transcription of Occom Circle documents produces a text document with a custom markup to indicate interesting aspects of the text. The elements of the document fall into several categories:
- Major structural elements such as pages, paragraphs, line breaks, and opening and closing sections.
- Typographic elements such as underlines, superscripts, large or small characters.
- Conceptual elements including names of people, places, organizations and dates.
The structure of a letter includes the pages of the letter, the paragraphs and line breaks, and key elements such as the opening, closing and postcript. Not all letters will include each of these elements, and the elements may be omitted from the transcription if they are not present in the letter; however, those appearing in the letter must appear in the order listed below. These elements are labeled in the document as follows:
|Page||== Page image_number ==||Must be placed at the start of each page, including the first line of the document. The image number is optional, and may be omitted if unknown. Each page in the document must be included in order in the transcription, even if the page is blank. If a page break occurs between paragraphs, a blank line must precede and follow the page marker. If the break occurs within a paragraph, the blank lines must be omitted.|
Date: [[date July 3, 1897]]
Salutation: Dear Mr. Smith,
|The opening to a letter includes the date and salutation. These should be the first elements in the document following the first page heading. The date and salutation are placed on separate lines preceded by the words "Date:" and "Salutation:".|
|Body||== body ==||The body marks the end of the opening and the start of the main body of the letter.|
== closing ==
Signature: John Doe
|The close to a letter includes the salutation and signature.|
|Postscript||== postscript ==||If any text appears after the closing of the letter, it should be included as a postscript. The contents of the postscript should be transcribed in the same manner as the main body of the letter.|
|Trailer||== trailer ==||Text that appears as a closing title or footer of the letter, after any postscript.|
|Address||== address==||The address to which the letter was sent.|
Blocks of text (lines and paragraphs) should be typed as they appear in the document. A paragraph should be followed by one or more blank lines unless it is the last paragraph in a section.
|Line breaks are denoted by line breaks in the text.|
|Paragraphs are denoted by a blank line in the text.|
|Indented text||indented line||Lines that are indented should begin with a tab or space(s).|
Typographic markup is enclosed within
[[ ... ]] brackets. Immediately following
the opening brackets is a key words indicating the kind of markup, a space, and the text that
is marked up. The text may contain other markup, including structural markup such as lines,
paragraphs or page breaks as described above. For each pair of opening brackets, there must
be a corresponding pair of closing brackets.
Example: (note the use of spaces between ]] and ]] to improve legibility)
this is normal text [[bold this text will appear bold [[italic this will be bold italic text ]] ]] this is normal text
this is normal text this text will appear bold this will be bold italic text this is normal text
[[blockletters block lettered text appears here]]
abbreviated form: [[block block lettered text appears here]]
|bold||[[bold bold text appears here]]||Rendered as
|italic||[[italic italic text appears here]]||Rendered as
|underline||[[underline underlined text appears here]]||Rendered as
|large letters||[[large large text appears here]] normal text here||Rendered as
normal text [[superscript superscript text appears here]]
abbreviated form: [[sup superscript text appears here]]
Markup of illegible text depends on the reason for the illegibility: Gaps in the text, damage to the document, or
merely illegible text are indicated with one of the tags listed below. If a reasonable guess about the original
text can be made, this can be indicated within the tag by enclosing the guess within
[ ... ] after
] bracket. More than one guess may be supplied if desired, by enclosing each within
Example: (note the use of spaces between ] and [ to improve legibility)
[[illegible ] [my first guess] [my second guess]]
|Gaps, missing or entirely unreadable text.||[[gap reason] [my first guess] [my second guess]]||
The kind of gap should be indicated by the reason. It must be one of the following statements:
|Blotted text||[[blotted ] [my guess]]||Obsolete. Use [[gap blotted_out]] instead.|
|Damaged text||[[damaged reason] [my guess]]||Obsolete. Use [[gap reason]] instead.|
|Illegible text||[[illegible] [my guess]]||Use for any illegible text that does not fall into the other categories.|
Notes are informational text added by the transcriber. They may be used as markers for parts of the text that require further review, or to describe any other aspect of the transcription that the transcriber wishes to record.
[[note kind text of the note]
e.g. [[note editorial check the transcription of this paragraph`]]
The note kind should be one of the following words:
Areas where the author has crossed out or added text are covered by the changes. Multiple changes can be nested if warranted. If, for example, the author indicated text was to be added to a place in the text, but crossed out part of the added text, this should be indicated by a deletion within the addition.
|Added/Inserted text||[[add location this is inserted text]]||
The author indicated that text should be inserted into the main text at this point. The location
indicates where the author made the notation, and should be one of the of the following words:
|Deleted/Crossed out text||
[[delete this text was deleted]]
abbreviated form: [[del this text was deleted]]
Conceptual elements are dates, places, or things which are significant and may have other information attached to them in the final website. These items should be marked so that they can be indexed and cataloged. These items should be transcribed within the given tags exactly as the author wrote them.
|Dates||[[date July 3, 1884]]|
|Person||[[person Sampson Occom]]|
|Place||[[place Hanover, NH]]|
[[organization Dartmouth College]]
abbreviated form: [[org Dartmouth College]]
Special symbols are enclosed in pairs of less-than (<) and greater-than (>) signs.
|"m bar" representing an abbreviation or repeated letter m (m with Unicode u305)|
|⅌||<<per>>||"Per" symbol (Unicode u214c)|
|Long 's' (Unicode u017f)|
|Swung Dash (Unicode u2053)|
|[||<<[>>||Left square bracket|
|]||<<]>>||Right square bracket|
|arbitrary character||<<uxxxx>>||Arbitrary Unicode character with code xxxx, where xxxx is the hexadecimal code value|
Accounting journal or ledger pages which consist of a table of transactions are most easily transcribed in Excel. After transcription the files must be saved as tab delimited files before being sent to the validation script for translation. To help the translation script recognize the documents, the first row of the document must contain only the text "== document table ==" in the first cell. Since Excel treats a leading "=" as the start of a formula, this must be entered in the cell with a leading apostrophe:
'== document table ==
The second row must contain the identifier for the first page:
'== page 764565_001==
The nature of the table layout and the way Excel functions places additional constraints on the document:
Tags may not span cells
IMS markup must be entirely contained within a cell. This is actually as much an XML restriction as an IMS restriction. XML tags must be properly nested, and dates, deletions, or other tags spanning cells would not be properly nested. When markup appears to cross cells, consider one of the following alternatives:
- Don't apply the markup. For example, if a date spans cells, it may not be necessary to apply a date tag to the element, especially if there is a similar or identical tag nearby.
- Reapply the markup to the contents of each cell individually. If text is crossed out across an entire row, apply the delete markup to the contents of each cell individually.
- Reformat the content so it fits in one cell and markup as usual. In some instances text spans cells in the original text out of convenience. Perhaps the text was too big for a column and overlapped the next column, or perhaps the text was never intended to fit a column. If so, consider combining it into one cell and applying the markup within that cell.
This restriction and the alternatives apply to all markup that might span cells in the table. For example, if an entire row is crossed out, the "del" markup must be repeated in each cell in the row. The IMS does not support starting the "del" markup at the beginning of the row, and ending it at the end. TEI p5 does support this concept using the "delSpan" tag, but it cannot be entered directly via the IMS.
Cells may not span rows or columns
Cells may not span multiple rows or columns. This is a limitation of the export format as well as the IMS. Notes should be used as hints to the reviewers that the cell should be defined to span rows or columns, and the XML will need to be adjusted by the Text Markup Unit to add the appropriate attributes and remove the extra cells inserted by the IMS.
Cells are single lines
The text in individual cells may not contain any line breaks. This is an IMS restriction imposed by the export format. Again, notes may be used to clarify the text if desired.
Special characters must be entered using codes
Excel does not export UTF-8 encoded text. It uses a single byte for each character. Files that include non-ASCII symbols directly are incompatible with the UTF-8 character set required by the IMS system. To work around this, use the special character markup, using either named symbols or Unicode character codes. For example, the pound symbol (£) must be entered with its Unicode code value: <<u00a3>>.
Mixed documents are not possible
Each document submitted to the IMS must be either a table style or letter style document. It is not possible to mix structures. If that's necessary, the individual parts should be transcribed into separate documents, and assembled into a single XML document later, perhaps by the TMU.
The following is a fictitious example document based on letter 764475-3 to demonstrate how a marked up document should appear.
== Page 764475-3_001 == Date: [[place Lebanon]] [[date 25[[superscript th]] Aug[[superscript t]] 1764]] Salutation: [[person M[[superscript r.]] Occom]], Sir. == body == Your time is so that, and your Business so crowding, that I can't desire such an Addition to your Bur— den, as your coming hither again would be: I therefore take this Way to hint to you what I would say more fully if you were here. And in the first place, I suspect you will miss of seeing [[person Mr. Kirtland]] on his Return from [[person Mr. Whitefield]], and also of seeing [[person Mr. Whitefield]], who I hear preached some weeks ago at [[place Philadelphia]], & consequently you will miss of receiving any supplys which he may have got for your journey; and if so, I advise you to represent the Case to some able Friends at [[place New York]], and if you can get Supply no other Way, hire the Money of Some good Friend till you return. I herewith Send you a Copy of our Commission from [[place Scotland]] in order that you may shew it, if you shall have occasion, to [[person Gen[[superscript l]] Gage]], [[person Gen[[superscript l]] Johnson]], or others. I would have you obtain 15 or 20 youth, if you can procure those [[delete that]][[add above which]] are likely, of remote Tribes of Indians. And if you hear that which is encouraging of good [[person Peter]] at [[place Onohoquagee]], and those two Boys there who were offered to the Comissioner at [[place Boston]], let them be of the Number. There was also an English Lad with the Mohawks to learn their Tongue, before this War, who I hear is very likely: if you can obtain such an one, do it. I shall leave the Proportion of Girls to you, & [[person Gen[[superscript l]] Johnson]], whose advice I would have you take in every Thing, when it may be had. And be sure, you let all the Children whom you bring, know that they don't come here to be without Government, nor to live a lazy sordid Life, but to be fitted for Business and lifefulness in the World. And I am not afraid that you Should boast of my Mohawk Boys Proficiency in very strong Terms. And don't fail to write to me as your Progress, Success, and any Occurrence that may be entertaining, by every opportunity [[note this word is found at the right margin, continuing the previous line]] == page 764475-3_002 == Opportunity, as you know Friends at Home will be glad to hear. Send me an Acco[[superscript t]] of what Labour you have or Shall hire upon my Credit [[add above at Mohegan]]; and what you desire me to do for your [[add above Family]] while you are gone. And may the God of all Grace be with you & [[person David]] in all the way whither you go, and inspire you with Wisdom, Prudence, Zeal, Courage, and holy Fortitude, and honour you to be the Instrument to spread the Saviour of his Name, and the Knowledge of the great Salvation, far among the Pagans. == closing == Salutation: Remember me respectfully to Friends in your Way, espe— cially at [[place N. York]]. — which with Love &c is the needful from Yours affectionately Signature: [[person Eleazar Wheelock]]. [[person Rev[[superscript d]] M[[superscript r]] Occom]] == postscript == [[date August 27th]] P.S. [[person Mr. Kirtland]] returned last Evening has got no money. [[person Mr. Whitefield]] is at [[place N. York]]. talks of going to [[place Albany]] this Week if he can he will serve you, if he cant acquaint [[person M[[superscript r.]] Whitaker]] — do the best you can —
When you start a new transcription, use the template below to get started. Delete any parts of the template that are not applicable to the letter you are transcribing. Remember to fill in the image number for pages if you know the corresponding image number.
== page == Date: Salutation: == body == == closing == Salutation: Signature: == postscript == == trailer == == address ==
The transcriptions must be saved as plain text (.txt) documents without any application specific markup. Word (.doc or .docx) or RTF (.rtf) documents are not acceptable. Documents may be edited in Word or other text editors provided they are saved as "text-only" documents. Please review the instructions below that apply to the application you are using to ensure that you are saving the documents in the correct format.
Word is able to edit and save text documents, but the process is relatively complex compared to other editors. We recommend using Word only if no other editor is available.
Create a new document as follows:
File > New Document
File > Save As...
- Choose Format:
Plain Text (.txt)
Save(the File Conversion dialog box will appear)
- Choose Text Encoding
Other Encodingand click
Unicode 5.1 UTF-8
- Choose End Lines with
- Make sure "Insert line breaks" and "Allow character substition" are not checked
- You will receive a warning each time you save the document indicating that some formatting may be lost. Word displays this warning even with empty documents and even if the document contains no formatting. You must click "Save" to save the document.
- Word will not remember the "Save" settings, so the next time you save the document, the Text Encoding and line endings will not be correct. You must use "Save As" each time you save the document starting with Step 2 above in order to save the document properly.
- While editing the file, you must take care not to use any Word formatting such as bold, italic or different fonts. This information will not be saved in the text file. Use only the markup described in this document.
TextEdit is an application supplied with the Macintosh operating system. It is capable of editing plain text (.txt) and RTF (.rtf) documents, and is much easier to use than Word for plain text documents. It is found in the "Applications" folder.
To create a new plain text (.txt) document in TextEdit:
File > New
Format > Make Plain Text(see notes)
File > Saveto save the document.
- Choose Plain Text Encoding
Unicode (UTF-8)(this should be the default, and will only be necessary the first time you save the file)
- Using the TextEdit preferences you can choose whether the New command creates a plain text or RTF document by default. If a document is an RTF document, TextEdit displays a ruler at the top of the page. If you do not see the ruler and cannot find the "Make Plain Text" command, you already have a plain text document. While you are working on the Occom Circle project, it is recommended that you set this preference so that TextEdit automatically creates new plaintext documents.
- The font and type size for plaintext documents is changed in the TextEdit preferences. You may set this preference to any font and size you find convenient.
Notepad++ (free) http://notepad-plus-plus.org/
- Go to notepad-plus.org/news/notepad-6.1.8-release.html to download.
- When you first open notepad++ you will get a new file as a default (its name is usually “new 1”) .
- Go to the “Encoding” in the top menu and change the setting to “Encode in UTF-8”.
- When you save the file go to the top menu “File” and then “Save As…” and name the file.
Notepad (comes with Windows) All Programs --> Accessories --> Notepad Notepad defaults to ANSI text but you can choose unicode or UTF-8 when doing a "Save As"
Word 2010 on Windows 7 Choose the No Spacing style Save As "Plain Text (*.txt)" You then get a dialogue box with warnings about losing formatting where you can choose to insert line breaks different then the default of CR|LF. Choices are CR|LF CR only LF only LF|CR Choose LF only.
Other applications are available for the Macintosh and PC that are capable of editing plain text files. If you wish to use something other than one of the applications listed above, please consult with the project manager before beginning to ensure that your application is compatible with the system.
After the document transcription is complete, the document must be validated, translated to XML and sent to the project manager. This is accomplished through the Occom Circle Validation Form.
If a document is valid, the form displays a color coded version of the document to facilitate proof reading the markup, and a list of key elements (people, places, organizations) found in the document. After the document is proofread, a link on the page emails the original transcription and translated XML document to the project manager.
If markup errors are found in the document, a list of problems is displayed with the line numbers and text surrounding each error. These must be corrected before the document can be submitted to the manager.
The translation to TEI markup leaves placeholders for certain elements that must be filled in manually:
- The "when" attribute of <date> tags requires the canonical date in year-month-day format.
- The "key" attribute of <persName>, <orgName>, and <placeName> tags require the id of the item in the corresponding authority list.
- (?) The rend attribute of <del> tags must be supplied.
- Words containing the m-bar or long-s characters must be normalized using the following tags: <choice><orig></orig><reg></reg></choice>