ERCIM News No.26 - July 1996 - INRIA

Computational Linguistics is needed to check Typographic Conventions

by Jacques André and Hélène Richy

Little attention is given to the quality of electronic documents (especially to those using HTML) in terms of typography (eg abiding by the rules of the Chicago Manual of Style). The specification of 'typographic sheets' can help in checking typographic correctness in structured documents. However, such a checker requires tools from computational linguistics.

Today, thanks to electronic documents and world wide networks, authors and readers communicate directly. Alas, this is quite often done without the help or the savoir faire underlying the traditional activities of typographers, editors, correctors, and printers. For most people, typography concerns visual aspects such as font and character design and the layout of pages. Even if it is related to legibility, another aspect of typography is more concerned with the text itself rather than with its appearance: there are typographic conventions such as the rules given in the 'Chicago Manual of Styles' or in the French 'Code typographique'. These rules refer not only to spacing before or after punctuation, but to capitalization, use of italics, use of acronyms and abbreviations, composition of numbers, etc. While spellers and even syntactic checkers are increasingly offered with incorporated formatters, very little is done in terms of typographic conventions (apart from naive tests such as balancing of parentheses). We are now working on developing such a typographic checker.

The purpose of a typographic corrector is to propose some corrections to the author when errors are found. The problem is that linguistic parsers analyse sentences with the assumption that the punctuation is correct, while a typographic checker is supposed to detect punctuation errors (among others).

Our first approach is to use the logical structure of a document. Indeed, many typographic rules are context dependent. For example, periods are omitted at the end of centred headings, signatures or legends; capitals are allowed in titles; in a bibliographic item, book titles are to be composed in italic, etc. A typically more complex rule is the one describing the punctuation to be used at the end of a list item: it depends on the rank of the specific item in the list, the context of the list, ie whether or not it is within a sentence. Structured documents allow the separation of different levels of interest, for example separately defining the description of a document type (SGML's DTD eg) and its physical description (DSSSL). A typographic checker has been added to the Thot editor and works with typographic sheets, based on the DTD. The word sheet implicitly refers to (cascading) style sheets as they have the same spirit.

However, this first approach presents limitations with respect to linguistic structures. Let us take two examples. The Chicago Manual of Style says "Omit the period after ... running heads ..." . However, linguistic tools (such as abbreviation dictionaries or morphematic analysis) are needed to decide whether a dot is a period or an abbreviation mark (eg after etc.). The same Manual of Style also says "The exclamation point should be placed inside the quotation marks... when it is part of the quoted ... matter; otherwise it should be placed outside." . This implies that a typographic checker must, for example, be able to correctly semantically analyze the two following sentences : Our research now consists in defining the lowest level of linguistic tools needed for such a typographic checker.

Please contact:
Jacques André - Inria/Irisa
Tel: +33 99 84 73 50
or Hélène Richy - CNRS/Irisa
Tel: +33 99 84 73 71

return to the contents page