Introduction
The aim of this thesis is to develop an Open Source style and grammar checker for the English
language.
The many different kinds of errors which may appear in written text can be categorized in several
different ways.
Spelling errors: Spell checkers simply compare the words of a text with a large list of known words.
If a word is not in the list, it is considered incorrect. Similar words will then be suggested as
alternatives.
Example: *Gemran1(Ispell will suggest, among others, German)
Grammar errors: An error which causes a sentence not to comply with the English grammar
rules. Unlike spell checking, grammar checking needs to make use of context information, so
that it can find an error like this:
*Harry Potter bigger then than Titanic?
Style errors: Using uncommon words and complicated sentence structures makes a text more
difficult to understand, which is seldomly desired. These cases are thus considered style errors.
Unlike grammar errors, it heavily depends on the situation and text type which cases should be
classified as a style error.
Example: But it [= human reason] quickly discovers that, in this way, its labours must remain
ever incomplete, because new questions never cease to present themselves; and thus it finds
itself compelled to have recourse to principles which transcend the region of experience, while
they are regarded by common sense without distrust.
This sentence stems from Kant’s Critique of pure reason. It is 48 words long and most people; ">
will agree that it is very difficult to understand. The reason is its length, difficult vocabulary(like transcend), and use of double negation (without distrust). With today’s demand for easyto understand documents, this sentence can be considered to have a style problem.
Semantic errors: A sentence which contains incorrect information which is neither a style
error, grammar error, nor a spelling error. Since extensive world-knowledge is required to
recognize semantic errors, these errors usually cannot be detected automatically.
Example: MySQL is a great editor for programming!
This sentence is neither true nor false – it simply does not make sense, as MySQL is not an
editor, but a database.
Part-of-Speech Tagging
Part-of-speech tagging (POS tagging, or just tagging) is the task of assigning each word its POS tag.
It is not strictly defined what POS tags exist, but the most common ones are noun, verb, determiner,
adjective and adverb. Nouns can be further divided into singular and plural nouns, verbs can be
divided into past tense verbs and present tense verbs and so on.
Phrase Chunking
Phrase Chunking is situated between POS tagging and a full-blown grammatical analysis: whereas
POS tagging only works on the word level, and grammar analysis (i.e. parsing) is supposed to build a
tree structure of a sentence, phrase chunking assigns a tag to word sequences of a sentence.
Grammar Checking
Syntax-based checking, as described in [Jensen et al, 1993]. In this approach, a text is completely parsed, i.e. the sentences are analyzed and each sentence is assigned a tree structure.
The text is considered incorrect if the parsing does not succeed.
Statistics-based checking, as described in [Attwell, 1987]. In this approach, a POS-annotated
corpus is used to build a list of POS tag sequences. Some sequences will be very common (for
example determiner, adjective, noun as in the old man), others will probably not occur at all
(for example determiner, determiner, adjective). Sequences which occur often in the corpus can
be considered correct in other texts, too, uncommon sequences might be errors.
Rule-based checking, as it is used in this project. In this approach, a set of rules is matched
against a text which has at least been POS tagged. This approach is similar to the statistics-based
approach, but all the rules are developed manually.
No comments:
Post a Comment