The Part of speech (POS) tagging is the process of labeling a part of speech or other lexical class marker to each and every word in a sentence. It is similar to the process of tokenization for computer languages.
Sunday, November 20, 2016
Wednesday, June 15, 2016
A Rule-Based Style and Grammar Checker
Introduction
The aim of this thesis is to develop an Open Source style and grammar checker for the English
language.
The many different kinds of errors which may appear in written text can be categorized in several
different ways.
Spelling errors: Spell checkers simply compare the words of a text with a large list of known words.
If a word is not in the list, it is considered incorrect. Similar words will then be suggested as
alternatives.
Example: *Gemran1(Ispell will suggest, among others, German)
Grammar errors: An error which causes a sentence not to comply with the English grammar
rules. Unlike spell checking, grammar checking needs to make use of context information, so
that it can find an error like this:
*Harry Potter bigger then than Titanic?
Style errors: Using uncommon words and complicated sentence structures makes a text more
difficult to understand, which is seldomly desired. These cases are thus considered style errors.
Unlike grammar errors, it heavily depends on the situation and text type which cases should be
classified as a style error.
Example: But it [= human reason] quickly discovers that, in this way, its labours must remain
ever incomplete, because new questions never cease to present themselves; and thus it finds
itself compelled to have recourse to principles which transcend the region of experience, while
they are regarded by common sense without distrust.
This sentence stems from Kant’s Critique of pure reason. It is 48 words long and most people; "> will agree that it is very difficult to understand. The reason is its length, difficult vocabulary(like transcend), and use of double negation (without distrust). With today’s demand for easyto understand documents, this sentence can be considered to have a style problem.
Semantic errors: A sentence which contains incorrect information which is neither a style
error, grammar error, nor a spelling error. Since extensive world-knowledge is required to
recognize semantic errors, these errors usually cannot be detected automatically.
Example: MySQL is a great editor for programming!
This sentence is neither true nor false – it simply does not make sense, as MySQL is not an
editor, but a database.
Part-of-Speech Tagging
Part-of-speech tagging (POS tagging, or just tagging) is the task of assigning each word its POS tag.
It is not strictly defined what POS tags exist, but the most common ones are noun, verb, determiner,
adjective and adverb. Nouns can be further divided into singular and plural nouns, verbs can be
divided into past tense verbs and present tense verbs and so on.
Phrase Chunking
Phrase Chunking is situated between POS tagging and a full-blown grammatical analysis: whereas
POS tagging only works on the word level, and grammar analysis (i.e. parsing) is supposed to build a
tree structure of a sentence, phrase chunking assigns a tag to word sequences of a sentence.
Grammar Checking
Syntax-based checking, as described in [Jensen et al, 1993]. In this approach, a text is completely parsed, i.e. the sentences are analyzed and each sentence is assigned a tree structure.
The text is considered incorrect if the parsing does not succeed.
Statistics-based checking, as described in [Attwell, 1987]. In this approach, a POS-annotated
corpus is used to build a list of POS tag sequences. Some sequences will be very common (for
example determiner, adjective, noun as in the old man), others will probably not occur at all
(for example determiner, determiner, adjective). Sequences which occur often in the corpus can
be considered correct in other texts, too, uncommon sequences might be errors.
Rule-based checking, as it is used in this project. In this approach, a set of rules is matched
against a text which has at least been POS tagged. This approach is similar to the statistics-based
approach, but all the rules are developed manually.
The aim of this thesis is to develop an Open Source style and grammar checker for the English
language.
The many different kinds of errors which may appear in written text can be categorized in several
different ways.
Spelling errors: Spell checkers simply compare the words of a text with a large list of known words.
If a word is not in the list, it is considered incorrect. Similar words will then be suggested as
alternatives.
Example: *Gemran1(Ispell will suggest, among others, German)
Grammar errors: An error which causes a sentence not to comply with the English grammar
rules. Unlike spell checking, grammar checking needs to make use of context information, so
that it can find an error like this:
*Harry Potter bigger then than Titanic?
Style errors: Using uncommon words and complicated sentence structures makes a text more
difficult to understand, which is seldomly desired. These cases are thus considered style errors.
Unlike grammar errors, it heavily depends on the situation and text type which cases should be
classified as a style error.
Example: But it [= human reason] quickly discovers that, in this way, its labours must remain
ever incomplete, because new questions never cease to present themselves; and thus it finds
itself compelled to have recourse to principles which transcend the region of experience, while
they are regarded by common sense without distrust.
This sentence stems from Kant’s Critique of pure reason. It is 48 words long and most people; "> will agree that it is very difficult to understand. The reason is its length, difficult vocabulary(like transcend), and use of double negation (without distrust). With today’s demand for easyto understand documents, this sentence can be considered to have a style problem.
Semantic errors: A sentence which contains incorrect information which is neither a style
error, grammar error, nor a spelling error. Since extensive world-knowledge is required to
recognize semantic errors, these errors usually cannot be detected automatically.
Example: MySQL is a great editor for programming!
This sentence is neither true nor false – it simply does not make sense, as MySQL is not an
editor, but a database.
Part-of-Speech Tagging
Part-of-speech tagging (POS tagging, or just tagging) is the task of assigning each word its POS tag.
It is not strictly defined what POS tags exist, but the most common ones are noun, verb, determiner,
adjective and adverb. Nouns can be further divided into singular and plural nouns, verbs can be
divided into past tense verbs and present tense verbs and so on.
Phrase Chunking
Phrase Chunking is situated between POS tagging and a full-blown grammatical analysis: whereas
POS tagging only works on the word level, and grammar analysis (i.e. parsing) is supposed to build a
tree structure of a sentence, phrase chunking assigns a tag to word sequences of a sentence.
Grammar Checking
Syntax-based checking, as described in [Jensen et al, 1993]. In this approach, a text is completely parsed, i.e. the sentences are analyzed and each sentence is assigned a tree structure.
The text is considered incorrect if the parsing does not succeed.
Statistics-based checking, as described in [Attwell, 1987]. In this approach, a POS-annotated
corpus is used to build a list of POS tag sequences. Some sequences will be very common (for
example determiner, adjective, noun as in the old man), others will probably not occur at all
(for example determiner, determiner, adjective). Sequences which occur often in the corpus can
be considered correct in other texts, too, uncommon sequences might be errors.
Rule-based checking, as it is used in this project. In this approach, a set of rules is matched
against a text which has at least been POS tagged. This approach is similar to the statistics-based
approach, but all the rules are developed manually.
Sunday, May 29, 2016
Contextual spell checking for Tamil Language
This paper proposes a contextual spell correction feature to an existing spell checker that is developed using a hybrid approach by the authors. The existing spell checker is integrating the dictionary check, Canti check, crowd sourcing and suggestion generation. In addition to that a contextual spelling correction approach in which, mistakes that arisedue to confusion in following sets of letters {Ḻakaram, ḷakaram, lakaram}, {ṇakaram, nakaram, ṉakaram} and {ṟakaram, rakaram} is proposed in this paper. A bigram languagemodel is used to make suggestions for the words with confusing letters.
developed This approach integrates dictionary approach, canticheck and crowd sourcing for new words. The Levenshtein distance finding algorithm is used to match the words with the dictionary and flag the misspelled words.
Grammatical rules have been written for Valliṉammikum and Valliṉammikā places to solve
the cantiproblems. For generating suggestions, an n-gram based technique is used. Further
a feature, called crowd sourcing, has been added to the system to collect new words from
users. This is an important feature as there are a lot of colloquial words are used in Tamil.
The given word is first checked with dictionary to see whether the word exists on the dictionary using Levenshtein algorithm. If it is not available, then the appropriate suggestions will be generated using letter level n-gram analysis. Then it is checked for joining letter, the cantiletter. After the completion of these steps, word is checked to see whether there are any letter in confusion set
are available in it. If such letter is available, then the letter is checked for appropriateness. This is done using a statistical approach with the aid of a bi-gram language model.
Hybrid approach for Tamil spell checking
An application to check the Tamil spelling using a hybrid approach has been proposed anddeveloped This approach integrates dictionary approach, canticheck and crowd sourcing for new words. The Levenshtein distance finding algorithm is used to match the words with the dictionary and flag the misspelled words.
Grammatical rules have been written for Valliṉammikum and Valliṉammikā places to solve
the cantiproblems. For generating suggestions, an n-gram based technique is used. Further
a feature, called crowd sourcing, has been added to the system to collect new words from
users. This is an important feature as there are a lot of colloquial words are used in Tamil.
The given word is first checked with dictionary to see whether the word exists on the dictionary using Levenshtein algorithm. If it is not available, then the appropriate suggestions will be generated using letter level n-gram analysis. Then it is checked for joining letter, the cantiletter. After the completion of these steps, word is checked to see whether there are any letter in confusion set
are available in it. If such letter is available, then the letter is checked for appropriateness. This is done using a statistical approach with the aid of a bi-gram language model.
Subscribe to:
Posts (Atom)