public class RussianLetterTokenizer
extends org.apache.lucene.analysis.CharTokenizer
Tokenizer
that extends LetterTokenizer
by additionally looking up letters in a given "russian charset".
The problem with
LetterTokenizer
is that it uses Character.isLetter(char)
method,
which doesn't know how to detect letters in encodings like CP1252 and KOI8
(well-known problems with 0xD7 and 0xF7 chars)
Constructor and Description |
---|
RussianLetterTokenizer(org.apache.lucene.util.AttributeSource.AttributeFactory factory,
java.io.Reader in) |
RussianLetterTokenizer(org.apache.lucene.util.AttributeSource source,
java.io.Reader in) |
RussianLetterTokenizer(java.io.Reader in) |
RussianLetterTokenizer(java.io.Reader in,
char[] charset)
Deprecated.
Use
RussianLetterTokenizer(Reader) instead. |
Modifier and Type | Method and Description |
---|---|
protected boolean |
isTokenChar(char c)
Collects only characters which satisfy
Character.isLetter(char) . |
end, incrementToken, next, next, normalize, reset
getOnlyUseNewAPI, reset, setOnlyUseNewAPI
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString
public RussianLetterTokenizer(java.io.Reader in, char[] charset)
RussianLetterTokenizer(Reader)
instead.public RussianLetterTokenizer(java.io.Reader in)
public RussianLetterTokenizer(org.apache.lucene.util.AttributeSource source, java.io.Reader in)
public RussianLetterTokenizer(org.apache.lucene.util.AttributeSource.AttributeFactory factory, java.io.Reader in)
Copyright © 2000-2016 Apache Software Foundation. All Rights Reserved.