Rodica Brudea Rodica Brudea - 10 months ago 62
Java Question

Create a lucene romanian stemmer in java netbeans

I need to do a simple search engine which can recognize and stem Romanian words, including those with diacritics. I used RomanianAnalyzer, but it does not do the right stemming when it comes to the same word written with and without diacritics.

Can you help me with a code for adding/modifying an existing Romanian stemmer?

PS: I edited the question, to be more clear.


You can copy the RomanianAnalyzer source to create a custom analyzer, and add a filter to the analysis chain in the createComponents method. ASCIIFoldingFilter would probably be what you are looking for. I would add it to the end, to be sure that you don't mess up the stemmer when removing the diacritics.

public final class RomanianASCIIAnalyzer extends StopwordAnalyzerBase {
  private final CharArraySet stemExclusionSet;

  public final static String DEFAULT_STOPWORD_FILE = "stopwords.txt";
  private static final String STOPWORDS_COMMENT = "#";

  public static CharArraySet getDefaultStopSet(){
    return DefaultSetHolder.DEFAULT_STOP_SET;

  private static class DefaultSetHolder {
    static final CharArraySet DEFAULT_STOP_SET;

    static {
      try {
        DEFAULT_STOP_SET = loadStopwordSet(false, RomanianAnalyzer.class, 
      } catch (IOException ex) {
        throw new RuntimeException("Unable to load default stopword set");

  public RomanianASCIIAnalyzer() {

  public RomanianASCIIAnalyzer(CharArraySet stopwords) {
    this(stopwords, CharArraySet.EMPTY_SET);

  public RomanianASCIIAnalyzer(CharArraySet stopwords, CharArraySet stemExclusionSet) {
    this.stemExclusionSet = CharArraySet.unmodifiableSet(CharArraySet.copy(stemExclusionSet));

  protected TokenStreamComponents createComponents(String fieldName) {
    final Tokenizer source = new StandardTokenizer();
    TokenStream result = new StandardFilter(source);
    result = new LowerCaseFilter(result);
    result = new StopFilter(result, stopwords);
      result = new SetKeywordMarkerFilter(result, stemExclusionSet);
    result = new SnowballFilter(result, new RomanianStemmer());
//This following line is the addition made to the RomanianAnalyzer source.
    result = new ASCIIFoldingFilter(result); 
    return new TokenStreamComponents(source, result);