Skip to content

Latest commit

 

History

History
85 lines (56 loc) · 2.48 KB

File metadata and controls

85 lines (56 loc) · 2.48 KB

Ruby implementation of the UEA-Lite stemmer for conservative stemming in search and indexing workloads.

UEA-Lite uses a rule set to normalize suffixes while avoiding aggressive stemming.

The stemmer operates on a single token at a time and returns a stemmed token.

Notable behavior of this implementation:

  • possessive apostrophes are removed

  • contractions are expanded by default (for example, don't becomes do not)

  • tokens beginning with uppercase letters are preserved, and pluralized acronyms ending in a lowercase s are singularized

  • pure numbers, and tokens containing hyphens/underscores, are passed through unchanged

This is a port to Ruby from the Java port of the original Perl script by Marie-Claire Jenkins and Dr. Dan J. Smith at the University of East Anglia.

Install the gem:

gem install uea-stemmer

Install from source:

git clone https://github.com/ealdent/uea-stemmer.git
cd uea-stemmer
bundle install
bundle exec rake test
bundle exec rake install

Basic usage:

require "uea-stemmer"
stemmer = UEAStemmer.new

stemmer.stem("helpers")   # => "helper"
stemmer.stem("dying")     # => "die"
stemmer.stem("scarred")   # => "scar"

You can extract the matching rule with stem_with_rule:

result = stemmer.stem_with_rule("invited")
result.word      # => "invite"
result.rule_num  # => 22.3
result.rule      # => #<UEAStemmer::Rule ...>

Disable contraction expansion:

UEAStemmer.new(nil, nil, skip_contractions: true).stem("don't")
# => "don't"

Use the singleton instance:

DefaultUEAStemmer.instance.stem("running")  # => "run"
  • Fork the project.

  • Make your feature addition or bug fix.

  • Add or update tests.

  • Run +bundle exec rake test+.

  • Send me a pull request. Bonus points for topic branches.

Copyright © 2005 by the University of East Anglia and authored by Marie-Claire Jenkins and Dr. Dan J Smith. This port to Ruby was done by Jason Adams using the port to Java by Richard Churchill.

This project is distributed under the Apache 2.0 License. See LICENSE for details.