This document pertains to the Java version of the libstemmer distribution, available for download from:
https://snowballstem.org/download.html
Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a search for connected would also find documents which only have the other forms.
This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer.
The Java code generated by Snowball requires Java >= 7 (since Snowball 3.0.0). Java 7 was released in 2011, and Java 6's EOL was 2013 so we don't expect this to be a problematic requirement.
Simply run the java compiler on all the java source files under the java directory. For example, this can be done under unix by changing directory into the java directory, and running:
javac org/tartarus/snowball/.java org/tartarus/snowball/ext/.java
This will compile the library and also an example program "TestApp" which provides a command line interface to the library.
The stemming algorithms generally expect the input text to use composed accents (Unicode NFC or NFKC) and to have been folded to lower case already.
There is currently no formal documentation on the use of the Java version of the library. Additionally, its interface is not guaranteed to be stable.
The best documentation of the library is the source of the TestApp example program.
The stemmer code is re-entrant, but not thread-safe if the same stemmer object is used concurrently in different threads.
If you want to perform stemming concurrently in different threads, we suggest creating a new stemmer object for each thread. The alternative is to share stemmer objects between threads and protect access using a mutex or similar but that's liable to slow your program down as threads can end up waiting for the lock.
The TestApp example program allows you to run any of the stemmers compiled into the libstemmer library on a sample vocabulary. For details on how to use it, run it with no command line parameters.