Apache Solr is an open-source, Java-based enterprise platform, typically used to provide full-text search capabilities. Even the definition from the vendor’s website mentions such standard applications of the product as distributed indexing, replication, load-balanced querying, automated failover, recovery, or centralized configuration. But is it all you can use Solr for? Of course not. Thanks to the open architecture of the platform, developers can create custom extensions and add new functions.
In this article, however, we will have a closer look at one of the features that are available out of the box and require no customization. Solr offers some basic machine learning algorithms with simple text classification. We will try to use these algorithms to build a text classifier. But first, let’s define our use case.
Text classification – example use case
Imagine an e-commerce search system. All text is indexed to provide the full-text search functionality, and labeled with some categories. The users have to manually provide the labels, which consumes a significant amount of time and money. That process is begging for improvement.
As owners of the e-commerce platform, we’d like to automate the labeling. To do that, we’ll use already indexed documents with categories assigned. This will be a reference for all new incoming text, based on which the system will assign labels according to related keywords.
Typically in a similar scenario, we’d use a bunch of available libraries with classifiers, but we don’t want to install additional software for now. Fortunately, Solr package has built-in plugins to operate on text, such as:
- analysis (docs)
- filtering (docs)
- tokenization (docs)
- pre-processing input (docs)
- language-analysis (docs)
- phonetic matching (docs)
- text classification (docs)
- Naive Bayes
This is sufficient to build a simple text classifier with pre-processing phases (clean, tokenization, stemming, etc.) and even to compare the results of the KNN algorithm and naive Bayes.
KNN vs. naive Bayes
The KNN classifier (K-Nearest-Neighbors) is based on the assumption that document classification is the most similar to the classification of other entities with similar content. Compared to the naive Bayes classifier, KNN does not rely on prior probabilities, and it’s computationally efficient. The main computation to be performed is the sorting of training documents to find the k-nearest neighbors for the test document. This is a different approach to that of naive Bayes, which assumes that every class is distributed according to simple distribution, regardless of its features.
Sample data set
To build our model, we need text documents. It can be any collection of text in English. We will use a “20 newsgroups” set.
The “20 newsgroups” data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering. It was originally collected by Ken Lang, probably for the purpose of his “NewsWeeder: Learning to Filter Netnews” paper.
You can download the archive package from the original website; go to “Data” and find the file “20news-18828.tar.gz – 20 Newsgroups; duplicates removed, only “From” and “Subject” headers (18828 documents).” After that, extract the package to a suitable location.
Each article in a set has the same features:
- news number
- from, subject headers
- main text
Parser and indexer
In our classifier, we will need the ‘main text’ feature with an additional parser and indexer, which will extract the text and process it to Apache Solr. To manage that on my device, I have created a simple project based on spring boot and spring-data-solr (there are many other ways to achieve that: through bash scripts, by converting content to a .csv set and importing it from Solr admin panel, and so on).
Below you can find an example of parser/indexer using spring-data-solr. The most essential elements are:
Remember that each data set should be split into training and testing sets (where approx. 70-90% should be assigned to training and 10%-30% – to testing).
Creating a new collection in Apache Solr
Now it’s time to switch to Solr to create a new collection of indexed documents. First, run Solr instance and create a collection with the following commands:
./solr start -v
.solr create -e newsgroups
- These commands will also create the default schema.xml, and configuration and index directories for your documents.
- To delete a collection, rerun the command with the delete option.
If no errors have occurred, you will be able to go to the default instance.
Solr classifier configuration
Let’s move on to configuration details. Here are some basic rules and guidelines:
- To configure the classifier, edit ‘managed-schema.xml.’
- The default path for the config files is: “$SOLR_HOME/server/solr/newsgroups/”. The file contains definitions of fields, analyzers, filters, and processors.
- The important thing is to focus on the tokenizer and stemmer. In this example, the chosen tokenizer and stemmer are WhitespaceTokenizerFactory (a simple tokenizer that splits a text stream on whitespace and returns sequences of non-whitespace characters as tokens) and SnowballPorterFilterFactory.
- Additionally, we configured two filters: StopFilterFactory and LowerCaseFilterFactory, just to clean text from stopwords and add insensitive indexing.
- You can check the example of the configuration that we used here. This is absolutely the simplest version that you need to run text-classification algorithms, but you can also use it to try out different scenarios (for example to check other stemmers and tokenizers). For testing, use a built-in browser for query analysis in Solr.
- The next configuration step is to define the classifier algorithm and change the request handler by adding our processor. The required settings are:
- inputFields: fields used by the algorithm for training
- classField: field with assigned category
- algorithm: “knn” or “bayes”
- additional algorithm options – you can find more information here
- Next, we need to add the classifier to the request processor chain for the “/update” endpoint. Configure this by adding “updateProcessor” and extending the current chain. Here’s an example of a configured KNN and a request handler, which you can copy & paste into the existing solrconfig.xml: config.
Running classifier configuration
Now that we’ve configured the classifier, it’s time to play! Run the entire configuration and check if our group prediction works for testing the set of news defined earlier.
Let’s recap: we have a data set, an instance of Apache Solr, and we implemented a simple parser with an indexer, so the steps of our pipeline are:
- Parsing the “20newsgroups” articles (the training data set)
- Checking if our analyzers work in Solr (via the built-in browser)
- Indexing training news with the following features:
- Number of news
- Group (category) of news
- Parsed content
- Indexing test news with the following features:
- Number of news
- Empty group (category)
- Parsed content
Our pipeline and classifier (except for the built-in KNN and naive Bayes algorithms implementations) are shown in the figures below:
Figure 1: Pipeline in our example. From a data set to Solr index.
Figure 2: Steps in our classifier. From a set of words to filtered tokens.
If everything goes according to the plan, Solr should automatically fill in the ‘group’ field with the value of the predicted group.
This table presents the values in my case, and the results I obtained (correctness of predicted groups):
|Algorithm||Training set (number of news)||Testing set (number of news)||Correctness of predicted groups|
Apache Solr – now it’s your turn to play
The primary goal of this article was to present Apache Solr’s ability to run some basic machine learning algorithms. We hoped to demonstrate that the platform is able of more than just providing the default full-text search functionality.
To prove our point, we have created a simple automated process of text classification using Solr. The results of our classifier could be much improved by applying different approaches to text indexing (i.e., by experimenting with various analyzers and tokenizers). But this was not our objective this time.
Feel free to experiment with our example using your own data sets. If you try out a different approach, maybe you’d like to share your results? This would provide valuable data to everyone. What do you think about this pipeline? Do you use Apache Solr for basic machine learning?
If you spot any errors in my example, please let me know, too. All feedback is welcome ✌
Share your insights in the comments section below and stay tuned for the next posts about Apache Solr.