Wednesday, 31 August 2016

Solr Is Learning To Rank Better - Part 3 - Ltr tools

Things Get Serious

The model has been trained, we are ready to deploy it to Solr, but first would be useful to have better understanding of what we just created.
A LambdaMART model in a real world scenario is a massive ensemble of regression trees, not the most readable structure for a human.
More we understand the model, easier will be to find anomalies and to fix/improve it.
But the most important benefit of having a clearer picture of the training set and the model is the fact that it can dramatically improves the communication with the business layer :

  • What are the most important features in our domain ?
  • What kind of document should score high according to the model ?
  • Why this document (feature vector) is scoring that high ?
These are only examples, but a lot of similar questions can rise, and we need the tools to answer.

Ltr Tools

This is how the Ltr tools project [1] was born.
Target of the project is to use the power of Solr to visualise and understand the model.
It is a set of simple tools specifically thought for LambdaMart models, represented in the Json format supported by the Bloomber Ltr Solr plugin.
Of course it is open source so feel free to extend it introducing additional models.
All the tools provided are meant to work with a Solr backend in order to index data that we can later search easily.
The tools currently available provide the support to :
- index the model  in a Solr collection
- index the training set in a Solr collection 
- print the top scoring leaves from a LambdaMART model

Preparation

To use the Ltr tools you must proceed with these simple steps :
  • set up the Solr backend - this will be a fresh Solr instance with 2 collections : models, trainingSet,  the simple configuration is available in : ltr-tools/configuration
  • gradle build - this will package the executable jar in : ltr-tools/ltr-tools/build/libs

Usage

Let's briefly take a look to the parameters of the executable command line interface : 


ParameterDescription
-helpPrint the help message
-tool 
<name>
The tool to execute (possible values):
- modelIndexer
- trainingSetIndexer
- topScoringLeavesViewer
-solrURL
<URL>
The Solr base URL to use for the search backend
-model 
<file>
The path to the model.json file
-topK
<int>
The number of top scoring leaves to return ( sorted by score descendant)
-trainingSet
<file>
The path to the training set file
-features
<file>
The path to the feature-mapping.json.
A file containing a mapping between the feature Id and the feature name.
-categoricalFeatures
<file>
The path to a file containing the list of categorical feature names.

N.B. all the following examples will assume the model in input is a LambdaMART model, in the json format the Bloomberg Solr Plugin expects.

Model Indexer

Requirement : Backend Solr collection <models> must be UP & RUNNING

The Model Indexer is a tool that indexes a lambdaMART model in Solr to better visualize the structure of the trees ensemble.
In particular the tool will index each branch split of the trees belonging to the lambdaMART ensemble as Solr documents.
Let's take a look the solr schema:

configuration/solr/models/conf
 <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>  
 <field name="modelName" type="string" indexed="true" stored="true"/>  
 <field name="feature" type="string" indexed="true" stored="true" docValues="true"/>  
 <field name="threshold" type="double" indexed="true" stored="true" docValues="true"/>  
 ...  


So giving in input a lambdaMART model :

e.g. lambdaMARTModel1.json
 {  
   "class":"org.apache.solr.ltr.ranking.LambdaMARTModel",  
   "name":"lambdaMARTModel1",  
   "features":[  
    {  
      "name":"feature1"  
    },  
    {  
      "name":"feature2"  
    }  
   ],  
   "params":{  
    "trees":[  
      {  
       "weight":1,  
       "root":{  
         "feature":"feature1",  
         "threshold":0.5,  
         "left":{  
          "value":80  
         },  
         "right":{  
          "feature":"feature2",  
          "threshold":10.0,  
          "left":{  
            "value":50  
          },  
          "right":{  
            "value":75  
          }  
         }  
       }  
      }  
    ]  
   }  
 }  

N.B. a branching split is where the tree split in 2 branches:

 "feature":"feature2",   
      "threshold":10.0,   
      "left":{   
       "value":50   
      },   
      "right":{   
       "value":75   
      }   

A split happens on a threshold of the feature value.
We can use the tool to start the indexing process :

 java -jar ltr-tools-1.0.jar -tool modelIndexer -model /models/lambdaMARTModel1.json  -solrURL http://localhost:8983/solr/models

After the indexing process has finished we can access Solr and start searching !
e.g.
This query will return in response for each feature :

  • the number of times the feature appears at a branch split
  • the top 10 occurring thresholds for that feature
  • the number of unique thresholds that appear in the model for that feature

 http://localhost:8983/solr/models/select?indent=on&q=*:*&wt=json&facet=true&json.facet={  
      Features: {  
           type: terms,  
           field: feature,  
           limit: -1,  
           facet: {  
                Popular_Thresholds: {  
                     type: terms,  
                     field: threshold,  
                     limit: 10  
                },  
                uniques: "unique(threshold)"  
           }  
      }  
 }&rows=0&fq=modelName:lambdaMARTModel1  

Let's see how it is possible to interprete the Solr response :

 facets": {  
   "count": 3479, //number of branch splits in the entire model  
   "Features": {  
     "buckets": [  
       {  
         "val": "product_price",  
         "count": 317, //the feature "product_price" is occurring in the model in 317 splits  
         "uniques": 28, //the feature "product_price" is occurring in the splits with 28 unique threshold values  
         "Popular_Thresholds": {  
           "buckets": [  
             {  
               "val": "250.0", //threshold value  
               "count": 45 //the feature "product_price" is occurring in the splits 45 times with threshold "250.0"  
             },  
             {  
               "val": "350.0",  
               "count": 45  
             },  
             ...  



TrainingSet Indexer

Requirement : Backend Solr collection <trainingSet> must be UP & RUNNING

The Training set Indexer is a tool that indexes a Learning To Rank traning set (in RankLib format) in Solr to better visualize the data.
In particular the tool will index each training sample of the trainign set as a Solr document.
Let's see the Solr schema :

configuration/solr/models/conf
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>
<field name="relevancy" type="tdouble" indexed="true" stored="true" docValues="true"/> 
<dynamicField name="cat_*" type="string" indexed="true" stored="true" docValues="true"/>
<dynamicField name="*" type="tdouble" indexed="true" stored="true" docValues="true"/> 

As you can notice the main point here is definition of dynamic fields.
Indeed we don't know beforehand the names of the features, but we can distinguish between categorical features ( which we can index as strings) and ordinal features (which we can index as double).

We require now 3 inputs :

1) the training set in the RankLib format: 

e.g. training1.txt
1 qid:419267 1:300 2:4.0 3:1 6:1
4 qid:419267 1:250 2:4.5 4:1 7:1
5 qid:419267 1:450 2:5.0 5:1 6:1
2 qid:419267 1:200 2:3.5 3:1 8:1 

2) the feature mapping to translate the feature Id to a human readable feature name

e.g. features-mapping1.json
 {"1":"product_price","2":"product_rating","3":"product_colour_red","4":"product_colour_green","5":"product_colour_blue","6":"product_size_S","7":"product_size_M","8":"product_size_L"}  
N.B. the mapping must be a json object on a single line

This input file is optional, it is possible to index directly the feature Ids as names.

3) the list of categorical features

e.g. categoricalFeatures1.txt
product_colour
product_size 

This list ( one feature per line) will clarify to the tool which features are categorical, to index the category as a string value for the feature.
This input file is optional, it is possible to index the categorical features as binary one hot encoded features.

To start the indexing process :

 java -jar ltr-tools-1.0.jar -tool trainingSetIndexer -trainingSet /trainingSets/training1.txt -features /featureMappings/feature-mapping1.json -categoricalFeatures /feature/categoricalFeatures1.txt -solrURL http://localhost:8983/solr/trainingSet

After the indexing process has finished we can access Solr and start searching !
e.g.
This query will return in response all the training samples filtered and then faceted on the relevancy field.

This can be an indication of the distribution of the relevancy score in specific subsets of the training set

 http://localhost:8983/solr/trainingSet/select?indent=on&q=*:*&wt=json&fq=cat_product_colour:red&rows=0&facet=true&facet.field=relevancy


N.B. this is a quick and dirty way to explore the training set. I deeply suggest you to use it as a quick resource. Advance data plotting is more suitable to visualize big data and identify patterns.

Top Scoring Leaves Viewer

The top scoring leaves viewer is a tool to print the path of the top scoring leaves in the model.
Thanks to this tool will be easier to answer to questions like :
" How a document (feature vector) should look like to get an high score?"
The tool will simply visit the ensemble of trees in the model and keep track of the scores of each leaf.

So giving in input a lambdaMART model : 

e.g. lambdaMARTModel1.json
 {  
   "class":"org.apache.solr.ltr.ranking.LambdaMARTModel",  
   "name":"lambdaMARTModel1",  
   "features":[  
    {  
      "name":"feature1"  
    },  
    {  
      "name":"feature2"  
    }  
   ],  
   "params":{  
    "trees":[  
      {  
       "weight":1,  
       "root":{  
         "feature":"feature1",  
         "threshold":0.5,  
         "left":{  
          "value":80  
         },  
         "right":{  
          "feature":"feature2",  
          "threshold":10.0,  
          "left":{  
            "value":50  
          },  
          "right":{  
            "value":75  
          }  
         }  
       }  
      }, ...  
    ]  
   }  
 }  

To start the process :

 java -jar ltr-tools-1.0.jar -tool topScoringLeavesViewer -model /models/lambdaMARTModel1.json -topK 10  

This will print the top scoring 10 leaves (with related path in the tree):

1000.0 -> feature2 > 0.8, feature1 <= 100.0
200.0 -> feature2 <= 0.8, 
80.0 -> feature1 <= 0.5, 
75.0 -> feature1 > 0.5, feature2 > 10.0, 
60.0 -> feature2 > 0.8, feature1 > 100.0, 
50.0 -> feature1 > 0.5, feature2 <= 10.0,  

Conclusion

The Learning to rank tools are quick and dirty solutions to help people understanding better and working better with Learning To Rank models.
They are far from being optimal but I hope they will be helpful for people working on similar problems.
Any contribution, improvement, bugfix is welcome !

[1] Learning To Rank tools

No comments:

Post a Comment