Modelling with Osmot
This tutorial briefly describes how to run Osmot to model user behavior. This is only a small part of Osmot's functionality (its main purpose is to run real-world search experiments and search engine learning). Please email me if there are any inaccuracies or confusing steps.
Please email me if you are using this software to run evaluations as this is a BETA RELEASE. This code has been completely rewritten and has not yet been fully re-tested. I'd like to be able to let you know if I find any significant problems.
Osmot can be used to generate a document collection and then model user behavior on it in order to evaluate algorithms for learning to rank rapidly and offline. This is a brief tutorial about how to run modelling experiments as described in [Radlinski & Joachims, ICML 2005].
- Download and unpack Osmot.
We assume you unpack in ~/osmot. All paths here are relative to the osmot root directory (which contains the directories 'lib', 'java', 'jsp', etc). You need to install ant to build Osmot.
You also need to have a recent version of svm_light as well as ant installed. Remember where you installed svm_learn (you'll need this later).
- Install the following (as JAR files) in the lib subdirectory
- Colt (i.e. colt.jar, I suggest version 1.0.3)
- servlet, commons-fileupload (from the Apache toolkit)
- Nutch (at least version nutch-0.7.jar, but before version 0.8)
- Lucene (at least version 1.9-rc1)
While using Nutch is optional (you don't need to keep the jar files around to run Osmot if you're not using a NutchSearcher), to compile the source otherwise youneed it. Also, testing is performed using this version of Lucene, obtained with Nutch 0.7.jar. Copy lucene-*.jar to the Osmot lib directory.
- If you do not want to use the default document collection settings, modify java/edu/cornell/cs/osmot/modelling/generation/GenerateData.java to describe the document collection you want to build There are a number of constants describing the collection.
- Compile Osmot
Run the following in the Osmot root directory. This will create the file build/osmot-0.9.jar
cd ~/osmot/ ant jar
- Generate a document collection:
java -cp ~/osmot/build/osmot-0.9.jar:~/osmot/lib/colt.jar -Xmx256m edu.cornell.cs.osmot.modelling.generation.GenerateData
The option -Xmx256m sets the maximum amount of virtual memory that the Java VM can use to 256 megabytes. Some Java virtual machines have a default that is much too low. The size needed depends on the size of the document collection you want to build. Don't worry if you have less RAM, the JVM will just swap things in as it needs them. WARNING: If your JVM doesn't support this option, you need to work out what the right equivalent option is, otherwise you might experience strange crashes (e.g. class not found) due to Java running out of space.
This process will generate 10 independent document collections so that you can evaluate the variance your results. Be warned, these can be big files. The files generated are:
- documents.iterN.dat: The document vectors.
- document_labels.iterN.dat: A list of topics each document is relevant to.
- topics.iterN.dat: The topic vectors.
- idf_scores.iterN.dat: IDF scores for each term, though we don't use these.
- Plug your searching algorithm into Osmot.
You have to write a wrapper for your original ranking function that implements the Searcher interface. There are already a few examples present in osmot, these should help. We will assume you are using the default searcher used for modelling, implemented in ~/osmot/java/edu/cornell/cs/osmot/searcher/ModelSearcher.java. In this case, you don't need to do anything in this step.
- Modify ~/osmot/osmot.conf if necessary
osmot.conf has many options, including a few key options for modelling: Make sure the location of svm_learn on your machine matches what is specified in osmot.conf. The searcher should be ModelSearcher. The unique id field should be "id".
UNIQ_ID_FIELD = id ... SEARCHER_TYPE = ModelSearcher ... SVM_LEARN_PATH = /usr/local/bin/svm_learn ... RERANKER_FEATURES_FILE = features.out
- Determine the user parameters you want to use
You need to set all the parameters of user behavior. They are passed via the command line to the modelling code. For the rest of this tutorial we assume that you will use the default settings. The parameters are described in the paper.
- Run a simulation using Osmot
You are now ready to simulate user behavior on this document collection. Run the following:
cd ~/osmot java -cp build/osmot-0.9.jar:lib/colt.jar:lib/lucene-1.4.3.jar -Xmx256m edu.cornell.cs.osmot.modelling.usage.GetResults --files documents.iter1.dat document_labels.iter1.dat topics.iter1.dat > log1.out
- Generate preferences from the output of this simulation
java -cp build/osmot-0.9.jar:lib/colt.jar:lib/lucene-1.4.3.jar -Xmx256m edu.cornell.cs.osmot.reranker.Learner generate --log log1.out --docs documents.iter1.dat --labels document_labels.iter1.dat --prefs prefs1.out --chains
prefs1.out contains the preferences for learning, features.out defines the features in the preferences. --chains indicates that we want to use query chains in generating preferences. If you want the document id pairs, use the option "--debug" to get additional output. The particular pairs are then written among the debug output in the log file.
- Run the learning algorithm on the preferences
Our learning algorithm is run as follows. This is the slow step.
java -cp build/osmot-0.9.jar:lib/colt.jar:lib/lucene-1.4.3.jar edu.cornell.cs.osmot.reranker.Learner learn allprefs.out model.out prefs1.out
The preferences files (only one in this case) are combined into allprefs.out, along with the additional hard constraints as described in the paper. The model that describes how the results are reranked is output into model.out.
- Simulate user behavior again, this time using the reranker.
java -cp build/osmot-0.9.jar:lib/colt.jar:lib/lucene-1.4.3.jar -Xmx256m edu.cornell.cs.osmot.modelling.usage.GetResults --files documents.iter1.dat document_labels.iter1.dat topics.iter1.dat --use-reranker model.out features.out > log2.out
- You can now check how performance has changed!
We measure the maximum relevant of a result in the top 5 in the first query of a chain.
- You can repeat from step 9 to run multiple iterations.