Osmot Search Engine Tutorial

Web Search with Osmot

This tutorial briefly describes how to set up Osmot as a search engine on a collection of documents. You will then be able to use Osmot to learn improved an improved search function from usage logs. However, this process is somewhat involved so may take some time.

Please email me if there are any inaccuracies or confusing steps. These insutrctions are currently in draft form and will be improved within a week or two.

Please email me if you are using this software to run evaluations as this is a BETA RELEASE. This code has been completely rewritten and has not yet been fully re-tested. I'd like to be able to let you know if I find any significant problems.

The sorts of things you can do with Osmot are described in [Radlinski & Joachims, KDD 2005].

Make sure you have the command line search working.
The tutorial for that is available here.
Install Tomcat.
You need to make sure Tomcat is working, for example by running on of the Tomcat demos. We assume you can access your Tomcat server at http://127.0.0.1:8080/.
Compile Osmot
Assuming the JAR file compiles correctly, you now need to copy the plugins directory from nutch to the osmot directory. Then, you need to run "ant war" to generate a WAR file, and copy osmot-0.9.war to the webapps directory in the Tomcat root directory. In webapps, delete or rename ROOT.war and the directory ROOT (these initially contain Tomcat demo applications), and rename osmot-0.9.war to ROOT.war.
Restart Tomcat
You probably kept Tomcat running after installing and testing it. You need to restart it after copying the war file into the webapps directory. Once Tomcat has started, Tomcat will extracted the war file into a ROOT directory in the webapps directory.
Run a search
Go to http://127.0.0.1:8080/. You should see a simple search page. To run a search, for example search for "document". This will search the set of documents you created in the command line search tutorial and return any that contain the word "document".
To make Osmot useful for your purposes you now need to:
- Tweak osmot.conf, the configuration file, to match your setup.
- Create a Lucene index of your collection, if you want to search using Lucene. The SimpleIndexer.java source will probably be useful to help you do this.
- Create your own Searcher as well as an index of your collection, if you wish to use something other than Lucene.
- Fix the redirector page (details.jsp) to redirect to the correct URL for the documents in your collection. This is currently controlled by osmot.conf, although you may want to use a more complex naming scheme than the current implementation allows. If this is the case, it should be sufficient to modify details.jsp and the section of search.jsp that displays the URL of each result.
- Personalize the header, footer and search page to match the look and feel you want.
Learn an improved ranking function
Once you have a working search engine that has generated a reasonable amount of log information, you may want to learn an improved ranking function. For this, you need to analyze the log files generated by Osmot. This is a two step process. First, you need to generate preferences from the log files:
```
java -cp build/osmot-0.9.jar:lib/colt.jar:lib/lucene-1.4.3.jar -Xmx256m edu.cornell.cs.osmot.reranker.Learner multigen logs/Log_FILL_IN_HERE.log
```
You need to substitute the log file name into the command and run this in the root of your Osmot install directory. The log files you should run with are those with file names including .1a, .1b, .1c, .2a, .2b and .2c. Here is what the search modes mean:
- 1: EVALUATION
  - 1a: Mix chains reranker with nochains reranker
  - 1b: Mix chains reranker with no reranker
  - 1c: Mix nochains reranker with no reranker
- 2: DATA COLLECTION
  - 2a: Display results reranked with nochains reranker
  - 2b: Display results reranked with chains reranker
  - 2c: Display results without reranking
When evaluating with query chains, you can use the data generated from all six modes. If evaluating learning without query chains, you must only use modes 1c, 2a and 2c. The other modes will include results that result from query chains so you would not give a valid evaluation. To change the ratio between the number of queries collected with each mode, you need to change the option SEARCHER_FRACTION_EVALUATION in osmot.conf. The distribution between the sub-modes is equal, though can be changed in the source code in searcher/SearchBean.java:pickMode(). You probably don't want to run a no-chains reranker in a real world setting, so you should make sure modes 1a, 1c and 2a never occur. If you simply want to use the search engine without running experiments, you want the mode to always be 2b.

The analysis will create a file with the same name as the log file(s) except replacing .log with .CHprefs or .NCprefs as appropriate. All the other file names used will come from osmot.conf. If you want to switch between generating preferences using query chains and preferences without using query chains, you need to uncomment the correct block at around line 826 in java/edu/cornell/cs/osmot/reranker/Learner.java.

Next, you have to use the preferences to learn a new ranking function:
```
java -cp build/osmot-0.9.jar:lib/colt.jar:lib/lucene-1.4.3.jar edu.cornell.cs.osmot.reranker.Learner learn RERANKER.allCHprefs RERANKER.chainsModel logs/Log_FILL_IN_HERE*.CHprefs
```
Again, you need to substitute the right file names. The name of the reranker features file will be taken from osmot.conf, so make sure that the option name RERANKER_FEATURE_FILE is correct.
Copy the model to the loation specified in the osmot configuration file, if necessary.
It won't be necessary if you haven't changed the file and directory locations set by default in osmot.conf.
Restart Tomcat, and run queries.
When you run queries, you will now sometimes see the ranking changed by the improved results (depending on the mode randomly selected when you connect). You might also see slightly surprising results, where documents that do not contain the search terms are in the search results. This is because documents may have had a score attached due to them being clicked on in a query chain including the query you are running. Also, if the two rankings returned by two ranking functions are of different length, random document are appended to the shorter set of results.

To understand a bit better what is going on, try running searches with searchExplain.jsp instead of search.jsp
Compare the two ranking functions
The web page evaluate.jsp is intended for this purpose. More instructions about interpreting the results will be posted here within a week or two.