Command Line Search with Osmot

This tutorial briefly describes how to set up Osmot as a search engine that can be used from the command line. While not terribly useful in the real world, this process is less involved than setting up up as a we search engine and shows Osmot's functionality. Please email me if there are any inaccuracies or confusing steps. The sorts of things you can do with Osmot are described in [Radlinski & Joachims, KDD 2005].

Please email me if you are using this software to run evaluations as this is a BETA RELEASE. This code has been completely rewritten and has not yet been fully re-tested. I'd like to be able to let you know if I find any serious problems.

  1. Download and unpack Osmot

    We assume you unpack in ~/osmot. All paths here are relative to the osmot root directory (which contains the directories 'lib', 'java', 'jsp', etc).

  2. Install the following (as JAR files) in the lib subdirectory

    • Colt (i.e. colt.jar, I suggest version 1.0.3)
    • servlet, commons-fileupload (from the Apache toolkit)
    • Nutch (at least version nutch-0.7.jar, but before version 0.8)
    • Lucene (at least version 1.9-rc1, you will find it included with Nutch 0.7)

    While using Nutch is optional (you don't need to keep the jar files around to run Osmot if you're not using a NutchSearcher), to compile the source otherwise you need it. Also, testing is performed using this version of Lucene, obtained with Nutch 0.7. Copy lucene-*.jar to the Osmot lib directory, as well as nutch-site.xml and common-terms.utf8 to the Osmot root directory.

  3. Compile Osmot

    You need ant to compile osmot.

    cd ~/osmot
    ant jar
    
  4. Set up config file to use simple searcher

    Osmot is built so that you can plug search engines, and it can then learn to improve the plugged in search engine's ranking (subject to some constraints). For the purpose of this tutorial, we will use the SimpleSearcher, which run a simple search using Lucene.

    Set the following options in osmot.conf:

    UNIQ_ID_FIELD = uniqId
    SEARCHER_TYPE = LuceneSearcher
    SNIPPETER_FIELDS = content
    OSMOT_ROOT = <Location of Osmot>
    INDEX_DIRECTORY = <Location where you want the index>
    
  5. Create a document collection to search

    In order to demonstrate search, we need a simple document collection. For this purpose, we have created a simple indexer. Create a collection of seven tiny document by running:

    cd ~/osmot
    java -cp build/osmot-0.9.jar:lib/lucene-1.4.3.jar edu.cornell.cs.osmot.indexer.SimpleIndexer new
    java -cp build/osmot-0.9.jar:lib/lucene-1.4.3.jar edu.cornell.cs.osmot.indexer.SimpleIndexer add 1 "Document 1" "This is a document"
    java -cp build/osmot-0.9.jar:lib/lucene-1.4.3.jar edu.cornell.cs.osmot.indexer.SimpleIndexer add 2 "Document 2" "Welcome to Osmot"
    java -cp build/osmot-0.9.jar:lib/lucene-1.4.3.jar edu.cornell.cs.osmot.indexer.SimpleIndexer add 3 "Document 3" "Welcome document"
    java -cp build/osmot-0.9.jar:lib/lucene-1.4.3.jar edu.cornell.cs.osmot.indexer.SimpleIndexer add 4 "Document 4" "Osmot searches documents"
    java -cp build/osmot-0.9.jar:lib/lucene-1.4.3.jar edu.cornell.cs.osmot.indexer.SimpleIndexer add 5 "Document 5" "This is a short document search"
    java -cp build/osmot-0.9.jar:lib/lucene-1.4.3.jar edu.cornell.cs.osmot.indexer.SimpleIndexer add 6 "Document 6" "Osmot documents rock"
    java -cp build/osmot-0.9.jar:lib/lucene-1.4.3.jar edu.cornell.cs.osmot.indexer.SimpleIndexer add 7 "Document 7" "Search for documents with Osmot"
    
  6. Search over your collection
    java -cp build/osmot-0.9.jar:lib/lucene-1.4.3.jar edu.cornell.cs.osmot.searcher.LuceneSearcher osmot