Python

Warning: the python API doesn't allow you to plug in user-defined distance metric written in Python. Because distance evaluation is extremely frequently invoked, a callback written in Python will make the computation very slow. For matrix data, it's easy to modify the Python wrapper code to add custom distances(see below).

Building

The python binding is in the "python" directory of the SDK. You'll need python development library and boost python to build (libpython-dev and libboost-python-dev in Ubuntu).

Constraints

Only float and double matrices are supported (need uint8?). Right now only L2 distance is supported. By default, each vector has to occupy a multiple of 16 bytes in memory. If that doesn't work for you, you'll have to plug in your own distance computation code as explained in here.

Sample Session

>>> from numpy import random
>>> import pykgraph
>>> dataset = random.rand(1000000, 16)
>>> query = random.rand(1000, 16)
>>> index = pykgraph.KGraph(dataset, 'euclidean')
>>> index.build()
>>> index.save("index_file");
>>> # load with index.load("index_file");
......
>>> result = index.search(query, K=10)                        # this uses all CPU threads, set prune=1 to make index smaller (no accuracy loss)
>>> result = index.search(query, K=10, threads=1)             # one thread, slower
>>> result.shape             # the k-nn IDs.
>>> result = index.search(query, K=1000, P=100)                # search for 1000-nn, no need to recompute index.
(1000, 10)

Tweaking

Full building and searching interface and default parameters:

>>> index.build(iterations=100, L=50, K=10, S=10, controls=100, delta=0.005, recall=0.98, prune=0)
>>> result = index.search(query, K=10, P=100, threads=1)       # set threads=0 to automatically use all cpu threads.

Checkout Tutorial and Document for the exact meaning of the parameters. If you find result quality is too low, try the following:

  • Enlarge search P (for probe) to a big number like 1000. (You can also reduce that to 20 to make it run extremely fast.)
  • Enlarge L in indexing. L should not be much larger than 100 or you won't be able to finish indexing in tolerable time. (L should not be < 50.)
  • The index K doens't have to be the same as the search K, don't change that.

User-Defined Distance Metric

Write up something in C++ like

    struct mydistance {
        template <typename T>
        static float apply (T const *t1, T const *t2, unsigned dim) {
            // replace code below with your own distance computing code.
            float r = 0;
            for (unsigned i = 0; i < dim; ++i) {
                float v = float(t1[i]) - float(t2[i]);
                v *= v;
                r += v;
            }  
            return r;
        }  
    };  

Put that above the "KGraph" class in pykgraph.cpp. Search and replace all appearances of "kgraph::metric::l2sqr" with your own distance class.