Tutorial

Run K-NN Search Demo with the Provided Tools


$ cd kgraph-1.1-x86_64/bin
$
$ wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
$ tar zxvf sift.tar.gz
sift/
sift/sift_base.fvecs
sift/sift_groundtruth.ivecs
sift/sift_learn.fvecs
sift/sift_query.fvecs
$
$ # convert format
$ ./fvec2lshkit sift/sift_base.fvecs sift.data
0 16 35 5 32 31 14 10 11 78 55 10 45 83 11 6 14 57 102 75 20 8 3 5 67 17 19 26 5 0 1 22 60 26 7 1 18 22 84 53 85 119 119 4 24 18 7 7 1 81 106 102 72 30 6 0 9 1 9 119 72 1 4 33 119 29 6 1 0 1 14 52 119 30 3 0 0 55 92 111 2 5 4 9 22 89 96 14 1 0 1 82 59 16 20 5 25 14 11 4 0 0 1 26 47 23 4 0 0 4 38 83 30 14 9 4 9 17 23 41 0 0 2 8 19 25 23 1
$ ./fvec2lshkit sift/sift_query.fvecs sift.query
1 3 11 110 62 22 4 0 43 21 22 18 6 28 64 9 11 1 0 0 1 40 101 21 20 2 4 2 2 9 18 35 1 1 7 25 108 116 63 2 0 0 11 74 40 101 116 3 33 1 1 11 14 18 116 116 68 12 5 4 2 2 9 102 17 3 10 18 8 15 67 63 15 0 14 116 80 0 2 22 96 37 28 88 43 1 4 18 116 51 5 11 32 14 8 23 44 17 12 9 0 0 19 37 85 18 16 104 22 6 2 26 12 58 67 82 25 12 2 2 25 18 8 2 19 42 48 11
$
$ # build index, default target K = 10
$ ./index sift.data sift.kgraph
Generating control...
Initializing...
iteration: 1 recall: 0 accuracy: 2.48257 cost: 0.00038 M: 10 delta: 1 time: 5.1086 one-recall: 0 one-ratio: 3.51503
iteration: 2 recall: 0.008 accuracy: 1.38832 cost: 0.000637268 M: 10 delta: 0.869008 time: 8.00719 one-recall: 0 one-ratio: 2.74464
iteration: 3 recall: 0.035 accuracy: 0.865508 cost: 0.00109259 M: 11.4989 delta: 0.846425 time: 11.7051 one-recall: 0.04 one-ratio: 2.26229
iteration: 4 recall: 0.195 accuracy: 0.473838 cost: 0.00162679 M: 11.8251 delta: 0.798 time: 15.445 one-recall: 0.18 one-ratio: 1.78477
iteration: 5 recall: 0.559 accuracy: 0.133067 cost: 0.00222684 M: 12.5367 delta: 0.65767 time: 19.3702 one-recall: 0.59 one-ratio: 1.27996
iteration: 6 recall: 0.816 accuracy: 0.0298136 cost: 0.00294478 M: 14.8595 delta: 0.359162 time: 23.7699 one-recall: 0.92 one-ratio: 1.06548
iteration: 7 recall: 0.93 accuracy: 0.008966 cost: 0.00376176 M: 19.6205 delta: 0.130743 time: 28.6716 one-recall: 0.96 one-ratio: 1.02683
iteration: 8 recall: 0.964 accuracy: 0.00221385 cost: 0.00439122 M: 23.3132 delta: 0.0428131 time: 32.75 one-recall: 0.98 one-ratio: 1.00215
iteration: 9 recall: 0.972 accuracy: 0.0016227 cost: 0.00472564 M: 25.0266 delta: 0.0156585 time: 35.5824 one-recall: 0.99 one-ratio: 1.00177
iteration: 10 recall: 0.977 accuracy: 0.00120891 cost: 0.00487024 M: 25.707 delta: 0.00632707 time: 37.4416 one-recall: 1 one-ratio: 1
iteration: 11 recall: 0.979 accuracy: 0.00111887 cost: 0.0049299 M: 25.9814 delta: 0.00262671 time: 38.7585 one-recall: 1 one-ratio: 1
1
 39.776979s wall, 394.170000s user + 1.300000s system = 395.470000s CPU (994.2%)
$
$ # search for 10-NN with brutal force, save to sift.10nn as gold standard.
$ ./search --linear sift.data sift.kgraph sift.query sift.10nn  
 100.017852s wall, 1100.200000s user + 1.170000s system = 1101.370000s CPU (1101.2%)
Time: 100.018 Recall: 0 Cost: 1

$ # it shows Recall: 0 because we didn't provide a gold standard.
$
$ # now search with index, evaluate with sift.10nn, save results to sift.10ann.
$ ./search sift.data sift.kgraph sift.query sift.10ann --eval sift.10nn
Searching...
 0.781659s wall, 7.880000s user + 0.030000s system = 7.910000s CPU (1012.0%)
Time: 0.721185 Recall: 0.915304 Cost: 0.00167574
 

Now you can tweak searching by specifying a "K" parameter (default is 10). For search, a larger "P" (for "probe", default is 100) can be used to improve recall. It's safe to grow P up to 10,000 (or maybe even larger), but cost will increase as well. For search, it's practical to make "K" as large as 10,000s.

There's also an indexing parameter "K" which is independent from the search parameter. It is perfectly OK to server a 10000-NN query with an index created with K = 10, and the gap can be compensated by using a large P. The index is essentially a nearest neighbors graph, and the quality is measured with the recall of the K-NN with the index K, and that recall is reported after every iteration of index construction stage. The number of nearest neighbor we keep for each point is somewhere between K and L, depending on the local dimensionality of every point. The average number of NN is shown as the value "M" reported after each iteration (M <= L). For difficult dataset, you might want to tune L (instead of K which is only used for measuring recall). Here are some tips.

  • If you want high quality and are willing to sacrifice some time, make L bigger (100, or even 200), so the algorithm is allowed to spend more time and memory to fight the dimensionality curse. But it doesn't make much sense to have an L much bigger than M + 10 of the final iteration. The real effort (# of distance computations) is automatically determined using the local geometric information, but a big L does come with some time & space overhead.
  • If time is more important, use L to upper limit your scale of computation. Space cost is always O(LN), time cost, in # distance computations, is O(M^2NlogN) and therefore upper-limited by O(L^2NlogN).
  • If a dataset calls for L >= 100, it's intrinsic dimensionality is probably too high to make online nearest neighbor search practical and to make the nearest neighbor saerch results stable and useful. It is the complexity of the problem itself and NO INDEXING ALGORITHM CAN SAVE YOU. The only meaningful thing to invest time on is probably to improve your data representation (or feature extraction).
  • You will also be able to increase recall by using a slightly larger indexing parameter "-S", say 15 or 20.

Find out more information about the LSHKIT data format. The search result is saved as a matrix of "unsigned" (point IDs) in LSHKIT format.

Program with the KGraph Library

Suppose you have a bunch of objects of "MyType" and you measure similarity with "MyDist" which takes two objects and returns a smaller number when similarity is high. Here's how you program with KGraph. C++11 users can fast forward here.

#include <kgraph.h>

// the two lines below are what you already have.
class MyType;
float myDist (MyType const &o1, MyType const &o2);  

// suppose your data is stored in a vector -- any container that can be indexed by integral ID will do.
typedef vector<MyType> MyDataset;

// First implement the index oracle
using namespace kgraph;

class MyIndexOracle: public IndexOracle {
  MyDataset const &data;
public:
  MyIndexOracle(MyDataset const &d): data(d) {}
  virtual unsigned size () const {
    return data.size();
  }
  virtual float operator () (unsigned i, unsigned j) const {
    return myDist(data[i], data[j]);
  }
};

// And the search oracle

class MySearchOracle: public SearchOracle {
  MyDataset const &data;
  MyType const &query;
public:
  MyIndexOracle(MyDataset const &d, MyType const &q): data(d), query(q) {}
  virtual unsigned size () const {
    return data.size();
  }
  virtual float operator () (unsigned i) const {
    return myDist(data[i], query);
  }
};
 

Now you can go ahead and build the index.


MyDataset data;

MyIndexOracle indexOracle(data);
KGraph::IndexParams params;  // safe to use default value, see document for details.

KGraph *index = KGraph::create();
index->build(indexOracle, params, NULL);

// optionally you can save the index to disk.
index->save("some_path");

delete index;
 

And use the index to do search


MyDataset data;
MyType query;

MySearchOracle searchOracle(data, query);

KGraph::SearchParams params;
params.K = K;
vector<unsigned> knn(K);    // to save K-NN ids.

KGraph *index = KGraph::create();
index->load("where you saved index");
index->search(searchOracle, params, &knn[0], NULL);

delete index;
 

Simplified API with C++11

C++11 function object has a little bit overhead over the above approach.

So long as your MyDataset type supports a "size()" method to return the number of objects, and a "operator []" to obtain a reference to your object, the above code can be simplified as follows using the "VectorOracle" wrapper.

#include <kgraph.h>

// the two lines below are what you already have.
class MyType;
float myDist (MyType const &o1, MyType const &o2);  
typedef vector<MyType> MyDataset;

using namespace kgraph;

MyDataset data;

VectorOracle<MyDataset, MyType> oracle(data, myDist);
KGraph *index = KGraph::create();

{
  KGraph::IndexParams params;  // safe to use default value, see document for details.
  index->build(oracle, params, NULL);
}
// optionally you can save the index to disk.
index->save("some_path");

// use index->load("some_path");  to load previously built index.
MyType query;

KGraph::SearchParams params;
params.K = K;
vector<unsigned> knn(K);   // to save K-NN ids.

index->search(oracle.query(query), params, &knn[0], NULL);

delete index;
 

Generalized Usage of SearchOracle

Our SearchOracle interface doesn't explicitly involve the query object. It can be used to minimize any function that is defined on a graph and that can be optimized by hill-climbing a nearest-neighbor graph.