K Graph On Wikipedia Data

Various K-NN search algorithms were benchmarked by Radim Řehůřek in his Performance Shootout of Nearest Neighbors. Radim has made his code available so I was able to test my KGraph code on the wikipedia dataset.

I used Radim's code to preprocess the wikipedia dataset, and generated a dataset of 3566145 vectors of 500 dimensions. I then randomly picked 1000 vectors to use as query, and the remaining as the data to be searched. (It seems that in Radim's benchmark, the query are not removed from the dataset. This is OK with other index algorithms. But as KGraph offline finds the K-NN for each vector in the dataset, we would essentially be finding the K-NN of queries offline.) For each query I searched for 10-nn and measure the recall (called accuracy/precision in Radim's code/page).

So here's the result.

  • # vectors in data: 3565145
  • data size: 6.7G
  • query size: 2.0M

KGraph

(Run with parameter --prune=1 to remove unnecessary data from index.)

  • Index time: 262.7s.
  • Index size: 300M

default
  • Index search time: 0.106s
  • Index search recall: 0.778

P=500
  • Index search time: 0.302s
  • Index search recall: 0.900

P=1500
  • Index search time: 1.093s
  • Index search recall: 0.956

Update: search algorithm improved. Create index with --prune=0 (default) and search with -P 200 -T 2 -M 50, now I'm able to achieve 0.961 recall in 0.392s. Index size will be 708M. The new search parameters T and M will be available starting from version 1.2. The problem here is that the text data has a high intrinsic dimensionality, and we need to probe a larger neighborhood than the default M inferred when constructing the graph. The real M used is the larger of the M provided on commandline and the inferred M of each point (and therefore we cannot prune the graph).

Update: Run level-2 prune on the original graph with "prune wiki.data wiki.kgraph wiki.kgraph_l2 --prune 2" (taking ~250 seconds), and search using the pruned graph "wiki.kgraph_l2" with parameters "-P 200 -T 2 -M 50", I can get recall 0.951 in 0.290s. The pruned graph is only 284M. The prune program will also be available from version 1.2.

According to my Benchmark, FLANN hierarchical kmeans seems to be the best algorithm when targeting a high recall. Randomized KD-tree, on the other hand, is very quick to construct. So I decided to benchmark against these two algorithms. Other than # trees, I used default values for all other parameters.

FLANN brutal force time: 187s.

FLANN Hierarchical Kmeans

  • # trees : 12
  • Index time: 1607.5s
  • Index size: 439M

default
  • Index search time: 0.190s
  • Index search recall: 0.258

C=1000
  • Index search time: 0.332s
  • Index search recall: 0.900

C=2500
  • Index search time: 0.495s
  • Index search recall: 0.953

Hierarchical kmeans seems to be very good at keeping improving recall at near-linear cost, even above 0.9, in this case. Even though default setting was out-performed by KGraph, it manages to catch up very quickly. It was able to achieve 0.95 recall with about half the search time of KGraph 1.1. KGraph 1.2 is able to beat FLANN With improved search algorithm and graph pruning.

FLANN KD-Trees:

  • # trees: 8
  • index time: 214.7s
  • index size: 490M

default
  • Index search time: 0.081s
  • Index search recall: 0.220

C=7500
  • Index search time: 1.25s
  • Index search recall: 0.899

It's already hard for kd-tree to reach 0.9 recall, so it didn't make much sense to force it to reach 0.95.