Benchmark

1. Benchmarking Fairly

Software performance usually involves several aspects. For nearest neighbor search, relevant aspects includes index constructing time, index size, online search speed, search accuracy. To fairly compare two programs, we need to tune the programs such that

  • The performance is acceptable on aspects we don't care so much, like index constructing time -- so long as it finish within reasonable amount of time, people can usually wait for offline indexing to be done. We still need to report the number, so they can be considered when the programs are to be deployed. (Depending on the scenario, index construction time can be the most critical factor, and the benchmarking scheme has to be changed accordingly.)
  • They perform equivalently on aspects that are hard to control and that we don't care the most, like index size. They have to produce indices of similar size. For example, we consider 4GB and 5GB as similar sizes. It is usually not possible to make the program generate an index that is exactly the given size. But it is not very fair to compare an index of 4GB to one of 40GB.
  • Fix all critical factors except for one. For similarity search, we need to make their recall/accuracy the same, say 95%.
  • Compare the remaining one factor, the online search speed.

The most common mistake people make (including me) when writing a paper is when they benchmark an approximate search algorithm against brutal force. For example, if brutal force finishes in 100 seconds, and the index finishes searching in 10 seconds with 50% recall (percentage of exact nearest neighbors found, 50% == 0.5, recall of brutal force scanning the whole dataset is always 1), sometimes they go ahead and claim that the index is 10x the speed of brutal force. But that is unfair, because recall is not the same for the two methods. Actually, we can safely assume that it would take brutal force about 50 seconds to scan half the dataset and achieve the same 50% recall. So the index is only about 5x the speed of brutal force at the recall of 50%.

All the benchmarks in this page are performed on a desktop machine with a 6-core 3930K CPU with 12 threads, and 64GB main memory. Only SSE2 instructions are used; AVX instructions, which is not so commonly available, are not enabled.

The dataset I use is the one million SIFT feature dataset with 10,000 queries. Each SIFT feature is represented as a 128-dimensional vector of 32-bit floating point numbers. Similarity is the L2 square distance. Recall is measured with the first 10 nearest neighbors.

2. Brutal Force

This shows the implementation dependent speedup of KGraph over FLANN. The difference is purely generated by code quality.

I use FLANN's brutal force as baseline, and compute speedup using

speedup = (flann_brutal_force_time * index_search_recall) / (index_search_time)

All online search are parallelized by query. The time reported is the total time spent on running 10000 queries in parallel.

 brutal force, flannbrutal force, kgraph
search time/s135.187.3
speedup1.0x1.54x

3. Comparison of Various Methods

Index construction time and size. (FLANN indexing building is not parallelized, while KGraph is. It should be rather straightforward to parallelize FLANN indexing and achieve 5x speedup and beat KGraph in index building. It will also make their auto tuning run much faster.) I've tuned the indices to have comparable size (for kmeans it appears there's nothing I can to to tune the size). For the same algorithm, a bigger index typically allows it to achieve higher recall at the same online search cost.

 kdtreekmeanshierarchicalkgraphkgraph, level-2 pruned
index time21.7s156.2s114.3s40s or 252.7s in serial112.8
index size138M250M120M107M97M
parameterstrees=8defaulttrees=12default, with --prune=1default, with --prune=2, search with -M 50

Online speed comparison. The time reported is the total amount of time spent on running 10,000 queries in parallel. The curves are generated by varying the C parameter of FLANN and the P parameter of KGraph.