API

1. The Core KGraph API

1.1 Index and Search Oracles

Our definition of distance is very generic. We only make the following two assumptions:

  1. Smaller distance value means higher similarity. The distance can be any valid float number. 0.1 is smaller than 1 and -200 is smaller than -100.
  2. For the same two points, the distance computing code must always return exactly the same value.

KGraph uses user-provided "oracle" classes to compute the distance between two points. We assume that there are N points in the dataset, and the points are numbered from 0 to N-1. The user needs to subclass the following two classes in order to build index and perform online search.

namespace kgraph {
    class IndexOracle {
    public:
        virtual unsigned size () const = 0;    // returns N, the dataset size.
        virtual float operator () (unsigned i, unsigned j) const = 0;  // compute distance between objects i and j, for any i, j < N.
    };  

    class SearchOracle {
    public:
        virtual unsigned size () const = 0;    // returns N, the dataset size.
        virtual float operator () (unsigned i) const = 0;  // compute distance between the query and object i, for any i < N.
        unsigned search (unsigned K, float epsilon, unsigned *ids) const; // brutal force search, return the actual number of points found.
    };
}

Without index, the Search

1.2 Indexing and Searching

Once the oracles are ready, we can do indexing and searching. Following is the Kgraph interface.

namespace kgraph {
    class KGraph {
    public:
        struct IndexParams;  // index parameters, automatically initialized to default values.
        struct IndexInfo;    // returns index construction information.

        struct SearchParams;  // search parameters, automatically initialized to default values.
        struct SearchInfo;    // returns search information.

        static KGraph *create ();  // creates a KGraph instance.
        virtual void load (char const *path);  // load index from file.
        virtual void save (char const *path);  // save index file
        virtual void build (IndexOracle const &oracle, IndexParams const &params, IndexInfo *info);  // build index.
        virtual unsigned search (SearchOracle const &oracle, SearchParams const &params, unsigned *ids, SearchInfo *info);  // online search -- either build or load has to be invoked before conducting search.
        virtual void get_nn (unsigned id, unsigned *nns, unsigned *M, unsigned *L) const;   // obtain nearest neighbor graph infomation
}

For now, it is only important to understand that the user must set a search parameter SearchParams::K (as in K nearest neighbors), and optionally provide a buffer "ids" that can hold at least K unsigned IDs for search result. All pointers in the build and index methods can be NULL if the information is not needed, including the search result (for speed test, one might not need the actual results). But if ids is not NULL, it must be large enough to hold K IDs.

Suppose the user oracle classes "MyIndexOracle" and "MySearchOracle" are implemented, we can use the following code to do the basic K-NN search:


using namespace kgraph;

KGraph *index = KGraph::create();

{   // indexing
    MyIndexOracle oracle(dataset,...);
    KGraph::IndexParams params;          // use default value.
    index->build(oracle, params, NULL);  //
    index->save("path_to_index_file");   // optionally save the index for future use.
}

{    // searching
     MySearchOracle oracle(dataset, query, ...);
     KGraph::SearchParams params;
     vector<unsigned> results(K);
     params.K = K;
     index->search(oracle, params, &results[0], NULL);
}

delete index;

Always construct a KGraph object with the static method "KGraph::create". Never use "new KGraph" as it is an abstract class.

The "get_nn" method can be used to obtain nearest neighbors of a particular point identified by "id" in the dataset. It's nearest neighbors are returned in the buffer "nns", which must be large enough to hold IndexParam::L elements. The valid number of IDs returned is stored in *L, which might be smaller than IndexParam::L (in the current version, it is equal to IndexParams::L, but this might change in the future). *M is a number <= *L, which is the number of nearest neighbor that KGraph think is important. One can reconstruct the K-NN graph by invoking "get_nn" once for each point in the dataset.

1.3 Parameter Tuning

Following are the definition of indexing parameters.

namespace kgraph;
    class KGraph {
        struct IndexParams {
        public:
            IndexParams ();         // constructor that initialize every parameter to default value.

            unsigned iterations;    // maximal number of iterations.
            unsigned controls;      // # control points, default = 100.
            unsigned K;             // compute recall for K-NN in indexing stage, default = 10.
            float delta;            // stop when recall improvement becomes smaller than this.
            float recall;           // targeted recall.
            unsigned prune;         // prune level, default = 0, supports 0, 1 and 2.

            unsigned seed;          // default is 1998, use 0 to use time(NULL) as seed.
                                    // because the algorithm is parallelized, same seed can result in different results.

            unsigned L;             // default = 50. Good value is 30~100, must be > K. Increase when dimension of data is high.
            unsigned S;             // default = 10, slightly increase when dimension is high. shouldn't go above 20.
            unsigned R;             // undocumented, don't touch.
        };
    };
}

The indexing algroithm is iterative. We start by generating a number of control points (random sample from the dataset), and find the exact K-NN of these points by brutal force search. Then after each iteration, we measure the average recall if we use the index to perform K-NN search for the control points. We stop iteration when we reach IndexParams::recall, or the maximal number of iterations, or the recall improvement becomes smaller than IndexParams::delta.

The stopping status of indexing algorithm is returned with the following structure.

namespace kgraph;
    class KGraph {
    public:
         struct IndexInfo {
            enum StopCondition {
                ITERATION = 0,
                DELTA,
                RECALL
            } stop_condition;               # which condition leads to the stop of the iteration.
            unsigned iterations;            # number of iterations.
            float cost;                     # number of distance computations divided by N(N-1)/2.
            float recall;                   # average of recall @ K for the control points.
            float accuracy;                 # average relative error of K-NN distances. Doesn't make much sense for negative distances.
            float delta;                    # recall improvement of last iteration.
            float M;                        # guideline for setting L -- L should be larger than returned M.
        };
    };
}

We use the "relative scan rate" as a machine-independent cost measure. It is the actual number of distance computation used divided by N(N-1)/2, the total number of all possible distance computations, or the cost of a imaginary brutal-force approach.

If you don't turn off screen output, the library will print statistic data at the end of each iteration. There's a value "M" that increases in each iteration but remains smaller than IndexParams::L. The guideline for setting L is that it should not be to much bigger than M when iteration stops. Using a too big L (L > M + 10) makes the algorithm unnecessarily slow. Think M as the intrinsic dimensionality of the data, and L the maximal dimensionality the algorithm is capable of handling so as to produce good recall.

The parameters for search is simpler.

namespace kgraph;
    class KGraph {
    public:
        struct SearchParams {
            SearchParams ();     // constructor that initialize every parameter to default value.
            unsigned K;          // interested K, default = 10.
            float epsilon;       // only return points with distance <= epsilon, default 1e30.
            unsigned P;          // cost parameter.  default is 100. increase for higher recall. It's good to use 1000s, but probably not beyond 10,000.
            unsigned M;          // default is 0.  If the larger of this M and the M of a data point is used for search.
            unsigned T;          // run T independent trials with different randomization and merge the results.
            unsigned seed;       // random number seed. The same as in indexing parameters.
            unsigned init;       // default is 0.  See below for usage.
        };

        struct SearchInfo {
            float cost;          // number of distance computations divided by N -- the cost of brutal force search.
        };

        virtual unsigned search (SearchOracle const &oracle, SearchParams const &params, unsigned *ids, SearchInfo *info);
    };
}

The search process try to find the K-nearest neighbors with distance smaller than epsilon. There are two typical use cases:

  • K-NN search. Set a fixed K (typically < 100), and a huge epsilon (default should be OK.)
  • radius search. Set epsilon to radius, and set a big K.

The search function fills the buffer "ids" and returns the actual number of IDs found. The return value is always <= K. For radius search, it might be tricky to figure out a proper radius. Even with the same radius, the number of points falling within that radius can be drastically different from different query points. In order not to return too many results when too large an epsilon is used, we ask the user to still provided a K (maybe ~1000), so there can be a cutoff. It is legitimate for the user to set K = N, so the result set can be as large as the whole dataset -- at the cost of a huge cost though. The space cost of search algorithm is theoretically O(K+P).

KGraph uses a graph-waling algorithm for search. It allows the user to provide a number of starting points. So the user can use KGraph to incrementally improve the search results returned by other indexing methods. To do so, set SearchParams::init to the number of starting point, and put the IDs of these points to the buffer "ids". If no starting points are provided (default behavior), KGraph samples a number of random points to start with. Providing starting points is not recommended, as it usually takes extra costs to generate such points, and they typically don't work as well as pure random starting points (for example, candidates generated by LSH are likely to be trapped from local optimal).

1.3.1 Index Pruning

By default, KGraph saves everything that is computed during the index construction stage. The index size is therefore larger than what is actually needed for the search stage. By setting KGraph::IndexParams::prune = 1, only what's used in the search stage will be saved to the index file to save space.

Smart edge compression with higher prune levels will be supported in future releases.

2. The KGraph Data API

2.1 Matrices Handling

KGraph provides a matrix class to handle the common case where vector data are densely stored as matrices.

namespace kgraph {
    template <typename T, unsigned A = KGRAPH_MATRIX_ALIGN>
    class Matrix {
    public:
        Matrix ();                                // construct an empty matrix with 0 rows and 0 cols.                
        Matrix (unsigned row, unsigned col);
        ~Matrix ();
        unsigned size () const;                   // return rows
        unsigned dim () const;                    // return cols
        size_t step () const;                     // size of each row in number of bytes, including padding
        void resize (unsigned r, unsigned c);
        T const *operator [] (unsigned i) const;  // starting address of i-th row
        T *operator [] (unsigned i);
        void zero ();                             // set the whole matrix to 0
        void load (const std::string &path, unsigned dim, unsigned skip = 0, unsigned gap = 0); // load raw data from file
        void load_lshkit (std::string const &path); // load a file stored in LSHKIT format.
        void save_lshkit (std::string const &path); // save data in LSHKIT format.
    };
}

LSHKIT format Raw Format

AVX support is still problematic at the current release

To take full advantage of SSE/AVX instructions, the matrix data are automatically aligned to multiples of KGRAPH_MATRIX_ALIGN bytes per row with padding of 0s. Specifically, KGRAPH_MATRIX_ALIGN=32 if AVX is enabled ("-mavx" or "-march=corei7") or 16 if SSE2 is enabled ("-msse2"), or 4 otherwise.

The Matrix class manages its own memory. KGraph also provides a MatrixProxy class for read-only access to user-managed data. MatrixProxy doesn't manage it's own memory. A MatrixProxy object can be directly constructed from a OpenCV matrix or a FLANN matrix.

namespace kgraph {
    template <typename DATA_TYPE>
    class MatrixProxy {
    public:
        MatrixProxy (Matrix<DATA_TYPE> const &m);

#ifndef __AVX__
#ifdef FLANN_DATASET_H_
        MatrixProxy (flann::Matrix<DATA_TYPE> const &m);
#endif
#ifdef __OPENCV_CORE_HPP__
        MatrixProxy (cv::Mat const &m);
#endif
#endif
        unsigned size () const;
        unsigned dim () const;
        DATA_TYPE const *operator [] (unsigned i) const;
    };
}

Note that AVX instructions have a stronger requirement on the alignment of starting address than that is guaranteed by malloc or new, and there's no way for us to ensure that on user-provided data, so wrapping of OpenCV and FLANN matrices are disabled when AVX support is enabled. (SSE2 also have an alignment requirement, but that is typically guaranteed by malloc or new, so it's usually safe to assume the user data is already aligned.)

2.2 Matrix Oracles

Matrix and MatrixProxy are provided to manage dataset as a whole. The user is not encouraged to access the matrix elements. Rather, a MatrixOracle should be used to generate oracle instances for indexing and querying.


namespace kgraph {
    template <typename DATA_TYPE, typename DIST_TYPE>
    class MatrixOracle: public kgraph::IndexOracle {
    public:
        template <typename MATRIX_TYPE>
        MatrixOracle (MATRIX_TYPE const &m);

        virtual unsigned size () const;
        virtual float operator () (unsigned i, unsigned j) const;

        class SearchOracle;
        SearchOracle query (DATA_TYPE const *query) const;
    };
}
 

MatrixOracle itself is an IndexOracle, and the query method, upon accepting the address of a query vector, returns a SearchOracle. Note that it is the user's job to ensure that query is properly aligned and padded.

The template parameter DATA_TYPE is typically float, double, uint8_t, ....

The template parameter DIST_TYPE can be "kgraph::metric::l2" or "kgraph:metric::l2sqr" (square of L2 distance, essentially the same as L2). More distance metrics will be implemented in the future. SSE2 and AVX optimizations are provided for float vectors; SSE2 is provided for uint8_t (to deal with SIFT features.)

For example, the following snippet can be used to construct index for L2 distance.

using namespace kgraph;

......

Matrix<float> data;
data.load_lshkit("path...");

MatrixOracle<float, metric::l2sqr> oracle(data);

KGraph *index = KGraph::create();

KGraph::IndexParams params;

index->build(oracle, params, NULL);
index->save("path...");
delete index;

And for online search.

using namespace kgraph;

......

Matrix<float> data;
data.load_lshkit("path...");

Matrix<float> queries;
queries.load_lshkit("path...");

MatrixOracle<float, metric::l2sqr> oracle(data);

KGraph *index = KGraph::create();

index->load("path...");

KGraph::SearchParams params;

for (unsigned i = 0; i < queries.size(); ++i) {
    vector<unsigned> result(params.K);
    index->search(oracle.query(queries[i]), params, &results[0], NULL);
    // do something with search result.
}

delete index;
 

2.3 Distance Computation Routines

The following distance computation routines are also exposed. The user has to ensure that the pointers are aligned and the buffers are padded.

namespace kgraph {
    float float_l2sqr_avx (float const *t1, float const *t2, unsigned dim);
    float float_l2sqr_sse2 (float const *t1, float const *t2, unsigned dim);
    float uint8_l2sqr_sse2 (uint8_t const *t1, uint8_t const *t2, unsigned dim);
}