performance - Making C++ Eigen LU faster (my tests show 2x slower than GSL) -


i'm comparing lu decomposition/solve of eigen gsl, , find eigen on order of 2x slower -o3 optimizations on g++/osx. isolated timing of decompose , solve, find both substantially slower gsl counterparts. doing silly, or eigen not perform use case (e.g. small systems?) built eigen 3.2.8 , older gsl 1.15. test case contrived, mirrors results in nonlinear-fitting software i'm writing - eigen being anywhere 1.5x - 2x+ slower total lu/solve operation.

#define ndebug  #include "sys/time.h" #include "gsl/gsl_linalg.h" #include <eigen/lu>  // ax=b 3x3 system soln x=[8,2,3] // double avals_col[9] = { 4, 2, 2, 7, 5, 5, 7, 5, 9 };     // col major double avals_row[9] = { 4, 7, 7, 2, 5, 5, 2, 5, 9 };     // row major double bvals[9] = { 67, 41, 53 };  //----------- helpers  void print_solution( double *x, int dim, char *which ) {     printf( "%s solve x:\n", );     for( int j=0; j<3; j++ ) {         printf( "%g ", x[j] );     }     printf( "\n" ); }  struct timeval tv; struct timezone tz; double timenow() {     gettimeofday( &tv, &tz );     int _mils = tv.tv_usec/1000;     int _secs = tv.tv_sec;     return (double)_secs + ((double)_mils/1000.0); }  //-----------  void run_gsl( double *a, double *b, double *x, int dim, int reps ) {     gsl_matrix_view gsla;     gsl_vector_view gslb;     gsl_vector_view gslx;     gsl_permutation *gslp;     int sign;      gsla = gsl_matrix_view_array( a, dim, dim );     gslp = gsl_permutation_alloc( dim );     gslb = gsl_vector_view_array( b, dim );     gslx = gsl_vector_view_array( x, dim );      int err;     double t, elapsed;     t = timenow();     for( int i=0; i<reps; i++ ) {         // gsl overwrites during decompose, must copy initial each time.         memcpy( a, avals_row, sizeof(avals_row) );         err = gsl_linalg_lu_decomp( &gsla.matrix, gslp, &sign );     }     elapsed = timenow() - t;     printf( "gsl decompose (%dx) time = %g\n", reps, elapsed );      t = timenow();     for( int i=0; i<reps; i++ ) {         err = gsl_linalg_lu_solve( &gsla.matrix, gslp, &gslb.vector, &gslx.vector );     }     elapsed = timenow() - t;     printf( "gsl solve (%dx) time = %g\n", reps, elapsed );      gsl_permutation_free( gslp ); }  void run_eigen( double *a, double *b, double *x, int dim, int reps ) {     eigen::partialpivlu<eigen::matrixxd> eigena_lu;      eigen::map< eigen::matrix < double, eigen::dynamic, eigen::dynamic, eigen::colmajor > > ma( a, dim, dim );     eigen::map<eigen::matrixxd> mb( b, dim, 1 );      int err;     double t, elapsed;     t = timenow();     for( int i=0; i<reps; i++ ) {         // memcpy not necessary eigen, not overwrite in         // decompose, time directly comparable gsl.          memcpy( a, avals_col, sizeof(avals_col) );         eigena_lu.compute( ma );     }     elapsed = timenow() - t;     printf( "eigen decompose (%dx) time = %g\n", reps, elapsed );      t = timenow();     eigen::vectorxd _x;     for( int i=0; i<reps; i++ ) {          _x = eigena_lu.solve( mb );     }     elapsed = timenow() - t;     printf( "eigen solve (%dx) time = %g\n", reps, elapsed );      // copy soln passed x     for( int i=0; i<dim; i++ ) {         x[i] = _x(i);     } }  int main() {     // solve 3x3 system many times      double a[9], b[3], x[3];     int dim = 3;     int reps = 1000000;      memcpy( b, bvals, sizeof(bvals) );         // init b vector, copied multiple times in run_gsl/run_eigen      run_eigen( a, b, x, dim, reps );     print_solution( x, dim, "eigen" );        run_gsl( a, b, x, dim, reps );     print_solution( x, dim, "gsl" );      return 0; } 

this produces, example:

eigen decompose (1000000x) time = 0.242 eigen solve (1000000x) time = 0.108 eigen solve x: 8 2 3  gsl decompose (1000000x) time = 0.049 gsl solve (1000000x) time = 0.075 gsl solve x: 8 2 3  

your benchmark not fair doing copy of input matrix twice in eigen version: 1 manually through memcpy, , 1 within partialpivlu. let eigen knowns mb vector declaring map<eigen::vectord>. (gcc5,-o3,eigen3.3):

eigen decompose (1000000x) time = 0.087 eigen solve (1000000x) time = 0.036 eigen solve x: 8 2 3 gsl decompose (1000000x) time = 0.032 gsl solve (1000000x) time = 0.062 gsl solve x: 8 2 3 

moreover, eigen's partialpivlu not designed such extremely tiny matrices (see below). 3x3 matrices, better explicitly compute inverse (for matrices 4x4 usually, ok, not larger ones!). in case must fix sizes @ compile-time:

eigen::partialpivlu<eigen::matrix3d> eigena_lu; eigen::map<eigen::matrix3d> ma(avals_col); eigen::map<eigen::vector3d> mb(b); eigen::matrix3d inv; eigen::vector3d _x; double t, elapsed; t = timenow(); for( int i=0; i<reps; i++ ) {     inv = ma.inverse(); } elapsed = timenow() - t; printf( "eigen decompose (%dx) time = %g\n", reps, elapsed );  t = timenow(); for( int i=0; i<reps; i++ ) {   _x.noalias() = inv * mb; } elapsed = timenow() - t; printf( "eigen solve (%dx) time = %g\n", reps, elapsed ); 

which gives me:

eigen inverse , solve (1000000x) time = 0.0209999 eigen solve (1000000x) time = 0.000999928 

so faster.

now if try larger problem, 3000 x 3000, more 1 order of magnitude of difference in favor of eigen:

eigen decompose (1x) time = 0.411 gsl decompose (1x) time = 6.073 

this typically optimizations allows such performance large problems introduces overhead tiny matrices.


Comments