zoujyjs zoujyjs - 2 months ago 7
C Question

performance hit at bitwise operation

// nx_, ny_ is like 350 * 350
#define IJ_REF(_i, _j) ((_j)*nx_+(_i))
#define HAS_BIT(_v, _bit) (((_v) & (_bit)) == (_bit))

for (int ix = 0; ix < nx_; ++ix) { // 0.019s
for (int iy = 0; iy < ny_; ++iy) { // 0.716s
int32 r = IJ_REF(ix, iy); // 0.548s
if (!HAS_BIT(image_[r], FLAG)) { // 3.016s
int32 k = r * 4; // 0.242s
pTex[k] = pTex[k + 1] = pTex[k + 2] = pTex[k + 3] = 255; // 1.591s
}
}
}


The assembly of the HAS_BIT line is:
enter image description here

I guess the
and
directive is a
&
operation, so is it suppose to be so costly?

PS: FLAG is 0x2 so I guess the compiler did some optimization to generate a single directive for the
HAS_BIT
. And I use Vtune to profile.

Answer

The hit is not because you are using a bit-wise instruction, but because the instruction reads from memory - a more expensive operation than offset computation that uses registers.

The problem with the code is that it does not read memory consecutively, because according to IJ_REF your image is stored by rows, but you read it by column.

You should be able to improve the performance by increasing the number of cache hits if you swap the order of your loops:

for (int iy = 0; iy < ny_; ++iy) {
    for (int ix = 0; ix < nx_; ++ix) {
        int32 r = IJ_REF(ix, iy);
        if (!HAS_BIT(image_[r], FLAG)) {
            int32 k = r * 4;
            pTex[k] = pTex[k + 1] = pTex[k + 2] = pTex[k + 3] = 255;
        }
    }
}
Comments