makeapp makeapp - 2 months ago 12
C++ Question

Why will the same compile options of gcc behave differently on different computer architecture?

I use the following two makefile to compile my program to do Gaussian blur.


  1. g++ -Ofast -ffast-math -march=native -flto -fwhole-program -std=c++11 -fopenmp -o interpolateFloatImg interpolateFloatImg.cpp

  2. g++ -O3 -std=c++11 -fopenmp -o interpolateFloatImg interpolateFloatImg.cpp



My two testing environments are:


  • i7 4710HQ 4 cores 8 threads

  • E5 2650



However, the first output has 2x speed on E5 but 0.5x speed on i7.
The second output behaves faster on i7 but slower on E5.

Can anyone give some explanations?

this is the source code: https://github.com/makeapp007/interpolateFloatImg

I will give out more details as soon as possible.

The program on i7 will be run on 8 threads.
I did't know how many threads will this program generate on E5.

==== Update ====

I am the teammate of the original author on this project, and here are the results.

Arch-Lenovo-Y50 ~/project/ca/3/12 (git)-[master] % perf stat -d ./interpolateFloatImg lobby.bin out.bin 255 20
Kernel kernelSize : 255
Standard deviation : 20
Kernel maximum: 0.000397887
Kernel minimum: 1.22439e-21
Reading width 20265 height 8533 = 172921245
Micro seconds: 211199093
Performance counter stats for './interpolateFloatImg lobby.bin out.bin 255 20':
1423026.281358 task-clock:u (msec) # 6.516 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
2,604 page-faults:u # 0.002 K/sec
4,167,572,543,807 cycles:u # 2.929 GHz (46.79%)
6,713,517,640,459 instructions:u # 1.61 insn per cycle (59.29%)
725,873,982,404 branches:u # 510.092 M/sec (57.28%)
23,468,237,735 branch-misses:u # 3.23% of all branches (56.99%)
544,480,682,764 L1-dcache-loads:u # 382.622 M/sec (37.00%)
545,000,783,842 L1-dcache-load-misses:u # 100.10% of all L1-dcache hits (31.44%)
38,696,703,292 LLC-loads:u # 27.193 M/sec (26.68%)
1,204,703,652 LLC-load-misses:u # 3.11% of all LL-cache hits (35.70%)
218.384387536 seconds time elapsed


And these are the results from the workstation:

workstation:~/mossCAP3/repos/liuyh1_liujzh/12$ perf stat -d ./interpolateFloatImg ../../../lobby.bin out.bin 255 20
Kernel kernelSize : 255
Standard deviation : 20
Kernel maximum: 0.000397887
Kernel minimum: 1.22439e-21
Reading width 20265 height 8533 = 172921245
Micro seconds: 133661220
Performance counter stats for './interpolateFloatImg ../../../lobby.bin out.bin 255 20':
2035379.528531 task-clock (msec) # 14.485 CPUs utilized
7,370 context-switches # 0.004 K/sec
273 cpu-migrations # 0.000 K/sec
3,123 page-faults # 0.002 K/sec
5,272,393,071,699 cycles # 2.590 GHz [49.99%]
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
7,425,570,600,025 instructions # 1.41 insns per cycle [62.50%]
370,199,835,630 branches # 181.882 M/sec [62.50%]
47,444,417,555 branch-misses # 12.82% of all branches [62.50%]
591,137,049,749 L1-dcache-loads # 290.431 M/sec [62.51%]
545,926,505,523 L1-dcache-load-misses # 92.35% of all L1-dcache hits [62.51%]
38,725,975,976 LLC-loads # 19.026 M/sec [50.00%]
1,093,840,555 LLC-load-misses # 2.82% of all LL-cache hits [49.99%]
140.520016141 seconds time elapsed


====Update====
the specification of the E5:

workstation:~$ cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c
20 Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
workstation:~$ dmesg | grep cache
[ 0.041489] Dentry cache hash table entries: 4194304 (order: 13, 33554432 bytes)
[ 0.047512] Inode-cache hash table entries: 2097152 (order: 12, 16777216 bytes)
[ 0.050088] Mount-cache hash table entries: 65536 (order: 7, 524288 bytes)
[ 0.050121] Mountpoint-cache hash table entries: 65536 (order: 7, 524288 bytes)
[ 0.558666] PCI: pci_cache_line_size set to 64 bytes
[ 0.918203] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[ 0.948808] xhci_hcd 0000:00:14.0: cache line size of 32 is not supported
[ 1.076303] ehci-pci 0000:00:1a.0: cache line size of 32 is not supported
[ 1.089022] ehci-pci 0000:00:1d.0: cache line size of 32 is not supported
[ 1.549796] sd 4:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 1.552711] sd 5:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 1.552955] sd 6:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Answer

Your program has very high cache miss ratio. Is it good for the program or bad for it?

545,000,783,842 L1-dcache-load-misses:u # 100.10% of all L1-dcache hits

545,926,505,523 L1-dcache-load-misses # 92.35% of all L1-dcache hits

Cache sizes may be different in i7 and E5, so it is one source of difference. Other is - different assembler code, different gcc versions, different gcc options.

You should try to look inside the code, find hot spot, analyze how many pixels is processed by commands and how order of processing may be better for cpu and memory. Rewriting the hotspot (the part of code where most time of running is spent) is the key of solving the task http://shtech.org/course/ca/projects/3/.

You may use perf profiler in record / report / annotate mode to find the hot spot (it will be easier if you will recompile project with -g option added):

# Profile program using cpu cycle performance counter; write profile to perf.data file
perf record ./test test_arg1 test_arg2
# Read perf.data file and report functions where time was spent 
#  (Do not change ./test file, or recompile it after record and before report)
perf report
# Find the hotspot in the top functions by annotation
#  you may use Arrows and Enter to do "annotate" action from report; or:
perf annonate -s top_function_name
perf annonate -s top_function_name > annotate_func1.txt

I was able to increase speed for small bin file and 277 10 arguments in 7 times on my mobile i5-4* (intel haswell) with 2 cores (4 virtual cores with HT enabled) and AVX2+FMA.

Rewriting some loops / loop nests is needed. You should understand how CPU cache works and what is easier to it: to miss often or not to miss often. Also, gcc may be dumb and may not always detect pattern of reading the data; this detection may be needed to work on several pixels in parallel.

Comments