How to understand the speedup in optimization report from icc compiler?

up vote
2
down vote

favorite

environment is:

icc version 19.0.0.117 (gcc version 5.4.0 compatibility)

Intel parallel studio XE cluster edition 2019

Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

Ubuntu 16.04

compiler flags are:

-std=gnu11 -Wall -xHost -xCORE-AVX2 -O2 -fma -qopenmp -qopenmp-simd -qopt-report=5 -qopt-report-phase=all

I use OpenMP simd or intel parama to vectorize my loop to gain the speedup. In the optimization report generated by icc, I usually see the following result:

LOOP BEGIN at get_forces.c(3668,3)

   remark #15389: vectorization support: reference mon->fricforce[n1][d] has unaligned access   [ get_forces.c(3669,4) ]

   remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access   [ get_forces.c(3669,36) ]

   remark #15389: vectorization support: reference vel[n1][d] has unaligned access   [ get_forces.c(3669,51) ]

   remark #15389: vectorization support: reference mon->drag[n1][d] has unaligned access   [ get_forces.c(3671,4) ]

   remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access   [ get_forces.c(3671,40) ]

   remark #15389: vectorization support: reference vel[n1][d] has unaligned access   [ get_forces.c(3671,57) ]

   remark #15381: vectorization support: unaligned access used inside loop body

   remark #15305: vectorization support: vector length 2

   remark #15309: vectorization support: normalized vectorization overhead 0.773

   remark #15300: LOOP WAS VECTORIZED

   remark #15450: unmasked unaligned unit stride loads: 3 

   remark #15451: unmasked unaligned unit stride stores: 2 

   remark #15475: --- begin vector cost summary ---

   remark #15476: scalar cost: 21 

   remark #15477: vector cost: 11.000 

   remark #15478: estimated potential speedup: 1.050 

   remark #15488: --- end vector cost summary ---

   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1

   remark #25015: Estimate of max trip count of loop=1

LOOP END

My question is:
I do not understand how the speedup is calculated from

normalized vectorization overhead 0.773

scalar cost: 21 

vector cost: 11.000

Another more extreme and puzzled case could be

LOOP BEGIN at get_forces.c(2690,8)

<Distributed chunk3>

   remark #15388: vectorization support: reference q12[j] has aligned access   [ get_forces.c(2694,19) ]

   remark #15388: vectorization support: reference q12[j] has aligned access   [ get_forces.c(2694,26) ]

   remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override 

   remark #15305: vectorization support: vector length 2

   remark #15309: vectorization support: normalized vectorization overhead 1.857

   remark #15448: unmasked aligned unit stride loads: 1 

   remark #15475: --- begin vector cost summary ---

   remark #15476: scalar cost: 7 

   remark #15477: vector cost: 3.500 

   remark #15478: estimated potential speedup: 0.770 

   remark #15488: --- end vector cost summary ---

   remark #25436: completely unrolled by 3  

LOOP END

Now, 3.5+1.857=5.357 < 7

So, I could still simd this loop and gain a speedup or I should take the speedup number 0.770 in the report and not simd it?

How to understand the speedup in optimization report from icc compiler?

asked Nov 8 at 8:33

jjl

2114

add a comment |

up vote
2
down vote

favorite

environment is:

icc version 19.0.0.117 (gcc version 5.4.0 compatibility)

Intel parallel studio XE cluster edition 2019

Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

Ubuntu 16.04

compiler flags are:

-std=gnu11 -Wall -xHost -xCORE-AVX2 -O2 -fma -qopenmp -qopenmp-simd -qopt-report=5 -qopt-report-phase=all

I use OpenMP simd or intel parama to vectorize my loop to gain the speedup. In the optimization report generated by icc, I usually see the following result:

LOOP BEGIN at get_forces.c(3668,3)

   remark #15389: vectorization support: reference mon->fricforce[n1][d] has unaligned access   [ get_forces.c(3669,4) ]

   remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access   [ get_forces.c(3669,36) ]

   remark #15389: vectorization support: reference vel[n1][d] has unaligned access   [ get_forces.c(3669,51) ]

   remark #15389: vectorization support: reference mon->drag[n1][d] has unaligned access   [ get_forces.c(3671,4) ]

   remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access   [ get_forces.c(3671,40) ]

   remark #15389: vectorization support: reference vel[n1][d] has unaligned access   [ get_forces.c(3671,57) ]

   remark #15381: vectorization support: unaligned access used inside loop body

   remark #15305: vectorization support: vector length 2

   remark #15309: vectorization support: normalized vectorization overhead 0.773

   remark #15300: LOOP WAS VECTORIZED

   remark #15450: unmasked unaligned unit stride loads: 3 

   remark #15451: unmasked unaligned unit stride stores: 2 

   remark #15475: --- begin vector cost summary ---

   remark #15476: scalar cost: 21 

   remark #15477: vector cost: 11.000 

   remark #15478: estimated potential speedup: 1.050 

   remark #15488: --- end vector cost summary ---

   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1

   remark #25015: Estimate of max trip count of loop=1

LOOP END

My question is:
I do not understand how the speedup is calculated from

normalized vectorization overhead 0.773

scalar cost: 21 

vector cost: 11.000

Another more extreme and puzzled case could be

LOOP BEGIN at get_forces.c(2690,8)

<Distributed chunk3>

   remark #15388: vectorization support: reference q12[j] has aligned access   [ get_forces.c(2694,19) ]

   remark #15388: vectorization support: reference q12[j] has aligned access   [ get_forces.c(2694,26) ]

   remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override 

   remark #15305: vectorization support: vector length 2

   remark #15309: vectorization support: normalized vectorization overhead 1.857

   remark #15448: unmasked aligned unit stride loads: 1 

   remark #15475: --- begin vector cost summary ---

   remark #15476: scalar cost: 7 

   remark #15477: vector cost: 3.500 

   remark #15478: estimated potential speedup: 0.770 

   remark #15488: --- end vector cost summary ---

   remark #25436: completely unrolled by 3  

LOOP END

Now, 3.5+1.857=5.357 < 7

So, I could still simd this loop and gain a speedup or I should take the speedup number 0.770 in the report and not simd it?

How to understand the speedup in optimization report from icc compiler?

asked Nov 8 at 8:33

jjl

2114

add a comment |

up vote
2
down vote

favorite

environment is:

icc version 19.0.0.117 (gcc version 5.4.0 compatibility)

Intel parallel studio XE cluster edition 2019

Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

Ubuntu 16.04

compiler flags are:

-std=gnu11 -Wall -xHost -xCORE-AVX2 -O2 -fma -qopenmp -qopenmp-simd -qopt-report=5 -qopt-report-phase=all

I use OpenMP simd or intel parama to vectorize my loop to gain the speedup. In the optimization report generated by icc, I usually see the following result:

LOOP BEGIN at get_forces.c(3668,3)

   remark #15389: vectorization support: reference mon->fricforce[n1][d] has unaligned access   [ get_forces.c(3669,4) ]

   remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access   [ get_forces.c(3669,36) ]

   remark #15389: vectorization support: reference vel[n1][d] has unaligned access   [ get_forces.c(3669,51) ]

   remark #15389: vectorization support: reference mon->drag[n1][d] has unaligned access   [ get_forces.c(3671,4) ]

   remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access   [ get_forces.c(3671,40) ]

   remark #15389: vectorization support: reference vel[n1][d] has unaligned access   [ get_forces.c(3671,57) ]

   remark #15381: vectorization support: unaligned access used inside loop body

   remark #15305: vectorization support: vector length 2

   remark #15309: vectorization support: normalized vectorization overhead 0.773

   remark #15300: LOOP WAS VECTORIZED

   remark #15450: unmasked unaligned unit stride loads: 3 

   remark #15451: unmasked unaligned unit stride stores: 2 

   remark #15475: --- begin vector cost summary ---

   remark #15476: scalar cost: 21 

   remark #15477: vector cost: 11.000 

   remark #15478: estimated potential speedup: 1.050 

   remark #15488: --- end vector cost summary ---

   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1

   remark #25015: Estimate of max trip count of loop=1

LOOP END

My question is:
I do not understand how the speedup is calculated from

normalized vectorization overhead 0.773

scalar cost: 21 

vector cost: 11.000

Another more extreme and puzzled case could be

LOOP BEGIN at get_forces.c(2690,8)

<Distributed chunk3>

   remark #15388: vectorization support: reference q12[j] has aligned access   [ get_forces.c(2694,19) ]

   remark #15388: vectorization support: reference q12[j] has aligned access   [ get_forces.c(2694,26) ]

   remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override 

   remark #15305: vectorization support: vector length 2

   remark #15309: vectorization support: normalized vectorization overhead 1.857

   remark #15448: unmasked aligned unit stride loads: 1 

   remark #15475: --- begin vector cost summary ---

   remark #15476: scalar cost: 7 

   remark #15477: vector cost: 3.500 

   remark #15478: estimated potential speedup: 0.770 

   remark #15488: --- end vector cost summary ---

   remark #25436: completely unrolled by 3  

LOOP END

Now, 3.5+1.857=5.357 < 7

So, I could still simd this loop and gain a speedup or I should take the speedup number 0.770 in the report and not simd it?

How to understand the speedup in optimization report from icc compiler?

asked Nov 8 at 8:33

jjl

2114

environment is:

icc version 19.0.0.117 (gcc version 5.4.0 compatibility)

Intel parallel studio XE cluster edition 2019

Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz

Ubuntu 16.04

compiler flags are:

-std=gnu11 -Wall -xHost -xCORE-AVX2 -O2 -fma -qopenmp -qopenmp-simd -qopt-report=5 -qopt-report-phase=all

I use OpenMP simd or intel parama to vectorize my loop to gain the speedup. In the optimization report generated by icc, I usually see the following result:

LOOP BEGIN at get_forces.c(3668,3)

   remark #15389: vectorization support: reference mon->fricforce[n1][d] has unaligned access   [ get_forces.c(3669,4) ]

   remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access   [ get_forces.c(3669,36) ]

   remark #15389: vectorization support: reference vel[n1][d] has unaligned access   [ get_forces.c(3669,51) ]

   remark #15389: vectorization support: reference mon->drag[n1][d] has unaligned access   [ get_forces.c(3671,4) ]

   remark #15389: vectorization support: reference mon->vel[n1][d] has unaligned access   [ get_forces.c(3671,40) ]

   remark #15389: vectorization support: reference vel[n1][d] has unaligned access   [ get_forces.c(3671,57) ]

   remark #15381: vectorization support: unaligned access used inside loop body

   remark #15305: vectorization support: vector length 2

   remark #15309: vectorization support: normalized vectorization overhead 0.773

   remark #15300: LOOP WAS VECTORIZED

   remark #15450: unmasked unaligned unit stride loads: 3 

   remark #15451: unmasked unaligned unit stride stores: 2 

   remark #15475: --- begin vector cost summary ---

   remark #15476: scalar cost: 21 

   remark #15477: vector cost: 11.000 

   remark #15478: estimated potential speedup: 1.050 

   remark #15488: --- end vector cost summary ---

   remark #25456: Number of Array Refs Scalar Replaced In Loop: 1

   remark #25015: Estimate of max trip count of loop=1

LOOP END

My question is:
I do not understand how the speedup is calculated from

normalized vectorization overhead 0.773

scalar cost: 21 

vector cost: 11.000

Another more extreme and puzzled case could be

LOOP BEGIN at get_forces.c(2690,8)

<Distributed chunk3>

   remark #15388: vectorization support: reference q12[j] has aligned access   [ get_forces.c(2694,19) ]

   remark #15388: vectorization support: reference q12[j] has aligned access   [ get_forces.c(2694,26) ]

   remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override 

   remark #15305: vectorization support: vector length 2

   remark #15309: vectorization support: normalized vectorization overhead 1.857

   remark #15448: unmasked aligned unit stride loads: 1 

   remark #15475: --- begin vector cost summary ---

   remark #15476: scalar cost: 7 

   remark #15477: vector cost: 3.500 

   remark #15478: estimated potential speedup: 0.770 

   remark #15488: --- end vector cost summary ---

   remark #25436: completely unrolled by 3  

LOOP END

Now, 3.5+1.857=5.357 < 7

So, I could still simd this loop and gain a speedup or I should take the speedup number 0.770 in the report and not simd it?

How to understand the speedup in optimization report from icc compiler?

vectorization intel compiler-optimization simd icc

asked Nov 8 at 8:33

jjl

2114

asked Nov 8 at 8:33

jjl

2114

asked Nov 8 at 8:33

jjl

2114

asked Nov 8 at 8:33

jjl

2114

asked Nov 8 at 8:33

jjl

2114

add a comment |

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53203989%2fhow-to-understand-the-speedup-in-optimization-report-from-icc-compiler%23new-answer', 'question_page');
}
);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Name

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Xtykutl