In my assembly program, I am trying to calculate the equation of (((((2^0 + 2^1) * 2^2) + 2^3) * 2^4) + 2^5)











up vote
2
down vote

favorite












In my 80x86 assembly program, I am trying to calculate the equation of
(((((2^0 + 2^1) * 2^2) + 2^3) * 2^4) + 2^5)...(2^n), where each even exponent is preceded by a multiplication and each odd exponent is preceded by a plus. I have code, but my result is continuously off from the desired result. When 5 is put in for n, the result should be 354, however I get 330.



Any and all advice will be appreciated.



.586
.model flat

include io.h

.stack 4096

.data
number dword ?
prompt byte "enter the power", 0
string byte 40 dup (?), 0
result byte 11 dup (?), 0
lbl_msg byte "answer", 0
bool dword ?
runtot dword ?

.code
_MainProc proc
input prompt, string, 40
atod string
push eax


call power



add esp, 4

dtoa result, eax
output lbl_msg, result

mov eax, 0
ret

_MainProc endp

power proc
push ebp
mov ebp, esp

push ecx

mov bool, 1 ;initial boolean value
mov eax, 1
mov runtot, 2 ;to keep a running total
mov ecx, [ebp + 8]

jecxz done

loop1:
add eax, eax ;power of 2
test bool, ecx ;test case for whether exp is odd/even
jnz oddexp ;if boolean is 1
add runtot, eax ;if boolean is 0
loop loop1

oddexp:
mov ebx, eax ;move eax to seperate register for multiplication
mov eax, runtot ;move existing total for multiplication
mul ebx ;multiplication of old eax to new eax/running total
loop loop1

done:
mov eax, runtot ;move final runtotal for print
pop ecx
pop ebp
ret




power endp



end









share|improve this question






















  • Your code falls through into oddexp. You need a jmp done just before oddexp:. You might have other problems too. Learn to use a debugger.
    – Jester
    Nov 10 at 0:28










  • The code automatically jumps to done once ECX is at 0. The debugger has been used thoroughly.
    – ironclad.monarc
    Nov 10 at 0:33






  • 1




    Using the code I provided, when n = 1, the result is 2. The result should be 3, and when the code is ran the result is 330. I feel like there is a factor off somewhere, but don't exactly know where I should add or adjust it.
    – ironclad.monarc
    Nov 10 at 0:55






  • 1




    I am not sure I follow your intended algorithm correctly but as I said, I would make sure eax is preserved if you want to keep the powers of two in there, and write the result of the multiplication back into runtot.
    – Jester
    Nov 10 at 1:08






  • 3




    These are powers of 2, you could (and should) just left-shift by 1. (Or for multiply by 2^n, left-shift by a count in cl, so count cl up towards n). I also don't see the point of having a static variable called bool which you seem to only use as an operand for test. You're just making your code more complicated and harder to read vs. writing test ecx,1 to check the low bit for zero / non-zero. You also don't need static storage for runtot, just use a register. 32-bit x86 has 7 registers (not including the stack pointer).
    – Peter Cordes
    Nov 10 at 2:16

















up vote
2
down vote

favorite












In my 80x86 assembly program, I am trying to calculate the equation of
(((((2^0 + 2^1) * 2^2) + 2^3) * 2^4) + 2^5)...(2^n), where each even exponent is preceded by a multiplication and each odd exponent is preceded by a plus. I have code, but my result is continuously off from the desired result. When 5 is put in for n, the result should be 354, however I get 330.



Any and all advice will be appreciated.



.586
.model flat

include io.h

.stack 4096

.data
number dword ?
prompt byte "enter the power", 0
string byte 40 dup (?), 0
result byte 11 dup (?), 0
lbl_msg byte "answer", 0
bool dword ?
runtot dword ?

.code
_MainProc proc
input prompt, string, 40
atod string
push eax


call power



add esp, 4

dtoa result, eax
output lbl_msg, result

mov eax, 0
ret

_MainProc endp

power proc
push ebp
mov ebp, esp

push ecx

mov bool, 1 ;initial boolean value
mov eax, 1
mov runtot, 2 ;to keep a running total
mov ecx, [ebp + 8]

jecxz done

loop1:
add eax, eax ;power of 2
test bool, ecx ;test case for whether exp is odd/even
jnz oddexp ;if boolean is 1
add runtot, eax ;if boolean is 0
loop loop1

oddexp:
mov ebx, eax ;move eax to seperate register for multiplication
mov eax, runtot ;move existing total for multiplication
mul ebx ;multiplication of old eax to new eax/running total
loop loop1

done:
mov eax, runtot ;move final runtotal for print
pop ecx
pop ebp
ret




power endp



end









share|improve this question






















  • Your code falls through into oddexp. You need a jmp done just before oddexp:. You might have other problems too. Learn to use a debugger.
    – Jester
    Nov 10 at 0:28










  • The code automatically jumps to done once ECX is at 0. The debugger has been used thoroughly.
    – ironclad.monarc
    Nov 10 at 0:33






  • 1




    Using the code I provided, when n = 1, the result is 2. The result should be 3, and when the code is ran the result is 330. I feel like there is a factor off somewhere, but don't exactly know where I should add or adjust it.
    – ironclad.monarc
    Nov 10 at 0:55






  • 1




    I am not sure I follow your intended algorithm correctly but as I said, I would make sure eax is preserved if you want to keep the powers of two in there, and write the result of the multiplication back into runtot.
    – Jester
    Nov 10 at 1:08






  • 3




    These are powers of 2, you could (and should) just left-shift by 1. (Or for multiply by 2^n, left-shift by a count in cl, so count cl up towards n). I also don't see the point of having a static variable called bool which you seem to only use as an operand for test. You're just making your code more complicated and harder to read vs. writing test ecx,1 to check the low bit for zero / non-zero. You also don't need static storage for runtot, just use a register. 32-bit x86 has 7 registers (not including the stack pointer).
    – Peter Cordes
    Nov 10 at 2:16















up vote
2
down vote

favorite









up vote
2
down vote

favorite











In my 80x86 assembly program, I am trying to calculate the equation of
(((((2^0 + 2^1) * 2^2) + 2^3) * 2^4) + 2^5)...(2^n), where each even exponent is preceded by a multiplication and each odd exponent is preceded by a plus. I have code, but my result is continuously off from the desired result. When 5 is put in for n, the result should be 354, however I get 330.



Any and all advice will be appreciated.



.586
.model flat

include io.h

.stack 4096

.data
number dword ?
prompt byte "enter the power", 0
string byte 40 dup (?), 0
result byte 11 dup (?), 0
lbl_msg byte "answer", 0
bool dword ?
runtot dword ?

.code
_MainProc proc
input prompt, string, 40
atod string
push eax


call power



add esp, 4

dtoa result, eax
output lbl_msg, result

mov eax, 0
ret

_MainProc endp

power proc
push ebp
mov ebp, esp

push ecx

mov bool, 1 ;initial boolean value
mov eax, 1
mov runtot, 2 ;to keep a running total
mov ecx, [ebp + 8]

jecxz done

loop1:
add eax, eax ;power of 2
test bool, ecx ;test case for whether exp is odd/even
jnz oddexp ;if boolean is 1
add runtot, eax ;if boolean is 0
loop loop1

oddexp:
mov ebx, eax ;move eax to seperate register for multiplication
mov eax, runtot ;move existing total for multiplication
mul ebx ;multiplication of old eax to new eax/running total
loop loop1

done:
mov eax, runtot ;move final runtotal for print
pop ecx
pop ebp
ret




power endp



end









share|improve this question













In my 80x86 assembly program, I am trying to calculate the equation of
(((((2^0 + 2^1) * 2^2) + 2^3) * 2^4) + 2^5)...(2^n), where each even exponent is preceded by a multiplication and each odd exponent is preceded by a plus. I have code, but my result is continuously off from the desired result. When 5 is put in for n, the result should be 354, however I get 330.



Any and all advice will be appreciated.



.586
.model flat

include io.h

.stack 4096

.data
number dword ?
prompt byte "enter the power", 0
string byte 40 dup (?), 0
result byte 11 dup (?), 0
lbl_msg byte "answer", 0
bool dword ?
runtot dword ?

.code
_MainProc proc
input prompt, string, 40
atod string
push eax


call power



add esp, 4

dtoa result, eax
output lbl_msg, result

mov eax, 0
ret

_MainProc endp

power proc
push ebp
mov ebp, esp

push ecx

mov bool, 1 ;initial boolean value
mov eax, 1
mov runtot, 2 ;to keep a running total
mov ecx, [ebp + 8]

jecxz done

loop1:
add eax, eax ;power of 2
test bool, ecx ;test case for whether exp is odd/even
jnz oddexp ;if boolean is 1
add runtot, eax ;if boolean is 0
loop loop1

oddexp:
mov ebx, eax ;move eax to seperate register for multiplication
mov eax, runtot ;move existing total for multiplication
mul ebx ;multiplication of old eax to new eax/running total
loop loop1

done:
mov eax, runtot ;move final runtotal for print
pop ecx
pop ebp
ret




power endp



end






math assembly x86






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 10 at 0:23









ironclad.monarc

132




132












  • Your code falls through into oddexp. You need a jmp done just before oddexp:. You might have other problems too. Learn to use a debugger.
    – Jester
    Nov 10 at 0:28










  • The code automatically jumps to done once ECX is at 0. The debugger has been used thoroughly.
    – ironclad.monarc
    Nov 10 at 0:33






  • 1




    Using the code I provided, when n = 1, the result is 2. The result should be 3, and when the code is ran the result is 330. I feel like there is a factor off somewhere, but don't exactly know where I should add or adjust it.
    – ironclad.monarc
    Nov 10 at 0:55






  • 1




    I am not sure I follow your intended algorithm correctly but as I said, I would make sure eax is preserved if you want to keep the powers of two in there, and write the result of the multiplication back into runtot.
    – Jester
    Nov 10 at 1:08






  • 3




    These are powers of 2, you could (and should) just left-shift by 1. (Or for multiply by 2^n, left-shift by a count in cl, so count cl up towards n). I also don't see the point of having a static variable called bool which you seem to only use as an operand for test. You're just making your code more complicated and harder to read vs. writing test ecx,1 to check the low bit for zero / non-zero. You also don't need static storage for runtot, just use a register. 32-bit x86 has 7 registers (not including the stack pointer).
    – Peter Cordes
    Nov 10 at 2:16




















  • Your code falls through into oddexp. You need a jmp done just before oddexp:. You might have other problems too. Learn to use a debugger.
    – Jester
    Nov 10 at 0:28










  • The code automatically jumps to done once ECX is at 0. The debugger has been used thoroughly.
    – ironclad.monarc
    Nov 10 at 0:33






  • 1




    Using the code I provided, when n = 1, the result is 2. The result should be 3, and when the code is ran the result is 330. I feel like there is a factor off somewhere, but don't exactly know where I should add or adjust it.
    – ironclad.monarc
    Nov 10 at 0:55






  • 1




    I am not sure I follow your intended algorithm correctly but as I said, I would make sure eax is preserved if you want to keep the powers of two in there, and write the result of the multiplication back into runtot.
    – Jester
    Nov 10 at 1:08






  • 3




    These are powers of 2, you could (and should) just left-shift by 1. (Or for multiply by 2^n, left-shift by a count in cl, so count cl up towards n). I also don't see the point of having a static variable called bool which you seem to only use as an operand for test. You're just making your code more complicated and harder to read vs. writing test ecx,1 to check the low bit for zero / non-zero. You also don't need static storage for runtot, just use a register. 32-bit x86 has 7 registers (not including the stack pointer).
    – Peter Cordes
    Nov 10 at 2:16


















Your code falls through into oddexp. You need a jmp done just before oddexp:. You might have other problems too. Learn to use a debugger.
– Jester
Nov 10 at 0:28




Your code falls through into oddexp. You need a jmp done just before oddexp:. You might have other problems too. Learn to use a debugger.
– Jester
Nov 10 at 0:28












The code automatically jumps to done once ECX is at 0. The debugger has been used thoroughly.
– ironclad.monarc
Nov 10 at 0:33




The code automatically jumps to done once ECX is at 0. The debugger has been used thoroughly.
– ironclad.monarc
Nov 10 at 0:33




1




1




Using the code I provided, when n = 1, the result is 2. The result should be 3, and when the code is ran the result is 330. I feel like there is a factor off somewhere, but don't exactly know where I should add or adjust it.
– ironclad.monarc
Nov 10 at 0:55




Using the code I provided, when n = 1, the result is 2. The result should be 3, and when the code is ran the result is 330. I feel like there is a factor off somewhere, but don't exactly know where I should add or adjust it.
– ironclad.monarc
Nov 10 at 0:55




1




1




I am not sure I follow your intended algorithm correctly but as I said, I would make sure eax is preserved if you want to keep the powers of two in there, and write the result of the multiplication back into runtot.
– Jester
Nov 10 at 1:08




I am not sure I follow your intended algorithm correctly but as I said, I would make sure eax is preserved if you want to keep the powers of two in there, and write the result of the multiplication back into runtot.
– Jester
Nov 10 at 1:08




3




3




These are powers of 2, you could (and should) just left-shift by 1. (Or for multiply by 2^n, left-shift by a count in cl, so count cl up towards n). I also don't see the point of having a static variable called bool which you seem to only use as an operand for test. You're just making your code more complicated and harder to read vs. writing test ecx,1 to check the low bit for zero / non-zero. You also don't need static storage for runtot, just use a register. 32-bit x86 has 7 registers (not including the stack pointer).
– Peter Cordes
Nov 10 at 2:16






These are powers of 2, you could (and should) just left-shift by 1. (Or for multiply by 2^n, left-shift by a count in cl, so count cl up towards n). I also don't see the point of having a static variable called bool which you seem to only use as an operand for test. You're just making your code more complicated and harder to read vs. writing test ecx,1 to check the low bit for zero / non-zero. You also don't need static storage for runtot, just use a register. 32-bit x86 has 7 registers (not including the stack pointer).
– Peter Cordes
Nov 10 at 2:16














1 Answer
1






active

oldest

votes

















up vote
5
down vote



accepted










You're overcomplicating your code with static variables and branching.



These are powers of 2, you can (and should) just left-shift by n instead of actually constructing 2^n and using a mul instruction.



add eax,eax is the best way to multiply by 2 (aka left shift by 1), but it's not clear why you're doing that to the value in EAX at that point. It's either the multiply result (which you probably should have stored back into runtot after mul), or it's that left-shifted by 1 after an even iteration.



If you were trying to make a 2^i variable (with a strength reduction optimization to shift by 1 every iteration instead of shifting by i), then your bug is that you clobber EAX with mul, and its setup, in the oddexp block.



As Jester points out, if the first loop loop1 falls through, it will fall through into oddexp:. When you're doing loop tail duplication, make sure you consider where fall-through will go from each tail if the loop does end there.





There's also no point in having a static variable called bool which holds a 1, which you only use as an operand for test. That implies to human readers that the mask sometimes needs to change; test ecx,1 is a lot clearer as a way to check the low bit for zero / non-zero.



You also don't need static storage for runtot, just use a register (like EAX where you want the result eventually anyway). 32-bit x86 has 7 registers (not including the stack pointer).





This is how I'd do it. Untested, but I simplified a lot by unrolling by 2. Then the test for odd/even goes away because that alternating pattern is hard-coded into the loop structure.



We increment and compare/branch twice in the loop, so unrolling didn't get rid of the loop overhead, just changed one of the loop branches into an an if() break that can leave the loop from the middle.



This is not the most efficient way to write this; the increment and early-exit check in the middle of the loop could be optimized away by counting another counter down from n, and leaving the loop if there are less than 2 steps left. (Then sort it out in the epilogue)



;; UNTESTED
power proc ; fastcall calling convention: arg: ECX = unsigned int n
; clobbers: ECX, EDX
; returns: EAX

push ebx ; save a call-preserved register for scratch space

mov eax, 1 ; EAX = 2^0 running total / return value
test ecx,ecx
jz done

mov edx, ecx ; EDX = n
mov ecx, 1 ; ECX = i=1..n loop counter and shift count

loop1: ; do{ // unrolled by 2
; add 2^odd power
mov ebx, 1
shl ebx, cl ; 2^i ; xor ebx, ebx; bts ebx, ecx
add eax, ebx ; total += 2^i

inc ecx
cmp ecx, edx
jae done ; if (++i >= n) break;

; multiply by 2^even power
shl eax, cl ; total <<= i; // same as total *= (1<<i)

inc ecx ; ++i
cmp ecx, edx
jb loop1 ; }while(i<n);

done:
pop ebx
ret


I didn't check if the adding-odd-power step ever produces a carry into another bit. I think it doesn't, so it could be safe to implement it as bts eax, ecx (setting bit i). Effectively an OR instead of an ADD, but those are equivalent as long as the bit was previously cleared.



To make the asm look more like the source and avoid obscure instructions, I implemented 1<<i with shl to generate 2^i for total += 2^i, instead of a more-efficient-on-Intel xor ebx,ebx / bts ebx, ecx. (Variable-count shifts are 3 uops on Intel Sandybridge-family because of x86 flag-handling legacy baggage: flags have to be untouched if count=0). But that's worse on AMD Ryzen, where bts reg,reg is 2 uops but shl reg,cl is 1.





Update: i=3 does produce a carry when adding, so we can't OR or BTS the bit for that case. But optimizations are possible with more branching.



Using calc:



; define shiftadd_power(n) { local res=1; local i; for(i=1;i<=n;i++){ res+=1<<i; i++; if(i>n)break; res<<=i;} return res;}
shiftadd_power(n) defined
; base2(2)

; shiftadd_power(0)
1 /* 1 */
...


The first few outputs are:



n          shiftadd(n) (base2)

0 1
1 11
2 1100
3 10100 ; 1100 + 1000 carries
4 101000000
5 101100000 ; 101000000 + 100000 set a bit that was previously 0
6 101100000000000
7 101100010000000 ; increasing amounts of trailing zero around the bit being flipped by ADD


Peeling the first 3 iterations would enable the BTS optimization, where you just set the bit instead of actually creating 2^n and adding.



Instead of just peeling them, we can just hard-code the starting point for i=3 for larger n, and optimize the code that figures out a return value for the n<3 case. I came up with a branchless formula for that based on right-shifting the 0b1100 bit-pattern by 3, 2, or 0.



Also note that for n>=18, the last shift count is strictly greater than half the width of the register, and the 2^i from odd i has no low bits. So only the last 1 or 2 iterations can affect the result. It boils down to either 1<<n for odd n, or 0 for even n. This simplifies to (n&1) << n.



For n=14..17, there are at most 2 bits set. Starting with result=0 and doing the last 3 or 4 iterations should be enough to get the correct total. In fact, for any n, we only need to do the last k iterations, where k is enough that the total shift count from even i is >= 32. Any bits set by earlier iterations are shifted out. (I didn't add a branch for this special case.)



;; UNTESTED
;; special cases for n<3, and for n>=18
;; enabling an optimization in the main loop (BTS instead of add)
;; funky overflow behaviour for n>31: large odd n gives 1<<(n%32) instead of 0
power_optimized proc
; fastcall calling convention: arg: ECX = unsigned int n <= 31
; clobbers: ECX, EDX
; returns: EAX

mov eax, 14h ; 0b10100 = power(3)
cmp ecx, 3
ja n_gt_3 ; goto main loop or fall through to hard-coded low n
je early_ret
;; n=0, 1, or 2 => 1, 3, 12 (0b1, 0b11, 0b1100)

mov eax, 0ch ; 0b1100 to be right-shifted by 3, 2, or 0
cmp ecx, 1 ; count=0,1,2 => CF,ZF,neither flag set
setbe cl ; count=0,1,2 => cl=1,1,0
adc cl, cl ; 3,2,0 (cl = cl+cl + (count<1) )
shr eax, cl
early_ret:
ret

large_n: ; odd n: result = 1<<n. even n: result = 0
mov eax, ecx
and eax, 1 ; n&1
shl eax, cl ; n>31 will wrap the shift count so this "fails"
ret ; if you need to return 0 for all n>31, add another check

n_gt_3:
;; eax = running total for i=3 already
cmp ecx, 18
jae large_n

mov edx, ecx ; EDX = n
mov ecx, 4 ; ECX = i=4..n loop counter and shift count

loop1: ; do{ // unrolled by 2
; multiply by 2^even power
shl eax, cl ; total <<= i; // same as total *= (1<<i)

inc edx
cmp ecx, edx
jae done ; if (++i >= n) break;

; add 2^odd power. i>3 so it won't already be set (thus no carry)
bts eax, edx ; total |= 1<<i;

inc ecx ; ++i
cmp ecx, edx
jb loop1 ; }while(i<n);

done:
ret


By using BTS to set a bit in EAX avoids needing an extra scratch register to construct 1<<i in, so we don't have to save/restore EBX. So that's a minor bonus saving.



Notice that this time the main loop is entered with i=4, which is even, instead of i=1. So I swapped the add vs. shift.



I still didn't get around to pulling the cmp/jae out of the middle of the loop. Something like lea edx, [ecx-2] instead of mov would set the loop-exit condition, but would require a check to not run the loop at all for i=4 or 5. For large-count throughput, many CPUs can sustain 1 taken + 1 not-taken branch every 2 clocks, not creating a worse bottleneck than the loop-carried dep chains (through eax and ecx). But branch-prediction will be different, and it uses more branch-order-buffer entries to record more possible roll-back / fast-recovery points.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53234921%2fin-my-assembly-program-i-am-trying-to-calculate-the-equation-of-20-21%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    5
    down vote



    accepted










    You're overcomplicating your code with static variables and branching.



    These are powers of 2, you can (and should) just left-shift by n instead of actually constructing 2^n and using a mul instruction.



    add eax,eax is the best way to multiply by 2 (aka left shift by 1), but it's not clear why you're doing that to the value in EAX at that point. It's either the multiply result (which you probably should have stored back into runtot after mul), or it's that left-shifted by 1 after an even iteration.



    If you were trying to make a 2^i variable (with a strength reduction optimization to shift by 1 every iteration instead of shifting by i), then your bug is that you clobber EAX with mul, and its setup, in the oddexp block.



    As Jester points out, if the first loop loop1 falls through, it will fall through into oddexp:. When you're doing loop tail duplication, make sure you consider where fall-through will go from each tail if the loop does end there.





    There's also no point in having a static variable called bool which holds a 1, which you only use as an operand for test. That implies to human readers that the mask sometimes needs to change; test ecx,1 is a lot clearer as a way to check the low bit for zero / non-zero.



    You also don't need static storage for runtot, just use a register (like EAX where you want the result eventually anyway). 32-bit x86 has 7 registers (not including the stack pointer).





    This is how I'd do it. Untested, but I simplified a lot by unrolling by 2. Then the test for odd/even goes away because that alternating pattern is hard-coded into the loop structure.



    We increment and compare/branch twice in the loop, so unrolling didn't get rid of the loop overhead, just changed one of the loop branches into an an if() break that can leave the loop from the middle.



    This is not the most efficient way to write this; the increment and early-exit check in the middle of the loop could be optimized away by counting another counter down from n, and leaving the loop if there are less than 2 steps left. (Then sort it out in the epilogue)



    ;; UNTESTED
    power proc ; fastcall calling convention: arg: ECX = unsigned int n
    ; clobbers: ECX, EDX
    ; returns: EAX

    push ebx ; save a call-preserved register for scratch space

    mov eax, 1 ; EAX = 2^0 running total / return value
    test ecx,ecx
    jz done

    mov edx, ecx ; EDX = n
    mov ecx, 1 ; ECX = i=1..n loop counter and shift count

    loop1: ; do{ // unrolled by 2
    ; add 2^odd power
    mov ebx, 1
    shl ebx, cl ; 2^i ; xor ebx, ebx; bts ebx, ecx
    add eax, ebx ; total += 2^i

    inc ecx
    cmp ecx, edx
    jae done ; if (++i >= n) break;

    ; multiply by 2^even power
    shl eax, cl ; total <<= i; // same as total *= (1<<i)

    inc ecx ; ++i
    cmp ecx, edx
    jb loop1 ; }while(i<n);

    done:
    pop ebx
    ret


    I didn't check if the adding-odd-power step ever produces a carry into another bit. I think it doesn't, so it could be safe to implement it as bts eax, ecx (setting bit i). Effectively an OR instead of an ADD, but those are equivalent as long as the bit was previously cleared.



    To make the asm look more like the source and avoid obscure instructions, I implemented 1<<i with shl to generate 2^i for total += 2^i, instead of a more-efficient-on-Intel xor ebx,ebx / bts ebx, ecx. (Variable-count shifts are 3 uops on Intel Sandybridge-family because of x86 flag-handling legacy baggage: flags have to be untouched if count=0). But that's worse on AMD Ryzen, where bts reg,reg is 2 uops but shl reg,cl is 1.





    Update: i=3 does produce a carry when adding, so we can't OR or BTS the bit for that case. But optimizations are possible with more branching.



    Using calc:



    ; define shiftadd_power(n) { local res=1; local i; for(i=1;i<=n;i++){ res+=1<<i; i++; if(i>n)break; res<<=i;} return res;}
    shiftadd_power(n) defined
    ; base2(2)

    ; shiftadd_power(0)
    1 /* 1 */
    ...


    The first few outputs are:



    n          shiftadd(n) (base2)

    0 1
    1 11
    2 1100
    3 10100 ; 1100 + 1000 carries
    4 101000000
    5 101100000 ; 101000000 + 100000 set a bit that was previously 0
    6 101100000000000
    7 101100010000000 ; increasing amounts of trailing zero around the bit being flipped by ADD


    Peeling the first 3 iterations would enable the BTS optimization, where you just set the bit instead of actually creating 2^n and adding.



    Instead of just peeling them, we can just hard-code the starting point for i=3 for larger n, and optimize the code that figures out a return value for the n<3 case. I came up with a branchless formula for that based on right-shifting the 0b1100 bit-pattern by 3, 2, or 0.



    Also note that for n>=18, the last shift count is strictly greater than half the width of the register, and the 2^i from odd i has no low bits. So only the last 1 or 2 iterations can affect the result. It boils down to either 1<<n for odd n, or 0 for even n. This simplifies to (n&1) << n.



    For n=14..17, there are at most 2 bits set. Starting with result=0 and doing the last 3 or 4 iterations should be enough to get the correct total. In fact, for any n, we only need to do the last k iterations, where k is enough that the total shift count from even i is >= 32. Any bits set by earlier iterations are shifted out. (I didn't add a branch for this special case.)



    ;; UNTESTED
    ;; special cases for n<3, and for n>=18
    ;; enabling an optimization in the main loop (BTS instead of add)
    ;; funky overflow behaviour for n>31: large odd n gives 1<<(n%32) instead of 0
    power_optimized proc
    ; fastcall calling convention: arg: ECX = unsigned int n <= 31
    ; clobbers: ECX, EDX
    ; returns: EAX

    mov eax, 14h ; 0b10100 = power(3)
    cmp ecx, 3
    ja n_gt_3 ; goto main loop or fall through to hard-coded low n
    je early_ret
    ;; n=0, 1, or 2 => 1, 3, 12 (0b1, 0b11, 0b1100)

    mov eax, 0ch ; 0b1100 to be right-shifted by 3, 2, or 0
    cmp ecx, 1 ; count=0,1,2 => CF,ZF,neither flag set
    setbe cl ; count=0,1,2 => cl=1,1,0
    adc cl, cl ; 3,2,0 (cl = cl+cl + (count<1) )
    shr eax, cl
    early_ret:
    ret

    large_n: ; odd n: result = 1<<n. even n: result = 0
    mov eax, ecx
    and eax, 1 ; n&1
    shl eax, cl ; n>31 will wrap the shift count so this "fails"
    ret ; if you need to return 0 for all n>31, add another check

    n_gt_3:
    ;; eax = running total for i=3 already
    cmp ecx, 18
    jae large_n

    mov edx, ecx ; EDX = n
    mov ecx, 4 ; ECX = i=4..n loop counter and shift count

    loop1: ; do{ // unrolled by 2
    ; multiply by 2^even power
    shl eax, cl ; total <<= i; // same as total *= (1<<i)

    inc edx
    cmp ecx, edx
    jae done ; if (++i >= n) break;

    ; add 2^odd power. i>3 so it won't already be set (thus no carry)
    bts eax, edx ; total |= 1<<i;

    inc ecx ; ++i
    cmp ecx, edx
    jb loop1 ; }while(i<n);

    done:
    ret


    By using BTS to set a bit in EAX avoids needing an extra scratch register to construct 1<<i in, so we don't have to save/restore EBX. So that's a minor bonus saving.



    Notice that this time the main loop is entered with i=4, which is even, instead of i=1. So I swapped the add vs. shift.



    I still didn't get around to pulling the cmp/jae out of the middle of the loop. Something like lea edx, [ecx-2] instead of mov would set the loop-exit condition, but would require a check to not run the loop at all for i=4 or 5. For large-count throughput, many CPUs can sustain 1 taken + 1 not-taken branch every 2 clocks, not creating a worse bottleneck than the loop-carried dep chains (through eax and ecx). But branch-prediction will be different, and it uses more branch-order-buffer entries to record more possible roll-back / fast-recovery points.






    share|improve this answer



























      up vote
      5
      down vote



      accepted










      You're overcomplicating your code with static variables and branching.



      These are powers of 2, you can (and should) just left-shift by n instead of actually constructing 2^n and using a mul instruction.



      add eax,eax is the best way to multiply by 2 (aka left shift by 1), but it's not clear why you're doing that to the value in EAX at that point. It's either the multiply result (which you probably should have stored back into runtot after mul), or it's that left-shifted by 1 after an even iteration.



      If you were trying to make a 2^i variable (with a strength reduction optimization to shift by 1 every iteration instead of shifting by i), then your bug is that you clobber EAX with mul, and its setup, in the oddexp block.



      As Jester points out, if the first loop loop1 falls through, it will fall through into oddexp:. When you're doing loop tail duplication, make sure you consider where fall-through will go from each tail if the loop does end there.





      There's also no point in having a static variable called bool which holds a 1, which you only use as an operand for test. That implies to human readers that the mask sometimes needs to change; test ecx,1 is a lot clearer as a way to check the low bit for zero / non-zero.



      You also don't need static storage for runtot, just use a register (like EAX where you want the result eventually anyway). 32-bit x86 has 7 registers (not including the stack pointer).





      This is how I'd do it. Untested, but I simplified a lot by unrolling by 2. Then the test for odd/even goes away because that alternating pattern is hard-coded into the loop structure.



      We increment and compare/branch twice in the loop, so unrolling didn't get rid of the loop overhead, just changed one of the loop branches into an an if() break that can leave the loop from the middle.



      This is not the most efficient way to write this; the increment and early-exit check in the middle of the loop could be optimized away by counting another counter down from n, and leaving the loop if there are less than 2 steps left. (Then sort it out in the epilogue)



      ;; UNTESTED
      power proc ; fastcall calling convention: arg: ECX = unsigned int n
      ; clobbers: ECX, EDX
      ; returns: EAX

      push ebx ; save a call-preserved register for scratch space

      mov eax, 1 ; EAX = 2^0 running total / return value
      test ecx,ecx
      jz done

      mov edx, ecx ; EDX = n
      mov ecx, 1 ; ECX = i=1..n loop counter and shift count

      loop1: ; do{ // unrolled by 2
      ; add 2^odd power
      mov ebx, 1
      shl ebx, cl ; 2^i ; xor ebx, ebx; bts ebx, ecx
      add eax, ebx ; total += 2^i

      inc ecx
      cmp ecx, edx
      jae done ; if (++i >= n) break;

      ; multiply by 2^even power
      shl eax, cl ; total <<= i; // same as total *= (1<<i)

      inc ecx ; ++i
      cmp ecx, edx
      jb loop1 ; }while(i<n);

      done:
      pop ebx
      ret


      I didn't check if the adding-odd-power step ever produces a carry into another bit. I think it doesn't, so it could be safe to implement it as bts eax, ecx (setting bit i). Effectively an OR instead of an ADD, but those are equivalent as long as the bit was previously cleared.



      To make the asm look more like the source and avoid obscure instructions, I implemented 1<<i with shl to generate 2^i for total += 2^i, instead of a more-efficient-on-Intel xor ebx,ebx / bts ebx, ecx. (Variable-count shifts are 3 uops on Intel Sandybridge-family because of x86 flag-handling legacy baggage: flags have to be untouched if count=0). But that's worse on AMD Ryzen, where bts reg,reg is 2 uops but shl reg,cl is 1.





      Update: i=3 does produce a carry when adding, so we can't OR or BTS the bit for that case. But optimizations are possible with more branching.



      Using calc:



      ; define shiftadd_power(n) { local res=1; local i; for(i=1;i<=n;i++){ res+=1<<i; i++; if(i>n)break; res<<=i;} return res;}
      shiftadd_power(n) defined
      ; base2(2)

      ; shiftadd_power(0)
      1 /* 1 */
      ...


      The first few outputs are:



      n          shiftadd(n) (base2)

      0 1
      1 11
      2 1100
      3 10100 ; 1100 + 1000 carries
      4 101000000
      5 101100000 ; 101000000 + 100000 set a bit that was previously 0
      6 101100000000000
      7 101100010000000 ; increasing amounts of trailing zero around the bit being flipped by ADD


      Peeling the first 3 iterations would enable the BTS optimization, where you just set the bit instead of actually creating 2^n and adding.



      Instead of just peeling them, we can just hard-code the starting point for i=3 for larger n, and optimize the code that figures out a return value for the n<3 case. I came up with a branchless formula for that based on right-shifting the 0b1100 bit-pattern by 3, 2, or 0.



      Also note that for n>=18, the last shift count is strictly greater than half the width of the register, and the 2^i from odd i has no low bits. So only the last 1 or 2 iterations can affect the result. It boils down to either 1<<n for odd n, or 0 for even n. This simplifies to (n&1) << n.



      For n=14..17, there are at most 2 bits set. Starting with result=0 and doing the last 3 or 4 iterations should be enough to get the correct total. In fact, for any n, we only need to do the last k iterations, where k is enough that the total shift count from even i is >= 32. Any bits set by earlier iterations are shifted out. (I didn't add a branch for this special case.)



      ;; UNTESTED
      ;; special cases for n<3, and for n>=18
      ;; enabling an optimization in the main loop (BTS instead of add)
      ;; funky overflow behaviour for n>31: large odd n gives 1<<(n%32) instead of 0
      power_optimized proc
      ; fastcall calling convention: arg: ECX = unsigned int n <= 31
      ; clobbers: ECX, EDX
      ; returns: EAX

      mov eax, 14h ; 0b10100 = power(3)
      cmp ecx, 3
      ja n_gt_3 ; goto main loop or fall through to hard-coded low n
      je early_ret
      ;; n=0, 1, or 2 => 1, 3, 12 (0b1, 0b11, 0b1100)

      mov eax, 0ch ; 0b1100 to be right-shifted by 3, 2, or 0
      cmp ecx, 1 ; count=0,1,2 => CF,ZF,neither flag set
      setbe cl ; count=0,1,2 => cl=1,1,0
      adc cl, cl ; 3,2,0 (cl = cl+cl + (count<1) )
      shr eax, cl
      early_ret:
      ret

      large_n: ; odd n: result = 1<<n. even n: result = 0
      mov eax, ecx
      and eax, 1 ; n&1
      shl eax, cl ; n>31 will wrap the shift count so this "fails"
      ret ; if you need to return 0 for all n>31, add another check

      n_gt_3:
      ;; eax = running total for i=3 already
      cmp ecx, 18
      jae large_n

      mov edx, ecx ; EDX = n
      mov ecx, 4 ; ECX = i=4..n loop counter and shift count

      loop1: ; do{ // unrolled by 2
      ; multiply by 2^even power
      shl eax, cl ; total <<= i; // same as total *= (1<<i)

      inc edx
      cmp ecx, edx
      jae done ; if (++i >= n) break;

      ; add 2^odd power. i>3 so it won't already be set (thus no carry)
      bts eax, edx ; total |= 1<<i;

      inc ecx ; ++i
      cmp ecx, edx
      jb loop1 ; }while(i<n);

      done:
      ret


      By using BTS to set a bit in EAX avoids needing an extra scratch register to construct 1<<i in, so we don't have to save/restore EBX. So that's a minor bonus saving.



      Notice that this time the main loop is entered with i=4, which is even, instead of i=1. So I swapped the add vs. shift.



      I still didn't get around to pulling the cmp/jae out of the middle of the loop. Something like lea edx, [ecx-2] instead of mov would set the loop-exit condition, but would require a check to not run the loop at all for i=4 or 5. For large-count throughput, many CPUs can sustain 1 taken + 1 not-taken branch every 2 clocks, not creating a worse bottleneck than the loop-carried dep chains (through eax and ecx). But branch-prediction will be different, and it uses more branch-order-buffer entries to record more possible roll-back / fast-recovery points.






      share|improve this answer

























        up vote
        5
        down vote



        accepted







        up vote
        5
        down vote



        accepted






        You're overcomplicating your code with static variables and branching.



        These are powers of 2, you can (and should) just left-shift by n instead of actually constructing 2^n and using a mul instruction.



        add eax,eax is the best way to multiply by 2 (aka left shift by 1), but it's not clear why you're doing that to the value in EAX at that point. It's either the multiply result (which you probably should have stored back into runtot after mul), or it's that left-shifted by 1 after an even iteration.



        If you were trying to make a 2^i variable (with a strength reduction optimization to shift by 1 every iteration instead of shifting by i), then your bug is that you clobber EAX with mul, and its setup, in the oddexp block.



        As Jester points out, if the first loop loop1 falls through, it will fall through into oddexp:. When you're doing loop tail duplication, make sure you consider where fall-through will go from each tail if the loop does end there.





        There's also no point in having a static variable called bool which holds a 1, which you only use as an operand for test. That implies to human readers that the mask sometimes needs to change; test ecx,1 is a lot clearer as a way to check the low bit for zero / non-zero.



        You also don't need static storage for runtot, just use a register (like EAX where you want the result eventually anyway). 32-bit x86 has 7 registers (not including the stack pointer).





        This is how I'd do it. Untested, but I simplified a lot by unrolling by 2. Then the test for odd/even goes away because that alternating pattern is hard-coded into the loop structure.



        We increment and compare/branch twice in the loop, so unrolling didn't get rid of the loop overhead, just changed one of the loop branches into an an if() break that can leave the loop from the middle.



        This is not the most efficient way to write this; the increment and early-exit check in the middle of the loop could be optimized away by counting another counter down from n, and leaving the loop if there are less than 2 steps left. (Then sort it out in the epilogue)



        ;; UNTESTED
        power proc ; fastcall calling convention: arg: ECX = unsigned int n
        ; clobbers: ECX, EDX
        ; returns: EAX

        push ebx ; save a call-preserved register for scratch space

        mov eax, 1 ; EAX = 2^0 running total / return value
        test ecx,ecx
        jz done

        mov edx, ecx ; EDX = n
        mov ecx, 1 ; ECX = i=1..n loop counter and shift count

        loop1: ; do{ // unrolled by 2
        ; add 2^odd power
        mov ebx, 1
        shl ebx, cl ; 2^i ; xor ebx, ebx; bts ebx, ecx
        add eax, ebx ; total += 2^i

        inc ecx
        cmp ecx, edx
        jae done ; if (++i >= n) break;

        ; multiply by 2^even power
        shl eax, cl ; total <<= i; // same as total *= (1<<i)

        inc ecx ; ++i
        cmp ecx, edx
        jb loop1 ; }while(i<n);

        done:
        pop ebx
        ret


        I didn't check if the adding-odd-power step ever produces a carry into another bit. I think it doesn't, so it could be safe to implement it as bts eax, ecx (setting bit i). Effectively an OR instead of an ADD, but those are equivalent as long as the bit was previously cleared.



        To make the asm look more like the source and avoid obscure instructions, I implemented 1<<i with shl to generate 2^i for total += 2^i, instead of a more-efficient-on-Intel xor ebx,ebx / bts ebx, ecx. (Variable-count shifts are 3 uops on Intel Sandybridge-family because of x86 flag-handling legacy baggage: flags have to be untouched if count=0). But that's worse on AMD Ryzen, where bts reg,reg is 2 uops but shl reg,cl is 1.





        Update: i=3 does produce a carry when adding, so we can't OR or BTS the bit for that case. But optimizations are possible with more branching.



        Using calc:



        ; define shiftadd_power(n) { local res=1; local i; for(i=1;i<=n;i++){ res+=1<<i; i++; if(i>n)break; res<<=i;} return res;}
        shiftadd_power(n) defined
        ; base2(2)

        ; shiftadd_power(0)
        1 /* 1 */
        ...


        The first few outputs are:



        n          shiftadd(n) (base2)

        0 1
        1 11
        2 1100
        3 10100 ; 1100 + 1000 carries
        4 101000000
        5 101100000 ; 101000000 + 100000 set a bit that was previously 0
        6 101100000000000
        7 101100010000000 ; increasing amounts of trailing zero around the bit being flipped by ADD


        Peeling the first 3 iterations would enable the BTS optimization, where you just set the bit instead of actually creating 2^n and adding.



        Instead of just peeling them, we can just hard-code the starting point for i=3 for larger n, and optimize the code that figures out a return value for the n<3 case. I came up with a branchless formula for that based on right-shifting the 0b1100 bit-pattern by 3, 2, or 0.



        Also note that for n>=18, the last shift count is strictly greater than half the width of the register, and the 2^i from odd i has no low bits. So only the last 1 or 2 iterations can affect the result. It boils down to either 1<<n for odd n, or 0 for even n. This simplifies to (n&1) << n.



        For n=14..17, there are at most 2 bits set. Starting with result=0 and doing the last 3 or 4 iterations should be enough to get the correct total. In fact, for any n, we only need to do the last k iterations, where k is enough that the total shift count from even i is >= 32. Any bits set by earlier iterations are shifted out. (I didn't add a branch for this special case.)



        ;; UNTESTED
        ;; special cases for n<3, and for n>=18
        ;; enabling an optimization in the main loop (BTS instead of add)
        ;; funky overflow behaviour for n>31: large odd n gives 1<<(n%32) instead of 0
        power_optimized proc
        ; fastcall calling convention: arg: ECX = unsigned int n <= 31
        ; clobbers: ECX, EDX
        ; returns: EAX

        mov eax, 14h ; 0b10100 = power(3)
        cmp ecx, 3
        ja n_gt_3 ; goto main loop or fall through to hard-coded low n
        je early_ret
        ;; n=0, 1, or 2 => 1, 3, 12 (0b1, 0b11, 0b1100)

        mov eax, 0ch ; 0b1100 to be right-shifted by 3, 2, or 0
        cmp ecx, 1 ; count=0,1,2 => CF,ZF,neither flag set
        setbe cl ; count=0,1,2 => cl=1,1,0
        adc cl, cl ; 3,2,0 (cl = cl+cl + (count<1) )
        shr eax, cl
        early_ret:
        ret

        large_n: ; odd n: result = 1<<n. even n: result = 0
        mov eax, ecx
        and eax, 1 ; n&1
        shl eax, cl ; n>31 will wrap the shift count so this "fails"
        ret ; if you need to return 0 for all n>31, add another check

        n_gt_3:
        ;; eax = running total for i=3 already
        cmp ecx, 18
        jae large_n

        mov edx, ecx ; EDX = n
        mov ecx, 4 ; ECX = i=4..n loop counter and shift count

        loop1: ; do{ // unrolled by 2
        ; multiply by 2^even power
        shl eax, cl ; total <<= i; // same as total *= (1<<i)

        inc edx
        cmp ecx, edx
        jae done ; if (++i >= n) break;

        ; add 2^odd power. i>3 so it won't already be set (thus no carry)
        bts eax, edx ; total |= 1<<i;

        inc ecx ; ++i
        cmp ecx, edx
        jb loop1 ; }while(i<n);

        done:
        ret


        By using BTS to set a bit in EAX avoids needing an extra scratch register to construct 1<<i in, so we don't have to save/restore EBX. So that's a minor bonus saving.



        Notice that this time the main loop is entered with i=4, which is even, instead of i=1. So I swapped the add vs. shift.



        I still didn't get around to pulling the cmp/jae out of the middle of the loop. Something like lea edx, [ecx-2] instead of mov would set the loop-exit condition, but would require a check to not run the loop at all for i=4 or 5. For large-count throughput, many CPUs can sustain 1 taken + 1 not-taken branch every 2 clocks, not creating a worse bottleneck than the loop-carried dep chains (through eax and ecx). But branch-prediction will be different, and it uses more branch-order-buffer entries to record more possible roll-back / fast-recovery points.






        share|improve this answer














        You're overcomplicating your code with static variables and branching.



        These are powers of 2, you can (and should) just left-shift by n instead of actually constructing 2^n and using a mul instruction.



        add eax,eax is the best way to multiply by 2 (aka left shift by 1), but it's not clear why you're doing that to the value in EAX at that point. It's either the multiply result (which you probably should have stored back into runtot after mul), or it's that left-shifted by 1 after an even iteration.



        If you were trying to make a 2^i variable (with a strength reduction optimization to shift by 1 every iteration instead of shifting by i), then your bug is that you clobber EAX with mul, and its setup, in the oddexp block.



        As Jester points out, if the first loop loop1 falls through, it will fall through into oddexp:. When you're doing loop tail duplication, make sure you consider where fall-through will go from each tail if the loop does end there.





        There's also no point in having a static variable called bool which holds a 1, which you only use as an operand for test. That implies to human readers that the mask sometimes needs to change; test ecx,1 is a lot clearer as a way to check the low bit for zero / non-zero.



        You also don't need static storage for runtot, just use a register (like EAX where you want the result eventually anyway). 32-bit x86 has 7 registers (not including the stack pointer).





        This is how I'd do it. Untested, but I simplified a lot by unrolling by 2. Then the test for odd/even goes away because that alternating pattern is hard-coded into the loop structure.



        We increment and compare/branch twice in the loop, so unrolling didn't get rid of the loop overhead, just changed one of the loop branches into an an if() break that can leave the loop from the middle.



        This is not the most efficient way to write this; the increment and early-exit check in the middle of the loop could be optimized away by counting another counter down from n, and leaving the loop if there are less than 2 steps left. (Then sort it out in the epilogue)



        ;; UNTESTED
        power proc ; fastcall calling convention: arg: ECX = unsigned int n
        ; clobbers: ECX, EDX
        ; returns: EAX

        push ebx ; save a call-preserved register for scratch space

        mov eax, 1 ; EAX = 2^0 running total / return value
        test ecx,ecx
        jz done

        mov edx, ecx ; EDX = n
        mov ecx, 1 ; ECX = i=1..n loop counter and shift count

        loop1: ; do{ // unrolled by 2
        ; add 2^odd power
        mov ebx, 1
        shl ebx, cl ; 2^i ; xor ebx, ebx; bts ebx, ecx
        add eax, ebx ; total += 2^i

        inc ecx
        cmp ecx, edx
        jae done ; if (++i >= n) break;

        ; multiply by 2^even power
        shl eax, cl ; total <<= i; // same as total *= (1<<i)

        inc ecx ; ++i
        cmp ecx, edx
        jb loop1 ; }while(i<n);

        done:
        pop ebx
        ret


        I didn't check if the adding-odd-power step ever produces a carry into another bit. I think it doesn't, so it could be safe to implement it as bts eax, ecx (setting bit i). Effectively an OR instead of an ADD, but those are equivalent as long as the bit was previously cleared.



        To make the asm look more like the source and avoid obscure instructions, I implemented 1<<i with shl to generate 2^i for total += 2^i, instead of a more-efficient-on-Intel xor ebx,ebx / bts ebx, ecx. (Variable-count shifts are 3 uops on Intel Sandybridge-family because of x86 flag-handling legacy baggage: flags have to be untouched if count=0). But that's worse on AMD Ryzen, where bts reg,reg is 2 uops but shl reg,cl is 1.





        Update: i=3 does produce a carry when adding, so we can't OR or BTS the bit for that case. But optimizations are possible with more branching.



        Using calc:



        ; define shiftadd_power(n) { local res=1; local i; for(i=1;i<=n;i++){ res+=1<<i; i++; if(i>n)break; res<<=i;} return res;}
        shiftadd_power(n) defined
        ; base2(2)

        ; shiftadd_power(0)
        1 /* 1 */
        ...


        The first few outputs are:



        n          shiftadd(n) (base2)

        0 1
        1 11
        2 1100
        3 10100 ; 1100 + 1000 carries
        4 101000000
        5 101100000 ; 101000000 + 100000 set a bit that was previously 0
        6 101100000000000
        7 101100010000000 ; increasing amounts of trailing zero around the bit being flipped by ADD


        Peeling the first 3 iterations would enable the BTS optimization, where you just set the bit instead of actually creating 2^n and adding.



        Instead of just peeling them, we can just hard-code the starting point for i=3 for larger n, and optimize the code that figures out a return value for the n<3 case. I came up with a branchless formula for that based on right-shifting the 0b1100 bit-pattern by 3, 2, or 0.



        Also note that for n>=18, the last shift count is strictly greater than half the width of the register, and the 2^i from odd i has no low bits. So only the last 1 or 2 iterations can affect the result. It boils down to either 1<<n for odd n, or 0 for even n. This simplifies to (n&1) << n.



        For n=14..17, there are at most 2 bits set. Starting with result=0 and doing the last 3 or 4 iterations should be enough to get the correct total. In fact, for any n, we only need to do the last k iterations, where k is enough that the total shift count from even i is >= 32. Any bits set by earlier iterations are shifted out. (I didn't add a branch for this special case.)



        ;; UNTESTED
        ;; special cases for n<3, and for n>=18
        ;; enabling an optimization in the main loop (BTS instead of add)
        ;; funky overflow behaviour for n>31: large odd n gives 1<<(n%32) instead of 0
        power_optimized proc
        ; fastcall calling convention: arg: ECX = unsigned int n <= 31
        ; clobbers: ECX, EDX
        ; returns: EAX

        mov eax, 14h ; 0b10100 = power(3)
        cmp ecx, 3
        ja n_gt_3 ; goto main loop or fall through to hard-coded low n
        je early_ret
        ;; n=0, 1, or 2 => 1, 3, 12 (0b1, 0b11, 0b1100)

        mov eax, 0ch ; 0b1100 to be right-shifted by 3, 2, or 0
        cmp ecx, 1 ; count=0,1,2 => CF,ZF,neither flag set
        setbe cl ; count=0,1,2 => cl=1,1,0
        adc cl, cl ; 3,2,0 (cl = cl+cl + (count<1) )
        shr eax, cl
        early_ret:
        ret

        large_n: ; odd n: result = 1<<n. even n: result = 0
        mov eax, ecx
        and eax, 1 ; n&1
        shl eax, cl ; n>31 will wrap the shift count so this "fails"
        ret ; if you need to return 0 for all n>31, add another check

        n_gt_3:
        ;; eax = running total for i=3 already
        cmp ecx, 18
        jae large_n

        mov edx, ecx ; EDX = n
        mov ecx, 4 ; ECX = i=4..n loop counter and shift count

        loop1: ; do{ // unrolled by 2
        ; multiply by 2^even power
        shl eax, cl ; total <<= i; // same as total *= (1<<i)

        inc edx
        cmp ecx, edx
        jae done ; if (++i >= n) break;

        ; add 2^odd power. i>3 so it won't already be set (thus no carry)
        bts eax, edx ; total |= 1<<i;

        inc ecx ; ++i
        cmp ecx, edx
        jb loop1 ; }while(i<n);

        done:
        ret


        By using BTS to set a bit in EAX avoids needing an extra scratch register to construct 1<<i in, so we don't have to save/restore EBX. So that's a minor bonus saving.



        Notice that this time the main loop is entered with i=4, which is even, instead of i=1. So I swapped the add vs. shift.



        I still didn't get around to pulling the cmp/jae out of the middle of the loop. Something like lea edx, [ecx-2] instead of mov would set the loop-exit condition, but would require a check to not run the loop at all for i=4 or 5. For large-count throughput, many CPUs can sustain 1 taken + 1 not-taken branch every 2 clocks, not creating a worse bottleneck than the loop-carried dep chains (through eax and ecx). But branch-prediction will be different, and it uses more branch-order-buffer entries to record more possible roll-back / fast-recovery points.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 13 at 22:28

























        answered Nov 10 at 2:53









        Peter Cordes

        116k16176302




        116k16176302






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53234921%2fin-my-assembly-program-i-am-trying-to-calculate-the-equation-of-20-21%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Schultheiß

            Verwaltungsgliederung Dänemarks

            Liste der Kulturdenkmale in Wilsdruff