Improve C snippet for mikroc

tubos · May 23, 2012

following piece of code takes about instruction 4700 cycles
and it should be a lot less than 4096
Any ideas on how to improve on that?

I guess the generated asm code could be improved
with some inline assembly but I'm not that great at it.

Code:

// Makro's
#define setLo(pin)  (pin=0)
#define setHi(pin)  (pin=1)
#define pulse(pin)  do { setHi(pin); setLo(pin); } while (0)

Code:

for(ii=0;ii<16;ii++) {  // 16 bits
         bitmask = 0x800;
         currentdata=gsData[ii];
         for(jj=0;jj<12;jj++) {
            if(currentdata & bitmask)  setHi(TLC_SIN);   // 1
            else setLo(TLC_SIN); // 0
            pulse(TLC_SCLK);
            bitmask >>=1;       // Next bit
          }
       }

The code is for sending GSdata to a TLC5940 chip within an ISR

misterT · May 23, 2012

I don't know about mikroc, but you should be able to do "loop unrolling".. either by hand or somehow tell the compiler to do it for you. In GCC compiler you would give the flag "-funroll" and the compiler unrolls the loop. Apparently unrolling is also called unwinding: https://en.wikipedia.org/wiki/Loop_unwinding

EDIT: Here is one "optimization". Anyway, if this doesn't help, I think the "loop unrolling" is the next thing you can do before resorting to inline ASM.

Code:

for(ii=0;ii<16;ii++) {  // 16 bits
     currentdata=gsData[ii];
     for(bitmask = 0x800; bitmask != 0; bitmask >>= 1) {
        TLC_SIN = (currentdata & bitmask) ? 1 : 0;
        pulse(TLC_SCLK);
     }
}

Even if the above code is shorter, it does not mean that it is more efficient. I'm not familiar with your compiler so I can't say for sure. Sometimes it is better not to try to be too clever and leave the optimization to the compiler. At least I managed to get rid of one variable.. and usually compilers optimize the ternary ?:-operator better than if-else statement.

Mr RB · May 24, 2012

You have 2 loops of 16 and 12 which multiply to give a total of 192 iterations. So at your measured 4700 inst cycles it is only taking <25 instructions per each cycle which is already quite good considering it needs to access an array of 16bit values and do bit testing and 16bit right shifts.

How fast did you expect it to be?

misterT · May 24, 2012

Mr RBs post gave me an idea that could greatly improve the performance. The idea is to cast the array of 16 x 16bit values to 32 x 8 bit values and unroll the second loop:

Code:

unsigned char* castedData; // declare another variable, pointer to 8-bit array

castedData = (unsigned char*)gsData;  // Cast 16-bit array to 8-bit array
for(ii=0; ii<32; ii+=2) {

     currentdata=castedData[ii+1];
     TLC_SIN = (currentdata & 0b00000001) ? 1 : 0;
     pulse(TLC_SCLK);
     TLC_SIN = (currentdata & 0b00000010) ? 1 : 0;
     pulse(TLC_SCLK);
     TLC_SIN = (currentdata & 0b00000100) ? 1 : 0;
     pulse(TLC_SCLK);
     TLC_SIN = (currentdata & 0b00001000) ? 1 : 0;
     pulse(TLC_SCLK);
     TLC_SIN = (currentdata & 0b00010000) ? 1 : 0;
     pulse(TLC_SCLK);
     TLC_SIN = (currentdata & 0b00100000) ? 1 : 0;
     pulse(TLC_SCLK);
     TLC_SIN = (currentdata & 0b01000000) ? 1 : 0;
     pulse(TLC_SCLK);
     TLC_SIN = (currentdata & 0b10000000) ? 1 : 0;
     pulse(TLC_SCLK);

     currentdata=castedData[ii];
     TLC_SIN = (currentdata & 0b00000001) ? 1 : 0;
     pulse(TLC_SCLK);
     TLC_SIN = (currentdata & 0b00000010) ? 1 : 0;
     pulse(TLC_SCLK);
     TLC_SIN = (currentdata & 0b00000100) ? 1 : 0;
     pulse(TLC_SCLK);
     TLC_SIN = (currentdata & 0b00001000) ? 1 : 0;
     pulse(TLC_SCLK);
}

EDIT: not sure if I got the byte and bit order right, but the idea is the same.

Depending on how TLC_SIN is defined and depending on your compiler, you could be able to replace all the "TLC_SIN = (currentdata & 0b00001000) ? 1 : 0;" lines with simple "TLC_SIN = (currentdata & 0b00001000);"

tubos · May 24, 2012

Here is the corresponding asm code generated by Mikroc compiler
I want it a lot faster because I want to try multiplexing 8 rows if possible.
@MisterT I like your idea for casting it to 8bit array I'll probably give it a try tomorrow tx

Code:

;TLC5940.c,191 ::                 for(ii=0;ii<16;ii++) {
        CLRF        R3 
L_interrupt16:
        MOVLW       16
        SUBWF       R3, 0 
        BTFSC       STATUS+0, 0 
        GOTO        L_interrupt17
;TLC5940.c,192 ::                 bitmask = 0x800;
        MOVLW       0
        MOVWF       interrupt_bitmask_L1+0 
        MOVLW       8
        MOVWF       interrupt_bitmask_L1+1 
;TLC5940.c,193 ::                 currentdata=gsData[ii];
        MOVF        R3, 0 
        MOVWF       R0 
        MOVLW       0
        MOVWF       R1 
        RLCF        R0, 1 
        BCF         R0, 0 
        RLCF        R1, 1 
        MOVLW       _gsData+0
        ADDWF       R0, 0 
        MOVWF       FSR0L 
        MOVLW       hi_addr(_gsData+0)
        ADDWFC      R1, 0 
        MOVWF       FSR0H 
        MOVF        POSTINC0+0, 0 
        MOVWF       R5 
        MOVF        POSTINC0+0, 0 
        MOVWF       R6 
;TLC5940.c,194 ::                 for(jj=0;jj<12;jj++) {
        CLRF        R4 
L_interrupt19:
        MOVLW       12
        SUBWF       R4, 0 
        BTFSC       STATUS+0, 0 
        GOTO        L_interrupt20
;TLC5940.c,195 ::                 if(currentdata & bitmask)  setHi(TLC_SIN);
        MOVF        interrupt_bitmask_L1+0, 0 
        ANDWF       R5, 0 
        MOVWF       R0 
        MOVF        R6, 0 
        ANDWF       interrupt_bitmask_L1+1, 0 
        MOVWF       R1 
        MOVF        R0, 0 
        IORWF       R1, 0 
        BTFSC       STATUS+0, 2 
        GOTO        L_interrupt22
        BSF         LATC+0, 3 
        GOTO        L_interrupt23
L_interrupt22:
;TLC5940.c,196 ::                 else setLo(TLC_SIN); // 0
        BCF         LATC+0, 3 
L_interrupt23:
;TLC5940.c,197 ::                 pulse(TLC_SCLK);
        BSF         LATC+0, 2 
        BCF         LATC+0, 2 
;TLC5940.c,198 ::                 bitmask >>=1;       // Next bit
        RRCF        interrupt_bitmask_L1+1, 1 
        RRCF        interrupt_bitmask_L1+0, 1 
        BCF         interrupt_bitmask_L1+1, 7 
;TLC5940.c,194 ::                 for(jj=0;jj<12;jj++) {
        INCF        R4, 1 
;TLC5940.c,199 ::                 }
        GOTO        L_interrupt19
L_interrupt20:
;TLC5940.c,191 ::                 for(ii=0;ii<16;ii++) {
        INCF        R3, 1 
;TLC5940.c,200 ::                 }

Mike - K8LH · May 24, 2012

I don't know anything about Mikro C but I got 64 words of program memory and 890 cycles with a quick-n-dirty (untested) assembly language routine in BoostC for 16 * 12 = 192 bits...

Mr RB · May 25, 2012

misterT said:
...
TLC_SIN = (currentdata & 0b00000001) ? 1 : 0;

MikroC allows direct bit access, so you could use
TLC_SIN = currentdata.F0
to copy bit0, the compiler will efficiently use bitwise ASM instructions like BTFSS.

To the OP, thanks for posting the ASM dump, I made a mistake too in my post earlier as the data array read is only done 16 times not 192 times!

MisterT looks to on the right solution to hard code the 12 bits and then just loop it 16 times. That will save a ton of insts compared to your existing code.

Another way to speed loops in C is to do them like this;

Code:

ii = 16;
while(ii)
{
  blah here;
  ii--;
}

which can compile to faster and less code than a typical for() loop as it will use a single DECFSZ instruction for the for loop instead of loading and testing a value in the W reg.

Mike - K8LH · May 25, 2012

Taking a closer look at the compiler output can help a lot. It's amazing to see the difference between sets of instructions that essentially do the same thing. Take these two sets of instructions for instance;

Code:

   #define clk porta.0          // index for RA0
   #define dat porta.1          // index for RA1

Code:

     dat = data.12;
00C6  1A48  	BTFSC main_1_data+D'1',4
00C7  148C  	BSF gbl_porta,1
00C8  1E48  	BTFSS main_1_data+D'1',4
00C9  108C  	BCF gbl_porta,1

Code:

     dat = 0;
00C6  108C  	BCF gbl_porta,1
     if(data.12) dat = 1;
00C7  1A48  	BTFSC main_1_data+D'1',4
00C8  148C  	BSF gbl_porta,1

That's a 25% savings in both memory and processing overhead.

Same thing with loop constructs. In BoostC the while() loop branches from the bottom of the loop up to the top of the loop where the loop condition test is performed. When the loop condition is met there's a branch to just past the loop. A do/while loop is more efficient because it performs the loop condition test at the bottom of the loop which saves one branch instruction;

Code:

     char ii = 16;
     while(ii)
     { nop();
       ii--;
     }

Code:

     char ii = 16;
00C9  3010  	MOVLW 0x10
00CA  00C9  	MOVWF main_1_ii

     while(ii)
00CB        label12
00CB  08C9  	MOVF main_1_ii, F
00CC  1903  	BTFSC STATUS,Z
00CD  28D1  	GOTO	label13
     { nop();
00CE  0000  	NOP
       ii--;
00CF  03C9  	DECF main_1_ii, F
     }
00D0  28CB  	GOTO	label12
00D1        label13

Code:

     ii = 16;
     do
     {
       nop();
     } while(--ii);

Code:

     ii = 16;
00D1  3010  	MOVLW 0x10
00D2  00C9  	MOVWF main_1_ii
     do
00D3        label14
     {
       nop();
00D3  0000  	NOP
     } while(--ii);
00D4  03C9  	DECF main_1_ii, F
00D5  1D03  	BTFSS STATUS,Z
00D6  28D3  	GOTO	label14

Notice the do/while loop construct uses two words less memory but even more important is that loop time has gone from 7 cycles in the first example down to 5 cycles in the do/while loop. That's a difference of ~32 cycles for two different loops that essentially do the same thing. You may also notice that a decfsz instruction would have saved another word of memory and one more loop cycle but, short of using in-line assembly, this is a typical example of the inefficiencies in most high level languages.

Using a variable for an array index also has huge cost in an HLL.

Good luck on your project...

shingwell · May 25, 2012

I'm not familiar with either your processor or C compiler, but if you can avoid indexing an array with a variable in a loop that's generally well worth doing! Try

Code:

         gsPtr = gsData;
         for(ii=0;ii<16;ii++) {  // 16 bits
         bitmask = 0x800;
         currentdata=*gsPtr++;
         for(jj=0;jj<12;jj++) {
            if(currentdata & bitmask)  setHi(TLC_SIN);   // 1
            else setLo(TLC_SIN); // 0
            pulse(TLC_SCLK);
            bitmask >>=1;       // Next bit
          }
       }

with an appropriate local variable declaration for gsPtr of course. It's only in the outer loop but your compiler seems to make a real meal of it!

tubos · May 25, 2012

I've Changed the loop as proposed by MisterT but keeping the 16 bit
array and it came down from 4755 cycles to 2566 cycles which
is really amazing.

Thx all for your suggestions , I'll do some more tests in the following days.

Code:

for(ii=0;ii<16;ii++) {  // 16 bits
         currentdata=gsData[ii];
         TLC_SIN = (currentdata & 0x800) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (currentdata & 0x400) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (currentdata & 0x200) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (currentdata & 0x100) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (currentdata & 0x080) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (currentdata & 0x040) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (currentdata & 0x020) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (currentdata & 0x010) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (currentdata & 0x008) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (currentdata & 0x004) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (currentdata & 0x002) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (currentdata & 0x001) ? 1 : 0;  pulse(TLC_SCLK);
       }

tubos · May 26, 2012

I also tried with an 8 bit pointer but the resulting code is now 2700 cycles
so slower than the 16 bit code from above. strange ???
There is also a bitorder fault in there but i wont look into it further.
as that wont change the speed.

Code:

unsigned char* ptr8data;
ptr8data=(unsigned char*)gsData; // 8bit Pointer to our 16bit array

for(ii=0;ii<32;ii+=2) {  // 16 outputs
         curdata=ptr8data[ii+1];
         TLC_SIN = (curdata & 0x80) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (curdata & 0x40) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (curdata & 0x20) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (curdata & 0x10) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (curdata & 0x08) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (curdata & 0x04) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (curdata & 0x02) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (curdata & 0x01) ? 1 : 0;  pulse(TLC_SCLK);
         curdata=ptr8data[ii];
         TLC_SIN = (curdata & 0x08) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (curdata & 0x04) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (curdata & 0x02) ? 1 : 0;  pulse(TLC_SCLK);
         TLC_SIN = (curdata & 0x01) ? 1 : 0;  pulse(TLC_SCLK);
       }

tubos · May 28, 2012

I've optimized it for speed even more by using inline assembly
(yes I've spent a lot of time in the datasheets)
and got it down to 1189 cycles.It still uses 16 bit data.

Now I can add another TLC5940 in series for 32x8 led matrix serviced by my ISR
which was my main goal.

I'll post my code here for future reference.

Code:

for(ii=0;ii<16;ii++) {  // 16 outputs at 12bit/Ch -> now 1189 cycles
     
         asm incf FSR2L            // point to Hi Byte ( 0xHHLL )
         asm movFF INDF2,_curdat8   // Get Hi Byte
         
         TLC_SIN=1;
         asm BTFSS _curdat8,3      // HiByte Bit 3 == 1  ?
         TLC_SIN=0;
         pulse(TLC_SCLK);
         
         TLC_SIN=1;
         asm BTFSS _curdat8,2      // HiByte Bit 2 == 1  ?
         TLC_SIN=0;
         pulse(TLC_SCLK);
         
         TLC_SIN=1;
         asm BTFSS _curdat8,1      // HiByte Bit 1 == 1  ?
         TLC_SIN=0;
         pulse(TLC_SCLK);

         TLC_SIN=1;
         asm BTFSS _curdat8,0      // HiByte Bit 0 == 1  ?
         TLC_SIN=0;
         pulse(TLC_SCLK);
         
         asm decf FSR2L            // point to Lo Byte in array ( 0xHHLL )
         asm movFF INDF2,_curdat8  // And save in curdat8
         
         TLC_SIN=1;
         asm BTFSS _curdat8,7      // LoByte Bit 7 == 1  ?
         TLC_SIN=0;
         pulse(TLC_SCLK);

         TLC_SIN=1;
         asm BTFSS _curdat8,6      // LoByte Bit 6 == 1  ?
         TLC_SIN=0;
         pulse(TLC_SCLK);

         TLC_SIN=1;
         asm BTFSS _curdat8,5      // LoByte Bit 5 == 1  ?
         TLC_SIN=0;
         pulse(TLC_SCLK);

         TLC_SIN=1;
         asm BTFSS _curdat8,4      // LoByte Bit 4 == 1  ?
         TLC_SIN=0;
         pulse(TLC_SCLK);
         
         TLC_SIN=1;
         asm BTFSS _curdat8,3      // LoByte Bit 3 == 1  ?
         TLC_SIN=0;
         pulse(TLC_SCLK);

         TLC_SIN=1;
         asm BTFSS _curdat8,2      // LoByte Bit 2 == 1  ?
         TLC_SIN=0;
         pulse(TLC_SCLK);

         TLC_SIN=1;
         asm BTFSS _curdat8,1      // LoByte Bit 1 == 1  ?
         TLC_SIN=0;
         pulse(TLC_SCLK);

         TLC_SIN=1;
         asm BTFSS _curdat8,0      // LoByte Bit 0 == 1  ?
         TLC_SIN=0;
         pulse(TLC_SCLK);
         
         asm incf FSR2L            // We move to the next channel
         asm incf FSR2L            // point to Lo Byte of next channel Data

       }

Mike - K8LH · May 28, 2012

Bravo! Way to go! That's very close to the ~880 cycles I had come up with.

Regards...

Code:

   #define clk 0                // index for RA0
   #define dat 1                // index for RA1
  /*
   void test()
   { u16 data; u08 lpctr;       //
     asm movlw  high(_owbuff)   // for 16F enhanced mid-range
     asm movwf  _fsr0h          //  "
     asm movlw  low(_owbuff)    //  "
     asm movwf  _fsr0l          //  "
   //asm lfsr   0,_owbuff       // for 18F (2 words, 2 cycles)
     asm movlw  16              //  
     asm movwf  _lpctr          // loop counter = 16
   lp1:
     asm moviw  _fsr0++         // for 16F enhanced mid-range
     asm movwf  _data+0         //  "
     asm moviw  _fsr0++         //  "
     asm movwf  _data+1         //  "
   //asm movf   _postinc0,W     // for 18F
   //asm movwf  _data+0         //  "
   //asm movf   _postinc0,W     //  "
   //asm movwf  _data+1         //  "
     asm movlw  ~(1<<clk|1<<dat)
     asm btfsc  _data+1,3       //
     asm bsf    _porta,dat      //
     asm bsf    _porta,clk      // shift out b11
     asm andwf  _porta,F        // clk = 0, dat = 0
     asm btfsc  _data+1,2       //
     asm bsf    _porta,dat      //
     asm bsf    _porta,clk      // shift out b10
     asm andwf  _porta,F        // clk = 0, dat = 0
     asm btfsc  _data+1,1       //
     asm bsf    _porta,dat      //
     asm bsf    _porta,clk      // shift out b9
     asm andwf  _porta,F        //
     asm btfsc  _data+1,0       //
     asm bsf    _porta,dat      //
     asm bsf    _porta,clk      // shift out b8
     asm andwf  _porta,F        //
     asm btfsc  _data+0,7       //
     asm bsf    _porta,dat      //
     asm bsf    _porta,clk      // shift out b7
     asm andwf  _porta,F        //
     asm btfsc  _data+0,6       //
     asm bsf    _porta,dat      //
     asm bsf    _porta,clk      // shift out b6
     asm andwf  _porta,F        //
     asm btfsc  _data+0,5       //
     asm bsf    _porta,dat      //
     asm bsf    _porta,clk      // shift out b5
     asm andwf  _porta,F        //
     asm btfsc  _data+0,4       //
     asm bsf    _porta,dat      //
     asm bsf    _porta,clk      // shift out b4
     asm andwf  _porta,F        //
     asm btfsc  _data+0,3       //
     asm bsf    _porta,dat      //
     asm bsf    _porta,clk      // shift out b3
     asm andwf  _porta,F        //
     asm btfsc  _data+0,2       //
     asm bsf    _porta,dat      //
     asm bsf    _porta,clk      // shift out b2
     asm andwf  _porta,F        //
     asm btfsc  _data+0,1       //
     asm bsf    _porta,dat      //
     asm bsf    _porta,clk      // shift out b1
     asm andwf  _porta,F        //
     asm btfsc  _data+0,0       //
     asm bsf    _porta,dat      //
     asm bsf    _porta,clk      // shift out b0
     asm andwf  _porta,F        //
     asm decfsz _lpctr,F        // all 16 words? yes, skip, else
     asm bra    lp1             //
   }

tubos · May 28, 2012

@K8LH thx It felt very good when I finally got it to work with inline asm

Mr RB · May 28, 2012

Congrats on a job well done you have earned it!

Of course you'll probably get roasted later by someone who comes along and sees your code and says "That's not proper C syntax! He should have used 2 for loops and no assembler!"...

Welcome to our site!

Electro Tech is an online community (with over 170,000 members) who enjoy talking about and building electronic circuits, projects and gadgets. To participate you need to register. Registration is free. Click here to register now.

Improve C snippet for mikroc

tubos

New Member

misterT

Well-Known Member

Mr RB

Well-Known Member

misterT

Well-Known Member

tubos

New Member

Mike - K8LH

Well-Known Member

Mr RB

Well-Known Member

Mike - K8LH

Well-Known Member

shingwell

New Member

tubos

New Member

tubos

New Member

tubos

New Member

Mike - K8LH

Well-Known Member

tubos

New Member

Mr RB

Well-Known Member

Similar threads

Latest threads

New Articles From Microcontroller Tips