Z80 Optimizations
Written by Chris?

 

    Click to return to the site's menu... or here to get back to the tutorial's menu.

Optimization Techniques on the TiCalc's Z80

Optimization on the Z80 reminds me of the old days on the 8088 family processors. It was cruel, but a good programmer could get the most of it with a few optimization techniques. The purpose of this text is to enhance your Z80 optimization skills so that you CAN put all the nice effects you wish to add into your game/application/etc.

BUS, just about the worst transportation method

The BUS in the chips can be seen as a lazy way to make the circuitry of a chip! The idea is to use the BUS as a bias from the memory to the registers and to load the operands/opcode. It's kind'a like a highway =). So if you want to load something from memory, you load it in the bus and the second step consists of loading it from the bus to the register. Honestly, it's purpose is to put less wires/logical doors so that it costs less, since it would really look bad having a 4Meg without BUS . Anyways, you'll see that if you catch yourself a book on the inner circuitry of computers ;).

For optimization purposes, you may consider the BUS as an intermediary register (getting the information from one location to the other). The Z80 has to fetch every instructions it executes. Having a 8bit bus, means it takes it a total of 4cycles to get one single byte in a queue to be processed. Fortunately, the fetching process can be done at the same time as an execution occurs. But the problem comes around when you have a 4cycle instruction which is more then 1 bytes for example on an empty queue.

Suppose the instruction ld a, 1 that instruction is 2 bytes in length and takes 7 cycles according to Zilog, 7 cycles? Not quite, since the instruction is 2 bytes it requires 2x4 cycles to fetch the instruction which means you're down to 8 cycles.

Figure 1.1

Now suppose the next string of instruction

        ld      a, 4
        push    af

It takes 8 cycles to load the first command as we've seen and then it takes a total of 11 cycles to do the push since the push is only 4bytes long. There's plenty of time to fetch the instruction. Since there's more instruction time then fetching time, the fetching process will still continue and fetch the following instruction after the 4th cycle. So in all, this set of command takes 11 + 8 = 19 cycles when the prefetch queue is empty. Consider the case where we reverse the operation. We do the push first and then the load. In such a case, we use 11 cycles to run the push and only 7 cycles for the load. That's 18 in total. Here's what happens :

Figure 1.2

So did Zilog screw-up when counting their instruction speeds? Not at all, as you can see, if the prefetch queue is empty, then you're going to have to count knowing that 4cycles are needed for the fetch. If the prefetch is not empty, then considering that all the bytes needed for the instruction has been loaded, you can count Zilog's cycles.

There's also another BUS effect which you have to take under consideration… Consider the case "ld a, (hl)". This command uses 1 bytes in length, and according to Zilog uses 7cycles. But because of our good friend the data BUS, it actually takes 8 cycles. The reason is the following, it takes 4 cycles to prefetch the instruction like we saw, but it also takes 4 cycles to access the memory in (hl) [BUS]. So in total you're counting 8 cycles. Of course if you already have the instruction in the queue. Then you're really looking at 7cycles as usual.

So basically the moral in here is not to go with cycle counting. Cause that's not going to give you a good overview of the speed of the program at all. There's much more things that affect the performance of a program not counting the BUS itself. To really know the speed of a program you're always better off making a counter for it and use that. At least it will give you a true relative speed comparison. Last notes on the prefetch queue. Note that as soon as you jump somewhere, you will empty your queue.

Self Modifying Code

A Chameleon is a very intriguing animal. It's rather well known for changing colors all depending on it's environment. It's actually one of it's defense system. If you look from a distance and it hides, you'll notice the Chameleon will adapt very well to his environment color-wise. Take that same chameleon and place it in another colorful environment and you'll see it changing once again to match the background. Self modifying code does just about the same thing but of course you have to program to possibilities it can behave into. This is a very nice technique and can do wonders to a program if it's well applied. But there's also a bad trap in this one if the programmer isn't fully knowledgefull about it.

Let's take case number 1. You're out of registers and you need something to store a value into and most programmer's choice at this point is to make a typical variable. But although this is an acceptable method, if you are referencing that value many times, then you're reducing the speed of your code. The solution? Use the variable intertwined with your code so that you only need to store the value therefore accessing the indirect memory once instead twice, (Saving and Retrieving). How can we do that? Suppose the following piece of code

        ld      (VarName), a
        ;(Do some stuff)
        ld      a, (VarName)

Instead, you would reduce the last instruction by making it a direct memory load as in

        ld (Patch1 + 1), a
        ;(Do some stuff)
        Patch1: ld  a, 12h

Note that the value12h doesn't affect anything. Whatever you put in there won't affect anything since it's being written over. Also note that this technique uses less cycles then push/pop.

Now suppose another case : where you have 2 functions, the first one erases the screen from top to bottom and the second one erases from bottom to top.

        Start:ld a, (hl)
              ; (Do some math on "a")
              ld      (bc), a
        Patch1:
              inc     hl
              inc     bc
              dec     d
              jr      nz, Start

The only thing you have to do is to patch the word value "0323h" at Patch1 to increment and patch the word value "0B2Bh" to decrement. You may obtain theses values from the .TAB file you use to compile you assembler Z80 files. Here if the math on a is very long, you get to save quite a bundle of bytes in size.

You have to make sure that whatever you modified isn't currently in the prefetch queue. Or else that means the changes simply won't seem to have applied. There's a great potential for this technique.

Math VS Tables

A few days back, I was working on the init screen of a small game. I wanted a horizontal line which would alternate colors every 3pixels. Since the Z80 has 1 bit representing one pixel, it's not an easy task. I had a nice long formula which worked very well, but it had a good deal of shifts and such commands. The final version of it was actually a simple 3byte string which was dumped to the screen. It only used copying commands and didn't require math at all. Needless to say there's no equation which will do that in 3bytes. So in that case, using tables to lookup the graphic was the best idea. That whole idea of doing that pattern was also an optimization by itself since the whole screen was using patterns, by removing the 1K image and replacing it by math, I was able to reduce the size from 1K down to 0.3K which is quite an improvement on the size's side. (This game is the Bomb screen Init : BTW)

So when you come upon a piece of graphic you have to do. Make sure you think really well if it's preferable to use an equation, a simple table or a combination of both like what I had up there. You can calculate a table in a program if you find that taking less space then storing it directly.

Unrolled Loops

In the old times, there used to be sailors competition. There were many types of them, but the one I'm most interested into is the one about the weight carrying one. For this competition, the sailor had a cart full of heavy things which were all sailing related. There were some heavy anchors, chains, wood pieces and much more. The winner was the one who could move every object from the cart to the second one using the less time possible. The first one, took one object at a time and carried it to the other cart very fast and did a good time. The second sailor took 2 objects at a time, therefore reducing his walking time, but also reducing his walking speed since the weight was much more considerable. The last one, also the winner for the fastest time, had taken every object on his back and walked at a turtle's speed, but didn't have to come back at all to the first cart.

This is exactly the effect of an unrolled loop. The idea is that you take your typical "FOR/DO/WHILE" loop and split it so that you do more then one of the same thing in it, but reducing the number of iterations. This will increase the size of your code obviously, but it will also speed it up. Here's why. Consider the speed of a jump. It's around 10/12 cycles. And now that of a comparison which is about 4cycles. Suppose you do that loop 100 times, that means you're wasting about 16cycles per iteration therefore 1600cycles for the whole loop. That's a great amount. If you would copy/paste the internal loop only once, you'd need about 50iterations instead. Then you cut your looping cycling speed by about half. Which is 800cycles. If your loop is small enough, you can ever consider unrolling the whole loop which means you don't even need to do a check at all.

ld   b, 100d
Loop:   add  hl, de
        djnz Loop                       ; We're wasting about 13*100 = 1300cycles

        Ld   b, 50d
Loop:   add  hl, de
        add  hl, de
        djnz Loop                       ; Here we're using 13*50 = 750cycles

Smart direct RAM usage

A few years ago, the PC's didn't have much RAM and the conventional memory's minimum requirements for a game was really hi for some games. Back in DOS, we could only access Half a meg directly. Some games would request as huge amounts as 620K to run. That means you only had 20K to load your drivers. It was really hard to squeeze in everything. That's the point where most people though that Bill Gate's famous sentence "640K should be enough for everyone" wasn't so true. And look who's using 25times that amount for an OS eh? =) But anyhow, while the PC was more and more checked out, some ingenious people were able to find some free holes in the memory which could be used to run programs. Such holes were the Black text mode address (B000-B7FF), and depending on your PC, other holes were also available. Intel, Lotus and Microsoft also developed a program called "HIMEM.SYS" which could make the memory accessible to DOS. They also found leek where you could use the first part of the your RAM as conventional by using non-standard addressing calls. Also know as HMA.

I don't have to go all the way to make you see where I'm heading. The calculator has holes also. Not every parts of it are used and must be kept. There's many holes you can use as temporary variables and you should seek to know where they are if needed. For starters, at address $4000, there is a nice 16K which is reserved for the ROM. This Flipping page is used by the calculator to accomplish all the ROM calls you're making. If you're not using any ROM calls in your code, why not make good use of this space. Just make sure you load the page back when you're down with it. You can get the current page by using the command "in a, (5)". Also following that buffer, at $8000 is another 16K buffer reserved for RAM flipping. SO if you're accessing anything outside your direct memory, it flips there and can get read. Again, if you're not using that space and most programs really don't need to use it, then use it yourself for sprites or such data.

The biggest problem is probably How can you get external data into that RAM page for you to use. Well, the concept is a bit tricky but it does work pretty well once done. First, get the full 3bytes address of your variable via the VAT. Check other references for that. Once you have the address, calculate what page that data is at. It's a really simple calculation "Page = (Address / 16K) - 3". Make sure you load that page into the ROM area ($4000) and also, load the next page into area ($8000). So that if your data is within the 16K boundary, then it will still be there! No, what you're probably interested into is to free up the ROM page and have only the data you need in the RAM page. To do that, you'll have to get the starting address which you can get by the formula (Address AND (16K-1)). Then copy that buffer to $8000. But you have to copy it starting from the end to the beginning since if you do it like you usually do from beginning to end, you might overwrite a portion of the data you need in >= $8000. So all that said, here's the code to do just that : (Assuming ex points to the data itself)

        in   a, (5)
        ex   af, af'

        ld   a, (ix + 2)
        ld   l, (ix + 1)
        sla  l
        rla
        sla  l
        rla                                                     ; a = Data Page (This is AHL/16K)

        add  a, %01000000 - 3           ; Access the RAM page - 3
        out  (5), a                                     ; Show it in $4000
        inc  a
        out  (6), a                                     ; The Data is between $4000 and $C000

        ld   bc, Length
        ld   a, (ix + 1)
        and  %00111111
        ld   h, a
        ld   l, (ix)
        ld   de, $4000 + 1 + Length
        add  hl, de
        ld   de, $BFFF
        lddr                                            ; Store data into RAM swap page $8000

        ex   af, af'
        out  (5), a                                     ; Give back the ROM page

Note here that we're actually copying the data so that it ends at $BFFF. There's no needs for that to be there, it's just a personal preference!

For the ones doing Grayscale, you may also use the holes in the system to use that as one of the virtual screens. There's also a nice hole at $CA00. This one is commonly used by Grayscale programmers. Since it has a nice screen-length buffer free. For temporary variables, you may also want to check at $C100. There's a wealth of space all around the calculator, check in your .INC/.H files and see what addresses aren't in use, try to see if they don't affect anything, if not, then you can use them =) After all, the most we're getting out of this calculator, the better it is. Also you might want to check if you can use places where you don't use or need the variables/ROM calls. Just make sure that you give back it's values after if they're important at all for the calculator to work well.

BTW : The holes explained up here are only conform to the TI-86 (Maybe some others as well?)

Branched Jump Size Optimization

This technique will really serve to the coders who optimize speed to the MAX. The idea is simple, you may save 1byte if you use a "jr" instead of a "jp". Suppose the case where you wish to jump to an address which has a Delta > 127. In other words, a "jp" instruction seems to be the only option here! Not quite. If somewhere in your code, you have a jump which goes to where you want, you simply need to jump to the jump. Now that might sound a tad cryptic at first, but here's an example. Although, be aware that this method, really doesn't optimize for speed, since it requires 2 jumps or more in order to get to your desired position.

Start:
; Some core goes here!
Jp1:  jr  Start
; Lots of Code goes here
Jp2:  jr  Jp1
; Lots more code lies in here (There's more then 127bytes between Jp2
; and Start so instead, we'll jump to Jp1 which will in turn, jump to
; Start.  We're saving one byte here!
          jr  Jp2

Working on 16bit registers might mean 8bit

This is a trick many car repairshops could use. When you go to see them and they found a problem with your alternator for example, they usually change the whole block. While instead, they could try to isolate exactly what's wrong within the alternator and change that part only. Of course, they're up for cash and timesaving, but we're here to make our games better, faster so we'll use the techniques to change only what we need to change and not the extra.

Many times, you operate on a 16bit register. Maybe you're addressing some place in memory? You're using an incremental function. That's 6cycles! You could increment only the lower part if you were sure that by incrementing, you wouldn't hit a 256bytes boundary. That would mean 4cycles. If you place your data using a well though of manner in memory, you should be able to make sure that you don't overlap such a boundary and therefore use a 8bit increment. Another case for this is the addition. Sometimes you have multiple layers to fixup and you want to update both (Grayscale for example). So you need to write your byte at (hl) and then you need to write your byte at (hl + Page2 - Page1). Of course most coders would use "add hl, bc" where b would contain the delta of the page. A smarter, leaner way to do this is to only increment "h" leaving "l" as is. So basically, you only need a "add h, N" instruction which is 7[8]cycles instead of something near 21[23]cycles for loading the delta and adding it. And BTW, you also get to free up the temporary addition register =). But here again, you must make sure that your pages end in a boundary of 256bytes within memory.

Afterwards

Well, that's pretty much all the optimization techniques I can think of for now. There's undoubtfully much more then what I've written here, but that'll be the discussion of another piece of writing later on. I haven't written the obvious optimization either. Those like "xor a instead of ld a,0", since I assume they're obvious and not really for advanced/intermediate programmers =). In other words I assume you already know theses things.

Liked this tut, want more, have suggestions? Fixes, bugs in the code, anything? Write to me : Ti_chris@yahoo.com