Files
Failed to load latest commit information.
twain
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
See the top level README for information on where to find the schematic and programmers reference manual for the ARM processor on the raspberry pi. Also find information on how to load and run these programs. This example serves many purposes. First it is a real world library/ application demonstrated in a bare-metal embedded format. This example as with extest demonstrates the use of the mmu so that the data cache can be safely turned on. One of the main goals though is that over the years all to often based on questions in person or in forums I have come to believe that many programmers live under the myth that there is a one to one relationship between high level code and the compiled binary, if you want to change the performance of you code you need to change your code. That is no truth of any kind to that. Likewise a belief that any two compilers can or should be expected to produce the same compiled machine code from the same high level source code. Likewise, that is no reason to expect such a thing. Think about it this way, take 100 programmers, 100 students, whatever. Give them the same programming assignment, a homework project or semester project, whatever, do you end up with all 100 producing the exact same source code? No, of course not you might have solutions that fall into defineable categories, but you are going to see many solutions. No different here, each compiler is created by different people for different reasons with different skills and different goals. And the output is not exactly the same. Sure for some simple programs, there may be an ideal solution and two compilers might converge on that solution, but on average, esp as the project gets large, the solutions will vary. What I have here is the zlib library, in the Linux world it is on a par with gcc as far as being a cornerstone holding the whole thing up in the air. Not unlike gcc you really dont want to actually look at the code. The jumping around from longs to ints, and how variables are declared is not consistent, basically it makes your compiler work a little harder at least. I have some text that I am going to compress and then uncompress, making a nice self-checking test. And going to compare three compilers using different compiler options, and going to turn on and off the caches. With respect to the zlib licence this zlib library is not mine, I have though made modifications to it. I have commented out the includes so that it wont interfere with trying to compile bare-metal (just makes it easier to know what I have to manage and not manage). I do not use a C library so there are a few C library functions that had to be implemented malloc, calloc, memset, memcpy, free. Very very simple malloc, free does nothing, this is a once through test, no need to actually manage memmory allocation just need to give the code what it asks for. Fortunately the raspberry pi is well endowed with memory. I have two versions of gcc, 4.6.1 from the CodeSourcery folks (now mentor graphics). And 4.7.0 built from sources using the build script in the buildgcc directory. The third is clang/llvm built using the build script in the buildgcc directory. This is a completely separate compiler from gcc, different goals, different construction. Thanks to the Apple iPhone and Apple the llvm compiler tools have had a big boost in quality, the code produced is approaching the speed of gcc. In tests I did a while back other compilers, pay-for compilers, blew gcc out of the water. Gcc is an average good compiler for many targets, but not great at any particular target. Other compilers are better. For gcc going to play with the obvious optimizations -Ox, a few of them to show they actually do something. Also leave the default arm arch setting and specify the proper architecture specifically to see what that does. There are many many knobs you can play with, I dont even need to mess with this many combinations to completely demonstrate that the same compiler or different compilers can create binaries that execute at different speeds from the same high level source code. So the gcc knobs are -O1, -O2 and -O3 plus in the code commenting and uncommenting defines that enable the instruction cache, mmu and data cache (need the mmu on to enable the data cache). COPS0 = -Wall -O1 -nostdlib -nostartfiles -ffreestanding COPS1 = -Wall -O2 -nostdlib -nostartfiles -ffreestanding COPS2 = -Wall -O3 -nostdlib -nostartfiles -ffreestanding COPS3 = -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mcpu=arm1176jzf-s COPS4 = -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mcpu=arm1176jzf-s -mtune=arm1176jzf-s COPS5 = -Wall -O3 -nostdlib -nostartfiles -ffreestanding -mcpu=arm1176jzf-s -mtune=arm1176jzf-s arm-none-eabi-gcc --version arm-none-eabi-gcc (Sourcery CodeBench Lite 2011.09-69) 4.6.1 Copyright (C) 2011 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Using the 4.6.1 compiler, going to just dive in. The code compresses some known data. The decompresses some known data. The data has been pre-computed on the host computer to know what the expected compressed result should be. A system timer is used to time how long it takes to compress then decompress, then it checks the results after the timer has been sampled. So in addition to looking at the time result need to also make sure the test actually passed. cops icache mmu dcache COPS0 no no no 0xD0D08A COPS1 no no no 0xB92BA4 COPS2 no no no 0xA410E1 COPS3 no no no 0xB4DC8A COPS4 no no no 0xB4D11B COPS5 no no no 0xA450DB COPS0 yes no no 0x9DC2DF COPS1 yes no no 0x8F5ECF COPS2 yes no no 0x8832F2 COPS3 yes no no 0x8F9C79 COPS4 yes no no 0x8F9ED4 COPS5 yes no no 0x8AE077 COPS3 yes yes no 0x174FA4 COPS4 yes yes no 0x175336 COPS5 yes yes no 0x162750 COPS3 yes yes yes 0x176068 COPS4 yes yes yes 0x175CB0 COPS5 yes yes yes 0x162590 The first interesting thing we see is that even though the data cache is not enabled in the control register, it is obviously on in some form or fashion. Other interesting things are that optimization -O3 when specifying the actual processor vs -O3 using a generic ARMv5 or whatever the generic arm binary was a little faster. The reason is more complicated than just architecture, will get to that in a bit. Note that these results were all collected by hand so there is a possibility for human error. Ideally you will use this info to learn from and not actually care about the specific numbers. arm-none-eabi-gcc (GCC) 4.7.0 Copyright (C) 2012 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. COPS0 no no no 0xD1360B COPS1 no no no 0xB7683E COPS2 no no no 0xA83E6F COPS3 no no no 0xBB1436 COPS4 no no no 0xBB0B60 COPS5 no no no 0xA6CF10 COPS0 yes no no 0x9CFAAD COPS1 yes no no 0x8DA3D1 COPS2 yes no no 0x84B7C5 COPS3 yes no no 0x8E6FDB COPS4 yes no no 0x8E73A6 COPS5 yes no no 0x86156D COPS3 yes yes no 0x17EA80 COPS4 yes yes no 0x17FA4B COPS5 yes yes no 0x15B210 COPS3 yes yes yes 0x17F0C8 COPS4 yes yes yes 0x17FB53 COPS5 yes yes yes 0x15B55E arm-none-eabi-gcc (GCC) 4.8.1 Copyright (C) 2013 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. COPS0 no no no 0xD48A9F COPS1 no no no 0xB994DC COPS2 no no no memset COPS3 no no no 0xC4CEBC COPS4 no no no 0xC4D880 COPS5 no no no memset COPS0 yes no no 0x9D6225 COPS1 yes no no 0x90D014 COPS2 yes no no memset COPS3 yes no no 0x921680 COPS4 yes no no 0x920FF1 COPS5 yes no no memset COPS3 yes yes no 0x194A48 COPS4 yes yes no 0x19379E COPS5 yes yes no memset COPS3 yes yes yes 0x1948C2 COPS4 yes yes yes 0x19409E COPS5 yes yes yes memset memset means: twain.o: In function `xmemset': twain.c:(.text+0x1dc): undefined reference to `memset' twain.o: In function `xcalloc': twain.c:(.text+0x23c): undefined reference to `memset' trees.o: In function `build_tree': trees.c:(.text+0x1028): undefined reference to `memset' make: *** [twain.gcc.elf] Error 1 dont want to deal with that right now. Some things to talk about at this point...First off I was using jtag to load and run these programs. Think about this for a second when you have the instruction cache on and the data cache off. If you stop the ARM and use data transactions to load data into ram. Not cached, whatever instructions were in the cache when you stopped are still there instructions laying around. If you have changed the program you want to run then the instructions in the instruction cache do not match the new program. If you start up, are now mixing two programs and nothing good comes from that. Although my start_l1cache init invalidates the whole cache, and I stop the instruction cache when the program ends there is evidence if you run the same program over and over again that the i cache is playing a role. Ideally doing something like this with some sort of bootloader be it serial or using jtag (using data cycles to place instructions) you want to completely stop the icache and clean it before starting again, if that doesnt appear to be working then just power cycle between each test, come up the same way every time. The mmu and data cache is worse than the icache, just power cycle between each test. This stop, reload, stop, reload is not normal use of a system, normal development sure, but how much do you put into your code for something that is not runtime? Another factor is during development you need to make sure you are actually initializing everything. Say the 12th build of the afternoon set some bit somewhere to make something work. You didnt think that code was working (because some other bit somewhere else was not set) so you remove it. Later, without power cycles you find that other bit, now you think you have it figured out. Eventually you power cycle and it doesnt work. One thing your code must do is be able to come up out of a reset. Being able to re-run hot is not as important. might save development time sure, and may have some value, but nowhere near as important as power on initialization. If you read up on the raspberry pi you know that the memory is SDRAM which is at its core DRAM, the first thing to know about DRAM is that it has to be refreshed. Think of each bit as a rechargeable battery if you want to remember that that bit is charged you have to every so often give it a little boost, if it discharges too much you might not notice that the bit has changed. Well you cant access that memory from the processor while the DRAM controller is refreshing memory. Basically DRAM performance is not deterministic, it changes a little. Which means if you run the same benchmark several times you should not be surprised if the results are not the same, even bare metal like this where there are no interrupts and no other code running. Now you can make dram deterministic if you access it slow enough to insure that you get the result in the same number of clock cycles every time. For all we know the gpu or other parts of the chip may be sharing a bus or interfering with each other affecting performance and certainly making it a bit random. If you run the same binary over and over you will see that it varies some, but not a huge amount, so that is good it is somewhat consistent as far as this code/task goes. if you were to run on say a microcontroller or even a gameboy advance or other systems where the flash and ram access times are a known number of clocks every time, then you should at worst only see a difference of one count either way. For some of the runs below I ran them multiple times to show the times were not exactly the same. Another topic related to doing this kind of benchmarking is how caches work. Quite simply if you think about it programs tend to run a number of instructions in a row before they branch, and when you do stuff with data you either tend to reuse the same data or access data in chunks or in some linear order. The cache will basically read a number of words at a time from slow memory to fast memory (the fast memory is very deterministic BTW) so that if you happen to read that instruction again in the relatively near future you dont have to suffer a slow memory cycle but can have a relatively fast memory cycle. Certainly if you have a few instructions in a row, the first one in the cache line (the chunk of ram the cache fetches in one shot) causes the cache line to be read, really really slow, but the second and third are really really fast, if that code is used again, say in a loop, before it is kicked out of the cache to make room for other code, then it remains really fast until it finally gets replaced with other code. So lets pretend that our cache has a cache line of 4 instructions and is aligned on a 4 word boundary, say for example 0x1000, 0x1004, 0x1008 and 0x100C all being in the same cache line. Now lets say we have two instructions that are called often, someting branches to the first one it does something then the second one is a branch elsewhere, maybe not the most efficient, but it happens. If the first instruction is at address 0x1000, then when you branch to it if there is a cache miss then it reads the four instructions from main memory, really slow, then the second instruction is really fast but we basically read 4 instructions to execute two (lets not think about the prefetch right now). Now if I were to remove one instruction somewhere just before this code, say I optimized one instruction out of something even the startup code. Well, it can actually cause many changes when you link but lets assume what it does to these instructions is put one at 0xFFC and the other at 0x1000. Now every time I hit these two instructions I have to read two cache lines one at 0xFF0 and the other at 0x1000, I have to read 8 instructions to execute 2. If those two instructions are used often enough and they happen to line up with other code that is used often enough to evict one or both of these cache lines then for these two instructions the cache can be making life worse if we were to put back a nop somewhere to move these two back into the same cache line, focusing on those two only our performance would improve. Of course it is not at all that simple, you have a big program there are tons of cache lines across the program that are evicting each other and you have this alignment problem all over the place, by optimizing one group of instructions by aligning them you might move another or many other groups so they are not optimally aligned. The bottom line to this very long subject is that by simply adding and removing nops to the beginning of the program, and re-compiling and re-linking you move the instructions relative to their addresses and change this cache alignment and as a result change the performance, even if the instructions were all position independent and identical but moved over one address location the peformance of your program, even with a deterministic memory system will vary. Lets try, the fifth column is the number of nops added to vectors.s. This sample ships with three you can add or remove them at will. With these numbers the changes are actually less than one percent, but despite that you can still see that something is going on, and that something has to do with how things line up relative to the cache lines for the instruction cache. COPS4 yes no no 0 0x8E7665 0x8E711F 0x8E73CB COPS4 yes no no 1 0x8E735E COPS4 yes no no 2 0x8E29A6 0x8E2DB9 COPS4 yes no no 3 0x8E220D COPS4 yes no no 4 0x8E2859 COPS4 yes no no 5 0x8E6691 0x8E68CB COPS4 yes no no 6 0x8E6ACC COPS4 yes no no 7 0x8E7713 0x8E7786 COPS4 yes no no 8 0x8E735A Another topic is the mmu. How does an mmu work? Ideally an mmu is there to take a virtual address, the address the processor thinks it is using, and convert that to a physical address, the real address in memory. You can do things for example like have your linux program be compiled for the same address, all linux programs think they start at the same address, but it is virtual. One program may think it is running code at 0x9000 but it is really 0x12345678, another running at 0x9000 might really be 0x200100. the mmu also helps with protecting programs from each other and other things. You can also tune what is cached or not, in our case here, we dont want hardware registers like the timer and uart to be cached, so we use the mmu to mark that address space as non-cached and our programs memory space as cached. In this case the virtual address and physical address are the same to make things easier. Now how does the mmu do what it does? It has tables, think of it as nested arrays (an array whose index is another array with some other index). In addition to all the cache business going on when you go to have a memory cycle there is another kind of cache in the mmu that remembers some small number of virtual to physical address conversions. If your new address is not in that list then it has to compute an offset in the first level of the mmu table, and perform that memory cycle. waiting...waiting... then some of the bits in that first table tell you where to go in the next table, so you compute another address and do another slow memory read. Now, finally you can actually go after the thing you were first looking for three memory cycles later. This is repeated for almost everything. Just like the repetitive nature of things makes the caches additional reads (fetching a whole cache line even if I only need one item) faster overall, the cache like table in the mmu cuts down on the constant table lookups to smooth things out. As the results show by enabling the mmu with the data cache on a program like this which is loop heavy and data heavy runs quite a bit faster when the data cache is on. clang version 3.0 (branches/release_30 152644) Target: x86_64-unknown-linux-gnu Thread model: posix LLCOPS0 no no no 0xDF11BB LLCOPS1 no no no 0xDEF420 LLCOPS0 yes no no 0xABA6AC LLCOPS1 yes no no 0xAB97C6 LLCOPS0 yes yes no 0x1A49FE LLCOPS1 yes yes no 0x19F911 clang version 3.3 (branches/release_33 189603) Target: x86_64-unknown-linux-gnu Thread model: posix LLCOPS0 no no no 0xE6EF36 LLCOPS1 no no no 0xF550AC LLCOPS0 yes no no 0xAC25D7 LLCOPS1 yes no no 0xAC2B1C LLCOPS0 yes yes no 0x1CA6C5 LLCOPS1 yes yes no 0x1C4F53 LLCOPS0 yes yes yes 0x1CA16C LLCOPS1 yes yes yes 0x1C5B56 A simple experiment to show the mmu overhead. Changing the code from this if(add_one(0x00000000,0x0000|8|4)) return(1); if(add_one(0x00100000,0x0000|8|4)) return(1); if(add_one(0x00200000,0x0000|8|4)) return(1); to this if(add_one(0x00000000,0x0000)) return(1); if(add_one(0x00100000,0x0000)) return(1); if(add_one(0x00200000,0x0000)) return(1); Which disables the data cache for our program space, basically no data cache anywhere. The program goes from 0x19Fxx to 0xE5Bxxx which is slower than the slowest clang program by quite a bit 885% slower. Llvm and clang are getting better, not quite caught up to gcc, but compared to version 27 for example against whatever the gcc was at the time it is converging. The best clang time is 20% slower than the best gcc time for this particular benchmark. As mentioned at the beginning this is demonstrating that the same source code compled with different compiler options using the same compiler and different versions of the same compiler and differen compilers are showing dramatically different results. The worst clang time is 858% slower (8.5 times slower) than the fastests clang time. The worst gcc time is 964% slower than the fastest gcc time. A newer version of gcc is not automatically producing faster code, nor is it making the same speed code, something is changing and it is not always better. Newer doesnt mean better. We also saw what the mmu does, what the cache does, that tiny changes to the location of the same sets of instructions can do, etc. Notice that we didnt have to do any disassembly and comparison to understand that there are definitely differences running the same source code on the same computer. When you go to tomshardware and look at benchmarks, those are a single binary run one time one way on top of an operating system. Yes the hard drive maybe different and everything else held the same, but did they run that test 100 times and average or one time? Had they run it more than once what is the run to run difference on the same system? Is that difference greater than say the same system with a different hard drive? it probably says somewhere in the fine print, the point is though that you need to understand the nature of benchmarking. Another thing that is hopefully obvious now. Take any other library or program, basically change the source code, and the peformance changes again. There are no doubt programs that llvm/clang is better at compiling that a benchmark like this would show, take the same code compile it several different ways, different compilers, etc. And see how they play out, you will find code llvm is good at and code that gcc is good at and some programs may run at X number of instructions per second on average and another Y number of instructions per second on average on the same hardware. What this benchmark has shown is that the same source code compiled with different compilers and different settings produces dramatically different performance results which implies that the code generated is not the same, there is no one to one relationship between high level programs and the machine code generated. We also got to play with the caches and mmu in a simplified fashion. (yes this is too wordy and I need to proof/rewrite) based on new discoveries, changed the start address to 0x8000 for the whole thing so that messes with cache lines and results have some differences to the numbers above (for reasons described above).