no effect of PLD in cortex A9 - arm

I am using the following program to check the effect of PLD on performance. However, I couldn't find the difference in performance with and without PLD the C code I've written. Is there anything I am missing or any compiler option I need to add?
int arra[6144] = {0}; /*15kb*/
int arrb[6144] = {0}; /*15kb*/
int arrc[6144] = {0}; /*15kb*/
int arrd[2048] = {0}; /*5kb*/
int arre[2048] = {0}; /*5kb*/
int arrf[2048] = {0}; /*5kb*/
int arrg[2048] = {0}; /*5kb*/
int arrh[2048] = {0}; /*5kb*/
int arri[2048] = {0}; /*5kb*/
int arrj[2048] = {0}; /*5kb*/
int arrk[2048] = {0}; /*5kb*/
int arrl[2048] = {0}; /*5kb*/
int main()
{
int csize;
int i,z = 3;
int loop_i;
int32x4_t viarrd,viarre,viarrf;
int32x4_t viarrg,viarrh,viarri;
int32x4_t viarrj,viarrk,viarrl;
asm("LDR r1, =arrd");
asm("LDR r2, =arre");
asm("LDR r3, =arrf");
asm("LDR r4, =arrg");
asm("LDR r5, =arrh");
asm ("PLD [r1]");
asm ("PLD [r2]");
asm ("PLD [r3]");
asm ("PLD [r4]");
asm ("PLD [r5]");
for(loop_i=0;loop_i<100;loop_i++)
{
for(i=0;i<2048;i++)
{
arrd[i] = 5;
arre[i] = 5;
arrf[i] = 5;
arrg[i] = 5;
arrh[i] = 5;
arri[i] = 5;
arrj[i] = 5;
arrk[i] = 5;
arrl[i] = 5;
}
for(i=0;i<2048;i+=4)
{
viarrf = vld1q_s32(&arrf[i]);
viarre = vld1q_s32(&arre[i]);
viarrd = vmulq_s32(viarrf,viarre);
vst1q_s32(&arrd[i],viarrd);
}
for(i=0;i<2048;i+=4)
{
viarrg = vld1q_s32(&arrg[i]);
viarrh = vld1q_s32(&arrh[i]);
viarri = vmulq_s32(viarrg,viarrh);
vst1q_s32(&arri[i],viarri);
}
for(i=0;i<2048;i+=4)
{
viarrj = vld1q_s32(&arrj[i]);
viarrk = vld1q_s32(&arrk[i]);
viarrl = vmulq_s32(viarrj,viarrk);
vst1q_s32(&arrl[i],viarrl);
}
for(i=0;i<2048;i+=4)
{
viarrd = vld1q_s32(&arrd[i]);
viarrf = vld1q_s32(&arrf[i]);
viarre = vmulq_s32(viarrd,viarrf);
vst1q_s32(&arre[i],viarre);
}
for(i=0;i<2048;i+=4)
{
viarrg = vld1q_s32(&arrg[i]);
viarri = vld1q_s32(&arri[i]);
viarrh = vmulq_s32(viarrg,viarri);
vst1q_s32(&arrh[i],viarrh);
}
}

Your description says Cortex-A9 but the tag says Cortex-A8 - which is it? On Cortex-A8 pld only loads to L2 and your data set already fits in L2, so if it's already there it won't benefit from preloading.
That said, your code wouldn't accomplish an awful lot regardless of whether or not it's on Cortex-A8 or A9 because a single pld will only load a single cache line (32-64 bytes); it won't tell the CPU to keep prefetching lines after that forever. An effective usage of the pld instruction, is to issue it inside your loop iteration such that it's pointing multiple cache lines ahead of where you're currently loading from. Ideally you'd structure your loop such that one cache line's worth of loads are done per pld, in order to avoid redundant ones. Also, you'd align your data sets to cache line width.
However, Cortex-A9 has an automatic prefetcher that will detect strides. If you are on Cortex-A9 and this feature is turned on the pld might not be helping much or at all, and will instead just waste time going through the pipeline.

Related

Difficulty understanding how to translate between a virutal address and physical address

I had a lab today that I could not complete because I cannot understand the basic process to going from a virtual address to a physical address. My understanding so far is that virtual memory is kept in a page table, it consists of pages(which contain the address and an indication if it's present or not), and that to get the physical address you need to offset by 12 bits and to make the 4 higher order bits an index(?) That's really where I get confused.
This is my code currently, it's not cohesive, but it generally shows what I understand so far. I would really appreciate any help with my understanding of this process, it seems strangely straight forward and simple, I don't know what isn't clicking for me.
`
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
struct Map{
unsigned int frame_Num : 3;
unsigned int valid : 1;
};
struct Map PageTable[16];
void createPageTable(){
PageTable[0].frame_Num = 0x2;
PageTable[0].valid = 1;
PageTable[1].frame_Num = 0x1;
PageTable[1].valid = 1;
PageTable[2].frame_Num = 0x6;
PageTable[2].valid = 1;
PageTable[3].frame_Num = 0x0;
PageTable[3].valid = 0;
PageTable[4].frame_Num = 0x4;
PageTable[4].valid = 1;
PageTable[5].frame_Num = 0x3;
PageTable[5].valid = 1;
PageTable[6].frame_Num = 0x0;
PageTable[6].valid = 0;
PageTable[7].frame_Num = 0x0;
PageTable[7].valid = 0;
PageTable[8].frame_Num = 0x0;
PageTable[8].valid = 0;
PageTable[9].frame_Num = 0x5;
PageTable[9].valid = 1;
PageTable[10].frame_Num = 0x0;
PageTable[10].valid = 0;
PageTable[11].frame_Num = 0x7;
PageTable[11].valid = 1;
PageTable[12].frame_Num = 0x0;
PageTable[12].valid = 0;
PageTable[13].frame_Num = 0x0;
PageTable[13].valid = 0;
PageTable[14].frame_Num = 0x0;
PageTable[14].valid = 0;
PageTable[15].frame_Num = 0x0;
PageTable[15].valid = 0;
}
int translateToPhysicalAddress(int virtualAddress){
//int offset = 12;
//int physicalAddress = offset<<virtualAddress;
//printf("0x%x\n", physicalAddress);
}
int main(){
createPageTable();
//translateToPhysicalAddress(5);
return 0;
}
`
I had people keep explaining it to me with the same wording as I described before, Ive tried googling things, going over my prof slides, etc. I was out for a week which does not help, but I just don't understand the basic process.

What am I doing wrong with this for loop in C?

I'm trying to implement the below code in a for loop, to avoid needing to have every XOR term written out separately.
unsigned int check_0 = P0^en[2]^en[4]^en[6]^en[8]^en[10]^en[12]^en[14]^en[16]^en[18]^en[20]^en[22]^en[24]^en[26]^en[28]^en[30];
This is what I've written, but it doesn't work. Can someone please let me know what I'm doing wrong?
unsigned int check_0z = P0;
unsigned int check_0 = 0;
int i = 2;
for (i = 2; i > 30; i += 2){
check_0 = check_0z^en[i];
check_0z = check_0;
}

Why does this array not get fouled up?

I'm studying code to learn about Extended USB Controls and I came across this bit of code shown below. The function reverses an array's order. It's pretty straight forward, except for one thing. Why doesn't the code corrupt the array? Using the source and destination as the same variable should corrupt it, shouldn't it?
/*
* Convert a array of bytes from big endian to little endian and vice versa by inverting it
*/
static
uint8_t *raw_inv(uint8_t *data, int size) {
int ai = 0;
int bi = size - 1;
uint8_t a = 0;
uint8_t b = 0;
while (ai < bi) {
a = data[ai];
b = data[bi];
data[ai] = b;
data[bi] = a;
ai++;
bi--;
}
return data;
}
Ah: It's the 'static' declaration, isn't it?
It uses a and b as temporaries to hold the values it's exchanging. Only one temporary is needed -- this could be rewritten as:
while (ai < bi) {
a = data[ai];
data[ai] = data[bi];
data[bi] = a;
ai++;
bi--;
}

U-Boot: Unexpected problems porting code

I want to extend the u-boot SPL code with some fuzzy extractor logic by adding code into {u-boot_sources}/arch/arm/cpu/armv7/omap-common/hwinit-common.c. U-boot shall be used on a PandaBoard ES (omap4460 SoC).
Thus, first I successfully implemented the code on my x86 pc and I am porting it to the ARM-based PandaBoard. The complete code can be found here (as a side note the "main" function is s_init()):
http://pastebin.com/iaz13Yn9
However, I am expecting dozens of unexptected effects, which results in either stopping during the execution of the code, stopping u-boot after reading u-boot.img or not sending output (and thus not booting) at all.
For example, I want to call two functions (computeSyndrome, decodeErrors) inside a for-loop, which is part of another function golayDecode.
For my first problem please ignore the code below the multiline comment starting with /* >>>> These lines of code below totally break u-boot. Also only the function computeSyndrome in conjunction with the calling function golayDecode is important.
The issue: If comment out both functions computeSyndrome and decodeErrors everything works fine and the OS (Android) is booting. However, if computeSyndrome is not commented out and thus gets processed, u-boot stucks after displaying reading u-boot.img.
The funny thing about it: even if I replace computeSyndrome with a bogus function which does not but iterating a values or displaying stuff, u-boot stucks as well.
Furthermore, if I remove the multiline comment furhter below to also include the residual code, u-boot doesn't display ony character. (1*)
I am a beginner regarding microprocessor programming but I can not figure out a possible error in these 12 lines of the computeSyndrome function or the general behaviour of u-boot at all. (2*)
Does anyone have a clue what I am missing?
Thanks,
P.
1* I am using minicom to display the output of u-boot, which I receive over serial-usb-converter.
2* I am using the following compiler flags to make sure there are no errors at compile time: -Wall -Wstrict-prototypes -Wdisabled-optimization -W -pedantic
void golayDecode(volatile int x[12], volatile int y[12], volatile unsigned int golayEncodedSecret[30], volatile unsigned int s, volatile unsigned char repetitionDecodedSecretBits[360]){
printf("\n[I] - Performing Golay decoding\r\n");
volatile unsigned char secret[22] = {0};
volatile unsigned char currentByte = 0, tmpByte = 0;
volatile unsigned int golayDecodedSecret[30] ={0};
volatile int twelveBitCounter = 0;//, j = 0, k = 0, q = 0, aux = 0, found = 0, bitCounter = 0, i_2 = 7, currentSecretEncByte = 0x00;
volatile int c_hat[2] = {0}, e[2] = {0};
e[0] = s;
e[1] = 0;
for(twelveBitCounter = 0; twelveBitCounter < 30; twelveBitCounter+=2){
printf("Computing syndrome and decoding errors for bytes %03x & %03x\n", golayEncodedSecret[twelveBitCounter], golayEncodedSecret[twelveBitCounter+1]);
computeSyndrome(golayEncodedSecret[twelveBitCounter], golayEncodedSecret[twelveBitCounter+1], x, y, s);
decodeErrors(golayEncodedSecret[i], golayEncodedSecret[i+1], x, y, s);
}
printf("\n[D] - Reconstructing secret bytes\r\n");
/* >>>> These lines of code below totally break u-boot
for(i = 0; i < 30; i+=2){
currentSecretEncByte = golayDecodedSecret[i];
volatile int j = 11;
// Access each source bit
for(; 0<=j; j--){
volatile int currentSourceBit = (currentSecretEncByte >> j) & 0x01;
repetitionDecodedSecretBits[bitCounter] = currentSourceBit;
bitCounter++;
}
}
k = 0;
for(i = 0; i<176; i++){
tmpByte = repetitionDecodedSecretBits[i] << i_2;
currentByte = currentByte | tmpByte;
i_2--;
if(i_2==0){ // We collected 8 bits and created a byte
secret[k] = currentByte;
i_2 = 7;
tmpByte = 0x00;
currentByte = 0x00;
k++;
}
}
SHA256_CTX ctx;
unsigned char hash[32];
printf("\n[I] - Generating secret key K\n");
sha256_init(&ctx);
sha256_update(&ctx,secret,strlen((const char*)secret));
sha256_final(&ctx,hash);
printf("\n[I] - This is our secret key K\n\t==================================\n\t");
print_hash(hash);
printf("\t==================================\n");
*/
}
/* Function for syndrome computation */
void computeSyndrome(int r0, int r1, volatile int x[12], volatile int y[12], volatile unsigned int s){
unsigned int syndromeBitCounter, syndromeMatrixCounter, syndromeAux;
s = 0;
for(syndromeMatrixCounter=0; syndromeMatrixCounter<12; syndromeMatrixCounter++){
syndromeAux = 0;
for(syndromeBitCounter=0; syndromeBitCounter<12; syndromeBitCounter++){
syndromeAux = syndromeAux^((x[syndromeMatrixCounter]&r0)>>syndromeBitCounter &0x01);
}
for(syndromeBitCounter=0; syndromeBitCounter<12; syndromeBitCounter++){
syndromeAux = syndromeAux^((y[syndromeMatrixCounter]&r1)>>syndromeBitCounter &0x01);
}
s = (s<<1)^syndromeAux;
}
}
/* Funcion to recover original byte */
void decodeErrors(int r0, int r1, volatile int x[12], volatile int y[12], volatile unsigned int s){
//printf("\n[D] - Starting to decode errors for %3x | %3x\n", r0, r1);
volatile unsigned int c_hat[2] = {0xaa}, e[2] = {0xaa};
volatile unsigned int q;
unsigned int i, j, aux, found;
//printf("Step 2\n");
if(weight(s)<=3){
e[0] = s;
e[1] = 0;
}else{
/******* STEP 3 */
//printf("Step 3\n");
i = 0;
found = 0;
do{
if (weight(s^y[i]) <=2){
e[0] = s^y[i];
e[1] = x[i];
found = 1;
printf("\ntest 2\n");
}
i++;
}while ((i<12) && (!found));
if (( i==12 ) && (!found)){
/******* STEP 4 */
//printf("Step 4\n");
q = 0;
for (j=0; j<12; j++){
aux = 0;
for (i=0; i<12; i++)
aux = aux ^ ( (y[j]&s)>>i & 0x01 );
q = (q<<1) ^ aux;
}
/******* STEP 5 */
//printf("Step 5\n");
if (weight(q) <=3){
e[0] = 0;
e[1] = q;
}else{
/******* STEP 6 */
//printf("Step 6\n");
i = 0;
found = 0;
do{
if (weight(q^y[i]) <=2){
e[0] = x[i];
e[1] = q^y[i];
found = 1;
}
i++;
}while((i<12) && (!found));
if ((i==12) && (!found)){
/******* STEP 7 */
printf("\n[E] - uncorrectable error pattern! (%3x | %3x)\n", r0, r1);
/* You can raise a flag here, or output the vector as is */
//exit(1);
}
}
}
}
c_hat[0] = r0^e[0];
c_hat[1] = r1^e[1];
//printf("\t\tEstimated codeword = %x%x\n", c_hat[0], c_hat[1]);
}
Indeed, the code was a little bit too complex to be executed at this point of boot time. At this time there is ne real CRT and I only have a minimal stack.
Thus, I moved the code to board_init_f() which is still part of the SPL. It gave more stable results and my algorithm now works as expected.

Difference between two C snippets

I am at a loss to explain why these two C snippets do not behave the same way. I am trying to serialize two structs, eh and ah, as a single buffer of bytes (uint8_t). The first code block works, the second does not. I can't figure out the difference. If anyone can explain it to me it would be greatly appreciated.
Block 1:
uint8_t arp_reply_buf[sizeof(eh) + sizeof(ah)];
uint8_t *eh_ptr = (uint8_t*)&eh;
for (int i = 0; i < sizeof(eh); i++)
{
arp_reply_buf[i] = eh_ptr[i];
}
uint8_t *ah_ptr = (uint8_t*)&ah;
int index = 0;
for (int i = sizeof(eh); i < (sizeof(eh) + sizeof(ah)); i++)
{
arp_reply_buf[i] = ah_ptr[index++];
}
Block 2:
uint8_t arp_reply_buf[sizeof(eh) + sizeof(ah)];
arp_reply_buf[0] = *(uint8_t *)&eh;
arp_reply_buf[sizeof(eh)] = *(uint8_t *)&ah;
In the second example you only set the values in two indexes:
arp_reply_buf[0]:
arp_reply_buf[0] = *(uint8_t *)&eh;
arp_reply_buf[sizeof(eh)]:
arp_reply_buf[sizeof(eh)] = *(uint8_t *)&ah;

Resources