Is there a bug in the nestest rom? - c

I am currently making an emulator for the NES (like many others) , and while testing my emulation against the nestest rom by Kevtris (found here : https://wiki.nesdev.com/w/index.php/Emulator_tests),
there is a weird bug I've encountered , at the instruction 877 on the nestest log (this one : http://www.qmtpro.com/~nes/misc/nestest.log , at line CE42) .
The instruction is a PLA , which pulls the accumulator from the stack , while having the stack pointer at $7E at the beginning. (I'm using a 1 byte value for the stack pointer , since it goes from 0x0100 to 0x01FF , so when I write $7E talking about the stack , it's 0x017E , not zeropage
;) )
So , when PLA is executed at line 877, the stack pointer moves to $7F and retrieve the first byte and store into the accumulator .
The problem is here : on the nestest log , this byte is 0x39 , then , on instruction 878 which is also a PLA , the retrieved byte at $80 (stack pointer incremented + 1) , is 0xCE, and this has inverted the low byte and high byte.
The values written on the stack (0xCE39) have their origin in the JSR instruction at line CE37 and here is my implementation of the JSR opcode :
uint8_t JSR(){
get() ; // fetch the data of the opcode , like an absolute address operand or a value
uint16_t newPC = PC - 1 ; // the program counter is decremented by 1
uint8_t low = newPC & 0x00FF ;
uint8_t high = (newPC & 0xFF00) >> 8;
write_to_stack(SP-- , low) ; //we store the PC , highest address in stack takes the low bytes
write_to_stack(SP-- , high) ; //lower address on the stack takes the high bytes
PC = new_address ; // the address we read that points to the subroutine.
return 0 ;
}
Here are the logs from nestest :
CE37 20 3D CE JSR $CE3D A:69 X:80 Y:01 P:A5 SP:80 PPU:233, 17 CYC:2017
CE3D BA TSX A:69 X:80 Y:01 P:A5 SP:7E PPU:251, 17 CYC:2023
CE3E E0 7E CPX #$7E A:69 X:7E Y:01 P:25 SP:7E PPU:257, 17 CYC:2025
CE40 D0 19 BNE $CE5B A:69 X:7E Y:01 P:27 SP:7E PPU:263, 17 CYC:2027
CE42 68 PLA A:69 X:7E Y:01 P:27 SP:7E PPU:269, 17 CYC:2029
CE43 68 PLA A:39 X:7E Y:01 P:25 SP:7F PPU:281, 17 CYC:2033
CE44 BA TSX A:CE X:7E Y:01 P:A5 SP:80 PPU:293, 17 CYC:2037
With my code , I am having 0xCE at $7F and 0x39 at $80.
So the first PLA with my code stores 0xCE in the accumulator , and the second PLA stores 0x39, and this is the invert of what the nestest log shows.
I don't know if my JSR code is wrong , it has succeeded until now.
I tried inverting the low and high byte of the program counter when stored on the stack , but , as expected , the instructions become invalid at the first JSR of the rom .
So , what do you guys think I'm missing ?

The mistake is not in nestest; the mistake is in your implementation of JSR and RTS!
You need to push the high byte first, and then the low byte. (This is so that the low byte can be retrieved first, and incremented while the high byte is being fetched)

Related

Unpack Dec from Hex - via bit offsets

I have a block of hex data which inicludes settings of a sensor, I will include the beginning snippet of the hex (LSB first):
F501517C 8150D4DE 04010200 70010101
05F32A04 F4467000 00000AFF 0502D402
This comes straight from the documentation to decode this hex to dec:
3.1. Full identifier and settings record (0x7C)
Offset Length (bytes) Field description
0x00 6 Full identifier
0x06 40 Settings
3.1.1 Full identifier
Offset Field description
0x00 Product Type
0x01 Device Type
0x02 Software Major Version
0x03 Software Minor Version
0x04 Hardware Major Version
0x05 Hardware Minor Version
3.1.2 Settings
Offset Length(bit) Offset(bit) Default value Min Max Field Description
0x00 8 0 0 0 255 Country number
0x01 8 0 0 0 255 District number
0x02 16 0 0 0 9999 Sensor number
...
0x27
This being the only information I have to decode this. The offset column must be the trick to understanding this.
What are the hex values offset from?
I see 7C in the first hex string.
The Settings section goes to 0x27 = 39 in decimal which is stated in the 3.1 section as the length being 40.
The given hex bytes are byte offset from the beginning of the data.
Assuming that your given dump is little endian 32-bit, let's have a look:
Value in dump - separated in bytes - bytes in memory
F501517C - F5 01 51 7C - 7C 51 01 F5
8150D4DE - 81 50 D4 DE - DE D4 50 81
04010200 - 04 01 02 00 - 00 02 01 04
Now let's assign them to the fields. The next list has both records concatenated.
Byte Offset Field description
7C 0x00 Product Type
51 0x01 Device Type
01 0x02 Software Major Version
F5 0x03 Software Minor Version
DE 0x04 Hardware Major Version
D4 0x05 Hardware Minor Version
Byte Offset Length(bit) Offset(bit) Default value Min Max Field Description
50 0x00 8 0 0 0 255 Country number
81 0x01 8 0 0 0 255 District number
00,02 0x02 16 0 0 0 9999 Sensor number
Whether the result makes sense, is your decision:
Product Type = 0x7C
Device Type = 0x51 = 81 decimal (could also be ASCII 'Q')
Software Major.Minor Version = 0x01.0xF5 = 1.245 decimal
Hardware Major.Minor Version = 0xDE.0xD4 = 222.212
Country number = 0x50 = 80 decimal (could also be ASCII 'P')
District number = 0x81 = 129 decimal (perhaps 0x01 = 1 with bit 7 set?)
Sensor number = 0x0002 = 2 decimal (big endian assumed)

How do AVR Assembly BRNE delay loops work?

An online delay loop generator gives me this delay loop of runtime of 0.5s for a chip running at 16MHz.
The questions on my mind are:
Do the branches keep branching if the register becomes negative?
How exactly does one calculate the values that are loaded in the beginning?
ldi r18, 41
ldi r19, 150
ldi r20, 128
L1: dec r20
brne L1
dec r19
brne L1
dec r18
brne L1
To answer your questions exactly:
1: The DEC instruction doesn't know about 'signed' numbers, it just decrements an 8-bit register. The miracle of twos complement arithmetic makes this work at the wraparound (0x00 -> 0xFF, is the same bit pattern as 0 -> -1). The DEC instruction also sets the Z flag in the status register, which BRNE uses to determine if branching should happen.
2: You can see from the AVR manual that DEC is a single cycle instruction. BRNE is also a single cycle when not branching, and 2 cycles when branching. therefore to compute the time of your loop, you need to count the number of times each path will be taken.
Consider a single DEC/BRNE loop:
ldi r8 0
L1: dec r8
brne L1
This loop will execute exactly 256 times, which is 256 cycles of DEC, and 512 cycles of BRNE, for a total of 768 cycles. At 16MHz, that's 48us.
Wrapping that in an outer delay loop:
ldi r7 10
ldi r8 0
L1: dec r8
brne L1
dec r7
brne L1
You can see that the outer loop counter will decrement every time the inner loop counter hits 0. Thus in our example the outer loop DEC/BRNE will happen 10 times(for 768 cycles), and the inner loop will happen 10 x 256 times so the total time for this loop is 10 x 48us + 48us for 528us. Similarly for 3 nested loops.
From here, it's trivial to figure out how many times each loop should execute to achieve the desired delay. It's the largest number of iterations the outer loop can do less than the desired time, then taking that time out, do the same for the next nested loop, and so on until the inner most loop fills up the tiny amount left.
How exactly does one calculate the values that are loaded in the beginning?
Calculate total amount of cycles => 0.5s * 16000000 = 8000000
Know the total cycles of r20 and r19 loops (from zero to zero), AVR registers are 8 bit, so a full loop is 256 times (dec 0 = 255). dec is 1 cycle. brne is 2 cycles when condition (branch) happens, 1 cycle when not.
So the most inner loop:
L1: dec r20
brne L1
Is from zero to zero (r20=0): 255 * (1+2) + 1 * (1+1) = 767 cycles (255 times the branch is taken, 1 time it goes through).
The second wrapping loop working with r19 is then: 255 * (767+1+2) + 1 * (767+1+1) = 197119 cycles
The single r18 loop when branch is taken is then 197119+1+2 = 197122 cycles. (197121 when branch is not taken = final exit of delay loop, I will avoid this -1 by a trick in next step).
Now this is almost enough to calculate initial r18, let's adjust the total cycles first by the O(1) code, that's three times ldi instruction, which takes 1 cycle: total2 = 8000000 - (1+1+1) + 1 = 7999998 ... wait, what is the last +1 there? That's fake additional cycle to delay, to make the final r18 loop pretend it costs same as non-final, i.e. 197122 cycles.
And that's it, the initial r18 must be enough to wait at least 7999998 cycles: r18 = (7999998 + 197122 - 1) div 197122 = 41. The " + 197122 - 1" part will make sure the abundant cycles fits constraint: 0 <= abundant_cycles < 197122 (remainder by 197122 division).
41 * 197122 = 8082002 ... this is too much, but now we can shave the extra cycles down by setting up also r19 and r20 to particular values, to fine-tuned the delay. So how much is to be shaved off? 8082002 - 7999998 = 82004 cycles.
The single r19 loop takes 770 cycles when branching and 769 when exiting, so again let's avoid the 769 by adjusting 82004 to only 82003 to be shaved off. 82003 div 770 = 106: 106 r19 loops can be skipped, r19 = 256 - 106 = 150. Now this will shave 81620 cycles, so 82003 - 81620 = 383 cycles more to be shaved off.
The single r20 loop takes 3 cycles when branching and 2 when exiting. Again I will take into account the exiting loop being only 2 cycles -> 383 => 382 to shave off. And 382 div 3 = 127, remainder 1. r20 = 256 - 127 = 129 and do one less to shave additional 3 cycles (to cover that remainder) = 128. Then 2 cycles (3-1) wait is missing to make it a full 8mil.
So:
ldi r18, 41
ldi r19, 150
ldi r20, 128
L1: dec r20
brne L1
dec r19
brne L1
dec r18
brne L1
According to my calculations should wait exactly 8000000-2 cycles (if not interrupted by something else).
Let's try to verify:
Initial r20: 1273 + 12 = 383 cycles
Initial r19: 1*(383+1+2) + 148*(767+1+2) + 1*(767+1+1) = 115115 cycles
(that's initial r20 incomplete cycle one time, then 149 times full time r20 cycle with the final one being -1 due to exiting brne)
The r18 total: 1*(115115+1+2) + 39*(197119+1+2) + 1*(197119+1+1) = 7999997 cycles.
And the three ldi are +3 cycles = 7999997+3 = 8000000.
And the missing 2 cycles are nowhere to be seen, so I made somewhere a mistake.
As you can see, the math behind is reasonably simple, but very mundane to do by hand, and prone to mistakes...
Ah, I think I know where I did the mistake. When I'm shaving off the abundant cycles, the termination loop is not involved (that's part of the actual delay process), so I shouldn't have adjusted the to_shave_off cycles by -1. Then After r19 = 106 I would have still to shave off 384 cycles, and that's exactly 384/3 = 128 loops to shave off from r20 = 256-128 = 128. No remainder, no missing cycle, perfect 8mil.
If you have trouble to follow this reverse calculation, try it other way, imagine 2 bit registers (0..3 values only), and do on paper similar loop with r18=r19=r20=2, and count the cycles manually to see how it is evolving. .. i.e. 3x ldi = +3, dec r20,brne,dec r20,brne(skip) = +5 cycles, dec r19, brne = +3, ... etc.
Edit: and this was explained before by Jester in his links. And I'm too lazy to clean this up down to some simple formula to create your own online calculator.

MSP430 microcontroller - how to check addressing modes

I'm programming a MSP430 in C language as a simulation of real microcontroller. I got stuck in addressing modes (https://en.wikipedia.org/wiki/TI_MSP430#MSP430_CPU), especially:
Addressing modes using R0 (PC)
Addressing modes using R2 (SR) and R3 (CG), special-case decoding
I don't understand what does mean 0(PC), 2(SR) and 3(CG). What they are?
How to check these values?
so for the source if the as bits are 01 and the source register bits are a 0 which is the pc for reference then
ADDR Symbolic. Equivalent to x(PC). The operand is in memory at address PC+x.
if the ad bit is a 1 and the destination is a 0 then also
ADDR Symbolic. Equivalent to x(PC). The operand is in memory at address PC+x.
x is going to be another word that follows this instruction so the cpu will fetch the next word, add it to the pc and that is the source
if the as bits are 11 and the source is register 0, the source is an immediate value which is in the next word after the instruction.
if the as bits are 01 and the source is a 2 which happens to be the SR register for reference then the address is x the next word after the instruction (&ADDR)
if the ad bit is a 1 and the destination register is a 2 then it is also an &ADDR
if the as bits are 10 the source bits are a 2, then the source is the constant value 4 and we dont have to burn a word in flash after the instruction for that 4.
it doesnt make sense to have a destination be a constant 4 so that isnt a real combination.
repeat for the rest of the table.
you can have both of these addressing modes at the same time
mov #0x5A80,&0x0120
generates
c000: b2 40 80 5a mov #23168, &0x0120 ;#0x5a80
c004: 20 01
which is
0x40b2 0x5a80 0x0120
0100000010110010
0100 opcode mov
0000 source
1 ad
0 b/w
11 as
0010 destination
so we have an as of 11 with source of 0 the immediate #x, an ad of 1 with a destination 2 so the destination is &ADDR. this is an important experiment because when you have 2 x values, a three word instruction basically which one goes with the source and which the destination
0x40b2 0x5a80 0x0120
so the address 0x5a80 which is the destination is the first x to follow the instruction then the source 0x0120 an immediate comes after that.
if it were just an immediate and a register then
c006: 31 40 ff 03 mov #1023, r1 ;#0x03ff
0x4031 0x03FF
0100000000110001
0100 mov
0000 source
0 ad
0 b/w
11 as
0001 dest
as of 11 and source of 0 is #immediate the X is 0x03FF in this case the word that follows. the destination is ad of 0
Register direct. The operand is the contents of Rn
where destination in this case is r1
so the first group Rn, x(Rn), #Rn and #Rn+ are the normal cases, the ones below that that you are asking about are special cases, if you get a combination that fits into a special case then you do that otherwise you do the normal case like the mov immediate to r1 example above. the destination of r1 was a normal Rn case.
As=01, Ad=1, R0 (ADDR): This is exactly the same as x(Rn), i.e., the operand is in memory at address R0+x.
This is used for data that is stored near the code that uses it, when the compiler does not know at which absolute address the code will be located, but it knows that the data is, e.g., twenty words behind the instruction.
As=11, R0 (#x): This is exactly the same as #R0+, and is used for instructions that need a word of data from the instruction stream. For example, this assembler instruction:
MOV #1234, R5
is actually encoded and implemented as:
MOV #PC+, R5
.dw 1234
After the CPU has read the MOV instruction word, PC points to the data word. When reading the first MOV operand, the CPU reads the data word, and increments PC again.
As=01, Ad=1, R2 (&ADDR): this is exactly the same as x(Rn), but the R2 register reads as zero, so what you end up with is the value of x.
Using the always-zero register allows to encode absolute addresses without needing a special addressing mode for this (just a special register).
constants -1/0/1/2/4/8: it would not make sense to use the SR and CG registers with most addressing modes, so these encodings are used to generate special values without a separate data word, to save space:
encoding: what actually happens:
MOV #SR, R5 MOV #4, R5
MOV #SR+, R5 MOV #8, R5
MOV CG, R5 MOV #0, R5
MOV x(CG), R5 MOV #1, R5 (no word for x)
MOV #CG, R5 MOV #2, R5
MOV #CG+, R5 MOV #-1, R5

Detecting I-frame data in an MPEG-4 transport stream

I am testing a project. I need to break the payload data(making zero some bytes) of the MPEG-4 ts packets by a percentage coming from the user. I am doing it by reading the ".ts" file packet by packet(188 bytes). But the video is changing to really mud after process. (By the way I'm writing the program in C)
So I decided to find the data/packets that belongs to I-frames, then not touching them but scrambling the other datas by percentage. I could find below
(in hex)
00 00 00 01 E0 start of video PES packet
..
..
00 00 01 B8 start of group of pictures header
..
..
00 00 01 00 the picture start code. This is 32 bits. The 10 bits immediately following this is called as the temporal reference. So temporal reference will include the byte following the picture start code and the first two bits of the second byte after the picture start code ie one byte(8 bits) + 2 bits. These we need to skip. Now the three bits present(3, 4 and 5th bits of the second byte from the picture start code) will indicate the Frame type ie I, B or P. So to get this simply logical AND & the second byte from the picture start code with 0x38 and right shift >> with 3.
For example the data is like that;
00 00 01 00 00 0F FF F8 00 00 01 B5........... and so on.
Here the first four bytes 00 00 01 00 is the picture start code.
The fifth byte and the first two bits of the sixth byte is the temporal reference.
So our concern is in the sixth byte --> 0F
((0F & 38)>>3)
Frame type = 1 ==> I Frame
Frame type 000 forbidden
Frame type 001 intra-coded (I) - iframe
Frame type 010 predictive-coded (P) - p frame
Frame type 011 bidirectionally-predictive-coded (B) - b frame
But this is for MPEG-2. Is there some patterns like that so I recognize and get the frame type with bitwise operations for MPEG-4 transport stream(extension is ".ts")?
And I need to get how many bytes or packets belong to that frame?
Thanks a lot for your help
I would parse the complete TS packet. So first determine what PID your video stream belongs to (by parsing the PAT and PMT). Then find keyframes by looking for the 'Random Access indicator' bit in the Adaptation Field.
uint8_t *pkt = <your 188 byte TS packet>;
assert( 0x47 == pkt[0] );
int16_t pid = ( ( pkt[1] & 0x1F) << 8 ) | pkt[2];
if ( pid == video_pid ) {
// found video stream
if( ( pkt[3] & 0x20 ) && ( pkt[4] > 0 ) ) {
// have AF
if ( pkt[5] & 0x40 ) {
// found keyframe
} } }
If you are using H.264 there should be specific byte stream for I and P frame ..
Like 0x0000000165 for I frame and 0x00000001XX for P frame ..
So just parse and look for continuous such byte stream in such a way you can identify I or P frame..
Again above byte stream is codec implementation dependent ..
For more information you can look into FFMPEG..

Decoding BLX instruction on ARM/Thumb(Android)

I want to decoding a blx instruction on arm, and I have found a good answer here:
Decoding BLX instruction on ARM/Thumb (IOS)
But in my case, I follow this tip step by step, and get the wrong result, can anyone tell me why?
This is my test:
.plt: 000083F0 sub_83F0 ...
...
.text:00008436 FF F7 DC EF BLX sub_83F0
I parse the machine code 'FF F7 DC EF' by follow:
F7 FF EF DC
11110 1 1111111111 11 1 0 1 1111101110 0
S imm10H J1 J2 imm10L
I1 = NOT(J1 EOR S) = 1
I2 = NOT(J2 EOR S) = 1
imm32 = SignExtend(S:I1:I2:imm10H:imm10L:00)
= SignExtend(1111111111111111110111000)
= SignExtend(0x1FFFFB8)
= ?
So the offset is 0xFFB8?
But 0x83F0-0X8436-4=0xFFB6
I need your help!!!
When the target of a BLX is 32-bit ARM code, the immediate value encoded in the BLX instruction is added to align(PC,4), not the raw value of PC.
PC during execution of the BLX instruction is 0x8436 + 4 == 0x843a due to the ARM pipeline
align(0x843a, 4) == 0x8438
So:
0x00008438 + 0ffffffb8 == 0x83f0
The ARM ARM mentions this in the assembler syntax for the <label> part of the instruction:
For BLX (encodings T2, A2), the assembler calculates the required value of the offset from the Align(PC,4) value of the BLX instruction to this label, then selects an encoding that sets imm32 to that offset.
The alignment requirement can also be found by careful reading of the Operation pseudocode in the ARM ARM:
if ConditionPassed() then
EncodingSpecificOperations();
if CurrentInstrSet == InstrSet_ARM then
next_instr_addr = PC - 4;
LR = next_instr_addr;
else
next_instr_addr = PC;
LR = next_instr_addr<31:1> : ‘1’;
if toARM then
SelectInstrSet(InstrSet_ARM);
BranchWritePC(Align(PC,4) + imm32); // <--- alignment of the current PC when BLX to non-Thumb ARM code
else
SelectInstrSet(InstrSet_Thumb);
BranchWritePC(PC + imm32);
F7FF
1111011111111111
111 10 11111111111 h = 10 offset upper = 11111111111
EFDC
1110111111011100
111 01 11111011100 h = 01 blx offset upper 11111011100
offset = 1111111111111111011100<<1
sign extended = 0xFFFFFFB8
0x00008436 + 2 + 0xFFFFFFB8 = 1000083F0
clip to 32 bits 0x000083F0

Resources