C read() produces wrong byte order compared to hexdump? [duplicate] - c

I am playing with the Unix hexdump utility. My input file is UTF-8 encoded, containing a single character ñ, which is C3 B1 in hexadecimal UTF-8.
hexdump test.txt
0000000 b1c3
0000002
Huh? This shows B1 C3 - the inverse of what I expected! Can someone explain?
For getting the expected output I do:
hexdump -C test.txt
00000000 c3 b1 |..|
00000002
I was thinking I understood encoding systems.

This is because hexdump defaults to using 16-bit words and you are running on a little-endian architecture. The byte sequence b1 c3 is thus interpreted as the hex word c3b1. The -C option forces hexdump to work with bytes instead of words.

I found two ways to avoid that:
hexdump -C file
or
od -tx1 < file
I think it is stupid that hexdump decided that files are usually 16bit word little endian. Very confusing IMO.

Related

What is the meaning of this special character '^?' in an ascii table?

I'm trying to print non-printable characters in C files, but I can't print the given character because I don't know its meaning. I noticed this when I compared my program with cat.
cat -nt file.txt:
134 is 127 7f
cat -n file.txt:
134 ^? is 127 7f
In the Caret Notation (which cat uses when -v or -t is specified) ^? represents the DEL character (Unicode U+007F, ASCII encoding 127 decimal, 0x7f in hex).

How to read objdump in C

Code
#define TEXT_LEN 256
#define NUM_NUMBERS (2*65536)
int numNumbers = NUM_NUMBERS;
PuTTY command for global variable numNumbers
objdump -s -j .data assign1-0
Output of command
602070 00000000 00000200
Hello,
Can someone help me understand this output or if I put the wrong command?
Im trying to find global variable numNumbers using objdump.
But im pretty sure the output should be 00020000 because numNumbers is 131072 (2*65536) but it's coming out 00000200 which is 512 from hexadecimal to decimal.
Am I reading it wrong and the output is correct or is the command wrong to find a global variable?
You are probably on a little endian computer, and so the bytes that make up your int are not in the order in which you're used reading digits or bits as a human. Familiarize yourself with the concept of endianness.

Writing out a binary structure to file, unexpected results?

I'm trying to write out a packed data structure to a binary file, however as you can see from od -x below the results are not expected in their ordering. I'm using gcc on a 64-bit Intel system. Does anyone know why the ordering is wrong? It doesn't look like an endianness issue.
#include <stdio.h>
#include <stdlib.h>
struct B {
char a;
int b;
char c;
short d;
} __attribute__ ((packed));
int main(int argc, char *argv[])
{
FILE *fp;
fp = fopen("temp.bin", "w");
struct B b = {'a', 0xA5A5, 'b', 0xFF};
if (fwrite(&b, sizeof(b), 1, fp) != 1)
printf("Error fwrite\n");
exit(0);
}
ASCII 61 is 'a', so the b.a member. ASCII 62 is 'b', so the b.c member. It's odd how 0xA5A5 is spread out over the sequence.
$ od -x temp.bin
0000000 a561 00a5 6200 00ff
0000010
od -x groups the input into 2-byte units and swaps their endianness. It's a confusing output format. Use -t x1 to leave the bytes alone.
$ od -t x1 temp.bin
0000000 61 a5 a5 00 00 62 ff 00
0000010
Or, easier to remember, use hd (hex dump) instead of od (octal dump). hd's default format doesn't need adjusting, plus it shows both a hex and ASCII dump.
$ hd temp.bin
00000000 61 a5 a5 00 00 62 ff 00 |a....b..|
00000008
od -x writes out two little-endian bytes. Per the od man page:
-x same as -t x2, select hexadecimal 2-byte units
So
0000000 a561 00a5 6200 00ff
is, on disk:
0000000 61a5 a500 0062 ff00

Command line input to a C program (using 'print' command of Perl)

I am having difficulty in understanding the way print command of Perl interprets the hexadecimal values. I am using a very simple program of just 8 lines to demonstrate my question. Following code with gdb will explain my question in detail:
anil#anil-Inspiron-N5010:~/Desktop$ gcc -g code.c
anil#anil-Inspiron-N5010:~/Desktop$ gdb -q ./a.out
Reading symbols from ./a.out...done.
(gdb) list
1 #include <stdio.h>
2
3 int main(int argc, char* argv[])
4 {
5 int i;
6 for (i =0; i<argc; ++i)
7 printf ("%p\n", argv[i]);
8 return 0;
9 }
(gdb) break 8
Breakpoint 1 at 0x40057a: file code.c, line 8.
(gdb) run $(perl -e 'print "\xdd\xcc\xbb\xaa"') $(perl -e 'print "\xcc\xdd\xee\xff"')
Starting program: /home/anil/Desktop/a.out $(perl -e 'print "\xdd\xcc\xbb\xaa"') $(perl -e 'print "\xcc\xdd\xee\xff"')
0x7fffffffe35d
0x7fffffffe376
0x7fffffffe37b
Breakpoint 1, main (argc=3, argv=0x7fffffffdfe8) at code.c:8
8 return 0;
(gdb) x/2x argv[1]
0x7fffffffe376: 0xaabbccdd 0xeeddcc00
In above shown lines I have used gdb to debug the program. As command line arguments, I have passed two (hexadecimal) arguments (excluding the name of the program itself): \xdd\xcc\xbb\xaa and \xcc\xdd\xee\xff. Owing to the little-endian architecture, those arguments should be interpreted as 0xaabbccdd and 0xffeeddcc but as you can see the last line of above shown debugging shows 0xaabbccdd and 0xeeddcc00. Why is this so? What am I missing ?? This has happened with some other arguments too. I am requesting you to help me with this.
PS: 2^32 = 4294967296 and 0xffeeddcc = 4293844428 (2^32 > 0xffeeddcc). I don't know if still there is any connection.
Command-line arguments are NUL-terminated strings.
Arguments argv[1] is a pointer to the first character of a NUL-terminated string.
7FFFFFFFE376 DD CC BB AA 00
argv[2] is a pointer to the first character of a NUL-terminated string.
7FFFFFFFE37B CC DD EE FF 00
If you pay attention, you'll notice they happen to be located immediately one after the other in memory.
7FFFFFFFE376 DD CC BB AA 00 CC DD EE FF 00
You asked to print two (32-bit) integers starting at argv[1]
7FFFFFFFE376 DD CC BB AA 00 CC DD EE FF 00
----------- -----------
0xAABBCCDD 0xEEDDCC00
For x/2x to be correct, you would have needed to use
perl -e'print "\xdd\xcc\xbb\xaa\xcc\xdd\xee\xff"'
-or-
perl -e'print pack "i*", 0xaabbccdd, 0xffeeddcc'
For the arguments you passed, you need to use
(gdb) x argv[1]
0x3e080048cbd: 0xaabbccdd
(gdb) x argv[2]
0x3e080048cc2: 0xffeeddcc
You are confusing yourself by printing strings as numbers. In a little-endian architecture, in a four-byte value such as 0xDDCCBBAA, the bytes are numbered left-to-right from the starting address.
So let's take a look at the output of your debugger command:
(gdb) x/2x argv[1]
0x7fffffffe376: 0xaabbccdd 0xeeddcc00
Looking at that byte by byte, it would be:
0x7fffffffe376: dd
0x7fffffffe377: cc
0x7fffffffe378: bb
0x7fffffffe379: aa
0x7fffffffe37a: 00 # This NUL terminates argv[1]
0x7fffffffe37b: cc # This address corresponds to argv[2]
0x7fffffffe37c: dd
0x7fffffffe37d: ee
Which is not unexpected, no?
You might want to use something like this to display arguments in hex:
x/8bx argv[1]
(which will show 8 bytes in hexadecimal)

MIPS Assembly - String (ASCII) Instructions

I am writing an assembler in C for MIPS assembly (so it converts MIPS assembly to machine code).
Now MIPS has three different instructions: R-Type, I-Type and J-Type. However, in the .data. section, we might have something like message: .asciiz "hello world". In this case, how would we convert an ASCII string into machine code for MIPS?
Thanks
ASCII text is not converted to machine code. It is stored via the format found on Wikipedia.
MIPS uses this format to store ASCII strings. As for .asciiz in particular, it is the string plus the NUL character. So, according to the sheet, A is 41 in hexadecimal, which is just 0100 0001 in binary. But don't forget the NUL character, so: 0100 0001 0000.
When storing the string, I'd take Mars MIPS simulator's idea and just start the memory section at a known address in memory and make any references to the label message set to that location in memory.
Please note that everything in the data section is neither R-type, I-type, nor J-type. It is just raw data.
Data is not executable and should not be converted to machine code. It should be encoded in the proper binary representation of the data type for your target.
As other answers have noted, the ascii contained in a .ascii "string" directive is encoded in it's raw binary format in the data segment of the object file. As to what happens from there, that depends on the binary format the assembler is encoding into. Ordinarily data is not encoded into machine code, however GNU as will happily assemble this:
.text
start:
.ascii "Hello, world"
addi $t1, $zero, 0x1
end:
If you disassemble the output in objdump ( I'm using the mips-img-elf toolchain here ) you'll see this:
Disassembly of section .text:
00000000 <message>:
0: 48656c6c 0x48656c6c
4: 6f2c2077 0x6f2c2077
8: 6f726c64 0x6f726c64
c: 20090001 addi t1,zero,1
The hexadecimal sequence 48 65 6c 6c 6f 2c 20 77 6f 72 6c 64 spells out "Hello, world".
I came here while looking for an answer as to why GAS behaves like this. Mars won't assemble the above program, giving an error that data directives can't be used in the text segment
Does anyone have any insight here?

Resources