I'm trying to print non-printable characters in C files, but I can't print the given character because I don't know its meaning. I noticed this when I compared my program with cat.
cat -nt file.txt:
134 is 127 7f
cat -n file.txt:
134 ^? is 127 7f
In the Caret Notation (which cat uses when -v or -t is specified) ^? represents the DEL character (Unicode U+007F, ASCII encoding 127 decimal, 0x7f in hex).
Related
I have a .echo_colors file containing some variables for colors in the following format:
export red="\033[0;31m"
this works fine with echo -e, but i want to use this environment variables on a C code. I'm getting the variable via getenv and printing with printf:
#include <stdlib.h>
#include <stdio.h>
int main(){
char* color = getenv("red");
printf("%s", color);
printf("THIS SHOULD BE IN RED\n");
return 0;
}
with this program, i get
\033[0;31mTHIS SHOULD BE IN RED
The string is just being printed and not interpreted as a color code. printf("\033[0;31m") works and prints output in red as i want to. Any ideas for what to do to correct this problem?
Bash doesn't interpret \033 as "ESC" by default, as evident from hex-dumping the variable, but as "backslash, zero, three, three":
bash-3.2$ export red="\033[0;31m"
bash-3.2$ echo $red | xxd
00000000: 5c30 3333 5b30 3b33 316d 0a \033[0;31m.
You'll need to use a different Bash syntax to export the variable to have it interpret the escape sequence (instead of echo -e doing it):
export red=$'\033[0;31m'
i.e.
bash-3.2$ export red=$'\033[0;31m'
bash-3.2$ echo $red | xxd
00000000: 1b5b 303b 3331 6d0a .[0;31m.
Use ^[ (Control-left-bracket) in your shell script to key in the export command, as in
$ export red="<ctrl-[>[0;31m"
in your shell script, the ctrl-[ is actually the escape character (as typed from the terminal, it is possible that you need to escape it with Ctrl-V so it is not interpreted as an editing character, in that case, put a Ctrl-V in front of the escape char)
If you do
$ echo "<ctrl-[>[31mTHIS SHOULD BE IN RED"
THIS SHOULD BE IN RED (in red letters)
$ _
you will see the effect.
I have this paper, on C language, which requires some greek sentences to be printed in the terminal.
In the code template that is given to us there is this line of code:
system("chcp 1253>nul");
This is supposed to print the greek characters.
In my Ubuntu Terminal I see:
�������� ����� �� ����� ����� ��� �������� ���� ������
So, how can I print greek characters in my terminal?
This is supported out of the box in most Linuxes. The only thing one must do is use
setlocale(LC_ALL, "");
in the beginning of the program. This relies on the fact that UTF-8 is the default choice of encoding for users' locales. The standard says that this call switches to user's current locale. The default is to use the "C" locale which may or may not support national characters.
By default gcc interprets the source code as encoded in UTF-8. Compile-time options exist to change that, but it is recommendedd to keep everything in UTF-8 on Linux. Sources that come from Windows are probably not encoded in UTF-8 and need to be recoded. Use the iconv utility for that;l. If the source is associated with a particular legacy code page, try that code page name as the source encoding.
A C program (comforming to ISO C99 or later, or POSIX.1 or later) that inputs or outputs non-ASCII text should use wide strings, wide I/O, and localization.
For example:
#include <stdlib.h>
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
int main(void)
{
/* Tell the C library to use the current locale settings. */
setlocale(LC_ALL, "");
/* Standard output is used with the wide I/O functions. */
fwide(stdout, 1);
/* Print some Greek text. */
wprintf(L"Γειά σου Κόσμε!\n");
return EXIT_SUCCESS;
}
Note that wide string literals are written using L"..." whereas normal (ASCII or narrow) string literals as "...". Similarly, wide character constants (of type wint_t) are written with the L prefix; for example, L'€'.
When compiling, you do need to tell the compiler what character set the source code uses. In Linux, GCC uses the locale settings, but also provides an option -finput-charset=windows-1252 to change it to Windows Western European, for example.
Rather than fiddle with the flags, I recommend you write a Bash helper script, say to-utf8:
#!/bin/bash
if [ $# -lt 2 ] || [ ":$1" = ":-h" ] || [ ":$1" = ":--help" ]; then
printf '\n'
printf 'Usage: %s [ -h | --help ]\n' "$0"
printf ' %s windows-1252 file.c [ ... ]\n' "$0"
printf '\n'
exit 0
fi
charset="$1"
shift 1
Work=$(mktemp) || exit 1
trap "rm -f '$Work'" EXIT
for src in "$#" ; do
iconv -f "$charset//TRANSLIT" -t UTF-8 "$src" > "$Work" || exit $?
sed -e 's|\r$||' "$Work" > "$src" || exit $?
printf '%s: Converted successfully.\n' "$src"
done
exit 0
If you want, you can install that system-wide using
sudo install -o 0 -g 0 -m 0755 to-utf8 /usr/bin/
The first command-line parameter is the source character set (use iconv --list to see them all), followed by a list of files to fix.
The script creates an automatically deleted temporary file. The iconv line converts the character set of each file to UTF-8, saving the result into the temporary file. The sed file changes any CRLF (\r\n) newlines to LF (\n), overwriting the contents of the file.
(Rather than use a second temporary file to hold the contents, having sed to direct its output to the original file, means the original file keeps its owner and group intact.)
I am playing with the Unix hexdump utility. My input file is UTF-8 encoded, containing a single character ñ, which is C3 B1 in hexadecimal UTF-8.
hexdump test.txt
0000000 b1c3
0000002
Huh? This shows B1 C3 - the inverse of what I expected! Can someone explain?
For getting the expected output I do:
hexdump -C test.txt
00000000 c3 b1 |..|
00000002
I was thinking I understood encoding systems.
This is because hexdump defaults to using 16-bit words and you are running on a little-endian architecture. The byte sequence b1 c3 is thus interpreted as the hex word c3b1. The -C option forces hexdump to work with bytes instead of words.
I found two ways to avoid that:
hexdump -C file
or
od -tx1 < file
I think it is stupid that hexdump decided that files are usually 16bit word little endian. Very confusing IMO.
I am having difficulty in understanding the way print command of Perl interprets the hexadecimal values. I am using a very simple program of just 8 lines to demonstrate my question. Following code with gdb will explain my question in detail:
anil#anil-Inspiron-N5010:~/Desktop$ gcc -g code.c
anil#anil-Inspiron-N5010:~/Desktop$ gdb -q ./a.out
Reading symbols from ./a.out...done.
(gdb) list
1 #include <stdio.h>
2
3 int main(int argc, char* argv[])
4 {
5 int i;
6 for (i =0; i<argc; ++i)
7 printf ("%p\n", argv[i]);
8 return 0;
9 }
(gdb) break 8
Breakpoint 1 at 0x40057a: file code.c, line 8.
(gdb) run $(perl -e 'print "\xdd\xcc\xbb\xaa"') $(perl -e 'print "\xcc\xdd\xee\xff"')
Starting program: /home/anil/Desktop/a.out $(perl -e 'print "\xdd\xcc\xbb\xaa"') $(perl -e 'print "\xcc\xdd\xee\xff"')
0x7fffffffe35d
0x7fffffffe376
0x7fffffffe37b
Breakpoint 1, main (argc=3, argv=0x7fffffffdfe8) at code.c:8
8 return 0;
(gdb) x/2x argv[1]
0x7fffffffe376: 0xaabbccdd 0xeeddcc00
In above shown lines I have used gdb to debug the program. As command line arguments, I have passed two (hexadecimal) arguments (excluding the name of the program itself): \xdd\xcc\xbb\xaa and \xcc\xdd\xee\xff. Owing to the little-endian architecture, those arguments should be interpreted as 0xaabbccdd and 0xffeeddcc but as you can see the last line of above shown debugging shows 0xaabbccdd and 0xeeddcc00. Why is this so? What am I missing ?? This has happened with some other arguments too. I am requesting you to help me with this.
PS: 2^32 = 4294967296 and 0xffeeddcc = 4293844428 (2^32 > 0xffeeddcc). I don't know if still there is any connection.
Command-line arguments are NUL-terminated strings.
Arguments argv[1] is a pointer to the first character of a NUL-terminated string.
7FFFFFFFE376 DD CC BB AA 00
argv[2] is a pointer to the first character of a NUL-terminated string.
7FFFFFFFE37B CC DD EE FF 00
If you pay attention, you'll notice they happen to be located immediately one after the other in memory.
7FFFFFFFE376 DD CC BB AA 00 CC DD EE FF 00
You asked to print two (32-bit) integers starting at argv[1]
7FFFFFFFE376 DD CC BB AA 00 CC DD EE FF 00
----------- -----------
0xAABBCCDD 0xEEDDCC00
For x/2x to be correct, you would have needed to use
perl -e'print "\xdd\xcc\xbb\xaa\xcc\xdd\xee\xff"'
-or-
perl -e'print pack "i*", 0xaabbccdd, 0xffeeddcc'
For the arguments you passed, you need to use
(gdb) x argv[1]
0x3e080048cbd: 0xaabbccdd
(gdb) x argv[2]
0x3e080048cc2: 0xffeeddcc
You are confusing yourself by printing strings as numbers. In a little-endian architecture, in a four-byte value such as 0xDDCCBBAA, the bytes are numbered left-to-right from the starting address.
So let's take a look at the output of your debugger command:
(gdb) x/2x argv[1]
0x7fffffffe376: 0xaabbccdd 0xeeddcc00
Looking at that byte by byte, it would be:
0x7fffffffe376: dd
0x7fffffffe377: cc
0x7fffffffe378: bb
0x7fffffffe379: aa
0x7fffffffe37a: 00 # This NUL terminates argv[1]
0x7fffffffe37b: cc # This address corresponds to argv[2]
0x7fffffffe37c: dd
0x7fffffffe37d: ee
Which is not unexpected, no?
You might want to use something like this to display arguments in hex:
x/8bx argv[1]
(which will show 8 bytes in hexadecimal)
#include<stdio.h>
int main()
{
printf("He %c llo",65);
}
Output: He A llo
#include<stdio.h>
int main()
{
printf("He %c llo",13);
}
Output: llo. It doesnt print He.
I can understand that 65 is ascii value for A and hence A is printed in first case but why llo in second case.
Thanks
ASCII 13 is carriage return, which on some systems simply moves the cursor to the beginning of the line you were just on.
Further characters then wipe out the earlier text.
Man ascii:
Oct Dec Hex Char
015 13 0D CR '\r'
Character 13 is carriage return so it prints He then returns to the beginning of the line and prints the remining llo.
It's just being rendered strangely due to the nature of a carriage return*. You can see the characters that are output, by piping to another tool such as xxd:
$ gcc b.c && ./a.out | xxd
0000000: 4865 200d 206c 6c6f He . llo
$ gcc c.c && ./a.out | xxd
0000000: 4865 2041 206c 6c6f He A llo
* See Wikipedia:
On printers, teletypes, and computer terminals that were not capable of displaying graphics, the carriage return was used without moving to the next line to allow characters to be placed on top of existing characters to produce character graphics, underlines, and crossed out text.