Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I am trying to print the characters from a text file using C in CodeBlock terminal. I use getc and printf. But the terminal shows unwanted characters as well. For example,
when I read,
CAAAAATATAAAAACAGGTTTATGATATAAGGTAAAGTATGGGAGATGGGGACAAAAGT
It shows,
CΘA A A A A T A T A A A A A C A G G T T T A T G A T A T A A G GT A A A G T A T$GhGêG╝A G<AöT G#GñG<G AxC A A A A G T
Can any one please state what can be done to avoid this situation.
Your text file obviously uses a 2byte character encoding. If this is on windows, it's very likely UTF-16.
char in C is a single byte, so a single-byte encoding is assumed. There are many ways to solve this, e.g. you could use iconv. On windows, you can use wchar_t(*) to read the characters of this file (together with functions for wide characters like getwc() and if you need it in an 8byte encoding, windows API functions like WideCharToMultiByte() can help.
wchar_t is a type for "wide" characters, but it's implementation-defined how many bytes a wide character has. On windows, wchar_t has 16 bits and typically holds UTF-16 encoded characters. On many other systems, wchar_t has 32 bits and typically holds UCS-4 encoded characters.
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I'm working on C and need to recieve a string from the user in the format of "abcd", and to diagnose it to retrieve it as a string of "abcd"(int the code).
For some reason when I try to check if the first char in the string (that I've read using sscanf) is " it doesn't return it is, as you can see in the picture below. The watch says that data[0] is '"', but that data[0] == '"' is false, which is absurd.
The character in data[0] is probably a special quotation mark with the ASCII (or rather Windows-1252) code 147/0x93. It is a number in which the highest bit is 1, and as such is outside the 7 bit ASCII range. While the 7 bit ASCII codes are interpreted identically across many character sets this is not so for 8 bit values (> 127). The "glyph" a given terminal or printer will show for 8 bit values depends on the char set is assumes (in your case, as mentioned, Windows-1252).
Last not least, because on your system chars are signed the debugger interprets the highest bit as a minus sign and shows a negative value. I think you can cast it in the debugger watch expression to unsigned char to obtain the positive equivalent.
That character cannot be entered directly with the keyboard; on Windows you can try to use the Alt+Number block trick. When you enter the normal quotation mark you create a char with the ASCII code 34/0x22, which the compiler and debugger correctly claim is not identical.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I've read somewhere we should always open file in C as a Binary file (even if it's a text file). At the time (few years ago) I didn't care too much about it, but now I really need to understand if that's the case and how come.
I've been trying to search for info on this but the most I find is the opening difference between them - not even their structural difference.
So I guess my question is: why should we always open the file as a binary even if we guess before hand it's a text file? Second question lies on the structure of each file itself, is a binary file like an "encrypted" text file?
The names "text" vs. "binary", while quite mnemonic, can sometimes leave you wondering which one to apply. It's best to translate them to their underlying mechanics, and choose based on which one of those you need.
"Binary" could also be called "verbatim" opening mode. Each byte in the file will be read exactly as-is on disk. Which means that if it's a Windows file containing the text "ABC" on one line (including the line terminator), the bytes read from the file will be 65 66 67 13 10.
"Text" mode could also be called "line-terminator translating" opening mode. When the file contains a sequence of 1 or more characters which is defined by the platform on which you're running as "line terminator"(1), the entire sequence will be read from the file, but the runtime will make it appear as if only the character '\n' (10 when using ASCII) was read. For the same Windows-file above, if it was opened as a text file on Windows, the bytes read from the file would be 65 66 67 10.
The same applies when writing: a file openend as "binary" for writing will write exactly the bytes you give it. A file opened as "text" will translate the byte '\n' (10 in ASCII) to whatever the platform defines as the line-terminating character sequence.
I don't think an "always do this" rule can be distilled from the above, but perhaps you can use it to make an informed decision for each case.
(1) On Unix-style systems, the line-terminating character sequence is LF (ASCII 10). On Windows, it's the two-character sequence CR LF (ASCII 13 10). On old pre-X Mac OS, it was just the single-character CR (ASCII 13).
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
Consider this thread What is EOF in the C programming language?
The answer was that EOF (Ctrl-D) results in that getchar returns -1
My question is what does Ctrl-J and Ctrl-M represent in c on OSX and why does getchar return 10 for both using the same code as in link above?
What other shortcuts (Ctrl-somthing / Cmd-something) results in that getcharturns a static predefined number?
Ctrl-J is the shortcut for the line feed control character, having the character code 10. Here is a page with other control characters
I as of this time do not know why Ctrl-M (ASCII value 13) returns 10 but assume it is due to it being similar in function to the line feed.
The reason EOF returns -1 is because its value is -1 on most systems.
Some other defined characters:
Ctrl-G: 7
Ctrl-I: 9
...
Ctrl-V: 22
stdin is typically in text mode. Various conversions occur per OS concerning line endings when reading/writing in text mode. Crtl-M is one of them - it is converted to 10. Had IO been in binary mode, no conversion would be expected.
Consoles map various keyboard combinations to various char and actions (like Ctrl-D --> EOF). The various char created certainly include most of the values 0 to 127. As these values are typically mapped to ASCII, the first 32 values (Ctrl-#, CTRL-A, CTRl-B, ... Ctrl-_), they may have no graphical representation
Note: Notice what is returned when getchar() is called again after it returned EOF. Expect it to immediately return EOF again without waiting for any additionally key presses. (Ctrl-D) set a condition, not a char.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 9 years ago.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Improve this question
what is UTF-8 encoding? I google it but could not able to understand what it is. Please explain in simple words and example.
Next I need to encode one string in UTF-8 encoding. I got openssl but it is converting in only base64 format.
#include<stdio.h>
struct some
{
char string[40];
};
int main()
{
string *s;
char str[9];
gets(str);
strcpy(s,str);
/*Now how to get emcoded form of "Hello" in UTF-8*/
/*printf("encoded data");
return 0;
}
Those strings are available at runtime so do not anything about what is coming. and after encoding need to store them in DB.
I checked it on SO itself but could not find any source in c, it is available in .net java c#. I am using linux Redhat.
Encodings describe what bytes or sequence of bytes correspond to what characters. ASCII is the simplest encoding. In ASCII a single byte value corresponds to a single character. Unfortunately there are more than 255 characters in the world. UTF-8 is probably the most common encoding format because it is compatible with english ASCII, but also allows international characters. If you write a standard english string in C it is already UTF-8. "Hello" == "Hello"
Joel has a fantastic article about this subject called: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
It does a good job of explaining ASCII, unicode, and UTF8 string encodings.
In UTF-8, every code point from 0-127 is stored in a single byte. Only
code points 128 and above are stored using 2, 3, in fact, up to 4 (not 6, corrected by R.)
bytes.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
i had weird question or rather stupid question
when i open a binary file using text editor it doesn't seem like it represented in binary 0 and 1 or hex so what representation is that???
IHDR\00\00k\00\00\C3\00\00\00\A2\B6\8D$\00\00\00sBIT|d\88\00\00 \00IDATx\9C̽Y\AC-\CBy\DF\F7\FB\AA\AA\BBװ\87\B3\CFtϹ
The hard disk (as well as any other digital device in your computer) transmits data as 0 and 1. And all files are just sequences of numbers, and they are all 'binary' in the sense that they all are bunch of bits. But some of the files can be read by a human (after a simple decoding that is performed by text viewers), and we call those 'text' files; and others are in machine-oriented language and are not targeted to human's perception at all or at least without a special software (those are called 'binary').
A text editor tries to display these data as text. As "plain" text files usually contain a text encoded by 8 bits per 1 character, your editor interprets each binary octet (each byte) as an integer number containing a character's code, and displays the appropriate character. For some codes, there are no printable characters in the encoding table; these characters are usually displayed with squares, question marks or (as in your case) with their numerical (hexadecimal) codes.
Some editors can show pure hexadecimal representation of file, and it's rather convenient feature for low-level data analysis since hexadecimals are compact and can be quite easily converted to a binary representation.
This is hexadecimal representation along with ASCII representation of characters your software is able to display..