Use special characters other than "\n" and "\0" in C - c

I have one question.
I'm writing some code in C, on UNIX.
I need to write a special character in a file, because I need to divide my file in small sections.
Example:
'SPECIAL_CHARACTER'
section 1 with some text
'SPECIAL_CHARACTER'
section 2 with some text
etc..
I was thinking to use character '\1'.It seems to work, but it is ok? Or It is wrong?
To do these things without using characters like "\0" or "\n" what should I do?

I hear two different questions where you ask "Or It is wrong?"
I hear you asking "how can I designate a separator byte in my code?", and I hear you asking "what is a good choice for a separator byte?"
First, fundamentally, what you are asking about is covered in section 6.4.4.4 of the C language specification, which covers "C Character Constants". There are various places you can look up the formal C language spec, or you can search for "C Character Constants" for perhaps a friendlier description, etc.
In detail, a handful of letters can be used in escape sequences to stand in for single bytes of specific values; e.g., \n is one of those, as a stand-in for 0x0a (decimal 10), a byte designated (in ASCII) as a newline. Here are the legal ones:
\a \b \f \n \r \t \v
The escape sequences \0 and \1 work because C supports using \ followed by digits as an octal value. So, that'll also work with, say, \3 and \35, but not \9, and note that \35 has a decimal value of 29. (Google "octal values" if you don't immediately see why that's the case.)
There are other legal escape sequences:
\' \" \\ \? : ' " \ and ?, respectively
\xNNNN... : each 'N' can be a hexadecimal digit
And, of course, escape sequences are just one aspect of C character constants.
Second, whether or not you should use a given byte value as your file's section separator depends entirely on how your program will be used. As others have pointed out in the comments, there are commonplace prevailing practices on what sort of byte value to use for this sort of thing.
I personally agree that 0x1e makes perhaps the most sense since in ASCII it is the "record separator". Conforming to ASCII can matter if the data will need to be understood by other programs, or if your program will need to be understood by other people.
On the other hand, a simple code comment can make it clear to anyone reading your code what byte value you are using for separating sections of your data file, and any program that needs to understand your data files needs to 'know' a lot more about the file format than just what the record separator is. There is nothing magical about 0x1e : it is merely a convention, and a reserved spot on the ASCII table to facilitate a common need -- that is, record separation of text that could contain normal text separators like space, newline, and null.
Broadly, any byte value that won't show up in the contents of your sections would make a fine section separator. Since you say those contents will be text, there are well over 100 choices, even if you exclude \0 (0x00) and \n (0x0a). In ASCII, a handful of byte values have been set aside for this sort of purpose, so that helps reduce the choice from several dozen to just several. Even among those several, there are only a few commonly used as separators.

Related

Restrictions to Unicode escape sequences in C11

Why is there a restriction for Unicode escape sequences (\unnnn and \Unnnnnnnn) in C11 such that only those characters outside of the basic character set may be represented? For example, the following code results in the compiler error: \u000A is not a valid universal character. (Some Unicode "dictionary" sites even give this invalid format as canon for the C/C++ languages, though admittedly these are likely auto-generated):
static inline int test_unicode_single() {
return strlen(u8"\u000A") > 1;
}
While I understand that it's not exactly necessary for these basic characters to supported, is there a technical reason why they're not? Something like not being able to represent the same character in more than one way?
It's precisely to avoid alternative spellings.
The primary motivations for adding Universal Character Names (UCNs) to C and C++ were to:
allow identifiers to include letters outside of the basic source character set (like ñ, for example).
allow portable mechanisms for writing string and character literals which include characters outside of the basic source character set.
Furthermore, there was a desire that the changes to existing compilers be as limited as possible, and in particular that compilers (and other tools) could continue to use their established (and often highly optimised) lexical analysis functions.
That was a challenge, because there are huge differences in the lexical analysis architectures of different compilers. Without going into all the details, it appeared that two broad implementation strategies were possible:
The compiler could internally use some single universal encoding, such as UTF-8. All input files in other encodings would be transcribed into this internal encoding very early in the input pipeline. Also, UCNs (wherever they appeared) would be converted to the corresponding internal encoding. This latter transformation could be conducted in parallel with continuation line processing, which also requires detecting backslashes, thus avoiding an extra test on every input character for a condition which very rarely turns out to be true.
The compiler could internally use strict (7-bit) ASCII. Input files in encodings allowing other characters would be transcribed into ASCII with non-ASCII characters converted to UCNs prior to any other lexical analysis.
In effect, both of these strategies would be implemented in Phase 1 (or equivalent), which is long before lexical analysis has taken place. But note the difference: strategy 1 converts UCNs to an internal character coding, while strategy 2 converts non-representable characters to UCNs.
What these two strategies have in common is that once the transcription is finished, there is no longer any difference between a character entered directly into the source stream (in whatever encoding the source file uses) and a character described with a UCN. So if the compiler allows UTF-8 source files, you could enter an ñ as either the two bytes 0xc3, 0xb1 or as the six-character sequence \u00D1, and they would both end up as the same byte sequence. That, in turn, means that every identifier has only one spelling, so no change is necessary (for example) to symbol table lookup.
Typically, compilers just pass variable names through the compilation pipeline, leaving them to be eventually handled by assemblers or linkers. If these downstream tools do not accept extended character encodings or UCNs (depending on implementation strategy) then names containing such characters need to be "mangled" (transcribed) in order to make them acceptable. But even if that's necessary, it's a minor change and can be done at a well-defined interface.
Rather than resolve arguments between compiler vendors whose products (or development teams) had clear preferences between the two strategies, the C and C++ standards committees chose mechanisms and restrictions which make both strategies compatible. In particular, both committees forbid the use of UCNs which represent characters which already have an encoding in the basic source character set. That avoids questions like:
What happens if I put \u0022 inside a string literal:
const char* quote = "\u0022";
If the compiler translates UCNs to the characters they represent, then by the time the lexical analyser sees that line, "\u0022" will have been converted to """, which is a lexical error. On the other hand, a compiler which retains UCNs until the end would happily accept that as a string literal. Banning the use of a UCN which represents a quotation mark avoids this possible non-portability.
Similarly, would '\u005cn' be a newline character? Again, if the UCN is converted to a backslash in Phase 1, then in Phase 3 the string literal would definitely be treated as a newline. But if the UCN is converted to a character value only after the character literal token has been identified as such, then the resulting character literal would contain two characters (an implementation-defined value).
And what about 2 \u002B 2? Is that going to look like an addition, even though UCNs aren't supposed to be used for punctuation characters? Or will it look like an identifier starting with a non-letter code?
And so on, for a large number of similar issues.
All of these details are avoided by the simple expedient of requiring that UCNs cannot be used to spell characters in the basic source character set. And that's what was embodied in the standards.
Note that the "basic source character set" does not contain every ASCII character. It does not contain the majority of the control characters, and nor does it contain the ASCII characters $, # and `. These characters (which have no meaning in a C or C++ program outside of string and character literals) can be written as the UCNs \u0024, \u0040 and \u0060 respectively.
Finally, in order to see what sort of knots you need to untie in order to correctly lexically analyse C (or C++), consider the following snippet:
const char* s = "\\
n";
Because continuation lines are dealt with in Phase 1, prior to lexical analysis, and Phase 1 only looks for the two-character sequence consisting of a backslash followed by a newline, that line is the same as
const char* s = "\n";
But that might not have been obvious looking at the original code.

How can I print a string with the same length with or without multicharacters?

I am trying to do exercise 1-22 in K&R book. It asks to fold long lines (i.e.going into a new line) after a predefined number of characters in string.
As I was testing the program and it worked well, but I saw that some lines were "folding" earlier than they should. I noticed that it was the lines on which special characters appeared, such as:
ö ş ç ğ
So, my question is, how do I ensure that lines are printed with the same maximum length with or without multicharacters?
What happens in your code ?
The K&R was written in a time where all characters were encoded on one single char. Example of such encoding standards are ASCII or ISO 8859.
Nowadays the leading encoding standard is UNICODE, which comes in several flavors. The UTF-8 encoding is used to represent the thousands of unicode characters on 8 bit bytes, using a variable length scheme:
the ascii characters (i.e. 0x00 to 0x7F) are encoded on a single byte.
all other characters are encoded on 2 to 4 bytes.
So the letter ö and the others in your list are encoded as 2 consecutive bytes. Unfortunately, the standard C library and the algorithms of K&R do not manage variable encoding. So each of your special char is counted as two so that your algorithm is tricked.
How to solve it ?
There is no easy way. You must make a distinction between the length of the strings in memory, and the length of the strings when they are displayed.
I can propose you a trick that uses the properties of the encoding scheme: whenever you count the display length of a string, just ignore the characters c in memory that comply with the condition c&0xC0==0x80.
Another way would be to use wide chars wchar_t/win_t (requires header wchar.h) instead of char/int and use getwc()/putwc() instead of getc()/putc(). If on your environment sizeof(wchar_t) is 4 then you will be able to work with unicode just using the wide characters and wide library functions instead of the normal ones mentioned in K&R. If however
sizeof(wchar_t) is smaller (for example 2), you could work correctly with a larger subset of unicode but still could encounter alignement issues in some cases.
As in the comment, your string is probably encoded in UTF-8. That means that some characters, including the ones you mention, use more than one byte. If you simply count bytes to determine the width of your output, your computed value may be too large.
To properly determine the number of characters in a string with multibyte characters, use a function such as mbrlen(3).
You can use mbrtowc(3) to find out the number of bytes of the first character in a string, if you're counting character for character.
This of course goes way beyond the scope of the K&R book. It was written before multibyte characters were used.

Why does this scanf() conversion actually work?

Ah, the age old tale of a programmer incrementally writing some code that they aren't expecting to do anything more than expected, but the code unexpectedly does everything, and correctly, too.
I'm working on some C programming practice problems, and one was to redirect stdin to a text file that had some lines of code in it, then print it to the console with scanf() and printf(). I was having trouble getting the newline characters to print as well (since scanf typically eats up whitespace characters) and had typed up a jumbled mess of code involving multiple conditionals and flags when I decided to start over and ended up typing this:
(where c is a character buffer large enough to hold the entirety of the text file's contents)
scanf("%[a-zA-Z -[\n]]", c);
printf("%s", c);
And, voila, this worked perfectly. I tried to figure out why by creating variations on the character class (between the outside brackets), such as:
[\w\W -[\n]]
[\w\d -[\n]]
[. -[\n]]
[.* -[\n]]
[^\n]
but none of those worked. They all ended up reading either just one character or producing a jumbled mess of random characters. '[^\n]' doesn't work because the text file contains newline characters, so it only prints out a single line.
Since I still haven't figured it out, I'm hoping someone out there would know the answer to these two questions:
Why does "[a-zA-Z -[\nn]]" work as expected?
The text file contains letters, numbers, and symbols (':', '-', '>', maybe some others); if 'a-z' is supposed to mean "all characters from unicode 'a' to unicode 'z'", how does 'a-zA-Z' also include numbers?
It seems like the syntax for what you can enter inside the brackets is a lot like regex (which I'm familiar with from Python), but not exactly. I've read up on what can be used from trying to figure out this problem, but I haven't been able to find any info comparing whatever this syntax is to regex. So: how are they similar and different?
I know this probably isn't a good usage for scanf, but since it comes from a practice problem, real world convention has to be temporarily ignored for this usage.
Thanks!
You are picking up numbers because you have " -[" in your character set. This means all characters from space (32) to open-bracket (91), which includes numbers in ASCII (48-57).
Your other examples include this as well, but they are missing the "a-zA-Z", which lets you pick up the lower-case letters (97-122). Sequences like '\w' are treated as unknown escape sequences in the string itself, so \w just becomes a single w. . and * are taken literally. They don't have a special meaning like in a regular expression.
If you include - inside the [ (other than at the beginning or end) then the behaviour is implementation-defined.
This means that your compiler documentation must describe the behaviour, so you should consult that documentation to see what the defined behaviour is, which would explain why some of your code worked and some didn't.
If you want to write portable code then you can't use - as anything other than matching a hyphen.

C Removing Newlines in a Portable and International Friendly Way

Simple question here with a potentially tricky answer: I am looking for a portable and localization friendly way to remove trailing newlines in C, preferably something standards-based.
I am already aware of the following solutions:
Parsing for some combination of \r and \n. Really not pretty when dealing with Windows, *nix and Mac, all which use different sequences to represent a new line. Also, do other languages even use the same escape sequence for a new line? I expect this will blow up in languages that use different glyphs from English (say, Japanese or the like).
Removing trailing n bytes and replacing final \0. Seems like a more brittle way of doing the above.
isspace looks tempting but I need to only match newlines. Other whitespace is considered valid token text.
C++ has a class to do this but it is of little help to me in a pure-C world.
locale.h seems like what I am after but I cannot see anything pertinent to extracting newline tokens.
So, with that, is this an instance that I will have to "roll my own" functionality or is there something that I have missed? Thanks!
Solution
I ended up combining both answers from Weather Vane and Loic, respectively, for my final solution. What worked was to use the handy strcspn function to break on the first newline character as selected from Loic's provided links. Thus, I can select delimiters based on a number of supported locales. Is a good point that there are too many to support generically at this level; I didn't even know that there were several competing encodings for the Cyrillic.
In this way, I can achieve "good enough" multinational support while still using standard library functions.
Since I can only accept one answer, I am selecting Weather Vane's as his was the final invocation I used. That being said, it was really the two answers together that worked for me.
The best one I know is
buffer [ strcspn(buffer, "\r\n") ] = 0;
which is a safe way of dealing with all the combinations of \r and \n - both, one or none.
I suggest to replace one or more whitespace characters with one standard space (US-ASCII 0x20). Considering only ISO-8859-1 characters (https://en.wikipedia.org/wiki/ISO/IEC_8859-1), whitespace consists of any byte in 0x00..0x20 (C0 control characters and space) and 0x7F..0xA0 (delete, C1 control characters and no-break space). Notice that US-ASCII is subset of ISO-8859-1.
But take into account that Windows 1251 (https://en.wikipedia.org/wiki/Windows-1251) assign different, visible (non-control) characters to the range 0x80..0x9F. In this case, those bytes cannot be replaced by spaces without lost of textual information.
Resources for an extensive definition of whitespace characters:
https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace
http://unicode.org/reports/tr23/
http://www.unicode.org/Public/8.0.0/charts/CodeCharts.pdf
Take also onto account that different encodings may be used, most commonly:
ISO-8859-1 (https://en.wikipedia.org/wiki/ISO/IEC_8859-1)
UTF-8 (https://en.wikipedia.org/wiki/UTF-8)
Windows 1251 (https://en.wikipedia.org/wiki/Windows-1251)
But in non-western countries (for instance Russia, Japan), further character encodings are also usual. Numerous encodings exist, but it probably does not make sense to try to support each and every known encoding.
Thus try to define and restrict your use-cases, because implementing it in full generality means a lot of work.
This answer is for C++ users with the same problem.
Matching a newline character for any locale and character type can be done like this:
#include <locale>
template<class Char>
bool is_newline(Char c, std::locale const & loc = std::locale())
{
// Translate character into default locale and character type.
// Then, test against '\n', which is the only newline character there.
return std::use_facet< std::ctype<Char>>(loc).narrow(c, ' ') == '\n';
}
Now, removing all trailing newlines can be done like this:
void remove_trailing_newlines(std::string & str) {
while (!str.empty() && is_newline(*str.rbegin())
str.pop_back();
}
This should be absolutely portable, as it relies only on standard C++ functions.

What is the name given to those 3 character constructs that represent another character or characters

Apologies for the vagueness; I barely know how to pose this question.
Can anyone tell me the name of that family of 3 character constructs that represent another character or characters?
I think they were used in the old VT100 terminal days.
I know C supports them.
They are called trigraph. There are also two characters code called digraphs.
They are called trigraph sequences. E.g. ??/ maps to \. You have to take care to remember this when building regular expression-type parsers for C code.

Resources