C Removing Newlines in a Portable and International Friendly Way - c

Simple question here with a potentially tricky answer: I am looking for a portable and localization friendly way to remove trailing newlines in C, preferably something standards-based.
I am already aware of the following solutions:
Parsing for some combination of \r and \n. Really not pretty when dealing with Windows, *nix and Mac, all which use different sequences to represent a new line. Also, do other languages even use the same escape sequence for a new line? I expect this will blow up in languages that use different glyphs from English (say, Japanese or the like).
Removing trailing n bytes and replacing final \0. Seems like a more brittle way of doing the above.
isspace looks tempting but I need to only match newlines. Other whitespace is considered valid token text.
C++ has a class to do this but it is of little help to me in a pure-C world.
locale.h seems like what I am after but I cannot see anything pertinent to extracting newline tokens.
So, with that, is this an instance that I will have to "roll my own" functionality or is there something that I have missed? Thanks!
Solution
I ended up combining both answers from Weather Vane and Loic, respectively, for my final solution. What worked was to use the handy strcspn function to break on the first newline character as selected from Loic's provided links. Thus, I can select delimiters based on a number of supported locales. Is a good point that there are too many to support generically at this level; I didn't even know that there were several competing encodings for the Cyrillic.
In this way, I can achieve "good enough" multinational support while still using standard library functions.
Since I can only accept one answer, I am selecting Weather Vane's as his was the final invocation I used. That being said, it was really the two answers together that worked for me.

The best one I know is
buffer [ strcspn(buffer, "\r\n") ] = 0;
which is a safe way of dealing with all the combinations of \r and \n - both, one or none.

I suggest to replace one or more whitespace characters with one standard space (US-ASCII 0x20). Considering only ISO-8859-1 characters (https://en.wikipedia.org/wiki/ISO/IEC_8859-1), whitespace consists of any byte in 0x00..0x20 (C0 control characters and space) and 0x7F..0xA0 (delete, C1 control characters and no-break space). Notice that US-ASCII is subset of ISO-8859-1.
But take into account that Windows 1251 (https://en.wikipedia.org/wiki/Windows-1251) assign different, visible (non-control) characters to the range 0x80..0x9F. In this case, those bytes cannot be replaced by spaces without lost of textual information.
Resources for an extensive definition of whitespace characters:
https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace
http://unicode.org/reports/tr23/
http://www.unicode.org/Public/8.0.0/charts/CodeCharts.pdf
Take also onto account that different encodings may be used, most commonly:
ISO-8859-1 (https://en.wikipedia.org/wiki/ISO/IEC_8859-1)
UTF-8 (https://en.wikipedia.org/wiki/UTF-8)
Windows 1251 (https://en.wikipedia.org/wiki/Windows-1251)
But in non-western countries (for instance Russia, Japan), further character encodings are also usual. Numerous encodings exist, but it probably does not make sense to try to support each and every known encoding.
Thus try to define and restrict your use-cases, because implementing it in full generality means a lot of work.

This answer is for C++ users with the same problem.
Matching a newline character for any locale and character type can be done like this:
#include <locale>
template<class Char>
bool is_newline(Char c, std::locale const & loc = std::locale())
{
// Translate character into default locale and character type.
// Then, test against '\n', which is the only newline character there.
return std::use_facet< std::ctype<Char>>(loc).narrow(c, ' ') == '\n';
}
Now, removing all trailing newlines can be done like this:
void remove_trailing_newlines(std::string & str) {
while (!str.empty() && is_newline(*str.rbegin())
str.pop_back();
}
This should be absolutely portable, as it relies only on standard C++ functions.

Related

Splitting string in C by blank spaces, besides when said blank space is within a set of quotes

I'm writing a simple Lisp in C without any external dependencies (please do not link the BuildYourOwnLisp), and I'm following this guide as a basis to parse the Lisp. It describes two steps in tokenising a S-exp, those steps being:
Put spaces around every paranthesis
Split on white space
The first step is easy enough, I wrote a trivial function that replaces certain substrings with other substrings, but I'm having problems with the second step. In the article it only uses the string "Lisp" in its examples of S-exps; if I were to use strtok() to blindly split by whitespace, any string in a S-exp that had a space within it would become fragmented and interpreted incorrectly by my Lisp. Obviously, a language limited to single-word strings isn't very useful.
How would I write a function that splits a string by white space, besides when the text is in between two double quotes?
I've tried using regex, but from what I can see of the POSIX regex.h library and PCRE, just extracting the matches would be incredibly laborious in terms of the amount of auxillary code I'd have to write, which would only serve to bloat my codebase. Besides, one of my goals with this project was to use only ANSI C, or, if need be, C99, solely for the sake of portability - fiddling with the POSIX library and the Win32 API would just fatten my code and make moving my lisp around a nightmare.
When researching this problem I came across this StackOverflow answer; but the approved answer only sends the tokenised string onto stdout, which isn't useful for me; I'd ideally have the tokens in a char** so that I could then parse them into useful in memory data structures.
As well as this, the approved answer on the aforementioned SO question is written to be restricted to specifically my problem - ideally, I'd have myself a general purpose function that would allow me to tokenise a string, except when a substring is between two of charachter x. This isn't a huge deal, it's just that I'd like my codebase to be clean and composable.
You have two delimiters: the space and double quotes.
You can use the strcspn (or with example: cppreference - strcspn) function for that.
Iterate over the string and look for the delimiters (space and quotes). strcspn returns if such a delimiter was found. If a space was found, continue looking for both. If a double quote was found, the delimiter chages from " \"" (space and quotes) to "\"" (double quotes). If you then hit the quotes again, change the delimiter to " \"" (space and quotes).
Based on your comment:
Lets say you have a string like
This is an Example.
The output would be
This
is
an
Example.
If the string would look like
This "is an" Example.
The output would be
This
is an
Example.

Why does this scanf() conversion actually work?

Ah, the age old tale of a programmer incrementally writing some code that they aren't expecting to do anything more than expected, but the code unexpectedly does everything, and correctly, too.
I'm working on some C programming practice problems, and one was to redirect stdin to a text file that had some lines of code in it, then print it to the console with scanf() and printf(). I was having trouble getting the newline characters to print as well (since scanf typically eats up whitespace characters) and had typed up a jumbled mess of code involving multiple conditionals and flags when I decided to start over and ended up typing this:
(where c is a character buffer large enough to hold the entirety of the text file's contents)
scanf("%[a-zA-Z -[\n]]", c);
printf("%s", c);
And, voila, this worked perfectly. I tried to figure out why by creating variations on the character class (between the outside brackets), such as:
[\w\W -[\n]]
[\w\d -[\n]]
[. -[\n]]
[.* -[\n]]
[^\n]
but none of those worked. They all ended up reading either just one character or producing a jumbled mess of random characters. '[^\n]' doesn't work because the text file contains newline characters, so it only prints out a single line.
Since I still haven't figured it out, I'm hoping someone out there would know the answer to these two questions:
Why does "[a-zA-Z -[\nn]]" work as expected?
The text file contains letters, numbers, and symbols (':', '-', '>', maybe some others); if 'a-z' is supposed to mean "all characters from unicode 'a' to unicode 'z'", how does 'a-zA-Z' also include numbers?
It seems like the syntax for what you can enter inside the brackets is a lot like regex (which I'm familiar with from Python), but not exactly. I've read up on what can be used from trying to figure out this problem, but I haven't been able to find any info comparing whatever this syntax is to regex. So: how are they similar and different?
I know this probably isn't a good usage for scanf, but since it comes from a practice problem, real world convention has to be temporarily ignored for this usage.
Thanks!
You are picking up numbers because you have " -[" in your character set. This means all characters from space (32) to open-bracket (91), which includes numbers in ASCII (48-57).
Your other examples include this as well, but they are missing the "a-zA-Z", which lets you pick up the lower-case letters (97-122). Sequences like '\w' are treated as unknown escape sequences in the string itself, so \w just becomes a single w. . and * are taken literally. They don't have a special meaning like in a regular expression.
If you include - inside the [ (other than at the beginning or end) then the behaviour is implementation-defined.
This means that your compiler documentation must describe the behaviour, so you should consult that documentation to see what the defined behaviour is, which would explain why some of your code worked and some didn't.
If you want to write portable code then you can't use - as anything other than matching a hyphen.

Style to write code in c (UTF-8)

In my code I use names of people. For example one of them is:
const char *translators[] = {"Jörgen Adam <adam#***.de>", NULL};
and contain ö 'LATIN SMALL LETTER O WITH DIAERESIS'
When I write code what format is right to use
UTF-8:
Jörgen Adam
or
UTF-8(hex):
J\xc3\xb6rgen Adam
UPDATE:
Text with name will be print in GTK About Dialog (name of translators)
The answer depends a lot on whether this is in a comment or a string.
If it's in a comment, there's no question: you should use raw UTF-8, so it should appear as:
/* Jörgen Adam */
If the user reading the file has a misconfigured/legacy system that treats text as something other than UTF-8, it will appear in some other way, but this is just a comment so it won't affect code generation, and the ugliness is their problem.
If on the other hand the UTF-8 is in a string, you probably want the code to be interpreted correctly even if the compile-time character set is not UTF-8. In that case, your safest bet is probably to use:
"J\xc3\xb6rgen Adam"
It might actually be safe to use the UTF-8 literal there too; I'm not 100% clear on C's specification of the handling of non-wide string literals and compile-time character set. Unless you can convince yourself that it's formally safe and not broken on a compiler you care to support, though, I would just stick with the hex.

C CSV API for unicode

I need a C API for manipulating CSV data that can work with unicode. I am aware of libcsv (sourceforge.net/projects/libcsv), but I don't think that will work for unicode (please correct me if I'm wrong) because don't see wchar_t being used.
Please advise.
It looks like libcsv does not use the C string functions to do its work, so it almost works out of the box, in spite of its mbcs/ws ignorance. It treats the string as an array of bytes with an explicit length. This might mostly work for certain wide character encodings that pad out ASCII bytes to fill the width (so newline might be encoded as "\0\n" and space as "\0 "). You could also encode your wide data as UTF-8, which should make things a bit easier. But both approaches might founder on the way libcsv identifies space and line terminator tokens: it expects you to tell it on a byte-to-byte basis whether it's looking at a space or terminator, which doesn't allow for multibyte space/term encodings. You could fix this by modifying the library to pass a pointer into the string and the length left in the string to its space/term test functions, which would be pretty straightforward.

Removing diacritic symbols from UTF8 string in C

I am writing a C program to search a large number of UTF-8 strings in a database. Some of these strings contain English characters with didactics, such as accents, etc. The search string is entered by the user, so it will most likely not contain such characters. Is there a way (function, library, etc) which can remove these characters from a string, or just perform a didactic-insensitive search? For example, if the user enters the search string "motor", it should match the string "motörhead".
My first attempt was to manually strip out the combining didactic modifiers described here:
http://en.wikipedia.org/wiki/Combining_character
This worked in some cases, but it turns out many of these characters also have specific unicode values. For example, the character "ö" above can be represented by an "o" followed by the combining didactic U+0308, but it can also be represented by the single unicode character U+00F6, and my method only filters the former.
I have also looked into iconv, which can convert from UTF8 to ASCII. However, I may want to localize my program at a future date, and this would no doubt cause problems for languages with non-English characters. Is there a way I can simply strip/convert these accented characters?
Edit: removed typo in question title.
Convert to one of the decomposed normalizations -- probably NFD, but you might want NFKD even -- that makes all diacritics into combining characters that can be stripped.
You will want a library for this. I hear good things about ICU.
Use ICU, create a collator over "root" with strength of PRIMARY (L1) (which only uses base letters, only cares about 'o' and ignores 'ö') then you can use ICU's search functions to match. There's a new functionality search collator that will provide special collators designed for this case, but 'primary strength' will handle this specific case.
Example: "motor == mötor" in the 'collated' section.

Resources