What is this character ` called? - c

I never noticed the character ` (the one in the same key as tilde ~). There is another single quote character ' in the same key as ". I see that the characters ` and ' aren't interchangeable whereas ' and " are.
I spent a lot of time due to that when compiling GTK programs. It gave error (file not found), and finally figured out that its not a single quote.
What is the purpose of this ` character and when is it (or when should it) be used?
Thanks.

It's typically called a "backtick", and in bash, it is used for command substitution (although the $(cmd) construct is usually preferred due to easier nesting).

` is known variously known as a backtick, or grave accent (See http://en.wikipedia.org/wiki/Grave_accent).
In UNIX shells, as well as some scripting languages (Ruby, Perl...), it introduces some input to be executed in a subshell. In C and C++, it has no special purpose but can be inserted as a character literal, or part of a string literal. One reason it's not used for something more interested in the extremely wide portability of the languages spans machines where the character can't be expected to be on the keyboards, and may not display very differently from the single-right-quote "'" on screen and printouts, making for extremely hard-to-see bugs.
In some word-processing and similar application programs, typing a backtick will insert a single left quote character "‘". Commonly keyboard input software will allow a user to type say "e" in order to enter the character "è", or "a` for "à" etc, as used in some languages' alphabets.

I call it a "grave", as in a grave accent

In MySQL, it's used to surround identifiers when they might otherwise be ambiguous (such as using a reserved word as a table or column name). There are going to be lots of different uses of that character in lots of different pieces of software, just as there are for the other keys on the keyboard.

In a few languages, including PHP, Perl, and i think Ruby, backticks execute shell commands.
http://php.net/manual/en/language.operators.execution.php
The SQL thing mentioned is another use, which unfortunately I am well aware of because of co-workers who decided 'Desc' was a good name for a field

Related

Parsing shell commands in c: string cutting with respect to its contents

I'm currently creating Linux shell to learn more about system calls.
I've already figured out most of the things. Parser, token generation, passing appropriate things to appropriate system calls - works.
The thing is, that even before I start making tokens, I split whole command string into separate words. It's based on array of separators, and it works surprisingly good. Except that I'm struggling with adding additional functionality to it, like escape sequences or quotes. I can't really live without it, since even people using basic grep commands use arguments with quotes. I'll need to add functionality for:
' ' - ignore every other separator, operator or double quotes found between those two, pass this as one string, don't include these quotation marks into resulting word,
" "- same as above, but ignore single quotes,
\\ - escape this into single backslash,
\(space) - escape this into space, do not parse resulting space as separator
\", \' - analogously to the above.
Many other things that I haven't figured out I need yet
and every single one of them seems like an exception on its own. Each of them must operate on diversity of possible positions in commands, being included into result or not, having influence on the rest of the parsing. It makes my code look like big ball of mud.
Is there a better approach to do this? Is there a more general algorithm for that purpose?
You are trying to solve a classic problem in program analysis (of lexing and parsing) using a nontraditional structure for lexer ( I split whole command string into separate words... ). OK, then you will have non-traditional troubles with getting the lexer "right".
That doesn't mean that way is doomed to failure, and without seeing specific instances of your problem, (you list a set of constructs you want to handle, but don't say why these are hard to process), it is hard to provide any specific advice. It also doesn't mean that way will lead to success; splitting the line may break tokens that shouldn't be broken (usually by getting confused about what has been escaped).
The point of using a standard lexer (such as Flex or any of the 1000 variants you can get) is that they provide a proven approach to complex lexing problems, based generally on the idea that one can use regular expressions to describe the shape of individual lexemes. Thus, you get one regexp per lexeme type, thus an ocean of them but each one is pretty easy to specify by itself.
I've done ~~40 languages using strong lexers and parsers (using one of the ones in that list). I assure you the standard approach is empirically pretty effective. The types of surprises are well understood and manageable. A nonstandard approach always has the risk that it will surprise you in a bad way.
Last remark: shell languages for Unix have had people adding crazy stuff for 40 years. Expect the job to be at least medium hard, and don't expect it to be pretty like Wirth's original Pascal.

C Removing Newlines in a Portable and International Friendly Way

Simple question here with a potentially tricky answer: I am looking for a portable and localization friendly way to remove trailing newlines in C, preferably something standards-based.
I am already aware of the following solutions:
Parsing for some combination of \r and \n. Really not pretty when dealing with Windows, *nix and Mac, all which use different sequences to represent a new line. Also, do other languages even use the same escape sequence for a new line? I expect this will blow up in languages that use different glyphs from English (say, Japanese or the like).
Removing trailing n bytes and replacing final \0. Seems like a more brittle way of doing the above.
isspace looks tempting but I need to only match newlines. Other whitespace is considered valid token text.
C++ has a class to do this but it is of little help to me in a pure-C world.
locale.h seems like what I am after but I cannot see anything pertinent to extracting newline tokens.
So, with that, is this an instance that I will have to "roll my own" functionality or is there something that I have missed? Thanks!
Solution
I ended up combining both answers from Weather Vane and Loic, respectively, for my final solution. What worked was to use the handy strcspn function to break on the first newline character as selected from Loic's provided links. Thus, I can select delimiters based on a number of supported locales. Is a good point that there are too many to support generically at this level; I didn't even know that there were several competing encodings for the Cyrillic.
In this way, I can achieve "good enough" multinational support while still using standard library functions.
Since I can only accept one answer, I am selecting Weather Vane's as his was the final invocation I used. That being said, it was really the two answers together that worked for me.
The best one I know is
buffer [ strcspn(buffer, "\r\n") ] = 0;
which is a safe way of dealing with all the combinations of \r and \n - both, one or none.
I suggest to replace one or more whitespace characters with one standard space (US-ASCII 0x20). Considering only ISO-8859-1 characters (https://en.wikipedia.org/wiki/ISO/IEC_8859-1), whitespace consists of any byte in 0x00..0x20 (C0 control characters and space) and 0x7F..0xA0 (delete, C1 control characters and no-break space). Notice that US-ASCII is subset of ISO-8859-1.
But take into account that Windows 1251 (https://en.wikipedia.org/wiki/Windows-1251) assign different, visible (non-control) characters to the range 0x80..0x9F. In this case, those bytes cannot be replaced by spaces without lost of textual information.
Resources for an extensive definition of whitespace characters:
https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace
http://unicode.org/reports/tr23/
http://www.unicode.org/Public/8.0.0/charts/CodeCharts.pdf
Take also onto account that different encodings may be used, most commonly:
ISO-8859-1 (https://en.wikipedia.org/wiki/ISO/IEC_8859-1)
UTF-8 (https://en.wikipedia.org/wiki/UTF-8)
Windows 1251 (https://en.wikipedia.org/wiki/Windows-1251)
But in non-western countries (for instance Russia, Japan), further character encodings are also usual. Numerous encodings exist, but it probably does not make sense to try to support each and every known encoding.
Thus try to define and restrict your use-cases, because implementing it in full generality means a lot of work.
This answer is for C++ users with the same problem.
Matching a newline character for any locale and character type can be done like this:
#include <locale>
template<class Char>
bool is_newline(Char c, std::locale const & loc = std::locale())
{
// Translate character into default locale and character type.
// Then, test against '\n', which is the only newline character there.
return std::use_facet< std::ctype<Char>>(loc).narrow(c, ' ') == '\n';
}
Now, removing all trailing newlines can be done like this:
void remove_trailing_newlines(std::string & str) {
while (!str.empty() && is_newline(*str.rbegin())
str.pop_back();
}
This should be absolutely portable, as it relies only on standard C++ functions.

What is the name given to those 3 character constructs that represent another character or characters

Apologies for the vagueness; I barely know how to pose this question.
Can anyone tell me the name of that family of 3 character constructs that represent another character or characters?
I think they were used in the old VT100 terminal days.
I know C supports them.
They are called trigraph. There are also two characters code called digraphs.
They are called trigraph sequences. E.g. ??/ maps to \. You have to take care to remember this when building regular expression-type parsers for C code.

Removing diacritic symbols from UTF8 string in C

I am writing a C program to search a large number of UTF-8 strings in a database. Some of these strings contain English characters with didactics, such as accents, etc. The search string is entered by the user, so it will most likely not contain such characters. Is there a way (function, library, etc) which can remove these characters from a string, or just perform a didactic-insensitive search? For example, if the user enters the search string "motor", it should match the string "motörhead".
My first attempt was to manually strip out the combining didactic modifiers described here:
http://en.wikipedia.org/wiki/Combining_character
This worked in some cases, but it turns out many of these characters also have specific unicode values. For example, the character "ö" above can be represented by an "o" followed by the combining didactic U+0308, but it can also be represented by the single unicode character U+00F6, and my method only filters the former.
I have also looked into iconv, which can convert from UTF8 to ASCII. However, I may want to localize my program at a future date, and this would no doubt cause problems for languages with non-English characters. Is there a way I can simply strip/convert these accented characters?
Edit: removed typo in question title.
Convert to one of the decomposed normalizations -- probably NFD, but you might want NFKD even -- that makes all diacritics into combining characters that can be stripped.
You will want a library for this. I hear good things about ICU.
Use ICU, create a collator over "root" with strength of PRIMARY (L1) (which only uses base letters, only cares about 'o' and ignores 'ö') then you can use ICU's search functions to match. There's a new functionality search collator that will provide special collators designed for this case, but 'primary strength' will handle this specific case.
Example: "motor == mötor" in the 'collated' section.

Non-english alpha-numerics in a text file

C# WinForm application
EDIT: It appears there's concern about foreign language compatibility.
This is a non-issue.
The card game I'm making this utility for is primarily in English. In the future I may support other languages, but everything will still be keyed off the English names, which are a primary key in both the program and the rules of the game.
I can simply add additional tables with the English name, followed by the translated text, and everything should be fine.
.
Part of my program reads input from a text file containing names, and compares it to another list of names.
Sometimes these names have non-english letters, particularly accented "o" and the Latin AE in the input file.
When this text input is compared to names, those non-english characters are causing problems.
I'd like to find a way to overlay these characters with the english counterpart in most cases, such as "[accented o]" -> "o"
.
I'm perfectly content to code a find/replace table (I only expect 12-30 problem characters), but I've got some roadblocks.
1) Hardcoding the find/replace table (in the ".cs" file) gives me errors, because the compiler doesn't like the characters.
Anyone know a trick to fix this, or do I just have to create a Find/Replace text file that would be read before this process?
2) Identifying the letters is frustrating, but I'll only reach the replace logic if a match isn't found.
This occurs when the non-english characters cause a mismatch, or it isn't in the list yet.
I'm not too worried about the inefficiency of a char-by-char check of each unmatched string, as this is a manual update process triggered every three months.
Presumably getting down to the Bianary-code level of a single character should work, but I haven't gotten this to work.
3) The aforementioned [AE] character is used often, and it would be nice to at least allow the use of this character within the program, as I don't intend to replace it like I do the others.
I've loaded [AE] characters into my database with no problems, and searches using "Ae," "AE," and "[AE]" have posed no problem at the SQL-level, so I'm fine with that functionality.
It's just that searching for other non-english characters is less intuitive.
.
So there's my problem, which is actually more of a nuisance than anything serious. Still, any help or advice would be greatly appreciated.
Are you sure these names aren't meant to be different? Are you sure that you want all of "è", "é", "ê", and "ë" to mean the same thing?
Especially in "foreign" names, characters with different diacritical marks are likely intended to be different. After all, to the people whose names those are, these characters are not foreign.

Resources