I need help filtering bad words in C? - c

As you can see, I am trying to filter various bad words. I have some code to do so. I am using C, and also this is for a GTK application.
char LowerEnteredUsername[EnteredUsernameLen];
for(unsigned int i = 0; i < EnteredUsernameLen; i++) {
LowerEnteredUsername[i] = tolower(EnteredUsername[i]);
}
LowerEnteredUsername[EnteredUsernameLen+1] = '\0';
if (strstr(LowerEnteredUsername, (char[]){LetterF, LetterU, LetterC, LetterK})||strstr(LowerEnteredUsername, (char[]){LetterF, LetterC, LetterU, LetterK})) {
gtk_message_dialog_set_markup((GtkMessageDialog*)Dialog, "This username seems to be innapropriate.");
UsernameErr = 1;
}
My issue is, is that, it will only filter the last bad word specified in the if statement. In this example, "fcuk". If I input "fuck," the code will pass that as clean. How can I fix this?

(char[]){LetterF, LetterU, LetterC, LetterK}
(char[]){LetterF, LetterC, LetterU, LetterK}
You’ve forgotten to terminate your strings with a '\0'. This approach doesn’t seem to me to be very effective in keeping ~bad words~ out of source code, so I’d really suggest just writing regular string literals:
if (strstr(LowerEnteredUsername, "fuck") || strstr(LowerEnteredUsername, "fcuk")) {
Much clearer. If this is really, truly a no-go, then some other indirect but less error-prone ways are:
"f" "u" "c" "k"
or
#define LOWER_F "f"
#define LOWER_U "u"
#define LOWER_C "c"
#define LOWER_K "k"
and
LOWER_F LOWER_U LOWER_C LOWER_K

Doing human-language text processing in C is painful because C's concept of strings (i.e. char*/char[] and wchar_t*/wchar_t[]) are very low-level and are not expressive enough to easily represent Unicode text, let alone locate word-boundaries in text and match words in a known dictionary (also consider things like inflection, declension, plurals, the use of diacritics to evade naive string matching).
For example - your program would need to handle George carlin's famous Seven dirty words quote:
https://www.youtube.com/watch?v=vbZhpf3sQxQ
Someone was quite interested in these words. They kept referring to them: they called them bad, dirty, filthy, foul, vile, vulgar, coarse, in poor taste, unseemly, street talk, gutter talk, locker room language, barracks talk, bawdy, naughty, saucy, raunchy, rude, crude, lude, lascivious, indecent, profane, obscene, blue, off-color, risqué, suggestive, cursing, cussing, swearing... and all I could think of was: shit, piss, fuck, cunt, cocksucker, motherfucker, and tits!
This could be slightly modified to evade a naive filter, like so:
Someone was quite interested in these words. They kept referring to them: they called them bad, dirty, filthy, foul, vile, vulgar, coarse, in poor taste, unseemly, street talk, gutter talk, locker room language, barracks talk, bawdy, naughty, saucy, raunchy, rude, crude, lude, lascivious, indecent, profane, obscene, blue, off-color, risqué, suggestive, cursing, cussing, swearing... and all I could think of was: shít, pis$, phuck, c​unt, сocksucking, motherfúcker, and títs!
Above, some of the words have simple replacements done, like s to $, others had diacritics added like u to ú, and some are just homonyms), however some of the other words in the above look the same but actually contain homographs or "invisible" characters like Unicode's zero-width-space, so they would evade naive text matching systems.
So in short: Avoid doing this in C. if you must, then use a robust and fully-featured Unicode handling library (i.e. do not use the C Standard Library's string functions like strstr, strtok, strlen, etc).
Here's how I would do it:
Read in input to a binary blob containing Unicode text (presumably UTF-8).
Use a Unicode library to:
Normalize the encoded Unicode text data (see https://en.wikipedia.org/wiki/Unicode_equivalence )
Identify word boundaries (assuming we're dealing with European-style languages that use sentences comprised of words).
Use a linguistics library and database (English alone is full of special-cases) to normalize each word to some singular canonical form.
Then lookup each morpheme in a case-insensitive hash-set of known "bad words".
Now, there are a few shortcuts you can take:
You can use regular-expressions to identify word-boundaries.
There exist Unicode-aware regular-expression libraries for C, for example PCRE2: http://www.pcre.org/current/doc/html/pcre2unicode.html
You can skip normalizing each word's inflections/declensions if you're happy with having to list those in your "bad word" list.
I would write working code for this example, but I'm short on time tonight (and it would be a LOT of code), but hopefully this answer provides you with enough information to figure out the rest yourself.
(Pro-tip: don't match strings in a list by checking each character - it's slow and inefficient. This is what hashtables and hashsets are for!)

Related

Space Characters managment in Playfair cipher Algorithm

i'm trying to implement this algorithm; you can find a good description here
LINK - Chapter 11.6 - Playfair Cipher
I'm getting some thoughts about the decryptying phase.
After i follow the instructions for crypting the text i get:
35VRX2NZDCR25885
then to decrypt i follow the instruction in the opposite direction, but i'm stucked at the point i get the message decrypted as follow
LETUSMEETATNOON
How could I pass from "LETUSMEETATNOON" to "LET US MEET AT NOON"?
Should I treat spaces in a different way?
Spaces are not allowed as a plaintext nor a ciphertext of your cryptosystem as it is defined in the attached document.
You could use a larger definition matrix and add all the useful symbols you need, probably . , ; - ? ! : ' \ / and maybe a newline.
There is one more approach. It is a general issue: How to add spaces to a text to get human readable (English) sentences? How to recognize a valid English sentence? This is a very hard problem of mathematical linguistics, which has not been solved yet.
In your case you could omit syntactical analysis and check the validity of words only. You could easily check all possible splits and check if all resulting words are valid English words. All you need is a good English dictionary (list of all English words), which can be found for example on Linux in folders /usr/share/dict/ or /var/lib/dict/ and many others can be downloaded from the Internet.
On Playfair algorithm:
Do not use it if you need any real security - it could be very easily broken using frequency analysis.

Parsing shell commands in c: string cutting with respect to its contents

I'm currently creating Linux shell to learn more about system calls.
I've already figured out most of the things. Parser, token generation, passing appropriate things to appropriate system calls - works.
The thing is, that even before I start making tokens, I split whole command string into separate words. It's based on array of separators, and it works surprisingly good. Except that I'm struggling with adding additional functionality to it, like escape sequences or quotes. I can't really live without it, since even people using basic grep commands use arguments with quotes. I'll need to add functionality for:
' ' - ignore every other separator, operator or double quotes found between those two, pass this as one string, don't include these quotation marks into resulting word,
" "- same as above, but ignore single quotes,
\\ - escape this into single backslash,
\(space) - escape this into space, do not parse resulting space as separator
\", \' - analogously to the above.
Many other things that I haven't figured out I need yet
and every single one of them seems like an exception on its own. Each of them must operate on diversity of possible positions in commands, being included into result or not, having influence on the rest of the parsing. It makes my code look like big ball of mud.
Is there a better approach to do this? Is there a more general algorithm for that purpose?
You are trying to solve a classic problem in program analysis (of lexing and parsing) using a nontraditional structure for lexer ( I split whole command string into separate words... ). OK, then you will have non-traditional troubles with getting the lexer "right".
That doesn't mean that way is doomed to failure, and without seeing specific instances of your problem, (you list a set of constructs you want to handle, but don't say why these are hard to process), it is hard to provide any specific advice. It also doesn't mean that way will lead to success; splitting the line may break tokens that shouldn't be broken (usually by getting confused about what has been escaped).
The point of using a standard lexer (such as Flex or any of the 1000 variants you can get) is that they provide a proven approach to complex lexing problems, based generally on the idea that one can use regular expressions to describe the shape of individual lexemes. Thus, you get one regexp per lexeme type, thus an ocean of them but each one is pretty easy to specify by itself.
I've done ~~40 languages using strong lexers and parsers (using one of the ones in that list). I assure you the standard approach is empirically pretty effective. The types of surprises are well understood and manageable. A nonstandard approach always has the risk that it will surprise you in a bad way.
Last remark: shell languages for Unix have had people adding crazy stuff for 40 years. Expect the job to be at least medium hard, and don't expect it to be pretty like Wirth's original Pascal.

C Removing Newlines in a Portable and International Friendly Way

Simple question here with a potentially tricky answer: I am looking for a portable and localization friendly way to remove trailing newlines in C, preferably something standards-based.
I am already aware of the following solutions:
Parsing for some combination of \r and \n. Really not pretty when dealing with Windows, *nix and Mac, all which use different sequences to represent a new line. Also, do other languages even use the same escape sequence for a new line? I expect this will blow up in languages that use different glyphs from English (say, Japanese or the like).
Removing trailing n bytes and replacing final \0. Seems like a more brittle way of doing the above.
isspace looks tempting but I need to only match newlines. Other whitespace is considered valid token text.
C++ has a class to do this but it is of little help to me in a pure-C world.
locale.h seems like what I am after but I cannot see anything pertinent to extracting newline tokens.
So, with that, is this an instance that I will have to "roll my own" functionality or is there something that I have missed? Thanks!
Solution
I ended up combining both answers from Weather Vane and Loic, respectively, for my final solution. What worked was to use the handy strcspn function to break on the first newline character as selected from Loic's provided links. Thus, I can select delimiters based on a number of supported locales. Is a good point that there are too many to support generically at this level; I didn't even know that there were several competing encodings for the Cyrillic.
In this way, I can achieve "good enough" multinational support while still using standard library functions.
Since I can only accept one answer, I am selecting Weather Vane's as his was the final invocation I used. That being said, it was really the two answers together that worked for me.
The best one I know is
buffer [ strcspn(buffer, "\r\n") ] = 0;
which is a safe way of dealing with all the combinations of \r and \n - both, one or none.
I suggest to replace one or more whitespace characters with one standard space (US-ASCII 0x20). Considering only ISO-8859-1 characters (https://en.wikipedia.org/wiki/ISO/IEC_8859-1), whitespace consists of any byte in 0x00..0x20 (C0 control characters and space) and 0x7F..0xA0 (delete, C1 control characters and no-break space). Notice that US-ASCII is subset of ISO-8859-1.
But take into account that Windows 1251 (https://en.wikipedia.org/wiki/Windows-1251) assign different, visible (non-control) characters to the range 0x80..0x9F. In this case, those bytes cannot be replaced by spaces without lost of textual information.
Resources for an extensive definition of whitespace characters:
https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace
http://unicode.org/reports/tr23/
http://www.unicode.org/Public/8.0.0/charts/CodeCharts.pdf
Take also onto account that different encodings may be used, most commonly:
ISO-8859-1 (https://en.wikipedia.org/wiki/ISO/IEC_8859-1)
UTF-8 (https://en.wikipedia.org/wiki/UTF-8)
Windows 1251 (https://en.wikipedia.org/wiki/Windows-1251)
But in non-western countries (for instance Russia, Japan), further character encodings are also usual. Numerous encodings exist, but it probably does not make sense to try to support each and every known encoding.
Thus try to define and restrict your use-cases, because implementing it in full generality means a lot of work.
This answer is for C++ users with the same problem.
Matching a newline character for any locale and character type can be done like this:
#include <locale>
template<class Char>
bool is_newline(Char c, std::locale const & loc = std::locale())
{
// Translate character into default locale and character type.
// Then, test against '\n', which is the only newline character there.
return std::use_facet< std::ctype<Char>>(loc).narrow(c, ' ') == '\n';
}
Now, removing all trailing newlines can be done like this:
void remove_trailing_newlines(std::string & str) {
while (!str.empty() && is_newline(*str.rbegin())
str.pop_back();
}
This should be absolutely portable, as it relies only on standard C++ functions.

Is there a simple example on iconv transliteration of from-language-to-language for c?

Say we have the simple scenario, a string of a language, say French.
And we want that French to be converted to ASCII in a transliterated form.
How can it be done in C in the simplest way?
Also is there's a completely different way, irrelevant to iconv, ideally multiplatform?
If you want multiplatform, iconv is not the right tool. Transliteration is a GNU-specific extension. In general, transliteration is a hard problem, and the GNU iconv implementation is only sufficient for trivial cases. How a non-ASCII character gets transliterated is not a property of the character but of the language of the text and how it's being used. For instance, should "日" become "ri" or "ni" or something else entirely? Or if you want to stick with Latin-based languages, should "ö" become "o" or "oe"? Expanding to other non-Latin scripts, transliterating most Indic languages is fairly straightforward, but transliterating Thai requires some reordering of characters and transliterating Tibetan requires parsing whole syllables and identifying which letters are in root/prefix/suffix/etc. roles.
In my opinion, the best answer to "How do I transliterate to ASCII?" for most software programs is: don't. Instead fix whatever bugs or intentionally-English-centric policies made you want ASCII in the first place. The only software that should really be doing transliteration is highly-linguistically-aware software facilitating search or interpretation of texts not in the user's own native language.

Parsing a stream of data for control strings

I feel like this is a pretty common problem but I wasn't really sure what to search for.
I have a large file (so I don't want to load it all into memory) that I need to parse control strings out of and then stream that data to another computer. I'm currently reading in the file in 1000 byte chunks.
So for example if I have a string that contains ASCII codes escaped with ('$' some number of digits ';') and the data looked like this... "quick $33;brown $126;fox $a $12a". The string going to the other computer would be "quick brown! ~fox $a $12a".
In my current approach I have the following problems:
What happens when the control strings falls on a buffer boundary?
If the string is '$' followed by anything but digits and a ';' I want to ignore it. So I need to read ahead until the full control string is found.
I'm writing this in straight C so I don't have streams to help me.
Would an alternating double buffer approach work and if so how does one manage the current locations etc.
If I've followed what you are asking about it is called lexical analysis or tokenization or regular expressions. For regular languages you can construct a finite state machine which will recognize your input. In practice you can use a tool that understands regular expressions to recognize and perform different actions for the input.
Depending on different requirements you might go about this differently. For more complicated languages you might want to use a tool like lex to help you generate an input processor, but for this, as I understand it, you can use a much more simple approach, after we fix your buffer problem.
You should use a circular buffer for your input, so that indexing off the end wraps around to the front again. Whenever half of the data that the buffer can hold has been processed you should do another read to refill that. Your buffer size should be at least twice as large as the largest "word" you need to recognize. The indexing into this buffer will use the modulus (remainder) operator % to perform the wrapping (if you choose a buffer size that is a power of 2, such as 4096, then you can use bitwise & instead).
Now you just look at the characters until you read a $, output what you've looked at up until that point, and then knowing that you are in a different state because you saw a $ you look at more characters until you see another character that ends the current state (the ;) and perform some other action on the data that you had read in. How to handle the case where the $ is seen without a well formatted number followed by an ; wasn't entirely clear in your question -- what to do if there are a million numbers before you see ;, for instance.
The regular expressions would be:
[^$]
Any non-dollar sign character. This could be augmented with a closure ([^$]* or [^$]+) to recognize a string of non$ characters at a time, but that could get very long.
$[0-9]{1,3};
This would recognize a dollar sign followed by up 1 to 3 digits followed by a semicolon.
[$]
This would recognize just a dollar sign. It is in the brackets because $ is special in many regular expression representations when it is at the end of a symbol (which it is in this case) and means "match only if at the end of line".
Anyway, in this case it would recognize a dollar sign in the case where it is not recognized by the other, longer, pattern that recognizes dollar signs.
In lex you might have
[^$]{1,1024} { write_string(yytext); }
$[0-9]{1,3}; { write_char(atoi(yytext)); }
[$] { write_char(*yytext); }
and it would generate a .c file that will function as a filter similar to what you are asking for. You will need to read up a little more on how to use lex though.
The "f" family of functions in <stdio.h> can take care of the streaming for you. Specifically, you're looking for fopen(), fgets(), fread(), etc.
Nategoose's answer about using lex (and I'll add yacc, depending on the complexity of your input) is also worth considering. They generate lexers and parsers that work, and after you've used them you'll never write one by hand again.

Resources