Non-english alpha-numerics in a text file

Non-english alpha-numerics in a text file - winforms

C# WinForm application
EDIT: It appears there's concern about foreign language compatibility.
This is a non-issue.
The card game I'm making this utility for is primarily in English. In the future I may support other languages, but everything will still be keyed off the English names, which are a primary key in both the program and the rules of the game.
I can simply add additional tables with the English name, followed by the translated text, and everything should be fine.
.
Part of my program reads input from a text file containing names, and compares it to another list of names.
Sometimes these names have non-english letters, particularly accented "o" and the Latin AE in the input file.
When this text input is compared to names, those non-english characters are causing problems.
I'd like to find a way to overlay these characters with the english counterpart in most cases, such as "[accented o]" -> "o"
.
I'm perfectly content to code a find/replace table (I only expect 12-30 problem characters), but I've got some roadblocks.
1) Hardcoding the find/replace table (in the ".cs" file) gives me errors, because the compiler doesn't like the characters.
Anyone know a trick to fix this, or do I just have to create a Find/Replace text file that would be read before this process?
2) Identifying the letters is frustrating, but I'll only reach the replace logic if a match isn't found.
This occurs when the non-english characters cause a mismatch, or it isn't in the list yet.
I'm not too worried about the inefficiency of a char-by-char check of each unmatched string, as this is a manual update process triggered every three months.
Presumably getting down to the Bianary-code level of a single character should work, but I haven't gotten this to work.
3) The aforementioned [AE] character is used often, and it would be nice to at least allow the use of this character within the program, as I don't intend to replace it like I do the others.
I've loaded [AE] characters into my database with no problems, and searches using "Ae," "AE," and "[AE]" have posed no problem at the SQL-level, so I'm fine with that functionality.
It's just that searching for other non-english characters is less intuitive.
.
So there's my problem, which is actually more of a nuisance than anything serious. Still, any help or advice would be greatly appreciated.

Are you sure these names aren't meant to be different? Are you sure that you want all of "è", "é", "ê", and "ë" to mean the same thing?
Especially in "foreign" names, characters with different diacritical marks are likely intended to be different. After all, to the people whose names those are, these characters are not foreign.

Related

Using C I would like to format my output such that the output in the terminal stops once it hits the edge of the window

If you type ps aux into your terminal and make the window really small, the output of the command will not wrap and the format is still very clear.
When I use printf and output my 5 or 6 strings, sometimes the length of my output exceeds that of the terminal window and the strings wrap to the next line which totally screws up the format. How can I write my program such that the output continues to the edge of the window but no further?
I've tried searching for an answer to this question but I'm having trouble narrowing it down and thus my search results never have anything to do with it so it seems.
Thanks!

There are functions that can let you know information about the terminal window, and some others that will allow you to manipulate it. Look up the "ncurses" or the "termcap" library.
A simple approach for solving your problem will be to get the terminal window size (specially the width), and then format your output accordingly.

There are two possible answers to fix your problem.
Turn off line wrapping in your terminal emulator(if it supports it).
Look into the Curses library. Applications like top or vim use the Curses library for screen formatting.

You can find, or at least guess, the width of the terminal using methods that other answers describe. That's only part of the problem however -- the tricky bit is formatting the output to fit the console. I don't believe there's any alternative to reading the text word by word, and moving the output to the next line when a word would overflow the width. You'll need to implement a method to detect where the white-space is, allowing for the fact that there could be multiple white spaces in a row. You'll need to decide how to handle line-breaking white-space, like CR/LF, if you have any. You'll need to decide whether you can break a word on punctuation (e.g, a hyphen). My approach is to use a simple finite-state machine, where the states are "At start of line", "in a word", "in whitespace", etc., and the characters (or, rather character classes) encountered are the events that change the state.
A particular complication when working in C is that there is little-to-no built-in support for multi-byte characters. That's fine for text which you are certain will only ever be in English, and use only the ASCII punctuation symbols, but with any kind of internationalization you need to be more careful. I've found that it's easiest to convert the text into some wide format, perhaps UTF-32, and then work with arrays of 32-bit integers to represent the characters. If your text is UTF-8, there are various tricks you can use to avoid having to do this conversion, but they are a bit ugly.
I have some code I could share, but I don't claim it is production quality, or even comprehensible. This simple-seeming problem is actually far more complicated than first impressions suggest. It's easy to do badly, but difficult to do well.

Why does this scanf() conversion actually work?

Ah, the age old tale of a programmer incrementally writing some code that they aren't expecting to do anything more than expected, but the code unexpectedly does everything, and correctly, too.
I'm working on some C programming practice problems, and one was to redirect stdin to a text file that had some lines of code in it, then print it to the console with scanf() and printf(). I was having trouble getting the newline characters to print as well (since scanf typically eats up whitespace characters) and had typed up a jumbled mess of code involving multiple conditionals and flags when I decided to start over and ended up typing this:
(where c is a character buffer large enough to hold the entirety of the text file's contents)
scanf("%[a-zA-Z -[\n]]", c);
printf("%s", c);
And, voila, this worked perfectly. I tried to figure out why by creating variations on the character class (between the outside brackets), such as:
[\w\W -[\n]]
[\w\d -[\n]]
[. -[\n]]
[.* -[\n]]
[^\n]
but none of those worked. They all ended up reading either just one character or producing a jumbled mess of random characters. '[^\n]' doesn't work because the text file contains newline characters, so it only prints out a single line.
Since I still haven't figured it out, I'm hoping someone out there would know the answer to these two questions:
Why does "[a-zA-Z -[\nn]]" work as expected?
The text file contains letters, numbers, and symbols (':', '-', '>', maybe some others); if 'a-z' is supposed to mean "all characters from unicode 'a' to unicode 'z'", how does 'a-zA-Z' also include numbers?
It seems like the syntax for what you can enter inside the brackets is a lot like regex (which I'm familiar with from Python), but not exactly. I've read up on what can be used from trying to figure out this problem, but I haven't been able to find any info comparing whatever this syntax is to regex. So: how are they similar and different?
I know this probably isn't a good usage for scanf, but since it comes from a practice problem, real world convention has to be temporarily ignored for this usage.
Thanks!

You are picking up numbers because you have " -[" in your character set. This means all characters from space (32) to open-bracket (91), which includes numbers in ASCII (48-57).
Your other examples include this as well, but they are missing the "a-zA-Z", which lets you pick up the lower-case letters (97-122). Sequences like '\w' are treated as unknown escape sequences in the string itself, so \w just becomes a single w. . and * are taken literally. They don't have a special meaning like in a regular expression.

If you include - inside the [ (other than at the beginning or end) then the behaviour is implementation-defined.
This means that your compiler documentation must describe the behaviour, so you should consult that documentation to see what the defined behaviour is, which would explain why some of your code worked and some didn't.
If you want to write portable code then you can't use - as anything other than matching a hyphen.

Validation for user input on file system

I have written a bunch of web apps and know how to protect against mysql injections and such. I am writing a log storage system for a project in C and I was advised to make sure that it was hack free in the sense that the user could not supply bad data like foo\b\b\b and try to hack into the OS with some rm -rf /* kind of crud. I looked online and found a similar question here: how to check for the "backspace" character in C
This is at least what I thought of, but I know there are probably other things I need to protect against. Can someone who has a bit more experience help me list out the things I need to validate when I am saving files onto a server using user input as part of the hierarchical file naming system?
Example file: /home/webapp/data/{User input}/{Machine-ID}/{hostname}/{tag} where all of these fields could be "faked" when submitted to our log storing system.

Instead of checking for bad characters, turn the problem on its head and specify the good characters. E.g. require {User Input} be a single directory name made of [[:alnum:]_] characters; {Machine-ID} must be made of [[:xdigit:]] to your liking, etc. That gets rid of all the injection stuff quickly.

If you're only ever using these inputs as file names inside your program, and you're storing them on a native Linux filesystem, then the critical things to watch for are:
absolutely proscribe any file name starting with ../ or containing /../ or ending with /... Such file names could allow the user to reach files outside the directory tree that you're working in.
Be wary of any file name containing / as these allow the user to name subdirectories, possibly with unintended consequences.
Other things that could cause trouble include:
Non-ASCII characters that may have a different meaning if used in a different locale.
Some ASCII punctuation characters may have a special meaning in parts of your processing system or may be invalid in some filesystems.
Some parts of your system may be case-sensitive with other parts being case-insensitive. Consider normalizing the case.
If applicable, restrict each field to something that isn't going to cause any trouble. For example:
A machine ID should probably consist of only ASCII lower letters and digits (or only ASCII uppercase letters and digits).
A hostname should consist of only ASCII lowercase letters and digits, plus - but not in an initial position (use Punycode for non-ASCII host names). If these are fully qualified host names, as opposed to host names in a network, then . is also valid, but not in initial position.
No field should be empty or contain a / or start with a . (an initial . could be . or .. — see above — and would be a dot file that ls doesn't show by default and isn't included in the pattern * in shells, so they're best avoided).
While control characters such as backspace aren't directly harmful, they can be indirectly harmful in that if you're investigating an issue on the command line, they can cause you to make mistakes. Do not allow them.

Removing diacritic symbols from UTF8 string in C

I am writing a C program to search a large number of UTF-8 strings in a database. Some of these strings contain English characters with didactics, such as accents, etc. The search string is entered by the user, so it will most likely not contain such characters. Is there a way (function, library, etc) which can remove these characters from a string, or just perform a didactic-insensitive search? For example, if the user enters the search string "motor", it should match the string "motörhead".
My first attempt was to manually strip out the combining didactic modifiers described here:
http://en.wikipedia.org/wiki/Combining_character
This worked in some cases, but it turns out many of these characters also have specific unicode values. For example, the character "ö" above can be represented by an "o" followed by the combining didactic U+0308, but it can also be represented by the single unicode character U+00F6, and my method only filters the former.
I have also looked into iconv, which can convert from UTF8 to ASCII. However, I may want to localize my program at a future date, and this would no doubt cause problems for languages with non-English characters. Is there a way I can simply strip/convert these accented characters?
Edit: removed typo in question title.

Convert to one of the decomposed normalizations -- probably NFD, but you might want NFKD even -- that makes all diacritics into combining characters that can be stripped.
You will want a library for this. I hear good things about ICU.

Use ICU, create a collator over "root" with strength of PRIMARY (L1) (which only uses base letters, only cares about 'o' and ignores 'ö') then you can use ICU's search functions to match. There's a new functionality search collator that will provide special collators designed for this case, but 'primary strength' will handle this specific case.
Example: "motor == mötor" in the 'collated' section.

How do I check if a file is text-based?

I am working on a small text replacement application that basically lets the user select a file and replace text in it without ever having to open the file itself. However, I want to make sure that the function only runs for files that are text-based. I thought I could accomplish this by checking the encoding of the file, but I've found that Notepad .txt files use Unicode UTF-8 encoding, and so do MS Paint .bmp files. Is there an easy way to check this without placing restrictions on the file extensions themselves?

Unless you get a huge hint from somewhere, you're stuck. Purely by examining the bytes there's a non-zero probability you'll guess wrong given the plethora of encodings ("ASCII", Unicode, UTF-8, DBCS, MBCS, etc). Oh, and what if the first page happens to look like ASCII but the next page is a btree node that points to the first page...
Hints can be:
extension (not likely that foo.exe is editable)
something in the stream itself (like BOM [byte-order-marker])
user direction (just edit the file, goshdarnit)
Windows used to provide an API IsTextUnicode that would do a probabilistic examination, but there were well-known false-positives.
My take is that trying to be smarter than the user has some issues...

Honestly, given the Windows environment that you're working with, I'd consider a whitelist of known text formats. Windows users are typically trained to stick with extensions. However, I would personally relax the requirement that it not function on non-text files, instead checking with the user for goahead if the file does not match the internal whitelist. The risk of changing a binary file would be mitigated if your search string is long - that is assuming you're not performing Y2K conversion (a la sed 's/y/k/g').

It's pretty costly to determine if a file is text-based or not (i.e. a binary file). You would have to examine each byte in the file to determine if it is a valid character, irrespective of the file encoding.

Others have said to look at all the bytes in the file and see if they're alphanumeric. Some UNIX/Linux utils do this, but just check the first 1K or 2K of the file as an "optimistic optimization".

well a text file contains text, right ? so a really easy way to check a file if it does contain only text is to read it and check if it does contains alphanumeric characters.
So basically the first thing you have to do is to check the file encoding if its pure ASCII you have an easy task just read the whole file in to a char array (I'm assuming you are doing it in C/C++ or similar) and check every char in that array with functions isalpha and isdigit ...of course you have to take care about special exceptions like tabulators '\t' space ' ' or the newline ('\n' in linux , '\r'\'n' in windows)
In case of a different encoding the process is the same except the fact that you have to use different functions for checking if the current character is an alphanumeric character... also note that in case of UTF-16 or greater a simple char array is simply to small...but if you are doing it for example in C# you dont have to worry about the size :)

You can write a function that will try to determine if a file is text based. While this will not be 100% accurate, it may be just enough for you. Such a function does not need to go through the whole file, about a kilobyte should be enough (or even less). One thing to do is to count how many whitespaces and newlines are there. Another thing would be to consider individual bytes and check if they are alphanumeric or not. With some experiments you should be able to come up with a decent function. Note that this is just a basic approach and text encodings might complicate things.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight