Unicode normalization through ICU4C - c

I want to normalize a string using the ICU C interface.
Looking at unorm2_normalize, I have some questions.
The UNormalizer2 instance -- how do I dispose of it after I'm done with it?
What if the buffer isn't large enough for decomposition or recomposition? Is the normal way to check if the error code is U_BUFFER_OVERFLOW_ERROR? Does U_STRING_NOT_TERMINATED_WARNING apply? Is the resulting string null-terminated? If an error is returned, do I reallocate memory and try again? It seems like a waste of time to start all over again.

See unorm2_close(). Note that you should not free instances acquired via unorm2_getInstance()
In general, most ICU APIs can be passed a NULL buffer and 0 length as input which should result in U_BUFFER_OVERLOW_ERROR and a variable populated with the required length. If you get U_STRING_NOT_TERMINATED_WARNING it means just that: The data is populated but not terminated.

Related

Difficulties understanding how to take elements from a file and store them in C

I'm working on an assignment that is supposed to go over the basics of reading a file and storing the information from that file. I'm personally new to C and struggling with the lack of a "String" variable.
The file that the program is supposed to work with contains temperature values, but we are supposed to account for "corrupted data". The assignment states:
Every input item read from the file should be treated as a stream of characters (string), you can
use the function atof() to convert a string value into a floating point number (invalid data can be
set to a value lower than the lowest minimum to identify it as corrupt)."
The number of elements in the file is undetermined but an example given is:
37.8, 38.a, 139.1, abc.5, 37.9, 38.8, 40.5, 39.0, 36.9, 39.8
After reading the file we're supposed to allow a user to query these individual entries, but as mentioned if the data entry contains a non-numeric value, we are supposed to state that the specific data entry is corrupted.
Overall, I understand how to functionally write a program that can fulfill those requirements. My issue is not knowing what data structure to use and/or how to store the information to be called upon later.
The closest to an actual string datatype which you find in C is a sequence of chars which is terminated by a '\0' value. That is used for most things which you'd expect to do with strings.
Storing them requires just sufficent memory, as offered by a sufficiently large array of char, or as offered by malloc().
I think the requirements of your assignment would be met by making a char array as buffer, then reading in with fgets(), making sure to not read more than fits into your array and making sure that there is a '\0' at the end.
Then you can use atof() on the content of the array and if it fails do the handling of corrupted input. Though I would prefer sscanf() for its better feedback via separate return value.

Why do the functions in std::io::Read take a buffer?

Why do the methods in std::io::Read, namely read_to_end, read_to_string, and read_exact take a buffer rather than returning the result? The current return value is a Result<usize> (or Result<()>), but could that not be made into a tuple instead, also containing the result?
RFC 517 discusses these functions and describes two reasons for why the functions take buffers over returning values:
Performance. When it is known that reading will involve some large number of bytes, the buffer can be preallocated in advance.
"Atomicity" concerns. For read_to_end, it's possible to use this API to retain data collected so far even when a read fails in the middle. For read_to_string, this is not the case, because UTF-8 validity cannot be ensured in such cases; but if intermediate results are wanted, one can use read_to_end and convert to a String only at the end.
For the first point, a string can be pre-allocated using the associated function String::with_capacity. A very similar function exists for vectors: Vec::with_capacity.

Prevent crash in string manipulation crashing whole application

I created a program which at regular intervals downloads a text file from a website, which is in csv format, and parses it, extracting relevant data, which then is displayed.
I have noticed that occasionally, every couple of months or so, it crashes. The crash is rare, considering the cycle of data downloading and parsing can happen every 5 minutes or even less. I am pretty sure it crashes inside the function that parses the string and extracts the data. When it crashes it happens during a congested internet connection, i.e. heavy downloads and/or a slow connection. Occasionally the remote site may be handing corrupt or incomplete data.
I used a test application which saves the data to be processed before processing it and it indeed shows it was not complete when a crash happens.
I have adapted the function to accommodate for a number of cases of invalid or incomplete data, as well as checking all return values. I also check return values of the various functions used to connect to the remote site and download the data. And will not go further when a return value indicates no success.
The core of the function uses strsep() to walk through the data and extract information out of it:
/ *
* delimiters typically contains: <;>, <">, < >
* strsep() is used to split part of the string using delimiter
* and copy into token which then is copied into the array
* normally the function stops way before ARRAYSIZE which is just a safeguard
* it would normally stop when the end of file is reached, i.e. \0
*/
for(n=0;n<ARRAYSIZE;n++)
{
token=strsep(&copy_of_downloaded_data, delimiters);
if (token==NULL)
break;
data->array[n].example=strndup(token, strlen(token));
if (data->array[n].example!=NULL)
{
token=strsep(&copy_of_downloaded_data, delimiters);
if (token==NULL)
break;
(..)
copy_of_downloaded_data=strchr(copy_of_downloaded_data,'\n'); /* find newline */
if (copy_of_downloaded_data==NULL)
break;
copy_of_downloaded_data=copy_of_downloaded_data+1;
if (copy_of_downloaded_data=='\0') /* find end of text */
break;
}
Since I suspect I can not account for all ways in which data can be corrupted I would like to know if there is a way to program this so the function when run does not crash the whole application in case of corrupted data.
If that is not possible what could I do to make it more robust.
Edit: One possible instance of a crash is when the data ends abruptly, where the middle of a field is cut of, i.e.
"test","example","this data is brok
At least I noticed it by looking through the saved data, however I found it not being consistent. Will have to stress test it as was suggested below.
The best thing to do would be to figure out what input causes the function to crash, and fix the function so that it does not crash. Since the function is doing string processing, this should be possible to do by feeding it lots of dummy/test data (or feeding it the "right" test data if it's a particular input that causes the crash). You basically want to torture-test the function until you find out how to make it crash on demand; at that point you can start investigating exactly where and why it crashes, and once you understand that, the necessary changes to fix the crash will probably become obvious to you.
Running the program under valgrind might also point you to the bug.
If for some reason you can't fix the bug, the other option is to spawn a child process and run the buggy code inside the child process. That way if it crashes, only the child process is lost and not the parent. (You can spawn the child process under most OS's by calling fork(); you'll need to come up with some way for the child process to communicate its results back to the parent process, of course). (Note that doing it this way is a kludge and will likely not be very efficient, and could also introduce a security hole into your application if someone malicious who has the ability to send your program input can figure out how to manipulate the bug in order to take control of the child process -- so I don't recommend this approach!)
What does the coredump point to?
strsep - does not have memory synchronization mechanisms, so protect it as a critical section ( lock it when you do strsep ) ?
see if strsep can handle a big chunk ( ARRAYSIZE is not gonna help you here ).
stack size of the thread/program that receives copy_of_downloaded_data ( i know you are only referencing it so look at the function that receives it. )
I would suggest that one should try to write code that keeps track of string lengths deliberately and doesn't care whether strings are zero-terminated or not. Even though null pointers have been termed the "billion dollar mistake"(*) I think zero-terminated strings are far worse. While there may be some situations where code using zero-terminated strings might be "simpler" than code that tracks string lengths, extra effort required to make sure that nothing can cause string-handling code to exceed buffer boundaries exceeds that required when working with known-length strings.
If, for example, one wants to store the concatenation of strings of length length1 and length2 into a buffer if length BUFF_SIZE, one can test easily whether length1+length2 <= BUFF_SIZE if one isn't expecting strings to be null-terminated, or length1+length2 < BUFF_SIZE if one expects a gratuitous null byte to follow every string. When using zero-terminated strings, one would have to determine the length of the two strings before concatenation, and having done so one could just as well use memcpy() rather than strcpy() or the useless strcat().
(*) There are many situations where it's much better to have a recognizably-invalid pointer than to require that pointers which can't point to anything meaningful must instead point to something meaningless. Many null-pointer related problems actually stem from a failure of implementations to trap arithmetic with null pointers; it's not fair to blame null pointers for problems that could have been, but weren't avoided.

MalformedInputException when trying to read entire file

I have a 132 kb file (you can't really say it's big) and I'm trying to read it from the Scala REPL, but I can't read past 2048 char because it gives me a java.nio.charset.MalformedInputException exception
These are the steps I take:
val it = scala.io.Source.fromFile("docs/categorizer/usig_calles.json") // this is ok
it.take(2048).mkString // this is ok too
it.take(1).mkString // BANG!
java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:277)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:338)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
Any idea what could be wrong?
--
Apparently the problem was that the file was not UTF encoded
I saved it as UTF and everything works, I just issue mkString on the iterator and it retrieves the whole contents of the file
The strange thing is that the error only aroused passing the first 2048 chars...
Cannot be certain without the file, but the documentation on the exception indicates it is thrown "when an input byte sequence is not legal for given charset, or an input character sequence is not a legal sixteen-bit Unicode sequence." (MalformedInputException javadoc)
I suspect that at 2049 is the first character encountered that is not valid with whatever the default JVM character encoding is in you environment. Consider explicitly stating the character encoding of the file using one of the overloads to fromFile.
If the application will be cross platform, you should know that the default character encoding on the JVM does vary by platform, so if you operate with a specific encoding you either want to explicitly set it as a command line parameter when launching of your application, or specify it at each call using the appropriate overload.
Any time you call take twice on the same iterator, all bets are off. Iterators are inherently imperative, and mixing them with functional idioms is dicey at best. Most of the iterators you come across in the standard library tend to be fairly well-behaved in this respect, but once you've used take, or drop, or filter, etc., you're in undefined-behavior land, and in principle anything could happen.
From the docs:
It is of particular importance to note that, unless stated otherwise,
one should never use an iterator after calling a method on it. The two
most important exceptions are also the sole abstract methods: next
and hasNext ...
def take(n: Int): Iterator[A] ...
Reuse: After calling this method, one should discard the iterator it
was called on, and use only the iterator that was returned. Using the
old iterator is undefined, subject to change, and may result in
changes to the new iterator as well.
So it's probably not worth trying to track down exactly what went wrong here.
If you just wish to convert the bytes to plain Latin data:
// File:
io.Source.fromFile(file)(io.Codec.ISO8859).mkString
// InputStream:
io.Source.fromInputStream(System.io)(io.Codec.ISO8859).mkString

Is it possible to read in a string of unknown size in C, without having to put it in a pre-allocated fixed length buffer?

Is it possible to read in a string in C, without allocating an array of fixed size ahead of time?
Everytime I declare a char array of some fixed size, I feel like I'm doing it wrong. I'm always taking a guess at what I think would be the maximum for my usecase, but this isn't always easy.
Also, I don't like the idea of having a smaller string sitting in a larger container. It doesn't feel right.
Am I missing something? Is there some other way I should be doing this?
At the point that you read the data, your buffer is going to have a fixed size -- that's unavoidable.
What you can do, however, is read the data using fgets, and check whether the last character is a '\n', (or you've reached the end of file) and if not, realloc your buffer, and read more.
I rarely find that necessary, but do usually allocate a single fixed buffer for the reading, read data into it, and then dynamically allocate space for a copy of it, allocating only as much space as it actually occupies, not the whole size of the buffer I originally used.
When you say "ahead of time", do you mean at runtime, or at compile time?
At compile time you do this:
char str[1000];
at runtime you do this:
char *str = new char[size];
They only way to get exactly the right size is to know how many characters you are going to read in. If you're reading from a file, you can seek to the nearest newline (or some other condition) and then you know exactly how big the array needs to be. ie:
int numChars = computeNeededSpace(someFileHandle);
char *readBuffer = new char[numChars];
fread(someFileHandle, readBuffer, numChars); //probly wrong parameter order
There is no other way to do this. Put yourself in the programs perspective, how is it supposed to know how many keys the user is going to press? The best thing you can do is limit the user, or whatever input.
there are some more complex things, like creating a linked list of buffers, and allocating chunks of buffers then linking them after. But I think that's not the answer you wanted here.
EDIT: Most languages have string/inputbuffer classes that hide this from you.
You must allocate fixed buffer. If it becomes small, than realloc() it to bigger size and continue.
There's no way of determining the string length until you've read it in, so reading it into a fixed-size buffer is pretty much your only choice.
I suppose you have the alternative to read the string in small chunks, but depending on your application that might not give you enough information at a time to work with.
Perhaps the easiest way to handle this dilemma is by defining maximum lengths for certain string input (#define-ing a constant for this value helps). Use a buffer of this pre-determined size whenever you are reading in a string, but make sure to use the strncpy() form of the string commands so you can specify a maximum number of characters to read. Some commonly-used types of strings (for example, filenames or paths) may have system-defined maximum lengths.
There's nothing inherently 'wrong' about declaring a fixed-size array as long as you use it properly (do proper bounds checking and handle the case where input will overflow the array). It may result in unused memory being allocated, but unfortunately the C language doesn't give us much to work with when it comes to strings.
There is the concept of String Ropes that lets you create trees of fixed size buffers. You still have to have fixed size buffers there is no getting around that really, but this is a pretty neat way of building up strings dynamically.
You can use Chuck Falconer's public domain ggets function to do the buffer management and reallocation for you: http://cbfalconer.home.att.net/download/index.htm
Edit:
Chuck Falconer's website is no longer available. archive.org still has a copy, and I'm hosting a copy too.

Resources