Why do the functions in std::io::Read take a buffer? - file

Why do the methods in std::io::Read, namely read_to_end, read_to_string, and read_exact take a buffer rather than returning the result? The current return value is a Result<usize> (or Result<()>), but could that not be made into a tuple instead, also containing the result?

RFC 517 discusses these functions and describes two reasons for why the functions take buffers over returning values:
Performance. When it is known that reading will involve some large number of bytes, the buffer can be preallocated in advance.
"Atomicity" concerns. For read_to_end, it's possible to use this API to retain data collected so far even when a read fails in the middle. For read_to_string, this is not the case, because UTF-8 validity cannot be ensured in such cases; but if intermediate results are wanted, one can use read_to_end and convert to a String only at the end.
For the first point, a string can be pre-allocated using the associated function String::with_capacity. A very similar function exists for vectors: Vec::with_capacity.

Related

fgetc vs getline or fgets - which is most flexible

I am reading data from a regular file and I was wondering which would allow for the most flexibility.
I have found that both fgets and getline both read in a line (one with a maximum number of characters, the other with dynamic memory allocation). In the case of fgets, if the length of the line is bigger than the given size, the rest of the line would not be read but remain buffered in the stream. With getline, I am worried that it may attempt to assign a large block of memory for an obscenely long line.
The obvious solution for me seems to be turning to fgetc, but this comes with the problem that there will be many calls to the function, thereby resulting in the read process being slow.
Is this compromise in either case between flexibility and efficiency unavoidable, or can it worked through?
The three functions you mention do different things:
fgetc() reads a single character from a FILE * descriptor, it buffers input and so, you can process the file in a buffered way without having the overhelm of making a system call for each character. when your problem can be handled in a character oriented way, it is the best.
fgets() reads a single line from a FILE * descriptor, it's like calling fgetc() to fill the character array you pass to it in order to read line by line. It has the drawback of making a partial read in case your input line is longer than the buffer size you specify. This function buffers also input data, so it is very efficient. If you know that your lines will be bounded, this is the best to read your data line by line. Sometimes you want to be able to process data in an unbounded line size way, and you must redesign your problem to use the available memory. Then the one below is probably better election.
getline() this function is relatively new, and is not ANSI-C, so it is possible you port your program to some architecture that lacks it. It's the most flexible, at the price of being the less efficient. It requires a reference to a pointer that is realloc()ated to fill more and more data. It doesn't bind the line length at the cost of being possible to fill all the memory available on a system. Both, the buffer pointer and the size of the buffer are passed by reference to allow them to be updated, so you know where the new string is located and the new size. It mus be free()d after use.
The reason of having three and not only one function is that you have different needs for different cases and selecting the mos efficient one is normally the best selection.
If you plan to use only one, probably you'll end in a situation where using the function you selected as the most flexible will not be the best election and you will probably fail.
Much is case dependent.
getline() is not part of the standard C library. Its functionality may differ - depends on the implementation and what other standards it follows - thus an advantage for the standard fgetc()/fgets().
... case between flexibility and efficiency unavoidable, ...
OP is missing the higher priorities.
Functionality - If code cannot function right with the selected function, why use it? Example: fgets() and reading null characters create issues.
Clarity - without clarity, feel the wrath of the poor soul who later has to maintain the code.
would allow for the most flexibility. (?)
fgetc() allows for the most flexibility at the low level - yet helper functions using it to read lines tend to fail corner cases.
fgets() allows for the most flexibility at mid level - still have to deal with long lines and those with embedded null characters, but at least the low level of slogging in the weeds is avoided.
getline() useful when high portability not needed and risks of allowing the user to overwhelm resources is not a concern.
For robust handing of user/file input to read a line, create a wrapping function (e.g. int my_read_line(size_t buf, char *buf, FILE *f)) and call that and only that in user code. Then when issues arise, they can be handled locally, regardless of the low level input function selected.

What is the purpose of using memory stream in the C standard library?

In the C standard library, what is the purpose of using a memory stream (as created for an array via fmemopen())? How is it compared to manipulating the array directly?
This is very similar to using the std::stringstream in C++, which allows you to write to a string (including '\0' characters) and then use the string the way you'd like.
The idea is that we have many functions at our disposal, such as fprintf(), which can be used to write data to a stream in a formatted way. All those functions can be used with a memory based file without any need for further changes anywhere else than the fopen() to fmemopen().
So if you want to create a string which requires many fprintf(), using that function to generate the string in memory is extremely useful. The snprintf() could also be used if you just need one quick conversion.
Similarly, you can of course use fread() and fwrite() and the like. If you need to create a file which requires a lot of seeking and it's not that big that it can easily fit in memory, then it's going to go a lot faster. Once done, you can save the results to disk.

Unicode normalization through ICU4C

I want to normalize a string using the ICU C interface.
Looking at unorm2_normalize, I have some questions.
The UNormalizer2 instance -- how do I dispose of it after I'm done with it?
What if the buffer isn't large enough for decomposition or recomposition? Is the normal way to check if the error code is U_BUFFER_OVERFLOW_ERROR? Does U_STRING_NOT_TERMINATED_WARNING apply? Is the resulting string null-terminated? If an error is returned, do I reallocate memory and try again? It seems like a waste of time to start all over again.
See unorm2_close(). Note that you should not free instances acquired via unorm2_getInstance()
In general, most ICU APIs can be passed a NULL buffer and 0 length as input which should result in U_BUFFER_OVERLOW_ERROR and a variable populated with the required length. If you get U_STRING_NOT_TERMINATED_WARNING it means just that: The data is populated but not terminated.

How do you read a file until you hit a certain string in c?

I wanted to know how, in C, you can read a certain file until the reading hits a certain string, or character array. What I want to be able to do is, once the file hits that string, I want the position to be set at that point. I am going to use fseek for that, and that's not a problem. It's just the reading until a certain string is hit that I am not able to do. I've been reading up on some of the functions, but there doesn't seem to be anything that guides with this. Fgets is the closest thing to this, but I don't want to provide a certain number of characters to be read, as I don't know how many. But can you give me some tips on how to do this?
Thanks!
There are many efficient string searching algorithms, each of which can be implemented in C.
http://en.wikipedia.org/wiki/String_searching_algorithm
If you're looking for a string of length N, easiest is to keep a circular buffer of length N and read 1 byte at a time from the file adding it to the circular buffer. At each step you compare your buffer with the string you're searching for. It's highly inefficient but easy to code.
There's no built-in function to do exactly what you want, but there are a few options.
Option one: Read data in chunks. You don't know exactly where your data is, so read in a few kbs of data at a time, and search within these chunks. Make sure you deal with the case where the string you're looking for straddles a chunk boundrary! Once you've located the string, use fseek() to position yourself at the start of it.
Option two: Memory map the file and use memmem() on the entire file (as mapped into memory). This requires unportable calls to set up the memory mapping, so you'll need to know your OS (or use a portability wrapper library like glib). On 32-bit machines, it will also limit the size of files you can search in to a few hundred megabytes. It is, however, a very simple and efficient approach when it's an option.
If you go with option one, the trickiest part will be dealing with the chunk-straddling case. One option is to always keep two chunks in memory, and restart the search so it begins (length of target string) - 1 bytes before the end of the previous block. The actual search could then be done using memmem() or any other string searching algorithm. You could also convert your search into a DFA (since it is a regular language) and keep the current state across blocks.

Hash function for short strings

I want to send function names from a weak embedded system to the host computer for debugging purpose. Since the two are connected by RS232, which is short on bandwidth, I don't want to send the function's name literally. There are some 15 chars long function names, and I sometimes want to send those names at a pretty high rate.
The solution I thought about, was to find a hash function which would hash those function names to a single byte, and send this byte only. The host computer would scan all the functions in the source, compute their hash using the same function, and then would translate the hash to the original string.
The hash function must be
Collision free for short strings.
Simple (since I don't want too much code in my embedded system).
Fit a single byte
Obviously, it does not need to be secure by any means, only collision free. So I don't think using cryptography-related hash function is worth their complexity.
An example code:
int myfunc() {
sendToHost(hash("myfunc"));
}
The host would then be able to present me with list of times where the myfunc function was executed.
Is there some known hash function which holds the above conditions?
Edit:
I assume I will use much less than 256 function-names.
I can use more than a single byte, two bytes would have me pretty covered.
I prefer to use a hash function instead of using the same function-to-byte map on the client and the server, because (1) I have no map implementation on the client, and I'm not sure I want to put one for debugging purposes. (2) It requires another tool in my build chain to inject the function-name-table into my embedded system code. Hash is better in this regard, even if that means I'll have a collision once in many while.
Try minimal perfect hashing:
Minimal perfect hashing guarantees that n keys will map to 0..n-1 with no collisions at all.
C code is included.
Hmm with only 256 possible values, since you will parse your source code to know all possible functions, maybe the best way to do it would be to attribute a number to each of your function ???
A real hash function would probably won't work because you have only 256 possible hashes.
but you want to map at least 26^15 possible values (assuming letter-only, case-insensitive function names).
Even if you restricted the number of possible strings (by applying some mandatory formatting) you would be hard pressed to get both meaningful names and a valid hash function.
You could use a Huffman tree to abbreviate your function names according to the frequency they are used in your program. The most common function could be abbreviated to 1 bit, less common ones to 4-5, very rare functions to 10-15 bits etc. A Huffman tree is not very hard to implement but you will have to do something about the bit alignment.
No, there isn't.
You can't make a collision free hash code, or even close to it, with just an eight bit hash. If you allow strings that are longer than one character, you have more possible strings than there are possible hash codes.
Why not just extract the function names and give each function name an id? Then you only need a lookup table on each side of the wire.
(As others have shown you can generate a hash algorithm without collisions if you already have all the function names, but then it's easier to just assign a number to each name to make a lookup table...)
If you have a way to track the functions within your code (i.e. a text file generated at run-time) you can just use the memory locations of each function. Not exactly a byte, but smaller than the entire name and guaranteed to be unique. This has the added benefit of low overhead. All you would need to 'decode' the address is the text file that maps addresses to actual names; this could be sent to the remote location or, as I mentioned, stored on the local machine.
In this case you could just use an enum to identify functions. Declare function IDs in some header file:
typedef enum
{
FUNC_ID_main,
FUNC_ID_myfunc,
FUNC_ID_setled,
FUNC_ID_soundbuzzer
} FUNC_ID_t;
Then in functions:
int myfunc(void)
{
sendFuncIDToHost(FUNC_ID_myfunc);
...
}
If sender and receiver share the same set of function names, they can build identical hashtables from these. You can use the path taken to get to an hash element to communicate this. This can be {starting position+ number of hops} to communicate this. This would take 2 bytes of bandwidth. For a fixed-size table (lineair probing) only the final index is needed to address an entry.
NOTE: when building the two "synchronous" hash tables, the order of insertion is important ;-)
Described here is a simple way of implementing it yourself: http://www.devcodenote.com/2015/04/collision-free-string-hashing.html
Here is a snippet from the post:
It derives its inspiration from the way binary numbers are decoded and converted to decimal number format. Each binary string representation uniquely maps to a number in the decimal format.
if say we have a character set of capital English letters, then the length of the character set is 26 where A could be represented by the number 0, B by the number 1, C by the number 2 and so on till Z by the number 25. Now, whenever we want to map a string of this character set to a unique number , we perform the same conversion as we did in case of the binary format

Resources