When would you use strings instead of characters? - c

When is it appropriate to use strings instead of characters? What about vice-versa?

Strings and characters represent fundamentally different concepts.
A character is a single, indivisible unit representing some sort of glyph. When working with a character, you are guaranteed to have a single character, no more or no less. Functions that work with characters are best suited for cases where you know this to be true. For example, if you were writing "Hangman" and wanted to process a user's guess, it would make sense for the function that processes the guess to take a character rather than a string, since you know for a fact that the input to that function should always be a single letter.
A string is a composite type formed by taking zero or more characters and putting them together. Strings are typically used to represent text, which can have an arbitrary length, including having zero length. Functions that work on strings are best suited for cases where the input is known to be made of letters, but it's unclear how many letters there are going to be.
One other option is to use a fixed-length array of characters, which is ideal for the situation where you know that you have exactly k characters for some k. This does not come up very much, but it's another option.
In short, use characters when you know that you need to work on a piece of text that is just one glyph long. Use strings when you don't know the length of the input in advance. Use fixed-sized arrays when you know for a fact that the input has some particular length.

Related

Is it possible to know how many characters long text read from a file will be in C?

I know in C++, you can check the length of the string, but in C, not so much.
Is it possible knowing the file size of a text file, to know how many characters are in the file?
Is it one byte per character or are other headers secretly stored whether or not I set them?
I would like to avoid performing a null check on every character as I iterate through the file for performance reasons.
Thanks.
You can open the file and read all the characters and count them.
Besides that, there's no fully portable method to check how long a file is -- neither on disk, nor in terms of how many characters will be read. This is true for text files and binary files.
How do you determine the size of a file in C? goes over some of the pitfalls. Perhaps one of the solutions there will suit a subset of systems that you run your code on; or you might like to use a POSIX or operating system call.
As mentioned in comments; if the intent behind the question is to read characters and process them on the fly, then you still need to check for read errors even if you knew the file size, because reading can fail.
Characters (of type char) are single byte values, as defined in the C standard (see CHAR_BIT). A NUL character is also a character, and so it, too, takes up a single byte.
Thus, if you are working with an ASCII text file, the file size will be the number of bytes and therefore equivalent to the number of characters.
If you are asking how long individual strings are inside the file, then you will indeed need to look for NUL and other extended character bytes and calculate string lengths on that basis. You might not be able to safely assume that there is only one NUL character and that it is at the end of the file, depending on how that file was made. There can also be newlines and other extended characters you would want to exclude. You have to decide on a character set and do counting from that set.
Further, if you are working with a file containing multibyte characters encoded in, say, Unicode, then this will be a different answer. You would use different functions to read a text file using a multibyte encoding.
So the answer will depend on what type of encoding your text file uses, and whether you are calculating characters or string lengths, which are two different measures.

What are the reasons for the decisions behind implementing strings as an array of chars with a null marker vs other approaches in C?

Looking into C for the first time, I found that a string is actually a char[] - and I was wondering how many different ways there are of implementing a string datatype as a result?
A comment on this question (Why does a string of N chars require initializing an array of N + 1 chars in C?)
For a string datatype, you need to know the length. You can either have a struct that has a length field (and the char array), or you need a special marker to signal the end of the string. In C the special marker method has been chosen and the marker is a null character
Implies there are just two means of achieving a string structure?
A char[] with a null marker
An object of sorts that provides a pointer to the start of a char[] and other necessary metadata
Are there other means of implementing a string datatype? Why did C take approach (1)?
Why did C take approach (1)?
According to The Development of the C Language, it was to avoid fixing the maximum length of a string, and that their personal experience led them to believe a terminator was more convenient.
None of BCPL, B, or C supports character data strongly in the language; each treats strings much like vectors of integers and supplements general rules by a few conventions. In both BCPL and B a string literal denotes the address of a static area initialized with the characters of the string, packed into cells. In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which B spelled *e. This change was made partially to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator.
Are there other means of implementing a string datatype?
Nothing significantly different, as long as a string is defined to consist of contiguous bytes.
Remember that C was developed primarily to implement the UNIX operating system - text processing was not going to be its focus.
Mapping strings and string operations onto arrays makes sense since, at their core, strings are sequences of character values. Existing operations on arrays (such that they are) can be applied to strings fairly easily. Some operations like concatenation become dead easy.
Using a terminator instead of a leading length byte means there’s no upper limit on string length.
There are times when it would be nice to have a real string data type, distinct from an array of char. However, in most C programming, those times are few enough and far enough between that this method is good enough.

Finding the number of occurrences of each character in a String or character array

I am going over some interview preparation material and I was wondering what the best way to solve this problem would be if the characters in the String or array can be unicode characters. If it they were strictly ascii, you could make an int array of size 256 and map each ascii character to an index and that position in the array would represent the number of occurrences. If the string has unicode characters is it still possible to do so, i.e. does the unicode character a reasonable size that you could represent it using the indexes of a integer array? Since unicode characters can be more than 1 byte in size, what data type would you use to represent them? What would be the most optimal solution for this case?
Since Unicode only defines code points in the range [0, 221), you only need an array of 221 (i.e. 2 million) elements, which should fit comfortably into memory.
An array wouldn't be practical when using Unicode. This is because Unicode defines (less than) 221 characters.
Instead, consider using two parallel vectors, one for the character and one for the count. The setup would look something like this:
<'c', '$', 'F', '¿', '¤'> //unicode characters
< 1 , 3 , 1 , 9 , 4 > //number of times each character has appeared.
EDIT
After seeing Kerrek's answer, I must admit, an array of size 2 million would be reasonable. The amount of memory it would take up would be in the Megabyte range.
But as it's for an interview, I wouldn't recommend having an array 2 million elements long, especially if many of those slots will be unused (not all Unicode characters will appear, most likely). They're probably looking for something a little more elegant.
SECOND EDIT
As per the comments here, Kerrek's answer does indeed seem to be more efficient as well as easier to code.
While others here are focusing on data structures, you should also know that the notion of "Unicode character" is somewhat ill-defined. That's a potential interview trap. Consider: are å and å the same character? The first one is a "latin small letter a with ring above" (codepoint U+00E5). The second one is a "latin small letter a" (codepoint U+0061) followed by a "combining ring above" (U+030A). Depending on the purpose of the count, you might need to consider these as the same character.
You might want to look into Unicode normalization forms. It's great fun.
Convert string to UTF-32.
Sort the 32-bit characters.
Getting character counts is now trivial.

How can I parse text input and convert strings to integers?

I have a file input, in which i have the following data.
1 1Apple 2Orange 10Kiwi
2 30Apple 4Orange 1Kiwi
and so on. I have to read this data from file and work on it but i dont know how to retrieve the data. I want to store 1(of 1 apple) as integer and then Apple as a string.
I thought of reading the whole 1Apple as a string. and then doing something with the stoi function.
Or I could read the whole thing character by character and then if the ascii value of that character lies b/w 48 to 57 then i will combine that as an integer and save the rest as string? Which one shall I do? Also how do I check what is the ASCII value of the char. (shall I convert the char to int and then compare, or is there any inbuilt function?)
How about using the fscanf() function if and only if your input pattern is not going to change. Otherwise you should probably use fgets() and perform checks if you want to separate the number from the string such as you suggested.
There is one easy right way to do this with standard C library facilities, one rather more difficult right way, and a whole lot of wrong ways. This is the easy right way:
Read an entire line into a char[] buffer using fgets.
Extract numbers from this line using strtol or strtoul.
It is very important to understand why the easier-looking alternatives (*scanf and atoi) should never be used. You might write less code initially, but once you start thinking about how to handle even slightly malformed input, you will discover that you should have used strtol.
The "rather more difficult right way" is to use lex and yacc. They are much more complicated but also much more powerful. You shouldn't need them for this problem.

A C style string file format conundrum

I'm very confused with this wee little problem I have. I have a non-indexed file format header. (more specifically the ID3 header) Now, this header stores a string or rather three bytes for conformation that the data is actually an ID3 tag (TAG is the string btw.) Point is, now that this TAG in the file format is not null-terminated. So there are two things that can be done:
Load the entire file with fread and for non-terminated string comparison, use strncmp. But:
This sounds hacky
What if someone opens it up and tries to manipulate the string w/o prior knowledge of this?
The other option is that the file be loaded, but the C struct shouldn't exactly map to the file format, but include proper null-terminators, and then each member should be loaded using a unique call. But, this too feels hacky and is tedious.
Help, especially from people who have practical experience with dealing with such stuff, is appreciated.
The first thing to consider when parsing anything is: Are the lengths of these fields either fixed in size, or prefixed by counts (that are themselves fixed in size, for example, nearly every graphics file has a fixed size/structure header followed by a variable sized sequence of the pixels)? Or, does the format have completely variable length fields that are delimited somehow (for example, MPEG4 frames are delimited by the bytes 0x00, 0x00, 0x01)? Usually the answer to this question will go a long way toward telling you how to parse it.
If the file format specification says a certain three bytes have the values corresponding to 'T', 'A', 'G' (84, 65, 71), then you should compare just those three bytes.
For this example, strncmp() is OK. In general, memcmp() is better because it doesn't have to worry about string termination, so even if the byte stream (tag) you are comparing contains ASCII NUL '\0' characters, memcmp() will work.
You also need to recognize whether the file format you are working with is primarily printable data or whether it is primarily binary data. The techniques you use for printable data can be different from the techniques used for binary data; the techniques used for binary data sometimes (but not always) translate for use with printable data. One big difference is that the lengths of values in binary data is known in advance, either because the length is embedded in the file or because the structure of the file is known. With printable data, you are often dealing with variable-length encodings with implicit boundaries on the fields - and no length encoding information ahead of it.
For example, the Unix password file format is a text encoding with variable length fields; it uses a ':' to separate fields. You can't tell how long a field is until you come across the next ':' or the end of the line. This requires different handling from a binary format encoded using ASN.11, where fields can have a type indicator value (usually a byte) and a length (can be 1, 2 or 4 bytes, depending on type) before the actual data for the field.
1 ASN.1 is (justifiably) regarded as very complex; I've given a very simple example of roughly how it is used that can be criticized on many levels. Nevertheless, the basic idea is valid - length (and with ASN.1, usually type too) precedes the (binary) data. This is also known as TLV - type, length, value - encoding.
If you are just learning something, you can find the ID3v1 tag in a MP3 file by reading the last 128 bytes of the file, and checking if the first 3 characters of the block are TAG.
For a real application, use TagLib.
Keep three bytes and compare each byte with the characters 'T', 'A' and 'G'. This may not be very smart, but gets the job done well and more importantly correctly.
And don´t forget the genre that two different meaning on id3 v1 and id3v1.1

Resources