Own strcmp function - non standard chars - c

I am currently writing a little sort function. I can only use stdio libary, so I wrote my 'own strcmp' function.
int ownstrcmp(char a[], char b[])
{
int i = 0;
while( a[i] == b[i] )
{
if( a[i] == '\0' )
return 0;
++i;
}
return ( a[i] < b[i]) ? 1 : -1;
}
This works great for me. But there is one little problem: What can I do for 'non-Standard-Chars'? Like "ä,ü,ß Their decimal ASCII value is greater than the normal chars, so it sort the string 'example' behind 'ääää'.
I have already read about locale, but the only library that i can use is stdio.h. Is there a 'simple' solution for this problem?

Your question is somewhat vague. First of all, how characters with umlaut are represented depends on your encoding. For example, my computer's locale is set to Greek, meaning that in place of those special Latin characters I have Greek characters. You can't assume anything like that, as far as I can tell.
Second, the answer to your question depends on your representation. Are you still using a "one char per character" representation? If that's so, the above code might still work.
If you're using multi char representation, for example two chars per character, you should change your code so that it exits when two consecutive chars are \0.
Generally, you may want to look into how wchar_t and its family of functions (specifically wcscmp) are implemented.

For german the umlauts ä,ö,ü and ß will be sorted as if they occur in their 'expanded' form:
ä -> ae
ö -> oe
ü -> ue
ß -> ss
In order to get the collation according to the standard you could expand the strings before comparing.

You need to know the encoding the characters are in, and make sure you treat the strings properly. If the encoding is multi-byte, you must start reading (and comparing) individual characters, not bytes.
Also, the way to compare characters internationally varies with the locale, there's no single solution. In some languages, 'ä' sorts after 'z', in some it sorts right next to 'a'.
One simple way of implementing this is of course to create a table which holds the relative order for each character, like so:
unsigned char character_order[256];
character_order[(unsigned char) 'a'] = 1;
character_order[(unsigned char) 'ä'] = character_order[(unsigned char) 'a'];
/* ... and so on ... */
Then instead of subtracting the character's encoded value (which no longer can be used as a "proxy" for the sorting order of the character), you compare the character_order values.
The above assumes single-byte encoding, i.e. Latin-1 or something, since the array size is only 256.
Also note casts to unsigned char when indexing with character literals.

If you are using ISO/IEC_8859-16 encoding, which is the normal enconding for German Language, it's enough to transform your char to unsigned char.
In this way chars can be represented in interval 0-255, suitable for this standard.

Under UTF8 this can help, following your code
if ((a[i] > 0) ^ (b[i] > 0))
return a[i] > 0 ? 1 : -1;
else
return a[i] < b[i] ? 1 : -1;
But you have to check cases like ownstrcmp("ab", "abc");
Furthermore your code doesn't work like strcmp() in <string.h>
A value greater than zero indicates that the first character that does not match has a greater value in str1 than in str2; And a value less than zero indicates the opposite.
I would do it like this:
int ownstrcmp(char a[], char b[])
{
int i = 0;
while(a[i] == b[i]) {
if (a[i] == 0) return 0;
++i;
}
if ((a[i] == 0) || (b[i] == 0))
return a[i] != 0 ? 1 : -1;
if ((a[i] > 0) ^ (b[i] > 0))
return a[i] < 0 ? 1 : -1;
else
return a[i] > b[i] ? 1 : -1;
}

Related

Condition for checking whether a letter is between 'j' and 'p'

The code should accept a character and it should check whether its between 'j' and 'p'.
If it is between 'j' and 'p' it should print yes or else it should print no.
I have tried to do something about it but the only ideas I got is this:
if (a=='j' || a=='k' || a=='k' || a=='l' || a=='m' || a=='n' || a=='o' || a=='p')
{
printf("YES");
}
else
{
printf("NO");
}
You can avoid all the alternative tests by using a function like strchr():
if (strchr("jklmnop", a)) {
puts("YES");
} else {
puts("NO");
}
The obvious approach is to do something like
if (a >= 'j' && a <= 'p') {
// ...
}
but that has a problem if you want to write portable code.
The C standard only requires that the characters '0' through '9' appear consecutively and in order. If you're following the standard to a t, you shouldn't assume that 'j' through 'p' appear together and can be used with a pair of >= and <= tests. If you add additional qualifications like requiring an ASCII compatible character set, it's a different story.
It depends on what you mean by "between j and p".
If you mean "Only lowercase English letters between j and p", then one portable way of writing it down is
if (strchr("jklmnop", a)) ...
If you mean "Character codes between that of 'j' and that of 'p', in whatever encoding is used by the machine", then one portable way of writing it down is
if (a >= 'j' && a <= 'p') ...
If your encoding is ASCII, then the two notions above strictly coincide for any range of English letters.
If your encoding is EBCDIC, then they coincide for the range j..p, but not say for the range i..p.
It is guaranteed that all English letters between j and p are included in the range of the codes in any standards-compliant encoding, but there might be additional, non-English-letter characters in the same range.
Finally, for completeness, if by "between j and p" you mean "letters of the user's language, whatever it is, that are between j and p", then one correct way of writing it down is probably
setlocale (LC_ALL, ""); // first statement of the program
...
if (strcoll(a, "j") >= 0 && strcoll(a, "p") <= 0) ...
Note that here, a is not a character as above, but a string. It is up to you to ensure that it contains a single character of the user's language (which is not the same thing as a single char element). Ensuring this is very non-trivial.
TL;DR
if (a >= 'j' && a <= 'p') will probably work for whatever task you currently have, but don't assume it will always work.
try this code
#include <stdio.h>
int main()
{
char a;
printf("enter the letter :");
scanf("%c",&a);
if(a>='j' && a<='p')
printf("yes the letter is between j and p");
else
printf("No the given letter is not between j and p");
}
Each and every alphabet has an ASCII code which is an integer so you can perform it like this,
char a;
scanf("%c",&a);
if(a>='j'&&a<='p')
{
printf("YES");
}
else
{
printf("NO");
}
Assuming lowercase letters form a contiguous block in the execution set, which is true for ASCII, you can write:
if (a >= 'j' && a <= 'p') ...
Assuming a is an int containing a char value, You can write this as a single test, but a good compiler should be able to generate the same code for the more readable test above:
if ((unsigned)(a - 'j') <= (unsigned)('p' - 'j')) ...
You could also test if a is in a set of characters, which will work regardless of the target encoding:
if (a != 0 && strchr("jklmnop", a)) ...
The test for a != 0 can be removed if you know a cannot be a null byte.
Character literals (This is a character literal: 'a') are just numbers. Almost all computers use a ASCII-compatible encoding (there are very few exceptions).
ASCII assumed, for example 'a' is a 97 for your computer, 'j' a 106. if you write a=='j' you basically write a==106. Using character literals is just syntax sugar, it makes it a lot easier for humans to read but the computer does not care.
This means, you have to check if a is between 106 and 112. You probably know a better way to do that than you current way. But instead of 106 and 112 write 'j' and 'p', because it easier to read.

Most efficient way to extract season and episode from tv show filename in C

I want to extract season and episode from a filename in C. For example, if the input string is "Game.of.Thrones.S05E02.720p.HDTV.x264-IMMERSE.mkv", then I want to extract the substring "S05E02" out of it.
At the moment, I'm using a very naive approach for matching characters one at a time. Concretely, I am finding 'S' and then checking if the next two characters are both numbers between '0' and '9' and then the subsequent character is 'E' and the next two characters to 'E' are also between '0' and '9'.
// Return index if pattern found. Return -1 otherwise
int get_tvshow_details(const char filename[])
{
unsigned short filename_len = strlen(filename);
for (int i = 0; i < filename_len-5; ++i) {
char season_prefix = filename[i];
char episode_prefix = filename [i+3];
char season_left_digit = filename[i+1];
char season_right_digit = filename[i+2];
char episode_left_digit = filename[i+4];
char episode_right_digit = filename[i+5];
if ((season_prefix == 'S' || season_prefix == 's')
&& (episode_prefix == 'E' || episode_prefix == 'e')
&& (season_left_digit >= '0' && season_left_digit <= '9')
&& (season_right_digit >= '0' && season_right_digit <= '9')
&& (episode_left_digit >= '0' && episode_left_digit <= '9')
&& (episode_right_digit >= '0' && episode_right_digit <= '9')) {
printf("match found at %d\n", i);
return i;
}
}
return -1;
}
Is there a more efficient way in C to extract the following pattern: S<2_digit_number>E<2_digit_number> from any tv show filename?
I'd like to propose another solution, very similar to regex, but not dependent on a separate library for regex. C's format strings are quite powerful, though primitive. I think they could actually work in this case.
The format string we'll need is- %*[^.].%*[^.].%*[^.].%*1[Ss]%d%*1[Ee]%d.
Let's compare this to a string like Game.of.Thrones.S05E02.720p.HDTV.x264-IMMERSE.mkv
The first %*[^.]. will consume Game. but not capture it.
The second %*[^.]. will consume of. but not capture it.
The second %*[^.]. will consume Thrones. but not capture it.
Now the fun part, %*1[Ss]%d%*1[Ee]%d. is designed to capture S05E02., and also extract the 05 and 02 into integer variables. Let's discuss this.
%*1[Ss] will consume only 1 letter that is either S or s but not capture it
%d will consume the digits afterwards (05 in this case) and store it into an integer
%*1[Ee] will consume only 1 letter that is either E or e but not capture it
Finally, %d. will consume the digits afterwards, store it inside an integer and capture the . right after.
If used properly, it should look like-
// Just a dummy string literal for testing
char s[] = "Game.of.Thrones.S05E02.720p.HDTV.x264-IMMERSE.mkv";
// Variables to store the numbers in
int seas, ep;
printf("%d\n", sscanf(s, "%*[^.].%*[^.].%*[^.].%*1[Ss]%d%*1[Ee]%d.", &seas, &ep));
You may notice, we're also printing the return value of sscanf (you don't have to print it, you can just store it). This is very important. If sscanf returns 2 (that is, the number of captured variables), you know that it was a successful match and the provided string is indeed valid. If it returns anything else, it indicates either non-complete match or a complete failure (in case of negative values).
If you run this piece of code, you get-
2
Which is correct. If you print seas and ep later, you get-
5 2

Do char's in C have pre-assigned zero indexed values?

Sorry if my title is a little misleading, I am still new to a lot of this but:
I recently worked on a small cipher project where the user can give the file a argument at the command line but it must be alphabetical. (Ex: ./file abc)
This argument will then be used in a formula to encipher a message of plain text you provide. I got the code to work, thanks to my friend for helping but i'm not 100% a specific part of this formula.
#include <stdio.h>
#include <cs50.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <ctype.h>
int main (int argc, string argv[])
{ //Clarify that the argument count is not larger than 2
if (argc != 2)
{
printf("Please Submit a Valid Argument.\n");
return 1;
}
//Store the given arguemnt (our key) inside a string var 'k' and check if it is alpha
string k = (argv[1]);
//Store how long the key is
int kLen = strlen(k);
//Tell the user we are checking their key
printf("Checking key validation...\n");
//Pause the program for 2 seconds
sleep(2);
//Check to make sure the key submitted is alphabetical
for (int h = 0, strlk = strlen(k); h < strlk; h++)
{
if isalpha(k[h])
{
printf("Character %c is valid\n", k[h]);
sleep(1);
}
else
{ //Telling the user the key is invalid and returning them to the console
printf("Key is not alphabetical, please try again!\n");
return 0;
}
}
//Store the users soon to be enciphered text in a string var 'pt'
string pt = get_string("Please enter the text to be enciphered: ");
//A prompt that the encrypted text will display on
printf("Printing encrypted text: ");
sleep(2);
//Encipher Function
for(int i = 0, j = 0, strl = strlen(pt); i < strl; i++)
{
//Get the letter 'key'
int lk = tolower(k[j % kLen]) - 'a';
//If the char is uppercase, run the V formula and increment j by 1
if isupper(pt[i])
{
printf("%c", 'A' + (pt[i] - 'A' + lk) % 26);
j++;
}
//If the char is lowercase, run the V formula and increment j by 1
else if islower(pt[i])
{
printf("%c", 'a' + (pt[i] - 'a' + lk) % 26);
j++;
}
//If the char is a symbol just print said symbol
else
{
printf("%c", pt[i]);
}
}
printf("\n");
printf("Closing Script...\n");
return 0;
}
The Encipher Function:
Uses 'A' as a char for the placeholder but does 'A' hold a zero indexed value automatically? (B = 1, C = 2, ...)
In C, character literals like 'A' are of type int, and represent whatever integer value encodes the character A on your system. On the 99.999...% of systems that use ASCII character encoding, that's the number 65. If you have an old IBM mainframe from the 1970s using EBCDIC, it might be something else. You'll notice that the code is subtracting 'A' to make 0-based values.
This does make the assumption that the letters A-Z occupy 26 consecutive codes. This is true of ASCII (A=65, B=66, etc.), but not of all codes, and not guaranteed by the language.
does 'A' hold a zero indexed value automatically? (B = 1, C = 2, ...)
No. Strictly conforming C code can not depend on any character encoding other than the numerals 0-9 being represented consecutively, even though the common ASCII character set does represent them consecutively.
The only guarantee regarding character sets is per 5.2.1 Character sets, paragraph 3 of the C standard:
... the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous...
Character sets such as EBCDIC don't represent letters consecutively
char is a numeric type that happens to also often be used to represent visible characters (or special non-visible pseudo-characters). 'A' is a value (with actual type int) that can be converted to a char without overflow or underflow. That is, it's really some number, but you usually don't need to know what number, since you generally use a particular char value either as just a number or as just a character, not both.
But this program is using char values in both ways, so it somewhat does matter what the numeric values corresponding to visible characters are. One way it's very often done, but not always, is using the ASCII values which are numbered 0 to 127, or some other scheme which uses those values plus more values outside that range. So for example, if the computer uses one of those schemes, then 'A'==65, and 'A'+1==66, which is 'B'.
This program is assuming that all the lowercase Latin-alphabet letters have numeric values in consecutive order from 'a' to 'z', and all the uppercase Latin-alphabet letters have numeric values in consecutive order from 'A' to 'Z', without caring exactly what those values are. This is true of ASCII, so it will work on many kinds of machines. But there's no guarantee it will always be true!
C does guarantee the ten digit characters from '0' to '9' are in consecutive order, which means that if n is a digit number from zero to nine inclusive, then n + '0' is the character for displaying that digit, and if c is such a digit character, then c - '0' is the number from zero to nine it represents. But that's the only guarantee the C language makes about the values of characters.
For one counter-example, see EBCDIC, which is not in much use now, but was used on some older computers, and C supports it. Its alphabetic characters are arranged in clumps of consecutive letters, but not with all 26 letters of each case all together. So the program would give incorrect results running on such a computer.
Sequentiality is only one aspect of concern.
Proper use of isalpha(ch) is another, not quite implemented properly in OP's code.
isalpha(ch) expects a ch in the range of unsigned char or EOF. With k[h], a char, that value could be negative. Insure a non-negative value with:
// if isalpha(k[h])
if isalpha((unsigned char) k[h])

Condition to limit between 2 characters

I'm writing code that need to limit the use to enter characters that be only from A to H. Greater then H should not be accepted.
I saw that with numbers I can use that like:
if (input == 0 - 9) return 1;
But, how I do that A to H (char)?
The C Standard does not specify that character encoding should be ASCII, though it is likely. Nonetheless, it is possible for the encoding to be other (EBCDIC, for example), and the characters of the Latin alphabet may not be encoded in a contiguous sequence. This would cause problems for solutions that compare char values directly.
One solution is to create a string that holds valid input characters, and to use strchr() to search for the input in this string in order to validate:
#include <stdio.h>
#include <string.h>
int main(void)
{
char *valid_input = "ABCDEFGH";
char input;
printf("Enter a letter from 'A' - 'H': ");
if (scanf("%c", &input) == 1) {
if (input == '\0' || strchr(valid_input, input) == NULL) {
printf("Input '%c' is invalid\n", input);
} else {
puts("Valid input");
}
}
return 0;
}
This approach is portable, though solutions which compare ASCII values are likely to work in practice. Note that in the original code that I posted, an edge case was missed, as pointed out by #chux. It is possible to enter a '\0' character from the keyboard (or to obtain one by other methods), and since a string contains the '\0' character, this would be accepted as valid input. I have updated the validation code to check for this condition.
Yet there is another advantage to using the above solution. Consider the following comparison-style code:
if (input >= 'A' || input <= 'H') {
puts("Valid input");
} else {
puts("Invalid input");
}
Now, suppose that conditions for valid input change, and the program must be modified. It is simpler to modify a validation string, for example to change to:
char *valid_input = "ABCDEFGHIJ";
With the comparison code, which may occur in more than one location, each comparison must be found in the code. But with the validation string, only one line of code needs to be found and modified.
Further, the validation string is simpler for more complex requirements. For example, if valid input is a character in the range 'A' - 'I' or a character in the range '0' - '9', the validation string can simply be changed to:
char *valid_input = "ABCDEFGHI0123456789";
The comparison method begins to look unwieldy:
if ((input >= 'A' && input <= 'I') || (input >= '0' && input <= '9')) {
puts("Valid input");
} else {
puts("Invalid input");
}
Do note that one of the few requirements placed on character encoding by the C Standard is that the characters '0', ..., '9' be encoded in a contiguous sequence. This does allow for portable direct comparison of decimal digit characters, and also for reliably finding the integer value associated with a decimal digit character through subtraction:
char ch = '3';
int num;
if (ch >= '0' && ch <= '9') {
printf("'%c' is a decimal digit\n", ch);
num = ch - '0';
printf("'%c' represents integer value %d\n", ch, num);
}
The if statement you present here is equal to:
if (input == -9) return 1;
which will return 1 in the case of an input equal to -9, so there is no range checking at all.
To allow numbers from 0 to 9 you have to compare like:
if (input >= 0 && input <= 9) /* range valid */
or with the characters that you want (A to H)1:
if (input >= 'A' && input <= 'H') /* range valid */
If you want to return 1 if the input is not in a valid range just put the logical not operator (!) in front of the condition:
if (!(input >= 'A' && input <= 'H')) return 1; /* range invalid */
1 You should take care of the used range if working with conditions that uses character ranges because the range needs an encoding that specify the letters in an incrementing order without any gaps in between the range (ASCII code e.g.: A = 65, B = 66, C = 67, ..., Z = 90).
There are encoding where this rule breaks. As the other answer of #DavidBowling stated there is for example EBCDIC (e.g.: A = 193, B = 194, ..., I = 200, J = 209, ..., Z = 233) which has some gaps in between the range from A to Z. Nevertheless the condition: (input >= 'A' && input <= 'H') will work with both encodings.
I never fall about such an implementation yet and it is very unlikely. Most implementations uses the ASCII code for which the condition works.
Nevertheless his answer provides a solution that is working in every case.
It's as simple as:
if(input >='A' && input<='H') return 1;
C doesn't let you specify ranges like 0 - 9.
In fact that's an arithmetic expression "zero minus nine" and evaluates to minus nine (of course).
Nerd Corner:
As others point out this is not guaranteed by the C standard because it doesn't specify a character encoding though in practice all modern platforms encode these characters the same as ASCII. So it's very unlikely you will come unstuck and if you're working in an environment where it won't work you'd have been told!
A truly portable implementation could be:
#include <string.h>//contains strchr()
const char* alpha="ABCDEFGHIJKLMNOPQRSTUVWXYZ";
const char* pos=strchr(alpha,input);
if(pos!=NULL&&(pos-alpha)<8) return 1;
This tries to find the character in an alphabet string then determines if the character (if any) pointed to is before 'I'.
This is total over engineering and not the answer you're looking for.

Common character to indicate error has occured

Typically in a function that returns an integer, -1 is used to indicate to the user/programmer that a error has occurred in the program. e.g. A program may return -1 instead of the index of where a word starts in a string if the word cannot be found in the string.
So my question is, what should you return from a function that returns a character instead?
In other words, is there a common character that most programmers use like -1 to detect when an error has occurred?
It may be subjective to the scenario of course so lets say you created a program that converts a digit into its corresponding character:
char digitToCh (int digit)
{
if ( digit < 0 ) // If digit is negative, return ?
return '?';
int n = intLen (digit);
if ( n > 1 )
digit /= power (10, n - 1);
return (char) (((int) '0') + digit);
}
If the digit is negative, what error character code would seem appropriate. This example is quite arbitrary but I'm just trying to make it simpler to express.
No, there is no such reserved character. In fact, it is the absence of such common character that caused character-oriented I/O functions to change their return type to int.
The trick with returning an int is important when you must retain the full range of character return values. In your case, however, only the ten digit characters are valid, so returning character zero '\0' is a valid option:
char digitToCh (int digit) {
if (digit < 0) return '\0';
...
}
You can check the error like this:
char nextDigit = digitToCh(someInt);
if (!nextDigit) {
// The next digit is invalid
}
Make the function return an int instead of a char and also use -1 to indicate an error.
This way it is done by fgetc() for example.
.

Resources