How to change multicharacter signs by other ones in C? - c

I've got an UTF-8 text file containing several signs that i'd like to change by other ones (only those between |( and |) ), but the problem is that some of these signs are not considered as characters but as multi-character signs. (By this i mean they can't be put between '∞' but only like this "∞", so char * ?)
Here is my textfile :
Text : |(abc∞∪v=|)
For example :
∞ should be changed by ¤c
∪ by ¸!
= changed by "
So as some signs(∞ and ∪) are multicharacters, i decided to use fscanf to get all the text word by word. The problem with this method is that I have to put space between each character ... My file should look like this :
Text : |( a b c ∞ ∪ v = |)
fgetc can't be used because characters like ∞ can't be considered as one single character.If i use it I won't be able to strcmp a char with each sign (char * ), i tried to convert my char to char* but strcmp !=0.
Here is my code in C to help you understanding my problem :
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
int main(void){
char *carac[]={"∞","=","∪"}; //array with our signs
FILE *flot,*flot3;
flot=fopen("fichierdeTest2.txt","r"); // input text file
flot3=fopen("resultat.txt","w"); //output file
int i=0,j=0;
char a[1024]; //array that will contain each read word.
while(!feof(flot))
{
fscanf(flot,"%s",&a[i]);
if (strstr(&a[i], "|(") != NULL){ // if the word read contains |( then j=1
j=1;
fprintf(flot3,"|(");
}
if (strcmp(&a[i], "|)") == 0)
j=0;
if(j==1) { //it means we are between |( and |) so the conversion can begin
if (strcmp(carac[0], &a[i]) == 0) { fprintf(flot3, "¤c"); }
else if (strcmp(carac[1], &a[i]) == 0) { fprintf(flot3,"\"" ); }
else if (strcmp(carac[2], &a[i]) == 0) { fprintf(flot3, " ¸!"); }
else fprintf(flot3,"%s",&a[i]); // when it's a letter, number or sign that doesn't need to be converted
}
else { // when we are not between |( and |) just copy the word to the output file with a space after it
fprintf(flot3, "%s", &a[i]);
fprintf(flot3, " ");
}
i++;
}
}
Thanks a lot for the future help !
EDIT : Every sign will be changed correctly if i put a space between each them but without ,it won't work, that's what i'm trying to solve.

First of all, get the terminology right. Proper terminology is a bit confusing, but at least other people will understand what you are talking about.
In C, char is the same as byte. However, a character is something abstract like ∞ or ¤ or c. One character may contain a few bytes (that is a few chars). Such characters are called multi-byte ones.
Converting a character to a sequence of bytes (encoding) is not trivial. Different systems do it differently; some use UTF-8, while others may use UTF-16 big-endian, UTF-16 little endian, a 8-bit codepage or any other encoding.
When your C program has something inside quotes, like "∞" - it's a C-string, that is, several bytes terminated by a zero byte. When your code uses strcmp to compare strings, it compares each byte of both strings, to make sure they are equal. So, if your source code and your input file use different encodings, the strings (byte sequences) won't match, even though you will see the same character when examining them!
So, to rule out any encoding mismatches, you might want to use a sequence of bytes instead of a character in your source code. For example, if you know that your input file uses the UTF-8 encoding:
char *carac[]={
"\xe2\x88\x9e", // ∞
"=",
"\xe2\x88\xaa"}; // ∪
Alternatively, make sure the encodings (of your source code and your program's input file) are the same.
Another, less subtle, problem: when comparing strings, you actually have a big string and a small string, and you want to check whether the big string starts with the small string. Here strcmp does the wrong thing! You must use strncmp here instead:
if (strncmp(carac[0], &a[i], strlen(carac[0])) == 0)
{
fprintf(flot3, "\xC2\xA4""c"); // ¤c
}
Another problem (actually, a major bug): the fscanf function reads a word (text delimited by spaces) from the input file. If you only examine the first byte in this word, the other bytes will not be processed. To fix, make a loop over all bytes:
fscanf(flot,"%s",a);
for (i = 0; a[i] != '\0'; )
{
if (strncmp(&a[i], "|(", 2)) // start pattern
{
now_replacing = 1;
i += 2;
continue;
}
if (now_replacing)
{
if (strncmp(&a[i], whatever, strlen(whatever)))
{
fprintf(...);
i += strlen(whatever);
}
}
else
{
fputc(a[i], output);
i += 1; // processed just one char
}
}

You're on the right track, but you need to look at characters differently than strings.
strcmp(carac[0], &a[i])
(Pretending i = 2) As you know this compares the string "∞" with &a[2]. But you forget that &a[2] is the address of the second character of the string, and strcmp works by scanning the entire string until it hits a null terminator. So "∞" actually ends up getting compared with "abc∞∪v=|)" because a is only null terminated at the very end.
What you should do is not use strings, but expand each character (8 bits) to a short (16 bits). And then you can compare them with your UTF-16 characters
if( 8734 = *((short *)&a[i])) { /* character is infinity */ }
The reason for that 8734 is because that's the UTF16 value of infinity.
VERY IMPORTANT NOTE:
Depending if your machine is big-endian or little-endian matters for this case. If 8734 (0x221E) does not work, give 7714 (0x1E22) a try.
Edit Something else I overlooked is you're scanning the entire string at once. "%s: String of characters. This will read subsequent characters until a whitespace is found (whitespace characters are considered to be blank, newline and tab)." (source)
//feof = false.
fscanf(flot,"%s",&a[i]);
//feof = ture.
That means you never actually iterate. You need to go back and rethink your scanning procedure.

Related

Scanf for a word in C language

Hey I got this code where I need to scanf the input of the user (ANO/NE) and store this input into variable "odpoved". How to do that? What I have now looks like it is scanning just the first letter of the input.
char odpoved;
printf("Je vše v pořádku? (ANO/NE)");
scanf("%s", &odpoved);
if(odpoved == "ANO" || odpoved == "ano"){
printf("Super, díky mockrát");
}
else if(odpoved == "NE" || odpoved == "ne"){
printf("To mě mrzí, ale ani já nejsem dokonalý");
}
else{
printf("Promiň, ale zmátl jsi mě. Takovou odpověď neznám!!!");
return 0;
}
The first thing you should know about is that char is a data type consisting of 1 byte and is used to store a single character such as 'a', 'b', 1, 2 etc... there are 256 possible characters which are often represented by the ASCII table ( https://www.ascii-code.com/).
As odpoved is a string you need to make it type char* or equivalently char [] which is a pointer to the first char in the array (string) of characters. The last char in a string is always a terminator byte '\0' used to indicate the end of a string. The null terminator is automatically inserted when the speech marks are used e.g. "sometext" or when %s is used to get input.
The other mistake you have made is to compare strings with == or != signs. This will not work as the first characters will be compared with each other. Hence to compare the characters you will need to use strcmp function provided when the string.h library is included. There are many other useful string functions such as strlen which tell you the length of the string etc.

Stdin + Dictionary Text Replacement Tool -- Debugging

I'm working on a project in which I have two main files. Essentially, the program reads in a text file defining a dictionary with key-value mappings. Each key has a unique value and the file is formatted like this where each key-value pair is on its own line:
ipsum i%##!
fubar fubar
IpSum XXXXX24
Ipsum YYYYY211
Then the program reads in input from stdin, and if any of the "words" match the keys in the dictionary file, they get replaced with the value. There is a slight thing about upper and lower cases -- this is the order of "match priority"
The exact word is in the replacement set
The word with all but the first character converted to lower case is in the replacement set
The word converted completely to lower case is in the replacement set
Meaning if the exact word is in the dictionary, it gets replaced, but if not the next possibility (2) is checked and so on...
My program passes the basic cases we were provided but then the terminal shows
that the output vs reference binary files differ.
I went into both files (not c files, but binary files), and one was super long with tons of numbers and the other just had a line of random characters. So that didn't really help. I also reviewed my code and made some small tests but it seems okay? A friend recommended I make sure I'm accounting for the null operator in processInput() and I already was (or at least I think so, correct me if I'm wrong). I also converted getchar() to an int to properly check for EOF, and allocated extra space for the char array. I also tried vimdiff and got more confused. I would love some help debugging this, please! I've been at it all day and I'm very confused.
There are multiple issues in the processInput() function:
the loop should not stop when the byte read is 0, you should process the full input with:
while ((ch = getchar()) != EOF)
the test for EOF should actually be done differently so the last word of the file gets a chance to be handled if it occurs exactly at the end of the file.
the cast in isalnum((char)ch) is incorrect: you should pass ch directly to isalnum. Casting as char is actually counterproductive because it will turn byte values beyond CHAR_MAX to negative values for which isalnum() has undefined behavior.
the test if(ind >= cap) is too loose: if word contains cap characters, setting the null terminator at word[ind] will write beyond the end of the array. Change the test to if (cap - ind < 2) to allow for a byte and a null terminator at all times.
you should check that there is at least one character in the word to avoid calling checkData() with an empty string.
char key[ind + 1]; is useless: you can just pass word to checkData().
checkData(key, ind) is incorrect: you should pass the size of the buffer for the case conversions, which is at least ind + 1 to allow for the null terminator.
the cast in putchar((char)ch); is useless and confusing.
There are some small issues in the rest of the code, but none that should cause a problem.
Start by testing your tokeniser with:
$ ./a.out <badhash2.c >zooi
$ diff badhash2.c zooi
$
Does it work for binary files, too?:
$ ./a.out <./a.out > zooibin
$ diff ./a.out zooibin
$
Yes, it does!
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
void processInput(void);
int main(int argc, char **argv) {
processInput();
return 0;
}
void processInput() {
int ch;
char *word;
int len = 0;
int cap = 60;
word = malloc(cap);
while(1) {
ch = getchar(); // (1)
if( ch != EOF && isalnum(ch)) { // (2)
if(len+1 >= cap) { // (3)
cap += cap/2;
word = realloc(word, cap);
}
word[len++] = ch;
} else {
if (len) { // (4)
#if 0
char key[len + 1];
memcpy(key, word, len); key[len] = 0;
checkData(key, len);
#else
word[len] = 0;
fputs(word, stdout);
#endif
len = 0;
}
if (ch == EOF) break; // (5)
putchar(ch);
}
}
free(word);
}
I only repaired your tokeniser, leaving out the hash table and the search & replace stuff. It is now supposed to generate a verbatim copy of the input. (which is silly, but great for testing)
If you want to allow binary input, you cannot use while((ch = getchar()) ...) : a NUL in the input would cause the loop to end. You must pospone testing for EOF, because ther could still be a final word in your buffer ...&& ch != EOF)
treat EOF just like a space here: it could be the end of a word
you must reserve space for the NUL ('\0') , too.
if (len==0) there would be no word, so no need to look it up.
we treated EOF just like a space, but we don't want to write it to the output. Time to break out of the loop.

Using sscanf to validate a string input

I have just started learning C after coding for some while in Java and Python.
I was wondering how I could "validate" a string input (if it stands in a certain criteria) and I stumbled upon the sscanf() function.
I had the impression that it acts kind of similarly to regular expressions, however I didn't quite manage to tell how I can create rather complex queries with it.
For example, lets say I have the following string:
char str[]={"Santa-monica 123"}
I want to use sscanf() to check if the string has only letters, numbers and dashes in it.
Could someone please elaborate?
The fact that sscanf allows something that looks a bit like a character class by no means implies that it is anything at all like a regular expression library. In fact, Posix doesn't even require the scanf functions to accept character ranges inside character classes, although I suspect that it will work fine on any implementation you will run into.
But the scanning problem you have does not require regular expressions, either. All you need is a repeated character class match, and sscanf can certainly do that:
#include <stdbool.h>
bool check_string(const char* s) {
int n = 0;
sscanf(s, "%*[-a-zA-Z0-9]%n", &n);
return s[n] == 0;
}
The idea behind that scanf format is that the first conversion will match and discard the longest initial sequence consisting of valid characters. (It might fail if the first character is invalid. Thanks to #chux for pointing that out.) If it succeeds, it will then set n to the current scan point, which is the offset of the next character. If the next character is a NUL, then all the characters were good. (This version returns OK for the empty string, since it contains no illegal characters. If you want the empty string to fail, change the return condition to return n && s[n] == 0;)
You could also do this with the standard regex library (or any more sophisticated library, if you prefer, but the Posix library is usually available without additional work). This requires a little bit more code in order to compile the regular expression. For efficiency, the following attempts to compile the regex only once, but for simplicity I left out the synchronization to avoid data races during initialization, so don't use this in a multithreaded application.
#include <regex.h>
#include <stdbool.h>
bool check_string(const char* s) {
static regex_t* re_ptr = NULL;
static regex_t re;
if (!re_ptr) regcomp((re_ptr = &re), "^[[:alnum:]-]*$", REG_EXTENDED);
return regexec(re_ptr, s, 0, NULL, 0) == 0;
}
I want to use sscanf() to check if the string has only letters, numbers and dashes in it.
Variation of #rici good answer.
Create a scanset for letters, numbers and dashes.
//v The * indicates to scan, but not save the result.
// v Dash (or minus sign), best to list first.
"%*[-0-9A-Za-z]"
// ^^^^^^ Letters a-z, both cases
// ^^^ Digits
Use "%n" to detect how far the scan went.
Now we can use determine if
Scanning stop due to a null character (the whole string is valid)
Scanning stop due to an invalid character
int n = 0;
sscanf(str, "%*[-0-9A-Za-z]%n", &n);
bool success = (str[n] == '\0');
sscanf does not have this functionality, the argument you are referring to is a format specifier and not used for validation. see here: https://www.tutorialspoint.com/c_standard_library/c_function_sscanf.htm
as also mentioned sscanf is for a different job. for more in formation see this link. You can loop over string using isalpha and isdigit to check if chars in string are digits and alphabetic characters or no.
char str[]={"Santa-monica 123"}
for (int i = 0; str[i] != '\0'; i++)
{
if ((!isalpha(str[i])) && (!isdigit(str[i])) && (str[i] != '-'))
printf("wrong character %c", str[i]);//this will be printed for spaces too
}
I want to ... check if the string has only letters, numbers and dashes in it.
In C that's traditionally done with isalnum(3) and friends.
bool valid( const char str[] ) {
for( const char *p = str; p < str + strlen(str); p++ ) {
if( ! (isalnum(*p) || *p == '-') )
return false;
}
return true;
}
You can also use your friendly neighborhood regex(3), but you'll find that requires a surprising amount of code for a simple scan.
After retrieving value on sscanf(), you may use regular expression to validate the value.
Please see Regular Expression ic C

in C I want to read in line by line from a file a certain way with the end length of the file changing

Ok I need to read information in from a file. I have to take certain parts of the line apart and do different things with each part. I know the maximum and minimum length of the file but I am doing something wrong when I read in the file and then split it up as I am getting really funny values and stuff when I try to compare methods. The maximum length of any line is 80 character.
The format for each line will be as follows: (I will write them in column form as they would appear in a character array)
0-7 _ 9 10-16 _ 18 19-28_ _31-79
spots 0-7 will contain a string(any being under 8 will have blank spaces)
spots 8,17,29,30 are all blank spaces (Marked by underscores)
spots 10-16 will contain a string (again any being under the max length will have blank spaces at the end)
spot 18 will contain a blank space or a character
spot 19-28 will contain another string (Same as other cases)
spot 31-79 can be filled with a string or may not exist at all depends on the users input.
Right now I am using a buffer of size 82 and then doing strncpy to take certain parts from the buffer to break it up. It appears to be working fine but when I do strcmp I am getting funky answers and the strlen is not giving the char arrays I declared the right length.
(I have declared them as having a max length of 8,9,etc. but strlen has been returning weird numbers like 67)
So if I could just read it in broken up it should completely resolve the issue.
I was hoping there would be a way to do this but am currently unsure.
Any help would be greatly appreciated. I have attached the part of the code where I think the error is.
(I know it isn't good to have the size hardcoded in there but I want to get it working first and then I'll get rid of the magic numbers)
while (fgets(buffer, sizeof buffer, fp) != NULL) /* read a line from a file */
{
if (buffer[0] == '.') //If it is a comment line just echo it do not increase counter
{
printf("%s", buffer);
}
else if (buffer[0] == ' ' && buffer[10] == ' ') // If it is a blank line print blank line do not increase counter
{
printf("\n");
}
else //it is an actual instruction perform the real operations
{
//copy label down
strncpy(label, &buffer[0], 8);
//copy Pnemonic into command string
strncpy(command, &buffer[9], 8);
//copy symbol down
symbol = buffer[syLoc];
//copy operand down
strncpy(operand, &buffer[19], 9);
Funky characters and overlong string lengths are a sign that the strings aren't null-terminated, as C (or at least most of C's library functions) expects them.
strncpy will yield null-terminated strings only if the buffer is greater than the length of the source string. In your case, you want to copy substrings out of the middle of a string, so your strings won't have the null terminator.
You could add the null-terminator by hand:
char label[9];
strncpy(label, &buffer[0], 8);
label[8] = '\0';
But given that you have spaces after the substrings you want anyway, you could also use strtok's approach to make your substrings pointers into the line you have read and overwrite the spaces with the null character:
char *label;
char *command;
label = &buffer[0];
buffer[8] = '\0';
command = &buffer[9];
buffer[9 + 8] = '\0';
This approach has the advantage that you don't need extra memory for the substrings. It has the drawback that your substrings will become invalid when you read the next line. If your substrings don't "live" long enough, that approach might be good for you.
Warning: strncpy function do not add any null termination(\0) at the end of the copied chars.
To protect the target char array you have to manually add a \0after each strncpycall like this:
//copy label down
strncpy(label, &buffer[0], 8);
label[8]='\0';
//copy Pnemonic into command string
strncpy(command, &buffer[9], 8);
command[8]='\0';
//copy symbol down
symbol = buffer[syLoc]; //Ok just a single char
//copy operand down
strncpy(operand, &buffer[19], 9);
operand[9]='\0';
If no '\0' is added, chars will be read until a '\0' is encountered in the address after the readed char array in the memory (buffer overflow).

RLE algorithm decoding - escape character

I have to do a rle algorithm (escape character) that is able to encode and decode every file.
I did the first part (encoding) and now already before to begin the decoding part i can see some problems. Example:
If I have a file and inside it there is: AAAAABBBBBBCCCCCDDD
The encode function that I did give an output like this: QA5QB6QC5DDD
But you have to think that I have to work with real file so inside there is not just letter also numbers and symbols.
So, after the encode part, what I have to do if inside the encoded file there is something like QA55?
The output have to be AAAAA5 or fifty five A?
Another example, if I have to read QA5
Which is the final output? AAAAA or just QA5?
I mean that I don't know how I can recognize when the block of letter that I'm reading is something of encoded or not.
This is my encode function:
void encode (FILE *source, FILE *destination) {
char currentChar;
char seqChar = 'Z'; //could be any character
int count = 0;
while(1) {
int endFile = (fread(&currentChar, sizeof(char),1, source) == 0);
if(endFile || seqChar!=currentChar) {
if(count>3) {
char escape = 'Q';
int k = count;
char str[100];
int digits = sprintf(str,"%d",count);
fwrite(&escape, sizeof(escape), 1, destination);
fwrite(&seqChar, sizeof(escape),1, destination);
fwrite(&str, sizeof(char), digits, destination);
}
else {
for(int i=0;i<count;i++)
fwrite(&seqChar,sizeof(char),1,destination);
}
seqChar = currentChar;
count =1;
}
else count++;
if(endFile)
break;
}
fclose(source);
fclose(destination);
}
I hope you know what I mean,
for sure, I think, that I have to invent some convention in order to solve this problem, but I can not figure out which and what kind.
How do you place a literal backslash in a C string? How do you write a percent sign with printf? You have to find an escape sequence that represents the escape character itself.
Your escape character is Q (strange choice, by the way). Then Q + character + count could mean: that character, count times. And QQ could mean the escape character itself.
You'll see that you cannot compress sequences of Q's that way, because Q already means "Q". There are two possibilities to fix this: Get rid of the QQ special meaning and always encode "Q" as a sequence of one "Q", ie. QQ1. Or place the count in front of the character to encode and have Q not be a valid count.
(By the way, that's not so much a C question, it's more about the design of your compression algorithm. You might want to re-tag it and remove the code.)

Resources