Related
I've got an UTF-8 text file containing several signs that i'd like to change by other ones (only those between |( and |) ), but the problem is that some of these signs are not considered as characters but as multi-character signs. (By this i mean they can't be put between '∞' but only like this "∞", so char * ?)
Here is my textfile :
Text : |(abc∞∪v=|)
For example :
∞ should be changed by ¤c
∪ by ¸!
= changed by "
So as some signs(∞ and ∪) are multicharacters, i decided to use fscanf to get all the text word by word. The problem with this method is that I have to put space between each character ... My file should look like this :
Text : |( a b c ∞ ∪ v = |)
fgetc can't be used because characters like ∞ can't be considered as one single character.If i use it I won't be able to strcmp a char with each sign (char * ), i tried to convert my char to char* but strcmp !=0.
Here is my code in C to help you understanding my problem :
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
int main(void){
char *carac[]={"∞","=","∪"}; //array with our signs
FILE *flot,*flot3;
flot=fopen("fichierdeTest2.txt","r"); // input text file
flot3=fopen("resultat.txt","w"); //output file
int i=0,j=0;
char a[1024]; //array that will contain each read word.
while(!feof(flot))
{
fscanf(flot,"%s",&a[i]);
if (strstr(&a[i], "|(") != NULL){ // if the word read contains |( then j=1
j=1;
fprintf(flot3,"|(");
}
if (strcmp(&a[i], "|)") == 0)
j=0;
if(j==1) { //it means we are between |( and |) so the conversion can begin
if (strcmp(carac[0], &a[i]) == 0) { fprintf(flot3, "¤c"); }
else if (strcmp(carac[1], &a[i]) == 0) { fprintf(flot3,"\"" ); }
else if (strcmp(carac[2], &a[i]) == 0) { fprintf(flot3, " ¸!"); }
else fprintf(flot3,"%s",&a[i]); // when it's a letter, number or sign that doesn't need to be converted
}
else { // when we are not between |( and |) just copy the word to the output file with a space after it
fprintf(flot3, "%s", &a[i]);
fprintf(flot3, " ");
}
i++;
}
}
Thanks a lot for the future help !
EDIT : Every sign will be changed correctly if i put a space between each them but without ,it won't work, that's what i'm trying to solve.
First of all, get the terminology right. Proper terminology is a bit confusing, but at least other people will understand what you are talking about.
In C, char is the same as byte. However, a character is something abstract like ∞ or ¤ or c. One character may contain a few bytes (that is a few chars). Such characters are called multi-byte ones.
Converting a character to a sequence of bytes (encoding) is not trivial. Different systems do it differently; some use UTF-8, while others may use UTF-16 big-endian, UTF-16 little endian, a 8-bit codepage or any other encoding.
When your C program has something inside quotes, like "∞" - it's a C-string, that is, several bytes terminated by a zero byte. When your code uses strcmp to compare strings, it compares each byte of both strings, to make sure they are equal. So, if your source code and your input file use different encodings, the strings (byte sequences) won't match, even though you will see the same character when examining them!
So, to rule out any encoding mismatches, you might want to use a sequence of bytes instead of a character in your source code. For example, if you know that your input file uses the UTF-8 encoding:
char *carac[]={
"\xe2\x88\x9e", // ∞
"=",
"\xe2\x88\xaa"}; // ∪
Alternatively, make sure the encodings (of your source code and your program's input file) are the same.
Another, less subtle, problem: when comparing strings, you actually have a big string and a small string, and you want to check whether the big string starts with the small string. Here strcmp does the wrong thing! You must use strncmp here instead:
if (strncmp(carac[0], &a[i], strlen(carac[0])) == 0)
{
fprintf(flot3, "\xC2\xA4""c"); // ¤c
}
Another problem (actually, a major bug): the fscanf function reads a word (text delimited by spaces) from the input file. If you only examine the first byte in this word, the other bytes will not be processed. To fix, make a loop over all bytes:
fscanf(flot,"%s",a);
for (i = 0; a[i] != '\0'; )
{
if (strncmp(&a[i], "|(", 2)) // start pattern
{
now_replacing = 1;
i += 2;
continue;
}
if (now_replacing)
{
if (strncmp(&a[i], whatever, strlen(whatever)))
{
fprintf(...);
i += strlen(whatever);
}
}
else
{
fputc(a[i], output);
i += 1; // processed just one char
}
}
You're on the right track, but you need to look at characters differently than strings.
strcmp(carac[0], &a[i])
(Pretending i = 2) As you know this compares the string "∞" with &a[2]. But you forget that &a[2] is the address of the second character of the string, and strcmp works by scanning the entire string until it hits a null terminator. So "∞" actually ends up getting compared with "abc∞∪v=|)" because a is only null terminated at the very end.
What you should do is not use strings, but expand each character (8 bits) to a short (16 bits). And then you can compare them with your UTF-16 characters
if( 8734 = *((short *)&a[i])) { /* character is infinity */ }
The reason for that 8734 is because that's the UTF16 value of infinity.
VERY IMPORTANT NOTE:
Depending if your machine is big-endian or little-endian matters for this case. If 8734 (0x221E) does not work, give 7714 (0x1E22) a try.
Edit Something else I overlooked is you're scanning the entire string at once. "%s: String of characters. This will read subsequent characters until a whitespace is found (whitespace characters are considered to be blank, newline and tab)." (source)
//feof = false.
fscanf(flot,"%s",&a[i]);
//feof = ture.
That means you never actually iterate. You need to go back and rethink your scanning procedure.
I wrote a C program for lex analyzer (a small code) that will identify keywords, identifiers and constants. I am taking a string (C source code as a string) and then converting splitting it into words.
#include <stdio.h>
#include <conio.h>
#include <string.h>
char symTable[5][7] = { "int", "void", "float", "char", "string" };
int main() {
int i, j, k = 0, flag = 0;
char string[7];
char str[] = "int main(){printf(\"Hello\");return 0;}";
char *ptr;
printf("Splitting string \"%s\" into tokens:\n", str);
ptr = strtok(str, " (){};""");
printf("\n\n");
while (ptr != NULL) {
printf ("%s\n", ptr);
for (i = k; i < 5; i++) {
memset(&string[0], 0, sizeof(string));
for (j = 0; j < 7; j++) {
string[j] = symTable[i][j];
}
if (strcmp(ptr, string) == 0) {
printf("Keyword\n\n");
break;
} else
if (string[j] == 0 || string[j] == 1 || string[j] == 2 ||
string[j] == 3 || string[j] == 4 || string[j] == 5 ||
string[j] == 6 || string[j] == 7 || string[j] == 8 ||
string[j] == 9) {
printf("Constant\n\n");
break;
} else {
printf("Identifier\n\n");
break;
}
}
ptr = strtok(NULL, " (){};""");
k++;
}
_getch();
return 0;
}
With the above code, I am able to identify keywords and identifiers but I couldn't obtain the result for numbers. I've tried using strspn() but of no avail. I even replaced 0,1,2...,9 to '0','1',....,'9'.
Any help would be appreciated.
Here are some problems in your parser:
The test string[j] == 0 does not test if string[j] is the digit 0. The characters for digits are written '0' through '9', their values are 48 to 57 in ASCII and UTF-8. Furthermore, you should be comparing *p instead of string[j] to test if you have a digit in the string indicating the start of a number.
Splitting the string with strtok() is not a good idea: it modifies the string and overwrites the first separator character with '\0': this will prevent matching operators such as (, )...
The string " (){};""" is exactly the same as " (){};". In order to escape " inside strings, you must use \".
To write a lexer for C, you should switch on the first character and check the following characters depending on the value of the first character:
if you have white space, skip it
if you have //, it is a line comment: skip all characters up to the newline.
if you have /*, it is a block comment: skip all characters until you get the pair */.
if you have a ', you have a character constant: parse the characters, handling escape sequences until you get a closing '.
if you have a ", you have astring literal. do the same as for character constants.
if you have a digit, consume all subsequent digits, you have an integer. Parsing the full number syntax requires much more code: leave that for later.
if you have a letter or an underscore: consume all subsequent letters, digits and underscores, then compare the word with the set of predefined keywords. You have either a keyword or an identifier.
otherwise, you have an operator: check if the next characters are part of a 2 or 3 character operator, such as == and >>=.
That's about it for a simple C parser. The full syntax requires more work, but you will get there one step at a time.
When you're writing lexer, always create specific function that finds your tokens (name yylex is used for tool System Lex, that is why I used that name). Writing lexer in main is not smart idea, especially if you want to do syntax, semantic analysis later on.
From your question it is not clear whether you just want to figure out what are number tokens, or whether you want token + fetch number value. I will assume first one.
This is example code, that finds whole numbers:
int yylex(){
/* We read one char from standard input */
char c = getchar();
/* If we read new line, we will return end of input token */
if(c == '\n')
return EOI;
/* If we see digit on input, we can not return number token at the moment.
For example input could be 123a and that is lexical error */
if(isdigit(c)){
while(isdigit(c = getchar()))
;
ungetc(c,stdin);
return NUM;
}
/* Additional code for keywords, identifiers, errors, etc. */
}
Tokens EOI, NUM, etc. should be defined on top. Later on, when you want to write syntax analysis, you use these tokens to figure out whether code responds to language syntax or not. In lexical analysis, usually ASCII values are not defined at all, your lexer function would simply return ')' for example. Knowing that, tokens should be defined above 255 value. For example:
#define EOI 256
#define NUM 257
If you have any futher questions, feel free to ask.
string[j]==1
This test is wrong(1) (on all C implementations I heard of), since string[j] is some char e.g. using ASCII (or UTF-8, or even the old EBCDIC used on IBM mainframes) encoding and the encoding of the char digit 1 is not the the number 1. On my Linux/x86-64 machine (and on most machines using ASCII or UTF-8, e.g. almost all of them) using UTF-8, the character 1 is encoded as the byte of code 48 (that is (char)48 == '1')
You probably want
string[j]=='1'
and you should consider using the standard isdigit (and related) function.
Be aware that UTF-8 is practically used everywhere but is a multi-byte encoding (of displayable characters). See this answer.
Note (1): the string[j]==1 test is probably misplaced too! Perhaps you might test isdigit(*ptr) at some better place.
PS. Please take the habit of compiling with all warnings and debug info (e.g. with gcc -Wall -Wextra -g if using GCC...)
and use the debugger (e.g. gdb). You should have find out your bug in less time than it took you to get an answer here.
I am previously a java programmer, but I'm now doing a C course at university (computer science major).
I need the user to be able to enter 3 chars,the first 2 being numbers, and the last 1 being either 'v' or 'h'.
For example "1 2 v".
I need the user to be able to enter it with the spaces in between each character.
This is my current code:
void manageInput(char box[][width]){
char move[4];
char input[16];
while(1){
scanf("%s", input);
int i = 0;
while(input[i] != 0){
if(input[i] != ' ' && input[i] != "\n"){
move[i] = input[i];
}
i++;
}
printf("%s\n", move);
makeMove(box, move);
printBox(box, height, width);
// TODO
if(move[0] == 'x'){
exit(0);
}
}
}
However if I run it, it works fine when I enter the chars with out spaces like "12v", but If I enter "1 2 v", it will print out "1", call printBox, then print out "2", then print out box again, and so on.
If someone could explain what I'm doing wrong here, I would appreciate it.
If someone could explain what I'm doing wrong here, I would appreciate it.
The short story is: Your code doesn't fulfill your requirements. It simply doesn't do what you want it to do.
Your requirements are:
All fields must be one character. This requirement isn't fulfilled by your code. Your code will mistakenly accept multiple characters per field.
There must be one space (exactly one space?) between the fields. This requirement isn't fulfilled by your code. There might be multiple spaces between the fields, and your code will mistakenly accept that.
In fact, your code invokes undefined behaviour by accessing the move array out of bounds. Consider that as a consequence of one of the above scenarios i might become some value higher than 3. What might happen in this code: move[i] = input[i];?
Your code is also way too complex. All of your functionality can be performed by scanf alone. It's a very powerful function, when you know how to use it correctly... I suggest reading and understanding the manual multiple times, when you have an opportunity. You'll learn a lot!
I notice something you neglected to mention from within the logic you have presented: It's expected that the first field might also be 'x', which corresponds to an exit usecase. This is a bad design; the caller has no opportunity to clean up... but I'll run with it. You really should use return (and return an int value or something, corresponding to error/success) instead.
Let us caste that last paragraph aside, because we can simply consider 'x' to be invalid input (and exit as a result), and I don't want to change the contracts of your functions; I'll leave that to you. The expression described so far appears to be int x = scanf("%1[0123456789]%*1[ ]%1[0123456789]%*1[ ]%1[vh]", a, b, c);.
Note that it is expected that a, b and c will have enough space to store a string of one byte in length. That is, their declaration should look like: char a[2], b[2], c[2];.
Make sure you check the return value (x, in the example)! If x is 3, it's safe to assume that the three variables a, b and c are safe to use. If x is 2, it's safe to assume that a and b are safe to use, and so on... If x is EOF or 0, none of them are safe to use.
By checking the return value, you can reject input that doesn't match that precise pattern, that is:
Fields that aren't exactly one byte in width will be rejected.
Too many or too few spaces will be rejected.
Something else popped up that you have neglected to mention, and it's also present within your code: Chux mentioned that you'll likely be expecting the input to be terminated with a '\n' (newline) character. This can also be implemented in a number of ways using scanf:
scanf("%1*[\n]"); will attempt to read and discard precisely one '\n' character, but there's no way to ensure that was successful. getchar would be more appropriate for that purpose; something along the lines of if (getchar() != '\n') { exit(EXIT_FAILURE); } might make sense, if you wish to ensure that the lines of input are perfectly formed and bomb out when they aren't... #define BOMB_OUT?
scanf("%*[^\n]"); scanf("%*c"); makes more sense; If you're interested in reading one item per line, then it makes sense to discard everything remaining on the line, and then the newline character itself. Note that your program should always tell the user when it's discarding or truncating input. You could also use getchar for this.
void manageInput(char box[][width]){
for (;;) {
char a[2], b[2], c[2];
int x = scanf("%1[0123456789]%*1[ ]%1[0123456789]%*1[ ]%1[vh]", a, b, c);
if (x != 3) {
/* INVALID INPUT should cause an error value to be returned!
* However, this function has no return value (which makes it
* poorly designed)... Calling `exit` gives no opportunity for
* calling code to clean up :(
*/
exit(EXIT_FAILURE);
}
if (getchar() != '\n') {
# ifdef BOMB_OUT
exit(EXIT_FAILURE);
# else
scanf("%*[^\n]");
getchar();
puts("NOTE: Excess input has been discarded.");
# endif
}
char move[4] = { a[0], b[0], c[0] };
printf("%s\n", move);
makeMove(box, move);
printBox(box, height, width);
// TODO
if(move[0] == 'x'){
exit(0);
}
}
}
%s reads a whitespace-delimited string with scanf, so if that's not what you want, it's not the thing to use. %c reads a single character, but does not skip whitespce, so you probably also want a (space) in your format to skip whitespace:
char input[3];
scanf(" %c %c %c", intput, input+1, input+2);
will read 3 non-whitespace characters and skip any whitespace before or between them. You should also check the return value of scanf to make sure that it is 3 -- if not, there was less than 3 characters in your input before an end-of-file was reached.
It's usuall a bad idea to read string via scanf because of potential buffer overflow. Consider using fscanf or better fgets as in
fgets(input, 15, stdin);
Note the extra byte for '\0'.
Also, you're comparing char to string here: input[i] != "\n". It should be input[i] != '\n' instead.
And btw you can just use something like
int x, y;
char d;
scanf("%d%d%c", &x, &y, &d);
This looks like two simple bugs.
You need to use separate indexes for move[] and input[]
int i = 0;
while(input[i] != 0){
if(input[i] != ' ' && input[i] != "\n"){
move[i] = input[i];
}
i++;
}
Imagine input of 1 2 v
input[0] != 0, so we enter the loop
it's not ' ' or '\n' either, so we copy input[0] to move[0]
so far so good
You increment i, and discover that input[1] == ' '
But then you increment i again
You discover that you are interested in input[2] (2) - so you copy it to move[2], rather than move[1]. Oops!
Then to make things worse, you never put an end-of-string character after the last valid character of move[].
I currently have a finite state machine which analyzes a long string, separates the long string by white space, and analyzes each token to either octal, hex, float, error, etc.
Here is a brief overview of how I analyze each token:
enum state mystate = start_state;
while (current_index <= end_index - 1) { // iterate through whole token
switch (mystate) {
case 0:
// analyze first character and move to appropriate state
// cases 1-5 represent the valid states, if error set mystate = 6
case 6: // this is the error state
current_index = end_index - 1; // end loop
break;
}
current_index++;
}
At the end of this loop, I analyze what state my token fell under, for example if the token didn't fit into any category and it went to state 6 (the error state):
if (mystate == 6) {
// token is char pointer to string token
fprintf(stdout, "Error: \" %s \" is invalid\n", token);
}
Now, I am supposed to print out unprintable characters from 0x20 and under, such as start-of-text, start-of-header, etc. in their hex form, such as [0x02] and [0x01]. I found a good list of the ASCII unprintable characters from 0x20 and under here: http://www.theasciicode.com.ar/ascii-control-characters/start-of-header-ascii-code-1.html
Firstly, I am confused how to even type the unprintable characters into the command line. How does one type an unprintable character as a command line argument for my program to analyze?
After that hurdle, I know that the unprintable characters will fall into state 6, my error state. So I have to modify my error state if statement slightly. Here is my thought process of how to do so in pseudo code:
if (mystate == 6) {
if (token is equal to unprintable character) {
// print hex form, use 0x%x for formatting
} else {
// still error, but not unprintable so just have original error statement
fprintf(stdout, "Error: \" %s \" is invalid\n", token);
}
}
Another thought I had was:
if (mystate == 6) {
if (the token's hex value is between 0x01 and 0x20) {
// print hex form, use 0x%x for formatting
} else {
// still error, but not unprintable so just have original error statement
fprintf(stdout, "Error: \" %s \" is invalid\n", token);
}
}
With a sane libc you would use
#include <ctype.h>
...
if (!isprint((int)ch) {
unsigned x = ch;
printf ("[0x%02x]", 0xff&(int)ch);
}
...
to find non-printable ascii characters, assumed that char ch is your current input character.
To use them in a command line you could use printf(1) from the command line.
printf '\x02'|xxd
0000000: 02
There you see the STX character. BTW. There is an excellent manual page about ascii (ascii(7))!
So as a complete command line:
YOUR_Program "`printf '\x02\x03\x18\x19'`"
(The xxd was just to show what comes out of printf, as it is non-printable). xxd is just a hexdump utility, similar to od.
Note: When you really want unprintable input, it is more convenient to take the input either from a file, or from stdin. That simplifies your program call:
printf '\x02\x03\x18\x19'|YOUR_Program
One piece of your puzzle is printing in hex.
Printf("%02x", 7);
This prints the two digit hex value 07.
Another piece is detecting non printable.
If (c < 20).
This translates as if the character has any value less than a space.
You might research the isprint function as there are some unprintable characters that are greater than space.
Good luck. Welcome to c.
#include <stdio.h>
int main()
{
char line[80];
int count;
// read the line of charecter
printf("Enter the line of text below: \n");
scanf("%[ˆ\n]",line);
// encode each individual charecter and display them
for(count = 0; line[count]!= '\0'; ++ count){
if(((line[count]>='0')&& (line [count]<= '9')) ||
((line[count]>= 'A')&& (line[count]<='Z')) ||
((line[count]>= 'a')&& (line[count]<='z')))
putchar(line[count]+1);
else if (line[count]=='9')putchar('0');
else if (line [count]== 'A')putchar('Z');
else if (line [count]== 'a') putchar('z');
else putchar('.');
}
}
In the above code problem is converting encoding. Whenever I compile the code, the compiler automatically converts the encoding and then I am unable to get required output.
My target output should look like:
enter the string
Hello World 456
Output
Ifmmp.uif.tusjof
For every letter, it is replaced by 2nd letter and space is replaced by '.'.
This is suspect:
scanf("%[ˆ\n]",line);
It should be:
scanf("%79[^\n]",line);
Your version has a multibyte character that looks a bit like ^, instead of the ^. This would cause your scans to malfunction. Your symptoms sound as if the text that has been input is actually multi-byte characters.
BTW you could make your code easier to read by using isalnum( (unsigned char)line[count] ). That test replaces your a-z, A-Z, 0-9 tests.
You are not checking your conditions correctly:
if (line[count]>= 'A')&& (line[count]<='Z)
..
already converts the character 'Z'. The next check,
if (line [count]== 'A')putchar('Z');
is never executed. But that is not the only thing wrong here. The character 'A' should be translated to 'B', not 'Z'. You probably want
if (line[count]>= 'A' && line[count] < 'Z)
(< instead of <=) and
if (line [count]== 'Z')putchar('A');
and the same for lowercase and digits.
The problem is your format string for scanf. If you want to read a line of text from the console, you should use %s.
If you want to make sure that you read a maximum of 79 characters, you should use %79s (because your line vector has a length of 80).
So you should replace your scanf with this:
scanf("%79s", line);