I currently have a finite state machine which analyzes a long string, separates the long string by white space, and analyzes each token to either octal, hex, float, error, etc.
Here is a brief overview of how I analyze each token:
enum state mystate = start_state;
while (current_index <= end_index - 1) { // iterate through whole token
switch (mystate) {
case 0:
// analyze first character and move to appropriate state
// cases 1-5 represent the valid states, if error set mystate = 6
case 6: // this is the error state
current_index = end_index - 1; // end loop
break;
}
current_index++;
}
At the end of this loop, I analyze what state my token fell under, for example if the token didn't fit into any category and it went to state 6 (the error state):
if (mystate == 6) {
// token is char pointer to string token
fprintf(stdout, "Error: \" %s \" is invalid\n", token);
}
Now, I am supposed to print out unprintable characters from 0x20 and under, such as start-of-text, start-of-header, etc. in their hex form, such as [0x02] and [0x01]. I found a good list of the ASCII unprintable characters from 0x20 and under here: http://www.theasciicode.com.ar/ascii-control-characters/start-of-header-ascii-code-1.html
Firstly, I am confused how to even type the unprintable characters into the command line. How does one type an unprintable character as a command line argument for my program to analyze?
After that hurdle, I know that the unprintable characters will fall into state 6, my error state. So I have to modify my error state if statement slightly. Here is my thought process of how to do so in pseudo code:
if (mystate == 6) {
if (token is equal to unprintable character) {
// print hex form, use 0x%x for formatting
} else {
// still error, but not unprintable so just have original error statement
fprintf(stdout, "Error: \" %s \" is invalid\n", token);
}
}
Another thought I had was:
if (mystate == 6) {
if (the token's hex value is between 0x01 and 0x20) {
// print hex form, use 0x%x for formatting
} else {
// still error, but not unprintable so just have original error statement
fprintf(stdout, "Error: \" %s \" is invalid\n", token);
}
}
With a sane libc you would use
#include <ctype.h>
...
if (!isprint((int)ch) {
unsigned x = ch;
printf ("[0x%02x]", 0xff&(int)ch);
}
...
to find non-printable ascii characters, assumed that char ch is your current input character.
To use them in a command line you could use printf(1) from the command line.
printf '\x02'|xxd
0000000: 02
There you see the STX character. BTW. There is an excellent manual page about ascii (ascii(7))!
So as a complete command line:
YOUR_Program "`printf '\x02\x03\x18\x19'`"
(The xxd was just to show what comes out of printf, as it is non-printable). xxd is just a hexdump utility, similar to od.
Note: When you really want unprintable input, it is more convenient to take the input either from a file, or from stdin. That simplifies your program call:
printf '\x02\x03\x18\x19'|YOUR_Program
One piece of your puzzle is printing in hex.
Printf("%02x", 7);
This prints the two digit hex value 07.
Another piece is detecting non printable.
If (c < 20).
This translates as if the character has any value less than a space.
You might research the isprint function as there are some unprintable characters that are greater than space.
Good luck. Welcome to c.
Related
I'm trying to display an HTTP frame. The problem is that some characters are not recognized. I use the isprint function.
Here is the function I created:
void printAscii(const int dataLength, const char *data){
if (dataLength <= 0) {
printf("No data (data length <= 0)\n");
} else {
printf("Warning: Unsupported characters are not displayed.\n\n");
size_t i;
printf("|- ");
for (i = 0; i < dataLength; i++) {
if (isprint(data[i])) {
printf("%c", data[i]);
}
if (data[i] == '\n') {
printf("|- ");
}
}
}
}
The problem is that characters like "\ n" and "\ t" are not displayed either.
I thought of adding additional conditions in my function
if (isprint(data[i]) || data[i] == '\n' || data[i] == '\t')
But I was wondering if there was not a more "clean" way?
I started the C there is not too long so do not hesitate if I made mistakes in my function.
EDIT
I may not have been clear enough in my question.
My project is a frame analyzer (pcap), and I get to the HTTP part. The frame contains only ASCII, so it is relatively easy to display this type of frame. The problem is that some characters are not displayed directly (encoding for images for example) so I decided to ignore these characters. The problem is with isprint () characters like '\ n', '\ t', etc ... are not displayed either and so my display is less "beautiful".
For example, this HTTP trame :
<ul>
<li>Foo</li>
<li>Bar</li>
</ul>
become :
<ul><li>Foo</li><li>Bar</li></ul>
which is less understandable.
Edit 2
I found. This code works as desired.
if (isprint(data[i]) || isspace(data[i]))
Thanks anyway.
I found. This code works as desired.
if (isprint(data[i]) || isspace(data[i]))
– Eraseth
Since these characters ASCII values are white space you will not see them with the "%c" format specifier. What you need to do is use the hexadecimal string format specifier in printf as follows:
printf("%x", data[i]);
Note: The "%x"(lowercase) and "%X"(uppercase) means to display the hexadecimal value of the character instead of the actual ASCII value.
I should also note, that formatting the string in this fashion will give you the raw data. This is what it would look like in memory or what it would look like if it were transmitted on the wire. So, a \n would be a 0x0A and a \t would be a 0x09
See the following link for a great reference on what all the specifiers mean.
http://www.cplusplus.com/reference/cstdio/printf/
Also, here is another link to an ASCII Table.
https://www.techonthenet.com/ascii/chart.php
Hope this helps!
I've got an UTF-8 text file containing several signs that i'd like to change by other ones (only those between |( and |) ), but the problem is that some of these signs are not considered as characters but as multi-character signs. (By this i mean they can't be put between '∞' but only like this "∞", so char * ?)
Here is my textfile :
Text : |(abc∞∪v=|)
For example :
∞ should be changed by ¤c
∪ by ¸!
= changed by "
So as some signs(∞ and ∪) are multicharacters, i decided to use fscanf to get all the text word by word. The problem with this method is that I have to put space between each character ... My file should look like this :
Text : |( a b c ∞ ∪ v = |)
fgetc can't be used because characters like ∞ can't be considered as one single character.If i use it I won't be able to strcmp a char with each sign (char * ), i tried to convert my char to char* but strcmp !=0.
Here is my code in C to help you understanding my problem :
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
int main(void){
char *carac[]={"∞","=","∪"}; //array with our signs
FILE *flot,*flot3;
flot=fopen("fichierdeTest2.txt","r"); // input text file
flot3=fopen("resultat.txt","w"); //output file
int i=0,j=0;
char a[1024]; //array that will contain each read word.
while(!feof(flot))
{
fscanf(flot,"%s",&a[i]);
if (strstr(&a[i], "|(") != NULL){ // if the word read contains |( then j=1
j=1;
fprintf(flot3,"|(");
}
if (strcmp(&a[i], "|)") == 0)
j=0;
if(j==1) { //it means we are between |( and |) so the conversion can begin
if (strcmp(carac[0], &a[i]) == 0) { fprintf(flot3, "¤c"); }
else if (strcmp(carac[1], &a[i]) == 0) { fprintf(flot3,"\"" ); }
else if (strcmp(carac[2], &a[i]) == 0) { fprintf(flot3, " ¸!"); }
else fprintf(flot3,"%s",&a[i]); // when it's a letter, number or sign that doesn't need to be converted
}
else { // when we are not between |( and |) just copy the word to the output file with a space after it
fprintf(flot3, "%s", &a[i]);
fprintf(flot3, " ");
}
i++;
}
}
Thanks a lot for the future help !
EDIT : Every sign will be changed correctly if i put a space between each them but without ,it won't work, that's what i'm trying to solve.
First of all, get the terminology right. Proper terminology is a bit confusing, but at least other people will understand what you are talking about.
In C, char is the same as byte. However, a character is something abstract like ∞ or ¤ or c. One character may contain a few bytes (that is a few chars). Such characters are called multi-byte ones.
Converting a character to a sequence of bytes (encoding) is not trivial. Different systems do it differently; some use UTF-8, while others may use UTF-16 big-endian, UTF-16 little endian, a 8-bit codepage or any other encoding.
When your C program has something inside quotes, like "∞" - it's a C-string, that is, several bytes terminated by a zero byte. When your code uses strcmp to compare strings, it compares each byte of both strings, to make sure they are equal. So, if your source code and your input file use different encodings, the strings (byte sequences) won't match, even though you will see the same character when examining them!
So, to rule out any encoding mismatches, you might want to use a sequence of bytes instead of a character in your source code. For example, if you know that your input file uses the UTF-8 encoding:
char *carac[]={
"\xe2\x88\x9e", // ∞
"=",
"\xe2\x88\xaa"}; // ∪
Alternatively, make sure the encodings (of your source code and your program's input file) are the same.
Another, less subtle, problem: when comparing strings, you actually have a big string and a small string, and you want to check whether the big string starts with the small string. Here strcmp does the wrong thing! You must use strncmp here instead:
if (strncmp(carac[0], &a[i], strlen(carac[0])) == 0)
{
fprintf(flot3, "\xC2\xA4""c"); // ¤c
}
Another problem (actually, a major bug): the fscanf function reads a word (text delimited by spaces) from the input file. If you only examine the first byte in this word, the other bytes will not be processed. To fix, make a loop over all bytes:
fscanf(flot,"%s",a);
for (i = 0; a[i] != '\0'; )
{
if (strncmp(&a[i], "|(", 2)) // start pattern
{
now_replacing = 1;
i += 2;
continue;
}
if (now_replacing)
{
if (strncmp(&a[i], whatever, strlen(whatever)))
{
fprintf(...);
i += strlen(whatever);
}
}
else
{
fputc(a[i], output);
i += 1; // processed just one char
}
}
You're on the right track, but you need to look at characters differently than strings.
strcmp(carac[0], &a[i])
(Pretending i = 2) As you know this compares the string "∞" with &a[2]. But you forget that &a[2] is the address of the second character of the string, and strcmp works by scanning the entire string until it hits a null terminator. So "∞" actually ends up getting compared with "abc∞∪v=|)" because a is only null terminated at the very end.
What you should do is not use strings, but expand each character (8 bits) to a short (16 bits). And then you can compare them with your UTF-16 characters
if( 8734 = *((short *)&a[i])) { /* character is infinity */ }
The reason for that 8734 is because that's the UTF16 value of infinity.
VERY IMPORTANT NOTE:
Depending if your machine is big-endian or little-endian matters for this case. If 8734 (0x221E) does not work, give 7714 (0x1E22) a try.
Edit Something else I overlooked is you're scanning the entire string at once. "%s: String of characters. This will read subsequent characters until a whitespace is found (whitespace characters are considered to be blank, newline and tab)." (source)
//feof = false.
fscanf(flot,"%s",&a[i]);
//feof = ture.
That means you never actually iterate. You need to go back and rethink your scanning procedure.
#include <stdio.h>
int main()
{
char line[80];
int count;
// read the line of charecter
printf("Enter the line of text below: \n");
scanf("%[ˆ\n]",line);
// encode each individual charecter and display them
for(count = 0; line[count]!= '\0'; ++ count){
if(((line[count]>='0')&& (line [count]<= '9')) ||
((line[count]>= 'A')&& (line[count]<='Z')) ||
((line[count]>= 'a')&& (line[count]<='z')))
putchar(line[count]+1);
else if (line[count]=='9')putchar('0');
else if (line [count]== 'A')putchar('Z');
else if (line [count]== 'a') putchar('z');
else putchar('.');
}
}
In the above code problem is converting encoding. Whenever I compile the code, the compiler automatically converts the encoding and then I am unable to get required output.
My target output should look like:
enter the string
Hello World 456
Output
Ifmmp.uif.tusjof
For every letter, it is replaced by 2nd letter and space is replaced by '.'.
This is suspect:
scanf("%[ˆ\n]",line);
It should be:
scanf("%79[^\n]",line);
Your version has a multibyte character that looks a bit like ^, instead of the ^. This would cause your scans to malfunction. Your symptoms sound as if the text that has been input is actually multi-byte characters.
BTW you could make your code easier to read by using isalnum( (unsigned char)line[count] ). That test replaces your a-z, A-Z, 0-9 tests.
You are not checking your conditions correctly:
if (line[count]>= 'A')&& (line[count]<='Z)
..
already converts the character 'Z'. The next check,
if (line [count]== 'A')putchar('Z');
is never executed. But that is not the only thing wrong here. The character 'A' should be translated to 'B', not 'Z'. You probably want
if (line[count]>= 'A' && line[count] < 'Z)
(< instead of <=) and
if (line [count]== 'Z')putchar('A');
and the same for lowercase and digits.
The problem is your format string for scanf. If you want to read a line of text from the console, you should use %s.
If you want to make sure that you read a maximum of 79 characters, you should use %79s (because your line vector has a length of 80).
So you should replace your scanf with this:
scanf("%79s", line);
I've got to parse a .txt file like this
autore: sempronio, caio; titolo: ; editore: ; luogo_pubblicazione: ; anno: 0; prestito: 0-1-1900; collocazione: ; descrizione_fisica: ; nota: ;
with fscanf in C code.
I tried with some formats in fscanf call, but none of them worked...
EDIT:
a = fscanf(fp, "autore: %s");
This is the first try I did; the patterns 'autore', 'titolo', 'editore', etc. must not be caught by fscanf().
Generally speaking, trying to parse input with fscanf is not a good idea, as it is difficult to recover gracefully if the input does not match expectations. It is generally better to read the input into an internal buffer (with fread or fgets), and parse it there (with sscanf, strtok, strtol etc.). Details on which functions are best depend on the definition of the input format (which you did not give us; example input is no replacement for a formal specification).
The following shows how to use strtok:
char* item;
char* input; // fill it with fgets
for (item = strtok(input, ";"); item != NULL; item = strtok(NULL, ";"))
{
// item loops through the following:
// "autore: sempronio, caio"
// " titolo: "
// " editore: "
// ...
}
The following shows how to use sscanf:
char tag[20];
int chars = -1;
if (sscanf(item, " %19[^:]: %n", tag, &chars) == 1 && chars >= 0)
{
printf("%s is %s\n", tag, item + chars);
}
Here, the format string consists of the following:
(space) - tells the parser to discard whitespace
19 - maximum number of bytes/chars in the tag
[^:] - tells the parser to read until it meets the colon character
: - tells the parser to discard the colon character
(whitespace) - as above
%n - tells the parser to report the number of bytes it read (check &chars)
If there was an unexpected input, the number of chars is not updated, so you have to set it to -1 before parsing each item.
I am trying to create a program which, given an input file, returns the count of all the lines of code in the input file, excluding blank lines and comment lines. I have written the following code, however I need help with how to exclude lines containing comments and blank lines.
#include<stdio.h>
int main()
{
int count;
char ch;
FILE *fptr;
clrscr();
fp=fopen("test.cpp","r");
if(fp==EOF)
{
perror("Error:");
}
else
{
while(ch!=EOF)
{
ch=fgetc(fptr);
if(ch=='\n')
count++;
if(ch=='\\')
count--;
if(ch=='\*')
{
while(ch!='*\')
{
ch=fgetc(fptr);
}
}
}
printf("the lines in the code are %d\n",count);
fclose (fptr)
}
getchar();
return 0;
}
How can I modify the above code so that blank lines and comment lines are not counted?
If you read the input file character by character, you'll have a lot more work than if you read it line by line. After all you're counting lines ...
psudocode
1. initialize line count to 0
2. read a line
3. end of file? yes: goto 7
4. is it a good line? yes: goto 5; no: goto 2
5. increment line count
6. repeat from 2
7. output line count
Now you ask ... what is a good line?
For an approximation of the program, I suggest you consider a line everything except lines composed of 0 or more whitespace. This approximation will count comments, but you can develop your program from here.
The following version ignores lines with // coments on an otherwise empty line.
Version 3 could ignore lines containing both /* and */
and version 4 would deal with multi line comments.
Above all, have fun!
C comments are // and /* */. The following lines are where your problem is:
if(ch=='\\')
count--;
if(ch=='\*')
while(ch!='*\')
ch=fgetc(fptr);
The other problem is that you can't match a two-character comment delimiter by reading a character at a time without some sort of state machine.
Also, your code should cater for the case where comments are embedded in real lines of code. eg.
x = 1; // Set value of x
You'd be far better off reading the file a line at a time, and checking whether or not each line is blank or a comment, and incrementing a counter if not.
you mean //, /* and */ instead of \ * and *\
the \ is used as an escape character, which changes the "meaning" of the character after it.
\n gives you a newline. with \\ you get a single \ and with \' you get something that doesn't close the opening '
If you replace those comment-characters with the correct one you should get code that will compile.
But it wont count correctly.
Imagine a line like this:
doSomething(); // foo
Apart from your problems with character constants you have errors in the way you deal with fputc. fputc returns an int. It can return either EOF which is a negative integer constant if there were no remaining characters to red or there was an error, or it can return the value of the character read as an unsigned char and converted to a int.
If you convert the return value of fputc to char before comparing it to EOF then a valid character might compare as equal to EOF causing premature termination of your loop.
Also, not that the while loop starts before the first call to fputc so you are using the uninitialized value of ch in the first iteration. This could cause anything to happen.
The idiomatic way to form the loop would be:
int ch;
while ((ch = fgetc()) != EOF)
{
/* ... */
}
In side the loop you need to be careful in the comparison of the returned value due to the way the fact that ch is the unsigned char converted to an int.
On most platforms the simplest thing to do would be to create a char variable for comparison purposes although you could put your character constants throught the same unsigned char to int conversion routine.
E.g.
char c = ch;
if (c == '\n')
or
if (ch == (unsigned char)'\n')
Others have pointed out the problems with your character literals.
Well, part of the problem is that your ch variable is only a character in length, but when you test for comments, such as \\ and \*, these are two characters long, and thus need to use string comparison.
Another problem is that one-line comments in C/C++ actually start with //, and multi-line comments start with /* and end with */.
You can write something like this in Python:
def goodline(l : str) -> int:
if l.lstrip().startswith("/*") and l.rstrip().endswith("*/"):
# single line
return 0
elif l.lstrip().startswith("/*") and not l.rstrip().endswith("*/"):
# multi line start
return 1
elif not l.lstrip().startswith("/*") and l.rstrip().endswith("*/"):
# multi line end
return 2
elif l.strip() == "":
# empty line
return 3
elif l.lstrip().startswith("//"):
# single line
return 0
else:
# good line
return 4
if return from above function is 1, keep iterating on lines until return becomes 2.