just to give you background. We have a school project where we need to write our own compiler in C. My task is to write a lexical analysis. So far so good but I am having some difficulties with escape sequences.
When I find an escape sequence and the escape sequence is correct I have it saved in a string which looks like this \xAF otherwise it is lexical error.
My problem is how do I convert the string containing only escape sequence to one char? So I can add it to "buffer" containing the rest of the string.
I had an idea about a massive table containing only escape sequences and then comparing it one by one but it does not seem elegant.
This solution can be used for numerical escape sequences of all lengths and type, both octal, hexadecimal and others.
What you do when you see a '\' is to check the next character. If it's a 'x' (or 'X') then you read one character, if it's a hexadecimal digit (isxdigit) then you read another. If the last is not a hexadecimal digit then put it back into the stream (an "unget" operation), and use only the first digit you read.
Each digit you read you put into a string, and then you can use e.g. strtol to convert that string into a number. Put that number directly into the token value.
For octal sequences, just up to three characters instead.
For an example of a similar method see this old lexer I made many years ago. Search for the
lex_getescape function. Though this method uses direct arithmetic instead of strtoul to convert the escape code into a number, and not the standard isxdigit etc. functions either.
you can use the following code, call xString2char with your string.
char x2char(const char c)
{
if (c >= '0' && c <= '9')
return c - '0';
if (c >= 'a' && c <= 'f')
return c - 'a';
if (c >= 'A' && c <= 'F')
return c - 'A';
//if we got here it's an error - handle it as you like...
}
char xString2char(const char* buf)
{
char ans;
ans = x2char(buf[2]);
ans <<= 4;
ans += x2char(buf[3]);
return ans;
}
This should work, just add the error checking & handling (in case you didn't already validate them in your code)
flex has a start condition. This enables contextual analysis.
For instance, there is an example for C comment analysis(between /* and */) in flex manual:
<INITIAL>"/*" BEGIN(IN_COMMENT);
<IN_COMMENT>{
"*/" BEGIN(INITIAL);
[^*\n]+ /* eat comment in chunks */
"*" /* eat the lone star */
\n yylineno++;
}
The start condition also enables string literal analysis. There is an example of how to match C-style quoted strings using start conditions in the item Start Conditions, and there is also FAQ item titled "How do I expand backslash-escape sequences in C-style quoted strings?" in flex manual.
Probably this will answer your question.
Related
I'm trying to write a program that counts all the characters in a string at Turkish language. I can't see why this does not work. i added library, setlocale(LC_ALL,"turkish") but still doesn't work. Thank you. Here is my code:
My file character encoding: utf_8
int main(){
setlocale(LC_ALL,"turkish");
char string[9000];
int c = 0, count[30] = {0};
int bahar = 0;
...
if ( string[c] >= 'a' && string[c] <= 'z' ){
count[string[c]-'a']++;
bahar++;
}
my output:
a 0.085217
b 0.015272
c 0.022602
d 0.035736
e 0.110263
f 0.029933
g 0.015272
h 0.053146
i 0.071167
k 0.010996
l 0.047954
m 0.025046
n 0.095907
o 0.069334
p 0.013745
q 0.002443
r 0.053451
s 0.073916
t 0.095296
u 0.036958
v 0.004582
w 0.019243
x 0.001527
y 0.010996
This is English alphabet but i need this characters calculate too: "ğ,ü,ç,ı,ö"
setlocale(LC_ALL,"turkish");
First: "turkish" isn't a locale.
The proper name of a locale will typically look like xx_YY.CHARSET, where xx is the ISO 639-1 code for the language, YY is the ISO 3166-1 Alpha-2 code for the country, and CHARSET is an optional character set name (usually ISO8859-1, ISO8859-15, or UTF-8). Note that not all combinations are valid; the computer must have locale files generated for that specific combination of language code, country code, and character set.
What you probably want here is setlocale(LC_ALL, "tr_TR.UTF-8").
if ( string[c] >= 'a' && string[c] <= 'z' ){
Second: Comparison operators like >= and <= are not locale-sensitive. This comparison will always be performed on bytes, and will not include characters outside the ASCII a-z range.
To perform a locale-sensitive comparison, you must use a function like strcoll(). However, note additionally that some letters (including the ones you're trying to include here!) are composed of multi-byte sequences in UTF-8, so looping over bytes won't work either. You will need to use a function like mblen() or mbtowc() to separate these sequences.
Since you are apparently working with a UTF-8 file, the answer will depend upon your execution platform:
If you're on Linux, setlocale(LC_CTYPE, "en_US.UTF-8") or something similar should work, but the important part is the UTF-8 at the end! The language shouldn't matter. You can verify it worked by using
if (setlocale(LC_CTYPE, "en_US.UTF-8") == NULL) {
abort();
}
That will stop the program from executing. Anything after that code means that the locale was set correctly.
If you're on Windows, you can instead open the file using fopen("myfile.txt", "rt, ccs=UTF-8"). However, this isn't entirely portable to other platforms. It's a lot cleaner than the alternatives, however, which is likely more important in this particular case.
If you're using FreeBSD or another system that doesn't allow you to use either approach (e.g. there are no UTF-8 locales), you'd need to parse the bytes manually or use a library to convert them for you. If your implementation has an iconv() function, you might be able to use it to convert from UTF-8 to ISO-8859-9 to use your special characters as single bytes.
Once you're ready to read the file, you can use fgetws with a wchar_t array.
Another problem is checking if one of your non-ASCII characters was detected. You could do something like this:
// lower = "abcdefghijklmnopqrstuvwxyzçöüğı"
// upper = "ABCDEFGHİJKLMNOPQRSTUVWXYZÇÖÜĞI"
const wchar_t lower[] = L"abcdefghijklmnopqrstuvwxyz\u00E7\u00F6\u00FC\u011F\u0131";
const wchar_t upper[] = L"ABCDEFGH\u0130JKLMNOPQRSTUVWXYZ\u00C7\u00D6\u00DC\u011EI";
const wchar_t *lchptr = wcschr(lower, string[c]);
const wchar_t *uchptr = wcschr(upper, string[c]);
if (lchptr) {
count[(size_t)(lchptr-lower)]++;
bahar++;
} else if (uchptr) {
count[(size_t)(uchptr-upper)]++;
bahar++;
}
That code assumes you're counting characters without regard for case (case insensitive). That is, ı (\u0131) and I are considered the same character (count[8]++), just like İ (\u0130) and i are considered the same (count[29]++). I won't claim to know much about the Turkish language, but I used what little I understand about Turkish casing rules when I created the uppercase and lowercase strings.
Edit
As #JonathanLeffler mentioned in the question's comments, a better solution would be to use something like isalpha (or in this case, iswalpha) on each character in string instead of the lower and upper strings of valid characters I used. This, however, would only allow you to know that the character is an alphabetic character; it wouldn't tell you the index of your count array to use, and the truth is that there is no universal answer to do so because some languages use only a few characters with diacritic marks rather than an entire group where you can just do string[c] >= L'à' && string[c] <= L'ç'. In other words, even when you have read the data, you still need to convert it to fit your solution, and that requires knowledge of what you're working with to create a mapping from characters to integer values, which my code does by using strings of valid characters and the indices of each character in the string as the indices of the count array (i.e. lower[29] will mean count[29]++ is executed, and upper[18] will mean count[18]++ is executed).
The solution depends on the character encoding of your files.
If the file is in ISO 8859-9 (latin-5), then each special character is still encoded in a single byte, and you can modify your code easily: You already have a distiction between upper case and lower case. Just add more branches for the special characters.
If the file is in UTF-8, or some other unicode encoding, you need a multi-byte capable string library.
I'm making a program that, if the user inputs a lowercase character, generates its character in uppercase, and the opposite too. I'm using a function in order convert the character into lowercase or uppercase based on the ASCII table. Lowercase to uppercase is being converted correctly, but uppercase to lowercase is not.
char changeCapitalization(char n)
{
//uppercase to lowercase
if(n>=65 && n<=90)
n=n+32;
//lowercase to uppercase
if(n>= 97 && n<=122)
n=n-32;
return n;
}
What the others are essentially saying is you want something like this ('else if' instead of 'if' on the lower to upper logic):
char changeCapitalization(char n)
{
if(n>=65 && n<=90) //uppercase to lowercase
n=n+32;
else if(n>= 97 && n<=122) //lowercase to uppercase
n=n-32;
return n;
}
Chang the line
if(n>= 97 && n<=122)
with
else if(n>= 97 && n<=122)
Because this condition is the opposite way like you said in the question
Two if statements in sequence are executed - well - in sequence. So if you have an uppercase character, it will first be converted to lowercase, and afterwards, the next if statement will convert it back to lowercase. When you want to check the second condition only if the first one wasn't true, put else in front of the second if.
Also, rather than using the ASCII codes directly, you can compare characters to each other: if (n >= 'A' && n <= 'Z').
Later, when you're more comfortable with programming and start doing bigger projects, you should use the language's built-in functions for working with strings and characters, such as islower() and isupper() - and if you need to support any non-English characters, you should read this great article on the intricacies of encoding international characters.
I'm currently reading in a file, and storing that file into a cstring. I'm using strtok to parse it out for the first few strings i'm interested in. After that, the substrings could be numbers (500,150,30) or character combinations( P(4),(K(5)). Is there an easy method in the string library to differentiate between numbers and letters? \
Thanks for the answers guys!
If you are sure that there are no other symbols (##$%^%&*^) you can use the isalpha() function.
Usage:
isalpha(p);// returns true if its alphabetic and false otherwise.
Also note that you should include ctype.h.
You probably look for isalpha and isdigit library functions.
well, if you're reading a stream of bytes and want to differentiate between numbers and letters the following can be done:
// returns true if given char is a character, false otherwise
bool is_letter(char c) {
return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
}
which is easy enough to implement where it is needed. If you really want a library function, you can still use isalpha() or isdigit() from ctype.h, which basically should do the same thing.
N.B.: you might want to choose between bool or unsigned short. I won't enter into that debate.
I am pulling information from a binary file in C and one of my strings is coming out as \\b\\3777\\375\\v\\177 in GDB. I want to be able to parse this sort of useless data out of my output in a non-specific way - I.e anything that doesn't start with a number/character should be kicked out. How can this be achieved?
The data is being buffered into a struct n bytes at a time, and I am sure that this information is correct based on how data later in the file is being read correctly.
if( isalnum( buf[ 0 ]) {
printf( "%s", buf );
}
It sounds a bit like you're reimplementing the linux utility strings.
For each file given, GNU strings
prints the printable character
sequences that are at least 4
characters long (or the number given
with the options below) and are
followed by an unprintable character.
By default, it only prints the strings
from the initialized and loaded
sections of object files; for other
types of files, it prints the strings
from the whole file.
As the vast majority of the ASCII printable characteres are in the range of 0x20 (' ', space) to 0x7E('~', tilde), you can use this test:
if( (buf[0] >= 0x20) && ( buf[0] <= 0x7E ) )
{
printf( "%s", buf );
}
this will validate any string starting with any ASCII character.
Iterate over your bytes, and check the value of each one to see if it one of the characters that you consider to be valid. I don't know what you consider to be "a integer or char" (i.e. valid values) but you can try comparing the characters to (for example) ensure that:
(c >= '0' && c <= '9') || (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')
The above condition will ensure that the character's ASCII value is either a number (0 through 9) or a capital or lowercase English letter. Then you have to decide what to do when you encounter a character that you don't want. You can either replace the "bad" character with something "safe" (like a space) or you can build up a new string in a separate buffer, containing only the "good" characters.
Note that the above condition will only work for English, doesn't work for accented characters, and all punctuation and whitespace is also excluded. Another possible test would be to see if the character is a printable ASCII character ((c >= 0x20 && c <= 0x7e) || c == 0xa || c == 0xd which also includes punctuation, space and CR/LF). And this doesn't even get started trying to deal with encodings that aren't ASCII-compatible.
If I want to convert a single numeric char to it's numeric value, for example, if:
char c = '5';
and I want c to hold 5 instead of '5', is it 100% portable doing it like this?
c = c - '0';
I heard that all character sets store the numbers in consecutive order so I assume so, but I'd like to know if there is an organized library function to do this conversion, and how it is done conventionally. I'm a real beginner :)
Yes, this is a safe conversion. C requires it to work. This guarantee is in section 5.2.1 paragraph 2 of the latest ISO C standard, a recent draft of which is N1570:
Both the basic source and basic execution character sets shall have the following
members:
[...]
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9
[...]
In both the source and execution basic character sets, the
value of each character after 0 in the above list of decimal digits shall be one greater than
the value of the previous.
Both ASCII and EBCDIC, and character sets derived from them, satisfy this requirement, which is why the C standard was able to impose it. Note that letters are not contiguous iN EBCDIC, and C doesn't require them to be.
There is no library function to do it for a single char, you would need to build a string first:
int digit_to_int(char d)
{
char str[2];
str[0] = d;
str[1] = '\0';
return (int) strtol(str, NULL, 10);
}
You could also use the atoi() function to do the conversion, once you have a string, but strtol() is better and safer.
As commenters have pointed out though, it is extreme overkill to call a function to do this conversion; your initial approach to subtract '0' is the proper way of doing this. I just wanted to show how the recommended standard approach of converting a number as a string to a "true" number would be used, here.
Try this :
char c = '5' - '0';
int i = c - '0';
You should be aware that this doesn't perform any validation against the character - for example, if the character was 'a' then you would get 91 - 48 = 49. Especially if you are dealing with user or network input, you should probably perform validation to avoid bad behavior in your program. Just check the range:
if ('0' <= c && c <= '9') {
i = c - '0';
} else {
/* handle error */
}
Note that if you want your conversion to handle hex digits you can check the range and perform the appropriate calculation.
if ('0' <= c && c <= '9') {
i = c - '0';
} else if ('a' <= c && c <= 'f') {
i = 10 + c - 'a';
} else if ('A' <= c && c <= 'F') {
i = 10 + c - 'A';
} else {
/* handle error */
}
That will convert a single hex character, upper or lowercase independent, into an integer.
You can use atoi, which is part of the standard library.
Since you're only converting one character, the function atoi() is overkill. atoi() is useful if you are converting string representations of numbers. The other posts have given examples of this. If I read your post correctly, you are only converting one numeric character. So, you are only going to convert a character that is the range 0 to 9. In the case of only converting one numeric character, your suggestion to subtract '0' will give you the result you want. The reason why this works is because ASCII values are consecutive (like you said). So, subtracting the ASCII value of 0 (ASCII value 48 - see ASCII Table for values) from a numeric character will give the value of the number. So, your example of c = c - '0' where c = '5', what is really happening is 53 (the ASCII value of 5) - 48 (the ASCII value of 0) = 5.
When I first posted this answer, I didn't take into consideration your comment about being 100% portable between different character sets. I did some further looking around around and it seems like your answer is still mostly correct. The problem is that you are using a char which is an 8-bit data type. Which wouldn't work with all character types. Read this article by Joel Spolsky on Unicode for a lot more information on Unicode. In this article, he says that he uses wchar_t for characters. This has worked well for him and he publishes his web site in 29 languages. So, you would need to change your char to a wchar_t. Other than that, he says that the character under value 127 and below are basically the same. This would include characters that represent numbers. This means the basic math you proposed should work for what you were trying to achieve.
Yes. This is safe as long as you are using standard ascii characters, like you are in this example.
Normally, if there's no guarantee that your input is in the '0'..'9' range, you'd have to perform a check like this:
if (c >= '0' && c <= '9') {
int v = c - '0';
// safely use v
}
An alternative is to use a lookup table. You get simple range checking and conversion with less (and possibly faster) code:
// one-time setup of an array of 256 integers;
// all slots set to -1 except for ones corresponding
// to the numeric characters
static const int CHAR_TO_NUMBER[] = {
-1, -1, -1, ...,
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, // '0'..'9'
-1, -1, -1, ...
};
// Now, all you need is:
int v = CHAR_TO_NUMBER[c];
if (v != -1) {
// safely use v
}
P.S. I know that this is an overkill. I just wanted to present it as an alternative solution that may not be immediately evident.
As others have suggested, but wrapped in a function:
int char_to_digit(char c) {
return c - '0';
}
Now just use the function. If, down the line, you decide to use a different method, you just need to change the implementation (performance, charset differences, whatever), you wont need to change the callers.
This version assumes that c contains a char which represents a digit. You can check that before calling the function, using ctype.h's isdigit function.
Since the ASCII codes for '0','1','2'.... are placed from 48 to 57 they are essentially continuous. Now the arithmetic operations require conversion of char datatype to int datatype.Hence what you are basically doing is:
53-48 and hence it stores the value 5 with which you can do any integer operations.Note that while converting back from int to char the compiler gives no error but just performs a modulo 256 operation to put the value in its acceptable range
You can simply use theatol()function:
#include <stdio.h>
#include <stdlib.h>
int main()
{
const char *c = "5";
int d = atol(c);
printf("%d\n", d);
}