Convert a string to the character that it represents - c

const char *lex1 = "\'a\'"; // prints 'a'
const char *lex2 = "\'\\t\'"; // prints '\t'
const char *lex3 = "\'0\'"; // prints '0'
How would I convert from a char * to the representative char? For the above example, the conversion is as follows:
char lex1_converted = 'a';
char lex2_converted = '\t';
char lex3_converted = '0';
I am currently thinking of chains of if-else blocks, but there might be a better approach.
This is a homework problem related to lexical analysis with flex. Please provide the minimum hint required to comprehend the solution. Thanks!
Edit: Looks like I misinterpreted the tokenized strings in my first post :)

const char* lex1="your sample text";
char lex1_converted=*lex1;
this code will convert the first char of your lex1 , and if you want to convert others you had to do this
char lex1_converted=*(lex1+your_disire_index);
//example
char lex1_converted=*(lex1+1);
and also the other way is doing this
char lex1_converted=lex1[your_desire_index];

Please provide the minimum hint required to comprehend the solution.
A key attribute of the conversion is to detect that the source string may lack the expected format.
Below some untested sample code to get OP started.
Sorry, did not see "homework problem related to lexical analysis with flex" until late.
Non-flex, C only approach:
Pseudo code
// Return 0-255 on success
// Else return -1.
int swamp_convert(s)
Find string length
If length too small or string missing the two '
return fail
if s[1] char is not \
if length is 3, return middle char
return fail
if length is 4 & character after \ is a common escaped character, (search a string)
return corresponding value (table lookup)
Maybe consider octal, hexadecimal, other escaped forms
else return fail
// Return 0-255 on success
// Else return -1.
int swamp_convert(const char *s) {
size_t len = strlen(s);
if (len < 3 || s[0] != '\'' || s[len - 1] != '\'') {
return -1;
}
if (s[1] != '\\') {
return (len == 3) ? s[1] : -1;
}
// Common escaped single charters
const char *escape_chars = "abtnvfr\\\'\"";
const char *e = strchr(escape_chars, s[2]);
if (t && len == 4) {
return "\a\b\t\n\v\f\r\\\'\""[e - escape_chars];
}
// Additional code to handle \0-7, \x, \X, maybe \e ...
return -1;
}

Related

How to check string matches format "printf like - %d/..."

I have dynamic string like: "/users/5/10/fnvfnvdjvndfvjvdklchsh" and also dynamic format like "/users/%u/%d/%s", how to check these strings matches?
As string i mean char[255] or char* str = malloc(x).
I tried use sscanf but i dont know number of arguments and types, if i do:
int res = sscanf(input, format);
I have stack overflow, or can i allocate stack to prevent this?
Example like this:
void* buffer = malloc(1024);
int res = sscanf(input, format, buffer);
I would like have function like this:
bool stringMatches(const char* format, const char* input);
stringMatches("/users/%u/%d/%s", "/users/5/10/fnvfnvdjvndfvjvdklchsh"); //true
stringMatches("/users/%u/%d/%s", "/users/5/10"); //false
stringMatches("/users/%u/%d/%s", "/users/-10/10/aaa"); //false %u is unsigned
Do you see any solution?
Thanks in advance.
I don't think that there is a scanf-like matching function in the standard lib, so you will have to write your own. Replicating all details of the scanf behaviour is difficult, but it's probably not necessary.
If you allow only % and a limited selection of single format identifiers without size, width and precision information, the code isn't terribly complex:
bool stringMatches(const char *format, const char *input)
{
while (*format) {
if (*format == '%') {
format++;
switch(*format++) {
case '%': {
if (*input++ != '%') return false;
}
break;
case 'u':
if (*input == '-') return false;
// continue with 'd' case
case 'd': {
char *end;
strtol(input, &end, 0);
if (end == input) return false;
input = end;
}
break;
case 's': {
if (isspace((uint8_t) *input)) return false;
while (*input && !isspace((uint8_t) *input)) input++;
}
break;
default:
return false;
}
} else {
if (*format++ != *input++) return false;
}
}
return (*input == '\0');
}
Some notes:
I've parsed the numbers with strtol. If you want to include floating-point number formats, you could use strtod for that, if your embedded system provides it. (You could also parse stretches of isdigit() chars as valid numbers.)
The 'u' case falls through to the 'd' case here. The function strtoul parses an unsigned long, but it allows a minus sign, so that case is caught explicitly. (But the way it is caught, it won't allow leading white space.)
You could implement your own formats or re-interpret existing ones. For example you could decide that you don't want leading white space for numbers or that a string ends with a slash.
It's a rather tricky one. I don't think C has very useful built in functions that will help you.
What you could do is using a regex. Something like this:
#include <sys/types.h>
#include <regex.h>
#include <stdio.h>
int main(void)
{
regex_t regex;
if (regcomp(&regex, "/users/[[:digit:]]+", 0)) {
fprintf("Error\n");
exit(1);
}
char *mystring = "/users/5/10/fnvfnvdjvndfvjvdklchsh";
if( regexec(&regex, myString, 0, NULL, 0) == 0)
printf("Match\n");
}
The regex in the code above does not suit your example. I just used something to show the idea. I think it would correspond to the format string "/users/%u" but I'm not sure. Nevertheless, I think this is one of the easiest ways to tackle this problem.
The easiest is to just try parsing it with sscanf, and see if the scan succeeded.
char * str = "/users/5/10/fnvfnvdjvndfvjvdklchsh";
unsigned int tmp_u;
int tmp_d;
char tmp_s[256];
int n = sscanf (str, "/users/%u/%d/%s", &tmp_u, &tmp_d, tmp_s);
if (n!=3)
{
/* Match failed */
}
Just remember that you don't have to mach everything in one go. You can use the %n format specifier to get the number of bytes parsed, and increment the string for the next parse.
This example abuses the fact that bytes_parsed will not be modified if the parsing doesn't reach the %n specifier:
char * str = "/users/5/10/fnvfnvdjvndfvjvdklchsh";
int bytes_parsed = 0;
/* parse prefix */
sscanf("/users/%n", &bytes_parsed);
if (bytes_parsed == 0)
{
/* Parse error */
}
str += bytes_parsed; /* str = "5/10/fnvfnvdjvndfvjvdklchsh"; */
bytes_parsed = 0;
/* Parse next num */
unsigned int tmp_u
sscanf(str, "%u%n", &tmp_u, &bytes_parsed);
if (bytes_parsed)
{
/* Number was an unsigned, do something */
}
else
{
/* First number was not an `unsigned`, so we try parsing it as signed */
unsigned int tmp_d
sscanf(str, "%d%n", &tmp_d, &bytes_parsed);
if (bytes_parsed)
{
/* Number was an unsigned, do something */
}
}
if (!bytes_parsed)
{
/* failed parsing number */
}
str += bytes_parsed; /* str = "/10/fnvfnvdjvndfvjvdklchsh"; */
......

how to cut out a chinese words & english words mixture string in c language

I have a string that contains both Mandarin and English words in UTF-8:
char *str = "你a好测b试";
If you use strlen(str), it will return 14, because each Mandarin character uses three bytes, while each English character uses only one byte.
Now I want to copy the leftmost 4 characters ("你a好测"), and append "..." at the end, to give "你a好测...".
If the text were in a single-byte encoding, I could just write:
strncpy(buf, str, 4);
strcat(buf, "...");
But 4 characters in UTF-8 isn't necessarily 4 bytes. For this example, it will be 13 bytes: three each for 你, 好 and 测 and one for a. So, for this specific case, I would need
strncpy(buf, str, 13);
strcat(buf, "...");
If I had a wrong value for the length, I could produce a broken UTF-8 stream with an incomplete character. Obviously I want to avoid that.
How can I compute the right number of bytes to copy, corresponding to a given number of characters?
First you need to know your encoding. By the sound of it (3 byte Mandarin) your string is encoded with UTF-8.
What you need to do is convert the UTF-8 back to unicode code points (integers). You can then have an array of integers rather than bytes, so each element of the array will be 1 character, reguardless of the language.
You could also use a library of functions that already handle utf8 such as http://www.cprogramming.com/tutorial/utf8.c
http://www.cprogramming.com/tutorial/utf8.h
In particular this function: int u8_toucs(u_int32_t *dest, int sz, char *src, int srcsz); might be very useful, it will create an array of integers, with each integer being 1 character. You can then modify the array as you see fit, then convert it back again with int u8_toutf8(char *dest, int sz, u_int32_t *src, int srcsz);
I would recommend dealing with this at a higher level of abstraction: either convert to wchar_t or use a UTF-8 library. But if you really want to do it at the byte level, you could count characters by skipping over the continuation bytes (which are of the form 10xxxxxx):
#include <stddef.h>
size_t count_bytes_for_chars(const char *s, int n)
{
const char *p = s;
n += 1; /* we're counting up to the start of the subsequent character */
while (*p && (n -= (*p & 0xc0) != 0x80))
++p;
return p-s;
}
Here's a demonstration of the above function:
#include <string.h>
#include <stdio.h>
int main()
{
const char *str = "你a好测b试";
char buf[50];
int truncate_at = 4;
size_t bytes = count_bytes_for_chars(str, truncate_at);
strncpy(buf, str, bytes);
strcpy(buf+bytes, "...");
printf("'%s' truncated to %d characters is '%s'\n", str, truncate_at, buf);
}
Output:
'你a好测b试' truncated to 4 characters is '你a好测...'
The Basic Multilingual Plane was designed to contain characters for almost all modern languages. In particular, it does contain Chinese.
So you just have to convert your UTF8 string to a UTF16 one to have each character using one single position. That means that you can just use a wchar_t array or even better a wstring to be allowed to use natively all string functions.
Starting with C++11, the <codecvt> header declares a dedicated converter std::codecvt_utf8 to specifically convert UTF8 narrow strings to wide Unicode ones. I must admit it is not very easy to use, but it should be enough here. Code could be like:
char str[] = "你a好测b试";
std::codecvt_utf8<wchar_t> cvt;
std::mbstate_t state = std::mbstate_t();
wchar_t wstr[sizeof(str)] = {0}; // there will be unused space at the end
const char *end;
wchar_t *wend;
auto cr = cvt.in(state, str, str+sizeof(str), end,
wstr, wstr+sizeof(str), wend);
*wend = 0;
Once you have the wstr wide string, you can convert it to a wstring and use all the C++ library tools, or if you prefer C strings you can use the ws... counterparts of the str... functions.
Pure C solution:
All UTF8 multibyte characters will be made from char-s with the most-significant-bit set to 1 with the first bits of their first character indicating how many characters makes a codepoint.
The question is ambiguous in regards to the criterion used in cutting; either:
a fixed number of codepoints followed by three dots, this wil require a variable size output buffer
a fixed size output buffer, which will impose "whatever you can fit inside"
Both the solutions will require a helper function telling how many chars make the next codepoint:
// Note: the function does NOT fully validate a
// UTF8 sequence, only looks at the first char in it
int codePointLen(const char* c) {
if(NULL==c) return -1;
if( (*c & 0xF8)==0xF0 ) return 4; // 4 ones and one 0
if( (*c & 0xF0)==0xE0 ) return 3; // 3 ones and one 0
if( (*c & 0xE0)==0xC0 ) return 2; // 2 ones and one 0
if( (*c & 0x7F)==*c ) return 1; // no ones on msb
return -2; // invalid UTF8 starting character
}
So, solution for the criterion 1 (fixed number of code points, variable output buff size) - does not append ... to the destination, but you can ask "how many chars I need" upfront and if it is longer than you can afford, reserve yourself the extra space.
// returns the number of chars used from the output
// If not enough space or the dest is null, does nothing
// and returns the lenght required for the output buffer
// Returns negative val if the source in not a valid UTF8
int copyFirstCodepoints(
int codepointsCount, const char* src,
char* dest, int destSize
) {
if(NULL==src) {
return -1;
}
// do a cold run to see if size of the output buffer can fit
// as many codepoints as required
const char* walker=src;
for(int cnvCount=0; cnvCount<codepointsCount; cnvCount++) {
int chCount=codePointLen(walker);
if(chCount<0) {
return chCount; // err
}
walker+=chCount;
}
if(walker-src < destSize && NULL!=dest) {
// enough space at destination
strncpy(src, dest, walker-src);
}
// else do nothing
return walker-src;
}
Second criterion (limited buffer size): just use the first one with the number of codepoints returned by this one
// return negative if UTF encoding error
int howManyCodepointICanFitInOutputBufferOfLen(const char* src, int maxBufflen) {
if(NULL==src) {
return -1;
}
int ret=0;
for(const char* walker=src; *walker && ret<maxBufflen; ret++) {
int advance=codePointLen(walker);
if(advance<0) {
return src-walker; // err because negative, but indicating the err pos
}
// look on all the chars between walker and walker+advance
// if any is 0, we have a premature end of the source
while(advance>0) {
if(0==*(++walker)) {
return src-walker; // err because negative, but indicating the err pos
}
advance--;
} // walker is set on the correct position for the next attempt
}
return ret;
}
static char *CutStringLength(char *lpszData, int nMaxLen)
{
if (NULL == lpszData || 0 >= nMaxLen)
{
return "";
}
int len = strlen(lpszData);
if(len <= nMaxLen)
{
return lpszData;
}
char strTemp[1024] = {0};
strcpy(strTemp, lpszData);
char *p = strTemp;
p = p + (nMaxLen-1);
if ((unsigned char)(*p) < 0xA0)
{
*(++p) = '\0'; // if the last byte is Mandarin character
}
else if ((unsigned char)(*(--p)) < 0xA0)
{
*(++p) = '\0'; // if the last but one byte is Mandarin character
}
else if ((unsigned char)(*(--p)) < 0xA0)
{
*(++p) = '\0'; // if the last but two byte is Mandarin character
}
else
{
int i = 0;
p = strTemp;
while(*p != '\0' && i+2 <= nMaxLen)
{
if((unsigned char)(*p++) >= 0xA0 && (unsigned char)(*p) >= 0xA0)
{
p++;
i++;
}
i++;
}
*p = '\0';
}
printf("str = %s\n",strTemp);
return strTemp;
}

How to check if a particular string is a numeric value of character value?

I would like to understand how to validate a string input and check whether the entered string is Numeric or not? I belive isdigit() function is the right way to do it but i'm able to try it out with one char but when it comes to a string the function isn't helping me.This is what i have got so far,Could any please guide me to validate a full string like
char *Var1 ="12345" and char *var2 ="abcd"
#include <stdio.h>
#include <ctype.h>
int main()
{
char *var1 = "hello";
char *var2 = "12345";
if( isdigit(var1) )
{
printf("var1 = |%s| is a digit\n", var1 );
}
else
{
printf("var1 = |%s| is not a digit\n", var1 );
}
if( isdigit(var2) )
{
printf("var2 = |%s| is a digit\n", var2 );
}
else
{
printf("var2 = |%s| is not a digit\n", var2 );
}
return(0);
}
The program seems to be working fine when the variables are declared and initialized as below,
int var1 = 'h';
int var2 = '2';
But i would like to understand how to validate a full string like *var =" 12345";
Try to make a loop on each string and verify each char alone
isdigit takes a single char, not a char*. If you want to use isdigit, add a loop to do the checking. Since you are planning to use it in several places, make it into a function, like this:
int all_digits(const char* str) {
while (*str) {
if (!isdigit(*str++)) {
return 0;
}
}
return 1;
}
The loop above will end when null terminator of the string is reached without hitting the return statement in the middle, in other words, when all characters have passed the isdigit test.
Note that passing all_digits does not mean that the string represents a value of any supported numeric type, because the length of the string is not taken into account. Therefore, a very long string of digits would return true for all_digits, but if you try converting it to int or long long you would get an overflow.
Use this
int isNumber(const char *const text)
{
char *endptr;
if (text == NULL)
return 0;
strtol(text, &endptr, 10);
return (*endptr == '\0');
}
then
if (isNumeric(var1) == 0)
printf("%s is NOT a number\n", var1);
else
printf("%s is number\n", var1);
the strtol() function will ignore leading whitspace characters.
If a character that cannot be converted is found, the convertion stops, and endptr will point to that character after return, thus checking for *endptr == '\0' will tell you if you are at the end of the string, meaning that all characters where successfuly converted.
If you want to consider leading whitespaces as invalid characters too, then you could just write this instead
int isNumber(const char *text)
{
char *endptr;
if (text == NULL)
return 0;
while ((*text != '\0') && (isspace(*text) != 0))
text++;
if (*text == '\0')
return 0;
strtol(text, &endptr, 10);
return (*endptr == '\0');
}
depending on what you need, but skipping leading whitespace characters is to interpret the numbers as if a human is reading them, since humans "don't see" whitespace characters.

Unencoding data from POST (CGI and C)

I was just reading this page http://www.cs.tut.fi/~jkorpela/forms/cgic.html about getting started with CGI in C. I had a question about the code in the unencoding part.
#include <stdio.h>
#include <stdlib.h>
#define MAXLEN 80
#define EXTRA 5
/* 4 for field name "data", 1 for "=" */
#define MAXINPUT MAXLEN+EXTRA+2
/* 1 for added line break, 1 for trailing NUL */
#define DATAFILE "../data/data.txt"
void unencode(char *src, char *last, char *dest)
{
for(; src != last; src++, dest++)
if(*src == '+')
*dest = ' ';
else if(*src == '%') {
int code;
if(sscanf(src+1, "%2x", &code) != 1) code = '?';
*dest = code;
src +=2; }
else
*dest = *src;
*dest = '\n';
*++dest = '\0';
}
int main(void)
{
char *lenstr;
char input[MAXINPUT], data[MAXINPUT];
long len;
printf("%s%c%c\n",
"Content-Type:text/html;charset=iso-8859-1",13,10);
printf("<TITLE>Response</TITLE>\n");
lenstr = getenv("CONTENT_LENGTH");
if(lenstr == NULL || sscanf(lenstr,"%ld",&len)!=1 || len > MAXLEN)
printf("<P>Error in invocation - wrong FORM probably.");
else {
FILE *f;
fgets(input, len+1, stdin);
unencode(input+EXTRA, input+len, data);
f = fopen(DATAFILE, "a");
if(f == NULL)
printf("<P>Sorry, cannot store your data.");
else
fputs(data, f);
fclose(f);
printf("<P>Thank you! Your contribution has been stored.");
}
return 0;
}
I was wondering exactly how these lines:
else if(*src == '%') {
int code;
if(sscanf(src+1, "%2x", &code) != 1) code = '?';
*dest = code;
src +=2; }
convert something like %21 back into the exclamation mark?
Thanks!
else if(*src == '%') {
int code;
if(sscanf(src+1, "%2x", &code) != 1) code = '?';
*dest = code;
src +=2;
}
If the string begins with a % character, sscanf() is used to parse the following hexadecimal characters. The "%x" format converts hexadecimal characters to a integer value (in this case, a character code), and the 2 specifies a maximum field width, so that it consumes at most 2 characters.
The return value of sscanf() indicate the number of successful conversions, so if it doesn't return 1, it didn't find a valid hexadecimal number.
Then the character code is assigned to *dest, and the src pointer is advanced to point to the next character after the %xx sequence.
There are actually three bugs here:
The "%x" format specifier expects an argument of type unsigned int *. A signed int * was passed which, I believe, invokes undefined behaviour. Variadic functions (such as sscanf()) have unusal ways of passing the arguments, and it is required that the format specifier matches the type of the argument.
However, the two types are similar enough that it will probably work just fine in practice.
It also accepts signed hexadecimal numbers (with a + or - character), which is probably not what the author intended.
For example, "%-ffText" would result in code == -15.
The src pointer is advanced by 2 bytes, but scanf() doesn't necessarily consume 2 characters.
"%fText" would result in code == 15, and consume only one character (other than the % character). The example above would consume 3 characters.
The sscanf function translates 2 hex chars into a single int value. This value is equal to the ASCII value, therefore put in 'dest'. Since 2 chars has been decoded, src has to increase two positions.
So '%21' -> 0x21 -> char '!'

C : using strlen for string including \0

What I need to do is when given a text or string like
\0abc\n\0Def\n\0Heel\n\0Jijer\n\tlkjer
I need to sort this string using qsort and based on the rot encoding comparison.
int my_rot_conv(int c) {
if ('a' <= tolower(c) && tolower(c) <= 'z')
return tolower(c)+13 <= 'z' ? c+13 : c-13;
return c;
}
int my_rot_comparison(const void *a, const void *b) {
char* ia = (char*) a;
char* ib = (char*) b;
int i=0;
ia++, ib++;
while (i<strlen(ia)) {
if (ia[i] == '\0' || ia[i] == '\n' || ia[i] == '\t' || ib[i] == '\0' || ib[i] == '\n' || ib[i] == '\t') {
i++;
}
if (my_rot_conv(ia[i]) > my_rot_conv(ib[i])) {
return 1;
} else if (my_rot_conv(ia[i]) < my_rot_conv(ib[i]))
return -1;
}
return 0;
}
I get to the point that I compare two string that starts with \0, getting the -1 in the following example.
printf("%d \n", my_rot_comparison("\0Abbsdf\n", "\0Csdf\n"));
But this wouldn't work for a string with qsort because ia++, ib++; does work only for one word comparison.
char *my_arr;
my_arr = malloc(sizeof(\0abc\n\0Def\n\0Heel\n\0Jijer\n\tlkjer));
strcpy(my_arr, \0abc\n\0Def\n\0Heel\n\0Jijer\n\tlkjer);
qsort(my_arr, sizeof(my_arr), sizeof(char), my_rot_comparison);
and the array should be sorted like \0Def\n\0Heel\n\0Jijer\n\0\n\tlkjer
My question is how do I define the comparison function that works for the string that includes \0 and \t and \n characters?
strlen simply cannot operate properly on a string which embeds \0 bytes, since by definition of the function strlen considers the end of the string to be the first encountered \0 byte at or after the beginning of the string.
The rest of the standard C string functions are defined in the same way.
This means that you have to use a different set of functions to manipulate string(-like) data that can include \0 bytes. You will perhaps have to write these functions yourself.
Note that you will probably have to define a structure which has a length member in it, since you won't be able to rely on a particular sentinel byte (such as \0) to mark the end of the string. For example:
typedef struct {
unsigned int length;
char bytes[];
}
MyString;
If there is some other byte (other than \0) which is forbidden in your input strings, then (per commenter #Sinn) you can swap it and \0, and then use normal C string functions. However, it is not clear whether this would work for you.
assuming you use an extra \0 at the end to terminate
int strlenzz(char*s)
{
int length =0;
while(!(*s==0 && *(s+1) == 0))
{
s++;
length++;
}
return length+1
}
Personally I'd prefer something like danfuzz's suggestion, but for the sake of listing an alternative...
You could use an escaping convention, writing functions to:
"escape" / encode, expanding embedded (but not the terminating) '\0'/NUL to say '\' and '0' (adopting the convention used when writing C source code string literals), and
another to unescape.
That way you can still pass them around as C strings, your qsort/rot comparison code above will work as is, but you should be very conscious that strlen(escaped_value) will return the number of bytes in the escaped representation, which won't equal the number of bytes in the unescaped value when that value embeds NULs.
For example, something like:
void unescape(char* p)
{
char* escaped_p = p;
for ( ; *escaped_p; ++escaped_p)
{
if (*escaped_p == '\\')
if (*++escaped_p == '0')
{
*p++ = '\0';
continue;
}
*p++ = *escaped_p;
}
*escaped_p = '\0'; // terminate
}
Escaping is trickier, as you need some way to ensure you have enough memory in the buffer, or to malloc a new buffer - either of the logical size of the unescaped_value * 2 + 1 length as an easy-to-calculate worst-case size, or by counting the NULs needing escaping and sizing tightly to logical-size + #NULs + 1....

Resources