I have written a program that works well in C that converts non-readable ASCII to their character values. I would appreciate if a C master? would show me a better way of doing it that I have currently done, mainly this section:
if (isascii(ch)) {
switch (ch) {
case 0:
printControl("NUL");
break;
case 1:
printControl("SOH");
break;
.. etc (32 in total)
case default:
putchar(ch);
break;
}
}
Is it normal to make a switch that big? Or should I be using some other method (input from an ascii table?)
If you're always doing the same operation (e.g., putchar), you can just statically initialize an array that maps to what each character should map. You could then access the proper mapping value by smartly accessing the array per the offset of the incoming character.
For example, (in pseudo-code -- it's been awhile since I wrote in C), you would define:
const char* [] map = {"NUL", "SOH, ...};
and then index that smartly via something like:
const char* val = map[((int)ch)];
to get your value.
You would not be able to use this if your "from" values are not sequential; in that case, you would need to have some conditional blocks. But if you can leverage the sequentiality, you should.
Too many years ago when assembly languages for 8-bit micros were how I spent my time, I would have written something like
printf("%3.3s",
("NULSOHSTXETXEOTENQACKBELBS HT LF VT FF CR SO SI "
"DLEDC1DC2DC3DC4NAKSYNETBCANEM SUBESCFS GS RS US ")[3*ch]);
but not because its particularly better. And the multiply by three is annoying because 8-bit micros don't multiply so it would have required both a shift and an add, as well as a spare register.
A much more C-like result would be to use a table with four bytes per control, with the NUL bytes included. That allows each entry to be referred to as a string constant, but saves the extra storage for 32 pointers.
const char *charname(int ch) {
if (ch >= 0 && ch <= 0x20)
return ("NUL\0" "SOH\0" "STX\0" "ETX\0" /* 00..03 */
"EOT\0" "ENQ\0" "ACK\0" "BEL\0" /* 04..07 */
"BS\0\0" "HT\0\0" "LF\0\0" "VT\0\0" /* 08..0B */
"FF\0\0" "CR\0\0" "SO\0\0" "SI\0\0" /* 0C..0F */
"DLE\0" "DC1\0" "DC2\0" "DC3\0" /* 10..13 */
"DC4\0" "NAK\0" "SYN\0" "ETB\0" /* 14..17 */
"CAN\0" "EM\0\0" "SUB\0" "ESC\0" /* 18..1B */
"FS\0\0" "GS\0\0" "RS\0\0" "US\0\0" /* 1C..1F */
"SP\0\0") + (ch<<2); /* 20 */
if (ch == 0x7f)
return "DEL";
if (ch == EOF)
return "EOF";
return NULL;
}
I've tried to format the main table so its organization is clear. The function returns NULL for characters that name themselves, or are not 7-bit ASCII. Otherwise, it returns a pointer to a NUL-terminated ASCII string containing the conventional abbreviation of that control character, or "EOF" for the non-character EOF returned by C standard IO routines on end of file.
Note the effort taken to pad each character name slot to exactly four bytes. This is a case where building this table with a scripting language or a separate program would be a good idea. In that case, the simple answer is to build a 129-entry table (or 257-entry) containing the names of all 7-bit ASCII (or 8-bit extended in your preferred code page) characters with an extra slot for EOF.
See the sources to the functions declared in <ctype.h> for a sample of handling the extra space for EOF.
You can make a switch this big but it does become a bit difficult to manage.
The way I would approach this is to build an array with char c; char* ctrl; for each item. Then you could just loop through the array. This would make it a little easier to maintain the data.
Note that if you use every character in a particular range (for example, character 0 through 32), then your array would only need the name and it wouldn't be necessary to store the character value.
I would say build a table with the vals (0-32) and their corresponding control string ("NUL", "SOH"). (In this case the table requires just an array)
Then you can just check if it is in range an index into the table to get the string to pass to your printControl() function.
Related
What I am trying to ask is when you are coding a cipher and you are asking for user input. How would you go about changing the characters in the string to numbers so that you could plug them into a formula and then get another letter out.
string s = get_string("Plain Text:");
while (s[a] != '\0')
{
if (isalpha(s[a]))
{
for (c = 0, e = strlen(s); c < e; c++)
{
if (isupper(s[a]))
{
printf("C\n");
a++;
}
if (islower(s[a])) // (x=(?+(argv[1][i]))%26) This is the formula, the ? is where im trying to figure out how to change characters into numbers and then back in characters
{
printf("c\n");
a++;
}
}
}
}
The 'printf' were to make sure that the code was checking for lower or upper case properly. If im not mistaken i'd need 2 formulas with different ranges in order for one to be upper and the other to be lower with that i would be able to plug them both into one cipher formula. Im pretty sure that you could also do without the whole check if its upper or lower but I dont really understand how you would go about creating the range or the loop so that it always stay with in alphabetical characters whether it be capital or lower case.
The char type is an 8-bit integer. So a string is an array of 8-bit integers.
Because you mention a formula that applies c % 26 to the character values, I'm assuming that you are only interested in ASCII text for this exercise. If you care about non-ASCII encodings, the whole approach will need to change. You can't do byte-by-byte analysis with strings in general. For example, in UTF8, some glyphs will span more than one byte, and in UTF16, each glyph is at least two bytes.
The upper-case A–Z characters are in the range 65..90 and the lower-case a-z characters are in the range 97..122. So you will want to get those values in the range 0..25 before applying an arithmetic formula to them.
You can do this like this:
When isupper(s[a]) is true: int c = s[a] - 65;
When islower(s[a]) is true: int c = s[a] - 97;
Then apply your formula.
As an aside, normally when doing arithmetic with chars, you'll need to pay special care to values between 128 and 255, because char * is not a pointer to an array of unsigned values. And with Unicode strings, some of your glyphs may span more than one byte each. But because we're only considering ASCII strings, this doesn't apply here.
I am making a game where the answer is stored in client_challenges->answer while the client inputs the answer (which is stored in buffer) in the following format:
A: myanswer
If the answer starts from alphabet A, then i need to compare myanswer with the answer pre-stored. Using the code below, I get the correct buffer and ans lengths but if I print out my store array and answer array, the results differ. For example, if I input A: color, my store gives colo instead of color. However, store-2 works in some cases. How can I fix this?
if (buffer[0] == 'A')
{
printf("ans len %ld, buff len %ld\n",strlen(client_challenges->answer,(strlen(buffer)-4));
if(strlen(client_challenges->answer) == (strlen(buffer)-4))
{
char store[100];
for (int i = 1; i<= strlen(client_challenges->answer);i++)
{
store[i-1]=buffer[2+i];
}
store[strlen(store)-2] = '\0';
//store[strlen(client_challenges->answer)+1]='\0';
printf("Buffer: <%s>\n", buffer);
printf("STORE: %s\n",store);
printf("ANSWER: %s\n",client_challenges->answer);
if(strcmp(store,client_challenges->answer)==0)
{
send(file_descriptor, correct, strlen(correct), 0);
}
}
}
Example:
Client enters
A: Advancement
ans len 11, buff len 11
But when I print out store, it is Advancemen while the answer is Advancement. However, in my previous attempt, answer was soon and I entered "soon". It worked then.
Although I can not pin point the exact reason of this bug with the given input, I can share my experiences about how to find the correct spot efficiently.
Always verify your input.
Never trust an input. You only printed out the lengths of the inputs, what is the content. You'd better check with every byte (preferably in hex) to spot not printable characters. Some IDE provide integrated debugger to show buffer contents.
Use defines, constants, some human readable things instead of 4 or 2. This makes life much easier For instance
/* what is 4 here */
strlen(buffer)-4
should have been:
/* remove 'A: ' (A, colon, and white space, and I do not know what is 4th item */
strlen(buffer) - USER_ADDED_HEADERS
Get more familiar with C library
You actually did not need store array here. C provides strncmp function to compare two strings up to size "n" or memcmp to compare two buffers. This would save copy operation (cpu cycles), and stack memory.
More clear version of your code fragment (without error checks) could have been written as:
if (buffer[0] == 'A') {
/* verify input here */
/* #define ANSWER_START 4 // I do not know what the 4 is */
/* compare lengths here if they are not equal return sth accordingly */
/* supplied answer correct? */
if (memcmp(client_challenges->answer,
buffer + ANSWER_START,
strlen(client_challenges->answer)) == 0) {
/* do whatever you want here */
}
}
Consistent code formatting
Code formatting DOES matter. Be consistent on indents, curly parenthesis, tabs vs spaces, spaces before/after atoms etc. You do not have to stick to one format, but you have to be consistent.
Use a debugger
Debugger is your best friend. Learn about it. The issue with this bug can be identified with the debugger very easily.
How do I check in C if an array of uint8 contains only ASCII elements?
If possible please refer me to the condition that checks if an element is ASCII or not
Your array elements are uint8, so must be in the range 0-255
For standard ASCII character set, bytes 0-127 are used, so you can use a for loop to iterate through the array, checking if each element is <= 127.
If you're treating the array as a string, be aware of the 0 byte (null character), which marks the end of the string
From your example comment, this could be implemented like this:
int checkAscii (uint8 *array) {
for (int i=0; i<LEN; i++) {
if (array[i] > 127) return 0;
}
return 1;
}
It breaks out early at the first element greater than 127.
All valid ASCII characters have value 0 to 127, so the test is simply a value check or 7-bit mask. For example given the inclusion of stdbool.h:
bool is_ascii = (ch & ~0x7f) == 0 ;
Possibly however you intended only printable ASCII characters (excluding control characters). In that case, given inclusion of ctype.h:
bool is_printable_ascii = (ch & ~0x7f) == 0 &&
(isprint() || isspace()) ;
Your intent may be lightly different in terms of what characters you intend to include in your set - in which case other functions in ctype.h may be applied or simply test the values for value or range to include/exclude.
Note also that the ASCII set is very restricted in international terms. The ANSI or "extended ASCII" set uses locale specific codepages to define the glyphs associated with codes 128 to 255. That is to say the set changes depending on language/locale settings to accommodate different language characters, accents and alphabets. In modern systems it is common instead to use a multi-byte Unicode encoding (or which there are several with either fixed or variable length codes). UTF-8 encoding is a variable width encoding where all single byte encodings are also ASCII codes. As such, while it is trivial to determine whether data is entirely within the ASCII set, it does not follow that the data is therefore text. If the test is intended to distinguish binary data from text, it will fail in a great many scenarios unless you can guarantee a priori that all text is restricted to the ASCII set - and that is application specific.
You cannot check if something is "ASCII" with standard C.
Because C does not specify which symbol table that is used by a compiler. Various other more or less exotic symbol tables exists/existed.
UTF8 for example, is a superset of ASCII. Older, dysfunctional 8 bit symbol tables have existed, such as EBCDIC and "Extended ASCII". To tell if something is for example ASCII or EBCDIC can't be done trivially, without a long line of value checks.
With standard C, you can only do the following:
You can check if a character is printable, with the function isprint() from ctype.h.
Or you can check if it only has up to 7 bits only set, if((ch & 0x7F)==ch).
In C programming, a character variable holds ASCII value (an integer number between 0 and 127) rather than that character itself.
The ASCII value of lowercase alphabets are from 97 to 122. And, the ASCII value of uppercase alphabets are from 65 to 90.
incase of giving the actual code , i am giving you example.
You can assign int to char directly.
int a = 47;
char c = a;
printf("%c", c);
And this will also work.
printf("%c", a); // a is in valid range
Another approach.
An integer can be assigned directly to a character. A character is different mostly just because how it is interpreted and used.
char c = atoi("47");
Try to implement this after understand the following logic properly.
I have a file encoded in UTF-8, as it is shown by the following command :
file -i D.txt D.txt: text/plain; charset=utf-8
I just want to display each character one after one, so I have done this :
FILE * F_entree = fopen("D.txt", "r");
if (! F_entree) usage("impossible d'ouvrir le fichier d'entrée");
char ligne[TAILLE_MAX];
while (fgets(ligne, TAILLE_MAX, F_entree))
{
string mot = strtok(strdup(ligne), "\t");
while (*mot++){printf("%c \n", *mot) ;}
}
But the special characters aren't well displayed (a <?> is displayed instead) in the terminal (on Ubuntu 12). I think the problem is that only ASCII code can be stocked in %c, but how can I display those special characters?
And what's the good way to keep those characters in memory (in order to implement a tree index)? (I'm aware that this last question is unclear, don't hesitate to ask for clarifications.)
It does not work because your code splits up the multi-byte characters into separate ones. As your console expects a valid multi-byte code, after seeing a first one, and it does not receive the correct codes, you get your <?> -- translated freely, "whuh?". It does not receive a correct code because you are stuffing a space and newline in there.
Your console can only correctly interpret UTF8 characters if you send the right codes and in the correct sequence. The algorithm is:
Is the next character the start code for a UTF-8 sequence? If not, print it and continue.
If it is, print it and print all "next" codes for this character. See Wikipedia on UTF8 for the actual encoding; I took a shortcut in my code below.
Only then print your space (..?) and newline.
The procedure to recognize the start and length of a UTF8 multibyte character is this:
"Regular" (ASCII) characters never have their 7th bit set. Testing against 0x80 is enough to differentiate them from UTF8.
Each UTF8 character sequence starts with one of the bit patterns 110xxxxx, 1110xxxx, 11110xxx, 111110xx, or 1111110x. Every unique bit pattern has an associated number of extra bytes. The first one, for example, expects one additional byte. The xxx bits are combined with bits from the next byte(s) to form a 16-bit or longer Unicode character. (After all, that is what UTF8 is all about.)
Each next byte -- no matter how many! -- has the bit pattern 10xxxxxx. Important: none of the previous patterns start with this code!
Therefore, as soon as you see any UTF8 character, you can immediately display it and all 'next' codes, as long as they start with the bit pattern 10....... This can be tested efficiently with a bit-mask: value & 0xc0, and the result should be 0x80. Any other value means it's not a 'next' byte anymore, so you're done then.
All of this only works if your source file is valid UTF8. If you get to see some strange output, it most likely is not. If you need to check the input file for validity, you do need to implement the entire table in the Wikipedia page, and check if each 110xxxxx byte is in fact followed by a single 10xxxxxx byte, and so on. The pattern 10xxxxxx appearing on itself would indicate an error.
A definitive must-read is Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). See also UTF-8 and Unicode FAQ for Unix/Linux for more background information.
My code below addresses a few other issues with yours. I've used English variable names (see Meta Stackoverflow "Foreign variable names etc. in code"). It appears to me strdup is not necessary. Also, string is a C++ expression.
My code does not "fix" or handle anything beyond the UTF-8 printing. Because of your use of strtok, the code only prints the text before the first \t Tab character on each line in your input file. I assume you know what you are doing there ;-)
Add.: Ah, forgot to address Q2, "what's the good way to keep those characters in memory". UTF8 is designed to be maximally compatible with C-type char strings. You can safely store them as such. You don't need to do anything special to print them on an UTF8-aware console -- well, except when you are doing stuff as you do here, printing them as separate characters. printf ought to work just fine for whole words.
If you need UTF8-aware equivalents of strcmp, strchr, and strlen, you can roll your own code (see the Wikipedia link above) or find yourself a good pre-made library. (I left out strcpy intentionally!)
#define MAX_LINE_LENGTH 1024
int main (void)
{
char line[MAX_LINE_LENGTH], *word;
FILE *entry_file = fopen("D.txt", "r");
if (!entry_file)
{
printf ("not possible to open entry_file\n");
return -1;
}
while (fgets(line, MAX_LINE_LENGTH, entry_file))
{
word = strtok(line, "\t");
while (*word)
{
/* print UTF8 encoded characters as a single entity */
if (*word & 0x80)
{
do
{
printf("%c", *word);
word++;
} while ((*word & 0xc0) == 0x80);
printf ("\n");
} else
{
/* print low ASCII characters as-is */
printf("%c \n", *word);
word++;
}
}
}
return 0;
}
Lets say I have a string:
char theString[] = "你们好āa";
Given that my encoding is utf-8, this string is 12 bytes long (the three hanzi characters are three bytes each, the latin character with the macron is two bytes, and the 'a' is one byte:
strlen(theString) == 12
How can I count the number of characters? How can i do the equivalent of subscripting so that:
theString[3] == "好"
How can I slice, and cat such strings?
You only count the characters that have the top two bits are not set to 10 (i.e., everything less that 0x80 or greater than 0xbf).
That's because all the characters with the top two bits set to 10 are UTF-8 continuation bytes.
See here for a description of the encoding and how strlen can work on a UTF-8 string.
For slicing and dicing UTF-8 strings, you basically have to follow the same rules. Any byte starting with a 0 bit or a 11 sequence is the start of a UTF-8 code point, all others are continuation characters.
Your best bet, if you don't want to use a third-party library, is to simply provide functions along the lines of:
utf8left (char *destbuff, char *srcbuff, size_t sz);
utf8mid (char *destbuff, char *srcbuff, size_t pos, size_t sz);
utf8rest (char *destbuff, char *srcbuff, size_t pos;
to get, respectively:
the left sz UTF-8 bytes of a string.
the sz UTF-8 bytes of a string, starting at pos.
the rest of the UTF-8 bytes of a string, starting at pos.
This will be a decent building block to be able to manipulate the strings sufficiently for your purposes.
Try this for size:
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
// returns the number of utf8 code points in the buffer at s
size_t utf8len(char *s)
{
size_t len = 0;
for (; *s; ++s) if ((*s & 0xC0) != 0x80) ++len;
return len;
}
// returns a pointer to the beginning of the pos'th utf8 codepoint
// in the buffer at s
char *utf8index(char *s, size_t pos)
{
++pos;
for (; *s; ++s) {
if ((*s & 0xC0) != 0x80) --pos;
if (pos == 0) return s;
}
return NULL;
}
// converts codepoint indexes start and end to byte offsets in the buffer at s
void utf8slice(char *s, ssize_t *start, ssize_t *end)
{
char *p = utf8index(s, *start);
*start = p ? p - s : -1;
p = utf8index(s, *end);
*end = p ? p - s : -1;
}
// appends the utf8 string at src to dest
char *utf8cat(char *dest, char *src)
{
return strcat(dest, src);
}
// test program
int main(int argc, char **argv)
{
// slurp all of stdin to p, with length len
char *p = malloc(0);
size_t len = 0;
while (true) {
p = realloc(p, len + 0x10000);
ssize_t cnt = read(STDIN_FILENO, p + len, 0x10000);
if (cnt == -1) {
perror("read");
abort();
} else if (cnt == 0) {
break;
} else {
len += cnt;
}
}
// do some demo operations
printf("utf8len=%zu\n", utf8len(p));
ssize_t start = 2, end = 3;
utf8slice(p, &start, &end);
printf("utf8slice[2:3]=%.*s\n", end - start, p + start);
start = 3; end = 4;
utf8slice(p, &start, &end);
printf("utf8slice[3:4]=%.*s\n", end - start, p + start);
return 0;
}
Sample run:
matt#stanley:~/Desktop$ echo -n 你们好āa | ./utf8ops
utf8len=5
utf8slice[2:3]=好
utf8slice[3:4]=ā
Note that your example has an off by one error. theString[2] == "好"
The easiest way is to use a library like ICU
Depending on your notion of "character", this question can get more or less involved.
First off, you should transform your byte string into a string of unicode codepoints. You can do this with iconv() of ICU, though if this is the only thing you do, iconv() is a lot easier, and it's part of POSIX.
Your string of unicode codepoints could be something like a null-terminated uint32_t[], or if you have C1x, an array of char32_t. The size of that array (i.e. its number of elements, not its size in bytes) is the number of codepoints (plus the terminator), and that should give you a very good start.
However, the notion of a "printable character" is fairly complex, and you may prefer to count graphemes rather than codepoints - for instance, an a with an accent ^ can be expressed as two unicode codepoints, or as a combined legacy codepoint â - both are valid, and both are required by the unicode standard to be treated equally. There is a process called "normalization" which turns your string into a definite version, but there are many graphemes which are not expressible as a single codepoint, and in general there is no way around a proper library that understands this and counts graphemes for you.
That said, it's up to you to decide how complex your scripts are and how thoroughly you want to treat them. Transforming into unicode codepoints is a must, everything beyond that is at your discretion.
Don't hesitate to ask questions about ICU if you decide that you need it, but feel free to explore the vastly simpler iconv() first.
In the real world, theString[3]=foo; is not a meaningful operation. Why would you ever want to replace a character at a particular position in the string with a different character? There's certainly no natural-language-text processing task for which this operation is meaningful.
Counting characters is also unlikely to be meaningful. How many characters (for your idea of "character") are there in "á"? How about "á"? Now how about "གི"? If you need this information for implementing some sort of text editing, you're going to have to deal with these hard questions, or just use an existing library/gui toolkit. I would recommend the latter unless you're an expert on world scripts and languages and think you can do better.
For all other purposes, strlen tells you exactly the piece of information that's actually useful: how much storage space a string takes. This is what's needed for combining and separating strings. If all you want to do is combine strings or separate them at a particular delimiter, snprintf (or strcat if you insist...) and strstr are all you need.
If you want to perform higher-level natural-language-text operations, like capitalization, line breaking, etc. or even higher-level operations like pluralization, tense changes, etc. then you'll need either a library like ICU or respectively something much higher-level and linguistically-capable (and specific to the language(s) you're working with).
Again, most programs do not have any use for this sort of thing and just need to assemble and parse text without any considerations to natural language.
while (s[i]) {
if ((s[i] & 0xC0) != 0x80)
j++;
i++;
}
return (j);
This will count characters in a UTF-8 String... (Found in this article: Even faster UTF-8 character counting)
However I'm still stumped on slicing and concatenating?!?
In general we should use a different data type for unicode characters.
For example, you can use the wide char data type
wchar_t theString[] = L"你们好āa";
Note the L modifier that tells that the string is composed of wide chars.
The length of that string can be calculated using the wcslen function, which behaves like strlen.
One thing that's not clear from the above answers is why it's not simple. Each character is encoded in one way or another - it doesn't have to be UTF-8, for example - and each character may have multiple encodings, with varying ways to handle combining of accents, etc. The rules are really complicated, and vary by encoding (e.g., utf-8 vs. utf-16).
This question has enormous security concerns, so it is imperative that this be done correctly. Use an OS-supplied library or a well-known third-party library to manipulate unicode strings; don't roll your own.
I did similar implementation years back. But I do not have code with me.
For each unicode characters, first byte describes the number of bytes follow it to construct a unicode character. Based on the first byte you can determine the length of each unicode character.
I think its a good UTF8 library.
enter link description here
A sequence of code points constitute a single syllable / letter / character in many other Non Western-European languages (eg: all Indic languages)
So, when you are counting the length OR finding the substring (there are definitely use cases of finding the substrings - let us say playing a hangman game), you need to advance syllable by syllable , not by code point by code point.
So the definition of the character/syllable and where you actually break the string into "chunks of syllables" depends upon the nature of the language you are dealing with.
For example, the pattern of the syllables in many Indic languages (Hindi, Telugu, Kannada, Malayalam, Nepali, Tamil, Punjabi, etc.) can be any of the following
V (Vowel in their primary form appearing at the beginning of the word)
C (consonant)
C + V (consonant + vowel in their secondary form)
C + C + V
C + C + C + V
You need to parse the string and look for the above patterns to break the string and to find the substrings.
I do not think it is possible to have a general purpose method which can magically break the strings in the above fashion for any unicode string (or sequence of code points) - as the pattern that works for one language may not be applicable for another letter;
I guess there may be some methods / libraries that can take some definition / configuration parameters as the input to break the unicode strings into such syllable chunks. Not sure though! Appreciate if some one can share how they solved this problem using any commercially available or open source methods.