sscanf adding in 0's to my string - c

I'm writing some c code for an assembler used for a virtual computer designed for our text book. The point is to get the binary output to look the same as it does after running assembly through the program that accompanies the text book. I was on the last instruction to convert to binary, BR (for branch), and was having some trouble with sscanf. The function is,
char* br(char* line) {
int num, i, l, n = 0, z = 0, p = 0;
char bin[17] = "0000";
char word[20], arg1[20];
sscanf(line, "%S%S", word, arg1);
l = strlen(word);
for (i = 2; i < l; i++) {
if (word[i] == 'N') {
n = 1;
} else if (word[i] == 'Z') {
z = 1;
} else if (word[i] == 'P') {
p = 1;
}
}
bin[4] = n + '0';
bin[5] = z + '0';
bin[6] = p + '0';
while (label[i] != 0) {
if (strcmp(label[i], arg1) == 0) {
num = address[i] - currentAddress - 1;
decToBinary(num, arg1);
break;
}
i++;
}
for (i = 7; i < 16; i++) {
bin[i] = arg1[i];
}
return bin;
}
The problem I'm having is that sscanf is adding 0's between every character placed in word and arg1 so they are terminated. The incoming string "BRZP START" is broken into "B" for word and "S" for arg1 respectively. I've used sscanf in this way a bunch already and don't know why its not working now.

If you look at the man page for sscanf it seems that %S (capital S) does not really mean anything (based off the printf man page, it looks as if it is equivalent to "ls", which reads wide characters from the string, but should not be used. When a string of short characters is converted to a string of long characters, every second character will seem to equal zero). Try this:
sscanf(line, "%s%s", word, arg1);

You don't specify the host OS or development platform you're using, but I'm not in the least bit surprised you're getting two bytes for every character considering that %S tells sscanf to read wide chars in some implementations. If your input is ASCII, you'll get ASCII chars plus a null byte, just as you're seeing. The solution is simple: use %s not %S.

Related

Weird characters randomly appearing when converting int arr to string in C (with valid ASCII)

This is a simple Caesar cipher implementation in C.
I take a pin, make a key from it, take a message, shift the ASCII value w.r.t the key, and output the hex value of each character in hex.
Upon using (as it appeared to me) any key , for the some (not all) messages results in the printing of an extra character upon decoding.
Example Cases :
Weird Char appearing :
Message : "hexa !"
Pin : 454545 & Key : 23
Ciphertext : (in hexa) " 51 4e 61 4a 9 a fffffff3 4e -6f " (-6f is simply used to terminate input)
Text given when decoding :
hexa !
e
Other keys generate other weird chars , like ' + ', for example. The weird char always appears on the next line.
The entire code is ~ 100 lines, so I wont paste it here, but it's available on GitHub .
Don't use the windows .exe in this repo, it is of the older version, I'm trying to fix this issue before releasing this version.
The code where the issue is likely appearing is the encrypt() and decrypt() functions :
void encrypt() {
char msg[3001], ch;
int i,key, en[3001], count = 0;
printf("\n");
key = pin();
getchar();
printf("\nType Message - \n\n");
fgets(msg,sizeof(msg),stdin);
for (i=0; msg[i] != '\0'; i++) {
ch=msg[i];
int d = (ch - key);
en[i] = d;
count++;
}
printf("\nEncrypted message -\n\n");
for (i=0; i <= count; i++)
printf("%x ", en[i]);
printf("-6f");
}
void decrypt() {
char msg[3001], ch;
int i,key, en[3001],d;
printf("\n");
key = pin();
printf("\nEnter encrypted message -\n\n");
getchar();
for (i=0; i <= 3001; i++) {
scanf("%x",&d);
if (d == -111) {
msg[i] = '\0';
break;
} else {
ch = d + key;
msg[i] = ch;
}
}
printf("\nDecrypted message -\n\n");
puts(msg);
}
In your second for loop:
for(i = 0; i <= count; i++)
You are running off the end of the array (array indices start at zero, not one).
Change it to:
for(i = 0; i < count; i++)
Always put your variable declarations each on their own line.
int a, b, c = 0; // only one variable is initialized.
char *src = 0, c = 0; // c is of type char, not char*
fgets includes the trailing newline. If you don't want it, you have to strip it off.
int len = strlen(msg);
if(len > 0 && msg[len - 1] == '\n')
msg[len - 1] = '\0';

What does the format specifiers "%02x " and "%3o " do in this code?

While searching for the meaning of the statement of K&R C Exercise 7-2 I foun this answer to the K&R C Exercise 7.2 on https://clc-wiki.net/wiki/K%26R2_solutions:Chapter_7:Exercise_2
The Exercise asks to
Write a program that will print arbitrary input in a sensible way. As a minimum, it should print non-graphic characters in octal or hexadecimal according to local custom, and break long text lines.
Also I am unable to understand the meaning of sentence of the exercise, specifically the part "As a minimum, it should print non-graphic characters in octal or hexadecimal according to local custom".
What does this exercise 7-2 asks for? Please explain the meaning of the statement of exercise.
The code below uses format specifiers "%02x " and "%3o " are used at the end of the code to print non-printable characters but what exactly this format specifiers do to Non-Printables?
else
{
if(textrun > 0 || binaryrun + width >= split)
{
printf("\nBinary stream: ");
textrun = 0;
binaryrun = 15;
}
printf(format, ch);
binaryrun += width;
}
Rest of the code break long lines into smaller ones and print all printable characters as it is.
The Complete Program is as below:
#include <stdio.h>
#define OCTAL 8
#define HEXADECIMAL 16
void ProcessArgs(int argc, char *argv[], int *output)
{
int i = 0;
while(argc > 1)
{
--argc;
if(argv[argc][0] == '-')
{
i = 1;
while(argv[argc][i] != '\0')
{
if(argv[argc][i] == 'o')
{
*output = OCTAL;
}
else if(argv[argc][i] == 'x')
{
*output = HEXADECIMAL;
}
else
{
/* Quietly ignore unknown switches, because we don't want to
* interfere with the program's output. Later on in the
* chapter, the delights of fprintf(stderr, "yadayadayada\n")
* are revealed, just too late for this exercise.
*/
}
++i;
}
}
}
}
int can_print(int ch)
{
char *printable = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890 !\"#%&'()*+,-./:;<=>?[\\]^_{|}~\t\f\v\r\n";
char *s;
int found = 0;
for(s = printable; !found && *s; s++)
{
if(*s == ch)
{
found = 1;
}
}
return found;
}
int main(int argc, char *argv[])
{
int split = 80;
int output = HEXADECIMAL;
int ch;
int textrun = 0;
int binaryrun = 0;
char *format;
int width = 0;
ProcessArgs(argc, argv, &output);
if(output == HEXADECIMAL)
{
format = "%02X ";
width = 4;
}
else
{
format = "%3o ";
width = 4;
}
while((ch = getchar()) != EOF)
{
if(can_print(ch))
{
if(binaryrun > 0)
{
putchar('\n');
binaryrun = 0;
textrun = 0;
}
putchar(ch);
++textrun;
if(ch == '\n')
{
textrun = 0;
}
if(textrun == split)
{
putchar('\n');
textrun = 0;
}
}
else
{
if(textrun > 0 || binaryrun + width >= split)
{
printf("\nBinary stream: ");
textrun = 0;
binaryrun = 15;
}
printf(format, ch);
binaryrun += width;
}
}
putchar('\n');
return 0;
}
You have found by yourself what "%02x " and "%03o " mean. That is good!
So your question boils down to "What are non-printable characters?" and "How are they printed with the mentioned formats?"
A non-printable character is defined (in the source shown) by the string printable in function can_print(). All characters not in this string are deliberately defined to be non-printable. We can reason about the selection, but this is out of scope here. Another note: " " and "\t\f\v\r\n" are in this set of printable characters and have a value of <= 0x20 in ASCII.
BTW, the standard library has isprint() that checks for printability.
As you seem to know each character is encoded as an assigned value. This value can be interpreted as you like, as a character, as a number, as an instruction, as a colour, as a bit pattern, anything. Actually all digital computers are just working on bit patterns, it's up to the interpretation what they mean.
So a non-printable character can be interpreted as an int number, and this is what happens by printf() with the mentioned format. Let's say that the character read is '\b', known as backspace. (Note: It is not in printable.) In ASCII this character is encoded as 0x08. So the output will be "08 " and "010 ", respectively.
You might like to change the program in such way that all characters are considered non-printable. Then you'll see all characters output as hex or octal.

Input of varying format in C

I am currently trying to figure out how to process an input of such format: [int_1,...,int_N] where N is any number from interval <1, MAX_N> (for example #define MAX_N 1000). What I have right now is fgets to get it as string which I then, using some loops and sscanf, save into an int array.
My solution is, IMO, not the most elegant and functional, but that's because of how I implement it. So what I'm asking I guess is how you guys would solve this problem, because I've ran out of ideas.
Edit: adding the code for string -> int array
int digit_arr[MAX_N];
char input[MAX_N];
//MAX_N is a constant set at 1000
//Brackets and spaces have been removed at this point
for (i = 0; i < strlen(input); i++) {
if(sscanf(&input[i+index_count],"%d,", &digit_arr[i]) == 1){
while (current_char != ',') {
current_char = input[i+index_count+j];
index_count++;
j++;
if ((index_count+j+i) == strlen(input)-1){
break;
}
}
}
My personal variant:
char const* data = input; // if input is NOT a pointer or you yet need it unchanged
for(;;)
{
int offset = 0;
if(sscanf(data, "%d,%n", digit_arr + i, &offset) == 1)
{
++i;
if(offset != 0)
{
data += offset;
continue;
}
}
break;
}
You might finally ckeck if all characters in the text are consumed:
if(*data)
{
// not all characters consumed, input most likely invalid
}
else
{
// we reached terminating null character -> fine
}
Note that my code as is does not cover trailing whitespace, you could do so by changing the format string to "%d, %n (note the added space character).

Program runs too slowly with large input - C

The goal for this program is for it to count the number of instances that two consecutive letters are identical and print this number for every test case. The input can be up to 1,000,000 characters long (thus the size of the char array to hold the input). The website which has the coding challenge on it, however, states that the program times out at a 2s run-time. My question is, how can this program be optimized to process the data faster? Does the issue stem from the large char array?
Also: I get a compiler warning "assignment makes integer from pointer without a cast" for the line str[1000000] = "" What does this mean and how should it be handled instead?
Input:
number of test cases
strings of capital A's and B's
Output:
Number of duplicate letters next to each other for each test case, each on a new line.
Code:
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>
int main() {
int n, c, a, results[10] = {};
char str[1000000];
scanf("%d", &n);
for (c = 0; c < n; c++) {
str[1000000] = "";
scanf("%s", str);
for (a = 0; a < (strlen(str)-1); a++) {
if (str[a] == str[a+1]) { results[c] += 1; }
}
}
for (c = 0; c < n; c++) {
printf("%d\n", results[c]);
}
return 0;
}
You don't need the line
str[1000000] = "";
scanf() adds a null terminator when it parses the input and writes it to str. This line is also writing beyond the end of the array, since the last element of the array is str[999999].
The reason you're getting the warning is because the type of str[10000000] is char, but the type of a string literal is char*.
To speed up the program, take the call to strlen() out of the loop.
size_t len = strlen(str)-1;
for (a = 0; a < len; a++) {
...
}
str[1000000] = "";
This does not do what you think it does and you're overflowing the buffer which results in undefined behaviour. An indexer's range is from 0 - sizeof(str) EXCLUSIVE. So you either add one to the
1000000 when initializing or use 999999 to access it instead. To get rid of the compiler warning and produce cleaner code use:
str[1000000] = '\0';
Or
str[999999] = '\0';
Depending on what you did to fix it.
As to optimizing, you should look at the assembly and go from there.
count the number of instances that two consecutive letters are identical and print this number for every test case
For efficiency, code needs a new approach as suggeted by #john bollinger & #molbdnilo
void ReportPairs(const char *str, size_t n) {
int previous = EOF;
unsigned long repeat = 0;
for (size_t i=0; i<n; i++) {
int ch = (unsigned char) str[i];
if (isalpha(ch) && ch == previous) {
repeat++;
}
previous = ch;
}
printf("Pair count %lu\n", repeat);
}
char *testcase1 = "test1122a33";
ReportPairs(testcase1, strlen(testcase1));
or directly from input and "each test case, each on a new line."
int ReportPairs2(FILE *inf) {
int previous = EOF;
unsigned long repeat = 0;
int ch;
for ((ch = fgetc(inf)) != '\n') {
if (ch == EOF) return ch;
if (isalpha(ch) && ch == previous) {
repeat++;
}
previous = ch;
}
printf("Pair count %lu\n", repeat);
return ch;
}
while (ReportPairs2(stdin) != EOF);
Unclear how OP wants to count "AAAA" as 2 or 3. This code counts it as 3.
One way to dramatically improve the run-time for your code is to limit the number of times you read from stdin. (basically process input in bigger chunks). You can do this a number of way, but probably one of the most efficient would be with fread. Even reading in 8-byte chunks can provide a big improvement over reading a character at a time. One example of such an implementation considering capital letters [A-Z] only would be:
#include <stdio.h>
#define RSIZE 8
int main (void) {
char qword[RSIZE] = {0};
char last = 0;
size_t i = 0;
size_t nchr = 0;
size_t dcount = 0;
/* read up to 8-bytes at a time */
while ((nchr = fread (qword, sizeof *qword, RSIZE, stdin)))
{ /* compare each byte to byte before */
for (i = 1; i < nchr && qword[i] && qword[i] != '\n'; i++)
{ /* if not [A-Z] continue, else compare */
if (qword[i-1] < 'A' || qword[i-1] > 'Z') continue;
if (i == 1 && last == qword[i-1]) dcount++;
if (qword[i-1] == qword[i]) dcount++;
}
last = qword[i-1]; /* save last for comparison w/next */
}
printf ("\n sequential duplicated characters [A-Z] : %zu\n\n",
dcount);
return 0;
}
Output/Time with 868789 chars
$ time ./bin/find_dup_digits <dat/d434839c-d-input-d4340a6.txt
sequential duplicated characters [A-Z] : 434893
real 0m0.024s
user 0m0.017s
sys 0m0.005s
Note: the string was actually a string of '0's and '1's run with a modified test of if (qword[i-1] < '0' || qword[i-1] > '9') continue; rather than the test for [A-Z]...continue, but your results with 'A's and 'B's should be virtually identical. 1000000 would still be significantly under .1 seconds. You can play with the RSIZE value to see if there is any benefit to reading a larger (suggested 'power of 2') size of characters. (note: this counts AAAA as 3) Hope this helps.

Iterating over string/strlen with umlauted characters

This is a follow-up to my previous question . I succeeded in implementing the algorithm for checking umlauted characters. The next problem comes from iterating over all characters in a string. I do this like so:
int main()
{
char* str = "Hej du kalleåäö";
printf("length of str: %d", strlen(str));
for (int i = 0; i < strlen(str); i++)
{
printf("%s ", to_morse(str[i]));
}
putchar('\n');
return 0;
}
The problem is that, because of the umlauted characters, it prints 18, and also makes the to_morse function fail (ignoring these characters). The toMorse method accepts an unsigned char as a parameter. What would be the best way to solve this? I know I can check for the umlaut character here instead of the letterNr function but I don't know if that would be a pretty/logical solution.
Normally, you'd store the string in a wchar_t and use something like ansi_strlen to get the length of it - that would give you the number of printed characters as opposed to the number of bytes you stored.
You really shouldn't be implementing UTF or Unicode or whatever multibyte character handling yourself - there are libraries for that sort of thing.
On OS X, Cocoa is a solution - note the use of "%C" in NSLog - that's an unichar (16-bit Unicode character):
#import <Cocoa/Cocoa.h>
int main()
{
NSAutoreleasePool * pool = [NSAutoreleasePool new];
NSString * input = #"Hej du kalleåäö";
printf("length of str: %d", [input length]);
int i=0;
for (i = 0; i < [input length]; i++)
{
NSLog(#"%C", [input characterAtIndex:i]);
}
[pool release];
}
You could do something like
for (int i = 0; str[i]!='\0'; ++i){
//do something with str[i]
}
Strings in C are terminated with '\0'. So it is possible to check for the end of the string like that.
EDIT: What locale are you using?
If you are going to iterating over a string, don't bother with getting its length with strlen. Just iterate until you see a NUL character:
char *p = str;
while(*p != '\0') {
printf("%c\n", *p);
++p;
}
As for the umlauted characters and such, are they UTF-8? If the string is multi-byte, you could do something like this:
size_t n = strlen(str);
char *p = str;
char *e = p + n;
while(*p != '\0') {
wchar_t wc;
int l = mbtowc(&wc, p, e - p);
if(l <= 0) break;
p += l;
/* do whatever with wc which is now in wchar_t form */
}
I honestly don't know if mbtowc will simply return -1 if it encounters a NUL in the middle of a MB character. If it does, you could just pass MB_CUR_MAX instead of e - p and do away with the strlen call. But I have a feeling this is not the case.

Resources