How to process a string char by char in the XS code - c

Let's suppose there is a piece of code like this:
my $str = 'some text';
my $result = my_subroutine($str);
and my_subroutine() should be implemented as Perl XS code. For example it could return the sum of bytes of the (unicode) string.
In the XS code, how to process a string (a) char by char, as a general method, and (b) byte by byte, if the string is composed of ASCII codes subset (a built-in function to convert from the native data srtucture of a string to char[]) ?

At the XS layer, you'll get byte or UTF-8 strings. In the general case, your code will likely contain a char * to point at the next item in the string, incrementing it as it goes. For a useful set of UTF-8 support functions to use in XS, read the "Unicode Support" section of perlapi
An example of mine from http://cpansearch.perl.org/src/PEVANS/Tickit-0.15/lib/Tickit/Utils.xs
int textwidth(str)
SV *str
INIT:
STRLEN len;
const char *s, *e;
CODE:
RETVAL = 0;
if(!SvUTF8(str)) {
str = sv_mortalcopy(str);
sv_utf8_upgrade(str);
}
s = SvPV_const(str, len);
e = s + len;
while(s < e) {
UV ord = utf8n_to_uvchr(s, e-s, &len, (UTF8_DISALLOW_SURROGATE
|UTF8_WARN_SURROGATE
|UTF8_DISALLOW_FE_FF
|UTF8_WARN_FE_FF
|UTF8_WARN_NONCHAR));
int width = wcwidth(ord);
if(width == -1)
XSRETURN_UNDEF;
s += len;
RETVAL += width;
}
OUTPUT:
RETVAL
In brief, this function iterates the given string one Unicode character at a time, accumulating the width as given by wcwidth().

If you're expecting bytes:
STRLEN len;
char* buf = SvPVbyte(sv, len);
while (len--) {
char byte = *(buf++);
... do something with byte ...
}
If you're expecting text or any non-byte characters:
STRLEN len;
U8* buf = SvPVutf8(sv, len);
while (len) {
STRLEN ch_len;
UV ch = utf8n_to_uvchr(buf, len, &ch_len, 0);
buf += ch_len;
len -= ch_len;
... do something with ch ...
}

Related

I encounter a memory issue whenever i try to split a string

I am writing a kernel in c. I am trying to write a split string function that takes in a string (char s[]), a delimiter (char d), and a pointer to a 2 dimensional output array (char** outp) that returns the length of items written to outp. For reference, I terminated all strings with '\n'. The problem occurs when I try to read from outp after calling split, it returns unexpected values. However, when I read outp inside the split function, the split string array value is correct.
For demo purposes, i am going to split "The quick brown fox jumps over the lazy dog" with a space character.
EDIT: I tried allocating memory for it.
Here's the output from inside the split function
The
quick
brown
fox
jumped
over
the
lazy
dog
Here's the output from printing outp after calling split.
9
lazy
the
the
quick
the
the
brown
the
Edit: Here's the memory allocator function
u32 free_mem_addr = 0x10000;
u32 kmalloc(u32 size, int align, u32 *phys_addr) {
/* Pages are aligned to 4K, or 0x1000 */
if (align == 1 && (free_mem_addr & 0xFFFFF000)) {
free_mem_addr &= 0xFFFFF000;
free_mem_addr += 0x1000;
}
/* Save also the physical address */
if (phys_addr) *phys_addr = free_mem_addr;
u32 ret = free_mem_addr;
free_mem_addr += size; /* Remember to increment the pointer */
return ret;
}
Edit: Here's where I allocated the memory
```c
output = (char**)kmalloc(MAX_ARG_COUNT * MAX_ARG_SIZE, 1, (u32*)&output);
Here's the input handler in the kernel.
```c
void handle_input(char* input)
{
char** output;
int arg_len = split(input, ' ', output);
kprint("\n");
char str[5];
int_to_ascii(arg_len, str);
kprint(str);
int i;
for (i = 0; i < arg_len; i++)
{
kprint(output[i])
}
kprint("\n$> ");
}
Here's the split string function
int split(char s[], char d, char** outp)
{
int i;
int size = 0;
int buffer_size = 0;
for (i = 0; i < strlen(s); i++)
{
if (s[i] == d)
{
outp[size][buffer_size] = '\0';
kprint(outp[size]);
kprint("\n");
size++;
buffer_size = 0;
}
else
{
outp[size][buffer_size] = s[i];
buffer_size++;
}
}
kprint(outp[size]);
return size+1;
}

Size problem while decoding ciphered text in C

[EDIT]
This is the ciphered text needs to be decoded:
bURCUE}__V|UBBQVT
I have decoder that successfully fetches ciphered text but convert it to some point. The rest of the encoded message is gibberish. I checked the size of buffer and char pointer, both seem correct, I couldn't find the flaw
Message I expect to see:
SecretLongMessage
Decrypted message on the screen looks like this:
SecretLong|drs`fe
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define BUZZ_SIZE 1024
char* encryptDecrypt(const char* toEncrypt, int length)
{
char key[] = "1011011011";
char* output = malloc(length + 1);
output[length] = '\0'; //buffer
for (int i = 0; i < length; i++)
{
output[i] = toEncrypt[i] ^ key[i % (sizeof(key)/sizeof(char))];
}
return output;
}
int main(int argc, char* argv[])
{
char buff[BUZZ_SIZE];
FILE *f;
f = fopen("C:\\Users\\Dell\\source\\repos\\XOR\\XOR\\bin\\Debug\\cipher.txt", "r"); // read mode
fgets(buff, BUZZ_SIZE, f);
printf("Ciphered text: %s, size = %d\n", buff,sizeof(buff));
fclose(f);
char* sourceString = buff;
//Decrypt
size_t size = strlen(sourceString);
char* decrypted = encryptDecrypt(buff, size);
//printf("\nsize = %d\n",size);
printf("\nDecrypted is: ");
printf(decrypted);
// Free the allocated buffers
return 0;
}
Here is my C# code that gives cipher
String szEncryptionKey = "1011011011";
public Form1()
{
InitializeComponent();
}
string EncryptOrDecrypt(string text, string key)
{
var result = new StringBuilder();
for (int c = 0; c < text.Length; c++)
{
// take next character from string
char character = text[c];
// cast to a uint
uint charCode = (uint)character;
// figure out which character to take from the key
int keyPosition = c % key.Length; // use modulo to "wrap round"
// take the key character
char keyChar = key[keyPosition];
// cast it to a uint also
uint keyCode = (uint)keyChar;
// perform XOR on the two character codes
uint combinedCode = charCode ^ keyCode;
// cast back to a char
char combinedChar = (char)combinedCode;
// add to the result
result.Append(combinedChar);
}
return result.ToString();
}
private void Button1_Click(object sender, EventArgs e)
{
String str = textBox1.Text;
var cipher = EncryptOrDecrypt(str, szEncryptionKey);
System.IO.File.WriteAllText(#"C:\\Users\\Dell\\source\\repos\\XOR\\XOR\\bin\\Debug\\cipher.txt", cipher);
}
You want to use all characters from
char key[] = "1011011011";
for your encryption. But the array key includes a terminating '\0' which is included in the calculation when you use
key[i % (sizeof(key)/sizeof(char))]
because sizeof(key) includes the terminating '\0'.
You could either use strlen to calculate the string length or use key[i % (sizeof(key)/sizeof(char))-1] or initialize the array as
char key[] = {'1', '0', '1', '1', '0', '1', '1', '0', '1', '1' };
to omit the terminating '\0'. In the latter case you can use sizeof to calculate the key index as in your original code
After the C# code was added to the question it is clear that the encryption doesn't include a '\0' in the key. key.Length is comparable to strlen(key) in C, not sizeof(key).
BTW: The variable name String szEncryptionKey = "1011011011"; in C# is misleading because it is not a zero terminated string as it would be in C.
Note: strlen(key) is the same as sizeof(key)-1 in your case because you don't specify the array size and initialize the array to a string. It might not be the same in other cases.

How to word-wrap using specific delimiters, without dynamic allocation

I have a program that displays UTF-8 encoded strings with a size limitation (say MAX_LEN).
Whenever I get a string with a length > MAX_LEN, I want to find out where I could split it so it would be printed gracefully.
For example:
#define MAX_LEN 30U
const char big_str[] = "This string cannot be displayed on one single line: it must be splitted"
Without process, the output will looks like:
"This string cannot be displaye" // Truncated because of size limitation
"d on one single line: it must "
"be splitted"
The client would be able to chose eligible delimiters for the split but for now, I defined a list of delimiters by default:
#define DEFAULT_DELIMITERS " ;:,)]" // Delimiters to track in the string
So I am looking for an elegant and lightweight way of handling these issue without using malloc: my API should not return the sub-strings, I just want the positions of the sub-strings to display.
I already have some ideas that I will propose in answer: any feedback (e.g. pros and cons) would be appreciated, but most of all I am interested in alternatives solutions.
I just want the positions of the sub-strings to display.
So all you need is one function analysing your input returning the positions where a delimiter was found.
A possible appoach using strpbrk() assuming C99 at least:
#include <unistd.h> /* for ssize_t */
#include <string.h>
#define DELIMITERS (" ;.")
void find_delimiter_positions(
const char * input,
const char * delimiters,
ssize_t * delimiter_positions)
{
ssize_t dp_current = 0;
const char * p = input;
while (NULL != (p = strpbrk(p, delimiters)))
{
delimiter_positions[dp_current] = p - input;
++dp_current;
++p;
}
}
int main(void)
{
char input[] = "some randrom data; more.";
size_t input_length = strlen(input);
ssize_t delimiter_positions[input_length];
for (size_t s = 0; s < input_length; ++s)
{
delimiter_positions[s] = -1;
}
find_delimiter_positions(input, DELIMITERS, delimiter_positions);
for (size_t s = 0; -1 != delimiter_positions[s]; ++s)
{
/* print out positions */
}
}
For why C99: C99 introduces V(ariable) L(ength) A(rray), which are necessary here to get around the limitation to not use dynamic memory allocation.
If VLAs also may not be used one needs to fall back a defining a maximum number of possible occurences of delimiters per string. The latter however might be feasable as the maximum length of the string to be parsed is given, which in turn would imply the maximum number of possible delimiters per string.
For the latter case those lines from the example above
char input[] = "some randrom data; more.";
size_t input_length = strlen(input);
ssize_t delimiter_positions[input_length];
could be replaced by
char input[MAX_INPUT_LEN] = "some randrom data; more.";
size_t input_length = strlen(input);
ssize_t delimiter_positions[MAX_INPUT_LEN];
An approach that doesn't require additional storage is to make the wrapping function call a callback function for each substring. In the example below, the string is just printed with plain old printf, but the callback could call any other API function.
Things to note:
There is a function next that should advance a pointer to the next UTF-8 character. The encoding width for an UTF-8 char can be seen from its first byte.
The space and punctuation delimiters are treated slightly differently: Spaces are neither appended to the end or beginning of a line. (If there aren't any consecutive spaces in the text, that is.) Punctuation is retained at the end of a line.
Here's an example implementation:
#include <assert.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#define DELIMITERS " ;:,)]"
/*
* Advance to next character. This should advance the pointer to
* up to three chars, depending on the UTF-8 encoding. (But at the
* moment, it doesn't.)
*/
static const char *next(const char *p)
{
return p + 1;
}
typedef struct {
const char *begin;
const char *end;
} substr_t;
/*
* Wraps the text and stores the found substring' ranges into
* the lines struct. Return the number of word-wrapped lines.
*/
int wrap(const char *text, int width, substr_t *lines, uint32_t max_num_lines)
{
const char *begin = text;
const char *split = NULL;
uint32_t num_lines = 1;
int l = 0;
while (*text) {
if (strchr(DELIMITERS, *text)) {
split = text;
if (*text != ' ') split++;
}
if (l++ == width) {
if (split == NULL) split = text;
lines[num_lines - 1].begin = begin;
lines[num_lines - 1].end = split;
//write(fileno(stdout), begin, split - begin);
text = begin = split;
while (*begin == ' ') begin++;
split = NULL;
l = 0;
num_lines++;
if (num_lines > max_num_lines) {
//abort();
return -1;
}
}
text = next(text);
}
lines[num_lines - 1].begin = begin;
lines[num_lines - 1].end = text;
//write(fileno(stdout), begin, split - begin);
return num_lines;
}
int main()
{
const char *text = "I have a program that displays UTF-8 encoded strings "
"with a size limitation (say MAX_LEN). Whenever I get a string with a "
"length > MAX_LEN, I want to find out where I could split it so it "
"would be printed gracefully.";
substr_t lines[100];
const uint32_t max_num_lines = sizeof(lines) / sizeof(lines[0]);
const int num_lines = wrap(text, 48, lines, max_num_lines);
if (num_lines < 0) {
fprintf(stderr, "error: can't split into %d lines\n", max_num_lines);
return EXIT_FAILURE;
}
//printf("num_lines = %d\n", num_lines);
for (int i=0; i < num_lines; i++) {
FILE *stream = stdout;
const ptrdiff_t line_length = lines[i].end - lines[i].begin;
write(fileno(stream), lines[i].begin, line_length);
fputc('\n', stream);
}
return EXIT_SUCCESS;
}
Addendum: Here's another approach that builds loosely on the strtok pattern, but without modifying the string. It requires a state and that state must be initialised with the string to print and the maximum line width:
struct wrap_t {
const char *src;
int width;
int length;
const char *line;
};
int wrap(struct wrap_t *line)
{
const char *begin = line->src;
const char *split = NULL;
int l = 0;
if (begin == NULL) return -1;
while (*begin == ' ') begin++;
if (*begin == '\0') return -1;
while (*line->src) {
if (strchr(DELIMITERS, *line->src)) {
split = line->src;
if (*line->src != ' ') split++;
}
if (l++ == line->width) {
if (split == NULL) split = line->src;
line->line = begin;
line->length = split - begin;
line->src = split;
return 0;
}
line->src = next(line->src);
}
line->line = begin;
line->length = line->src - begin;
return 0;
}
All definitions not shown (DELIMITERS, next) are as above and the basic algorithm hasn't changed. I think this method is easy to use for the client:
int main()
{
const char *text = "I have a program that displays UTF-8 encoded strings "
"with a size limitation (say MAX_LEN). Whenever I get a string with a "
"length > MAX_LEN, I want to find out where I could split it so it "
"would be printed gracefully.";
struct wrap_t line = {text, 60};
while (wrap(&line) == 0) {
printf("%.*s\n", line.length, line.line);
}
return 0;
}
Solution1
A function that will be called successively until the whole string is processed: it would return the count of bytes to recopy to create the sub-strings:
The API:
/**
* Return the length between the beginning of the string and the
* last delimiter (such that returned length <= max_length)
*/
size_t get_next_substring_length(
const char * str, // The string to be splitted
const char * delim, // String of eligible delimiters for a split
size_t max_length); // The maximum length of resulting substring
On the client' side:
size_t shift = 0;
for(;;)
{
// Where do we start within big_str ?
const char * tmp = big_str + shift;
size_t count = get_next_substring_length(tmp, DEFAULT_DELIMITERS, MAX_LEN);
if(count)
{
// Allocate a sub-string and recopy "count" bytes
// Display the sub-string
shift += count;
}
else // End Of String (or error)
{
// Handle potential error
// Exit the loop
}
}
Solution2
Define a custom structure to store positions and lengths of sub-strings:
const char * str = "This is a long test string";
struct substrings
{
const char * str; // Beginning of the substring
size_t length; // Length of the substring
} sub[] = { {&str[0], 4},
{&str[5], 2},
{&str[8], 1},
{&str[10], 4},
{&str[15], 4},
{&str[20], 6},
{NULL, 0} };
The API:
size_t find_substrings(
struct substrings ** substr,
size_t max_length,
const char * delimiters,
const char * str);
On the client' side:
#define ARRAY_LENGTH 20U
struct substrings substr[ARRAY_LENGTH];
// Fill the structure
find_substrings(
&substr,
ARRAY_LENGTH,
DEFAULT_DELIMITERS,
big_str);
// Browse the structure
for (struct substrings * sub = &substr[0]; substr->str; sub++)
{
// Display sub->length bytes of sub->str
}
Some things are bothering me though:
in Solution1 I don't like the infinite loop, it is often bug prone
in Solution2 I fixed ARRAY_LENGTH arbitrarily but it should vary depending of input string length

Extracting key=value with scanf in C

I need to extract a value for a given key from a string. I made this quick attempt:
char js[] = "some preceding text with\n"
"new lines and spaces\n"
"param_1=123\n"
"param_2=321\n"
"param_3=string\n"
"param_2=321\n";
char* param_name = "param_2";
char *key_s, *val_s;
char buf[32];
key_s = strstr(js, param_name);
if (key_s == NULL)
return 0;
val_s = strchr(key_s, '=');
if (val_s == NULL)
return 0;
sscanf(val_s + 1, "%31s", buf);
printf("'%s'\n", buf);
And it in fact works ok (printf gives '321'). But I suppose the scanf/sscanf would make this task even easier but I have not managed to figure out the formatting string for that.
Is that possible to pass a content of a variable param_name into sscanf so that it evaluates it as a part of a formatting string? In other words, I need to instruct sscanf that in this case it should look for a pattern param_2=%s (the param_name in fact comes from a function argument).
Not directly, no.
In practice, there's of course nothing stopping you from building the format string for sscanf() at runtime, with e.g. snprintf().
Something like:
void print_value(const char **js, size_t num_js, const char *key)
{
char tmp[32], value[32];
snprintf(tmp, sizeof tmp, "%s=%%31s", key);
for(size_t i = 0; i < num_js; ++i)
{
if(sscanf(js[i], tmp, value) == 1)
{
printf("found '%s'\n", value);
break;
}
}
}
OP's has a good first step:
char *key_s = strstr(js, param_name);
if (key_s == NULL)
return 0;
The rest may be simplified to
if (sscanf(&key_s[strlen(param_name)], "=%31s", buf) == 0) {
return 0;
}
printf("'%s'\n", buf);
Alternatively one could use " =%31s" to allow spaces before =.
OP's approach gets fooled by "param_2 321\n" "param_3=string\n".
Note: Weakness to all answers so far to not parse the empty string.
One issue that bears consideration is the difference between finding a 'key=value' setting in the string for a specific key value (such as param_2 in the question), and finding any 'key=value' setting in the string (with no specific key in mind a priori). The techniques to be used are rather different.
Another issue that has not self-evidently been considered is the possibility that you're looking for a key param_2 but the string also contains param_22=xyz and t_param_2=abc. The simple-minded approaches using strstr() to hunt for param_2 will pick up either of those alternatives.
In the sample data, there is a collection of characters that are not in the 'key=value' format to be skipped before the any 'key=value' parts. In the general case, we should assume that such data appears before, in between, and after the 'key=value' pairs. It appears that the values do not need to support complications such as quoted strings and metacharacters, and the value is delimited by white space. There is no comment convention visible.
Here's some workable code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
enum { MAX_KEY_LEN = 31 };
enum { MAX_VAL_LEN = 63 };
int find_any_key_value(const char *str, char *key, char *value);
int find_key_value(const char *str, const char *key, char *value);
int find_any_key_value(const char *str, char *key, char *value)
{
char junk[256];
const char *search = str;
while (*search != '\0')
{
int offset;
if (sscanf(search, " %31[a-zA-Z_0-9]=%63s%n", key, value, &offset) == 2)
return(search + offset - str);
int rc;
if ((rc = sscanf(search, "%255s%n", junk, &offset)) != 1)
return EOF;
search += offset;
}
return EOF;
}
int find_key_value(const char *str, const char *key, char *value)
{
char found[MAX_KEY_LEN + 1];
int offset;
const char *search = str;
while ((offset = find_any_key_value(search, found, value)) > 0)
{
if (strcmp(found, key) == 0)
return(search + offset - str);
search += offset;
}
return offset;
}
int main(void)
{
char js[] = "some preceding text with\n"
"new lines and spaces\n"
"param_1=123\n"
"param_2=321\n"
"param_3=string\n"
"param_4=param_2=confusion\n"
"m= x\n"
"param_2=987\n";
const char p2_key[] = "param_2";
int offset;
const char *str;
char key[MAX_KEY_LEN + 1];
char value[MAX_VAL_LEN + 1];
printf("String being scanned is:\n[[%s]]\n", js);
str = js;
while ((offset = find_any_key_value(str, key, value)) > 0)
{
printf("Any found key = [%s] value = [%s]\n", key, value);
str += offset;
}
str = js;
while ((offset = find_key_value(str, p2_key, value)) > 0)
{
printf("Found key %s with value = [%s]\n", p2_key, value);
str += offset;
}
return 0;
}
Sample output:
$ ./so24490410
String being scanned is:
[[some preceding text with
new lines and spaces
param_1=123
param_2=321
param_3=string
param_4=param_2=confusion
m= x
param_2=987
]]
Any found key = [param_1] value = [123]
Any found key = [param_2] value = [321]
Any found key = [param_3] value = [string]
Any found key = [param_4] value = [param_2=confusion]
Any found key = [m] value = [x]
Any found key = [param_2] value = [987]
Found key param_2 with value = [321]
Found key param_2 with value = [987]
$
If you need to handle different key or value lengths, you need to adjust the format strings as well as the enumerations. If you pass the size of the key buffer and the size of the value buffer to the functions, then you need to use snprint() to create the format strings used by sscanf(). There is an outside chance that you might have a single 'word' of 255 characters followed immediately by the target 'key=value' string. The chances are ridiculously small, but you might decide you need to worry about that (it prevents this code being bomb-proof).

How to retrieve n characters from char array

I have a char array in a C application that I have to split into parts of 250 so that I can send it along to another application that doesn't accept more at one time.
How would I do that? Platform: win32.
From the MSDN documentation:
The strncpy function copies the initial count characters of strSource to strDest and returns strDest. If count is less than or equal to the length of strSource, a null character is not appended automatically to the copied string. If count is greater than the length of strSource, the destination string is padded with null characters up to length count. The behavior of strncpy is undefined if the source and destination strings overlap.
Note that strncpy doesn't check for valid destination space; that is left to the programmer. Prototype:
char *strncpy(
char *strDest,
const char *strSource,
size_t count
);
Extended example:
void send250(char *inMsg, int msgLen)
{
char block[250];
while (msgLen > 0)
{
int len = (msgLen>250) ? 250 : msgLen;
strncpy(block, inMsg, 250);
// send block to other entity
msgLen -= len;
inMsg += len;
}
}
I can think of something along the lines of the following:
char *somehugearray;
char chunk[251] ={0};
int k;
int l;
for(l=0;;){
for(k=0; k<250 && somehugearray[l]!=0; k++){
chunk[k] = somehugearray[l];
l++;
}
chunk[k] = '\0';
dohandoff(chunk);
}
If you strive for performance and you're allowed to touch the string a bit (i.e. the buffer is not const, no thread safety issues etc.), you could momentarily null-terminate the string at intervals of 250 characters and send it in chunks, directly from the original string:
char *str_end = str + strlen(str);
char *chunk_start = str;
while (true) {
char *chunk_end = chunk_start + 250;
if (chunk_end >= str_end) {
transmit(chunk_start);
break;
}
char hijacked = *chunk_end;
*chunk_end = '\0';
transmit(chunk_start);
*chunk_end = hijacked;
chunk_start = chunk_end;
}
jvasaks's answer is basically correct, except that he hasn't null terminated 'block'. The code should be this:
void send250(char *inMsg, int msgLen)
{
char block[250];
while (msgLen > 0)
{
int len = (msgLen>249) ? 249 : msgLen;
strncpy(block, inMsg, 249);
block[249] = 0;
// send block to other entity
msgLen -= len;
inMsg += len;
}
}
So, now the block is 250 characters including the terminating null. strncpy will null terminate the last block if there are less than 249 characters remaining.

Resources