I have a char array in a C application that I have to split into parts of 250 so that I can send it along to another application that doesn't accept more at one time.
How would I do that? Platform: win32.
From the MSDN documentation:
The strncpy function copies the initial count characters of strSource to strDest and returns strDest. If count is less than or equal to the length of strSource, a null character is not appended automatically to the copied string. If count is greater than the length of strSource, the destination string is padded with null characters up to length count. The behavior of strncpy is undefined if the source and destination strings overlap.
Note that strncpy doesn't check for valid destination space; that is left to the programmer. Prototype:
char *strncpy(
char *strDest,
const char *strSource,
size_t count
);
Extended example:
void send250(char *inMsg, int msgLen)
{
char block[250];
while (msgLen > 0)
{
int len = (msgLen>250) ? 250 : msgLen;
strncpy(block, inMsg, 250);
// send block to other entity
msgLen -= len;
inMsg += len;
}
}
I can think of something along the lines of the following:
char *somehugearray;
char chunk[251] ={0};
int k;
int l;
for(l=0;;){
for(k=0; k<250 && somehugearray[l]!=0; k++){
chunk[k] = somehugearray[l];
l++;
}
chunk[k] = '\0';
dohandoff(chunk);
}
If you strive for performance and you're allowed to touch the string a bit (i.e. the buffer is not const, no thread safety issues etc.), you could momentarily null-terminate the string at intervals of 250 characters and send it in chunks, directly from the original string:
char *str_end = str + strlen(str);
char *chunk_start = str;
while (true) {
char *chunk_end = chunk_start + 250;
if (chunk_end >= str_end) {
transmit(chunk_start);
break;
}
char hijacked = *chunk_end;
*chunk_end = '\0';
transmit(chunk_start);
*chunk_end = hijacked;
chunk_start = chunk_end;
}
jvasaks's answer is basically correct, except that he hasn't null terminated 'block'. The code should be this:
void send250(char *inMsg, int msgLen)
{
char block[250];
while (msgLen > 0)
{
int len = (msgLen>249) ? 249 : msgLen;
strncpy(block, inMsg, 249);
block[249] = 0;
// send block to other entity
msgLen -= len;
inMsg += len;
}
}
So, now the block is 250 characters including the terminating null. strncpy will null terminate the last block if there are less than 249 characters remaining.
Related
I'm using an array of strings in C to hold arguments given to a custom shell. I initialize the array of buffers using:
char *args[MAX_CHAR];
Once I parse the arguments, I send them to the following function to determine the type of IO redirection if there are any (this is just the first of 3 functions to check for redirection and it only checks for STDIN redirection).
int parseInputFile(char **args, char *inputFilePath) {
char *inputSymbol = "<";
int isFound = 0;
for (int i = 0; i < MAX_ARG; i++) {
if (strlen(args[i]) == 0) {
isFound = 0;
break;
}
if ((strcmp(args[i], inputSymbol)) == 0) {
strcpy(inputFilePath, args[i+1]);
isFound = 1;
break;
}
}
return isFound;
}
Once I compile and run the shell, it crashes with a SIGSEGV. Using GDB I determined that the shell is crashing on the following line:
if (strlen(args[i]) == 0) {
This is because the address of arg[i] (the first empty string after the parsed commands) is inaccessible. Here is the error from GDB and all relevant variables:
(gdb) next
359 if (strlen(args[i]) == 0) {
(gdb) p args[0]
$1 = 0x7fffffffe570 "echo"
(gdb) p args[1]
$2 = 0x7fffffffe575 "test"
(gdb) p args[2]
$3 = 0x0
(gdb) p i
$4 = 2
(gdb) next
Program received signal SIGSEGV, Segmentation fault.
parseInputFile (args=0x7fffffffd570, inputFilePath=0x7fffffffd240 "") at shell.c:359
359 if (strlen(args[i]) == 0) {
I believe that the p args[2] returning $3 = 0x0 means that because the index has yet to be written to, it is mapped to address 0x0 which is out of the bounds of execution. Although I can't figure out why this is because it was declared as a buffer. Any suggestions on how to solve this problem?
EDIT: Per Kaylum's comment, here is a minimal reproducible example
#include<stdio.h>
#include<string.h>
#include<stdlib.h>
#include<unistd.h>
#include<sys/types.h>
#include<sys/wait.h>
#include <sys/stat.h>
#include<readline/readline.h>
#include<readline/history.h>
#include <fcntl.h>
// Defined values
#define MAX_CHAR 256
#define MAX_ARG 64
#define clear() printf("\033[H\033[J") // Clear window
#define DEFAULT_PROMPT_SUFFIX "> "
char PROMPT[MAX_CHAR], SPATH[1024];
int parseInputFile(char **args, char *inputFilePath) {
char *inputSymbol = "<";
int isFound = 0;
for (int i = 0; i < MAX_ARG; i++) {
if (strlen(args[i]) == 0) {
isFound = 0;
break;
}
if ((strcmp(args[i], inputSymbol)) == 0) {
strcpy(inputFilePath, args[i+1]);
isFound = 1;
break;
}
}
return isFound;
}
int ioRedirectHandler(char **args) {
char inputFilePath[MAX_CHAR] = "";
// Check if any redirects exist
if (parseInputFile(args, inputFilePath)) {
return 1;
} else {
return 0;
}
}
void parseArgs(char *cmd, char **cmdArgs) {
int na;
// Separate each argument of a command to a separate string
for (na = 0; na < MAX_ARG; na++) {
cmdArgs[na] = strsep(&cmd, " ");
if (cmdArgs[na] == NULL) {
break;
}
if (strlen(cmdArgs[na]) == 0) {
na--;
}
}
}
int processInput(char* input, char **args, char **pipedArgs) {
// Parse the single command and args
parseArgs(input, args);
return 0;
}
int getInput(char *input) {
char *buf, loc_prompt[MAX_CHAR] = "\n";
strcat(loc_prompt, PROMPT);
buf = readline(loc_prompt);
if (strlen(buf) != 0) {
add_history(buf);
strcpy(input, buf);
return 0;
} else {
return 1;
}
}
void init() {
char *uname;
clear();
uname = getenv("USER");
printf("\n\n \t\tWelcome to Student Shell, %s! \n\n", uname);
// Initialize the prompt
snprintf(PROMPT, MAX_CHAR, "%s%s", uname, DEFAULT_PROMPT_SUFFIX);
}
int main() {
char input[MAX_CHAR];
char *args[MAX_CHAR], *pipedArgs[MAX_CHAR];
int isPiped = 0, isIORedir = 0;
init();
while(1) {
// Get the user input
if (getInput(input)) {
continue;
}
isPiped = processInput(input, args, pipedArgs);
isIORedir = ioRedirectHandler(args);
}
return 0;
}
Note: If I forgot to include any important information, please let me know and I can get it updated.
When you write
char *args[MAX_CHAR];
you allocate room for MAX_CHAR pointers to char. You do not initialise the array. If it is a global variable, you will have initialised all the pointers to NULL, but you do it in a function, so the elements in the array can point anywhere. You should not dereference them before you have set the pointers to point at something you are allowed to access.
You also do this, though, in parseArgs(), where you do this:
cmdArgs[na] = strsep(&cmd, " ");
There are two potential issues here, but let's deal with the one you hit first. When strsep() is through the tokens you are splitting, it returns NULL. You test for that to get out of parseArgs() so you already know this. However, where your program crashes you seem to have forgotten this again. You call strlen() on a NULL pointer, and that is a no-no.
There is a difference between NULL and the empty string. An empty string is a pointer to a buffer that has the zero char first; the string "" is a pointer to a location that holds the character '\0'. The NULL pointer is a special value for pointers, often address zero, that means that the pointer doesn't point anywhere. Obviously, the NULL pointer cannot point to an empty string. You need to check if an argument is NULL, not if it is the empty string.
If you want to check both for NULL and the empty string, you could do something like
if (!args[i] || strlen(args[i]) == 0) {
If args[i] is NULL then !args[i] is true, so you will enter the if body if you have NULL or if you have a pointer to an empty string.
(You could also check the empty string with !(*args[i]); *args[i] is the first character that args[i] points at. So *args[i] is zero if you have the empty string; zero is interpreted as false, so !(*args[i]) is true if and only if args[i] is the empty string. Not that this is more readable, but it shows again the difference between empty strings and NULL).
I mentioned another issue with the parsed arguments. Whether it is a problem or not depends on the application. But when you parse a string with strsep(), you get pointers into the parsed string. You have to be careful not to free that string (it is input in your main() function) or to modify it after you have parsed the string. If you change the string, you have changed what all the parsed strings look at. You do not do this in your program, so it isn't a problem here, but it is worth keeping in mind. If you want your parsed arguments to survive longer than they do now, after the next command is passed, you need to copy them. The next command that is passed will change them as it is now.
In main
char input[MAX_CHAR];
char *args[MAX_CHAR], *pipedArgs[MAX_CHAR];
are all uninitialized. They contain indeterminate values. This could be a potential source of bugs, but is not the reason here, as
getInput modifies the contents of input to be a valid string before any reads occur.
pipedArgs is unused, so raises no issues (yet).
args is modified by parseArgs to (possibly!) contain a NULL sentinel value, without any indeterminate pointers being read first.
Firstly, in parseArgs it is possible to completely fill args without setting the NULL sentinel value that other parts of the program should rely on.
Looking deeper, in parseInputFile the following
if (strlen(args[i]) == 0)
contradicts the limits imposed by parseArgs that disallows empty strings in the array. More importantly, args[i] may be the sentinel NULL value, and strlen expects a non-NULL pointer to a valid string.
This termination condition should simply check if args[i] is NULL.
With
strcpy(inputFilePath, args[i+1]);
args[i+1] might also be the NULL sentinel value, and strcpy also expects non-NULL pointers to valid strings. You can see this in action when inputSymbol is a match for the final token in the array.
args[i+1] may also evaluate as args[MAX_ARGS], which would be out of bounds.
Additionally, inputFilePath has a string length limit of MAX_CHAR - 1, and args[i+1] is (possibly!) a dynamically allocated string whose length might exceed this.
Some edge cases found in getInput:
Both arguments to
strcat(loc_prompt, PROMPT);
are of the size MAX_CHAR. Since loc_prompt has a length of 1. If PROMPT has the length MAX_CHAR - 1, the resulting string will have the length MAX_CHAR. This would leave no room for the NUL terminating byte.
readline can return NULL in some situations, so
buf = readline(loc_prompt);
if (strlen(buf) != 0) {
can again pass the NULL pointer to strlen.
A similar issue as before, on success readline returns a string of dynamic length, and
strcpy(input, buf);
can cause a buffer overflow by attempting to copy a string greater in length than MAX_CHAR - 1.
buf is a pointer to data allocated by malloc. It's unclear what add_history does, but this pointer must eventually be passed to free.
Some considerations.
Firstly, it is a good habit to initialize your data, even if it might not matter.
Secondly, using constants (#define MAX_CHAR 256) might help to reduce magic numbers, but they can lead you to design your program too rigidly if used in the same way.
Consider building your functions to accept a limit as an argument, and return a length. This allows you to more strictly track the sizes of your data, and prevents you from always designing around the maximum potential case.
A slightly contrived example of designing like this. We can see that find does not have to concern itself with possibly checking MAX_ARGS elements, as it is told precisely how long the list of valid elements is.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX_ARGS 100
char *get_input(char *dest, size_t sz, const char *display) {
char *res;
if (display)
printf("%s", display);
if ((res = fgets(dest, sz, stdin)))
dest[strcspn(dest, "\n")] = '\0';
return res;
}
size_t find(char **list, size_t length, const char *str) {
for (size_t i = 0; i < length; i++)
if (strcmp(list[i], str) == 0)
return i;
return length;
}
size_t split(char **list, size_t limit, char *source, const char *delim) {
size_t length = 0;
char *token;
while (length < limit && (token = strsep(&source, delim)))
if (*token)
list[length++] = token;
return length;
}
int main(void) {
char input[512] = { 0 };
char *args[MAX_ARGS] = { 0 };
puts("Welcome to the shell.");
while (1) {
if (get_input(input, sizeof input, "$ ")) {
size_t argl = split(args, MAX_ARGS, input, " ");
size_t redirection = find(args, argl, "<");
puts("Command parts:");
for (size_t i = 0; i < redirection; i++)
printf("%zu: %s\n", i, args[i]);
puts("Input files:");
if (redirection == argl)
puts("[[NONE]]");
else for (size_t i = redirection + 1; i < argl; i++)
printf("%zu: %s\n", i, args[i]);
}
}
}
I am trying to concatenate a few strings to a buffer. However, if I call the function repeatedly, the size of my buffer will keep growing.
void print_message(char *str) {
char message[8196];
sender *m = senderlist;
while(m) {
/* note: stricmp() is a case-insensitive version of strcmp() */
if(stricmp(m->sender,str)==0) {
strcat(message,m->sender);
strcat(message,", ");
}
m = m->next;
}
printf("strlen: %i",strlen(message));
printf("Message: %s\n",message);
return;
}
The size of message will continuously grow until the length will be 3799.
Example:
1st. call: strlen = 211
2nd call: strlen = 514
3rd call: strlen = 844
...
nth call: strlen = 3799
nth +1 call: strlen = 3799
nth +2 call: strlen = 3799
My understanding was, that statically allocated variables like char[] will automatically be freed upon exiting the function, and I'm not dynamically allocating anything on the heap.
And why will suddenly stop growing at 3799 bytes? Thanks for any pointers.
Add one more statement after the buffer definition
char message[8196];
message[0] = '\0';
Or initialize the buffer when it is defined
char message[8196] = { '\0' };
or
char message[8196] = "";
that is fully equivalent to the previous initialization.
The problem with your code is that the compiler does not initialize the buffer if you wiil not specify initialization explicitly. So array message contains some garbage but function strcat at first searches the terminating zero in the buffer that to append a new string. So your program has undefined behaviour.
What you are seeing is the growing of the senderlist or likely garbage in message. Fortunately not exceeding 8196.
The message array must start with the empty string. At the moment doing a strcat adds to garbage.
char message[8196];
sender *m = senderlist;
int len = 0;
*message = '\0';
while(m) {
/* note: stricmp() is a case-insensitive version of strcmp() */
if(stricmp(m->sender,str)==0) {
int sender_len = strlen(m->sender);
if (len + sender_len + 2 + 1 < sizeof(message)) {
strcpy(message + len, m->sender);
len += sender_len;
strcpy(message + len, ", ");
len += 2;
} else {
// Maybe appending "..." instead (+ 3 + 1 < ...).
break;
}
}
m = m->next;
}
printf("strlen: %i",strlen(message));
printf("Message: %s\n",message);
"Deallocation" is not the same as wiping the data; in fact, C generally leaves the data unerased for performance reasons.
I have a program that displays UTF-8 encoded strings with a size limitation (say MAX_LEN).
Whenever I get a string with a length > MAX_LEN, I want to find out where I could split it so it would be printed gracefully.
For example:
#define MAX_LEN 30U
const char big_str[] = "This string cannot be displayed on one single line: it must be splitted"
Without process, the output will looks like:
"This string cannot be displaye" // Truncated because of size limitation
"d on one single line: it must "
"be splitted"
The client would be able to chose eligible delimiters for the split but for now, I defined a list of delimiters by default:
#define DEFAULT_DELIMITERS " ;:,)]" // Delimiters to track in the string
So I am looking for an elegant and lightweight way of handling these issue without using malloc: my API should not return the sub-strings, I just want the positions of the sub-strings to display.
I already have some ideas that I will propose in answer: any feedback (e.g. pros and cons) would be appreciated, but most of all I am interested in alternatives solutions.
I just want the positions of the sub-strings to display.
So all you need is one function analysing your input returning the positions where a delimiter was found.
A possible appoach using strpbrk() assuming C99 at least:
#include <unistd.h> /* for ssize_t */
#include <string.h>
#define DELIMITERS (" ;.")
void find_delimiter_positions(
const char * input,
const char * delimiters,
ssize_t * delimiter_positions)
{
ssize_t dp_current = 0;
const char * p = input;
while (NULL != (p = strpbrk(p, delimiters)))
{
delimiter_positions[dp_current] = p - input;
++dp_current;
++p;
}
}
int main(void)
{
char input[] = "some randrom data; more.";
size_t input_length = strlen(input);
ssize_t delimiter_positions[input_length];
for (size_t s = 0; s < input_length; ++s)
{
delimiter_positions[s] = -1;
}
find_delimiter_positions(input, DELIMITERS, delimiter_positions);
for (size_t s = 0; -1 != delimiter_positions[s]; ++s)
{
/* print out positions */
}
}
For why C99: C99 introduces V(ariable) L(ength) A(rray), which are necessary here to get around the limitation to not use dynamic memory allocation.
If VLAs also may not be used one needs to fall back a defining a maximum number of possible occurences of delimiters per string. The latter however might be feasable as the maximum length of the string to be parsed is given, which in turn would imply the maximum number of possible delimiters per string.
For the latter case those lines from the example above
char input[] = "some randrom data; more.";
size_t input_length = strlen(input);
ssize_t delimiter_positions[input_length];
could be replaced by
char input[MAX_INPUT_LEN] = "some randrom data; more.";
size_t input_length = strlen(input);
ssize_t delimiter_positions[MAX_INPUT_LEN];
An approach that doesn't require additional storage is to make the wrapping function call a callback function for each substring. In the example below, the string is just printed with plain old printf, but the callback could call any other API function.
Things to note:
There is a function next that should advance a pointer to the next UTF-8 character. The encoding width for an UTF-8 char can be seen from its first byte.
The space and punctuation delimiters are treated slightly differently: Spaces are neither appended to the end or beginning of a line. (If there aren't any consecutive spaces in the text, that is.) Punctuation is retained at the end of a line.
Here's an example implementation:
#include <assert.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#define DELIMITERS " ;:,)]"
/*
* Advance to next character. This should advance the pointer to
* up to three chars, depending on the UTF-8 encoding. (But at the
* moment, it doesn't.)
*/
static const char *next(const char *p)
{
return p + 1;
}
typedef struct {
const char *begin;
const char *end;
} substr_t;
/*
* Wraps the text and stores the found substring' ranges into
* the lines struct. Return the number of word-wrapped lines.
*/
int wrap(const char *text, int width, substr_t *lines, uint32_t max_num_lines)
{
const char *begin = text;
const char *split = NULL;
uint32_t num_lines = 1;
int l = 0;
while (*text) {
if (strchr(DELIMITERS, *text)) {
split = text;
if (*text != ' ') split++;
}
if (l++ == width) {
if (split == NULL) split = text;
lines[num_lines - 1].begin = begin;
lines[num_lines - 1].end = split;
//write(fileno(stdout), begin, split - begin);
text = begin = split;
while (*begin == ' ') begin++;
split = NULL;
l = 0;
num_lines++;
if (num_lines > max_num_lines) {
//abort();
return -1;
}
}
text = next(text);
}
lines[num_lines - 1].begin = begin;
lines[num_lines - 1].end = text;
//write(fileno(stdout), begin, split - begin);
return num_lines;
}
int main()
{
const char *text = "I have a program that displays UTF-8 encoded strings "
"with a size limitation (say MAX_LEN). Whenever I get a string with a "
"length > MAX_LEN, I want to find out where I could split it so it "
"would be printed gracefully.";
substr_t lines[100];
const uint32_t max_num_lines = sizeof(lines) / sizeof(lines[0]);
const int num_lines = wrap(text, 48, lines, max_num_lines);
if (num_lines < 0) {
fprintf(stderr, "error: can't split into %d lines\n", max_num_lines);
return EXIT_FAILURE;
}
//printf("num_lines = %d\n", num_lines);
for (int i=0; i < num_lines; i++) {
FILE *stream = stdout;
const ptrdiff_t line_length = lines[i].end - lines[i].begin;
write(fileno(stream), lines[i].begin, line_length);
fputc('\n', stream);
}
return EXIT_SUCCESS;
}
Addendum: Here's another approach that builds loosely on the strtok pattern, but without modifying the string. It requires a state and that state must be initialised with the string to print and the maximum line width:
struct wrap_t {
const char *src;
int width;
int length;
const char *line;
};
int wrap(struct wrap_t *line)
{
const char *begin = line->src;
const char *split = NULL;
int l = 0;
if (begin == NULL) return -1;
while (*begin == ' ') begin++;
if (*begin == '\0') return -1;
while (*line->src) {
if (strchr(DELIMITERS, *line->src)) {
split = line->src;
if (*line->src != ' ') split++;
}
if (l++ == line->width) {
if (split == NULL) split = line->src;
line->line = begin;
line->length = split - begin;
line->src = split;
return 0;
}
line->src = next(line->src);
}
line->line = begin;
line->length = line->src - begin;
return 0;
}
All definitions not shown (DELIMITERS, next) are as above and the basic algorithm hasn't changed. I think this method is easy to use for the client:
int main()
{
const char *text = "I have a program that displays UTF-8 encoded strings "
"with a size limitation (say MAX_LEN). Whenever I get a string with a "
"length > MAX_LEN, I want to find out where I could split it so it "
"would be printed gracefully.";
struct wrap_t line = {text, 60};
while (wrap(&line) == 0) {
printf("%.*s\n", line.length, line.line);
}
return 0;
}
Solution1
A function that will be called successively until the whole string is processed: it would return the count of bytes to recopy to create the sub-strings:
The API:
/**
* Return the length between the beginning of the string and the
* last delimiter (such that returned length <= max_length)
*/
size_t get_next_substring_length(
const char * str, // The string to be splitted
const char * delim, // String of eligible delimiters for a split
size_t max_length); // The maximum length of resulting substring
On the client' side:
size_t shift = 0;
for(;;)
{
// Where do we start within big_str ?
const char * tmp = big_str + shift;
size_t count = get_next_substring_length(tmp, DEFAULT_DELIMITERS, MAX_LEN);
if(count)
{
// Allocate a sub-string and recopy "count" bytes
// Display the sub-string
shift += count;
}
else // End Of String (or error)
{
// Handle potential error
// Exit the loop
}
}
Solution2
Define a custom structure to store positions and lengths of sub-strings:
const char * str = "This is a long test string";
struct substrings
{
const char * str; // Beginning of the substring
size_t length; // Length of the substring
} sub[] = { {&str[0], 4},
{&str[5], 2},
{&str[8], 1},
{&str[10], 4},
{&str[15], 4},
{&str[20], 6},
{NULL, 0} };
The API:
size_t find_substrings(
struct substrings ** substr,
size_t max_length,
const char * delimiters,
const char * str);
On the client' side:
#define ARRAY_LENGTH 20U
struct substrings substr[ARRAY_LENGTH];
// Fill the structure
find_substrings(
&substr,
ARRAY_LENGTH,
DEFAULT_DELIMITERS,
big_str);
// Browse the structure
for (struct substrings * sub = &substr[0]; substr->str; sub++)
{
// Display sub->length bytes of sub->str
}
Some things are bothering me though:
in Solution1 I don't like the infinite loop, it is often bug prone
in Solution2 I fixed ARRAY_LENGTH arbitrarily but it should vary depending of input string length
I'd like to know what's the most memory efficient way to read & store a list of strings in C.
Each string may have a different length, so pre-allocating a big 2D array would be wasteful.
I also want to avoid a separate malloc for each string, as there may be many strings.
The strings will be read from a large buffer into this list data-structure I'm asking about.
Is it possible to store all strings separately with a single allocation of exactly the right size?
One idea I have is to store them contiguously in a buffer, then have a char * array pointing to the different parts in the buffer, which will have '\0's in it to delimit. I'm hoping there's a better way though.
struct list {
char *index[32];
char buf[];
};
The data-structure and strings will be strictly read-only.
Here's a mildly efficient format, assuming you know the length of all the strings in advance:
|| total size | string 1 | string 2 | ........ | string N | len(string N) | ... | len(string 2) | len(string 1) ||
You can store the lengths either in fixed-width integers or in variable-width integers, but the point is that you can jump to the end and scan all the lengths relatively efficiently, and from the length sum you can compute the offset of the string. You know when you reached the last string when there is no remaining space.
You can create your single buffer and store them contiguously, expanding the buffer as needed by using realloc(). But then you would need a second array to store string positions and maybe realloc() it as well, so I might simply create a dynamically allocated array and malloc() each string separately.
Find the number and total-length of all strings:
int num = 0;
int len = 0;
char* string = GetNextString(input);
while (string)
{
num += 1;
len += strlen(string);
string = GetNextString(input);
}
Rewind(input);
Then, allocate the following two buffers:
int* indexes = malloc(num*sizeof(int));
char* strings = malloc((num+len)*sizeof(char));
Finally, fill these two buffers:
int index = 0;
for (int i=0; i<num; i++)
{
indexes[i] = index;
string = GetNextString(input);
strcpy(strings+index,string);
index += strlen(string)+1;
}
After that, you can simply use strings[indexes[i]] in order to access the ith string.
Most efficient and memory efficient way is a two pass solution. In the first pass you calculate the total size for all strings, then you allocate the total memory block. In the second pass you read all strings using large buffers.
You can create a pointer array for the strings and calculate the difference between the pointers to get the string sizes. This way you save the null byte as end marker.
Here a complete example:
#include <stdio.h>
#include <memory.h>
#include <stdlib.h>
struct StringMap
{
char *data;
char **ptr;
long cPos;
};
void initStringMap(StringMap *stringMap, long numberOfStrings, long totalCharacters)
{
stringMap->data = (char*)malloc(sizeof(char)*(totalCharacters+1));
stringMap->ptr = (char**)malloc(sizeof(char*)*(numberOfStrings+2));
memset(stringMap->ptr, 0, sizeof(char*)*(numberOfStrings+1));
stringMap->ptr[0] = stringMap->data;
stringMap->ptr[1] = stringMap->data;
stringMap->cPos = 0;
}
void extendString(StringMap *stringMap, char *str, size_t size)
{
memcpy(stringMap->ptr[stringMap->cPos+1], str, size);
stringMap->ptr[stringMap->cPos+1] += size;
}
void endString(StringMap *stringMap)
{
stringMap->cPos++;
stringMap->ptr[stringMap->cPos+1] = stringMap->ptr[stringMap->cPos];
}
long numberOfStringsInStringMap(StringMap *stringMap)
{
return stringMap->cPos;
}
size_t stringSizeInStringMap(StringMap *stringMap, long index)
{
return stringMap->ptr[index+1] - stringMap->ptr[index];
}
char* stringinStringMap(StringMap *stringMap, long index)
{
return stringMap->ptr[index];
}
void freeStringMap(StringMap *stringMap)
{
free(stringMap->data);
free(stringMap->ptr);
}
int main()
{
// The interesting values
long numberOfStrings = 0;
long totalCharacters = 0;
// Scan the input for required information
FILE *fd = fopen("/path/to/large/textfile.txt", "r");
int bufferSize = 4096;
char *readBuffer = (char*)malloc(sizeof(char)*bufferSize);
int currentStringLength = 0;
ssize_t readBytes;
while ((readBytes = fread(readBuffer, sizeof(char), bufferSize, fd))>0) {
for (int i = 0; i < readBytes; ++i) {
const char c = readBuffer[i];
if (c != '\n') {
++currentStringLength;
} else {
++numberOfStrings;
totalCharacters += currentStringLength;
currentStringLength = 0;
}
}
}
// Display the found results
printf("Found %ld strings with total of %ld bytes\n", numberOfStrings, totalCharacters);
// Allocate the memory for the resource
StringMap stringMap;
initStringMap(&stringMap, numberOfStrings, totalCharacters);
// read all strings
rewind(fd);
while ((readBytes = fread(readBuffer, sizeof(char), bufferSize, fd))>0) {
char *stringStart = readBuffer;
for (int i = 0; i < readBytes; ++i) {
const char c = readBuffer[i];
if (c == '\n') {
extendString(&stringMap, stringStart, &readBuffer[i]-stringStart);
endString(&stringMap);
stringStart = &readBuffer[i+1];
}
}
if (stringStart < &readBuffer[readBytes]) {
extendString(&stringMap, stringStart, &readBuffer[readBytes]-stringStart);
}
}
endString(&stringMap);
fclose(fd);
// Ok read the list
numberOfStrings = numberOfStringsInStringMap(&stringMap);
printf("Number of strings in map: %ld\n", numberOfStrings);
for (long i = 0; i < numberOfStrings; ++i) {
size_t stringSize = stringSizeInStringMap(&stringMap, i);
char *buffer = (char*)malloc(stringSize+1);
memcpy(buffer, stringinStringMap(&stringMap, i), stringSize);
buffer[stringSize-1] = '\0';
printf("string %05ld size=%8ld : %s\n", i, stringSize, buffer);
free(buffer);
}
// free the resource
freeStringMap(&stringMap);
}
This example reads a very large text file, splits it into lines and creates an array with a string per line. It only needs two malloc calls. One for the pointer array and one for the sting block.
If it's strictly read-only as you've described, you can store the entire list of strings and their offsets in a single chunk of memory and read the whole thing with a single read.
The first sizeof(long) bytes stores the number of strings, n. The next n longs store the offsets into each string from the start of the string buffer which starts at position (n+1)*sizeof(long). You don't have to store the trailing zero for each string, but if you do, you can access each string with &str_buffer[offset[i]]. If you don't store the trailing '\0' then you would have to copy into a temporary buffer and append it yourself.
Let's suppose there is a piece of code like this:
my $str = 'some text';
my $result = my_subroutine($str);
and my_subroutine() should be implemented as Perl XS code. For example it could return the sum of bytes of the (unicode) string.
In the XS code, how to process a string (a) char by char, as a general method, and (b) byte by byte, if the string is composed of ASCII codes subset (a built-in function to convert from the native data srtucture of a string to char[]) ?
At the XS layer, you'll get byte or UTF-8 strings. In the general case, your code will likely contain a char * to point at the next item in the string, incrementing it as it goes. For a useful set of UTF-8 support functions to use in XS, read the "Unicode Support" section of perlapi
An example of mine from http://cpansearch.perl.org/src/PEVANS/Tickit-0.15/lib/Tickit/Utils.xs
int textwidth(str)
SV *str
INIT:
STRLEN len;
const char *s, *e;
CODE:
RETVAL = 0;
if(!SvUTF8(str)) {
str = sv_mortalcopy(str);
sv_utf8_upgrade(str);
}
s = SvPV_const(str, len);
e = s + len;
while(s < e) {
UV ord = utf8n_to_uvchr(s, e-s, &len, (UTF8_DISALLOW_SURROGATE
|UTF8_WARN_SURROGATE
|UTF8_DISALLOW_FE_FF
|UTF8_WARN_FE_FF
|UTF8_WARN_NONCHAR));
int width = wcwidth(ord);
if(width == -1)
XSRETURN_UNDEF;
s += len;
RETVAL += width;
}
OUTPUT:
RETVAL
In brief, this function iterates the given string one Unicode character at a time, accumulating the width as given by wcwidth().
If you're expecting bytes:
STRLEN len;
char* buf = SvPVbyte(sv, len);
while (len--) {
char byte = *(buf++);
... do something with byte ...
}
If you're expecting text or any non-byte characters:
STRLEN len;
U8* buf = SvPVutf8(sv, len);
while (len) {
STRLEN ch_len;
UV ch = utf8n_to_uvchr(buf, len, &ch_len, 0);
buf += ch_len;
len -= ch_len;
... do something with ch ...
}