Our programming assignment asked us to break a text file into a set of smaller files having names (filename)partx.txt. For example if the argument passed to the program is a text file named stack.txt then the output should be stackpart1.txt, stackpart2.txt etc, where each file is of size 250 bytes max.
What is the best way to attain the part_x thing ?
I learned about using macro with ## to attain that. What are the drawbacks of this method and is there any better way ?
Is it a good practise to generate variable names this way ?
Don't confuse variable names with their content; macros and variable names have nothing to do with your assignment. ## is used to join strings to be used in your code at compile-time (a typical usage is to build identifiers or in general code parametrically in macros), which is a relatively rare and very specialized task.
What you want to do, instead, is to generate strings at runtime based on a pattern (=> you'll have the same string variable that you'll fill with different stuff at each iteration); the right function for this is snprintf.
It's perfectly simple, I'd say: You open a file (fopen returns a FILE *) which you can then read in a loop, using fread to specify the max amount of bytes to read on each iteration. Given the fact you're using a loop anyway, you can increment a simple int to keep track of the chunk-file names, using snprintf to create the name, write the characters read by fread to each file, and continue until you're done.
Some details on fread that might be useful to you
A basic example (needs some work, still):
int main( void )
{
int chunk_count = 0, chunk_size = 256;
char buffer[256]
FILE *src_fp,
*target_fp;
char chunk_name[50];
while (chunk_size == fread(buffer, chunk_size, 1, src_fp))
{//read chunk
++chunk_count;//increase chunk count
snprintf(chunk_name, 50, "chunk_part%d.txt", chunk_count);
target_fp = fopen(chunk_name, "w");
//write to chunk file
fwrite(buffer, chunk_size, 1, target_fp);
fclose(target_fp);//close chunk file
}
//don't forget to write the last chunk, if it's not 0 in length
if (chunk_size)
{
++chunk_count;//increase chunk count
snprintf(chunk_name, 50, "chunk_part%d.txt", chunk_count);
target_fp = fopen(chunk_name, "w");
//write to chunk file
fwrite(buffer, strlen(buffer) + 1, 1, target_fp);
fclose(target_fp);//close chunk file
}
fclose(src_fp);
printf("Written %d files, each of max 256 bytes\n", chunk_count);
return 0 ;
}
Note that this code is not exactly safe to use as it stands. You'll need to check the return values of fopen (it can, and at some point will, return NULL). The fread-based loop simply assumes that, if its return value is less than the chunk size, we've reached the end of the source-file, which isn't always the case. you'll have to handle NULL pointers and ferror stuff yourself, still. Either way, the functions to look into are:
fread
fopen
fwrite
fclose
ferror
snprintf
That should do it.
Update, just for the fun of it.
You might want to pad the numbers of your chunk file names (chunk_part0001.txt). To do this, you can try to predict how big the source file is, divide that by 256 to work out how many chunks you're actually going to end up with and use that amount of padding zeroes. How to get the file size is explained here, but here's some code I some time ago:
long file_size = 0,
factor = 10;
int padding_cnt = 1;//at least 1, ensures correct padding
fseek(src_fp, 0, SEEK_END);//go to end of file
file_size = ftell(src_fp);
file_size /= 256;//divided by chunk size
rewind(src_fp);//return to beginning of file
while(10 <= (file_size/factor))
{
factor *= 10;
++padding_cnt;
}
//padded chunk file names:
snprintf(chunk_name, sizeof chunk_name, "chunk_part%0*d.txt", padding_cnt, chunk_count);
If you want, I could explain every single statement, but the gist of it is this:
fseek + ftell gets to size of the file (in bytes), divided by the chunk size (256) gets you the total number of chunks you'll create (+1 for the remainder, which is why padding_cnt is initialized to 1)
The while loop divides the total count by 10^n, each time the factor is multiplied by 10, the padding count increases
the format passed to snprintf changed to %0*d which means: _"print an int, padded by n occurrences of 0 (ie to a fixed width). If you end up with 123 chunks, the first chunk file will be called chunk_part001.txt, the tenth file will be chunk_part010.txt all the way up to chunk_part100.txt.
refer to the linked question, the accepted answer uses sys/stat.h to get the file-size, which is more reliable (though it can pose some minor portability issues) Check the stat wiki for alternatives
Why? Because it's fun, and it makes the output files easier to sort by name. It also enables you to predict how big the char array that holds the target file name should be, so if you have to allocate that memory using malloc, you know exactly how much memory you'll need, and don't have to allocate 100 chars (which should be enough either way), and hope that you don't run out of space.
Lastly: the more you know, the better IMO, so I thought I'd give you some links and refs you might want to check.
You can either:
Use a MACRO as suggested (Compile-time). This involves some amount of knowledge to be present regarding the filesize (and numbers for sub-files) while implementing the code.
use snprintf() in a loop to generate the filename.(Runtime). This can be used dynamically based on some algorithm for measuring the file size.
That said, best way : use snprintf().
Related
I have binary file read buffer which reads structures of variable length. Near the end of buffer there will always be incomplete struct. I want to move such tail of buffer to its beginning and then read buffer_size - tail_len bytes during next file read. Something like this:
char[8192] buf;
cur = 0, rcur = 0;
while(1){
read("file", &buf[rcur], 8192-rcur);
while (cur + sizeof(mystruct) < 8192){
mystruct_ptr = &buf[cur];
if (mystruct_prt->tailsize + cur >= 8192) break; //incomplete
//do stuff
cur += sizeof(mystruct) + mystruct_ptr->tailsize;
}
memcpy(buf,&buf[cur],8192-cur);
rcur=8192-cur;
cur = 0;
}
It should be okay if tail is small and buffer is big because then memcpy most likely won't overlap copied memory segment during single copy iteration. However it sounds slightly risky when tail becomes big - bigger than 50% of buffer.
If buffer is really huge and tail is also huge then it still should be okay since there's physical limit of how much data can be copied in single operation which if I remember correctly is 512 bytes for modern x86_64 CPUs using vector units. I thought about adding condition that checks length of tail and if it's too big comparing to size of buffer, performs naive byte-by-byte copy but question is:
How big is too big to consider such overlapping memcpy more or less safe. tail > buffer size - 2kb?
Per the standard, memcpy() has undefined behavior if the source and destination regions overlap. It doesn't matter how big the regions are or how much overlap there is. Undefined behavior cannot ever be considered safe.
If you are writing to a particular implementation, and that implementation defines behavior for some such copying, and you don't care about portability, then you can rely on your implementation's specific behavior in this regard. But I recommend not. That would be a nasty bug waiting to bite people who decide to use the code with some other implementation after all. Maybe even future you.
And in this particular case, having the alternative of using memmove(), which is dedicated to this exact purpose, makes gambling with memcpy() utterly reckless.
I have a FASTA file that contain sequence strings upto 2000000 strings [lines]. I wrote the code that works well with smaller size but when the size of file grows it get slower (Even slower then the smaller size file speed). I am confused that why it take more time when file size let see is 100,000 even for first iteration that run very efficiently in case of 10,000.
For Example: I put printf statement for each iteration. In case of 10,000 first iteration take 2 ms. where as in case of 100000 strings even the first iteration will take more time then 2 ms to print and so on. Why it could be slow like that?
Can you please help me how I can make it efficient or even work at the same speed as it do with smaller size file? I am reading it line by line.
My code is
#include "kseq.h"
KSEQ_INIT(gzFile, gzread)
int z=0;
fp = gzopen(dbFile, "r"); //Read database Fasta file into host memory
seq_d = kseq_init(fp);
while ((d = kseq_read(seq_d)) >= 0) {
unsigned char *b = (unsigned char *)malloc(sizeof(unsigned char) * 256);
memcpy(b, seq_d->seq.s, 256);
....
do work with b
....
............
z++
free(b);
}
kseq_destroy(seq_d);
gzclose(fp);
I have found the issue. I didn't notice before but in my code there were two loops that actually run to the size of file and don't needed (That,s why I got the variable time for each iteration too). I just eliminate them and now it work perfect.
to improve the speed you can also move the malloc line before the 'while', and the free after the end of the 'while'.
I'm working on a homework assignment for my Intro to C course (don't worry, I don't need you guys to solve anything for me!) and I have a question about design. I'm trying to figure out how to safely set the size of an array by reading input from a file.
Initially I wrote it out like this:
fscanf(ifp, "%d", &number_of_pizzas);
float pizza_cost[number_of_pizzas];
I'm pretty sure this will build fine, but I know that it's unwise to declare an array with a variable size. My assignment specifies the array can be no bigger than 100, so I know I can just write "pizza_cost[100]", but I'd rather do it precisely instead of wasting the memory.
Java is the language I'm most familiar with, and I believe the solution to the problem would be written out like this:
Scanner s = new Scanner(System.in);
final int i = s.nextInt();
int[] array = new int[i];
I know C doesn't have a final keyword, so I'm assuming "const" would be the way to go... Is there any way to replicate that code into C?
I'm pretty sure this will build fine, but I know that it's unwise to declare an array with a variable size.
That is true only in situations when there is no upper limit on the size. If you know that number_of_pizzas is 100 or less, your declaration would be safe on all but the most memory-constrained systems.
If you change your code to validate number_of_pizzas before declaring a variable-size array, you would be safe. However, this array would be limited in scope to a function, so you wouldn't be able to return it to your function's caller.
An analogy to Java code would look as follows:
float *pizza_cost = malloc(sizeof(float)*number_of_pizzas);
Now your array can be returned from a function, but you would be responsible for freeing it at some point in your program by calling free(pizza_cost)
As far as making number_of_pizzas a const goes, it is not going to work with scanf: it would be illegal to modify a const through a pointer. It is of very little utility even in Java, because you can get the same value by accessing array's length.
Any dynamic expression can have a limit placed upon its value easily enough:
fscanf(ifp, "%d", &number_of_pizzas);
float pizza_cost[number_of_pizzas > 100 ? 100 : number_of_pizzas];
This is not going to be any less safe than using a constant value as long as the bound(s) is/are constant, and has the potential to be smaller should the required number be less.
Making the variable const/final/anything gains you nothing in this scenario because whether it is modified after being used to create the buffer, doesn't affect the size of the buffer in any way.
Your code will build fine. Running well depends on input values.
I would add just one improvement to this; test the limit of your array initializer before using it:
fscanf(ifp, "%d", &number_of_pizzas);
if((number_of_pizzas > MIN_SIZE) &&(number_of_pizzas < MAX_SIZE))//add this test (or something similar)
{
float pizza_cost[number_of_pizzas];
//do stuff
}
Pick values for MIN_SIZE and MAX_SIZE that make sense for your application...
Doing the same this using dynamic allocation:
fscanf(ifp, "%d", &number_of_pizzas);
if((number_of_pizzas > MIN_SIZE) &&(number_of_pizzas < MAX_SIZE))//add this test (or something similar)
{
float pizza_cost = malloc(sizeof(float)*number_of_pizzas);
//do stuff
}
Don't forget to use free(pizza_cost); when you are done.
According to the GNU Lib C documentation on getcwd()...
The GNU C Library version of this function also permits you to specify a null pointer for the buffer argument. Then getcwd allocates a buffer automatically, as with malloc(see Unconstrained Allocation). If the size is greater than zero, then the buffer is that large; otherwise, the buffer is as large as necessary to hold the result.
I now draw your attention to the implementation using the standard getcwd(), described in the GNU documentation:
char* gnu_getcwd ()
{
size_t size = 100;
while (1)
{
char *buffer = (char *) xmalloc (size);
if (getcwd (buffer, size) == buffer)
return buffer;
free (buffer);
if (errno != ERANGE)
return 0;
size *= 2;
}
}
This seems great for portability and stability but it also looks like a clunky compromise with all that allocating and freeing memory. Is this a possible performance concern given that there may be frequent calls to the function?
*It's easy to say "profile it" but this can't account for every possible system; present or future.
The initial size is 100, holding a 99-char path, longer than most of the paths that exist on a typical system. This means in general that there is no “allocating and freeing memory”, and no more than 98 bytes are wasted.
The heuristic of doubling at each try means that at a maximum, a logarithmic number of spurious allocations take place. On many systems, the maximum length of a path is otherwise limited, meaning that there is a finite limit on the number of re-allocations caused.
This is about the best one can do as long as getcwd is used as a black box.
This is not a performance concern because it's the getcwd function. If that function is in your critical path then you're doing it wrong.
Joking aside, there's none of this code that could be removed. The only way you could improve this with profiling is to adjust the magic number "100" (it's a speed/space trade-off). Even then, you'd only have optimized it for your file system.
You might also think of replacing free/malloc with realloc, but that would result in an unnecessary memory copy, and with the error checking wouldn't even be less code.
Thanks for the input, everyone. I have recently concluded what should have been obvious from the start: define the value ("100" in this case) and the increment formula to use (x2 in this case) to be based on the target platform. This could account for all systems, especially with the use of additional flags.
I was wondering if theres a realy good (performant) solution how to Convert a whole file to lower Case in C.
I use fgetc convert the char to lower case and write it in another temp-file with fputc. At the end i remove the original and rename the tempfile to the old originals name. But i think there must be a better Solution for it.
This doesn't really answer the question (community wiki), but here's an (over?)-optimized function to convert text to lowercase:
#include <assert.h>
#include <ctype.h>
#include <stdio.h>
int fast_lowercase(FILE *in, FILE *out)
{
char buffer[65536];
size_t readlen, wrotelen;
char *p, *e;
char conversion_table[256];
int i;
for (i = 0; i < 256; i++)
conversion_table[i] = tolower(i);
for (;;) {
readlen = fread(buffer, 1, sizeof(buffer), in);
if (readlen == 0) {
if (ferror(in))
return 1;
assert(feof(in));
return 0;
}
for (p = buffer, e = buffer + readlen; p < e; p++)
*p = conversion_table[(unsigned char) *p];
wrotelen = fwrite(buffer, 1, readlen, out);
if (wrotelen != readlen)
return 1;
}
}
This isn't Unicode-aware, of course.
I benchmarked this on an Intel Core 2 T5500 (1.66GHz), using GCC 4.6.0 and i686 (32-bit) Linux. Some interesting observations:
It's about 75% as fast when buffer is allocated with malloc rather than on the stack.
It's about 65% as fast using a conditional rather than a conversion table.
I'd say you've hit the nail on the head. Temp file means that you don't delete the original until you're sure that you're done processing it which means upon error the original remains. I'd say that's the correct way of doing it.
As suggested by another answer (if file size permits) you can do a memory mapping of the file via the mmap function and have it readily available in memory (no real performance difference if the file is less than the size of a page as it's probably going to get read into memory once you do the first read anyway)
You can usually get a little bit faster on big inputs by using fread and fwrite to read and write big chunks of the input/output. Also you should probably convert a bigger chunk (whole file if possible) into memory and then write it all at once.
edit: I just rememberd one more thing. Sometimes programs can be faster if you select a prime number (at the very least not a power of 2) as the buffer size. I seem to recall this has to do with specifics of the cacheing mechanism.
If you're processing big files (big as in, say, multi-megabytes) and this operation is absolutely speed-critical, then it might make sense to go beyond what you've inquired about. One thing to consider in particular is that a character-by-character operation will perform less well than using SIMD instructions.
I.e. if you'd use SSE2, you could code the toupper_parallel like (pseudocode):
for (cur_parallel_word = begin_of_block;
cur_parallel_word < end_of_block;
cur_parallel_word += parallel_word_width) {
/*
* in SSE2, parallel compares are either about 'greater' or 'equal'
* so '>=' and '<=' have to be constructed. This would use 'PCMPGTB'.
* The 'ALL' macro is supposed to replicate into all parallel bytes.
*/
mask1 = parallel_compare_greater_than(*cur_parallel_word, ALL('A' - 1));
mask2 = parallel_compare_greater_than(ALL('Z'), *cur_parallel_word);
/*
* vector op - and all bytes in two vectors, 'PAND'
*/
mask = mask1 & mask2;
/*
* vector op - add a vector of bytes. Would use 'PADDB'.
*/
new = parallel_add(cur_parallel_word, ALL('a' - 'A'));
/*
* vector op - zero bytes in the original vector that will be replaced
*/
*cur_parallel_word &= !mask; // that'd become 'PANDN'
/*
* vector op - extract characters from new that replace old, then or in.
*/
*cur_parallel_word |= (new & mask); // PAND / POR
}
I.e. you'd use parallel comparisons to check which bytes are uppercase, and then mask both original value and 'uppercased' version (one with the mask, the other with the inverse) before you or them together to form the result.
If you use mmap'ed file access, this could even be performed in-place, saving on the bounce buffer, and saving on many function and/or system calls.
There is a lot to optimize when your starting point is a character-by-character 'fgetc' / 'fputc' loop; even shell utilities are highly likely to perform better than that.
But I agree that if your need is very special-purpose (i.e. something as clear-cut as ASCII input to be converted to uppercase) then a handcrafted loop as above, using vector instruction sets (like SSE intrinsics/assembly, or ARM NEON, or PPC Altivec), is likely to make a significant speedup possible over existing general-purpose utilities.
Well, you can definitely speed this up a lot, if you know what the character encoding is. Since you're using Linux and C, I'm going to go out on a limb here and assume that you're using ASCII.
In ASCII, we know A-Z and a-z are contiguous and always 32 apart. So, what we can do is ignore the safety checks and locale checks of the toLower() function and do something like this:
(pseudo code)
foreach (int) char c in the file:
c -= 32.
Or, if there may be upper and lowercase letters, do a check like
if (c > 64 && c < 91) // the upper case ASCII range
then do the subtract and write it out to the file.
Also, batch writes are faster, so I would suggest first writing to an array, then all at once writing the contents of the array to the file.
This should be considerable faster.