Best way to convert whole file to lowercase in C - c

I was wondering if theres a realy good (performant) solution how to Convert a whole file to lower Case in C.
I use fgetc convert the char to lower case and write it in another temp-file with fputc. At the end i remove the original and rename the tempfile to the old originals name. But i think there must be a better Solution for it.

This doesn't really answer the question (community wiki), but here's an (over?)-optimized function to convert text to lowercase:
#include <assert.h>
#include <ctype.h>
#include <stdio.h>
int fast_lowercase(FILE *in, FILE *out)
{
char buffer[65536];
size_t readlen, wrotelen;
char *p, *e;
char conversion_table[256];
int i;
for (i = 0; i < 256; i++)
conversion_table[i] = tolower(i);
for (;;) {
readlen = fread(buffer, 1, sizeof(buffer), in);
if (readlen == 0) {
if (ferror(in))
return 1;
assert(feof(in));
return 0;
}
for (p = buffer, e = buffer + readlen; p < e; p++)
*p = conversion_table[(unsigned char) *p];
wrotelen = fwrite(buffer, 1, readlen, out);
if (wrotelen != readlen)
return 1;
}
}
This isn't Unicode-aware, of course.
I benchmarked this on an Intel Core 2 T5500 (1.66GHz), using GCC 4.6.0 and i686 (32-bit) Linux. Some interesting observations:
It's about 75% as fast when buffer is allocated with malloc rather than on the stack.
It's about 65% as fast using a conditional rather than a conversion table.

I'd say you've hit the nail on the head. Temp file means that you don't delete the original until you're sure that you're done processing it which means upon error the original remains. I'd say that's the correct way of doing it.
As suggested by another answer (if file size permits) you can do a memory mapping of the file via the mmap function and have it readily available in memory (no real performance difference if the file is less than the size of a page as it's probably going to get read into memory once you do the first read anyway)

You can usually get a little bit faster on big inputs by using fread and fwrite to read and write big chunks of the input/output. Also you should probably convert a bigger chunk (whole file if possible) into memory and then write it all at once.
edit: I just rememberd one more thing. Sometimes programs can be faster if you select a prime number (at the very least not a power of 2) as the buffer size. I seem to recall this has to do with specifics of the cacheing mechanism.

If you're processing big files (big as in, say, multi-megabytes) and this operation is absolutely speed-critical, then it might make sense to go beyond what you've inquired about. One thing to consider in particular is that a character-by-character operation will perform less well than using SIMD instructions.
I.e. if you'd use SSE2, you could code the toupper_parallel like (pseudocode):
for (cur_parallel_word = begin_of_block;
cur_parallel_word < end_of_block;
cur_parallel_word += parallel_word_width) {
/*
* in SSE2, parallel compares are either about 'greater' or 'equal'
* so '>=' and '<=' have to be constructed. This would use 'PCMPGTB'.
* The 'ALL' macro is supposed to replicate into all parallel bytes.
*/
mask1 = parallel_compare_greater_than(*cur_parallel_word, ALL('A' - 1));
mask2 = parallel_compare_greater_than(ALL('Z'), *cur_parallel_word);
/*
* vector op - and all bytes in two vectors, 'PAND'
*/
mask = mask1 & mask2;
/*
* vector op - add a vector of bytes. Would use 'PADDB'.
*/
new = parallel_add(cur_parallel_word, ALL('a' - 'A'));
/*
* vector op - zero bytes in the original vector that will be replaced
*/
*cur_parallel_word &= !mask; // that'd become 'PANDN'
/*
* vector op - extract characters from new that replace old, then or in.
*/
*cur_parallel_word |= (new & mask); // PAND / POR
}
I.e. you'd use parallel comparisons to check which bytes are uppercase, and then mask both original value and 'uppercased' version (one with the mask, the other with the inverse) before you or them together to form the result.
If you use mmap'ed file access, this could even be performed in-place, saving on the bounce buffer, and saving on many function and/or system calls.
There is a lot to optimize when your starting point is a character-by-character 'fgetc' / 'fputc' loop; even shell utilities are highly likely to perform better than that.
But I agree that if your need is very special-purpose (i.e. something as clear-cut as ASCII input to be converted to uppercase) then a handcrafted loop as above, using vector instruction sets (like SSE intrinsics/assembly, or ARM NEON, or PPC Altivec), is likely to make a significant speedup possible over existing general-purpose utilities.

Well, you can definitely speed this up a lot, if you know what the character encoding is. Since you're using Linux and C, I'm going to go out on a limb here and assume that you're using ASCII.
In ASCII, we know A-Z and a-z are contiguous and always 32 apart. So, what we can do is ignore the safety checks and locale checks of the toLower() function and do something like this:
(pseudo code)
foreach (int) char c in the file:
c -= 32.
Or, if there may be upper and lowercase letters, do a check like
if (c > 64 && c < 91) // the upper case ASCII range
then do the subtract and write it out to the file.
Also, batch writes are faster, so I would suggest first writing to an array, then all at once writing the contents of the array to the file.
This should be considerable faster.

Related

Hashing a timestamp into a sha256 checksum in c

Quick question for those more experienced in c...
I want to compute a SHA256 checksum using the functions from openssl for the current time an operation takes place. My code consists of the following:
time_t cur_time = 0;
char t_ID[40];
char obuf[40];
char * timeBuf = malloc(sizeof(char) * 40 + 1);
sprintf(timeBuf, "%s", asctime(gmtime(&cur_time)));
SHA256(timeBuf, strlen(timeBuf), obuf);
sprintf(t_ID, "%02x", obuf);
And yet, when I print out the value of t_ID in a debug statement, it looks like 'de54b910'. What am I missing here?
Edited to fix my typo around malloc and also to say I expected to see the digest form of a sha256 checksum, in hex.
Since obuf is an array, printing its value causes it to decay to a pointer and prints the value of the memory address that the array is stored at. Write sensible code to print a 256-bit value.
Maybe something like:
for (int i = 0; i < 32; ++i)
printf("%02X", obuf[i]);
This is not really intended as an answer, I'm just sharing a code fragment with the OP.
To hash the binary time_t directly without converting the time to a string, you could use something like (untested):
time_t cur_time;
char t_ID[40];
char obuf[40];
gmtime(&cur_time);
SHA256(&cur_time, sizeof(cur_time), obuf);
// You know this doesn't work:
// sprintf(t_ID, "%02x", obuf);
// Instead see https://stackoverflow.com/questions/6357031/how-do-you-convert-buffer-byte-array-to-hex-string-in-c
How do you convert buffer (byte array) to hex string in C?
This doesn't address byte order. You could use network byte order functions, see:
htons() function in socket programing
http://beej.us/guide/bgnet/output/html/multipage/htonsman.html
One complication: the size of time_t is not specified, it can vary by platform. It's traditionally 32 bits, but on 64 bit machines it can be 64 bits. It's also usually the number of seconds since Unix epoc, midnight, January 1, 1970.
If you're willing to live with assumption that the resolution is seconds and don't have to worry about the code working in 20 years (see: https://en.wikipedia.org/wiki/Year_2038_problem) then you might use (untested):
#include <netinet/in.h>
time_t cur_time;
uint32_t net_cur_time; // cur_time converted to network byte order
char obuf[40];
gmtime(&cur_time);
net_cur_time = htonl((uint32_t)cur_time);
SHA256(&net_cur_time, sizeof(net_cur_time), obuf);
I'll repeat what I mentioned in a comment: it's hard to understand what you possibly hope to gain from this hash, or why you can't use the timestamp directly. Cryptographically secure hashes such as SHA256 go through a lot of work to ensure the hash is not reversible. You can't benefit from that because the input data is from a limited known set. At the very least, why not use CRC32 instead because it's much faster.
Good luck.

Automatically appending numbers to variables in C

Our programming assignment asked us to break a text file into a set of smaller files having names (filename)partx.txt. For example if the argument passed to the program is a text file named stack.txt then the output should be stackpart1.txt, stackpart2.txt etc, where each file is of size 250 bytes max.
What is the best way to attain the part_x thing ?
I learned about using macro with ## to attain that. What are the drawbacks of this method and is there any better way ?
Is it a good practise to generate variable names this way ?
Don't confuse variable names with their content; macros and variable names have nothing to do with your assignment. ## is used to join strings to be used in your code at compile-time (a typical usage is to build identifiers or in general code parametrically in macros), which is a relatively rare and very specialized task.
What you want to do, instead, is to generate strings at runtime based on a pattern (=> you'll have the same string variable that you'll fill with different stuff at each iteration); the right function for this is snprintf.
It's perfectly simple, I'd say: You open a file (fopen returns a FILE *) which you can then read in a loop, using fread to specify the max amount of bytes to read on each iteration. Given the fact you're using a loop anyway, you can increment a simple int to keep track of the chunk-file names, using snprintf to create the name, write the characters read by fread to each file, and continue until you're done.
Some details on fread that might be useful to you
A basic example (needs some work, still):
int main( void )
{
int chunk_count = 0, chunk_size = 256;
char buffer[256]
FILE *src_fp,
*target_fp;
char chunk_name[50];
while (chunk_size == fread(buffer, chunk_size, 1, src_fp))
{//read chunk
++chunk_count;//increase chunk count
snprintf(chunk_name, 50, "chunk_part%d.txt", chunk_count);
target_fp = fopen(chunk_name, "w");
//write to chunk file
fwrite(buffer, chunk_size, 1, target_fp);
fclose(target_fp);//close chunk file
}
//don't forget to write the last chunk, if it's not 0 in length
if (chunk_size)
{
++chunk_count;//increase chunk count
snprintf(chunk_name, 50, "chunk_part%d.txt", chunk_count);
target_fp = fopen(chunk_name, "w");
//write to chunk file
fwrite(buffer, strlen(buffer) + 1, 1, target_fp);
fclose(target_fp);//close chunk file
}
fclose(src_fp);
printf("Written %d files, each of max 256 bytes\n", chunk_count);
return 0 ;
}
Note that this code is not exactly safe to use as it stands. You'll need to check the return values of fopen (it can, and at some point will, return NULL). The fread-based loop simply assumes that, if its return value is less than the chunk size, we've reached the end of the source-file, which isn't always the case. you'll have to handle NULL pointers and ferror stuff yourself, still. Either way, the functions to look into are:
fread
fopen
fwrite
fclose
ferror
snprintf
That should do it.
Update, just for the fun of it.
You might want to pad the numbers of your chunk file names (chunk_part0001.txt). To do this, you can try to predict how big the source file is, divide that by 256 to work out how many chunks you're actually going to end up with and use that amount of padding zeroes. How to get the file size is explained here, but here's some code I some time ago:
long file_size = 0,
factor = 10;
int padding_cnt = 1;//at least 1, ensures correct padding
fseek(src_fp, 0, SEEK_END);//go to end of file
file_size = ftell(src_fp);
file_size /= 256;//divided by chunk size
rewind(src_fp);//return to beginning of file
while(10 <= (file_size/factor))
{
factor *= 10;
++padding_cnt;
}
//padded chunk file names:
snprintf(chunk_name, sizeof chunk_name, "chunk_part%0*d.txt", padding_cnt, chunk_count);
If you want, I could explain every single statement, but the gist of it is this:
fseek + ftell gets to size of the file (in bytes), divided by the chunk size (256) gets you the total number of chunks you'll create (+1 for the remainder, which is why padding_cnt is initialized to 1)
The while loop divides the total count by 10^n, each time the factor is multiplied by 10, the padding count increases
the format passed to snprintf changed to %0*d which means: _"print an int, padded by n occurrences of 0 (ie to a fixed width). If you end up with 123 chunks, the first chunk file will be called chunk_part001.txt, the tenth file will be chunk_part010.txt all the way up to chunk_part100.txt.
refer to the linked question, the accepted answer uses sys/stat.h to get the file-size, which is more reliable (though it can pose some minor portability issues) Check the stat wiki for alternatives
Why? Because it's fun, and it makes the output files easier to sort by name. It also enables you to predict how big the char array that holds the target file name should be, so if you have to allocate that memory using malloc, you know exactly how much memory you'll need, and don't have to allocate 100 chars (which should be enough either way), and hope that you don't run out of space.
Lastly: the more you know, the better IMO, so I thought I'd give you some links and refs you might want to check.
You can either:
Use a MACRO as suggested (Compile-time). This involves some amount of knowledge to be present regarding the filesize (and numbers for sub-files) while implementing the code.
use snprintf() in a loop to generate the filename.(Runtime). This can be used dynamically based on some algorithm for measuring the file size.
That said, best way : use snprintf().

Self contained C routine to print string

I would like to make a self contained C function that prints a string. This would be part of an operating system, so I can't use stdio.h. How would I make a function that prints the string I pass to it without using stdio.h? Would I have to write it in assembly?
Assuming you're doing this on an X86 PC, you'll need to read/write directly to video memory located at address 0xB8000. For color monitors you need to specify an ASCII character byte and an attribute byte, which can indicate color. It is common to use macros when accessing this memory:
#define VIDEO_BASE_ADDR 0xB8000
#define VIDEO_ADDR(x,y) (unsigned short *)(VIDEO_BASE_ADDR + 2 * ((y) * SCREEN_X_SIZE + (x)))
Then, you write your own IO routines around it. Below is a simple function I used to write from a screen buffer. I used this to help implement a crude scrolling ability.
void c_write_window(unsigned int x, unsigned int y, unsigned short c)
{
if ((win_offset + y) >= BUFFER_ROWS) {
int overlap = ((win_offset + y) - BUFFER_ROWS);
*VIDEO_ADDR(x,y) = screen_buffer[overlap][x] = c;
} else {
*VIDEO_ADDR(x,y) = screen_buffer[win_offset + y][x] = c;
}
}
To learn more about this, and other osdev topics, see http://wiki.osdev.org/Printing_To_Screen
You will probably want to look at, or possibly just use, the source to the stdio functions in the FreeBSD C library, which is BSD-licensed.
To actually produce output, you'll need at least some function that can write characters to your output device. To do this, the stdio routines end up calling write, which performs a syscall into the kernel.

Faster than scanf?

I was doing massive parsing of positive integers using scanf("%d", &someint). As I wanted to see if scanf was a bottleneck, I implemented a naive integer parsing function using fread, just like:
int result;
char c;
while (fread(&c, sizeof c, 1, stdin), c == ' ' || c == '\n')
;
result = c - '0';
while (fread(&c, sizeof c, 1, stdin), c >= '0' || c <= '9') {
result *= 10;
result += c - '0';
}
return result;
But to my astonishment, the performance of this function is (even with inlining) about 50% worse. Shouldn't there be a possibility to improve on scanf for specialized cases? Isn't fread supposed to be fast (additional hint: Integers are (edit: mostly) 1 or 2 digits)?
The overhead I was encountering was not the parsing itself but the many calls to fread (same with fgetc, and friends). For each call, the libc has to lock the input stream to make sure two threads aren't stepping on each other's feet. Locking is a very costly operation.
What we're looking for is a function that gives us buffered input (reimplementing buffering is just too much effort) but avoids the huge locking overhead of fgetc.
If we can guarantee that there is only a single thread using the input stream, we can use the functions from unlocked_stdio(3), such as getchar_unlocked(3). Here is an example:
static int parseint(void)
{
int c, n;
n = getchar_unlocked() - '0';
while (isdigit((c = getchar_unlocked())))
n = 10*n + c-'0';
return n;
}
The above version doesn't check for errors. But it's guaranteed to terminate. If error handling is needed it might be enough to check feof(stdin) and ferror(stdin) at the end, or let the caller do it.
My original purpose was submitting solutions to programming problems at SPOJ, where the input is only whitespace and digits. So there is still room for improvement, namely the isdigit check.
static int parseint(void)
{
int c, n;
n = getchar_unlocked() - '0';
while ((c = getchar_unlocked()) >= '0')
n = 10*n + c-'0';
return n;
}
It's very, very hard to beat this parsing routine, both performance-wise and in terms of convenience and maintainability.
You'll be able to improve significantly on your example by buffering - read a large number of characters into memory, and then parse them from the in-memory version.
If you're reading from disk you might get a performance increase by your buffer being a multiple of the block size.
Edit: You can let the kernel handle this for you by using mmap to map the file into memory.
Here's something I use.
#define scan(x) do{while((x=getchar())<'0'); for(x-='0'; '0'<=(_=getchar()); x=(x<<3)+(x<<1)+_-'0');}while(0)
char _;
However, this only works with Integers.
From what you say, I derive the following facts:
numbers are in the range of 0-99, which accounts for 10+100 different strings (including leading zeros)
you trust that your input stream adheres to some sort of specification and won't contain any unexpected character sequences
In that case, I'd use a lookup table to convert strings to numbers. Given a string s[2], the index to your lookup table can be computed by s[1]*10 + s[0], swapping the digits and making use of the fact that '\0' equals 0 in ASCII.
Then, you can read your input in the following way:
// given our lookup method, this table may need padding entries
int lookup_table[] = { /*...*/ };
// no need to call superfluous functions
#define str2int(x) (lookup_table[(x)[1]*10 + (x)[0]])
while(read_token_from_stream(stdin, buf))
next_int = str2int(buf);
On today's machines, it will be hard to come up with a faster technique. My guess is that this method will likely run 2 to 10 times faster than any scanf()-based approach.

C file checksum

how can i make a checksum of a file using C? i dont want to use any third party, just default c language and also speed is very important (its less the 50mb files but anyway)
thanks
I would suggest starting with the simple one and then only worrying about introducing the fast requirement if it turns out to be an issue.
Far too much time is wasted on solving problems that do not exist (see YAGNI).
By simple, I mean simply starting a checksum character (all characters here are unsigned) at zero, reading in every character and subtracting it from the checksum character until the end of the file is reached, assuming your implementation wraps intelligently.
Something like in the following program:
#include <stdio.h>
unsigned char checksum (unsigned char *ptr, size_t sz) {
unsigned char chk = 0;
while (sz-- != 0)
chk -= *ptr++;
return chk;
}
int main(int argc, char* argv[])
{
unsigned char x[] = "Hello_";
unsigned char y = checksum (x, 5);
printf ("Checksum is 0x%02x\n", y);
x[5] = y;
y = checksum (x, 6);
printf ("Checksum test is 0x%02x\n", y);
return 0;
}
which outputs:
Checksum is 0x0c
Checksum test is 0x00
That checksum function actually does both jobs. If you pass it a block of data without a checksum on the end, it will give you the checksum. If you pass it a block with the checksum on the end, it will give you zero for a good checksum, or non-zero if the checksum is bad.
This is the simplest approach and will detect most random errors. It won't detect edge cases like two swapped characters so, if you need even more veracity, use something like Fletcher or Adler.
Both of those Wikipedia pages have sample C code you can either use as-is, or analyse and re-code to avoid IP issues if you're concerned.
Determine which algorithm you want to use (CRC32 is one example)
Look up the algorithm on Wikipedia or other source
Write code to implement that algorithm
Post questions here if/when the code doesn't correctly implement the algorithm
Profit?
Simple and fast
FILE *fp = fopen("yourfile","rb");
unsigned char checksum = 0;
while (!feof(fp) && !ferror(fp)) {
checksum ^= fgetc(fp);
}
fclose(fp)
Generally, CRC32 with a good polynomial is probably your best choice for a non-cryptographic-hash checksum. See here for some reasons: http://guru.multimedia.cx/crc32-vs-adler32/ Click on the error correcting category on the right-hand side to get a lot more crc-related posts.
I would recommend using a BSD implementation. For example, http://www.freebsd.org/cgi/cvsweb.cgi/src/usr.bin/cksum/

Resources