C file checksum - c

how can i make a checksum of a file using C? i dont want to use any third party, just default c language and also speed is very important (its less the 50mb files but anyway)
thanks

I would suggest starting with the simple one and then only worrying about introducing the fast requirement if it turns out to be an issue.
Far too much time is wasted on solving problems that do not exist (see YAGNI).
By simple, I mean simply starting a checksum character (all characters here are unsigned) at zero, reading in every character and subtracting it from the checksum character until the end of the file is reached, assuming your implementation wraps intelligently.
Something like in the following program:
#include <stdio.h>
unsigned char checksum (unsigned char *ptr, size_t sz) {
unsigned char chk = 0;
while (sz-- != 0)
chk -= *ptr++;
return chk;
}
int main(int argc, char* argv[])
{
unsigned char x[] = "Hello_";
unsigned char y = checksum (x, 5);
printf ("Checksum is 0x%02x\n", y);
x[5] = y;
y = checksum (x, 6);
printf ("Checksum test is 0x%02x\n", y);
return 0;
}
which outputs:
Checksum is 0x0c
Checksum test is 0x00
That checksum function actually does both jobs. If you pass it a block of data without a checksum on the end, it will give you the checksum. If you pass it a block with the checksum on the end, it will give you zero for a good checksum, or non-zero if the checksum is bad.
This is the simplest approach and will detect most random errors. It won't detect edge cases like two swapped characters so, if you need even more veracity, use something like Fletcher or Adler.
Both of those Wikipedia pages have sample C code you can either use as-is, or analyse and re-code to avoid IP issues if you're concerned.

Determine which algorithm you want to use (CRC32 is one example)
Look up the algorithm on Wikipedia or other source
Write code to implement that algorithm
Post questions here if/when the code doesn't correctly implement the algorithm
Profit?

Simple and fast
FILE *fp = fopen("yourfile","rb");
unsigned char checksum = 0;
while (!feof(fp) && !ferror(fp)) {
checksum ^= fgetc(fp);
}
fclose(fp)

Generally, CRC32 with a good polynomial is probably your best choice for a non-cryptographic-hash checksum. See here for some reasons: http://guru.multimedia.cx/crc32-vs-adler32/ Click on the error correcting category on the right-hand side to get a lot more crc-related posts.

I would recommend using a BSD implementation. For example, http://www.freebsd.org/cgi/cvsweb.cgi/src/usr.bin/cksum/

Related

Hashing a timestamp into a sha256 checksum in c

Quick question for those more experienced in c...
I want to compute a SHA256 checksum using the functions from openssl for the current time an operation takes place. My code consists of the following:
time_t cur_time = 0;
char t_ID[40];
char obuf[40];
char * timeBuf = malloc(sizeof(char) * 40 + 1);
sprintf(timeBuf, "%s", asctime(gmtime(&cur_time)));
SHA256(timeBuf, strlen(timeBuf), obuf);
sprintf(t_ID, "%02x", obuf);
And yet, when I print out the value of t_ID in a debug statement, it looks like 'de54b910'. What am I missing here?
Edited to fix my typo around malloc and also to say I expected to see the digest form of a sha256 checksum, in hex.
Since obuf is an array, printing its value causes it to decay to a pointer and prints the value of the memory address that the array is stored at. Write sensible code to print a 256-bit value.
Maybe something like:
for (int i = 0; i < 32; ++i)
printf("%02X", obuf[i]);
This is not really intended as an answer, I'm just sharing a code fragment with the OP.
To hash the binary time_t directly without converting the time to a string, you could use something like (untested):
time_t cur_time;
char t_ID[40];
char obuf[40];
gmtime(&cur_time);
SHA256(&cur_time, sizeof(cur_time), obuf);
// You know this doesn't work:
// sprintf(t_ID, "%02x", obuf);
// Instead see https://stackoverflow.com/questions/6357031/how-do-you-convert-buffer-byte-array-to-hex-string-in-c
How do you convert buffer (byte array) to hex string in C?
This doesn't address byte order. You could use network byte order functions, see:
htons() function in socket programing
http://beej.us/guide/bgnet/output/html/multipage/htonsman.html
One complication: the size of time_t is not specified, it can vary by platform. It's traditionally 32 bits, but on 64 bit machines it can be 64 bits. It's also usually the number of seconds since Unix epoc, midnight, January 1, 1970.
If you're willing to live with assumption that the resolution is seconds and don't have to worry about the code working in 20 years (see: https://en.wikipedia.org/wiki/Year_2038_problem) then you might use (untested):
#include <netinet/in.h>
time_t cur_time;
uint32_t net_cur_time; // cur_time converted to network byte order
char obuf[40];
gmtime(&cur_time);
net_cur_time = htonl((uint32_t)cur_time);
SHA256(&net_cur_time, sizeof(net_cur_time), obuf);
I'll repeat what I mentioned in a comment: it's hard to understand what you possibly hope to gain from this hash, or why you can't use the timestamp directly. Cryptographically secure hashes such as SHA256 go through a lot of work to ensure the hash is not reversible. You can't benefit from that because the input data is from a limited known set. At the very least, why not use CRC32 instead because it's much faster.
Good luck.

Most Efficient Way To Count How Many Times A Character Occurs within a String

I am writing a very simple function that counts how many times a certain character occurs within a given string. I have a working function but was wondering if there was a more efficient or preferred method of doing this.
Here is the function:
size_t strchroc(const char *str, const char ch)
{
int c = 0, i = 0;
while(str[i]) if(str[i++] == ch) c++;
return c;
}
I personally cannot think of any way to get this code more efficient. And was wondering (just for the sake of learning) if anybody knew of a way to make this function more efficient.
(efficient in the sense of speed and using minimal resources).
First of all, unless your function is really time-sensitive, do not try to over-optimize. Just use the one you provided, as it is easy to verify for correctness and it doesn't try to be smart for just the heck of it.
If the function needs really to be fast then there are many ways to optimize it more. Many, really many ways. Some of them either expect or assume specific memory layout of the strings you have (for example, that they are allocated on word boundaries and the allocation is also always padded up to word boundary). So you'd need to be careful, as the algorithm might work on some combination of processor, compiler and memory allocator and fail miserably on others.
Just for the heck of it, I'll list some possible ways to speed up the character counter:
Read the string a word (32 or 64 bit integer) at a time. Not necessarily much of a help thanks to L1 caching and speculative/out-of-order execution. This needs end-of-loop adjustment for the last word (miscounting bytes after NUL terminator). Use only with word-aligned and padded memory allocators.
Remove the conditional, and instead calculate counts for all characters (to an array) and return the count for the wanted character. (This will remove the conditional and if you know string length in advance it makes for excellent loop unrolling plus removes one point of conditional branching.)
If you know the length of the string beforehand (calculated somewhere else) you can use that to unroll the loop. Or better, write it as a for-loop and apply a suitable #pragma and compiler options to make the compiler do loop unrolling for you.
Write the routine in assembler. Before going this way, crank up all compiler optimizations and disassemble the routine first -- you are likely to find out that the compiler already used all potential tricks you knew and several you didn't.
If your string is potentially very large (megabytes) -- and here I am speculating -- using a graphics card via OpenCL/CUDA might offer some potential.
And so on.
But I really, really suggest you stick with the one you have if you have a real-world problem. If this is a toy problem and you are optimizing for the fun of it, go ahead.
Cycle-shaving is a fun way to learn CPUs and instructions sets, but for 99.999999...% of programming tasks it is not worth the effort.
You can use the pointer to iterate the string, and with a little effort use the * only once per character:
size_t strchroc(const char *str, const char ch)
{
size_t c = 0;
char n;
while ((n=*str++), ((n==ch)? ++c : 0), n)
;
return c;
}
Not that the compiler couldn't optimize yours to exactly the same code, but just for fun.
You should use strchr() (or memchr() if you know the length) before using your function. If there is a match, you can start from the position of the first matching character and then go from there.
This should be much faster unless your strings are very short, or it matches very early.
you can get rid of the variable i.
size_t strchroc(const char *str, const char ch){
size_t c = 0;
while(*str != '\0') {
if(*str == ch) c++;
str++;
}
return c;
}
size_t count_the_string(const char *str, const char ch){
size_t cnt ;
for(cnt=0; *str; ) {
cnt += *str++ == ch;
}
return cnt;
}
For the equivalent do { ...} while(); variant, GCC generates code without the conditional jump (except for the loop's jump, of course) , comparable to #hakattack 's solution.
size_t count_the_string2(const char *str, const char ch){
size_t cnt=0 ;
do {
cnt += *str == ch;
} while (*str++);
return cnt;
}
After a quick low quality benchmark I ended up with this for strings of arbitrary length.
On huge strings (100M+) it did not show too much of a difference, but on shorter strings (sentences, normal text files etc.) the improvement was about 25%.
unsigned int countc_r(char *buf, char c)
{
unsigned int k = 0;
for (;;) {
if (!buf[0]) break;
if ( buf[0] == c) ++k;
if (!buf[1]) break;
if ( buf[1] == c) ++k;
if (!buf[2]) break;
if ( buf[2] == c) ++k;
if (!buf[3]) break;
if ( buf[3] == c) ++k;
buf += 4;
}
return k;
}

Best way to convert whole file to lowercase in C

I was wondering if theres a realy good (performant) solution how to Convert a whole file to lower Case in C.
I use fgetc convert the char to lower case and write it in another temp-file with fputc. At the end i remove the original and rename the tempfile to the old originals name. But i think there must be a better Solution for it.
This doesn't really answer the question (community wiki), but here's an (over?)-optimized function to convert text to lowercase:
#include <assert.h>
#include <ctype.h>
#include <stdio.h>
int fast_lowercase(FILE *in, FILE *out)
{
char buffer[65536];
size_t readlen, wrotelen;
char *p, *e;
char conversion_table[256];
int i;
for (i = 0; i < 256; i++)
conversion_table[i] = tolower(i);
for (;;) {
readlen = fread(buffer, 1, sizeof(buffer), in);
if (readlen == 0) {
if (ferror(in))
return 1;
assert(feof(in));
return 0;
}
for (p = buffer, e = buffer + readlen; p < e; p++)
*p = conversion_table[(unsigned char) *p];
wrotelen = fwrite(buffer, 1, readlen, out);
if (wrotelen != readlen)
return 1;
}
}
This isn't Unicode-aware, of course.
I benchmarked this on an Intel Core 2 T5500 (1.66GHz), using GCC 4.6.0 and i686 (32-bit) Linux. Some interesting observations:
It's about 75% as fast when buffer is allocated with malloc rather than on the stack.
It's about 65% as fast using a conditional rather than a conversion table.
I'd say you've hit the nail on the head. Temp file means that you don't delete the original until you're sure that you're done processing it which means upon error the original remains. I'd say that's the correct way of doing it.
As suggested by another answer (if file size permits) you can do a memory mapping of the file via the mmap function and have it readily available in memory (no real performance difference if the file is less than the size of a page as it's probably going to get read into memory once you do the first read anyway)
You can usually get a little bit faster on big inputs by using fread and fwrite to read and write big chunks of the input/output. Also you should probably convert a bigger chunk (whole file if possible) into memory and then write it all at once.
edit: I just rememberd one more thing. Sometimes programs can be faster if you select a prime number (at the very least not a power of 2) as the buffer size. I seem to recall this has to do with specifics of the cacheing mechanism.
If you're processing big files (big as in, say, multi-megabytes) and this operation is absolutely speed-critical, then it might make sense to go beyond what you've inquired about. One thing to consider in particular is that a character-by-character operation will perform less well than using SIMD instructions.
I.e. if you'd use SSE2, you could code the toupper_parallel like (pseudocode):
for (cur_parallel_word = begin_of_block;
cur_parallel_word < end_of_block;
cur_parallel_word += parallel_word_width) {
/*
* in SSE2, parallel compares are either about 'greater' or 'equal'
* so '>=' and '<=' have to be constructed. This would use 'PCMPGTB'.
* The 'ALL' macro is supposed to replicate into all parallel bytes.
*/
mask1 = parallel_compare_greater_than(*cur_parallel_word, ALL('A' - 1));
mask2 = parallel_compare_greater_than(ALL('Z'), *cur_parallel_word);
/*
* vector op - and all bytes in two vectors, 'PAND'
*/
mask = mask1 & mask2;
/*
* vector op - add a vector of bytes. Would use 'PADDB'.
*/
new = parallel_add(cur_parallel_word, ALL('a' - 'A'));
/*
* vector op - zero bytes in the original vector that will be replaced
*/
*cur_parallel_word &= !mask; // that'd become 'PANDN'
/*
* vector op - extract characters from new that replace old, then or in.
*/
*cur_parallel_word |= (new & mask); // PAND / POR
}
I.e. you'd use parallel comparisons to check which bytes are uppercase, and then mask both original value and 'uppercased' version (one with the mask, the other with the inverse) before you or them together to form the result.
If you use mmap'ed file access, this could even be performed in-place, saving on the bounce buffer, and saving on many function and/or system calls.
There is a lot to optimize when your starting point is a character-by-character 'fgetc' / 'fputc' loop; even shell utilities are highly likely to perform better than that.
But I agree that if your need is very special-purpose (i.e. something as clear-cut as ASCII input to be converted to uppercase) then a handcrafted loop as above, using vector instruction sets (like SSE intrinsics/assembly, or ARM NEON, or PPC Altivec), is likely to make a significant speedup possible over existing general-purpose utilities.
Well, you can definitely speed this up a lot, if you know what the character encoding is. Since you're using Linux and C, I'm going to go out on a limb here and assume that you're using ASCII.
In ASCII, we know A-Z and a-z are contiguous and always 32 apart. So, what we can do is ignore the safety checks and locale checks of the toLower() function and do something like this:
(pseudo code)
foreach (int) char c in the file:
c -= 32.
Or, if there may be upper and lowercase letters, do a check like
if (c > 64 && c < 91) // the upper case ASCII range
then do the subtract and write it out to the file.
Also, batch writes are faster, so I would suggest first writing to an array, then all at once writing the contents of the array to the file.
This should be considerable faster.

Can I please get some feedback for this strcmp() function I implemented in C?

I'm learning C.
I find I learn programming well when I try things and received feedback from established programmers in the language.
I decided to write my own strcmp() function, just because I thought I could :)
int strcompare(char *a, char *b) {
while (*a == *b && *a != '\0') {
a++;
b++;
}
return *a - *b;
}
I was trying to get it to work by incrementing the pointer in the condition of the while but couldn't figure out how to do the return. I was going for the C style code, of doing as much as possible on one line :)
Can I please get some feedback from established C programmers? Can this code be improved? Do I have any bad habits?
Thanks.
If you want to do everything in the while statement, you could write
while (*a != '\0' && *a++ == *b++) {}
I'm not personally a huge fan of this style of programming - readers need to mentally "unpack" the order of operations anyway, when trying to understand it (and work out if the code is buggy or not). Memory bugs are particularly insidious in C, where overwriting memory one byte beyond or before where you should can cause all sorts of inexplicable crashes or bugs much later on, away from the original cause.
Modern styles of C programming emphasize correctness, consistency, and discipline more than terseness. The terse expression features, like pre- and post-increment operations, were originally a way of getting the compiler to generate better machine code, but optimizers can easily do that themselves these days.
As #sbi writes, I'd prefer const char * arguments instead of plain char * arguments.
The function doesn't change the content of a and b. it should probably announce that by taking pointers to const strings.
Most C styles are much terser than many other languages' styles, but don't try to be too clever. (In your code, with several conditions ANDed in the loop conditions, I don't think there's way to put incrementing in there, so this isn't even a question of style, but of correctness.)
I don't know since when putting as much as possible is considered as C-style... I rather associate (obfuscated) Perl with that..
Please DO NOT do this. The best thing to do is one command per line. You will understand why when you try to debug your code :)
To your implementation: Seems quite fine to me, but I would put in the condition that *b is not '\0' either, because you can't know that a is always bigger than b... Otherwise you risk reading in unallocated memory...
You may find this interesting, from eglibc-2.11.1. It's not far different to your own implementation.
/* Compare S1 and S2, returning less than, equal to or
greater than zero if S1 is lexicographically less than,
equal to or greater than S2. */
int
strcmp (p1, p2)
const char *p1;
const char *p2;
{
register const unsigned char *s1 = (const unsigned char *) p1;
register const unsigned char *s2 = (const unsigned char *) p2;
unsigned reg_char c1, c2;
do
{
c1 = (unsigned char) *s1++;
c2 = (unsigned char) *s2++;
if (c1 == '\0')
return c1 - c2;
}
while (c1 == c2);
return c1 - c2;
}
A very subtle bug: strcmp compares bytes interpreted as unsigned char, but your function is interpreting them as char (which is signed on most implementations). This will cause non-ascii characters to sort before ascii instead of after.
This function will fail if limits of (insigned) char is equal to or greater than limits of int because of integer overflow.
For example if you compile it on DSP which have 16 bit char with limits 0...65536 and 16 bit int with limits of -32768...32767, then if you try to compare strings like
"/uA640" and "A" the result will be negative, which is not true.
This is exotic and weird problem but it appear when you write universal implementation.

STM32 printf and RTC

* UPDATE *
Here is what I found. Whenever I had that function in there it wouldn't actually make the code lock up. It would actually make the read RTC I²C function very slow to execute, but the code would still run properly, but I had to wait a really long time to get past every time I read the RTC.
So there is an alarm interrupt for the RTC and this was triggering other I²C interactions inside the ISR, so it looks like it was trying to do two I²C communications at the same time, therefore slowing down the process. I removed the functions in the ISR and it's working now. I will keep investigating.
I am having this problem when programming an STM32F103 microcontroller using IAR 5.40. I have this function that if I try to printf a local variable it causes the code to freeze at another point way before it even gets to that function in question.
What could possibly be causing this?
This is the function:
u8 GSM_Telit_ReadSms(u8 bSmsIndex)
{
char bTmpSms[3] = {0};
itoa(bSmsIndex, bTmpSms, 10); // Converts the smsindex into a string
printf("index = %s\n", bTmpSms); // This printf caused the code to get stuck in the RTC // byte read function!
GSM_Telit_RequestModem("AT+CMGR=""1", 10, "CMGR", 5, 0);
return 1;
}
I tried this as well and this does not cause the lock I experienced:
u8 GSM_Telit_ReadSms(u8 bSmsIndex)
{
char bTmpSms[3] = {0};
itoa(bSmsIndex, bTmpSms, 10);
printf("index = 2\n");
GSM_Telit_RequestModem("AT+CMGR=""1", 10, "CMGR", 5, 0);
return 1;
}
There is no optimization enabled whatsoever and the code gets stuck when trying to read a byte out of my I²C RTC, but as soon as I remove this printf("index = %s\n", bTmpSms); or use this one instead printf("index = 2\n"); then everything is happy. Any ideas?
The bSmsIndex will never be more than 30 actually and even then the lock up happens wayyyy before this function gets called.
char bTmpSms[3] only has space for "99". If your bSmsIndex is 100 or greater, you will be trying to write to memory that doesn't belong to you.
Edit after the update
I don't have a reference to itoa on my local machine, but I found this one ( http://www.cplusplus.com/reference/clibrary/cstdlib/itoa/ ). According to that reference, the destination array MUST BE LONG ENOUGH FOR ANY POSSIBLE VALUE. Check your documentation: your specific itoa might be different.
Or use sprintf, snprintf, or some function described by the Standard.
Some ideas:
If itoa() is not properly NUL-terminating the string, then the call to printf may result in the machine looking for the NUL forever.
pmg has a very good point.
Also, consider what type the first argument to itoa() is. If it's signed and you're passing in an unsigned integer, then you may be getting an unexpected minus sign in bTmpSms. Try using sprintf() instead.
The change in code is moving the rest of your code around in memory. My guess is that some other part of the code, not listed here, is bashing some random location; in the first case that location contains something critical, in the second case it does not.
These are the worst kinds of problems to track down*. Good luck.
*Maybe not the worst - it could be worse if it were a race condition between multiple threads that only manifested itself once a week. Still not my favorite kind of bug.
It seems that if I don't initialize the variable bTmpSms to something the problem occurs.
I also realized that it is not the printf that is the problem. It is the itoa function. It got me to check that even though I didn't think that was the problem, when I commented the itoa function then the whole code worked.
So I ended up doing this:
u8 GSM_Telit_ReadSms(u8 bSmsIndex)
{
char bTmpSms[4] = "aaa"; // I still need to find out why this is !!!
itoa(bSmsIndex, bTmpSms, 10); // Converts the smsindex into a string
printf("index = %s\n", bTmpSms); // This printf caused the code to get stuck in the RTC // byte read function!
GSM_Telit_RequestModem("AT+CMGR=""1", 10, "CMGR", 5, 0);
return 1;
}
This is the itoa function I got:
char itoa(int value, char* result, int base)
{
// Check that the base if valid
if (base < 2 || base > 36) {
*result = '\0';
return 0;
}
char* ptr = result, *ptr1 = result, tmp_char;
int tmp_value;
do
{
tmp_value = value;
value /= base;
*ptr++ = "zyxwvutsr
qponmlkjihgfedcba9876543210123456789abcdefghijklmnopqrstuvwxyz" [35 + (tmp_value - value * base)];
} while (value);
// Apply negative sign
if (tmp_value < 0)
*ptr++ = '-';
*ptr-- = '\0';
while(ptr1 < ptr)
{
tmp_char = *ptr;
*ptr--= *ptr1;
*ptr1++ = tmp_char;
}
return 1;
}
What's the value of bSmsIndex you're trying to print?
If it's greater than 99 then you're overrunning the bTmpSms array.
If that doesn't help, then use IAR's pretty good debugger - I'd drop into the assembly window at the point where printf() is being called and single step until things went into the weeds. That'll probably make clear what the problem is.
Or as a quick-n-dirty troubleshoot, try sizing the array to something large (maybe 8) and see what happens.
What's the value of bSmsIndex?
If more than 99 it will be three digits when converted to a string. When zero terminated, it will be four characters, but you've allocated only three to bTmpSms so the null may get overwritten and the printf will try to print whatever is after bTmpSms until the next null. That could access anything, really.
Try to disassemble this area with index = 2 vs. index = %s.

Resources