Quick-fixing 32-bit (2GB limited) fseek/ftell on freebsd 7 - c

I have old 32-bit C/C++ program on FreeBSD, which is used remotely by hundreds of users, and author of which will not fix it. It was written in unsafe way, all file offset are stored internally as unsigned 32-bit offsets, and ftell/fseek functions where used. In FreeBSD 7 (the host platform for software), it means that ftell and fseek uses 32-bit signed long:
int fseek(FILE *stream, long offset, int whence);
long ftell(FILE *stream);
I need to do quick fix of the program, because some internal data files suddenly hit 2^31 file size (2 147 483 7yy bytes) after 13 years of collecting data, and internal fseek/ftell assert fails now for any request.
In FreeBSD7 world there is fseeko/ftello hack for 2GB+ files.
int
fseeko(FILE *stream, off_t offset, int whence);
off_t
ftello(FILE *stream);
The off_t type here is not well-defined; all I know now, that it has 8-byte size and looks like long long OR unsigned long long (I don't know which one).
Is it enough (to work with up to 4 GB files) and safe to search-and-replace all ftell to ftello, and all fseek to fseeko (sed -i 's/ftell/ftello', same for seek), if possible usages of them are:
unsigned long offset1,offset2; //32bit
offset1 = (compute + it) * in + some - arithmetic;
fseek(file, 0, SEEK_END);
fseek(file, 4, SEEK_END); // or other small int constant
offset2 = ftell(file);
fseek(file, offset1, SEEK_SET); // No usage of SEEK_CUR
and combinations of such calls.
What is the signedness of off_t?
It is safe to assign 64-bit off_t into unsigned 32-bit offset? Will it work for bytes in range from 2 GB up to 4 GB?
Which functions may be used for working with offset besides ftell/fseek?

FreeBSD fseeko() and ftello() is documented as POSIX.1-2001 compatible, which means off_t is a signed integer type.
On FreeBSD 7, you can safely do:
off_t actual_offset;
unsigned long stored_offset;
if (actual_offset >= (off_t)0 && actual_offset < (off_t)4294967296.0)
stored_offset = (unsigned long)actual_offset;
else
some_fatal_error("Unsupportable file offset!");
(On LP64 architectures, the above would be silly, as off_t and long would both be 64-bit signed integers. It would be safe even then; just silly, since all possible file offsets could be supported.)
The thing that people do get bitten by often with this, is that the offset calculations must be done using off_t. That is, it is not enough to cast the result to off_t, you must cast the values used in the arithmetic to off_t. (Technically, you only need to make sure each arithmetic operation is done at off_t precision, but I find it easier to remember the rules if I just punt and cast all the operands.) For example:
off_t offset;
unsigned long some, values, used;
offset = (off_t)some * (off_t)value + (off_t)used;
fseeko(file, offset, SEEK_SET);
Usually the offset calculations are used to find a field in a specific record; the arithmetic tends to stay the same. I truly recommend you move the seek operations to a helper function, if possible:
int fseek_to(FILE *const file,
const unsigned long some,
const unsigned long values,
const unsigned long used)
{
const off_t offset = (off_t)some * (off_t)value + (off_t)used;
if (offset < (off_t)0 || offset >= (off_t)4294967296.0)
fatal_error("Offset exceeds 4GB; I must abort!");
return fseeko(file, offset, SEEK_SET);
}
Now, if you happen to be in a lucky position where you know all your offsets are aligned (to some integer, say 4), you can give yourself a couple of years of more time to rewrite the application, by using an extension of the above:
#define BIG_N 4
int fseek_to(FILE *const file,
const unsigned long some,
const unsigned long values,
const unsigned long used)
{
const off_t offset = (off_t)some * (off_t)value + (off_t)used;
if (offset < (off_t)0)
fatal_error("Offset is negative; I must abort!");
if (offset >= (off_t)(BIG_N * 2147483648.0))
fatal_error("Offset is too large; I must abort!");
if ((offset % BIG_N) && (offset >= (off_t)2147483648.0))
fatal_error("Offset is not a multiple of BIG_N; I must abort!");
return fseeko(file, offset, SEEK_SET);
}
int fseek_big(FILE *const file, const unsigned long position)
{
off_t offset;
if (position >= 2147483648UL)
offset = (off_t)2147483648UL
+ (off_t)BIG_N * (off_t)(position - 2147483648UL);
else
offset = (off_t)position;
return fseeko(file, offset, SEEK_SET);
}
unsigned long ftell_big(FILE *const file)
{
off_t offset;
offset = ftello(file);
if (offset < (off_t)0)
fatal_error("Offset is negative; I must abort!");
if (offset < (off_t)2147483648UL)
return (unsigned long)offset;
if (offset % BIG_N)
fatal_error("Offset is not a multiple of BIG_N; I must abort!");
if (offset >= (off_t)(BIG_N * 2147483648.0))
fatal_error("Offset is too large; I must abort!");
return (unsigned long)2147483648UL
+ (unsigned long)((offset - (off_t)2147483648UL) / (off_t)BIG_N);
}
The logic is simple: If offset is less than 231, it is used as-is. Otherwise, it is represented by value 231 + BIG_N × (offset - 231). The only requirement is that offset 231 and above are always multiples of BIG_N.
Obviously, you them must use only the above three functions -- plus whatever variants of fseek_to() you need, as long as they do the same checks, just use different parameters and formula for the offset calculation --, you can support file sizes of up to 2147483648 + BIG_N × 2147483647. For BIG_N==4, that is 10 GiB (less 4 bytes; 10,737,418,236 bytes to be exact).
Questions?
Edited to clarify:
Start with replacing your fseek(file, position, SEEK_SET) with calls to fseek_pos(file, position),
static inline void fseek_pos(FILE *const file, const unsigned long position)
{
if (fseeko(file, (off_t)position, SEEK_SET))
fatal_error("Cannot set file position!");
}
and fseek(file, position, SEEK_END) with calls to fseek_end(file, position) (for symmetry -- I'm assuming the position for this one is usually a literal integer constant),
static inline void fseek_end(FILE *const file, const off_t relative)
{
if (fseeko(file, relative, SEEK_END))
fatal_error("Cannot set file position!");
}
and finally, ftell(file) with calls to ftell_pos(file):
static inline unsigned long ftell_pos(FILE *const file)
{
off_t position;
position = ftello(file);
if (position == (off_t)-1)
fatal_error("Lost file position!");
if (position < (off_t)0 || position >= (off_t)4294967296.0)
fatal_error("File position outside the 4GB range!");
return (unsigned long)position;
}
Since on your architecture and OS unsigned long is a 32-bit unsigned integer type and off_t is a 64-bit signed integer type, this gives you the full 4GB range.
For the offset calculations, define one or more functions similar to
static inline void fseek_to(FILE *const file, const off_t term1,
const off_t term2,
const off_t term3)
{
const off_t position = term1 * term2 + term3;
if (position < (off_t)0 || position >= (off_t)4294967296.0)
fatal_error("File position outside the 4GB range!");
if (fseeko(file, position, SEEK_SET))
fatal_error("Cannot set file position!");
}
For each offset calculation algorithm, define one fseek_to variant. Name the parameters so that the arithmetic makes sense. Make the parameters const off_t, as above, so you don't need extra casts in the arithmetic. Only the parameters and the const off_t position = line defining the calculation algorithm vary between the variant functions.
Questions?

Related

What are the ramifications of returning the value -1 as a size_t return value in C?

I am reading a textbook and one of the examples does this. Below, I've reproduced the example in abbreviated form:
#include <stdio.h>
#define SIZE 100
size_t linearSearch(const int array[], int searchVal, size_t size);
int main(void)
{
int myArray[SIZE];
int mySearchVal;
size_t returnValue;
// populate array with data & prompt user for the search value
// call linear search function
returnValue = linearSearch(myArray, mySearchVal, SIZE);
if (returnValue != -1)
puts("Value Found");
else
puts("Value Not Found");
}
size_t linearSearch(const int array[], int key, size_t size)
{
for (size_t i = 0; i < size; i++) {
if (key == array[i])
return i;
}
return -1;
}
Are there any potential problems with this? I know size_t is defined as an unsigned integral type so it seems as if this might be asking for trouble at some point if I'm returning -1 as a size_t return value.
There's a few APIs that come to mind which use the maximum signed or unsigned integer value as a sentinel value. For example, C++'s std::string::find() method returns std::string::npos if the value given to find() could not be found within the string, and std::string::npos is equal to (std::string::size_type)-1.
Similarly, on iOS and OS X, NSArray's indexOfObject: method return NSNotFound when the object cannot be found in the array. Surprisingly, NSNotFound is actually defined to NSIntegerMax, which is either INT_MAX for 32-bit platforms or LONG_MAX for 64-bit platforms, even though NSArray indexes are typically NSUInteger (which is either unsigned int for 32-bit platforms or unsigned long for 64-bit platforms).
It does mean that there will be no distinction between “not found” and “element number 18,446,744,073,709,551,615” (for 64-bit systems), but whether that is an acceptable trade off is up to you.
An alternative is to have the function return the index through a pointer argument and have the function's return value indicate success or failure, e.g.
#include <stdbool.h>
bool linearSearch(const int array[], int val, size_t size, size_t *index)
{
// find value and then
if (found)
{
*index = indexOfFoundItem;
return true;
}
else
{
*index = 0; // optional, in some cases, better to leave *index untouched
return false;
}
}
Your compiler may decide to complain about comparing signed with unsigned — GCC or Clang will if provoked* — but otherwise "it works". On two's-complement machines (most machines these days), (size_t)-1 is the same as SIZE_MAX — indeed, as discussed in extenso in the comments, it is the same for one's-complement or sign-magnitude machines because of the wording in §6.3.1.3 of the C99 and C11 standards).
Using (size_t)-1 to indicate 'not found' means that you can't distinguish between the last entry in the biggest possible array and 'not found', but that's seldom an actual problem.
So, it's just the one edge case where I could end up having a problem?
The array would have to be an array of char, though, to be big enough to cause trouble — and while you could have 4 GiB memory with a 32-bit machine, it's pretty implausible to have all that memory committed to a character array (and it's very much less likely to be an issue with 64-bit machines; most don't run to 16 exbibytes of memory). So it isn't a practical edge case.
In POSIX, there is a ssize_t type, the signed type of the same size of size_t. You could consider using that instead of size_t. However, it causes the same angst that (size_t)-1 causes, in my experience. Plus on a 32-bit machine, you could have a 3 GiB chunk of memory treated as an array of char, but with ssize_t as a return type, you couldn't usefully use more than 2 GiB — or you'd need to use SSIZE_MIN (if it existed; I'm not sure it does) instead of -1 as the signal value.
*
GCC or Clang has to be provoked fairly hard. Simply using -Wall is not sufficient; it takes -Wextra (or the specific -Wsign-compare option) to trigger a warning. Since I routinely compile with -Wextra, I'm aware of the issue; not everyone is as vigilant.
Comparing signed and unsigned quantities is fully defined by the standard, but can lead to counter-intuitive results (because small negative numbers appear very large when converted to unsigned values), which is why the compilers complain if requested to do so.
Normally if you want to return negative values and still have some notion of a size type you use ssize_t. gcc and clang both complain but the following compiles. Note, some of the following is undefined behavior...
#include <stdio.h>
#include <stdint.h>
size_t foo() {
return -1;
}
void print_bin(uint64_t num, size_t bytes);
void print_bin(uint64_t num, size_t bytes) {
int i = 0;
for(i = bytes * 8; i > 0; i--) {
(i % 8 == 0) ? printf("|") : 1;
(num & 1) ? printf("1") : printf("0");
num >>= 1;
}
printf("\n");
}
int main(void){
long int x = 0;
printf("%zu\n", foo());
printf("%ld\n", foo());
printf("%zu\n", ~(x & 0));
printf("%ld\n", ~(x & 0));
print_bin((~(x & 0)), 8);
}
The output is
18446744073709551615
-1
18446744073709551615
-1
|11111111|11111111|11111111|11111111|11111111|11111111|11111111|11111111
I'm on a 64bit machine. The following in binary
|11111111|11111111|11111111|11111111|11111111|11111111|11111111|11111111
can mean -1 or 18446744073709551615, it depends on context ie in what way the type that has that binary representation is being used.

fseek - fails skipping a large amount of bytes?

I'm trying to skip a large amount of bytes before using fread to read the next bytes.
When size is small #define size 6404168 - it works:
long int x = ((long int)size)*sizeof(int);
fseek(fincache, x, SEEK_CUR);
When size is huge #define size 649218227, it doesn't :( The next fread reads garbage, can't really understand which offset is it reading from.
Using fread instead as a workaround works in both cases but its really slow:
temp = (int *) calloc(size, sizeof(int));
fread(temp,1, size*sizeof(int), fincache);
free(temp);
Assuming sizoef(int) is 4 and you are on a 32 bit system (where sizeof(long) is 4),
So 649218227*4 would overflow what a long can hold. Signed integer overflow is undefined behaviour. So you it works for smaller values (that's less than LONG_MAX).
You can use a loop instead to fseek() necessary bytes.
long x;
intmax_t len = size;
for(;len>0;){
x = (long) (len>LONG_MAX?LONG_MAX:len);
fseek(fincache, x, SEEK_CUR);
len = len-x;
}
The offset argument of fseek is required to be a long, not a long long. So x must fit into a long, else don't use fseek.
Since your platform's int is most likely 32-bit, multiplying 649,218,227 with sizeof(int) results in a number that exceeds INT_MAX and LONG_MAX, which are both 2**31-1 on 32-bit platforms. Since fseek accepts a long int, the resulting overflow causes your program to print garbage.
You should consult your compiler's documentation to find if it provides an extension for 64-bit seeking. On POSIX systems, for example, you can use fseeko, which accepts an offset of type off_t.
Be careful not to introduce overflow before even calling the 64-bit seeking function. Careful code could look like this:
off_t offset = (off_t) size * (off_t) sizeof(int);
fseeko(fincache, offset, SEEK_CUR);
Input guidance for fseek:
http://www.tutorialspoint.com/c_standard_library/c_function_fseek.htm
int fseek(FILE *stream, long int offset, int whence)
offset − This is the number of bytes to offset from whence.
You are invoking undefined behavior by passing a long long (whose value is bigger then the Max of Long int) to fseek rather then the required long.
As is known, UB can do anything, including not work.
Try this, You may have to read it out if it's such a large number
size_t toseek = 6404168;
//change the number to increase it
while(toseek>0)
{
char buffer[4096];
size_t toread = min(sizeof(buffer), toseek);
size_t read = fread(buffer, 1, toread, stdin);
toseek = toseek - read;
}

size_t used as a value in a formula

Here is a short snippet of a function reading lines.
How is that possible that it compares bufsize with ((size_t)-1)/2 ?
I imagined comparing a variable to eg. int - that is just impossible; to INT_MAX on the contrary it is correct, I think.
So how can that code actually work and give no errors?
int c;
size_t bufsize = 0;
size_t size = 0;
while((c=fgetc(infile)) != EOF) {
if (size >= bufsize) {
if (bufsize == 0)
bufsize = 2;
else if (bufsize <= ((size_t)-1)/2)
bufsize = 2*size;
else {
free(line);
exit(3);
}
newbuf = realloc(line,bufsize);
if (!newbuf) {
free(line);
abort();
}
line = newbuf;
}
/* some other operations */
}
(size_t)-1
This is casting the value -1 to a size_t. (type)value is a cast in C.
Since size_t is an unsigned type, this is actually the maximum value that size_t can hold, so it's used to make sure that the buffer size can actually be safely doubled (hence the subsequent division by two).
The code relies on some assumptions about bits and then does a well known hack for finding the maximum size_t value (provided that size_t doesn't accommodate more bits than the register, a safe bet on many machines).
First it fills a register up with 1 bits, then it casts it into a size_t data type, so the comparison will work. As long as that register is larger in number of bits than the size_t data type, then the (if any) unused 1 bits will be truncated, and you will get the largest unsigned number that can fit in size_t bits.
After you have that, it divides by two to get half of that number, and does the comparison to see if it seems to be safe to increase size without going over the "maximum" size_t. but by then, it's dividing a size_t data type, and comparing two size_t data types (a type safe operation).
If you really wanted to remove this bit-wizardy (ok, it's not the worst example of bit wizardy I've seen). Consider that the following snippet
else if (bufsize <= ((size_t)-1)/2)
bufsize = 2*size;
could be replaced with
else if (bufsize <= (MAX_SIZE/2)
bufsize = 2*size;
and be type safe without casting and more readable.
(size_t)-1 casts -1 to the type size_t, which results in SIZE_MAX (a macro defined in stdint.h), the maximum value that the size_t type can hold.
So the comparison is checking whether bufsize is less than or equal to one half the maximum value that can be contained in a size_t
size_t isn't being interpreted as a value, it's being used to cast the value of negative one to the type size_t.
((size_t)-1)/2
is casting -1 to a size_t and then dividing by 2.
The size_t in ((size_t)-1)/2) is simply being used as a cast: casting -1 to size_t.
The trick here is that size_t is unsigned, so the cast (size_t) -1 will be converted to the maximum value of size_t, or SIZE_MAX. This is useful in the context of the loop. However, I'd prefer to see SIZE_MAX used directly rather than this trick.

How the condition to check whether the link's size in a symbolic link file is too big, works in this code?

Here is a piece of code from the lib/xreadlink.c file in GNU Coreutils..
/* Call readlink to get the symbolic link value of FILENAME.
+ SIZE is a hint as to how long the link is expected to be;
+ typically it is taken from st_size. It need not be correct.
Return a pointer to that NUL-terminated string in malloc'd storage.
If readlink fails, return NULL (caller may use errno to diagnose).
If malloc fails, or if the link value is longer than SSIZE_MAX :-),
give a diagnostic and exit. */
char * xreadlink (char const *filename)
{
/* The initial buffer size for the link value. A power of 2
detects arithmetic overflow earlier, but is not required. */
size_t buf_size = 128;
while (1)
{
char* buffer = xmalloc(buf_size);
ssize_t link_length = readlink(filename, buffer, buf_size);
if(link_length < 0)
{
/*handle failure of system call*/
}
if((size_t) link_length < buf_size)
{
buffer[link_length] = 0;
return buffer;
}
/*size not sufficient, allocate more*/
free (buffer);
buf_size *= 2;
/*Check whether increase is possible*/
if (SSIZE_MAX < buf_size || (SIZE_MAX / 2 < SSIZE_MAX && buf_size == 0))
xalloc_die ();
}
}
The code is understandable except I could not understand how the check for whether the link's size is too big works, that is the line:
if (SSIZE_MAX < buf_size || (SIZE_MAX / 2 < SSIZE_MAX && buf_size == 0))
Further, how can
(SIZE_MAX / 2 < SSIZE_MAX)
condition be true on any system???
SSIZE_MAX is the maximum value of the signed variety of size_t. For instance if size_t is only 16 bits (very unlikely these days), SIZE_MAX is 65535 while ssize_max is 32767. More likely it is 32 bits (giving 4294967295 and 2147483647 respectively), or even 64 bits (giving numbers too big to type here :-) ).
The basic problem to solve here is that readlink returns a signed value even though SIZE_MAX is an unsigned one ... so once buf_size exceeds SSIZE_MAX, it's impossible to read the link, as the large positive value will result in a negative return value.
As for the "furthermore" part: it quite likely can't, i.e., you're right. At least on any sane system, anyway. (It is theoretically possible to have, e.g., a 32-bit SIZE_MAX but a 33-bit signed integer so that SSIZE_MAX is also 4294967295. Presumably this code is written to guard against theoretically-possible, but never-actually-seen, systems.)

How to convert from integer to unsigned char in C, given integers larger than 256?

As part of my CS course I've been given some functions to use. One of these functions takes a pointer to unsigned chars to write some data to a file (I have to use this function, so I can't just make my own purpose built function that works differently BTW). I need to write an array of integers whose values can be up to 4095 using this function (that only takes unsigned chars).
However am I right in thinking that an unsigned char can only have a max value of 256 because it is 1 byte long? I therefore need to use 4 unsigned chars for every integer? But casting doesn't seem to work with larger values for the integer. Does anyone have any idea how best to convert an array of integers to unsigned chars?
Usually an unsigned char holds 8 bits, with a max value of 255. If you want to know this for your particular compiler, print out CHAR_BIT and UCHAR_MAX from <limits.h> You could extract the individual bytes of a 32 bit int,
#include <stdint.h>
void
pack32(uint32_t val,uint8_t *dest)
{
dest[0] = (val & 0xff000000) >> 24;
dest[1] = (val & 0x00ff0000) >> 16;
dest[2] = (val & 0x0000ff00) >> 8;
dest[3] = (val & 0x000000ff) ;
}
uint32_t
unpack32(uint8_t *src)
{
uint32_t val;
val = src[0] << 24;
val |= src[1] << 16;
val |= src[2] << 8;
val |= src[3] ;
return val;
}
Unsigned char generally has a value of 1 byte, therefore you can decompose any other type to an array of unsigned chars (eg. for a 4 byte int you can use an array of 4 unsigned chars). Your exercise is probably about generics. You should write the file as a binary file using the fwrite() function, and just write byte after byte in the file.
The following example should write a number (of any data type) to the file. I am not sure if it works since you are forcing the cast to unsigned char * instead of void *.
int homework(unsigned char *foo, size_t size)
{
int i;
// open file for binary writing
FILE *f = fopen("work.txt", "wb");
if(f == NULL)
return 1;
// should write byte by byte the data to the file
fwrite(foo+i, sizeof(char), size, f);
fclose(f);
return 0;
}
I hope the given example at least gives you a starting point.
Yes, you're right; a char/byte only allows up to 8 distinct bits, so that is 2^8 distinct numbers, which is zero to 2^8 - 1, or zero to 255. Do something like this to get the bytes:
int x = 0;
char* p = (char*)&x;
for (int i = 0; i < sizeof(x); i++)
{
//Do something with p[i]
}
(This isn't officially C because of the order of declaration but whatever... it's more readable. :) )
Do note that this code may not be portable, since it depends on the processor's internal storage of an int.
If you have to write an array of integers then just convert the array into a pointer to char then run through the array.
int main()
{
int data[] = { 1, 2, 3, 4 ,5 };
size_t size = sizeof(data)/sizeof(data[0]); // Number of integers.
unsigned char* out = (unsigned char*)data;
for(size_t loop =0; loop < (size * sizeof(int)); ++loop)
{
MyProfSuperWrite(out + loop); // Write 1 unsigned char
}
}
Now people have mentioned that 4096 will fit in less bits than a normal integer. Probably true. Thus you can save space and not write out the top bits of each integer. Personally I think this is not worth the effort. The extra code to write the value and processes the incoming data is not worth the savings you would get (Maybe if the data was the size of the library of congress). Rule one do as little work as possible (its easier to maintain). Rule two optimize if asked (but ask why first). You may save space but it will cost in processing time and maintenance costs.
The part of the assignment of: integers whose values can be up to 4095 using this function (that only takes unsigned chars should be giving you a huge hint. 4095 unsigned is 12 bits.
You can store the 12 bits in a 16 bit short, but that is somewhat wasteful of space -- you are only using 12 of 16 bits of the short. Since you are dealing with more than 1 byte in the conversion of characters, you may need to deal with endianess of the result. Easiest.
You could also do a bit field or some packed binary structure if you are concerned about space. More work.
It sounds like what you really want to do is call sprintf to get a string representation of your integers. This is a standard way to convert from a numeric type to its string representation. Something like the following might get you started:
char num[5]; // Room for 4095
// Array is the array of integers, and arrayLen is its length
for (i = 0; i < arrayLen; i++)
{
sprintf (num, "%d", array[i]);
// Call your function that expects a pointer to chars
printfunc (num);
}
Without information on the function you are directed to use regarding its arguments, return value and semantics (i.e. the definition of its behaviour) it is hard to answer. One possibility is:
Given:
void theFunction(unsigned char* data, int size);
then
int array[SIZE_OF_ARRAY];
theFunction((insigned char*)array, sizeof(array));
or
theFunction((insigned char*)array, SIZE_OF_ARRAY * sizeof(*array));
or
theFunction((insigned char*)array, SIZE_OF_ARRAY * sizeof(int));
All of which will pass all of the data to theFunction(), but whether than makes any sense will depend on what theFunction() does.

Resources