Integer overflow for looping over large vector in C? - c

We can use a for loop to iterate over a vector in C, for example:
int len = length(x);
for (int i=0; i<len; i++) {
double val = REAL(x)[i];
}
This seems to work fine, but I don' understand why. According to wikipedia the default int type ranges from −32767 to +32767. So why does this still work for vectors longer than that?
Does R somehow override int to always be long int? Is there a maximum length of the vector that this code will support?

You are misreading Wikipedia. Here is a more complete quote (emphasis mine):
At least in the [−32767,+32767] range
On most modern platforms (at least those powerful enough to run R), int is at least 32 bits wide, which gives a range of [−2147483647,+2147483647] or more.
Additionally, R's ?integer has the following to say:
Note that current implementations of R use 32-bit integers for integer vectors, so the range of representable integers is restricted to about +/-2*10^9: doubles can hold much larger integers exactly.
Finally, ?.Machine say:
integer.max - the largest integer which can be represented. Always 2147483647.

Related

What's the role of L in modulo 1000000007L?

When you're a total noob programming, a lot of things looks like magic,
so solving some classic problems at SPOJ using C language, I found one called DIAGONAL.
After some attempts I gave up and went searching for solutions and I found this one:
#include <stdio.h>
int main() {
int num_cases, i;
long long mod_by = 24*1000000007L;
scanf("%d", &num_cases);
long long n;
long long answer;
for(i = 0; i < num_cases; i++) {
scanf("%lld", &n);
long long x = (n*(n-1)) % (mod_by);
long long y = (x*(n-2)) % (mod_by);
long long z = (y*(n-3)) % (mod_by);
answer = z / 24;
printf("%lld\n", answer);
}
return 0;
}
At first glance I thought that L with the modulo was some kind of mistake made by the user who posted the code (haha who would mix numbers with letters this way?! nonsense! -thought the noob-), but when I fixed the (many) wrongs in my code and used this modulo, it didn't work without the magic L (I got Wrong Answer).
Then I substituted the L with the ASCII code number (because well, maybe it was that!) and it also didn't work.
Since then I'm trying to understand what is the logic behind this. How can I get the same results removing this L?
It's not like one just woke up in the morning and "uhm, maybe if I add this L it will work", but I just couldn't find other examples of this (random letter added to a large number for calculations) googling.
long long mod_by = 24*1000000007L; is a problem.
The L suffix insures the constant 1000000007 is at least of type long.1 Without the L, the type may have been int or long.
Yet since long may only be 32-bit, the ~36-bit product can readily overflow long math leading to undefined behavior (UB).
The assignment to a long long happens after and does not affect the type nor range of the multiplication.
Code should use LL to form the product and protect against overflow.
long long mod_by = 24LL*1000000007;
// or
long long mod_by = 24*1000000007LL;
In general, make certain the calculation occurs at least with the width of the destination.
See also Why write 1,000,000,000 as 1000*1000*1000 in C?.
1 In OP's case, apparently int is 32-bit and long is 64-bit so code "worked" with a L and failed without it. Yet porting this code to a 32-bit long implementation, code fails with or without an L. Insure long long math with an LL constant.
The L suffix when applied to an integer constant means the constant has type long.
This is done to prevent overflow when the value is multiplied by 24. Without the suffix, two constants of type int are multiplied giving an int result. Assuming 32 is 32 bits, the result will overflow the range of an int (231-1) causing undefined behavior.
By making the constant have type long, assuming a long is 64 bits, it allows the multiplication to be done using and type and therefore not cause overflow and give you the correct value.
L is a suffix that means that the number is of type long, otherwise it will default to type int that means the result of the operation 24 * 1000000007 will also be of type int.
As an int is usually 4 bytes in size the result will overflow, this happens before it being assigned to mod_by, and for that reason it invokes undefined behavior.
As the result of the arithmetic operation is converted to the larger type, e.g:
int * int = int
int * long = long
For the result of the operation to be of type long one of the operands must also be of type long.
Note that long is not guaranteed to have 8 bytes in size, the minimum allowed size for a long is 4 bytes, so you can invoke the same undefined behavior deppending on the platform where you compile your program.
Using LL for long long will be more portable.

Should I use the stdint.h integer types on 32/64 bit machines?

One thing that bugs me about the regular c integer declarations is that their names are strange, "long long" being the worst. I am only building for 32 and 64 bit machines so I do not necessarily need the portability that the library offers, however I like that the name for each type is a single word in similar length with no ambiguity in size.
// multiple word types are hard to read
// long integers can be 32 or 64 bits depending on the machine
unsigned long int foo = 64;
long int bar = -64;
// easy to read
// no ambiguity
uint64_t foo = 64;
int64_t bar = -64;
On 32 and 64 bit machines:
1) Can using a smaller integer such as int16_t be slower than something higher such as int32_t?
2) If I needed a for loop to run just 10 times, is it ok to use the smallest integer that can handle it instead of the typical 32 bit integer?
for (int8_t i = 0; i < 10; i++) {
}
3) Whenever I use an integer that I know will never be negative is it ok to prefer using the unsigned version even if I do not need the extra range in provides?
// instead of the one above
for (uint8_t i = 0; i < 10; i++) {
}
4) Is it safe to use a typedef for the types included from stdint.h
typedef int32_t signed_32_int;
typedef uint32_t unsigned_32_int;
edit: both answers were equally good and I couldn't really lean towards one so I just picked the answerer with lower rep
1) Can using a smaller integer such as int16_t be slower than something higher such as int32_t?
Yes it can be slower. Use int_fast16_t instead. Profile the code as needed. Performance is very implementation dependent. A prime benefit of int16_t is its small, well defined size (also it must be 2's complement) as used in structures and arrays, not so much for speed.
The typedef name int_fastN_t designates the fastest signed integer type with a width of at least N. C11 §7.20.1.3 2
2) If I needed a for loop to run just 10 times, is it ok to use the smallest integer that can handle it instead of the typical 32 bit integer?
Yes but that savings in code and speed is questionable. Suggest int instead. Emitted code tends to be optimal in speed/size with the native int size.
3) Whenever I use an integer that I know will never be negative is it OK to prefer using the unsigned version even if I do not need the extra range in provides?
Using some unsigned type is preferred when the math is strictly unsigned (such as array indexing with size_t), yet code needs to watch for careless application like
for (unsigned i = 10 ; i >= 0; i--) // infinite loop
4) Is it safe to use a typedef for the types included from stdint.h
Almost always. Types like int16_t are optional. Maximum portability uses required types uint_least16_t and uint_fast16_t for code to run on rare platforms that use bits widths like 9, 18, etc.
Can using a smaller integer such as int16_t be slower than something higher such as int32_t?
Yes. Some CPUs do not have dedicated 16-bit arithmetic instructions; arithmetic on 16-bit integers must be emulated with an instruction sequence along the lines of:
r1 = r2 + r3
r1 = r1 & 0xffff
The same principle applies to 8-bit types.
Use the "fast" integer types in <stdint.h> to avoid this -- for instance, int_fast16_t will give you an integer that is at least 16 bits wide, but may be wider if 16-bit types are nonoptimal.
If I needed a for loop to run just 10 times, is it ok to use the smallest integer that can handle it instead of the typical 32 bit integer?
Don't bother; just use int. Using a narrower type doesn't actually save any space, and may cause you issues down the line if you decide to increase the number of iterations to over 127 and forget that the loop variable is using a narrow type.
Whenever I use an integer that I know will never be negative is it ok to prefer using the unsigned version even if I do not need the extra range in provides?
Best avoided. Certain C idioms do not work properly on unsigned integers; for instance, you cannot write a loop of the form:
for (i = 100; i >= 0; i--) { … }
if i is an unsigned type, because i >= 0 will always be true!
Is it safe to use a typedef for the types included from stdint.h
Safe from a technical perspective, but it'll annoy other developers who have to work with your code.
Get used to the <stdint.h> names. They're standardized and reasonably easy to type.
Absolutely possible, yes. On my laptop (Intel Haswell), in a microbenchmark that counts up and down between 0 and 65535 on two registers 2 billion times, this takes
1.313660150s - ax dx (16-bit)
1.312484805s - eax edx (32-bit)
1.312270238s - rax rdx (64-bit)
Minuscule but repeatable differences in timing. (I wrote the benchmark in assembly, because C compilers may optimize it to a different register size.)
It will work, but you'll have to keep it up to date if you change the bounds and the C compiler will probably optimize it to the same assembly code anyway.
As long as it's correct C, that's totally fine. Keep in mind that unsigned overflow is defined and signed overflow is undefined, and compilers do take advantage of that for optimization. For example,
void foo(int start, int count) {
for (int i = start; i < start + count; i++) {
// With unsigned arithmetic, this will execute 0 times if
// "start + count" overflows to a number smaller than "start".
// With signed arithmetic, that may happen, or the compiler
// may assume this loop always runs "count" times.
// For defined behavior, avoid signed overflow.
}
Yes. Also, POSIX provides inttypes.h which extends stdint.h with some useful functions and macros.

Fibonacci Sequence in C generating negatives?

I'm new to programming and need help in C. I am writing a program to generate a Fibonacci sequence for values with up to 1000 digits.
Here is my code:
#include <stdio.h>
int main(void)
{
int seq[1000];
int i,n;
printf("How many Fibonacci numbers do you want?: ");
scanf("%d",&n);
seq[0] = 0;
seq[1] = 1;
for(i = 2; i < n; i++)
seq[i] = seq[i-1] + seq[i-2];
for (i = 1; i < n; i++)
printf("%d: %d\n", i, seq[i]);
return 0;
}
Now the problem is, the numbers are all correct up until the 47th number. Then it just goes crazy and there's negative numbers and its all wrong. Can anyone see the error in my code? Any help is greatly appreciated.
I am writing a program to generate a Fibonacci sequence for values with up to 1000 digits.
Not yet you aren't. You are storing the values in variables of type int. Commonly such variables are 32 bit values and have a maximum possible value of 2^31 - 1. That equals 2,147,483,647 which is some way short of your goal of reaching 1,000 digits.
The 47th Fibonacci number is the first number to exceed 2,147,483,647. According to Wolfram Alpha, the value is 2,971,215,073.
When your program attempts to calculate such a number it suffers from integer overflow, because the true value cannot be stored in an int. You could try to analyse exactly what happens when you overflow, why you see negative values, but it really doesn't get you very far. Simply put, what you are attempting is clearly impossible with int.
In order to reach 1,000 digits you need to use a big integer type. None of the built-in types can handle numbers as large as you intend to handle.
The comment I posted above has the simple answer, but here's a more complete version: C often represents integers with a sequence of 32 bits, and the range of values they can take on are from -2,147,483,648 to 2,147,483,647.
Notice what the 47th Fibonacci number is? 2,971,215,073
After they overflow, they wrap around to the smallest integer possible; see 2's complement notation for more information!
For a solution, I might suggest a BigInteger structure. But Fibonacci numbers get huge really fast, so I'm not sure you'd really want to calculate that many.
you are not using the correct data type; fibonacci numbers tend to grow really fast. So you probably are going beyond the limit for int: 2^31.
Since int and long are both 32 bit integers (in most cases ->gcc and VS) try using long long .

How to find the largest prime factor of 600851475143?

#include <stdio.h>
main()
{
long n=600851475143;
int i,j,flag;
for(i=2;i<=n/2;i++)
{
flag=1;
if(n%i==0)//finds factors backwards
{
for(j=2;j<=(n/i)/2;j++)//checks if factor is prime
{
if((n/i)%j==0)
flag=0;
}
if(flag==1)
{
printf("%d\n",n/i);//displays largest prime factor and exits
exit(0);
}
}
}
}
The code above works for n = 6008514751. However, it doesn't work for n = 600851475143, even though that number still is within the range of a long.
What can I do to make it work?
One potential problem is that i and j are int, and could overflow for large n (assuming int is narrower than long, which it often is).
Another issue is that for n=600,851,475,143 your program does quite a lot of work (the largest factor is 6857). It is not unreasonable to expect it to take a long time to complete.
Use longs in place of ints. Better still, use uint64_t which has been defined since C99 (acknowledge Zaibis). It is a 64 bit unsigned integral type on all platforms. (The code as you have it will overflow on some platforms).
And now we need to get your algorithm working more quickly:
Your test for prime is inefficient; you don't need to iterate over all the even numbers. Just iterate over primes; up to and equal to the square root of the number you're testing (not half way which you currently do).
Where do you get the primes from? Well, call your function recursively. Although in reality I'd be tempted to cache the primes up to, say, 65536.
From ISO/IEC 9899:TC3
5.2.4.2.1 Sizes of integer types
[...]
Their implementation-defined values shall be equal or greater in magnitude(absolute value) to those shown, with the same sign.
[...]
— minimum value for an object of type long int
LONG_MIN -2147483647 // -(2^31 - 1)
— maximum value for an object of type long int
LONG_MAX +2147483647 // 2^31 - 1
EDIT:
Sorry I forgot to add what this should tell you.
The point is long doesn't even need to be able to hold the value you mentioned, as the standard says it has to be able to hold at least 4 Bytes with sign so it could be possible that your machine is just able to hold values up to 2147483647 in a variable of type long.
On 32-bit machine long range from -2,147,483,648 to 2,147,483,647 and On 64-bit machine its range is from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 (NOTE: This is not mandated by C standard and may vary from one compiler to another).
As OP said in comment he is on 32-bit, 600851475143 goes out of range as it is not fit in the range of long.
Try changing n to long long int .. and change i,j to long
EDIT: define n like this :
long long int n = 600851475143LL;
LL - is a suffix to enforce long long type ...

Why is int rather than unsigned int used for C and C++ for loops?

This is a rather silly question but why is int commonly used instead of unsigned int when defining a for loop for an array in C or C++?
for(int i;i<arraySize;i++){}
for(unsigned int i;i<arraySize;i++){}
I recognize the benefits of using int when doing something other than array indexing and the benefits of an iterator when using C++ containers. Is it just because it does not matter when looping through an array? Or should I avoid it all together and use a different type such as size_t?
Using int is more correct from a logical point of view for indexing an array.
unsigned semantic in C and C++ doesn't really mean "not negative" but it's more like "bitmask" or "modulo integer".
To understand why unsigned is not a good type for a "non-negative" number please consider these totally absurd statements:
Adding a possibly negative integer to a non-negative integer you get a non-negative integer
The difference of two non-negative integers is always a non-negative integer
Multiplying a non-negative integer by a negative integer you get a non-negative result
Obviously none of the above phrases make any sense... but it's how C and C++ unsigned semantic indeed works.
Actually using an unsigned type for the size of containers is a design mistake of C++ and unfortunately we're now doomed to use this wrong choice forever (for backward compatibility). You may like the name "unsigned" because it's similar to "non-negative" but the name is irrelevant and what counts is the semantic... and unsigned is very far from "non-negative".
For this reason when coding most loops on vectors my personally preferred form is:
for (int i=0,n=v.size(); i<n; i++) {
...
}
(of course assuming the size of the vector is not changing during the iteration and that I actually need the index in the body as otherwise the for (auto& x : v)... is better).
This running away from unsigned as soon as possible and using plain integers has the advantage of avoiding the traps that are a consequence of unsigned size_t design mistake. For example consider:
// draw lines connecting the dots
for (size_t i=0; i<pts.size()-1; i++) {
drawLine(pts[i], pts[i+1]);
}
the code above will have problems if the pts vector is empty because pts.size()-1 is a huge nonsense number in that case. Dealing with expressions where a < b-1 is not the same as a+1 < b even for commonly used values is like dancing in a minefield.
Historically the justification for having size_t unsigned is for being able to use the extra bit for the values, e.g. being able to have 65535 elements in arrays instead of just 32767 on 16-bit platforms. In my opinion even at that time the extra cost of this wrong semantic choice was not worth the gain (and if 32767 elements are not enough now then 65535 won't be enough for long anyway).
Unsigned values are great and very useful, but NOT for representing container size or for indexes; for size and index regular signed integers work much better because the semantic is what you would expect.
Unsigned values are the ideal type when you need the modulo arithmetic property or when you want to work at the bit level.
This is a more general phenomenon, often people don't use the correct types for their integers. Modern C has semantic typedefs that are much preferable over the primitive integer types. E.g everything that is a "size" should just be typed as size_t. If you use the semantic types systematically for your application variables, loop variables come much easier with these types, too.
And I have seen several bugs that where difficult to detect that came from using int or so. Code that all of a sudden crashed on large matrixes and stuff like that. Just coding correctly with correct types avoids that.
It's purely laziness and ignorance. You should always use the right types for indices, and unless you have further information that restricts the range of possible indices, size_t is the right type.
Of course if the dimension was read from a single-byte field in a file, then you know it's in the range 0-255, and int would be a perfectly reasonable index type. Likewise, int would be okay if you're looping a fixed number of times, like 0 to 99. But there's still another reason not to use int: if you use i%2 in your loop body to treat even/odd indices differently, i%2 is a lot more expensive when i is signed than when i is unsigned...
Not much difference. One benefit of int is it being signed. Thus int i < 0 makes sense, while unsigned i < 0 doesn't much.
If indexes are calculated, that may be beneficial (for example, you might get cases where you will never enter a loop if some result is negative).
And yes, it is less to write :-)
Using int to index an array is legacy, but still widely adopted. int is just a generic number type and does not correspond to the addressing capabilities of the platform. In case it happens to be shorter or longer than that, you may encounter strange results when trying to index a very large array that goes beyond.
On modern platforms, off_t, ptrdiff_t and size_t guarantee much more portability.
Another advantage of these types is that they give context to someone who reads the code. When you see the above types you know that the code will do array subscripting or pointer arithmetic, not just any calculation.
So, if you want to write bullet-proof, portable and context-sensible code, you can do it at the expense of a few keystrokes.
GCC even supports a typeof extension which relieves you from typing the same typename all over the place:
typeof(arraySize) i;
for (i = 0; i < arraySize; i++) {
...
}
Then, if you change the type of arraySize, the type of i changes automatically.
It really depends on the coder. Some coders prefer type perfectionism, so they'll use whatever type they're comparing against. For example, if they're iterating through a C string, you might see:
size_t sz = strlen("hello");
for (size_t i = 0; i < sz; i++) {
...
}
While if they're just doing something 10 times, you'll probably still see int:
for (int i = 0; i < 10; i++) {
...
}
I use int cause it requires less physical typing and it doesn't matter - they take up the same amount of space, and unless your array has a few billion elements you won't overflow if you're not using a 16-bit compiler, which I'm usually not.
Because unless you have an array with size bigger than two gigabyts of type char, or 4 gigabytes of type short or 8 gigabytes of type int etc, it doesn't really matter if the variable is signed or not.
So, why type more when you can type less?
Aside from the issue that it's shorter to type, the reason is that it allows negative numbers.
Since we can't say in advance whether a value can ever be negative, most functions that take integer arguments take the signed variety. Since most functions use signed integers, it is often less work to use signed integers for things like loops. Otherwise, you have the potential of having to add a bunch of typecasts.
As we move to 64-bit platforms, the unsigned range of a signed integer should be more than enough for most purposes. In these cases, there's not much reason not to use a signed integer.
Consider the following simple example:
int max = some_user_input; // or some_calculation_result
for(unsigned int i = 0; i < max; ++i)
do_something;
If max happens to be a negative value, say -1, the -1 will be regarded as UINT_MAX (when two integers with the sam rank but different sign-ness are compared, the signed one will be treated as an unsigned one). On the other hand, the following code would not have this issue:
int max = some_user_input;
for(int i = 0; i < max; ++i)
do_something;
Give a negative max input, the loop will be safely skipped.
Using a signed int is - in most cases - a mistake that could easily result in potential bugs as well as undefined behavior.
Using size_t matches the system's word size (64 bits on 64 bit systems and 32 bits on 32 bit systems), always allowing for the correct range for the loop and minimizing the risk of an integer overflow.
The int recommendation comes to solve an issue where reverse for loops were often written incorrectly by unexperienced programmers (of course, int might not be in the correct range for the loop):
/* a correct reverse for loop */
for (size_t i = count; i > 0;) {
--i; /* note that this is not part of the `for` statement */
/* code for loop where i is for zero based `index` */
}
/* an incorrect reverse for loop (bug on count == 0) */
for (size_t i = count - 1; i > 0; --i) {
/* i might have overflowed and undefined behavior occurs */
}
In general, signed and unsigned variables shouldn't be mixed together, so at times using an int in unavoidable. However, the correct type for a for loop is as a rule size_t.
There's a nice talk about this misconception that signed variables are better than unsigned variables, you can find it on YouTube (Signed Integers Considered Harmful by Robert Seacord).
TL;DR;: Signed variables are more dangerous and require more code than unsigned variables (which should be preferred almost in all cases and definitely whenever negative values aren't logically expected).
With unsigned variables the only concern is the overflow boundary which has a strictly defined behavior (wrap-around) and uses clearly defined modular mathematics.
This allows a single edge case test to catch an overflow and that test can be performed after the mathematical operation was executed.
However, with signed variables the overflow behavior is undefined (UB) and the negative range is actually larger than the positive range - things that add edge cases that must be tested for and explicitly handled before the mathematical operation can be executed.
i.e., how much INT_MIN * -1? (the pre-processor will protect you, but without it you're in a jam).
P.S.
As for the example offered by #6502 in their answer, the whole thing is again an issue of trying to cut corners and a simple missing if statement.
When a loop assumes at least 2 elements in an array, this assumption should be tested beforehand. i.e.:
// draw lines connecting the dots - forward loop
if(pts.size() > 1) { // first make sure there's enough dots
for (size_t i=0; i < pts.size()-1; i++) { // then loop
drawLine(pts[i], pts[i+1]);
}
}
// or test against i + 1 : which tests the desired pts[i+1]
for (size_t i = 0; i + 1 < pts.size(); i++) { // then loop
drawLine(pts[i], pts[i+1]);
}
// or start i as 1 : but note that `-` is slower than `+`
for (size_t i = 1; i < pts.size(); i++) { // then loop
drawLine(pts[i - 1], pts[i]);
}

Resources