How to interpret results for a moderating model? - logistic-regression

I have a moderation model, but the results look a bit wired.
A is DV, which is a binary variable.
B is M, also a binary variable.
C is IV, which is a continuous variable.
If I ran the regression of A only on B, the coefficient on B is positive and significant (e.g., 0.1418). When I add C in the model and the interactive term of B and C, the coefficient on B is still positively significant, but the number is lower (the coefficient on B is 0.1376). And the interactive term between B and C is significantly positive (0.0222).
Can I explain the results in this way?
The higher the C, the more likelihood that A presents. However, this tendency is weaker when M is present.

Related

Optimal frequency of modulo operation in finite field arithmetic implementation

I'm trying to implement finite field arithmetic to use it in Elliptic Curve calculations. Since all that's ever used are arithmetic operations that commute with the modulo operator, I don't see a reason not to delaying that operation till the very end. One thing that may happen is that the numbers involved might become (way) too big and impractical/inefficient to work with, but I was wondering if there was a way to determine the optimal conditions/frequency which should trigger a modulo operation in the calculations.
I'm coding in C.
To avoid the complexity of elliptic curve crypto (as I'm unfamiliar with its algorithm); let assume you're doing temp = (a * b) % M; result = (temp * c) % M, and you're thinking about just doing result = (a * b * c) % M instead.
Let's also assume that you're doing this a lot with the same modulo M; so you've precomputed "multiples of M" lookup tables, so that your modulo code can use the table to find the highest multiple of "M shifted left by N" that is not greater than the dividend and subtract it from dividend, and repeat that with decreasing values of N until you're left with the quotient.
If your lookup table has 256 entries, the dividend is 4096 bits and the divisor is 2048 bits; then you'd reduce the size of the dividend by 8 bits per iteration, so dividend would become smaller than the divisor (and you'd find the quotient) after no more than 256 "search and subtract" operations.
For multiplication; it's almost purely "multiply and add digits" for each pair of digits. E.g. using uint64_t as a digit, multiplying 2048 bit numbers is multiplying 32 digit numbers and involves 32 * 32 = 1024 of those "multiply and add digits" operations.
Now we can make comparisons. Specifically, assuming a, b, c, M are 2048-bit numbers:
a) the original temp = (a * b) % M; result = (temp * c) % M would be 1024 "multiply and add", then 256 "search and subtract", then 1024 "multiply and add", then 256 "search and subtract". For totals it'd be 2048 "multiply and add" and 512 "search and subtract".
b) the proposed result = (a * b * c) % M would be 1024 "multiply and add", then would be 2048 "multiply and add" (as the result of a*b will be a "twice as big" 4096-bit number), then 512 "search and subtract" (as a*b*c will be twice as big as a*b). For totals it'd be 3072 "multiply and add" and 512 "search and subtract".
In other words; (assuming lots of assumptions) the proposed result = (a * b * c) % M would be worse, with 50% more "multiply and add" and the exact same "search and subtract".
Of course none of this (the operations you need for elliptic curve crypto, the sizes of your variables, etc) can be assumed to apply for your specific case.
I was wondering if there was a way to determine the optimal conditions/frequency which should trigger a modulo operation in the calculations.
Yes; the way to determine the optimal conditions/frequency is to do similar to what I did above - determine the true costs (in terms of lower level operations, like my "search and subtract" and "multiply and add") and compare them.
In general (regardless of how modulo is implemented, etc) I'd expect you'll find that doing modulo as often as possible is the fastest option (as it reduces the cost of multiplications and also reduces the cost of later/final modulo) for all cases don't involve addition or subtraction, and that don't fit in simple integers.
If M is a constant, then an alternative for modulo is to multiply by the logical inverse of M. Looking at Polk's comment about 256 bits being a common case, then assuming M is polynomial of degree 256 with 1 bit coefficients, then define the inverse of M to be x^512 / M, which results in a 256 bit "inverse". Name this inverse to be I. Then for a multiply modulo M:
C = A * B ; 512 bit product
Q = (upper 256 bits of C * I)>>256 ; Q = C / M = 256 bit quotient
P = M * Q ; 512 bit product
R = lower 256 bits of (C xor P) ; (A * B)% M
So this require 3 extended precision multiplies and one xor.
If the processor for this code has a carryless multiply, such as X86 PCLMULQDQ, which multiplies two 64 bit operands to produce a 128 bit result, then that could be used as the basis for an extended precision multiply. A basic implementation would need 16 multiplies for a 256 bit by 256 bit multiply to produce a 512 bit product. This could be improved using somthing like Karatsuba:
https://en.wikipedia.org/wiki/Karatsuba_algorithm
but on currernt X86, PCLMULQDQ is fast, taking 1 to 3 cycles, so the main issue would be loading the data into the XMM registers, and I'm not sure Karatsuba would save much time.
optimal conditions/frequency which should trigger a modulo operation in the calculations
Standard practice is to replace all actual modulo operations with something else. So the frequency is never. There are different ways to accomplish that:
Choose the modulus to be a Mersenne prime or pseudo-Mersenne prime. There is a large repertoire of mathematical tricks to implement arithmetic modulo a (pseudo-)Mersenne prime efficiently, without doing any actual modulo operations. In the context of elliptic curves, the prime-modulus NIST curves are chosen this way and for this reason.
Use Barrett reduction. This has the same effect as a real modulo operation, but relies on some precomputation and a precondition on the range of the input to be able to reduce the cost of a modulo-like operation to the cost to a couple of multiplications (plus some supporting operations). Also applicable to polynomial fields.
Do arithmetic in Montgomery form.
Additionally, and perhaps more in the spirit of your question, a common technique is to do various additions without reducing every time (addition does not significantly change the size of a number). It takes a lot of additions before you need an extra limb in your integers, so a lot of them can be done before it starts to make sense to reduce. For multiplications, unless it's by a small constant it almost always makes sense to reduce immediately afterwards to prevent the numbers from getting much physically larger than they need to be (which would be especially bad if the result was fed into another multiplication).
Another technique especially associated with Barrett reductions is to work, most of the time, in a slightly larger range than [0 .. N), eg [0 .. 2N). This enables skipping the conditional subtraction that Barrett reduction needs in order to fully reduce to the range [0 .. N), while still using the most important part, the reduction from the range [0 .. N²) to the range [0 .. 2N).

What is the answer for this computational analysis problem?

Two algorithms have the same function, while algorithm A has computational complexity O(2^N) and algorithm B has computational complexity O(N^10). Suppose a real computer can continuously run 10^7seconds, performing 10^3 basic operations per second.
In this computer environment, please answer the following questions.
A) What is the approximate range of N for algorithms A and B, respectively?
B) Which algorithm is more suitable in the environment? Why?
The question is defective.
The fact that A has complexity O(2N) means the number of basic operations (presumably modeled as each basic operation taking the same amount of time) means A takes at most some constant times 2N steps for N at least some threshold N0. Similarly, the fact B has complexity O(N10) means B takes at most some constant times N10 steps for N at least some threshold N1. However, they may be different constants; the number of steps for A is at most C02N and the number of steps for B is at most C1N10, and they may have different thresholds N0 and N1.
In asking about a computer that can perform 103 basic operations for 107 seconds, the question asks for which N is the number of steps of A or B known to be at most 1010. In other words, it asks to solve for N in C02N ≤ 1010 and in C1N10 ≤ 1010.
These are clearly unsolvable without knowing C0 and C1, about which the question gives no information.
Further, we do not know the thresholds N0 and N1 where these bounds are known to apply. So even if we knew C0 and C1, we would not know any bound on how many steps the algorithms take for any particular N.
The question is also defective in that it neglects that the O notation puts only an upper bound on the algorithm. The algorithm may run in fewer steps than the values of the formulae. So it may be that, even with N for which C02N ≤ C1N10, algorithm B is better, or vice-versa.
Possibly it is intended that some simplifying assumptions are intended, such as C0 = C1 = 1, N0 = N1 = 0, and each algorithm takes exactly the number of steps of its formula. Then it is easy to solve 2N ≤ 1010 (N is at most about 33.22) and N10 = 1010 (N ≤ 10). However, if these assumptions are intended, then the author has missed the point of O notation; it characterizes a fundamental nature of an algorithm; it does not quantify its actual number of steps.

How do you find the largest subset of an array of integers that xor to zero

To clarify, the largest subset of the array: [0,1,4,5,6,8] that xors to 0 would be [0,1,4,5] since 0^1^4^5=0 (where ^ is xor). I know this can be done in exponential time by brute force, but I'd like to know what the lower bound is, and what algorithm solves it in that time.
I'm doing to implement the Rational sieve algorithm. Beyond the wikipedia article, resources on the algorithm are fairly scarce. To complete the rational sieve you attempt to find a subset of a group of arrays, such that when adding up corresponding elements, the resulting array has only even numbers. For example:
[2,3,4,5]+[4,3,4,3]=[6,6,8,8] This would be a valid solution, provided that these arrays exist in the larger set.
According to that wikipedia article, this can be solved using linear algebra, but I don't know enough linear algebra to solve it.
For the purpose of the algorithm, an empty subset isn't useful.
I simplified the problem by saying that the arrays can only have 0s, and 1s, and by putting the array into a single number so that the sum can be computed with a single operator, but otherwise they are the same problem.
Yes, it can be formulated as a linear optimization problem. Assuming the integers are k bits and there are n of them, you can represent them as a k * n matrix A, where columns represent the integers, and row r of column n is the r-th bit of integer i.
Then the selection and xor-ing of integers can be represented as A * x, where x is a vector of size n that has 1-s at positions of selected integers. This has to be over GF(2), so multiplication is the standard one and addition is XOR. So you are solving maximize(|x|) subject to Ax = 0.

Is there any fast algorithm to compute log2 for numbers that are all power of 2?

Is there any fast algorithm to compute log2 for numbers that are all power of 2,eg:
log2(1), log2(2), log2(4), log2(1024), log2(4096)...
I'm considering using it to implement bit set iteration.
assuming you know the number must be power of 2, so in binary, it is 1 following with n 0 where n is the number you are looking for.
if you are using gcc or clang, you can use builtin function
— Built-in Function: int __builtin_ctz (unsigned int x)
Returns the number of trailing 0-bits in x, starting at the least
significant bit position. If x is 0, the result is undefined.
for pure C implementation, it is already answered
Finding trailing 0s in a binary number
Three more theoretically possibly efficient algorithms in addition to the ones already given or linked. n is the number of bits, N = 2^n:
Big LUT: one lookup
Simple binary search: log2(n) comparisons
LUT[N % k] with k-position LUT: one modulo, one lookup (k=37 for 32-bit and 67 for 64-bit numbers)
In practice, #1 is great with small n, #2 may be fastest on certain hardware (something without fast multiply), but the code looks ugly. #3 probably never beats DeBruijn on a real machine, but it has fewer operations.

Fast inverse square of double in C/C++

Recently I was profiling a program in which the hotspot is definitely this
double d = somevalue();
double d2=d*d;
double c = 1.0/d2 // HOT SPOT
The value d2 is not used after because I only need value c. Some time ago I've read about the Carmack method of fast inverse square root, this is obviously not the case but I'm wondering if a similar algorithms can help me computing 1/x^2.
I need quite accurate precision, I've checked that my program doesn't give correct results with gcc -ffast-math option. (g++-4.5)
The tricks for doing fast square roots and the like get their performance by sacrificing precision. (Well, most of them.)
Are you sure you need double precision? You can sacrifice precision easily enough:
double d = somevalue();
float c = 1.0f / ((float) d * (float) d);
The 1.0f is absolutely mandatory in this case, if you use 1.0 instead you will get double precision.
Have you tried enabling "sloppy" math on your compiler? On GCC you can use -ffast-math, there are similar options for other compilers. The sloppy math may be more than good enough for your application. (Edit: I did not see any difference in the resulting assembly.)
If you are using GCC, have you considered using -mrecip? There is a "reciprocal estimate" function which only has about 12 bits of precision, but it is much faster. You can use the Newton-Raphson method to increase the precision of the result. The -mrecip option will cause the compiler to automatically generate the reciprocal estimate and Newton-Raphson steps for you, although you can always write the assembly yourself if you want to fine tune the performance-precision trade-off. (Newton-Raphson converges very quickly.) (Edit: I was unable to get GCC to generate RCPSS. See below.)
I found a blog post (source) discussing the exact problem you are going through, and the author's conclusion is that the techniques like the Carmack method are not competitive with the RCPSS instruction (which the -mrecip flag on GCC uses).
The reason why division can be so slow is because processors generally only have one division unit and it's often not pipelined. So, you can have a few multiplications in the pipe all executing simultaneously, but no division can be issued until the previous division finishes.
Tricks that don't work
Carmack's method: It is obsolete on modern processors, which have reciprocal estimation opcodes. For reciprocals, the best version I've seen only gives one bit of precision -- nothing compared to the 12 bits of RCPSS. I think it is a coincidence that the trick works so well for reciprocal square roots; a coincidence that is unlikely to be repeated.
Relabeling variables. As far as the compiler is concerned, there is very little difference between 1.0/(x*x) and double x2 = x*x; 1.0/x2. I would be surprised if you found a compiler that generates different code for the two versions with optimizations turned on even to the lowest level.
Using pow. The pow library function is a total monster. With GCC's -ffast-math turned off, the library call is fairly expensive. With GCC's -ffast-math turned on, you get the exact same assembly code for pow(x, -2) as you do for 1.0/(x*x), so there is no benefit.
Update
Here is an example of a Newton-Raphson approximation for the inverse square of a double-precision floating-point value.
static double invsq(double x)
{
double y;
int i;
__asm__ (
"cvtpd2ps %1, %0\n\t"
"rcpss %0, %0\n\t"
"cvtps2pd %0, %0"
: "=x"(y)
: "x"(x));
for (i = 0; i < RECIP_ITER; ++i)
y *= 2 - x * y;
return y * y;
}
Unfortunately, with RECIP_ITER=1 benchmarks on my computer put it slightly slower (~5%) than the simple version 1.0/(x*x). It's faster (2x as fast) with zero iterations, but then you only get 12 bits of precision. I don't know if 12 bits is enough for you.
I think one of the problems here is that this is too small of a micro-optimization; at this scale the compiler writers are on nearly equal footing with the assembly hackers. Maybe if we had the bigger picture we could see a way to make it faster.
For example, you said that -ffast-math caused an undesirable loss of precision; this may indicate a numerical stability problem in the algorithm you are using. With the right choice of algorithm, many problems can be solved with float instead of double. (Of course, you may just need more than 24 bits. I don't know.)
I suspect the RCPSS method shines if you want to compute several of these in parallel.
Yes, you can certainly try and work something out. Let me just give you some general ideas, you can fill in the details.
First, let's see why Carmack's root works:
We write x = M × 2E in the usual way. Now recall that the IEEE float stores the exponent offset by a bias: If e denoted the exponent field, we have e = Bias + E ≥ 0. Rearranging, we get E = e − Bias.
Now for the inverse square root: x−1/2 = M-1/2 × 2−E/2. The new exponent field is:
e' = Bias − E/2 = 3/2 Bias − e/2
With bit fiddling, we can get the value e/2 from e by shifting, and 3/2 Bias is just a constant.
Moreover, the mantissa M is stored as 1.0 + x with x < 1, and we can approximate M-1/2 as 1 + x/2. Again, the fact that only x is stored in binary means that we get the division by two by simple bit shifting.
Now we look at x−2: this is equal to M−2 × 2−2 E, and we are looking for an exponent field:
e' = Bias − 2 E = 3 Bias − 2 e
Again, 3 Bias is just a constant, and you can get 2 e from e by bitshifting. As for the mantissa, you can approximate (1 + x)−2 by 1 − 2 x, and so the problem reduces to obtaining 2 x from x.
Note that Carmack's magic floating point fiddling doesn't actually compute the result right aaway: Rather, it produces a remarkably accurate estimate, which is used as the starting point for a traditional, iterative computation. But because the estimate is so good, you only need very few rounds of subsequent iteration to get an acceptable result.
For your current program you have identified the hotspot - good. As an alternative to speeding up 1/d^2, you have the option of changing the program so that it does not compute 1/d^2 so often. Can you hoist it out of an inner loop? For how many different values of d do you compute 1/d^2? Could you pre-compute all the values you need and then look up the results? This is a bit cumbersome for 1/d^2, but if 1/d^2 is part of some larger chunk of code, it might be worthwhile applying this trick to that. You say that if you lower the precision, you don't get good enough answers. Is there any way you can rephrase the code, that might provide better behaviour? Numerical analysis is subtle enough that it might be worth trying a few things and seeing what happened.
Ideally, of course, you would find some optimised routine that draws on years of research - is there anything in lapack or linpack that you could link to?

Resources