Regression loop and store coefficients - loops

I am going (1) to loop a regression over a certain criterion many times; and (2) to store a certain coefficient from each regression. Here is an example:
clear
sysuse auto.dta
local x = 2000
while `x' < 5000 {
xi: regress price mpg length gear_ratio i.foreign if weight < `x'
est sto model_`x'
local x = `x' + 100
}
est dir
I just care about one predictor, say mpg here. I want to extract coefficients of mpg from each result into one independent file (any file is OK, .dta would be great) to see if there is a trend as the threshold for weight increases. What I am doing now is to useestout to export the results, something like:
esttab * using test.rtf, replace se stats(r2_a N, labels(R-squared)) starl(* 0.10 ** 0.05 *** 0.01) nogap onecell title(regression tables)
estout will export everything and I need to edit the results. This works well for regressions with few predictors, but my real dataset has more than 30 variables and the regression will loop at least 100 times (I have a variable Distance with range from 0 to 30,000: it has the role of weight in the example). Therefore, it is really difficult for me to edit the results without making mistakes.
Is there any other efficient way to solve my problem? Since my case is not looping over a group variable, but over a certain criterion. the statsby function seems not working well here.

As #Todd has already suggested, you can just choose the particular results you care about and use postfile to store them as new variables in a new dataset. Note that a forval loop is more direct than your while code, while using xi: is superseded by factor variable notation in recent versions of Stata. (I have not changed that just in case you are using some older version.) Note evaluation of saved results such as _b[_cons] on the fly and the use of parentheses () to stop negative signs being evaluated. Some code examples elsewhere store results temporarily in local macros or scalars, which is quite unnecessary.
sysuse auto.dta, clear
tempname myresults
postfile `myresults' threshold intercept gradient se using myresults.dta
quietly forval x = 2000(200)4800 {
xi: regress price mpg length gear_ratio i.foreign if weight < `x'
post `myresults' (`x') (`=_b[_cons]') (`=_b[mpg]') (`=_se[mpg]')
}
postclose `myresults'
use myresults
list
+---------------------------------------------+
| thresh~d intercept gradient se |
|---------------------------------------------|
1. | 2000 -3699.55 -296.8218 215.0348 |
2. | 2200 -4175.722 -53.19774 54.51251 |
3. | 2400 -3918.388 -58.83933 42.19707 |
4. | 2600 -6143.622 -58.20153 38.28178 |
5. | 2800 -11159.67 -49.21381 44.82019 |
|---------------------------------------------|
6. | 3000 -6636.524 -51.28141 52.96473 |
7. | 3200 -7410.392 -58.14692 60.55182 |
8. | 3400 -2193.125 -57.89508 52.78178 |
9. | 3600 -1824.281 -103.4387 56.49762 |
10. | 3800 -1192.767 -110.9302 51.6335 |
|---------------------------------------------|
11. | 4000 5649.41 -173.9975 74.51212 |
12. | 4200 5784.363 -147.4454 71.89362 |
13. | 4400 6494.47 -93.81158 80.81586 |
14. | 4600 6494.47 -93.81158 80.81586 |
15. | 4800 5373.041 -95.25342 82.60246 |
+---------------------------------------------+
statsby (a command, not a function) is just not designed for this problem at all, so it is not a question of whether it works well.

I would suggest you look at help postfile for an example of how to aggregate the results. I agree that statsby may not be the best approach. Evaluating the interaction between mpg and weight on price may help address what would seem to be a classic question of interaction.

Related

Excel: Problems w. INDIRECT, Arrays, and Aggregate Functions (SUM, MAX, etc.)

Objective
I have a Microsoft Excel spreadsheet containing a price list that may change over time (B2:B5 in the example). Separately, I have a budget that too may change over time (D2). I am attempting to construct a formula for E2 to output the number of items that can be purchased with the budget in D2. Thereafter, I'll attempt to construct formulas to output any change that would be made (F2) and a comma-delimited list of purchasable items (G2).
Note: It unfortunately isn't possible to add an intermediate calculation column to the list, such as a running total. As such, I'm trying for formulas for single cells (i.e., E2, F2, and G2).
Note: I'm using Excel for Mac 2019.
A B C D E F G
+---------+---------+-----+---------+-------+---------+---------------------------+
1 | Label | Price | | Budget | Items | Change | Item(s) |
+---------+---------+-----+---------+-------+---------+---------------------------+
2 | Item #1 | $ 10.00 | | $ 40.00 | 3 | $ 4.50 | Item #1, Item #2, Item #3 |
+---------+---------+-----+---------+-------+---------+---------------------------+
3 | Item #2 | $ 20.00 | | | | | |
+---------+---------+-----+---------+-------+---------+---------------------------+
4 | Item #3 | $ 5.50 | | | | | |
+---------+---------+-----+---------+-------+---------+---------------------------+
5 | Item #4 | $ 25.00 | | | | | |
+---------+---------+-----+---------+-------+---------+---------------------------+
6 | Item #5 | $ 12.50 | | | | | |
+---------+---------+-----+---------+-------+---------+---------------------------+
For E2, I've attempted:
{=MAX(N(SUM(INDIRECT("$B$2:$B$"&ROW($B$2:$B$6)))<=$D2)*ROW($B$2:$B$6)-MIN(ROW($B$2:$B$6))+1)}
Though, the above values and this formula result in an output of -1.
Note: The formula for F2 and G2 seemingly easily follow E2; e.g. {=$D2-SUM(IF((ROW($B$2:$B$6)-MIN(ROW($B$2:$B$6))+1)<=$E2,$B$2:$B$6,0))} and {=TEXTJOIN(", ",TRUE,INDIRECT("$A$2:$A$"&(MIN(ROW($B$2:$B$6))+$E2-1)))} seem to work well, respectively.
Observations
{="$B$2:$B$"&ROW($B$2:$B$6)} evaluates to {"$B$2:$B$2";"$B$2:$B$3";...;"$B$2:$B$6"} (as desired);
{=INDIRECT("$B$2:$B$"&ROW($B$2:$B$6)) should evaluate to the equivalent of {{$B$2:$B$2},{$B$2:$B$3},...,{$B$2:$B$6}}; though, as a 1x5 multi-cell array formula, evaluates to the equivalent of {#VALUE!,#VALUE!,#VALUE!,#VALUE!,#VALUE!} and, with F9 does to {10;#N/A;#N/A;#N/A;12.5};
{=SUM(INDIRECT("$B$2:$B$"&ROW($B$2:$B$6)))<=$D2}, as a 1x5 multi-cell array formula, evaluates to the equivalent of {TRUE;TRUE;TRUE;FALSE;FALSE} (as desired); though, with F9 does to #VALUE!;
{=N(SUM(INDIRECT("$B$2:$B$"&ROW($B$2:$B$6)))<=$D2)}, as a 1x5 multi-cell array formula, evaluates to the equivalent of 1;1;1;0;0 (as desired); though, with F9 does again to #VALUE!;
{=N(SUM(INDIRECT("$B$2:$B$"&ROW($B$2:$B$6)))<=$D2)*ROW($B$2:$B$6), as as 1x5 multi-cell array formula, evaluates to the equivalent of {2,3,4,0,0} (as desired); though, with F9 does to {#VALUE!,#VALUE!,#VALUE!,#VALUE!,#VALUE!};
{=N(SUM(INDIRECT("$B$2:$B$"&ROW($B$2:$B$6)))<=$D2)*ROW($B$2:$B$6)-MIN(ROW($B$2:$B$6))+1}, as a 1x5 multi-cell array formula, evaluates to the equivalent of {1,2,3,-1,-1} (as desired); though, with F9 does again to {#VALUE!,#VALUE!,#VALUE!,#VALUE!,#VALUE!}; and,
{=MAX(N(SUM(INDIRECT("$B$2:$B$"&ROW($B$2:$B$6)))<=$D2)*ROW($B$2:$B$6)-MIN(ROW($B$2:$B$6))+1)} evaluates to -1
Interestingly:
If {=N(SUM(INDIRECT("$B$2:$B$"&ROW($B$2:$B$6)))<=$D2)*ROW($B$2:$B$6)-MIN(ROW($B$2:$B$6))+1} is placed as the multi-cell array formula in, say, E10:E14, a =MAX($E$10:$E$14) results in 3 (as desired).
Speculation
At present, I'm speculating that, when entered as a single cell array formula, the INDIRECT is not being assessed to be array producing and/or the SUM, as part of a single cell array formula, is not producing an array result.
Please assist. And, thank you in advance.
Solutions (Thanks to Contributors Below)
For E2, {=IF($B$2<=$D2,MATCH(1,0/(MMULT(N(ROW($B$2:$B$6)>=TRANSPOSE(ROW($B$2:$B$6))),$B$2:$B$6)<=$D2)),0)} (thank you Jos Woolley);
For F2, =IF($E2=0,MAX(0,$D2),$D2-SUM($B$2:INDEX($B$2:$B$6,$E2))) (thank you P.b); and,
For G2, =IF($E2=0,"",TEXTJOIN(", ",TRUE,$A$2:INDEX($A$2:$A$6,$E2))) (thank you P.b).
The first point to make, as I mentioned in the comments, is that it must be understood that piecemeal evaluation of a formula - via highlighting subsections of that formula and committing with F9 within the formula bar - will not necessarily correspond to the actual evaluation.
Evaluation via F9 in the formula bar always forces that part to be evaluated as an array. Though this is misleading, since the overall construction may not actually evaluate that part as an array.
The second point to make is that SUM cannot iterate over an array of ranges, though SUBTOTAL, for example, can, so replacing SUM with SUBTOTAL (9, in your current formula should work.
However, you would still be left with a construction which is volatile, so I would recommend this non-volatile alternative:
=MATCH(1,0/(MMULT(N(ROW(B2:B6)>=TRANSPOSE(ROW(B2:B6))),B2:B6)<=D2))
In E2 you can use:
=MATCH(TRUE,--SUBTOTAL(9,OFFSET(B2:B6,,,ROW(B2:B6)))>=D2,0)
In F2 you can use:
=D2-SUM(B2:INDEX(B2:B6,E2))
In G2 you can use:
=TEXTJOIN(", ",1,A2:INDEX(A2:A6,E2))

Running Delta Issue in Google Data Studio

Data
Datapull | product | value
8/30 | X | 2
8/30 | Y | 3
8/30 | Y | 4
9/30 | X | 5
9/30 | Y | 6
Report
date range dimension: datapull
dimensions: data pull & product
metric: running delta of record count
Chart
For 8/30, The totals to start for product Y are right but Product X shows nothing when the first row of data has an entry for product X in 8/30.
The variances in 9/30 are wrong too.
Can someone please let me know if there is a way to do running deltas with 2 dimensions? This is not calculating correctly.
Using the combination of Breakdown dimension and Running delta calculation brakes the chart!
If you not mind to create a field for each Breakdown dimension (product), this will work:
case when product="X" then 1 else 0 end
And put each of these fields into 'Metric':

Excel: create an array with n occurences of a value x

I'm looking for a way to create an excel array with n occurences of an x value, n and x being vectors.
Desired behaviour :
|---------------------|------------------|------------------|
| occurences | value | result |
|---------------------|------------------|------------------|
| 3 | 4 | {4;4;4;1;1} |
|---------------------|------------------|------------------|
| 2 | 1 |
|---------------------|------------------|
This is a question similar to this one, except that I want one more dimension. I'm not interested in a VBA answer, I'm looking for a formula.
I've tried playing around with index and concatenation like in the answer to the previously linked question but with no luck until now.
This result will be used in a bigger formula that will sum the m greatest values (I already have that part figured and working, the m value is irrelevant here). You can consider this question as if the occurences are the storage amounts, and I want the sum of the m greatest individual values.
Here's another approach in O365:
=INDEX(B:B,MATCH(SEQUENCE(SUM(A1:A3),1,0),
MMULT(N(ROW(A1:A3)>=TRANSPOSE(ROW(A1:A3))),A1:A3)-A1:A3))
where you're looking up the row number of the output array in the running total of the input counts.
I think it could be modified to work over an arbitrary range but would then be a fairly long formula.
If the inputs aren't in the sheet but coming from an array formula, then still possible but it would be a very long formula.
=FILTERXML("<t><s>" & TEXTJOIN("</s><s>",TRUE,SEQUENCE(3,,4,0),SEQUENCE(2,,1,0)) & "</s></t>","//s")
will return: {4;4;4;1;1} which can be used as part of a larger formula.

make x in a cell equal 8 and total

I need an excel formula that will look at the cell and if it contains an x will treat it as a 8 and add it to the total at the bottom of the table. I have done these in the pass and I am so rusty that I cannot remember how I did it.
Generally, I try and break this sort of problem into steps. In this case, that'd be:
Determine if a cell is 'x' or not, and create new value accordingly.
Add up the new values.
If your values are in column A (for example), in column B, fill in:
=if(A1="x", 8, 0) (or in R1C1 mode, =if(RC[-1]="x", 8, 0).
Then just sum those values (eg sum(B1:B3)) for your total.
A | B
+---------+---------+
| VALUES | TEMP |
+---------+---------+
| 0 | 0 <------ '=if(A1="x", 8, 0)'
| x | 8 |
| fish | 0 |
+---------+---------+
| TOTAL | 8 <------ '=sum(B1:B3)'
+---------+---------+
If you want to be tidy, you could also hide the column with your intermediate values in.
(I should add that the way your question is worded, it almost sounds like you want to 'push' a value into the total; as far as I've ever known, you can really only 'pull' values into a total.)
Try this one for total sum:
=SUMIF(<range you want to sum>, "<>" & <x>, <range you want to sum>)+ <x> * COUNTIF(<range you want to sum>, <x>)

Fast Hypotenuse Algorithm for Embedded Processor?

Is there a clever/efficient algorithm for determining the hypotenuse of an angle (i.e. sqrt(a² + b²)), using fixed point math on an embedded processor without hardware multiply?
If the result doesn't have to be particularly accurate, you can get a crude
approximation quite simply:
Take absolute values of a and b, and swap if necessary so that you have a <= b. Then:
h = ((sqrt(2) - 1) * a) + b
To see intuitively how this works, consider the way that a shallow angled line is plotted on a pixel display (e.g. using Bresenham's algorithm). It looks something like this:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | | | | | | | | | | | | | | | |*|*|*| ^
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | | | | | | | | | | | |*|*|*|*| | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | | | | | | | |*|*|*|*| | | | | | | | a pixels
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| | | | |*|*|*|*| | | | | | | | | | | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
|*|*|*|*| | | | | | | | | | | | | | | | v
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
<-------------- b pixels ----------->
For each step in the b direction, the next pixel to be plotted is either immediately to the right, or one pixel up and to the right.
The ideal line from one end to the other can be approximated by the path which joins the centre of each pixel to the centre of the adjacent one. This is a series of a segments of length sqrt(2), and b-a segments of length 1 (taking a pixel to be the unit of measurement). Hence the above formula.
This clearly gives an accurate answer for a == 0 and a == b; but gives an over-estimate for values in between.
The error depends on the ratio b/a; the maximum error occurs when b = (1 + sqrt(2)) * a and turns out to be 2/sqrt(2+sqrt(2)), or about 8.24% over the true value. That's not great, but if it's good enough for your application, this method has the advantage of being simple and fast. (The multiplication by a constant can be written as a sequence of shifts and adds.)
For the record, here are a few more approximations, listed in roughly
increasing order of complexity and accuracy. All these assume 0 ≤ a ≤ b.
h = b + 0.337 * a // max error ≈ 5.5 %
h = max(b, 0.918 * (b + (a>>1))) // max error ≈ 2.6 %
h = b + 0.428 * a * a / b // max error ≈ 1.04 %
Edit: to answer Ecir Hana's question, here is how I derived these
approximations.
First step. Approximating a function of two variables can be a
complex problem. Thus I first transformed this into the problem of
approximating a function of one variable. This can be done by choosing
the longest side as a “scale” factor, as follows:
h = √(b2 + a2)
= b √(1 + (a/b)2)
= b f(a/b) where f(x) = √(1+x2)
Adding the constraint 0 ≤ a ≤ b means we are only concerned with
approximating f(x) in the interval [0, 1].
Below is the plot of f(x) in the relevant interval, together with the
approximation given by Matthew Slattery (namely (√2−1)x + 1).
Second step. Next step is to stare at this plot, while asking
yourself the question “how can I approximate this function cheaply?”.
Since the curve looks roughly parabolic, my first idea was to use a
quadratic function (third approximation). But since this is still
relatively expensive, I also looked at linear and piecewise linear
approximations. Here are my three solutions:
The numerical constants (0.337, 0.918 and 0.428) were initially free
parameters. The particular values were chosen in order to minimize the
maximum absolute error of the approximations. The minimization could
certainly be done by some algorithm, but I just did it “by hand”,
plotting the absolute error and tuning the constant until it is
minimized. In practice this works quite fast. Writing the code to
automate this would have taken longer.
Third step is to come back to the initial problem of approximating a
function of two variables:
h ≈ b (1 + 0.337 (a/b)) = b + 0.337 a
h ≈ b max(1, 0.918 (1 + (a/b)/2)) = max(b, 0.918 (b + a/2))
h ≈ b (1 + 0.428 (a/b)2) = b + 0.428 a2/b
Consider using CORDIC methods. Dr. Dobb's has an article and associated library source here. Square-root, multiply and divide are dealt with at the end of the article.
One possibility looks like this:
#include <math.h>
/* Iterations Accuracy
* 2 6.5 digits
* 3 20 digits
* 4 62 digits
* assuming a numeric type able to maintain that degree of accuracy in
* the individual operations.
*/
#define ITER 3
double dist(double P, double Q) {
/* A reasonably robust method of calculating `sqrt(P*P + Q*Q)'
*
* Transliterated from _More Programming Pearls, Confessions of a Coder_
* by Jon Bentley, pg. 156.
*/
double R;
int i;
P = fabs(P);
Q = fabs(Q);
if (P<Q) {
R = P;
P = Q;
Q = R;
}
/* The book has this as:
* if P = 0.0 return Q; # in AWK
* However, this makes no sense to me - we've just insured that P>=Q, so
* P==0 only if Q==0; OTOH, if Q==0, then distance == P...
*/
if ( Q == 0.0 )
return P;
for (i=0;i<ITER;i++) {
R = Q / P;
R = R * R;
R = R / (4.0 + R);
P = P + 2.0 * R * P;
Q = Q * R;
}
return P;
}
This still does a couple of divides and four multiples per iteration, but you rarely need more than three iterations (and two is often adequate) per input. At least with most processors I've seen, that'll generally be faster than the sqrt would be on its own.
For the moment it's written for doubles, but assuming you've implemented the basic operations, converting it to work with fixed point shouldn't be terribly difficult.
Some doubts have been raised by the comment about "reasonably robust". At least as originally written, this was basically a rather backhanded way of saying that "it may not be perfect, but it's still at least quite a bit better than a direct implementation of the Pythagorean theorem."
In particular, when you square each input, you need roughly twice as many bits to represent the squared result as you did to represent the input value. After you add (which needs only one extra bit) you take the square root, which gets you back to needing roughly the same number of bits as the inputs. Unless you have a type with substantially greater precision than the inputs, it's easy for this to produce really poor results.
This algorithm doesn't square either input directly. It is still possible for an intermediate result to underflow, but it's designed so that when it does so, the result still comes out as well as the format in use supports. Basically, the situation in which it happens is that you have an extremely acute triangle (e.g., something like 90 degrees, 0.000001 degrees, and 89.99999 degrees). If it's close enough to 90, 0, 90, we may not be able to represent the difference between the two longer sides, so it'll compute the hypotenuse as being the same length as the other long side.
By contrast, when the Pythagorean theorem fails, the result will often be a NaN (i.e., tells us nothing) or, depending on the floating point format in use, quite possibly something that looks like a reasonable answer, but is actually wildly incorrect.
You can start by reevaluating if you need the sqrt at all. Many times you are calculating the hypotenuse just to compare it to another value - if you square the value you're comparing against you can eliminate the square root altogether.
Unless you're doing this at >1kHz, multiply even on a MCU without hardware MUL isn't terrible. What's much worse is the sqrt. I would try to modify my application so it doesn't need to calculate it at all.
Standard libraries would probably be best if you actually need it, but you could look at using Newton's method as a possible alternative. It would require several multiply/divide cycles to perform, however.
AVR resources
Atmel App note AVR200: Multiply and Divide Routines (pdf)
This sqrt function on AVR Freaks forum
Another AVR Freaks post
Maybe you could use some of Elm Chans Assembler Libraries and adapt the ihypot-function to your ATtiny. You would need to replace the MUL and maybe (i haven't checked) some other instructions.

Resources