Loop and calculate average in Stata - loops

I am trying to calculate the Gini coefficient as the average over five repetitions. My code doesn't correctly work, and I cannot find a way to do it.
inequal7 is a user-written command.
gen gini=.
forval i=1/5 {
mi xeq `i' : inequal7 income [aw=hw0010]
gen gini_`i'=.
scalar gini_`i' = r(gini)
replace gini_`i'= r(gini)
if `i' ==5 {
replace gini = sum(gini_1+gini_2+gini_3+gini_4+gini_5)/5
}
}
Can someone help me?

There is no context on or example of the dataset you're using. This may not work but it's likely to be closer to legal and correct than what you have.
scalar gini = 0
forval i=1/5 {
mi xeq `i' : inequal7 income [aw=hw0010]
scalar gini = scalar(gini) + r(gini)
}
scalar gini = scalar(gini) / 5
Notes:
Using variables to hold constants is legal, but not necessarily good style.
sum() gives the running or cumulative sum; applied to a variable that's a constant it does far more work than you need, and at best the correct answer will just be that in observation 1. As you're feeding it the sum of 5 values, it's redundant any way.
Watch out: names for scalars and variables occupy the same namespace.
If this is a long way off what you want, and you get no better answer, you're likely to need to give much more detail.

Related

Find possible solutions for a matrix with known row/column sums and maximum cell values

I am trying to find solutions to a matrix where I know the row and column sums and the maximum value a cell can have. I want to find possible solutions that are within the constraints. I've already tried various things like constructing an array of all cell values and picking picking from each cell in sequence but whatever I try I always run into the problem where I run out of values for a cell.
I also tried a recursive algorithm but that I only managed to get the first result or it failed to get any solution. I think I have to do this with a backtracking algorithm? Not sure...
Any help or pointers would be appreciated.
Row sums A, B, C, column sums X, Y, Z as well as the maximum value for each ? are known. All values are are positive integers.
C1 | C2 | C3
-----------------
R1 | ? | ? | ? | A
-----------------
R2 | ? | ? | ? | B
-----------------
R3 | ? | ? | ? | C
-----------------
X | Y | Z
If you heard about linear programming (LP) and its 'cousins' (ILP, MILP), that could be a good approach to help you solve your problem with a great efficiency.
A linear program consists in a set of variables (your matrix unknowns), constraints (maximum values, sum of rows and columns), and an objective function (here none) to minimize or maximize.
Let's call x[i][j] the values you are looking for.
With the following data:
NxM the dimensions of your matrix
max_val[i][j] the maximum value for the variable x[i][j]
row_val[i] the sum of the values on the row i
col_val[j] the sum of the values on the column j
Then a possible linear program that could solve your problem is:
// declare variables
int x[N][M] // or eventually float x[N][M]
// declare constaints
for all i in 1 .. N, j in 1 .. M, x[i][j] <= max_val[i][j]
for all i in 1 .. N, sum[j in 1 .. M](x[i][j]) == row_val[i]
for all j in 1 .. M, sum[i in 1 .. N](x[i][j]) == col_val[j]
// here the objective function is useless, but you still will need one
// for instance, let's minimize the sum of all variables (which is constant, but as I said, the objective function does not have to be useful)
minimize sum[i in 1 .. N](sum[j in 1 .. M](x[i][j]))
// you could also be more explicit about the uselessness of the objective function
// minimize 0
Solvers such as gurobi or Cplex (but there are much more of them, see here for instance) can solve this kind of problems incredibly fast, especially if your solutions do not need to be integer, but can be float (that makes the problem much, much easier). It also have the advantage to not only be faster t execute, but faster and simpler to code. They have APIs in several common programming languages to ease their use.
For example, you can reasonably expect to solve this kind of problem in less than a minute, with hundreds of thousands of variables in the integer case, millions in the real variables case.
Edit:
In response to the comment, here is a piece of code in OPL (the language Cplex and other LP solvers use) that would solve your problem. We consider a 3x3 case.
// declare your problem input
int row_val[1..3] = [7, 11, 8];
int col_val[1..3] = [14, 6, 6];
int max_val[1..3][1..3] = [[10, 10, 10], [10, 10, 10], [10, 10, 10]];
// declare your decision variables
dvar int x[1..3][1..3];
// objective function
minimize 0;
// constraints
subject to {
forall(i in 1..3, j in 1..3) x[i][j] <= max_val[i][j];
forall(i in 1..3) sum(j in 1..3) x[i][j] == row_val[i];
forall(j in 1..3) sum(i in 1..3) x[i][j] == col_val[j];
}
The concept of a LP solver is that you only describe the problem you want to solve, then the solver solves it for you. The problem must be described according to a certain set of rules. In the current case (Integer Linear Programming, or ILP), the variables must all be integers, and the constraints and objective function must be linear equalities (or inequalities) with regards to the decision variables.
The solver will then work as a black box. It will analyse the problem, and run algorithms that can solve it, with a ton of optimizations, and output the solution.
As you wrote in a comment, that you want to come up an own solution, here's some guideline:
Use a Backtrack algorithm to find a solution. Your value-space consists of 3*3=9 independent values, each of them are between 1 and maxval[i][j]. Your constraints will be the row and column sums (all of them must match)
Intitalize your space with all 1s, then increment them, until they reach the maxval. Evaluate the conditions only after each value is covered for that condition (particularly, after 3 values you can evaluate the first row, after 6 the second row, after 7 the first col, after 8 the second col, and after 9 the third row and the third col)
If you reach the 9th, with all conditions passing, you've got a solution. Otherwise try the values from 1 till maxval, if neither matches, step back. If the first value was iterated through, then there's no solution.
That's all.
More advanced backtracking:
Your moving values are only the top-left 2*2=4 values. The third column is calculated, the condition is that it must be between 1 and the maxval for that particular element.
After defining the [1][1] element, you need to calculate the [2][2] index by using the column sum, and validate its value by the row sum (or vica versa). The same processing rules apply as above: iterate through all possible values, step back if none matches, and check rules only if they can be applied.
It is a way faster method, since you have 5 bound variables (the bottom and right rows), and only 4 unbound. These are optimizations from your particular rules. A bit more complex to implement, though.
PS: 1 is used because you have positive integers. If you have non-negative integers, you need to start with 0.

arrayfun to bsxfun possible

I know that bsxfun(which works fast!) and arrayfun(as far as I could understand, uses loops internally which is expected to be slow) are intended for different uses, at least, at the most basic level.
Having said this, I am trying
to sum up all the numbers in a given array, say y, before a certain index
add the number at the specific location(which is the number at the above index location) to the above sum.
I could perform this with the below piece of example code easily:
% index array
x = [ 1:6 ]; % value array
y = [ 3 3 4 4 1 1 ];
% arrayfun version
o2 = arrayfun(#(a) ...
sum(y(1:(a-1)))+...
y(a), ...
x)
But it seems to be slow on large inputs.
I was wondering what would be a good way to convert this to a version that works with bsxfun, if possible.
P.S. the numbers in y do not repeat as given above, this was just an example, it could also be [3 4 3 1 4 ...]
Is x always of the form 1 : n? Assuming the answer is yes, then you can get the same result with the much faster code:
o2 = cumsum(y);
Side note: you don't need the brackets in the definition of x.
if you have a supported GPU device, you can define your variables as gpuArray type since arrayfun, bsxfun and pagefun are compatible with GPUs.
GPU computing is supposed to be faster for large data.

MATLAB solve array

I've got multiple arrays that you can't quite fit a curve/equation to, but i do need to solve them for a lot of values. Simplified it looks like this when i plot it, but the real ones have a lot more points:
So say i would like to solve for y=22,how would i do that? As you can see there'd be three solutions to this, but i only need the most left one.
Linear is okay, but i'd rather us a non-linear method.
The only way i found is to fit an equation to a set of points and solve that equation, but an equation can't approximate the array accurately enough.
This implementation uses a first-order interpolation- if you're looking for higher accuracy and it feels appropriate, you can use a similar strategy for another order estimator.
Assuming data is the name of your array containing data with x values in the first column and y values in the second, that the columns are sorted by increasing or decreasing x values, and you wanted to find all data at the value y = 22;
searchPoint = 22; %search for all solutions where y = 22
matchPoints = []; %matrix containing all values of x
for ii = 1:length(data)-1
if (data(ii,2)>searchPoint)&&(data(ii+1,2)<searchPoint)
xMatch = data(ii,1)+(searchPoint-data(ii,2))*(data(ii+1,1)-data(ii,1))/(data(ii+1,2)-data(ii,2)); %Linear interpolation to solve for xMatch
matchPoints = [matchPoints xMatch];
elseif (data(ii,2)<searchPoint)&&(data(ii+1,2)>searchPoint)
xMatch = data(ii,1)+(searchPoint-data(ii,2))*(data(ii+1,1)-data(ii,1))/(data(ii+1,2)-data(ii,2)); %Linear interpolation to solve for xMatch
matchPoints = [matchPoints xMatch];
elseif (data(ii,2)==searchPoint) %check if data(ii,2) is equal
matchPoints = [matchPoints data(ii,1)];
end
end
if(data(end,2)==searchPoint) %Since ii only goes to the rest of the data
matchPoints = [matchPoints data(end,1)];
end
This was written sans-compiler, but the logic was tested in octave (in other words, sorry if there's a slight typo in variable names, but the math should be correct)

Matlab: average each element in 2D array based on neighbors [duplicate]

I've written code to smooth an image using a 3x3 averaging filter, however the output is strange, it is almost all black. Here's my code.
function [filtered_img] = average_filter(noisy_img)
[m,n] = size(noisy_img);
filtered_img = zeros(m,n);
for i = 1:m-2
for j = 1:n-2
sum = 0;
for k = i:i+2
for l = j:j+2
sum = sum+noisy_img(k,l);
end
end
filtered_img(i+1,j+1) = sum/9.0;
end
end
end
I call the function as follows:
img=imread('img.bmp');
filtered = average_filter(img);
imshow(uint8(filtered));
I can't see anything wrong in the code logic so far, I'd appreciate it if someone can spot the problem.
Assuming you're working with grayscal images, you should replace the inner two for loops with :
filtered_img(i+1,j+1) = mean2(noisy_img(i:i+2,j:j+2));
Does it change anything?
EDIT: don't forget to reconvert it to uint8!!
filtered_img = uint8(filtered_img);
Edit 2: the reason why it's not working in your code is because sum is saturating at 255, the upper limit of uint8. mean seems to prevent that from happening
another option:
f = #(x) mean(x(:));
filtered_img = nlfilter(noisy_img,[3 3],f);
img = imread('img.bmp');
filtered = imfilter(double(img), ones(3) / 9, 'replicate');
imshow(uint8(filtered));
Implement neighborhood operation of sum of product operation between an image and a filter of size 3x3, the filter should be averaging filter.
Then use the same function/code to compute Laplacian(2nd order derivative, prewitt and sobel operation(first order derivatives).
Use a simple 10*10 matrix to perform these operations
need matlab code
Tangentially to the question:
Especially for 5x5 or larger window you can consider averaging first in one direction and then in the other and you save some operations. So, point at 3 would be (P1+P2+P3+P4+P5). Point at 4 would be (P2+P3+P4+P5+P6). Divided by 5 in the end. So, point at 4 could be calculated as P3new + P6 - P2. Etc for point 5 and so on. Repeat the same procedure in other direction.
Make sure to divide first, then sum.
I would need to time this, but I believe it could work a bit faster for larger windows. It is sequential per line which might not seem the best, but you have many lines where you can work in parallel, so it shouldn't be a problem.
This first divide, then sum also prevents saturation if you have integers, so you might use the approach even in 3x3 case, as it is less wrong (though slower) to divide twice by 3 than once by 9. But note that you will always underestimate final value with that, so you might as well add a bit of bias (say all values +1 between the steps).
img=imread('camraman.tif');
nsy-img=imnoise(img,'salt&pepper',0.2);
imshow('nsy-img');
h=ones(3,3)/9;
avg=conv2(img,h,'same');
imshow(Unit8(avg));

Generating sums of variables with loops in Stata

I'm having some problems with a loop that I'm trying to perform and probably with the syntax for generating the variable that I want.
Putting in words, what I am trying to do make is a sum of a particular set of observations and storing each sum in a cell for a new variable. Here is an example of syntax that I used:
forvalues j=1/50 {
replace x1 = sum(houses) if village==j'& year==2010
}
gen x2=.
forvalues j=1/50 {
replace x2 = sum(houses) if village==j' & year==2011
}
gen x3 =.
forvalues j=1/50 {
replace x3 = sum(houses) if village==j' & year==2012
}
This is from a dataset with more than 4000 observations. So, for each particular j, if I were successful with the code above, I would get an unique observation for each j (what I want to obtain), but I'm not obtaining this -- which is a sum of all houses, conditioned with the year and village; the total sum of houses per village in each year. I would greatly appreciate if someone could help me obtain one particular observation for each j in each variable.
sum() will return a running sum, so that is probably not what you want. This type of problem is usually much easier to solve with the by: prefix in combination with the egen command. The one line command below will give you the total number of houses per village and year:
bys village year: egen Nhouses = total(houses)

Resources