Nested loop with increasing number of elements - loops

I have a longitudinal dataset of 18 time periods. For reasons not to be discussed here, this dataset is in the wide shape, not in the long one. More precisely, time-varying variables have an alphabetic prefix which identifies the time it belongs to. For the sake of this question, consider a quantity of interest called pay. This variable is denoted apay in the first period, bpay in the second, and so on, until rpay.
Importantly, different observations have missing values in this variable in different periods, in an unpredictable way. In consequence, running a panel for the full number of periods will reduce my number of observations considerably. Hence, I would like to know precisely how many observations a panel with different lengths will have. To evaluate this, I want to create variables that, for each period and for each number of consecutive periods count how many respondents have the variable with that time sequence. For example, I want the variable b_count_2 to count how many observations have nonmissing pay in the first period and the second. This can be achieved with something like this:
local b_count_2 = 0
if apay != . & bpay != . {
local b_count_2 = `b_count_2' + 1 // update for those with nonmissing pay in both periods
}
Now, since I want to do this automatically, this has to be in a loop. Moreover, there are different numbers of sequences for each period. For example, for the third period, there are two sequences (those with pay in period 2 and 3, and those with sequences in period 1, 2 and 3). Thus, the number of variables to create is 1+2+3+4+...+17 = 153. This variability has to be reflected in the loop. I propose a code below, but there are bits that are wrong, or of which I'm unsure, as highlighted in the comments.
local list b c d e f g h i j k l m n o p q r // periods over which iterate
foreach var of local list { // loop over periods
local counter = 1 // counter to update; reflects sequence length
while `counter' > 0 { // loop over sequence lengths
gen _`var'_counter_`counter' = 0 // generate variable with counter
if `var'pay != . { // HERE IS PROBLEM 1. NEED TO MAKE THIS TO CHECK CONDITIONS WITH INCREASING NUMBER OF ELEMENTS
recode _`var'_counter_`counter' (0 = 1) // IM NOT SURE THIS IS HOW TO UPDATE SPECIFIC OBSERVATIONS.
local counter = `counter' - 1 // update counter to look for a longer sequence in the next iteration
}
}
local counter = `counter' + 1 // HERE IS PROBLEM 2. NEED TO STOP THIS LOOP! Otherwise counter goes to infinity.
}
An example of the result of the above code (if right) is the following. Consider a dataset of five observations, for four periods (denoted a, b, c, and d):
Obs a b c d
1 1 1 . 1
2 1 1 . .
3 . . 1 1
4 . 1 1 .
5 1 1 1 1
where 1 means value is observed in that period, and . is not. The objective of the code is to create 1+2+3=6 new variables such that the new dataset is:
Obs a b c d b_count_2 c_count_2 c_count_3 d_count_2 d_count_3 d_count_4
1 1 1 . 1 1 0 0 0 0 0
2 1 1 . . 1 0 0 0 0 0
3 . . 1 1 0 0 0 1 0 0
4 . 1 1 . 0 1 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1
Now, why is this helpful? Well, because now I can run a set of summarize commands to get a very nice description of the dataset. The code to print this information in one go would be something like this:
local list a b c d e f g h i j k l m n o p q r // periods over which iterate
foreach var of local list { // loop over periods
local list `var'_counter_* // group of sequence variables for each period
foreach var2 of local list { // loop over each element of the list
quietly sum `var'_counter_`var2' if `var'_counter_`var2' == 1 // sum the number of individuals with value = 1 with sequence of length var2 in period var
di as text "Wave `var' has a sequence of length `var2' with " as result r(N) as text " observations." // print result
}
}
For the above example, this produces the following output:
"Wave 'b' has a sequence of length 2 with 3 observations."
"Wave 'c' has a sequence of length 2 with 2 observations."
"Wave 'c' has a sequence of length 3 with 1 observations."
"Wave 'd' has a sequence of length 2 with 2 observations."
"Wave 'd' has a sequence of length 3 with 1 observations."
"Wave 'd' has a sequence of length 4 with 1 observations."
This gives me a nice summary of the trade-offs I'm having between a wider panel and a longer panel.

If you insist on doing this with data in wide form, it is very inefficient to create extra variables just to count patterns of missing values. You can create a single string variable that contains the pattern for each observation. Then, it's just a matter of extracting from this pattern variable what you are looking for (i.e. patterns of consecutive periods up to the current wave). You can then loop over lengths of the matching patterns and do counts. Something like:
* create some fake data
clear
set seed 12341
set obs 10
foreach pre in a b c d e f g {
gen `pre'pay = runiform() if runiform() < .8
}
* build the pattern of missing data
gen pattern = ""
foreach pre in a b c d e f g {
qui replace pattern = pattern + cond(mi(`pre'pay), " ", "`pre'")
}
list
qui foreach pre in b c d e f g {
noi dis "{hline 80}" _n as res "Wave `pre'"
// the longest substring without a space up to the wave
gen temp = regexs(1) if regexm(pattern, "([^ ]+`pre')")
noi tab temp
// loop over the various substring lengths, from 2 to max length
gen len = length(temp)
sum len, meanonly
local n = r(max)
forvalues i = 2/`n' {
count if length(temp) >= `i'
noi dis as txt "length = " as res `i' as txt " obs = " as res r(N)
}
drop temp len
}
If you are open to working in long form, then here is how you would identify spells with contiguous data and how to loop to get the info you want (the data setup is exactly the same as above):
* create some fake data in wide form
clear
set seed 12341
set obs 10
foreach pre in a b c d e f g {
gen `pre'pay = runiform() if runiform() < .8
}
* reshape to long form
gen id = _n
reshape long #pay, i(id) j(wave) string
* identify spells of contiguous periods
egen wavegroup = group(wave), label
tsset id wavegroup
tsspell, cond(pay < .)
drop if mi(pay)
foreach pre in b c d e f g {
dis "{hline 80}" _n as res "Wave `pre'"
sum _seq if wave == "`pre'", meanonly
local n = r(max)
forvalues i = 2/`n' {
qui count if _seq >= `i' & wave == "`pre'"
dis as txt "length = " as res `i' as txt " obs = " as res r(N)
}
}

I echo #Dimitriy V. Masterov in genuine puzzlement that you are using this dataset shape. It can be convenient for some purposes, but for panel or longitudinal data such as you have, working with it in Stata is at best awkward and at worst impracticable.
First, note specifically that
local b_count_2 = 0
if apay != . & bpay != . {
local b_count_2 = `b_count_2' + 1 // update for those with nonmissing pay in both periods
}
will only ever be evaluated in terms of the first observation, i.e. as if you had coded
if apay[1] != . & bpay[1] != .
This is documented here. Even if it is what you want, it is not usually a pattern for others to follow.
Second, and more generally, I haven't tried to understand all of the details of your code, as what I see is the creation of a vast number of variables even for tiny datasets as in your sketch. For a series T periods long, you would create a triangular number [(T - 1)T]/2 of new variables; in your example (17 x 18)/2 = 153. If someone had series 100 periods long, they would need 4950 new variables.
Note that because of the first point just made, these new variables would pertain with your strategy only to individual variables like pay and individual panels. Presumably that limitation to individual panels could be fixed, but the main idea seems singularly ill-advised in many ways. In a nutshell, what strategy do you have to work with these hundreds or thousands of new variables except writing yet more nested loops?
Your main need seems to be to identify spells of non-missing and missing values. There is easy machinery for this long since developed. General principles are discussed in this paper and an implementation is downloadable from SSC as tsspell.
On Statalist, people are asked to provide workable examples with data as well as code. See this FAQ That's entirely equivalent to long-standing requests here for MCVE.
Despite all that advice, I would start by looking at the Stata command xtdescribe and associated xt tools already available to you. These tools do require a long data shape, which reshape will provide for you.

Let me add another answer based on the example now added to the question.
Obs a b c d
1 1 1 . 1
2 1 1 . .
3 . . 1 1
4 . 1 1 .
5 1 1 1 1
The aim of this answer is not to provide what the OP asks but to indicate how many simple tools are available to look at patterns of non-missing and missing values, none of which entail the creation of large numbers of extra variables or writing intricate code based on nested loops for every new question. Most of those tools require a reshape long.
. clear
. input a b c d
a b c d
1. 1 1 . 1
2. 1 1 . .
3. . . 1 1
4. . 1 1 .
5. 1 1 1 1
6. end
. rename (a b c d) (y1 y2 y3 y4)
. gen id = _n
. reshape long y, i(id) j(time)
(note: j = 1 2 3 4)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 5 -> 20
Number of variables 5 -> 3
j variable (4 values) -> time
xij variables:
y1 y2 ... y4 -> y
-----------------------------------------------------------------------------
. xtset id time
panel variable: id (strongly balanced)
time variable: time, 1 to 4
delta: 1 unit
. preserve
. drop if missing(y)
(7 observations deleted)
. xtdescribe
id: 1, 2, ..., 5 n = 5
time: 1, 2, ..., 4 T = 4
Delta(time) = 1 unit
Span(time) = 4 periods
(id*time uniquely identifies each observation)
Distribution of T_i: min 5% 25% 50% 75% 95% max
2 2 2 2 3 4 4
Freq. Percent Cum. | Pattern
---------------------------+---------
1 20.00 20.00 | ..11
1 20.00 40.00 | .11.
1 20.00 60.00 | 11..
1 20.00 80.00 | 11.1
1 20.00 100.00 | 1111
---------------------------+---------
5 100.00 | XXXX
* ssc inst xtpatternvar
. xtpatternvar, gen(pattern)
* ssc inst groups
. groups pattern
+------------------------------------+
| pattern Freq. Percent % <= |
|------------------------------------|
| ..11 2 15.38 15.38 |
| .11. 2 15.38 30.77 |
| 11.. 2 15.38 46.15 |
| 11.1 3 23.08 69.23 |
| 1111 4 30.77 100.00 |
+------------------------------------+
. restore
. egen npresent = total(missing(y)), by(time)
. tabdisp time, c(npresent)
----------------------
time | npresent
----------+-----------
1 | 2
2 | 1
3 | 2
4 | 2
----------------------

Related

Rank and unrank fibonacci bitsequence with k ones

For positive integers n and k, let a "k-fibonacci-bitsequence of n" be a bitsequence with k 1 where the 1 on index i describe not Math.pow(2,i) but Fibonacci(i). These positive integers that add up to n, and let the "rank" of a given k- fibonnaci-bitsequence of n be its position in the sorted list of all of these fibonacci-bitsequences in lexicographic order, starting at 0.
For example, for the number 39 we have following valid k-fibonacci-bitsequences, k <=4. The fibonacci numbers behind the fibonacci-bitsequence in this example are following:
34 21 13 8 5 3 2 1
10001000 k = 2 rank = 0
01101000 k = 3 rank = 0
10000110 k = 3 rank = 1
01101100 k = 4 rank = 0
So, I want to be able to do two things:
Given n, k, and a k-fibonacci-bitsequence of n, I want to find the rank of that k-fibonacci-bitsequence of n.
Given n, k, and a rank, I want to find the k-fibonacci-bitsequence of n with that rank.
Can I do this without having to compute all the k-fibonacci-bitsequences of n that come before the one of interest?
Preliminaries
For brevity lets say »k-fbs of n« instead of »k-fibonacci-bitsequences of n«.
Question
Can I do this without having to compute all the k-fbs of n that come before the one of interest?
I'm not sure. So far I still have to compute some of fbs. However, you might have thought we had to start from 00…0 and count up – this is not the case. We can do it the other way around and start from the highest fbs and work our way down very efficiently.
This is not a complete answer. However, there are some observations that could help you:
Zeckendorf
In the following pseudo-code we use the data-type fbs which is basically an array of bools. We can read and write individual bits using mySeq[i] where bit i represents the Fibonacci number fib(i). Just as in your question, the bits myFbs[0] and myFbs[1] do not exist. All bits are initialized to 0 by default. An fbs can be used without [] to read the represented number (n). The helper function #(fbs) returns the number of set bits (k) inside an fbs. Example for n = 7:
fbs meaning representation helper functions
1 0 1 0
| | | `— 0·fib(2) = 0·1 ——— myFbs[2] = 0 #(myFbs) == 2
| | `——— 1·fib(3) = 1·2 ——— myFbs[3] = 1 myFbs == 7
| `————— 0·fib(4) = 0·3 ——— myFbs[4] = 0
`——————— 1·fib(5) = 1·5 ——— myFbs[5] = 1
For any given n we can easily compute the lexicographical maximum (across all k) fbs of n as this fbs happends to be the Zeckendorf representation of n.
function zeckendorf(int n) returns (fbs z):
1 int i := any (ideally the smallest) number such that fib(start) > n
2 while n-z > 0
3 | if fib(i) < n
4 | | z[i] := 1
5 | i := i - 1
zeckendorf(n) is unique and the only fbs of n with k=#(zeckendorf(n)). Therefore zeckendorf(n) has rank=0. Also, there exists no k'-fbs of n with k'<#(zeckendorf(n)).
Transformation
Any k-fbs of n can be transformed into a (k+1)-fbs of n by replacing the bit sequence 100 by 011 anywhere inside the fbs. This works because fib(i)=fib(i-1)+fib(i-2).
If our input k-fbs of n has rank=0 and we replace the right-most 100 then our resulting (k+1)-fbs of n also has rank=0. If we replace the second-right-most 100 our resulting (k+1)-fbs has rank=1 and so on.
You should be able answer both of your questions using repeated transformations starting at zeckendorf(n). For the first question it might even be sufficient to only look at the k-stable transformations 011…100→100…011 and 100…011→011…100 of the given fbs (think about what these transformations do to the rank).

End Loop when significant value found : Stata?

could you help me in figuring out: ho do i tell Stata to end the loop over iterations when it finds the first positive and significant value of a particular coefficient in a regression.
Here is a small sample using publicly available dataset that shows what I am trying to do: In the following case, I want stata to stop looping when it finds the "year" coefficient to be positive and significant.
set more off
clear all
clear matrix
use http://www.stata-press.com/data/r13/abdata
forvalues i=1/8{
xtabond n w k ys year, lags(`i') noconstant
matrix b = e(b)'
mat byear = b["year",1]
if `i'==1 matrix byear=b["year",1]
else matrix byear=(byear\ b["year",1])
}
Could you please help in figuring out how to tell stata to stop looping when it finds a condition is met.
Thank you
Here is some code that seems to do what you want. I had to set the confidence level to 80 (from the default of 95) so it would terminate before it exceeded the maximum number of lags.
set more off
clear all
clear matrix
set level 80
use http://www.stata-press.com/data/r13/abdata
forvalues i=1/8{
quietly xtabond n w k ys year, lags(`i') noconstant
matrix t = r(table)
scalar b = t[rownumb(t,"b"),colnumb(t,"year")]
scalar p = t[rownumb(t,"pvalue"),colnumb(t,"year")]
scalar r = 1-r(level)/100
scalar q = (b>0) & (p<=r)
if q {
display "success with `i' lags"
display "b: " b " p: " p " r: " r " q: " q
xtabond
continue, break
}
else {
display "no luck with `i' lags"
}
}
which yields
no luck with 1 lags
success with 2 lags
b: .00759529 p: .18035747 r: .2 q: 1
Arellano-Bond dynamic panel-data estimation Number of obs = 611
Group variable: id Number of groups = 140
Time variable: year
Obs per group:
min = 4
avg = 4.364286
max = 6
Number of instruments = 31 Wald chi2(6) = 1819.55
Prob > chi2 = 0.0000
One-step results
------------------------------------------------------------------------------
n | Coef. Std. Err. z P>|z| [80% Conf. Interval]
-------------+----------------------------------------------------------------
n |
L1. | .3244849 .0774312 4.19 0.000 .1727225 .4762474
L2. | -.0266879 .0363611 -0.73 0.463 -.0979544 .0445785
|
w | -.5464779 .0562155 -9.72 0.000 -.6566582 -.4362975
k | .360622 .0330634 10.91 0.000 .2958189 .4254252
ys | .5948084 .0818672 7.27 0.000 .4343516 .7552652
year | .0075953 .0056696 1.34 0.180 -.0035169 .0187075
------------------------------------------------------------------------------
Instruments for differenced equation
GMM-type: L(2/.).n
Standard: D.w D.k D.ys D.year
.
end of do-file

Split vector in MATLAB

I'm trying to elegantly split a vector. For example,
vec = [1 2 3 4 5 6 7 8 9 10]
According to another vector of 0's and 1's of the same length where the 1's indicate where the vector should be split - or rather cut:
cut = [0 0 0 1 0 0 0 0 1 0]
Giving us a cell output similar to the following:
[1 2 3] [5 6 7 8] [10]
Solution code
You can use cumsum & accumarray for an efficient solution -
%// Create ID/labels for use with accumarray later on
id = cumsum(cut)+1
%// Mask to get valid values from cut and vec corresponding to ones in cut
mask = cut==0
%// Finally get the output with accumarray using masked IDs and vec values
out = accumarray(id(mask).',vec(mask).',[],#(x) {x})
Benchmarking
Here are some performance numbers when using a large input on the three most popular approaches listed to solve this problem -
N = 100000; %// Input Datasize
vec = randi(100,1,N); %// Random inputs
cut = randi(2,1,N)-1;
disp('-------------------- With CUMSUM + ACCUMARRAY')
tic
id = cumsum(cut)+1;
mask = cut==0;
out = accumarray(id(mask).',vec(mask).',[],#(x) {x});
toc
disp('-------------------- With FIND + ARRAYFUN')
tic
N = numel(vec);
ind = find(cut);
ind_before = [ind-1 N]; ind_before(ind_before < 1) = 1;
ind_after = [1 ind+1]; ind_after(ind_after > N) = N;
out = arrayfun(#(x,y) vec(x:y), ind_after, ind_before, 'uni', 0);
toc
disp('-------------------- With CUMSUM + ARRAYFUN')
tic
cutsum = cumsum(cut);
cutsum(cut == 1) = NaN; %Don't include the cut indices themselves
sumvals = unique(cutsum); % Find the values to use in indexing vec for the output
sumvals(isnan(sumvals)) = []; %Remove NaN values from sumvals
output = arrayfun(#(val) vec(cutsum == val), sumvals, 'UniformOutput', 0);
toc
Runtimes
-------------------- With CUMSUM + ACCUMARRAY
Elapsed time is 0.068102 seconds.
-------------------- With FIND + ARRAYFUN
Elapsed time is 0.117953 seconds.
-------------------- With CUMSUM + ARRAYFUN
Elapsed time is 12.560973 seconds.
Special case scenario: In cases where you might have runs of 1's, you need to modify few things as listed next -
%// Mask to get valid values from cut and vec corresponding to ones in cut
mask = cut==0
%// Setup IDs differently this time. The idea is to have successive IDs.
id = cumsum(cut)+1
[~,~,id] = unique(id(mask))
%// Finally get the output with accumarray using masked IDs and vec values
out = accumarray(id(:),vec(mask).',[],#(x) {x})
Sample run with such a case -
>> vec
vec =
1 2 3 4 5 6 7 8 9 10
>> cut
cut =
1 0 0 1 1 0 0 0 1 0
>> celldisp(out)
out{1} =
2
3
out{2} =
6
7
8
out{3} =
10
For this problem, a handy function is cumsum, which can create a cumulative sum of the cut array. The code that produces an output cell array is as follows:
vec = [1 2 3 4 5 6 7 8 9 10];
cut = [0 0 0 1 0 0 0 0 1 0];
cutsum = cumsum(cut);
cutsum(cut == 1) = NaN; %Don't include the cut indices themselves
sumvals = unique(cutsum); % Find the values to use in indexing vec for the output
sumvals(isnan(sumvals)) = []; %Remove NaN values from sumvals
output = {};
for i=1:numel(sumvals)
output{i} = vec(cutsum == sumvals(i)); %#ok<SAGROW>
end
As another answer shows, you can use arrayfun to create a cell array with the results. To apply that here, you'd replace the for loop (and the initialization of output) with the following line:
output = arrayfun(#(val) vec(cutsum == val), sumvals, 'UniformOutput', 0);
That's nice because it doesn't end up growing the output cell array.
The key feature of this routine is the variable cutsum, which ends up looking like this:
cutsum =
0 0 0 NaN 1 1 1 1 NaN 2
Then all we need to do is use it to create indices to pull the data out of the original vec array. We loop from zero to max and pull matching values. Notice that this routine handles some situations that may arise. For instance, it handles 1 values at the very beginning and very end of the cut array, and it gracefully handles repeated ones in the cut array without creating empty arrays in the output. This is because of the use of unique to create the set of values to search for in cutsum, and the fact that we throw out the NaN values in the sumvals array.
You could use -1 instead of NaN as the signal flag for the cut locations to not use, but I like NaN for readability. The -1 value would probably be more efficient, as all you'd have to do is truncate the first element from the sumvals array. It's just my preference to use NaN as a signal flag.
The output of this is a cell array with the results:
output{1} =
1 2 3
output{2} =
5 6 7 8
output{3} =
10
There are some odd conditions we need to handle. Consider the situation:
vec = [1 2 3 4 5 6 7 8 9 10 11 12 13 14];
cut = [1 0 0 1 1 0 0 0 0 1 0 0 0 1];
There are repeated 1's in there, as well as a 1 at the beginning and end. This routine properly handles all this without any empty sets:
output{1} =
2 3
output{2} =
6 7 8 9
output{3} =
11 12 13
You can do this with a combination of find and arrayfun:
vec = [1 2 3 4 5 6 7 8 9 10];
N = numel(vec);
cut = [0 0 0 1 0 0 0 0 1 0];
ind = find(cut);
ind_before = [ind-1 N]; ind_before(ind_before < 1) = 1;
ind_after = [1 ind+1]; ind_after(ind_after > N) = N;
out = arrayfun(#(x,y) vec(x:y), ind_after, ind_before, 'uni', 0);
We thus get:
>> celldisp(out)
out{1} =
1 2 3
out{2} =
5 6 7 8
out{3} =
10
So how does this work? Well, the first line defines your input vector, the second line finds how many elements are in this vector and the third line denotes your cut vector which defines where we need to cut in our vector. Next, we use find to determine the locations that are non-zero in cut which correspond to the split points in the vector. If you notice, the split points determine where we need to stop collecting elements and begin collecting elements.
However, we need to account for the beginning of the vector as well as the end. ind_after tells us the locations of where we need to start collecting values and ind_before tells us the locations of where we need to stop collecting values. To calculate these starting and ending positions, you simply take the result of find and add and subtract 1 respectively.
Each corresponding position in ind_after and ind_before tell us where we need to start and stop collecting values together. In order to accommodate for the beginning of the vector, ind_after needs to have the index of 1 inserted at the beginning because index 1 is where we should start collecting values at the beginning. Similarly, N needs to be inserted at the end of ind_before because this is where we need to stop collecting values at the end of the array.
Now for ind_after and ind_before, there is a degenerate case where the cut point may be at the end or beginning of the vector. If this is the case, then subtracting or adding by 1 will generate a start and stopping position that's out of bounds. We check for this in the 4th and 5th line of code and simply set these to 1 or N depending on whether we're at the beginning or end of the array.
The last line of code uses arrayfun and iterates through each pair of ind_after and ind_before to slice into our vector. Each result is placed into a cell array, and our output follows.
We can check for the degenerate case by placing a 1 at the beginning and end of cut and some values in between:
vec = [1 2 3 4 5 6 7 8 9 10];
cut = [1 0 0 1 0 0 0 1 0 1];
Using this example and the above code, we get:
>> celldisp(out)
out{1} =
1
out{2} =
2 3
out{3} =
5 6 7
out{4} =
9
out{5} =
10
Yet another way, but this time without any loops or accumulating at all...
lengths = diff(find([1 cut 1])) - 1; % assuming a row vector
lengths = lengths(lengths > 0);
data = vec(~cut);
result = mat2cell(data, 1, lengths); % also assuming a row vector
The diff(find(...)) construct gives us the distance from each marker to the next - we append boundary markers with [1 cut 1] to catch any runs of zeros which touch the ends. Each length is inclusive of its marker, though, so we subtract 1 to account for that, and remove any which just cover consecutive markers, so that we won't get any undesired empty cells in the output.
For the data, we mask out any elements corresponding to markers, so we just have the valid parts we want to partition up. Finally, with the data ready to split and the lengths into which to split it, that's precisely what mat2cell is for.
Also, using #Divakar's benchmark code;
-------------------- With CUMSUM + ACCUMARRAY
Elapsed time is 0.272810 seconds.
-------------------- With FIND + ARRAYFUN
Elapsed time is 0.436276 seconds.
-------------------- With CUMSUM + ARRAYFUN
Elapsed time is 17.112259 seconds.
-------------------- With mat2cell
Elapsed time is 0.084207 seconds.
...just sayin' ;)
Here's what you need:
function spl = Splitting(vec,cut)
n=1;
j=1;
for i=1:1:length(b)
if cut(i)==0
spl{n}(j)=vec(i);
j=j+1;
else
n=n+1;
j=1;
end
end
end
Despite how simple my method is, it's in 2nd place for performance:
-------------------- With CUMSUM + ACCUMARRAY
Elapsed time is 0.264428 seconds.
-------------------- With FIND + ARRAYFUN
Elapsed time is 0.407963 seconds.
-------------------- With CUMSUM + ARRAYFUN
Elapsed time is 18.337940 seconds.
-------------------- SIMPLE
Elapsed time is 0.271942 seconds.
Unfortunately there is no 'inverse concatenate' in MATLAB. If you wish to solve a question like this you can try the below code. It will give you what you looking for in the case where you have two split point to produce three vectors at the end. If you want more splits you will need to modify the code after the loop.
The results are in n vector form. To make them into cells, use num2cell on the results.
pos_of_one = 0;
% The loop finds the split points and puts their positions into a vector.
for kk = 1 : length(cut)
if cut(1,kk) == 1
pos_of_one = pos_of_one + 1;
A(1,one_pos) = kk;
end
end
F = vec(1 : A(1,1) - 1);
G = vec(A(1,1) + 1 : A(1,2) - 1);
H = vec(A(1,2) + 1 : end);

Replace values of a variable for values of other variables Stata 13

I am in my first bout with Stata. I have never used it until this week and am trying to work through some examples. I have the following set of data:
contruse | educ_none | educ_prim | educ_secabove
1 | 0 | 1 | 0
0 | 1 | 0 | 0
...
I created the following variable with corresponding data set so that I could tab contruse with all different educations.
gen education=0
replace education=1 if educ_none==1
replace education=2 if educ_prim==1
replace education=3 if educ_secabove==1
replace education=. if educ_none==. | educ_prim==. | educ_secabove==.
tab education, missing
contruse | educ_none | educ_prim | educ_secabove | education
1 | 0 | 1 | 0 | 2
0 | 1 | 0 | 0 | 1
Basically is there a better way of doing this: for instance my varlist could be arbitrarily large and doing the above is painful. Is there a way of say reversing the following to work through multiple variables and give a single variable a value?
foreach x of varlist educ_none educ_prim educ_secabove {
replace `x' = . if var > 3
}
Can you automate this process? The answer is "No", because each component variable will have a unique suffix. So if you have "race_black" "race_hisp_nonw" "race_white", for example, you can't process the "education" and "race" variables in the same way. You also will have unique value labels to assign to each variable. See second answer below.
Two other issues:
Reading your example, it seems that for education there
are exactly three categories. So you are initializing to
a non-existent category.
Your treatment of the missings is possibly incorrect.
You've set education to missing if any of its components
is missing. It's possible that an interviewer correctly coded one of the component variables as "1" and left
the other values blank (missing) when they should have been coded "0". Education for that observation should not be set
to missing.
Here's my idea of code:
set linesize 100
clear
input id educ_none educ_prim educ_secabove
1 0 1 0
2 1 0 0
3 0 0 1
4 . 1 . /* Okay */
5 . . . /* Really Missing */
6 0 0 0 /* Really Missing */
7 . 1 1 /* Illegal */
end
egen etot = rowtotal(educ_*) /* = 1 for valid values */
foreach x of varlist educ_* {
/* Tentatively fix incorrect missings */
replace `x'= 0 if `x'==. & etot==1
}
list
gen education = 1 if educ_none==1
replace education=2 if educ_prim==1
replace education=3 if educ_secabove==1
/* Assign extended missing for illegal values*/
replace education = .a if etot >1 & etot<.
#delim ;
label define educl
1 "None"
2 "Primary"
3 "Secondary+"
.a ">1 indicator is 1"
;
#delim cr
label values education educl
list
tab education, missing
In addition to Steve Samuels' excellent suggestions, three standard devices in this territory are
A. Using recode. Check out its help.
B.
gen education = educ_none + 2 * educ_prim + 3 * educ_secabove
(which works if and only if at most one indicator is 1)
C.
gen education = cond(educ_secabove == 1, 3,
cond(educ_prim == 1, 2,
cond(educ_none == 1, 1)))
Notes:
C1. The code just above is one statement. The layout is just to help make the structure visible.
C2. Just as in elementary algebra, each left parenthesis ( implies a promise to match it by a right parenthesis ). Nesting calls to cond() doesn't change that.
C3. There is more on cond() at http://www.stata-journal.com/sjpdf.html?articlenum=pr0016
Automated approach 2014-06-02
After stating that the process of creating and labeling new variables can't be automated, I decided to try. I found two commands on SSC that help: Roger Newson's varlabdef and Daniel Klein's labvalch3. Both can be downloaded from within Stata, e.g. ssc install varlabdef.
I assume, as in the original example, that each 0-1 variable name is of the form "root_suffix", and that exactly one of the variables with the same root has value 1. The goal is to create a new variable for each root with a value that corresponds to the order of the indicator variable (if any) with value 1. The user first creates a local macro that contains all the roots. The program loops through the roots, with one variable created in each pass ; an inner loop implements Nick's solution (B); varlabdef creates value labels from the names of the original indicators; and labvalch3 strips off all but the suffix and capitalizes each item. This value label is then assigned to the new variable with a label values statement. Outside the loop, the new variables are given variable labels with label variable.
In the example that follows, there are two "roots", educ and gender. The variables with root "gender", for example, are gender_male and gender_female. A new variable gender is initialized, then assigned values 1 for males and 2 for females. A corresponding value label (also named "gender") is defined and associated with the new variable, and the variable itself is labeled "Gender".
clear
input id educ_none educ_prim educ_secabove gender_male gender_female
1 0 1 0 1 0
2 1 0 0 1 0
3 0 0 1 0 1
4 0 1 0 1 0
end
/* Create local macro to hold root names */
local roots educ gender
/* Loop over each root */
foreach v of local roots {
qui gen `v' = 0 /* Initialize new variable from root */
/* Get number of component variables */
qui ds `v'_*
local wc : word count `r(varlist)'
/* Create new variables */
forvalues k = 1/`wc' {
/* z`k' is the k-th component variable */
local z`k' : word `k' of `r(varlist)' /* extended macro */
qui replace `v' = `v'+`k'*`z`k''
}
/* Total components to check for missing/illegal values*/
egen `v'tot = rowtotal(`v'_*)
replace `v' = . if `v'tot != 1
replace `v' = .a if `v'tot>1 & `v'tot<.
/* Create value labels from variable names. Note that
value labels can have same names as the
the variables they label*/
/* Create a value label consisting of the component variable names */
varlabdef `v', vlist(`v'_*) from(name)
label define `v' .a "Illegal", add
/* Remove the roots from the labels and capitalize */
labvalch3 `v', subst("`v'_" "")
labvalch3 `v', strfcn(proper("#"))
/* Assign the value labels to the new variables */
label values `v' `v'
}
/* Give nice labels to the new variables */
label var educ "Education"
label var gender "Gender"
label list
tab educ
tab gender
The results are:
. label list
gender:
1 Male
2 Female
.a Illegal
educ:
1 None
2 Prim
3 Secabove
.a Illegal
. tab educ
Education | Freq. Percent Cum.
------------+-----------------------------------
None | 1 25.00 25.00
Prim | 2 50.00 75.00
Secabove | 1 25.00 100.00
------------+-----------------------------------
Total | 4 100.00
. tab gender
Gender | Freq. Percent Cum.
------------+-----------------------------------
Male | 3 75.00 75.00
Female | 1 25.00 100.00
------------+-----------------------------------
Total | 4 100.00

Carrying out of loop so as to do the following operation

consider an area with size m*n. Here the size of m and n is unknown. Now I am extracting data from each point in the area. I am scanning the area first going in the x direction till m point and the again returning to m=0 and n=1, i.e the second row. Again I scan along the x direction till the end of m. An example of the data has been shown below. Here I get value for different x,y coordinates during the scan. I can carry out operation between the first two points in x direction by
p1 = A{1}; %%reading the data from the text file
p2 = A{2};
LA=[p1 p2];
for m=1:length(y)
p= LA(m,1);
t= LA(m,2);
%%and
q=LA(m+1,1)
r=LA(m+1,2)
I want to do the same for y axis. That is I want to operate between first point in x=0 and y=1 then between x=2 and y=1 and so on. Hope you have got it.
g x y
2 0 0
3 1 0
2 2 0
4 3 0
1 4 0
2 m 0
3 0 1
2 1 1
4 2 1
5 3 1
.
.
.
.
2 m 1
now I was thinking of a logic where I will first find the size of n by counting the number of zeros
NUMX = 0;
while y((NUMX+1),:) == 0
NUMX = NUMX + 1;
end
NU= NUMX;
And then I was thinking of applying the following loop
for m=1:NU:n-1
%%and
p= LA(m,1);
t= LA(m,2);
%%and
q=LA(m+1,1)
r=LA(m+1,2)
But its showing error. Please help!!
??? Attempted to access del2(99794,:); index out of bounds because
size(del2)=[99793,1].
Here NUMX=198
Comment: The nomenclature in your question is inconsistent, making it difficult to understand what you are doing. The variable del2 you mention in the error message is nowhere to be seen.
1.) Let's start off by creating a minimal working example that illustrates the data structure and provides knowledge of the dimensions we want to retrieve later. You matrix is not m x n but m*n x 3.
The following example will set up a matrix with data similar to what you have shown in your question:
M = zeros(8,3);
for J=1:4
for I=1:2
M((J-1)*2+I,1) = rand(1);
M((J-1)*2+I,2) = I;
M((J-1)*2+I,3) = J-1;
end
end
M =
0.469 1 0
0.012 2 0
0.337 1 1
0.162 2 1
0.794 1 2
0.311 2 2
0.529 1 3
0.166 2 3
2.) Next, let's determine the number of x and y, to use the nomenclature of your question:
NUMX = 0;
while M(NUMX+1,3) == 0
NUMX = NUMX + 1;
end
NUMY = size(M,1)/NUMX;
NUMX =
2
NUMY =
4
3.) The data processing you want to do still is unclear, but here are two approaches that can be used for different means:
(a)
COUNT = 1;
for K=1:NUMX:size(M,1)
A(COUNT,1) = M(K,1);
COUNT = COUNT + 1;
end
In this case, you step through the first column of M with a step-size corresponding to NUMX. This will result in all the values for x=1:
A =
0.469
0.337
0.794
0.529
(b) You can also use NUMX and NUMY to reorder M:
for J=1:NUMY
for I=1:NUMX
NEW_M(I,J) = M((J-1)*NUMX+I,1);
end
end
NEW_M =
0.469 0.337 0.794 0.529
0.012 0.162 0.311 0.166
The matrix NEW_M now is of size m x n, with the values of constant y in the columns and the values of constant x in the rows.
Concluding remark: It is unclear how you define m and n in your code, so your specific error message cannot be resolved here.

Resources