Reordering dataset in stata - database

We have a dataset of 1222x20 in Stata.
There are 611 individuals, such that each individual is in 2 rows of the dataset. There is only one variable of interest in each second row of each individual that we would like to use.
This means that we want a dataset of 611x21 that we need for our analysis.
It might also help if we could discard each odd/even row, and merge it later.
However, my Stata skills let me down at this point and I hope someone can help us.
Maybe someone knows any command or menu option that we might give a try.
If someone knows such a code, the individuals are characterized by the variable rescode, and the variable of interest on the second row is called enterprise.
Below, the head of our dataset is given. There is a binary time variable followup, where we want to regress the enterprise(yes/no) as dependent variable at time followup = followup onto enterprise as independent variable at time followup = baseline
We have tried something like this:
reg enterprise(if followup="Folowup") i.aimag group loan_baseline eduvoc edusec age16 under16 marr_cohab age age_sq buddhist hahl sep_f nov_f enterprise(if followup ="Baseline"), vce(cluster soum)

followup is a numeric variable with value labels, as its colouring in the Data Editor makes clear, so you can't test its values directly for equality or inequality with literal strings. (And if you could, the match needs to be exact, as Folowup would not be read as implying Followup.)
There is a syntax for identifying observations by value labels: see [U] 13.11 in the pdf documentation or https://www.stata-journal.com/article.html?article=dm0009.
However, it is usually easiest just to use the numeric value underneath the value label. So if the variable followup had numeric values 0 and 1, you would test for equality with 0 or 1.
You must use == not = for testing for equality here:
... if followup == 1
For any future Stata questions, please see the Stata tag wiki for detailed advice on how to present data. Screenshots are usually difficult to read and impossible to copy, and leave many details about the data obscure.

Related

(SPSS) Assign values to remaining time point based on value on another variable, and looping for each case

I am currently working on analyzing a within-subject dataset with 8 time-ordered assessment points for each subject.
The variables of interest in this example is ID, time point, and accident.
I want to create two variables: accident_intercept and accident_slope, based on the value on accident at a particular time point.
For the accident_intercept variable, once a participant indicated the occurrence of an accident (e.g., accident = 1) at a specific time point, I want the values for that time point and the remaining time points to be 1.
For the accident_slope variable, once a participant indicated the occurrence of an accident (e.g., accident = 1) at a specific time point, I want the value of that time point to be 0, but count up by 1 for the remaining time points until the end time point, for each subject.
The main challenge here is that the process stated above need to be repeated/looped for each participant that occupies 8 rows of data.
Please see how the newly created variables would look like:
I have looked into the instruction for different SPSS syntax, such as loop, the lag/lead functions. I also tried to break my task into different components and google each one. However, I have not made any progress :)
I would be really grateful of any helps and directions that you provide.
Here is one way to do what you need using aggregate to calculate "accident time":
if accident=1 accidentTime=TimePoint.
aggregate out=* mode=addvariables overwrite=yes /break=ID/accidentTime=max(accidentTime).
if TimePoint>=accidentTime Accident_Intercept=1.
if TimePoint>=accidentTime Accident_Slope=TimePoint-accidentTime.
recode Accident_Slope accidentTime (miss=0).
Here is another approach using the lag function:
compute Accident_Intercept=0.
if accident=1 Accident_Intercept=1.
if $casenum>1 and id=lag(id) and lag(Accident_Intercept)=1 Accident_Intercept=1.
compute Accident_Slope=0.
if $casenum>1 and id=lag(id) and lag(Accident_Intercept)=1 Accident_Slope=lag(Accident_Slope) +1.
exe.

What is an efficient way to cross-compare two different data pairs in Excel to spot differences?

Summary
I am looking to compare two data sets within Excel, and produce an output depending on which has changed, and what to.
More info
I hold two databases, which are updated independently. I cross compare these databases monthly, to see which database(s) have changed, and who holds the most accurate data. The other database is then amended to reflect the correct value. I am trying to automate the process of deciding which database needs to be updated. I'm comparing not just data change, but data change over time.
Example
On month 1, database 1 contains the value "Foo". Database 2 also contains the value "Foo". On month 2, database 1 now contains the value "Bar", but database 2 still contains the value "Foo". I can ascertain that because database 1 holds a different value, but last month they held the same value, database 1 has been updated, and database 2 should be updated to reflect this.
Table Example
Data1 Month1
Data2 Month1
Data1 Month2
Data2 Month2
Database to update
Reason
Foo
Foo
Foo
Foo
None
All match
Apple
Apple
Orange
Apple
Data2
Data1 has new data when they did match previously. Data2 needs to be updated with the new info.
Cat
Dog
Dog
Dog
None
They mismatched previously, but both databases now match.
1
1
1
2
Data1
Data2 has new data when they did match previously. Data1 needs to be updated with the new info.
AAA
BBB
AAA
BBB
CHECK
Both databases should match, but you cannot ascertain which should be updated.
ABC
ABC
DEF
GHI
CHECK
Both databases changed, but you cannot tell if Data1 or Data2 is correct as they were updated at the same time.
Current logic
Currently, I'm trying to get this to work using multiple nested =IF statements, combined with some =AND and =NOT statements. Essentially, an example part of the statement would be (database 1, month 1 = DB1M1, etc.): =IF(AND(DB1M1=DB2M1,DB2M1=DB2M2),"None",IF(AND(DB1M1=DB2M1,DB1M1=DB2M2,NOT(DB2M1=DB1M2)),"Data2",IF(ETC,ETC,ETC).
I've had some success with this, but due to the length of the statement, it is very messy and I'm struggling to make it work, as it becomes unreadable for me trying to calculate the possible outcomes in just =IF clauses. I also have no doubt it's incredibly inefficient, and I'd like to make it more efficient, especially considering the size of the database is around 10,000 lines.
Final Notes / Info
I'd appreciate any help with getting this to work. I'm keen to learn, so any tips and advice are always welcomed.
I'm using MSO 365, version 2202 (I cannot update beyond this). This will be run in the Desktop version of Excel. I would prefer this is done exclusively using formulas, but I am open to using Visual Basic if it would be otherwise impossible or incredibly inefficient. Thanks!
In previous similar scenarios, it sounds me familiar using bitwise operations or binary numbers. The main idea behind a binary number is that each digit can act as flag indicating if certain property is present or not.
The goal is to identify if two databases (DB1, DB2) are on sync based on a given value over two periods (M1, M2). If one database is out of sync we would like to know which action to carry out to have it on sync with respect to the other database. Similarly we would like to know when both databases are out of sync at the end of the period.
Here is the Excel solution in cell M2, then extend down the formula:
=LET(dec,
BIN2DEC(IF(B2=C2,0,1)&IF(D2=E2,0,1)&IF(B2=D2,0,1)&IF(C2=E2,0,1)),
DBsOnSync, ISNUMBER(FIND(dec, "0;10;3;9;11")),
DBsOutOfSync, ISNUMBER(FIND(dec, "7;12;13;14;15")),
IFERROR(IFS(dec=5,"Update DB1", dec=6,"Update DB2", DBsOnSync=TRUE,
"DBs on Sync", DBsOutOfSync=TRUE, "DBs out of Sync"), "Case not defined")
)
The input table tries to consider all possible combinations, so we can build the logic. The highlighted columns are not really necessary, it is just for illustrative or testing purpose. In red combinations already defined previously so it is no really necessary to take into account.
Explanation
We build a binary number based on the following conditions for each binary digit. This is just an intermediate result to convert it to a decimal number via BIN2DEC and determine the case for each possible value.
BIN2DEC(IF(B2=C2,0,1)&IF(D2=E2,0,1)&IF(B2=D2,0,1)&IF(C2=E2,0,1))
We have four conditions, so we build a binary number of length 4, where each digit represent a flag condition (0-equal, 1-not equal).
We build the binary number that will be the input for BIN2DEC via concatenation of the logical conditions we are looking for. Each IF condition represents a binary digit from left to right:
IF(B2=C2,0,1) check for DB1, DB2 are consistent in M1 (intermediate calculation shown in column M1).
IF(D2=E2,0,1) check for DB1, DB2 are consistent in M2 (intermediate calculation shown in column M2).
IF(B2=D2,0,1) DB1 keeps consistency over time (intermediate calculation shown in column DB1).
IF(C2=E2,0,1) DB2 keeps consistency over time (intermediate calculation shown in column DB2).
Converting the binary number to decimal, we can identify each case, assigning a set of decimal numbers. The following decimal number or set of decimal numbers represent each case:
dec
Scenario
0,10,3,9,11
DBs on Sync
5
Update DB1
6
Update DB2
7,12,13,14,15
DBs out of sync
We use IFS and FIND to identify each case based on dec value. We use FIND to find dec in the string that represents the set of possible numbers for each case. We use ISNUMBER to check whether the number was found or not. We include as last resource, for testing purpose, if some case was not defined yet, it returns Case not defined.
Notes
Columns F:I give a hint about maximum number of possible combinations. We have four columns, with only two possible values: Sync, NotSync. Which represents 2*2*2*2=16 combinations, which represents the maximum possible binary numbers of size 4 we can have (we have four conditions).
As you can see from the screenshot we have less numbers of unique combinations (12). The reason is because the way we build the binary numbers they have dependencies, so some combination are impossible.

Looping and selectig different value in each line

I have a problem that I think would be solved relatively quickly with a loop. I have to work with SPSS and I think it can only be solved in syntax.
Unfortunately I am not good with loops, so I hope that one of you can help me.
I have done a study on reasons for abortions. Now I would like to present the distribution of reasons.
The problem is that each person was first asked about all their pregnancies (because this is also relevant for the later analysis), then the pregnancy was determined to which the questionnaire will further refer.
So the further questionnaire was only about one of the pregnancies, whereas the first questions (f.ex. year of pregnancy, reason for abortion) were answered for each pregnancy. For the reasons I only need the information that refers to the pregnancy that was also used for the further questionnaire.
I have an index variable that determines the loop at which pass the relevant pregnancy is asked ("index"). Then I have the variable "Loop_1_R" to "Loop_5_R" which queries the reasons for each up to 5 abortions (of course, for each woman, only the number of pregnancies that she also indicated). In between there are some missing data, for ex. it could be that a woman said that she had 5 pregnancies, but only two of them were abortions (f.ex. the third and fifth). So then she would only give reasons for an abortion in loop3 and loop5.
Now I want to create a new variable which contains only the reason which refers to the relevant pregnancy. So for each woman only one value. I was thinking, you could build a loop in the sense of calculate new variable in such a way that loop i is taken at index i.
I could of course do it by hand, but with a VPN count of over 3000 it will obviously take considerably longer.
I hope someone can help me! This is an example dataset with less loops and VPN:
You can use do repeat to loop and catch the value you need this way:
do repeat vr=Loop_1_R to Loop_5_R/vl=1 to 5.
if Index=vl reason=vr.
end repeat.

Is there a way to turn my data from string variables to numeric?

I imported my data into Stata, and the program is reading some of the variables as strings, but not all of them. And I cannot understand what I did wrong, as some variables are being read as numbers. Is there a way in Stata to turn the string into numeric?
destring is intended for this situation, but the real question is why Stata read your variables as string when you think they should be numeric.
Some of the reasons commonly met are
There is metadata in your data, especially if the data were read in from a file that has spent time in a spreadsheet. Rows of header information or endnotes can cause this problem.
A missing data code has been used that Stata doesn't recognise, say NA for missing.
Decimal points are indicated by say commas, not stops or periods.
The options of destring are often critical, as you may need to spell out what should be done. So, study the help for destring.
If a variable to you should be numeric, but it's not clear why not, something like
tab myvar if missing(real(myvar))
shows the kinds of values of myvar that can't be converted easily. Very often it becomes clear quickly that there is one repeated problem for which there is one overall fix.

Ive got a pipe that consists of 5 pieces, each including 5 properties

Inlet -> front -> middle -> rear -> outlet
Those five properties have a value anything between 4 - 40. Now i want to calculate a specific match for each of those values that is either a full 10 or a 5 when a single property is summed from each pipe piece. There might be hundreds of different pipe pieces all with different properties.
So if i have all 5 pieces and when summed, their properties go like 54,51,23,71,37. That is not good and not what im looking.
Instead 55,50,25,70,40. That would be perfect.
My trouble is there are so many of the pieces that it would be insane to do the miss'matching manually, and new ones come up frequently.
I have manually inserted about 100 of these already into SQLite, but should be easy to convert into any excel or other database formats, so answer can be related to anything like mysql or googlesheets.
I need the calculation that takes every piece in account and results either in "no match" or tells me the id of each piece that is required for a match and if multiple matches are available, it separates them.
Edit: Even just the math needed to do this kind of calculation would be a lot of help here, not much of a math guy myself. I guess there should be a reference piece i need to use and then that gets checked against every possible scenario.
If the value you want to verify is in A1, use: =ROUND(A1/5,0)*5
If the pipes may not be shorter than the given values, use =CEILING(A1,5)

Resources