How to get the number of "yes" values across a group of variables

How to get the number of "yes" values across a group of variables - loops

I have a dataset in Stata with information on whether or not a participant has experienced a specific side effect. The dataset contains about 900 participants and there are 20 side effects captured for each participant. The values for each side effect are "0" denoting the participant did not experience the side effect and "1" denoting the participant did experience the side effect. I want to count the number of "1"s for each participant and ultimately summarize the number of participants that experienced one side effect, two side effects, three side effects, etc.
I tried the count if command but that does not allow me to get the number of "1"s for each participant across the 20 side effects.

If your (0, 1) variables are all numeric then the number of values of 1 is identical to the row sum across variables. Hence
egen wanted = rowtotal(sideeffect*)
where sideeffect* is a wild guess on variable names, as you don't give a data example or mention any names.
Putting 0 and 1 within " " in your question may indicate that you are holding these as string variables. If so, you need to destring those variables first.

Related

(SPSS) Assign values to remaining time point based on value on another variable, and looping for each case

I am currently working on analyzing a within-subject dataset with 8 time-ordered assessment points for each subject.
The variables of interest in this example is ID, time point, and accident.
I want to create two variables: accident_intercept and accident_slope, based on the value on accident at a particular time point.
For the accident_intercept variable, once a participant indicated the occurrence of an accident (e.g., accident = 1) at a specific time point, I want the values for that time point and the remaining time points to be 1.
For the accident_slope variable, once a participant indicated the occurrence of an accident (e.g., accident = 1) at a specific time point, I want the value of that time point to be 0, but count up by 1 for the remaining time points until the end time point, for each subject.
The main challenge here is that the process stated above need to be repeated/looped for each participant that occupies 8 rows of data.
Please see how the newly created variables would look like:
I have looked into the instruction for different SPSS syntax, such as loop, the lag/lead functions. I also tried to break my task into different components and google each one. However, I have not made any progress :)
I would be really grateful of any helps and directions that you provide.

Here is one way to do what you need using aggregate to calculate "accident time":
if accident=1 accidentTime=TimePoint.
aggregate out=* mode=addvariables overwrite=yes /break=ID/accidentTime=max(accidentTime).
if TimePoint>=accidentTime Accident_Intercept=1.
if TimePoint>=accidentTime Accident_Slope=TimePoint-accidentTime.
recode Accident_Slope accidentTime (miss=0).
Here is another approach using the lag function:
compute Accident_Intercept=0.
if accident=1 Accident_Intercept=1.
if $casenum>1 and id=lag(id) and lag(Accident_Intercept)=1 Accident_Intercept=1.
compute Accident_Slope=0.
if $casenum>1 and id=lag(id) and lag(Accident_Intercept)=1 Accident_Slope=lag(Accident_Slope) +1.
exe.

What is an efficient way to cross-compare two different data pairs in Excel to spot differences?

Summary
I am looking to compare two data sets within Excel, and produce an output depending on which has changed, and what to.
More info
I hold two databases, which are updated independently. I cross compare these databases monthly, to see which database(s) have changed, and who holds the most accurate data. The other database is then amended to reflect the correct value. I am trying to automate the process of deciding which database needs to be updated. I'm comparing not just data change, but data change over time.
Example
On month 1, database 1 contains the value "Foo". Database 2 also contains the value "Foo". On month 2, database 1 now contains the value "Bar", but database 2 still contains the value "Foo". I can ascertain that because database 1 holds a different value, but last month they held the same value, database 1 has been updated, and database 2 should be updated to reflect this.
Table Example
Data1 Month1
Data2 Month1
Data1 Month2
Data2 Month2
Database to update
Reason
Foo
Foo
Foo
Foo
None
All match
Apple
Apple
Orange
Apple
Data2
Data1 has new data when they did match previously. Data2 needs to be updated with the new info.
Cat
Dog
Dog
Dog
None
They mismatched previously, but both databases now match.
1
1
1
2
Data1
Data2 has new data when they did match previously. Data1 needs to be updated with the new info.
AAA
BBB
AAA
BBB
CHECK
Both databases should match, but you cannot ascertain which should be updated.
ABC
ABC
DEF
GHI
CHECK
Both databases changed, but you cannot tell if Data1 or Data2 is correct as they were updated at the same time.
Current logic
Currently, I'm trying to get this to work using multiple nested =IF statements, combined with some =AND and =NOT statements. Essentially, an example part of the statement would be (database 1, month 1 = DB1M1, etc.): =IF(AND(DB1M1=DB2M1,DB2M1=DB2M2),"None",IF(AND(DB1M1=DB2M1,DB1M1=DB2M2,NOT(DB2M1=DB1M2)),"Data2",IF(ETC,ETC,ETC).
I've had some success with this, but due to the length of the statement, it is very messy and I'm struggling to make it work, as it becomes unreadable for me trying to calculate the possible outcomes in just =IF clauses. I also have no doubt it's incredibly inefficient, and I'd like to make it more efficient, especially considering the size of the database is around 10,000 lines.
Final Notes / Info
I'd appreciate any help with getting this to work. I'm keen to learn, so any tips and advice are always welcomed.
I'm using MSO 365, version 2202 (I cannot update beyond this). This will be run in the Desktop version of Excel. I would prefer this is done exclusively using formulas, but I am open to using Visual Basic if it would be otherwise impossible or incredibly inefficient. Thanks!

In previous similar scenarios, it sounds me familiar using bitwise operations or binary numbers. The main idea behind a binary number is that each digit can act as flag indicating if certain property is present or not.
The goal is to identify if two databases (DB1, DB2) are on sync based on a given value over two periods (M1, M2). If one database is out of sync we would like to know which action to carry out to have it on sync with respect to the other database. Similarly we would like to know when both databases are out of sync at the end of the period.
Here is the Excel solution in cell M2, then extend down the formula:
=LET(dec,
BIN2DEC(IF(B2=C2,0,1)&IF(D2=E2,0,1)&IF(B2=D2,0,1)&IF(C2=E2,0,1)),
DBsOnSync, ISNUMBER(FIND(dec, "0;10;3;9;11")),
DBsOutOfSync, ISNUMBER(FIND(dec, "7;12;13;14;15")),
IFERROR(IFS(dec=5,"Update DB1", dec=6,"Update DB2", DBsOnSync=TRUE,
"DBs on Sync", DBsOutOfSync=TRUE, "DBs out of Sync"), "Case not defined")
)
The input table tries to consider all possible combinations, so we can build the logic. The highlighted columns are not really necessary, it is just for illustrative or testing purpose. In red combinations already defined previously so it is no really necessary to take into account.
Explanation
We build a binary number based on the following conditions for each binary digit. This is just an intermediate result to convert it to a decimal number via BIN2DEC and determine the case for each possible value.
BIN2DEC(IF(B2=C2,0,1)&IF(D2=E2,0,1)&IF(B2=D2,0,1)&IF(C2=E2,0,1))
We have four conditions, so we build a binary number of length 4, where each digit represent a flag condition (0-equal, 1-not equal).
We build the binary number that will be the input for BIN2DEC via concatenation of the logical conditions we are looking for. Each IF condition represents a binary digit from left to right:
IF(B2=C2,0,1) check for DB1, DB2 are consistent in M1 (intermediate calculation shown in column M1).
IF(D2=E2,0,1) check for DB1, DB2 are consistent in M2 (intermediate calculation shown in column M2).
IF(B2=D2,0,1) DB1 keeps consistency over time (intermediate calculation shown in column DB1).
IF(C2=E2,0,1) DB2 keeps consistency over time (intermediate calculation shown in column DB2).
Converting the binary number to decimal, we can identify each case, assigning a set of decimal numbers. The following decimal number or set of decimal numbers represent each case:
dec
Scenario
0,10,3,9,11
DBs on Sync
5
Update DB1
6
Update DB2
7,12,13,14,15
DBs out of sync
We use IFS and FIND to identify each case based on dec value. We use FIND to find dec in the string that represents the set of possible numbers for each case. We use ISNUMBER to check whether the number was found or not. We include as last resource, for testing purpose, if some case was not defined yet, it returns Case not defined.
Notes
Columns F:I give a hint about maximum number of possible combinations. We have four columns, with only two possible values: Sync, NotSync. Which represents 2*2*2*2=16 combinations, which represents the maximum possible binary numbers of size 4 we can have (we have four conditions).
As you can see from the screenshot we have less numbers of unique combinations (12). The reason is because the way we build the binary numbers they have dependencies, so some combination are impossible.

Reordering dataset in stata

We have a dataset of 1222x20 in Stata.
There are 611 individuals, such that each individual is in 2 rows of the dataset. There is only one variable of interest in each second row of each individual that we would like to use.
This means that we want a dataset of 611x21 that we need for our analysis.
It might also help if we could discard each odd/even row, and merge it later.
However, my Stata skills let me down at this point and I hope someone can help us.
Maybe someone knows any command or menu option that we might give a try.
If someone knows such a code, the individuals are characterized by the variable rescode, and the variable of interest on the second row is called enterprise.
Below, the head of our dataset is given. There is a binary time variable followup, where we want to regress the enterprise(yes/no) as dependent variable at time followup = followup onto enterprise as independent variable at time followup = baseline
We have tried something like this:
reg enterprise(if followup="Folowup") i.aimag group loan_baseline eduvoc edusec age16 under16 marr_cohab age age_sq buddhist hahl sep_f nov_f enterprise(if followup ="Baseline"), vce(cluster soum)

followup is a numeric variable with value labels, as its colouring in the Data Editor makes clear, so you can't test its values directly for equality or inequality with literal strings. (And if you could, the match needs to be exact, as Folowup would not be read as implying Followup.)
There is a syntax for identifying observations by value labels: see [U] 13.11 in the pdf documentation or https://www.stata-journal.com/article.html?article=dm0009.
However, it is usually easiest just to use the numeric value underneath the value label. So if the variable followup had numeric values 0 and 1, you would test for equality with 0 or 1.
You must use == not = for testing for equality here:
... if followup == 1
For any future Stata questions, please see the Stata tag wiki for detailed advice on how to present data. Screenshots are usually difficult to read and impossible to copy, and leave many details about the data obscure.

Single storage value that can be interpreted to mean different things

I want to see if there is a way to store a single value in one column in a database, but when read it could be interpreted as four columns in a database.
I have square, which has four sides. I would like to store a value that tells me what sides should have a dashed line. When this value is read I can easily decipher like the left and right side should be dashed, or just the top side, or all sides, or none.
I know I could have say 17 options I could store, but is there an easier way with a number? I will have four buttons in the interface showing which sides they want dotted and will store a value that way.
Is there a way to show me this in some pseudo-code in the storing and reinterpreting parts?

There are 16 possibilities for the sides--each side has two possibilities, and each decision is independent from the others, so you multiply the numbers together, getting 2*2*2*2=16.
You could just store an integer between 0 and 15. To get the choice for the left side, calculate n & 1 where n is the stored number. (The "&" symbol means "bitwise and" in many computer languages.) The result 1 means dashed, while 0 means solid. To get the right side, calculate n & 2, the top is n & 4, while the bottom is n & 8. Note that the "magic numbers" in those formulas are 2**0, 2**1, 2**2, and 2**3.
You can easily go the other way. If your choices expressed as 0 or 1 are a,b,c,d, then n = 1*a + 2*b + 4*c + 8*d. Note the same magic numbers as in the previous paragraph.
If you know the binary number system, those formulas are obvious. If you want to learn more, do a web search on "binary number".
Of course, another way to solve your problem is to store a string value rather than an integer. For example, you could store
"left=dashed, right=solid, top=solid, bottom=dashed"
and use the string-handling facilities of your language to parse the string. This is a more lengthy process than my first method, and uses more memory for the storage, but it has the advantage of being more transparent and easy to understand for anyone examining the data.
Your idea in a comment of using a short string like "1101" is a good one, and depending on the implementation it may even use less memory than my first solution. It shows the same amount of information as my first solution and less than my second. The implementation of your idea and the others depends on the language and on how you store the decision for each side in a variable. In Python, if n is the stored (string) value and the decisions for each side are Boolean variables a,b,c,d (True means dashed, False means solid), your code could be
a = (n[0] == '1')
b = (n[1] == '1')
c = (n[2] == '1')
d = (n[3] == '1')
There are shorter ways to do this in Python but they are less understandable to non-Python programmers.

Maximal Number of Functional Dependencies (including trivial)

In a relationship R of N attributes how many functional dependencies are there (including trivial)?
I know that trivial dependencies are those where the right hand side is a subset of the left hand side but I'm not sure how to calculate the upper bound on the dependencies.
Any information regarding the answer and the approach to it would be greatly appreciated.
-

The maximum possible number of functional dependencies is
the number of possible left-hand sides * the number of possible right-hand sides
We're including trivial functional dependencies, so the number of possible left-hand sides equals the number of possible right-hand sides. So this simplifies to
(the number of possible left-hand sides)2
Let's say you have R{∅AB}. There are three attributes.1 The number of possible left-hand sides is
combinations of 3 attributes taken 1 at a time, plus
combinations of 3 attributes taken 2 at a time, plus
combinations of 3 attributes taken 3 at a time
which equals 3+3+1, or 7. So there are at most 72 possible functional dependencies for any R having three attributes: 49. The order of attributes doesn't matter, so we use formulas for combinations, not for permutations.
If you start with R{∅ABC}, you have
combinations of 4 attributes taken 1 at a time, plus
combinations of 4 attributes taken 2 at a time, plus
combinations of 4 attributes taken 3 at a time, plus
combinations of 4 attributes taken 4 at a time
which equals 4+6+4+1, or 15. So there are at most 152 possible functional dependencies for any R having four attributes: 225.
Once you know this formula, these calculations are simple using a spreadsheet. It's also pretty easy to write a program to generate every possible functional dependency using a scripting language like Ruby or Python.
The Wikipedia article on combinations has examples of how to count the combinations, with and without using factorials.
All possible combinations from R{∅AB}:
A->A A->B A->∅ A->AB A->A∅ A->B∅ A->AB∅
B->A B->B B->∅ B->AB B->A∅ B->B∅ B->AB∅
∅->A ∅->B ∅->∅ ∅->AB ∅->A∅ ∅->B∅ ∅->AB∅
AB->A AB->B AB->∅ AB->AB AB->A∅ AB->B∅ AB->AB∅
A∅->A A∅->B A∅->∅ A∅->AB A∅->A∅ A∅->B∅ A∅->AB∅
B∅->A B∅->B B∅->∅ B∅->AB B∅->A∅ B∅->B∅ B∅->AB∅
AB∅->A AB∅->B AB∅->∅ AB∅->AB AB∅->A∅ AB∅->B∅ AB∅->AB∅
Most people ignore the empty set. They'd say R{∅AB} has only two attributes, A and B, and they'd write it as R{AB}.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight