SPSS Matching Case Control 1:n - matching

We want to match our cases and controls using SPSS 23.
We already matched our cases and controls on age in a 1:3 ratio and a tolerance of 1 month as followed:
DATASET ACTIVATE DataSet1.
FUZZY BY=Age SUPPLIERID=Databasenr NEWDEMANDERIDVARS=MatchID1 MatchID2 MatchID3 GROUP=Case FUZZ=1
EXACTPRIORITY=TRUE
MATCHGROUPVAR=Matchgroupvariable
/OPTIONS SAMPLEWITHREPLACEMENT=FALSE MINIMIZEMEMORY=FALSE SHUFFLE=TRUE.
Know we have 2 questions:
We want to use different tolerances for our cases. For example the cases aged under 1 year should be matched with a tolerance of 1 month and the cases older than 1 year with a tolerance of 6 month. How can we do that?
We want to distribute the controles equally on the cases. So we have 60 cases and 300 controles. First we want every case to have a control if possible, than we want to distribute the lasting controles equally on the cases so that every case has at least one and as many as possible controles.
Thanks for your help.

In order to have a fuzz that is more complex than a simple difference, you need to use a customfuzz function written in Python the computes the match criteria. I have shown a function for this below. Save this in a file named customfuzz.py somewhere that Python can find it, such as the python\lib\site-packages directory under your Statistics installation.
Then use CUSTOMMATCH='customfuzz.custommatch' in the FUZZY syntax instead of FUZZ.
Here is the function. The indentation shown is important. The code assumes that the ages are in years (which can include a fractional part).
custommatch(demander, supplier):
"""calculate match for one variable and return 0 or 1
demander and supplier are assumed to be (lists) of ages in years"""
# check for missing values
if demander[0] is None or supplier[0] is None:
return 0 # no match
delta = abs(demander[0] - supplier[0]) # difference in years
if demander[0] < 1: # demander age lt 1 year
if delta <= .08333: # difference le 1 month
return 1 # ok match
else:
return 0
else:
if delta <= .5: # difference le half year
return 1 # ok match
else:
return 0
For the second problem, what you would have to do is the first match round and then remove all the controls that were actually used and repeat the process with the reduced dataset of controls.

Related

Calculate working days between two dates: null value returned

I'm trying to figure out the number of working days between two dates. The table (dfDates) is laid out as follows:
Key
StartDateKey
EndDateKey
1
20171227
20180104
2
20171227
20171229
I have another table (dfDimDate) with all the relevant date keys and whether the date key is a working day or not:
DateKey
WorkDayFlag
20171227
1
20171228
1
20171229
1
20171230
0
20171231
0
20180101
0
20180102
1
20180103
1
20180104
1
I'm expecting a result as so:
Key
WorkingDays
1
6
2
3
So far (I realise this isn't complete to get me the above result), I've written this:
workingdays = []
for i in range(0, len(dfDates)):
value = dfDimDate.filter((dfDimDate.DateKey >= dfDates.collect()[i][1]) & (dfDimDate.DateKey <= df.collect()[i][2])).agg({'WorkDayFlag': 'sum'})
workingdays.append(value.collect())
However, only null values are being returned. Also, I've noticed this is very slow and took 54 seconds before it errored.
I think I understand what the error is about but I'm not sure how to fix it. Also, I'm not sure how to optimise the command so it runs faster. I'm looking for a solution in pyspark or spark SQL (whichever is easiest).
Many thanks,
Carolina
Edit: The error below was resolved thanks to a suggestion from #samkart who said to put the agg after the filter
AnalysisException: Resolved attribute(s) DateKey#17075 missing from sum(WorkDayFlag)#22142L in operator !Filter ((DateKey#17075 <= 20171228) AND (DateKey#17075 >= 20171227)).;
A possible and simple solution:
from pyspark.sql import functions as F
dfDates \
.join(dfDimDate, dfDimDate.DateKey.between(dfDates.StartDateKey, dfDates.EndDateKey)) \
.groupBy(dfDates.Key) \
.agg(F.sum(dfDimDate.WorkDayFlag).alias('WorkingDays'))
That is, first join the two datasets in order to link each date with all the dimDate rows in its range (dfDates.StartDateKey <= dfDimDate.DateKey <= dfDates.EndDateKey).
Then simply group the joined dataset by the date key and count the number of working days in its range.
In the solution you proposed, you are performing the calculation directly on the driver, so you are not taking advantage of the parallelism that spark offers. This should be avoided when possible, especially for large datasets.
Apart from that, you are requesting repeated collects in the for-loop, even for the same data, resulting in a further slowdown.

Calculate and updated a score according to selected answer

I am new to ODK and XLSForms.
I have several questions and based on the answers, I need to calculate a score.
I have 17 questions, each time a person answer yes, I need to add 2 points into an integer field.
So I have:
type name label appearnce required
select_one yes_no1 q1 //question here //appearance quick //required yes
...
select_one yes_no17 q17 ...
And here is the score field:
type name label
calculate total Total
This is my first assignment in my job, and can't figure how to calculate and change value according to selected answer.
EDIT
I added a calculation expression but can't know how to get the result because it didn't worked:
if ((${q8} = 'yes' orĀ ${q9} ='yes' or ${q11}='yes'), 2, 0)
So if question 8, 9 or 11 are answered as yes, add 2 points to current value, but the field didn't appeared at all. And still need to add if question 10, 12, 13 and 14 are answered with yes to add 1 point for each.
The simple but tedious way to do this is to create a calculation for each question with an if() statement that sets the calculated value to either 2 or 0. The final result can be obtained by adding up the calculated items.
The cool way to do this is to do it all in one XPath expression. Basically you want to create a nodeset that contains all 17 questions, filter these to the ones with value 'yes', count the filtered nodeset and multiply by 2. You can do this in XLSForm, but Im not sure if you can use the ${node} shorthand.
You'd have to put all your question in a group (doesn't need a group label) after which you can do something like:
count(${grp}/*[text() = 'yes']) * 2
or without ${node} shortcut (check the XForm for the correct path):
count(/myform/grp/*[text() = 'yes']) * 2
I'm not sure if the use of text() will pass ODK Validate though. If not there is probably an expression that will pass and does the same. (However, the above syntax would work in Enketo).

SQL Invoice Query Performance converting Credits to negative numbers

I have a 3rd party database that contains Invoice data I need to report on. The Quantity and Amount Fields are stored as Positive numbers regardless of whether the "invoice" is a Credit Memo or actual Invoice. There is a single character field that contains the Type "I" = Invoice, "R" = Credit.
In a report that is equating 1.4 million records, I need to sum this data, so that Credits subtract from the total and Invoices add to the total, and I need to do this for 8 different columns in the report (CurrentYear, PreviousYear, etc)
My problem is performance of the many different ways to achieve this.
The Best performing seems to be using a CASE statement within the equation like so:
Case WHEN ARH.AccountingYear - 2 = #iCurrentYear THEN ARL.ShipQuantity * (CASE WHEN InvoiceType = 'R' THEN -1 ELSE 1 END) ELSE 0 END as PPY_INVOICED_QTY
But code readable wise, this is super ugly since I have to do it to 8 different columns, performance is good, runs against all 1.4M records in 16 seconds.
Using a Scalar UDF kills performance
Case WHEN ARH.AccountingYear - 2 = #iCurrentYear THEN ARL.ShipQuantity * dbo.fn_GetMultiplier(ARH.InvoiceType) ELSE 0 END as PPY_INVOICED_QTY
Takes almost 5 minutes. So can't do that.
Other options I can think of would be:
Multiple levels of Views, use a new view to add a Multiplier column, then SELECT from that and do the multiplication using the new column
Build a table that has 2 columns and 2 records, R, -1 and I, 1, and join it based on InvoiceType, but this seems excessive.
Any other ideas I am missing, or suggestions on best practice for this sort of thing? I cannot change the stored data, that is established by the 3rd party application.
I decided to go with the multiple views as Igor suggested, actually using the nested version, even though readability is lower, maintenance is easier due to only 1 named view instead of 2. Performance is similar to the 8 different case statements, so overall running in just under 20 seconds.
Thanks for the insights.

oracle and user inputs [duplicate]

This question already has answers here:
Trying to find vehicles which are free between 2 variable dates
(2 answers)
Closed 8 years ago.
So I am kind of stuck on this Query I need to do, I have no Idea what It even means.
I need to be able to find a vehicle number from a table entitled Vehicle_Details and check whether it is currently in use for the period Of time I would like to use it as said bellow.
When a new trip is being arranged, the administrator has to find a vehicle that is not already in use for the trip duration. Because this query will be needed frequently, it is important that it can be run easily for an arbitrary start and end date. It should therefore use substitution variables so that the trip start and end dates can be provided at run time.
Make sure that when it is run the user is only prompted to supply the start and end dates once. Also make sure that any vehicles displayed are available for the entire period specified - you will have to include more than one test in the where clause
Any code would be helpful but even links to things that can help me write it myself as I have No idea tbh.
Example data from the three tables:
Trip_ID Departure Return_Date Duration Registration
73180 07-FEB-12 08-FEB-12 1 PY09 XRH
73181 07-FEB-12 08-FEB-12 1 PY10 OPM
73182 07-FEB-12 10-FEB-12 3 PY56 BZT
73183 07-FEB-12 08-FEB-12 1 PY56 BZU
73184 07-FEB-12 09-FEB-12 2 PY58 UHF
Registration Make Model Year
4585 AW ALBION RIEVER 1963
SDU 567M ATKINSON N/A 1974
P525 CAO DAF FT85.400 1996
PY55 CGO DAF FTGCF85.430 2005
PY06 BYP DAF FTGCF85.430 2006
Weight Registration Body Vehicle ID
20321 4585 AW N/A 1
32520 SDU 567M N/A 2
40000 P525 CAO N/A 3
40000 PY55 CGO N/A 4
40000 PY06 BYP N/A 5
You need to find if a particular date range intersects with any reservation. This is interval intersection arithmetic. Consider the following intervals [A,B] and [x,y]:
-----------[xxxxxxxxxxx]-------------
A B
---------------------[xxxxxx]--------
x y
An interval [x,y] will intersect with [A,B] if and only if:
B >= x
And A <= y
So your query will look like:
SELECT *
FROM registrations reg
WHERE reg.registration = :searched_vehicle
AND NOT EXISTS (SELECT NULL
FROM reservations res
WHERE res.registration = reg.registration
AND res.return_date >= :interval_start
AND res.departure <= :interval_end)
This is for one vehicle. If the query returns a row, this vehicle is available for the given interval [:interval_start, :interval_end].

SPSS :Loop through the values of variable

I have a dataset that has patient data according to the site they visited our mobile clinic. I have now written up a series of commands such as freqs and crosstabs to produce the analyses I need, however I would like this to be done for patients at each site, rather than the dataset as whole.
If I had only one site, a mere filter command with the variable that specifies a patient's site would suffice, but alas I have 19 sites, so I would like to find a way to loop through my code to produce these outputs for each site. That is to say for i in 1 to 19:
1. Take the i th site
2. Compute a filter for this i th site
3. Run the tables using this filtered data of patients at ith site
Here is my first attempt using DO REPEA. I also tried using LOOP earler.
However it does not work I keep getting an error even though these are closed loops.
Is there a way to do this in SPSS syntax? Bear in mind I do not know Python well enough to do this using that plugin.
*LOOP #ind= 1 TO 19 BY 1.
DO REPEAT #ind= 1 TO 20.
****8888888888888888888888888888888888888888888888888888888 Select the Site here.
COMPUTE filter_site=(RCDSITE=#ind).
USE ALL.
FILTER BY filter_site.
**********************Step 3: Apply the necessary code for tables
*********Participation in the wellness screening, we actually do not care about those who did FP as we are not reporting it.
COUNT BIO= CheckB (1).
* COUNT FPS=CheckF(1).
* COUNT BnF= CheckB CheckF(1).
VAL LABEL BIO
1 ' Has the Wellness screening'
0 'Does not have the wellness screening'.
*VAL LABEL FPS
1 'Has the First patient survey'.
* VAL LABEL BnF
1 'Has either Wellness or FPS'
2 'Has both surveys done'.
FREQ BIO.
*************************Use simple math to calcuate those who only did the Wellness/First Patient survey FUB= F+B -FnB.
*******************************************************Executive Summary.
***********Blood Pressure.
FREQ BP.
*******************BMI.
FREQ BMI.
******************Waist Circumference.
FREQ OBESITY.
******************Glucose.
FREQ GLUCOSE.
*******************Cholesterol.
FREQ TC.
************************ Heamoglobin.
FREQ HAEMOGLOBIN.
*********************HIV.
FREQ HIV.
******************************************************************************I Lifestyle and General Health.
MISSING VALUES Gender GroupDep B8 to B13 ('').
******************Graphs 3.1
Is this just Frequencies you are producing? Try the SPLIT procedure by the variable RCDSITE. Should be enough.
SPLIT FILES allows you to partition your data by up to eight variables. Then each procedure will automatically iterate over each group.
If you need to group the results at a higher level than the procedure, that is, to run a bunch of procedures for each group before moving on to the next one so that all the output for a group will be together, you can use the SPSSINC SPLIT DATASET and SPSSINC PROCESS files extension commands to do this.
These commands require the Python Essentials. That and the commands can be downloaded from the SPSS Community website (www.ibm.com/developerworks/spssdevcentral) if you have at least version 18.
HTH,
Jon Peck
A simple but perhaps not very elegant way is to select from the menu: Data/Select Cases/If condition, there you enter the filter for site 1 and press Paste, not OK.
This will give the used filter as syntax code.
So with some copy/paste/replace/repeat you can get the freqs and all other results based on the different sites.

Resources