Grouping tuples of 2 numbers based on the difference between the tuples - arrays

Serial Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Parameter 1
22.95
23.46
22.71
23.41
23.36
23.18
23.52
23.35
22.86
22.47
22.63
23.72
23.22
23.17
22.8
23.18
23.15
23.12
23.16
23.22
23.58
23.68
23.33
23.52
23.54
23.48
Parameter 2
19.97
20.83
19.22
20.39
20.33
20.44
20.62
20.26
21.22
20.31
20.53
20.62
20.65
20.14
19.43
20.66
20.09
20.52
20.41
20.63
20.98
21.15
19.97
20.72
20.71
20.32
We have to divide this table into groups of 4 columns such that:-
1.Max difference between the values of parameter 1 of the columns in a group is less than 1
2.Max difference between the values parameter 2 of the columns in a group is less than 0.5
for eg. The columns 4,5,6,7 form a valid group
I need an algorithm that will return the maximum number of groups that this table can be divided into and will also return those groups.
for eg. In this case the maximum number of groups possible are 6, and the remaining 2 columns do not fall under any group. The algorithm should return these 6 groups.

Related

Issues Regarding SAS

I was working on a homework problem regarding using arrays and looping to create a new variable to identify the date of when the maximum blood lead value was obtained but got stuck. For context, here is the homework problem:
In 1990 a study was done on the blood lead levels of children in Boston. The following variables for twenty-five children from the study have been entered on multiple lines per subject in the file lead_sum2018.txt in a list format:
Line 1
ID Number (numeric, values 1-25)
Date of Birth (mmddyy8. format)
Day of Blood Sample 1 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 1 (numeric, initial possible range: -9 to 12)
Line 2
ID Number (numeric, values 1-25)
Day of Blood Sample 2 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 2 (numeric, initial possible range: -9 to 12)
Line 3
ID Number (numeric, values 1-25)
Day of Blood Sample 3 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 3 (numeric, initial possible range: -9 to 12)
Line 4
ID Number (numeric, values 1-25)
Blood Lead Level Sample 1 (numeric, possible range: 0.01 – 20.00)
Blood Lead Level Sample 2 (numeric, possible range: 0.01 – 20.00)
Blood Lead Level Sample 3 (numeric, possible range: 0.01 – 20.00)
Sex (character, ‘M’ or ‘F’)
All blood samples were drawn in 1990. However, during data entry the order of blood samples was scrambled so that the first blood sample in the data file (blood sample 1) may not correspond to the first blood sample taken on a subject, it could be the first, second or third. In addition, some of the months and days and days of blood sampling were not written on the forms. At data entry, missing month and missing day values were each coded as -9.
The team of investigators for this project has made the following decisions regarding the missing values. Any missing days are to set equal to 15, any missing months are to be set equal to 6. Any analyses that are done on this data set need to follow those decisions. Be sure to implement the SAS syntax as indicated for each question. For example, use SAS arrays and loops if the item states that these must be used.
Here is the data that the HW references (it is in list format and was contained in a separate file called lead_sum2018.txt):
1 04/30/78 6 10
1 -9 7
1 14 1
1 1.62 1.35 1.47 F
2 05/19/79 27 11
2 20 -9
2 5 6
2 1.71 1.31 1.76 F
3 01/03/80 11 7
3 6 6
3 27 2
3 3.24 3.4 3.83 M
4 08/01/80 5 12
4 28 -9
4 3 4
4 3.1 3.69 3.27 M
5 12/26/80 21 5
5 3 7
5 -9 12
5 4.35 4.79 5.14 M
6 06/20/81 7 10
6 11 3
6 22 1
6 1.24 1.16 0.71 F
7 06/22/81 19 6
7 3 12
7 29 8
7 3.1 3.21 3.58 F
8 05/24/82 26 7
8 31 1
8 9 10
8 2.99 2.37 2.4 M
9 10/11/82 2 7
9 25 5
9 28 3
9 2.4 1.96 2.71 F
10 . 10 8
10 30 12
10 28 2
10 2.72 2.87 1.97 F
11 11/16/83 19 4
11 15 11
11 7 -9
11 4.8 4.5 4.96 M
12 03/02/84 17 6
12 11 2
12 17 11
12 2.38 2.6 2.88 F
13 04/19/84 2 12
13 -9 6
13 1 7
13 1.99 1.20 1.21 M
14 02/07/85 4 5
14 17 5
14 21 11
14 1.61 1.93 2.32 F
15 07/06/85 5 2
15 16 1
15 14 6
15 3.93 4 4.08 M
16 09/10/85 12 10
16 11 -9
16 23 6
16 3.29 2.88 2.97 M
17 11/05/85 12 7
17 18 1
17 11 11
17 1.31 0.98 1.04 F
18 12/07/85 16 2
18 18 4
18 -9 6
18 2.56 2.78 2.88 M
19 03/02/86 19 4
19 11 3
19 19 2
19 0.79 0.68 0.72 M
20 08/19/86 21 5
20 15 12
20 -9 4
20 0.66 1.15 1.42 F
21 02/22/87 16 12
21 17 9
21 13 4
21 2.92 3.27 3.23 M
22 10/11/87 7 6
22 1 12
22 -9 3
22 1.43 1.42 1.78 F
23 05/12/88 12 2
23 21 4
23 17 12
23 0.55 0.89 1.38 M
24 08/07/88 17 6
24 27 11
24 6 2
24 0.31 0.42 0.15 F
25 01/12/89 4 7
25 15 -9
25 23 1
25 1.69 1.58 1.53 M
A) Input the data and in the data step:
1) make sure that Date of Birth variable is recorded as a SAS date;
2) use SAS arrays and looping to create a SAS date variable for each of the three blood samples and to address the missing data in accordance to the decisions of the investigators. Hint: use a single array and do loop to recode the missing values for day and month, separately, and an array/do loop for creating the SAS date variable;
3) use a SAS function to create a variable for the highest, i.e., maximum, blood lead value for each child;
4) use SAS arrays and looping to identify the date on which this largest value was obtained and create a new variable for the date of the largest blood lead value;
5) determine the age of the child in years when the largest blood lead value was obtained (rounded to two decimal places);
6) create a new variable based on the age of the child in years when the largest lead value was obtained (call it, “agecat”) that takes on three levels: for children less than 4 years old, agecat should equal 1; for children at least 4 years old, but less than 8, agecat should equal 2; and for children at least 8 years of age, agecat should be 3.;
7) print out the variables for the date of birth, date of the largest lead level, age at blood sample for the largest blood lead level, agecat, sex, and the largest blood lead level (Only print out these requested variables). All dates should be formatted to use the mmddyy10. format on the output.
The code I used in response to this was:
libname HW3 'C:\Users\johns\Desktop\SAS';
filename HW3new 'C:\Users\johns\Desktop\SAS\lead_sum2018.txt';
data one;
infile HW3new;
informat dob mmddyy8.;
input #1 id dob dbs1 mbs1
#2 dbs2 mbs2
#3 dbs3 mbs3
#4 bls1 bls2 bls3 sex;
array dbs{3} dbs1 dbs2 dbs3;
array mbs{3} mbs1 mbs2 mbs3;
do i=1 to 3;
if dbs{i}=-9 then dbs{i}=15;
end;
do i=4 to 6;
if mbs{i}=-9 then mbs{i}=6;
end;
array date{3} mdy1 mdy2 mdy3;
do i=1 to 3;
date{i}=mdy(mbs{i}, dbs{i}, 1990);
end;
maxbls=max(of bls1-bls3);
array bls{3} bls1 bls2 bls3;
array maxdte{3} maxdte1 maxdte2 maxdte3;
do i=1 to i=3;
if bls{i}=maxbls then maxdte=i;
end;
agemax=maxdte-dob;
ageest=round(agemax/365.25,2);
if agemax=. then agecat=.;
else if agemax < 4 then agecat=1;
else if 4 <= agemax < 8 then agecat=2;
else if agemax ge 8 then agecat=3;
run;
I received this error:
22 maxbls=max(of bls1-bls3);
23 array bls{3} bls1 bls2 bls3;
24 array maxdte{3} maxdte1 maxdte2 maxdte3;
25 do i=1 to i=3;
26 if bls{i}=maxbls then maxdte=i;
ERROR: Illegal reference to the array maxdte.
27 end;
Does anyone have any tip is regards to this issue? What did I do wrong? Was I supposed to create an additional array for the date of when the maximum blood lead sample value was collected? Thanks!
**I'm stuck on #4 of Part A, but I included the other parts for context. Thanks!
**Edits: I included the data that I had to read into SAS and the file name of the file it came from
Just from looking at the code immediately prior to the error, you have a problem on this line:
26 if bls{i}=maxbls then maxdte=i;
You are getting the error because you are attempting to assign a value to the array maxdte. Arrays cannot be assigned values like that (unless you are using the deprecated do over syntax...) Instead, choose an element of the array and assign the value to the element. E.g. you could do:
26 if bls{i}=maxbls then maxdte{1}=i;
Or instead of a literal 1, you could use a variable containing the relevant array index.
You are not properly handling ID field from lines #2-4
input #1 id dob dbs1 mbs1
#2 dbs2 mbs2
#3 dbs3 mbs3
#4 bls1 bls2 bls3 sex;
For example you need to skip field 1 on line 2-3 or read the ids into array perhaps to check they are all the same.
input #1 id dob dbs1 mbs1
#2 id2 dbs2 mbs2
#3 id3 dbs3 mbs3
#4 id4 bls1 bls2 bls3 sex;
This example show how to check that you have 4 lines with the same ID and if you do read the rest of the variables or execute LOSTCARD. ID 3 has a missing record;
353 data ex;
354 infile cards n=4 stopover;
355 input #1 id #2 id2 #3 id3 #4 id4 #;
356 if id eq id2 eq id3 eq id4
357 then input #1 id dob:mmddyy. dbs1 mbs1
358 #2 id2 dbs2 mbs2
359 #3 id3 dbs3 mbs3
360 #4 id4 bls1 bls2 bls3 sex :$1.;
361 else lostcard;
362 format dob mmddyy.;
363 cards;
NOTE: LOST CARD.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
372 3 01/03/80 11 7
373 3 27 2
374 3 3.24 3.4 3.83 M
375 4 08/01/80 5 12
NOTE: LOST CARD.
376 4 28 -9
NOTE: LOST CARD.
377 4 3 4
NOTE: The data set WORK.EX has 3 observations and 15 variables.
data ex;
infile cards n=4 stopover;
input #1 id #2 id2 #3 id3 #4 id4 #;
if id eq id2 eq id3 eq id4
then input #1 id dob:mmddyy. dbs1 mbs1
#2 id2 dbs2 mbs2
#3 id3 dbs3 mbs3
#4 id4 bls1 bls2 bls3 sex :$1.;
else lostcard;
format dob mmddyy.;
cards;
1 04/30/78 6 10
1 -9 7
1 14 1
1 1.62 1.35 1.47 F
2 05/19/79 27 11
2 20 -9
2 5 6
2 1.71 1.31 1.76 F
3 01/03/80 11 7
3 27 2
3 3.24 3.4 3.83 M
4 08/01/80 5 12
4 28 -9
4 3 4
4 3.1 3.69 3.27 M
;;;;
run;
proc print;
run;

How to find the size of smallest and largest bin in MATLAB?

I need to find the the size of bin with maximum and minimum element. I am using histc function in MATLAB.
Here is what I am doing,
A=[1 2 3 11 22 3 4 55 6 7 2 33 44 5 22]
edges = [10 inf];
N = histc(A,edges)
it gives N=[6,0]; means there are 6 elements having values greater than 10. Now I want to count what is the maximum count in a bin for my condition.
here it should be 2 as there are two instances where we have two integers satisfying my condition 11 22 and 33 44
How to count it in MATLAB.
Here you go;
A=[1 2 3 11 22 3 4 55 6 7 2 33 44 5 22]
arr=diff([0 (find(~(A>10))) numel(A)+1]) -1;
arr(find(arr(1,:)==0))=[];
largest=max(arr); % longest sequence of occurences of numbers > 10
smallest=min(arr); % smallest sequence of occurences of numbers > 10
Cheers!!

Summing data in rows based on horizontal and vertical criteria

I have a dataset in the below format:
Date 1 Date 1 Date 1 Date 2 Date 2 Date 3 Date 3
Product 1 10 20 10 5 10 20 30
Product 2 5 5 10 10 10 5 30
Product 3 30 10 5 10 30 30 40
Product 4 5 10 10 20 5 10 20
and I am trying to sum the sales of the products by the date, to create the below:
Date 1 Date 2 Date 3
Product 1 40 15 50
Product 3 45 40 70
Product 4 25 25 30
Product 2 20 20 35
The products in the second table will often be in a different order, so a simple SUMIF will not suffice.
I've attempted a combination of SUM, INDEX and MATCH, as well as SUM with nested IF function, but no amount of Googling or trial and error is getting me there. I keep just bringing back the values in one cell, but not managing to sum.
With the following setup:
I used the following formula
=SUMIF($B$1:$H$1,B$10,INDIRECT("$B" & MATCH($A11,$A$1:$A$5,0) & ":$H" &MATCH($A11,$A$1:$A$5,0)))
To get what was wanted. I put the formula in B11 and then copied across and Down

Adding and multiplying tables' data by values in another table

Say I have a table of subtractions and divisions sorted by date:
tblFactors
dt sub divide
2014-07-01 1 1
2014-06-01 0 5
2014-05-01 2 1
2014-05-01 0 3
I have another table of values, sorted by date:
tblValues
dt val
2014-07-05 4
2014-06-15 5
2014-05-15 21
2014-04-14 31
2014-03-15 71
I need to perform some sequential calculations. For the first value in tblFactors, I need to subtract 1 from every val where tblValues.dt < '2014-07-01'.
Next, I need to process the second row in tblFactors. There is nothing to subtract. However, the divide = 5 means that I need to divide every val by 5 where tblValues.dt < '2014-06-01'. The tricky thing is that I need to do this on the modified val from the row before (divide 20 / 5, not 21 / 5).
Each row in tblFactors would process in this manner, giving a sequence like this:
tblFactors: Row 1 Row 2 Row 3 Row 4
Dt Original Val Subtract 1 Divide by 5 Subtract 2 Divide by 3
7/5/2014 4
6/15/2014 5 4
5/15/2014 21 20 4
4/14/2014 31 30 6 4
3/25/2014 71 70 14 12 4
This would leave me with:
qryValues
dt val
2014-07-05 4
2014-06-15 4
2014-05-15 4
2014-04-14 4
2014-03-15 4
Right now I'm doing vector multiplications over loops in R. I was wondering if there was a clever way to accomplish this in the native sql. I tried doing some aggregations but I've had limited success.

transact SQL, sum each row and insert into another table

for a table on ms-sql2000 containing the following columns and numbers:
S_idJ_id Se_id B_id Status Count multiply
63 1000 16 12 1 10 2
64 1001 12 16 1 9 3
65 1002 17 12 1 10 2
66 1003 16 12 1 6 3
67 1004 12 16 1 10 2
I want to generate an classic asp script which will do the following for each row
where status=1 :
-multiply -> answer= multiply column 'count' with column 'multiply'
Then:
count the total answer and sum for each se_id like :
se_id total
12 47
16 38
17 20
and display on screen like
Rank se_id total
1 12 47
2 16 38
3 17 20
Condition:
if there are multiple equal total values then give the lower numbered se_id a priority for
getting a ranking and give the next higher numbered se_id the next number in rank
Any sample code in classic asp or advice is welcome on how to get this accomplished
'score' = source table.
if (EXISTS (select * from INFORMATION_SCHEMA.TABLES where TABLE_NAME = 'result_table'))
begin
drop table result_table;
end
select
rank = IDENTITY(INT,1,1),
se_id, sum(multiply * count) as total
into result_table
from score
where status = 1
group by se_id
order by total desc, se_id;
[Edit] Change query as answer on first comment

Resources