Total values of makes some litter difference how can i get exact value in sql? - sql-server

Speed A b c Total
129 11 4 2.75 354.75
91 9 6 1.5 136.5
166 19 8 2.375 395.08
164 26 12 2.16666666666667 355.88
146 11 6 1.83333333333333 267.18
147 16 8 2 294
201 8 3 2.66666666666667 536.67
164 4 2 2 328
186 8 6 1.33333333333333 247.38
165 7 4 1.75 288.75
171 10 4 2.5 427.5
104 5 4 1.25 130
1834 134 67 2 3668
Iam using total=Speed*c
But the value is not getting equal if i add all values for last column 1834 .I am getting 100 ampunt less.

The first:
Total = Speed * c, this expression about your output I suppose is rounded at the second number after comma.
The second:
You have put 2 as value of c in the last row. Why? This is the average of all previous rows?
So, if you take the forst and the second, you'll have cause about you have a difference with the last row and the other rows

Approximate-number data types for use with floating point numeric data.
Floating point data is approximate; therefore, not all values in the data
type range can be represented exactly.
Float has a default 15 digit precision. So when your number has more digits it will round off the rest of the digits.
You are showing total with only 2 decimals. You could consider permanently changing c to decimal(20,14). Here is the result if you convert it to decimal, this should be more exact:
select cast(c as decimal(20, 14)) * speed as Total
from table

Related

Grouping tuples of 2 numbers based on the difference between the tuples

Serial Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Parameter 1
22.95
23.46
22.71
23.41
23.36
23.18
23.52
23.35
22.86
22.47
22.63
23.72
23.22
23.17
22.8
23.18
23.15
23.12
23.16
23.22
23.58
23.68
23.33
23.52
23.54
23.48
Parameter 2
19.97
20.83
19.22
20.39
20.33
20.44
20.62
20.26
21.22
20.31
20.53
20.62
20.65
20.14
19.43
20.66
20.09
20.52
20.41
20.63
20.98
21.15
19.97
20.72
20.71
20.32
We have to divide this table into groups of 4 columns such that:-
1.Max difference between the values of parameter 1 of the columns in a group is less than 1
2.Max difference between the values parameter 2 of the columns in a group is less than 0.5
for eg. The columns 4,5,6,7 form a valid group
I need an algorithm that will return the maximum number of groups that this table can be divided into and will also return those groups.
for eg. In this case the maximum number of groups possible are 6, and the remaining 2 columns do not fall under any group. The algorithm should return these 6 groups.

Repeat certain pandas series values, so that it has an entry for all index values between 1 and 100

I have created a list of pandas series, with each series indexed by numbers between 1 and 100 eg
Index Value
1 62.99
4 64.39
37 75.225
65 88.12
74 89.89
79 93.30
88 94.30
92 95.83
100 100.00
What I want to do, either while it is a Series, or as an array after calling .to_numpy() on it, is to fill it out so that my series has 100 values (1 to 100), with any new entries having the previous existing value ie
Index Value
1 62.99
2 62.99
3 62.99
4 64.39
5 64.39
6 64.39
...
...
36 64.39
37 75.225
38 75.225
and so on.
I can do this programmatically the long-winded way by iterating through each series and checking for a change in value; my question is, is there a version of Series.repeat() which could do this in one hit, or a numpy function which can 'pad out' my array in this manner with my 100 values?
Thanks in advance for reading, and for any suggestions. This isn't homework; it's a genuine question so please don't attack me if my style of asking isn't as you expect.
What you need yo do is to frontfill the values in a series:
This code
series = pd.Series([33.2, 36, 39, 55], index=[3, 6, 12, 14], name='series')
indices = range(100)
df = pd.DataFrame(indices)
series = df.join(series).ffill()['series']
produces
0 NaN
1 NaN
2 NaN
3 33.2
4 33.2
...
95 55.0
96 55.0
97 55.0
98 55.0
99 55.0
First values ar NaN because there are no values to fill them in the series
So here's the solution I went with - an ffill() with fillna(0), joining to range(1,101). I had to iterate through a larger dataset which needed grouping by ID first / taking the maximum 'Pct' per 'Bucket' :-
j=df[['ID','Bucket','Pct']].groupby(['ID','Bucket']).max()
for i in df['ID'].unique():
index=pd.DataFrame(range(1,101))
index.columns=['Bucket']
k=pd.merge(index,j.loc[i],how='left',on='Bucket').ffill().fillna(0)
In:
Bucket Pct
3 0.03
3 0.1
3 0.26
3 0.42
3 0.45
3 0.59
3 0.69
3 0.83
3 0.86
3 0.91
3 0.94
3 0.98
4 1.1
... ...
91 98.89
93 99.08
94 99.17
94 99.26
94 99.43
94 99.48
94 99.63
100 100.0
Out:
Bucket Pct
1 0.00
2 0.00
3 0.98
4 1.83
5 22.83
... ...
91 98.89
92 98.89
93 99.08
94 99.63
95 99.63
96 99.63
97 99.63
98 99.63
99 99.63
100 100.00
Many, many thanks once again to you both!

Issues Regarding SAS

I was working on a homework problem regarding using arrays and looping to create a new variable to identify the date of when the maximum blood lead value was obtained but got stuck. For context, here is the homework problem:
In 1990 a study was done on the blood lead levels of children in Boston. The following variables for twenty-five children from the study have been entered on multiple lines per subject in the file lead_sum2018.txt in a list format:
Line 1
ID Number (numeric, values 1-25)
Date of Birth (mmddyy8. format)
Day of Blood Sample 1 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 1 (numeric, initial possible range: -9 to 12)
Line 2
ID Number (numeric, values 1-25)
Day of Blood Sample 2 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 2 (numeric, initial possible range: -9 to 12)
Line 3
ID Number (numeric, values 1-25)
Day of Blood Sample 3 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 3 (numeric, initial possible range: -9 to 12)
Line 4
ID Number (numeric, values 1-25)
Blood Lead Level Sample 1 (numeric, possible range: 0.01 – 20.00)
Blood Lead Level Sample 2 (numeric, possible range: 0.01 – 20.00)
Blood Lead Level Sample 3 (numeric, possible range: 0.01 – 20.00)
Sex (character, ‘M’ or ‘F’)
All blood samples were drawn in 1990. However, during data entry the order of blood samples was scrambled so that the first blood sample in the data file (blood sample 1) may not correspond to the first blood sample taken on a subject, it could be the first, second or third. In addition, some of the months and days and days of blood sampling were not written on the forms. At data entry, missing month and missing day values were each coded as -9.
The team of investigators for this project has made the following decisions regarding the missing values. Any missing days are to set equal to 15, any missing months are to be set equal to 6. Any analyses that are done on this data set need to follow those decisions. Be sure to implement the SAS syntax as indicated for each question. For example, use SAS arrays and loops if the item states that these must be used.
Here is the data that the HW references (it is in list format and was contained in a separate file called lead_sum2018.txt):
1 04/30/78 6 10
1 -9 7
1 14 1
1 1.62 1.35 1.47 F
2 05/19/79 27 11
2 20 -9
2 5 6
2 1.71 1.31 1.76 F
3 01/03/80 11 7
3 6 6
3 27 2
3 3.24 3.4 3.83 M
4 08/01/80 5 12
4 28 -9
4 3 4
4 3.1 3.69 3.27 M
5 12/26/80 21 5
5 3 7
5 -9 12
5 4.35 4.79 5.14 M
6 06/20/81 7 10
6 11 3
6 22 1
6 1.24 1.16 0.71 F
7 06/22/81 19 6
7 3 12
7 29 8
7 3.1 3.21 3.58 F
8 05/24/82 26 7
8 31 1
8 9 10
8 2.99 2.37 2.4 M
9 10/11/82 2 7
9 25 5
9 28 3
9 2.4 1.96 2.71 F
10 . 10 8
10 30 12
10 28 2
10 2.72 2.87 1.97 F
11 11/16/83 19 4
11 15 11
11 7 -9
11 4.8 4.5 4.96 M
12 03/02/84 17 6
12 11 2
12 17 11
12 2.38 2.6 2.88 F
13 04/19/84 2 12
13 -9 6
13 1 7
13 1.99 1.20 1.21 M
14 02/07/85 4 5
14 17 5
14 21 11
14 1.61 1.93 2.32 F
15 07/06/85 5 2
15 16 1
15 14 6
15 3.93 4 4.08 M
16 09/10/85 12 10
16 11 -9
16 23 6
16 3.29 2.88 2.97 M
17 11/05/85 12 7
17 18 1
17 11 11
17 1.31 0.98 1.04 F
18 12/07/85 16 2
18 18 4
18 -9 6
18 2.56 2.78 2.88 M
19 03/02/86 19 4
19 11 3
19 19 2
19 0.79 0.68 0.72 M
20 08/19/86 21 5
20 15 12
20 -9 4
20 0.66 1.15 1.42 F
21 02/22/87 16 12
21 17 9
21 13 4
21 2.92 3.27 3.23 M
22 10/11/87 7 6
22 1 12
22 -9 3
22 1.43 1.42 1.78 F
23 05/12/88 12 2
23 21 4
23 17 12
23 0.55 0.89 1.38 M
24 08/07/88 17 6
24 27 11
24 6 2
24 0.31 0.42 0.15 F
25 01/12/89 4 7
25 15 -9
25 23 1
25 1.69 1.58 1.53 M
A) Input the data and in the data step:
1) make sure that Date of Birth variable is recorded as a SAS date;
2) use SAS arrays and looping to create a SAS date variable for each of the three blood samples and to address the missing data in accordance to the decisions of the investigators. Hint: use a single array and do loop to recode the missing values for day and month, separately, and an array/do loop for creating the SAS date variable;
3) use a SAS function to create a variable for the highest, i.e., maximum, blood lead value for each child;
4) use SAS arrays and looping to identify the date on which this largest value was obtained and create a new variable for the date of the largest blood lead value;
5) determine the age of the child in years when the largest blood lead value was obtained (rounded to two decimal places);
6) create a new variable based on the age of the child in years when the largest lead value was obtained (call it, “agecat”) that takes on three levels: for children less than 4 years old, agecat should equal 1; for children at least 4 years old, but less than 8, agecat should equal 2; and for children at least 8 years of age, agecat should be 3.;
7) print out the variables for the date of birth, date of the largest lead level, age at blood sample for the largest blood lead level, agecat, sex, and the largest blood lead level (Only print out these requested variables). All dates should be formatted to use the mmddyy10. format on the output.
The code I used in response to this was:
libname HW3 'C:\Users\johns\Desktop\SAS';
filename HW3new 'C:\Users\johns\Desktop\SAS\lead_sum2018.txt';
data one;
infile HW3new;
informat dob mmddyy8.;
input #1 id dob dbs1 mbs1
#2 dbs2 mbs2
#3 dbs3 mbs3
#4 bls1 bls2 bls3 sex;
array dbs{3} dbs1 dbs2 dbs3;
array mbs{3} mbs1 mbs2 mbs3;
do i=1 to 3;
if dbs{i}=-9 then dbs{i}=15;
end;
do i=4 to 6;
if mbs{i}=-9 then mbs{i}=6;
end;
array date{3} mdy1 mdy2 mdy3;
do i=1 to 3;
date{i}=mdy(mbs{i}, dbs{i}, 1990);
end;
maxbls=max(of bls1-bls3);
array bls{3} bls1 bls2 bls3;
array maxdte{3} maxdte1 maxdte2 maxdte3;
do i=1 to i=3;
if bls{i}=maxbls then maxdte=i;
end;
agemax=maxdte-dob;
ageest=round(agemax/365.25,2);
if agemax=. then agecat=.;
else if agemax < 4 then agecat=1;
else if 4 <= agemax < 8 then agecat=2;
else if agemax ge 8 then agecat=3;
run;
I received this error:
22 maxbls=max(of bls1-bls3);
23 array bls{3} bls1 bls2 bls3;
24 array maxdte{3} maxdte1 maxdte2 maxdte3;
25 do i=1 to i=3;
26 if bls{i}=maxbls then maxdte=i;
ERROR: Illegal reference to the array maxdte.
27 end;
Does anyone have any tip is regards to this issue? What did I do wrong? Was I supposed to create an additional array for the date of when the maximum blood lead sample value was collected? Thanks!
**I'm stuck on #4 of Part A, but I included the other parts for context. Thanks!
**Edits: I included the data that I had to read into SAS and the file name of the file it came from
Just from looking at the code immediately prior to the error, you have a problem on this line:
26 if bls{i}=maxbls then maxdte=i;
You are getting the error because you are attempting to assign a value to the array maxdte. Arrays cannot be assigned values like that (unless you are using the deprecated do over syntax...) Instead, choose an element of the array and assign the value to the element. E.g. you could do:
26 if bls{i}=maxbls then maxdte{1}=i;
Or instead of a literal 1, you could use a variable containing the relevant array index.
You are not properly handling ID field from lines #2-4
input #1 id dob dbs1 mbs1
#2 dbs2 mbs2
#3 dbs3 mbs3
#4 bls1 bls2 bls3 sex;
For example you need to skip field 1 on line 2-3 or read the ids into array perhaps to check they are all the same.
input #1 id dob dbs1 mbs1
#2 id2 dbs2 mbs2
#3 id3 dbs3 mbs3
#4 id4 bls1 bls2 bls3 sex;
This example show how to check that you have 4 lines with the same ID and if you do read the rest of the variables or execute LOSTCARD. ID 3 has a missing record;
353 data ex;
354 infile cards n=4 stopover;
355 input #1 id #2 id2 #3 id3 #4 id4 #;
356 if id eq id2 eq id3 eq id4
357 then input #1 id dob:mmddyy. dbs1 mbs1
358 #2 id2 dbs2 mbs2
359 #3 id3 dbs3 mbs3
360 #4 id4 bls1 bls2 bls3 sex :$1.;
361 else lostcard;
362 format dob mmddyy.;
363 cards;
NOTE: LOST CARD.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
372 3 01/03/80 11 7
373 3 27 2
374 3 3.24 3.4 3.83 M
375 4 08/01/80 5 12
NOTE: LOST CARD.
376 4 28 -9
NOTE: LOST CARD.
377 4 3 4
NOTE: The data set WORK.EX has 3 observations and 15 variables.
data ex;
infile cards n=4 stopover;
input #1 id #2 id2 #3 id3 #4 id4 #;
if id eq id2 eq id3 eq id4
then input #1 id dob:mmddyy. dbs1 mbs1
#2 id2 dbs2 mbs2
#3 id3 dbs3 mbs3
#4 id4 bls1 bls2 bls3 sex :$1.;
else lostcard;
format dob mmddyy.;
cards;
1 04/30/78 6 10
1 -9 7
1 14 1
1 1.62 1.35 1.47 F
2 05/19/79 27 11
2 20 -9
2 5 6
2 1.71 1.31 1.76 F
3 01/03/80 11 7
3 27 2
3 3.24 3.4 3.83 M
4 08/01/80 5 12
4 28 -9
4 3 4
4 3.1 3.69 3.27 M
;;;;
run;
proc print;
run;

Assigning a single value to all cells within a specified time period, matrix format

I have the following example dataset which consists of the # of fish caught per check of a net. The nets are not checked at uniform intervals. The day of the check is denoted in julian days as well as the number of days the net had been fishing since last checked (or since it's deployment in the case of the first check)
http://textuploader.com/9ybp
Site_Number Check_Day_Julian Set_Duration_Days Fish_Caught
2 5 3 100
2 10 5 70
2 12 2 65
2 15 3 22
100 4 3 45
100 10 6 20
100 18 8 8
450 10 10 10
450 14 4 4
In any case, I would like to turn the raw data above into the following format:
http://textuploader.com/9y3t
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
2 0 0 100 100 100 70 70 70 70 70 65 65 22 22 22 0 0 0
100 0 45 45 45 20 20 20 20 20 20 8 8 8 8 8 8 8 8
450 10 10 10 10 10 10 10 10 10 10 4 4 4 4 0 0 0 0
This is a matrix which assigns the # of fish caught during the period to EACH of the days that were within that period. The columns of the matrix are Julian days, the rows are site numbers.
I have tried to do this with some matrix functions but I have had much difficulty trying to populate all the fields that are within the time period, but I do not necessarily have a row of data for?
I had posted my small bit of code here, but upon reflection, my approach is quite archaic and a bit off point. Can anyone suggest a method to convert the data into the matrix provided? I've been scratching my head and googling all day but now I am stumped.
Cheers,
C
Two answers, the second one is faster but a bit low level.
Solution #1:
library(IRanges)
with(d, {
ir <- IRanges(end=Check_Day_Julian, width=Set_Duration_Days)
cov <- coverage(split(ir, Site_Number),
weight=split(Fish_Caught, Site_Number),
width=max(end(ir)))
do.call(rbind, lapply(cov, as.vector))
})
Solution #2:
with(d, {
ir <- IRanges(end=Check_Day_Julian, width=Set_Duration_Days)
site <- factor(Site_Number, unique(Site_Number))
m <- matrix(0, length(levels(site)), max(end(ir)))
ind <- cbind(rep(site, width(ir)), as.integer(ir))
m[ind] <- rep(Fish_Caught, width(ir))
m
})
I don't see a super obvious matrix transformation here. This is all i've got assuming the raw data is in a data.frame called dd
dd$Site_Number<-factor(dd$Site_Number)
mm<-matrix(0, nrow=nlevels(dd$Site_Number), ncol=18)
for(i in 1:nrow(dd)) {
mm[as.numeric(dd[i,1]), (dd[i,2]-dd[i,3]):dd[i,2] ] <- dd[i,4]
}
mm

Trouble reading a data set into R

I am new to R and I am trying to read in a data set. The data set is here:
http://petitlien.fr/myfiles
(The above link will expand to a GMX File storage folder link and click on Guest access to retrieve the file.)
The file named mydata.log has 32 entries with no header and it consists of 2 columns which are delimited by spaces.
I am trying the powerful command scan
test.frame<-scan(file="mydata.log",sep= "", nlines=32,blank.lines.skip=TRUE)
The above just read the first 3 rows:
head(test.frame)
[1] 0.0000 0.0000 144.3210 0.3400 159.4070 0.8925
I have tried also read.table:
test.frame<-read.table(file="mydata.log",sep= "", nrows=32,blank.lines.skip=TRUE)
This one reads the first 6 lines only as shown below:
names(test.frame)
[1] "V1" "V2"
> head(test.frame)
V1 V2
1 0.000 0.0000
2 144.321 0.3400
3 159.407 0.8925
4 198.413 0.9450
5 222.557 0.9975
6 235.464 1.0500
Does someone know how to read this data set properly?
A related question: Can I control the number of significant digits or perhaps decimal places in the data being read in?
Thanks a lot...
This line of your code works perfectly:
test.frame<-read.table(file="mydata.log",sep= "", nrows=32,blank.lines.skip=TRUE)
The reason why you only get 6 lines in your output is because you are using head. To view all lines, just enter the name of your object:
> test.frame
V1 V2
1 0.000 0.0000
2 144.321 0.3400
3 159.407 0.8925
4 198.413 0.9450
5 222.557 0.9975
6 235.464 1.0500
7 296.918 1.1025
8 346.773 1.1550
9 442.955 1.2075
10 694.879 1.2600
11 892.436 1.3125
12 1492.970 1.3650
13 2916.960 1.4175
14 3596.060 1.4700
15 5278.950 1.5225
16 7480.730 1.5750
17 12259.800 1.6275
18 14032.600 1.6800
19 19565.600 1.7325
20 31427.700 1.7850
21 58221.400 1.8375
22 92283.900 1.9900
23 165601.000 1.9425
24 165703.000 1.9950
25 213925.000 2.8750
26 260381.000 2.1000
27 312701.000 2.1525
28 370853.000 2.2050
29 479303.000 2.2575
30 487265.000 2.3100
31 545225.000 2.3625
32 703186.000 2.4150
Here is an easy way to see how many rows you have (useful when you have many observations):
nrow(test.frame)
[1] 32
As for the number of digits, see the round command. To look at the documentation for a command, enter a ? and then the command, in this case a function: ?round
#note that you do not have to put "digits=2", you can just put "2", but this way is clearer
> rounded_test.frame <- round(test.frame, digits=2)
> rounded_test.frame
V1 V2
1 0.00 0.00
2 144.32 0.34
3 159.41 0.89
4 198.41 0.94
5 222.56 1.00
6 235.46 1.05
7 296.92 1.10
8 346.77 1.16
9 442.95 1.21
10 694.88 1.26
11 892.44 1.31
12 1492.97 1.36
13 2916.96 1.42
14 3596.06 1.47
15 5278.95 1.52
16 7480.73 1.57
17 12259.80 1.63
18 14032.60 1.68
19 19565.60 1.73
20 31427.70 1.78
21 58221.40 1.84
22 92283.90 1.99
23 165601.00 1.94
24 165703.00 2.00
25 213925.00 2.88
26 260381.00 2.10
27 312701.00 2.15
28 370853.00 2.21
29 479303.00 2.26
30 487265.00 2.31
31 545225.00 2.36
32 703186.00 2.42
Note in the above I created a new object instead of replacing the current one. If you want to replace the current one and lose the data forever (until you reload the dataset of course!), then you can use this line instead:
test.frame <- round(test.frame, digits=2)
If you don't really want to compress your numbers, you might just be interested in viewing the rounded numbers. You can do this the following command:
print(test.frame,digits=2)
Instead of nrow() as suggested, I would recommend str() ("structure") that gives you more useful information about your data set (class of variables etc). It's also a bit less cryptic....:)

Resources