Concat strings with different amount of values - database

I am having five strings with a different amount of RangeIndex. String 1 for example could have the following data:
20160101 1.08526
20160102 1.08535
20160103 1.09052
20160104 1.08659
String 2 is as follows
20160101 150.0659
20160103 145.8063
20160104 143.5892
with the normal concat function I would get the following result
20160101 1.08526 20160101 150.0659
20160102 1.08535 20160103 145.8063
20160103 1.09052 20160104 143.5892
20160104 1.08659
The result I am looking for should be as follow
20160101 1.08526 20160101 150.0659
20160102 1.08535 20160102 150.0659
20160103 1.09052 20160103 145.8063
20160104 1.08659 20160104 143.5892
This means that the combination of the two strings should put the right data together with the right date. If there is an empty field the system should chose the previous value to fill the gap.
How can that be done?

Related

Using the window function "last_value", when the values of the sorted field are same, the value snowflake returns is not the last value

As we all known, the window function "last_value" returns the last value within an ordered group of values.
In the following example, group by field "A" and sort by field "B" in positive order.
In the group of "A = 1", the last value is returned, which is, the C value 4 when B = 2.
However, in the group of "A = 2", the values of field "B" are the same.
At this time, instead of the last value, which is, the C value 4 in line 6, the first C value 1 in B = 2 is returned.
This puzzles me why the last value within an ordered group of values is not returned when I encounter the value I want to use for sorting.
Example
row_number
A
B
C
LAST_VALUE(C) IGNORE NULLS OVER (PARTITION BY A ORDER BY B ASC)
1
1
1
2
4
2
1
1
1
4
3
1
1
3
4
4
1
2
4
4
5
2
2
1
1
6
2
2
4
1
This puzzles me why the last value within an ordered group of values is not returned when I encounter the value I want to use for sorting.
For partition A equals 2 and column B, there is a tie:
The sort is NOT stable. To achieve stable sort a column or a combination of columns in ORDER BY clause must be unique.
To ilustrate it:
SELECT C
FROM tab
WHERE A = 2
ORDER BY B
LIMIT 1;
It could return either 1 or 4.
If you sort by B within A then any duplicate rows (same A and B values) could appear in any order and therefore last_value could give any of the possible available values.
If you want a specific row, based on some logic, then you would need to sort by all columns within the group to reflect that logic. So in your case you would need to sort by B and C
Good day Bill!
Right, the sorting is not stable and it will return different output each time.
To get stable results, we can run something like below
select
column1,
column2,
column3,
last_value(column3) over (partition by column1 order by
column2,column3) as column2_last
from values
(1,1,2), (1,1,1), (1,1,3),
(1,2,4), (2,2,1), (2,2,4)
order by column1;

Calculate MTTF with SQL query

I'm trying to calculate the expected average time between the end of a fault and the beginning of the next one (two separate columns) [MTTF].
MTTF = Mean Time To Failure:
is the average time from the end of a fault to the beginning of the next
I have already asked a similar question and I have been answered very professionally. I love this forum.
I have to calculate the difference between the dates in the Failure column and in the End_Of_Repair column for each row -1 (another column), transforming it into hours and dividing it by the number of intervals to calculate the average time.
Intervals = number of rows - 1
This is my table:
Failure |Start_Repair |End_Of_Repair |Line |Operator
------------------------|------------------------|------------------------|---------|--------
2019-06-26 06:30:00 |2019-06-26 10:40:00 |2019-06-27 12:00:00 |A |Mike
2019-06-28 00:10:00 |2019-06-28 02:40:00 |2019-06-29 01:12:00 |A |Loty
2019-06-30 10:10:00 |2019-06-30 02:40:00 |2019-07-01 00:37:00 |B |Judy
2019-07-02 12:01:00 |2019-07-02 14:24:00 |2019-07-05 00:35:00 |B |Judy
2019-07-06 07:08:00 |2019-07-06 15:46:00 |2019-07-07 02:30:00 |A |Mike
2019-07-07 08:22:00 |2019-07-08 05:19:00 |2019-07-08 08:30:00 |B |Loty
2019-07-29 04:10:00 |2019-07-29 07:40:00 |2019-07-29 14:00:05 |A |Judy
So I have to make the difference on the error and end of error columns, the second minus the first, the third minus the second, etc. Divided by the calculated intervals (which are the number of lines-1 since I start from line 2-line 1).
In a nutshell, an average-cross between two tables.
I attach the image to make the idea better.
So I'll have to do the seventh line of the failure column minus the sixth row of the end_of_repair column, sixth row of failure minus the fifth row of end_of_repair column and so on until you get to the first.
I tought:
SELECT line,
DATEDIFF(hour, min(End_Of_Repair), max (Failure)) / nullif(count(*) - 1, 0) as 'intervals'
from Test_Failure
group by line
But the results are:
A = 253
B = 76
The result should come back
A = (12,1+173,93+529,6) / 3 = 238h
B = (35,4 + 55,78) / 2 = 45,59h
One method would be to use LAG to get the value from the previous row; then you can average the difference in hours between the 2 times:
WITH CTE AS(
SELECT V.Failure,
V.Start_Repair,
V.End_Of_Repair,
V.Line,
V.Operator,
LAG(V.End_Of_Repair) OVER (PARTITION BY V.Line ORDER BY V.Failure) AS LastRepair
FROM (VALUES('2019-06-26T06:30:00','2019-06-26T10:40:00','2019-06-27T12:00:00','A ','Mike'),
('2019-06-28T00:10:00','2019-06-28T02:40:00','2019-06-29T01:12:00','A ','Loty'),
('2019-06-30T10:10:00','2019-06-30T02:40:00','2019-07-01T00:37:00','B ','Judy'),
('2019-07-02T12:01:00','2019-07-02T14:24:00','2019-07-05T00:35:00','B ','Judy'),
('2019-07-06T07:08:00','2019-07-06T15:46:00','2019-07-07T02:30:00','A ','Mike'),
('2019-07-07T08:22:00','2019-07-08T05:19:00','2019-07-08T08:30:00','B ','Loty'),
('2019-07-29T04:10:00','2019-07-29T07:40:00','2019-07-29T14:00:05','A ','Judy'))V(Failure, Start_Repair, End_Of_Repair, Line, Operator))
SELECT CTE.Line,
AVG(DATEDIFF(HOUR, CTE.LastRepair, CTE.Failure)) AS FaultHours
FROM CTE
GROUP BY CTE.Line;

Removing duplicates from one filed where the rest of the row isn't duplicated

I have a problem where the data set is sometimes returning 2 rows for a serial number. This occurs when a serial number has been removed and has one row where the removal date is NULL and one where its populated
I've managed to get a query where the NULLS are removed by using a min() and group by clause but this is also removing the NULLs where the meter hasn't been removed.
SELECT [MeterSerialNumber]
,[EquipmentType]
,[InstallDate]
,min ([Removaldate] ) as REM_DATE
,round (DATEDIFF(DAY,InstallDate,case when Removaldate IS null then convert (date,GETDATE()) else Removaldate end)/30.42,0) as Age_M
FROM [DOCDPT].[main].[Tbl_Device_ISU]
where EquipmentType in ('S1','NS','NSS') or EquipmentType like ('%S2%')
Group by MeterSerialNumber,EquipmentType,InstallDate,Removaldate having COUNT(distinct removaldate) =1
order by MeterSerialNumber
,Removaldate desc
These are the results prior to adding in the min() and group by clause. I would like to remove row 2 as the meter has been removed but leave the bottom 2 rows. The code above seems to just remove all the NULLS. I only want to remove the NULLs where the meterserialnumber appears more than once.
MeterSerialNumber I EquipmentType I InstallDate I Removaldate I Age_M
000009501794462 I S1 I 2017-06-18 I 2018-01-22 I 7.000000
000009501794462 I S1 I 2017-06-18 I NULL I 23.000000
000009999203079 I S1 I 2017-06-18 I NULL I 23.000000
000009995553079 I S1 I 2017-06-18 I NULL I 23.000000
I presume the issue is with the count not counting NULL
If I understand this correctly, I think you just need to remove [Removaldate] from the GROUP BY, get rid of the HAVING and use MIN([Removaldate]) in the calculation of Age_M and the ORDER BY like this:
SELECT
[MeterSerialNumber]
,[EquipmentType]
,[InstallDate]
,MIN([Removaldate]) as REM_DATE
,ROUND(DATEDIFF(DAY, InstallDate, case when MIN(Removaldate) IS null then CONVERT (date,GETDATE()) else MIN(Removaldate) end)/30.42,0) as Age_M
FROM
Tbl_Device_ISU
WHERE
EquipmentType in ('S1','NS','NSS') or EquipmentType like ('%S2%')
GROUP BY
MeterSerialNumber,
EquipmentType,
InstallDate
ORDER BY
MeterSerialNumber,
REM_DATE desc

SAS: How can I filter for (multiple) entries which are closest to the last day of month (for each month)

I have a large Dataset and want to filter it for all rows with date entry closest to the last day of the month, for each month. So there could be multiple entries for the day closest to the last day of month.
So for instance:
original Dataset
date price name
05-01-1995 1,2 abc
06-01-1995 1,5 def
07-01-1995 1,8 ghi
07-01-1995 1,7 mmm
04-02-1995 1,9 jkl
27-02-1995 2,1 mno
goal:
date price name
07-01-1995 1,8 ghi
07-01-1995 1,7 mmm
27-02-1995 2,1 mno
I had 2 ideas, but I am failing with implementing it within a loop (for traversing the months) in SAS.
1.idea: create new column wich indicates last day of the current month (intnx() function); then filter for all entries that are closest to the last day of its month:
date price name last_day_of_month
05-01-1995 1,2 abc 31-01-1995
06-01-1995 1,5 def 31-01-1995
07-01-1995 1,8 ghi 31-01-1995
04-02-1995 1,9 jkl 28-02-1995
27-02-1995 2,1 mno 28-02-1995
2.idea: simply filter for each month the entries with highest date (using maybe max function?!)
I would be very glad if you were able to help me, as I am used to ordinary programming languages and just started with SAS for research purposes.
proc sql is one way to solve this kind of situation. I'll break down your original requirements with explanations in how to interpret them in sql.
Since you want to group your observations on date, you can use the having clause to filter on the max date per month.
data work.have;
input date DDMMYY10. price name $;
format date date9.;
datalines;
05-01-1995 1.2 abc
07-01-1995 1.8 ghi
06-01-1995 1.5 def
07-01-1995 1.7 mmm
04-02-1995 1.9 jkl
27-02-1995 2.1 mno
;
data work.want;
input date DDMMYY10. price name $;
format date date9.;
datalines;
07-01-1995 1.8 ghi
07-01-1995 1.7 mmm
27-02-1995 2.1 mno
;
proc sql ;
create table work.want as
select *
/*, max(date) as max_date format=date9.*/
/*, intnx('month',date,0,'end') as monthend format=date9.*/
from work.have
group by intnx('month',date,0,'end')
having max(date) = date
order by date, name
;
If you uncomment the comments, the actual filters used are shown in the output table.
Comparing the the requirements against the solution:
proc compare base=work.want compare=work.solution;
results in
NOTE: No unequal values were found. All values compared are exactly equal.
1) create a new variable periode = put(date,yymmn6.) /* gives you yyyymm*/
2) sort the table on periode and date
3) now a periode.last logic will select the record you need per periode.
Something like...
data tab2;
set your_table;
periode = put(date,yymmn6.);
run;
proc sort data= tab2;
by periode date;
run;
data tab3;
set tab2;
by periode;
if last.periode then output;
run;
You can use two SAS functions called intnx and intck to do this with proc sql:
proc sql ;
create table want as
select *, put(date,yymmn6.) as month, intck('days',date,intnx('month',date,0,'end')) as DaysToEnd
from have
group by month
having (DaysToEnd=min(DaysToEnd))
;quit ;
Intnx() adjusts dates by intervals. In the above case, the four parameters used are:
What size 'step' you want to add/subrate the intervals in.
The date that is being referenced
How many interval steps to make
How to 'round' the step (eg round it to the start/end/middle of the resultant day/week/year)
Intck() simply counts interval steps between two dates
This will give you all records which fall on the day closest to the end of the month
Another approach is by using proc rank;
data mid;
retain yrmth date;
set have;
format date yymmddn8.;
yrmth = put(date,yymmn6.);
run;
proc sort data = mid;
by yrmth descending date;
run;
proc rank data = mid out = want descending ties=low;
by yrmth;
var date;
ranks rankdt;
run;
data want1;
set want;
where rankdt = 1;
run;
HTH

Merging Data to Run Specific Individual Analysis

I have two data sets. FIRST is a list of products and their daily prices from a supplier and SECOND is a list of start and end dates (as well as other important data for analysis). How can I tell Stata to pull the price at the beginning date and then the price at the end date from FIRST into SECOND for the given dates. Please note, if there is no exact matching date I would like it to grab the last date available. For example, if SECOND has the date 1/1/2013 and FIRST has prices on ... 12/30/2012, 12/31/2012, 1/2/2013, ... it would grab the 12/31/2012 price.
I would usually do this with Excel, but I have millions of observations, and it is not feasible.
I have put an example of FIRST and SECOND as well as what the optimal solution would give as an output POST_SECOND
FIRST
Product Price Date
1 3 1/1/2010
1 3 1/3/2010
1 4 1/4/2010
1 2 1/8/2010
2 1 1/1/2010
2 5 2/5/2010
3 7 12/26/2009
3 2 1/1/2010
3 6 4/3/2010
SECOND
Product Start Date End Date
1 1/3/2010 1/4/2010
2 1/1/2010 1/1/2010
3 12/26/2009 4/3/2010
POST_SECOND
Product Start Date End Date Price_Start Price_End
1 1/3/2010 1/4/2010 3 4
2 1/1/2010 1/1/2010 1 1
3 12/26/2009 4/3/2010 7 6
Here's a merge/keep/sort/collapse* solution that relies on using the last date. I altered your example data slightly.
/* Make Fake Data & Convert Dates to Date Format */
clear
input byte Product byte Price str12 str_date
1 3 "1/1/2010"
1 3 "1/3/2010"
1 4 "1/4/2010"
1 2 "1/8/2010"
2 1 "1/1/2010"
2 5 "2/5/2010"
3 7 "12/26/2009"
3 7 "12/28/2009"
3 2 "1/1/2010"
3 6 "4/3/2010"
4 8 "12/30/2012"
4 9 "12/31/2012"
4 10 "1/2/2013"
4 10 "1/3/2013"
end
gen Date = date(str_date,"MDY")
format Date %td
drop str_date
save "First.dta", replace
clear
input byte Product str12 str_Start_Date str12 str_End_Date
1 "1/3/2010" "1/4/2010"
2 "1/1/2010" "1/1/2010"
3 "12/27/2009" "4/3/2010"
4 "1/1/2013" "1/2/2013"
end
gen Start_Date = date(str_Start_Date,"MDY")
gen End_Date = date(str_End_Date,"MDY")
format Start_Date End_Date %td
drop str_*
save "Second.dta", replace
/* Data Transformation */
use "First.dta", clear
merge m:1 Product using "Second.dta", nogen
bys Product: egen ads = min(abs(Start_Date-Date))
bys Product: egen ade = min(abs(End_Date - Date))
keep if (ads==abs(Date - Start_Date) & Date <= Start_Date) | (ade==abs(Date - End_Date) & Date <= End_Date)
sort Product Date
collapse (first) Price_Start = Price (last) Price_End = Price, by(Product Start_Date End_Date)
list, clean noobs
*Some people are reshapers. Others are collapsers. Often both can get the job done, but I think collapse is easier in this case.
In Stata, I've never been able to get something like this to work nicely in one step (something you can do in SAS via a SQL call). In any case, I think you'd be better off creating an intermediate file from FIRST.dta and then merging that 2x on each of your StartDate and EndDate variables in SECOND.dta.
Say you have data for price adjustments from Jan 1, 2010 to Dec 31, 2013 (specified with varied intervals as you have shown above). I assume all the date variables are already in date format in FIRST.dta & SECOND.dta, and that variable names in SECOND do not have spaces in them.
tempfile prod prices
use FIRST.dta, clear
keep Product
duplicates drop
save `prod'
clear
set obs 1096
g Date=date("12-31-2009","MDY")+_n
format date %td
cross using `prod'
merge 1:1 Product Date using FIRST.dta, assert(1 3) nogen
gsort +Product +Date /*this ensures the data are sorted properly for the next step */
replace price=price[_n-1] if price==. & Product==Product[_n-1]
save `prices'
use SECOND.dta, clear
foreach i in Start End {
rename `i'Date Date
merge 1:1 Product Date using `prices', assert(2 3) keep(3) nogen
rename Price Price_`i'
rename Date `i'Date
}
This should work if I understand your data structures correctly, and it should address the issue being discussed in the comments to #Dimitriy's answer. I'm open to critiques on how to make this nicer as its something I've had to do a few times and this is how I usually go about it.

Resources