MATLAB - array2table nesting - arrays

For the purpose of simplicity I'll try to take an example from everyday life. Let's say I have a table in CSV file loaded in a table called dataOriginal with 3 columns - names, jobs , dates.
Let's take a closer look at the column "date":
date
____
'13.01.2014 20:34'
'22.03.2014 11:17'
...
I want to split date in a date-vector and add this vector (along with the variable names for each of it's columns (since we have multiple dates we have de facto a matrix)) to a column in a new table again named "Date" but with all the naming goodies in it such as year, month etc.
Here is what I have done so far (sorry for the poor code quality but I've just started learning MATLAB :-/):
I split each date in a date-vector and also add names to each element like this:
dateFormat = 'dd.mm.yy HH:MM';
[year,month,day,hour,minute,second] = datevec(datesRaw, dateFormat);
so that I have this:
year(1) % returns '2014' since this is the first date in my column
year % returns all years in my entire column
Then I converted the above to a table:
dates = array2table([year,month,day,hour,minute,second],'VariableNames',{'year','month',...,'second'});
so I get a nice output like this
year month second
____ _____ ... ______
2014 1 0
2014 3 0
... ... ... ...
This allows me an easy-to-read access to each column by simply calling for example:
year % returns all years
year(1) % returns first entry's year (here: '2014' from '13.01.2014 20:34')
I've processed my other columns too doing various operations on those and at the end I'm trying to horizontally concatenate all like this:
name job date
____ _____________________ _____________________
year month ... second
____ _____ ______
"Bob" "Construction worker" 2014 1 ... 0
"Alice" "Waitress" 2014 3 ... 0
... ... ... ... ... ...
I'm struggling exactly with the part with the nesting of year,month etc. in a single column named "date". I'd like to address a date's element in the table above as follows:
myData.name(1) % will return 'Bob'
myData.job(1) % will return 'construction worker'
myData.date(1).year(1) % should return '2014' for Bob, the construction worker
Currently I'm having the following code after some sweating and swearing:
dataFinal =
horzcat(array2table([dataProcessed(:,1),dataProcessed(:,2)],'VariableNames',[dataOriginal.Properties.VariableNames(1),dataOriginalProperties,VariableNames(2)]],
array2table([year,month,day,hour,minute,second],'VariableNames',{'year','month','day','hour','minute','second'}))
where
dataProcessed(:,1) are my processed names
dataProcessed(:,2) are my processed jobs
dataOriginal.Properties.VariableNames(1) is the name of the first column in my original table - "name"
dataOriginal.Properties.VariableNames(2) is the name of the second column in my original table - "job"
I do not know how to insert
array2table([year,month,day,hour,minute,second],'VariableNames',{'year','month','day','hour','minute','second'})
in a named column "date" in order to accomplish my goal.
Thanks!

Try the following, it may be what you're looking for:
data = table(names, jobs, table(years, months, ...), 'VariableNames', {'name', 'job', 'date'})
Though you will address as follows, which is slightly different from what you said you want; it may still work for your purposes:
data.name(1);
data.job(1);
data.date.year(1);
EDIT: To see your output, do
disp([data(:, ~strcmp(data.Properties.VariableNames, 'date')), data.date])
names ids years months
_____ ___ _____ ______
'Bob' 1 2014 4
'Max' 2 2013 8
(when editing the comment I didn't exactly replicate the data and fields from the answer, but I think you should get the point here).

Related

Merging time series with different number of observations where variables have the same name (SAS)

I have a bunch of time series data (sas-files) which I like to merge / combine up to a larger table (I am fairly new to SAS).
Filenames:
cq_ts_SYMBOL, where SYMBOL is the respective symbol for each file
with the following structure:
cq_ts_AAA.sas7bdat: file1
SYMBOL DATE TIME BID ASK MID
AAA 20100101 9:30:00 10.375 10.4 .
AAA 20100101 9:31:00 10.38 10.4 .
.
.
AAA 20150101 15:59:00 15 15.1 .
cq_ts_BBB.sas7bdat: file2
SYMBOL DATE TIME BID ASK MID
BBB 20120101 9:30:00 12.375 12.4 .
BBB 20120102 9:31:00 12.38 12.4 .
.
.
BBB 20170101 15:59:00 20 20.1 .
Key characteristics:
- They have the same variable name
- They have different number of observations
- They are all saved in the same folder
So what I want to do is:
- Create 3 tables: BID-table, ASK-table, Mid-table with the following structure, ie for bid-table, cq_ts_bid.sas7bdat:
DATE TIME AAA BBB ...
20100101 9:30:00 10.375 .
20100102 9:31:00 10.38 .
.
.
20120101 9:30:00 9.375 12.375
20120102 9:31:00 9.38 12.38
.
.
20150101 15:59:00 15 17
.
.
20170101 15:59:00 . 20
It is not all to difficult to do it for 2 stock time series, however, I was wondering whether there is the possibility to do the following:
From data set cq_ts_AAA take DATE TIME BID and rename BID to AAA (either from the values in symbol? does this make sense? or get the name from the filename).
Do the same for cq_ts_BBB.
In fact, loop through the folder to get the number of files and filenames (this part I got more or less, see below).
Merge cq_ts_BBB and cq_ts_BBB having DATE TIME AAA (former bid price of AAA) BBB (former bid price of BBB), for all the files in the folder.
Do this for BID, then for ASK and finally MID (actually I couldn't get the midpoint variable from bid and ask (i.e. mid= (bid + ask) / 2;) just gives me the "." in the previous data steps when creating the files).
I think a macro to first get each single file then rename (when should this step take place?) it and merge them together - like a double loop.
Here the renaming and merging part:
data ALDW_short (rename=(iprice = ALDW));
set output.cq_ts_aldw
retain date time ALJ;
run;
data ALJ_short (rename= (iprice = ALJ));
set output.cq_ts_alj;
retain date time datetime ALJ;
run;
data ALDW_ALJ_merged (keep= date itime ALDW ALJ);
merge ALDW_short ALJ_short;
by datetime;
run;
This is the part to loop through the folder and get a list of names:
proc contents data = output._all_ out = outputcont(keep = memname) noprint;
run;
proc sort data = outputcont nodupkey;
by memname;
run;
data _null_;
set outputcont end = last;
by memname;
i+1;
call symputx('name'||trim(left(put(i,8.))),memname);
if last then call symputx('count',i);
run;
Would it make sense to extract the symbol (and how? they have different length) from the filename or just to take them from the variable SYMBOL (and how can I get the one value to rename my column?)?
Somehow I have difficulty changing the order of columns, ie. I tried with retain and format.
Looks like you could do this easily with PROC TRANSPOSE. Combine your datasets into a single dataset.
data all ;
set set output.cq_ts_: ;
by date time;
run;
Then use PROC TRANSPOSE for each of your source variables/target tables.
proc transpose data=all out=bid ;
by date time ;
id symbol;
var bid;
run;
Given your example data a formula for MID of
mid = (bid + ask)/2 ;
Should work. Most likely if you got all missing values you probable put the assignment statement before the SET or INPUT statement. In other words you were trying to calculate using values that had not been read in yet.

Calculated column in DAX to show current BusinessArea

I have a table in my SSAS-model with SCD-type2 functionality.
CustNr StartDate EndDate BusinessArea
123 2014-12-01 2015-01-01 Norway
123 2015-01-01 - Sweden
I need a calc-column(DAX) which shows the current BusinessArea(based on customer number). How do i approach it? I've heard about the "Earlier" function but i am new to DAX and cannot get my head around it.
The desired output would be like this:
CustNr StartDate EndDate BusinessArea CurrentBA
123 2014-12-01 2015-01-01 Norway Sweden
123 2015-01-01 - Sweden Sweden
All answers are welcome! Cheers!
LOOKUPVALUE()
(edit: note original left for continuity - correct measure below in edit section)
CurrentBusinessArea =
LOOKUPVALUE(
DimCustomer[BusinessArea] // Lookup column - will return value
// matching search criteria below.
,DimCustomer[CustNr] // Search column 1.
,DimCustomer[CustNr] // Value to match to search column 1 -
// this is evaluated in row context.
,DimCustomer[EndDate] // Search column 2.
,"-" // Literal value to match for search
// column 2.
)
I doubt that your [EndDate] is actually a text field, so I also doubt that the literal value for [EndDate] for the row that represents the current business area is actually "-". If it's blank, use the BLANK() function rather than a literal "-".
Edit based on comment, corrected measure definition with some discussion:
CurrentBusinessArea =
LOOKUPVALUE(
DimCustomer[BusinessArea]
,DimCustomer[CustNr]
,DimCustomer[CustNr]
,DimCustomer[EndDate]
,DATE(BLANK(),BLANK(),BLANK())
)
Normally in DAX you can test directly for equality to BLANK(). It tends not to act similarly to NULL in SQL. In fact, you can create a column to test this. If you do any of these, they return true for the row with a blank [EndDate]:
=DimCustomer[EndDate] = BLANK()
=ISBLANK(DimCustomer[EndDate])
=DimCustomer[EndDate] = 0 //implicit conversion 0 = blank
For some reason there is an issue in the conversion from Date/Time to BLANK(). The construction above, using the DATE() function fed with all BLANK()s works for me. I had assumed that LOOKUPVALUE() would work with a literal blank (fun fact, if data type is Integer, LOOKUPVALUE() works with a BLANK()). Apologies on that.

Report Builder 3.0 - grouping rows by time of day

I am trying to create a table within a report that appears as follows:
The data set is based on this query:
SELECT
DATENAME(dw, CurrentReadTime) AS 'DAY',
DATEPART(dw, CurrentReadTime) AS 'DOW',
CAST(datename(HH, CurrentReadTime) as int) AS 'HOD',
AVG([Difference]) AS 'AVG'
FROM
Consumption
INNER JOIN Readings ON Readings.[RadioID-Hex] = Consumption.[RadioID-Hex]
WHERE
CONCAT([Building], ' ', [Apt]) = #ServiceLocation
GROUP BY
CurrentReadTime
ORDER BY
DATEPART(DW, CurrentReadTime),
CAST(DATENAME(HH, CurrentReadTime) AS INT)
The data from this table returns as follows:
In report builder, I have added this code to the report properties:
Function GetRangeValueByHour(ByVal Hour As Integer) As String
Select Case Hour
Case 6 To 12
GetRangeValueByHour = "Morning"
Case 12 to 17
GetRangeValueByHour = "Afternoon"
Case 17 to 22
GetRangeValueByHour = "Evening"
Case Else
GetRangeValueByHour = "Overnight"
End Select
Return GetRangeValueByHour
End Function
And this code to the "row group":
=Code.GetRangeValueByHour(Fields!HOD.Value)
When I execute the report, selecting the parameter for the target service location, I get this result:
As you will notice, the "Time of Day" is displaying the first result that meets the CASE expression in the Report Properties code; however, I confirmed that ALL "HOD" (stored as an integer) are being grouped together by doing a SUM on this result.
Furthermore, the actual table values (.05, .08, etc) are only returning the results for the HOD that first meets the requirements of the CASE statement in the VB code.
These are the things I need resolved, but can't figure out:
Why isn't the Report Properties VB code displaying "Morning", "Afternoon", "Evening", and "Overnight" in the Time of Day column?
How do I group together the values in the table? So that the AVG would actually be the sum of each AVG for all hours within the designated range and day of week (6-12, 12-18, etc on Monday, Tuesday etc).
To those still reading, thanks for your assistance! Please let me know if you need additional information.
I'm still not sure if I have a clear picture of your table design, but I'm imagining this as a single row group that's grouped on this expression: =Code.GetRangeValueByHour(Fields!HOD.Value). Based on this design and the dataset above, here's how I would solve your two questions:
Use the grouping expression for the value of the Time of Day cell, like:
Add a SUM with a conditional for the values on each day of the week. Example: the expression for Sunday would be =SUM(IIF(Fields!DOW.Value = 1, Fields!AVG.Value, CDec(0))). This uses CDec(0)instead of 0 because the AVG values are decimals and SSRS will otherwise throw an aggregate of mixed data types error by interpreting 0 as an int.

Merging Data to Run Specific Individual Analysis

I have two data sets. FIRST is a list of products and their daily prices from a supplier and SECOND is a list of start and end dates (as well as other important data for analysis). How can I tell Stata to pull the price at the beginning date and then the price at the end date from FIRST into SECOND for the given dates. Please note, if there is no exact matching date I would like it to grab the last date available. For example, if SECOND has the date 1/1/2013 and FIRST has prices on ... 12/30/2012, 12/31/2012, 1/2/2013, ... it would grab the 12/31/2012 price.
I would usually do this with Excel, but I have millions of observations, and it is not feasible.
I have put an example of FIRST and SECOND as well as what the optimal solution would give as an output POST_SECOND
FIRST
Product Price Date
1 3 1/1/2010
1 3 1/3/2010
1 4 1/4/2010
1 2 1/8/2010
2 1 1/1/2010
2 5 2/5/2010
3 7 12/26/2009
3 2 1/1/2010
3 6 4/3/2010
SECOND
Product Start Date End Date
1 1/3/2010 1/4/2010
2 1/1/2010 1/1/2010
3 12/26/2009 4/3/2010
POST_SECOND
Product Start Date End Date Price_Start Price_End
1 1/3/2010 1/4/2010 3 4
2 1/1/2010 1/1/2010 1 1
3 12/26/2009 4/3/2010 7 6
Here's a merge/keep/sort/collapse* solution that relies on using the last date. I altered your example data slightly.
/* Make Fake Data & Convert Dates to Date Format */
clear
input byte Product byte Price str12 str_date
1 3 "1/1/2010"
1 3 "1/3/2010"
1 4 "1/4/2010"
1 2 "1/8/2010"
2 1 "1/1/2010"
2 5 "2/5/2010"
3 7 "12/26/2009"
3 7 "12/28/2009"
3 2 "1/1/2010"
3 6 "4/3/2010"
4 8 "12/30/2012"
4 9 "12/31/2012"
4 10 "1/2/2013"
4 10 "1/3/2013"
end
gen Date = date(str_date,"MDY")
format Date %td
drop str_date
save "First.dta", replace
clear
input byte Product str12 str_Start_Date str12 str_End_Date
1 "1/3/2010" "1/4/2010"
2 "1/1/2010" "1/1/2010"
3 "12/27/2009" "4/3/2010"
4 "1/1/2013" "1/2/2013"
end
gen Start_Date = date(str_Start_Date,"MDY")
gen End_Date = date(str_End_Date,"MDY")
format Start_Date End_Date %td
drop str_*
save "Second.dta", replace
/* Data Transformation */
use "First.dta", clear
merge m:1 Product using "Second.dta", nogen
bys Product: egen ads = min(abs(Start_Date-Date))
bys Product: egen ade = min(abs(End_Date - Date))
keep if (ads==abs(Date - Start_Date) & Date <= Start_Date) | (ade==abs(Date - End_Date) & Date <= End_Date)
sort Product Date
collapse (first) Price_Start = Price (last) Price_End = Price, by(Product Start_Date End_Date)
list, clean noobs
*Some people are reshapers. Others are collapsers. Often both can get the job done, but I think collapse is easier in this case.
In Stata, I've never been able to get something like this to work nicely in one step (something you can do in SAS via a SQL call). In any case, I think you'd be better off creating an intermediate file from FIRST.dta and then merging that 2x on each of your StartDate and EndDate variables in SECOND.dta.
Say you have data for price adjustments from Jan 1, 2010 to Dec 31, 2013 (specified with varied intervals as you have shown above). I assume all the date variables are already in date format in FIRST.dta & SECOND.dta, and that variable names in SECOND do not have spaces in them.
tempfile prod prices
use FIRST.dta, clear
keep Product
duplicates drop
save `prod'
clear
set obs 1096
g Date=date("12-31-2009","MDY")+_n
format date %td
cross using `prod'
merge 1:1 Product Date using FIRST.dta, assert(1 3) nogen
gsort +Product +Date /*this ensures the data are sorted properly for the next step */
replace price=price[_n-1] if price==. & Product==Product[_n-1]
save `prices'
use SECOND.dta, clear
foreach i in Start End {
rename `i'Date Date
merge 1:1 Product Date using `prices', assert(2 3) keep(3) nogen
rename Price Price_`i'
rename Date `i'Date
}
This should work if I understand your data structures correctly, and it should address the issue being discussed in the comments to #Dimitriy's answer. I'm open to critiques on how to make this nicer as its something I've had to do a few times and this is how I usually go about it.

Merging datasets based on 2 variables in SAS

I'm working with different databases. All of them contain information about 1000+ companies. A company is defined by its ticker code (the short version of the name (i.e. Ford as F) usually seen on stock quotation boards).
Aside from the ticker code to merge on I also have to merge on the time. I used month as a count variable throughout my time series. The final purpose is to have a regression in the kind of
Y(jt) = c + X(jt) +X1(jt) etc with j = company (ticker) and t = time (month).
So imagine I have 2 databases, one of which is the base database with variables such as Tickers, months, betas of a company (risk measure) etc. and a second database which has an extra variable (let's say market capitalisation).
What I want to do then is to merge these 2 databases based on the ticker and the month.
Example: Base database:
Ticker ____ Month ____ Betas
AA ____ 4 ____ 1.2
BB ____ 8 ____ 1.18
Second database:
Ticker ____ Month ____ MCAP
AA ____ 4 ____ 8542
BB ____ 6 ____ 1245
Then after merge I would like to have something like this:
Ticker ____ Month ____ Betas ____ MCAP
AA ____ 4 ____ 1.2 ____ 8542
So all observations that do not match BOTH the date and ticker have to be dropped. I'm sure this is possible, just can't find the right type of code.
PS: I'm guessing the underscores have something to do with font layout but both the bold as italic is supposed to be normal :)
Agree with Jonathan... after sorting both datasets independently by ticker and time, the data step of merging is what I would use..... little modification
data want;
merge base(in = b) mcap(in = m);
by ticker time;
if m & b;
run;
Records that don't have common ticker and time in both datasets would be dropped automatically..
Calling the two datasets base and mcap, and assuming that they have both been sorted by ticker and month, you can do it this way:
data want;
merge base(in = b)
mcap(in = m);
if m & b;
run;
The subsetting if will not accept any row that does not match in bath datasets.
Ok so it appears you can just do it very easily by:
proc sort data=work;
by ticker month;
run;
proc sort data=wsize;
by ticker month;
run;
data test;
merge work(in=a) wsize(in=b);
by ticker month;
frommerg=a;
fromwtvol=b;
run;
data test;
set test;
if frommerg=0 then delete;
run;
data test;
set test;
if fromwtvol = 0 then delete;
run;
data test;
set test;
drop frommerg fromwtvol;
run;
That's the code I used, I tried this before posting because I didn't want to look like a leecher but it so happens that the 2 databases i tried had nothing in common (what are the odds with 70.000 observations :D), I retried it and it works (for now!)
Thanks anyway!
proc sort data=database1;
by ticker month;
run;
proc sort data=database2;
by ticker month;
run;
data gh;
merge database1(in=a) database2(in=b);
by ticker month;
if a and b;
run;

Resources