Im working with a database in SAS that updates every so often. I want the macro to automatically load the most recent dataset of a given year. The datasets cover the years 2015-2018 and each year has a different updated version which is stated in the name of the dataset, i.e. 2015_version9. With my current code you need to update the macro manually everytime a dataset change its version and name.
You can scan through each library and find the max version number, then save those to a single macro variable string that you can supply to a set statement. Here are the assumptions of this solution:
Your libraries are named lib_2015, lib_2016, etc. and follow 8-char libname requirements
Your libraries are static for years 2015-2018
Your datasets are named _version1, _version2, etc.
Here's how we'll do it.
%let libraries = "LIB_2015", "LIB_2016", "LIB_2017", "LIB_2018";
proc sql noprint;
select cats(libname, '.', memname)
, input(compress(memname,,'KD'), 8.) as version
into :data separated by ' '
from dictionary.members
where upcase(libname) IN(&libraries.)
AND upcase(memname) LIKE "^_VERSION%" escape '^'
group by libname
having version = max(version)
;
quit;
data want;
set &data. indsname=name;
dsn = name;
run;
This code does the following:
Gets all dataset names from each library that starts with _VERSION. The ^ in the like clause is an escape character that we defined so that we can match _ literally.
Removes all non-digits from the dataset name and converts it to a version number, version. The KD option in the compress() function says to keep only digits from the string.
Keeps only names in each library where version is the highest value
Saves all the dataset names to a single macro variable, &data
&data will store a string of all the relevant datasets you want with the highest version number for each library. For example:
%put &data.;
LIB_2015._VERSION9 LIB_2016._VERSION19 LIB_2017._VERSION12 LIB_2018._VERSION8
The indsname option in the data step will store the full dataset name of each observation. We're saving that to a variable named dsn. This shows where each observation comes from so you can split them out to individual datasets as needed.
Related
Requirement: Need to fetch only the latest file everyday example here its 20200902 file
Example Files in S3:
#stagename/2020/09/reporting_2020_09_20200902000335.gz
#stagename/2020/09/reporting_2020_09_20200901000027.gz
Code:
select distinct metadata$filename from
#stagename/2020/09/
(file_format=>APP_SKIP_HEADER,pattern=>'.*/reporting_*20200902*.gz');
This will work no matter what the naming conventions of the files. Since your files appear to have a naming convention based on date and are one per point in time, you may not need to use the date to do this as you could use the name. You'll still want to use the result_scan approach.
I haven't found a way to get the date for a file in a stage other than using the LIST command. The docs say that FILE_NAME and FILE_ROW_NUMBER are the only available metadata in a select query. In any case, that approach reads the data, and we only want to read the metadata.
Since a LIST command is a metadata query, you'll need to query the result_scan to use a where clause.
One final issue that I ran into while working on a project: the last_modified date in the LIST command is in format that requires a somewhat long conversion expression to convert to timestamp. I made a UDF to do the conversion so that it's more readable. If you'd prefer putting the expression directly in the SQL, that's fine too.
First, create the UDF.
create or replace function LAST_MODIFIED_TO_TIMESTAMP(LAST_MODIFIED string)
returns timestamp_tz
as
$$
to_timestamp_tz(left(LAST_MODIFIED, len(LAST_MODIFIED) - 4) || ' ' || '00:00', 'DY, DD MON YYYY HH:MI:SS TZH:TZM')
$$;
Next, list the files in your stage or subdirectory of the stage.
list #stagename/2020/09/
Before running any other query in the session, run this one on the last query ID. You can of course run it any time in 24 hours if you specify the query ID explicitly.
select "name",
"size",
"md5",
"last_modified",
last_modified_to_timestamp("last_modified") LAST_MOD
from table(result_scan(last_query_id()))
order by LAST_MOD desc
limit 1
I have a permanent data set called Branch(Branch code, Branch description)
I want to create a format from that dataset (a permanent one)
I can see that this gives me more or less what I want, but now to put it into a permanent dataset?
proc format library = Home.Branch fmtlib;
Run;
What I've tried
proc print data=Home.DataSetToApply
format B_Code $B_CODE_FORMAT.;
RUN;
This works if I manually create the format. I can't seem to create a permanent format directly from a data set.
Could you point me in the right direction?
Resources
Creating a Format from Raw Data or a SASĀ® Dataset
SAS has an autoexec.sas file which executes when you start SAS.
Of course, whether this is a valid option depends on your access rights + the OS you're running.
Have a look here: http://support.sas.com/documentation/cdl/en/hostwin/63285/HTML/default/viewer.htm#win-sysop-autoexec.htm
You could just drop the format code in the auto-executing script then to have your format always available when using SAS.
This will create a dataset with formats in the current library.
proc format cntlout=myfmtdataset lib=mylibname;
select myformatname; *if you want to just pick one or some - leave out select for all;
quit;
This will import that back into formats (later):
proc format cntiln=myfmtdataset lib=myotherlibname;
quit;
That could of course be in your autoexec, or in your regular code.
If you are trying to take a dataset to make a permanent format, you need to set it up like this:
Required:
fmtname = name of format start = starting value (or, single value)
end = ending value (this can be missing if only single values)
label = formatted value
Optional:
type = type of format (n=numeric, c=character, i=informat, j=character informat)
hlo = various options (h=end is highest value, l = start is lowest value,
o=other, m=multilabel, etc.)
Then use the CNTLIN option to load it. SAS documentation has more detail if you need it.
I'm doing an Excel loop through fifty or more Excel files. The loop goes through each Excel file, grabs all the data and inputs it into the database without error. This is the typical process of setting delay validation to true, and making sure that the expression for the Excel Connection is a string variable called EFile that is set to nothing (in the loop).
What is not working: trying to input the name of the Excel file into the database.
What's been tried (edit; SO changed my 2 to 1 - don't know why):
Add a derived column between the Excel file and database input, and add a column using the EFile expression (so under Expression in the Derived Column it would be #[User::EFile]). and add the empty. However, this inputs nothing a blank (nothing).
One suggestion was to add ANOTHER string variable and set its properties EvaluateAsExpression to True and set the Expression to the EFile variable (#[User::EFile]). The funny thing is that this does the same thing - inputs a blank into the database.
Numerous people on blogs claim they can do this, yet I haven't seen one actually address this (I have a blog and I will definitely be showing people how to do this when I get an answer because, so far, these others have fallen short). How do I grab an Excel file's name and input it in a database during a loop?
Added: Forgot to add, no scripts; the claim is that it can be done without them, so I want to see the solution without them.
Note: I already have the ability to import the data from the Excel files - that's easy (see my GitHub account, as I have two different projects for importing all sorts of txt, csv, xls, xlsx data). I am trying to also get the actual name of the file being imported also into the database. So, if there are fifty Excel files, along with the data in each file, the database will have the fifty file names alongside that data (so if each file has 1000 rows of data, each 1000 rows would also have the name of the file they came from next to them as an additional column). This point seems to cause a lot of confusion, as people assume I'm having trouble importing data in files - NOPE, see my GitHub; again that's easy. It's the FILENAME that needs to also be imported.
Test package: https://github.com/tmmtsmith/SSISLoopWithFileName
Solution: #jaimet pointed out that the Derived Column needed to be the #[User::CurrentFile] (see the test package). When I first ran the package, I still got a blank value in my database. But when we originally set up the connection, we do point it to an actual file (I call this "fooling the package"), then change the expression on the connecting later to the #[User::CurrentFile], which is blank. The Derived Column, using the variable #[User::CurrentFile], showed a string of 0. So, I removed the Derived Column, put the full file path and name in the variable, then added the variable to the Derived Column (which made it think the string was 91 characters long), then went back and set the variable to nothing (English teacher would hate the THENs about right now). When I ran the package, it inputted the full file path. Maybe, like the connection, it needs to initially think that a file exists in order for it to input the full amount of characters?
Appreciate all the help.
The issue is because of blank value in the variable #[User::FileNameInput] and this caused the SSIS package to assume that the value of this variable will always be of zero length in the Derived Column transformation.
Change the expression on the Derived column transformation from #[User::FileNameInput] to (DT_STR, 2000, 1252)#[User::FileNameInput].
Type casting the derived column to 2000 sets the column length to that maximum value. The value 1252 represents the code page. I assumed that you are using ANSI code page. I took the value 2000 from your table definition because the FilePath column had variable VARCHAR(2000). If the column data type had been NVARCHAR(2000), then the expression would be (DT_WSTR, 2000)#[User::FileNameInput]
Tim,
You're using the wrong variable in your Derived Column component. You are storing the filename in #[User::CurrentFile] but the variable that you're using in your Derived Column component is #[User::FileNameInput]
Change your Derived Column component to use #[User::CurrentFile] and you'll be good.
Hope that helps.
JT
If you are using a ForEach loop to process the files in a folder then I have have used the technique described in SSIS Junkie's blog to get the filename in to an SSIS variable: SSIS: Enumerating files in a Foreach loop
You can use the variable later in your flow to write it to the database.
TO all intents and purposes your method #1 should work. That's exactly how I would attempt to do it. I am baffled as to why it is not working. Could you perhaps share your package?
Tony, thanks very much for the link. Much appreciated.
Regards
Jamie
I am exporting an SAS Data Set to an xpt file using the following code but the variable names are truncated to length 8. Is there anything I can do to keep the full names?
libname target xport 'C:\temp\test.xpt';
proc copy in=work out=target;
select data;
run;
XPort files have length 8 maximum for variable names - they're intended to be highly compatible with earlier versions of SAS as well as other software, and in both cases 8 is a safe maximum.
See http://support.sas.com/documentation/cdl/en/movefile/59598/HTML/default/viewer.htm#a001027644.htm for more details on the limitations of the XPORT feature.
What are you trying to do with your data? There might be a safer/easier way to get it from SAS to what you want while preserving variable names.
I'm tring to create an SSIS package to import some dataset files, however given that I seem to be hitting a brick
wall everytime I achieve a small part of the task I need to take a step back and perform a sanity check on what I'm
trying to achieve, and if you good people can advise whether SSIS is the way to go about this then I would
appreciate it.
These are my questions from this morning :-
debugging SSIS packages - debug.writeline
Changing an SSIS dts variables
What I'm trying to do is have a For..Each container enumerate over the files in a share on the SQL Server. For each
file it finds a script task runs to check various attributes of the filename, such as looking for a three letter
code, a date in CCYYMM, the name of the data contained therein, and optionally some comments. For example:-
ABC_201007_SalesData_[optional comment goes here].csv
I'm looking to parse the name using a regular expression and put the values of 'ABC', '201007', and
'SalesData' in variables.
I then want to move the file to an error folder if it doesn't meet certain criteria :-
Three character code
Six character date
Dataset name (e.g. SalesData, in this example)
CSV extension
I then want to lookup the Character code, the date (or part thereof), and the Dataset name against a lookup table
to mark off a 'checklist' of received files from each client.
Then, based on the entry in the checklist, I want to kick off another SSIS package.
So, for example I may have a table called 'Checklist' with these columns :-
Client code Dataset SSIS_Package
ABC SalesData NorthSalesData.dtsx
DEF SalesData SouthSalesData.dtsx
If anyone has a better way of achieving this I am interested in hearing about it.
Thanks in advance
That's an interesting scenario, and should be relatively easy to handle.
First, your choice of the Foreach Loop is a good one. You'll be using the Foreach File Enumerator. You can restrict the files you iterate over to be just CSVs so that you don't have to "filter" for those later.
The Foreach File Enumerator puts the filename (full path or just file name) into a variable - let's call that "FileName". There's (at least) two ways you can parse that - expressions or a Script Task. Depends which one you're more comfortable with. Either way, you'll need to create three variables to hold the "parts" of the filename - I'll call them "FileCode", "FileDate", and "FileDataset".
To do this with expressions, you need to set the EvaluateAsExpression property on FileCode, FileDate, and FileDataset to true. Then in the expressions, you need to use FINDSTRING and SUBSTRING to carve up FileName as you see fit. Expressions don't have Regex capability.
To do this in a Script Task, pass the FileName variable in as a ReadOnly variable, and the other three as ReadWrite. You can use the Regex capabilities of .Net, or just manually use IndexOf and Substring to get what you need.
Unfortunately, you have just missed the SQLLunch livemeeting on the ForEach loop: http://www.bidn.com/blogs/BradSchacht/ssis/812/sql-lunch-tomorrow
They are recording the session, however.