Adding observations and variables to a dataset with .csv files in Stata - database

I am using Stata 17. I want to add observations and variables in a dataset, I'll name it dataset1.
Dataset1 has the following structure
Date Year urbanname urbancode etc..
2010m1 2010 Beijing 1029 ...
2010m2 2010 Beijing 1029 ...
2010m3 2010 Beijing 1029 ...
...
2015m1 2015 Paris 1030 etc
For different cities and different time periods.
I would like to add observations of other cities (that are not in the rows of dataset1), that I have in different .csv files (dataset2.csv, dataset3.csv, and so on..). Each city has its own dataset.
In each .csv dataset I want to add I have the following variables
the dates
the urbanname
the urbancode
other variables which I do not yet have in dataset1 but that I want to add
What would be your advice on how to proceed ? I thought of doing it with R but dataset1 does not open well in RStudio and the variable Date is not well imported.

You do not describe what you have tried so far and what issues you are encountering but you can do something like this:
use dataset1, clear
* Store in the data in a temporary file
tempfile appendfile
save `appendfile`
foreach dataset in dataset2.csv dataset3.csv {
import delimited `dataset`
append using `appendfile`
save `appendfile`
}

Related

Macro to open, recode and stack several .csv files in SPSS

I am trying to code a macro that:
Import the columns year, month, id, value and motive from several .csv sequential files to SPSS. The files are named like: DATA_JAN_2010, DATA_FEB_2010 [...], until DATA_DEC_2019. These are the first variables of the csv files (the code I am using to import this variables is provided in the end).
Alter type of columns id to (a11), motive to (a32), if necessary (needed to stack all files).
Stack all these datasets in a new dataset named: DATA_2010_2019.
For now, what I am doing is to import each file separately, stacking and saving two by two. But this is so repetitive and irrational from the efficiency standpoint. Moreover, if in the future I need to import additional variables, I would need to rewrite all the code for each file. That is why I believe that a loop or a macro would be the smartest way of dealing with this repetitive codes. Any help is really appreciated.
A sample of my code so far:
GET DATA /TYPE=TXT
/FILE="C:\Users\luizz\DATA\DATA_JAN_2010.csv"
/ENCODING='Locale'
/DELCASE=LINE
/DELIMITERS=";"
/ARRANGEMENT=DELIMITED
/FIRSTCASE=2
/IMPORTCASE=ALL
/VARIABLES=
YEAR F4.0
MONTH F1.0
ID A11
VALUE F4.0
MOTIVE A8.
CACHE.
EXECUTE.
DATASET NAME JAN_2010 WINDOW=FRONT.
ALTER TYPE MOTIVE (a32).
GET DATA /TYPE=TXT
/FILE="C:\Users\luizz\DATA\DATA_FEB_2010.csv"
/ENCODING='Locale'
/DELCASE=LINE
/DELIMITERS=";"
/ARRANGEMENT=DELIMITED
/FIRSTCASE=2
/IMPORTCASE=ALL
/VARIABLES=
YEAR F4.0
MONTH F1.0
ID A11
VALUE F4.0
MOTIVE A8.
CACHE.
EXECUTE.
DATASET NAME FEB_2010 WINDOW=FRONT.
DATASET ACTIVATE FEB_2010.
ALTER TYPE MOTIVE (a32).
DATASET ACTIVATE JAN_2010.
ADD FILES /FILE=*
/FILE='FEB_2010'.
EXECUTE.
SAVE OUTFILE='C:\Users\luizz\DATA\DATA_JAN_FEV_2010.sav'
/COMPRESSED.
Assuming the parameters for all the files are the same, you can use a macro like this:
define !getfiles ()
!do !yr=2010 !to 2019
!do !mn !in("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC")
GET DATA
/TYPE=TXT /FILE=!concat('"C:\Users\luizz\DATA\DATA_', !mn, '_', !yr, '.csv"')
/ENCODING='Locale' /DELCASE=LINE /DELIMITERS=";" /ARRANGEMENT=DELIMITED
/FIRSTCASE=2 /IMPORTCASE=ALL /VARIABLES=
YEAR F4.0
MONTH F1.0
ID A11
VALUE F4.0
MOTIVE A8.
CACHE.
EXECUTE.
ALTER TYPE id (a11) MOTIVE (a32).
dataset name tmp.
dataset activate gen.
add files /file=* /file=tmp.
exe.
!doend !doend
!enddefine.
The macro as defined will read each of the files and add it to a main file. Before we call the macro we will create the main file:
data list list/YEAR (F4) MONTH (F1) ID (A11) VALUE (F4) MOTIVE (A8).
begin data
end data.
exe.
dataset name gen.
* now we can call the macro.
!getfiles .
* now the data is all combined and we can save it.
SAVE OUTFILE='C:\Users\luizz\DATA\DATA_JAN_FEV_2010.sav' /COMPRESSED.
NOTE: I used your code from the original post in the macro. Please make sure all the definitions are right.

Looping exports from SAS to .dta format using year month name

I am a STATA user and am therefore not familiar with using SAS. However, all of the files that I require for my current project are stored in SAS format, so I would like to convert them from SAS to .dta format, using SAS code.
The files are stored as monthly sets like so:
1976 - x1976M1, x1976M2, x1976M3.... x1976M12
where 1976 is the folder, and each month, eg. x1976M1, is a file containing the observations for that month and year.
I would like to export those files to .dta format, with the same file structure so that I can easily read them into STATA.
I am not picky about whether or not I can loop over each folder, or will have to loop each folder individually--there are forty folders with 12 files in each.
Therefore, I will need to at least create a loop that goes from m1 to m2 that is appended to the end of the filename, eg. filename1976 + my, where y = [1, 12]. Ideally, I will be able to create a loop that goes from one folder to the next, executing this process via a nested loop.
I hope this is satisfactorily clear! If not, please comment and I will adjust my question accordingly.
Some code given to me by a coworker. Hope this helps anybody with the same issue. This will need to be updated for each individual folder, as it does not loop.
Cheers!
libname name 'G:\folder\'; run`;
%macro subset1976(month=);
data subset1976_&month;
set name.file1976_&month;
keep xyz /*varnames*/
;
if age>=15;
noc2011 = soc4+0;
run;
%mend;
%subset1976(month=jan);
%subset1976(month=feb);
....
%macro export1976(month=);
proc export data=subset1976_&month outfile='G:\lfs\subset1976_&month.dta' replace dbms=stata; run;
%mend;
%export1976(month=jan);
%export1976(month=feb);

Loop to create multiple reports dependent on a date

I need to create multiple reports at once taking into consideration the date column.
For example:
INVOICE COMMENT DATE
------------------------------------
1111 example1 14/04/2018
2222 example2 14/04/2018
3333 example3 15/04/2018
4444 example4 18/04/2018
For day 14/04/2018 I would need to generate two PDF with this data:
1111-example1-14/04/2018
2222-example2-14/04/2018
So basically one for each row with today's date. On 15/04/2018 only one report would be created.
I need SSRS "to loop" between dates and creating a PDF file for each one. Obviously the query would be larger, but this is just and example.
Is this even possible with SSRS or maybe there are other ways to do it?
You can do this with a data-driven subscription. You would need to write a small query that returns all the parameter values you want to use. When it runs, it will create a copy of the report for each value you specify. You can have the resulting PDF emailed or stored in a directory.

Copy & Paste Errors When Moving Data From SSMS to Excel

I'm attempting to copy a query result from SSMS to excel. I selected "copy with headers" and pasted it into Excel. My data set has 9 columns. When I paste the data into excel, information from column 9 ends spread across columns 9, 10 and 1 It looks like this:
A B C D E F G H I -Column Heading
A B C D E F G H I I -Data
I
(blank row)
I've reviewed the query results in SSMS and this is not occurring in the original data. When the value in column F is NULL the additional row and information in column 10 do not occur. Thus far I have tried the following:
-When I remove column 9 from the query then copy & paste, column 8 is spread across 8, 9, and 1.
-I've also created a brand new spreadsheet, made sure to clear any formatting and tried the copy & paste.
-I saved the query results as a .csv file and imported it into Excel. I still got the same result.
-I copied the columns individually one at a time. The the information in the 8th column ends up on two lines paired with the other columns of the next row. So each item in column 8 becomes another row offset downwards until there are many more values in column 8 than other columns. Where the value in column 8 is NULL, this does not occur.
-I removed all the other items from the query result so that only the values of columns 8 and 9 are returned. All information from column 9 ends up in column 8 followed by a blank row.
Returning 8 alone, each item returned ends up on two rows.
Returning 9 alone, the data is pasted correctly.
The headers are always in the right place. From what I can surmise, the data in column 8 is the culprit here. The data type is a varchar(max) which allows nulls. The information included is in the following format,
(TC Date & Time, Last Name, First Name) Comments
Moving SSMS query results into Excel to make tables is something I do frequently. However I have never before encountered this result. Hopefully my explanation is thorough enough so someone can tell me how to correct this error. Thanks!
Replace feed and Carriage returns from your dataset before you can paste into Excel, Try something like this on the columns you are having issues and then try to paste it in excel:
SELECT REPLACE(REPLACE(yourcolumnname, CHAR(13),' '), CHAR(10),' ')
FROM table
This is probably due to using 'Text to columns' recently in Excel. That splits columns using some rule. Columns need to be set back to 'tab delimited'.
For the offending column:
Data → Text to Columns
Original Data Type: Check Delimited
Click Next
Delimiters: check 'Tab', uncheck anything else.
Click Next
Click Finish
SSMS copy-paste does not preserve data types. Excel tries to parse the string and splits it into additional columns or even lines.
I develop SSMSBoost add-in and we have covered this in our video, which explains 3 different ways of exporting the data into Excel without data loss (data type information is preserved): (Copy-Paste in native excel format, XML export, .dqy Query) https://youtu.be/waDCukeXeLU

Importing Excel to SQL Server from the backend

I have a ton of excel files which I'm trying to import to SQL Server from the backend and then automate it using a batch file.
I know that we can use OPENROWSET inside a T-sql script and load the excel files. I'm also aware of using SQLCMD or BCP options. All of these will work for excel sheets which are straightforward grids.
However, the challenge is, I only need to load a specific region/range of excel cells from a sheet.
For example: if the sheet has the below info, I need to only load the below columns:
Date, Group1, Group2 and Group 3
until it hits the "Blank Row" and ignore everything below it.
Date Group1 Group2 Group 3
Jan-13 25 26 27
Jan 18 35 29 19
20 15
<empty row> <empty row>
Y/Y % YTD % Group %
15 20 40
So, my question is: is it possible to implement this functionality using OPENROWSET in T-SQL? IF so, can you please point me to any links/example on how I can do this? I tried digging around a bit on the MSDN site but couldn't find any.
If this cannot be done in T-SQL, any ideas on how I could implement it from the backend?
Thanks in advance,
Bee

Resources