Exporting data files using loop in Stata - loops

I have a large dataset of 20 cities and I'd like to split it into smaller ones for each city. Each variable in the dataset will be exported into a text file.
foreach i in Denver Blacksburg {
use "D:\Data\All\AggregatedCount.dta", clear
drop if MetroArea != `i'
export delimited lnbike using "D:\Data/`"`i'"'/DV/lnbike.txt", delimiter(tab) replace
export delimited lnped using "D:\Data/`"`i'"'/DV/lnped.txt", delimiter(tab) replace
}
I tried i' and"`i'"' in the export commands but none of them worked. The error is
"Denver not found."
I also have cities that have space in between, such as Los Angeles. I tried
local city `" "Blacksburg" "Los Angeles" "Denver" "'
foreach i of city {
use "D:\Data\All\AggregatedCount.dta", clear
drop if MetroArea != `i'
export delimited lnbike using "D:/Data/`"`i'"'/DV/lnbike.txt", delimiter(tab) replace
export delimited lnped using "D:/Data/`"`i'"'/DV/lnped.txt", delimiter(tab) replace
}
This didn't work either. Do you have any suggestion?

If you want to continue with Stata, the only thing you would need to change in your first code snippet is
`"`i'"'
to
\`i'
Note the \ so that your code looks like:
export delimited lnbike using "D:\Data\\`i'/DV/lnbike.txt", delimiter(tab) replace
(I would personally change all of the forward slashes (/) to back slashes (\) in general anyway) but the extra one is because a backslash before a left single quote in a string evaluates to just the left single quote. Having the second backslash tells Stata that you want the local macro i to be evaluated.
Your second code snippet could work if you also changed
foreach i of city {
to
foreach i of `city' {
It might be helpful to read up on local macros: they can definitely be confusing, but are powerful if you know how to use them.

This answer overlaps with the helpful answer by #Eric HB.
Given 20 (or more) cities you should not want to type those city names, which is tedious and error-prone, and not needed. Nor do you need to read in the dataset again and again, because you can just export the part you want. This should get you closer.
use "D:/Data/All/AggregatedCount.dta", clear
* result is integers 1 up, with names as value labels
egen which = group(MetroArea), label
* how many cities: r(max), the maximum, is the number
su which, meanonly
forval i = 1/`r(max)' {
* look up city name for informative filename
local where : label (which) `i'
export delimited lnbike if which == `i' using "D:/Data/`where'/DV/lnbike.txt", delimiter(tab) replace
export delimited lnped if which == `i' using "D:/Data/`where'/DV/lnped.txt", delimiter(tab) replace
}
The principles concerned not yet discussed:
-- When testing for literal strings, you need " " or compound double quotes to delimit such strings. Otherwise Stata thinks you mean a variable or scalar name. This was your first bug, as given
drop if MetroArea != `i'
interpreted as
drop if MetroArea != Denver
Stata can't find a variable Denver. As you found, you need
drop if MetroArea != "`i'"
-- Windows uses the backslash as a separator in file and directory names, but Stata also uses the backslash as an escape character. If you use local macro names after such file separators, the result can be quite wrong. This is documented at [U] 18.3.11 in this manual chapter and also in this note. Forward slashes are never a problem, and Stata understands them as you intend, even with Windows.
All that said, it is difficult to believe that you will be better off with lots of little files, but that depends on what you want to do with them.

Related

Using input_file_name, take substring between underscores without file extension

I want to take a substring from a filename every time a new file is coming to us for processing and load that value into file. The task here is like suppose we are receiving many files from X company for cleansing process and the first thing what we need to do is to take substring from the file name.
For Example: the file name is 'RV_NETWORK_AXN TECHNOLOGY_7737463273272635'. From this I want to take 'AXN TECHNOLOGY' and want to create a new column with name 'COMPANY NAME' in the same file and load 'AXN TECHNOLOGY" value into it. The file names change, but the company name will every time be after the second underscore.
In the comment, you said that using df_1 = df_1.withColumn('COMPANY', F.split(F.input_file_name(), '_')[3]) extracts AXN TECHMOLOGY.csv.
I'll suggest 2 options to you:
You could use one more split on \. and using element_at get the 2nd to last element. In this case, splitting on \. works and on . doesn't, because this argument of split function is not a simple string, but a regex pattern; and unescaped dot . in regex has a meaning of "any character".
df = df.withColumn(
'COMPANY',
F.element_at(F.split(F.split(F.input_file_name(), '_')[3], '\.'), -2)
)
The following regex pattern would extract only what's after the 3rd _ and potential 4th _, but not including file extension (e.g. .csv).
df = df.withColumn(
'COMPANY',
F.regexp_extract(F.input_file_name(), r'^.+?_.*?_.*?_([^_]+)\w*\.\w+$', 1)
)

mssql search a varchar field for invalid typo

I have a field with names in it. They can be last name, first name middle name/initial
Basically I want to find all names that aren't normal spellings so I can tell someone to fix their names in the system.
I don't want to select and find this guy
O'Leary-Smith, Timothy L.
But I would want to find this guy
<>[]}{##$%^&*()/?=+_!|";:~`1234567890
I can just keep coming up with special characters to search for but then I'm just making this huge query and having to bracket wildcards... it's like 50+ lines long just to say one thing.
Is there something (not some custom function)
that lets me say
where name not like
A-Z
a-z
,
.
'
-
possibly something that is
where name contains anything but these ascii characters
Hopefully this is a one of fix-up; a negated character class:
where patindex('%[^ A-Za-z,.''-]%', name) > 0
Although more letters than A-Z can appear in names ...
If it's just odd characters you're looking for:
WHERE name like '%[^A-Za-z]%'
The ^ acts as a NOT operator.

Properly Using String Functions in Stored Procedures

I have an SSIS package that imports data into SQL Server. I have a field that I need to cut everything after and including "-". The following is my script:
SELECT LEFT(PartNumber, CHARINDEX('-', PartNumber) - 1)
FROM ExtensionBase
My question is where in my stored procedure should I use this script so that it cuts before entering data into the ExtensionBase. Can I do this in a Scalar_Value Function?
You have two routes available to you. You can use Derived Columns and the Expressions to generate this value or use a Script Transformation. Generally speaking, reaching for a script first is not a good habit for maintainability and performance in SSIS but the other rule of thumb is that if you can't see the entire expression without scrolling, it's too much.
Dataflow
Here's a sample data flow illustrating both approaches.
OLE_SRC
I used a simple query to simulate your source data. I checked for empty strings, part numbers with no dashes, multiple dashes and NULL.
SELECT
D.PartNumber
FROM
(
VALUES
('ABC')
, ('def-jkl')
, ('mn-opq-rst')
, ('')
, ('Previous line missing')
, (NULL)
) D(PartNumber);
DER Find dash position
I'm going to use FINDSTRING to determine the starting point. FINDSTRING is going to return a zero if the searched item doesn't exist or the input is NULL. I use the ternary operator to either return the position of the first dash, less a space to account for the dash, or the length of the source string.
(FINDSTRING(PartNumber,"-",1)) > 0
? (FINDSTRING(PartNumber,"-",1)) - 1
: LEN(PartNumber)
I find it helpful in these situations to first compute the positions before trying to use them later. That way if you make an error, you don't have to fix multiple formulas.
DER Trim out dash plus
The 2012 release of SSIS provided us with the LEFT function while previous editions users had to make due with SUBSTRING calls.
The LEFT expression would be
LEFT(PartNumber,StartOrdinal)
whilst the SUBSTRING is simply
SUBSTRING(PartNumber,1,StartOrdinal)
SCR use net libraries
This approach is to use the basic string capabilities of .NET to make life easier. The String.Split method is going to return an array of strings, less what we split upon. Since you only want the first thing, the zero-eth element, we will assign that to our SCRPartNumber column that is created in this task. Note that we check whether the PartNumber is null and sets the null flag on our new column.
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (!Row.PartNumber_IsNull)
{
string[] partSplit = Row.PartNumber.Split('-');
Row.SCRPartNumber = partSplit[0];
}
else
{
Row.SCRPartNumber_IsNull = true;
}
}
Results
You can see the results are the same however you compute them.

SSMS add comma delimiter - shortcut

Is there a shortcut for adding commas for values for IN clause? e.g. if my query SELECT * FROM Persons WHERE Id IN(.....)
and I've copied
356
9874
5975
9771
4166
....
Person's Id values let's say from some Excel file, How can I quickly add ',' comma to the end of each line, so I can paste it IN(....) clause?
Here's what you need to do:
Put your cursor one space after the 356 (you'll see what you need that extra space in step 2).
Hold down ALT + SHIFT and click the down key for as many numbers as you have. You should see a blue line just to the right of your numbers. This is why you needed the extra space after 356 - so that you can just arrow down the entire list without having to move left or right.
Release the keys, and press the comma key. A comma should appear in all of the lines.
You can use this method to add quotes or other characters to your SQL queries as well.
I use this in SSMS 2014;
I am not sure if this can be done in previous versions
Yeah, that's always a pain.. There are a few things you can do:
insert commas in the cells to the right of the numbers in Excel and copy them with the list into SSMS
in Notepad++, copy the list with values and click CTRL+H (find and replace), in the Search Mode box click Extended. In the Find box type "\n" (without quotations) and in the Replace with it with a comma ""
Hope this helps!
Since SSMS 2012 you can "draw" a vertical line at the end of the code using the mouse while pressing the ALT key.
After that, just press the comma key and that's it.
I have resolved this issue by applying this query
select Concat(Id,',') from user
If you want to concatenate all rows into one you can apply below query:-
Select SUBSTRING(
(
SELECT ',' + Cast(id as varchar) AS 'data()'
FROM users FOR XML PATH('')
), 2 , 9999) As users
Write a little program, like the one below, and fire it off from Launchy.
I wrote mine in C# - called it Commander... probably both the best name and best program ever written.
using System;
using System.Windows.Forms;
namespace Commander
{
internal class Program
{
[STAThread]
private static void Main()
{
var clipboardText = Clipboard.GetText(TextDataFormat.Text);
Clipboard.SetText(clipboardText.Contains(",")
? clipboardText.Replace(",", Environment.NewLine)
: clipboardText.Replace(Environment.NewLine, ",").TrimEnd(','));
}
}
}
Compile the above and reference the resulting Commander.exe from Launchy.
Once referenced:
Highlight a column of characters, and cut or copy the text
Summon Launchy (Alt-Enter, if you're using the default shortcut)
Type Commander
Paste
Enjoy your comma separated list; like use it in an IN statement somewhere
Type Commander again from launchy with a comma separated list, and it will reverse the operation. Read code... it's kind of obvious :)
Some good answers here already but here's some more:
... Person's Id values let's say from some Excel file ...
If you're copying from Excel its sometimes easier to add commas (or speechmarks) or whatever in Excel before copying.
e.g. in the cell to the right do
=A1 & ","
Then copy that formula all the way down the list.
Also Notepad++ is great for this sort of thing, you can record a macro to do one line, and then run it N times:
In Notepad++ go to the start of the first line
Select Macro - Start Recording
Do the right keypresses - in this case: End, Comma, Down, Home
Select Macro - Stop Recording
Select 'Run a Macro Multiple Times ...'
It will by default show 'current recorded macro' (the one you just recorded)
Tell it how many times you want it, then off you go

Fix CSV file with new lines

I ran a query on a MS SQL database using SQL Server Management Studio, and some the fields contained new lines. I selected to save the result as a csv, and apparently MS SQL isn't smart enough to give me a correctly formatted CSV file.
Some of these fields with new lines are wrapped in quotes, but some aren't, I'm not sure why (it seems to quote fields if they contain more than one new line, but not if they only contain one new line, thanks Microsoft, that's useful).
When I try to open this CSV in Excel, some of the rows are wrong because of the new lines, it thinks that one row is two rows.
How can I fix this?
I was thinking I could use a regex. Maybe something like:
/,[^,]*\n[^,]*,/
Problem with this is it matches the last element of one line and the 1st of the next line.
Here is an example csv that demonstrates the issue:
field a,field b,field c,field d,field e
1,2,3,4,5
test,computer,I like
pie,4,8
123,456,"7
8
9",10,11
a,b,c,d,e
A simple regex replacement won't work, but here's a solution based on preg_replace_callback:
function add_quotes($matches) {
return preg_replace('~(?<=^|,)(?>[^,"\r\n]+\r?\n[^,]*)(?=,|$)~',
'"$0"',
$matches[0]);
}
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){5}$~m';
$result=preg_replace_callback($row_regex, 'add_quotes', $source);
The secret to $row_regex is knowing ahead of time how many columns there are. It starts at the beginning of a line (^ in multiline mode) and consumes the next five things that look like fields. It's not as efficient as I'd like, because it always overshoots on the last column, consuming the "real" line separator and the first field of the next row before backtracking to the end of the line. If your documents are very large, that might be a problem.
If you don't know in advance how many columns there are, you can discover that by matching just the first row and counting the matches. Of course, that assumes the row doesn't contain any of the funky fields that caused the problem. If the first row contains column headers you shouldn't have to worry about that, or about legitimate quoted fields either. Here's how I did it:
preg_match_all('~\G,?[^,\r\n]++~', $source, $cols);
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){' . count($cols[0]) . '}$~m';
Your sample data contains only linefeeds (\n), but I've allowed for DOS-style \r\n as well. (Since the file is generated by a Microsoft product, I won't worry about the older-Mac style CR-only separator.)
See an online demo
If you want a java programmatic solution, open the file using the OpenCSV library. If it is a manual operation, then open the file in a text editor such as Vim and run a replace command. If it is a batch operation, you can use a perl command to cleanup the CRLFs.

Resources