Looping in Stata through pairs of strings - loops

I am trying to set a variable equal to state fips codes given its state abbreviation. Is there a shorter way to do this other than:
replace fips = "[fips code]" if other_variable=="[state_abbrev]"
Which I currently have 50 lines of. I would like to create a loop, but given that I have two changing values, I don't know how to avoid looping through every permutation.

Here is an example of the strategy covered in the FAQ.
1) Create a dataset containing two variables: the state name and the associated fips code. To make this slightly more flexible, I include common semi-abbreviations for the state name. In the future, you could add a third variable that includes the two-letter state abbreviation.
clear
input fips str20 state
1 "alabama"
2 "alaska"
4 "arizona"
5 "arkansas"
6 "california"
8 "colorado"
9 "connecticut"
10 "delaware"
11 "district of columbia"
12 "florida"
13 "georgia"
15 "hawaii"
16 "idaho"
17 "illinois"
18 "indiana"
19 "iowa"
20 "kansas"
21 "kentucky"
22 "louisiana"
23 "maine"
24 "maryland"
25 "massachusetts"
26 "michigan"
27 "minnesota"
28 "mississippi"
29 "missouri"
30 "montana"
31 "nebraska"
32 "nevada"
33 "new hampshire"
34 "new jersey"
35 "new mexico"
36 "new york"
37 "north carolina"
37 "n. carolina"
38 "north dakota"
38 "n. dakota"
39 "ohio"
40 "oklahoma"
41 "oregon"
42 "pennsylvania"
44 "rhode island"
45 "south carolina"
45 "s. carolina"
46 "south dakota"
46 "s. dakota"
47 "tennessee"
48 "texas"
49 "utah"
50 "vermont"
51 "virginia"
53 "washington"
54 "west virginia"
54 "w. virginia"
55 "wisconsin"
56 "wyoming"
72 "puerto rico"
end
save statefips, replace
2) Load your primary dataset that holds a variable with state names and perform a many-to-one merge using statefips.dta.
sysuse census, clear
// Convert the state names to lowercase to ensure
// consistency with the statefips dataset
replace state = lower(state)
merge m:1 state using statefips.dta
drop if _merge == 2
drop _merge
If you wanted to preserve the case of the state names in your master data set, you could simply generate a temporary variable and use that for the merge, i.e.
gen statelower = lower(state)
merge m:1 statelower using statefips.dta
Also, once you've created the statefips.dta data set, there's no need to recreate it every time you want to perform a merge. You could simply bundle it along with your project's files and use it when necessary. If you find you want to add two-letter state abbreviations or make some other change, then it's practically instantaneous to recreate it.

No obvious shortcut, but in Stata
. search merge, faq
to find a relevant FAQ by Kit Baum.

Related

What is a correct name for an operation that turns 3-column long table into a compact "2D" table with variable number of columns?

For example, from this table
row
col
val
0
A
32
0
B
31
0
C
35
1
A
30
1
B
29
1
C
29
2
A
15
2
B
14
2
D
18
3
A
34
3
B
39
3
C
34
3
D
35
it should produce this table:
A
B
C
D
0
32
31
35
1
30
29
29
2
15
14
18
3
34
39
34
35
Is there some official, canonical (or at least popular specific unambiguous) term for such operation (or its reverse)?
I am trying to find (or implement & publish) a tool that transforms CSV this way, but am unsure what to search for (or how to name it).
The term is pivot.
Some databases have native support for pivot, eg SQL Server's PIVOT (and even UNPIVOT) keywords.
For most databases you must craft a query that does the job.

Add Countif to Array Formula (Subtotal) in Excel

I am new to array formulae and have noticed that while SUBTOTAL includes many functions, it does not feature COUNTIF (only COUNT and COUNTA).
I'm trying to figure out how I can integrate a COUNTIF-like feature to my array formula.
I have a matrix, a small subset of which looks like:
A B C D E
48 53 46 64 66
48 66 89
40 38 42 49 44
37 33 35 39 41
Thanks to the help of #Tom Shape in this post, I (he) was able to average the sum of each row in the matrix provided it had complete data (so rows 2 and 4 in the example above would not be included).
Now I would like to count the number of rows with complete data (so rows 2 and 4 would be ignored) which include at least one value above a given threshold (say 45).
In the current example, the result would be 2, since row 1 has 5/5 values > 45, and row 3 has 1 value > 45. Row 5 has values < 45 and rows 2 and 3 have partially or fully missing data, respectively.
I have recently discovered the SUMPRODUCT function and think that perhaps SUMPRODUCT(--(A1:E1 >= 45 could be useful but I'm not sure how to integrate it within Tom Sharpe's elegant code, e.g.,
=AVERAGE(IF(SUBTOTAL(2,OFFSET(A1,ROW(A1:A5)-ROW(A1),0,1,COLUMNS(A1:E1)))=COLUMNS(A1:E1),SUBTOTAL(9,OFFSET(A1,ROW(A1:A5)-ROW(A1),0,1,COLUMNS(A1:E1))),""))
Remember, I am no longer looking for the average: I want to filter rows for whether they have full data, and if they do, I want to count rows with at least 1 entry > 45.
Try the following. Enter as array formula.
=COUNT(IF(SUBTOTAL(4,OFFSET(A1,ROW(A1:A5)-ROW(A1),0,1,COLUMNS(A1:E1)))>45,IF(SUBTOTAL(2,OFFSET(A1,ROW(A1:A5)-ROW(A1),0,1,COLUMNS(A1:E1)))=COLUMNS(A1:E1),SUBTOTAL(9,OFFSET(A1,ROW(A1:A5)-ROW(A1),0,1,COLUMNS(A1:E1))))))
Data

Pulling ordered list using array functions in Excel

I have a report in excel that displays the sales results from each employee. The columns are Location, Region, Username & Sales. It is sorted by Sales descending, showing which employee has the best sales in the company.
I am attempting to have an additional sheet per region that displays the results for all employees in that region also sorted by Sales (to avoid sorting the results of the many regions myself everyday).
An example version of the first 12 rows of the Data Sheet:
G H I J K X
Row Location Username Sales Region Region
1 38 John.Doe 85 North1 North1
2 154 John.Smith 83 South2
3 23 E.Williams 83 North1
4 210 M.Williams 79 East5
5 139 Joe.Dawn 77 North2
6 22 Kay.Smith 69 South2
7 51 Jay.Smith 69 South2
8 125 L.Smith 69 East2
9 51 L.Day 69 South2
10 23 23.Guest2 67 North1
11 92 U.Goode 65 North4
I have successfully created an array function that pulls the Sales column of only the results in the specified region.
{=LARGE(SMALL(IF(IF(ISERROR(K:K),"",K:K)=$X$2,J:J),
ROW(INDIRECT("1:"&COUNTIF(K:K,$X$2)))),F2)}
I am attempting now for an array function that pulls the Username that matches the corresponding sales amount in the original array, and also matches the region. I am having trouble when a single region has 'ties' or more than one employee with the same sales that month. Here is what I started with for that function:
=INDEX(I:I,MATCH(1,(Y2=J:J)*($X$1=K:K),0)
but that is having trouble when a single region has multiple users with the same sales. So I am trying a conditional to accomodate, with the function I know that works for singles when there's only one of that sales for that region.
{=IF(COUNTIF($AB$2:AB2,AB2)>1,
INDEX(I:I,
SMALL(IF(J:J=AB2,
IF(K:K=$AB$2,ROW(K:K)-ROW(INDEX(K:K,1,1))+1)),
COUNTIF($AB$2:AB2,AB2))),
INDEX(I:I,MATCH(1,(AC2=J:J)*($AB$2=K:K),0)))}
The inner piece may be sufficient if it worked, excluding the need for the conditional:
{=INDEX(I:I,
SMALL(IF(J:J=AB2,
IF(K:K=$AB$2,ROW(K:K)-ROW(INDEX(K:K,1,1))+1)),
COUNTIF($AB$2:AB2,AB2)))}
I'll use the same function for Username.
Expected results for two regions:
X Y Z AA AB AC AD AE
Region Sales Username Location Region Sales Username Location
North1 85 John.Doe 38 South2 83 John.Smith 154
83 E.Williams 23 69 Kay.Smith 22
67 23.Guest2 23 69 Jay.Smith 51
69 L.Day 51
Since beginning to type this question I have found a work around that includes a few additional columns to complete the calculation, but still wanted to ask this to see if it was possible for knowledge's sake.
With North1 in X2, these are the formulas for Y2:AA2.
=IFERROR(AGGREGATE(14, 6, ($J$2:$J$999)/($K$2:$K$999=X$2), ROW(1:1)), "")
=IFERROR(INDEX($H:$H, AGGREGATE(15, 6, ROW($2:$999)/(($K$2:$K$999=X$2)*($J$2:$J$999=Y2)), COUNTIF(Y$2:Y2, Y2))), "")
=IFERROR(INDEX($H:$H, AGGREGATE(15, 6, ROW($2:$999)/(($K$2:$K$999=X$2)*($J$2:$J$999=Y2)), COUNTIF(Y$2:Y2, Y2))), "")
Fill down as necessary.
With South2 in AB2, copy Y2:AA2 to AC2:AE2 and fill down as necessary.

comparing multiple column files using python3

input_file1:
a 1 33
a 34 67
a 68 78
b 1 99
b 100 140
c 1 70
c 71 100
c 101 190
input file2:
a 5 23
a 30 72
a 76 78
b 5 30
c 23 88
c 92 98
I want to compare these two files such that for every value of 'a' in file2 the two integers (boundary) fall in the range (boundaries) of 'a' in file1 or between two ranges.
Instead of storing values like this 'a 1 33', you can make one structure (like 'a:1:33') for your data while writing into file. So that it will become easy to read data also.
Then, you can read each line and can split it based on ':' separator and you can compare with another file easily.

Distinguish Keywords from string

I have an Array of strings :
ID: 12 34 56 78 class:C
SEX:M EYES:BRN HT:5-09'
Now , I want my o/p should be like this
ID: 12 34 56 78
class:C
SEX:M
EYES:BRN
HT:5-09'
you may found the solution at link:string split
use objectAtIndex: to get each string which you want

Resources