Matching observations across two string variables

Matching observations across two string variables - database

I have two continuous indicators that are measured at the country-level:
GDP per capita
democracy score
I have two string variables that essentially use the same country coding system, such as AFG for Afghanistan. However, I only have 184 observations under the country variable for the GDP data, yet 249 observations under the code variable for the democracy_score data.
I would like to match GDP and democracy score data for observations where the data for both continuous indicators are complete. For instance, the data in the first row below is
"AFG" 2079.9219 "ABW" "0.813"
And I would like to match it with the democracy score data from the third row for observations where the country code is the same, "AFG".
"ALB" 13655.665 "AFG" "0.174"
And the correct data structure would be as follows for AFG:
country gdp_adj democracy_score
"AFG" 2079.9219 "0.174"
Here is a data example:
dataex country gdp_adj code democracy_score
output:
* Example generated by -dataex-. For more info, type help dataex
clear
input str3 country float gdp_adj str3 code str5 democracy_score
"AFG" 2079.9219 "ABW" "0.813"
"AGO" 6602.424 "ADO" "#N/A"
"ALB" 13655.665 "AFG" "0.174"
"ARE" 71782.16 "AIA" "#N/A"
"ARG" 22071.75 "ALB" "0.576"
"ARM" 14317.553 "ANT" "#N/A"
"ATG" 23035.66 "ARE" "0.232"
"AUS" 49379.09 "ARG" "0.632"
"AUT" 55806.44 "ARM" "0.496"
"AZE" 14442.04 "ASM" "#N/A"
"BDI" 729.6584 "ATG" "#N/A"
"BEL" 51977.18 "AUS" "0.861"
"BEN" 3156.439 "AUT" "0.852"
"BFA" 2110.0623 "AZE" "0.200"
"BGD" 5467.208 "BDI" "0.170"
"BGR" 23270.225 "BEL" "0.820"
"BHR" 49768.98 "BEN" "0.473"
"BHS" 35161.832 "BFA" "0.358"
"BIH" 14634.738 "BGD" "0.388"
"BLR" 19279.209 "BGR" "0.602"
"BLZ" 9028.552 "BHR" "0.190"
"BOL" 8528.749 "BHS" "0.688"
"BRA" 14685.128 "BIH" "0.399"
end

You can do it by stacking and reshaping back to wide:
destring democracy_score, replace ignore("#N/A")
stack country gdp_adj code democracy_score , into(country outcome) clear
reshape wide outcome, i(country) j(_stack)
rename (outcome1 outcome2) (gdp_adj democracy_score)
I converted the score from string to double under the assumption that you would want to do some analysis on it. If not, then you can tostring it back.
I also had to tweak the GDP storage to double to avoid some precision issues:
input str3 country double gdp_adj str3 code str5 democracy_score

Related

Difference Function

I don't understand how I could get a result of 4 for difference on the following:
col_a col_b
201 E. Rudisill 2535 E 10th St.
6039 Bunt Drive 408 W. Petit Ave.
difference(upper(a), upper(b)) returned 4 for both rows.
How is this possible? They do not sound anything alike?

SOUNDEX converts an alphanumeric string to a four-character code that is based on how the string sounds when spoken. May be the above string starts with numbers the soundex() return '0000'.
Similar Question : Soundex with numbers as String parameter

Standardizing Heterogeneous Age Data in SPSS or Excel

I'm trying to standardize a column of Age data (i.e. into years old / months old) using SPSS / SPSS Syntax / Excel. My intuition is to use a series of DO IF loops i.e.:
DO IF CHAR.INDEX(Age, "y")>1... for years
DO IF CHAR.INDEX(Age, "m")>1... for months
DO IF CHAR.INDEX(Age, "d")>1... for days
and have the program reference the number(s) immediately preceding the string as a quantity of years / months / days and add it to a total in a new variable which could be in days (the smallest unit) which could later be converted to years.
For example for a cell "3 yr 5 mo": add 3*365 + 5*30.5 = 1248 days old to a new variable (something like "DaysOld").
Examples of Cell contents (numbers without any strings assumed to be years):
2
5 months
11 days
1.7
13 yr
22 yrs
13 months
10 mo
6/19/2016
3y10m
10m
12y
3.5 years
3 years
11 mos
1 year 10 months
1 year, two months
20 Y
13 y/o
3 years in 2014

The following syntax will solve a lot of cases, but definitely not all of them (eg. "1.7" or "3 years in 2014"). You'll need to do more work on it, but this should get you started nicely...
First I recreate your sample data to work with:
data list list/age (a30).
begin data
"2"
"5 months"
"11 days"
"1.7"
"13 yr"
"22 yrs"
"13 Months"
"10 mo"
"6/19/2016"
"3y10m"
"10m"
"12y"
"3.5 years"
"3 YEARS"
"11 mos"
"1 year 10 months"
"1 year, two months"
"20 Y"
"13 y/o"
"3 years in 2014"
end data.
Now to work:
* some necessary definitions.
string ageCleaned (a30) chr (a1) nm d m y (a5).
compute ageCleaned="".
* my first step is to create a "cleaned" age variable (it's possible to
manage without this variable but using this is better for debugging and
improving the method).
* in the `ageCleaned` variable I only keep digits, periods (for decimal
point) and the characters "d", "m", "y".
do if CHAR.INDEX(lower(age),'ymd',1)>0.
loop #chrN=1 to char.length(age).
compute chr=lower(char.substr(age,#chrN,1)).
if CHAR.INDEX(chr,'0123456789ymd.',1)>0 ageCleaned=concat(rtrim(ageCleaned),chr).
end loop.
end if.
* the following line accounts for the word "days" which in the `ageCleaned`
variable has turned into the characters "dy".
compute ageCleaned=replace(ageCleaned,"dy","d").
exe.
* now I can work through the `ageCleaned` variable, accumulating digits
until I meet a character, then assigning the accumulated number to the
right variable according to that character ("d", "m" or "y").
compute nm="".
loop #chrN=1 to char.length(ageCleaned).
compute chr=char.substr(ageCleaned,#chrN,1).
do if CHAR.INDEX(chr,'0123456789.',1)>0.
compute nm=concat(rtrim(nm),chr).
else.
if chr="y" y=nm.
if chr="m" m=nm.
if chr="d" d=nm.
compute nm="".
end if.
end loop.
exe.
* we now have the numbers in string format, so after turning them into
numbers they are ready for use in calculations.
alter type d m y (f8.2).
compute DaysOld=sum(365*y, 30.5*m, d).

Calculated column in DAX to show current BusinessArea

I have a table in my SSAS-model with SCD-type2 functionality.
CustNr StartDate EndDate BusinessArea
123 2014-12-01 2015-01-01 Norway
123 2015-01-01 - Sweden
I need a calc-column(DAX) which shows the current BusinessArea(based on customer number). How do i approach it? I've heard about the "Earlier" function but i am new to DAX and cannot get my head around it.
The desired output would be like this:
CustNr StartDate EndDate BusinessArea CurrentBA
123 2014-12-01 2015-01-01 Norway Sweden
123 2015-01-01 - Sweden Sweden
All answers are welcome! Cheers!

LOOKUPVALUE()
(edit: note original left for continuity - correct measure below in edit section)
CurrentBusinessArea =
LOOKUPVALUE(
DimCustomer[BusinessArea] // Lookup column - will return value
// matching search criteria below.
,DimCustomer[CustNr] // Search column 1.
,DimCustomer[CustNr] // Value to match to search column 1 -
// this is evaluated in row context.
,DimCustomer[EndDate] // Search column 2.
,"-" // Literal value to match for search
// column 2.
)
I doubt that your [EndDate] is actually a text field, so I also doubt that the literal value for [EndDate] for the row that represents the current business area is actually "-". If it's blank, use the BLANK() function rather than a literal "-".
Edit based on comment, corrected measure definition with some discussion:
CurrentBusinessArea =
LOOKUPVALUE(
DimCustomer[BusinessArea]
,DimCustomer[CustNr]
,DimCustomer[CustNr]
,DimCustomer[EndDate]
,DATE(BLANK(),BLANK(),BLANK())
)
Normally in DAX you can test directly for equality to BLANK(). It tends not to act similarly to NULL in SQL. In fact, you can create a column to test this. If you do any of these, they return true for the row with a blank [EndDate]:
=DimCustomer[EndDate] = BLANK()
=ISBLANK(DimCustomer[EndDate])
=DimCustomer[EndDate] = 0 //implicit conversion 0 = blank
For some reason there is an issue in the conversion from Date/Time to BLANK(). The construction above, using the DATE() function fed with all BLANK()s works for me. I had assumed that LOOKUPVALUE() would work with a literal blank (fun fact, if data type is Integer, LOOKUPVALUE() works with a BLANK()). Apologies on that.

Matching different variables for the same observation

I am encountering some difficulty with a dataset that I am analyzing with Stata. The dataset I have is a repeated cross section of the following form:
Individual Year Age VarA VarB VarC
Variable C has been calculated for each individual by year, using the egen command. As a result, this variable is year specific. I now want to match the value of this variable corresponding to the year when each individual was x years old. (I create this new variable by the transform variableD=Year-Age+x).
I want to match the value of Variable C that was obtained in the year "variableD" for each individual.

Here's an example of how to do this with a user-written xfill:
net install xfill, from("http://www.sealedenvelope.com/")
webuse nlswork, clear
duplicates drop idcode age, force
gen x=20 if mod(idcode,2)==1
replace x=25 if mod(idcode,2)!=1
bys idcode year: egen var_c = mean(ln_wage)
bys idcode: gen var_c_at_x = var_c if age == x
xfill var_c_at_x, i(idcode)
edit idcode ln_wage year age x var_c*

MATLAB - array2table nesting

For the purpose of simplicity I'll try to take an example from everyday life. Let's say I have a table in CSV file loaded in a table called dataOriginal with 3 columns - names, jobs , dates.
Let's take a closer look at the column "date":
date
____
'13.01.2014 20:34'
'22.03.2014 11:17'
...
I want to split date in a date-vector and add this vector (along with the variable names for each of it's columns (since we have multiple dates we have de facto a matrix)) to a column in a new table again named "Date" but with all the naming goodies in it such as year, month etc.
Here is what I have done so far (sorry for the poor code quality but I've just started learning MATLAB :-/):
I split each date in a date-vector and also add names to each element like this:
dateFormat = 'dd.mm.yy HH:MM';
[year,month,day,hour,minute,second] = datevec(datesRaw, dateFormat);
so that I have this:
year(1) % returns '2014' since this is the first date in my column
year % returns all years in my entire column
Then I converted the above to a table:
dates = array2table([year,month,day,hour,minute,second],'VariableNames',{'year','month',...,'second'});
so I get a nice output like this
year month second
____ _____ ... ______
2014 1 0
2014 3 0
... ... ... ...
This allows me an easy-to-read access to each column by simply calling for example:
year % returns all years
year(1) % returns first entry's year (here: '2014' from '13.01.2014 20:34')
I've processed my other columns too doing various operations on those and at the end I'm trying to horizontally concatenate all like this:
name job date
____ _____________________ _____________________
year month ... second
____ _____ ______
"Bob" "Construction worker" 2014 1 ... 0
"Alice" "Waitress" 2014 3 ... 0
... ... ... ... ... ...
I'm struggling exactly with the part with the nesting of year,month etc. in a single column named "date". I'd like to address a date's element in the table above as follows:
myData.name(1) % will return 'Bob'
myData.job(1) % will return 'construction worker'
myData.date(1).year(1) % should return '2014' for Bob, the construction worker
Currently I'm having the following code after some sweating and swearing:
dataFinal =
horzcat(array2table([dataProcessed(:,1),dataProcessed(:,2)],'VariableNames',[dataOriginal.Properties.VariableNames(1),dataOriginalProperties,VariableNames(2)]],
array2table([year,month,day,hour,minute,second],'VariableNames',{'year','month','day','hour','minute','second'}))
where
dataProcessed(:,1) are my processed names
dataProcessed(:,2) are my processed jobs
dataOriginal.Properties.VariableNames(1) is the name of the first column in my original table - "name"
dataOriginal.Properties.VariableNames(2) is the name of the second column in my original table - "job"
I do not know how to insert
array2table([year,month,day,hour,minute,second],'VariableNames',{'year','month','day','hour','minute','second'})
in a named column "date" in order to accomplish my goal.
Thanks!

Try the following, it may be what you're looking for:
data = table(names, jobs, table(years, months, ...), 'VariableNames', {'name', 'job', 'date'})
Though you will address as follows, which is slightly different from what you said you want; it may still work for your purposes:
data.name(1);
data.job(1);
data.date.year(1);
EDIT: To see your output, do
disp([data(:, ~strcmp(data.Properties.VariableNames, 'date')), data.date])
names ids years months
_____ ___ _____ ______
'Bob' 1 2014 4
'Max' 2 2013 8
(when editing the comment I didn't exactly replicate the data and fields from the answer, but I think you should get the point here).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Matching observations across two string variables - database

Related

Difference Function

Standardizing Heterogeneous Age Data in SPSS or Excel

Calculated column in DAX to show current BusinessArea

Matching different variables for the same observation

MATLAB - array2table nesting

Categories

Resources