Standardizing Heterogeneous Age Data in SPSS or Excel - database

I'm trying to standardize a column of Age data (i.e. into years old / months old) using SPSS / SPSS Syntax / Excel. My intuition is to use a series of DO IF loops i.e.:
DO IF CHAR.INDEX(Age, "y")>1... for years
DO IF CHAR.INDEX(Age, "m")>1... for months
DO IF CHAR.INDEX(Age, "d")>1... for days
and have the program reference the number(s) immediately preceding the string as a quantity of years / months / days and add it to a total in a new variable which could be in days (the smallest unit) which could later be converted to years.
For example for a cell "3 yr 5 mo": add 3*365 + 5*30.5 = 1248 days old to a new variable (something like "DaysOld").
Examples of Cell contents (numbers without any strings assumed to be years):
2
5 months
11 days
1.7
13 yr
22 yrs
13 months
10 mo
6/19/2016
3y10m
10m
12y
3.5 years
3 years
11 mos
1 year 10 months
1 year, two months
20 Y
13 y/o
3 years in 2014

The following syntax will solve a lot of cases, but definitely not all of them (eg. "1.7" or "3 years in 2014"). You'll need to do more work on it, but this should get you started nicely...
First I recreate your sample data to work with:
data list list/age (a30).
begin data
"2"
"5 months"
"11 days"
"1.7"
"13 yr"
"22 yrs"
"13 Months"
"10 mo"
"6/19/2016"
"3y10m"
"10m"
"12y"
"3.5 years"
"3 YEARS"
"11 mos"
"1 year 10 months"
"1 year, two months"
"20 Y"
"13 y/o"
"3 years in 2014"
end data.
Now to work:
* some necessary definitions.
string ageCleaned (a30) chr (a1) nm d m y (a5).
compute ageCleaned="".
* my first step is to create a "cleaned" age variable (it's possible to
manage without this variable but using this is better for debugging and
improving the method).
* in the `ageCleaned` variable I only keep digits, periods (for decimal
point) and the characters "d", "m", "y".
do if CHAR.INDEX(lower(age),'ymd',1)>0.
loop #chrN=1 to char.length(age).
compute chr=lower(char.substr(age,#chrN,1)).
if CHAR.INDEX(chr,'0123456789ymd.',1)>0 ageCleaned=concat(rtrim(ageCleaned),chr).
end loop.
end if.
* the following line accounts for the word "days" which in the `ageCleaned`
variable has turned into the characters "dy".
compute ageCleaned=replace(ageCleaned,"dy","d").
exe.
* now I can work through the `ageCleaned` variable, accumulating digits
until I meet a character, then assigning the accumulated number to the
right variable according to that character ("d", "m" or "y").
compute nm="".
loop #chrN=1 to char.length(ageCleaned).
compute chr=char.substr(ageCleaned,#chrN,1).
do if CHAR.INDEX(chr,'0123456789.',1)>0.
compute nm=concat(rtrim(nm),chr).
else.
if chr="y" y=nm.
if chr="m" m=nm.
if chr="d" d=nm.
compute nm="".
end if.
end loop.
exe.
* we now have the numbers in string format, so after turning them into
numbers they are ready for use in calculations.
alter type d m y (f8.2).
compute DaysOld=sum(365*y, 30.5*m, d).

Related

Keep format time in array (VBA Excel)

I put several data (Names, dates, times and values) into an array. It goes wrong with the date and time.
Where do I go wrong? Here is a piece of my code:
For i = 1 To LastRow + 13
For j = 1 To 10
strArray(i, j) = Cells(i, j).Value2
Next j
Next i
So 0,000983796 should become 0:01:25.
In Excel, a date is the number of days since January 1, 1900 starting
with January 1, 1900 being “1”. Each date after that, Excel adds one
more number to that sequence. So August 26, 2013 is 41512, or 41,512
days since January 1, 1900.
The integer part of the number is used for the days. The decimal part
of the number is the fractional part of the day — or the time. So .5
would be 50% of the way thru the day, or 12:00 noon. That makes
41,512.5 to be equivalent to 12:00 noon on August 26, 2013.
From DATE VALUES IN EXCEL EXPLAINED
You can convert this number value back into something more pretty and readable.
dim pretty as String
pretty = Format(Cells(i, j).Value2, "h:mm:ss")
More examples on vba formatting

SQL/postgres number of days in interval with filter

(Everything I have is in postgres and I can only use db stuff and no external help)
I have the following tables:
Project (title, id)
Tasks (planned_start_date, planned_end_date, contact_id, project_id)
Contact (id, contact_name)
Basically, a contact is assigned to a task and each task is assigned to a project.
I have to calculate the number of WORKING days which each contact has assigned per project. For example:
Projects:
P1(proj1, 1)
P2(proj2, 2)
Tasks:
T1('12/12/12', '12/14/12', 1, 1)
T2('12/12/12', '12/13/12', 1, 1)
T3('12/12/12', '12/13/12', 1, 2)
T4('12/12/12', '12/13/12', 2, 1)
T5('12/12/12', '12/13/12', 2, 2)
Contacts:
C1(1, Jack)
C2(2, Carla)
The trick is that I also have an evaluation period which should act as a filter for the tasks start/end dates.
So I should somehow sum up all these assigned working days for a project only for the interval for which the task is inside the evaluation period.
In this example we consider an evaluation time period from 14th of
January to 19th of February.
The available working days are 3 + 5 + 5 + 5 + 5 + 4 = 27 days for
this evaluation time period.
In this time period some tasks are "cutted" for the evaluation. This
means that values are used pro rata:
Task 1: not to be considered; it's outside the evaluation time period
Task 2: to be considered in part. planned end date (due date) -
planned start date = 16 days (without weekends) 3 + 5 + 1 = 9 working
days are in the evaluation phase for this Task 2. planned time effort
pte = 6 to be done in this 16 days, i.e. 6/16 planned work effort per
day (pte ratio) to be done (assumption of an equal distribution of
work). for the 9 working days during the evaluation period we have a
pro rata planned time effort of pte = 9 * (6 / 16).
Task 3: planned time effort pte = 3
Task 4: planned end date (due date) - planned start date = 11 days
(without weekends) 5 + 4 = 9 working days are in the evaluation phase
for this Task 4. planned time effort pte = 4 to be done in this 11
days, i.e. 4/11 planned work effort per day (pte ratio) to be done
(assumption of an equal distribution of work). for the 9 working days
during the evaluation period we have a pro rata planned time effort
of pte = 9 * (4 / 11).
I've used
SELECT count(*)
FROM generate_series(0, (end_date::date - start_date::date::date)) i
WHERE date_part('dow', start_date::date::date + i) NOT IN (0,6))
To get only the working days (no weekend) for an interval, but I have no idea how to sum everything up and combine everything..
Any help would be so awesome

MATLAB - array2table nesting

For the purpose of simplicity I'll try to take an example from everyday life. Let's say I have a table in CSV file loaded in a table called dataOriginal with 3 columns - names, jobs , dates.
Let's take a closer look at the column "date":
date
____
'13.01.2014 20:34'
'22.03.2014 11:17'
...
I want to split date in a date-vector and add this vector (along with the variable names for each of it's columns (since we have multiple dates we have de facto a matrix)) to a column in a new table again named "Date" but with all the naming goodies in it such as year, month etc.
Here is what I have done so far (sorry for the poor code quality but I've just started learning MATLAB :-/):
I split each date in a date-vector and also add names to each element like this:
dateFormat = 'dd.mm.yy HH:MM';
[year,month,day,hour,minute,second] = datevec(datesRaw, dateFormat);
so that I have this:
year(1) % returns '2014' since this is the first date in my column
year % returns all years in my entire column
Then I converted the above to a table:
dates = array2table([year,month,day,hour,minute,second],'VariableNames',{'year','month',...,'second'});
so I get a nice output like this
year month second
____ _____ ... ______
2014 1 0
2014 3 0
... ... ... ...
This allows me an easy-to-read access to each column by simply calling for example:
year % returns all years
year(1) % returns first entry's year (here: '2014' from '13.01.2014 20:34')
I've processed my other columns too doing various operations on those and at the end I'm trying to horizontally concatenate all like this:
name job date
____ _____________________ _____________________
year month ... second
____ _____ ______
"Bob" "Construction worker" 2014 1 ... 0
"Alice" "Waitress" 2014 3 ... 0
... ... ... ... ... ...
I'm struggling exactly with the part with the nesting of year,month etc. in a single column named "date". I'd like to address a date's element in the table above as follows:
myData.name(1) % will return 'Bob'
myData.job(1) % will return 'construction worker'
myData.date(1).year(1) % should return '2014' for Bob, the construction worker
Currently I'm having the following code after some sweating and swearing:
dataFinal =
horzcat(array2table([dataProcessed(:,1),dataProcessed(:,2)],'VariableNames',[dataOriginal.Properties.VariableNames(1),dataOriginalProperties,VariableNames(2)]],
array2table([year,month,day,hour,minute,second],'VariableNames',{'year','month','day','hour','minute','second'}))
where
dataProcessed(:,1) are my processed names
dataProcessed(:,2) are my processed jobs
dataOriginal.Properties.VariableNames(1) is the name of the first column in my original table - "name"
dataOriginal.Properties.VariableNames(2) is the name of the second column in my original table - "job"
I do not know how to insert
array2table([year,month,day,hour,minute,second],'VariableNames',{'year','month','day','hour','minute','second'})
in a named column "date" in order to accomplish my goal.
Thanks!
Try the following, it may be what you're looking for:
data = table(names, jobs, table(years, months, ...), 'VariableNames', {'name', 'job', 'date'})
Though you will address as follows, which is slightly different from what you said you want; it may still work for your purposes:
data.name(1);
data.job(1);
data.date.year(1);
EDIT: To see your output, do
disp([data(:, ~strcmp(data.Properties.VariableNames, 'date')), data.date])
names ids years months
_____ ___ _____ ______
'Bob' 1 2014 4
'Max' 2 2013 8
(when editing the comment I didn't exactly replicate the data and fields from the answer, but I think you should get the point here).

Matlab: Number of observations per year for very large array

I have a large array with daily data from 1926 to 2012. I want to find out how many observations are in each year (it varies from year-to-year). I have a column vector which has the dates in the form of:
19290101
19290102
.
.
.
One year here is going to be July through June of the next year.
So 19630701 to 19640630
I would like to use this vector to find the number of days in each year. I need the number of observations to use as inputs into a regression.
I can't tell whether the dates are stored numerically or as a string of characters; I'll assume they're numbers. What I suggest doing is to convert each value to the year and then using hist to count the number of dates in each year. So try something like this:
year = floor(date/10000);
obs_per_year = hist(year,1926:2012);
This will give you a vector holding the number of observations in each year, starting from 1926.
Series of years starting July 1st:
bin = datenum(1926:2012,7,1);
Bin your vector of dates within each year with bin(1) <= x < bin(2), bin(2) <= x < bin(3), ...
count = histc(dates,bin);

Utilizing SQL datepart to indentify consecutive periods of time

I have a stored procedure that works correctly, but don't understand the theory behind why it works. I'm indentifying a consecutive period of time by utilizing a datepart and dense rank (found solution through help elsewhere).
select
c.bom
,h.x
,h.z
,datepart(year, c.bom) * 12 + datepart(month, c.bom) -- this is returning a integer value for the year and month, allowing us to increment the number by one for each month
- dense_rank() over ( partition by h.x order by datepart(year, c.bom) * 12 + datepart(month, c.bom)) as grp -- this row does a dense rank and subtracts out the integer date and rank so that consecutive months (ie consecutive integers) are grouped as the same integer
from
#c c
inner join test.vw_info_h h
on h.effective_date <= c.bom
and (h.expiration_date is null or h.expiration_date > c.bom)
I understand in theory what is happening with the grouping functionality.
How does multiplying year * 12 + month work? Why do we multiply the year? What is happening in the backend?
The year component of a date is an integer value. Since there are 12 months in a year, multiplying the year value by 12 provides the total number of months that have passed to get to the first of that year.
Here's an example. Take the date February 11, 2012 (20120211 in CCYYMMDD format)
2012 * 12 = 24144 months from the start of time itself.
24144 + 2 months (february) = 24146.
Multiplying the year value by the number of months in a year allows you to establish month-related offsets without having to do any coding to handle the edge cases between the end of one year and the start of another. For example:
11/2011 -> 24143
12/2011 -> 24144
01/2012 -> 24145
02/2012 -> 24146

Resources