Calculating all possible "time difference" combinations in R - difftime

I'm using camera trapping data which contains two columns:
"datetime" of when the photo was taken
"species" species that appears in the photo
I want to calculate time difference between all possible pairs of species in R.
I have used difftime and diff functions in R but the result obtained isn't what I aim, as R is calculating time between "datetime2-datetime1", "datetime3-datetime2", "datetime4-datetime3", etc.
An example of my data is:
datetime (POSIXct format): "2018-10-06 08:39:00", "2018-10-07 04:09:00", "2018-10-14 00:47:00"
species: "deer", "horse", "fox"
If I use diff function:
diff(datetime)
Time differences in hours
[1] 19.5000 164.6333 #this shows time between first and second and second and third datetimes.
#
I have also tried:
base_time <- datetime[1]
later_times <- datetime[2:3]
later_times - base_time
diff(later_times)
This option combines all possible datetimes but it doesnt make sense if my data set has more than 3 rows...
As I need to calculate time difference between all photos, this should be:
"datetime2-datetime1", "datetime3-datetime1", "datetime4-datetime1",
"datetime3-datetime2", "datetime4-datetime3", etc.
I'm still learning R, so any help would be greatly appreciated!

Related

Calculating distance between GPS data using SQL Server

I have tables in SQL Server management studio containing the location of individuals by date/time along several months. The tables have the following fields: AnimalID, Date/Time, Lat, Long, Global ID. I am trying to calculate and return the distance between each pair of points in order of its movement without manually entering in the lat and long each time.There are many posts here about calculating distance between two points but I'm trying to run a query that will calculate the distance between each pair in consecutive order. Some of my tables have hundreds of locations.
My values might look like:
`MD001 10/9/2019 1:00:00PM 40.73995 -111.8739
MD001 10/9/2019 6:00:00PM 40.75068 -111.8782
MD001 10/9/2019 10:00:00PM 40.74900 -111.89100`
I want to know the distance between 1:00PM and 6:00PM and then from 6:00PM and 10:00PM, and so forth. I want to accomplish this in SQL Server so that I can query out outliers in the data. Your insight is much appreciated. I also do not want to create a new field in this table.
The algorithm to calculate the distance between points is called Harvesine Formula
To calculate the distance between 2 points in SQL Server you have 2 options:
POINT 1 = 151.209030,-33.877814
POINT 2 = 144.971431, -37.808694
Option 1. You can do your own implementation of the harvesine formula:
select
2 * 6371 * asin(sqrt(POWER((sin(radians((-37.808694 - -33.877814) / 2))),2) + cos(radians(-33.877814)) * cos(radians(-37.808694)) * POWER((sin(radians((144.971431 - 151.209030) / 2))),2)))
Note this will give you the distance in kilometer. This is defined by the multiplier 6371. To get the distance in miles replace 6371 by 3959
If you do a search on the harvesine formula + sql you can find more in depth details about this implementation.
Option 2.. Use SQL Server built-in functions.
In order to do that you'll need to convert your lat and long columns to geography datatype and then use the STDistance function to calculate the actual distance.
The statement below should give you an idea to get started:
select
cast('POINT(151.209030 -33.877814)' as geography).STDistance(cast('POINT(144.971431 -37.808694)' as geography)) as distance_in_meters,
cast('POINT(151.209030 -33.877814)' as geography).STDistance(cast('POINT(144.971431 -37.808694)' as geography)) / 1000 as distance_in_km
The default result will be in meters.
Note there's a slight difference between these 2 options when they are applied to the same coordinates. So if you need precision then you might want to do some further investigation on why that is.

How do I create a default everyday date dimension?

I am trying to create a line chart counting all the optins per date, however the only dimension that is will allow me to choose from have to be a date column on my source. The problem with this is it only chooses from dates that are populated in those fields with an optin date.
For example: I have 5 optins on 1/1/2019, 0 on 1/2/2019, and 3 on 1/3/2019
If I use this series and want to include another metric, 1/2/2019 will not show anything for that other metric
I just want a standard everyday series that counts every metric on a given day. The google analytics connection source has a generic Date dimension but I can not figure out how it was done
Ive tried creating a new column with everydate on it and trying to use that as a dimension without any luck
You should be able to use a Time Series graph (of which there are 3 types) instead of a Line graph.
A Time Series will keep the days where no data is available unlike the Line Graph which only presents labels for those which have values in the data.

iterating over multindex - a groupby.value_counts() object is only through values and not through original date index

i want to know the percent of males in the ER (emergency room) during days that i defined as over crowded days.
i have a DF named eda with rows repesenting each entry to the ER. a certain column states if the entry occurred in an over crowded day (1 means over crowded) and a certain column states the gender of the person who entered.
so far i managed to get a series of over crowded days as index and a sub-index representing gender and the number of entries in that gender.
i used this code :
eda[eda.over_crowd==1].groupby(eda[eda.over_crowd==1].index.date).gender.value_counts()
and got the following result:
my question is, what is the most 'pandas-ian' way to get the percent of males\females in general. or, how to continue from the point i stopped?
as can be shown in the bottom of the screenshot, when i iterate over the elements, each value is the male of female consecutively. i want to iterate over dates so i could somehow write a more clean loop that will produce another column of male percentage.
i found a pretty elegant solution. i'm sure there are more, but maybe it can help someone else.
so i defined a multi-index series with all dates and counts of females and males. then used .loc to operate on each count of all dates to get percentage of males at each day. finally i just extract only the days that apply for over_crowd==1.
temp=eda.groupby(eda.index.date).gender.value_counts()
crowding['male_percent']=np.divide(100*temp.loc[:,1],temp.loc[:,2]+temp.loc[:,1])
crowding.male_percent[crowding.over_crowd==1]

Vlookup from multiple criteria to display nearest answer

I was hoping someone can help me. I have hit a solid wall.
I have a table with product information included and I am building a calculator which should spit out a number of options based on set criteria which is in the table. I am failing at just pulling through a code. I feel rather embarassed asking about how to do a vlookup here. But basically I have a vlookup which depends on multiple criteria and for the calc to cough out the nearest match (if applicable) based on this criteria.
Criteria 1 = Product
Criteria 2 = Type
Criteria 3 = Height
Criteria 4 = Min
I have created a search key in the table to concatenate all of these columns and then done a vlookup, which is =Vlookup(Criteria1 & Criteria2 & Criteria3 & Criteria4, Table Data, Code Required) But this does not appear to be giving me results, it either coughs out an error or the incorrect product. Below is my data and my calc I am hoping to complete. Can someone please help?
Here is an example looking for a closest match on Min. It demonstrates the principle so you can extend.
The closest match formula part is:
MATCH(MIN(ABS(E2:E4-K2)),ABS(E2:E4-K2),0))
Column E for column with Min values in. And K2 for target Min. This is an array formula entered with Ctrl + Shift+Enter. You would adjust the range of E2:E4.
The multiple criteria part is using:
=MATCH(lookup_value_1&lookup_value_2&lookup_value_3, lookup_array_1&lookup_array_2&lookup_array_3, match_type)
Where you are concantenating your parameters and searching for a match of the concatenation of those parameters in the table (you could do this against the key column if the key is made up of the same parameters.)
Overall formula with some test data (using one estimate figure):
=INDEX(F:F,MATCH(K1&K5&J5&INDEX(E2:E4,MATCH(MIN(ABS(E2:E4-K2)),ABS(E2:E4-K2),0)),B:B&C:C&D:D&E:E,0))
Above entered combined formula remember is an array formula so entered with Ctrl+Shift+Enter . You can reduce the ranges from entire columns to only those rows holding data.
Data data:
I am not typing all that out from picture so here is a quick n dirty
I tried with the QHarr's solution but it didn't work with all the rows.
My solution is:
Add a column with:
=IF(E2 < $K$2, E2, 0) and copy for all rows
In L5 create the formula:
{=INDEX(F2:F19,MATCH($K$1&K5&$J$5&INDEX(E2:E19,MATCH(MAX(SI(B2:B19=$K$1,1,0)*IF(C2:C19=K5,1,0)*IF(D2:D19=$J$5,1,0)*G2:G19,0),E2:E19,0)),B2:B19&C2:C19&D2:D19&E2:E19,0))}
Copy the formula to L6 and L7
Excel exercise printscreen
Originally marked this as answered and it did work initially but as I added more products it began to fail. I did manage to (after much trial and error) find a simple solution {=INDEX(Calc!$I$2:$I$189,MATCH(Output!$H$7,IF(Calc!$B$2:$B$189=Output!A12,Calc!$H$2:$H$189),1))}

Efficient Excel formula for returning multiple matches from a large number of rows

I'm stumped by a major issue. I have a data set consisting of about 16000 rows (could be more in future). This list is basically a price list containing products and their corresponding installation fees. Now the products are classified by the following hierarchy: City -> Category -> Rating/Type. Before I was using named ranges to refer to each set by concatenating City & Category & Rating (_XYZ_SPC_9.5). This resulted in about 1500 named ranges which inflated the size of the Excel file. So I decided to calculate the products on-the-fly using inputs from the user. I have tried array formulas and simple formulas but they take some time to calculate (16000 rows!!) which is not acceptable from a usability perspective; our sales people are very particular about how much time they have to spend on the tool.
I have uploaded a sample file at:
Price List Sample
Formulas that I have used so far are:
=IFERROR(INDEX($H$6:$H$15000, SMALL(INDEX(($AE$9=$R$6:$R$15000)*(MATCH(ROW($R$6:$R$15000), ROW($R$6:$R$15000)))+($AE$9<>$R$6:$R$15000)*15000, 0, 0), AC3)),"Not Available")
{=IFERROR(INDEX(ref_PRICE_LIST!$H$6:$H$16074,MATCH(INDEX(ref_PRICE_LIST!$H$6:$H$16074,(SMALL(IF(IF(RIGHT($AE$3,3)="All",ref_PRICE_LIST!$Z$6:$Z$16074,ref_PRICE_LIST!$R$6:$R$16074)=$AE$3,ROW(ref_PRICE_LIST!$H$6:$H$16074)-ROW(ref_PRICE_LIST!$H$6)+1),$AC3))),ref_PRICE_LIST!$H$6:$H$16074,0),1),"Not Available")}
I would really appreciate if someone can help me out.
Thank you so much!
I think the best way to speed this up is to split the formula into a helper column K and a reult column L
Helper Column (copy down for all 16,000 data rows)
=IF($D:$D=$O$2,ROW(),"")
Result column (starting at L2, copy down as many as you need)
=IFERROR(INDEX($F:$F,SMALL($K:$K,ROW()-1)),"Not available")
I've tested this with about 150,000 rows and it updates in < 1s

Resources