AQL Optimization - traversal by month - query-optimization

In ArangoDB, I have a graph which contains messages for profiles. For reporting purposes, I need to aggregate the number of messages for specific categories of message from each month for the last 12 months.
The date and category of each message are on every edge between the profile and the message.
Currently, I have:
for mo in 0..11
let month = DATE_MONTH(DATE_SUBTRACT(DATE_NOW(),mo,"month"))
let msgCount = count(
for v, e in outbound "profileId" graph 'MyGraph'
filter e.category == "myCategory" && DATE_MONTH(e.date) == month
return 1
)
return {month: month, msgCount: msgCount}
This query executes in over one second, which is a bit slow for the purposes of this webapp, as I need to create and visualize this report for several categories.
Is there any optimization that could produce the same results faster?

Analysing the query
Using db._explain() shows the following:
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
2 CalculationNode 1 - LET #7 = 0 .. 11 /* range */ /* simple expression */
3 EnumerateListNode 12 - FOR mo IN #7 /* list iteration */
4 CalculationNode 12 - LET month = DATE_MONTH(DATE_SUBTRACT(DATE_NOW(), mo, "month")) /* v8 expression */
11 SubqueryNode 12 - LET #4 = ... /* subquery */
5 SingletonNode 1 * ROOT
9 CalculationNode 1 - LET #11 = 1 /* json expression */ /* const assignment */
6 TraversalNode 1 - FOR v /* vertex */, e /* edge */ IN 1..1 /* min..maxPathDepth */ OUTBOUND 'profileId' /* startnode */ GRAPH 'MyGraph'
7 CalculationNode 1 - LET #9 = ((e.`category` == "myCategory") && (DATE_MONTH(e.`date`) == month)) /* v8 expression */
8 FilterNode 1 - FILTER #9
10 ReturnNode 1 - RETURN #11
13 CalculationNode 12 - LET #13 = { "month" : month, "msgCount" : COUNT(#4) } /* simple expression */
14 ReturnNode 12 - RETURN #13
Indexes used:
none
Traversals on graphs:
Id Depth Vertex collections Edge collections Filter conditions
6 1..1 persons knows
Optimization rules applied:
Id RuleName
1 move-calculations-up
2 remove-unnecessary-calculations
which can tell us the following:
in the traversal we use V8 expressions, which is expensive.
Since you're executing a complex filter expression, this can't be executed inside of the traverser - thus filtering is applied after the traversal was done.
What you could do to improve the situation
You should find a way to move the complex calculations out of the iteration.
While working with the DATE_* functions may be convenient, its not fast. You should calculate the strings YEAR:MONTH you want to filter for in advance, and then inside the the query do range comparison on the calculated strings:
FILTER e.date > '2016:04:' && e.date < '2016:05:'
Nate's implementation:
for mo in 0..11
let month = date_subtract(concat(left(date_iso8601(date_now()),7),'-01T00:00:00.000Z'), mo, "month")
let nextMonth = date_subtract(concat(left(date_iso8601(date_now()),7),'-01T00:00:00.000Z'), mo-1, "month")
let monthNumber = date_month(month)
let msgCount = count(
for v, e in outbound "profileId" graph "MyGraph"
filter e.category == 'myCategory' && e.date > month && e.date < nextMonth
return 1
)
return {month: monthNumber, msgCount: msgCount}
This significantly speeds up the query (5x faster)!

Related

Complicated SAS Loop & Permutation Problem: Need to Calculate # of Days Between All Date Variables and Find Set That Matches Criteria

I have a wide dataset with 6 dates and 6 types. The dataset has between 2 - 4 million rows so I need the most efficient way to do this. Each type number corresponds to the date number. I need to compare each date and type to each other (2 at a time) and find the earliest date that these requirements are met:
Criteria
If both types = P, the difference between the dates must be 17 or more.
All other combinations must be 24 or more days apart
I have to find the earliest date match. So in my example, my "have" dataset has all 6 entries. My want dataset has the earliest two dates that meet my criteria which are 2 & 4. I can calculate this manually but can't figure out how to do it in SAS. I would love to have this in some sort of iterative macro program instead of taking up hundreds of lines of code.
My logic on paper
Calculate # of days between each admin_date (all permutations)
Ignore any that are not at least 17 days apart
Among those that are >= 17 and <24, check to see if both types are P. If so, take the combination with the earliest 2nd date.
If not, check those that are >= 24. Take the combination with the earliest 2nd date.
My calculations for Day Differences Between:
1 & 2: 1 day (ignore, less than 17)
1 & 3: 7 days (ignore, less than 17)
1 & 4: 18 days (ignore, not both P)
1 & 5: 20 days (ignore, not both P)
1 & 6: 34 days (consider, ge 24)
2 & 3: 6 days (ignore, less than 17)
2 & 4: 17 days (consider, ge 17, both P)
2 & 5: 19 days (consider, ge 17, both P)
2 & 6: 33 days (consider, ge 24)
and so forth
Among all of the sets that met the criteria: 1 and 6, 2 and 4, 2 and 5, 2 and 6. 4 is the earliest so that's what I want.
data have;
input person $
admin_date1 : ?? mmddyy10.
admin_date2 : ??mmddyy10.
admin_date3 : ?? mmddyy10.
admin_date4 : ?? mmddyy10.
admin_date5 : ?? mmddyy10.
admin_date6 : ?? mmddyy10.
type1 $
type2 $
type3 $
type4 $
type5 $
type6 $;
format admin_date1 mmddyy10.
admin_date2 mmddyy10.
admin_date3 mmddyy10.
admin_date4 mmddyy10.
admin_date5 mmddyy10.
admin_date6 mmddyy10.;
datalines;
JohnDoe 01/12/2021 01/13/2021 01/19/2021 01/30/2021 02/01/2021 02/15/2021 M P M P P J
;
run;
data want;
input person $
admin_date1 : ?? mmddyy10.
admin_date2 : ??mmddyy10.
type1 $
type2 $
;
format admin_date1 mmddyy10.
admin_date2 mmddyy10.
;
datalines;
JohnDoe 01/13/2021 01/30/2021 P P
;
run;
have dataset
want dataset
This is pretty easy if you transpose to long. Basically just do a second nested set with a point that points to the next row and iterates over the other records in the same person. It's not super fast, there are faster ways, but it's quite simple unless you have millions of rows.
data have_long( rename=(_admin_date = admin_date _type = type));
set have;
array admin_date[6];
array type[6];
do _i = 1 to dim(admin_date);
_admin_date = admin_date[_i];
_type = type[_i];
output;
end;
keep person _admin_date _type;
format _admin_date mmddyy10.;
run;
data want;
set have_long nobs=nobs;
if type = 'P';
do _i = _n_ + 1 to nobs until (person ne _person);
set have_long(rename=(person=_person type=_type admin_date=admin_date2)) point=_i;
if person eq _person and admin_date2 ge (admin_date + 17) and _type = 'P' then do;
output;
leave;
end;
end;
drop _:;
run;
Another similar solution, which works so long as have_long is able to be loaded into memory, is to use a hash table. This might be faster, or might not be, depending on various details of the data.
data want;
if _n_ eq 1 then do; *declare the hash table;
declare hash hl(dataset:'have_long', ordered:'a', multidata:'y'); *ordered ascending and multiple records per key is allowed;
hl.defineKey('person','type');
hl.defineData('admin_date');
hl.defineDone();
end;
format admin_date1 mmddyy8.;
set have_long;
if type = 'P';
admin_date1 = admin_date; *set the original admin_date aside;
rc = hl.find(); *find the first match;
do while (rc eq 0 and admin_date1 gt (admin_date - 17)); *loop unless we either run out of matches or find a valid date;
rc = hl.find_next();
end;
if rc eq 0 then output; *if we exited due to a valid date then output;
run;

how to create a loop for a macro in stata?

I have a very large dataset but to cut it short I demonstrated the data with the following example:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(patid death dateofdeath)
1 0 .
2 0 .
3 0 .
4 0 .
5 1 15007
6 0 .
7 0 .
8 1 15526
9 0 .
10 0 .
end
format %d dateofdeath
I am trying to sample for a case-control study based on date of death. At this stage, I need to first create a variable with each date of death repeated for all the participants (hence we end up with a dataset with 20 participants) and a pairid equivalent to the patient id patid of the corresponding case.
I created a macro for one case (which works) but I am finding it difficult to have it repeated for all cases (where death==1) in a loop.
The successful macro is as follows:
local i "5" //patient id who died
gen pairid= `i'
gen matchedindexdate = dateofdeath
replace matchedindexdate=0 if pairid != patid
gsort matchedindexdate
replace matchedindexdate= matchedindexdate[_N]
format matchedindexdate %d
save temp`i'
and the loop I attempted is:
* (min and max patid id)
forval j = 1/10 {
count if patid == `j' & death==1
if r(N)=1 {
gen pairid= `j'
gen matchedindexdate = dateofdeath
replace matchedindexdate=0 if pairid != patid
gsort matchedindexdate
replace matchedindexdate= matchedindexdate[_N]
save temp/matched`j'
}
}
use temp/matched1, clear
forval i=2/10 {
capture append using temp/matched`i'
save matched, replace
}
but I get:
invalid syntax
How can I do the loop?
I finally had it solved, please check:
https://www.statalist.org/forums/forum/general-stata-discussion/general/1591811-how-to-create-a-loop-for-a-macro

Sum amount base from previous value on slicer power bi

I have a table like this:
Year Month Code Amount
---------------------------------------
2017 11 a 7368
2017 11 b 3542
2017 12 a 4552
2017 12 b 7541
2018 1 a 6352
2018 1 b 8376
2018 2 a 1287
2018 2 b 3625
I make slicer base on Year and Month (ignore the Code), and I want to show SUM of Amount like this :
If I select on slicer Year 2017 and Month 12, the value to be shown is SUM Amount base on 2017-11, and select on slicer Year 2018 and Month 1 should be SUM Amount base on 2017-12
I have tried this one for testing with, but this not allowed:
Last Month = CALCULATE(SUM(Table[Amount]); Table[Month] = SELECTEDVALUE(Table[Month]) - 1)
How to do it right?
I want something like this
NB: I use direct query to SQL Server
Update: At this far, I added Last_Amount column in SQL Server Table by sub-query, maybe you guys have a better way for my issue
The filters in a CALCULATE statement are only designed to take simple statements that don't have further calculations involved. There are a couple of possible remedies.
1. Use a variable to compute the previous month number before you use it in the CALCULATE function.
Last Month =
VAR PrevMonth = SELECTEDVALUE(Table[Month]) - 1
RETURN CALCULATE(SUM(Table[Amount]), Table[Month] = PrevMonth)
2. Use a FILTER() function. This is an iterator that allows more complex filtering.
Last Month = CALCULATE(SUM(Table[Amount]),
FILTER(ALL(Table),
Table[Month] = SELECTEDVALUE(Table[Month]) - 1))
Edit: Since you are using year and month, you need to have a special case for January.
Last Month =
VAR MonthFilter = MOD(SELECTEDVALUE(Table[Month]) - 2, 12) + 1
VAR YearFilter = IF(PrevMonth = 12,
SELECTEDVALUE(Table[Year]) - 1,
SELECTEDVALUE(Table[Year]))
RETURN CALCULATE(SUM(Table[Amount]),
Table[Month] = MonthFilter,
Table[Year] = YearFilter)

Select max value from huge matrix of 30 years daily data

Lets say i have daily data for 30 years of period in a matrix. To make it simple just assume it has only 1 column and 10957 row indicates the days for 30 years. The year start in 2010. I want to find the max value for every year so that the output will be 1 column and 30 rows. Is there any automated way to program it in Matlab? currently im doing it manually where what i did was:
%for the first year
max(RAINFALL(1:365);
.
.
%for the 30th of year
max(RAINFALL(10593:10957);
It is exhausting to do it manually and i have quite few of same data sets. I used the code below to calculate mean and standard deviation for the 30 years. I tried modified the code to work for my task above but i couldn't succeed. Hope anyone can modify the code or suggest new way to me.
data = rand(32872,100); % replace with your data matrix
[nDays,nData] = size(data);
% let MATLAB construct the vector of dates and worry about things like leap
% year.
dayFirst = datenum(2010,1,1);
dayStamp = dayFirst:(dayFirst + nDays - 1);
dayVec = datevec(dayStamp);
year = dayVec(:,1);
uniqueYear = unique(year);
K = length(uniqueYear);
a = nan(1,K);
b = nan(1,K);
for k = 1:K
% use logical indexing to pick out the year
currentYear = year == uniqueYear(k);
a(k) = mean2(data(currentYear,:));
b(k) = std2(data(currentYear,:));
end
One possible approach:
Create a column containing the year of each data value, using datenum and datevec to take care of leap years.
Find the maximum for each year, with accumarray.
Code:
%// Example data:
RAINFALL = rand(10957,1); %// one column
start_year = 2010; %// data starts on January 1st of this year
%// Computations:
[year, ~] = datevec(datenum(start_year,1,1) + (0:size(RAINFALL,1)-1)); %// step 1
result = accumarray(year.'-start_year+1, RAINFALL.', [], #max); %// step 2
As a bonus: if you change #max in step 2 by either #mean or #std, guess what you get... much simpler than your code.
This may help You:
RAINFALL = rand(1,10957); % - Your data here
firstYear = 2010;
numberOfYears = 4;
cum = 0; % - cumulative factor
yearlyData = zeros(1,numberOfYears); % - this isnt really necessary
for i = 1 : numberOfYears
yearLength = datenum(firstYear+i,1,1) - datenum(firstYear + i - 1,1,1);
yearlyData(i) = max(RAINFALL(1 + cum : yearLength + cum));
cum = cum + yearLength;
end

Changing indices and order in arrays

I have a struct mpc with the following structure:
num type col3 col4 ...
mpc.bus = 1 2 ... ...
2 2 ... ...
3 1 ... ...
4 3 ... ...
5 1 ... ...
10 2 ... ...
99 1 ... ...
to from col3 col4 ...
mpc.branch = 1 2 ... ...
1 3 ... ...
2 4 ... ...
10 5 ... ...
10 99 ... ...
What I need to do is:
1: Re-order the rows of mpc.bus, such that all rows of type 1 are first, followed by 2 and at last, 3. There is only one element of type 3, and no other types (4 / 5 etc.).
2: Make the numbering (column 1 of mpc.bus, consecutive, starting at 1.
3: Change the numbers in the to-from columns of mpc.branch, to correspond to the new numbering in mpc.bus.
4: After running simulations, reverse the steps above to turn up with the same order and numbering as above.
It is easy to update mpc.bus using find.
type_1 = find(mpc.bus(:,2) == 1);
type_2 = find(mpc.bus(:,2) == 2);
type_3 = find(mpc.bus(:,2) == 3);
mpc.bus(:,:) = mpc.bus([type1; type2; type3],:);
mpc.bus(:,1) = 1:nb % Where nb is the number of rows of mpc.bus
The numbers in the to/from columns in mpc.branch corresponds to the numbers in column 1 in mpc.bus.
It's OK to update the numbers on the to, from columns of mpc.branch as well.
However, I'm not able to find a non-messy way of retracing my steps. Can I update the numbering using some simple commands?
For the record: I have deliberately not included my code for re-numbering mpc.branch, since I'm sure someone has a smarter, simpler solution (that will make it easier to redo when the simulations are finished).
Edit: It might be easier to create normal arrays (to avoid woriking with structs):
bus = mpc.bus;
branch = mpc.branch;
Edit #2: The order of things:
Re-order and re-number.
Columns (3:end) of bus and branch are changed. (Not part of this question)
Restore original order and indices.
Thanks!
I'm proposing this solution. It generates a n x 2 matrix, where n corresponds to the number of rows in mpc.bus and a temporary copy of mpc.branch:
function [mpc_1, mpc_2, mpc_3] = minimal_example
mpc.bus = [ 1 2;...
2 2;...
3 1;...
4 3;...
5 1;...
10 2;...
99 1];
mpc.branch = [ 1 2;...
1 3;...
2 4;...
10 5;...
10 99];
mpc.bus = sortrows(mpc.bus,2);
mpc_1 = mpc;
mpc_tmp = mpc.branch;
for I=1:size(mpc.bus,1)
PAIRS(I,1) = I;
PAIRS(I,2) = mpc.bus(I,1);
mpc.branch(mpc_tmp(:,1:2)==mpc.bus(I,1)) = I;
mpc.bus(I,1) = I;
end
mpc_2 = mpc;
% (a) the following mpc_tmp is only needed if you want to truly reverse the operation
mpc_tmp = mpc.branch;
%
% do some stuff
%
for I=1:size(mpc.bus,1)
% (b) you can decide not to use the following line, then comment the line below (a)
mpc.branch(mpc_tmp(:,1:2)==mpc.bus(I,1)) = PAIRS(I,2);
mpc.bus(I,1) = PAIRS(I,2);
end
% uncomment the following line, if you commented (a) and (b) above:
% mpc.branch = mpc_tmp;
mpc.bus = sortrows(mpc.bus,1);
mpc_3 = mpc;
The minimal example above can be executed as is. The three outputs (mpc_1, mpc_2 & mpc_3) are just in place to demonstrate the workings of the code but are otherwise not necessary.
1.) mpc.bus is ordered using sortrows, simplifying the approach and not using find three times. It targets the second column of mpc.bus and sorts the remaining matrix accordingly.
2.) The original contents of mpc.branch are stored.
3.) A loop is used to replace the entries in the first column of mpc.bus with ascending numbers while at the same time replacing them correspondingly in mpc.branch. Here, the reference to mpc_tmp is necessary so ensure a correct replacement of the elements.
4.) Afterwards, mpc.branch can be reverted analogously to (3.) - here, one might argue, that if the original mpc.branch was stored earlier on, one could just copy the matrix. Also, the original values of mpc.bus are re-assigned.
5.) Now, sortrows is applied to mpc.bus again, this time with the first column as reference to restore the original format.

Resources