When is join tree selectivity = 1? - database

I'm trying to understand some examples from a textbook regarding join trees, their cardinality, selectivity, and cost.
The cost function is given as follows:
The statistics for the example are
R1 = 10
R2 = 100
R3 = 1000
f_(1,2) = 0.1
f_(2,3) = 0.2
What trips me up is that they then say: assume f_ij=1 for all other combinations.
What does this say about other combinations? Does this mean that joining R_2 and R_3 won't produce any results because they don't share any attributes? If they don't share any attributes, wouldn't that make the result an empty set?
I appreciate the help!

Related

Receiving different measured values from crossK and lohboot

I have a marked ppp dataset looking at crimes and their relation to locations.
I am performing an inhomogeneous cross-K using the Kcross.inhom, and am using lohboot to bootstrap confidence intervals around the inhomogenous cross-K. However, I am getting different measured values of the iso for the two when we would anticipate identical values.
The crime dataset is 26k rows, unsure of how to subset to create a reproducible example.
#creating the ppp
crime.coords = as.data.frame(st_coordinates(crime)) #coordinates of crimes
center.coords = as.data.frame(st_coordinates(center)) #coordinates of locations
temp = rbind(data.frame(x=crime.coords$X,y=crime.coords$Y,type='crime'),
data.frame(x=center.coords$X,y=center.coords$Y,type='center')) #df for maked ppp
temp = ppp(temp[,1],temp[,2], window=owin(border.coords), marks=relevel(as.factor(temp$type), 'crime')) #creating marked ppp
#creating an intensity model of the crimes
temp = rescale(temp, 10000) #rescaling for polynomial model coefficients
crime.ppp = unmark(split(temp)$crime)
model.crime = ppm(crime.ppp ~ polynom(x, y, 2), Poisson())
ck = Kcross.inhom(temp, i = 'crime', j = 'center', lambdaI = model.crime) #cross K w/ intensity function
ckenv = lohboot(temp, fun='Kcross.inhom', i = 'crime', j='center', lambdaI = model.crime) #bootstrapped CIs for cross K w/ intensity function
Here are the values plotted, showing different curves:
A few things I've noted are that the r are different for both functions, and setting the lohboot r does not in fact make them identical. Unsure of where to go from here, exhausted all my resources in finding a solution. Thank you in advance.
These curves are not guaranteed to be equal. lohboot subdivides the data, randomly resamples the subdivisions, computes the contributions from these randomly selected subdivisions, and averages them. If you repeat the experiment you should get a slightly different answer from lohboot each time. See the help file for lohboot.
It would desirable that the two curves are close. Unfortunately the default behaviour of lohboot does not often achieve that. For consistency, the default behaviour follows the original implementation which was not very good. Try setting block = TRUE for better performance. Also try the other options basicboot and Vcorrection.

Efficiently calculating weighted distance in MATLAB

Several posts exist about efficiently calculating pairwise distances in MATLAB. These posts tend to concern quickly calculating euclidean distance between large numbers of points.
I need to create a function which quickly calculates the pairwise differences between smaller numbers of points (typically less than 1000 pairs). Within the grander scheme of the program i am writing, this function will be executed many thousands of times, so even small gains in efficiency are important. The function needs to be flexible in two ways:
On any given call, the distance metric can be euclidean OR city-block.
The dimensions of the data are weighted.
As far as i can tell, no solution to this particular problem has been posted. The statstics toolbox offers pdist and pdist2, which accept many different distance functions, but not weighting. I have seen extensions of these functions that allow for weighting, but these extensions do not allow users to select different distance functions.
Ideally, i would like to avoid using functions from the statistics toolbox (i am not certain the user of the function will have access to those toolboxes).
I have written two functions to accomplish this task. The first uses tricky calls to repmat and permute, and the second simply uses for-loops.
function [D] = pairdist1(A, B, wts, distancemetric)
% get some information about the data
numA = size(A,1);
numB = size(B,1);
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else error('Function only accepts "cityblock" and "euclidean" distance')
end
% format weights for multiplication
wts = repmat(wts,[numA,1,numB]);
% get featural differences between A and B pairs
A = repmat(A,[1 1 numB]);
B = repmat(permute(B,[3,2,1]),[numA,1,1]);
differences = abs(A-B).^r;
% weigh difference values before combining them
differences = differences.*wts;
differences = differences.^(1/r);
% combine features to get distance
D = permute(sum(differences,2),[1,3,2]);
end
AND:
function [D] = pairdist2(A, B, wts, distancemetric)
% get some information about the data
numA = size(A,1);
numB = size(B,1);
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else error('Function only accepts "cityblock" and "euclidean" distance')
end
% use for-loops to generate differences
D = zeros(numA,numB);
for i=1:numA
for j=1:numB
differences = abs(A(i,:) - B(j,:)).^(1/r);
differences = differences.*wts;
differences = differences.^(1/r);
D(i,j) = sum(differences,2);
end
end
end
Here are the performance tests:
A = rand(10,3);
B = rand(80,3);
wts = [0.1 0.5 0.4];
distancemetric = 'cityblock';
tic
D1 = pairdist1(A,B,wts,distancemetric);
toc
tic
D2 = pairdist2(A,B,wts,distancemetric);
toc
Elapsed time is 0.000238 seconds.
Elapsed time is 0.005350 seconds.
Its clear that the repmat-and-permute version works much more quickly than the double-for-loop version, at least for smaller datasets. But i also know that calls to repmat often slow things down, however. So I am wondering if anyone in the SO community has any advice to offer to improve the efficiency of either function!
EDIT
#Luis Mendo offered a nice cleanup of the repmat-and-permute function using bsxfun. I compared his function with my original on datasets of varying size:
As the data become larger, the bsxfun version becomes the clear winner!
EDIT #2
I have finished writing the function and it is available on github [link]. I ended up finding a pretty good vectorized method for computing euclidean distance [link], so i use that method in the euclidean case, and i took #Divakar's advice for city-block. It is still not as fast as pdist2, but its must faster than either of the approaches i laid out earlier in this post, and easily accepts weightings.
You can replace repmat by bsxfun. Doing so avoids explicit repetition, therefore it's more memory-efficient, and probably faster:
function D = pairdist1(A, B, wts, distancemetric)
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else
error('Function only accepts "cityblock" and "euclidean" distance')
end
differences = abs(bsxfun(#minus, A, permute(B, [3 2 1]))).^r;
differences = bsxfun(#times, differences, wts).^(1/r);
D = permute(sum(differences,2),[1,3,2]);
end
For r = 1 ("cityblock" case), you can use bsxfun to get elementwise subtractions and then use matrix-multiplication, which must speed up things. The implementation would look something like this -
%// Calculate absolute elementiwse subtractions
absm = abs(bsxfun(#minus,permute(A,[1 3 2]),permute(B,[3 1 2])));
%// Perform matrix multiplications with the given weights and reshape
D = reshape(reshape(absm,[],size(A,2))*wts(:),size(A,1),[]);

I really can't figure out where to start

By using 9 numbers which are 1 to 9 you should find the number of ways to get N using multiplication and addition.
For example, if 100 is given, you would answer 7.
The reason is that there are 7 possible ways.
100 = 1*2*3*4+5+6+7*8+9
100 = 1*2*3+4+5+6+7+8*9
100 = 1+2+3+4+5+6+7+8*9
100 = 12+3*4+5+6+7*8+9
100 = 1+2*3+4+5+67+8+9
100 = 1*2+34+5+6*7+8+9
100 = 12+34+5*6+7+8+9
If this question is given to you, how would you start?
Are we allowed to use parentheses? That would expand the number of possibilities by a lot.
I would try to find the first additive term, let’s say 1×23, first. There are a limited number of those, and since we can’t subtract, we know that if we get a term above our target, we can prune it from our search. That leaves us looking for the solution to 23 + f = 100, where f is another formula of exactly the same form. But that is exactly the same as solving the original problem for numbers 4–9 and target 77! So call your algorithm recursively and add the solutions for that subproblem to the solutions to the original problem. That is, if we have 23 + 4, are there any solutions to the subproblem with numbers 5–9 and n = 73? Divide and conquer.
You might benefit from a dynamic table of partial solutions, since it's possible you might get the same subproblem in different ways: 1+2+3 = 1×2×3, so solving the subproblem with numbers 4–9 and target 94 twice duplicates work.
You are probably better going from right to left than from left to right, on the principle of most-constrained first. 89, 8×9, or 78+9 leave much less room for possible solutions than 1+2+3, 1×2×3, 12×3, 12+3 or 1×23.
There are three possible operations
addition
multiplication
combine, for example combine 1 and 2 to make 12
There are 8 positions for each operator. Hence, there are a total of 3^8 = 6561 possible equations. So I would start with
for ( i = 0; i < 6561; i++ )

Genetic algorithm - shift planning

I have some basic genetic algorithm knowledge, I've programmed a simple application finding X maximizing value of some function, but what I'm struggling with now is how should chromosome, individual, population etc. look like for some more complex problems like shift planning. Let's say we have some employees, some shifts and we want to assign them to each other. How shall the key parts of genetic algorithm look like to make it work for such data?
Let me make a few assumptions to showcase an example of how you can make a genetic algorithm model for this problem.
Assumptions
Let there be n employees labelled e_1, e_2, ..., e_n and n shifts labelled s_1, s_2, ..., s_n
Let n be an even number for simplicity of explanation
Chromosome of an individual
Let each chromosome consist of n 2D tuples. Each tuple is a pair of a shift and an employee (s_i, e_i). n tuples is thus a mapping of all the employees to the shifts. Thus a chromosome will look something like this:
{ {s_x1, e_y1}, {s_x2, e_y2}, ... {s_xn, e_yn} }
where, xi, yi c {1,2, ..,n} for all i,
s_xi != s_xj for i != j,
e_yi != e_yj for i != j
Population
Depending on n we can have a population of D individuals each with the above chromosome configuration. You can start by randomizing the chromosome configurations for the D organisms at the start (although there can be better ways to do this).
Reproduction
Given a generation of D individuals pick any two individuals, say d_i and d_j, for crossover. Let us obtain 2 children in the next generation, say c_i and c_j. It should look like this (considering n as 4 for simplicity):
d_i = { {s_i1, e_i1}, {s_i2, e_i2}, s_i3, e_i3, {s_i4, e_i4} }
d_j = { {s_j1, e_j1}, {s_j2, e_j2}, s_j3, e_j3, {s_j4, e_j4} }
crosses over to reproduce,
c_i = { {s_i1, e_j3}, {s_i2, e_j2}, s_i3, e_j1, {s_i4, e_j4} }
c_j = { {s_j1, e_i3}, {s_j2, e_i2}, s_j3, e_i1, {s_j4, e_i4} }
How you do this computationally for larger n is what I will let you think.
I also have some ideas on how one can apply mutation to the model, but I will let you think that through as well (moreover this is just an example model to get you started).
Thoughts on a fitness function
Lets consider you have a cost fitness function called employee satisfaction which is a summation of happiness (say an integer between -10 and 10) of all employees within an individual of a population.
Now, say an employee (e_1) if given a certain shift (s_1), his happiness is -4, but if given another shift (s_4), his happiness is 10.
Then, the fitness (employee satisfaction) of an individual in the population can simply be (there can be more complicated mathematical functions for this) the sum of happiness of all n employees. The ideal best fitness scenario would be all employees have 10 happiness and employee satisfaction sums to n x 10 compared to the worst fitness (least employee satisfaction) being n x -10.

How to create initial population and obtain best solution

This would be a trivial question to pros but difficult for a newbie like me.
Evolutionary Algorithm - I am trying to generate the initial population for a system with a given objective function and 3 bounded constraints.
Issue is I don't know the solver to use and their respective codes. I have gone through some of the matlab documentations but cannot find much info.
Any help or procedure whatsoever would be appreciated.
OBJECTIVE FUNCTION:
ZETA(i) = -SIG(i) ./(sqrt(SIG(i).^2 + OMG(i).^2));
Constraints:
1 <= K <= 50
0.1 <= T1 <= 1
0.01 <= T2 <= 0.2
K, T1, T2 - are variables used in calculating ZETA, SIG, OMG and are MATRICES.
I have created a function that does the matrix computation to obtain the real and imaginary values of the matrix with is SIG and OMG. Now I am stuck trying to proceed to creating an EA.
Thanks.

Resources