I have up to 16 datasets (only using 8 for the example here), and I'm trying to sort them into groups of 4, where the datasets within each group are as closely matched as possible. (Using a VBA macro in Excel).
My aim was to iterate through every possible combination of groups of 4 and compare how well matched they are to the previous "best match", overwriting that if so.
I've got no problems comparing how well matched the groups are, but the code I have won't iterate through every possible combination.
My question is why doesn't this code work? And if there is a better solution please let me know.
For a = 1 To UBound(Whitelist) - 3
For b = a + 1 To UBound(Whitelist) - 2
For c = b + 1 To UBound(Whitelist) - 1
For d = c + 1 To UBound(Whitelist)
TempGroups(1, 1) = a: TempGroups(1, 2) = b: TempGroups(1, 3) = c: TempGroups(1, 4) = d
For e = 1 To UBound(Whitelist) - 3
If InArray(TempGroups, e) = False Then
For f = e + 1 To UBound(Whitelist) - 2
If InArray(TempGroups, f) = False Then
For g = f + 1 To UBound(Whitelist) - 1
If InArray(TempGroups, g) = False Then
For h = g + 1 To UBound(Whitelist)
If InArray(TempGroups, h) = False Then
TempGroups(2, 1) = e: TempGroups(2, 2) = f: TempGroups(2, 3) = g: TempGroups(2, 4) = h
If HowClose(Differences, TempGroups, 1) + HowClose(Differences, TempGroups, 2) < HowClose(Differences, Groups, 1) + HowClose(Differences, Groups, 2) Then
For x = 1 To 4
For y = 1 To 4
Groups(x, y) = TempGroups(x, y)
Next y
Next x
End If
End If
Next h
End If
Next g
End If
Next f
End If
Next e
Next d
Next c
Next b
Next a
For Reference, UBound(Whitelist) can be taken as 8 (number of datasets I have to match)
TempGroups is an array which I'm writing each iteration to, so it can be compared to...
Groups, the array which will contain the data sorted into matched groups
The InArray function checks to see if the value is already allocated to a group, as the datasets can only be in one group each.
Thanks in advance!
Images:
Datasets
Relatively Well Matched Data
Fairly Poorly Matched Data
Related
I have code in VBA that works to generate a transition matrix on data in excel. I now have access to a huge SQL DB with ~2MM rows of information and would like to generate a transition matrix which looks something like this:
In any case here is the VBA code:
Option Explicit
Function COHORT(id, dat, rat, _
Optional classes As Integer, Optional ystart, Optional yend)
'Function is written for data sorted according to issuers and rating dates (ascending),
'rating classes numbered from 1 to classes, last rating class=default, not rated has number "0"
If IsMissing(ystart) Then ystart = Year(Application.WorksheetFunction.min(dat))
If IsMissing(yend) Then yend = Year(Application.WorksheetFunction.Max(dat)) - 1
If classes = 0 Then classes = Application.WorksheetFunction.Max(rat)
Dim obs As Long, k As Long, kn As Long, i As Integer, j As Integer, t As Integer
Dim ni() As Long, nij() As Long, pij() As Double, newrat As Integer
ReDim nij(1 To classes - 1, 0 To classes), ni(0 To classes)
obs = id.Rows.count
For k = 1 To obs
'Earliest cohort to which observation can belong is from year:
t = Application.Max(ystart, Year(dat(k)))
'Loop through cohorts to which observation k can belong
Do While t < yend
'Is there another rating from the same year?
If id(k + 1) = id(k, 1) And Year(dat(k + 1)) <= t And k <> obs Then
Exit Do
End If
'Is the issuer in default or not rated?
If rat(k) = classes Or rat(k) = 0 Then Exit Do
'Add to number of issuers in cohort
ni(rat(k)) = ni(rat(k)) + 1
'Determine rating from end of next year (=y+1)
'rating stayed constant
If id(k + 1) <> id(k) Or Year(dat(k + 1)) > t + 1 Or k = obs Then
newrat = rat(k)
'rating changed
Else
kn = k + 1
Do While Year(dat(kn + 1)) = Year(dat(kn)) And id(kn + 1) = id(kn)
If rat(kn) = classes Then Exit Do 'Default is absorbing!
kn = kn + 1
Loop
newrat = rat(kn)
End If
'Add to number of transitions
nij(rat(k), newrat) = nij(rat(k), newrat) + 1
'Exit if observation k cannot belong to cohort of y+1
If newrat <> rat(k) Then Exit Do
t = t + 1
Loop
Next k
ReDim pij(1 To classes - 1, 1 To classes + 1)
'Compute transition frequencies pij=Nij/Ni
For i = 1 To classes - 1
For j = 1 To classes
If ni(i) > 0 Then pij(i, j) = nij(i, j) / ni(i)
Next j
Next i
'NR category to the end
For i = 1 To classes - 1
If ni(i) > 0 Then pij(i, classes + 1) = nij(i, 0) / ni(i)
Next i
COHORT = pij
I am very new to SQL Server, can anyone help convert this for SQL? I made an attempt based on searching on internet but it didn't come close:
create table trans as
select a.group as from, b.group as to, count(*) as nTrans
from haveRanks as a inner join haveRanks as b
on a.ID=b.ID and a.Date+1 = b.Date
group by a.group, b.group;
create table probs as
select from, to, nTrans/sum(nTrans) as prob
from trans
group by from;
End Function
My data looks something like this:
ID Date Value
1 1/10/14 5
1 5/10/14 5
1 6/23/16 7
2 3/10/00 12
2 6/10/01 4
Edit: Answering Question from Comments:
1) Correct, if a Highest Rating is not supplied, the program will take the maximum number as the "default" class
2) If there is no rating for the year it is has not changed.
3) Not sure I understand this question, the Do While t < yend loops through the observations and checks if the next observations are in the same cohort, or if it's default/Not Rated it will kick out and go to the next one.
Suppose I have two arrays ordered in an ascending order, i.e.:
A = [1 5 7], B = [1 2 3 6 9 10]
I would like to create from B a new vector B', which contains only the closest values to A values (one for each).
I also need the indexes. So, in my example I would like to get:
B' = [1 6 9], Idx = [1 4 5]
Note that the third value is 9. Indeed 6 is closer to 7 but it is already 'taken' since it is close to 4.
Any idea for a suitable code?
Note: my true arrays are much larger and contain real (not int) values
Also, it is given that B is longer then A
Thanks!
Assuming you want to minimize the overall discrepancies between elements of A and matched elements in B, the problem can be written as an assignment problem of assigning to every row (element of A) a column (element of B) given a cost matrix C. The Hungarian (or Munkres') algorithm solves the assignment problem.
I assume that you want to minimize cumulative squared distance between A and matched elements in B, and use the function [assignment,cost] = munkres(costMat) by Yi Cao from https://www.mathworks.com/matlabcentral/fileexchange/20652-hungarian-algorithm-for-linear-assignment-problems--v2-3-:
A = [1 5 7];
B = [1 2 3 6 9 10];
[Bprime,matches] = matching(A,B)
function [Bprime,matches] = matching(A,B)
C = (repmat(A',1,length(B)) - repmat(B,length(A),1)).^2;
[matches,~] = munkres(C);
Bprime = B(matches);
end
Assuming instead you want to find matches recursively, as suggested by your question, you could either walk through A, for each element in A find the closest remaining element in B and discard it (sortedmatching below); or you could iteratively form and discard the distance-minimizing match between remaining elements in A and B until all elements in A are matched (greedymatching):
A = [1 5 7];
B = [1 2 3 6 9 10];
[~,~,Bprime,matches] = sortedmatching(A,B,[],[])
[~,~,Bprime,matches] = greedymatching(A,B,[],[])
function [A,B,Bprime,matches] = sortedmatching(A,B,Bprime,matches)
[~,ix] = min((A(1) - B).^2);
matches = [matches ix];
Bprime = [Bprime B(ix)];
A = A(2:end);
B(ix) = Inf;
if(not(isempty(A)))
[A,B,Bprime,matches] = sortedmatching(A,B,Bprime,matches);
end
end
function [A,B,Bprime,matches] = greedymatching(A,B,Bprime,matches)
C = (repmat(A',1,length(B)) - repmat(B,length(A),1)).^2;
[minrows,ixrows] = min(C);
[~,ixcol] = min(minrows);
ixrow = ixrows(ixcol);
matches(ixrow) = ixcol;
Bprime(ixrow) = B(ixcol);
A(ixrow) = -Inf;
B(ixcol) = Inf;
if(max(A) > -Inf)
[A,B,Bprime,matches] = greedymatching(A,B,Bprime,matches);
end
end
While producing the same results in your example, all three methods potentially give different answers on the same data.
Normally I would run screaming from for and while loops in Matlab, but in this case I cannot see how the solution could be vectorized. At least it is O(N) (or near enough, depending on how many equally-close matches to each A(i) there are in B). It would be pretty simple to code the following in C and compile it into a mex file, to make it run at optimal speed, but here's a pure-Matlab solution:
function [out, ind] = greedy_nearest(A, B)
if nargin < 1, A = [1 5 7]; end
if nargin < 2, B = [1 2 3 6 9 10]; end
ind = A * 0;
walk = 1;
for i = 1:numel(A)
match = 0;
lastDelta = inf;
while walk < numel(B)
delta = abs(B(walk) - A(i));
if delta < lastDelta, match = walk; end
if delta > lastDelta, break, end
lastDelta = delta;
walk = walk + 1;
end
ind(i) = match;
walk = match + 1;
end
out = B(ind);
You could first get the absolute distance from each value in A to each value in B, sort them and then get the first unique value to a sequence when looking down in each column.
% Get distance from each value in A to each value in B
[~, minIdx] = sort(abs(bsxfun(#minus, A,B.')));
% Get first unique sequence looking down each column
idx = zeros(size(A));
for iCol = 1:numel(A)
for iRow = 1:iCol
if ~ismember(idx, minIdx(iRow,iCol))
idx(iCol) = minIdx(iRow,iCol);
break
end
end
end
The result when applying idx to B
>> idx
1 4 5
>> B(idx)
1 6 9
I'm trying to build an array for a school project, scenario is this:
You have a city that is 30 miles (y) by 20 miles (x) with roads every mile, have to write code that gives the best location for placement of a distribution center based on the business locations and cost to delivery goods to each.
I have one bit of code that records the number of client businesses, an array that records the locations of the businesses, and an array that stores volume of product each client buys.
Where I'm stuck at, is I think I should build an array that is 0 to 30, 0 to 20 (the size of the city) but I have to evaluate the cost based on user defined precision (.25, .5, 1, and 2 miles) so I should have the array be able to store the values from the calculations for 120 by 80 cells.
Not sure if that makes sense, here is the requirements:
Specifically, your program should be able to complete the following tasks:
Read the customer data, and user input for analysis resolution. Your program must be designed to
accommodate changes in the number of customers, their locations, and their product delivery volumes. User options for analysis resolution will be: 0.25 miles, 0.5 miles, 1 mile, and 2 miles.
Validate that user input is numeric and valid.
Analyze the costs at all possible distribution center locations (which may be collocated with customer
locations), and determine the optimums.
Display the X and Y values of the optimum distribution center locations and the corresponding minimum
weekly cost. There may be multiple locations with the same minimum cost.
Display the costs at all locations adjacent to the optimums, in order to show the sensitivity of the result.
The formulas to use are:
Distance to customer (Di)=abs|x-xi|+|y+yi|
Cost for customer (Ci)=1/2*Di*Vi (volume of customer product)* Ft
Ft = 0.03162*y+0.04213*x+0.4462
cost = sum Ci from 1 to number of clients
Below is my code so far, I started it to have it basically build the array and display it on a second sheet, so I can visualize it but I can't have that in the final product. In debugging it, it gets to the line of code di = and gives me a subscript out of range.
Option Explicit
Option Base 1
Sub load()
Dim Cust!, x#, y#, volume#(), n!, loc#(), i%, vol#, xi#, ci#, di#, j#
Dim choices#, val#(), nptsx!, nptsy!, Ft#, lowx!, lowy!, low!, M!
Dim costmatrix#(), npts!, cost!()
'find number of customers
Cust = 0
Do While Cells(8 + Cust, 1).Value <> ""
Cust = Cust + 1
Loop
If Cust < 2 Then
MsgBox "number of customers must be greater than 1."
Exit Sub
End If
'establist array of customer locations
ReDim loc#(1, Cust)
For j = 1 To Cust
x = Cells(7 + j, 2)
y = Cells(7 + j, 3)
Next j
ReDim volume#(Cust, 1)
For i = 1 To Cust
vol = Cells(7 + i, 4)
Next i
choices = Cells(3, 7).Value
nptsx = 30 / choices + 1
nptsy = 20 / choices + 1
'30x20 grid
ReDim costmatrix(x To nptsx, y To nptsy)
For x = 0 To nptsx - 1
Sheet3.Cells(1, x + 2) = x * choices
Next x
For y = 0 To nptsy - 1
Sheet3.Cells(2 + y, 1) = y * choices
Next y
For x = 0 To nptsx - 1
For y = 0 To nptsy - 1
For k = 1 To Cust
di = Abs(x * choices - Sheet1.Cells(7 + j, 2)) + Abs(y * choices - Sheet1.Cells(7 + j, 3))
Ft = 0.03162 * Sheet1.Cells(7 + j, 3) + 0.4213 * Sheet1.Cells(7 + j, 2) + 0.4462
ci = 1 / 2 * di * vol * Ft
Sheet3.Cells(x + 2, 2 + y) = ci
Next k
Next y
Next x
lowx = x
lowy = y
Range("I9:J:3").Clear
Range("I9") = "optimum"
Range("J9") = lowx * choices
Range("K9") = lowy * choices
Range("L9") = low
i = 9
If lowy < npts - 1 Then
i = i + 1
Cells(1, "I") = "Increment North"
Cells(1, "L") = cost(lowx, lowy + 1)
End If
If lowy > 0 Then
i = i + 1
Cells(1, "I") = "Increment South"
Cells(1, "L") = cost(lowx, lowy - 1)
End If
If lowx < npts - 1 Then
i = i + 1
Cells(1, "I") = "Increment East"
Cells(1, "L") = cost(lowx, lowy + 1)
End If
If lowx > 0 Then
i = i + 1
Cells(1, "I") = "Increment West"
Cells(1, "L") = cost(lowx, lowy - 1)
End If
End Sub
Updated, I have built the array, but I need to figure out how to do the calculations for all clients in one cell at a time, adding the results for each client together and putting the sum of them into the cell, then going onto the next cell.
When you dimension loc# via
ReDim loc#(Cust, 2)
The first index must be between 1 and Cust
The you have the loop
For k = 1 To Cust
x = Cells(7 + k, 2)
y = Cells(7 + k, 3)
Next k
After this loop runs k has value Cust + 1, not Cust since for-loops in VBA first increment the counter and then test if it has exceeded the limit.
You don't use k again until the line
di = Abs(Sheet3.Cells(1, x + 2) - loc(k, 1))
At this stage k is Cust + 1 -- which is one more than the highest permissible subscript, hence the subscript out of range error.
In context, I think that you meant to use j rather than k in both that line and the next line. I don't know if your code has other issues, but getting rid of k in those lines should help.
I have 2 cell arrays which are "celldata" and "data" . Both of them store strings inside. Now I would like to check each element in "celldata" whether in "data" or not? For example, celldata = {'AB'; 'BE'; 'BC'} and data={'ABCD' 'BCDE' 'ACBE' 'ADEBC '}. I would like the expected output will be s=3 and v= 1 for AB, s=2 and v=2 for BE, s=2 and v=2 for BC, because I just need to count the sequence of the string in 'celldata'
The code I wrote is shown below. Any help would be certainly appreciated.
My code:
s=0; support counter
v=0; violate counter
SV=[]; % array to store the support
VV=[]; % array to store the violate
pairs = ['AB'; 'BE'; 'BC']
%celldata = cellstr(pairs)
celldata = {'AB'; 'BE'; 'BC'}
data={'ABCD' 'BCDE' 'ACBE' 'ADEBC '} % 3 AB, 2 BE, 2 BC
for jj=1:length(data)
for kk=1:length(celldata)
res = regexp( data(jj),celldata(kk) )
m = cell2mat(res);
e=isempty(m) % check res array is empty or not
if e == 0
s = s + 1;
SV(jj)=s;
v=v;
else
s=s;
v= v+1;
VV(jj)=v;
end
end
end
If I am understanding your variables correctly, s is the number of cells which the substring AB, AE and, BC does not appear and v is the number of times it does. If this is accurate then
v = cellfun(#(x) length(cell2mat(strfind(data, x))), celldata);
s = numel(data) - v;
gives
v = [1;1;3];
s = [3;3;1];
I am trying to create two data sets, one which summarizes data by 2 groups which I have done using the following code:
x = rnorm(1:100)
g1 = sample(LETTERS[1:3], 100, replace = TRUE)
g2 = sample(LETTERS[24:26], 100, replace = TRUE)
aggregate(x, list(g1, g2), mean)
The second needs to summarize the data by the first group and NOT the second group.
If we consider the possible pairs from the previous example:
A - X B - X C - X
A - Y B - Y C - Y
A - Z B - Z C - Z
The second dataset should to summarize the data as the average of the outgroup.
A - not X
A - not Y
A - not Z etc.
Is there a way to manipulate aggregate functions in R to achieve this?
Or I also thought there could be dummy variable that could represent the data in this way, although I am unsure how it would look.
I have found this answer here:
R using aggregate to find a function (mean) for "all other"
I think this indicates that a dummy variable for each pairing is necessary. However if there is anyone who can offer a better or more efficient way that would be appreciated, as there are many pairings in the true data set.
Thanks in advance
First let us generate the data reproducibly (using set.seed):
# same as question but added set.seed for reproducibility
set.seed(123)
x = rnorm(1:100)
g1 = sample(LETTERS[1:3], 100, replace = TRUE)
g2 = sample(LETTERS[24:26], 100, replace = TRUE)
Now we have two solutions both of which use aggregate:
1) ave
# x equals the sums over the groups and n equals the counts
ag = cbind(aggregate(x, list(g1, g2), sum),
n = aggregate(x, list(g1, g2), length)[, 3])
ave.not <- function(x, g) ave(x, g, FUN = sum) - x
transform(ag,
x = NULL, # don't need x any more
n = NULL, # don't need n any more
mean = x/n,
mean.not = ave.not(x, Group.1) / ave.not(n, Group.1)
)
This gives:
Group.1 Group.2 mean mean.not
1 A X 0.3155084 -0.091898832
2 B X -0.1789730 0.332544353
3 C X 0.1976471 0.014282465
4 A Y -0.3644116 0.236706489
5 B Y 0.2452157 0.099240545
6 C Y -0.1630036 0.179833987
7 A Z 0.1579046 -0.009670734
8 B Z 0.4392794 0.033121335
9 C Z 0.1620209 0.033714943
To double check the first value under mean and under mean.not:
> mean(x[g1 == "A" & g2 == "X"])
[1] 0.3155084
> mean(x[g1 == "A" & g2 != "X"])
[1] -0.09189883
2) sapply Here is a second approach which gives the same answer:
ag <- aggregate(list(mean = x), list(g1, g2), mean)
f <- function(i) mean(x[g1 == ag$Group.1[i] & g2 != ag$Group.2[i]]))
ag$mean.not = sapply(1:nrow(ag), f)
ag
REVISED Revised based on comments by poster, added a second approach and also made some minor improvements.