Combine multiple rows in dataset - database

I have a dataset with about 20 million rows with the following format:
Userid attributid timeid
1 -1 0
1 -2 0
1 -3 0
1 -4 0
1 -5 0
...
and another index that match the attributeid to one of the four attribute type:
attributeid attributetype
-1 A
-2 B
-3 C
-4 D
-5 B
I would like to batch import the dataset into neo4j by converting it into the following format:
UserID A B C D timeid
1 -1 -2,-5 -3 -4 0
I tried R by order the dataframe using userid and scanning through it. But it was too slow. I was wondering what is the most time efficient way to do that? Or is there any thing I can do to optimize my code? Here is my code:
names(node_attri)[1] = 'UserID'
names(node_attri)[2] = 'AttriID'
names(node_attri)[3] = 'TimeID'
names(attri_type)[1] = 'AttriID'
names(attri_type)[2] = 'AttriType'
#attri_type <- attri_type[order(attri_type),]
#node_attri <- node_attri[order(node_attri),]
N = length(unique(node_attri$TimeID))*length(unique(node_attri$UserID))
new_nodes = data.frame(UserID=rep(NA,N), employer=rep(NA,N), major=rep(NA,N),
places_lived=rep(NA,N), school=rep(NA,N), TimeID=rep(NA,N))
row = 0
start = 1
end = 1
M =nrow(node_attri)
while(start <= M) {
row = row + 1
em = ''
ma = ''
pl = ''
sc = ''
while(node_attri[start,1] == node_attri[end,1]) {
if (attri_type[abs(node_attri[end,2]),2] == 'employer')
em = paste(em, node_attri[end,2], sep=',')
else if (attri_type[abs(node_attri[end,2]),2] == 'major')
ma = paste(ma, node_attri[end,2], sep=',')
else if (attri_type[abs(node_attri[end,2]),2] == 'places_lived')
pl = paste(pl, node_attri[end,2], sep=',')
else if (attri_type[abs(node_attri[end,2]),2] == 'school')
sc = paste(sc, node_attri[end,2], sep=',')
end = end + 1
if (end > M) break
}
new_nodes[row,] = list(UserID=node_attri[start,1], employer=substring(em,2),
major=substring(ma,2), places_lived=substring(pl,2),
school=substring(sc,2), TimeID=node_attri[start,3])
start = end
end = start
}
new_nodes = new_nodes[1:row,]

You need to merge, aggregate and then reshape. Assuming your dataframes are DFand DF2respectively:
x <- merge(DF, DF2)
y <- aggregate(attributeid~., data=x, FUN=function(x)paste(x, collapse=","))
z <- reshape(y, direction="wide", idvar=c("Userid","timeid"), timevar="attributetype")
Result:
> z
Userid timeid attributeid.A attributeid.B attributeid.C attributeid.D
1 1 0 -1 -5,-2 -3 -4
Renaming and rearranging columns is trivial.

Here is a solution using the reshape2 package and match.
library(reshape2)
##Create some sample data
dat1 <- data.frame(Userid=rep(1:4,each=5),attributeid=rep(-1:-5,4),timeid=rep(0:3,each=5))
index <- data.frame(attruibuteid=-1:-5,attributetype=c("A","B","C","D","B"))
##Merge the two using match
dat1$attributetype = index$attributetype[match(dat1$attributeid,index$attruibuteid)]
##Reformat using aggregate and dcast
dat2 <- aggregate(attributeid~attributetype+timeid+Userid,function(x){paste(x,collapse=",")},data=dat1)
dat3 <- dcast(formula=Userid+timeid~attributetype,value.var="attributeid",data=dat2)
> dat3
Userid timeid A B C D
1 1 0 -1 -2,-5 -3 -4
2 2 1 -1 -2,-5 -3 -4
3 3 2 -1 -2,-5 -3 -4
4 4 3 -1 -2,-5 -3 -4

Related

Expression values are not inserted into array cell in R

I have to create n (say 2) no. of matrices of size say 5*5 with different cell value assignment like this :
,,1
-1 0 1 -1 0
1 -1 0 1 -1
0 1 -1 0 1
-1 0 1 -1 0
1 -1 0 1 -1
,,2
0 -1 0 1 -1
1 0 0 1 0
0 0 1 0 1
1 0 -1 0 -1
0 -1 -1 -1 -1
For this I have tried following way:
r = 5
c = 5
a = 2
m = array(2, dim = c(5,5,2))
for(i in 1:dim(m))
{ for (j in 1:dim(m)) {
for (k in 1:dim(m)){
m[i,j,k] = sample(c(-1,1,0),replace =T, 1)
}}}
m
and got following error and warnings :
Error in [<-(*tmp*, i, j, k, value = -1) : subscript out of bounds
In addition: Warning messages: 1: In 1:dim(m) : numerical expression
has 3 elements: only the first used 2: In 1:dim(m) : numerical
expression has 3 elements: only the first used 3: In 1:dim(m) :
numerical expression has 3 elements: only the first used
But when I tried with matrix function it was giving me correct result, then what's wrong with array.
r = 5
c = 5
m = matrix(2,r,c)
for(i in 1:r)
{ for (j in 1:c) {
m[i,j] = sample(c(0,1,-1),replace =T,prob=c(.33,.33,.34),1)
}}
Can anyone please tell me what can be done to resolve this issue.

Matlab Fill zeros matric based on array

So i have this data
F =
1
1
2
3
1
2
1
1
and zeros matric
NM =
0 0 0
0 0 0
0 0 0
i have rules, from the lis of array make connection for each variabel, from the F data the connection should be
1&1, 1&2, 2&3, 3&1, 1&2, 2&1, 1&1
each connection represent column and row value on NM matric, and if there is connection the value must be +1
so from the connection above the new matric should be
NNM=
2 2 0
1 0 1
1 0 0
im trying to code like this
[G H]=size(NM)
for i=1:G
j=2:G
if F(i)==A(j)
(NM(i,j))+1
else
NM(i,j)=0
end
end
NNM=NM
but there is no change from the NM matric?
what shoul i do?
Is this what you are trying to do
F = [1 1 2 3 1 2 1 1];
NM = zeros(3, 3);
for i=1:(numel(F)-1)
NM(F(i), F(i+1))=NM(F(i), F(i+1))+1;
end
You can use sparse (and then convert to full) as follows:
NM = full(sparse(F(1:end-1), F(2:end), 1));
list = [1 1 ; 1 2 ; 2 3 ; 3 1 ; 1 2 ; 2 1 ; 1 1 ] ;
[nx,ny] = size(list) ;
NM = zeros(3) ;
for i = 1:nx
for j = 1:ny
NM(list(i,1),list(i,2)) = NM(list(i,1),list(i,2)) + 1/2 ;
end
end

How can I generate this matrix (containing only 0s and ±1s)?

I would like to generate matrix of size (n(n-1)/2, n) that looks like this (n=5 in this case):
-1 1 0 0 0
-1 0 1 0 0
-1 0 0 1 0
-1 0 0 0 1
0 -1 1 0 0
0 -1 0 1 0
0 -1 0 0 1
0 0 -1 1 0
0 0 -1 0 1
0 0 0 -1 1
This is what I, quickly, came up with:
G = [];
for i = 1:n-1;
for j = i+1:n
v = sparse(1,i,-1,1,n);
w = sparse(1,j,1,1,n);
vw = v+w;
G = [G; vw];
end
end
G = full(G);
It works, but is there a faster/cleaner way of doing it?
Use nchoosek to generate the indices of the columns that will be nonzero:
n = 5; %// number of columns
ind = nchoosek(1:n,2); %// ind(:,1): columns with "-1". ind(:,2): with "1".
m = size(ind,1);
rows = (1:m).'; %'// row indices
G = zeros(m,n);
G(rows + m*(ind(:,1)-1)) = -1;
G(rows + m*(ind(:,2)-1)) = 1;
You have two nested loops, which leads to O(N^2) complexity of non-vectorized operations, which is too much for this task. Take a look that your matrix actually has a rectursive pattern:
G(n+1) = [ -1 I(n)]
[ 0 G(n)];
where I(n) is identity matrix of size n. That's how you can express this pattern in matlab:
function G = mat(n)
% Treat original call as G(n+1)
n = n - 1;
% Non-recursive branch for trivial case
if n == 1
G = [-1 1];
return;
end
RT = eye(n); % Right-top: I(n)
LT = repmat(-1, n, 1); % Left-top: -1
RB = mat(n); % Right-bottom: G(n), recursive
LB = zeros(size(RB, 1), 1); % Left-bottom: 0
G = [LT RT; LB RB];
end
And it gives us O(N) complexity of non-vectorized operations. It probably will waste some memory during recursion and matrix composition if Matlab is not smart enought to factor these out. If it is critical, you may unroll recursion into loop and iteratively fill up corresponding places in your original pre-allocated matrix.

Select n elements in matrix left-wise based on certain value

I have a logical matrix A, and I would like to select all the elements to the left of each of my 1s values given a fixed distant. Let's say my distance is 4, I would like to (for instance) replace with a fixed value (saying 2) all the 4 cells at the left of each 1 in A.
A= [0 0 0 0 0 1 0
0 1 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 1 0 1]
B= [0 2 2 2 2 1 0
2 1 0 0 0 0 0
0 0 0 0 0 0 0
2 2 2 2 2 2 1]
In B is what I would like to have, considering also overwrting (last row in B), and cases where there is only 1 value at the left of my 1 and not 4 as the fixed searching distance (second row).
How about this lovely one-liner?
n = 3;
const = 5;
A = [0 0 0 0 0 1 0;
0 1 0 0 0 0 0;
0 0 0 0 0 0 0;
0 0 0 0 1 0 1]
A(bsxfun(#ne,fliplr(filter(ones(1,1+n),1,fliplr(A),[],2)),A)) = const
results in:
A =
0 0 5 5 5 1 0
5 1 0 0 0 0 0
0 0 0 0 0 0 0
0 5 5 5 5 5 1
here some explanations:
Am = fliplr(A); %// mirrored input required
Bm = filter(ones(1,1+n),1,Am,[],2); %// moving average filter for 2nd dimension
B = fliplr(Bm); %// back mirrored
mask = bsxfun(#ne,B,A) %// mask for constants
A(mask) = const
Here is a simple solution you could have come up with:
w=4; % Window size
v=2; % Desired value
B = A;
for r=1:size(A,1) % Go over all rows
for c=2:size(A,2) % Go over all columns
if A(r,c)==1 % If we encounter a 1
B(r,max(1,c-w):c-1)=v; % Set the four spots before this point to your value (if possible)
end
end
end
d = 4; %// distance
v = 2; %// value
A = fliplr(A).'; %'// flip matrix, and transpose to work along rows.
ind = logical( cumsum(A) ...
- [ zeros(size(A,1)-d+2,size(A,2)); cumsum(A(1:end-d-1,:)) ] - A );
A(ind) = v;
A = fliplr(A.');
Result:
A =
0 2 2 2 2 1 0
2 1 0 0 0 0 0
0 0 0 0 0 0 0
2 2 2 2 2 2 1
Approach #1 One-liner using imdilate available with Image Processing Toolbox -
A(imdilate(A,[ones(1,4) zeros(1,4+1)])==1)=2
Explanation
Step #1: Create a morphological structuring element to be used with imdilate -
morph_strel = [ones(1,4) zeros(1,4+1)]
This basically represents a window extending n places to the left with ones and n places to the right including the origin with zeros.
Step #2: Use imdilate that will modify A such that we would have 1 at all four places to the left of each 1 in A -
imdilate_result = imdilate(A,morph_strel)
Step #3: Select all four indices for each 1 of A and set them to 2 -
A(imdilate_result==1)=2
Thus, one can write a general form for this approach as -
A(imdilate(A,[ones(1,window_length) zeros(1,window_length+1)])==1)=new_value
where window_length would be 4 and new_value would be 2 for the given data.
Approach #2 Using bsxfun-
%// Paramters
window_length = 4;
new_value = 2;
B = A' %//'
[r,c] = find(B)
extents = bsxfun(#plus,r,-window_length:-1)
valid_ind1 = extents>0
jump_factor = (c-1)*size(B,1)
extents_valid = extents.*valid_ind1
B(nonzeros(bsxfun(#plus,extents_valid,jump_factor).*valid_ind1))=new_value
B = B' %// B is the desired output

SQL Server: Error while using TSQL variables

I have a user function that returns a BIT calle dbo.IsPartReady.
I am trying to use the function inside of a trigger as follows:
SET #railReady = dbo.IsPartReady(1,#curPartiId);
SET #frameReady = dbo.IsPartReady(2,#curPartiId);
SET #dryAReady = dbo.IsPartReady(3,#curPartiId);
SET #dryBReady = dbo.IsPartReady(4,#curPartiId);
IF ( (#railReady AND #frameReady ) OR ( #dryAReady AND #dryBReady ) )
I'm getting the following error in the IF statement:
An expression of non-boolean type specified in a context where a condition is expected, near 'AND'.
What am I doing wrong ?
BIT data type in SQL Server is not a boolean it is an integer. You have to compare the value of the variable with something to get a boolean expression. BIT can have the value 0, 1 or NULL.
http://msdn.microsoft.com/en-us/library/ms177603.aspx
declare #B bit = 1
if #B = 1
begin
print 'Yes'
end
Use the following:
IF ((#railReady = 1 AND #frameReady = 1) OR (#dryAReady = 1 AND #dryBReady = 1))
or alternatively,
IF ((#railReady & #frameReady) | (#dryAReady & #dryBReady)) = 1
More information:
To verify this we can use a truth table containing all combinations of four bit values:
WITH B(x) AS (SELECT CAST(0 AS bit) UNION ALL SELECT CAST(1 AS bit))
, AllSixteenCombinations(a,b,c,d) AS
(SELECT * FROM B B1 CROSS JOIN B B2 CROSS JOIN B B3 CROSS JOIN B B4)
SELECT a,b,c,d
, CASE WHEN ((a = 1 AND b = 1) OR (c = 1 AND d = 1)) THEN 'Y' ELSE 'N' END[Logic]
, CASE WHEN ((a & b) | (c & d)) = 1 THEN 'Y' ELSE 'N' END [Bitwise]
FROM AllSixteenCombinations
Output:
a b c d Logic Bitwise
----- ----- ----- ----- ----- -------
0 0 0 0 N N
0 1 0 0 N N
0 0 1 0 N N
0 1 1 0 N N
1 0 0 0 N N
1 1 0 0 Y Y
1 0 1 0 N N
1 1 1 0 Y Y
0 0 0 1 N N
0 1 0 1 N N
0 0 1 1 Y Y
0 1 1 1 Y Y
1 0 0 1 N N
1 1 0 1 Y Y
1 0 1 1 Y Y
1 1 1 1 Y Y
(16 row(s) affected)

Resources