Sum by two groups in R and create a panel - database

I have the following dataset:
id_municipio ano CLARO OI TIM VIVO
1 1300029 1999 0 5 0 0
2 1300029 2006 0 2 0 0
3 1300029 2012 0 0 0 2
4 1300029 2015 0 2 0 0
5 1300029 2016 2 0 0 2
6 1300029 2022 0 0 1 0
I want to sum "CLARO", "OI", "TIM", and "VIVO" by "id_municipio", and "ano". So, I would have the following panel:
id_municipio ano CLARO OI TIM VIVO
1 1300029 1999 0 5 0 0
2 1300029 2006 0 7 0 0
3 1300029 2012 0 7 0 2
4 1300029 2015 0 9 0 2
5 1300029 2016 2 9 0 4
6 1300029 2022 2 9 1 4
I know it should be something like this, but I keep getting the original dataset. What am I doing wrong?
teste_3 <- teste_2 %>%
group_by(id_municipio, ano) %>%
arrange(ano) %>%
mutate(
CLARO = cumsum(CLARO),
OI = cumsum(OI),
TIM = cumsum(TIM),
VIVO = cumsum(VIVO)
)

I could manage to solve this issue by dropping "ano" from the code.
teste_3 <- teste_2 %>%
group_by(id_municipio, ano) %>%
arrange(ano) %>%
mutate(
CLARO = cumsum(CLARO),
OI = cumsum(OI),
TIM = cumsum(TIM),
VIVO = cumsum(VIVO)
)

Related

Summing across SAS arrays based on intervals i.e. start date and end date

I am trying to sum variables in an array based on start and end date. For each ID there is one row (if the start and end date are within the same year), two rows (if the start and end date are within consecutive years) or multiple rows for different periods of start and end dates. There are 12 variables with counts for each month i.e. v1-v12 where v1 is january and v12 is december. The two rows for some ID contain monthly values for the 2 consecutive years i.e. within the stat year and the end year. I am trying to get the sum count for the array variables but only from the start date to the end date for each ID. For example, for ID 1 the start date is 07/23/2007 and end date is 06/07/2008, i would like to sum from V7 (july start month) to v12 in 2007 and V1 to V6 (june end month) in 2008 i.e. second row. Here's what I have:
ID STARTDATE ENDDATE YR V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 07/23/2007 06/07/2008 2007 3 5 2 6 3 2 1 3 4 1 2 3
1 07/23/2007 06/07/2008 2008 0 4 2 2 3 0 1 3 1 0 2 3
2 02/01/2002 07/27/2002 2002 1 0 2 3 1 0 1 2 3 0 0 2
3 05/26/2008 03/07/2009 2008 2 0 2 3 1 2 1 1 3 0 0 1
3 05/26/2008 03/07/2009 2009 4 1 4 3 1 0 2 3 3 1 0 3
3 10/17/2011 08/17/2012 2011 3 3 0 1 0 1 1 5 3 1 0 1
3 10/17/2011 08/17/2012 2012 1 3 2 3 1 0 1 2 3 2 0 2
4 02/27/2004 01/22/2005 2004 2 0 2 3 1 2 1 1 3 0 0 1
4 02/27/2004 01/22/2005 2005 0 4 2 2 3 0 1 3 1 0 2 3
and this is what I want :
ID STARTDATE ENDDATE YR V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 sum
1 07/23/2007 06/07/2008 2007 3 5 2 6 3 2 [1 3 4 1 2 3] 25
1 07/23/2007 06/07/2008 2008 [0 4 2 2 3 0] 1 3 1 0 2 3 25
2 02/01/2002 07/27/2002 2002 1 [0 2 3 1 0 1] 2 3 0 0 2 8
3 05/26/2008 03/07/2009 2008 2 0 2 3 [1 2 1 1 3 0 0 1] 18
3 05/26/2008 03/07/2009 2009 [4 1 4] 3 1 0 2 3 3 1 0 3 18
3 10/17/2011 08/17/2012 2011 3 3 0 1 0 1 1 5 3 [1 0 1] 15
3 10/17/2011 08/17/2012 2011 [1 3 2 3 1 0 1 2] 3 2 0 2 15
4 02/27/2004 01/22/2005 2004 2 [0 2 3 1 2 1 1 3 0 0 1] 14
4 02/27/2004 01/22/2005 2005 [0] 4 2 2 3 0 1 3 1 0 2 3 14
Here's the code I tried
data want;
set have;
array vars(*) V1-V12;
DT_CHECK=intnx('month',ENDDATE,-12);
start=intck('month','STARTDATE,DT_CHECK)+1;
if start<1 then do;
error 'Start date out of range';
delete;
end;
else if start>dim(vars)-12 then do;
error 'End date out of range';
delete;
end;
do _N_=start to start+12;
sum_n+vars(_N_);
end;
format DT_CHECK mmddyy10.;
run;
But am having problems. Any help is appreciated. Thank you.
A DOW / serial loop technique can compute a value for criteria over a group, and then apply that value to each row in group.
Example:
Requires the start to end date intervals within an id be mutually exclusive (i.e. do not overlap and data are sorted by id startdate enddate)
data want;
* [sum] variable is implicitly reset to missing at the top of the step.;
do _n_ = 1 by 1 until (last.enddate);
set have;
by id startdate enddate;
array v(12);
_month1 = intnx('month', startdate, 0);
_month2 = intnx('month', enddate, 0);
do _index = 1 to 12;
if _month1 <= mdy(_index,1,yr) <= _month2 then sum = sum(sum,v(_index));
end;
end;
do _n_ = 1 to _n_;
set have;
output;
end;
format sum 4.;
drop _:;
run;
The answer does not address the scenario of startdate to enddate intervals that overlap within an id.
Since each observation represents one year a straight forward approach would be to just loop month from Jan to Dec and check if that month falls within your date range.
data want;
do until(last.startdate);
set have;
by id startdate;
array v v1-v12;
do month=1 to 12 ;
if intnx('month',startdate,0,'b')<=mdy(month,1,yr)<=intnx('month',enddate,0,'e')
then sum=sum(sum,v[month])
;
end;
end;
keep id startdate enddate sum;
run;
Results:
Obs ID STARTDATE ENDDATE sum
1 1 2007-07-23 2008-06-07 25
2 2 2002-02-01 2002-07-27 7
3 3 2008-05-26 2009-03-07 18
4 3 2011-10-17 2012-08-17 15
5 4 2004-02-27 2005-01-22 14

Summarizing dataframe and sorting with inserting 0 if the target words doesn't exist

I have a tab-delimited data frame like this:
Sample 1
A 2
B 1
D 2
Sample 2
C 1
D 1
E 2
Sample 3
D 1
E 3
Sample 4
A 1
E 3
Sample 5
A 2
B 3
Sample 6
C 1
D 2
Sample 7
D 1
E 3
Sample 8
A 3
D 2
Sample 9
A 1
Sample 10
A 1
C 2
E 3
and I would like to transpose it and link the alphabets and the associated values. If the sample does not contain alphabets, I want "0" inserted.
So the modified dataframe I desire is
Sample A B C D E
1 2 1 0 2 0
2 0 0 1 1 1
3 0 0 0 1 3
4 1 0 0 0 3
5 2 3 0 0 0
6 0 0 1 2 0
7 0 0 0 1 3
8 3 0 0 2 0
9 1 0 0 0 0
10 1 0 2 0 3
I tried to summarize the datafreame, and then transpose it. When I used,
awk 'NR==1{print} NR>1{a[$1]=a[$1]" "$2}END{for (i in a){print i " " a[i]}}' TEST
the output data was
Sample 1
A 2 1 2 3 1 1
B 1 3
C 1 1 2
D 2 1 1 2 1 2
E 2 3 3 3 3
Sample 2 3 4 5 6 7 8 9 10
Sample 1 was isolated, and there was no space when the alphabets were not included by the samples.
I hope that makes sense.

How can I build a decreasing lower triangular matrix? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
Do you know if it is possible to get the following triangular matrix
[ N:-1:1; (N-1):-1:0; (N-2):-1:0 0; (N-3):-1:0 0 0; ....] without writing every line with horzcat and without using a loop?
thanks all
Fred
Is this what you want?
N = 8;
result = flipud(tril(toeplitz(1:N)));
This gives
result =
8 7 6 5 4 3 2 1
7 6 5 4 3 2 1 0
6 5 4 3 2 1 0 0
5 4 3 2 1 0 0 0
4 3 2 1 0 0 0 0
3 2 1 0 0 0 0 0
2 1 0 0 0 0 0 0
1 0 0 0 0 0 0 0
Maybe something like this:
N=10;
M=triu(gallery('circul',N)).'
M =
1 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0
3 2 1 0 0 0 0 0 0 0
4 3 2 1 0 0 0 0 0 0
5 4 3 2 1 0 0 0 0 0
6 5 4 3 2 1 0 0 0 0
7 6 5 4 3 2 1 0 0 0
8 7 6 5 4 3 2 1 0 0
9 8 7 6 5 4 3 2 1 0
10 9 8 7 6 5 4 3 2 1
Or did you want this:
M=fliplr(triu(gallery('circul',N)))
M =
10 9 8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1 0
8 7 6 5 4 3 2 1 0 0
7 6 5 4 3 2 1 0 0 0
6 5 4 3 2 1 0 0 0 0
5 4 3 2 1 0 0 0 0 0
4 3 2 1 0 0 0 0 0 0
3 2 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0
I couldn't really tell from your code sample which direction you wanted this to go.
The power of bsxfun compels you!
[[N:-1:1]' reshape(repmat([N-1:-1:1]',1,N).*bsxfun(#ge,[1:N-1]',1:N),N,[])]
Sample run -
>> N = 8;
>> [[N:-1:1]' reshape(repmat([N-1:-1:1]',1,N).*bsxfun(#ge,[1:N-1]',1:N),N,[])]
ans =
8 7 6 5 4 3 2 1
7 6 5 4 3 2 1 0
6 5 4 3 2 1 0 0
5 4 3 2 1 0 0 0
4 3 2 1 0 0 0 0
3 2 1 0 0 0 0 0
2 1 0 0 0 0 0 0
1 0 0 0 0 0 0 0
This is basically inspired by this another bsxfun-based solution to a very similar question - Replicate vector and shift each copy by 1 row down without for-loop. There you can see similar solutions and related benchmarks, as it seems performance is a concern here.

Summing rows in a file

I want to add the rows based on one column field. Is it possible to do by awk command or any simple way?
Date Hour Requests Success Error
10-Apr 11 1 1 0
10-Apr 13 1 1 0
10-Apr 14 1 1 0
10-Apr 18 1 1 0
10-Apr 9 1 1 0
10-Apr 11 1 1 0
10-Apr 12 3 3 0
10-Apr 13 2 1 1
10-Apr 14 2 2 0
10-Apr 15 1 1 0
10-Apr 16 1 1 0
10-Apr 12 3 3 0
10-Apr 13 4 1 3
10-Apr 14 1 1 0
10-Apr 16 2 2 0
10-Apr 18 1 1 0
10-Apr 10 3 3 0
10-Apr 11 1 1 0
10-Apr 12 3 3 0
10-Apr 13 1 1 0
10-Apr 14 2 2 0
10-Apr 15 2 2 0
10-Apr 16 2 2 0
10-Apr 17 2 2 0
From the above table I want add the rows(requests, success, errors for that hour) based on hour and the o/p should be as like as below
Date Hour Requests Success Error
10-Apr 9 1 1 0
10-Apr 10 3 3 0
10-Apr 11 3 3 0
10-Apr 12 9 9 0
10-Apr 13 8 4 4
10-Apr 14 6 6 0
10-Apr 15 3 3 0
10-Apr 16 5 5 0
10-Apr 17 2 2 0
10-Apr 18 2 2 0
Using GNU awk for true Multi-D arrays and sorted in:
$ cat tst.awk
NR==1 { print; next }
!seen[$1]++ { dates[++numDates] = $1 }
{ for (i=3;i<=NF;i++) sum[$1][$2][i] += $i }
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
for (dateNr=1; dateNr<=numDates; dateNr++) {
date = dates[dateNr]
for (hr in sum[date]) {
printf "%s %s ", date, hr
for (i=3;i<=NF;i++) {
printf "%s%s", sum[date][hr][i], (i<NF?OFS:ORS)
}
}
}
}
$ awk -f tst.awk file | column -t
Date Hour Requests Success Error
10-Apr 9 1 1 0
10-Apr 10 3 3 0
10-Apr 11 3 3 0
10-Apr 12 9 9 0
10-Apr 13 8 4 4
10-Apr 14 6 6 0
10-Apr 15 3 3 0
10-Apr 16 5 5 0
10-Apr 17 2 2 0
10-Apr 18 2 2 0
I wasn't sure if your fields were space or tab separated so made no attempt to format the output within awk.

how to add a factor to a sequence?

I'm analysing a dataset with some data-mining tools.The response variable has ten levels and I'm trying to create a classifier.
Here comes the problem.When using nnet and bagging function,the result is not that good and the 5th level is even not in the prediction.
I want to use a confusion matrix to analyse the classifier.but as the 5th level is not shown in the prediction I can't get a well-formed matrix.So how can I get a well-formed matrix?i.e. I want a 10*10 matrix.
The confusion matrix:
library("mda")#This is where **confusion** comes from
> confusion(pre.bag$class,CLASS)#here confusion acts like table
true
predicted 1 2 3 4 6 7 8 9 10 5
1 338 9 6 0 5 12 10 1 15 46
2 9 549 1 59 18 0 3 0 0 6
3 18 1 44 0 0 0 2 0 0 4
4 0 1 0 21 0 0 0 0 0 0
6 2 13 0 1 299 2 9 0 0 0
7 5 2 1 0 10 231 6 0 1 0
8 0 0 0 0 0 5 76 0 0 0
9 5 1 0 0 0 0 0 62 0 0
10 7 3 1 0 0 2 1 6 181 16
attr(,"error")
[1] 0.1231743
attr(,"mismatch")
[1] 0.03386642
Try this:
pred <- factor(pre.bag$class, levels=levels(CLASS) )
confusion(pre.bag$class, CLASS)
(Tested with an fda-object.)

Resources