SAS, assigning the same numbers to specific observations - loops

I want to assign the same id number to every four observations. For example, if I have the following data
age marital gender id
45 1 0 1
33 1 1 1
68 0 1 1
27 1 0 1
43 0 0 2
37 0 1 2
19 1 1 2
40 1 1 2
25 1 0 3
38 1 1 3
57 0 0 3
50 1 0 3
51 1 1 4
44 0 1 4
69 1 0 4
39 0 1 4
The last column id is something I want to produce.
Plus, the dataset have 500,000+ observations.
Thanks in advance.

Slightly more compact:
id = ceil(_n_/4);

Use the integer function and the built-in _n_ variable (which increments for each observation):
id = int( (_n_-4)/4 )+1;

Related

Aggregate function with window function filtered by time

I have a table with data about buses while making their routes. There are columns for:
bus trip id (different each time a bus starts the route from the first stop)
bus stop id
datetime column that indicates the moment that the bus leaves each bus stop
integer that indicates how many passengers entered the bus in that stop
There is no information about how many passengers get off the bus on each stop, so I have to make an estimation supposing that once they get on the bus, they stay on it for 30 minutes. The trip lasts about 70 minutes from the first to the last stop.
I am trying to aggregate results on each stop using
SUM(iPassengersIn) OVER (
PARTITION BY tripDate, tripId
ORDER BY busStopOrder
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) total_passengers
The problem is that I can add passengers since the beginning of the trip, but not since "30 minutes ago" on each stop. How could I limit the aggregation to "the last 30 minutes" on each row in order to estimate the occupation between stops?
This is a subset of my data:
trip_date trip_id bus_stop_order minutes_since_trip_start passengers_in trip_total_passengers
2020-06-08 374910 0 0 0 0
2020-06-08 374910 1 3 0 0
2020-06-08 374910 2 5 1 1
2020-06-08 374910 3 8 0 1
2020-06-08 374910 4 9 0 1
2020-06-08 374910 5 12 0 1
2020-06-08 374910 6 13 0 1
2020-06-08 374910 7 13 0 1
2020-06-08 374910 8 15 0 1
2020-06-08 374910 9 16 0 1
2020-06-08 374910 10 16 0 1
2020-06-08 374910 11 17 0 1
2020-06-08 374910 12 18 2 3
2020-06-08 374910 13 20 0 3
2020-06-08 374910 14 22 0 3
2020-06-08 374910 15 24 0 3
2020-06-08 374910 16 25 0 3
2020-06-08 374910 17 28 2 5
2020-06-08 374910 18 30 1 6
2020-06-08 374910 19 31 0 6
2020-06-08 374910 20 33 0 6
2020-06-08 374910 21 41 3 9
2020-06-08 374910 22 44 3 12
2020-06-08 374910 23 45 4 16
2020-06-08 374910 24 48 2 18
2020-06-08 374910 25 48 2 20
2020-06-08 374910 26 50 0 20
2020-06-08 374910 27 51 0 20
2020-06-08 374910 28 51 0 20
2020-06-08 374910 29 53 0 20
2020-06-08 374910 30 55 0 20
2020-06-08 374910 31 58 0 20
For the row with bus_stop_order 21 (41 minutes into the bus trip), where 3 passengers enter the bus, I have to sum only the passengers that entered the bus between minute 11 and 41. Thus, the passenger that entered the bus in the 2nd bus stop (5 minutes into the trip) should be excluded.
That should be applied for every row.
The only thing I can think of is:
select
trip_date,
trip_id,
minutes_since_trip_start,
v.total_passengers
from
#t t1
outer apply (
select sum(passengers_in)
from #t t2
where
t1.trip_date = t2.trip_date
and t1.trip_id = t2.trip_id
and t2.bus_stop_order <= t1.bus_stop_order
and t2.minutes_since_trip_start >= t1.minutes_since_trip_start - 30
) v(total_passengers)
order by
trip_date,
trip_id,
minutes_since_trip_start
;

Replacing specific elements in a table with a specific element from a range in APLX

I'm learning a spread of programming languages in a class, and we're working on an APLX project at the moment. A restriction we have to work around is we cannot use If, For, While, etc. No loops or conditionals. I have to be able to take a plane of numbers, ranging 0-7, and replace each number 2 or greater into the depth of that number, and, ideally, change the 1's to 0's. For example:
0100230 => 0000560
I have no idea how I'm supposed to do the replacement with depth aspect, though the change from ones to zeros is quite simple. I'm able to produce the set of integers in a table and I understand how to replace specific values, but only with other specific values, not values that would have to be determined during the function. The depth should be the row depth, rather than the multi-dimensional depth.
For the record this is not the whole of the program, the program itself is a poker dealing and scoring program. This is a specific aspect of the scoring methodology that my professor recommended I use.
TOTALS„SCORE PHAND;TYPECOUNT;DEPTH;ISCOUNT;TEMPS;REPLACE
:If (½½PHAND) = 0
PHAND„DEAL PHAND
:EndIf
TYPECOUNT„CHARS°.¹PHAND
DEPTH„2Þ(½TYPECOUNT)
REPLACE „ 2 3 4 5 6 7
ISCOUNT „ +/ TYPECOUNT
ISCOUNT „ ³ISCOUNT
((1=,ISCOUNT)/,ISCOUNT)„0
©((2=,ISCOUNT)/,ISCOUNT)„1
©TEMPS „ ISCOUNT
Œ„ISCOUNT
Œ„PHAND
You may have missed the first lessons of your prof and it might help to look at at again to learn about vectors and how easy you can work with them - once you unlearned the ideas of other programming languages ;-)
Assume you have a vector A with numbers from 1 to 7:
A←⍳7
A
1 2 3 4 5 6 7
Now, if you wanted to search for values > 3, you'd do:
A>3
0 0 0 1 1 1 1
The result is a vector, too, and you can easily combine the two in lots of operations:
multiplication to only keep values > 0 and replace others with 0:
A×A>3
0 0 0 4 5 6 7
or add 500 to values >3
A+500×A>3
1 2 3 504 505 506 507
or, find the indices of values > 3:
(A>3)×⍳⍴A
0 0 0 4 5 6 7
Now, looking at your q again, the word 'depth' has a specific meaning in APL and I guess you meant something different. Do I understand correctly that you want to replace values > 2 with the ' indices' of these values?
Well, with what I've shown before, this is easy:
A←0 1 0 0 2 3 0
(A≥2)×⍳⍴A
0 0 0 0 5 6 0
edit: looking at multi-dimensional arrays:
let's look into this example:
A←(⍳5)∘.×⍳10
A
1 2 3 4 5 6 7 8 9 10
2 4 6 8 10 12 14 16 18 20
3 6 9 12 15 18 21 24 27 30
4 8 12 16 20 24 28 32 36 40
5 10 15 20 25 30 35 40 45 50
Now, which numbers are > 20 and < 30?
z←(A>20)∧A<30
z
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 1 0
0 0 0 0 0 1 1 0 0 0
0 0 0 0 1 0 0 0 0 0
Then, you can multiply the values with that boolean result to filter out only the ones satisfying the condition:
A×z
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 21 24 27 0
0 0 0 0 0 24 28 0 0 0
0 0 0 0 25 0 0 0 0 0
Or, perhaps you're interested in the column-index of the values?
z×[2]⍳¯1↑⍴z
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 7 8 9 0
0 0 0 0 0 6 7 0 0 0
0 0 0 0 5 0 0 0 0 0
NB: this statement might not work in all APL-dialects. Here's another way to formulate this:
z×((1↑⍴z)⍴0)∘.+⍳¯1↑⍴z
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 7 8 9 0
0 0 0 0 0 6 7 0 0 0
0 0 0 0 5 0 0 0 0 0
I hope this gives you some ideas to play with. In general, using booleans to manipulate arrays in mathematical operations is an extremely powerful idea in APL which will take you loooooong ways ;-)
Also, if you'd like to see more of the same, have a look at the FinnAPL Idioms - some useful shorties grown over the years ;-)
edit re. "maintaining untouched values":
going back to example array A:
A←(⍳5)∘.×⍳10
A
1 2 3 4 5 6 7 8 9 10
2 4 6 8 10 12 14 16 18 20
3 6 9 12 15 18 21 24 27 30
4 8 12 16 20 24 28 32 36 40
5 10 15 20 25 30 35 40 45 50
Replacing values between 20 and 30 with the power 2 of these values, keeping all others unchanged:
touch←(A>20)∧A<30
(touch×A*2)+A×~touch
1 2 3 4 5 6 7 8 9 10
2 4 6 8 10 12 14 16 18 20
3 6 9 12 15 18 441 576 729 30
4 8 12 16 20 576 784 32 36 40
5 10 15 20 625 30 35 40 45 50
I hope you get the idea...
Or better: ask a new q, as otherwise this would truly take epic dimensions, whereas the idea of stackoverflow is more like "one issue - one question"...

Why FOR /F sets empty values for repeated numbers in the rest of tokens?

Not sure if the question is clear enough so here's an example:
:::this prints - 1:[i] 2:[] 3:[] 4:[] 5:[] 6:[] 7:[]
for /f "tokens=1,1,1,1,1,1,1" %%a in ("i ii iii iv v vi vii") do (
#echo 1:[%%a] 2:[%%b] 3:[%%c] 4:[%%d] 5:[%%e] 6:[%%f] 7:[%%g]
)
:::this prints - 1:[i] 2:[ii] 3:[iii] 4:[iv] 5:[] 6:[] 7:[%g]
for /f "tokens=2,3,1-4" %%a in ("i ii iii iv v vi vii") do (
#echo 1:[%%a] 2:[%%b] 3:[%%c] 4:[%%d] 5:[%%e] 6:[%%f] 7:[%%g]
)
:::this prints - 1:[i] 2:[ii] 3:[iii] 4:[] 5:[] 6:[] 7:[%g]
for /f "tokens=1-3,1-3," %%a in ("i ii iii iv v vi vii") do (
#echo 1:[%%a] 2:[%%b] 3:[%%c] 4:[%%d] 5:[%%e] 6:[%%f] 7:[%%g]
)
In brief if there's a repeated numbers in the list of tokens (doesn't matter if they are in the ranges like n-m or set one by one with commas ) the same number of the left accessed tokens have empty values.
Nowhere this behavior is documented (or at least I didn't found such thing).Here's FOR help that concerns tokens:
tokens=x,y,m-n - specifies which tokens from each line are to
be passed to the for body for each iteration.
This will cause additional variable names to
be allocated. The m-n form is a range,
specifying the mth through the nth tokens. If
the last character in the tokens= string is an
asterisk, then an additional variable is
allocated and receives the remaining text on
the line after the last token parsed.
This is testes on Win8x64 so I'm not even sure this will happen on all the range of Windows machines.
EDIT: Despite the accesible tokens are limited to 31 with this I can create more empty tokens :
setlocal disableDelayedExpansion
for /f "tokens=1-31,1-31,1-31" %%! in (
"33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 "
) do (
echo 1:[%%!-!] 30:[%%?-?] 31:[%%#-#] 32:[%%A-A] 33:[%%B-B] 34:[%%C-C] 35:[%%D-D] 36:[%%E-E] 37:[%%F-F] 38:[%%G-G] 90:[%%{-{]
)
edit. the maximum of the empty tokens is 250 (not sure how the extended ascii characters will be displayed between 0x02 and 0xFB):
#echo off
for /f "tokens=1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31" %% in (
"1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1") do (
echo 0x02-%%- 0x07-%%- 0xFE-%%ю- 0xFB-%%ы- 0xFA-%%ъ-
)
While i have no real idea of why the for command behaves as it does, there are some simple rules that match the for behaviour. And here we are talking only about the token clause. delims, eol, skip and usebackq other day
Step 1 - The tokens clause is found. The clause is parsed, and for each range requested (only one, start-end, *) it is determined if it is valid. It is discarded if it is not a valid request (not in the range 1-31 or not *) but if it is a valid request, for each element requested a "variable" is allocated (probably a table) to later hold the data retrieved for this token. At the same time, a "set" is defined (maybe a bitmap mask), setting that the token number x (the number used to identify the token in the tokens clause) will be retrieved. The same token can be requested several times, but in the "set" (or bitmask, ...) the only effect is to mark again that the token x will be retrieved.
Now the "set" contains the position of the valid (1-31, *) tokens that were requested.
Once after the parser ends to process the for configuration, the input file is readed into memory, or the command is executed to retrieve all its output into memory or the literal string is declared as the input buffer.
Step 2 - Prepare line parse. The table to hold the token data is initialized to blanks and a pointer set to the first position in the table (the first token). If the line has not been discarded by skip, eol or because it is empty, the tokenizer will scan the input buffer for tokens, else, search the end of the line and repeat step 2 for the new line found.
Step 3 - Parse the input buffer. Until the end of a line is reached, for each token found in the line its position, if in range (1-31 or * token), is checked against the "set" to determine if it has been requested or not (if this token number is in the set or if the * token is being handled). If it has been requested, its data is included in the "table"? in the position indicate by the table pointer, the pointer incremented and the tokenizer continues repeating step 3 until the end of the line is reached.
Step 4 - The end of the line has been reached. If any token has been retrieved or if the only token requested was * (test for /f "tokens=*" %a in (" ") do echo %a), execute the code in the do clause.
Step 5 - If the excution of the for has not been canceled and the end of the buffer has not been reached, there are more lines to process, back to step 2.
This set of steps reproduce all the observed behaviours in the question, but does not prove if this is the way the for command is coded.
Now, let's check it against the code in the question
:::this prints - 1:[i] 2:[] 3:[] 4:[] 5:[] 6:[] 7:[]
for /f "tokens=1,1,1,1,1,1,1" %%a in ("i ii iii iv v vi vii") do (
#echo 1:[%%a] 2:[%%b] 3:[%%c] 4:[%%d] 5:[%%e] 6:[%%f] 7:[%%g]
)
7 requested tokens, so 7 positions in the table that will be passed to the do code, but the only token that matches the "set" is the number 1
:::this prints - 1:[i] 2:[ii] 3:[iii] 4:[iv] 5:[] 6:[] 7:[%g]
for /f "tokens=2,3,1-4" %%a in ("i ii iii iv v vi vii") do (
#echo 1:[%%a] 2:[%%b] 3:[%%c] 4:[%%d] 5:[%%e] 6:[%%f] 7:[%%g]
)
6 requested tokens, 6 position in the table of tokens, and the "set" will only match 1,2,3,4
:::this prints - 1:[i] 2:[ii] 3:[iii] 4:[] 5:[] 6:[] 7:[%g]
for /f "tokens=1-3,1-3," %%a in ("i ii iii iv v vi vii") do (
#echo 1:[%%a] 2:[%%b] 3:[%%c] 4:[%%d] 5:[%%e] 6:[%%f] 7:[%%g]
)
6 requested tokens, 6 positions in the table of tokens, and the "set" will only match 1,2,3
setlocal disableDelayedExpansion
for /f "tokens=1-31,1-31,1-31" %%! in (
"33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 "
) do (
echo 1:[%%!-!] 30:[%%?-?] 31:[%%#-#] 32:[%%A-A] 33:[%%B-B] 34:[%%C-C] 35:[%%D-D] 36:[%%E-E] 37:[%%F-F] 38:[%%G-G] 90:[%%{-{]
)
93 requested tokens, 93 positions allocated in the table of tokens, the "set" will only match elements 1-31
edited more cases added to the question
the maximum of the empty tokens is 250
#echo off
for /f "tokens=1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31" %% in (
"1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1") do (
echo 0x02-%%- 0x07-%%- 0xFE-%%ю- 0xFB-%%ы- 0xFA-%%ъ-
)
No, you can request as much tokens as you can. I tested with 1625 1-30 and an aditional 31 (to ensure the parser keeps working), and it is handled without problems. Probably the limit is the line lengh. You can request up to 50530 (aprox) tokens (repeating 1-31,... to reach the line limit), but you are limited to get valid data for the 31 first tokens and blank data for the rest of the elements in the storage table, having to retrieve elements using a single character in the for replaceable parameter. Using %%^A (0x01, Alt-001) as the for replaceable parameter, you can request up to %%ÿ (0xFF, Alt-255)
I also don't have an explanation, but I do have an additional effect.
The * "token" is still accepted, but it will always be empty (dysfunctional) if there is at least one duplicate token request.
#echo off
for /f "tokens=1,1,2*" %%a in ("1 2 3 4") do (
echo a=%%a
echo b=%%b
echo c=%%c
echo d=%%d
echo e=%%e
)
-- OUTPUT --
a=1
b=2
c=
d=
e=%e

From matrix to array [J]

I'm working on J.
How can I convert this matrix:
(i.10)*/(i.10)
0 0 0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7 8 9
0 2 4 6 8 10 12 14 16 18
0 3 6 9 12 15 18 21 24 27
0 4 8 12 16 20 24 28 32 36
0 5 10 15 20 25 30 35 40 45
0 6 12 18 24 30 36 42 48 54
0 7 14 21 28 35 42 49 56 63
0 8 16 24 32 40 48 56 64 72
0 9 18 27 36 45 54 63 72 81
in array?
0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 . . .
I tried
(i.10)*/(i.10)"0
and then I've added
~.(i.10)*/(i.10)"0
to eliminate doubles, but it doesn't work.
If you want to turn a 2-dimensional table (matrix) into a 1-dimensional list (vector or "array", though in the J world "array" usually means "rectangle with any number [N] of dimensions"), you can use ravel (,):
matrix =: (i.10)*/(i.10)
list =: , matrix
list
0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 ...
Now using nub (~.) to remove duplicates should work:
~. list
0 1 2 3 4 5 6 7 8 9 10 12 ...
Note that, in J, the shape of an array usually carries important information, so flattening a matrix like this would be fairly unusual. Still, nothing stopping you.
BTW, you can save yourself some keystrokes by using the adverb ~, which will copy the left argument of a dyad to the right side as well, so you could just say:
matrix =: */~ i. 10
and get the same result as (i.10) */ (i.10).

how to add a factor to a sequence?

I'm analysing a dataset with some data-mining tools.The response variable has ten levels and I'm trying to create a classifier.
Here comes the problem.When using nnet and bagging function,the result is not that good and the 5th level is even not in the prediction.
I want to use a confusion matrix to analyse the classifier.but as the 5th level is not shown in the prediction I can't get a well-formed matrix.So how can I get a well-formed matrix?i.e. I want a 10*10 matrix.
The confusion matrix:
library("mda")#This is where **confusion** comes from
> confusion(pre.bag$class,CLASS)#here confusion acts like table
true
predicted 1 2 3 4 6 7 8 9 10 5
1 338 9 6 0 5 12 10 1 15 46
2 9 549 1 59 18 0 3 0 0 6
3 18 1 44 0 0 0 2 0 0 4
4 0 1 0 21 0 0 0 0 0 0
6 2 13 0 1 299 2 9 0 0 0
7 5 2 1 0 10 231 6 0 1 0
8 0 0 0 0 0 5 76 0 0 0
9 5 1 0 0 0 0 0 62 0 0
10 7 3 1 0 0 2 1 6 181 16
attr(,"error")
[1] 0.1231743
attr(,"mismatch")
[1] 0.03386642
Try this:
pred <- factor(pre.bag$class, levels=levels(CLASS) )
confusion(pre.bag$class, CLASS)
(Tested with an fda-object.)

Resources