Why FOR /F sets empty values for repeated numbers in the rest of tokens? - batch-file

Not sure if the question is clear enough so here's an example:
:::this prints - 1:[i] 2:[] 3:[] 4:[] 5:[] 6:[] 7:[]
for /f "tokens=1,1,1,1,1,1,1" %%a in ("i ii iii iv v vi vii") do (
#echo 1:[%%a] 2:[%%b] 3:[%%c] 4:[%%d] 5:[%%e] 6:[%%f] 7:[%%g]
)
:::this prints - 1:[i] 2:[ii] 3:[iii] 4:[iv] 5:[] 6:[] 7:[%g]
for /f "tokens=2,3,1-4" %%a in ("i ii iii iv v vi vii") do (
#echo 1:[%%a] 2:[%%b] 3:[%%c] 4:[%%d] 5:[%%e] 6:[%%f] 7:[%%g]
)
:::this prints - 1:[i] 2:[ii] 3:[iii] 4:[] 5:[] 6:[] 7:[%g]
for /f "tokens=1-3,1-3," %%a in ("i ii iii iv v vi vii") do (
#echo 1:[%%a] 2:[%%b] 3:[%%c] 4:[%%d] 5:[%%e] 6:[%%f] 7:[%%g]
)
In brief if there's a repeated numbers in the list of tokens (doesn't matter if they are in the ranges like n-m or set one by one with commas ) the same number of the left accessed tokens have empty values.
Nowhere this behavior is documented (or at least I didn't found such thing).Here's FOR help that concerns tokens:
tokens=x,y,m-n - specifies which tokens from each line are to
be passed to the for body for each iteration.
This will cause additional variable names to
be allocated. The m-n form is a range,
specifying the mth through the nth tokens. If
the last character in the tokens= string is an
asterisk, then an additional variable is
allocated and receives the remaining text on
the line after the last token parsed.
This is testes on Win8x64 so I'm not even sure this will happen on all the range of Windows machines.
EDIT: Despite the accesible tokens are limited to 31 with this I can create more empty tokens :
setlocal disableDelayedExpansion
for /f "tokens=1-31,1-31,1-31" %%! in (
"33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 "
) do (
echo 1:[%%!-!] 30:[%%?-?] 31:[%%#-#] 32:[%%A-A] 33:[%%B-B] 34:[%%C-C] 35:[%%D-D] 36:[%%E-E] 37:[%%F-F] 38:[%%G-G] 90:[%%{-{]
)
edit. the maximum of the empty tokens is 250 (not sure how the extended ascii characters will be displayed between 0x02 and 0xFB):
#echo off
for /f "tokens=1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31" %% in (
"1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1") do (
echo 0x02-%%- 0x07-%%- 0xFE-%%ю- 0xFB-%%ы- 0xFA-%%ъ-
)

While i have no real idea of why the for command behaves as it does, there are some simple rules that match the for behaviour. And here we are talking only about the token clause. delims, eol, skip and usebackq other day
Step 1 - The tokens clause is found. The clause is parsed, and for each range requested (only one, start-end, *) it is determined if it is valid. It is discarded if it is not a valid request (not in the range 1-31 or not *) but if it is a valid request, for each element requested a "variable" is allocated (probably a table) to later hold the data retrieved for this token. At the same time, a "set" is defined (maybe a bitmap mask), setting that the token number x (the number used to identify the token in the tokens clause) will be retrieved. The same token can be requested several times, but in the "set" (or bitmask, ...) the only effect is to mark again that the token x will be retrieved.
Now the "set" contains the position of the valid (1-31, *) tokens that were requested.
Once after the parser ends to process the for configuration, the input file is readed into memory, or the command is executed to retrieve all its output into memory or the literal string is declared as the input buffer.
Step 2 - Prepare line parse. The table to hold the token data is initialized to blanks and a pointer set to the first position in the table (the first token). If the line has not been discarded by skip, eol or because it is empty, the tokenizer will scan the input buffer for tokens, else, search the end of the line and repeat step 2 for the new line found.
Step 3 - Parse the input buffer. Until the end of a line is reached, for each token found in the line its position, if in range (1-31 or * token), is checked against the "set" to determine if it has been requested or not (if this token number is in the set or if the * token is being handled). If it has been requested, its data is included in the "table"? in the position indicate by the table pointer, the pointer incremented and the tokenizer continues repeating step 3 until the end of the line is reached.
Step 4 - The end of the line has been reached. If any token has been retrieved or if the only token requested was * (test for /f "tokens=*" %a in (" ") do echo %a), execute the code in the do clause.
Step 5 - If the excution of the for has not been canceled and the end of the buffer has not been reached, there are more lines to process, back to step 2.
This set of steps reproduce all the observed behaviours in the question, but does not prove if this is the way the for command is coded.
Now, let's check it against the code in the question
:::this prints - 1:[i] 2:[] 3:[] 4:[] 5:[] 6:[] 7:[]
for /f "tokens=1,1,1,1,1,1,1" %%a in ("i ii iii iv v vi vii") do (
#echo 1:[%%a] 2:[%%b] 3:[%%c] 4:[%%d] 5:[%%e] 6:[%%f] 7:[%%g]
)
7 requested tokens, so 7 positions in the table that will be passed to the do code, but the only token that matches the "set" is the number 1
:::this prints - 1:[i] 2:[ii] 3:[iii] 4:[iv] 5:[] 6:[] 7:[%g]
for /f "tokens=2,3,1-4" %%a in ("i ii iii iv v vi vii") do (
#echo 1:[%%a] 2:[%%b] 3:[%%c] 4:[%%d] 5:[%%e] 6:[%%f] 7:[%%g]
)
6 requested tokens, 6 position in the table of tokens, and the "set" will only match 1,2,3,4
:::this prints - 1:[i] 2:[ii] 3:[iii] 4:[] 5:[] 6:[] 7:[%g]
for /f "tokens=1-3,1-3," %%a in ("i ii iii iv v vi vii") do (
#echo 1:[%%a] 2:[%%b] 3:[%%c] 4:[%%d] 5:[%%e] 6:[%%f] 7:[%%g]
)
6 requested tokens, 6 positions in the table of tokens, and the "set" will only match 1,2,3
setlocal disableDelayedExpansion
for /f "tokens=1-31,1-31,1-31" %%! in (
"33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 "
) do (
echo 1:[%%!-!] 30:[%%?-?] 31:[%%#-#] 32:[%%A-A] 33:[%%B-B] 34:[%%C-C] 35:[%%D-D] 36:[%%E-E] 37:[%%F-F] 38:[%%G-G] 90:[%%{-{]
)
93 requested tokens, 93 positions allocated in the table of tokens, the "set" will only match elements 1-31
edited more cases added to the question
the maximum of the empty tokens is 250
#echo off
for /f "tokens=1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31,1-31" %% in (
"1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1") do (
echo 0x02-%%- 0x07-%%- 0xFE-%%ю- 0xFB-%%ы- 0xFA-%%ъ-
)
No, you can request as much tokens as you can. I tested with 1625 1-30 and an aditional 31 (to ensure the parser keeps working), and it is handled without problems. Probably the limit is the line lengh. You can request up to 50530 (aprox) tokens (repeating 1-31,... to reach the line limit), but you are limited to get valid data for the 31 first tokens and blank data for the rest of the elements in the storage table, having to retrieve elements using a single character in the for replaceable parameter. Using %%^A (0x01, Alt-001) as the for replaceable parameter, you can request up to %%ÿ (0xFF, Alt-255)

I also don't have an explanation, but I do have an additional effect.
The * "token" is still accepted, but it will always be empty (dysfunctional) if there is at least one duplicate token request.
#echo off
for /f "tokens=1,1,2*" %%a in ("1 2 3 4") do (
echo a=%%a
echo b=%%b
echo c=%%c
echo d=%%d
echo e=%%e
)
-- OUTPUT --
a=1
b=2
c=
d=
e=%e

Related

Changing snps in 00, 11, 20 in a file to biallelic letter allele using another file which has the nucleotides as map file

I have a raw.txt file:
FID IID FA MO SEX PHENO SNP1 SNP2 SNP3 SNP4
1 1 0 0 1 1 20 00 20 11
1 2 0 0 1 1 11 00 20 20
1 3 0 0 1 1 11 20 11 20
1 4 0 0 1 1 00 11 11 20
A snp.txt file:
1 SNP1 20 A G
1 SNP2 45 T C
1 SNP3 56 A G
1 SNP4 80 C G
My output file should look like this (after conversion of numbers to from column 7 to letters in raw.txt based on columns 4 and 5 in snp.txt):
FID IID FA MO SEX PHENO SNP1 SNP2 SNP3 SNP4
1 1 0 0 1 1 AA CC AA CG
1 2 0 0 1 1 AG CC AA CC
1 3 0 0 1 1 AG TT AG CC
1 4 0 0 1 1 GG TC AG CC
Column 2 of file snp.txt are the headers for file raw.txt starting from column 7 (raw.txt). Columns 4 and 5 of file snp.txt represent minor and major alleles of snps at column 2. I want columns under SNP1,SNP2, SNP3 and SNP4 which are in the 0,1,2 format to be converted to ACGT format using columns 4 and 5 as the map.
The columns SNP1, SNP2,SNP3 and SNP4 of raw.txt represent 0,1 or 2 copies of minor allele (4th column of snp.txt file). Column 5 is the major allele. If SNP1 is 20 as shown in raw.txt, there are 2 copies of the minor allele, which according to snp.txt is A. Therefore 20 should change to AA (The 2 in 20 is a count of the minor allele A). SNP1 11 indicates that there is 1 copy of the minor allele. Therefore 11 should be AG. SNP1 00 indicates that there is no copy of the minor allele but only major alleles. Therefore 00 should be GG (2 copies of the letter in column 5) of file snp.txt.
In actual fact, I have over 65,000 snps which means there are that much columns for file raw.txt. I have the code below (a code I found on stackoverflow that I edited a bit :
awk 'NR==FNR {a[$2,20]=$4$4; a[$2,11]=$4$5; a[$2,"00"]=$5$5; next} $7~/^[0-2]/ {
$7=a["SNP1",$7]; $8=a["SNP2",$8];9=a["SNP3",$9];$10=a["SNP4",$10]}1'
snp.txt raw.txt > output.txt
This does what I want if file raw.txt has only 4 snps. I do not know how to make this loop through the fields from column 7 of raw.txt when I have over 65,000 snps. I want a an code (preferably awk language) which can loop through numerous columns of raw.txt to change the snps in 00, 11, 20 format to bi-allelic letter formats. Thank you.
Your awk is good! Here is how to make it for variable number of snps.
> cat tst.awk
NR==FNR {
snp[$2 "20"] = $4 $4
snp[$2 "11"] = $4 $5
snp[$2 "00"] = $5 $5
next
}
FNR==1 { # read the columns/snps
for (i=7;i<=NF;i++) col[i] = $i
print
next
}
{
for (i=7;i<=NF;i++) $i = snp[col[i] $i]
print
}
Usage:
> awk -f tst.awk snp.txt raw.txt
FID IID FA MO SEX PHENO SNP1 SNP2 SNP3 SNP4
1 1 0 0 1 1 AA CC AA CG
1 2 0 0 1 1 AG CC AA CC
1 3 0 0 1 1 AG TT AG CC
1 4 0 0 1 1 GG TC AG CC
The modification is that we read the header and save the snps, later we use them for the mapping. Both actions are done with a typical for loop, from the column we want to the last column (NF), the rest is what you are already doing, besides some clearer syntax.

Aggregate function with window function filtered by time

I have a table with data about buses while making their routes. There are columns for:
bus trip id (different each time a bus starts the route from the first stop)
bus stop id
datetime column that indicates the moment that the bus leaves each bus stop
integer that indicates how many passengers entered the bus in that stop
There is no information about how many passengers get off the bus on each stop, so I have to make an estimation supposing that once they get on the bus, they stay on it for 30 minutes. The trip lasts about 70 minutes from the first to the last stop.
I am trying to aggregate results on each stop using
SUM(iPassengersIn) OVER (
PARTITION BY tripDate, tripId
ORDER BY busStopOrder
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) total_passengers
The problem is that I can add passengers since the beginning of the trip, but not since "30 minutes ago" on each stop. How could I limit the aggregation to "the last 30 minutes" on each row in order to estimate the occupation between stops?
This is a subset of my data:
trip_date trip_id bus_stop_order minutes_since_trip_start passengers_in trip_total_passengers
2020-06-08 374910 0 0 0 0
2020-06-08 374910 1 3 0 0
2020-06-08 374910 2 5 1 1
2020-06-08 374910 3 8 0 1
2020-06-08 374910 4 9 0 1
2020-06-08 374910 5 12 0 1
2020-06-08 374910 6 13 0 1
2020-06-08 374910 7 13 0 1
2020-06-08 374910 8 15 0 1
2020-06-08 374910 9 16 0 1
2020-06-08 374910 10 16 0 1
2020-06-08 374910 11 17 0 1
2020-06-08 374910 12 18 2 3
2020-06-08 374910 13 20 0 3
2020-06-08 374910 14 22 0 3
2020-06-08 374910 15 24 0 3
2020-06-08 374910 16 25 0 3
2020-06-08 374910 17 28 2 5
2020-06-08 374910 18 30 1 6
2020-06-08 374910 19 31 0 6
2020-06-08 374910 20 33 0 6
2020-06-08 374910 21 41 3 9
2020-06-08 374910 22 44 3 12
2020-06-08 374910 23 45 4 16
2020-06-08 374910 24 48 2 18
2020-06-08 374910 25 48 2 20
2020-06-08 374910 26 50 0 20
2020-06-08 374910 27 51 0 20
2020-06-08 374910 28 51 0 20
2020-06-08 374910 29 53 0 20
2020-06-08 374910 30 55 0 20
2020-06-08 374910 31 58 0 20
For the row with bus_stop_order 21 (41 minutes into the bus trip), where 3 passengers enter the bus, I have to sum only the passengers that entered the bus between minute 11 and 41. Thus, the passenger that entered the bus in the 2nd bus stop (5 minutes into the trip) should be excluded.
That should be applied for every row.
The only thing I can think of is:
select
trip_date,
trip_id,
minutes_since_trip_start,
v.total_passengers
from
#t t1
outer apply (
select sum(passengers_in)
from #t t2
where
t1.trip_date = t2.trip_date
and t1.trip_id = t2.trip_id
and t2.bus_stop_order <= t1.bus_stop_order
and t2.minutes_since_trip_start >= t1.minutes_since_trip_start - 30
) v(total_passengers)
order by
trip_date,
trip_id,
minutes_since_trip_start
;

Insert a space to separate a database

Good morning, I have the following set, but with thousands of more information:
215 22221121110110110101
212 22221121110110110101
468 22221121110110110101
1200 22221121110110110101
400 22221121110110110101
100 22221121110110110101
200 22221121110110110101
And I need to separate it into columns this way:
215 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
212 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
468 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
1200 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
400 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
100 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
200 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
I tried to use a simple sed, but don't work
sed -i -e 's// /g'
Perl to the rescue!
perl -lane 'push #F, split //, pop #F; print "#F"'
-n reads the input line by line
-l removes newlines from input and adds them back to output
-a splits each line on whitespace into the #F array
pop removes the last element of an array and returns it, in this case it returns the second "word"
split turns a string into a list, with // it splits the string into individual characters
push is dual to pop, it adds the elements to the end of an array (in this case, it adds individual characters to the array currently containing only the first column)
when printing an array in double quotes, by default the members are separated by spaces.
you can use GNU awk gensub function.
gawk '{$2=gensub(/./, "& ", "g", $2)}1' file
to eliminate extra space at the end of line by other solutions you can use this
$ awk '{print $1 gensub(/./," &","g",$2)}'
Could you please try following with GNU awk and do let me know if this helps you.
awk '{num=split($2,a,"");printf $1;for(i=0;i<=num;i++){printf("%s%s",a[i],i==num?RS:FS)};}' Input_file
Using awk's gsub(regexp, replacement [, target])
awk '{gsub(/./," &",$2); print $1 $2}' infile
Explanation:
gsub(/./,"& ",$2) match any char (except for line terminators) and replace it with the same, along with single space in second column of current record read.
The Dot Matches (Almost) Any Character. In regular expressions, the
dot or period is one of the most commonly used metacharacters.
The
dot matches a single character, without caring what that character is.
The only exception are line break characters.
If the special character & appears in replacement, it stands for the precise substring that was matched by regexp.
Test Results:
$ cat infile
215 22221121110110110101
212 22221121110110110101
468 22221121110110110101
1200 22221121110110110101
400 22221121110110110101
100 22221121110110110101
200 22221121110110110101
$ awk '{gsub(/./," &",$2); print $1 $2}' infile
215 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
212 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
468 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
1200 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
400 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
100 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
200 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
speed comparison of some of the answers
$ perl -0777 -ne 'print $_ x 1000000' ip.txt > f1
$ du -h f1
169M f1
time given for two consecutive runs
$ time perl -lane 'push #F, split //, pop #F; print "#F"' f1 > t1
real 0m34.004s
real 0m33.729s
$ time perl -lane 'print join " ",$F[0],split //,$F[1]' f1 > t2
real 0m23.291s
real 0m23.935s
$ time LC_ALL=C awk '{gsub(/./," &",$2); print $1 $2}' f1 > t3
real 0m30.834s
real 0m30.723s
$ diff -s t1 t2
Files t1 and t2 are identical
$ diff -s t1 t3
Files t1 and t3 are identical
Another approach with bash
while read a b;do
printf "%s" $a
while read -n1 c;do
printf " %c" "$c"
done<<<$b
echo
done<lefile
This might work for you (GNU sed):
sed 's/ /\n/;h;s/\B/ /g;H;g;s/\n.*\n/ /' file
Replace the first space by a newline, copy the line, replace all non-word boundaries with a space, append the change line to the copy and then rearrange the line.
How about coreutils:
paste -d '' \
<(cut -d' ' -f1 infile ) \
<(cut -d' ' -f2 infile | sed 's/./ &/g')
Output:
215 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
212 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
468 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
1200 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
400 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
100 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
200 2 2 2 2 1 1 2 1 1 1 0 1 1 0 1 1 0 1 0 1
Try
sed -i -e 's/\(.\)/\1 /g'
That is, capture character by character, then replace the capture with itself, plus a space.

MATLAB: Remove sub-arrays from a multidimensional array into an array of ones

I would like to construct a function
[B, ind] = extract_ones(A)
which removes some sub-arrays from a binary array A in arbitrary dimensions, such that the remaining array B is the largest possible array with only 1's, and I also would like to record in ind that where each of the 1's in B comes from.
Example 1
Assume A is a 2-D array as shown
A =
1 1 0 0 0 1
1 1 1 0 1 1
0 0 0 1 0 1
1 1 0 1 0 1
1 1 0 1 0 1
1 1 1 1 1 1
After removing A(3,:) and A(:,3:5), we have the output B
B =
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
which is the largest array with only ones by removing rows and columns of A.
As the fifteen 1's of B corresponds to
A(1,1) A(1,2) A(1,6)
A(2,1) A(2,2) A(2,6)
A(4,1) A(4,2) A(4,6)
A(5,1) A(5,2) A(5,6)
A(6,1) A(6,2) A(6,6)
respectively, or equivalently
A(1) A(7) A(31)
A(2) A(8) A(32)
A(4) A(10) A(34)
A(5) A(11) A(35)
A(6) A(12) A(36)
so, the output ind looks like (of course ind's shape does not matter):
ind = [1 2 4 5 6 7 8 10 11 12 31 32 34 35 36]
Example 2
If the input A is constructed by
A = ones(6,3,4,3);
A(2,2,2,2) = 0;
A(4,1,3,3) = 0;
A(1,1,4,2) = 0;
A(1,1,4,1) = 0;
Then, by deleting the minimum cuboids containing A(2,2,2,2), A(4,1,3,3), A(1,1,4,3) and A(1,1,4,1), i.e. after deleting these entries
A(2,:,:,:)
A(:,1,:,:)
Then the remaining array B will be composed by 1's only. And the ones in B corresponds to
A([1,3:6],2:3,1:4,1:3)
So, the output ind lists the subscripts transformed into indices, i.e.
ind = [7 9 10 11 12 13 15 16 17 18 25 27 28 29 30 31 33 34 35 36 43 45 46 47 48 49 51 52 53 54 61 63 64 65 66 67 69 70 71 72 79 81 82 83 84 85 87 88 89 90 97 99 100 101 102 103 105 106 107 108 115 117 118 119 120 121 123 124 125 126 133 135 136 137 138 139 141 142 143 144 151 153 154 155 156 157 159 160 161 162 169 171 172 173 174 175 177 178 179 180 187 189 190 191 192 193 195 196 197 198 205 207 208 209 210 211 213 214 215 216]
As the array needed to be processed as above is in 8-D, and it should be processed more than once, so can anyone give me opinions on how to composing the program doing this task fast?
My work so far [Added at 2 am (GMT-4), 2nd Aug 2017]
My idea was that I delete the sub-arrays with the largest proportion of zero one by one. And here is my work so far:
Inds = reshape(1:numel(A),size(A)); % Keep track on which 1's survive.
cont = true;
while cont
sz = size(A);
zero_percentage = 0;
Test_location = [];
% This nested for loops are for determining which sub-array of A has the
% maximum proportion of zeros.
for J = 1 : ndims(A)
for K = 1 : sz(J)
% Location is in the form of (_,_,_,...,_)
% where the J-th blank is K, the other blanks are colons.
Location = strcat('(',repmat(':,',1,(J-1)),int2str(K),repmat(',:',1,(ndims(A)-J)),')');
Test_array = eval(strcat('A',Location,';'));
N = numel(Test_array);
while numel(Test_array) ~= 1
Test_array = sum(Test_array);
end
test_zero_percentage = 1 - (Test_array/N);
if test_zero_percentage > zero_percentage
zero_percentage = test_zero_percentage;
Test_location = Location;
end
end
end
% Delete the array with maximum proportion of zeros
eval(strcat('A',Test_location,'= [];'))
eval(strcat('Inds',Test_location,'= [];'))
% Determine if there are still zeros in A. If there are, continue the while loop.
cont = A;
while numel(cont) ~= 1
cont = prod(cont);
end
cont = ~logical(cont);
end
But I encountered two problems:
1) It may be not efficient to check all arrays in all sub-dimensions one-by-one.
2) The result does not contain the most number of rectangular ones. for example, I tested my work using a 2-dimensional binary array A
A =
0 0 0 1 1 0
0 1 1 0 1 1
1 0 1 1 1 1
1 0 0 1 1 1
0 1 1 0 1 1
0 1 0 0 1 1
1 0 0 0 1 1
1 0 0 0 0 0
It should return me the result as
B =
1 1
1 1
1 1
1 1
1 1
1 1
Inds =
34 42
35 43
36 44
37 45
38 46
39 47
But, instead, the code returned me this:
B =
1 1 1
1 1 1
1 1 1
Inds =
10 34 42
13 37 45
14 38 46
*My work so far 2 [Added at 12noon (GMT-4), 2nd Aug 2017]
Here is my current amendment. This may not provide the best result.
This may give a fairly OK approximation to the problem, and this does not give empty Inds. But I am still hoping that there is a better solution.
function [B, Inds] = Finding_ones(A)
Inds = reshape(1:numel(A),size(A)); % Keep track on which 1's survive.
sz0 = size(A);
cont = true;
while cont
sz = size(A);
zero_percentage = 0;
Test_location = [];
% This nested for loops are for determining which sub-array of A has the
% maximum proportion of zeros.
for J = 1 : ndims(A)
for K = 1 : sz(J)
% Location is in the form of (_,_,_,...,_)
% where the J-th blank is K, the other blanks are colons.
Location = strcat('(',repmat(':,',1,(J-1)),int2str(K),repmat(',:',1,(ndims(A)-J)),')');
Test_array = eval(strcat('A',Location,';'));
N = numel(Test_array);
Test_array = sum(Test_array(:));
test_zero_percentage = 1 - (Test_array/N);
if test_zero_percentage > zero_percentage
eval(strcat('Testfornumel = numel(A',Location,');'))
if Testfornumel < numel(A) % Preventing the A from being empty
zero_percentage = test_zero_percentage;
Test_location = Location;
end
end
end
end
% Delete the array with maximum proportion of zeros
eval(strcat('A',Test_location,'= [];'))
eval(strcat('Inds',Test_location,'= [];'))
% Determine if there are still zeros in A. If there are, continue the while loop.
cont = A;
while numel(cont) ~= 1
cont = prod(cont);
end
cont = ~logical(cont);
end
B = A;
% command = 'i1, i2, ... ,in'
% here, n is the number of dimansion of A.
command = 'i1';
for J = 2 : length(sz0)
command = strcat(command,',i',int2str(J));
end
Inds = reshape(Inds,numel(Inds),1); %#ok<NASGU>
eval(strcat('[',command,'] = ind2sub(sz0,Inds);'))
% Reform Inds into a 2-D matrix, which each column indicate the location of
% the 1 originated from A.
Inds = squeeze(eval(strcat('[',command,']')));
Inds = reshape(Inds',length(sz0),numel(Inds)/length(sz0));
end
It seems a difficult problem to solve, since the order of deletion can change a lot in the final result. If in your first example you start with deleting all the columns that contain a 0, you don't end up with the desired result.
The code below removes the row or column with the most zeros and keeps going until it's only ones. It keeps track of the rows and columns that are deleted to find the indexes of the remaining ones.
function [B,ind] = extract_ones( A )
if ~islogical(A),A=(A==1);end
if ~any(A(:)),B=[];ind=[];return,end
B=A;cdel=[];rdel=[];
while ~all(B(:))
[I,J] = ind2sub(size(B),find(B==0));
ih=histcounts(I,[0.5:1:size(B,1)+0.5]); %zero's in rows
jh=histcounts(J,[0.5:1:size(B,2)+0.5]); %zero's in columns
if max(ih)>max(jh)
idxr=find(ih==max(ih),1,'first');
B(idxr,:)=[];
%store deletion
rdel(end+1)=idxr+sum(rdel<=idxr);
elseif max(ih)==max(jh)
idxr=find(ih==max(ih),1,'first');
idxc=find(jh==max(jh),1,'first');
B(idxr,:)=[];
B(:,idxc)=[];
%store deletions
rdel(end+1)=idxr+sum(rdel<=idxr);
cdel(end+1)=idxc+sum(cdel<=idxc);
else
idxc=find(jh==max(jh),1,'first');
B(:,idxc)=[];
%store deletions
cdel(end+1)=idxc+sum(cdel<=idxc);
end
end
A(rdel,:)=0;
A(:,cdel)=0;
ind=find(A);
Second try: Start with a seed point and try to grow the matrix in all dimensions. The result is the start and finish point in the matrix.
function [ res ] = seed_grow( A )
if ~islogical(A),A=(A==1);end
if ~any(A(:)),res={};end
go = true;
dims=size(A);
ind = cell([1 length(dims)]); %cell to store find results
seeds=A;maxmat=0;
while go %main loop to remove all posible seeds
[ind{:}]=find(seeds,1,'first');
S = [ind{:}]; %the seed
St = [ind{:}]; %the end of the seed
go2=true;
val_dims=1:length(dims);
while go2 %loop to grow each dimension
D=1;
while D<=length(val_dims) %add one to each dimension
St(val_dims(D))=St(val_dims(D))+1;
I={};
for ct = 1:length(S),I{ct}=S(ct):St(ct);end %generate indices
if St(val_dims(D))>dims(val_dims(D))
res=false;%outside matrix
else
res=A(I{:});
end
if ~all(res(:)) %invalid addition to dimension
St(val_dims(D))=St(val_dims(D))-1; %undo
val_dims(D)=[]; D=D-1; %do not try again
if isempty(val_dims),go2=false;end %end of growth
end
D=D+1;
end
end
%evaluate the result
mat = prod((St+1)-S); %size of matrix
if mat>maxmat
res={S,St};
maxmat=mat;
end
%tried to expand, now remove seed option
for ct = 1:length(S),I{ct}=S(ct):St(ct);end %generate indices
seeds(I{:})=0;
if ~any(seeds),go=0;end
end
end
I tested it using your matrix:
A = [0 0 0 1 1 0
0 1 1 0 1 1
1 0 1 1 1 1
1 0 0 1 1 1
0 1 1 0 1 1
0 1 0 0 1 1
1 0 0 0 1 1
1 0 0 0 0 0];
[ res ] = seed_grow( A );
for ct = 1:length(res),I{ct}=res{1}(ct):res{2}(ct);end %generate indices
B=A(I{:});
idx = reshape(1:numel(A),size(A));
idx = idx(I{:});
And got the desired result:
B =
1 1
1 1
1 1
1 1
1 1
1 1
idx =
34 42
35 43
36 44
37 45
38 46
39 47

SAS, assigning the same numbers to specific observations

I want to assign the same id number to every four observations. For example, if I have the following data
age marital gender id
45 1 0 1
33 1 1 1
68 0 1 1
27 1 0 1
43 0 0 2
37 0 1 2
19 1 1 2
40 1 1 2
25 1 0 3
38 1 1 3
57 0 0 3
50 1 0 3
51 1 1 4
44 0 1 4
69 1 0 4
39 0 1 4
The last column id is something I want to produce.
Plus, the dataset have 500,000+ observations.
Thanks in advance.
Slightly more compact:
id = ceil(_n_/4);
Use the integer function and the built-in _n_ variable (which increments for each observation):
id = int( (_n_-4)/4 )+1;

Resources