Assigning a single value to all cells within a specified time period, matrix format - arrays

I have the following example dataset which consists of the # of fish caught per check of a net. The nets are not checked at uniform intervals. The day of the check is denoted in julian days as well as the number of days the net had been fishing since last checked (or since it's deployment in the case of the first check)
http://textuploader.com/9ybp
Site_Number Check_Day_Julian Set_Duration_Days Fish_Caught
2 5 3 100
2 10 5 70
2 12 2 65
2 15 3 22
100 4 3 45
100 10 6 20
100 18 8 8
450 10 10 10
450 14 4 4
In any case, I would like to turn the raw data above into the following format:
http://textuploader.com/9y3t
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
2 0 0 100 100 100 70 70 70 70 70 65 65 22 22 22 0 0 0
100 0 45 45 45 20 20 20 20 20 20 8 8 8 8 8 8 8 8
450 10 10 10 10 10 10 10 10 10 10 4 4 4 4 0 0 0 0
This is a matrix which assigns the # of fish caught during the period to EACH of the days that were within that period. The columns of the matrix are Julian days, the rows are site numbers.
I have tried to do this with some matrix functions but I have had much difficulty trying to populate all the fields that are within the time period, but I do not necessarily have a row of data for?
I had posted my small bit of code here, but upon reflection, my approach is quite archaic and a bit off point. Can anyone suggest a method to convert the data into the matrix provided? I've been scratching my head and googling all day but now I am stumped.
Cheers,
C

Two answers, the second one is faster but a bit low level.
Solution #1:
library(IRanges)
with(d, {
ir <- IRanges(end=Check_Day_Julian, width=Set_Duration_Days)
cov <- coverage(split(ir, Site_Number),
weight=split(Fish_Caught, Site_Number),
width=max(end(ir)))
do.call(rbind, lapply(cov, as.vector))
})
Solution #2:
with(d, {
ir <- IRanges(end=Check_Day_Julian, width=Set_Duration_Days)
site <- factor(Site_Number, unique(Site_Number))
m <- matrix(0, length(levels(site)), max(end(ir)))
ind <- cbind(rep(site, width(ir)), as.integer(ir))
m[ind] <- rep(Fish_Caught, width(ir))
m
})

I don't see a super obvious matrix transformation here. This is all i've got assuming the raw data is in a data.frame called dd
dd$Site_Number<-factor(dd$Site_Number)
mm<-matrix(0, nrow=nlevels(dd$Site_Number), ncol=18)
for(i in 1:nrow(dd)) {
mm[as.numeric(dd[i,1]), (dd[i,2]-dd[i,3]):dd[i,2] ] <- dd[i,4]
}
mm

Related

Stat2Data package doesn't allow me to use the objects in it

Stat2Data package contains a lot of exemplary datasets. I get no errors when installing it or using the library function to call it. However, it doesn't allow me to work with the objects.
Anyone familiar with this package and know what I can do about it? Here's the code that I used:
install.packages("Stat2Data")
library(Stat2Data)
# Attempt 1
MedGPA_ds <- ggplot(MedGPA, aes(x = GPA, y = Acceptance))
# Attempt 2
MedGPA_ds <- ggplot(Stat2Data::MedGPA, aes(x = GPA, y = Acceptance))
You have missed just one step, the use of the data() function to call the specific dataset (MedGPA) contained within the Stat2Data package.
Try the following:
library(Stat2Data)
data(MedGPA)
head(MedGPA)
Accept Acceptance Sex BCPM GPA VR PS WS BS MCAT Apps
1 D 0 F 3.59 3.62 11 9 9 9 38 5
2 A 1 M 3.75 3.84 12 13 8 12 45 3
3 A 1 F 3.24 3.23 9 10 5 9 33 19
4 A 1 F 3.74 3.69 12 11 7 10 40 5
5 A 1 F 3.53 3.38 9 11 4 11 35 11
6 A 1 M 3.59 3.72 10 9 7 10 36 5
Happy coding!

2D MATRIX statement for 5 x 5 Grid

Given a 5 x 5 Grid comprising of tiles numbered from 1 to 25 and a set of 5 start-end point pairs.
For each pair,find a path from the start point to the end point.
The paths should meet the below conditions:
a) Only Horizontal and Vertical moves allowed.
b) No two paths should overlap.
c) Paths should cover the entire grid
Input consist of 5 lines.
Each line contains two space-separated integers,Starting and Ending point.
Output: Print 5 lines. Each line consisting of space-separated integers,the path for the corresponding start-end pair. Assume that such a path Always exists. In case of Multiple Solution,print any one of them.
Sample Input
1 22
4 17
5 18
9 13
20 23
Sample Output
1 6 11 16 21 22
4 3 2 7 12 17
5 10 15 14 19 18
9 8 13
20 25 24 23
i think there should be restriction or it lacks some more information about the input ( start point and endpoint)
because if we take following input then covering whole grid is not possible
1 22,
6 7,
11 12,
16 17,
8 9

check if ALL elements of a vector are in another vector

I need to loop through coloumn 1 of a matrix and return (i) when I have come across ALL of the elements of another vector which i can predefine.
check_vector = [1:43] %% I dont actually need to predefine this - i know I am looking for the numbers 1 to 43.
matrix_a coloumn 1 (which is the only coloumn i am interested in looks like this for example
1
4
3
5
6
7
8
9
10
11
12
13
14
16
15
18
17
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
1
3
4
2
6
7
8
We want to loop through matrix_a and return the value of (i) when we have hit all of the numbers in the range 1 to 43.
In the above example we are looking for all the numbers from 1 to 43 and the iteration will end round about position 47 in matrix_a because it is at this point that we hit number '2' which is the last number to complete all numbers in the sequence 1 to 43.
It doesnt matter if we hit several of one number on the way, we count all those - we just want to know when we have reached all the numbers from the check vector or in this example in the sequence 1 to 43.
Ive tried something like:
completed = []
for i = 1:43
complete(i) = find(matrix_a(:,1) == i,1,'first')
end
but not working.
Assuming A as the input column vector, two approaches could be suggested here.
Approach #1
With arrayfun -
check_vector = [1:43]
idx = find(arrayfun(#(n) all(ismember(check_vector,A(1:n))),1:numel(A)),1)+1
gives -
idx =
47
Approach #2
With customary bsxfun -
check_vector = [1:43]
idx = find(all(cumsum(bsxfun(#eq,A(:),check_vector),1)~=0,2),1)+1
To find the first entry at which all unique values of matrix_a have already appeared (that is, if check_vector consists of all unique values of matrix_a): the unique function almost gives the answer:
[~, ind] = unique(matrix_a, 'first');
result = max(ind);
Someone might have a more compact answer but is this what your after?
maxIndex = 0;
for ii=1:length(a)
[f,index] = ismember(ii,a);
maxIndex=max(maxIndex,max(index));
end
maxIndex
Here is one solution without a loop and without any conditions on the vectors to be compared. Given two vectors a and b, this code will find the smallest index idx where a(1:idx) contains all elements of b. idx will be 0 when b is not contained in a.
a = [ 1 4 3 5 6 7 8 9 10 11 12 13 14 16 15 18 17 19 20 21 22 23 24 25 26 ...
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 1 3 4 2 6 7 8 50];
b = 1:43;
[~, Loca] = ismember(b,a);
idx = max(Loca) * all(Loca);
Some details:
ismember(b,a) checks if all elements of b can be found in a and the output Loca lists the indices of these elements within a. The index will be 0, if the element cannot be found in a.
idx = max(Loca) then is the highest index in this list of indices, so the smallest one where all elements of b are found within a(1:idx).
all(Loca) finally checks if all indices in Loca are nonzero, i.e. if all elements of b have been found in a.

VTK Structured Point file

I am trying to parse a VTK file in C by extracting its point data and storing each point in a 3D array. However, the file I am working with has 9 shorts per point and I am having difficulty understanding what each number means.
I believe I understand most of the header information (please correct me if I have misunderstood):
ASCII: Type of file (ASCII or Binary)
DATASET: Type of dataset
DIMENSIONS: dims of voxels (x,y,z)
SPACING: Volume of each voxel (w,h,d)
ORIGIN: Unsure
POINT DATA: Total number of points/voxels (dimx.dimy.dimz)
I have looked at the documentation and I am still not getting an understanding on how to interpret the data. Could someone please help me understand or point me to some helpful resources
# vtk DataFile Version 3.0
vtk output
ASCII
DATASET STRUCTURED_POINTS
DIMENSIONS 256 256 130
SPACING 1 1 1.3
ORIGIN 86.6449 -133.929 116.786
POINT_DATA 8519680
SCALARS scalars short
LOOKUP_TABLE default
0 0 0 0 0 0 0 0 0
0 0 7 2 4 5 3 3 4
4 5 5 1 7 7 1 1 2
1 6 4 3 3 1 0 4 2
2 3 2 4 2 2 0 2 6
...
thanks.
You are correct regarding the meaning of fields in the header.
ORIGIN corresponds to the coordinates of the 0-0-0 corner of the grid.
An example of a DATASET STRUCTURED_POINTS can be found in the documentation.
Starting from this, here is a small file with 6 shorts per point. Each line represents a point.
# vtk DataFile Version 2.0
Volume example
ASCII
DATASET STRUCTURED_POINTS
DIMENSIONS 3 4 2
ASPECT_RATIO 1 1 1
ORIGIN 0 0 0
POINT_DATA 24
SCALARS volume_scalars char 6
LOOKUP_TABLE default
0 1 2 3 4 5
1 1 2 3 4 5
2 1 2 3 4 5
0 2 2 3 4 5
1 2 2 3 4 5
2 2 2 3 4 5
0 3 2 8 9 10
1 3 2 8 9 10
2 3 2 8 9 10
0 4 2 8 9 10
1 4 2 8 9 10
2 4 2 8 9 10
0 1 3 18 19 20
1 1 3 18 19 20
2 1 3 18 19 20
0 2 3 18 19 20
1 2 3 18 19 20
2 2 3 18 19 20
0 3 3 24 25 26
1 3 3 24 25 26
2 3 3 24 25 26
0 4 3 24 25 26
1 4 3 24 25 26
2 4 3 24 25 26
The 3 first fields may be displayed to understand the data layout : x change faster than y, which change faster than z in file.
If you wish to store the data in an array a[2][4][3][6], just read while doing a loop :
for(k=0;k<2;k++){ //z loop
for(j=0;j<4;j++){ //y loop : y change faster than z
for(i=0;i<3;i++){ //x loop : x change faster than y
for(l=0;l<6;l++){
fscanf(file,"%d",&a[k][j][i][l]);
}
}
}
}
To read the header, fscanf() may be used as well :
int sizex,sizey,sizez;
char headerpart[100];
fscanf(file,"%s",headerpart);
if(strcmp(headerpart,"DIMENSIONS")==0){
fscanf(file,"%d%d%d",&sizex,&sizey,&sizez);
}
Note than fscanf() need the pointer to the data (&sizex, not sizex). A string being a pointer to an array of char terminated by \0, "%s",headerpart works fine. It can be replaced by "%s",&headerpart[0]. The function strcmp() compares two strings, and return 0 if strings are identical.
As your grid seems large, smaller files can be obtained using the BINARY kind instead of ASCII, but watch for endianess as specified here.

XOR File Decryption

So I have to decrypt a .txt file that is crypted with XOR code and with a repeated password that is unknown, and the goal is to discover the message.
Here are the things that I already know because of the professor:
First I need to find the length of the unknown password
The message has been altered and it doesn't have spaces (this may add a bit more difficulty because the space character has the highest frequency in a message)
Any ideas on how to solve this?
thx in advanced :)
First you need to find out the length of the password. You do this by assessing the Index of Coincidence or Kappa-test. XOR the ciphertext with itself shifted 1 step and count the number of characters that are the same (value 0). You get the Kappa value by dividing the result with the total number of characters minus 1. Shift one more time and again calculate the Kappa value. Shift the ciphertext as many times as needed until you discover the password length. If the length is 4 you should see something similar to this:
Offset Hits
-------------------------
1 2.68695%
2 2.36399%
3 3.79009%
4 6.74012%
5 3.6953%
6 1.81582%
7 3.82744%
8 6.03504%
9 3.60273%
10 1.98052%
11 3.83241%
12 6.5627%
As you see the Kappa value is significantly higher on multiples of 4 (4, 8 and 12) than the others. This suggests that the length of the password is 4.
Now that you have the password length you should again XOR the cipher text with itself but now you shift by multiples of the length. Why? Since the ciphertext looks like this:
THISISTHEPLAINTEXT <- Plaintext
PASSPASSPASSPASSPA <- Password
------------------
EJKELDOSOSKDOWQLAG <- Ciphertext
When two values which are the same are XOR:ed the result is 0:
EJKELDOSOSKDOWQLAG <- Ciphertext
EJKELDOSOSKDOWQLAG <- Ciphertext shifted 4.
Is in reality:
THISISTHEPLAINTEXT <- Plaintext
PASSPASSPASSPASSPA <- Password
THISISTHEPLAINTEXT <- Plaintext
PASSPASSPASSPASSPA <- Password
Which is:
THISISTHEPLAINTEXT <- Plaintext
THISISTHEPLAINTEXT <- Plaintext
As you see the password "disappears" and the plaintext is XOR:ed with itself.
So what can we do now then? You wrote that the spaces are removed. This makes it a bit harder to get the plaintext or password. But not at all impossible.
The following table shows the ciphertext values for all english characters:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 0
B 3 0
C 2 1 0
D 5 6 7 0
E 4 7 6 1 0
F 7 4 5 2 3 0
G 6 5 4 3 2 1 0
H 9 10 11 12 13 14 15 0
I 8 11 10 13 12 15 14 1 0
J 11 8 9 14 15 12 13 2 3 0
K 10 9 8 15 14 13 12 3 2 1 0
L 13 14 15 8 9 10 11 4 5 6 7 0
M 12 15 14 9 8 11 10 5 4 7 6 1 0
N 15 12 13 10 11 8 9 6 7 4 5 2 3 0
O 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
P 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0
Q 16 19 18 21 20 23 22 25 24 27 26 29 28 31 30 1 0
R 19 16 17 22 23 20 21 26 27 24 25 30 31 28 29 2 3 0
S 18 17 16 23 22 21 20 27 26 25 24 31 30 29 28 3 2 1 0
T 21 22 23 16 17 18 19 28 29 30 31 24 25 26 27 4 5 6 7 0
U 20 23 22 17 16 19 18 29 28 31 30 25 24 27 26 5 4 7 6 1 0
V 23 20 21 18 19 16 17 30 31 28 29 26 27 24 25 6 7 4 5 2 3 0
W 22 21 20 19 18 17 16 31 30 29 28 27 26 25 24 7 6 5 4 3 2 1 0
X 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23 8 9 10 11 12 13 14 15 0
Y 24 27 26 29 28 31 30 17 16 19 18 21 20 23 22 9 8 11 10 13 12 15 14 1 0
Z 27 24 25 30 31 28 29 18 19 16 17 22 23 20 21 10 11 8 9 14 15 12 13 2 3 0
What does this mean then? If an A and a B is XOR:ed then the resulting value is 3. E and P will result in 21. Etc. OK but how will this help you?
Remember that the plaintext is XOR:ed with itself shifted by multiples of the password length. For each value you can check the above table and determine what combinations that position could have. Lets say the value is 25 then the two characters that resulted in the value 25 could be one of the following combinations:(I-P), (H-Q), (K-R), (J-S), (M-T), (L-U), (O-V), (N-W), (A-X) or (C-Z). But which one? Now you do more shifts and look up the corresponding values in the table again for each position. Next time the value might be 7 and since you already have a list of possible character combinations you only check against them. At the next two shifts the values are 3 and 1. Now you can determine that the character is W since that is the only common character in each shift, (N-W), (P-W), (T-W), (V-W). You can do this for most positions.
You will not get all the plaintext but you will get enough characters to discover the password. Take the known characters and XOR them in the correct position in the ciphertext. This will yield the password. The number of known characters you need atleast is the number of characters in the password if they are at the "correct" positions in regards to the password.
Good luck!
you should look at cracking a vigenere chiffre, especially at auto-correlation. The latter will help you finding out the length of the password and the rest is usually just bruteforcing on the normal distribution of letters (where the most common one is the letter e in the english language).
Although spaces are the most common characters and make decryptions like this easy, the other character also have different frequencies. For example, see this Wikipedia article. If you've got enough encrypted text and the password length isn't too large, it might just be enough to find out the most common bytes in the encrypted text. They will most likely be the encrypted versions of e that has the highest frequency in english texts.
This alone won't give you the decrypted text, but it's very likely you can find out the password length and (part of) the password itself with it. For example, let's assume the most frequent encrypted bytes are
w x m z y
with almost the same frequency and there's a significant drop in frequency after the last one. This will tell you two things:
The password length most likely is 5, because statistically, all encrypted e will be equally likely. EDIT: OK, this isn't correct, it will be 5 or above because the password can contain the same character multiple times.
The password will be some permutation of (w x m z y XOR e e e e e) - you can use the byte offsets modulo the password length to get the correct permutation.
EDIT: The same character occuring in the password multiple times makes things a bit harder, but you'll most likely be able to identify those because as I said, encrypted versions of e will cluster around frequency f - now if the character occurs n times, it will have a frequency near n*f.
The most common three letter trigram in English (assuming the language is probably English) is "the". Place "the" at all possible points on your cyphertext to derive a possible 3 characters of the key. Try each possible key fragment at all other possible positions on the cyphertext and see what you get. For example, "qzg" is unlikely to be correct, but "fen" could be. Look at the spacing between possible positions to derive the key length. With a key length and a key fragment you can place a lot more of the key.
As Lars said, look at ways of decrypting Vigenère, which is effectively what you have here.

Resources