Issues Regarding SAS

Issues Regarding SAS - arrays

I was working on a homework problem regarding using arrays and looping to create a new variable to identify the date of when the maximum blood lead value was obtained but got stuck. For context, here is the homework problem:
In 1990 a study was done on the blood lead levels of children in Boston. The following variables for twenty-five children from the study have been entered on multiple lines per subject in the file lead_sum2018.txt in a list format:
Line 1
ID Number (numeric, values 1-25)
Date of Birth (mmddyy8. format)
Day of Blood Sample 1 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 1 (numeric, initial possible range: -9 to 12)
Line 2
ID Number (numeric, values 1-25)
Day of Blood Sample 2 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 2 (numeric, initial possible range: -9 to 12)
Line 3
ID Number (numeric, values 1-25)
Day of Blood Sample 3 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 3 (numeric, initial possible range: -9 to 12)
Line 4
ID Number (numeric, values 1-25)
Blood Lead Level Sample 1 (numeric, possible range: 0.01 – 20.00)
Blood Lead Level Sample 2 (numeric, possible range: 0.01 – 20.00)
Blood Lead Level Sample 3 (numeric, possible range: 0.01 – 20.00)
Sex (character, ‘M’ or ‘F’)
All blood samples were drawn in 1990. However, during data entry the order of blood samples was scrambled so that the first blood sample in the data file (blood sample 1) may not correspond to the first blood sample taken on a subject, it could be the first, second or third. In addition, some of the months and days and days of blood sampling were not written on the forms. At data entry, missing month and missing day values were each coded as -9.
The team of investigators for this project has made the following decisions regarding the missing values. Any missing days are to set equal to 15, any missing months are to be set equal to 6. Any analyses that are done on this data set need to follow those decisions. Be sure to implement the SAS syntax as indicated for each question. For example, use SAS arrays and loops if the item states that these must be used.
Here is the data that the HW references (it is in list format and was contained in a separate file called lead_sum2018.txt):
1 04/30/78 6 10
1 -9 7
1 14 1
1 1.62 1.35 1.47 F
2 05/19/79 27 11
2 20 -9
2 5 6
2 1.71 1.31 1.76 F
3 01/03/80 11 7
3 6 6
3 27 2
3 3.24 3.4 3.83 M
4 08/01/80 5 12
4 28 -9
4 3 4
4 3.1 3.69 3.27 M
5 12/26/80 21 5
5 3 7
5 -9 12
5 4.35 4.79 5.14 M
6 06/20/81 7 10
6 11 3
6 22 1
6 1.24 1.16 0.71 F
7 06/22/81 19 6
7 3 12
7 29 8
7 3.1 3.21 3.58 F
8 05/24/82 26 7
8 31 1
8 9 10
8 2.99 2.37 2.4 M
9 10/11/82 2 7
9 25 5
9 28 3
9 2.4 1.96 2.71 F
10 . 10 8
10 30 12
10 28 2
10 2.72 2.87 1.97 F
11 11/16/83 19 4
11 15 11
11 7 -9
11 4.8 4.5 4.96 M
12 03/02/84 17 6
12 11 2
12 17 11
12 2.38 2.6 2.88 F
13 04/19/84 2 12
13 -9 6
13 1 7
13 1.99 1.20 1.21 M
14 02/07/85 4 5
14 17 5
14 21 11
14 1.61 1.93 2.32 F
15 07/06/85 5 2
15 16 1
15 14 6
15 3.93 4 4.08 M
16 09/10/85 12 10
16 11 -9
16 23 6
16 3.29 2.88 2.97 M
17 11/05/85 12 7
17 18 1
17 11 11
17 1.31 0.98 1.04 F
18 12/07/85 16 2
18 18 4
18 -9 6
18 2.56 2.78 2.88 M
19 03/02/86 19 4
19 11 3
19 19 2
19 0.79 0.68 0.72 M
20 08/19/86 21 5
20 15 12
20 -9 4
20 0.66 1.15 1.42 F
21 02/22/87 16 12
21 17 9
21 13 4
21 2.92 3.27 3.23 M
22 10/11/87 7 6
22 1 12
22 -9 3
22 1.43 1.42 1.78 F
23 05/12/88 12 2
23 21 4
23 17 12
23 0.55 0.89 1.38 M
24 08/07/88 17 6
24 27 11
24 6 2
24 0.31 0.42 0.15 F
25 01/12/89 4 7
25 15 -9
25 23 1
25 1.69 1.58 1.53 M
A) Input the data and in the data step:
1) make sure that Date of Birth variable is recorded as a SAS date;
2) use SAS arrays and looping to create a SAS date variable for each of the three blood samples and to address the missing data in accordance to the decisions of the investigators. Hint: use a single array and do loop to recode the missing values for day and month, separately, and an array/do loop for creating the SAS date variable;
3) use a SAS function to create a variable for the highest, i.e., maximum, blood lead value for each child;
4) use SAS arrays and looping to identify the date on which this largest value was obtained and create a new variable for the date of the largest blood lead value;
5) determine the age of the child in years when the largest blood lead value was obtained (rounded to two decimal places);
6) create a new variable based on the age of the child in years when the largest lead value was obtained (call it, “agecat”) that takes on three levels: for children less than 4 years old, agecat should equal 1; for children at least 4 years old, but less than 8, agecat should equal 2; and for children at least 8 years of age, agecat should be 3.;
7) print out the variables for the date of birth, date of the largest lead level, age at blood sample for the largest blood lead level, agecat, sex, and the largest blood lead level (Only print out these requested variables). All dates should be formatted to use the mmddyy10. format on the output.
The code I used in response to this was:
libname HW3 'C:\Users\johns\Desktop\SAS';
filename HW3new 'C:\Users\johns\Desktop\SAS\lead_sum2018.txt';
data one;
infile HW3new;
informat dob mmddyy8.;
input #1 id dob dbs1 mbs1
#2 dbs2 mbs2
#3 dbs3 mbs3
#4 bls1 bls2 bls3 sex;
array dbs{3} dbs1 dbs2 dbs3;
array mbs{3} mbs1 mbs2 mbs3;
do i=1 to 3;
if dbs{i}=-9 then dbs{i}=15;
end;
do i=4 to 6;
if mbs{i}=-9 then mbs{i}=6;
end;
array date{3} mdy1 mdy2 mdy3;
do i=1 to 3;
date{i}=mdy(mbs{i}, dbs{i}, 1990);
end;
maxbls=max(of bls1-bls3);
array bls{3} bls1 bls2 bls3;
array maxdte{3} maxdte1 maxdte2 maxdte3;
do i=1 to i=3;
if bls{i}=maxbls then maxdte=i;
end;
agemax=maxdte-dob;
ageest=round(agemax/365.25,2);
if agemax=. then agecat=.;
else if agemax < 4 then agecat=1;
else if 4 <= agemax < 8 then agecat=2;
else if agemax ge 8 then agecat=3;
run;
I received this error:
22 maxbls=max(of bls1-bls3);
23 array bls{3} bls1 bls2 bls3;
24 array maxdte{3} maxdte1 maxdte2 maxdte3;
25 do i=1 to i=3;
26 if bls{i}=maxbls then maxdte=i;
ERROR: Illegal reference to the array maxdte.
27 end;
Does anyone have any tip is regards to this issue? What did I do wrong? Was I supposed to create an additional array for the date of when the maximum blood lead sample value was collected? Thanks!
**I'm stuck on #4 of Part A, but I included the other parts for context. Thanks!
**Edits: I included the data that I had to read into SAS and the file name of the file it came from

Just from looking at the code immediately prior to the error, you have a problem on this line:
26 if bls{i}=maxbls then maxdte=i;
You are getting the error because you are attempting to assign a value to the array maxdte. Arrays cannot be assigned values like that (unless you are using the deprecated do over syntax...) Instead, choose an element of the array and assign the value to the element. E.g. you could do:
26 if bls{i}=maxbls then maxdte{1}=i;
Or instead of a literal 1, you could use a variable containing the relevant array index.

You are not properly handling ID field from lines #2-4
input #1 id dob dbs1 mbs1
#2 dbs2 mbs2
#3 dbs3 mbs3
#4 bls1 bls2 bls3 sex;
For example you need to skip field 1 on line 2-3 or read the ids into array perhaps to check they are all the same.
input #1 id dob dbs1 mbs1
#2 id2 dbs2 mbs2
#3 id3 dbs3 mbs3
#4 id4 bls1 bls2 bls3 sex;
This example show how to check that you have 4 lines with the same ID and if you do read the rest of the variables or execute LOSTCARD. ID 3 has a missing record;
353 data ex;
354 infile cards n=4 stopover;
355 input #1 id #2 id2 #3 id3 #4 id4 #;
356 if id eq id2 eq id3 eq id4
357 then input #1 id dob:mmddyy. dbs1 mbs1
358 #2 id2 dbs2 mbs2
359 #3 id3 dbs3 mbs3
360 #4 id4 bls1 bls2 bls3 sex :$1.;
361 else lostcard;
362 format dob mmddyy.;
363 cards;
NOTE: LOST CARD.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
372 3 01/03/80 11 7
373 3 27 2
374 3 3.24 3.4 3.83 M
375 4 08/01/80 5 12
NOTE: LOST CARD.
376 4 28 -9
NOTE: LOST CARD.
377 4 3 4
NOTE: The data set WORK.EX has 3 observations and 15 variables.
data ex;
infile cards n=4 stopover;
input #1 id #2 id2 #3 id3 #4 id4 #;
if id eq id2 eq id3 eq id4
then input #1 id dob:mmddyy. dbs1 mbs1
#2 id2 dbs2 mbs2
#3 id3 dbs3 mbs3
#4 id4 bls1 bls2 bls3 sex :$1.;
else lostcard;
format dob mmddyy.;
cards;
1 04/30/78 6 10
1 -9 7
1 14 1
1 1.62 1.35 1.47 F
2 05/19/79 27 11
2 20 -9
2 5 6
2 1.71 1.31 1.76 F
3 01/03/80 11 7
3 27 2
3 3.24 3.4 3.83 M
4 08/01/80 5 12
4 28 -9
4 3 4
4 3.1 3.69 3.27 M
;;;;
run;
proc print;
run;

Related

Grouping tuples of 2 numbers based on the difference between the tuples

Serial Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Parameter 1
22.95
23.46
22.71
23.41
23.36
23.18
23.52
23.35
22.86
22.47
22.63
23.72
23.22
23.17
22.8
23.18
23.15
23.12
23.16
23.22
23.58
23.68
23.33
23.52
23.54
23.48
Parameter 2
19.97
20.83
19.22
20.39
20.33
20.44
20.62
20.26
21.22
20.31
20.53
20.62
20.65
20.14
19.43
20.66
20.09
20.52
20.41
20.63
20.98
21.15
19.97
20.72
20.71
20.32
We have to divide this table into groups of 4 columns such that:-
1.Max difference between the values of parameter 1 of the columns in a group is less than 1
2.Max difference between the values parameter 2 of the columns in a group is less than 0.5
for eg. The columns 4,5,6,7 form a valid group
I need an algorithm that will return the maximum number of groups that this table can be divided into and will also return those groups.
for eg. In this case the maximum number of groups possible are 6, and the remaining 2 columns do not fall under any group. The algorithm should return these 6 groups.

reading and printing a .csv file like a 2D matrix with both integer and float values in c

Reading a file in c with .csv as extension. The file consisting of both integer and float type data values. Is there any way to read the csv file. Any help is appreciated.
The data is as follows:
Application_No. Actual_Effort (in PM) No of Processes No of Tasks No of partnerLinks Task Variables Element Variables Event Variables Script Developer's Skills Developer's Confidence TPSS TS TCC
1 918.28 1 3 5 33 7 2 3 3.5 1 8 135 143
2 8891.513 3 9 3 100 15 6 12 3 1 36 1197 1233
3 22479.261 5 15 23 125 25 10 20 3 1 190 2700 2890
4 2961.131 2 4 9 70 13 4 17 2 0 72 416 488
5 19650.198 7 14 19 130 28 12 5 2.5 0 231 2450 2681
6 377.75 1 2 4 22 8 2 2 3 1 6 68 74
7 2671.93 1 5 12 55 12 6 4 2 0 17 385 402
8 966.15 3 3 6 31 8 5 7 2.5 0 27 153 180
9 3765.81 2 6 17 73 14 2 3 3.5 1 46 552 590
10 7467.11 4 8 21 87 19 13 1 2 0 116 960 1076

Assigning a single value to all cells within a specified time period, matrix format

I have the following example dataset which consists of the # of fish caught per check of a net. The nets are not checked at uniform intervals. The day of the check is denoted in julian days as well as the number of days the net had been fishing since last checked (or since it's deployment in the case of the first check)
http://textuploader.com/9ybp
Site_Number Check_Day_Julian Set_Duration_Days Fish_Caught
2 5 3 100
2 10 5 70
2 12 2 65
2 15 3 22
100 4 3 45
100 10 6 20
100 18 8 8
450 10 10 10
450 14 4 4
In any case, I would like to turn the raw data above into the following format:
http://textuploader.com/9y3t
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
2 0 0 100 100 100 70 70 70 70 70 65 65 22 22 22 0 0 0
100 0 45 45 45 20 20 20 20 20 20 8 8 8 8 8 8 8 8
450 10 10 10 10 10 10 10 10 10 10 4 4 4 4 0 0 0 0
This is a matrix which assigns the # of fish caught during the period to EACH of the days that were within that period. The columns of the matrix are Julian days, the rows are site numbers.
I have tried to do this with some matrix functions but I have had much difficulty trying to populate all the fields that are within the time period, but I do not necessarily have a row of data for?
I had posted my small bit of code here, but upon reflection, my approach is quite archaic and a bit off point. Can anyone suggest a method to convert the data into the matrix provided? I've been scratching my head and googling all day but now I am stumped.
Cheers,
C

Two answers, the second one is faster but a bit low level.
Solution #1:
library(IRanges)
with(d, {
ir <- IRanges(end=Check_Day_Julian, width=Set_Duration_Days)
cov <- coverage(split(ir, Site_Number),
weight=split(Fish_Caught, Site_Number),
width=max(end(ir)))
do.call(rbind, lapply(cov, as.vector))
})
Solution #2:
with(d, {
ir <- IRanges(end=Check_Day_Julian, width=Set_Duration_Days)
site <- factor(Site_Number, unique(Site_Number))
m <- matrix(0, length(levels(site)), max(end(ir)))
ind <- cbind(rep(site, width(ir)), as.integer(ir))
m[ind] <- rep(Fish_Caught, width(ir))
m
})

I don't see a super obvious matrix transformation here. This is all i've got assuming the raw data is in a data.frame called dd
dd$Site_Number<-factor(dd$Site_Number)
mm<-matrix(0, nrow=nlevels(dd$Site_Number), ncol=18)
for(i in 1:nrow(dd)) {
mm[as.numeric(dd[i,1]), (dd[i,2]-dd[i,3]):dd[i,2] ] <- dd[i,4]
}
mm

Trouble reading a data set into R

I am new to R and I am trying to read in a data set. The data set is here:
http://petitlien.fr/myfiles
(The above link will expand to a GMX File storage folder link and click on Guest access to retrieve the file.)
The file named mydata.log has 32 entries with no header and it consists of 2 columns which are delimited by spaces.
I am trying the powerful command scan
test.frame<-scan(file="mydata.log",sep= "", nlines=32,blank.lines.skip=TRUE)
The above just read the first 3 rows:
head(test.frame)
[1] 0.0000 0.0000 144.3210 0.3400 159.4070 0.8925
I have tried also read.table:
test.frame<-read.table(file="mydata.log",sep= "", nrows=32,blank.lines.skip=TRUE)
This one reads the first 6 lines only as shown below:
names(test.frame)
[1] "V1" "V2"
> head(test.frame)
V1 V2
1 0.000 0.0000
2 144.321 0.3400
3 159.407 0.8925
4 198.413 0.9450
5 222.557 0.9975
6 235.464 1.0500
Does someone know how to read this data set properly?
A related question: Can I control the number of significant digits or perhaps decimal places in the data being read in?
Thanks a lot...

This line of your code works perfectly:
test.frame<-read.table(file="mydata.log",sep= "", nrows=32,blank.lines.skip=TRUE)
The reason why you only get 6 lines in your output is because you are using head. To view all lines, just enter the name of your object:
> test.frame
V1 V2
1 0.000 0.0000
2 144.321 0.3400
3 159.407 0.8925
4 198.413 0.9450
5 222.557 0.9975
6 235.464 1.0500
7 296.918 1.1025
8 346.773 1.1550
9 442.955 1.2075
10 694.879 1.2600
11 892.436 1.3125
12 1492.970 1.3650
13 2916.960 1.4175
14 3596.060 1.4700
15 5278.950 1.5225
16 7480.730 1.5750
17 12259.800 1.6275
18 14032.600 1.6800
19 19565.600 1.7325
20 31427.700 1.7850
21 58221.400 1.8375
22 92283.900 1.9900
23 165601.000 1.9425
24 165703.000 1.9950
25 213925.000 2.8750
26 260381.000 2.1000
27 312701.000 2.1525
28 370853.000 2.2050
29 479303.000 2.2575
30 487265.000 2.3100
31 545225.000 2.3625
32 703186.000 2.4150
Here is an easy way to see how many rows you have (useful when you have many observations):
nrow(test.frame)
[1] 32
As for the number of digits, see the round command. To look at the documentation for a command, enter a ? and then the command, in this case a function: ?round
#note that you do not have to put "digits=2", you can just put "2", but this way is clearer
> rounded_test.frame <- round(test.frame, digits=2)
> rounded_test.frame
V1 V2
1 0.00 0.00
2 144.32 0.34
3 159.41 0.89
4 198.41 0.94
5 222.56 1.00
6 235.46 1.05
7 296.92 1.10
8 346.77 1.16
9 442.95 1.21
10 694.88 1.26
11 892.44 1.31
12 1492.97 1.36
13 2916.96 1.42
14 3596.06 1.47
15 5278.95 1.52
16 7480.73 1.57
17 12259.80 1.63
18 14032.60 1.68
19 19565.60 1.73
20 31427.70 1.78
21 58221.40 1.84
22 92283.90 1.99
23 165601.00 1.94
24 165703.00 2.00
25 213925.00 2.88
26 260381.00 2.10
27 312701.00 2.15
28 370853.00 2.21
29 479303.00 2.26
30 487265.00 2.31
31 545225.00 2.36
32 703186.00 2.42
Note in the above I created a new object instead of replacing the current one. If you want to replace the current one and lose the data forever (until you reload the dataset of course!), then you can use this line instead:
test.frame <- round(test.frame, digits=2)
If you don't really want to compress your numbers, you might just be interested in viewing the rounded numbers. You can do this the following command:
print(test.frame,digits=2)

Instead of nrow() as suggested, I would recommend str() ("structure") that gives you more useful information about your data set (class of variables etc). It's also a bit less cryptic....:)

XOR File Decryption

So I have to decrypt a .txt file that is crypted with XOR code and with a repeated password that is unknown, and the goal is to discover the message.
Here are the things that I already know because of the professor:
First I need to find the length of the unknown password
The message has been altered and it doesn't have spaces (this may add a bit more difficulty because the space character has the highest frequency in a message)
Any ideas on how to solve this?
thx in advanced :)

First you need to find out the length of the password. You do this by assessing the Index of Coincidence or Kappa-test. XOR the ciphertext with itself shifted 1 step and count the number of characters that are the same (value 0). You get the Kappa value by dividing the result with the total number of characters minus 1. Shift one more time and again calculate the Kappa value. Shift the ciphertext as many times as needed until you discover the password length. If the length is 4 you should see something similar to this:
Offset Hits
-------------------------
1 2.68695%
2 2.36399%
3 3.79009%
4 6.74012%
5 3.6953%
6 1.81582%
7 3.82744%
8 6.03504%
9 3.60273%
10 1.98052%
11 3.83241%
12 6.5627%
As you see the Kappa value is significantly higher on multiples of 4 (4, 8 and 12) than the others. This suggests that the length of the password is 4.
Now that you have the password length you should again XOR the cipher text with itself but now you shift by multiples of the length. Why? Since the ciphertext looks like this:
THISISTHEPLAINTEXT <- Plaintext
PASSPASSPASSPASSPA <- Password
------------------
EJKELDOSOSKDOWQLAG <- Ciphertext
When two values which are the same are XOR:ed the result is 0:
EJKELDOSOSKDOWQLAG <- Ciphertext
EJKELDOSOSKDOWQLAG <- Ciphertext shifted 4.
Is in reality:
THISISTHEPLAINTEXT <- Plaintext
PASSPASSPASSPASSPA <- Password
THISISTHEPLAINTEXT <- Plaintext
PASSPASSPASSPASSPA <- Password
Which is:
THISISTHEPLAINTEXT <- Plaintext
THISISTHEPLAINTEXT <- Plaintext
As you see the password "disappears" and the plaintext is XOR:ed with itself.
So what can we do now then? You wrote that the spaces are removed. This makes it a bit harder to get the plaintext or password. But not at all impossible.
The following table shows the ciphertext values for all english characters:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 0
B 3 0
C 2 1 0
D 5 6 7 0
E 4 7 6 1 0
F 7 4 5 2 3 0
G 6 5 4 3 2 1 0
H 9 10 11 12 13 14 15 0
I 8 11 10 13 12 15 14 1 0
J 11 8 9 14 15 12 13 2 3 0
K 10 9 8 15 14 13 12 3 2 1 0
L 13 14 15 8 9 10 11 4 5 6 7 0
M 12 15 14 9 8 11 10 5 4 7 6 1 0
N 15 12 13 10 11 8 9 6 7 4 5 2 3 0
O 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
P 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0
Q 16 19 18 21 20 23 22 25 24 27 26 29 28 31 30 1 0
R 19 16 17 22 23 20 21 26 27 24 25 30 31 28 29 2 3 0
S 18 17 16 23 22 21 20 27 26 25 24 31 30 29 28 3 2 1 0
T 21 22 23 16 17 18 19 28 29 30 31 24 25 26 27 4 5 6 7 0
U 20 23 22 17 16 19 18 29 28 31 30 25 24 27 26 5 4 7 6 1 0
V 23 20 21 18 19 16 17 30 31 28 29 26 27 24 25 6 7 4 5 2 3 0
W 22 21 20 19 18 17 16 31 30 29 28 27 26 25 24 7 6 5 4 3 2 1 0
X 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23 8 9 10 11 12 13 14 15 0
Y 24 27 26 29 28 31 30 17 16 19 18 21 20 23 22 9 8 11 10 13 12 15 14 1 0
Z 27 24 25 30 31 28 29 18 19 16 17 22 23 20 21 10 11 8 9 14 15 12 13 2 3 0
What does this mean then? If an A and a B is XOR:ed then the resulting value is 3. E and P will result in 21. Etc. OK but how will this help you?
Remember that the plaintext is XOR:ed with itself shifted by multiples of the password length. For each value you can check the above table and determine what combinations that position could have. Lets say the value is 25 then the two characters that resulted in the value 25 could be one of the following combinations:(I-P), (H-Q), (K-R), (J-S), (M-T), (L-U), (O-V), (N-W), (A-X) or (C-Z). But which one? Now you do more shifts and look up the corresponding values in the table again for each position. Next time the value might be 7 and since you already have a list of possible character combinations you only check against them. At the next two shifts the values are 3 and 1. Now you can determine that the character is W since that is the only common character in each shift, (N-W), (P-W), (T-W), (V-W). You can do this for most positions.
You will not get all the plaintext but you will get enough characters to discover the password. Take the known characters and XOR them in the correct position in the ciphertext. This will yield the password. The number of known characters you need atleast is the number of characters in the password if they are at the "correct" positions in regards to the password.
Good luck!

you should look at cracking a vigenere chiffre, especially at auto-correlation. The latter will help you finding out the length of the password and the rest is usually just bruteforcing on the normal distribution of letters (where the most common one is the letter e in the english language).

Although spaces are the most common characters and make decryptions like this easy, the other character also have different frequencies. For example, see this Wikipedia article. If you've got enough encrypted text and the password length isn't too large, it might just be enough to find out the most common bytes in the encrypted text. They will most likely be the encrypted versions of e that has the highest frequency in english texts.
This alone won't give you the decrypted text, but it's very likely you can find out the password length and (part of) the password itself with it. For example, let's assume the most frequent encrypted bytes are
w x m z y
with almost the same frequency and there's a significant drop in frequency after the last one. This will tell you two things:
The password length most likely is 5, because statistically, all encrypted e will be equally likely. EDIT: OK, this isn't correct, it will be 5 or above because the password can contain the same character multiple times.
The password will be some permutation of (w x m z y XOR e e e e e) - you can use the byte offsets modulo the password length to get the correct permutation.
EDIT: The same character occuring in the password multiple times makes things a bit harder, but you'll most likely be able to identify those because as I said, encrypted versions of e will cluster around frequency f - now if the character occurs n times, it will have a frequency near n*f.

The most common three letter trigram in English (assuming the language is probably English) is "the". Place "the" at all possible points on your cyphertext to derive a possible 3 characters of the key. Try each possible key fragment at all other possible positions on the cyphertext and see what you get. For example, "qzg" is unlikely to be correct, but "fen" could be. Look at the spacing between possible positions to derive the key length. With a key length and a key fragment you can place a lot more of the key.
As Lars said, look at ways of decrypting Vigenère, which is effectively what you have here.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Issues Regarding SAS - arrays

Related

Grouping tuples of 2 numbers based on the difference between the tuples

reading and printing a .csv file like a 2D matrix with both integer and float values in c

Assigning a single value to all cells within a specified time period, matrix format

Trouble reading a data set into R

XOR File Decryption

Categories

Resources