I'm a Julia beginner (scripting beginner too).
I have a text file which consists in 4 columns:
1 5.4 9.5 19.5
2 5.4 9.4 20.6
2 6.2 9.6 18.3
1 9.1 0.5 17.2
2 8.5 1.4 19.6
2 8.4 0.6 24.1
etc.
I have no idea how in Julia I can replace certain values in the rows or add a new one according to a existing column pattern 122 122. For example I would like to add the column with letter C and O (C when is 1 in the first column and O when is 2). I would like to add new column after the one with C and O where the pattern 1 2 2 is designated by number 4 and next by number 5. This is how I imagine the result:
C 4 1 5.4 9.5 19.5
O 4 2 5.4 9.4 20.6
O 4 2 6.2 9.6 18.3
C 5 1 9.1 0.5 17.2
O 5 2 8.5 1.4 19.6
O 5 2 8.4 0.6 24.1
Thank you for your help in advance.
Kasia.
String processing is fairly straightforward in Julia. You might write a function that takes an input and output filename as follows:
function munge_file(in::AbstractString, out::AbstractString)
# open the output file for writing
open(out, "w") do out_io
# open the input file for reading
open(in, "r") do in_io
# and process the contents
munge_file(in_io, out_io)
end
end
end
Now, the inner call to munge_file will have to do the actual work (this isn't particularly optimized, but should very straightforward):
function munge_file(input::IO, io::IO = IOBuffer())
# initialize the pattern index
pattern_index = 3
# iterate over each line of the input
for line in eachline(input)
# skip empty lines
isempty(line) && continue
# split the current line into parts
parts = split(line, ' ')
# this line doesn't conform to the specified input pattern
# might be better to throw an error here
length(parts) == 4 || continue
# this line starts a new pattern if the first character is a 1
is_start = parse(Int, parts[1]) == 1
# increment the counter (for the second output column)
pattern_index += is_start
# first column depends on whether a 1 2 2 pattern starts here or not
print(io, is_start ? 'C' : 'O')
print(io, ' ')
# print the pattern counter
print(io, pattern_index)
print(io, ' ')
# print the original line
println(io, line)
end
return io
end
Using the code in the REPL produces the expected output:
shell> cat input.txt
1 5.4 9.5 19.5
2 5.4 9.4 20.6
2 6.2 9.6 18.3
1 9.1 0.5 17.2
2 8.5 1.4 19.6
2 8.4 0.6 24.1
julia> munge_file("input.txt", "output.txt")
IOStream(<file output.txt>)
shell> cat output.txt
C 4 1 5.4 9.5 19.5
O 4 2 5.4 9.4 20.6
O 4 2 6.2 9.6 18.3
C 5 1 9.1 0.5 17.2
O 5 2 8.5 1.4 19.6
O 5 2 8.4 0.6 24.1
Assuming your file is input.txt you could do:
open("output.txt","w") do f
println.(Ref(f),replace.(replace.(readlines("input.txt"),r"^1 "=>"C "), r"^2 "=>"O "))
end;
Dots (.) in the above code vectorize it so functions work on vectors rather than scalars. The replace function takes a String, regular expression and new value. ^ in regular expression means "line starts with".
Related
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
No matter how I try to code this in R, I still cannot drop my columns so that I can build my logistic regression model. I tried to run it two different ways
cols<-c("EmployeeCount","Over18","StandardHours")
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[-cols,]
Error in -cols : invalid argument to unary operator
cols<-c("EmployeeCount","Over18","StandardHours")
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[!cols,]
Error in !cols : invalid argument type
This may solve your problem:
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[ , !colnames(DAT_690_Attrition_Proj1EmpAttrTrain) %in% cols]
Please note that if you want to drop columns, you should put your code inside [ on the right side of the comma, not on the left side.
So [, your_code] not [your_code, ].
Here is an example of dropping columns using the code above.
cols <- c("cyl", "hp", "wt")
mtcars[, !colnames(mtcars) %in% cols]
# mpg disp drat qsec vs am gear carb
# Mazda RX4 21.0 160.0 3.90 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 160.0 3.90 17.02 0 1 4 4
# Datsun 710 22.8 108.0 3.85 18.61 1 1 4 1
# Hornet 4 Drive 21.4 258.0 3.08 19.44 1 0 3 1
# Hornet Sportabout 18.7 360.0 3.15 17.02 0 0 3 2
# Valiant 18.1 225.0 2.76 20.22 1 0 3 1
#...
Edit to Reproduce the Error
The error message you got indicates that there is a column that has only one, identical value in all rows.
To show this, let's try a logistic regression using a subset of mtcars data, which has only one, identical values in its cyl column, and then we use that column as a predictor.
mtcars_cyl4 <- mtcars |> subset(cyl == 4)
mtcars_cyl4
# mpg cyl disp hp drat wt qsec vs am gear carb
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
glm(am ~ as.factor(cyl) + mpg + disp, data = mtcars_cyl4, family = "binomial")
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more levels
Now, compare it with the same logistic regression by using full mtcars data, which have various values in cyl column.
glm(am ~ as.factor(cyl) + mpg + disp, data = mtcars, family = "binomial")
# Call: glm(formula = am ~ as.factor(cyl) + mpg + disp, family = "binomial",
# data = mtcars)
#
# Coefficients:
# (Intercept) as.factor(cyl)6 as.factor(cyl)8 mpg disp
# -5.08552 2.40868 6.41638 0.37957 -0.02864
#
# Degrees of Freedom: 31 Total (i.e. Null); 27 Residual
# Null Deviance: 43.23
# Residual Deviance: 25.28 AIC: 35.28
It is likely that, even though you have drop three columns that have one,identical values in all the respective rows, there is another column in Trainingmodel1 that has one identical values. The identical values in the column were probably resulted during filtering the data frame and splitting data into training and test groups. Better to have a check by using summary(Trainingmodel1).
Further edit
I have checked the summary(Trainingmodel1) result, and it becomes clear that EmployeeNumber has one identical value (called "level" for a factor) in all rows. To run your regression properly, either you drop it from your model, or if EmployeeNumber has another level and you want to include it in your model, you should make sure that it contains at least two levels in the training data. It is possible to achieve that during splitting by repeating the random sampling until the randomly selected EmployeeNumber samples contain at least two levels. This can be done by looping using for, while, or repeat. It is possible, but I don't know how proper the repeated sampling is for your study.
As for your question about subsetting more than one variable, you can use subset and conditionals. For example, you want to get a subset of mtcars that has cyl == 4 and mpg > 20 :
mtcars |> subset(cyl == 4 & mpg > 20 )
If you want a subset that has cyl == 4 or mpg > 20:
mtcars |> subset(cyl == 4 | mpg > 20 )
You can also subset by using more columns as subset criteria:
mtcars |> subset((cyl > 4 & cyl <8) | (mpg > 20 & gear > 4 ))
I try to calculate a mean value of 10 matrix entries (1:10, 2:11, 3:12 and so on) and then make a new matrix out of these mean values. However, it always gives me Invalid Index.
A=rand(150,1);
number_of_rows=size(A,1);
for i=1:number_of_rows
B=mean(A(i:i+9,1),1);
C(i,:)=B;
end
The trivial following code does it, for any window's length n:
A = grand(1,40,"uin",0,9)
n = 10;
C = [0 cumsum(A)];
C = (C(n+1:$)-C(1:$-n))/n
Result (sample):
--> A = grand(1,40,"uin",0,9)
A =
column 1 to 24
3. 2. 2. 9. 2. 6. 1. 5. 1. 5. 7. 2. 5. 8. 3. 0. 3. 8. 3. 3. 4. 8. 8. 1.
column 25 to 40
4. 2. 0. 8. 5. 8. 5. 3. 7. 3. 1. 8. 8. 0. 0. 4.
--> n = 10;
--> C = [0 cumsum(A)];
--> C = (C(n+1:$)-C(1:$-n))/n
C =
column 1 to 20
3.6 4. 4. 4.3 4.2 4.3 3.7 3.9 4.2 4.4 4.2 3.9 4.5 4.8 4.1 4.2 4.4 4.1 4.1 4.3
column 21 to 31
4.8 4.9 4.4 4.3 4.5 4.2 4.8 5.6 4.8 4.3 3.9
--> mean(A(1:n))
ans =
3.6
However, cumsum() would propagate any Inf or NaN value belonging to A.
I have a data file with 4 columns:
x y u v
such that x and y are the coordinate positions associated to the values u and v.
The data is structured such that
x y u v
1 1 # #
2 1 # #
3 1 # #
...
However, I would like to restructure the file such that
x y u v
1 1 # #
1 2 # #
1 3 # #
...
Is there a function in fortran which can achieve this?
Well, I never make claims about "pretty," but it should do the job. Obviously, you will need to check your FORMAT statements:
PROGRAM TEST
REAL*8 :: U(4,4)
REAL*8 :: V(4,4)
INTEGER :: X, Y
DO
READ(*,'(2I2)',ADVANCE='NO',END=10) X,Y
READ(*,'(2F6.1)',ADVANCE='YES',END=10) U(X,Y),V(X,Y)
END DO
10 CONTINUE
WRITE(*,'(2I4,2F10.2)') ((I,J,U(I,J),V(I,J),J=1,4),I=1,4)
END
I'm assuming that your arrays are already allocated properly.
Here's my input file:
$ cat test.in
1 1 5.0 10.0
2 1 1.3 -0.2
3 1 5.1 0.0
4 1 -9.1 3.0
1 2 4.0 2.0
2 2 14.0 -8.0
3 2 -8.0 8.0
4 2 4.0 9.6
1 3 2.0 1.1
2 3 3.4 8.0
3 3 4.0 7.0
4 3 4.0 4.1
1 4 5.5 8.4
2 4 34.1 23.0
3 4 -4.1 4.0
4 4 6.0 8.4
And the output:
$ cat test.in | ./a.out
1 1 5.0 10.0
1 2 4.0 2.0
1 3 2.0 1.1
1 4 5.5 8.4
2 1 1.3 -0.2
2 2 14.0 -8.0
2 3 3.4 8.0
2 4 34.1 23.0
3 1 5.1 0.0
3 2 -8.0 8.0
3 3 4.0 7.0
3 4 -4.1 4.0
4 1 -9.1 3.0
4 2 4.0 9.6
4 3 4.0 4.1
4 4 6.0 8.4
I would like to plot data points included inside a script file.
This should be done multiple times (plotting to different files).
Therefore, I am using a do-for-loop.
This loop let's Gnuplot freeze on excution.
Could you please hint me to the cause?
This is my MWE:
reset
set autoscale
do for [index=1:1] {
plot "-" with lines ls 2 notitle
0.500 5
1.000 6
1.500 7
e
}
Yes, seems like the combination of do for with inline data isn't supported. It also wouldn't be very convenient, since this would require a separate data block for every iteration like in
set style data linespoints
plot '-' using 1:2, '-' using 1:3
1 2 3
4 5 6
e
1 2 3
4 5 6
e
With version 5.0 inline data blocks were introduced which allow reusing inline data:
$data <<EOD
1 2 3
4 5 6
EOD
do for [i=2:3] {
plot $data using 1:i w l
pause -1
}
I have a sorted (Ascending trend) array as
[1 1 1 1 1 1.2 1.6 2 2 2 2.4 2.4 2.4 2.6 3 3.5 3.6 3.8 3.9 4 4.3 4.3 4.6 5 5.02 6 7]
I want to check and print the number of the repeated numbers between each "natural numbers".
for example:
between 1 and 2: 0 (no repeated)
between 2 and 3: 3 repeated with 2.4
between 3 and 4: 0
between 4 and 5: 2 repeated with 4.3
between 5 and 6: 0
between 6 and 7: 0
Is there any function in MATLAB to do this task?
you can use tabulate, and the array need not be even sorted for that.
Then just select the proper elements using logical conditions. For example:
A=[1 1 1 1 1 1.2 1.6 2 2 2 2.4 2.4 2.4 2.6 3 3.5 3.6 3.8 3.9 4 4.3 4.3 4.6 5 5.02 6 7]
M=tabulate(A) % get frequency table
id1=mod(M(:,1),1)>0; % get indices for non integer values
id2=M(:,2)>1; % get indices for more than one occurrence
idx=id1 & id2; % get indices that combines the two above
ans=[M(idx,1) , M(idx,2)] % show value , # of repeats
ans =
2.4000 3.0000
4.3000 2.0000
the alternative is to use histc. So if your vector is stored in a then
h = histc(a,a); % count how many times the number is there, the a should be sorted
natNumbers = (mod(a,1)==0) .* h;
nonnatNum = (mod(a,1)>0).*h;
indNN = find(natNumbers>0);
indNNN = find(nonNatNumbers>1);
resultIndex = sort([indNN indNNN]);
result = [a(resultIndex);h(resultIndex)]
Then you can work with the result matrix by checking if there are any numbers between natural numbers