Creating a series of 2-dimensional arrays from a text file in Julia - arrays

I'm trying to write a Sudoku solver, which is the fun part. The un-fun part is actually loading the puzzles into Julia from a text file. The text file consists of a series of puzzles comprising a label line followed by 9 lines of digits (0s being used to denote blank squares). The following is a simple example of the sort of text file I am using (sudokus.txt):
Easy 7
000009001
008405670
940000032
034061800
070050020
002940360
890000056
061502700
400700000
Medium 95
000300100
800016070
000009634
001070000
760000015
000020300
592400000
030860002
007002000
Hard 143
000003700
305061000
000200004
067002100
400000003
003900580
200008000
000490308
008100000
What I want to do is strip out the label lines and store the 9x9 grids in an array. File input operations are not my specialist subject, and I've tried various methods such as read(), readcsv(), readlines() and readline(). I don't know whether there is any advantage to storing the digits as characters rather than integers, but leading zeros have to be maintained (a problem I have encountered with some input methods and with abortive attempts to use parse()).
I've come up with a solution, but I suspect it's far from optimal:
function main()
open("Text Files\\sudokus.txt") do file
grids = Vector{Matrix{Int}}()
grid = Matrix{Int}(0,9)
row_no = 0
for line in eachline(file)
if !(all(i -> isnumber(i), line))
continue
else
row_no += 1
squares = split(line, "")
row = transpose([parse(Int, square) for square in squares])
grid = vcat(grid, row)
if row_no == 9
push!(grids, grid)
grid = Matrix{Int}(0,9)
row_no = 0
end
end
end
return grids
end
end
#time main()
I initially ran into #code_warntype problems from the closure, but I seem to have solved those by moving my grids, grid and row_no variables from the main() function to the open block.
Can anyone come up with a more efficient way to achieve my objective or improve my code? Is it possible, for example, to load 10 lines at a time from the text file? I am using Julia 0.6, but solutions using 0.7 or 1.0 will also be useful going forward.

I believe your file is well-structured, by that I mean each 1,11,21... contains difficulty information and the lines between them contains the sudoku rows. Therefore if we know the number of lines then we know the number of sudokus in the file. The code utilizes this information to pre-allocate an array of exactly the size needed.
If your file is too-big then you can play with eachline instead of readlines. readlines read all the lines of the file into the RAM while eachline creates an iterable to read lines one-by-one.
function readsudoku(file_name)
lines = readlines(file_name)
sudokus = Array{Int}(undef, 9, 9, div(length(lines),10)) # the last dimension is for each sudoku
for i in 1:length(lines)
if i % 10 != 1 # if i % 10 == 1 you have difficulty line
sudokus[(i - 1) % 10, : , div(i-1, 10) + 1] .= parse.(Int, collect(lines[i])) # collect is used to create an array of `Char`s
end
end
return sudokus
end
This should run on 1.0 and 0.7 but I do not know if it runs on 0.6. Probably, you should remove undef argument in Array allocation to make it run on 0.6.

Similar to Hckr's (faster) approach, my first idea is:
s = readlines("sudoku.txt")
smat = reshape(s, 10,3)
sudokus = Dict{String, Matrix{Int}}()
for k in 1:3
sudokus[smat[1,k]] = parse.(Int, permutedims(hcat(collect.(Char, smat[2:end, k])...), (2,1)))
end
which produces
julia> sudokus
Dict{String,Array{Int64,2}} with 3 entries:
"Hard 143" => [0 0 … 0 0; 3 0 … 0 0; … ; 0 0 … 0 8; 0 0 … 0 0]
"Medium 95" => [0 0 … 0 0; 8 0 … 7 0; … ; 0 3 … 0 2; 0 0 … 0 0]
"Easy 7" => [0 0 … 0 1; 0 0 … 7 0; … ; 0 6 … 0 0; 4 0 … 0 0]

Related

How can I create a new matrix that from the other's elements?

I want to pick out the elements which are
(2*pi*k),
where k=0,1,2,3... which means integer, and fill them (i1) into the other matrix.
But my problem is, I don't know how to make "k" be row. (By the way, dividends and divisors are float, so I need to find the approximations and see them as 2*pi*k).
My code, only can find the elements which are (2*pi*k), but can't order them like if k=1, then it will be put into k=1 row; if k=2, then the element should be put into k=2 row.
For example,
A = [2*pi 6 3 4;0.5*pi 0 2;3.1 7 4 8;2*pi 7 2 9;2.6 4*pi 6*pi 0]
I want the output to be
B = [0 2*pi 4*pi 6*pi;0 2*pi NaN NaN;NaN 2*pi NaN NaN]
This is my code:
k=0;
for m=380:650;
for n=277:600;
if abs((rem(abs(i(m,n)),(2*PI)))-(PI))>=3.11;
k=k+1;
B(m,k)=i1(m,n);
end
end
k=0;
end
It can find what I want but they seem not to be ordered the way I want.
As others, I'm a bit unsure what you want. Here's how I understood it and would code it:
check whether (2*pi*k) is contained in A, you want a numerical approach
output binary result
here's the code:
testPI=#(k) (2*pi*k); %generates 2*pi*k, where k is up to the user
A = [2*pi 6 3 4;0.5*pi 0 2 0;3.1 7 4 8;2*pi 7 2 9;2.6 4*pi 6*pi 0]; %A from example (fixed dimension error)
ismember(A,f(1:10)) %test if k=1:10 is contained in A
ans =
5×4 logical array
1 0 0 0
0 0 0 0
0 0 0 0
1 0 0 0
0 1 1 0
Adapt 1:10 to any value you'd like. Of course this only works if k is within reasonable range, otherwise this approach is suboptimal

Matlab One Hot Encoding - convert column with categoricals into several columns of logicals

CONTEXT
I have a large number of columns with categoricals, all with different, unrankable choices. To make my life easier for analysis, I'd like to take each of them and convert it to several columns with logicals. For example:
1 GENRE
2 Pop
3 Classical
4 Jazz
...would turn into...
1 Pop Classical Jazz
2 1 0 0
3 0 1 0
4 0 0 1
PROBLEM
I've tried using ind2vec but this only works with numericals or logicals. I've also come across this but am not sure it works with categoricals. What is the right function to use in this case?
If you want to convert from a categorical vector to a logical array, you can use the unique function to generate column indices, then perform your encoding using any of the options from this related question:
% Sample data:
data = categorical({'Pop'; 'Classical'; 'Jazz'; 'Pop'; 'Pop'; 'Jazz'});
% Get unique categories and create indices:
[genre, ~, index] = unique(data)
genre =
Classical
Jazz
Pop
index =
3
1
2
3
3
2
% Create logical matrix:
mat = logical(accumarray([(1:numel(index)).' index], 1))
mat =
6×3 logical array
0 0 1
1 0 0
0 1 0
0 0 1
0 0 1
0 1 0
ind2vec do work with the cell strings, and you could call cellstr function to get such a cell string.
This codes may help (From this ,I only changed a little)
data = categorical({'Pop'; 'Classical'; 'Jazz';});
GENRE = cellstr(data); %change categorical data into cell strings
[~, loc] = ismember(GENRE, unique(GENRE));
genre = ind2vec(loc')';
Gen=full(genre);
array2table(Gen, 'VariableNames', unique(GENRE))
run such a code will return this:
ans =
Classical Jazz Pop
_________ ____ ___
0 0 1
1 0 0
0 1 0
you can call unique(GENRE) to check the categories(in cell strings). In the meanwhile, logical(Gen)(or call logical(full(genre))) contain columns with logical that you need.
P.s. categorical structure might be faster than cell string, but ind2vec function doesn't work with it. unique and accumarray might better.

Haskell file reading and finding values

I have recently started learning Haskell and I'm having a hard time figuring out how to interpret text files.
I have following .txt file:
ncols 5
nrows 5
xllcorner 809970
yllcorner 169790
cellsize 20
NODATA_value -9999
9 0 0 0 0
0 1 0 0 0
0 0 0 0 0
0 0 0 0 0
0 2 0 0 3
The first 6 lines just display some information I need when working with the file in a GIS software. The real deal starts when I try to work with the numbers below in Haskell.
I want to tell Haskell to look up where the numbers 9, 1, 2 and 3 are and print back the number of the row and column where those numbers actually are. In this case Haskell should print:
The value 9 is in row 1 and column 1
The value 1 is in row 2 and column 2
The value 2 is in row 5 and column 2
The value 3 is in row 5 and column 5
I tried finding the solution (or at least similar methods for interpreting files) in tutorials and other Haskell scripts without any success, so any help would be greatly appreciated.
Here is an example of a script to do what you want. Note that this will in its current form does not fail gracefully (but given this is a script, I doubt this is a concern). Make sure there is a trailing newline at the end of your file!
import Control.Monad (replicateM, when)
import Data.Traversable (for)
import System.Environment (getArgs)
main = do
-- numbers we are looking for
numbers <- getArgs
-- get the key-value metadata
metadata <- replicateM 6 $ do
[key,value] <- words <$> getLine
return (key,value)
let Just rows = read <$> lookup "nrows" metadata
Just cols = read <$> lookup "ncols" metadata
-- loop over all the entries
for [1..rows] $ \row ->do
rawRow <- words <$> getLine
for (zip [1..cols] rawRow) $ \(col,cell) ->
when (cell `elem` numbers)
(putStrLn ("The value " ++ cell ++ " is in row " ++ show row ++ " and column " ++ show col))
To use it, pass it as command line arguments the numbers you are looking for and then feed as input your data file.
$ ghc script.hs
$ ./script 9 1 2 3 < data.txt
Let me know if you have any questions!
I wasn't really sure if you wanted to look up just a fixed set of numbers, or any non-zero number. As your question asked for the former, that is what I did.

Split vector in MATLAB

I'm trying to elegantly split a vector. For example,
vec = [1 2 3 4 5 6 7 8 9 10]
According to another vector of 0's and 1's of the same length where the 1's indicate where the vector should be split - or rather cut:
cut = [0 0 0 1 0 0 0 0 1 0]
Giving us a cell output similar to the following:
[1 2 3] [5 6 7 8] [10]
Solution code
You can use cumsum & accumarray for an efficient solution -
%// Create ID/labels for use with accumarray later on
id = cumsum(cut)+1
%// Mask to get valid values from cut and vec corresponding to ones in cut
mask = cut==0
%// Finally get the output with accumarray using masked IDs and vec values
out = accumarray(id(mask).',vec(mask).',[],#(x) {x})
Benchmarking
Here are some performance numbers when using a large input on the three most popular approaches listed to solve this problem -
N = 100000; %// Input Datasize
vec = randi(100,1,N); %// Random inputs
cut = randi(2,1,N)-1;
disp('-------------------- With CUMSUM + ACCUMARRAY')
tic
id = cumsum(cut)+1;
mask = cut==0;
out = accumarray(id(mask).',vec(mask).',[],#(x) {x});
toc
disp('-------------------- With FIND + ARRAYFUN')
tic
N = numel(vec);
ind = find(cut);
ind_before = [ind-1 N]; ind_before(ind_before < 1) = 1;
ind_after = [1 ind+1]; ind_after(ind_after > N) = N;
out = arrayfun(#(x,y) vec(x:y), ind_after, ind_before, 'uni', 0);
toc
disp('-------------------- With CUMSUM + ARRAYFUN')
tic
cutsum = cumsum(cut);
cutsum(cut == 1) = NaN; %Don't include the cut indices themselves
sumvals = unique(cutsum); % Find the values to use in indexing vec for the output
sumvals(isnan(sumvals)) = []; %Remove NaN values from sumvals
output = arrayfun(#(val) vec(cutsum == val), sumvals, 'UniformOutput', 0);
toc
Runtimes
-------------------- With CUMSUM + ACCUMARRAY
Elapsed time is 0.068102 seconds.
-------------------- With FIND + ARRAYFUN
Elapsed time is 0.117953 seconds.
-------------------- With CUMSUM + ARRAYFUN
Elapsed time is 12.560973 seconds.
Special case scenario: In cases where you might have runs of 1's, you need to modify few things as listed next -
%// Mask to get valid values from cut and vec corresponding to ones in cut
mask = cut==0
%// Setup IDs differently this time. The idea is to have successive IDs.
id = cumsum(cut)+1
[~,~,id] = unique(id(mask))
%// Finally get the output with accumarray using masked IDs and vec values
out = accumarray(id(:),vec(mask).',[],#(x) {x})
Sample run with such a case -
>> vec
vec =
1 2 3 4 5 6 7 8 9 10
>> cut
cut =
1 0 0 1 1 0 0 0 1 0
>> celldisp(out)
out{1} =
2
3
out{2} =
6
7
8
out{3} =
10
For this problem, a handy function is cumsum, which can create a cumulative sum of the cut array. The code that produces an output cell array is as follows:
vec = [1 2 3 4 5 6 7 8 9 10];
cut = [0 0 0 1 0 0 0 0 1 0];
cutsum = cumsum(cut);
cutsum(cut == 1) = NaN; %Don't include the cut indices themselves
sumvals = unique(cutsum); % Find the values to use in indexing vec for the output
sumvals(isnan(sumvals)) = []; %Remove NaN values from sumvals
output = {};
for i=1:numel(sumvals)
output{i} = vec(cutsum == sumvals(i)); %#ok<SAGROW>
end
As another answer shows, you can use arrayfun to create a cell array with the results. To apply that here, you'd replace the for loop (and the initialization of output) with the following line:
output = arrayfun(#(val) vec(cutsum == val), sumvals, 'UniformOutput', 0);
That's nice because it doesn't end up growing the output cell array.
The key feature of this routine is the variable cutsum, which ends up looking like this:
cutsum =
0 0 0 NaN 1 1 1 1 NaN 2
Then all we need to do is use it to create indices to pull the data out of the original vec array. We loop from zero to max and pull matching values. Notice that this routine handles some situations that may arise. For instance, it handles 1 values at the very beginning and very end of the cut array, and it gracefully handles repeated ones in the cut array without creating empty arrays in the output. This is because of the use of unique to create the set of values to search for in cutsum, and the fact that we throw out the NaN values in the sumvals array.
You could use -1 instead of NaN as the signal flag for the cut locations to not use, but I like NaN for readability. The -1 value would probably be more efficient, as all you'd have to do is truncate the first element from the sumvals array. It's just my preference to use NaN as a signal flag.
The output of this is a cell array with the results:
output{1} =
1 2 3
output{2} =
5 6 7 8
output{3} =
10
There are some odd conditions we need to handle. Consider the situation:
vec = [1 2 3 4 5 6 7 8 9 10 11 12 13 14];
cut = [1 0 0 1 1 0 0 0 0 1 0 0 0 1];
There are repeated 1's in there, as well as a 1 at the beginning and end. This routine properly handles all this without any empty sets:
output{1} =
2 3
output{2} =
6 7 8 9
output{3} =
11 12 13
You can do this with a combination of find and arrayfun:
vec = [1 2 3 4 5 6 7 8 9 10];
N = numel(vec);
cut = [0 0 0 1 0 0 0 0 1 0];
ind = find(cut);
ind_before = [ind-1 N]; ind_before(ind_before < 1) = 1;
ind_after = [1 ind+1]; ind_after(ind_after > N) = N;
out = arrayfun(#(x,y) vec(x:y), ind_after, ind_before, 'uni', 0);
We thus get:
>> celldisp(out)
out{1} =
1 2 3
out{2} =
5 6 7 8
out{3} =
10
So how does this work? Well, the first line defines your input vector, the second line finds how many elements are in this vector and the third line denotes your cut vector which defines where we need to cut in our vector. Next, we use find to determine the locations that are non-zero in cut which correspond to the split points in the vector. If you notice, the split points determine where we need to stop collecting elements and begin collecting elements.
However, we need to account for the beginning of the vector as well as the end. ind_after tells us the locations of where we need to start collecting values and ind_before tells us the locations of where we need to stop collecting values. To calculate these starting and ending positions, you simply take the result of find and add and subtract 1 respectively.
Each corresponding position in ind_after and ind_before tell us where we need to start and stop collecting values together. In order to accommodate for the beginning of the vector, ind_after needs to have the index of 1 inserted at the beginning because index 1 is where we should start collecting values at the beginning. Similarly, N needs to be inserted at the end of ind_before because this is where we need to stop collecting values at the end of the array.
Now for ind_after and ind_before, there is a degenerate case where the cut point may be at the end or beginning of the vector. If this is the case, then subtracting or adding by 1 will generate a start and stopping position that's out of bounds. We check for this in the 4th and 5th line of code and simply set these to 1 or N depending on whether we're at the beginning or end of the array.
The last line of code uses arrayfun and iterates through each pair of ind_after and ind_before to slice into our vector. Each result is placed into a cell array, and our output follows.
We can check for the degenerate case by placing a 1 at the beginning and end of cut and some values in between:
vec = [1 2 3 4 5 6 7 8 9 10];
cut = [1 0 0 1 0 0 0 1 0 1];
Using this example and the above code, we get:
>> celldisp(out)
out{1} =
1
out{2} =
2 3
out{3} =
5 6 7
out{4} =
9
out{5} =
10
Yet another way, but this time without any loops or accumulating at all...
lengths = diff(find([1 cut 1])) - 1; % assuming a row vector
lengths = lengths(lengths > 0);
data = vec(~cut);
result = mat2cell(data, 1, lengths); % also assuming a row vector
The diff(find(...)) construct gives us the distance from each marker to the next - we append boundary markers with [1 cut 1] to catch any runs of zeros which touch the ends. Each length is inclusive of its marker, though, so we subtract 1 to account for that, and remove any which just cover consecutive markers, so that we won't get any undesired empty cells in the output.
For the data, we mask out any elements corresponding to markers, so we just have the valid parts we want to partition up. Finally, with the data ready to split and the lengths into which to split it, that's precisely what mat2cell is for.
Also, using #Divakar's benchmark code;
-------------------- With CUMSUM + ACCUMARRAY
Elapsed time is 0.272810 seconds.
-------------------- With FIND + ARRAYFUN
Elapsed time is 0.436276 seconds.
-------------------- With CUMSUM + ARRAYFUN
Elapsed time is 17.112259 seconds.
-------------------- With mat2cell
Elapsed time is 0.084207 seconds.
...just sayin' ;)
Here's what you need:
function spl = Splitting(vec,cut)
n=1;
j=1;
for i=1:1:length(b)
if cut(i)==0
spl{n}(j)=vec(i);
j=j+1;
else
n=n+1;
j=1;
end
end
end
Despite how simple my method is, it's in 2nd place for performance:
-------------------- With CUMSUM + ACCUMARRAY
Elapsed time is 0.264428 seconds.
-------------------- With FIND + ARRAYFUN
Elapsed time is 0.407963 seconds.
-------------------- With CUMSUM + ARRAYFUN
Elapsed time is 18.337940 seconds.
-------------------- SIMPLE
Elapsed time is 0.271942 seconds.
Unfortunately there is no 'inverse concatenate' in MATLAB. If you wish to solve a question like this you can try the below code. It will give you what you looking for in the case where you have two split point to produce three vectors at the end. If you want more splits you will need to modify the code after the loop.
The results are in n vector form. To make them into cells, use num2cell on the results.
pos_of_one = 0;
% The loop finds the split points and puts their positions into a vector.
for kk = 1 : length(cut)
if cut(1,kk) == 1
pos_of_one = pos_of_one + 1;
A(1,one_pos) = kk;
end
end
F = vec(1 : A(1,1) - 1);
G = vec(A(1,1) + 1 : A(1,2) - 1);
H = vec(A(1,2) + 1 : end);

Using bsxfun with an anonymous function

after trying to understand the bsxfun function I have tried to implement it in a script to avoid looping. I am trying to check if each individual element in an array is contained in one matrix, returning a matrix the same size as the initial array containing 1 and 0's respectively. The anonymous function I have created is:
myfunction = #(x,y) (sum(any(x == y)));
x is the matrix which will contain the 'accepted values' per say. y is the input array. So far I have tried using the bsxfun function in this way:
dummyvar = bsxfun(myfunction,dxcp,X)
I understand that myfunction is equal to the handle of the anonymous function and that bsxfun can be used to accomplish this I just do not understand the reason for the following error:
Non-singleton dimensions of the two input arrays must match each other.
I am using the following test data:
dxcp = [1 2 3 6 10 20];
X = [2 5 9 18];
and hope for the output to be:
dummyvar = [1,0,0,0]
Cheers, NZBRU.
EDIT: Reached 15 rep so I have updated the answer
Thanks again guys, I thought I would update this as I now understand how the solution provided from Divakar works. This might deter confusion from others who have read my initial question and are confused to how bsxfun() works, I think writing it out helps me understand it better too.
Note: The following may be incorrect, I have just tried to understand how the function operates by looking at this one case.
The input into the bsxfun function was dxcp and X transposed. The function handle used was #eq so each element was compared.
%%// Given data
dxcp = [1 2 3 6 10 20];
X = [2 5 9 18];
The following code:
bsxfun(#eq,dxcp,X')
compared every value of dxcp, the first input variable, to every row of X'. The following matrix is the output of this:
dummyvar =
0 1 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
The first element was found by comparing 1 and 2 dxcp = [1 2 3 6 10 20]; X' = [2;5;9;18];
The next along the first row was found by comparing 2 and 2 dxcp = [1 2 3 6 10 20]; X' = [2;5;9;18];
This was repeated until all of the values of dxcp where compared to the first row of X'. Following this logic, the first element in the second row was calculating using the comparison between: dxcp = [1 2 3 6 10 20]; X' = [2;5;9;18];
The final solution provided was any(bsxfun(#eq,dxcp,X'),2) which is equivalent to: any(dummyvar,2). http://nf.nci.org.au/facilities/software/Matlab/techdoc/ref/any.html seems to explain the any function in detail well. Basically, say:
A = [1,2;0,0;0,1]
If the following code is run:
result = any(A,2)
Then the function any will check if each row contains one or several non-zero elements and return 1 if so. The result of this example would be:
result = [1;0;1];
Because the second input parameter is equal to 2. If the above line was changed to result = any(A,1) then it would check for each column.
Using this logic,
result = any(A,2)
was used to obtain the final result.
1
0
0
0
which if needed could be transposed to equal
[1,0,0,0]
Performance- After running the following code:
tic
dummyvar = ~any(bsxfun(#eq,dxcp,X'),2)'
toc
It was found that the duration was:
Elapsed time is 0.000085 seconds.
The alternative below:
tic
arrayfun(#(el) any(el == dxcp),X)
toc
using the arrayfun() function (which applies a function to each element of an array) resulted in a runtime of:
Elapsed time is 0.000260 seconds.
^The above run times are averages over 5 runs of each meaning that in this case bsxfun() is faster (on average).
You don't want every combination of elements thrown into your any(x == y) test, you want each element from dxcp tested to see if it exists in X. So here is the short version, which also needs no transposes. Vectorization should also be a bit faster than bsxfun.
arrayfun(#(el) any(el == X), dxcp)
The result is
ans =
0 1 0 0 0 0

Resources