How do I make a line of code that works for Julia to sum the values of col2 where the values of col1 that are in list ? I'm pretty new to Julia and trying the following lines prints out the error Exception has occurred: DimensionMismatch DimensionMismatch: arrays could not be broadcast to a common size; got a dimension with lengths 10 and 3
total_sum = sum(df[ismember(df[:, :col1], list), :col2])
One way could be:
julia> df = DataFrame(reshape(1:12,4,3),:auto)
4×3 DataFrame
Row │ x1 x2 x3
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 5 9
2 │ 2 6 10
3 │ 3 7 11
4 │ 4 8 12
julia> list = [2,3]
2-element Vector{Int64}:
2
3
julia> sum(df.x2[df.x1 .∈ Ref(list)])
13
Uses broadcasting on in (how ismember is written in Julia) which can also be written as ∈. Ref(list) is used to prevent broadcasting over list.
Depending on what you want to do filter! is also worth knowing (using code form Dan Getz's answer):
julia> sum(filter!(:x1 => x1 -> x1 ∈ [2,3], df).x2)
13
Not exactly sure if this is what you're asking but try intersect
julia> using DataFrames
julia> df = DataFrame(a = 1:5, b = 2:6)
5×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
2 │ 2 3
3 │ 3 4
4 │ 4 5
5 │ 5 6
julia> list = collect(3:10);
julia> sum(df.b[intersect(df.a, list)])
15
Let say
df = DataFrame(a=[1])
Row │ a
│ Int64
─────┼───────
1 │ 1
We have
Tried to combine data and make a new column holding arrays
combine(df, :a => x->[1,2])
Row │ a_function
│ Int64
─────┼────────────
1 │ 1
2 │ 2
Tried to combine data and make a new column holding tuples
combine(df, :a => x->(1,2))
Row │ a_function
│ Tuple…
─────┼────────────
1 │ (1, 2)
Why 1 doesn't work as intended, i.e. holding the whole [1,2] array in one cell instead of creating 2 rows?
I converted the array into a tuple and it worked, but I wonder why they works in so different ways.
I need to find an array that contains all of the values of another array, in the presented order - similar to 'hasString' or 'hasAll' - or an idea of how to go about this.
Example Data
dataCol = [1,2,3,4]
hasSubstr
hasSubstr is close however if the values are not an exact match, they are not a match.
hasSubstr(dataCol, [1,2,4]) will return 0
But I need a 1 here because 1, 2 and 4 are in dataCol in the order of 1 then 2 then 4.
Has All
hasAll is also close however it doesn't care about the order.
hasAll(dataCol, [4,2,1]) will return 1
But I need a 0 here because the order is incorrect.
Function or Query?
Something the equivalent of the 'imaginary' function: hasAllOrdered(dataCol, [1,3,4]) = 1
Or an idea of how to construct a query for this. Maybe a combination of hasAll and some query logic magic?
Edit: To clarify my intended result, I need to run a query to select multiple columns as could be used with a function.
SELECT
path AS dataCol,
track
FROM tracks
WHERE time_start > 1645232556
AND { magic here returning rows containing [276,277,279] in dataCol }
LIMIT 10
Query id: cac9b576-193e-475f-98e4-84354bf13af4
┌─dataCol───────────────────────────────────┬──track─┐
│ [211,210,207,205,204] │ 413354 │
│ [211,210,207,205,204] │ 413355 │
│ [73,74,142,209,277,276,208] │ 413356 │
│ [73,74,142,209,277,276,208] │ 413357 │
│ [280,279] │ 413358 │
│ [280,279] │ 413359 │
│ [272,208,276,277,278,346,347,273,206,207] │ 413360 │
│ [208,276,277,278,346,272,273,206,207,347] │ 413361 │
│ [276,277,278,279,348,208,209,141] │ 413362 │
│ [141,276,208,209,277,278,279,348] │ 413363 │
└───────────────────────────────────────────┴────────┘
10 rows in set. Elapsed: 0.007 sec. Processed 13.59 thousand rows, 273.88 KB (1.86 million rows/s., 37.49 MB/s.)
Ref: https://clickhouse.com/docs/en/sql-reference/functions/array-functions/
You have 2 arrays [a, b] and [a,,b]
Lets build the second array through indexes of the first (indexOf + arrayMap) === [a,d,b] --> [1,0,2], remove zeros d by indexOf ( <> 0) --> [1,2]
Now we need the array only if indexes are grow, otherwise elements is in a wrong order.
arrayDifference == [1,2] -> [0,1]. Now if this array has negative elements then indexes are not grow
-- not arrayExists j < 0
create table tracks( dataCol Array(UInt64), track UInt64 ) Engine = Memory;
insert into tracks values
( [211,210,207,205,204] , 413354)
( [211,210,207,205,204] , 413355)
( [280,279] , 413358)
( [280,279] , 413359)
( [272,208,276,277,278,346,347,273,206,207], 413360)
( [208,276,277,278,346,272,273,206,207,347], 413361)
( [276,277,278,279,348,208,209,141] , 413362)
( [141,276,208,209,277,278,279,348] , 413363);
select *
from tracks
where hasAll(dataCol, [276,277,279] as x ) and not arrayExists(j -> j<0, arrayDifference(arrayFilter(i->indexOf(dataCol, i)<>0, x)))
┌─dataCol───────────────────────────┬──track─┐
│ [276,277,278,279,348,208,209,141] │ 413362 │
│ [141,276,208,209,277,278,279,348] │ 413363 │
└───────────────────────────────────┴────────┘
My data structure looks similar to
tdata = Array{Int64,1}[]
# After 1st collection, push the first batch of data
push!(tdata, [1, 2, 3, 4, 5])
# After 2nd collection, push this batch of data
push!(tdata, [11, 12, 13, 14, 15])
Therefore, my data is
> tdata
2-element Array{Array{Int64,1},1}:
[1, 2, 3, 4, 5]
[11, 12, 13, 14, 15]
When I tried to convert this to a DataFrame,
> convert(DataFrame, tdata)
ERROR: MethodError: Cannot `convert` an object of type Array{Array{Int64,1},1} to an object of type DataFrame
while I was hoping for similar to
2×5 DataFrame
Row │ c1 c2 c3 c4 c5
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 2 3 4 5
2 │ 11 12 13 14 15
Alternatively, I tried to save it to .CSV, but
> CSV.write("",tdata)
ERROR: ArgumentError: 'Array{Array{Int64,1},1}' iterates 'Array{Int64,1}' values, which doesn't satisfy the Tables.jl `AbstractRow` interface
Clearly, I have some misunderstanding of the data structure I have. Any suggestion is apprecitaed!
Either do this:
julia> using SplitApplyCombine
julia> DataFrame(invert(tdata), :auto)
2×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 2 3 4 5
2 │ 11 12 13 14 15
or this:
julia> DataFrame(transpose(hcat(tdata...)), :auto)
2×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 2 3 4 5
2 │ 11 12 13 14 15
or this:
julia> DataFrame(vcat(transpose(tdata)...), :auto)
2×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 2 3 4 5
2 │ 11 12 13 14 15
or this:
julia> df = DataFrame(["c$i" => Int[] for i in 1:5])
0×5 DataFrame
julia> foreach(x -> push!(df, x), tdata)
julia> df
2×5 DataFrame
Row │ c1 c2 c3 c4 c5
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 1 2 3 4 5
2 │ 11 12 13 14 15
The challenge with your data is that you want vectors to be rows of the data frame, and normally vectors are treated as columns of a data frame.
I have two large (several thousand values) arrays with floats, and would like to combine them in an xy point array for further processing, eg to plot.
So right now in Xcode playground I am doing this:
let xArray = // read from datafile, fast
let yArray = // read from another datafile, fast
struct xyPoint {
let x: Float
let y: Float
}
var spectrum: [xyPoint] = []
for i in 0..<xArray.count {
let xy = xyPoint(x: xArray[i], y: yArray[i])
spectrum.append(xy)
}
Now when I run the playground, this takes a really long time to do.
Any ideas how I can speed this up?
I checked the performance for various solutions to your problem. You can download my tests from this link to github
A) Your code
var spectrum: [XYPoint] = []
for i in 0..<xArray.count {
let xy = XYPoint(x: xArray[i], y: yArray[i])
spectrum.append(xy)
}
B) Zip + map (Martin R's answer)
let spectrumB = zip(xArray, yArray).map(XYPoint.init)
C) Range + map (My solution)
let spectrum = (0 ..< xArray.count).map { i in
return XYPoint(x: xArray[i], y: yArray[i])
}
D) ReserveCapacity + append (Duncan C's answer)
var spectrum: [XYPoint] = []
spectrum.reserveCapacity(xArray.count)
for (index, value) in xArray.enumerated() {
spectrum.append(XYPoint(x: xArray[index], y: yArray[index]))
}
My results (in seconds)
╭──────────────┬──────────────┬──────────────┬──────────────╮
│ A │ B │ C │ D │
╭───────────╬══════════════╪══════════════╪══════════════╪══════════════╡
│ 100 ║ 0.000009426 │ 0.000002401 │ 0.000000571 │ 0.000000550 │
│ 200 ║ 0.000003356 │ 0.000002629 │ 0.000000911 │ 0.000000866 │
│ 500 ║ 0.000005610 │ 0.000007288 │ 0.000002236 │ 0.000002012 │
│ 1000 ║ 0.000010638 │ 0.000009181 │ 0.000003905 │ 0.000005030 │
│ 2000 ║ 0.000019377 │ 0.000013316 │ 0.000007116 │ 0.000008732 │
│ 5000 ║ 0.000023430 │ 0.000019304 │ 0.000019809 │ 0.000019092 │
│ 10000 ║ 0.000050463 │ 0.000031669 │ 0.000035121 │ 0.000035420 │
│ 20000 ║ 0.000087040 │ 0.000058664 │ 0.000069300 │ 0.000069456 │
│ 50000 ║ 0.000272357 │ 0.000204213 │ 0.000176962 │ 0.000192996 │
│ 100000 ║ 0.000721436 │ 0.000459551 │ 0.000415024 │ 0.000437604 │
│ 200000 ║ 0.001114534 │ 0.000924621 │ 0.000816374 │ 0.000896202 │
│ 500000 ║ 0.002576687 │ 0.002094998 │ 0.001860833 │ 0.002060462 │
│ 1000000 ║ 0.007063596 │ 0.005924892 │ 0.004319181 │ 0.004869024 │
│ 2000000 ║ 0.014474969 │ 0.013594134 │ 0.008568550 │ 0.009388957 │
│ 5000000 ║ 0.038348767 │ 0.035136008 │ 0.021276415 │ 0.023855382 │
│ 10000000 ║ 0.081750925 │ 0.078742713 │ 0.043578664 │ 0.047700495 │
│ 20000000 ║ 0.202616669 │ 0.199960563 │ 0.148141266 │ 0.145360923 │
│ 50000000 ║ 0.567078563 │ 0.552158644 │ 0.370327555 │ 0.397115294 │
│ 100000000 ║ 1.136993625 │ 1.101725386 │ 0.713406642 │ 0.740150322 │
└───────────╨──────────────┴──────────────┴──────────────┴──────────────┘
The easiest way to create the array of points would be
let spectrum = zip(xArray, yArray).map(XYPoint.init)
(I have taken the liberty to call the struct XYPoint, as Swift types
should start with uppercase letters.) This also allows to define
the result array as a constant.
However, it is not the fastest with respect to execution time.
Reasons may be
zip() operates on general sequences and does not take advantage
of the input being arrays.
zip() returns a Sequence and therefore map()
does not know the number of elements which are to be created.
As a consequence, the destination array will be reallocated several
times.
Therefore an explicit loop is faster if you reserve the needed
capacity in advance:
var spectrum: [XYPoint] = []
spectrum.reserveCapacity(xArray.count)
for i in 0..<xArray.count {
let xy = XYPoint(x: xArray[i], y: yArray[i])
spectrum.append(xy)
}
In my test (on a 1.2 GHz Intel Core m5 MacBook, compiled in Release
mode) with two arrays of 10,000 elements, the first method took
about 0.65 milliseconds and the second method about 0.42 milliseconds.
For 1,000,000 elements I measured 12 milliseconds vs 6 milliseconds.
Once you have 2 separate arrays, combining them is a little awkward, and there isn't a neat "Swifty" way to do it. If you had an array of structs, where each struct contained an x an y value, you could use a map statement to transform that array into an array of CGPoint objects (which is actually another Struct type).
You start out by telling us:
let xArray = // read from datafile, fast
let yArray = // read from another datafile, fast
It might be better to rework that code you don't show, so that instead of reading all of the x points data file, then reading all of the y points data file, you:
read an x point
read a y point
create a CGPoint for that X/Y pair
Add the new CGPoint to your output array of CGPoint values
Or even, restructure your code that creates the data files so that it writes a file containing an array of X/Y pairs rather than 2 separate files.
If you have 2 separate arrays, you might use a variant of for... in that gives you both an index and a value for each array entry:
let xArray: [CGFloat] = [0.1, 0.2, 0.3, 0.4]
let yArray: [CGFloat] = [0.4, 0.3, 0.2, 0.1]
var output = [CGPoint]()
output.reserveCapacity(xArray.count)
for (index, value) in xArray.enumerated() {
let yValue = yArray[index]
let aPoint = CGPoint (x: value, y: yValue)
output.append(aPoint)
}
The code above will crash if yArray has less values than xArray, and will miss the last values in yArray if it contains more values than xArray. A complete implementation should really do error checking first and handle the cases were the arrays have different numbers of values.
When you run code in the main playground file, you will likely have logging enabled. This adds a huge performance hit to the code.
I tried out your code in the question as a function. Putting the function in the main swift file for arrays of size 10000 took over 10 minutes!
I moved the function to a separate swift file in the sources folder of the playground with the same size arrays, and it finished instantly.
The code I used was from your question (within a func) rather than the optimized versions.