I am trying to group the below dataset based on the column "id" and sum the arrays in the column "values" element-wise. How do I do it in Spark using Scala?
Input: (dataset of 2 columns, column1 of type String and column2 of type Array[Int])
| id | values |
---------------
| A | [12,61,23,43]
| A | [43,11,24,45]
| B | [32,12,53,21]
| C | [11,12,13,14]
| C | [43,52,12,52]
| B | [33,21,15,24]
Expected Output: (dataset or dataframe)
| id | values |
---------------
| A | [55,72,47,88]
| B | [65,33,68,45]
| C | [54,64,25,66]
Note:
The result has to be flexible and dynamic. That is, even if there are 1000s of columns or even if the file is several TBs or PBs, the solution should still hold good.
I'm a little unsure about what you mean when you say it has to be flexible, but just on top of my head, I can think of a couple of ways. The first (and in my opinion the prettiest) one uses a udf:
// Creating a small test example
val testDF = spark.sparkContext.parallelize(Seq(("a", Seq(1,2,3)), ("a", Seq(4,5,6)), ("b", Seq(1,3,4)))).toDF("id", "arr")
val sum_arr = udf((list: Seq[Seq[Int]]) => list.transpose.map(arr => arr.sum))
testDF
.groupBy('id)
.agg(sum_arr(collect_list('arr)) as "summed_values")
If you have billions of identical ids, however, the collect_list will of course be a problem. In that case you could do something like this:
testDF
.flatMap{case Row(id: String, list: Seq[Int]) => list.indices.map(index => (id, index, list(index)))}
.toDF("id", "arr_index", "arr_element")
.groupBy('id, 'arr_index)
.agg(sum("arr_element") as "sum")
.groupBy('id)
.agg(collect_list('sum) as "summed_values")
The below single-line solution worked for me
ds.groupBy("Country").agg(array((0 until n).map(i => sum(col("Values").getItem(i))) :_* ) as "Values")
Related
I have a postgresql field that stores a 4-element array. I want to select the value of that field, but it's coming back as a string:
{43.690916,-79.396774,43.700845,-79.37125}
I would assume that the gem would have known the format of that field and returned an array but I am wrong.
How can I get this into an array without going through string methods? That would seem like a hack. Moving from four individual float fields to a single array field with associated methods, I thought would make records easier to access.
There was no migration, and restrictive to assume this is in Rails, which it is not. Here is the structure:
Column | Type | Collation | Nullable | Default
-----------+------------------------+-----------+----------+-------------------------------------------
loc_id | integer | | not null | nextval('mjtable_loc_id_seq'::regclass)
locname | character varying(255) | | |
locbounds | double precision[] | | |
In Postgres, array is still a string in the database so you will only have a string and you would need to deal with that as such.
conn = PG.connect( dbname: 'pgarray_development') # or whatever db name
data = conn.exec('SELECT * FROM foos').entries
=> [{"id"=>"1", "coords"=>"{1.0,2.0,3.0,4.0}"]
data.first['coords'].class
=>String
But you can do this
conn.type_map_for_results = PG::BasicTypeMapForResults.new conn
conn.exec("select coords::float[] from foos").values
=> [[1.0, 2.0, 3.0, 4.0]]
There are probably other ways to use type casts, see https://bitbucket.org/ged/ruby-pg/wiki/Home
I have a MySQL2::Result object #available_items and
#available_items.each do |row| puts row.values.join("\t ") end
gives me something that looks like this:
+------------+--------------+----------------------------------+
| tdate | whatDay | items |
+------------+--------------+----------------------------------+
| 2018-01-02 | Tuesday | OL,BD,DM,WW,DG |
| 2018-01-03 | Wednesday | KP,LW |
| 2018-01-04 | Thursday | LW,WW,FS,DG |
| 2018-01-05 | Friday | OL,KP,BD,SB,LW,DM,AS,WW,FS,DG |
| 2018-01-06 | Saturday | OL,KP,BD,SB,LW,DM,AS,WW,FS,DG |
+------------+--------------+----------------------------------+
Well, actually it looks like something else, but hopefully you get the idea.
I know that the default MySQL2 results output is as a hash, but I cannot figure out how to be able to refer to the items in column 3 by reference to the date in column 1 (i.e. use tdate as key to get items as value.)
So I created some, what feels like, "dirty" code with a pluck method to create an array
#available_items.each do |row|
#available_array[0] = #available_items.pluck("tdate")
#available_array[1] = #available_items.pluck("whatDay")
#available_array[2] = #available_items.pluck("items")
end
Now I have an array where I can call by reference to the array, but what I really want is a hash where tdate is the key and items is the value, so that later on I can pull one of the (abbreviated) items out of the comma separated list in items for any given date at random and put that single item into a new hash (pseudocode) and then examine that hash using some other code.
#final_list = hash.new()
#final_list[:tdate] = items(randomSelection)
If I try to create a hash as follows:
#available_hash = Hash.new()
#available_items.each do |row|
#keyis = #available_items.pluck("tdate")
#valueis = #available_items.pluck("available")
#available_hash[#keyis] = #valueis
end
and then do
#available_hash.each_with_index do |k, v| puts "#{k} : #{v}" end
I get:
[[#<Date: 2018-01-02 ((2458121j,0s,0n),+0s,2299161j)>, #<Date: 2018-01-03 ((2458122j,0s,0n),+0s,2299161j)>, #<Date: 2018-01-04 ((2458123j,0s,0n),+0s,2299161j)>, #<Date: 2018-01-05 ((2458124j,0s,0n),+0s,2299161j)>, #<Date: 2018-01-06 ((2458125j,0s,0n),+0s,2299161j)>], nil] : 0
which looks like it has just fed everything into one line.
I have a feeling I am trying to be too complex, and also that I have misunderstood how to append to a hash.
So the question is: how do I create hash where each {key,value} pair of things looks like {tdate: items} with one new pair for each date.
Thanks in advance.
Try this.
#available_hash = #available_items.map{|item| [item["date"], item["items"]] }.to_h
Hope this helps.
Adding it as a more obvious answer than being buried in comments:
#available_hash = #available_items.map do |h| {h["tdate"] => h["available"] } end
I am trying to find a formula for column A that will check an IP address in column B and find if it falls into a range (or between) 2 addresses in two other columns C and D.
E.G.
A B C D
+---------+-------------+-------------+------------+
| valid? | address | start | end |
+---------+-------------+-------------+------------+
| yes | 10.1.1.5 | 10.1.1.0 | 10.1.1.31 |
| Yes | 10.1.3.13 | 10.1.2.16 | 10.1.2.31 |
| no | 10.1.2.7 | 10.1.1.128 | 10.1.1.223 |
| no | 10.1.1.62 | 10.1.3.0 | 10.1.3.127 |
| yes | 10.1.1.9 | 10.1.4.0 | 10.1.4.255 |
| no | 10.1.1.50 | … | … |
| yes | 10.1.1.200 | | |
+---------+-------------+-------------+------------+
This is supposed to represent an Excel table with 4 columns a heading and 7 rows as an example.
I can do a lateral check with
=IF(AND((B3>C3),(B3 < D3)),"yes","no")
which only checks 1 address against the range next to it.
I need something that will check the 1 IP address against all of the ranges. i.e. rows 1 to 100.
This is checking access list rules against routes to see if I can eliminate redundant rules... but has other uses if I can get it going.
To make it extra special I can not use VBA macros to get it done.
I'm thinking some kind of index match to look it up in an array but not sure how to apply it. I don't know if it can even be done. Good luck.
Ok, so I've been tracking this problem since my initial comment, but have not taken the time to answer because just like Lana B:
I like a good puzzle, but it's not a good use of time if i have to keep guessing
+1 to Lana for her patience and effort on this question.
However, IP addressing is something I deal with regularly, so I decided to tackle this one for my own benefit. Also, no offense, but getting the MIN of the start and the MAX of the end is wrong. This will not account for gaps in the IP white-list. As I mentioned, this required 15 helper columns and my result is simply 1 or 0 corresponding to In or Out respectively. Here is a screenshot (with formulas shown below each column):
The formulas in F2:J2 are:
=NUMBERVALUE(MID(B2,1,FIND(".",B2)-1))
=NUMBERVALUE(MID(B2,FIND(".",B2)+1,FIND(".",B2,FIND(".",B2)+1)-1-FIND(".",B2)))
=NUMBERVALUE(MID(B2,FIND(".",B2,FIND(".",B2)+1)+1,FIND(".",B2,FIND(".",B2,FIND(".",B2)+1)+1)-1-FIND(".",B2,FIND(".",B2)+1)))
=NUMBERVALUE(MID(B2,FIND(".",B2,FIND(".",B2,FIND(".",B2)+1)+1)+1,LEN(B2)))
=F2*256^3+G2*256^2+H2*256+I2
Yes, I used formulas instead of "Text to Columns" to automate the process of adding more information to a "living" worksheet.
The formulas in L2:P2 are the same, but replace B2 with C2.
The formulas in R2:V2 are also the same, but replace B2 with D2.
The formula for X2 is
=SUMPRODUCT(--($P$2:$P$8<=J2)*--($V$2:$V$8>=J2))
I also copied your original "valid" set in column A, which you'll see matches my result.
You will need helper columns.
Organise your data as outlined in the picture.
Split address, start and end into columns by comma (ribbon menu Data=>Text To Columns).
Above the start/end parts, calculate MIN FOR START, and MAX FOR END for all split text parts (i.e. MIN(K5:K1000) .
FORMULAS:
VALIDITY formula - copy into cell D5, and drag down:
=IF(AND(B6>$I$1,B6<$O$1),"In",
IF(OR(B6<$I$1,B6>$O$1),"Out",
IF(B6=$I$1,
IF(C6<$J$1, "Out",
IF( C6>$J$1, "In",
IF( D6<$K$1, "Out",
IF( D6>$K$1, "In",
IF(E6>=$L$1, "In", "Out"))))),
IF(B6=$O$1,
IF(C6>$P$1, "Out",
IF( C6<$P$1, "In",
IF( D6>$Q$1, "Out",
IF( D6<$Q$1, "In",
IF(E6<=$R$1, "In", "Out") )))) )
)))
I have three tables offers, sports and the join table offers_sports.
class Offer < ActiveRecord::Base
has_and_belongs_to_many :sports
end
class Sport < ActiveRecord::Base
has_and_belongs_to_many :offers
end
I want to select offers that include a given array of sport names. They must contain all of the sports but may have more.
Lets say I have these three offers:
light:
- "Yoga"
- "Bodyboarding"
medium:
- "Yoga"
- "Bodyboarding"
- "Surfing"
all:
- "Yoga"
- "Bodyboarding"
- "Surfing"
- "Parasailing"
- "Skydiving"
Given the array ["Bodyboarding", "Surfing"] I would want to get medium and all but not light.
I have tried something along the lines of this answer but I get zero rows in the result:
Offer.joins(:sports)
.where(sports: { name: ["Bodyboarding", "Surfing"] })
.group("sports.name")
.having("COUNT(distinct sports.name) = 2")
Translated to SQL:
SELECT "offers".*
FROM "offers"
INNER JOIN "offers_sports" ON "offers_sports"."offer_id" = "offers"."id"
INNER JOIN "sports" ON "sports"."id" = "offers_sports"."sport_id"
WHERE "sports"."name" IN ('Bodyboarding', 'Surfing')
GROUP BY sports.name
HAVING COUNT(distinct sports.name) = 2;
An ActiveRecord answer would be nice but I'll settle for just SQL, preferably Postgres compatible.
Data:
offers
======================
id | name
----------------------
1 | light
2 | medium
3 | all
4 | extreme
sports
======================
id | name
----------------------
1 | "Yoga"
2 | "Bodyboarding"
3 | "Surfing"
4 | "Parasailing"
5 | "Skydiving"
offers_sports
======================
offer_id | sport_id
----------------------
1 | 1
1 | 2
2 | 1
2 | 2
2 | 3
3 | 1
3 | 2
3 | 3
3 | 4
3 | 5
4 | 3
4 | 4
4 | 5
Group by offer.id, not by sports.name (or sports.id):
SELECT o.*
FROM sports s
JOIN offers_sports os ON os.sport_id = s.id
JOIN offers o ON os.offer_id = o.id
WHERE s.name IN ('Bodyboarding', 'Surfing')
GROUP BY o.id -- !!
HAVING count(*) = 2;
Assuming the typical implementation:
offer.id and sports.id are defined as primary key.
sports.name is defined unique.
(sport_id, offer_id) in offers_sports is defined unique (or PK).
You don't need DISTINCT in the count. And count(*) is even a bit cheaper, yet.
Related answer with an arsenal of possible techniques:
How to filter SQL results in a has-many-through relation
Added by #max (the OP) - this is the above query rolled into ActiveRecord:
class Offer < ActiveRecord::Base
has_and_belongs_to_many :sports
def self.includes_sports(*sport_names)
joins(:sports)
.where(sports: { name: sport_names })
.group('offers.id')
.having("count(*) = ?", sport_names.size)
end
end
One way to do it is using arrays and the array_agg aggregate function.
SELECT "offers".*, array_agg("sports"."name") as spnames
FROM "offers"
INNER JOIN "offers_sports" ON "offers_sports"."offer_id" = "offers"."id"
INNER JOIN "sports" ON "sports"."id" = "offers_sports"."sport_id"
GROUP BY "offers"."id" HAVING array_agg("sports"."name")::text[] #> ARRAY['Bodyboarding','Surfing']::text[];
returns:
id | name | spnames
----+--------+---------------------------------------------------
2 | medium | {Yoga,Bodyboarding,Surfing}
3 | all | {Yoga,Bodyboarding,Surfing,Parasailing,Skydiving}
(2 rows)
The #> operator means that the array on the left must contain all the elements from the one on the right, but may contain more. The spnames column is just for show, but you can remove it safely.
There are two things you must be very mindful of with this.
Even with Postgres 9.4 (I haven't tried 9.5 yet) type conversion for comparing arrays is sloppy and often errors out, telling you it can't find a way to convert them to comparable values, so as you can see in the example I've manually cast both sides using ::text[].
I have no idea what the level of support for array parameters is Ruby, nor the RoR framework, so you may end-up having to manually escape the strings (if input by user) and form the array using the ARRAY[] syntax.
For example:
Array
ID | Primary | Data2
------------------
1 | N | Something 1
2 | N | Something 2
3 | Y | Something 3
I'm trying to sort it based on the primary column and I want the "Y" to show first. It should bring all the other column at the top.
The end result would be:
Sorted Array
ID | Primary | Data2
------------------
3 | Y | Something 3
1 | N | Something 1
2 | N | Something 2
Is there a pre-made function for that. If not, how do we do this?
It is declared like this:
Dim Array(,) As String
regards,
I like using LINQ's OrderBy and ThenBy to order collections of objects. You just pass in a selector function to use to order the collections. For example:
orderedObjs = objs.OrderByDescending(function(x) x.isPrimary).ThenBy(function(x) x.id).ToList()
This code orders a collection first by the .isPrimary boolean, then by the id. Finally, it immediately evaluates the query into a List and assigns it to some variable.
Demo
There's a similar C# question whose solution applies just as well to VB. In short, you can use an overload of Array.Sort if you first split your 2D array into separate (1D) arrays:
Dim Primary() As String
Dim Data2() As String
// ...
Array.Sort(Primary,Data2)
This would reorder Data2 according to the Y/N sort of Primary, after which point you could then recombine them into a 2D array.