using .where with an array of parameters [duplicate] - arrays

I have three tables offers, sports and the join table offers_sports.
class Offer < ActiveRecord::Base
has_and_belongs_to_many :sports
end
class Sport < ActiveRecord::Base
has_and_belongs_to_many :offers
end
I want to select offers that include a given array of sport names. They must contain all of the sports but may have more.
Lets say I have these three offers:
light:
- "Yoga"
- "Bodyboarding"
medium:
- "Yoga"
- "Bodyboarding"
- "Surfing"
all:
- "Yoga"
- "Bodyboarding"
- "Surfing"
- "Parasailing"
- "Skydiving"
Given the array ["Bodyboarding", "Surfing"] I would want to get medium and all but not light.
I have tried something along the lines of this answer but I get zero rows in the result:
Offer.joins(:sports)
.where(sports: { name: ["Bodyboarding", "Surfing"] })
.group("sports.name")
.having("COUNT(distinct sports.name) = 2")
Translated to SQL:
SELECT "offers".*
FROM "offers"
INNER JOIN "offers_sports" ON "offers_sports"."offer_id" = "offers"."id"
INNER JOIN "sports" ON "sports"."id" = "offers_sports"."sport_id"
WHERE "sports"."name" IN ('Bodyboarding', 'Surfing')
GROUP BY sports.name
HAVING COUNT(distinct sports.name) = 2;
An ActiveRecord answer would be nice but I'll settle for just SQL, preferably Postgres compatible.
Data:
offers
======================
id | name
----------------------
1 | light
2 | medium
3 | all
4 | extreme
sports
======================
id | name
----------------------
1 | "Yoga"
2 | "Bodyboarding"
3 | "Surfing"
4 | "Parasailing"
5 | "Skydiving"
offers_sports
======================
offer_id | sport_id
----------------------
1 | 1
1 | 2
2 | 1
2 | 2
2 | 3
3 | 1
3 | 2
3 | 3
3 | 4
3 | 5
4 | 3
4 | 4
4 | 5

Group by offer.id, not by sports.name (or sports.id):
SELECT o.*
FROM sports s
JOIN offers_sports os ON os.sport_id = s.id
JOIN offers o ON os.offer_id = o.id
WHERE s.name IN ('Bodyboarding', 'Surfing')
GROUP BY o.id -- !!
HAVING count(*) = 2;
Assuming the typical implementation:
offer.id and sports.id are defined as primary key.
sports.name is defined unique.
(sport_id, offer_id) in offers_sports is defined unique (or PK).
You don't need DISTINCT in the count. And count(*) is even a bit cheaper, yet.
Related answer with an arsenal of possible techniques:
How to filter SQL results in a has-many-through relation
Added by #max (the OP) - this is the above query rolled into ActiveRecord:
class Offer < ActiveRecord::Base
has_and_belongs_to_many :sports
def self.includes_sports(*sport_names)
joins(:sports)
.where(sports: { name: sport_names })
.group('offers.id')
.having("count(*) = ?", sport_names.size)
end
end

One way to do it is using arrays and the array_agg aggregate function.
SELECT "offers".*, array_agg("sports"."name") as spnames
FROM "offers"
INNER JOIN "offers_sports" ON "offers_sports"."offer_id" = "offers"."id"
INNER JOIN "sports" ON "sports"."id" = "offers_sports"."sport_id"
GROUP BY "offers"."id" HAVING array_agg("sports"."name")::text[] #> ARRAY['Bodyboarding','Surfing']::text[];
returns:
id | name | spnames
----+--------+---------------------------------------------------
2 | medium | {Yoga,Bodyboarding,Surfing}
3 | all | {Yoga,Bodyboarding,Surfing,Parasailing,Skydiving}
(2 rows)
The #> operator means that the array on the left must contain all the elements from the one on the right, but may contain more. The spnames column is just for show, but you can remove it safely.
There are two things you must be very mindful of with this.
Even with Postgres 9.4 (I haven't tried 9.5 yet) type conversion for comparing arrays is sloppy and often errors out, telling you it can't find a way to convert them to comparable values, so as you can see in the example I've manually cast both sides using ::text[].
I have no idea what the level of support for array parameters is Ruby, nor the RoR framework, so you may end-up having to manually escape the strings (if input by user) and form the array using the ARRAY[] syntax.

Related

Alternative solutions to an array search in PostgreSQL

I am not sure if my database design is good for this tricky case and I also ask for help how the query for this could look like.
I plan a query with the following table:
search_array | value | id
-----------------------+-------+----
{XYa,YZb,WQb} | b | 1
{XYa,YZb,WQb,RSc,QZa} | a | 2
{XYc,YZa} | c | 3
{XYb} | a | 4
{RSa} | c | 5
There are 5 main elements in the search_array: XY, YZ, WQ, RS, QZ and 3 Values: a, b, c that are concardinated to each element.
Each row has also one value: a, b or c.
My aim is to find all rows that fit to a specific row in this sense: At first it should be checked if they have any same main elements in their search_arrays (yellow marked in the example).
As example:
Row id 4 an row id 5 wouldnt match because XY != RS.
Row id 1, 2 and 3 would match two times because they have all XY and YZ.
Row id 1 and 2 would even match three times because they have also WQ in common.
And second: if there is a Main Element match it should be 'crosschecked' if the lowercase letters after the Main Elements fit to the value of the other row.
As example: The only match for Row id 1 in the table would be Row id 4 because they both search for XY and the low letters after the elements match each value of the two rows.
Another match would be ROW id 2 and 5 with RS and search c to value c and search a to value a (green and orange marked).
My idea was to cut the search_array elements in the query in two parts with the RIGHT and LEFT command for strings. But I dont know how to combine the subqueries for this search.
Or would be a complete other solution faster? Like splitting the search array into another table with the columns 'foregin key' to the maintable, 'main element' and 'searched_value'. I am not sure if this is the best solution because the program would all the time switch to the main table to find two rows out of 3 million rows to compare their searched_values to the values?
Thank you very much for your answers and your time!
You'll have to represent the data in a normalized fashion. I'll do it in a WITH clause, but it would be better to store the data in this fashion to begin with.
WITH unravel AS (
SELECT t.id, t.value,
substr(u.val, 1, 2) AS arr_main,
substr(u.val, 3, 1) AS arr_val
FROM mytable AS t
CROSS JOIN LATERAL unnest(t.search_array) AS u(val)
)
SELECT a.id AS first_id,
a.value AS first_value,
b.id AS second_id,
b.value AS second_value,
a.arr_main AS main_element
FROM unravel AS a
JOIN unravel AS b
ON a.arr_main = b.arr_main
AND a.arr_val = b.value
AND b.arr_val = a.value;

Element-wise sum of array across rows of a dataset - Spark Scala

I am trying to group the below dataset based on the column "id" and sum the arrays in the column "values" element-wise. How do I do it in Spark using Scala?
Input: (dataset of 2 columns, column1 of type String and column2 of type Array[Int])
| id | values |
---------------
| A | [12,61,23,43]
| A | [43,11,24,45]
| B | [32,12,53,21]
| C | [11,12,13,14]
| C | [43,52,12,52]
| B | [33,21,15,24]
Expected Output: (dataset or dataframe)
| id | values |
---------------
| A | [55,72,47,88]
| B | [65,33,68,45]
| C | [54,64,25,66]
Note:
The result has to be flexible and dynamic. That is, even if there are 1000s of columns or even if the file is several TBs or PBs, the solution should still hold good.
I'm a little unsure about what you mean when you say it has to be flexible, but just on top of my head, I can think of a couple of ways. The first (and in my opinion the prettiest) one uses a udf:
// Creating a small test example
val testDF = spark.sparkContext.parallelize(Seq(("a", Seq(1,2,3)), ("a", Seq(4,5,6)), ("b", Seq(1,3,4)))).toDF("id", "arr")
val sum_arr = udf((list: Seq[Seq[Int]]) => list.transpose.map(arr => arr.sum))
testDF
.groupBy('id)
.agg(sum_arr(collect_list('arr)) as "summed_values")
If you have billions of identical ids, however, the collect_list will of course be a problem. In that case you could do something like this:
testDF
.flatMap{case Row(id: String, list: Seq[Int]) => list.indices.map(index => (id, index, list(index)))}
.toDF("id", "arr_index", "arr_element")
.groupBy('id, 'arr_index)
.agg(sum("arr_element") as "sum")
.groupBy('id)
.agg(collect_list('sum) as "summed_values")
The below single-line solution worked for me
ds.groupBy("Country").agg(array((0 until n).map(i => sum(col("Values").getItem(i))) :_* ) as "Values")

Use excel to summarise data from a column by identifier

I have a spreadsheet with a column called MRN (the identifier) and the drugs administered next to them. There are duplicates of the MRN in column A that correspond to different courses of drugs. What I'm hoping to do is to summarise all the drugs administered associated with one MRN in one line, removing all duplicates. It looks something like this.
| | A | B |
| 1 | MRN Item
| 2 | 1 cefoTAXime
| 3 | 1 ampicillin
| 4 | 1 cefoTAXime
| 5 | 1 vancomycin
| 6 | 1 cefTRIaxone
| 7 | 2 ampicillin
| 8 | 2 vancomycin
| 9 | 2 vancomycin
I have 3 different formulas. The first is to produce a list of MRNs that are all unique. The second is to pull all drugs by MRN and list them in one line. The third is to remove duplicates from this list. They are below (in order).
{=IFERROR(INDEX($A$2:$A$2885, MATCH(0,COUNTIF(D$1:$D1, $A$2:$A$2885),0 )),"")}
{=INDEX($A$2:$B$2885,SMALL(IF($A$2:$A$2885=$D2,ROW($A$2:$A$2885)),COLUMN(D:D))-4,2)}
{=IFERROR(INDEX($E$2:$AE$2, MATCH(0,COUNTIF(D$3:$D3, $E$2:$AE$2),0 )),"")}
*I know that I can edit the second one by adding IF(ISERROR ...) to remove NA and print blanks if drug not found, but want to keep the formulas as simple as possible at this time.
My problem is that second formula isn't pulling all the drugs by MRN, and in an ideal world I would be able to combine the second and third formula into one, but I am not sure how to. Here is a link to a test file that shows my issue and the formulas in action.
https://1drv.ms/x/s!ApoCMYBhswHzhooXnumW2iV7yx-JaA
I appreciate that there may be a better way to do this using python/R, and if that's possible then I'm more than happy to try, but I couldn't make any headway. Thanks for your help and suggestions.
If you could deal with a count of the number of courses per drug per MRN, you can do this with Power Query (aka Get & Transform in Excel 2016)
Starting with the data you provided on your worksheet, the results would look like:
M-Code
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(Source,{{"MRN", Int64.Type}, {"Item", type text}}),
#"Grouped Rows" = Table.Group(#"Changed Type", {"MRN"}, {{"Count", each _, type table}}),
#"Expanded Count" = Table.ExpandTableColumn(#"Grouped Rows", "Count", {"MRN", "Item"}, {"Count.MRN", "Count.Item"}),
#"Pivoted Column" = Table.Pivot(#"Expanded Count", List.Distinct(#"Expanded Count"[Count.Item]), "Count.Item", "Count.MRN", List.NonNullCount)
in
#"Pivoted Column"

Can I set rules for string comparison in SQL? (or do I need to hardcode using CASE WHEN)

I need to make a comparison for ratings in two points in time and indicate if the change was upwards,downwards or stayed the same.
For example:
This would be a table with four columns:
ID T0 T0+1 Status
1 AAA AA Lower
2 BB A Higher
3 C C Same
However, this does not work when applying regular string comparison, because in SQL
A<B
B<BBB
I need
A>B
B<BBB
So my order(highest to lowest): AAA,AA,A,BBB,BB,B
SQL order(highest to lowest): BBB,BB,B,AAA,AA,A
Now I have 2 options in mind, but I wonder if someone know a better one:
1) Use CASE WHEN statements for all the possibilities of ratings going up and down ( I have more values than indictaed above)
CASE WHEN T0=T0+1 then 'Same'
WHEN T0='AAA' and To+1<>'AAA' then 'Lower'
....adress all other options for rating going down
ELSE 'Higher'
However, this generates a very large number of CASE WHEN statements.
2) My other option requires generating 2 tables. In table 1 I use case when statements to assign values/rank to the ratings.
For example:
CASE WHEN T0='AAA' then 6
CASE WHEN T0='AA' then 5
CASE WHEN T0='A' then 4
CASE WHEN T0='BBB' then 3
CASE WHEN T0='BB' then 2
CASE WHEN T0='B' then 1
The same for T0+1.
Then in table 2 I use a regular compariosn between column T0 and Column T0+1 on the numeric values.
However, I am looking for a solution where I can do it in one table (with as little lines as possible), and optimally never really show the ranking column.
I think a nested statement would be the best option, but it did now work for me.
Anybody has suggestions?
I use SQL Server 2008.
If you are using Credit Rating, this is very likely that this is not just about AAA > AA or BBB > BB.
Whether you are using one agency or another, it could also be AA+ or Aa1 for long term, F1+ for short term or something else in different contexts or with other agencies.
It is also often requiered to convert data from one agency to other agencies Rating.
Therefore it is better to use a mapping table such as:
Id | Rating
0 | AAA
1 | AA+
2 | AA
3 | AA-
4 | A+
5 | A
6 | A-
7 | BBB+
Using this table, you only have to join the rating in your data table with the rating in the mapping table:
SELECT d.Rating_T0, d.Rating_T1
CASE WHEN d.Rating_T0 = d.Rating_T1 THEN '='
WHEN m0.id < m1.id THEN '<'
WHEN m0.id > m1.id THEN '>'
END
FROM yourData d
INNER JOIN RatingMapping m0
ON m0.Rating= d.Rating_T0
INNER JOIN RatingMapping m1
ON m1.Rating= d.Rating_T1
If you only store the Rating id in you data table, you will not only save space (1 byte for tinyint versus up to 4 chars) but will also be able to compare without the JOIN to the mapping table.
SELECT d.Rating_Id0, d.Rating_Id1
CASE WHEN d.Rating_Id0 = d.Rating_Id1 THEN '='
WHEN d.Rating_Id0 < d.Rating_Id1 THEN '<'
WHEN d.Rating_Id0 > d.Rating_Id1 THEN '>'
END
FROM yourData d
The JOIN would only be requiered when you want to display the actual Rating value such as AAA for Rating_ID = 0.
You could also add an agency_Id to the Mapping table. This way, you can easily choose which Notation agency you want to display and easily convert between Agency 1 and Agency 2 or Agency 3 (ie. Id 1 => S&P and Id 2 => Fitch, Id 3 => ...)

A single MySQL query for 'bouncing' table selects

So, say for the sake of simplicity, I have a master table containing two fields - The first is an attribute and the second is the attributes value. If the second field is set to reference a value in another table it is denoted in parenthesis.
Example:
MASTER_TABLE:
Attr_ID | Attr_Val
--------+-----------
1 | 23(table1) --> 23rd value from `table1`
2 | ...
1 | 42 --> the number 42
1 | 72(table2) --> 72nd value from `table2`
3 | ...
1 | txt --> string "txt"
2 | ...
4 | ...
TABLE 1:
Val_Id | Value
--------+-----------
1 | some_content
2 | ...
. | ...
. | ...
. | ...
23 | some_content
. | ...
Is it possible to perform a single query in SQL (without parsing the results inside the application and requerying the db) that would iterate trough master_table and for the given <attr_id> get only the attributes that reference other tables (e.g. 23(table1), 72(table2), ...), then parse the tables names from the parenthesis (e.g. table1, table2, ...) and perform a query to get the (23rd, 72nd, ...) value (e.g. some_content) from that referenced table?
Here is something I've done, and it parses the Attr_Val for the table name, but I don't know how to assign it to a string and then do a query with that string.
PREPARE pstmt FROM
"SELECT * FROM information_schema.tables
WHERE TABLESCHEMA = '<my_db_name>' AND TABLE_NAME=?";
SET #str_tablename =
(SELECT table.tablename FROM
(SELECT #string:=(SELECT <string_column> FROM <table> WHERE ID=<attr_id>) as String,
#loc1:=length(#string)-locate("(", reverse(#string))+2 AS from,
#loc2:=length(#string)-locate(")", reverse(#string))+1-#loc1 AS to,
substr(#string,#loc1, #loc2) AS tablename
) table
); <--this returns 1 rows which is OK
EXECUTE pstmt USING #str_tablename; <--this then returns 0 rows
Any thoughts?
I love the purity of this approach, if pulled off. But I'm thinking you're creating a maintenance bomb. With a cure like this, who needs to be sick?
No one has ever said of a web site "Man, their data sure is pure!" They compliment what is being done with the data. I don't recommend you keep your hands tied behind your back on this one. I guarantee your competitors aren't.

Resources