Pyspark array preserving order - arrays

I have a structure along these lines, an invoice table and an invoice lines table. I want to output the lines as a JSON ordered array in a mandated schema, ordered by line number but the line number isn't in the schema (it is assumed to be implicit in the array). As I understand it, both pyspark and json will preserve the array order once created. Please see the rough example below. How can I make sure the invoice lines preserve the line number order. I could do it using list comprehension but this means dropping out of spark which I think would be inefficient.
from pyspark.sql.functions import collect_list, struct
invColumns = StructType([
StructField("invoiceNo",StringType(),True),
StructField("invoiceStuff",StringType(),True)
])
invData = [("1", "stuff"), ("2", "other stuff"), ("3", "more stuff")]
invLines = StructType([
StructField("lineNo",IntegerType(),True),
StructField("invoiceNo",StringType(),True),
StructField("detail",StringType(),True),
StructField("quantity",IntegerType(),True)
])
lineData = [(1,"1","item stuff",3),(2,"1","new item stuff",2),(3,"1","old item stuff",5),(1,"2","item stuff",3),(1,"3","item stuff",3),(2,"3","more item stuff",7)]
invoice_df = spark.createDataFrame(data=invData,schema=invColumns)
#in reality read from a spark table
invLine_df = spark.createDataFrame(data=lineData,schema=invLines)
#in reality read from a spark table
invoicesTemp_df = (invoice_df.select('invoiceNo',
'invoiceStuff')
.join(invLine_df.select('lineNo',
'InvoiceNo',
'detail',
'quantity'
),
on='invoiceNo'))
invoicesOut_df = (invoicesTemp_df.withColumn('invoiceLines',struct('detail','quantity'))
.groupBy('invoiceNo','invoiceStuff').agg(collect_list('invoiceLines').alias('invoiceLines'))
.select('invoiceNo',
'invoiceStuff',
'invoiceLines'
))
display(invoicesOut_df)
3 -- more stuff -- array -- 0: -- {"detail": "item stuff", "quantity": 3}
-- 1: -- {"detail": "more item stuff", "quantity": 7}
1 -- stuff -- array -- 0: -- {"detail": "new item stuff", "quantity": 2}
-- 1: -- {"detail": "old item stuff", "quantity": 5}
-- 2: -- {"detail": "item stuff", "quantity": 3}
2 -- other stuff -- array -- 0: -- {"detail": "item stuff", "quantity": 3}
The following, as requested is input data
Invoice Table
"InvoiceNo", "InvoiceStuff",
"1","stuff",
"2","other stuff",
"3","more stuff"
Invoice Lines Table
"LineNo","InvoiceNo","Detail","Quantity",
1,"1","item stuff",3,
2,"1","new item stuff",2,
3,"1","old item stuff",5,
1,"2","item stuff",3,
1,"3","item stuff",3,
2,"3","more item stuff",7
and an output should look like this, but the arrays should be ordered by the line number from the invoice lines table, even though it isn't in the output.
Output
"1","stuff","[{"detail": "item stuff", "quantity": 3},{"detail": "new item stuff", "quantity": 2},{"detail": "old item stuff", "quantity": 5}]",
"2","other stuff","[{"detail": "item stuff", "quantity": 3}]"
"3","more stuff","[{"detail": "item stuff", "quantity": 3},{"detail": "more item stuff", "quantity": 7}]"

collect_list does not respect data's order
Note The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.
One possible way to do that is applying collect_list with a window function where you can control the order.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
(invoice_df
.join(invLine_df, on='invoiceNo')
.withColumn('invoiceLines', F.struct('lineNo', 'detail','quantity'))
.withColumn('a', F.collect_list('invoiceLines').over(W.partitionBy('invoiceNo').orderBy('lineNo')))
.groupBy('invoiceNo')
.agg(F.max('a').alias('invoiceLines'))
.show(10, False)
)
+---------+--------------------------------------------------------------------+
|invoiceNo|invoiceLines |
+---------+--------------------------------------------------------------------+
|1 |[{1, item stuff, 3}, {2, new item stuff, 2}, {3, old item stuff, 5}]|
|2 |[{1, item stuff, 3}] |
|3 |[{1, item stuff, 3}, {2, more item stuff, 7}] |
+---------+--------------------------------------------------------------------+

Related

how to query nested jsonb array for a given key

I have a table 'animals' to store information
row 1 -> [{"type": "rabbit", "value": "carrot"}, {"type": "cat", "value": "milk"}]
row 2 -> [{"type": "cat", "value": "fish"}, {"type": "rabbit", "value": "leaves"}]
I need to query the value for type rabbit from all the rows.
tried to use operator #> [{"type" : "rabbit"}}
select * from data where data #> '[{"type":"rabbit"}]';
but doesn't work

How I can modify the following code to get the result of an array with shortest length?

How I can modify the following code:
c1_arr = F.col('col1')
c2_arr = F.split(F.trim('col2'), '\s+')
arr_of_struct = F.transform(
c1_arr,
lambda x: F.struct(
F.size(F.array_intersect(c2_arr, F.split(F.trim(x), '\s+'))).alias('cnt'),
x.alias('val'),
)
)
top_val = F.sort_array(arr_of_struct, False)[0]
top_val gives me an element of col1 where it has the most common token with col2's element. I want to get an element of col1 where it has the most common token with col2's element and it has the shortest length or token.
For example consider the following data:
col1 col2
["come and get", "computer", come and get more" ] "come for good"
["summer is hot", "summer is too hot", "hot weather"] "hot tea"
["summer is hot", "summer is too hot", "hot weather"] "hot summer"
Desired output:
col1 col2 match
["come and get", "computer", come and get more" ] "come for good" "come and get"
["summer is hot", "summer is too hot", "hot weather"] "hot tea" "hot weather"
["summer is hot", "summer is too hot", "hot weather"] "hot summer" "summer is hot"
Then I use the following code to get my desired result:
df = df.select(
'*',
F.when(((top_val['cnt'] > 1) & (F.size(c2_arr) > 1))|((top_val['cnt'] > 0) & (F.size(c2_arr) == 1))|((F.size(F.split(F.trim(top_val['val']),'\s+'))==1) &(top_val['cnt']>0 )),top_val['val']).alias('match'))
The first condition says that the match will be selected if the intersection is more than 1 and the size of co2 is also larger than 1.
the second condition says that if the size of col2 is equal 1 then intersection needs to be non-zero
And the third condition says if the intersection is 1 and an element of col1 is also equal to one, the element is selected.
First of all, input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(["come and get", "computer", "come and get more"], "come for good"),
(["summer is hot", "summer is too hot", "hot weather"], "hot tea"),
(["summer is hot", "summer is too hot", "hot weather"], "hot summer")],
['col1', 'col2'])
For the answer, along the process, I've made 2 options...
You'll want to use this option, because it is way cleaner. In order to make both struct fields sort ascending, I've added the - sign before the count of matching tokens. When both field order direction is the same, the code does not need much alterations:
c1_arr = F.col('col1')
c2_arr = F.split(F.trim('col2'), '\s+')
arr_of_struct = F.transform(
c1_arr,
lambda x: F.struct(
F.size(F.array_intersect(c2_arr, F.split(F.trim(x), '\s+'))).alias('cnt'),
(-F.size(F.split(F.trim(x), '\s+'))).alias('len'),
x.alias('val'),
)
)
top_val = F.sort_array(arr_of_struct, False)[0]
df = df.withColumn('match', top_val.val)
df.show(truncate=0)
# +-----------------------------------------------+-------------+-------------+
# |col1 |col2 |match |
# +-----------------------------------------------+-------------+-------------+
# |[come and get, computer, come and get more] |come for good|come and get |
# |[summer is hot, summer is too hot, hot weather]|hot tea |hot weather |
# |[summer is hot, summer is too hot, hot weather]|hot summer |summer is hot|
# +-----------------------------------------------+-------------+-------------+
Option #2 - for academic use ;)
Make results sorted on:
one struct's field - descending
other struct's field - ascending
Since order directions are different, I've created the custom sorting function (so-called comparator function). It looks at "cnt" field of the struct (and sorts descending). But if "cnt" field values are equal, then the function looks at "len" field (and sorts ascending).
c1_arr = F.col('col1')
c2_arr = F.split(F.trim('col2'), '\s+')
df = df.withColumn(
'match',
F.transform(
c1_arr,
lambda x: F.struct(
F.size(F.array_intersect(c2_arr, F.split(F.trim(x), '\s+'))).alias('cnt'),
F.size(F.split(F.trim(x), '\s+')).alias('len'),
x.alias('val'),
)
)
)
df = df.withColumn('match', F.expr("""
array_sort(
match,
(l, r) -> case when l.cnt > r.cnt then -1
when l.cnt < r.cnt then 1
when l.len < r.len then -1
when l.len > r.len then 1
else 0
end)
""")[0].val)
df.show(truncate=0)
# +-----------------------------------------------+-------------+-------------+
# |col1 |col2 |match |
# +-----------------------------------------------+-------------+-------------+
# |[come and get, computer, come and get more] |come for good|come and get |
# |[summer is hot, summer is too hot, hot weather]|hot tea |hot weather |
# |[summer is hot, summer is too hot, hot weather]|hot summer |summer is hot|
# +-----------------------------------------------+-------------+-------------+
We can only do this in SQL API, because array_sort in Python API does not have the function parameter.

How to merge 2 arrays where value in one matches a value in another with different key in Ruby

I have an array that contains other arrays of items with prices but when one has a sale a new item is created How do I merge or pull value from one to the other to make 1 array so that the sale price replaces the non sale but contains the original price?
Example:
items=[{"id": 123, "price": 100, "sale": false},{"id":456,"price":25,"sale":false},{"id":678, "price":75, "sale":true, "parent_price_id":123}]
Transform into:
items=[{"id":456,"price":25,"sale":false},{"id":678, "price":75, "sale":true, "parent_price_id":123, "original_price": 100}]
It's not the prettiest solution, but here's one way you can do it. I added a minitest spec to check it against the values you provided and it gives the answer you're hoping for.
require "minitest/autorun"
def merge_prices(prices)
# Create a hash that maps the ID to the values
price_map =
prices
.map do |price|
[price[:id], price]
end
.to_h
# Create a result array which is initially duplicated from the original
result = prices.dup
result.each do |price|
if price.key?(:parent_price)
price[:original_price] = price_map[price[:parent_price]][:price]
# Delete the original
result.delete_if { |x| x[:id] == price[:parent_price] }
end
end
result
end
describe "Merge prices" do
it "should work" do
input = [
{"id":123, "price": 100, "sale": false},
{"id":456,"price":25,"sale": false},
{"id":678, "price":75, "sale": true, "parent_price":123}
].freeze
expected_output = [
{"id":456,"price":25,"sale": false},
{"id":678, "price":75, "sale": true, "parent_price":123, "original_price": 100}
].freeze
assert_equal(merge_prices(input), expected_output)
end
end
Let's being by defining items in an equivalent, but more familiar, way:
items = [
[{:id=>123, :price=>100, :sale=>false}],
[{:id=>456, :price=>25, :sale=>false}],
[{:id=>678, :price=>75, :sale=>true, :parent_price=>123}]
]
with the desired return value being:
[
{:id=>456, :price=>25, :sale=>false},
{:id=>678, :price=>75, :sale=>true, :parent_price=>123,
:original_price=>100}
]
I assume that h[:sale] #=> false for every element of items (a hash) g for which g[:parent_price] = h[:id].
A convenient first step is to create the following hash.
h = items.map { |(h)| [h[:id], h] }.to_h
#=> {123=>{:id=>123, :price=>100, :sale=>false},
# 456=>{:id=>456, :price=>25, :sale=>false},
# 678=>{:id=>678, :price=>75, :sale=>true, :parent_price=>123}}
Then:
h.keys.each { |k| h[k][:original_price] =
h.delete(h[k][:parent_price])[:price] if h[k][:sale] }
#=> [123, 456, 678] (not used)
h #=> {456=>{:id=>456, :price=>25, :sale=>false},
# 678=>{:id=>678, :price=>75, :sale=>true, :parent_price=>123,
# :original_price=>100}}
Notice that Hash#delete returns the value of the deleted key.
The last two steps are to extract the values from this hash and replace items with the resulting array of hashes:
items.replace(h.values)
#=> [{:id=>456, :price=>25, :sale=>false},
# {:id=>678, :price=>75, :sale=>true, :parent_price=>123,
# :original_price=>100}]
See Array#replace.
If desired we could combine these steps as follows.
items.replace(
items.map { |(h)| [h[:id], h] }.to_h.tap do |h|
h.keys.each { |k| h[k][:original_price] =
h.delete(h[k][:parent_price])[:price] if h[k][:sale] }
end.values)
#=> [{:id=>456, :price=>25, :sale=>false},
# {:id=>678, :price=>75, :sale=>true, :parent_price=>123,
# :original_price=>100}]
See Object#tap.

lucene solr - how to know numCount of each word in query

i have a query string with 5 words. for exmple "cat dog fish bird animals".
i need to know how many matches each word has.
at this point i create 5 queries:
/q=name:cat&rows=0&facet=true
/q=name:dog&rows=0&facet=true
/q=name:fish&rows=0&facet=true
/q=name:bird&rows=0&facet=true
/q=name:animals&rows=0&facet=true
and get matches count of each word from each query.
but this method takes too many time.
so is there a way to check get numCount of each word with one query?
any help appriciated!
In this case, functionQueries are your friends. In particular:
termfreq(field,term) returns the number of times the term appears in the field for that document. Example Syntax:
termfreq(text,'memory')
totaltermfreq(field,term) returns the number of times the term appears in the field in the entire index. ttf is an alias of
totaltermfreq. Example Syntax: ttf(text,'memory')
The following query for instance:
q=*%3A*&fl=cntOnSummary%3Atermfreq(summary%2C%27hello%27)+cntOnTitle%3Atermfreq(title%2C%27entry%27)+cntOnSource%3Atermfreq(source%2C%27activities%27)&wt=json&indent=true
returns the following results:
"docs": [
{
"id": [
"id-1"
],
"source": [
"activities",
"activities"
],
"title": "Ajones3 Activity Entry 1",
"summary": "hello hello",
"cntOnSummary": 2,
"cntOnTitle": 1,
"cntOnSource": 1,
"score": 1
},
{
"id": [
"id-2"
],
"source": [
"activities",
"activities"
],
"title": "Common activity",
"cntOnSummary": 0,
"cntOnTitle": 0,
"cntOnSource": 1,
"score": 1
}
}
]
Please notice that while it's working well on single value field, it seems that for multivalued fields, the functions consider just the first entry, for instance in the example above, termfreq(source%2C%27activities%27) returns 1 instead of 2.

Angular.js Select with ngOptions: Label the optgroup

I just started to play with Angular.js and have a question about ngOptions: Is it possible to label the optgroup?
Lets assume 2 objects - cars and garages.
cars = [
{"id": 1, "name": "Diablo", "color": "red", "garageId": 1},
{"id": 2, "name": "Countach", "color": "white", "garageId": 1},
{"id": 3, "name": "Clio", "color": "silver", "garageId": 2},
...
]
garages = [
{"id": 1, "name": "Super Garage Deluxe"},
{"id": 2, "name": "Toms Eastside"},
...
]
With this code I got nearly the result I want:
ng-options = "car.id as car.name + ' (' + car.color + ')' group by car.garageId for car in cars"
Result in the select:
-----------------
1
Diablo (red)
Countach (white)
Firebird (red)
2
Clio (silver)
Golf (black)
3
Hummer (silver)
-----------------
But I want to label the optgroups like "Garage 1", "Garage 2", ... or even better display the name of the garage and not just the garageId.
The angularjs.org documentation for select says nothing about labels for the optgroup, but I would like to extend the group by part of ngOptions like group by car.garageId as 'Garage ' + car.garageId or group by car.garageId as getGarageName(car.garageId) - which sadly is not working.
My only solution so far is to add a new property "garageDisplayName" to the car objects and store there the id + garage name and use that as group by parameter. But I don't want to update all cars whenever a garage name is changed.
Is there a way to label the optgroups with ngOptions, or should I use ngRepeat in that case?
You can just call getGarageName() in the group by without using an as...
ng-options="car.id as car.name + ' (' + car.color + ')' group by getGarageName(car.garageId) for car in cars"
Instead of storing the garage id in each car, you might want to consider storing a reference to the garage object in the garages array. That way you can change the garage name and there is no need to change each car. And the group by simply becomes...
group by car.garage.name

Resources