how to query nested jsonb array for a given key - database

I have a table 'animals' to store information
row 1 -> [{"type": "rabbit", "value": "carrot"}, {"type": "cat", "value": "milk"}]
row 2 -> [{"type": "cat", "value": "fish"}, {"type": "rabbit", "value": "leaves"}]
I need to query the value for type rabbit from all the rows.
tried to use operator #> [{"type" : "rabbit"}}
select * from data where data #> '[{"type":"rabbit"}]';
but doesn't work

Related

Pyspark array preserving order

I have a structure along these lines, an invoice table and an invoice lines table. I want to output the lines as a JSON ordered array in a mandated schema, ordered by line number but the line number isn't in the schema (it is assumed to be implicit in the array). As I understand it, both pyspark and json will preserve the array order once created. Please see the rough example below. How can I make sure the invoice lines preserve the line number order. I could do it using list comprehension but this means dropping out of spark which I think would be inefficient.
from pyspark.sql.functions import collect_list, struct
invColumns = StructType([
StructField("invoiceNo",StringType(),True),
StructField("invoiceStuff",StringType(),True)
])
invData = [("1", "stuff"), ("2", "other stuff"), ("3", "more stuff")]
invLines = StructType([
StructField("lineNo",IntegerType(),True),
StructField("invoiceNo",StringType(),True),
StructField("detail",StringType(),True),
StructField("quantity",IntegerType(),True)
])
lineData = [(1,"1","item stuff",3),(2,"1","new item stuff",2),(3,"1","old item stuff",5),(1,"2","item stuff",3),(1,"3","item stuff",3),(2,"3","more item stuff",7)]
invoice_df = spark.createDataFrame(data=invData,schema=invColumns)
#in reality read from a spark table
invLine_df = spark.createDataFrame(data=lineData,schema=invLines)
#in reality read from a spark table
invoicesTemp_df = (invoice_df.select('invoiceNo',
'invoiceStuff')
.join(invLine_df.select('lineNo',
'InvoiceNo',
'detail',
'quantity'
),
on='invoiceNo'))
invoicesOut_df = (invoicesTemp_df.withColumn('invoiceLines',struct('detail','quantity'))
.groupBy('invoiceNo','invoiceStuff').agg(collect_list('invoiceLines').alias('invoiceLines'))
.select('invoiceNo',
'invoiceStuff',
'invoiceLines'
))
display(invoicesOut_df)
3 -- more stuff -- array -- 0: -- {"detail": "item stuff", "quantity": 3}
-- 1: -- {"detail": "more item stuff", "quantity": 7}
1 -- stuff -- array -- 0: -- {"detail": "new item stuff", "quantity": 2}
-- 1: -- {"detail": "old item stuff", "quantity": 5}
-- 2: -- {"detail": "item stuff", "quantity": 3}
2 -- other stuff -- array -- 0: -- {"detail": "item stuff", "quantity": 3}
The following, as requested is input data
Invoice Table
"InvoiceNo", "InvoiceStuff",
"1","stuff",
"2","other stuff",
"3","more stuff"
Invoice Lines Table
"LineNo","InvoiceNo","Detail","Quantity",
1,"1","item stuff",3,
2,"1","new item stuff",2,
3,"1","old item stuff",5,
1,"2","item stuff",3,
1,"3","item stuff",3,
2,"3","more item stuff",7
and an output should look like this, but the arrays should be ordered by the line number from the invoice lines table, even though it isn't in the output.
Output
"1","stuff","[{"detail": "item stuff", "quantity": 3},{"detail": "new item stuff", "quantity": 2},{"detail": "old item stuff", "quantity": 5}]",
"2","other stuff","[{"detail": "item stuff", "quantity": 3}]"
"3","more stuff","[{"detail": "item stuff", "quantity": 3},{"detail": "more item stuff", "quantity": 7}]"
collect_list does not respect data's order
Note The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.
One possible way to do that is applying collect_list with a window function where you can control the order.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
(invoice_df
.join(invLine_df, on='invoiceNo')
.withColumn('invoiceLines', F.struct('lineNo', 'detail','quantity'))
.withColumn('a', F.collect_list('invoiceLines').over(W.partitionBy('invoiceNo').orderBy('lineNo')))
.groupBy('invoiceNo')
.agg(F.max('a').alias('invoiceLines'))
.show(10, False)
)
+---------+--------------------------------------------------------------------+
|invoiceNo|invoiceLines |
+---------+--------------------------------------------------------------------+
|1 |[{1, item stuff, 3}, {2, new item stuff, 2}, {3, old item stuff, 5}]|
|2 |[{1, item stuff, 3}] |
|3 |[{1, item stuff, 3}, {2, more item stuff, 7}] |
+---------+--------------------------------------------------------------------+

Split json data into separate row result in Bigquery

I am writing a bigquery code to split a JSON data set into a more structured table.
The JSON_data_set looks something like this
Row | createdon | result
1 | 24022020 | {"searchResult": {"searchAccounts": [{"chainName": "xyxvjw", "address": {"name": "xyxvjw - ythji", "combined_city": "uptown", "combined_address": "1 downtown, uptown, 09728", "city": "uptown"}, "products": ["pin", "needle", "cloth"]}},{"chainName": "pwiewhds", "address": {"name": "pwiewhds - oujsus", "combined_city": "over the river", "combined_address": "100 under bridge, over the river, 19920", "city": "over the river"}, "products": ["tape", "stapler"]}}],"searchID": "3abci832832o0"}}
2 | 25020202 | {"searchResult": {"searchAccounts": [{"chainName": "xyxvjw2029", "address": {"name": "xyxvjw2029 - ythji", "combined_city": "uptown", "combined_address": "1 downtown, uptown, 09728", "city": "uptown"}, "products": ["pin", "needle", "cloth"]}},{"chainName": "pwiewhds8972", "address": {"name": "pwiewhds8972 - oujsus", "combined_city": "over the river", "combined_address": "100 under bridge, over the river, 19920", "city": "over the river"}, "products": ["tape", "stapler"]}}],"searchID": "3abci832832o0"}}
There are many subsequent account details in each row in the result column.
Able to unnest the data using the below code to get column data such as chain name & address. However, when I try to call the broken down field columns, it gives me the error Cannot access field _field_1 on a value with type ARRAY-STRUCT-STRING, STRING>>
How can I separate the columns created from json data into individual columns and rows without being tied to the json row column?
CREATE TEMP FUNCTION json2array(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
if (json !== null) {
return JSON.parse(json).map(x=>JSON.stringify(x));
}
""";
SELECT * EXCEPT(chains),
ARRAY(SELECT AS STRUCT JSON_EXTRACT_SCALAR(x, '$.chainName'), JSON_EXTRACT_SCALAR(x, '$.address.combined_address') FROM UNNEST(chains) x WHERE JSON_EXTRACT_SCALAR(x, '$.chainName') IS NOT NULL) chain_names
FROM (
SELECT *,
json2array(
JSON_EXTRACT(result, '$.searchResult.searchAccounts')
) chains
FROM json_data_set
)
Just needed to write the query a different way to achieve individual columns
CREATE TEMP FUNCTION json2array(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
if (json !== null) {
return JSON.parse(json).map(x=>JSON.stringify(x));
}
""";
WITH chain_name AS (
SELECT
*,
json2array(
JSON_EXTRACT(result, '$.searchResult.searchMerchants')
) chains
FROM json_data_set
)
SELECT AS STRUCT
JSON_EXTRACT_SCALAR(x, '$.chainName') chainName,
JSON_EXTRACT_SCALAR(x, '$.address.combined_address') combined_address
FROM chain_name, UNNEST(chains) x
WHERE JSON_EXTRACT_SCALAR(x, '$.chainName') IS NOT NULL

How can I create an external table in HIVE from HDFS file that contains a JSON array?

My json looks like this:
[
{
"blocked": 1,
"object": {
"ip": "abc",
"src_ip": "abc",
"lan_initiated": true,
"detection": "abc",
"src_port": ,
"src_mac": "abc",
"dst_mac": "abc",
"dst_ip": "abc",
"dst_port": "abc"
},
"object_type": "url",
"threat": "",
"threat_type": "abc",
"device_id": "abc",
"app_id": "abc",
"user_id": "abc",
"timestamp": 1520268249657,
"date": {
"$date": "Mon Mar 05 2018 16:44:09 GMT+0000 (UTC)"
},
"expire": {
"$date": "Fri May 04 2018 16:44:09 GMT+0000 (UTC)"
},
"_id": "abc"
}
]
I have tried:
CREATE EXTERNAL TABLE `table_name`(
reports array<struct<
user_id: string ,
device_id: string ,
app_id: string ,
blocked: string ,
object: struct<ip:string,src_ip:string,lan_initiated:string,detection:string,src_port:string,src_mac:string,dst_mac:string,dstp_ip:string,dst_port:string> ,
object_type: string ,
threat: string ,
threat_type: string ,
servertime:string,
date_t: struct<dat:string>,
expire: struct<dat:string>,
id: string >>)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'ignore.malformed.json'='false','mapping.dat'='$date', 'mapping.servertime'='timestamp','mapping.date'='date_t','mapping._id'='id')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'abc'
and after that
SELECT * FROM table_name
LATERAL VIEW outer explode(reports) exploded_table as rep;
but i get: Vertex did not succeed due to OWN_TASK_FAILURE - killed/failed due to:null.
I have read that because the JSON starts with '[' it cannot be parsed. Any ideas? The structure of the json must be changed?
I believe you have made mistake in specifying location
In the code, you have mentioned
LOCATION
'abc'
LOCATION is expected to be a folder in which your JSON file should be present. You can have any name for the JSON file.
You also need to make sure that the JSON-Serde jar is there in the classpath. If not, use below command before trying to create the table.
hive> add jar /path/to/<json-serde-jar>;

lucene solr - how to know numCount of each word in query

i have a query string with 5 words. for exmple "cat dog fish bird animals".
i need to know how many matches each word has.
at this point i create 5 queries:
/q=name:cat&rows=0&facet=true
/q=name:dog&rows=0&facet=true
/q=name:fish&rows=0&facet=true
/q=name:bird&rows=0&facet=true
/q=name:animals&rows=0&facet=true
and get matches count of each word from each query.
but this method takes too many time.
so is there a way to check get numCount of each word with one query?
any help appriciated!
In this case, functionQueries are your friends. In particular:
termfreq(field,term) returns the number of times the term appears in the field for that document. Example Syntax:
termfreq(text,'memory')
totaltermfreq(field,term) returns the number of times the term appears in the field in the entire index. ttf is an alias of
totaltermfreq. Example Syntax: ttf(text,'memory')
The following query for instance:
q=*%3A*&fl=cntOnSummary%3Atermfreq(summary%2C%27hello%27)+cntOnTitle%3Atermfreq(title%2C%27entry%27)+cntOnSource%3Atermfreq(source%2C%27activities%27)&wt=json&indent=true
returns the following results:
"docs": [
{
"id": [
"id-1"
],
"source": [
"activities",
"activities"
],
"title": "Ajones3 Activity Entry 1",
"summary": "hello hello",
"cntOnSummary": 2,
"cntOnTitle": 1,
"cntOnSource": 1,
"score": 1
},
{
"id": [
"id-2"
],
"source": [
"activities",
"activities"
],
"title": "Common activity",
"cntOnSummary": 0,
"cntOnTitle": 0,
"cntOnSource": 1,
"score": 1
}
}
]
Please notice that while it's working well on single value field, it seems that for multivalued fields, the functions consider just the first entry, for instance in the example above, termfreq(source%2C%27activities%27) returns 1 instead of 2.

solr query for not equal to text value and number greater than 0

I have solr documents with two fields, one is a string and one is an integer. Both fields are allowed to be null. I am attempting to write a query that will eliminate documents with the following properties:
textField = "badValue" AND (numberField is null OR numberField = 0)
I added the following fq:
((NOT textField=badValue) OR numberField=[1 TO *])
This does not seem to have worked properly, because I am getting a document with textField = badValue and numberField = 0. What did I do wrong with my fq?
The full query response header, containing the parsed query is:
"responseHeader": {
"status": 0,
"QTime": 245,
"params": {
"q": "(numi) AND (solr_specs:[* TO ] OR full_description:[ TO ])",
"defType": "edismax",
"bf": "log(sum(popularity,1))",
"indent": "true",
"qf": "categories^3.0 manufacturer^1.0 sku^0.2 split_sku^0.2 upc^1.0 invoice_description^2.6 full_description solr_specs^0.8 solr_spec_values^1.7 legacyid legacy_altcode id",
"fl": "distributor_status,QOH_estimate,id,score",
"start": "0",
"fq": "((:* NOT distributor_status=VENDORDISC) OR QOH_estimate=[1 TO *])",
"sort": "score desc,id desc",
"rows": "20",
"wt": "json",
"_": "1441220051438"
}
}
QOH_estimate is numberField and distributor_status is textField.
Please try the following in your fq parameter: ((*:* NOT textField:badValue) OR numberField:[1 TO *]).
((*:* NOT distributor_status:VENDORDISC) OR QOH_estimate:[1 TO *])
Here you first selecting the documents which are not containing textField:badValue and ORing with documents coming from numberField:[1 TO *] condition.

Resources