Array_agg in postgres selectively quotes - arrays

I have a complex database with keys and values stored in different tables. It is useful for me to aggregate them when pulling out the values for the application:
SELECT array_agg(key_name), array_agg(vals)
FROM (
SELECT
id,
key_name,
array_agg(value)::VARCHAR(255) AS vals
FROM factor_key_values
WHERE id=20
GROUP BY key_name, id
) f;
This particular query, in my case gives the following invalid JSON:
-[ RECORD 1 ]-----------------------------------------------------------------------
array_agg | {"comparison method","field score","field value"}
array_agg | {"{\"text category\"}","{100,70,50,0,30}","{A,B,C,F,\"No Experience\"}"}
Notice that the array of varchars is only quoted if the string has a space. I have narrowed this down to the behaviour of ARRAY_AGG. For completeness here is an example:
BEGIN;
CREATE TABLE test (txt VARCHAR(255));
INSERT INTO test(txt) VALUES ('one'),('two'),('three'), ('four five');
SELECT array_agg(txt) FROM test;
The result will be:
{one,two,three,"four five"}
This is why my JSON is breaking. I can handle unquoted or quoted strings in the application code, but have a mix in nuts.
Is there any solution to this?

Can't you use json_agg?
select json_agg(txt) from test;
json_agg
--------------------------------------
["one", "two", "three", "four five"]

Unfortunately, this is the inconsistent standard that PostgreSQL uses for formatting arrays. See "Array Input and Output Syntax" for more information.
Clodoaldo's answer is probably what you want, but as an alternative, you could also build your own result:
SELECT '{'||array_to_string(array_agg(txt::text), ',')||'}' FROM test;

Related

Create dynamic query from the Dataframe present in Spark Scala

I have a dataframe DF as below. Based on the Issue column and Datatype column I wants to create a dynamic query.
If Issue column is YES then check for the Datatype, If its StringType add Trim(DiffColumnName) to the query or if Datatype is integer do some other operation like round(COUNT,2)
And for the column for which Issue type is NO do nothing and select the Column itself
Query should be like this
Select DEST_COUNTRY_NAME, trim(ORIGIN_COUNTRY_NAME),round(COUNT,2)
+-------------------+-----------+-----+
| DiffColumnName| Datatype|Issue|
+-------------------+-----------+-----+
| DEST_COUNTRY_NAME| StringType| NO|
|ORIGIN_COUNTRY_NAME| StringType| YES|
| COUNT|IntegerType| YES|
+-------------------+-----------+-----+
I am not sure if I should be using If else condition here or case statement or create a UDF. Also my dataframe (i.e. columns) are dynamic and will be changed every time.
Need some suggestions how to proceed here. Thanks
This can be accomplished using the following piece of code.
Derive the new column by applying the required operations
Use collect_list to aggregate the values to an array
Format the output using concat_ws and concat
val origDF=Seq(("DEST_COUNTRY_NAME","StringType","NO"),
("ORIGIN_COUNTRY_NAME","StringType","YES"),
("COUNT","IntegerType","YES"),
("TESTCOL","StringType","NO")
).toDF("DiffColumnName","Datatype","Issue")
val finalDF=origDF.withColumn("newCol",when(col("Issue")==="YES" && col("DataType")==="StringType",concat(lit("trim("),col("DiffColumnName"),lit(")")))
when(col("Issue")==="YES" && col("DataType")==="IntegerType",concat(lit("round("),col("DiffColumnName"),lit(",2)")))
when(col("Issue")==="NO",col("DiffColumnName"))
)
finalDF.agg(collect_list("newCol").alias("queryout")).select(concat(lit("select "),concat_ws(",",col("queryout")))).show(false)
I included an additional column to the data for testing and it is giving me the desired output.
+-------------------------------------------------------------------------+
|concat(select , concat_ws(,, queryout)) |
+-------------------------------------------------------------------------+
|select DEST_COUNTRY_NAME,trim(ORIGIN_COUNTRY_NAME),round(COUNT,2),TESTCOL|
+-------------------------------------------------------------------------+

XPath 'contains()' requires a singleton (or empty sequence)

Given the XML:
<Dial>
<DialID>
24521
</DialID>
<DialName>
Base Price
</DialName>
</Dial>
<Dial>
<DialID>
24528
</DialID>
<DialName>
Rush Options
</DialName>
<DialValue>
1.5
</DialValue>
</Dial>
<Dial>
<DialID>
24530
</DialID>
<DialName>
Bill Rush Charges
</DialName>
<DialValue>
School
</DialValue>
</Dial>
I can use the contains() function in my xpath:
//Dial[DialName[contains(text(), 'Bill')]]/DialValue
To retrieve the values I'm after:
School
The above XML is stored in a field in my SQL database so I'm using the .value method to select from that field.
SELECT Dials.DialDetail.value('(//Dial[DialName[contains(text(), "Bill")]]/DialValue)[1]','VARCHAR(64)') AS BillTo
FROM CampaignDials Dials
I can't seem to get the syntax right though... the xpath works as expected (tested in Oxygen and elsewhere) but when I use it in the XQuery argument of the .value() method, I get an error:
Started executing query at Line 1
Msg 2389, Level 16, State 1, Line 36
XQuery [Dials.DialDetail.value()]: 'contains()' requires a singleton (or empty sequence), found operand of type 'xdt:untypedAtomic *'
Total execution time: 00:00:00.004
I've tried different variations of single and double quotes with no effect. The error refers to an XPath data type for attributes, but I'm not retrieving an attribute; I'm getting the text value. I receive the same error if I type the response with //Dial[DialName[contains(text(), 'Bill')]]/DialValue/text() instead.
What is the correct way to use contains() in an XQuery when it's used in the XML.value() method? Or is this the wrong approach to begin with?
You nearly have it right, you just need [1] on the text() function to guarantee a single value.
You should also use text() on the actual node you are pulling out, for performance reasons.
Also, // can be inefficient, so only use it if you really need recursive descent. You can instead use /*/ to get the first node of any name.
SELECT
Dials.DialDetail.value(
'(//Dial[DialName[contains(text()[1], "Bill")]]/DialValue/text())[1]',
'VARCHAR(64)') AS BillTo
FROM CampaignDials Dials
As Yitzhak Kabinsky notes, this only gets you one value per row of the table, you need .nodes if you want to shred the XML itself into rows.
The difference between your actual database case that fails and your reduced sample case that works is likely one of different data.
The error,
contains() requires a singleton (or empty sequence)
indicates that one of your DialName elements has multiple text node children rather than a single text node child as you're expecting.
You can abstract away such variations by testing the string-value of DialName rather than its text node children:
//Dial[contains(DialName, 'Bill')]/DialValue
See also
Testing text() nodes vs string values in XPath
Here is how to do XML shredding in MS SQL Server correctly.
You need to apply filter in the XQuery .nodes() method.
The .value() method is just for the actual value retrieval.
It is possible to pass SQL Server variable as a parameter instead of the hard-coding "Bill" value.
SQL
-- DDL and sample data population, start
DECLARE #tbl TABLE (ID INT IDENTITY PRIMARY KEY, DialDetail XML);
INSERT INTO #tbl (DialDetail) VALUES
(N'<Dial>
<DialID>24521</DialID>
<DialName>Base Price</DialName>
</Dial>
<Dial>
<DialID>24528</DialID>
<DialName>Rush Options</DialName>
<DialValue>1.5</DialValue>
</Dial>
<Dial>
<DialID>24530</DialID>
<DialName>Bill Rush Charges</DialName>
<DialValue>School</DialValue>
</Dial>');
-- DDL and sample data population, end
SELECT ID
, c.value('(DialID/text())[1]', 'INT') AS DialID
, c.value('(DialName/text())[1]', 'VARCHAR(30)') AS DialName
, c.value('(DialValue/text())[1]', 'VARCHAR(30)') AS DialValue
FROM #tbl CROSS APPLY DialDetail.nodes('/Dial[contains((DialName/text())[1], "Bill")]') AS t(c);
Output
+----+--------+-------------------+-----------+
| ID | DialID | DialName | DialValue |
+----+--------+-------------------+-----------+
| 1 | 24530 | Bill Rush Charges | School |
+----+--------+-------------------+-----------+

Create a Nested/Repeating field using SQL in BigQuery which can be queried with dot notation (without UNNEST)

I am trying to build a data structure in BigQuery using SQL which exactly reflects the data structure which I obtain when uploading JSON. This will enable me to query the view using SQL with dot notation instead of having to UNNEST, which I do understand but many of my clients find extremely confusing and unintuitive.
If I build a really simple dummy dataset with a couple of rows and then nest using the ARRAY_AGG(STRUCT([field list])) pattern:
WITH
flat_table AS (
SELECT "BigQuery" AS name, 23 AS user_count, "Data Warehouse" AS data_thing, 5 AS ease_of_use, "Awesome" AS description UNION ALL
SELECT "MySQL" AS name, 12 AS user_count, "Database" AS data_thing, 3 AS ease_of_use, "Solid" AS description
)
SELECT
name, user_count,
ARRAY_AGG(STRUCT(data_thing, ease_of_use, description)) AS attributes
FROM flat_table
GROUP BY name, user_count
Then saving and viewing the schema shows that the attributes field is Type = RECORD and Mode = REPEATED. Schema field names are:
name
user_count
attributes
attributes.data_thing
attributes.ease_of_use
attributes.description
If I look at the COLUMN information in the INFORMATION_SCHEMA.COLUMNS query I can see that the attributes field is_nullable = NO and data_type = ARRAY<STRUCT<data_thing STRING, ease_of_use INT64, description STRING>>
If I want to query this structure I need to use the UNNEST pattern as below:
SELECT
name,
user_count
FROM
nested_table,
UNNEST(attributes)
WHERE
ease_of_use > 3
However when I upload the following JSON representation of the same data to BigQuery with automatic schema detection:
{"attributes":{"description":"Awesome","ease_of_use":5,"data_thing":"Data Warehouse"},"user_count":23,"name":"BigQuery"}
{"attributes":{"description":"Solid","ease_of_use":3,"data_thing":"Database"},"user_count":12,"name":"MySQL"}
The schema looks nearly identical once loaded, except for the attributes field is Mode = NULLABLE (it is still Type = RECORD). The INFORMATION_SCHEMA.COLUMNS shows me that the attributes field is now is_nullable = YES and data_type = STRUCT<data_thing STRING, ease_of_use INT64, description STRING>, i.e. now nullable and not in an array.
However the most interesting thing for me is that I can now query this table using dot notation instead of the UNNEST pattern, so the query above becomes:
SELECT
name,
user_count
FROM
nested_table_json
WHERE
attributes.ease_of_use > 3
Which is arguably easier to read, even in this trivial case. However once we get to more complex data structures with multiple nested fields and multi-level nesting, the UNNEST pattern becomes extremely difficult to write, QA and debug. The dot notation pattern appears to be much more intuitive and scalable.
So my question is: is it possible to build a data structure equivalent to the loaded JSON by writing queries in SQL, enabling us to build Standard SQL queries using dot notation and not requiring complex UNNEST patterns?
If you know that your array_agg will produce one row, you can drop the ARRAY notation like this:
SELECT
name, user_count,
ARRAY_AGG(STRUCT(data_thing, ease_of_use, description))[offset(0)] AS attributes
notice the use of OFFSET(0) this way the returned output will be:
[
{
"name": "BigQuery",
"user_count": "23",
"attributes": {
"data_thing": "Data Warehouse",
"ease_of_use": "5",
"description": "Awesome"
}
}
]
which can be queried using dot notation.
In case you want just to group result in STRUCT, you don't need array_agg.
WITH
flat_table AS (
SELECT "BigQuery" AS name, 23 AS user_count, struct("Data Warehouse" AS data_thing, 5 AS ease_of_use, "Awesome" AS description) as attributes UNION ALL
SELECT "MySQL" AS name, 12 AS user_count, struct("Database" AS data_thing, 3 AS ease_of_use, "Solid" AS description)
)
SELECT
*
FROM flat_table

Unnest multiple arrays in parallel

My last question Passing an array to stored to postgres was a bit unclear. Now, to clarify my objective:
I want to create an Postgres stored procedure which will accept two input parameters. One will be a list of some amounts like for instance (100, 40.5, 76) and the other one will be list of some invoices ('01-2222-05','01-3333-04','01-4444-08'). After that I want to use these two lists of numbers and characters and do something with them. For example I want to take each amount from this array of numbers and assign it to corresponding invoice.
Something like that in Oracle would look like this:
SOME_PACKAGE.SOME_PROCEDURE (
789,
SYSDATE,
SIMPLEARRAYTYPE ('01-2222-05','01-3333-04','01-4444-08'),
NUMBER_TABLE (100,40.5,76),
'EUR',
1,
P_CODE,
P_MESSAGE);
Of course, the two types SIMPLEARRAYTYPE and NUMBER_TABLE are defined earlier in DB.
You will love this new feature of Postgres 9.4:
unnest(anyarray, anyarray [, ...])
unnest() with the much anticipated (at least by me) capability to unnest multiple arrays in parallel cleanly. The manual:
expand multiple arrays (possibly of different types) to a set of rows. This is only allowed in the FROM clause;
It's a special implementation of the new ROWS FROM feature.
Your function can now just be:
CREATE OR REPLACE FUNCTION multi_unnest(_some_id int
, _amounts numeric[]
, _invoices text[])
RETURNS TABLE (some_id int, amount numeric, invoice text) AS
$func$
SELECT _some_id, u.* FROM unnest(_amounts, _invoices) u;
$func$ LANGUAGE sql;
Call:
SELECT * FROM multi_unnest(123, '{100, 40.5, 76}'::numeric[]
, '{01-2222-05,01-3333-04,01-4444-08}'::text[]);
Of course, the simple form can be replaced with plain SQL (no additional function):
SELECT 123 AS some_id, *
FROM unnest('{100, 40.5, 76}'::numeric[]
, '{01-2222-05,01-3333-04,01-4444-08}'::text[]) AS u(amount, invoice);
In earlier versions (Postgres 9.3-), you can use the less elegant and less safe form:
SELECT 123 AS some_id
, unnest('{100, 40.5, 76}'::numeric[]) AS amount
, unnest('{01-2222-05,01-3333-04,01-4444-08}'::text[]) AS invoice;
Caveats of the old shorthand form: besides being non-standard to have set-returning function in the SELECT list, the number of rows returned would be the lowest common multiple of each arrays number of elements (with surprising results for unequal numbers). Details in these related answers:
Parallel unnest() and sort order in PostgreSQL
Is there something like a zip() function in PostgreSQL that combines two arrays?
This behavior has finally been sanitized with Postgres 10. Multiple set-returning functions in the SELECT list produce rows in "lock-step" now. See:
What is the expected behaviour for multiple set-returning functions in SELECT clause?
Arrays are declared by adding [] to the base datatype. You declare them as a parameter the same way you declare regular parameters:
The following function accepts an array of integers and and array of strings and will return some dummy text:
create function array_demo(p_data integer[], p_invoices text[])
returns text
as
$$
select p_data[1] || ' => ' || p_invoices[1];
$$
language sql;
select array_demo(array[1,2,3], array['one', 'two', 'three']);
SQLFiddle demo: http://sqlfiddle.com/#!15/fdb8d/1

Parse json arrays using HIVE

I have many json arrays stored in a table (jt) that looks like this:
[{"ts":1403781896,"id":14,"log":"show"},{"ts":1403781896,"id":14,"log":"start"}]
[{"ts":1403781911,"id":14,"log":"press"},{"ts":1403781911,"id":14,"log":"press"}]
Each array is a record.
I would like to parse this table in order to get a new table (logs) with 3 fields: ts, id, log.
I tried to use the get_json_object method, but it seems that method is not compatible with json arrays because I only get null values.
This is the code I have tested:
CREATE TABLE logs AS
SELECT get_json_object(jt.value, '$.ts') AS ts,
get_json_object(jt.value, '$.id') AS id,
get_json_object(jt.value, '$.log') AS log
FROM jt;
I tried to use other functions but they seem really complicated.
Thank you! :)
Update!
I solved my issue by performing a regexp:
CREATE TABLE jt_reg AS
select regexp_replace(regexp_replace(value,'\\}\\,\\{','\\}\\\n\\{'),'\\[|\\]','') as valuereg from jt;
CREATE TABLE logs AS
SELECT get_json_object(jt_reg.valuereg, '$.ts') AS ts,
get_json_object(jt_reg.valuereg, '$.id') AS id,
get_json_object(jt_reg.valuereg, '$.log') AS log
FROM ams_json_reg;
I just ran into this problem, with the JSON array stored as a string in the hive table.
The solution is a bit hacky and ugly, but it works and doesn't require serdes or external UDFs
SELECT
get_json_object(single_json_table.single_json, '$.ts') AS ts,
get_json_object(single_json_table.single_json, '$.id') AS id,
get_json_object(single_json_table.single_json, '$.log') AS log
FROM ( SELECT explode (
split(regexp_replace(substr(json_array_col, 2, length(json_array_col)-2),
'"}","', '"}",,,,"'), ',,,,')
) FROM src_table) single_json_table;
I broke the lines up so that it would be a little easier to read.
I'm using substr() to strip the first and last characters, removing [ and ] . I'm then using regex_replace to match the separator between records in the json array and adding or changing the separator to be something unique that can then be used easily with split() to turn the string into a hive array of json objects which can then be used with explode() as described in the previous solution.
Note, the separator regex used here ( "}"," ) wouldn't work with the original data set...the regex would have to be ( "},\{" ) and the replacement would then need to be "},,,,{" eg..
split(regexp_replace(substr(json_array_col, 2, length(json_array_col)-2),
'"},\\{"', '"},,,,{"'), ',,,,')
Use explode() function
hive (default)> CREATE TABLE logs AS
> SELECT get_json_object(single_json_table.single_json, '$.ts') AS ts,
> get_json_object(single_json_table.single_json, '$.id') AS id,
> get_json_object(single_json_table.single_json, '$.log') AS log
> FROM
> (SELECT explode(json_array_col) as single_json FROM jt) single_json_table ;
Automatically selecting local only mode for query
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
hive (default)> select * from logs;
OK
ts id log
1403781896 14 show
1403781896 14 start
1403781911 14 press
1403781911 14 press
Time taken: 0.118 seconds, Fetched: 4 row(s)
hive (default)>
where json_array_col is column in jt which holds your array of jsons.
hive (default)> select json_array_col from jt;
json_array_col
["{"ts":1403781896,"id":14,"log":"show"}","{"ts":1403781896,"id":14,"log":"start"}"]
["{"ts":1403781911,"id":14,"log":"press"}","{"ts":1403781911,"id":14,"log":"press"}"]
because get_json_object doesn't support json array string, so you can concat to a json object, like this:
SELECT
get_json_object(concat(concat('{"root":', jt.value), '}'), '$.root')
FROM jt;

Resources