Create dynamic query from the Dataframe present in Spark Scala - arrays

I have a dataframe DF as below. Based on the Issue column and Datatype column I wants to create a dynamic query.
If Issue column is YES then check for the Datatype, If its StringType add Trim(DiffColumnName) to the query or if Datatype is integer do some other operation like round(COUNT,2)
And for the column for which Issue type is NO do nothing and select the Column itself
Query should be like this
Select DEST_COUNTRY_NAME, trim(ORIGIN_COUNTRY_NAME),round(COUNT,2)
+-------------------+-----------+-----+
| DiffColumnName| Datatype|Issue|
+-------------------+-----------+-----+
| DEST_COUNTRY_NAME| StringType| NO|
|ORIGIN_COUNTRY_NAME| StringType| YES|
| COUNT|IntegerType| YES|
+-------------------+-----------+-----+
I am not sure if I should be using If else condition here or case statement or create a UDF. Also my dataframe (i.e. columns) are dynamic and will be changed every time.
Need some suggestions how to proceed here. Thanks

This can be accomplished using the following piece of code.
Derive the new column by applying the required operations
Use collect_list to aggregate the values to an array
Format the output using concat_ws and concat
val origDF=Seq(("DEST_COUNTRY_NAME","StringType","NO"),
("ORIGIN_COUNTRY_NAME","StringType","YES"),
("COUNT","IntegerType","YES"),
("TESTCOL","StringType","NO")
).toDF("DiffColumnName","Datatype","Issue")
val finalDF=origDF.withColumn("newCol",when(col("Issue")==="YES" && col("DataType")==="StringType",concat(lit("trim("),col("DiffColumnName"),lit(")")))
when(col("Issue")==="YES" && col("DataType")==="IntegerType",concat(lit("round("),col("DiffColumnName"),lit(",2)")))
when(col("Issue")==="NO",col("DiffColumnName"))
)
finalDF.agg(collect_list("newCol").alias("queryout")).select(concat(lit("select "),concat_ws(",",col("queryout")))).show(false)
I included an additional column to the data for testing and it is giving me the desired output.
+-------------------------------------------------------------------------+
|concat(select , concat_ws(,, queryout)) |
+-------------------------------------------------------------------------+
|select DEST_COUNTRY_NAME,trim(ORIGIN_COUNTRY_NAME),round(COUNT,2),TESTCOL|
+-------------------------------------------------------------------------+

Related

How to convert regexp_substr(Oracle) to SQL Server?

I have a data table which has a column as Acctno what is expected shows in separate column
|Acctno | expected_output|
|ABC:BKS:1023049101 | 1023049101 |
|ABC:UWR:19048234582 | 19048234582 |
|ABC:UEW:1039481843 | 1039481843 |
I know in Oracle SQL which I used the below
select regexp_substr(acctno,'[^:]',1,3) as expected_output
from temp_mytable
but in Microsoft SQL Server I am getting an error that regexp_substr is not a built in function
How can I resolve this issue?
We can use PATINDEX with SUBSTRING here:
SELECT SUBSTRING(acctno, PATINDEX('%:[0-9]%', acctno) + 1, LEN(acctno)) AS expected_output
FROM temp_mytable;
Demo
Note that this answer assumes that the third component would always start with a digit, and that the first two components would not have any digits. If this were not true, then we would have to do more work.
Just another option if the desired value is the last portion of the string and there are not more than 4 segments.
Select *
,NewValue = parsename(replace(Acctno,':','.'),1)
from YourTable

How to convert a CharField to a DateTimeField in peewee on the fly?

I have model I created on the fly for peewee. Something like this:
class TestTable(PeeweeBaseModel):
whencreated_dt = DateTimeField(null=True)
whenchanged = CharField(max_length=50, null=True)
I load data from a text file to a table using peewee, the column "whenchanged" contains all dates in a format of '%Y-%m-%d %H:%M:%S' as varchar column. Now I want to convert the text field "whenchanged" into a datetime format in "whencreated_dt".
I tried several things... I ended up with this:
# Initialize table to TestTable
to_execute = "table.update({table.%s : datetime.strptime(table.%s, '%%Y-%%m-%%d %%H:%%M:%%S')}).execute()" % ('whencreated_dt', 'whencreated')
which fails with a "TypeError: strptime() argument 1 must be str, not CharField": I'm trying to convert "whencreated" to datetime and then assign it to "whencreated_dt".
I tried a variation... following e.g. works without a hitch:
# Initialize table to TestTable
to_execute = "table.update({table.%s : datetime.now()}).execute()" % (self.name)
exec(to_execute)
But this is of course just the current datetime, and not another field.
Anyone knows a solution to this?
Edit... I did find a workaround eventually... but I'm still looking for a better solution... The workaround:
all_objects = table.select()
for o in all_objects:
datetime_str = getattr( o, 'whencreated' )
setattr(o, 'whencreated_dt', datetime.strptime(datetime_str, '%Y-%m-%d %H:%M:%S'))
o.save()
Loop over all rows in the table, get the "whencreated". Convert "whencreated" to a datetime, put it in "whencreated_dt", and save each row.
Regards,
Sven
Your example:
to_execute = "table.update({table.%s : datetime.strptime(table.%s, '%%Y-%%m-%%d %%H:%%M:%%S')}).execute()" % ('whencreated_dt', 'whencreated')
Will not work. Why? Because datetime.strptime is a Python function and operates in Python. An UPDATE query works in database-land. How the hell is the database going to magically pass row values into "datetime.strptime"? How would the db even know how to call such a function?
Instead you need to use a SQL function -- a function that is executed by the database. For example, Postgres:
TestTable.update(whencreated_dt=whenchanged.cast('timestamp')).execute()
This is the equivalent SQL:
UPDATE test_table SET whencreated_dt = CAST(whenchanged AS timestamp);
That should populate the column for you using the correct data type. For other databases, consult their manuals. Note that SQLite does not have a dedicated date/time data type, and the datetime functionality uses strings in the Y-m-d H:M:S format.

Array_agg in postgres selectively quotes

I have a complex database with keys and values stored in different tables. It is useful for me to aggregate them when pulling out the values for the application:
SELECT array_agg(key_name), array_agg(vals)
FROM (
SELECT
id,
key_name,
array_agg(value)::VARCHAR(255) AS vals
FROM factor_key_values
WHERE id=20
GROUP BY key_name, id
) f;
This particular query, in my case gives the following invalid JSON:
-[ RECORD 1 ]-----------------------------------------------------------------------
array_agg | {"comparison method","field score","field value"}
array_agg | {"{\"text category\"}","{100,70,50,0,30}","{A,B,C,F,\"No Experience\"}"}
Notice that the array of varchars is only quoted if the string has a space. I have narrowed this down to the behaviour of ARRAY_AGG. For completeness here is an example:
BEGIN;
CREATE TABLE test (txt VARCHAR(255));
INSERT INTO test(txt) VALUES ('one'),('two'),('three'), ('four five');
SELECT array_agg(txt) FROM test;
The result will be:
{one,two,three,"four five"}
This is why my JSON is breaking. I can handle unquoted or quoted strings in the application code, but have a mix in nuts.
Is there any solution to this?
Can't you use json_agg?
select json_agg(txt) from test;
json_agg
--------------------------------------
["one", "two", "three", "four five"]
Unfortunately, this is the inconsistent standard that PostgreSQL uses for formatting arrays. See "Array Input and Output Syntax" for more information.
Clodoaldo's answer is probably what you want, but as an alternative, you could also build your own result:
SELECT '{'||array_to_string(array_agg(txt::text), ',')||'}' FROM test;

Parse json arrays using HIVE

I have many json arrays stored in a table (jt) that looks like this:
[{"ts":1403781896,"id":14,"log":"show"},{"ts":1403781896,"id":14,"log":"start"}]
[{"ts":1403781911,"id":14,"log":"press"},{"ts":1403781911,"id":14,"log":"press"}]
Each array is a record.
I would like to parse this table in order to get a new table (logs) with 3 fields: ts, id, log.
I tried to use the get_json_object method, but it seems that method is not compatible with json arrays because I only get null values.
This is the code I have tested:
CREATE TABLE logs AS
SELECT get_json_object(jt.value, '$.ts') AS ts,
get_json_object(jt.value, '$.id') AS id,
get_json_object(jt.value, '$.log') AS log
FROM jt;
I tried to use other functions but they seem really complicated.
Thank you! :)
Update!
I solved my issue by performing a regexp:
CREATE TABLE jt_reg AS
select regexp_replace(regexp_replace(value,'\\}\\,\\{','\\}\\\n\\{'),'\\[|\\]','') as valuereg from jt;
CREATE TABLE logs AS
SELECT get_json_object(jt_reg.valuereg, '$.ts') AS ts,
get_json_object(jt_reg.valuereg, '$.id') AS id,
get_json_object(jt_reg.valuereg, '$.log') AS log
FROM ams_json_reg;
I just ran into this problem, with the JSON array stored as a string in the hive table.
The solution is a bit hacky and ugly, but it works and doesn't require serdes or external UDFs
SELECT
get_json_object(single_json_table.single_json, '$.ts') AS ts,
get_json_object(single_json_table.single_json, '$.id') AS id,
get_json_object(single_json_table.single_json, '$.log') AS log
FROM ( SELECT explode (
split(regexp_replace(substr(json_array_col, 2, length(json_array_col)-2),
'"}","', '"}",,,,"'), ',,,,')
) FROM src_table) single_json_table;
I broke the lines up so that it would be a little easier to read.
I'm using substr() to strip the first and last characters, removing [ and ] . I'm then using regex_replace to match the separator between records in the json array and adding or changing the separator to be something unique that can then be used easily with split() to turn the string into a hive array of json objects which can then be used with explode() as described in the previous solution.
Note, the separator regex used here ( "}"," ) wouldn't work with the original data set...the regex would have to be ( "},\{" ) and the replacement would then need to be "},,,,{" eg..
split(regexp_replace(substr(json_array_col, 2, length(json_array_col)-2),
'"},\\{"', '"},,,,{"'), ',,,,')
Use explode() function
hive (default)> CREATE TABLE logs AS
> SELECT get_json_object(single_json_table.single_json, '$.ts') AS ts,
> get_json_object(single_json_table.single_json, '$.id') AS id,
> get_json_object(single_json_table.single_json, '$.log') AS log
> FROM
> (SELECT explode(json_array_col) as single_json FROM jt) single_json_table ;
Automatically selecting local only mode for query
Total MapReduce jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
hive (default)> select * from logs;
OK
ts id log
1403781896 14 show
1403781896 14 start
1403781911 14 press
1403781911 14 press
Time taken: 0.118 seconds, Fetched: 4 row(s)
hive (default)>
where json_array_col is column in jt which holds your array of jsons.
hive (default)> select json_array_col from jt;
json_array_col
["{"ts":1403781896,"id":14,"log":"show"}","{"ts":1403781896,"id":14,"log":"start"}"]
["{"ts":1403781911,"id":14,"log":"press"}","{"ts":1403781911,"id":14,"log":"press"}"]
because get_json_object doesn't support json array string, so you can concat to a json object, like this:
SELECT
get_json_object(concat(concat('{"root":', jt.value), '}'), '$.root')
FROM jt;

SSIS "Enumerator failed to retrieve element at index" Error

In my SSIS package I am using a data flow task to extract data from SQL Server and put it into a dataset with the following schema:
Column1 Int32
Column2 Object
Column3 Object
Column4 String
Column5 Double
That step seems to work well. In the foreach editor I mapped the columns to variables like this:
VARIABLE | INDEX
User::Column1 | 0
User::Column2 | 1
User::Column3 | 2
User::Column4 | 3
User::Column5 | 4
When I run the package I get the following error on the foreach task:
Error: The enumerator failed to retrieve element at index "4".
Error: ForEach Variable Mapping number 5 to variable "User::Column5" cannot be applied.
There are no null values in Column5 and I can clearly see all 5 columns in the query when I run it against the database. Any assistance is greatly appreciated!
I finally found the problem. The target dataset in the data flow task was dropping the last column for some reason. Once I recreated the dataset destination everything worked.

Resources