Scala Spark - Get String row values from a WrappedArray

Scala Spark - Get String row values from a WrappedArray - arrays

I'm writing a Scala script for Spark and I have a "specialArray" as following :
specialArray = ...
specialArray.show(6)
__________________________ console __________________________________
specialArray: org.apache.spark.sql.DataFrame = [_VALUE: array<string>]
+--------------+
| _VALUE|
+--------------+
| [fullForm]|
| [fullForm]|
| [fullForm]|
| [fullForm]|
| [fullForm]|
| [fullForm]|
| [fullForm]|
+--------------+
only showing top 6 rows
But would like to see the content of those "fullForm" sub-arrays, how would you do this, please ?
Thank you very much in advance!
I have already tried to get the first value in this way :
val resultTest = specialArray.map(s => s.toString).toDF().collect()(0)
__________________________ console __________________________________
resultTest: org.apache.spark.sql.Row = [[WrappedArray(fullForm)]]
So I don't know how to deal with that and I haven't found anything "effective" in thdoc: : https://www.scala-lang.org/api/current/scala/collection/mutable/WrappedArray.html.
If you have any ideas or you have some questions to ask me, feel free to leave a message, thanks:).

Here specialArray is a dataframe, So to see the schema of dataframe you use specialArray.printSchema, Which shows the datatypes inside the dataframe.
If you just want to see the data inside the dataframe you can use
specialArray.show(6, false) the parameter false is not to truncate while displaying long values.
Next thing you can do is use select or withColumn to change the WrappedArray to the comma-separated (or any separator) String
import org.apache.spark.sql.functions._
df.select(concat_ws(",", $"_VALUE")).show(false)
df.withColumn("_VALUE", concat_ws(",", $"_VALUE")).show(false)
Hope this help!

Related

Store a list of values as a string when creating a table in snowflake

I am trying to create a table with 5 columns. COLUMN #2 (PROGRESS) is a comma seperated list (i.e 1,2,3,4 etc.) but when trying to create this table as either a string, variant or varchar, Snowflake refuses to allow this. Any advice on how I can create a column seperated list from a CSV? I tried to import the data as a TSV, XML, as well as a JSON file but no success.
create or replace TABLE AD_HOC.TEMP.NEW_DATA (
VISITOR_ID VARCHAR(16777216),
PROGRESS VARCHAR(16777216),
DATE DATETIME,
ROLE VARCHAR(16777216),
FIRST_VISIT DATETIME
)COMMENT='Interaction data'
;
Goal:
VISITOR_ID | PROGRESS | DATE | ROLE | FIRST_VISIT
111 | [1,2,3] | 1/1/2022 | OWNER | 1/1/2021
123 | [1] | 1/2/2022 | ADMIN | 2/2/2021
23321 | [1,2,3,4] | 2/22/2022 | USER | 3/12/2021

I encoded the column in python and loaded the data in Snowflake!
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = doc_data.join(pd.DataFrame(mlb.fit_transform(doc_data.pop('PROGRESS')),
columns=mlb.classes_,
index=doc_data.index))
df

Create a new column based on Condition in Spark Dataframe

How to create a new column in Dataframe DF based on give condition. I have array of String and want to compare that with existing dataframe
dataframe DF
+-------------------+-----------+
| DiffColumnName| Datatype|
+-------------------+-----------+
| DEST_COUNTRY_NAME| StringType|
|ORIGIN_COUNTRY_NAME| StringType|
| COUNT|IntegerType|
+-------------------+-----------+
and Array of String having column names( this is not constant and can be changed)
val diffcolarray = Array("ORIGIN_COUNTRY_NAME", "COUNT")
I want to create a new column in DF based on a condition that if columns present in diffcolarray is also present in Dataframe's column DiffColumnName then yes else no.
I have tried below options however getting error
val newdf = df.filter(when(col("DiffColumnName") === df.columns.filter(diffcolarray.contains(_)), "yes").otherwise("no")).as("issue")
val newdf = valdfe.filter(when(col("DiffColumnName") === df.columns.map(diffcolarray.contains(_)), "yes").otherwise("no")).as("issue")
Looks like when comparing there is datatype mismatch.Output should be something like this. Any suggestion would be helpful. Thank you
+-------------------+-----------+----------+
| DiffColumnName| Datatype| Issue |
+-------------------+-----------+----------+
| DEST_COUNTRY_NAME| StringType| NO |
|ORIGIN_COUNTRY_NAME| StringType| NO |
| COUNT|IntegerType| YES |
+-------------------+-----------+----------+

This can give you the desired output.
df.withColumn("Issue",when(col("DiffColumnName").isin(diffcolarray: _*),"YES").otherwise("NO")).show(false)

How can I write a query that returns two different columns for two different conditions?

I have a Microsoft SQL Server table that has three columns in it - one for Locations, one for the associated Images and one indicating whether a given image is meant be used as the main Image for a given location (this is done because historically multiple images were uploaded for each location and this column indiciated which image is the one that actually gets used).
Now we want to be able pick and assign a second image as a logo for our locations, resulting in a fourth column added to indiciate which image becomes that logo.
So I have a table that looks something like this:
+----------+----------+-----------------+-----------+
| filename | location | IsMainImage | IsLogo |
+----------+----------+-----------------+-----------+
| img1 | 10 | True | Null |
| img2 | 10 | Null | True |
| img3 | 10 | Null | Null |
| img4 | 20 | True | Null |
| img5 | 20 | NULL | True |
+----------+----------+-----------------+-----------+
My goal is to write a query that would return both img1 and img2 as different columns within the same row in my query, followed by img3 and img4 another row. From the table above I need the output to look like this:
+-----------+-------------+
| filename1 | filename2 |
+-----------+-------------+
| img1 | img2 |
| img4 | img5 |
+-----------+-------------+
Please note that my description is an oversimplication. I am modifying and SSIS package that is consumed by another proces that I cannot modify. This is the reason why I need the output in this format.
What became Filename1 and Filename2 used to be the same file (logo was a resized version of the main image) and now I need to differentiate between the two.
It is crucial that only the columns flagged as IsMainImage show under filename1 and only the columns flagged under IsLogo show under filename2.
I would appreciate any help with this. Thank you!

Using Case expression:
SELECT max(CASE WHEN [IsMainImage] = 'True' then filename end) as filename1,
max(CASE WHEN [IsLogo] = 'True' then filename end) as filename2
from Table_testcase group by location;

You can join the table with itself, something like this:
SELECT im1.filename as filename1,im2.filename as filename2
FROM images im1
JOIN images im2
ON im1.location = im2.location
AND im2.IsLogo = 1
WHERE im1.IsMainImage = 1

importing json array into hive

I'm trying to import the following json in hive
[{"time":1521115600,"latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":21.70905,"pm25":16.5,"pm10":14.60085,"gas1":0,"gas2":0.12,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":1521115659,"latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":24.34045,"pm25":18.5,"pm10":16.37065,"gas1":0,"gas2":0.08,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":1521115720,"latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":23.6826,"pm25":18,"pm10":15.9282,"gas1":0,"gas2":0,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":1521115779,"latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":25.65615,"pm25":19.5,"pm10":17.25555,"gas1":0,"gas2":0.04,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0}]
CREATE TABLE json_serde (
s array<struct<time: timestamp, latitude: string, longitude: string, pm1: string>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'mapping.value' = 'value'
)
STORED AS TEXTFILE
location '/user/hduser';
the import works but if i try
Select * from json_serde;
it will return from every document that is on hadoop/user/hduser only the first element per file.
there is a good documentation on working with json array??

If I may suggest you another approach to just load the whole JSON string into a column as String datatype into an external table. The only restriction is to define LINES TERMINATED BY properly. e.g. If you may have each json in one line, then you can create table as below:
e.g.
CREATE EXTERNAL TABLE json_data_table (
json_data String
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0001' LINES TERMINATED BY '\n' STORED AS TEXTFILE
LOCATION '/path/to/json';
Use Hive get_json_object to extract individual columns. this commands support basic xPath like query to json string E.g.
If json_data column has below JSON string
{"store":
{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
"bicycle":{"price":19.95,"color":"red"}
},
"email":"amy#only_for_json_udf_test.net",
"owner":"amy"
}
The below query fetches
SELECT get_json_object(json_data, '$.owner') FROM json_data_table;
returns amy
In this way you could extract each json element as column from the table.

You have an array of structs. What you pasted is only one line.
If you want to see all the elements, you need to use inline
SELECT inline(s) FROM json_table;
Alternatively, you need to rewrite your files such that each object within that array is a single JSON object on its own line of the file
Also, I don't see a value field in your data, so I'm not sure what you're mapping in the serde properties

The JSON that you provided is not correct. A JSON always starts with an opening curly brace "{" and ends with an ending curly brace "}".
So, the first thing to look out here is that your JSON is wrong.
Your JSON should have been like the one below:
{"key":[{"key1":"value1","key2":"value2"},{"key1":"value1","key2":"value2""},{"key1":"value1","key2":"value2"}]}
And, the second thing is you have declared the data-type of the "time" field as timestamp. But the data (1521115600) is in milliseconds. The timestamp data-type need data in the format YYYY-MM-DD HH:MM:SS[.fffffffff].
So, your data should ideally be in the below format:
{"myjson":[{"time":"1970-01-18
20:01:55","latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":21.70905,"pm25":16.5,"pm10":14.60085,"gas1":0,"gas2":0.12,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":"1970-01-18
20:01:55","latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":24.34045,"pm25":18.5,"pm10":16.37065,"gas1":0,"gas2":0.08,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":"1970-01-18
20:01:55","latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":23.6826,"pm25":18,"pm10":15.9282,"gas1":0,"gas2":0,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":"1970-01-18
20:01:55","latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":25.65615,"pm25":19.5,"pm10":17.25555,"gas1":0,"gas2":0.04,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0}]}
Now, you can use query to select the records from the table.
hive> select * from json_serde;
OK
[{"time":"1970-01-18 20:01:55","latitude":"44.3959","longitude":"26.1025","pm1":"21.70905"},{"time":"1970-01-18 20:01:55","latitude":"44.3959","longitude":"26.1025","pm1":"24.34045"},{"time":"1970-01-18 20:01:55","latitude":"44.3959","longitude":"26.1025","pm1":"23.6826"},{"time":"1970-01-18 20:01:55","latitude":"44.3959","longitude":"26.1025","pm1":"25.65615"}]
Time taken: 0.069 seconds, Fetched: 1 row(s)
hive>
If you want each value separately displayed in tabular format, you can use the below query.
select b.* from json_serde a lateral view outer inline (a.myjson) b;
The result of the above query would be like this:
+------------------------+-------------+--------------+-----------+--+
| b.time | b.latitude | b.longitude | b.pm1 |
+------------------------+-------------+--------------+-----------+--+
| 1970-01-18 20:01:55.0 | 44.3959 | 26.1025 | 21.70905 |
| 1970-01-18 20:01:55.0 | 44.3959 | 26.1025 | 24.34045 |
| 1970-01-18 20:01:55.0 | 44.3959 | 26.1025 | 23.6826 |
| 1970-01-18 20:01:55.0 | 44.3959 | 26.1025 | 25.65615 |
+------------------------+-------------+--------------+-----------+--+
Beautiful. Is n't it?
Happy Learning.

If you can not use update your input file format you can directly import in spark and use it, once data is finalized write back to Hive table.
scala> val myjs = spark.read.format("json").option("path","file:///root/tmp/test5").load()
myjs: org.apache.spark.sql.DataFrame = [altitude: bigint, gas1: bigint ... 13 more fields]
scala> myjs.show()
+--------+----+----+----+----+--------+--------+---------+-----+--------+--------+----+--------+-----------+----------+
|altitude|gas1|gas2|gas3|gas4|humidity|latitude|longitude|noise| pm1| pm10|pm25|pressure|temperature| time|
+--------+----+----+----+----+--------+--------+---------+-----+--------+--------+----+--------+-----------+----------+
| 53| 0|0.12| 0| 0| 0| 44.3959| 26.1025| 0|21.70905|14.60085|16.5| 0| null|1521115600|
| 53| 0|0.08| 0| 0| 0| 44.3959| 26.1025| 0|24.34045|16.37065|18.5| 0| null|1521115659|
| 53| 0| 0.0| 0| 0| 0| 44.3959| 26.1025| 0| 23.6826| 15.9282|18.0| 0| null|1521115720|
| 53| 0|0.04| 0| 0| 0| 44.3959| 26.1025| 0|25.65615|17.25555|19.5| 0| null|1521115779|
+--------+----+----+----+----+--------+--------+---------+-----+--------+--------+----+--------+-----------+----------+
scala> myjs.write.json("file:///root/tmp/test_output")
Alternatively you can directly hive table
scala> myjs.createOrReplaceTempView("myjs")
scala> spark.sql("select * from myjs").show()
scala> spark.sql("create table tax.myjs_hive as select * from myjs")

Looping on checkbox array values inside Laravel Controller is not working

I have multiple checkboxes on a form generated from a model to view which is presented this way:
{{Form::open(array('action'=>'LaboratoryController#store'))}}
#foreach (Accounts::where('accountclass',$i)->get() as $accounttypes)
{{ Form::checkbox('accounttype[]', $accounttypes->id)}}
#endforeach
{{Form::submit('Save')}}
{{Form::close()}}
When I return the Input::all() from my controller store method, it outputs like this:
{"client":"1","accounttype":["2","3","5","12","13","14","16","31","32","33"]}
Now I want to store the accounttypes array values to the accounts table by looping through the array in order to store each values on each rows using the same client id.
The same accounttype will be inserted to the second table but with different data.
So, my accounts table:
+-------------+---------------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------------------------+------+-----+---------+----------------+
| accountno | int(11) unsigned zerofill | NO | PRI | NULL | auto_increment |
| accounttype | int(11) | NO | | NULL | |
| client | int(11) | NO | | NULL | |
| created_at | datetime | NO | | NULL | |
+-------------+---------------------------+------+-----+---------+----------------+
My controller store method:
public function store()
{
$accounttypes = Input::get('accounttype');
if(is_array($accounttypes))
{
for($i=0;$i < count($accounttypes);$i++)
{
// insert data on first table (accounts table)
$accountno = DB::table('accounts')->insertGetId(array('client'=>Input::get('client'),'accounttype',$accounttypes[$i]));
// insert data on the second table (account summary table) using the account no above
// DB::table('accountsummary')...blah blah
}
}
return Redirect::to('some/path');
}
The function seems to work but only for the first array value which is "2". I don't know what's wrong with the code but it seems that the loop doesn't go through the rest of the values. I was testing other loop methods like while and foreach but still the looping variable ($i) returns zero.
I was thinking if laravel controller doesn't allow loops on POST methods.
Your inputs are greatly appreciated. Thanks..

Foreach and DB::insert() works for me.
foreach ($accounttypes as $accounttype) {
DB::insert('INSERT INTO tb_accounts (accounttype,client) VALUES (?,?)', array($accounttype,Input::get('client'));
}
I just need to create separate query to get the last insert id because DB::insertGetId doesn't work the way I want it. But that's another issue. Anyway, thanks.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Scala Spark - Get String row values from a WrappedArray - arrays

Related

Store a list of values as a string when creating a table in snowflake

Create a new column based on Condition in Spark Dataframe

How can I write a query that returns two different columns for two different conditions?

importing json array into hive

Looping on checkbox array values inside Laravel Controller is not working

Categories

Resources