Spark dataset from List - arrays

I need to create a Spark Dataset for ML. I have an array of 100 Double values and I want to add them to a dataset of 100 columns (each column for one value).
How can I do it?
Thanks
EDIT: CODE
import org.apache.spark.sql.Row
import org.apache.spark.sql.RowFactory
import sess.implicits._
val values = new ListBuffer[Double]()
//Values population proccess ....
val ds = values.toDS()
ds.show()
And de output shows as:
+--------+
| value|
+--------+
| 27242.0|
| 33883.0|
| 69727.0|
| 20851.0|
| 27740.0|
| 18747.0|

There are plenty of ways to meet your requirement. One of the ways is to form a schema and then convert the array of 100 doubles to RDD[Seq[Row[Doubles]]] and finally use createDataFrame api to form a dataframe.
// necessary imports
import scala.collection.mutable.ListBuffer
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField}
import org.apache.spark.sql.SQLContext
// forming array of 100 doubles
var values = new ListBuffer[Double]()
for(x <- 1 to 100){
values = values :+ x.toDouble
}
//creating schema for the 100 doubles
val schema = StructType(values.map(value => StructField(("col"+value).replace(".", "_"), DoubleType, true)))
// finally creating the dataframe of 100 doubles with each values in each column
val df = sqlContext.createDataFrame(sc.parallelize(Seq(Row.fromSeq((values.toSeq)))), schema)
df.show(false)
which should give you
+------+------+------+------+------+------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+
|col1_0|col2_0|col3_0|col4_0|col5_0|col6_0|col7_0|col8_0|col9_0|col10_0|col11_0|col12_0|col13_0|col14_0|col15_0|col16_0|col17_0|col18_0|col19_0|col20_0|col21_0|col22_0|col23_0|col24_0|col25_0|col26_0|col27_0|col28_0|col29_0|col30_0|col31_0|col32_0|col33_0|col34_0|col35_0|col36_0|col37_0|col38_0|col39_0|col40_0|col41_0|col42_0|col43_0|col44_0|col45_0|col46_0|col47_0|col48_0|col49_0|col50_0|col51_0|col52_0|col53_0|col54_0|col55_0|col56_0|col57_0|col58_0|col59_0|col60_0|col61_0|col62_0|col63_0|col64_0|col65_0|col66_0|col67_0|col68_0|col69_0|col70_0|col71_0|col72_0|col73_0|col74_0|col75_0|col76_0|col77_0|col78_0|col79_0|col80_0|col81_0|col82_0|col83_0|col84_0|col85_0|col86_0|col87_0|col88_0|col89_0|col90_0|col91_0|col92_0|col93_0|col94_0|col95_0|col96_0|col97_0|col98_0|col99_0|col100_0|
+------+------+------+------+------+------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+
|1.0 |2.0 |3.0 |4.0 |5.0 |6.0 |7.0 |8.0 |9.0 |10.0 |11.0 |12.0 |13.0 |14.0 |15.0 |16.0 |17.0 |18.0 |19.0 |20.0 |21.0 |22.0 |23.0 |24.0 |25.0 |26.0 |27.0 |28.0 |29.0 |30.0 |31.0 |32.0 |33.0 |34.0 |35.0 |36.0 |37.0 |38.0 |39.0 |40.0 |41.0 |42.0 |43.0 |44.0 |45.0 |46.0 |47.0 |48.0 |49.0 |50.0 |51.0 |52.0 |53.0 |54.0 |55.0 |56.0 |57.0 |58.0 |59.0 |60.0 |61.0 |62.0 |63.0 |64.0 |65.0 |66.0 |67.0 |68.0 |69.0 |70.0 |71.0 |72.0 |73.0 |74.0 |75.0 |76.0 |77.0 |78.0 |79.0 |80.0 |81.0 |82.0 |83.0 |84.0 |85.0 |86.0 |87.0 |88.0 |89.0 |90.0 |91.0 |92.0 |93.0 |94.0 |95.0 |96.0 |97.0 |98.0 |99.0 |100.0 |
+------+------+------+------+------+------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+

Related

How to create a new array of substrings from string array column in a spark dataframe

I have a spark dataframe. One of the columns is an array type consisting of an array of text strings of varying lengths. I am looking for a way to add a new column that is an array of the unique left 8 characters of those strings.
df.printSchema()
root
(...)
|-- arr_agent: array (nullable = true)
| |-- element: string (containsNull = true)
example data from column "arr_agent":
["NRCANL2AXXX", "NRCANL2A"]
["UTRONL2U", "BKRBNL2AXXX", "BKRBNL2A"]
["NRCANL2A"]
["UTRONL2U", "REUWNL2A002", "BKRBNL2A", "REUWNL2A", "REUWNL2N"]
["UTRONL2U", "UTRONL2UXXX", "BKRBNL2A"]
["MQBFDEFFYYY", "MQBFDEFFZZZ", "MQBFDEFF" ]
What I need to have in the new column:
["NRCANL2A"]
["UTRONL2U", "BKRBNL2A"]
["NRCANL2A"]
["UTRONL2U", "BKRBNL2A", "REUWNL2A", "REUWNL2N"]
["UTRONL2U", "BKRBNL2A"]
["MQBFDEFF" ]
I already tried to define a udf that does it for me.
from pyspark.sql import functions as F
from pyspark.sql import types as T
def make_list_of_unique_prefixes(text_array, prefix_length=8):
out_arr = set(t[0:prefix_length] for t in text_array)
return(out_arr)
make_list_of_unique_prefixes_udf = F.udf(lambda x,y=8: make_list_of_unique_prefixes(x,y), T.ArrayType(T.StringType()))
But then calling:
df.withColumn("arr_prefix8s", F.collect_set( make_list_of_unique_prefixes_udf(F.col("arr_agent") )))
Throws an error
AnalysisException: grouping expressions sequence is empty,
Any tips would be appreciated.
thanks
You can solve this using higher order functions available from spark 2.4+ using transform and substring and then take array distinct:
from pyspark.sql import functions as F
n = 8
out = df.withColumn("New",F.expr(f"array_distinct(transform(arr_agent,x->substring(x,0,{n})))"))
out.show(truncate=False)
+-----------------------------------------------------+----------------------------------------+
|arr_agent |New |
+-----------------------------------------------------+----------------------------------------+
|[NRCANL2AXXX, NRCANL2A] |[NRCANL2A] |
|[UTRONL2U, BKRBNL2AXXX, BKRBNL2A] |[UTRONL2U, BKRBNL2A] |
|[NRCANL2A] |[NRCANL2A] |
|[UTRONL2U, REUWNL2A002, BKRBNL2A, REUWNL2A, REUWNL2N]|[UTRONL2U, REUWNL2A, BKRBNL2A, REUWNL2N]|
|[UTRONL2U, UTRONL2UXXX, BKRBNL2A] |[UTRONL2U, BKRBNL2A] |
|[MQBFDEFFYYY, MQBFDEFFZZZ, MQBFDEFF] |[MQBFDEFF] |
+-----------------------------------------------------+----------------------------------------+

Pyspark Array Column - Replace Empty Elements with Default Value

I have a dataframe with a column which is an array of strings. Some of the elements of the array may be missing like so:
-------------|-------------------------------
ID |array_list
---------------------------------------------
38292786 |[AAA,, JLT] |
38292787 |[DFG] |
38292788 |[SHJ, QKJ, AAA, YTR, CBM] |
38292789 |[DUY, ANK, QJK, POI, CNM, ADD] |
38292790 |[] |
38292791 |[] |
38292792 |[,,, HKJ] |
I would like to replace the missing elements with a default value of "ZZZ". Is there a way to do that? I tried the following code, which is using a transform function and a regular expression:
import pyspark.sql.functions as F
from pyspark.sql.dataframe import DataFrame
def transform(self, f):
return f(self)
DataFrame.transform = transform
df = df.withColumn("array_list2", F.expr("transform(array_list, x -> regexp_replace(x, '', 'ZZZ'))"))
This doesn't give an error but it is producing nonsense. I'm thinking I just don't know the correct way to identify the missing elements of the array - can anyone help me out?
In production our data has around 10 million rows and I am trying to avoid using explode or a UDF (not sure if it's possible to avoid using both though, just need the code run as efficiently as possible). I'm using Spark 2.4.4
This is what I would like my output to look like:
-------------|-------------------------------|-------------------------------
ID |array_list | array_list2
---------------------------------------------|-------------------------------
38292786 |[AAA,, JLT] |[AAA, ZZZ, JLT]
38292787 |[DFG] |[DFG]
38292788 |[SHJ, QKJ, AAA, YTR, CBM] |[SHJ, QKJ, AAA, YTR, CBM]
38292789 |[DUY, ANK, QJK, POI, CNM, ADD] |[DUY, ANK, QJK, POI, CNM, ADD]
38292790 |[] |[ZZZ]
38292791 |[] |[ZZZ]
38292792 |[,,, HKJ] |[ZZZ, ZZZ, ZZZ, HKJ]
The regex_replace works at character level.
I could not get it to work with transform either, but with help from the first answerer I used a UDF - not that easy.
Here is my example with my data, you can tailor.
%python
from pyspark.sql.types import StringType, ArrayType
from pyspark.sql.functions import udf, col
concat_udf = udf(
lambda con_str, arr: [
x if x is not None else con_str for x in arr or [None]
],
ArrayType(StringType()),
)
arrayData = [
('James',['Java','Scala']),
('Michael',['Spark','Java',None]),
('Robert',['CSharp','']),
('Washington',None),
('Jefferson',['1','2'])]
df = spark.createDataFrame(data=arrayData, schema = ['name','knownLanguages'])
df = df.withColumn("knownLanguages", concat_udf(lit("ZZZ"), col("knownLanguages")))
df.show()
returns:
+----------+------------------+
| name| knownLanguages|
+----------+------------------+
| James| [Java, Scala]|
| Michael|[Spark, Java, ZZZ]|
| Robert| [CSharp, ]|
|Washington| [ZZZ]|
| Jefferson| [1, 2]|
+----------+------------------+
Quite tough this, had some help from the first answerer.
I'm thinking of something, but i'm not sure if it is efficient.
from pyspark.sql import functions as F
df.withColumn("array_list2", F.split(F.array_join("array_list", ",", "ZZZ"), ","))
First I concatenate the values as a string with a delimiter , (hoping you don't have it in your string but you can use something else). I use the null_replacement option to fill the null values. Then I split according to the same delimiter.
EDIT: Based on #thebluephantom comment, you can try this solution :
df.withColumn(
"array_list_2", F.expr(" transform(array_list, x -> coalesce(x, 'ZZZ'))")
).show()
SQL built-in transform is not working for me, so I couldn't try it but hopefully you'll have the result you wanted.

convert JSON text string to Pandas, but each row cell ends up as an array of values inside

I manage to extract a time-series of prices from a web-portal. The data arrives in a json format, and I convert them into a pandas dataFrame.
Unfortunately, the data for the different bands come in a text string, and I can't seem to extract them out properly.
The below is the json data I extract
I convert them into a pandas dataframe using this code
data = pd.DataFrame(r.json()['prices'])
and get them like this
I need to extract (for example) the data in the column ClosePrice out, so that I can do data analysis and cleansing on them.
I tried using
data['closePrice'].str.split(',', expand=True).rename(columns = lambda x: "string"+str(x+1))
but it doesn't really work.
Is there any way to either
a) when I convert the json to dataFrame, such that the prices within the closePrice, bidPrice etc are extracted in individual columns OR
b) if they were saved in the dataFrame, extract the text strings within them, such that I can extract the prices (e.g. the bid, ask and lastTraded) within the text string?
A relatively brute force way, using links from other stackOverflow.
# load and extract the json data
s = requests.Session()
r = s.post(url + '/session', json=data)
loc = <url>
dat1 = s.get(loc)
dat1 = pd.DataFrame(dat1.json()['prices'])
# convert the object list into individual columns
dat2 = pd.DataFrame()
dat2[['bidC','askC', 'lastP']] = pd.DataFrame(dat1.closePrice.values.tolist(), index= dat1.index)
dat2[['bidH','askH', 'lastH']] = pd.DataFrame(dat1.highPrice.values.tolist(), index= dat1.index)
dat2[['bidL','askL', 'lastL']] = pd.DataFrame(dat1.lowPrice.values.tolist(), index= dat1.index)
dat2[['bidO','askO', 'lastO']] = pd.DataFrame(dat1.openPrice.values.tolist(), index= dat1.index)
dat2['tStamp'] = pd.to_datetime(dat1.snapshotTime)
dat2['volume'] = dat1.lastTradedVolume
get the equivalent below
Use pandas.json_normalize to extract the data from the dict
import pandas as pd
data = r.json()
# print(data)
{'prices': [{'closePrice': {'ask': 1.16042, 'bid': 1.16027, 'lastTraded': None},
'highPrice': {'ask': 1.16052, 'bid': 1.16041, 'lastTraded': None},
'lastTradedVolume': 74,
'lowPrice': {'ask': 1.16038, 'bid': 1.16026, 'lastTraded': None},
'openPrice': {'ask': 1.16044, 'bid': 1.16038, 'lastTraded': None},
'snapshotTime': '2018/09/28 21:49:00',
'snapshotTimeUTC': '2018-09-28T20:49:00'}]}
df = pd.json_normalize(data['prices'])
Output:
| | lastTradedVolume | snapshotTime | snapshotTimeUTC | closePrice.ask | closePrice.bid | closePrice.lastTraded | highPrice.ask | highPrice.bid | highPrice.lastTraded | lowPrice.ask | lowPrice.bid | lowPrice.lastTraded | openPrice.ask | openPrice.bid | openPrice.lastTraded |
|---:|-------------------:|:--------------------|:--------------------|-----------------:|-----------------:|:------------------------|----------------:|----------------:|:-----------------------|---------------:|---------------:|:----------------------|----------------:|----------------:|:-----------------------|
| 0 | 74 | 2018/09/28 21:49:00 | 2018-09-28T20:49:00 | 1.16042 | 1.16027 | | 1.16052 | 1.16041 | | 1.16038 | 1.16026 | | 1.16044 | 1.16038 | |

Remove garbage(#,$) value from any string and drop records that contains only garbage(#,$) value with multiple occurances in multiple columns

I tried below code for drop records that contains garbage value with multiple occurrences and multiple columns,But I want to remove garbage value form string with multiple occurrences in multiple columns.
Sample Code :-
filter_list = ['$','#','%','#','!','^','&','*','null']
def filterfn(*x):
remove_garbage = list(chain(*[[filter not in elt for filter in
filter_list] for elt in x]))
return(reduce(lambda x,y: x and y, remove_garbage, True))
filter_udf = f.udf(filterfn, BooleanType())
original = original.filter(filter_udf(*[col for col in compulsory_fields]))
original.show()
In this example "original" is my original dataframe and "compulsory_fields" this is my array(it stores as multiple columns).
Sample Input :-
id name salary
# Yogita 1000
2 Neha ##
3 #Jay$deep## 8000
4 Priya 40$00&
5 Bhavana $$%&^
6 $% $$&&
Sample Output :-
id name salary
3 Jaydeep 8000
4 priya 4000
Your requirements are not completely clear to me, but it seems you want to output records that are valid after removing the "garbage" characters. You can achieve this by adding a clean_special_characters udf that removes the special characters before running your filter_udf:
import pyspark.sql.functions as f
from itertools import chain
from pyspark.sql.functions import regexp_replace,col
from pyspark.sql.types import BooleanType,StringType
rdd = sc.parallelize((
('#','Yogita','1000'),
('2', 'Neha', '##'),
('3', '#Jay$deep##','8000'),
('4', 'Priya', '40$00&'),
('5', 'Bhavana', '$$%&^'),
('6', '$%','$$&&'))
)
original = rdd.toDF(['id','name','salary'])
filter_list = ['$','#','%','#','!','^','&','*','null']
compulsory_fields = ['id','name','salary']
def clean_special_characters(input_string):
cleaned_input = input_string.translate({ord(c): None for c in filter_list if len(c)==1})
if cleaned_input == '':
return 'null'
return cleaned_input
clean_special_characters_udf = f.udf(clean_special_characters, StringType())
original = original.withColumn('name', clean_special_characters_udf(original.name))
original = original.withColumn('salary', clean_special_characters_udf(original.salary))
def filterfn(*x):
remove_garbage = list(chain(*[[filter not in elt for filter in
filter_list] for elt in x]))
return(reduce(lambda x,y: x and y, remove_garbage, True))
filter_udf = f.udf(filterfn, BooleanType())
original = original.filter(filter_udf(*[col for col in compulsory_fields]))
original.show()
This outputs:
+---+-------+------+
| id| name|salary|
+---+-------+------+
| 3|Jaydeep| 8000|
| 4| Priya| 4000|
+---+-------+------+

Spark 2.0.x dump a csv file from a dataframe containing one array of type string

I have a dataframe df that contains one column of type array
df.show() looks like
|ID|ArrayOfString|Age|Gender|
+--+-------------+---+------+
|1 | [A,B,D] |22 | F |
|2 | [A,Y] |42 | M |
|3 | [X] |60 | F |
+--+-------------+---+------+
I try to dump that df in a csv file as follow:
val dumpCSV = df.write.csv(path="/home/me/saveDF")
It is not working because of the column ArrayOfString. I get the error:
CSV data source does not support array string data type
The code works if I remove the column ArrayOfString. But I need to keep ArrayOfString!
What would be the best way to dump the csv dataframe including column ArrayOfString (ArrayOfString should be dumped as one column on the CSV file)
The reason why you are getting this error is that csv file format doesn't support array types, you'll need to express it as a string to be able to save.
Try the following :
import org.apache.spark.sql.functions._
val stringify = udf((vs: Seq[String]) => vs match {
case null => null
case _ => s"""[${vs.mkString(",")}]"""
})
df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)
or
import org.apache.spark.sql.Column
def stringify(c: Column) = concat(lit("["), concat_ws(",", c), lit("]"))
df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)
Pyspark implementation.
In this example, change the field column_as_array to column_as_string before saving.
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def array_to_string(my_list):
return '[' + ','.join([str(elem) for elem in my_list]) + ']'
array_to_string_udf = udf(array_to_string, StringType())
df = df.withColumn('column_as_str', array_to_string_udf(df["column_as_array"]))
Then you can drop the old column (array type) before saving.
df.drop("column_as_array").write.csv(...)
No need for a UDF if you already know which fields contain arrays. You can simply use Spark's cast function:
import org.apache.spark.sql.functions._
val dumpCSV = df.withColumn("ArrayOfString", col("ArrayOfString").cast("string"))
.write
.csv(path="/home/me/saveDF")
Hope that helps.
Here is a method for converting all ArrayType (of any underlying type) columns of a DataFrame to StringType columns:
def stringifyArrays(dataFrame: DataFrame): DataFrame = {
val colsToStringify = dataFrame.schema.filter(p => p.dataType.typeName == "array").map(p => p.name)
colsToStringify.foldLeft(dataFrame)((df, c) => {
df.withColumn(c, concat(lit("["), concat_ws(", ", col(c).cast("array<string>")), lit("]")))
})
}
Also, it doesn't use a UDF.
CSV is not the ideal export format, but if you just want to visually inspect your data, this will work [Scala]. Quick and dirty solution.
case class example ( id: String, ArrayOfString: String, Age: String, Gender: String)
df.rdd.map{line => example(line(0).toString, line(1).toString, line(2).toString , line(3).toString) }.toDF.write.csv("/tmp/example.csv")
To answer DreamerP's question (from one of the comments) :
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def array_to_string(my_list):
return '[' + ','.join([str(elem) for elem in my_list]) + ']'
array_to_string_udf = udf(array_to_string, StringType())
df = df.withColumn('Antecedent_as_str', array_to_string_udf(df["Antecedent"]))
df = df.withColumn('Consequent_as_str', array_to_string_udf(df["Consequent"]))
df = df.drop("Consequent")
df = df.drop("Antecedent")
df.write.csv("foldername")

Resources