Trying to convert string values to array in pandas dataframe - arrays

I have a dataframe that looks like this
Company CompanyDetails
A [{"companyId": 1482, "companyAddress": 'sampleaddress1', "numOfEmployees": 500}]
B [{"companyId": 1437, "companyAddress": 'sampleaddress2', "numOfEmployees": 50}]
C [{"companyId": 1452, "companyAddress": 'sampleaddress3', "numOfEmployees": 10000}]
When I execute df.dtypes I find that both the Company and CompanyDetails columns are objects.
df[['CompanyDetails']].iloc[0, :] would return '{["companyId": 1482, "companyAddress": 'sampleaddress1', numOfEmployees: 500]}' (there will be quotes ' ' around my array").
I am trying to extract the details within the dictionary in the CompanyDetails column so that I can add new columns to my dataframe to look like this:
Company CompanyId CompanyAddress numOfEmployees
A 1482 'sampleaddress1' 500
B 1437 'sampleaddress2' 50
C 1452 'sampleaddress3' 10000
I tried something like this as I was trying to convert the CompanyDetails column to contain arrays for all my values so I can easily extract each property in the object.
import ast
df['CompanyDetails'] = df['CompanyDetails'].apply(ast.literal_eval)
However, the above code caused this error
ValueError: malformed node or string: <ast.Name object at 0x000002D73D0C13A0>
Would appreciate any help on this, thanks!

You're getting the error because it's actually not valid JSON. numOfEmployees is not quoted, and JSON required ALL keys to be double-quoted.
The easiest, safest (in terms of likelyhood to break) way I can think of to fix this would be to repair the JSON using a regular expression replace:
df['CompanyDetails'] = df['CompanyDetails'].str.replace(r',\s*(\w+)\s*:', r', "\1":', regex=True)
Then do your other stuff:
import ast
df['CompanyDetails'] = df['CompanyDetails'].apply(ast.literal_eval)
df = pd.concat([df.drop('CompanyDetails', axis=1), pd.json_normalize(df['CompanyDetails'].explode())], axis=1)
...or whatever you have in mind.

You can use
import pandas as pd
# Test dataframe
df = pd.DataFrame({'Company':['A'], 'CompanyDetails':[[{"companyId": 1482, "companyAddress": 'sampleaddress1', "numOfEmployees": 500}]]})
df['CompanyDetails'] = df['CompanyDetails'].str[0]
df = pd.concat([df.drop(['CompanyDetails'], axis=1), df['CompanyDetails'].apply(pd.Series)], axis=1)
# => >>> df
# Company companyId companyAddress numOfEmployees
# 0 A 1482 sampleaddress1 500
Note:
df['CompanyDetails'] = df['CompanyDetails'].str[0] gets the first item from each list since each of them only contains one item
pd.concat([df.drop(['CompanyDetails'], axis=1), df['CompanyDetails'].apply(pd.Series)], axis=1) does the actual expansion and merging with the current dataframe.

Related

Pyspark Array Column - Replace Empty Elements with Default Value

I have a dataframe with a column which is an array of strings. Some of the elements of the array may be missing like so:
-------------|-------------------------------
ID |array_list
---------------------------------------------
38292786 |[AAA,, JLT] |
38292787 |[DFG] |
38292788 |[SHJ, QKJ, AAA, YTR, CBM] |
38292789 |[DUY, ANK, QJK, POI, CNM, ADD] |
38292790 |[] |
38292791 |[] |
38292792 |[,,, HKJ] |
I would like to replace the missing elements with a default value of "ZZZ". Is there a way to do that? I tried the following code, which is using a transform function and a regular expression:
import pyspark.sql.functions as F
from pyspark.sql.dataframe import DataFrame
def transform(self, f):
return f(self)
DataFrame.transform = transform
df = df.withColumn("array_list2", F.expr("transform(array_list, x -> regexp_replace(x, '', 'ZZZ'))"))
This doesn't give an error but it is producing nonsense. I'm thinking I just don't know the correct way to identify the missing elements of the array - can anyone help me out?
In production our data has around 10 million rows and I am trying to avoid using explode or a UDF (not sure if it's possible to avoid using both though, just need the code run as efficiently as possible). I'm using Spark 2.4.4
This is what I would like my output to look like:
-------------|-------------------------------|-------------------------------
ID |array_list | array_list2
---------------------------------------------|-------------------------------
38292786 |[AAA,, JLT] |[AAA, ZZZ, JLT]
38292787 |[DFG] |[DFG]
38292788 |[SHJ, QKJ, AAA, YTR, CBM] |[SHJ, QKJ, AAA, YTR, CBM]
38292789 |[DUY, ANK, QJK, POI, CNM, ADD] |[DUY, ANK, QJK, POI, CNM, ADD]
38292790 |[] |[ZZZ]
38292791 |[] |[ZZZ]
38292792 |[,,, HKJ] |[ZZZ, ZZZ, ZZZ, HKJ]
The regex_replace works at character level.
I could not get it to work with transform either, but with help from the first answerer I used a UDF - not that easy.
Here is my example with my data, you can tailor.
%python
from pyspark.sql.types import StringType, ArrayType
from pyspark.sql.functions import udf, col
concat_udf = udf(
lambda con_str, arr: [
x if x is not None else con_str for x in arr or [None]
],
ArrayType(StringType()),
)
arrayData = [
('James',['Java','Scala']),
('Michael',['Spark','Java',None]),
('Robert',['CSharp','']),
('Washington',None),
('Jefferson',['1','2'])]
df = spark.createDataFrame(data=arrayData, schema = ['name','knownLanguages'])
df = df.withColumn("knownLanguages", concat_udf(lit("ZZZ"), col("knownLanguages")))
df.show()
returns:
+----------+------------------+
| name| knownLanguages|
+----------+------------------+
| James| [Java, Scala]|
| Michael|[Spark, Java, ZZZ]|
| Robert| [CSharp, ]|
|Washington| [ZZZ]|
| Jefferson| [1, 2]|
+----------+------------------+
Quite tough this, had some help from the first answerer.
I'm thinking of something, but i'm not sure if it is efficient.
from pyspark.sql import functions as F
df.withColumn("array_list2", F.split(F.array_join("array_list", ",", "ZZZ"), ","))
First I concatenate the values as a string with a delimiter , (hoping you don't have it in your string but you can use something else). I use the null_replacement option to fill the null values. Then I split according to the same delimiter.
EDIT: Based on #thebluephantom comment, you can try this solution :
df.withColumn(
"array_list_2", F.expr(" transform(array_list, x -> coalesce(x, 'ZZZ'))")
).show()
SQL built-in transform is not working for me, so I couldn't try it but hopefully you'll have the result you wanted.

parsing a JSON string Pyspark dataframe column that has string of array in one of the columns

I am trying to read a JSON file and parse 'jsonString' and the underlying fields which includes array into a pyspark dataframe.
Here is the contents of json file.
[{"jsonString": "{\"uid\":\"value1\",\"adUsername\":\"value3\",\"courseCertifications\":[{\"uid\":\"value2\",\"courseType\":\"TRAINING\"},{\"uid\":\"TEST\",\"courseType\":\"TRAINING\"}],\"modifiedBy\":\"value4\"}","transactionId": "value5", "tableName": "X"},
{"jsonString": "{\"uid\":\"value11\",\"adUsername\":\"value13\",\"modifiedBy\":\"value14\"}","transactionId": "value15", "tableName": "X1"},
{"jsonString": "{\"uid\":\"value21\",\"adUsername\":\"value23\",\"modifiedBy\":\"value24\"}","transactionId": "value25", "tableName": "X2"}]
I am able to parse contents of string 'jsonString' and select required columns using the below logic
df = spark.read.json('path.json',multiLine=True)
df = df.withColumn('courseCertifications', explode(array(get_json_object(df['jsonString'],'$.courseCertifications'))))
Now my end goal is to parse field "courseType" from "courseCertifications" and create one row per instance.
I am using below logic to get "courseType"
df = df.withColumn('new',get_json_object(df.courseCertifications, '$[*].courseType'))
I am able to get the contents of "courseType" but as a string as shown below
[Row(new=u'["TRAINING","TRAINING"]')]
My end goal is to create a dataframe with columns transactionId, jsonString.uid, jsonString.adUsername, jsonString.courseCertifications.uid, jsonString.courseCertifications.courseType
I need to retain all the rows and create multiple rows one per array instances of courseCertifications.uid/courseCertifications.courseType.
An elegant manner to resolve your question is creating the schema of the json string and then parse it using from_json function
import pyspark.sql.functions as f
from pyspark.shell import spark
from pyspark.sql.types import ArrayType, StringType, StructType, StructField
df = spark.read.json('your_path', multiLine=True)
schema = StructType([
StructField('uid', StringType()),
StructField('adUsername', StringType()),
StructField('modifiedBy', StringType()),
StructField('courseCertifications', ArrayType(
StructType([
StructField('uid', StringType()),
StructField('courseType', StringType())
])
))
])
df = df \
.withColumn('tmp', f.from_json(df.jsonString, schema)) \
.withColumn('adUsername', f.col('tmp').adUsername) \
.withColumn('uid', f.col('tmp').uid) \
.withColumn('modifiedBy', f.col('tmp').modifiedBy) \
.withColumn('tmp', f.explode(f.col('tmp').courseCertifications)) \
.withColumn('course_uid', f.col('tmp').uid) \
.withColumn('course_type', f.col('tmp').courseType) \
.drop('jsonString', 'tmp')
df.show()
Output:
+-------------+------+----------+----------+----------+-----------+
|transactionId|uid |adUsername|modifiedBy|course_uid|course_type|
+-------------+------+----------+----------+----------+-----------+
|value5 |value1|value3 |value4 |value2 |TRAINING |
|value5 |value1|value3 |value4 |TEST |TRAINING |
+-------------+------+----------+----------+----------+-----------+

using lookup tables to plot a ggplot and table

I'm creating a shiny app and i'm letting the user choose what data that should be displayed in a plot and a table. This choice is done through 3 different input variables that contain 14, 4 and two choices respectivly.
ui <- dashboardPage(
dashboardHeader(),
dashboardSidebar(
selectInput(inputId = "DataSource", label = "Data source", choices =
c("Restoration plots", "all semi natural grasslands")),
selectInput(inputId = "Variabel", label = "Variable", choices =
choicesVariables)),
#choicesVariables definition is omitted here, because it's very long but it
#contains 14 string values
selectInput(inputId = "Factor", label = "Factor", choices = c("Company
type", "Region and type of application", "Approved or not approved
applications", "Age group" ))
),
dashboardBody(
plotOutput("thePlot"),
tableOutput("theTable")
))
This adds up to 73 choices (yes, i know the math doesn't add up there, but some choices are invalid). I would like to do this using a lookup table so a created one with every valid combination of choices like this:
rad1<-c(rep("Company type",20), rep("Region and type of application",20),
rep("Approved or not approved applications", 13), rep("Age group", 20))
rad2<-choicesVariable[c(1:14,1,4,5,9,10,11, 1:14,1,4,5,9,10,11, 1:7,9:14,
1:14,1,4,5,9,10,11)]
rad3<-c(rep("Restoration plots",14),rep("all semi natural grasslands",6),
rep("Restoration plots",14), rep("all semi natural grasslands",6),
rep("Restoration plots",27), rep("all semi natural grasslands",6))
rad4<-1:73
letaLista<-data.frame(rad1,rad2,rad3, rad4)
colnames(letaLista) <- c("Factor", "Variabel", "rest_alla", "id")
Now its easy to use subset to only get the choice that the user made. But how do i use this information to plot the plot and table without using a 73 line long ifelse statment?
I tried to create some sort of multidimensional array that could hold all the tables (and one for the plots) but i couldn't make it work. My experience with these kind of arrays is limited and this might be a simple issue, but any hints would be helpful!
My dataset that is the foundation for the plots and table consists of dataframe with 23 variables, factors and numerical. The plots and tabels are then created using the following code for all 73 combinations
s_A1 <- summarySE(Samlad_info, measurevar="Dist_brukcentrum",
groupvars="Companytype")
s_A1 <- s_A1[2:6,]
p_A1=ggplot(s_A1, aes(x=Companytype,
y=Dist_brukcentrum))+geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=Dist_brukcentrum-se,
ymax=Dist_brukcentrum+se),width=.2,position=position_dodge(.9))+
scale_y_continuous(name = "") + scale_x_discrete(name = "")
where summarySE is the following function, burrowed from cookbook for R
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=TRUE,
conf.interval=.95, .drop=TRUE) {
# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
# This does the summary. For each group's data frame, return a vector with
# N, mean, and sd
datac <- ddply(data, groupvars, .drop=.drop,
.fun = function(xx, col) {
c(N = length2(xx[[col]], na.rm=na.rm),
mean = mean (xx[[col]], na.rm=na.rm),
sd = sd (xx[[col]], na.rm=na.rm)
)
},
measurevar
)
# Rename the "mean" column
datac <- rename(datac, c("mean" = measurevar))
datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult <- qt(conf.interval/2 + .5, datac$N-1)
datac$ci <- datac$se * ciMult
return(datac)
}
The code in it's entirety is a bit to large but i hope this may clarify what i'm trying to do.
Well, thanks to florian's comment i think i might have found a solution my self. I'll present it here but leave the question open as there is probably far neater ways of doing it.
I rigged up the plots (that was created as lists by ggplot) into a list
plotList <- list(p_A1, p_A2, p_A3...)
tableList <- list(s_A1, s_A2, s_A3...)
I then used subset on my lookup table to get the matching id of the list to select the right plot and table.
output$thePlot <-renderPlot({
plotValue<-subset(letaLista, letaLista$Factor==input$Factor &
letaLista$Variabel== input$Variabel & letaLista$rest_alla==input$DataSource)
plotList[as.integer(plotValue[1,4])]
})
output$theTable <-renderTable({
plotValue<-subset(letaLista, letaLista$Factor==input$Factor &
letaLista$Variabel== input$Variabel & letaLista$rest_alla==input$DataSource)
skriva <- tableList[as.integer(plotValue[4])]
print(skriva)
})

Pyspark: filter contents of array inside row

In Pyspark, one can filter an array using the following code:
lines.filter(lambda line: "some" in line)
But I have read data from a json file and tokenized it. Now it has the following form:
df=[Row(text=u"i have some text", words=[u'I', u'have', u"some'", u'text'])]
How can I filter out "some" from words array ?
You can use array_contains, it's available since 1.4 :
from pyspark.sql import Row
from pyspark.sql import functions as F
df = sqlContext.createDataFrame([Row(text=u"i have some text", words=[u'I', u'have', u'some', u'text'])])
df.withColumn("keep", F.array_contains(df.words,"some")) \
.filter(F.col("keep")==True).show()
# +----------------+--------------------+----+
# | text| words|keep|
# +----------------+--------------------+----+
# |i have some text|[I, have, some, t...|true|
# +----------------+--------------------+----+
If you want to filter out 'some', like I said in the comment, you can use the StopWordsRemover API
from pyspark.ml.feature import StopWordsRemover
StopWordsRemover(inputCol="words", stopWords=["some"]).transform(df)

Spark 2.0.x dump a csv file from a dataframe containing one array of type string

I have a dataframe df that contains one column of type array
df.show() looks like
|ID|ArrayOfString|Age|Gender|
+--+-------------+---+------+
|1 | [A,B,D] |22 | F |
|2 | [A,Y] |42 | M |
|3 | [X] |60 | F |
+--+-------------+---+------+
I try to dump that df in a csv file as follow:
val dumpCSV = df.write.csv(path="/home/me/saveDF")
It is not working because of the column ArrayOfString. I get the error:
CSV data source does not support array string data type
The code works if I remove the column ArrayOfString. But I need to keep ArrayOfString!
What would be the best way to dump the csv dataframe including column ArrayOfString (ArrayOfString should be dumped as one column on the CSV file)
The reason why you are getting this error is that csv file format doesn't support array types, you'll need to express it as a string to be able to save.
Try the following :
import org.apache.spark.sql.functions._
val stringify = udf((vs: Seq[String]) => vs match {
case null => null
case _ => s"""[${vs.mkString(",")}]"""
})
df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)
or
import org.apache.spark.sql.Column
def stringify(c: Column) = concat(lit("["), concat_ws(",", c), lit("]"))
df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)
Pyspark implementation.
In this example, change the field column_as_array to column_as_string before saving.
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def array_to_string(my_list):
return '[' + ','.join([str(elem) for elem in my_list]) + ']'
array_to_string_udf = udf(array_to_string, StringType())
df = df.withColumn('column_as_str', array_to_string_udf(df["column_as_array"]))
Then you can drop the old column (array type) before saving.
df.drop("column_as_array").write.csv(...)
No need for a UDF if you already know which fields contain arrays. You can simply use Spark's cast function:
import org.apache.spark.sql.functions._
val dumpCSV = df.withColumn("ArrayOfString", col("ArrayOfString").cast("string"))
.write
.csv(path="/home/me/saveDF")
Hope that helps.
Here is a method for converting all ArrayType (of any underlying type) columns of a DataFrame to StringType columns:
def stringifyArrays(dataFrame: DataFrame): DataFrame = {
val colsToStringify = dataFrame.schema.filter(p => p.dataType.typeName == "array").map(p => p.name)
colsToStringify.foldLeft(dataFrame)((df, c) => {
df.withColumn(c, concat(lit("["), concat_ws(", ", col(c).cast("array<string>")), lit("]")))
})
}
Also, it doesn't use a UDF.
CSV is not the ideal export format, but if you just want to visually inspect your data, this will work [Scala]. Quick and dirty solution.
case class example ( id: String, ArrayOfString: String, Age: String, Gender: String)
df.rdd.map{line => example(line(0).toString, line(1).toString, line(2).toString , line(3).toString) }.toDF.write.csv("/tmp/example.csv")
To answer DreamerP's question (from one of the comments) :
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def array_to_string(my_list):
return '[' + ','.join([str(elem) for elem in my_list]) + ']'
array_to_string_udf = udf(array_to_string, StringType())
df = df.withColumn('Antecedent_as_str', array_to_string_udf(df["Antecedent"]))
df = df.withColumn('Consequent_as_str', array_to_string_udf(df["Consequent"]))
df = df.drop("Consequent")
df = df.drop("Antecedent")
df.write.csv("foldername")

Resources