Export array<string> type to csv using PySpark without specifying them one by one? - arrays

I've a DataFrame with a lot of columns. Some of these columns are of the type array<string>.
I need to export a sample to csv and csv doesn't support array.
Now I'm doing this for every array column (sometimes is miss one or more)
df_write = df\
.withColumn('col_a', F.concat_ws(',', 'col_a'))\
.withColumn('col_g', F.concat_ws(',', 'col_g'))\
....
Is there a way to use a loop and do this for every array column without specifying them one by one?

You can check the type of each column and do a list comprehension:
import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType
arr_col = [
i.name
for i in df.schema
if isinstance(i.dataType, ArrayType)
]
df_write = df.select([
F.concat_ws(',', c)
if c in arr_col
else F.col(c)
for c in df.columns
])
Actually, you don't need to use concat_ws. You can just cast all columns to string type before writing to CSV, e.g.
df_write = df.select([F.col(c).cast('string') for c in df.columns])

You can also check the types using df.dtypes:
from pyspark.sql import functions as F
array_cols = [c for c, t in df.dtypes if t == "array<string>"]
df.select(*[
F.array_join(c, ",").alias(c) if c in array_cols else F.col(c)
for c in df.columns
])

Related

Pandas DataFrame from Numpy Array - column order

I'm trying to read data from a .csv file using Pandas, smoothing it with Savitsky-Golay filter, filtering it and then using Pandas again to write an output csv file. Data must be converted from DataFrame to an array to perform smoothing and then again to DataFrame to create the output file.
I found a topic on creation of dataframe from numpy arrays (Creating a Pandas DataFrame from a Numpy array: How do I specify the index column and column headers?) and i used the dataset = pd.DataFrame({'Column1': data[:, 0], 'Column2': data[:, 1]}) line to create mine.
The problem is that when I rename the column names to 'time' for first column and 'angle' for the second one, the order in the final dataframe changes. It seems as if the alphabetical order is important, which seems weird.
Can someone help me with an explanation?
My complete code:
import scipy as sp
from scipy import signal
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Specify the input file
in_file = '0_chunk0_test.csv'
# Define min and max angle values
alpha_min = 35
alpha_max = 45
# Define Savitsky-Golay filter parameters
window_length = 15
polyorder = 1
# Read input .csv file, but only time and pitch values using usecols argument
data = pd.read_csv(in_file,usecols=[0,2])
# Replace ":" with "" in time values
data['time'] = data['time'].str.replace(':','')
# Convert pandas dataframe to a numpy array, use .astype to convert
# string to float
data_arr = data.to_numpy(dtype=np.dtype,copy=True)
data_arr = data_arr.astype(np.float)
# Perform a Savitsky-Golay filtering with signal.savgol_filter
data_arr_smooth = signal.savgol_filter(data_arr[:,1],window_length,polyorder)
# Convert smoothed data array to dataframe and rename Pitch: to angle
data_fr = pd.DataFrame({'time': data_arr[:,0],'angle': data_arr_smooth})
print data_fr
Your question is essentially: why does this code result in a column order that is alphabetical, rather than the order that I provided?
data_fr = pd.DataFrame({'time': data_arr[:,0],'angle': data_arr_smooth})
Recent versions of pandas (0.23+ or 1.0+) actually do what you want, with columns ['time', 'angle'] rather than ['angle', 'time'].
Up to Python 3.5, dictionaries did not preserve the order of keys; by sorting alphabetically, pandas would at least give a reproducible column order. This was changed in Pandas 0.23 (released May 2018).
If your data is already in a dataframe, it's much easier to just pass the values of the Pitch column to savgol_filter:
data_arr_smooth = signal.savgol_filter(data.Pitch.values, window_length, polyorder)
data_fr = pd.DataFrame({'time': data.time.values,'angle': data_arr_smooth})
There's no need to explicitly convert your data to float as long as they are numeric, savgol_filter will do this for you:
If x is not a single or double precision floating point array, it
will be converted to type numpy.float64 before filtering.
If you want both original and smoothed data in you original dataframe then just assign a new column to it:
data['angle'] = signal.savgol_filter(data.Pitch.values, window_length, polyorder)

Apache-spark: Equal column data structure, different outcome in UDF function

I have two array of columns
arrayColumns1: org.apache.spark.sql.Column = array("col1","col2")
arrayColumns2: org.apache.spark.sql.Column = array("col1","col2")
Both seems equals, but they came from different sources.
arrayColumns1 is from a conversion of Array("col1","col2") to an array, using this function:
def asLitArray[T](xs: Seq[T]) = array(xs map lit: _*)
arrayColumns2 is from writing the textual Array.
Now when I try to use arrayColumns1 as input to an UDF function:
.withColumn("udfFunction",udfFunction(arrayColumns))
where
val udfFunction= udf(
{ xs : Seq[Double] =>
DO_SOMETHING
(output)
}
)
It thorows me this error:
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(array(col1,col2))' due to data type mismatch: argument 1 requires array<double> type, however, 'array('col1','col2')' is of array<string> type.;;
But when I use arrayColumns2 it works fine. What did I do wrong?
I'm using Spark 2.1 over scala 2.11
It does not make much sense to pass an array of literals to an UDF, because what you want to pass is the name of the columns, not literal values. Your second case fails because you are creating columns of type string (lit("col1") is a literal column with a content of "col1", it does not reference the column col1)
I would to it like this:
def asColArray(xs: Seq[String]) = array((xs.map(x => col(x))): _*)
val arrayColumns = asColArray(Array("col1","col2"))
df.withColumn("udfFunction",udfFunction(arrayColumns))
If you really want to use literal values, you would need to do something like this:
val arrayColumns = asLitArray(Array(1.0,2.0))
But this would give you a constant output of your udf

Converting all arrays in a PyTables/HDF5 file from float64 to float32

I have a PyTables file with a considerable amount of subdirectories. I have a way of iterating through all the array datatypes in the table. They are float64; I want to convert the file in place while converting all data points from float64 to float32.
According to this question, one way to overwrite arrays is to assign values. I have the following code snippet which tries to take this "count" value/array in the table, converts it to float32, and assigns it back to the table:
import h5py
import numpy as np
# filehead is a string for a file
with h5py.File(filehead, 'r+') as f:
# Lots of stuff here ... e.g. `head` is a string
print("/obsnorm/Standardizer/count {}".format(f[head+'/obsnorm/Standardizer/count']))
print("count value: {}".format(f[head+'/obsnorm/Standardizer/count'].value))
f[head+'/obsnorm/Standardizer/count'][...] = (f[head+'/obsnorm/Standardizer/count'].value).astype('float32')
print("/obsnorm/Standardizer/count {}".format(f[head+'/obsnorm/Standardizer/count']))
print("count value: {}".format(f[head+'/obsnorm/Standardizer/count'].value))
Unfortunately, the result of the printing is:
/obsnorm/Standardizer/count <HDF5 dataset "count": shape (), type "<f8">
count value: 512364.0
/obsnorm/Standardizer/count <HDF5 dataset "count": shape (), type "<f8">
count value: 512364.0
In other words, before the assignment, the type of count is f8, or float64. After casting it, the type is still float64.
How do I modify this in-place so that the data is truly understood as float32?
As suggested by hpaulj in the comments, I decided to simply recreate a duplicate HDF5 file except making the datasets of type f4 (same as float32) and I was able to achieve my coding goals.
The pseudocode is as follows:
import h5py
import numpy as np
# Open the original file jointly with new file, with `float32` at the end.
with h5py.File(oldfile, 'r') as f, h5py.File(newfile[:-3]+'_float32.h5', 'w') as newf:
# `head` is some directory structure
# Create groups to follow the same directory structure
newf.create_group(head)
# When it comes time to create a dataset, make the cast here.
newdata = (f[head+'/name_here'].value).astype('float32')
newf.create_dataset(head+'/name_here', data=newdata, dtype='f4')
# Proceed for all other datasets.

CSV to Array of Objects

I am using a third party package for spark that utilizes a "PointFeature" object. I am trying to take a csv file and put elements form each row into an Array of these PointFeature objects.
The PointFeature constructor for my implementation looks like this:
Feature(Point( _c1, _c2), _c3)
where _c1, _c2, and _c3 are the columns of my csv and represent doubles.
Here is my current attempt:
val points: Array[PointFeature[Double]] = for{
line <- sc.textFile("file.csv")
point <- Feature(Point(line._c1,line._c2),line._c3)
} yield point
My error shows up when referencing the columns
<console>:36: error: value _c1 is not a member of String
point <- Feature(Point(line._c1,line._c2),line._c3.toDouble)
^
<console>:36: error: value _c2 is not a member of String
point <- Feature(Point(line._c1,line._c2),line._c3.toDouble)
^
<console>:36: error: value _c3 is not a member of String
point <- Feature(Point(line._c1,line._c2),line._c3.toDouble)
^
This is obviously because I'm referencing a String as if it were an element of a DataFrame. I'm wondering if there is a way to work with DataFrames in this loop format, or a way to split each line into a List of doubles. Maybe I need an RDD?
Also, I'm not certain that this will yield an Array. Actually, I suspect it will return an RDD...
I am using Spark 2.1.0 on Amazon EMR
Here are some other Questions I have drawn from:
How to read csv file into an Array of arrays in scala
Splitting strings in Apache Spark using Scala
How to iterate records spark scala?
You could set up a Dataset[Feature] this way:
case class Feature(x: Double, y: Double, z: Double)
sparkSession.read.csv("file.csv")
.toDF("x", "y", "z")
.withColumn("x", 'x.cast(DoubleType))
.withColumn("y", 'y.cast(DoubleType))
.withColumn("z", 'z.cast(DoubleType))
.as[Feature]
Then you can consume your strongly-typed DataSet[Feature] as you see fit.
I suggest taking this on in smaller steps.
Step One
Get your rows as an Array/List/whatever of Strings.
val lines = sc.textFile("file.txt").getLines, or something like that.
Step Two
Break your lines in to their own lists of columns.
val splits = lines.map {l => l.split(",")}
Step Three
Extract your colums as vals that you can use
val res = splits.map {
case Array(col1, col2, col3) => // Convert to doubles, put in to Feature/Point Structure}
case _ => // Handle the case where your csv is malformatted
}
This can all be done in one go, I only split them to show the logical step from file → list string → list list string → list Feature

Creating a typed array column from an empty array

I just want to solve the following problem: i want to filter out all tuples of a data frame in which the strings contained in one column are not contained in a blacklist which is given as a (potentially empty) array of strings.
For example: if the blacklist contains "fourty two" and "twenty three", all rows are filtered out from the dataframe in which the respective column contains either "fourty two" or "twenty three".
The following code will successfully execute, if the blacklist is not empty (for example Array("fourty two")) and fail else (Array.empty[String]):
//HELPERs
val containsStringUDF = udf(containsString(_: mutable.WrappedArray[String], _: String))
def containsString(array: mutable.WrappedArray[String], value: String) = {array.contains(value)}
def arrayCol[T](arr: Array[T]) = {array(arr map lit: _*)}
df.filter(!containsStringUDF(arrayCol[String](blacklist),$"theStringColumn"))
The error message is:
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(array(), theStringColumn)' due to data type mismatch: argument 1 requires array<string> type, however, 'array()' is of array<null> type
It seems, that empty arrays appear typeless to spark. Is there a nice way to deal with this?
You are overthinking a problem. What you really need here is isin:
val blacklist = Seq("foo", "bar")
$"theStringColumn".isin(blacklist: _*)
Moreover don't depend on the local type for ArrayType being a WrappedArray. Just use Seq.
Finally to answer your question you can either:
array().cast("array<string>")
or:
import org.apache.spark.sql.types.{ArrayType, StringType}
array().cast(ArrayType(StringType))

Resources