Converting all arrays in a PyTables/HDF5 file from float64 to float32 - arrays

I have a PyTables file with a considerable amount of subdirectories. I have a way of iterating through all the array datatypes in the table. They are float64; I want to convert the file in place while converting all data points from float64 to float32.
According to this question, one way to overwrite arrays is to assign values. I have the following code snippet which tries to take this "count" value/array in the table, converts it to float32, and assigns it back to the table:
import h5py
import numpy as np
# filehead is a string for a file
with h5py.File(filehead, 'r+') as f:
# Lots of stuff here ... e.g. `head` is a string
print("/obsnorm/Standardizer/count {}".format(f[head+'/obsnorm/Standardizer/count']))
print("count value: {}".format(f[head+'/obsnorm/Standardizer/count'].value))
f[head+'/obsnorm/Standardizer/count'][...] = (f[head+'/obsnorm/Standardizer/count'].value).astype('float32')
print("/obsnorm/Standardizer/count {}".format(f[head+'/obsnorm/Standardizer/count']))
print("count value: {}".format(f[head+'/obsnorm/Standardizer/count'].value))
Unfortunately, the result of the printing is:
/obsnorm/Standardizer/count <HDF5 dataset "count": shape (), type "<f8">
count value: 512364.0
/obsnorm/Standardizer/count <HDF5 dataset "count": shape (), type "<f8">
count value: 512364.0
In other words, before the assignment, the type of count is f8, or float64. After casting it, the type is still float64.
How do I modify this in-place so that the data is truly understood as float32?

As suggested by hpaulj in the comments, I decided to simply recreate a duplicate HDF5 file except making the datasets of type f4 (same as float32) and I was able to achieve my coding goals.
The pseudocode is as follows:
import h5py
import numpy as np
# Open the original file jointly with new file, with `float32` at the end.
with h5py.File(oldfile, 'r') as f, h5py.File(newfile[:-3]+'_float32.h5', 'w') as newf:
# `head` is some directory structure
# Create groups to follow the same directory structure
newf.create_group(head)
# When it comes time to create a dataset, make the cast here.
newdata = (f[head+'/name_here'].value).astype('float32')
newf.create_dataset(head+'/name_here', data=newdata, dtype='f4')
# Proceed for all other datasets.

Related

Pandas DataFrame from Numpy Array - column order

I'm trying to read data from a .csv file using Pandas, smoothing it with Savitsky-Golay filter, filtering it and then using Pandas again to write an output csv file. Data must be converted from DataFrame to an array to perform smoothing and then again to DataFrame to create the output file.
I found a topic on creation of dataframe from numpy arrays (Creating a Pandas DataFrame from a Numpy array: How do I specify the index column and column headers?) and i used the dataset = pd.DataFrame({'Column1': data[:, 0], 'Column2': data[:, 1]}) line to create mine.
The problem is that when I rename the column names to 'time' for first column and 'angle' for the second one, the order in the final dataframe changes. It seems as if the alphabetical order is important, which seems weird.
Can someone help me with an explanation?
My complete code:
import scipy as sp
from scipy import signal
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Specify the input file
in_file = '0_chunk0_test.csv'
# Define min and max angle values
alpha_min = 35
alpha_max = 45
# Define Savitsky-Golay filter parameters
window_length = 15
polyorder = 1
# Read input .csv file, but only time and pitch values using usecols argument
data = pd.read_csv(in_file,usecols=[0,2])
# Replace ":" with "" in time values
data['time'] = data['time'].str.replace(':','')
# Convert pandas dataframe to a numpy array, use .astype to convert
# string to float
data_arr = data.to_numpy(dtype=np.dtype,copy=True)
data_arr = data_arr.astype(np.float)
# Perform a Savitsky-Golay filtering with signal.savgol_filter
data_arr_smooth = signal.savgol_filter(data_arr[:,1],window_length,polyorder)
# Convert smoothed data array to dataframe and rename Pitch: to angle
data_fr = pd.DataFrame({'time': data_arr[:,0],'angle': data_arr_smooth})
print data_fr
Your question is essentially: why does this code result in a column order that is alphabetical, rather than the order that I provided?
data_fr = pd.DataFrame({'time': data_arr[:,0],'angle': data_arr_smooth})
Recent versions of pandas (0.23+ or 1.0+) actually do what you want, with columns ['time', 'angle'] rather than ['angle', 'time'].
Up to Python 3.5, dictionaries did not preserve the order of keys; by sorting alphabetically, pandas would at least give a reproducible column order. This was changed in Pandas 0.23 (released May 2018).
If your data is already in a dataframe, it's much easier to just pass the values of the Pitch column to savgol_filter:
data_arr_smooth = signal.savgol_filter(data.Pitch.values, window_length, polyorder)
data_fr = pd.DataFrame({'time': data.time.values,'angle': data_arr_smooth})
There's no need to explicitly convert your data to float as long as they are numeric, savgol_filter will do this for you:
If x is not a single or double precision floating point array, it
will be converted to type numpy.float64 before filtering.
If you want both original and smoothed data in you original dataframe then just assign a new column to it:
data['angle'] = signal.savgol_filter(data.Pitch.values, window_length, polyorder)

Extract Data From NetCDF4 File Using List

I am using a list of integers corresponding to an x,y index of a gridded NetCDF array to extract specific values, the initial code was derived from here. My NetCDF file has a single dimension at a single timestep, which is named TMAX2M. My code written to execute this is as follows (please note that I have not shown the call of netCDF4 at the top of the script):
# grid point lists
lat = [914]
lon = [2141]
# Open netCDF File
fh = Dataset('/pathtofile/temperaturedataset.nc', mode='r')
# Variable Extraction
point_list = zip(lat,lon)
dataset_list = []
for i, j in point_list:
dataset_list.append(fh.variables['TMAX2M'][i,j])
print(dataset_list)
The code executes, and the result is as follows:
masked_array(data=73,mask=False,fill_value=999999,dtype=int16]
The data value here is correct, however I would like the output to only contain the integer contained in "data". The goal is to pass a number of x,y points as seen in the example linked above and join them into a single list.
Any suggestions on what to add to the code to make this achievable would be great.
The solution to calling the particular value from the x,y list on single step within the dataset can be done as follows:
dataset_list = []
for i, j in point_list:
dataset_list.append(fh.variables['TMAX2M'][:][i,j])
The previous linked example contained [0,16] for the indexed variables, [:] can be used in this case.
I suggest converting to NumPy array like this:
for i, j in point_list:
dataset_list.append(np.array(fh.variables['TMAX2M'][i,j]))

CSV to Array of Objects

I am using a third party package for spark that utilizes a "PointFeature" object. I am trying to take a csv file and put elements form each row into an Array of these PointFeature objects.
The PointFeature constructor for my implementation looks like this:
Feature(Point( _c1, _c2), _c3)
where _c1, _c2, and _c3 are the columns of my csv and represent doubles.
Here is my current attempt:
val points: Array[PointFeature[Double]] = for{
line <- sc.textFile("file.csv")
point <- Feature(Point(line._c1,line._c2),line._c3)
} yield point
My error shows up when referencing the columns
<console>:36: error: value _c1 is not a member of String
point <- Feature(Point(line._c1,line._c2),line._c3.toDouble)
^
<console>:36: error: value _c2 is not a member of String
point <- Feature(Point(line._c1,line._c2),line._c3.toDouble)
^
<console>:36: error: value _c3 is not a member of String
point <- Feature(Point(line._c1,line._c2),line._c3.toDouble)
^
This is obviously because I'm referencing a String as if it were an element of a DataFrame. I'm wondering if there is a way to work with DataFrames in this loop format, or a way to split each line into a List of doubles. Maybe I need an RDD?
Also, I'm not certain that this will yield an Array. Actually, I suspect it will return an RDD...
I am using Spark 2.1.0 on Amazon EMR
Here are some other Questions I have drawn from:
How to read csv file into an Array of arrays in scala
Splitting strings in Apache Spark using Scala
How to iterate records spark scala?
You could set up a Dataset[Feature] this way:
case class Feature(x: Double, y: Double, z: Double)
sparkSession.read.csv("file.csv")
.toDF("x", "y", "z")
.withColumn("x", 'x.cast(DoubleType))
.withColumn("y", 'y.cast(DoubleType))
.withColumn("z", 'z.cast(DoubleType))
.as[Feature]
Then you can consume your strongly-typed DataSet[Feature] as you see fit.
I suggest taking this on in smaller steps.
Step One
Get your rows as an Array/List/whatever of Strings.
val lines = sc.textFile("file.txt").getLines, or something like that.
Step Two
Break your lines in to their own lists of columns.
val splits = lines.map {l => l.split(",")}
Step Three
Extract your colums as vals that you can use
val res = splits.map {
case Array(col1, col2, col3) => // Convert to doubles, put in to Feature/Point Structure}
case _ => // Handle the case where your csv is malformatted
}
This can all be done in one go, I only split them to show the logical step from file → list string → list list string → list Feature

How to structure multiple python arrays for sorting

A fourier analysis I'm doing outputs 5 data fields, each of which I've collected into 1-d numpy arrays: freq bin #, amplitude, wavelength, normalized amplitude, %power.
How best to structure the data so I can sort by descending amplitude?
When testing with just one data field, I was able to use a dict as follows:
fourier_tuples = zip(range(len(fourier)), fourier)
fourier_map = dict(fourier_tuples)
import operator
fourier_sorted = sorted(fourier_map.items(), key=operator.itemgetter(1))
fourier_sorted = np.argsort(-fourier)[:3]
My intent was to add the other arrays to line 1, but this doesn't work since dicts only accept 2 terms. (That's why this post doesn't solve my issue.)
Stepping back, is this a reasonable approach, or are there better ways to combine & sort separate arrays? Ultimately, I want to take the data values from the top 3 freqs and associated other data, and write them to an output data file.
Here's a snippet of my data:
fourier = np.array([1.77635684e-14, 4.49872050e+01, 1.05094837e+01, 8.24322470e+00, 2.36715913e+01])
freqs = np.array([0. , 0.00246951, 0.00493902, 0.00740854, 0.00987805])
wavelengths = np.array([inf, 404.93827165, 202.46913583, 134.97942388, 101.23456791])
amps = np.array([4.33257766e-16, 1.09724890e+00, 2.56328871e-01, 2.01054261e-01, 5.77355886e-01])
powers% = np.array([4.8508237956526163e-32, 0.31112370227749603, 0.016979224022185751, 0.010445983875848858, 0.086141014686372669])
The last 4 arrays are other fields corresponding to 'fourier'. (Actual array lengths are 42, but pared down to 5 for simplicity.)
You appear to be using numpy, so here is the numpy way of doing this. You have the right function np.argsort in your post, but you don't seem to use it correctly:
order = np.argsort(amplitudes)
This is similar to your dictionary trick only it computes the inverse shuffling compared to your procedure. Btw. why go through a dictionary and not simply a list of tuples?
The contents of order are now indices into amplitudes the first cell of order contains the position of the smallest element of amplitudes, the second cell contains the position of the next etc. Therefore
top5 = order[:-6:-1]
Provided your data are 1d numpy arrays you can use top5 to extract the elements corresponding to the top 5 ampltiudes by using advanced indexing
freq_bin[top5]
amplitudes[top5]
wavelength[top5]
If you want you can group them together in columns and apply top5 to the resulting n-by-5 array:
np.c_[freq_bin, amplitudes, wavelength, ...][top5, :]
If I understand correctly you have 5 separate lists of the same length and you are trying to sort all of them based on one of them. To do that you can either use numpy or do it with vanilla python. Here are two examples from top of my head (sorting is based on the 2nd list).
a = [11,13,10,14,15]
b = [2,4,1,0,3]
c = [22,20,23,25,24]
#numpy solution
import numpy as np
my_array = np.array([a,b,c])
my_sorted_array = my_array[:,my_array[1,:].argsort()]
#vanilla python solution
from operator import itemgetter
my_list = zip(a,b,c)
my_sorted_list = sorted(my_list,key=itemgetter(1))
You can then flip the array with my_sorted_array = np.fliplr(my_sorted_array) if you wish or if you are working with lists you can reverse it in place with my_sorted_list.reverse()
EDIT:
To get first n values only, you have to simply slice the array similarly to what #Paul is suggesting. Slice is done in a similar manner to classic list slicing by specifying start:stop:step (you can omit the step) arguments. In your case for 5 top columns it would be [:,-5:]. So in the example above you can take top 2 columns from each row like this:
my_sliced_sorted_array = my_sorted_array[:,-2:]
result will be:
array([[15, 13],
[ 3, 4],
[24, 20]])
Hope it helps.

Creating a typed array column from an empty array

I just want to solve the following problem: i want to filter out all tuples of a data frame in which the strings contained in one column are not contained in a blacklist which is given as a (potentially empty) array of strings.
For example: if the blacklist contains "fourty two" and "twenty three", all rows are filtered out from the dataframe in which the respective column contains either "fourty two" or "twenty three".
The following code will successfully execute, if the blacklist is not empty (for example Array("fourty two")) and fail else (Array.empty[String]):
//HELPERs
val containsStringUDF = udf(containsString(_: mutable.WrappedArray[String], _: String))
def containsString(array: mutable.WrappedArray[String], value: String) = {array.contains(value)}
def arrayCol[T](arr: Array[T]) = {array(arr map lit: _*)}
df.filter(!containsStringUDF(arrayCol[String](blacklist),$"theStringColumn"))
The error message is:
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(array(), theStringColumn)' due to data type mismatch: argument 1 requires array<string> type, however, 'array()' is of array<null> type
It seems, that empty arrays appear typeless to spark. Is there a nice way to deal with this?
You are overthinking a problem. What you really need here is isin:
val blacklist = Seq("foo", "bar")
$"theStringColumn".isin(blacklist: _*)
Moreover don't depend on the local type for ArrayType being a WrappedArray. Just use Seq.
Finally to answer your question you can either:
array().cast("array<string>")
or:
import org.apache.spark.sql.types.{ArrayType, StringType}
array().cast(ArrayType(StringType))

Resources