Shuffle Dask array chunks from hdf5 file - arrays

I have a very large array stored in an hdf5 file. I am trying to load it and manage it as a Dask array.
At the moment my challenge is that i need to shuffle this array time to time in a process, this is a challenge by itself to shuffle an array bigger than memory.
So what i am trying to do without success is to shuffle the dask array chunks.
#Prepare data
f=h5py.File('Data.hdf5')
dset = f['/Data']
dk_array = da.from_array(dset, chunks=dset.chunks)
So given the context above how can i shuffle the chunks?

If your array is tabular in nature then you might consider adding a column of random data (see da.concatenate and da.random), turning it into a dask.dataframe, and setting that column as the index.
As a warning, this will be somewhat slow as it will need to do an on-disk shuffle.

Related

Parsing NumPy arrays from pandas data frame cells

I'm rather new to Pandas, and I think I have messed up with my data files.
I have stored some pandas data frames to CSV files. Data frames contained NumPy arrays stored in a single column. I know that it is not recommended to do so. However, because the arrays have an indefinite number of elements (varying row by row), I stored them in a single column. Column names and column order was getting a bit tedious otherwise. Initially, my notion was that I would not need those arrays for my data analysis because they contain raw data just stored for completeness. It was only later that I realized that I would have to go back to the raw data to extract some relevant data. Lucky for me that I saved it initially, but reading it back from the CSV files proved to be difficult.
Everything works fine, as long as I have the original data frame, but when I read the data frame back from CSV, the columns that contain the arrays are read back as strings instead of NumPy arrays.
I have used Pandas.read_csv function's converters option and NumPy.fromstring function with some regular expressions to parse the NumPy arrays from the strings. However, it is slow (data frames contain approx 400k rows).
So, preferably, I would like to convert the data once and save the data frames to a file format that maintains the NumPy arrays in the cells and can be read back directly as NumPy arrays. What would be the best file format to use if it is possible? Or what would be the best way to do it otherwise?
Your suggestions would be appreciated.
For completeness, here is my converter code:
def parseArray(s):
s = re.sub(r'\[','',s)
s = re.sub(r'\]','',s)
s = re.sub(r' +',',',s)
s = np.fromstring(s,sep=',')
return s
testruns = pd.read_csv("datafiles/parse_test.csv", converters={'swarmBest': parseArray})
Without the converter, the 'swarmBest' column is read back as a string:
'[1095.56629 52.32807 8.43377 122.19014 75.42834 8.43377]'
With the converter I can do for example:
testarray = swarmFits[0]
print(testarray)
print(testarray[0])
Output:
[1095.56629 52.32807 8.43377 122.19014 75.42834 8.43377]
1095.56629

Efficiently indexing into a 1-dimensional array representing sparse 2-Dimensional Data

Given a 2D array sparsely filled with objects containing position and size data (X,Y,W,H) representing objects in 2D space which has been transformed into a 1D array with excess data removed(any empty spaces) is it possible to index into the one dimensional array given a starting point (X,Y) corresponding to a location in the original array using some minimal set of metadata acquired from the original array?
Essentially i'm searching for a way to represent large sets of 2D spatial data without consuming an excess of memory and computing power(a 16,000 by 16,000 array would contain 256,000,000 objects) so getting rid of potentially hundreds of millions of "empty spaces" would represent a large performance gain in both terms of memory usage and cycles wasted looping over empty space. Computations are being made in real time so keeping data in memory is preferred.
Linked is an example of what i'm trying to accomplish. Hope it makes sense.
example

Cython construct sorted array

Using Cython, I'm building a program in whcih I am constructing an array of known length, say of length 10e5. I have to separately calculate each element and then add it to the array, which I initialize as np.empty(int(10e5), dtype=np.float64) with help of the numpy package.
However, I want this array to be sorted, and the np.ndarray.sort operation takes about 17% of my program's total runtime, so I would like to eliminate this step.
Is there a fast Cython-esque way to construct the array in a way where it is kept sorted as more values are added?
I tried something using a TreeSet but this object-oriented approach generates way too much overhead.

How to efficiently append data to an HDF5 table in C?

I'm failing to save a large dataset of float values in an HDF5 file efficiently.
The data acquisition works as follows:
A fixed array of 'ray data' (coordintaes, directions, wavelength, intensity, etc.) is created and send to an external ray trace programm (its about 2500 values).
In return I get the same array but with changed data.
I now want to save the new coordinates in an HDF5 for further processing as a simple table.
These steps are repeated many times (about 80 000).
I followed the example of the HDF5group http://www.hdfgroup.org/ftp/HDF5/current/src/unpacked/examples/h5_extend_write.c, but unfortunatly the solution is quite slow.
Before I wrote the data directly into an hdf5 file I used a simple csv file, it takes about 80 sec for 100 repetitions, whereas it takes 160 sec appending to the hdf5 file.
The 'pseudo' code looks like this:
//n is a large number e.g. 80000
for (i=0;i<n;++i):
{
/*create an array of rays for tracing*/
rays = createArray(i);
/*trace the rays*/
traceRays(&rays);
/* write results to hdf5 file, m is a number around 2500 */
for(j=0;j<m;j++):
{
buffer.x = rays[j].x
buffer.y = rays[j].y
//this seems to be slow:
H5TBappend_records(h5file,tablename, 1,dst_size, dst_offset, dst_sizes, &buffer)
// this is fast:
sprintf(szBuffer, "%15.6E,%14.6E\n",rays[j].x,rays[j].y)
fputs(szBuffer, outputFile)
}
}
I could imagine that it has something to do with the overhead of extending the table at each step ?
Any help would be appreciated.
cheers,
Julian
You can get very good performance using the low level API of HDF5. I explain how to do it in this detailed answer.
Basically you need to either use a fixed-size dataset if you know its final size in advance (best case scenario), or use a chunked dataset which you can extend at will (a bit more code, more overhead, and choosing a good chunk size is critical for performance). In any case, then you can let the HDF5 library buffer the writes for you. It should be very fast.
In your case you probably want to create a compound datatype to hold each record of your table. Your dataset would then be a 1D array of your compound datatype.
NB: The methods used in the example code you linked to are correct. If it didn't work for you, that might be because your chunk size was too small.

How can I efficiently copy 2-dimensional arrays of bytes into a larger 2D array?

I have a structure called Patch that represents a 2D array of data.
newtype Size = (Int, Int)
data Patch = Patch Size Strict.ByteString
I want to construct a larger Patch from a set of smaller Patches and their assigned positions. (The Patches do not overlap.) The function looks like this:
newtype Position = (Int, Int)
combinePatches :: [(Position, Patch)] -> Patch
combinePatches plan = undefined
I see two sub-problems. First, I must define a function to translate 2D array copies into a set of 1D array copies. Second, I must construct the final Patch from all those copies.
Note that the final Patch will be around 4 MB of data. This is why I want to avoid a naive approach.
I'm fairly confident that I could do this horribly inefficiently, but I would like some advice on how to efficiently manipulate large 2D arrays in Haskell. I have been looking at the "vector" library, but I have never used it before.
Thanks for your time.
If the spec is really just a one-time creation of a new Patch from a set of previous ones and their positions, then this is a straightforward single-pass algorithm. Conceptually, I'd think of it as two steps -- first, combine the existing patches into a data structure with reasonable lookup for any give position. Next, write your new structure lazily by querying the compound structure. This should be roughly O(n log(m)) -- n being the size of the new array you're writing, and m being the number of patches.
This is conceptually much simpler if you use the Vector library instead of a raw ByteString. But it is simpler still if you simply use Data.Array.Unboxed. If you need arrays that can interop with C, then use Data.Array.Storable instead.
If you ditch purity, at least locally, and work with an ST array, you should be able to trivially do this in O(n) time. Of course, the constant factors will still be worse than using fast copying of chunks of memory at a time, but there's no way to keep that code from looking low-level.

Resources