MODIS(MYD06_L2) file concatenation using xarray and dask - concatenation

I try to open multiple MODIS files (MYD06_L2) using xarray (xr.open_mfdataset).
I can open a single file or may be some files but i am not able to open many files or one day file as they have different dimensions.
d06 = xr.open_mfdataset(M06_2040, concat_dim= 'None', parallel=True)
['Cloud_Mask_1km'][:,:,:,0].values
Here M06_2040 is the directory of the files
I end up with the following error:
ValueError: arguments without labels along dimension
'Cell_Along_Swath_1km:mod06' cannot be aligned because they have
different dimension sizes: {2040, 2030}

Correct. I believe that the xarray.open_mfdataset function expects that all dimensions other than the concatenated dimension are the same in all files.

Related

Fortran90 : read file names sequentially

I am working with fortran 90. I have 50 .dat files that correspond to 50 time steps. Files have a similar name, for instance tstep01.dat, tstep02.dat, tstep03.dat, etc. I have to read sequentially the name of these files. Files are loacted in the output directory that is in the same directory as my script. I want to get the name of files in order to pass it to a subroutine that generates animation. The subroutine gets this name to read the data and creates .png frames subsequently. I have already tried this:
character(len = 14) :: data_name !data name
nframes = 50 !number of timesteps
do i = 1, nframes
write(data_name, '(output/("/tstep", I2.2, ".dat"))') i
end do
but I got this error :
write(data_name, ('output/("/tstep", I2.2, ".dat")')) i
1
Error: Nonnegative width required in format string at (1)
I think the problem is with the output/, but I don't know what is the correct way for defining the directory of files. Your help would be appreciated.

Split of XML files

I am working on xml file but unfortunately my xml file is become large. So now I want to split my xml file into multiple smaller xml files. Is it possible to split one large xml file into multiple smaller xml files.
For E.g. If we make any project in c language then we create multiple c files but the main function will always be present in one c file. All other functions or sub programs we keep in different c files. So if we have to call any function we call it from the c file which is having main function.
Same or similar to that I want in my xml file where there will be one main xml file and all other xml file would be dependent on the main xml file.
In simple words I want to split my large xml file into smaller xml files. I don't have any idea about it. I request you all that please share an example or link for any example of this kind of thing.
Thanks
If you just want to split the file into smaller parts you can use the split command in terminal.
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'. With no INPUT, or when INPUT
is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N use suffixes of length N (default 2)
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes use numeric suffixes instead of alphabetic
-l, --lines=NUMBER put NUMBER lines per output file
--verbose print a diagnostic to standard error just
before each output file is opened
--help display this help and exit
--version output version information and exit

Reading many (1000+) files with dlmread - Loop with varying filenames?

I'm very new to matlab, or coding for that matter.
I'm running a simulation which outputs thousands of files. These files are .vtk and are read correctly by dlmread.
I tried reading one of them, defining it as a matrix and extracting column vectors out of this matrix. This works fine. What i need now is to not only read one of them, but all. The filenames vary by a number, for example cover1000.vtk, cover2000.vtk, ...., cover1200000.vtk.
I want all of them to be read with dlmread and stored as a different matrix. How do i do that? Here is what i have right now, working with one file at a time:
A_1000 = dlmread ('cover1000.vtk') %matrix a containing values from vtk file_in_loadpath
fx_1000 = A(1:20,1) %extracting vector with specific values
fx_ave_1000 = sum(fx_1000)/length(fx_1000) % average of the values in extracted vector
I'm thinking of a loop, but how do i create a loop with varying file names?
Also I've read that a loop is not the best idea, cell arrays would be better. But i have absolutely no idea how to implement any of this.
Thanks for your help!
cheers
You can use the function dir to list all the vtk files in your directory then loop over those files.
filename = dir('*.vtk'); %list all the vtk files in your current directory.
for ii = 1:length(filename)
A = dlmread (filename(ii).name) %matrix a containing values from vtk file_in_loadpath
fx{ii} = A(1:20,1) %extracting vector with specific value
fx_ave{ii} = sum(fx{ii})/length(fx{ii}) % average of the values in extracted vector
end
The results are now stored in two cells: fx and fx_ave.

Read matrices from multiple .csv files and print matrices in .csv files

So I have to write a C program to read data from .csv files supplied to me by multiple users, into matrices on which I will perform some operations (like matrix addition, multiplication with necessary conditions on dimensions, etc.) and print these matrices (or the output data) in to .csv files again.
I also need to dynamically allocate memory to my matrices.
Now, I have zero background in dealing with .csv files. I do not at all know the required code to read a .csv file or write into a .csv file. I have searched for long on the Internet but surprisingly I have not found any program that teaches how to deal with .csv files from the elementary level.
I am lost on this and need a lot of guidance, maybe a sample, fully well-written C program as I need a comprehensive example to begin with.
A CSV file is just a plain ASCII text file that contains a grid of values. Think of the file as a set of rows in a database table where each line in the file represents one record and the order of the data in each line is identical. Each item of data is separated using a comma character (hence the name). So to read the file:-
open file
until the end of the file
read line into a string
split the string into sub strings where ',' is the dilimiter
parse each sub string
Since there is no formatting information in a CSV file, if the data in each value consists of a string, then what do you do if the value has a comma in it? For reading numbers that is not a problem for you.
You could read the file in several passes, the first to determine the amount of data there is (number of columns, number of rows, etc) and the second to actually read the data.
Writing the CSV is quite simple:-
open file
for each record to write
for each element to write
write element
if not last element
write a comma
write a new line

Concatenate a large number of HDF5 files

I have about 500 HDF5 files each of about 1.5 GB.
Each of the files has the same exact structure, which is 7 compound (int,double,double) datasets and variable number of samples.
Now I want to concatenate all this files by concatenating each of the datasets so that at the end I have a single 750 GB file with my 7 datasets.
Currently I am running a h5py script which:
creates a HDF5 file with the right datasets of unlimited max
open in sequence all the files
check what is the number of samples (as it is variable)
resize the global file
append the data
this obviously takes many hours,
would you have a suggestion about improving this?
I am working on a cluster, so I could use HDF5 in parallel, but I am not good enough in C programming to implement something myself, I would need a tool already written.
I found that most of the time was spent in resizing the file, as I was resizing at each step, so I am now first going trough all my files and get their length (it is variable).
Then I create the global h5file setting the total length to the sum of all the files.
Only after this phase I fill the h5file with the data from all the small files.
now it takes about 10 seconds for each file, so it should take less than 2 hours, while before it was taking much more.
I get that answering this earns me a necro badge - but things have improved for me in this area recently.
In Julia this takes a few seconds.
Create a txt file that lists all the hdf5 file paths (you can use bash to do this in one go if there are lots)
In a loop read each line of txt file and use label$i = h5read(original_filepath$i, "/label")
concat all the labels label = [label label$i]
Then just write: h5write(data_file_path, "/label", label)
Same can be done if you have groups or more complicated hdf5 files.
Ashley's answer worked well for me. Here is an implementation of her suggestion in Julia:
Make text file listing the files to concatenate in bash:
ls -rt $somedirectory/$somerootfilename-*.hdf5 >> listofHDF5files.txt
Write a julia script to concatenate multiple files into one file:
# concatenate_HDF5.jl
using HDF5
inputfilepath=ARGS[1]
outputfilepath=ARGS[2]
f = open(inputfilepath)
firstit=true
data=[]
for line in eachline(f)
r = strip(line, ['\n'])
print(r,"\n")
datai = h5read(r, "/data")
if (firstit)
data=datai
firstit=false
else
data=cat(4,data, datai) #In this case concatenating on 4th dimension
end
end
h5write(outputfilepath, "/data", data)
Then execute the script file above using:
julia concatenate_HDF5.jl listofHDF5files.txt final_concatenated_HDF5.hdf5

Resources