Importing data using Numpy genfromtext and formatting column with datetime - arrays

I have a long text file that I am importing with numpy genfromtext:
00:00:01 W 348 18.2 55.9 049 1008.8 0.000
00:00:02 W 012 12.5 55.9 049 1008.8 0.000
00:00:03 W 012 12.5 55.9 049 1008.8 0.000
00:00:04 W 357 18.2 55.9 049 1008.8 0.000
00:00:05 W 357 18.2 55.9 049 1008.8 0.000
00:00:06 W 339 17.6 55.9 049 1008.8 0.000
testdata = np.genfromtxt(itertools.islice(f_in, 0, None, 60),\
names=('time','ew','d12','s12','t12','p12'.....)
time = (testdata['time'])
This is organizing all the data into an array. The first column of data in the file is a timestamp for each row. In the text file it is formatted as 00:00:00 so in format (%H:%m:%s). However in the actual array that is generated, it turns it into 1900-01-01 00:00:00 . When plotting my data with time, I cannot get it to drop the Y-m-d.
I have tried time = time.strftime('%H:%M:%S')
and
dt.datetime.strptime(time.decode('ascii'), '%H:%M:%S')
Both do nothing. How can I turn my whole array of times to keep the original %H:%m:%s format without it adding in the %Y-%m-%d?

EDIT: based on data provided, you can import your file like this:
str2date = lambda x: datetime.strptime(x.decode("utf-8"), '%H:%M:%S').time()
data = np.genfromtxt(itertools.islice(f_in, 0, None, 60), dtype=None,names=('time','ew','d12','s12','t12','p12'.....), delimiter=' ', converters = {0: str2date})
print(data['time'])
output:
00:00:01
Note than you would need to .decode("utf-8") your input to str2date since it accepts bytes. You can set your dtype in np.genfromtxt() according to your specific file content.
You can also use this if your data is in right format:
dt.datetime.strptime(time,"%H:%M:%S").time()

Related

Python Impute using BayesianRidge() sklearn impute.IterativeImputer regression impute analysis value error

PROBLEM
Use interativeImputer from sklearn.impute.IterativeImputer, to get regression model fit for for BayesianRidge() for impute missing data in variable 'Frontage'.
After the interative_imputer_fit = interative_imputer.fit(data) run, the interative_imputer_fit.transform(X) runs but invoke on function, imputer_bay_ridge(data), the transform() function from interative_imputer, e.g., interative_imputer_fit.transform(X) error on value error. Passed in two variables, Frontage and Area. But only Frontage was inside the numpy.array.
Python CODE using sklearn
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.linear_model import BayesianRidge
def imputer_bay_ridge(data):
data_array = data.to_numpy()
data_array.reshape(1, -1)
interative_imputer = IterativeImputer(BayesianRidge())
interative_imputer_fit = interative_imputer.fit(data_array)
X = data['LotFrontage']
data_imputed = interative_imputer_fit.transform(X)
train_data[['Frontage', 'Area']]
INVOKE FUNCTION
fit_tranformed_imputed = imputer_bay_ridge(train_data[['Frontage', 'Area']])
DATA EXAMPLE
train_data[['Frontage', 'Area']]
Frontage Area
0 65.0 8450
1 80.0 9600
2 68.0 11250
3 60.0 9550
4 84.0 14260
... ... ...
1455 62.0 7917
1456 85.0 13175
1457 66.0 9042
1458 68.0 9717
1459 75.0 9937
1460 rows × 2 columns
ERROR
ValueError Traceback (most recent call last)
Cell In[243], line 1
----> 1 fit_tranformed_imputed = imputer_bay_ridge(train_data[['LotFrontage', 'LotArea']])
Cell In[242], line 12, in imputer_bay_ridge(data)
10 interative_imputer_fit = interative_imputer.fit(data_array)
11 X = data['LotFrontage']
---> 12 data_imputed = interative_imputer_fit.transform(X)
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/impute/_iterative.py:724, in IterativeImputer.transform(self, X)
707 """Impute all missing values in `X`.
708
709 Note that this is stochastic, and that if `random_state` is not fixed,
(...)
720 The imputed input data.
721 """
722 check_is_fitted(self)
--> 724 X, Xt, mask_missing_values, complete_mask = self._initial_imputation(X)
726 X_indicator = super()._transform_indicator(complete_mask)
728 if self.n_iter_ == 0 or np.all(mask_missing_values):
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/impute/_iterative.py:514, in IterativeImputer._initial_imputation(self, X, in_fit)
511 else:
512 force_all_finite = True
--> 514 X = self._validate_data(
515 X,
516 dtype=FLOAT_DTYPES,
517 order="F",
518 reset=in_fit,
519 force_all_finite=force_all_finite,
520 )
521 _check_inputs_dtype(X, self.missing_values)
523 X_missing_mask = _get_mask(X, self.missing_values)
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py:566, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
564 raise ValueError("Validation should be done on X, y or both.")
565 elif not no_val_X and no_val_y:
--> 566 X = check_array(X, **check_params)
567 out = X
568 elif no_val_X and not no_val_y:
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py:769, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
767 # If input is 1D raise error
768 if array.ndim == 1:
--> 769 raise ValueError(
770 "Expected 2D array, got 1D array instead:\narray={}.\n"
771 "Reshape your data either using array.reshape(-1, 1) if "
772 "your data has a single feature or array.reshape(1, -1) "
773 "if it contains a single sample.".format(array)
774 )
776 # make sure we actually converted to numeric:
777 if dtype_numeric and array.dtype.kind in "OUSV":
ValueError: Expected 2D array, got 1D array instead:
array=[65. 80. 68. ... 66. 68. 75.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Dealing with large grib files using xarray and dask

I'm reading some (apparently) large grib files using xarray. I say 'apparently' because they're ~100MB each, which doesn't seem too big to me. However, running
import xarray as xr
ds = xr.open_dataset("gribfile.grib", engine="cfgrib")
takes a good 5-10 minutes. Worse, reading one of these takes up almost 4GB RAM - something that surprises me given the lazy-loading that xarray is supposed to do. Not least that this is 40-odd times the size of the original file!
This reading time and RAM usage seems excessive and isn't scalable to the 24 files I have to read.
I've tried using dask and xr.open_mfdataset, but this doesn't seem to help when the individual files are so large. Any suggestions?
Addendum:
dataset looks like this once opened:
<xarray.Dataset>
Dimensions: (latitude: 10, longitude: 10, number: 50, step: 53, time: 45)
Coordinates:
* number (number) int64 1 2 3 4 5 6 7 8 9 ... 42 43 44 45 46 47 48 49 50
* time (time) datetime64[ns] 2011-01-02 2011-01-04 ... 2011-03-31
* step (step) timedelta64[ns] 0 days 00:00:00 ... 7 days 00:00:00
surface int64 0
* latitude (latitude) float64 56.0 55.0 54.0 53.0 ... 50.0 49.0 48.0 47.0
* longitude (longitude) float64 6.0 7.0 8.0 9.0 10.0 ... 12.0 13.0 14.0 15.0
valid_time (time, step) datetime64[ns] 2011-01-02 ... 2011-04-07
Data variables:
u100 (number, time, step, latitude, longitude) float32 6.389208 ... 1.9880934
v100 (number, time, step, latitude, longitude) float32 -13.548858 ... -3.5112982
Attributes:
GRIB_edition: 1
GRIB_centre: ecmf
GRIB_centreDescription: European Centre for Medium-Range Weather Forecasts
GRIB_subCentre: 0
history: GRIB to CDM+CF via cfgrib-0.9.4.2/ecCodes-2.9.2 ...
I've temporarily got around the issue by reading in the grib files, one-by-one, and writing them to disk as netcdf. xarray then handles the netcdf files as expected. Obviously it would be nice to not have to do this because it takes ages - I've only done this for 4 so far.

Convert Array to dataframe

I have the following array and I would like to convert it to a dataframe with the column names(Time and DepressionCount). I am pretty much a complete beginner to R and some help is greatly appreciated.
2004 2004.25 2004.5 2004.75 2005 2005.25 2005.5 2005.75
875 820 785 857 844 841 726 766
This is what you want:
dd=data.frame(t(array))
colnames(dd) = c("Time", "DepressionCount")
t(array) changes the two lines array into a two columns array, data.frame simply converts the array to a data.frame and colnames changes the column names of the data.frame.

Reading a text file into MATLAB and storing as two separate arrays

I need to be able to take this:
2.8 7.23 3.64 5.91 9.14 4.17 3.63
2.2 7.53 2.20 10.00 3.28 3.09 7.22
1.1 3.64 7.85 5.15 2.78 7.39 9.15
3.6 3.49 9.99 2.40 7.68 4.53 4.97
2.8 2.60 8.82 5.46 10.00 10.00 7.93
3.5 6.33 4.98 10.00 8.11 2.99 10.00
2.5 6.90 7.35 10.00 10.00 9.93 10.00
1.0 2.05 3.75 5.28 2.34 7.61 9.80
3.8 4.61 7.32 10.00 8.19 2.01 4.19
2.2 5.43 4.12 8.29 5.61 7.33 8.33
3.2 2.13 8.84 2.72 3.40 4.12 9.13
1.4 9.01 5.88 8.79 3.28 7.87 2.03
Which is saved into a text file say, nums.txt, and save it to an array in Matlab. However I have to save the first column as a [x by 1] array, and then the rest as a separate array. Sorry if that doesnt makes sense. But I just can't figure it out. Thanks in advance.
EDIT:
I managed to get the column array saved by using this:
diveData = fopen(dive_data.txt');
degDiff = textscan(diveData, '%f %*[^\n]');
degDiff = degDiff{:};
However, I can't get the rest of the array to work so I'm not sure what to do.
First, read in all of your data at once using textread. If you do it like this, then you will read all of your data as a single 1D array. To fix this, you'll need to reshape the array so that you have a 7 column array. Note that MATLAB reads in data by columns, so you'll need to transpose your matrix when you're finished.
After this, extract the first column to put this into one array, then use the rest of the content and place this into another array. In other words:
f = textread('nums.txt', '%f');
vals = reshape(f, 7, []).';
oneColumn = vals(:,1);
otherStuff = vals(:,2:end);
oneColumn will be the first column you want, while otherStuff will be the other columns. When you run this code, this is what I get with your data:
oneColumn =
2.8000
2.2000
1.1000
3.6000
2.8000
3.5000
2.5000
1.0000
3.8000
2.2000
3.2000
1.4000
otherStuff =
7.2300 3.6400 5.9100 9.1400 4.1700 3.6300
7.5300 2.2000 10.0000 3.2800 3.0900 7.2200
3.6400 7.8500 5.1500 2.7800 7.3900 9.1500
3.4900 9.9900 2.4000 7.6800 4.5300 4.9700
2.6000 8.8200 5.4600 10.0000 10.0000 7.9300
6.3300 4.9800 10.0000 8.1100 2.9900 10.0000
6.9000 7.3500 10.0000 10.0000 9.9300 10.0000
2.0500 3.7500 5.2800 2.3400 7.6100 9.8000
4.6100 7.3200 10.0000 8.1900 2.0100 4.1900
5.4300 4.1200 8.2900 5.6100 7.3300 8.3300
2.1300 8.8400 2.7200 3.4000 4.1200 9.1300
9.0100 5.8800 8.7900 3.2800 7.8700 2.0300

Split file in columns, filter and print them

I have been given an assignment to do. Here are the instructions:
Write a Perl program to accomplish each of the following on the file solar.txt (see link at
the class homepage)
Print all records that do not list a discoverer in the eighth field.
Print every record after erasing the second field. Note: It would be better to say
"print every record" omitting the second field.
Print the records for satellites that have negative orbital periods. (A negative
orbital period simply means that the satellite orbits in a counterclockwise
direction.)
Print the data for the objects discovered by the Voyager2 space probe.
Print each record with the orbital period given in seconds rather than days.
About solar.txt file:
This file contains lines of 9 items, the first being:
Adrastea XV Jupiter 129000 0.30 0.00 0.00 Jewitt 1979
in alphabetical order by the name of the planet or moon (first field).
The text in [] is the corresponding field from the line above.
The fields in this file are:
Name of planet or moon [Adrastea]
Number of moon or planet (roman numerals) [XV]
Name of the abject around which the satellite orbits [Jupiter]
Orbital radius (semimajor axis) in kilometers [129000]
Orbital period in days [0.30]
Orbital inclination in degrees [0.00]
Orbital eccentricity [0.00]
Discoverer [Jewitt]
Year of discovery [1979]
I am stuck on the first instruction. I can read in the "solar.txt" file, but after that I can't do it or can't figure it out. Splitting the array seems like the best option, but is not working for me at the moment. Here's the code:
#usr/bin/perl
use warnings;
use strict;
open (SOLAR_FILE, "C:/perl_tests/solar.txt") or die "Could not open the file!";
my #array = (<SOLAR_FILE>);
close (SOLAR_FILE);
for (my $i = 0; $i < 8; $i++) {
my #tempArray = split(/ /, $array[$i]);
if ($tempArray[$i] eq "-") {
print "#tempArray";
}
}
open (SOLAR_FILE, "C:/perl_tests/solar.txt") or die "Could not open the file!";
my #array = (<SOLAR_FILE>);
close (SOLAR_FILE);
for my $record (#array) {
my #tempArray = split(/ /, $record);
if ($tempArray[2] eq qw(Jupiter, Uranus, Saturn, Pluto, Mars, Sun, Neptune, Earth)
s//???/" "/g;
#I know something goes where the (???) are, but I'm not sure how to do it.
{
print "#tempArray";
}
}
Also, I'm not sure how to start the other 4. If anyone could point me in the right direction, that would be helpful.
EDIT: Here's the info from the file:
Adrastea XV Jupiter 129000 0.30 0.00 0.00 Jewitt 1979
Amalthea V Jupiter 181000 0.50 0.40 0.00 Barnard 1892
Ananke XII Jupiter 21200000 -631 147.00 0.17 Nicholson 1951
Ariel I Uranus 191000 2.52 0.00 0.00 Lassell 1851
Atlas XV Saturn 138000 0.60 0.00 0.00 Terrile 1980
Belinda XIV Uranus 75000 0.62 0.03 0.00 Voyager2 1986
Bianca VIII Uranus 59000 0.43 0.16 0.00 Voyager2 1986
...
Leda XIII Jupiter 11094000 238.72 27.00 0.15 Kowal 1974
Lysithea X Jupiter 11720000 259.22 29.00 0.11 Nicholson 1938
Mars IV Sun 227940000 686.98 1.85 0.09 - -
Megaclite XIX Jupiter 23911000 ? ? ? Sheppard 2000
Mercury I Sun 57910000 87.97 7.00 0.21 - -
Metis XVI Jupiter 128000 0.29 0.00 0.00 Synnott 1979
Mimas I Saturn 186000 0.94 1.53 0.02 Herschel 1789
Miranda V Uranus 130000 1.41 4.22 0.00 Kuiper 1948
Moon I Earth 384000 27.32 5.14 0.05 - -
Naiad III Neptune 48000 0.29 0.00 0.00 Voyager2 1989
Neptune VIII Sun 4504300000 60190.00 1.77 0.01 Adams 1846
...
Check 8th field for something other than -: if ($fields[7] ne '-') { ... }
Delete 2nd field: splice(#fields, 1, 0);
Check 5th field for negative: if ($fields[4] < 0) { ... }
Check 8th field for Voyager2: if ($fields[7] eq 'Voyager2') { ... }
Impossible. The number of seconds in a day is not the same for every day. However, the approximate result given by $fields[4]*24*60*60 will probably be within tolerance.
Your outer loop is iterating over the fields; it should iterate over the entire array of lines:
for my $record (#array) {
my $tempArray = split(/ /, $record);
if ($tempArray[7] eq "-") # test 8th field
{
. . .
}
}
This assumes that you are splitting each line correctly; that is, that the delimiter between fields is a space character.

Resources