Appending data in Python - arrays

I have a time data with dimensions (95,). I wrote the following code to extract the year, month and day to create an array of dimension (95,3). However, the following code is able to create an array of dimension (285,). How can I create the new time array with dimension (95,3) where the first column represents year, second column - month and the last column the day.
newtime = np.array([])
for i in range(len(time)):
a = seconds_since_jan_1_1993_to_datetime(time[i])
time_year = float(a.strftime("%Y"))
time_mon = float(a.strftime("%m"))
time_day = float(a.strftime("%d"))
newtime = np.append(newtime, np.array([time_year, time_mon, time_day]))
For example, I have an input array with elements array([725696054.99044609, 725696056.99082708, 725696058.99119401, ...])
I want an output of the following form:
Col1 Col2 Col3
2015.0 12.0 31.0
2015.0 12.0 31.0
2015.0 12.0 31.0
Look forward to your suggestions or help.

My suggestion would be working with a dataframe format.
An easy fix to your code would be:
newtime = pd.DataFrame([], columns=['year','month','day'])
for i in range(len(time)):
a = seconds_since_jan_1_1993_to_datetime(time[i])
time_year = float(a.strftime("%Y"))
time_mon = float(a.strftime("%m"))
time_day = float(a.strftime("%d"))
newtime.loc[len(newtime)] = [time_year, time_mon, time_day]
hope that helps!

The dataframe is a good option. However, if you want to keep an array, you can simply use the reshape() function of numpy. Here is an example code :
import numpy as np
newtime = np.array([])
for i in range(12):
# Dummy data generated here, using floats like in the original post
time_year = float(2015.0)
time_mon = float(1.0*i)
time_day = float(31.0)
newtime = np.append(newtime,np.array([time_year, time_mon, time_day]))
newtime = newtime.reshape((-1,3))
Note the argument in the reshape function: (-1,3) will tell numpy to make the second dimension 3, computing automatically the first dimension. Now, if you print newtime, you should see:
[[ 2.01500000e+03 0.00000000e+00 3.10000000e+01]
[ 2.01500000e+03 1.00000000e+00 3.10000000e+01]
[ 2.01500000e+03 2.00000000e+00 3.10000000e+01]
[ 2.01500000e+03 3.00000000e+00 3.10000000e+01]
[ 2.01500000e+03 4.00000000e+00 3.10000000e+01]
[ 2.01500000e+03 5.00000000e+00 3.10000000e+01]
[ 2.01500000e+03 6.00000000e+00 3.10000000e+01]
[ 2.01500000e+03 7.00000000e+00 3.10000000e+01]
[ 2.01500000e+03 8.00000000e+00 3.10000000e+01]
[ 2.01500000e+03 9.00000000e+00 3.10000000e+01]
[ 2.01500000e+03 1.00000000e+01 3.10000000e+01]
[ 2.01500000e+03 1.10000000e+01 3.10000000e+01]]

Related

I'm trying to convert Pandas dataframe to HuggingFace DatasetDic

I have a pandas dataframe with 20k rows containing 2 columns named English, te. Changed the English column name to en. Trying to split the dataset into train, validation and test. And, I want to convert that dataset into
raw_datasets
the output i'm expecting is
DatasetDict({
train: Dataset({
features: ['translation'],
num_rows: 18000
})
validation: Dataset({
features: ['translation'],
num_rows: 1000
})
test: Dataset({
features: ['translation'],
num_rows: 1000
})
})
I'm trying to write a code like raw_datasets["train"][0], then it should return output like below
{'translation': {'en': 'Membership of Parliament: see Minutes',
'to': 'Componenţa Parlamentului: a se vedea procesul-verbal'}}
The data must be in DatasetDict, similar to if we load data from huggingface like dataset DatasetDict type. Below is the code i've written but it's not working
import pandas as pd
from collections import namedtuple
Dataset = namedtuple('Dataset', ['features', 'num_rows'])
DatasetDict = namedtuple('DatasetDict', ['train', 'validation', 'test'])
def create_dataset_dict(df):
# Rename the column
df = df.rename(columns={'English': 'en'})
# Split the data into train, validation and test
train_df = df.iloc[:18000, :]
validation_df = df.iloc[18000:19000, :]
test_df = df.iloc[19000:, :]
# Create the dataset dictionaries
train = Dataset(features=['translation'], num_rows=18000)
validation = Dataset(features=['translation'], num_rows=1000)
test = Dataset(features=['translation'], num_rows=1052)
# Create the final dataset dictionary
datasets = DatasetDict(train=train, validation=validation, test=test)
return datasets
def preprocess_dataset(df):
df = df.rename(columns={'English': 'en'})
train_df = df.iloc[:18000, :]
validation_df = df.iloc[18000:19000, :]
test_df = df.iloc[19000:, :]
train_dict = [{'translation': {'en': row['en'], 'te': row['te']}} for _, row in train_df.iterrows()]
validation_dict = [{'translation': {'en': row['en'], 'te': row['te']}} for _, row in validation_df.iterrows()]
test_dict = [{'translation': {'en': row['en'], 'te': row['te']}} for _, row in test_df.iterrows()]
return DatasetDict(train=train_dict, validation=validation_dict, test=test_dict)
df = pd.read_csv('eng-to-te.csv')
raw_datasets = preprocess_dataset(df)
The above code is not working. Can anyone help me with this?

How can I get list values that meet specific conditions?

I'm an introductory member of Python.
I are implementing code to organize data with Python.
I have to extract a value that meets only certain conditions out of the numerous lists.
It seems very simple, but it feels too difficult for me.
First, let me explain with the simplest example
solutions
Out[73]:
array([[ 2.31350622e-04, -1.42539948e-02, -7.17361833e-02,
2.17545418e-01, -3.38251827e-01, 1.88254191e-01],
[ 4.23523963e-82, -9.48255372e-81, 5.22018863e-80,
-1.11271010e-79, 1.03507672e-79, -3.55573390e-80],
[ 2.31350597e-04, -1.42539951e-02, -7.17361800e-02,
2.17545409e-01, -3.38251817e-01, 1.88254187e-01],
[ 2.58309722e-02, -6.21550000e-01, 3.41867505e+00,
-7.53828444e+00, 7.09091365e+00, -2.39409614e+00],
[ 2.31350606e-04, -1.42539950e-02, -7.17361809e-02,
2.17545411e-01, -3.38251820e-01, 1.88254188e-01],
[ 1.14525725e-02, -3.25174709e-01, 2.11632584e+00,
-5.16113713e+00, 5.12508331e+00, -1.78380602e+00],
[ 9.75839726e-03, -3.08729919e-01, 2.26983591e+00,
-6.16462170e+00, 6.76409438e+00, -2.55992476e+00],
[ 1.13190092e-03, -6.72042220e-02, 7.10413638e-01,
-2.39952623e+00, 2.94849402e+00, -1.18046338e+00],
[ 5.24406689e-03, -1.86240596e-01, 1.36500589e+00,
-3.61106144e+00, 3.75606312e+00, -1.34699295e+00]])
coeff
Out[74]:
array([[ 1.03177808e-04, -6.35700011e-03, -3.19929208e-02,
9.70209594e-02, -1.50853634e-01, 8.39576506e-02,
4.45980248e-01],
[ 5.13911499e-83, -1.15062991e-81, 6.33426960e-81,
-1.35018220e-80, 1.25598048e-80, -4.31459067e-81,
1.21341776e-01],
[ 1.03177797e-04, -6.35700027e-03, -3.19929194e-02,
9.70209556e-02, -1.50853630e-01, 8.39576490e-02,
4.45980249e-01],
[ 4.26209161e-03, -1.02555298e-01, 5.64078896e-01,
-1.24381145e+00, 1.16999559e+00, -3.95024121e-01,
1.64999272e-01],
[ 1.03177801e-04, -6.35700023e-03, -3.19929198e-02,
9.70209566e-02, -1.50853631e-01, 8.39576495e-02,
4.45980248e-01],
[ 2.27512838e-03, -6.45980810e-02, 4.20421959e-01,
-1.02529362e+00, 1.01813129e+00, -3.54364724e-01,
1.98656535e-01],
[ 1.42058482e-03, -4.49435521e-02, 3.30432790e-01,
-8.97418681e-01, 9.84687293e-01, -3.72662657e-01,
1.45575629e-01],
[ 2.46722650e-04, -1.46486353e-02, 1.54850246e-01,
-5.23029411e-01, 6.42688990e-01, -2.57307904e-01,
2.17971950e-01],
[ 1.30617191e-03, -4.63880878e-02, 3.39990392e-01,
-8.99429225e-01, 9.35545685e-01, -3.35503798e-01,
2.49076135e-01]])
In a matrix defined as 'numpy', called 'solutions', each row represents 'solutions[0]','solutions[1]', 'solutions[i]'... In addition, the 'coeff' is also defined as 'numpy', and the 'coeff[0]','coeff[1]','coeff[i]'... is matched to 'solutions[0]','solutions[1]','solutions[i]'...
What I want at this time is to find specific 'solution[i]' and 'coeff[i]' where all elements of solutions[i] are less than 10^-10 and all elements of coeff[i] are greater than 10^-3.
I wonder if there is an appropriate code to extract a list array in a situation that meets more than one condition. I'm a Python initiator, so please excuse me.
This can be accomplished using advanced indexing:
solution_valid = np.all(solutions < 10e-10, axis=1)
coeff_valid = np.all(coeff > 1e-3, axis=1)
both_valid = coeff_valid & solution_valid
valid_solutions = solutions[both_valid]
valid_coeffs = coeff[both_valid]
but perhaps you mean that the absolute value should be greater or below a certain threshold?
solution_valid = np.all(np.abs(solutions) < 10e-10, axis=1)
coeff_valid = np.all(np.abs(coeff) > 1e-3, axis=1)
both_valid = coeff_valid & solution_valid
valid_solutions = solutions[both_valid]
valid_coeffs = coeff[both_valid]

use np.where to subset 3d array

I am currently working on a satellite image, and I got a 3D array(6464,4064,3) like this
[[[ 3.61944046e+01 -6.91377335e+01 -1.50000001e-09]
[ 3.61942863e+01 -6.91287460e+01 1.32471696e-08]
[ 3.61941681e+01 -6.91197662e+01 9.53853174e-09]
...,
[ 3.11809139e+01 -3.63661194e+01 6.60078259e-09]
[ 3.11785698e+01 -3.63582687e+01 6.60078259e-09]
[ 3.11762199e+01 -3.63504028e+01 6.40588294e-09]]
[[ 3.61873817e+01 -6.91379166e+01 -1.50000001e-09]
[ 3.61872635e+01 -6.91289215e+01 1.43964334e-08]
[ 3.61871490e+01 -6.91199493e+01 1.12178125e-08]
...,
[ 3.11743488e+01 -3.63688583e+01 6.63846089e-09]
[ 3.11720028e+01 -3.63610077e+01 7.23354443e-09]
[ 3.11696529e+01 -3.63531456e+01 7.43190709e-09]]
[[ 3.61803589e+01 -6.91380997e+01 -1.50000001e-09]
[ 3.61802444e+01 -6.91291122e+01 1.69292687e-08]
[ 3.61801338e+01 -6.91201324e+01 1.33426239e-08]
...,
[ 3.11677856e+01 -3.63715935e+01 7.35317940e-09]
[ 3.11654358e+01 -3.63637428e+01 6.95529767e-09]
[ 3.11630821e+01 -3.63558846e+01 7.15423853e-09]]
...,
[[ -5.02645159e+00 -7.61433792e+01 -1.50000001e-09]
[ -5.02774668e+00 -7.61361847e+01 3.38870656e-08]
[ -5.02903891e+00 -7.61290054e+01 3.38870656e-08]
...,
[ -9.27992916e+00 -4.86378708e+01 9.09282427e-09]
[ -9.28078461e+00 -4.86308556e+01 9.09282427e-09]
[ -9.28179646e+00 -4.86225281e+01 7.49361462e-09]]
[[ -5.03337288e+00 -7.61447067e+01 -1.50000001e-09]
[ -5.03466558e+00 -7.61375122e+01 3.04580183e-08]
[ -5.03595591e+00 -7.61303253e+01 3.48006957e-08]
...,
[ -9.28699970e+00 -4.86376190e+01 8.94025476e-09]
[ -9.28782177e+00 -4.86308937e+01 8.15083290e-09]
[ -9.28873920e+00 -4.86233711e+01 8.34818881e-09]]
[[ -5.04029608e+00 -7.61460190e+01 -1.50000001e-09]
[ -5.04158545e+00 -7.61388321e+01 3.18825499e-08]
[ -5.04287243e+00 -7.61316452e+01 3.26812319e-08]
...,
[ -9.29387188e+00 -4.86390038e+01 8.31999980e-09]
[ -9.29480457e+00 -4.86313744e+01 8.51963478e-09]
[ -9.29572582e+00 -4.86238594e+01 8.71926975e-09]]]
which is [latitude, longtitude, radiance] * 6464rows * 4064colomns
I want to subset my interested area according to latitude and longtitude,
so I use
new= np.where((hh[:,:,0]<=13)& (hh[:,:,0]>=7) & (hh[:,:,1]>=-76) & (hh[:,:,1]<=-64))
(hh is my 3d array)
and it comes out the shape of new array is (1156142, 3),
which means it becomes a 2d array and lost its colomns and rows.
I don't know how to why and don't know how to plot the radiance figure with unknown rows and colomns.
when np.where feed only one argument, return the tuple condition.nonzero(), the indices where condition is True..
If you only want data points according to latitude and longtitude, index slicing is OK, more detail
Like:
import numpy as np
hh = np.random.randn(6464, 4064, 3)
# boolean
new_idx = (hh[:, :, 0] <= 0.13)& (hh[:, :, 0] >= 0.07) & (hh[:, :, 1] >= -0.76) & \
(hh[:, :, 1] <= -0.64)
print(new_idx.shape) # (6464, 4064)
# Boolean array indexing
# data points (num, latitude, longtitude, radiance)
new_arr = hh[new_idx]
print(new_arr.shape) # (23175, 3)
# rows, cols
new_where = np.where(new_idx)
print(type(new_where)) # <class 'tuple'>
print([x.shape for x in new_where]) # [(23175,), (23175,)]

Importing data from multiple .csv files into single DataFrame

I'm having trouble getting data from several .csv files into a single array. I can get all of the data from the .csv files fine, I just can't get everything into a simple numpy array. The name of each .csv file is important to me so in the end I'd like to have a Pandas DataFrame with the columns labeled by the initial name of the .csv file.
import glob
import numpy as np
import pandas as pd
files = glob.glob("*.csv")
temp_dict = {}
wind_dict = {}
for file in files:
data = pd.read_csv(file)
temp_dict[file[:-4]] = data['HLY-TEMP-NORMAL'].values
wind_dict[file[:-4]] = data['HLY-WIND-AVGSPD'].values
temp = []
wind = []
name = []
for word in temp_dict:
name.append(word)
temp.append(temp_dict[word])
for word in wind_dict:
wind.append(wind_dict[word])
temp = np.array(temp)
wind = np.array(wind)
When I print temp or wind I get something like this:
[array([ 32.1, 31.1, 30.3, ..., 34.9, 33.9, 32.9])
array([ 17.3, 17.2, 17.2, ..., 17.5, 17.5, 17.2])
array([ 41.8, 41.1, 40.6, ..., 44.3, 43.4, 42.6])
...
array([ 32.5, 32.2, 31.9, ..., 34.8, 34.1, 33.7])]
when what I really want is:
[[ 32.1, 31.1, 30.3, ..., 34.9, 33.9, 32.9]
[ 17.3, 17.2, 17.2, ..., 17.5, 17.5, 17.2]
[ 41.8, 41.1, 40.6, ..., 44.3, 43.4, 42.6]
...
[ 32.5, 32.2, 31.9, ..., 34.8, 34.1, 33.7]]
This does not work but is the goal of my code:
df = pd.DataFrame(temp, columns=name)
And when I try to use a DataFrame from Pandas each row is its own array which isn't helpful because it thinks every row has only element in it. I know the problem is with "array(...)" I just don't know how to get rid of it. Thank you in advance for your time and consideration.
I think you can use:
files = glob.glob("*.csv")
#read each file to list of DataFrames
dfs = [pd.read_csv(fp) for fp in files]
#create names for each file
lst4 = [x[:-4] for x in files]
#create one big df with MultiIndex by files names
df = pd.concat(dfs, keys=lst4)
If want separately DataFrames change last row above solution with reshape:
df = pd.concat(dfs, keys=lst4).unstack()
df_temp = df['HLY-TEMP-NORMAL']
df_wind = df['HLY-WIND-AVGSPD']

Graphite summarize() function results are inconsistent for different 'from' values

I am using Graphite to record user login information.
When I run the following :
render?target=summarize(stats_counts.login.success,"1day")&format=json&from=-1days
I am getting the result :
[
{
"target": "summarize(stats_counts.login.success, \"1day\", \"sum\")",
"datapoints": [
[
5,
1435708800
],
[
21,
1435795200
]
]
}
]
But for the following query :
render?target=summarize(stats_counts.login.success,"1day")&format=json&from=-7days
I am getting the result :
[
{
"target": "summarize(stats_counts.login.success, \"1day\", \"sum\")",
"datapoints": [
[
0,
1435190400
],
[
1,
1435276800
],
[
0,
1435363200
],
[
0,
1435449600
],
[
5,
1435536000
],
[
16,
1435622400
],
[
6,
1435708800
],
[
21,
1435795200
]
]
}
]
Notice the value for the bucket : 1435708800 in both the results.
In one result it is : 5 and in the second result it is : 6
In the first query I am trying to get the number of user logins per day over the last week and in the second one I am trying to get the number of user logins per day yesterday and today.
What is the reason for this difference ?
UPDATE
Graphite Version : 0.9.10
Retention Settings :
[carbon]
pattern = ^carbon\.
retentions = 60:90d
[real_time]
priority = 200
pattern = ^stats.*
retentions = 1:34560000
[stats]
priority = 110
pattern = .*
retentions = 1s:24h,1m:7d,10m:1y
Try adding allign to true, since based on time it varies the number of data points it get on the bucket interval.
By default, buckets are calculated by rounding to the nearest interval. This works well for intervals smaller than a day. For example, 22:32 will end up in the bucket 22:00-23:00 when the interval=1hour.
Passing alignToFrom=true will instead create buckets starting at the from time. In this case, the bucket for 22:32 depends on the from time. If from=6:30 then the 1hour bucket for 22:32 is 22:30-23:30.
"summarize(ex.cpe.ex.xxx,'30s','avg', true)"

Resources