Importing data from multiple .csv files into single DataFrame

Importing data from multiple .csv files into single DataFrame - arrays

I'm having trouble getting data from several .csv files into a single array. I can get all of the data from the .csv files fine, I just can't get everything into a simple numpy array. The name of each .csv file is important to me so in the end I'd like to have a Pandas DataFrame with the columns labeled by the initial name of the .csv file.
import glob
import numpy as np
import pandas as pd
files = glob.glob("*.csv")
temp_dict = {}
wind_dict = {}
for file in files:
data = pd.read_csv(file)
temp_dict[file[:-4]] = data['HLY-TEMP-NORMAL'].values
wind_dict[file[:-4]] = data['HLY-WIND-AVGSPD'].values
temp = []
wind = []
name = []
for word in temp_dict:
name.append(word)
temp.append(temp_dict[word])
for word in wind_dict:
wind.append(wind_dict[word])
temp = np.array(temp)
wind = np.array(wind)
When I print temp or wind I get something like this:
[array([ 32.1, 31.1, 30.3, ..., 34.9, 33.9, 32.9])
array([ 17.3, 17.2, 17.2, ..., 17.5, 17.5, 17.2])
array([ 41.8, 41.1, 40.6, ..., 44.3, 43.4, 42.6])
...
array([ 32.5, 32.2, 31.9, ..., 34.8, 34.1, 33.7])]
when what I really want is:
[[ 32.1, 31.1, 30.3, ..., 34.9, 33.9, 32.9]
[ 17.3, 17.2, 17.2, ..., 17.5, 17.5, 17.2]
[ 41.8, 41.1, 40.6, ..., 44.3, 43.4, 42.6]
...
[ 32.5, 32.2, 31.9, ..., 34.8, 34.1, 33.7]]
This does not work but is the goal of my code:
df = pd.DataFrame(temp, columns=name)
And when I try to use a DataFrame from Pandas each row is its own array which isn't helpful because it thinks every row has only element in it. I know the problem is with "array(...)" I just don't know how to get rid of it. Thank you in advance for your time and consideration.

I think you can use:
files = glob.glob("*.csv")
#read each file to list of DataFrames
dfs = [pd.read_csv(fp) for fp in files]
#create names for each file
lst4 = [x[:-4] for x in files]
#create one big df with MultiIndex by files names
df = pd.concat(dfs, keys=lst4)
If want separately DataFrames change last row above solution with reshape:
df = pd.concat(dfs, keys=lst4).unstack()
df_temp = df['HLY-TEMP-NORMAL']
df_wind = df['HLY-WIND-AVGSPD']

Related

I'm trying to convert Pandas dataframe to HuggingFace DatasetDic

I have a pandas dataframe with 20k rows containing 2 columns named English, te. Changed the English column name to en. Trying to split the dataset into train, validation and test. And, I want to convert that dataset into
raw_datasets
the output i'm expecting is
DatasetDict({
train: Dataset({
features: ['translation'],
num_rows: 18000
})
validation: Dataset({
features: ['translation'],
num_rows: 1000
})
test: Dataset({
features: ['translation'],
num_rows: 1000
})
})
I'm trying to write a code like raw_datasets["train"][0], then it should return output like below
{'translation': {'en': 'Membership of Parliament: see Minutes',
'to': 'Componenţa Parlamentului: a se vedea procesul-verbal'}}
The data must be in DatasetDict, similar to if we load data from huggingface like dataset DatasetDict type. Below is the code i've written but it's not working
import pandas as pd
from collections import namedtuple
Dataset = namedtuple('Dataset', ['features', 'num_rows'])
DatasetDict = namedtuple('DatasetDict', ['train', 'validation', 'test'])
def create_dataset_dict(df):
# Rename the column
df = df.rename(columns={'English': 'en'})
# Split the data into train, validation and test
train_df = df.iloc[:18000, :]
validation_df = df.iloc[18000:19000, :]
test_df = df.iloc[19000:, :]
# Create the dataset dictionaries
train = Dataset(features=['translation'], num_rows=18000)
validation = Dataset(features=['translation'], num_rows=1000)
test = Dataset(features=['translation'], num_rows=1052)
# Create the final dataset dictionary
datasets = DatasetDict(train=train, validation=validation, test=test)
return datasets
def preprocess_dataset(df):
df = df.rename(columns={'English': 'en'})
train_df = df.iloc[:18000, :]
validation_df = df.iloc[18000:19000, :]
test_df = df.iloc[19000:, :]
train_dict = [{'translation': {'en': row['en'], 'te': row['te']}} for _, row in train_df.iterrows()]
validation_dict = [{'translation': {'en': row['en'], 'te': row['te']}} for _, row in validation_df.iterrows()]
test_dict = [{'translation': {'en': row['en'], 'te': row['te']}} for _, row in test_df.iterrows()]
return DatasetDict(train=train_dict, validation=validation_dict, test=test_dict)
df = pd.read_csv('eng-to-te.csv')
raw_datasets = preprocess_dataset(df)
The above code is not working. Can anyone help me with this?

Using glob to import txt files to an array for interpolation

Currently I am using data (wavelength, flux) in txt format and have six txt files. The wavelengths are the same but the fluxes are different. I have imported the txt files using pd.read_cvs (as can be seen in the code) and assigned each flux a different name. These different named fluxes are placed in an array. Finally, I interpolate the fluxes with a temperature array. The codes works and because currently I only have six files writing the code this way is ok. The problem I have moving forward is that when I have 100s of txt files I need a better method.
How can I use glob to import the txt files, assign a different name to each flux (if that is necessary) and finally interpolate? Any help would be appreciated. Thank you.
import pandas as pd
import numpy as np
from scipy import interpolate
fcf = 0.0000001 # flux conversion factor
wcf = 10 #wave conversion factor
temperature = np.array([725,750,775,800,825,850])
# import files and assign column headers; blank to ignore spaces
c1p = pd.read_csv("../c/725.txt",sep=" ",header=None)
c1p.columns = ["blank","0","blank","blank","1"]
c2p = pd.read_csv("../c/750.txt",sep=" ",header=None)
c2p.columns = ["blank","0","blank","blank","1"]
c3p = pd.read_csv("../c/775.txt",sep=" ",header=None)
c3p.columns = ["blank","0","blank","blank","1"]
c4p = pd.read_csv("../c/800.txt",sep=" ",header=None)
c4p.columns = ["blank","0","blank","blank","1"]
c5p = pd.read_csv("../c/825.txt",sep=" ",header=None)
c5p.columns = ["blank","0","blank","blank","1"]
c6p = pd.read_csv("../c/850.txt",sep=" ",header=None)
c6p.columns = ["blank","0","blank","blank","1"]
wave = np.array(c1p['0']/wcf)
c1fp = np.array(c1p['1']*fcf)
c2fp = np.array(c2p['1']*fcf)
c3fp = np.array(c3p['1']*fcf)
c4fp = np.array(c4p['1']*fcf)
c5fp = np.array(c5p['1']*fcf)
c6fp = np.array(c6p['1']*fcf)
cfp = np.array([c1fp,c2fp,c3fp,c4fp,c5fp,c6fp])
flux_int = interpolate.interp1d(temperature,cfp,axis=0,kind='linear',bounds_error=False,fill_value='extrapolate')
My attempts so far...I think I have loaded the files into a list using glob as
import pandas as pd
import numpy as np
from scipy import interpolate
import glob
c_list=[]
path = "../c/*.*"
for file in glob.glob(path):
print(file)
c = pd.read_csv(file,sep=" ",header=None)
c.columns = ["blank","0","blank","blank","1"]
c_list.append
I am still unsure how to extract just the fluxes into an array in order to interpolate. I will continue to post my attempts.
My updated code
fcf = 0.0000001
import pandas as pd
import numpy as np
from scipy import interpolate
import glob
c_list=[]
path = "../c/*.*"
for file in glob.glob(path):
print(file)
c = pd.read_csv(file,sep=" ",header=None)
c.columns = ["blank","0","blank","blank","1"]
c = c['1']*fcf
c_list.append(c)
fluxes = np.array(c_list)
temperature = np.array([7250,7500,7750,8000,8250,8500])
flux_int =interpolate.interp1d(temperature,fluxes,axis=0,kind='linear',bounds_error=False,fill_value='extrapolate')
When I run this code I get the following error
raise ValueError("x and y arrays must be equal in length along "
ValueError: x and y arrays must be equal in length along interpolation axis.
I think the error in the code that needs correcting is here fluxes = np.array(c_list). This is one list of all fluxes but I need a list of fluxes from each file. How is this done?
Final attempt
import pandas as pd
import numpy as np
from scipy import interpolate
import glob
c_list=[]
path = "../c/*.*"
for file in glob.glob(path):
print(file)
c = pd.read_csv(file,sep=" ",header=None)
c.columns = ["blank","0","blank","blank","1"]
c = c['1']* 0.0000001
c_list.append(c)
c1=np.array(c_list[0])
c2=np.array(c_list[1])
c3=np.array(c_list[2])
c4=np.array(c_list[3])
c5=np.array(c_list[4])
c6=np.array(c_list[5])
fluxes = np.array([c1,c2,c3,c4,c5,c6])
temperature = np.array([7250,7500,7750,8000,8250,8500])
flux_int = interpolate.interp1d(temperature,fluxes,axis=0,kind='linear',bounds_error=False,fill_value='extrapolate')
This code work but I am still not sure about
c1=np.array(c_list[0])
c2=np.array(c_list[1])
c3=np.array(c_list[2])
c4=np.array(c_list[3])
c5=np.array(c_list[4])
c6=np.array(c_list[5])
Is there a better way to write this?

Here's 2 things that you can tdo:
Instead of
c = c['1']* 0.0000001
try doing c = c['1'].to_numpy()* 0.0000001
This will build a list of numpy Arrays rather than a list of pandas Series
When constructing fluxes, you can just do
fluxes = np.array(c_list)

Problem with saving pickle object into arrays from images in python

I have the following class for loading and converting my images into train and test arrays for a deep learning model in Tensorflow 2.
The images are in three folders, named 'Car', 'Cat' and 'Man', which are within the Train and Test main folders. Each image is of 300 x 400 pixels.
import os
import pickle
import cv2
import numpy as np
os.getcwd()
out: 'C:\\Users\\me\\Jupiter_Notebooks\\Dataset_Thermal\\SeekThermal'
path_train = "../SeekThermal/Train"
path_test = "../SeekThermal/Test"
class MasterImage(object):
def __init__(self,PATH='', IMAGE_SIZE = 50):
self.PATH = PATH
self.IMAGE_SIZE = IMAGE_SIZE
self.image_data = []
self.x_data = []
self.y_data = []
self.CATEGORIES = []
# This will get List of categories
self.list_categories = []
def get_categories(self):
for path in os.listdir(self.PATH):
if '.DS_Store' in path:
pass
else:
self.list_categories.append(path)
print("Found Categories ",self.list_categories,'\n')
return self.list_categories
def process_image(self):
try:
"""
Return Numpy array of image
:return: X_Data, Y_Data
"""
self.CATEGORIES = self.get_categories()
for categories in self.CATEGORIES: # Iterate over categories
train_folder_path = os.path.join(self.PATH, categories) # Folder Path
class_index = self.CATEGORIES.index(categories) # this will get index for classification
for img in os.listdir(train_folder_path): # This will iterate in the Folder
new_path = os.path.join(train_folder_path, img) # image Path
try: # if any image is corrupted
image_data_temp = cv2.imread(new_path) # Read Image as numbers
image_temp_resize = cv2.resize(image_data_temp,(self.IMAGE_SIZE,self.IMAGE_SIZE))
self.image_data.append([image_temp_resize,class_index])
random.shuffle(self.image_data)
except:
pass
data = np.asanyarray(self.image_data) # or, data = np.asanyarray(self.image_data,dtype=object)
# Iterate over the Data
for x in data:
self.x_data.append(x[0]) # Get the X_Data
self.y_data.append(x[1]) # get the label
X_Data = np.asarray(self.x_data) / (255.0) # Normalize Data
Y_Data = np.asarray(self.y_data)
# reshape x_Data
X_Data = X_Data.reshape(-1, self.IMAGE_SIZE, self.IMAGE_SIZE, 3)
return X_Data, Y_Data
except:
print("Failed to run Function Process Image ")
def pickle_image(self):
"""
:return: None Creates a Pickle Object of DataSet
"""
# Call the Function and Get the Data
X_Data,Y_Data = self.process_image()
# Write the Entire Data into a Pickle File
pickle_out = open('X_Data','wb')
pickle.dump(X_Data, pickle_out)
pickle_out.close()
# Write the Y Label Data
pickle_out = open('Y_Data', 'wb')
pickle.dump(Y_Data, pickle_out)
pickle_out.close()
print("Pickled Image Successfully ")
return X_Data,Y_Data
def load_dataset(self):
try:
# Read the Data from Pickle Object
X_Temp = open('..\SeekThermal\X_Data','rb')
X_Data = pickle.load(X_Temp)
Y_Temp = open('..\SeekThermal\Y_Data','rb')
Y_Data = pickle.load(Y_Temp)
print('Reading Dataset from Pickle Object')
return X_Data,Y_Data
except:
print('Could not Found Pickle File ')
print('Loading File and Dataset ..........')
X_Data,Y_Data = self.pickle_image()
return X_Data,Y_Data
I dont understand what the problem is with the pickle file, because just last week I able to create these arrays successfully with the same code??
Is there an easier way to load images in Tensorflow rather than through the custom defined class?
a = MasterImage(PATH = path_train,IMAGE_SIZE = 224)
a.process_image()
out:
it produces an array with a warning.
VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
data = np.asanyarray(self.image_data)
a.pickle_image()
out:
TypeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_1692\507657192.py in <cell line: 1>()
----> 1 a.pickle_image()
~\AppData\Local\Temp\ipykernel_1692\1410849712.py in pickle_image(self)
71 """
72 # Call the Function and Get the Data
---> 73 X_Data,Y_Data = self.process_image()
74
75 # Write the Entire Data into a Pickle File
TypeError: cannot unpack non-iterable NoneType object
a.load_dataset()
out:
Could not Found Pickle File
Loading File and Dataset ..........
Found Categories ['Car', 'Cat', 'Man', 'Car', 'Cat', 'Man']
Pickled Image Successfully
I'm running Python 3.8.8 via anaconda on Windows 10. Thank you for any advice.

Python, face_recognition convert string to array

I want to convert a variable to a string and then to an array that I can use to compare, but i dont know how to do that.
my code:
import face_recognition
import numpy as np
a = face_recognition.load_image_file('C:\\Users\zivsi\OneDrive\תמונות\סרט צילום\WIN_20191115_10_32_24_Pro.jpg') # my picture 1
b = face_recognition.load_image_file('C:\\Users\zivsi\OneDrive\תמונות\סרט צילום\WIN_20191115_09_48_56_Pro.jpg') # my picture 2
c = face_recognition.load_image_file(
'C:\\Users\zivsi\OneDrive\תמונות\סרט צילום\WIN_20191115_09_48_52_Pro.jpg') # my picture 3
d = face_recognition.load_image_file('C:\\Users\zivsi\OneDrive\תמונות\סרט צילום\ziv sion.jpg') # my picture 4
e = face_recognition.load_image_file(
'C:\\Users\zivsi\OneDrive\תמונות\סרט צילום\WIN_20191120_17_46_40_Pro.jpg') # my picture 5
f = face_recognition.load_image_file(
'C:\\Users\zivsi\OneDrive\תמונות\סרט צילום\WIN_20191117_16_19_11_Pro.jpg') # my picture 6
a = face_recognition.face_encodings(a)[0]
b = face_recognition.face_encodings(b)[0]
c = face_recognition.face_encodings(c)[0]
d = face_recognition.face_encodings(d)[0]
e = face_recognition.face_encodings(e)[0]
f = face_recognition.face_encodings(f)[0]
Here I tried to convert the variable to a string
str_variable = str(a)
array_variable = np.array(str_variable)
my_face = a, b, c, d, e, f, array_variable
while True:
new = input('path: ')
print('Recognizing...')
unknown = face_recognition.load_image_file(new)
unknown_encodings = face_recognition.face_encodings(unknown)[0]
The program cannot use the variable:
results = face_recognition.compare_faces(array_variable, unknown_encodings, tolerance=0.4)
print(results)
recognize_times = int(results.count(True))
if (3 <= recognize_times):
print('hello boss!')
my_face = *my_face, unknown_encodings
please help me
The error shown:
Traceback (most recent call last):
File "C:/Users/zivsi/PycharmProjects/AI/pytt.py", line 37, in <module>
results = face_recognition.compare_faces(my_face, unknown_encodings, tolerance=0.4)
File "C:\Users\zivsi\AppData\Local\Programs\Python\Python36\lib\site-
packages\face_recognition\api.py", line 222, in compare_faces
return list(face_distance(known_face_encodings, face_encoding_to_check) <= tolerance)
File "C:\Users\zivsi\AppData\Local\Programs\Python\Python36\lib\site-packages\face_recognition\api.py", line 72, in face_distance
return np.linalg.norm(face_encodings - face_to_compare, axis=1)
ValueError: operands could not be broadcast together with shapes (7,) (128,)

First of all, the array_variable should actually be a list of the known encodings and not a numpy array.
Also you do not need str.
Now, in your case, if the input images i.e., a,b,c,d,f,e do NOT have the same dimensions, the error will persist. You can not compare images that have different sizes using this function. The reason is that the comparison is based on the distance and distance is defined on vectors of the same length.
Here is a working simple example using the photos from https://github.com/ageitgey/face_recognition/tree/master/examples:
import face_recognition
import numpy as np
from PIL import Image, ImageDraw
from IPython.display import display
# Load a sample picture and learn how to recognize it.
obama_image = face_recognition.load_image_file("obama.jpg")
obama_face_encoding = face_recognition.face_encodings(obama_image)[0]
# Load a second sample picture and learn how to recognize it.
biden_image = face_recognition.load_image_file("biden.jpg")
biden_face_encoding = face_recognition.face_encodings(biden_image)[0]
array_variable = [obama_face_encoding,biden_face_encoding] # list of known encodings
# compare the list with the biden_face_encoding
results = face_recognition.compare_faces(array_variable, biden_face_encoding, tolerance=0.4)
print(results)
[False, True] # True means match, False mismatch
# False: coming from obama_face_encoding VS biden_face_encoding
# True: coming from biden_face_encoding VS biden_face_encoding
To run it go here: https://beta.deepnote.com/project/09705740-31c0-4d9a-8890-269ff1c3dfaf#
Documentation: https://face-recognition.readthedocs.io/en/latest/face_recognition.html
EDIT
To save the known encodings you can use numpy.save
np.save('encodings',biden_face_encoding) # save
load_again = np.load('encodings.npy') # load again

Sub Value and Add new column pandas

I am trying to read few files from a path as extension to my previous question The answer given by Jianxun Definitely makes sense but I am getting a key error. very very new to pandas and not able to fix error.
Note: I use Python 2.7 and Pandas 0.16
File_1.csv
Ids,12:00:00
2341,9865
7352,8969
File_2.csv
Ids,12:45:00
1234,9865
8435,8969
Master.csv
Ids,00:00:00,00:30:00,00:45:00
1234,1000,500,100
8435,5243,300,200
2341,563,400,400
7352,345,500,600
Programs:
import pandas as pd
import numpy as np
from StringIO import StringIO
# your csv file contents
csv_file1 = 'Path/Transition_Data/Test_1.csv '
csv_file2 = 'Path/Transition_Data/Test_2.csv '
csv_file_all = [csv_file1, csv_file2]
# read csv into df using list comprehension
# I use buffer here, replace stringIO with your file path
df_all = [pd.read_csv(StringIO(csv_file)) for csv_file in csv_file_all]
# processing
# =====================================================
# concat along axis=0, outer join on axis=1
merged = pd.concat(df_all, axis=0, ignore_index=True, join='outer').set_index('Ids')
# custom function to handle/merge duplicates on Ids (axis=0)
def apply_func(group):
return group.fillna(method='ffill').iloc[-1]
# remove Ids duplicates
merged_unique = merged.groupby(level='Ids').apply(apply_func)
# do the subtraction
master_csv_file = 'Path/Data_repository/Master1_Test.csv'
df_master = pd.read_csv(io.StringIO(master_csv_file), index_col=['Ids']).sort_index()
# select matching records and horizontal concat
df_matched = pd.concat([df_master,merged_unique.reindex(df_master.index)], axis=1)
# use broadcasting
df_matched.iloc[:, 1:] = df_matched.iloc[:, 1:].sub(df_matched.iloc[:, 0], axis=0)
Error:
Traceback (most recent call last):
File "distribute_count.py", line 18, in <module>
merged = pd.concat(df_all, axis=0, ignore_index=True, join='outer').set_index('Ids')
File "/usr/lib/pymodules/python2.7/pandas/core/frame.py", line 2583, in set_index
level = frame[col].values
File "/usr/lib/pymodules/python2.7/pandas/core/frame.py", line 1787, in __getitem__
return self._getitem_column(key)
File "/usr/lib/pymodules/python2.7/pandas/core/frame.py", line 1794, in _getitem_column
return self._get_item_cache(key)
File "/usr/lib/pymodules/python2.7/pandas/core/generic.py", line 1079, in _get_item_cache
values = self._data.get(item)
File "/usr/lib/pymodules/python2.7/pandas/core/internals.py", line 2843, in get
loc = self.items.get_loc(item)
File "/usr/lib/pymodules/python2.7/pandas/core/index.py", line 1437, in get_loc
return self._engine.get_loc(_values_from_object(key))
File "index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas/index.c:3786)
File "index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3664)
File "hashtable.pyx", line 697, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11943)
File "hashtable.pyx", line 705, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11896)
KeyError: 'Ids'

import pandas as pd
import numpy as np
# your csv file contents
csv_file1 = '/home/Jian/Downloads/stack_flow_bundle/Transition_Data/Test_1.csv'
csv_file2 = '/home/Jian/Downloads/stack_flow_bundle/Transition_Data/Test_2.csv'
master_csv_file = '/home/Jian/Downloads/stack_flow_bundle/Data_repository/master_lac_Test.csv'
csv_file_all = [csv_file1, csv_file2]
# read csv into df using list comprehension
# I use buffer here, replace stringIO with your file path
df_all = [pd.read_csv(csv_file) for csv_file in csv_file_all]
# processing
# =====================================================
# concat along axis=0, outer join on axis=1
merged = pd.concat(df_all, axis=0, ignore_index=True, join='outer').set_index('Ids')
# custom function to handle/merge duplicates on Ids (axis=0)
def apply_func(group):
return group.fillna(method='ffill').iloc[-1]
# remove Ids duplicates
merged_unique = merged.groupby(level='Ids').apply(apply_func)
# do the subtraction
df_master = pd.read_csv(master_csv_file, index_col=['Ids']).sort_index()
# select matching records and horizontal concat
df_matched = pd.concat([df_master,merged_unique.reindex(df_master.index)], axis=1)
# use broadcasting
df_matched.iloc[:, 1:] = df_matched.iloc[:, 1:].sub(df_matched.iloc[:, 0], axis=0)
print(df_matched)
00:00:00 00:30:00 00:45:00 12:00:00 12:45:00
Ids
1234 1000 -500 -900 NaN 8865
2341 563 -163 -163 9302 NaN
7352 345 155 255 8624 NaN
8435 5243 -4943 -5043 NaN 3726