how to create a pandas DataFrame by combining a list of column_names and a numpy array, and then adding more column(s)? - arrays

I have a list of names and a numpy array as below, respectively. How could I combine these two to make a pandas DataFrame? (My actual problem is larger than this, as I have more than 700 column names and hundred thousand inputs in the array). Your help will be so invaluable to me. Thank you.
column_names = [u'Bars', u'Burgers', u'Dry Cleaning & Laundry', u'Eyewear & Opticians', u'Local Services', u'Restaurants', u'Shopping']
values = array([[1, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 1],
[0, 0, 0, 0, 0, 1, 0]], dtype=int64)
UPDATE
Thank you very much for the quick inputs. I am sorry that I did not fully explain the final goal that I would like to achieve -- I would like to add another column score, which is a list [4, 4.5, 5, 5.5, 3], to the pandas data frame. Then I would like to extract all columns except of score as predictors to predict score in a linear regression model. I think the essential part here is how to add a new column in an efficient way? I know that I can do
data = pd.DataFrame({"Bars": Bars, "Burgers": Burgers, "Dry Clearning & Laundry": Dry Cleaning & Laundry, ..."score": score})
However, this seems very unlikely to do as I have way too many columns.
I also use dd = pd.DataFrame(values, columns=column_names), and ddd = pd.DataFrame(dd, scores).
This yields:
Out[185]:
Bars Burgers Dry Cleaning & Laundry Eyewear & Opticians Local Services \
3 0.0 0.0 0.0 0.0 0.0
5 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
Restaurants Shopping
3 1.0 0.0
5 NaN NaN
5 NaN NaN
4 NaN NaN`
Once again thank you very much!!

import pandas as pd
import numpy as np
column_names = [u'Bars', u'Burgers', u'Dry Cleaning & Laundry', u'Eyewear & Opticians', u'Local Services', u'Restaurants', u'Shopping']
values = array([[1, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 1],
[0, 0, 0, 0, 0, 1, 0]], dtype=int64)
df = pd.DataFrame(data=values, columns=column_names)
df.loc[:,'Scores'] = pd.Series(score, index=df.index)

I think I figured out. I can make scores another data frame. Then concatenate the first data frame dd = pd.DataFrame(values, columns=column_names) with the second data frame scores.
pd.concat([dd, scores], axis=1)
This can generate a new data frame.

Related

Most efficient way to forward fill a bit array

Imagine you have a bit array (any data type is okay. e.g. list, np.array, bitarray, bitmap, etc of booleans) that is randomly filled. What is the fastest way to “forward fill” (left to right, or 0th index to nth index) that array in Python such that n bits get set to 1 following each bit already set to 1?
For example, take the array below:
[01000100000]
Given n=2 the forward filled array would be:
[01110111000]
edit
Assume that the input is a bit array of 10,000 elements, of which a random 20% are true, and n=25. This can be represented as a python list with 10,000 boolean elements, of which 20% are True. This could also be represented as a set with 2,000 int elements between 0 and 10,000.
edit 2
To get things started, here are some examples using the parameters above:
new = set()
new.update(*[range(i, i+25) for i in existing])
# 2.34 ms ± 56.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
new = BitMap() # This is a pyroaring BitMap
for e in existing:
new.add_range(e, e+25)
# 461 µs ± 6.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I have addressed several datatypes below. There are no timings given, you might want to time the statement setting ans or refactor-in functions to time at the granularity that makes sense to you.
# -*- coding: utf-8 -*-
"""
Created on Sun Dec 19 09:08:56 2021
for: https://stackoverflow.com/questions/70397220/most-efficient-way-to-forward-fill-a-bit-array
#author: paddy
"""
from random import sample
n = 2 # bits to the right of set bits to also set
elements = 17
true_percent = 20.0
#%% Using arbitrary precision int
print("\nUsing arbitrary precision int.\n".upper())
from operator import or_
from functools import reduce
# Set some random bits True
bits = sum(1 << r
for r in sample(range(elements), int(true_percent/100 * elements)))
# Set n right-adjacent bits.
ans = reduce(or_, (bits >> x for x in range(n+1)), 0)
# Print
print(f"Random bits = {bits:0{elements}b}")
if 1:
print()
for x in range(n+1):
print(f" {bits >> x:0{elements}b}")
print()
print(f"Answer = {ans:0{elements}b}\n")
#%% Using list.
print("\nUsing list.\n".upper())
from operator import or_
from functools import reduce
bits = [0] * elements
# Set some random bits to 1
for r in sample(range(elements), int(true_percent/100 * elements)):
bits[r] = 1
# Set n right-adjacent bits.
# [0]*x is padding bits on the left.
# zip(*(list1, list2,..)) returns the n'th elements on list1, list2,...
# int(any(...)) or's them.
ans = [int(any(shifts))
for shifts in zip(*([0]*x + bits for x in range(n+1)))]
# Print
print(f"Random bits = {bits}")
if 1:
print()
for x in range(n+1):
print(f" {[0]*x + bits}")
print()
print(f"Answer = {ans}\n")
#%% Using numpy.
# Adapt the list solution to use numpy operators on numpy arrays
#%% Using other ordered collections such as str.
# Convert to and from int solution.
Sample Output:
USING ARBITRARY PRECISION INT.
Random bits = 01000000010000010
01000000010000010
00100000001000001
00010000000100000
Answer = 01110000011100011
USING LIST.
Random bits = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
Answer = [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0]

convert numpy array of size n into a scalar with sub-item index numbers as scalar values

I want to convert a numpy array
a = array([[1, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 1],
[0, 0, 0, 1, 0, 0]])
into a scalar with sub-item index numbers as scalar values
desired output
a1 = array([[1],[1],[3],[3],[6],[4]])
I tried this method:
a1 = []
for item in a:
a1.append(np.where(item==1))
and I get this output:
a1 = [(array([0]),),
(array([0]),),
(array([2]),),
(array([2]),),
(array([6]),),
(array([4]),)]
Is there a more pythonic way to achieve it?
After reading the question more carefully, I think this one-liner should
solve your issue in a more pythonic way:
np.where(a==1)[1].reshape((a.shape[0],1))
If you prefer starting indexing with 1 instead of 0 (as is evident in your desired output), then you just have to add a 1 to the line above, i.e.
np.where(a==1)[1].reshape((a.shape[0],1)) + 1
Note that there seems to be an error in your example output above for the last two elements.

how to feed DataGenerator for KERAS multilabel issue?

I am working on a multilabel classification problem with KERAS.
When i execute the code like this i get the following error:
ValueError: Error when checking target: expected activation_19 to have 2 dimensions, but got array with shape (32, 6, 6)
This is because of my lists full of "0" and "1" in the labels dictionary, which dont fit to keras.utils.to_categorical in return statement, as i learned recently. softmax cant handle more than one "1" as well.
I guess I first need a Label_Encoder and afterwards One_Hot_Encoding for labels, to avoid multiple "1" in labels, which dont go together with softmax.
I hope someone can give me a hint how to preprocess or transform labels data, to get the code fixed. I will appreciate a lot.
Even a code snippet would be awesome.
csv looks like this:
Filename label1 label2 label3 label4 ... ID
abc1.jpg 1 0 0 1 ... id-1
def2.jpg 0 1 0 1 ... id-2
ghi3.jpg 0 0 0 1 ... id-3
...
import numpy as np
import keras
from keras.layers import *
from keras.models import Sequential
class DataGenerator(keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, list_IDs, labels, batch_size=32, dim=(224,224), n_channels=3,
n_classes=21, shuffle=True):
'Initialization'
self.dim = dim
self.batch_size = batch_size
self.labels = labels
self.list_IDs = list_IDs
self.n_channels = n_channels
self.n_classes = n_classes
self.shuffle = shuffle
self.on_epoch_end()
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in indexes]
# Generate data
X, y = self.__data_generation(list_IDs_temp)
return X, y
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(len(self.list_IDs))
if self.shuffle == True:
np.random.shuffle(self.indexes)
def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
# Initialization
X = np.empty((self.batch_size, *self.dim, self.n_channels))
y = np.empty((self.batch_size, self.n_classes), dtype=int)
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Store sample
X[i,] = np.load('Folder with npy files/' + ID + '.npy')
# Store class
y[i] = self.labels[ID]
return X, keras.utils.to_categorical(y, num_classes=self.n_classes)
-----------------------
# Parameters
params = {'dim': (224, 224),
'batch_size': 32,
'n_classes': 21,
'n_channels': 3,
'shuffle': True}
# Datasets
partition = partition
labels = labels
# Generators
training_generator = DataGenerator(partition['train'], labels, **params)
validation_generator = DataGenerator(partition['validation'], labels, **params)
# Design model
model = Sequential()
model.add(Conv2D(32, (3,3), input_shape=(224, 224, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
...
model.add(Flatten())
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dense(21))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
# Train model on dataset
model.fit_generator(generator=training_generator,
validation_data=validation_generator)
Since you already have the labels as a vector of 21 elements of 0 and 1, you shouldn't use keras.utils.to_categorical in the function __data_generation(self, list_IDs_temp). Just return X and y.
Ok i have a solution but i'm not sure that's the best .. :
from sklearn import preprocessing #for LAbelEncoder
labels_list = [x[1] for x in labels.items()] #get the list of all sequences
def convert(list):
res = int("".join(map(str, list)))
return res
label_int = [convert(i) for i in labels_list] #Convert each sequence to int
print(label_int) #E.g : [1,2,3] become 123
le = preprocessing.LabelEncoder()
le.fit(label_int)
labels = le.classes_ #Encode each int to only get the uniques
print(labels)
d = dict([(y,x) for x,y in enumerate(labels)]) #map each unique sequence to an label like 0, 1, 2, 3 ...
print(d)
labels_encoded = [d[i] for i in label_int] #get all the sequence and encode them with label obtained
print(labels_encoded)
labels_encoded = to_categorical(labels_encoded) #encode to_cagetorical
print(labels_encoded)
This is not really clean i think, but it's working
You need to change your last Dense layer to have a number of neurones equal to the lenght of the labels_encoded sequences.
For the predictions, you will have the dict "d" that map the predicted value to your orginal sequence style.
Tell me if you need clarifications !
For a few test sequences, it's gives you that :
labels = {'id-0': [1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1],
'id-1': [0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
'id-2': [0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1],
'id-3': [1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1],
'id-4': [0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]}
[100100001100000001011, 10100001100000000001, 100001100010000001, 100100001100000001011, 10100001100000000001]
[100001100010000001 10100001100000000001 100100001100000001011]
{100001100010000001: 0, 10100001100000000001: 1, 100100001100000001011: 2}
[2, 1, 0, 2, 1]
[[0. 0. 1.]
[0. 1. 0.]
[1. 0. 0.]
[0. 0. 1.]
[0. 1. 0.]]
EDIT after clarification :
Ok i read a little more about the subject, once more the problem of softmax is that it will try to maximize a class while minize the others.
So i would sugest to keep your arrays of 21 ones's and zeros's but instead of using Softmax, use Sigmoid (to predict a probability between 0 and 1 for each class) with binary_crossentropy.
And use a treshold for your predictions :
preds = model.predict(X_test)
preds[preds>=0.5] = 1
preds[preds<0.5] = 0
Keep me in touch of the results !

ValueError: setting an array element with a sequence - passing a list in a dictionary to DataGenerator

I am working on a keras multilabel problem. In order to work with big amount of data to avoid memory issues, I implemented a custom data generator.
So far I work with a csv file with IDs, Filenames and their corresponding labels (21 in total), which looks like this:
Filename label1 label2 label3 label4 ... ID
abc1.jpg 1 0 0 1 ... id-1
def2.jpg 1 0 0 1 ... id-2
ghi3.jpg 1 0 0 1 ... id-3
...
I put the the ids and the labels in dictionaries which have the following output:
partition: {'train': ['id-1','id-2','id-3',...], 'validation': ['id-7','id-14','id-21',...]}
labels: {'id-0': [1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
'id-1': [1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
'id-2': [1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
...}
All my images are converted to arrays and saved in single npy files. id-1.npy, id-2.npy...
Then I am executing my code:
import numpy as np
import keras
from keras.layers import *
from keras.models import Sequential
class DataGenerator(keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, list_IDs, labels, batch_size=32, dim=(224,224), n_channels=3,
n_classes=21, shuffle=True):
'Initialization'
self.dim = dim
self.batch_size = batch_size
self.labels = labels
self.list_IDs = list_IDs
self.n_channels = n_channels
self.n_classes = n_classes
self.shuffle = shuffle
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in indexes]
# Generate data
X, y = self.__data_generation(list_IDs_temp)
return X, y
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(len(self.list_IDs))
if self.shuffle == True:
np.random.shuffle(self.indexes)
def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
# Initialization
X = np.empty((self.batch_size, *self.dim, self.n_channels))
y = np.empty((self.batch_size), dtype=int)
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Store sample
X[i,] = np.load('Folder with npy files/' + ID + '.npy')
# Store class
y[i] = self.labels[ID]
return X, keras.utils.to_categorical(y, num_classes=self.n_classes)
# Parameters
params = {'dim': (224, 224),
'batch_size': 32,
'n_classes': 21,
'n_channels': 3,
'shuffle': True}
# Datasets
partition = partition
labels = labels
# Generators
training_generator = DataGenerator(partition['train'], labels, **params)
validation_generator = DataGenerator(partition['validation'], labels, **params)
# Design model
model = Sequential()
model.add(Conv2D(32, (3,3), input_shape=(224, 224, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
...
model.add(Flatten())
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dense(21))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
# Train model on dataset
model.fit_generator(generator=training_generator,
validation_data=validation_generator)
and the following Error raises:
ValueError: setting an array element with a sequence
the following part of the error seems to be crucial:
<ipython-input-58-fedc63607310> in __getitem__(self, index)
31
32 # Generate data
---> 33 X, y = self.__data_generation(list_IDs_temp)
34
35 return X, y
<ipython-input-58-fedc63607310> in __data_generation(self, list_IDs_temp)
53
54 # Store class
---> 55 y[i] = self.labels[ID]
56
57 return X, keras.utils.to_categorical(y, num_classes=self.n_classes)
as soon as i replace labels from the beginning with the following, the code gets executed:
labels = {'id-0': 0,
'id-1': 2,
'id-2': 1,
...}
I still want to pass multiple labels to the DataGenerator, therefore I chose to put a list in the dictionary, as shown in the beginning, but this gives me the ValueError. How can I anyway pass multiple values for a single ID to the DataGenerator as suggested? What do I have to adjust?
A hint or a snippet of code I appreciate a lot.
If i understand well your code here is the problem :
y = np.empty((self.batch_size), dtype=int)
You are creating an emty 1D array, but here :
y[i] = self.labels[ID]
You are filling it with a sequence :
'id-0': [1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
In order to work you need to create your label array with the shape of your batch_size and the lenght of your sequence :
y = np.empty((self.batch_size, len(sequence)), dtype=int)
EDIT
to_categorical is to encode categorical feature to be arrays like [0, 0, 0, 1], [0, 0, 1, 0], etc But you are feeding sequences, not categorical features.
By feeding sequences to your network, you don't want to one_hot encode it so replace :
return X, keras.utils.to_categorical(y, num_classes=self.n_classes)
by :
return X, y
Recommendation from last comment
The problem is that your Softmax activation will try to give the best score to the correct class, but here you give sequence array that softmax will interpret with multiple "correct class" :
For exemple : if you have 3 labels [1, 2, 3], by one_hot encoding you will have [1, 0, 0], [0, 1, 0], [0, 0, 1], there is only one "1" per encoded label array, one correct class, softmax will try to get this class score bigger as possible.
But in you case your are giving arrays with multiple "1's" :
with that : [1, 0, 1] softmax don't know to which class give the best score.
So i would recommand that, you start with your 21 labels [0,1,2,3, ..] and then you one_hot encode this array and you give it to your network.
If you really need that sequence, you have to find an other solution !
Hope i'm clear !

Creating a dataframe with vector entries

I am trying to create a pandas dataframe where the entry in a single cell is a numpy array. For example, given a list of chemical compounds - A2B3C4, D1A2J3 etc, I create a numpy array for each of them so that:
firstium - A2B3C4 - [2,3,4,0,0,0,0.....]
secondium - D1A2J3 - [2,0,0,1,......3....]
I would like to create a dataframe with just two columns - 'name' and 'vec' where name is the string for the name of the compound and vec has the array for the formula. let's say that vec is of dimension 1 x 100.
Name vec
firstium [2,3,4,0,0,0...]
secondium [2,0,0,1,.....3.]
etc.
What I have been doing so far is to create a dictinary {'name':'vec'} and converting this to a dataframe:
Min_dict={}
for ....:
..
Min_dict[min_name]=vec
Min_Dataframe=pd.DataFrame.from_dict(Min_dict,orient='index')
However, this gives me a dataframe with as many columns as the dimension of the array, plus one. So, my dataframe has dimensions data x 101. I need it to be data x 2
This makes it inconvenient to do processing on the data as I would like to treat each array as one unit of information. Does any one know how to do what I just described?
Thanks!
IIUC:
Setup
data = {
'firstium': np.array([2, 3, 4, 0, 0, 0]),
'secondium': np.array([2, 0, 0, 1, 0, 3])
}
Option 1
pd.Series(data).rename_axis('Name').reset_index(name='Vec')
Name Vec
0 firstium [2, 3, 4, 0, 0, 0]
1 secondium [2, 0, 0, 1, 0, 3]
Option 2
pd.DataFrame(dict(zip(('Name', 'Vec'), zip(*data.items()))))
Name Vec
0 firstium [2, 3, 4, 0, 0, 0]
1 secondium [2, 0, 0, 1, 0, 3]

Resources