Convert a multidimensional array to a data frame Python - arrays

I have a data set as below:
data={ 'StoreID':['a','b','c','d'],
'Sales':[1000,200,500,800],
'Profit':[600,100,300,500]
}
data=pd.DataFrame(data)
data.set_index(['StoreID'],inplace=True,drop=True)
X=data.values
from sklearn.metrics.pairwise import euclidean_distances
dist=euclidean_distances(X)
Now I get an array as below:
array([[0. ,943,583,223],
[943, 0.,360,721],
[583,360,0., 360],
[223,721,360, 0.]])
My purpose to get unique combinations of stores and their corresponding distance. I would like the end results as a data frame below:
Store NextStore Dist
a b 943
a c 583
a d 223
b c 360
b d 721
c d 360
Thank you for your help!

You probably want pandas.melt which will "unpivot" the distance matrix into tall-and-skinny format.
m = pd.DataFrame(dist)
m.columns = list('abcd')
m['Store'] = list('abcd')
...which produces:
a b c d Store
0 0.000000 943.398113 583.095189 223.606798 a
1 943.398113 0.000000 360.555128 721.110255 b
2 583.095189 360.555128 0.000000 360.555128 c
3 223.606798 721.110255 360.555128 0.000000 d
Melt data into tall-and-skinny format:
pd.melt(m, id_vars=['Store'], var_name='nextStore')
Store nextStore value
0 a a 0.000000
1 b a 943.398113
2 c a 583.095189
3 d a 223.606798
4 a b 943.398113
5 b b 0.000000
6 c b 360.555128
7 d b 721.110255
8 a c 583.095189
9 b c 360.555128
10 c c 0.000000
11 d c 360.555128
12 a d 223.606798
13 b d 721.110255
14 c d 360.555128
15 d d 0.000000
Remove redundant rows, convert dist to int, and sort:
df2 = pd.melt(m, id_vars=['Store'],
var_name='NextStore',
value_name='Dist')
df3 = df2[df2.Store < df2.NextStore].copy()
df3.Dist = df3.Dist.astype('int')
df3.sort_values(by=['Store', 'NextStore'])
Store NextStore Dist
4 a b 943
8 a c 583
12 a d 223
9 b c 360
13 b d 721
14 c d 360

I have a 2D array with shape (14576, 24) respectively elements and the number of features that are going to constitute my dataframe.
data_from_2Darray = {}
for i in range(name_2Darray.shape[1]):
data_from_2Darray["df_column_name{}".format(i)] = name_2Darray[:,i]
#visualize the result
pd.DataFrame(data=data_from_2Darray).plot()
#the converted array in dataframe
df = pd.DataFrame(data=data_from_2Darray)

Related

awk PROCINFO["sorted_in"] Multidimensional array sorting problem

[root#rocky ~]# cat c
a b 1
a c 4
a r 6
a t 2
b a 89
b c 76
a d 45
b z 9
[root#rocky ~]# awk '{a[$1][$2]=$3}END{PROCINFO["sorted_in"]="#val_num_asc";for(i in a){for(x in a[i])print i,x,a[i][x]}}' c
a b 1
a t 2
a c 4
a r 6
a d 45
b z 9
b c 76
b a 89
[root#rocky ~]# awk '{a[$2]=$3}END{PROCINFO["sorted_in"]="#val_num_asc";for(i in a)print i,a[i]}' c
b 1
t 2
r 6
z 9
d 45
c 76
a 89
[root#rocky ~]# awk '{a[$2]=$3}END{PROCINFO["sorted_in"]="#val_num_desc";for(i in a)print i,a[i]}' c
a 89
c 76
d 45
z 9
r 6
t 2
b 1
[root#rocky ~]# awk '{a[$1][$2]=$3}END{PROCINFO["sorted_in"]="#val_num_desc";for(i in a){for(x in a[i])print i,x,a[i][x]}}' c
a d 45
a r 6
a c 4
a t 2
a b 1
b a 89
b c 76
b z 9
[root#rocky ~]# awk --version
GNU Awk 4.2.1, API: 2.0 (GNU MPFR 3.1.6-p2, GNU MP 6.1.2)
There is a problem with the sorting of multidimensional arrays using PROCINFO["sorted_in"]="#val_num_asc" or PROCINFO["sorted_in"]="#val_num_desc", and there is no real sorting. There is no problem with one-dimensional arrays. What is the problem? Is it because it does not support multidimensional arrays?
awk '{a[$1][$2]=$3}END{PROCINFO["sorted_in"]="#val_num_asc";for(i in a){for(x in a[i])print i,x,a[i][x]}}' c
a b 1
a t 2
a c 4
a r 6
a d 45
b z 9
b c 76
b a 89
This is not a bug, this is how it is supposed to work. Look closely, you are using a nested loop here.
for(i in a)
This is outer loop that will iterate through values a and b in 2 iterations.
for(x in a[i])
This is inner loop that will iterate through values of for a,[$2] first and b,[$2] second time.
#val_num_asc will sort values numerically in ascending order as per the value which is $3. If you look closely printed values 1,2,4,6,45 for $1=a are numerically sorted as per the value and so are 9,76,89 for $1=b.
If you want sorted output using awk then use this suggested workaround:
awk '{a[$1 OFS $2]=$3} END {PROCINFO["sorted_in"]="#val_num_asc"; for(x in a) print x, a[x]}' c
a b 1
a t 2
a c 4
a r 6
b z 9
a d 45
b c 76
b a 89

Common Lisp: Why does lparallel have problems with assigning array elements?

I've written a function that copies many elements from one array to another. I wanted to speed it up using the (pdotimes) function from lparallel. The code looks like this:
(pdotimes (i (size output))
(setf (row-major-aref output i)
(row-major-aref input (dostuff i))))
The (dostuff) function does arithmetic on the row-major output index i to convert it to the row-major input index. When I run this function, the results tend to look like this:
#2A((9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5)
(9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5)
(9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5)
(9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5)
(9 0 9 9 9 9 9 9 9 9 5 5 0 5 5 5 5 5 5 5)
(9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5)
(9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5)
(9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5)
(9 9 9 9 9 9 0 0 9 9 5 5 5 5 5 5 5 5 5 5)
(9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5))
The function is supposed to catenate a matrix of 9s on the left and a matrix of 5s on the right. But notice that there are a few 0s in there too. Zeroes are the initial value for the output matrix, so that means that those elements didn't get assigned.
The non-assignment of elements is seemingly random; run the function many times and zeroes will appear in different places. For some reason, those elements are being missed.
I've tried wrapping the function in a future, like this:
(let ((f (future (pdotimes ...))))
(force f))
But that doesn't work either. One thing I've noticed is that the larger the number of threads and the smaller the size of the array, the more elements get missed. It suggests that the array element assignments are clobbering each other somehow.
I've also tried using (pmap-into) to map the function's results into a vector that's displaced to the output, but that fails in a different way: instead of 0s showing up where elements weren't assigned, elements get assigned in the wrong places. If the array contains repeating "1 2 3 4" sub-vectors, sometimes a "1 2 2" sequence will appear, for example.
AFAIK it should be possible for threads to concurrently assign different elements in the same array, but does Common Lisp have problems with this? Do I need to implement a lock so assignments are guaranteed to happen synchronously? If simultaneous assignments were a problem, I'd expect to see more unassigned elements. Any help appreciated.
Edit: I seem to have found how to prevent this, but not the root cause. Try running this in SBCL:
(let ((output (make-array '(20 20) :initial-element 0 :element-type '(unsigned-byte 7))))
(check-type output simple-array)
(pdotimes (i (array-total-size output) output)
(setf (row-major-aref output i)
(random-elt '(1 2 3 4 5 6)))))
No zeroes will appear in the output. Now try this in SBCL:
(let ((output (make-array '(20 20) :initial-element 0 :element-type '(unsigned-byte 4))))
(check-type output simple-array)
(pdotimes (i (array-total-size output) output)
(setf (row-major-aref output i)
(random-elt '(1 2 3 4 5 6)))))
And see zeroes aplenty. I just tested this with CCL and the output was fine. I'm going to try some other CLs but it seems like this is an SBCL problem so far. For some reason, SBCL has problems doing concurrent assignments to arrays with elements smaller than 7 bits. Character arrays are fine, as are floats and t-type arrays.
This is a slightly speculative answer, but I'm reasonably sure it's correct.
If an implementation supports arrays whose element size (in bits) is smaller than the smallest object the machine can read from and write to memory, and if it stores those arrays without wasted space (which is, really, the only purpose of having them), then the only approach to updating an array element is:
read smallest object containing element from memory;
update object with element;
write back.
Since writes to different array elements can result in reading and writing the same smallest object from memory, this is not safe in the presence of multiple threads without interlocking which would generally have catastrophic performance effects.
Probably all CL implementations have such arrays for modern machines which can't write single bits to memory, in the form of bit arrays. SBCL also has arrays of element types with 2 and 4 bits, which, assuming machines can read & write no object smaller than 8 bits are also in this area. It's also possible that arrays with very large object types could suffer from the same problem, if multiple reads & writes are required to load & store an object.
It should be possible to look at the disassembly of code that uses such arrays to see the behaviour. It's probably also the case that such arrays have lower performance than ones with larger element types (experimentally this is true for SBCL on x64: code which initialises an (unsigned-byte 4) array is 2.5 times slower than that which initialises an (unsigned-byte 8) array).
As a note, I suspect strongly the right approach to getting good performance out of array-bashing code is to partition the arrays amongst the cores in a fairly smart way.
That being said, here's a way to initialize an array of nibbles ((unsigned-byte 4)s) which I think should be safe on the assumption that the smallest object that can be written atomically is a byte. The trick is to write pairs of even-odd addresses at once:
(defun initialize-nibble-array (a)
;; the idea is to put some pattern in it I can see if it has holes
(declare (type (array (unsigned-byte 4) *) a))
(let ((s (array-total-size a)))
(pdotimes (i (truncate s 2))
(let ((rmi (* i 2)))
(setf (row-major-aref a rmi) (mod rmi 8)
(row-major-aref a (1+ rmi)) (mod (1+ rmi) 8))))
(when (oddp s)
;; if the array has an odd number of elements we've missed one
;; at the end
(setf (row-major-aref a (- s 1)) (mod (- s 1) 8)))
a))
I wrote a minimal example as follows (uses lparallel and alexandria)
(let ((output (make-array '(20 20) :initial-element '_)))
(check-type output simple-array)
(pdotimes (i (array-total-size output) output)
(setf (row-major-aref output i)
(random-elt '(a b c d e f g h)))))
And it consistently fills the output grid as follows, each time:
#2A((B G C D H A F E D C F D F G D F A C G G)
(C E D D F A H A F D G E G A C C F G E G)
(H C A E C F E H E D F G D B H B B A H D)
(D H G H H A E B G D E G D E G C E A B B)
(B E H G E E C D A H F A E C F D D A H H)
(C B D D G D H H D G H C A A H G B G C C)
(H H D D C F D B H B H G B C F G H F D E)
(F B C C A H D H G H C D G G D F E G A B)
(A E G C C H F C F C E F H H D E C H H D)
(H G H C D F G E D E C E A H C E A H H H)
(E C B E E C A D B G A F C B G A D G F D)
(H D D H A E A A G D H B H D A G A G C F)
(C D F H D G A D E C F C C D F A F F C H)
(H H D E C B C B E B B G G H H B A A E H)
(G F C C B F C D D D H F A B C F F C A B)
(D A H B B F H B B B F F H B G B H C F E)
(A G H C D H A H C H B F D D A G A E B G)
(G H A D H G B E A A B F C E G G G D E D)
(C E G F H F A A A H D D F B F C H B G B)
(H E H D D F F H E G G A A E D G C H H B))
But, 3.6 Traversal Rules and Side Effects
says that the consequences are undefined if you modify a fill-pointer (impossible for non-vectors) or adjust the array (?). But your example does not look like the array is being adjusted.
Sorry for the question but does it work with dotimes? Does my example work on your machine?

Python 3.x - Append data to Pandas dataframe using for loop

I have a empty pandas DataFrame:
aqi_df = pd.DataFrame(columns = ["IMEI","Date","pm10conc_24hrs","pm25conc_24hrs","sdPm10","sdPm25","aqi","windspeed","winddirection","severity","health_impact"] )
I want to add elements one by one to each column -
for i in range(1,10):
aqi_df.IMEI.append("a")
aqi_df.Date.append("b")
aqi_df.pm10conc_24hrs.append("c")
.
.
.
But append throws an error
TypeError: cannot concatenate a non-NDFrame object
How can I append elements to pandas dataframe one by one?
IIUC you can use:
aqi_df = pd.DataFrame(columns = ["IMEI","Date","pm10conc_24hrs"] )
print (aqi_df)
for i in range(1,10):
aqi_df.loc[i] = ['a','b','c']
print (aqi_df)
IMEI Date pm10conc_24hrs
1 a b c
2 a b c
3 a b c
4 a b c
5 a b c
6 a b c
7 a b c
8 a b c
9 a b c
But better is creating DataFrame from Series or dict:
IMEI = pd.Series(['aa','bb','cc'])
Date = pd.Series(['2016-01-03','2016-01-06','2016-01-08'])
pm10conc_24hrs = pd.Series(['w','e','h'])
aqi_df = pd.DataFrame({'a':IMEI,'Date':Date,'pm10conc_24hrs':pm10conc_24hrs})
print (aqi_df)
Date a pm10conc_24hrs
0 2016-01-03 aa w
1 2016-01-06 bb e
2 2016-01-08 cc h
aqi_df = pd.DataFrame({'a':['aa','bb','cc'],
'Date':['2016-01-03','2016-01-06','2016-01-08'],
'pm10conc_24hrs':['w','e','h']})
print (aqi_df)
Date a pm10conc_24hrs
0 2016-01-03 aa w
1 2016-01-06 bb e
2 2016-01-08 cc h

r - Constructing a row index based on arrays of starting point and length [duplicate]

I have a table as below
product=c("a","b","c")
min=c(1,5,3)
max=c(1,7,7)
dd=data.frame(product,min,max)
> dd
product min max
1 a 1 1
2 b 5 7
3 c 3 7
I want to create a table which will look like below. I want to create one row for each value between and including min and max for a product
product mm
a 1
b 5
b 6
b 7
c 3
c 4
c 5
c 6
c 7
How can i do it using R? is there any package which would give quick results?
Try
library(data.table)
setDT(dd)[, list(mm=min:max), by = product]
# product mm
#1: a 1
#2: b 5
#3: b 6
#4: b 7
#5: c 3
#6: c 4
#7: c 5
#8: c 6
#9: c 7
Or a faster option would be seq.int(min, max, 1L) as suggested by #David Arenburg
setDT(dd)[, list(mm = seq.int(min, max, 1L)), by = product]
Benchmarks
library(stringi)
set.seed(24)
product <- unique(stri_rand_strings(1e5,4))
min1 <- sample(1:10, length(product), replace=TRUE)
max1 <- sample(11:15, length(product), replace=TRUE)
dd <- data.frame(product, min1, max1)
dd2 <- copy(dd)
josilber <- function(){res1 <- data.frame(product=rep(dd$product,
dd$max1-dd$min1+1),
mm=unlist(mapply(seq, dd$min1, dd$max1)))
}
akrun <- function(){as.data.table(dd2)[, list(mm = seq.int(min1, max1,
1L)), by = product]}
Ananda <- function() {stack(lapply(split(dd[-1], dd[1]),
function(x) seq(x[[1]], x[[2]])))}
jiber <- function(){res <- by(dd[,-1], dd[,1], function(x)
seq(x$min1, x$max1) )
res <- as.data.frame(unlist(res))
data.frame(product=gsub("[0-9]", "", rownames(res)), mm=res[,1])}
system.time(akrun())
# user system elapsed
# 0.129 0.001 0.129
system.time(josilber())
# user system elapsed
# 0.762 0.002 0.764
system.time(Ananda())
# user system elapsed
#45.449 0.191 45.636
system.time(jiber())
# user system elapsed
# 48.013 8.218 56.291
library(microbenchmark)
microbenchmark(josilber(), akrun(), times=20L, unit='relative')
#Unit: relative
# expr min lq mean median uq max neval cld
#josilber() 6.39757 6.713236 5.570836 5.901037 5.603639 3.970663 20 b
# akrun() 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000 20 a
With base R, you could do something like:
data.frame(product=rep(dd$product, dd$max-dd$min+1),
mm=unlist(mapply(seq, dd$min, dd$max)))
# product mm
# 1 a 1
# 2 b 5
# 3 b 6
# 4 b 7
# 5 c 3
# 6 c 4
# 7 c 5
# 8 c 6
# 9 c 7
You could also consider split + lapply + stack:
stack(lapply(split(dd[-1], dd[1]), function(x) seq(x[[1]], x[[2]])))
## values ind
## 1 1 a
## 2 5 b
## 3 6 b
## 4 7 b
## 5 3 c
## 6 4 c
## 7 5 c
## 8 6 c
## 9 7 c
Just another approach using R base functions
> res <- by(dd[,-1], dd[,1], function(x) seq(x$min, x$max) )
> res <- as.data.frame(unlist(res))
> data.frame(product=gsub("[0-9]", "", rownames(res)), mm=res[,1])
product mm
1 a 1
2 b 5
3 b 6
4 b 7
5 c 3
6 c 4
7 c 5
8 c 6
9 c 7

Get all values in one vector in Matlab

I have two arrays, A (500 x 128 integer values) and B (500 x 64 real values). I want to concatenate both to get C. The problem is that Matlab ignores all values in B as they are small values. Is there any way to get all values without neglect?
Thanks.
I think this can simulate your problem:
A = int8(randi(4,4)*10);
B = rand(4,4)*10;
C = [A B]
C =
10 20 20 30 3 0 8 3
40 10 40 40 2 6 1 2
30 20 10 30 2 1 6 6
40 20 40 30 9 9 5 5
To achieve the result you want, you have to add a type to your data before concatenating them:
C = [double(A) B]
C =
Columns 1 through 7:
10.00000 20.00000 20.00000 30.00000 2.92979 0.31162 7.73694
40.00000 10.00000 40.00000 40.00000 1.71392 5.82900 1.08936
30.00000 20.00000 10.00000 30.00000 1.83903 0.84160 5.75773
40.00000 20.00000 40.00000 30.00000 8.81039 9.31400 4.60636
Column 8:
3.10192
1.75853
5.75013
5.39383
So here when you mention that A has to be shown as a double matrix, the other matrix, B is kept at its original type. You can check the other existing types in MATLAB here.

Resources