pandas values change with numpy, but their memory locations are different - arrays

I created an array based on a dataframe. When I changed the value of the array the dataframe also changed, which means that both should be using the same address, but when I use id() to check it, it is different.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'column1': [11,22,33],
'column2': [44,55,66]
})
col1_arr = df['column1'].to_numpy()
col1_arr[0] = 100
col1_arr
array([100, 22, 33], dtype=int64)
df
index
column1
column2
0
100
44
1
22
55
2
33
66
When I changed the value of the array, the dataframe also changed to 100. If their values change to synchronize, it means they should be using the same memory address, but below shows that their addresses are different.
for i in df['column1']:
print(i)
print(hex(id(i)))
# 100
# 0x21c795a0d50
# 22
# 0x21c795a0390
# 33
# 0x21c795a04f0
for i in col1_arr:
print(i)
print(hex(id(i)))
# 100
# 0x21c00e36c70
# 22
# 0x21c00e36d10
# 33
# 0x21c00e36c70
Another strange thing is that the address of col1_arr[0] is equal to col1_arr[2].

One column of the frame is a Series:
In [675]: S = df['column1']
In [676]: type(S)
Out[676]: pandas.core.series.Series
While the storage details of a DataFrame, or Series, may vary, here it looks like the values are stored in a numpy array:
In [677]: S._values
Out[677]: array([11, 22, 33], dtype=int64)
In [678]: id(S._values)
Out[678]: 2737259230384
And that array is exactly the same one that you get with to_numpy():
In [679]: id(col1_arr)
Out[679]: 2737259230384
So when you change an element of col1_arr, you see that change in S, and df.
Data in an array is not stored by reference (not like list):
col1_arr[0] creates a numpy.int64 object that has the same value, but it is not in any way the "address" of the value. Note the different id values below:
In [683]: id(col1_arr[0])
Out[683]: 2737348615568
In [684]: id(col1_arr[0])
Out[684]: 2737348615280
To find "where" values are stored you have to look at something like:
In [686]: col1_arr.__array_interface__
Out[686]:
{'data': (2737080083824, False),
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (3,),
'version': 3}

Related

pyspark: From an array of structs, extract a scalar from the struct for even or odd index , then postprocess the array

I have a dataframe row that contains an ArrayType column named moves. moves is an array of StructType with a few fields in it. I can use a dotpath to fetch just one field from that struct and make a new array of just that field e.g. .select(df.moves.other) will create a same-length array as moves but only with the values of the other field. This is the result:
[null, [{null, null, [0:10:00]}], null, null, [{null, null, [0:10:00]}], [{null, null, [0:09:57]}], [{null, null, [0:09:56]}], [{null, null, [0:09:54]}], ...
So clearly other is not simple. Each element in the array is either null (idx 0,2,and 3 above) if 'other' is not in the struct (which is permitted) or an array of struct where the struct contains field clk which itself is an array (note that simple SPARK output does not list the field names, just the values. The nulls in the struct are unset fields). This is a two-player alternating move sequence; we need to do two things:
Extract the even idx elements and the odd idx elements.
From each, "simplify" the array where entries are either null or the value of the zeroeth entry in the clk field.
This is the target:
even list: [null, null, "0:10:00", "0:09:56", ...
odd list: ["0:10:00", null, "0:09:57", ...
Lastly, we wish to walk these arrays (individually) and compute delta time (n+1 - n) iff both n+1 and n not null.
This is fairly straightforward in "regular" python using slicing e.g. [::2] for evens and [1::2] for odds and map and list comprehensions etc. etc. But I cannot seem to assemble the right functions in pyspark to create the simplified arrays (forget about converting 0:10:00 to something for the moment). For example, unlike regular python, pyspark slice does not accept a step argument and pyspark needs more conditional logic around nulls. transform is promising but I cannot get it to skip entries to arrive at a shorter list.
I tried going the other direction with a UDF. To start, my UDF returned the array that was passed to it:
moves_udf = F.udf(lambda z: z, ArrayType(StructType()))
df.select( moves_udf(df.moves.other) )
But this yielded a grim exception, possibly because the other array contains nulls:
py4j.protocol.Py4JJavaError: An error occurred while calling o55.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (192.168.0.5 executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:773)
...
I know the UDF machinery works for simple scalars. I tested a toUpper() function on a different column and the UDF worked fine.
Almost all of the other move data is much more "SPARK friendly". It is the other field and the array-of-array substructure that is vexing.
Any guidance most appreciated.
P.S. All my pyspark logic is pipelined functions; no SQL. I would greatly prefer not to mix and match.
The trick is to use transform as a general loop exploiting the binary form of the lambda that also passes the current index into the array. Here is a solution:
# Translation:
# select 1: Get only even moves and call the output a "temporary" column 'X'
# select 2: X will look like this [ [{null,null,["0:03:00"]},null,{null,null,["0:02:57"]},...]
# This is because the dfg.moves.a is an array in the moves
# array. In this example, we do not further filter on the
# entries in the 'a' array; we know we want a[0] (the first one).
# We just want ["0:03:00",null,"0:02:57",...]
# x.clk will get the clk field but the value there is *also*
# an array so we must use subscript [0] *twice* to dig thru
# to the string we seek
# select 3: Convert all the "h:mm:ss" strings into timestamps
# select 4: Walk the deltas and return diff to the next neighbor.
# The last entry is always null.
dfx = dfg\
.select( F.filter(dfg.moves.a, lambda x,i: i % 2 == 0).alias('X'))\
.select( F.transform(F.col('X'), lambda x: x.clk[0][0]).alias('X'))\
.select( F.transform(F.col('X'), lambda x: F.to_timestamp(x,"H:m:s").cast("long")).alias('X'))\
.select( F.transform(F.col('X'), lambda x,i: x - F.col('X')[i+1]).alias('delta'))
dfx.show(truncate=False)
dfx.printSchema()
+---------------------------------------------------------------+
|delta |
+---------------------------------------------------------------+
|[2, 1, 2, 1, 4, 9, 16, 0, 6, 3, 8, 5, 2, 12, 4, 4, 10, 0, null]|
+---------------------------------------------------------------+
root
|-- delta: array (nullable = true)
| |-- element: long (containsNull = true)
If you want to compactify it you can do so.
dfx = dfg\
.select( F.transform(F.filter(dfg.moves.a, lambda x,i: i % 2 == 0), lambda x: F.to_timestamp(x.clk[0][0],"H:m:s").cast("long")).alias('X') )\
.select( F.transform(F.col('X'), lambda x,i: x - F.col('X')[i+1]).alias('delta'))

How to convert PySpark dataframe columns into list of dictionary based on groupBy column

I'm converting dataframe columns into list of dictionary.
Input dataframe has 3 columns:
ID accounts pdct_code
1 100 IN
1 200 CC
2 300 DD
2 400 ZZ
3 500 AA
I need to read this input dataframe and convert it into 3 output rows. The output should look like this:
ID arrayDict
1 [{“accounts”: 100, “pdct_cd”: ’IN’}, {”accounts”: 200, “pdct_cd”: ’CC’}]
Similarly, for ID "2" there should be 1 row with 2 dictionaries with key value pair.
I tried this:
Df1 = df.groupBy("ID").agg(collect_list(struct(col("accounts"), ("pdct_cd"))).alias("array_dict"))
But output is not quite as I wanted which should be a list of dictionary.
What you described (list of dictionary) doesn't exist in Spark. Instead of lists we have arrays, instead of dictionaries we have structs or maps. Since you didn't operate these terms, this will be a loose interpretation of what I think you need.
The following will create arrays of strings. Those strings will have the structure which you probably want.
df.groupBy("ID").agg(F.collect_list(F.to_json(F.struct("accounts", "pdct_code")))
struct() puts your column inside a struct data type.
to_json() creates a JSON string out of the provided struct.
collect_list() is an aggregation function which moves all the strings of the group into an array.
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 100, "IN"),
(1, 200, "CC"),
(2, 300, "DD"),
(2, 400, "ZZ"),
(3, 500, "AA")],
["ID", "accounts", "pdct_code"])
df = df.groupBy("ID").agg(F.collect_list(F.to_json(F.struct("accounts", "pdct_code"))).alias("array_dict"))
df.show(truncate=0)
# +---+----------------------------------------------------------------------+
# |ID |array_dict |
# +---+----------------------------------------------------------------------+
# |1 |[{"accounts":100,"pdct_code":"IN"}, {"accounts":200,"pdct_code":"CC"}]|
# |3 |[{"accounts":500,"pdct_code":"AA"}] |
# |2 |[{"accounts":300,"pdct_code":"DD"}, {"accounts":400,"pdct_code":"ZZ"}]|
# +---+----------------------------------------------------------------------+

get output of numpy loadtxt as a single array rather than multiple array

I have a CSV file:
8.84,17.22,13.22,3.84
3.99,11.73,19.66,1.27
Def jo(x):
data=np.loadtxt(x,delimiter=',')
Return data
Print(jo('data.csv')
The code returns:
[ [8.84 17.22 13.22 3.84]
[3.99 11.73 19.66 1.27] ]
But I want all these elements in a single array, because I want to find their mean and median.
How to combine these 2 arrays into 1 ?
use numpy.rehshape
# data: data is your array
>>> data.reshape(-1)
In [245]: txt="""8.84,17.22,13.22,3.84
...: 3.99,11.73,19.66,1.27"""
In [246]: data = np.loadtxt(txt.splitlines(), delimiter=',')
In [247]: data
Out[247]:
array([[ 8.84, 17.22, 13.22, 3.84],
[ 3.99, 11.73, 19.66, 1.27]])
In [248]: data.shape
Out[248]: (2, 4)
That is one array, just 2d.
There are various ways of turning that into a 1d array:
In [259]: arr = data.ravel()
In [260]: arr
Out[260]: array([ 8.84, 17.22, 13.22, 3.84, 3.99, 11.73, 19.66, 1.27])
But there's no need to do that. mean (and median) without axis parameter acts on the raveled array. Check the docs:
In [261]: np.mean(data)
Out[261]: 9.971250000000001
In [262]: np.mean(arr)
Out[262]: 9.971250000000001

Is it possible to save and access and search an array or list containing elements with different length in python?

Is it possible to save and access array or list containing elements with different length? For instance, I want to save data=[s,r,a,se] r,a are scalar but s and se are an arrays with 4 elements.(in python language)
For instance in one time:(s,r,a,se) are different in different times
s=[1,3,4,6] r=5 a=7 se=[11,12,13,14]
data=[s,r,a,se]=[[1,3,4,6],5,7,[11,12,13,14]]
How I can define the array containing them to be able to call them similar to the following code:
s=[data[0] for data in minibatch]
r=[data[1] for data in minibatch]
a=[data[2] for data in minibatch]
se=[data[3] for data in minibatch]
Also, how I can extract (Find) that is there a special[stest,rtest,atest,setest] in data (stest,setest are with 4 elements)
For instance: I want to see can I find [[1,2,3,4],5,6,[7,8,9,10]] in data which is something similar to: [ [[1,2,3,4],5,6,[7,8,9,10]] ,[[...,...,...,...],...,..., [...,...,...,...]], [[18,20,31,42],53,666,[27,48,91,120]]]
If I did not find I append to it otherwise nothing is happened!
You can add them in a new list:
new_list = [s, r, a, se]
But you'll have to be careful managing this list
# This is a great opportunity to explore dictionaries
# lets take the examples you have given in variales
s=[1,3,4,6]
r=5
a=7
se=[11,12,13,14]
# make a dictionary out of them, with keys which are
# the same as the variable name
my_dict = {'s':s,
'r':r,
'a':a,
'se':se}
# lets see what that looks like
print(my_dict)
print()
# to retrieve 2nd (=ix=1) element of s
# the format to do this is simple
# ditionary_variable_name['the string which is the key'][index_of_the_list]
s2 = my_dict['s'][1]
print(s2)
print()
# to retrieve all of se
se_retrieved = my_dict['se']
print(se_retrieved)
# you can further experiment with this
output of sample:
{'s': [1, 3, 4, 6], 'r': 5, 'a': 7, 'se': [11, 12, 13, 14]}
3
[11, 12, 13, 14]
In order to write to this, you need to do something like this:
my_dict['se'] = [15, 16, 17, 18]
or
my_dict['se'][2] = 19

How to write binary data into SQLite with R DBI's dbWriteTable()?

For instance, how to execute the equivalent following SQL (which inserts into a BINARY(16) field)
INSERT INTO Table1 (MD5) VALUES (X'6717f2823d3202449201145073ab871A'),(X'6717f2823d3202449301145073ab371A')
using dbWriteTable()? Doing
dbWriteTable(db, "Table1", data.frame(MD5 = "X'6717f2823d3202449201145073ab871A'", ...), append = T, row.names = F)
doesn't seem to work - it writes the values as text.
In the end, I'm going to have a big data.frame of hashes that I want to write, and so perfect for using dbWriteTable. But I just can't figure out how to INSERT the data.frame into binary database fields.
So here are two possibilities that seem to work. The first uses dbSendQuery(...) in a loop (you've probably thought of this already...).
db.WriteTable = function(con,table,df) { # no error checking whatsoever...
require(DBI)
field <- colnames(df)[1]
for (i in 1:nrow(df)) {
query <- sprintf("INSERT INTO %s (%s) VALUES (X'%s')",table,field,df[i,1])
rs <- dbSendQuery(con,statement=query)
}
return(nrow(df))
}
library(DBI)
drv <- dbDriver("SQLite")
con <- dbConnect(drv)
rs <- dbSendQuery(con, statement="CREATE TABLE hash (MD5 BLOB)")
df <- data.frame(MD5=c("6717f2823d3202449201145073ab871A",
"6717f2823d3202449301145073ab371A"))
rs <- db.WriteTable(con,"hash",df)
result.1 <- dbReadTable(con,"hash")
result.1
# MD5
# 1 67, 17, f2, 82, 3d, 32, 02, 44, 92, 01, 14, 50, 73, ab, 87, 1a
# 2 67, 17, f2, 82, 3d, 32, 02, 44, 93, 01, 14, 50, 73, ab, 37, 1a
If your data frame of hashes is very large, then df.WriteFast(...) does the same thing as db.WriteTable(...) only it should be faster.
db.WriteFast = function(con.table,df) {
require(DBI)
field <- colnames(df)[1]
lapply(unlist(df[,1]),function(x){
dbSendQuery(con,
statement=sprintf("INSERT INTO %s (%s) VALUES (X'%s')",
table,field,x))})
}
Note that result.1 is a data frame, and if we use it in a call to dbWriteTable(...) we can successfully write the hashes to a BLOB. So it is possible.
str(result.1)
# 'data.frame': 2 obs. of 1 variable:
# $ MD5:List of 2
# ..$ : raw 67 17 f2 82 ...
# ..$ : raw 67 17 f2 82 ...
The second approach takes advantage of R's raw data type to create a data frame structured like result.1, and passes that to dbWriteTable(...). You'd think this would be easy, but no.
h2r = function(x) {
bytes <- substring(x, seq(1, nchar(x)-1, 2), seq(2, nchar(x), 2))
return(list(as.raw(as.hexmode(bytes))))
}
hash2raw = Vectorize(h2r)
df.raw=data.frame(MD5=list(1:nrow(df)))
colnames(df.raw)="MD5"
df.raw$MD5 = unname(hash2raw(as.character(df$MD5)))
dbWriteTable(con, "newHash",df.raw)
result.2 <- dbReadTable(con,"newHash")
result.2
all.equal(result.1$MD5,result.2$MD5)
# [1] TRUE
In this approach, we create a data frame df.raw which has one column, MD5, wherein each element is a list of raw bytes. The utility function h2r(...) takes a character representation of the hash, breaks it into a vector of char(2) (the bytes), then interprets each of those as hex (as.hexmode(...)), converts the result to raw (as.raw(...)), and finally returns the result as a list. Vectorize(...) is a wrapper that allows hash2raw(...) to take a vector as its argument.
Personally, I think you're better off using the first approach: it takes advantage of SQLite's internal mechanism for writing hex to BLOBs, and it's much easier to understand.

Resources