Snowpark with Python: AttributeError: 'NoneType' object has no attribute 'join' - snowflake-cloud-data-platform

I'm trying to use Snowpark & Python to transform and prep some data ahead of using it for some ML models. I've been able to easily use session.table() to access the data and select(), col(), filter(), and alias() to pick out the data I need. I'm now trying to join data from two different DataFrame objects, but running into an error.
My code to get the data:
import pandas as pd
df1 = read_session.table("<SCHEMA_NAME>.<TABLE_NAME>").select(col("ID"),
col("<col_name1>"),
col("<col_name2>"),
col("<col_name3>")
).filter(col("<col_name2>") == 'A1').show()
df2 = read_session.table("<SCHEMA_NAME>.<TABLE_NAME2>").select(col("ID"),
col("<col_name1>"),
col("<col_name2>"),
col("<col_name3>")
).show()
Code to join:
df_joined = df1.join(df2, ["ID"]).show()
Error: AttributeError: 'NoneType' object has no attribute 'join'
I have also used this method (from the Snowpark Python API documentation) and get the same error:
df_joined = df1.join(df2, df1.col("ID") == df2.col("ID")).select(df1["ID"], "<col_name1>", "<col_name2>").show()
I get similar errors when trying to convert to a DataFrame using pd.DataFrame and then trying to write it back to Snowflake to a new DB and Schema.
What am I doing wrong? Am I misunderstanding what Snowpark can do; isn't it part of the appeal that all these transformations can be easily done with the objects rather than as a full DataFrame? How can I get this to work?

the primary issue is that you are assigning the output of a .show() method call to the variable, and not the Snowpark DF itself. It is best practice to assign the Snowpark dataframe itself to a variable, and then call .show() on that variable when you need to see the results.
Snowpark DF transformations are lazily executed, when you call .show(), you actually force execution, as opposed to hold a reference to the underlying data and transformations. So, for example:
df1 = read_session.table("<SCHEMA_NAME>.<TABLE_NAME>").select(col("ID"),
col("<col_name1>"),
col("<col_name2>"),
col("<col_name3>")
).filter(col("<col_name2>") == 'A1')
df1.show()
df2 = read_session.table("<SCHEMA_NAME>.<TABLE_NAME2>").select(col("ID"),
col("<col_name1>"),
col("<col_name2>"),
col("<col_name3>")
)
df2.show()
df_joined = df1.join(df2, df1.col("ID") == df2.col("ID")).select(df1["ID"], "<col_name1>", "<col_name2>")
df_joined.show()
Otherwise, you are assigning a method call that returns NoneType to your variable, hence the error you are seeing: https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/_autosummary/snowflake.snowpark.html#snowflake.snowpark.DataFrame.show

Related

How to use multiprocessing to create gzip file from dataframe in python

I have a process that's becoming IO bound where I pull a large dataset from a database into a pandas dataframe and then try to do some line by line processing and then persist to a gzip file. I'm trying to find a way to use multiprocessing to be able to split the creation of the gzip into multiple processes and then merge them into one file. Or process in parallel without overwriting a previous thread. I found this package p_tqdm but i'm running into EOF issues probably because the threads overwrite each other. Here's a sample of my current solution:
from p_tqdm import p_map
df = pd.read_sql(some_sql, engine)
things =[]
for index, row in df.iterrows():
things.append(row)
p_map(process, things)
def process():
with gzip.open("final.gz", "wb") as f:
value = do_somthing(row)
f.write(value.encode())
I don't know about the p_tqdm but if I understand your question, it might be easily done with multiprocessing.
something like this
import multiprocessing
def process(row):
# take care that "do_somthing" must return class with encode() method (e.g. string)
return do_somthing(row)
df = pd.read_sql(some_sql, engine)
things =[]
for index, row in df.iterrows():
things.append(row)
with gzip.open("final.gz", "wb") as f, multiprocessing.Pool() as pool:
for processed_row in pool.imap(process, things):
f.write(processed_row.encode())
Just few sidenotes:
The pandas iterrows method is slow - avoid if possible (see Does pandas iterrows have performance issues?).
Also, you don't need to create things, just pass iterable to imap(even passing df.iterrows() directly should be possible) save yourself some memory.
And finally, since it appears that you are reading sql data, why not connect to the db dicectly and iterate over the cursor from SELECT ... query, skipping pandas altogether.

Turn Multiple Column into Json Array in R?

Say I have the mtcars dataset, and I wanted to take three columns and turn them into a JSON array. How do I convert this to a json array and is that possible to pass them into a POSTGRESQL database?
library(jsonlite)
df <- mtcars
attach(mtcars)
json.column <- cbind(mpg,cyl,disp)
Do I use toJSON() ?
mtcars.json <- toJSON(json.column)
https://cran.r-project.org/web/packages/jsonlite/vignettes/json-aaquickstart.html
Array of objects [{"name":"Erik", "age":43}, {"name":"Anna", "age":32}] Data Frame simplifyDataFrame
Keep your data as a data.frame, not a matrix. Use
json.column <- data.frame(mpg,cyl,disp)
toJSON(json.column)
# [{"mpg":21,"cyl":6,"disp":160},{"mpg":21,"cyl":6,"disp":160}, ...
Also, you should avoid the use of attach(). It can cause lots of problems if you forget detach(). Plus you can use with() often to avoid it
json.column <- with(mtcars, data.frame(mpg,cyl,disp))
(For starters, never use attach! It's dangerous! Use with instead, typically.)
There are a bunch of ways to do it. Here's how to create the values using dplyr:
qq <- rowwise(mtcars) %>%
mutate(newcol=as.character(jsonlite::toJSON(list(mpg=mpg, cyl=cyl, disp=disp))))
> qq$newcol
[1] "{\"mpg\":[21],\"cyl\":[6],\"disp\":[160]}" "{\"mpg\":[21],\"cyl\":[6],\"disp\":[160]}"
[3] "{\"mpg\":[22.8],\"cyl\":[4],\"disp\":[108]}" "{\"mpg\":[21.4],\"cyl\":[6],\"disp\":[258]}"
...
From there, if your Postgres database is set up with newcol as a JSON type, I think just writing that table as usual should work.

How to indicate the database in SparkSQL over Hive in Spark 1.3

I have a simple Scala code that retrieves data from the Hive database and creates an RDD out of the result set. It works fine with HiveContext. The code is similar to this:
val hc = new HiveContext(sc)
val mySql = "select PRODUCT_CODE, DATA_UNIT from account"
hc.sql("use myDatabase")
val rdd = hc.sql(mySql).rdd
The version of Spark that I'm using is 1.3. The problem is that the default setting for hive.execution.engine is 'mr' that makes Hive to use MapReduce which is slow. Unfortunately I can't force it to use "spark".
I tried to use SQLContext by replacing hc = new SQLContext(sc) to see if performance will improve. With this change the line
hc.sql("use myDatabase")
is throwing the following exception:
Exception in thread "main" java.lang.RuntimeException: [1.1] failure: ``insert'' expected but identifier use found
use myDatabase
^
The Spark 1.3 documentation says that SparkSQL can work with Hive tables. My question is how to indicate that I want to use a certain database instead of the default one.
use database
is supported in later Spark versions
https://docs.databricks.com/spark/latest/spark-sql/language-manual/use-database.html
You need to put the statement in two separate spark.sql calls like this:
spark.sql("use mydb")
spark.sql("select * from mytab_in_mydb").show
Go back to creating the HiveContext. The hive context gives you the ability to create a dataframe using Hive's metastore. Spark only uses the metastore from hive, and doesn't use hive as a processing engine to retrieve the data. So when you create the df using your sql query, its really just asking hive's metastore "Where is the data, and whats the format of the data"
Spark takes that information, and will run process against the underlying data on the HDFS. So Spark is executing the query, not hive.
When you create the sqlContext, its removing the link between Spark and the Hive metastore, so the error is saying it doesn't understand what you want to do.
I have not been able to implement the use databale command, but here is a workaround to use the desired database:
spark-shell --queue QUEUENAME;
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val res2 = sqlContext.sql("select count(1) from DB_NAME.TABLE_NAME")
res2.collect()

SparkR - extracting dataframe's array<int> for an R function

I have 1000s of sensors, I need to partition the data (i.e. per sensor per day) then submit each list of data points to an R algorithm). Using Spark, simplified sample looks like:
//Spark
val rddData = List(
("1:3", List(1,1,456,1,1,2,480,0,1,3,425,0)),
("1:4", List(1,4,437,1,1,5,490,0)),
("1:6", List(1,6,500,0,1,7,515,1,1,8,517,0,1,9,522,0,1,10,525,0)),
("1:11", List(1,11,610,1))
)
case class DataPoint(
key: String,
value: List[Int]) // 4 value pattern, sensorID:seq#, seq#, value, state
I convert to a parquet file, save it.
Load the parquet in SparkR, no problem, the schema says:
#SparkR
df <- read.df(sqlContext, filespec, "parquet")
schema(df)
StructType
|-name = "key", type = "StringType", nullable = TRUE
|-name = "value", type = "ArrayType(IntegerType,true)", nullable = TRUE
So in SparkR, I have a dataframe where each record has all of the data I want (df$value). I want to extract that array into something R can consume then mutate my original dataframe(df) with a new column holding the resultant array. Logically something like results = function(df$value). Then I need to get results (for all rows) back into a SparkR dataframe for output.
How to I extract an array from the SparkR dataframe then mutate with the results?
Let spark data frame be, df and R data frame be df_r
To convert sparkR df to R df, use code
df_r <- collect(df)
with R data frame df_r, you can do all computations you want to do in R.
let say you have the result in column df_r$result
Then for converting back to SparkR data frame use code,
#this is a new SparkR data frame, df_1
df_1 <- createDataFrame(sqlContext, df_r)
For adding the result back to SparkR data frame `df` use code
#this adds the df_1$result to a new column df$result
#note that number of rows should be same in df and `df_1`, if not use `join` operation
df$result <- df_1$result
Hope this solves your problem
I had this problem too. The way I got around it was by adding a row index into the spark DataFrame and then using explode inside a select statement. Make sure to select the index and then the row you want in your select statement. That will get you a "long" dataframe. If each of the nested lists in the DataFrame column has the same amount of information in it (for example if you are exploding a list-column of x,y coordinates), you would expect each row index in the long DataFrame to occur twice.
After doing the above, I typically do a groupBy(index) on the exploded DataFrame, filter where the n() of each index is not equal to the expected number of items in the list and proceed with additional groupBy, merge, join, filter, etc. operations on the Spark DataFrame.
There are some excellent guides on the Urban Institute's GitHub page. Good luck. -nate

Storing R Objects in a relational database

I frequently create nonparametric statistics (loess, kernel densities, etc) on data I pull out of a relational database. To make data management easier I would like to store R output back inside my DB. This is easy with simple data frames of numbers or text, but I have not figured out how to store R objects back in my relational database. So is there a way to store a vector of kernel densities, for example, back into a relational database?
Right now I work around this by saving the R objects to a network drive space so others can load the objects as needed.
Use the serialization feature to turn any R object into a (raw or character) string, then store that string. See help(serialize).
Reverse this for retrieval: get the string, then unserialize() into a R object.
An example R variable, that's fairly complex:
library(nlme)
model <- lme(uptake ~ conc + Treatment, CO2, random = ~ 1 | Plant / Type)
The best storage database method for R variables depends upon how you want to use it.
I need to do in-database analytics on the values
In this case, you need to break the object down into values that the database can handle natively. This usually means converting it into one or more data frames. The easiest way to do this is to use the broom package.
library(broom)
coefficients_etc <- tidy(model)
model_level_stats <- glance(model)
row_level_stats <- augment(model)
I just want storage
In this case you want to serialize your R variables. That is, converting them to be a string or a binary blob. There are several methods for this.
My data has to be accessible by programs other than R, and needs to be human-readable
You should store your data in a cross-platform text format; probably JSON or YAML. JSON doesn't support some important concepts like Inf; YAML is more general but the support in R isn't as mature. XML is also possible, but is too verbose to be useful for storing large arrays.
library(RJSONIO)
model_as_json <- toJSON(model)
nchar(model_as_json) # 17916
library(yaml)
# yaml package doesn't yet support conversion of language objects,
# so preprocessing is needed
model2 <- within(
model,
{
call <- as.character(call)
terms <- as.character(terms)
}
)
model_as_yaml <- as.yaml(model2)
nchar(model_as_yaml) # 14493
My data has to be accessible by programs other than R, and doesn't need to be human-readable
You could write your data to an open, cross-platform binary format like HFD5. Currently support for HFD5 files (via rhdf5) is limited, so complex objects are not supported. (You'll probably need to unclass everything.)
library(rhdf5)
h5save(rapply(model2, unclass, how = "replace"), file = "model.h5")
bin_h5 <- readBin("model.h5", "raw", 1e6)
length(bin_h5) # 88291 not very efficient in this case
The feather package let's you save data frames in a format readable by both R and Python. To use this, you would first have to convert the model object into data frames, as described in the broom section earlier in the answer.
library(feather)
library(broom)
write_feather(augment(model), "co2_row.feather") # 5474 bytes
write_feather(tidy(model), "co2_coeff.feather") # 2093 bytes
write_feather(glance(model), "co2_model.feather") # 562 bytes
Another alternative is to save a text version of the variable (see previous section) to a zipped file and store its bytes in the database.
writeLines(model_as_json)
tar("model.tar.bz", "model.txt", compression = "bzip2")
bin_bzip <- readBin("model.tar.bz", "raw", 1e6)
length(bin_bzip) # only 42 bytes!
My data only needs to be accessible by R, and needs to be human-readable
There are two options for turning a variable into a string: serialize and deparse.
p <- function(x)
{
paste0(x, collapse = "\n")
}
serialize needs to be sent to a text connection, and rather than writing to file, you can write to the console and capture it.
model_serialized <- p(capture.output(serialize(model, stdout())))
nchar(model_serialized) # 23830
Use deparse with control = "all" to maximise the reversibility when re-parsing later.
model_deparsed <- p(deparse(model, control = "all"))
nchar(model_deparsed) # 22036
My data only needs to be accessible by R, and doesn't need to be human-readable
The same sorts of techniques shown in the previous sections can be applied here. You can zip a serialized or deparsed variable and re-read it as a raw vector.
serialize can also write variables in a binary format. In this case, it is most easily used with its wrapper saveRDS.
saveRDS(model, "model.rds")
bin_rds <- readBin("model.rds", "raw", 1e6)
length(bin_rds) # 6350
For sqlite (and possibly others):
CREATE TABLE data (blob BLOB);
Now in R:
RSQLite::dbGetQuery(db.conn, 'INSERT INTO data VALUES (:blob)', params = list(blob = list(serialize(some_object)))
Note the list wrapper around some_object. The output of serialize is a raw vector. Without list, the INSERT statement would be executed for each vector element. Wrapping it in a list allows RSQLite::dbGetQuery to see it as one element.
To get the object back from the database:
some_object <- unserialize(RSQLite::dbGetQuery(db.conn, 'SELECT blob FROM data LIMIT 1')$blob[[1]])
What happens here is you take the field blob (which is a list since RSQLite doesn't know how many rows will be returned by the query). Since LIMIT 1 assures only 1 row is returned, we take it with [[1]], which is the original raw vector. Then you need to unserialize the raw vector to get your object.
Using textConnection / saveRDS / loadRDS is perhaps the most versatile and high level:
zz<-textConnection('tempConnection', 'wb')
saveRDS(myData, zz, ascii = T)
TEXT<-paste(textConnectionValue(zz), collapse='\n')
#write TEXT into SQL
...
closeAllConnections() #if the connection persists, new data will be appended
#reading back:
#1. pull from SQL into queryResult
...
#2. recover the object
recoveredData <- readRDS(textConnection(queryResult$TEXT))
[100% WORKING - 27 Feb 2020]
Description:
Here are the steps if you want to store your model into a POSTGRES table, then query it and load it. An important part is the ascii = TRUE, which otherwise would produce errors when serializing
db <- pgsql_connect #connection to your database
serialized_model <- rawToChar(serialize(model_fit, NULL, ascii=TRUE))
insert_query <-'INSERT INTO table (model) VALUES ($1)'
rs <- dbSendQuery(db, insert_query, list(serialized_model))
dbClearResult(rs)
serialized_model <- dbGetQuery(db, "select model from table order by created_at desc limit 1")
model_fit2 <- unserialize(charToRaw(as.character(serialized_model[,c('model')])))
model_fit2

Resources