how to convert binary file to pandas dataframe - amazon-sagemaker

I use Amazon Sagemaker for model training and prediction. I have a problem with the returned data with predictions. I am trying to convert prediction data to pandas dataframe format.
After the model is deployed:
from sagemaker.serializers import CSVSerializer
xgb_predictor=estimator.deploy(
initial_instance_count=1,
instance_type='ml.g4dn.xlarge',
serializer=CSVSerializer()
)
I made a prediction on the test data:
predictions=xgb_predictor.predict(first_day.to_numpy())
The returned prediction results are in a binary file
predictions
b'2.092024326324463\n10.584211349487305\n18.23127555847168\n2.092024326324463\n8.308058738708496\n32.35516357421875\n4.129155158996582\n7.429899215698242\n55.65376281738281\n116.5504379272461\n1.0734045505523682\n5.29403018951416\n1.0924320220947266\n1.9484598636627197\n5.29403018951416\n2.190509080886841\n2.085641860961914\n2.092024326324463\n7.674410343170166\n2.1198673248291016\n5.293967247009277\n7.088096618652344\n2.092024326324463\n10.410735130310059\n10.36008358001709\n2.092024326324463\n10.565692901611328\n15.495997428894043\n15.61841106414795\n1.0533703565597534\n6.262670993804932\n31.02411460876465\n10.43086051940918\n3.116995096206665\n3.2846100330352783\n108.82835388183594\n26.210166931152344\n1.0658172369003296\n10.55643367767334\n6.245237350463867\n15.951444625854492\n10.195240020751953\n1.0734045505523682\n48.720497131347656\n2.119992256164551\n9.41071605682373\n2.241959810256958\n3.1907501220703125\n10.415051460266113\n1.2154537439346313\n2.13691782951355\n31.1861515045166\n3.0827555656433105\n6.261478424072266\n5.279026985168457\n15.897627830505371\n20.483125686645508\n20.874958038330078\n53.2086296081543\n10.731611251831055\n2.115110397338867\n13.79739761352539\n2.1198673248291016\n26.628803253173828\n10.030998229980469\n15.897627830505371\n5.278475284576416\n45.371158599853516\n2.2791690826416016\n15.58777141571045\n15.947166442871094\n30.88138771057129\n10.388553619384766\n48.22294235229492\n10.565692901611328\n20.808977127075195\n10.388553619384766\n15.910200119018555\n8.252408981323242\n1.109586238861084\n15.58777141571045\n13.718815803527832\n3.1227424144744873\n32.171592712402344\n10.524396896362305\n15.897627830505371\n2.092024326324463\n14.52088737487793\n5.293967247009277\n57.61208724975586\n21.161712646484375\n14.173937797546387\n5.230247974395752\n16.257652282714844
How can I convert prediction data to pandas dataframe?

you mean this:
import pandas as pd
a = a.decode(encoding="utf-8").split("\n")
df = pd.DataFrame(data=a)
df.head()

Related

Snowpark with Python: AttributeError: 'NoneType' object has no attribute 'join'

I'm trying to use Snowpark & Python to transform and prep some data ahead of using it for some ML models. I've been able to easily use session.table() to access the data and select(), col(), filter(), and alias() to pick out the data I need. I'm now trying to join data from two different DataFrame objects, but running into an error.
My code to get the data:
import pandas as pd
df1 = read_session.table("<SCHEMA_NAME>.<TABLE_NAME>").select(col("ID"),
col("<col_name1>"),
col("<col_name2>"),
col("<col_name3>")
).filter(col("<col_name2>") == 'A1').show()
df2 = read_session.table("<SCHEMA_NAME>.<TABLE_NAME2>").select(col("ID"),
col("<col_name1>"),
col("<col_name2>"),
col("<col_name3>")
).show()
Code to join:
df_joined = df1.join(df2, ["ID"]).show()
Error: AttributeError: 'NoneType' object has no attribute 'join'
I have also used this method (from the Snowpark Python API documentation) and get the same error:
df_joined = df1.join(df2, df1.col("ID") == df2.col("ID")).select(df1["ID"], "<col_name1>", "<col_name2>").show()
I get similar errors when trying to convert to a DataFrame using pd.DataFrame and then trying to write it back to Snowflake to a new DB and Schema.
What am I doing wrong? Am I misunderstanding what Snowpark can do; isn't it part of the appeal that all these transformations can be easily done with the objects rather than as a full DataFrame? How can I get this to work?
the primary issue is that you are assigning the output of a .show() method call to the variable, and not the Snowpark DF itself. It is best practice to assign the Snowpark dataframe itself to a variable, and then call .show() on that variable when you need to see the results.
Snowpark DF transformations are lazily executed, when you call .show(), you actually force execution, as opposed to hold a reference to the underlying data and transformations. So, for example:
df1 = read_session.table("<SCHEMA_NAME>.<TABLE_NAME>").select(col("ID"),
col("<col_name1>"),
col("<col_name2>"),
col("<col_name3>")
).filter(col("<col_name2>") == 'A1')
df1.show()
df2 = read_session.table("<SCHEMA_NAME>.<TABLE_NAME2>").select(col("ID"),
col("<col_name1>"),
col("<col_name2>"),
col("<col_name3>")
)
df2.show()
df_joined = df1.join(df2, df1.col("ID") == df2.col("ID")).select(df1["ID"], "<col_name1>", "<col_name2>")
df_joined.show()
Otherwise, you are assigning a method call that returns NoneType to your variable, hence the error you are seeing: https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/_autosummary/snowflake.snowpark.html#snowflake.snowpark.DataFrame.show

generate dataframe of model predictions by looping through dictionary of models

I would like to loop through a dictionary like this:
models = {'OLS': LinearRegression(),
'Lasso': Lasso(),
'LassoCV': LassoCV(n_alphas=300, cv=3)}
and then i want to generate a dataframe of the each model's predictions.
So far I wrote this to code, which only generates arrays of each result:
predictions = []
for i in models:
predictions.append(models[i].fit(X_train,y_train).predict(X_test))
As the final result, I want a dataframe with each column labelled by the key in the dictionary and the result values associated with the model key name inside the column.
Thank you!
Instead of appending the predictions to the list, you can directly insert the predictions into a data frame.
Code:
import pandas as pd
models = {'OLS': LinearRegression(),
'Lasso': Lasso(),
'LassoCV': LassoCV(n_alphas=300, cv=3)}
df = pd.DataFrame()
for i in models:
df[i] = models[i].fit(X_train,y_train).predict(X_test)

pyarrow parquet - encoding array into list of records

I am creating parquet files using Pandas and pyarrow and then reading schema of those files using Java (org.apache.parquet.avro.AvroParquetReader).
I found out that parquet files created using pandas + pyarrow always encode arrays of primitive types using an array of records with single field.
I observed same behaviour when using PySpark. There is similar question here Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery
Here is the python script to create parquet file:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame(
{
'organizationId' : ['org1', 'org2', 'org3'],
'entityType' : ['customer', 'customer', 'customer'],
'entityId' : ['cust_1', 'cust_2', 'cust_3'],
'customerProducts' : [['p1', 'p2'], ['p4', 'p5'], ['p1', 'p3']]
}
)
table = pa.Table.from_pandas(df)
pq.write_table(table, 'output.parquet')
When I try to read Avro schema of that parquet file I see the following schema for 'customerProducts' field:
{"type":"array","items":{"type":"record","name":"list","fields":[{"name":"item","type":["null","string"],"default":null}]}}
but I would expect something this:
{"type":"array","type":["null","string"],"default":null}]}}
Anyone knows if there is a way to make sure that created parquet files with arrays of primitive types will have simplest schema possible?
thanks
As far as I know the parquet data model follows the capacitor data model which allows a column to be one of three types:
Required
optional
repeated.
In order to represent a list the nested type is needed to add an additional level of indirection to distinguish between empty lists and lists containing only null values.

How to generate training dataset from huge binary data using TF 2.0?

I have a binary dataset of over 15G. I want to extract the data for model training using TF 2.0. Currently here is what I am doing:
import numpy as np
import tensorflow as tf
data1 = np.fromfile('binary_file1', dtype='uint8')
data2 = np.fromfile('binary_file2', dtype='uint8')
dataset = tf.data.Dataset.from_tensor_slices((data1, data2))
# then do something like batch, shuffle, prefetch, etc.
for sample in dataset:
pass
but this consumes my memory and I don't think it's a good way to deal with such big files. What should I do to deal with this problem?
You have to make use of FixedLengthRecordDataset to read binary files. Though this dataset accepts several files together, it will extract data only one file at a time. Since you wish to have data to be read from two binary files simultaneously, you will have to create two FixedLengthRecordDataset and then zip them.
Another thing to note down is you have to specify parameter of record_bytes which states how many bytes would you like to read per iteration. This has to be an exact multiple of the total bytes of your binary file size.
ds1 = tf.data.FixedLengthRecordDataset('file1.bin', record_bytes=1000)
ds2 = tf.data.FixedLengthRecordDataset('file2.bin', record_bytes=1000)
ds = tf.data.Dataset.zip((ds1, ds2))
ds = ds.batch(32)

My first BigQuery python script

I would like to know how to create a python script to access a BigQuery database.
I Found a lot of script but not really a complete script.
So, I Would like to have a standard script to connect a project and make a query on a specific table and create a csv file from it.
Thanks for your help.
Jérôme.
#!/usr/bin/python
from google.cloud import bigquery
import pprint
import argparse
import sys
from apiclient.discovery import build
def export_data_to_gcs(dataset_id, table_id, destination):
bigquery_client = bigquery.Client(project='XXXXXXXX-web-data')
dataset_ref = bigquery_client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job = bigquery_client.extract_table(table_ref, destination)
job.result() # Waits for job to complete
print('Exported {}:{} to {}'.format(
dataset_id, table_id, destination))
export_data_to_gcs('2XXXX842', 'ga_sessions_201XXXXX', 'gs://analytics-to- deci/file-name.json')
Destination format
BigQuery supports CSV, JSON and Avro format.
Nested or repeated data cannot be exported to CSV, but it can be exported to JSON or Avro format.
As per the google documentation -
Google Document
Try other formats as said.
from google.cloud import bigquery
from google.cloud.bigquery.job import DestinationFormat, ExtractJobConfig, Compression
def export_table_to_gcs(dataset_id, table_id, destination):
"""
Exports data from BigQuery to an object in Google Cloud Storage.
For more information, see the README.rst.
Example invocation:
$ python export_data_to_gcs.py example_dataset example_table \\
gs://example-bucket/example-data.csv
The dataset and table should already exist.
"""
bigquery_client = bigquery.Client()
dataset_ref = bigquery_client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job_config = ExtractJobConfig()
job_config.destination_format = DestinationFormat.NEWLINE_DELIMITED_JSON
job_config.compression = Compression.GZIP
job = bigquery_client.extract_table(table_ref, destination, job_config=job_config)
job.result(timeout=300) # Waits for job to complete
print('Exported {}:{} to {}'.format(dataset_id, table_id, destination))

Resources