Snowflake EXPLAIN query support with Snowflake JDBC Driver - snowflake-cloud-data-platform

Is there a way to run an EXPLAIN snowflake query through the JDBC driver with the Snowflake extension? I am running net.snowflake snowflake-jdbc 3.12.8 and it throws an error saying net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error: syntax error line 1 at position 15 unexpected 'EXPLAIN'.. I see there are more up to date versions to 3.12.16 but nothing in the release notes mentions this added capability. The same exact query I am running works successfully in the snowflake UI.

I had no problem using EXPLAIN and the Snowflake JDBC driver 3.12.8:
print(sc._jvm.net.snowflake.spark.snowflake.Utils.getClientInfoString())
x=sc._jvm.net.snowflake.spark.snowflake.Utils.runQuery(sfOptions, 'explain select * from numbers limit 10')
cols = x.getMetaData().getColumnNames()
print(cols)
while(x.next()):
print([x.getString(i) for i in range(1, 1+cols.size())])
The results show that I'm using the specified JDBC version (through PySpark) and the results of the EXPLAIN query:
{
"spark.version" : "2.4.4",
"spark.snowflakedb.version" : "2.8.1",
"spark.app.name" : "Simple App",
"scala.version" : "2.11.12",
"java.version" : "1.8.0_242",
"snowflakedb.jdbc.version" : "3.12.8"
}
['step', 'id', 'parent', 'operation', 'objects', 'alias', 'expressions', 'partitionsTotal', 'partitionsAssigned', 'bytesAssigned']
[None, None, None, 'GlobalStats', None, None, None, '1', '1', '512']
['1', '0', None, 'Result', None, None, 'NUMBERS.X', None, None, None]
['1', '1', '0', 'Limit', None, None, 'rowCount: 10', None, None, None]
['1', '2', '1', 'TableScan', 'TEMP.PUBLIC.NUMBERS', None, 'X', '1', '1', '512']
For further community debugging, you'll need to paste your code to check what's happening.

The explain query can be executed via Snowflake JDBC connector :
Example:
ResultSet rs = stmt.executeQuery("explain SELECT top 5 * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF10.ORDERS where O_ORDERDATE between '1992-01-01' and '1992-12-31'");
ResultSetMetaData rsmd = rs.getMetaData();
int numberOfColumns = rsmd.getColumnCount();
for (int i = 1; i <= numberOfColumns; i++ ) {
String name = rsmd.getColumnName(i);
System.out.println("name :" + name +" size :" + rsmd.getColumnDisplaySize(i) );
}
O/p:
name :id size :10
name :parent size :10
name :operation size :16777216
name :objects size :16777216
name :alias size :16777216
name :expressions size :16777216
name :partitionsTotal size :39
name :partitionsAssigned size :39
name :bytesAssigned size :39
Thanks,
Sujan Ghosh

Related

P-values for glmer mixed effects logistic regression in Python

I have a dataset for one year for all employees with individual-level data (e.g. age, gender, promotions, etc.). Each employee is in a team of a certain manager. I have some variables on the team- and manager-levels as well (e.g. manager's tenure, team diversity, etc.). I want to explain the termination of employees (binary: left the company or not). I am running a multilevel logistic regression, where employees are grouped by their managers, therefore they share the same team- and manager-level characteristics.
So, my model looks like:
Termination ~ Age + Time in company + Promotions + Manager tenure + Percent of employees who completed training", data, groups=data[Manager_ID]
Dataset example:
data = {'Employee': ['ID1', 'ID2','ID3','ID4','ID5','ID6','ID7', 'ID8'],
'Manager_ID': ['MID1', 'MID2','MID2','MID1','MID3','MID3','MID3', 'MID1'],
'Termination': ['0', '0', '0', '0', '1', '1', '1', '0'],
'Age': ['35', '40','50','24','33','46','44', '31'],
'TimeinCompany': ['1', '3', '10', '20', '4', '0', '4', '9'],
'Promotions': ['1', '0', '0', '0', '1', '1', '1', '0'],
'Manager_Tenure': ['10', '5', '5', '10', '8', '8', '8', '10'],
'PercentCompletedTrainingTeam': ['40', '20', '20', '40', '49', '49', '49', '40']}
columns = ['Employee','Manager_ID','Age', 'TimeinCompany', 'Promotions', 'Manager_Tenure', 'AverageAgeTeam', 'PercentCompletedTrainingTeam']
data = pd.DataFrame(data, columns=columns)
I managed to run mixed effects logistic regression using lme4 package from R in Python.
importr('lme4')
model1 = r.glmer(formula=Formula('Termination ~ Age + TimeinCompany + Promotions + Manager_Tenure + PercentCompletedTrainingTeam + (1 | Manager_ID)'),
data=data)
print(r.summary(model1))
I receive the following output for the full sample:
REML criterion at convergence: 54867.6
Scaled residuals:
Min 1Q Median 3Q Max
-2.9075 -0.3502 -0.2172 -0.0929 3.9378
Random effects:
Groups Name Variance Std.Dev.
Manager_ID (Intercept) 0.005033 0.07094
Residual 0.072541 0.26933
Number of obs: 211974, groups: Manager_ID, 24316
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.14635573 0.00299341 48.893
Age -0.00112153 0.00008079 -13.882
TimeinCompany -0.00238352 0.00010314 -23.110
Promotions -0.01754085 0.00491545 -3.569
Manager_Tenure -0.00044373 0.00010834 -4.096
PercentCompletedTrainingTeam -0.00014393 0.00002598 -5.540
Correlation of Fixed Effects:
(Intr) Age TmnCmpny Promotions Mngr_Tenure
Age -0.817
TmnCmpny 0.370 -0.616
Promotions -0.011 -0.009 -0.033
Mngr_Tenure -0.279 0.013 -0.076 0.035
PrcntCmpltT -0.309 -0.077 -0.021 -0.042 0.052
But, there are no p-values displayed. I read a lot that lme4 does not provide p-values for a number of reasons, however I have to have them for the work presentation.
I tried several possible solutions that I found, but none of them worked:
importr('lmerTest')
importr('afex')
print(r.anova(model1))
does not display any output
print(r.anova(model1, ddf="Kenward-Roger"))
only displays npar, Sum Sq, Mean Sq, F value
print(r.summary(model1, ddf="merModLmerTest"))
Provides the same output as with just summary
print(r.anova(model1, "merModLmerTest"))
only displays npar, Sum Sq, Mean Sq, F value
Any ideas on how to get p-values are much appreciated.

Sagemaker - batch transform] Internal server error : 500

I am trying to do a batch transform on a training dataset in an S3 bucket. I have followed this link:
https://github.com/aws-samples/quicksight-sagemaker-integration-blog
The training data on which transformation is being applied is of ~35 MB.
I am getting these errors:
Bad HTTP status received from algorithm: 500
The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
Process followed:
1. s3_input_train = sagemaker.TrainingInput(s3_data='s3://{}/{}/rawtrain/'.format(bucket, prefix), content_type='csv')
2. from sagemaker.sklearn.estimator import SKLearn
sagemaker_session = sagemaker.Session()
script_path = 'preprocessing.py'
sklearn_preprocessor = SKLearn(
entry_point=script_path,
role=role,
train_instance_type="ml.c4.xlarge",
framework_version='0.20.0',
py_version = 'py3',
sagemaker_session=sagemaker_session)
sklearn_preprocessor.fit({'train': s3_input_train})
3. transform_train_output_path = 's3://{}/{}/{}/'.format(bucket, prefix, 'transformtrain-train-output')
scikit_learn_inferencee_model = sklearn_preprocessor.create_model(env={'TRANSFORM_MODE': 'feature-transform'})
transformer_train = scikit_learn_inferencee_model.transformer(
instance_count=1,
assemble_with = 'Line',
output_path = transform_train_output_path,
accept = 'text/csv',
strategy = "MultiRecord",
max_payload =40,
instance_type='ml.m4.xlarge')
4. Preprocess training input
transformer_train.transform(s3_input_train.config['DataSource']['S3DataSource']['S3Uri'],
content_type='text/csv',
split_type = "Line")
print('Waiting for transform job: ' + transformer_train.latest_transform_job.job_name)
transformer_train.wait()
preprocessed_train_path = transformer_train.output_path + transformer_train.latest_transform_job.job_name
preprocessing.py
from __future__ import print_function
import time
import sys
from io import StringIO
import os
import shutil
import argparse
import csv
import json
import numpy as np
import pandas as pd
import logging
from sklearn.compose import ColumnTransformer
from sklearn.externals import joblib
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Binarizer, StandardScaler, OneHotEncoder
from sagemaker_containers.beta.framework import (
content_types, encoders, env, modules, transformer, worker)
# Specifying the column names here.
feature_columns_names = [
'A',
'B',
'C',
'D',
'E',
'F',
'G',
'H',
'I',
'J',
'K'
]
label_column = 'ab'
feature_columns_dtype = {
'A' : str,
'B' : np.float64,
'C' : np.float64,
'D' : str,
"E" : np.float64,
'F' : str,
'G' : str,
'H' : np.float64,
'I' : str,
'J' : str,
'K': str,
}
label_column_dtype = {'ab': np.int32}
def merge_two_dicts(x, y):
z = x.copy() # start with x's keys and values
z.update(y) # modifies z with y's keys and values & returns None
return z
def _is_inverse_label_transform():
"""Returns True if if it's running in inverse label transform."""
return os.getenv('TRANSFORM_MODE') == 'inverse-label-transform'
def _is_feature_transform():
"""Returns True if it's running in feature transform mode."""
return os.getenv('TRANSFORM_MODE') == 'feature-transform'
if __name__ == '__main__':
parser = argparse.ArgumentParser()
# Sagemaker specific arguments. Defaults are set in the environment variables.
parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
args = parser.parse_args()
input_files = [ os.path.join(args.train, file) for file in os.listdir(args.train) ]
if len(input_files) == 0:
raise ValueError(('There are no files in {}.\n' +
'This usually indicates that the channel ({}) was incorrectly specified,\n' +
'the data specification in S3 was incorrectly specified or the role specified\n' +
'does not have permission to access the data.').format(args.train, "train"))
raw_data = [ pd.read_csv(
file,
header=None,
names=feature_columns_names + [label_column],
dtype=merge_two_dicts(feature_columns_dtype, label_column_dtype)) for file in input_files ]
concat_data = pd.concat(raw_data)
numeric_features = list([
'B',
'C',
'E',
'H'
])
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = list(['A','D','F','G','I','J','K'])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)],
remainder="drop")
preprocessor.fit(concat_data)
joblib.dump(preprocessor, os.path.join(args.model_dir, "model.joblib"))
print("saved model!")
def input_fn(input_data, request_content_type):
"""Parse input data payload
We currently only take csv input. Since we need to process both labelled
and unlabelled data we first determine whether the label column is present
by looking at how many columns were provided.
"""
content_type = request_content_type.lower(
) if request_content_type else "text/csv"
content_type = content_type.split(";")[0].strip()
if isinstance(input_data, str):
str_buffer = input_data
else:
str_buffer = str(input_data,'utf-8')
if _is_feature_transform():
if content_type == 'text/csv':
# Read the raw input data as CSV.
df = pd.read_csv(StringIO(input_data), header=None)
if len(df.columns) == len(feature_columns_names) + 1:
# This is a labelled example, includes the label
df.columns = feature_columns_names + [label_column]
elif len(df.columns) == len(feature_columns_names):
# This is an unlabelled example.
df.columns = feature_columns_names
return df
else:
raise ValueError("{} not supported by script!".format(content_type))
if _is_inverse_label_transform():
if (content_type == 'text/csv' or content_type == 'text/csv; charset=utf-8'):
# Read the raw input data as CSV.
df = pd.read_csv(StringIO(str_buffer), header=None)
if len(df.columns) == len(feature_columns_names) + 1:
# This is a labelled example, includes the ring label
df.columns = feature_columns_names + [label_column]
elif len(df.columns) == len(feature_columns_names):
# This is an unlabelled example.
df.columns = feature_columns_names
return df
else:
raise ValueError("{} not supported by script!".format(content_type))
def output_fn(prediction, accept):
"""Format prediction output
The default accept/content-type between containers for serial inference is JSON.
We also want to set the ContentType or mimetype as the same value as accept so the next
container can read the response payload correctly.
"""
accept = 'text/csv'
if type(prediction) is not np.ndarray:
prediction=prediction.toarray()
if accept == "application/json":
instances = []
for row in prediction.tolist():
instances.append({"features": row})
json_output = {"instances": instances}
return worker.Response(json.dumps(json_output), mimetype=accept)
elif accept == 'text/csv':
return worker.Response(encoders.encode(prediction, accept), mimetype=accept)
else:
raise RuntimeException("{} accept type is not supported by this script.".format(accept))
def predict_fn(input_data, model):
"""Preprocess input data
We implement this because the default predict_fn uses .predict(), but our model is a preprocessor
so we want to use .transform().
The output is returned in the following order:
rest of features either one hot encoded or standardized
"""
if _is_feature_transform():
features = model.transform(input_data)
if label_column in input_data:
# Return the label (as the first column) and the set of features.
return np.insert(features.toarray(), 0, pd.get_dummies(input_data[label_column])['True.'], axis=1)
else:
# Return only the set of features
return features
if _is_inverse_label_transform():
features = input_data.iloc[:,0]>0.5
features = features.values
return features
def model_fn(model_dir):
"""Deserialize fitted model
"""
if _is_feature_transform():
preprocessor = joblib.load(os.path.join(model_dir, "model.joblib"))
return preprocessor
Please help.
As I can see you are referring to a post that is pretty old and has issues open on Github here with regards to the Input source and the configurations.
I encourage you to check out the latest examples here which shows how to visualize Amazon SageMaker machine learning predictions with Amazon QuickSight.
Additionally, if the problem persists please open a service request with AWS Support with job ARN to investigate future on the Internal Server Error.

Parsing a pipe delimited json data in python

I am trying to parse an API response which is JSON. the JSON looks like this:
{
'id': 112,
'name': 'stalin-PC',
'type': 'IP4Address',
'properties': 'address=10.0.1.110|ipLong=277893412|state=DHCP Allocated|macAddress=41-1z-y4-23-dd-98|'
}
It's length is 1200, If i convert it I should get 1200 rows. My goal is to parse this json like below:
id name type address iplong state macAddress
112 stalin-PC IP4Address 10.0.1.110 277893412 DHCP Allocated 41-1z-y4-23-dd-98
I am getting the first 3 elements but having an issue in "properties" key which value is pipe delimited. I have tried the below code:
for network in networks: # here networks = response.json()
network_id = network['id']
network_name = network['name']
network_type = network['type']
print(network_id, network_name, network_type)
It works file and gives me result :
112 stalin-PC IP4Address
But when I tried to parse the properties key with below code , its not working.
for network in networks:
network_id = network['id']
network_name = network['name']
network_type = network['type']
for line in network['properties']:
properties_value = line.split('|')
network_address = properties_value[0]
print(network_id, network_name, network_type, network_address )`
How can I parse the pipe delimited properties key? Would anyone help me please.
Thank you
Using str methods
Ex:
network = {
'id': 112,
'name': 'stalin-PC',
'type': 'IP4Address',
'properties': 'address=10.0.1.110|ipLong=277893412|state=DHCP Allocated|macAddress=41-1z-y4-23-dd-98'
}
for n in network['properties'].split("|"):
key, value = n.split("=")
print(key, "-->", value)
Output:
address --> 10.0.1.110
ipLong --> 277893412
state --> DHCP Allocated
macAddress --> 41-1z-y4-23-dd-98

CodeIgniter unknown SQL error

Here is my selection code from db:
$q = $this->db->like('Autor1' or 'Autor2' or 'Autor3' or 'Autor4', $vyraz)
->where('stav', 1)
->order_by('id', 'desc')
->limit($limit)
->offset($offset)
->get('knihy');
return $q->result();
Where $vyraz = "Zuzana Šidlíková";
And the error is:
Nastala chyba databázy
Error Number: 1054
Unknown column '1' in 'where clause'
SELECT * FROM (\knihy`) WHERE `stav` = 1 AND `1` LIKE '%Zuzana Šidlíková%' ORDER BY `id` desc LIMIT 9
Filename: C:\wamp\www\artbooks\system\database\DB_driver.php
Line Number: 330
Can you help me solve this problem?
Your syntax is wrong for what you're trying to do, but still technically valid, because this:
'Autor1' or 'Autor2' or 'Autor3' or 'Autor4'
...is actually a valid PHP expression which evaluates to TRUE (because all non-empty strings are "truthy"), which when cast to a string or echoed comes out as 1, so the DB class is looking to match on a column called "1".
Example:
function like($arg1, $arg2)
{
return "WHERE $arg1 LIKE '%$arg2%'";
}
$vyraz = 'Zuzana Šidlíková';
echo like('Autor1' or 'Autor2' or 'Autor3' or 'Autor4', $vyraz);
// Output: WHERE 1 LIKE '%Zuzana Šidlíková%'
Anyways, here's what you need:
$q = $this->db
->like('Autor1', $vyraz)
->or_like('Autor2', $vyraz)
->or_like('Autor3', $vyraz)
->or_like('Autor4', $vyraz)
->where('stav', 1)
->order_by('id', 'desc')
->limit($limit)
->offset($offset)
->get('knihy');

CakePHP encoding problem : storing uppercase S with caron on top, saves in the database but causes errors while processed by cake

So I am working in a site that sores cuneiform tablets info. We use semitic chars for transliteration.
In my script, I create a term list from the translittaration of a tablet.
My problem is that with the Š, my script created two different terms because it thinks there is a space in the word because of the way cake treats the special char.
Exemple :
Partial contents of a tablet :
utu-DIŠ-nu-il2
Terms from the tablet when treated by my script :
utu-DIŠ, -nu-il2
it should be :
utu-DIŠ-nu-il2
When I print the contents of my array in course of treatment of the contents, I see this :
utu-DI� -nu-il2
So this means the uncorrect parsing of the text creates a space that is interpreted in my script as 2 words instead of one.
In the database, the text is fine...
I also get these errors :
Warning (512): SQL Error: 1366: Incorrect string value: '\xC5' for column 'term' at row 1 [CORE\cake\libs\model\datasources\dbo_source.php, line 684]
Query: INSERT INTO terms (term, lft, rght) VALUES ('utu-DI�', 449, 450)
Query: INSERT INTO terms (term, lft, rght) VALUES ('A�', 449, 450)
Query: INSERT INTO terms (term, lft, rght) VALUES ('xDI�', 449, 450)
Anybody knows what I could do to make this work ?
Thanks !
Added info :
$terms=$this->data['Tablet']['translit'];
$terms= str_replace(array('\r\n', '\r', '\n','\n\r','\t'), ' ', $terms);
$terms = trim($terms, chr(173));
print_r($terms);
$terms = preg_replace('/\s+/', ' ', $terms);
$terms = explode(" ", $terms);
$terms=array_map('trim', $terms);
$anti_terms = array('#tablet','1.','2.','3.','4.','5.','6.','7.','7.','9.','10.','11.','12.','13.','14.','15.','16.','17.','18.','19.','20.','Rev.',
'Obv.','#tablet','#obverse','#reverse','C1','C2','C3','C4','C5','C6','C7','C8','C9', '\r', '\n','\r\n', '\t',''. ' ', null, chr(173), 'x', '[x]','[...]' );
foreach($terms as $key => $term) {
if(in_array($term, $anti_terms) || is_numeric($term)) {
unset($terms[$key]);
}
}
If I put my print_r before the preg, the S are good, if I do it after, they display with the black lozenge. So I guess the preg function is the problem !
just found this :
http://www.php.net/manual/fr/function.preg-replace.php#84385
But it seems that
mb_ereg_replace()
causes the same problem as preg_replace() ....
Solutuion :
mb_internal_encoding("UTF-8");
mb_regex_encoding("UTF-8");
$terms = mb_ereg_replace('\s+', ' ', $terms);
and error is gone ... !
mb_internal_encoding("UTF-8");
mb_regex_encoding("UTF-8");
$terms = mb_ereg_replace('\s+', ' ', $terms);

Resources