Please let me know whether Snowflake Spark connector has the ability to create Snowflake external table?
Spark connector does support DDL statement and CREAT EXTERNAL TABLE is a DDL statement
https://docs.snowflake.com/en/user-guide/spark-connector-use.html#executing-ddl-dml-sql-statements
Not sure about how you can create external tables with spark connector, but, what I usually do is to create a stage in snowflake using a Blob Storage or S3 Bucket and you can work it like a local file. Which can be use in your queries using spark connector, also, any new file on the stage (BlobStorage/S3 Bucket) will be available through a query such as:
SELECT * FROM #"STAGE_NAME"/example.json;
I ain't sure if this is of any help, I don't know how you are trying to apply it. But if it does, I'll be glad to put a example here.
Related
I've created some hive tables using a JDBC in a python notebook on Databricks. This was on Data Science and Engineering UI. I'm able to query the tables in a Databricks Notebook and user direct SQL with the magic command %
When switching to Databricks SQL UI, I'm still able to see the tables in Hive metastore explorer. However I'm not able to read the data. A very clear message says that only csv, parquet and so are supported.
Even though, I found this surprising, since I can use the data on DS and Engineering UI why it's not the case on Databricks SQL? Is there any solution to overcome that?
Yes, it's a known limitation that Databricks SQL right now supports only file-based formats. As I remember it's related to a security model, plus the fact that DBSQL is using Photon under the hood where JDBC integration could be not so performant. You may reach your solution architect or customer success engineer to get information on if it will be supported in the future.
The current workaround would be only to have a job that will periodically read all data from database via JDBC and dump into Delta table - it could be even more performant compared to JDBC, the only issue is the freshness of data.
You can import a Hive table from cloud storage into Databricks using an external table and query it using Databricks SQL.
Step 1: Show the CREATE TABLE statement
Issue a SHOW CREATE TABLE <tablename> command on your Hive command line to see the statement that created the table.
Refer below example:
hive> SHOW CREATE TABLE wikicc;
OK
CREATE TABLE `wikicc`(
`country` string,
`count` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/user/hive/warehouse/wikicc'
TBLPROPERTIES (
'totalSize'='2335',
'numRows'='240',
'rawDataSize'='2095',
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='1',
'transient_lastDdlTime'='1418173653')
Step 2: Issue a CREATE EXTERNAL TABLE statement
If the statement that is returned uses a CREATE TABLE command, copy the statement and replace CREATE TABLE with CREATE EXTERNAL TABLE.
EXTERNAL ensures that Spark SQL does not delete your data if you drop the table.
You can omit the TBLPROPERTIES field.
DROP TABLE wikicc
CREATE EXTERNAL TABLE `wikicc`(
`country` string,
`count` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/user/hive/warehouse/wikicc'
Step 3: Issue SQL commands on your data
SELECT * FROM wikicc
Source: https://docs.databricks.com/data/data-sources/hive-tables.html
I am trying to load data from a couple of snowflake tables [200-300 columns] into azure sql server. Is there a way to convert the datatypes automatically or to convert the entire table creation script ?
Paste the ddl into a text editor and use find-and-replace to change the datatypes
There are tools online that can assist here (Your mileage may vary)
Example: https://www.jooq.org/translate/
I have a few questions regarding the process of copying tables from S3 to Snowflake.
The plan is to copy some data from AWS/S3 onto snowflake and then perform some modeling by DataRobot
We have some tables that contain PII data and we would like to hide those columns from Datarobot, what suggestion do you have for this problem?
The schema in AWS needs to match the schema in Snowflake for the copying process.
Thanks,
Mali
Assuming you know the schema of the data you are loading, you have a few options for using Snowflake:
Use COPY INTO statements to load the data into the tables
Use SNOWPIPE to auto-load the data into the tables (this would be good for instances where you are regularly loading new data into Snowflake tables)
Use EXTERNAL TABLES to reference the S3 data directly as a table in Snowflake. You'd likely want to use MATERIALIZED VIEWS for this in order for the tables to perform better.
As for hiring the PII data from DataRobot, I would recommend leveraging Snowflake DYNAMIC DATA MASKING to establish rules that obfuscate the data (or null it out) for the role that DataRobot is using.
All of these features are well-documented in Snowflake documentation:
https://docs.snowflake.com/
Regarding hiding your PII elements, you can use 2 different roles, one would be say data_owner(the role that will create the table and load the data in it) and another say data_modelling (for using data robot)
Create masking policies using the data owner such that the data robot cannot see the column data.
About your question on copying the data, there is no requirement that AWS S3 folder need to be in sync with Snowflake. you can create the external stage with any name and point it to any S3 folder.
Snowflake documentation has good example which helps to get some hands on :
https://docs.snowflake.com/en/user-guide/data-load-s3.html
is there any clever way to get my data from a mysql datatbase into snowflake?
I found two possible ways so far:
Option 1: Put a Snowpipe ontop of the mysql database and the pipeline converts the data automatically.
Option 2: I convert tables manually into csv and store them locally and load them via staging into snowflake.
For me it seems strange to convert every table into a csv first. Can I not just push a sql dump file to snowflake? Can I also schedule some reload task in snowflake, so either option1 or 2 get triggered automatically?
Best
NicBeC24
I found some very good information regarding MySQL-Snowflake-migrations here: https://hevodata.com/blog/mysql-to-snowflake-data-migration-steps/
The main steps from the webpage above are:
Exporting data from MySQL
Taking care about data types
Stage your files into Snowflake (Internal/External stage)
Copy the staged files into the table
If the SQL-dump is just a ".sql-file" in ANSI, yes, of course, you can copy&paste it to your Snowflake worksheet and execute it there.
Regarding scheduling: Yes, in Snowflake there is a functionality called Tasks: https://docs.snowflake.com/en/user-guide/tasks-intro.html You can use them to schedule your COPY INTO-command.
I need to update a table in snowflake by taking data from oracle database.
Is there a way to connect to oracle database from snowflake?
If answer is NO how can i update the table in snowflake using data from oracle.
Not sure exactly what you are looking for here. The best way to get data into Snowflake is via the COPY INTO command, which would then allow you to update the Snowflake table with that data. If you are looking at ways to keep the 2 systems in-sync, then you might want to look into the various data replication tools that are in the marketplace. If this is a transactional update, then you can use a connector (ODBC, JDBC, Python, etc.) to update the data from one system to another. I wouldn't recommend that for bulk updates, though.
There are several ways you can integrate your data from oracle to snowflake. If you are familiar with ETL tool you can use any one of them or you can use any program language to extract and load.