Delta table linked to sql table - sql-server

I'm new on databricks and spark, we create delta table using data from sql. Theese table are kind of mirrored. Basicalli if I insert a new row to sql it affects delta, I can even insert from databricks having sql updated, but deleting is allowed only from sql.
By the way, I don't understand how it works, if I create a delta table with this command the delta and sql table are linked
spark.sql("""
create table IF NOT EXISTS dbname.delta_table
using org.apache.spark.sql.jdbc
OPTIONS (
url '""" + sql_url + """',
dbtable 'dbname.sql_table',
user '""" + sql_user + """',
password '""" + sql_password + """',
TRUNCATE true
)
""");
But if I try with pyspark, there's no link between table
spark.read \
.format("jdbc") \
.option("url", url_sql) \
.option("dbtable", sql_table) \
.option("user", sql_user) \
.option("password", sql_password) \
.option("truncate", True) \
.load() \
.write \
.saveAsTable(delta_table)
I would like to know how to get the same result with pyspark and how to get more documentation about it, I didn't find what I was looking for, I don't know what kind of relationship there's between table and the keyword related to this.
Thanks for help
Sergio
I've been looking online all day to find the correct topic but I didn't find anything

You're doing different things:
First SQL statements creates a metadata entry in the hive metastore that points to the SQL database. So when you read from it, Spark under the hood connects via JDBC protocol, and loads the data.
In the second approach you're actually loading data from database, and create a managed table that is stored in Delta format (default format). This table is the snapshot of the SQL server at the time of execution.
Really, if you want to create a table as in your first case, you just need to continue to use spark.sql.

Related

Spark JDBC SQL connector "SELECT INTO" Statement

I am trying to read the data from SQL Server. Because of some requirements I need to create a temp table using SELECT INTO statement, that will be used further in the query. But when I run the query I get the following error
com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword 'INTO'
My question is, is the SELECT INTO statement allowed with Spark SQL Connector?
Here is a sample query and code
drivers = {"mssql": "com.microsoft.sqlserver.jdbc.SQLServerDriver"}
sparkDf = spark.read.format("jdbc") \
.option("url", connectionString) \
.option("query", "SELECT * INTO #TempTable FROM Table1") \
.option("user", username) \
.option("password", password) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.load()
It's because the driver is doing extra stuff which is getting in your way. More explicitly, the problem is documented here. Here's docs from that page for the "query" option you're using:
A query that will be used to read data into Spark. The specified query
will be parenthesized and used as a subquery in the FROM clause. Spark
will also assign an alias to the subquery clause. As an example, spark
will issue a query of the following form to the JDBC Source.
SELECT FROM (<user_specified_query>) spark_gen_alias
Below are a couple of restrictions while using this option.
It is not allowed to specify dbtable and query options at the same time.
It is not allowed to specify query and partitionColumn options at the same time. When specifying partitionColumn option is required, the
subquery can be specified using dbtable option instead and partition
columns can be qualified using the subquery alias provided as part of
dbtable.
Example:
spark.read.format("jdbc")
.option("url", jdbcUrl)
.option("query", "select c1, c2 from t1")
.load()
Essentially, that wrapper that the driver is putting around your code is causing the problem. And because their wrapper starts with SELECT * FROM ( and ends with ) spark_generated_alias, you're very limited in how you can "break out" of it and still execute the statements you want.
Here's how I do it.
I break it into 3 separate queries, because normal (#) and global (##) temp-tables don't work (the driver disconnects after each query).
Query 1:
SELECT 1 AS col) AS tbl; --terminates the "SELECT * FROM (" the driver prepends
--Write whatever Sql you want, then select into a new "real" table.
--E.g. here's your example, but with a "real" table.
SELECT * INTO _TempTable FROM Table1;
SELECT 1 FROM (SELECT 1 AS col --and the driver will append ") spark_generated_alias". The driver ignores all result-sets but the first.
Query 2:
SELECT * FROM _TempTable;
Query 3 (you cannot run this until after you're done with the DataFrame):
SELECT 1 AS col) AS tbl; --terminates the "SELECT * FROM (" the driver prepends
DROP TABLE IF EXISTS _TempTable;
SELECT 1 FROM (SELECT 1 AS col --and the driver will append ") spark_generated_alias". The driver ignores all result-sets but the first.

Find all columns in a Pervasive database

I want to find all the columns with a name that includes a specific string using PSQL in a Pervasive database. How do I do that?
You can query the X$Field table for your string. Something like:
select file.xf$name, field.xe$name from x$field field
join x$file file on xe$file = xf$id
where xe$name like '%some string%'
This query should work for both original and v2 (long metadata) databases but would only work if you have the DDFs (FILE.DDF, FIELD.DDF, and INDEX.DDF at a minimum) and have a PSQL database setup pointing to the DDFs.

Loading data of one table into another residing on different databases - Netezza

I have a big file which I have loaded in a table in a netezza database using an ETL tool, lets call this database Staging_DB. Now, post some verifications, the content of this table needs to be inserted into similar structured table residing in another netezza DB, lets call this one PROD_DB. What is the fastest way to transfer data from staging_DB to PROD_DB?
Should I be using the ETL tool to load the data into PROD_DB? Or,
Should the transfer be done using external tables concept?
If there is no transformation need to be done, then better way to transfer is cross database data transfer. As described in Netezza documentation that Netezza support cross database support where the user has object level permission on both databases.
You can check permission with following command -
dbname.schemaname(loggenin_username)=> \dpu username
Please find below working example -
INSERT INTO Staging_DB..TBL1 SELECT * FROM PROD_DB..TBL1
If you want to do some transformation and than after you need to insert in another database then you can write UDT procedures (also called as resultset procedures).
Hope this will help.
One way you could move the data is by using Transient External Tables. Start by creating a flat file from your source table/db. Because you are moving from Netezza to Netezza you can save time and space by turning on compression and using internal formatting.
CREATE EXTERNAL TABLE 'C:\FileName.dat'
USING (
delim 167
datestyle 'MDY'
datedelim '/'
maxerrors 2
encoding 'internal'
Compress True
REMOTESOURCE 'ODBC'
logDir 'c:\' ) AS
SELECT * FROM source_table;
Then create the table in your target database using the same DDL in the source and just load it up.
INSERT INTO target SELECT * FROM external 'C:\FileName.dat'
USING (
delim 167
datestyle 'MDY'
datedelim '/'
maxerrors 2
encoding 'internal'
Compress True
REMOTESOURCE 'ODBC'
logDir 'c:\' );
I would write a SP on production db and do a CTAS from stage to production database. The beauty of SP is you can add transformations as well.
One other option is NZ migrate utility provided by Netezza and that is the fastest route I believe.
A simple SQL query like
INSERT INTO Staging_DB..TBL1 SELECT * FROM PROD_DB..TBL1
works great if you just need to do that.
Just be aware that you have to be connected to the destination database when executing the query, otherwise you will get an error code
HY0000: "Cross Database Access not supported for this type of command"
even if you have read/write access to both databases and tables.
In most cases you can simply change the catalog using a "Set Catalog" command
https://www-304.ibm.com/support/knowledgecenter/SSULQD_7.0.3/com.ibm.nz.dbu.doc/r_dbuser_set_catalog.html
set catalog='database_name';
insert into target_db.target_schema.target_table select source_db.source_schema.source_table;

"IDENTITY_INSERT is set to off" sqoop error while exporting table to Sql Server

I am exporting a simple hive table to Sql server. Both tables have the exact schema. There is an identity column in Sql Server and I have done a "set identity_insert table_name on" on it.
But when I export from sqoop to sql server, sqoop gives me an error saying that "IDENTITY_INSERT is set to off".
If I export to a Sql Server table having no identity column then all works fine.
Any idea about this? Anyone faced this issue while exporting from sqoop to sql server?
Thanks
In Short:
Postfix -- --identity-insert to your Sqoop export command
Detailed:
Here is an example for anyone searching (and possibly for my own later reference).
SQLSERVER_JDBC_URI="jdbc:sqlserver://<address>:<port>;username=<username>;password=<password>"
HIVE_PATH="/user/hive/warehouse/"
$TABLENAME=<tablename>
sqoop-export \
-D mapreduce.job.queuename=<queuename> \
--connect $SQLSERVER_JDBC_URI \
--export-dir "$HIVE_PATH""$TABLENAME" \
--input-fields-terminated-by , \
--table "$TABLENAME" \
-- --schema <schema> \
--identity-insert
Note the particular bits on the last line -- -- --schema <schema> --identity-insert . You can omit the schema part, but leave in the extra --.
That allows you to set the identity insert ability for that table within your sqoop session. (source)
Tell SQL Server to let you insert into the table with the IDENTITY column. That's an autoincrement column that you normally can't write to. But you can change that. See here or here. It'll still fail if one of your values conflicts with one that already exists in that column.
The SET IDENTITY_INSERT statement is session-specific. So if you set it by opening a query window, executing the statement, and then ran the export anywhere else, IDENTITY_INSERT was only set in that session, not in the export session. You need to modify the export itself if possible. If not, a direct export from sqoop to MSSQL will not be possible; instead you will need to dump the data from sqoop to a file that MSSQL can read (such as tab delimited) and then write a statement that first does SET IDENTITY_INSERT ON, then BULK INSERTs the file, then does SET IDENTITY_INSERT OFF.

mysqldump partial database

I recently decided to switch the company through which i get my hosting, so to move my old db into my new db, i have been trying to run this:
mysqldump --host=ipaddress --user=username --password=password db_name table_name | mysql -u username -ppassword -h new_url new_db_name
and this seemed to be working fine.. but because my database is so freaking massive, i would get time out errors in the middle of my tables. So i was wondering if there was any easy way to do a mysqldump on just part of my table.
I would assume the work flow will look something like this:
create temp_table
move rows from old_table where id>2,500,000 into temp_table
some how dump the temp table into the new db's table (which has the same name as old_table)
but i'm not exactly sure how to do those steps.
Add this --where="id>2500000" at the end of mysqldump command. MySQL 5.1 Reference Manual
In your case the mysqldump command would look like
mysqldump --host=ipaddress \
--user=username \
--password=password \
db_name table_name \
--where="id>2500000
If you dump twice. The second dump will contain table creation info. But next time you want to add the new rows only. So for second dump add --no-create-info option in mysqldump command line.
I've developed a tool for this job. It's called mysqlsuperdump and can be found here:
https://github.com/hgfischer/mysqlsuperdump
With it you can speciffy the full "WHERE" clause for each table, so it's possible to specify different rules for each table.
You can also replace the values of each column by each table in the dump. This is useful, for example, when you want to export a database dump to use in development environment.

Resources