INSERT data from spark dataframe to a table in SQL server

INSERT data from spark dataframe to a table in SQL server - sql-server

I am using Scala Notebook on Databricks. I need to perform an INSERT of data from a dataframe to a table in SQL server. If the data already exist, no need modify or insert - only insert data that do not exist.
I tried the methods specified here https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#write-data-to-jdbc, however, they don't address my usecase. The SaveMode.Append creates duplicate entries of the data, SaveMode.Overwrite replaces the existing data (table), SaveMode.Ignore does not add any new data if the table already exists.
df.write.mode(SaveMode.Overwrite).jdbc(url=dbUrl, table=table_name, dbConnectionProperties)
How can I do an INSERT of new data only to the database?
Thanks a lot in advance for your help!

Assume your current dataframe is df1.
You should read the existing data in the SQL table into another dataframe (df2).
Then use subtract (or subtractByKey): http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract
val dfFinal = df1.subtract(df2)
dfFinal will contain the remaining records to insert.

NOTE: - It's a workaround. Not a full proof solution.
There can be a workaround for this issue.
You need to maintain an auto-increment key/primary key in the SQL server table. And the source data too should have this key in the data before an insert.
The following conditions can arise:
New Primary key == Old Primary key -> job will fail with constraints exception.
New Primary key != Old Primary key -> insert successfully.
The insert into table failure can be handled at the program level.

To avoid bringing in the entire set to do a comparison you could put a Unique Index on the SQL Table and use the option to Ignore Duplicates.
MS Document: Create Unique Indexes
CREATE [ UNIQUE ] [ CLUSTERED | NONCLUSTERED ] INDEX index_name
ON <object> ( column [ ASC | DESC ] [ ,...n ] )
[ INCLUDE ( column_name [ ,...n ] ) ]
[ WHERE <filter_predicate> ]
[ WITH ( <relational_index_option> [ ,...n ] ) ]
[ ON { partition_scheme_name ( column_name )
| filegroup_name
| default
}
]
[ FILESTREAM_ON { filestream_filegroup_name | partition_scheme_name | "NULL" } ]
[ ; ]
<object> ::=
{ database_name.schema_name.table_or_view_name | schema_name.table_or_view_name | table_or_view_name }
<relational_index_option> ::=
{
| IGNORE_DUP_KEY = { ON | OFF }
}

Related

PostgreSQL Logical Replication Custom Column Expression

I am experimenting with PostgreSQL15 logical replication.
I have a table called "test" in database "test1" with columns "id" int (primary) and "name" varchar
id int (primary) | name varchar
I also have a table called "test" in database "test0" with columns "tenant" int (primary), "id" int (primary) and "name" varchar
tenant int (primary/default=1) | id int (primary) | name varchar
I have the following publisher on database "test1"
CREATE PUBLICATION pb_test FOR TABLE test ("id", "name")
SELECT pg_create_logical_replication_slot('test_slot_v1', 'pgoutput');
I also have the following subscriber database "test0"
CREATE SUBSCRIPTION sb_test CONNECTION 'dbname=test1 host=localhost port=5433 user=postgres password=*********' PUBLICATION pb_test WITH (slot_name = test_slot_v1, create_slot = false);
The result is that every time a new record is added on the database "test1", then the same record is inserted on database "test0" with tenant=1 as is the default value.
The question, is there any way to have custom expression for this additional column "tenant" while being replicated? For example records coming from database "test1" should have tenant=1 but records coming from database "test2" will have tenant=2.

Seems that currently, PostgreSQL14 Logical Replication does not support this requirement to have additional columns with fixed values.
UPDATE
PostgreSQL15 allows to publish and subscribe on subset of columns of a table but still is not supporting custom expression as key or value column

Specify Xml index storage location in Sql Server

Is there a way how to specify storage location in MS Sql Server?
I have a relatively small tabel with xmls (up to 2 GBs of data) and a primary and secondary XML indexes on this table. Those indexes takes few time more space than the underlying table.
Is there any way how to physically separate the table and indexes into two separate partitions?
I do not see any sucha n option in MSDN (https://learn.microsoft.com/en-us/sql/t-sql/statements/create-xml-index-transact-sql):
CREATE [ PRIMARY ] XML INDEX index_name
ON <object> ( xml_column_name )
[ USING XML INDEX xml_index_name
[ FOR { VALUE | PATH | PROPERTY } ] ]
[ WITH ( <xml_index_option> [ ,...n ] ) ]
[ ; ]
But I'm hoping there is some trick.

From the Creating XML Indexes:
The filegroup or partitioning information of the user table is applied
to the XML index. Users cannot specify these separately on an XML
index.
However, if I understand it correctly, you should be able to move the XML column and all indices based on it into a separate filegroup by using the TEXTIMAGE_ON clause of the create table statement:
create table dbo.MyTable (
Id int not null,
XMLData xml not null
)
on [PRIMARY]
textimage_on [LARGEDAT];
The only drawback I see here is that all LOB-typed columns from the table will be placed on that filegroup, not just XML one(s).
EDIT:
You can use the large value types out of row option of the sp_tableoption to force LOB data types out of row even if their values are small enough to fit the table page.

Best way to delete records in two tables having foreign key?

I created two tables like marks and users. I maintained foreign key relation between two tables, When I delete a row in marks table, I need to delete that particular user in user table based on uid that exists in both tables commonly.can anyone suggest me?

Use the ON DELETE CASCADE option if you want rows deleted in the child table when corresponding rows are deleted in the parent table.
But your case is reverse from it.There is no way to do it reverse
automatically.
You need to use delete trigger explicitly whenever record are delete
from child table.
BTW its not safe to do reverse as there might be many marks record for single user and if you delete any one of them then user is removed from user table.
I suggest to do it logically in sproc.
you can check in sproc that all record for user is deleted in mark table than remove user from user table.

Well for your case, I will recommend using on delete cascade
More about it :
A foreign key with cascade delete means that if a record in the parent table is deleted, then the corresponding records in the child table will automatically be deleted. This is called a cascade delete in SQL Server.
The syntax for creating a foreign key with cascade delete using a CREATE TABLE statement in SQL Server (Transact-SQL) is:
CREATE TABLE child_table
(
column1 datatype [ NULL | NOT NULL ],
column2 datatype [ NULL | NOT NULL ],
...
CONSTRAINT fk_name
FOREIGN KEY (child_col1, child_col2, ... child_col_n)
REFERENCES parent_table (parent_col1, parent_col2, ... parent_col_n)
ON DELETE CASCADE
[ ON UPDATE { NO ACTION | CASCADE | SET NULL | SET DEFAULT } ]
);
For more read this

In design just use on delete cascade
CREATE TABLE child_table
(
column1 datatype [ NULL | NOT NULL ],
column2 datatype [ NULL | NOT NULL ],
...
CONSTRAINT fk_name
FOREIGN KEY (child_col1, child_col2, ... child_col_n)
REFERENCES parent_table (parent_col1, parent_col2, ... parent_col_n)
ON DELETE CASCADE
[ ON UPDATE { NO ACTION | CASCADE | SET NULL | SET DEFAULT } ]
);
Now when you delete parent . child will automatically deleted... you don't need to do any thing
check Link for detail
On delete cascade

As I don't like to DELETE any row from related tables, I suggest you this solution:
Add a status field with default value of 1 to your table(s).
Create a VIEW that shows only rows with status <> 0 and use this VIEW to show your valid data.
For parent-child or related tables just show rows with status <> 0 for both of parent and child table like parent.status * child.status <> 0.
[Optional & additional]* Create a log table or a journal for your database or your tables or just your important tables and store some actions like Create, Edit\Modify, Delete, Undelete and so on.
With this solution you can:
Support Undo and Redo.
Support Undelete action!
Be not worry about a child that has no parent.
*Found old data, changes of data and many other information.
And many other benefits and you just store more data that it is not concern with a good RDBMS.
I use DELETE just for a table that is at the end child point and its data is not so important.

vsdbcmd.exe generate drop constraint instructions without constraint name

When generating a diff script between two dbschema with vsdbcmd.exe, I sometime obtain an unexpected output, containing some drop constraint without the name of the constraint :
GO
PRINT N'Dropping On column: ColumnName ...';
GO
ALTER TABLE TableName DROP CONSTRAINT ;
In our schema, this column has a default value constraint, with an auto generated name. I expected vsdbcmd.exe to generate a valid ALTER TABLE sql statement, as specified in the msdn library :
ALTER TABLE [ database_name . [ schema_name ] . | schema_name . ] table_name DROP { [ CONSTRAINT ] constraint_name | COLUMN column_name }
Do you have any idea of what could prevent vsdbcmd.exe to generate a valid sql statement ?

This issue only occurs when the constraint has a generated name. Explicitly named constraint are not impacted.
Therefore, this solution is to name every constraint.

can i have unique constrain on a NULLable fields?

I have a table with following scehma
CREATE TABLE MyTable
(
ID INTEGER DEFAULT(1,1),
FirstIdentifier INTEGER NULL,
SecondIdentifier INTEGER NULL,
--.... some other fields .....
)
Now each of FirstIdentifier & SecondIdentifier isunique but NULLable. I want to put a unique constraint on each of this column but cannot do it because its NULLable and can have two rows with NULL values that will fail that unique constraints. Any ideas of how can I address it on schema level?

You can use a filtered index as a unique constraint.
create unique index ix_FirstIdentifier on MyTable(FirstIdentifier)
where FirstIdentifier is not null

As several have suggested, using a filtered indexe is probably the way to get what you want.
But the book answer to your direct question is that a column can be nullable if it has a unique index, but it will only be able to have one row with a null value in that field. Any more than one null would violate the index.

Your question is a little bit confusing. First, in your schema definition, you say that your columns are not allowed to hold null values, but in your description, you say that they can be null.
Anyway, assuming you've got the schema wrong, and you actually wanted the columns to allow null values, SQL Server allows you to do this by adding WHERE IS NOT NULL to the constraint.
Something along the lines of:
CREATE UNIQUE NONCLUSTERED INDEX IDX_my_index
ON MyTable (firstIdentifier)
WHERE firstIdentifier IS NOT NULL

Do a filtered unique index on the fields:
CREATE UNIQUE INDEX ix_IndexName ON MyTable (FirstIdentifier, SecondIdentifier)
WHERE FirstIdentifier IS NOT NULL
AND SecondIdentifier IS NOT NULL
It will allow NULL but still enforce uniqueness.

You can use a Filter predicate on the CREATE INDEX
From CREATE INDEX
CREATE [ UNIQUE ] [ CLUSTERED | NONCLUSTERED ] INDEX index_name
ON ( column [ ASC | DESC ] [ ,...n ] )
[ INCLUDE ( column_name [ ,...n ] ) ]
[ WHERE <filter_predicate> ]
[ WITH ( [ ,...n ] ) ]
[ ON { partition_scheme_name ( column_name )
| filegroup_name
| default
}
]
[ FILESTREAM_ON { filestream_filegroup_name |
partition_scheme_name | "NULL" } ] [ ; ]
WHERE <filter_predicate> Creates a filtered index by specifying which
rows to include in the index. The filtered index must be a
nonclustered index on a table. Creates filtered statistics for the
data rows in the filtered index.
The filter predicate uses simple comparison logic and cannot reference
a computed column, a UDT column, a spatial data type column, or a
hierarchyID data type column. Comparisons using NULL literals are not
allowed with the comparison operators. Use the IS NULL and IS NOT NULL
operators instead.
Here are some examples of filter predicates for the
Production.BillOfMaterials table:
WHERE StartDate > '20040101' AND EndDate <= '20040630'
WHERE ComponentID IN (533, 324, 753)
WHERE StartDate IN ('20040404', '20040905') AND EndDate IS NOT NULL
Filtered indexes do not apply to XML indexes and full-text indexes.
For UNIQUE indexes, only the selected rows must have unique index
values. Filtered indexes do not allow the IGNORE_DUP_KEY option.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

INSERT data from spark dataframe to a table in SQL server - sql-server

Related

PostgreSQL Logical Replication Custom Column Expression

Specify Xml index storage location in Sql Server

Best way to delete records in two tables having foreign key?

vsdbcmd.exe generate drop constraint instructions without constraint name

can i have unique constrain on a NULLable fields?

Categories

Resources