Spark dataframe changes column values when writing on SQL server - sql-server

I'm facing a very specific problem. I'm working on a pyspark notebook on Databricks. I run the following command:
my_df.select("insert_date").distinct().show()
and get:
+--------------------+
| insert_date|
+--------------------+
|2021-12-22 09:52:...|
|2021-12-20 16:36:...|
+--------------------+
then, I run:
my_df.persist()
my_df.write.option("truncate", "true").mode("overwrite").jdbc(sqlDbUrl, "[dbo].[my_table]",properties = connectionProperties)
my_df.unpersist()
my_df.select("insert_date").distinct().show()
and get:
+--------------------+
| insert_date|
+--------------------+
|2021-12-22 09:52:...|
+--------------------+
if I interrogate the SQL database, the result is consistent with the second show:
select distinct insert_date from [dbo].[my_table]
**result**
insert_date
2021-12-22 09:52:31.000
and this is undesired, I would like to preserve the two distinct values for column insert_date within the SQL table, and I can't get why this is not happening.
Further information:
Databricks column type: string
python version: 3
SQL column type: NVARCHAR(MAX)
There are no costraints such as default value for the SQL column
For any further information, please ask in the comments.
Thanks in advance!

Related

How to convert regexp_substr(Oracle) to SQL Server?

I have a data table which has a column as Acctno what is expected shows in separate column
|Acctno | expected_output|
|ABC:BKS:1023049101 | 1023049101 |
|ABC:UWR:19048234582 | 19048234582 |
|ABC:UEW:1039481843 | 1039481843 |
I know in Oracle SQL which I used the below
select regexp_substr(acctno,'[^:]',1,3) as expected_output
from temp_mytable
but in Microsoft SQL Server I am getting an error that regexp_substr is not a built in function
How can I resolve this issue?
We can use PATINDEX with SUBSTRING here:
SELECT SUBSTRING(acctno, PATINDEX('%:[0-9]%', acctno) + 1, LEN(acctno)) AS expected_output
FROM temp_mytable;
Demo
Note that this answer assumes that the third component would always start with a digit, and that the first two components would not have any digits. If this were not true, then we would have to do more work.
Just another option if the desired value is the last portion of the string and there are not more than 4 segments.
Select *
,NewValue = parsename(replace(Acctno,':','.'),1)
from YourTable

SQLalchemy append dataframe to existing SQL Server table

I'm trying to append two columns from a dataframe to an existing SQL server table. The code runs but when I query the SQL table, the additional rows are not present. What am I missing?
import sqlalchemy
engine = sqlalchemy.create_engine("mssql+pyodbc://user:pw#host:port/dbname?driver=ODBC+Driver+13+for+SQL+Server")
df.to_sql(name='database.tablename', con=engine, if_exists='append', index=False)
You cannot use dot notation in the name= parameter. Just use name=tablename. The other parts are fine.
If you need to assign a non-default (dbo) schema, there is a schema= parameter for df.to_sql(). The prefix database. is redundant because you have already assigned dbname in the engine.
Tested with SQL Server 2017 (latest docker image on debian 10) and anaconda python 3.7.
Test code
SQL Server part (create an empty table)
use testdb;
go
if OBJECT_ID('testdb..test') is not null
drop table test;
create table test (
[Brand] varchar(max),
[Price] money
);
Python part
from pandas import DataFrame
import sqlalchemy
# check your driver string
# import pyodbc
# pyodbc.drivers() # ['ODBC Driver 17 for SQL Server']
# connect
eng = sqlalchemy.create_engine("mssql+pyodbc://myid:mypw#localhost:1433/testdb?driver=ODBC+Driver+17+for+SQL+Server")
df = DataFrame(
data={'Brand': ['A','B','C'],
'Price': [10.00, 20.00, 30.00]},
columns=['Brand', 'Price']
)
df.to_sql(name="test", schema="dbo", con=eng, if_exists="append", index=False)
Result
select * from [test]
| Brand | Price |
|-------|---------|
| A | 10.0000 |
| B | 20.0000 |
| C | 30.0000 |

Sqoop & Hadoop - How to join/merge old data and new data imported by Sqoop in lastmodified mode?

Background:
I have a table with the following schema on a SQL server. Updates to existing rows is possible and new rows are also added to this table.
unique_id | user_id | last_login_date | count
123-111 | 111 | 2016-06-18 19:07:00.0 | 180
124-100 | 100 | 2016-06-02 10:27:00.0 | 50
I am using Sqoop to add incremental updates in lastmodified mode. My --check-column parameter is the last_login_date column. In my first run, I got the above two records into Hadoop - let's call this current data. I noted that the last value (the max value of the the check column from this first import) is 2016-06-18 19:07:00.0.
Assuming there is a change on the SQL server side, I now have the following changes on the SQL server side:
unique_id | user_id | last_login_date | count
123-111 | 111 | 2016-06-25 20:10:00.0 | 200
124-100 | 100 | 2016-06-02 10:27:00.0 | 50
125-500 | 500 | 2016-06-28 19:54:00.0 | 1
I have the row 123-111 updated with a more recent last_login_date value and the count column has also been updated. I also have a new row 125-500 added.
On my second run, sqoop looks at all columns with a last_login_date column greater than my known last value from the previous import - 2016-06-18 19:07:00.0
This gives me only the changed data, i.e. 123-111 and 125-500 records. Let's call this - new data.
Question
How do I do a merge join in Hadoop/Hive using the current data and the new data so that I end up with the updated version of 123-111, 124-100, and the newly added 125-500?
Changed data load using scoop is a two phase process.
1st phase - load changed data into some temp (stage) table using
sqoop import utility.
2nd phase - Merge changed data with old data using sqoop-merge
utility.
If the table is small(say few M records) then use full load using sqoop import.
Sometimes it's possible to load only latest partition - in such case use sqoop import utility to load partition using custom query, then instead of merge simply insert overwrite loaded partition into target table, or copy files - this will work faster than sqoop merge.
You can change the existing Sqoop query (by specifying a new custom query) to get ALL the data from the source table instead of getting only the changed data. Refer using_sqoop_to_move_data_into_hive. This would be the simplest way to accomplish this - i.e doing a full data refresh instead of applying deltas.

execute stored procedure with dbslim with Fitnesse (Selenium,Xebium)

https://github.com/markfink/dbslim
I'd like to execute the stored procedures with DbSlim using Fitnesse (Selenium, Xebium)
now what I tried to do is:
!define dbQuerySelectCustomerbalance (
execute dbo.uspLogError
)
| script | Db Slim Select Query | !-${dbQuerySelectCustomerbalance}-! |
which gives a green indicator,
however Microsoft SQL Server profiler gives no actions/logging...
so what i'd like to know is: is it possible to use dbslim for executing stored procedures,
if yes
what is the correct way to do it?
By the way, the connection to the Database i've on 1 page, and on the query page i included the connection to the database. (is that ok?)
Take out the !- ... -!. It is used to escape wikified words. But in this case you want it to be translated to the actual query.
!define dbQuerySelectCustomerbalance ( execute dbo.uspLogError )
| script | Db Slim Select Query | ${dbQuerySelectCustomerbalance} |
| show | data by column index | 1 | and row index | 1 |
You can add in the last line which outputing the first column of the first row for testing purpose if your SP is returning some result (or you can create one simple SP just to test this out)
Specifying the connection anywhere before this block will be fine, be it on the same page or in an SetUp/SuiteSetUp/normal page included/executed before.

SSIS "Enumerator failed to retrieve element at index" Error

In my SSIS package I am using a data flow task to extract data from SQL Server and put it into a dataset with the following schema:
Column1 Int32
Column2 Object
Column3 Object
Column4 String
Column5 Double
That step seems to work well. In the foreach editor I mapped the columns to variables like this:
VARIABLE | INDEX
User::Column1 | 0
User::Column2 | 1
User::Column3 | 2
User::Column4 | 3
User::Column5 | 4
When I run the package I get the following error on the foreach task:
Error: The enumerator failed to retrieve element at index "4".
Error: ForEach Variable Mapping number 5 to variable "User::Column5" cannot be applied.
There are no null values in Column5 and I can clearly see all 5 columns in the query when I run it against the database. Any assistance is greatly appreciated!
I finally found the problem. The target dataset in the data flow task was dropping the last column for some reason. Once I recreated the dataset destination everything worked.

Resources