I'm trying to append two columns from a dataframe to an existing SQL server table. The code runs but when I query the SQL table, the additional rows are not present. What am I missing?
import sqlalchemy
engine = sqlalchemy.create_engine("mssql+pyodbc://user:pw#host:port/dbname?driver=ODBC+Driver+13+for+SQL+Server")
df.to_sql(name='database.tablename', con=engine, if_exists='append', index=False)
You cannot use dot notation in the name= parameter. Just use name=tablename. The other parts are fine.
If you need to assign a non-default (dbo) schema, there is a schema= parameter for df.to_sql(). The prefix database. is redundant because you have already assigned dbname in the engine.
Tested with SQL Server 2017 (latest docker image on debian 10) and anaconda python 3.7.
Test code
SQL Server part (create an empty table)
use testdb;
go
if OBJECT_ID('testdb..test') is not null
drop table test;
create table test (
[Brand] varchar(max),
[Price] money
);
Python part
from pandas import DataFrame
import sqlalchemy
# check your driver string
# import pyodbc
# pyodbc.drivers() # ['ODBC Driver 17 for SQL Server']
# connect
eng = sqlalchemy.create_engine("mssql+pyodbc://myid:mypw#localhost:1433/testdb?driver=ODBC+Driver+17+for+SQL+Server")
df = DataFrame(
data={'Brand': ['A','B','C'],
'Price': [10.00, 20.00, 30.00]},
columns=['Brand', 'Price']
)
df.to_sql(name="test", schema="dbo", con=eng, if_exists="append", index=False)
Result
select * from [test]
| Brand | Price |
|-------|---------|
| A | 10.0000 |
| B | 20.0000 |
| C | 30.0000 |
Related
I'm facing a very specific problem. I'm working on a pyspark notebook on Databricks. I run the following command:
my_df.select("insert_date").distinct().show()
and get:
+--------------------+
| insert_date|
+--------------------+
|2021-12-22 09:52:...|
|2021-12-20 16:36:...|
+--------------------+
then, I run:
my_df.persist()
my_df.write.option("truncate", "true").mode("overwrite").jdbc(sqlDbUrl, "[dbo].[my_table]",properties = connectionProperties)
my_df.unpersist()
my_df.select("insert_date").distinct().show()
and get:
+--------------------+
| insert_date|
+--------------------+
|2021-12-22 09:52:...|
+--------------------+
if I interrogate the SQL database, the result is consistent with the second show:
select distinct insert_date from [dbo].[my_table]
**result**
insert_date
2021-12-22 09:52:31.000
and this is undesired, I would like to preserve the two distinct values for column insert_date within the SQL table, and I can't get why this is not happening.
Further information:
Databricks column type: string
python version: 3
SQL column type: NVARCHAR(MAX)
There are no costraints such as default value for the SQL column
For any further information, please ask in the comments.
Thanks in advance!
I have a question about SQL server's transparent encryption (TDE). I need to dump a database instance, which will be restored by another DBA remotely by dumped data files. I was asked to make sure the dumped data files has no TDE so DBA can restore it. I checked online, and I found a query to list the encryption status as follows:
SELECT db_name(database_id), encryption_state
FROM sys.dm_database_encryption_keys;
my database instance is not in the result at all. I run another query as follows:
SELECT
db.name,
db.is_encrypted,
dm.encryption_state,
dm.percent_complete,
dm.key_algorithm,
dm.key_length
FROM
sys.databases db
LEFT OUTER JOIN sys.dm_database_encryption_keys dm
ON db.database_id = dm.database_id;
GO
My database instance has value 0 for is_encrypted, and all other values null.
Does it mean my database instance is not encrypted at all?
If your output looks like this...
name | is_encrypted | encryption_state | percent_complete | key_algorithm | ley_length
--------------------------------------------------------------------------------------------
MyDatabase | 0 | NULL | NULL | NULL | NULL
... your database, [MyDatabase], is NOT encrypted. Nor does it have a database encryption key configured.
If, however, any databases have non-NULLs in columns other than [is_encrypted] (e.g. [encryption_state] = 1), those databases are either encrypted, partially encrypted/decrypted or prepped for encryption.
Read up here for detail on encrpytion states:
https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-database-encryption-keys-transact-sql?view=sql-server-ver15
I'm trying to save a dataframe to MS SQL that uses Windows authentication. I've tried using engine, engine.connect(), engine.raw_connection() and they all throw up errors:
'Engine' object has no attribute 'cursor', 'Connection' object has no attribute 'cursor', and Execution failed on sql 'SELECT name FROM sqlite_master WHERE type='table' AND name=?;': ... respectively.
params = urllib.parse.quote('DRIVER={ODBC Driver 13 for SQL Server};'
'SERVER=server;'
'DATABASE=db;'
'TRUSTED_CONNECTION=Yes;')
engine = create_engine('mssql+pyodbc:///?odbc_connect=%s' % params)
df.to_sql(table_name,engine, index=False)
This will do exactly what you want.
# Insert from dataframe to table in SQL Server
import time
import pandas as pd
import pyodbc
# create timer
start_time = time.time()
from sqlalchemy import create_engine
df = pd.read_csv("C:\\your_path\\CSV1.csv")
conn_str = (
r'DRIVER={SQL Server Native Client 11.0};'
r'SERVER=name_of_your_server;'
r'DATABASE=name_of_your_database;'
r'Trusted_Connection=yes;'
)
cnxn = pyodbc.connect(conn_str)
cursor = cnxn.cursor()
for index,row in df.iterrows():
cursor.execute('INSERT INTO dbo.Table_1([Name],[Address],[Age],[Work]) values (?,?,?,?)',
row['Name'],
row['Address'],
row['Age'],
row['Work'])
cnxn.commit()
cursor.close()
cnxn.close()
# see total time to do insert
print("%s seconds ---" % (time.time() - start_time))
Here is an update to my original answer. Basically, this is the old-school way of doing things (INSERT INTO). I recently stumbled upon a super-easy, scalable, and controllable, way of pushing data from Python to SQL Server. Try the sample code and post back if you have additional questions.
import pyodbc
import pandas as pd
engine = "mssql+pyodbc://your_server_name/your_database_name?driver=SQL Server Native Client 11.0?trusted_connection=yes"
... dataframe here...
dataframe.to_sql(x, engine, if_exists='append', index=True)
dataframe is pretty self explanatory.
x = the name yo uwant your table to be in SQL Server.
Background:
I have a table with the following schema on a SQL server. Updates to existing rows is possible and new rows are also added to this table.
unique_id | user_id | last_login_date | count
123-111 | 111 | 2016-06-18 19:07:00.0 | 180
124-100 | 100 | 2016-06-02 10:27:00.0 | 50
I am using Sqoop to add incremental updates in lastmodified mode. My --check-column parameter is the last_login_date column. In my first run, I got the above two records into Hadoop - let's call this current data. I noted that the last value (the max value of the the check column from this first import) is 2016-06-18 19:07:00.0.
Assuming there is a change on the SQL server side, I now have the following changes on the SQL server side:
unique_id | user_id | last_login_date | count
123-111 | 111 | 2016-06-25 20:10:00.0 | 200
124-100 | 100 | 2016-06-02 10:27:00.0 | 50
125-500 | 500 | 2016-06-28 19:54:00.0 | 1
I have the row 123-111 updated with a more recent last_login_date value and the count column has also been updated. I also have a new row 125-500 added.
On my second run, sqoop looks at all columns with a last_login_date column greater than my known last value from the previous import - 2016-06-18 19:07:00.0
This gives me only the changed data, i.e. 123-111 and 125-500 records. Let's call this - new data.
Question
How do I do a merge join in Hadoop/Hive using the current data and the new data so that I end up with the updated version of 123-111, 124-100, and the newly added 125-500?
Changed data load using scoop is a two phase process.
1st phase - load changed data into some temp (stage) table using
sqoop import utility.
2nd phase - Merge changed data with old data using sqoop-merge
utility.
If the table is small(say few M records) then use full load using sqoop import.
Sometimes it's possible to load only latest partition - in such case use sqoop import utility to load partition using custom query, then instead of merge simply insert overwrite loaded partition into target table, or copy files - this will work faster than sqoop merge.
You can change the existing Sqoop query (by specifying a new custom query) to get ALL the data from the source table instead of getting only the changed data. Refer using_sqoop_to_move_data_into_hive. This would be the simplest way to accomplish this - i.e doing a full data refresh instead of applying deltas.
I am currently working with a nightly import that I need to create, but am not sure what the best route would be to update/insert into the current table. This is all done in MS SQL Server 2012 and pulling the Excel file from another server. I am trying to figure out how I can loop through the columns and pull out the data I need. If I could rearrange the data, I would but am currently stuck with what I have.
In my current table tblHW I have Columns such as PmpCount, , NumberStages, Pmpmodel_pmp1, serialnum_Pmp1, pmpModel_pmp2, Pmpmodel_pmp2, serialnum_pmp2, partnum_motor1, serialnumberMotor1, etc…. I apologize in advance for not being able to post a real table or a picture.
Example:
|Name | PmpCount| numstages| pmpmodel_pmp1| stages_pmp1| Sn_pmp1|
|AN 91-23G | 4| 500| FX2347| 250| 354197|
|BR DN 895R| 5| 521| D2442| 45| 875164|
|ALN 1-60J | 5| 521| H21342| 95| 594126|
|pmpmodel_pmp2| stages_pmp2| sn_pmp2| Partnum_mtr1| sn_mtr1|
|FX2347 | 250| 354198| NULL| NULL|
|FX17500 | 143| 102547| M7544| 4512241|
|FX17500 | 143| 458790| M7544| 4512364|
The information I want to move into tblHW comes from the tbl Pull_Down. Here is the setup:
|Name | Run_ID | Part1| SN1 | Attribute1_7|
|AN 21-919G| Oct 08, 2013 / 100845| BOD| NA| 3RD U|
|FR 55-013A| Oct 17, 2013 / 100853| Pmp| 2EA3A022| 78|
|FR 55-013A| Oct 01, 2014 / 101383| Cbl| N/A| REDALEAD|
|FR 43-223J| Apr 03, 2013 / 100594| BOD| NA| 3RD U|
|VH 204 | May 17, 2014 / 101145| BOD| 3RD U|
|Part2| SN2 | Attribute2_7| Part3 | SN3 | Attribute3_7|
|Pmp | 2EA3F379| 78| Pmp| 2EA3N380| 117|
|Pmp | 2EA3C020| 117| Pmp| 2EA3Y021| 117|
|MLE | J14312161| 120| BOD| N/A| 3RD U|
|Other| NA| Pmp| 2EA2X774| 78|
|BOD | NULL| Pmp| 2EA4F075| 38|
A bit more information. I am receiving this information in the form of five excel spreadhsheets each with over 400 columns. The columns giving me the biggest headache are the 20 part columns that I need to place into the SQL table.
I need to somehow move each row into the tblHW but need to do something like this:
The first row AN 21-919G needs to have SN1 to be inserted into sn_mtr1 since it is a BOD, SN2 into SN_pmp1 since it is a PMP, and SN3 into sn_pmp2 since it is the second PMP here. I also need to get the pmp count, in this case 2 and then add the attribute1_7 and attribute2_7 to put into numstages when the prts are PMP.
Situations like this is the whole purpose for SSIS to exist: Integration Services!
First one would question as to why the data you need is in Excel, and if there is a more direct route, one could exploit, as a linked server (if the source is another RDBMS).
Based in what information you provide, we make the following assumptions:
A) We have no control on the source output and we must import the data from Excel.
B) The files always have consistent columns (probably created by an automated process).
In SSIS you can easily create a source connection for the Excel file. If the Excel file name is dynamic, you can create a script to modify the connection string for that connection before importing data. Then set the destination connection to the SQL Server. The last step is creating a Data Flow Task where you can map the source to the destination columns.
Example: