pandas.DataFrame.to_sql inserts data, but doesn't commit the transaction - sql-server

I have a pandas dataframe I'm trying to insert into MS SQL EXPRESS as per below:
import pandas as pd
import sqlalchemy
engine = sqlalchemy.create_engine("mssql+pyodbc://user:password#testodbc")
connection = engine.connect()
data = {'Host': ['HOST1','HOST2','HOST3','HOST4'],
'Product': ['Apache HTTP 2.2','RedHat 6.9','OpenShift 2','JRE 1.3'],
'ITBS': ['Infrastructure','Accounting','Operations','Accounting'],
'Remediation': ['Upgrade','No plan','Decommission','Decommission'],
'TargetDate': ['2018-12-31','NULL','2019-03-31','2019-06-30']}
df = pd.DataFrame(data)
When I call:
df.to_sql(name='TLMPlans', con=connection, index=False, if_exists='replace')
and then:
print(engine.execute("SELECT * FROM TLMPLans").fetchall())
I can see the data alright, but it actually doesn't commit any transaction:
D:\APPS\Python\python.exe
C:/APPS/DashProjects/dbConnectors/venv/Scripts/readDataFromExcel.py
[('HOST1', 'Apache HTTP 2.2', 'Infrastructure', 'Upgrade', '2018-12-31'), ('HOST2', 'RedHat 6.9', 'Accounting', 'No plan', 'NULL'), ('HOST3', 'OpenShift 2', 'Operations', 'Decommission', '2019-03-31'), ('HOST4', 'JRE 1.3', 'Accounting', 'Decommission', '2019-06-30')]
Process finished with exit code 0
It says here I don't have to commit as SQLAlchemy does it:
Does the Pandas DataFrame.to_sql() function require a subsequent commit()?
and the below suggestions don't work:
Pandas to_sql doesn't insert any data in my table
I spent good 3 hours looking for clues all over the Internet, but I'm not getting any relevant answers, or I don't know how to ask the question.
Any guidance on what to look for would be highly appreciated.
UPDATE
I'm able to commit changes using pyodbc connection and full insert statement, however pandas.DataFrame.to_sql() with SQLAlchemy engine doesn't work. It send the data to memory instead the actual database, regardless if schema is specified or not.
I would really appreciate help with this on, or possibly it is a panda issue I need to report?

I had the same issue, I realised you need to tell pyodbc which database you want to use. For me the default was master, so my data ended up there.
There are two ways you can do this, either:
connection.execute("USE <dbname>")
Or define the schema in the df.to_sql():
df.to_sql(name=<TABELENAME>, conn=connection, schema='<dbname>.dbo')
In my case the schema was <dbname>.dbo I think the .dbo is default so it could be something else if you define an alternative schema
This was referenced in this answer, it took me a bit longer to realise what the schema name should be.

Related

how to set spark.sql.shuffle.partitions when using the lastest spark version

I want to reset the spark.sql.shuffle.partitions configure in the pyspark code, since I need to join two big tables. But the following code doesn't not work in the latest spark version, the error says that "no method "setConf" in xxx"
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import pyspark
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
spark.sparkContext.setConf("spark.sql.shuffle.partitions", "1000")
spark.sparkContext.setConf("spark.default.parallelism", "1000")
# or using the follow, neither is working
spark.setConf("spark.sql.shuffle.partitions", "1000")
spark.setConf("spark.default.parallelism", "1000")
I would like to know how to reset the "spark.sql.shuffle.partitions" now.
SparkSession provides a RuntimeConfig interface to set and get Spark related parameters. The answer to your question would be:
spark.conf.set("spark.sql.shuffle.partitions", 1000)
Refer: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.RuntimeConfig
I've missed that your question was about pyspark. Pyspark has a similar interface spark.conf.
Refer: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.conf
Please beware that we discovered a defect in the Spark SQL "Group By" / "Distinct" implementation when the shuffle partitions is set to greater than 2000. We tested with a data-set of around 3000 records, with 38 columns which had about 1800 unique records with 38 columns.
When we ran the "Distinct" or "Group By" query with the 38 columns and "spark.sql.shuffle.partitions" set to 2001, the count of distinct records was coming as less than 1800, say 1794. However, when we set it to 2000, the same query gave us record count as 1800.
So basically, Spark is incorrectly dropping a few records when the shuffle partitions is greater than 2000.
We tested with Spark v2.3.1 and will file a Bug Jira soon. I need to prepare a test data to demonstrate, but we have confirmed it with our real-world dataset already.

SQLALchemy - cannot reflect a SQL Server DB running on Amazon RDS

My code is simple:
app = Flask(__name__)
app.config.from_object('config')
db = SQLAlchemy(app)
db.metadata.reflect()
And it throws no errors. However, when I inspect the metadata after this reflection, it returns an empty immutabledict object.
The parameters in my connection string is 100% correct and the code works with non-RDS databases.
It seems to happen to others as well but I can't find a solution.
Also, I have tried to limit the reflection to specific tables using the "only" parameter in the metadata.reflect function, and this is the error I get:
sqlalchemy.exc.InvalidRequestError: Could not reflect: requested table(s) not available in mssql+pyodbc://{connection_string}: (users)
I've fixed it. The reflect() method of the SQLAlchemy class has a parameter named 'schema'. Setting this parameter, to "dbo" in my case, solved it.
I am using Flask-SQLAlchemy, which does not have the said parameter in its reflect() method. You can follow this post to gain access to that parameter and others, such as 'only'.
This error occurs when reflect is called without the schema name provided. For example, this will cause the error to happen:
metadata.reflect(only = [tableName])
It needs to be updated to use the schema of the table you are trying to reflect over like this:
metadata.reflect(schema=schemaName, only = [tableName])
You have to set schema='dbo' in parameter for reflect.
db.Model.metadata.reflect(bind=engine, schema='dbo', only=['User'])
and then create model of your table:
class User(db.Model):
__table__ = Base.metadata.tables['dbo.User']
and to access data from that table:

How to indicate the database in SparkSQL over Hive in Spark 1.3

I have a simple Scala code that retrieves data from the Hive database and creates an RDD out of the result set. It works fine with HiveContext. The code is similar to this:
val hc = new HiveContext(sc)
val mySql = "select PRODUCT_CODE, DATA_UNIT from account"
hc.sql("use myDatabase")
val rdd = hc.sql(mySql).rdd
The version of Spark that I'm using is 1.3. The problem is that the default setting for hive.execution.engine is 'mr' that makes Hive to use MapReduce which is slow. Unfortunately I can't force it to use "spark".
I tried to use SQLContext by replacing hc = new SQLContext(sc) to see if performance will improve. With this change the line
hc.sql("use myDatabase")
is throwing the following exception:
Exception in thread "main" java.lang.RuntimeException: [1.1] failure: ``insert'' expected but identifier use found
use myDatabase
^
The Spark 1.3 documentation says that SparkSQL can work with Hive tables. My question is how to indicate that I want to use a certain database instead of the default one.
use database
is supported in later Spark versions
https://docs.databricks.com/spark/latest/spark-sql/language-manual/use-database.html
You need to put the statement in two separate spark.sql calls like this:
spark.sql("use mydb")
spark.sql("select * from mytab_in_mydb").show
Go back to creating the HiveContext. The hive context gives you the ability to create a dataframe using Hive's metastore. Spark only uses the metastore from hive, and doesn't use hive as a processing engine to retrieve the data. So when you create the df using your sql query, its really just asking hive's metastore "Where is the data, and whats the format of the data"
Spark takes that information, and will run process against the underlying data on the HDFS. So Spark is executing the query, not hive.
When you create the sqlContext, its removing the link between Spark and the Hive metastore, so the error is saying it doesn't understand what you want to do.
I have not been able to implement the use databale command, but here is a workaround to use the desired database:
spark-shell --queue QUEUENAME;
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val res2 = sqlContext.sql("select count(1) from DB_NAME.TABLE_NAME")
res2.collect()

Performance issue with django exclude

I have a Django 1.8 application, and I am using an MsSQL database, with pyodbc as the db backend (using "django-pyodbc-azure" module).
I have the following models:
class Branch(models.Model):
name = models.CharField(max_length=30)
startTime = models.DateTimeField()
class Device(models.Model):
uid = models.CharField(max_length=100, primary_key=True)
type = models.CharField(max_length=20)
firstSeen = models.DateTimeField()
lastSeen = models.DateTimeField()
class Session(models.Model):
device = models.ForeignKey(Device)
branch = models.ForeignKey(Branch)
start = models.DateTimeField()
end = models.DateTimeField(null=True, blank=True)
I need to query the session model, and I want to exclude some records with specific device values. So I issue the following query:
sessionCount = Session.objects.filter(branch=branch)
.exclude(device__in=badDevices)
.filter(end__gte=F('start')+timedelta(minutes=30)).count()
badDevices is a pre-filled list of device ids with around 60 items.
badDevices = ['id-1', 'id-2', ...]
This query takes around 1.5 seconds to complete. If I remove the exclude from the query, it takes around 250 miliseconds.
I printed the generated sql for this queryset, and tried it in my database client. There, both versions executed in around 250 miliseconds.
This is the generated SQL:
SELECT [session].[id], [session].[device_id], [session].[branch_id], [session].[start], [session].[end]
FROM [session]
WHERE ([session].[branch_id] = my-branch-id AND
NOT ([session].[device_id] IN ('id-1', 'id-2', 'id-3',...)) AND
DATEPART(dw, [session].[start]) = 1
AND [session].[end] IS NOT NULL AND
[session].[end] >= ((DATEADD(second, 600, CAST([session].[start] AS datetime)))))
So, using the exclude in database level doesn't seem to be affecting the query performance, but in django, the query runs 6 times slower if I add the exclude part. What could be causing this?
The general issue seems to be that django is doing some extra work to prepare the exclude clause. After that step and by the time the SQL has been generated and sent to the database, there isn't anything interesting happening on the django side that could cause such a significant delay.
In your case, one thing that might be causing this is some kind of pre-processing of badDevices. If, for instance, badDevices is a QuerySet then django might be executing the badDevices query just to prepare the actual query's SQL. Possibly something similar might be happening in the case where device has a non-default primary key.
The other thing might delay the SQL preparation is of course django-pyodbc-azure. Maybe it's doing something strange while compiling the query and it becomes a bottleneck.
This is all wild speculation though, so if you're still having this issue then post the Device and Branch models as well, the exact content of badDevices and the SQL generated from the queries. Then maybe some scenarios can be at least eliminated.
EDIT: I think it must be the Device.uid field. Possibly django or pyodbc is getting confused by the non-default primary key and is fetching all the devices while generating the query. Try two things:
Replace device__in with device_id__in, device__pk__in and device__uid__in and check each one again. Maybe a more explicit query will be easier for django to translate into SQL. You can even try replacing branch with branch_id, just in case.
If the above doesn't work, try replacing the exclude expression with a raw SQL where clause:
# add quotes (because of the hyphens) & join
badDevicesIdString = ", ".join(["'%s'" % id for id in badDevices])
# Replaces .exclude()
... .extra(where=['device_id NOT IN (%s)' % badDevicesIdString])
If neither works, then most likely the problem is with the whole query and not just exclude. There are some more options in that case but try the above first and I will update my answer later if necessary.
Just want to share a similar problem that I had with MySQL and exclude clauses performance and how it was fixed.
When running the exclude clause, the list with the "in" lookup was actually a Queryset that I got using values_list method. Checking the exclude query executed by MySQL, the "in" objects were not values but actually another query. This behavior was impacting performance on specific large queries.
To fix that, instead of passing the queryset, I flat it out in a python list of values. By doing that, each value is passed as an argument inside the in lookup and the performance was really improved.

How to decode OLAP Query?

I am totally fresher to OLAP server. i have a OLAP query that is working fine, i just want to know, which tables are linked to send the result and how(i mean with which joins). Here is query.
WITH MEMBER [Measures].[ThisYearMonthToDate] AS 'Sum({[Time].[All Time].[2013].
[Q1].[January],[Time].[All Time].[2013].[Q1].[February],[Time].[All Time].[2013].
[Q1].[March],[Time].[All Time].[2013].[Q2].[April],[Time].[All Time].[2013].[Q2].[May]},
[Measures].[Main Temp Id])'MEMBER [Measures].[LastYearMonthToDate] AS
'Sum({[Time].[All Time].[2012].[Q1].[January],[Time].[All Time].[2012].[Q1].[February],
[Time].[All Time].[2012].[Q1].[March],[Time].[All Time].[2012].[Q2].[April],
[Time].[All Time].[2012].[Q2].[May]}, [Measures].[Main Temp Id])' SELECT {[Measures].
[LastYearMonthToDate], [Measures].[ThisYearMonthToDate]} ON COLUMNS,
{([PublicRegion].[All Regions].[USA]),([PublicRegion].[All Regions].[USA].[Northeast]),
([PublicRegion].[All Regions].[USA].[Midwest]),([PublicRegion].[All Regions].[USA].
[Southeast]),([PublicRegion].[All Regions].[USA].[Southwest]),([PublicRegion].[All
Regions].[USA].[West Coast]),([PublicRegion].[All Regions].[USA].[Misc]),
([PublicRegion].[All Regions].[Europe]),([PublicRegion].[All Regions].[Europe].[UK]),
([PublicRegion].[All Regions].[Europe].[France]),([PublicRegion].[All Regions].
[Europe].[Italy]),([PublicRegion].[All Regions].[Europe].[Germany]),
([PublicRegion].[All Regions].[Europe].[Spain]),([PublicRegion].[All Regions].
[Canada]),([PublicRegion].[All Regions].[Other])} ON ROWS FROM Public
i am not getting how to decode this query. Please help me..
There are two pretty easy ways to find out:
Log of you OLAP server: I'm almost sure that all leading OLAP tools logs SQL queries sent to database server.
Log of your database server: Set your database to log all queries from all users. By time of execution, and user name you declared in metadata file, you can easily filter queries sent by OLAP tool.
Hope this helps,
Best regards

Resources