how to set spark.sql.shuffle.partitions when using the lastest spark version - shuffle

I want to reset the spark.sql.shuffle.partitions configure in the pyspark code, since I need to join two big tables. But the following code doesn't not work in the latest spark version, the error says that "no method "setConf" in xxx"
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import pyspark
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
spark.sparkContext.setConf("spark.sql.shuffle.partitions", "1000")
spark.sparkContext.setConf("spark.default.parallelism", "1000")
# or using the follow, neither is working
spark.setConf("spark.sql.shuffle.partitions", "1000")
spark.setConf("spark.default.parallelism", "1000")
I would like to know how to reset the "spark.sql.shuffle.partitions" now.

SparkSession provides a RuntimeConfig interface to set and get Spark related parameters. The answer to your question would be:
spark.conf.set("spark.sql.shuffle.partitions", 1000)
Refer: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.RuntimeConfig
I've missed that your question was about pyspark. Pyspark has a similar interface spark.conf.
Refer: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.conf

Please beware that we discovered a defect in the Spark SQL "Group By" / "Distinct" implementation when the shuffle partitions is set to greater than 2000. We tested with a data-set of around 3000 records, with 38 columns which had about 1800 unique records with 38 columns.
When we ran the "Distinct" or "Group By" query with the 38 columns and "spark.sql.shuffle.partitions" set to 2001, the count of distinct records was coming as less than 1800, say 1794. However, when we set it to 2000, the same query gave us record count as 1800.
So basically, Spark is incorrectly dropping a few records when the shuffle partitions is greater than 2000.
We tested with Spark v2.3.1 and will file a Bug Jira soon. I need to prepare a test data to demonstrate, but we have confirmed it with our real-world dataset already.

Related

Quickly update django model objects from pandas dataframe

I have a Django model that records transactions. I need to update only some of the fields (two) of some of the transactions.
In order to update, the user is asked to provide additional data and I use pandas to make calculations using this extra data.
I use the output from the pandas script to update the original model like this:
for i in df.tnsx_uuid:
t = Transactions.objects.get(tnsx_uuid=i)
t.start_bal = df.loc[df.tnsx_uuid==i].start_bal.values[0]
t.end_bal = df.loc[df.tnsx_uuid==i].end_bal.values[0]
t.save()
this is very slow. What is the best way to do this?
UPDATE:
after some more research, I found bulk_update and changed the code to:
transactions = Transactions.objects.select_for_update()\
.filter(tnsx_uuid__in=list(df.tnsx_uuid)).only('start_bal', 'end_bal')
for t in transactions:
i = t.tnsx_uuid
t.start_bal = df.loc[df.tnsx_uuid==i].start_bal.values[0]
t.end_bal = df.loc[df.tnsx_uuid==i].end_bal.values[0]
Transactions.objects.bulk_update(transactions, ['start_bal', 'end_bal'])
this has approximately halved the time required.
How can I improve performance further?
I have been looking for the answer to this question and haven't found any authoritative, idiomatic solutions. So, here's what I've settled on for my own use:
transaction = Transactions.objects.filter(tnsx_uuid__in=list(df.tnsx_uuid))
# Build a DataFrame of Django model instances
trans_df = pd.DataFrame([{'tnsx_uuid': t.tnsx_uuid, 'object': t} for t in transactions])
# Join the Django instances to the main DataFrame on the index
df = df.join(trans_df.set_index('tnsx_uuid'))
for obj, start_bal, end_bal in zip(df['object'], df['start_bal'], df['end_bal']):
obj.start_bal = start_bal
obj.end_bal = send_bal
Transactions.objects.bulk_update(df['object'], ['start_bal', 'end_bal'])
I don't know how DataFrame.loc[] is implemented but it could be slow if it needs to search the whole DataFrame for each use rather than just do a hash lookup. For that reason and to just simply things by doing a single iteration loop, I pulled all of the model instances into df and then used the recommendation from a Stackoverflow answer on iterating over a DataFrames to loop over the zipped columns of interest.
I looked at the documentation for select_for_update in Django and it isn't apparent to me that it offers a performance improvement, but you may be using it to lock the transaction and make all of the changes atomically. Per the documentation, bulk_update should be faster than saving each object individually.
In my case, I'm only updating 3500 items. I did some timing of the various steps and came up with the following:
3.05 s to query and build the DataFrame
2.79 ms to join the instances to df
5.79 ms to run the for loop and update the instances
1.21 s to bulk_update the changes
So, I think you would need to profile your code to see what is actually taking time, but it is likely a Django issue rather than a Pandas issue.
I kind of face the same issue (almost same quantity of records 3500~), and I will like to add:
bulk_update seems to be a lot worse in performance than a
bulk_create, in my case deleting objects was allowed, so
instead of bulk_updating, I delete all objects, and then recreate them.
I used the same approach as you (thanks for the idea), but with some modifications:
a) I create the dataframe from the query itself:
all_objects_values = all_objects.values('id', 'date', 'amount')
self.df_values = pd.DataFrame.from_records(all_objects_values )
b) Then I create the column of objects without iterating (I make sure these are ordered):
self.df_values['object'] = list(all_objects)
c) For updating object values (after operations made in my dataframe), I iterate rows(not sure about performance difference):
for index, row in self.df_values.iterrows():
row['object'].amount= row['amount']
d) At the end, I re-create all objects:
MyModel.objects.bulk_create(self.df_values['object'].tolist())
Conclusion:
In my case, the most time consuming was the bulk update, so re-creating objects solved it for me (from 19 seconds with bulk_update to 10 seconds with delete + bulk_create)
In your case, using my approach may improve the time for all other operations.

pandas.DataFrame.to_sql inserts data, but doesn't commit the transaction

I have a pandas dataframe I'm trying to insert into MS SQL EXPRESS as per below:
import pandas as pd
import sqlalchemy
engine = sqlalchemy.create_engine("mssql+pyodbc://user:password#testodbc")
connection = engine.connect()
data = {'Host': ['HOST1','HOST2','HOST3','HOST4'],
'Product': ['Apache HTTP 2.2','RedHat 6.9','OpenShift 2','JRE 1.3'],
'ITBS': ['Infrastructure','Accounting','Operations','Accounting'],
'Remediation': ['Upgrade','No plan','Decommission','Decommission'],
'TargetDate': ['2018-12-31','NULL','2019-03-31','2019-06-30']}
df = pd.DataFrame(data)
When I call:
df.to_sql(name='TLMPlans', con=connection, index=False, if_exists='replace')
and then:
print(engine.execute("SELECT * FROM TLMPLans").fetchall())
I can see the data alright, but it actually doesn't commit any transaction:
D:\APPS\Python\python.exe
C:/APPS/DashProjects/dbConnectors/venv/Scripts/readDataFromExcel.py
[('HOST1', 'Apache HTTP 2.2', 'Infrastructure', 'Upgrade', '2018-12-31'), ('HOST2', 'RedHat 6.9', 'Accounting', 'No plan', 'NULL'), ('HOST3', 'OpenShift 2', 'Operations', 'Decommission', '2019-03-31'), ('HOST4', 'JRE 1.3', 'Accounting', 'Decommission', '2019-06-30')]
Process finished with exit code 0
It says here I don't have to commit as SQLAlchemy does it:
Does the Pandas DataFrame.to_sql() function require a subsequent commit()?
and the below suggestions don't work:
Pandas to_sql doesn't insert any data in my table
I spent good 3 hours looking for clues all over the Internet, but I'm not getting any relevant answers, or I don't know how to ask the question.
Any guidance on what to look for would be highly appreciated.
UPDATE
I'm able to commit changes using pyodbc connection and full insert statement, however pandas.DataFrame.to_sql() with SQLAlchemy engine doesn't work. It send the data to memory instead the actual database, regardless if schema is specified or not.
I would really appreciate help with this on, or possibly it is a panda issue I need to report?
I had the same issue, I realised you need to tell pyodbc which database you want to use. For me the default was master, so my data ended up there.
There are two ways you can do this, either:
connection.execute("USE <dbname>")
Or define the schema in the df.to_sql():
df.to_sql(name=<TABELENAME>, conn=connection, schema='<dbname>.dbo')
In my case the schema was <dbname>.dbo I think the .dbo is default so it could be something else if you define an alternative schema
This was referenced in this answer, it took me a bit longer to realise what the schema name should be.

How to indicate the database in SparkSQL over Hive in Spark 1.3

I have a simple Scala code that retrieves data from the Hive database and creates an RDD out of the result set. It works fine with HiveContext. The code is similar to this:
val hc = new HiveContext(sc)
val mySql = "select PRODUCT_CODE, DATA_UNIT from account"
hc.sql("use myDatabase")
val rdd = hc.sql(mySql).rdd
The version of Spark that I'm using is 1.3. The problem is that the default setting for hive.execution.engine is 'mr' that makes Hive to use MapReduce which is slow. Unfortunately I can't force it to use "spark".
I tried to use SQLContext by replacing hc = new SQLContext(sc) to see if performance will improve. With this change the line
hc.sql("use myDatabase")
is throwing the following exception:
Exception in thread "main" java.lang.RuntimeException: [1.1] failure: ``insert'' expected but identifier use found
use myDatabase
^
The Spark 1.3 documentation says that SparkSQL can work with Hive tables. My question is how to indicate that I want to use a certain database instead of the default one.
use database
is supported in later Spark versions
https://docs.databricks.com/spark/latest/spark-sql/language-manual/use-database.html
You need to put the statement in two separate spark.sql calls like this:
spark.sql("use mydb")
spark.sql("select * from mytab_in_mydb").show
Go back to creating the HiveContext. The hive context gives you the ability to create a dataframe using Hive's metastore. Spark only uses the metastore from hive, and doesn't use hive as a processing engine to retrieve the data. So when you create the df using your sql query, its really just asking hive's metastore "Where is the data, and whats the format of the data"
Spark takes that information, and will run process against the underlying data on the HDFS. So Spark is executing the query, not hive.
When you create the sqlContext, its removing the link between Spark and the Hive metastore, so the error is saying it doesn't understand what you want to do.
I have not been able to implement the use databale command, but here is a workaround to use the desired database:
spark-shell --queue QUEUENAME;
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val res2 = sqlContext.sql("select count(1) from DB_NAME.TABLE_NAME")
res2.collect()

Performance issue with django exclude

I have a Django 1.8 application, and I am using an MsSQL database, with pyodbc as the db backend (using "django-pyodbc-azure" module).
I have the following models:
class Branch(models.Model):
name = models.CharField(max_length=30)
startTime = models.DateTimeField()
class Device(models.Model):
uid = models.CharField(max_length=100, primary_key=True)
type = models.CharField(max_length=20)
firstSeen = models.DateTimeField()
lastSeen = models.DateTimeField()
class Session(models.Model):
device = models.ForeignKey(Device)
branch = models.ForeignKey(Branch)
start = models.DateTimeField()
end = models.DateTimeField(null=True, blank=True)
I need to query the session model, and I want to exclude some records with specific device values. So I issue the following query:
sessionCount = Session.objects.filter(branch=branch)
.exclude(device__in=badDevices)
.filter(end__gte=F('start')+timedelta(minutes=30)).count()
badDevices is a pre-filled list of device ids with around 60 items.
badDevices = ['id-1', 'id-2', ...]
This query takes around 1.5 seconds to complete. If I remove the exclude from the query, it takes around 250 miliseconds.
I printed the generated sql for this queryset, and tried it in my database client. There, both versions executed in around 250 miliseconds.
This is the generated SQL:
SELECT [session].[id], [session].[device_id], [session].[branch_id], [session].[start], [session].[end]
FROM [session]
WHERE ([session].[branch_id] = my-branch-id AND
NOT ([session].[device_id] IN ('id-1', 'id-2', 'id-3',...)) AND
DATEPART(dw, [session].[start]) = 1
AND [session].[end] IS NOT NULL AND
[session].[end] >= ((DATEADD(second, 600, CAST([session].[start] AS datetime)))))
So, using the exclude in database level doesn't seem to be affecting the query performance, but in django, the query runs 6 times slower if I add the exclude part. What could be causing this?
The general issue seems to be that django is doing some extra work to prepare the exclude clause. After that step and by the time the SQL has been generated and sent to the database, there isn't anything interesting happening on the django side that could cause such a significant delay.
In your case, one thing that might be causing this is some kind of pre-processing of badDevices. If, for instance, badDevices is a QuerySet then django might be executing the badDevices query just to prepare the actual query's SQL. Possibly something similar might be happening in the case where device has a non-default primary key.
The other thing might delay the SQL preparation is of course django-pyodbc-azure. Maybe it's doing something strange while compiling the query and it becomes a bottleneck.
This is all wild speculation though, so if you're still having this issue then post the Device and Branch models as well, the exact content of badDevices and the SQL generated from the queries. Then maybe some scenarios can be at least eliminated.
EDIT: I think it must be the Device.uid field. Possibly django or pyodbc is getting confused by the non-default primary key and is fetching all the devices while generating the query. Try two things:
Replace device__in with device_id__in, device__pk__in and device__uid__in and check each one again. Maybe a more explicit query will be easier for django to translate into SQL. You can even try replacing branch with branch_id, just in case.
If the above doesn't work, try replacing the exclude expression with a raw SQL where clause:
# add quotes (because of the hyphens) & join
badDevicesIdString = ", ".join(["'%s'" % id for id in badDevices])
# Replaces .exclude()
... .extra(where=['device_id NOT IN (%s)' % badDevicesIdString])
If neither works, then most likely the problem is with the whole query and not just exclude. There are some more options in that case but try the above first and I will update my answer later if necessary.
Just want to share a similar problem that I had with MySQL and exclude clauses performance and how it was fixed.
When running the exclude clause, the list with the "in" lookup was actually a Queryset that I got using values_list method. Checking the exclude query executed by MySQL, the "in" objects were not values but actually another query. This behavior was impacting performance on specific large queries.
To fix that, instead of passing the queryset, I flat it out in a python list of values. By doing that, each value is passed as an argument inside the in lookup and the performance was really improved.

Django Query: Annotate with `count` of a *window*

I search for a query which is pretty similar to this one. But as an extension, I do not want to count all objects, but just over the ones, that are fairly recent.
In my case, there are two models. Let one be the Source and one be the Data. As result I'd like to get a list of all Sources ordered by the number of data records, that has been collected during the last week.
For me it is not iteresting, how many data records have been collected in total, but if there is a recent activity of that source.
Using the following code snippet from the above link, I cannot make up how to subquery the Data Table before.
from django.db.models import Count
activity_per_source = Source.objects.annotate(count_data_records=Count('Data')) \
.order_by('-count_data_records')
The only ways I came up with, would be to write native SQL or to process this in a loop and individual queries. Is there a Django-Query version?
(I use a MySQL database and Django 1.5.4)
Checkout out the docs on the order of annotate and filter: https://docs.djangoproject.com/en/1.5/topics/db/aggregation/#order-of-annotate-and-filter-clauses
Try something along the lines of:
activity_per_source = Source.objects.\
filter(data__date__gte=one_week_ago).\
annotate(count_data_records=Count('Data')).\
order_by('-count_data_records').distinct()
There is a way of doing that mixing Django queries with SQL via extra:
start_date = datetime.date.today() - 7
activity_per_source = (
Source.objects
.extra(where=["(select max(date) from app_data where source_id=app_source.id) >= '%s'"
% start_date.strftime('%Y-%m-%d')])
.annotate(count_data_records=Count('Data'))
.order_by('-count_data_records'))
The where part will filter the Sources by its Data last date.
Note: replace table and field names with actual ones.

Resources