Pandas Chunksize - sql-server

I have the following code which appends each chunksize to a master excel file.
df = pd.read_sql(query,cnxn,chunksize=10000)
for chunk in df:
chunk.to_excel(r'C:\File\Path\file.xlsx', mode='a', sep=',',encoding='utf-8')
The issue I'm running into is that I'm expecting a result of around 8 million rows. With excel's limit of about 1 mil rows, I am considering two options:
For every 1mil rows, add a new sheet and continue the process. This is the preferred method.
Creating a new workbook for each set of 1mil results
Any recommendations for which approach is best? I know I'll need to modify the mode='a' part of the code.
Here is the error message I'm receiving:
ProgrammingError: ('42000', '[42000] [Microsoft][ODBC SQL Server Driver][SQL Server]The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information. (8623) (SQLExecDirectW)')
Thank you in advance
EDIT: some further clarification
The SQL query is using a list of 200,000 strings as my key and pulling all rows that contain at least one of the strings. There are multiple rows with the same string identifier which is why i'm expecting a result of about 8 million. This is also why I believe I'm getting the programming error.

The query processor ran out of internal resources and could not produce a query plan.
The SQL query is using a list of 200,000 strings
To simplify the query, pass the 200,000 strings in a JSON document and parse them on the server as in this answer, or load them into a temp table and reference that in your query.
Embedding them in the query text is a bad idea.

Related

Simple SSIS from Oracle to SQL Server drops rows

I have a very simple SQL query in my SSIS (VS 2017) Data Flow. It connects to Oracle via Native OLE DB\Oracle Provider for OLE DB and uses SQL Command to query the Oracle view. The destination table is a SQL Server 2017 table. If I query only the first 20 columns or so (I am querying 57 columns), I get all 1,060,000ish records. As I start to add more columns, the rowcount drops. I have already removed any date fields from both tables, and have done quite a few data conversions (source table has several varchar2(4000) fields that need to be SUBSTR to reasonable lengths in the SQL destination table. All fields in the destination table are nullable. When I pull the SQL out of SSIS and run it in SQL Developer, I get the right row count. When I run it in SSIS, it drops from 1.06 M rows to around 28k. I already tried the SQLChick hack (https://www.sqlchick.com/entries/2012/9/2/resolving-missing-records-in-ssis-from-oracle-source.html) doesn't work and causes connection errors (I had to use VS Code to add that property to my Oracle connection, then when I went back to VS, the connection was broken. When opening it back up to re-enter connection credentials, the extra property gets dropped.) I have reduced and increased the Rows per Batch and Maximum insert commit size values to zero avail. I have also set the RetainSameConnection property to True for all the Connection Managers. I'm at a loss! (As you can see from the pics, both jobs finish "successfully".)
This code returns all records:
SELECT
PIDM,
STUDENT_ID,
LAST_NAME,
FIRST_NAME,
MIDDLE_NAME,
LFM_NAME,
FML_NAME,
SORT_NAME,
GENDER,
ETHNIC_CODE,
ETHNIC_CODE_DESC,
LEGACY_CODE,
LEGACY_CODE_DESC,
ADDR_STR_LINE1,
ADDR_STR_LINE2,
ADDR_STR_LINE3,
ADDR_CITY,
ADDR_COUNTY,
ADDR_STATE,
ADDR_NATION,
ADDR_ZIPCODE,
ADDR_AREA_CODE,
ADDR_PHONE
FROM <TABLE_NAME>
This code returns only 28k:
SELECT
PIDM,
STUDENT_ID,
LAST_NAME,
FIRST_NAME,
MIDDLE_NAME,
LFM_NAME,
FML_NAME,
SORT_NAME,
GENDER,
ETHNIC_CODE,
ETHNIC_CODE_DESC,
LEGACY_CODE,
LEGACY_CODE_DESC,
ADDR_STR_LINE1,
ADDR_STR_LINE2,
ADDR_STR_LINE3,
ADDR_CITY,
ADDR_COUNTY,
ADDR_STATE,
ADDR_NATION,
ADDR_ZIPCODE,
ADDR_AREA_CODE,
ADDR_PHONE,
ORIGIN_STR_LINE1,
ORIGIN_STR_LINE2,
ORIGIN_STR_LINE3,
ORIGIN_CITY,
ORIGIN_COUNTY,
ORIGIN_NATION,
ORIGIN_STATE,
ORIGIN_ZIPCODE,
EMAIL,
HIGH_SCHOOL_CODE,
HIGH_SCHOOL_CODE_DESC,
HIGH_SCHOOL_CITY,
HIGH_SCHOOL_STATE,
HIGH_SCHOOL_GPA,
HIGH_SCHOOL_RANK,
PRIOR_COLLEGE_CODE,
PRIOR_COLLEGE_CODE_DESC,
PRIOR_COLLEGE_DEGREE_CODE,
PRIOR_COLLEGE_DEGREE_CODE_DESC,
PRIOR_COLLEGE_CITY,
PRIOR_COLLEGE_STATE,
ADMIT_FLAG,
GENERAL_STUDENT_FLAG,
CURRENT_ENROLLMENT_FLAG,
LETTER_CODES,
CONTACT_CODES,
COMMENT_CODES,
DIRECTORY_EMAIL,
ADDR_DIVISION_CODE,
HIGH_SCHOOL_CLASS_SIZE,
ETHNICITY,
RACE_CODE,
REGULATORY_RACE,
INT_LANG
FROM <TABLE_NAME>
Troubleshooting steps from the comments
If you run the all column version of the query in sql developer (whatever the Oracle query tool is) using the same credentials as the SSIS package, do you get 28k rows or 1M?
1M records are returned in SQL Developer when I use the same credentials SSIS is using. –
As painful as it may be, I would add 1 column, run, observe results. The first time you see a drop in row count, interrogate the heck of the source data (data type, collation, whether some permission thing is at play). If nothing seems out of place, edit the question to include the full table definition and identify what the first source column is that is throwing the results off.
I've done that. Column by column. I've even added a column that already existed (ADDR_STR_LINE1) as ORIGIN_STR_LINE1 and just aliased it, knowing that ADDRR_STR_LINE1 had already worked and both fields shared the exact datatypes/lengths etc. I just ran it with this code:SELECT PIDM, ORIGIN_STR_LINE1, ORIGIN_STR_LINE2, ORIGIN_STR_LINE3, ORIGIN_CITY, ORIGIN_COUNTY, ORIGIN_NATION, ORIGIN_STATE, ORIGIN_ZIPCODE FROM ODSMGR.RECRUIT_PERSON_OSU and it returned 1m records.
While little, consolation, you hitting all the troubleshooting steps I'd employ. I suppose the next item I would try to rule out is some bizarre row width issue/bug. Add a new data flow. As your source query, take one of your varchar2(4000) fields and duplicate it 60 times i.e. SELECT ADDR_STR_LINE1 AS Col0, ADDR_STR_LINE1 AS Col1, ..., ADDR_STR_LINE1 As Col59 FROM Owner.Table and connect that to a Derived Column task (it doesn't need to do anything, just serve as an anchor point) and run it. Do you get 1M or 28k?
Adding more of my troubleshooting steps. 1) Created a view off the original table, casting all of the fields that would need to be truncated as VARCHAR(proper length based on dest table). 2) Added/substracted fields piecemeal, until I thought I had a stable query, knowing that if I added <this fields>, <this many rows> would be dropped. But, for instance, I added PRIOR_COLLEGE_CITY and the first time, my counts dropped from 1063202 to 952755, but then later, I ran it again, and the counts dropped from 1063202 to 953989, so even if it was a data issue (it's not) it's not a consistent one.
Once I got my 953989 rows into the destination table, I compared which PRIOR_COLLEGE_CITY records were missing. In the Source Data Flow, I explicitly queried for those records, and they loaded fine, so again, not a data issue.
According to the picture you provided, when Source component output the records, some records have lost, so we could determine that this problem occurs in Source component.
In this case, please try to check the following thing in your case.
1.Run the query(not the views but the query inside the view) in your Oracle environment when execute the query in Source component, then check whether the number of records(returned from Oracle environment)is equal to the number of records(returned from SSIS Source component). Do this on a separate data task.
2. Check if there are some changes on the source table.
3. If the returned results is correct when running the query in Oracle environment, please try to compare the correct results with the SSIS Source returned results, and analyze the missing data.
I had a similar problem, mostly with odbc driver for oracle!
The problem not only lies on the volume of or rows that returns but in my case, for some reason it grouped the values of the first column also.
The only solution I ve found is to use another driver besides odbc and oledb.
Using the native Oracle Destination and Oracle Source in VS2017 it worked perfect and also the performance was better than odbc and ole db.
enter image description here
I was having a similar issue: 1,470,491 rows in the Oracle view that I was querying, all would come across when I run the package in Visual Studio, but only 377,257 rows would be read when I ran the package from SQL Agent. I tried the SQLChick "UseSessionFormat" hack that you mentioned. While editing the connection string used by the job (it comes in from configuration) I noticed that the connection string in the package had a "USERNAME" parameter as well as a "user id" paramter, but the configuration used by SQL Agent only had "USERNAME". I added "user id" parameter to the configuration used by SQL Agent and after that, the job retrieves all 1,470,491 rows.

SQL Server Except operator - how to identify culprit columns

Just to give background, we have created new SSIS packages to get feeds into SQL Server tables.
To make sure new SSIS packages put data in same manner in which how current live process does, we have written comparison script to compare Live and Test tables which will be executed parallel for a month.
So have i have used EXCEPT command to get the differences and it returns data too which has differences.
Now problem is that, i have around 50 columns and i need to check data of each column and compare with other tables to find out the culprit.
In some cases, i am getting around 40000 rows as a difference and identifying correct columns is too cumbersome.
In most of the cases, it was because of NULL value in Live and blank value in Test. I have done following to identify columns
Limit Number of columns
Look at value for few columns and then change SQL statement to include ISNULL function & so on
This is time consuming.
Is there any good way to get the columns names because of which we are getting difference in rows.
It would be good if i can get better way to handle this.

Bulk Update with SQL Server based on a list of primary keys

I am processing data from a database with millions of rows. I am pulling a batch of 1000 items from the database and processing them without a problem. I am not loading the whole entity and I am just pulling down a few columns of data for the batch.
What I want to do is mark the 1000 rows as processed with a single SQL command.
Something like:
UPDATE dbo.Clients
SET HasProcessed = 1
WHERE ClientID IN (...)
The ... is just a list of integers.
Context:
Azure SQL Server 2012
Entity Framework
Database.ExecuteSqlCommand
Ideas:
I know I could build the command as a pure string, but this would mean not using any SqlParameters and not benefiting from the query plan optimization.
Also, I found some information about table-valued parameters, but this requires creating a table type and some overhead which I would like to avoid. This is just a list of integers after all.
Question:
Is there an easy (performant) way to do this that I am overlooking either with Entity Framework or ExecuteSqlCommand?
If not and using table-valued parameters is the best way, could you provide a complete example of how to convert an integer list into the simplest type and running that with the above query?

Combining MSSQL results with Oracle into CFSPREADSHEET

I have a report that generates an excel file daily with data extracted from a MS-SQL database. I now have to add additional columns to my spreadsheet from an Oracle database where the ID matches the ID in the MS-SQL query results.
My problem is that I have about 1200-1400+ unique IDs generated on this report from the first query. When I plug them into an IN list with the Oracle query and try to do a CFDUMP to see if the results will come out as it should, I receive a CF error saying that query cannot list more than 1000 results from the oracle query.
I basically set the values from the first query into a valuelist for the ID column and then put that into the IN clause for the Oracle query. I then do a cfdump on the Oracle where I receive that error. I've also tried wrapping cfloop query = "firstquery"> around the Oracle query and just placing #firstquery.columnIDname# but that does not work either.
So two questions I have here is ..
How do I handle the limit on Oracle with 1k limit and if I only have read only access to the Oracle database with ColdFusion?
After #1 is figured out, how could I combine the results from the Oracle Query with my MSSQL query or in other words, add the columns I'm pulling from the Oracle query to the spreadsheet for the matching ID.
Thanks.
For your quick, dirty, and sub-optimal approach, visit cflib.org and look for a function called ListSplit(). It converts a long list to an array of short lists.
You then loop through this array and run a query each time. Make sure the query name changes with each loop iteration.
After the loop, do a query of queries union query. Then do whatever you have to do to combine that data with what you got from sql server.
Note that you will probably have to use array notation to access your dynamically named query objects.

How to validate large data set from PowerShell against SQL server table

I have couple hundred rows of data of about 100 bytes each and need the field values validated against fairly large (milions of rows) table in SQL server. The query itself on SQL is very quick, but running single query per row from the script is not slow, probably because of the connection set-up and teardown. I can't add any stored procs to the server. Is there any way how to pass a dataset/table to a query from the script and return results rather then iterate the dataset row-by-row? Of course PowerShell code samples would be welcome ;)
There won't be a way via powershell (or any other method I am aware of) as SQL Server can't accept a table as an input parameter in a query. Any method you use for this will at some point be passing a row/specific values to SQL.
What you COULD do is create a table from your array and compare that temp table to your large table for validation. Determining if this is faster will need some testing, though.

Resources