camel jdbc out of memory exception - apache-camel

I am trying to ingest data from postgres to another DB and I am using camel-jdbc component to do it. I have a large table so I want to read few rows at a time instead of the whole table altogether. so my route looks like below (only for testing purpose)
from(fromUri).setBody("select * from table limit 10").to("jdbc://myDataSource?resetAutoCommit=false&statement.fetchSize=2").split(body()).streaming().process(test)
As shown above, I am only getting 10 rows at a time for testing purpose and I have set fetchSize to 2 to only receive 2 rows at a time. However, I am still receiving all 10 rows altogether. When I remove the "limit 10" from the query I get Out of Memory error just before the split command which tells me that its trying to load the entire result set in memory.
What am I missing here or what am I doing wrong?
Thanks for help.

I think fetchSize is more of a hint to the JDBC driver. You can use the maxRows option to really limit on the server side, eg statement.maxRows=2. You can read more about these options on the JDBC javadoc documentation.
https://docs.oracle.com/javase/7/docs/api/java/sql/Statement.html

Related

Pandas Chunksize

I have the following code which appends each chunksize to a master excel file.
df = pd.read_sql(query,cnxn,chunksize=10000)
for chunk in df:
chunk.to_excel(r'C:\File\Path\file.xlsx', mode='a', sep=',',encoding='utf-8')
The issue I'm running into is that I'm expecting a result of around 8 million rows. With excel's limit of about 1 mil rows, I am considering two options:
For every 1mil rows, add a new sheet and continue the process. This is the preferred method.
Creating a new workbook for each set of 1mil results
Any recommendations for which approach is best? I know I'll need to modify the mode='a' part of the code.
Here is the error message I'm receiving:
ProgrammingError: ('42000', '[42000] [Microsoft][ODBC SQL Server Driver][SQL Server]The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information. (8623) (SQLExecDirectW)')
Thank you in advance
EDIT: some further clarification
The SQL query is using a list of 200,000 strings as my key and pulling all rows that contain at least one of the strings. There are multiple rows with the same string identifier which is why i'm expecting a result of about 8 million. This is also why I believe I'm getting the programming error.
The query processor ran out of internal resources and could not produce a query plan.
The SQL query is using a list of 200,000 strings
To simplify the query, pass the 200,000 strings in a JSON document and parse them on the server as in this answer, or load them into a temp table and reference that in your query.
Embedding them in the query text is a bad idea.

Best way to handle large amount of inserts to Azure SQL database using TypeORM

I have an API created with Azure Functions (TypeScript). These functions receive arrays of JSON data, converts them to TypeORM entities and inserts them into Azure SQL database. I recently ran into an issue where the array had hundreads of items, and I got an error:
The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request
I figured that saving all of the data at once using the entity manager causes the issue:
const connection = await createConnection();
connection.manager.save(ARRAY OF ENTITIES);
What would be the best scalable solution to handle this? I've got a couple ideas but I've no idea if they're any good, especially performance wise.
Begin transaction -> Start saving the entities individually inside forEach loop -> Commit
Split the array into smaller arrays -> Begin transaction -> Save the smaller arrays individually -> Commit
Something else?
Right now the array sizes are in tens or hundreads, but occasional arrays with 10k+ items are also a possibility.
One way you can massively scale is let DB deal with that problem. E.g. use External Tables. DB does the parsing. Let your code only orchestrates.
E.g.
Make data to be inserted available in ADLS (Datalake):
instead of calling your REST API with all data (in body or query params as array), caller would write the data to ADLS-location as csv/json/parquet/... file. OR
Caller remains unchanged. Your Azure Function writes data to some csv/json/parquet/... file in ADLS-location (instead of writing to DB).
Make DB read and load the data from ADLS.
First `CREATE EXTERNAL TABLE tmpExtTable LOCATION = ADLS-location
Then INSERT INTO actualTable (SELECT * from tmpExtTable)
See formats supported by EXTERNAL FILE FORMAT.
You need not delete and re-create external table each time. Whenever you run SELECT on it, DB will go parse the data in ADLS. But it's a choice.
I ended up doing this the easy way, as TypeORM already provided the ability to save in chunks. It might not be the most optimal way but at least I got away from the "too many parameters" error.
// Save all data in chunks of 100 entities
connection.manager.save(ARRAY OF ENTITIES, { chunk: 100 });

SQL Query to read a text file and display only selected contents from that

I am working on something which requires me to run an sql query to read a text file from a path but it has to display only few contents based on my conditions/requirements. I have read about using ROWSET/BULK copy but it copies the entire file but I need only certain data from the file.
Ex:
Line 1 - Hello, Good Morning.
Line 2 - Have a great day ahead.
Line 3 - Phone Number : 1112223333 and so on.
So, if I read this file and give the condition as "1112223333", it should display only the lines consisting of "1112223333".
NOTE: It should display the entire line of the matched case/condition
Is it possible to achieve this using an sql query? If so then please help me with this.
Thanks in advance.
Unfortunately what you're trying to do doesn't work with ROWSET. There is no way to apply a filter at read time. You're stuck with reading in the entire table. You can of course read into a temp table, then delete the rows. This will give you the desired end result, but you have to take the hit on reading in the entire table.
You may be able to generate a script to filter the file server side and trigger that with xp_cmdshell but you'd still need to take the performance hit somewhere. While this would be lower load on the SQL server, you'd just be pushing the processing elsewhere, and you'd still have to wait for the processing to happen before you could read the file. May be worth doing if the file is on a separate server and network traffic is an issue. If the file is on the same server, unless SQL is completely bogged down, I can't see an advantage to this.

How can I minimize validation intervals when changing the SQL in ADO NET Source Tasks

Part of an SSIS package is the data import from an external database via a SQL command embedded into an ADO.NET Source Data Flow Source. Whenever I make even the slightest adjustment to the query (such as changing a column name) it takes ages (in that case 1-2 hours) until the program has finished validation. The query itself returns around 30,000 rows with 20 columns each.
Is there any way to cut these long intervals or is this something I have to live with?
I usually store the source queries in a table and the first part of my package would execute a select and store the query returned from the table in a package variable, which would then be used by the ADO.NET Source Data Flow. So In my package for the default value of the variable I usually have the query that is stored in the database along with a "where 1=2" at the end. Hence during design time it does execute the query but just returns the column metadata. Let me know if you have any questions.

SQL Query FOR XML runs fine in 2000, slow in 2008 R2

I'm converting a client's SSIS packages from DTS to SSIS. In one of their packages they have an execute SQL task that has a query similar to this:
SELECT * FROM [SOME_TABLE] AS ReturnValues
ORDER BY IDNumber
FOR XML AUTO, ELEMENTS
This query seems to return in a decent amount of time on the old system but on the new box it takes up to 18 minutes to run the query in SSMS. Sometimes if I run it it will generate an XML link and if I click on it to view the XML its throwing a 'System.OutOfMemoryException' and suggests increasing the number of characters retrieved from the server for XML data. I increased the option to unlimited and still getting error.
The table itself contains 220,500 rows but the query rows returned is showing 129,810 before query stops. Is this simply a matter of not having enough memory available to the system? This box has 48 GB (Win 2008 R2 EE x64), instance capped to 18GB because its shared dev environment. Any help/insight would be greatly appreciated as I don't really know XML!
When you are using SSMS to do XML queries FOR XML, it will generate all the XML and then put it into the grid and allow you to click on it. There are limits to how much data it brings back and with 220,000 rows, depending on how wide the table is, is huge and produces a lot of text.
The out of memory is the fact that SQL Server is trying to parse all of it and that is a lot of memory consumption for SSMS.
You can try to execute to a file and see what you get for size. But the major reason for running out of memory, is because that is a lot of XML and returning it to the grid, you will not get all the results all the time with this type of result set (size wise).
DBADuck (Ben)
The out of memory exception you're hitting is due to the amount of text a .net grid control can handle. 220k lines is huge! the setting in SSMS to show unlimited data is only as good as the .net control memory cap.
You coul look at removing the ELEMENTS option and look at the data in attribute format. That will decreate the amount XML "string space" returned. Personally, I prefer attributes over elements for that reason alone. Context is king, so it depends on what you're trying to accomplish (look at the data or use the data). Could youp pipe the data into an XML variable? When all is said & done, DBADuck is 100% correct in his statement.
SqlNightOwl

Resources