Best way to handle large amount of inserts to Azure SQL database using TypeORM

Best way to handle large amount of inserts to Azure SQL database using TypeORM - sql-server

I have an API created with Azure Functions (TypeScript). These functions receive arrays of JSON data, converts them to TypeORM entities and inserts them into Azure SQL database. I recently ran into an issue where the array had hundreads of items, and I got an error:
The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request
I figured that saving all of the data at once using the entity manager causes the issue:
const connection = await createConnection();
connection.manager.save(ARRAY OF ENTITIES);
What would be the best scalable solution to handle this? I've got a couple ideas but I've no idea if they're any good, especially performance wise.
Begin transaction -> Start saving the entities individually inside forEach loop -> Commit
Split the array into smaller arrays -> Begin transaction -> Save the smaller arrays individually -> Commit
Something else?
Right now the array sizes are in tens or hundreads, but occasional arrays with 10k+ items are also a possibility.

One way you can massively scale is let DB deal with that problem. E.g. use External Tables. DB does the parsing. Let your code only orchestrates.
E.g.
Make data to be inserted available in ADLS (Datalake):
instead of calling your REST API with all data (in body or query params as array), caller would write the data to ADLS-location as csv/json/parquet/... file. OR
Caller remains unchanged. Your Azure Function writes data to some csv/json/parquet/... file in ADLS-location (instead of writing to DB).
Make DB read and load the data from ADLS.
First `CREATE EXTERNAL TABLE tmpExtTable LOCATION = ADLS-location
Then INSERT INTO actualTable (SELECT * from tmpExtTable)
See formats supported by EXTERNAL FILE FORMAT.
You need not delete and re-create external table each time. Whenever you run SELECT on it, DB will go parse the data in ADLS. But it's a choice.

I ended up doing this the easy way, as TypeORM already provided the ability to save in chunks. It might not be the most optimal way but at least I got away from the "too many parameters" error.
// Save all data in chunks of 100 entities
connection.manager.save(ARRAY OF ENTITIES, { chunk: 100 });

Related

SQL to Excel - Max each sheet

I have a SQL Table with close to 2 Million rows and I am trying to export this data into an Excel file so the stakeholders can manipulate data, see charts, so on...
The issue is, when I hit refresh, it fails after getting all the data saying the number of rows exceed max rows limitation in Excel. This table is going to keep growing every day.
What I am looking for here is a way to refresh data, then add rows to Sheet 1 until max rows limitation is reached. Once maxed out, I want the rows to start getting inserted into Sheet 2. Once maxed out, move to 3rd sheet, all from the single SQL table, from a single refresh.
This does not have to happen in Excel (Data -> Refresh option), I can have this as a part of the SSIS package that I am already using to populate rows in the SQL table.
I am also open to any alternate ways to export SQL table into a different format that can be used by said stakeholders to create charts, analyze data, and whatever else pleases them.

Without sounding too facetious, you are suggesting a very inefficient method.
The best way of approaching this method is not to use .xlsx files at all for the data storage.
Assuming your destination stakeholders don't have read access to the SQL server, export the data to .csv and then use Power Query in some sort of 'Dashbaord.xlsx' type file to load the .csv to the data model which can handle hundreds of millions of rows instead of just 1.05m.
This will allow for the use of Power Pivot and DAX for analysis and the data will also be visible in the data model table view if users do want raw rows (or they can refer to the csv file..).
If they do have SQL read access then you can query the server directly so you don't need to store any rows whatsoever as it will read directly.
Failing all that and you decide to do it your way, I would suggest the following.
Read your table into a Pandas df and iterate over each row and cell of the dataframe, writing to an your xlsx[sheet1] using openpyxl then once the row number reaches 1,048,560 simply iterate to xlsx[sheet2].
In short: openpyxl allows you to create workbooks, worksheets, and write to cells directly.
But depending on how many columns you have it could take incredibly long.

Product Limitations
Excel 2007+ 1,048,576 rows by 16,384 columns
A challenge with your suggestion of filling a worksheet with the max number of rows and then splitting is "How are they going to work with that data?" and "Did you split data that should have been together to make an informed choice?"
If Excel is the tool the users want to use and they must have access to all the data, then you're going to need to put the data into a Power Pivot data model (and yes, that's going to impact the availability of some data visualizations). A Power Pivot model is an in-memory tabular data set. What the means is that the data engine, xVelocity, is going to use a bunch of memory but can get over the 1 million row limitation. Depending on how much memory is required, you might need to switch from the default 32 bit Office install and go with a 64 bit install (and I've seen clients have to max RAM out on old, low end desktops because they went cheap for business users).
Power Pivot will have a connection to your SQL Server (or other provider). When it refreshes data, it's going to fire off queries and determine the unique values in columns and then create a dictionary of unique values. This allows it to compress the data with low cardinality really well - sales dates are likely going to be repeated heavily within your set so the compression is good. Assuming your customers are typically not-repeat customers, a customer surrogate key would have high cardinality and thus not compress well since there's little to no repeat. The refresh is going to be dependent on your use case and environment. Maybe the user has to manually kick it off, maybe you have SharePoint with Excel services installed and then you can have it refresh the data on various intervals.
If they're good analysts, you might try turning them on to Power BI. Same-ish engine behind the scenes but built from the ground up to be an response reporting tool. If they're just wading through tables of data, they're not ready for PBI. If they are making visuals out of the data, PBI is likely a better fit.

How to create 100 records in database while simulating 100 users using gatling?

While simulating the recorded scala file, the datas should fetch from csv file and insert into db.
For Eg:
If we simulate 100 users,Will it create 100 records in the database in case registration populates the database data ?
Thanks,
Mohanapriya

You may wish to specify a "feeder" to supply data to your Gatling scenarios. Feeders can load data from a variety of sources, or may generate the data themselves. Gatling has standard feeders for several use cases, or you can write your own.
More details at http://gatling.io/docs/2.1.7/session/feeder.html.

Need a clever way to get orders from all stores while each store is in a different database

The setup
I have the following database setup:
CentralDB
Table: Stores
Table: Users
Store1DB
Table: Orders
Store2DB
Table: Orders
Store3DB
Table: Orders
Store4DB
Table: Orders
... etc
CentralDB contains the users, logging and a Stores table with the name of each store database and general information about each store such as address, name, description, image, etc...
All the StoreDB's use the same structure just different data.
It is important to know that the list of stores will shrink and increase in the future.
The main client communicating with this setup is an API REST Service which gets passed a STOREID in the Header of each request telling it which database to connect to. This works flawlessly so far.
The reasoning
Whenever we need to do database maintenance on one store, we don't want all other stores to be down.
Backup management should be per store
Not having to write the WHERE storeID=x every time and for every table
Performance: each store could run on its own database server if the need arises
The goal
I need my REST API Service to somehow get all orders from all stores in one query.
Will you help me figure out a way to do this without hardcoding all storedb names? I was thinking about a stored procedure on the CentralDB but I was hoping there would be other solutions. In any case it has to be very efficient.

One option would be to have a list of databases stored in a "system" table in CentralDB.
Then you could create a stored procedure that would read the database names from the table, loop through them with cursor and generate a dynamic SQL that would UNION the results from all the databases. This way you would get a single recordset of results.
However, this database design is IMHO flawed. There is no reason for using multiple databases to store data that belongs to the same "domain". All the reasons that you have mentioned can be solved by using a single database with proper database design. Having multiple databases will create multiple problems on the long term:
you will need to change structure of all the DBs when you modify your database model
you will need to create/drop new databases when new stores are added/removed from your system
you will need to have items and other entities that are "common" to all the stores duplicated in all the DBs
what about reporting requirements (e.g. get sales data for stores 1 and 2 together, etc.) - this will require creating complex union queries...
etc...
On the long term, managing and maintaining this model will be a big pain.

I'd maintain a set of views that UNION ALL all the data. Every time a store is added or deleted those views must be updated. This can be automated.
The views provide an illusion to the application that there is only one database.
What I would not do is have each SQL query or procedure query all the database names and create dynamic SQL. That would entail lots of code duplication and an unnecessary loss of performance. This approach is error prone. Better generate code once in a central place and have all other SQL code reference that generated code.

ADO - Can I edit results of a complex query with multiple join statements?

I'm working on a data conversion utility which can push data from one master database out to a number of different databases. The utility its self will have no knowledge of how data is kept in the destination (table structure), but I would like to provide writing a SQL statement to return data from the destination using a complex SQL query with multiple join statements. As long as the data is in a standardized format that the utility can recognize (field names) in an ADO query.
What I would like to do is then modify the live data in this ADO Query. However, since there are multiple join statements, I'm not sure if it's possible to do this. I know at least with BDE (I've never used BDE), it was very strict and you had to return all fields (*) and such. ADO I know is more flexible, but I don't know quite how flexible in this case.
Is it supposed to be possible to modify data in a TADOQuery in this manner, when the results include fields from different tables? And even if so, suppose I want to append a new record to the end (TADOQuery.Append). Would it append to two different tables?
The actual primary table I'm selecting from has a complimentary table which is joined by the same primary key field, one is a "Small" table (brief info) and the other is a "Detail" table (more info for each record in Small table). So, a typical statement would include something like this:
select ts.record_uid, ts.SomeField, td.SomeOtherField from table_small ts
join table_detail td on td.record_uid = ts.record_uid
There are also a number of other joins to records in other tables, but I'm not worried about appending to those ones. I'm only worried about appending to the "Small" and "Detail" tables - at the same time.
Is such a thing possible in an ADO Query? I'm willing to tweak and modify the SQL statement in any way necessary to make this possible. I have a bad feeling though that it's not possible.
Compatibility:
SQL Server 2000 through 2008 R2
Delphi XE2

Editing these Fields which have no influence on the joins is usually no problem.
Appending is ... you can limit the Append to one of the Tables by
procedure TForm.ADSBeforePost(DataSet: TDataSet);
begin
inherited;
TCustomADODataSet(DataSet).Properties['Unique Table'].Value := 'table_small';
end;
but without an Requery you won't get much further.
The better way will be setting Values by Procedure e.g. in BeforePost, Requery and Abort.
If your View would be persistent you would be able to use INSTEAD OF Triggers

Jerry,
I encountered the same problem on FireBird, and from experience I can tell you that it can be made(up to a small complexity) by using CachedUpdates . A very good resource is this one - http://podgoretsky.com/ftp/Docs/Delphi/D5/dg/11_cache.html. This article has the answers to all your questions.

I have abandoned the original idea of live ADO query updates, as it has become more complex than I can wrap my head around. The scope of the data push project has changed, and therefore this is no longer an issue for me, however still an interesting subject to know.
The new structure of the application consists of attaching multiple "Field Links" on various fields from the original set of data. Each of these links references the original field name and a SQL Statement which is to be executed when that field is being imported. Multiple field links can be on one single field, therefore can execute multiple statements, placing the value in various tables, etc. The end goal was an app which I can easily and repeatedly export a common dataset from an original source to any outside source with different data structures, without having to recompile the app.
However the concept of cached updates was not appealing to me, simply for the fact pointed out in the link in RBA's answer that data can be changed in the database in the mean-time. So I will instead integrate my own method of customizable data pushes.

Growing MS Access File Size problem

I have a large MS Access application with a lot of computations in VBA code. When I run it it eventually crashes due to excessive file size. There are a lot of intermediate tables and queries created and subsequently deleted, but Access does not reclaim the space. I have diligently closed all intermediate record sets and set all temporary objects to nothing, but nothing helps. The only way I can get my code to run is to run part of it, stop and repair/compress the file then restart the code.
Isn't there a better way?
Thanks

You should be able to run the compact function from within your VBA code.
I had the below snippet bookmarked from a long time ago when I was doing access work.
Public Sub CompactDB()
CommandBars("Menu Bar").Controls("Tools").Controls("Database utilities").Controls("Compact and repair database...").accDoDefaultAction
End Sub
You can put that in your code to get around it.
NOTE: you might also consider growing to a larger db system if you are having these types of scaling issues.

What sizes are you dealing with? What is the error code when it crashes? I'd be surprised if it is simply because the file gets "too big", but I imagine there's a limit. It sounds from your description of all the temp stuff that there may be design improvements that would help.
EDIT: I expect you realize it's non-trivial to replace the database with something else - even if you try to keep whatever else is in the mdb besides the tables. Access querydefs are unique, Access SQL is non-standard and you'd be basically starting over.
Most Access applications I've seen have lots of opportunity for refactoring; and it's usually not that difficult if a) you understand the logic and the business rules, and b) you have a solid understanding of Access programming. But that would be more or less true for any alternatives. If I were you and you're a little short in either area, maybe you can get some help. But I'd try to rescue the Access app first.
There's also a suggestion from another poster about moving the tables into one or more attached MDBs. That's a solid, well-proven technique in general. But first I'd get a handle on what the real cause of the problem is.

I'd push the data over to MS SQL (the permanent data and the intermediate tables); and you can leave the code portion in MS Access for the time being.
This solves two big issues:
The data will be inherently more stable/dependable (I can't tell you how many times I've had a corrupt MS Access database).
Your Access database won't grow/change very much (it should reach an equilibrium once all the code in has been run and compiled).
Both of these mean no more having to compress/repair the database; you can get a free version (the Express Edition) of MS SQL and it is not that hard to do.

If you do not want to switch to SQL Express or similar, you could dig the following ideas:
Open another 'external' access database (mdb file) for all temporary tables, so you could put all temp data in the external file, throwing away the mdb file when you close your app. You will then manipulate in your code the 'currentDb' object and another database that you build at startup and connect to through jet, OLEDB or ODBC connection
Separate your permanent tables from your code and, when needed, bring the data into your local client interface to build your temporary tables. This can be done for example by linking the external database to the local/client file using "DoCmd.transferDatabase acLink". This can also be done by connecting to the permanent data through OLEDB connection, opening the needed recordset(s) and saving them locally as XML files. There are many other solutions that can be implemented here.

The state of affairs with regard to Jet file sizes is interminably problematic for me.
I am currently watching a piece of my own VBA code from Access database A as it does a series of single-record field updates using ADO to a table on Access database B (via a updateable-query reference in database A). The single field is a CHAR(8). With every 4 updates that go by, database B grows by about 8 Kbytes. No good excuse for that. The addition to the file size slows performance on this severely; with each file growth, updates slow from about one per second (in a table of about 30-40K records using single-record SQL lookups and no indexes anywhere) to one per 5-10 seconds.
Now, I admit, I did compact/repair database B prior to running this update code; perhaps if I had not done that, the performance would not have been this bad. Had the target field for update been of, say, type Memo, then I would have expected this. But to carry out an update on a CHAR() field and get this result is simply not reasonable.
Most of the above (no particular criticism for any one solution intended) appear to be valid solutions for applications that use a relatively permanent business application arrangement (talk to the same target databases all of the time). Mine is not so . . . I cannot alter the target database (database B), as it is generated and consumed by a vendor's tool that we use to export and import data from their application.
I understand and commend the above writers for coming up with solutions to users' problems. However, I cannot let it stand when poor software design/implementation gets in the way of users using a product as the users expect it to function.

I'm not an MVP, but Google found these. Maybe they'll help you:
http://www.mvps.org/access/general/gen0041.htm
http://forums.devarticles.com/microsoft-access-development-49/compact-database-via-vba-24958.html

Unfortunately, MS Access has problems when you get too large - I think the max size is 2GB for an access DB.
You may consider moving to Sql Express, VistaDB, etc.

According to http://office.microsoft.com/en-us/access/HP051868081033.aspx, Access 2003 and 2007 have a 2 GB limit. However, it's easy to move some or all the tables into a separate .mdb file and then link to those tables. It's good practice anyway to have two files, one for your data and one for all the macros, queries, and so on. You could even have multiple files if your table file gets near the 2 GB limit.

I have encountered a similar issue where my database was bloating on raw data import. Instead of splitting the database and compacting the backend routinely, I decided to use the database object (DAO) to create a temp database, import the data, query/modify data in that temp database, pull it over to your original database via SQL and then delete it. YBase code shown below:
Sub tempAccessDatabaseImport()
Dim mySQL As String
Dim tempDBPath As String
Dim myWrk As DAO.Workspace
Dim tempDB As DAO.Database
Dim myObject
'Define temp access database path
tempPathArr = Split(Application.CurrentProject.Path, "\")
For i = LBound(tempPathArr) To UBound(tempPathArr)
tempDBPath = tempDBPath + tempPathArr(i) + "\"
Next i
tempDBPath = tempDBPath + "tempDB.accdb"
'Delete temp access database if exists
Set myObject = CreateObject("Scripting.FileSystemObject")
If myObject.FileExists(tempDBPath) Then
myObject.deleteFile (tempDBPath)
End If
'Open default workspace
Set myWrk = DBEngine.Workspaces(0)
'DAO Create database
Set tempDB = myWrk.CreateDatabase(tempDBPath, dbLangGeneral)
'DAO - Import temp xlsx into temp Access table
mySQL = "SELECT * INTO tempTable FROM (SELECT vXLSX.*FROM [Excel 12.0;HDR=YES;DATABASE=" & RAWDATAPATH & "].[" & WORKSHEETNAME & "$] As vXLSX)"
'DAO Execute SQL
Debug.Print mySQL
Debug.Print
tempDB.Execute mySQL, dbSeeChanges
'Do Something Else
'Close DAO Database object
tempDB.Close
Set tempDB = Nothing
myWrk.Close
Set myWrk = Nothing
'Delete temp access database if exists
If myObject.FileExists(tempDBPath) Then
'myObject.deleteFile (tempDBPath)
End If
End Sub

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight