Copy from s3 to redshift - database

I am loading data to redshift from s3, using MANIFEST to specify load because I have to load 8k files (total dataset size ~1TB)
I am using SQLWorkbench to load this dataset, I am setting MAXERROR = 100000, but actual error occurring is greater than 100000 (MAXERROR=100000). I think SQLWorkbench had MAXERROR limit to 100000.
Is there any better way to do this ? any suggestion ?

If you actually have more than 100,000 errors in your data being imported I would suggest you need to go back to the source and correct the files. If that's not possible then you could try loading the data into a table with the problematic columns set to VARCHAR(MAX) and then you can convert them inside Redshift.

Related

Best way to handle large amount of inserts to Azure SQL database using TypeORM

I have an API created with Azure Functions (TypeScript). These functions receive arrays of JSON data, converts them to TypeORM entities and inserts them into Azure SQL database. I recently ran into an issue where the array had hundreads of items, and I got an error:
The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request
I figured that saving all of the data at once using the entity manager causes the issue:
const connection = await createConnection();
connection.manager.save(ARRAY OF ENTITIES);
What would be the best scalable solution to handle this? I've got a couple ideas but I've no idea if they're any good, especially performance wise.
Begin transaction -> Start saving the entities individually inside forEach loop -> Commit
Split the array into smaller arrays -> Begin transaction -> Save the smaller arrays individually -> Commit
Something else?
Right now the array sizes are in tens or hundreads, but occasional arrays with 10k+ items are also a possibility.
One way you can massively scale is let DB deal with that problem. E.g. use External Tables. DB does the parsing. Let your code only orchestrates.
E.g.
Make data to be inserted available in ADLS (Datalake):
instead of calling your REST API with all data (in body or query params as array), caller would write the data to ADLS-location as csv/json/parquet/... file. OR
Caller remains unchanged. Your Azure Function writes data to some csv/json/parquet/... file in ADLS-location (instead of writing to DB).
Make DB read and load the data from ADLS.
First `CREATE EXTERNAL TABLE tmpExtTable LOCATION = ADLS-location
Then INSERT INTO actualTable (SELECT * from tmpExtTable)
See formats supported by EXTERNAL FILE FORMAT.
You need not delete and re-create external table each time. Whenever you run SELECT on it, DB will go parse the data in ADLS. But it's a choice.
I ended up doing this the easy way, as TypeORM already provided the ability to save in chunks. It might not be the most optimal way but at least I got away from the "too many parameters" error.
// Save all data in chunks of 100 entities
connection.manager.save(ARRAY OF ENTITIES, { chunk: 100 });

How to create a SQL Server table with exactly 1 MB data in it

I'd like to try and gauge my users internet speeds based on the downloading of a dataset of known size (1MB).
Using T-SQL only, how can I quickly create a table with exactly 1MB of data in it?
I want to be able to run EXEC sp_spaceused N'dbo.myTableName to verify data size.
The SO search keywords ended up being Numbers Table. Searching for this term, I found a great post.
Closing this as duplicate of: What is the best way to create and populate a numbers table?
I think i'm going to move a different direction for my use case with
fsutil file createnew C:\Desktop\testFile.png 1000000

Write more than 100000 records to the CSV or XLS sheet using apex

we have requirement :
To show object data on Visual Force page using pagination.
Export button to export all records to xls or csv file.
Issue is data size is too large i.e. more than 100000 records.
How can we write more than 100000 records to xls file using Apex?
I know for sure writing to a 'Document' could work. Maybe even the newer Files feature. You can write and append to an existing document. Using the #read only annotation you can query more than 10.000 records. You might run into heap size errors though. Other option could be to use the Bulk API v2.

Loading Data using LookUp Transformation in SSIS is taking more time?

I am Loading Data from one server to another through Dataflowtask in SSIS, I have used LookUp Transformation on one column , initially it was very fast loading the data , but when i am running everyday it is taking longer and the outputs are OLEDB Destination and OLEDB Command , The data is around 90K. Any help how can I make it work faster ?
Initial thought would be to change the caching mode on the lookup table. If that helps, you are golden. If not, I would also want to look at what other operations are being performed on the database when that lookup is being performed.
Also, would adding an index to the lookup column be of any benefit? These are just the starters.

create and store large XML output in SQL

I have a query that gives approximately four hundred thousand records [400000], up to 18mb.
I parsed the output to XML output, using:
for xml path('url'), Root('urlset')
Now, query result doesn't show complete XML.
when I try to view in XML SQL window, not able to export in an XML file.
ERROR:
I have done below practices as suggested in other posts:
Run This Script to enable xp_cmdshell
Increase XML data capacity to Unlimited in Options >> Result to Grid
Still the same error! How to resolve this?
After a while, I tried what #gofr1 referred in his comment.
Above practice doesn't become effective on opening new Query window/ Refreshing the Database connection.
It works ONLY AFTER Restarting SSMS.
update NOTE: With this solution, you can see Results in Grid, but if you try to export it to XML file. you will get the same error.
Exception of type 'System.OutOfMemoryException' was thrown.
UPDATE:
so, here I decided to pass the XML output to an application and let the C# generate XML file in the desired folder.
As Google sitemap XML limit is : 50,000. So, one should create the number of sitemaps each containing max 50k records.
Note: Google allows 1000 sitemap files for each domain.
EDIT:
Google has increased the maximum limit from 1,000 sitemaps to 50,000 Sitemaps, a sitemap file size can be maximum up to 50mb. This is a huge increase in capacity.
References:
seroundtable/archives/021559
searchengineland/google-bing-increase-file-size-limit-sitemaps-files
searchenginejournal/google-bing-increase-sitemap-file-size-limit-50
This is because SSMS not able to show a large amount of data in the result window.
You can write a utility and dump the data into the file for better readability.

Resources