Slow loading into ultra-wide tables on Redshift - database

I have a few ultra-wide tables (1500+ columns) which I am trying to load data into. I am loading GZIPped files from S3 using a manifest file.
The distkey of the table is 'date' and each file in S3 contains information for one particular date only. The columns are mostly floats, with a few dates and varchars.
Each file has approximately 16000 rows with 1500 columns, and is approximately 84 MiB gzipped. Even following best practices for loading, we are seeing very poor load performance: 100 records/s or approximately 300 kB/s.
Are there any suggestions for improving load speeds specifically for ultra-wide tables? I'm loading data into narrower tables using similar techniques with fairly reasonable speeds, so I have reason to believe that this is an artifact of the width of the table.

Having files separated by the DISTKEY field does not necessarily improve load speed. Amazon Redshift will use multiple nodes to import files in parallel. The node that reads one particular input file will not necessarily be the same node used to store the data. Therefore, the data will be sent between nodes (which is expected during a load process).
If the table has been newly created, then the load process will automatically use the first 100,000 rows to determine an optimal compression type for each column. It will then delete that data and restart the load process. To avoid this, either create the table with compression defined on each column or run the COPY command with the COMPUPDATE option set to OFF. If, on the other hand, there is already data in the table, then this automatic process will be skipped.
It is possible that the load process is consuming too much memory and is spilling to disk. Try increasing wlm_query_slot_count increase the memory available to the COPY command. However, I'm not sure that this parameter applies to COPY commands (it is for 'queries', and the COPY command might not qualify as a query).

Adding for future reference:
One optimization that helped was switching from Gzipped JSON to CSV files. This reduced each file from 84 MiB to 11 MiB and tripled the loading speed.

Related

Snowpipe Auto Ingest Commit Interval

I have a file which has 3 millions record and I am using snowpipe auto ingest feature to load it automatically.
Wanted to know the behavior of snowpipe will it load 3 million records in stage in one shot then commit or it will be some sort incremental load like 10k,20k...3 million.
Since SnowPipe runs a copy command under the hood, each file is a single transaction and if there is any data issue, the copy will follow your file format and other properties to see if the partial load is allowed or not.
There are two factors that are essential to ingesting data via SnowPipe faster
File size (better to have small files > 250Mb), bigger the file, slower the response, and the chance of failure is high
File Format (in my experience, CSV works better)
The data latency will be around ~15 to 30sec I have simulated this and works very well with 50-100Mb files within ~20sec.
Alternatively, if the file size is big, then follow the external table with auto-refresh & have a task associated with it, and load data via copy command. But the task's min frequency is 1min. So your latency is always 1+ min.
with Snowpipe, loads are combined or split into a single or multiple transactions based on the number and size of the rows in each data file. This is different to COPY where load is always performed in a single transaction (all or nothing). So it is possible with Snowpipe you might start seeing some data, before the entire file is loaded. Therefore do not rely on file-granularity transactions with Snowpipe. Files is a chunking mechanism for continuous data loading. A file itself may be loaded in multiple chunks.
Snowpipe is designed to load new data typically within 1 min after file notification is sent, but loading can take longer for really large files.
Most efficient and cost-effective load with Snowpipe is achieved when file sizes are around 100-250MB in size. If it takes longer than 1 min to accumulate MBs of data in your application, consider aggregating the data to create a new data file within the 100-250MB range. This leads to good balance between cost and performance (load latency).
Ref: https://docs.snowflake.com/en/user-guide/data-load-considerations-prepare.html#continuous-data-loads-i-e-snowpipe-and-file-sizing
As per the recommendations, the file size should be in the range of 100-250MB compressed to load the files (data loading). Loading very large files (e.g. 100 GB or larger) is not recommended.
Snowpipe works in parallel mode, its used to load continuous streaming data in micro-batches in near real-time (as soon the file lands in S3, Azure blob etc). The number of data files that can be processed in parallel is determined by the number and capacity of servers/nodes in a warehouse.
For ad-hoc queries/load/one time, you can use the COPY INTO command.
Loading a single huge file is not recommended via snowpipe.
If you try to ingest a huge single file with 3 million rows, it will not be able to use parallel mode, even if you use large warehouse, it will not help boost performance because the number of load operations that run in parallel cannot exceed the number of data files to be loaded. So for single file load it will use single node from the warehouse, rest of the nodes will not be used.
So if you want to use snowpipe auto-ingest, please split the large file into smaller sizes (100-250MB). Splitting larger data files allows the load to scale linearly.
Please refer to these links for more details
https://docs.snowflake.com/en/user-guide/data-load-considerations-prepare.html#general-file-sizing-recommendations
https://docs.snowflake.com/en/user-guide/data-load-considerations-prepare.html

Given a 10^8 rows, 10GB import, is it better to import data on separate rows or consolidate and separate the rows in the DB?

I'm doing a rather large import to a SQL Database, 10^8+ items and I am doing this with a bulk insert. I'm curious to know if the speed at which the bulk insert runs can be improved by importing multiple rows of data as a single row and splitting them once imported?
If the time to import data is defined by the sheer volume of data itself (ie. 10GB), then I'd expect that importing 10^6 rows vs 10^2 with the data consolidated would take about the same amount of time.
If the time to import however is limited more by row operations and logging each line and not by the data itself then I'd expect that consolidating data would have a performance benefit. I'm not sure however how this would carry over if one had tot then break up the data in DB later on.
Does anyone have experience with this and can shed some light on what specifically can be done to reduce bulk insert time without simply adding that time later to split the data in DB?
Given a 10GB import, is it better to import data on separate rows or consolidate and separate the rows in the DB?
[EDIT] I'm testing this on a Quad 2.5GH with 8GB or RAM and 300MB/sec of read/writes to disk (stripped array). The files are hosted n the same array and the average row size varies with some rows containing large amounts of data (> 100 KB) and many under 100 B.
I've chunked my data into 100 MB files and it takes about 40 seconds to import the file. Each file has 10^6 rows in it.
Where is the data that you are importing? If it is on another server, then the Network might be the bottleneck. This then depends on number of NIC'S and frame sizes.
If it is on the same server, things to play with are batch size and recovery model which effect the log file. In full recovery model, everything is written to a log file. Bulk copy recovery model is a little less overhead in the log.
Since this is staging data, maybe a full backup before the process, change the model to simple, then import might reduce the time. Of course, change the model back to full and do another backup.
As for importing non-normalized data, multiple rows at a time, I usually stay away from the extra coding.
Most of the time, I use SSIS packages. More packages, threads, means a fuller NIC pipe. I usually have at least a 4 GB back bone that is seldom full.
Other things that come to play are your disks. Do you have multiple files (path ways) to the RAID 5 array? If not, you might want to think about it.
In short, it really depends on your environment.
Use a DMAIC process.
1 - Define what you want to do
2 - Measure the current implementation
3 - Analyze ways to improve.
4 - Implement the change.
5 - Control the environment by remeasuring.
Did the change go in the positive direction?
If not, rollback the change and try another one.
Repeat the process until the desired result (timing) is achieve.
Good luck, J
If this is a one time thing and done in an offline change window.. you may want to consider to put the database in simple recovery model prior to inserting the data.
Keep in mind though this would break the log chain....

SSIS processing large amount of flat files is painfully slow

From one of our partners, I receive about 10.000 small tab delimited text files with +/- 30 records in each file. It is impossible for them to deliver it in one big file.
I process these files in a ForEach loop container. After reading a file, 4 column derivations are performed and then finally contents are stored in a SQL Server 2012 table.
This process can take up to two hours.
I already tried processing the small files into one big file and then importing this one in the same table. This process takes even more time.
Does anyone have any suggestions to speed up processing?
One thing that sounds counter intuitive is to replace your one Derived Column Transformation with 4 and have each one perform a single task. The reason this can provide performance improvement is that the engine can better parallelize operations if it can determine that these changes are independent.
Investigation: Can different combinations of components affect Dataflow performance?
Increasing Throughput of Pipelines by Splitting Synchronous Transformations into Multiple Tasks
You might be running into network latency since you are referencing files on a remote server. Perhaps you can improve performance by copying those remote files to the local box before you being processing. The performance counters you'd be interested in are
Network Interface / Current Bandwidth
Network Interface / Bytes Total / sec
Network Interface / Transfers/sec
The other thing you can do is replace your destination and derived column with a Row Count transformation. Run the package a few times for all the files and that will determine your theoretical maximum speed. You won't be able to go any faster than that. Then add in your Derived column and re-run. That should help you understand whether the drop in performance is due to the destination, the derived column operation or the package is running as fast as the IO subsystem can go.
Do your files offer an easy way (i.e. their names) of subdividing them into even (or mostly even) groups? If so, you could run your loads in parallel.
For example, let's say you could divide them into 4 groups of 2,500 files each.
Create a Foreach Loop container for each group.
For your destination for each group, write your records to their own staging table.
Combine all recordss from all staging tables into your big table at the end.
If the files themselves don't offer an easy way to group them, consider pushing them into subfolders when your partner sends them over, or inserting the file paths into a database so you can write a query to subdivide them and use the file path field as a variable in the Data Flow task.

How to instantly query a 64Go database

Ok everyone, I have an excellent challenge for you. Here is the format of my data :
ID-1 COL-11 COL-12 ... COL-1P
...
ID-N COL-N1 COL-N2 ... COL-NP
ID is my primary key and index. I just use ID to query my database. The datamodel is very simple.
My problem is as follow:
I have 64Go+ of data as defined above and in a real-time application, I need to query my database and retrieve the data instantly. I was thinking about 2 solutions but impossible to set up.
First use sqlite or mysql. One table is needed with one index on ID column. The problem is that the database will be too large to have good performance, especially for sqlite.
Second is to store everything in memory into a huge hashtable. RAM is the limit.
Do you have another suggestion? How about to serialize everything on the filesystem and then, at each query, store queried data into a cache system?
When I say real-time, I mean about 100-200 query/second.
A thorough answer would take into account data access patterns. Since we don't have these, we just have to assume equal probably distribution that a row will be accessed next.
I would first try using a real RDBMS, either embedded or local server, and measure the performance. If this this gives 100-200 queries/sec then you're done.
Otherwise, if the format is simple, then you could create a memory mapped file and handle the reading yourself using a binary search on the sorted ID column. The OS will manage pulling pages from disk into memory, and so you get free use of caching for frequently accessed pages.
Cache use can be optimized more by creating a separate index, and grouping the rows by access pattern, such that rows that are often read are grouped together (e.g. placed first), and rows that are often read in succession are placed close to each other (e.g. in succession.) This will ensure that you get the most back for a cache miss.
Given the way the data is used, you should do the following:
Create a record structure (fixed size) that is large enough to contain one full row of data
Export the original data to a flat file that follows the format defined in step 1, ordering the data by ID (incremental)
Do a direct access on the file and leave caching to the OS. To get record number N (0-based), you multiply N by the size of a record (in byte) and read the record directly from that offset in the file.
Since you're in read-only mode and assuming you're storing your file in a random access media, this scales very well and it doesn't dependent on the size of the data: each fetch is a single read in the file. You could try some fancy caching system but I doubt this would gain you much in terms of performance unless you have a lot of requests for the same data row (and the OS you're using is doing poor caching). make sure you open the file in read-only mode, though, as this should help the OS figure out the optimal caching mechanism.

Storage of many log files

I have a system which is receiving log files from different places through http (>10k producers, 10 logs per day, ~100 lines of text each).
I would like to store them to be able to compute misc. statistics over them nightly , export them (ordered by date of arrival or first line content) ...
My question is : what's the best way to store them ?
Flat text files (with proper locking), one file per uploaded file, one directory per day/producer
Flat text files, one (big) file per day for all producers (problem here will be indexing and locking)
Database Table with text (MySQL is preferred for internal reasons) (pb with DB purge as delete can be very long !)
Database Table with one record per line of text
Database with sharding (one table per day), allowing simple data purge. (this is partitioning. However the version of mysql I have access to (ie supported internally) does not support it)
Document based DB à la couchdb or mongodb (problem could be with indexing / maturity / speed of ingestion)
Any advice ?
(Disclaimer: I work on MongoDB.)
I think MongoDB is the best solution for logging. It is blazingly fast, as in, it can probably insert data faster than you can send it. You can do interesting queries on the data (e.g., ranges of dates or log levels) and index and field or combination of fields. It's also nice because you can randomly add more fields to logs ("oops, we want a stack trace field for some of these") and it won't cause problems (as it would with flat text files).
As far as stability goes, a lot of people are already using MongoDB in production (see http://www.mongodb.org/display/DOCS/Production+Deployments). We just have a few more features we want to add before we go to 1.0.
I'd pick the very first solution.
I don't see why would you need DB at all. Seems like all you need is to scan through the data. Keep the logs in the most "raw" state, then process it and then create a tarball for each day.
The only reason to aggregate would be to reduce the number of files. On some file systems, if you put more than N files in a directory, the performance decreases rapidly. Check your filesystem and if it's the case, organize a simple 2-level hierarchy, say, using the first 2 digits of producer ID as the first level directory name.
I would write one file per upload, and one directory/day as you first suggested. At the end of the day, run your processing over the files, and then tar.bz2 the directory.
The tarball will still be searchable, and will likely be quite small as logs can usually compress quite well.
For total data, you are talking about 1GB [corrected 10MB] a day uncompressed. This will likely compress to 100MB or less. I've seen 200x compression on my log files with bzip2. You could easily store the compressed data on a file system for years without any worries. For additional processing you can write scripts which can search the compressed tarball and generate more stats.
Since you would like to store them to be able to compute misc. statistics over them nightly , export them (ordered by date of arrival or first line content) ... You're expecting 100,000 files a day, at a total of 10,000,000 lines:
I'd suggest:
Store all the files as regular textfiles using the following format : yyyymmdd/producerid/fileno.
At the end of the day, clear the database, and load all the textfiles for the day.
After loading the files, it would be easy to get the stats from the database, and post them in any format needed. (maybe even another "stats" database). You could also generate graphs.
To save space ,you could compress the daily folder. Since they're textfiles, they would compress well.
So you would only be using the database to be able to easily aggregate the data. You could also reproduce the reports for an older day if the process didn't work, by going through the same steps.
To my experience, single large table performs much faster then several linked tables if we talk about database solution. Particularly on write and delete operations. For example, splitting one table into three linked tables decreases performance 3-5 times. This is very rough, of course it depends on details, but generally this is the risk. It gets worse when data volumes get very large. Best way, IMO, to store log data is not in a flat text, but rather in a structured form, so that you can do efficient queries and formatting later. Managing log files could be pain, especially when there are lots of them and coming from many sources and locations. Check out our solution, IMO it can save you lots of development time.

Resources