I have a large cube where processing times have become too long. I want to change my cube partitioning and processing options. I understand that process incremental will pull new records into the cube. My question is, is there an advantage of having multiple partitions and performing process incremental rather than just having one partition and performing process incremental? I do not expect a large volume of new records each time I process.
The advantage of having multiple partitions is that you can load into each in parallel. If the volume of new records isn't very large, and the processing time quick you could get away with just one partition.
The problem with having multiple partitions is that you will have to manage what data is exposed to each partition. If the same data is processed into multiple partitions, then you'll get duplicates in the cube.
Related
I have a stream of data, containing a key, that I need to mix and match with data associated with that key. Each key belongs to a partition, and each partition can be loaded from a database.
Data is quite big and only a few hundred out of hundreds of thousands partitions can fit in a single task manager.
My current approach is to use partitionCustom based on the key.partition and cache the partition data inside a RichMapFunction to mix and match without reloading the data of the partitions multiple times.
When the number of message rate on a same partition gets too high, I hit a hot-spot/performance bottleneck.
What tools do I have in Flink to improve the throughput in this case?
Are there ways to customize the scheduling and to optimize the job placements based on setup time on the machines, and maximum processing time history?
It sounds like (a) your DB-based data is also partitioned, and (b) you have skew in your keys, where one partition gets a lot more keys than other partitions.
Assuming the above is correct, and you've done code profiling on your "mix and match" code to make that reasonably efficient, then you're left with manual optimizations. For example, if you know that keys in partition X are much more common, you can put all of those keys in one partition, and then distribute the remaining keys amongst the other partitions.
Another approach is to add a "batcher" operator, which puts up to N keys for the same partition into a group (typically this also needs a timeout to flush, so data doesn't get stuck). If you can batch enough keys, then it might not be so bad to load the DB data on demand for the partition associated with each batch of keys.
I find myself dealing with a Redshift cluster with 2 different types of tables: the ones that get fully replaced every day and the ones that receive a merge every day.
From what I understand so far, there are maintenance commands that should be given since all these tables have millions of rows. The 3 commands I've found so far are:
vacuum table_name;
vacuum reindex table_name;
analyze table_name;
Which of those commands should be applied on which circumstance? I'm planning on doing it daily after they load in the middle of the night. The reason to do it daily is because after running some of these manually, there is a huge performance improvement.
After reading the documentation, I feel it's not very clear what the standard procedure should be.
All the tables have interleaved sortkeys regardless of the load type.
A quick summary of the commands, from the VACUUM documentation:
VACUUM: Sorts the specified table (or all tables in the current database) and reclaims disk space occupied by rows that were marked for deletion by previous UPDATE and DELETE operations. A full vacuum doesn't perform a reindex for interleaved tables.
VACUUM REINDEX: Analyzes the distribution of the values in interleaved sort key columns, then performs a full VACUUM operation.
ANALYZE: Updates table statistics for use by the query planner.
It is good practice to perform an ANALYZE when significant quantities of data have been loaded into a table. In fact, Amazon Redshift will automatically skip the analysis if less than 10% of data has changed, so there is little harm in running ANALYZE.
You mention that some tables get fully replaced every day. This should be done either by dropping and recreating the table, or by using TRUNCATE. Emptying a table with DELETE * is less efficient and should not be used to empty a table.
VACUUM can take significant time. In situations where data is being appended in time-order and the table's SORTKEY is based on the time, it is not necessary to vacuum the table. This is because the table is effectively sorted already. This, however, does not apply to interleaved sorts.
Interleaved sorts are more tricky. From the sort key documentation:
An interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order.
Basically, interleaved sorts use a fancy algorithm to sort the data so that queries based on any of the columns (individually or in combination) will minimize the number of data blocks that are required to be read from disk. Disk access always takes the most time in a database, so minimizing disk access is the best way to speed-up the database. Amazon Redshift uses Zone Maps to identify which blocks to read from disk and the best way to minimize such access is to sort data and then skip over as many blocks as possible when performing queries.
Interleaved sorts are less performant than normal sorts, but give the benefit that multiple fields are fairly well sorted. Only use interleaved sorts if you often query on many different fields. The overhead in maintaining an interleaved sort (via VACUUM REINDEX) is quite high and should only be done if the reindex effort is worth the result.
So, in summary:
ANALYZE after significant data changes
VACUUM regularly if you delete data from the table
VACUUM REINDEX if you use Interleaved Sorts and significant amounts of data have changed
I have a SSAS 2008 Cube.
I've just inserted some more data (4 million transactions) in to the fact table and the dimensions are still good too. I've accidentally refreshed my Excel pivot table and noticed that my new data is there - I thought I had to reprocess the cube for this!!
That leaves me asking:
When do I need to process the cube? Is it ONLY structural changes?
When do I need to process dimensions?
If I don't need to process the cube on inserting new data into source tables, what happens if I insert bad data into the source i.e. something that does not have a matching dimension key?
#Warren, I know it has been a while, but I have to say the issue you mentioned here is data latency issue. It depends on the storage mode you choose on your measure groups within your multidimensional cube. For example, it is ROLAP, there is no data latency issue, you do not need to re-process the cube. However, if it is MOLAP, which means, everything (i.e. data, metadata, and aggregations) is stored in the cube. Every time you do some ETL, you need to re-process it to show the updated data.
You can process a cube under 3 conditions.
If you are modifying the structure of the cube, you may be required to process the cube with the Full Process option.
If you are adding
new data to the cube, you can process the cube with the Incremental update option.
To clear out and replace a cube's source data, you can use the Refresh data processing option.
Find more #
http://technet.microsoft.com/en-us/library/aa933573%28v=sql.80%29.aspx
From one of our partners, I receive about 10.000 small tab delimited text files with +/- 30 records in each file. It is impossible for them to deliver it in one big file.
I process these files in a ForEach loop container. After reading a file, 4 column derivations are performed and then finally contents are stored in a SQL Server 2012 table.
This process can take up to two hours.
I already tried processing the small files into one big file and then importing this one in the same table. This process takes even more time.
Does anyone have any suggestions to speed up processing?
One thing that sounds counter intuitive is to replace your one Derived Column Transformation with 4 and have each one perform a single task. The reason this can provide performance improvement is that the engine can better parallelize operations if it can determine that these changes are independent.
Investigation: Can different combinations of components affect Dataflow performance?
Increasing Throughput of Pipelines by Splitting Synchronous Transformations into Multiple Tasks
You might be running into network latency since you are referencing files on a remote server. Perhaps you can improve performance by copying those remote files to the local box before you being processing. The performance counters you'd be interested in are
Network Interface / Current Bandwidth
Network Interface / Bytes Total / sec
Network Interface / Transfers/sec
The other thing you can do is replace your destination and derived column with a Row Count transformation. Run the package a few times for all the files and that will determine your theoretical maximum speed. You won't be able to go any faster than that. Then add in your Derived column and re-run. That should help you understand whether the drop in performance is due to the destination, the derived column operation or the package is running as fast as the IO subsystem can go.
Do your files offer an easy way (i.e. their names) of subdividing them into even (or mostly even) groups? If so, you could run your loads in parallel.
For example, let's say you could divide them into 4 groups of 2,500 files each.
Create a Foreach Loop container for each group.
For your destination for each group, write your records to their own staging table.
Combine all recordss from all staging tables into your big table at the end.
If the files themselves don't offer an easy way to group them, consider pushing them into subfolders when your partner sends them over, or inserting the file paths into a database so you can write a query to subdivide them and use the file path field as a variable in the Data Flow task.
I am developing a Job application which executes multiple parallel jobs. Every job will pull data from third party source and process. Minimum records are 100,000. So i am creating new table for each job (like Job123. 123 is jobId) and processing it. When job starts it will clear old records and get new records and process. Now the problem is I have 1000 jobs and the DB has 1000 tables. The DB size is drastically increased due to lots of tables.
My question is whether it is ok to create new table for each job. or have only one table called Job and have column jobId, then enter data and process it. Only problem is every job will have 100,000+ records. If we have only one table, whether DB performance will be affected?
Please let me know which approach is better.
Don't create all those tables! Even though it might work, there's a huge performance hit.
Having a big table is fine, that's what databases are for. But...I suspect that you don't need 100 million persistent records, do you? It looks like you only process one Job at a time, but it's unclear.
Edit
The database will grow to the largest size needed, but the space from deleted records is reused. If you add 100k records and delete them, over and over, the database won't keep growing. But even after the delete it will take up as much space as 100k records.
I recommend a single large table for all jobs. There should be one table for each kind of thing, not one table for each thing.
If you make the Job ID the first field in the clustered index, SQL Server will use a b-tree index to determine the physical order of data in the table. In principle, the data will automatically be physically grouped by Job ID due to the physical sort order. This may not stay strictly true forever due to fragmentation, but that would affect a multiple table design as well.
The performance impact of making the Job ID the first key field of a large table should be negligible for single-job operations as opposed to having a separate table for each job.
Also, a single large table will generally be more space efficient than multiple tables for the same amount of total data. This will improve performance by reducing pressure on the cache.