I have a SSAS 2008 Cube.
I've just inserted some more data (4 million transactions) in to the fact table and the dimensions are still good too. I've accidentally refreshed my Excel pivot table and noticed that my new data is there - I thought I had to reprocess the cube for this!!
That leaves me asking:
When do I need to process the cube? Is it ONLY structural changes?
When do I need to process dimensions?
If I don't need to process the cube on inserting new data into source tables, what happens if I insert bad data into the source i.e. something that does not have a matching dimension key?
#Warren, I know it has been a while, but I have to say the issue you mentioned here is data latency issue. It depends on the storage mode you choose on your measure groups within your multidimensional cube. For example, it is ROLAP, there is no data latency issue, you do not need to re-process the cube. However, if it is MOLAP, which means, everything (i.e. data, metadata, and aggregations) is stored in the cube. Every time you do some ETL, you need to re-process it to show the updated data.
You can process a cube under 3 conditions.
If you are modifying the structure of the cube, you may be required to process the cube with the Full Process option.
If you are adding
new data to the cube, you can process the cube with the Incremental update option.
To clear out and replace a cube's source data, you can use the Refresh data processing option.
Find more #
http://technet.microsoft.com/en-us/library/aa933573%28v=sql.80%29.aspx
Related
Let me first say that I don't have any solid SQL knowledge and it is the first time I'm working with Tabular, so, please, pardon me for any inaccuracy or naive mistake.
I have a big dataset on Tabular linked to a server which receives entries everyday. Whenever I process (update) the main table to which all the tables are linked I'm basically reprocessing a lot of entries that I already have plus a few new ones.
I'd like to figure out a way to process only entries that I don't have yet and not the entire dataset.
We have 10 slightly different SSIS packages that transfer data from one database to another. Whenever we make a change to the first db, such as adding in a new field or changing a property of said field such as extending a varchar's length, we have to update the packages as well.
Each of these packages have a long flow with multiple merge joins, sorts, conditional statements, etc. If the field that needs to be changed is at the beginning of the process, I have to go through each merge and update it with the new change and each time I do so, it takes a few minutes to process, then I'm on to the next one. As I get near the end, the process takes longer and longer to compute for each merge join. Doing this for 10 different packages, even if they are done at the same time, still takes upwards of 3 hours. This is time consuming and very monotonous. There's got to be a better way, right?
BIML is very good for this. BIML is an XML-based technology which translates to dtsx packages. BIMLScript is BIML interleaved with c# or vb to provide control flow logic, so you can create multiple packages/package elements based on conditions. You can easily query the table structure or custom metadata, such that if you are only doing db to db transformations, you can make structural changes to the database(s) and regenerate your SSIS packages without having to do any editing.
The short answer is no. The metadata that SSIS generates makes it very awkward when data sources change. You can go down the road of dynamically generated packages, but it's not ideal.
Your other option is damage reduction. Consider whether you could implement the Canonical Data Model pattern:
http://www.eaipatterns.com/CanonicalDataModel.html
It would involve mapping the data to some kind of internal format immediately on receiving it, possibly via a temporary table or cache, and then only using your internal format from then on. Then you map back to your output format at the end of processing.
While this does increase the overall complexity of your package, it means that an external data source changing will only affect the transforms at the beginning and end of your processing, which may well save you lots of time in the long run.
I have a large cube where processing times have become too long. I want to change my cube partitioning and processing options. I understand that process incremental will pull new records into the cube. My question is, is there an advantage of having multiple partitions and performing process incremental rather than just having one partition and performing process incremental? I do not expect a large volume of new records each time I process.
The advantage of having multiple partitions is that you can load into each in parallel. If the volume of new records isn't very large, and the processing time quick you could get away with just one partition.
The problem with having multiple partitions is that you will have to manage what data is exposed to each partition. If the same data is processed into multiple partitions, then you'll get duplicates in the cube.
Our application will store some information from a user that we do not want to be traced back to any other records in the database. For example (albeit a stupid one) - a user must pay to tell us anonymously what their favorite color is. We want to store each color record as a new row in the database and keep track of the transaction information.
If we stored the colors and transactions in separate tables, the rows could be correlated to one another if the server were hacked, by using the sequential ID of the rows (because a color will always have a transaction) or by the creation time of the row. So to solve this we won't have a sequential ID column for the colors table, or an update/modification time for the colors table.
Now, the only way to associate a color with a transaction is to look at the files that are used to actually store the database information. While this may be difficult and tedious, I imagine it is still possible because the colors table information would probably be stored sequentially in the files.
How can I store database information in an un-ordered matter, so that this could never happen? I suppose a more general question is how do I store information anonymously and securely? (But that is way too broad)
Obviously, an answer is don't let your database get hacked, but not a good one.
You can pre-generate millions of rows and randomly populate them.
If you need to analyze data, you will need to understand it, and if you can attacker can also. No matter what clever solution you will come up with correletion will still be possible. Relational DB transaction logs, wil show what and when and where was inserted updated deleted. So you cannot provide 100% decoupling of data, if you want to use the same db. You could encrypt data with some HSM, which would render stolen data useless for attacker. Or you can store data on some other machine with random delay or some batch processing, (wait and insert 20 records instead of one)... but it can be tricky and it can fail.
Consider leveraging a non-realtional database, e.g. NoSQL.
I'm looking for some design help here.
I'm doing work for a client that requires me to store data about their tens of thousands of employees. The data is being given to me in Excel spreadsheets, one for each city/country in which they have offices.
I have a database that contains a spreadsheets table and a data table. The data table has a column spreadsheet_id which links it back to the spreadsheets table so that I know which spreadsheet each data row came from. I also have a simple shell script which uploads the data to the database.
So far so good. However, there's some data missing from the original spreadsheets, and instead of giving me just the missing data, the client is giving me a modified version of the original spreadsheet with the new data appended to it. I cannot simply overwrite the original data since the data was already used and there are other tables that link to it.
The question is - how do I handle this? It seems to me that I have the following options:
Upload the entire modified spreadsheet, and mark the original as 'inactive'.
PROS: It's simple, straightforward, and easily automated.
CONS: There's a lot of redundant data being stored in the database unnecessarily, especially if the spreadsheet changes numerous times.
Do a diff on the spreadsheets and only upload the rows that changed.
PROS: Less data gets loaded into the database.
CONS: It's at least partially manual, and therefore prone to error. It also means that the database will no longer tell the entire story - e.g. if some data is missing at some later date, I will not be able to authoritatively say that I never got the data just by querying the database. And will doing diffs continue working even if I have to do it multiple times?
Write a process that compares each spreadsheet row with what's in the database, inserts the rows that have changed data, and sets the original data row to inactive. (I have to keep track of the original data also, so I can't overwrite it.)
PROS: It's automated.
CONS: It will take time to write and test such a process, and it will be very difficult for me to justify the time spent doing so.
I'm hoping to come up with a fourth and better solution. Any ideas as to what that might be?
If you have no way to be 100 % certain you can avoid human error in option 2, don't do it.
Option 3: It should not be too difficult (or time consuming) to write a VBA script that does the comparison for you. VBA is not fast, but you can let it run over night. Should not take more than one or two hours to get it running error free.
Option 1: This would be my preferred approach: Fast, simple, and I can't think of anything that could go wrong right now. (Well, you should first mark the original as 'inactive', then upload the new data set IMO). Especially if this can happen more often in the future, having a stable and fast process to deal with it is important.
If you are really worried about all the inactive entries, you can also delete them after your update (delete from spreadsheets where status='inactive' or somesuch). But so far, all databases I have seen in my work had lots of those. I wouldn't worry too much about it.