Tabular incremental update - database

Let me first say that I don't have any solid SQL knowledge and it is the first time I'm working with Tabular, so, please, pardon me for any inaccuracy or naive mistake.
I have a big dataset on Tabular linked to a server which receives entries everyday. Whenever I process (update) the main table to which all the tables are linked I'm basically reprocessing a lot of entries that I already have plus a few new ones.
I'd like to figure out a way to process only entries that I don't have yet and not the entire dataset.

Related

SSIS Select from Excel Worksheet with Where condition

We are seeing an SSIS process taking too long overnight, and running late into the morning.
The process queries a database table to return a list of files that have been changed since the last time the process ran (it runs daily), and this can be anything from 3 (over a weekend) to 40 during the working week. There are possibly 258 Excel files (all based on the same template) that could be imported.
As I stated, the process seems to be taking too long some days (it's not on a dedicated server) so we decided to look at performance improvement suggestions i.e. increase the DefaultBufferMaxRows and DefaultBufferSize to 50,000 and 50MB respectively for each Data Flow Task in the project. The other main suggestion was to always use a SQL command over a Table or View - each of my Data Flow Tasks (I have nine) all rely on a range name from the spreadsheet - if it might help with performance, what I want to know is - is it possible to select from an Excel worksheet with a WHERE condition?
The nine import ranges vary from a single cell to a range of 10,000 rows by 230 columns. Each of these is imported into a staging table, and then merged into the appropriate main table, but we have had issues with the import not correctly understanding data types (even with IMEX=1), so it seems that I might get a better import if I could select the data differently, and restrict it to only the rows I'm interested in (rather than all 10,000 and then filter them as part of the task), i.e. all rows where a specific column is not blank.
This is just initially an exercise to look into performance improvement, but also it's going to help me going forward with maintaining and improving the package as it's an important process for the business.
Whilst testing the merge stored procedure to look at other ways to improve the process, it came to light that one of the temporary tables populated via SSIS was corrupted. It was taking 27 seconds to query and return 500 records, but when looking at the table statistics, the table was much bigger than it should have been. After conversations with our DBA, we dropped and recreated the table and now the process is running back at it's previous speed, i.e. for 5 spreadsheets, the process takes 1 minute 13 seconds, and for 43 spreadsheets it's around 7 minutes!
Whilst I will still be revisiting the process at some point, everyone agrees to leave it unchanged for the moment.

Add conditional columns and custom value columns to database-linked query

TL;DR version: I have a query linked to a database. I need to add some columns to it for data that isn't linked to the database, but don't know how.
I'm quite new to SQL and Access (got a reasonable grasp of Excel and VBA though) and have a pretty complex reporting task. I've got halfway (I think) but am stuck.
Purpose
A report showing how many (or what percentage of) delivery lines were late in a time period, with reasons for their being late from a set list, and analysis of what's the biggest cause of lateness.
Vague Plan
Create a table/query showing delivery lines with customer, required date and delivery date, plus a column to show whether they were on time, plus another to detail lateness reason. Summaries can be done afterwards in Excel. I'd like to be able to cycle through said table in form view entering lateness reasons (they'll be from a linked table, maybe 4 or 5 options).
Sticking Point
I have the data, but not the analysis. I've created the original data output query, it's linked to a live SQL database (Orderwise) so it keeps updating. However I don't know:
how to add extra columns and lookups to that to figure / record whether it's ontime and lateness reason
how I'll be able to cycle through the late ones in form view to add reasons
How do I structure the access database so it can do this please?

When do i need to process a cube?

I have a SSAS 2008 Cube.
I've just inserted some more data (4 million transactions) in to the fact table and the dimensions are still good too. I've accidentally refreshed my Excel pivot table and noticed that my new data is there - I thought I had to reprocess the cube for this!!
That leaves me asking:
When do I need to process the cube? Is it ONLY structural changes?
When do I need to process dimensions?
If I don't need to process the cube on inserting new data into source tables, what happens if I insert bad data into the source i.e. something that does not have a matching dimension key?
#Warren, I know it has been a while, but I have to say the issue you mentioned here is data latency issue. It depends on the storage mode you choose on your measure groups within your multidimensional cube. For example, it is ROLAP, there is no data latency issue, you do not need to re-process the cube. However, if it is MOLAP, which means, everything (i.e. data, metadata, and aggregations) is stored in the cube. Every time you do some ETL, you need to re-process it to show the updated data.
You can process a cube under 3 conditions.
If you are modifying the structure of the cube, you may be required to process the cube with the Full Process option.
If you are adding
new data to the cube, you can process the cube with the Incremental update option.
To clear out and replace a cube's source data, you can use the Refresh data processing option.
Find more #
http://technet.microsoft.com/en-us/library/aa933573%28v=sql.80%29.aspx

Daily data generation and insertion

I'm facing a problem that perhaps someone around here can help me with.
I work in a business intelligence company and I'd like to simulate the whole usage cycle of our product the way our clients use it.
The short version is that our customers are inserting some 20 million records to their database on a daily basis, and our product crunches the new data at the end of the day.
I would like to automatically create around 20 million records and insert them into some database, everyday (MSSQL probably).
I should point out that the number of records should change from day to day between 15 to 25 million. Other than that, the data is supposed to be inserted to 6 tables linked with foreign keys.
I ususally use Redgate's SQL Generator to create data, but as far as I can tell it's good for one time data generation as opposed to the on going data generation I'm looking for.
If anyone knows of methods/tools adequate to this situation, please let me know.
Thanks!
You could also write a small Java (or similar) program to get the starting ID from the database, pick a random number of rows to insert, and then execute the data-generation tool as a child process.
For example, see Runtime.exec():
http://docs.oracle.com/javase/7/docs/api/java/lang/Runtime.html
You can then run your program as a scheduled task or cron job.

Database data upload design question

I'm looking for some design help here.
I'm doing work for a client that requires me to store data about their tens of thousands of employees. The data is being given to me in Excel spreadsheets, one for each city/country in which they have offices.
I have a database that contains a spreadsheets table and a data table. The data table has a column spreadsheet_id which links it back to the spreadsheets table so that I know which spreadsheet each data row came from. I also have a simple shell script which uploads the data to the database.
So far so good. However, there's some data missing from the original spreadsheets, and instead of giving me just the missing data, the client is giving me a modified version of the original spreadsheet with the new data appended to it. I cannot simply overwrite the original data since the data was already used and there are other tables that link to it.
The question is - how do I handle this? It seems to me that I have the following options:
Upload the entire modified spreadsheet, and mark the original as 'inactive'.
PROS: It's simple, straightforward, and easily automated.
CONS: There's a lot of redundant data being stored in the database unnecessarily, especially if the spreadsheet changes numerous times.
Do a diff on the spreadsheets and only upload the rows that changed.
PROS: Less data gets loaded into the database.
CONS: It's at least partially manual, and therefore prone to error. It also means that the database will no longer tell the entire story - e.g. if some data is missing at some later date, I will not be able to authoritatively say that I never got the data just by querying the database. And will doing diffs continue working even if I have to do it multiple times?
Write a process that compares each spreadsheet row with what's in the database, inserts the rows that have changed data, and sets the original data row to inactive. (I have to keep track of the original data also, so I can't overwrite it.)
PROS: It's automated.
CONS: It will take time to write and test such a process, and it will be very difficult for me to justify the time spent doing so.
I'm hoping to come up with a fourth and better solution. Any ideas as to what that might be?
If you have no way to be 100 % certain you can avoid human error in option 2, don't do it.
Option 3: It should not be too difficult (or time consuming) to write a VBA script that does the comparison for you. VBA is not fast, but you can let it run over night. Should not take more than one or two hours to get it running error free.
Option 1: This would be my preferred approach: Fast, simple, and I can't think of anything that could go wrong right now. (Well, you should first mark the original as 'inactive', then upload the new data set IMO). Especially if this can happen more often in the future, having a stable and fast process to deal with it is important.
If you are really worried about all the inactive entries, you can also delete them after your update (delete from spreadsheets where status='inactive' or somesuch). But so far, all databases I have seen in my work had lots of those. I wouldn't worry too much about it.

Resources