Prelude: The design of this database is truly horrible - this isn't the first "crooked" question I've asked, and it won't be the last. The question is what it is, and I'm only asking because A) I only have a couple of years of experience with SQL Server, and B) I've already been pounding my face on my keyboard for a couple of days trying to find a viable solution.
Having said all of that...
We have a database with two tables relevant to this issue. The schema is ridiculous, so I'm going to paraphrase so that it can be understood:
T_Customer & T_Task
T_Task holds data about various work that has been performed on behalf of a customer in T_Customer.
In T_Customer, there is a field called "Sort_Type" (again, paraphrasing...). In this field, there is a concatenated string of various fields from T_Task which, in the order specified, determine how the customer's report is produced in the client program. There are a total of 73 possible fields in T_Task that can be chosen as Sort_Type in T_Customer, and the user can choose up to 5 of them in any given order. For example:
T_Customer
Customer_ID | Sort_Type
------------|-------------------------------
1 | 'Task_Date,Task_Type,Task_ID'
2 | 'Task_Type,Destination'
3 | 'Route,Task_Type,Task_ID'
T_Task
Task_ID | Customer_ID | Task_Type | Task_Date | Route | Destination
--------|-------------|-----------|-----------|---------|-------------
12345 | 1 | 1 | 01/01/2017| '1 to 2'| '2'
12346 | 1 | 1 | 01/02/2017| '3 to 4'| '4'
12347 | 2 | 2 | 12/31/2016| '6 to 2'| '2'
12348 | 3 | 3 | 01/01/2017| '4 to 1'| '1'
In this example, Customer #1's report would be sorted/totaled by the Task_Date, then by Task_Type, then by Task_ID; but not simply by doing an ORDER BY. This function requires one single value which can be ordered as a whole, single unit. As such...
Up until today, a field existed in T_Task called (paraphrasing....) 'MySort'. This field contained a concatenated string of fixed-width values filled in with zeroes and created according to the order and content of the values in T_Customer.Sort_Type. In this case:
Task_ID | Customer_ID | Task_Type | Task_Date | Route | Destination | MySort
--------|-------------|-----------|-----------|-------|-------------|-------
12345 | 1 | 1 | 01/01/2017| 1 to 2| 2 |'002017010100000000010000012345'
12346 | 1 | 1 | 01/02/2017| 3 to 4| 4 |'002017010200000000010000012346'
12347 | 2 | 2 | 12/31/2016| 6 to 2| 2 |'000000000000000000020000000002'
12348 | 3 | 3 | 01/01/2017| 4 to 1| 1 |'000040to0100000000030000012348'
During the printing phase of every single report, the program would search for the customer, find the values in T_Customer.Sort_Type, split them by commas, and then run an update on all of the tasks of that customer to update the value of MySort accordingly...
Can you guess what the problem is with this? Performance (not to mention chronic insanity)
I have been tasked with finding a more efficient way of performing this same task server-side, within SQL Server 2005 if possible, using whatever means will eventually allow me to return a result set including all of the details of the tasks requested, together with a concatenated string similar to the one used in the past (which the client program relies upon in order to sort and subtotal the report).
I've tried Views, UDFs in computed columns, and parameterized queries, but I know my limitations. I'm too inexperienced to know all of my options.
Question: Aside from quitting (not an option) or going berserk (considering it...), what methods might you use to solve this problem?
EDIT: Having received two questions about the MySort column already,
I'll explain a bit better.
T_Task.MySort =
REPLICATE('0',10 - LEN(T_Customer.Sort_Type[Value1])
+ CAST(T_Customer.Sort_Type[Value1] AS VARCHAR(10))
+
REPLICATE('0',10 - LEN(T_Customer.Sort_Type[Value2])
+ CAST(T_Customer.Sort_Type[Value2] AS VARCHAR(10))
+
REPLICATE('0',10 - LEN(T_Customer.Sort_Type[Value3])
+ CAST(T_Customer.Sort_Type[Value3] AS VARCHAR(10))
WHERE T_Customer.Customer_ID = T_Task.Customer_ID
...Up to T_Customer.Sort_Type[Value5].
Reminder: Those values are not constants at all, so the value of the
field MySort had to constantly be updated before printing a report.
The idea is to somehow remove the need to constantly update the field,
and instead return the string as part of the result set.
The resulting string should always be 50 characters in length. I
didn't do that here simply to save a bit of space and time - I chose
only 3 for the example. The real string would simply have another
twenty zeroes leading the value:
'00000000000000000000002017010100000000010000012345'
Related
Let's say that I have the following SQL table where each value has a reference to the previous one:
ChainedTable
+------------------+--------------------------------------+------------+--------------------------------------+
| SequentialNumber | GUID | CustomData | LastGUID |
+------------------+--------------------------------------+------------+--------------------------------------+
| 1 | 792c9583-12a1-4c95-93a4-3206855d284f | OtherData1 | 0 |
+------------------+--------------------------------------+------------+--------------------------------------+
| 2 | 1022ffd3-afda-4e20-9d45-eec884bc2a50 | OtherData2 | 792c9583-12a1-4c95-93a4-3206855d284f |
+------------------+--------------------------------------+------------+--------------------------------------+
| 3 | 83729ad4-2564-4146-b451-00d82585bd96 | OtherData3 | 1022ffd3-afda-4e20-9d45-eec884bc2a50 |
+------------------+--------------------------------------+------------+--------------------------------------+
| 4 | d7197e87-d7d6-4175-8172-12656043a69d | OtherData4 | 83729ad4-2564-4146-b451-00d82585bd96 |
+------------------+--------------------------------------+------------+--------------------------------------+
| 5 | c1d3d751-ef34-4079-a73c-8952f93d17db | OtherData5 | d7197e87-d7d6-4175-8172-12656043a69d |
+------------------+--------------------------------------+------------+--------------------------------------+
If I were to insert the sixth row, I would retrieve the data of the last row using a query like this:
SELECT TOP 1 (SequentialNumber, GUID) FROM ChainedTable ORDER BY SequentialNumber DESC;
After that selection and before the insertion of the next row, an operation outside the database will take place.
That would suffice if it is ensured that only one entity is using the table every time. However, if more entities can do this same operation, there is a risk of a race condition. There is the possibility that one entity requests the information of the last row and before doing the insert on the second one.
At first, I thought of creating a new table with a value that indicates if the table is being used or not (the value can be null or the identifier of the process that has access to the table). In that solution, the entity won't start the request of the last operation if the value indicates that the table is being used by another process. However, one of the things that can happen in this scenario is that the process using the table can die without releasing the table, blocking the whole system.
I'm sure this is a "typical" computer science problem and that there are well known solutions to implement this. Can anyone point me in the right direction, please?
I think using Transaction in SQL may solve the problem For example, if you create a transaction that will add a new row, no one else will be able to do the same transaction until the first one is completed.
I have two source tables, one is basically an invoice, the other is a migrated invoice. The same object should probably have been used for both, but I have this instead. They contain most of the same data.
I had thought to combine both into a dimension table, however both will use the same natural keys. How should I approach this?
One potential solution I thought of was using negative numbers for the migrated table, but then the natural keys won't align exactly with the source.
Do I just combine them in the fact table? Then I can't link back to the dimension table for either due to NULLs.
Or do I add an additional column or information to indicate which type of invoice it is?
EDIT
Simple models of the current tables below.
The dimension currently only contains the non migrated data, it has a primary key, however
if i merge the migrated invoice table in to this, it will appear as if the changes are being
made to the original invoices and not a second set of invoices
Dimension
surrogate_key| source_pk | Total | scd_from | scd_to
| | | |
1 | 1 | 100 | 01/01/2019 | 31/01/2019
2 | 1 | 150 | 01/02/2019 | 31/12/2019
3 | 2 | 50 | 01/01/2019 | 31/12/9999
source invoice table
pk | Total
___________________
1 | 150
2 | 50
source migrated invoice table
pk | total
___________________
1 | 200
2 | 300
If invoice and migrated invoice have same natural key but some of the fields have different values (your example shows Total amount different between them), then you have one row based on the natural key in the Dim but 2 different columns to represent the 2 sources. Based on your example, you need invoice_Total and migrated_invoice_Total columns in your DIM.
I've a fact table that details individual line amounts for orders placed by my organisation. In this fact, at line level, I've included the total order amount to be used, as it's possible we might need that level of detail at some point.
Here's an example of what I've got:-
+------------+------------+---------------+------------+---------------------+
| BookingKey | Booking_ID | Category_FKey | Line_Value | Total_Booking_Value |
+------------+------------+---------------+------------+---------------------+
| 1 | 12 | 8 | 150 | 700 |
| 2 | 12 | 4 | 150 | 700 |
| 3 | 12 | 5 | 300 | 700 |
| 4 | 12 | 4 | 100 | 700 |
+------------+------------+---------------+------------+---------------------+
As you can see, the Total_Booking_Value here is the sum of the Line_Value for the booking in the example (Booking_ID = 12).
The Category_FKey looks up to a Categories dimension.
Using this structure I've created a simple cube and this works fine, mainly.
The issue I have is that I'd like to be able to view the Total Line_Value amount, and somehow include the Total_Booking_Value alongside it.
So, for example I might add the Categories dimension as a filter and want to filter by say Category_FKey = 4.
If this was the case I'd want the aggregates to tell me that the total Line_Value was 250 (for BookingKeys 2 and 4), and the Total_Booking_Value should be 700. Using normal aggregation (ie SUM) I'm getting the Total_Booking_Value as 1400 (obviously - because it's adding 700 * 2 for the two rows the cube would return).
So, the way I see it I'd like to create an MDX calculation that somehow takes the Total_Booking_Value and gives just the value for the Booking in question.
Should this be done using some kind of average, or division by the Distinct number of items? I can't figure this out. I tried something like this:-
create member currentcube.measures.[Calculated Booking Value]
as
[Measures].[Total_Booking_Value] / count(Measures.Booking_ID);
But this isn't working.
Hopefully this makes sense and you can point me in the right direction.
I find it strange that booking_ID is a measure - intuitively it strikes me as something that would be an attribute and therefore a hierarchy - in which case you'd be able to do the count like this:
[Measures].[Total_Booking_Value]
/
COUNT(EXISTING [Booking].[Booking_ID].[Booking_ID].members)
A straightforward solution would be to have two fact tables: one with granularity booking key and one with granularity booking id. The first would contain all columns except total booking value, and the second would contain columns booking id and total booking value.
Then each of both measures would easily be summable.
The reference type between the second fact table and the category dimension could be configures as many-to-many via the first fact table. Thus, you would see the full values of the involved bookings for each selected category, automatically eliminating double counting.
At work we're creating a form to allow property agents to submit their new developments. A simplified version of our form is the following:
Bedrooms: [Enter a number]
Quantity: [Enter a number]
Add Another | Save
We allow agents to add multiple rows. However at the moment we have absolutely zero validation for duplicates, which in my opinion allows our database to store identical data in two ways:
| development_id | bedrooms | quantity |
|----------------|----------|----------|
| 1 | 3 | 1 |
| 1 | 3 | 1 |
| 1 | 3 | 3 |
Clearly a row could represent both one unit or a group of units.
I'm arguing that we should store the developments either one way of the other, but certainly not both. Unfortunately the back-end developers — I'm mostly front-end — are arguing that it's not a big deal, and to me that seems absurd.
For a simple example, by storing it as the above, a COUNT to obtain how many developments are for sale that have 3 bedrooms requires a SELECT COUNT(*) and consideration of the quantity field.
As a front-end developer it seems largely to be presentation logic, because transforming between rendering them as a list of single units, or grouping them together should be a front-end/API task, and the business logic should be one way or the other. Ultimately our table seems to be not normalised at all.
In my humble opinion there should be a unique index on development_id, bedrooms.
Am I right in my argument? Or horribly wrong?
Edit:
In clarification all of these are currently possible, all of which represent the same fact, and my argument is there should be only one way:
| development_id | bedrooms | quantity |
|----------------|----------|----------|
| 1 | 3 | 1 |
| 1 | 3 | 1 |
| 1 | 3 | 1 |
Same as:
| development_id | bedrooms | quantity |
|----------------|----------|----------|
| 1 | 3 | 1 |
| 1 | 3 | 2 |
Same as:
| development_id | bedrooms | quantity |
|----------------|----------|----------|
| 1 | 3 | 3 |
You're right, there should be only one way to record each fact in a database and duplicate rows should not be allowed. If each row represents the quantity of units that have a certain number of bedrooms in a particular development, then a unique key on development_id, bedrooms makes sense, and will prevent multiple entries for the same kind of unit in each development.
Funnily, you & backend colleagues/rivals are both right.
It's not a big deal, for real (in the shown circumstances).
Although it really violates DB normalization (in the shown circumstances).
From what you reveal, there's no need to split into multiple rows.
Although imagine it gains another attribute that distinct one three-bed from another, from now on. Say, apt plan. Or a timestamp, for whatever reason.
It immediately starts making sense then.
Another thing here: reads are generally non-blocking, writes are.
That means, on a mature RDBMS with row-level locks, the inserts (and reads for COUNT) won't be competing, while updates to a counter would.
Although I'm way far from thinking your realty agents combined would ever achieve even single-digit TPS in their additions, so you may consider an issue non-existent for a scale. :-)
I was wondering if there was a way to get a distinct count on a certain column based on the value of a second column while still getting a total count of the first column. This is an example of the issue I'm facing. I have a query that returns an i-Vent type, ID, Status, and linked medication orders for a pharmacy intervention system. The interventions are grouped by i-Vent type. The Status can be one of five values or NULL. I need to be able to count how many i-Vents were recorded as each of the six possible values for Status.
An example set may look similar to this:
________________________________________________________
Type | ID | Status | Linked Meds
________________________________________________________
IV2PO | 1234 | Accepted | pantoprazole IV
IV2PO | 1234 | Accepted | pantoprazole PO
IV2PO | 1235 | NULL | NULL
IV2PO | 1236 | Pending | metoclopramide IV
IV2PO | 1236 | Pending | metoclopramide PO
IV2PO | 1236 | Pending | Pharmacy Consult - IV2PO
Consult | 1237 | Rejected | NULL
________________________________________________________
The group summary should list IV2PO having a total count of 3 with a count of 1 for "Accepted", 1 for "NULL", and 1 for "Pending"; and Consult having a total count of 1 with a count of 1 for "Rejected".
Please take notice of the duplicate values caused by having more than one medication/order liked to an i-Vent.
Ultimately I'm building the final report in Crystal Reports so if there is a way to get the correct counts there that would be fine as well. I have a version of this which uses a subreport to get the linked medications/orders, but I'd like to find a better alternative to take less time to run and use fewer resources.
Does anyone know of a way to do this?
Thanks!
In Crystal Reports you can use Count distinct summary option
When creating a "Summary", using the Count function may not be desirable. It is often the case that a report must only return the number of unique contact records, as other tables (i.e. History) may contain multiple rows for each customer.
Select Insert | Summary.
Select the fieldname you wish to summarize.
Make sure to select Distinct Count as the Summary Operation.