How to save Large relational data - database

Hope and pray that you all must well.
I have a scenario, in which i have to write a very large set of relational/combinational data, I am looking for a implementation technique which must be super fast. Its something like an expert system in AI.
I have 4 entities, Questions, Options, Benefits and Scenarios:
Each question can have multiple options
Each option can relate to single question
On any combination of options a benefit is allocated, the allocation is called scenario
a scenario can related to any number of options
a scenario can relate to any number of benefits
Each benefit can be included in multiple scenarios
Now for instance we look for an example:
We have 4 questions, q1, q2, q3, q4
q1 have 3 options q1o1, q1o2, q1o3
q2 have 4 options q2o1, q2o2,q2o3,q2o4
q3 have 5 options q3o1, q3o2,q3o3,q3o4, q3o5
q4 have 2 options q4o1, q4o2
scenario 1: for combination of [q1o1,q201] a benefit b1 is allocated
scenario 2: for combination of [q1o1,q201,q303] a benefit b2 is allocated
scenario 3: for combination of [q201,q304] a benefit b3 is allocated
scenario 4: for combination of [q304,q401] a benefit b4 is allocated
scenario 5: for combination of [q402] a benefit b5 is allocated
scenario 6: for combination of [q1o2,q2o2,q3o1,q4o1] a benefit b5 is allocated
So in this way
( (3+1) C 1 x (4+1) C 1 x (5+1) C 1 x (2+1) C 1 ) - 1
( 4 x 5 x 6 x 3 ) - 1
360 - 1
359
scenarios can be build. where as C denote to Combination.
And if questions goes to 25 and each question should have 5 options
((5+1) ^ 25 - 1)
6 ^ 25 -1
28430288029929701375
scenarios can be build
I am looking for a best way to store this relational/combinational data to the database and want to access it back. Will wait for response of you guys.

The following set of tables will do it.
question:
id
...
option:
id
question_id
...
option_scenario:
option_id
scenario_id
scenario:
id
option_count
...
scenario_benefit:
scenario_id
benefit_id
benefit:
id
...
The one thing that is denormalized in the design is that scenario.option_count should be the count of things in option_scenario with that scenario_id.
To query it you'll need to use subqueries heavily. Suppose that person_option is another table with the options a specific person has. Then to find the benefits that that person has you'll need to:
SELECT b.*
FROM (
SELECT s.scenario_id
FROM person_option po
JOIN scenario_option so
ON so.option_id = po.option_id
JOIN scenario s
ON s.id = so.scenario_id
WHERE po.person_id = ?
GROUP BY so.scenario_id, s.option_count
HAVING s.option_count = COUNT(DISTINCT po.option_id)
) ps
JOIN scenario_benefit sb
ON sb.scenario_id = ps.scenario_id
JOIN benefit b
ON b.id = sb.benefit_id

Related

How to combine group by, join, COUNT, SUM and subquery clauses in sql

I am not sure how to write the SQL query for the following problem:
There are two tables, Worker and Product (one worker can make many products) which I describe in this link:https://docs.google.com/spreadsheets/d/1Yk2vKKmUEyuN-QfgTEbmF4suHFtuDkkrsUf-wqvOoKQ/edit?fbclid=IwAR3ipjwNrfhGXg3fCyAri4tD1Q4WqWuKVAqagvbsZg9Sn1myDwkWbWcl_6E#gid=0
The calculation of the total salary of a worker at month x is as follows
totalSalary = salaryPerMonth + SUM(salaryPerProduct * COUNT(pid))
I want to use join statement (regardless of INNER JOIN, LEFT, OR RIGHT JOIN) combined with group by clause to solve this problem but my statements are wrong.
Expect a specific SQL statement in this case.
I hope to be able to express my ideas in this photo
UPDATE: my picture quality is not good so i will repost my picture on this linkenter image description here
#phi nguyễn quốc - Welcome to StackOverflow. What you posted has the makings of a good question. It contains:
Brief summary of the issue
Table structure, sample data
Explanation of expected results
Code you've tried
It just needs a few modifications to conform to the guidelines and avoid being closed. A few tips on posting:
Help others to help you by including a Minimal, Reproducible Example. (With SQL questions include table definitions and sample data). That way folks who want to help can spend their time answering your question, instead of on writing set-up code to replicate your tables, environment, etc..
Make it easy for others to be able to test your code. Always post code as text, not as an image.
Use collaborative tools like db<>fiddle for sharing
One example of how you might improve the question and avoid it being closed:
Issue:
I am trying to write a SQL query to calculate the total salary for workers for a given month X. There are two tables: [Worker] and [Product]. One worker can make many products.
wid
wname
salaryPerMonth
salaryPerProduct
phoneNumber
1
Mr A
500
5
2
Mr B
100
30
3
Mr C
200
20
pid
pname
manufacturedDate
wid
1
Product A
2013-12-01
1
2
Product B
2013-12-09
1
3
Product C
2013-09-08
1
4
Product D
2013-01-30
2
5
Product E
2013-09-20
2
6
Product F
2013-12-23
3
The "Total Salary" of a worker for month X is calculated as follows:
SalaryPerMonth +
( SalaryPerProduct *
Number of Products for Month
)
Expected Results: (December 2013)
wid
wname
salaryPerMonth
salaryPerProduct
totalSalary
** Formula
1
Mr A
500
5
510
= 500 + (5*2)
2
Mr B
100
30
100
= 100 + (30*0)
3
Mr C
200
20
220
= 200 + (20*1)
Actual Results
I've tried this query
SELECT W.wid, W.wname, W.phoneNumber, W.salaryPerMonth, W.salaryPerProduct, (W.salaryPerMonth - SUM(W.salaryPerMonth*COUNT(p.pid))) AS Total
FROM Worker W INNER JOIN Product P ON p.Wid = W.wid
WHERE MONTH(P.manufacturedDate) = 12
GROUP BY W.wid, W.wname, W.phoneNumber, W.salaryPerMonth, W.salaryPerProduct
.. but am getting the error below:
Msg 130 Level 15 State 1 Line 1
Cannot perform an aggregate function on an expression containing an aggregate or a subquery.
Here is my db<>fiddle
CREATE TABLE Product (
pid int
, pname varchar(40)
, manufacturedDate date
, wid int
);
CREATE TABLE Worker (
wid int
, wname varchar(40)
, salaryPerMonth int
, salaryPerProduct int
, phoneNumber varchar(20)
)
INSERT INTO Product(pid, pname, manufacturedDate, wid)
VALUES
(1,'Product A','2013-12-01',1)
,(2,'Product B','2013-12-09',1)
,(3,'Product C','2013-09-08',1)
,(4,'Product D','2013-01-30',2)
,(5,'Product E','2013-09-20',2)
,(6,'Product F','2013-12-23',3)
;
INSERT INTO Worker (wid, wname, salaryPerMonth,salaryPerProduct)
VALUES
(1,'Mr A', 500, 5)
,(2, 'Mr B', 100, 30)
,(3,'Mr C', 200, 20)
;

Alternative solutions to an array search in PostgreSQL

I am not sure if my database design is good for this tricky case and I also ask for help how the query for this could look like.
I plan a query with the following table:
search_array | value | id
-----------------------+-------+----
{XYa,YZb,WQb} | b | 1
{XYa,YZb,WQb,RSc,QZa} | a | 2
{XYc,YZa} | c | 3
{XYb} | a | 4
{RSa} | c | 5
There are 5 main elements in the search_array: XY, YZ, WQ, RS, QZ and 3 Values: a, b, c that are concardinated to each element.
Each row has also one value: a, b or c.
My aim is to find all rows that fit to a specific row in this sense: At first it should be checked if they have any same main elements in their search_arrays (yellow marked in the example).
As example:
Row id 4 an row id 5 wouldnt match because XY != RS.
Row id 1, 2 and 3 would match two times because they have all XY and YZ.
Row id 1 and 2 would even match three times because they have also WQ in common.
And second: if there is a Main Element match it should be 'crosschecked' if the lowercase letters after the Main Elements fit to the value of the other row.
As example: The only match for Row id 1 in the table would be Row id 4 because they both search for XY and the low letters after the elements match each value of the two rows.
Another match would be ROW id 2 and 5 with RS and search c to value c and search a to value a (green and orange marked).
My idea was to cut the search_array elements in the query in two parts with the RIGHT and LEFT command for strings. But I dont know how to combine the subqueries for this search.
Or would be a complete other solution faster? Like splitting the search array into another table with the columns 'foregin key' to the maintable, 'main element' and 'searched_value'. I am not sure if this is the best solution because the program would all the time switch to the main table to find two rows out of 3 million rows to compare their searched_values to the values?
Thank you very much for your answers and your time!
You'll have to represent the data in a normalized fashion. I'll do it in a WITH clause, but it would be better to store the data in this fashion to begin with.
WITH unravel AS (
SELECT t.id, t.value,
substr(u.val, 1, 2) AS arr_main,
substr(u.val, 3, 1) AS arr_val
FROM mytable AS t
CROSS JOIN LATERAL unnest(t.search_array) AS u(val)
)
SELECT a.id AS first_id,
a.value AS first_value,
b.id AS second_id,
b.value AS second_value,
a.arr_main AS main_element
FROM unravel AS a
JOIN unravel AS b
ON a.arr_main = b.arr_main
AND a.arr_val = b.value
AND b.arr_val = a.value;

Can I set rules for string comparison in SQL? (or do I need to hardcode using CASE WHEN)

I need to make a comparison for ratings in two points in time and indicate if the change was upwards,downwards or stayed the same.
For example:
This would be a table with four columns:
ID T0 T0+1 Status
1 AAA AA Lower
2 BB A Higher
3 C C Same
However, this does not work when applying regular string comparison, because in SQL
A<B
B<BBB
I need
A>B
B<BBB
So my order(highest to lowest): AAA,AA,A,BBB,BB,B
SQL order(highest to lowest): BBB,BB,B,AAA,AA,A
Now I have 2 options in mind, but I wonder if someone know a better one:
1) Use CASE WHEN statements for all the possibilities of ratings going up and down ( I have more values than indictaed above)
CASE WHEN T0=T0+1 then 'Same'
WHEN T0='AAA' and To+1<>'AAA' then 'Lower'
....adress all other options for rating going down
ELSE 'Higher'
However, this generates a very large number of CASE WHEN statements.
2) My other option requires generating 2 tables. In table 1 I use case when statements to assign values/rank to the ratings.
For example:
CASE WHEN T0='AAA' then 6
CASE WHEN T0='AA' then 5
CASE WHEN T0='A' then 4
CASE WHEN T0='BBB' then 3
CASE WHEN T0='BB' then 2
CASE WHEN T0='B' then 1
The same for T0+1.
Then in table 2 I use a regular compariosn between column T0 and Column T0+1 on the numeric values.
However, I am looking for a solution where I can do it in one table (with as little lines as possible), and optimally never really show the ranking column.
I think a nested statement would be the best option, but it did now work for me.
Anybody has suggestions?
I use SQL Server 2008.
If you are using Credit Rating, this is very likely that this is not just about AAA > AA or BBB > BB.
Whether you are using one agency or another, it could also be AA+ or Aa1 for long term, F1+ for short term or something else in different contexts or with other agencies.
It is also often requiered to convert data from one agency to other agencies Rating.
Therefore it is better to use a mapping table such as:
Id | Rating
0 | AAA
1 | AA+
2 | AA
3 | AA-
4 | A+
5 | A
6 | A-
7 | BBB+
Using this table, you only have to join the rating in your data table with the rating in the mapping table:
SELECT d.Rating_T0, d.Rating_T1
CASE WHEN d.Rating_T0 = d.Rating_T1 THEN '='
WHEN m0.id < m1.id THEN '<'
WHEN m0.id > m1.id THEN '>'
END
FROM yourData d
INNER JOIN RatingMapping m0
ON m0.Rating= d.Rating_T0
INNER JOIN RatingMapping m1
ON m1.Rating= d.Rating_T1
If you only store the Rating id in you data table, you will not only save space (1 byte for tinyint versus up to 4 chars) but will also be able to compare without the JOIN to the mapping table.
SELECT d.Rating_Id0, d.Rating_Id1
CASE WHEN d.Rating_Id0 = d.Rating_Id1 THEN '='
WHEN d.Rating_Id0 < d.Rating_Id1 THEN '<'
WHEN d.Rating_Id0 > d.Rating_Id1 THEN '>'
END
FROM yourData d
The JOIN would only be requiered when you want to display the actual Rating value such as AAA for Rating_ID = 0.
You could also add an agency_Id to the Mapping table. This way, you can easily choose which Notation agency you want to display and easily convert between Agency 1 and Agency 2 or Agency 3 (ie. Id 1 => S&P and Id 2 => Fitch, Id 3 => ...)

T-SQL Get Rows With Similar Company Name Using Levenshtein

I'm using this Levenshtein function for T-SQL which works well (I'm not worried about performance). Now I want to write a query that returns all rows where the Levenshtein distance is less than x (where x might be 5 for example) using the Company name field to do the comparison.
I've tried the following, but it returns thousands of duplicate rows.
SELECT * FROM Contacts c1, Contacts c2
WHERE dbo.ufnCompareString(c1.Company, c2.Company) < 5
AND c1.id <> c2.id
I would like it to show a list like this:
1 Apple Experts
20 Apple Experts Inc.
240 H&K Paving
21 H and K Paving
98 HK Paving
189 H.K. Paving
5 J.M. Lawn Care
105 JM Lawn Care
Is it possible to do something like this? What am I doing wrong?
EDIT
I ended up with a query that looks something like this. I found that there were some "invalid" entries causing the problems I was having:
SELECT c1.ContactId, c1.Company, c1.LastName, c1.FirstName,
c2.ContactId, c2.Company, c2.LastName, c2.FirstName
FROM Contacts c1, Contacts c2
WHERE Cast(c1.ContactId AS INT) < Cast(c2.ContactId AS INT)
AND c1.Company IS NOT NULL
AND Replace(c1.Company, ' ', '') <> ''
AND c2.Company IS NOT NULL
AND Replace(c2.Company, ' ', '') <> ''
AND Len(c1.Company) > 6
AND Len(c2.Company) > 6
AND dbo.ufnCompareString(c1.Company, c2.Company) < 5
Note that the query is pretty slow running (on about 12,000 records) and I also have a different query that is more effective. The goal was to find duplicate companies that had been entered using slightly different company names and this query returned too many false positives. As to the query I actually used, it's too complicated to show here and outside the scope of this question.
To reduce the duplicates, use this instead:
SELECT * FROM Contacts c1, Contacts c2
WHERE dbo.ufnCompareString(c1.Company, c2.Company) < 5
AND c1.id < c2.id
It returns all unique pairs of contacts, whose distance is less than 5.
The query you have there should work properly, if you are getting duplicates look at the content of the Contacts table.

Moving Data from a Grid into a Database

I have the following lookup grid
x A B C D
A 0 2 1 1
B 2 0 1 1
C 1 1 0 1
D 1 1 1 0
Think of this similar to the travelling salesman with point to point, although the algorithm isn't relevant to this problem. It is More like a lookup from A->B
What would be the best way to store in a database, since the time is the same both directions. A to B is 2, and B to A is 2
Start End Time
A B 2
A C 1
B A 2
etc
Doing this seems like it will be duplicating all the data which wouldn't be a good design.
Any thoughts which would be the best way implement this?
Don't store the duplicate rows. Just do a select like this:
select *
from LookupTable
where (Start = 'A' and End = 'B')
or (Start = 'B' and End = 'A')
Agree with OrbMan. You may adopt a convention to store either the upper triangle or the lower triangle. and after loading that triangle from the database just mirror it. Doing this in the db streamer, and loader should encapsulate/localize the behavior in one place.
Oh, another thing, you should probably use a matrix implementation which is similar so that a[i,j] returns a[j,i] if i>j, 0 if i==j. You get the point... Then just have to save and load the items where i<j.

Resources