Does this require recursive CTE, just creative window functions, a loop? - sql-server
I cannot for the life of me figure out how to get a weighted ranking for scores across X categories. For example, the student needs to answer 10 questions across 3 categories (both # of questions and # of categories will be variable eventually). To get a total score the top 1 score in each of the X (3) categories will be added to whatever is left to add up to 10 total question scores.
Here is the data. I used a CASE WHEN Row_Number() to get the TopInCat
http://sqlfiddle.com/#!6/e6e9f/1
The fiddle has more students.
| Question | Student | Category | Score | TopInCat |
|----------|---------|----------|-------|----------|
| 120149 | 125 | 6 | 1 | 1 |
| 120127 | 125 | 6 | 0.9 | 0 |
| 120124 | 125 | 6 | 0.8 | 0 |
| 120125 | 125 | 6 | 0.7 | 0 |
| 120130 | 125 | 6 | 0.6 | 0 |
| 120166 | 125 | 6 | 0.5 | 0 |
| 120161 | 125 | 6 | 0.4 | 0 |
| 120138 | 125 | 4 | 0.15 | 1 |
| 120069 | 125 | 4 | 0.15 | 0 |
| 120022 | 125 | 4 | 0.15 | 0 |
| 120002 | 125 | 4 | 0.15 | 0 |
| 120068 | 125 | 2 | 0.01 | 1 |
| 120050 | 125 | 3 | 0.05 | 1 |
| 120139 | 125 | 2 | 0 | 0 |
| 120156 | 125 | 2 | 0 | 0 |
This is how I envision it needs to look, but it doesn't have to be exactly this. I just need to have 10 questions by 3 categories detail data in a way that would allow me to sum and average the Sort 1-10 column below. The 999's could be null or whatever as long as I can sum whats important and present the details.
| Question | Student | Category | Score | TopInCat | Sort |
|----------|---------|----------|-------|----------|------|
| 120149 | 125 | 6 | 1 | 1 | 1 |
| 120138 | 125 | 4 | 0.15 | 1 | 2 |
| 120068 | 125 | 2 | 0.01 | 1 | 3 |
| 120127 | 125 | 6 | 0.9 | 0 | 4 |
| 120124 | 125 | 6 | 0.8 | 0 | 5 |
| 120125 | 125 | 6 | 0.7 | 0 | 6 |
| 120130 | 125 | 6 | 0.6 | 0 | 7 |
| 120166 | 125 | 6 | 0.5 | 0 | 8 |
| 120161 | 125 | 6 | 0.4 | 0 | 9 |
| 120069 | 125 | 4 | 0.15 | 0 | 10 |
| 120022 | 125 | 4 | 0.15 | 0 | 999 |
| 120002 | 125 | 4 | 0.15 | 0 | 999 |
| 120050 | 125 | 3 | 0.05 | 1 | 999 |
| 120139 | 125 | 2 | 0 | 0 | 999 |
| 120156 | 125 | 2 | 0 | 0 | 999 |
One last thing, the category no longer matters once the X (3) threshold is met. So a 4th category would just sort normally.
| Question | Student | Category | Score | TopInCat | Sort |
|----------|---------|----------|-------|----------|------|
| 120149 | 126 | 6 | 1 | 1 | 1 |
| 120138 | 126 | 4 | 0.75 | 1 | 2 |
| 120068 | 126 | 2 | 0.50 | 1 | 3 |
| 120127 | 126 | 6 | 0.9 | 0 | 4 |
| 120124 | 126 | 6 | 0.8 | 0 | 5 |
| 120125 | 126 | 6 | 0.7 | 0 | 6 |
| 120130 | 126 | 6 | 0.6 | 0 | 7 |
| 120166 | 126 | 6 | 0.5 | 0 | 8 |
| 120050 | 126 | 3 | 0.45 | 1 | 9 |********
| 120161 | 126 | 6 | 0.4 | 0 | 10 |
| 120069 | 126 | 4 | 0.15 | 0 | 999 |
| 120022 | 126 | 4 | 0.15 | 0 | 999 |
| 120002 | 126 | 4 | 0.15 | 0 | 999 |
| 120139 | 126 | 2 | 0 | 0 | 999 |
| 120156 | 126 | 2 | 0 | 0 | 999 |
I really appreciate any help. Been banging my head on this for a few days.
With such matters I like to proceed with a 'building blocks' approach. Following the maxim of first make it work, then if you need to make it fast, this first step is often enough.
So, given
CREATE TABLE WeightedScores
([Question] int, [Student] int, [Category] int, [Score] dec(3,2))
;
and your sample data
INSERT INTO WeightedScores
([Question], [Student], [Category], [Score])
VALUES
(120161, 123, 6, 1), (120166, 123, 6, 0.64), (120138, 123, 4, 0.57), (120069, 123, 4, 0.5),
(120068, 123, 2, 0.33), (120022, 123, 4, 0.18), (120061, 123, 6, 0), (120002, 123, 4, 0),
(120124, 123, 6, 0), (120125, 123, 6, 0), (120137, 123, 6, 0), (120154, 123, 6, 0),
(120155, 123, 6, 0), (120156, 123, 6, 0), (120139, 124, 2, 1), (120156, 124, 2, 1),
(120050, 124, 3, 0.88), (120068, 124, 2, 0.87), (120161, 124, 6, 0.87), (120138, 124, 4, 0.85),
(120069, 124, 4, 0.51), (120166, 124, 6, 0.5), (120022, 124, 4, 0.43), (120002, 124, 4, 0),
(120130, 124, 6, 0), (120125, 124, 6, 0), (120124, 124, 6, 0), (120127, 124, 6, 0),
(120149, 124, 6, 0), (120149, 125, 6, 1), (120127, 125, 6, 0.9), (120124, 125, 6, 0.8),
(120125, 125, 6, 0.7), (120130, 125, 6, 0.6), (120166, 125, 6, 0.5), (120161, 125, 6, 0.4),
(120138, 125, 4, 0.15), (120069, 125, 4, 0.15), (120022, 125, 4, 0.15), (120002, 125, 4, 0.15),
(120068, 125, 2, 0.01), (120050, 125, 3, 0.05), (120139, 125, 2, 0), (120156, 125, 2, 0),
(120149, 126, 6, 1), (120138, 126, 4, 0.75), (120068, 126, 2, 0.50), (120127, 126, 6, 0.9),
(120124, 126, 6, 0.8), (120125, 126, 6, 0.7), (120130, 126, 6, 0.6), (120166, 126, 6, 0.5),
(120050, 126, 3, 0.45), (120161, 126, 6, 0.4), (120069, 126, 4, 0.15), (120022, 126, 4, 0.15),
(120002, 126, 4, 0.15), (120139, 126, 2, 0), (120156, 126, 2, 0)
;
let's proceed.
The complicated part here is identifying the top three top-in-category questions; the others of the ten questions of interest per student are simply sorted by score, which is easy. So let's start with identifying the top three top-in-category questions.
First, assign to each row a row number giving the ordering of that score within the category, for the student:
;WITH Numbered1 ( Question, Student, Category, Score, SeqInStudentCategory ) AS
(
SELECT Question, Student, Category, Score
, ROW_NUMBER() OVER (PARTITION BY Student, Category ORDER BY Score DESC) SeqInStudentCategory
FROM WeightedScores
)
Now we are only interested in rows where SeqInStudentCategory is 1. Considering only such rows, let's order them by score within student, and number those rows:
-- within the preceding WITH
, Numbered2 ( Question, Student, Category, Score, SeqInStudent ) AS
(
SELECT
Question, Student, Category, Score
, ROW_NUMBER() OVER (PARTITION BY Student ORDER BY Score DESC) SeqInStudent
FROM
Numbered1
WHERE
SeqInStudentCategory = 1
)
Now we are only interested in rows where SeqInStudent is at most 3. Let's pull them out, so that we know to include it (and exclude it from the simple sort by score, that we will use to make up the remaining seven rows):
-- within the preceding WITH
, TopInCat ( Question, Student, Category, Score, SeqInStudent ) AS
(
SELECT Question, Student, Category, Score, SeqInStudent FROM Numbered2 WHERE SeqInStudent <= 3
)
Now we have the three top-in-category questions for each student. We now need to identify and order by score the not top-in-category questions for each student:
-- within the preceding WITH
, NotTopInCat ( Question, Student, Category, Score, SeqInStudent ) AS
(
SELECT
Question, Student, Category, Score
, ROW_NUMBER() OVER (PARTITION BY Student ORDER BY Score DESC) SeqInStudent
FROM
WeightedScores WS
WHERE
NOT EXISTS ( SELECT 1 FROM TopInCat T WHERE T.Question = WS.Question AND T.Student = WS.Student )
)
Finally we combine TopInCat with NotTopInCat, applying an appropriate offset and restriction to NotTopInCat.SeqInStudent - we need to add 3 to the raw value, and take the top 7 (which is 10 - 3):
-- within the preceding WITH
, Combined ( Question, Student, Category, Score, CombinedSeq ) AS
(
SELECT
Question, Student, Category, Score, SeqInStudent AS CombinedSeq
FROM
TopInCat
UNION
SELECT
Question, Student, Category, Score, SeqInStudent + 3 AS CombinedSeq
FROM
NotTopInCat
WHERE
SeqInStudent <= 10 - 3
)
To get our final results:
SELECT * FROM Combined ORDER BY Student, CombinedSeq
;
You can see the results on sqlfiddle.
Note that here I have assumed that every student will always have answers from at least three categories. Also, the final output doesn't have a TopInCat column, but hopefully you will see how to regain that if you want it.
Also, "(both # of questions and # of categories will be variable eventually)" should be relatively straightforward to deal with here. But watch out for my assumption that (in this case) 3 categories will definitely be present in the answers of each student.
Related
PostgreSQL query to iterate through array and perform product calculation on each cell
I have a cell with an array of integers (column created using the syntax: some_arrays integer[]) | id | some_arrays | |----+-----------------| | 0 | 10, 15, 20, 25 | |----+-----------------| | 1 | 1, 3, 5, 7 | |----+-----------------| | 2 | 33, 674, 2, 4 | |----+-----------------| For any given row, I want to iterate through each array and for each element return the product of all the elements in the array other than itself. ex: some_arrays[0] -> [10, 15, 20, 25] -> [15 * 20 * 25, 10 * 20 * 25, 10 * 15 * 25, 10 * 15 * 20] -> [7500, 5000, 3750, 3000] In Python we could do something like: def get_prod(array, n): prod = 1 # calculate prod all elements in array for i in range(n): prod *= array[i] # replace each element by the product divided by that element for i in range(n): array[i] = prod // array[i] array = [10, 15, 20, 25] n = len(array) get_prod(array, n) for i in range(n): print(array[i], end=" ") In postgreSQL what's the correct query to select a cell, iterate through each element, and obtain the product of the other elements? Doubling each element of each cell and displaying it to a new column is possible: SELECT id, (SELECT SUM(a) FROM unnest(some_arrays) AS a) AS total FROM this_table; SELECT some_arrays, ARRAY(SELECT x * 2 FROM unnest(some_arrays) AS x) AS doubles FROM table; This yields: | some_arrays | doubled | |-----------------+----------------- | 10, 15, 20, 25 | 20, 30, 40, 50 | |-----------------+-----------------| | 1, 3, 5, 7 | 2, 6, 10, 14 | |-----------------+-----------------| | 66, 674, 2, 4 | 33, 1348, 4, 8 | |-----------------+-----------------| (Using Postgres 11.10) Thank you for any tips on writing a comprehensive for loop!
Explicit looping would be one way to approach this task, but I prefer to use relational calculus, postgresql doesn't now how to do a product aggregate, but you can teach it: create aggregate product(integer) (stype=bigint,sfunc=int84mul,initcond=1); Here int84mul is a built-in function multiplies an 8 byte integer by a 4 byte integer. it's not described in the documentations but is visiblwe via instrospection eg the psql command \dfS+ *mul* then select cell.*, product(a) from cell, lateral unnest(some_arrays) as a group by 1,2 order by 1; id | some_arrays | exclusive_products ----+---------------+------------------------ 0 | {10,15,20,25} | {7500,5000,3750,3000} 1 | {1,3,5,7} | {105,35,21,15} 2 | {33,674,2,4} | {5392,264,88968,44484} will get you the result: id | some_arrays | product ----+---------------+--------- 0 | {10,15,20,25} | 75000 1 | {1,3,5,7} | 105 2 | {33,674,2,4} | 177936 this is close. if you then divide the product by the cell. with b as ( select cell.*, product(a) from cell, lateral unnest(some_arrays) as a group by 1,2 ) select id,some_arrays, ARRAY(SELECT product / x FROM unnest(some_arrays) AS x) AS exclusive_products from b order by 1; gets the result id | some_arrays | exclusive_products ----+---------------+------------------------ 0 | {10,15,20,25} | {7500,5000,3750,3000} 1 | {1,3,5,7} | {105,35,21,15} 2 | {33,674,2,4} | {5392,264,88968,44484}
SQL Grouping columns by row
Struggling with the SQL query to convert the data I have into the required format. I have an event log for machines and would like to associate the start and stop time and event outcome in the same row. I am unable to use LAG due to the version of SQLServer. Any help appreciated. current dataset: +----------+----------+------------+------------------------------+---------------------+ | MACHINE | EVENT_ID | EVENT_CODE | DATE_TIME | EVENT_DESCRIPTOR | +----------+----------+------------+------------------------------+---------------------+ | 1 | 1 | 1 | 2020-08-06 14:59:26 | SCAN : START : z1 : | | 1 | 2 | 6 | 2020-08-06 15:00:18 | SCAN : END : z1 : | | 1 | 3 | 1 | 2020-08-06 15:00:45 | SCAN : START : z1 : | | 1 | 4 | 5 | 2020-08-06 15:01:54 | SCAN : ABORT : z1 : | | 2 | 5 | 1 | 2020-08-06 15:02:15 | SCAN : START : z1 : | | 2 | 6 | 6 | 2020-08-06 15:05:07 | SCAN : END : z1 : | | 1 | 7 | 1 | 2020-08-06 15:05:13 | NEST : START : z1 : | | 1 | 8 | 6 | 2020-08-06 15:05:22 | NEST : END : z1 : | | 1 | 9 | 1 | 2020-08-06 15:07:17 | CUT : START : z1 : | | 1 | 10 | 6 | 2020-08-06 15:10:40 | CUT : END : z1 : | +----------+----------+------------+------------------------------+---------------------+ The outcome I am trying to achieve: +----------+------------------------------+------------------------------+----------+ | Machine | SCAN:START:Z1 _TIME | SCAN:STOP_OR_ABORT:Z1 _TIME | OUTCOME | +----------+------------------------------+------------------------------+----------+ | 1 | Thu Aug 06 14:59:26 BST 2020 | 2020-08-06 15:00:18 | END | | 1 | Thu Aug 06 15:00:45 BST 2020 | 2020-08-06 15:01:54 | ABORT | | 1 | Thu Aug 06 15:02:15 BST 2020 | 2020-08-06 15:05:07 | END | +----------+------------------------------+------------------------------+----------+
You can select the starting events and join them the ending events as subqueries, in the form, for example, of an outer apply. select L1.Machine, L1.date_time as Start, L2.datetime as Stop_Or_Abort, case L2.Event_Id when 5 then 'ABORT' when 6 then 'END' end as Outcome from MyLogs L1 outer apply (select top 1 L2.date_time, L2.Event_Code from MyLogs L2 where L2.Machine = L1.Machine and L2.Event_ID > L1.Event_ID and L2.Event_Code in (5, 6) order by L2.Event_ID) as L2 where L1.Event_Descriptor Like 'SCAN%' and L1.Event_Code = 1
You can use an Outer Apply to get the next record after the start, filtering only the SCAN events. Select Machine = L.MACHINE, [SCAN:START:Z1 _TIME] = L.DATE_TIME, [SCAN:STOP_OR_ABORT:Z1 _TIME] = E.DATE_TIME, Outcome = Case when E.EVENT_CODE = 5 then 'ABORT' when E.EVENT_CODE = 6 then 'END' End From Logs L Outer Apply ( Select top 1 L1.DATE_TIME, L1.EVENT_CODE From Logs L1 where L1.MACHINE = L.MACHINE and L1.EVENT_CODE in (5, 6) and L1.DATE_TIME > L.DATE_TIME order by EVENT_ID ) E where L.EVENT_CODE = 1 and L.EVENT_DESCRIPTOR like 'SCAN%'
Lateral flatten array with multiple JSON objects in Snowflake
I have an array with multiple JSON objects. The max number of elements in any JSON array located in the table is 8. Here's an example of the raw value of the array: variants ---------------------------------------------------------------- [ { "id": 12388362846279, "inventory_quantity": 10, "sku": “sku1” }, { "id": 12388391387207, "inventory_quantity": 31, "sku": “sku2” }, { "id": 12394420142151, "inventory_quantity": 12, "sku": “sku3” }, { "id": 12394426007623, "inventory_quantity": 4, "sku": “sku4” }, { "id": 12394429022279, "inventory_quantity": 9, "sku": “sku5” }, { "id": 12394431414343, "inventory_quantity": 15, "sku": “sku6” }, { "id": 12394455597127, "inventory_quantity": 22, "sku": “sku7” }, { "id": 12394459856967, "inventory_quantity": 0, "sku": “sku8” } ] My query attempts to flatten and parse the array to return a row for each object: select variants[0]:sku, variants[0]:inventory_quantity, variants[1]:sku, variants[1]:inventory_quantity, variants[2]:sku, variants[2]:inventory_quantity, variants[3]:sku, variants[3]:inventory_quantity, variants[4]:sku, variants[4]:inventory_quantity, variants[5]:sku, variants[5]:inventory_quantity, variants[6]:sku, variants[6]:inventory_quantity, variants[7]:sku, variants[7]:inventory_quantity from table , lateral flatten(input => variants) However, my output is returning duplicate/repeated values: +------+----+------+----+------+----+------+---+------+---+------+----+------+----+------+---+ | sku1 | 10 | sku2 | 31 | sku3 | 12 | sku4 | 4 | sku5 | 9 | sku6 | 15 | sku7 | 22 | sku8 | 0 | +------+----+------+----+------+----+------+---+------+---+------+----+------+----+------+---+ | sku1 | 10 | sku2 | 31 | sku3 | 12 | sku4 | 4 | sku5 | 9 | sku6 | 15 | sku7 | 22 | sku8 | 0 | +------+----+------+----+------+----+------+---+------+---+------+----+------+----+------+---+ | sku1 | 10 | sku2 | 31 | sku3 | 12 | sku4 | 4 | sku5 | 9 | sku6 | 15 | sku7 | 22 | sku8 | 0 | +------+----+------+----+------+----+------+---+------+---+------+----+------+----+------+---+ | sku1 | 10 | sku2 | 31 | sku3 | 12 | sku4 | 4 | sku5 | 9 | sku6 | 15 | sku7 | 22 | sku8 | 0 | +------+----+------+----+------+----+------+---+------+---+------+----+------+----+------+---+ | sku1 | 10 | sku2 | 31 | sku3 | 12 | sku4 | 4 | sku5 | 9 | sku6 | 15 | sku7 | 22 | sku8 | 0 | +------+----+------+----+------+----+------+---+------+---+------+----+------+----+------+---+ | sku1 | 10 | sku2 | 31 | sku3 | 12 | sku4 | 4 | sku5 | 9 | sku6 | 15 | sku7 | 22 | sku8 | 0 | +------+----+------+----+------+----+------+---+------+---+------+----+------+----+------+---+ | sku1 | 10 | sku2 | 31 | sku3 | 12 | sku4 | 4 | sku5 | 9 | sku6 | 15 | sku7 | 22 | sku8 | 0 | +------+----+------+----+------+----+------+---+------+---+------+----+------+----+------+---+ | sku1 | 10 | sku2 | 31 | sku3 | 12 | sku4 | 4 | sku5 | 9 | sku6 | 15 | sku7 | 22 | sku8 | 0 | +------+----+------+----+------+----+------+---+------+---+------+----+------+----+------+---+ I would like my output to look similar to the following: +------+----+ | sku1 | 10 | +------+----+ | sku2 | 31 | +------+----+ | sku3 | 12 | +------+----+ | sku4 | 4 | +------+----+ | sku5 | 9 | +------+----+ | sku6 | 15 | +------+----+ | sku7 | 22 | +------+----+ | sku8 | 0 | +------+----+
Using LATERAL FLATTEN removes the need for you to explicitly reference array locations. Each member of the array becomes its own row. So to obtain the results you want above, simply use: select v.value:sku::varchar, v.value:inventory_quantity from table, lateral flatten(input => table.variants) v ; If there are columns from table that are outside of the array that you want to reference in each row, simply include them in the SELECT. Essentially the flattened rows from the array are "joined" to the non-nested columns of the table implicitly...
SQL Server : Islands And Gaps
I'm struggling with an "Islands and Gaps" issue. This is for SQL Server 2008 / 2012 (we have databases on both). I have a table which tracks "available" Serial-#'s for a Pass Outlet; i.e., Buss Passes, Admissions Tickets, Disneyland Tickets, etc. Those Serial-#'s are VARCHAR, and can be any combination of numbers and characters... any length, up to the max value of the defined column... which is VARCHAR(30). And this is where I'm mightily struggling with the syntax/design of a VIEW. The table (IM_SER) which contains all this data has a primary key consisting of: ITEM_NO...VARCHAR(20), SERIAL_NO...VARCHAR(30) In many cases... particularly with different types of the "Bus Passes" involved, those Serial-#'s could easily track into the TENS of THOUSANDS. What is needed... is a simple view in SQL Server... which simply outputs the CONSECUTIVE RANGES of Available Serial-#'s...until a GAP is found (i.e. a BREAK in the sequences). For example, say we have the following Serial-#'s on hand, for a given Item-#: 123 124 125 139 140 ABC123 ABC124 ABC126 XYZ240003 XYY240004 In my example above, the output would be displayed as follows: 123 -to- 125 139 -to- 140 ABC123 -to- ABC124 ABC126 -to- ABC126 XYZ240003 to XYZ240004 In total, there would be 10 Serial-#'s...but since we're outputting the sequential ranges...only 5-lines of output would be necessary. Does this make sense? Please let me know...and, again, THANK YOU!...Mark
This should get you started... the fun part will be determining if there are gaps or not. You will have to handle each serial format a little bit differently to determine if there are gaps or not... select x.item_no,x.s_format,x.s_length,x.serial_no, LAG(x.serial_no) OVER (PARTITION BY x.item_no,x.s_format,x.s_length ORDER BY x.item_no,x.s_format,x.s_length,x.serial_no) PreviousValue, LEAD(x.serial_no) OVER (PARTITION BY x.item_no,x.s_format,x.s_length ORDER BY x.item_no,x.s_format,x.s_length,x.serial_no) NextValue from ( select item_no,serial_no, len(serial_no) as S_LENGTH, case WHEN PATINDEX('%[0-9]%',serial_no) > 0 AND PATINDEX('%[a-z]%',serial_no) = 0 THEN 'NUMERIC' WHEN PATINDEX('%[0-9]%',serial_no) > 0 AND PATINDEX('%[a-z]%',serial_no) > 0 THEN 'ALPHANUMERIC' ELSE 'ALPHA' end as S_FORMAT from table1 ) x order by item_no,s_format,s_length,serial_no http://sqlfiddle.com/#!3/5636e2/7 | item_no | s_format | s_length | serial_no | PreviousValue | NextValue | |---------|--------------|----------|-----------|---------------|-----------| | 1 | ALPHA | 4 | ABCD | (null) | ABCF | | 1 | ALPHA | 4 | ABCF | ABCD | (null) | | 1 | ALPHANUMERIC | 6 | ABC123 | (null) | ABC124 | | 1 | ALPHANUMERIC | 6 | ABC124 | ABC123 | ABC126 | | 1 | ALPHANUMERIC | 6 | ABC126 | ABC124 | (null) | | 1 | ALPHANUMERIC | 9 | XYY240004 | (null) | XYZ240003 | | 1 | ALPHANUMERIC | 9 | XYZ240003 | XYY240004 | (null) | | 1 | NUMERIC | 3 | 123 | (null) | 124 | | 1 | NUMERIC | 3 | 124 | 123 | 125 | | 1 | NUMERIC | 3 | 125 | 124 | 139 | | 1 | NUMERIC | 3 | 139 | 125 | 140 | | 1 | NUMERIC | 3 | 140 | 139 | (null) |
Database structure for metaquery
I have a database that has a the following structure: +------+------+--------+-------+--------+-------+ | item | type | color | speed | length | width | +------+------+--------+-------+--------+-------+ | 1 | 1 | 3 | 1 | 2 | 2 | | 2 | 1 | 5 | 3 | 1 | 1 | | 3 | 1 | 6 | 3 | 1 | 1 | | 4 | 2 | 2 | 1 | 3 | 1 | | 5 | 2 | 2 | 2 | 2 | 1 | | 6 | 2 | 4 | 2 | 3 | 1 | | 7 | 2 | 5 | 1 | 1 | 2 | | 8 | 3 | 1 | 2 | 2 | 2 | | 9 | 4 | 4 | 3 | 1 | 2 | | 10 | 4 | 6 | 3 | 3 | 2 | +------+------+--------+-------+--------+-------+ I would like to efficiently query what combination of fields are valid. So, for example, I'd like to query the database for the following: What values of color are valid if type is 1? ans: [3, 5, 6] What values of speed are valid if type is 2 and color is 2? ans: [1, 2] What values of type are valid if length is 2 and width is 2? ans: [1, 2] The SQL equivalents are: SELECT DISTINCT `color` FROM `cars` WHERE `type` =2 SELECT DISTINCT `speed` FROM `cars` WHERE `type` =2 AND `width` =2 SELECT DISTINCT `type` FROM `cars` WHERE `length` =2 AND `width` =2 I'm planning on using a cloud based database (Cloudant DBAAS - based on CouchDB). How would this best be implemented, keeping in mind that there may be thousands of items with tens of fields?
I haven't put too much thought into this question, so there may be errors in the approach, but one option is to represent each row with a document: { "_id": "1db91338150bfcfe5fcadbd98fe77d56", "_rev": "1-83daafc1596c2dabd4698742c2d8b0cf", "item": 1, "type": 1, "color": 3, "speed": 1, "length": 2, "width": 2 } Note the _id and _rev fields have been automatically generated by Cloudant for this example. You could then create a secondary index on the type field: function(doc) { if(doc.type) emit(doc.type); } To search using the type field: https://accountname.cloudant.com/dashboard.html#database/so/_design/ddoc/_view/col_for_type?key=1&include_docs=true A secondary index on the type and width fields: function(doc) { if( doc.type && doc.width) emit([doc.type, doc.width]); } To search using the type and width fields: https://accountname.cloudant.com/dashboard.html#database/so/_design/ddoc/_view/speed_for_type_and_width?key=[1,2]&include_docs=true A secondary index on the length and width fields: function(doc) { if (doc.length && doc.width) emit([doc.length, doc.width]); } To search using the length and width fields: https://accountname.cloudant.com/dashboard.html#/database/so/_design/ddoc/_view/type_for_length_and_width?key=[2,2]&include_docs=true The complete design document is here: { "_id": "_design\/ddoc", "_rev": "3-c87d7c3cd44dcef35a030e23c1c91711", "views": { "col_for_type": { "map": "function(doc) {\n if(doc.type)\n emit(doc.type);\n}" }, "speed_for_type_and_width": { "map": "function(doc) {\n if( doc.type && doc.width)\n emit([doc.type, doc.width]);\n}" }, "type_for_length_and_width": { "map": "function(doc) {\n if (doc.length && doc.width)\n emit([doc.length, doc.width]);\n}" } }, "language": "javascript" }