Using map reduce in CouchDB to output fewer rows

Using map reduce in CouchDB to output fewer rows - database

Lets say you have two document types, customers and orders. A customer document contains basic information like name, address etc. and orders contain all the order information each time a customer orders something. When storing the documents, the type = order or the type = customer.
If I do a map function over a set of 10 customers and 30 orders it will output 40 rows. Some rows will be customers, some will be orders.
The question is, how do I write the reduce, so that the order information is "stuffed" inside of the rows that has the customer information? So it will return 10 rows (10 customers), but all the relevant orders for each customer.
Basically I don't want separate records on the output, I want to combine them (orders into one customer row) and I think reduce is the way?

This is called view collation and it is a very useful CouchDB technique.
Fortunately, you don't even need a reduce step. Just use map to get the customers and their orders "clumped" together.
Setup
The key is that you need a unique id for each customer, and it has to be known both from customer docs and from order docs.
Example customer:
{ "_id": "customer me#example.com"
, "type": "customer"
, "name": "Jason"
}
Example order:
{ "_id": "abcdef123456"
, "type": "order"
, "for_customer": "customer me#example.com"
}
I have conveniently used the customer ID as the document _id but the important thing is that both docs know the customer's identity.
Payoff
The goal is a map query, where if you specify ?key="customer me#example.com" then you will get back (1) first, the customer info, and (2) any and all orders placed.
This map function would do that:
function(doc) {
var CUSTOMER_VAL = 1;
var ORDER_VAL = 2;
var key;
if(doc.type === "customer") {
key = [doc._id, CUSTOMER_VAL];
emit(key, doc);
}
if(doc.type === "order") {
key = [doc.for_customer, ORDER_VAL];
emit(key, doc);
}
}
All rows will sort primarily on the customer the document is about, and the "tiebreaker" sort is either the integer 1 or 2. That makes customer docs always sort above their corresponding order docs.
["customer me#example.com", 1], ...customer doc...
["customer me#example.com", 2], ...customer's order...
["customer me#example.com", 2], ...customer's other order.
... etc...
["customer another#customer.com", 1], ... different customer...
["customer another#customer.com", 2], ... different customer's order
P.S. If you follow all that: instead of 1 and 2 a better value might be null for the customer, then the order timestamp for the order. They will sort identically as before except now you have a chronological list of orders.

Related

How to subtract a value in one table from a value in a different table based on common ID in calculation field

I'm trying to subtract a value in one field from a value in another table, depending on ID number using Filemaker Pro 19.x. I thought I'd done this before without any problems but now it isn't working and I don't know why.
First, I want to do this in a calculation field, not in a script. I don't understand scripting at all and use it only when there is no alternative. In this case, I think a calculation field should work.
I have 2 tables, I'll call them "Data" and "Categories"
Both tables have the field "CID" "Category ID".
The CID fields in both tables are linked in the Relationship Editor
The Data table has a field "Product ID"
The Categories table has several fields related to products. Two of those are "MIN PID" and "MAX PID". These are the minimum and maximum product ID numbers.
Product IDs are assigned in ranges. If it is within a certain range, it has to belong to a certain category. These categories have CID numbers.
To assign the CID number to the products listed on the Data table, I did run a script that essentially recreated all the data within the Categories table. It was inefficient (in my eyes) because the data was sitting right there in the table. I couldn't figure out how to reference it, so I gave up and ran the script instead. The other problem is that if the CID ever changes for a product, I have to rerun the script (or someone else, who might not know about the script)
That said, I now have the correct CID assigned for all 62 product categories. What I want to do now is to use the CID MIN and CID MAX values (among others) in calculation fields in the Data table.
For instance, if the product ID is "45,001", it belongs to the Category "16", which has a MIN value of "30,000" and a MAX value of "50,000". I want to subtract the "30,000" from the "45,001", (PID - CID MIN) to return the result "15,001". This allows me to locate a product within the category. Instead of being "Product 45,001", it will be "Product 16.15001".
The calculation I tried is:
If (CID = CID::CID ; PID - CID::CID MIN)
The result is a huge group of records where this field is empty.
I also tried:
PID - CID::CID MIN
Again, no results.
Both tables have the field "CID"
The two CID fields are linked in the relationship editor.
I have tried this with a test file and it works perfectly. It does not work in the database I am working within.
What am I doing wrong?

Azure Search how to get one specific item in a complex type collection

I have an index for songs and each song have a collection of customers.
If I filter songs that contains a specific customer works great, but I get in the result all the customers for the song.
I would need to get the song with only the customer I'm filtering in the collection.
My index is something like:
{
SongId: 1,
Title: "My song",
Artist: "Artist",
Customers: [
{
CustomerId: 1,
...more customer data
}
{
CustomerId: 2,
...more customer data
}
]
}
I need to get the song filtering by title and only get the customer's 1 data or no customer if the customer 1 is not in the collection for that song.
Is that possible?

This is not possible today. You'd have to filter out the complex collection elements on the client-side.
Also, you didn't mention this in your question, but it seems the cardinality of the relationship from songs to customers could be quite high. If that's the case, you should consider a different data model, because there is a hard limit on the number of elements you can have in complex collections per document. Even without that limit, there are practical limits to complex collections, for example around document size and the inability to incrementally update them during indexing.
If your model will have at most a few hundred customers per song, you're probably fine, but if it's in the thousands or higher, you should reconsider your design.

Strategy to design specification table for measurements data for analytical database project

I need advice on the following topic:
I am developing a DW/BI solution in SQL Server and reports are published in Power BI.
Main part of my question starts here: I have a large table which collects measurement data on product measurement for multiple attributes. Product can be of multiple type, that can be recognised by item number in this table, measurements can be done multiple times and can be identified by measurement date. Usually, we refer latest dates. If it makes things complicated, I can filter data for latest dates only. This is dense row table (multi million). Number of attribute counts about 200.
I want to include specifications for these attributes most likely in a dimension table, and there may be tens of such specifications. Intention is that user shall select in the report any one specification name and he would like to see each product with attributes passing/failing as well as the products passing if all all of specification attributes are passed.
I currently have this measurement table and a dim table with test names, I can add a table for specification if needed. Specification can define few or all test names with lower/upper spec limits:
Sample measurement table:
Sample dim table for test names:
I can add a table for specification as below and user will select any of one:
e.g. Use select ID_spec = 1 then measurement table may look like:
Some spec may contain all and some few attributes.
Please suggest strategy to design a spec table to be efficient for such large size tables. Please let me know if any further details needed.
Later, I will have to further work to calculate % of pass product if they have been tested for all needed tests in a specification selected.

For large tables, the best thing to do is choose the right key. That means dumping the "Id" column (nothing more than a row identifier) and replacing it with something that:
Guarantees uniqueness
Facilitates searches
That often means composite keys, which are fine.
It's also means dumping the whole "fact/dimension" mindset and just focusing on the relations. This is also fine.
Based on your description, this is the first draft of a data model for your warehouse. If you are unfamiliar with IDEF1X diagrams, please read this.
I've added a unique constraint to SpecCd so you could specify the value directly instead of having to check both the ProductId and SpecCd to return a result.
ProductTest exists so you can provide integrity for ProductTestCriteria and ensure tests are limited to only those products that can be measured by them. If all products are subject to all tests, this can be removed and Test can relate directly to ProductMeasurement and ProductTestCriteria.
If you want to subject the latest test of "Product A" to "Spec S" your query would look like:
SELECT
Measurement.ProductId
,Measurement.TestCd
,Measurement.TestDt
,Criteria.SpecCd
,Measurement.Value
,CASE
WHEN Measurement.Value BETWEEN Criteria.LowerValue AND Criteria.UpperValue THEN 'Pass'
ELSE 'Fail'
END AS Result
FROM
ProductMeasurement Measurement
INNER JOIN
ProductTestCriteria Criteria
ON Critera.ProductId = Measurement.ProductId
AND Criteria.TestCd = Measurement.TestCd
WHERE
Measurement.ProductId = 'A'
AND Criteria.SpecCd = 'S'
AND Measurement.TestDt =
(
SELECT
MAX(TestDt)
FROM
ProductMeasurement
WHERE
ProductId = Measurement.ProductId
)
You could remove the filters for ProductId and SpecCd and roll that into a view - users could later specify for the Products and specifications they want later.
If you want result as of a given date, the query is easily modified to this or incorporated into a TVF:
SELECT
Measurement.ProductId
,Measurement.TestCd
,Measurement.TestDt
,Criteria.SpecCd
,Measurement.Value
,CASE
WHEN Measurement.Value BETWEEN Criteria.LowerValue AND Criteria.UpperValue THEN 'Pass'
ELSE 'Fail'
END AS Result
FROM
ProductMeasurement Measurement
INNER JOIN
ProductTestCriteria Criteria
ON Critera.ProductId = Measurement.ProductId
AND Criteria.TestCd = Measurement.TestCd
WHERE
Measurement.ProductId = 'A'
AND Criteria.SpecCd = 'S'
AND Measurement.TestDt =
(
SELECT
MAX(TestDt)
FROM
ProductMeasurement
WHERE
ProductId = Measurement.ProductId
AND TestDt <= <Your Date>
)

What's the most effective way of storing this data?

Need help figuring out a good way to store data effectively and efficiently
I'm using Parse (JavaScript SDK), here's an example of what I'm trying to store
Predictions of football (soccer) matches so an example of one match would be;
Team A v Team B
EventID = "abc"
Categories = ["League-1","Sunday-League"]
User123 predicts the score will be Team A 2-0 Team B -> so 2-0
User456 predicts the score will be Team A 1-3 Team B -> so 1-3
Each event has information attached to it like an eventId, several categories, start time, end time, a result and more
I need to record a score prediction per user for each event (usually 10 events at a time so a lot of predictions will be coming in)
I need to store these so I can cross reference the correct result against the user's prediction and award points based on their prediction, the teams in the match and the categories of the event but instead of adding to a total I need all the awarded points stored separately per category and per user so I can then filter based on predictions between set dates and certain categories e.g.
Team A v Team B
EventID = "abc"
Categories = ["League-1","Sunday-League"]
User123 prediction = 2-0
Actual result = 2-0
So now I need to award X points to User123 for Team A, Team B, "League-1", and "Sunday-League" and record it to the event date too.

I would suggest you create a table for games and a table for users and then an associative table to handle the many to many relationship. This is a pretty standard many to many relationship.

Excel to Database. Help on table design

This is my excel spreadsheet and I am trying to convert it to a proper database. Dry and Sensors are products/goods. I like to build a database that is flexible to add more products. Also their attributes like 'type' and 'kV' can hold multiple values.
I am not experience with designing databases. My current design is
table_company
company_id
company_name
etch...
table_dry_ctd_type
company_id
record_id
dry_ctd_type
table_dry_vtd_kV
company_id
record_id
dry_vtd_kV
I ended up with 13 tables.
business rule (if needed):
-company can have 0 to many products
-products can have 0 to many sub categories.
-product type and KV can hold multiple values.
-Type is not required to have kV

Based on the information you gave I'd create -sort of- the following 3 tables:
company data
product types (fields: [uniqueID]; ptID; pName; pSubName; subPType) so data is:
1, 1, "DRY", NULL, NULL (pSubName NULL == Product Type "definition" field)
2, 1, "DRY", "CTD", ??? --whatever 'type' field's value is
3, 1, "DRY", "VTD", ???,
...
6, 2, "Sensors", "VCS", ???
If you need to get data several times based on product sub-type, also add a ptSubID -numeric- field and get data based on that
data table storing companyID, productTypeUniqueID, KV and e.g. amount or other "value data" you may or may not have
If added productTypeUniqueID to product type table, add that here too!
Adding new company, product type or product sub-type is obvious. Getting data would look something like:
SELECT CMP.compName AS 'Company Name',PRD.pname AS 'Product',PRD.pSubName AS 'subProduct',PRD.pType AS 'Type',DTA.KV AS 'KV', DTA.amount
FROM data_table AS DTA
LEFT OUTER JOIN company_table AS CMP ON CMP.compID=DTA.compID
LEFT OUTER JOIN product_table AS PRD ON (PRD.ptID=DTA.ptID [AND PRD.ptSubID=DTA.ptSubID])
WHERE PRD.ptID=1 --all "DRY" products
/** or: **/ WHERE (PRD.ptID=1 AND PRD.ptSubID<>2) AND CMP.compID=8
But it may vary based on:
Most common data retrieval types will be used
Details you didn't share but important search keywords (e.g. sub-products' dimensions)
Frequency of product updates/changes, product type/sub-type updates/changes
Typical changes may come -based on business experience-, such as "we don't use product sub-types anymore"
Hope this helps!