Related
I am building a SQLite database and am not sure how to proceed with this scenario.
I'll use a real-world example to explain what I need:
I have a list products that are sold by many stores in various states. Not every Store sells a particular Product at all, and those that do, may only sell it in one State or another. Most stores sell a product in most states, but not all.
For example, let's say I am trying to buy a vacuum cleaner in Hawaii. Joe's Hardware sells vacuums in 18 states, but not in Hawaii. Walmart sells vacuums in Hawaii, but not microwaves. Burger King does not sell vacuums at all, but will give me a Whopper anywhere in the US.
So if I am in Hawaii and search for a vacuum, I should only get Walmart as a result. While other stores may sell vacuums, and may sell in Hawaii, they don't do both but Walmart does.
How do I efficiently create this type of relationship in a relational database (specifically, I am currently using SQLite, but need to be able to convert to MySQL in the future).
Obviously, I would need tables for Product, Store, and State, but I am at a loss on how to create and query the appropriate join tables...
If I, for example, query a certain Product, how would I determine which Store would sell it in a particular State, keeping in mind that Walmart may not sell vacuums in Hawaii, but they do sell tea there?
I understand the basics of 1:1, 1:n, and M:n relationships in RD, but I am not sure how to handle this complexity where there is a many-to-many-to-many situation.
If you could show some SQL statements (or DDL) that demonstrates this, I would be very grateful. Thank you!
An accepted and common way is the utilisation of a table that has a column for referencing the product and another for the store. There's many names for such a table reference table, associative table mapping table to name some.
You want these to be efficient so therefore try to reference by a number which of course has to uniquely identify what it is referencing. With SQLite by default a table has a special column, normally hidden, that is such a unique number. It's the rowid and is typically the most efficient way of accessing rows as SQLite has been designed this common usage in mind.
SQLite allows you to create a column per table that is an alias of the rowid you simple provide the column followed by INTEGER PRIMARY KEY and typically you'd name the column id.
So utilising these the reference table would have a column for the product's id and another for the store's id catering for every combination of product/store.
As an example three tables are created (stores products and a reference/mapping table) the former being populated using :-
CREATE TABLE IF NOT EXISTS _products(id INTEGER PRIMARY KEY, productname TEXT, productcost REAL);
CREATE TABLE IF NOT EXISTS _stores (id INTEGER PRIMARY KEY, storename TEXT);
CREATE TABLE IF NOT EXISTS _product_store_relationships (storereference INTEGER, productreference INTEGER);
INSERT INTO _products (productname,productcost) VALUES
('thingummy',25.30),
('Sky Hook',56.90),
('Tartan Paint',100.34),
('Spirit Level Bubbles - Large', 10.43),
('Spirit Level bubbles - Small',7.77)
;
INSERT INTO _stores (storename) VALUES
('Acme'),
('Shops-R-Them'),
('Harrods'),
('X-Mart')
;
The resultant tables being :-
_product_store_relationships would be empty
Placing products into stores (for example) could be done using :-
-- Build some relationships/references/mappings
INSERT INTO _product_store_relationships VALUES
(2,2), -- Sky Hooks are in Shops-R-Them
(2,4), -- Sky Hooks in x-Mart
(1,3), -- thingummys in Harrods
(1,1), -- and Acme
(1,2), -- and Shops-R-Them
(4,4), -- Spirit Level Bubbles Large in X-Mart
(5,4), -- Spiirit Level Bubble Small in X-Mart
(3,3) -- Tartn paint in Harrods
;
The _product_store_relationships would then be :-
A query such as the following would list the products in stores sorted by store and then product :-
SELECT storename, productname, productcost FROM _stores
JOIN _product_store_relationships ON _stores.id = storereference
JOIN _products ON _product_store_relationships.productreference = _products.id
ORDER BY storename, productname
;
The resultant output being :-
This query will only list stores that have a product name that contains an s or S (as like is typically case sensitive) the output being sorted according to productcost in ASCending order, then storename, then productname:-
SELECT storename, productname, productcost FROM _stores
JOIN _product_store_relationships ON _stores.id = storereference
JOIN _products ON _product_store_relationships.productreference = _products.id
WHERE productname LIKE '%s%'
ORDER BY productcost,storename, productname
;
Output :-
Expanding the above to consider states.
2 new tables states and store_state_reference
Although no real need for a reference table (a store would only be in one state unless you consider a chain of stores to be a store, in which case this would also cope)
The SQL could be :-
CREATE TABLE IF NOT EXISTS _states (id INTEGER PRIMARY KEY, statename TEXT);
INSERT INTO _states (statename) VALUES
('Texas'),
('Ohio'),
('Alabama'),
('Queensland'),
('New South Wales')
;
CREATE TABLE IF NOT EXISTS _store_state_references (storereference, statereference);
INSERT INTO _store_state_references VALUES
(1,1),
(2,5),
(3,1),
(4,3)
;
If the following query were run :-
SELECT storename,productname,productcost,statename
FROM _stores
JOIN _store_state_references ON _stores.id = _store_state_references.storereference
JOIN _states ON _store_state_references.statereference =_states.id
JOIN _product_store_relationships ON _stores.id = _product_store_relationships.storereference
JOIN _products ON _product_store_relationships.productreference = _products.id
WHERE statename = 'Texas' AND productname = 'Sky Hook'
;
The output would be :-
Without the WHERE clause :-
make Stores-R-Them have a presence in all states :-
The following would make Stores-R-Them have a presence in all states :-
INSERT INTO _store_state_references VALUES
(2,1),(2,2),(2,3),(2,4)
;
Now the Sky Hook's in Texas results in :-
Note This just covers the basics of the topic.
You will need to create combine mapping table of product, states and stores as tbl_product_states_stores which will store mapping of products, state and store. The columns will be id, product_id, state_id, stores_id.
For each user in my webapp, there are n related Widgets. Each widget is represented in the database in a Widgets table. Users can sort their widgets, they'll never have more than a couple dozen widgets, and they will frequently sort widgets.
I haven't dealt with database items that have an inherent order to them very frequently. What's a good strategy for ordering them? At first, I thought a simple "sortIndex" column would work just fine, but then I started wondering how to initialize this value. It presumably has to be a unique value, and it should be greater or less than every other sort index. I don't want to have to check all of the other sort indexes for that user every time I create a new widget, though. That seems unnecessary.
Perhaps I could have a default "bottom-priority" sort index? But then how do I differentiate between those? I suppose I could use a creation date flag, but then what if a user wants to insert a widget in the middle of all of those bottom-priority widgets?
What's the standard way to handle this sort of thing?
If you have users sorting widgets for their own personal tastes, you want to create a lookup table, like so:
create table widgets_sorting
(
SortID int primary key,
UserID int,
WidgetID int,
SortIndex int
)
Then, to sort a user's widgets:
select
w.*
from
widgets w
inner join widgets_sorting s on
w.WidgetID = s.WidgetID
inner join users u on
s.UserID = u.UserID
order by
s.SortIndex asc
This way, all you'll have to do for new users is add new rows to the widgets_sorting table. Make sure you put a foreign key constraint and an index on both the WidgetID and the UserID columns.
These lookup tables are really the best way to solve the many-to-many relationships that are common with this sort of personalized listing. Hopefully this points you in the right direction!
The best way for user-editable sorting is to keep the id's in a linked list:
user_id widget_id prev_widget_id
---- ---- ----
1 1 0
1 2 8
1 3 7
1 7 1
1 8 3
2 3 0
2 2 3
This will make 5 widgets for user 1 in this order: 1, 7, 3, 8, 2; and 2 widgets for user 2 in this order: 3, 2
You should make UNIQUE indexes on (user_id, widget_id) and (user_id, prev_widget_id).
To get widgets in intended order, you can query like this, say, in Oracle:
SELECT w.*
FROM (
SELECT widget_id, level AS widget_order
FROM widget_orders
START WITH
user_id = :myuser
AND prev_widget_id = 0
CONNECT BY
user_id = PRIOR user_id
AND prev_widget_id = PRIOR widget_id
) o
JOIN widgets w
ON w.widget_id = o.widget_id
ORDER BY
widget_order
To update the order, you will need to update at most 3 rows (even if you move the whole block of widgets).
SQL Server and PostgreSQL 8.4 implement this functionality using recursive CTEs:
WITH
-- RECURSIVE
-- uncomment the previous line in PostgreSQL
q AS
(
SELECT widget_id, prev_widget_id, 1 AS widget_order
FROM widget_orders
WHERE user_id = #user_id
UNION ALL
SELECT wo.widget_id, wo.prev_widget_id, q.widget_order + 1
FROM q
JOIN wo.widget_orders wo
ON wo.user_id = #user_id
AND wo.prev_widget_id = q.widget_id
)
SELECT w.*
FROM q
JOIN widgets w
ON w.widget_id = q.widget_id
ORDER BY
widget_order
See this article in my blog on how to implement this functionality in MySQL:
Sorting lists
I like to use a two-table approach - which can be a bit confusing but if you're using an ORM such as ActiveRecord it's easy, and if you write a bit of clever code it can be manageable.
Use one table to link user to sorting, and one table to link widget and position and sorting. This way it's a lot clearer what's going on, and you can use an SQL join or a seperate query to pull the various data from the various tables. Your structure should look like this:
//Standard user + widgets table, make sure they both have unique IDs
CREATE TABLE users;
CREATE TABLE widgets;
//The sorting tables
CREATE TABLE sortings (
id INT, //autoincrement etc,
user_id INT
)
CREATE TABLE sorting_positions (
sorting_id INT,
widget_id INT,
position INT
)
Hopefully this makes sense, if you're still confused, comment on this message and I'll write you up some basic code.
Jamie
If you mean that each user assigns his own sort order to the widgets, then Eric's answer is correct. Presumably you then have to give the user a way to assign the sort value. But if the number is modest as you say, then you can just give him a screen listing all the widgets, and either let him type in the order number, or display them in order and put up and down buttons beside each, of if you want to be fancy, give him a way to drag and drop.
If the order is the same for all users, the question becomes, Where does this order come from? If it's arbitrary, just assign a sequence number as new widgets are created.
I have what I'd thought would be a simple query, but it takes 'forever'. I'm not great with SQL optimizations, so I thought I could ask you guys.
Here's the query, with EXPLAIN:
EXPLAIN SELECT *
FROM `firms_firmphonenumber`
INNER JOIN `firms_location` ON (
`firms_firmphonenumber`.`location_id` = `firms_location`.`id`
)
ORDER BY
`firms_location`.`name_en` ASC,
`firms_firmphonenumber`.`location_id` ASC LIMIT 100;
Result:
id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, 'SIMPLE', 'firms_location', 'ALL', 'PRIMARY', '', '', '', 73030, 'Using temporary; Using filesort'
1, 'SIMPLE', 'firms_firmphonenumber', 'ref', 'firms_firmphonenumber_firm_id', 'firms_firmphonenumber_firm_id', '4', 'citiadmin.firms_location.id', 1, ''
Keys on firms_location:
Keyname Type Unique Packed Field Cardinality
PRIMARY BTREE Yes No id 65818
firms_location_name_en BTREE No No name_en 65818
Keys on firms_firmphonenumber:
Keyname Type Unique Packed Field Cardinality
PRIMARY BTREE Yes No id 85088
firms_firmphonenumber_firm_id BTREE No No location_id 85088
It seems (to me) that mySQL refuses to use the firms_location table's primary key - but I have no idea why.
Any help would be much appreciated.
Edit after solution posted
With the altered order by:
EXPLAIN SELECT *
FROM `firms_firmphonenumber`
INNER JOIN `firms_location` ON (
`firms_firmphonenumber`.`location_id` = `firms_location`.`id`
)
ORDER BY
`firms_location`.`name_en` ASC,
`firms_location`.id ASC LIMIT 100;
#`firms_firmphonenumber`.`location_id` ASC LIMIT 100;
Result:
"id","select_type","table","type","possible_keys","key","key_len","ref","rows","Extra"
1,"SIMPLE","firms_location","index","PRIMARY","firms_location_name_en","767","",100,""
1,"SIMPLE","firms_firmphonenumber","ref","firms_firmphonenumber_firm_id","firms_firmphonenumber_firm_id","4","citiadmin.firms_location.id",1,""
Why did it decide to use these now? mySQL makes some odd choices... Any insight would help again :)
Edit with detail from django
Originally, I had these (abbreviated) models:
class Location(models.Model):
id = models.AutoField(primary_key=True)
name_en = models.CharField(max_length=255, db_index=True)
class Meta:
ordering = ("name_en", "id")
class FirmPhoneNumber(models.Model):
location = models.ForeignKey(Location, db_index=True)
number = PhoneNumberField(db_index=True)
class Meta:
ordering = ("location", "number")
Changing the Locaion's class's Meta.ordering field to ("name_en", ) fixed the query to not have the spurious order by.
These things tend to be by trial and error, but try ordering on firms_location.id rather than firms_firmphonenumber.location_id. They are the same value, but MySQL may then pick up on the index.
It is using it, for the join; that's the 'citiadmin.firms_location.id' value in the ref column. It isn't appearing in possible_keys and key because you have no WHERE clause and it's only reflecting keys it has available for the ORDER BY clause.
If you want to speed up your query, try indexing name_en.
Because there's no where, and because the cardinality of the join field is higher than than of the joining field, it's calculating that it might as well get everything. Using the index on the join won't speed that up, so it's resorting to the lesser optimization of the using an index for sorting.
First, you can do USE to force it to use the index you specify. Also, try doing an optimize to make sure the cardinality is correctly estimated. (I'm guessing you're using INNO, which estimates it in a series of random "dives"; if this is MyISAM, which actually knows, then I wonder why the cardinality looks as it does.)
Don't bother indexing the name or etc. MySQL will use only one index per table per join, ever, and the index will just bulk it up.
how much data? if only a few rows, most databases will just do a table scan no matter what indexes you have
I have a requirement to produce a list of possible duplicates before a user saves an entity to the database and warn them of the possible duplicates.
There are 7 criteria on which we should check the for duplicates and if at least 3 match we should flag this up to the user.
The criteria will all match on ID, so there is no fuzzy string matching needed but my problem comes from the fact that there are many possible ways (99 ways if I've done my sums corerctly) for at least 3 items to match from the list of 7 possibles.
I don't want to have to do 99 separate db queries to find my search results and nor do I want to bring the whole lot back from the db and filter on the client side. We're probably only talking of a few tens of thousands of records at present, but this will grow into the millions as the system matures.
Anyone got any thoughs of a nice efficient way to do this?
I was considering a simple OR query to get the records where at least one field matches from the db and then doing some processing on the client to filter it some more, but a few of the fields have very low cardinality and won't actually reduce the numbers by a huge amount.
Thanks
Jon
OR and CASE summing will work but are quite inefficient, since they don't use indexes.
You need to make UNION for indexes to be usable.
If a user enters name, phone, email and address into the database, and you want to check all records that match at least 3 of these fields, you issue:
SELECT i.*
FROM (
SELECT id, COUNT(*)
FROM (
SELECT id
FROM t_info t
WHERE name = 'Eve Chianese'
UNION ALL
SELECT id
FROM t_info t
WHERE phone = '+15558000042'
UNION ALL
SELECT id
FROM t_info t
WHERE email = '42#example.com'
UNION ALL
SELECT id
FROM t_info t
WHERE address = '42 North Lane'
) q
GROUP BY
id
HAVING COUNT(*) >= 3
) dq
JOIN t_info i
ON i.id = dq.id
This will use indexes on these fields and the query will be fast.
See this article in my blog for details:
Matching 3 of 4: how to match a record which matches at least 3 of 4 possible conditions
Also see this question the article is based upon.
If you want to have a list of DISTINCT values in the existing data, you just wrap this query into a subquery:
SELECT i.*
FROM t_info i1
WHERE EXISTS
(
SELECT 1
FROM (
SELECT id
FROM t_info t
WHERE name = i1.name
UNION ALL
SELECT id
FROM t_info t
WHERE phone = i1.phone
UNION ALL
SELECT id
FROM t_info t
WHERE email = i1.email
UNION ALL
SELECT id
FROM t_info t
WHERE address = i1.address
) q
GROUP BY
id
HAVING COUNT(*) >= 3
)
Note that this DISTINCT is not transitive: if A matches B and B matches C, this does not mean that A matches C.
You might want something like the following:
SELECT id
FROM
(select id, CASE fld1 WHEN input1 THEN 1 ELSE 0 "rule1",
CASE fld2 when input2 THEN 1 ELSE 0 "rule2",
...,
CASE fld7 when input7 THEN 1 ELSE 0 "rule2",
FROM table)
WHERE rule1+rule2+rule3+...+rule4 >= 3
This isn't tested, but it shows a way to tackle this.
What DBS are you using? Some support using such constraints by using server side code.
Have you considered using a stored procedure with a cursor? You could then do your OR query and then step through the records one-by-one looking for matches. Using a stored procedure would allow you to do all the checking on the server.
However, I think a table scan with millions of records is always going to be slow. I think you should work out which of the 7 fields are most likely to match are make sure these are indexed.
I'm assuming your system is trying to match tag ids of a certain post, or something similar. This is a multi-to-multi relationship and you should have three tables to handle it. One for the post, one for tags and one for post and tags relationship.
If my assumptions are correct then the best way to handle this is:
SELECT postid, count(tagid) as common_tag_count
FROM posts_to_tags
WHERE tagid IN (tag1, tag2, tag3, ...)
GROUP BY postid
HAVING count(tagid) > 3;
I have a postgres database with a user table (userid, firstname, lastname) and a usermetadata table (userid, code, content, created datetime). I store various information about each user in the usermetadata table by code and keep a full history. so for example, a user (userid 15) has the following metadata:
15, 'QHS', '20', '2008-08-24 13:36:33.465567-04'
15, 'QHE', '8', '2008-08-24 12:07:08.660519-04'
15, 'QHS', '21', '2008-08-24 09:44:44.39354-04'
15, 'QHE', '10', '2008-08-24 08:47:57.672058-04'
I need to fetch a list of all my users and the most recent value of each of various usermetadata codes. I did this programmatically and it was, of course godawful slow. The best I could figure out to do it in SQL was to join sub-selects, which were also slow and I had to do one for each code.
This is actually not that hard to do in PostgreSQL because it has the "DISTINCT ON" clause in its SELECT syntax (DISTINCT ON isn't standard SQL).
SELECT DISTINCT ON (code) code, content, createtime
FROM metatable
WHERE userid = 15
ORDER BY code, createtime DESC;
That will limit the returned results to the first result per unique code, and if you sort the results by the create time descending, you'll get the newest of each.
I suppose you're not willing to modify your schema, so I'm afraid my answe might not be of much help, but here goes...
One possible solution would be to have the time field empty until it was replaced by a newer value, when you insert the 'deprecation date' instead. Another way is to expand the table with an 'active' column, but that would introduce some redundancy.
The classic solution would be to have both 'Valid-From' and 'Valid-To' fields where the 'Valid-To' fields are blank until some other entry becomes valid. This can be handled easily by using triggers or similar. Using constraints to make sure there is only one item of each type that is valid will ensure data integrity.
Common to these is that there is a single way of determining the set of current fields. You'd simply select all entries with the active user and a NULL 'Valid-To' or 'deprecation date' or a true 'active'.
You might be interested in taking a look at the Wikipedia entry on temporal databases and the article A consensus glossary of temporal database concepts.
A subselect is the standard way of doing this sort of thing. You just need a Unique Constraint on UserId, Code, and Date - and then you can run the following:
SELECT *
FROM Table
JOIN (
SELECT UserId, Code, MAX(Date) as LastDate
FROM Table
GROUP BY UserId, Code
) as Latest ON
Table.UserId = Latest.UserId
AND Table.Code = Latest.Code
AND Table.Date = Latest.Date
WHERE
UserId = #userId