How to calculate Jaccard similarity coefficient with sqlite - database

I have a database made with sqlite3 where each user has 3 possible hobbies, which are saved as a boolean value (1 if the user likes it, 0 if he doesn't).
I want to get a list of the pairs that are similar ordered by their Jaccard similarity coefficient, which means I have to count the number of hobbies that are true for both of them and divide it by the number of hobbies that either of them chose.
I have created this VIEW
All of the pairs must contain wonka in the view. Carros, tecnologia and comida are hobbies.

Instead of trying to store all hobbies in a single row per user, and joining them (Like your view appears to be doing), and then trying to add them up, it's a lot easier to calculate with a better database design that expresses the relationships between users and hobbies by tracking them in another table (Think about what needs to be done to add a fourth hobby.). You'll want to look up terms like many-to-many relationship and junction table for more, and/or find a good resource on database design.
With a design like that, given these tables:
CREATE TABLE users(userID INTEGER PRIMARY KEY, userName TEXT UNIQUE);
CREATE TABLE hobbies(hobbyID INTEGER PRIMARY KEY, hobbyName TEXT UNIQUE);
CREATE TABLE interests(userID INTEGER REFERENCES users(userID) ON DELETE CASCADE
, hobbyID INTEGER REFERENCES hobbies(hobbyID) ON DELETE CASCADE
, liked INTEGER
, PRIMARY KEY(userID, hobbyID)) WITHOUT ROWID;
you can calculate the similarity coefficient for all pairs with something like:
SELECT u1.userName AS "Person 1", u2.UserName AS "Person 2"
, ifnull(total(i1.liked AND i2.liked) / total(i1.liked OR i2.liked), 0.0) AS Similarity
FROM users AS u1
JOIN users AS u2 ON u1.userId <> u2.userId
LEFT JOIN interests AS i1 ON u1.userId = i1.userId
LEFT JOIN interests AS i2 ON u2.userId = i2.userId AND i1.hobbyId = i2.hobbyID
GROUP BY u1.userId, u2.userId;

Related

Postgis: Query all rows within radius for given table of geometries. Nearest Neighbor modification

In my Postgres 9.5 database with PostGis 2.2.0 installed, I have a table buildings with geometric data (points) centroid. The table contains about 3 million buildings, of which about 300.000 contain special information.
Now, for each buildings.gid I want to know, how many other buildings of the same table are within a certain radius (I want to test this for different radiuses: 20meter, 50meter, 100m, 200m, 500m if it can be done in an adequat amount of time) and add this information to a column of buildings. The related columns are N20, N50,...
Query
I figured out to use something like:
UPDATE buildings
SET N50=sub.N
FROM (SELECT Count(n.gid) AS N
FROM buildings n, buildings b
WHERE ST_DWithin(b.centroid, n.centroid, 50) -- distance in meter
) sub
Related to this solution of #ErwinBrandstetter, where there is a coordinate given, around with the radius is produced. But even when testing for only one gid, I did not recieve an result in an acceptable amount of time.
The difference to my problem is, that I want this to be done for every building.
Table definitions
CREATE TABLE public.buildings
(
gid integer NOT NULL DEFAULT nextval('buildings_gid_seq'::regclass),
osm_id character varying(11),
name character varying(48),
type character varying(16),
geom geometry(MultiPolygon,4326),
centroid geometry(Point,4326),
gembez character varying(50),
gemname character varying(50),
krsbez character varying(50),
krsname character varying(50),
pv boolean,
gr numeric,
capac double precision,
instdate date,
pvid integer,
dist double precision,
gemewz integer,
n50 integer,
n100 integer,
n200 integer,
n500 integer,
n1000 integer,
IBASE numeric,
CONSTRAINT buildings_pkey PRIMARY KEY (gid)
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.buildings
OWNER TO postgres;
CREATE INDEX build_centroid_gix
ON public.buildings
USING gist
(st_transform(centroid, 31467));
CREATE INDEX buildings_geom_idx
ON public.buildings
USING gist
(geom);
Advanced Problem
(The following might be another problem, hence should be another question on stackoverflow, but there might be the chance to implement this in the first question)
Furthermore, referring to the "special information", 268238 of the buildings contain information about dist,instdate,capac. These columns of the remaining buildings are NULL.
instdate is the date, at which a building had a "PV" installed. I need to transform the table buildings to a panel datatype table, which means that for each period (in my case 11 periods) exists one row for the same building.
Now I need to check, how many other buildings within the radius already had a "PV" installed.
To do so, I want to query all buildings within a radius (like in first question) where for example capac IS NOT NULL, but now the buildings shall not be counted, but their information about dist,instdate,capac shall be added as a string to IBASE.
Try building an index on a geography cast, which can be used for ST_DWithin (so you can calculated metric distances with geographic data)
CREATE INDEX buildings_geog_idx ON buildings USING gist (geom::geography);
UPDATE buildings SET n50=c.count
FROM (
SELECT a.gid, count(b.gid)
FROM buildings a
LEFT JOIN buildings b ON ST_DWithin(a.geom::geography, b.geom::geography, 50.0)
AND a.gid <> b.gid
GROUP BY a.gid
) c
WHERE c.gid = buildings.gid;
You could also try calculating on a sphere for faster performance, but potential errors from spheroid distances:
ST_DWithin(a.geom::geography, b.geom::geography, 50.0, false)

modeling correct star schema for ssas tabular

I'm using ssas tabular (powerpivot) and need to design a data-model and write some DAX.
I have 4 tables in my relational database-model:
Orders(order_id, order_name, order_type)
Spots (spot_id,order_id, spot_name, spot_time, spot_price)
SpotDiscount (spot_id, discount_id, discount_value)
Discounts (discount_id, discount_name)
One order can include multiple spots but one spot (spot_id 1) can only belong to one order.
One spot can include different discounts and every discount have one discount_value.
Ex:
Order_1 has spot_1 (spot_price 10), spot_2 (spot_price 20)
Spot_1 has discount_name_1(discount_value 10) and discount_name_2 (discount_value 20)
Spot_2 has discount_name_1(discount_value 15) and discount_name_3 (discount_value 30)
I need to write two measures: price(sum) and discount_value(average)
How do I correctly design a star schema with fact table (or maybe two fact tables) so that I in my powerpivot cube can get:
If i choose discount_name_1 I should get
order_1 with spot_1 and spot_2 and price on order_1 level will have value 50 and discount_value = 12,5
If I choose discount_name_3 I should get
order_1 with only spot_2 and price on order level = 20 and discount_value = 30
Fact(OrderKey, SpotKey, DiscountKey, DateKey, TimeKey Spot_Price, Discount_Value,...)
DimOrder, DimSpot, DimDiscount, etc....
TotalPrice:=
SUMX(
SUMMARIZE(
Fact
,Fact[OrderKey]
,Fact[SpotKey]
,Fact[Spot_Price]
)
,Fact[Spot_Price]
)
AverageDiscount:=
AVERAGE(Fact[Discount_Value])
Fact table is denormalized and you end up with the simplest star schema you can have.
First measure deserves some explanation. [Spot_Price] is duplicated for any spot with multiple discounts, and we would get wrong results with a simple SUM(). SUMMARIZE() does a group by on all the columns passed to it, following relationships (if necessary, we're looking at a single table here so nothing to follow).
SUMX() iterates over this table and accumulates the value of the expression in its second argument. The SUMMARIZE() has removed our duplicate [Spot_Price]s so we accumulate the unique ones (per unique combination of [OrderKey] and [SpotKey]) in a sum.
You say
One order can include multiple spots but one spot (spot_id 1) can only
belong to one order.
That's is not supported in the table definitions you give just above that statement. In the table definitions, one order has only one spot but (unless you've added a unique index to Orders on spot_id) each Spot can have multiple orders. Each Spot can also have multiple discounts.
If you want to have the relationship described in your words, the table definitions should be:
Orders(order_id, order_name, order_type)
OrderSpot(order_id, spot_id) -- with a Unique index on spot_id)
Spots (spot_id, spot_name, spot_time, price)
or:
Orders(order_id, order_name, order_type)
Spots (spot_id, spot_name, spot_time, order_id, price)
You can create the ssas cube with Order as the fact table, with one dimention in the Spot Table. If you then add the SpotDiscount and Discount tables with their relations (SpotDiscount to Spot, Discount to SpotDiscount) you have a 1 dimentional.
EDIT as per comments
Well, the Fact table would have order_id, order_name, order_type
The Dimension would be made up of the other 3 tables and have the columns you're interested in: probably spot_name, spot_time, spot_price, discount_name, discount_value.

Data model for unique sets

I am hunting for the best way to implement a data model for "recipes"
think like a pizza app where you can compose your own pizza. you select maybe 5 out of 100 ingredients and you select an amount for each. I need to check if I've "seen" that pizza combination before, assign ID if I have not, and retrieve ID if I have.
We have n ingredients.
A recipe is defined by a set of ingredients and a corresponding amount.
Could look like:
Ingr1 90
Ingr2 10
or
Ingr1 90
Ingr2 10
Ingr3 10
I want to store this in a structure where I give each unique recipe an ID, and so it's possible for me to query for the ID given the recipe data set.
I want a stored procedure that takes a data set as a parameter and returns an ID that is new if the recipe was unknown and existing if the recipe already exists.
I am looking for the most efficient way of doing this. My best idea so far is to either encode the recipe as a string (json) and use this as a unique constraint, or have a stored procedure that iterates through the recipe data set and constructs a n level deep if exists statement.
So, I'm confident I can solve the problem, but am looking for a beautiful method.
As far as I can see, you have entities Recipe and Ingredient and M:M relation between them. Data model can look like this (PK in bold):
Recipe (RecipeID, RecipeName)
Ingredient(IngredientID, IngredientName)
RecipeIngredients(RecipeID, IngredientID, Amount)
You can solve task of finding out if same recipe is already present in a database using query but this query wouldn't be simple. It is well-know problem, relational division. There are several approaches. One of the most popular is counting. If some recipe has same amount of ingredients as target one and all ingredients are the same, then they are equal. Such queries often involves data aggregations and perform not very fast on big amount of data.
You can help to solve this problem from application side and you are thinking in right direction. Represent recipe as a string, ordering values by IngredientID (to get same string even if ingredients were added in different order), converting Amount in some stable form (not to get 0.499999 instead of 0.5), calculate some hash out of string, and store this value in Recipe. In simple form hash is an integer value, so you can find doubles very fast.
So it is your call. Every approach has it's own issues. Heavy query in first case and hassle to keep hash in actual state in second case (and possible collisions too). I'd stick with first option until it works OK and start any optimizations only when they are unavoidable.
Query example (new recipe is in #tmp):
;with totals as
(
select RecipeID, count(*) totals
from RecipeIngredients
group by RecipeID
), matched_totals as
(
select i.RecipeID, count(*) matched_totals
from RecipeIngredients i
join #tmp t
on i.IngredientID = t.IngredientID
and i.Amount = t.Amount
group by i.RecipeID
)
select t.*
from totals t
join matched_totals m
on m.RecipeID = t.RecipeID
where
totals = matched_totals
and totals = (select count(*) from #tmp)
This solution is more elegant but much less intuitive:
select *
from Recipe r
where
not exists
( select 1
from RecipeIngredients ri
where
r.RecipeID = ri.RecipeID
and not exists
(select 1 from #tmp t where t.IngredientID = ri.IngredientID)
)

Designing a multilevel company structure

I try to build a database model for the following structure:
I have companies with up to 3 hierachical levels. For each unit I have a value (these values are given randomly and duplicates between companies (not within) are possible. Let us say (1 Level: 222-Amazon, 2 Level: 441-Amazon: Germany, 542-Britan, 3 Level: 6-Distribution, 99-Shop, 124-Programming, 5-HR.
Of course for each company this is different. What I did is:
Table1:
ID_Worker
CompanyName
ID_CompanyLvL1
ID_CompanyLvL2
ID_CompanyLvL3
...
Table2:
ID_CompanyLevel1
Slot1
Slot2
...
Table3:
ID_CompanyLevel2
Slot1
Slot2
...
But with this approach I have the following problem: If two companies have the same number for a CompanyLevel1(2 or 3) unit I cannot distingush them anymore.
Another approach that is not working is
Table1:
ID_Company
ID_Worker
ID_CompanyLevel1
...
Tabel2:
ID_CompanyLevel1
Slot1
ID_CompanyLevel2
...
Table3:
ID_CompanyLevel2
Slot
ID_CompanyLevel3
...
With this approach I cannot identify which person is in e.g. which level2 unit. Could anyone help me with this i just cannot come up with the right design.
You need to decide whether the organization structure is purely hierarchical (an org unit can only belong to 0 or 1 other org unit), or whether it is graphical (an org unit can belong to 0, 1, or 1+ org units).
Your limit of three is a business rule, and should be enforced by database logic (trigger) and not the database schema.
Why the codes with the names?
If hierarchical, this is your schema:
create table organizations (
organization_id int primary key,
name varchar(whatever) not null,
parent_id int null references organizations(organization_id)
);
Use Recursive Common Table Expressions to query them.
If graphical, this is your schema:
create table organizations (
organization_id int primary key,
name varchar(whatever) not null
);
create table organizations_structure (
parent_organization_id int references organizations(organization_id),
child_organization_id int references organizations(organization_id),
primary key (parent_organization_id, child_organization_id),
check (parent_organization_id <> child_organization_id)
);
For anything like that - make sure you do not put yourself into a cornder. For example:
I have companies with up to 3 hierachical levels
No. YOu do have companies with CURRENTLY up to 3 hierarchical levels. And they do not want to scream at you when one of them decides to have 4.
I would suggest reading the Data Model Ressource Book Volume 1 - they describe all kinds of stuff and standard data schemata, among them entity organizations (entity as in "legal, human or organizatonal entity" which includes organigrams. Things are a lot more complex as you think when you do not want to put yourself into a corner that WILL make the program require a rewrite in the not too far future.

Select exclusively a field from a table

I have to add a coupon table to my db. There are 3 types of coupons : percentage, amount or 2 for 1.
So far I've come up with a coupon table that contains these 3 fields. If there's a percentage value not set to null then it's this kind of coupon.
I feel it's not the proper way to do it. Should I create a CouponType table and how would you see it? Where would you store these values?
Any help or cue appreciated!
Thanks,
Teebot
You're correct, I think a CouponType table would be fit for your problem.
Two tables: Coupons and CouponTypes. Store the CouponTypeId inside the Coupons table.
So for an example, you'll have a Coupon record called "50% off", if would reference the percent off CouponType record and from there you could determine the logic to take 50% off the cost of the item.
So now you can create unlimited coupons, if it's a dollar amount coupon type it will take the "amount" column and treat it as a dollar amount. If it's a percent off it will treat it as a percentage and if it's an "x for 1" deal, it will treat the value as x.
- Table Coupons
- ID
- name
- coupon_type_id # (or whatever fits your style guidelines)
- amount # Example: 10.00 (treated as $10 off for amount type, treated as
# 10% for percent type or 10 for 1 with the final type)
- expiration_date
- Table CouponTypes
- ID
- type # (amount, percent, <whatever you decided to call the 2 for 1> :))
In the future you might have much more different coupon types. You could also have different business logic associated with them - you never know. It's always useful to do the things right in this case, so yes, definitely, create a coupon type field and an associated dictionary table to go with it.
I would definitely create a CouponType lookup table. That way you avoid all the NULL's and allow for more coupon types in the future.
Coupon
coupon_id INT
name VARCHAR
coupon_type_id INT <- Foreign Key
CouponType
coupon_type_id INT
type_description VARCHAR
...
Or I suppose you could have a coupon type column in your coupon table CHAR(1)

Resources