Efficiently counting strength of relationship between rows in Postgres - database

I have a table that looks similar to this:
session_id | sku
------------|-----
a | 1
a | 2
a | 3
a | 4
b | 2
b | 3
c | 3
I want to pivot this into a table similar to this:
sku1 | sku2 | score
------|------|------
1 | 2 | 1
1 | 3 | 1
1 | 4 | 1
2 | 3 | 2
2 | 4 | 1
3 | 4 | 1
The idea is to store a denormalised table that allows one to look up for a given sku, what other skus are related to sessions it has been related to, and how many times both skus are related to the same session.
What algorithms, patterns or strategies could you suggest for implementing this in PostgreSQL or other technologies?
I realise that this kind of lookup can be done on the original table using counts, or using a facetting search engine. However, I want to make the reads more performant, and just want to keep the overall statistics. The idea is that I will be performing this pivot regularly on the newest few thousand rows in the first table, then storing the result in the second. I'm only concerned with approximate statistics for the second table.
I've got some SQL that works, but VERY slowly. Also looking into the potential for using a graph database of some sort, but wanted to avoid adding another technology for a small part of the app.
Update: The SQL below seems performant enough. I can convert 1.2 million rows in the first table (tags) into 250k rows in the second table (product_relations) with around 2-3k variations of sku in about 5 minutes on my iMac. I will realistically be denormalising only up to 10k rows per day. Question is whether this is actually the best approach. Seems a little dirty to me.
BEGIN;
CREATE
TEMPORARY TABLE working_tags(tag_id int, session_id varchar, sku varchar) ON COMMIT DROP;
INSERT INTO working_tags
SELECT id,
session_id,
sku
FROM tags
WHERE time < now() - interval '12 hours'
AND processed_product_relation IS NULL
AND sku IS NOT NULL LIMIT 200000;
CREATE
TEMPORARY TABLE working_relations (sku1 varchar, sku2 varchar, score int) ON COMMIT DROP;
INSERT INTO working_relations
SELECT a.sku AS sku1,
b.sku AS sku2,
count(DISTINCT a.session_id) AS score
FROM working_tags AS a
INNER JOIN working_tags AS b ON a.session_id = b.session_id
AND a.sku < b.sku
WHERE a.sku IS NOT NULL
AND b.sku IS NOT NULL
GROUP BY a.sku,
b.sku;
UPDATE product_relations
SET score = working_relations.score+product_relations.score
FROM working_relations
WHERE working_relations.sku1 = product_relations.sku1
AND working_relations.sku2 = product_relations.sku2;
INSERT INTO product_relations (sku1, sku2, score)
SELECT working_relations.sku1,
working_relations.sku2,
working_relations.score
FROM working_relations
LEFT OUTER JOIN product_relations ON (working_relations.sku1 = product_relations.sku1
AND working_relations.sku2 = product_relations.sku2)
WHERE product_relations.sku1 IS NULL;
UPDATE tags
SET processed_product_relation = TRUE
WHERE id IN
(SELECT tag_id
FROM working_tags);
COMMIT;

If I've interpreted your intention correctly (per comments) this should do it:
SELECT
s1.sku AS sku1,
s2.sku AS sku2,
count(session_id)
FROM session s1
INNER JOIN session s2 USING (session_id)
WHERE s1.sku < s2.sku
GROUP BY s1.sku, s2.sku
ORDER BY 1,2;
See: http://sqlfiddle.com/#!15/2e0b2/1
In other words: Self-join session, then find all pairings of SKUs for each session ID, excluding ones where the left is greater than or equal to the right in order to avoid repeating pairings - if we have (1,2,count) we don't want (2,1,count) as well. Then group by the SKU pairings and count how many rows are found for each pairing.
You may want to count(distinct session_id) instead, if your SKU pairings can repeat and you want to exclude duplicates. There will probably be more efficient ways to do that, but that's the simplest.
An index on at least session_id will be very useful. You may also want to mess with planner cost parameters to make sure it chooses a good plan - in particular, make sure effective_cache_size is accurate and random_page_cost vs seq_page_cost reflects your caching and I/O costs. Finally, throw as much work_mem at it as you can afford.
If you're creating a materialized view, just CREATE UNLOGGED TABLE whatever AS SELECT .... . That way you minimise the numer of writes/rewrites/overwrites.

Related

I wonder how to design a good database that sorts data when inserting values in the middle or top

I have a lot of data and I need to sort the data by reflecting it when the value is added in the middle or top.
For example, a table with increasing data (group_order) is shown below.
code| group_id | group_order | depth
----+------------+-------------+-------
c1 | Group1 | 1 | 1
<- c6 | Group1 | 2 | 1
c2 | Group1 | 2 | 1
c3 | Group1 | 3 | 1
c4 | Group1 | 4 | 1
c5 | Group1 | 5 | 1
As above table, I put data with group_order of 2 in the second row, and tried to increase the group_order of the data below (c2, c3, c4, c5) by 1.
Of course, it runs well, but as I said before, it took a lot of time to update because I have a lot of data.
When I insert the data into the desired location, the values should be sorted in that order.
Please help me.
The database I use is postgresql.
Thank you.
demo:db<>fiddle
I personally think, it is never a good idea to manipulate the internal data for such reasons as sort order or similar. Final order is something for the view and not for the model. So in my opinion you should think about calculating the order when you need it, not generally. Another problem could be the inserting performance, which I am not quite sure about, but should be investigated: As you will update all your data, the records will be blocked by the transaction and in that time you will not be able to insert another record. If no blocking transaction, you should think about race conditions (what happens if two records will be inserted simultaneously?).
I would introduce a second sort column like an insert number which stores the number at which the record was inserted. Could be simply a serial or auto-increment sequence.
And then you could simply order by group_order ASC followed by insert_nr DESC. Benefit: You don't need to manipulate your original data.
If you still need a correct order number, you could create a (materialized) view and add the order number with incrementing numbers using row_number() window function.
Introducing an (auto-)incrementing column insert_nr:
CREATE TABLE mytable (
code text,
group_order int,
insert_nr serial
);
Using it within a view:
CREATE VIEW v_myview AS
SELECT
code, group_order,
row_number() OVER (ORDER BY group_order, insert_nr DESC)
FROM (
SELECT * FROM mytable
) s;

How to find the last 100k rows from 10000K table in oracle?

when i am using this query it is taking more than 5 mins please give me some other suggestion
SELECT * FROM
( SELECT id,name,rownum AS RN$$_RowNumber FROM MILLION_1) INNER_TABLE where
RN$$_RowNumber > (V_total_count - V_no_of_rows)
ORDER BY RN$$_RowNumber DESC;
Try the offset clause.
I have a table with about 16M records in it, if i just want the last 100,000 rows, I ORDER them via the ORDER BY clause, and then I use the OFFSET clause, which basically says, read this many rows first, before you return any data.
select *
from SHERI; -- 15,691,544 Rows
select *
from SHERI
order by COLUMN4 asc
offset 15591444 rows; -- my math was bad, should have offset 15591544 rows to get just the last 100,000
The FETCH FIRST and OFFSET clauses are new for 12c (docs)
If we look at the plan under this query, we can see how the database makes it work:
PLAN_TABLE_OUTPUT
SQL_ID 7wd4ra8pfu1vb, child number 0
-------------------------------------
select * from SHERI order by COLUMN4 asc offset 15591444 rows
Plan hash value: 3535161482
----------------------------------------------
| Id | Operation | Name | E-Rows |
----------------------------------------------
| 0 | SELECT STATEMENT | | |
|* 1 | VIEW | | 15M|
| 2 | WINDOW SORT | | 15M|
| 3 | TABLE ACCESS FULL| SHERI | 15M|
----------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber">15591444)
Note
-----
- Warning: basic plan statistics not available. These are only collected when:
* hint 'gather_plan_statistics' is used for the statement or
* parameter 'statistics_level' is set to 'ALL', at session or system level
'window sort' basically translates to, an analytic function
There are some very thorough answers at this similar question, but I'll try to make them specific to your case.
First, when you say "last 100k rows", what do you mean? It looks like you just want to pull the last 100k rows from an unsorted query, but that doesn't make a lot of sense. If you want the 100k most recent rows, Oracle doesn't guarantee that they'll be at the end of your unsorted query. So you want to order by something which will have the most recent ones at the end.
Also, part of the reason your query is slow is that you're sorting/filtering on the rownum pseudo-column, which can't be indexed. Sorting on a column that has an index would drastically speed this up. So I'd guess you want to order by the id column, which is probably a unique/primary key.
So this is the old (11g and earlier) way to do this.
select id, name
from (select id, name
from MILLION_1
order by id desc)
where rownum < 100000;
If you're on 12c or later, there's a newer way to do it.
select id, name
from MILLION_1
order by id desc
fetch first 100000 rows only;

Metadata database design

I am trying to store meta data about a document into a SQL Server. The document are stored into a document archive, and returns back an identifier so I can get back that document by asking the archive to get the document by identifier.
Our user would like to be able to search for this document based on different meta data. The meta data could be 1 attribute or 5 depending on the document type, and the users should be able to create new document types from a admin site.
I can see two solution here. One is that each documenttype gets it's own metadata table, where all metadata attributes are predefined, and if one should be added a new column needs to be created. And if a new documenttype is created a new metadata table needs to be created. Our DBA will freak out with a solution like this, and I also see a problem with indexes. Because if the documenttype has 5 different meta data attributes it needs to be searchable with 1 or 4 of them specified in the search. Then I would need to write index for all the different combinations of possible searchs.
here is an example (fictiv)
|documentId | Name | InsertDate | CustomerId | City
| 1 | John | 2014-01-01 | 2 | London
| 2 | John | 2014-01-20 | 5 | New York
| 3 | Able | 2014-01-01 | 10 | Paris
I could here say:
Give me all documents where Name = 'John'
Give me all documets where Name = 'John' And CustomerId = 5
Give me all document where InserDate = '2014-01-01' and City = 'London'
This will be 3 differnet indexes and then I haven't coverd all possible combinations. This isn't practical.
So I am look in to the evil 'EAV' (anti)pattern.
So instead of having the metadata as columns I can have the as rows.
|documentId | MetaAttribute | MetaValue
| 1 | Name | John
| 1 | InsertDate | 2014-01-01
| 1 | CustomerId | 2
| 1 | City | London
| 2 | Name | John
| 2 | InsertDate | 2014-01-20
| 2 | CustomerId | 5
| 2 | City | New York
| 3 | Name | Able
| 3 | InserDate | 2014-01-01
| 3 | CustomerId | 10
| 3 | City | Paris
Here it's simple to create one index om MetaAttribute och metaValue, and it's covered. If a new documenttype is created, new metadata can be created with that documenttype into a MetaAttributeTable (that contains all MetaAttribute for the different documenttype). So no need to create new tables or coulms if a new documenttype is added or if a new attribute is added to a documenttype. Instead all MetaValues most be strings :( and the SQL Query to find the document id is a bit more complicated.
This is what I figured out. (In this example the MetaAttribute is a string, but would be an ID to the MetaAttribute Table)
SELECT * FROM [Document]
WHERE ID IN (SELECT documentId FROM [MetaData]
WHERE ((MetaAttribute = 'Name' AND MetaValue = 'John')
OR (MetaAttribute = 'CustomerId' and MetaValue = '5'))
GROUP BY [documentId]
HAVING Count(1) = 2)
Here I need to ask if the Name = 'John' and CustomerId = 5. I do that by finding all records that have Name = 'John' and CustomerId = '5' and the Group it on the documentId and count number of items in the group. If I got 2 then both Name = 'John' and CustomerId = '5' is true for this search. Return the documentId and use that to retrive information about the document, like the document archive storage id.
There should be a better SQL statement for this isn't there?
So my question is. Is there a better approche than these 2. Is the EAV-pattern so bad that I should stick with the first approche and have a Freaked out DBA and "ten millions of indexes"
We are talking about a system that will have around 10-20 millions of new records each month, and contain data for at least 3 years.... So the tables will be preatty big and good indexes are neccasary for performance.
Best Regards
Magnus
The EAV model is appealing if you have unbounded attributes--that is, anyone can set up anything as an attribute. However, it sounds from your description that this is not the case--the possible document attributes come from a known and fairly limited set. If this is the case, routine normalization suggests the following:
-- One per document
CREATE TABLE Document
(
DocumentId -- primary key
,DocumentType
,<etc>
)
-- One per "type" of document
CREATE TABLE DocumentType
(
DocumentTypeId -- pirmary key
,Name
)
-- One per possible document attribute.
-- Note that multiple document types can reference the same attribute
CREATE TABLE DocumentAttributes
(
AttributeId -- primary key
,Name
)
-- This lists which attributes are used by a given type
CREATE TABLE DocumentTypeAttributes
(
DocumentTypeId
,AttributeId
-- compound primary key on both columns
-- foeign keys on both columns
)
-- This contains the final association of document and attributes
CREATE TABLE DocumentAttributeValues
(
DocumentId
,AttributeId
,Value
-- compound primary key on DocumentId, AttributeId
-- foeign keys on both columns ot their respective parent tables
)
A tighter model with more robust keys could be implemented to ensure at the database level that an attribute cannot be assigned to a document with an “inappropriate” type.
Queries have to use joins, but (presumably) only the Documents and DocumentAttributes tables will ever be large. An index on on (AttributeId + Value) facilitiate lookups by attribute type, and depending on cardinality an index on (Value + AttributeId) could make searches for specific attributes quite efficient.
(Edit)
Ooh, clever, I created two tables with the same name. I've renamed the last one to DocumentAttributeValues. (Free advice is clearly worth what you paid for it!)
This shows how ugly these systems can get in SQL, as you have to “look up” both attributes separately. On the plus side you don’t have to worry about “does this type go with this document”, as those rules have (better had) been applied when the data was loaded. Two examples:
This one spells everything out in joins, and as such I think it might perform worse than the next:
-- Top-down
SELECT do.DocumentId
from Documents do
inner join DocumentAttributes da1
on da.Name = 'Name'
inner join DocumentAttributeValues dav1
on dav1.AttributeId = da1.AttributeId
and dav1.Value = 'John'
inner join DocumentAttributes da2
on da2.Name = 'CustomerId'
inner join DocumentAttributeValues dav2
on dav2.AttributeId = da2.AttributeId
and dav2.Value = '5'
This one picks out the attributes, then finds which documents have all of them. It might perform better, as there’s one less table to process:
-- Bottom-up
SELECT xx.DocumentId
from (-- All documents with name "John"
select dav.DocumentId
from DocumentAttributes da
inner join DocumentAttributeValues dav
on dav.AttributeId = da.AttributeId
where da.Name = 'Name'
and dav.Value = 'John'
-- This combines the two sets, with "all" keeping any duplicate entries
union all
-- All documents with CustomerId = "5"
select dav.DocumentId
from DocumentAttributes da
inner join DocumentAttributeValues dav
on dav.AttributeId = da.AttributeId
where da.Name = 'CustomerId'
and dav.Value = '5') xx -- Have to give the subquery an alias
group by xx.DocumentId
having count(*) = 2
While further refinements might be possible, the more more attributes you’re filtering on, the uglier the queries will be. Five attributes max might work ok in SQL, but if you’ve got tons of attributes, a NoSQL solution might be what you’re looking for.
(Please note that, as with my original post, I have not tested this code, so there may be typos or subtle--or not so subtle--errors in here.)
SQL Server 2008+ offers three related features for dealing with such cases:
Sparse Columns which allow you to define hundreds of columns even if only a subset are used at a time
Column Sets allow you to group these columns and treat them as a group
Filtered indexes can index only the rows that actually have values in them.
These features allow you to work with more-or-less normal SQL statements to handle all metadata columns.
These features were specifically added to address the EAV/metadata scenario.
EDIT
If you have a limited set of attributes that are always filled, there is no need for Sparse Columns or the EAV anti-pattern either.
You can create your tables as you normally would and add indexes to optimize the real workload you encounter. Certain types of queries will occur far more often than others and SQL Server's Index tuning advisor can propose the indexes and statistics to use based on a trace captured using SQL Server's Profiler.
It's quite possible that only a subset of the columns will accelerate searches and the rest can be added as include columns in the index.
Full Text Search
A more powerful option is to use SQL Server's Full Text Search. This will allow you to execute queries using arbitrary attributes. This is another technique using by document/content management systems, ERPs and CRMs to handle arbitrary attributes.
With FTS you simply specify the columns to include in one FTS index and don't have to create separate indexes for each attribute.
You can use FTS predicates in SELECT queries like this:
SELECT Name, ListPrice
FROM Production.Product
WHERE ListPrice = 80.99
AND CONTAINS(Name, 'Mountain')
This can result in much simpler queries (you just write a modified select) and administration (no worries about column order in indexes, only one FTS index to manage)

Partitioning results in a running totals query

I'm looking for a fast way to create cumulative totals in a large SQL Server 2008 data set that partition by a particular column, potentially by using a multiple assignment variable solution. As a very basic example, I'd like to create the "cumulative_total" column below:
user_id | month | total | cumulative_total
1 | 1 | 2.0 | 2.0
1 | 2 | 1.0 | 3.0
1 | 3 | 3.5 | 8.5
2 | 1 | 0.5 | 0.5
2 | 2 | 1.5 | 2.0
2 | 3 | 2.0 | 4.0
We have traditionally done this with correlated subqueries, but over large amounts of data (200,000+ rows and several different categories of running total) this isn't giving us ideal performance.
I recently read about using multiple assignment variables for cumulative summing here:
http://sqlblog.com/blogs/paul_nielsen/archive/2007/12/06/cumulative-totals-screencast.aspx
In the example in that blog the cumulative variable solution looks like this:
UPDATE my_table
SET #CumulativeTotal=cumulative_total=#CumulativeTotal+ISNULL(total, 0)
This solution seems brilliantly fast for summing for a single user in the above example (user 1 or user 2). However, I need to effectively partition by user - give me the cumulative total by user by month.
Does anyone know of a way of extending the multiple assignment variable concept to solve this, or any other ideas other than correlated subqueries or cursors?
Many thanks for any tips.
If you don't need to STORE the data (which you shouldn't, because you need to update the running totals any time any row is changed, added or deleted), and if you don't trust the quirky update (which you shouldn't, because it isn't guaranteed to work and its behavior could change with a hotfix, service pack, upgrade, or even an underlying index or statistics change), you can try this type of query at runtime. This is a method fellow MVP Hugo Kornelis coined "set-based iteration" (he posted something similar in one of his chapters of SQL Server MVP Deep Dives). Since running totals typically requires a cursor over the entire set, a quirky update over the entire set, or a single non-linear self-join that becomes more and more expensive as the row counts increase, the trick here is to loop through some finite element in the set (in this case, the "rank" of each row in terms of month, for each user - and you process only each rank once for all user/month combinations at that rank, so instead of looping through 200,000 rows, you loop up to 24 times).
DECLARE #t TABLE
(
[user_id] INT,
[month] TINYINT,
total DECIMAL(10,1),
RunningTotal DECIMAL(10,1),
Rnk INT
);
INSERT #t SELECT [user_id], [month], total, total,
RANK() OVER (PARTITION BY [user_id] ORDER BY [month])
FROM dbo.my_table;
DECLARE #rnk INT = 1, #rc INT = 1;
WHILE #rc > 0
BEGIN
SET #rnk += 1;
UPDATE c SET RunningTotal = p.RunningTotal + c.total
FROM #t AS c INNER JOIN #t AS p
ON c.[user_id] = p.[user_id]
AND p.rnk = #rnk - 1
AND c.rnk = #rnk;
SET #rc = ##ROWCOUNT;
END
SELECT [user_id], [month], total, RunningTotal
FROM #t
ORDER BY [user_id], rnk;
Results:
user_id month total RunningTotal
------- ----- ----- ------------
1 1 2.0 2.0
1 2 1.0 3.0
1 3 3.5 6.5 -- I think your calculation is off
2 1 0.5 0.5
2 2 1.5 2.0
2 3 2.0 4.0
Of course you can update the base table from this table variable, but why bother, since those stored values are only good until the next time the table is touched by any DML statement?
UPDATE mt
SET cumulative_total = t.RunningTotal
FROM dbo.my_table AS mt
INNER JOIN #t AS t
ON mt.[user_id] = t.[user_id]
AND mt.[month] = t.[month];
Since we're not relying on implicit ordering of any kind, this is 100% supported and deserves a performance comparison relative to the unsupported quirky update. Even if it doesn't beat it but comes close, you should consider using it anyway IMHO.
As for the SQL Server 2012 solution, Matt mentions RANGE but since this method uses an on-disk spool you should also test with ROWS instead of just running with RANGE. Here is a quick example for your case:
SELECT
[user_id],
[month],
total,
RunningTotal = SUM(total) OVER
(
PARTITION BY [user_id]
ORDER BY [month] ROWS UNBOUNDED PRECEDING
)
FROM dbo.my_table
ORDER BY [user_id], [month];
Compare this with RANGE UNBOUNDED PRECEDING or no ROWS\RANGE at all (which will also use the RANGE on-disk spool). The above will have lower overall duration and way less I/O, even though the plan looks slightly more complex (an additional sequence project operator).
I've recently published a blog post outlining some performance differences I observed for a specific running totals scenario:
http://www.sqlperformance.com/2012/07/t-sql-queries/running-totals
Your options in SQL Server 2008 are reasonably limited - in that you can either do something based on the method as above (which is called a 'quirky update') or you can do something in the CLR.
Personally I would go with the CLR because it's guaranteed to work, while the quirky update syntax isn't something that's formally supported (so might break in future versions).
The variation on quirky update syntax you're looking for would be something like:
UPDATE my_table
SET #CumulativeTotal=cumulative_total=ISNULL(total, 0) +
CASE WHEN #user=#lastUser THEN #CumulativeTotal ELSE 0 END,
#user=lastUser
It's worth noting that in SQL Server 2012 introduces RANGE support to windowing functions, and so this is expressible in a way that is the most efficient, while being 100% supported.

Detecting Correlated Columns in Data

Suppose I have the following data:
OrderNumber | CustomerName | CustomerAddress | CustomerCode
1 | Chris | 1234 Test Drive | 123
2 | Chris | 1234 Test Drive | 123
How can I detect that the columns "CustomerName", "CustomerAddress", and "CustomerCode" all correlate perfectly? I'm thinking that Sql Server data mining is probably the right tool for the job, but I don't have too much experience with that.
Thanks in advance.
UPDATE:
By "correlate", I mean in the statistics sense, that whenever column a is x, column b will be y. In the above data, The last three columns correlate with each other, and the first column does not.
The input of the operation would be the name of the table, and the output would be something like :
Column 1 | Column 2 | Certainty
CustomerName | CustomerAddress | 100%
CustomerAddress | CustomerCode | 100%
There is a 'functional dependency' test built in to the SQL Server Data Profiling component (which is an SSIS component that ships with SQL Server 2008). It is described pretty well on this blog post:
http://blogs.conchango.com/jamiethomson/archive/2008/03/03/ssis-data-profiling-task-part-7-functional-dependency.aspx
I have played a little bit with accessing the data profiler output via some (under-documented) .NET APIs and it seems doable. However, since my requirement dealt with distribution of column values, I ended up going with something much simpler based on the output of DBCC STATISTICS. I was quite impressed by what I saw of the profiler component and the output viewer.
What do you mean by correlate? Do you just want to see if they're equal? You can do that in T-SQL by joining the table to itself:
select distinct
case when a.OrderNumber < b.OrderNumber then a.OrderNumber
else b.OrderNumber
end as FirstOrderNumber,
case when a.OrderNumber < b.OrderNumber then b.OrderNumber
else a.OrderNumber
end as SecondOrderNumber
from
MyTable a
inner join MyTable b on
a.CustomerName = b.CustomerName
and a.CustomerAddress = b.CustomerAddress
and a.CustomerCode = b.CustomerCode
This would return you:
FirstOrderNumber | SecondOrderNumber
1 | 2
Correlation is defined on metric spaces, and your values are not metric.
This will give you percent of customers that don't have customerAddress uniquely defined by customerName:
SELECT AVG(perfect)
FROM (
SELECT
customerName,
CASE
WHEN COUNT(customerAddress) = COUNT(DISTINCT customerAddress)
THEN 0
ELSE 1
END AS perfect
FROM orders
GROUP BY
customerName
) q
Substitute other columns instead of customerAddress and customerName into this query to find discrepancies between them.

Resources