SQLite delete rows based on multiple columns - database

im pretty new to SQLite hence asking this question!
I need to remove rows in a table so that I have the earliest occurence of column each unique value in column X(colour) based on column Y (time).
Basically i have this:
test | colour | time(s)
one | Yellow | 8
one | Red | 6
one | Yellow | 10
two | Red | 4
Which i want to remove rows so that is looks like:
test | colour | time(s)
one | Yellow | 8
two | Red | 4
Thanks in advance!
EDIT: To be clearer i need to retain the Earliest occurence in time that each colour occurred, regardless of the test.
EDIT: I can select the rows i want to keep by doing this:
select * from ( select * from COL_TABLE order by time desc) x group by colour;
which produces the desired result, but i want to remove what is not there in the result of the select.
EDIT: The following worked thanks to #JimmyB:
DELETE FROM COL_TABLE WHERE EXISTS ( SELECT * FROM COL_TABLE t2 WHERE COL_TABLE .colour = t2.colour AND COL_TABLE .test = t2.test AND COL_TABLE .time < t2.time )

You can include subqueries (EXISTS/NOT EXISTS) in the WHERE clause of a DELETE statement.
Like subqueries in SELECTs, these can refer to the table in the outer statement to create matches.
In your case, try this:
DELETE FROM my_table
WHERE EXISTS (
SELECT *
FROM my_table t2
WHERE my_table.colour = t2.colour
AND my_table.test = t2.test
AND my_table.time < t2.time
)
This statement uses three noteworthy constructs:
Subquery in DELETE
Self-join
Emulation of a MIN(...), via self-join
The subquery with EXISTS is mentioned above.
The self-join is required whenever one row of a table must be compared against other rows of the same table. Finding the minimum value of some column is exactly that.
Normally, you'd use the MIN(...) function to find the minimum. The minimum can be defined as the single value for which no lower value exists, and that's what we're using here because we're not actually interested in the actual value but only want to identify the record which contains that value.
(Since we're deleting, our SELECT yields all the non-minimum rows, which we want to delete to keep only the minimums.)
So, what the statement says is:
Delete all records from my_table for which there is at least one record in my_table with the same colour and the same test but a lower time.

Related

I wonder how to design a good database that sorts data when inserting values in the middle or top

I have a lot of data and I need to sort the data by reflecting it when the value is added in the middle or top.
For example, a table with increasing data (group_order) is shown below.
code| group_id | group_order | depth
----+------------+-------------+-------
c1 | Group1 | 1 | 1
<- c6 | Group1 | 2 | 1
c2 | Group1 | 2 | 1
c3 | Group1 | 3 | 1
c4 | Group1 | 4 | 1
c5 | Group1 | 5 | 1
As above table, I put data with group_order of 2 in the second row, and tried to increase the group_order of the data below (c2, c3, c4, c5) by 1.
Of course, it runs well, but as I said before, it took a lot of time to update because I have a lot of data.
When I insert the data into the desired location, the values should be sorted in that order.
Please help me.
The database I use is postgresql.
Thank you.
demo:db<>fiddle
I personally think, it is never a good idea to manipulate the internal data for such reasons as sort order or similar. Final order is something for the view and not for the model. So in my opinion you should think about calculating the order when you need it, not generally. Another problem could be the inserting performance, which I am not quite sure about, but should be investigated: As you will update all your data, the records will be blocked by the transaction and in that time you will not be able to insert another record. If no blocking transaction, you should think about race conditions (what happens if two records will be inserted simultaneously?).
I would introduce a second sort column like an insert number which stores the number at which the record was inserted. Could be simply a serial or auto-increment sequence.
And then you could simply order by group_order ASC followed by insert_nr DESC. Benefit: You don't need to manipulate your original data.
If you still need a correct order number, you could create a (materialized) view and add the order number with incrementing numbers using row_number() window function.
Introducing an (auto-)incrementing column insert_nr:
CREATE TABLE mytable (
code text,
group_order int,
insert_nr serial
);
Using it within a view:
CREATE VIEW v_myview AS
SELECT
code, group_order,
row_number() OVER (ORDER BY group_order, insert_nr DESC)
FROM (
SELECT * FROM mytable
) s;

Report Builder Multiple Group Matrix Calculation

I have a Matrix created using Matrix Wizard in Report Builder 3.0(2014), having 2 row groups ,1 column groups and 2 values. After I create the matrix (included total and subtotal), I have a matrix that look just nice. But now I want to add one more cell for each columns groups (one row), to store the below value.
Value = Total of 1st row group + Total of 2nd row group - Total of 3rd row group ...
The matrix built just show me the subtotal of each row group which I don't need.
I want to ask how do I retrieve the result of total calculated by matrix itself and how do I identify them based on their row group value using expression? And also, how do I do this for every column groups which have different data?
I tried to look at the expression in design view of the matrix, it just shows [SUM(MyField)] for every cell in the Matrix (total & subtotal).
Or should I do it at another dataset using another query? If so, what query should I use and how do I put two dataset into one matrix?
My Matrix looks something like this :
Column Group
ROW GROUP 1 | ROW GROUP 2 | VALUE 1 | VALUE 2
Row Group 1 | Row Group 2 | [Sum(MyField)] | [Sum(MyField)]
| TOTAL OF ROW GROUP 1 | [Value] | [Value]
| ROW PLAN TO ADD | [Value(0)+Value | [Value(0)+Value
| (1)-Value(2)] | (1)-Value(2)]
CAPITALS : Column name, constant
[sqrbrkted] : Calculated Value
Normal : data inside table
I am new to Report Builder, sorry if I made any mistake. In case I didn't make myself clear, please do comment and let me know. Thank you in advance.
EDIT: I have figured out an approach to achieve my purpose at the answer section below. If anyone have other solution, please feel free to answer it. Thanks.
I solved this problem by changing my SQL query to make another dummy column which shows negative result if its respective row group is meant to deduct the subtotal (Value(2)), and place it inside the matrix. And inside the expression, I have another IIF statement to convert it back to positive for display purpose.
SQL:
SELECT *, (CASE WHEN COL_B = 'VAL_2' THEN -COL_A ELSE COL_A END)AS DMY_COL_A FROM TABLE
Expression:
=IIF(Sum(Fields!DMY_COL_A.Value)<0,-Sum(Fields!DMY_COL_A.Value),Sum(Fields!DMY_COL_A.Value))

How to find the last 100k rows from 10000K table in oracle?

when i am using this query it is taking more than 5 mins please give me some other suggestion
SELECT * FROM
( SELECT id,name,rownum AS RN$$_RowNumber FROM MILLION_1) INNER_TABLE where
RN$$_RowNumber > (V_total_count - V_no_of_rows)
ORDER BY RN$$_RowNumber DESC;
Try the offset clause.
I have a table with about 16M records in it, if i just want the last 100,000 rows, I ORDER them via the ORDER BY clause, and then I use the OFFSET clause, which basically says, read this many rows first, before you return any data.
select *
from SHERI; -- 15,691,544 Rows
select *
from SHERI
order by COLUMN4 asc
offset 15591444 rows; -- my math was bad, should have offset 15591544 rows to get just the last 100,000
The FETCH FIRST and OFFSET clauses are new for 12c (docs)
If we look at the plan under this query, we can see how the database makes it work:
PLAN_TABLE_OUTPUT
SQL_ID 7wd4ra8pfu1vb, child number 0
-------------------------------------
select * from SHERI order by COLUMN4 asc offset 15591444 rows
Plan hash value: 3535161482
----------------------------------------------
| Id | Operation | Name | E-Rows |
----------------------------------------------
| 0 | SELECT STATEMENT | | |
|* 1 | VIEW | | 15M|
| 2 | WINDOW SORT | | 15M|
| 3 | TABLE ACCESS FULL| SHERI | 15M|
----------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber">15591444)
Note
-----
- Warning: basic plan statistics not available. These are only collected when:
* hint 'gather_plan_statistics' is used for the statement or
* parameter 'statistics_level' is set to 'ALL', at session or system level
'window sort' basically translates to, an analytic function
There are some very thorough answers at this similar question, but I'll try to make them specific to your case.
First, when you say "last 100k rows", what do you mean? It looks like you just want to pull the last 100k rows from an unsorted query, but that doesn't make a lot of sense. If you want the 100k most recent rows, Oracle doesn't guarantee that they'll be at the end of your unsorted query. So you want to order by something which will have the most recent ones at the end.
Also, part of the reason your query is slow is that you're sorting/filtering on the rownum pseudo-column, which can't be indexed. Sorting on a column that has an index would drastically speed this up. So I'd guess you want to order by the id column, which is probably a unique/primary key.
So this is the old (11g and earlier) way to do this.
select id, name
from (select id, name
from MILLION_1
order by id desc)
where rownum < 100000;
If you're on 12c or later, there's a newer way to do it.
select id, name
from MILLION_1
order by id desc
fetch first 100000 rows only;

SQL Server 2008 - False error for "Msg 8120"?

I am writing a query in SQL Server 2008 (Express I believe?). I am currently getting this error:
Msg 8120, Level 16, State 1, Line 16
Column 'AIM.dbo.AggTicket.TotDirectHrs' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
I am trying to do a historical analysis of our production WIP (Work In Process).
I have created a standalone calendar table (actually located in a separate database called BAS on the same server to not interfere with the ERP that operates the AIM database). I've been overwhelmed for days with some of the examples for creating running total queries/views/tables, so for now I'll just plan on taking care of that part inside of Crystal Reports 2016. My thinking was that I wanted to return records for each order each day of my calendar table (to be narrowed down in the future to only days that match records in the AIM database). The values I think I will need are:
Record Date (not unique)
Order Number (unique for each day)
Estimated hours for the job
The total number of hours worked on the job current as of today's date (in case the estimated hours were drastically underbudgeted)
The SUM of the direct labor hours charged to the job on said record date
The COUNT of the number of employees in attendance on said record date.
The SUM of the hours attended by employees on said record date.
The tables I use are as follows:
BAS Database:
dbo.DateDimension - Used for complete calendar of dates from 1/1/1987 to 12/31/2036
AIM Database:
dbo.AggAttend - Contains one or more records for each employee's attendance duration on a given date (i.e. One record for each punch-in / punch-out. Should be equal to indirect + direct labor)
dbo.AggTicket - Contains one or more records for each employee's direct labor duration charged to a particular order number
dbo.ModOrders - Contains one record for each order including the estimated hours, start date, and end date (I will worry about using the start and end dates later for figuring out how many available hours there were on each date)
Here is the code I'm using in my query:
;WITH OrderTots AS
(
SELECT
AggTicket.OrderNo,
SUM(AggTicket.TotDirectHrs) AS TotActHrs
FROM
AIM.dbo.AggTicket
GROUP BY
AggTicket.OrderNo
)
SELECT
d.Date,
t.OrderNo,
o.EstHrs,
OrderTots.TotActHrs,
SUM(t.TotDirectHrs) OVER (PARTITION BY t.TicketDate) AS DaysDirectHrs,
COUNT(a.EmplCode) AS NumEmployees,
SUM(a.TotHrs) AS DaysAttendHrs
FROM
BAS.dbo.DateDimension d
INNER JOIN
AIM.dbo.AggAttend a ON d.Date = a.TicketDate
LEFT OUTER JOIN
AIM.dbo.AggTicket t ON d.Date = t.TicketDate
LEFT OUTER JOIN
AIM.dbo.ModOrders o ON t.OrderNo = o.OrderNo
LEFT OUTER JOIN
OrderTots ON t.OrderNo = OrderTots.OrderNo
GROUP BY
d.Date, t.TicketDate, t.OrderNo, o.EstHrs,
OrderTots.TotActHrs
ORDER BY
d.Date
When I run that query in SQL Server Management Studio 2017, I get the above error.
These are my questions for the community:
Does this error message correctly describe an error in my code?
If so, why is that error an error? (To the best of my knowledge, everything is already contained in either an aggregate function or in the GROUP BY clause...smh)
What is a better way to write this query so that it will function?
Much appreciation to everyone in advance!
I am writing a query in SQL Server 2008 (Express I believe?).
SELECT ##VERSION Will let you know what version you are on.
Column 'AIM.dbo.AggTicket.TotDirectHrs' is invalid in the select list
because it is not contained in either an aggregate function or the
GROUP BY clause.
The problem is with your SUM OVER() statement:
SUM(t.TotDirectHrs) OVER (PARTITION BY t.TicketDate) AS DaysDirectHrs
Here, since you are using the OVER clause, you must include it in the GROUP BY. The OVER clause is used to determine the partitioning and order of a row-set for a window function. So, while you are using an aggregate with SUM you are doing this in a window function. Window functions belong to a type of function known as a 'set function', which means a function that applies to a set of rows. The word 'window' is used to refer to the set of rows that the function works on.
Thus, add t.TotDirectHrs to the GROUP BY
GROUP BY
d.Date, t.TicketDate, t.OrderNo, o.EstHrs,
OrderTots.TotActHrs, t.TotDirectHrs
If this narrows your results into a grouping that you don't want, then you can wrap it in another CTE or use a correlated sub-query. Potentially like the below:
(SELECT SUM(t2.TotDirectHrs) OVER (PARTITION BY t2.TicketDate) AS DaysDirectHrs FROM AIM.dbo.AggTicket t2 WHERE t2.TicketDate = t.TicketDate) as DaysDirectHrs,
EXAMPLE
if object_id('tempdb..#test') is not null
drop table #test
create table #test(id int identity(1,1), letter char(1))
insert into #test
values
('a'),
('b'),
('b'),
('c'),
('c'),
('c')
Given the data set above, suppose we wanted to get a count of all rows. That's simple right?
select
TheCount = count(*)
from
#test
+----------+
| TheCount |
+----------+
| 6 |
+----------+
Here, no GROUP BY is needed because it's implied to group over all columns since no columns are specified in the SELECT list. Remember, GROUP BY groups the SELECT statement results according to the values in a list of one or more column expressions. If aggregate functions are included in the SELECT list, GROUP BY calculates a summary value for each group. These are known as vector aggregates.[MSDN].
Now, suppose we wanted to count each letter in the table. We could do that at least two ways. Using COUNT(*) with the letter column in the select list--or using COUNT(letter) with the letter column in the select list. However, in order for us to attribute the count with the letter, we need to return the letter column. Thus, we must include letter in the GROUP BY to tell SQL Server what to apply the summary table to.
select
letter
,TheCount = count(*)
from
#test
group by
letter
+--------+----------+
| letter | TheCount |
+--------+----------+
| a | 1 |
| b | 2 |
| c | 3 |
+--------+----------+
Now, what if we wanted to return this same count, but we wanted to return all rows as well? This is where window functions come in. The window function works similar to GROUP BY in this case by telling SQL Server the set of rows to apply the aggregate to. Then, it's value is returned for for every row in this window / partition. Thus, it returns a column which is applied to every row making it just like any column or calculated column which is returned form the select list.
select
letter
,TheCountOfTheLetter = count(*) over (partition by letter)
from
#test
+--------+---------------------+
| letter | TheCountOfTheLetter |
+--------+---------------------+
| a | 1 |
| b | 2 |
| b | 2 |
| c | 3 |
| c | 3 |
| c | 3 |
+--------+---------------------+
Now we get to your case where you want to use an aggregate and an aggregate in a window function. Remember that the return of the window function is treated like any other column, thus must be applied in the GROUP BY. Pseudo would look something like this, but window functions aren't allowed in the GROUP BY clause.
select
letter
,TheCount = count(*)
,TheCountOfTheLetter = count(*) over (partition by letter)
from
#test
group by
letter
,count(*) over (partition by letter)
--returns an error
Thus, we must a correlated sub-query or a cte or some other method.
select
t.letter
,TheCount = count(*)
,TheCountOfTheLetter = (select distinct count(*) over (partition by letter) from #test t2 where t2.letter = t.letter)
from
#test t
group by
t.letter
+--------+----------+---------------------+
| letter | TheCount | TheCountOfTheLetter |
+--------+----------+---------------------+
| a | 1 | 1 |
| b | 2 | 2 |
| c | 3 | 3 |
+--------+----------+---------------------+

Efficiently counting strength of relationship between rows in Postgres

I have a table that looks similar to this:
session_id | sku
------------|-----
a | 1
a | 2
a | 3
a | 4
b | 2
b | 3
c | 3
I want to pivot this into a table similar to this:
sku1 | sku2 | score
------|------|------
1 | 2 | 1
1 | 3 | 1
1 | 4 | 1
2 | 3 | 2
2 | 4 | 1
3 | 4 | 1
The idea is to store a denormalised table that allows one to look up for a given sku, what other skus are related to sessions it has been related to, and how many times both skus are related to the same session.
What algorithms, patterns or strategies could you suggest for implementing this in PostgreSQL or other technologies?
I realise that this kind of lookup can be done on the original table using counts, or using a facetting search engine. However, I want to make the reads more performant, and just want to keep the overall statistics. The idea is that I will be performing this pivot regularly on the newest few thousand rows in the first table, then storing the result in the second. I'm only concerned with approximate statistics for the second table.
I've got some SQL that works, but VERY slowly. Also looking into the potential for using a graph database of some sort, but wanted to avoid adding another technology for a small part of the app.
Update: The SQL below seems performant enough. I can convert 1.2 million rows in the first table (tags) into 250k rows in the second table (product_relations) with around 2-3k variations of sku in about 5 minutes on my iMac. I will realistically be denormalising only up to 10k rows per day. Question is whether this is actually the best approach. Seems a little dirty to me.
BEGIN;
CREATE
TEMPORARY TABLE working_tags(tag_id int, session_id varchar, sku varchar) ON COMMIT DROP;
INSERT INTO working_tags
SELECT id,
session_id,
sku
FROM tags
WHERE time < now() - interval '12 hours'
AND processed_product_relation IS NULL
AND sku IS NOT NULL LIMIT 200000;
CREATE
TEMPORARY TABLE working_relations (sku1 varchar, sku2 varchar, score int) ON COMMIT DROP;
INSERT INTO working_relations
SELECT a.sku AS sku1,
b.sku AS sku2,
count(DISTINCT a.session_id) AS score
FROM working_tags AS a
INNER JOIN working_tags AS b ON a.session_id = b.session_id
AND a.sku < b.sku
WHERE a.sku IS NOT NULL
AND b.sku IS NOT NULL
GROUP BY a.sku,
b.sku;
UPDATE product_relations
SET score = working_relations.score+product_relations.score
FROM working_relations
WHERE working_relations.sku1 = product_relations.sku1
AND working_relations.sku2 = product_relations.sku2;
INSERT INTO product_relations (sku1, sku2, score)
SELECT working_relations.sku1,
working_relations.sku2,
working_relations.score
FROM working_relations
LEFT OUTER JOIN product_relations ON (working_relations.sku1 = product_relations.sku1
AND working_relations.sku2 = product_relations.sku2)
WHERE product_relations.sku1 IS NULL;
UPDATE tags
SET processed_product_relation = TRUE
WHERE id IN
(SELECT tag_id
FROM working_tags);
COMMIT;
If I've interpreted your intention correctly (per comments) this should do it:
SELECT
s1.sku AS sku1,
s2.sku AS sku2,
count(session_id)
FROM session s1
INNER JOIN session s2 USING (session_id)
WHERE s1.sku < s2.sku
GROUP BY s1.sku, s2.sku
ORDER BY 1,2;
See: http://sqlfiddle.com/#!15/2e0b2/1
In other words: Self-join session, then find all pairings of SKUs for each session ID, excluding ones where the left is greater than or equal to the right in order to avoid repeating pairings - if we have (1,2,count) we don't want (2,1,count) as well. Then group by the SKU pairings and count how many rows are found for each pairing.
You may want to count(distinct session_id) instead, if your SKU pairings can repeat and you want to exclude duplicates. There will probably be more efficient ways to do that, but that's the simplest.
An index on at least session_id will be very useful. You may also want to mess with planner cost parameters to make sure it chooses a good plan - in particular, make sure effective_cache_size is accurate and random_page_cost vs seq_page_cost reflects your caching and I/O costs. Finally, throw as much work_mem at it as you can afford.
If you're creating a materialized view, just CREATE UNLOGGED TABLE whatever AS SELECT .... . That way you minimise the numer of writes/rewrites/overwrites.

Resources