Identify first time visitor in Snowflake - snowflake-cloud-data-platform

So, say, I have the following table structure:
VisitorID Date
Kristin `2020-03-01`
Kristin `2020-03-05`
Kristin `2020-03-07`
PL `2020-03-07`
Mithil `2020-03-06`
Kristin '2020-03-02`
The user Kristin visits the page for the first time on 1st March '20. I need a query such that we can flag this as 1. The next time Kristin visits, it should always be flagged 0. By flag I mean a new column that will indicate whether the user is a new visitor or not.
I tried using the Row_number in Snowflake but I'm not able to group by both date and visitor id.
select
date,
visitorid,
ROW_NUMBER() OVER (ORDER BY visitorid asc) AS row_number
from PRODUCT_VISITS_UNIQ

Maybe something like this?
select visit_date, name,
IFF((ROW_NUMBER() OVER (PARTITION BY name ORDER BY visit_date asc)) = 1, 'First Visit', '') as status
from
values ('2020-02-01','Hakan'),
('2020-02-01','Gokhan'),
('2020-02-02','Cenk'),
('2020-02-03','Gokhan'),
('2020-02-04','Cenk') p (visit_date, name)

Related

Create an Identity column [duplicate]

I have a data set that contains the columns Date, Cat, and QTY. What I want to do is add a unique column that only counts unique Cat values when it does the row count. This is what I want my result set to look like:
By using the SQL query below, I'm able to get row using the row_number() function.
However, I can't get that unique column that I have depicted above. When I add group by to the OVER clause, it does not work. Does anybody have any ideas as how I could get this unique count column to work?
SELECT
Date,
ROW_NUMBER() OVER (PARTITION BY Date ORDER By Date, Cat) as ROW,
Cat,
Qty
FROM SOURCE
Here is a solution.
You need not worry about the ordering of Cat. Using following SQL you will be able to get unique values for your Date & Cat combination.
SELECT
Date,
ROW_NUMBER() OVER (PARTITION BY Date, Cat ORDER By Date, Cat) as ROW,
Cat,
Qty
FROM SOURCE
DENSE_RANK() OVER (PARTITION BY date ORDER BY cat)

min(count(*)) over... behavior?

I'm trying to understand the behavior of
select ..... ,MIN(count(*)) over (partition by hotelid)
VS
select ..... ,count(*) over (partition by hotelid)
Ok.
I have a list of hotels (1,2,3)
Each hotel has departments.
On each departments there are workers.
My Data looks like this :
select * from data
Ok. Looking at this query :
select hotelid,departmentid , cnt= count(*) over (partition by hotelid)
from data
group by hotelid, departmentid
ORDER BY hotelid
I can perfectly understand what's going on here. On that result set, partitioning by hotelId , we are counting visible rows.
But look what happens with this query :
select hotelid,departmentid , min_cnt = min(count(*)) over (partition by hotelid)
from data
group by hotelid, departmentid
ORDER BY hotelid
Question:
Where are those numbers came from? I don't understand how adding min caused that result? min of what?
Can someone please explain how's the calculation being made?
fiddle
The 2 statements are very different. The first query is counting the rows after the grouping and then application the PARTITION. So, for example, with hotel 1 there is 1 row returned (as all rows for Hotel 1 have the same department A as well) and so the COUNT(*) OVER (PARTITION BY hotelid) returns 1. Hotel 2, however, has 2 departments 'B' and 'C', and so hence returns 2.
For your second query, you firstly have the COUNT(*), which is not within the OVER clause. That means it counts all the rows within the GROUP BY specified in your query: GROUP BY hotelid, departmentid. For Hotel 1, there are 4 rows for department A, hence 4. Then you take the minimum of 4; which is unsurprisingly 4. For all the other hotels, they have at least 1 entry with only 1 row for a hotel and department and so returns 1.

SQL to identify initial dates for a product that changes from active to cancelled status

I have a table that records the following items:
product_id
product_status
date
Products can exist in the following product statuses: pending, active, or canceled. Only one status can exist per date per product code. A status and product code is inserted for each and every day a product exists.
Utilizing SQL I'd like to be able to identify the initial cancellation dates for a product that cancels more than once in a given time frame.
i.e. if a product is active for 3 days and then cancels for 3 days and then is active again for 3 days and then cancels again for another 3 days.
I'd like to be able to identify day 1 of the 2 cancellation periods.
Thought I'd get the crystal ball out for this one. This sounds like a Gaps and Islands question. There's plenty of answers on how to do this on the internet, however, this might be what you're after:
CREATE TABLE #Sample (product_id int,
product_status varchar(10),
[date] date); --blargh
INSERT INTO #Sample
VALUES (1,'active', '20170101'),
(1,'active', '20170102'),
(1,'active', '20170103'),
(1,'cancelled', '20170104'),
(1,'cancelled', '20170105'),
(1,'cancelled', '20170106'),
(1,'active', '20170107'),
(1,'pending', '20170108'),
(1,'active', '20170109'),
(1,'cancelled', '20170110'),
(2,'pending', '20170101'),
(2,'active', '20170102'),
(2,'cancelled', '20170103'),
(2,'cancelled', '20170104');
GO
SELECT *
FROM #Sample;
WITH Groups AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY product_id
ORDER BY [date]) -
ROW_NUMBER() OVER (PARTITION BY product_id, product_status
ORDER BY [date]) AS Grp
FROM #Sample)
SELECT product_id, MIN([date]) AS cancellation_start
FROM Groups
WHERE product_status = 'cancelled'
GROUP BY Grp, product_id
ORDER BY product_id, cancellation_start;
GO
DROP TABLE #Sample;
If not, then see Patrick Artnet's comment.

SQL Delete Rows with Duplicate Key Keeping Most Recent

Dearest Professionals,
I have a table that sometimes has rows created with duplicate Invoice #'s (EMP_ID). In these rows, there separate date (FILE_DATE) and time (FILE_TIME) columns (genius database design there). I need to remove the older rows of any duplicated EMP_ID's in this database, keeping the most recent date (from FILE_DATE) + time (from FILE_TIME).
Both FILE_DATE and FILE_TIME are date/time field in the database. The software we use writes to this table, adding the date of the invoice to the FILE_DATE column, with YYYY-MM-DD 00:00:00.000 (the zeros all hard coded). Then the FILE_TIME field has 1900-01-01 HH:mm:ss.SSS, the 1900-01-01 hard coded. (the time stamp comes from the time the row was written to the database)
So, long story short, I need to marry these two together, to get the DATE portion of FILE_DATE and the time portion of FILE_TIME, to get the most recent (IF duplicates exist of EMP_ID) and delete all duplicated that are not the most recent of the married FILE_DATE & FILE_TIME.
Here is a sample of what a Before & After situation would look like.
BEFORE:
AFTER:
Any and all help would be insanely appreciated.
Using some good old CTE "magic":
WITH CTE AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY EMP_ID
ORDER BY FILE_DATE DESC, FILE_TIME DESC) AS RN
FROM YourTable)
DELETE FROM CTE
WHERE RN > 1;
I think this can be accomplished using MAX and GROUP BY:
select B.EMP_ID
, B.File_date
, Max(B.File_Time) as MaxFileTime
, B.DESC_TEXT_1
from Before B
group
by B.EMP_ID
, B.File_date
, B.DESC_TEXT_1

SQL - Insert multiple records from select

I have been searching all day but could not find answer to this:
I have a table on SQL Server:
dbo.Program
with fields:
Program.id...PK autoincrement
Program.user...varchar
Program.program...varchar
Program.installed...boolean
Program.department...varchar
Program.wheninstalled...date
Now, I want to insert a new record for every distinct user and copy the department from his latest(Program.wheninstalled) record with other values the same for every user:
Program.user...every unique user
Program.program...MyMostAwesomeProgram
Program.installed...false
Program.department...department of the record with the latest program.wheninstalled field of all the records of the unique user in program.user
Program.wheninstalled...null
I know how to do it in an ugly way:
select the latest records for every user and their department in that record
extract values from 1) and make it into insert into
(field1, field2...fieldX) values
(1records_value1, 1records_value2...1records_valueX),
(2records_value1, 2records_value2...2records_valueX),
...
(Nrecords_value1, Nrecords_value2...Nrecords_valueX)
but I would like to know how to do it in a better way. Oh I cannot use some proper HR databse to make my life easier so this is what I got to work with now.
I'm a postgres guy, but something akin to the below should work:
insert into Program (user, program, installed, department, wheninstalled)
select user,
'MyMostAwesomeProgram',
false,
(select department from someTable where u.user = ...),
null
from users as u;
from https://stackoverflow.com/a/23905173/3430807
you said you know how to do the select
insert into Program (user, program, installed, department, wheninstalled)
select user, 'MyMostAwesomeProgram', 'false' , department, null
from ...
This should do it:
insert into Program (user, program, installed, department, whenInstalled)
select user, program, installed, department, whenInstalled
from
(
select User
, 'MyMostAwesomeProgram' program
, 0 installed
, department
, null whenInstalled
, row_number() over (partition by user order by whenInstalled desc, id desc) r
from Program
) p
where p.r = 1
The interesting bit is the row_number() over (partition by user order by whenInstalled desc, id desc) r.
This says to return a column, r, which holds values 1..n for each user, counting up according to the order by clause (i.e. starting with the most recent whenInstalled and working backwards).
I also included the id field in the order by clause in case there were two installs for the same user on the same date; in such a case the most recently added (the one with the higher id is used first).
We then put this in a subquery, and select only the first record; thus we have 1 record per user, and it's the most recent.
The only values we use from this record are the user and department fields; all else is defined per your defaults.
So I am gonna answer this for some other people who might google this:
To get a records from a table to another table you need to use Select into statement
I like using With XXX as ( some select) to specify, say, "virtual" table with which you can work during the query
As JohnLBevan meantioned a very useful function Row_number, over, partition by and what i missed is a rank
So once you read on these you should be able to understand how to do what I wanted to do.

Resources