PostgreSQL Inserted rows differ from select - database

I have a problem with an INSERT in PostgreSQL. I have this query:
INSERT INTO track_segments(tid, gdid1, gdid2, distance, speed)
SELECT * FROM (
SELECT DISTINCT ON (pga.gdid)
pga.tid as ntid,
pga.gdid as gdid1, pgb.gdid as gdid2,
ST_Distance(pga.geopoint, pgb.geopoint) AS segdist,
(ST_Distance(pga.geopoint, pgb.geopoint) / EXTRACT(EPOCH FROM (pgb.timestamp - pga.timestamp + interval '0.1 second'))) as speed
FROM fl_pure_geodata AS pga
LEFT OUTER JOIN fl_pure_geodata AS pgb ON (pga.timestamp < pgb.timestamp AND pga.tid = pgb.tid)
ORDER BY pga.gdid ASC) AS sq
WHERE sq.gdid2 IS NOT NULL;
to fill a table with pairwise connected segements of geopoints. When I run the SELECT alone I get the correct pairs, but when I use it in the statement above, then some are paired the wrong way or not at all. Here's what I mean:
result of SELECT alone:
tid;gdid1;gdid2;distance;speed
"0f6fd522-5f1e-49a4-b85e-50f11ef7f908";10;11;34.105058803;31.0045989118182
"0f6fd522-5f1e-49a4-b85e-50f11ef7f908";11;12;90.099603143;14.7704267447541
"0f6fd522-5f1e-49a4-b85e-50f11ef7f908";12;13;23.331326565;21.2102968772727
result after INSERT with the same SELECT:
tid;gdid1;gdid2;distance;speed
"0f6fd522-5f1e-49a4-b85e-50f11ef7f908";10;12;122.574;17.2639603638028
"0f6fd522-5f1e-49a4-b85e-50f11ef7f908";11;12;90.0996;14.7704267447541
"0f6fd522-5f1e-49a4-b85e-50f11ef7f908";12;13;23.3313;21.2102968772727
What be the cause of that? It's exactly the same SELECT statement for the INSERT, so why does it give different results?

DISTINCT ON (pga.gdid) can pick any row from a set with equal pga.gdid. You can get different result even by execution the same query for several times. Add additional ordering to get consistent results. something like: pga.gdid ASC, pgb.gdid ASC
BTW You may want to order by pga.gdid ASC, pgb.timestamp - pga.timestamp ASC to get the "next" point.
BTW2 It may be easier to use lead() or lag() window functions to calculate differences between current row and next/previous. This way you wont need a self join and will likely get better performance.

You are ordering your query results only by the column pga.gdid, which is the same in all the rows, so postgres will order the results in a different way each time you do the select query.

Related

how to select first rows distinct by a column name in a sub-query in sql-server?

Actually I am building a Skype like tool wherein I have to show last 10 distinct users who have logged in my web application.
I have maintained a table in sql-server where there is one field called last_active_time. So, my requirement is to sort the table by last_active_time and show all the columns of last 10 distinct users.
There is another field called WWID which uniquely identifies a user.
I am able to find the distinct WWID but not able to select the all the columns of those rows.
I am using below query for finding the distinct wwid :
select distinct(wwid) from(select top 100 * from dbo.rvpvisitors where last_active_time!='' order by last_active_time DESC) as newView;
But how do I find those distinct rows. I want to show how much time they are away fromm web apps using the diff between curr time and last active time.
I am new to sql, may be the question is naive, but struggling to get it right.
If you are using proper data types for your columns you won't need a subquery to get that result, the following query should do the trick
SELECT TOP 10
[wwid]
,MAX([last_active_time]) AS [last_active_time]
FROM [dbo].[rvpvisitors]
WHERE
[last_active_time] != ''
GROUP BY
[wwid]
ORDER BY
[last_active_time] DESC
If the column [last_active_time] is of type varchar/nvarchar (which probably is the case since you check for empty strings in the WHERE statement) you might need to use CAST or CONVERT to treat it as an actual date, and be able to use function like MIN/MAX on it.
In general I would suggest you to use proper data types for your column, if you have dates or timestamps data use the "date" or "datetime2" data types
Edit:
The query aggregates the data based on the column [wwid], and for each returns the maximum [last_active_time].
The result is then sorted and filtered.
In order to add more columns "as-is" (without aggregating them) just add them in the SELECT and GROUP BY sections.
If you need more aggregated columns add them in the SELECT with the appropriate aggregation function (MIN/MAX/SUM/etc)
I suggest you have a look at GROUP BY on W3
To know more about the "execution order" of the instruction you can have a look here
You can solve problem like this by rank ordering the results by a key and finding the last x of those items, this removes duplicates while preserving the key order.
;
WITH RankOrdered AS
(
SELECT
*,
wwidRank = ROW_NUMBER() OVER (PARTITION BY wwid ORDER BY last_active_time DESC )
FROM
dbo.rvpvisitors
where
last_active_time!=''
)
SELECT TOP(10) * FROM RankOrdered WHERE wwidRank = 1
If my understanding is right, below query will give the desired output.
You can have conditions according to your need.
select top 10 distinct wwid from dbo.rvpvisitors order by last_active_time desc

SQL Server: random number in WHERE clause

As far as I am aware, the only way to get a random value in a SELECT statement is by using the newid() function, as the random() function doesn’t generate new values for each row.
This leads to the following awkward construction to get a random number from, say 0 - 9:
abs(checksum(newid())) % 10
If I use this expression in the SELECT clause, it behaves as expected. However, if I try something like the following:
select *
from table
where abs(checksum(newid())) % 10>4;
I should have though that I would get roughly half the rows. Instead I get I get all or none of them. Apparently newid() is only evaluated once, instead of for each row.
The question is, how can I use a random number in the WHERE clause?
More
There is a similar question which asks for fixed number of rows at random. In the above example I could have used:
select top 50 percent from table order by newid();
which will get me what I am looking for.
The question remains, how can I use a random number in the WHERE clause. For example, is it possible to do something like this?
select *
from table
where code={random number};
Here is one way to get around the problem
SELECT *
FROM (SELECT *,
Abs(Checksum(Newid())) % 10 AS ran
FROM yourtable) a
WHERE ran > 4;
for some reason newid() in where clause it is executed only once and it is checked with the constant.
When I check the execution plan your query is missing compute scalar where as my query has compute scalar present in execution plan.
The function newid() is calculate only once in the WHERE clause, not row by row. The trick is to force it to run row by row.
Of course it is possible to include it in a SELECT clause, and, in turn, include that in a CTE or a subquery, as per the other answers.
Microsoft offer a solution here: https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms189108(v=sql.105)?redirectedfrom=MSDN
The trick is to force newid() to recalculate by combining it with some row value. This is easily done in the checksum() function.
For example:
SELECT *
FROM table
WHERE abs(checksum(newid(),id)) % 10>4;
I should have though that I would get roughly half the rows. Instead I get I get all or none of them
You may get all of the rows or none of them ,since NEWID() is executed once per query when you use it in where clause..This is explained here by Conor Cunnigham and the technical term for this is called RumTimeConstants
You can look at your execution plan and look out for below expression
Const ConstValue
which you can see is calculated once and used throughout and finally you are doing just a boolean comparison,so you will end up with all rows or none
you have to use CTE Like the one stated in another answer or use Top with order by newid() or tablesample to return random rows
you may find Tablesample option more helpfull,since this may not go though all the table data to get only sample set of rows,unlike Newid()
below is one example on a table having 1000000 rows
select * from Orders
TABLESAMPLE (50 PERCENT)
plan

SQL Get Second Record

I am looking to retrieve only the second (duplicate) record from a data set. For example in the following picture:
Inside the UnitID column there is two separate records for 105. I only want the returned data set to return the second 105 record. Additionally, I want this query to return the second record for all duplicates, not just 105.
I have tried everything I can think of, albeit I am not that experience, and I cannot figure it out. Any help would be greatly appreciated.
You need to use GROUP BY for this.
Here's an example: (I can't read your first column name, so I'm calling it JobUnitK
SELECT MAX(JobUnitK), Unit
FROM JobUnits
WHERE DispatchDate = 'oct 4, 2015'
GROUP BY Unit
HAVING COUNT(*) > 1
I'm assuming JobUnitK is your ordering/id field. If it's not, just replace MAX(JobUnitK) with MAX(FieldIOrderWith).
Use RANK function. Rank the rows OVER PARTITION BY UnitId and pick the rows with rank 2 .
For reference -
https://msdn.microsoft.com/en-IN/library/ms176102.aspx
Assuming SQL Server 2005 and up, you can use the Row_Number windowing function:
WITH DupeCalc AS (
SELECT
DupID = Row_Number() OVER (PARTITION BY UnitID, ORDER BY JobUnitKeyID),
*
FROM JobUnits
WHERE DispatchDate = '20151004'
ORDER BY UnitID Desc
)
SELECT *
FROM DupeCalc
WHERE DupID >= 2
;
This is better than a solution that uses Max(JobUnitKeyID) for multiple reasons:
There could be more than one duplicate, in which case using Min(JobUnitKeyID) in conjunction with UnitID to join back on the UnitID where the JobUnitKeyID <> MinJobUnitKeyID` is required.
Except, using Min or Max requires you to join back to the same data (which will be inherently slower).
If the ordering key you use turns out to be non-unique, you won't be able to pull the right number of rows with either one.
If the ordering key consists of multiple columns, the query using Min or Max explodes in complexity.

Select query optimisation

I have a large table with ID, date, and some other columns. ID is indexed and sequential.
I want to select all rows after a certain date. Given that the IDs are sequential, if the rows are ordered by ID in decreasing order, once the first row that fails the date test there's no need to carry on checking. How can I make use of the index to optimise this?
You could do something like this:
With FirstFailDate AS
(
-- You start by selecting the first fail date
SELECT TOP 1 * FROM YOUR_TABLE WHERE /* DATE TEST FAILING */ ORDER BY ID DESC
)
SELECT *
FROM YOUR_TABLE t
-- Then, you join your table with the first fail date, and get all the records
-- that are before this date (by ID)
JOIN FirstFailDate f
ON f.ID > t.ID
I don't think there is a good "legal" way to do this without actually indexing date.
However, you could try something like this:
Issue the following query to the DBMS: SELECT * FROM YOUR_TABLE ORDER BY ID DESC.
Start fetching the rows in your client application.
As you fetch, check the date.
Stop fetching (and close the cursor) when the date passes the limit.
The idea is that DBMS sometimes doesn't have to finish the whole query before starting to send the partial results to the client. In this case, the hope is that the DBMS will perform an index scan on ID (due to the ORDER BY ID DESC), and you'll be able get the results as it happens and then stop it before it has even finished.
NOTE: If your DBMS gives you an option to balance between getting the first row fast, versus getting the whole result fast, pick the first option (such as /*+ FIRST_ROWS */ hint under Oracle).
Of course, perform measurements on realistic amounts of data, to make sure this actually works in your particular situation.

Microsoft SQL Server Paging

There are a number of sql server paging questions on stackoverflow and many of them talk about using ROW_NUMBER() OVER (ORDER BY ...) AND CTE. Once you get into the hundreds of thousands of rows and start adding sorting on non-primary key values and adding custom WHERE clauses, these methods become very inneficient. I have a dataset of several million rows I am trying to page through with custom sorting and filtering, but I am getting poor performance, even with indexes on all the fields that I sort by and filter by. I even went as far as to include my SELECT columns in each of the indexes, but this barely helped and severely bloated my database.
I noticed the stackoverflow paging only takes about 500 milliseconds no matter what sorting criteria or page number you click on. Anyone know how to make paging work efficiently in SQL Server 2008 with millions of rows? This would include getting the total rows as efficiently as possible.
My current query has the exact same logic as this stackoverflow question about paging:
Best paging solution using SQL Server 2005?
Anyone know how to make paging work efficiently in SQL Server 2008 with millions of rows?
If you want accurate perfect paging, there is no substitute for building an index key (position row number) for each record. However, there are alternatives.
(1) total number of pages (records)
You can use an approximation from sysindexes.rows (almost instant) assuming the rate of change is small.
You can use triggers to maintain a completely accurate, to the second, table row count
(2) paging
(a)
You can show page jumps within say the next five pages to either side of a record. These need to scan at most {page size} x 5 on each side. If your underlying query lends itself to travelling along the sort order quickly, this should not be slow. So given a record X, you can go to the previous page using (assuming sort order is a asc, b desc
select top(#pagesize) t.*
from tbl x
inner join tbl t on (t.a = x.a and t.b > x.b) OR
(t.a < a.x)
where x.id = #X
order by t.a asc, t.b desc
(i.e. the last {page size} of records prior to X)
To go five pages back, you increase it to TOP(#pagesize*5) then further TOP(#pagesize) from that subquery.
Downside: This option requires that you cannot directly jump to a particular location, your options are only FIRST (easy), LAST (easy), NEXT/PRIOR, <5 pages either side
(b)
If the paging is always going to be quite specific and predictable, maintain an INDEXED view or trigger-updated table that does not contain gaps in the row number. This may be an option if the tables normally only see updates at one end of the spectrum, with gaps from deletes easily filled quickly by shifting not-so-many records.
This approach gives you a rowcount (last row) and also direct access to any page.
try this, let say you have country table as below:
DECLARE #pageIndex INT=0;
DECLARE #pageSize INT= 10;
DECLARE #sortByColumn NVARCHAR(200)='Code';
DECLARE #sortByDesc BIT=0;
;WITH tbl AS (
SELECT COUNT(id) OVER() [RowTotal], c.Id, c.Code, c.Name
FROM dbo.[Country] c
ORDER BY
CASE WHEN #sortByColumn='Code' AND #sortByDesc=0 THEN c.Code END ASC,
CASE WHEN #sortByColumn='Code' AND #sortByDesc<>0 THEN c.Code END DESC,
CASE WHEN #sortByColumn='Name' AND #sortByDesc=0 THEN c.Name END ASC,
CASE WHEN #sortByColumn='Name' AND #sortByDesc<>0 THEN c.Name END DESC,
,c.Name ASC --DEFAULT SORTING ORDER
OFFSET #PageIndex*#pageSize ROWS
FETCH NEXT #pageSize ROWS ONLY
) SELECT (#PageIndex*#pageSize)+(ROW_NUMBER() OVER(ORDER BY Id))[RowNo],* from tbl;

Resources