Avoiding and removing duplicates in bigquery

Avoiding and removing duplicates in bigquery - google-app-engine

We have started using bigquery for event logging from our games.
We collect events from appengine nodes and enques them in chunks now and then which are placed in a task queue.
A backend is then processing this queue and uploads events to bigquery.
Today we store ca 60 million daily events from one of our games and 6 million from another.
We have also made cron jobs to process these events to gather various gaming KPI's. (I.e. second day retention, active users, etc etc)
Everything has gone quite smooth but we do now face a tricky problem !
======== Question 1 ===============================================
Due to some reason the deletion of the queue tasks fails now and then. Not very often but it happens and often in bursts.
TransientFailureException is probably the cause ... I say probable since we are deleting process events in batch mode. I.e. ...
List<Boolean> Queue.deleteTask(List<TashHandle> taskstoDelete)
... so we actually don't know why we failed to delete a task.
We have today added retry code that will try to delete those failed deletions again.
Is there a best practice to deal with this kind of problem?
========= Question 2 =======================================================
Duplicate detection
The following SQL succeds to find duplicates for or smaller game but exceeds resources for
the bigger one.
SELECT DATE(ts) date, SUM(duplicates) - COUNT(duplicates) as duplicates
FROM (
SELECT ts, eventId, userId, count(*) duplicates
FROM [analytics_davincigameserver.events_app1_v2_201308]
GROUP EACH BY ts, eventId, userId
HAVING duplicates > 1
)
GROUP EACH BY date
Is there a way to detect duplicates even for our bigger game?
I.e. a query that bigquery will be able to mangle our 60 million daily rows and locate duplicates.
Thanks in advance!

For question #2 (I'd prefer they were separate questions, to skip this step and opportunity of confusion):
Resources are exhausted on the inner query, or the outer query?
Does this work?
SELECT ts, eventId, userId, count(*) duplicates
FROM [analytics_davincigameserver.events_app1_v2_201308]
GROUP EACH BY ts, eventId, userId
HAVING duplicates > 1
What about reducing the cardinality? I'm guessing as you are grouping by timestamp, there might be too many different buckets to group by. Does this work better?
SELECT ts, eventId, userId, count(*) duplicates
FROM [analytics_davincigameserver.events_app1_v2_201308]
WHERE ABS(HASH(ts) % 10) = 1
GROUP EACH BY ts, eventId, userId
HAVING duplicates > 1

Related

Looking for a way to exclude certain items from my query

I have a query I'm running daily and I'd like to exclude certain items from this query that I've already identified as not wanting to see without adding another table to the database. Below is my attempt at this, which works with one TaskID but not multiple TaskIDs as I'm trying to do.
I feel its also important to note that this list could grow into roughly 150 ID's but not necessarily over 200 if that makes a difference. Obviously the way I did it is not the best way. Can anyone recommend the best way to accomplish this?
Direct question: what is the best way to exclude a large number of TaskID's from the below query without creating another table?
SELECT
TaskID, MAX(timeended) AS 'Last Run'
FROM
[moveitautomationagain].[dbo].taskruns
WHERE
TaskID <> 222300 OR TaskID <> 103439128
GROUP BY
TaskID
HAVING
DATEDIFF(HOUR, MAX(timeended), SYSDATETIME()) > 24

For your query you need AND not OR. However I would use NOT IN i.e.
where TaskID not in (222300, 103439128)

efficient way to select many records from oracle database(or in short time)

I am currently develop a program to retrieve records from database based on numbers of tables.
Here is my sql cmd string.
select distinct a.id,
a.name,
b.status,
b.date
from table1 a
left join table2 b
on a.id = b.id where #$%^$#%#
And some of the tables have around 50 millions of records or above. At most of the time, the sockets will not return timeout error because users will input the where clauses for what they want. However, when the program try to retrieve all the records from the database, it will show socket error because it takes too much time to retrieve which is not allowed.
One of my through is that to limit the rows retrieved by using rownum because users might not really want so many datas from the tables. For example, user can input the max number of rows that they want to retrieve 10 thoursands records. And I will return 10000 records back to them. But I fail to retrieve same exact number of records by using rownum < 10000 and I don't know how it can work too....
And so here I am to ask for any suggestions from professional developers here. Please help! Thanks so much!!!!

First of all you have to make it clear (to yourself) what data you need.
If you need to generate overall statistics, then you need all data. Saving intermediate results may help, but you still have to read everything. In that case set the socket timeout to some 24 hours, just make sure your SELECTs don't block other processes.
But if you are making a client application (displaying some data in an HTML table), then you definitely do not need everything. Design your application so that users apply filters first, then they receive the first result page, then the second... See how Google search or e-shops work - you get an empty homepage first (possibly with some promotion), after that you start filtering.
Secondly, technical ideas:
Limit was implemented in Oracle 12, so you can use SELECT * FROM table OFFSET 0 ROWS FETCH NEXT 10000 ROWS ONLY. For older versions you have to use the old WHERE rownum <= 10000, which does not work well with ORDER BY.
Save intermediate results when using aggregations, etc.
Minimize the need of JOINs (denormalize).

I can think of using an optimizer hint to tell Oracle that you want the first n rows fast like described here: https://docs.oracle.com/cd/B10500_01/server.920/a96533/hintsref.htm#4942

Other two answers already mention that you can implement paging using order by clause and rownum like this:
select * from (
SELECT *, rownum rnum FROM (SELECT foo FROM Foo ORDER BY OFR_ID) WHERE rownum <= stopOffset
) where rnum >=startOffset
or by using OFFSET in modern Oracle.
I want to point out additional thing which shows up when you want to retrieve many (like hundrend thousands to millions) rows to process them in application - be sure to set large enough fetch size (usually in range 1000 < ? < 5000) in your application when you do it. Typically there is a big difference in execution time when you retrieve results with default fetch size comparing to retrieving them with larger fetch size when it's known that there will be lot of rows. For example, in Java you can explicitly set fetchSize on your Statement object when crafting a query:
PreparedStatement statement = connection.prepareStatement(query);
statement.setFetchSize(1000);

SQL Server 2012 : SELECT and UPDATE in one query slow performance

I am running SQL Server 2012 and this one query is killing my database performance.
My text message provider does not support scheduled text messages so I have a text message engine that picks up messages from the database and sends them at the scheduled time. I put this query together that gets the messages from the database and also changes their status so that they do not get picked up again.
The query works fine, it is just causing wait times on the CPU especially since it runs every other second. I installed a database performance software and it said this query accounts for 92% of instance execution time. The software also said that every single execution is doing 347,267 Logical Reads.
Any ideas on how to make this perform better?
Should I maybe select into a temporary table and update those results before returning them?
Here is the current query:
UPDATE TOP (30) dbo.Outgoing
SET Status = 2
OUTPUT INSERTED.OutgoingID, INSERTED.[Message], n.PhoneNumber, c.OptInStatus
FROM dbo.Outgoing o
JOIN Numbers n on n.NumberID = o.NumberID
LEFT JOIN Contacts c on c.ContactID = o.ContactID
WHERE Scheduled <= GETUTCDATE() AND SmsId IS NULL AND Status = 1
Here is the execution plan
There are three tables involved in this query: Outgoing, Numbers, & Contacts
Outgoing is the main table that this query deals with. There are only two indexes right now, a clustered primary key index on OutgoingID [PK, bigint, not null] and a non-clustered, non-unique index on SmsId [varchar(255), null] which is an identifier sent back from our text message provider once the messages are successfully received in their system. The Status column is just an integer column that relates to a few different statuses (Scheduled, Queued, Sent, Failed, Etc)
Numbers is just a simple table where we store unique cell phone numbers, some different formats of that number, and some basic information identifying the customer such as First name, carrier, etc. It just has a clustered primary key index on NumberID [bigint]. The PhoneNumber column is just a varchar(15).
The Contacts table just connects the individual person (phone number) to one of our merchants and keeps up with the number's opt in status, and other information related to the customer/merchant relationship. The only columns related to this query are OptInStatus [bit, not null] and ContactID [PK, bigint, not null]
--UPDATE--
Added a non-clustered index on the the Outgoing table with columns (Scheduled, SmsId, Status) and that seems to have brought down the execution time from 2+ second to milliseconds. I will check in with my performance monitoring software tomorrow to see how it has improved. Thank you everyone for the help so far!

As several commenters have already pointed out you need a new index on the dbo.Outgoing table. The server is struggling with finding the rows to update/output. This is most probably where the problem is:
WHERE Scheduled <= GETUTCDATE() AND SmsId IS NULL AND Status = 1
To improve performance you should create an index on dbo.Outgoing where you include these columns. This will make is easier for Sql Server to find the correct rows. It will on the other hand create some more work for the actual update though since there will be a new index that needs attention when updating.
While you're working on this, it would likely be a good idea to shorten the SmsId column unless you actually need it to be 255 chars long. Preferably before you create the index.
As an alternate solution you might think about having separate tables for the messages that are outgoing and those that are outgone. Then you can:
insert all records from Outgoing to Outgone
delete all records from Outgoing, with output clause like you are currently doing.
Make sure though that the insert and the delete operations are done in one transaction or you will soon have weird inconsistencies in the database.

it is just causing wait times on the CPU especially since it runs every other second.
Get rid of the TOP 30 and run it much less often than once every other second... maybe every two or three minutes.

You can enable max degree of parallelism of your sql server for faster processing

SQL Server last 25 records query optimization

I have 4million records in one of my tables. I need to get the last 25 records that have been added in the last 1 week.
This is how my current query looks
SELECT TOP(25) [t].[EId],
[t].[DateCreated],
[t].[Message]
FROM [dbo].[tblEvent] AS [t]
WHERE ( [t].[DateCreated] >= Dateadd(DAY, Datediff(DAY, 0, Getdate()) - 7, 0)
AND [t].[EId] = 1 )
ORDER BY [t].[DateCreated] DESC
Now I do not have any indexes running for this table and do not intend to have one. This query takes about 10-15 seconds to run and my apps times-out, now is there a way to better it?

You should create an index on EId, DateCreated or at least DateCreated
Without this only way of optimising this that I can think of would be to maintain the last 25 in a separate table via an insert trigger (and possibly update and delete triggers as well).

If you have an ID in the table that is autoincrement (not the Eid but a separate PK) you can order by ID desc instead of DateCreated, that might make your order by faster.
otherwise you do need an index (but your question says you do not want that).

If the table has no indexes to support the query you are going to be forced to perform a table scan.
You are going to struggle to get around the table scan aspect of that - and as the table grows, the response time will get slower.
You are going to have to endevour to educate your client as to the problems going forward they face, and that they should consider an index. They may be saying no, you need to show the evidence to support the reasoning, show them times with / without, and make sure the impact to the record insertion is also shown, it's a relatively simple cost / benefit / detriment for the adding of the index / not adding of it. If they insist on no index, then you have no choice but to extend your timeouts.

You should also try query hint:
http://msdn.microsoft.com/en-us/library/ms181714.aspx
With option FAST n -- number of rows.

SQL Query Costing, aggregating a view is faster?

I have a table, Sheet1$ that contains 616 records. I have another table, Rates$ that contains 47880 records. Rates contains a response rate for a given record in the sheet for 90 days from a mailing date. Within all 90 days of a records Rates relation the total response is ALWAYS 1 (100%)
Example:
Sheet1$: Record 1, 1000 QTY, 5% Response, Mail 1/1/2009
Rates$: Record 1, Day 1, 2% Response
Record 1, Day 2, 3% Response
Record 1, Day 90, 1% Response
Record N, Day N, N Response
So in that, I've written a view that takes these tables and joins them to the right on the rates to expand the data so I can perform some math to get a return per day for any given record.
SELECT s.[Mail Date] + r.Day as Mail_Date, s.Quantity * s.[Expected Response Rate] * r.Response as Pieces, s.[Bounce Back Card], s.Customer, s.[Point of Entry]
FROM Sheet1$ as s
RIGHT OUTER JOIN Rates$ as r
ON s.[Appeal Code] = r.Appeal
WHERE s.[Mail Date] IS NOT NULL
AND s.Quantity <> 0
AND s.[Expected Response Rate] <> 0
AND s.Quantity IS NOT NULL
AND s.[Expected Response Rate] IS NOT NULL);
So I save this as a view called Test_Results. Using SQL Server Management Studio I run this query and get a result of 211,140 records. Elapsed time was 4.121 seconds, Est. Subtree Cost was 0.751.
Now I run a query against this view to aggregate a piece count on each day.
SELECT Mail_Date, SUM(Pieces) AS Piececount
FROM Test_Results
GROUP BY Mail_Date
That returns 773 rows and it only took 0.452 seconds to execute! 1.458 Est. Subtree Cost.
My question is, with a higher estimate how did this execute SO much faster than the original view itself?! I would assume a piece might be that its returning rows to management studio. If that is the case, how would I go about viewing the true cost of this query without having to account for the return feedback?

SELECT * FROM view1 will have a plan
SELECT * FROM view2 (where view2 is based on view1) will have its own complete plan
The optimizer is smart enough to make the plan for view2 combine/collapse the operations into a most efficient operation. It is only going to observe the semantics of the design of view1, but it is not necessarily required to use the plan for SELECT * FROM view1 and than apply another plan for view2 - this will, in general, be a completely different plan, and it will do whatever it can to get the most efficient results.
Typically, it's going to push the aggregation down to improve the selectivity, and reduce the data requirements, and that's going to speed up the operation.

I think that Cade has covered the most important part - selecting from a view doesn't necessarily entail returning all of the view rows and then selecting against that. SQL Server will optimize the overall query.
To answer your question though, if you want to avoid the network and display costs then you can simply select each query result into a table. Just add "INTO Some_Table" after the column list in the SELECT clause.
You should also be able to separate things out by showing client statistics or by using Profiler, but the SELECT...INTO method is quick and easy.

Query costs are unitless, and are just used by the optimizer to choose what it thinks the most efficient execution path for a particular query is. They can't really be compared between queries. This, although old, is a good quick read. Then you'll probably want to look around for some books or articles on the MSSQL optimizer and about reading query plans if you're really interested.
(Also, make sure you're viewing the actual execution plan, and not the explain plan ... they can be different)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight