Rails ActiveRecord - Performance - How to destroy invalid belongs_to records - query-optimization

I have two models:
Asset has_many Assethistories
Assethistory belongs_to Asset
Unfortunately, when the migration was created, no foreign_key was added to Assethistories. Some Assethistories records exist where Assethistories.asset_id has a value that does not exist in Asset.id (probably someone used delete on the Asset table instead of destroy).
There are approximately 10 million Asset records and 25 million Assethistories records.
Using this query takes a VERY LONG time:
Assethistory.where("asset_id NOT IN (select id from assets)").delete_all
or in rails syntax:
Assethistory.where.not(asset_id: Asset.select(:id)).delete_all
NOTE: we can use delete_all since there are no callbacks or nested models.
In fact, just doing a COUNT of the invalid records takes VERY LONG.
Assethistory.where.not(asset_id: Asset.select(:id)).count
Is there any way to destroy the invalid records that would be more performant?

I've got some data which is about 20x smaller than your case, e.g 5M assethistory and 500K asset like your case, which could reproduce your issue by using where.not. By following Anti-Join pattern, where joins records that don't exist, would bring you the result you want with better performance.
Assethistory.left_outer_joins(:asset).where(assets: { id: nil })
# => 3500 rows (415.3ms)
or
Assethistory.where('NOT EXISTS (:assets)', assets: Asset.select('1').where('assets.id = assethistories.asset_id'))
# => 3500 rows (701.6ms)
By using LEFT OUTER JOIN instead of WHERE NOT, DB scans the table in a very different way. WHERE NOT loops over all rows in assethistories where assethistories.asset_id do not match all existing assets.id.
LEFT OUTER JOIN queries the records from the joined-table(assethistories and assets as a big one table) where the rows have null value in asset.id column, which is much more efficient.

Related

Power BI combine results from two SQL-Server tables

While using Power BI for a few months now, we (the user group) encountered an issue that is not really clear to us.
We use Power-BI with a remote SQL-Server data source, we access the data source through direct query.
Let's pretend we have 2 Tables as below-
Table name: Issue
Column:
ResolutionTime(Date/Time)
IssueID(Unique Numbers)
Table Name: WorkItem
Column:
start (Date/Time)
end (Date/Time)
IssueID (Unique Numbers, Foreign Key to "Issue" table)
Table WorkItem also contain a calculated column "WorkTime" which uses this DAX-expression as below-
WorkTime = WorkItem[end] - WorkItem[start]
The two tables are configured through Power-Bi having a two-way 1:n relationship that can be queried to collect all "WorkItem"(s) assigned to an "Issue" entry, using the "IssueID" as correlation column.
To be able to compute the aggregated "work-time" for each "WorkItem", we use a new/calculated table with the following DAX expression to aggregate the total amount of time invested for a single "Issue":
SumWork =
SUMMARIZE(
WorkItem, WorkItem[IssueID], "All work per item", SUM(WorkItem[WorkTime])
)
The above table computes the total invested work-time for a particular issue, grouping/summarizing results based on the "IssueID" foreign key. This new calculated table is also configured to have a relationship with the "Issue" table, this time a "1:1" relationship, using the IssueID as correlation column.
Now to compute the time that the issue was worked on + the time for Resolution should be summarized in a calculated column inside "Issue", but this does not work:
ResolutionAndWorkTime = Issue[ResolutionTime] + SumWork["All work per item"]
But the above DAX expression fails to compile, as it always reports that it returns "more than one result", thus not being a singular result. But that is suprising, as the two table ("Issue" and "SumWork" are related to each other with a "1:1" relationship).
Tables:
Issues
IssueID ResolutionTime ResolutionAndWorkTime
1 03:20:20 ???
2 01:20:20 ???
3 00:20:20 ???
WorkItem
IssueID start end WorkTime
1 1-2-2020 3:20:20 1-2-2020 3:25:20 00:05:00
1 2-2-2020 6:20:20 2-2-2020 7:20:20 01:00:00
3 1-3-2020 3:20:20 1-3-2020 3:29:20 00:09:00
Any ideas what to look for? Data-types? Table-definition? Table-relationships? We checked other Stackoverflow questions/answers, but no good ideas retrieved so far.
NOTE that a lot of join/merge features of Power BI are not available if direct-query is used and thus joining the tables is not really an option (we think).
You need this following code for your new Calculated column.
Visit HERE To know more about RELATED.
ResolutionAndWorkTime = Issues[ResolutionTime] + RELATED(SumWork[All work per item])
Based on input provided by "mkRabbani" (see other answer) we investigated why "RELATED" does not function as expected. The problem originates in the access to the database. As suspected earlier the function delivers the expected results once the database access is switched to "import" instead of "direct-query".
As a workaround we now joins the data inside the SQL server by using traditional database views. Of course this only works for scenarios where the database is under control of the data analytics team.

Keep order in a key-value use case with additions and deletions

I receive big 4 bit hashes list from an API that I need to store.
The list must always be ordered alphanumerically.
Here's how things are flowing :
I get the full list of hashes from the API (Approx 1,000,000). I store this list.
I request the API periodically to get a list of additions and deletions to the list.
For now I store the hashes list with a 'order' row in a table :
0 - 6717
1 - 7fcd
2 - 88c6
3 - 9e63
4 - dcb0
5 - fb44
Now let's say I receive the deletions :
1
4
And the additions :
7bd7
0e33
I need to delete the rows 1 and 4 :
0 - 6717
2 - 88c6
3 - 9e63
5 - fb44
And I need to add the additions to the list AND to rebuild the roder row to keep the alphaneumeric order, to be able to to this again :
0 - 0e33
1 - 6717
2 - 7bd7
3 - 88c6
4 - 9e63
5 - fb44
I need this for a PHP Symfony application, I have implemented this with MySQL but it's pretty slow to create the full list and to rebuild the id row...
As I have a key->value dataset Redis seems to be a good choice but there is no bulk rename function for the keys.
I am also thinking about MongoDB and to create one document for each hash but I'm not really sure.
what would you do ? Thanks
In relational data model (YesSQL world) a table is an non-ordered set of rows. Hence, the order of stored values is unpredictable "by design" in general case (I wouldn't say about clustered indexes here). The ordered list is only guaranteed when using ORDER BY clause
SELECT key, value FROM my_store WHERE ... ORDER BY key
For the performance purposes you need to have an index/primary key/unique constraint (depend on DMBS and database design) on the affected column(s). 1M of rows is a relatively small amount which shouldn't make some performance issues. Be aware also of data/index fragmentation when deletion and insertion are frequent.

SQL join running slow

I have 2 sql queries doing the same thing, first query takes 13 sec to execute while second takes 1 sec to execute. Any reason why ?
Not necessary all the ids in ProcessMessages will have data in ProcessMessageDetails
-- takes 13 sec to execute
Select * from dbo.ProcessMessages t1
join dbo.ProcessMessageDetails t2 on t1.ProcessMessageId = t2.ProcessMessageId
Where Id = 4 and Isdone = 0
--takes under a sec to execute
Select * from dbo.ProcessMessageDetails
where ProcessMessageId in ( Select distinct ProcessMessageId from dbo.ProcessMessages t1
Where Where Id = 4 and Isdone = 0 )
I have clusterd index on t1.processMessageId(Pk) and non clusterd index on t2.processMessageId (FK)
I would need the actual execution plans to tell you exactly what SqlServer is doing behind the scenes. I can tell you these queries aren't doing the exact same thing.
The first query is going through and finding all of the items that meet the conditions for t1 and finding all of the items for t2 and then finding which ones match and joining them together.
The second one is saying first find all of the items that are meet my criteria from t1, and then find the items in t2 that have one of these IDs.
Depending on your statistics, available indexes, hardware, table sizes: Sql Server may decide to do different types of scans or seeks to pick data for each part of the query, and it also may decide to join together data in a certain way.
The answer to your question is really simple the first query which have used will generate more number of rows as compared to the second query so it will take more time to search those many rows that's the reason your first query took 13 seconds and the second one to only one second
So it is generally suggested that you should apply your conditions before making your join or else your number of rows will increase and then you will require more time to search those many rows when joined.

Postgresql - performance of using array in big database

Let say we have a table with 6 million records. There are 16 integer columns and few text column. It is read-only table so every integer column have an index.
Every record is around 50-60 bytes.
The table name is "Item"
The server is: 12 GB RAM, 1,5 TB SATA, 4 CORES. All server for postgres.
There are many more tables in this database so RAM do not cover all database.
I want to add to table "Item" a column "a_elements" (array type of big integers)
Every record would have not more than 50-60 elements in this column.
After that i would create index GIN on this column and typical query should look like this:
select * from item where ...... and '{5}' <# a_elements;
I have also second, more classical, option.
Do not add column a_elements to table item but create table elements with two columns:
id_item
id_element
This table would have around 200 mln records.
I am able to do partitioning on this tables so number of records would reduce to 20 mln in table elements and 500 K in table item.
The second option query looks like this:
select item.*
from item
left join elements on (item.id_item=elements.id_item)
where ....
and 5 = elements.id_element
I wonder what option would be better at performance point of view.
Is postgres able to use many different indexes with index GIN (option 1) in a single query ?
I need to make a good decision because import of this data will take me a 20 days.
I think you should use an elements table:
Postgres would be able to use statistics to predict how many rows will match before executing query, so it would be able to use the best query plan (it is more important if your data is not evenly distributed);
you'll be able to localize query data using CLUSTER elements USING elements_id_element_idx;
when Postgres 9.2 would be released then you would be able to take advantage of index only scans;
But I've made some tests for 10M elements:
create table elements (id_item bigint, id_element bigint);
insert into elements
select (random()*524288)::int, (random()*32768)::int
from generate_series(1,10000000);
\timing
create index elements_id_item on elements(id_item);
Time: 15470,685 ms
create index elements_id_element on elements(id_element);
Time: 15121,090 ms
select relation, pg_size_pretty(pg_relation_size(relation))
from (
select unnest(array['elements','elements_id_item', 'elements_id_element'])
as relation
) as _;
relation | pg_size_pretty
---------------------+----------------
elements | 422 MB
elements_id_item | 214 MB
elements_id_element | 214 MB
create table arrays (id_item bigint, a_elements bigint[]);
insert into arrays select array_agg(id_element) from elements group by id_item;
create index arrays_a_elements_idx on arrays using gin (a_elements);
Time: 22102,700 ms
select relation, pg_size_pretty(pg_relation_size(relation))
from (
select unnest(array['arrays','arrays_a_elements_idx']) as relation
) as _;
relation | pg_size_pretty
-----------------------+----------------
arrays | 108 MB
arrays_a_elements_idx | 73 MB
So in the other hand arrays are smaller, and have smaller index. I'd do some 200M elements tests before making a decision.

Get "surrounding" rows in NHibernate query

I am looking for a way to retrieve the "surrounding" rows in a NHibernate query given a primary key and a sort order?
E.g. I have a table with log entries and I want to display the entry with primary key 4242 and the previous 5 entries as well as the following 5 entries ordered by date (there is no direct relation between date and primary key). Such a query should return 11 rows in total (as long as we are not close to either end).
The log entry table can be huge and retrieving all to figure it out is not possible.
Is there such a concept as row number that can be used from within NHibernate? The underlying database is either going to be SQlite or Microsoft SQL Server.
Edited Added sample
Imagine data such as the following:
Id Time
4237 10:00
4238 10:00
1236 10:01
1237 10:01
1238 10:02
4239 10:03
4240 10:04
4241 10:04
4242 10:04 <-- requested "center" row
4243 10:04
4244 10:05
4245 10:06
4246 10:07
4247 10:08
When requesting the entry with primary key 4242 we should get the rows 1237, 1238 and 4239 to 4247. The order is by Time, Id.
Is it possible to retrieve the entries in a single query (which obviously can include subqueries)? Time is a non-unique column so several entries have the same value and in this example is it not possible to change the resolution in a way that makes it unique!
"there is no direct relation between date and primary key" means, that the primary keys are not in a sequential order?
Then I would do it like this:
Item middleItem = Session.Get(id);
IList<Item> previousFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Le("Time", middleItem.Time))
.AddOrder(Order.Desc("Time"))
.SetMaxResults(5);
IList<Item> nextFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Gt("Time", middleItem.Time))
.AddOrder(Order.Asc("Time"))
.SetMaxResults(5);
There is the risk of having several items with the same time.
Edit
This should work now.
Item middleItem = Session.Get(id);
IList<Item> previousFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Le("Time", middleItem.Time)) // less or equal
.Add(Expression.Not(Expression.IdEq(middleItem.id))) // but not the middle
.AddOrder(Order.Desc("Time"))
.SetMaxResults(5);
IList<Item> nextFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Gt("Time", middleItem.Time)) // greater
.AddOrder(Order.Asc("Time"))
.SetMaxResults(5);
This should be relatively easy with NHibernate's Criteria API:
List<LogEntry> logEntries = session.CreateCriteria(typeof(LogEntry))
.Add(Expression.InG<int>(Projections.Property("Id"), listOfIds))
.AddOrder(Order.Desc("EntryDate"))
.List<LogEntry>();
Here your listOfIds is just a strongly typed list of integers representing the ids of the entries you want to retrieve (integers 4242-5 through 4242+5 ).
Of course you could also add Expressions that let you retrieve Ids greater than 4242-5 and smaller than 4242+5.
Stefan's solution definitely works but better way exists using a single select and nested Subqueries:
ICriteria crit = NHibernateSession.CreateCriteria(typeof(Item));
DetachedCriteria dcMiddleTime =
DetachedCriteria.For(typeof(Item)).SetProjection(Property.ForName("Time"))
.Add(Restrictions.Eq("Id", id));
DetachedCriteria dcAfterTime =
DetachedCriteria.For(typeof(Item)).SetMaxResults(5).SetProjection(Property.ForName("Id"))
.Add(Subqueries.PropertyGt("Time", dcMiddleTime));
DetachedCriteria dcBeforeTime =
DetachedCriteria.For(typeof(Item)).SetMaxResults(5).SetProjection(Property.ForName("Id"))
.Add(Subqueries.PropertyLt("Time", dcMiddleTime));
crit.AddOrder(Order.Asc("Time"));
crit.Add(Restrictions.Eq("Id", id) || Subqueries.PropertyIn("Id", dcAfterTime) ||
Subqueries.PropertyIn("Id", dcBeforeTime));
return crit.List<Item>();
This is NHibernate 2.0 syntax but the same holds true for earlier versions where instead of Restrictions you use Expression.
I have tested this on a test application and it works as advertised

Resources