How would you optimize pagination queries in a graph database in Neo4J? - database

For my graph database, I have users and transactions. Each user has an id. Each transaction has an id, date and sender and receiver. The sender and receiver of a transaction have the type User.id.
There is a sender/receiver type edge that connects users and transactions together.
I would like to query the 10 most recent transactions for a particular user, user_id, before they sent an arbitrary transaction with the id of txn_id.
How can I optimize the performance of this pagination query? I was thinking of creating a single index for both User.id to find user_id fast. If I index Transaction.date and Transaction.id, would it make it performant to search for a transaction that is older txn_id for an individual user?

If you're talking about a composite index (create index on :Transaction(date, id)) then no, as you would need to have the exact values for lookup for the indexed properties. Any other means of lookup (including range scans) won't use the composite index.
An index on just :Transaction(date) won't help either, as this wouldn't be used once you found your user in question and expanded out the transactions. Starting from transactions by date wouldn't be wise, since you would receive many that weren't associated with the user in question.
You'll need an index on :User(id) for that quick lookup of the user, then you'll just need to use ordering with a limit to find the 10 most recent for the user.
If there can be a great many transactions per user, you might consider creating a linked list between transactions in date order so they can be iterated through much faster. Keep in mind that this would require keeping order as transactions are added, and you'd need to use proper locking techniques to avoid race conditions as you add on to the transactions list.

Related

Performance strategies for selecting large amounts of data from child tables

I have a fairly standard database structure for storing invoices and their line items. We have an AccountingDocuments table for header details, and AccountingDocumentItems child table. The child table has a foreign key of AccountingDocumentId with associated non-clustered index.
Most of our Sales reports need to select almost all the columns in the child table, based on a passed date range and branch. So the basic query would be something like:
SELECT
adi.TotalInclusive,
adi.Description, etc.
FROM
AccountingDocumentItems adi
LEFT JOIN AccountingDocuments ad on ad.AccountingDocumentId=adi.AccountingDocumentId
WHERE
ad.DocumentDate > '1 Jan 2022' and ad.DocumentDate < '5 May 2022'
AND
ad.SiteId='x'
This results in an index seek on the AccountingDocuments table, but due to the select requirements on AccountingDocumentItems we're seeing a full index scan here. I'm not certain how to index the child table given the filtering is taking place on the parent table.
At present there are only around 2m rows in the line items table, and the query performs quite well. My concern is how this will work once the table is significantly larger and what strategies there are to ensure good performance on these reports going forward.
I'm aware the answer may be to simply keep adding hardware, but want to investigate alternative strategies to this.

How can I store an indefinite amount of stuff in a field of my database table?

Heres a simple version of the website I'm designing: Users can belong to one or more groups. As many groups as they want. When they log in they are presented with the groups the belong to. Ideally, in my Users table I'd like an array or something that is unbounded to which I can keep on adding the IDs of the groups that user joins.
Additionally, although I realize this isn't necessary, I might want a column in my Group table which has an indefinite amount of user IDs which belong in that group. (side question: would that be more efficient than getting all the users of the group by querying the user table for users belonging to a certain group ID?)
Does my question make sense? Mainly I want to be able to fill a column up with an indefinite list of IDs... The only way I can think of is making it like some super long varchar and having the list JSON encoded in there or something, but ewww
Please and thanks
Oh and its a mysql database (my website is in php), but 2 years of php development I've recently decided php sucks and I hate it and ASP .NET web applications is the only way for me so I guess I'll be implementing this on whatever kind of database I'll need for that.
Your intuition is correct; you don't want to have one column of unbounded length just to hold the user's groups. Instead, create a table such as user_group_membership with the columns:
user_id
group_id
A single user_id could have multiple rows, each with the same user_id but a different group_id. You would represent membership in multiple groups by adding multiple rows to this table.
What you have here is a many-to-many relationship. A "many-to-many" relationship is represented by a third, joining table that contains both primary keys of the related entities. You might also hear this called a bridge table, a junction table, or an associative entity.
You have the following relationships:
A User belongs to many Groups
A Group can have many Users
In database design, this might be represented as follows:
This way, a UserGroup represents any combination of a User and a Group without the problem of having "infinite columns."
If you store an indefinite amount of data in one field, your design does not conform to First Normal Form. FNF is the first step in a design pattern called data normalization. Data normalization is a major aspect of database design. Normalized design is usually good design although there are some situations where a different design pattern might be better adapted.
If your data is not in FNF, you will end up doing sequential scans for some queries where a normalized database would be accessed via a quick lookup. For a table with a billion rows, this could mean delaying an hour rather than a few seconds. FNF guarantees a direct access lookup path for each item of data.
As other responders have indicated, such a design will involve more than one table, to be joined at retrieval time. Joining takes some time, but it's tiny compared to the time wasted in sequential scans, if the data volume is large.

Indicating primary/default record in database

I've been struggling with how I should indicate that a certain record in a database is the "fallback" or default entry. I've also been struggling with how to reduce my problem to a simple problem statement. I'm going to have to provide an example.
Suppose that you are building a very simple shipping application. You'll take orders and will need to decide which warehouse to ship them from.
Let's say that you have a few cities that have their own dedicated warehouses*; if an order comes in from one of those cities, you'll ship from that city's warehouse. If an order comes in from any other city, you want to ship from a certain other warehouse. We'll call that certain other warehouse the fallback warehouse.
You might decide on a schema like this:
Warehouses
WarehouseId
Name
WarehouseCities
WarehouseId
CityName
The solution must enforce zero or one fallback warehouses.
You need a way to indicate which warehouse should be used if there aren't any warehouses specified for the city in question. If it really matters, you're doing this on SQL Server 2008.
EDIT: To be clear, all valid cities are NOT present in the WarehouseCities table. It is possible for an order to be received for a City not listed in WarehouseCities. In such a case, we need to be able to select the fallback warehouse.
If any number of default warehouses were allowed, or if I was assigning default warehouses to, say, states, I would use a DefaultWarehouse table. I could use such a table here, but I would need to limit it to exactly one row, which doesn't feel right.
How would you indicate the fallback warehouse?
*Of course, in this example we discount the possibility that there might be multiple cities with the same name. The country you are building this application for rigorously enforces a uniqueness constraint on all city names.
I understand your problem, but have questions about parts of it, so I'll be a bit more general.
If at all possible I would store warehouse/backup warehouse data with your inventory data (either directly hanging of warehouses, or if it's product specific off the inventory tables).
If the setup has to be calculated through your business logic then the records should hang off the order/order_item table
In terms how to implement the structure in SQL, I'll assume that all orders ship out of a single warehouse and that the shipping must be hung off the orders table (but the ideas should be applicable elsewhere):
The older way to enforce zero/one backup warehouses would be to hang a Warehouse_Source record of the Orders table and include an "IsPrimary" field or "ShippingPriority" then include a composite unique index that includes OrderID and IsPrimary/ShippingPriority.
if you will only ever have one backup warehouse you could add ShippingSource_WareHouseID and ShippingSource_Backup_WareHouseID fields to the order. Although, this isn't the route I would go.
In SQL 2008 and up we have the wonderful addition of Filtered Indexes. These allow you to add a WHERE clause to your index -- resulting in a more compact index. It also has the added benefit of allowing you to accomplish some things that could only be done through triggers in the past.
You could put a Unique filtered index on OrderID & IsPrimary/ShippingPriority (WHERE IsPrimary = 0).
Add a comment or such if you want me to explain further.
re: how I should indicate that a certain record in a database is the "fallback" or default entry
use another column, isFallback, holding a binary value. I'm assuming your fallback warehouse won't have any cities associated with it.
As I see it, the fallback warehouse after all is just another warehouse and if I understood, every record in WarehouseCities has a reference to one record in Warehouses:
WarehouseCities(*)...(1)Warehouses
Which means that if there are a hundred cities without a dedicated warehouse they all will reference the id of an specific fallback warehouse. So I don't see any problem (which makes me thing I didn't understand the problem), even the model looks well defined.
Now you could identify if a warehouse is fallback warehouse with an attribute like type_warehouse on Warehouses.
EDIT after comment
Assuming there is only one fallback warehouse for the cities not present in WarehouseCities, I suggest to keep the fallback warehouse as just another warehouse and keep its Id (WarehouseId) as an application parameter (a table for parameters maybe?), of course, this solution is programmatically and not attached to your database platform.

How to design data storage for partitioned tagging system?

How to design data storage for huge tagging system (like digg or delicious)?
There is already discussion about it, but it is about centralized database. Since the data is supposed to grow, we'll need to partition the data into multiple shards soon or later. So, the question turns to be: How to design data storage for partitioned tagging system?
The tagging system basically has 3 tables:
Item (item_id, item_content)
Tag (tag_id, tag_title)
TagMapping(map_id, tag_id, item_id)
That works fine for finding all items for given tag and finding all tags for given item, if the table is stored in one database instance. If we need to partition the data into multiple database instances, it is not that easy.
For table Item, we can partition its content with its key item_id. For table Tag, we can partition its content with its key tag_id. For example, we want to partition table Tag into K databases. We can simply choose number (tag_id % K) database to store given tag.
But, how to partition table TagMapping?
The TagMapping table represents the many-to-many relationship. I can only image to have duplication. That is, same content of TagMappping has two copies. One is partitioned with tag_id and the other is partitioned with item_id. In scenario to find tags for given item, we use partition with tag_id. If scenario to find items for given tag, we use partition with item_id.
As a result, there is data redundancy. And, the application level should keep the consistency of all tables. It looks hard.
Is there any better solution to solve this many-to-many partition problem?
I doubt there is a single approach that optimizes all possible usage scenarios. As you said, there are two main scenarios that the TagMapping table supports: finding tags for a given item, and finding items with a given tag. I think there are some differences in how you will use the TagMapping table for each scenario that may be of interest. I can only make reasonable assumptions based on typical tagging applications, so forgive me if this is way off base!
Finding Tags for a Given Item
A1. You're going to display all of the tags for a given item at once
A2. You're going to ensure that all of an item's tags are unique
Finding Items for a Given Tag
B1. You're going to need some of the items for a given tag at a time (to fill a page of search results)
B2. You might allow users to specify multiple tags, so you'd need to find some of the items matching multiple tags
B3. You're going to sort the items for a given tag (or tags) by some measure of popularity
Given the above, I think a good approach would be to partition TagMapping by item. This way, all of the tags for a given item are on one partition. Partitioning can be more granular, since there are likely far more items than tags and each item has only a handful of tags. This makes retrieval easy (A1) and uniqueness can be enforced within a single partition (A2). Additionally, that single partition can tell you if an item matches multiple tags (B2).
Since you only need some of the items for a given tag (or tags) at a time (B1), you can query partitions one at a time in some order until you have as many records needed to fill a page of results. How many partitions you will have to query will depend on how many partitions you have, how many results you want to display and how frequently the tag is used. Each partition would have its own index on tag_id to answer this query efficiently.
The order you pick partitions in will be important as it will affect how search results are grouped. If ordering isn't important (i.e. B3 doesn't matter), pick partitions randomly so that none of your partitions get too hot. If ordering is important, you could construct the item id so that it encodes information relevant to the order in which results are to be sorted. An appropriate partitioning scheme would then be mindful of this encoding. For example, if results are URLs that are sorted by popularity, then you could combine a sequential item id with the Google Page Rank score for that URL (or anything similar). The partitioning scheme must ensure that all of the items within a given partition have the same score. Queries would pick partitions in score order to ensure more popular items are returned first (B3). Obviously, this only allows for one kind of sorting and the properties involved should be constant since they are now part of a key and determine the record's partition. This isn't really a new limitation though, as it isn't easy to support a variety of sorts, or sorts on volatile properties, with partitioned data anyways.
The rule is that you partition by field that you are going to query by. Otherwise you'll have to look through all partitions. Are you sure you'll need to query Tag table by tag_id only? I believe not, you'll also need to query by tag title. It's no so obvious for Item table, but probably you also would like to query by something like URL to find item_id for it when other user will assign tags for it.
But note, that Tag and Item tables has immutable title and URL. That means you can use the following technique:
Choose partition from title (for Tag) or URL (for Item).
Choose sequence for this partition to generate id.
You either use partition-localID pair as global identifier or use non-overlapping number sets. Anyway, now you can compute partition from both id and title/URL fields. Don't know number of partitions in advance or worrying it might change in future? Create more of them and join in groups, so that you can regroup them in future.
Sure, you can't do the same for TagMapping table, so you have to duplicate. You need to query it by map_id, by tag_id, by item_id, right? So even without partitioning you have to duplicate data by creating 3 indexes. So the difference is that you use different partitioning (by different field) for each index. I see no reason to worry about.
Most likely your queries are going to be related to a user or a topic. Meaning that you should have all info related to those in one place.
You're talking about distribution of DB, usually this is mostly an issue of synchronization. Reading, which is about 90% of the work usually, can be done on a replicated database. The issue is how to update one DB and remain consistent will all others and without killing the performances. This depends on your scenario details.
The other possibility is to partition, like you asked, all the data without overlapping. You probably would partition by user ID or topic ID. If you partition by topic ID, one database could reference all topics and just telling which dedicated DB is holding the data. You can then query the correct one. Since you partition by ID, all info related to that topic could be on that specialized database. You could partition also by language or country for an international website.
Last but not least, you'll probably end up mixing the two: Some non-overlapping data, and some overlapping (replicated) data. First find usual operations, then find how to make those on one DB in least possible queries.
PS: Don't forget about caching, it'll save you more than distributed-DB.

Maintaining audit log for entities split across multiple tables

We have an entity split across 5 different tables. Records in 3 of those tables are mandatory. Records in the other two tables are optional (based on sub-type of entity).
One of the tables is designated the entity master. Records in the other four tables are keyed by the unique id from master.
After update/delete trigger is present on each table and a change of a record saves off history (from deleted table inside trigger) into a related history table. Each history table contains related entity fields + a timestamp.
So, live records are always in the live tables and history/changes are in history tables. Historical records can be ordered based on the timestamp column. Obviously, timestamp columns are not related across history tables.
Now, for the more difficult part.
Records are initially inserted in a single transaction. Either 3 or 5 records will be written in a single transaction.
Individual updates can happen to any or all of the 5 tables.
All records are updated as part of a single transaction. Again, either 3 or 5 records will be updated in a single transaction.
Number 2 can be repeated multiple times.
Number 3 can be repeated multiple times.
The application is supposed to display a list of point in time history entries based on records written as single transactions only (points 1,3 and 5 only)
I'm currently having problems with an algorithm that will retrieve historical records based on timestamp data alone.
Adding a HISTORYMASTER table to hold the extra information about transactions seems to partially address the problem. A new record is added into HISTORYMASTER before every transaction. New HISTORYMASTER.ID is saved into each entity table during a transaction.
Point in time history can be retrieved by selecting the first record for a particular HISTORYMASTER.ID (ordered by timestamp)
Is there any more optimal way to manage audit tables based on AFTER (UPDATE, DELETE) TRIGGERs for entities spanning multiple tables?
Your HistoryMaster seems similar to how we have addressed history of multiple related items in one of our systems. By having a single point to hang all the related changes from in the history table, it is easy to then create a view that uses the history master as the hub and attached the related information. It also allows you to not create records in the history where an audit is not desired.
In our case the primary tables were called EntityAudit (where entity was the "primary" item being retained) and all data was stored EntityHistory tables related back to the Audit. In our case we were using a data layer for business rules, so it was easy to insert the audit rules into the data layer itself. I feel that the data layer is an optimal point for such tracking if and only if all modifications use that data layer. If you have multiple applications using distinct data layers (or none at all) then I suspect that a trigger than creates the master record is pretty much the only way to go.
If you don't have additional information to track in the Audit (we track the user who made the change, for example, something not on the main tables) then I would contemplate putting the extra Audit ID on the "primary" record itself. Your description does not seem to indicate you are interested in the minor changes to individual tables, but only changes that update the entire entity set (although I may be miss reading that). I would only do so if you don't care about the minor edits though. In our case, we needed to track all changes, even to the related records.
Note that the use of an Audit/Master table has an advantage in that you are making minimal changes to the History tables as compared to the source tables: a single AuditID (in our case, a Guid, although autonumbers would be fine in non distributed databases).
Can you add a TimeStamp / RowVersion datatype column to the entity master table, and associate all the audit records with that?
But an Update to any of the "child" tables will need to update the Master entity table to force the TimeStamp / RowVersion to change :(
Or stick a GUID in there that you freshen whenever one of the associated records changes.
Thinking that through, out loud, it may be better to have a table joined 1:1 to Master Entity that only contains the Master Entity ID and the "version number" fo the record - either TimeSTamp / RowVersion, GUID, incremented number, or something else.
I think it's a symptom of trying to capture "abstract" audit events at the lowest level of your application stack - the database.
If it's possible consider trapping the audit events in your business layer. This would allow you to capture the history per logical transaction rather than on a row-by-row basis. The date/time is unreliable for resolving things like this as it can be different for different rows, and the same for concurrent (or closely spaced) transactions.
I understand that you've asked how to do this in DB triggers though. I don't know about SQL Server, but in Oracle you can overcome this by using the DBMS_TRANSACTION.LOCAL_TRANSACTION_ID system package to return the ID for the current transaction. If you can retrieve an equivalent SQLServer value, then you can use this to tie the record updates for the current transaction together into a logical package.

Resources