Modelling hierarchical data in RavenDb - database

In RavenDb, I have to store hierarchical data and I need to query it recursively. The performance is the biggest concern here.
What I have is similar to the following one:
public class Category
{
public int Id { get; set; }
public string Name { get; set; }
public Category Parent { get; set; }
}
In this case, if I store the parent category inside the document itself, it will hard for me to manage the data as I will duplicating the categories all over the place.
So, to make that easy, I can store this as below:
public class Category
{
public int Id { get; set; }
public int? ParentId { get; set; }
public string Name { get; set; }
}
But in that case I'm not sure how the performance will be here as I will have millions of records and I need to create the category tree from this reference.
Is there a certain decision in RavenDb on how to model this type of data when the performance is the biggest concern?

Hierarchies are usually best modeled in one document that defines the hierarchy. In your situation that would be to define the categories tree, where the categories themselves can be represented by standalone documents (and thus hold Name, Description etc, and allow for other collection to reference them), or not.
Modeled from code a Category document would look something like this:
public class Category
{
public string Id { get; set; }
public string Name { get; set; }
// other meta-data that you want to store per category, like image etc
}
And the hierarchy tree document can be serialized from a class like the following, where this class can have methods for making nodes in it easily accessible:
public class CategoriesHierarchyTree
{
public class Node
{
public string CategoryId { get; set; }
public List<Node> Children { get; set; }
}
public List<Node> RootCategories { get; private set; }
// various methods for looking up and updating tree structure
}
This approach of hierarchy-tree has several important advantages:
One transactional scope - when the tree changes, the tree changes in one transaction, always. You cannot get affected by multiple concurrent changes to the tree since you can leverage optimistic concurrency when editing this one document. Using the approach you propose it is impossible to guarantee that therefore harder to guarantee the completeness and correctness of the hierarchy tree over time. If you think of a hierarchy as a tree, it actually makes a lot of sense to have each change lock the entire tree until it completes. The hierarchy tree is one entity.
Caching - the entire hierarchy can be quickly and efficiently cached, even using aggressive caching which will minimize the times the server is accessed with queries on hierarchy.
All operations are done entirely in-memory - since its one document, aka object, all queries on the hierarchy (whose the parent of, list of children etc) are made entirely in-memory and effectively cost close to nothing to perform. Using an index with Recurse() to answer such queries is order of magnitude costlier (network costs and computational). You mention performance is the biggest concern - so this is a winner.
Multiple parents per category, no denormalization - if a category document is saved outside the hierarchy tree, like demonstrated above, you can effectively put a category under multiple parents without the need to denormalize. All category data is in one place, in a document outside of the tree, and the tree only holds a reference to the category.
I will highly recommend going with this approach. It is a bit of a shift from the relational mindset, but its so worth it, even when the tree grows big.

Related

Should I split a DbContext with multiple DbSets with hundreds of thousands of records in each of them?

My question: what determines the speed(performance) of calling DbContext.SaveChanges() method? And is it a bad practice to put all the DbSets in a single DbContext?
I have a c#/WPF/MS SQL Server/Entity Framework Core project, which is actually for my company's wholesale business.
I implemented a single DbContext which contains dozens of DbSet's, each of which, of course, represents a table in the database. There are about 10 major tables representing orders, order details, customers, products, etc, and each of the major DbSet/tables contains about 50,000 to 150,000 records in it. The problem is when DbContext.SaveChanges method is called, it takes over 9,000ms(9 sec) to execute! I put ALL of the DbSets in the same DbContext. Is this a bad habit and the cause for slow speed?
For a test, I created a separate DbContext and put only one DbSet in it. The DbSet has about 100,000 records, but calling SaveChanges for that took about 500ms, which was a significant improvement.
Given my situation, what is the best practice for database performances? Please help.
public class MyDbContext : DbContext
{
protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
optionsBuilder.UseLazyLoadingProxies().UseSqlServer(DbConn.GetConnStr());
base.OnConfiguring(optionsBuilder);
}
public DbSet<Order> Orders { get; set; } // This has 100k+ records.
public DbSet<OrderDetail> OrderDetails { get; set; } // This has 150k+ records.
public DbSet<Ship> Ships { get; set; } // 100k+ records
public DbSet<ShipDetail> ShipDetails { get; set; } // 150k+ records
public DbSet<Customer> Customers { get; set; } // 100k records
public DbSet<Product> Products { get; set; } // 10k+ records
public DbSet<ProductStock> ProductStocks { get; set; }
public DbSet<ProductPrice> ProductPrices { get; set; }
public DbSet<PriceType> PriceTypes { get; set; }
public DbSet<Claim> Claims { get; set; }
public DbSet<Carrier> Carriers { get; set; }
public DbSet<Channel> Channels { get; set; }
public DbSet<Import> Imports { get; set; }
public DbSet<ImportDetail> ImportDetails { get; set; }
}
No, quite the opposite. You should encapsulate one database per dbContext extended class in your app. If it is just one db ( or rather one schema ) then you should not split the class at all.
Instead make a partial class and define different dbSets within domain-like-files that form up a concrete class.
The speed is based on the changes made x items loaded ( rly abstract... ).
The more changes, the more the rows that you are affecting/loading, the harder stuff get.
Biggest hit for you would be the sql updates. When you want to manage very big datasets, skip loading them into memory at all. Work with .FromSqlRaw and do everything at the db level returning the minimum that you need.
For example mass updates is a great case for this.
Also care for the case that you are loading uneeded objects ( relations that you are not using )
Credit to Gert Arnold, rantri, and MKougiouris for your replies and comments. All of you are dead right. Here's what I've figured. As all of you mentioned, the problem was not the fact that a single DbContext has all tables in it. The problem was that I was using and passing around a single "instance" of my DbContext across multiple operations throughout the lifetime of running the appplication. This should NEVER be done with a DbContext.
I figured that a DbContext is supposed to be instantiated for a single unit-of-work or a single operation and then dispose the instance as soon as the operation is over. I was reading ALL of the DataSets into the DbContext and querying as much as possible with the single DbContext instance. This is a guarantee for slow performance.
I said it took 9 seconds(9,000ms) to persist changes to the DB by calling SaveChanges. Now it takes 250ms(0.25 sec) to get the same job done. Hope my comment helps for anyone with the same issue.

Why am I getting DbUpdateException: OptimisticConcurrencyException?

I have a Category class:
public class Category
{
public int CategoryId { get; set; }
public string CategoryName { get; set; }
}
I also have a Subcategory class:
public class Subcategory
{
public int SubcategoryId { get; set; }
public Category Category { get; set; }
public string SubcategoryName { get; set; }
}
And a Flavor class:
public class Flavor
{
public int FlavorId { get; set; }
public Subcategory Subcategory { get; set; }
public string FlavorName { get; set; }
}
Then I also have Filling and Frosting classes just like the Flavor class that also have Category and Subcategory navigation properties.
I have a Product class that has a Flavor navigation property.
An OrderItem class represents each row in an order:
public class OrderItem
{
public int OrderItemId { get; set; }
public string OrderNo { get; set; }
public Product Product { get; set; }
public Frosting Frosting { get; set; }
public Filling Filling { get; set; }
public int Quantity { get; set; }
}
I'm having issues when trying to save an OrderItem object. I keep getting DbUpdateException: An error occurred while saving entities that do not expose foreign key properties for their relationships. with the Inner Exception being OptimisticConcurrencyException: Store update, insert, or delete statement affected an unexpected number of rows (0). Entities may have been modified or deleted since entities were loaded. I've stepped through my code several times and I can't find anything that modifies or deletes any entities loaded from the database. I've been able to save the OrderItem, but it creates duplicate entries of Product, Flavor, Subcategory and Category items in the DB. I changed the EntityState of the OrderItem to Modified, but that throws the above exception. I thought it might have been the fact that I have Product, Frosting and Filling objects all referencing the same Subcategory and Category objects, so I tried Detaching Frosting and Filling, saving, attaching, changing OrderItem entity state to Modified and saving again, but that also throws the above exception.
The following statement creates duplicates in the database:
db.OrderItems.Add(orderItem);
Adding any of the following statements after the above line all cause db.SaveChanges(); to throw the mentioned exception (both Modified and Detached states):
db.Entry(item).State = EntityState.Modified;
db.Entry(item.Product.Flavor.Subcategory.Category).State = EntityState.Modified;
db.Entry(item.Product.Flavor.Subcategory).State = EntityState.Modified;
db.Entry(item.Product.Flavor).State = EntityState.Modified;
db.Entry(item.Product).State = EntityState.Modified;
Can someone please give me some insight? Are my classes badly designed?
The first thing to check would be how the entity relationships are mapped. Generally the navigation properties should be marked as virtual to ensure EF can proxy them. One other optimization is that if the entities reference SubCategory then since SubCats reference a Category, those entities do not need both. You would only need both if sub categories are optional. Having both won't necessarily cause issues, but it can lead to scenarios where a Frosting's Category does not match the category of the Frosting's SubCategory. (Seen more than enough bugs like this depending on whether the code went frosting.CategoryId vs. frosting.SubCategory.CategoryId) Your Flavor definition seemed to only use SubCategory which is good, just something to be cautious of.
The error detail seems to point at EF knowing about the entities but not being told about their relationships. You'll want to ensure that you have mapping details to tell EF about how Frosting and SubCategory are related. EF can deduce some of these automatically but my preference is always to be explicit. (I hate surprises!)
public class FrostingConfiguration : EntityTypeConfiguration<Frosting>
{
public FlavorConfiguration()
{
ToTable("Flavors");
HasKey(x => x.FlavorId)
.Property(x => x.FlavorId)
.HasDatabaseGeneratedOption(DatabaseGeneratedOption.Identity);
HasRequired(x => x.SubCategory)
.WithMany()
.Map(x => x.MapKey("SubCategoryId");
}
}
Given your Flavor entity didn't appear to have a property for the SubCategoryId, it helps to tell EF about it. EF may be able to deduce this, but with IDs and the automatic naming conventions it looks for, I don't bother trying to remember what works automagically.
Now if this is EF Core, you can replace the .Map() statement with:
.ForeignKey("SubCategoryId");
which will set up a shadow property for the FK.
If SubCats are optional, then replace HasRequired with HasOptional. The WithMany() just denotes that while a Flavor references a sub category, SubCategory does not maintain a list of flavours.
The next point of caution is passing entities outside of the scope of the DBContext that they were loaded. While EF does support detaching entities from one context and reattaching them to another, I would argue that this practice is almost always far more trouble than it is worth. Mapping entities to POCO ViewModels/DTOs, then loading them on demand again when performing updates is simpler, and less error-prone then attempting to reattach them. Data state may have changed between the time they were initially loaded and when you go to re-attach them, so fail-safe code needs to handle that scenario anyways. It also saves the hassle of messing around with modified state in the entity sets. While it may seem efficient to not load the entities a second time, by adopting view models you can optimize reads far more efficiently by only pulling back and transporting the meaningful data rather than entire entity graphs. (Systems generally read far more than they update) Even for update-heavy operations you can utilize bounded contexts to represent large tables as smaller, simple entities to load and update a few key fields more efficiently.

best event sourcing db strategy

I want to setup a small event sourcing lib.
I read a few tutorials online, everything understood so far.
The only problem is, in these different tutorials, there are two different database strategies, but without any comments why they use the one they use.
So, I want to ask for your opinion.
And important, why do you prefer the solution you choose.
Solution is the db structure where you create one table for each event.
Solution is the db structure where you create only one generic table, and save the events as serialized string to one column.
In both cases I'm not sure how they handle event changes, maybe they create a whole new one.
Kind regards
I built my own event sourcing lib and I opted for option 2 and here's why.
You query the event stream by aggregate id not event type.
Reproducing the events in order would be a pain if they are all in different tables
It would make upgrading events a bit of pain
There is an argument to say you can store events on a per aggregate but that depends of the requirements of the project.
I do have some posts about how event streams are used that you may find helpful.
6 Code Smells With Your CQRS Events and How to Avoid Them
Aggregate Root – How to Build One for CQRS and Event Sourcing
How to Upgrade CQRS Events Without Busting Your Event Stream
Solution is the db structure where you create only one generic table, and save the events as serialized string to one column
This is by far the best approach as replaying events is simpler. Now my two cents on event sourcing: It is a great pattern, but you should be careful because not everything is as simple as it seems. In a system I was working on we saved the stream of events per aggregate but we still had a set of normalized tables, because we just could not accept that in order to get the latest state of an object we would have to run all the events (snapshots help but are not a perfect solution). So yes event sourcing is a fine pattern, it gives you a complete versioning of your entities and a full auditing log, and it should be used just for that, not as a replacement of a set of normalized tables, but this is just my two cents.
I think best solution will be to go with #2. And even you can save your current state together with the related event at the same time if you use a transactional db like mysql.
I realy dont like and recommend the solution #1.
If your concern for #1 is about event versioning/upgrading; then declare a new class for each new change. Dont be too lazy; or be obsess with reusing. Let the subscribers know about changes; give them the event version.
If your concers for #1 is about something like querying/interpreting events; then later you can easily push your events to an nosqldb or eventstore at any time (from original db).
Also; the pattern I use for eventsourcing lib is something like that:
public interface IUserCreated : IEventModel
{
}
public class UserCreatedV1 : IUserCreated
{
public string Email { get; set; }
public string Password { get; set; }
}
public class UserCreatedV2 : IUserCreated
{
// Fullname added to user creation. Wrt issue: OA-143
public string Email { get; set; }
public string Password { get; set; }
public string FirstName { get; set; }
public string LastName { get; set; }
}
public class EventRecord<T> where T : IEventModel
{
public string SessionId { get; set; } // Can be set in emitter.
public string RequestId { get; set; } // Can be set in emitter.
public DateTime CreatedDate { get; set; } // Can be set in emitter.
public string EventName { get; set; } // Extract from class or interface name.
public string EventVersion { get; set; } // Extract from class name
public T EventModel { get; set; } // Can be set in emitter.
}
public interface IEventModel { }
So; make event versioning and upgrading explicit; both in domain and codebase. Implement handling of new events in subscribers before deploying origin of new events. And; if not required, dont allow direct consuming of domain events from external subscribers; put an integration layer or something like that.
I wish my thoughts will be useful for you.
I read about an event-sourcing approach that consists in:
having two tables: aggregate and event;
base on you use cases either:
a. creates and registry on aggregate table, generating an ID, version = 0 and a event type and create an event on event table;
b. retrieve from aggregate table, events by ID or event type, apply business cases and then update aggregate table (version and event type) and then create an event on event table.
although I this approach updates some fields on aggregate table, it leaves event table as append only and improves performace as you have the latest version of an aggregate in aggregate table.
I would go with #2, and if you really want to have an efficient way of search via event type, I would just add an index on that column.
Here are the two strategies to access the data about a subject involved in this case.
1) current state and 2) event sequencing.
With current state we process the events but keep only the last state of the subject.
With event sequencing we keep the events and rebuild the current state by processing the events every time we need the state.
Event sequencing is more reliable as we can track everything that happened causing the current state but it's definitely not efficient. It's a common sense to keep also intermediate states (snapshots) not only the last one to avoid reprocessing all the events all the time. Now we have reliability and performance.
In crypto currencies there are the event sequencing and local snapshots - the local in the name is because blockchains are distributed and data are replicated.

Which nosql database for heterogeneous records?

I'm looking at different options for storing log entries for easier querying/reporting.
Currently I write scripts that parse and find the data, but the data is becoming more and more in demand, so it's becoming worth it to put the log data in a database.
Log entries are composed of key-value pairs, such as{"timestamp":"2012-04-24 12:34:56.789", "Msg":"OK" (simplified example).
I'm sure that eventually the log format will be extended to, say {"timestamp":"2012-04-24 12:34:56.789", "Msg":"OK", "Hostname":"Bubba", which means that the "schema" or "document definition" will need to change. Also, we're a Windows + .NET shop.
Hence, I was primarily looking for some NoSQL engine and found RavenDB attractive to use from .NET.
However, I have a hard time finding information about how it, and other NoSQL databases, work with heterogeneous records.
What would be a good fit in your opinion?
With RavenDB you can just store the different types of docs and it will be able to handle the "changes" in schema. Because it is in fact "schema-free", you can write indexes that will only index the fields that are there. See this blog post for some extra info. It's talking about migrations, but the same applies here.
Also the dynamic fields option will help you here. So given doc with arbitrary properties:
public class LogEntry
{
public string Id { get; set; }
public List<Attribute> Attributes { get; set; }
}
public class Attribute
{
public string Name { get; set; }
public string Value { get; set; }
}
You can write queries like this:
var logs = session.Advanced.LuceneQuery<LogEntry>("LogEntry/ByAttribute")
.WhereEquals("Msg", "OK")
.ToList();

RIA Services Invoke Operation return Complex Type with Entity properties

Look at this complex type, which is basically a DTO that wraps some entities. I don't need to track these entities or use the for updating or any of that stuff, I just want to send them down to the client. The stuff at the top are non-entities just to let me know that I'm not crazy.
public class ResultDetail
{
// non entities (some are even complex) - this works GREAT!
public string WTF { get; set; }
public IEnumerable<int> WTFs { get; set; }
public SomethingElse StoneAge { get; set; }
public IEnumerable<SomethingElse> StoneAgers { get; set; }
// these are entities - none of this works
public EntityA EntityA { get; set; }
public IEnumerable<EntityB> EntityB { get; set; }
}
public class SomethingElse
{
public int ShoeString { get; set; }
}
Now look at this:
http://i.snag.gy/tI9O9.jpg
Not a single entity property shows up on the client side generated types. Are there attributes or something that I can or do I really need to create DTO objects for every one of these entity types? There are more than 2 as in my sample and they have many properties.
By the way these entity types have been generated on the client because of the normal query operations in the domain service that work with them.
This is not possible as current Ria services framework is mainly designed for tracking entities, and for Ria services it is not possible to detect which properties to serialized and which to note, since every entity has navigation properties, serializing properties may cause infinite loops or long loops as there is no control over how to navigate object graph.
Instead you are expected to program your client in such way so that you will load relations on demand correctly.

Resources