How to prevent transactions from violating application invariants in Datomic - database

To elaborate, most relational databases have the idea of a database constraint. Here is the Postgres documentation on constraints. What tools does Datomic offer to constrain data or maintain some set of invariants on the data stored?

EDIT 2019-06-28: Since 0.9.5927 (Datomic On-Prem) / 480-8770 (Datomic Cloud), Datomic supports finer write-time validation via Attribute Predicates, Entity Specs and Entity Predicates. This makes most of the initial answer invalid or irrelevant.
In particular, observe that Entity Predicates accept a database value as a parameter, so they can actually enforce invariants that span several Entities.
By default, Datomic enforces only a very limited set of constraints regarding what data may be written, including mostly:
uniqueness constraints: see Identity and Uniqueness
type constraints, e.g you may not write a number to an attribute that is :db.type/string
entity resolution: operations like [:db/add [:my.app/id "fjdsklfjsl"] :my.app/content 42] will fail if the [:my.app/id "fjdsklfjsl"] lookup-ref does not resolve to an existing entity
conflicts, e.g Datomic won't let you :db/add 2 different values for the same entity-attribute pair if the attribute cardinality is one.
(I may be forgetting some, if so please comment.)
In particular at the time of writing, there is no built-in way to add custom validation or 'foreign-key' constraint to a given attribute.
However, combining Transaction Functions and speculative writes (a.k.a db.with()) gives you a powerful way of enforcing arbitrary invariants. For instance, you can wrap a transaction in a transaction function that applies the transaction speculatively using db.with(), then searches the speculative result for invariant violations, throwing an Exception if it finds some. You can even make this transaction function very generic by implementing the 'search invariant violations' part in Datalog.
Here's an example of what the API may look like:
[:myapp.fns/checking-invariants
;; a description of the invariant
{:query
[:find ?message ?user-id
:in $db-before $db-after ?tx-data ?tempids ?user-id
:where
[$db-before ?user :myapp.user/id ?user-id]
[$db-before ?user :myapp.user/email ?email-before]
[$db-after ?user :myapp.user/email ?email-after]
[(not= ?email-before ?email-after)]
[(ground "A user may not change her email") ?message]]
:inputs ["user-id-12342141"]}
;; the wrapped transaction
[[:db/add 125315815291 :myapp.user/email "hello.world#yopmail.com"]
[:db/add 125315815291 :myapp.user/name "Foo Bar"]]]
Here's an (untested) implementation of :myapp.fns/checking-invariants:
{:db/ident :myapp.fns/checking-invariants,
:db/fn #db/fn{:lang :clojure,
:imports [],
:requires [[datomic.api :as d]],
:params [db invariant-q tx],
:code
(let [{:keys [query inputs]} invariants-q
{:keys [db-before db-after tx-data tempids]}
(d/with db tx)]
(when-some [violations (apply d/q query
db-before db-after tx-data tempids
inputs)]
(throw (ex-info
"Transaction would violate invariants."
{:tx tx
:violations violations
:t (d/basis-t db-before)})))
tx)}}
Limitations:
you can only protect externally: the client has to opt in to using this invariant-checking transaction function.
be careful about performance - abusing this approach may put too much load on the Transactor. In cases where it is safe to do so, you may prefer to perform validation on the Peer using db.invoke()
make sure your transaction is deterministic, as it will be run twice (more precisely, make sure that whether your transaction violates the invariant is deterministic)

One way is to use transaction functions that modify data and do constraint validation during modification: http://docs.datomic.com/database-functions.html#uses-for-transaction-functions

Related

How to limit amount of associations in Elixir Ecto

I have this app where there is a Games table and a Players table, and they share an n:n association.
This association is mapped in Phoenix through a GamesPlayers schema.
What I'm wondering how to do is actually quite simple: I'd like there to be an adjustable limit of how many players are allowed per game.
If you need more details, carry on reading, but if you already know an answer feel free to skip the rest!
What I've Tried
I've taken a look at adding check constraints, but without much success. Here's what the check constraint would have to look something like:
create constraint("games_players", :limit_players, check: "count(players) <= player_limit")
Problem here is, the check syntax is very much invalid and I don't think there actually is a valid way to achieve this using this call.
I've also looked into adding a trigger to the Postgres database directly in order to enforce this (something very similar to what this answer proposes), but I am very wary of directly fiddling with the DB since I should only be using ecto's interface.
Table Schemas
For the purposes of this question, let's assume this is what the tables look like:
Games
Property
Type
id
integer
player_limit
integer
Players
Property
Type
id
integer
GamesPlayers
Property
Type
game_id
references(Games)
player_id
references(Players)
As I mentioned in my comment, I think the cleanest way to enforce this is via business logic inside the code, not via a database constraint. I would approach this using a database transaction, which Ecto supports via Ecto.Repo.transaction/2. This will prevent any race conditions.
In this case I would do something like the following:
begin the transaction
perform a SELECT query counting the number of players in the given game; if the game is already full, abort the transaction, otherwise, continue
perform an INSERT query to add the player to the game
complete the transaction
In code, this would boil down to something like this (untested):
import Ecto.Query
alias MyApp.Repo
alias MyApp.GamesPlayers
#max_allowed_players 10
def add_player_to_game(player_id, game_id, opts \\ []) do
max_allowed_players = Keyword.get(opts, :max_allowed_players, #max_allowed_players)
case is_game_full?(game_id, max_allowed_players) do
false -> %GamesPlayers{
game_id: game_id,
player_id: player_id
}
|> Repo.insert!()
# Raising an error causes the transaction to fail
true -> raise "Game #{inspect(game_id)} full; cannot add player #{inspect(player_id)}"
end
end
defp is_game_full?(game_id, max_allowed_players) do
current_players = from(r in GamesPlayers,
where: r.game_id == game_id,
select: count(r.id)
)
|> Repo.one()
current_players >= max_allowed_players
end

Avoiding in Filter in AppEngine Datastore

I have two Entities with me on Google DataStore:
A:
id,
name,
.... a lot of other columns
B:
id,
Key(A, <id_of_A_record>) --> Reference Property A,
URL,
size
... and more columns
I have a bunch of A ids with me and I want to query with those ids.
Now I am able to achieve this using
A_ids = [1, 2, 3, 4, 5, 6]
B.all().filter('A IN', A_ids).fetch(None)
However, if there are 6 A_ids, there are 6 db calls, therefore beating the purpose of IN. Is they any way in which I can achieve this by avoiding the IN filter (too many DB calls)?
Thank you!
The in filter is actually just a convenience function provided by your ORM. Behind the scenes, it's doing a series of ORs to make it happen.
The filter name in (value1, ..., valueN) is converted into (name = value1) OR ... OR (name = valueN) (also a DisjunctionNode)
https://googleapis.dev/python/python-ndb/latest/query.html#google.cloud.ndb.query.FilterNode
Even ORs aren't supported by datastore itself, it's just emulated by the client library:
The nature of the index query mechanism imposes certain restrictions on what a query can do. Datastore mode queries do not support substring matches, case-insensitive matches, or so-called full-text search. The NOT, OR, and != operators are not natively supported, but some client libraries may add support on top of Datastore mode.
https://cloud.google.com/datastore/docs/concepts/queries#restrictions_on_queries
I think the client library performs ORs by launching separate queries, which is why you are seeing those 6 db calls. I don't think you'll be able to escape it without restructuring your data.

cross-group transaction error with only one entity

I'm trying to execute the code below. Some times it works fine. But some times it does not work.
#db.transactional
def _add_data_to_site(self, key):
site = models.Site.get_by_key_name('s:%s' % self.site_id)
if not site:
site = models.Site()
if key not in site.data:
site.data.append(key)
site.put()
memcache.delete_multi(['', ':0', ':1'], key_prefix='s%s' %
self.site_id)
I'm getting the error:
File "/base/data/home/apps/xxxxxxx/1-7-1.366398694339889874xxxxxxx.py", line 91, in _add_data_to_site
site.put()
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/db/__init__.py", line 1070, in put
return datastore.Put(self._entity, **kwargs)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/datastore.py", line 579, in Put
return PutAsync(entities, **kwargs).get_result()
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 604, in get_result
return self.__get_result_hook(self)
File "/base/python_runtime/python_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1569, in __put_hook
self.check_rpc_success(rpc)
File "/base/python_runtime/python_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1224, in check_rpc_success
raise _ToDatastoreError(err)
BadRequestError: cross-group transaction need to be explicitly specified, see TransactionOptions.Builder.withXG
So, my question is:
If I'm changing only one entity (models.Site) why am I getting a cross-group transaction error?
As mentioned in the logs: "Cross-group transaction need to be explicitly specified".
Try specifying it by using
#db.transactional(xg=True)
Instead of:
#db.transactional
Does this work if you specify parent=None in your get_by_key_name() query?
Essentially, in order to use a transaction, all entities in the transaction must share the same parent (ie you query using one parent, and create a new entity withe the same parent), or you must use a XG transaction. You're seeing a problem because you didn't specify a parent.
You may need to create artificial entities to behave as parents in order to do what you're trying to do.
I had the same issue. By stepping through the client code, I made the following two observations:
1) Setting a parent of (None) seems to still indicate a parent of that kind, even if there's no specific record elected as that parent.
2) Your transaction will include all ReferenceProperty properties as well.
Therefore, you should, theoretically, get the cross-group transaction exception if you haven't declared a parent (by either omitting or setting to (None)) on any of the kinds that you're affecting if there's at least two (because if you're using kind A and kind B, it looks like you're using two different entity groups, for A records and for B records), -as well as- any of the kinds referred-to by any ReferenceProperty properties.
To fix this, you must create, at least, a kind without any properties, that can be set as the parent of all of your previously no-parent records, as well as the parent of all ReferenceProperty properties that they declare.
If that's not sufficient, then set the flag for the cross-group transaction.
Also, the text of the exception, for me, was: "cross-groups transaction need to be explicitly specified" (plural "groups"). I have version 1.7.6 of the Python AppEngine client.
Please upvote this answer if it fits your scenario.
A cross group transaction error refers to the entity groups being used, not the kind used (here Site).
When it occurs, it's because you are attempting a transaction on entities with different parents, hence putting them in different entity groups.
SHAMELESS PLUG:
You should stop using db and move your code to ndb, especially since it seems you're in the development phase.

BadArgumentError: _MultiQuery with cursors requires __key__ order in ndb

I can't understand what this error means and apparently, no one ever got the same error on the internet
BadArgumentError: _MultiQuery with cursors requires __key__ order
This happens here:
return SocialNotification.query().order(-SocialNotification.date).filter(SocialNotification.source_key.IN(nodes_list)).fetch_page(10)
The property source_key is obviously a key and nodes_list is a list of entity keys previously retrieved.
What I need is to find all the SocialNotifications that have a field source_key that match one of the keys in the list.
The error message tries to tell you you that queries involving IN and cursors must be ordered by __key__ (which is the internal name for the key of the entity). (This is needed so that the results can be properly merged and made unique.) In this case you have to replace your .order() call with .order(SocialNotification._key).
It seems that this also happens when you filter for an inequality and try to fetch a page.
(e.g. MyModel.query(MyModel.prop != 'value').fetch_page(...) . This basically means (unless i missed something) that you can't fetch_page when using an inequality filter because on one hand you need the sort to be MyModel.prop but on the other hand you need it to be MyModel._key, which is hard :)
I found the answer here: https://developers.google.com/appengine/docs/python/ndb/queries#cursors
You can change your query to:
SocialNotification.query().order(-SocialNotification.date, SocialNotification.key).filter(SocialNotification.source_key.IN(nodes_list)).fetch_page(10)
in order to get this to work. Note that it seems to be slow (18 seconds) when nodes_list is large (1000 entities), at least on the Development server. I don't have a large amount of test
data on a test server.
You need the property you want to order on and key.
.order(-SocialNotification.date, SocialNotification.key)
I had the same error when filtering without a group.
The error occurred every time my filter returned more than one result.
To fix it I actually had to add ordering by key.

Implementing check constraints with SQL CLR integration

I'm implementing 'check' constraints that simply call a CLR function for each constrained column.
Each CLR function is one or two lines of code that attempts to construct an instance of the user-defined C# data class associated with that column. For example, a "Score" class has a constructor which throws a meaningful error message when construction fails (i.e. when the score is outside a valid range).
First, what do you think of that approach? For me, it centralizes my data types in C#, making them available throughout my application, while also enforcing the same constraints within the database, so it prevents invalid manual edits in management studio that non-programmers may try to make. It's working well so far, although updating the assembly causes constraints to be disabled, requiring a recheck of all constraints (which is perfectly reasonable). I use DBCC CHECKCONSTRAINTS WITH ALL_CONSTRAINTS to make sure the data in all tables is still valid for enabled and disabled constraints, making corrections as necessary, until there are no errors. Then I re-enable the constraints on all the tables via ALTER TABLE [tablename] WITH CHECK CHECK CONSTRAINT ALL. Is there a T-SQL statement to re-enable with check all check constraints on ALL tables, or do I have to re-enable them table by table?
Finally, for the CLR functions used in the check constraints, I can either:
Include a try/catch in each function to catch data construction errors, returning false on error, and true on success, so that the CLR doesn't raise an error in the database engine, or...
Leave out the try/catch, just construct the instance and return true, allowing that aforementioned 'meaningful' error message to be raised in the database engine.
I prefer 2, because my functions are simpler without the error code, and when someone using management studio makes an invalid column edit, they'll get the meaningful message from the CLR like "Value for type X didn't match regular expression '^p[1-9]\d?$'" instead of some generic SQL error like "constraint violated". Are there any severe negative consequences of allowing CLR errors through to SQL Server, or is it just like any other insert/update failure resulting from a constraint violation?
For example, a "Score" class has a constructor which throws a meaningful error message when construction fails (i.e. when the score is outside a valid range). First, what do you think of that approach?
It worries me a bit, because calling a ctor requires memory allocation, which is relatively expensive. For each row inserted, you're calling a ctor -- and only for its side-effects.
Also expensive are exceptions. They're great when you need them, but this is a case where you vould use them in a ctor context, but not in a check context.
A refactoring could reduce both costs, by having the check exist as a class static or free function, then both the check constraint and the ctor could call that:
class Score {
private:
int score;
public:
static bool valid( int score ) {
return score > 0 ;
}
Score( int s ) {
if( ! valid( s ) ) {
throw InvalidParameter();
}
score = s;
}
}
Check constraint calls Score::valid(), no construction or exception needed.
Of course, you still have the overhead, for each row, of a CLR call. Whether that's acceptable is something you'll have to decide.
Is there a T-SQL statement to re-enable with check all check constraints on ALL tables, or do I have to re-enable them table by table?
No, but you can do this to generate the commands:
select 'ALTER TABLE ' || name || ' WITH CHECK CHECK CONSTRAINT ALL;'
from sys.tables ;
and then run the resultset against the database.
Comments from the OP:
I use base classes called ConstrainedNumber and RegexConstrainedString for all my data types. I could easily move those two classes' simple constructor code to a separate public boolean IsValueValid method as you suggested, and probably will.
The CLR overhead (and memory allocation) would only occur for inserts and updates. Given the simplicity of the methods, and rate at which table updates will occur, I don't think the performance impact will anything to worry about for my system.
I still really want to raise exceptions for the information they'll provide to management studio users. I like the IsValueValid method, because it gives me the 'option' of not throwing errors. Within applications using my data types, I could still get the exception by constructing an instance :)
I'm not sure I agree with the exception throwing, but again, the "take-home message" is that by decomposing the problem into parts, you can select what parts you're wiling to pay for, without paying for parts you don't use. The ctor you don't use, because you were only calling it to get the side-effect. So we decomposed creation and checking. We can further decompose throwing:
class Score {
private:
int score;
public:
static bool IsValid( int score ) {
return score > 0 ;
}
static checkValid( int score ) {
if( ! isValid( s ) ) {
throw InvalidParameter();
}
Score( int s ) {
checkValid( s ) ;
score = s;
}
}
Now a user can call the ctor, and get the check and possible exception and construction, call checkValid and get the check and exception, or isValid to just get the validity, paying the runtime cost for only what he needs.
Some clarification. These data classes set one level above the primitives types, constraining data to make it meaningful.
Actually, they sit just above the RegexConstrainedString and ConstrainedNumber<T> classes, which is where we're talking about refactoring the constructor's validation code into a separate method.
The problem with refactoring the validation code, is that the Regex necessary for validation exists only in the subclasses of RegexConstrainedString, since each subclass has a different Regex. This means that the validation data is only available to the RegexConstrainedString's constructor, not any of it's methods. So, if I factor out the validation code, callers would need access to the Regex.
public class Password: RegexConstrainedString
{
internal static readonly Regex regex = CreateRegex_CS_SL_EC( #"^[\w!""#\$%&'\(\)\*\+,-\./:;<=>\?#\[\\\]\^_`{}~]{3,20}$" );
public Password( string value ): base( value.TrimEnd(), regex ) {} //length enforced by regex, so no min/max values specified
public Password( Password original ): base( original ) {}
public static explicit operator Password( string value ) {return new Password( value );}
}
So, when reading a value from the database or reading user input, the Password constructor forwards the Regex to the base class to handle the validation. Another trick is that it trims the end characters automatically, in case the database type is char rather than varchar, so I don't have to remember to do it. Anyway, here is what the main constructor for RegexConstrainedString looks like:
protected RegexConstrainedString( string value, Regex subclasses_static_regex, int? min_length, int? max_length )
{
_value = (value ?? String.Empty);
if (min_length != null)
if (_value.Length < min_length)
throw new Exception( "Value doesn't meet minimum length of " + min_length + " characters." );
if (max_length != null)
if (_value.Length > max_length)
throw new Exception( "Value exceeds maximum length of " + max_length + " characters." );
value_match = subclasses_static_regex.Match( _value ); //Match.Synchronized( subclasses_static_regex.Match( _value ) );
if (!value_match.Success)
throw new Exception( "Invalid value specified (" + _value + "). \nValue must match regex:" + subclasses_static_regex.ToString() );
}
Since callers would need access to the subclass's Regex, I think my best bet is to implement a IsValueValid method in the subclass, which forwards the data to the IsValueValid method in the RegexConstrainedString base class. In other words, I would add this line to the Password class:
public static bool IsValueValid( string value ) {return IsValueValid( value.TrimEnd(), regex, min_length, max_length );}
I don't like this however, because I'm replicating the subclasses constructor code, having to remember to trim the string again and pass the same min/max lengths when necessary. This requirement would be forced upon all subclasses of RegexConstrainedString, and it's not something I want to do. These data classes like Password is so simple, because RegexConstrainedString handles most of the work, implementing operators, comparisons, cloning, etc.
Furthermore, there are other complications with factoring out the code. The validation involves running and storing a Regex match in the instance, since some data types may have properties that report on specific elements of the string. For example, my SessionID class contains properties like TimeStamp, which return a matched group from the Match stored in the data class instance. The bottom line is that this static method is an entirely different context. Since it's essentially incompatible with the constructor context, the constructor cannot use it, so I would end up replicating code once again.
So... I could factor out the validation code by replicating it and tweaking it for a static context and imposing requirements on subclasses, or I could keep things much simpler and just perform the object construction. The relative extra memory allocated would be minimal, as only a string and Match reference is stored in the instance. Everything else, such as the Match and the string itself would still be generated by the validation anyway, so there's no way around that. I could worry about the performance all day, but my experience has been that correctness is more important, because correctness often leads to numerous other optimizations. For example, I don't ever have to worry about improperly formatted or sized data flowing through my application, because only meaningful data types are used, which forces validation to the point-of-entry into the application from other tiers, be it database or UI. 99% of my validation code was removed as a no-longer-necessary artifact, and I find myself only checking for nulls nowadays. Incidentally, having reached this point, I now understand why including nulls was the billion dollar mistake. Seems to be the only thing I have to check for anymore, even though they are essentially non-existent in my system. Complex objects having these data types as fields cannot be constructed with nulls, but I have to enforce that in the property setters, which is irritating, because they otherwise would never need validation code... only code that runs in response to changes in values.
UPDATE:
I simulated the CLR function calls both ways, and found that when all data is valid, the performance difference is only fractions of a millisecond per thousand calls, which is negligible. However, when roughly half the passwords are invalid, throwing exceptions in the "instantiation" version, it's three orders of magnitude slower, which equates to about 1 extra sec per 1000 calls. The magnitudes of difference will of course multiple as multiple CLR calls are made for multiple columns in the table, but that's a factor of 3 to 5 for my project. So, is an extra 3 - 5 second per 1000 updates acceptable to me, as a trade off for keeping my code very simple and clean? Well that depends on the update rate. If my application were getting 1000 updates per second, a 3 - 5 second delay would be devastating. If, on the other hand, I was getting 1000 updates a minute or an hour, it may be perfectly acceptable. In my situation, I can tell you now that it's quite acceptable, so I think I'll just go with the instantiation method, and allow the errors through. Of course, in this test I handled the errors in the CLR instead of letting SQL Server handle them. Marshalling the error info to SQL Server, and then possibly back to the application, could definitely slow things down much more. I guess I will have to fully implement this to get a real test, but from this preliminary test, I'm pretty sure what the results will be.

Resources