pyDatalog: is it possible to define multiple independent datalog sessions? - logic-programming

I'm working on some code that assesses data in a database to see if instances in a stream of incoming events comply with a set of protocols. The idea is to use pyDatalog to do this. Ideally, we would like to be able to assess the data against several independent rule sets, which define separate protocols the events should comply with.
In other words, is it possible to create several logically independent pyDatalog sessions which each have their own sets of rules, but take data from the same underlying database?

Support for multiple rule set is planned for release 0.14, together with thread safety.
With the current and previous releases, you can store the different rule sets in the same pyDatalog session, provided that there is no predicate name conflicts. For example, you could prefix each predicate by an identifier of the rule set it belongs too. Then, by calling the appropriate predicate, you'll activate the relevant rule set, without visible performance loss.
For prefixed predicates (referring to a python class, e.g. Employee.id[X]==Y), you would need to create python subclasses with the appropriate prefix. You could see some performance drop, but that should be small.

Related

SCD-2 in data modelling: how do I detect changes?

I know the concept of SCD-2 and I'm trying to improve my skills about it doing some practices.
I have the next scenario/experiment:
I'm calling daily to a rest API to extract information about companies.
In my initial load to the DB everything is new, so everything is very easy.
Next day I call to the same rest API, which might returns the same companies, but some of them might have (or not) some changes (i.e., they changed the size, the profits, the location, ...)
I know SCD-2 might be really simple if the rest API returns just records with changes, but in this case it might returns as well records without changes.
In this scenario, how people detect if the data of a company has changes or not in order to apply SCD-2?, do they compare all the fields?.
Is there any example out there that I can see?
There is no standard SCD-2 nor even a unique concept of it. It is a general term for large number of possible approaches. The only chance is to practice and see what is suitable for your use case.
In any case you must identify the natural key of the dimension and the set of the attributes you want to keep the history.
You may of course make it more complex by the decision to use your own surrogate key.
You mentioned that there are two main types of the interface for the process:
• You get periodically a full set of the dimension data
• You get the “changes only” (aka delta interface)
Paradoxically the former is much simple to handle than the latter.
First of all, in the full dimensional snapshot the natural key holds, contrary to the delta interface (where you may get more changes for one entity).
Additionally you have to handle the case of late change delivery or even the wrong order of changes delivery.
Next important decision is if you expect deletes to occur. This is again trivial in the full interface, you must define some convention, how this information would be passed in the delta interface.
Connected is the question whether a previously deleted entity can be reused (i.e. reappear in the data).
If you support delete/reuse you'll have to thing about how to show them in your dimension table.
In any case you will need some additional columns in the dimension to cover the historical information.
Some implementation use a change_timestamp, some other use validity interval valid_from and valid_to.
Even other implementation claim that additional sequence number is required – so you avoid the trap of more changes with the identical timestamp.
So you see that before you look for some particular implementation you need carefully decide the options above. For example the full and delta interface leads to a completely different implementations.

Need to establish Database connectivity in DRL file

Need to establish Oracle database connectivity in drools to get some data as and when required while executing the rules. How do I go about that?
You shouldn't do this. Instead, you should query your data out of the database first, then pass it into the rules as facts in working memory.
I tried to write a detailed answer about all the reasons you shouldn't do this, but it turns out that StackOverflow has a character limit. So I'm going to give you the high level reasons.
Latency
Data consistency
Lack of DB access hardening
Extreme design constraints for rules
High maintenance burden
Potential security issues
Going in order ...
Latency. Database queries aren't free. Regardless of how good your connection management is, you will incur overhead every time you make a database call. If you have a solid understanding of the Drools execution lifecycle and how it executes rules, and you design your rules to explicitly only query the database in ways that will minimize the number and quantity of calls, you could consider this an OK risk. A good caching layer wouldn't be amiss. Note that having to properly design your rules this way is not trivial, and you'll incur perpetual overhead in having to make sure all of your rules remain compliant.
(Hint: this means you must never ever call the database from the 'when' clause.)
Data consistency. A database is a shared resource. If you make the same query in two different 'when' clauses, there is no guarantee that you'll get back the same result. Again, you could potentially work around this with a deep understanding of how Drools evaluates and executes rules, and designing your rules appropriately. But the same issues from 'latency' will affect you here -- namely the burden of perpetual maintenance. Further the rule design restrictions -- which are quite strict -- will likely make your other rules and use cases less efficient as well because of the contortions you need to pull to keep your database-dependent rules compatible.
Lack of hardening. The Java code you can write in a DRL function is not the same as the Java code you can write in a Java class. DRL files are parsed as strings and then interpreted and then compiled; many language features are simply not available. (Some examples: try-with-resources, annotations, etc.) This makes properly hardening your database access extremely complicated and in some cases impossible. Libraries which rely on annotations like Spring Data are not available to you for use in your DRL functions. You will need to manage your connection pooling, transaction management, connection management (close everything!), error handling, and so on manually using a subset of the Java language that is roughly equivalent to Java 5.
This is, of course, specific to writing your code to access the database as a function in your DRL. If you instead implement your database access in a service which acts like a database access layer, you can leverage the full JDK and its features and functionality in that external service which you then pass into the rules as an input. But in terms of DRL functions, this point remains a major concern.
Rule design constraints. As I mentioned previously, you need to have an in-depth understanding of how Drools evaluates and executes rules in order to write effective rules that interact with the database. If you're not aware that all left hand sides ("when" clauses) are executed first, then the "matches" ordered by salience, and then the right hand sides ("then" clauses) executed in order sequentially .... well you absolutely should not be trying to do this from the rules. Not only do you as the initial implementor need to understand the rules execution lifecycle, but everyone who comes after you who is going to be maintaining your rules needs to also understand this and continue implementing the rules based on these restrictions. This is your high maintenance burden.
As an example, here are two rules. Let's assume that "DataService" is a properly implemented data access layer with all the necessary connection and transaction management, and it is passed into working memory as a fact.
rule "Record Student Tardiness"
when
$svc: DataService() // data access layer
Tardy( $id: studentId )
$student: Student($tardy: tardyCount) from $svc.getStudentById($id)
then
$student.setTardyCount($tardy + 1)
$svc.save($student)
end
rule "Issue Demerit for Excessive Tardiness"
when
$svc: DataService() // data access layer
Tardy( $id: studentId )
$student: Student(tardyCount > 3) from $svc.getStudentById($id)
then
AdminUtils.issueDemerit($student, "excessive tardiness")
end
If you understand how Drools executes rules, you'll quickly realize the problems with these rules. Namely:
we call getStudentById twice (latency, consistency)
the changes to the student's tardy count are not visible to the second rule
So if our student, Alice, has 3 tardies recorded in the database, and we pass in a new Tardy instance for her, the first rule will hit and her tardy count will increment and be saved (Alice will have 4 tardies in the database.) But the second rule will not hit! Because at the time the matches are calculated, Alice only had 3 tardies, and the "issue demerit" rule only triggers for more than 3. So while she has 4 tardies now, she didn't then.
The solution to the second problem is, of course, to call update to let Drools know to reevaluate all matches with the new data in working memory. This of course exacerbates the first issue -- now we'll be calling getStudentById four times!
Finally the last problem are potential security issues. This really depends on how you implement your queries, but you'll need to be doubly sure you're not accidentally exposing any connection configuration (URL, credentials) in your DRLs, and that you've properly sanitized all query inputs to protect yourself against SQL injection.
The right way to do this, of course, is not to do it at all. Call the database first, then pass it to your rules.
As an example, let's say we have a set of rules which is designed to determine if a customer purchase is "suspicious" by comparing it to trends from the previous 3 months' worth of purchases.
// Assume this class serves as our data access layer and does proper connection,
// transaction management. It might be something like a Spring Data JPA repository,
// or something from another library; the specifics are not relevant.
private PurchaseService purchaseService;
public boolean isSuspiciousPurchase(Purchase purchase) {
List<Purchase> previous = purchaseService.getPurchasesForCustomerAfterDate(
purchase.getCustomerId(),
LocalDate.now().minusMonths(3));
KieBase kBase = ...;
KieSession session = kBase.newKieSession();
session.insert(purchase);
session.insert(previous);
// insert other facts as needed
session.fireAllRules();
// ...
}
As you can see, we call the database and pass the result into working memory. Then we can write the rules such that they do work against that existing list, without needing to interact with the database at all.
If our use case requires modifying the database -- eg saving updates -- we can pass those commands back to the caller and they can be invoked after the fireAllRules is completed. Not only will that keep us from having to interact with the database in the rules, but it'll give us better control over our transaction management (you can probably group the updates into a single transaction, even if the originally came from multiple rules). And since we don't need to understand anything about how Drools evaluates and executes rules, it'll be a little more robust in case a rule with a database "update" is triggered twice.
You can use function like below to get details from DB. Here I have written function in DRL file but its suggested to add such code in java file and call specific method from DRL file.
function String ConnectDB(String ConnectionClass,String url,String user, String password) {
Class.forName(ConnectionClass);
java.sql.Connection con = DriverManager.getConnection(url, user, password);
Statement st = con.createStatement();
ResultSet rs = st.executeQuery("select * from Employee where employee_id=199");
rs.first();
return rs.getString("employee_name");
}
rule "DBConnection"
when
person:PersonPojo(name == ConnectDB("com.mysql.jdbc.Driver","jdbc:mysql://localhost:3306/root","root","redhat1!"))
.. ..
then
. . ..
end

Bad practice to have IDs that are not defined in the database?

I am working on an application that someone else wrote and it appears that they are using IDs throughout the application that are not defined in the database. For a simplified example, lets say there is a table called Question:
Question
------------
Id
Text
TypeId
SubTypeId
Currently the SubTypeId column is populated with a set of IDs that do not reference another table in the database. In the code these SubTypeIds are mapped to a specific string in a configuration file.
In the past when I have had these types of values I would create a lookup table and insert the appropriate values, but in this application there is a mapping between the IDs and their corresponding text values in a configuration file.
Is it bad practice to define a lookup table in a configuration file rather than in the database itself?
Is it bad practice to define a lookup table in a configuration file rather than in the database itself?
Absolutely, yes. It brings in a heavy dependence on the code to manage and maintain references, fetch necessary values, etc. In a situation where you now need to create additional functionality, you would rely on copy-pasting the mapping (or importing them, etc.) which is more likely to cause an issue.
It's similar to why DB constraints should be in the DB rather than in the program/application that's accessing it - any maintenance or new application needs to replicate all the behaviour and rules. Having things this way has similar side-affects I've mentioned here in another answer.
Good reasons to have a lookup table:
Since DBs can generally naturally have these kinds of relations, it would be obvious to use them.
Queries first need to be constructed in code for the Type- and SubType- Text vs ID instead of having them as part of the where/having clause of the query that is actually executed.
Speed/Performance - with the right indexes and table structures, you'd benefit from this (and reduce code complexity that manages it)
You don't need to update your code for to add a new Type or SubType, or to edit/delete them.
Possible reasons it was done that way, which I don't think are valid reasons:
The TypeID and SubTypeID are related and the original designer did not know how to create a complex foreign key. (Not a good reason though.)
Another could be 'translation' but that could also be handled using foreign key relations.
In some pieces of code, there may not be a strict TypeID-to-SubTypeID relation and that logic was handled in code rather than in the DB. Again, can be managed using 'flag' values or NULLs if possible. Those specific cases could be handled by designing the DB right and then working around a unique/odd situation in code instead of putting all the dependence on the code.
NoSQL: Original designer may be under the impression that such foreign keys or relations cannot be done in a NoSQL db.
And the obvious 'people' problem vs technical challenge: The original designer may not have had a proper understanding of databases and may have been a programmer who did that application (or was made to do it) without the right knowledge or assistance.
Just to put it out there: If the previous designer was an external contractor, he may have used the code maintenance complexity or 'support' clause as a means to get more business/money.
As a general rule of thumb, I'd say that keeping all the related data in a DB is a better practice since it removes a tacit dependency between the DB and your app, and because it makes the DB more "comprehensible." If the definitions of the SubTypeIDs are in a lookup table it becomes possible to create queries that return human-readable results, etc.
That said, the right answer probably depends a bit on the specifics of the application. If there's very tight coupling between the DB and app to begin with (eg, if the DB isn't going to be accessed by other clients) this is probably a minor concern particularly if the set of SubTypeIDs is small and seldom changes.

Grouping postgresql tables in schemas

I'm currently building an app that contains around 60 or so tables, some with meta information, some with actual data and a couple of views on top of them.
To keep things organized I'm prefixing all table name with meta_ or view_ respectively, but might it be worth it to just put them in different schemas inside the same database?
It this common practice?
Any reasons to not do this?
Can I create FK-constraints over different schemas?
PS: There seems to be no performance penalty judging from this answer: PostgreSQL: Performance penalty for joining two tables in separate schemas
You can use schemas in any way you like. No limitation in principal. They are especially useful if you need to GRANT certain rights to groups of objects, for instance to separate users inside a database. You can largely treat them like directories in a file system (the analogy has its limits, though).
I use v_ prefix for views and f_ prefix for functions. But that is basically just notational convenience to spot those objects quickly in a text search - if I hack the dump for instance. Make such prefixes short, you will have to type them for the rest of your database life. Multiple prefixes from various semantic layers could have to apply to a single object.
I would not mix functional prefixes (v_ or view_) with semantic prefixes (meta_) on the same level. Rather create a separate schema for meta-objects and use prefixes to denote object types throughout all your schemata.
Whichever system you chose, stay consistent! Or it will do more harm than good.

How do you structure config data in a database?

What is people's prefered method of storing application configuration data in a database. From having done this in the past myself, I've utilised two ways of doing it.
You can create a table where you store key/value pairs, where key is the name of the config option and value is its value. Pro's of this is adding new values is easy and you can use the same routines to set/get data. Downsides are you have untyped data as the value.
Alternatively, you can hardcode a configuration table, with each column being the name of the value and its datatype. The downside to this is more maintenance setting up new values, but it allows you to have typed data.
Having used both, my preferences lie with the first option as its quicker to set things up, however its also riskier and can reduce performance (slightly) when looking up data. Does anyone have any alternative methods?
Update
It's necessary to store the information in a database because as noted below, there may be multiple instances of the program that require configuring the same way, as well as stored procedures potentially using the same values.
You can expand option 1 to have a 3rd column, giving a data-type. Your application can than use this data-type column to cast the value.
But yeah, I would go with option 1, if config files are not an option. Another advantage of option 1 is you can read it into a Dictionary object (or equivalent) for use in your application really easily.
Since configuration typically can be stored in a text file, the string data type should be more than enough to store the configuration values. If you're using a managed language, it's the code that knows what the data type should be, not the database.
More importantly, consider these things with configuration:
Hierarchy: Obviously, configuration will benefit from a
hierarchy
Versioning: Consider the benefit of being able to roll back to the configuration that was in effect at a certain date.
Distribution: Some time, it might be nice to be able to cluster an application. Some properties should probably be local to each node in a cluster.
Documentation: Depending on if you have a web tool or something, it is probably nice to store the documentation about a property close to the code that uses it. (Code annotations is very nice for this.)
Notification: How is the code going to know that a change has been made somewhere in the configuration repository?
Personally, i like an inverted way of handling configuration, where the configuration properties is injected into the modules which don't know where the values came from. This way, the configuration management system can be very complex or very simple depending on your (current) needs.
I use option 1.
My project uses a database table with four columns:
ID [pk]
Scope (default 'Application')
Setting
Value
Settings with a Scope of 'Application' are global settings, such as Maximum number of simultaneous users.
Each module has its own scope based; so our ResultsLoader and UserLoader have different scopes, but both have a Setting named 'inputPath'.
Defaults are either provided in the source code or are injected via our IoC container. If no value is injected or provided in the database, the default from the code is used (if one exists). Therefore, defaults are never stored in the database.
This works out quite well for us. Each time we backup the database we get a copy of the Configuration which is quite handy. The two are always in sync.
It seems overkill to use the DB for config data.
EDIT (sorry too long for comment box):
Of course there's no strict rules on how you implement any part of your program. For the sake of argument, slotted screwdrivers work on some philips screws! I guess I judged too early before knowing what your scenario is.
Relational database excels in massive data store that gives you quick storing, updating, and retrieval, so if your config data is updated and read constantly, then by all means use db.
Another scenario where db may make sense is when you have a server farm where you want your database to store your central config, but then you can do the same with a shared networked drive that point to the xml config file.
XML file is better when your config is hierarchically structured. You can easily organize, locate, and update what you need, and for bonus benefit you can version control the config file along with your source code!
All in all, it all depends on how the config data is used.
That concludes my opinion with limited knowledge of your application. I am sure you can make the right decision.
I guess this is more of a poll, so I'll say the column approach (option 2). However it will depend on how often your config changes, how dynamic it is, and how much data there is, etc.
I'd certainly use this approach for user configurations / preferences, etc.
Go with option 2.
Option 1 is really a way of implenting a database on top of a database, and that is a well-known antipattern, which is just going to give you trouble in the long run.
I can think of at least two more ways:
(a) Create a table with key, string-value, date-value, int-value, real-value columns. Leave unused types NULL.
(b) Use a serialization format like XML, YAML or JSON and store it all in a blob.
Where do you you store the configuration settings your app needs to connect to the database?
Why not store the other config info there too?
I'd go with option 1, unless the number of config options were VERY small (seven or less)
At my company, we're working on using option one (a simple dictionary-like table) with a twist. We're allowing for string substitution using tokens which contain the name of the config variable to be substituted.
For example, the table might contain rows ('database connection string', 'jdbc://%host%...') and ('host', 'foobar'). Encapsulating that with a simple service or stored procedure layer allows for an extremely simple, but flexible, recursive configuration. It supports our need to have multiple isolated environments (dev, test, prod, etc).
I've used both 1 and 2 in the past, and I think they're both terrible solutions. I think Option 2 is better because it allows typing, but it's a lot more ugly than option 1. The biggest problem I have with either is versioning the config file. You can version SQL reasonably well using standard version control systems, but merging changes is usually problematic. Given an opportunity to do this "right", I'd probably create a bunch of tables, one for each type of configuration parameter (not necessarily for each parameter itself), thus getting the benefit of typing and the benefit of the key/value paradigm where appropriate. You can also implement more advanced structures this way, such as lists and hierarchies, which will then be directly queryable by the app instead of having to load the config and then transform it somehow in memory.
I vote for option 2. Easy to understand and maintain.
Option 1 is good for an easily expandable, central storage location. In addition to some of the great column suggestions by folks like RB, Hugo, and elliott, you might also consider:
Include a Global/User setting flag with a user field or even a user/machine field (for machine-specific UI type settings).
Those can, of course, be stored in a local file, but since you are using the database anyway, that makes these available for aliasing a user when debugging - which can be important if the bug is setting related. It also allows an admin to manage setings when necessary.
I use a mix of option 2 and XML columns in SQL server.
You may also wan't to add a check constraint to keep the table at one row.
CREATE TABLE [dbo].[MyOption] (
[GUID] uniqueidentifier CONSTRAINT [dfMyOptions_GUID] DEFAULT newsequentialid() ROWGUIDCOL NOT NULL,
[Logo] varbinary(max) NULL,
[X] char(1) CONSTRAINT [dfMyOptions_X] DEFAULT 'X' NOT NULL,
CONSTRAINT [MyOptions_pk] PRIMARY KEY CLUSTERED ([GUID]),
CONSTRAINT [MyOptions_ck] CHECK ([X]='X')
)
for settings that have no relation to any db tables, i'd probably go for the EAV approach if you need the db to work with the values. otherwise a serialized field value is good if it's really just a store for app code.
but what about a format for a single field to store multiple config settings to be used by the db?
like one field per user that contains all their settings related to their messageboard view (like default sort order, blocked topics, etc.), and maybe another with all their settings for their theme (like text color, bg color, etc.)
Storing hierarchy and documents in a relational DB is madness. Firstly you either have to shred them, only to recombine them at some later stage. Or there bunged inside a BLOB, even more stupid.
Don't use use a relational db for non-relational data, the tool does not fit. Consider something like MongoDB or CouchDB for this. Schema-less no-relational data stores. Store it as JSON if it's coming down the wire in any way to a client, use XML for serverside.
CouchDB gives you versioning out of the box.
Don't store configuration data in a database unless you have a very good reason to. If you do have a very good reason, and are absolutely certain you are going to do it, you should probably store it in a data serialization format like JSON or YAML (not XML, unless you actually need a markup language to configure your app -- trust me, you don't) as a string. Then you can just read the string, and use tools in whatever language you work in to read and modify it. Store the strings with timestamps, and you have a simple versioning scheme with the ability to store hierarchical data in a very simple system. Even if you don't need hierarchical config data, at least now if you need it in the future you won't have to change your config interface to get it. Of course you lose the ability to do relational queries on your config data, but if you're storing that much config data, then you're probably doing something very wrong anyway.
Companies tend to store lots configuration data for their systems in a database, I'm not sure why, I don't think much thought goes into these decisions. I don't see this kind of thing done too often in the OSS world. Even large OSS programs that need lots of configuration like Apache don't need a connection to a database containing an apache_config table to work. Having a huge amount of configuration to deal with in your apps is a bad code smell, storing that data in a database just causes more problems (as this thread illustrates).

Resources