How to store mathematical values in PDDL? - artificial-intelligence

I need to create a plan in PDDL to visit a subset of n places, each of which has a score. I need to maximize the utility, which is defined as the sum of each individual score. How do I represent this domain in PDDL? Specifically, how do I store score for each place?

I assume that you are familiar with action costs and plan metrics. If not, please state so in the comments.
The easiest way would be, I guess, via action costs. The problem to solve is that in your case the quality of a plan is associated with the places that you have visited after the execution of a plan, so it is not directly associated with the costs of the actions you perform, but with the state variables that you produce. So, let's say you would increase the plan's quality each time an action is performed that causes the agent to visit a location, then you can get wrong plan qualities, because you can visit the same location multiple times. However, you can fix this problem as follows:
You simply add an action increase-plan-quality(?location) of the following form:
(1) in each location it is executable exactly once
(2) in each location, it is only executable if the agent is currently at that location
(3) the effects increase the quality of the plan by the score of that location
Then, you only need to set the plan metric to maximize and you are done.
Why does this work?
(A) If your agent is at a location, the maximize-metric will cause the planner to apply the action that increases the quality (due to (2) that action is applicable)
(B) These additional actions cannot procude wrong plan qualities, as, due to (1), each such action is only applicable once per location. The only thing that can happen is that you have visited a location but the planner does not apply the action that increases the plan's quality (although it could do so). But that's the planner's choice and pretty unlikely, I guess.
Another possibility would be to rely on so-called state-dependent action costs. But that concept is quite new (about 2 years if I remember correctly), so I guess there are only a limited number of planners that can handle them and I am also assuming that one needs a specialized syntax that is not part of the standard PDDL specification.

Related

Concurrent queries in PostgreSQL - what is actually happening?

Let us say we have two users running a query against the same table in PostgreSQL. So,
User 1: SELECT * FROM table WHERE year = '2020' and
User 2: SELECT * FROM table WHERE year = '2019'
Are they going to be executed at the same time as opposed to executing one after the other?
I would expect that if I have 2 processors, I can run both at the same time. But I am thinking that matters become far more complicated depending on where the data is located (e.g. disk) given that it is the same table, whether there is partitioning, configurations, transactions, etc. Can someone help me understand how I can ensure that I get my desired behaviour as far as PostgreSQL is concerned? Under which circumstances will I get my desired behaviour and under which circumstances will I not?
EDIT: I have found this other question which is very close to what I was asking - https://dba.stackexchange.com/questions/72325/postgresql-if-i-run-multiple-queries-concurrently-under-what-circumstances-wo. It is a bit old and doesn't have much answers, would appreciate a fresh outlook on it.
If the two users have two independent connections and they don't go out of their way to block each other, then the queries will execute at the same time. If they need to access the same buffer at the same time, or read the same disk page into a buffer at the same time, they will use very fast locking/coordination methods (LWLocks, spin locks, or atomic operations like CAS) to coordinate that. The exact techniques vary from version to version, as better methods become widely available on supported platforms and as people find the time to change the implementation to use those better methods.
I can ensure that I get my desired behaviour as far as PostgreSQL is concerned?
You should always get the correct answer to your query (Or possibly some kind of ERROR indicating a failure to serialize if you are using the highest (and non-default) isolation level, but that doesn't seem to be a risk if each of those queries is run in a single-statement transaction.)
I think you are overthinking this. The point of using a database management system is that you don't need to micromanage it.
Also, "parallel-query" refers to a single query using multiple CPUs, not to different queries running at the same time.

How to determine what is more resource expensive, a write/read or a computation

I scrape stock data. I scrape the following:
opening price, stock price, volume traded, shares in issue
The sites i scrape, also have a few other derived quantities available. By derived I mean these can be calculated from the quantities above. These include:
value traded, market cap, price change
Whilst the latter can just be scraped into my database and then read off later, I could also just write methods that calculate them on the fly when requested. So instead of writing them to database and reading later, I could just have methods like
calculate_value_traded(), calculate_market_cap() and calculate_price_change()
My question is, what is the more efficient way? How do i determine "more efficient" in practice? I know it may depend on amount of data being written/read and also nature of calculation but I am wondering how does one even benchmark which is more resource efficient and ultimately less expensive?
Am I looking at memory used, bandwidth, I/O or what? What would be the things i need to measure to ultimately choose one over the other?
In general, you need not store calculated values unless they are used extremly often or they should be provided extremly fast. The reason is that you have several places to make this calculations. First, it is database engine which usually has built-in possibility to work with calculated columns. Second, you may make calculations on your application client side thus redusing I/O as well as bandwidth. Both cases reduce your cost of storage which you should take into account as well. Third, you may use a kind of a cache storage for this data which uses for example IMDG storage.
Please, note that the answer is very general because we have no information on your performance and cost requirements as well as technical means which you are using.
But be aware of storing calculated data because you need implement mechanism to recalculate it when source data is updated to be sure that your data remains consistent.

What does 'soft-state' in BASE mean?

BASE stands for 'Basically Available, Soft state, Eventually consistent'
So, I've come this far: "Basically Available: the system is available, but not necessarily all items in it at any given point in time" and "Eventually Consistent: after a certain time all nodes are consistent, but at any given time this might not be the case" (please correct me if I'm wrong).
But, what is meant exactly by 'Soft State'? I haven't been able to find any decent explanations on the internet yet.
This page (originally here, now available only from the web archive) may help:
[soft state] is information (state) the user put into the system that
will go away if the user doesn't maintain it. Stated another way, the
information will expire unless it is refreshed.
By contrast, the position of a typical simple light-switch is
"hard-state". If you flip it up, it will stay up, possibly forever. It
will only change back to down when you (or some other user) explicitly
comes back to manipulate it.
The BASE acronym is a bit contrived, and most NoSQL stores don't actually require data to be refreshed in this way. There's another explanation suggesting that soft-state means that the system will change state without user intervention due to eventual consistency (but then the soft-state part of the acronym is redundant).
There are some specific usages where state must indeed be refreshed by the user; for example, in the Cassandra NoSQL database, one can give all rows a time-to-live to make them completely soft-state (they will expire unless refreshed), but this is an unusual mode of usage (a transient cache, essentially).
"Soft-state" might also apply to the gossip protocol within Cassandra; a new node can determine the state of the cluster from the gossip messages it receives, and this cluster state must be constantly refreshed to detect unresponsive nodes.
I was taught in classes that "Soft state" means that the state of the system could change over time (even during times without input), because there may be changes going on due to "eventual consistency". That's why says "soft" state.
Some source: link
Soft state means data that is not persisted on the disk, yet in case of failure it could be possible to restore it (e.g. recreate a lower quality image from a high quality one). A good article that addresses this and other interesting issues is Cluster-Based Scalable Network Services
A BASE system gives up on consistency to improve the performance of the database. Hence most of the famous NoSQL databases are highly available and scalable than ACID-compliant relational databases.
Soft state indicates that the state of the system may change over time, even without input. This is because of the eventual consistency model.
Eventual consistency indicates that the system will become consistent over time.
For example, Consider two systems such as A and B. If a user writes data to a system A, there will be some delay in reflecting these written data to B usually within milliseconds(depending upon the network speed and the design for syncing).

2 Questions about Philosophy and Best Practices in Database Development

Which one is best, regarding the implementation of a database for a web application: a lean and very small database with only the bare information, sided with a application that "recalculates" all the secondary information, on demand, based on those basic ones, OR, a database filled with all those secondary information already previously calculated, but possibly outdated?
Obviously, there is a trade-of there and I think that anyone would say that the best answer to this question is: "depends" or "is a mix between the two". But I'm really not to comfortable or experienced enough to reason alone about this subject. Could someone share some thoughts?
Also, another different question:
Should a database be the "snapshot" of a particular moment in time or should a database accumulate all the information from previous time, allowing the retrace of what happened? For instance, let's say that I'm modeling a Bank Account. Should I only keep the one's balance on that day, or should I keep all the one's transactions, and from those transactions infer the balance?
Any pointer on this kind of stuff that is, somehow, more deep in database design?
Thanks
My quick answer would be to store everything in the database. The cost of storage is far lower than the cost of processing when talking about very large scale applications. On small scale applications, the data would be far less, so storage would still be an appropriate solution.
Most RDMSes are extremely good at handling vast amounts of data, so when there are millions/trillions of records, the data can still be extracted relatively quickly, which can't be said about processing the data manually each time.
If you choose to calculate data rather than store it, the processing time doesn't increase at the same rate as the size of data does - the more data ~ the more users. This would generally mean that processing times would multiply by the data's size and the number of users.
processing_time = data_size * num_users
To answer your other question, I think it would be best practice to introduce a "snapshot" of a particular moment only when data amounts to such a high value that processing time will be significant.
When calculating large sums, such as bank balances, it would be good practice to store the result of any heavy calculations, along with their date stamp, to the database. This would simply mean that they will not need calculating again until it becomes out of date.
There is no reason to ever have out of date pre-calulated values. That's what trigger are for (among other things). However for most applications, I would not start precalculating until you need to. It may be that the calculation speed is always there. Now in a banking application, where you need to pre-calculate from thousands or even millions of records almost immediately, yes, design a precalulation process bases on triggers that adjust the values every time they are changed.
As to whether to store just a picture in time or historical values, that depends largely on what you are storing. If it has anything to do with financial data, store the history. You will need it when you are audited. Incidentally, design to store some data as of the date of the action (this is not denormalization). For instance, you have an order, do not rely onthe customer address table or the product table to get data about where the prodcts were shipped to or what they cost at the time of the order. This data changes over time and then you orders are no longer accurate. You don't want your financial reports to change the dollar amount sold because the price changed 6 months later.
There are other things that may not need to be stored historically. In most applications we don't need to know that you were Judy Jones 2 years ago and are Judy Smith now (HR application are usually an exception).
I'd say start off just tracking the data you need and perform the calculations on the fly, but throughout the design process and well into the test/production of the software keep in mind that you may have to switch to storing the pre-calculated values at some point. Design with the ability to move to that model if the need arises.
Adding the pre-calculated values is one of those things that sounds good (because in many cases it is good) but might not be needed. Keep the design as simple as it needs to be. If performance becomes an issue in doing the calculations on the fly, then you can add fields to the database to store the calculations and run a batch overnight to catch up and fill in the legacy data.
As for the banking metaphor, definitely store a complete record of all transactions. Store any data that's relevant. A database should be a store of data, past and present. Audit trails, etc. The "current state" can either be calculated on the fly or it can be maintained in a flat table and re-calculated during writes to other tables (triggers are good for that sort of thing) if performance demands it.
It depends :) Persisting derived data in the database can be useful because it enables you to implement constraints and other logic against it. Also it can be indexed or you may be able to put the calculations in a view. In any case, try to stick to Boyce-Codd / 5th Normal Form as a guide for your database design. Contrary to what you may sometimes hear, normalization does not mean you cannot store derived data - it just means data shouldn't be derived from nonkey attributes in the same table.
Fundamentally any database is a record of the known facts at a particular point in time. Most databases include some time component and some data is preserved whereas some is not - requirements should dictate this.
You've answered your own question.
Any choices that you make depend on the requirements of the application.
Sometimes speed wins, sometimes space wins. Sometime data accuracy wins, sometimes snapshots win.
While you may not have the ability to tell what's important, the person you're solving the problem for should be able to answer that for you.
I like dynamic programming(not calculate anything twise). If you're not limited with space and are fine with a bit outdated data, then precalculate it and store in the DB. This will give you additional benefit of being able to run sanity checks and ensure that data is always consistent.
But as others already replied, it depends :)

Design of database to record failed logins to prevent brute forcing

There have been a couple of questions about limiting login attempts, but none have really discussed the advantages or disadvantages or different ways of storing the record of login attempts (most have focused on the issue of throttling vs captchas, for instance, which is not what I'm interested in). Instead, I'm trying to figure out the best way to record this information in a database that would be useful but not impact performance too much.
Approach 1) Create a table that would record: ip address, username tried, and attempt time. Then when a user tries to login, I run a select to determine how many times a username was tried in the last 5 minutes (or how many times a particular ip address tried in the last 5 minutes). If over a certain amount, then either throttle the login attempts or display a captcha.
This allows throttling based both on username and on ip address (if the attack is trying many different usernames with a single password), and creates an easily auditable record of what happened, but requires an INSERT for every login attempt, and at least one extra SELECT.
Approach 2) Create a similar table that records the number of failed attempts and the last login time. This could result in a smaller database table and faster SELECT statements (no COUNT needed), but offers a little less control and also less data to analyze.
Approach 3) Similar to approach 2 but store this information in the User table itself. This means that an additional SELECT statement would not be needed, although it would add extra, perhaps unnecessary information to the users table. It also wouldn't allow the extra control and information that approach 1 would offer.
Additional approaches: store the login attempts in a mysql MEMORY table or a sqlite in-memory table. This would reduce performance loss due to disk performance, but still requires database calls, and wouldn't allow long term auditing of login attempts because the data wouldn't be persistent.
Any thoughts on a good way to do this or what you yourself have implemented?
It sounds like you're optimizing prematurely.
Implement it in the way that other developers would expect it to be implemented, and then measure performance. If you find you need an optimization at that point, then take some steps based on your measurements.
I would expect Approach #1 to be the most natural solution.
As a side note: I'd neatly separate the portion of code that a) decides if someone is hacking, and b) decides what action to take. Chances are you'll make several attempts to get this portion correct over optimizing the queries.
A couple of extra simple statements aren't going to make any noticeable difference. Logins are rare - relatively speaking. If you want to make sure automated login attempts don't cause a performance problem, you just have to have a second threshold after which you
Timestamp the abuse
Stop counting (and thus stop inserting/updating)
Block the ip for a few hours.
This threshold must be much higher than the one that throttles/introduces a captcha.

Resources