Should I represent database data with immutable or mutable data structures? - database

I'm currently programming in Scala, but I guess this applies to any functional programming language, or rather, any programming language that recommends immutability and can interact with a database.
When I fetch data from my database, I map it to a model data structure. In functional programming, data structures tend to be immutable. But the data in a database is mutable, so I wonder whether or not my model should be mutable as well. In general, what would be a good and well-accepted practice in such a case?
Following Scala courses by Martin Odersky on Coursera, I remember he said something like:
It's better to use immutable data structures, but when you want to
interact with the real world, it can be useful to use mutable data
structures.
So, again, I wonder what should I do. As of now, my data structures are immutable, and this is leading to a lot of boilerplate code when I want to update a record in my database. Would using a mutable model help reduce this boiler plate?
(I already asked a similar question which was quite specific to the technologies I use, but I wasn't satisfied with the actual answers, so I've generalized it here.)

Why is a database mutable? Is it a fundamental nature of databases to be mutable? The relational model and using it as a persistence store for your application data might steer you towards this conclusion, but it may not be a fundamental property.
Given that you may have other options such as storing a new version of your data when you update it, perhaps the premise of the question is undermined somewhat. Perhaps, even if you do have a 'mutable' database, you still need to provide a new value for the update function that is separate from the old value – consider for instance an optimistic lock where the update should only occur if the old value has not in the meantime changed.
In other words, the mutability or otherwise of the database should not matter at all, you are dealing with a separate domain layer in your application. If you need to ask then the answer will always be immutable. Mutability is a complexity vector that experts should only introduce as a performance optimisation when it has been demonstrated to be necessary.

In the trading app I'm currently working on, almost everything is immutable - certainly the model is.
Our experience is that this has greatly simplified how we work with the model, including persistence.
I don't understand yet why things have become simpler, it just has. I need to ponder on this more. Reasoning about the code and working with it is simpler.
Yes, you need to use things like lenses but I tend to write them - a mechanical process - and move on. It's a tiny part which I am sure can be finessed.

"Interacting with the real world" has nothing to do with whether you use mutable or immutable data structures. This is a furfy that is repeated all too often and it is great that you have questioned it.
While it is typically more healthy to dismiss garbage like this, you might be interested in a cursory debunking:
http://blog.higher-order.com/blog/2012/09/13/what-purity-is-and-isnt/
However, I strongly recommend dismissing it and moving on.
Onto your question, you say you have boilerplate when you want to perform operations on your immutable data structures. In fact, there is very well established theory that solves this problem to a large extent. Here is a paper written about it using Scala:
http://dropbox.tmorris.net/media/doc/lenses.pdf
Hope that helps.

Related

Is it bad design to use arrays within a database?

So I'm making a database for a personal project just to get more than my feet wet with PostgreSQL and certain languages and applications that can use a PostgreSQL database.
I've come to the realization that using an array isn't necessarily even compliant (Arrays are not atomic, right?) with 1NF. So my question is: Is there a lack of efficiency or data safety this way? Should I learn early to not use arrays?
Short answer to the title: No
A bit longer answer:
You should learn to use arrays when appropriate. Arrays are not bad design themselves, they are as atomic as a character varying field (array of characters, no?) and they exists to make our lives easier and our databases faster and lighter. There are issues considering portability (most database systems don't support arrays, or do so in a different way than Postgres)
Example:
You have a blog with posts and tags, and each post may have 0 or more tags. The first thing that comes to mind is to make a different table with two columns postid and tagid and assign the tags in that table.
If we need to search through posts with tagid, then the extra table is necessary (with appropriate indexes of course).
But if we only want the tag information to be shown as the post's extra info, then we can easily add an integer array column in the table of posts and extract the information from there. This can still be done with the extra table, but using an array reduces the size of the database (no needed extra tables or extra rows) and simplifies the query by letting us execute our select queries with joining one less table and seems easier to understand by human eye (the last part is in the eye of the beholder, but I think I speak for a majority here). If our tags are preloaded, then not even one join is necessary.
The example may be poor but it's the first that came to mind.
Conclusion:
Arrays are not necessary. They can be harmful if you use them wrong. You can live without them and have a great, fast and optimized database. When you are considering portability (e.g. rewriting your system to work with other databses) then you must not use arrays.
If you are sure you'll stick with Postgres, then you can safely use arrays where you find appropriate. They exist for a reason and are neither bad design nor non-compliant. When you use them in the right places, they can help a little with simplicity of database structures and your code, as well as space and speed optimization. That is all.
Whether an array is atomic depends on what you're interested in. If you generally want the whole array then it's atomic. If you are more interested in the individual elements then it is being used as structure. A text field is basically a list of characters. However, we're usually interested in the whole string.
Now - from a practical viewpoint, many frameworks and ORMs don't automatically unpack PostgreSQL's array types. Also, if you want to port the database to e.g. MySQL then you'll
Likewise foreign-key constraints can't be added to an array (EDIT: this is still true as of 2021).
Short answer: Yes, it is bad design. Using arrays will guarantee that your design is not 1NF, because to be 1NF there must be no repeating values. Proper design is unequivocal: make another table for the array's values and join when you need them all.
Arrays may be the right tool for the job in certain limited circumstances, but I would still try hard to avoid them. They're a feature of last resort.
The biggest problem with arrays is that they're a crutch. You know them already and you want to use them because they're familiar to you. But they do not work quite like you expect, and they will only allow you to postpone a true understanding of SQL and relational databases. You're much better off waiting until you're forced to use them than learning them and looking for opportunities to rely on them.
I believe arrays are a useful and appropriate design in cases where you're working with array-like data and want to use the power of SQL for efficient queries and analysis. I've begun using PostgreSQL arrays regularly for data science purposes, as well as in PostGIS for edge cases, as examples.
In addition to the well-explained challenges mentioned above, I'm finding the biggest problem in getting third-party client apps to be able to handle the array fields in ways I'd expect. In Tableau and QGIS, for example, arrays are treated as strings, so array operations are unavailable.
Arrays are a first class data type in the SQL standard, and generally allow for a simpler schema and more efficient queries. Arrays, in general, are a great data type. If your implementation is self-contained, and doesn't need to rely on third-party tools without an API or some other middleware that can deal with incompatibilities, then use the array field.
IF, however, you interface with third-party software that directly queries the DB, and arrays are used to produce queries, then I'd avoid them in favor of simpler lookup tables and other traditional relational approaches.

Convinience for postgresql C custom function Vs plpgsql

I state that my answer to the object question is Yes in my case is convinient but I ask here to the expert.
I developed a lot of plpgsql functions and just one in C but I already understood that the learning curve is definitely more sloped.
In may case I need a real developing language that plpgsql sometimes is not, but also I need performance otherwise I'd looked at python.
But here the question.
Mainly I need to retrieve data with some select and join, make elaboration on them, sametimes complex and return a table of data.
From the time of execution point of view is quicker a c function for this kind of use?
I apreciate any comment
luca
But here the question. Mainly I need to retrieve data with some select and join, make elaboration on them, sametimes complex and return a table of data.
I would go with pl/pgsql for this, as that's what it is designed for. In general, pl/pgsql performs very well within its problem domain, and I doubt you are likely to get significantly better performance by going with C. To the extent you can push your elaborations into the main query, all the better performance-wise.
This is assuming that your elaborations can be done with existing functions and not a huge amount of complex data manipulation (in particular, say, converting between datatypes, like arrays and sets). If that is the case, I would still put the main query and light manipulation in the pl/pgsql, and put the specific operations that need to be tuned in C. There are two reasons for doing this:
It means less C code, which means the C code is easier to read, follow, and prove correct.
It separates concerns so that you can use similar manipulations elsewhere.
There's a lot of performance tuning that has gone into pl/pgsql for its problem domain and reinventing all of that would be a lot of work both in development and testing. To the extent you can leverage tools that are already there you can get the performance you need with a lot less effort and a lot more in the way of guarantees.
EDIT
If you want to write PL/PGSQL code that performs well, you want to have it be a large main query with modest support logic. The more you can push into your query the better, and the more of your elaborations you can do in SQL (with possible C functions as mentioned above), the better. Not only does this mean better performance but it means better maintainability. As ArtemGr mentioned, certain operations are very expensive in PL/PGSQL. and in these cases you want to supplement with C code in order to get the performance you need.
I know C/C++ well and for me it's easier to write a PostgreSQL function in C++ than to learn the intricacies of pgSQL syntax and workaround its limitations. I'd say go with the language you (and the rest of your team) are more familiar with. C should be faster than pgSQL (and Tcl, Perl, Python) for complex data manipulation. Usually 5-10 times faster. Javascript (http://code.google.com/p/plv8js/) might be nearly as fast as C if it has a chance to spin it's JIT. Python code can actually use a Cython extension under the hood which might be nearly as fast as C.
You should probably measure how much time is spent in the data manipulation in question and relative to the time spent in the I/O before making a decision. In some domains C isn't faster, for example Tcl and Javascript has very good regular expression engines.

Using a relational database and a key-value store in combination

The requirements for the project I'm working seem to point to using both a relational database (e.g. postgre, MySQL) in combination with a key-value store (e.g. HBase, Cassandra). Our data almost breaks nicely into one of the two data models with the exception of a small amount of interdependence.
This is not an attempt to cram a relational database into a key-value store; they are independent of each other.
Are there any serious reasons to not do this?
It should work fine.
There are a couple of things you need to be aware of / watch out for:
Your program is now responsible for the data consistency between the stores, not the relational model.
Depending on your technology you may or may not have transactions that span the data stores. Here you might have to program some manual clean up work in the case of a failure.
I work in SQL DBMS territory, so take that bias into account, but...
As with Shiraz Bhaiji, I worry about the "except for a small amount of interdependence". There are a number of things to think about, the answers to which will help you determine what to do.
What happens if something goes wrong with the interdependence? (Customers lose money - then you need to use a DBMS throughout; you lose money - probably the same; someone gets reported as having 3045 points when they really have 3046 - maybe it doesn't matter.)
How hard is it to fix up the 'mess' when something goes wrong?
How much of the work is on the key-value store and how much is on the DBMS?
Can the interdependence be removed by moving some stuff from key-value store to DBMS?
How slow is the DBMS when used as a key-value store? (Are you sure there's no way to bring it close enough to parity?)
What happens in disaster recovery scenarios? Synchronized backups?
If you have adequate answers to these and related questions, then it is OK to go with the mixed setup - you've thought it through, weighed the risks, formed a judgement, and it is reasonable to go ahead. If you don't have answers, get them.
When you say key-value store are you meaning like in a session or a cache type of implementation? There are always reasons to do such things...reading from and writing to a database is generally your most resource intensive operation. More details?

schema design

Let's say you are a GM dba and you have to design around the GM models
Is it better to do this?
table_model
type {cadillac, saturn, chevrolet}
Or this?
table_cadillac_model
table_saturn_model
table_chevrolet_model
Let's say that the business lines have the same columns for a model and that there are over a million records for each subtype.
EDIT:
there is a lot of CRUD
there are a lot of very processor intensive reports
in either schema, there is a model_detail table that contains 3-5 records for each model and the details for each model differ (you can't add a cadillac detail to a saturn model)
the dev team doesn't have any issues with db complexity
i'm not really sure that this is a normalization question. even though the structures are the same they might be thought of as different entities.
EDIT:
Reasons for partitioning the structure into multiple tables
- business lines may have different business rules regarding parts
- addModelDetail() could be different for each business line (even though the data format is the same)
- high add/update activity - better performance with partitioned structure instead of single structure (I'm guessing and not sure here)?
I think this is a variation of the EAV problem. When posed as a EAV design, the single table structure generally gets voted as a bad idea. When posed in this manner, the single table strucutre generally gets voted as a good idea. Interesting...
I think the most interesting answer is having two different structures - one for crud and one for reporting. I think I'll try concatenated/flattened view for reporting and multiple tables for crud and see how that works.
Definitely the former example. Do you want to be adding tables to your database whenever you add a new model to your product range?
On data with a lot of writes, (e.g. an OLTP application), it is better to have more, narrower tables (e.g. tables with fewer fields). There will be less lock contention because you're only writing small amounts of data into different tables.
So, based on the criteria you have described, the table structure I would have is:
Vehicle
VehicleType
Other common fields
CadillacVehicle
Fields specific to a Caddy
SaturnVehicle
Fields specific to a Saturn
For reporting, I'd have an entirely different database on an entirely different server that does not have the normalized structure (e.g. just has CadillacVehicle and SaturnVehicle tables with all of the fields from the Vehicle table duplicated into them).
With proper indexes, even the OLTP database could be performant in your SELECT's, regardless of the fact that there are tens of millions of rows. However, since you mentioned that there are processor-intensive reports, that's why I would have a completely separate reporting database.
One last comment. About the business rules... the data store cares not about the business rules. If the business rules are different between models, that really shouldn't factor into your design decisions about the database schema (other than to help dictate which fields are nullable and their data types).
Use the former. Setting up separate tables for the specialisations will complicate your code and doesn't bring any advantages that can't be achieved in other ways. It will also massively simplify your reports.
If the tables really do have the same columns, then the former is the best way to do it. Even if they had different columns, you'd probably still want to have the common columns be in their own table, and store a type designator.
You could try having two separate databases.
One is an OLTP (OnLine Transaction Processing) system which should be highly normalized so that the data model is highly correct. Report performance must not be an issue, and you would deal with non-reporting query performance with indexes/denormalization etc. on a case-by-case basis. The data model should try to match up very closely with the conceptual model.
The other is a Reports system which should pull data from the OLTP system periodically, and massage and rearrange that data in a way that makes report-generation easier and more performant. The data model should not try to match up too closely with the conceptual model. You should be able to regenerate all the data in the reporting database at any time from the data currently in the main database.
I would say the first way looks better.
Are there reasons you would want to do it the second way?
The first way follows normalization better and is closer to how most relational database schema are developed.
The second way seems to be harder to maintain.
Unless there is a really good reason for doing it the second way I would go with the first method.
Given the description that you have given us, the answer is either.
In other words you haven't given us enough information to give a decent answer. Please describe what kind of queries you expect to perform on the data.
[Having said that, I think the answer is going to be the first one ;-)
As I imaging even though they are different models, the data for each model is probably going to be quite similar.
But this is a complete guess at the moment.]
Edit:
Given your updated edit, I'd say the first one definitely. As they have all the same data then they should go into the same table.
Another thing to consider in defining "better"--will end users be querying this data directly? Highly normalized data is difficult for end-users to work with. Of course this can be overcome with views but it's still something to think about as you're finalizing your design.
I do agree with the other two folks who answered: which form is "better" is subjective and dependent on what you're hoping to achieve. If you're hoping to achieve very quick queries that's one thing. If you're hoping to achieve high programmer productivity--that's a different goal again and possibly conflicts with quick queries.
Choice depends on required performance.
The best database is normalized database. But there could be performance issues in normalized database then you have to denormalize it.
Principle "Normalize first, denormalize for performance" works well.
It depends on the datamodel and the use case. If you ever need to report on a query that wants data out of the "models" then the former is preferable because otherwise (with the latter) you'd have to change the query (to include the new table) every time you added a new model.
Oh and by "former" we mean this option:
table_model
* type {cadillac, saturn, chevrolet}
#mson has asked the question "What do you do when a question is not satisfactorily answered on SO?", which is a direct reference to the existing answers to this question.
I contributed the following answer to that discussion, primarily critiquing the way the question was asked.
Quote (verbatim):
I looked at the original question yesterday, and decided not to contribute an answer.
One problem was the use of the term 'model' as in 'GM models' - which cited 'Chevrolet, Saturn, Cadillac' as 'models'. To my understanding, these are not models at all; they are 'brands', though there might also be an industry-insider term for them that I'm not familiar with, such as 'division'. A model would be a 'Saturn Vue' or 'Chevrolet Impala' or 'Cadillac Escalade'. Indeed, there could well be models at a more detailed level than that - different variants of the Saturn Vue, for example.
So, I didn't think that the starting point was well framed. I didn't critique it; it wasn't quite compelling enough, and there were answers coming in, so I let other people try it.
The next problem is that it is not clear what your DBMS is going to be storing as data. If you're storing a million records per 'model' ('brand'), then what sorts of data are you dealing with? Lurking in the background is a different scenario - the real scenario - and your question has used an analogy that failed to be sufficiently realistic. That means that the 'it depends' parts of the answer are far more voluminous than the 'this is how to do it' ones. There is just woefully too little background information on the data to be modelled to allow us to guess what might be best.
Ultimately, it will depend on what uses people have for the data. If the information is going to go flying off in all different directions (different data structures in different brands; different data structures at the car model levels; different structures for the different dealerships - the Chevrolet dealers are handled differently from the Saturn dealers and the Cadillac dealers), then the integrated structure provides limited benefit. If everything is the same all the way down, then the integrated structure provides a lot of benefit.
Are there legal reasons (or benefits) to segregating the data? To what extent are the different brands separate legal entities where shared records could be a liability? Are there privacy issues, such that it will be easier to control access to the data if the data for the separate brands is stored separately?
Without a lot more detail about the scenario being modelled, no-one can give a reliable general answer - at least, not more than the top-voted one already gives (or doesn't give).
Data modelling is not easy.
Data modelling without sufficient information is impossible to do reliably.
I have copied the material here since it is more directly relevant. I do think that to answer this question satisfactorily, a lot more context should be given. And it is possible that there needs to be enough extra context to make SO the wrong place to ask it. SO has its limitations, and one of those is that it cannot deal with questions which require long explanations.
From the SO FAQs page:
What kind of questions can I ask here?
Programming questions, of course! As long as your question is:
detailed and specific
written clearly and simply
of interest to at least one other programmer somewhere
...
What kind of questions should I not ask here?
Avoid asking questions that are subjective, argumentative, or require extended discussion. This is a place for questions that can be answered!
This question is, IMO, close to the 'require extended discussion' limit.

Is Functional to Relational mapping easier than Object to Relational? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 months ago.
Improve this question
Object-relational mapping has been well discussed, including on here. I have experience with a few approaches and the pitfalls and compromises. True resolution seems like it requires changes to the OO or relational models themselves.
If using a functional language, does the same problem present itself? It seems to me that these two paradigms should fit together better than OO and RDBMS. The idea of thinking in sets in an RDBMS seems to mesh with the automatic parallelism that functional approaches seem to promise.
Does anyone have any interesting opinions or insights? What's the state of play in the industry?
What's the purpose of an ORM?
The main purpose of using an ORM is to bridge between the networked model (object orientation, graphs, etc.) and the relational model. And the main difference between the two models is surprisingly simple. It's whether parents point to children (networked model) or children point to parents (relational model).
With this simplicity in mind, I believe there is no such thing as an "impedance mismatch" between the two models. The problems people usually run into are purely implementation specific, and should be solvable, if there were better data transfer protocols between clients and servers.
How can SQL address the problems we have with ORMs?
In particular, the third manifesto tries to address the shortcomings of the SQL language and relational algebra by allowing for nested collections, which have been implemented in a variety of databases, including:
Oracle (probably the most sophisticated implementation)
PostgreSQL (to some extent)
Informix
SQL Server, MySQL, etc. (through "emulation" via XML or JSON)
In my opinion, if all databases implemented the SQL standard MULTISET() operator (e.g. Oracle does), people would no longer use ORMs for mapping (perhaps still for object graph persistence), because they could materialise nested collections directly from within the databases, e.g. this query:
SELECT actor_id, first_name, last_name,
MULTISET (
SELECT film_id, title
FROM film AS f
JOIN film_actor AS fa USING (film_id)
WHERE fa.actor_id = a.actor_id
) AS films
FROM actor AS a
Would yield all the actors and their films as a nested collection, rather than a denormalised join result (where actors are repeated for each film).
Functional paradigm at the client side
The question whether a functional programming language at the client side is better suited for database interactions is really orthogonal. ORMs help with object graph persistence, so if your client side model is a graph, and you want it to be a graph, you will need an ORM, regardless if you're manipulating that graph using a functional programming language.
However, because object orientation is less idiomatic in functional programming languages, you are less likely to shoehorn every data item into an object. For someone writing SQL, projecting arbitrary tuples is very natural. SQL embraces structural typing. Each SQL query defines its own row type without the need to previously assign a name to it. That resonates very well with functional programmers, especially when type inference is sophisticated, in case of which you won't ever think of mapping your SQL result to some previously defined object / class.
An example in Java using jOOQ from this blog post could be:
// Higher order, SQL query producing function:
public static ResultQuery<Record2<String, String>> actors(Function<Actor, Condition> p) {
return ctx.select(ACTOR.FIRST_NAME, ACTOR.LAST_NAME)
.from(ACTOR)
.where(p.apply(ACTOR)));
}
This approach leads to a much better compositionality of SQL statements than if the SQL language were abstracted by some ORM, or if SQL's natural "string based" nature were used. The above function can now be used e.g. like this:
// Get only actors whose first name starts with "A"
for (Record rec : actors(a -> a.FIRST_NAME.like("A%")))
System.out.println(rec);
FRM abstraction over SQL
Some FRMs try to abstract over the SQL language, usually for these reasons:
They claim SQL is not composable enough (jOOQ disproves this, it's just very hard to get right).
They claim that API users are more used to "native" collection APIs, so e.g. JOIN is translated to flatMap() and WHERE is translated to filter(), etc.
To answer your question
FRM is not "easier" than ORM, it solves a different problem. In fact, FRM doesn't really solve any problem at all, because SQL, being a declarative programming language itself (which is not so different from functional programming), is a very good match for other functional client programming languages. So, if anything at all, an FRM simply bridges the gap between SQL, the external DSL, and your client language.
(I work for the company behind jOOQ, so this answer is biased)
The hard problems of extending the relational database are extended transactions, data-type mismatches, automated query translation and things like N+1 Select that are fundamental problems of leaving the relational system and -- in my opinion -- do not change by changing the receiving programming paradigm.
That depends on your needs
If you want to focus on the data-structures, use an ORM like JPA/Hibernate
If you want to shed light on treatments, take a look at FRM libraries: QueryDSL or Jooq
If you need to tune your SQL requests to specific databases, use JDBC and native SQL requests
The strengh of various "Relational Mapping" technologies is portability: you ensure your application will run on most of the ACID databases.
Otherwise, you will cope with differences between various SQL dialects when you write manually the SQL requests .
Of course you can restrain yourself to the SQL92 standard (and then do some Functional Programming) or you can reuse some concepts of functionnal programming with ORM frameworks
The ORM strenghs are built over a session object which can act as a bottleneck:
it manages the lifecycle of the objects as long as the underlying database transaction is running.
it maintains a one-to-one mapping between your java objects and your database rows (and use an internal cache to avoid duplicate objects).
it automatically detects association updates and the orphan objects to delete
it handles concurrenty issues with optimistic or pessimist lock.
Nevertheless, its strengths are also its weaknesses:
The session must be able to compare objects so you need to implements equals/hashCode methods.
But Objects equality must be rooted on "Business Keys" and not database id (new transient objects have no database ID!).
However, some reified concepts have no business equality (an operation for instance).
A common workaround relies on GUIDs which tend to upset database administrators.
The session must spy relationship changes but its mapping rules push the use of collections unsuitable for the business algorithms.
Sometime your would like to use an HashMap but the ORM will require the key to be another "Rich Domain Object" instead of another light one...
Then you have to implement object equality on the rich domain object acting as a key...
But you can't because this object has no counterpart on the business world.
So you fall back to a simple list that you have to iterate on (and performance issues result from).
The ORM API are sometimes unsuitable for real-world use.
For instance, real world web applications try to enforce session isolation by adding some "WHERE" clauses when you fetch data...
Then the "Session.get(id)" doesn't suffice and you need to turn to more complex DSL (HSQL, Criteria API) or go back to native SQL
The database objects conflicts with other objects dedicated to other frameworks (like OXM frameworks = Object/XML Mapping).
For instance, if your REST services use jackson library to serialize a business object.
But this Jackson exactly maps to an Hibernate One.
Then either you merge both and a strong coupling between your API and your database appears
Or you must implement a translation and all the code you saved from the ORM is lost there...
On the other side, FRM is a trade-off between "Object Relational Mapping" (ORM) and native SQL queries (with JDBC)
The best way to explain differences between FRM and ORM consists into adopting a DDD approach.
Object Relational Mapping empowers the use of "Rich Domain Object" which are Java classes whose states are mutable during the database transaction
Functional Relational Mapping relies on "Poor Domain Objects" which are immutable (so much so you have to clone a new one each time you want to alter its content)
It releases the constraints put on the ORM session and relies most of time on a DSL over the SQL (so portability doesn't matter)
But on the other hand, you have to look into the transaction details, the concurrency issues
List<Person> persons = queryFactory.selectFrom(person)
.where(
person.firstName.eq("John"),
person.lastName.eq("Doe"))
.fetch();
I'd guess functional to relational mapping should be easier to create and use than OO to RDBMS. As long as you only query the database, that is. I don't really see (yet) how you could do database updates without side effects in a nice way.
The main problem I see is performance. Todays RDMS are not designed to be used with functional queries, and will probably behave poorly in quite a few cases.
I haven't done functional-relational mapping, per se, but I have used functional programming techniques to speed up access to an RDBMS.
It's quite common to start with a dataset, do some complex computation on it, and store the results, where the results are a subset of the original with additional values, for example. The imperative approach dictates that you store your initial dataset with extra NULL columns, do your computation, then update the records with the computed values.
Seems reasonable. But the problem with that is it can get very slow. If your computation requires another SQL statement besides the update query itself, or even needs to be done in application code, you literally have to (re-)search for the records that you are changing after the computation to store your results in the right rows.
You can get around this by simply creating a new table for results. This way, you can just always insert instead of update. You end up having another table, duplicating the keys, but you no longer need to waste space on columns storing NULL – you only store what you have. You then join your results in your final select.
I (ab)used an RDBMS this way and ended up writing SQL statements that looked mostly like this...
create table temp_foo_1 as select ...;
create table temp_foo_2 as select ...;
...
create table foo_results as
select * from temp_foo_n inner join temp_foo_1 ... inner join temp_foo_2 ...;
What this is essentially doing is creating a bunch of immutable bindings. The nice thing, though, is you can work on entire sets at once. Kind of reminds you of languages that let you work with matrices, like Matlab.
I imagine this would also allow for parallelism much easier.
An extra perk is that types of columns for tables created this way don't have to be specified because they are inferred from the columns they're selected from.
I'd think that, as Sam mentioned, if the DB should be updated, the same concurrency issues have to be faced as with OO world. The functional nature of the program could maybe be even a little more problematic than the object nature because of the state of data, transactions etc of the RDBMS.
But for reading, the functional language could be more natural with some problem domains (as it seems to be regardless of the DB)
The functional<->RDBMS mapping should have no big differences to OO<->RDMBS mappings. But I think that that depends a lot on what kind of data types you want to use, if you want to develop a program with a brand new DB schema or to do something against a legacy DB schema, etc..
The lazy fetches etc for associations for example could probably be implemented quite nicely with some lazy evaluation -related concepts. (Even though they can be done quite nicely with OO also)
Edit : With some googling I found HaskellDB (SQL library for Haskell) - that could be worth trying?
Databases and Functional Programming can be fused.
for example:
Clojure is a functional programming language based on relational database theory.
Clojure -> DBMS, Super Foxpro
STM -> Transaction,MVCC
Persistent Collections -> db, table, col
hash-map -> indexed data
Watch -> trigger, log
Spec -> constraint
Core API -> SQL, Built-in function
function -> Stored Procedure
Meta Data -> System Table
Note: In the latest spec2, spec is more like RMDB.
see: spec-alpha2 wiki: Schema-and-select
I advocate: Building a relational data model on top of hash-map to achieve a combination of NoSQL and RMDB advantages. This is actually a reverse implementation of posgtresql.
Duck Typing: If it looks like a duck and quacks like a duck, it must be a duck.
If clojure's data model like a RMDB, clojure's facilities like a RMDB and clojure's data manipulation like a RMDB, clojure must be a RMDB.
Clojure is a functional programming language based on relational database theory
Everything is RMDB
Implement relational data model and programming based on hash-map (NoSQL)
Being functional and being OO are two orthogonal concepts. The issue of mapping flat tables to trees of objects is orthogonal to Functional vs Imperative.
However, functional vs imperative does solve one particular mismatch, namely the mismatch between imperative updates and MVCC. In imperative programming, locking the table you are working with while you update the tables is the most intuitive approach, and anything non-sequential is extremely counterintuitive.
In FP, MVCC is much more natural than locks. The natural way to write is to compute the result set, compute the diff with read data, write (i.e. pick the updated dataset as the new one, sharing the data they have in common using persistent data structures), and do a rollback & retry if there is a write-write conflict. This matches exactly what MVCC does.

Resources