Pluggable database interface - database

I am working on a project where we are scoping out the specs for an interface to the backend systems of multiple wholesalers. Here is what we are working with,
Each wholesaler has multiple products, upwards of 10,000. And each wholesaler has customized prices for their products.
The list of wholesalers being accessed will keep growing in the future, so potentially 1000s of wholesalers could be accessed by the system.
Wholesalers are geographically dispersed.
The interface to this system will allow the user to select the wholesaler they wish and browse their products.
Product price updates should be reflected on the site in real time. So, if the wholesaler updates the price it should immediately be available on the site.
System should be database agnostic.
The system should be easy to setup on the wholesalers end, and be minimally intrusive in their daily activities.
Initially, I thought about creating databases for each wholesaler on our end, but with potentially 1000s of wholesalers in the future, is this the best option as far as performance and storage.
Would it be better to query the wholesalers database directly instead of storing their data locally? Can we do this and still remain database agnostic?
What would be best technology stack for such an implementation? I need some kind of ORM tool.
Java based frameworks and technologies preferred.
Thanks.

If you want to create a software that can switch the database I would suggest to use Hibernate (or NHibernate if you use .Net).
Hibernate is an ORM which is not dependent to a specific database and this allows you to switch the DB very easy. It is already proven in large applications and well integrated in the Spring framework (but can be used without Spring framework, too). (Spring.net is the equivalent if using .Net)
Spring is a good technology stack to build large scalable applications (contains IoC-Container, Database access layer, transaction management, supports AOP and much more).
Wiki gives you a short overview:
http://en.wikipedia.org/wiki/Hibernate_(Java)
http://en.wikipedia.org/wiki/Spring_Framework
Would it be better to query the wholesalers database directly instead
of storing their data locally?
This depends on the availability and latency for accessing remote data. Databases itself have several posibilities to keep them in sync through multiple server instances. Ask yourself what should/would happen if a wholesaler database goes (partly) offline. Maybe not all data needs to be duplicated.
Can we do this and still remain database agnostic?
Yes, see my answer related to the ORM (N)Hibernate.
What would be best technology stack for such an implementation?
"Best" depends on your requirements. I like Spring. If you go with .Net the built-in ADO.NET Entity Framework might be fit, too.

Related

DB recommendation - Portable, Concurrent (multiple read only, one write)

I'm looking for a portable database solution I can use with a website that is designed to handle service outages. I need to nightly retrieve a list of users from SQL Server and upsert their details into a portable database. It's roughly about 250,000 users (and growing) and each one has probably 25 fields that are required. Of those fields, i'd say less than 5 need to be searched on. The rest just need retrieving.
The idea is, in times of a service outage, we can use a website that's designed to work from the portable database rather than SQL Server. Our long term goal, is to move to the cloud and handle things in an entirely different way, but for the short term this is our aim.
The website is going to be a .Net Core web api so will be being accessed by multiple users in multiple threads. The website will only ever need read access, it will not be updating these details what-so-ever.
To keep the portable database up-to-date i'm thinking of having another application that just runs nightly to update the data. Our business is 24 hours (albeit quieter overnight), so there is a potential this updater is in use while the website is in use. While service outage would assume the SQL Server is down, this may not be the case. There are other factors in play that could cause what we would describe as outages. This will be the only piece of software updating the database.
I've tried using LiteDB but I couldn't get it working in a way that worked with my concurrency requirements. It did seem to do some of the job, and was easy to get running. However, i'd often run into locked files due to the nature of web api. I did work out a solution for that, but then the updater app couldn't access the database file.
Does anyone have any recommendations I can look into?
Given the description of the problem (1 table, 250k rows with - I assume - relative fast growth rate) and requirements, I don't think a relational database is what you are looking for.
I think nosql databases, or, more specifically, document oriented databases are more fitted to meet your requirements. There are many choices: Mongo, Cassandra, CouchDB, ... the choice is yours.
Personally I have some experience with ElasticSearch (https://www.elastic.co/elasticsearch), that is quite easy to learn, is portable (runs on Linux, Windows, Containers, etc...), is scalable, and it is fast. I mean, really, really fast, you can get results in 10-20 milliseconds (even less, sometimes).
The NEST nuget package acts as a high level client for working with ElasticSearch (https://www.elastic.co/guide/en/elasticsearch/client/net-api/7.x/nest-getting-started.html)

Database for a java application in cluster

I'd like to play around with kubernetes, I'm able to start a simple app, but now I'd like to design something more complex. Nevertheless I can't figure out, how to handle the database access in such architecture.
Let's say I have 100 pod replicas of some simple chat application. They all need to access the same database (or more like data set) and perform CRUD operations upon them. How to design it to keep the data consistent and eliminate the risk of deadlocks?
If possible, I'd like to use SQL-like database, so I can comfortably use hibernate and other tools I'm familiar with.
Is this even possible or do I have to use totally a different approach? What is the name of the technology or architecture I'm searching for?
1) You can use a connection pool to reduce this number and make the connection settings more aggressive/elastic;
2) Split your microservices in such way the access to the persistence is a microservice exposing your CRUD service to your persistence(mysql/rdms/nosql/etc). In that way you most likely don't need hundreds of replicas of your pods.
3) Deadlocks / locking strategies - as Andrew mentioned in the comments, it's more related to your software development architecture rather than K8s itself. There are plenty of ways to deal with that with pros/cons.

Data Migration from Legacy Data Structure to New Data Structure

Ok So here is the problem we are facing.
Currently:
We have a ton of Legacy Applications that have direct database access
The data structure in the database is not normalized
The current process / structure is used by almost all applications
What we are trying to implement:
Move all functionality to a RESTful service so no application has direct database access
Implement a normalized data structure
The problem we are having is how to implement this migration not only with the Applications but with the Database as well.
Our current solution is to:
Identify all the CRUD functionality and implement this in the new Web Service
Create the new Applications to replace the Legacy Apps
Point the New Applications to the new Web Service ( Still Pointing to the Old Data Structure )
Migrate the data in the databases to the new Structure
Point the New Applications to the new Web Service ( Point to new Data Structure )
But as we are discussing this process we are looking at having to rewrite the New Web Service twice. Once for the Old Data Structure and Once for the New Data Structure, As currently we could not represent the old Data Structure to fit the new Data Structure for the new Web Service.
I wanted to know if anyone has faced any challenges like this and how did you overcome these types of issues/implementation and such.
EDIT: More explanation of synchronization using bi-directional triggers; updates for syntax, language and clarity.
Preamble
I have faced similar problems on a data model upgrade on a large web application I worked on for 7 years, so I feel your pain. From this experience, I would propose the something a bit different - but hopefully one that will be a lot easier to implement. But first, an observation:
Value to the organisation is the data - data will long outlive all your current applications. The business will constantly invent new ways of getting value out of the data it has captured which will engender new reports, applications and ways of doing business.
So getting the new data structure right should be your most important goal. Don't trade getting the structure right against against other short term development goals, especially:
Operational goals such as rolling out a new service
Report performance (use materialized views, triggers or batch jobs instead)
This structure will change over time so your architecture must allow for frequent additions and infrequent normalizations to it. This means that your data structure and any shared APIs to it (including RESTful services) must be properly versioned.
Why RESTful web services?
You mention that your will "Move all functionality to a RESTful service so no application has direct database access". I need to ask a very important question with respect to the legacy apps: Why is this important and what value has it brought?
I ask because:
You lose ACID transactions (each call is a single transaction unless you implement some horrifically complicated WS-* standards)
Performance degrades: Direct database connections will be faster (no web server work and translations to do) and have less latency (typically 1ms rather than 50-100ms) which will visibly reduce responsiveness in applications written for direct DB connections
The database structure is not abstracted from the RESTful service, because you acknowledge that with the database normalization you have to rewrite the web services and rewrite the applications calling them.
And the other cross-cutting concerns are unchanged:
Manageability: Direct database connections can be monitored and managed with many generic tools here
Security: direct connections are more secure than web services that your developers will write,
Authorization: The database permission model is very advanced and as fine-grained as you could want
Scaleability: The web service is a (only?) direct-connected database application and so scales only as much as the database
You can migrate the database and keep the legacy applications running by maintaining a legacy RESTful API. But what if we can keep the legacy apps without introducing a 'legacy' RESTful service.
Database versioning
Presumably the majority of the 'legacy' applications use SQL to directly access data tables; you may have a number of database views as well.
One approach to the data migration is that the new database (with the new normalized structure in a new schema) presents the old structure as views to the legacy applications, typically from a different schema.
This is actually quite easy to implement, but solves only reporting and read-only functionality. What about legacy application DML? DML can be solved using
Updatable views for simple transformations
Introducing stored procedures where updatable views not possible (eg "CALL insert_emp(?, ?, ?)" rather than "INSERT INTO EMP (col1, col2, col3) VALUES (?, ? ?)".
Have a 'legacy' table that synchronizes with the new database with triggers and DB links.
Having a legacy-format table with bi-directional synchronization to the new format table(s) using triggers is a brute-force solution and relatively ugly.
You end up with identical data in two different schemas (or databases) and the possibility of data going out-of-sync if the synchronization code has bugs - and then you have the classic issues of the "two master" problem. As such, treat this as a last resort, for example when:
The fundamental structure has changed (for example the changing the cardinality of a relation), or
The translation to the legacy format is a complex function (eg if the legacy column is the square of the new-format column value and is set to "4", an updatable view cannot determine if the correct value is +2 or -2).
When such changes are required in your data, there will be some significant change in code and logic somewhere. You could implement in a compatibility layer (advantage: no change to legacy code) or change the legacy app (advantage: data layer is clean). This is a technical decision by the engineering team.
Creating a compatibility database of the legacy structure using the approaches outlined above minimize changes to legacy applications (in some cases, the legacy application continues without any code change at all). This greatly reduces development and testing costs (for which there is no net functional gain to the business), and greatly reduces rollout risk.
It also allows you to concentrate on the real value to the organisation:
The new database structure
New RESTful web services
New applications (potentially build using the RESTful web services)
Positive aspect of web services
Please don't read the above as a diatribe against web services, especially RESTful web services. When used for the right reason, such as for enabling web applications or integration between disparate systems, this is a good architectural solution. However, it might not be the best solution for managing your legacy apps during the data migration.
What it seems like you ought to do is define a new data model ("normalized") and build a mapping from the normalized model back to the legacy model. Then you can replace legacy direct calls with calls on the normalized one at your leisure. This breaks no code.
In parallel, you need to define what amounts to a (cerntralized) legacy db api, and map it to to your normalized model. Now, at your leisure, replace the original legacy db calls with calls on the legacy db API. This breaks no code.
Once the original calls are completely replaced, you can switch the data model over to the real normalized one. This should break no code, since everything is now going against the legacy db API or the normalized db API.
Finally, you can replace the legacy db API calls and related code, with revised code that uses the normalized data API. This requires careful recoding.
To speed all this up, you want an automated code transformation tool to implement the code replacements.
This document seems to have a good overview: http://se-pubs.dbs.uni-leipzig.de/files/Cleve2006CotransformationsinDatabaseApplicationsEvolution.pdf
Firstly, this seems like a very messy situation, and I don't think there's a "clean" solution. I've been through similar situations a couple of times - they weren't much fun.
Firstly, the effort of changing your client apps is going to be significant - if the underlying domain changes (by introducing the concept of an address that is separate from a person, for instance), the client apps also change - it's not just a change in the way you access the data. The best way to avoid this pain is to write your API layer to reflect the business domain model of the future, and glue your old database schema into that; if there are new concepts you cannot reflect using the old data (e.g. "get /app/addresses/addressID"), throw a NotImplemented error. Where you can reflect the new model with the old data, wire it together as best you can, and then re-factor under the covers.
Secondly, that means you need to build versioning into your API as a first-class concern - so you can tell clients that in version 1, features x, y and z throw "NotImplemented" exceptions. Each version should be backwards compatible, but add new features. That way, you can refactor features in version 1 as long as you don't break the service, and implement feature x in version 1.1, feature y in version 1.2 etc. Ideally, have a roadmap for your versions, and notify the client app owners if you're going to stop supporting a version, or release a breaking change.
Thirdly, a set of automated integration tests for your API is the best investment you can make - they confirm that you've not broken features as you refactor.
Hope this is of some use - I don't think there's a single, straightforward answer to your question.

To CouchDB or not to?

Note: (I have investigated CouchDB for sometime and need some actual experiences).
I have an Oracle database for a fleet tracking service and some status here are:
100 GB db
Huge insertion/sec (our received messages)
Reliable replication (via Oracle streams on 4 servers)
Heavy complex queries.
Now the question: Can CouchDB be used in this case?
Note: Why I thought of CouchDB?
I have read about it's ability to scale horizontally very well. That's very important in our case.
Since it's schema free we can handle changes more properly since we have a lot of changes in different tables and stored procedures.
Thanks
Edit I:
I need transactions too. But I can tolerate other solutions too. And If there is a little delay in replication, that would be no problem IF it is guaranteed.
You are enjoying the following features with your database:
Using it in production
The data is naturally relational (related to itself)
Huge insertion rate (no MVCC concerns)
Complex queries
Transactions
These are all reasons not to switch to CouchDB.
Of course, the story is not so simple. I think you have discovered what many people never learn: complex problems require complex solutions. We cannot simply replace our database and take the rest of the month off. Sure, CouchDB (and BigCouch) supports excellent horizontal scaling (and cross-datacenter replication too!) but the cost will be rewriting a production application. That is not right.
So, where can CouchDB benefit you?
I suggest that you begin augmenting your application with CouchDB applications. Deploy CouchDB, import your data into it, and build non mission-critical applications. See where it fits best.
For your project, these are the key CouchDB strengths:
It is a small, simple tool—easy for you to set up on a workstation or server
It is a web server. It integrates very well with your infrastructure and security policies.
For example, if you have a flexible policy, just set it up on your LAN
If you have a strict network and firewall policy, you can set it up behind a VPN, or with your SSL certificates
With that step done, it is very easy to access now. Just make http or http requests. Whether you are importing data from Oracle with a custom tool, or using your web browser, it's all the same.
Yes! CouchDB is an app server too! It has a built-in administrative app, to explore data, change the config, etc. (like a built-in phpmyadmin). But for you, the value will be building admin applications and reports as simple, traditional HTML/Javascript/CSS applications. You can get as fancy or as simple as you like.
As your project grows and becomes valuable, you are in a great position to grow, using replication
Either expand the core with larger CouchDB clusters
Or, replicate your data and applications into different data centers, or onto individual workstations, or mobile phones, etc. (The strategy will be more obvious when the time comes.)
CouchDB gives you a simple web server and web site. It gives you a built-in web services API to your data. It makes it easy to build web apps. Therefore, CouchDB seems ideal for extending your core application, not replacing it.
I don't agree with this answer..
I think CouchDB suits especially well fleet tracking use case, due to their distributed nature. Moreover, the unreliable nature of gprs connections used for transmitting position data, makes the offline-first paradygm of couchapps the perfect partner for your application.
For uploading data from truck, Insertion-rate can take a huge advantage from couchdb replication and bulk inserts, especially if performed on ssd-based couchdb hosting.
For downloading data to truck, couchdb provides filtered replication, allowing each truck to download only the data it really needs, instead of the whole database.
Regarding complex queries, NoSQL database are more flexible and can perform much faster than relation databases.. It's only a matter of structuring and querying your data reasonably.

In Memory Database

I'm using SqlServer to drive a WPF application, I'm currently using NHibernate and pre-read all the data so it's cached for performance reasons. That works for a single client app, but I was wondering if there's an in memory database that I could use so I can share the information across multiple apps on the same machine. Ideally this would sit below my NHibernate stack, so my code wouldn't have to change. Effectively I'm looking to move my DB from it's traditional format on the server to be an in memory DB on the client.
Note I only need select functionality.
I would be incredibly surprised if you even need to load all your information in memory. I say this because, just as one example, I'm working on a Web app at the moment that (for various reasons) loads thousands of records on many pages. This is PHP + MySQL. And even so it can do it and render a page in well under 100ms.
Before you go down this route make sure that you have to. First make your database as performant as possible. Now obviously this includes things like having appropriate indexes and tuning your database but even though are putting the horse before the cart.
First and foremost you need to make sure you have a good relational data model: one that lends itself to performant queries. This is as much art as it is science.
Also, you may like NHibernate but ORMs are not always the best choice. There are some corner cases, for example, that hand-coded SQL will be vastly superior in.
Now assuming you have a good data model and assuming you've then optimized your indexes and database parameters and then you've properly configured NHibernate, then and only then should you consider storing data in memory if and only if performance is still an issue.
To put this in perspective, the only times I've needed to do this are on systems that need to perform millions of transactions per day.
One reason to avoid in-memory caching is because it adds a lot of complexity. You have to deal with issues like cache expiry, independent updates to the underlying data store, whether you use synchronous or asynchronous updates, how you give the client a consistent (if not up-to-date) view of your data, how you deal with failover and replication and so on. There is a huge complexity cost to be paid.
Assuming you've done all the above and you still need it, it sounds to me like what you need is a cache or grid solution. Here is an overview of Java grid/cluster solutions but many of them (eg Coherence, memcached) apply to .Net as well. Another choice for .Net is Velocity.
It needs to be pointed out and stressed that something like NHibernate is only consistent so long as nothing externally updates the database and that there is exactly one NHibernate-enabled process (barring clustered solutions). If two desktop apps on two different PCs are both updating the same database with NHibernate the caching simply won't work because the persistence units simply won't be aware of the changes the other is making.
http://www.db4o.com/ can be your friend!
Velocity is an out of process object caching server designed by Microsoft to do pretty much what you want although it's only in CTP form at the moment.
I believe there are also wrappers for memcached, which can also be used to cache objects.
You can use HANA, express edition. You can download it for free, it's in-memory, columnar and allows for further analytics capabilities such as text analytics, geospatial or predictive. You can also access with ODBC, JDBC, node.js hdb library, REST APIs among others.

Resources