I am looking for strategic advice about our operational reporting strategy. Currently, we have an effective monolith that has been in existence for decades. Over the past couple years, we have tried to adopt a microservice architecture but now we are in a weird spot where we have 1 monolith (with a large relational database) and several microservices (with small relational databases) and we need to conduct operational reporting on the microservices. The microservices need to be combined with data from the monolith to be valuable so we are looking at options to achieve this.
So far, we are considering the following:
data replication to the monolith to leverage its reporting functionality (either dual write OR event sourcing)
sync both to a data warehouse and use a BI tool for both. This path may not be ideal for operational reporting since it’s OLAP not OLTP.
What are design patterns should we consider? What are some pros and cons for the options we are considering?
I am evaluating the best approach for migrating our current on-premises Java Web app to a SAAS platform. Application multi-tenancy seems straight-forward, but less so with the database. We're probably all aware of the database-per-tenant pros at this point: isolation, performance, reduced backup/restore complexity, and much lower retrofit complexity. Naturally the row-per-tenant approach has its benefits as well, reduced infrastructure costs being a major one.
Is it unheard of to combining the two approaches? That way the database-per-tenant approach faster time-to-market while the development changes to support a multi-tenant database are being made gradually. Once both approaches are operational customers with particularly heavy workloads or security constraints could have their own isolated database, but the default would be using a shared common database (for cost/efficiency reasons). Does anyone have any experience using/seeing this combination of approaches in the real world?
Whether requests are routing to datasource by tenant ID, or the tenant ID is an argument to the SQL queries, the major differences should be contained with in the persistence layer/database somewhat limiting the added complexity of combining the two approaches.
There are complexities when we scale out a tenant, i.e. moving a tenant data from the shared database to that of the isolated database.
The automation of this process requires effort and testing due to the identification of the entity tables, mapping tables and ordering of these steps to process the migration successfully. The strategy used for the database like ORM or ADO.NET also needs to be considered for this process.
Compared to having a row wise tenantid, if we can use a schema per tenant within the same database, it will be easier to perform this kind of migration.
We did try this out initially, but since there was framework data and application / business data, it was little difficult to resolve the migration to happen automatically, given lesser time-frame, however with the right time and plan, this can be achieved.
I need to design a NoSQL database for a system that is using an FSA(Fare Service Aggregator) which is having a very heavy load including major scenarios, database aggregates, and queries.
Are any references on how to design a NoSQL database of about 10-15 pages?
Video tutorials or examples would do. Thank you.
There is no database design in NoSQL, it's literally a document dump. The major problem with NoSQL is how to query all the garbage in it. Mostly they all are Key-Value stores, totally unsuitable for business requirements. If document is too big maybe split it. I would suggest Couchbase to play with, because it's almost got SQL for querying objects (must have called it BackSQL) which it seems done right, so there is a chance you will be able to implement something using it. Will be 30x slower than RDBMS, but it's a trade off for NoSQL "scalability" (when you can horizontally add a dozen of servers to compensate slow index building and scan).
I am looking at rewriting a VB based on-premise (locally installed) application (invoicing+inventory) as a web based Clojure application for small enterprise customers. I am intending this to be offered as a SaaS application for customers in similar trade.
I was looking at database options: My choice was an RDBMS: Postgresql/ MySQL. I might scale up to 400 users in the first year, with typically a 20-40 page views/ per day per user - mostly for transactions not static views. Each view will involve fetch data and update data. ACID compliance is necessary(or so I think). So the transaction volume is not huge.
It would have been a no-brainer to pick either of these based on my preference, but for this one requirement, which I believe is typical of a SaaS app: The Schema will be changing as I add more customers/users and for each customer's changing business requirement (I will be offering some limited flexibility only to start with). As I am not a DB expert, based on what I can think of and has read, I can handle that in a number of ways:
Have a traditional RDBMS schema design in MySQl/Postgresql with a single DB hosting multiple tenants. And add enough "free-floating" columns in each table to allow for future changes as I add more customers or changes for an existing customer. This might have a downside of propagating the changes to the DB every time a small change is made to the Schema. I remember reading that in Postgresql schema updates can be done real time without locking. But not sure, how painful or how practical is it in this use case. And also, as the schema changes might also introduce new/ minor SQL changes as well.
Have an RDBMS, but design the database schema in a flexible manner: with a close to entity-attribute-value or just as a key-value store. (Workday, FriendFeed for example)
Have the entire thing in-memory as objects and store them in log files periodically.(e.g., edval, lmax)
Go for a NoSQL DB like MongoDB or Redis. But based on what I can gather, they are not suitable for this use-case and not fully ACID compliant.
Go for some NewSQL Dbs like VoltDb or JustoneDb(cloud based) which retain the SQL and ACID compliant behaviour and are "new-gen" RDBMS.
I looked at neo4j(graphdb), but not sure if that will fit this use-case
In my use case, more than scalability or distributed computing, I am looking at a better way to achieve "Flexibility in Schema + ACID + some reasonable Performance". Most of the articles I could find on the net speak of flexibility in schema as a cause leading to performance(in the case of NoSQL DBs) and scalability while leaving out the ACID/Transactions side.
Is this an "either or" case of 'Schema flexibility vs ACID' transactions or Is there a better way out?
I think tarantool can help you. That solution have transactions, lua, msgpack, and etc. And also see that video
I want to build a web-application similar to Google-Analytics, in which I collect statistics on my customers' end-users, and show my customers analysis based on that data.
High scalability, handle very large volume
Compartmentalized - Queries always run on a single customer's data
Support analytical queries (drill-down, slices, etc.)
Due to the analytical need, I'm considering to use an OLAP/BI suite, but I'm not sure it's meant for this scale. NoSQL database? Simple RDBMS would do?
These what I am using at work in a production environnement and it works like a charm.
I copled three things
PostgreSQL + LucidDB + Mondrian (More generally the whole Pentaho BI suite components)
PostgreSQL : I am not going to describe postgresql, really strong open source RDBMS will let you do - certainly - everything you need. I use it to store my operational data.
LucidDB : LucidDB is an Open source column-store database. Highly scalable and will provide a really gain of processing time compare to PostgreSQL for retrieving a large amount of data. It is not optimized for transaction processing but for intensive reads. This is my Datawarehouse database
Mondrian : Mondrian is an Open Source R-OLAP cube. LucidDB made it easy to connect those two programs together.
I would recommend you to look at the whole Pentaho BI Suite, it worth it, you might want to use some of there components.
Hope I could help,
There are two main architectures you could opt for for true web-scale:
1. "BI" architecture
Event journaller (e.g. LWES Journaller) or immutable event store (e.g. HDFS) feeds
Analytics/column-store database (e.g. Greenplum, InfiniDB, LucidDB, Infobright) feeds
Business intelligence reporting tool (e.g. Microstrategy, Pentaho Business Analytics)
2. "NoSQL" architecture
(Optional) Event journaller or immutable event store feeds
NoSQL database (e.g. Cassandra, Riak, HBase) feeds
A custom analytics UI (e.g. using D3.js)
The immutable event store or journaller is there because in most cases you want to be batching your analytics events and doing bulk updates to your database (even with something like HDFS) - rather than doing an atomic write for every single page view etc.
For SnowPlow, our open-source analytics platform built on Hadoop and Hive, the event logs are all collected on S3 first before being batch loaded into Hive.
Note that the "NoSQL architecture" will involve a fair bit more development work. Remember that with either architecture, you can always shard by customer if the volumes grow truly epic (billions of rows per customer) - because there's no need (I'm guessing) for cross-customer analytics.
I'd say that having put in place OLAP analysis is always nice and then has great potential for sophisticated data analysis using MDX.
What do you mean by large volume ?
Where are your customer user information?
What kind of front-end and reporting are you going to use?
Disclaimer : I'll make some publicity for my own solution - have a look to www.icCube.com and contact me for more details
I am trying to decide whether to use voldemort or couchdb for an upcoming healthcare project. I want a storage system that has high availability , fault tolerance, and can scale for the massive amounts of data being thrown at it.
What is the pros/cons of each?
Project Voldemort looks nice, but I haven't looked deeply into it so far.
In it current state CouchDB might not be the right thing for "massive amounts of data". Distributing data between nodes and routing queries accordingly is on the roadmap but not implemented so far. The biggest known production setups of CouchDB use "tables" ("databases" in couch-speak) of about 200G.
HA is not natively supported by CouchDB but can build easily: All CouchDB nodes are replicating the database nodes between each other in a multi-master setup. We put two Varnish proxies in front of the CouchDB machines and the Varnish boxes are made redundant with CARP. CouchDBs "build from the Web" design makes such things very easy.
The most pressing issue in our setup is the fact that there are still issues with the replication of large (multi MB) attachments to CouchDB documents.
I suggest you also check the traditional RDBMS route. There are huge issues with available talent outside the RDBMS approach and there are very capable offerings available from Oracle & Co.
Not knowing enough from your question, I would nevertheless say Project Voldemort or distributed hash tables (DHTs) like CouchDB in general are a solution to your problem of HA.
Those DHTs are very nice for high availability but harder to write code for than traditional relational databases (RDBMS) concerning consistency.
They are quite good to store document type information, which may fit nicely with your healthcare project but make development harder for data.
The biggest limitation of most stores is that they are not transactionally safe (See Scalaris for an transactionally safe store) and you need to ensure data consistency by yourself - most use read time consistency by merging conflicting data). RDBMS are much easier to use for consistency of data (ACID)
Joining data is much harder too. In RDBMs you can easily query data over several tables, you need to write code in CouchDB to aggregate data. For other stores Hadoop may be a good choice for aggregating information.
Read about BASE and the CAP theorem on consistency vs. availability.
Is memcacheDB an option? I've heard that's how Digg handled HA issues.