advantage of noSql over newSql - database

It seems now that Google bet on NewSql solutions for big data storages.
I'm wondering if there is still some advantages of a NoSql solution comparing to a newSql solution ? (Like memory managment or others things)

NewSql databases are a new "strain" of databases if you will that are attempting to take the long established benefits of the traditional Relationl Database Management System (RDBMS) and make it compete with the highlights of NoSql data stores. They are not updates or improvements to RDBMS but more often rewrites that include middleware that abstracts the practice of database "sharding" or the ability to distribute a database over a grid of computers like NoSql does.
The power of the RDBMS comes mostly from queryability via the Structured Query Language (SQL), their transactionality and adhereance to the ACID principal (Atomicity, Consistency, Isolation, Durability) and the powerful tools developed over time to manage them. A lesser benefit comes from the fact that the relational model eliminates repetitive storage of the same information in multiple places.
The benefits of the NoSql is high speed, the ability to scale laterally across a comuting grid, and the lack of schema to maintain. This makes them very highly performant even against hugh data stores. But they lack the benefits that you get from the traditional RDBMS in that the query language to manipulate data isn't really there (yet), they can't be transactional across a computing grid, and they lack the tools to work against them like MS Sql Server Management Studio.
NewSql is attempting to take the best parts of both worlds and I think it eventually will. Here is a great write up of the RDBMS V.s. NoSql V.s. NewSql on bananagunprogramming.com.

Related

Are Relational DBMS actually difficult to scale?

I know there are similar questions on SO but some are a decade old and others don't have helpful answers.
I am a newbie in the system designing world. I have some experience with relational DBMS where I created some small scale projects.
Every post on 'Relational vs Non-Relational DBMS' points out that because of ACID transactions, Ref integrity constraints and consistency, Relational DBMS are difficult to scale. But on the other hand giants like Amazon and financial services continue to use relational DBMS and they don't seem to have any problems with scalability.
I just want to theoretically understand if relational DBMS are actually difficult to scale? If it is, how are these companies using it with terabytes of data?
Thank you!
As long as the data is in Third Normal Form, and appropriate indexes are applied, there should be no problem. This requires knowing what data is to be stored and how it needs to be accessed.
At some scale (e.g. I had a system which added up to 18 million rows per day to 1 table) you may want to have a process in place to transfer new data to an Analysis database (OLAP - e.g. SQL Server Analysis Database, MSAS). An OLAP is relatively easy to design, but the process to keep it up-to-date on even a daily basis can be difficult to design and manage without some experience. Queries to an OLAP database are optimised for reporting; I don't believe that any query for the system I mentioned ever took more than an average of 3 seconds to complete.

Can we use snowflake as database for Data driven web application?

I am Asp.Net MVC/SQLSERVER developer and I am very new to all these and so I may be on compelete wrong path.
I came to know by googling that Snowwflake can put/get data from AWS-S3, Google Storage and Azure. And Snowflake has their database and tables as well.
I have following questions,
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
Can we use Snowflake as database for data driven web application? and if yes, could you provide link or something to start?
Once again I am very new to all these and expecting from you to get ideas and best way to work arround this.
Thak you in advance.
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
You're talking about three different classes of technology product there, which are not equivalent:
Snowflake is a database platform, similar to other database technologies it provides data storage and metadata and a SQL interface for data manipulation and management.
AWS S3 (and similar products) provides scalable cloud storage for files of any kind. You generally need to implement an additional technology such as Spark, Presto, or Amazon Athena to query data stored as files in cloud storage. Snowflake can also make use of data files in cloud storage, either querying the files directly as an "external table" or using a COPY statement to load the data into Snowflake itself.
Talend and other ETL or data integration tools are used to move data between source and target platforms. Usually this will be from a line of business application, such as an ERP system, to a data warehouse or data lake.
So you need to think about three things when considering Snowflake:
Where is your analytical data going to be stored? Is it going to be files in cloud storage, loaded into a database or a mix of both? There are advantages and disadvantages to each scenario.
How do you want to query the data? It's fairly likely you'll want something that supports the use of SQL queries, as mentioned above there are numerous technologies that support SQL on files in cloud storage. Query performance will generally be significantly better if the data is loaded into a dedicated analytical database though.
How will the data get from the data sources to the analytical data repository, whatever that may be? Typically this will involve either a third party ETL tool, or rolling your own solution (which can be a cheaper option initially but can become a significant management and support overhead).
Can we use Snowflake as database for data driven web application?
The answer to that is yes, in theory. It very much depends on what your web application does because Snowflake is a database designed for analytics, i.e. crunching through large amounts of data to find answers to questions. It's not designed as a transactional database for a system that involves lots of updates and inserts of small amounts of data. For example Snowflake doesn't support features like referential integrity.
However, if your web application is an analytical one (for example it has embedded reports that would query a large amount of data and users will typically be reading data and not adding it) then you could use Snowflake as a backend for the analytical part, although you would probably still want a traditional database to manage data for things like users and sessions.
You can connect your web application to Snowflake with one of the connectors, like https://docs.snowflake.com/en/user-guide/odbc.html
Snowflake excels for large analytic workloads that are difficult to scale and tune. If, for example, you have many (millions/billions) of events that you want to aggregate into dashboards, then Snowflake might be a good fit.
I agree with much of what Nathan said, to add to that, from my experience every time I've created a database for an application it's been with an OLTP database like PostgreSQL, Azure SQL Database, or SQL Server.
One big problem of using MPP/Distributed Databases is that they don't enforce referential integrity, so if that's important to you then you don't want to use MPP/Distributed Databases.
Snowflake and other MPP/Distributed Databases are NOT meant for OLTP workloads but instead for OLAP workloads. No matter what snake oil those companies like databricks and snowflake try to sell you MPP/Distributed databases are NOT meant for OLTP. The costs alone would be tremendous even with auto-scaling.
If you think about it, Databricks, Snowflake, etc. have a limit to how much they want to optimize their platforms because the longer a query runs the more money they make. For them to make money they have to optimize performance but not too much otherwise it will effect their income.
This can be an in-depth topic so I would recommend doing more research into OLTP Vs. OLAP.
Enforcing Referential integrity is a double edged sword, the downside being as the data volume grows the referential violation check significantly slows down the inserts and deletes. This results in the developer having to put the RI check in the program (with a dirty read) and turn off the RI enforcement by the database, finally ending up with a Snowflake like situation.
Bottom line is Snowflake not enforcing RI should not be a limitation for OLTP applications.

Replicating PostgreSQL data for analytics

I'm currently scoping out at a potential development project where we will develop an analytical solution to support a production application. Obviously we want to run queries on reasonably up-to-date data, but we don't want the operational risk of querying the main database directly with (possibly expensive) analytical queries.
To do this I believe we would like to do the following:
Make a replica of a "production" PostgreSQL database into a separate "analytics" database
Add additional tables / views etc to the "analytics" database, which will support the analytics solution only and not be part of the application DB.
Maintain the replica copy of the production data in a reasonably up-to-date fashion (realtime replication not strictly needed, but no more than a few seconds lag would be good)
The database will not be excessively large (it is a web/mobile application with a lot of users but most not likely to be active at any one time).
Is this likely to be feasible with PostgreSQL, and if so what is the best strategy / replication technique to use?
You cannot use streaming replication for that, because you cannot add tables to a read-only database. But you might rethink the requirement to not add the additional tables to the production database.
However, there are other replication techniques like Slony, Bucardo or Londiste.
One thing that you should keep in mind is that a data model that is suitable for an online transaction processing database is usually not well suited for analytical applications, and you might end up being pretty unhappy with the performance of your analytical queries. For these, the normal thing to do is to build some sort of data warehouse where data are stored in a more denormalized form, usually in something like a star schema.
But for that you cannot have “no more than a few seconds lag”. Double check if that is really essential, it usually isn't for analytical queries.

When to build a separate reporting database?

We're building an application that has a database (yeah, pretty exciting huh :). The database is mainly transactional (to support the app) and also does a bit of "reporting" as part of the app - but nothing too strenuous.
Above and beyond that we have some reporting requirements - but they're pretty vague and high-level at the moment. We have a standard reporting tool that we-use in-house which we'll use to do the "heavier" reporting as the requirements solidify.
My question is: how do you know when a separate database for reporting is required?
What sort of questions need to be asked? What sort of things would make you decide a separate reporting database was necessary?
In general, the more mission critical the transactional app and the more sophisticated the reporting requirements, the more splitting makes sense.
When transaction performance is critical.
When it's hard to get a maintenance window on the transactional app.
If reporting needs to correlate results not only from this app, but from other application silos.
If the reports need to support trending or other types of reporting that are best suited for a star schema/Business Intelligence environment.
If the reports are long running.
If the transactional app is on an expensive hardware resource (cluster, mainframe, etc.)
If you need to do data cleansing/extract-transform-load operations on the transactional data (e.g., state names to canonical state abbreviations).
It adds non-trivial complexity, so imo, there has to be a good reason to split.
Typically, I would try to report off the transactional database initially.
Ensure that any indexes you add to facilitate efficient reporting are all frequently used. The more indexes you add, the poorer performance is going to be on inserts and (if you alter keys) updates.
When you do go to a reporting database, remember there are only a few reasons you are going there:
Ultimately, the number one thing about reporting databases is that you are removing locking contention from the OLTP database. So if your reporting database is a straight copy of the same database, you're simply using delayed snapshots which won't interfere with production transactions.
Next, you can have a separate indexing strategy to support the reporting usage scenarios. These extra indexes are OK to maintain in the reporting database, but would cause unnecessary overhead in the OLTP database.
Now both the above could be done on the same server (even the same instance in a separate database or even just in a separate schema) and still see benefits. When CPU and IO are completely pegged, at that point, you definitely need to have it on a completely separate box (or upgrade your single box).
Finally, for ultimate reporting flexibility, you denormalize the data (usually into a dimensional model or star schemas) so that the reporting database is the same data in a different model. Reporting of large amounts of data (particularly aggregates) is extremely fast in dimensional models because the star schemas are very efficient for that. It also is efficient for a larger variety of queries without a lot of re-indexing or analysis to change indexes, because the dimensional model lends itself better to unforeseen usage patterns (the old "slice and dice every which way" request). You could view this is a kind of mini-data warehouse where you use data warehousing techniques, but aren't necessarily implementing a full-blown data warehouse. Also, star schemas are particular easy for users to get to grips with, and data dictionaries are much simpler and easier to build for BI tools or reporting tools from star schemas. You could do this on the same box or different box etc, just like discussed earlier.
This question requires experience rather than science.
As a BI architect, the approach I take on designing each BI solution for my clients are very different. I don't go through a checklist. It requires a general understanding of their system, their reporting requirements, budget and man power.
I personally prefer to keep the reporting processes as much as possible on the database side (Best practice in BI world). REPORTING TOOLS ARE FOR DISPLAYING PURPOSE ONLY (MAXIMUM FOR SMALL CALCULATIONS). This approach requires a lot of pre-processing of data which requires different staging tables, triggers and etc.
When you said:
I work on projects with hundreds of millions of rows with real time reporting along with hundreds of users accessing the application/database at the same time with out issue.
There are a few things wrong with your statement.
Hundreds of millions of rows are A LOT. even today's in memory tools like Cognos TM1 or Qlikview would struggle to get such a results. (look at SAP HANA from SAP to understand how giants in the industry handle it).
If you have Hundreds of millions of rows in database, it doesn't necessarily mean that the report needs to go through all those records. maybe the report worked on 1000s not millions. probably that's what you saw.
Transactional reports are very different than dashboards. Most dashboard tools pre-processing and cache the data.
My point is that it all comes to experience for deciding when to:
design a new schema
create a semantic database
work on the same transactional database
or even use a reporting tool (Sometimes handwritten dashboards with Java/JSF/Ajax/jQuery or JSP would work fine for client)
The main reason you would need a separate database for reporting issues is when the generation of the reports interferes with the transactional responsibilities of the app. E.g. if a report takes 20 minutes to generate and utilizes 100% of the CPU/Disk/etc... during a time of high activity you might think of using a separate database for reporting.
As for questions, here are some basic one:
Can I do the high intensity reports during non-peak hours?
Does it interfere with the users using the system?
If yes to #2, what are the costs of the interference Vs the cost of another database server, refactoring code, etc...?
I would also add another reason for which you might use a reporting database, and that is: CQRS pattern (Command Query Responsibility Separation).
If you have a large number of users accessing and writing to a small set of data, you would do wise to consider this pattern. It basicly, in its simplest form, means that all your commands (Create, Update, Delete) are pushed to the transactional database.
All of your queries (Read) are from your reporting database. This lets you freely scopy your architecture and upgrade function.
There are MUCH more to it in the pattern, I just mentioned the bit which was interesting due to your question regarding reporting database.
Basically, when the database load from the app becomes incompatible with the database load for reporting. This could be due to:
Reporting consuming inordinate amount of database server resources impacting the app's DB performance.
A part of this category would be the app DB work having to wait on a majorly slow report query due to locking, though it might be possible to resolve with less drastic methods like locking tuning.
Reporting queries being very incompatible with app queries as far as tuning (e.g. indices but not limited to that) - the most dumb example would be something like a hot spot affecting app inserts because of the reporting-purpose index.
Timing issues. E.g. the only small windows for DB maintenance available (due to application usage) are the times of heavy reporting work
Reporting data's sheer volume (e.g. logging, auditing, statistics) is so big that your primary DB server architecture is a bad solution for such reporting (see Sybase ASE vs. Sybase IQ). BTW, this is a real scenario - we moved our performance reporting to IQ because of this.
I would also add that transactional databases are meant to hold current state and oftentimes do so to be self-maintaining. You don't want transactional databases growing beyond their necessary means. When a workflow or transaction is complete then move that data out and into a Reporting database, which is much better designed to hold historical data.

Swapping out databases?

It seems like the goal of a lot of ORM tools and custom data access layers (DAO pattern, etc.) is to abstract the database to the point where you could supposedly swap out the entire database system with minimal work.
Following the common DAL patterns is usually a good idea in code, but it seems like it would never be minimal work to swap out a database. (Cost, training, data migration, etc.)
Does anyone have any experience with swapping out one database for another in a large system, and dealing with the implications in code? Is it worth it to worry about abstracting the actual database from your code?
Question 1: Does anyone have any experience with
swapping out one database for another
in a large system, and dealing with
the implications in code?
Yes we tried it. Our customer is using a large MS Access based Delphi client server application. After about five years we considered switching to SQL Server. We analyzed the problem and concluded that swapping the database would be very costly and provide only a few advantages. Customer decided not to swap the database. The application is still running fine and the customer is still happy.
Note that:
MS Access is only being used for data storage and report generation.
The server application ensures that MS Access is only being accessed on the server. Normal multi-user MS Access applications will transfer large chunks of the Access database over the network - resulting in slow and unreliable database functionality. This is not the case for this application. Client <> Server <> MS Access. Only the server application communicates with the MS Access database. Actually the Server has exclusive access to the MS Access database. No other computer can open to the MS Access database. Conclusion: MS Access is being used as a true RDBMS, Relational DataBase Management System - please no flaming about MS Access being inferior and unstable - it has been running fine for more than 10 years.
The most important issues you will have to consider:
SQL statements: (SELECT, UPDATE, DELETE, INSERT, CREATE TABLE) and make sure they would be compatible with the SQL database. It's amazing how much all the RDBMS differ in the details (date formats, number formats, search formats, string formats, join syntax, create table syntax, stored procedures, user defined functions, (auto) primary keys, etc.)
Report generation: Depending on your database you might be using a different reporting tool. Our customer has over 200 complex reports. Converting all these reports is very time consuming.
Performance: all RDBMS have different performances in different environments. Normally performance optimalisations are very much RDBMS dependent.
Costs: the costs of tools, developers, server and user licenses varies greatly. It ranges from free to very expensive. Free does not mean cheap and expensive does not always equate to good. A cost/value comparison will have to be made.
Experience: making the best use of your RDBMS requires experience. If you have to develop for an "unknown" RDBMS your productivity will suffer.
Question 2: Is it worth it to worry about
abstracting the actual database from
your code?
Yes. In an ideal world, swapping a database would just be adjusting the data connection string. In the real world this is not possible because all databases are different. They all have tables and SQL support but the differences are in the details. If you can keep the differences of the databases shielded through abstraction - please do so. Make a list of the databases you need to support. Check the selected database systems for the differences. Provide centralized code to handle the differences. Support one RDBMS and provide stubs for future support of other RDBMS.
I disagree that the purpose is to be able to swap out databases, and I think you are correct in showing some suspicion about ORMs leading towards that goal.
However, I would still use an ORM, as it abstracts away the details of data access. Isn't this the goal of object oriented programming? Keep your concerns separated.
I think the primary use case for database abstraction (via ORM tools) is to be able to ship a product that works with multiple database brands. I believe it's a rarer occurrence for a company to switch between database vendors, but that's still one of the use cases.
I've worked jobs where we started out using MySQL for monetary reasons (think a startup) and, one we started making money, wanted to switch to Oracle. We didn't end up making the switch, but it was nice to have the option.
Still, ORM tools are not a completely leak-less abstractions and I know our migration still would have been painful and costly. It totally depends on what you are building, but it has been my experience that -- for performance reasons, usually -- you end up either working around your ORM solution or exploiting vendor-specific features at some point.
The only time I've seen a database switch was from HSQL during early development to Oracle as the project progressed. The ORM made this easy.
I often use the DAO pattern to swap out data services (from a database to web service or to swap a web service to a test stub).
For ORM I don't think the goal is to enable you to switch databases - it is to hide you from the complexities of different database implementations and removing the need to worry about the fine details of translating from relational to object represenations of your data.
By having someone smart write an ORM that handles caching, only updates fields that have changed, groups updates, etc I don't need to. Although in the cases where I need something special I can still revert to SQL if I want.

Resources