Is SQL Server 2014's In-Memory OLTP (Hekaton) the same or similar concept with Redis?
I use Redis for in-memory storage (storage in RAM) and caching, while having a separate SQL Server database (like StackExchange does). Can Hekaton do the same thing?
They're similar by both being primarily in-memory, but that's about it.
Redis is an in-memory key-value database. It can persist data to disk if configure it, but it keeps the entire dataset in memory so you need enough RAM for that. The key-value architecture allows various different data types so you can store a value as a simple string or lists, sets, hashes, etc. Basically all the data structures you can use inside of a programming language are available in Redis natively.
SQL Server Hekaton (In-Memory OLTP) is a new engine designed to run relational tables in memory. All the data for these tables is kept in RAM but also stored to disk so they are fully durable.
Hekaton can take individual tables in a SQL Server database and run them in a different process using MVCC (instead of pages and locks) and other optimizations so operations are thousands of times faster than the traditional disk-based engine. There is a lot of research that went into this and the primary use-case would be to take a table that is under heavy load and switch it to run in-memory to increase performance and scalability.
Hekaton was not meant to run an entire database in memory (although you can do that if you really want to) but rather as a new engine designed to handle specific cases while keeping the interface the same. Everything to the end-user is identical to the rest of SQL Server: you can use SQL, stored procedures, triggers, indexes, atomic operations with ACID properties and you can work seamlessly with data in both regular and in-memory tables.
Because of the performance potential of Hekaton, you can use it to replace Redis if you need the speed and want to model your data within traditional relational tables. If you need the other key-value and data structure features of Redis, you're better off staying with that.
With SQL 2016 SP1 and newer, all tiers of SQL Server now have access to the same features and the only difference is pricing for support and capacity.
Firstly you need the enterprise edition (very expensive) of SQL Server to use Hekaton (In-Memory OLTP). Note you have to pay for sql server per CPU, adding more workload to SQL server may require you to have more CPU and therefore a lot more licence costs.
But unlike Redis, you can have a trigger or stored proc update your “in memory cache” as part of the database transaction. You may also find that Hekaton is fast enough that you don’t need a separate set of caches from your main tables.
So yes, Hekaton can do the same as Redis, but it is unlikely to be sensible to use it in that way unless its usage does not cost you much.
Hekaton comes into its own when it allows you to process a lot more data without having to invest in the programming cost of re-designing your system to make use of caching with Redis or otherwise.
Related
I am Asp.Net MVC/SQLSERVER developer and I am very new to all these and so I may be on compelete wrong path.
I came to know by googling that Snowwflake can put/get data from AWS-S3, Google Storage and Azure. And Snowflake has their database and tables as well.
I have following questions,
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
Can we use Snowflake as database for data driven web application? and if yes, could you provide link or something to start?
Once again I am very new to all these and expecting from you to get ideas and best way to work arround this.
Thak you in advance.
Why one should use Snowflake when you can compute your data with Cloud Storage(S3 etc) and Talend or any other ETL tool?
You're talking about three different classes of technology product there, which are not equivalent:
Snowflake is a database platform, similar to other database technologies it provides data storage and metadata and a SQL interface for data manipulation and management.
AWS S3 (and similar products) provides scalable cloud storage for files of any kind. You generally need to implement an additional technology such as Spark, Presto, or Amazon Athena to query data stored as files in cloud storage. Snowflake can also make use of data files in cloud storage, either querying the files directly as an "external table" or using a COPY statement to load the data into Snowflake itself.
Talend and other ETL or data integration tools are used to move data between source and target platforms. Usually this will be from a line of business application, such as an ERP system, to a data warehouse or data lake.
So you need to think about three things when considering Snowflake:
Where is your analytical data going to be stored? Is it going to be files in cloud storage, loaded into a database or a mix of both? There are advantages and disadvantages to each scenario.
How do you want to query the data? It's fairly likely you'll want something that supports the use of SQL queries, as mentioned above there are numerous technologies that support SQL on files in cloud storage. Query performance will generally be significantly better if the data is loaded into a dedicated analytical database though.
How will the data get from the data sources to the analytical data repository, whatever that may be? Typically this will involve either a third party ETL tool, or rolling your own solution (which can be a cheaper option initially but can become a significant management and support overhead).
Can we use Snowflake as database for data driven web application?
The answer to that is yes, in theory. It very much depends on what your web application does because Snowflake is a database designed for analytics, i.e. crunching through large amounts of data to find answers to questions. It's not designed as a transactional database for a system that involves lots of updates and inserts of small amounts of data. For example Snowflake doesn't support features like referential integrity.
However, if your web application is an analytical one (for example it has embedded reports that would query a large amount of data and users will typically be reading data and not adding it) then you could use Snowflake as a backend for the analytical part, although you would probably still want a traditional database to manage data for things like users and sessions.
You can connect your web application to Snowflake with one of the connectors, like https://docs.snowflake.com/en/user-guide/odbc.html
Snowflake excels for large analytic workloads that are difficult to scale and tune. If, for example, you have many (millions/billions) of events that you want to aggregate into dashboards, then Snowflake might be a good fit.
I agree with much of what Nathan said, to add to that, from my experience every time I've created a database for an application it's been with an OLTP database like PostgreSQL, Azure SQL Database, or SQL Server.
One big problem of using MPP/Distributed Databases is that they don't enforce referential integrity, so if that's important to you then you don't want to use MPP/Distributed Databases.
Snowflake and other MPP/Distributed Databases are NOT meant for OLTP workloads but instead for OLAP workloads. No matter what snake oil those companies like databricks and snowflake try to sell you MPP/Distributed databases are NOT meant for OLTP. The costs alone would be tremendous even with auto-scaling.
If you think about it, Databricks, Snowflake, etc. have a limit to how much they want to optimize their platforms because the longer a query runs the more money they make. For them to make money they have to optimize performance but not too much otherwise it will effect their income.
This can be an in-depth topic so I would recommend doing more research into OLTP Vs. OLAP.
Enforcing Referential integrity is a double edged sword, the downside being as the data volume grows the referential violation check significantly slows down the inserts and deletes. This results in the developer having to put the RI check in the program (with a dirty read) and turn off the RI enforcement by the database, finally ending up with a Snowflake like situation.
Bottom line is Snowflake not enforcing RI should not be a limitation for OLTP applications.
Could someone get into the technicals as to why NoSQL equivalents of MySQL joins are so much more expensive than SQL ones?
I'm not sure it is more expensive.
Joins in a non-distributed system are straightforward. The data's on the same server, and you simply go and find it.
In a distributed system - which just about every NoSQL is - the data might be local, or it might be remote. You have to do at least a calculation to determine the location of the data, then get the data from the other servers.
It is a hard problem (maybe even a fool's errand) to optimize the cluster so associated data is on the "same server" - because what is "associated" changes. Some distributed NoSQL databases, such as my companies' Aerospike database, make a point of randomly distributing data so joins are all equal cost - even if slower. Some cases - like small, well known tables with commonly read and infrequently written data - can be implemented on every server easily, but few NoSQL allow that (currently).
To create a comparison, you'll have to create a distributed SQL cluster, using your favorite distributed SQL technology (such as MySQL Cluster), and compare client join patterns vs the overhead of SQL parsing, optimizer runs, and then fetching from different parts of the cluster.
I've never seen such a benchmark. Most NoSQL use patterns eschew the normalize-and-join SQL patterns, so it is not well tested.
We have a medium-sized web application (multiple instances), querying against a single SQL Server 2014 database.
Not the must robust architecture, no clustering/failover, and we have been getting a few deadlocks recently.
I'm looking at how i can improve the performance and availability of the database, reduce these deadlocks, and have a better backup/failover strategy.
I'm not a DBA, so looking for some advice here.
We currently have the following application architecture:
Multiple web servers reading and writing to a single SQL Server DB
Multiple background services reading and writing to the same single SQL Server DB
I'm contemplating making the following changes:
Split the single DB into two DB's, one read-only and another read-write. The read-write DB replicates the data to the read-only DB using SQL Server replication
Web servers connect to the given DB depending on the operation.
Background servers connect to the read-write DB (most the writes happen here)
Most of the DB queries on the web servers are reads (and a lot of the writes can be offloaded to the background services), so that's the reason for my thoughts here.
I could then also potentially add clustering to the read-only databases.
Is this a good SQL Server database architecture? Or would the DBA's out there simply suggest a clustering approach?
Goals: performance, scalability, reliability
Without more specific details about your server, it's tough to give you specific advice (for example, what's a medium-sized web application? what are the specs on your database server? What's your I/O latency like? CPU contention? Memory utilization?)
At a high level of abstraction, deadlocks usually occur because of two reasons:
Your reads are too slow, and
Your writes are too slow.
There's lots of ways to address both of those issues, but in general:
You can cover a lot of coding sins with good hardware, and
Don't re-architect a solution until you've pursued performance tuning options (including indexing strategies and/or procedure rewrites).
Clustering is generally considered to be used as a strategy for High Availability/Disaster Recovery, not performance augmentation (there are always exceptions).
We have a Delphi application which can connect to either Oracle or SQL Server. We use Devart components to connect to the databases, and everything is very generic when it comes to database access. i.e. we use the lowest common denominator. Ultimately we use the databases as data stores and do not use any of the more "advanced" features which maybe specific to the database.
However we have a serious performance issue with Oracle. It is to do with inserting data. I know that inserting data by running off a load of insert statements is not great for performance, but due to some business logic that needs to be done on the raw data before it gets uploaded to the database, we are a little restricted to multiple inserts. To get an idea of performance differences, a recent test we did, inserts 1000 items into our database and takes 5 minutes in SQL Server (acceptable) but 44 minutes in Oracle.
Is there anything we could do to improve performance? The inserting of data needs to be done by the user and NOT an Oracle DBA, so absolutely no Oracle skills is one of the pre-requisites for any solution. Basically, the users need to press a button and everything is done.
Edit: Business Logic happens before the insert (although there is a little going on during the actual insert, so more realistic number would be 2 minutes for SQL Server and 40 or so minutes for Oracle. Bear in mind we are inserting a few large blobs per record, so perhaps that explains the slowish performance, but not why there is such a difference. The 1000 items are part of a transaction.
Oracle supports array DML, which can speed up performance. Also if BLOB are involved, performance may depend on caching settings, and how the BLOB are setup in the destination table. Some db client parameters tuning may be also beneficial to increase network speed.
Anyway, without knowing which version of Oracle you're using, how it is configured, your table(s) deinition (and its tablespaces), how large are the BLOBS, and the SQL actually used (did you trace it?), it's very difficult to diagnose the real problem.
Oracle has some powerful diagnostic tools to identify bottlenecks, but they may not be easy to use and require to know enough about how Oracle works. From the Enterprise Manager Console you can access some of them in a more readable format - did you check it?
Update: because I can't comment to other answers, Oracle support differet type of LOB storage:
LOBs stored into the database (under transaction managment)
BFILES, external file system LOBs yet still managed by Oracle (LOB data not under transaction)
SecureFiles (11g onwards, alike BFILES but with transaction support and other features)
Oracle is designed for and can manage large LOBs - just it needs to be configured properly. Parameter that will affect LOB performance:
ENABLE/DISABLE STORAGE IN ROW
CACHE/NOCACHE/CACHE READS
LOGGING/NOLOGGING
CHUNK
PCTVERSION/RETENTION (especially for updates and deletes)
TABLESPACE (usually, a dedicated tablespace for lobs is advisable)
These parameters needs to be set taking into account the average LOB size, how LOBs are accessed, amd how often are modified. There's no "one size fits all".
But there are also the client side: OCI can buffer LOBs client side, so small read/write operations are cached, minimizing the number of network roundtrips and LOB versioning - that's up to the OCI wrapper you're using.
Array DML (only available with FireDac, ODAC, DOA and our SynDbOracle unit afaik) won't change much if your problem is about blob transfer.
First idea is to compress the data before transmission.
Try several access libraries. Our open source SynDBOracle directly accesses the oci.dll client but may be slightly faster.
But perhaps the problem may be on the server side. Oracle does not like transactions with huge data, since it tends to overflow its wal files. Try to tune the write ahead log files of the table.
IMHO a rdbms is not the best option to store huge blobs. Plain files, indexed via a rdbms for metadata is usually better. Or switch to a big SQL storage, like key/value stores or mongodb blob api.
Remember that both Oracle and mssql do ask money proportional to the data size....
Just for End-Of-Day data there will be billions of rows. What is the best way to store all that data. Is SQL Server 2008 good enough for that or should I look towards NoSQL solution, like MongoDB. Any suggestions?
That would be cool to have one master db with read/write permissions and one ore more replications of it for read only operations. Only master database will be used for adding new prices into the storage. Also that would be cool to be able replicate OHLC prices for most popular securities individually in order to optimize read access.
This data then will be streamed to a trading platform on clients' machines.
You should consider Oracle Berkeley DB which is in production doing this within the infrastructure of a few well known stock exchanges. Berkeley DB will allow you to record information at a master as simple key/value pairs, in your case I'd imagine a timestamp for the key and an encoded OHLC set for the value. Berkeley DB supports single master multi-replica replication (called "HA" for High Availability) to support exactly what you've outlined - read scalability. Berkeley DB HA will automatically fail-over to a new master if/when necessary. Using some simple compression and other basic features of Berkeley DB you'll be able to meet your scalability and data volume targets (billions of rows, tens of thousands of transactions per second - depending on your hardware, OS, and configuration of BDB - see the 3n+1 benchmark with BDB for help) without issue.
When you start working on accessing that OHLC data consider Berkeley DB's support for bulk-get and make sure you use the B-Tree access method (because your data has order and locality will provide much faster access). Also consider the Berkeley DB partitioning API to split your data (perhaps based on symbol or even based on time). Finally, because you'll be replicating the data you can relax the durability constraints to DB_TXN_WRITE_NOSYNC as long as your replication acknowledgement policy is requires a quorum of replicas ACK a write before considering it durable. You'll find that a fast network beats a fast disk in this case. Also, to offload some work from your master, enable peer-to-peer log replica distribution.
But, first read the replication manager getting started guide and review the rep quote example - which already implements some of what you're trying to do (handy, eh?).
Just for the record, full disclosure I work as a product manager at Oracle on Berkeley DB products. I have for the past nine years, so I'm a tad biased. I'd guess that the other solutions - SQL based or not - might eventually give you a working system, but I'm confident that Berkeley DB can without too much effort.
If you're really talking billions of new rows a day (Federal Express' data warehouse isn't that large), then you need an SQL database that can partition across multiple computers, like Oracle or IBM's DB2.
Another alternative would be a heavy-duty system managed storage like IBM's DFSMS.