Database Design for continuous data stream - database

I am currently developing a tool where one client A can send a continuous data stream (just text) to a server and another client B should be able to watch the data stream in real time, as in fetching the same data from the server again.
Of course the server should not send all the available data to client B since it can get a lot of text, so I am currently thinking in how to design that client B is only fetching the newest data.
My first approach was to do it similar to pagination where client B sends another attribute client_lines = 10 to the server indicating how many lines of data he already posses and then we can query our database with where lines > client_lines.
But the database can grow quite large since we would have one database for many users each sending data which can have a lot of text-lines. So querying the complete database with data from different users does not seem like a super efficient solution.
Is there any smarter approach? Maybe using a NoSQL database like MongoDB?

You are looking for a Topic, a pub-sub implementation in which multiple subscribers can receive messages that are published and consumers may consume just the incremental bits. You can find good implementations of this by products like ActiveMQ, JMS, Kafka, Amazon SNS, Kinesis, and many more. It occasionally implemented in a relational database, but rarely is it implemented well in a relational database. You are generally far better off using a dedicated solution.
Note, often a database will subscribe to the topic in order to receive updates, and bridge to the relational model.

Related

Transfer data between NoSQL and SQL databases on different servers

Currently, I'm working on a MERN Web Application that'll need to communicate with a Microsft SQL Server database on a different server but on the same network.
Data will only be "transferred" from the Mongo database to the MSSQL one based on a user action. I think I can accomplish this by simply transforming the data to transfer into the appropriate format on my Express server and connecting to the MSSQL via the matching API.
On the flip side, data will be transferred from the MSSQL database to the Mongo one when a certain field is updated in a record. I think I can accomplish this with a Trigger, but I'm not exactly sure how.
Do either of these solutions sound reasonable or are there more better/industry standard methods that I should be employing. Any and all help is much appreciated!
There are (in general) two ways of doing this.
If the data transfer needs to happen immediately, you may be able to use triggers to accomplish this, although be aware of your error handling.
The other option is to develop some form of worker process in your favourite scripting language and run this on a schedule. (This would be my preferred option, as my personal familiarity with triggers is fairly limited). If option 1 isn't viable, you could set your schedule to be very frequent, say once per minute or every x seconds, as long as a new task doesn't spawn before the previous is completed.
The broader question though, is do you need to have data duplicated across two different sources? The obvious pitfall with this approach is consistency, should anything fail you can end up with two data sources wildly out of sync with each other and your approach will have to account for this.

Database Bottleneck In Distributed Application

I hear about SOA and Distributed Applications everywhere now. I would like know about some best practices related to keeping the single data source responsive or in case if you have copy of data on every server how it is better to synchronise those databases to keep them updated ?
There are many answers to this question and in order to choose the most appropriate solution, you need to carefully consider what kind of data you are storing and what you want to do with it.
Replication
This is the traditional mechanism for many RDBMS, and normally relies on features provided by the RDBMS. Replication has a latency which means although servers can handle load independently, they may not necessarily be reading the latest data. This may or may not be a problem for a particular system. When replication is bidirectional then simultaneous changes on two databases can lead to conflicts that need resolving somehow. Depending on your data, the choice might be easy (i.e. audit log => append both), or difficult (i.e. hotel room booking - cancel one? select alternative hotel?). You also have to consider what to do in the event that the replication network link is down (i.e. do you deny updates on both database, one database or allow the databases to diverge and sort out the conflicts later). This is all dependent on the exact type of data you have. One possible compromise, for read-heavy systems, is to use unidirectional replication to many databases for reading, and send all write operations to the source database. This is always a trade-off between Availability and Consistency (see CAP Theorem). The advantage of RDBMS and replication is that you can easily query your entire dataset in complex ways and have greater opportunity to
remove duplication by using relational links to data items.
Sharding
If your data can be cleanly partitioned into disjoint subsets (e.g. different customers), such that all possible relational links between data items are contained within each subset (e.g. customers -> orders). Then you can put each subset in separate databases. This is the principle behind NoSQL databases, or as Martin Fowler calls them 'Aggregate-Oriented Databases'. The downside of this approach is that it requires more work to run queries over your entire dataset, as you have to query all your databases and then combine the results (e.g. map-reduce). Another disadvantage is that in separating your data you may need to duplicate some (e.g. sharding by customers -> orders might mean product data is duplicated). It is also hard to manage the data schema as it lies independently on multiple databases, which is why most NoSQL databases are schema-less.
Database-per-service
In the microservice approach, it is advised that each microservice should have its own dedicated database, that is not allowed to be accessed by any other microservice (of a different type). Hence, a microservice that manages customer contact information stores the data in a separate database from the microservice that manages customer orders. Links can be made between the databases using globally unique ids, or URIs (especially if the microservices are RESTful) etc. The downside again from this is that it is even harder to perform complex queries on the entire dataset (especially since all access should go via the microservice API not direct to the databases).
Polyglot storage
So many of my projects in the past have involved a single RDBMS in which all data was placed. Some of this data was well suited to the relational model, much of it was not. For example, hierarchical data might be better stored in a graph database, stock ticks in a column-oriented database, html templates in a NoSQL database. The trend with micro-services is to move towards a model where different parts of your dataset are placed in storage providers that are chosen according to the need.
If you thinking to keep different copies of the database for each microservice and you want to achieve eventual consistency than you can use Kafka Connect. I can briefly tell you that kafka connect will watch your DBS and whenever there are any changes it will read the log file and will add these logged events as a message in Queue then another database those are a subscriber to this Queue can execute the same statement at their side also.
Kafka connect isn't the only framework, you can search and find other frameworks or application for the same implementation.

Best highperf database for simple read/write (no update) scenario

I'm interested in opinions on what database system to select for this project where I basically need to persist a constant stream of messages at potentially high speed. There's basically four types of messages with some commonalities. No relations needed. I guess you could call it an event store.
I will need to read (query by a non-unique key), but I don't need to update any data. I will have to delete old data though.
Considerations:
Database must be able to scale out
Performance is crucial
as well as up-time (system allowing live updates would be nice)
Preferably something running on Windows Server, but this is not a requirement
I'm familiar with document databases (MongoDB), but don't know what other kinds of NoSQL solutions would fit my problem, or how they compare.
MongoDb would be ideal. But if all you want to do is read from the stream and serve up content, more than database consideration (use any db - mysql, access, sql server express, xml files), I would suggest you look at putting all your data in memory (maybe at app startup); and then serve up data from memory.
You should also look at some caching solutions like Memcached (http://memcached.org/)

What is couchdb, for what and how should I use it?

I hear a lot about couchdb, but after reading some documents about it, I still don't get why to use it and how.
Could you clarify this mystery for me?
It's a non-relational database, open-source, distributed (incremental, bidirectional replication), schema-free. A CouchDB database is a collection of documents; each document is a bunch of string "keys" and corresponding "values" (which can be numbers, strings, lists, dates, ...). You can have indices, queries, views.
If a relational DB feels confining to you (you find schemas too rigid, can't spread the DB engine work around a very large numbers of servers, etc), CouchDB is worth considering (it's one of the most interesting of the many non-relational DBs that are emerging these days).
But if all of your work happily fits in a relational database, that's what you probably want to continue using for production work (even though "playing around" with some non-relational DB is still well worth your time, just for personal growth and edification, that's quite different from transferring huge production systems over from a relational DB!-).
It sounds like you should be reading Why CouchDB
To quote from wikipedia
It is not a relational database management system. Instead of storing data in rows and columns, the database manages a collection of JSON documents. The documents in a collection need not share a schema, but retain query abilities via views.
CouchDB provides a different model for data storage than a traditional relational database in that it does not represent data as rows within tables, instead it stores data as "documents" in JSON format.
This difference in data storage model is what differenciates CouchDB from products like MySQL and SQL Server.
In terms of programatic access to CouchDB, it exposes a REST API which you can access by sending HTTP requests from your code
I hope this has been somewhat helpful, though I acknowlege it may not be given my minimal familiarity with the product
I'm far from an expert(all I've done is play around with it some...) but here's how I'm thinking of using it:
Usually when I'm designing an app I've got a bunch of app servers behind a load balancer. Often times, I've got sticky sessions so that each user will go back to the same app server during that session. What I'm thinking of doing is have a couchdb instance tied to each app server.
That way you can use that local couchdb to access user preferences, product data...whatever data you've got that doesn't have to be perfectly up to date.
So...now you've got data on these local CouchDBs. CouchDB allows replication. So, every fixed time period, merge the data back(every X seconds?) into it's peers to keep them up to date.
As a whole you shouldn't have to worry about conflicts b/c each appserver has it's own CouchDB and users are attached to the appserver, and you've got eventual consistency because you've got replication.
Does that answer your question?
A good example is when you say have to deal with people data in either a website or application. If you set off wishing to design the data and keep the individuals' information seperate, that makes a good case for CouchDB, which stores data in documents rather than relational tables. In a production deployment, my users may end up adding adhoc data about 10% of the people and some other funny details for another selected 5%. In a relational context, this could add up to loads of redundancy but not for CouchDB.
And it's not just about the fact that CouchDB is non-relational: if you're too focus on that, you're missing the point. CouchDB is plugged into the web, all you need to start with is HTTP for creating and making queries (GET/PUT/POST/DELETE...), and it's RESTful, plus the fact that it's portable and great for peer to peer sharing. It can also serve up web applications in what is termed as 'CouchApps', where CouchDB totally holds the images, CSS, markup as data stored under special documents called design documents.
Check out this collection of videos introducing non-relational databases, the one on CouchDB should give you a better idea.

What is the best approach for decoupled database design in terms of data sharing?

I have a series of Oracle databases that need to access each other's data. The most efficient way to do this is to use database links - setting up a few database links I can get data from A to B with the minimum of fuss. The problem for me is that you end up with a tightly-coupled design and if one database goes down it can bring the coupled databases with it (or perhaps part of an application on those databases).
What alternative approaches have you tried for sharing data between Oracle databases?
Update after a couple of responses...
I wasn't thinking so much a replication, more on accessing "master data". For example, if I have a central database with currency conversion rates and I want to pull a rate into a separate database (application). For such a small dataset igor-db's suggestion of materialized views over DB links would work beautifully. However, when you are dynamically sampling from a very large dataset then the option of locally caching starts to become trickier. What options would you go for in these circumstances. I wondered about an XML service but tuinstoel (in a comment to le dorfier's reply) rightly questioned the overhead involved.
Summary of responses...
On the whole I think igor-db is closest, which is why I've accepted that answer, but I thought I'd add a little to bring out some of the other answers.
For my purposes, where I'm looking at data replication only, it looks like Oracle BASIC replication (as opposed to ADVANCED) replication is the one for me. Using materialized view logs on the master site and materialized views on the snapshot site looks like an excellent way forward.
Where this isn't an option, perhaps where the data volumes make full table replication an issue, then a messaging solution seems the most appropriate Oracle solution. Oracle Advanced Queueing seems the quickest and easiest way to set up a messaging solution.
The least preferable approach seems to be roll-your-own XML web services but only where the relative ease of Advanced Queueing isn't an option.
Streams is the Oracle replication technology.
You can use MVs over database links (so database 'A' has a materialized view of the data from database 'B'. If 'B' goes down, the MV can't be refreshed but the data is still in 'A').
Mileage may depend on DB volumes, change volumes...
It looks to me like it's by definition tightly coupled if you need simultaneous synchronous access to multiple databases.
If this is about transferring data, for instance, and it can be asynchronous, you can install a message queue between the two and have two processes, with one reading from the source and the other writing to the sink.
The OP has provided more information. He states that the dataset is very large. Well how large is large? And how often are the master tables changed?
With the use of materialized view logs Oracle will only propagate the changes made in the master table. A complete refresh of the data isn't necessary. Oracle streams also only communicate the modifications to the other side.
Buying storage is cheap, so why not local caching? Much cheaper than programming your own solutions.
An XML service doesn't help you when its database is not available so I don't understand why it would help? Oracle has many options for replication, explore them.
edit
I've build xml services. They provide interoperability between different systems with a clear interface (contract). You can build a xml service in C# and consume the service with Java. However xml services are not fast.
Why not use Advanced Queuing? Why roll your own XML service to move messages (DML) between Oracle instances - It's already there. You can have propagation move messages from one instance to another when they are both up. You can process them as needed in the destination servers. AQ is really rather simple to set up and use.
Why do they need to be separate databases?
Having a single database/instance with multiple schemas might be easier.
Keeping one database up (with appropriate standby databases etc) will be easier than keeping N up.
What kind of immediacy do you need and how much bi-directionality? If the data can be a little older and can be pulled from one "master source", create a series of simple ETL scripts run on a schedule to pull the data from the "source" database into the others.
You can then tailor the structure of the data to feed the needs of the client database(s) more precisely and you can change the structure of the source data until you're blue in the face.

Resources