State Backend for Flink tables - apache-flink

Can I checkpoint and store dynamic tables in Apache Flink into RocksDB as persistent backend?
If so can I have 20+ GB here?

Flink SQL will store in the configured state backend (which can be RocksDB) whatever state is needed to satisfy the needs of the query being executed. There's no problem having 20+ GB there. (Some users have 10's of TB.)
But keep in mind that you cannot directly access this state. You will need to send the results of your query to an external sink.

Related

Memgraph is an in-memory database. Does that mean that data is lost when I shutdown the computer?

Memgraph is an in-memory database. Does that mean that data is lost when I shut down the computer?
Do I need to use GQLAlchemy library as an on-disk storage solution to ensure data persistence?
The data is not lost when you shutdown your computer.
Memgraph uses two mechanisms to ensure data durability:
write-ahead logs (WAL)
periodic snapshots
Snapshots are taken periodically during the entire runtime of Memgraph based on the time defined with the --storage-snapshot-interval-sec configuration flag. The whole data storage is written to the disk when a snapshot is triggered. Write-ahead logs save all database modifications that happened to a file.
If you want to generate a snapshot for the current database state instantly, use the following query:
CREATE SNAPSHOT;
If you are using Docker, check how to specify volumes for data persistence.
Therefore, you don't need to use GQLAlchemy to insure data persistence. GQLAlchemy library provides an on-disk storage solution for large properties not used in graph algorithms. This is useful when nodes or relationships have metadata that doesn’t need to be used in any of the graph algorithms that need to be carried out in Memgraph, but can be fetched after. You can check out the how-to guide to learn how to use an SQL database to store node properties seamlessly as if they were being stored in Memgraph.
The data is not lost when you shutdown your computer.
Memgraph uses two mechanisms to ensure data durability:
write-ahead logs (WAL) and
periodic snapshots.
Snapshots are taken periodically during the entire runtime of Memgraph. The whole data storage is written to the disk when a snapshot is triggered. Write-ahead logs save all database modifications that happened to a file.
If you are using Docker, check how to specify volumes for data persistence.

How to ensure data consistency between local and remote database when providing smartphone and web app

I am developing a flutter app that should also be able to run in offline mode. Because I am using flutter I want to also offer the use of the web version of my application. The application I want to build is data reliant therefore to make it work offline I need to have some kind of local database, but for the web version to work, I also need to store the data on a remote database so it can be accessed from the web. The problem that this proposes is how do I makes sure that the local and remote databases are always on the same page. If something is changed from the web it needs to also affect the local database on the device and if something is changed locally it also has to affect the remote database. I simply need a tip on how to generally achieve what I am looking for.
This can be easily achieved using Firebase Firestore which supports offline.
Or
If you plan to do this from scratch one way to do this is to keep a separate local database and execute every transaction on it. If online do the same transactions on the remote database. If offline keep a record of transactions pending on the local database ( preferably in a separate DB from the main DB) and execute them when the remote database is connected.
You can use Hive or sqflite for local DB
Or
Keep track of records, which the transactions were performed on. Before synchronize merge these records from both local(done offline) and remote(done on the web, phone not connected). If multiple transactions were performed on the same record, Update both remote & local DB records to the latest record state from DB where the latest transaction was performed (if the latest transaction was performed on the remote DB, update the local DB& vice-versa)

Spark MapWithState to manage session state

I am working on a use case where I need to continuously collect and process information for on-going user sessions on their smart phones. Smart phone application contacts server and keeps sending session related parameters to the server throughout the session. Application typically reports session metrics every 15-20 seconds. A typical session lasts 15-20 minutes but may go upto 1-2 hours as well. The session metrics have to be available on a dashboard which not only pulls metrics for on-going sessions but historical sessions as well (last 30 days) I am using Spark Streaming with MapWithState functionality to manage session state. I keep pushing updated state info to an external database after every spark batch. Currently dashboard only queries external database.
I am worried about the performance of such a system since the database upserts become far too many when the system is under heavy load. The latest session info has to be available on the dashboard (strict business requirement).
What can be the refinements which I can work on? Spark has a concept of JDBC server. Can I make use of that somehow? If yes, I will have to juggle between database (for historical sessions) and Spark (for on-going/recent sessions).
FYI: I can't afford to use Spark Structured streaming as state management is quite complex in my case.

Load balancer and multiple instance of database design

The current single application server can handle about 5000 concurrent requests. However, the user base will be over millions and I may need to have two application servers to handle requests.
So the design is to have a load balancer to hope it will handle over 10000 concurrent requests. However, the data of each users are being stored in one single database. So the design is to have two or more servers, shall I do the followings?
Having two instances of databases
Real-time sync between two database
Is this correct?
However, if so, will the sync process lower down the performance of the servers
as Database replication seems costly.
Thank you.
You probably want to think of your service in "tiers". In this instance, you've got two tiers; the application tier and the database tier.
Typically, your application tier is going to be considerably easier to scale horizontally (i.e. by adding more application servers behind a load balancer) than your database tier.
With that in mind, the best approach is probably to overprovision your database (i.e. put it on its own, meaty server) and have your application servers all connect to that same database. Depending on the database software you're using, you could also look at using read replicas (AWS docs) to reduce the strain on your database.
You can also look at caching via Memcached / Redis to reduce the amount of load you're placing on the database.
So – tl;dr – put your DB on its own, big, server, and spread your application code across many small servers, all connecting to that same DB server.
Best option could be the synchronizing the standby node with data from active node as cost effective solution since it can be achievable using open source relational database(e.g. Maria DB).
Do not store computable results and statistics that can be easily doable at run time which may help reduce to data size.
If history data is not needed urgent for inquiries , it can be written to text file in easily importable format to database(e.g. .csv).
Data objects that are very oftenly updated can be kept in in-memory database as key value pair, use scheduled task to perform batch update/insert to relation database to achieve persistence
Implement retry logic for database batch update tasks to handle db downtimes or network errors
Consider writing data to relational database as serialized objects
Cache configuration data to memory from database either periodically or via API to refresh the changing part.

Joining SQL Query data with Rest Service data on the fly

I need to merge data from a mssql server and rest service on the fly. I have been asked to not store the data permanently in the mssql database as it changes periodically (caching would be OK, I believe as long as the cache time was adjustable).
At the moment, I am querying for data, then pulling joined data from a memory cache. If the data is not in cache, I call a rest service and store the result in cache.
This can be cumbersome and slow. Are there any patterns, applications or solutions that would help me solve this problem?
My thought is I should move the cached data to a database table which would speed up joins and have the application periodically refresh the data in the database table. Any thoughts?
You can try Denodo. It allows connecting multiple data source and has inbuild caching feature.
http://www.denodo.com/en

Resources