Spark streaming and database caching - database

I have a Spark streaming application which does lot of database access. We are using an abstraction layer (basically a JAVA library) which maintains CACHE to access database. This CACHE layer (at client side) helps to speed up query processing.
I know that in context of Normal host (non spark) this really helps. But when it comes to Spark Streaming I doubt whether this CLIENT side caching would help as data may not remain cached for a long time as Spark Streaming behaves differently.
Can somebody please clarify.

Related

Distributed database with hundreds of read-only replicas which can synchronise asynchronously through HTTP

I have a service running as a sidecar next to a variety of applications.
This service needs to be extremely fast and do not make remote calls.
It has to have in-memory database. The contents of this database have to be populated and kept up-to-date (although a lag is acceptable) with a central component.
The service does not accept writes.
Of course this could be done through a mechanism of long pooling, for instance, but this brings the complexity of managing this solution and some intrinsic inefficiencies.
Is there a lightweight, ephemeral in-process and preferably in-memory database that can synchronise asynchronously with central replica preferably through regular HTTP so that no ports needs to be opened?
Maybe Couchbase lite/mobile is what you are after. Atleast mobile is syncing over a web socket, not sure about which protocol lite is running (or if there is actually a difference between the products).
Seems like couchbase lite replaced touchDB which was a mobile version of CouchDB IIRC.
Another variant might be running pouchDB and using CouchDB as the master backend. You don't say which platform the application will run on, which is relevant if you want an in-process solution.

Does akka-streams support clustering ? If yes, Please share example

I am using akka stream in my application to handle realtime data.
Data volume is so high so I want to horizontally scale my application.
Can some help me to understand does akka-streams support clustering ? If yes, Please share example.
I have found no examples or documentation that would indicate akka-stream is capable of directly running in a clustered mode. However, there are a few "work-arounds" that you may be able to deploy.
Integration with Akka Cluster
You could have a Flow.map or Flow.mapAsync send an incoming object to the cluster and wait for the response.
Sharding of Incoming Data
The source data could be broken up by a sharding function and sent to independent services which process the data in parallel. Each service could operate with a single akka-stream but the cluster of applications would allow for multiple streams running independently.

How to transfer rules and configuration to edge devices?

In our application we have a server which contains entities along with their relations and processing rules stored in DB. To that server there will be n no.of clients like raspberry pi , gateways, android apps are connected.
I want to push configuration & processing rules to those clients, so when they read some data they can process on their own. This is to make the edge devices self sustainable, avoid outages when server/network is down.
How to push/pull the configuration. I don't want to maintain DBs at client and configure replication. But the problem is maintenance and patching of DBs for those no.of client will be tough.
So any other better alternative.?
At the same time I have to push logs to upstream (server).
Thanks in advance.
I have been there. You need an on-device data store. For this range of embedded Linux, in order of growing development complexity:
Variables: Fast to change and retrieve, makes sense if the data fits in memory. Lost if the process ends.
Filesystem: Requires no special libraries, just read/write access somewhere. Workable if the data is small enough to fit in memory and does not change much during execution (read on startup when lacking network, write on update from server). If your data can be structured as a few object variables, you could write them to JSON files, and there is plenty of documentation on other file storage options for Android apps.
In-memory datastore like Redis: Lightweight dependency, can automate messaging and filesystem-stored backup. Provides a managed framework/hybrid of the previous two.
Lightweight databases, especially SQLite: Lightweight SQL database, stored in one file and popular with Android apps (probably already installed on many of the target devices). It could work for frequent changes on a larger block of data in a memory-constrained environment, but does not look like a great fit. It gets worse for anything heavier.
Redis replication is easy, but indiscriminate, so mainly sensible if your devices receive a changing but identical ruleset. Otherwise, in all these cases, the easiest transfer option may be to request and receive the whole configuration (GET a string, download a JSON file, etc.) and parse the received values.

Improving speed in winform application and WCF with Caching

We provide a critical application for a customer. It's a clickonce winforms application which consumes several WCF services which communicates with an Oracle Database.
The service is hosted with Oracle Application Server with two Web Cache Servers in front for load balancing. The Database is on another separate machine.
Thing is, the application has now poor performance and we need to speed it up. We have tried many techniques: optimize queries with adding indexes when analyzing explain plans, reducing service calls from client and profiling the client application for pitfalls.
But I would really like two set up a caching layer over the database or the WCF. The data is critical and changed quite often so it's necessary to get the latest data at all requests.
So when data changes in the database the cache should immediately be expired. The queries are complex with up two 14-15 joins...
How is the right way to do this and which tools/frameworks should I use? I have heard of memcached.. is this good?
Because your code sees all updates to the data you can have a very effective caching layer as the cache can be updated at the same time as the database.
With your requirement for absolute cache coherency you need to make sure all servers see the same cache. There are two approaches you could take:
Have a cache server which uses something like the ASP.NET cache which the application servers talk to to get and update the data
Use a caching product to maintain the cache
If you use a caching product there are a number on market: memcached, gemfire, coherence, Windows Server AppFabric Caching and more
The nice thing about AppFabric Caching (project formally known as Velocity) is that it is free with Windows Server and is very .NET friendly (although it is newer than some of the others and so you might say less proven)
Before adding a new tool you should make sure you're correctly using all of the Oracle caching that is available to you.
There's the buffer cache, PL/SQL function result cache, client query result cache, sql query result cache, materialized views, and bind variables will help cache query plans.

Charts or Stats comparing Database vs. HTTP vs. Direct File Access Performance?

I am wondering what the stats are for different ways of storing (and therefore retrieving) content. Are there any charts out there, or do you guys have any quick tests to show, the requests per second, etc., of:
Direct (local) database access, vs.
HTTP Access to cached data, vs.
HTTP Access to uncached data (remote database), vs.
Direct File access
I am wondering to judge how necessary it is to locally cache data if I'm using remote services.
Thanks!
.. what the stats are ...
Although some people may have published their findings, this will not map directly to your experience - you may find the opposite of they discovered.
Sometimes it may be faster to retrieve files from a database than a file - it depends on the size of the file, the filesystem or DBMS it resides on, the other data which affects the access path (e.g. indexes, number of I/O operations to dereference the start of the file...) the underlying hardware, the amount of caching available, the presence of the data or information relating to its location in the cache and the interaction between each of these factors.
And that's before you start considering the additional variables introduced when you start talking about HTTP, which also implies remote network access.
While ultimately any file would need to be read from the filesystem at some point, this suggests that direct file access would be the fastest method (but only on the local machine) however if you consider centralized caching and concurrency this is not necessarily the case.
I am wondering to judge how necessary it is to locally cache data if I'm using remote services.
Rather hard to say. How remote? what are your bandwidth costs? Latency? What level of service do you hope to provide? Does the remote system provide caching information already? How do you deal with cache invalidations?
If we knew everything about your application, the data source, your customers and networks connecting them and your budget for implementing the service then we might hazard a guess. And, yes, caching on the MITM server probably is a good idea but only if you know that you're not breaking anything by using caching.
C.

Resources