Spark MapWithState to manage session state

Spark MapWithState to manage session state - database

I am working on a use case where I need to continuously collect and process information for on-going user sessions on their smart phones. Smart phone application contacts server and keeps sending session related parameters to the server throughout the session. Application typically reports session metrics every 15-20 seconds. A typical session lasts 15-20 minutes but may go upto 1-2 hours as well. The session metrics have to be available on a dashboard which not only pulls metrics for on-going sessions but historical sessions as well (last 30 days) I am using Spark Streaming with MapWithState functionality to manage session state. I keep pushing updated state info to an external database after every spark batch. Currently dashboard only queries external database.
I am worried about the performance of such a system since the database upserts become far too many when the system is under heavy load. The latest session info has to be available on the dashboard (strict business requirement).
What can be the refinements which I can work on? Spark has a concept of JDBC server. Can I make use of that somehow? If yes, I will have to juggle between database (for historical sessions) and Spark (for on-going/recent sessions).
FYI: I can't afford to use Spark Structured streaming as state management is quite complex in my case.

Related

Azure SQL GeoReplication - Queries on secondary DB are slower

I've setup two SQL DBs on Azure with geo-replication. The primary is in Brazil and a secondary in West Europe.
Similarly I have two web apps running the same web api. A Brazilian web app that reads and writes on the Brazilian DB and a European web app that reads on the European DB and writes in the Brazilian DB.
When I test response times on read-only queries with Postman from Europe, I first notice that on a first "cold" call the European Web app is twice as fast as the Brazilian one. However, immediate next calls response times on the Bazilian web app are 10% of the initial "cold" call whereas response times on the European web app remain the same. I also notice that after a few minutes of inactivity, results are back to the "cold" case.
So:
why do query response times drop in Brazil?
whatever the answer is to 1, why doesn't it happen in Europe?
why does the response times optimization occurring in 1 doesn't last after a few minutes of inactivity?
Note that both web apps and DB are created as copy/paste (except geo-replication) from each other in an Azure ARM json file.
Both web apps are alwaysOn.
Thank you.
UPDATE
Actually there are several parts in action in what I see as a end user. The webapps and the dbs. I wrote this question thinking the issue was around the dbs and geo-replication however, after trying #Alberto's script (see below) I couldn,' see any differences in wait_times when querying Brazil or Europe so the problem may be on the webapps. I don't know how to further analyse/test that.
UPDATE 2
This may be (or not) related to query store. I asked on a new more specific question on that subject.
UPDATE 3
Queries on secondary database are not slower. My question was raised on false conclusions. I won't delete it as others took time to answer it and I thank them.
I was comparing query response times through rest calls to a web api running EF queries on a SQL Server DB. As rest calls to the web api located in the region querying the db replica are slower than rest calls to the same web api deployed in another region targeting the primary db, I concluded the problem was on the db side. However, when I run the queries in SSMS directly, bypassing the web api, I observe almost no differences in response times between primary and replica db.
I still have a problem but it's not the one raised in that question.

On Azure SQL Database your database' memory utilization may be dynamically reduced after some minutes of inactivity, and on this behavior Azure SQL differs from SQL Server on-premises. If you run a query two or three times it then start to execute faster again.
If you examine the query execution plan and its wait stats, you may find a wait named MEMORY_ALLOCATION_EXT for those queries executing after the memory allocation has been shrinked by Azure SQL Database service. Databases with a lot activity and query execution may not see its memory allocation reduced. For a detailed information of my part please read this StackOverflow thread.
Take in consideration also both databases should have the same service tier assigned.
Use below script to determine query waits and see what is the difference in terms of waits between both regions.
DROP TABLE IF EXISTS #before;
SELECT [wait_type], [waiting_tasks_count], [wait_time_ms], [max_wait_time_ms],
[signal_wait_time_ms]
INTO #before
FROM sys.[dm_db_wait_stats];
-- Execute test query here
SELECT *
FROM [dbo].[YourTestQuery]
-- Finish test query
DROP TABLE IF EXISTS #after;
SELECT [wait_type], [waiting_tasks_count], [wait_time_ms], [max_wait_time_ms],
[signal_wait_time_ms]
INTO #after
FROM sys.[dm_db_wait_stats];
-- Show accumulated wait time
SELECT [a].[wait_type], ([a].[wait_time_ms] - [b].[wait_time_ms]) AS [wait_time]
FROM [#after] AS [a]
INNER JOIN [#before] AS [b] ON
[a].[wait_type] = [b].[wait_type]
ORDER BY ([a].[wait_time_ms] - [b].[wait_time_ms]) DESC;

Load balancer and multiple instance of database design

The current single application server can handle about 5000 concurrent requests. However, the user base will be over millions and I may need to have two application servers to handle requests.
So the design is to have a load balancer to hope it will handle over 10000 concurrent requests. However, the data of each users are being stored in one single database. So the design is to have two or more servers, shall I do the followings?
Having two instances of databases
Real-time sync between two database
Is this correct?
However, if so, will the sync process lower down the performance of the servers
as Database replication seems costly.
Thank you.

You probably want to think of your service in "tiers". In this instance, you've got two tiers; the application tier and the database tier.
Typically, your application tier is going to be considerably easier to scale horizontally (i.e. by adding more application servers behind a load balancer) than your database tier.
With that in mind, the best approach is probably to overprovision your database (i.e. put it on its own, meaty server) and have your application servers all connect to that same database. Depending on the database software you're using, you could also look at using read replicas (AWS docs) to reduce the strain on your database.
You can also look at caching via Memcached / Redis to reduce the amount of load you're placing on the database.
So – tl;dr – put your DB on its own, big, server, and spread your application code across many small servers, all connecting to that same DB server.

Best option could be the synchronizing the standby node with data from active node as cost effective solution since it can be achievable using open source relational database(e.g. Maria DB).
Do not store computable results and statistics that can be easily doable at run time which may help reduce to data size.
If history data is not needed urgent for inquiries , it can be written to text file in easily importable format to database(e.g. .csv).
Data objects that are very oftenly updated can be kept in in-memory database as key value pair, use scheduled task to perform batch update/insert to relation database to achieve persistence
Implement retry logic for database batch update tasks to handle db downtimes or network errors
Consider writing data to relational database as serialized objects
Cache configuration data to memory from database either periodically or via API to refresh the changing part.

Optimised Dashboard and Publisher/Distributor design

We have a database that is not performing well and I am hoping to get some advice on the best way to re-design it. The database/application comes from a third-party vendor, and for the moment cannot be changed.
Currently we have a local distributor set up to serve about 80,000 reports per month of different complexities (I know - how complex or simple is each one - the number is more by way of indication than an actual load assessment). We are pulling data from a number of different real time (x4) and transactional databases (x3) across a WAN on a minute by minute basis and then transforming that data into the schema. We have dashboards (.NET installed client) and MSRS reporting. There is also some minor data entry.
As you may have guessed, the server is struggling.
We are looking to move to SQL Server 2014.
There are two options we are considering:
AG separation of Primary and active Secondary.
Splitting Publisher and Distributor and using some form of replication (Transactional?) to push the data to the distributors.
Which would make more sense?
Also, each object on every dashboard calls its own query. If 10 people from 3 different geographic locations are running the same dashboard, they will each be running the same query, and these will refresh every 2 mins.

Caching to a local SQL instance on a web server

I run a very high traffic(10m impressions a day)/high revenue generating web site built with .net. The core meta data is stored on a SQL server. My team and I have a unique caching strategy that involves querying the database for new meta data at regular intervals from a middle tier server, serializing the data to files and sending those to the web nodes. The web application uses the data in these files (some are actually serialized objects) to instantiate objects and caches those in memory to use for real time requests.
The advantage of this model is that it:
Allows the web nodes to cache all data in memory and not incur any IO overhead querying a database.
If the database ever goes down either unexpectedly or for maintenance windows, the web servers will continue to run and generate revenue. You can even fire up a web server without having to retrieve its initial data from the DB because all the data it needs are in files on its own disks.
Allows us to be completely horizontally scalable. If throughput suffers, we can just add a web server.
The disadvantages are that this caching and persistense layers adds complexity in the code that queries the database, packages the data and unpackages it on the web server. Any time our domain model requires us to add entities, more of this "plumbing" has to be coded. This architecture has been in place for four years and there are probably better ways to tackle this.
One strategy I have been considering is using replication to replicate our master sql server database to local database instances installed on each web server. The web server application would use normal sql/ORM techniques to instantiate objects. Here, we can still sustain a master database outage and we would not have to code up specialized caching code and could instead use nHibernate to handle the persistence.
This seems like a more elegant solution and would like to see what others think or if anyone else has any alternatives to suggest.

I think you're overthinking this. SQL Server already has mechanisms available to you to handle these kinds of things.
First, implement a SQL Server cluster to protect your main database. You can fail over from node to node in the cluster without losing data, and downtime is a matter of seconds, max.
Second, implement database mirroring to protect from a cluster failure. Depending on whether you use synchronous or asynchronous mirroring, your mirrored server will either be updated in realtime or a few minutes behind. If you do it in realtime, you can fail over to the mirror automatically inside your app - SQL Server 2005 & above support embedding the mirror server's name in the connection string, so you don't even have to lift a finger. The app just connects to whatever server's live.
Between these two things, you're protected from just about any main database failure short of a datacenter-wide power outage or network outage, and there's none of the complexity of the replication stuff. That covers your high availability issue, and lets you answer the scaling question separately.
My favorite starting point for scaling is using three separate connection strings in your application, and choose the right one based on the needs of your query:
Realtime - Points directly at the one master server. All writes go to this connection string, and only the most mission-critical reads go here.
Near-Realtime - Points at a load balanced pool of read-only SQL Servers that are getting updated by replication or log shipping. In your original design, these lived on the web servers, but that's dangerous practice and a maintenance nightmare. SQL Server needs a lot of memory (not to mention money for licensing) and you don't want to be tied into adding a database server for every single web server.
Delayed Reporting - In your environment right now, it's going to point to the same load-balanced pool of subscribers, but down the road you can use a technology like log shipping to have a pool of servers 8-24 hours behind. These scale out really well, but the data's far behind. It's great for reporting, search, long-term history, and other non-realtime needs.
If you design your app to use those 3 connection strings from the start, scaling is a lot easier, and doesn't involve any coding complexity - just pick the right connection string.

Have you considered memcached? Since it is:
in memory
can run locally
fully scalable horizontally
prevents the need to re-cache on each web server
It may fit the bill. Check out Google for lots of details and usage stories.

Just some addition to what RickNZ proposed above..
Since your master data which you are caching currently won't change so frequently and probably over some maintenance window, here is what should you do first on database side:
Create a SNAPSHOT replication for the master tables which you want to cache. Adding new entities will be equally easy.
On all the webservers, install SQL Express and subscribe to this Publication.
Since, this is not a frequently changing data, you can rest assure, no much server resource usage issue minus network trips for master data.
All your caching which was available via previous mechanism is still availbale minus all headache which comes when you add new entities.
Next, you can leverage .NET mechanisms as suggested above. You won't face memcached cluster failure unless your webserver itself goes down. There is a lot availble in .NET which a .NET pro can point out after this stage.

It seems to me that Windows Server AppFabric is exactly what you are looking for. (AKA "Velocity"). From the introductory documentation:
Windows Server AppFabric provides a
distributed in-memory application
cache platform for developing
scalable, available, and
high-performance applications.
AppFabric fuses memory across multiple
computers to give a single unified
cache view to applications.
Applications can store any
serializable CLR object without
worrying about where the object gets
stored. Scalability can be achieved by
simply adding more computers on
demand. The cache also allows for
copies of data to be stored across the
cluster, thus protecting data against
failures. It runs as a service
accessed over the network. In
addition, Windows Server AppFabric
provides seamless integration with
ASP.NET that enables ASP.NET session
objects to be stored in the
distributed cache without having to
write to databases. This increases
both the performance and scalability
of ASP.NET applications.

Have you considered using SqlDependency caching?
You could also write the data to the local disk at the web tier, if you're concerned about initial start-up time or DB outages. But at least with a SqlDependency, you shouldn't have to poll the DB to look for changes. It can also be made relatively transparent.
In my experience, adding a DB instance on web servers generally doesn't work out too well from a scalability or performance perspective.
If you're concerned about performance and scalability, you might consider partitioning your data tier. The specifics depend on your app, but as an example, you could move read-only data onto a couple of SQL Express servers that are populated with replication.
In case it helps, I talk about this subject at length in my book (Ultra-Fast ASP.NET).

Pattern for very slow DB Server

I am building an Asp.net MVC site where I have a fast dedicated server for the web app but the database is stored in a very busy Ms Sql Server used by many other applications.
Also if the web server is very fast, the application response time is slow mainly for the slow response from the db server.
I cannot change the db server as all data entered in the web application needs to arrive there at the end (for backup reasons).
The database is used only from the webapp and I would like to find a cache mechanism where all the data is cached on the web server and the updates are sent to the db asynchronously.
It is not important for me to have an immediate correspondence between read db data and inserted data: think like reading questions on StackOverflow and new inserted questions that are not necessary to show up immediately after insertion).
I thought to build an in between WCF service that would exchange and sync the data between the slow db server and a local one (may be an Sqllite or an SqlExpress one).
What would be the best pattern for this problem?

What is your bottleneck? Reading data or Writing data?
If you are concerning about reading data, using a memory based data caching machanism like memcached would be a performance booster, As of most of the mainstream and biggest web sites doing so. Scaling facebook hi5 with memcached is a good read. Also implementing application side page caches would drop queries made by the application triggering lower db load and better response time. But this will not have much effect on database servers load as your database have some other heavy users.
If writing data is the bottleneck, implementing some kind of asyncronyous middleware storage service seems like a necessity. If you have fast and slow response timed data storage on the frontend server, going with a lightweight database storage like mysql or postgresql (Maybe not that lightweight ;) ) and using your real database as an slave replication server for your site is a good choise for you.

I would do what you are already considering. Use another database for the application and only use the current one for backup-purposes.

I had this problem once, and we decided to go for a combination of data warehousing (i.e. pulling data from the database every once in a while and storing this in a separate read-only database) and message queuing via a Windows service (for the updates.)
This worked surprisingly well, because MSMQ ensured reliable message delivery (updates weren't lost) and the data warehousing made sure that data was available in a local database.
It still will depend on a few factors though. If you have tons of data to transfer to your web application it might take some time to rebuild the warehouse and you might need to consider data replication or transaction log shipping. Also, changes are not visible until the warehouse is rebuilt and the messages are processed.
On the other hand, this solution is scalable and can be relatively easy to implement. (You can use integration services to pull the data to the warehouse for example and use a BL layer for processing changes.)

There are many replication techniques that should give you proper results. By installing a SQL Server instance on the 'web' side of your configuration, you'll have the choice between:
Making snapshot replications from the web side (publisher) to the database-server side (suscriber). You'll need a paid version of SQLServer on the web server. I have never worked on this kind of configuration but it might use a lot of the web server ressources at scheduled synchronization times
Making merge (or transactional if requested) replication between the database-server side (publisher) and web side(suscriber). You can then use the free version of MS-SQL Server and schedule the synchronization process to run according to your tolerance for potential loss of data if the web server goes down.

I wonder if you could improve it adding a MDF file in your Web side instead dealing with the Sever in other IP...
Just add an SQL 2008 Server Express Edition file and try, as long as you don't pass 4Gb of data you will be ok, of course there are more restrictions but, just for the speed of it, why not trying?

You should also consider the network switches involved. If the DB server is talking to a number of web servers then it may be being constrained by the network connection speed. If they are only connected via a 100mb network switch then you may want to look at upgrading that too.

the WCF service would be a very poor engineering solution to this problem - why make your own when you can use the standard SQLServer connectivity mechanisms to ensure data is transferred correctly. Log shipping will send the data across at selected intervals.
This way, you get the fast local sql server, and the data is preserved correctly in the slow backup server.
You should investigate the slow sql server though, the performance problem could be nothing to do with its load, and more to do with the queries and indexes you're asking it to work with.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight