EF Core - adding range - possible duplicate - sql-server

I have a service that posts data to web-server (asp.net core 3.1) each second that I store to a sql server using EF Core 3.1.
Up til now I have, when trying to store new data, for each new data row separately:
Checked if data entity exist in database (the entity type is configured with .IsUnique() index in the OnModelCreating() method)
If not exists - add single entity
DBContext.SaveChanges()
However, this seems like it is a bit "heavy" on the sql server with quite a lot of calls. It is running on Azure and sometimes it seems that the database has some problems following along and the web-server starts returning 500 (internal server error as far as I understand). This happens sometimes when someone calls another controller on the web-server and tries to retrieve some data (larger chunks) from the sql server. (that's perhaps for another question - about Azure SQL server reliability)
Is it better to keep a buffer on the web-server and save all in one go, like: DBContext.AddRange(entities) with a bit coarser time resolution (i.e. each minute)? I do not know exactly what happens if one or more of the data is/are duplicates? Are the ones not being duplicates stored or are all inserts refused? (I can't seem to find an explanation for this).
Any help on the matter is much appreciated.
EDIT 2021-02-08:
I try to expand a bit on the situation:
outside my control: MQTT Broker(publishing messages)
in my control:
MQTT client (currently as an azure webjob), subscribes to MQTT Broker
ASP.NET server
SQL Database
The MQTT client is collecting and grouping messages from different sensors from mqtt broker into a format that (more or less) can be stored directly in the database.
The asp.net server acts as middle-man between mqtt client and sql database. BUT ALSO sends continuously "live" updates to anyone visiting the website. So currently the web-server has many jobs (perhaps the problem arises here??)
receive data form MQTT service
to store/retrieve data to/from database
serve visitors with "live" data from MQTT client as well as historic data from database
Hope this helps with the understanding.

I ended up with a buffer service with a ConcurrentDictionary that I use in my asp.net controller. That way I can make sure that duplicates are handled in my code in a controlled way (updates existing or discarded based on quality of the received data). Each minute I empty last minutes data to the database so that I always keep one minute of data. Bonus: I can also serve current data to visitors much more quickly from the buffer service instead of going to the database.

Related

Do I need a separate database to store user input data for security reason

We are building a web application and plan to run it on AWS. Created a RDS instance with MySQL. The proposed architecture is as follows:
Data is being uploaded from company data mart to Core DB in RDS. On the other side, user is sending data through our Rest API to post data. This user input data will be saved in a separate DB within the same RDS, as one of our architects suggested. The data will then be periodically copied to a table inside Core DB. We will have a rule engine running based Core DB. Whenever an exception is detected, notification will be sent to customers.
The overall structure seems fine. One thing I would change though is instead of having two separate DBs, we can just have one DB and have user input data in a table in the same Database. The logic behind separate DBs, according to our architect, is for security concerns. Since Core DB will have data from our company, it is better to be on its own. So the http requests from clients will only affect the user input DB.
While it makes sense, I am not sure it is really necessary. First all the user input is authenticated. Secondly the web api provides another protection layer against database since it only allows certain requests, which in this case couple of endpoints for post request. Besides if someone can somehow still hack into the User Input DB in RDS, since the it resides on the same RDS instance plus there is data transfer between DBs, it is not impossible they can't get to Core.
That said, do we really need separate DBs? And if this is the way to go, what is best way to sync from User Input DB to a User Input TB in Core DB?
In terms of security reason, separating the db are not magically make it true. My suggestion :
Restrict the API layer, such as only have write access ( just in case to avoiding accidentally deleting data)
For credentials data, don't put it on source code, you can put it as environment variables, example on ElasticBeanstalk Environment Variables
For RDS itself, put it under VPC
In term of synchronizing data if you have to go with 2 db.
if your two database are exactly same on the schema, you can use db replication capability (such as mysql replication)
if not, you can send it to message broker service (SQS) then create a worker to pulling it then save it to target database
or you can use another service such as datapipeline

Point Connection String to custom utility

Currently we have our Asp mvc LOB web application talking to an SQL server database. This is setup through the a connection string in the web.config as usual.
We are having performance issues with some of our bigger customers that are running some really large reports and kpi's on the database which choke it up and cause performance issues for the rest of the users.
Our solution so far is to setup replication on the database and pass all the report and kpi data calls off to the replicated server and leave the main server for the common critical use.
Without having add another connection string to the config for the replicated server and go through the application and direct the report, kpi and other read only calls to the secondary db is there a way I can point the web.config connection string to an intermediary node that will analyse the data request and shuffle it off to the appropriate db accordingly? i.e. If the data call is a standard update process on the db it will shuffle that to the main db and if there is a report being loaded it will pass it off to the secondary replicated server.
We will only need to add this node in for the bigger customers with larger db's, so if we can get away with adding a node outside the current application setup it will save us a lot of code changes and testing needed.
Thanks in advance
I would say it may be easier for you to add a second connection string for reports, etc. instead of trying to analyse the request.
The reasons are as follows:
You probably have a fairly good idea which areas of your system need to go the second database. Once you identify them, you can just point them to to the second database and not worry about switching them back and forth.
You can just create 2 connection string in you config file. If you have only one database for smaller customers, you can point both connections to the same one database. For bigger customers, you can use two different connection strings. This way you will make the system flexible and configurable.
Analysing requests usually turns out to be complex and adding this additional complexity seems unwarranted in this case.
All my comments are based on what you wrote above and may not be absolutely valid - you know they system better, just use them if you want.

Service Broker -- how to send the table with collected data? XML messages or not?

A technology machine uses SQL Server Express to collect the data like temperature... The database is updated say every several second (i.e. low trafic). The machine and its SQL Server must work independently on anything. Sometimes the technology computer (i.e. SQL Server) is switched-off with the machine.
The central SQL Server Standard Edition should collect the data form many machines like the above.
What scenario would you recommend?
The machine sends each new row of a table when it is created (i.e. each several seconds one row).
The machine activates the process of sending the data say each hour, and sends all the rows with the newly collected data.
Any other approach?
If I understand the Service Broker idea well, there will be a single name of the request message type, a single name of the reply message type, a single name of contract. The related databases on the SQL machines will each have one message queue, and the related service.
I am very new to Service Broker. The tutorial examples show how to send messages as XML fragments. Is it the real world way to send the rows of the table? How can I reliably convert the result of a SELECT command to the XML fragment and back?
Thanks, Petr
Attention: The related question Service Broker — how to extract the rows from the XML message? was created.
I also recommend sending as the the database is updated, one message per row inserted (one message per statement, from a trigger as JAnis suggests, also works fine if your INSERT statements do not insert huge number of rows in a single statement).
The reason why is better to send many small messages rather than one big message is that processing one XML payload on the RECEIVE is less efficient than processing many small XML payloads.
I've seen similar deployments to what you're trying to do, see High Volume Contigous Real Time Audit And ETL. I aslo recommend you read Reusing Conversations.
Service Broker will handle very well the Express instance going on and off or being disconnected for long periods. It will simply retain the messages sent and deliver them once connectivity is restored. You just have to make sure the 10GB max database size limit of Express is enough to retain messages for as long as needed, until connectivity is back.
What Service Broker will handle poorly is if the Express appears as different DNS names every time it connect back (think a laptop that connects from home, from office and from Starbucks). Service Broker uses database routing to send acks from central server to your Express instance and the routing require a static DNS. Also is important that the Express instance is reachable from the central server (ie. is not behind Network Address Translation). For a mobile instance (the laptop mentioned before) both the dynamic name and the fact that most time will connect from behind NAT will cause problems. Using a VPN can address the issues, but is cumbersome.
Well, if you implement row sending in trigger, then sending actually happens then (not once in a while). You can always process collected messages on receiver side by activating procedure (RECEIVE) once in a while..
To send row(s) as XML you can use statement in trigger:
Declare #msg XML;
Set #msg =
(
Select * from Inserted FOR XML RAW, Type
)
And then just send a message.

Is it better to send data to hbase via one stream or via several servers concurrently?

I'm sorry if this question is basic(I'm new to nosql). Basically I have a large mathimatical process that I'm splitting up and having different servers process and send the result to an hbase database. Each server computing the data, is an hbase regional server, and has thrift on it.
I was thinking of each server processing the data and then updating hbase locally(via thrift). I'm not sure if this is the best approach because I don't fully understand how the master(named) node will handle the upload/splitting.
I'm wondering what the best practice is when uploading large amounts of data(in total I suspect it'll be several million rows)? Is it okay to send it to regional servers or should everything go through the master?
From this blog post,
The general flow is that a new client contacts the Zookeeper quorum (a
separate cluster of Zookeeper nodes) first to find a particular row
key. It does so by retrieving the server name (i.e. host name) that
hosts the -ROOT- region from Zookeeper. With that information it can
query that server to get the server that hosts the .META. table. Both
of these two details are cached and only looked up once. Lastly it can
query the .META. server and retrieve the server that has the row the
client is looking for.
Once it has been told where the row resides, i.e. in what region, it
caches this information as well and contacts the HRegionServer hosting
that region directly. So over time the client has a pretty complete
picture of where to get rows from without needing to query the .META.
server again.
I am assuming you directly use the thrift interface. In that case, even if you call any mutation from a particular regionserver, that regionserver only acts as a client. It will contact Zookeeper quorum, then contact Master to get the regions where to write the data and proceed in the same way as if it was written from another regionserver.
Is it okay to send it to regional servers or should everything go through the master?
Both are same. There is no such thing as writing directly to regionserver. Master will have to be contacted to determine which region to write the output to.
If you are using a hadoop map-reduce job, and using the Java API for the mapreduce job, then you can use the TableOutputFormat to write directly to HFiles without going through the HBase API. It is about ~10x faster than using the API.

Pattern for very slow DB Server

I am building an Asp.net MVC site where I have a fast dedicated server for the web app but the database is stored in a very busy Ms Sql Server used by many other applications.
Also if the web server is very fast, the application response time is slow mainly for the slow response from the db server.
I cannot change the db server as all data entered in the web application needs to arrive there at the end (for backup reasons).
The database is used only from the webapp and I would like to find a cache mechanism where all the data is cached on the web server and the updates are sent to the db asynchronously.
It is not important for me to have an immediate correspondence between read db data and inserted data: think like reading questions on StackOverflow and new inserted questions that are not necessary to show up immediately after insertion).
I thought to build an in between WCF service that would exchange and sync the data between the slow db server and a local one (may be an Sqllite or an SqlExpress one).
What would be the best pattern for this problem?
What is your bottleneck? Reading data or Writing data?
If you are concerning about reading data, using a memory based data caching machanism like memcached would be a performance booster, As of most of the mainstream and biggest web sites doing so. Scaling facebook hi5 with memcached is a good read. Also implementing application side page caches would drop queries made by the application triggering lower db load and better response time. But this will not have much effect on database servers load as your database have some other heavy users.
If writing data is the bottleneck, implementing some kind of asyncronyous middleware storage service seems like a necessity. If you have fast and slow response timed data storage on the frontend server, going with a lightweight database storage like mysql or postgresql (Maybe not that lightweight ;) ) and using your real database as an slave replication server for your site is a good choise for you.
I would do what you are already considering. Use another database for the application and only use the current one for backup-purposes.
I had this problem once, and we decided to go for a combination of data warehousing (i.e. pulling data from the database every once in a while and storing this in a separate read-only database) and message queuing via a Windows service (for the updates.)
This worked surprisingly well, because MSMQ ensured reliable message delivery (updates weren't lost) and the data warehousing made sure that data was available in a local database.
It still will depend on a few factors though. If you have tons of data to transfer to your web application it might take some time to rebuild the warehouse and you might need to consider data replication or transaction log shipping. Also, changes are not visible until the warehouse is rebuilt and the messages are processed.
On the other hand, this solution is scalable and can be relatively easy to implement. (You can use integration services to pull the data to the warehouse for example and use a BL layer for processing changes.)
There are many replication techniques that should give you proper results. By installing a SQL Server instance on the 'web' side of your configuration, you'll have the choice between:
Making snapshot replications from the web side (publisher) to the database-server side (suscriber). You'll need a paid version of SQLServer on the web server. I have never worked on this kind of configuration but it might use a lot of the web server ressources at scheduled synchronization times
Making merge (or transactional if requested) replication between the database-server side (publisher) and web side(suscriber). You can then use the free version of MS-SQL Server and schedule the synchronization process to run according to your tolerance for potential loss of data if the web server goes down.
I wonder if you could improve it adding a MDF file in your Web side instead dealing with the Sever in other IP...
Just add an SQL 2008 Server Express Edition file and try, as long as you don't pass 4Gb of data you will be ok, of course there are more restrictions but, just for the speed of it, why not trying?
You should also consider the network switches involved. If the DB server is talking to a number of web servers then it may be being constrained by the network connection speed. If they are only connected via a 100mb network switch then you may want to look at upgrading that too.
the WCF service would be a very poor engineering solution to this problem - why make your own when you can use the standard SQLServer connectivity mechanisms to ensure data is transferred correctly. Log shipping will send the data across at selected intervals.
This way, you get the fast local sql server, and the data is preserved correctly in the slow backup server.
You should investigate the slow sql server though, the performance problem could be nothing to do with its load, and more to do with the queries and indexes you're asking it to work with.

Resources