RavenDB Database Configuration - database

How to configure the RavenDB database that can be capable to sending too many requests, or receiving too large a response?
By default, RavenDB will not allow operations that might compromise the stability of either the server or the client and a RavenDB session automatically enforces the following limitations:
If a page size value is not specified, the length of the results will be limited to 128 results. At the server side as well, there is a hard limit to the page size of 1,024 results.
The number of remote calls to the server per session is limited to 30.
I want to configure DocumentStore/DocumentSession in the client that it increase limitation of page size value and number of remote calls the server per session.

You can adjust the max page size with the Raven/MaxPageSize setting as described here. You can adjust the max number of session requests via IDocumentStore.Conventions.MaxNumberOfRequestsPerSession. However, a better approach is to architect your application such that you don't need large sessions. Instead, prefer to create sessions for small unit of work and dispose them. If a certain operation requires a large number of requests, batch them into groups of 1,024 or so.

Related

Fill Factor 80 vs 100

I am rebuilding some indexes in Azure SQL using a fill factor of 80 (recommended by the company who developed the application, who are not experts on the database) and after doing this queries got a LOT slower. We noticed that now they were taking a longer time in "Network I/O". Does anybody know what the problem might be?
Fillfactor is not a silver bullet and has it's tradeoffs. https://www.mssqltips.com/sqlservertip/5908/what-is-the-best-value-for-fill-factor-in-sql-server/
It is important to note which effect the lower fillfactor value has on the underlying data pages and index pages, which comprise your table:
There is now 20% more storage allocated for data pages for the same number of records!
This causes increased I/O. Depending on your Azure storage/compute plan you may be hitting a ceiling and need to bump up you IOPS.
Now, if you are not running out of IOPS, there's more to look into. Is it possible that the index rebuild operation had not completed yet and index is not being used for query optimization? A Profiler/Execution plan can confirm this.
I'd say that if you have a very large table and want to speed things up dramatically, your best bet is partitioning on the column most commonly used to address the data.
See also: https://www.sqlshack.com/database-table-partitioning-sql-server/
Try to identify queries returning large data sets to client hosts. Large result sets may lead to unnecessary network utilization and client application processing. Make sure queries are returning only what is needed using filtering and aggregations, and make sure no duplicates are being returned unnecesarily.
Another possible cause of that wait on Azure SQL may be the client application doesn't fetch results fast enough and doesn't notify Azure SQL that the result set has been received. On the client appliation side, store the results in memory first and only then doing more processing. Make sure the lient application is not under stress and that makes it unable to get the results faster.
One last thing, make sure Azure SQL and the appliation are on the same region, and there is not transfer of data over different regions or zones.

10,000 HTTP requests per minute performance

I'm fairly experienced with web crawlers, however, this question is in regards to performance and scale. I'm needing to request and crawl 150,000 urls over an interval(most urls are every 15 minutes which makes it about 10,000 requests per minute). These pages have a decent amount of data(around 200kb per page). Each of the 150,000 urls exist in our database(MSSQL) with a timestamp of the last crawl date, and an interval for so we know when to crawl again.
This is where we get an extra layer of complexity. They do have an API which allows for up to 10 items per call. The information we need exists partially only in the API, and partially only on the web page. The owner is allowing us to make web calls and their servers can handle it, however, they can not update their API or provide direct data access.
So the flow should be something like: Get 10 records from the database that intervals have passed and need to be crawled, then hit the API. Then each item in the batch of 10 needs their own separate web-requests. Once the request returns the HTML we parse it and update records in our database.
I am interested in getting some advice on the correct way to handle the infrastructure. Assuming a multi-server environment some business requirements:
Once a URL record is ready to be crawled, we want to ensure it is only grabbed and ran by a single server. If two servers check it out simultaneously and run, it can corrupt our data.
The workload can vary, currently, it is 150,000 url records, but that can go much lower or much higher. While I don't expect more than a 10% change per day, having some sort of auto-scale would be nice.
After each request returns the HTML we need to parse it and update records in our database with the individual data pieces. Some host providers allow free incoming data but charge for outgoing. So ideally the code base that requests the webpage and then parses the data also has direct SQL access. (As opposed to a micro-service approach)
Something like a multi-server blocking collection(Azure queue?), autoscaling VMs that poll the queue, single database host server which is also queried by MVC app that displays data to users.
Any advice or critique is greatly appreciated.
Messaging
I echo Evandro's comment and would explore Service Bus Message Queues of Event Hubs for loading a queue to be processed by your compute nodes. Message Queues support record locking which based on your write up might be attractive.
Compute Options
I also agree that Azure Functions would provide a good platform for scaling your compute/processing operations (calling the API & scraping HTML). In addition Azure Functions can be triggered by Message Queues, Event Hubs OR Event Grid. [Note: Event Grid allows you to connect various Azure services (pub/sub) with durable messaging. So it might play a helpful middle-man role in your scenario.]
Another option for compute could be Azure Container Instances (ACI) as you could spin up containers on demand to process your records. This does not have the same auto-scaling capability that Functions does though and also does not support the direct binding operations.
Data Processing Concern (Ingress/Egress)
Indeed Azure does not charge for data ingress but any data leaving Azure will have an egress charge after the initial 5 GB each month. [https://azure.microsoft.com/en-us/pricing/details/bandwidth/]
You should be able to have the Azure Functions handle calling the API, scraping the HTML and writing to the database. You might have to break those up into separated Functions but you can chain Functions together easily either directly or with LogicApps.

How to handle issues when we run out of connections to database?

I am looking for what are the commonly used / best practices in industry.
Assume the following hypothetical scenario:
If my app server accepts 200 user requests, and each of them need DB access.
But my DB max_connections are 100.
If all 200 users request at the same time, but we have only 100 max_connections, what happens to the other requests, which were not served max connections ?
In real world:
will remaining 100 requests be stored in some sort of a queue on apps servers, and kept waiting for DB connections ?
do we error out ?
Basically, if your database server can only handle 100 connections, and all of your web connections "require database access," you must ensure that no more than 100 requests are allowed to be active at any one instant. This is the “ruling constraint” of this scenario. (It may be one of many.)
You can accept "up to 200 simultaneous connections" on your server, but you must enqueue these requests so that the limit of 100 active requests is not exceeded.
There are many, many ways to do that: load balancers, application servers, even Apache/nginix directives. Sometimes, the web-page is only a front-end to a process that is broken-out among many different servers working in parallel. No matter how you do it, there must be a way to regulate how many requests are active, and to queue the remainder.
Also note that, even though you might have “200 active connections” to your web server, it is highly unlikely that all 200 of these clients “clicked their mouse at precisely the same time.” Requests come in at random rates and therefore might never encounter any sort of delay at all. But your system must nonetheless be engineered to handle the worst case.

Indexing about 300.000 triples in sesame using Camel

I have a Camel context configured to do some manipulation of input data in order to build RDF triples.
There's a final route with a processor that, using Sesame Client API, talks to a separate Sesame instance (running on Tomcat with 3GB of RAM) and sends add commands (each command contains about 5 - 10 statements).
The processor is running as a singleton and the corresponding "from" endpoint has 10 concurrentConsumers (I tried with 1, then 5, then 10 - moreless same behaviour).
I'm using HttpRepository from my processor for sending add commands and, while running, I observe a (rapid and) progressive degrade of performance in indexing. The overall process starts indexing triples very quickly but after a little bit the committed statements grow very slowly.
On Sesame side I used both MemoryStore and NativeStore but (performance) behaviour seems moreless the same.
The questions:
which kind of store kind is reccommended in case I would like to speed up the indexing phase?
Is the Repository.getConnection doing some kind of connection pooling? In other words, can I open and close a connection each time the "add" processor does its work?
Having said that I need first to create a store will all those triples, is it preferred create a "local" Sail store instead of having that managed by a remote Sesame server (therefore I won't use a HTTPRepository)?
I am assuming that you're adding using transactions of 4 or 5 statements for good reason, but if you have a way to do larger transactions, that will significantly boost speed. Ideal (and quickest) would be to just send all 300,000 triples to the store in a single transaction.
Your questions, in order:
If you're only storing 300,000 statements the choice of store is not that important, as both native and memory can easily handle this kind of scale at good speed. I would expect memory store be slightly more performant, especially if you have configured it to use a non-zero sync delay for persistence, but native has a lower memory footprint and is of course more robust.
HTTPRepository.getConnection does not pool the actual RepositoryConnection itself, but internally pools resources (so the actual HttpConnections that Sesame uses internally are pooled). so getConnection is relatively cheap and opening and closing multiple connections is fine - though you might consider reusing the same connection for multiple adds, so that you can batch multiple adds in a single transaction.
Whether to store locally or on a remote server really depends on you. Obviously a local store will be quicker because you eliminate network latency as well as the cost of (de)serializing, but the downside is that a local store is not easily made available outside your own application.

Caching in Google App Engine/Cloud Based Hosting

I am curious as to how caching works in Google App Engine or any cloud based application. Since there is no guarantee that requests are sent to same sever, does that mean that if data is cached on 1st request on Server A, then on 2nd requests which is processed by Server B, it will not be able to access the cache?
If thats the case (cache only local to server), won't it be unlikely (depending on number of users) that a request uses the cache? eg. Google probably has thousands of servers
With App Engine you cache using memcached. This means that a cache server will hold the data in memory (rather than each application server). The application servers (for a given application) all talk the same cache server (conceptually, there could be sharding or replication going on under the hoods).
In-memory caching on the application server itself will potentially not be very effective, because there is more than one of those (although for your given application there are only a few instances active, it is not spread out over all of Google's servers), and also because Google is free to shut them down all the time (which is a real problem for Java apps that take some time to boot up again, so now you can pay to keep idle instances alive).
In addition to these performance/effectiveness issues, in-memory caching on the application server could lead to consistency problems (every refresh shows different data when the caches are not in sync).
Depends on the type of caching you want to achieve.
Caching on the application server itself can be interesting if you have complex in-memory object structure that takes time to rebuild from data loaded from the database. In that specific case, you may want to cache the result of the computation. It will be faster to use a local cache than a shared memcache to load if the structure is large.
If having consistent value between in-memory and the database is paramount, you can do some checksum/timestamp check with a stored value on the datastore, every time you use the cached value. Storing checksum/timestamp on a small object or in a global cache will fasten the process.
One big issue using global memcache is ensuring proper synchronization on "refilling" it, when a value is not yet present or has been flushed. If you have multiple servers doing the check at the exact same time and refilling value in cache, you may end-up having several distinct servers doing the refill at the same time. If the operation is idem-potent, this is not a problem; if not, a potential and very hard to trace bug.

Resources