Solr Qtime difference to real time - solr

i am trying to find out what causes the difference between the Qtime and the actual response time in my Solr application.
The SolrServer is running on the same maschine as the program generating the queries.
I am getting Qtimes in average around 19ms but it takes 30ms to actually get my response.
This may sound like it is not much, but i am using Solr for some obscure stuff where every millisecond counts.
I figured that the time difference is not caused by Disk I/O since using RAMDirectoryFactory did not speed up anything.
Using a SolrEmbeddedServer instead of a SolrHttpServer did not cause a speedup aswell (so it is not Jetty what causes the difference?)
Is the data transfer between the query-program and the Solr-instance causing the time difference? And even more important, how can i minimize this time?
regards

This is a well known FAQ:
Why is the QTime Solr returns lower then the amount of time I'm
measuring in my client?
"QTime" only reflects the amount of time Solr spent processing the
request. It does not reflect any time spent reading the request from
the client across the network, or writing the response back to the
client. (This should hopefully be obvious since the QTime is actually
included in body of the response.)
The time spent on this network I/O can be a non-trivial contribution
to the the total time as observed from clients, particularly because
there are many cases where Solr can stream "stored fields" for the
response (ie: requested by the "fl" param) directly from the index as
part of the response writing, in which case disk I/O reading those
stored field values may contribute to the total time observed by
clients outside of the time measured in QTime.
How to minimize this time?
Not sure if it will have any effect, but make sure you are using javabin format, not json or xml (wt=javabin)

Related

Why does the kinesis shard iterator falls behind when using BoundedOutOfOrdernessTimestampExtractor

I'm using KDA with a flink job which should analyse messages emitted by a different IOT device sources. There is a kinesis stream with 4 shards with each of them contains more or less the same amount of data (there are no hot shards). The kinesis stream gets filled by AWS Greengrass Streammanager which is using an increasing sequence number as partition key. Each message contains a single value (something like temperature = 5).
As with this setup the stream read by the kinesis consumer in flink is unordered. But I need to preserve the order of the messages. To do so I have written a small buffer function which is more or less the logic from CepOperator to buffer messages and restore the order. Therefore the stream is keyed by the id of a message. Let's say a temperature message has always a unique id and therefore the stream is keyed by this id.
To create the respective watermarks I'm using the FlinkKinesisConsumer and register there a BoundedOutOfOrdernessTimestampExtractor. If I now use a out-of-orderness time of 10 seconds everything works fine except that I have almost 50% of late arrivals which is not the desired behaviour. But now if I increase the time to 60 seconds the iterator of the kinesis stream falls significantly behind (linear growing over time). The documentation of the Kinesis Consumer does say a little about the settings here. I have also tried to register a JobManagerWatermarkTracker but it seems that it does not change the behaviour.
I do not understand the circumstances why a higher out of orderness leads the iterator to fall behind increasingly but a smaller time setting drops a significant amount of messages. What measures do I need to take to find the proper settings or is my implementation wrong?
UPDATE:
While investigating the issue I have found that if the JobManagerWatermarkTracker isn't properly configured (I still don't understand how to configure) the alignment to the global watermark stops subtasks from reading from the kinesis stream which causes the iterator to fall back. I have calculated a delta how much "latency" a dropped event has and set this as and out-of-orderness (in this case 60 secs). With deactivating the JobManagerWatermarkTracker everything work as expected.
Furthermore it seems that the AWS Greengrass Streammanager isn't optimal for such use cases as it distributes the load evenly across shards but with an increasing number of shards this isn't optimal since one temperature datapoint might be spread across all shards of a stream. That introduces a lot unnecessary latency. I appreciate any input howto configure the JobManagerWatermarkTracker

Why my SnappyData cluster faced with slow query about once a day

When my cluster runs for a certain time(maybe a day, maybe two days), some queries may become very slow, about 2~10min to finish, when this happens, I need to restart the whole cluster, and the query become normal, but after some time, very slow queries happen again
The query response time depends on multiple factor including
1. Table size, if table size grows with time then response time will also increase
2. If it is open source version then time spent in GC pauses, which in turn will depend on number of objects/grabage present in the JVM Heap
3. Number of Concurrent queries being run
4. Amount of data overflown to disk
You will need to describe in detail your usage pattern of snappydata. Only then It would be possible to characterise the issue.
Some of the questions that should be answered are like
1. What is cluster size?
2. What are the table sizes?
3. Are writes happening continuously on the tables or only queries are getting executed?
You can engage us at slack channel to provide us informations related to your clusters.

GDAX Websocket API - Level 2 timestamp accuracy

I'm currently using the level2 orderbook channel via the GDAX WebSocket API. Quite recently a "time" field started appearing on the l2update JSON messages and this doesn't appear to be documented on the API reference pages. Some questions:
What does this time field represent and is it reliable enough to use? Is it message sending time from GDAX?
If it is sending time, I am occassionally seeing latencies of up to two minutes - is this expected?
Thanks!
I am playing with the L2 api's right now and have the same question. I'm seeing a range of timestamps from 4000ms delayed to -300ms (in the future).
The negative number makes me feel that the time can't be trusted. I attempted connecting from 2 different datacenters and from home and I can replicate both sides of the problem.
I've been using this field reliably for a couple of months assuming it is the time the order was received, and it was typically coming out as a pretty consistent 0.05 second lag in relation to my system time; however, in the last few days, it's been increasing - over 1 second yesterday and 2.02 seconds right now. I note (https://status.gdax.com/) maintenance was carried out on May 9th, but it was fine for me after that for a few days.
To answer the question more directly, no 2 minute latencies are not expected. I would check your system time is correct. Quick google brings up https://time.is/.

SOLR Streaming VS Search

What are the key differences between streaming and normal search in terms of internal implementation . A normal search would also work in a distributed manner , right ? How does streaming improve performance ? The documentation is not helping out .
Distributed search issues requests, calculates results, delivers results and handles merging. Each step is processed fully before continuing to the next step. This works well enough when the amount of data is small. For larger requests, such as delivering millions of documents, this requires huge memory buffers. It also means that the caller has to wait until the very last step (delivery of the result to the caller) before the result can be processed.
With streaming, all this is ongoing: Calculation, delivery and merging happens at the same time, with a fixed upper memory overhead. You can ask for 10K results or you can ask for 10 billion, the only difference is how long it takes. As every part of the process is active at the same time, including delivery to the caller, this also means that the first result data will be delivered to the caller very quickly.
Internally, streaming is basically paging through the search result. Each page (10K documents, if I remember correctly) is passed along to the stream as soon as it has been calculated. Ignoring optimizations, the same behaviour could be emulated from the outside with deep paging and a custom merger.

Recommended Document Structure for CouchDB

We are currently considering a change from Postgres to CouchDB for a usage monitoring application. Some numbers:
Approximately 2000 connections, polled every 5 minutes, for approximately 600,000 new rows per day. In Postgres, we store this data, partitioned by day:
t_usage {service_id, timestamp, data_in, data_out}
t_usage_20100101 inherits t_usage.
t_usage_20100102 inherits t_usage. etc.
We write data with an optimistic stored proc that presumes the partition exists and creates it if necessary. We can insert very quickly.
For reading of the data, our use cases, in order of importance and current performance are:
* Single Service, Single Day Usage : Good Performance
* Multiple Services, Month Usage : Poor Performance
* Single Service, Month Usage : Poor Performance
* Multiple Services, Multiple Months : Very Poor Performance
* Multiple Services, Single Day : Good Performance
This makes sense because the partitions are optimised for days, which is by far our most important use case. However, we are looking at methods of improving the secondary requirements.
We often need to parameterise the query by hours as well, for example, only giving results between 8am and 6pm, so summary tables are of limited use. (These parameters change with enough frequency that creating multiple summary tables of data is prohibitive).
With that background, the first question is: Is CouchDB appropriate for this data? If it is, given the above use cases, how would you best model the data in CouchDB documents? Some options I've put together so far, which we are in the process of benchmarking are (_id, _rev excluded):
One Document Per Connection Per Day
{
service_id:555
day:20100101
usage: {1265248762: {in:584,out:11342}, 1265249062: {in:94,out:1242}}
}
Approximately 60,000 new documents a month. Most new data would be updates to existing documents, rather than new documents.
(Here, the objects in usage are keyed on the timestamp of the poll, and the values the bytes in and byes out).
One Document Per Connection Per Month
{
service_id:555
month:201001
usage: {1265248762: {in:584,out:11342}, 1265249062: {in:94,out:1242}}
}
Approximately 2,000 new documents a month. Moderate updates to existing documents required.
One Document Per Row of Data Collected
{
service_id:555
timestamp:1265248762
in:584
out:11342
}
{
service_id:555
timestamp:1265249062
in:94
out:1242
}
Approximately 15,000,000 new documents a month. All data would be an insert to a new document. Faster inserts, but I have questions about how efficient it's going to be after a year or 2 years with hundreds of millions of documents. The file IO would seem prohibitive (though I'm the first to admit I don't fully understand the mechanics of it).
I'm trying to approach this in a document-oriented way, though breaking the RDMS habit is difficult :) The fact you can only minimally parameterise views as well has me a bit concerned. That said, which of the above would be the most appropriate? Are there other formats that I haven't considered which will perform better?
Thanks in advance,
Jamie.
I don't think it's a horrible idea.
Let's consider your Connection/Month scenario.
Given that an entry is ~40 (that's generous) characters long, and you get ~8,200 entries per month, your final document size will be ~350K long at the end of the month.
That means, going full bore, you're be reading and writing 2000 350K documents every 5 minutes.
I/O wise, this is less than 6 MB/s, considering read and write, averaged for the 5m window of time. That's well within even low end hardware today.
However, there is another issue. When you store that document, Couch is going to evaluate its contents in order to build its view, so Couch will be parsing 350K documents. My fear is that (at last check, but it's been some time) I don't believe Couch scaled well across CPU cores, so this could easily pin the single CPU core that Couch will be using. I would like to hope that Couch can read, parse, and process 2 MB/s, but I frankly don't know. With all it's benefits, erlang isn't the best haul ass in a straight line computer language.
The final concern is keeping up with the database. This will be writing 700 MB every 5 minutes at the end of the month. With Couchs architecture (append only), you will be writing 700MB of data every 5 min, which is 8.1GB per hour, and 201GB after 24 hrs.
After DB compression, it crushes down to 700MB (for a single month), but during that process, that file will be getting big, and quite quickly.
On the retrieve side, these large documents don't scare me. Loading up a 350K JSON document, yes it's big, but it's not that big, not on modern hardware. There are avatars on bulletin boards bigger than that. So, anything you want to do regarding the activity of a connection over a month will be pretty fast I think. Across connections, obviously the more you grab, the more expensive it will get (700MB for all 2000 connections). 700MB is a real number that has real impact. Plus your process will need to be aggressive in throwing out the data you don't care about so it can throw away the chaff (unless you want to load up 700MB of heap in your report process).
Given these numbers, Connection/Day may be a better bet, as you can control the granularity a bit better. However, frankly, I would go for the coarsest document you can, because I think that gives you the best value from the database, solely because today all the head seeks and platter rotations are what kill a lot of I/O performance, many disk stream data very well. Larger documents (assuming well located data, since Couch is constantly compacted, this shouldn't be a problem) stream more than seek. Seeking in memory is "free" compared to a disk.
By all means run your own tests on our hardware, but take all these considerations to heart.
EDIT:
After more experiments...
Couple of interesting observations.
During import of large documents CPU is equally important to I/O speed. This is because of the amount of marshalling and CPU consumed by converting the JSON in to the internal model for use by the views. By using the large (350k) documents, my CPUs were pretty much maxed out (350%). In contrast, with the smaller documents, they were humming along at 200%, even though, overall, it was the same information, just chunked up differently.
For I/O, during the 350K docs, I was charting 11MB/sec, but with the smaller docs, it was only 8MB/sec.
Compaction appeared to be almost I/O bound. It's hard for me to get good numbers on my I/O potential. A copy of a cached file pushes 40+MB/sec. Compaction ran at about 8MB/sec. But that's consistent with the raw load (assuming couch is moving stuff message by message). The CPU is lower, as it's doing less processing (it's not interpreting the JSON payloads, or rebuilding the views), plus it was a single CPU doing the work.
Finally, for reading, I tried to dump out the entire database. A single CPU was pegged for this, and my I/O pretty low. I made it a point to ensure that the CouchDB file wasn't actually cached, my machine has a lot of memory, so a lot of stuff is cached. The raw dump through the _all_docs was only about 1 MB/sec. That's almost all seek and rotational delay than anything else. When I did that with the large documents, the I/O was hitting 3 MB/sec, that just shows the streaming affect I mentioned a benefit for larger documents.
And it should be noted that there are techniques on the Couch website about improving performance that I was not following. Notably I was using random IDs. Finally, this wasn't done as a gauge of what Couch's performance is, rather where the load appears to end up. The large vs small document differences I thought were interesting.
Finally, ultimate performance isn't as important as simply performing well enough for you application with your hardware. As you mentioned, you're doing you're own testing, and that's all that really matters.

Resources