have a join query which i need to optimize using OR and Stats, i am new to splunk and i am confused how to start - query-optimization

index="index1" sourcetype=sourcetype1 | join commonfield [ search <br>index="index2" sourcetype=sourcetype2 ] | sort _time | stats <br>last(index1field1) as state by index2field1, index1field2, index1field3 <br>| where index1field1 != "UP" | dedup index2field1 | stats count
I want to optimize this query without join using stats and OR, can anyone help me?

(index="index1" sourcetype=sourcetype1) OR (index="index2" sourcetype=sourcetype2)
| stats values(*) AS *, values(_*) as * by commonfield
This will be a fairly good starting point. Bring in both sets of data initially, then merge all the fields from both sources, "joining" based on the commonfield

Related

Advice on refactoring/scaling enormous events-based production table in PostgreSQL

I am hoping to get some advice regarding next steps to take for our production database that, in hindsight, has been very ill thought out.
We have an events based system that generates payment information based on "contracts" (a "contract" defines how/what the user gets paid).
Example schema:
Approx Quantity is ess. Quantity is slightly
5M rows 1-to-1 with "Events" less, approx.
i.e. ~5M rows 3M rows
---------- ---------------- --------------
| Events | | Contract | | Payment |
---------- ---------------- --------------
| id | | id | | id |
| userId | -> | eventId | ----> | contractId |
| jobId | | <a bunch of | | amount |
| date | | data to calc | | currency |
---------- | payments> | |------------|
----------------
There is some additional data on each table too such as generic "created_at", "updated_at", etc.
These are the questions I have currently come up with (please let me know if I'm barking up the wrong tree):
Should I denormalise the payment information?
Currently, in order to get a payment based on an event ID we need to do minimum one join but realistically we have to do two (there are rarely scenarios where we want a payment based only on some contract information). With 5M rows (and growing on a daily basis) it takes well over 20s to do a normal select with sort, let alone a join. My thought is that by moving the payment info directly onto the events table we can skip the joins in order to get payment information.
Should I be "archiving" old rows from Events table?
Since the Events table has so many new rows constantly (and also needs to be read from for other business logic purposes) should I be implementing some strategy to move events where payments have already been calculated into some "archive" table? My thought is that this will keep the very I/O-heavy Events table more "freed up" in order to be read/written to faster.
What other optimisation strategies should I be thinking about?
The application in question is currently live to paying customers and our team is growing concerned about the manpower that will be required to keep this unoptimised setup in-check. Are there any "quick wins" that we could perform on the tables/database in order to speed things up? (I've read about "indexing" but I haven't grasped the purpose well enough to be confident to deploy it to a production database)
Thank you so much in advance for any advice you can give!
Edit to include DB information:
explain (analyze, buffers) select * from "event" order by "updatedAt" desc
Gather Merge (cost=1237571.93..1699646.19 rows=3960360 width=381) (actual time=71565.082..76242.131 rows=5616130 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=107649 read=211715, temp read=805207 written=805973
I/O Timings: read=76365.059
-> Sort (cost=1236571.90..1241522.35 rows=1980180 width=381) (actual time=69698.164..70866.085 rows=1872043 loops=3)
Sort Key: "updatedAt" DESC
Sort Method: external merge Disk: 725448kB
Worker 0: Sort Method: external merge Disk: 685968kB
Worker 1: Sort Method: external merge Disk: 714432kB
Buffers: shared hit=107649 read=211715, temp read=805207 written=805973
I/O Timings: read=76365.059
-> Parallel Seq Scan on event (cost=0.00..339111.80 rows=1980180 width=381) (actual time=0.209..26506.333 rows=1872043 loops=3)
Buffers: shared hit=107595 read=211715
I/O Timings: read=76365.059
Planning Time: 1.167 ms
Execution Time: 76799.910 ms

Database Design - Need help scaling query

I'm trying to find the best data-structure/data store solution (highest performance) for the following request:
I have a list of attributes that I need to store for all individual in the US, for example:
+------------+-------+-------------+
| Attribute | Value | SSN |
+------------+-------+-------------+
| hair color | black | 123-45-6789 |
| eye color | brown | 123-45-6789 |
| height | 175 | 123-45-6789 |
| sex | M | 123-45-6789 |
| shoe size | 42 | 123-45-6789 |
As you can guess, with the general population, there are nothing unique and identifiable from those attributes.
However, let's assume that if we were to fetch from a combination of 3 or 4 attributes, then I would be able to uniquely identify a person (find their SSN).
Now here's the difficulties, the set of combinations that can uniquely identify a person will evolve over time and be adjusted.
What would be my best bet for storing and querying the data with the scenario mentioned above, that will remain highly performant (<100ms) at scale?
Current attempt with combining two attributes:
SELECT * FROM (SELECT * FROM people WHERE hair='black') p1
JOIN (SELECT * FROM people WHERE height=175) p2
ON p1.SSN = p2.SSN
But with a database with millions of rows, as you can guess.. NOT performant.
Thank you!
if the data store is not a constraint, I would use a DocumentDB, something like MongoDB, CosmosDB or even ElasticSearch.
With Mongo, for example, you could leverage its schemaless nature and have a collection of People with one property per "attribute":
{
"SSN": "123-45-6789",
"eyeColor": "brown",
"hairColor" "blond",
"sex": "M"
}
documents in this collection might have different properties, but it's not an issue. All you have to do now is to put an index on each one and run your queries.

Work around the SQLite parameter limit in a select query

I have a GUI application with a list of people which contains the person's database id and their attributes. Something like this:
+----+------+
| ID | Name |
+----+------+
| 1 | John |
| 2 | Fred |
| 3 | Mary |
[...]
This list can be filtered, so the amount and type of people depend from time to time. To get a list of Peewee Person objects I first get the list of visible IDs and use the following query:
ids = [row[0] for row in store]
Person.select().where(Person.id.in_(ids))
Which in turn translates to the following SQL:
('SELECT "t1"."id", "t1"."name" FROM "person" AS "t1" WHERE ("t1"."id" IN (?, ?, ?, ...))', [1, 2, 3, ...])
This throws an OperationalError: too many SQL variables error on Windows with more than 1000 people. This is documented in the Peewee and SQLite docs. Workarounds given online usually relate to bulk inserts and ways to split the action in chunks. Is there any way to work around this limitation with the mentioned SELECT ... WHERE ... IN query?
Getting the separate objects in a list comprehension is too slow:
people = [Person.get_by_id(row[0]) for row in store]
Maybe split the list of IDs in max 1000 items, use the select query on each chunk and then combine those somehow?
Where are the IDs coming from? The best answer is to avoid using that many parameters, of course. For example, if your list of IDs could be represented as a query of some sort, then you can just write a subquery, e.g.
my_friends = (Relationship
.select(Relationship.to_user)
.where(Relationship.from_user == me))
tweets_by_friends = Tweet.select().where(Tweet.user.in_(my_friends))
In the above, we could get all the user IDs from the first query and pass them en-masse as a list into the second query. But since the first query ("all my friends") is itself a query, we can just compose them. You could also use a JOIN instead of a subquery, but hopefully you get the point.
If this is not possible and you seriously have a list of >1000 IDs...how is such a list useful in a GUI application? Over 1000 anything is quite a lot of things.
To try and answer the question you asked -- you'll have to chunk them up. Which is fine. Just:
user_ids = list_of_user_ids
accum = []
# 100 at a time.
for i in range(0, user_ids, 100):
query = User.select().where(User.id.in_(user_ids[i:i+100]))
accum.extend([user for user in query])
return accum
But seriously, I think there's a problem with the way you're implementing this that makes it even necessary to filter on so many ids.

Best way to apply FIR filter to data stored in a database

I have a PostgreSQL database with a few tables that store several million of data from different sensors. The data is stored in one column of each row like:
| ID | Data | Comment |
| 1 | 19 | Sunny |
| 2 | 315 | Sunny |
| 3 | 127 | Sunny |
| 4 | 26 | Sunny |
| 5 | 82 | Rainy |
I want to apply a FIR filter to the data and store it in another table so I can work with it, but because of the amount of data I'm not sure of the best way to do it. So far I've got the coefficients in Octave and work with some extractions of it. Basically I export the column Data to a CSV and then run a csvimport in Octave to have it in a array and filter it. The problem is that this method doesn't allow me to work with more of several thousand data at the time.
Things I've been looking so far:
PostgreSQL: I've been looking for someway to do it directly in the database, but I haven't been able to find any way to do it so far.
Java: Another possible way to do it is making a small program that extracts chunks of data each time, recalculates the data using the coefficients and stores it back in other table of the database.
C/C++: I've seen some questions and resolutions about how to implement the filter in StackOverflow here, here or here, but they seem to be for working with data on real time and not talking advantage of having all the data already.
I think the best way would be to do it directly with PostgreSQL and with Java or C/C++ would be too slow, but I don't have too much experience working with so much data so probably I'm wrong. Just need to know why and where to point myself to.
What's the best way to apply a FIR filter to data stored on a database, and why?

Relevance and Solr Grouping

Say I have the following collection of webpages in a Solr index:
+-----+----------+----------------+--------------+
| ID | Domain | Path | Content |
+-----+----------+----------------+--------------+
| 1 | 1.com | /hello1.html | Hello dude |
| 2 | 1.com | /hello2.html | Hello man |
| 3 | 1.com | /hello3.html | Hello fella |
| 4 | 2.com | /hello1.html | Hello sir |
...
And I want a query for hello to show results grouped by domain like:
Results from 1.com:
/hello1.html
/hello2.html
/hello3.html
Results from 2.com:
/hello1.html
How is ordering determined if I sort by score? I use a combination of TF/IDF and PageRank for my results normally, but since that calculates scores for each individual item, how does it determine how to order the gruops? What if 1.com/hello3.html and 1.com/hello2.html have very low relevance but two results while 2.com/hello1.html has really high relevance and only one result? Or vice versa? Or is relevance summed when there are multiple items in a grouping field?
I've looked around, but haven't been able to find a good answer to this.
Thanks.
It sounds to me like you are using Result Grouping. If that's the case, then the groups are sorted according to the sort parameter, and the records within each group are sorted according to the group.sort parameter. If you sort the groups by sort=score desc (this is the default, so you wouldn't actually need to specify it), then it sorts the groups according to the score of each group. How this score is determined isn't made very clear, but if you look through the examples in the linked documentation you can see this statement:
The groups are sorted by the score of the top document within each group.
So, in your example, if 2.com's hello1.html was the most relevant document in your result set, "Results from 2.com" would be your most relevant group even though "Results from 1.com" includes three times the document count.
If this isn't what you want, your best options are to provide a different sort parameter or result post-processing. For example, for one project I was involved in, (where we had a very modest number of groups,) we chose to pull the top three results for each group and in post processing we calculated our own sort order for the groups based on the combination of their scores and numFound values. This sort of strategy might have been prohibitive for cases with too many groups, and may not be a good idea if the more numerous groups run the risk of making the most relevant documents harder to find.

Resources