I have a big dataset in Elasticsearch and I'm trying to find a way to check if one doc from the last 14 days also exist with similar fields (four fields must be the same) in the last 15-90 days.
So I have two time intervals and I would like to check if there are similar one in them.
Related
I am working on a Flutter app (similar to dating apps with swippable cards ) that show one user at a time.
I have a Firestore collection called users from which I'd like to query the nearby users first (based on the current user's location). Every user has a GeoPoint(Long + Lat) and a Geohash (8 characters) saved.
What I'd like:
Query the closest users first (20 or 30 users per query call) since I don't want the user to wait a long time.
The area in which the users are being queried grows bigger and bigger whenever the webservice is being called over and over (since the close-by users have been shown already)
There are a few filters that are applied to the query like: only get users of a certain age group, or certain gender, etc.
I thought about Duncan Campbell's solution of having multiple variants of the GeoHash per user. For instance, every user would have a GeoHash of 2 characters, 3 characters, and 4 characters. And I'd search with a Where In clause containing the neighboring geohashes or the parents of the neighboring geohashes which should get me the users quickly.
However, this won't work for my purpose. Let's say we got all the closeby users by using the neighboring geohash4 (4 characters). Once done, if the user kept on swiping, we will have to search with a larger area geohash3( 3 characters) which will contain almost all the users that were searched previously. Yes I can filter them afterward, but that is not efficient (and costly). Also if we keep on going, we'd reach the geohash2, but then, in order to get the rest of the users, I'd have to search with a where not inclause containing all the neighboring geohash2 areas that we already searched. I do understand that this is a huge area to cover, but with the data that I currently have, users will cover it really quickly.
I also think that GeoFire doesn't do the job for me since it queries all the users in a given area at once which doesn't satisfy my requirements (as discussed in the linked article).
For me, it sounds like this solution is a bit complicated, and so I am seeking suggestions for making this more efficient and effective. Any ideas on how to tackle this issue?
Is there a way sort documents in Solr by the number of fields in each document?
The solr core in question has about 200 different fields, while not every field must be present in every doucment. To circle out datasets that contain not enough fields to be correct, I'd like to work through a *:* query sorted from the lowest number of fields per documents upwards.
I didn't find anything on this specific use case. Most results I found were about the relevance of individual fields, however this might not help here given the large field spectrum of the core.
It might be possible by sorting on a function query. That function would return a value that would be higher the more fields the doc has. But I am afraid that function would be huge (and slow), as it would need to enumerate all fields in the function.
By far the easiest thing would be to, at index time, add a 'nbFields' field containing the number of fields. Then you can sort easily on that one.
I'm evaluating CouchBase for an application, and trying to figure out something about range queries on views. I know I can do a view get for a single key, multiple keys, or a range. Can I do a get for multiple ranges? i.e. I want to retrieve items with view key 0-10, 50-100, 5238-81902. I might simultaneously need 100 different ranges, so having to make 100 requests to the database seems like a lot of overhead.
As far as I know in couchbase there is no way to implement getting values from multiple ranges with one view. May be there are (or will be implemented in future) some features in Couchbase N1QL, but I didn't work with it.
Answering your question 100 requests will not be a big overhead. Couchbase is quiet fast and it's designed to handle a lot of operations per second. Also, if your view is correctly designed, it will not be "recalculated" on each query.
Also there is another way:
1. Determine minimum and maximum value of your range (it will be 0..81902 according to your example)
2. Query view that will return only document ids and a value that range was based on, without including all docs in result.
3. On client side filter array of results from previous step according to your ranges (0-10, 50-100, 5238-81902)
and then use getMulti with document ids that left in array.
I don't know your data structure, so you can try both ways, test them and choose the best one that will fit your demands.
I have a table to keep flights and i want to keep which days of the week this flight operates.
There is no need for date for this since i only need day names.
Firstly i thought to have a column in the flight table that will keep a single string with the day names inside and use my application logic to unravel the information.
This seems ok since the only operation on the days will be to retrieve them.
The thing is, i don't find this is "clean" enough so i thought of making a separate table to keep all 7 day names and a many to many (auto generated) table to keep the flight_id and day_id.
Still though, there are only 7 set values on days table and i am not so sure for the second approach either.
What i would like is some other opinions on how to handle this.
A flight can operate on many different days of a week
Only day names are needed - so, 7 in total.
Sorry for bad English and if this is a trivial question for some. I am not too experienced in both English language and databases.
Some databases support arrays. PostgreSQL for example supports arrays.
You could store the days in an array of integers and use a function to tanslate integers to day names. You could also use an array of a custom enum type (PostgreSQL Example).
We have a database with hundreds of millions of records of log data. We're attempting to 'group' this log data as being likely to be of the same nature as other entries in the log database. For instance:
Record X may contain a log entry like:
Change Transaction ABC123 Assigned To Server US91
And Record Y may contain a log entry like:
Change Transaction XYZ789 Assigned To Server GB47
To us humans those two log entries are easily recognizable as being likely related in some way. Now, there may be 10 million rows between Record X and Record Y. And there may be thousands of other entries that are similar to X and Y, and some that are totally different but that have other records they are similar to.
What I'm trying to determine is the best way to group the similar items together and say that with XX% certainty Record X and Record Y are probably of the same nature. Or perhaps a better way of saying it would be that the system would look at Record Y and say based on your content you're most like Record X as apposed to all other records.
I've seen some mentions of Natural Language Processing and other ways to find similarity between strings (like just brute-forcing some Levenshtein calculations) - however for us we have these two additional challenges:
The content is machine generated - not human generated
As opposed to a search engine approach where we determine results for a given query - we're trying to classify a giant repository and group them by how alike they are to one another.
Thanks for your input!
Interesting problem. Obviously, there's a scale issue here because you don't really want to start comparing each record to every other record in the DB. I believe I'd look at growing a list of "known types" and scoring records against the types in that list to see if each record has a match in that list.
The "scoring" part will hopefully draw some good answers here -- your ability to score against known types is key to getting this to work well, and I have a feeling you're in a better position than we are to get that right. Some sort of soundex match, maybe? Or if you can figure out how to "discover" which parts of new records change, you could define your known types as regex expressions.
At that point, for each record, you can hopefully determine that you've got a match (with high confidence) or a match (with lower confidence) or very likely no match at all. In this last case, it's likely that you've found a new "type" that should be added to your "known types" list. If you keep track of the score for each record you matched, you could also go back for low-scoring matches and see if a better match showed up later in your processing.
I would suggest indexing your data using a text search engine like Lucene to split your log entries into terms. As your data is machine generated use also word bigrams and tigrams, even higher order n-grams. A bigram is just a sequence of consecutive words, in your example you would have the following bigrams:
Change_Transaction, Transaction_XYZ789, XYZ789_Assigned, Assigned_To, To_Server, Server_GB47
For each log prepare queries in a similar way, the search engine may give you the most similar results. You may need to tweek the similarity function a bit to obtain best results but I believe this is a good start.
Two main strategies come to my mind here:
the ad-hoc one. Use an information retrieval approach. Build an index for the log entries, eventually using a specialized tokenizer/parser, by feeding them into a regular text search engine. I've heard people do this with Xapian and Lucene. Then you can "search" for a new log record and the text search engine will (hopefully) return some related log entries to compare it with. Usually the "information retrieval" approach is however only interested in finding the 10 most similar results.
the clustering approach. You will usually need to turn the data into numerical vectors (that may however be sparse) e.g. as TF-IDF. Then you can apply a clustering algorithm to find groups of closely related lines (such as the example you gave above), and investigate their nature. You might need to tweak this a little, so it doesn't e.g. cluster on the server ID.
Both strategies have their ups and downs. The first one is quite fast, however it will always just return you some similar existing log lines, without much quantities on how common this line is. It's mostly useful for human inspection.
The second strategy is more computationally intensive, and depending on your parameters could fail completely (so maybe test it on a subset first), but could also give more useful results by actually building large groups of log entries that are very closely related.
It sounds like you could take the lucene approach mentioned above, then use that as a source for input vectors into the machine learning library Mahout (http://mahout.apache.org/). Once there you can train a classifier, or just use one of their clustering algorithms.
If your DBMS has it, take a look at SOUNDEX().