We have many years of weather data that we need to build a reporting app on. Weather data has many fields of different types e.g. city, state, country, zipcode, latitude, longitude, temperature (hi/lo), temperature (avg), preciptation, wind speed, date etc. etc.
Our reports require that we choose combinations of these fields then sort, search and filter on them e.g.
WeatherData.all().filter('avg_temp =',20).filter('city','palo alto').filter('hi_temp',30).order('date').fetch(100)
or
WeatherData.all().filter('lo_temp =',20).filter('city','palo alto').filter('hi_temp',30).order('date').fetch(100)
May be easy to see that these queries require different indexes. May also be obvious that the 200 index limit can be crossed very very easily with any such data model where a combination of fields will be used to filter, sort and search entities. Finally, the number of entities in such a data model can obviously run into millions considering that there are many cities and we could do hourly data instead of daily.
Can anyone recommend a way to model this data which allows for all the queries to still be run, at the same time staying well under the 200 index limit? The write-cost in this model is not as big a deal but we need super fast reads.
Your best option is to rely on the built-in support for merge join queries, which can satisfy these queries without an index per combination. All you need to do is define one index per field you want to filter on and sort order (if that's always date, then you're down to one index per field). See this part of the docs for details.
I know it seems counter-intuitive but you can use a full-text search system that supports categories (properties/whatever) to do something like this as long as you are primarily using equality filters. There are ways to get inequality filters to work but they are often limited. The faceting features can be useful too.
The upcoming Google Search API
IndexTank is the service I currently use
EDIT:
Yup, this is totally a hackish solution. The documents I am using it for are already in my search index and I am almost always also filtering on search terms.
Related
I have a use case requirement, where I want to design a hashtag ranking system. 10 most popular hashtag should be selected. My idea is something like this:
[hashtag, rateofhitsperminute, rateofhisper5minutes]
Then I will query, find out the 10 most popular #hashtags, whose rateofhits per minute are highest.
My question is what sort of databases, can I use, to provide me statistics like 'rateofhitsperminute' ?
What is a good way to calculate such a detail and store in it db ? Do some DBs offer these features?
First of all, "rate of hits per minute" is calculated:
[hits during period]/[length of period]
So the rate will vary depending on how long the period is. (The last minute? The last 10 minutes? Since the hits started being recorded? Since the hashtag was first used?)
So what you really want to store is the count of hits, not the rate. It is better to either:
Store the hashtags and their hit counts during a certain period (less memory/cpu required but less flexible)
OR the timestamp and hashtag of each hit (more memory/cpu required but more flexible)
Now it is a matter of selecting the time period of interest, and querying the database to find the top 10 hashtags with the most hits during that period.
If you need to display the rate, use the formula above, but notice it does not change the order of the top hashtags because the period is the same for every hashtag.
You can apply the algorithm above to almost any DB. You can even do it without using a database (just use a programming language's builtin hashmap).
If performance is a concern and there will be many different hashtags, I suggest using an OLAP database. OLAP databases are specially designed for top-k queries (over a certain time period) like this.
Having said that, here is an example of how to accomplish your use case in Solr: Solr as an Analytics Platform. Solr is not an OLAP database, but this example uses Solr like an OLAP DB and seems to be the easiest to implement and adapt to your use case:
Your Solr schema would look like:
<fields>
<field name="hashtag" type="string"/>
<field name="hit_date" type="date"/>
</fields>
An example document would be:
{
"hashtag": "java",
"hit_date": '2012-12-04T10:30:45Z'
}
A query you could use would be:
http://localhost:8983/solr/select?q=*:*&facet=true&facet.field=hashtag&facet.mincount=1&facet.limit=10&facet.range=hit_date&facet.range.end=2013-01-01T00:00:00Z&facet.range.start=2012-01-01T00:00:00
Finally, here are some advanced resources related to this question:
Similar question: Implementing twitter and facebook like hashtags
What is the best way to compute trending topics or tags? An interesting idea I got from these answers is to use the derivative of the hit counts over time to calculate the "instantaneous" hit rate.
HyperLogLog can be used to estimate the hit counts if an approximate calculation is acceptable.
Look into Sliding-Window Top-K if you want to get really academic on this topic.
No database has rate per minute statistics just built in, but any modern database could be used to create a database in which you could quite easily calculate rate per minute or any other calculated values you need.
Your question is like asking which kind of car can drive from New York to LA - well no car can drive itself or refuel itself along the way (I should be careful with this analogy because I guess cars are almost doing this now!), but you could drive any car you like from New York to LA, some will be more comfortable, some more fuel efficient and some faster than others, but you're going to have to do the driving and refueling.
You can use InfluxDB. It's well suited for your use case, since it was created to handle time series data (for example "hits per minute").
In your case, every time there is a hit, you could send a record containing the name of the hashtag and a timestamp.
The data is queryable, and there are already tools that can help you process or visualize it (like Grafana).
If you are happy with a large data set you could store and calculate this information yourself.
I believe Mongo is fairly fast when it comes to index based queries so you could structure something like this.
Every time a tag is "hit" or accessed you could store this information as a row
[Tag][Timestamp]
Storing it in such a fashion allows you to first of all run simple Group, Count and Sort operations which will lead you to your first desired ability of calculating the 10 most popular tags.
With the information in this format you can then perform further queries based on tag and timestamp to Count the amount of hits for a specific tag between the times X and Y which would give you your hits Per period.
Benefits of doing it this way:
High information granularity depending on time frames supplied via query
These queries are rather fast in mongoDB or similar databases even on large data sets
Negatives of doing it this way:
You have to store many rows of data
You have to perform queries to retrieve the information you need rather than returning a single data row
I'm evaluating CouchBase for an application, and trying to figure out something about range queries on views. I know I can do a view get for a single key, multiple keys, or a range. Can I do a get for multiple ranges? i.e. I want to retrieve items with view key 0-10, 50-100, 5238-81902. I might simultaneously need 100 different ranges, so having to make 100 requests to the database seems like a lot of overhead.
As far as I know in couchbase there is no way to implement getting values from multiple ranges with one view. May be there are (or will be implemented in future) some features in Couchbase N1QL, but I didn't work with it.
Answering your question 100 requests will not be a big overhead. Couchbase is quiet fast and it's designed to handle a lot of operations per second. Also, if your view is correctly designed, it will not be "recalculated" on each query.
Also there is another way:
1. Determine minimum and maximum value of your range (it will be 0..81902 according to your example)
2. Query view that will return only document ids and a value that range was based on, without including all docs in result.
3. On client side filter array of results from previous step according to your ranges (0-10, 50-100, 5238-81902)
and then use getMulti with document ids that left in array.
I don't know your data structure, so you can try both ways, test them and choose the best one that will fit your demands.
Forgive this super basic question, from a search newbie.
I want to implement a site that makes use of faceted search. For example, it's a site with a database of hotels, and I want to allow users to search for hotels within a price range, with a swimming pool, with either three or four stars.
Clearly I can return results to users with a simple database query.
Should I use ElasticSearch or Solr to implement this instead of using a database query? If so, why?
Yes you should use ES or Solr. Reasons: primarily performance and the ability to change (think config) 'types of faceting' easily.
Faceting is no small feat and although you could do it with a RDBMS, to do it fast requires hard thinking. Why do it yourself if you can use the gazillions of hours Solr / ES (+ Lucene) teams have worked to optimize it.
As for the 'types of faceting' I mentioned:
perhaps you want to do hierarchical faceting. Select price-category > display smaller price categories. How are the bucketed: fixed range, evenly distributed, etc. Solr / ES provide these options from within a config.
Perhaps instead you implement price-faceting with a slider with min/max handles? Do you want to display the nr of hotels while you slide (histogram/facetstats in SOlr / ES)
While you've faceted on price, perhaps you still want to know the min and max-value of the priceslider as if you DIDN't filter on price. This is needed if you want to be able to draw the slider-handles proportionally. (see my question on SO as part of considering a switch from Solr to ES: Elasticsearch: excluding filters while faceting possible? (like in Solr) )
faceting on stars? Perhaps you want to show the best price per stars-facet if the user would select that star (again histogram/ stats)
Seriously, don't even consider doing the above with a RDBMS. You'll go insane.
Hope that helps, and yes I'm familiar with the domain :)
Additional questions, just ask.
In my Solr queries, I want to sort most recently accessed documents to the top ("accessed" meaning opened by user action). No other search criteria has weight for me: of the documents with text matching the query, I want them in order of recent use. I can only think of two ways to do this:
1) Include a 'last accessed' date field in each doc to have Solr sort upon. Trie Date fields can be sorted very quickly, I'm told. The problem of course is keeping the field up to date, which would require storing each document's text so I can delete and re-add any document with an updated 'last accessed' field. Mutable fields would obviate this, but Lucene/Solr still doesn't offer mutable fields.
2) Alternatively, store the mutable 'last accessed' dates and keep them updated in another db. This would require Solr to return the full list of matching documents, which could be upwards of hundreds of thousands of documents. This huge list of document ids would then be matched up against dates in the db and then sorted. It would work OK for uncommon search terms, but not for broad, common search terms.
So the trade off is between 1) index size plus a processing cost every time a document is accessed and 2) big query overhead, especially for unfocused search terms
Do I have any alternatives?
http://lucidworks.lucidimagination.com/display/solr/Solr+Field+Types#SolrFieldTypes-WorkingwithExternalFiles
http://blog.mikemccandless.com/2012/01/tochildblockjoinquery-in-lucene.html
You should be able to do this with the atomic update functionality.
http://wiki.apache.org/solr/Atomic_Updates
This functionality is available as of Solr 4.0. It allows you to update a single field in a document without having to reindex the entire document. I only know about this functionality from the documentation. I have not used it myself, so I can't say how well it works or if there are any pitfalls.
Definitely use option 1, using SOLR queries and updating the lastAccessed field as needed.
Since SOLR 4.0 partial document updates are suported in several falvours: https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
For your application it seems that a simple atomic update would be sufficient.
With respect to performance, this should work very well for large collections and fast document updates.
Is it better to use
a lot of indexes (eg. for every user as your application allows that)
in Lucene
or just one, having every document in int
... if you think about:
performance
disk space
health
I am using elasticsearch, therefore I am using Lucene.
In Elastic Search, I think based off your information I would use 1 index. My understanding is users are only searching there own documents, and the documents seems to be relatively similar.
Performance - When searching you can use a Filtered Query to filter to only the documents matching the user. The user id filter is cache-able, and fast.
Scalable - In Elasticsearch, you control sharding and replication at index level. Elasticsearch can handle large numbers of indexes, I just think configuring appropriate shards and replications could be valuable for the entire index.
In a single index, you can still easy wipe away data (see delete by query) , and there should be little concern of seeing others data unless you write your queries wrong. A filtered query with that filters results to only those associated with a user id is very simple. Similar in complexity to searching a different index per user.
Your exact needs might fit a different approach better. Based what I have so far, I would do choose one index though.