store the number of times a document was seen in a given time period - solr

I am parsing documents on the web and storing them in solr database. Every day I see thousand of documents and some of them are repeating.
I'd like to give user an option to see which document was most seen on a given date, or in a given timespan. Queries of interest correspond to:
-show me which documents were seen the most on 16/10/2022,
-show me which documents were seen the most between 16/10/2022 and 23/10/2022
When writing solr queries, you specify field name to search on. What field type should I use and in what format should I store the number of times the document was seen on a given date?

How I would try it:
Create a separate collection - very simple collection with fields:
view time
doc id
title or body (whatever you're querying)
... do this for EVERY view.
you can query it by the gap you want:
curl http://localhost:8983/solr/query -d 'q=title:abc&rows=0&json.facet={
per_month: { range : {
field : last_modified,
start:'2022-01-01T00:00:00Z',
end:'2022-12-31T23:59:59Z',
gap:'+1MONTH',
}}
}}
This would return all views by MONTH (can change it to DAY, YEAR, etc).
But your doc is probably too big for this solution. If you want to normalize this:
a JOIN query. Since solr 8.6, you can now do cross-collection joins on multiple shards. this is a good article about how to do those queries. this is a decent video of how to set this up It's not that hard to do.
The JOIN query would be much faster.
If you don't want to do the JOIN query:
If the views change often, do not store them in the document store. There's no notion of partial updates in solr. If you're updating views every day, you'll need to update every document that's been viewed. That's going to cause a lot of unnecessary disk thrashing.
Other thoughts:
can you use a database? This is a far better use of views. Solr isn't good as a master record for views.
Another suggestion is to make the views go to an analytics engine - a far better solution since you can get rich analytics about the actual users. An analytics engine does a lot that rendering views does not - especially filtering out false positives (like bots!). It's not fun to maintain an accurate view count if you have a high-trafficed site.
In the past I've used an analytics engine to collect the data and used the analytics engine to export that data into solr. This way you can have the view logic be done by the software component that knows views best (the analytics engine like Google analytics or Salesforce marketing engine) and run an hourly process to update the views in solr using one of the above tactics.

Related

Why are Solr's logs time series stored in different collections based on time instead of different shards based on time

If you see Lucidworks Time Based Partitioning or Large Scale Log Analytics with Solr, multiple solr "collections" are created partitioned on time.
My question is
Why not in such cases just create multiple shards based on time ?
In case of multiple collection, how would a query spanning multiple collections/time be done ?
There is not much difference between multiple shards with implicit routing or multiple collections. When you issue a query, you can (optionally) specify which shards or which collections to search.
Alternatively you can set up an alias containing multiple collections, thus hiding the logistics from the search client. This makes it easy to create custom views over the full data set, such as an alias for each year, one for everything and one for the last quarter. If you at a later time decide to slice your data differently, e.g. make a collection for each week instead of each month, this change will be transparent to the client application. Aliases does not work for shards, so that is one reason to prefer collections.

Updating documents with SOLR

I have a community website with 20.000 members using everyday a search form to find other members. The results are sorted by the last connected members. I'd like to use Solr for my search (right now it's mysql) but I'd like to know first if it's good practice to update the document of every member who would login in order to change their login date and time ? There will be around 20.000 update of documents a day, I don't really know if it's too much updating and could alter performances ? Tank you for your help.
20k updates/day is not unreasonable at all for Solr.
OTOH, for very frequently updating fields (imagine one user could log in multiple times a day so you might want to update it all those times), you can use External Fields to keep that field stored outside the index (in a text file) and still use it for sorting in solr.
Generally, Solr does not be used for this purpose, using your database is still better.
However, if you want to use Solr, you will deal with it in a way like the database. i.e every user document should has a unique field, id for example. When the user make log in, you may use an update query for that user's document last_login_date field by its id. You could able to know more about Partial Update from this link.

Google app engine - help in query optimization

I have run into a scenario while running query in app engine which is increasing my cost considerably.
I am writing the below query to fetch book names -
Iterable<Entity> entities =
datastore.prepare(query).asIterable(DEFAULT_FETCH_OPTIONS);
After that I run a loop to match the name with the name the user has requested. This is causing data reads for the entire books in the datastore and with the book details increasing day by day in the datastore, it is further impacting the cost since it is reading the entire list.
Is there an alternative to fetch data for only the requested book detail by the user so that I dont have to read the complete data store? Will SQL help or filters? I would appreciate if someone provides the query.
You have two options:
If you match the title exactly, make it an indexed field and use a filter to fetch only books with exactly the same title.
If you search within titles too:
a. You can use Search API to index all titles and use it to find the books your users are looking for.
b. A less optimal but quick solution is to create a projection query that reads only the book titles.

Indexes for google app engine data models

We have many years of weather data that we need to build a reporting app on. Weather data has many fields of different types e.g. city, state, country, zipcode, latitude, longitude, temperature (hi/lo), temperature (avg), preciptation, wind speed, date etc. etc.
Our reports require that we choose combinations of these fields then sort, search and filter on them e.g.
WeatherData.all().filter('avg_temp =',20).filter('city','palo alto').filter('hi_temp',30).order('date').fetch(100)
or
WeatherData.all().filter('lo_temp =',20).filter('city','palo alto').filter('hi_temp',30).order('date').fetch(100)
May be easy to see that these queries require different indexes. May also be obvious that the 200 index limit can be crossed very very easily with any such data model where a combination of fields will be used to filter, sort and search entities. Finally, the number of entities in such a data model can obviously run into millions considering that there are many cities and we could do hourly data instead of daily.
Can anyone recommend a way to model this data which allows for all the queries to still be run, at the same time staying well under the 200 index limit? The write-cost in this model is not as big a deal but we need super fast reads.
Your best option is to rely on the built-in support for merge join queries, which can satisfy these queries without an index per combination. All you need to do is define one index per field you want to filter on and sort order (if that's always date, then you're down to one index per field). See this part of the docs for details.
I know it seems counter-intuitive but you can use a full-text search system that supports categories (properties/whatever) to do something like this as long as you are primarily using equality filters. There are ways to get inequality filters to work but they are often limited. The faceting features can be useful too.
The upcoming Google Search API
IndexTank is the service I currently use
EDIT:
Yup, this is totally a hackish solution. The documents I am using it for are already in my search index and I am almost always also filtering on search terms.

What Database / Technology should be used to calculate unique visitors in time scope

I've got a problem with performance of my reporting database (tables have millions of records, 50+), when I want to calculate distinct on column that indicates a visitor uniqueness, let's say some hashkey.
For example:
I have these columns:
hashkey, name, surname, visit_datetime, site, gender, etc...
I need to get distinct in time span of 1 year, less than in 5 sec:
SELECT COUNT(DISTINCT hashkey) FROM table WHERE visit_datetime BETWEEN 'YYYY-MM-DD' AND 'YYYY-MM-DD'
This query will be fast for short time ranges, but if it be bigger than one month, than it can takes more than 30s.
Is there a better technology to calculate something like this than relational databases?
I'm wondering what google analytics use to do theirs unique visitors calculating on the fly.
For reporting and analytics, the type of thing you're describing, these sorts of statistics tend to be pulled out, aggregated, and stored in a data warehouse or something. They are stored in a fashion meant for performance reasons in lieu of nice relational storage techniques optimized for OLTP (online transaction processing). This pre-aggregated technique is called OLAP (online analytical processing).
You could have another table store the count of unique visitors for each day, updated daily by a cron function or something.
Google Analytics uses a first-party cookie, which you can see if you log Request Headers using LiveHTTPHeaders, etc.
All GA analytics parameters are packed into the Request URL, e.g.,
utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1%3B">http://www.google-analytics.com/_utm.gif?utmwv=4&utmn=769876874&utmhn=example.com&utmcs=ISO-8859-1&utmsr=1280x1024&utmsc=32-bit&utmul=en-us&utmje=1&utmfl=9.0%20%20r115&utmcn=1&utmdt=GATC012%20setting%20variables&utmhid=2059107202&utmr=0&utmp=/auto/GATC012.html?utm_source=www.gatc012.org&utm_campaign=campaign+gatc012&utm_term=keywords+gatc012&utm_content=content+gatc012&utm_medium=medium+gatc012&utmac=UA-30138-1&utmcc=__utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1%3B...
Within that URL is a piece that keyed to __utmcc, these are the GA cookies. Within _utmcc, is a string keyed to _utma, which is string comprised of six fields each delimited by a '.'. The second field is the Visitor ID, a random number generated and set by the GA server after looking for GA cookies and not finding them:
__utma%3D97315849.1774621898.1207701397.1207701397.1207701397.1
In this example, 1774621898 is the Visitor ID, intended by Google Analytics as a unique identifier of each visitor
So you can see the flaws of technique to identify unique visitors--entering the Site using a different browser, or a different device, or after deleting the cookies, will cause you to appear to GA as a unique visitor (i.e., it looks for its cookies and doesn't find any, so it sets them).
There is an excellent article by EFF on this topic--i.e., how uniqueness can be established, and with what degree of certainty, and how it can be defeated.
Finally, once technique i have used to determine whether someone has visited our Site before (assuming the hard case, which is that they have deleted their cookies, etc.) is to examine the client request for our favicon. The directories that store favicons are quite often overlooked--whether during a manual sweep or programmatically using a script.

Resources