Google app engine - help in query optimization - google-app-engine

I have run into a scenario while running query in app engine which is increasing my cost considerably.
I am writing the below query to fetch book names -
Iterable<Entity> entities =
datastore.prepare(query).asIterable(DEFAULT_FETCH_OPTIONS);
After that I run a loop to match the name with the name the user has requested. This is causing data reads for the entire books in the datastore and with the book details increasing day by day in the datastore, it is further impacting the cost since it is reading the entire list.
Is there an alternative to fetch data for only the requested book detail by the user so that I dont have to read the complete data store? Will SQL help or filters? I would appreciate if someone provides the query.

You have two options:
If you match the title exactly, make it an indexed field and use a filter to fetch only books with exactly the same title.
If you search within titles too:
a. You can use Search API to index all titles and use it to find the books your users are looking for.
b. A less optimal but quick solution is to create a projection query that reads only the book titles.

Related

store the number of times a document was seen in a given time period

I am parsing documents on the web and storing them in solr database. Every day I see thousand of documents and some of them are repeating.
I'd like to give user an option to see which document was most seen on a given date, or in a given timespan. Queries of interest correspond to:
-show me which documents were seen the most on 16/10/2022,
-show me which documents were seen the most between 16/10/2022 and 23/10/2022
When writing solr queries, you specify field name to search on. What field type should I use and in what format should I store the number of times the document was seen on a given date?
How I would try it:
Create a separate collection - very simple collection with fields:
view time
doc id
title or body (whatever you're querying)
... do this for EVERY view.
you can query it by the gap you want:
curl http://localhost:8983/solr/query -d 'q=title:abc&rows=0&json.facet={
per_month: { range : {
field : last_modified,
start:'2022-01-01T00:00:00Z',
end:'2022-12-31T23:59:59Z',
gap:'+1MONTH',
}}
}}
This would return all views by MONTH (can change it to DAY, YEAR, etc).
But your doc is probably too big for this solution. If you want to normalize this:
a JOIN query. Since solr 8.6, you can now do cross-collection joins on multiple shards. this is a good article about how to do those queries. this is a decent video of how to set this up It's not that hard to do.
The JOIN query would be much faster.
If you don't want to do the JOIN query:
If the views change often, do not store them in the document store. There's no notion of partial updates in solr. If you're updating views every day, you'll need to update every document that's been viewed. That's going to cause a lot of unnecessary disk thrashing.
Other thoughts:
can you use a database? This is a far better use of views. Solr isn't good as a master record for views.
Another suggestion is to make the views go to an analytics engine - a far better solution since you can get rich analytics about the actual users. An analytics engine does a lot that rendering views does not - especially filtering out false positives (like bots!). It's not fun to maintain an accurate view count if you have a high-trafficed site.
In the past I've used an analytics engine to collect the data and used the analytics engine to export that data into solr. This way you can have the view logic be done by the software component that knows views best (the analytics engine like Google analytics or Salesforce marketing engine) and run an hourly process to update the views in solr using one of the above tactics.

Google Search API Wildcard

I have a Python project running on Google App Engine. I have a set of data currently placed at datastore. On user side, I fetch them from my API and show them to the user on a Google Visualization table with client side search. Because the limitations I can only fetch 1000 record at one query. I want my users search from all records that I have. I can fetch them with multiple queries before showing them but fetching 1000 records already taking 5-6 second so this process can exceed 30 seconds timeout and I don't think putting around 20.000 records on a table is good idea.
So I decided to put my records on Google Search API. Wrote a script to sync important data between datastore and Search API Index. When perform a search, couldn't find anything like wildcard character. For example let's say I have user field stores a string which contains "Ilhan" value. When user search for "Ilha" that record not show up. I want to show record includes "Ilhan" value even if it partially typed. So basically SQL equivalent of my search should be something like "select * from users where user like '%ilh%'".
I wonder if there is a way to that or is this not how Search API works?
I setup similar functionality purely within datastore. I have a repeated computed property that contains all the search substrings that can be formed for a given object.
class User(ndb.Model):
# ... other fields
search_strings = ndb.ComputedProperty(
lambda self: [i.lower() for i in all_substrings(strings=[
self.email,
self.first_name,
self.last_name,], repeated=True)
Your search query would then look like this:
User.query(User.search_strings == search_text.strip().lower()).fetch_page(20)
If you don't need the other features of Google Search API and if the number of substrings per entity won't put you at risk of hitting the 900 properties limit, then I'd recommend doing this instead as it's pretty simple and straight forward.
As for taking 5-6 seconds to fetch 1000 records, do you need to fetch that many? why not fetch only 100 or even 20 and use the query cursor for the user to pull the next page only if they need it.

Search engine with few documents just imported

I'm wondering if it's possible to search with the Retrieve and Rank service on few PDF documents you just imported ? Example : I want to search informations on 4 PDF documents, so I import documents in the system and use my search engine to find my informations.
Any idea about the feasibility ?
It'll work - there aren't minimum requirements.
But I'd need to know a little more about your use case to know if it's a sensible idea.
For example, how long are your documents? The smallest production cluster that R&R provides is 32GB, so you'll be paying the monthly fee for that even if you only put 4 tiny documents in. That may not be a very cost-effective way to solve your particular problem.
What do you want it to return in response to queries? If it's the whole document that you want it to return, then every query could end up returning the same 4 documents, just in a different order each time... which doesn't sound like a very helpful thing to do.

CakePHP search by relevance

I am developing a job site, where I want to search through job ads by relevance, I have fields such as job title, job_text for example. now lets say a person searches for cakephp, I would like to get results for cakephp first, and then after them say php which also matches, but cakephp is obviously the most relevant. how can I do this?
My suggestion is that you should run multiple queries for the sorting purposes.
For example, first you find the jobs where title is say php using order by title desc, then run query to find jobs where 'php' appears in keywords for jobs, and lastly you can run a query to find jobs where description has the word 'php' in it.
Then you can combine the results for these queries.
The best way I found to do what I was trying to at that stage was to integrate with apache solr or some similar search engine.
If you want to sort by relevance you will have to come up with some criteria of how relevance is defined for you and calculate it. For example if a certain article got more views it might be more relevant than another article because it was seen by more people. Combine that number with a few other variables (average rating for example if there is a rating functionality), calculate a relevance value based on them, store it in your table and order by the relevance value field. Update it every time one of the vars for the calculation changes or do it via a cron job one time per day, it all depends on your requirements and performance.

How to store document vectors in a database for a search engine?

I have implemented a search engine in Java. It has a database that stores the inverted index ie mapping from terms to list of documents the term appears in. There is a feature that allows a user to upload a document which can be added to document for indexing. The problem that i'm facing is that , everytime a new document is added , the index is reconstructed in memory instead of being updated . To update , i would need a database that stores document vectors that are essentially tf-idf's(term frequency* inverse document frequency) of each and every term in the index. I'm not able to work out database structure for it as in what rows and columns or multiple tables would be needed for storing such a structure.
I need to store
1. Document ID
2. Document Title
3. N dimensional Document vector where N is the number of unique terms
4. N terms
5. IDF of each term
6. TF of each term for every document.
I need it so that at the time of query matching i can extract this vector and calculate its similarity with the query vector.If you want any additional information, please let me know.
Thank you very much , I'm sure i would get some help here.
Are you sure you wanna use a Database to implement a search engine?
You may take a look at this Java framework which does an excellent job and very simple to learn .
Lucene Tutorial in 5 mins
It uses the Vector Space Model and there's no need for you to worry about all the above fields you mentioned in your post, since Lucene stores them along with much more advanced ranking factors.
I am sorry that my reply doesn't help you if you are intentionally using the Databases.

Resources