CakePHP search by relevance

CakePHP search by relevance - cakephp

I am developing a job site, where I want to search through job ads by relevance, I have fields such as job title, job_text for example. now lets say a person searches for cakephp, I would like to get results for cakephp first, and then after them say php which also matches, but cakephp is obviously the most relevant. how can I do this?

My suggestion is that you should run multiple queries for the sorting purposes.
For example, first you find the jobs where title is say php using order by title desc, then run query to find jobs where 'php' appears in keywords for jobs, and lastly you can run a query to find jobs where description has the word 'php' in it.
Then you can combine the results for these queries.

The best way I found to do what I was trying to at that stage was to integrate with apache solr or some similar search engine.

If you want to sort by relevance you will have to come up with some criteria of how relevance is defined for you and calculate it. For example if a certain article got more views it might be more relevant than another article because it was seen by more people. Combine that number with a few other variables (average rating for example if there is a rating functionality), calculate a relevance value based on them, store it in your table and order by the relevance value field. Update it every time one of the vars for the calculation changes or do it via a cron job one time per day, it all depends on your requirements and performance.

Related

store the number of times a document was seen in a given time period

I am parsing documents on the web and storing them in solr database. Every day I see thousand of documents and some of them are repeating.
I'd like to give user an option to see which document was most seen on a given date, or in a given timespan. Queries of interest correspond to:
-show me which documents were seen the most on 16/10/2022,
-show me which documents were seen the most between 16/10/2022 and 23/10/2022
When writing solr queries, you specify field name to search on. What field type should I use and in what format should I store the number of times the document was seen on a given date?

How I would try it:
Create a separate collection - very simple collection with fields:
view time
doc id
title or body (whatever you're querying)
... do this for EVERY view.
you can query it by the gap you want:
curl http://localhost:8983/solr/query -d 'q=title:abc&rows=0&json.facet={
per_month: { range : {
field : last_modified,
start:'2022-01-01T00:00:00Z',
end:'2022-12-31T23:59:59Z',
gap:'+1MONTH',
}}
}}
This would return all views by MONTH (can change it to DAY, YEAR, etc).
But your doc is probably too big for this solution. If you want to normalize this:
a JOIN query. Since solr 8.6, you can now do cross-collection joins on multiple shards. this is a good article about how to do those queries. this is a decent video of how to set this up It's not that hard to do.
The JOIN query would be much faster.
If you don't want to do the JOIN query:
If the views change often, do not store them in the document store. There's no notion of partial updates in solr. If you're updating views every day, you'll need to update every document that's been viewed. That's going to cause a lot of unnecessary disk thrashing.
Other thoughts:
can you use a database? This is a far better use of views. Solr isn't good as a master record for views.
Another suggestion is to make the views go to an analytics engine - a far better solution since you can get rich analytics about the actual users. An analytics engine does a lot that rendering views does not - especially filtering out false positives (like bots!). It's not fun to maintain an accurate view count if you have a high-trafficed site.
In the past I've used an analytics engine to collect the data and used the analytics engine to export that data into solr. This way you can have the view logic be done by the software component that knows views best (the analytics engine like Google analytics or Salesforce marketing engine) and run an hourly process to update the views in solr using one of the above tactics.

solr auto-suggestion with multible where clauses

Not sure if it is a relevant query to post, but want to understand auto-suggestion is suitable option for location based search as I am looking for specific requirement. The requirement is, from a specified geo location, want to search for providers(be it doctor with specialty or hospitals) using auto suggestion.
As part of suggestion, I need to pass geo location with search key, the search key would be a doctor’s name or doctor’s specialty or hospital name or hospital address, the suggester would provide the results on the basis of geo distance in ascending order.
The weightage option would be calculated on the basis of distance by inverse value.
I posted earlier a query here (solr autosuggestion with tokenization), this post is relevant to my earlier query.
Regards
Venkata Madhu

If you want to add more logic to the suggestions that you're going to show is probably a good idea to use normal queries instead of the suggest component.
For instance take a look at this repo is a (bit outdated) example of using a normal solr core to store suggestions and do suggest-like queries. Meaning you can do partial match queries on that index and add the custom scoring logic that you want. Keep in mind that it doesn't need to be a separated core you could just copy data from the fields that you have in a separate field used only for generating the suggestions.
In this case, you'll only need to add/edit the score function used to add your own logic (geodist) or even do a hard sort on the distance.

Search engine with few documents just imported

I'm wondering if it's possible to search with the Retrieve and Rank service on few PDF documents you just imported ? Example : I want to search informations on 4 PDF documents, so I import documents in the system and use my search engine to find my informations.
Any idea about the feasibility ?

It'll work - there aren't minimum requirements.
But I'd need to know a little more about your use case to know if it's a sensible idea.
For example, how long are your documents? The smallest production cluster that R&R provides is 32GB, so you'll be paying the monthly fee for that even if you only put 4 tiny documents in. That may not be a very cost-effective way to solve your particular problem.
What do you want it to return in response to queries? If it's the whole document that you want it to return, then every query could end up returning the same 4 documents, just in a different order each time... which doesn't sound like a very helpful thing to do.

Google app engine - help in query optimization

I have run into a scenario while running query in app engine which is increasing my cost considerably.
I am writing the below query to fetch book names -
Iterable<Entity> entities =
datastore.prepare(query).asIterable(DEFAULT_FETCH_OPTIONS);
After that I run a loop to match the name with the name the user has requested. This is causing data reads for the entire books in the datastore and with the book details increasing day by day in the datastore, it is further impacting the cost since it is reading the entire list.
Is there an alternative to fetch data for only the requested book detail by the user so that I dont have to read the complete data store? Will SQL help or filters? I would appreciate if someone provides the query.

You have two options:
If you match the title exactly, make it an indexed field and use a filter to fetch only books with exactly the same title.
If you search within titles too:
a. You can use Search API to index all titles and use it to find the books your users are looking for.
b. A less optimal but quick solution is to create a projection query that reads only the book titles.

Filtering Functionality Similar to Ebay SQL Count Issue

I am stuck on a database problem for a client, wandering if someone could help me out. I am currently trying to implement filtering functionality so that a user can filter results after they have searched for something. We are using SQL Server 2008. I am working on an electronics e-commerce site and the database is quite large (500,000 plus records). The scenario is this - user goes to our website and types in 'laptop' and clicks search. This brings up the first page of several thousand results. What I want to do is then
filter these results further and present the user with options such as:
Filter By Manufacturer
Dell (10,000)
Acer (2,000)
Lenovo (6,000)
Filter By Colour
Black (7000)
Silver (2000)
The main columns of the database are like this - the primary key is an integer ID
ID Title Manufacturer Colour
The key part of the question is how to get the counts in various categories in an efficient manner. The only way I currently know how to do it is with separate queries. However, should we wish to filter by further categories then this will become very slow - especially as the database grows. My current SQL is this:
select count(*) as ManufacturerCount, Manufacturer from [ProductDB.Product] GROUP BY Manufacturer;
select count(*) as ColourCount, Colour from [ProductDB.Product] GROUP BY Colour;
My question is if I can get the results as a single table using some-kind of join or union and if this would be faster than my current method of issuing multiple queries with the Count(*) function. Thanks for your help, if you require any further information please ask. PS I am wandering how on sites like ebay and amazon manage to do this so fast. In order to understand my problem better if you go onto ebay and type in laptop you will
see a number of filters on the left - this is basically what I am trying to achieve. I don't know how it can be done efficiently when there are many filters. E.g to get functionality equivalent to Ebay I would need about 10 queries and I'm sure that will be slow. I was thinking of creating an intermediate table with all the counts however the intermediate table would have to be continuously updated in order to reflect changes to the database and that would be a problem if there are multiple updates per minute. Thanks.

The "intermediate table" is exactly the way to go. I can guarantee you that no e-commerce site with substantial traffic and large number of products would do what you are suggesting on the fly at every inquiry.
If you are worried about keeping track of changes to products, just do all changes to the product catalog thru stored procs (my preferred method) or else use triggers.
One complication is how you will group things in the intermediate table. If you are only grouping on pre-defined categories and sub-categories that are built into the product hierarchy, then it's fairly easy. It sounds like you are allowing free-text search... if so, how will you manage multiple keywords that result in an unexpected intersection of different categories? One way is to save the keywords searched along with the counts and a time stamp. Then, the next time someone searches on the same keywords, check the intermediate table and if the time stamp is older than some predetermined threshold (say, 5 minutes), return your results to a temp table, query the category counts from the temp table, overwrite the previous counts with the new time stamp, and return the whole enchilada to the web app. Otherwise, skip the temp table and just return the pre-aggregated counts and data records. In this case, you might get some quirky front-end count behavior, like it might say "10 results" in a particular category but then when the user drills down, they actually find 9 or 11. It's happened to me on different sites as a customer and it's really not a big deal.
BTW, I used to work for a well-known e-commerce company and we did things like this.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight