Search engine with few documents just imported - ibm-watson

I'm wondering if it's possible to search with the Retrieve and Rank service on few PDF documents you just imported ? Example : I want to search informations on 4 PDF documents, so I import documents in the system and use my search engine to find my informations.
Any idea about the feasibility ?

It'll work - there aren't minimum requirements.
But I'd need to know a little more about your use case to know if it's a sensible idea.
For example, how long are your documents? The smallest production cluster that R&R provides is 32GB, so you'll be paying the monthly fee for that even if you only put 4 tiny documents in. That may not be a very cost-effective way to solve your particular problem.
What do you want it to return in response to queries? If it's the whole document that you want it to return, then every query could end up returning the same 4 documents, just in a different order each time... which doesn't sound like a very helpful thing to do.

Related

Is it possible to get a list of similar and/identical documents?

This is a general question that would like to get some input from the search community, so I don't have a piece of code to share just yet.
The objective is for a single document to get a list of similar and/or identical documents indexed by Azure Search - is that possible?
So given a document_id = 1 how do I get a list of the most similar documents to the specified id in the index? Ideally the outcome would be a list of documents order by a match of 0-100 - where 100 (%) would be an identical match.
I considering maybe taking the content of a given document and submitting that as part of the search, but that doesn't seem to be very elegant and it is also error prone in terms of constructing the query and the size of a document can be significant.
Thank you in advance for any suggestions or comments.
You could try using the preview feature "moreLikeThis" -> https://learn.microsoft.com/en-us/azure/search/search-more-like-this
I believe that's the closest Azure Search has to offer to what you want.
Edit 1: Be advised that this feature has limitations like non-support for complex types. Make sure it meets your requirements before taking a production dependency.

Solr near real time search: impact of reindexing frequently the same documents

We want to use SolR in a Near Real Time scenario. Say for example we want to filter / rank our results by number of views.
SolR SoftCommit was made for this use case but:
In practice, the same few documents are updated very frequently (just for the nb_view field) while most of the documents are untouched.
As far as I know each update, even partial are implemented as a full delete and full addition of the document in lucene.
It seems to me having many times the same docs in the Tlog is inefficient and might also be problematic during the merge process (is the doc marked n times as deleted and added?)
Any advice / good practice?
Two things you could use for supporting this scenario:
In place updates: only that field is udpated, not the whole doc. Check out the conditions you need to be able to use them.
ExternalFileFieldType you keep the values in an external file
if the scenario is critical, I would test both in reald world conditions if possible, and asses.

Google app engine - help in query optimization

I have run into a scenario while running query in app engine which is increasing my cost considerably.
I am writing the below query to fetch book names -
Iterable<Entity> entities =
datastore.prepare(query).asIterable(DEFAULT_FETCH_OPTIONS);
After that I run a loop to match the name with the name the user has requested. This is causing data reads for the entire books in the datastore and with the book details increasing day by day in the datastore, it is further impacting the cost since it is reading the entire list.
Is there an alternative to fetch data for only the requested book detail by the user so that I dont have to read the complete data store? Will SQL help or filters? I would appreciate if someone provides the query.
You have two options:
If you match the title exactly, make it an indexed field and use a filter to fetch only books with exactly the same title.
If you search within titles too:
a. You can use Search API to index all titles and use it to find the books your users are looking for.
b. A less optimal but quick solution is to create a projection query that reads only the book titles.

CakePHP search by relevance

I am developing a job site, where I want to search through job ads by relevance, I have fields such as job title, job_text for example. now lets say a person searches for cakephp, I would like to get results for cakephp first, and then after them say php which also matches, but cakephp is obviously the most relevant. how can I do this?
My suggestion is that you should run multiple queries for the sorting purposes.
For example, first you find the jobs where title is say php using order by title desc, then run query to find jobs where 'php' appears in keywords for jobs, and lastly you can run a query to find jobs where description has the word 'php' in it.
Then you can combine the results for these queries.
The best way I found to do what I was trying to at that stage was to integrate with apache solr or some similar search engine.
If you want to sort by relevance you will have to come up with some criteria of how relevance is defined for you and calculate it. For example if a certain article got more views it might be more relevant than another article because it was seen by more people. Combine that number with a few other variables (average rating for example if there is a rating functionality), calculate a relevance value based on them, store it in your table and order by the relevance value field. Update it every time one of the vars for the calculation changes or do it via a cron job one time per day, it all depends on your requirements and performance.

Faceted search: ElasticSearch/Solr or a simple database query?

Forgive this super basic question, from a search newbie.
I want to implement a site that makes use of faceted search. For example, it's a site with a database of hotels, and I want to allow users to search for hotels within a price range, with a swimming pool, with either three or four stars.
Clearly I can return results to users with a simple database query.
Should I use ElasticSearch or Solr to implement this instead of using a database query? If so, why?
Yes you should use ES or Solr. Reasons: primarily performance and the ability to change (think config) 'types of faceting' easily.
Faceting is no small feat and although you could do it with a RDBMS, to do it fast requires hard thinking. Why do it yourself if you can use the gazillions of hours Solr / ES (+ Lucene) teams have worked to optimize it.
As for the 'types of faceting' I mentioned:
perhaps you want to do hierarchical faceting. Select price-category > display smaller price categories. How are the bucketed: fixed range, evenly distributed, etc. Solr / ES provide these options from within a config.
Perhaps instead you implement price-faceting with a slider with min/max handles? Do you want to display the nr of hotels while you slide (histogram/facetstats in SOlr / ES)
While you've faceted on price, perhaps you still want to know the min and max-value of the priceslider as if you DIDN't filter on price. This is needed if you want to be able to draw the slider-handles proportionally. (see my question on SO as part of considering a switch from Solr to ES: Elasticsearch: excluding filters while faceting possible? (like in Solr) )
faceting on stars? Perhaps you want to show the best price per stars-facet if the user would select that star (again histogram/ stats)
Seriously, don't even consider doing the above with a RDBMS. You'll go insane.
Hope that helps, and yes I'm familiar with the domain :)
Additional questions, just ask.

Resources