Getting most frequent terms in a subset of indexed lucene documents

Getting most frequent terms in a subset of indexed lucene documents - solr

Let's assume the following scenario.
Lucene document: ArticleDocument
Fields: {Id, text, publisherId}
A publisher can publish multiple articles.
Problem
I would like to build word clouds (most frequent words, shingles) for each Publisher Id.
After my investigation, I could find ways to get most frequent terms for the entire Index or a document but not for a subset of documents. I found a similar question but that's Lucene 2.x and I'm hoping there exists an effective way in recent Lucene.
Please could you guide me with a way to perform that in Lucene 4.x (preferred) or 3.x (latest in version 3).
Please note that I cannot make each Publisher a document with all the articles being appended to a field.
That's because I would like to have those words in the cloud to be searchable with corresponding articles (by same publisher id) being the results.
I'm not sure whether maintaining two types of lucene documents (article and publisher) is a good idea in terms of maintenance and performance.

Use Pivot Faceting available in Solr 4.X releases. Pivot faceting allows you to facet within the results of the parent facet.
Generate Shingled token for "text" field at indexing time using Shingle Filter Factory.
For faceting add facet=true&facet.pivot=publisherid,text parameters in your query.
Sample query:
http://localhost:8983/solr/collection1/select?q=*:*&wt=json&indent=true&facet=true&facet.pivot=publisherid,text
Query will return frequent shingles/words with frequency for each "publisherid".

Related

Migrating SOLR fq to Elasticsearch

I am currently migrating a SOLR app to Elasticsearch and have become stuck on a particular query. The ElasticSearch documentation is rather vague on how to achieve my desired result.
Currently I am trying to convert tagged "fq's" (filter queries) from SOLR into Elasticsearch. I need to be able to return from Elasticsearch facets (now known as aggregations) based on my query and filters but also show aggregations for other options in a search
Although this sounds complicated it is achieved in SOLR simply by adding an "fq" parameter and tagging the filter as follows:
q=mainquery&fq=status:public&fq={!tag=dt}doctype:pdf&facet=on&facet.field={!ex=dt}doctype
From the main SOLR help docs this will filter on "doctype:pdf" but also include counts for other doc types in the facet output - again this works fine for me, I am simply trying to recreate this in Elasticsearch.
So far I have tried a "post_filter" which does the job until I wish to apply anymore than one filter (again something SOLR handles with no problems). You can see an example of how this works and how I want to achieve it at:
https://www.jobsinhealthcare.co.uk/search?latitude=&longitude=&title=&location=&radius=5&type=&salary=0&frequency=year&since=&jobtype=&keywords=&company=&sort=Most+recent&filter[contract_type_estr][33d5667c]=Temporary&filter[job_type_estr][5d370027]=Part+time&filter[job_type_estr][4b45bd05]=Full+time
IN the filters/facets on the Right of the results you can select multiple "contract type" and/or "job type" and/or "location" and still be shown the facet counts for unselected queries/filters. Please note that Hourly Salary, Annual Salary and Date Added do NOT have this functionality - this is by design.
Any pointers as to how I should be structuring my query would be greatly apprreciated.

I think what you need is global aggregation (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-global-aggregation.html). Inside top level aggregation you should use filter aggregation (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-filter-aggregation.html) as a sub-aggretation to filter only "status:public".

Similarity/approximate queries in Solr

What is the simplest way to query Solr for the documents that contain text similiar to a (longish) passage. This is similar to what ElasticSearch match queries do or what probabilistic search engines like Indri do by default. This is something between an and and an or query. None of the terms is required, but you get documents that contain many of the terms. You can also just pass a passage of raw text to the engine and it returns documents with high term overlap with the passage without having to try to parse or tokenize the text in the client. The best I option can see in the Solr query reference is to tokenize the query text myself and then insert an OR between each pair of terms and return the top N results. Is there more concise way of doing it with Solr?

The answer above is correct. You can choose to find documents similar to another document in the index, similar to a given external URL or similar to some given text. You can choose what field(s) to target and various other parameters. Here's the official Solr Reference Guide documentation page for MLT: https://cwiki.apache.org/confluence/display/solr/MoreLikeThis

Lucene and SQL Server - best practice

I am pretty new to Lucene, so would like to get some help from you guys :)
BACKGROUND: Currently I have documents stored in SQL Server and want to use Lucene for full-text/tag searches on those documents in SQL Server.
Q1) In this case, in order to do the keyword search on the documents, should I insert all of those documents to the Lucene index? Does this mean there will be data duplication (one in SQL Server and the other one in the Lucene index?) It could be a matter since we have a massive amount of documents (about 100GB). Is it inevitable?
Q2) Also, each documents has a set of tags (up to 3). Lucene is also good choice for the tag search? If so, how to do it?
Thanks,

Yes, providing full-text search through Lucene and data storage through a traditional database is a well-supported architecture. Take a look here, for a brief introduction. A typical implementation would be to index anything you wish to be able to support searching on, and store only a unique identifier in the Lucene index, and pull any records founds by a search from the database, based on the ID. If you want to reduce DB load, you can store some information in Lucene to display a list of search results, and only query the database in order to fetch the full document.
As for saving on space, there will be some measure of duplication. This is true even if you only Lucene, though. Lucene stores the inverted index used for searching entirely separately from stored data. For saving on space, I'd recommend being very deliberate about what data you choose to index, and what you need to store and be able to retrieve later. What you store is particularly important for saving space in Lucene, since indexed-only values tend to be very space-efficient, in most cases.
Lucene can certainly implement a tag search. The simple way to implement it would be to add each tag to a field of your choosing (I'll call is "tags", which seems to make sense), while building the document, such as:
document.add(new Field("tags", "widget", Field.Store.NO, Field.Index.ANALYZED));
document.add(new Field("tags", "forkids", Field.Store.NO, Field.Index.ANALYZED));
and I could simply add a required term to any query to search only within a particular tag. For instance, if I was to search for "some stuff", but only with the tag "forkids", I could write a query like:
some stuff +tags:forkids

Documents can also be stored in Lucene, you can retrieve and reference them using the document ID.
I would suggest using Solr http://lucene.apache.org/solr/ on top of Lucene, is more user friendly and has multiValued fields (for the tags) available by default.
http://wiki.apache.org/solr/SchemaXml

Create a Solr Index using Lucene IndexWriter

I need to index vast amounts of content in extremely short order, I have tried various techniques using Solrnet/solr using threading and TPL, however the speeds leave a lot to be desired. Hence considering a move to using Lucene.net index writer to create an index (preliminarily I see almost an order of magnitude of speed improvement) . Any "gotchas" to be aware of?
I am not too sure if:
1. Trie based Numeric Range query would continue to be available for query via Solr. ( I am using NumericFields in Lucene)?
2. Faceting etc. would continue to be available ?
Anything else I need to watch out for?

Please see Scaling Lucene and Solr about improving run times.
If you decide to go with Lucene:
You need a unique id field for the index to be a valid Solr index.
The schema must match the Solr schema.
The Lucene version must be the same as in Solr.
I think the range query and faceting will be available, as long as you index the respective fields according to the requirements in Solr, and use the same analyzers.

SolrNet Newbie - How to handle multiple Where Clauses

I just started exploring SolrNet. Previously I have been using MSSQL FULL TEXT.
In sql server, my query make full text searches and also have multiple joins and Where clauses. I am also using custom paging to return only the 10 rows out of millions.
I have read few solrNet docs and run sample apps provided on the blogs. All worked well so far. Just need to get an idea, What do I do with JOINS and WHERE clauses??
e.g. If user searches for Samsung, db would return 100k records, but if users searches for Samsung && City='New york' && Price >'500' then he would only get couple of thousands records.
Do I add all columns in Solr and write WHERE clauses in Solr?
What do I do about SQL JOINS?
Thanks in Advance!

There are no joins in Solr. From the Solr wiki:
Solr provides one table. Storing a set
database tables in an index generally
requires denormalizing some of the
tables. Attempts to avoid
denormalizing usually fail.
About WHERE clauses (i.e. filtering), see Querying in SolrNet, Solr query syntax, and Common Solr query parameters.

The Solr equivalent of your where clauses is to map your columns to fields and run queries based on the query syntax. A query like your example:
Samsung && City='New york' && Price >'500'
could be translated to something like this in Solr:
q=Samsung AND city:"new york" AND price:[500 TO *]
You need to take some care when you map your database to a Solr schema, specifically you will probably have to denormalize your data. See this page on the Solr wiki for more information. Basically, you can't really do complex JOINs in Solr. It's a "flat" index.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight