MALLET for automatic topic tagging - with training data - tagging

I have a corpus of documents, which I have already tagged. I have fixed list of about 400 tags - relating to different topics. Each document has been tagged with one or more tags, and a short title. (I also have a much larger list of titles - which I often re-use if the document contains very similar content)
I want to make an interface that will suggest tags/titles (from my existing lists) for new documents that I add to the corpus, based on how I have tagged the existing documents.
I have read about the probabilistic topic model LDA classes, which look great for analyzing text when you don't have any existing tagged data. But I don't see any way I can incorporate my existing work.
Any suggestions would be appreciated.
Kind Regards
Swami

For tags suggestion, our experience is just using a search engine, no need for topic modeling.
Try below steps:
Setup an index on title and abstract of all your documents
Using the title or abstract of the new document as a query to search on the index, and a list of similar document can be achieved.
Using the first few most-similar documents from the list, we aggregate all the tags on them as a tag bundle
Sort the tags bundle by frequency of each tag, and the first most-frequent tags are the final result
This solution is workable.

Related

One to one relationship vs one to many

I have two collections: movies collection and comments collection, I want users to be able to post comments about a movie.
I can either have any movie contain an array which contains the id's of each comment or I can have any comment contain the id of the movie to whom it belongs. What are the downsides and advantages of each method?
This is more of a theoretical question. so lets assume that comments are too large and cannot be embedded into the movies collection.
This question is difficult to answer. In NoSQL DB (like your "mongodb" used tag indicates your are using it), the choice of using two collections, OR a collection with embedded comment's _id in an array, OR one single collection with embedded comments information really depends on your use cases.
With SQL database you can create a movie table and a comment table, with movie's id in comment element.
With nosql, you have to choose regarding your use cases : is your page displaying a movies list first with associated comments ? do you have a page which is listing last comments whatever movie ? You have also to integrate technical requirements/restrictions in your reflexion. Example, with mongodb you have a main restriction :
BSON Document Size - The maximum BSON document size is 16 megabytes.
The maximum document size helps ensure that a single document cannot
use excessive amount of RAM or, during transmission, excessive amount
of bandwidth. To store documents larger than the maximum size, MongoDB
provides the GridFS API. See mongofiles and the documentation for your
driver for more information about GridFS.
Check https://docs.mongodb.com/manual/reference/limits/ for more precisions.
My first reflexion regarding your needs and my global representation of what you want to do with your app is regarding the following use case :
A page is listing all movies (you can eventualy filter on different movie's flags). So, your entry point is a movie, not a comment. A comment is related to only one movie, a comment is not for more than one movie.
For each movie, an user can display associated comments and add a new comment.
For this use case, a performant db organisation is : One single collection for movies. A movie embed a list of comments, directly embedded in an array of JSON objects, like :
{
"_id":"m001",
"title":"Movie1",
"synopsis":"A young girl want to learn chess and becomes the best player in the world, his name: Beth harmone",
"comments":[
{
"_id":"c001",
"title":"Good movie",
"commentText":"This is a very good movie"
},
{
"_id":"c002",
"title":"Annoying movie",
"commentText":"This is a very annying movie"
}
]
}
You don't need to create another collection to store comments, you will loose reactivity, because of joining from movie another collection comment. BUT, this is a good choice only if you think each of your whole movies element will not be bigger than 16MB (you can also integrate GridFS API as indicated by MongoDB doc, but not the subject here...).
Alternatively, IF you think millions and millions of comments, with lot of information, can be added to a single movie, you will be blocked by technical limitation. In this case, it is better to split into two collections, with it, the technical limitation will not hurt you : each comment will be an element on "comment" collection and will certainly not reach 16MB.
Ffinally, noSQL DB performances can be really really better than SQL DB but you have to design your DB model regarding your use case.
I hope to be clear.
Useful links :
https://www.mongodb.com/basics/embedded-mongodb
https://fosterelli.co/collections-and-embedded-documents-in-mongodb (particularly "Example: comments on a blog" which seems to be your use case)

Is it possible to get a list of similar and/identical documents?

This is a general question that would like to get some input from the search community, so I don't have a piece of code to share just yet.
The objective is for a single document to get a list of similar and/or identical documents indexed by Azure Search - is that possible?
So given a document_id = 1 how do I get a list of the most similar documents to the specified id in the index? Ideally the outcome would be a list of documents order by a match of 0-100 - where 100 (%) would be an identical match.
I considering maybe taking the content of a given document and submitting that as part of the search, but that doesn't seem to be very elegant and it is also error prone in terms of constructing the query and the size of a document can be significant.
Thank you in advance for any suggestions or comments.
You could try using the preview feature "moreLikeThis" -> https://learn.microsoft.com/en-us/azure/search/search-more-like-this
I believe that's the closest Azure Search has to offer to what you want.
Edit 1: Be advised that this feature has limitations like non-support for complex types. Make sure it meets your requirements before taking a production dependency.

Solr multilingual search

I'm currently working on a project where we have indexed text content in SOLR. Every content is writen in one specific language (we have 4 differents
european languages) but we would like to add a feature that if the primary search (search text entered by the user) doesn't return much result then we try too look for document in other languages. Thus we would somehow need to translate the query.
Our base is that we can have a mapping list of translated words commonly used in the field of the project.
One solution that came to me was to use synonym search feature. But this might not provide the best results.
Does people have pointers on existing modules that could help us achieving this multilingual search feature? Or conception ideas we cold try to investigate?
Thanks
It seems like multi-lingual search is not a unique problem.
Please take a look
http://lucene.472066.n3.nabble.com/Multilingual-Search-td484201.html
and
Solr index and search multilingual data
those two links suggest to have dedicated fields for each language, but you can also have a field that states language, and you can add filter query (&fq=) for the language you have detected (from user query). This is more scalable solution, I think.
One option would be for you to translate your terms at index time, this could probably be done at Solr level or even before Solr at the application level, and then store the translated texts in different fields so you would have fields like:
text_en: "Hello",
text_fi: "Hei"
Then you can just query text_en:Hello and it would match.
And if you want to score primary language matches higher, you could have a primary_language field and then boost documents where it matches the search language higher.

Solr documents with multiple parents

I'm currently trying to figure out if Solr is the right tool for me. I have the following setup:
There is the primary document type "blog". Then there are two additional document types "user" and "category". Both of these are parents of the "blog" document type.
Now when searching the "blog" documents, I not only want to search in those fields (e.g. title and content), but also in the parent fields (user>name and category>name.
Of course, I could just flatten that down to a single document for Solr, which would ease the search a lot. The downside to this is though, that when e.g. a user updates their name, I have to run through all blog posts of them and update the documents for that in Solr, instead of just updating a single document.
This becomes even worse when the user has another parent, on which I need to search as well.
Do you have any recommendations about how to handle this use case? Maybe my Google foo is just not good enough, but what I found (block joins, etc.) don't seem to do the trick.
The absolutely most performant and easiest solution would be to flatten everything to a single document. It turns out that these relations aren't updated as often as people think, and that searches are performed more often than the documents update. And even if one of the values that are identical across a large set of documents change, reindexing from the most recent documents (for a blog) and then going backwards will appear rather performant for most users. The assumes that you have to actually search the values and don't just need the values - which you could look up from secondary storage when displaying an item (and just store the never changing id in the document).
Another option is to divide this into a multi-search problem. One collection for blog posts, one collection for users and one collection for categories. You then search through each of the collections for the relevant data and merge it in your search model. You can also use [Streaming Expressions] to hand off most of this processing to a Solr cluster for you.
The reason why I always recommend flattening if possible is that most features in Solr (and Lucene) are written for a flat document structure, and allows you to fully leverage the features available. Since Lucene by design is a flat document store, most other features require special care to support blockjoins and parent/child relationships, and you end up experimenting a lot to get the correct queries and feature set you want (if possible). If the documents are flat, it just works.

Using Solr to store user specified information in documents

I have an application that contains a set of text documents that users can search for. Every user must be able to search based on the text of the documents. What is more, users must be able to define custom tags and associate them to a document. Those tags are used in two ways:
1)Users must be able to search for documents based on specific tag ids.
2)There must be facets available for the tags.
My solution was adding a Mutivalued field in each document to pose as an array that contains the tagids that this document has been tagged with. So far so good. I was able to perform queries based on text and tagids ( for example text:hi AND tagIds:56 ).
My question is, would that solution work in production mode in an environment that users add but also remove tags from the documents ? Remember , I have to have the data available in real time, so whenever a user removes/adds a tag I have to reindex that document and commit immediately. If that's not a good solution, what would be an alternative ?
Stackoverflow uses Solr - this is in case if you doubt Solr abilities in production mode.
And although I couldn't find much information on how they have implemented tags, I don't think your approach sounds wrong. Yes, tagged documents will have to be reindexed (that means a slight delay) but other than that I don't see anything wrong with it.

Resources