Pulling out Popular terms from a Solr core

Pulling out Popular terms from a Solr core - solr

I have an Apache Solr core where i need to pull out the popular terms out of it, i am already aware of luke, facets, and Apache Solr stopwords but i am not getting what i want, for example, when i try to use luke to get the popular terms and after applying the stopwords on the result set i get a bunch of words like:
http, img, que ...etc
While what i really want is:
Obama, Metallica, Samsung ...etc
Is there any better way to implement this in Solr?, am i missing something that should be used to do this?
Thank You

Finding relevant words from a text is not easy. The first thing I would have a deeper look at is Natural Language Processing (NLP) with Solr. The article in Solr's wiki is a starting point for this. Reading the page you will stumble over the Full Example which extracts nouns and verbs, probably that already helps you.
During the process of getting this running you will need to install additional software (Apache's OpenNLP project) so after reading in Solr's wiki that project's home page maybe the next step.
To get a feeling what is possible with that you should have a look on the demonstration of the searchbox guy. There you can paste a sample text and get relevant words and terms extracted from it.
There are several tutorials out there you may have a look at for further reading.
If you went down the path and the results are not as expected or not as good as required, you may go down that road even further and start thinking about text mining with Apache Mahout. There are again several tutorials out there to cross it with Solr.
In any case you should then search Stackoverflow or the web for tutorials and How-Tos you will certainly need.
Update about arabic
If you are going to use OpenNLP for not supported languages, which Arabic unfortunately is out of the box as of version 1.5, you will need to train OpenNLP for the language. The reference about it is found on the developer docs of OpenNLP. Probably there is already something out there from the arabic community, but my arabic google-fu is not that good.
Should you decide to do the work and train it for the arabic language, why not share your traning with the project?
Update about integration in Solr/Lucene
There is work going on to integrate it as a module. In my humble opinion this is as far as it will and should get. If you compare this problem field to stemming stemming appears to be rather easy. But even stemming got complex when supporting different languages. Analysing a language to the level that you can extract nouns, verbs and so forth is so complex that a whole project evolved around it.
Having a module/contrib at hand, which you could simply copy to solr_home/lib would already be very handy. So there would be no need to run a different installer an so forth.

Well , this is a bit open ended.
First you will need to facet and find "popular terms" out of your index, then add all the non useful items such as http , img , time , what, when etc to your stop word list and re-index to get cream of the data you care about. I do not think there is an easier way of knowing popular names unless you can bounce your data against a custom dictionary of nouns during indexing (that is an option by the way)- You can choose to index only names by having a custom token filter (look how stopword filter works) and have your own nouns.txt file to go with your own nouns filter, in the case you allow only words in your dictionary to into index, and this approach is only possible if you have finite known list of nouns.

Related

How is ElasticSearch supposed to work in CakePHP 3?

I've been trying my very best not to ask any nosy question here in stackoverflow, but it has been almost one week since I got stuck in this problem and I couldn't find any solution.
I already have my working website built with CakePHP 3.2. What the website basically does is scrape Twitter for tweets containing a given search term, check if it's already in my database, and store it if it doesn't yet exist. Twitter's JSON response has this "tweet_id" property, and I've been using that value to check for whether I should ignore or append a specific tweet to my DB. While this might be okay while my database is small, I suspect it's going to slow things down considerably when my tables grow bigger. Thus my need for ElasticSearch.
My ElasticSearch server is running on my Arch Linux install, and I've configured my app to point to the said server. Also, I have my "Type" object named the same way as my "Tweets" table (I followed the documentation until the overview part http://book.cakephp.org/3.0/en/elasticsearch.html). This craps out an "Unknown method "alias" error, and following Google searches led me to creating an alternate pagination class since that was what some found to be the cause of the error (https://github.com/lorenzo/audit-stash/issues/4), which still doesn't fix things.
I'm not sure if I got this right. I installed the ElasticSearch plugin with the assumption that all I have to do is name the Types the same name as my tables, since to me the documentation "implies" that this should be done on top of the Blog Tutorial they did to "improve query performance".
TLDR, how is this supposed to work? Is my above assumption right? Do I name the Types differently and index everything myself? I'm not sure if there's just too much automagic, or I'm just poor at these sort of things. And yes, I'm new to frameworks (but not PHP, among other languages)
Thanks in advance!

Generate a series of documents based on SQL table

I am trying to formulate a proposal for an application that allows a user to print a batch of documents based on data stored in a SQL table. The SQL table indicates which documents are due and also contains all demographic information. This is outside of what I normally do and am trying to see if these is a platform/application that already exists to do such a task
For example
List of all documents: Document #1 - Document #10
Person 1 is due for document #: 1,5,7,8
Person 2 is due for document #: 2.6
Person 3 is due for document #: 7,8,10
etc
Ideally, what I would like is for the user to be able to push a button and get a printed stack of documents that have been customized for each user including basic demographic info like name, DOB, etc
Like i said at the top, I already have all of the needed information in a database, I am just trying to figure out the best approach to move that information onto a document
I have done some research and found some people have used mail merge in Word or using Access as a front end but I don't know if this is the best way. I've also found this document. Any advice would be greatly appreciated

If I understand your problem correctly, your problem is two-fold: Firstly, you need to find a way to generated documents based on data (mail-merge) and secondly, you might need to print them two.
For document generation you have two basic approaches: template-based and programmatically from scratch. I suppose that you will opt for a template based approach which basically means that you design (in MS Word) a template document (Word, RTF, ...) that acts as a template and contains placeholders and other tags that designate »dynamic« parts of the document. Then, at document generation time, you need a .NET library/processor that you will pass this template document and the data, where the processor will populate the template with the data and return the resulting document.
One way to achieve this functionality would be employing MS Words' native mail-merge, but you should know that this would involve using Office COM and Word Application Automation which should be avoided almost always.
Another option is to build such a system on top of Open XML SDK. This is velid option, but it will be a pretty demanding task and will most probably cost you much more than buying a commercial .NET library that does mail-merge out-of-the-box – been there, done that. But of course, the good side here is that you will be able to tailer the solution to your needs. If you go down this road I recoment that you use Content Controls for tagging documents/templates. The solution with CCs will be much easier to implement than the solution with bookmarks.
I'm not very familliar with the open source solutions and I'm not sury how many there are that can do mail-merge. One I know is FlexDoc (on CodePlex) but its problem is that uses a construct (XmlControl) for tagging that is depricated in Word 2010+.
Then there are commercial solutions. Again I don't know them in detail but I know that the majority of them are a general purpose document processing libraries. Our company has been using this document generation toolkit for some time now and I can say it covers all our »template-based document generation« needs. It doesn't require MS Word at doc generation time, and has really helpful add-in for MS word and you only need several lines of code to integrate it in your project. Templating is very powerful and you can set-up a template in a very short time. While templates are Word documents, you can generate PDF or XPS docs as well. XPS is useful because you can use .NET/WPF prining framework that works with XPS docs to print documents. This is a very high-end solution, but of course, the downside here is that it is not a free solution.

How to programmatically create indexing from the list of urls in java

I want to create search engine. So I had used nutch and solr for the developing it.
But it is not able to crawl each and every url of the website and search results are not as
good as Google.So I started using jcrawler to get list of url.
Now I have list of urls.But I have to index them.
So is there any way where I can index list of urls stored line by line in a file.
and show results vis lucene or solr or any other Java API

How you programmatically do something really depends on which language you plan on writing your code in - fetching content from a URL and making sense of that content before indexing will be largely dependent on the libraries available for your programming language of choice.
You can still use nutch with the Solr backend - give it the list of urls as input and set --depth to 1 (so that it doesn't spider anything further).
There are also other "ready" options, such as Crawl Anywhere (which has a Solr backend) and Scrapy.
"Not as good as Google" is not a good description of what you want to accomplish and how to approach that (keep in mind that Search is a core product for Google and they have a very, very large set of custom technologies for handling search). If you have specific issues with your own data and how to display that (usually you can do more useful results as you have domain knowledge of the task you're trying to solve), ask concrete, specific questions.

You can use Data Import Handler to load the list of URLs from file and then read and index them.
You'd need to use nested entity with outside entity having rootEntity flag set to false.
You'd need to practice a little bit with DIH. So, I recommend that you first learn how to import just the URLs into individual Solr documents and then enhance it with actually parsing of URL content.

Autosuggestion with solr

I'm implementing auto-suggestion in a web page (ASP.NET MVC) with solr and have understood that there are a number of ways to do this, including:
jQuery Autocomplete, Suggester, facets or NGramFilterFactory.
Which one is the fastest one to use for auto-suggestion?
Any good information about this?

You should take a good look at 'AJAX Solr' at https://github.com/evolvingweb/ajax-solr .
AJAX Solr has autocomplete widget among other things. Demo site - http://evolvingweb.github.io/ajax-solr/examples/reuters/index.html .

Here's my shot at addressing your need, with this disclaimer:
'Fastest' is a very vague term, and extends to a broader spectrum, i.e browser used, page weight, network etc. These need to be optimized outside the search implementation, if need be.
I would go for the straight-forward implementation and then optimize it based on performance stats
Ok, now to the implementation, at a high level:
1) Create a Solr index with a field having an NGramTokenizerFactory tokenizer.
- to reduce chatter, keep minLength of NGram to be 2, and fire autosuggest with minLength =2
2) Depending on technology used, you can either route search requests through your application, or hit Solr directly. Hitting Solr directly could be faster (Ref AjaxSolr as mentioned already).
3) Use something like Jquery-ui to have an autosuggest implemtation, backed with ajax requsts to Solr.
Here are couple of reference implementations:
http://isotope11.com/blog/autocomplete-with-solr-and-jquery-ui
https://gist.github.com/pduey/2648514
Note that there are similar implementation that work well for live sites, so I would be tempted to try this out and see if there is still a bottleneck, and not tempted to do any premature optimization.

'AJAX Solr' has limitations with respect to autosuggestions as it provides only word level suggestions. Internally it uses faceting to generate them.
But SOLR provides different suggesters which we can leverage to generate intelligent autosuggestions(words/phrase)
Checkout this blog post to know more.
http://lucidworks.com/blog/solr-suggester/
For implementation, you can use combination of suggesters(FST + Analyzing Infix) and jQuery autocomplete.

Apache module FORM handling in C

I'm implementing an Apache 2.0.x module in C, to interface with an existing product we have. I need to handle FORM data, most likely using POST but I want to handle the GET case as well.
Nick Kew's Apache Modules book has a section on handling form data. It provides code examples for POST and GET, which return an apr_hash_t of the key+value pairs in the form. parse_form_from_POST marshalls the bucket brigade and flattens it into a buffer, while parse_form_from_GET can simply reference the URL. Both routines rely on a parse_form_from_string routine to walk through each delimited field and extract the information into the hash table.
That would be fine, but it seems like there should be an easier way to do this than adding a couple hundred lines of code to my module. Is there an existing module or routines within apache, apr, or apr-util to extract the field names and associated data from a GET or POST FORM into a structure which C code can more easily access? I cannot find anything relevant, but this seems like a common need for which there should be a solution.

I switched to G-WAN which offers a transparent ANSI C scripts interface for GET and POST forms (and many other goodies like charts, GIF I/O, etc.).
A couple of AJAX examples are available at the GWAN developer page
Hope it helps!

While, on it's surface, this may seem common, cgi-style content handlers in C on apache are pretty rare. Most people just use CGI, FastCGI, or the myriad of frameworks such as mod_perl.
Most of the C apache modules that I've written are targeted at modifying the particular behavior of the web server in specific, targeted ways that are applicable to every request.
If it's at all possible to write your handler outside of an apache module, I would encourage you to pursue that strategy.

I have not yet tried any solution, since I found this SO question as a result of my own frustration with the example in the "Apache Modules" book as well. But here's what I've found, so far. I will update this answer when I have researched more.
Luckily it looks like this is now a solved problem in Apache 2.4 using the ap_parse_form_data funciton.
No idea how well this works compared to your example, but here is a much more concise read_post function.
It is also possible that mod_form could be of value.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight