automatic documents tagging related - tagging

I started working on a project in which i must tag documents with keywords, and it is really hard and time consuming if you do it manually (specially if you have thousands of documents). So I am planning to automatize the process (knowing that the result would not perfect but at least it gives you some suggested tags ).
In the latest firefox version they implemented a system like this (when you bookmark a page, it suggests you some tags).
yahoo term extraction service is also a great example
So if any body can help me get around this problem I would really appreciate the help. Or if someone know about the firefox tagging system a little bit of help would be great.

Would a statistical algorithm work? Something Bayesian perhaps? I know they're used in spam filtering, maybe you can adapt a Bayes filter to suit your needs.
At the very least, you could suggest words that are used frequently but are not common words in English (he, she, I, and, it, then, or, etc...)

Related

Is there a way to access Apache Solr (8.3.1) search queries without crawling the logs?

For statistic purposes I have to save and analyse all search queries made to a server running Solr (version 8.3.1). Maybe it's just because I haven't worked with Solr until today, but I couldn't find a simpler way to access these queries except crawling the logs.
I've only found one article to help me, in which the following is stated:
I think that solr by him self doesn't store the queries (correct me if I'm wrong, about this) but you can accomplish what you want by processing the solr log (its the only way I think).
(source: https://lucene.472066.n3.nabble.com/is-it-possible-to-save-the-search-query-td4018925.html)
Is there any more convenient way to do this?
I actually found a good way to achieve this in another SO-Question. Well, at least kind of.
Note: It is only useful if you have enough resources on the same server or another server to properly handle a second Solr-Core.
Link to original answer
That SO-Question is about Elastic-Search but the methodology of it can also be applied to this case with a second Solr-Core that indexes the queries made. (One can also add additional fields like when it was last searched, total search count, ...)
The functionalities of a search auto-complete are also achievable with this solution.
In short:
The basic idea is to use a second Solr instance to provide the means necessary for quickly saving the queries (instead of a DB for instance).
Remark: I'm not going to accept this as the best answer because it is a rather special solution to the question I originally made. But I nonetheless felt it could be useful for any programmer trying to achieve this while also thinking about search auto-completion.

How can I relate search indexes to models in MVC?

I have an MVC application which I need to be able to search. The application is modular so it needs to be easy for modules to register data to index with the search module.
At present, there's just a quick interim solution in place which is fine for flexibility, but speed was always going to be a problem. Modules register models (and relationships and columns) which they'd like to be searchable. Upon search, the search functionality queries data using those relationships and applies Levenshtein, removes stop words, does character replacements etc. Clearly this will slow down as the volume of data increases so it's not viable to keep as it is effectively select * from x,y,z and then mine through the data.
The benefit of the above is such that there is a direct relation to the model which found the data. For example, if Model_Product finds something, I know that in my code i can use Model_Product::url() to associate the result off to the relevant location or Model_Product::find(other data) to show say the image or description if the keyword had been found in the title for example.
Another benefit of the above is it's already database specific, and therefore can just be thrown up onto a virtualhost and it works.
I have read about the various options, and they all seem very similar so it's unlikely that people are going to be able to suggest the 'right' one without inciting discussion or debate, but for the record; from the following options, Solr seems to be the one I'm leaning toward. I'm not set in stone so if anyone has any advice they'd like to share or other options I could look at, that'd be great.
Sphinx
Lucene
Solr - appears to just run Lucene as a service?
Xapian
ElasticSearch
Looking through various tutorials and guides they all seem relatively easy to set up and configure. In the case above I can have modules register the path of config files/search index models and have the searcher run them all through search program x. This will build my indexes, and provide the means by which to query data. Fine.
What I don't understand is how any of these indexes related to my other code. If I index data, search and in turn find a result with say Solr, how do I know how to get all of the other information related to the bit it found?
Also is someone able to confirm whether or not I will need to have an instance of any of the above per virtualhost? This is something which I can't seem to find much information on. I would assume that I can just connect to a single instance and tell it what data is relevant? Much like connecting to a single DBMS server, with credentials x to database y.
Granted I haven't done as extensive reading on this as I would have typically because I'm a bit stuck in terms of direction at the moment and I'd rather not read everything about everything in favour of seeking some advice from those who know before I take a particular route.
Edit: This question seems to have swayed me more towards Solr. There's also a similar thread here with a fair amount of insight into Sphinx.
DISCLAIMER: I can only speak about Lucene/Solr and, I believe, ElasticSearch as I know it is based on Lucene. Others might or might not work in the same way.
If I index data, search and in turn find a result with say Solr, how
do I know how to get all of the other information related to the bit
it found?
You can store any extra data you want, e.g. a database key pointing to a particular row in the database. Lucene/Solr can also help you to find relative information, e.g. if you run a DVD rent shop and user has misspelled a movie name, Lucene will figure this out for you and (unlike with DB) still list the closest alternatives. You can also provide hints by boosting certain fields during indexing or querying. There are special extensions for geospatial search, etc. And obviously you can provide your own if you need to.
Also is someone able to confirm whether or not I will need to have an
instance of any of the above per virtualhost?
Lucene is a low level library and will have to be present in every JVM you run. Solr (built on top of Lucene) is an HTTP server. You can call it from as many clients as you want. More scaling options explained here.

Semantic Search Engine

I want to design a Semantic Search engine for my final year Master's degree. I have been doing a fair amount of reading both casually on the web and academic papers so I am not a total noob in this field.
My aim is to build a semantic search engine, which parses out the HTML content into its equivatlent RDF triples,stores the triples in a triplestore, through which the engine will try to respond to the query fired using SPARQL. I want to do something out of the box unlike the other students . So, I decided to build a semantic search engine.
Right now, I had a running search engine using Solr which performs keyword search, what I want to do is the semantic search. I know some open source tools regarding Web 3.0 but not sure whether they will be compatible with Solr or not.
So, can you please provide me some help for building the same.
Thanks.
Regards
Although it sounds hard, but you will not be able to capture everything.
You need a lot of data. Of course, there already is a lot of data arranged in formats like owl and rdf which you may use (e.g. WordNet, Yago, GeoNames etc), but although they are of huge size, they only focus on very small portions of a possible discourse universe.
Developing a good semantic search takes a lot of resources and brain power. Projects, like for example KompParse at the German Research Center for Artificial Intelligence, which only focus on a small part of human conversation (gossip or buying furniture) have been running for several years with several employees by now and are still just "ok".
Understanding semantics has already been implemented in different search engines, take google for example, or wolfram alpha. So this topic might not even be as much "out of the box" as you think.
So I will go with user723630 and strongly advise you, to focus on a smaller topic. You will still achieve a lot, but you will not get frustrated.

How to identify ideas and concepts in a given text

I'm working on a project at the moment where it would be really useful to be able to detect when a certain topic/idea is mentioned in a body of text. For instance, if the text contained:
Maybe if you tell me a little more about who Mr Jones is, that would help. It would also be useful if I could have a description of his appearance, or even better a photograph?
It'd be great to be able to detect that the person has asked for a photograph of Mr Jones. I could take a really naïve approach and just look for the word "photo" or "photograph", but this would obviously be no good if they wrote something like:
Please, never send me a photo of Mr Jones.
Does anyone know where to start with this? Is it even possible?
I've looked into things like nltk, but I've yet to find an example of someone doing something similar and am still not entirely sure what this kind of analysis is called. Any help that can get me off the ground would be great.
Thanks!
The best thing out there that might be useful to you is automatic sentiment analysis. This is used, for example, to judge whether, say, a customer review is positive or negative. I cannot give you direct pointers to available tools, but this is what you are looking for.
I must say, though, that this is a current hot topic in natural language processing and I’ve seen a number of papers at conferences. It’s definitely quite a complex matter and if you’re starting from scratch, it might take quite some time before you get the results that you want.
NLTK is not a bad framework for parsing natural language but beware that this is not a simple matter. Doing stuff like this is really research level programming.
A good thing that makes it much easier is if you have a very limited domain - say your application focuses on information about famous writers, then you can avoid some complexities of natural language like certain types of ambiguities.
Where to start? Good question. I don't know of any tutorials on the topic (and I presume you tried the Google option) but I'd imagine that iTunes U would have a course on the topic. If not I can post a link to a course I've done that mentions the subject and wasn't completely horrible: http://www.inf.ed.ac.uk/teaching/courses/inf2a/lecturematerials/index.html#lecture01
The problem that u tackle is very challenging.
I would start by first identifying the entities in the text (problem referred as Named Entity Recognition, google it), and then a I would try to identify concepts.
If want to roughly identify what is the text about, I suggest that you start by using WordNet and according to the words and their places in the hierarchy to identify the concepts involved.
If you want to produce a system which show real intelligence than you should start researching about resources such as CYC (OpenCYC) which will allow you to convert the sentences into FOL sentences.
This hardcore AI, approach to solving your problem. For simple chat bot, it would be easier to rely on simple statistical methods.
good luck

Looking for an example of when screen scraping might be worthwhile

Screen scraping seems like a useful tool - you can go onto someone else's site and steal their data - how wonderful!
But I'm having a hard time with how useful this could be.
Most application data is pretty specific to that application even on the web. For example, let's say I scrape all of the questions and answers off of StackOverflow or all of the results off of Google (assuming this were possible) - I'm left with data that is not very useful unless I either have a competing question and answer site (in which case the stolen data will be immediately obvious) or a competing search engine (in which case, unless I have an algorithm of my own, my data is going to be stale pretty quickly).
So my question is, under what circumstances could the data from one app be useful to some external app? I'm looking for a practical example to illustrate the point.
It's useful when a site publicly provides data that is (still) not available as an XML service. I had a client who used scraping to pull flight tracking data into one of his company's intranet applications.
The technique is also used for research. I had a client who wanted to compare the contents of several online dictionaries by part of speech, and all of these sites had to be scraped.
It is not a technique for "stealing" data. All ordinary usage restrictions apply. Many sites implement CAPTCHA mechanisms to prevent scraping, and it is inappropriate to work around these.
A good example is StackOverflow - no need to scrape data as they've released it under a CC license. Already the community is crunching statistics and creating interesting graphs.
There's a whole bunch of popular mashup examples on ProgrammableWeb. You can even meet up with fellow mashupers (O_o) at events like BarCamps and Hack Days (take a sleeping bag). Have a look at the wealth of information available from Yahoo APIs (particularly Pipes) and see what developers are doing with it.
Don't steal and republish, build something even better with the data - new ways of understanding, searching or exploring it. Always cite your data sources and thank those who helped you. Use it to learn a new language or understand data or help promote the semantic web. Remember it's for fun not profit!
Hope that helps :)
If the site has data that would benefit from being accessible through an API (and it would be free and legal to do so), but they just haven't implemented one yet, screen scraping is a way of essentially creating that functionality for yourself.
Practical example -- screen scraping would allow you to create some sort of mashup that combines information from the entire SO family of sites, since there's currently no API.
Well, to collect data from a mainframe. That's one reason why some people use screen scraping. Mainframes are still in use in the financial world and often it's running software that has been written in the previous century. The people who wrote it might already be retired and since this software is very critical for these organizations, they really hate it when some new code needs to be added. So, screenscraping offers an easy interface to communicate with the mainframe to collect information from the mainframe and then send it onwards to any process that needs this information.
Rewrite the mainframe application, you say? Well, software on mainframes can be very old. I've seen software on mainframes that was over 30 years old, written in COBOL. Often, those applications work just fine and companies don't want to risk rewriting parts because it might break some code that had been working for over 30 years! Don't fix things if they're not broken, please. Of course, additional code could be written but it takes a long time for mainframe code to be used in a production environment. And experienced mainframe developers are hard to find.
I myself had to use screen scraping too in a software project. This was a scheduling application which had to capture the output to the console of every child process it started. It's the simplest form of screen scraping, actually, and many people don't even realize that if you redirect the output of one application to the input of another, that it's still a kind of screen scraping. :)
Basically, screen scraping allows you to connect one (web) application with another one. It's often a quick solution, used when other solutions would cost too much time. Everyone hates it, but the amount of time it saves still makes it very efficient.
Let's say you wanted to get scores from a popular sports site that did not offer the information available with an XML feed or API.
For one project we found a (cheap) commercial vendor that offered translation services for a specific file format. The vendor didn't offer an API (it was, after all, a cheap vendor) and instead had a web form to upload and download from.
With hundreds of files a day the only way to do this was to use WWW::Mechanize in Perl, screen scrape the way through the login and upload boxes, submit the file, and save the returned file. It's ugly and definitely fragile (if the vendor changes the site in the least it could break the app) but it works. It's been working now for over a year.
One example from my experience.
I needed a list of major cities throughout the world with their latitude and longitude for an iPhone app I was building. The app would use that data along with the geolocation feature on the iPhone to show which major city each user of the app was closest to (so as not to show exact location), and plot them on a 3D globe of the earth.
I couldn't find an appropriate list in XML/Excel/CSV type format anywhere easily, but I did find this wikipedia page with (roughly) the info I needed. So I wrote up a quick script to scrape that page and load the data into a database.
Any time you need a computer to read the data on a website. Screen scraping is useful in exactly the same instances that any website API is useful. Some websites, however, don't have the resources to create an API themselves; screen scraping is the developer's way around that.
For instance, in the earlier days of Stack Overflow, someone built a tool to track changes to your reputation over time, before Stack Overflow itself provided that feature. The only way to do that, since Stack Overflow has no API, was to screen scrape.
The obvious case is when a webservice doesn't offer reverse search. You can implement that reverse search over the same data set, but it requires scraping the entire dataset.
This may be fair use if the reverse search also requires significant pre-processing, e.g. because you need to support partial matching. The data source may not have the technical skills or computing resources to provide the reverse search option.
I use screen scraping daily, I run some eCommerce sites and have screen-scraping scripts running daily to gather product lists automatically from my suppliers wholesale sites. This allows me to have upto date information on all the products available to me from several suppliers and allows me to flag non-economical margins due to price changes.

Resources