Tagging / Entity extraction using an entity list - tagging

I'm looking for a good solution to extract entities from a text.
In my case, entities are movie titles (so they could be quite long strings) and I have them stored in a database.
What could be a good way to do this ? Is there any already developed software to perform this kind of task ?
I've seen nltk, but what I need is way less complex: Given a (huge) list of strings, identify them in the input text.
Thanks!

Related

Thought on creating a searchable database

I guess what I need is two things. First a way to input data into an Excel like application or a form builder, then a way to search those entries. For example.. CAR PART put a car Part A into Field 1 the next Field 2 would be car Type, followed by make and model. The fields would need to be made into a form consisting of preset inputs such as ( Title/Type ) and (Variable Categories) so a drop down menu, icons, or checkboxes would help narrow down the list of results. What pieces need to be in place to build/use a lightweight database/application design like this that allows inputting new information and then being able to search for latet search for variables? Also is there any application that does this already, a programming code to learn, or estimated cost and requirements to have it built?
First, there might be something off the shelf that does this already, and there are applications like this. Microsoft's Access would be a good place to start to see if it would fit your needs -- you can build forms and store data without much programming effort. As time goes on, you can scale up to a SQL Server.
It's not clear to me if your data is relational or not, and it might not matter much at first (any database will likely handle your queries to start). I originally thought your data was not relational, but re-reading your post, I'm not so sure now.
If that doesn't work, or you want more flexibility, then I'd start looking at NoSQL as an option. Some good choices include Mongo and RavenDB (there are many others).
You can program it yourself with just about any major language -- some provide more or less functionality based on the tie-in to the data.

Designing a database for an e-commerce store

Hi I am trying to design a database for an e-commerce website but I can't seem to find a way to do this right, this is what I have so far:
The problem appears at the products.I have 66 types of products most of them having different fields.I have to id's but both of them don't seem very practical:
OPTION A:
At first I thought I to make a table for each product type, but that would result in 66 tables which is not very easy to maintain. I already started to do that I created the Product_Notebook and Product_NotebookBag tables. And then I stopped and thought about it a bit and this solution is not very good.
OPTION B
After thinking about it a bit more I came up with option B which is storing the data into a separate field called description. For example:
"Color : Red & Compatibility : 15.6 & CPU : Intel"
In this approach I could take the string and manipulate it after retrieving it from the database.
I know this approach is also not a very good idea, that's why I am asking for a more practical approach.
See my answer to this question here on Stack Overflow. For your situation I recommend using Entity Attribute Value (EAV).
As I explain in the linked answer, EAV is to be avoided almost all of the time for many good reasons. However, tracking product attributes for an online catalog is one application where the problems with EAV are minimal and the benefits are extensive.
Simply create a ProductProperties table and put all the possible fields there. (You can actually just add more fields to your Products table)
Then, when you list your products, just use the fields you need.
Surely, there are many fields in common as well.
By the way, if you're thinking of storing the data in array (option B?) you'll regret it later. You won't be able to easily sort your table that way.
Also, that option will make it hard to find a particular item by a specific characteristic.

What's the best way to store unstructured text file for data mining

I have millions of text news on my machine. I want to do some text mining on it.
I want first to store thest text news in a more structured way. what's the best way to do it ? so It will become more convenient to do data mining later on.
Currently I just store these news file in database indexed by the news headlines and the file path.
Any suggestion will be really appreciated. Thanks!
That depends greatly on what you want to achive with the more structured data.
If the data size is not heavy, you could use "in text" search on your database and you are aldready done.
A category or "tag" like here on stackoverflow would help greatly to categorize and group your content, but I guess it is very hard to extract that from your pure text base now.
Also a simple timestamp (you could get from the file itself, but be wary some systems alter that date when files get copied...) could help too.
For content extraction, have a look at http://www.opencalais.com/ , it offers an api for "text" analysis you might find interesting.
What do you mean by "do some text mining"? Are you just looking to store the text? Or, are you looking for a solution?
Many databases offer the capability to store text and do fast retrievals on them.
However, text mining typically covers a broader range of themes. Here are some examples:
Finding documents with similar themes.
Exposing sentiment in the documents.
Answering questions posed in natural language.
Summarizing documents.
Filling in data structures with information from documents.
Using information from documents for predictive modeling purposes.
Assigning codes to documents.
For such analyses, you would normally use text mining tools (you can look for these on kdnuggets.com, for instance). The tool then affects how the text is stored.
The last chapter of "Data Mining Techniques for Marketing, Sales, and Customer Support" is about text mining and has a very good case study on text mining applied to customer service records.
[In response to comment]
Is this an academic project or "real world"? Is the text monolingual? If so, is it English? You definitely need to do some research. Text analysis/mining has been an area of rather intense study since, at least, the time when Alan Turing proposed the Turing test in the 1930s.
As an example, I can readily think of four very different options for storing text for analysis. The first is "as is", which is most useful if you have lots of processors and memory. The second would be "grammatically", with text tagged with grammar and meanings, which is most effective if you have a team with lots of PhDs. Third is as an inverted index, which is the basic form for searching and some proximity matching. The fourth is by projecting onto an orthogonal space, using singular value decomposition (most useful if you want to use the text as input to other statistical techniques).

Best way to store large searchable text files

I am developing an online Bible search program. The Bible is a pretty large book, taking up nearly 5MB of space in plain text. I am planning on implementing an API in the program as well allowing other websites to include their own Bible search widgets and programs without having to develop the search queries or storing Bibles on their own servers.
With this in mind, I am going to expect that eventually I will have a moderate flow of queries passing through the program. Also, for those not familiar with the Bible, it has 2 methods of formatting the text. It can contain both red text and italics. I need a way to store the Scriptures along with the red letter and italics formatting but allowing the search queries to ignore the formatting.
It also needs to be fast and as efficient (memory and cpu usage) as possible. Any storage format will be considered (MySQL, JSON or XML text files, etc) as long as the querying can be done ignoring the formatting. File size and count doesn't really matter, so splitting up the books or even chapters into separate files is fine by me.
One more important thing to keep in mind though, is that I want to have some form of search method that can search across multiple verses. So a search for "but have everlasting life for God sent not his Son" would return John 3:16,17. Thanks for all ideas!
There are a bunch of different open source document search engines which are made for precisely what you're trying to do. Solr, Elastic Search, Xapian, Whoosh, Haystack (made for Django) and others. There are other posts on S.O. and elsewhere that go into the benefits of using one vs another, but your requirements are simple enough that any of them will be more than fine (and easily scale with very minimal effort should your project take off, which is always nice to know). So look at their examples and see which one looks most intuitive to you - Solr is arguably the most popular and it's the only one I've worked with, but Elastic Search uses the same popular Lucene backend and is apparently much easier to get up and running, so I would start there.
As for the actual implementation, you'll want to index each verse as a separate "document" if the single verse (or just verse number) is what you want to return. The search engine handles the ranking of the results based on relevancy (usually using a tf/idf algorithm, in case you're interested).
The way I'd handle the italics and red text is to include some kind of markup in the text (i.e. wrap the phrase in single asterisks for italics, double asterisks for red) and then tell the analyzer to ignore those characters - there may be a simpler way in the framework you end up choosing, though, so take that with a grain of salt. The queries spanning multiple verses requirement is more complicated, but the answer will probably involve indexing each whole chapter as a document instead of (or maybe in addition to? I'd have to think about it more) each verse.
A word of caution - if you're not familiar with search indexing, even something designed to be plug-and-play like Elastic Search will probably still require some time and effort to set up, so if you absolutely need to get this up and running quickly and you're already familiar with MySQL I suppose it could work (it does do fulltext search). But it's certainly not the best tool for the job, so if this is a project that you're invested in you will thank yourself later if you put in a little bit of work to learn one of these search frameworks. It may be overkill in terms of the amount of text you're dealing with, as others have pointed out, but it will be extremely flexible in how you can search on that text which seems to be what you want. For instance, adding other requirements later on would be very straightforward (for instance, you could let people limit their search to only matches in the red text).
I didn't know the bible had formatting. What is it used for? If it is for the verses, I'd suggest you store every verse in a database. In a highly normalized form, you got a table with books, a table with chapters and a table with verses. Each verse consists of a verse number and a verse text.
Now, I think the chapters don't have titles so they are actually just a number as well. In that case it it silly to store them separately, so you got just your table of books and a table of verses, in which each verse has a chapter number and a verse number and a verse text. That text I think of to be plain text, isn't it?
If the verse is plain text, you can easily make it searchable by storing it in MySQL and create a FULLTEXT index for it. That way, you can search quite efficiently and even use wildcards and such.
If the verse was to have formatting, you could choose to create two columns, one with the plain text for searching, and one with the formatted text for display, but I doubt you would need this.
PS: 5 MB of text is nothing really. If you got a dedicated program, you could keep it in memory in a single string and use strpos or a similar function to find a text. What language, database and platform are you using?

Tagging schema for AppEngine

Hey,
I'm using AppEngine for an application that I'm writing. So I need to assign tags each object. I wanted to know what is the best way of doing this.
Should I create a space seperated string of tags and then query something like %search_tag% (I'm not sure if you can do that in JDOQL)?
What other options do I have ?
Should I create another class which will map every object to a tag?
Which would be the best from the point of view of scalability, performance and ease of use?
Thanks
First, '%search_tag%' type 'LIKE' queries do not work on App Engine's datastore. The best you can do is a prefix search.
It is difficult to answer very general questions like this. The best solution will depend on several factors, how many tags do you expect per entity? Is there a limit to the number of tags? How will you use the tags? For searching? For display only? The answers to all these questions impact how you should design your models.
One general solution for tagging is to use a multi-valued property, such as a list of tags.
http://code.google.com/appengine/docs/java/datastore/dataclasses.html#Collections
Be aware, if you will have many tags on your entities it will add overhead at write time, since the indexes writes need time too. Also, you should try to avoid using multi-valued properties multiple times (or multiple multi-value properties) in queries with inequalities or orders. That can lead to 'exploding indexes,' since one index row gets written for every combination of the indexed fields.

Resources