Parsing Word Documents Into a DB for Analysis

Parsing Word Documents Into a DB for Analysis - database

So there are over 200 documents on this link (http://goo.gl/IdvhMf) each document has over a hundred pages of Questions and Answers from each respondent. Each document represents answers from one respondent. I want to create a table in a db (not dependent on any db technology) that would have a schema something like this:
Respondent | Question Number | Answer
e.g
UBS, 1, "Our opinions is that..."
I can then query the db to say fpr example: "show me all responses from Question 34 for respondents A,B,C"
The step after that might include some for of sentiment analysis on responses.
So what is the best way for someone who is not a programmer by day to to this? Any off-shelf configurable tools I could use?

Split your problem in two.
The first is how you parse out the question and answer pairs from the document.
Storing those in the database is a second unrelated problem.
Addressing the first problem (and not having looked at the documents), that would typically be done on the basis of styles (eg Question style, Answer style), magic text ("Question", "Answer"), or formatting (eg questions are bold/red). If I had control over the creation of the documents, I'd probably use content controls.
How you do this in code depends to some extent on your preferred language, but things are easier if the documents are in docx format (as opposed to legacy binary doc format, or RTF). Assuming docx format, there are Open XML toolsets for most languages.

Related

Datasets of conversations of people in text format which have been labelled positive and negative

I am currently doing a project on sentiment analysis. For this, I am searching for datasets of conversations of people in text format which have been labelled positive and negative. Can someone give me the links of such data sets.

There are some good data sets with sentiment tags, but they are not from conversations.
Check out this answer to see some of them.
Another good corpus for sentiment analysis that is also used for question answering (QA) is MPQA.

A good dataset is downloadable from Stanford CoreNLP website (http://nlp.stanford.edu/sentiment/, then look at the box "Dataset downloads" on the right). This dataset is labelled at different levels (single word, phrases, entire sentences): if you only look for a sentence-level annotation you can simply take the outermost label of every sentence.

Adding new Excel files to MS Access database as they come in

I am in the situation where I have a questionnaire that is basically just a plain excel spreadsheet with two columns:
one column with the questions and
a second column next to it where users can fill in their answers.
Each respondent has been sent a copy of the file and they will email back their files individually over a long time period. I can't wait until i have all files back; instead i would like to collect (and use) the data in Access as the files come in.
Two questions:
What is the best set up in terms of the manual steps required when a new datafile comes in. Can one just save the file in a specific folder and somehow have the column (column B) with responses "automatically" added to the main database? If not fully automatically, what could be done with just a few manual steps involved?
I realize that the shape of the questionnaire is not ideal (variables are in rows, not in columns). What's the best way to deal with that?
Thanks in advance for any pointers!
PS: I'be open to (simple) alternatives, if Access is not the best choice for this. Analysis of the data will be done in Excel again in the end.
Update, to clarify the questions below:
1) In the short - medium term, we are expecting 50-100 replies. In the long term, it will be more as, people will be asked to send updates when their situation changes - these will have to be added as new entries with a new date attached to them. i.e. it will be a continuous process with a few answers coming in every few weeks.
2) There are 80 questions on the questionnaire.
3) The Excel files come back as email attachments.
4) I was contemplating using Acess, as I thought it will a) makeit a bit cleaner and less error prone, especially as project managers might change in the future, b) allow for better handling of the data, as it will have to be mashed up and reshaped in different ways for the anlysis (e.g. it has to be un-pivoted, which i don't even know if excel can do), and c) i thought it it would give us more flexibility in the future when it comes to using different tools for analysis. i.e. each tool can just query the database. I am open for other suggestions, including Excel-only solutions, if that makes it easier, though.
5) I envision the base table to have all the 80 variables in different columns, and the answers as rows (i.e. each new colum that comes with each excel file will need to be transposed and added as a new row). There will be other data tables with the same primary key as the row identifier in this table.
6) I havn't worked on the analysis part yet, but i know that it will require a lot of reshaping and merging of data sets.

Answer 1 - Questions
You do not provide enough information to allow any one to give you pointers. Some initial questions:
How many questionaires are you expecting: 10, 100, 1000?
How many questions are there per questionaire?
How are the questionaires reaching you? You say "email back". Does this mean as an attachment or as a table in the body of the email.
You say the data is arriving as Excel files and you intend to do the analysis in Excel. Why are you storing the answers in Access? I am not saying you are wrong to store the results in Access; I just want to be convinced you have a reason.
Have you designed the planned table structure for Access?
Have you designed the structure of the Excel workbook(s) on which you will perform the analysis?

Answer 2
Firstly, I should say that I agree with Mat. I am not an expert on questionnaires but my understanding is that there are companies that will host online questionnaires and provide the results in a convenient form.
Most of the rest of this answer assumes it is too late to consider an online questionnaire or you have, for whatever reason, rejected that approach.
An Access project is, to a degree, self-documenting. You can look at its list of tables and see that Table 1 has columns A, B and C. If created properly you can see the relationships between tables. With an Excel workbook you just have a number of worksheets which can contain anything. There is no automatic documentation.
However, with both Excel and Access the author can create complete documentation that explains each table, worksheet, report and macro. If this project is going to last indefinitely and have a succession of project managers, such documentation will be essential. I can tell you from bitter experience that trying to understand a complex Access project or Excel workbook that you have inherited without proper documentation is at best difficult and at worst impossible.
Don’t even start this unless you plan to create and maintain proper documentation. I do not mean: “We will knock up something when we have finished.” Once it is finished, people will be moving onto their next projects and will have little time for boring stuff like documentation. After the event documentation also loses all the decisions and the reasons for those decisions. The next team is left wondering why their predecessors did it that way. The reason will not matter in many cases but I have seen a product destroyed by a new team removing “unnecessary complexity” they did not understand. I always kept a notebook in which I recorded what I was doing and why during the day. I encouraged my staff to do the same. I insisted something for the project log every week. The level of detail depends on the project. The question I asked myself was: “If I had just inherited this project, what happened during the last week that I would need to know?” This was in addition to an up-to-date specification for each component.
Sorry, I will get off my hobby-horse.
“In the short - medium term, we are expecting 50-100 replies. In the long term, it will be more as, people will be asked to send updates when their situation changes - these will have to be added as new entries with a new date attached to them.”
If you are going to keep a history of answers then Access will probably be a better repository than Excel. However, who is going to maintain the Access project and the central Excel workbooks? Access does not operate in the same way as Excel. Access VBA is not quite the same as Excel VBA. This will not matter if you are employing professionals experienced in both Access and Excel. But if you are employing amateurs who are picking up the necessary skills on the job then using both Access and Excel will increase what they have to learn and the likelihood that they will get confused.
If there are only 100 people/organisations submitting responses, you could merge responses and maintain one workbook per respondent to create something like:
Answers -->
Question 1May2014 20Jun2014 7Nov2014
Aaaaaa aa bb cc
Bbbbbb dd ee ff
I am not necessarily recommending an Excel approach but it will have benefits in some circumstances. Personally, unless I was using professional programmers, I would start with an Excel only solution until I knew why I needed Access.
“I envision the base table to have all the 80 variables in different columns, and the answers as rows (i.e. each new colum that comes with each excel file will need to be transposed and added as a new row).” I interpret this to mean a row will contain:
Respondent identifier
Date
Answer to Q1
Answer to Q2
: :
Answer to Q80.
My Access is very rusty. Is there a way of accessing attribute “Answer to Q(n)” or are you going to need 80 statements to move answers in and out? I hope there is no possibility of new questions. I found updating the database when a row changed a pain. I always favoured small rows such as:
Respondent identifier
Date
Question number
Answer
There are disadvantages to having lots of small rows but I always found the advantages outweighed them.
Hope this helps.

What's the best way to store unstructured text file for data mining

I have millions of text news on my machine. I want to do some text mining on it.
I want first to store thest text news in a more structured way. what's the best way to do it ? so It will become more convenient to do data mining later on.
Currently I just store these news file in database indexed by the news headlines and the file path.
Any suggestion will be really appreciated. Thanks!

That depends greatly on what you want to achive with the more structured data.
If the data size is not heavy, you could use "in text" search on your database and you are aldready done.
A category or "tag" like here on stackoverflow would help greatly to categorize and group your content, but I guess it is very hard to extract that from your pure text base now.
Also a simple timestamp (you could get from the file itself, but be wary some systems alter that date when files get copied...) could help too.
For content extraction, have a look at http://www.opencalais.com/ , it offers an api for "text" analysis you might find interesting.

What do you mean by "do some text mining"? Are you just looking to store the text? Or, are you looking for a solution?
Many databases offer the capability to store text and do fast retrievals on them.
However, text mining typically covers a broader range of themes. Here are some examples:
Finding documents with similar themes.
Exposing sentiment in the documents.
Answering questions posed in natural language.
Summarizing documents.
Filling in data structures with information from documents.
Using information from documents for predictive modeling purposes.
Assigning codes to documents.
For such analyses, you would normally use text mining tools (you can look for these on kdnuggets.com, for instance). The tool then affects how the text is stored.
The last chapter of "Data Mining Techniques for Marketing, Sales, and Customer Support" is about text mining and has a very good case study on text mining applied to customer service records.
[In response to comment]
Is this an academic project or "real world"? Is the text monolingual? If so, is it English? You definitely need to do some research. Text analysis/mining has been an area of rather intense study since, at least, the time when Alan Turing proposed the Turing test in the 1930s.
As an example, I can readily think of four very different options for storing text for analysis. The first is "as is", which is most useful if you have lots of processors and memory. The second would be "grammatically", with text tagged with grammar and meanings, which is most effective if you have a team with lots of PhDs. Third is as an inverted index, which is the basic form for searching and some proximity matching. The fourth is by projecting onto an orthogonal space, using singular value decomposition (most useful if you want to use the text as input to other statistical techniques).

Is there any free database which stores keywords with other relevant keywords, for applications to determine semantic relevance? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
This looks like a search for a valuable asset, but since we have a free alternative for many things, I am optimistic about this one.
A database which stores two key-value pairs like
key-value
or
key-context-value
would be very useful for web developers who collect data and want to tag them or searching records which can be relevant.
A data table like this would even be the normalized form of what they would want to store.
If you have ever heard of an available free to copy data table like this, please share. Thank you.

You could use WordNet: it contains general relationships between (English) words (divided into noun, verb, adjective and adverb). The relationships are among synsets (synonym sets) and describe such relations as "bus" is-a "vehicle", "wheel" is-part-of "car".
Note: To look up words in the WordNet dictionary you need to use lemmas (the base form of the word), so if you want to look words up from a free text (such as a website), you will have to calculate the lemmas of the words first. You could do this by applying some Natural Language Processing (NLP) techniques, or creating your own heuristics.
Besides the synset relationships, WordNet also contains short defintions (gloss) of the synsets, which you could use to gain more context. Also, Sense Disambiguation techniques can help you decide which sense of a multi-sense word to use, which is also a form of providing context.
If you need more context than what WordNet provides (general relationships between general meanings of English words), you should find a suitable ontology that describes semantic relationships between concepts. You will have to map the text to the concepts it is about (again, NLP techniques can help in this)
Example ontologies: SUMO, MSO, etc.

You could use Lucene (or any text-search engine) to store your documents, combined with a custom stemmer to index your document text based on meaning (rather than word variations).
Normally, stemmers are used to convert all variations of a word to the base word stem. For example, although the document is stored and retrieved with text as-is, any of the words "sing, singing, sang, sung" would be indexed as "sing", so when a search is made using the search term "sing", you get a hit on all documents containing sing, singing, sang or sung.
Similarly, the search terms may also be stemmed, so searching for any of "sing, singing, sang or sung" will search as if "sing" is the search term.
Standard stemmers deal with the usual English variations of words, but you could create one that stems based on meaning. For example, you might create a stemmer that stems any of "problem, issue or complaint" to "problem", etc for all words you want to "link".
The advantage of using a stemmer is all the search-related heavy lifting is done for you by the text search engine (and besides, text search engines are incredibly fast!).
Wen it come to implementation, you could make the linkages data-driven, either generating the code for the stemmer based on data in a database, or make it dynamic and look up a database whenever a search/index operation is done, or somewhere in between - caching the values and refreshing them periodically.

Depending from your requirements, you can look for different implementations of map-reduce paradigm. The most famous one is Hadoop, specifically Hadoop MapReduce. Though this is a framework rather than a database, it does exactly what you ask - storing and processing data in the key=value pair manner. This is a product for building large, scalable systems. If you need something more simple, there exist some smaller implementations, such as PHP-based (on top of MySQL), and even a "simple" MySQL aggregation, which can mimic MapReduce in most cases, where you do not need distributed system with loads of data.

It sounds very much like you are talking about an ontology. See What is an Ontology (Database?)?
It seems to me that ontologies provide a very powerful way of building up complex models of real-world entities and relationships in a natural and organic way. Relationships between entities/concepts can be captured in the model, and as the number of types of relationship grows, more and more sophisticated rules can be encoded to exploit this body of knowledge.

the format sounds like JSON objects => so i looked at wikipedia and found CouchDB - an open source database that uses JSON to store data

Best way to store large searchable text files

I am developing an online Bible search program. The Bible is a pretty large book, taking up nearly 5MB of space in plain text. I am planning on implementing an API in the program as well allowing other websites to include their own Bible search widgets and programs without having to develop the search queries or storing Bibles on their own servers.
With this in mind, I am going to expect that eventually I will have a moderate flow of queries passing through the program. Also, for those not familiar with the Bible, it has 2 methods of formatting the text. It can contain both red text and italics. I need a way to store the Scriptures along with the red letter and italics formatting but allowing the search queries to ignore the formatting.
It also needs to be fast and as efficient (memory and cpu usage) as possible. Any storage format will be considered (MySQL, JSON or XML text files, etc) as long as the querying can be done ignoring the formatting. File size and count doesn't really matter, so splitting up the books or even chapters into separate files is fine by me.
One more important thing to keep in mind though, is that I want to have some form of search method that can search across multiple verses. So a search for "but have everlasting life for God sent not his Son" would return John 3:16,17. Thanks for all ideas!

There are a bunch of different open source document search engines which are made for precisely what you're trying to do. Solr, Elastic Search, Xapian, Whoosh, Haystack (made for Django) and others. There are other posts on S.O. and elsewhere that go into the benefits of using one vs another, but your requirements are simple enough that any of them will be more than fine (and easily scale with very minimal effort should your project take off, which is always nice to know). So look at their examples and see which one looks most intuitive to you - Solr is arguably the most popular and it's the only one I've worked with, but Elastic Search uses the same popular Lucene backend and is apparently much easier to get up and running, so I would start there.
As for the actual implementation, you'll want to index each verse as a separate "document" if the single verse (or just verse number) is what you want to return. The search engine handles the ranking of the results based on relevancy (usually using a tf/idf algorithm, in case you're interested).
The way I'd handle the italics and red text is to include some kind of markup in the text (i.e. wrap the phrase in single asterisks for italics, double asterisks for red) and then tell the analyzer to ignore those characters - there may be a simpler way in the framework you end up choosing, though, so take that with a grain of salt. The queries spanning multiple verses requirement is more complicated, but the answer will probably involve indexing each whole chapter as a document instead of (or maybe in addition to? I'd have to think about it more) each verse.
A word of caution - if you're not familiar with search indexing, even something designed to be plug-and-play like Elastic Search will probably still require some time and effort to set up, so if you absolutely need to get this up and running quickly and you're already familiar with MySQL I suppose it could work (it does do fulltext search). But it's certainly not the best tool for the job, so if this is a project that you're invested in you will thank yourself later if you put in a little bit of work to learn one of these search frameworks. It may be overkill in terms of the amount of text you're dealing with, as others have pointed out, but it will be extremely flexible in how you can search on that text which seems to be what you want. For instance, adding other requirements later on would be very straightforward (for instance, you could let people limit their search to only matches in the red text).

I didn't know the bible had formatting. What is it used for? If it is for the verses, I'd suggest you store every verse in a database. In a highly normalized form, you got a table with books, a table with chapters and a table with verses. Each verse consists of a verse number and a verse text.
Now, I think the chapters don't have titles so they are actually just a number as well. In that case it it silly to store them separately, so you got just your table of books and a table of verses, in which each verse has a chapter number and a verse number and a verse text. That text I think of to be plain text, isn't it?
If the verse is plain text, you can easily make it searchable by storing it in MySQL and create a FULLTEXT index for it. That way, you can search quite efficiently and even use wildcards and such.
If the verse was to have formatting, you could choose to create two columns, one with the plain text for searching, and one with the formatted text for display, but I doubt you would need this.
PS: 5 MB of text is nothing really. If you got a dedicated program, you could keep it in memory in a single string and use strpos or a similar function to find a text. What language, database and platform are you using?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight