Extracting information from millions of simple but inconsistent text files - data-modeling

We have millions of simple txt documents containing various data structures we extracted from pdf, the text is printed line by line so all formatting is lost (because when we tried tools to maintain the format they just messed it up). We need to extract the fields and there values from this text document but there is some variation in structure of these files (new line here and there, noise on some sheets so spellings are incorrect).
I was thinking we would create some sort of template structure with information about the coordinates (line, word/words number) of keywords and values and use this information to locate and collect keyword values like that using various algorithms to make up for inconsistant formatting.
Is there any standard way of doing this, any links that might help? any other ideas?

the noise can be corrected or ignored by using fuzzy text matching tools like agrep: http://www.tgries.de/agrep/
However, the problem with extra new-lines will remain.
One technique that i would suggest is to limit the error propagation in a similar way compilers do. For example, you try to match your template or a pattern, and you can't do that. Later on in the text there is a sure match, but it might be a part of the current un-matched pattern.
In this case, the sure match should be accepted and the chunk of text that was un-matched should be left aside for future processing. This will enable you to skip errors that are too hard to parse.

Larry Wall's Perl is your friend here. This is precisely the sort of problem domain at which it excels.
Sed is OK, but for this sort of think, Perl is the bee's knees.

While I second the recommendations for the Unix command-line and for Perl, a higher-level tool that may help is Google Refine. It is meant to handle messy real-world data.

I would recoomnd using graph regular expression here with very weak rules and final accpetion predicate. Here you can write fuzzy matching on token level, then on line level etc.

I suggest Talend data integration tool. It is open source (i.e. FREE!). It is build on Java and you can customize your data integration project anyway you like by modifying underlying java code.
I used it and found very helpful on low budget highly complex data integration projects. Here's the link to their WEB site;Talend
Good luck.

Related

Best way to store large searchable text files

I am developing an online Bible search program. The Bible is a pretty large book, taking up nearly 5MB of space in plain text. I am planning on implementing an API in the program as well allowing other websites to include their own Bible search widgets and programs without having to develop the search queries or storing Bibles on their own servers.
With this in mind, I am going to expect that eventually I will have a moderate flow of queries passing through the program. Also, for those not familiar with the Bible, it has 2 methods of formatting the text. It can contain both red text and italics. I need a way to store the Scriptures along with the red letter and italics formatting but allowing the search queries to ignore the formatting.
It also needs to be fast and as efficient (memory and cpu usage) as possible. Any storage format will be considered (MySQL, JSON or XML text files, etc) as long as the querying can be done ignoring the formatting. File size and count doesn't really matter, so splitting up the books or even chapters into separate files is fine by me.
One more important thing to keep in mind though, is that I want to have some form of search method that can search across multiple verses. So a search for "but have everlasting life for God sent not his Son" would return John 3:16,17. Thanks for all ideas!
There are a bunch of different open source document search engines which are made for precisely what you're trying to do. Solr, Elastic Search, Xapian, Whoosh, Haystack (made for Django) and others. There are other posts on S.O. and elsewhere that go into the benefits of using one vs another, but your requirements are simple enough that any of them will be more than fine (and easily scale with very minimal effort should your project take off, which is always nice to know). So look at their examples and see which one looks most intuitive to you - Solr is arguably the most popular and it's the only one I've worked with, but Elastic Search uses the same popular Lucene backend and is apparently much easier to get up and running, so I would start there.
As for the actual implementation, you'll want to index each verse as a separate "document" if the single verse (or just verse number) is what you want to return. The search engine handles the ranking of the results based on relevancy (usually using a tf/idf algorithm, in case you're interested).
The way I'd handle the italics and red text is to include some kind of markup in the text (i.e. wrap the phrase in single asterisks for italics, double asterisks for red) and then tell the analyzer to ignore those characters - there may be a simpler way in the framework you end up choosing, though, so take that with a grain of salt. The queries spanning multiple verses requirement is more complicated, but the answer will probably involve indexing each whole chapter as a document instead of (or maybe in addition to? I'd have to think about it more) each verse.
A word of caution - if you're not familiar with search indexing, even something designed to be plug-and-play like Elastic Search will probably still require some time and effort to set up, so if you absolutely need to get this up and running quickly and you're already familiar with MySQL I suppose it could work (it does do fulltext search). But it's certainly not the best tool for the job, so if this is a project that you're invested in you will thank yourself later if you put in a little bit of work to learn one of these search frameworks. It may be overkill in terms of the amount of text you're dealing with, as others have pointed out, but it will be extremely flexible in how you can search on that text which seems to be what you want. For instance, adding other requirements later on would be very straightforward (for instance, you could let people limit their search to only matches in the red text).
I didn't know the bible had formatting. What is it used for? If it is for the verses, I'd suggest you store every verse in a database. In a highly normalized form, you got a table with books, a table with chapters and a table with verses. Each verse consists of a verse number and a verse text.
Now, I think the chapters don't have titles so they are actually just a number as well. In that case it it silly to store them separately, so you got just your table of books and a table of verses, in which each verse has a chapter number and a verse number and a verse text. That text I think of to be plain text, isn't it?
If the verse is plain text, you can easily make it searchable by storing it in MySQL and create a FULLTEXT index for it. That way, you can search quite efficiently and even use wildcards and such.
If the verse was to have formatting, you could choose to create two columns, one with the plain text for searching, and one with the formatted text for display, but I doubt you would need this.
PS: 5 MB of text is nothing really. If you got a dedicated program, you could keep it in memory in a single string and use strpos or a similar function to find a text. What language, database and platform are you using?

Why does Google Wave Operational Transform need annotations?

The operational transform stuff used in Google Wave has a rather curious document format. A document is basically just an xml subset document - characters, start tags and end tags. In addition to that, the document has "annotations", which are meta-data associated with ranges, e.g. start position and end position. The white paper justifies their presence with:
Wave document operations also support annotations. An annotation is some meta-data associated with an item range, i.e., a start position and an end position. This is particularly useful for describing text formatting and spelling suggestions, as it does not unecessarily complicate the underlying structured document format.
I can certainly see how it would be somewhat difficult if an arbitrary range from a document would be selected and for example bolded - XML tag nesting is strict and that would cause a mess of open and close tag insertions.
However, is this really a problem in practise? I mean, does one necessarily have to support such operation, if not making an editor that basically mimics the years old word processing paradigm instead of being a structured editor? Would pure XML operational transform with the document structure as simply HTML5 be that terrible? Is it a performance issue that styles would be in the document as tags? Or does the operational transform model somehow produce unsatisfactory results on text formatting if they are represented with tags?
Also, a side question - how good would the pure "insert character, remove character, retain" operational transform model be on plain text representations? For example, editing HTML5 as text - or editing Wikipedia articles?
There are fundamental problems with using a hierarchical markup language with OT. See below for a worked example:
Does operational transformation work on structured documents such as HTML if simply treated as plain text?
This choice makes sense to me as an optimization on several fronts:
The underlying document remains as human readable and parse-able as possible
The algorithms to parse the underlying XML remain as simple as possible (useful for compatibility with non-google attempts at parsing the resulting documents, and for maintenance)
The extra collected garbage, after multiple edits, can lead to large performance hits - due to the sheer number of tags and/or additional passes on the document to attempt to simplify it.

Game Design and Architecture Advice for Text Adventures

I am trying to create an old-school Text Adventure Game. I'm a bit stuck on creating the World Map and rooms.
Should the room descriptions be part of the source code or should it be separated out? I was thinking of placing all such descriptions and room properties in a MySQL database and then have code to organize the logic of each room; putting each room description in with the actual source code seems a bit untidy.
Is this the preferred method of organising Descriptions in an adventure game? I was also thinking that this might be preferable since I could then query the database to find common properties about the data.
Any comments would be appreciated.
No, don't include level/room description within code, it is not dynamic this way.
Many many development frameworks now tend to go with separating code from data. So, for usual cases, we put game rooms data within files and read those to build the level and maybe enable the user to construct a new level on his own and eventually create a new file to carry the room data.
I work in a company where they build games, and they have the rooms separated from the code, they have it in mysql. Actually also the items that go in each room are in a table, and there is also a table that says which item is at which room at that moment.
Besides if you want to expand your game or do statistics about it is much better doing it with a database.
I will address two issues here. First, you are right to keep the data that defines the game away from the engine that will use it. This makes it so that you dont have to recompile everything in order to fix a typo or the like in the case of a text based game.
Secondly though, I would just question the use of MySQL. If you are making a dos typed game that is to be installed on people's systems you dont want a pre-req to be 'Install MySQL', hehe. There is a little program out there that is written in C that is free for all to use called SQLite that would suit your needs much better. If on the other hand the web is the medium for the release of this text based game, then have at it :)
You could just use a system like ADRIFT, then all you need to worry about are the descriptions and logic.
Should the room descriptions be part of the source code or should it
be separated out?
Separated out.
Try Prolog language.
It has similar database to SQL (actually logical predicates)
With some skill You may be able to check whether after some change is Your adventure finishable.
You may easily create this description by some logical predicates if You don't mind it being very "computer like".
You can see examples of Prolog text adventures in simple Google search.
I suggest using engines that already have a vibrant community around them. That way, your source code is only that; the source code of the game. I'd go with either TADS 3 or Inform 7
I would construct such a game as an interpreter which reads in room data, and based on the room data, allows for a set of valid commands (move, take, drop, change...). For movement you would have a pre-built graph with nodes being rooms and edges being allowed moves.
I would separate the descriptions from the code, having an object Room, that owns an object Description that calls a "database" through some Facade, so that you may use a file or a database or anything you wish. It would also eventually allow you to add some scripting to the room itself, like having objects in your description that have behaviors.

Creating a new file type

I am assigned a project to create a new binary file structure for storing co-ordinate information coming out of a 3d CAD/CAM software. I would highly appreciate if you kindly point out some articles or books regarding creation of new binary file format. Thanks for your time :)
I would start by taking a look at other similar file formats on wotsit.org. The site is for various different file formats and it contains links to their specifications.
By looking at other file formats you'll get ideas about how best to format and present information about your specification.
There's a universal binary (and compact) notation called ASN.1. It's used widely and there are books about it available. ASN.1 can be compared to XML, but on some lower (more primitive yet more flexible) level than XML. And XML, especially binary XML mentioned above, would be of great help to you too.
Also, if you have more than just one sequence of data to hold in your file, take a look at Solid File System as a container for several data streams in one file.
If I had the same assignment, I would inspect something already existing, like .OBJ and then try to implement something similiar, probably with minor changes.
Short answer: Don't. Use XML or a text format instead for readability, extensability amd portability.
Longer answer: CAD/CAM has loads of 'legacy' formats around. I'd look to use one of those (possibly extending it if necessary). And if there's nothing suitable, and XML is considered to bloaty and slow, look at Binary XML formats instead.
I think that what you really need is to figure out what data you have to save. Then you load it into memory and serialize that memory
Here is a tutorial on serialization in C++. This page also addresses many issues with saving data

Evaluating HDF5: What limitations/features does HDF5 provide for modelling data?

We are in evaluating technologies that we'll use to store data that we gather during the analysis of C/C++ code. In the case of C++, the amount of data can be relatively large, ~20Mb per TU.
After reading the following SO answer it made me consider that HDF5 might be a suitable technology for us to use. I was wondering if people here could help me answer a few initial questions that I have:
Performance. The general usage for the data will be write once and read "several" times, similar to the lifetime of a '.o' file generated by a compiler. How does HDF5 compare against using something like an SQLite DB? Is that even a reasonable comparison to make?
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format. After reading the user guide I understand that HDF5 is similar to XML or a DB, in that information is associated with a tag/column and so a tool built to read an older structure will just ignore the fields that it is not concerned with? Is my understanding on this correct?
A significant chunk of the information that we wish to write out will be a tree type of structure: scope hierarchy, type hierarchy etc. Ideally we would model scopes as having parents, children etc. Is it possible to have one HDF5 object "point" to another? If not, is there a standard technique to solve this problem using HDF5? Or, as is required in a DB, do we need a unique key that would "link" one object to another with appropriate lookups when searching for the data?
Many thanks!
How does HDF5 compare against using something like an SQLite DB?
Is that even a reasonable comparison to make?
Sort of similar but not really. They're both structured files. SQLite has features to support database queries using SQL. HDF5 has features to support large scientific datasets.
They're both meant to be high performance.
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format.
If you store data in structured form, the data types of those structures are also stored in the HDF5 file. I'm a bit rusty as to how this works (e.g. if it includes innate backwards compatibility), but I do know that if you design your "reader" correctly it should be able to handle types that are changed in the future.
Is it possible to have one HDF5 object "point" to another?
Absolutely! You'll want to use attributes. Each object has one or more strings describing the path to reach that object. HDF5 groups are analogous to folders/directories, except that folders/directories are hierarchical = a unique path describes each one's location (in filesystems w/o hard links at least), whereas groups form a directed graph which can include cycles. I'm not sure whether you can store a "pointer" to an object directly as an attribute, but you can always store an absolute/relative path as a string attribute. (or anywhere else as a string; you could have lookup tables galore if you wanted.)
We produce HDF5 data on my project, but I don't directly deal with it usually. I can take a stab at the first two questions:
We use a write once, read many times model and the format seems to handle this well. I know a project that used to write both to an Oracle database and HDF5. Eventually they removed the Oracle output since performance suffered and no one was using it. Obviously, SQLite is not Oracle, but the HDF5 format was better suited for the task. Based on that one data point, a RDBMS may be better tuned for multiple inserts and updates.
The readers our customers use are robust when we add new data types. Some of the changes are anticipated, but we don't have to worry about breaking thing when adding more data fields. Our DBA recently wrote a Python program to read HDF5 data and populate KMZ files for visualization in Google Earth. Since it was a project he used to learn Python, I'd say it's not hard to build readers.
On the third question, I'll bow to Jason S's superior knowledge.
I'd say HDF5 is a completely reasonable choice, especially if you are already interested in it or plan to produce something for the scientific community.

Resources