I have a database of documents where searching quickly for keywords and patterns would be very useful to have.
I know of "Burrows–Wheeler transform"/FM-index. I wonder if there are any programs or database programs based on BWT or similar methods in order to search a corpus in O(1) and hopefully more advantages.
Any ideas?
There is a great book by Witten/Moffat/Bell (1994) Managing Gigabytes; this describes in detail everything you need to know about indexing and retrieval. I think their sourcecode is also available, or has been made available in an information retrieval library.
However, it doesn't include the Burrows-Wheeler transform, as that was only invented in the same year.
Related
I'm trying to create a small web application for a "personal information manager" / wiki kind of tool where I can take notes in the form of HTML snippets (or maybe Markdown), annotate them with some https://schema.org/ microdata and store both the snippet and the metadata somewhere for querying.
My understanding so far is that most semantic data stores (triple/quad stores, or databases supporting RDF) are better suited for storing and querying mainly the metadata. So I'll probably also want some traditional store of some sort (relational, document store, key-value, or even a non-rdf graph db) where I can store the full text of each note and maybe some other bits like time of last access, user-id that owns the note, etc, and also perform traditional (non-semantic) fulltext queries.
I started looking for stores that would allow me to store both data and metadata in a single place. I found a few: Ontotext GraphDB, Stardog, MarkLogic, etc. All of these seem to do exactly what I want, but have some pretty limiting free license terms that really discourage me from studying them in depth: I prefer to study open technologies that I could potentially use on a real product.
Before digging deeper, I was wondering:
If my assumption is correct: that I'll need to use one store for the data and another for the metadata.
if there's any setup involving free/open source software that developers with experience in RDF/Sparql can recommend, given the problem I describe.
Right now I'm just leaning towards using Apache Jena for the RDF store and SPARQL queries, and something completely independent for the rest of the data (PostgreSQL most likely).
Before digging deeper, I was wondering:
If my assumption is correct: that I'll need to use one store for the data and another for the metadata.
Not necessarily, no, though there certainly are some cases in which that distinction may be useful. But most RDF databases offer scalable storage for both data and metadata. The only requirement is that your (meta)data is represented as RDF. If you're worried about performance of things like text queries, most of them offer support for full-text indexing through Lucene, Solr, or Elasticsearch.
if there's any setup involving free/open source software that developers with experience in RDF/Sparql can recommend, given the problem I describe.
This is really not the right place to ask this question. Tool recommendations are considered off-topic on StackOverflow since they attract biased answers. But as said, there's plenty of tools, both open-source/free and commercial, that you can look into. I suggest you pick one you like the look of, experiment a bit, and perhaps talk to that particular tool's community to explain what you're trying to do. Apache Jena and Eclipse Rdf4j are two popular open-source projects, but there's plenty of others.
Imagine that you feed a system a bunch of pdfs that you, and only you, know "how" these are related (e.g. they are all dissertations, or news, or invoices). The system know that the batch is connected, but does not know how they relate.
The system then scan these pdf's, and suggest indexes and their respective value for each document.
Here's an example: You feed a the system all the invoices your company gets. The system process these docs and suggests for indexes "Supplier", "Invoice Cost" and "Due Date". Foreach pdf the system also extracts the value of the entry.
So my question is: what kind of artificial intelligence system is most adequate for this scenario? A Neural Network? A combination?
You are looking for unsupervised learning algorithms. More specifically, yours is a clustering problem, since your system does not know anything about the data it is going to analyze and it has to come up with a correct classification of the documents (or their properties).
In your example, by using clustering algorithms, your system can learn to distinguish the documents you provide and to extract the field "Invoice", "Supplier" ...
The wiki page I linked should be enough to have a general idea of the class of algorithms you need. On Google you will find a plethora of lecture slides on the topic.
You could do this by just a keyword search, if you know what keyword the machine should be looking for, and the documents all follow the same format.
If the formats are non-uniform within each category, however, then you would need to consider some kind of language processing in order for the machine to be able to understand what's going on.
Try do some research into natural language processing, this is probably along the lines of what you're looking for:
NLP Wiki
I am trying to implement a storage system to support tagging on data. A very simple application of this system is like questions on Stackoverflow, which are tagged with multiple tags. And a query may consist of multiple tags. This also looks like search on Google with multiple key words.
The data set maintained by this system will be very large, like several or tens of terabytes with billions of entries.
So what data structures and algorithms should I use in this system for maintaining and query data? And the data may be stored across a cluster of machines.
Are there any guide or papers to describe such problem and solutions?
You might want to read the two books below:
Collective Intelligence in Action
Satnam Alag (ISBN: 1933988312)
http://www.manning.com/alag/
"Capter 3. Extracting intelligence from tags" covers:
Three forms of tagging and the use of tags
A working example of how intelligence is extracted from tags
Database architecture for tagging
Developing tag clouds
Programming Collective Intelligence
Toby Segaran (ISBN: 978-0-596-52932-1)
http://shop.oreilly.com/product/9780596529321.do
 "Chapter 4. Searching and Ranking" covers:
Basic concepts of algorithms for search engine index
Design of a click-tracking neural network
Hope it helps.
Your problem is very difficult, but there is a plenty of related papers and books. Amazon Dynamo paper, yahoo PNUTS and this hadoop paper is a good examples.
So, at first, you must decide how your data will be distributed across cluster. Data must be evenly distributed across network, without hot spots. Consistent hashing will be a good solution for this problem. Also, data must be redundant, any entry need to be stored in several places to tolerate faults of individual nodes.
Next, you must decide how writes will occur in your system. Every write must be replicated across nodes that contains updated data entry. You might want to read about CAP theorem, and eventual consistency concept(wikipedia have a good article about both). Also, there is a consistency - latency tradeoff. You can use different mechanisms for writes replication: some kind of gossip protocol or state machine replication.
I don't know what kind of tagging do you mean, is this tags manually assigned to entries or learned from data. Anyway, this is a field of information retrieval(IR). You might use some kind of inverted index to effectively search entries by tags or keywords. Also, you must use some query result ranking algorithm.
I've been tasked with implementing a blacklist-based profanity filter for a Rails app. I know there are a ton of issues with blacklist-based filtering, but the decision was made above my head. Challenge: I'm looking for a good list of Spanish profanity to run into the filter. For English, we're building on a list which exhaustively lists conjugations/plurals/etc, one per line of a text file. Does such a list exist in the public domain for Spanish?
Finding good lists and having them tuned is difficult. It also sounds like you are doing a lot of manual work that can be automated (i.e. conjugation). I did a lot of this for my company's profanity filter named CleanSpeak and much of this can be automated using POS identifiers for words and in many cases you can manually do POS tagging or find a POS source.
You'll also need to consider the quality of the lists and the up-keep and management of a filter. A lot of people think it is simple and then realize that it is extremely difficult to prevent false-positives.
All that said, we found the majority of our lists for other languages difficult to come by online and ended up paying to have many of the built or purchased from other companies. The lists we did find online ended up being nearly worthless once we had them translated. We also attempted to take out blacklist and have that translated, which was a complete failure because most English profanities don't have equivalents in other languages. I would suggest purchasing lists or working with students at your local university to generate lists. A number of our customers found this method relatively good and not overly expensive.
I would also suggest that you take a look at some of the resources out there that define the best ways to manage User Generated Content. These will help guide you through any build vs. buy decisions.
I understand that full text indexing and search for a database can be enabled by a lot of pre-packaged products. However, just out of academical curiosity, I wonder how are those full text indexes actually implemented. I have tried to google for results with little answer. Please any feedback would be much appreciated.
Full text searches are supported by quite a few database engines these days as a core feature.
As for implementation I think your best bet is to check out postgres full text searches, as you can
find a lot of material on how it is implemented
actually change and play with the parsers (for example optimize for certain domain)
There are further details and concept explained on wikipedia:
full text indexes, and you can also check out
open source and free full text search engines as normally you will find supporting documentation explaining inner workings of those too (I have heard good things about Lucene/Solr from this list)
Probably by creating dictionaries of "words" and maybe a bit of lexical analysis. (Note that fulltext searches whole words and not parts of words, so indexing may be constrained to that.)