Databases: Effectively implement string contains query - database

I need a way to effectively do a string contains query like:
# In SQL
LIKE '%some-string%'
# In mongo
{ $regex: /some-string/ }
But its very slow when the dataset size is big. Eg. I tried in a dummy DB (with and without an index - no index is surprisingly faster on mongo) and generate 100m rows (in reality theres more). Seems reasonable if I use ElasticSearch, but I am wondering if theres a DB or way I can structure my data to optimise this use case? I asked and I really need contains instead of a prefix match ...

Postgresql offers so-called trigram indexes. Those indexes can accelerate SQL col LIKE '%search%' predicates efficiently enough. Notice that indexing can, in all makes of server, speed up col LIKE 'string%' (without the leading wildcard character).
MySQL / Mariadb have FULLTEXT indexes that work with a distinctive SQL syntax. That feature works word-by-word unlike, well, LIKE which works character-by-character. Microsoft SQL Server has a similar feature with different syntax. It also works word-by-word.
So, there's no SQL standard way to do this efficiently, and different makes of server do it differently.
If you haven't yet chosen a particular make of server, you should figure out whether one of the full text schemes will serve your purpose. If you must get good performance from LIKE,
postgresql's trigram indexing is the way to go.

There's no general solution to this that works for all database systems i think. As another answer already explains, there are fulltext search extensions to a lot of popular database systems that, while they're far from being able to do what stuff like Lucene/ElasticSearch can do, should be enough to massively speed up your use case.
Let me explain this from a database internals perspective. Let's say that your selectivity is high a.k.a only a very small percentage of your tuples actually match your condition then you would generally want to have some kind of index structure. The kind of index structure you would need for this kind of query is some kind of Radix-Tree/Trie but that's not a standard data structure implemented in all SQL databases. The only data structure that is actually implemented in almost all SQL databases is a B-Tree. But a B-Tree can only do Prefix queries something like LIKE 'test%'. The only chance you have for LIKE '%test%' if your database doesn't have such indexes is having a very fast runtime system which none of the traditional (open source) database systems has...

Related

Choosing the right DBM-like C++ library for sequential data

I am trying to choose a database for a newly developing application. There are so many alternatives and it’s so easy to choose a wrong one. First of all, there is a requirement to not use database servers. A required database should be a static or dynamic C++ library. The data that needs to be stored is an array of records. They vary but are fixed for a given dataset (so they can be stored in a table). The information in each row could be from several hundred bytes up to several megabytes. And a number of rows may be millions for now and expected to grow.
The index of the row could be used as a key. No need to maintain a separate key column.
Data is inserted sequentially. Read access will be performed only by iterating all the data or some segment of it sequentially (May need to iterate with steps like each 5th).
I don’t think that relational DBs are good feet for many reasons.
a. They are mostly server-based. I know about SQLite but as far as I know, it stores data in one file which I assume may lead to issues related to maximum file size.
b. We don’t need the power that SQL provides instead we would like to have more flexibility in stored data types.
There are Key/Value non-SQL dbms like BerkeleyDB, RocksDB, or something like luxio for lighter alternatives. The functionality they provide is more than enough for the task. And this might be the right choice however I don’t know how well they are optimized for such case where we have continuous integer keys. The associative key access (which is not required for us) may have some overhead in performance.
I know there are some type of non-SQL databases called “wide-column” which I am not familiar with. However, the name sounds like it is perfect for our task. All databases I can find are server of claud based. If you know dbm-like library for such type of database please advise.
I am not experienced in databases so please correct me if I am wrong in any of 3 above stamens.
If your row data can grow to megabytes, and you're talking about only millions of records, maybe just figure out a way to lay it out in a filesystem? If you need a more database-like index, use SQLite for the keys, and have the data records point to a location on the filesystem. This kind of thing will be far quicker to implement and get right than trying to do it all in one giant database.

Flat data with struct type vs document store

I know this is a 'soft' question, which is usually frowned upon on SO, but I have been using BigQuery to do data analysis on (obviously) flat data, which contains both structs and repeated data. Let's just use a very basic example, a row might look like this:
ID
Title (str)
ReleaseYear (int)
Genres (str[])
Credits (struct[])
And an example piece of data might look like:
{
"ID": "T-1997",
"Title": "Titanic",
"ReleaseYear": 1997,
"Genres": ["Drama", "Romance"],
"Credits": {
"Actors": ["Leonardo DiCaprio", "Kate Winslet"],
"Directors": ["James Cameron"]
}
}
My question is basically what type of operations or queries can be done in a native document store, such as MongoDB or CouchBase, that couldn't be done in a relational DB that supports arbitrarily-nested data. In other words, my assumption (and I hope I'm wrong or misguided) is that as long as a DB supports structs, it can do everything that a document-store can do. If not, what are some places where it is either: (1) something that can be done in MongoDB (or any other document-store) that cannot be done in BigQuery (or any other database that supports structs)? and (2) something that can be done much more easily in MongoDB that in a relational DB?
what type of operations or queries can be done in a native document
store, such as MongoDB or CouchBase, that couldn't be done in a
relational DB that supports arbitrarily-nested data.
Even if does support arbitrarily nested data, BigQuery allows limited nesting compared to MongoDB .MongoDB supports more levels of nesting.
In BigQuery, your schema cannot contain more than 15 levels of nested STRUCTs. MongoDB supports unto 100 levels of nesting for BSON documents.
In other words, my assumption (and I hope I'm wrong or misguided) is
that as long as a DB supports structs, it can do everything that a
document-store can do.
Not exactly - nested columns are columns within columns. But sharding in an RDBMS is a complex endeavor compared to a NoSQL database like Mongo. Technically you can do, but it wasn't designed for the same purpose. Its like using a wrench as a hammer - sure you can, but its purpose was something different. You should use the right tool for the right purpose.
If not, what are some places where it is either: (1) something that
can be done in MongoDB (or any other document-store) that cannot be
done in BigQuery (or any other database that supports structs)? and
(2) something that can be done much more easily in MongoDB that in a
relational DB?
The crux of the matter is, an RDBMS may tack on features to "technically" allow you to do some things that you can do in a NoSQL database. But it doesn't mean it may work just as well. For example, because of the features that make an RDBMS an RDBMS (ACID compliance, transactions etc), there will always be an additional performance hit compared to a NoSQL database. If an RDBMS removes these features, then it is no longer an RDBMS!
This answer illustrates how MongoDB achieves better performance because it doesn't need to support RDBMS features :
https://softwareengineering.stackexchange.com/questions/54373/when-would-someone-use-mongodb-or-similar-over-a-relational-dbms
MongoDB has a lower latency per query & spends less CPU time per query because it is doing a lot less work (e.g. no joins,
transactions).
As a result, it can handle a higher load in terms of queries per second and is thus often used if you have a massive # of users.
MongoDB is easier to shard (use in a cluster) because it doesn't have to worry about transactions and consistency. - MongoDB has a
faster write speed because it does not have to worry about
transactions or rollbacks (and thus does not have to worry about
locking).
MongoDB does not have a schema in case you have a special use case that can take advantage of that.
Another feature is sharding - sharding is easier with mongodb because it doesn't need to support many of the features which make an RDBMS an RDBMS, such as being ACID compliant. In contrast, sharding is complex for an RDBMS because an RDBMS must remain ACID compliant.
Take a look at the following two images:
The speed boat would out perform the "amphibious car" in the water 10/10 times. The amphibious car technically can navigate in water, but it wasn't designed to, hence is much slower and unsuited for its purpose.
Like wise, look at the difference in aerodynamics of the speed boat and this sweet automobile. Even if you tacked on wheels to the boat, its not going to perform as well as this car on land. (As an analogy you could say that NoSQL databases don't do joins - you have to implement them yourself. - but will it perform better than an RDBMS for join heavy operations ?)
The point I'm making with the analogies, is that each kind of database was initially designed for a specific goal, and over time features have been added to try and make it solve problems it was not designed for (hence it doesn't do it as well as something specifically designed for that purpose).
Hence in your question, even if BigQuery or some RDBMS can do something, it doesn't mean that you should use them for the job. The same applies for NoSQL databases. You should use the best tool for the job.
Disclaimer: I don't have experience in MongoDB or CouchBase. My answer is based on BigQuery's capability on STRUCT.
Performance
BigQuery's STRUCT is optimized for query. For example, if you query select a.nested_b.nested_c.nested_d from table_t, the query only scans data for the left STRUCT field nested_d, it is fast and cheap.
Usability
If your data is write-once or append-only, then STRUCT column is comparable with document store AFAIK.
But if you want to update only certain nested field later, nested STRUCT makes it pretty difficult to do, because there is no way to update single item in REPEATED field, you have to load the whole array, scan and change, and repack to update a column. You will be writing something like:
UPDATE table
SET Credits.Actors = (SELECT ARRAY_AGG(...) FROM UNNEST(Credits.Actors) WHERE ...)
WHERE ...
It may become a bigger problem when there is array of struct of arrays (and even more nested levels). Based on my understanding of document store, updating single nested field of a document should be easier than this. Basically, this is kind of the price you have to pay to get the performance benefit mentioned earlier.

Algorithms for key value pair, where key is string

I have a problem where there is a huge list of strings or phrases it might scale from 100,000 to 100Million. when i search for a phrase if found it gives me the Id or index to database for further operation. I know hash table can be used for this, but i am looking for other algorithm which could serve me to generate index based on strings and can also be useful in some other features like autocomplete etc.
I read suffix tree/array based on some SO threads they serve the purpose but consumes alot memory than i can afford. Any alternatives to this?
Since my search is only in a huge list of millions of strings. No docs no webpages not interested in search engine like lucene etc.
Also read about inverted index sounds helpful but which algorithm i need to study for it?.
If this Database index is within MS SQL Server you may get good results with SQL Full Text Indexing. Other SQL providers may have a similar function but I would not be able to help with those.
Check out: http://www.simple-talk.com/sql/learn-sql-server/understanding-full-text-indexing-in-sql-server/
and
http://msdn.microsoft.com/en-us/library/ms142571.aspx

full index checkbox while creating new database

I am creating a new database, which I am basically designing for the logging/history purpose. So, I'll make around 8-10 tables in this database. Which will keep the data and I'll retrieve it for showing history information to the user.
I am creating database from the SQL Server 2005 and I can see that there is a check box of " Use full Indexing". I am not sure whether I make it check or unchecked. As I am not familiar with the database too much, suggest me that by checking it, will it increase the performance of my database in retrieval?
I think that is the check box for FULLTEXT indexing.
You turn it on only if you plan to do some natural language queries or a lot of text-based queries.
See here for a description of what it is used to support.
http://msdn.microsoft.com/en-us/library/ms142571.aspx
From that base link, you can follow through to http://msdn.microsoft.com/en-us/library/ms142547.aspx (amongst others). Interesting is this quote
Comparison of LIKE to Full-Text Search
In contrast to full-text search, the LIKE Transact-SQL predicate works
on character patterns only. Also, you cannot use the LIKE predicate to
query formatted binary data. Furthermore, a LIKE query against a large
amount of unstructured text data is much slower than an equivalent
full-text query against the same data. A LIKE query against millions
of rows of text data can take minutes to return; whereas a full-text
query can take only seconds or less against the same data, depending
on the number of rows that are returned.
There is a cost for this of course which is in the storage of the patterns and relationships between words in the same record. It is really useful if you are storing articles for example, where you want to enable searching by "contains a, b and c". A LIKE pattern would be complicated and extremely slow to process like %A%B%C% OR LIKE '%B%A%C' Or ... and all the permutations for the order of appearance of A, B and C.

database vs flat file, which is a faster structure for "regex" matching with many simultaneous requests

which structure returns faster result and/or less taxing on the host server, flat file or database (mysql)?
Assume many users (100 users) are simultaneously query the file/db.
Searches involve pattern matching against a static file/db.
File has 50,000 unique lines (same data type).
There could be many matches.
There is no writing to the file/db, just read.
Is it possible to have a duplicate the file/db and write a logic switch to use the backup file/db if the main file is in use?
Which language is best for the type of structure? Perl for flat and PHP for db?
Addition info:
If I want to find all the cities have the pattern "cis" in their names.
Which is better/faster, using regex or string functions?
Please recommend a strategy
TIA
I am a huge fan of simple solutions, and thus prefer -- for simple tasks -- flat file storage. A relational DB with its indexing capabilities won't help you much with arbitrary regex patterns at all, and the filesystem's caching ensures that this rather small file is in memory anyway. I would go the flat file + perl route.
Edit: (taking your new information into account) If it's really just about finding a substring in one known attribute, then using a fulltext index (which a DB provides) will help you a bit (depending on the type of index applied) and might provide an easy and reasonably fast solution that fits your requirements. Of course, you could implement an index yourself on the file system, e.g. using a variation of a Suffix Tree, which is hard to be beaten speed-wise.
Still, I would go the flat file route (and if it fits your purpose, have a look at awk), because if you had started implementing it, you'd be finished already ;) Further I suspect that the amount of users you talk about won't make the system feel the difference (your CPU will be bored most of the time anyway).
If you are uncertain, just try it! Implement that regex+perl solution, it takes a few minutes if you know perl, loop 100 times and measure with time. If it is sufficiently fast, use it, if not, consider another solution. You have to keep in mind that your 50,000 unique lines are really a low number in terms of modern computing. (compare with this: Optimizing Mysql Table Indexing for Substring Queries )
HTH,
alexander
Depending on how your queries and your data look like a full text search engine like Lucene or Sphinx could be a good idea.

Resources