When using a DBM database (e.g. Berkeley or GDBM), is it better to store data using fewer long strings or more short strings? I can easily structure my data either way. I'm looking for 'better' in the performance sense, but I'm interested in other implications as well.
Berkeley DB, or any other DBM implementation, will incur overhead for each key/value pair. If you're dealing with millions of k/v pairs the overhead will matter, otherwise it's noise and you should choose what is easiest for you the programmer and let the database deal with the data. Overhead and access time will also depend on access method. Hash tables and B-Trees are totally different algorithmic animals. If your data has any degree of key ordering or access patterns dependent on keys then 99% of the time B-Trees are the way to go.
I think you're asking a great design question, but I think for anyone to give you a perfect answer we'd all have to know a lot more about the amount of data your dealing with, access patterns, and many other factors.
If you will be frequently searching or modifying the data, a greater number of short strings will provide better performance.
i.e. You don't want to be searching for a substring of one of those long strings, or modifying some value in the middle of a string frequently.
I think this question is really hard to answer in a completely generic way. There are so many variables here, that you would really need to test some common scenarios to determine the answer that is best for you.
Some factors to consider:
Will larger strings require substring searches?
What kind of searches will you perform over the data?
In the end, its generally better to go with the approach that yields the most normalized schema. Optimization can start from there, and depending upon your db, there are probably better alternatives than restructuring the underlying schema purely for performance.
Related
Apologies if this is a beginner question. I’m building a text-to-speech model. I was wondering if my training dataset should be “realistically” distributed (i.e. same distribution as the data it will be used on), or should it be uniformly distributed to make sure it performs well on all kinds of sentences. Thanks.
I’d say that this depends on the dataset size. If you have a really, really small dataset, which is common in some domains and rare in others, then you’d want to ensure that all the “important kinds of data” (whatever that means for your task) would be represented there even if they’re relatively rare, but a realistic distribution is better if you have a large enough dataset that all the key scenarios would be adequately represented anyway.
Also, if mistakes on certain data items are more important than others (which is likely in some domains), then it may make sense to overrepresent them, as you’re not optimizing for the average case of the real distribution.
There’s also the case of the targeted annotation where you look at the errors your model is making and specifically annotate extra data to overrepresent those cases - because there are scenarios where some types of data happen to be both very common and trivial to solve, so adding extra training data for them takes effort but doesn’t change the results in any way.
I saw many debates and articles as to which of integer(increment) and uuid should be used for ids on database.
There introduced some pros and cons of both the integer and uuid.
For example,
integer: fast, but available size is limited(unless you use bigint)
uuid: very unique and much more secure, but slow, and storage-
consuming
Then, I wondered if using random strings length of around 10( varchar(10) ), comprised of upper and lower case letters, and integers would solve the problems because they are not so big in size and can cover wide range of data(62^10 ways if 10 chars).
So, my question is, Is it good or bad to do that?
There is no absolute bad or good when it comes to database design. You should design your database based on your needs.
You mentioned some pros and cons of using int and uuid and now i recommend you to list your needs so you can choose which one to use.
Also keep in mind that you can use some tricks to get around the limits of both ints and uuids.
For example if uuid seems the right option for you but the speed of looking them up in the database is bothering you, then you can simply use indexing to maximize the speed for uuids. and if you have many writes and you need them to be fast, you can use pre-generated uuids. (generate some uuids, index them, and pick one of them up each time you need to)
And for ints, you can simply use 2 ints as your id which both of them together will make the id or some other math algorithm that make it a little more secure but yet fast enough.
These are just two example of how you can optimize your system so it will be fast enough and yet answering to your needs in the best way possible.
And for the case that it is okay to use both ints and uuids in your database design: it is completely ok if it's the best way of doing it for both satisfying your needs and getting the best performance out of it.
So I'm making a database for a personal project just to get more than my feet wet with PostgreSQL and certain languages and applications that can use a PostgreSQL database.
I've come to the realization that using an array isn't necessarily even compliant (Arrays are not atomic, right?) with 1NF. So my question is: Is there a lack of efficiency or data safety this way? Should I learn early to not use arrays?
Short answer to the title: No
A bit longer answer:
You should learn to use arrays when appropriate. Arrays are not bad design themselves, they are as atomic as a character varying field (array of characters, no?) and they exists to make our lives easier and our databases faster and lighter. There are issues considering portability (most database systems don't support arrays, or do so in a different way than Postgres)
Example:
You have a blog with posts and tags, and each post may have 0 or more tags. The first thing that comes to mind is to make a different table with two columns postid and tagid and assign the tags in that table.
If we need to search through posts with tagid, then the extra table is necessary (with appropriate indexes of course).
But if we only want the tag information to be shown as the post's extra info, then we can easily add an integer array column in the table of posts and extract the information from there. This can still be done with the extra table, but using an array reduces the size of the database (no needed extra tables or extra rows) and simplifies the query by letting us execute our select queries with joining one less table and seems easier to understand by human eye (the last part is in the eye of the beholder, but I think I speak for a majority here). If our tags are preloaded, then not even one join is necessary.
The example may be poor but it's the first that came to mind.
Conclusion:
Arrays are not necessary. They can be harmful if you use them wrong. You can live without them and have a great, fast and optimized database. When you are considering portability (e.g. rewriting your system to work with other databses) then you must not use arrays.
If you are sure you'll stick with Postgres, then you can safely use arrays where you find appropriate. They exist for a reason and are neither bad design nor non-compliant. When you use them in the right places, they can help a little with simplicity of database structures and your code, as well as space and speed optimization. That is all.
Whether an array is atomic depends on what you're interested in. If you generally want the whole array then it's atomic. If you are more interested in the individual elements then it is being used as structure. A text field is basically a list of characters. However, we're usually interested in the whole string.
Now - from a practical viewpoint, many frameworks and ORMs don't automatically unpack PostgreSQL's array types. Also, if you want to port the database to e.g. MySQL then you'll
Likewise foreign-key constraints can't be added to an array (EDIT: this is still true as of 2021).
Short answer: Yes, it is bad design. Using arrays will guarantee that your design is not 1NF, because to be 1NF there must be no repeating values. Proper design is unequivocal: make another table for the array's values and join when you need them all.
Arrays may be the right tool for the job in certain limited circumstances, but I would still try hard to avoid them. They're a feature of last resort.
The biggest problem with arrays is that they're a crutch. You know them already and you want to use them because they're familiar to you. But they do not work quite like you expect, and they will only allow you to postpone a true understanding of SQL and relational databases. You're much better off waiting until you're forced to use them than learning them and looking for opportunities to rely on them.
I believe arrays are a useful and appropriate design in cases where you're working with array-like data and want to use the power of SQL for efficient queries and analysis. I've begun using PostgreSQL arrays regularly for data science purposes, as well as in PostGIS for edge cases, as examples.
In addition to the well-explained challenges mentioned above, I'm finding the biggest problem in getting third-party client apps to be able to handle the array fields in ways I'd expect. In Tableau and QGIS, for example, arrays are treated as strings, so array operations are unavailable.
Arrays are a first class data type in the SQL standard, and generally allow for a simpler schema and more efficient queries. Arrays, in general, are a great data type. If your implementation is self-contained, and doesn't need to rely on third-party tools without an API or some other middleware that can deal with incompatibilities, then use the array field.
IF, however, you interface with third-party software that directly queries the DB, and arrays are used to produce queries, then I'd avoid them in favor of simpler lookup tables and other traditional relational approaches.
Most triple stores I read about are said to be scalable to around .5 billion triples.
I am interested to know if people think there is a theoretical reason to why they have to have an upper limit, and whether you know of any particular ways to make them more scalable.
I am curious to know if existing triple stores do things like this:
Represent URIs with integers
Integers in order
Search the integers instead of the URIs which I would imagine must be faster (because you can do things like a binary search etc.)
Thoughts ...
Just to get to 500million a triple store has to do all of that and more. I have spent several years working on a triple store implementation, and I can tell you that breaking 1 billion triples is not as simple as it may seem.
The problem is that many rdf queries are 2nd or 3rd order (and higher-orders are far from unheard of). This means that you are not only querying a set of entities, but simultaneously the data about the set of entities; data about the entities schemas; data describing the schema language used to describe the entities schemas.
All of this without any of the constraints available to a relational database to allow it to make assumptions about the shape of this data/metadata/metametadata/etc.
There are ways to get beyond 500 million, but they are far from trivial, and the low hanging fruit (ie. the approaches you have mentioned) were required just to get to where we are now.
That being said, the flexibility provided by an rdf-store, combined with a denotational semantic available via its interpretation in Description Logics, makes it all worthwhile.
Imagine you're dealing with many strings of text that are about 10,000 characters long entered by users. Would it be more efficient to write those automatically onto pages or input them onto a table in a database? I hope that question is clear enough...
It depends on what sort of "efficiency" you're aiming for.
Here's what I mean:
will you be changing the content of your text strings?
what sorts of searches will you be doing?
when you extract the text do what do you do with it?
My opinion is that provided you're not going to change the content much, nor perform much analysis, you're better off with the database.
10k isn't particularly large, so either is fine. I would personally use the database, as it will allow you to easily search though.
Depends how you're accessing them, but normally using the FS would result in better performance. That's for the obvious reason the DB is another layer built on top of the FS, and using the FS directly, assuming no extra heavy processing (for example, have 100s of named files instead of one big bloated file ordered in a special order you need to parse), would save you the DBMS operations.
I'm wondering if SQLite would be the best of both worlds, or at least, the best database for that size of job.
The real answer her is what you're going to do with these strings.
Databases are meant to be able to quickly return specific records. If you're just going to SELECT * FROM Table and then concat it all together, there's no point in using a database.
However, if you have a relation between your data that you want to be able to search, then a database will likely be more efficient.
E.G., do you want to be able to pull up all the text records from a set of users on a set of dates? Find all records from users who match some records?
These kinds of loads will likely be more efficient than a naive implementation, and still probably faster than a decent one, even if it does avoid some access layers.
There are a lot of considerations. As others said - either approach would work fine for a small number of 10k rows (thousands).
But what's the rest of your app do? If it does everything in the database, then I'd be inclined to put this there as well; the opposite is true as well.
And how will you be selecting these? Do you need to do complex text searches? If so, a database might not be the best. Or, would you be adding new attributes, searching on those attributes - or matching them against data in other tables? In this common case a database would be better.
And if your data is really vast (many millions of 10k rows) and your performance requirements aren't terribly high - you may want to compress them and store them in the file system.
Lastly, how important is data quality? Given the features of a good database it's much easier to guarantee good data quality with a database.