fileserver vs DB query speed - database

I have very simple data that I need to retrieve as quickly as possible:
I have json data that is associated with a hash of an email. So the table looks like this:
email_sha256, json
and has millions of rows.
I was wondering if one of the following two options would be faster:
1 Split the single large table into many smallers (split by alphabetical order)
2 Do not use a DB at all and serve the data as files. i.e. every email hash is the name of a separate file that contains the json data.

Creating a file for each user (for each email address), looks so wrong for so many aspect:
If you needs good performance you need a small amount number of file by directory
DB were created for that, you can have an index to retrieve the information very fast.
Without a DB you need to have your own lock/synchronization mechanism
If you are using a DB why using json to store data.
If you are looking for performance, do not serialize the data to a json.
What do you mean by "fast", can you quantify this duration/delay ?
Unless (maybe) the information associated with the user are huge (The size must be very superior to one sector). But again in this case, what do you mean by fast.

Related

How to store datapoints in the magnitudes of 1+ trillion?

So, I have astronomical spectroscopy data in the following format:
{
"molecule": "CO2",
"blahblah":
"5 more simple fields"
"arrayofvalues": [lengths can go up to 2 million]
}
of this data, I have 600,000 files, so that means that there are 1 trillion individual datapoints that I want to search through, and do computations with.
So can someone please direct me to a source of maybe bigData or bigQueries on how I can efficiently lookup this data for computations and graphing? I want to like for example search certain molecules, under certain condition, what data they show etc.
I want to make a website where people can pick some variables, and a value range, and get graphical or textual data.
Now I tried to put some of this stuff on PostgresQL, but obviously when I do a get request, (and store even just 5 files) it will crash Postman, because its too much data.
Without knowing more details, you can take advantage of the data modeling options available in bigquery, such as:
nested data
arrays and structs
partitioned tables
clustering
Take a look at the data types: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
And also the partitioning and clustering techniques.
https://towardsdatascience.com/how-to-use-partitions-and-clusters-in-bigquery-using-sql-ccf84c89dd65?gi=cd1bc7f704cc

How to store write-once, read-rarely data

I have an application which produces a large amount of data, that is all written once and then unchangeable (by law), and is rarely ever read. When it is read, it is always read in its entirety, as in, all the data for 2012 is read in one shot, and either processed for reporting or output in a different format for export (or gasp printed). The only way to access the data is to access an entire day's worth of data, or more than one day.
This data is easily represented as either two or three relational tables, or as a long list of self-contained documents.
What is the most storage-space-efficient way to store such data in a file system? Specifically, we're thinking of using Amazon S3 (File storage) for storage, though we could use something like RDS (their version of MySQL).
My current best bet is a gzipped file with JSON data for the entire day, one file per day.
Unless my data was pure ASCII (and even if it was), I would probably choose a binary storage method like one of
BSON
Protocol Buffers
B encode
I would use Windows Azure's Table Storage because it allows for heterogenous structured data to be stored in a single table. Having a database-like storage will allow you to append data as needed. You can easily create new table for each year.

Another database table or a json object

I have two tables: stores and users. Every user is assigned to a store. I thought "What if I could just save all the users assigned to a store as a json object and save that json object in a field of a store." So in other words, user's data will be stored in a field instead of it's own table. There will be around 10 people to a store. I would like to know which method will require the least amount of processing for the server.
Most databases are relational, meaning there's no reason to be putting multiple different fields in one column. Besides being more work for you having to put them together and take them apart, you'd be basically ignoring the strength of the database.
If you were ever to try to access the data from another app, you'd have to make yourself go through additional steps. It also limits sorting and greatly adds to your querying difficulties (i.e. can't say where field = value because one field contains many values)
In your specific example, if the users at a store change, rather than being able to do a very efficient delete from the users table (or modify which store they're assigned to) you'd have to fetch the data and edit it, which would double your overhead.
Joins exist for a reason, and they are efficient. So, don't fear them!

Reading HTML data from database is slow? Need a better approach?

We have a table in mysql of 18GB which has a column "html_view" which stores HTML source data, which we are displaying on the page, but now its taking too much time to fetch html data from "html_view" column, which making the page load slow.
We want an approach which can simplify our existing structure to load the html data faster from db or from any other way.
One idea which we are planning is to store HTML data in .txt files and in db we'll just store path of the txt file and will fetch the data from that particular file by reading file. But we fear that it will make extensive read write operations n our server and may slowdown the server then.
Is there any better approach, for making this situation faster?
First of all, why store HTML in database? Why not render it on demand?
For big text tables, you could store compressed text in a byte array, or compressed and encoded in base64 as plain text.
When you have an array with large text column, how many other columns does the table have? If it's not too many, you could partition the table and create a two column key-value store. That should be faster and simpler than reading files from disk.
Have a look at the Apache Caching guide.
It explains disk and memory caching - from my pov if the content is static (as the databae table indicates), you should use Apaches capabilities instead of writing your own slower mechanisms because you add multiple layers on top.
The usual measure instead of estimating does still apply though ;-).

Storing email messages in a database

What sort of database schema would you use to store email messages, with as much header information as practical/possible, into a database?
Assume that they have been fed into a script from the MTA and parsed into the relevant headers/body/attachments.
Would you store the message body whole in the database table, or split any MIME-parts apart? What about attachments?
You may want to check the architecture and the DB schema of "Archiveopteryx".
You may want to use a schema where the message body and attachment records can be shared between multiple recipients on the message. It's not uncommon to see email servers where fully 50% of the disk storage is used by duplicate emails.
A simple hash of the body/attachment would be enough to see if that record was already in the database. However, you would still need to keep separate headers.
Depends on what you're going to be doing with it. If you're going to need to do frequent searching against certain bits of it, you'll want to break it up in a way that makes sense for your usage case. If it's just for something like storage of e-mail for Sarbanes-Oxley compliance, you'd probably be okay storing the whole thing - headers, parts, etc. - as one big text field.
Suggestion: create a well defined table for storing e-mail with a column for each relevant part of a message: sender, header, subject, body. It is going to be much simpler later if you want to query, for example, by subject field. In the same table you can define a field to keep the path of a attachment and store the attached file on the file system, rather than storing it in blob fields.
An important step in database schema design is to figure out what types of entity you want to model. For this application the entities might be:
Messages
E-mail addresses
Conversation threads (perhaps: if you want to do efficient threading)
Attachments (perhaps: as suggested in other answers)
...
Once you know the entities, you can identify relationships between entities, which can be represented by tables:
Messages have a many-many relationship to messages (In-Reply-To and References headers).
Messages have a many-many relationship to e-mail addresses (From, To, Cc etc headers).
Messages have a many-one relationship with threads.
Messages have a many-many relationship with attachments.
...
It all depends on what you want to do with the data, but in general I would want to store all data and also make sure that the semantics interpreted by the MUA are preserved in the db, so for example:
- All headers that are parsed should have their own column
- A column should contain the whole headers
- The attachments (including body, multipart) should be in a many to one table with the email table.
You'll probably want to at least store attachments separately to optimize storage. It's astonishing to see the size and quantity of attachments (videos, etc.) that most users unhesitatingly attach to emails.
In the case of outgoing emails you may have multiple emails sending the same attachment. It's far more efficient to store a single copy of the attachment that is referenced by all emails that share it.
Another reason for storing attachments separately is that it gives you some archiving options later on. Should storage space become an issue, you can always go back and delete large attachments older than a given date in order to compact the database.
If it is already split up, and you can be sure that the routine to split the data is sound, then I would split up the table as granular as possible. You can always parse it back together in your middle tier. If space is not an issue, you could always store it twice. One, split up into the relevant fields, and another field that has the whole thing as one blob, if putting it back together is hard.

Resources