Storing a database of series data with associated metadata using pandas - database

My dataset is of the form of instances of series data, each with associated metadata. Similar to a CD collection where each CD track has metadata (artist, album, length, etc.) and a series of audio data. Or imagine a road condition survey dataset - each time a survey is conducted the metadata of location, date, time, operator, etc. is recorded, as well as some physical series data of the road condition for each unit length of road. The collection of surveys ({metadata, data} pairs) is the dataset.
I'd like to take advantage of pandas to help import, store, search and analyse that dataset. pandas does not have built-in support for this type of dataset, but many have tried to add it.
The typical solutions are either:
Add metadata to a pandas DataFrame, but this is the wrong way around - I want a collection of metadata records each with associated data, not data with associated metadata.
Casting data to be valid field in a DataFrame and storing it as one of the metadata fields, but the casting process discards significant integrity.
Using multiple indices to create a 3D DataFrame, but this imposes design details on your choice of index, which limits experimentation.
This sort of dataset is very common, and I see a lot of people trying to bend pandas to accommodate it. I wonder what the right approach is, or even if pandas is the right tool.

I now have a working solution, but since I haven't seen this method documented I wonder if there be dragons ahead.
My "database" is a pandas DataFrame that looks something like this:
| | Description | Time | Length | data_uuid |
| 0 | My first record | 2017-03-09 11:00:00 | 502 | f7ee-11e6-b702 |
| 1 | My second record | 2017-03-10 11:00:00 | 551 | f7ee-11e6-a996 |
That is, my metadata are rows of a DataFrame, which gives me all the power of pandas, but my data is given an uuid on importation. The data for each metadata is actually a separate DataFrame, serialised to a file whose name is the uuid.
That way, an illustrative example of looking up a record and pulling out the data looks like this:
display(df_database[df_database['Length'] >= 550.0])
idx = df_database[df_database['Length'] >= 550.0].index[0]
df_data = pd.read_pickle(DATA_DIR + str(df_database.at[idx, 'data_uuid']))
display(df_data)
With suitable importation, storage and lookup functions, this seems to give me the power (with associated cumbersomeness!) of pandas without pulling too many restrictive tricks.

Related

Optimising WordPress query speed by grouping metadata

I am building an eCommerce site and I have about 10 different custom post types. For each custom post type, I have about 20 custom fields (with a single meta_value assigned to each meta_key).
What I was thinking about doing is grouping all 20 custom fields into a single custom field so that I'll have 20 meta_values (array) assigned to a single meta_key. This would essentially reduce the number of rows inside the database table from 20 rows down to a single row for each custom post type.
For example, let's say I have post type "book".
Currently my post_meta table for this post type might look like this:
meta_key | meta_value
_________________________
chapters | 5
_________________________
paragraphs | 10
_________________________
author | john_smith
_________________________
price | 19
_________________________
If I convert it to this:
meta_key | meta_value
____________________________
book_detail | [chapters:5 ,paragraphs:10, author:john_smith, price:19]
____________________________
Would doing something like this optimize WordPress query speed or would I just be wasting time?
The tables should be kept as normalized as possible. Storing the CSV or any delimited values is not recommended.
Also, in case you need to query a specific aspect of a post such as only chapters, then normalized data comes handy where you just need to put a simple predicate:
where metakey = 'chapters';

Optimal View Design To Find Mismatches Between Two Sets of Data

A bit of background...my company utilizes a piece of software that stores information about a mortgage loan in independent fields. These fields are broken up across many tables in the loan database.
My current dilemma revolves around designing a view(s) that will allow me to find mismatched data on a subset of loans from the underwriting side of our software and the lock side of our software.
Here is a quick example of the data returned from the two views that already exist:
UW View
transID | DTIField | LTVField | MIField
50000 | 37.5 | 85.0 | 1
Lock View
transID | DTIField | LTVField | MIField
50000 | 42.0 | 85.0 | 0
In the above situation, the view should return the fields that are not matching (in this case the DTIField and the MIField). I have built a comparison view that uses a series of CASE statements to return either a 0 for not matched or a 1 for matched already:
transID | DTIField | LTVField | MIField
50000 | 0 | 1 | 0
This is fine in itself but it is creating a bit of an issue downstream on the reporting side. We want to be able to build a report that would display only those transIDs that have mismatched data and show which columns are not matched. Crystal Reports is the reporting solution in question.
Some specifics about the data sets...we have 27 items of the loan that we are comparing (so a total 54 fields). There are over 4000 loans in the system and growing. There are already indexes on the transID fields.
How would you structure the view to return all the data needed for the report? We can do a good amount of work in Crystal Reports but ideally much of the logic would be handled in MSSQL.
Thanks for any assistance.
I think there should be no issue in comparing the 27 columns for a given row. Since you'll be reading the row just once and comparing the columns on that row in both the tables, it shouldn't really pose any performance issues. You can use some hash functions HASHBYTES to assign a hash value to the combination of these 27 fields in both the tables and then use this field to compare which rows should be returned by the view. This should result in some performance improvement. Testing will reveal more.

Best way to apply FIR filter to data stored in a database

I have a PostgreSQL database with a few tables that store several million of data from different sensors. The data is stored in one column of each row like:
| ID | Data | Comment |
| 1 | 19 | Sunny |
| 2 | 315 | Sunny |
| 3 | 127 | Sunny |
| 4 | 26 | Sunny |
| 5 | 82 | Rainy |
I want to apply a FIR filter to the data and store it in another table so I can work with it, but because of the amount of data I'm not sure of the best way to do it. So far I've got the coefficients in Octave and work with some extractions of it. Basically I export the column Data to a CSV and then run a csvimport in Octave to have it in a array and filter it. The problem is that this method doesn't allow me to work with more of several thousand data at the time.
Things I've been looking so far:
PostgreSQL: I've been looking for someway to do it directly in the database, but I haven't been able to find any way to do it so far.
Java: Another possible way to do it is making a small program that extracts chunks of data each time, recalculates the data using the coefficients and stores it back in other table of the database.
C/C++: I've seen some questions and resolutions about how to implement the filter in StackOverflow here, here or here, but they seem to be for working with data on real time and not talking advantage of having all the data already.
I think the best way would be to do it directly with PostgreSQL and with Java or C/C++ would be too slow, but I don't have too much experience working with so much data so probably I'm wrong. Just need to know why and where to point myself to.
What's the best way to apply a FIR filter to data stored on a database, and why?

Can anyone suggest a method of versioning ATTRIBUTE (rather than OBJECT) data in DB

Taking MySQL as an example DB to perform this in (although I'm not restricted to Relational flavours at this stage) and Java style syntax for model / db interaction.
I'd like the ability to allow versioning of individual column values (and their corresponding types) as and when users edit objects. This is primarily in an attempt to drop the amount of storage required for frequent edits of complex objects.
A simple example might be
- Food (Table)
- id (INT)
- name (VARCHAR(255))
- weight (DECIMAL)
So we could insert an object into the database that looks like...
Food banana = new Food("Banana",0.3);
giving us
+----+--------+--------+
| id | name | weight |
+----+--------+--------+
| 1 | Banana | 0.3 |
+----+--------+--------+
if we then want to update the weight we might use
banana.weight = 0.4;
banana.save();
+----+--------+--------+
| id | name | weight |
+----+--------+--------+
| 1 | Banana | 0.4 |
+----+--------+--------+
Obviously though this is going to overwrite the data.
I could add a revision column to this table, which could be incremented as items are saved, and set a composite key that combines id/version, but this would still mean storing ALL attributes of this object for every single revision
- Food (Table)
- id (INT)
- name (VARCHAR(255))
- weight (DECIMAL)
- revision (INT)
+----+--------+--------+----------+
| id | name | weight | revision |
+----+--------+--------+----------+
| 1 | Banana | 0.3 | 1 |
| 1 | Banana | 0.4 | 2 |
+----+--------+--------+----------+
But in this instance we're going to be storing every single piece of data about every single item. This isn't massively efficient if users are making minor revisions to larger objects where Text fields or even BLOB data may be part of the object.
What I'd really like, would be the ability to selectively store data discretely, so the weight could possible be saved in a separate DB in its own right, that would be able to reference the table, row and column that it relates to.
This could then be smashed together with a VIEW of the table, that could sort of impose any later revisions of individual column data into the mix to create the latest version, but without the need to store ALL data for each small revision.
+----+--------+--------+
| id | name | weight |
+----+--------+--------+
| 1 | Banana | 0.3 |
+----+--------+--------+
+-----+------------+-------------+-----------+-----------+----------+
| ID | TABLE_NAME | COLUMN_NAME | OBJECT_ID | BLOB_DATA | REVISION |
+-----+------------+-------------+-----------+-----------+----------+
| 456 | Food | weight | 1 | 0.4 | 2 |
+-----+------------+-------------+-----------+-----------+----------+
Not sure how successful storing any data as blob to then CAST back to original DTYPE might be, but thought since I was inventing functionality here, why not go nuts.
This method of storage would also be fairly dangerous, since table and column names are entirely subject to change, but hopefully this at least outlines the sort of behaviour I'm thinking of.
A table in 6NF has one CK (candidate key) (in SQL a PK) and at most one other column. Essentially 6NF allows each pre-6NF table's column's update time/version and value recorded in an anomaly-free way. You decompose a table by dropping a non-prime column while adding a table with it plus an old CK's columns. For temporal/versioning applications you further add a time/version column and the new CK is the old one plus it.
Adding a column of time/whatever interval (in SQL start time and end time columns) instead of time to a CK allows a kind of data compression by recording longest uninterupted stretches of time or other dimension through which a column had the same value. One queries by an original CK plus the time whose value you want. You dont need this for your purposes but the initial process of normalizing to 6NF and the addition of a time/whatever column should be explained in temporal tutorials.
Read about temporal databases (which deal both with "valid" data that is times and time intervals but also "transaction" times/versions of database updates) and 6NF and its role in them. (Snodgrass/TSQL2 is bad, Date/Darwen/Lorentzos is good and SQL is problematic.)
Your final suggested table is an example of EAV. This is usually an anti-pattern. It encodes a database in to one or more tables that are effectively metadata. But since the DBMS doesn't know that you lose much of its functionality. EAV is not called for if DDL is sufficient to manage tables with columns that you need. Just declare appropriate tables in each database. Which is really one database, since you expect transactions affecting both. From that link:
You are using a DBMS anti-pattern EAV. You are (trying to) build part of a DBMS into your program + database. The DBMS already exists to manage data and metadata. Use it.
Do not have a class/table of metatdata. Just have attributes of movies be fields/columns of Movies.
The notion that one needs to use EAV "so every entity type can be
extended with custom fields" is mistaken. Just implement via calls
that update metadata tables sometimes instead of just updating regular
tables: DDL instead of DML.

Large amount of timecourses in database

I have a rather large amount of data (~400 mio datapoints) which is organized in a set of ~100,000 timecourses. This data may change every day and for reasons of revision-safety has to be archived daily.
Obviously we are talking about way too much data to be handled efficiently, so I made some analysis on sample data. Approx. 60 to 80% of the courses do not change at all between two days and for the rest only a very limited amount of the elements changes. All in all I expect much less than 10 mio datapoints change.
The question is, how do I make use of this knowledge? I am aware of concepts like the Delta-Trees used by SVN and similar techniques, however I would prefer, if the database itself would be capable of handling such semantic compression. We are using Oracle 11g for storage and the question is, is there a better way than a homebrew solution?
Clarification
I am talking about timecourses representing hourly energy-currents. Such a timecourse might start in the past (like 2005), contains 8760 elements per year and might end any time up to 2020 (currently). Each timecourse is identified by one unique string.
The courses themselves are more or less boring:
"Course_XXX: 1.1.2005 0:00 5; 1.1.2005 1:00 5;1.1.2005 2:00 7,5;..."
My task is making day-to-day changes in these courses visible and to do so, each day at a given time a snapshot has to be taken. My hope is, that some loss-free semantical compression will spare me from archiving ~20GB per day.
Basically my source data looks like this:
Key | Value0 | ... | Value23
to archive that data I need to add an additional dimension which directly or indirectly tells me the time at which the data was loaded from the source-system, so my archive-database is
Key | LoadID | Value0 | ... | Value23
Where LoadID is more or less the time the source-DB was accessed.
Now, compression in my scenario is easy. LoadIDs are growing with each run and I can give a range, i.e.
Key | LoadID1 | LoadID2 | Value0 | ... | Value23
Where LoadID1 gives me the ID of the first load where the 24 values where observed and LoadID2 gives me the ID of the last consecutive load where the 24 values where observed.
In my scenario, this reduces the amount of data stored in the database to 1/30th

Resources