My dataset is of the form of instances of series data, each with associated metadata. Similar to a CD collection where each CD track has metadata (artist, album, length, etc.) and a series of audio data. Or imagine a road condition survey dataset - each time a survey is conducted the metadata of location, date, time, operator, etc. is recorded, as well as some physical series data of the road condition for each unit length of road. The collection of surveys ({metadata, data} pairs) is the dataset.
I'd like to take advantage of pandas to help import, store, search and analyse that dataset. pandas does not have built-in support for this type of dataset, but many have tried to add it.
The typical solutions are either:
Add metadata to a pandas DataFrame, but this is the wrong way around - I want a collection of metadata records each with associated data, not data with associated metadata.
Casting data to be valid field in a DataFrame and storing it as one of the metadata fields, but the casting process discards significant integrity.
Using multiple indices to create a 3D DataFrame, but this imposes design details on your choice of index, which limits experimentation.
This sort of dataset is very common, and I see a lot of people trying to bend pandas to accommodate it. I wonder what the right approach is, or even if pandas is the right tool.
I now have a working solution, but since I haven't seen this method documented I wonder if there be dragons ahead.
My "database" is a pandas DataFrame that looks something like this:
| | Description | Time | Length | data_uuid |
| 0 | My first record | 2017-03-09 11:00:00 | 502 | f7ee-11e6-b702 |
| 1 | My second record | 2017-03-10 11:00:00 | 551 | f7ee-11e6-a996 |
That is, my metadata are rows of a DataFrame, which gives me all the power of pandas, but my data is given an uuid on importation. The data for each metadata is actually a separate DataFrame, serialised to a file whose name is the uuid.
That way, an illustrative example of looking up a record and pulling out the data looks like this:
display(df_database[df_database['Length'] >= 550.0])
idx = df_database[df_database['Length'] >= 550.0].index[0]
df_data = pd.read_pickle(DATA_DIR + str(df_database.at[idx, 'data_uuid']))
display(df_data)
With suitable importation, storage and lookup functions, this seems to give me the power (with associated cumbersomeness!) of pandas without pulling too many restrictive tricks.
I have a legacy database that I am doing some ETL work on. I have columns in the old table that are conditionally mapped to columns in my new table. The conditions are based on an associated column (a column in the same table that represents the shape of an object, we can call that column SHAPE). For example:
Column dB4D is mapped to column:
B4 if SHAPE=5
B3 if SHAPE=1
X if SHAPE=10
or else Y
I am using a condition to split the table based on SHAPE, then I am using 10-15 "copy column" transformations to take the old column (dB4D) and map it to the new column (B4, B3, X, etc).
Some of these columns "overlap". For example, I have multiple legacy columns (dB4D, dB3D, dB2D, dB1D, dC1D, dC2D, etc) and multiple new columns (A, B, C, D, etc). In one of the "copy columns" (which are broken up by SHAPE) I could have something like:
If SHAPE=10
+--------------+--------------+
| Input Column | Output Alias |
+--------------+--------------+
| dB4D | B |
+--------------+--------------+
If SHAPE=5
+--------------+--------------+
| Input Column | Output Alias |
+--------------+--------------+
| dB4D | C |
+--------------+--------------+
I need to now bring these all together into one final staging table (or "destination"). Not two rows will have the same size, so there is no conflict. But I need to map dB4D (and other columns) to different new columns based on a value in another column. I have tried to merge them but can't merge multiple data sources. I have tried to join them but not all columns (or output aliases) would show up in the destination. Can anyone recommended how to resolve this issue?
Here is the current design that may help:
As inputs to your data flow, you have a set of columns dB4D, dB3D, dB2D, etc.
Your destination will only have column names that do not exist in your source data.
Based on the Shape column, you'll project the dB columns into different mappings for your target table.
If the the Conditional Split logic makes sense as you have it, don't try and Union All it back together. Instead, just wire up 8 OLE DB Destinations. You'll probably have to change them from the "fast load" option to the table name option. This means it will perform singleton inserts so hopefully the data volumes won't be an issue. If they are, then create 8 staging table that you do use the "Fast Load" option for and then have a successor task to your Data Flow to perform set based inserts into the final table.
The challenge you'll run into with the Union All component is that if you make any changes to the source, the Union All rarely picks up on the change (the column changed from varchar to int, sorry!).
here is what i am trying to do, i want to store a list of values within a db record, so it is something like this:
| id | tags |
| 1 | 1,3,5 |
| 2 | 121,4,6 |
| 3 | 3,101,2 |
most of the suggestion i found so far suggest creating a separate join table to establish a many-to-many relationship, but in my case, i dont think it is suitable to create a separate table because the tags values are just a list of numbers.
the best i can think of right now is to store the data as a csv string, and parse it accordingly when it is retrieved, but i'm still trying to find a way where i can get the values as an array when i retrieve it from the db, even better if i can restrict the number of elements in the list, is there any better way to do this?
I haven't decided which database to use yet, most probably postgresql, but im open to others if it can help me implement this better,
On PostgreSQL you can use array type.
On MySQL you can use set type.
Then it depends on what you really need.
I need to represent graph information with relational database.
Let's say, a is connected to b, c, and d.
a -- b
|_ c
|_ d
I can have a node table for a, b, c, and d, and I can also have a link table (FROM, TO) -> (a,b), (a,c), (a,d).
For other implementation there might be a way to store the link info as (a,b,c,d), but the number of elements in the table is variable.
Q1 : Is there a way to represent variable elements in a table?
Q2 : Is there any general way to represent the graph structure using relational database?
Q1 : Is there a way to represent variable elements in a [database] table?
I assume you mean something like this?
from | to_1 | to_2 | to_3 | to_4 | to_5 | etc...
1 | 2 | 3 | 4 | NULL | NULL | etc...
This is not a good idea. It violates first normal form.
Q2 : Is there any general way to represent the graph structure using database?
For a directed graph you can use a table edges with two columns:
nodeid_from nodeid_to
1 2
1 3
1 4
If there is any extra information about each node (such as a node name) this can be stored in another table nodes.
If your graph is undirected you have two choices:
store both directions (i.e. store 1->2 and 2->1)
use a constraint that nodeid_from must be less than nodeid_to (i.e. store 1->2 but 2->1 is implied).
The former requires twice the storage space but can make querying easier and faster.
In addition to the two tables route mentioned by Mark take a look at the following link:
http://articles.sitepoint.com/article/hierarchical-data-database/2
This article basically preorders the elements in the tree assigning left and right values. You are then able to select portions or all of the tree using a single select statement.
Node | lft | rght
-----------------
A | 0 | 7
B | 1 | 2
C | 3 | 4
D | 5 | 6
EDIT: If you are going to be updating the tree heavily this is not an optimum solution as the whole tree must be re-numbered
I have stored multiple "TO" nodes in a relational representation of a graph structure. I was able to do this because my graph was directed. This meant that if I wanted to know what nodes "A" was connected to, I only needed to select a single record from my table of connections. I stored the TO nodes in an easy-to-parse string and it worked great, with a class that could manage the conversion from string to collection and back.
I recommend looking at dedicated graph databases, as nawroth suggests. One example would be the "Trinity" Database, suited for very large datasets. But there are others.
Listen to the podcast by Scott Hanselman on Hanselminutes about Trinity. Here is the text transcript.
I am a newbie to Postgresql and was trying with it.
I have created a simple table:
CREATE table items_tags (
ut_id SERIAL Primary KEY,
item_id integer,
item_tags_weights text[]
);
where:
item_id - Item Id with these tags are associated
item_tags_weights - Tags associated with Itm including weight
Example entry:
--------------------
ut_id | item_id | item_tags_weights
---------+---------+-------------------------------------------------------------------------------------------------------------------------------
3 | 2 | {{D,1},{B,9},{W,3},{R,18},{F,9},{L,15},{G,12},{T,17},{0,3},{I,7},{E,14},{S,2},{O,5},{M,4},{V,3},{H,2},{X,14},{Q,9},{U,6},{P,16},{N,11},{J,1},{A,12},{Y,15},{C,15},{K,4},{Z,17}}
1000003 | 3 | {{Q,4},{T,19},{P,15},{M,14},{O,20},{S,3},{0,6},{Z,6},{F,4},{U,13},{E,18},{B,14},{V,14},{X,10},{K,18},{N,17},{R,14},{J,12},{L,15},{Y,3},{D,20},{I,18},{H,20},{W,15},{G,7},{A,11},{C,14}}
4 | 4 | {{Q,2},{W,7},{A,6},{T,19},{P,8},{E,10},{Y,19},{N,11},{Z,13},{U,19},{J,3},{O,1},{C,2},{L,7},{V,2},{H,12},{G,19},{K,15},{D,7},{B,4},{M,9},{X,6},{R,14},{0,9},{I,10},{F,12},{S,11}}
5 | 5 | {{M,9},{B,3},{I,6},{L,12},{J,2},{Y,7},{K,17},{W,6},{R,7},{V,1},{0,12},{N,13},{Q,2},{G,14},{C,2},{S,6},{O,19},{P,19},{F,4},{U,11},{Z,17},{T,3},{E,10},{D,2},{X,18},{H,2},{A,2}}
(4 rows)
where:
{D,1} - D = tag, 1 = tag weight
Well, I just wanted to list the items_id where tags = 'U' according tag weight.
On way is to select ALL the tags from database and do the processing in high-level language with sort and use the result set.
For this, I can do the following:
1) SELECT * FROM user_tags WHERE 'X' = ANY (interest_tags_weights)
2) Extract and sort the information and display.
But considering that multiple items can be associated with a single 'TAG', and assuming
10 million entry, this method will be surely sluggish.
Any idea to list as needed with CREATE function or so?
Any pointers will be helpfull.
Many thanks.
Have you considered normalization, i.e. moving the array field into another table? Apart from being easy to query and extend, it's likely to have better performance on larger databases.