How is data wrangling different than a data cube processing? - cube

I know that a data cube is a transforming multi-dimensional data which keeps changing with time, whereas data wrangling definition says it's transforming the data making it more valuable.
Isn't a data cube more meaningful and valuable piece of denormalized data ?
I haven't been able to find any example to clear the symmetry, they both sound same to me.. Please help!

I found this article which claims to bring a perspective to this question -
I did not find true example but after reading and speaking to data analysts following line calls it a closure for me -
Data Wrangling is applied by functional experts on data in question to clean it off of it's veracity
On the other hand
Data Cube Processing is when a data analyst does a
projection on structured data to output a report with some KPIs (Key
Performance Indicators)
One is a 'cleanup' whereas another is a 'projection'

Related

What counts as ETL?

I know that ETL stands for Extract, Transform and Load data into a new target database. But in what scope does it still count as ETL? For example, if i want to move a contact database with 7000 records into a CRM software, does this process count as ETL as well?
ETL stands for Extract, Transform, Load stages for the data. Extract from a data source, TRANSFORM the extracted data and LOAD into target data source.
Whenever you do EXTRACT in one place and LOAD in another place, your process still comes into ETL. ETL may not involve TRANSFORM in every scenario, where it is straight forward data load. Most of the scenarios, there will be TRANSFORM to the data to suit the target environment/schema.
To answer your question, yes. your loading of records fall under the purview of ETL. But, in your case, it is not having TRANSFORM stage.
As stated by Venkataraman R, you don't have a transform stage that is why your job can't really be considered ETL.
Normally the transform portion would include some sort of data mapping (EG. standardize country codes or extract country codes USA -> US; TUR -> TR). Aside from lots of lookup verification and mapping you would do some general cleaning like removal of bad data, proper formatting like title caps, reworking of keys in the case of data warehouse). You can also do imputation, binning and normalization in the case of preparation of machine learning training. But i think the most important one would be removal of duplicates as it can cause issues regarding aggregation.
It is also considered transformation if you derive a new set of data from your existing data into aggregate form. This means that you have somehow group your data together (SUM/AVG/MAX) so that when a tool uses the data, it would no longer need to perform the aggregation themselves minimizing the computational and bandwidth requirements.
I think it's interesting that, since this question was asked, a whole new set of tools has emerged that call themselves "Reverse ETL" and they sync data in the direction you are talking about: from the database/warehouse into things like CRM systems. For example, out of Postgres and into Salesforce or Marketo.
The "Reverse" piece seems to be a acknowledgement that this is going in the opposite direction as ETL usually went in historically.

Relational database versus R/Python data frames

I was exposed to the world of tables and data structures in R before the RDBMS systems and other database systems. It is quite elegant in R/Python to create tables and lists from stuctured data (.csv or other formats) and then do data manipulations programmatically.
Last year, I attended a course in Database management and learnt all about structured and unstructured databases. I also noticed that it is the norm to feed data from multiple sources of data into databases rather than directly use them in R (for convenience and discipline?).
For research purposes, R seems to suffice, for joining, appending or even complicated data manipulations.
The questions that keeps arising is:
When to use R directly by using commands such as read.csv, when to use R by creating database and querying from tables using the R-SQL interface?
For instance, if I have a multi-source data, like (a) Person level information (age, gender, smoking habits), (b) Outcome variables (such as surveys taken by them in real time), (c) Covariate information (environment characteristics), (d) Treatment input (occurrence of an event that modifies the outcome - survey response) (d) Time and space information of participants taking survey
How to approach the data collection and processing in this case. There may be standard industry procedures, but I put this question forward here, to understand list of feasible and optimal approaches that individuals and small group of researchers can adopt.
What you're describing when you say "that it is the norm to feed data from multiple sources of data into databases" sounds more specifically like a data warehouse. Databases are used for many reasons, and in plenty of situations they will hold data from one source - for instance, a database used as the data store of a transactional system will often only hold the data needed to run that system, and the data produced by that system.
The process you're describing is commonly called Extract, Transform, Load (ETL), and you might find looking up information about ETL and data warehousing helpful if you decide to go in the direction of combining your data prior to working with it in R.
I can't tell you which you should choose, or the optimal way of accomplishing it, because it will vary in different situations and might even come down to opinion. What I can tell you are some of the reasons why people create data warehouses, and you can decide for yourself whether it might be useful in your situation:
A data warehouse can provide a central location to hold combined data. This means that people do not need to combine the data themselves each time they need to use that specific combination of data. Unlike something like a simple one-off report or extract of combined data, it should provide some flexibility, letting people obtain the combined set of data they need for a specific task. Very often, in enterprise situations, multiple things are then be run on top of the same combined set of data - multidimensional data analysis tools (cubes), reports, data mining, etc.
Some of the benefits of this might include:
Individuals saving time when they otherwise would have needed to combine the data themselves.
If the data which needs to be combined is complex, or some people do not have proficiency at handling that part of the process, then there is less risk of data being combined incorrectly; you can be sure that different pieces of work have used the same source data.
If the data suffers from data quality issues, you resolve this once in the data warehouse, rather than working around it or resolving it repeatedly in code.
If new data is constantly being received, collection and integration of this into the data warehouse can be carried out automatically.
Like I say, I can't decide for you whether this is a useful direction or not - as with any decision of this kind you'll need to weigh up the costs of implementing such a solution against the benefits, and both will be specific to your individual case. But hopefully this answers your core question of why someone might choose to do this work in a database instead of in their code, and gives you a starting point to work from.

Databasing to feed ~9k data points to High Charts

I am working on a freelance project that captures an audio file, runs some fourier analysis, and spits out three charts (x-y plots). Each chart has about ~3000 data points, which I plan to display with High Charts in the browser.
What database techniques do you recommend for storing and accessing this much data? Should I be storing the points in an array or in multiple rows? I'm considering Mongo too. Plan is to use Rails, so I was hoping to use a single database for both data and authentication.
I haven't dealt with queries accessing this much data for a single page, and this may very well be a tiny overall amount of data. In addition this is an MVP for demonstration to investors, so making it scalable to huge levels isn't of immediate concern.
My initial thought is that using Postgres and having one large table of data points, stored per-row, will be fine, and that that a bunch of doubles is not going to be too memory-intensive relative to images and such.
Realistically, I may just pull 100 evenly-spaced data points to make the chart, but the original data must still be stored.
I've done a lot of Mongo work and I can tell you what I would do if I were you.
One of the very nice properties about your data is that the x,y coordinates are of a fixed size generally. In other words it's not like you are storing comments from users, which can vary greatly in size.
With Mongo I would first make a sample document with the 3,000 points. Just a simple array of x,y points. I would see how big that document is and how my front end handled it - in other words can High Charts handle that?
I would also try to stick to the easiest conceptual model to manage, which is one document per chart, each chart having 3k points. This is a natural way to think of the data and I would start there and see if there were any performance hits. Mongo can easily store those documents, so I think the biggest pain would be in the UI with rendering the data.
Mongo would handle authentication well. I think it's a good choice for general data storage for an MVP.

Designing a database with periodic sensor data

I'm designing a PostgreSQL database that takes in readings from many sensor sources. I've done a lot of research into the design and I'm looking for some fresh input to help get me out of a rut here.
To be clear, I am not looking for help describing the sources of data or any related metadata. I am specifically trying to figure out how to best store data values (eventually of various types).
The basic structure of the data coming in is as follows:
For each data logging device, there are several channels.
For each channel, the logger reads data and attaches it to a record with a timestamp
Different channels may have different data types, but generally a float4 will suffice.
Users should (through database functions) be able to add different value types, but this concern is secondary.
Loggers and channels will also be added through functions.
The distinguishing characteristic of this data layout is that I've got many channels associating data points to a single record with a timestamp and index number.
Now, to describe the data volume and common access patterns:
Data will be coming in for about 5 loggers, each with 48 channels, for every minute.
The total data volume in this case will be 345,600 readings per day, 126 million per year, and this data needs to be continually read for the next 10 years at least.
More loggers & channels will be added in the future, possibly from physically different types of devices but hopefully with similar storage representation.
Common access will include querying similar channel types across all loggers and joining across logger timestamps. For example, get channel1 from logger1, channel4 from logger2, and do a full outer join on logger1.time = logger2.time.
I should also mention that each logger timestamp is something that is subject to change due to time adjustment, and will be described in a different table showing the server's time reading, the logger's time reading, transmission latency, clock adjustment, and resulting adjusted clock value. This will happen for a set of logger records/timestamps depending on retrieval. This is my motivation for RecordTable below but otherwise isn't of much concern for now as long as I can reference a (logger, time, record) row from somewhere that will change the timestamps for associated data.
I have considered quite a few schema options, the most simple resembling a hybrid EAV approach where the table itself describes the attribute, since most attributes will just be a real value called "value". Here's a basic layout:
RecordTable DataValueTable
---------- --------------
[PK] id <-- [FK] record_id
[FK] logger_id [FK] channel_id
record_number value
logger_time
Considering that logger_id, record_number, and logger_time are unique, I suppose I am making use of surrogate keys here but hopefully my justification of saving space is meaningful here. I have also considered adding a PK id to DataValueTable (rather than the PK being record_id and channel_id) in order to reference data values from other tables, but I am trying to resist the urge to make this model "too flexible" for now. I do, however, want to start getting data flowing soon and not have to change this part when extra features or differently-structured-data need to be added later.
At first, I was creating record tables for each logger and then value tables for each channel and describing them elsewhere (in one place), with views to connect them all, but that just felt "wrong" because I was repeating the same thing so many times. I guess I'm trying to find a happy medium between too many tables and too many rows, but partitioning the bigger data (DataValueTable) seems strange because I'd most likely be partitioning on channel_id, so each partition would have the same value for every row. Also, partitioning in that regard would require a bit of work in re-defining the check conditions in the main table every time a channel is added. Partitioning by date is only applicable to the RecordTable, which isn't really necessary considering how relatively small it will be (7200 rows per day with the 5 loggers).
I also considered using the above with partial indexes on channel_id since DataValueTable will grow very large but the set of channel ids will remain small-ish, but I am really not certain that this will scale well after many years. I have done some basic testing with mock data and the performance is only so-so, and I want it to remain exceptional as data volume grows. Also, some express concern with vacuuming and analyzing a large table, and dealing with a large number of indexes (up to 250 in this case).
On a very small side note, I will also be tracking changes to this data and allowing for annotations (e.g. a bird crapped on the sensor, so these values were adjusted/marked etc), so keep that in the back of your mind when considering the design here but it is a separate concern for now.
Some background on my experience/technical level, if it helps to see where I'm coming from: I am a CS PhD student, and I work with data/databases on a regular basis as part of my research. However, my practical experience in designing a robust database for clients (this is part of a business) that has exceptional longevity and flexible data representation is somewhat limited. I think my main problem now is I am considering all the angles of approach to this problem instead of focusing on getting it done, and I don't see a "right" solution in front of me at all.
So In conclusion, I guess these are my primary queries for you: if you've done something like this, what has worked for you? What are the benefits/drawbacks I'm not seeing of the various designs I've proposed here? How might you design something like this, given these parameters and access patterns?
I'll be happy to provide clarification/details where needed, and thanks in advance for being awesome.
It is no problem at all to provide all this in a Relational database. PostgreSQL is not enterprise class, but it is certainly one of the better freeware SQLs.
To be clear, I am not looking for help describing the sources of data or any related metadata. I am specifically trying to figure out how to best store data values (eventually of various types).
That is your biggest obstacle. Contrary to program design, which allows decomposition and isolated analysis/design of components, databases need to be designed as a single unit. Normalisation and other design techniques need to consider both the whole, and the component in context. The data, the descriptions, the metadata have to be evaluated together, not as separate parts.
Second, when you start off with surrogate keys, implying that you know the data, and how it relates to other data, it prevents you from genuine modelling of the data.
I have answered a very similar set of questions, coincidentally re very similar data. If you could read those answers first, it would save us both a lot of typing time on your question/answer.
Answer One/ID Obstacle
Answer Two/Main
Answer Three/Historical
I did something like this with seismic data for a petroleum exploration company.
My suggestion would be to store the meta-data in a database, and keep the sensor data in flat files, whatever that means for your computer's operating system.
You would have to write your own access routines if you want to modify the sensor data. Actually, you should never modify the sensor data. You should make a copy of the sensor data with the modifications so that you can show later what changes were made to the sensor data.

Advantages of keeping to a protocol for a data model

The question title is probably not correct because part of my question is to try and get some more understanding on the problem.
I am looking for the advantages of making sure data that is imported to a database (simple example: Excel table to Access database) should be given using the same schema and should also be valid to the business requirements.
I have an Excel table containing none normalised data and an Access database with normalised tables.
The Excel table comes from multiple third parties, none of which stick to the same format as each other or the database.
Some of the sources also do not supply all the relevant data.
Example of what could be supplied
contact_key, date, contact_title, reject_name, reject_cost, count_of_unique_contact
count_of_unique_contact is derived from distinct contact_title's and should not be imported.
contact_key is sometimes not supplied.
title is sometimes unknown and passed in as such "n/a", "name = ??1342", "#N/A" etc. rather random.
reject_name is often miss spelled.
the fields are sometimes not even supplied, e.g. date and contact_key are missing.
I am trying to find information to help explain the issues with the above.
Issues only related to incorrect data or fields making it difficult to have useful data in the database such as not being able to report a trend on reject costs in a month when the date is not supplied. Normalising the excel file is not an option available to me.
Requesting the values and fields in the Excel files to match the business requirements and the format to be the same for every third party that sends them is what I want to do but the request is falling on deaf ears.
I want to explain to the client that inputting fake data and checking for invalid/existing rejects/contacts all the time is wrong and doing it is going to fail or at the best be difficult without constant maintenance of a poor system.
Does anyone have any information on this problem?
Thanks
This is a common problem; this gets referred to in data processing circles as "garbage in, garbage out". Essentially, what you're running up against is that the data as given is of poor quality; you're correct to recognize that the problem is that it will be hard (if not impossible) to use this data to extract any useful information.
To some extent, this is a problem that should be fixed at the source; whatever your source of your data is, they need to be convinced that the data quality must improve. In the short term, you can sanitize your data; the term refers to removing or cleaning the bad entries to make the remainder of the data (the "good" data) importable into your database. Depending on just what percentage of your data is bad, you may or may not be able to do useful things with the sanitized data once you import it.
At some point, since you're not getting traction with management about the quality of the data, you will simply have to show them that the system is not working as intended because the quality of the data is bad. They'll need to improve their processes at that point to improve the quality of the data you get in at that point. Until then, though, keep pressing for better data; investigate the process of sanitizing the data and see what you can do with the remaining data. Good luck!

Resources