What counts as ETL? - database

I know that ETL stands for Extract, Transform and Load data into a new target database. But in what scope does it still count as ETL? For example, if i want to move a contact database with 7000 records into a CRM software, does this process count as ETL as well?

ETL stands for Extract, Transform, Load stages for the data. Extract from a data source, TRANSFORM the extracted data and LOAD into target data source.
Whenever you do EXTRACT in one place and LOAD in another place, your process still comes into ETL. ETL may not involve TRANSFORM in every scenario, where it is straight forward data load. Most of the scenarios, there will be TRANSFORM to the data to suit the target environment/schema.
To answer your question, yes. your loading of records fall under the purview of ETL. But, in your case, it is not having TRANSFORM stage.

As stated by Venkataraman R, you don't have a transform stage that is why your job can't really be considered ETL.
Normally the transform portion would include some sort of data mapping (EG. standardize country codes or extract country codes USA -> US; TUR -> TR). Aside from lots of lookup verification and mapping you would do some general cleaning like removal of bad data, proper formatting like title caps, reworking of keys in the case of data warehouse). You can also do imputation, binning and normalization in the case of preparation of machine learning training. But i think the most important one would be removal of duplicates as it can cause issues regarding aggregation.
It is also considered transformation if you derive a new set of data from your existing data into aggregate form. This means that you have somehow group your data together (SUM/AVG/MAX) so that when a tool uses the data, it would no longer need to perform the aggregation themselves minimizing the computational and bandwidth requirements.

I think it's interesting that, since this question was asked, a whole new set of tools has emerged that call themselves "Reverse ETL" and they sync data in the direction you are talking about: from the database/warehouse into things like CRM systems. For example, out of Postgres and into Salesforce or Marketo.
The "Reverse" piece seems to be a acknowledgement that this is going in the opposite direction as ETL usually went in historically.

Related

Relational database versus R/Python data frames

I was exposed to the world of tables and data structures in R before the RDBMS systems and other database systems. It is quite elegant in R/Python to create tables and lists from stuctured data (.csv or other formats) and then do data manipulations programmatically.
Last year, I attended a course in Database management and learnt all about structured and unstructured databases. I also noticed that it is the norm to feed data from multiple sources of data into databases rather than directly use them in R (for convenience and discipline?).
For research purposes, R seems to suffice, for joining, appending or even complicated data manipulations.
The questions that keeps arising is:
When to use R directly by using commands such as read.csv, when to use R by creating database and querying from tables using the R-SQL interface?
For instance, if I have a multi-source data, like (a) Person level information (age, gender, smoking habits), (b) Outcome variables (such as surveys taken by them in real time), (c) Covariate information (environment characteristics), (d) Treatment input (occurrence of an event that modifies the outcome - survey response) (d) Time and space information of participants taking survey
How to approach the data collection and processing in this case. There may be standard industry procedures, but I put this question forward here, to understand list of feasible and optimal approaches that individuals and small group of researchers can adopt.
What you're describing when you say "that it is the norm to feed data from multiple sources of data into databases" sounds more specifically like a data warehouse. Databases are used for many reasons, and in plenty of situations they will hold data from one source - for instance, a database used as the data store of a transactional system will often only hold the data needed to run that system, and the data produced by that system.
The process you're describing is commonly called Extract, Transform, Load (ETL), and you might find looking up information about ETL and data warehousing helpful if you decide to go in the direction of combining your data prior to working with it in R.
I can't tell you which you should choose, or the optimal way of accomplishing it, because it will vary in different situations and might even come down to opinion. What I can tell you are some of the reasons why people create data warehouses, and you can decide for yourself whether it might be useful in your situation:
A data warehouse can provide a central location to hold combined data. This means that people do not need to combine the data themselves each time they need to use that specific combination of data. Unlike something like a simple one-off report or extract of combined data, it should provide some flexibility, letting people obtain the combined set of data they need for a specific task. Very often, in enterprise situations, multiple things are then be run on top of the same combined set of data - multidimensional data analysis tools (cubes), reports, data mining, etc.
Some of the benefits of this might include:
Individuals saving time when they otherwise would have needed to combine the data themselves.
If the data which needs to be combined is complex, or some people do not have proficiency at handling that part of the process, then there is less risk of data being combined incorrectly; you can be sure that different pieces of work have used the same source data.
If the data suffers from data quality issues, you resolve this once in the data warehouse, rather than working around it or resolving it repeatedly in code.
If new data is constantly being received, collection and integration of this into the data warehouse can be carried out automatically.
Like I say, I can't decide for you whether this is a useful direction or not - as with any decision of this kind you'll need to weigh up the costs of implementing such a solution against the benefits, and both will be specific to your individual case. But hopefully this answers your core question of why someone might choose to do this work in a database instead of in their code, and gives you a starting point to work from.

Collecting data which wasn't predicted when the system was designed

How do you go about collecting and storing data which was not part of the initial database and software design? For example, if you've come up with a pointing system, you have to collect the points for every user which has already been registered. For new users, that would be easy, because the changes of the business logic will reflect the pointing system ... but the old ones?
In general, how does one deal with data, which should have been there from the beginning, but wasn't? Writing manual queries to collect the missing pieces? Using crons?
Well, you are asking for something that is by definition not possible, I think.
deal with data hich should have been there from the beginning, but wasn't?
Because if you are able to deduce the number of points from the existing data in the database. If that were possible, there is obviously no missing data.... Storing the points separately would make it redundant (still a fine option in case you need that for performance).
For example: stackoverflow rewards number of consecutive visits. Let's say they did not do that from the start. If they were logging date-of-visit already, you can recalc the points. So no missing data.
So if that is not possible, you need another solution: either get data from other sources (parse a webserver log) or get the business to draft some extra business rules for the determination of the default values for the existing users (difficult in this particular example).
Writing manual queries to collect the missing pieces? Using crons?
I would populate that in a conversion script or even in a special conversion application if very complex.

What is a good relational database design for stock market data?

Suppose there are two types of messages, QUOTE and TRADE. Both have different fields. For example TRADE has only a single price. QUOTE has both a bid and ask price. I want process messages in time order to do something like the following:
if (QUOTE) {
...
}
if (TRADE) {
...
}
My problem is the two messages are in different formats so I can't get them into the same database table. If I can't get them into the same database table how do I process sequentially? Any ideas for a suitable design?
The answer depends entirely on what you're doing and on where your app plugs into the data streams.
At one extreme, you might merely be answering customer quotes that you're pulling from an API, and basically implementing a cache. In this case two tables are fine.
At the other extreme, you might be monitoring real-time quotes for a high frequency trading platform, in which case the throughput will probably rule out using a database at all (things built around lisp, such as allegrograph, might be more appropriate), except to periodically collect aggregate statistics.
The short answer is, 'not really' For stock market and other time series data a key value store like Berkley DB or Mongo is pretty good. Also, a data format like NetCDF (http://en.wikipedia.org/wiki/NetCDF) will likely serve you better in the long run. It also depends on what kind of access you want and how much time you want to store.
You didn't indicate what you were doing with the data, which should inform your choices of storage more than anything. For example, a high-speed trading application will have different storage tradeoffs than a historical batch processing system (where Hadoop + NetCDF would be great). YMMV
Kdb+/q
Is a very good option for tick data. Used by major banks.
here is the info about that.
You can install a trail version and play with it.

Designing a database with periodic sensor data

I'm designing a PostgreSQL database that takes in readings from many sensor sources. I've done a lot of research into the design and I'm looking for some fresh input to help get me out of a rut here.
To be clear, I am not looking for help describing the sources of data or any related metadata. I am specifically trying to figure out how to best store data values (eventually of various types).
The basic structure of the data coming in is as follows:
For each data logging device, there are several channels.
For each channel, the logger reads data and attaches it to a record with a timestamp
Different channels may have different data types, but generally a float4 will suffice.
Users should (through database functions) be able to add different value types, but this concern is secondary.
Loggers and channels will also be added through functions.
The distinguishing characteristic of this data layout is that I've got many channels associating data points to a single record with a timestamp and index number.
Now, to describe the data volume and common access patterns:
Data will be coming in for about 5 loggers, each with 48 channels, for every minute.
The total data volume in this case will be 345,600 readings per day, 126 million per year, and this data needs to be continually read for the next 10 years at least.
More loggers & channels will be added in the future, possibly from physically different types of devices but hopefully with similar storage representation.
Common access will include querying similar channel types across all loggers and joining across logger timestamps. For example, get channel1 from logger1, channel4 from logger2, and do a full outer join on logger1.time = logger2.time.
I should also mention that each logger timestamp is something that is subject to change due to time adjustment, and will be described in a different table showing the server's time reading, the logger's time reading, transmission latency, clock adjustment, and resulting adjusted clock value. This will happen for a set of logger records/timestamps depending on retrieval. This is my motivation for RecordTable below but otherwise isn't of much concern for now as long as I can reference a (logger, time, record) row from somewhere that will change the timestamps for associated data.
I have considered quite a few schema options, the most simple resembling a hybrid EAV approach where the table itself describes the attribute, since most attributes will just be a real value called "value". Here's a basic layout:
RecordTable DataValueTable
---------- --------------
[PK] id <-- [FK] record_id
[FK] logger_id [FK] channel_id
record_number value
logger_time
Considering that logger_id, record_number, and logger_time are unique, I suppose I am making use of surrogate keys here but hopefully my justification of saving space is meaningful here. I have also considered adding a PK id to DataValueTable (rather than the PK being record_id and channel_id) in order to reference data values from other tables, but I am trying to resist the urge to make this model "too flexible" for now. I do, however, want to start getting data flowing soon and not have to change this part when extra features or differently-structured-data need to be added later.
At first, I was creating record tables for each logger and then value tables for each channel and describing them elsewhere (in one place), with views to connect them all, but that just felt "wrong" because I was repeating the same thing so many times. I guess I'm trying to find a happy medium between too many tables and too many rows, but partitioning the bigger data (DataValueTable) seems strange because I'd most likely be partitioning on channel_id, so each partition would have the same value for every row. Also, partitioning in that regard would require a bit of work in re-defining the check conditions in the main table every time a channel is added. Partitioning by date is only applicable to the RecordTable, which isn't really necessary considering how relatively small it will be (7200 rows per day with the 5 loggers).
I also considered using the above with partial indexes on channel_id since DataValueTable will grow very large but the set of channel ids will remain small-ish, but I am really not certain that this will scale well after many years. I have done some basic testing with mock data and the performance is only so-so, and I want it to remain exceptional as data volume grows. Also, some express concern with vacuuming and analyzing a large table, and dealing with a large number of indexes (up to 250 in this case).
On a very small side note, I will also be tracking changes to this data and allowing for annotations (e.g. a bird crapped on the sensor, so these values were adjusted/marked etc), so keep that in the back of your mind when considering the design here but it is a separate concern for now.
Some background on my experience/technical level, if it helps to see where I'm coming from: I am a CS PhD student, and I work with data/databases on a regular basis as part of my research. However, my practical experience in designing a robust database for clients (this is part of a business) that has exceptional longevity and flexible data representation is somewhat limited. I think my main problem now is I am considering all the angles of approach to this problem instead of focusing on getting it done, and I don't see a "right" solution in front of me at all.
So In conclusion, I guess these are my primary queries for you: if you've done something like this, what has worked for you? What are the benefits/drawbacks I'm not seeing of the various designs I've proposed here? How might you design something like this, given these parameters and access patterns?
I'll be happy to provide clarification/details where needed, and thanks in advance for being awesome.
It is no problem at all to provide all this in a Relational database. PostgreSQL is not enterprise class, but it is certainly one of the better freeware SQLs.
To be clear, I am not looking for help describing the sources of data or any related metadata. I am specifically trying to figure out how to best store data values (eventually of various types).
That is your biggest obstacle. Contrary to program design, which allows decomposition and isolated analysis/design of components, databases need to be designed as a single unit. Normalisation and other design techniques need to consider both the whole, and the component in context. The data, the descriptions, the metadata have to be evaluated together, not as separate parts.
Second, when you start off with surrogate keys, implying that you know the data, and how it relates to other data, it prevents you from genuine modelling of the data.
I have answered a very similar set of questions, coincidentally re very similar data. If you could read those answers first, it would save us both a lot of typing time on your question/answer.
Answer One/ID Obstacle
Answer Two/Main
Answer Three/Historical
I did something like this with seismic data for a petroleum exploration company.
My suggestion would be to store the meta-data in a database, and keep the sensor data in flat files, whatever that means for your computer's operating system.
You would have to write your own access routines if you want to modify the sensor data. Actually, you should never modify the sensor data. You should make a copy of the sensor data with the modifications so that you can show later what changes were made to the sensor data.

How do you verify the correct data is in a data mart?

I'm working on a data warehouse and I'm trying to figure out how to best verify that data from our data cleansing (normalized) database makes it into our data marts correctly. I've done some searches, but the results so far talk more about ensuring things like constraints are in place and that you need to do data validation during the ETL process (E.g. dates are valid, etc.). The dimensions were pretty easy as I could easily either leverage the primary key or write a very simple and verifiable query to get the data. The fact tables are more complex.
Any thoughts? We're trying to make this very easy for a subject matter export to run a couple queries, see some data from both the data cleansing database and the data marts, and visually compare the two to ensure they are correct.
You test your fact table loads by implementing a simplified, pared-down subset of the same data manipulation elsewhere, and comparing the results.
You calculate the same totals, counts, or other figures at least twice. Once from the fact table itself, after it has finished loading, and once from some other source:
the source data directly, controlling for all the scrubbing steps in between source and fact
a source system report that is known to be correct
etc.
If you are doing this in the database, you could write each test as a query that returns no records if everything correct. Any records that get returned are exceptions: count of x by (y,z) does not match.
See this excellent post by ConcernedOfTunbridgeWells for more recommendations.
Although it has some drawbacks and potential problems if you do a lot of cleansing or transforming, I've found you can round trip an input file by re-generating the input file from the star schema(s). Then simply comparing the input file to the output file. It might require some massaging to make them match (one is left padded, the other right padded).
Typically, I had a program which used the same layout the ETL used and did a compare, ignoring alignment within a field. Also, the files might have to be sorted - there is a command-line sort I used.
If your ETL does a transform incorrectly and you transform out incorrectly, it's still possible that this method doesn't show every problem in the DW, and I wouldn't claim it has complete coverage, but it's a pretty good first whack at a regression unit test for each load.

Resources