compare two large files to reconciliate financial transactions

compare two large files to reconciliate financial transactions - file

I am trying to write a program to compare two large files: two files should compare financial transactions every day. files can be xml or csv format.
there are between 3 and 4 million lines per file and fifty columns. Reconciliation occurs on the basis of an area defined by a set of key fields.
output must identify rows with the same key but where data is different
I used SQL comparison (each file in a table), it works but it requires a database such as Oracle and a powerful server
t there is a solution using MapReduce concepts or bases nosql

I don't think it's a problem to do a daily comparison of 2 3-4m indexed tables in any rdbms (oracle, sql server, mysql, postgre) for that matter and it wouldn't take too long too.
You may use a MapReduce based data processing system such as Hadoop to do the same as well. There are a few Hadoop-as-a-service platforms out there including ours (Xplenty) which can help you do it quickly and with pay-per-use pricing so that you can lower your costs to do this type of processing. I wouldn't recommend using a MapReduce based solution for a simple comparison of a few million records but if the comparison is complex then you may give it a try.

Related

Looking for a TSBD (Time Series Database) that supports multiple observations for the same Timestamp

I am looking for a time series database that supports storing multiple values for each observation.
To be precise, it is about weather forecasts that change multiple times and thus multiple values exist for each TimeStamp. Therefore, multiple values must be stored for each TimeStamp, which themselves have a timestamp (namely the time at which the forecast was created).
Does anyone know of a TSDB that supports such functionality? Or do I have to assemble something myself?
Many thanks and best regards
Johannes

Disclaimer: I'm working for TimescaleDB
TimescaleDb (timescaledb.com) is an extension built upon Postgresql. That said, you can use all the existing functionality and analytical tools from Postgres itself. Upon that, Timescaledb adds an extension to the query planner which makes sure that time-series data (stored in Timescale's Hypertables) are stored in time-based chunks. That optimizes the ingest and query speeds.
In terms of Hypertable design, you have multiple options. Hypertables are an extension of the standard database tables and can have multiple value columns per row, but you can also store the values one-per-row.
If you have multiple rows per timestamp and location (or however you have grouped your observations), you can use standard SQL (such as max) or advanced Timescale-provided functions (such as last) to retrieve the latest forecast.

How to compare two database Teradata and BigQuery tables

We have 2 database - Teradata and BigQuery.
We need to compare data from all tables in Teradata to BigQuery
Due to large number of tables and volumes of data, its not possible to extract data and perform diff operation.
There are tools available like Jetbrains DataGrip which can help connect to teradata via JDBC (in absence of db connector). and the same for Big Query via simba driver and OAUTH connectivity. But these are still very time consuming activity.
https://www.jetbrains.com/datagrip/quick-start/
https://blog.jetbrains.com/datagrip/2018/07/10/using-bigquery-from-intellij-based-ide/
Is there any other less expensive option available for comparing two databases ?

How about making a copy of Teradata-DB in BQ and compare both of them as follow:
https://medium.com/google-cloud/bigquery-table-comparison-cea802a3c64d

Provided content of source (Teradata) DB have not changed since the migration, the common way to quickly verify the two tables are identical is to compute some hash of each table in source and destination.
You would need some fingerprint or hash function that produces the same results in both systems. I think MD5 and SHA1 (available as extensions for Teradata) are good options, but you need to validate they produce identical values for each type, which might be tricky for some types like floating point.
Thus, for each row I would compute something along the lines of
SHA1(CONCAT(string_col1, "**", CAST(int_col2 AS STRING), "**", string_col3))
and then aggregate it using BIT_XOR to get a single value for the table, which does not depend on the order of the rows. Since all computations are done by the DBMS and data does not leave it, the computations should be relatively fast.
I would do something like the following for each table
Compare COUNT(*) to make sure we have same number of rows
Compute for both databases the single fingerprint as described above.
If the two fingerprints match, the tables are likely identical.
If counts or fingerprints don't match, go slow route and find mismatch by downloading data.

Should I normalize a 1.3 million record flat file for analysis?

I've been handed an immense flat file of health insurance claims data. It contains 1.3 million rows and 154 columns. I need to do a bunch of different analyses on these data. This will be in SQL Server 2012.
The file has 25 columns for diagnosis codes (DIAG_CD01
through DIAG_CD_25), 8 for billing codes (ICD_CD1 through ICD_CD8), and 4 for procedure modifier codes (MODR_CD1 through MODR_CD4). It looks like it was dumped from a relational database. The billing and diagnosis codes are going to be the basis for much of the analysis.
So my question is whether I should split the file into a mock relational database. Writing analysis queries on a table like this will be a nightmare. If I split it into a parent table and three child tables (Diagnoses, Modifiers, and Bill_codes) my query code will much easier. But if I do that I'll have, on top of the 1.3 million parent records, up to 32.5 million diagnosis records, up to 10.4 million billing code records, and up to 5.2 million modifier records. On the other hand, a huge portion of the flat data of the three sets is null fields, which are supposed to screw up query performance.
What are the likely performance consequences of querying these data as a mock relational database vs. as the giant flat file? Reading about normalization it sounds like performance should be better, but the sheer number of records in a four table split gives me pause.

Seems like if you keep it denormalized you would have to repeat query logic a whole bunch of times (25 for Diagnoses), and even worse, you have to somehow aggregate all those pieces together.
Do like you suggested and split the data into logical tables like Diagnosis Codes, Billing Codes, etc. and your queries will be much easier to handle.
If you have a decent machine these row counts should not be a performance problem for sql server. Just make sure you have indexes to help with your joins, etc.
Good luck!

Storing time-series data, relational or non?

I am creating a system which polls devices for data on varying metrics such as CPU utilisation, disk utilisation, temperature etc. at (probably) 5 minute intervals using SNMP. The ultimate goal is to provide visualisations to a user of the system in the form of time-series graphs.
I have looked at using RRDTool in the past, but rejected it as storing the captured data indefinitely is important to my project, and I want higher level and more flexible access to the captured data. So my question is really:
What is better, a relational database (such as MySQL or PostgreSQL) or a non-relational or NoSQL database (such as MongoDB or Redis) with regard to performance when querying data for graphing.
Relational
Given a relational database, I would use a data_instances table, in which would be stored every instance of data captured for every metric being measured for all devices, with the following fields:
Fields: id fk_to_device fk_to_metric metric_value timestamp
When I want to draw a graph for a particular metric on a particular device, I must query this singular table filtering out the other devices, and the other metrics being analysed for this device:
SELECT metric_value, timestamp FROM data_instances
WHERE fk_to_device=1 AND fk_to_metric=2
The number of rows in this table would be:
d * m_d * f * t
where d is the number of devices, m_d is the accumulative number of metrics being recorded for all devices, f is the frequency at which data is polled for and t is the total amount of time the system has been collecting data.
For a user recording 10 metrics for 3 devices every 5 minutes for a year, we would have just under 5 million records.
Indexes
Without indexes on fk_to_device and fk_to_metric scanning this continuously expanding table would take too much time. So indexing the aforementioned fields and also timestamp (for creating graphs with localised periods) is a requirement.
Non-Relational (NoSQL)
MongoDB has the concept of a collection, unlike tables these can be created programmatically without setup. With these I could partition the storage of data for each device, or even each metric recorded for each device.
I have no experience with NoSQL and do not know if they provide any query performance enhancing features such as indexing, however the previous paragraph proposes doing most of the traditional relational query work in the structure by which the data is stored under NoSQL.
Undecided
Would a relational solution with correct indexing reduce to a crawl within the year? Or does the collection based structure of NoSQL approaches (which matches my mental model of the stored data) provide a noticeable benefit?

Definitely Relational. Unlimited flexibility and expansion.
Two corrections, both in concept and application, followed by an elevation.
Correction
It is not "filtering out the un-needed data"; it is selecting only the needed data. Yes, of course, if you have an Index to support the columns identified in the WHERE clause, it is very fast, and the query does not depend on the size of the table (grabbing 1,000 rows from a 16 billion row table is instantaneous).
Your table has one serious impediment. Given your description, the actual PK is (Device, Metric, DateTime). (Please don't call it TimeStamp, that means something else, but that is a minor issue.) The uniqueness of the row is identified by:
(Device, Metric, DateTime)
The Id column does nothing, it is totally and completely redundant.
An Id column is never a Key (duplicate rows, which are prohibited in a Relational database, must be prevented by other means).
The Id column requires an additional Index, which obviously impedes the speed of INSERT/DELETE, and adds to the disk space used.
You can get rid of it. Please.
Elevation
Now that you have removed the impediment, you may not have recognised it, but your table is in Sixth Normal Form. Very high speed, with just one Index on the PK. For understanding, read this answer from the What is Sixth Normal Form ? heading onwards.
(I have one index only, not three; on the Non-SQLs you may need three indices).
I have the exact same table (without the Id "key", of course). I have an additional column Server. I support multiple customers remotely.
(Server, Device, Metric, DateTime)
The table can be used to Pivot the data (ie. Devices across the top and Metrics down the side, or pivoted) using exactly the same SQL code (yes, switch the cells). I use the table to erect an unlimited variety of graphs and charts for customers re their server performance.
Monitor Statistics Data Model.
(Too large for inline; some browsers cannot load inline; click the link. Also that is the obsolete demo version, for obvious reasons, I cannot show you commercial product DM.)
It allows me to produce Charts Like This, six keystrokes after receiving a raw monitoring stats file from the customer, using a single SELECT command. Notice the mix-and-match; OS and server on the same chart; a variety of Pivots. Of course, there is no limit to the number of stats matrices, and thus the charts. (Used with the customer's kind permission.)
Readers who are unfamiliar with the Standard for Modelling Relational Databases may find the IDEF1X Notation helpful.
One More Thing
Last but not least, SQL is a IEC/ISO/ANSI Standard. The freeware is actually Non-SQL; it is fraudulent to use the term SQL if they do not provide the Standard. They may provide "extras", but they are absent the basics.

Found very interesting the above answers.
Trying to add a couple more considerations here.
1) Data aging
Time-series management usually need to create aging policies. A typical scenario (e.g. monitoring server CPU) requires to store:
1-sec raw samples for a short period (e.g. for 24 hours)
5-min detail aggregate samples for a medium period (e.g. 1 week)
1-hour detail over that (e.g. up to 1 year)
Although relational models make it possible for sure (my company implemented massive centralized databases for some large customers with tens of thousands of data series) to manage it appropriately, the new breed of data stores add interesting functionalities to be explored like:
automated data purging (see Redis' EXPIRE command)
multidimensional aggregations (e.g. map-reduce jobs a-la-Splunk)
2) Real-time collection
Even more importantly some non-relational data stores are inherently distributed and allow for a much more efficient real-time (or near-real time) data collection that could be a problem with RDBMS because of the creation of hotspots (managing indexing while inserting in a single table). This problem in the RDBMS space is typically solved reverting to batch import procedures (we managed it this way in the past) while no-sql technologies have succeeded in massive real-time collection and aggregation (see Splunk for example, mentioned in previous replies).

You table has data in single table. So relational vs non relational is not the question. Basically you need to read a lot of sequential data. Now if you have enough RAM to store a years worth data then nothing like using Redis/MongoDB etc.
Mostly NoSQL databases will store your data on same location on disk and in compressed form to avoid multiple disk access.
NoSQL does the same thing as creating the index on device id and metric id, but in its own way. With database even if you do this the index and data may be at different places and there would be a lot of disk IO.
Tools like Splunk are using NoSQL backends to store time series data and then using map reduce to create aggregates (which might be what you want later). So in my opinion to use NoSQL is an option as people have already tried it for similar use cases. But will a million rows bring the database to crawl (maybe not , with decent hardware and proper configurations).

Create a file, name it 1_2.data. weired idea? what you get:
You save up to 50% of space because you don't need to repeat the fk_to_device and fk_to_metric value for every data point.
You save up even more space because you don't need any indices.
Save pairs of (timestamp,metric_value) to the file by appending the data so you get a order by timestamp for free. (assuming that your sources don't send out of order data for a device)
=> Queries by timestamp run amazingly fast because you can use binary search to find the right place in the file to read from.
if you like it even more optimized start thinking about splitting your files like that;
1_2_january2014.data
1_2_february2014.data
1_2_march2014.data
or use kdb+ from http://kx.com because they do all this for you:) column-oriented is what may help you.
There is a cloud-based column-oriented solution popping up, so you may want to have a look at: http://timeseries.guru

You should look into Time series database. It was created for this purpose.
A time series database (TSDB) is a software system that is optimized for handling time series data, arrays of numbers indexed by time (a datetime or a datetime range).
Popular example of time-series database InfluxDB

I think that the answer for this kind of question should mainly revolve about the way your Database utilize storage.
Some Database servers use RAM and Disk, some use RAM only (optionally Disk for persistency), etc.
Most common SQL Database solutions are using memory+disk storage and writes the data in a Row based layout (every inserted raw is written in the same physical location).
For timeseries stores, in most cases the workload is something like: Relatively-low interval of massive amount of inserts, while reads are column based (in most cases you want to read a range of data from a specific column, representing a metric)
I have found Columnar Databases (google it, you'll find MonetDB, InfoBright, parAccel, etc) are doing terrific job for time series.
As for your question, which personally I think is somewhat invalid (as all discussions using the fault term NoSQL - IMO):
You can use a Database server that can talk SQL on one hand, making your life very easy as everyone knows SQL for many years and this language has been perfected over and over again for data queries; but still utilize RAM, CPU Cache and Disk in a Columnar oriented way, making your solution best fit Time Series

5 Millions of rows is nothing for today's torrential data. Expect data to be in the TB or PB in just a few months. At this point RDBMS do not scale to the task and we need the linear scalability of NoSql databases. Performance would be achieved for the columnar partition used to store the data, adding more columns and less rows kind of concept to boost performance. Leverage the Open TSDB work done on top of HBASE or MapR_DB, etc.

I face similar requirements regularly, and have recently started using Zabbix to gather and store this type of data. Zabbix has its own graphing capability, but it's easy enough to extract the data out of Zabbix's database and process it however you like. If you haven't already checked Zabbix out, you might find it worth your time to do so.

Column Stores: Comparing Column Based Databases

I've really been struggling to make SQL Server into something that, quite frankly, it will never be. I need a database engine for my analytical work. The DB needs to be fast and does NOT need all the logging and other overhead found in typical databases (SQL Server, Oracle, DB2, etc.)
Yesterday I listened to Michael Stonebraker speak at the Money:Tech conference and I kept thinking, "I'm not really crazy. There IS a better way!" He talks about using column stores instead of row oriented databases. I went to the Wikipedia page for column stores and I see a few open source projects (which I like) and a few commercial/open source projects (which I don't fully understand).
My question is this: In an applied analytical environment, how do the different column based DB's differ? How should I be thinking about them? Anyone have practical experience with multiple column based systems? Can I leverage my SQL experience with these DBs or am I going to have to learn a new language?
I am ultimately going to be pulling data into R for analysis.
EDIT: I was requested for some clarification in what exactly I am trying to do. So, here's an example of what I would like to do:
Create a table that has 4 million rows and 20 columns (5 dims, 15 facts). Create 5 aggregation tables that calculate max, min, and average for each of the facts. Join those 5 aggregations back to the starting table. Now calculate the percent deviation from mean, percent deviation of min, and percent deviation from max for each row and add it to the original table. This table data does not get new rows each day, it gets TOTALLY replaced and the process is repeated. Heaven forbid if the process must be stopped. And the logs... ohhhhh the logs! :)

The short answer is that for analytic data, a column store will tend to be faster, with less tuning required.
A row store, the traditional database architecture, is good at inserting small numbers of rows, updating rows in place, and querying small numbers of rows. In a row store, these operations can be done with one or two disk block I/Os.
Analytic databases typically load thousands of records at a time; sometimes, as in your case, they reload everything. They tend to be denormalized, so have a lot of columns. And at query time, they often read a high proportion of the rows in the table, but only a few of these columns. So, it makes sense from an I/O standpoint to store values of the same column together.
Turns out that this gives the database a huge opportunity to do value compression. For instance, if a string column has an average length of 20 bytes but has only 25 distinct values, the database can compress to about 5 bits per value. Column store databases can often operate without decompressing the data.
Often in computer science there is an I/O versus CPU time tradeoff, but in column stores the I/O improvements often improve locality of reference, reduce cache paging activity, and allow greater compression factors, so that CPU gains also.
Column store databases also tend to have other analytic-oriented features like bitmap indexes (yet another case where better organization allows better compression, reduces I/O, and allows algorithms that are more CPU-efficient), partitions, and materialized views.
The other factor is whether to use a massively parallel (MMP) database. There are MMP row-store and column-store databases. MMP databases can scale up to hundreds or thousands of nodes, and allow you to store humungous amounts of data, but sometimes have compromises like a weaker notion of transactions or a not-quite-SQL query language.
I'd recommend that you give LucidDB a try. (Disclaimer: I'm a committer to LucidDB.) It is open-source column store database, optimized for analytic applications, and also has other features such as bitmap indexes. It currently only runs on one node, but utilizes several cores effectively and can handle reasonable volumes of data with not much effort.

4 million rows times 20 columns times 8 bytes for a double is 640 mb. Following the rule of thumb that R creates three temporary copies for every object, we get to around 2 gb. That is not a lot by today's standard.
So this should be doable in memory on a suitable 64-bit machine with a 'decent' amount of ram (say 8 gb or more). Installing Ubuntu or Debian (possibly in the server version) can be done in a few minutes.

I have some experience with Infobright Community edition --- column-or. db, based on mysql.
Pro:
you can use mysql interfaces/odbc mysql drivers, from R too
fast enough queries on big chunks of data selection (because of KnowledgeGrid & data packs)
very fast native data loader and connectors for ETL (talend, kettle)
optimized exactly that operations what I (and I think most of us) use (selection by factor levels, joining etc)
special "lookup" option for optimized storing R factor variables ;) (ok, char/varchar variables with relatively small levels number/rows number)
FOSS
paid support option
?
Cons:
no insert/update operations in Community edition (yet?), data loading only via native data loader/ETL connectors
no utf-8 official support (collation/sort etc), planned for q3 2009
no functions in aggregate queries f.e. select month (date) from ...) yet, planned for July(?) 2009, but because of column storage, I prefer simply create date columns for every aggregation levels (week number, month, ...) I need
cannot installed on existing mysql server as storage engine (because of own optimizer, if I understood correctly), but you may install Infobright & mysql on different ports if you need
?
Resume:
Good FOSS solution for daily analytical tasks, and, I think, your tasks as well.

Here is my 2 cents: SQL server does not scale well. We attempted to use SQL server to store financial data in real time (i.e. prices ticks coming in for 100 symbols). It worked perfectly for the first 2 weeks - then it went slower and slower as the database size increased, and finally ground to a halt, too slow to insert each price as it was received. We tried to work around it by moving data from the active database to offline storage every night, but ultimately the project was abandoned as it just didn't work.
Bottom line: if you're planning on storing a lot of data ( >1GB) you need something that scales properly, and that probably means a column database.

It looks like an implementation change (2-D array in column-major order, instead of row-major order), rather than an interface change.
Think "strategy" pattern, rather than being an entire paradigm shift. Of course, I've never used these products, so they may in fact force a paradigm shift down your throat. I don't know why, though.

We might be better able to help you reach an informed decision if you described [1] your specific goal and [2] the issues you're running into with SQL Server.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight