How to handle huge DM lookup tables in spark ETLs - database

We do have requirement to process the ETLs(Data loading) using spark from Source application to Data mart tables.
But while performing these loads their needs to be target look up and other dimensions lookup. Now pulling all those lookup into spark memory would be an heavy operations. Having 1000s of look-up /Dm tables in memory would be a cost to company on infrastructure end.
Is there any other better way to handle these scenario. Example
Loading table A —> Table B
Lookup Tables. — Table C(more than ~3 million rows) , Table D(more than ~500 million rows), Table E (~1 billion rows)
End goal is to move the ETL loads out from traditional approach to open source technology.
Thanks in Advance.
Thanks

Related

SQL Server - Inserting new data worsens query performance

We have a 4-5TB SQL Server database. The largest table is around 800 GB big containing 100 million rows. 4-5 other comparable tables are 1/3-2/3 of this size. We went through a process to create new indexes to optimize performance. While the performance certainly improved we saw that the newly inserted data was slowest to query.
It's a financial reporting application with a BI tool working on top of the database. The data is loaded overnight continuing in the late morning, though the majority of the data is loaded by 7am. Users start to query data around 8am through the BI tool and are most concerned with the latest (daily) data.
I wanted to know if newly inserted data causes indexes to go out of order. Is there anything we can do so that we get better performance on the newly inserted data than the old data. I hope I have explained the issue well here. Let me know in case of any missing information. Thanks
Edit 1
Let me describe the architecture a bit.
I have a base table (let’s call it Base) with Date,id as clustered index.
It has around 50 columns
Then we have 5 derived tables (Derived1, Derived2,...) , according to different metric types, which also have Date,Id as clustered index and foreign key constraint on the Base table.
Tables Derived1 and Derived2 have 350+ columns. Derived3,4,5 have around 100-200 columns. There is one large view created to join all the data tables due limitations of the BI tool. The date,ID are the joining columns for all the tables joining to form the view (Hence I created clustered index on those columns). The main concern is with regard to BI tool performance. The BI tool always uses the view and generally sends similar queries to the server.
There are other indexes as well on other filtering columns.
The main question remains - how to prevent performance from deteriorating.
In addition I would like to know
If NCI on Date,ID on all tables would be better bet in addition to the clustered index on date,ID.
Does it make sense to have 150 columns as included in NCI for the derived tables?
You have about a 100 million rows, increasing every day with new portions and those new portions are usually selected. I should use partitioned indexes with those numbers and not regular indexes.
Your solution within sql server would be partitioning. Take a look at sql partitioning and see if you can adopt it. Partitioning is a form of clustering where groups of data share a physical block. If you use year and month for example, all 2018-09 records will share the same physical space and easy to be found. So if you select records with those filters (and plus more) it is like the table has the size of 2018-09 records. That is not exactly accurate but its is quite like it. Be careful with data values for partitioning - opposite to standard PK clusters where each value is unique, partitioning column(s) should result a nice set of different unique combinations thus partitions.
If you cannot use partitions you have to create 'partitions' yourself using regular indexes. This will require some experiments. The basic idea is data (a number?) indicating e.g. a wave or set of waves of imported data. Like data imported today and the next e.g. 10 days will be wave '1'. Next 10 days will be '2' and so on. Filtering on the latest e.g. 10 waves, you work on the latest 100 days import effectively skip out all the rest data. Roughly, if you divided your existing 100 million rows to 100 waves and start on at wave 101 and search for waves 90 or greater then you have 10 million rows to search if SQL is put correctly to use the new index first (will do eventually)
This is a broad question especially without knowing your system. But one thing that I would try is manually update your stats on the indexes/table once you are done loading data. With tables that big, it is unlikely that you will manipulate enough rows to trigger an auto-update. Without clean stats, SQL Server won't have an accurate histogram of your data.
Next, dive into your execution plans and see what operators are the most expensive.

Should I normalize a 1.3 million record flat file for analysis?

I've been handed an immense flat file of health insurance claims data. It contains 1.3 million rows and 154 columns. I need to do a bunch of different analyses on these data. This will be in SQL Server 2012.
The file has 25 columns for diagnosis codes (DIAG_CD01
through DIAG_CD_25), 8 for billing codes (ICD_CD1 through ICD_CD8), and 4 for procedure modifier codes (MODR_CD1 through MODR_CD4). It looks like it was dumped from a relational database. The billing and diagnosis codes are going to be the basis for much of the analysis.
So my question is whether I should split the file into a mock relational database. Writing analysis queries on a table like this will be a nightmare. If I split it into a parent table and three child tables (Diagnoses, Modifiers, and Bill_codes) my query code will much easier. But if I do that I'll have, on top of the 1.3 million parent records, up to 32.5 million diagnosis records, up to 10.4 million billing code records, and up to 5.2 million modifier records. On the other hand, a huge portion of the flat data of the three sets is null fields, which are supposed to screw up query performance.
What are the likely performance consequences of querying these data as a mock relational database vs. as the giant flat file? Reading about normalization it sounds like performance should be better, but the sheer number of records in a four table split gives me pause.
Seems like if you keep it denormalized you would have to repeat query logic a whole bunch of times (25 for Diagnoses), and even worse, you have to somehow aggregate all those pieces together.
Do like you suggested and split the data into logical tables like Diagnosis Codes, Billing Codes, etc. and your queries will be much easier to handle.
If you have a decent machine these row counts should not be a performance problem for sql server. Just make sure you have indexes to help with your joins, etc.
Good luck!

Slow lookup and data loading in Vertica

I am evaluating Vertica 8.1. I am really happy with its performance at BI side. My reports are running 20 times faster. But it is very slow when it does lookup in the fact loading transformation and also when loading the dimension tables. More the lookup steps I have in a transformations, slower it gets. I have created a lot of projections (one for each column in the dimension tables). I am not using clustering feature of Vertica.
My dimension table surrogate keys are integers. When loading the dimension table, I check the table maximum and add 1 to it in the next row. By any chance, is that a problem? Because, the lookup and data loading with Pentaho is like 30ish rows per second. To load the dimensions, I am using Dimension lookup/update step for SCD Type 2, and to load the combination dimensions I am using Combination lookup/update step. I am loading the fact using Vertica bulk loader. I read the documentation on Vertica's website which was about connections and best practices using Pentaho with Vertica. But it didn't seem helpful.
Suggestions?

SQLite performance advice for .net

I am using SQLite in my application. The scenario is that I have stock market data and each company is a database with 1 table. That table stores records which can range from couple thousand to half a million.
Currently when I update the data in real time I - open connection, check if that particular data exists or not. If not, I then insert it and close the connection. This is then done in a loop and each database (representing a company) is updated. The number of records inserted is low and is not the problem. But is the process okay?
An alternate way is to have 1 database with many tables (each company can be a table) and each table can have a lot of records. Is this better or not?
You can expect at around 500 companies. I am coding in VS 2010. The language is VB.NET.
The optimal organization for your data is to make it properly normalized, i.e., put all data into a single table with a company column.
This is better for performance because the table- and database-related overhead is reduced.
Queries can be sped up with indexes, but what indexes you need depends on the actual queries.
I did something similar, with similar sized data in another field. It depends a lot on your indexes. Ultimately, separating each large table was best (1 table per file, representing a cohesive unit, in you case one company). Plus you gain the advantage of each company table being the same name, versus having x tables of different names that have the same scheme (and no sanitizing of company names to make new tables required).
Internally, other DBMSs often keep at least one file per table in their internal structure, SQL is thus just a layer of abstraction above that. SQLite (despite its conceptors' boasting) is meant for small projects and querying larger data models will get more finicky in order to make it work well.

choosing table design for database performance

I am developing a Job application which executes multiple parallel jobs. Every job will pull data from third party source and process. Minimum records are 100,000. So i am creating new table for each job (like Job123. 123 is jobId) and processing it. When job starts it will clear old records and get new records and process. Now the problem is I have 1000 jobs and the DB has 1000 tables. The DB size is drastically increased due to lots of tables.
My question is whether it is ok to create new table for each job. or have only one table called Job and have column jobId, then enter data and process it. Only problem is every job will have 100,000+ records. If we have only one table, whether DB performance will be affected?
Please let me know which approach is better.
Don't create all those tables! Even though it might work, there's a huge performance hit.
Having a big table is fine, that's what databases are for. But...I suspect that you don't need 100 million persistent records, do you? It looks like you only process one Job at a time, but it's unclear.
Edit
The database will grow to the largest size needed, but the space from deleted records is reused. If you add 100k records and delete them, over and over, the database won't keep growing. But even after the delete it will take up as much space as 100k records.
I recommend a single large table for all jobs. There should be one table for each kind of thing, not one table for each thing.
If you make the Job ID the first field in the clustered index, SQL Server will use a b-tree index to determine the physical order of data in the table. In principle, the data will automatically be physically grouped by Job ID due to the physical sort order. This may not stay strictly true forever due to fragmentation, but that would affect a multiple table design as well.
The performance impact of making the Job ID the first key field of a large table should be negligible for single-job operations as opposed to having a separate table for each job.
Also, a single large table will generally be more space efficient than multiple tables for the same amount of total data. This will improve performance by reducing pressure on the cache.

Resources