In-memory Database in Excel - database

I am looking for a way to import a datatable from Access into an Excel variable and then run queries through this variable to speed up the process. I am trying to migrate from C# .NET where I read a data table from an access database into memory and then used LINQ to query this dataset. It is MUCH faster than how I have it currently coded in VBA where I must make lots of calls to the actual database, which is slow. I have seen the QueryTable mentioned, but it appears that this requires pasting the data into the excel sheet. I would like to keep everything in memory and minimize the interaction between the Excel Sheet and the VBA code as much as possible.
I wish we didn't need to use Excel+VBA to do this, but we're kind of stuck with that for now. Thanks for the help!

I don't know of anything like LINQ for VBA.
If you keep the ADO Connection option in scope by making it Public, you can Excecute Commands against it. It's not as fast as LINQ, but it's definitely faster than creating and destroying Connection objects for every call.
If the tables aren't too huge, I tend to read the tables into custom classes in VBA with the appropriate Parent/Child relationships set up. The very obvious downside to this is that you can't use SQL to get a recordset of data from your classes. I have to use a lot of looping when I need more than one specific record. And that means if you have 1m records, it would be quicker to call the database.
If you're interested in the last one, you can read some of the stuff I've written on it here
http://www.dailydoseofexcel.com/archives/2008/12/07/vba-framework/
http://www.dailydoseofexcel.com/archives/2008/11/15/creating-classes-from-access-tables/
http://www.dailydoseofexcel.com/archives/2007/12/28/terminating-dependent-classes/ (read Rob Bruce's comment)

I would just read it into an ADO recordset, then get the data I need from the recordset as I need it. Of course this will depend on the size of the table you want to read.

Related

Multiple Tables Load from Snowflake to Snowflake using ADF

I have source tables in Snowflake and Destination tables in Snowflake.
I need to load data from source to destination using ADF.
Requirement: I need to load data using single pipeline for all the tables.
Eg: For suppose i have 40 tables in source and load the total 40 tables data to destination tables. I need to create a single pipeline to load all tables at a time.
Can anyone help me in achieving this?
Thanks,
P.
This is a fairly broad question. So take this all as general thoughts, more than specific advice.
Feel free to ask more specific questions, and I'll try to update/expand on this.
ADF is useful as an orchestration/monitoring process, but can be tricky to manage the actual copying and maneuvering of data in Snowflake. My high level recommendation is to write your logic and loading code in snowflake stored procedures
then you can use ADF to orchestrate by simply calling those stored procedures. You get the benefits of using ADF for what it is good at, and allow Snowflake to do the heavy lifting, which is what it is good at.
hopefully you'd be able to parameterize procedures so that you can have one procedure (or a few) that takes a table name and dynamically figures out column names and the like to run your loading process.
Assorted Notes on implementation.
ADF does have a native Snowflake connector. It is fairly new, so a lot of online posts will tell you how to set up a custom ODBC connector. You don't need to do this. Use the native connector and auto resolve integration and it should work for you.
You can write a query in an ADF lookup activity to output your list of tables, along with any needed parameters (like primary key, order by column, procedure name to call, etc.), then feed that list into an ADF foreach loop.
foreach loops are a little limited in that there are some things that you can't nest inside of a loop (like conditionals). If you need extra functionality, you can have the foreach loop call a child ADF pipeline (passing in those parameters) and have the child pipline manage your table processing logic.
Snowflake has pretty good options for querying metadata based on a tablename. See INFORMATION_SCHEMA. Between that and just a tiny bit of javascript logic, it's not too bad to generate dynamic queries (e.g. with column names specific to a provided tablename).
If you do want to use ADF's Copy Activities, I think You'll need to set up an intermediary Azure Storage Account connection. I believe this is because it uses COPY INTO under the hood which requires using external storage.
ADF doesn't have many good options for avoiding running one pipeline multiple times at once. Either be careful about making sure that your code can handle edge cases like this, or that your scheduling/timeouts won't allow for that scenario with a pipeline running too long.
Extra note:
I don't know how tied you are to ADF, but without more context, I might suggest a quick look into DBT for this use case. It's a great tool for this specific scenario of Snowflake to Snowflake processing/transforming. My team's been much happier since moving some of our projects from ADF to DBT. (not sponsored :P )

Creation of excel files dynamically basing on parameter

The task is: basing on data returned by SQL query, produce automated periodical reports individually for each customer and save it as a separate excel file named as customer_name+[YY-MM].xlsx (or one timestamped excel file with separate worksheet for each customer).
CustomerID's will possibly vary each month, the target customer list is generated by SQL query.
What would be the best technology? Studied so far:
SSIS: Names of Excel files or tabs should be pre-defined in Control Flow and Data Flow tasks and cannot be set up dynamically. Output is non-formatted csv (bad, but can cope with it).
Excel: I can embed the bulk data for all customers into excel, but not sure if it is possible to write a macro that will fetch unique customers, create corresponding tabs and put corresponding data in each one.
SSRS: Subscription function is disabled on corporate level - cannot use it :( Even though, not sure it accepts dynamic parameters like this.
Did I miss any option? Or maybe there is other technology available?
SSIS is going to be your best option here for the following reasons:
You can set up a set of excel templates if you have a finite number of patterns, then use SSIS to populate and rename the template pretty easily.
If you need finer control, you can use Interop and C# to do pretty much whatever you want in Excel, however this is very slow to execute and is time consuming to build.
It's easy to set up an execution loop over a dataset (your list of customers) and include which templates or tabs they need as part of the data set, which makes maintenance and adding new customers simpler and possibly something that can be moved to the business side.
Of course SSIS has a few drawbacks as well, such as the difficulty in getting the excel column formats to cooperate when using the excel connector, but it is still probably the best option.

Difference between using Access.Application object vs. a database connection

I've been scouring for information on how to add and query data within Excel VBA from an ACCDB. I've come across many answers: OpenDatabase() from my coworker, database connections and using an Access.Application object. What I couldn't figure out is, is there a benefit to using the Access object instead of creating a connection to a database with a string and such? I did read that using the Access Application object I didn't need to have the Access engine on the computer running the VBA, and I opted to do this for that reason. Plus, it looked a lot simpler than using a connection string and going that route. I've implemented the Access Object and it worked like a charm. So my question is, what's the benefit or disadvantage to doing the Access Object way vs. doing it another way? Thanks all!
Is the 10k incremental addition to the DB or your CSV input is increasing by 10K?
If it's the former then yes, storing it in a database is a good idea and I would use the DAO route. You notice that not many people are fans of firing up the Access application, mainly because you're not really using Ms Access features (it's more than a data store).
As an alternative, skip Excel and put your macro inside Access, since you have the app. There are a lot of goodies in Access that you can take advantage of.
However, if your CSV is always at full volume, you may just want to process the data yourself within Excel/VBA. I assume that the "other" table is a reference table.

Dataset retrieving data from another dataset

I work with an application that it switching from filebased datastorage to database based. It has a very large amount of code that is written specifically towards the filebased system. To make the switch I am implementing functionality that will work as the old system, the plan is then making more optimal use of the database in new code.
One problem is that the filebased system often was reading single records, and read them repeatedly for reports. This have become alot of queries to the database, which is slow.
The idea I have been trying to flesh out is using two datasets. One dataset to retrieve an entire table, and another dataset to query against the first, thereby decreasing communication overhead with the database server.
I've tried to look at the DataSource property of TADODataSet but the dataset still seems to require a connection, and it asks the database directly if Connection is assigned.
The reason I would prefer to get the result in another dataset, rather than navigating the first one, is that there is already implemented a good amount of logic for emulating the old system. This logic is based on having a dataset containing only the results as queried with the old interface.
The functionality only have to support reading data, not writing it back.
How can I use one dataset to supply values for another dataset to select from?
I am using Delphi 2007 and MSSQL.
You can use a ClientDataSet/DataSetProvider pair to fetch data from an existing DataSet. You can use filters on the source dataset, filters on the ClientDataSet and provider events to trim the dataset only to the interesting records.
I've used this technique with success in a couple of migrating projects and to mitigate similar situation where a old SQL Server 7 database was queried thousands of times to retrieve individual records with painful performance costs. Querying it only one time and then fetching individual records to the client dataset was, at the time, not only an elegant solution but a great performance boost to that particular application: The most great example was an 8 hour process reduced to 15 minutes... poor users loved me that time.
A ClientDataSet is just a TDataSet you can seamlessly integrate into existing code and UI.

How do you typically import data from a spreadsheet to multiple database columns?

For whatever reason, I have a lot of clients that have existing data that's stored in spreadsheets. Often there are hundreds, if not thousands of items in each spreadsheet, and asking the client to manually enter them through a website (or heaven forbid importing them that way myself) is out of the question. Typically, this data doesn't simply map spreadsheet column to database column. That would be too easy. Often, the data needs to be manipulated before going into the database (data needs to be split by commas, etc) or the data needs to be spread out across multiple tables. Or both.
I'm asking this question, not because I don't know of a multitude of ways to do it, but because I haven't settled on a way that doesn't feel like it takes more work than it should. So far I've taken all of the following approaches (and probably more that I've forgotten):
Using Excel to modify the data, so it's a little bit easier to import
Importing the entire spreadsheet into a temporary table and then importing with SQL
Writing a script and importing the data with it (I've used VBScript, C# and now Ruby)
So far, using a script has been the way that seemed most flexible, but it still feels a little clunky. I have to perform this task enough that I've even contemplated writing a little DSL for it, just to speed things up.
But before I do that, I'm curious, is there a better way?
You have to set boundaries, if you can. You should try and provide a template for them to use with the expected data, which includes file type (Excel, csv, etc.), column names, valid values, etc. You should allow the used to browse for the file and upload it on your page/form.
Once the file is uploaded, you need to do validation and importation. You can use ADO.NET, file streams, DTS/SSIS, or Office Automation to do this (if you are using the Microsoft stack). In the validation portion, you should tell the user exactly what they did wrong or need to change. This might include having the validation page have the actual data in a datagrid and providing red labels with errors on the exact row/column. If you use Office Automation, you can give them the exact cell number, but the Office PIA is a pain in the neck.
Once validation is accepted, you can import the information however you like. I prefer putting it into a staging table and using a stored proc to load it, but that's just me. Some prefer to use the object model, but this can be very slow if you have a lot of data.
If you are personally loading these files manually and having to go in and manipulate them, I would suggest finding the communality among them and coming up with a standard to follow. Once you have that, you can make it so the user can do it themselves or you can do it a lot faster yourself.
Yes, this is a lot of work, but in the long wrong, when there is a program that works 95% of the time, everybody wins.
If this is going to be a situation which just can’t be automated, then you will probably just have to have a vanilla staging table and have sql to to the importation. You will have to load the data into one staging table, do the basic manipulation, and then load it into te staging table that your SQL expects.
I’ve done so many imports and ETL tools, and there really is no easy way to handle it. The only way is to really come up with a standard that is reasonable and stick to it and program around that.
yeah.. that just sucks.
I would go with the script. And I assume you have repeating columns that have to match a single row in another table. I would do reasonable matching and if you encounter a row that the script can't deal with and move the data...then log it and make someone do it manually.
It's the little details that'll kill you on this, of course, but in general, I've had success with exporting the data as CSV from Excel, then reading it using a rool or script, munging it as needed, and inserting it. Depending on the wonderfulness of my environment, that can be done with a data base interface to the scripting language, down to and including writing SQL INSERT statements into a script file.
There are good CSV packages available for Python, Ruby, and Perl.
A DSL is the way to go.
Create a domain model for your problem. You talk about cells, columns, rows, database tables, splitting fields, combining fields, mapping from cells to database columns, so that are the concepts you need. In addition you probably want ranges (of cells), and sheets.
A simple view looks only at the values in the spreadsheets, not the underlying formulas. Exporting the spreadsheet as tab-separated text gives you access to that. If you need access to the formulas, you're better of with the xml representation, either the XML-spreadsheet, or the Office XML format.
You might be able to come up with a DSL in Excel. That could allow your smarter users to do (part of) the mapping.

Resources