The task is: basing on data returned by SQL query, produce automated periodical reports individually for each customer and save it as a separate excel file named as customer_name+[YY-MM].xlsx (or one timestamped excel file with separate worksheet for each customer).
CustomerID's will possibly vary each month, the target customer list is generated by SQL query.
What would be the best technology? Studied so far:
SSIS: Names of Excel files or tabs should be pre-defined in Control Flow and Data Flow tasks and cannot be set up dynamically. Output is non-formatted csv (bad, but can cope with it).
Excel: I can embed the bulk data for all customers into excel, but not sure if it is possible to write a macro that will fetch unique customers, create corresponding tabs and put corresponding data in each one.
SSRS: Subscription function is disabled on corporate level - cannot use it :( Even though, not sure it accepts dynamic parameters like this.
Did I miss any option? Or maybe there is other technology available?
SSIS is going to be your best option here for the following reasons:
You can set up a set of excel templates if you have a finite number of patterns, then use SSIS to populate and rename the template pretty easily.
If you need finer control, you can use Interop and C# to do pretty much whatever you want in Excel, however this is very slow to execute and is time consuming to build.
It's easy to set up an execution loop over a dataset (your list of customers) and include which templates or tabs they need as part of the data set, which makes maintenance and adding new customers simpler and possibly something that can be moved to the business side.
Of course SSIS has a few drawbacks as well, such as the difficulty in getting the excel column formats to cooperate when using the excel connector, but it is still probably the best option.
Related
I'm one of several analyst and data engineers working within a Snowflake database. We all often have to write ad-hoc small bits of code to check tables and views. These are often quite repetitive tasks (e.g. filtering data based on a certain reference, joining FACTS and DIM tables to add context).
I'd like to create a Worksheet that we can all periodically add useful bits of code to. Just to save us time making joins or be a good starting place for writing longer bits of code.
I've previously used SQL Server where I was able to save template files with useful bits of code. I could open these files directly within SQL Server so it was really easy to open, edit and run these files. What are some similar features in Snowflake?
Thanks
Yes, but you must use the new web UI "snowsight". In the new UI you can share worksheets with team members and give them edit access. You can also import your worksheets from the classic UI as well.
Keep in mind that team editing doesn't work like google docs, where you can see each other typing. So be mindful of that. However, every version of the worksheet is automatically saved and available from the upper right of the interface.
https://docs.snowflake.com/en/user-guide/ui-snowsight-gs.html
https://docs.snowflake.com/en/user-guide/ui-snowsight-worksheets-gs.html
I have source tables in Snowflake and Destination tables in Snowflake.
I need to load data from source to destination using ADF.
Requirement: I need to load data using single pipeline for all the tables.
Eg: For suppose i have 40 tables in source and load the total 40 tables data to destination tables. I need to create a single pipeline to load all tables at a time.
Can anyone help me in achieving this?
Thanks,
P.
This is a fairly broad question. So take this all as general thoughts, more than specific advice.
Feel free to ask more specific questions, and I'll try to update/expand on this.
ADF is useful as an orchestration/monitoring process, but can be tricky to manage the actual copying and maneuvering of data in Snowflake. My high level recommendation is to write your logic and loading code in snowflake stored procedures
then you can use ADF to orchestrate by simply calling those stored procedures. You get the benefits of using ADF for what it is good at, and allow Snowflake to do the heavy lifting, which is what it is good at.
hopefully you'd be able to parameterize procedures so that you can have one procedure (or a few) that takes a table name and dynamically figures out column names and the like to run your loading process.
Assorted Notes on implementation.
ADF does have a native Snowflake connector. It is fairly new, so a lot of online posts will tell you how to set up a custom ODBC connector. You don't need to do this. Use the native connector and auto resolve integration and it should work for you.
You can write a query in an ADF lookup activity to output your list of tables, along with any needed parameters (like primary key, order by column, procedure name to call, etc.), then feed that list into an ADF foreach loop.
foreach loops are a little limited in that there are some things that you can't nest inside of a loop (like conditionals). If you need extra functionality, you can have the foreach loop call a child ADF pipeline (passing in those parameters) and have the child pipline manage your table processing logic.
Snowflake has pretty good options for querying metadata based on a tablename. See INFORMATION_SCHEMA. Between that and just a tiny bit of javascript logic, it's not too bad to generate dynamic queries (e.g. with column names specific to a provided tablename).
If you do want to use ADF's Copy Activities, I think You'll need to set up an intermediary Azure Storage Account connection. I believe this is because it uses COPY INTO under the hood which requires using external storage.
ADF doesn't have many good options for avoiding running one pipeline multiple times at once. Either be careful about making sure that your code can handle edge cases like this, or that your scheduling/timeouts won't allow for that scenario with a pipeline running too long.
Extra note:
I don't know how tied you are to ADF, but without more context, I might suggest a quick look into DBT for this use case. It's a great tool for this specific scenario of Snowflake to Snowflake processing/transforming. My team's been much happier since moving some of our projects from ADF to DBT. (not sponsored :P )
This is linked to my other question when to move from a spreadsheet to RDBMS
Having decided to move to an RDBMS from an excel book, here is what I propose to do.
The existing data is loosely structured across two sheets in a work-book. The first sheet contains main record. The second sheet allows additional data.
My target DBMS is mysql, but I'm open to suggestions.
Define RDBMS schema
Define, say, web-services to interface with the database so the same can be used for both, UI and migration.
Define a migration script to
Read each group of affiliated rows from the spreadsheet
Apply validation/constraints
Write to RDBMS using the web-service
Define macros/functions/modules in spreadsheet to enforce validation where possible. This will allow use of the existing system while the new comes up. At the same time, ( i hope ) it will reduce migration failures when the move is eventually made.
What strategy would you follow?
There are two aspects to this question.
Data migration
Your first step will be to "Define RDBMS schema" but how far are you going to go with it? Spreadsheets are notoriously un-normalized and so have lots of duplication. You say in your other question that "Data is loosely structured, and there are no explicit constraints." If you want to transform that into a rigourously-defined schema (at least 3NF) then you are going to have to do some cleansing. SQL is the best tool for data manipulation.
I suggest you build two staging tables, one for each worksheet. Define the columns as loosely as possible (big strings basically) so that it is easy to load the spreadsheets' data. Once you have the data loaded into the staging tables you can run queries to assess the data quality:
how many duplicate primary keys?
how many different data formats?
what are the look-up codes?
do all the rows in the second worksheet have parent records in the first?
how consistent are code formats, data types, etc?
and so on.
These investigations will give you a good basis for writing the SQL with which you can populate your actual schema.
Or it might be that the data is so hopeless that you decide to stick with just the two tables. I think that is an unlikely outcome (most applications have some underlying structure, we just have to dig deep enough).
Data Loading
Your best bet is to export the spreadsheets to CSV format. Excel has a wizard to do this. Use it (rather than doing Save As...). If the spreadsheets contain any free text at all the chances are you will have sentences which contain commas, so make sure you choose a really safe separator, such as ^^~
Most RDBMS tools have a facility to import data from CSV files. Postgresql and Mysql are the obvious options for an NGO (I presume cost is a consideration) but both SQL Server and Oracle come in free (if restricted) Express editions. SQL Server obviously has the best integration with Excel. Oracle has a nifty feature called external tables which allow us to define a table where the data is held in a CSV file, removing the need for staging tables.
One other thing to consider is Google App Engine. This uses Big Table rather than an RDBMS but that might be more suited to your loosely-structured data. I suggest it because you mentioned Google Docs as an alternative solution. GAE is an attractive option because it is free (more or less, they start charging if usage exceeds some very generous thresholds) and it would solve the app sharing issue with those other NGOs. Obviously your organisation may have some qualms about Google hosting their data. It depends on what field they are operating in, and the sensitivity of the information.
Obviously, you need to create a target DB and the necessary table structure.
I would skip the web services and write a groovy script which reads the .xls (using the POI library), validates and saves the data in the database.
In my view, anything more involved (web services, GUI...) is not justified: these kinds of tasks are very well suited for scripts because they're concise and extremely flexible while things like performance, code base scalability and such are less of an issue here. Once you have something that works, you will be able to adapt the script to any future document with different data anomalies you run into in a matter of minutes or a few hours.
This is all assuming your data isn't in perfect order and needs to be filtered and/or cleaned.
Alternatively, if the data and validation rules aren't too complex, you can probably get good results with using a visual data transfer tool like Kettle: you just define the .xls as your source, the database table as the table, some validation/filter rules if needed and trigger the loading process. Quite painless.
If you'd rather use a tool that roll your own, check out SeekWell, which lets you write to your database from Google Sheets. Once you define your schema, Select the tables into a Sheet, then edit or insert the records and mark them for the appropriate action (e.g., update, insert, etc.). Set the schedule for the update and you're done. Read more about it here. Disclaimer--I'm a co-founder.
Hope that helps!
You might be doing more work than you need to. Excel spreadsheets can be saved as CVS or XML files and many RDBMS clients support importing these files directly into tables.
This could allow you skip writing web service wrappers and migration scripts. Your database constraints would still be properly enforced during any import. If your RDBMS data model or schema is very different from your Excel spreadsheets, however, then some translation would of course have to take place via scripts or XSLT.
Hopefully someone has been down this road before and can offer some sound advice as far as which direction I should take. I am currently involved in a project in which we will be utilizing a custom database to store data extracted from excel files based on pre-established templates (to maintain consistency). We currently have a process (written in C#.Net 2008) that can extract the necessary data from the spreadsheets and import it into our custom database. What I am primarily interested in is figuring out the best method for integrating that process with our portal. What I would like to do is let SharePoint keep track of the metadata about the spreadsheet itself and let the custom database keep track of the data contained within the spreadsheet. So, one thing I need is a way to link spreadsheets from SharePoint to the custom database and vice versa. As these spreadsheets will be updated periodically, I need tried and true way of ensuring that the data remains synchronized between SharePoint and the custom database. I am also interested in finding out how to use the data from the custom database to create reports within the SharePoint portal. Any and all information will be greatly appreciated.
I have actually written a similar system in SharePoint for a large Financial institution as well.
The way we approached it was to have an event receiver on the Document library. Whenever a file was uploaded or updated the event receiver was triggered and we parsed through the data using Aspose.Cells.
The key to matching data in the excel sheet with the data in the database was a small header in a hidden sheet that contained information about the reporting period and data type. You could also use the SharePoint Item's unique ID as a key or the file's full path. It all depends a bit on how the system will be used and your exact requirements.
I think this might be awkward. The Business Data Catalog (BDC) functionality will enable you to tightly integrate with your database, but simultaneously trying to remain perpetually in sync with a separate spreadsheet might be tricky. I guess you could do it by catching the update events for the document library that handles the spreadsheets themselves and subsequently pushing the right info into your database. If you're going to do that, though, it's not clear to me why you can't choose just one or the other:
Spreadsheets in a document library, or
BDC integration with your database
If you go with #1, then you still have the ability to search within the documents themselves and updating them is painless. If you go with #2, you don't have to worry about sync'ing with an actual sheet after the initial load, and you could (for example) create forms as needed to allow people to modify the data.
Also, depending on your use case, you might benefit from the MOSS server-side Excel services. I think the "right" decision here might require more information about how you and your team expect to interact with these sheets and this data after it's initially uploaded into your SharePoint world.
So... I'm going to assume that you are leveraging Excel because it is an easy way to define, build, and test the math required. Your spreadsheet has a set of input data elements, a bunch of math, and then there are some output elements. Have you considered using Excel Services? In this scenario you would avoid running a batch process to generate your output elements. Instead, you can call Excel services directly in SharePoint and run through your calculations. More information: available online.
You can also surface information in SharePoint directly from the spreadsheet. For example, if you have a graph in the spreadsheet, you can link to that graph and expose it. When the data changes, so does the graph.
There are also some High Performance Computing (HPC) Excel options coming out from Microsoft in the near future. If your spreadsheet is really, really big then the Excel Services route might not work. There is some information available online (search for HPC excel - I can't post the link).
For whatever reason, I have a lot of clients that have existing data that's stored in spreadsheets. Often there are hundreds, if not thousands of items in each spreadsheet, and asking the client to manually enter them through a website (or heaven forbid importing them that way myself) is out of the question. Typically, this data doesn't simply map spreadsheet column to database column. That would be too easy. Often, the data needs to be manipulated before going into the database (data needs to be split by commas, etc) or the data needs to be spread out across multiple tables. Or both.
I'm asking this question, not because I don't know of a multitude of ways to do it, but because I haven't settled on a way that doesn't feel like it takes more work than it should. So far I've taken all of the following approaches (and probably more that I've forgotten):
Using Excel to modify the data, so it's a little bit easier to import
Importing the entire spreadsheet into a temporary table and then importing with SQL
Writing a script and importing the data with it (I've used VBScript, C# and now Ruby)
So far, using a script has been the way that seemed most flexible, but it still feels a little clunky. I have to perform this task enough that I've even contemplated writing a little DSL for it, just to speed things up.
But before I do that, I'm curious, is there a better way?
You have to set boundaries, if you can. You should try and provide a template for them to use with the expected data, which includes file type (Excel, csv, etc.), column names, valid values, etc. You should allow the used to browse for the file and upload it on your page/form.
Once the file is uploaded, you need to do validation and importation. You can use ADO.NET, file streams, DTS/SSIS, or Office Automation to do this (if you are using the Microsoft stack). In the validation portion, you should tell the user exactly what they did wrong or need to change. This might include having the validation page have the actual data in a datagrid and providing red labels with errors on the exact row/column. If you use Office Automation, you can give them the exact cell number, but the Office PIA is a pain in the neck.
Once validation is accepted, you can import the information however you like. I prefer putting it into a staging table and using a stored proc to load it, but that's just me. Some prefer to use the object model, but this can be very slow if you have a lot of data.
If you are personally loading these files manually and having to go in and manipulate them, I would suggest finding the communality among them and coming up with a standard to follow. Once you have that, you can make it so the user can do it themselves or you can do it a lot faster yourself.
Yes, this is a lot of work, but in the long wrong, when there is a program that works 95% of the time, everybody wins.
If this is going to be a situation which just can’t be automated, then you will probably just have to have a vanilla staging table and have sql to to the importation. You will have to load the data into one staging table, do the basic manipulation, and then load it into te staging table that your SQL expects.
I’ve done so many imports and ETL tools, and there really is no easy way to handle it. The only way is to really come up with a standard that is reasonable and stick to it and program around that.
yeah.. that just sucks.
I would go with the script. And I assume you have repeating columns that have to match a single row in another table. I would do reasonable matching and if you encounter a row that the script can't deal with and move the data...then log it and make someone do it manually.
It's the little details that'll kill you on this, of course, but in general, I've had success with exporting the data as CSV from Excel, then reading it using a rool or script, munging it as needed, and inserting it. Depending on the wonderfulness of my environment, that can be done with a data base interface to the scripting language, down to and including writing SQL INSERT statements into a script file.
There are good CSV packages available for Python, Ruby, and Perl.
A DSL is the way to go.
Create a domain model for your problem. You talk about cells, columns, rows, database tables, splitting fields, combining fields, mapping from cells to database columns, so that are the concepts you need. In addition you probably want ranges (of cells), and sheets.
A simple view looks only at the values in the spreadsheets, not the underlying formulas. Exporting the spreadsheet as tab-separated text gives you access to that. If you need access to the formulas, you're better of with the xml representation, either the XML-spreadsheet, or the Office XML format.
You might be able to come up with a DSL in Excel. That could allow your smarter users to do (part of) the mapping.