Issues with using AWS-Data-Wrangler in Glue to SQL Server - sql-server

I've run into an issue with AWS-Data-Wrangler in AWS Glue that I can't seem to find the resolution to. In my searches, I came across this stackoverflow discussion that accurately describes my issue, but the answer isn't helpful because I'm writing the script in GlueStudio. The answer repeats what the source documentation reads about installing unixODBC-devel to install pyodbc, but nothing seems to refer to using the Job parameters function in GlueStudio to install awswrangler and pyodbc together.
To describe what I'm trying to accomplish: we have a bunch of data stored as parquet files that came from our on-premise SQL Server instance. I am trying to take this data and load it into an RDS SQL Server instance so we can run already-existing T-SQL SQL Server scripts against it so we don't have to re-write all of our SQL to match Aurora/Redshift/PostgreSQL/etc. The results would then be stored again in S3 or in another permanent SQL Server instance for use in an application and the RDS instance would be deleted as it is no longer in use. This is to be a first step in migrating our processes into cloud-native technologies and we want to minimize the amount of changes that we would have to make (this would be the ideal end-state of our processes) while also minimizing costs (as permanent SQL Server instances can become quite expensive).
I've tried the following using --additional-python-modules:
pyarrow==2,awswrangler
This results in it missing pyodbc and reporting "ModuleNotFoundError: You need to install pyodbc respectively the AWS Data Wrangler package with the sqlserver extra for using the sqlserver module"
pyarrow==2,awswrangler,pyodbc
This results in awswrangler not being found: "ModuleNotFoundError: No module named 'awswrangler'"
pyodbc,pyarrow==2,awswrangler
This is the same as above.
pyarrow==2,awswrangler[sqlserver]
This is the same as above.
pyarrow==2,awswrangler==2.14.0
This is the same as above
s3://<path>/<to>/<wheel>.whl
I manually built my own wheel file that contained pyarrow, awswrangler, and pyodbc (with all necessary dependencies), but it resulted in the same issue as above.
I've done the above configurations using both Glue 2.0 and 3.0, and one test using Glue 1.0 which took way too long to fail to be viable. At this point I'm stuck and can't figure out how to get this to work out. Without trying to make a connection to SQL Server, awswrangler works fine reading my S3 files and creating dataframes. I've seen mention of using awswranger[sqlserver] but I can't seem to find any actual usage of it in this kind of use-case. Does anyone have any ideas?

Related

BIML OLE DB connection uses the wrong database (sometimes)

I'm using BIML to interrogate the schema of the source and destination databases, check that everything is configured correctly, and then generate a bunch of SSIS packages. My issues is that occasionally the OLE DB connection starts using the Master system database instead of the one I've specified in the connection string. I can tell it is the master database by examining the tables that get returned.
I define the connection using BIML markup:
<OleDbConnection Name="appdb" ConnectionString="Server=<#=ReplicationConfig.appdbHostname#>;Database=<#=ReplicationConfig.appdbDatabaseName#>;Integrated Security=SSPI;Provider=SQLNCLI11;" CreateInProject="true" />
I've hit the issue mostly when trying to use the GetDatabaseSchema() method of the OleDbConnection object on the BIML root node. Though I have also run across it when trying to use an Execute SQL task. It was easy to work around with the Execute SQL task because I could fully qualify the table with [DatabaseName].[schema].[Table] but I don't have that option with the GetDatabaseSchema() method. I've also tried using ImportDB() and GetTableNodes() and they both experience the same issue.
When trying to migrate my solution from the development environment to test the issue has cropped up again. In the past this has been fixed by restarting Visual Studio (or working around it in the SQL query). However, that hasn't worked this time.
I'm using BIML Express with Visual Studio 2015.
Does anyone have any idea what could be wrong or how to get around this?
I ended up working around the issue by setting the default database for my login to the one that I needed to work with. For some reason BIML seems to be ignoring the database specified in the connection string.

Visual Studio Load Test w/o Agents - Manually Executing and Aggregating Results

My team recently adopted Visual Studio's Web Performance Test/Load Testing solution. Our test plans are developed, and we preparing to begin collecting baseline performance and stress test results against an corporate MVC application.
Due to corporate network security "features", Microsoft's Agents/Controller on-premises test distribution solution is not an option. Furthermore, the TFS Virtual Lab and Azure Virtual Lab load test distribution solutions are also not viable options due to security infrastructure and resource limitations.
Because of these constraints, it seems our only option is to run a Visual Studio Load Test from each developer machine (at a coordinated time, through different internet connections). *If anyone has another solution, I'm certainly receptive.
Assuming we take this approach, I'm concerned the results Visual Studio stores in the "LoadTest2010" SQL repository will not accurately reflect the combined results of all developer machine's Load Test.
My questions are:
Is this approach even viable?
If so, what is the best way to combine the separate Load Test SQL repositories into a single SQL Database (keeping in mind connecting to a central SQL Server during test execution is not an option)?
Assuming we import all the testers' results into a central database, does anyone have an idea of how to report on composite test results? I'm assuming they'll all have different TestRunIds which seems that would break Microsoft's built in Views and Stored Procedures for analyzing test results.
Putting all the test runs into one database can be done by exporting the results from all the secondary places and importing them into one database. Use the Open and manage load test results commands. See https://sqa.stackexchange.com/a/14503/6752 for more details.
Combining the results from several runs cannot be done, as far as I know, within Visual Studio. However each "graph" can be exported to Excel where you can manually merge the results. The rows of each "table" (but, unfortunately, not the headers) can be copied and pasted into Excel.
I prefer the Export graph data to Excel and Export graph data to text (.csv) commands over the Create Excel report. (The two Export... commands are not available for tables.) The reason being that the "Create Excel report" requires Visual Studio to be run as an Administrator and I have not found a sensible way of letting the Administrator user have access to my non-Administrator load test database.

Quickly changing SSIS-packages data source parameters for easy migration

I would need to migrate a SQL database from Sybase to MS SQL Server. Before doing the actual migration on the production server I first created an SSIS-package with SQL Server Management Studio's Import/Export Wizard for testing with other databases. The test migration was successful and I would now like to deploy my SSIS-package to the real servers.
However, it seems I cannot simply run the package in Management Studio choosing different data sources for it - it only runs on the same databases for which it was created. Now, it can be edited in something called SQL Server Business Intelligence Development Studio (or BIDS for short)(I am using the SQL Server 2008 version), but going through every data flow task changing the destination source manually for each of the ~ 150 tables I am moving is ineffective and also introduces a possibility for error.
I there a way to quickly change what data source is to be used for ALL destination sources in ALL the flow tasks of an SSIS-package? If not, what simple method is there for testing migration with test databases first and simply changing the data sources when deploying?
I am using ODBC data sources, but for some the package shows OLE-sources in BIDS instead.
I hope I was clear enough. If you have additional questions, please ask! Thank you!
I would use a variable for the ConnectionString property of the connection manager. A package level configuration can be very useful for accomplishing this task. Several ways to do this. I prefer to use a table in SQL Server that holds all the configurations for all packages. This can be especially effective if you have multiple packages and need to dynamically change a set of connection managers across those multiple packages.
The basic steps are:
Opposite click on your SSIS design surface and select "Package Configurations..."
Create a package level configuration of Configuration Type "SQL Server"
Store your connection in a Configuration table in SQL Server
Alter your Connection Manager to use a variable for the ConnectionString Property
Populate that variable from the Configuration table via your package level configuration
When it comes time to switch from Test to Production, simply update the connection string in your configuration table
These screenshots can help...
This is part of a larger package management framework that I implemented using this book:
Microsoft SQL Server 2008 Integration Services: Problem, Design, Solution
I highly recommend it. Should take less than a day to set it up. Book has step by step instructions.
This question and its associated answers also helpful.

Extract from Progress Database to SQL Server

I'm looking for the best approach (or a couple of good ones to choose from) for extracting from a Progress database (v10.2b). The eventual target will be SQL Server (v2008). I say "eventual target", because I don't necessarily have to connect directly to Progress from within SQL Server, i.e. I'm not averse to extracting from Progress to a text file, and then importing that into SQL Server.
My research on approaches came up with scenarios that don't match mine;
Migrating an entire Progress DB to SQL Server
Exporting entire tables from Progress to SQL Server
Using Progress-specific tools, something to which I do not have access
I am able to connect to Progress using ODBC, and have written some queries from within Visual Studio (v2010). I've also done a bit of custom programming against the Progress database, building a simple web interface to prove out a few things.
So, my requirement is to use ODBC, and build a routine that runs a specific query on a daily basis daily. The results of this query will then be imported into a SQL Server database. Thanks in advance for your help.
Update
After some additional research, I did find that a Linked Server is what I'm looking for. Some notes for others working with SQL Server Express;
If it's SQL Server Express that you are working with, you may not see a program on your desktop or in the Start Menu for DTS. I found DTSWizard.exe nested in my SQL Server Program Files (for me, C:\Program Files (x86)\Microsoft SQL Server\100\DTS\Binn), and was able to simply create a shortcut.
Also, because I'm using the SQL Express version of SQL Server, I wasn't able to save the Package I'd created. So, after creating the Package and running it once, I simply re-ran the package, and saved off my SQL for use in teh future.
Bit of a late answer, but in case anyone else was looking to do this...
You can use linked server, but you will find that the performance won't be as good as directly connecting via the ODBC drivers, also the translation of the data types may mean that you cannot access some tables. The linked server might be handy though for exploring the data.
If you use SSIS with the ODBC drivers (you will have to use ADO.NET data sources) then this will perform the most efficiently, and as well you should get more accurate data types (remember that the data types within progress can change dynamically).
If you have to extract a lot of tables, I would look at BIML to help you achieve this. BIML (Business Intelligence Markup Language) can help you create dynamically many SSIS packages on the fly which can be called from a master package. This master package can then be scheduled or run ad-hoc and so can any of the child packages as needed.
Can you connect to the Progress DB using OLE? If so, you could use SQL Server Linked Server to bypass the need for extracting to a file which would then be loaded into SQL Server. Alternately, you could extract to Excel and then import from Excel to SQL Server.

App fabric without SQL Server whatsoever

I got VPS with limited memory and my WCF service is hosted using AppFabric.
Since memory is limited and I am not using SQL server for anything other than AppFabric prerequisite im thinking about uninstalling SQL Server. (instance can eat up to 200mb memory at times). I am not using any DB related features of AppFabric like dashboard or caching. I like IIS extensions and simplicity for WCF service manipulations however, and I am thinking those do not require Sql Server actually.
I am unable to just try it out so wonder if someone has such experience, or can predict result of uninstalling SQL server on appfabric behaviour.
Instead of uninstalling SQL Server you could just stop the SQL Server process. Set the process to manual startup.
That way if you need SQL Server in the future you can just start the process.
As #Shiraz Bhajiji illudes to if you are using SQLServer as the configuration store, you will need to reconfigure it to use file based configuration instead, it sounds like you are only using a single AppFabric instance, but if you are or needed to use multiple instances the config file would need to be accessible to all instances.
Again it isn't necessarily relevant to you, but if you have multiple app fabric instances, the sql server configuration option is a much more robust approach. With the file based approach, if you configure things incorrectly one app fabric node going down can take down the entire cluster. The SQLServer approach does represent a single point of failure however, if you are using clustering etc you can easily mitigate this. Again I appreciate I'm getting a little off topic here.

Resources