I have a SQL Table with close to 2 Million rows and I am trying to export this data into an Excel file so the stakeholders can manipulate data, see charts, so on...
The issue is, when I hit refresh, it fails after getting all the data saying the number of rows exceed max rows limitation in Excel. This table is going to keep growing every day.
What I am looking for here is a way to refresh data, then add rows to Sheet 1 until max rows limitation is reached. Once maxed out, I want the rows to start getting inserted into Sheet 2. Once maxed out, move to 3rd sheet, all from the single SQL table, from a single refresh.
This does not have to happen in Excel (Data -> Refresh option), I can have this as a part of the SSIS package that I am already using to populate rows in the SQL table.
I am also open to any alternate ways to export SQL table into a different format that can be used by said stakeholders to create charts, analyze data, and whatever else pleases them.
Without sounding too facetious, you are suggesting a very inefficient method.
The best way of approaching this method is not to use .xlsx files at all for the data storage.
Assuming your destination stakeholders don't have read access to the SQL server, export the data to .csv and then use Power Query in some sort of 'Dashbaord.xlsx' type file to load the .csv to the data model which can handle hundreds of millions of rows instead of just 1.05m.
This will allow for the use of Power Pivot and DAX for analysis and the data will also be visible in the data model table view if users do want raw rows (or they can refer to the csv file..).
If they do have SQL read access then you can query the server directly so you don't need to store any rows whatsoever as it will read directly.
Failing all that and you decide to do it your way, I would suggest the following.
Read your table into a Pandas df and iterate over each row and cell of the dataframe, writing to an your xlsx[sheet1] using openpyxl then once the row number reaches 1,048,560 simply iterate to xlsx[sheet2].
In short: openpyxl allows you to create workbooks, worksheets, and write to cells directly.
But depending on how many columns you have it could take incredibly long.
Product Limitations
Excel 2007+ 1,048,576 rows by 16,384 columns
A challenge with your suggestion of filling a worksheet with the max number of rows and then splitting is "How are they going to work with that data?" and "Did you split data that should have been together to make an informed choice?"
If Excel is the tool the users want to use and they must have access to all the data, then you're going to need to put the data into a Power Pivot data model (and yes, that's going to impact the availability of some data visualizations). A Power Pivot model is an in-memory tabular data set. What the means is that the data engine, xVelocity, is going to use a bunch of memory but can get over the 1 million row limitation. Depending on how much memory is required, you might need to switch from the default 32 bit Office install and go with a 64 bit install (and I've seen clients have to max RAM out on old, low end desktops because they went cheap for business users).
Power Pivot will have a connection to your SQL Server (or other provider). When it refreshes data, it's going to fire off queries and determine the unique values in columns and then create a dictionary of unique values. This allows it to compress the data with low cardinality really well - sales dates are likely going to be repeated heavily within your set so the compression is good. Assuming your customers are typically not-repeat customers, a customer surrogate key would have high cardinality and thus not compress well since there's little to no repeat. The refresh is going to be dependent on your use case and environment. Maybe the user has to manually kick it off, maybe you have SharePoint with Excel services installed and then you can have it refresh the data on various intervals.
If they're good analysts, you might try turning them on to Power BI. Same-ish engine behind the scenes but built from the ground up to be an response reporting tool. If they're just wading through tables of data, they're not ready for PBI. If they are making visuals out of the data, PBI is likely a better fit.
Related
I have a SQL View that I'm working on that spits out some important information for my boss's boss's boss. The view includes a field called Item ID, which can be in several different formats.
Here are some examples (that may or many not be made up to protect the innocent):
ATS-LC-PLN-RT-RH-0.3125-18-3X2.125X1.5-1
012345.012345
01234567.0123
123456789012
000000.000000
000000.000002
I'd like to take the view and use it to (eventually) produce an excel spreadsheet, but I'm not confident that there's a way to format this column in a way that will work for all of these different Item ID's.
When playing around with Excel, these numbers drop their trailing zeroes and switch to scientific notation, among other shenanigans. I just need to format this column in a way that will preserve the Item ID.
If you know of a way to programmatically create an excel spreadsheet in a way that allows me to assign a format based on the data in the cell, that would work great. The problem that I'm mainly suffering from is that this spreadsheet naturally has hundreds of lines, soon to be thousands, and there's no feasible way to hand-format these lines one at a time on a daily or weekly basis.
I've got SQL-Server 2014 and Excel via Microsoft Office Standard 2013, which may offer more options.
Permit me to suggest another way of framing your issue. I don't think you really want to analyze (either manually or programmatically) each item ID and determine whether it is an integer, a decimal, or alphanumeric text. Since your item ID data varies, the only Excel formatting that will work for all of your cases is 'Text.' So my suggestion is look for a way to automate the export of your data to Excel while making sure that the formatting in Excel is set to 'Text' for all cells to contain your item ID data. As you've noticed, if you are pasting data in Excel, if the target cells are not first set to 'Text' formatting, Excel will make its own 'corrections' to each pasted value, including removal of leading and trailing zeros.
The best solution is to use SQL Server Reporting Services (SSRS). You can set the field formatting in SSRS, and then (if you choose) automate the export of your data to Excel by calling the report server by URL with &rs:Format=excel. (There is learning curve for SSRS but if you plan to continue doing things like this, it will be worth it.)
Other options
The easiest manual option is to 1) export the data to .csv format, 2) Open Excel and use the Text Import Wizard, and during Step 3 make sure to click the data column and then choose 'Text' as the data format. (You could automate this somewhat with an Excel VBA macro.)
The most complicated method involves programming using Excel VBA and ADO to automate the connection and querying of the data from your database view, and then rendering that data to a spreadsheet, using VBA to set the formatting to 'Text.'
I am working on a project that asked me to develop a system to provide JSON output, the following is the flow:
1a) Some tables will be updated via the administration panel (my company side)
1b) Some related tables will be updated via the administration panel (partner side)
-- Let say SuperHeroes & Males were updated in 1a), Studios & Years were updated in 1b)
2) Client browse our site and request an information set which:
Has an enabled and not deleted row in SuperHeroes (Ant-man)
Has an enabled and not deleted row in Males linked to SuperHeroes (Scott Lang)
Joined the above records then look if they are linked to and exists in Studios (Marvel)
Linked to an existing row in Years (2015)
3) A very small data will be outputted to a JSON string as the following: { id:1,type:marvel },{ id:1,type:dc }
All rows in the above 4 tables will be updated/deleted at anytime without notification, [No Foreign Key as well]
I am thinking to update the information in a flat file every time 1a is performed (since we can update the system of my company side but not the partner side, and they are rejected to save some extra information into a flat file, so the situation is we have no easy way to know if the Studios or Years tables are modified)
Then while the JSON request will first load the information from the flat file (all outputting data will be stored in this file), then use a simple SQL statement to filter if a linked record exists in Studios & Years
I have done my research and getting confused, I concluded when the data amount is small then flat file will be great but beware the file comes larger and larger (The flat file we are talking will noway more than 50 rows at 1 time, and that should not be modified frequently)
Some answer said database is good at Query data (I think so and the requirement will perform SQL check too)
So I don't know if its good when my data amounts are small but still need some communication with the database..
I appreciate your time and your help, all idea & hints are welcome, thanks!
Your conclusion regarding amount of data is absolutely correct, and file should handle those 50 rows, but..
Using database as storage should give you more options in the future, e.g.:
you'll be able to produce any output due to separation of data
representation (today it's JSON, but what if you have to produce XML
at some point? will you add files that store XML's next to JSON
files? and then CSV or any other?)
transactions will guarantee the ACID
scalability and performance - if your dataset get bigger (you never know :)), many
DB's offer you many possibilities like table partitioning, partitioning-based clustering
or replication
Wrong choices regarding architecture and technology made at the beginning of the project will always backfire.
Do you know how to transfer only new records between two different databases (ie. Oracle and MSSQL) using SSIS? There is no problem transfering new data only between two tables in the same database and server, but is this possible to do such operation between completely different servers and databases?
Ps. I know about solution using Lookup but it is not very efficient if anybody needs to check and add a lot of records (50k and more) several times per day. I would like to operate with new data only.
You have several options:
Timestamp based solution
If you have a column which stores the insertation time in the source system, you can select only the new records created since the last load. With the same logic, you can transfer modified records too, just mark the records with the timestamp value when it change.
Sequence based solution
If there is a sequence in the source table, you can load the new records based on that sequence. Query the last value from the destination system, then load avarything which is larger than that value.
CDC based solution
If you have CDC (Change Data Capture) in your source system, you can track the changes and you can load them based on the CDC entries.
Full load
This is the most resource hungry solution: you have to copy all data from the source to the destination. If you do not have any column which marks the new records, you should use this solution.
You have several options to achieve this:
TRUNCATE the destination table and reload it from source
Use a Lookup component to determine which records are missing
Load all data from source to a temporary table and write a query which retrieves the new/changed records.
Summary
If you have at least one column, which marks the new/modified records, you can use it to implement a differential/incremental load with SSIS. If you do not have any clue, which columns/rows are changed, you have to load (or at least query) all of them.
There is no solution which enables a one-query (INSERT .. SELECT) solution using multiple servers without transferring all data. (Please note, that a multi-server query using Linked Servers are transfers the data from the source system).
What about variables? Is it possible to use the same variable between different databases and servers in SSIS?
I would like to transfer last id number from a destination table and transfer it to the source table (different server!).
I can set a variable in a database scope like this:
DECLARE #Last int
SET #Last = (SELECT TOP 1 Id FROM dbo.Table_1 ORDER BY Id DESC)
SELECT *
FROM dbo.Table_2
WHERE ID > #Last;
However it works between two tables in the same database (as a SQL command) only. I can create a variable for a entire SSIS package in Variables --> Add variable, but I don't know it is possible to use the variable in a similar way as above - to keep an information about last id in a destination table and pass it to another table on a source server as data limit.
I have 12 very-large Excel sheets.
Each one is 102 columns wide, and an average of 600K rows long.
They're all identical in structure.
Quite often I need to run a specific query, from only a subset of columns, with a specific criteria. And the process of doing so by opening each file and filtering/sorting/finding what I need has become very tedious.
If they were all in a database, such queries would be so much easier.
I tried MS Access, and I tried SQL Server Express.
Both die on me during the respective Import wizards.
And the failure is certainly due to the size of the data set, because if I manually trim a file to say 10 rows, the import works fine.
Any ideas how to do this? I'm open to using Access, or SQL Server Express, or any other tool that does the job really.
Note: Some of the columns contain records that contain commas, hence several suggestions I found to turn the files into CSVs prior to import, ended up with broken structure.
Edit 1: They're 12 separate workbooks, with 1 sheet in each. And the aim is create a single table where all the data is appended.
I would like to cross-reference construct a distance chart similar to the one here (example is a road-distance cross-reference chart) and, ideally, store the data in SQL Server 2008 (preferably the Express version). It needs these properties / abilities
Every column has a corresponding row with the same name (ie. not misspelled like my example).
Changing the value at one Row-Column intersection would update the mirror intersection (Column-Row) or the mirror data could be ignored.
The distance-values would need to be end-user editable.
The end-user would need to be able to add, delete or rename a column/row pair.
The end-user needs to be able to sort the columns and have the rows move automatically.
There could be hundreds of pairs.
a look-up query needs to find a distance given a start & destination (Row & Column)
The distance chart is reasonably straightforward to implement in Excel. Considering this, am I better off...
Using Excel as the user editing UI and then updating an SQL 'thing' with the new data?
Using Excel as the data-source even if it means performance issues with querying the data?
Using an as-yet undiscovered stroke of genius detailed here in an answer?
Sure looks like an Excel application to me, start to end. (heh)
I can't imagine your users typing enough data in to make performance an issue. Excel will only take 32757 rows by ditto columns. If that's enough, I'd say you're golden.