Excel VBA: More Efficient Way of Removing Rows From Worksheet - arrays

Please note the edit after the original function code block
I've got this data set in Excel that I download from my company's cost management system each month. On average, this data set is around 100,000 rows with 32 columns. One of my job functions is to filter out line items that don't belong to my work group and arrange the data in the required format for a separate analysis system. Typically, I re-arrange the columns, enter a bunch of formulas into cells, and then use a series of autofilter checks to identify line items that need to be moved to other worksheets. This normally takes me about a couple of hours tops, but it's quite arduous and I'd rather automate the process to save time and reduce chances for me to make mistakes.
So I went ahead and wrote a VBA procedure that satisfies all of the requirements and everything seems to be checking out. The only problem is that the procedure itself takes about an hour to check 10,000 line items (I stopped it at that point). Wasting 10 hours watching a progress bar tick is not going to cut it. So now I'm trying to re-think how I've written this procedure to see if there's a better way (I'm certain there is).
Here's the code as it stands (I omitted a lot of code before and after the main loop for clarity, but I left comments there so you can see what happens in a 'pseudo-code' manner. The vast majority of time is spent in that loop, so it's really my main concern):
ORIGINAL FUNCTION
Function Prepare_CICTDF()
'Rename and set worksheet
wbRawFile.Worksheets("Sheet1").Name = "Excluded"
Set wsSheet = wbRawFile.Worksheets("Excluded")
'Update progress bar
status_message = "Rearranging columns in CICT Dedicated Facility. This may take several minutes."
Call Progress_Bar(current_row, status_message)
'Rearrange columns
'Omitted to shorten code block
'Create worksheet for included rows
wbRawFile.Worksheets.Add().Name = "Self Service"
'Copy header row to other worksheets
wsSheet.Rows("4").Copy Destination:=Sheets("Self Service").Range("A4")
'Import Lookup List
Dim wbLookupList As Workbook
Set wbLookupList = Workbooks.Open("\\server\path\to\file\Dedicated Facility Lookup List.xlsx")
Dim wsLookupList As Worksheet
Set wsLookupList = wbLookupList.Worksheets("Lookup List")
wsLookupList.Copy Before:=wbRawFile.Worksheets("Excluded")
wbLookupList.Close SaveChanges:=False
'Get first and last data row
Dim FirstRow As Long
Dim LastRow As Long
FirstRow = 5
LastRow = wsSheet.UsedRange.Rows.Count - 1
'Update progress bar
status_message = "Preparing rows in CICT Dedicated Facilty."
Call Progress_Bar(current_row, status_message)
'Loop through the rows to add formulas
Dim NextBlankRow As Long
Dim RowDeleted As Boolean
Dim i As Long
i = FirstRow
'-------------------------LOOP STARTS HERE-------------------------
Do While i <= LastRow
RowDeleted = False
'Add "CICTDF" before project ID
wsSheet.Range("B" & i).Value = "CICTDF" & wsSheet.Range("B" & i).Text
'Add formula for "Total Impact" column in column T
wsSheet.Range("T" & i).FormulaR1C1 = "=IF(AND(RC[-10]=""Complete"",RC[7]=""Manual Part Number Line Item""),RC[5],IF(AND(RC[-10]=""Complete"",RC[5]=0),0,IF(RC[-10]=""Complete"",RC[5]/RC[-5]*RC[4],RC[5])))"
'Add formula for rows with blank "Cost Impact - Part" column
If wsSheet.Range("V" & i).Value = "" Then
wsSheet.Range("V" & i).FormulaR1C1 = "=IF(RC[-7]>0,RC[3]/RC[-7]*-1,0)"
End If
'Change GLOBAL SUPPLY NETWORK to GLOBAL PURCHASING
If wsSheet.Range("F" & i).Value = "GLOBAL SUPPLY NETWORK" Then
wsSheet.Range("F" & i).Value = "GLOBAL PURCHASING"
End If
'Change numbers stored as text back to numbers
wsSheet.Range("M" & i).NumberFormat = "General"
wsSheet.Range("M" & i).Value = wsSheet.Range("M" & i).Value
wsSheet.Range("P" & i).NumberFormat = "General"
wsSheet.Range("P" & i).Value = wsSheet.Range("P" & i).Value
wsSheet.Range("AB" & i).NumberFormat = "General"
wsSheet.Range("AC" & i).NumberFormat = "General"
wsSheet.Range("AD" & i).NumberFormat = "General"
wsSheet.Range("AE" & i).NumberFormat = "General"
'Insert Cab Part # Formula
wsSheet.Range("AB" & i).Formula = "=VLOOKUP(M" & i & ",'Lookup List'!A:A,1,FALSE)"
'Insert Cabs DC formula
wsSheet.Range("AC" & i).Formula = "=VLOOKUP(N" & i & ",'Lookup List'!B:B,1,FALSE)"
'Insert Cab Localization HEX & MG Formula
wsSheet.Range("AD" & i).Formula = "=VLOOKUP(B" & i & ",'Lookup List'!C:C,1,FALSE)"
'Insert Already in MOASS formula
wsSheet.Range("AE" & i).Formula = "=VLOOKUP(B" & i & ",'Lookup List'!D:D,1,FALSE)"
'Include part numbers that match the inclusion criteria
If wsSheet.Range("AB" & i).Text <> "#N/A" And wsSheet.Range("AC" & i).Text = "#N/A" And wsSheet.Range("AD" & i).Text = "#N/A" _
And wsSheet.Range("AE" & i).Text = "#N/A" And wsSheet.Range("P" & i).Value = "14" Then
NextBlankRow = Worksheets("Self Service").UsedRange.Rows.Count + 1
wsSheet.Rows(i).Copy Destination:=Worksheets("Self Service").Range("A" & NextBlankRow)
wsSheet.Rows(i).Delete
RowDeleted = True
End If
'Check if the row was included or not
If RowDeleted = True Then
LastRow = LastRow - 1
Else
i = i + 1
End If
'Update the progress completion
current_row = current_row + 1
Call Progress_Bar(current_row, status_message)
Loop
'-------------------------LOOP STOPS HERE-------------------------
'Autofilter header row in Self Service tab
Worksheets("Self Service").Range("B4:AG4").AutoFilter
'Save as new file format
Worksheets("Self Service").Select
wbRawFile.SaveAs Filename:=output_directory & "CICT 2014 Dedicated Facility.xlsx", FileFormat:=51
wbRawFile.Application.DisplayAlerts = True
wbRawFile.Close SaveChanges:=False
End Function
Basically I loop through all of the lines in the file. For each line, I enter the formulas and values that I need and then check if they satisfy the inclusion requirements. If they do, I move the line to the "Self Service" worksheet, delete the line from the "Excluded" worksheet, and move on to the next line.
After running the first 10,000 lines of data, the elapsed time was just over 58 minutes. I think most of this can be attributed to the copy/paste/delete processes at the tail end of the loop. I've read that a common suggestion is to work within arrays instead of manipulating cells/rows/ranges in Excel, but I'm not exactly sure how I would go about doing this.
----------Edit:----------
After some input from Ron Rosenfeld, I revisited my process a little bit and made a bunch of changes. In the end, the new procedure processes and prepares over 100,000 rows (of 32 columns) in just over 49 minutes. The original procedure would have taken over 9.75 hours, so the changes have resulted in a procedure that's over 10x faster than its predecessor. Rather than paste the entire code block again, I'll describe the procedure in "pseudo-code":
Rearrange columns (takes the raw server download and puts it in the order I need).
Create a new worksheet for the included rows. Note that for my purposes, I process over 100,000 rows but end up keeping only about 10,000. Thus, I made the decision to look for those that I would INCLUDE instead of those that I would EXCLUDE.
Enter formulas in first row of data and drag down the column. I used Ron's suggestion of e.g. Range("A" & FirstRow & ":A" & LastRow") = "=B1+C1" for any columns that I could.
There was one column that only needed formulas if the cell was blank. So I used the SpecialCells(xlCellTypeBlanks) method to enter these.
AutoFiltered the data so that only the rows I wanted to include were visible. Again I used the SpecialCells(xlCellTypeVisible) method to find these and stored them in an array. This array was then entered into the new worksheet.
Finally, I did a little bit of massaging format-wise to make sure everything looked consistent (since storing the values in the array lost the cell formats).
It should also be noted that I think Tim's suggestion of using SQL in this scenario could be a very efficient alternative--I simply wasn't versed well enough on the topic to try it out. I'll be looking for ways to use it in the future, though!
Thanks everyone for the help!

Without knowing the exact layout of your worksheets, it is hard to say. In general, with regard to values, the process of reading a large DB into an array; looping through the array to decide what rows/items to keep, the writing that back to a new sheet, is usually at least an order of magnitude (10x) faster than looping through rows. Sometimes a challenge is to figure out how big the results array needs to be. If that cannot be done with some simple formulas, I have taken the interim step of gathering each row into a collection before dimensioning the results array.
Another thought after looking at your code: why not just filter on the values for columns 15, 27-30, and then copy/paste the visible cells to your new worksheet.
After you write the data to the worksheet; you can select all the blanks in the range with the SpecialCells method and write the formula that way:
R.columns(X).SpecialCells(xlCellTypeBlanks).FormulaR1C1 = "= RC[1] + RC[2]".
To get the size of arrIncluded, it seems you could either use CountIfs; or you could add the desired rows to a Collection, then use the Count property to get the size of arrIncluded; and write the Collection to the array. I prefer the Collection method, but test to see which way is faster.

Related

VBA SumIF between two Arrays

I have two large tables of data that I have been playing around with to try and minimize computing resources. The two tables roughly have the following information:
1) Freight Billing information
InvoiceNum | TrackingNum | Weight | BilledAmt | IncentiveCredit
2) ERP Generated Invoice Information
InvoiceNum | Weight | ProductRev | FreightRev | TaxAmt | DiscountAmt
I'm conducting an analysis between what the freight company is charging per order (Freight Cost) compared to what we charge customers per order (Freight Revenue). Additionally, I'm hoping to compare reported weights for both the freight company and our company to audit accuracy.
There will only be one instance of the invoice in the ERP data, but there can be multiple lines in the freight data with the same invoice number. A simple SumIF column would work well if the tables weren't both approximately 100,000 lines. I know that working with Arrays is generally much faster than working directly with ranges, so I started there.
Option Explicit
Sub compareFreightData()
Application.ScreenUpdating = False
'Declare Sheets
Dim sht As Worksheet
Dim billingSht As Worksheet
Set systemSht = ThisWorkbook.Sheets("SystemData")
Set billingSht = ThisWorkbook.Sheets("UPSData")
'Declare Arrays
Dim billingArray As Variant
Dim systemArray As Variant
'Declare timer
Dim t As Long
t = Timer
'Add additional header info
systemSht.Range("K3").Value = "BilledTotal"
systemSht.Range("L3").Value = "IncentiveCredits"
systemSht.Range("M3").Value = "OrderWeight"
'Put info into arrays
upsArray = billingSht.ListObjects("UPSCSVFiles").DataBodyRange.Value
systemArray = systemSht.Range("A4:J" & systemSht.Cells(systemSht.Rows.Count,"A").End(xlUp).Row).Value
'ReDim to have the three additional columns in the array
ReDim Preserve systemArray(1 To UBound(systemArray, 1), 1 To UBound(systemArray, 2) + 3)
'****************************CALCULATE THE SUMIF*******************
'
'
'Timer
Debug.Print "Timer", Timer - t
Some ideas I had were the following:
If there is a way to sum all the items in a filtered array then print the amount. (I think Filter() only works with single dimension arrays)
Loop through billing array, find the location in system array using MATCH, then add to system array (this seems very slow)
Mostly, I believe there is some easier way that I am missing and any input or ideas would be greatly appreciated.

Excel - setting a dynamic print area for a range of cells covered by an array formula

I have a spreadsheet with a page (Sheet 3) which I would like to export as a PDF (using a macro to export the pdf).
Currently, this links in to another worksheet, in which a user can put in a date range to pull out relevant data from a larger worksheet. This uses the following array formula to populate the data on Sheet 3:
=IFERROR(INDEX(Sheet1!$V$3:$W$5998,SMALL(IF((Sheet1!$C$3:$C$5998>=Crynodeb!$D$3)*(Sheet1!$C$3:$C$5998<=Crynodeb!$F$3)*(Sheet1!$V$3:$V$5998<>""),ROW(Sheet1!$V$3:$W$5998)-2),ROW(18:18)),1),"")
The array formula on Sheet 3 has been applied to around 6000 rows of data. So there is potential for 6000 lines of data to be returned. However, depending on the criteria the user has put in, maybe only 5 rows of data will be returned.
In addition to this, I've applied cell formatting to the 6000 rows so that there's a print-friendly line in between the rows of data.
However, because this has been applied to the 6000 rows of data, there could be 61 or so pages exported, when in reality only a page worth of data is displaying.
Is there an easy way to continue having the array formula applied across a large range, while limiting the print function to only apply to pages containing data that is returned from the array formula?
I'm also using the Format > AutoFit Row Height function to adjust the row height in accordance to the length of the returned items, but at the moment I think I have to do this manually every time I return data. Is there a way of applying that automatically to adjust around the content of the page?
Many thanks
Since you want to export the data to PDF via macro, we can just put everything you want into that one macro.
The macro needs to find the range that has values. From your example, I'm assuming there will always be data returned starting in A3; and then there will no blanks in Column A until the end of the data.
So, this macro starts at A3 and looks at each cell until it finds a blank value to determine where the data ends.
I'm also assuming you have a fixed number of columns of data. I can only see 2 columns so I've used that figure in macro. You can update the code where indicated to the actual number of columns.
Then, from A1 to the last row and column of the data, it autofits the rows and then saves that range as a PDF.
Public Sub SaveCopy()
Dim intLastRow As Integer
'Loops through column A to find last row with data
intLastRow = 3
Do While Cells(intLastRow + 1, 1).Value <> ""
intLastRow = intLastRow + 1
Loop
'With the range of data...
'NOTE: REPLACE '2' WITH ACTUAL NUMBER OF COLUMNS OF DATA
With Range("A1", Cells(intLastRow, 2))
'Ensure text is wrapping - required to autofit
.WrapText = True
'Autofit row height
.Rows.AutoFit
'Save as PDF
.ExportAsFixedFormat Type:=xlTypePDF, _
Filename:=ThisWorkbook.FullName, _
Quality:=xlQualityStandard, _
IncludeDocProperties:=True, _
IgnorePrintAreas:=False, _
OpenAfterPublish:=True
End With
End Sub
This is a basic export and will just save the PDF into the same location as the file and with the same filename (including the excel file extension) and just appends .PDF file extension.
There are various options for handling the filename and location of the PDF. With more information of what is required I could help provide a robust solution.
Alternatively, you can remove the line of code for exporting to PDF and replace it with just ".Select"
This will select the range of data and then you can manually save to PDF. Just use Save As... change the type to PDF, click Options and choose "Selection" in the "Publish what" section.
If any of my assumptions are incorrect, please let me know.

VBA group by process using Public Types

I want to create a counter in a routine that will count how many times a specific entry has appeared so far.
The routine that i have created so far populates data in a spreadsheet through a For..Next Loop. For each of these rows i have an extra column that will represent the counter and count how many times a characteristic of the entry row has appeared so far in the previous rows. For that, I am using the application.worksheetfunction.CountIf function but the reference range has to be dynamic.
For example, I have the following table
Example Table
the overall idea is to group by month and expense type and have the sum amount. The role of the counter is to identify these rows that can be grouped together and loop through their values and sum them. The table has approximately 10,000 rows and 53 columns. For this process, i have created the following public type:
>public type OP
>>Month as string
>>expense_type as string
>>amount as double
>end type
Sub NewOuput()
with sheet1
>for i=1 lastrow 'output is the existing table that i get the data and i want to manipulate and then populate them into another table of the same format
>>op.month=output(i,1)
>>op.expense_type=output(i,2)
>>op.amount=output(i,3)
'----------------------------
>> .cells(i,1)=op.month 'this is the population of hte data in the new table
>> .cells(i,2)=op.expense_type
>> .cells(i,3)=op.amount
next i
end with
end sub
Through functions, i try to identify the rows that need to sum-up and then call the respective functions in the output part of the loop.
Countif excel function cannot be appied with arrays, so this is now out of hte question. I have read many posts on various ways of grouping including data connections, collections and other customised approaches. Collections appeared to be the best ones but i miss some of hte background there.
Does this make any sense? Any suggestions are appreciated
I didn't actually grasp your exact needs, but since the table example image I'd go like follows:
Sub NewOuput()
With sheet1
'fill in the voids of 1st column
With .Range("A1:A" & .Cells(.Rows.Count, "B").End(xlUp).row) '<--| change "A" and "B" to your actual 1st and 2nd columns index
.SpecialCells(xlCellTypeBlanks).FormulaR1C1 = "=R[-1]C"
.Value = .Value
End With
'more code to exploit a "full" database structure
End With
End Sub

Excel VBA: Chart-making macro that will loop through unique name groups and create corresponding charts?

Alright, I've been racking my brain, reading up excel programming for dummies, and looking all over the place but I'm stressing over this little problem I have here. I'm completely new to vba programming, or really any programming language but I'm trying my best to get a handle on it.
The Scenario and what my goal is:
The picture below is a sample of a huge long list of data I have from different stream stations. The sample only holds two (niobrara and snake) to illustrate my problem, but in reality I have a little over 80 stations worth of data, each varying in the amount of stress periods (COLUMN B).
COLUMN A, is the station name column.
COLUMN B, stress period number
COLUMN C, modeled rate
COLUMN D, estimated rate
What I have been TRYING to figure out is how to make a macro that will loop through the station names (COLUMN A) and for each UNIQUE Group of station names, make a chart that will pop out to the right of the group, say in the COLUMN E area.
The chart is completely simple, it just needs two series scatterplot/line chart; one series with COLUMN B as x-value and COLUMN C as y-value; and the other series needs COLUMN B as x-value and COLUMN D as y-value.
Now my main ordeal, is that I don't know how to make the macro distinguish between station names, use all the data relating to that name to make the chart, then looping on to the next Station group and creating a chart that corresponds for that, and to continue looping through all 80+ station names in COLUMN A and to make the corresponding 80+ charts to the right of it all in somewhere like the COLUMN E.
If I had enough points to "bounty" this, I would in a heartbeat. But since I do not, whoever can solve my dilemma would receive my sincere gratitude in helping me understand run this problem smoothly and hopefully better my understanding of scenarios like this in the future. If there is anymore information that I need to clarify to make my question more understandable please comment your query and I'd be happy to explain in more detail the subject.
Cheers.
Oh, and for extra credit; now that I think about it, I manually entered the numbers in COLUMN B. Since the loop would need to use that column as the x-value it would be important if it could loop through itself and fill that column on its own before it made the chart (I would imagine it would have something to do with anything as simple as "counting out the rows that correspond to the station name". But again, I know not the proper terminology to correspond the station name, hence the pickle I'm in; however if the veteran programmer who is savvy enough to answer this question could, I'd imagine such a piece of code would be simple enough yet crucial to the success of such a macro I seek.
Try this
Sub MakeCharts()
Dim sh As Worksheet
Dim rAllData As Range
Dim rChartData As Range
Dim cl As Range
Dim rwStart As Long, rwCnt As Long
Dim chrt As Chart
Set sh = ActiveSheet
With sh
' Get reference to all data
Set rAllData = .Range(.[A1], .[A1].End(xlDown)).Resize(, 4)
' Get reference to first cell in data range
rwStart = 1
Set cl = rAllData.Cells(rwStart, 1)
Do While cl <> ""
' cl points to first cell in a station data set
' Count rows in current data set
rwCnt = Application.WorksheetFunction. _
CountIfs(rAllData.Columns(1), cl.Value)
' Get reference to current data set range
Set rChartData = rAllData.Cells(rwStart, 1).Resize(rwCnt, 4)
With rChartData
' Auto fill sequence number
.Cells(1, 2) = 1
.Cells(2, 2) = 2
.Cells(1, 2).Resize(2, 1).AutoFill _
Destination:=.Columns(2), Type:=xlFillSeries
End With
' Create Chart next to data set
Set chrt = .Shapes.AddChart(xlXYScatterLines, _
rChartData.Width, .Range(.[A1], cl).Height).Chart
With chrt
.SetSourceData Source:=rChartData.Offset(0, 1).Resize(, 3)
' --> Set any chart properties here
' Add Title
.SetElement msoElementChartTitleCenteredOverlay
.ChartTitle.Caption = cl.Value
' Adjust plot size to allow for title
.PlotArea.Height = .PlotArea.Height - .ChartTitle.Height
.PlotArea.Top = .PlotArea.Top + .ChartTitle.Height
' Name series'
.SeriesCollection(1).Name = "=""Modeled"""
.SeriesCollection(2).Name = "=""Estimated"""
' turn off markers
.SeriesCollection(1).MarkerStyle = -4142
.SeriesCollection(2).MarkerStyle = -4142
End With
' Get next data set
rwStart = rwStart + rwCnt
Set cl = rAllData.Cells(rwStart, 1)
Loop
End With
End Sub

Loop Running VERY Slow

please help me make my app a little faster, it's taking forever to loop through and give me results right now.
here is what im donig:
1. load gridview from an uploaded excel file (this would probably be about 300 records or so)
2. compare manufacturer, model and serial No to my MS SQL database (about 20K records) to see if there is a match.
'find source ID based on make/model/serial No combination.
Dim cSource As New clsSource()
Dim ds As DataSet = cSource.GetSources()
Dim found As Boolean = False
'populate db datatables
Dim dt As DataTable = ds.Tables(0)
Dim rows As Integer = gwResults.Rows.Count()
For Each row As GridViewRow In gwResults.Rows
'move through rows and check data in each row against the dataset
'1 - make
For Each dataRow As DataRow In dt.Rows
found = False
If dataRow("manufacturerName") = row.Cells(1).Text Then
If dataRow("modelName") = row.Cells(2).Text Then
If dataRow("serialNo") = row.Cells(3).Text Then
found = True
End If
End If
End If
'display results
If found Then
lblResults.Text += row.Cells(1).Text & "/" & row.Cells(2).Text & "/" & row.Cells(3).Text & " found"
Else
lblResults.Text += row.Cells(1).Text & "/" & row.Cells(2).Text & "/" & row.Cells(3).Text & " not found "
End If
Next
Next
is there a better way to find a match between the two? i'm dying here.
For each of your 300 gridview rows, you are looping through all 20k datarows. That makes 300 * 20k = 6 million loop iterations. No wonder your loop is slow. :-)
Let me suggest the following algorithm instead (pseudo-code):
For Each gridviewrow
Execute a SELECT statement on your DB with a WHERE clause that compares all three components
If the SELECT statement returns a row
--> found
Else
--> not found
End If
Next
With this solution, you only have 300 loop iterations. Within each loop iteration, you make a SELECT on the database. If you have indexed your database correctly (i.e., if you have a composite index on the fields manufacturerName, modelName and serialNo), then this SELECT should be very fast -- much faster than looping through all 20k datarows.
From a mathematical point of view, this would reduce the time complexity of your algorithm from O(n * m) to O(n * log m), with n denoting the number of rows in your gridview and m the number of records in your database.
While Heinzi's answer is correct; it may be more beneficial to carry out the expensive SQL query before the loop and filter using data views so you aren't hitting the DB 300 times
Execute a SELECT statement on your DB
For Each gridviewrow
if my datagridview.Select(String.format("manufacturerName={0}", row.item("ManufacturerName"))
If the dataview has a row
--> found
Else
--> not found
End If
Next
NOTE: I only compared a single criteria to illustrate the point, you could filter on all three in here
Hmm... how about loading the data from the spreadsheet into a table in tempdb and then writing a select that compares the rows in the way that you want to compare them? This way, all of the data comparisons happen server-side and you'll be able to leverage all of the power of your SQL instance.

Resources