Compare strings to identify duplicates - arrays

I have to write an isDup function to compare two tweets based on their similar word counts to determine if the tweets are duplicate, based on a decimal threshold chosen (0-1).
My process is to write a sub with two hardcoded tweets my prof has provided (just to get an understanding before converting to a function). I encountered a run time error 5.
Option Explicit
Sub isDup()
Dim tweet1 As String
Dim tweet2 As String
Dim threshold As Double
threshold = 0.7
tweet1 = "Hours of planning can save weeks of coding"
tweet2 = "Weeks of programming can save you hours of planning"
Dim tweet1Split() As String
tweet1Split = Split(tweet1, " ")
Dim tweet2Split() As String
tweet2Split = Split(tweet2, " ")
Dim i As Integer
Dim j As Integer
Dim sameCount As Integer
'my thought process below was to compare strings i and j to see if equal, and if true add 1 to sameCount,
'but the If StrComp line is where the error is
For i = LBound(tweet1Split) To UBound(tweet1Split) Step 1
For j = LBound(tweet2Split) To UBound(tweet2Split) Step 1
If StrComp(i, j, vbDatabaseCompare) = 0 Then
sameCount = sameCount + 1
Exit For
End If
Next j
Next i
End Sub
'here i wanted to get a total count of the first tweet to compare, the duplicate tweet is true based on the number of
'similar words
Function totalWords(tweet1 As String) As Integer
totalWords = 0
Dim stringLength As Integer
Dim currentCharacter As Integer
stringLength = Len(tweet1)
For currentCharacter = 1 To stringLength
If (Mid(tweet1, currentCharacter, 1)) = " " Then
totalWords = totalWords + 1
End If
Next currentCharacter
End Function
'this is where i compute an "isDup score" based on similar words compared to total words in tweet1, in this
'example the threshold was stated above at 0.7
Dim score As Double
score = sameCount / totalWords
If score > threshold Then
MsgBox "isDup Score: " & score & " ...This is a duplicate"
Else
MsgBox "isDup Score: " & score & " ...This is not a duplicate"
End If
End Sub

First issue:
i and j are just indexes. You want to compare the string that your index relates to so:
If StrComp(tweet1Split(i), tweet2Split(j), vbDatabaseCompare) = 0 Then
Second issue:
As noted in Microsoft documentation for StrComp, vbDatabaseCompare is reserved for Access, which you are not using, hence the source of your second error. You need to switch to a different comparison

Related

Count the number of uppercase words in a text review

I want to count the number of uppercase words in a data set of movie reviews (A2:A1001) and paste the results to column E.
I am having trouble getting any response from the sub on the spreadsheet.
Sub UppercaseWordCount()
Dim ArraySplit() As String
Dim X As Integer
Dim Count As Integer
Dim NextWord As String
Dim Line As Integer
Count = 0
ArraySplit = Split("A2:A1001", " ")
Line = 2
'splitting each review into an array and using lower
'and upper bounds with Ucase function to get a count
'len function used to split up words from single characters
For X = LBound(ArraySplit) To UBound(ArraySplit)
NextWord = ArraySplit(X)
If NextWord = UCase(NextWord) And Len(NextWord) >= 2 Then
Count = Count + 1
End If
Next
'calling the sub to column E with a count for each review
Range("E" & Line).Value = Count
Line = Line + 1
End Sub
Try this. You could convert to a custom function if doing repeatedly.
Sub UppercaseWordCount()
Dim ArraySplit() As String
Dim X As Long 'long better than integer
Dim Count As Long
Dim NextWord As String
Dim r As Range
'splitting each review into an array and using lower
'and upper bounds with Ucase function to get a count
'len function used to split up words from single characters
For Each r In Range("A2:A1001") 'loop through defined range
ArraySplit = Split(r, " ") 'split each cell using space delimiter
For X = LBound(ArraySplit) To UBound(ArraySplit) 'everything else as before
NextWord = ArraySplit(X)
If NextWord = UCase(NextWord) And Len(NextWord) >= 2 Then
Count = Count + 1
End If
Next
'calling the sub to column E with a count for each review
r.Offset(, 4).Value = Count 'can use offset rather than defining a new variable
Count = 0 'reset count
Next r
End Sub

Function That Detects if Tweets are similar

I am trying to create a function in VBA that takes 2 strings and a threshold(a percentage in decimal form), and returns true if the strings contain a higher percentage of the same words than the threshold.
Here is the code that I have so far...
Function isDup(tweet1 As String, tweet2 As String, threshold As Double) As Boolean
'Declare variables to store words from each tweet
Dim C1 As String
Dim C2 As String
'Use split function to split each tweet into single words. The " " is the delimiter, each space creates a new word
C1 = Split(tweet1, " ")
C2 = Split(tweet2, " ")
'Loop through each word from tweet1 and each word from tweet2
For i = LBound(C1) To UBound(C1)
For j = LBound(C2) To UBound(C2)
'Declare variable to store result from StrComp Function
Dim Cresult As Double
'Use StrComp Function to compare the current word from tweet1 to the current word from tweet2
Cresult = StrComp(i, j, vbTextCompare)
Next i
Next j
'Use If Then to return true if the tweets are more similar than the percentage given by the threshold
If Cresult > threshold Then
isDup = True
End Function
I am pretty new to VBA so there are some errors, specifically I keep running into the Expected: Array Error. Any help would be greatly appreciated, thank you!
Here's a quick rewrite fixing up the things I noted in my comments above. If this isn't exactly what you were after, it should get you in the ballpark.
Function isDup(tweet1 As String, tweet2 As String, threshold As Double) As Boolean
'Declare variables to store words from each tweet
Dim C1 As Variant
Dim C2 As Variant
'Use split function to split each tweet into single words. The " " is the delimiter, each space creates a new word
C1 = Split(tweet1, " ")
C2 = Split(tweet2, " ")
'Declare variable to store result from StrComp Function
Dim Cresult As Double
'Loop through each word from tweet1 and each word from tweet2
For i = LBound(C1) To UBound(C1)
For j = LBound(C2) To UBound(C2)
'Use StrComp Function to compare the current word from tweet1 to the current word from tweet2
If StrComp(C1(i), C2(j), vbTextCompare) = 0 Then
Cresult = Cresult + 1
End If
Next j
Next i
'Use If Then to return true if the tweets are more similar than the percentage given by the threshold
If Cresult > threshold Then
isDup = True
End If
End Function

Remove characters from a word - VBA

I want to remove characters from a string based on the position. i use the below code and i manage to:
Insert all word letters in an array
Loop the array and replace the letters that i want with nothing
BUT i dont know have to remove nothing from the array, recreate array index and create the word again without those characters.
'Create an array with all letters
For j = 1 To Len(SheetName2)
ReDim Preserve arrWord(k)
arrWord(k) = Mid(SheetName2, j, 1)
k = k + 1
Next j
Counter = (Len(SheetName2) - 31) * 2
'Loop array and replace with nothing
For w = Counter - 1 To 0 Step -2
arrWord(w) = ""
Next w
You don't need an array of the characters. Just read every second character into a new string.
Public Sub OutputEverySecondChar()
Dim TestString As String
TestString = "abcdefghij"
Dim OutputString As String
Dim iChar As Long
For iChar = 1 To Len(TestString) Step 2
OutputString = OutputString & Mid$(TestString, iChar, 1)
Next iChar
Debug.Print OutputString 'output is: acegi
End Sub
Or even better as re-useable function:
Public Sub test()
Debug.Print ReturnEveryXcharcter(InputString:="abcdefghij", FirstChar:=1, xChar:=2, CharLength:=1)
End Sub
Public Function ReturnEveryXcharcter(InputString As String, FirstChar As Long, xChar As Long, CharLength As Long) As String
Dim iChar As Long
For iChar = FirstChar To Len(InputString) Step xChar
ReturnEveryXcharcter = ReturnEveryXcharcter & Mid$(InputString, iChar, CharLength)
Next iChar
End Function
So the function is even more flexible and you can eg use the following to delete every 3ʳᵈ character (beginning with the first):
Debug.Print ReturnEveryXcharcter(InputString:="abcdefghij", FirstChar:=2, xChar:=3, CharLength:=2)
'output will be: bcefhi
Edit according comment:
Then loop until it is less then 31 characters:
Public Sub test()
Dim SheetName As String
SheetName = "1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890"
Do
SheetName = ReturnEveryXcharcter(InputString:=SheetName, FirstChar:=1, xChar:=2, CharLength:=1, LengthLimit:=31)
Loop While Len(SheetName) > 31
Debug.Print SheetName
End Sub
My solution:
If Len(SheetName1) > 31 Then
SheetName2 = Replace(SheetName1, " ", "")
'Third check - Remove letter by letter starting from the beginning
If Len(SheetName2) > 31 Then
Counter = (Len(SheetName2) - 31) * 2
Part_1 = Left(SheetName2, Counter)
Part_2 = Mid(SheetName2, Counter + 1, Len(SheetName2) - (Counter))
Semi_Part = ""
For j = 1 To Counter Step 2
'Use only Part_1 because it ll be the same in all occurance
Semi_Part = Semi_Part + Mid(Part_1, j, 1)
Next j
FinalSheetname = Semi_Part & Part_2
Else
FinalSheetname = SheetName2
End If
Else
FinalSheetname = SheetName1
End If
Just for interest, you can also do this with a worksheet function if you have one of the more recent versions of Excel (O365 or 2016+) that has the CONCAT function.
The "trick" is to create an array that includes the positions of the characters you wish to retain.
In the case of deleting every even space positioned character (retaining the odd positions):
=CONCAT(MID(A1,(ROW(INDEX($A:$A,1,1):INDEX($A:$A,ROUNDUP(LEN(A1)/2,0),1))-1)*2+1,1))
Since this is an array formula, you need to "confirm" it by holding down ctrl + shift while hitting enter. If you do this correctly, Excel will place braces {...} around the formula as observed in the formula bar

Regex expression and match function in text file

I have problem with my matching function actually I have to count number of lines with specific string and return line number ,so I have one dimensional array of string that contain the unique strings of text file {33,7,77,3 23,6} and text file with the same strings in array I have read lines of text file to array , but with duplicate of these strings ,when I use regex.match it works not bad expect when I check if line 2 contain 3 the function return True it's consider 3 in 23 as 3 , and the above explanation is just example of what I need any help please
Module Module1
Sub Main()
Dim txt As String = File.ReadAllText("e:\ii.txt")
' Use regular expressions to replace characters
' that are not letters or numbers with spaces.
Dim reg_exp As New Regex("[^a-zA-Z0-9]")
txt = reg_exp.Replace(txt, " ")
' Split the text into words.
'Dim words() As String = txt.Split( _
' New Char() {" "c}, _
' StringSplitOptions.RemoveEmptyEntries)
Dim words = txt.Split(New String() {" ", Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries)
' Use LINQ to get the unique words.
Dim word_query = _
(From word As String In words _
Order By word _
Select word).Distinct()
Dim stra() As String = word_query.ToArray()
For i = 0 To stra.Length - 1
Console.WriteLine(" " & stra(i))
Next
Dim lines() As String = IO.File.ReadAllLines("e:\ii.txt")
For i = 0 To lines.Length - 1
Console.WriteLine(lines(1))
Dim linecount = 0
Dim regex As New Regex(stra(i), RegexOptions.ExplicitCapture)
Dim match As Match = regex.Match(lines(1))
If match.Success Then
linecount += 1
Console.WriteLine("linecount= " & linecount)
Else
Console.WriteLine("false")
End If
Next
End Sub
End Module
You many not have to split the text into words. Is your word list very long? From what I understand you want the following:
1.Read a text file and return the line number for a given word or phrase.
Is the word or phrase complex? If not, why not use a the Contains extension method?
For example:
Dim myString = "Hello World"
If myString.Contains("World") Then
'Add line number to line count.
End if
If you are using this as an opportunity to learn regular expressions, I highly recommend "Mastering Regular Expressions" by Jeffrey Friedl. When I first begun I invested in a program RegexBuddy, which is worth the money. But now there are so many online regex testers now, that could be an alternative for something free.
Enhance your regex with anchors. These will ascertain that the whole test string matches instead of a substring. The following code also assembles all match patterns of interest into a single regex pattern which will be used against each line of the target file:
Dim strall As String
strall = ""
For i = 0 To stra.Length - 1
If i > o Then
strall = strall & "|"
End If
strall = strall & stra(i)
Console.WriteLine(" " & stra(i))
Next
strall = "^(" & strall & ")$"
Dim regexall As New Regex(strall, RegexOptions.ExplicitCapture)
'...
Dim linecount = 0
Dim match As Match = regexall.Match(lines(i)) '... was 'lines(1)', probably a typo
If match.Success Then
'...
this code is working with me thanks for all
Public Function countlines(ByVal st As String) As Integer
Dim count As Integer
Dim linecount As Integer = 0
Dim substrings() As String = Regex.Split(st, " ")
Dim stt() As String = {23, 7, 3}
For i = 0 To stt.Length - 1
'For j = 0 To substrings.Length - 1
'Console.WriteLine(substrings(0))
'For i = 0 To substrings.Length - 1
'Console.Write(substrings(i))
Dim matchQuery = From word In substrings Where word.ToLowerInvariant() = stt(i).ToLowerInvariant() Select word
' ' Count the matches.
count = matchQuery.Count()
Console.WriteLine("count=" & count)
If count > 0 Then
linecount += 1
Else
Console.WriteLine(" linecount=" & linecount)
End If
Next
Console.WriteLine("linecount= " & linecount)
Return linecount
End Function
Sub Main()
Dim lines() As String = IO.File.ReadAllLines("e:\ii.txt")
For Each line As String In lines
countlines(line)
Next
End Sub

How to count integers that are greater than a number and then search array to correlate the number to a name

I have two arrays (_intCholesterol and _strPatientNames). What I am trying to accomplish is to look through an array and count the number of integers in the array that are larger than 200, which I have already done below. But additionally I need to search another array (_strPatientName) and correlate the number that was found to be above 200, to a name. Such as Bob 272. And then write the name and high number to a file. How do accomplish the search and correlation?
Dim intCount As Integer = 0
Dim objWriter As New IO.StreamWriter("e:/consult.txt")
For Each intCholesterolLevel In _intCholesterolLevel
If intCholesterolLevel > 200 Then
intCount += 1
End If
Next
lblOutliers.Visible = True
lblOutliers.Text = "There were " & intCount & " people with levels above 200"
The code I used ended up being:
Dim objWriter As New IO.StreamWriter("E:\consult.txt")
' See if file exists.
If IO.File.Exists("E:\consult.txt") Then
' Run loop for numbers over 200 and write the file.
For intCholesterolIndex = 0 To (_intCholesterolLevel.Length - 1)
If _intCholesterolLevel(intCholesterolIndex) > 200 Then
objWriter.WriteLine(_strPatientName(intCholesterolIndex))
objWriter.WriteLine(_intCholesterolLevel(intCholesterolIndex))
End If
Next
Else
MsgBox("The file is not available, try again")
End If
objWriter.Close()
I think what you may be looking for is a Dictionary instead of 2 arrays.
Possible example
Dim cholesterolLevels As New Dictionary(Of String, Integer)
cholesterolLevels.Add("Bob", 272)
cholesterolLevels.Add("John", 190)
cholesterolLevels.Add("Joe", 205)
cholesterolLevels.Add("Bill", 165)
For Each patient As KeyValuePair(Of String, Integer) In cholesterolLevels
Dim name As String = patient.Key
Dim level As Integer = patient.Value
If level > 200 Then
intCount += 1
objWriter.WriteLine(name & " - " & level)
End if
Next
objWriter.Flush()
objWriter.Close()
If intCount > 0
lblOutliers.Visible = True
lblOutliers.Text = "There were " & intCount & " people with levels above 200"
End if

Resources