PDFBox adding white spaces within words - solr

When I try to extract text from my PDF files, it seems to insert white spaces between severl words randomly.
I am using pdfbox-app-1.6.0.jar (latest version) on following sample file in Downloads section of this page :
http://www.sheffield.gov.uk/roads/children/parents/6-11/pedestrian-training
I've tried with several other PDF files and it seems to be doing same on several pages.
I do the following:
java -jar pdfbox-app-1.6.0.jar ExtractText -force -console ~/Desktop/ped training pdf.pdf
on the downloaded file and you will see spaces in following inserted wrongly in the result on console:
"• If ch ildren are able to walk to
schoo l safely this could reduce the
congestion. "
"• Develops good hab its for later life."
"www.sheff ield.gov.uk"
"Think Ahead!, wh ich is based on the"
etc etc.
As you can see several of words above have spaces between them for no reason I can fathom.
I am on ubuntu and running Sun's JDK 1.6.
I've tried this on several different PDF files and tried searching for solution on forums, there were similar bugs but all seemed to have been resolved.
Any help or if anyone else has same problem please comment. This is causing big problem in indexing the content properly for searching.

Unfortunately there is currently no easy solution for this.
Internally PDF documents simply contain instructions like "place characters 'abc' in position X" and "place characters 'def' in position Y", and PDFBox tries to reason whether the resulting extracted text should be "abc def" or "abcdef" based on things like the distance between X and Y. These heuristics are generally pretty accurate, but as you can see they don't always produce the correct result.
One way to improve the quality of the extracted text is to try a dictionary lookup on each extracted word or token. If the lookup fails, try combining the token with the next one. If a dictionary lookup on the combined token succeeds, then it's fairly likely that the text extractor has mistakenly added an extra space inside the word. Unfortunately such a feature does not yet exist in PDFBox. See https://issues.apache.org/jira/browse/PDFBOX-1153 for the feature request filed for this. Patches welcome!

The class org.apache.pdfbox.util.PDFTextStripper (pdfbox-1.7.1) allows to modify the propensity to decide if two strings are part of the same word or not.
Increasing spacingTolerance will reduce the number of inserted spaces.
/**
* Set the space width-based tolerance value that is used
* to estimate where spaces in text should be added. Note that the
* default value for this has been determined from trial and error.
* Setting this value larger will reduce the number of spaces added.
*
* #param spacingToleranceValue tolerance / scaling factor to use
*/
public void setSpacingTolerance(float spacingToleranceValue) {
this.spacingTolerance = spacingToleranceValue;
}

Related

CSV file not recognized as csv, reason nominal value not declared in header

I am trying to load a dataset in weka, I have tried many solutions such as arff format, comas etc. but it was all a failure. Could any of you give me a working solution or load this dataset according to the format.
Here is a link to dataset
Instead of using Weka's functionality for reading CSV files, you could use ADAMS (developed at the same university; I'm the lead developer) instead.
Download the adams-ml-app snapshot and then use the Weka Investigator to load/save the file:
Load it as ADAMS Spreadsheets (.csv, .csv.gz)
Save it as Arff data files (.arff, .arff.gz) or Simple ARFF data files (.arff, .arff.gz)
The Reviews column contains an erroneous 3.0M, which prevents it from becoming numeric.
If you want to have an introduction to the Weka Investigator, then take a look at my talk from the Weka User Conference 2021: Taking Weka to the next level with ADAMS .
There are too many issues with lines in this file.
In line 23, I eliminated the odd looking brackets.
I removed all single quotes (')
I eliminated all repeated double quotes ("")
In line 10474 the first two fields (before the number) didn't seem to be separated, so I added a comma.
This allowed the file to go through initial screening, but...
The file contains a lot of odd emojis. I started to eliminate them one by one, but there are clearly more of these than I wish to deal with.
Each time I got rid of one, it would read farther into the file, then stop at the next one.
If I just try to read the top of the file, the first 20 lines before we get to any of these problems, it reads fine.
My partial editing can be found here: https://www.dropbox.com/s/ij707mb23dt1jvz/googleplaystore3.csv?dl=0
I think if you clear up the remaining emojis the file should be usable.

SSIS Fuzzy Lookup Transform not matching values with spaces in on and not in the other

I'm working on an SSIS project to transfer data from a legacy system to a new system. This is the first time I've used SSIS but so far all has gone well, until now.
I have to match product names from the old and new system. Some are clean matches and a standard lookup catches those. Some are not and I'm using a fuzzy lookup on the no match output of the main lookup to try and catch those afterwards. This works on some thing but what seems like the most obvious matches it completely misses. for example
Source data: FG 45J
Target data: FG45J
This is not matched by the fuzzy lookup. I've tried ticking and unticking the spaces delimiter box to no avail. The threshold is set to zero so everything gets through but similarity and confidence are zero on the relevant output records. Some others do return non zero similarity etc but they don't have spaces. Matches to return is set to one although I've tried setting this up to four to see if this made any difference and it didn't. I expect I've missed something but I can't work out what.
Any help would be greatly appreciated

Parsing text in Julia- Invalid ascii sequence

I'm trying to format a data file so that my other program will properly handle it. I am trying to handle the following data and I am getting a very weird error that I can't seem to put my finger on.
https://snap.stanford.edu/data/wiki-RfA.html
I am trying to format the data as [SRC TGT VOT], so I'd like the first two lines of my output file to be
1 2 1
3 2 1
because user 1 (stored in dictionary of users first) votes for user 2 with VOT 1 and then user 3 votes for user 2 with VOT 1. My problem is that when I try to run my code below, I always end up getting a very strange "invalid ascii sequence" error- can anyone help me identify the issue or perhaps find a way around this? It'd obviously be best if I could learn what I am doing wrong. Thank you!
Note, I understand that this is a bit specific of a question and I appreciate any help- I'm sort of baffled by this error and don't know how to resolve it at the moment.
f=open("original_vote_data.txt") #this is the file linked above
arr=readlines(f)
i=edge_count=src=tgt=vot=1
dict=Dict{ASCIIString, Int64}()
edges=["" for k=1:198275]
while i<1586200
src_temp=(arr[i])[5:end-2]
if (haskey(dict, src_temp))
new_src= dict[src_temp]
else
dict[src_temp]=src
new_src=src
src=src+1
end
tgt_temp=(arr[i+1])[5:end-2]
if (haskey(dict, tgt_temp))
new_tgt= dict[tgt_temp]
else
dict[tgt_temp]=tgt
new_tgt=tgt
tgt=tgt+1
end
vot_temp=(arr[i+2])[5]
edges[edge_count]=string(new_src)* " " * string(new_tgt)* " " *string(vot_temp)
edge_count=edge_count+1
i=i+8
end
Here we go - I'll write up my comment as an answer since it seems to have solved the question.
My hunch that the error stemmed from the fourth line (dict=Dict{ASCIIString, Int64}) was based on the fact that ASCIIStrings will error if you try to store non-ASCII characters in them. Since this file is coming from an international site, it's not unlikely that there are users with unicode characters in their names (or elsewhere in the data). So the simple fix is to change all instances of ASCIIString to UTF8String.
Just to make this answer a bit more complete, I downloaded the file and tried running the program. The simplest way to debug this is to run the script at top-level in the REPL and then inspect the program state after the error. After the error is thrown, i==3017. Now just try running each line of the while loop incrementally. You'll quickly see that line 3017 contains "SRC:Guðsþegn\n" — unicode, as I suspected. When you try to create a new entry in dict with that as the key, the error should have a backtrace to setindex! in dict.jl, where you'll see that it's trying to convert the key (a UTF8String) to an ASCIIString. So changing the dictionary type to have UTF8String keys solves the problem.
As it turns out, the edges array only contains strings of three integers (or sometimes a hyphen), so the ASCIIString there is ok, but still a little dangerous. I'd probably store that information in a more dedicated array of ints instead of converting it to a space-separated string: you know the first two elements in the string are ints, but the last element is unvalidated text from the file itself… which may be unicode or a space itself (which could mess up processing down the line).

Fix CSV file with new lines

I ran a query on a MS SQL database using SQL Server Management Studio, and some the fields contained new lines. I selected to save the result as a csv, and apparently MS SQL isn't smart enough to give me a correctly formatted CSV file.
Some of these fields with new lines are wrapped in quotes, but some aren't, I'm not sure why (it seems to quote fields if they contain more than one new line, but not if they only contain one new line, thanks Microsoft, that's useful).
When I try to open this CSV in Excel, some of the rows are wrong because of the new lines, it thinks that one row is two rows.
How can I fix this?
I was thinking I could use a regex. Maybe something like:
/,[^,]*\n[^,]*,/
Problem with this is it matches the last element of one line and the 1st of the next line.
Here is an example csv that demonstrates the issue:
field a,field b,field c,field d,field e
1,2,3,4,5
test,computer,I like
pie,4,8
123,456,"7
8
9",10,11
a,b,c,d,e
A simple regex replacement won't work, but here's a solution based on preg_replace_callback:
function add_quotes($matches) {
return preg_replace('~(?<=^|,)(?>[^,"\r\n]+\r?\n[^,]*)(?=,|$)~',
'"$0"',
$matches[0]);
}
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){5}$~m';
$result=preg_replace_callback($row_regex, 'add_quotes', $source);
The secret to $row_regex is knowing ahead of time how many columns there are. It starts at the beginning of a line (^ in multiline mode) and consumes the next five things that look like fields. It's not as efficient as I'd like, because it always overshoots on the last column, consuming the "real" line separator and the first field of the next row before backtracking to the end of the line. If your documents are very large, that might be a problem.
If you don't know in advance how many columns there are, you can discover that by matching just the first row and counting the matches. Of course, that assumes the row doesn't contain any of the funky fields that caused the problem. If the first row contains column headers you shouldn't have to worry about that, or about legitimate quoted fields either. Here's how I did it:
preg_match_all('~\G,?[^,\r\n]++~', $source, $cols);
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){' . count($cols[0]) . '}$~m';
Your sample data contains only linefeeds (\n), but I've allowed for DOS-style \r\n as well. (Since the file is generated by a Microsoft product, I won't worry about the older-Mac style CR-only separator.)
See an online demo
If you want a java programmatic solution, open the file using the OpenCSV library. If it is a manual operation, then open the file in a text editor such as Vim and run a replace command. If it is a batch operation, you can use a perl command to cleanup the CRLFs.

Full text catalog/index search for %book%

I'm trying to wrap my head around how to search for something that appears in the middle of a word / expression - something like searching for "LIKE %book% " - but in SQL Server (2005) full text catalog.
How can I do that? It almost appears as if both CONTAINS and FREETEXT really don't support wildcard at the beginning of a search expression - can that really be?
I would have imagined that FREETEXT(*, "book") would find anything with "book" inside, including "rebooked" or something like that.
unfortunately CONTAINS only supports prefix wildcards:
CONTAINS(*, '"book*"')
SQL Server Full Text Search is based on tokenizing text into words. There is no smaller unit as a word, so the smallest things you can look for are words.
You can use prefix searches to look for matches that start with certain characters, which is possible because word lists are kept in alphabetical order and all the Server has to do is scan through the list to find matches.
To do what you want a query with a LIKE '%book%' clause would probably be just as fast (or slow).
If you want to do some serious full text searching then I would (and have) use Lucene.Net. MS SQL Full Text search never seems to work that well for anything other than the basics.
Here's a suggestion that is a workaround for that wildcard limitation. You create a computed column that contains the same content but in reverse as the column(s) you are searching.
If, for example, you are searching on a column named 'ProductTitle', then create a column named ProductsRev. Then update that field's 'Computed Column Specification' value to be:
(reverse([ProductTitle]))
Include the 'ProductsRev' column in your search and you should now be able to return results that support a wildcard at the beginning of the word. Good luck!!
Full text has a table that lists all the words the engine has found. It should have orders-of-magnitude less rows than your full-text-indexed table. You could select from that table " where field like '%book%' " to get all the words that have 'book' in them. Then use that list to write a fulltext query. Its cumbersome, but it would work, and it would be ok in the speed department. HOWEVER, ultimately you are using fulltext wrong when you are doing this. It might actually be better to educate the source of these feature requests about what fulltext is doing. You want them to understand what it WANTS to do, so they can get high value from fulltext. Example, only use wild cards at the end of a word, which means think of the words in an ordered list.
why don't program an assembly in C# to compute all the non repeated sufixes. For example if you have the Text "eat the red meat" you can store in a field "eat at t the he e red ed d meat" (note that is not necesary to add eat at and t again) ind then in this field use full text search. A function for doing that can easily written in Csharp
x) I know it seems od... it's a workarround
x) I know I'm adding overhead in the insert / update .... only justified if this overhead is insignificant besides the improvement in the search function
x) I know there is also an overhead in the size of the stored data.
But I'm pretty conffident that will be quite fast

Resources