A customer asked to create a custom character mapper function from specific names to ASCII in their SQL database.
Here is a simplified fragment that works (shortened for brevity):
select TRANSLATE(N'àáâãäåāąæậạả',
N'àáâãäåāąæậạả',
N'aaaaaaaaaaaa');
While analyzing the results on customer's dataset, I noticed one more unmapped symbol ă. So I added it to the mapper as follows:
select TRANSLATE(N'àáâãäåāąæậạảă',
N'àáâãäåāąæậạảă',
N'aaaaaaaaaaaaa');
Unexpectedly, it started failing with the message:
The second and third arguments of the TRANSLATE built-in function must contain an equal number of characters.
Obviously, TRANSLATE thinks that ă is special and consists of more than one character. Actually, even Notepad thinks the same (copy ă and try to delete it using Backspace key - something unusual will happen. Delete key works normally, though).
Then I thought - if TRANSLATE considers it a two-char symbol, let's add a two char mapping then:
select TRANSLATE(N'àáâãäåāąæậạảă',
N'àáâãäåāąæậạảă',
N'aaaaaaaaaaaaaa');
No errors this time, yay. But the input string was not processed correctly, ă was not replaced with a.
What is the correct (case-sensitive) way to replace such "double symbols"? Can it be done using TRANSLATE at all? I don't want to add a bunch of REPLACE for every such symbol I find.
I currently have the following sample text file:
http://pastebin.com/BasTiD4x
and I need to duplicate the CDS blocks. Essentially the line that has the word "CDS" and the 4 lines after it are part of the CDS block.
I need to insert this duplicated CDS block right before a line that says CDS, and I need to change the word CDS in the duplicated block to mRNA.
Of course this needs to happen every time there is an instance CDS.
A sample output would be here:
http://pastebin.com/mEMAB50t
Essentially for every CDS block, I need an mRNA block that says exactly the same thing.
Would appreciate help with this, never done 4 line insertions and replacements.
Thanks,
Adrian
Sorry for the very specific question. Here is a working solution provided by someone else:
perl -ne 'if (! /^\s/){$ok=0;if($mem){$mem=~s/CDS/mRNA/;print $mem;$mem="";}}$ok=1 if (/\d+\s+CDS/);if($ok){$mem.=$_};print;' exemple
I'm trying to insert an email address with a comma in it (and it has to be entered that way) and I'm currently stumped on how to properly insert the exact text into this cell.
The e-mail address that needs to be entered as/is:
joe,.E.Blow#someemail.com
What I've tried is work different escape patterns into the code. For example, this...
'''joe'''+','+'''.E.Blow#someemail.com'''
gets entered into the database as...
'joe','E.Blow#someemail.com'
I've tried different combinations with single and double quotes and it can't even get past a simple parse without errors.
What is the best way to go about getting the email address entered as noted above? I know it's something simple but I'm just spinning my wheels here.
the solution is simple:
for example: joe,.E.Blow#someemail.com
SEE FIDDLE
I ran a query on a MS SQL database using SQL Server Management Studio, and some the fields contained new lines. I selected to save the result as a csv, and apparently MS SQL isn't smart enough to give me a correctly formatted CSV file.
Some of these fields with new lines are wrapped in quotes, but some aren't, I'm not sure why (it seems to quote fields if they contain more than one new line, but not if they only contain one new line, thanks Microsoft, that's useful).
When I try to open this CSV in Excel, some of the rows are wrong because of the new lines, it thinks that one row is two rows.
How can I fix this?
I was thinking I could use a regex. Maybe something like:
/,[^,]*\n[^,]*,/
Problem with this is it matches the last element of one line and the 1st of the next line.
Here is an example csv that demonstrates the issue:
field a,field b,field c,field d,field e
1,2,3,4,5
test,computer,I like
pie,4,8
123,456,"7
8
9",10,11
a,b,c,d,e
A simple regex replacement won't work, but here's a solution based on preg_replace_callback:
function add_quotes($matches) {
return preg_replace('~(?<=^|,)(?>[^,"\r\n]+\r?\n[^,]*)(?=,|$)~',
'"$0"',
$matches[0]);
}
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){5}$~m';
$result=preg_replace_callback($row_regex, 'add_quotes', $source);
The secret to $row_regex is knowing ahead of time how many columns there are. It starts at the beginning of a line (^ in multiline mode) and consumes the next five things that look like fields. It's not as efficient as I'd like, because it always overshoots on the last column, consuming the "real" line separator and the first field of the next row before backtracking to the end of the line. If your documents are very large, that might be a problem.
If you don't know in advance how many columns there are, you can discover that by matching just the first row and counting the matches. Of course, that assumes the row doesn't contain any of the funky fields that caused the problem. If the first row contains column headers you shouldn't have to worry about that, or about legitimate quoted fields either. Here's how I did it:
preg_match_all('~\G,?[^,\r\n]++~', $source, $cols);
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){' . count($cols[0]) . '}$~m';
Your sample data contains only linefeeds (\n), but I've allowed for DOS-style \r\n as well. (Since the file is generated by a Microsoft product, I won't worry about the older-Mac style CR-only separator.)
See an online demo
If you want a java programmatic solution, open the file using the OpenCSV library. If it is a manual operation, then open the file in a text editor such as Vim and run a replace command. If it is a batch operation, you can use a perl command to cleanup the CRLFs.
When I try to extract text from my PDF files, it seems to insert white spaces between severl words randomly.
I am using pdfbox-app-1.6.0.jar (latest version) on following sample file in Downloads section of this page :
http://www.sheffield.gov.uk/roads/children/parents/6-11/pedestrian-training
I've tried with several other PDF files and it seems to be doing same on several pages.
I do the following:
java -jar pdfbox-app-1.6.0.jar ExtractText -force -console ~/Desktop/ped training pdf.pdf
on the downloaded file and you will see spaces in following inserted wrongly in the result on console:
"• If ch ildren are able to walk to
schoo l safely this could reduce the
congestion. "
"• Develops good hab its for later life."
"www.sheff ield.gov.uk"
"Think Ahead!, wh ich is based on the"
etc etc.
As you can see several of words above have spaces between them for no reason I can fathom.
I am on ubuntu and running Sun's JDK 1.6.
I've tried this on several different PDF files and tried searching for solution on forums, there were similar bugs but all seemed to have been resolved.
Any help or if anyone else has same problem please comment. This is causing big problem in indexing the content properly for searching.
Unfortunately there is currently no easy solution for this.
Internally PDF documents simply contain instructions like "place characters 'abc' in position X" and "place characters 'def' in position Y", and PDFBox tries to reason whether the resulting extracted text should be "abc def" or "abcdef" based on things like the distance between X and Y. These heuristics are generally pretty accurate, but as you can see they don't always produce the correct result.
One way to improve the quality of the extracted text is to try a dictionary lookup on each extracted word or token. If the lookup fails, try combining the token with the next one. If a dictionary lookup on the combined token succeeds, then it's fairly likely that the text extractor has mistakenly added an extra space inside the word. Unfortunately such a feature does not yet exist in PDFBox. See https://issues.apache.org/jira/browse/PDFBOX-1153 for the feature request filed for this. Patches welcome!
The class org.apache.pdfbox.util.PDFTextStripper (pdfbox-1.7.1) allows to modify the propensity to decide if two strings are part of the same word or not.
Increasing spacingTolerance will reduce the number of inserted spaces.
/**
* Set the space width-based tolerance value that is used
* to estimate where spaces in text should be added. Note that the
* default value for this has been determined from trial and error.
* Setting this value larger will reduce the number of spaces added.
*
* #param spacingToleranceValue tolerance / scaling factor to use
*/
public void setSpacingTolerance(float spacingToleranceValue) {
this.spacingTolerance = spacingToleranceValue;
}