Duplicate Block in Tab Delimited File and replace a word - text-parsing

I currently have the following sample text file:
http://pastebin.com/BasTiD4x
and I need to duplicate the CDS blocks. Essentially the line that has the word "CDS" and the 4 lines after it are part of the CDS block.
I need to insert this duplicated CDS block right before a line that says CDS, and I need to change the word CDS in the duplicated block to mRNA.
Of course this needs to happen every time there is an instance CDS.
A sample output would be here:
http://pastebin.com/mEMAB50t
Essentially for every CDS block, I need an mRNA block that says exactly the same thing.
Would appreciate help with this, never done 4 line insertions and replacements.
Thanks,
Adrian

Sorry for the very specific question. Here is a working solution provided by someone else:
perl -ne 'if (! /^\s/){$ok=0;if($mem){$mem=~s/CDS/mRNA/;print $mem;$mem="";}}$ok=1 if (/\d+\s+CDS/);if($ok){$mem.=$_};print;' exemple

Related

CSV file not recognized as csv, reason nominal value not declared in header

I am trying to load a dataset in weka, I have tried many solutions such as arff format, comas etc. but it was all a failure. Could any of you give me a working solution or load this dataset according to the format.
Here is a link to dataset
Instead of using Weka's functionality for reading CSV files, you could use ADAMS (developed at the same university; I'm the lead developer) instead.
Download the adams-ml-app snapshot and then use the Weka Investigator to load/save the file:
Load it as ADAMS Spreadsheets (.csv, .csv.gz)
Save it as Arff data files (.arff, .arff.gz) or Simple ARFF data files (.arff, .arff.gz)
The Reviews column contains an erroneous 3.0M, which prevents it from becoming numeric.
If you want to have an introduction to the Weka Investigator, then take a look at my talk from the Weka User Conference 2021: Taking Weka to the next level with ADAMS .
There are too many issues with lines in this file.
In line 23, I eliminated the odd looking brackets.
I removed all single quotes (')
I eliminated all repeated double quotes ("")
In line 10474 the first two fields (before the number) didn't seem to be separated, so I added a comma.
This allowed the file to go through initial screening, but...
The file contains a lot of odd emojis. I started to eliminate them one by one, but there are clearly more of these than I wish to deal with.
Each time I got rid of one, it would read farther into the file, then stop at the next one.
If I just try to read the top of the file, the first 20 lines before we get to any of these problems, it reads fine.
My partial editing can be found here: https://www.dropbox.com/s/ij707mb23dt1jvz/googleplaystore3.csv?dl=0
I think if you clear up the remaining emojis the file should be usable.

Parsing text in Julia- Invalid ascii sequence

I'm trying to format a data file so that my other program will properly handle it. I am trying to handle the following data and I am getting a very weird error that I can't seem to put my finger on.
https://snap.stanford.edu/data/wiki-RfA.html
I am trying to format the data as [SRC TGT VOT], so I'd like the first two lines of my output file to be
1 2 1
3 2 1
because user 1 (stored in dictionary of users first) votes for user 2 with VOT 1 and then user 3 votes for user 2 with VOT 1. My problem is that when I try to run my code below, I always end up getting a very strange "invalid ascii sequence" error- can anyone help me identify the issue or perhaps find a way around this? It'd obviously be best if I could learn what I am doing wrong. Thank you!
Note, I understand that this is a bit specific of a question and I appreciate any help- I'm sort of baffled by this error and don't know how to resolve it at the moment.
f=open("original_vote_data.txt") #this is the file linked above
arr=readlines(f)
i=edge_count=src=tgt=vot=1
dict=Dict{ASCIIString, Int64}()
edges=["" for k=1:198275]
while i<1586200
src_temp=(arr[i])[5:end-2]
if (haskey(dict, src_temp))
new_src= dict[src_temp]
else
dict[src_temp]=src
new_src=src
src=src+1
end
tgt_temp=(arr[i+1])[5:end-2]
if (haskey(dict, tgt_temp))
new_tgt= dict[tgt_temp]
else
dict[tgt_temp]=tgt
new_tgt=tgt
tgt=tgt+1
end
vot_temp=(arr[i+2])[5]
edges[edge_count]=string(new_src)* " " * string(new_tgt)* " " *string(vot_temp)
edge_count=edge_count+1
i=i+8
end
Here we go - I'll write up my comment as an answer since it seems to have solved the question.
My hunch that the error stemmed from the fourth line (dict=Dict{ASCIIString, Int64}) was based on the fact that ASCIIStrings will error if you try to store non-ASCII characters in them. Since this file is coming from an international site, it's not unlikely that there are users with unicode characters in their names (or elsewhere in the data). So the simple fix is to change all instances of ASCIIString to UTF8String.
Just to make this answer a bit more complete, I downloaded the file and tried running the program. The simplest way to debug this is to run the script at top-level in the REPL and then inspect the program state after the error. After the error is thrown, i==3017. Now just try running each line of the while loop incrementally. You'll quickly see that line 3017 contains "SRC:Guðsþegn\n" — unicode, as I suspected. When you try to create a new entry in dict with that as the key, the error should have a backtrace to setindex! in dict.jl, where you'll see that it's trying to convert the key (a UTF8String) to an ASCIIString. So changing the dictionary type to have UTF8String keys solves the problem.
As it turns out, the edges array only contains strings of three integers (or sometimes a hyphen), so the ASCIIString there is ok, but still a little dangerous. I'd probably store that information in a more dedicated array of ints instead of converting it to a space-separated string: you know the first two elements in the string are ints, but the last element is unvalidated text from the file itself… which may be unicode or a space itself (which could mess up processing down the line).

Reading specified lines below a found string

I'm going to create a simple database bank account in C-language but I haven't quite figured out how I'm gonna fetch data for a specific account already created and sent to a file. I was thinking of doing a search from the beginning of the file using fseek for an account number specified since all account numbers will be unique. Is there a way to read the the amount of lines specified below that account number once it is found? For e.g in my file accounts.txt there will be the accounts
Account # : 13398
First Name : Eric
Last Name : Walters
Parish : St.tofu
Year of Birth : 1980
Age : 34
Savings Period : 5 year(s)
Password : Eric1
Account # : 13398
Account balance: $0.00
====================================
I want to search through the file for the account number and fetch it along with everything else 10 lines below it and display it on the screen if this is possible then say 'aye' and point me to a certain area I should study to achieve this and when I'm successful i'll post my coding here to show what I have done.
fseek() allows you to skip a certain number of bytes in each file. If your lines are not always the same length, you will have to read the entire file, not just to search for the account numbers, but also to find the ten newlines that delimit each account. To do this, you are better off using fgets().
The steps would be something like this
foreach line in file
if line starts with "Account Number"
if the number is the one you want
print the next 10 lines
else
skip the next 10 lines
else
keep looking
Firstly, fseek is used to move the file pointer not for searching. For search text, i.e. account id in your case, there is some examples Trying to find and replace a string from file in C. To write your own code, learning the basic use of file handling functions is enough. Furthermore, since your data is structured (every 11 lines represent one account), you code can be accelarated. At last, what you are trying to do is what database software offers and it is hard too implement your own database as fast as commercial software.
You could search in the file, but that would be a bit tedious. Even more tedious if you wanted to modify the account details.
Why don't you use SQLite:
It is designed to replace fopen().
?

Fix CSV file with new lines

I ran a query on a MS SQL database using SQL Server Management Studio, and some the fields contained new lines. I selected to save the result as a csv, and apparently MS SQL isn't smart enough to give me a correctly formatted CSV file.
Some of these fields with new lines are wrapped in quotes, but some aren't, I'm not sure why (it seems to quote fields if they contain more than one new line, but not if they only contain one new line, thanks Microsoft, that's useful).
When I try to open this CSV in Excel, some of the rows are wrong because of the new lines, it thinks that one row is two rows.
How can I fix this?
I was thinking I could use a regex. Maybe something like:
/,[^,]*\n[^,]*,/
Problem with this is it matches the last element of one line and the 1st of the next line.
Here is an example csv that demonstrates the issue:
field a,field b,field c,field d,field e
1,2,3,4,5
test,computer,I like
pie,4,8
123,456,"7
8
9",10,11
a,b,c,d,e
A simple regex replacement won't work, but here's a solution based on preg_replace_callback:
function add_quotes($matches) {
return preg_replace('~(?<=^|,)(?>[^,"\r\n]+\r?\n[^,]*)(?=,|$)~',
'"$0"',
$matches[0]);
}
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){5}$~m';
$result=preg_replace_callback($row_regex, 'add_quotes', $source);
The secret to $row_regex is knowing ahead of time how many columns there are. It starts at the beginning of a line (^ in multiline mode) and consumes the next five things that look like fields. It's not as efficient as I'd like, because it always overshoots on the last column, consuming the "real" line separator and the first field of the next row before backtracking to the end of the line. If your documents are very large, that might be a problem.
If you don't know in advance how many columns there are, you can discover that by matching just the first row and counting the matches. Of course, that assumes the row doesn't contain any of the funky fields that caused the problem. If the first row contains column headers you shouldn't have to worry about that, or about legitimate quoted fields either. Here's how I did it:
preg_match_all('~\G,?[^,\r\n]++~', $source, $cols);
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){' . count($cols[0]) . '}$~m';
Your sample data contains only linefeeds (\n), but I've allowed for DOS-style \r\n as well. (Since the file is generated by a Microsoft product, I won't worry about the older-Mac style CR-only separator.)
See an online demo
If you want a java programmatic solution, open the file using the OpenCSV library. If it is a manual operation, then open the file in a text editor such as Vim and run a replace command. If it is a batch operation, you can use a perl command to cleanup the CRLFs.

How can I use SQL Server's full text search across multiple rows at once?

I'm trying to improve the search functionality on my web forums. I've got a table of posts, and each post has (among other less interesting things):
PostID, a unique ID for the individual post.
ThreadID, an ID of the thread the post belongs to. There can be any number of posts per thread.
Text, because a forum would be really boring without it.
I want to write an efficient query that will search the threads in the forum for a series of words, and it should return a hit for any ThreadID for which there are posts that include all of the search words. For example, let's say that thread 9 has post 1001 with the word "cat" in it, and also post 1027 with the word "hat" in it. I want a search for cat hat to return a hit for thread 9.
This seems like a straightforward requirement, but I don't know of an efficient way to do it. Using the regular FREETEXT and CONTAINS capabilities for N'cat AND hat' won't return any hits in the above example because the words exist in different posts, even though those posts are in the same thread. (As far as I can tell, when using CREATE FULLTEXT INDEX I have to give it my index on the primary key PostID, and can't tell it to index all posts with the same ThreadID together.)
The solution that I currently have in place works, but sucks: maintain a separate table that contains the entire concatenated post text of every thread, and make a full text index on THAT. I'm looking for a solution that doesn't require me to keep a duplicate copy of the entire text of every thread in my forums. Any ideas? Am I missing something obvious?
As far as i can see there is no "easy" way of doing this.
I would create a stored procedure which simply splits up the search words and starts looking for the first word and put the threadid's in a table variable. Then you look for the other words (if any) in the threadids you just collected (inner join).
If intrested i can write a few bits of code but im guessing you wont need it.
What are you searching for?
CAT HAT as a complete word, in which case:
CONTAINS(*,'"CAT HAT")
CAT OR HAT then..
CONTAINS (*,'CAT OR HAT')
Searching for "CAT HAT" and expecting just the post with CAT in doesn't make any sense. If the problem is parsing what the user types, you could just replace SPACES with OR (to search any of the words, AND if both required). The OR will give you both posts for thread 9.
SELECT DISTINCT ThreadId
FROM Posts
WHERE CONTAINS (*,'"CAT OR HAT")
Better still you could , if it helps, use the brilliant irony (http://irony.codeplex.com/) which translates (parses) a search string into a Fulltext query. Might help for you.
Requires the use of google syntax for the original search which can only be a good thing as most people are used to typing in google searches.
Plus here is an article on how to use it.
http://www.sqlservercentral.com/articles/Full-Text+Search+(2008)/64248/

Resources