How to parse a FASTA file using kseq.h - c

I have known about this library from Heng Li for a while, but I have not attempted to use it until now, mostly because up until now python was fast enough for me.
Here is the link to the header: http://lh3lh3.users.sourceforge.net/kseq.shtml
When I attempt to use the following to parse a fasta file, it returns -1 for the length for the sequence line. I have looked over the Li's code and this seems to be designed mainly for FASTQ parsing, but he does say on his webpage that it also supports the FASTA format.
Here is my code:
#include <stdio.h>
#include <stdlib.h>
#include "kseq.h"
// STEP 1: declare the type of file handler and the read() function
KSEQ_INIT(FILE*, read)
int main(int argc, char** argv) {
FILE* fp = fopen(argv[1], "r"); // STEP 2: open the file handler
kseq_t *seq = kseq_init(fp); // STEP 3: initialize seq
int l;
while ((l = kseq_read(seq)) >= 0) { // STEP 4: read sequence
printf("name: %s\n", seq->name.s);
if (seq->comment.l) printf("comment: %s\n", seq->comment.s);
printf("seq: %s\n", seq->seq.s);
if (seq->qual.l) printf("qual: %s\n", seq->qual.s);
}
printf("return value: %d\n", l);
kseq_destroy(seq); // STEP 5: destroy seq
fclose(fp);
return (0);
}
The FASTA I have been using to test with is the Hg19 GRCH37 ChrY.fa file available from multiple sources including the Broad Institute.
Any help would be appreciated.

First you should check the return value of fopen():
FILE* fp = fopen(argv[1], "r"); // STEP 2: open the file handler
if(fp == 0) {
perror("fopen");
exit(1);
}
Second, I looked at the header file and I think kseg_init takes an fd not a FILE *.
You can get an fd from a FILE * with fileno().
kseq_t *seq = kseq_init(fp); // STEP 3: initialize seq
Should be:
kseq_t *seq = kseq_init(fileno(fp)); // STEP 3: initialize seq

Here is the complete code that is working for me
#include <zlib.h>
#include <stdio.h>
#include "kseq.h"
KSEQ_INIT(int, read)
int main(int argc, char **argv)
{
FILE* fp;
kseq_t *seq;
int n = 0, slen = 0, qlen = 0;
fp = fopen(argv[1], "r");
seq = kseq_init(fileno(fp));
while (kseq_read(seq) >= 0)
++n ;//slen += seq->seq.l, qlen += seq->qual.l;
printf("%d\t%d\t%d\n", n, slen, qlen);
kseq_destroy(seq);
fclose(fp);
return 0;
}

Related

How to print a substring in C

simple C question here!
So I am trying to parse through a string lets say: 1234567W
#include <stdlib.h>
#include <string.h>
int main(int argc, char *argv[]) {
//pointer to open file
FILE *op;
//open file of first parameter and read it "r"
op = fopen("TestCases.txt", "r");
//make an array of 1000
char x[1000];
char y[1000];
//declare variable nums as integer
int nums;
//if file is not found then exit and give error
if (!op) {
perror("Failed to open file!\n");
exit(1);
}
else {
while (fgets(x, sizeof(x), op)) {
//pounter to get the first coordinate to W
char *p = strtok(x, "W");
//print the first 3 digits of the string
printf("%.4sd\n", p);
}
}
return 0;
My output so far shows: "123d" because of the "%.4sd" in the printf function.
I now need to get the next two numbers, "45". Is there a regex expression I can use that will allow me to get the next two digits of a string?
I am new to C, so I was thinking more like "%(ignore the first 4 characters)(print next 2 digits)(ignore the last two digits)"
input: pic
output: pic
Please let me know.
Thanks all.
printf("Next two: %.2s\n", p + 4); should work.
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
//pointer to open file
FILE *op;
//open file of first parameter and read it "r"
op = fopen("TestCases.txt", "r");
//make an array of 1000
char x[1000];
char y[1000];
//declare variable nums as integer
int nums;
//if file is not found then exit and give error
if (!op) {
perror("Failed to open file!\n");
exit(1);
}
else {
while (fgets(x, sizeof(x), op)) {
//pounter to get the first coordinate to W
char *p = strtok(x, "W");
//print the first 3 digits of the string
printf("%.4sd\n", p);
printf("Next two: %.2s\n", p + 4);
}
}
return 0;
}
Side note: I added a missing stdio.h include. Please turn on compiler warnings, since this error would've been caught by them.

read a text file, make some trivial transformation character by character (swapping the case of all letters), write result to text file

I have to read a text file, make some trivial transformation character by character (swapping the case of all letters), write results to the text files. I wrote this code, but it's not working. Please guide me in this regard. Thanks for in Advance
#include <stdio.h>
#include <stdlib.h>
int main() {
char c[1000];
char x[100];
char var;
int i;
FILE *fptr;
if ((fptr = fopen("text.txt", "r")) == NULL) {
printf("Error! opening file");
// Program exits if file pointer returns NULL...
exit(1);
}
// reads text until a newline is encountered...
fscanf(fptr, "%[^\n]", c);
printf("Data from the file:\n%s", c);
// Convert the file to upper case....
for( i=0;i<= strlen(c);i++){
if(c[i]>=65&&c[i]<=90)
c[i]=c[i]+32;
}
fptr = fopen("program.txt","w");
fprintf(fptr,"%[^\n]",c);
fclose(fptr);
return 0;
}
Edit: added #include <stdlib.h>, removed static describing main()
My proposition, based on example of copying a file given at my uni.
I used toupper() from ctype.h, if you don't want to use it you can just add 32 under condition similarly to your solution
Note: there could be char c instead of int c. (In the original version it actually was char; I changed it because if you look at the headers in the docs of all functions dealing with c, they all take/return int, not char; in your version it would matter more as you keep an array, in my program it changes pretty much nothing – int is just my preferred practice).
Note2: I actually never delved into the difference between "w"/"r" (write/read) and "wb"/"rb" (write/read binary). The code seems to work either way.
(I think there is no big difference when the files are text files anyway, for further assurance that both versions work, note that the code uses feof() to handle EOF)
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
int main(void) {
FILE *from, *to;
int c;//could be char
/* opening the source file */
if ((from = fopen("text.txt", "rb")) == NULL) {
printf("no such source file\n");
exit(1);
}
/* opening the target file */
if ((to = fopen("program.txt", "wb")) == NULL) {
printf("error while opening target file\n");
exit(1);
}
while (!feof(from)) {
c = fgetc(from);
if (ferror(from)) {
printf("error while reading from the source file\n");
exit(1);
}
if (!feof(from)) {//we avoid writing EOF
fputc(toupper(c), to);
if (ferror(to)) {
printf("error while writing to the target file\n");
exit(1);
}
}
}
if (fclose(from) == EOF) {
printf("error while closing...\n");
exit(1);
}
if (fclose(to) == EOF) {
printf("error while closing...\n");
exit(1);
}
return 0;
}
For a version taking arguments from command line (works on windows too) replace the beginning of main with
int main(int argc, char *argv[]) {
FILE *from, *to;
char c;
/* checking the number of arguments in the command line */
if (argc != 3) {
printf("usage: name_of_executable_of_this_main <f1> <f2>\n");//name_of_exe could be copy_to_upper, for example; change adequately
exit(1);
}
/* opening the source file */
if ((from = fopen(argv[1], "rb")) == NULL) {
printf("no such source file\n");
exit(1);
}
/* opening the target file */
if ((to = fopen(argv[2], "wb")) == NULL) {
printf("error while opening the target file\n");
exit(1);
}
I don't know how to code in that language(i think it's C++), but basically want you should be doing is a for loop to iterate through every character in the string. In Python it would look like:
x = open("text.txt", "r")
y = open("new text.txt","w")
z = ""
for char in x:
z += char.upper()
y.write(z)
I hope I was able to give an idea of how to solve your problem. I'm a newbie as well, but in Python.

wzip.c OS three easy steps

I am newbie in OS. Im currently learning OS three easy steps.
I found this code for the the first project of the course.
(wzip) is a file compression tool, and the other (wunzip) is a file decompression tool.
input:
aaaaaaaaaabbbb
correct output:
10a4b
instructions: write out a 4-byte integer in binary format followed by the single character in ASCII.
current output:
ab
I type in the shell:
prompt> gcc -o wzip wzip.c -Wall -Werror
prompt> ./wzip file1.txt > file1.z
This is the link for the project:
https://github.com/remzi-arpacidusseau/ostep-projects/tree/master/initial-utilities
This is the code I found for this specific part of the project:
wzip:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <arpa/inet.h>
/*
wzip: is a file compresion tool.
*/
void writeFile(int , char *);
int main(int argc, char *argv[]){
FILE *fp;
char newbuff[2], oldbuff[2];
int count;
if (argc < 2){
printf("wzip: file1 [file2 ...]\n");
exit(EXIT_FAILURE);
}
// open files
for (size_t i = 1; i < argc; i++){
if ((fp = fopen(argv[i], "r")) == NULL){
printf("wzip: cannot open file\n");
exit(EXIT_FAILURE);
}
while (fread(newbuff, 1, 1, fp)){
if (strcmp(newbuff, oldbuff) == 0){
count++;
} else {
if (oldbuff[0] != '\0'){
writeFile(count, oldbuff);
}
count = 1;
strcpy(oldbuff, newbuff);
}
}
fclose(fp);
}
writeFile(count, oldbuff);
return 0;
}
void writeFile(int count, char *oldbuff){
// write as network byte order
count = htonl(count);
fwrite(&count, 4, 1, stdout);
fwrite(oldbuff, 1, 1, stdout);
}
wunzip:
#include <stdio.h>
#include <stdlib.h> // exit
#include <string.h> // memset
#include <arpa/inet.h> // ntohl
int
main(int argc, char *argv[])
{
FILE *fp;
char buff[5];
if (argc <= 1) {
printf("wunzip: file1 [file2 ...]\n");
exit(EXIT_FAILURE);
}
for (size_t i = 1; i < argc; i++) {
if ((fp = fopen(argv[i], "r")) == NULL) {
printf("wunzip: cannot open file\n");
exit(EXIT_FAILURE);
}
int count = 0;
while (fread(&count, 4, 1, fp)) {
count = ntohl(count); // read from network byte order
memset(buff, 0, strlen(buff));
fread(buff, 1, 1, fp);
for (size_t i = 0; i < count; i++) {
printf("%s", buff);
}
}
fclose(fp);
}
return 0;
}
Cant please some one can give a hand to understand better.
Thanks in advance.
How is your solution coming along?
The first that jumps out about the code posted here is that your write the wrong number of bytes. In other words, you don't follow the compression guidelines of the exercise. The instructions for the OSTEP project was to write a 32 bit int (4 bytes) followed by a single ascii character (1 byte).
It helps to set up a test before writing the compressed output. Make sure that what is being written does indeed equal 5 bytes.
#include <stdio.h>
struct token {
u_int32_t count;
u_int8_t ch;
};
size_t numbytes (struct token t) {
/* calculates the number of bytes in a single token */
/* assumes a token has two attributes */
/* should return a value of 5 */
size_t mybytes = sizeof(t.count) + sizeof(t.ch)
return mybytes;
}
I personally found it easier to think about the read procedure in terms of a parser. This is due to the fact that some of the test cases get slightly more complex. You need to effectively manage state between line reads and also between files.
It's a little bit late, but I just did this program for my course's assignment and I want to share my solution. It passes all the available test cases but it's not official from any party (neither from my course nor the project owner).
The code:
#include <stdio.h> // FILE, stdout, fprintf, fwrite, fgetc, EOF, fclose
#include <stdlib.h> // EXIT_*, exit
struct rle_t
{
int l;
char c;
};
void
writerle(struct rle_t rleobj)
{
fwrite((int *)(&(rleobj.l)), sizeof(int), 1, stdout);
fwrite((char *)(&(rleobj.c)), sizeof(char), 1, stdout);
}
struct rle_t
process(FILE *stream, struct rle_t prev)
{
int curr;
struct rle_t rle;
while ((curr = fgetc(stream)) != EOF)
{
if (prev.c != '\0' && curr != prev.c)
{
rle.c = prev.c;
rle.l = prev.l;
prev.l = 0;
writerle(rle);
}
prev.l++;
prev.c = curr;
}
rle.c = prev.c;
rle.l = prev.l;
return rle;
}
int
main(int argc, const char *argv[])
{
if (argc < 2)
{
fprintf(stdout, "wzip: file1 [file2 ...]\n");
exit(EXIT_FAILURE);
}
struct rle_t prev;
prev.c = '\0';
prev.l = 0;
for (int i = 1; i < argc; ++i)
{
FILE *fp = fopen(argv[i], "r");
if (fp == NULL)
{
fprintf(stdout, "wzip: cannot open files\n");
exit(EXIT_FAILURE);
}
struct rle_t rle = process(fp, prev);
prev.c = rle.c;
prev.l = rle.l;
fclose(fp);
}
writerle(prev);
return 0;
}
From the post, I think there is no need to use any function like htonl(3) for this problem. The tricky part is in the writerle() function, which you already got it. But let me explain what I understand:
Use fwrite(3) to write binary format to the stream, in this case, stdout.
First parameter is a pointer, that's why I need &(rleobj.l) and &(rleobj.c).
This line
fwrite((int *)(&(rleobj.l)), sizeof(int), 1, stdout);
tells that I want to write 4 bytes (sizeof(int)) of the first parameter (&(rleobj.l)) to the standard output (stdout) one time (1). The typecast is optional (depending on your compiler and how you want to read your code).
The reason why they require you to do so is because it will separate between the run-length part and the character part.
Let's say you have a simple input file like this:
333333333333333333333333333333333aaaaaaaaaaaa
After encoding without the binary format:
33312a
This is wrong. Because now, it looks like the run-length of the character a is 33312, instead of 33 of 3 and 12 of a.
However, with binary format, those parts are separated:
❯ xxd -b output.z
00000000: 00100001 00000000 00000000 00000000 00110011 00001100 !...3.
00000006: 00000000 00000000 00000000 01100001 ...a
Here, the first four bytes represent the run-length and the next one byte represents the character.
I hope this will help.

How can I carve out one binary file from a concatenated binary

Basically I'm combining two binaries using the "cat" command on Linux.
And I want to be able to separate them again using C
this is the code I got so far
int main(int argc, char *argv[]) {
// Getting this file
FILE *localFile = fopen(argv[0], "rb");
// Naming a new file to save our carved binary
FILE *newFile = fopen(argv[1], "wb+");
// Moving the cursor to the offset: 19672 which is the size of this file
fseek(localFile, 19672, SEEK_SET);
// Copying to the new file
char ch;
while ( ( ch = fgetc(localFile) ) != EOF ) {
fputc(ch, newFile);
}
}
Assuming that you already know where the second file starts. You can proceed as follows. (This is bare minimal)
#include <stdio.h>
#include <unistd.h>
int main()
{
FILE* f1 = fopen("f1.bin", "r");
FILE* f2 = fopen("f2.bin", "w");
long file1_size = 1;
lseek(fileno(f1), file1_size, SEEK_SET);
char fbuf[100];
int rd_status;
for( ; ; ) {
rd_status = read(fileno(f1), fbuf, sizeof(fbuf));
if (rd_status <= 0)
break;
write(fileno(f2), fbuf, rd_status);
}
fclose(f1);
fclose(f2);
return 0;
}
Input File -- f1.bin
1F 2A
Output File -- f2.bin
2A
Please, modify the file names and file sizes according to your example.

Trying to make program that counts number of bytes in a specified file (in C)

I am currently attempting to write a program that will tell it's user how many times the specified 8-bit byte appears in the specified file.
I have some ground work laid out, but when it comes to making sure that the file makes it in to an array or buffer or whatever format I should put the file data into to check for the bytes, I feel I'm probably very far off from using the correct methods.
After that, I need to check whatever the file data gets put in to for the byte specified, but I am also unsure how to do this.
I think I may be over-complicating this quite a bit, so explaining anything that needs to be changed or that can just be scrapped completely is greatly appreciated.
Hopefully didn't leave out any important details.
Everything seems to be running (this code compiles), but when I try to printf the final statement at the bottom, it does not spit out the statement.
I have a feeling I just did not set up the final for loop correctly at all..
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
//#define BUFFER_SIZE (4096)
main(int argc, char *argv[]){ //argc = arg count, argv = array of arguements
char buffer[4096];
int readBuffer;
int b;
int byteCount = 0;
b = atoi(argv[2]);
FILE *f = fopen(argv[1], "rb");
unsigned long count = 0;
int ch;
if(argc!=3){ /* required number of args = 3 */
fprintf(stderr,"Too few/many arguements given.\n");
fprintf(stderr, "Proper usage: ./bcount path byte\n");
exit(0);
}
else{ /*open and read file*/
if(f == 0){
fprintf(stderr, "File could not be opened.\n");
exit(0);
}
}
if((b <= -1) || (b >= 256)){ /*checks to see if the byte provided is between 0 & 255*/
fprintf(stderr, "Byte provided must be between 0 and 255.\n");
exit(0);
}
else{
printf("Byte provided fits in range.\n");
}
int i = 0;
int k;
int newFile[i];
fseek(f, 0, SEEK_END);
int lengthOfFile = ftell(f);
for(k = 0; k < sizeof(buffer); k++){
while(fgets(buffer, lengthOfFile, f) != NULL){
newFile[i] = buffer[k];
i++;
}
}
if(newFile[i] = buffer[k]){
printf("same size\n");
}
for(i = 0; i < sizeof(newFile); i++){
if(b == newFile[i]){
byteCount++;
}
printf("Final for loop is working???"\n");
}
}
OP is mixing fgets() with binary reads of a file.
fgets() reads a file up to the buffer size provided or reaching a \n byte. It is intended for text processing. The typical way to determine how much data was read via fgets() is to look for a final \n - which may or may not be there. The data read could have embedded NUL bytes in it so it becomes problematic to know when to stop scanning the buffer. on a NUL byte or a \n.
Fortunately this can all be dispensed with, including the file seek and buffers.
// "rb" should be used when looking at a file in binary. C11 7.21.5.3 3
FILE *f = fopen(argv[1], "rb");
b = atoi(argv[2]);
unsigned long byteCount = 0;
int ch;
while ((ch = fgetc(f)) != EOF) {
if (ch == b) {
byteCount++;
}
}
The OP error checking is good. But the for(k = 0; k < sizeof(buffer); k++){ loop and its contents had various issues. OP had if(b = newFile[i]){ which should have been if(b == newFile[i]){
Not really an ANSWER --
Chux corrected the code, this is just more than fits in a comment.
#include <sys/stat.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
struct stat st;
int rc=0;
if(argv[1])
{
rc=stat(argv[1], &st);
if(rc==0)
printf("bytes in file %s: %ld\n", argv[1], st.st_size);
else
{
perror("Cannot stat file");
exit(EXIT_FAILURE);
}
return EXIT_SUCCESS;
}
return EXIT_FAILURE;
}
The stat() call is handy for getting file size and for determining file existence at the same time.
Applications use stat instead of reading the whole file, which is great for gigantic files.

Resources