How to read from unknown file including numbers, letters, and symbols? - c

How can you read through a file of unknown line lengths (about 1500 lines, so no malloc/alloc and the like is needed because memory is sufficient...luckily, because I don't understand those array/pointer commands yet) including float numbers, signs, and letters, extract specific numbers from it, do some calculations and write them back in another file?
Three example-lines:
02060 6.1 0.15 K14C9 134.52612 339.34971 209.27800 6.93836 0.3820989 0.01956864 13.6383665 0 MPO319108 1304 45 1895-2014 0.53 M-v 38h MPCLINUX 000A (2060) Chiron 20141030
05145 7.1 0.15 K14C9 90.96884 354.94362 119.25398 24.73205 0.5736395 0.01074547 20.3385073 0 MPO169571 319 21 1977-2009 0.58 M-v 38h MPCMEL 400A (5145) Pholus 20090418
07066 9.6 0.15 K14C9 67.95075 170.25614 31.23622 15.65639 0.5195581 0.00813869 24.4774642 1 MPO135426 105 9 1993-2004 0.48 M-v 38h MPCW 400A (7066) Nessus 20040526

Welcome! Including your existing code is always a good idea, so thanks for doing so. It would help to see it as part of the question, however. Could you edit the question to include your code, and format it nicely as well?
One problem that I see is with the correct use of formatting strings in your fscanf() call. You always specify %lf but that is only appropriate for double values, and you also need to parse integers and strings.

Related

Is it safe to use the first 22 characters of a NFT pubkey as a primary key for a DB

I am wondering if is safe to only use the first 22 characters instead of the 44 characters of a pubkey of an NFT as a primary key of a MySQL DB. I have a DB with huge data and could save a lot of space thanks to this approach. For instance having the following pubkey:
AQoKYV7tYpTrFZN6P5oUufbQKAUr9mNYGe1TTJC9wajM
Would it be safer to use the first 22 characters:
AQoKYV7tYpTrFZN6P5oUuf
Would it be safer using the first 11chars plus the trailing 11chars, or doesn't make any difference?
AQoKYV7tYpTe1TTJC9wajM
A public key is 32 bytes, so those "44 characters" are actually the base-58 representation of those 32 bytes.
If you're only storing 22 characters, let's simplify things and say that you're storing 16 bytes out of 32 total. The chance of two pubkeys sharing the same 16-byte sequence is 1 / 256^16 = 1 / 2^128 = 2.9 * 10 ^ -39, which is very unlikely, but possible.
Here's another way to approach the problem -- how about storing the full pubkey as 32 bytes instead of as a string? Then you won't ever lose any precision.

Difference between constants defined in Go's syscall and C's stat.h [duplicate]

Update: based on the comment and response so far, I guess I should make it explicit that I understand 0700 is the octal representation of the decimal number 448. My concern here is that when an octal mode parameter or when a decimal number is recast as octal and passed to the os.FileMode method the resulting permissions on the file created using WriteFile don't seem to line up in a way that makes sense.
I worked as hard as I could to reduce the size of the question to its essence, maybe I need to go thru another round of that
Update2: after re-re-reading, I think I can more succinctly state my issue. Calling os.FileMode(700) should be the same as calling it with the binary value 1-010-111-100. With those 9 least significant bits there should be permissions of:
--w-rwxr-- or 274 in octal (and translates back to
Instead, that FileMode results in WriteFile creating the file with:
--w-r-xr-- which is 254 in octal.
When using an internal utility written in go, there was a file creation permission bug caused by using decimal 700 instead of octal 0700 when creating the file with ioutil.WriteFile(). That is:
ioutil.WriteFile("decimal.txt", "filecontents", 700) <- wrong!
ioutil.WriteFile("octal.txt", "filecontents", 0700) <- correct!
When using the decimal number (ie. no leading zero to identify it as an octal number to go_lang) the file that should have had permissions
0700 -> '-rwx------' had 0254 -> '--w-r-xr--'
After it was fixed, I noticed that when I converted 700 decimal to octal, I got “1274” instead of the experimental result of "0254".
When I converted 700 decimal to binary, I got: 1-010-111-100 (I added dashes where the rwx’s are separated). This looks like a permission of "0274" except for that leading bit being set.
I went looking at the go docs for FileMode and saw that under the covers FileMode is a uint32. The nine smallest bits map onto the standard unix file perm structure. The top 12 bits indicate special file features. I think that one leading bit in the tenth position is in unused territory.
I was still confused, so I tried:
package main
import (
"io/ioutil"
"fmt"
"os"
)
func main() {
content := []byte("temporary file's content")
modes := map[string]os.FileMode{
"700": os.FileMode(700),
"0700": os.FileMode(0700),
"1274": os.FileMode(1274),
"01274": os.FileMode(01274)}
for name, mode := range modes {
if err := ioutil.WriteFile(name, content, mode); err != nil {
fmt.Println("error creating ", name, " as ", mode)
}
if fi, err := os.Lstat(name); err == nil {
mode := fi.Mode()
fmt.Println("file\t", name, "\thas ", mode.String())
}
}
}
And now I'm even more confused. The results I got are:
file 700 has --w-r-xr--
file 0700 has -rwx------
file 1274 has --wxr-x---
file 01274 has --w-r-xr--
and was confirmed by looking at the filesystem:
--w-r-xr-- 1 rfagen staff 24 Jan 5 17:43 700
-rwx------ 1 rfagen staff 24 Jan 5 17:43 0700
--wxr-x--- 1 rfagen staff 24 Jan 5 17:43 1274
--w-r-xr-- 1 rfagen staff 24 Jan 5 17:43 01274
The first one is the broken situation that triggered the original bug in the internal application.
The second one is the corrected code working as expected.
The third one is bizarre, as 1274 decimal seems to translate into 0350
The fourth one kind of makes a twisted sort of sense, given that dec(700)->oct(1274) and explicitly asking for 01274 gives the same puzzling 0254 as the first case.
I have a vague suspicion that the extra part of the number larger than 2^9 is somehow messing it up but I can't figure it out, even after looking at the source for FileMode. As far as I can tell, it only ever looks at the 12 MSB and 9 LSB.
os.FileMode only knows about integers, it doesn't care whether the literal representation is octal or not.
The fact that 0700 is interpreted in base 8 comes from the language spec itself:
An integer literal is a sequence of digits representing an integer
constant. An optional prefix sets a non-decimal base: 0 for octal, 0x
or 0X for hexadecimal. In hexadecimal literals, letters a-f and A-F
represent values 10 through 15.
This is a fairly standard way of representing literal octal numbers in programming languages.
So your file mode was changed from the requested 0274 to the actual on-disk 0254. I'll bet that your umask is 0022. Sounds to me like everything is working fine.

Error in Matrices concatenation, inconsistent dimensions

%% INPUT DATA
input_data = [200 10.0 0.0095; %C1
240 7.0 0.0070; %C2
200 11.0 0.0090; %C3
220 8.5 0.0090; %C4
220 10.5 0.0080; %C5
0.0015 0.0014 -0.0001 0.0009 -0.0004 %Power Loss
];
pd=830; %Power Demand
%%
lambda = input ('Enter initial lambda:')
Could anyone help me fix this? I've check the row to column data but still cant fix the error.
Your 6th row, power, contains 5 entries, as opposed to c1 to c5, which contain only three entries. MATLAB doesn't do Swiss cheese; I'd suggest making power a separate variable.

Replace values only if they are different

I have a vcf file like this:
http://www.1000genomes.org/node/101
Here's the example from that site:
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
After the header lines, each line has fields that contain genotypes starting with the 10th field. The 10th field is below the NA0001 heading; the 11th field is genotype NA0002, etc. I have a file with 123 different genotypes, so going from position 10 to 133 (NA0001 until NA0123). What is shown in these fields can be 0/0, 0/1, 0/2 .... till 8/9 for instance. Now I want to replace all the non-equal ones. So I would like to keep 0/0, 1/1, 2/2, etc. And replace 0/1, 0/2, 1/2, 4/5, 4/6 etc by ./.
I would like to write this in a C script. Thought about using sed y/regexp/replacement/ but no idea how to write all those unequal values in a regular expression. And on other positions in the file there could also be these values, so really only positions 10 till 133 should be replaced. And it needs to be replaced; I will be needing the rest of the file with the new values.
Hope it is clear. Anyone any idea how to do this?
This regex should do what you want: \s(\d)[|\/](?!\1)\d: Replace matches with ./.:
Breakdown:
\s(\d) matches a space followed by a single digit, capturing the digit in capture group #1
[|\/] matches a pipe or slash (since it seems that the VCF format allows either)
(?!\1)\d uses a negative lookahead to ensure that the next character is not the same as capture group #1, and matches the digit
Caveats:
I matched a leading space and trailing : to try to ensure it matches only the intended values. I couldn't work out a good way to limit it to fields 10 and after.
Example using perl:
perl -pe 's#\s(\d)[|/](?!\1)\d:# ./.:#g' testfile.vcf > testfile_afterchange.vcf
Note: I used # as the delimiter to avoid having to escape the / characters in the regex.

What are some good ways to compress data across time?

I have an array of objects with time and value property. Looks something like this.
UPDATE: dataset with epoch times rather than time strings
[{datetime:1383661634, value: 43},{datetime:1383661856, value: 40}, {datetime:1383662133, value: 23}, {datetime:1383662944, value: 23}]
The array is far larger than this. Possibly a 6 digit length. I intend to build a graph to represent this array. Due to obvious reasons, I cannot use every bit of the data to build this graph (value vs time); so I need to normalize it across time.
So here's the main problem - There is no trend in the timestamp for these objects; so I need to dynamically choose slots of time in which I either average out the values or show counts of objects in that slot.
How can I calculate slots that user friendly. i.e per minute, hour, day, eight hours or so. I am looking at having a maximum of 25 slots done out of the array, which I show up on the graph.
I hope this helps get my point through.
You can convert the date/time into epoch and use numpy.histogram to get the ranges:
import random, numpy
l = [ random.randint(0, 1000) for x in range(1000) ]
num_items_bins, bin_ranges = numpy.histogram(l, 25)
print num_items_bins
print bin_ranges
Gives:
[34 38 42 41 43 50 34 29 37 46 31 47 43 29 30 42 38 52 42 44 42 42 51 34 39]
[ 1. 40.96 80.92 120.88 160.84 200.8 240.76 280.72
320.68 360.64 400.6 440.56 480.52 520.48 560.44 600.4
640.36 680.32 720.28 760.24 800.2 840.16 880.12 920.08
960.04 1000. ]
Hard to say without knowing the nature of your values, compressing values for display is a matter of what you can afford to discard and what you can't. Some ideas though:
histogram
candlestick chart
Is this JSON and the DateTimes transmitted as text?
Why not transmit the Date as a long (Int64), and use a method to convert to/from DateTime? Depending on which language you could use these implementations:
DateTime to Long in C#
Date to long using Unix timestamp in Java
That alone would save you a considerable amount of space, since strings are 16-bits per character and the long TimeStamp would be just 64 bits.

Resources