What are some good ways to compress data across time? - arrays

I have an array of objects with time and value property. Looks something like this.
UPDATE: dataset with epoch times rather than time strings
[{datetime:1383661634, value: 43},{datetime:1383661856, value: 40}, {datetime:1383662133, value: 23}, {datetime:1383662944, value: 23}]
The array is far larger than this. Possibly a 6 digit length. I intend to build a graph to represent this array. Due to obvious reasons, I cannot use every bit of the data to build this graph (value vs time); so I need to normalize it across time.
So here's the main problem - There is no trend in the timestamp for these objects; so I need to dynamically choose slots of time in which I either average out the values or show counts of objects in that slot.
How can I calculate slots that user friendly. i.e per minute, hour, day, eight hours or so. I am looking at having a maximum of 25 slots done out of the array, which I show up on the graph.
I hope this helps get my point through.

You can convert the date/time into epoch and use numpy.histogram to get the ranges:
import random, numpy
l = [ random.randint(0, 1000) for x in range(1000) ]
num_items_bins, bin_ranges = numpy.histogram(l, 25)
print num_items_bins
print bin_ranges
Gives:
[34 38 42 41 43 50 34 29 37 46 31 47 43 29 30 42 38 52 42 44 42 42 51 34 39]
[ 1. 40.96 80.92 120.88 160.84 200.8 240.76 280.72
320.68 360.64 400.6 440.56 480.52 520.48 560.44 600.4
640.36 680.32 720.28 760.24 800.2 840.16 880.12 920.08
960.04 1000. ]

Hard to say without knowing the nature of your values, compressing values for display is a matter of what you can afford to discard and what you can't. Some ideas though:
histogram
candlestick chart

Is this JSON and the DateTimes transmitted as text?
Why not transmit the Date as a long (Int64), and use a method to convert to/from DateTime? Depending on which language you could use these implementations:
DateTime to Long in C#
Date to long using Unix timestamp in Java
That alone would save you a considerable amount of space, since strings are 16-bits per character and the long TimeStamp would be just 64 bits.

Related

Is it safe to use the first 22 characters of a NFT pubkey as a primary key for a DB

I am wondering if is safe to only use the first 22 characters instead of the 44 characters of a pubkey of an NFT as a primary key of a MySQL DB. I have a DB with huge data and could save a lot of space thanks to this approach. For instance having the following pubkey:
AQoKYV7tYpTrFZN6P5oUufbQKAUr9mNYGe1TTJC9wajM
Would it be safer to use the first 22 characters:
AQoKYV7tYpTrFZN6P5oUuf
Would it be safer using the first 11chars plus the trailing 11chars, or doesn't make any difference?
AQoKYV7tYpTe1TTJC9wajM
A public key is 32 bytes, so those "44 characters" are actually the base-58 representation of those 32 bytes.
If you're only storing 22 characters, let's simplify things and say that you're storing 16 bytes out of 32 total. The chance of two pubkeys sharing the same 16-byte sequence is 1 / 256^16 = 1 / 2^128 = 2.9 * 10 ^ -39, which is very unlikely, but possible.
Here's another way to approach the problem -- how about storing the full pubkey as 32 bytes instead of as a string? Then you won't ever lose any precision.

3 byte array-for time format {Hour}{Minute}{Second}

I am reading system time(Register: TM11), and I want to get minute data from system time.
system time is in this data format = 3 bytes:{Hour}{Minute}{Second}
I am not sure, how to extract "minute" data, using array format, C code as below.
In my C code below, i use read_register function for reading system time, and use pointer (byte*)&systime[1]) to extract "minute". -not sure this is correct way to do so.
let's say, time now is 07:48:29 AM then, TM11 will show 07,48,29
I want to extract "48", which is minute, from TM11.
time interval: 15 minute.
Time Passed = 48 % 15 = 3 minute.
Putting this calculation in the C,
byte systime[2];
//declare "systime" variable as 3 byte array and to store TM11 //
byte time_interval = 15; //time interval is 15 minute
read_register (TM11,(byte*)&systime[1]);
//let's say read data value of TM11 is, 07:48:29 AM
//"read_register"function is to read the value of TM11 register
//(byte*)&systime[1] =try to point "minute" in TM11 register, systime[1]=48
//I am not sure whether hour will store in systime[0], minute will store in systime[1],//
elaps_time = systime[1] % time_interval;
//elapsed time calculation = 48 % 15 = 3

MATLAB sort function yields tampered results

I have a vector of 126 elements which is usually correctly sorted; however, I always sort it to make sure everything is okay.
The problem is that: when the array is already sorted, performing a sort would destroy the original values of the array.
I attached the array in a csv file and executed the script below, where I insert the vector in the first column of 'a' then sort it in the second then check for any differences in the third column.
a = csvread('a.csv')
a(:,2)=sort(a(:,1))
a(:,3)=a(:,2)-a(:,1)
result=sum(a(:,3).^2)
You could easily see that the first two columns aren't identical, and the third column has some none zero values.
Syntax for array
a = [17.4800
18.6800
19.8800
21.0800
22.2800
23.4800
24.6800
25.8800
27.0800
28.2800
29.4800
30.6800
46.1600
47.3600
48.5600
49.7600
50.9600
52.1600
53.3600
54.5600
55.7600
56.9600
58.1600
59.3600
74.8400
76.0400
77.2400
78.4400
79.6400
80.8400
103.5200
104.7200
105.9200
107.1200
108.3200
109.5200
110.7200
111.9200
113.1200
114.3200
115.5200
116.7200
132.2000
133.4000
134.6000
135.8000
137.0000
138.2000
139.4000
140.6000
141.8000
143.0000
144.2000
145.4000
165.4200
166.6200
167.8200
169.0200
170.2200
171.4200
172.6200
173.8200
175.0200
176.2200
177.4200
178.6200
179.9300
181.1300
182.3300
183.5300
184.7300
185.9300
187.1300
188.3300
189.5300
201.3700
202.5700
203.7700
204.9700
206.1700
207.3700
236.1100
237.3100
238.5100
239.7100
240.9100
242.1100
243.3100
244.5100
245.7100
246.9100
248.1100
249.3100
239.8400
241.0400
242.2400
276.9900
278.1900
279.3900
280.5900
281.7900
282.9900
284.1900
285.3900
286.5900
287.7900
288.9900
290.1900
277.8200
279.0200
280.2200
281.4200
282.6200
283.8200
285.0200
286.2200
287.4200
288.6200
289.8200
291.0200
291.0700
292.2700
293.4700
295.6900
296.8900
298.0900];
Your original vector is unfortunately not sorted. Therefore, sorting this result will obviously not give you what the original vector is supposed to be as the values that were out of order will become in order.
You can check this by using diff on the read in vector from the CSV file and seeing if there are any negative differences. diff takes the difference between the (i+1)th value and the ith value and if your values are monotonically increasing, you should get positive differences all around. We can see which locations are affected by finding values in the difference that are negative:
a = csvread('a.csv');
ind = find(diff(a) < 0);
We get:
>> ind
ind =
93
108
This says that locations 93 and 108 are where the out of order starts. Locations 94 and 109 is where it actually happens. Let's check out portions 90 - 110 of your vector to be sure:
>> a(90:110)
ans =
245.7100 % 90
246.9100 % 91
248.1100 % 92
249.3100 % 93
239.8400 %<-------
241.0400
242.2400
276.9900
278.1900
279.3900
280.5900
281.7900
282.9900
284.1900
285.3900
286.5900
287.7900 % 106
288.9900 % 107
290.1900 % 108
277.8200 % <------
279.0200
As you can see, locations 93 and 108 take a dip in numerical value, and so if you tried sorting the result then taking the difference, you'll notice that locations 1 up to 93 will exhibit a difference of 0, but after location 93, that's when it becomes unequal.
I'm frankly surprised you didn't see that they're out of order because your snapshot clearly shows there's a decrease in value on the left column towards the top of the snapshot.
Therefore, either check your data to see if you have input it correctly, or modify whatever process you're working on to ensure that it can handled unsorted data.

How can I sum values with multiple conditions including different dates

I have some data as follow (column A:D contain data, column E is the sum I created):
NO
SE
Date Country ID Value Sum
30-01-2014 SE B-08888 10 10
05-02-2014 SE B-08888 23
06-02-2014 SE B-08888 20
13-05-2014 SE B-08888 17 27
14-05-2014 SE B-08888 10
13-05-2014 NO A-07777 15 35
14-05-2014 NO A-07777 20
I would like to sum all values that are having same country and same ID when: 1) the date is greater than 1/5; and 2) when date is less than 1/5.
I am using the SUMIFS. But the SUMIFS doesn't give correct results when I included the date argument which is less than 1/5.
=SUMIFS($D$5:$D$11;$A$5:$A$11;"<="&DATE(2014;5;1);$B$5:$B$11;A2;$C$5:$C$11;C5) ==> gives incorrect result (=10)
=SUMIFS($D$5:$D$11;$A$5:$A$11;">="&DATE(2014;5;1);$B$5:$B$11;A2;$C$5:$C$11;C8) ==> gives correct result (=27)
Is there a way I can take into account both date conditions (i.e. date greater than and less than 1/5) and make the formula general so I don't have to go through every cell to change reference?
Thank you.
Using your data, the second formula returns 27 for me - so I assume the cell references you have not mentioned are as I have guessed. The first formula for me returns 53 - I suspect the result you want, though have not mentioned.
Something is wrong with your data (not the formulae). The most likely cause is that there is a trailing space in C6 and C7 that is not in C5. Copying C5 down to C9 should fix that. There might however be a data issue in other cells in those two rows.
It might make things easier for you if the formulae were in separate columns.

Multi Pattern Matching Algorithm

Consider you have several patterns of dates P1 - Pn.
Some of them are simple like P1 - all Mondays, P2 - all Tuesdays; others are more complex like P4 - all working days etc.
For custom array of dates (V1, V2) I have to create the shortest result string, as it is shown on the picture:
For any array we have to create string which will represent dates in array. The simplest method is to create string like 1.5.2013, 2.5.2013, 3.5.2013 ... But the result string will be very long.
Using several predefined patterns we can create shorter result string.
For result string I use following rules:
Single date format: DD.MM.YYYY (10 characters)
Enumeration (dates and patterns): comma and space (2 characters)
Interval of dates: DD.MM.YYYY-DD.MM.YYYY (21 characters)
Interval of pattern names: Px-Py (5 characters)
Special words: except (6 characters)
Examples of result strings:
V1 using P4 pattern:
P4 except 01.05.2013-03.05.2013, 09.05.2013, 10.05.2013, 16.05.2013, 17.05.2013 (80 characters)
V1 using Pn pattern:
Pn 06.05.2013-08.05.2013, 13.05.2013-15.05.2013, 20.05.2013-24.05.2013, 27.05.2013-31.05.2013 (94 characters)
V1 using best patterns match:
P1-P3 01.05.2013-19.05.2013, P4 20.05.2013-31.05.2013 (54 characters)
The main goal is to create the shortest result string. As I understand we can achieve this by finding the best matching pattern/patterns.
Currently I'm trying to adapt knapsack problem and longest common subsequence problem, but I'm not sure if it is the right direction.
I would appreciate any ideas.
updated
Thanks to Jan Dvorak for his extra short description of my problem:
The goal is to describe V using a predefined dictionary (P1..Pn and all intervals and single dates) where intersection, union and subtraction are all allowed and each operation and atom have a predefined cost (number of characters in result string).
After long time of searching we finally found solution which is very close to what we want.
http://www.fpz.unizg.hr/traffic/index.php/PROMTT/article/view/1287
Thanks for all how participated.
This is just a suggestion but if you want a really short string that represent the arrays of dates, you could solve this problem in a totally different way, this way is very simply and efficient.
Let 1 represent a day "seleceted" and let 0 represent a day "unselected", then you can construct a binary number that represent the custom date arrays in a month, for example, for the V1 case you can generate this binary number:
V1 = 0000011100001110000111110011111
So the first 0 represent that the date 1.5.2013 is "unselected", the next 0 represent that the date 2.5.2013 is "unselected", etc. If you separate this number in 8 bits groups (dividing the binary number in bytes) then you can create this byte array:
V1(starting in May 1, 2013) = 00000111 - 00001110 - 00011111 - 00111110 (4 bytes)
With this method you are representing the V1 using just 4 bytes, this is the only info you need if you know that V1 start on the date 1.5.2013, so you need to store the initial date as well, so you can represent the month and year using just 3 bytes, so for instance the May 2013 date can be represented in this way:
May = 5th month so 5 in binary is 101
2013 in binary is 11111011101
So using 3 bytes you can represent May 2013 as this:
0000101 00000111 11011101
[ 5 ] [ 2013 ]
So you can represent V1 as this
V1= 0000101 - 00000111 - 11011101 00000111 - 00001110 - 00011111 - 00111110
[Month] [ Year ] [ V1 custom array of dates ]
So V1 can be totally represented using just 7 bytes!!
If you need a String instead of a byte array, then you can convert this byte array into a
Base64 String so V1 can be represented as the string
V1 in Base64 is Cg+6Dhw+Pg== (using just 12 characters!!)
In the case of V2:
V2 = 0000101 - 00000111 - 11011101 11111111 - 11111111 - 11111111 - 11101110
[Month] [ Year ] [ V2 custom array of dates ]
V2 in Base64 is Cg+7////bg== (using just 12 characters again!!)
With this method you know that a month custom array of dates info can be represented in 7 bytes (or 12 characters if you use the base 64 String).
To store the custom array info in a year you just need:
3 bytes for the start month and year, plus 365/8 = 45.625 (rounded to 46 bytes), that is 49 bytes!! for the whole year, that in base 64 has a maximum length of 69 characters!!!
This is simple to implement, easy to maintain in code, better than a complex pattern matching algorithm, this smell like a good solution to me. I hope that this recommendation fits your requirement.

Resources