Multi Pattern Matching Algorithm - arrays
Consider you have several patterns of dates P1 - Pn.
Some of them are simple like P1 - all Mondays, P2 - all Tuesdays; others are more complex like P4 - all working days etc.
For custom array of dates (V1, V2) I have to create the shortest result string, as it is shown on the picture:
For any array we have to create string which will represent dates in array. The simplest method is to create string like 1.5.2013, 2.5.2013, 3.5.2013 ... But the result string will be very long.
Using several predefined patterns we can create shorter result string.
For result string I use following rules:
Single date format: DD.MM.YYYY (10 characters)
Enumeration (dates and patterns): comma and space (2 characters)
Interval of dates: DD.MM.YYYY-DD.MM.YYYY (21 characters)
Interval of pattern names: Px-Py (5 characters)
Special words: except (6 characters)
Examples of result strings:
V1 using P4 pattern:
P4 except 01.05.2013-03.05.2013, 09.05.2013, 10.05.2013, 16.05.2013, 17.05.2013 (80 characters)
V1 using Pn pattern:
Pn 06.05.2013-08.05.2013, 13.05.2013-15.05.2013, 20.05.2013-24.05.2013, 27.05.2013-31.05.2013 (94 characters)
V1 using best patterns match:
P1-P3 01.05.2013-19.05.2013, P4 20.05.2013-31.05.2013 (54 characters)
The main goal is to create the shortest result string. As I understand we can achieve this by finding the best matching pattern/patterns.
Currently I'm trying to adapt knapsack problem and longest common subsequence problem, but I'm not sure if it is the right direction.
I would appreciate any ideas.
updated
Thanks to Jan Dvorak for his extra short description of my problem:
The goal is to describe V using a predefined dictionary (P1..Pn and all intervals and single dates) where intersection, union and subtraction are all allowed and each operation and atom have a predefined cost (number of characters in result string).
After long time of searching we finally found solution which is very close to what we want.
http://www.fpz.unizg.hr/traffic/index.php/PROMTT/article/view/1287
Thanks for all how participated.
This is just a suggestion but if you want a really short string that represent the arrays of dates, you could solve this problem in a totally different way, this way is very simply and efficient.
Let 1 represent a day "seleceted" and let 0 represent a day "unselected", then you can construct a binary number that represent the custom date arrays in a month, for example, for the V1 case you can generate this binary number:
V1 = 0000011100001110000111110011111
So the first 0 represent that the date 1.5.2013 is "unselected", the next 0 represent that the date 2.5.2013 is "unselected", etc. If you separate this number in 8 bits groups (dividing the binary number in bytes) then you can create this byte array:
V1(starting in May 1, 2013) = 00000111 - 00001110 - 00011111 - 00111110 (4 bytes)
With this method you are representing the V1 using just 4 bytes, this is the only info you need if you know that V1 start on the date 1.5.2013, so you need to store the initial date as well, so you can represent the month and year using just 3 bytes, so for instance the May 2013 date can be represented in this way:
May = 5th month so 5 in binary is 101
2013 in binary is 11111011101
So using 3 bytes you can represent May 2013 as this:
0000101 00000111 11011101
[ 5 ] [ 2013 ]
So you can represent V1 as this
V1= 0000101 - 00000111 - 11011101 00000111 - 00001110 - 00011111 - 00111110
[Month] [ Year ] [ V1 custom array of dates ]
So V1 can be totally represented using just 7 bytes!!
If you need a String instead of a byte array, then you can convert this byte array into a
Base64 String so V1 can be represented as the string
V1 in Base64 is Cg+6Dhw+Pg== (using just 12 characters!!)
In the case of V2:
V2 = 0000101 - 00000111 - 11011101 11111111 - 11111111 - 11111111 - 11101110
[Month] [ Year ] [ V2 custom array of dates ]
V2 in Base64 is Cg+7////bg== (using just 12 characters again!!)
With this method you know that a month custom array of dates info can be represented in 7 bytes (or 12 characters if you use the base 64 String).
To store the custom array info in a year you just need:
3 bytes for the start month and year, plus 365/8 = 45.625 (rounded to 46 bytes), that is 49 bytes!! for the whole year, that in base 64 has a maximum length of 69 characters!!!
This is simple to implement, easy to maintain in code, better than a complex pattern matching algorithm, this smell like a good solution to me. I hope that this recommendation fits your requirement.
Related
Weird string formatting with 2 vs 4 decimal places in currency field
I have the following piece of code. I want to format numbers with string templates. One variable has 2 decimal places, the other 4 decimal places but they represent the same number 50000 (fifty thousand). The first number is correctly formatted (German representation) 50.000,00, the other one however is formatted as 5 million 5.000.000,00! DATA: lv_p2 TYPE p LENGTH 9 DECIMALS 2, lv_p4 TYPE p LENGTH 14 DECIMALS 4. START-OF-SELECTION. lv_p2 = '50000'. lv_p4 = lv_p2. SET COUNTRY 'DE'. "This is correctly formatted as 50.000,00 WRITE |{ lv_p2 NUMBER = ENVIRONMENT CURRENCY = 'EUR' }|. "This is on the other hand interpreted as five million! 5.000.000,00 WRITE |{ lv_p4 NUMBER = ENVIRONMENT CURRENCY = 'EUR' }|. Is this documented somewhere? What am I doing wrong here? EDIT: It looks like the problem is with the addition CURRENCY. If I don't use it, then the number is correctly formatted. WRITE |{ lv_p4 NUMBER = ENVIRONMENT }|. or WRITE |{ lv_p4 NUMBER = ENVIRONMENT DECIMALS = 2 }|. Anyway looks like some kind of a bug.
I believe this behaviour is documented. ABAP Documentation - WRITE, format_options - CURRENCY cur When CURRENCY is added: "For data objects of type p, the decimal places determined by the definition of the data type are ignored completely. Independently of the actual value and without rounding, decimal separators and thousands separators are inserted between the digits in the places determined by cur." Shortly: if CURRENCY is added (by WRITE), the number of decimal places is determined by the currency (in this case EUR has 2 decimal places), so the value 50.000,0000 will be 5.000.000,00. Same length (9 digits) only the number of decimals will be different.
Formula to write milliseconds in hh:mm:ss.000 format gives wrong values
I'm trying to convert duration in one column which is written in milliseconds (Ex: 600,2101,1110....) to hh:mm:ss.000 format(Ex:00:00:00.600, 00:00:02.101...) using the below formula in google spreadsheets: =CONCATENATE(TEXT(INT(A1/1000)/86400,"hh:mm:ss"),".",A1-(INT(A1/1000)*1000)) It gives correct values for almost all , but one type of values which is durations having '0' as their second digit (Eg: 2010,3056,1011). When 0 is the second digit , the after decimal value in hh:mm:ss.000 is rounded to the third digit and 0 is ignored (Example row 1 and 2 in below table). But for other durations it gives right value(row 3). I need a formula that works well on all type of values i.e 1080 → 00:00:01.080 and not 00:00:01.80 . Can someone please help with this. Duration in milliseconds hh:mm:ss.000 format 1080 00:00:01.80 (wrong) 2010 00:00:02.10 (wrong) 1630 00:00:01.630 (correct)
try: =INDEX(IF(A2:A="",,TEXT(A2:A/86400000, "hh:mm:ss.000")))
T-SQL: SUM Number between Delimiters from String
I need to get numbers with a variable length out of a string and sum them. The strings got the following format: EH:NUMBER=SomeOtherStuff->Code I'm extracting the code via RIGHT() and join with another table to get the group right, at the moment I'm using sum to get it together via date: SUM(CASE WHEN (MONTH(data.DATE1) = 5 AND YEAR(data.DATE1) = YEAR(GETDATE())) THEN 1 ELSE 0 END) N'Mai', I then need to sum the numbers from the string and not the number of rows. Some Examples: Month1 EH:1=24->ZTM Month1 EH:4=13-21->LKm Month2 EH:3=34,33,43->LKm Month2 EH:7=12,92-29,29->LKm Month2 EH:5=24-26,11,21,22->ZOL What i need: Material - Month1 - Month2 ZTM - 1 - 0 LKM - 4 - 10 ZOL - 0 - 5 Could you help me please? Greetings
Short version: What you are looking for is SUBSTRING. Longer version: To get the the sum of the numerical value of NUMBER you need think about how break it down. I'd recommend following these steps: Extract the NUMBER part from the string. This should be done with SUBSTRING (much like you extract Code with RIGHT). To get the start and and length och your substring use charindex ( or patindex if you like). Convert the NUMBER part to a numerical value with cast (or convert or what you are familiar with) Now you can do your aggregation. So SUM(CAST(SUBSTRING(*this part you will have to figure out by yourself)) as correct numerical data type)). I'll let you figure out the values to insert by yourself and would recommend to first find the positions of the delimiting characters, then extract the NUMBER part, then get the numerical value .... you get it . This to gain a better understanding of what you are actually doing. Cheers, and good luck with your assignment Martin
algorithm for finding date in sorted array of dates
here is my problem. I have a sorted array of dates that is stored in a circular buffer. I have a pointer to last date in buffer. There is a possibility that some dates are missing. Client requires a range of dates. If low limit date is missing, program should return first closest date that is higher then required one and vice versa for upper limit date. Here is an example: Dates in circular buffer (int[18]): 1,2,3,4,5,11,12,13,14,15,21,22,23,24,25,26,27,28 and if client wants from 8 to 23, program should return 11,12,13,14,15,21,22,23. I tried like this : Notes: - number between two stars is current date, and diff is number of steps to go to find 8. - pointer can not be less then 0 or higher then 17. {1,2,3,4,5,11,12,13,14,15,21,22,23,24,25,26,27,*28*}, diff = -20 {*1*,2,3,4,5,11,12,13,14,15,21,22,23,24,25,26,27,28}, diff = +7 {1,2,3,4,5,11,12,*13*,14,15,21,22,23,24,25,26,27,28}, diff = -5 {1,2,*3*,4,5,11,12,13,14,15,21,22,23,24,25,26,27,28}, diff = +5 -> (5/2)+1=+3<br /> (if I detect that I will just go x steps forward and x steps backward I split x in half) {1,2,3,4,5,*11*,12,13,14,15,21,22,23,24,25,26,27,28}, diff = -3 -> (-3/2)-1 = -2 {1,2,3,*4*,5,11,12,13,14,15,21,22,23,24,25,26,27,28}, diff = 4 {1,2,3,4,5,11,12,*13*,14,15,21,22,23,24,25,26,27,28}, diff = -5 {1,2,*3*,4,5,11,12,13,14,15,21,22,23,24,25,26,27,28}, diff = +5 -> (5/2)+1=+3 If we continue like this we will get 13,3,11,4 over and over again. Notes: - It is only coincidence that we get 11 here. When I use some real examples, with more dates,this algorithm jumps over some other 4 (or 3) numbers. - Dates are stored in EEPROM of uC, so reading dates take a while, and I need to find date as quick as it possible (with minimum reads). Please help.
Set p1 to be the start of the buffer, p2 to be the end. X is what you're looking for. If the date of p1Date is after X, return p1. If p2Date is before X return p2. Look at the midpoint between p1 and p2, m. If mDate is after X then p1=m else p2=m. Repeat until p1=p2.
What are some good ways to compress data across time?
I have an array of objects with time and value property. Looks something like this. UPDATE: dataset with epoch times rather than time strings [{datetime:1383661634, value: 43},{datetime:1383661856, value: 40}, {datetime:1383662133, value: 23}, {datetime:1383662944, value: 23}] The array is far larger than this. Possibly a 6 digit length. I intend to build a graph to represent this array. Due to obvious reasons, I cannot use every bit of the data to build this graph (value vs time); so I need to normalize it across time. So here's the main problem - There is no trend in the timestamp for these objects; so I need to dynamically choose slots of time in which I either average out the values or show counts of objects in that slot. How can I calculate slots that user friendly. i.e per minute, hour, day, eight hours or so. I am looking at having a maximum of 25 slots done out of the array, which I show up on the graph. I hope this helps get my point through.
You can convert the date/time into epoch and use numpy.histogram to get the ranges: import random, numpy l = [ random.randint(0, 1000) for x in range(1000) ] num_items_bins, bin_ranges = numpy.histogram(l, 25) print num_items_bins print bin_ranges Gives: [34 38 42 41 43 50 34 29 37 46 31 47 43 29 30 42 38 52 42 44 42 42 51 34 39] [ 1. 40.96 80.92 120.88 160.84 200.8 240.76 280.72 320.68 360.64 400.6 440.56 480.52 520.48 560.44 600.4 640.36 680.32 720.28 760.24 800.2 840.16 880.12 920.08 960.04 1000. ]
Hard to say without knowing the nature of your values, compressing values for display is a matter of what you can afford to discard and what you can't. Some ideas though: histogram candlestick chart
Is this JSON and the DateTimes transmitted as text? Why not transmit the Date as a long (Int64), and use a method to convert to/from DateTime? Depending on which language you could use these implementations: DateTime to Long in C# Date to long using Unix timestamp in Java That alone would save you a considerable amount of space, since strings are 16-bits per character and the long TimeStamp would be just 64 bits.