The Problem
I recently started working on a small backend application to track my staking rewards and I stumbled upon the floating point math issues for the first time in my life.
After some research two options seem to be preferred in general:
DECIMAL(27,18): Very useful for ETH tokens (which mostly have 18 decimals).
BIGINT: Useful but I think it will run out of space for tokens like Shiba Inu.
The Question(s)
If I store the tokens as DECIMAL, will I have to perform the operations in the database to avoid the floating point math issue?
Would it be a better approach to store everything as int and then divide by the token's decimal digits, stored separatedly? (Maybe something like: NUMERIC(72,0)?)
Is there a third option or other problems I'm not taking into account?
Related
dear reader
I have been thinking about how to store data efficiently since the beginning of my studies and while taking a shower, I came up with the following idea:
For example, you take a picture and convert that picture into 0(zeros) and 1(ones). Then you take this eternally long number and divide it by e.g. 10 then again by 10 and then again by 10 etc. and at the end you have a small number. Now the small number and the calculation path are stored and if someone wants to read the data, they only have to perform the inverse operation to get the result.
The idea is too good to be true --> my gut feeling tells me. But I would still like to know why this should not work?
Kind regards
Hello, dear reader
I have been thinking about how to store data efficiently since the beginning of my studies and while taking a shower, I came up with the following idea:
For example, you take a picture and convert that picture into 0(zeros) and 1(ones). Then you take this eternally long number and divide it by e.g. 10 then again by 10 and then again by 10 etc. and at the end you have a small number. Now the small number and the calculation path are stored and if someone wants to read the data, they only have to perform the inverse operation to get the result.
The idea is too good to be true --> my gut feeling tells me. But I would still like to know why this should not work?
Kind regards
Fun theorem. No bijection on the natural numbers can map every number to a smaller one. Proof by contradiction, consider F(F(1)).
Lots of ways to map numbers 1-1 to smaller numbers such that many map to smaller numbers. These are lossless compression algorithms. Most have the property that repeated application of the algorithm make the data larger, or leave it unchanged.
In your proposal, to the extent I understand it, you would have to store all the remainders of the division, which would be as large as the original data.
I am working with MongoDB v3.6.3.
I have seen a similar question that recieved a good answer. So why am I asking this question?
Because I am working with a different version of MongoDB
Because I have just stored a decimal number in my DB without registering any serializers as instructed in the answer of the similar question And no error was thrown.
My MongoDB schema looks like this:
rating:{
type: Number,
required: true
}
So my question is, is there anything wrong with the way I have implemented this. Considering that I have already stored a decimal number in my DB. Is it okay to store decimal numbers with the current schema? Or is this a setup for errors in the future because I am missing a step?
Thank you.
The Number type is a floating point numeric representation that cannot accurately represent decimal values. This may be fine if your use case does not require precision for floating point numbers, but would not be suitable if accuracy matters (for example, for fractional currency values).
If you want to store and work with decimal values with accuracy you should instead use the Decimal128 type in Mongoose. This maps to the Decimal 128 (aka NumberDecimal) BSON data type available in MongoDB 3.4+, which can also be manipulated in server-side calculations using the Aggregation Framework.
If your rating field doesn't require exact precision, you could continue to use the Number type. One reason to do so is that the Number type is native to JavaScript, while Decimal128 is not.
For more details, see A Node.js Perspective on MongoDB 3.4: Decimal Type.
I am attempting to use Entity Framework and have a contact database that has Longitude and Latitude data from Google Maps.
The guide says that this should be stored as float.
I have created my POCO entity which includes Longitude as a float and Latitude as a float.
I have just noticed that in the database, these both come up as real.
There is still a lot of work to be done before I can get any data back from Google, I am very far away from testing and I was just wondering if anyone can tell me if this is going to be a problem later on?
Nope, that should be fine. Note that you may not get the exact same value back as you received from Google Maps, as that would have been expressed in decimal-formatted text. However, this is effectively a physical quantity, and thus more appropriate as a float/double/real/(whatever binary floating point type you like) than as a decimal. (Decimals are more appropriate for man-made quantities - particularly currency.)
If you look at the documentation for float and real you'll see that real is effectively float(24) - a 32-bit floating binary point type, just like float in C#.
EDIT: As noted in comments, if you want more than the significant 7-8 digits of accuracy provided by float, then you probably want double instead in C#, which would mean float(53) in SQL Server.
This link:
http://msdn.microsoft.com/en-us/library/aa258876(v=sql.80).aspx
explains that, in SQL Server, real is a synonym for float(24), using 4 bytes of data. In .NET a Single precision floating point number also uses 4 bytes, so these are pretty much equivalent:
http://msdn.microsoft.com/en-us/library/47zceaw7(v=vs.71).aspx
All the examples I have seen of neural networks are for a fixed set of inputs which works well for images and fixed length data. How do you deal with variable length data such sentences, queries or source code? Is there a way to encode variable length data into fixed length inputs and still get the generalization properties of neural networks?
I have been there, and I faced this problem.
The ANN was made for fixed feature vector length, and so are many other classifiers such as KNN, SVM, Bayesian, etc.
i.e. the input layer should be well defined and not varied, this is a design problem.
However, some researchers opt for adding zeros to fill the missing gap, I personally think that this is not a good solution because those zeros (unreal values) will affect the weights that the net will converge to. in addition there might be a real signal ending with zeros.
ANN is not the only classifier, there are more and even better such as the random forest. this classifier is considered the best among researchers, it uses a small number of random features, creating hundreds of decision trees using bootstrapping an bagging, this might work well, the number of the chosen features normally the sqrt of the feature vector size. those features are random. each decision tree converges to a solution, using majority rules the most likely class will chosen then.
Another solution is to use the dynamic time warping DTW, or even better to use Hidden Markov models HMM.
Another solution is the interpolation, interpolate (compensate for missing values along the small signal) all the small signals to be with the same size as the max signal, interpolation methods include and not limited to averaging, B-spline, cubic.....
Another solution is to use feature extraction method to use the best features (the most distinctive), this time make them fixed size, those method include PCA, LDA, etc.
another solution is to use feature selection (normally after feature extraction) an easy way to select the best features that give the best accuracy.
that's all for now, if non of those worked for you, please contact me.
You would usually extract features from the data and feed those to the network. It is not advisable to take just some data and feed it to net. In practice, pre-processing and choosing the right features will decide over your success and the performance of the neural net. Unfortunately, IMHO it takes experience to develop a sense for that and it's nothing one can learn from a book.
Summing up: "Garbage in, garbage out"
Some problems could be solved by a recurrent neural network.
For example, it is good for calculating parity over a sequence of inputs.
The recurrent neural network for calculating parity would have just one input feature.
The bits could be fed into it over time. Its output is also fed back to the hidden layer.
That allows to learn the parity with just two hidden units.
A normal feed-forward two-layer neural network would require 2**sequence_length hidden units to represent the parity. This limitation holds for any architecture with just 2 layers (e.g., SVM).
I guess one way to do it is to add a temporal component to the input (recurrent neural net) and stream the input to the net a chunk at a time (basically creating the neural network equivalent of a lexer and parser) this would allow the input to be quite large but would have the disadvantage that there would not necessarily be a stop symbol to seperate different sequences of input from each other (the equivalent of a period in sentances)
To use a neural net on images of different sizes, the images themselves are often cropped and up or down scaled to better fit the input of the network. I know that doesn't really answer your question but perhaps something similar would be possible with other types of input, using some sort of transformation function on the input?
i'm not entirely sure, but I'd say, use the maximum number of inputs (e.g. for words, lets say no word will be longer than 45 characters (longest word found in a dictionary according to wikipedia), and if a shorter word is encountered, set the other inputs to a whitespace character.
Or with binary data, set it to 0. the only problem with this approach is if an input filled with whitespace characters/zeros/whatever collides with a valid full length input (not so much a problem with words as it is with numbers).
I inherited a project that uses SQL Server 200x, wherein a column that stores a value that is always considered as a percentage in the problem domain is stored as its greater than 1 decimal equivalent. For example, 70% (0.7, literally) is stored as 70, 100% as 100, etc. Aside from the need to remember to * 0.01 on retrieved values and * 100 before persisting values, it doesn't seem to be a problem in and of itself. It does make my head explode though... so is there a good reason for it that I'm missing? Are there compelling reasons to fix it, given that there is a fair amount of code written to work with the pseudo-percentages?
There are a few cases where greater than 100% occurs, but I don't see why the value wouldn't just be stored as 1.05, for example, in those cases.
EDIT: Head feeling better, and slightly smarter. Thanks for all the insights.
There are actually four good reasons I can think of that you might want to store—and calculate with—whole-number percentage values rather than floating-point equivalents:
Depending on the data types chosen, the integer value may take up less space.
Depending on the data type, the floating-point value may lose precision (remember that not all languages have a data type equivalent to SQL Server's decimal type).
If the value will be input from or output to the user very frequently, it may be more convenient to keep it in a more user-friendly format (decision between convert when you display and convert when you calculate ... but see the next point).
If the principle values are also integers, then
principle * integerPercentage / 100
which uses all integer arithmetic is usually faster than its floating-point equivalent (likely significantly faster in the case of a floating-point type equivalent to T-SQL's decimal type).
If its a byte field then it takes up less room in the db than floating point numbers, but unless you have millions and millions of records, you'll hardly see a difference.
Since floating-point values can't be compared for equality, an integer may have been used to make the SQL simpler.
For example
(0.3==3*.1)
is usually False.
However
abs( 0.3 - 3*.1 )
Is a tiny number (5.55e-17). But it's pain to have to do everything with (column-SomeValue) BETWEEN -0.0001 AND 0.0001 or ABS(column-SomeValue) < 0.0001. You'd rather do column = SomeValue in your WHERE clause.
Floating point numbers are prone to rounding errors and, therefore, can act "funny" in comparisons. If you always want to deal with it as fixed decimal, you could either choose a decimal type, say decimal(5,2), or do the convert and store as int thing that your db does. I'd probably go the decimal route, even though the int would take up less space.
A good guess is because anything you do with integers (storing, calculating, stuffing into an edit for for a user, etc.) is marginally easier and more efficient than doing the same with floating point numbers. And the rounding issues aren't so obvious when you look at the data.
If these are numbers that end users are likely to see and interact with, percentages are easier to understand than decimals.
This is one of those situations where a notation aid can help; in the program, be consistent in using a prefix (Hungarian) or postfix to specify values that are percentages vs. those that are decimal. If you can extend a naming convention to the database fields themselves, so much the better.
And to add to the data storage issue, if you can use integer arithmetic for whatever processing you are doing, the performance is much better than when doing floating point arithmetic... So storing ther percetages as integer values may allow the processing logic to itilize integer arithmetic
If you're actually using them as a coefficient (or expect users of the database to do this sort of thing in reports), there's a case for storing them as a coefficient - particularly if there's a reason to do calculations involving more than one.
However, if you do this you should be consistent - either all percentages or all coefficients.