Where to fix character encoding issues: before db insertion - sql-server

I'm working web project with a mssql backend, not in my control except for tapping it.
A lot of the data is littered with garbage characters in place of tradmarks, quotes, register symbols and quotes.
The html I'm rendering in is set to utf-8.
Is this encoding problem something that should be taken care of on the insertion of the data into the db?

The problem should be taken care of where it's wrong. If you try to take care of it somewhere else, then you don't fix the problem, you only try to recover lost data.
When you handle encoded text, you have to do it correctly all the way. If you encode it or decode it incorrectly at one point in the process, you can't reliably fix it at any other point.
You have to find out where the text gets encoded or decoded incorrectly by examining what's happening to the data, and apply the fix there.

Related

Black dot of Death SQL potential issue

I would like to ask about a recent black dot of death that is causing crash issues on messaging apps on Android and iOS devices.
More information on this issue can be found in this video:
https://www.youtube.com/watch?v=0fepqIO57f8
I would like to know if this issue can cause a potential risk to businesses everywhere if it is inserted to database without the correct validation check? And how do we perform validation check on this because there is a lot of hidden text that is not able to be detected by normal validation I think, such as doing string validation for a text box. It might crash database if the input went through and even making a query will take forever to load because of this issue.
well, i dont think you have to worry too much about that black dot, it isnt a virus as far as im aware of in my research about it, this is how it works:
The secret to the message is not actually the black dot, but rather the blank space that comes right after it. When the WhatsApp prank message is converted to HTML, it contains repetitive instances of the "&rlm:" character, which is code for right-to-left text directional formatting.
The right-to-left text format is used for languages such as Arabic and Urdu while the more common left-to-right text format is used for languages such as English. The hidden "&rlm:" code in the black dot prank message instructs WhatsApp to change the direction of the text repeatedly, which requires a lot of processing power.

How can I monitor a text file for *any* change whatsoever?

I'm trying to record how a file changes over time down to the smallest details, which means that reading any changes per file save wouldn't give sufficient data for my use-case, it needs to be per-keystroke.
I just tried out inotify, but it can only notify me on file-save, not file-modification.
I then realized (I'm quite inexperienced with file-system stuff) that this is because text-editors use buffers to store yet-to-happen changes, committing the contents of the buffer on file save (I think, at least).
So, it seems to me that I could
read the buffer of a text-editor (which seems like it would have to be code specific to each particular editor; not viable)
force a save to the file on every keypress (which again seems like it would require editor-specific code)
monitor the keypresses of the user, store my own buffer of them, and then once the file is saved, correlate the keypresses with the diff (this seems way too hard and prone to error)
read the contents of the file on an interval faster than a person is likely the press any keys (this is hacky, and ridiculous)
I can't think of any other ways. It seems that to properly get the behavior I want I'd need to have editing occur within the terminal, or within a web form. I'd love to hear other ideas though, or a potential editor-agnostic solution to this problem. Thank you for your time.

CakePHP debug() isn't working but Debugger::dump() is fine

Ever since PHP4 and Cake 1.3 I have been using debug($data); to debug things such as model output in CakePHP.
However, since upgrading to PHP5.4, I have noticed that debug($data) doesn't always seem to work. For example, today I did a straightforward $data = $this->Model->find('all'); and the contents of debug($data); appears to be empty. No error, just a reference in the HTML output to the fact that I called debug and the line number and then no debug output.
However, if I run Debugger::dump($data); on the exact same find, it works find and I see the entire output.
It seems to only be happening when $data has a significant amount of data (say, 100+ records), but I worked with datasets this size prior to PHP5.4 and there was never a problem and there are no errors, inline or in the apache/php logs indicating that there are any memory issues and I have debugging set to 3.
Does anyone have any idea why this is? I can obviosly starting using Debugger::dump($data); easily but it's just a little extra to have to try out each time and I'd like to know why I can't just use deubg(); anymore.
This can happen with non-utf8 encoded data in your db records - if the rest of your application is UTF-8 that is.
debug() will then just output "nothing". var_dump(), print_r() and other php internal methods should still print the output, though.
You can usually re-encode them to utf8 using iconv() etc.

Characters for code-readable files

I'm making a program, and I want to be able to store questions and answers in a text file. What I'm doing right now is separating the different question/answer combos by a new line, and separating the questions from the answers by having the questions first, then /// (triple slash) and then the answer. The problem with this is that the user can't enter a triple slash or slashes at the end of the question without messing the program up.
So what I'm looking for is some character that I can put in that can't be entered by the user in any way shape or form to separate the questions and answers. What I'm leaning towards right now is some unicode symbol like a smiley face, but anything that's more perfect would be appreciated.
As a side question, where should I store this document on Mac OS X? Should it be in application support, or part of the app bundle itself?
There are some well-documented pre-existing text formats you might want to consider. I would look into csv (comma-separated-value), tsv (tab-separated value), json and xml. Most likely xml and json are overkill for you, but they solve this problem nicely, and if you already have libraries for dealing with those in your code stack, you can get going more quickly by using those.
But for a DIY approach, I would recommend tsv. It's hard for users to enter the "TAB" ASCII 09 character in web forms, and you can usually convert to space without losing much information if they do. Then you just have to work out what your record-separator will be; probably on the Mac you will want to use CR (ascii 13), but NL (ascii 10) is a fine choice too.

How to implement a lossless URL shortening

First, a bit of context:
I'm trying to implement a URL shortening on my own server (in C, if that matters). The aim is to avoid long URLs while being able to restore a context from a shortened URL.
Currently I have a implementation that creates a session on the server, identified by a certain ID. This works, but consumes memory on the server (and is not desired since it's an embedded server with limited resources and the main purpose of the device isn't providing web pages but doing other cool stuff).
Another option would be to use cookies or HTML5 webstorage to store the session information in the client.
But what I'm searching for is the possibility to store the shortened URL parameters in one parameter that I attach to the URL and be able to re-construct the original parameters from that one.
First thought was to use a Base64-encoding to put all the parameters into one, but this produces an even larger URL.
Currently, I'm thinking of compressing the URL parameters (using some compression algorithm like zip, bz2, ...), do the Base64-encoding on that compressed binary blob and use that information as context. When I get the parameter, I could do a Base64-decoding, de-compress the result and have hands on the original URL.
The question is: is there any other possibility that I'm overlooking that I could use to lossless compress a large list of URL parameters into a single smaller one?
Update:
After the comments from home, I realized that I overlooked that compressing itself adds some overhead to the compressed data making the compressed data even larger than the original data because of the overhead that for example zipping adds to the content.
So (as home states in his comments), I'm starting to think that compressing the whole list of URL parameters is only really useful if the parameters are beyond a certain length because otherwise, I could end up having an even larger URL than before.
You can always roll your own compression. If you simply apply some huffman coding, the result will always be smaller (but then base64 encoding it, it'll grow a bit, so the net effect may perhaps not be optimal).
I'm using a custom compression strategy on an embedded project I work with where I first use a lzjb (a lempel ziv derivate, follow link for source code, really tight implementation (from open solaris)) followed by huffman coding the compressed result.
The lzjb algorithm doesn't perform too well on very short inputs, though (~16 bytes, in which case I leave it uncompressed).

Resources