On The Fly decrypt data with Symfony query - database

Let's admit this situation: i have a project that involves critical data. Server side is managed by Laravel/Symfony to retrieve, process and store these data.
Data are sent to the server through the API, they are encrypted and finally stored in the database.
My question is: if data are encrypted in my database, can i still retrieve these using a WHERE clause ? I'm thinking about something like On The Fly decryption, but i've found nothing about these terms on Google.
What is the best way to encrypt data in a database to improve data protection ?

The trick is to index the encrypted values, but this does limit what you can search for. You can improve things a bit by normalising the data beforehand, for example by forcing it to lower case before encryption to make matches more likely.
However, this is all academic because rather than reinventing the (potentially complicated and difficult as it may be) wheel, the best way to do this is to use a library that does it for you, and the library you need is CipherSweet by Scott Arciszewski.

Related

Message storage duplication for messaging systems

In many sub-system designs for messaging applications (twitter, facebook e.t.c) I notice duplication of where user message history is stored. On other hand they use tokenizing indexer like ElasticSeach or Solr. It's good for search. On other hand still use some sort of DB for history. Why to duplicate? Why the same instance of ES/Solr/EarlyBird can not be used for history? It's in fact able to.
The usual problem is the following - you want to search and also ideally you want to try index data in a different manner (e.g. wipe index and try new awesome analyzer, that you forgot to include initially). Separating data source and index from each other makes system less coupled. You're not afraid, that you will lose data in the Elasticsearch/Solr.
I am usually strongly against calling Elasticsearch/Solr a database. Since in fact, it's not. For example none of them have support for transactions, which makes your life harder, if you want to update multiple documents following standard relational logic.
Last, but not least - one of the hardest operation in Elasticsearch/Solr is to retrieve stored values, since it's not much optimised to do so, especially if you want to return 10k documents at once. In this case separate datasource would also help, since you will be able to return only matched document ids from Elasticsearch/Solr and later retrieve needed content from datasource and return it to the user.
Summary is just simple - Elasticsearch/Solr should be more think of as a search engines, not data storage.
True that ES is NOT a database per se and will never be. But no one says you cannot use it as such, and many people actually do. It really depends on your specific use case(s), and in the end it's all a question of the trade-offs you are ready to make to support your specific needs. As with pretty much any technology in general, there is no one-size-fits-all approach and with ES (and the like) it's no different.
A primary source of truth might not necessarily be a relational DBMS and they are not necessarily "duplicating" the data in the sense that you meant, it can be anything that has a copy of your data and allows you to rebuild your ES indexes in case something goes wrong. I've seen many many different "sources of truth". It could simply be:
your raw flat files containing your historical logs or business data
Kafka topics that you can replay anytime easily
a snapshot that you take from ES on a regular basis
a relational DB
you name it...
The point is that if something goes wrong for any reason (and that happens), you want to be able to recreate your ES indexes, be it from a real DB, from backups or from raw data. You should see that as a safety net. Even if all you have is a MySQL DB, you usually have a backup of it, so you're already "duplicating" the data in some way.
One thing that you need to think of, though, when architecting your system, is that you might not necessarily need to have the entirety of your data in ES, since ES is a search and analytics engine, you should only store in there what is necessary to support your search and analytics needs and be able to recreate that information anytime. In the end, ES is just a subsystem of your whole architecture, just like your DB, your messaging queue or your web server.
Also worth reading: Using ElasticSeach as primary source for part of my DB

Which is the best database to use to index the internet?

If you had to make provision for 80 million records (one for each page on the internet) and store the relationships between those records (which is 80 billion to the nth power), which database would be the best for this?
I've started this project thinking we will only map a portion of the internet, but unfortunately it has gone far beyond the limits of mysql. I need a better way to keep track of this data. The frontend is PHP, but I suppose the backend can be anything, as long as it can handle that amount of data?
i won't say there is the one holy database for your needs, maybe it could be better for your company to split your database in logical parts to handle the amount of data in a better way. maybe you could outsource some data into file system as you won't need anything everytime in your database.
if you scan the interwebs, you probably save the html, css or any big data you crawl for into your filesystem while you save connections and everything meta related into your database. but i really think you'd mentioned that already.
the best advice i want to give here is to make sure, your structure of your database is whatever fits your processes the best before think about switching the database. if you really need to switch (as mysql would not give you more performance), there will be mongodb and/or webscalesql. webscale seems to be used by facebook to handle the amount of their data.
a big question would be if you just can improve your performance by improve your hardware. you should check that too, AFTER you checked your structure and processes!

Best practice to implement cache

I have to implement caching for a function that processes strings of varying lenghts (a couple of bytes up to a few kilobytes). My intention is to use a database for this - basically one big table with input and output columns and an index on the input column. The cache would try to find the string in the input column and get the output column - probably one of the simplest database applications imaginable.
What database would be best for this application? A fully-featured database like mysql or a simple one like sqlite3? Or is there even a better way by not using a database?
Document-stores are made for this. I highly recommend Redis for this specific problem. It is a "key-value" store, meaning it does not have relations, it does not have schemas, all it does is map keys to values. Which sounds like just what you need.
Alternatives are MongoDB and CouchDB. Look around and see what suites you best. My recommendation stays with Redis though.
Reading: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
Joe has some good recommendations for data stores that are commonly use for caching. I would say Redis, Couchbase (not CouchDB though - it goes to disk fairly frequently/not that fast from my experience) and just plain Memcached.
MongoDB can be used for caching, but I don't think it's quite as tuned for pure caching like something like Redis is. Mongo can hit the disk quite a bit.
Also I highly recommend using time to live (TTL) as your main caching strategy. Just give a value some time to expire and then re-populate it later. It is a very hard problem to pro-actively find all instances of some data in a cache and refresh it.

Database design to manage data from Serial Port?

I'm writing codes to capture Serial Port data (e.g. micrometer) and do the following
real time display and graphing
delete/modify/replace existing data by focus and re-measurement,
save data somewhere for additional statistical analysis (e.g. in Excel). So out put as .csv is also an option
Because each measurement may capture hundreds to thousands of data (measurement points), I'm not sure how shall go about to design my database - shall I create a new row for each data received, or shall I push all data into an array and store as a super long string separated by a comma into the database? For such application, do I need Server 2008 or would Server 2008 Express will be sufficient. What are their pros/cons in terms of performance?
Is it possible to create such as application where client won't need to install sql server?
To answer your last question first, look at SQLite because it is free open source, and I'm pretty sure their license is null. Plus nothing to install for your users. SQLite can be compiled with your code.
To answer your primary question, I would encourage using a database only if it really makes sense. Are you going to be running queries against the data, or are you just looking to store and retrieve entries by some sort of identifier? I would discourage storing it as a super long comma separated string, but instead look into the BLOB type. With BLOBs, you can put datatypes into the database so you can easily get them back out, plus I believe it is much more efficient that way. I would only suggest using TEXT type if you need to do some sort of text query on it. For example fulltext searches
If you decide that the giant string approach makes sense for your application, you would probably just be better off using text files. Relational databases provide benefit only when your data is structured. If you go with a database it should be because you are storing discreet values in rows and columns.
An out of process database strikes me as a mismatch for this project. Have you considered SQLite or some other compile-or-link in solution? You'll probably get faster throughput of those large measurement vectors with something that runs in process.
What do you want (or what do you think you want) a relational database for?
Based on the simple requirements you expressed here (and assuming there are none other), I would suggest flat files (this could be either plain text, one sample per line, or binary). But I really don't think I have enough information to make a decision on your behalf. Some key questions are:
Do you need to share the data with others, who uses the data, where, over the network? (this will dictate the format you save the data in and a few other things)
How much data do you need to store, how fast does it get measured and over what period of time to measure and store this data? (this will tell you whether to use compression, or some other space saving scheme or not, this will also give you pointers to resource requirements)
How easily retrievable does it need to be? (this will give pointers as to what data management you need: how to name files, where to store them...)
What sort of analysis do you need to do on it? (do you want to analyse several files together, just one file, data from a given day, data for a given port... ?)
... and many more such questions.
Depending on answers to these questions you might be happy with an old 386 computer, or you might need a modern 8-core.

How would you approach this data processing task? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I have a file containing 250 million website URLs, each with an IP address, page title, country name, server banner (e.g. "Apache"), response time (in ms), number of images and so on. At the moment, these records are in a 25gb flat file.
I'm interested in generating various statistics from this file, such as:
number of IP addresses represented per country
average response time per country
number of images v response time
etc etc.
My question is, how would you achieve this type and scale of processing, and what platform and tools wuld you use(in a reasonable time)?
I am open to all suggestions, from MS SQL on Windows to Ruby on Solaris, all suggestions :-) Bonus points for DRY (don't repeat yourself), I'd prefer not to write a new program each time a different cut is required.
Any comments on what works, and what's to be avoided would greatly be appreciated.
Step 1: get the data into a DBMS that can handle the volume of data. Index appropriately.
Step 2: use SQL queries to determine the values of interest.
You'll still need to write a new query for each separate question you want answered. However, I think that is unavoidable. It should save you replicating the rest of the work.
Edited:
Note that although you probably can do a simple upload into a single table, you might well get better performance out of the queries if you normalize the data after loading it into the single table. This isn't completely trivial, but will likely reduce the volume of data. Making sure you have a good procedure (which will probably not be a stored procedure) for normalizing the data will help.
Load the data into a table in a SQL Server (or any other mainstream db) database, and then write queries to generate the statistics you need. You would not need any tools other than the database itself and whatever UI is used to interact with the data (e.g. SQL Server Management Studio for SQL Server, TOAD or SqlDeveloper for Oracle, etc.).
If you happen to use Windows, take a look at Log Parser. It can be found as a standalone download and also is included as part of the IIS Reource Kit.
Log Parser can read your logs and upload them to the Database.
Database Considerations:
For your Database Server you will want something that is fast (Microsoft SQL Server, IBM's DB2, PostgreSQL or Oracle). mySQL might be useful too but I have not experience with large Databases with it.
You will want all the memory you can afford. If you will be using the Database with regularity I'd say 4 GB at least. It can be done with less but you WILL notice big difference in performance.
Also, go for multicore/multi cpu servers if you can afford it and, again, if you will be using this Database with regularity.
Another recommendation is to analyze the king of queries you will be doing and plan the indexes accordingly. Remember: Every index you create will require additional storage space.
Of course, turn off the indexing or even destroy de indexes before masive data load operations. That will make the load lots faster. Re-index or re-create the indexes after the data load operation.
Now, if this Database will be an ongoing operation (i.e. is not just to investigate/analyze something and then discard it) you may want design a Database Schema with catalog and detail tables. This is called Database Normalization and the exact amount of normalization you will want depends on the usage pattern (data load operations versus query operations). An experienced DBA is a must if this Database will be used on an ongoing basis and have performance requirements.
P.S.
I will take the risk to include something obvious here but...
I think you may be interested in a Log Analyzer. These are computer programs that generate statistics from Web Server log files (some can analyze also ftp, sftp and mail server log files).
Web Log Analyzers generate reports with the statistics. Usually the reports are generated as HTML files and include graphics. There is a fair variety on depth analysis and options. Some are very customizable and some are not. You will find both commercial products and Open Source.
For the amount of data you will be managing, double check each candidate product and take a closer look on speed and ability to handle it.
One thing to keep in mind when you're importing the data is to try to create indexes that will allow you to do the kinds of queries you want to do. Think about what sort of fields will you be querying on and what those queries might look like. That should help you decide what indexing you will need.
25GB of flat file. I don't think writing any component on your own to read this file will be a good idea.
I would suggest that you should go for SQL import and take all the data to SQL Server. I agree that it would take ages to get this data in SQL Server, but once it is there you can do any thing you want with this data.
I hope once you put this data in DB, after that all you will get delta of information not 25 GB of flat file.
You haven't said how the data in your flat file is organised. The RDBMS suggestions are sensible, but presume that your flat file is formatted in some delimited way and a db import is a relatively simple task. If that is not the case then you first have the daunting task of decompiling the data cleanly into a set of fields on which you can do your analysis.
I'm going to presume that your data is not a nice CSV or TXT file, since you haven't said either way and nobody else has answered this part of the possible problem.
If the data have a regular structure, even without nice clean field delimiters you may be able to turn an ETL tool onto the job, such as Informatica. Since you are a techy and this is a one-off job, you should definitely consider writing some code of your own which does some regex comparisons for extraction of the parts that you want and spits out a file which you can then load into a database. Either way you are going to have to invest some significant effort in parsing and cleansing your data, so don't think of this as an easy task.
If you do write your own code then I would suggest you choose a compiled language and make sure you process the data a single row at a time (or in a way that buffers the reads into manageable chunks).
Either way you are going to have a pretty big job making sure that the results of any process that you apply to the data have been consistently executed, you don't want IP addresses turing up as decimal numbers in your calculations. On data of that scale it can be hard to detect a fault like that.
Once you have parsed it then I think that an RDBMS is the right choice to store and analyse your data.
Is this a one time thing or will you be processing things on a daily, weekly basis? Either way check out vmarquez's answer I've heard great things about logparser. Also check out http://awstats.sourceforge.net/ it's a full fledged web stats application.
SQL Server Analysis Services is designed for doing exactly that type of data analysis. The learning curve is a bit steep, but once you set up your schema you will be able to do any kind of cross-cutting queries that you want very quickly.
If you have more than one computer at your disposal, this is a perfect job for MapReduce.
Sounds like a job for perl to me. Just keep count of the stats you want. Use regex to parse the line. It would probably take less than 10 minutes to parse that size file. My computer reads through a 2 gig file (13 million lines) in about 45 seconds with perl.

Resources