I have to compare certain stored images (saved in .dat format) in my MATLAB Fingerprint Recognition System. Now, I can perform 1:1 matching, however, it isn't feasible for a relatively larger database of, say, 300 employees.
Any suggestions for what I could do?
(a) Something like, adding the address of where the complete database is stored
(b) Functions to employ best possible matching algorithm
c = [';#3EA4:aei7]ced.CFHE;4\T>*Y>,dL0,HOQQMJLJE9PX[[Q.ZF.\JTCA1dd'
'<A;FB:;bfj8^df//DGIF<5]UF+ZH-eM>-IorRPNMPIE-Y\\R8[I8]SUDW2e+'
'=4BGC;<cgk9_e00DEOJG=6^VG,[I.fN?5jpsSQPNQPF.Z,]S9`S9cTWVX:+,'
':5CHD<=4hlh`f11EFPKHA7&WH-\J/gOC?kqtTRRORQJ8--^TB+T=dWYWY;,_'
';6D3E=>7imiag2IFOQLID8''XI.]K0"PD#l32UZhP//P988_WC,U>+Z^Y\<2`'
'<82BF>?8jnjbhLJGPRMJE9/YJ/`L1#QMC$;;V[iv09QE99,XD.YB,[_\]=3a'
'>9;CG?#9kokc2MKHQSOKF:0ZL0aM2$RNG%AAW\jw9E.FEE-_G8aG.d`]_W5+'
'?:CDH#A:lpld3NLIRTPLG=1[M1bN3%SOH4BBX]kx:J9LLL8`H9bJ/+d_dX6,'
'#;DEIAB;mqmePOMJSUQMJ>2\N2cO4&TPP#HCY^lyDKEMMN9+I#+S8,+deY7^'
'8#EFJBC<4rnfQPNPTVRNKB3]O3dP5''UQQCIDZ_mzEPFNNOE,RA,T9/,++\8_'
'9A2G3CD=544gRQPQUWUOLE4^P4"Q6(VRRIJE[`n{KQKOOPK-SE.W:F/,,]Z+'
':BDH4DE>655hSRQRVXVPMF5_Q5#R>)eSSJKF\ao0L.L-WUL.VF8XCH001_[,'
';3EI<EO?766iTSRSWYWQNG6$R6''S?*fTTlLQ]bp1M/P.XVP8[H9]DIDA=`\]'
'?4D3=FP#877jUTSTXZXROK7%S7(TF+gUUmMR^cq:N9Q8YZQ9_I>cIJEB>d_^'
'#5E#>GQA98b3VUTUY*YSPL8&T>)UI,hVhnNS_dr;PE.9Z[RCaR?+JTFC?e`+'
'79FA?HRB:9c4WVUVZ+ZWQM=,WG*VJ-"gi4OT`es<QL9E[\TD+SA,SWUVW+d,'
'8:3B#JSX;:dVXWVW[,[XRN>-XH+bK.#hj#PUvftDRMEF,]UH,UB.TYVWX,e\'
'9;ECAKTY<;eWYXWX\:)YSOE.YI,cL/$ikCqV1guE/PFL-^XI-YG/WZWXY1+]'
':AFDBLUZ=<fXZYXY,;*ZTPF/ZJ-dM0%j#Jrt2hxH0QKM8,YJ.ZI8[^YY\2,,'
';B3ECMV[>jgY[ZYZ-<7[XQG0[K.eN1&"$K2u:iyO9.PN9-_K8aJ9\_]\]82['
'?CEFDNW\?khZ\[Z[==8\YRH1\M/!O2''#%m31Bw0PE/QXE8+R9bS;da^]_93\'
'#2FGEOX]ali[]\[\>>9(ZSL2]N0"P3($&n;2Cx1QN9--L9,SA+T<+d__`:4,'
'A3GHFPY^bmj\^]\]??:)[TM3^O1%Q4)%''oA:D0:0OE.8ME-TE,XB,+`da;5['
'643IGQZ_cnk]_^]^##;5\UN4_P2&R6*&(3B;E1<1PN99NL8WF.^C/,a+bY6,'
'7:F3HR[`dol^`_^_AA<6]VO5`Q3''S>+'');CBF:=:QOEEOO9_G8aH6/d,cZ[Y'
'8;G4IS\aep4_a`_-BD=7''XP6aR4(T?,(5#DCHCC;RPFLPPD`H9bJ70+0d\\Z'
'9BH>JT^bf45`ba`.CE#8(YQ7#S5)UD-)?AEDIDDD/QKMVQJ+S?cSDF,1e]a,'
':C3?K4_cg5[acbaADFA92ZR8$T6*VE.*#JFEJEEE0.NNWTK,U#+TEG0?+_bX'
';2D#L9`dh6\bdcbBEGD:3[S=)U7+cK/+CKGFLIKI9/OWZUL-VA,WIHB#,`cY'];
i = double(c(:)-32);
j = cumsum(diff([0; i])<=0) + 1;
S = sparse(i,j,1)';
spy(S)
Related
Why should you use an external database (e.g. Mysql) when working with (large/growing) Data?
I know of some projects which use SQL databases, but I can't see the advantage you get from doing this in contrast to just storing everything in .mat files (as for example stated here: http://www.matlabtips.com/how-to-store-large-datasets/)
Where is this necessary? Where does this approach simplify things?
Regarding growing data, let's take an example where, on a production line, you would measure different sources with different sensors:
Experiment.Date = '2014-07-18 # 07h28';
Experiment.SensorType = 'A';
Experiment.SensorSerial = 'SENSOR-00012-A';
Experiment.SourceType = 'C';
Experiment.SourceSerial = 'SOURCE-00143-C';
Experiment.SensorPositions = 180 * linspace(0, 359, 360) / pi;
Experiment.SensorResponse = rand(1, 360);
And store these experiments on disk using .mat files:
experiment.2013-01-02.0001.mat
experiment.2013-01-02.0002.mat
experiment.2013-01-02.0003.mat
experiment.2013-01-03.0004.mat
...
experiment.2014-07-18.0001.mat
experiment.2014-07-18.0002.mat
So now, if I ask you:
"what is the typical response of sensors of type B when the source is of type E" ?
Or:
"Which sensor has best performances to measure sources of type C ? Sensors A or sensors B ?"
"How performances of these sensors degrade with time ?"
"Did modification we made last july to production line improved lifetime of sensors A ?"
Loading in memory all these .mat files, to check if date, sensor and source type are correct and then calculate min,mean,max responses, and other statistics is gonna be very painful and time consuming + writing custom code for file selection!
Building a data-base on top of these .mat files can be very useful to "SELECT/JOIN/..." elements of interest and then perform further statistic or operations.
NB: The database does not replace .mat files (i.e. the information), it just a practical and standard way to quickly select some of them upon conditions without having to load and parse everything.
After working with neo4j and now coming to the point of considering to make my own entity manager (object manager) to work with the fetched data in the application, i wonder about neo4j's output format.
When i run a query it's always returned as tabular data. Why is this??
Sure tables keep a big place in data and processing, but it seems so strange that a graph database can only output in this format.
Now when i want to create an object graph in my application i would have to hydrate all the objects and this is not really good for performance and doesn't leverage true graph performace.
Consider MATCH (A)-->(B) RETURN A, B when there is one A and three B's, it would return:
A B
1 1
1 2
1 3
That's the same A passed down 3 times over the database connection, while i only need it once and i know this before the data is fetched.
Something like this seems great http://nigelsmall.com/geoff
a load2neo is nice, a load-from-neo would also be nice! either in the geoff format or any other formats out there https://gephi.org/users/supported-graph-formats/
Each language could then implement it's own functions to create the objects directly.
To clarify:
Relations between nodes are lost in tabular data
Redundant (non-optimal) format for graphs
Edges (relations) and vertices (nodes) are usually not in the same table. (makes queries more complex?)
Another consideration (which might deserve it's own post), what's a good way to model relations in an object graph? As objects? or as data/method inside the node objects?
#Kikohs
Q: What do you mean by "Each language could then implement it's own functions to create the objects directly."?
A: With an (partial) graph provided by the database (as result of a query) a language as PHP could provide a factory method (in C preferably) to construct the object graph (this is usually an expensive operation). But only if the object graph is well defined in a standard format (because this function should be simple and universal).
Q: Do you want to export the full graph or just the result of a query?
A: The result of a query. However a query like MATCH (n) OPTIONAL MATCH (n)-[r]-() RETURN n, r should return the full graph.
Q: you want to dump to the disk the subgraph created from the result of a query ?
A: No, existing interfaces like REST are prefered to get the query result.
Q: do you want to create the subgraph which comes from a query in memory and then request it in another language ?
A: no i want the result of the query in another format then tabular (examples mentioned)
Q: You make a query which only returns the name of a node, in this case, would you like to get the full node associated or just the name ? Same for the edges.
A: Nodes don't have names. They have properties, labels and relations. I would like enough information to retrieve A) The node ID, it's labels, it's properties and B) the relation to other nodes which are in the same result.
Note that the first part of the question is not a concrete "how-to" question, rather "why is this not possible?" (or if it is, i like to be proven wrong on this one). The second is a real "how-to" question, namely "how to model relations". The two questions have in common that they both try to find the answer to "how to get graph data efficiently in PHP."
#Michael Hunger
You have a point when you say that not all result data can be expressed as an object graph. It reasonable to say that an alternative output format to a table would only be complementary to the table format and not replacing it.
I understand from your answer that the natural (rawish) output format from the database is the result format with duplicates in it ("streams the data out as it comes"). I that case i understand that it's now left to an alternative program (of the dev stack) to do the mapping. So my conclusion on neo4j implementing something like this:
Pro's - not having to do this in every implementation language (of the application)
Con's - 1) no application specific mapping is possible, 2) no performance gain if implementation language is fast
"Even if you use geoff, graphml or the gephi format you have to keep all the data in memory to deduplicate the results."
I don't understand this point entirely, are you saying that these formats are no able to hold deduplicated results (in certain cases)?? So infact that there is no possible textual format with which a graph can be described without duplication??
"There is also the questions on what you want to include in your output?"
I was under the assumption that the cypher language was powerful enough to specify this in the query. And so the output format would have whatever the database can provide as result.
"You could just return the paths that you get, which are unique paths through the graph in themselves".
Useful suggestion, i'll play around with this idea :)
"The dump command of the neo4j-shell uses the approach of pulling the cypher results into an in-memory structure, enriching it".
Does the enriching process fetch additional data from the database or is the data already contained in the initial result?
There is more to it.
First of all as you said tabular results from queries are really commonplace and needed to integrate with other systems and databases.
Secondly oftentimes you don't actually return raw graph data from your queries, but aggregated, projected, sliced, extracted information out of your graph. So the relationships to the original graph data are already lost in most of the results of queries I see being used.
The only time that people need / use the raw graph data is when to export subgraph-data from the database as a query result.
The problem of doing that as a de-duplicated graph is that the db has to fetch all the result data data in memory first to deduplicate, extract the needed relationships etc.
Normally it just streams the data out as it comes and uses little memory with that.
Even if you use geoff, graphml or the gephi format you have to keep all the data in memory to deduplicate the results (which are returned as paths with potential duplicate nodes and relationships).
There is also the questions on what you want to include in your output? Just the nodes and rels returned? Or additionally all the other rels between the nodes that you return? Or all the rels of the returned nodes (but then you also have to include the end-nodes of those relationships).
You could just return the paths that you get, which are unique paths through the graph in themselves:
MATCH p = (n)-[r]-(m)
WHERE ...
RETURN p
Another way to address this problem in Neo4j is to use sensible aggregations.
E.g. what you can do is to use collect to aggregate data per node (i.e. kind of subgraphs)
MATCH (n)-[r]-(m)
WHERE ...
RETURN n, collect([r,type(r),m])
or use the new literal map syntax (Neo4j 2.0)
MATCH (n)-[r]-(m)
WHERE ...
RETURN {node: n, neighbours: collect({ rel: r, type: type(r), node: m})}
The dump command of the neo4j-shell uses the approach of pulling the cypher results into an in-memory structure, enriching it and then outputting it as cypher create statement(s).
A similar approach can be used for other output formats too if you need it. But so far there hasn't been the need.
If you really need this functionality it makes sense to write a server-extension that uses cypher for query specification, but doesn't allow return statements. Instead you would always use RETURN *, aggregate the data into an in-memory structure (SubGraph in the org.neo4j.cypher packages). And then render it as a suitable format (e.g. JSON or one of those listed above).
These could be a starting points for that:
https://github.com/jexp/cypher-rs
https://github.com/jexp/cypher_websocket_endpoint
https://github.com/neo4j-contrib/rabbithole/blob/master/src/main/java/org/neo4j/community/console/SubGraph.java#L123
There are also other efforts, like GraphJSON from GraphAlchemist: https://github.com/GraphAlchemist/GraphJSON
And the d3 json format is also pretty useful. We use it in the neo4j console (console.neo4j.org) to return the graph visualization data that is then consumed by d3 directly.
I've been working with neo4j for a while now and I can tell you that if you are concerned about memory and performances you should drop cypher at all, and use indexes and the other graph-traversal methods instead (e.g. retrieve all the relationships of a certain type from or to a start node, and then iterate over the found nodes).
As the documentation says, Cypher is not intended for in-app usage, but more as a administration tool. Furthermore, in production-scale environments, it is VERY easy to crash the server by running the wrong query.
In second place, there is no mention in the docs of an API method to retrieve the output as a graph-like structure. You will have to process the output of the query and build it.
That said, in the example you give you say that there is only one A and that you know it before the data is fetched, so you don't need to do:
MATCH (A)-->(B) RETURN A, B
but just
MATCH (A)-->(B) RETURN B
(you don't need to receive A three times because you already know these are the nodes connected with A)
or better (if you need info about the relationships) something like
MATCH (A)-[r]->(B) RETURN r
I read in "hadoop design pattern" book, "HBase supports batch queries, so it would be ideal to buffer all the queries we want to execute up to some predetermined size. This constant depends on how many records you can comfortably store in memory before querying HBase."
I tried to search some examples online but couldn't find any, can someone show me the example using java map reduce?
Thanks.
Dan
Is this what you want? You can save HBase Get object in a list and submit the list at the same time. It's a little better than invoke table.get(get) multiple times.
Configuration conf = HBaseConfiguration.create();
pool = new HTablePool(conf, 5);
HTableInterface table = pool.getTable('table');
List<Get> gets = new ArrayList<Get>();
table.get(gets);
I've been using SQL Server to store historical time series data for a couple hundred thousand objects, observed about 100 times per day. I'm finding that queries (give me all values for object XYZ between time t1 and time t2) are too slow (for my needs, slow is more then a second). I'm indexing by timestamp and object ID.
I've entertained the thought of using somethings a key-value store like MongoDB instead, but I'm not sure if this is an "appropriate" use of this sort of thing, and I couldn't find any mentions of using such a database for time series data. ideally, I'd be able to do the following queries:
retrieve all the data for object XYZ between time t1 and time t2
do the above, but return one date point per day (first, last, closed to time t...)
retrieve all data for all objects for a particular timestamp
the data should be ordered, and ideally it should be fast to write new data as well as update existing data.
it seems like my desire to query by object ID as well as by timestamp might necessitate having two copies of the database indexed in different ways to get optimal performance...anyone have any experience building a system like this, with a key-value store, or HDF5, or something else? or is this totally doable in SQL Server and I'm just not doing it right?
It sounds like MongoDB would be a very good fit. Updates and inserts are super fast, so you might want to create a document for every event, such as:
{
object: XYZ,
ts : new Date()
}
Then you can index the ts field and queries will also be fast. (By the way, you can create multiple indexes on a single database.)
How to do your three queries:
retrieve all the data for object XYZ
between time t1 and time t2
db.data.find({object : XYZ, ts : {$gt : t1, $lt : t2}})
do the above, but return one date
point per day (first, last, closed to
time t...)
// first
db.data.find({object : XYZ, ts : {$gt : new Date(/* start of day */)}}).sort({ts : 1}).limit(1)
// last
db.data.find({object : XYZ, ts : {$lt : new Date(/* end of day */)}}).sort({ts : -1}).limit(1)
For closest to some time, you'd probably need a custom JavaScript function, but it's doable.
retrieve all data for all objects for
a particular timestamp
db.data.find({ts : timestamp})
Feel free to ask on the user list if you have any questions, someone else might be able to think of an easier way of getting closest-to-a-time events.
This is why databases specific to time series data exist - relational databases simply aren't fast enough for large time series.
I've used Fame quite a lot at investment banks. It's very fast but I imagine very expensive. However if your application requires the speed it might be worth looking it.
There is an open source timeseries database under active development (.NET only for now) that I wrote. It can store massive amounts (terrabytes) of uniform data in a "binary flat file" fashion. All usage is stream-oriented (forward or reverse). We actively use it for the stock ticks storage and analysis at our company.
I am not sure this will be exactly what you need, but it will allow you to get the first two points - get values from t1 to t2 for any series (one series per file) or just take one data point.
https://code.google.com/p/timeseriesdb/
// Create a new file for MyStruct data.
// Use BinCompressedFile<,> for compressed storage of deltas
using (var file = new BinSeriesFile<UtcDateTime, MyStruct>("data.bts"))
{
file.UniqueIndexes = true; // enforces index uniqueness
file.InitializeNewFile(); // create file and write header
file.AppendData(data); // append data (stream of ArraySegment<>)
}
// Read needed data.
using (var file = (IEnumerableFeed<UtcDateTime, MyStrut>) BinaryFile.Open("data.bts", false))
{
// Enumerate one item at a time maxitum 10 items starting at 2011-1-1
// (can also get one segment at a time with StreamSegments)
foreach (var val in file.Stream(new UtcDateTime(2011,1,1), maxItemCount = 10)
Console.WriteLine(val);
}
I recently tried something similar in F#. I started with the 1 minute bar format for the symbol in question in a Space delimited file which has roughly 80,000 1 minute bar readings. The code to load and parse from disk was under 1ms. The code to calculate a 100 minute SMA for every period in the file was 530ms. I can pull any slice I want from the SMA sequence once calculated in under 1ms. I am just learning F# so there are probably ways to optimize. Note this was after multiple test runs so it was already in the windows Cache but even when loaded from disk it never adds more than 15ms to the load.
date,time,open,high,low,close,volume
01/03/2011,08:00:00,94.38,94.38,93.66,93.66,3800
To reduce the recalculation time I save the entire calculated indicator sequence to disk in a single file with \n delimiter and it generally takes less than 0.5ms to load and parse when in the windows file cache. Simple iteration across the full time series data to return the set of records inside a date range in a sub 3ms operation with a full year of 1 minute bars. I also keep the daily bars in a separate file which loads even faster because of the lower data volumes.
I use the .net4 System.Runtime.Caching layer to cache the serialized representation of the pre-calculated series and with a couple gig's of RAM dedicated to cache I get nearly a 100% cache hit rate so my access to any pre-computed indicator set for any symbol generally runs under 1ms.
Pulling any slice of data I want from the indicator is typically less than 1ms so advanced queries simply do not make sense. Using this strategy I could easily load 10 years of 1 minute bar in less than 20ms.
// Parse a \n delimited file into RAM then
// then split each line on space to into a
// array of tokens. Return the entire array
// as string[][]
let readSpaceDelimFile fname =
System.IO.File.ReadAllLines(fname)
|> Array.map (fun line -> line.Split [|' '|])
// Based on a two dimensional array
// pull out a single column for bar
// close and convert every value
// for every row to a float
// and return the array of floats.
let GetArrClose(tarr : string[][]) =
[| for aLine in tarr do
//printfn "aLine=%A" aLine
let closep = float(aLine.[5])
yield closep
|]
I use HDF5 as my time series repository. It has a number of effective and fast compression styles which can be mixed and matched. It can be used with a number of different programming languages.
I use boost::date_time for the timestamp field.
In the financial realm, I then create specific data structures for each of bars, ticks, trades, quotes, ...
I created a number of custom iterators and used standard template library features to be able to efficiently search for specific values or ranges of time-based records.
Currently, I've got images (max. 6MB) stored as BLOB in a InnoDB table.
As the size of the data is growing, the nightly backup is growing slower and slower hindering normal performance.
So, the binary data needs to go to the file system. (pointers to the files will be kept in the DB.)
The data has a tree like relation:
- main site
- user_0
- album_0
- album_1
- album_n
- user_1
- user_n
etc...
Now I want the data to be distributed evenly trough the directory structure. How should I accomplish this?
I guess I could try MD5('userId, albumId, imageId'); and slice up the resulting string to get my directory path:
/var/imageStorage/f/347e/013b/c042/51cf/985f7ad0daa987d.jpeg
This would allow me to map the first character to a server and evenly distribute the directory structure over multiple servers.
This would however not keep images organised per user, likely spreading the images for 1 album over multiple servers.
My question is:
What is the best way to store the image data in the file system in a balanced way, while keeping user/album data together ?
Am I thinking in the right direction? or is this the wrong way of doing things altogether?
Update:
I will go for the md5(user_id) string slicing for the split up on highest level.
And then put all user data in that same bucket. This will ensure an even distribution of data while keeping user data stored close together.
/var
- imageStorage
- f/347e/013b
- f347e013bc04251cf985f7ad0daa987d
- 0
- album1_10
- picture_1.jpeg
- 1
- album1_1
- picture_2.jpeg
- picture_3.jpeg
- album1_11
- picture_n.jpeg
- n
- album1_n
I think I will use albumId splitted up from behind (I like that idea!) as to keep the number of albums per directory smaller (although it won't be necessary for most users).
Thanks!
Just split your userid from behind. e.g.
UserID = 6435624
Path = /images/24/56/6435624
As for the backup you could use MySQL Replication and backup the slave
database to avoid problems (e.g. locks) while backuping.
one thing about distributing the filenames into different directories, if you consider splitting your md5 filenames into different subdirectories (which is generally a good idea), I would suggest keeping the complete hash as filename and duplicate the first few chars as directory names. This way you will make it easier to identify files e.g. when you have to move directories.
e.g.
abcdefgh.jpg -> a/ab/abc/abcdefgh.jpg
if your filenames are not evenly distributed (not a hash), try to choose a splitting method that gets an even distribution, e.g. the last characters if it is an incrementing user-id
I'm using this strategy given a unique picture ID
reverse the string
zerofill it with leading zero if there's an odd number of digits
chunk the string into two-digits substrings
build the path as below
17 >> 71 >> /71.jpg
163 >> 0361 >> /03/61.jpg
6978 >> 8796 >> /87/96.jpg
1687941 >> 01497861 >> /01/49/78/61.jpg
This method ensures that each folder contains up to 100 pictures and 100 sub-folders and the load is evenly distributed between the left-most folders.
Moreover, you just need the ID of the picture to reach the file, no need to read picture table containing other metadata.
User data are not stored close together indeed and the ID-Path relation is predictable, it depends on your needs.