Efficient file comparison of two large text files - database

In our use case, we get large snapshot text files (tsv, csv etc.) from our customer (size around 30GB) with millions of records. The data looks like this:
ItemId (unique), Title, Description, Price etc.
shoe-id1, "title1", "desc1", 10
book-id-2, "title2", "desc2", 5
Whenever, we get a snapshot from a customer, we need to compute a "delta":
Inserted - the records that were inserted (only present in latest file and not the previous one),
Updated - Same Id but different value in any of the other columns
Deleted (only present in previous file and not the latest one).
(The data may be out of order in subsequent files and is not really sorted on any column).
We need to be able to run this multiple times a day for different customers.
We currently store all our data from snapshot file 1 into SQL server (12 shards (partitioned by customerId), containing a billion rows in all) and compute diffs using multiple queries when snapshot file 2 is received. This is proving to be very inefficient (hours, deletes are particularly tricky). I was wondering if there are any faster solutions out there. I am open to any technology (e.g. hadoop, nosql database). The key is speed (minutes preferably).

Normally, the fastest way to tell if an id appears in a dataset is by hashing, so I would make a hash which uses the id as the key and an MD5 checksum or CRC of the remaining columns as the element stored at that key. That should relieve the memory pressure if your data has many columns. Why do I think that? Because you say you have GB of data for millions of records, so I deduce each record must be of the order of kilobytes - i.e. quite wide.
So, I can synthesise a hash of 13M old values in Perl and a hash of 15M new values and then find the added, changed and removed as below.
#!/usr/bin/perl
use strict;
use warnings;
# Set $verbose=1 for copious output
my $verbose=0;
my $million=1000000;
my $nOld=13*$million;
my $nNew=15*$million;
my %oldHash;
my %newHash;
my $key;
my $cksum;
my $i;
my $found;
print "Populating oldHash with $nOld entries\n";
for($i=1;$i<=$nOld;$i++){
$key=$i-1;
$cksum=int(rand(2));
$oldHash{$key}=$cksum;
}
print "Populating newHash with $nNew entries\n";
$key=$million;
for($i=1;$i<=$nNew;$i++){
$cksum=1;
$newHash{$key}=$cksum;
$key++;
}
print "Part 1: Finding new ids (present in newHash, not present in oldHash) ...\n";
$found=0;
for $key (keys %newHash) {
if(!defined($oldHash{$key})){
print "New id: $key, cksum=$newHash{rkey}\n" if $verbose;
$found++;
}
}
print "Total new: $found\n";
print "Part 2: Finding changed ids (present in both but cksum different) ...\n";
$found=0;
for $key (keys %oldHash) {
if(defined($newHash{$key}) && ($oldHash{$key}!=$newHash{$key})){
print "Changed id: $key, old cksum=$oldHash{$key}, new cksum=$newHash{$key}\n" if $verbose;
$found++;
}
}
print "Total changed: $found\n";
print "Part 3: Finding deleted ids (present in oldHash, but not present in newHash) ...\n";
$found=0;
for $key (keys %oldHash) {
if(!defined($newHash{$key})){
print "Deleted id: $key, cksum=$oldHash{$key}\n" if $verbose;
$found++;
}
}
print "Total deleted: $found\n";
That takes 53 seconds to run on my iMac.
./hashes
Populating oldHash with 13000000 entries
Populating newHash with 15000000 entries
Part 1: Finding new ids (present in newHash, not present in oldHash) ...
Total new: 3000000
Part 2: Finding changed ids (present in both but cksum different) ...
Total changed: 6000913
Part 3: Finding deleted ids (present in oldHash, but not present in newHash) ...
Total deleted: 1000000
For the purposes of testing, I made the keys in oldHash run from 0..12,999,999 and the keys in newHash run from 1,000,000..16,000,000 then I can tell easily if it worked since the new keys should be 13,000,000..16,000,000 and the deleted keys should be 0..999,999. I also made the checksums alternate between 0 and 1 so that 50% of the overlapping ids should appear different.
Having done it in a relatively simple way, I can now see that you only need the checksum part to find the altered ids, so you could do parts 1 and 3 without the checksum to save memory. You could also do part 2 one element at a time as you loaded the data so you wouldn't need to load all the old and all the new ids into memory up front. Instead, you would load the smaller of the old and the new datasets and then check for changes one id at a time as you read the other set of ids into memory which would make it less demanding of memory.
Finally, if the approach works, it could readily be re-done in C++ for example, to speed it up further and reduce further the memory demands.

Related

DynamoDB query row number

Can i smhow get index of a query response in DyanmoDB?
[hashKey exists, sortKey exists]
query { KeyCondExp = "hashKey = smthin1", FilterExp = "nonPrimeKey = smthin2" }
I need index of row according to sortKey for that selected document
When a DynamoDB Query request returns an item - in your example chosen by a specific filter - it will return the full item, including the sort key. If that is what you call "the index of row according to sortKey", then you are done.
If, however, by "index" you mean the numeric index - i.e., if the item is the 100th sort key in this partition (hash key), you want to return the number 100 - well, that you can't do. DynamoDB keeps rows inside a partition sorted by the sort key, but not numbered. You can insert an item in the middle of a million-row partition, and it will be inserted in the right place but DynamoDB won't bother to renumber the million-row list just to maintain numeric indexes.
But there is something else you should know. In the query you described, you are using a FilterExpression to return only specific rows out of the entire partition. With such a request, Amazon will charge you for reading the entire partition, not just the specific rows returned after the filter. If you're charged for reading the entire partition, you might as well just read it all, without a filter, and then you can actually count the rows and get the numeric index of the match if that's what you want. Reading the entire partition will cause you more work at the client (and more network traffic), but will not increase your DynamoDB RCU bill.

MarkLogic - How to know size of database, size of Index, Total indexs

We are using MarkLogic 9.0.8.2
We have setup MarkLogic cluster, ingested around 18M XML documents, few indexes have been created like Fields, PathRange & so on.
Now while setting up another environment with configuration, indexs, same number of records but i am not able to understand why the total size on database status page is different from previous environment.
So i started comparing database status page of both clusters where i can see size per forest/replica forest and all.
So in this case, i would like to know size for each
Database
Index
Also would like to know (instead of expanding each thru admin interface) the total indexes in given database
Option within Admin interface OR thru xQuery will also do.
MarkLogic does not break down the index sizes separately from the Database size. One reason for this is because the data is stored together with the Universal Index.
You could approximate the size of the other indexes by creating them one at a time, and checking the size before and after the reindexer runs, and the deleted fragments are merged out. We usually don't find a lot of benefit it trying to determine the exact index sizes, since the benefits they provide typically outweigh the cost of storage.
It's hard to say exactly why there is a size discrepancy. One common cause would be the number of deleted fragments in each database. Deleted fragments are pieces of data that have been marked for deletion (usually due to an update, delete or other change). Deleted fragments will continue to consume database space until they are merged out. This happens by default, or it can be manually started at the forest or database level.
The database size, and configured indexes can be determined through the Admin UI, Query Console (QConsole) or via the MarkLogic REST Management API (RMA) endpoints. QConsole supports a number of languages, but server side Javascript and XQuery are the most common. RMA can return results in XML or JSON.
Database Size:
REST: http://[host-name]:8002/manage/v2/databases/[database-name]?view=status
QConsole: Sum the disk size elements for the stands from xdmp.forestStatus(javascript) or xdmp:forest-status(XQuery) for all the forests in the database.
Configured Indexes:
REST: http://[host-name]:8002/manage/v2/databases/\database-name]?view=package
QConsole: Use xdmp.getConfiguration(javascript) or xdmp:get-configuration(XQuery) in conjunction with the xdmp.databaseGet[index type] or xdmp:database-get-[index type]
for $db-id in xdmp:databases()
let $db-name := xdmp:database-name($db-id)
let $db-size :=
fn:sum(
for $f-id in xdmp:database-forests($db-id)
let $f-status := xdmp:forest-status($f-id)
let $space := $f-status/forest:device-space
let $f-name := $f-status/forest:forest-name
let $f-size :=
fn:sum(
for $stand in $f-status/forest:stands/forest:stand
let $stand-size := $stand/forest:disk-size/fn:data(.)
return $space
)
return $f-size
)
order by $db-size descending
return $db-name || " = " || $db-size

Neo4j add huge number of relationships to already existing nodes

I have labels Person and Company with millions of nodes.
I am trying to create a relationship:
(person)-[:WORKS_AT]->(company) based on a unique company number property that exists in both labels.
I am trying to do that with the following query:
MATCH (company:Company), (person:Person)
WHERE company.companyNumber=person.comp_number
CREATE (person)-[:WORKS_AT]->(company)
but the query takes too long to execute and eventually fails.
I have indexes on companyNumber and comp_number.
So, my question is: it there a way to create the relationships by segments, for example (50000, then another 50000 etc...)?
Use a temporary label to mark things as completed, and add a limit step before creating the relationship. When you are all done, just remove the label from everyone.
MATCH (company:Company)
WITH company
MATCH (p:Person {comp_number: company.companyNumber} )
WHERE NOT p:Processed
WITH company, p
LIMIT 50000
MERGE (p) - [:WORKS_AT] -> (company)
SET p:Processed
RETURN COUNT(*) AS processed
That will return the number (usually 50000) of rows that were processed; when it returns less than 50000 (or whatever you set the limit to), you are all done. Run this guy then:
MATCH (n:Processed)
WITH n LIMIT 50000
REMOVE n:Processed
RETURN COUNT(*) AS processed
until you get a result less than 50000. You can probably turn all of these numbers up to 100000 or maybe more, depending on your db setup.

Sphinx. How fast are Random results?

Does anybody have experience with getting random results from index with +100,000,000 (100 million) records.
The goal is getting 30 results ordered by random, at least 100 times per second.
Actually my records are in MySQL but selecting ORDER BY RAND() from huge tables is the most easiest way to kill MySQL.
Sphinxsearch or whatever what do you recommend?
I dont have that big an index to try.
barry#server:~/modules/sphinx-2.0.1-beta/api# time php test.php -i gi_stemmed --sortby #random --select id
Query '' retrieved 20 of 3067775 matches in 0.081 sec.
Query stats:
Matches:
<SNIP>
real 0m0.100s
user 0m0.010s
sys 0m0.010s
This is on a reasonably powerful dedicated server - that is serving live queries (~20qps)
But to be honest if you dont need filtering (ie each query has a 'WHERE' clause), you can just setup a system that returns random results - can do this with mysql. Just using ORDER BY RAND() is evil (and sphinx while better at sorting than mysql is still doing basically the same thing).
How 'sparse' is your data? If most of your ids are used, can just do soemthing like
$ids = array();
$max = getOne("SELECT MAX(id) FROM table");
foreach(range(1,30) as $idx) {
$ids[] = rand(1,$max);
}
$query = "SELECT * FROM table WHERE id IN (".implode(',',$ids).")";
(may want to use shuffle() in php on the results afterwards as you likly to get the results out of mysql in id order)
Which will be much more efficient. If you do have holes, perhaps just lookup 33 rows. Sometimes will get more than need, (just discard), but you should still get 30 most of the times.
(Of course you could cache the '$max' somewhere, so it doesnt have to be looked up all the time.)
Otherwise you could setup a dedicated 'shuffled' list. Basically a FIFO buffer, have one thread, filling it with random results (perhaps using the above system, using 3000 ids at a time) and then the consumers just read random results directly out of this queue.
FIFO, is not particully easy to implement with mysql, so maybe use a different system - maybe redis, or even just memcache.

key-value store for time series data?

I've been using SQL Server to store historical time series data for a couple hundred thousand objects, observed about 100 times per day. I'm finding that queries (give me all values for object XYZ between time t1 and time t2) are too slow (for my needs, slow is more then a second). I'm indexing by timestamp and object ID.
I've entertained the thought of using somethings a key-value store like MongoDB instead, but I'm not sure if this is an "appropriate" use of this sort of thing, and I couldn't find any mentions of using such a database for time series data. ideally, I'd be able to do the following queries:
retrieve all the data for object XYZ between time t1 and time t2
do the above, but return one date point per day (first, last, closed to time t...)
retrieve all data for all objects for a particular timestamp
the data should be ordered, and ideally it should be fast to write new data as well as update existing data.
it seems like my desire to query by object ID as well as by timestamp might necessitate having two copies of the database indexed in different ways to get optimal performance...anyone have any experience building a system like this, with a key-value store, or HDF5, or something else? or is this totally doable in SQL Server and I'm just not doing it right?
It sounds like MongoDB would be a very good fit. Updates and inserts are super fast, so you might want to create a document for every event, such as:
{
object: XYZ,
ts : new Date()
}
Then you can index the ts field and queries will also be fast. (By the way, you can create multiple indexes on a single database.)
How to do your three queries:
retrieve all the data for object XYZ
between time t1 and time t2
db.data.find({object : XYZ, ts : {$gt : t1, $lt : t2}})
do the above, but return one date
point per day (first, last, closed to
time t...)
// first
db.data.find({object : XYZ, ts : {$gt : new Date(/* start of day */)}}).sort({ts : 1}).limit(1)
// last
db.data.find({object : XYZ, ts : {$lt : new Date(/* end of day */)}}).sort({ts : -1}).limit(1)
For closest to some time, you'd probably need a custom JavaScript function, but it's doable.
retrieve all data for all objects for
a particular timestamp
db.data.find({ts : timestamp})
Feel free to ask on the user list if you have any questions, someone else might be able to think of an easier way of getting closest-to-a-time events.
This is why databases specific to time series data exist - relational databases simply aren't fast enough for large time series.
I've used Fame quite a lot at investment banks. It's very fast but I imagine very expensive. However if your application requires the speed it might be worth looking it.
There is an open source timeseries database under active development (.NET only for now) that I wrote. It can store massive amounts (terrabytes) of uniform data in a "binary flat file" fashion. All usage is stream-oriented (forward or reverse). We actively use it for the stock ticks storage and analysis at our company.
I am not sure this will be exactly what you need, but it will allow you to get the first two points - get values from t1 to t2 for any series (one series per file) or just take one data point.
https://code.google.com/p/timeseriesdb/
// Create a new file for MyStruct data.
// Use BinCompressedFile<,> for compressed storage of deltas
using (var file = new BinSeriesFile<UtcDateTime, MyStruct>("data.bts"))
{
file.UniqueIndexes = true; // enforces index uniqueness
file.InitializeNewFile(); // create file and write header
file.AppendData(data); // append data (stream of ArraySegment<>)
}
// Read needed data.
using (var file = (IEnumerableFeed<UtcDateTime, MyStrut>) BinaryFile.Open("data.bts", false))
{
// Enumerate one item at a time maxitum 10 items starting at 2011-1-1
// (can also get one segment at a time with StreamSegments)
foreach (var val in file.Stream(new UtcDateTime(2011,1,1), maxItemCount = 10)
Console.WriteLine(val);
}
I recently tried something similar in F#. I started with the 1 minute bar format for the symbol in question in a Space delimited file which has roughly 80,000 1 minute bar readings. The code to load and parse from disk was under 1ms. The code to calculate a 100 minute SMA for every period in the file was 530ms. I can pull any slice I want from the SMA sequence once calculated in under 1ms. I am just learning F# so there are probably ways to optimize. Note this was after multiple test runs so it was already in the windows Cache but even when loaded from disk it never adds more than 15ms to the load.
date,time,open,high,low,close,volume
01/03/2011,08:00:00,94.38,94.38,93.66,93.66,3800
To reduce the recalculation time I save the entire calculated indicator sequence to disk in a single file with \n delimiter and it generally takes less than 0.5ms to load and parse when in the windows file cache. Simple iteration across the full time series data to return the set of records inside a date range in a sub 3ms operation with a full year of 1 minute bars. I also keep the daily bars in a separate file which loads even faster because of the lower data volumes.
I use the .net4 System.Runtime.Caching layer to cache the serialized representation of the pre-calculated series and with a couple gig's of RAM dedicated to cache I get nearly a 100% cache hit rate so my access to any pre-computed indicator set for any symbol generally runs under 1ms.
Pulling any slice of data I want from the indicator is typically less than 1ms so advanced queries simply do not make sense. Using this strategy I could easily load 10 years of 1 minute bar in less than 20ms.
// Parse a \n delimited file into RAM then
// then split each line on space to into a
// array of tokens. Return the entire array
// as string[][]
let readSpaceDelimFile fname =
System.IO.File.ReadAllLines(fname)
|> Array.map (fun line -> line.Split [|' '|])
// Based on a two dimensional array
// pull out a single column for bar
// close and convert every value
// for every row to a float
// and return the array of floats.
let GetArrClose(tarr : string[][]) =
[| for aLine in tarr do
//printfn "aLine=%A" aLine
let closep = float(aLine.[5])
yield closep
|]
I use HDF5 as my time series repository. It has a number of effective and fast compression styles which can be mixed and matched. It can be used with a number of different programming languages.
I use boost::date_time for the timestamp field.
In the financial realm, I then create specific data structures for each of bars, ticks, trades, quotes, ...
I created a number of custom iterators and used standard template library features to be able to efficiently search for specific values or ranges of time-based records.

Resources