Exporting Large Data Sets

Exporting Large Data Sets - export

I work for a mid sized online fashion company, and we have a very large report that we export with an FM Server side script, approx 75 fields/15k records.
The problem I have, is as you imagine - this takes an awful long time! We have a lot of calculation fields, perform finds, replaces etc in the script before the export, in order to build and update the report each day with new data, I should also state that there is a lot of related data from other tables in the export.
I'm acutely aware that this won't be a fast export, however if anyone has any tips on how to minimise export time I would be most grateful. At present we are exporting as CSV, so for example would XLSX be quicker?
Any advice at all on how to speed this up would be most welcome!
Thankyou -
S

A bit of a generic question, probably better suited for a FileMaker users forum.
15k is not that many records even for FileMaker.
CSV will be quicker then Excel export, but I do not think this is your problem
Try to break you export into stages and log the time spent on every stage. This will show what stage you need to optimise.

It's likely that the reason for the slow export is the related fields and any unstored calculations you are exporting. You can get around this by using cached fields, non-related, non-calculated fields, that get updated just before the export with various Replace Field Contents script steps performed on the server.
Separate the script into three scripts, Find Records to Export, Update Caches, and Export Report. They would look something like this:
Find Records to Export
Go to Layout [ "LayoutWithFieldsToReplace" ]
Enter Find Mode [ Pause: Off ]
Set Field [ TABLE::field ; // find criterion for this field ]
Set Field [ ...
Set Error Capture [ On ]
Perform Find []
Update Caches
Perform Script [ Find Records to Export ]
Replace Field Contents [ TABLE::cache_1 ; RELATED::field ]
Replace Field Contents [ TABLE::cache_2 ; TABLE::unstored_calc ]
Replace Field Contents [ ...
Export Report
Perform Script on Server [ Update Caches ]
Perform Script [ Find Records to Export ]
Export Records [ ...
I'm not sure if FileMaker has changed to requirement, but I think the fields you need to replace the contents of need to be on the layout during the replace. So make sure the layout you navigate to has them, in the script above, these would be TABLE::cache_1 and TABLE::cache_2.

Related

ImportJSON for Google Sheets Can't Handle File WIthout Properties?

I'm trying to pull historical pricing data from CoinGecko's free API to use in a Google Sheet. It presents OHLC numbers in the following format:
[
[1589155200000,0.05129,0.05129,0.047632,0.047632],
[1589500800000,0.047784,0.052329,0.047784,0.052329],
[1589846400000,0.049656,0.053302,0.049656,0.053302],
...
]
As you can see, this isn't typical JSON format since there are no property names. So that everyone is on the same page, for this data the properties of each subarray in order are Time (in UNIX epoch format), Open Price, High Price, Low Price, and Close Price.
I'm using the ImportJSON code found here to try and pull this data, but it does not work. Instead of putting each subarray into a separate row, split into columns for the 5 properties, it prints everything out into a single cell like so:
1589155200000,0.05129,0.05129,0.047632,0.047632,1589500800000,0.047784,0.052329,0.047784,0.052329,15898 6400000,0.049656,0.053302,0.049656,0.053302,...
This is incredibly unhelpful. I'm trying to avoid using a paid API add-on since I really don't want to have to pay the frankly exorbitant fees they want to charge, but I can't figure out a way to get ImportJSON to play nicely with this data. Does anyone know of a solution?

It's simplier : your datas are in an array structure : I put
[
[1589155200000,0.05129,0.05129,0.047632,0.047632],
[1589500800000,0.047784,0.052329,0.047784,0.052329],
[1589846400000,0.049656,0.053302,0.049656,0.053302]
]
in A1, and I get the individual values by this simplier way :
function myArray(){
var f = SpreadsheetApp.getActiveSheet();
var result = eval(f.getRange('A1').getValue());
f.getRange(2,1,result.length,result[0].length).setValues(result)
}

doc indexing - how to detect and store new followers and people stopped following

I'm indexing a list like this:
doc_userid123
{
followers:[
{id:5, name:'john'},
{id:6, name:'mari'},
{id:7, name:'bart'}
]
}
So, now I want to update this list everyday and detect new following people and stopped following.
The problem is the list can have millions of ID's, so comparing entire list will consume many RAM and take too long time to complete.
One possible way is indexing one doc per day, like this:
doc_userid123_2014-29-04
{
followers:[...]
}
But this will store many repeated info.
I'm trying a better way to store this info without over consume RAM/CPU/disk. Any idea?

you can create a new index per day, in this way you can query each day separately or all of them. This is also what logstash is doing by default when combined with ElasticSearch.

Paging arrays in mongodb subdocument

I have a mongo collection with documents that have a schema structured like the following:
{ _id : bla,
fname : foo,
lname : bar,
subdocs [ { subdocname : doc1
field1 : one
field2 : two
potentially_huge_array : [...]
}, ...
]
}
I'm using the ruby mongo driver that currently does not support elemMatch. I use an aggregation when extracting from subdocs via a project, unwind and match pipeline.
What I would now like to do is to page results from the potentially_huge_array array contained in the subdocument. I have not been able to figure out how to grab just a subset of the array without dragging the entire subdoc, huge array and all, out of the db into my app.
Is there some way to do this?
Would a different schema be a better way to handle this?

Depending on how huge is huge, you definitely don't want it embedded into another document.
The main reason is that unless you always want the array returned with the document, you probably don't want to store it as part of the document. How you can store it in another collection would depend on exactly how you want to access it.
Reviewing the types of queries you most often perform on your data will usually suggest the best schema - one that will allow you to be efficient about number of queries, the amount of data returned and ease of indexing the data.

If you field really huge and changes often, just placed it in separate collection.

Is there a sql only way to turn a hierarchical table into a set of json strings?

I have a hierarchical table a simplified version of which might look like this:
id parentid text
-- -------- ----
1 null A
2 1 Ax
3 1 Ay
4 3 Ay2
5 null B
6 5 Bx
I want to migrate all the data from this table in json form. The result of the table above should end up looking like:
{
"text":"A",
"children":
[
{
"text":"Ax",
"children":[]
},
{
"text":"Ay",
"children":
[
{
"text":"Ay2",
"children":[]
}
]
}
]
}
(next record)
{
"text":"B",
"children":
[
{
"text":"Bx",
"children":[]
}
]
}
The table has several hundred thousand records and I can't really make an assumption as to how deep it recurses as this might change before I finally run it. I'd looked into using WITH to try to concatenate the child records but I'm really struggling. Can this be done with WITH or is there another way?

In the end I went with a recursive sql function with a cursor that concatenates the results of each recursive call. Another sql function escapes the strings that go into the json to make sure there are no nasty characters in there.
This last function based on some of the code in the link #Martin provided in the comments above: http://www.simple-talk.com/sql/t-sql-programming/consuming-json-strings-in-sql-server/
I normally keep well away from cursors but as this is a data migration script it proved the simplest way forward

It's a bit difficult doing this type of thing in SQL, as it is expecting to return a set of rows (ie: a grid or table). If you are using PL/SQL or another procedural language built into sql you could approximate it, but the best solution would be to call SQL statements from within a script or other code, as you did.
PS: Cursors are very useful, once you get used to them. Think of them as a loop: on each row you can perform some logic.

Creating logger in CouchDB?

I would like to create a logger using CouchDB. Basically, everytime someone accesses the file, I would like like to write to the database the username and time the file has been accessed. If this was MySQL, I would just add a row for every access correspond to the user. I am not sure what to do in CouchDB. Would I need to store each access in array? Then what do I do during update, is there a way to append to the document? Would each user have his own document?

I couldn't find any documentation on how to append to an existing document or array without retrieving and updating the entire document. So for every event you log, you'll have to retrieve the entire document, update it and save it to the database. So you'll want to keep the documents small for two reasons:
Log files/documents tend to grow big. You don't want to send large documents across the wire for each new log entry you add.
Log files/documents tend to get updated a lot. If all log entries are stored in a single document and you're trying to write a lot of concurrent log entries, you're likely to run into mismatching document revisions on updates.
Your suggestion of user-based documents sounds like a good solution, as it will keep the documents small. Also, a single user is unlikely to generate concurrent log entries, minimizing any race conditions.
Another option would be to store a new document for each log entry. Then you'll never have to update an existing document, eliminating any race conditions and the need to send large documents between your application and the database.

Niels' answer is going down the right path with transactions. As he said, you will want to create a different document for each access - think of them as actions. Here's what one of those documents might look like
{
"_id": "32 char hash",
"_rev": "32 char hash",
"when": Unix time stamp,
"by": "some unique identifier
}
If you were tracking multiple files, then you'd want to add a "file" field and include a unique identifier.
Now the power of Map/Reduce begins to really shine, as it's extremely good at aggregating multiple pieces of data. Here's how to get the total number of views:
Map:
function(doc)
{
emit(doc.at, 1);
}
Reduce:
function(keys, values, rereduce)
{
return sum(values);
}
The reason I threw the time stamp (doc.at) into the key is that it allows us to get total views for a range of time. Ex., /dbName/_design/designDocName/_view/viewName?startkey=1000&endkey=2000&group=true gives us the total number of views between those two time stamps.
Cheers.

Although Sam's answer is an ok pattern to follow I wanted to point out that there is, indeed, a nice way to append to a Couch document. It just isn't very well documented yet.
By defining an update function in your design document and using that to append to an array inside a couch document you may be able to save considerable disk space. Plus, you end up with a 1:1 correlation between the file you're logging accesses on and the couch doc that represents that file. This is how I imagine a doc might look:
{
"_id": "some/file/path/name.txt",
"_rev": "32 char hash",
"accesses": [
{"at": 1282839291, "by": "ben"},
{"at": 1282839305, "by": "kate"},
{"at": 1282839367, "by": "ozone"}
]
}
One caveat: You will need to encode the "/" as %2F when you request it from CouchDB or you'll get an error. Using slashes in document ids is totally ok.
And here is a pair of map/reduce functions:
function(doc)
{
if (doc.accesses) {
for (i=0; i < doc.accesses.length; i++) {
event = doc.accesses[i];
emit([doc._id, event.by, event.at], 1);
}
}
}
function(keys, values, rereduce)
{
return sum(values);
}
And now we can see another benefit of storing all accesses for a given file in one JSON document: to get a list of all accesses on a document just make a get request for the corresponding document. In this case:
GET http://127.0.0.1:5984/dbname/some%2Ffile%2Fpath%2Fname.txt
If you wanted to count the number of times each file was accessed by each user you'll query the view like so:
GET http://127.0.0.1:5984/test/_design/touch/_view/log?group_level=2
Use group_level=1 if you just want to count total accesses per file.
Finally, here is the update function you can use to append onto that doc.accesses array:
function(doc, req) {
var whom = req.query.by;
var when = Math.round(new Date().getTime() / 1000);
if (!doc.accesses) doc.accesses = [];
var event = {"at": when, "by": whom}
doc.accesses.push(event);
var message = 'Logged ' + event.by + ' accessing ' + doc._id + ' at ' + event.at;
return [doc, message];
}
Now whenever you need to log an access to a file issue a request like the following (depending on how you name your design document and update function):
http://127.0.0.1:5984/my_database/_design/my_designdoc/_update/update_function_name/some%2Ffile%2Fpath%2Fname.txt?by=username

A comment to the last two anwers is that they refer to CouchBase not Apache CouchDb.
It is however possible to define updatehandlers in CouchDb but I have not used it.
http://wiki.apache.org/couchdb/Document_Update_Handlers

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Exporting Large Data Sets - export

Related

ImportJSON for Google Sheets Can't Handle File WIthout Properties?

doc indexing - how to detect and store new followers and people stopped following

Paging arrays in mongodb subdocument

Is there a sql only way to turn a hierarchical table into a set of json strings?

Creating logger in CouchDB?

Categories

Resources