I have been pulling Transaction data from Coinbase and storing the data for over a year.
I also get all data from the Buys and Sells endpoints.
My understanding is that the Buys.transaction.id maps to Transaction.id and vice versa with Transaction.buy.id
I use the Buys.transaction.id to merge data with stored Transaction data.
I recently noticed that the ids sometimes are different and the relationships seem wrong which is causing some issues with merging the Buy data into the Transaction data.
Looking into the data, I see this high-level pattern.
Transaction
{
"id": "A"
"buy": {
"id": "B"
}
}
Buy
{
"id": "B",
"transaction": {
"id": "C"
}
}
"C" doesn't exist as far as I can tell.
I get a Not Found when trying to access it from the resource path.
I think it should be(and at one time was?) "A".
I don't store the original response so there could be something I am missing that could be highlighted there.
Has anyone else experienced this?
A potential solution is only going off of the Transaction data only buy and ignoring the relationship information on the Buy but I am worried there is something I am missing with how the API works.
Related
I have a scenario at flink sql data processing which has many kafka message(json format).All of the messages content are similar but not the same. I created many flink tables one to one correspondence to the messages and I realise it is hard to maintance kafka topic.
event1:
{
"id": "id_001",
"type": "in",
"create_time": 1635116133089,
"uid":"uid_001"
}
event2:
{
"id":"id_002",
"status":"login",
"type":"in"
}
My flink sqls use watermark、 windows、 cep generally.
My pain point is I have variety of tables seems redundant. Is there a better table design in my case?
My initial thought would be to see if you can define a common schema for your events while using optional fields. You could then post all your events to one topic, since some fields can remain null.
So in our Cloudant database, we store multiple "schedule" documents, and each one has a number of associated "event" documents. Here's a sample schema:
Schedule:
{
"_id": "123",
"_rev": "456",
"type": "schedule",
"userId": "sampleUser"
}
Event:
{
"_id": "1234",
"_rev": "5678",
"type": "event",
"scheduleId": "123"
}
If we know the _id of the schedule ahead of time, it's easy enough to get the schedule and associated events in a single query. What I'm wondering is if this is possible to do in a single query if we only know the userId. Is there any way to obtain all schedules associated with a userId, and all events associated with those schedules, in a single query? Or are we stuck doing two queries here? I looked at a handful of join tutorials and they didn't explain how it might be feasible, but I'm not sure it's totally impossible.
If this isn't possible, I can just include an array of all event IDs in the schedule doc, but that's not an ideal solution since it requires us updating the schedule doc every time a new event is added.
You will need two queries unless you are able to change your data model and make Events part of the Schedule document.
I need to be able to search records that have any user ID within a group of user IDs in the query.
However, the amount of user IDs that must be searched will grow substantially over time. Therefore, I must be able to add thousands of user IDs to a single query and search across all of them.
I'm considering using ElasticSearch for this via a managed service like bonsai.
How well does ElasticSearch perform when queried with thousands of conditions?
The answer depends on lots of things (number of servers, RAM, CPU, etc), and it will probably take some experimentation to figure out what works best for you. I'm confident that Elasticsearch can solve your problem, but it's hard to predict performance in general.
You might want to investigate terms lookup. Basically you store all the terms for which you want to search in a document in the index (or another one), then you can reference that list in your search.
So you could save the IDs you want to search for as
PUT /test_index/idlist/1
{
"ids" : [2,1982,939,1982,98716,7611,983838,...]
}
Then you can search another type using that list with something like this, for example, with a top-level filter:
POST /test_index/doc/_search
{
"filter": {
"terms": {
"id": {
"index": "test_index",
"type": "idlist",
"id": "1",
"path": "ids"
}
}
}
}
This probably only makes sense if you're going to run the same query more than once. You could have more than one list of IDs, though, and give the documents holding lists descriptive IDs if it helps.
Using a managed service makes it easy to experiment with different cluster setups (number of nodes, size of machines, data center, and so on). I would suggest you take a look at Qbox (I'm biased, since I work with Qbox). New customers get a $40 introductory credit, which is usually enough to experiment with a proof of concept.
Being new to CouchDB, just wanted to discuss the best practice for structuring a database and documents. My background is from MySQL, so still trying to get a handle on document-driven databases.
To outline the system, we have several clients who each access a separate website with separate data. Each clients data will be split into its own database. Each database will have data constantly inserted (every 5 minutes, for at least a year) for logging events. A new document is created every 5 minutes with a timestamp and value. We also need to store some information about the client, which is a single document that doesn't ever get updated (if so, very rarely).
Below is an example of how one client database looks...
{
"_id": "client_info",
"name": "Client Name",
"role": "admin",
....
},
{
"_id": "1199145600",
"alert_1_value": 0.150
"alert_2_value": 1.030
"alert_3_value": 12.500
...
...
},
{
"_id": "1199145900",
"alert_1_value": 0.150
"alert_2_value": 1.030
"alert_3_value": 12.500
...
...
},
{
"_id": "1199146200",
"alert_1_value": 0.150
"alert_2_value": 1.030
"alert_3_value": 12.500
...
...
},
etc...literally millions more of these every 5 minutes...
My question is, is this sort of structure correct? I understand CouchDB is a flat-file database, but there will be literally millions of the timestamp/value documents in the database. I may just be being picky, but it just seems a little disorganized to me.
Thanks!
Use the timestamp as your id if it's guaranteed to be unique. This dramatically improves the ability of couch to maintain its b-tree for things like building and maintaining views as well as docs, and also it will save you len([_id]) space too.
Each doc you add (for such small data) adds some overhead in b-tree space. In your view (the logical equivalent of your SQL query) you can always parse the doc fields and emit them separately, or multiple times, if needed.
This type of unchanging data is great fit for CouchDB. As the data is added to couch, you can trigger a view update periodically, and the view will build the query data in advance. This means that, unlike SQL, where you'd calculate that aggregate date on the fly each time, couch will simply read that data, cached in the view b-tree's intermediate nodes. Much faster.
So the typical CouchDB approach is:
- model your transactions to minimise the # of docs (i.e. denormalise)
- use different views if needed to filter or sort results differently.
I guess you'll want to produce aggregate stats across that time period. Likely this will be much more efficient (CPU wise) in erlang; so take a look at https://github.com/apache/couchdb/blob/trunk/src/couchdb/couch_query_servers.erl#L172-205 to see how they're done.
What choices are there for document-store databases that allow for relational data to be retrieved? To give a real example, say you have a database to store blog posts. I'd like to have the data look something like:
{id: 12345,
title: "My post",
body: "The body of my post",
author: {
id: 123,
name: "Joe Bloggs",
email: "joe.bloggs#example.com"
}
}
Now, you will likely have a number of these records that all share the author details. What I'd really like is to have the author itself stored as a different record in the database, so that if you update this one record every post record that links to it gets the updates as well. To date the only way I've seen mentioned to do this is to have the post record instead store an ID of the author record, so that the calling code will have to make two queries of the data store - one for the post and another for the author ID that is linked to the post.
Are there any document store databases that will allow me to make a single query and return a structured document containing the linked records? And preferably allow me to edit an internal part of the document, persist the document as a whole and have the correct thing happen [I.e. in the above, if I retrieved the entire document, changed the value of email and persisted the entire document then the email address of the author record is changed, and reflected in all posts that have that author...]
First, let me acknowledge: This particular type of data is somewhat relational by nature. It just depends on exactly how you want to structure this type of data, and what technologies that you have easy access to for this particular project. That said, how do you want your data structured?
If you can structure your data any way you want, you could go with something like this:
{
name: 'Joe',
email: 'joe.bloggs#ex.com',
posts: [
{
id: 123,
title: "My post"
},
{..}
]
}
Where all the posts were contained in one particular key/value pair. This particular type of data I would say is uniquely suited for Riak (due to it being able to query internally against JSON using JavaScript natively). Though you could probably come at it from just about any of the NoSQL data store point of views (Cassandra, Couch, Mongo, et al..), as most of them can store straight up JSON. I just have a tendency towards Riak at this point, due to my personal experience with it.
The more interesting things that you'll probably run up against will relate to how you deal with the data store. For instance, I really like using Ripple for Ruby, which lets me deal with this kind of data in Riak real easy. But if you're in Java land, that might make adoption of this technique a bit more difficult (though I haven't spent a lot of time looking in to Java adoption of Riak), since it tends to lag on 'edge' style data storage techniques.
What is more than that, getting your brain to start thinking in NoSQL terms, or without using 'relations' is what usually takes the longest in structuring data. Because there isn't a schema, and there aren't any preconceptions that come with it, that means that you can do a lot of things that are thought of as simply wrong in the relational DB world. Like storing all of the blog posts for a single user in one document, which just wouldn't work in the standard schema-heavy strongly table based relational world.