is it better to define one topic with different types or to define typed-topic for each type? - google-cloud-pubsub

is it better to define one topic with different types or to define typed-topic for each type?
For example,
[topic]
foo
[message]
{
type: 'x',
data: [1,2,3]
}
OR
[topic]
foo-x
[message]
{
data: [1,2,3]
}
According to pub/sub-pricing, it seems that latter case is a little bit cheaper because of the message size given that gross message sent and delivered is the same.
But this is only pricing, I think there may be some empirical design or schema to define message. What could be considered in this case of message?

Prefer the second solution. Indeed, if you want to plug processing (Function, Cloud Run, AppEngine or other compute platform) on a particular topic, you can do it because the topic is typed.
If the topic is not typed, you have to implement a filter in your process to do: "If my type is X then do else exit". And you are charged for this. For example, for function and Cloud Run, you are charged per request and by processing time, the minimal facturation in 100ms -> too much for a simple IF
However, if you want to plug the same function on several topics, you have to duplicate your function (except if you use HTTP trigger)
Hope this help!

There isn't one right or wrong answer here. In part, it depends on the nature of your subscribers. Are they going to be interested in messages of all types or in messages of only one type? If the former, then a single topic with the type in the message might be better. If the latter, then separate subscriptions might be better. Additionally, how many types do you have? For dozens or more types, managing individual topics may be a lot of overhead.
Price would likely not be a factor. If your messages are large, then the price will be dominated by the contents of the messages. If your messages are small, then they will be subject to the 1KB minimum billing volume per request. The only way the type would be a factor is if you have very small messages, but batch them such that they are at least 1KB.
If you do decide to encode the type in the message, consider putting it in the message attributes instead of the message data. That way, you can look at the type without having to decode the entire message, which can be helpful if you want to immediately ack messages of certain types instead of processing them.

Related

SCD-2 in data modelling: how do I detect changes?

I know the concept of SCD-2 and I'm trying to improve my skills about it doing some practices.
I have the next scenario/experiment:
I'm calling daily to a rest API to extract information about companies.
In my initial load to the DB everything is new, so everything is very easy.
Next day I call to the same rest API, which might returns the same companies, but some of them might have (or not) some changes (i.e., they changed the size, the profits, the location, ...)
I know SCD-2 might be really simple if the rest API returns just records with changes, but in this case it might returns as well records without changes.
In this scenario, how people detect if the data of a company has changes or not in order to apply SCD-2?, do they compare all the fields?.
Is there any example out there that I can see?
There is no standard SCD-2 nor even a unique concept of it. It is a general term for large number of possible approaches. The only chance is to practice and see what is suitable for your use case.
In any case you must identify the natural key of the dimension and the set of the attributes you want to keep the history.
You may of course make it more complex by the decision to use your own surrogate key.
You mentioned that there are two main types of the interface for the process:
• You get periodically a full set of the dimension data
• You get the “changes only” (aka delta interface)
Paradoxically the former is much simple to handle than the latter.
First of all, in the full dimensional snapshot the natural key holds, contrary to the delta interface (where you may get more changes for one entity).
Additionally you have to handle the case of late change delivery or even the wrong order of changes delivery.
Next important decision is if you expect deletes to occur. This is again trivial in the full interface, you must define some convention, how this information would be passed in the delta interface.
Connected is the question whether a previously deleted entity can be reused (i.e. reappear in the data).
If you support delete/reuse you'll have to thing about how to show them in your dimension table.
In any case you will need some additional columns in the dimension to cover the historical information.
Some implementation use a change_timestamp, some other use validity interval valid_from and valid_to.
Even other implementation claim that additional sequence number is required – so you avoid the trap of more changes with the identical timestamp.
So you see that before you look for some particular implementation you need carefully decide the options above. For example the full and delta interface leads to a completely different implementations.

What are some of the problems with GTFS?

I am intersted in replacing my current data format that I use with GTFS, but I hear and read from here and there that there are flaws in GTFS file format.
Most of the time I see that you can't somehow predict some things such as delays or some real-time stuff. They say you can't get the "whole picture" with them.
So what I am asking is there anyone more experienced with GTFS , since I am seeing them only for first time, that could have possibly used GTFS in some kind of application and could tell the problems they have faced while developing?
Maybe someone has a suggestion about a better kind of file format? Or a combination of some formats?
It's hard to say whether GTFS is a good fit or not for your application without knowing what your application's requirements are, but I can offer a few remarks.
If your goal is to provide real-time data to users you should take a look at GTFS-realtime, a complementary data format designed specifically for issuing real-time updates. For most public-transit applications, using a GTFS and a GTFS-realtime feed together does indeed give the "whole picture" about a transit network, or near enough.
In terms of GTFS itself, my main complaint is that it seems designed specifically for route-planning applications and using data in this format for any other purpose can be difficult. For example, while a GTFS feed records information about transit stops and routes, there is no requirement that each of these have a single, canonical entry—if the data spans multiple board periods, there will almost always be (seemingly) duplicate entries for each.
This doesn't matter if you're plotting a route based on where and when a person is travelling, since the links between objects ensure you'll always generate the right result. If you're starting with only a person's location and want to know, "What transit resources are available nearby?", reliably producing an accurate answer requires some contortions.
It depends on your needs for importing existing feeds. If yes, then you need to be able to handle it anyhow. In my case, import was required, so I use the same for data that stems from other formats like PDF timetables. Otherwise you need to supoprt two formats. If you do not need it for import (or export) you may consider your own format : I find GTFS does not reveal the actual network.
GTFS needs quite some interpretation and digesting in order to end up with the whole picture that you can answer planning questions on.
I merge stops together if they are close, like a few meters apart, and assume 'trivial walk' if 10-50 meters. That automatically handles combining multipe feeds.
Apart from that, I turn the stop_times roughly inside-out to create a 'link' table'. The end result is that for each stop you have a list of departures and their destinations.
Biggest problem until now is that GTFS feeds can record the trips from an operator point of view. Passengers can remain sitting in the bus if it flips the headsign from 351 to 285, takes a new driver onboard and continues. That means you need to know what trips actually need to be seen as joined in passenger terms.
I solved minor problem for manual feed entry by having my GTFS parser accept a handful of constructs that ease editing, such as leaving out the sequence numbers to have it generated incrementally, and recognising 02.13+1 as 26.13.

Which is better: sending many small messages or fewer large ones?

I have an app whose messaging granularity could be written two ways - sending many small messages vs. (possibly far) fewer larger ones. Conceptually what moves around is a set of 'alive' vertex IDs that might get filtered at each superstep based on a processed list (vertex value) that vertexes manage. The ones that survive to the end are the lucky winners. compute() calculates a set of 'new-to-me' incoming IDs that are perfect for the outgoing message, but I could easily send each ID one at a time. My guess is that sending fewer messages is more important, but then each set might contain thousands of IDs. Thank you.
P.S. A side question: The few custom message type examples I've found are relatively simple objects with a few primitive instance variables, rather than collections. Is it nutty to send around a collection of IDs as a message?
I have used lists and even maps to be sent or just stored as vertex data, so that isn’t a problem. I think it shouldn’t matter for giraph which you want to choose, and I’d rather go with many simple small messages, as you will use Giraph appropriately. Instead you will need to go in the compute function through the list of messages and for each message through the list of IDs.
Performance-wise it shouldn’t make any difference. What I’ve rather found to make a big difference is, try to compute as much as possible in on cycle, as the switching between cycles and synchronising the messages, ... takes a lot of time. As long as that doesn’t change it should be more or less the same and probably much easier to read and maintain when you keep the size of messages small.
In order to answer your question, you need understand the MessageStore interface and its implementations.
In a nutshell, under the hood, it took the following steps:
The worker receive the byte raw input of the messages and the destination IDs
The worker sort the messages and put them into A Map of A Map. The first map's key is the partition ID, the section map's key is the vertex ID. (It is kind of like the post office. The work is like the center hub, and it sort the letters into different zip code first, then in each zip code sorted by address)
When it is the vertex's turn of compute, a Iterable of that vertex's messages are passed to the vertex's compute method, and that's where you get the messages and use it.
So less and bigger messages are better because of less sorting if the total amount of bytes is the same for both cases.
Also, you could send many small messages, but let Giraph convert this into a long one (almost) automatically. You can use Combiners.
The documentation on this subject is terrible on Giraph site, but you maybe could extract an example from the book Practical Graph Analytics with Apache Giraph.
This depends on the type of messages that you are sending, mainly.

What is a good relational database design for stock market data?

Suppose there are two types of messages, QUOTE and TRADE. Both have different fields. For example TRADE has only a single price. QUOTE has both a bid and ask price. I want process messages in time order to do something like the following:
if (QUOTE) {
...
}
if (TRADE) {
...
}
My problem is the two messages are in different formats so I can't get them into the same database table. If I can't get them into the same database table how do I process sequentially? Any ideas for a suitable design?
The answer depends entirely on what you're doing and on where your app plugs into the data streams.
At one extreme, you might merely be answering customer quotes that you're pulling from an API, and basically implementing a cache. In this case two tables are fine.
At the other extreme, you might be monitoring real-time quotes for a high frequency trading platform, in which case the throughput will probably rule out using a database at all (things built around lisp, such as allegrograph, might be more appropriate), except to periodically collect aggregate statistics.
The short answer is, 'not really' For stock market and other time series data a key value store like Berkley DB or Mongo is pretty good. Also, a data format like NetCDF (http://en.wikipedia.org/wiki/NetCDF) will likely serve you better in the long run. It also depends on what kind of access you want and how much time you want to store.
You didn't indicate what you were doing with the data, which should inform your choices of storage more than anything. For example, a high-speed trading application will have different storage tradeoffs than a historical batch processing system (where Hadoop + NetCDF would be great). YMMV
Kdb+/q
Is a very good option for tick data. Used by major banks.
here is the info about that.
You can install a trail version and play with it.

Designing a database with periodic sensor data

I'm designing a PostgreSQL database that takes in readings from many sensor sources. I've done a lot of research into the design and I'm looking for some fresh input to help get me out of a rut here.
To be clear, I am not looking for help describing the sources of data or any related metadata. I am specifically trying to figure out how to best store data values (eventually of various types).
The basic structure of the data coming in is as follows:
For each data logging device, there are several channels.
For each channel, the logger reads data and attaches it to a record with a timestamp
Different channels may have different data types, but generally a float4 will suffice.
Users should (through database functions) be able to add different value types, but this concern is secondary.
Loggers and channels will also be added through functions.
The distinguishing characteristic of this data layout is that I've got many channels associating data points to a single record with a timestamp and index number.
Now, to describe the data volume and common access patterns:
Data will be coming in for about 5 loggers, each with 48 channels, for every minute.
The total data volume in this case will be 345,600 readings per day, 126 million per year, and this data needs to be continually read for the next 10 years at least.
More loggers & channels will be added in the future, possibly from physically different types of devices but hopefully with similar storage representation.
Common access will include querying similar channel types across all loggers and joining across logger timestamps. For example, get channel1 from logger1, channel4 from logger2, and do a full outer join on logger1.time = logger2.time.
I should also mention that each logger timestamp is something that is subject to change due to time adjustment, and will be described in a different table showing the server's time reading, the logger's time reading, transmission latency, clock adjustment, and resulting adjusted clock value. This will happen for a set of logger records/timestamps depending on retrieval. This is my motivation for RecordTable below but otherwise isn't of much concern for now as long as I can reference a (logger, time, record) row from somewhere that will change the timestamps for associated data.
I have considered quite a few schema options, the most simple resembling a hybrid EAV approach where the table itself describes the attribute, since most attributes will just be a real value called "value". Here's a basic layout:
RecordTable DataValueTable
---------- --------------
[PK] id <-- [FK] record_id
[FK] logger_id [FK] channel_id
record_number value
logger_time
Considering that logger_id, record_number, and logger_time are unique, I suppose I am making use of surrogate keys here but hopefully my justification of saving space is meaningful here. I have also considered adding a PK id to DataValueTable (rather than the PK being record_id and channel_id) in order to reference data values from other tables, but I am trying to resist the urge to make this model "too flexible" for now. I do, however, want to start getting data flowing soon and not have to change this part when extra features or differently-structured-data need to be added later.
At first, I was creating record tables for each logger and then value tables for each channel and describing them elsewhere (in one place), with views to connect them all, but that just felt "wrong" because I was repeating the same thing so many times. I guess I'm trying to find a happy medium between too many tables and too many rows, but partitioning the bigger data (DataValueTable) seems strange because I'd most likely be partitioning on channel_id, so each partition would have the same value for every row. Also, partitioning in that regard would require a bit of work in re-defining the check conditions in the main table every time a channel is added. Partitioning by date is only applicable to the RecordTable, which isn't really necessary considering how relatively small it will be (7200 rows per day with the 5 loggers).
I also considered using the above with partial indexes on channel_id since DataValueTable will grow very large but the set of channel ids will remain small-ish, but I am really not certain that this will scale well after many years. I have done some basic testing with mock data and the performance is only so-so, and I want it to remain exceptional as data volume grows. Also, some express concern with vacuuming and analyzing a large table, and dealing with a large number of indexes (up to 250 in this case).
On a very small side note, I will also be tracking changes to this data and allowing for annotations (e.g. a bird crapped on the sensor, so these values were adjusted/marked etc), so keep that in the back of your mind when considering the design here but it is a separate concern for now.
Some background on my experience/technical level, if it helps to see where I'm coming from: I am a CS PhD student, and I work with data/databases on a regular basis as part of my research. However, my practical experience in designing a robust database for clients (this is part of a business) that has exceptional longevity and flexible data representation is somewhat limited. I think my main problem now is I am considering all the angles of approach to this problem instead of focusing on getting it done, and I don't see a "right" solution in front of me at all.
So In conclusion, I guess these are my primary queries for you: if you've done something like this, what has worked for you? What are the benefits/drawbacks I'm not seeing of the various designs I've proposed here? How might you design something like this, given these parameters and access patterns?
I'll be happy to provide clarification/details where needed, and thanks in advance for being awesome.
It is no problem at all to provide all this in a Relational database. PostgreSQL is not enterprise class, but it is certainly one of the better freeware SQLs.
To be clear, I am not looking for help describing the sources of data or any related metadata. I am specifically trying to figure out how to best store data values (eventually of various types).
That is your biggest obstacle. Contrary to program design, which allows decomposition and isolated analysis/design of components, databases need to be designed as a single unit. Normalisation and other design techniques need to consider both the whole, and the component in context. The data, the descriptions, the metadata have to be evaluated together, not as separate parts.
Second, when you start off with surrogate keys, implying that you know the data, and how it relates to other data, it prevents you from genuine modelling of the data.
I have answered a very similar set of questions, coincidentally re very similar data. If you could read those answers first, it would save us both a lot of typing time on your question/answer.
Answer One/ID Obstacle
Answer Two/Main
Answer Three/Historical
I did something like this with seismic data for a petroleum exploration company.
My suggestion would be to store the meta-data in a database, and keep the sensor data in flat files, whatever that means for your computer's operating system.
You would have to write your own access routines if you want to modify the sensor data. Actually, you should never modify the sensor data. You should make a copy of the sensor data with the modifications so that you can show later what changes were made to the sensor data.

Resources