How to update GTFS static data stored locally - static

How should the regular updates to the GTFS static data provided by the Agencies through their text files be handled?
Should all this static data be deleted from the data stores and then completely reloaded from the Agency's new GTFS text files ?
T
his method would be used if the identifiers of say Route_id, Trip_id or stop_id can be reassigned between updates.
For example the new GTFS data files show that Stop_id "x" which was assigned to Trip "Y" is now assigned to Trip "Z".
If these entity identifiers are never reassigned then the new GTFS files need to be compared to the local data and based on the results; records need to be removed, updated or added to each table.
Erick.

The only safe way is to load the new feed completely and then switch over to it on successful completion. While GTFS best practices at
https://gtfs.org/best-practices/#dataset-publishing--general-practices
do recommend that providers maintain persistent identifiers for stop_id, route_id and agency_id where possible, it is not a requirement of the specification and in practice they do often (particularly stop_id's) change in feeds.

Related

How to access java variables declared outside in flink map?

I am creating a List in Java. I want to share the List within the map function in flink. How to share the variables across flink processes
Requirement
I have static data records(Less than 1000 records). I want to join these records with the data stream.
Flink is designed with scalability in mind, and so it embodies a share nothing approach. An operator can be stateful, but its state is strictly local and not visible to other operators.
In this particular case, a good solution might be to use a RichMapFunction (or RichFlatMapFunction) to do the join, and to load a copy of all of the static data records into a local transient data structure in its open method.

AngularJS: Store localized user input data in translation json files or database

I have an architecture issue related to localization. My concern is what is the best approach to store and manage localized user data. Let me explain:
I have an AngularJS webapp with a mysql database. For text translations we are using angular-translate with files. For labels, static text, etc is working great.
In the other hand, the user can create items (i.e. houses for rent) and fill a title and description for it. He also is able to edit that information. This information is gathered by a form and stored in DB at the moment.
We would like to provide translations for these user input data and with this scenario in mind, I see two approaches:
User stores data in his language in DB. We store the translations in DB (translations tables...) and provides translations from there.
User stores data in his language in DB. We store the translations in locale.json files and create an key in database to get those translations (angular-translate).
In both scenarios we need to translate whether the user creates or updates a title or description. But it looks like if you store it in database, at least you already have one default translation. If you store it in a json file, you are keeping the default translation data in two places.
From the maintainance point of view, to use the translation files looks a little more complex at first sight. Also, take into account each time a user input text is added or updated a deployment needs to be done.
However, from the performance point of view, probably the translation files is a better approach. Probably you are saving at least one query to the DB when the user change the language.
From the architectural point of view, I would say the user data should be stored in database.
What do you think?
Always store the user input.
Store the translation in DB only if you ALWAYS needs it.
If you rarely needs to do it offer a Transalation button for the user.
Do what's cheaper. If you got only one in a thousand inputs in another language and it's rarely visited there's no sense in wasting precious DB space, let it be done on the fly by demand.
Also how do you know it needs to be translated? Some people are billingual and there are cases where a abroad tourist is (strugling to) using a device set in another language.
Obs:.
Do You knows automatic translations are crap don't you? So how are you translating?
TL;DR: option 1. You may cache access to the translation tables or create materialised views (if your DBMS supports them) to denormalise your Property entity and have one readily-translated row per language.
Personally, I do not see the need for caching - how many times is the user going to change language, in production?

Data Modeling - modeling an Append-only list in NDB

I'm trying to make a general purpose data structure. Essentially, it will be an append-only list of updates that clients can subscribe to. Clients can also send updates.
I'm curious for suggestions on how to implement this. I could have a ndb.Model, 'Update' that contains the data and an index, or I could use a StructuredProperty with Repeated=true on the main Entity. I could also just store a list of keys somehow and then the actual update data in a not-strongly-linked structure.
I'm not sure how the repeated properties work - does appending to the list of them (via the Python API) have to rewrite them all?
I'm also worried abut consistency. Since multiple clients might be sending updates, I don't want them to overwrite eachother and lose an update or somehow end up with two updates with the same index.
The problem is that you've a maximum total size for each model in the datastore.
So any single model that accumulates updates (storing the data directly or via collecting keys) will eventually run out of space (not sure how the limit applies with regard to structured properties however).
Why not have a model "update", as you say, and a simple version would be to have each provided update create and save a new model. If you track the save date as a field in the model you can sort them by time when you query for them (presumably there is an upper limit anyway at some level).
Also that way you don't have to worry about simultaneous client updates overwriting each other, the data-store will worry about that for you. And you don't need to worry about what "index" they've been assigned, it's done automatically.
As that might be costly for datastore reads, I'm sure you could implement a version that used repeated properties in a single, moving to a new model after N keys are stored but then you'd have to wrap it in a transaction to be sure mutiple updates don't clash and so on.
You can also cache the query generating the results and invalidate it only when a new update is saved. Look at NDB also as it provides some automatic caching (not for a query however).

Another database table or a json object

I have two tables: stores and users. Every user is assigned to a store. I thought "What if I could just save all the users assigned to a store as a json object and save that json object in a field of a store." So in other words, user's data will be stored in a field instead of it's own table. There will be around 10 people to a store. I would like to know which method will require the least amount of processing for the server.
Most databases are relational, meaning there's no reason to be putting multiple different fields in one column. Besides being more work for you having to put them together and take them apart, you'd be basically ignoring the strength of the database.
If you were ever to try to access the data from another app, you'd have to make yourself go through additional steps. It also limits sorting and greatly adds to your querying difficulties (i.e. can't say where field = value because one field contains many values)
In your specific example, if the users at a store change, rather than being able to do a very efficient delete from the users table (or modify which store they're assigned to) you'd have to fetch the data and edit it, which would double your overhead.
Joins exist for a reason, and they are efficient. So, don't fear them!

How to keep historic details of modification in a database (Audit trail)?

I'm a J2EE developer & we are using hibernate mapping with a PostgreSQL database.
We have to keep track of any changes occurs in the database, in others words all previous & current values of any field should be saved. Each field can be any type (bytea, int, char...)
With a simple table it is easy but we a graph of objects things are more difficult.
So we have, speaking in a UML point of view, a graph of objects to store in the database with every changes & the user.
Any idea or pattern how to do that?
A common way to do this is by storing versions of objects.
If add a "version" and a "deleted" field to each table that you want to store an audit trail on, then instead of doing normal updates and deletes, follow these rules:
Insert - Set the version number to 0 and insert as normal.
Update - Increment the version number and do an insert instead.
Delete - Increment the version number, set the deleted field to true and do an insert instead.
Retrieve - Get the record with the highest version number and return that.
If you follow this pattern, every time you update you will create a new record rather than overwriting the old data, so you will always be able to track back and see all the old objects.
This will work exactly the same for graphs of objects, just add the new fields to each table within the object graph, and handle each insert/update/delete for each table as described above.
If you need to know which user made the modification, you just add a "ModifiedBy" field as well.
(You can either do this processing in your DA layer code, or if you prefer you can use database triggers to catch your update/delete/retrieve calls and re-process them following the rules.)
Obviously, you need to consider space requirements, as every single update will result in a fully new record. If your application is update heavy, you are going to generate a lot of data. It's common to also include a "last modified time" fields so you can process the database off line and delete data older than required.
Current RDBMS implementations are not very good at handling temporal data. That's one reason why maintaining separate journalling tables through triggers is the usual approach. (The other is that audit trails frequently have different use cases to regular data, and having them in separate tables makes it easier to manage access to them). Oracle does a pretty slick job of hiding the plumbing in its Total Recall product, but being Oracle it charges $$$ for this.
Scott Bailey has published a presentation on temporal data in PostgreSQL. Alas it won't help you right now but it seems like some features planned for 8.5 and 8.6 will enable the transparent storage of time-related data. Find out more.

Resources