Migrating existing infrastructure & scaling with Terraform - vsphere

We are planning to automate creation & deletion of VMs in our DCs which power our cloud service. The service is such that every new customer gets dedicated VMs (at least 3) - so the number of VMs keep growing. We already have about 2000 VMs running on ESXi. So we now have two problems to solve before adopting terraform -
How do we migrate existing VMs to be managed by Terraform (or should we, at all)?
Generating resource specification could be scripted but verifying the plan to ensure nothing is affected will be a challenge - given the volume of VMs & the fact that they are all LIVE puts extra pressure on the engineers.
As the number of VMs increases, the number of .tf files will keep increasing on the disk. We could club multiple VMs into a single file but that would make deletion of individual VMs, programmatically, a bit tricky. Splitting files into multiple directories is simple workaround I can think of but...
Is there a better way to handle scale with terraform?
I couldn't find any blogs which discuss these problems, hence looking for some advice from practical experience here.

Good to see community starting to ask Terraform related questions on Stack Overflow more and more.
Regarding your questions:
Migrate existing VMs to be managed by Terraform means updating tfstate file. As of now there is no way to create resource definitions for already created resources and put it into state file automatically (though there are tools like Terraforming, which does it partially and only for AWS resources). So, you will have to describe resources in *.tf files, update tfstate file manually and then verify that tfstate == tf by running terraform plan which should say that there are no changes, which should be applied. Regarding what exactly to put into tfstate file - I would recommend to create resource definition in tf first, then create dummy VM (terraform apply) based on that, find relevant objects in updated tfstate file and update those dummy values with real values of your VMs in that tfstate file (you will also need to update serial to prevent local/remote state inconsistency error).
I don't know other smarter way of handling large amount of related resources other than grouping them by directories. In that way you can execute plan/apply just for specific logically separated directories, but you will have to have separated state files. It may easily be overkill (kind-of-warning-so-do-not-try-at-home).
Mostly there are these suggestions I keep in mind when working with Terraform (especially with large amount of resources as you have):
Organize your code, so that you have modules in one place and you pass parameters into them in another place. Reusability of code, or how it is called now :)
Use -target flag on commands like terraform plan and terraform apply to limit resources you want to touch.
Hope it helps! And more people will enjoy Terraform.

Related

Is it a good practice to have the database within the same container as the app?

We have several sites running under a CMS using virtual machines. Basically we have three VM running the CMS and a SQL instance to store data. We plan to transition to containers, but to be honest I'm don't have much idea about it and my boss plans to have the full app (CMS and DB) within an image and then deploy as many containers as needed (initially three). My doubt here is that as far as I know containers work better separating the different parts and using them as microservices, so I don't know if it's a good idea to have the full app within the container.
Short answer is: No.
It's best practice with containers to have one process per container. The container has an entrypoint, basically a command that is executed when starting the container. This entrypoint will be the command that starts your process. If you want more than one process, you need to have a script in the container that starts them and puts them in the background, complicating the whole setup. See also docker docs.
There are some more downsides.
A container should only consist of everything it needs to run the process. If you have more than one process, you'll have one big container. Also your not independent on the base image, but you need to find one, that fits all processes you want to run. Also you might have troubles with dependencies, because the different processes might need different version of a dependency (like a library).
You're unable to scale independently. E.g. you could have 5 CMS container that all use the same database, for redundance and performance. That's not possible when you have everything in the same container.
Detecting/debugging fault. If more than one process runs in a container, the container might fail because one of the processes failed. But you can't be sure which one. If you have one process and the container fails, you know exactly why. Also it's easier to monitor health, because there is one health-check endpoint for that container. Last but not least, logs of the container represent logs of the process, not of multiple ones.
Updating becomes easier. When updating your CMS to the next version or updating the database, you need to update the container image of the process. E.g. the database doesn't need to be stopped/started when you update the CMS.
The container can be reused easier. You can e.g. use the same container everywhere and mount the customer specifics from a volume, configmap or environment variable.
If you want both your CMS and database together you can use the sidecar pattern in kubernetes. Simply have a pod with multiple containers in the manifest. Note that this too will not make it horizontal scalable.
That's a fair question that most of us go through at some point. One tends to have everything in the same container for convenience but then later regret that choice.
So, best to do it right from the start and to have one container for the app and one for the database.
According to Docker's documentation,
Up to this point, we have been working with single container apps. But, we now want to add MySQL to the application stack. The following question often arises - “Where will MySQL run? Install it in the same container or run it separately?” In general, each container should do one thing and do it well.
(...)
So, we will update our application to work like this:
It's not clear what you mean with CMS (content/customer/... management system). Nonetheless, milestones on the way to create/sepearte an application (monolith/mcsvs) would propably be:
if the application is a smaller one, start with a monolithic structure (whole application as an executable on a application/webserver
otherwise determine which parts should be seperated (-> Domain-Driven Design)
if the smaller monolithic structure is getting bigger and you put more domain related services, you pull it apart with well defined seperations according to your domain landscape:
Once you have a firm grasp on why you think microservices are a good
idea, you can use this understanding to help prioritize which
microservices to create first. Want to scale the application?
Functionality that currently constrains the system’s ability to handle
load is going to be high on the list. Want to improve time to market?
Look at the system’s volatility to identify those pieces of
functionality that change most frequently, and see if they would work
as microservices. You can use static analysis tools like CodeScene to
quickly find volatile parts of your codebase.
"Building Microservices"-S.Newman
Database
According to the principle of "hiding internal state", every microservice should have its own database.
If a microservice wants to access data held by another microservice,
it should go and ask that second microservice for the data. ... which allows us to clearly separate functionality
In the long run this could be perfected into completely seperated end-to-end slices backed by their own database (UI-LOGIC-DATA). In the context of microservices:
sharing databases is one of the worst things you can do if you’re trying to achieve independent deployability
So the general way of choice would be more or less:

How to cache batches of IDs "locally" in a serverless environment?

Traditionally, in a non-serverless environment, I would have the following system. Say I have a custom ID generation protocol for all my models. Say I also have 20 servers scattered around. I give each server a slice of IDs to work with off the whole stack of IDs. When they are done or the server goes down, it returns the IDs back to the system so they don't get wasted. The reason for sending each server a batch of IDs is so that every time a new record is created you don't need to fetch from a central ID server to get the next ID. Instead they have a local set they can work with freely.
How would you do this sort of thing in a serverless system? I am deploying to Vercel and wondering what the appropriate architecture might be for such an ID batching system. There are other use cases for needed a persistent copy of data in a local server, so if you don't like the ID example just imagine another sort of system. How do you solve this optimization problem in a serverless environment?
Serverless is an approach. Like all such things (solutions), it should be matched to the problem - not the other way around. Is this simply a case where serverless is a good solution choice for dealing with 80% of your problem, and that all you need to do is choose something appropriate to deal to the other 20%?
Assuming you have the freedom to do this, can't you just have the serverless parts of the solution consume non-serverless services - e.g. an ID Service?
Separately to this, caching comes to mind - just the general idea of having some data close by which might be mastered somewhere else. Caching patterns like Write Behind would allow you to work with local copies (i.e. immediate consumption) whilst farming out the cache-master communication.

How to implement continuous delivery on a platform consisting of multiple applications which all depends on one database and each other?

We are working on old project which consists of multiple applications which all use the same database and strongly depend on each other. Because of the size of the project, we can't refactor the code so they all use the API as a single database source. The platform contains the following applications:
Website
Admin / CMS
API
Cronjobs
Right now we want to start implementing a CI/CD pipeline using Gitlab. We are currently experiencing problems, because we can't update the database for the deployment of one application without breaking all other applications (unless we deploy all applications).
I was thinking about a solution where one pipeline triggers all other pipelines. Every pipeline will execute all newly added database migrations and will test if the pipeline is still working like it should. If all pipelines succeeds, the deployment of all applications will be started.
I'm doubting if this is a good solution, because this change will only increase the already high coupling between our applications. Does anybody know a better solution how to implement CI/CD for our platform?
You have to stop thinking about these as separate applications. You have a monolith with multiple modules, but until they can be decoupled, they are all one application and will have to deployed as such.
Fighting this by pretending they aren't is likely a waste of time, your efforts would be better spent actually decoupling these systems.
There are likely a lot of solutions, but one that I've done in the past is create a separate repository for the CI/CD of the entire system.
Each individual repo builds that component, and then you can create tags as they are released or ready for CI at a system level.
The separate CI/CD repo pulls in the appropriate tags for each item and runs CI/CD against all of them as one unit. This allows you to specify which tag for each repo you want to specify, which should prevent this pipeline from failing when changes are made on the individual components.
Ask yourself why these "distinct applications" are using "one and the same database". Is that because every single one of all of those "distinct applications" all deal with "one and the same business semantics" ? If so, as Rob already stated, then you simply have one single application (and on top of that, there will be no decoupling precisely because your business semantics are singular/atomic/...).
Or are there discernable portions in the db structure such that a highly accurate mapping could be identified saying "this component uses that portion" etc. etc. ? In that case what is it that causes you to say stuff like "can't update the database for the deployment of ..." ??? (BTW "update the database" is not the same thing as "restructure the database". Please, please, please be precise.) The answer to that will identify what you've got to tackle.

How do you implement version control in a database application?

I'm working on a web based Java project that stores end user data in a MySql database. I'd like to implement something that allows the user to have functionality similar to what I have for my source code version control (e.g. Subversion). In other words, I'd like to implement code that allows the user to commit and rollback work and return to an existing branch. Is there an existing framework for this? It seems like putting the database data into version control and exposing the version control functionality to the end user (i.e. write code that allows the user to commit, rollback, etc.) could be a reasonable approach but it also seems their might be some problems with this approach. For example, how would you allow one user to view a rolled back version of the data (i.e. you can't just replace the data the database is pointing to if one user wants to look at a rolled back version of the data)? If given the choice of completely rebuilding the system using any persistence architecture what could be used to store the data that would make this type of functionality easy to implement?
There are 2 very common solutions for what you need:
http://www.liquibase.org/
https://flywaydb.org/
Branching and merging the user data
Your question is about solutions to version the user data in a application, to give your users capabilities such as branching and merging. You pondered about exposing a real version control such as svn.
The side-effects I can foresee are:
You will have to index things by directory and filename. Maybe using an abstraction of directories as entities and filenames as the primary key.
Operating systems (linux, mac and windows alike) does not handle well directories with millions of files. You will have to partition the entity. Usually hashing the ID (md5 for example) and taking the beginning of the hash to create an subdirectory. The number of digits to take from the hash depends on the expected size of the entity.
Operating systems (linux, mac and windows alike) are not prepared for huge quantity of files. I did a test on that. It took me days to backup and finally remove an file tree with hundreds of millions of files.
You will not be able to have additional indexes beyond the primary key, however you can work around that creating a data-mart, as I will describe below.
You will not have database constraints, but similar functionality can be implemented through git/svn/cvs triggers.
You will not have strong transactions, but similar functionality can be implemented through git/svn/cvs triggers.
You will have a working copy for each user, this will consume space depending on the size of the repositories. That way each user will be in a single point in time.
GIT is fast enough to switch from a branch to another, so go back in time and back will take only seconds (unless the user data is big, of course).
I saw a Linus interview where he warned about low performance in huge git repositories. Maybe it is best to have a repository to each user or other means to avoid your application having a single humongous repository.
Resolution of the changes. I bet that if you create gazillions of versions any version control will complaint. I do not what gazillions mean. You will have to test it.
Query database
A version control working copy will be limited to primary key queries using the "=" operator and sequential scans. This is not enough to make good reports and statistics on any usage pattern I can think off. That why you need to build a data-mart from your application data and you have two ways of doing that:
A batch process: that reads the whole repository history and builds cubes and other views to allow easier querying.
GIT/SVN/CVS triggers: can call programs made by you on file addition, modification, exclusion, branch creation and merging. This could be used to update the database when a change happen.
The batch is easier to implement but takes time to the reports and statistics be synchronized with the activity. You probably will want to go that way in the 1.0 version and in time moving to triggers to get things more dynamic.
Simulating constraints and transactions
GIT, SVN and CVS supports triggers that execute programs when a new version is submitted. Then the relationships and consistency can be checked to accept or not the change.
Alternative Solutions
Since you do not specified the kind of application you want, I will talk about blogs, content portals and online stores. For those kinds of applications I see no much reason to reinvent the wheel and build a custom database. Most of the versioning necessary can be predicted in the database model. A good event-oriented database design will be enough.
For example, a revision in a blog post could be modeled as marking the end date/time of the post and creating a new row for the revised post, increasing the version number and setting the previous version id. The same strategy can be used with sales and catalog of an online store. If you model your application with good logs you does not need version control.
Some developers also do a row level trigger that records everything that has changed on the database. This is a bit harder for an auditor that would need to reconstruct the past from bad designed logs. I personally do not like this way because is very difficult to index this kinds of queries. I prefer to make my whole applications around a good designed and meaningful log.
For example:
History Table
10/10/2010 [new process] process_id=1; name=john
11/10/2010 [change name] process_id=1; old_name=john; new_name=john doe
12/10/2010 [change name] process_id=1; old_name=john doe; new_name=john doe junior
Process Table after 12/10/2010.
proc_id=1 name=john doe junior
That way I can reconstruct almost everything on the past and still have my operational data in a easy-to-use format.
However, this is not close to the usage pattern you want (branching and merging)
Conclusion
The applicability of version control as a database seems to me very powerful on one hand and very limited and dangerous in another. It is very inspiring for auditing and error correction purposes. But my main concern would be scale and reliability.
It seems like you want version control for your data rather than the database schema. I could find two databases that implement most of the version control features such as fork, clone, branch, merge, push, and pull:
https://github.com/dolthub/dolt - SQL based
https://github.com/terminusdb/terminusdb - graph based
You mentioned Subversion, which is a Centralized Version Control System. But let us focus on Git, because of reasons. Git is a Decentralized Version Control System. A local copy of a Git repository is the same as a remote copy of the repository, if a remote copy exists at all (services such as GitLab and GitHub provide the remote housing and managing of Git projects). With Git you can have version control in an arbitrary directory in your machine. You can do whatever you are accustomed to doing with SVN, and more, in this arbitrary directory.
What I am getting at, is that you could possibly create per user directories/repositories in your server programmatically, and apply version control in these directories/repositories, keeping a separate repository per user (the specifics of the architecture would be decided later, though, depending on the structure of the user's "work"). Your application would be in charge of adding and removing files on behalf of the user (e.g. Biography, My Sample Project, etc.), editing files, committing the changes, presenting a file history, etc., essentially issuing Git commands. Your application would, thus, interface with the Git repository, exploiting the advanced version control that Git provides. Your database would just make sure that the user is linked to the directory/repository that contains their "work".
To provide a critical analogy, the GitLab project is an open source web-based Git repository manager with wiki and issue tracking features. GitLab is written in Ruby and uses PostgreSQL (preferably). It is a typical (as in Code - Database - Data directories and files) multiuser web-based application. Its purpose is to manage Git repositories. These Git repositories are stored in a designated directory in the server. Part of the code is responsible for accessing the Git repositories that the logged-in user is authorized to access (as the owner or as a collaborator). An interesting use case is of a user editing a file online, which will result in a commit in some branch in some repository. Another interesting use case is of a user checking the history of a file. A final interesting use case is of a user reverting a specific commit. All of these actions are performed online, via a web browser.
To provide an interesting real-world use case, Atlas by O'Reilly is an online platform for publishing-related collaboration using GitLab as the backend.
For Java there is JGit, a lightweight, pure Java library implementing the Git version control system. JGit is used by Eclipse for all actions related to managing Git repositories. Maybe you could look into it. It is an extremely active project, supported by many, Google included.
All of the above make sense, if the "work" you refer to is more than some fields in a database table, which the user will fill in and may later change the values of. For instance, it would make sense for structured text, HTML, etc.
If this "work" is not so large-scale, maybe doing something like what is described above is overkill. In that case, you could employ some of the version control concepts in your database design, such as calculating diffs and applying patches (also in reverse, for viewing past versions / rolling back). Your tables should allow for a tree-like structure, to store the diffs, so you could allow for branches. You could have the active version of a file readily available, as well as the active index (what Git calls HEAD), and navigate to another indexed/hashed/tagged version in the file's history by applying all patches sequentially, if moving forward, or applying patches in reverse, and in the reverse chronological order, if moving backwards. If this "work" is really small-scale, you could even ditch the diff concept, and store the whole version of the "work" in the tree-like structure.
Pure fun.

Configuration vs Database storage

I will keep this short, I am looking to store product plan data, these are the plans that the users would pick for their payment options. This data include how much the plan cost and what the unit details of the plan are, like what makes a unit (day/week/month) and fairly simple data about the plan. These plans may or may not change once a month or once a year, the company is a start up and things are always changing on the 11th hour and contently so there is no real way to predict when they will change. A co-worker and I are discussing whether these values should be stored in the web.config (where they currently are) or move them to the database.
I have done some googling and I have not found any good resource that help draw a clear line of when something should be in the database or in the web config. I wanted to know what your thought on this was and see if someone could clearly define when data should be stored in config or in the database.
Thanks for the help!
From the brief description you provide, it seems to me that the configuration data, eventually, may be accessed not just by your web server-based application running on one computer, but also by other supporting applications, such as end-of-month batch jobs, that you may want to run on other computers. To support that possibility, it would be a good idea to store the data in some sort of centralized repository that can be accessed remotely from multiple computers.
Storing the configuration data in a database is the obvious way to meet that requirement. But if you don't want to do that, then another approach would be to store the configuration data in a file on a company-internal (rather than public) web/ftp server. Then an application can use a utility such as curl to retrieve the configuration file from the web/ftp server.
Of those two approaches, I think using a database is probably best, because it provides an ergonomic way to not just read the configuration data, but also update it.

Resources