[Breaking] HA-support via Deployment #437

Merged
pat-s merged 75 commits from refs/pull/437/head into main 2023-07-17 19:09:46 +00:00
pat-s commented 2023-04-14 09:24:44 +00:00 (Migrated from gitea.com)

Changes

A big shoutout to @luhahn for all his work in #205 which served as the base for this PR.

Documentation

  • After thinking for some time about it, I still prefer the distinct option (as started in #350), i.e. having a standalone "HA" doc under docs/ha-setup.md to not have a very long README (which is already quite long).
    Most of the information below should go into it with more details and explanations behind all of the individual components.

Chart deps

- Adds meilisearch as a chart dependency for a HA-ready issue indexer. Only works with >= Gitea 1.20
- Adds redis as a chart dependency for a HA-ready session and queue store.

  • Adds redis-cluster as a chart dependency for a HA-ready session and queue store (alternative to redis). Only works with >= Gitea 1.19.2.
  • Removes memcached instead of redis-cluster
  • Add postgresql-ha as default DB dep in favor of postgres

Adds smart HA chart logic

The goal is to set smart config values that result in a HA-ready Gitea deployment if replicaCount > 1.

  • If replicaCount > 1,
    • gitea.config.session.PROVIDER is automatically set to redis-cluster
    • gitea.config.indexer.REPO_INDEXER_ENABLED is automatically set to false unless the value is elasticsearch or meilisearch
    • redis-cluster is used for [queue] and [cache] and [session]mode or not

Configuration of external instances of meilisearch and minio are documented in a new markdown doc.

Deployment vs Statefulset

Given all the discussions about this lately (#428), I think we could use both.
In the end, we do not have the requirement for a sequential pod scale up/scale down as it would happen in statefulsets.
On the other side, we do not have actual stateless pods as we are attaching a RWX to the deployment.
Yet I think because we do not have a leader-election requirement, spawning the pods as a deployment makes "Rolling Updates" easier and also signals users that there is no "leader election" logic and each pod can just be "destroyed" at anytime without causing interruption.

Hence I think we should be able to switch from a statefulset to a deployment, even in the single-replica case.

This change also brought up a templating/linting issue: the definition of .Values.gitea.config.server.SSH_LISTEN_PORT in ssh-svc.yaml just "luckily" worked so far due to naming-related lint processing. Due to the change from "statefulset" to "deployment", the processing queue changed and caused a failure complaining about config.server.SSH_LISTEN_PORT not being defined yet.
The only way I could see to fix this was to "properly" define the value in values.yaml instead of conditionally definining it in helpers.tpl. Maybe there's a better way?

Chart PVC Creation

I've adapted the automated PVC creation from another chart to be able to provide the storageClassName as I couldn't get dynamic provisioning for EFS going with the current implementation.
In addition the naming and approach within the Gitea chart for PV creation is a bit unusual and aligning it might be beneficial.

A semi-unrelated change which will result in a breaking change for existing users but this PR includes a lot of breaking changes already, so including another one might not make it much worse...

  • New persistence.mount: whether to mount an existing PVC (via persistence.existingClaim
  • New persistence.create: whether to create a new PVC

Testing

As this PR does a lot of things, we need proper testing.
The helm chart can be installed from the Git branch via helm-git as follows:

helm repo add gitea-charts git+https://gitea.com/gitea/helm-chart@/?ref=deployment
helm install gitea --version 0.0.0

It is highly recommended to test the chart in a dedicated namespace.

I've tested this myself with both redis and redis-cluster and it seemed to work fine.
I just did some basic operations though and we should do more niche testing before merging.

Examplary values.yml for testing (only needs a valid RWX storage class):

values.yaml
image:
  tag: "dev"
  PullPolicy: "Always"
  rootless: true

replicaCount: 2

persistence:
  enabled: true
  accessModes:
    - ReadWriteMany
  storageClass: FIXME

redis-cluster:
  enabled: false
  global:
    redis:
      password: gitea

gitea:
  config:
    indexer:
      ISSUE_INDEXER_ENABLED: true
      REPO_INDEXER_ENABLED: false

Preferred setup

The preferred HA setup with respect to performance and stability might currently be as follows:

  • Repos: RWX (e.g. EFS or Azurefiles NFS)
  • Issue indexer: Meilisearch (HA)
  • Session and cache: Redis Cluster (HA)
  • Attachments/Avatars: Minio (HA)

This will result in a ~ 10-pod HA setup overall.
All pods have very low resource requests.

fix #98

# Changes A big shoutout to @luhahn for all his work in #205 which served as the base for this PR. ## Documentation - [x] After thinking for some time about it, I still prefer the distinct option (as started in #350), i.e. having a standalone "HA" doc under `docs/ha-setup.md` to not have a very long README (which is already quite long). Most of the information below should go into it with more details and explanations behind all of the individual components. ## Chart deps ~~- Adds `meilisearch` as a chart dependency for a HA-ready issue indexer. Only works with >= Gitea 1.20~~ ~~- Adds `redis` as a chart dependency for a HA-ready session and queue store.~~ - Adds `redis-cluster` as a chart dependency for a HA-ready session and queue store (alternative to `redis`). Only works with >= Gitea 1.19.2. - Removes `memcached` instead of `redis-cluster` - Add `postgresql-ha` as default DB dep in favor of `postgres` ## Adds smart HA chart logic The goal is to set smart config values that result in a HA-ready Gitea deployment if `replicaCount` > 1. - If `replicaCount` > 1, - `gitea.config.session.PROVIDER` is automatically set to `redis-cluster` - `gitea.config.indexer.REPO_INDEXER_ENABLED` is automatically set to `false` unless the value is `elasticsearch` or `meilisearch` - `redis-cluster` is used for `[queue]` and `[cache]` and `[session]`mode or not Configuration of external instances of `meilisearch` and `minio` are documented in a new markdown doc. ## Deployment vs Statefulset Given all the discussions about this lately (#428), I think we could use both. In the end, we do not have the requirement for a sequential pod scale up/scale down as it would happen in statefulsets. On the other side, we do not have actual stateless pods as we are attaching a RWX to the deployment. Yet I think because we do not have a leader-election requirement, spawning the pods as a deployment makes "Rolling Updates" easier and also signals users that there is no "leader election" logic and each pod can just be "destroyed" at anytime without causing interruption. Hence I think we should be able to switch from a statefulset to a deployment, even in the single-replica case. This change also brought up a templating/linting issue: the definition of `.Values.gitea.config.server.SSH_LISTEN_PORT` in `ssh-svc.yaml` just "luckily" worked so far due to naming-related lint processing. Due to the change from "statefulset" to "deployment", the processing queue changed and caused a failure complaining about `config.server.SSH_LISTEN_PORT` not being defined yet. The only way I could see to fix this was to "properly" define the value in `values.yaml` instead of conditionally definining it in `helpers.tpl`. Maybe there's a better way? ## Chart PVC Creation I've adapted the automated PVC creation from another chart to be able to provide the `storageClassName` as I couldn't get dynamic provisioning for EFS going with the current implementation. In addition the naming and approach within the Gitea chart for PV creation is a bit unusual and aligning it might be beneficial. A semi-unrelated change which will result in a breaking change for existing users but this PR includes a lot of breaking changes already, so including another one might not make it much worse... - New `persistence.mount`: whether to mount an existing PVC (via `persistence.existingClaim` - New `persistence.create`: whether to create a new PVC ## Testing As this PR does a lot of things, we need proper testing. The helm chart can be installed from the Git branch via `helm-git` as follows: ``` helm repo add gitea-charts git+https://gitea.com/gitea/helm-chart@/?ref=deployment helm install gitea --version 0.0.0 ``` It is **highly recommended** to test the chart in a dedicated namespace. I've tested this myself with both `redis` and `redis-cluster` and it seemed to work fine. I just did some basic operations though and we should do more niche testing before merging. Examplary `values.yml` for testing (only needs a valid RWX storage class): <details> <summary>values.yaml</summary> ```yml image: tag: "dev" PullPolicy: "Always" rootless: true replicaCount: 2 persistence: enabled: true accessModes: - ReadWriteMany storageClass: FIXME redis-cluster: enabled: false global: redis: password: gitea gitea: config: indexer: ISSUE_INDEXER_ENABLED: true REPO_INDEXER_ENABLED: false ``` </details> ## Preferred setup The preferred HA setup with respect to performance and stability might currently be as follows: - Repos: RWX (e.g. EFS or Azurefiles NFS) - Issue indexer: Meilisearch (HA) - Session and cache: Redis Cluster (HA) - Attachments/Avatars: Minio (HA) This will result in a ~ 10-pod HA setup overall. All pods have very low resource requests. fix #98
viceice commented 2023-04-15 12:56:54 +00:00 (Migrated from gitea.com)

your sample values seems wrong. indexer settings below minio persistence?

your sample values seems wrong. indexer settings below minio persistence?
pat-s commented 2023-04-15 13:01:54 +00:00 (Migrated from gitea.com)

I don't understand what you mean. Can you be more specific or comment directly in the diff?

Ah got it now! Indeed, a c/p mistake.

~~I don't understand what you mean. Can you be more specific or comment directly in the diff?~~ Ah got it now! Indeed, a c/p mistake.
justusbunsi commented 2023-04-18 11:09:50 +00:00 (Migrated from gitea.com)

We could drop the memcached dependency and use redis by default. I mean, this PR adds both redis and redis-cluster as dependency. For non-HA redis dependency can be used where memcached is used. What's your opinion, @pat-s?

We _could_ drop the memcached dependency and use redis by default. I mean, this PR adds both redis and redis-cluster as dependency. For non-HA redis dependency can be used where memcached is used. What's your opinion, @pat-s?
luhahn commented 2023-04-18 13:22:35 +00:00 (Migrated from gitea.com)

We could drop the memcached dependency and use redis by default. I mean, this PR adds both redis and redis-cluster as dependency. For non-HA redis dependency can be used where memcached is used. What's your opinion, @pat-s?

Dropping memcached is a good idea, redis can do everything memcached is currently doing. And we would have to manage less dependencies. (Especially since this introduces redis + redis-ha as dependency)

> We _could_ drop the memcached dependency and use redis by default. I mean, this PR adds both redis and redis-cluster as dependency. For non-HA redis dependency can be used where memcached is used. What's your opinion, @pat-s? Dropping memcached is a good idea, redis can do everything memcached is currently doing. And we would have to manage less dependencies. (Especially since this introduces redis + redis-ha as dependency)
pat-s commented 2023-04-18 20:02:52 +00:00 (Migrated from gitea.com)

We could drop the memcached dependency and use redis by default. I mean, this PR adds both redis and redis-cluster as dependency. For non-HA redis dependency can be used where memcached is used. What's your opinion, @pat-s?

Yeah I guess so. Also having less decisions to take makes things easier for users.
The only "downside" is that the default deployment would then have 5-6 pods by default instead of 2 (gitea + memcached).

Especially since this introduces redis + redis-ha as dependency)

I am not 100% sure but atm I think both are HA-ready - the second option, i.e. redis-cluster, is just a different way of how the redis components/workers talk to each other and are set up in the first place. Just because you labeled the second one as redis-ha and by this implying that the other one would not be HA-ready 🙂 (but please correct me if I'm wrong!)

We should probably use the "normal" redis as the default as redis-cluster was not functional until lately and just got fixed in Gitea HEAD. I did a backport but this means one needs at least 1.19.2 - and to use 1.18 or older, one would need to use redis and not redis-cluster.

> We could drop the memcached dependency and use redis by default. I mean, this PR adds both redis and redis-cluster as dependency. For non-HA redis dependency can be used where memcached is used. What's your opinion, @pat-s? Yeah I guess so. Also having less decisions to take makes things easier for users. The only "downside" is that the default deployment would then have 5-6 pods by default instead of 2 (gitea + memcached). > Especially since this introduces redis + redis-ha as dependency) I am not 100% sure but atm I think both are HA-ready - the second option, i.e. redis-cluster, is just a different way of how the redis components/workers talk to each other and are set up in the first place. Just because you labeled the second one as `redis-ha` and by this implying that the other one would not be HA-ready 🙂 (but please correct me if I'm wrong!) We should probably use the "normal" `redis` as the default as `redis-cluster` was not functional until lately and just got fixed in Gitea HEAD. I did a backport but this means one needs at least 1.19.2 - and to use 1.18 or older, one would need to use `redis` and not `redis-cluster`.
justusbunsi commented 2023-04-18 20:22:27 +00:00 (Migrated from gitea.com)

I am not 100% sure but atm I think both are HA-ready - the second option, i.e. redis-cluster, is just a different way of how the redis components/workers talk to each other and are set up in the first place. Just because you labeled the second one as redis-ha and by this implying that the other one would not be HA-ready 🙂 (but please correct me if I'm wrong!)

Oops. I meant redis-cluster instead redis-ha. Both redis variants can have multiple replicas. But when it comes to truly high availability, I experienced a downtime with the non-cluster redis when updating the deployment and replacing the pods. With the cluster setup I did not experienced that. That's why I drew the thin line between redis and redis-cluster in my previous comment.

For non-HA Gitea the common redis would work fine. Both dependencies have their use cases.

The only "downside" is that the default deployment would then have 5-6 pods by default instead of 2 (gitea + memcached).

Non-HA would have 3: 1xGitea, 1xPostgres, 1xRedis
HA would have: Gitea, Postgres, Redis(-cluster), MinIO. All replicated by at least 2. MinIO must have at least 3 or 4 to be stable. Not sure about the recommended replicas for the others.

Is that what you mean by 5-6 Pods?

>I am not 100% sure but atm I think both are HA-ready - the second option, i.e. redis-cluster, is just a different way of how the redis components/workers talk to each other and are set up in the first place. Just because you labeled the second one as `redis-ha` and by this implying that the other one would not be HA-ready 🙂 (but please correct me if I'm wrong!) Oops. I meant redis-cluster instead redis-ha. Both redis variants can have multiple replicas. But when it comes to truly high availability, I experienced a downtime with the non-cluster redis when updating the deployment and replacing the pods. With the cluster setup I did not experienced that. That's why I drew the thin line between redis and redis-cluster in my previous comment. For non-HA Gitea the common redis would work fine. Both dependencies have their use cases. >The only "downside" is that the default deployment would then have 5-6 pods by default instead of 2 (gitea + memcached). Non-HA would have 3: 1xGitea, 1xPostgres, 1xRedis HA would have: Gitea, Postgres, Redis(-cluster), MinIO. All replicated by at least 2. MinIO must have at least 3 or 4 to be stable. Not sure about the recommended replicas for the others. Is that what you mean by 5-6 Pods?
pat-s commented 2023-04-19 07:41:30 +00:00 (Migrated from gitea.com)

I experienced a downtime with the non-cluster redis when updating the deployment and replacing the pods. With the cluster setup I did not experienced that.

Ah that's interesting! I will double-check that as I had in my mind that I had no disconnects with both of them. But if this would be the case, we could also think of only keeping redis-cluster and backport the Gitea-core support to 1.18 and just keep the dependency list "small" then?

EDIT: just tested again and did not face any disconnects when using redis with 1 master and 2 replicas while destroying/deleting one of the two Gite replicas. Was your setup different in some way?

Is that what you mean by 5-6 Pods?

I left minio out here as it's optional with a RWX keeping the storage (probably more performant but not mandatory). It's certainly recommended but that's why I left it as 5-6 (Gitea, + redis). In addition I forgot to count a PG-HA deployment as I always work with an external Postgres.
So I'd see the following pod counts:

  • All: Gitea HA (2) + PG HA (2) + Redis Cluster (4) + Minio (4) + Meilisearch (2) = 14

  • Minimum: Gitea HA (2) + PG HA (2) + Redis Cluster (4) + Meilisearch (2) = 10

> I experienced a downtime with the non-cluster redis when updating the deployment and replacing the pods. With the cluster setup I did not experienced that. Ah that's interesting! I will double-check that as I had in my mind that I had no disconnects with both of them. But if this would be the case, we could also think of only keeping `redis-cluster` and backport the Gitea-core support to 1.18 and just keep the dependency list "small" then? EDIT: just tested again and did not face any disconnects when using `redis` with 1 master and 2 replicas while destroying/deleting one of the two Gite replicas. Was your setup different in some way? > Is that what you mean by 5-6 Pods? I left minio out here as it's optional with a RWX keeping the storage (probably more performant but not mandatory). It's certainly recommended but that's why I left it as 5-6 (Gitea, + redis). In addition I forgot to count a PG-HA deployment as I always work with an external Postgres. So I'd see the following pod counts: - All: Gitea HA (2) + PG HA (2) + Redis Cluster (4) + Minio (4) + Meilisearch (2) = 14 - Minimum: Gitea HA (2) + PG HA (2) + Redis Cluster (4) + Meilisearch (2) = 10
justusbunsi commented 2023-04-19 18:06:23 +00:00 (Migrated from gitea.com)

EDIT: just tested again and did not face any disconnects when using redis with 1 master and 2 replicas while destroying/deleting one of the two Gite replicas. Was your setup different in some way?

I had disconnects to Redis when actually upgrading Redis, not replacing Gitea pods. IIRC, this was not the case when using redis-cluster. Although, I could not test Gitea itself due to missing cluster support back then. So I tested it otherwise. But maybe I did something wrong.

So I'd see the following pod counts:

  • All: Gitea HA (2) + PG HA (2) + Redis Cluster (4) + Minio (4) + Meilisearch (2) = 14

  • Minimum: Gitea HA (2) + PG HA (2) + Redis Cluster (4) + Meilisearch (2) = 10

It would be great if we keep the simplicity, even if Gitea is HA capable. Consider a new user, ready to get Gitea up and running in a Kubernetes cluster, trying to get their code self-hosted. What you need for that:

  • Gitea (1x)
  • Database (1x)
  • Redis (1x)

Everything else is optional, IMO. Especially repository code search.

Therefore, we should keep both redis dependencies, IMO. Or is it possible to "misuse" redis-cluster as a single-instance? In that case we could drop the redis dependency.

Additionally I don't think we should add MinIO as a dependency. If I see it right, attachments and avatars would work within the RWX storage that is shared across the Deployment replicas. If someone wants to use MinIO for that, they should deploy it outside of this Helm Chart. MinIO itself is a massive setup - for a truly HA setup, you have to run at least 4 replicas in distributed mode (absolute minimal requirement for leader election). The non-default production values set for that would be ~130 lines (no blank lines). This includes different buckets for avatars and attachments each. That's probably nothing we want to have in a Gitea Helm Chart that should be much lightweight than those of unnamed alternatives 😉. Currently the proposed setup does not look like its simpler.

> EDIT: just tested again and did not face any disconnects when using `redis` with 1 master and 2 replicas while destroying/deleting one of the two Gite replicas. Was your setup different in some way? I had disconnects to Redis when actually upgrading Redis, not replacing Gitea pods. IIRC, this was not the case when using redis-cluster. Although, I could not test Gitea itself due to missing cluster support back then. So I tested it otherwise. But maybe I did something wrong. > So I'd see the following pod counts: > > - All: Gitea HA (2) + PG HA (2) + Redis Cluster (4) + Minio (4) + Meilisearch (2) = 14 > > - Minimum: Gitea HA (2) + PG HA (2) + Redis Cluster (4) + Meilisearch (2) = 10 It would be great if we keep the simplicity, even if Gitea is HA capable. Consider a new user, ready to get Gitea up and running in a Kubernetes cluster, trying to get their code self-hosted. What you need for that: - Gitea (1x) - Database (1x) - Redis (1x) Everything else is optional, IMO. Especially repository code search. Therefore, we should keep both redis dependencies, IMO. Or is it possible to "misuse" `redis-cluster` as a single-instance? In that case we could drop the `redis` dependency. Additionally I don't think we should add MinIO as a dependency. If I see it right, attachments and avatars would work within the RWX storage that is shared across the Deployment replicas. If someone wants to use MinIO for that, they should deploy it outside of this Helm Chart. MinIO itself is _a massive_ setup - for a truly HA setup, you have to run at least 4 replicas in `distributed` mode (absolute minimal requirement for leader election). The non-default production values set for that would be ~130 lines (no blank lines). This includes different buckets for avatars and attachments each. That's probably nothing we want to have in a Gitea Helm Chart that _should_ be much lightweight than those of unnamed alternatives 😉. Currently the proposed setup does not look like its simpler.
pat-s commented 2023-04-20 09:33:42 +00:00 (Migrated from gitea.com)

I had disconnects to Redis when actually upgrading Redis, not replacing Gitea pods. IIRC, this was not the case when using redis-cluster.

Ah! Did you have redis running in HA with multiple replicas? I just tried it and the problems seems to be the redis-master-0 pod. One can destroy the replicas without issues but once the master pod goes down, Gitea has an issue.

There is experimental support for multiple masters in the chart (https://github.com/bitnami/charts/blob/main/bitnami/redis/README.md#multiple-masters-experimental).

Overall I guess it's just easier to go with redis-cluster only then and remove redis -> less choices and work for both HA and non-HA.

Everything else is optional, IMO. Especially repository code search.

Definitely agree. The question is only: do we want provide a skeleton in the chart to "easily" add these additional components or do we "just" provide documentation on how this can be done?

Or is it possible to "misuse" redis-cluster as a single-instance? In that case we could drop the redis dependency.

I guess we could just go with redis-cluster even for single-instance. Yes it requires some additional pods but users can then also easily go HA if they want to. And don't need to switch between redis and redis-cluster.

Additionally I don't think we should add MinIO as a dependency. If I see it right, attachments and avatars would work within the RWX storage that is shared across the Deployment replicas.

I see your point. I am fine with not adding it and only documenting it as an alternative to the RWX storage.

for a truly HA setup, you have to run at least 4 replicas in distributed mode (absolute minimal requirement for leader election).

That's true, yes.

The non-default production values set for that would be ~130 lines (no blank lines). This includes different buckets for avatars and attachments each.

Not sure if you always need distinct buckets for each and in my test install right now I don't have 130 config lines but again, I am fine with leaving minio out and just documenting it.

That's probably nothing we want to have in a Gitea Helm Chart that should be much lightweight than those of unnamed alternatives 😉.

I think this is also a question how you look at it: "lightweight" in terms of defaults and/or resources but rich in terms of expansion options - or lightweight in terms of general capabilities and chart dependencies.

Overall I think we might make our life easier with not adding too many dependencies into the chart but just documenting which services one needs to provision in addition to fully go HA. And only adding redis-cluster as a required dependency which would also directly work for HA then. Sounds like we all agree on this? If so, I would strip the PR down to only redis-cluster and just documenting all other optional HA deps.

(for testing purposes, it was still good/nice to be able to bootstrap a HA setup with all deps in this PR.)

> I had disconnects to Redis when actually upgrading Redis, not replacing Gitea pods. IIRC, this was not the case when using redis-cluster. Ah! Did you have `redis` running in HA with multiple replicas? I just tried it and the problems seems to be the `redis-master-0` pod. One can destroy the replicas without issues but once the master pod goes down, Gitea has an issue. There is experimental support for multiple masters in the chart (https://github.com/bitnami/charts/blob/main/bitnami/redis/README.md#multiple-masters-experimental). Overall I guess it's just easier to go with `redis-cluster` only then and remove `redis` -> less choices and work for both HA and non-HA. > Everything else is optional, IMO. Especially repository code search. Definitely agree. The question is only: do we want provide a skeleton in the chart to "easily" add these additional components or do we "just" provide documentation on how this can be done? > Or is it possible to "misuse" redis-cluster as a single-instance? In that case we could drop the redis dependency. I guess we could just go with `redis-cluster` even for single-instance. Yes it requires some additional pods but users can then also easily go HA if they want to. And don't need to switch between `redis` and `redis-cluster`. > Additionally I don't think we should add MinIO as a dependency. If I see it right, attachments and avatars would work within the RWX storage that is shared across the Deployment replicas. I see your point. I am fine with not adding it and only documenting it as an alternative to the RWX storage. > for a truly HA setup, you have to run at least 4 replicas in distributed mode (absolute minimal requirement for leader election). That's true, yes. > The non-default production values set for that would be ~130 lines (no blank lines). This includes different buckets for avatars and attachments each. Not sure if you always need distinct buckets for each and in my test install right now I don't have 130 config lines but again, I am fine with leaving minio out and just documenting it. > That's probably nothing we want to have in a Gitea Helm Chart that should be much lightweight than those of unnamed alternatives 😉. I think this is also a question how you look at it: "lightweight" in terms of defaults and/or resources but rich in terms of expansion options - or lightweight in terms of general capabilities and chart dependencies. Overall I think we might make our life easier with not adding too many dependencies into the chart but just documenting which services one needs to provision in addition to fully go HA. And only adding `redis-cluster` as a required dependency which would also directly work for HA then. Sounds like we all agree on this? If so, I would strip the PR down to only `redis-cluster` and just documenting all other optional HA deps. (for testing purposes, it was still good/nice to be able to bootstrap a HA setup with all deps in this PR.)
justusbunsi commented 2023-04-20 20:36:16 +00:00 (Migrated from gitea.com)

Ah! Did you have redis running in HA with multiple replicas? I just tried it and the problems seems to be the redis-master-0 pod. One can destroy the replicas without issues but once the master pod goes down, Gitea has an issue.

I tested both master/replica and sentinel approach. In both cases the k8s service did not properly switch from the downscaled leader to a replica or sentinel, causing the disconnect.

I guess we could just go with redis-cluster even for single-instance. Yes it requires some additional pods but users can then also easily go HA if they want to. And don't need to switch between redis and redis-cluster.

Sounds great. Let's go this way. 👍

Additionally I don't think we should add MinIO as a dependency....

I see your point. I am fine with not adding it and only documenting it as an alternative to the RWX storage.

+1 for documenting MinIO as an alternative.

Overall I think we might make our life easier with not adding too many dependencies into the chart but just documenting which services one needs to provision in addition to fully go HA. And only adding redis-cluster as a required dependency which would also directly work for HA then. Sounds like we all agree on this? If so, I would strip the PR down to only redis-cluster and just documenting all other optional HA deps.

A yes from my side. Stripping it down to redis-cluster with a simple templating skeleton to easily toggle HA and scale the necessary components properly will be a huge improvement for the Chart.

The question is only: do we want provide a skeleton in the chart to "easily" add these additional components or do we "just" provide documentation on how this can be done?

Such large-scale skeleton would require the other dependencies, but would be neat in a long-run. Maybe we should start with documenting them first. As soon as we have more stability in the chart (e.g. tests, automated e2e deploys, values integrity validation, renovate automation), it will become much easier to add new dependencies and handle such complexity.

(for testing purposes, it was still good/nice to be able to bootstrap a HA setup with all deps in this PR.)

💯. I appreciate the time you already spent on this PR. It's awesome to see Gitea being a solid competitor to the well-known big players.
In other Charts I've seen example values for various use cases. Those in pair with documenting the options of MinIO and Meilisearch may help users to find the right setup for their needs. But such example values are nothing that needs to be done in a first step.

>Ah! Did you have `redis` running in HA with multiple replicas? I just tried it and the problems seems to be the `redis-master-0` pod. One can destroy the replicas without issues but once the master pod goes down, Gitea has an issue. I tested both master/replica and sentinel approach. In both cases the k8s service did not properly switch from the downscaled leader to a replica or sentinel, causing the disconnect. >I guess we could just go with `redis-cluster` even for single-instance. Yes it requires some additional pods but users can then also easily go HA if they want to. And don't need to switch between `redis` and `redis-cluster`. Sounds great. Let's go this way. 👍 >> Additionally I don't think we should add MinIO as a dependency.... > >I see your point. I am fine with not adding it and only documenting it as an alternative to the RWX storage. +1 for documenting MinIO as an alternative. >Overall I think we might make our life easier with not adding too many dependencies into the chart but just documenting which services one needs to provision in addition to fully go HA. And only adding `redis-cluster` as a required dependency which would also directly work for HA then. Sounds like we all agree on this? If so, I would strip the PR down to only `redis-cluster` and just documenting all other optional HA deps. A `yes` from my side. Stripping it down to `redis-cluster` with a simple templating skeleton to easily toggle HA and scale the necessary components properly will be a _huge_ improvement for the Chart. >The question is only: do we want provide a skeleton in the chart to "easily" add these additional components or do we "just" provide documentation on how this can be done? Such large-scale skeleton would require the other dependencies, but would be neat in a long-run. Maybe we should start with documenting them first. As soon as we have more stability in the chart (e.g. tests, automated e2e deploys, values integrity validation, renovate automation), it will become **much** easier to add new dependencies and handle such complexity. >(for testing purposes, it was still good/nice to be able to bootstrap a HA setup with all deps in this PR.) 💯. I appreciate the time you already spent on this PR. It's awesome to see Gitea being a solid competitor to the well-known big players. In other Charts I've seen example values for various use cases. Those in pair with documenting the options of MinIO and Meilisearch may help users to find the right setup for their needs. But such example values are nothing that needs to be done in a first step.
pat-s (Migrated from gitea.com) reviewed 2023-05-02 13:48:03 +00:00
@ -3,6 +3,26 @@
Expand the name of the chart.
*/}}
{{- /* multiple replicas assertions */ -}}
pat-s (Migrated from gitea.com) commented 2023-05-02 13:48:03 +00:00

@justusbunsi I've found a neat (but hacky) solution to apply some assertions. Unfortunately helm does not provide a native way to check for certain incompatibilities of settings - or at least I couldn't find one.

I've tested these and they work as expected. I think in the longterm these checks can save quite some users from some faulty HA-deployments as it specficially looks for settings which are incompatible in a HA setup.

What do you think?

@justusbunsi I've found a neat (but hacky) solution to apply some assertions. Unfortunately helm does not provide a native way to check for certain incompatibilities of settings - or at least I couldn't find one. I've tested these and they work as expected. I think in the longterm these checks can save quite some users from some faulty HA-deployments as it specficially looks for settings which are incompatible in a HA setup. What do you think?
pat-s (Migrated from gitea.com) reviewed 2023-05-02 13:51:45 +00:00
pat-s (Migrated from gitea.com) commented 2023-05-02 13:51:45 +00:00

@justusbunsi I've exposed these settings in values.yml as these are their actual defaults. These are applied conditionally in helpers.tpl (https://gitea.com/gitea/helm-chart/src/branch/main/templates/_helpers.tpl#L262) without a clear exposure to the user.

With these settings being defined, we could also remove the condition in helpers.tpl?

@justusbunsi I've exposed these settings in `values.yml` as these are their actual defaults. These are applied conditionally in `helpers.tpl` (https://gitea.com/gitea/helm-chart/src/branch/main/templates/_helpers.tpl#L262) without a clear exposure to the user. With these settings being defined, we could also remove the condition in `helpers.tpl`?
pat-s commented 2023-05-02 14:15:57 +00:00 (Migrated from gitea.com)

@justusbunsi @luhahn I wonder if we also should swap postgres for postgres-ha, similar as we are switching to redis-cluster instead of redis.

This would make a functional HA deployment easier right from the start for new users as they would not need to migrate from a single-instance PG deployment to a HA-ready PG one (and only move storage).

The downside is that it would be a hard break for all current users as they would be forced to migrate their DB. We could also have both side by side but then we are again entering "maintenance hell" to some degree.

This PR is breaking in so many ways (cache dependency, PVC creation, statefulset -> deployment) that introducing another breaking change would arguably not be too disruptive.

What are your thoughts?


Overall I am pretty much done (aside of the PG discussion above) and everything works "fine" so far in all of my tests. The biggest challenge would most likely be to provide migratin instructions to users and also test this ourselves.

@justusbunsi @luhahn I wonder if we also should swap `postgres` for `postgres-ha`, similar as we are switching to `redis-cluster` instead of `redis`. This would make a functional HA deployment easier right from the start for new users as they would not need to migrate from a single-instance PG deployment to a HA-ready PG one (and only move storage). The downside is that it would be a hard break for all current users as they would be forced to migrate their DB. We could also have both side by side but then we are again entering "maintenance hell" to some degree. This PR is breaking in so many ways (cache dependency, PVC creation, statefulset -> deployment) that introducing another breaking change would arguably not be too disruptive. What are your thoughts? --- Overall I am pretty much done (aside of the PG discussion above) and everything works "fine" so far in all of my tests. The biggest challenge would most likely be to provide migratin instructions to users and also test this ourselves.
pat-s commented 2023-05-13 19:55:50 +00:00 (Migrated from gitea.com)

@justusbunsi

I made a brave move and switched our production instance to this branch (ofc after having tested in dev before) and I can report that everything seems to work well so far.

I am running with the following setup now

  • AWS EKS 1.26
  • 3 replicas
  • AWS RDS Postgres 14.x
  • AWS Elasticache Redis 7.x
  • AWS EFS RWX
  • Issue indexer: 'db'
  • Repo indexer: disabled

The transition was not that complicated, I only had backup and move /data to the new RWX PVC.

This PR also adds topologySpreadConstraints which helps to distribute the replicas across nodes.

Site operations are slower now due to EFS compared to the former RWO but that was expected. And is nothing Gitea can do anything about.

I have added a note about the required PV transition. User will need to specify existingClaim as otherwise the chart will (probably) create a new PVC due to the new resource.

Last, I'd like to point out again that I added a few assertions in config.yaml which apply some checks during startup which should prevent erroneous configurations. AFAICS there is no other, more elegant way to do this in charts at the moment.

Currently there are no blockers for merging this. Meilisearch will be supported with > 1.20 but 'db' for the ISSUE_INDEXER works.

The only open question is whether we should switch from postgres to postgres-HA as the former might cause issues when going HA. And given that we have already opted for redis-cluster by default, I think we should also switch to postgres-HA.
For backward-comp I think we need to keep the normal postgres chart as a dependency for a bit so that users do not have to make a forced DB transition.
My suggestion would be to have postgres-ha as the default and remove the deprecated postgres dependency (which is off by default) at some point in the future.

@justusbunsi I made a brave move and switched our production instance to this branch (ofc after having tested in dev before) and I can report that everything seems to work well so far. I am running with the following setup now - AWS EKS 1.26 - 3 replicas - AWS RDS Postgres 14.x - AWS Elasticache Redis 7.x - AWS EFS RWX - Issue indexer: `'db'` - Repo indexer: disabled The transition was not that complicated, I only had backup and move `/data` to the new RWX PVC. This PR also adds `topologySpreadConstraints` which helps to distribute the replicas across nodes. Site operations are slower now due to EFS compared to the former RWO but that was expected. And is nothing Gitea can do anything about. I have added a note about the required PV transition. User will need to specify `existingClaim` as otherwise the chart will (probably) create a new PVC due to the new resource. Last, I'd like to point out again that I added a few assertions in `config.yaml` which apply some checks during startup which should prevent erroneous configurations. AFAICS there is no other, more elegant way to do this in charts at the moment. Currently there are no blockers for merging this. Meilisearch will be supported with > 1.20 but 'db' for the ISSUE_INDEXER works. The only open question is whether we should switch from postgres to postgres-HA as the former might cause issues when going HA. And given that we have already opted for redis-cluster by default, I think we should also switch to postgres-HA. For backward-comp I think we need to keep the normal `postgres` chart as a dependency for a bit so that users do not have to make a forced DB transition. My suggestion would be to have `postgres-ha` as the default and remove the deprecated `postgres` dependency (which is off by default) at some point in the future.
lunny commented 2023-05-15 06:10:01 +00:00 (Migrated from gitea.com)

Ref: https://github.com/go-gitea/gitea/issues/13791

There are still some problems when running a HA mode.

Ref: https://github.com/go-gitea/gitea/issues/13791 There are still some problems when running a HA mode.
pat-s commented 2023-05-15 14:25:09 +00:00 (Migrated from gitea.com)

Good to know but without checking how it goes in a running instance, we won't know if they are "real" issues or just "would be good to have/clean up" 🙂

The good news is that the first day(s) have been without issues in my org - haven't checked on the CRON things yet though.

Good to know but without checking how it goes in a running instance, we won't know if they are "real" issues or just "would be good to have/clean up" 🙂️ The good news is that the first day(s) have been without issues in my org - haven't checked on the CRON things yet though.
pat-s commented 2023-07-17 06:22:11 +00:00 (Migrated from gitea.com)

Merging after confirmation from within Discord.

Merging after confirmation from within Discord.
techknowlogick (Migrated from gitea.com) approved these changes 2023-07-17 06:24:25 +00:00
lunny (Migrated from gitea.com) approved these changes 2023-07-17 13:07:31 +00:00
Sign in to join this conversation.
No description provided.