knative-extensions / eventing-natss

NATS streaming integration with Knative Eventing.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Broker field `spec.delivery.retry` is ignored

norbjd opened this issue · comments

Describe the bug
When configuring a broker with fields under spec.delivery (retry, backoffPolicy, backoffDelay, ...), those fields seems to be ignored. There is no retries if the Trigger object fails to deliver successfully the message to the subscriber. In my case, the subscriber is a ksvc, and there is no retries if that ksvc returns an error.

Expected behavior
When configuring a broker with fields under spec.delivery, the message should be redelivered (multiple retries) if the receiver (ksvc) returns an error.

To Reproduce

  1. Have a fresh Kubernetes cluster with Knative Serving and Eventing installed via YAML files (version 1.4.0).

  2. Install components to work with NATS (natsjsm.yaml, channel messaging layer eventing-natss.yaml, broker layer mt-channel-broker.yaml):

# https://github.com/knative-sandbox/eventing-natss/blob/release-1.4/config/broker/README.md
kubectl apply -f https://raw.githubusercontent.com/knative-sandbox/eventing-natss/knative-v1.4.0/config/broker/natsjsm.yaml

# https://knative.dev/docs/install/yaml-install/eventing/install-eventing-with-yaml/#optional-install-a-default-channel-messaging-layer
kubectl apply -f https://github.com/knative-sandbox/eventing-natss/releases/download/knative-v1.4.0/eventing-natss.yaml

# https://knative.dev/docs/install/yaml-install/eventing/install-eventing-with-yaml/#optional-install-a-broker-layer
kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.4.0/mt-channel-broker.yaml
  1. Use NatsJetStreamChannel as default channel:
kubectl patch configmap/config-br-default-channel \
  --namespace knative-eventing \
  --patch '{"data":{"channel-template-spec": "apiVersion: messaging.knative.dev/v1alpha1\nkind: NatsJetStreamChannel"}}'
  1. Deploy a simple ksvc (kn service create error-service --image=myimage --port=8080) that always return an error 503 to test handling delivery failure. The code is the following:
package main

import (
	"log"
	"net/http"
)

func handler(w http.ResponseWriter, req *http.Request) {
	log.Println("It does not work, returning 503")
	http.Error(w, "Does not work", 503)
}

func main() {
	http.HandleFunc("/", handler)
	http.ListenAndServe(":8080", nil)
}
  1. Create a Broker, a Trigger pointing to that ksvc, and a PingSource to send an event every minute:
apiVersion: eventing.knative.dev/v1
kind: Broker
metadata:
  name: default
  namespace: default
  annotations:
    eventing.knative.dev/broker.class: MTChannelBasedBroker
spec:
  config:
    apiVersion: v1
    kind: ConfigMap
    name: config-br-default-channel # NatsJetStreamChannel defined in the ConfigMap
    namespace: knative-eventing
  delivery:
    retry: 5
    backoffPolicy: exponential
    backoffDelay: "PT1S"
---
apiVersion: eventing.knative.dev/v1
kind: Trigger
metadata:
  name: my-service-trigger
  namespace: default
spec:
  broker: default
  subscriber:
    ref:
      apiVersion: serving.knative.dev/v1
      kind: Service
      name: error-service
---
apiVersion: sources.knative.dev/v1
kind: PingSource
metadata:
  name: ping
  namespace: default
spec:
  schedule: "* * * * *"
  contentType: "application/json"
  data: '{"msg": "ping"}'
  sink:
    ref:
      apiVersion: eventing.knative.dev/v1
      kind: Broker
      name: default
  1. Wait for the logs of error-service:
2022/05/29 14:22:00 It does not work, returning 503
2022/05/29 14:23:00 It does not work, returning 503
2022/05/29 14:24:00 It does not work, returning 503

As you can see, in the logs of the error-service pod, there is no sign that the message is send again after an error (there is one log per minute, not 1 + 5 retries as defined in the Broker spec.delivery.retry field.

Knative release version
v1.4.0

Additional context
I have also tried with NatssChannel (https://github.com/knative-sandbox/eventing-natss/blob/release-1.4/config/broker/README.md#1-nats-streaming-deprecation-notice-) and got the same result.

The docs (https://github.com/knative-sandbox/eventing-natss/blob/knative-v1.4.0/config/README.md#nats-streaming-channels) states:

If downstream rejects an event, that request is attempted again.

And the knative docs states that Nats Channels does not support any of the delivery fields (https://knative.dev/docs/eventing/event-delivery/#channel-support):

Nats Channel does not support delivery fields

So I don't know if this is the normal behavior or if I'm doing something wrong or misunderstanding something.

But what I would like to do is retrying the PingSource event if the ksvc returns an error.

Thanks for your help 🙌

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

/remove-lifecycle stale

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

MTChannelBasedBroker retries are delegated to the Channel implementation, so it depends on the implementation.

I see #376 is merged, maybe @astelmashenko can clarify where things are on the retry front for the Nats Channel ?

@pierDipi ,

maybe @astelmashenko can clarify where things are on the retry front for the Nats Channel ?

Retries should be working, do you mean I need to test if it's working with retries configured?

@astelmashenko yes, if you can reproduce this issue with the newer nats channel versions or not

@pierDipi , yes, it is working properly now on 1.3.5 version. I see that this issue is for 1.4 there are no fixes.

so, are you planning to port the fix to any version 1.4+?

I can create MR 1.4/1.5 from #376 as patch. Not sure if I have time to test 1.4/1.5 versions locally.

@pierDipi , should I just cherrypick #376 or all latest changes up to 1.3.5 which includes porting of PR #267 ?

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.