helm / chartmuseum

helm chart repository server

Home Page:https://chartmuseum.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PER_CHART_LIMIT name prefix filtering deletes unrelated charts

poblish opened this issue · comments

We've noticed a strange issue where charts are - for some reason we can't determine - disappearing from our S3 bucket.

Versions:

  • 0.16.0 / Helm chart 3.10.1
  • K8s 1.27.x

We're using "<service_name>" and "<service_name>-" as the name of the charts we upload, to distinguish production from test charts, and to pin the number of charts within each we have PER_CHART_LIMIT set to 10. We know that this mechanism is working fine, and have watched the oldest chart of 10 deleted before the newest one is stored for that particular service/branch. All good.

What is different is that when we check, say "clients-dev", that the folder is unexpectedly empty, and this seems to be completely independent of the above mechanism, e.g. I doubt there have ever been 10 charts in that particular folder.

So are there are any other expiry mechanisms working behind the scenes? Eyeballing the source suggests not, but I can't be 100% sure.

This clearly isn't related to "age" of chart per se, as the very first / oldest charts we added are still present. The deleted charts are likely much younger.

We're looking to see if we can get an audit log enabled to track the deletes. Before you ask, I can pretty much 100% rule out manual deleting...

I set DEBUG to true, but that doesn't give us any additional visibility of deletes.

How common is this? Not common, no. The two occasions this was noticed were 21 days apart, but each time several charts were found to be missing. I can't be any more specific about timings without the audit log.


Addendum: we don't have any lifecycle rules on the S3 bucket

You can unset PER_CHART_LIMIT and see if the issue still exists . But yeah , the auto-expiration will not have DEBUG or other audit log .

I'd rather not do that, as it's weakening one of the features that I'm relying on with ChartMuseum, and it could easily be weeks (3 last time) until we see the issue again, in which case I'll need to perform manual cleanups.

... unless there's some reason to believe this may be related? It's not clear to me that it is, since we have a number of charts that are quite happy with exactly the right number of versions retained (PER_CHART_LIMIT = 10).

Is it possible that there's some kind of race condition in this area of the code, that might (say) cause an unrelated entry to be expired when a different entry hits its limit? My argument against that would be that we're generally not pushing chart versions concurrently - but I wouldn't rule that out 100%

I deployed a fork of the binary with added logging of all delete attempts, and pushed a lot of main builds.

Of the 15 builds that triggered PER_CHART_LIMIT handling, 9 correctly removed the oldest <service>, while the other 6 removed an unrelated branch's chart: <service>-<something else>

This suggests to me that exact matching of the name is not being used for identifying deletion candidates, and that some other kind of matching (prefix? regex?) is being used. Thus our services and builds are not actually isolated. Presumably, there is the potential for service builds to delete the charts of other services, not just other branches of the same service, regardless of their age or count. Seems alarming!?

I'd need to check the code for how this actually happens.

Actually, it's clear in the code that this is a prefix-based matching. This 100% explains what we've been seeing, where an update to a payment service can cause a payment-queries-myunrelatedbranch chart to get deleted.

Was this deliberate? Presumably, this was done because the Storage API returns object names like payment-queries-myunrelatedbranch-244, and checking the prefix avoids the need to parse and extract the version number. Probably what you really want to do is strip everything past the final - => payment-queries-myunrelatedbranch

Then you should be able to do an exact match of the chart name and the object name. That would solve our issue completely.

UPDATE: out of interest, why does the code list all objects, pulling all that info back into memory only to throw 99% of it away, rather than using a filter on the AWS API call? Or alternatively, it should be easy to work out precisely which storage object to delete from the in-memory index data, without needing to load anything at all (if the object has gone missing from storage, meh, so be it.)

UPDATE 2: we're now running a version with the above change, and it does just what we want. I can raise a PR if it's of interest, but I totally get that the fix might not necessarily be appropriate for all storage types 🤷

At least it wasn't a race condition! 😃

👋 @poblish nice work! thanks for looking into this

out of interest, why does the code list all objects, pulling all that info back into memory only to throw 99% of it away, rather than using a filter on the AWS API call? Or alternatively, it should be easy to work out precisely which storage object to delete from the in-memory index data, without needing to load anything at all (if the object has gone missing from storage, meh, so be it.)

This sounds like a good optimization. It just depends if all of the storage backends support the kind of filtering we'd need. They likely do but we'd need to research and implement the filtering per storage backend.

Was this deliberate? Presumably, this was done because the Storage API returns object names like payment-queries-myunrelatedbranch-244, and checking the prefix avoids the need to parse and extract the version number. Probably what you really want to do is strip everything past the final - => payment-queries-myunrelatedbranch

I'm not totally sure but i agree the prefix matching here causes problems.

It looks like we normalize the filenames before uploading the charts so parsing the exact chart name should work for all storage backends:

func ChartPackageFilenameFromContent(content []byte) (string, error) {
chart, err := chartFromContent(content)
if err != nil {
return "", err
}
meta := chart.Metadata
filename := fmt.Sprintf("%s-%s.%s", meta.Name, meta.Version, ChartPackageFileExtension)
return filename, nil
}

wdyt @scbizu?

@poblish Thank you for pointing out the root cause :).

Was this deliberate? Presumably, this was done because the Storage API returns object names like payment-queries-myunrelatedbranch-244, and checking the prefix avoids the need to parse and extract the version number. Probably what you really want to do is strip everything past the final - => payment-queries-myunrelatedbranch

It seems that the prefix matching will cause the bug here , but this logic seems already implemented in the codebase:

for idx := lastIndex; idx >= 1; idx-- {
if _, err := strconv.Atoi(string(parts[idx][0])); err == nil { // see if this part looks like a version (starts w int)
version = strings.Join(parts[idx:], "-")
name = strings.Join(parts[:idx], "-")
break
}
}

I want to reuse this kind of logic rather than a block of new logic to impl the same thing .

/cc @cbuto