nhsx / open-source-policy

Open Source Policy development for the NHS

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Database schema

ChrisBeeley opened this issue · comments

When sharing analytic code it's important to establish the extent to which it is safe or desirable to share database schema (raw SQL statements, column names, DB and table names...)

I have been told that we cannot share material of this sort because a) sharing DB and table names poses a security risk b) database vendors hold copyright of schema and names, although these objections appear to be contradicted by other government guidance

I can't find this addressed in the policy or any of the linked documents. It would be very useful to have this explicitly addressed in this guidance

Absolutely, it's not addressed here (yet), and it would be useful to include that kind of detail. I fully agree with the CDDO guidance but I'll leave this issue open while we investigate the copyright concern further.

Hi @ChrisBeeley,

Do you think we could tackle some of those challenges head on with a view to debunking them? I've encountered similar pushback where the risk being highlighted is scary enough to block progress but also vague enough that you can't make sensible decisions about risk.

Could we trace back to the source of the objection around table and column names and try to pin that down (within scope of NHS)?

Hi @ChrisBeeley,

Do you think we could tackle some of those challenges head on with a view to debunking them? I've encountered similar pushback where the risk being highlighted is scary enough to block progress but also vague enough that you can't make sensible decisions about risk.

Could we trace back to the source of the objection around table and column names and try to pin that down (within scope of NHS)?

Absolutely. A definitive statement on security seems relatively straightforward and I would hope that it is added to this guidance in due course.

As for the copyright issue, somebody needs to confront the vendors about this and get them to make a statement on it one way or the other. I tried and was met with a wall of silence. It's a simple question with a simple answer and the vendors need to play fair and answer it

Hi folks, if you'd like to work together to propose a security statement on this topic I can check it through and find the right space for it next week?

Hi @otlah, @ChrisBeeley

I'll try to get hold of someone from security who can give a steer.

While table names should never be risky, I could imagine a situation where the CNAME address for a database might be risky. Security tends to take the onion philosophy - so the more protection the better. We just need to be able to identify a sensible balance of risk versus value in sharing code.

For that type of environment configuration though (database address), our open code guidance should explain how to avoid storing those things in open code. E.g. using a secrets manager or using environment variables. That type of thing would go along other code security practices like running githooks to avoid credentials being pushed to git and using .gitignore files.

The value of open code comes from seeing the business logic - not the database details. So it is no loss to mask that info.

Indeed, yes, I have no problem with running odbcConnect(Sys.getenv("DB_NAME")) or whatever it is it's column names that are problematic. Short of renaming the entire table they can pop up throughout the DB and analytic code. It's hard to believe that there would be any problems caused by my publishing time_from_referral = df$date_seen - df$date_referred (assuming they were real column names) but we have mountains of unpublished code that looks just like that

Hi @ChrisBeeley @otlah
I've not been able to get hold of a security person for this unfortunately.

I have seen that the Goldacre report is out and sets out a very strong expectation that code should be open: https://www.gov.uk/government/publications/better-broader-safer-using-health-data-for-research-and-analysis/better-broader-safer-using-health-data-for-research-and-analysis

But is itself frustratingly short on detail

That's true - but probably expected for a report like this. The report does waggle its eyebrows suggestively at the right organisations to get involved in setting out the specifics.

E.g. governance people
image

and the ICO
image

Perhaps we could convene a wider group to focus on this draft policy and try to elaborate on some of the missing pieces?

Indeed, yes. Not a criticism of the report, which I think is great. Just saying we need a different approach here. A technocratic, this is the answer, tablets of stone, job done type thing.

100% up for convening a "detail" group, you know where to find me

I'd also be more than happy to contribute to a group like this.

Hi folks, we've still got a couple of major groups to get feedback from (including in UK research funding), but once that feedback's in we're intending to gather a list of issues that have to be solved before we can publish a version 1.0, and a list that we can take longer over to unstick. I'm stoked to have a bunch of volunteers to help populate those lists!

In OpenSAFELY, we got permission from our two main vendors (EMIS and TPP) to publish their schemas (explicitly, and/or implicitly via code). So there's a good precedent, at least.

At HDR UK we have technical metadata (database schemas) for 472 out of 766 datasets (61%) that we have been publishing since 2020 on the Innovation Gateway - https://web.www.healthdatagateway.org/search?search=&technicaldetails=Contains+Technical+Metadata&tab=Datasets

In OpenSAFELY, we got permission from our two main vendors (EMIS and TPP) to publish their schemas (explicitly, and/or implicitly via code). So there's a good precedent, at least.

My Trust does use a TPP database, so that might be an angle for me to try @sebbacon . Don't suppose the permission is published anywhere, or generalisable, or anything I can use to try to push things on where I am?

At HDR UK we have technical metadata (database schemas) for 472 out of 766 datasets (61%) that we have been publishing since 2020 on the Innovation Gateway - https://web.www.healthdatagateway.org/search?search=&technicaldetails=Contains+Technical+Metadata&tab=Datasets

This is amazing. I had no idea this existed. I need to go and read this carefully. Thanks!