biocore / qurro

Visualize differentially ranked features (taxa, metabolites, ...) and their log-ratios across samples

Home Page:https://biocore.github.io/qurro

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Escape brackets/periods/backslashes/quotes in input rank IDs and sample metadata fields, and in any "inputs" to plot JSONs

fedarko opened this issue · comments

Apparently vega treats these specially. See this page for context.

This is causing a problem with the rank names in the Byrd data example -- trying to switch to a rank that isn't "Intercept" brings up an error.

I guess we have to apply this not only for each column but for every possible string that's passed to Vega: so every feature/sample ID, augmented feature ID, sample metadata, and probably more. sheesh.

ideally we should have tests that verify that our measures to protect against Vega interpreting things wrongly work (#2).

Note: you can escape these either with a ton of backslashes or by enclosing the field names in square brackets. The latter sounds easier.

Note: related to vega/vega-lite#4965

Note: should also ensure that field names (when passed into the plot JSONs, e.g. for things like setting an encoding field of the sample plot's color or setting the encoding field of the rank plot's y-axis) are escaped in JS via something like vega.stringValue().

So I think that due to our use of json.dump(), we shouldn't have to worry about most of these aside from the Vega-Lite-specific ones (periods and brackets). But again, it's still a good idea to be sure.

If we want to be 100% safe, we'll need to escape all of the following:

  • Rank IDs
  • Feature IDs
  • Sample IDs
  • Sample Metadata IDs
  • Feature Metadata IDs

In practice, I'm not sure that this is necessary for feature metadata IDs, feature IDs, or sample IDs (since I've used .s in these IDs before without issue). I think json.dump takes care of those -- the main issue seems with fields that end up being set as an axis/encoding/etc in Vega/Vega-Lite (e.g. ranks).

still worth adding lots of test cases that verify that this all works as intended.

ahsdfiusdoifjsdfioj

so it looks like even if you escape a rank ID properly for the axis stuff, you still need to use the non-escaped ID in the underlying dataset???? bluhg

@mortonjt small question: is preserving the patsy formulas in rank IDs (e.g. C(Timepoint, Treatment('F'))[T.B] in the Byrd data) helpful when looking at the ranks? It looks like periods, brackets, and quotes all cause problems when you pass them into Vega-Lite as field IDs.

I've implemented a basic solution that converts periods to colons and square brackets to parentheses (along with filtering out quotes and backslashes). This takes care of the problem for now, but if you think it's worth it I can come back to this later (probably after exams are over) and add back in support for some of these weird characters.

note to self: if we go with the solution of filtering out/converting certain special characters in IDs, ensure that they're still unique afterwards.