biocore / empress

A fast and scalable phylogenetic tree viewer for microbiome data analysis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Document QIIME 2 metadata merging complications

fedarko opened this issue · comments

When multiple sample* / feature metadata files are provided to Empress through QIIME 2, they're merged in such a way that only stuff shared across all metadata files is included. See here for details.

The problem with this is that this can rapidly reduce the amount of metadata passed to Empress -- the Q2 tutorial feature_importance.qza only contains 566 features, while the taxonomy.qza contains 770 features. This means that passing both in to Empress will "remove" taxonomy data for a lot of features, making taxonomy coloring look a lot more sparse.

Since it might be a while until there is built-in QIIME 2 support for other merging methods, in the interim we should ideally:

  1. Update the README to mention this problem
  2. Add an example python script or something for merging metadata files that users can easily start from

For task 2, here is a rough transcript of the code I used to merge the feature metadata files in this directory:

import pandas as pd
aldex = pd.read_csv("aldex2_results.txt", sep="\t", index_col=0)
sb = pd.read_csv("differentials.csv", sep="\t", index_col=0)
ancom = pd.read_csv("ancom_results_mixed.csv", sep="\t", index_col=0)

# Remove leading Xs added by R
aldex.index =[i if i[0] != 'X' else i[1:] for i in aldex.index]
diff = pd.concat([sb, aldex, ancom], axis=1)
# Replace NaNs with empty strings
diffe = diff.fillna("")
# Make the name of the index column "valid" for QIIME 2
diffe.index.name = "FeatureID"

diffe.to_csv("merged_diffabund.tsv", sep="\t")

Should be decent enough.

* I think this might impact sample metadata files, but feature metadata files are more of a problem for this right now

Simpler example, involving merging a taxonomy.qza and feature_importance.qza file:

from qiime2 import Artifact
import pandas as pd
fi = Artifact.load("feature_importance.qza").view(pd.DataFrame)
tax = Artifact.load("taxonomy.qza").view(pd.DataFrame)
merged_df = pd.concat([tax, fi], axis=1, sort=False)

# Assign index a name to allow us to use this as a Q2 feature metadata file
merged_df.index.name = "FeatureID"

# Missing values are, by default, represented as NaNs.
# .to_csv() represents them in the TSV as empty values by default (see the
# na_rep parameter:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)

merged_df.to_csv("merged_fm.tsv", sep="\t")

After this, the merged_fm.tsv file can be passed to Empress via --m-feature-metadata-file in place of the two initial QZAs. This will allow us to visualize both all available taxonomy data and all available feature importance data, even though the feature importances are not provided for some of the features in the dataset:

fi