brentp / peddy

genotype :: ped correspondence check, ancestry check, sex check. directly, quickly on VCF

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

KeyError in PCA.py

skgttjo opened this issue · comments

Hi Brent,

I cloned the peddy package yesterday and have got this error:

Traceback (most recent call last):
File "/home/torme/anaconda2/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/home/torme/anaconda2/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/torme/Desktop/TOOLS/peddy/peddy/main.py", line 14, in
sys.exit(cli())
File "/home/torme/anaconda2/lib/python2.7/site-packages/click/core.py", line 716, in call
return self.main(*args, **kwargs)
File "/home/torme/anaconda2/lib/python2.7/site-packages/click/core.py", line 696, in main
rv = self.invoke(ctx)
File "/home/torme/anaconda2/lib/python2.7/site-packages/click/core.py", line 889, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/torme/anaconda2/lib/python2.7/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "peddy/cli.py", line 207, in peddy
in ("ped_check", "het_check", "sex_check")]):
File "peddy/cli.py", line 43, in run
prefix=prefix, **kwargs)
File "peddy/peddy.py", line 878, in het_check
pca_df, background_pca_df = pca(pca_plot, sitesfile, gt_types, sites)
File "peddy/pca.py", line 46, in pca
idxs = np.array([kgsites_index[s] for s in sites])
KeyError: u'1:900505:G:C'

I get output files ped_check and ped_check_rel-difference, but no files for sex or PCA.

I get an error at position 1:900505 - I'm not sure how peddy runs in terms of order of variants in the file, but this is not the first variant position in my VCF. (However perhaps for speed the program does not run through chronologically so not sure this is relevant anyway.) This position looks fine in my VCF (is present, has the same ref and alt alleles as in 1000Genomes data). Is there anything obvious that would cause this error?

Sorry if this is an obvious oversight on my part... The output that I did get looks beautiful, so thank you!

Many thanks,
Tatiana

I wonder if you are getting a mix of peddy modules. the new version has a different set of sites and so maybe you're somehow getting the old PCA module? I suspect this is the case because you're running it out of the source directory.

this should be fixed with local imports in peddy, but, pending that fix, you can also check that you only have 1 peddy module available.

thanks for reporting.

Hi Brent,

Thank you for getting back to me. Just to let you know (in case it helps trouble shoot) I removed all things Peddy and reinstalled and got the same error as above. I then ran on a different VCF, in case it was a problem with the original VCF file, and I got the same error up til the last line - instead of the error in line 46, in pca - I got:
File "peddy/pca.py", line 60, in pca
clf = make_pipeline(PCA(n_components=4, whiten=True, copy=True, svd_solver="randomized"),
TypeError: init() got an unexpected keyword argument 'svd_solver'

Does this seem like a problem at my end (problem with installation or VCF)? As I don't want to waste your time!

All the best

you must have a very old version of scikit-learn. I would update that, re-run and report the error (if there is one).

commented

Which version of scikit-learn should I choose for this bug

@ruizgo which error? the svd_sovler error? Any recent version is fine.

If you are seeing an error like in the first message (KeyError: u'1:900505:G:C'), then you must be using a set of sites that doesn't match the ones used to create the thousand genomes labeled set. Can you share the command you ran along with the full error message?

commented

2023-05-18 03:06:57 01a458450dfc peddy.cli[786] INFO Running Peddy version 0.4.8
2023-05-18 03:06:57 01a458450dfc peddy.cli[786] INFO ped_check
2023-05-18 03:06:57 01a458450dfc peddy.cli[786] INFO ran in 0.3 seconds
2023-05-18 03:06:57 01a458450dfc peddy.cli[786] INFO het_check
2023-05-18 03:06:58 01a458450dfc peddy.pca[786] INFO loaded and subsetted thousand-genomes genotypes (shape: (2504, 1)) in 0.4 seconds
Traceback (most recent call last):
File "/usr/local/bin/peddy", line 33, in
sys.exit(load_entry_point('peddy', 'console_scripts', 'peddy')())
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1128, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/opt/peddy-0.4.8/peddy/cli.py", line 208, in peddy
for check, df, background_df in map(run, [(check, ped, vcf, plot, prefix, each, procs, sites) for check
File "/opt/peddy-0.4.8/peddy/cli.py", line 42, in run
df = getattr(p, check)(vcf, plot=plot, each=each, ncpus=ncpus,
File "/opt/peddy-0.4.8/peddy/peddy.py", line 879, in het_check
pca_df, background_pca_df = pca(pca_plot, sitesfile, gt_types, sites)
File "/opt/peddy-0.4.8/peddy/pca.py", line 64, in pca
clf.fit(genos1kg, background_target)
File "/usr/local/lib/python3.10/site-packages/sklearn/pipeline.py", line 401, in fit
Xt = self._fit(X, y, **fit_params_steps)
File "/usr/local/lib/python3.10/site-packages/sklearn/pipeline.py", line 359, in _fit
X, fitted_transformer = fit_transform_one_cached(
File "/usr/local/lib/python3.10/site-packages/joblib/memory.py", line 349, in call
return self.func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/sklearn/pipeline.py", line 893, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "/usr/local/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 140, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/sklearn/decomposition/_pca.py", line 462, in fit_transform
U, S, Vt = self._fit(X)
File "/usr/local/lib/python3.10/site-packages/sklearn/decomposition/_pca.py", line 514, in _fit
return self._fit_truncated(X, n_components, self._fit_svd_solver)
File "/usr/local/lib/python3.10/site-packages/sklearn/decomposition/_pca.py", line 587, in _fit_truncated
raise ValueError(
ValueError: n_components=4 must be between 1 and min(n_samples, n_features)=1 with svd_solver='randomized'
None

This is an error output message, which may cause errors when reading certain VCF files

This means that there was only 1 SNP in the thousand genomes set that was in your set. So you either have data that is too sparse or you're using the wrong genome build most likely.

commented

This means that there was only 1 SNP in the thousand genomes set that was in your set. So you either have data that is too sparse or you're using the wrong genome build most likely.

I will check the upstream operation, thank you for your reply!