brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reading .somalier binary files - how is the SNP order determined?

Austin-s-h opened this issue · comments

Hi! I like somalier and I think it's a useful tool. I'm trying to understand the format of the data stored in .somalier files. I've implemented read_somalier and try to provide a clear example result below.

I'm trying to understand how entries within .somalier files are ordered, and if it is possible to recreate position-based labeling to map which variants in the vcf file these counts correspond to. My input vcf file has 27 entries, and the resulting .somalier file contains 27 Sites in the somatic array. Are sites ever filtered?

Thank you!

def read_somalier(path: Union[str, Path]) -> dict:
    """Take a path to a single .somalier file and return sample information.

    Args:
        path (str): path to a .somalier file

    Returns:
        dict:
            sample (str): sample name
            sites (np.ndarray): array of sites [n_other, n_ref, n_alt]
            x_sites (np.ndarray): array of x_sites [n_other, n_ref, n_alt]
            y_sites (np.ndarray): array of y_sites [n_other, n_ref, n_alt]
    """
    if isinstance(path, str):
        path = Path(path)
    data = path.read_bytes()
    version = int.from_bytes(data[:1], byteorder="little")
    assert version == 2, ("bad version for:", path)
    data = data[1:]
    sample_strlen = int.from_bytes(data[:1], byteorder="little")
    data = data[1:]
    sample = data[:sample_strlen].decode()
    data = data[sample_strlen:]

    nsites = int.from_bytes(data[:2], byteorder="little")
    data = data[2:]
    nxsites = int.from_bytes(data[:2], byteorder="little")
    data = data[2:]
    nysites = int.from_bytes(data[:2], byteorder="little")
    data = data[2:]

    sites = np.frombuffer(data[: nsites * 3 * 4], dtype=np.uint32).reshape((nsites, 3))
    data = data[nsites * 3 * 4 :]
    x_sites = np.frombuffer(data[: nxsites * 3 * 4], dtype=np.uint32).reshape((nxsites, 3))
    data = data[nxsites * 3 * 4 :]
    y_sites = np.frombuffer(data[: nysites * 3 * 4], dtype=np.uint32).reshape((nysites, 3))

    return dict(sample=sample, sites=sites, x_sites=x_sites, y_sites=y_sites)
read_somalier("example1.somalier")
{'sample': 'example1', 'sites': array([
[    8,  6266,    22],
       [13001,     2,   149],
       [ 8261,     1,    48],
       [ 5669,  5170,    44],
       [    2,  3859,    32],
       [  879,   892,     6],
       [ 4469,  4409,    19],
       [14939,     0,    35],
       [   17, 12815,    19],
       [ 5569,  4525,    40],
       [ 6269,  7049,    30],
       [ 4039,     1,   107],
       [ 9360,     0,    38],
       [ 2694,  3256,    45],
       [ 6447,  5873,    63],
       [ 1913,  1656,    35],
       [    9, 10316,    38],
       [ 5209,  5061,   106],
       [16302,     0,    58],
       [  251,   260,    18],
       [ 6184,     0,    17],
       [  868,   761,    17],
       [    0, 13716,    21],
       [    0, 11615,   108],
       [ 1211,  1046,    20],
       [ 3572,  3742,    45],
       [ 8187,  7505,    47]], dtype=uint32), 'x_sites': array([], shape=(0, 3), dtype=uint32), 'y_sites': array([], shape=(0, 3), dtype=uint32)}
len(result['sites'])
27

hi, sites are never excluded. So all sites in your sites vcf are saved in the .somalier file. And the order is preserved. So the nth variant in your sites-vcf is the nth variant in your .somalier file.
So you can get the positions but they are not stored in the somalier file, only in your original sites vcf.

Great, thank you for the information!