A Python and HDF5-based VLBI correlator for CHIME Outriggers. It is based on the BBData data format used by CHIME, and uses the delay model difxcalc11 developed for the VLBA.
Natively supports frequency-dependent pulsar gates, and coherent dedispersion.
git clone https://github.com/leungcalvin/pyfx-public.git
cd pyfx-public
python setup.py develop # one way, or...
pip install -e . # ...another way, but editable
You will also need pycalc
(the delay model) and you probably will need coda
(visibility container). Use main
branch for both.
You will want to set up a CorrJob
to handle the correlation over all possible baselines. The CorrJob
will specify the correlation start and stop times as a function of frequency, time, and pointing at a given reference station in three arrays of shape (nfreq=1024, npointing, ntime)
. In most cases you will have only one pointing, but for widefield VLBI you might want multiple pointings.
-
$t$ , anastropy.Time
, which specifies topocentric Unix time at the reference station. -
$w$ , anint
, which specifies the scan duration. This effectively sets the frequency resolution of the correlation to be$390 * w$ kHz. -
$r$ , afloat
between 0-1 which specifies the fraction over the integration to correlate (used in pulsar gating mode), centered ont + 2.56e-6 * w//2
.
import numpy as np
from pyfx import corr_job
chime_file='/home/calvin/public/astro_255011898_multibeam_B0136+57_chime.h5' #a singlebeam file, ignore the name
kko_file='/home/calvin/public/astro_255011898_multibeam_B0136+57_kko.h5' #a singlebeam file from another station, ignore the name
pulsar_job = corr_job.CorrJob([chime_file,kko_file], #pass the names into CorrJob
ras = np.array([24.83225]), # where do we point (J2000)?
decs = np.array([58.24217194]) # where do we point (J2000)?
)
# 1) define correlation job
t,w,r = pulsar_job.define_scan_params(ref_station = 'chime',
t0f0 = (1670732187.8666291, 800.0), # reference time & frequency in MHz
start_or_toa = 'start', # is the reference time a "start time" (left edge) or is it a "toa" (center of the integration)
time_spacing = 'even', # evenly spaced integrations as a function of time. Later on, need to implement hardcore polyco gating.
period_frames = 2000, # spacing between integrations as a function of time
freq_offset_mode = 'bbdata', # how does the integration time change vs frequency? Can either be "bbdata" or "dm"
Window = np.ones(1024) * 2000, # duration of each integration in frames
r_ij = np.ones(1024) * 1, # duration of the gate applied to each integration
dm = 73.81141, # DM used for de-smearing, needs to be good to ~1% unless you're using freq_offset_mode = 'dm'
num_scans_before = 0, # how many integrations before the reference time, either an int or 'max'
num_scans_after = 2, # how many integrations after the reference time, either an int or 'max'
) # these parameters completely define t,w,r.
# 2) optional: check that the job has reasonable parameters
pulsar_job.visualize_twr(chime_file,t[...,0:5],w[...,0:5],r[...,0:5],dm = 73.81141) # does a waterfall plot to double check that you're integrating over the pulse.
# 3) press go
vlbivis = pulsar_job.run_correlator_job(t[...,0:3],w[0,:,0:3].astype(int),r[...,0:3],dm = 73.81141,
out_h5_file = False) # performs the correlations by 1) reading in all frequencies for all stations 2) correlating, and 3) writing out a VLBIVis.
# analysis and calibration of visibilities follows hereafter -- see `coda` repo
If you want to do stuff under the hood, you can directly run crosscorr_core()
or autocorr_core()
and accomplish similar things without the CorrJob
interface layer.
One of the reasons we developed pyfx
was for native support of the strange data format of CHIME. Since we are working at low frequencies where the DM sweep is long, we need to record the baseband data at reasonably high frequency resolution to follow the long DM sweep. This means we use lots of channels (1024) and no sub-bands, since CHIME directly digitizes the voltage data without using a local oscillator to mix signals down to baseband. These data are saved to .h5
files, which are then processed by offline (and later, real-time) beamformers. The format specification for singlebeam
data as used by pyfx
is summarized here.
To open singlebeam
files one can use h5py
directly to get started. However, slicing data in pyfx
requires using the BBData
data format specified in CHIME's baseband_analysis
repo.
from baseband\_analysis.core import BBData
data_all_freqs = BBData.from_file('/path/to/baseband_EVENTID_*.h5') # to load all frequencies
data_first_beam = BBData.from_file('/path/to/baseband_EVENTID_*.h5',beam_sel = [0,1]) # to load all frequencies, just one dual-pol beam
print(data_all_freqs.index_map['freq']['centre']) # what frequency channels do we have?
print(data_all_freqs['tiedbeam_locations'][:]) # what pointings do we have?
print(list(data_all_freqs.keys())) # what metadata do we have?
data_first_three_freqs_explicit = BBData.from_file(['/path/to/baseband_EVENTID_0.h5','/path/to/baseband_EVENTID_1.h5','/path/to/baseband_EVENTID_3.h5']) # to load data from FPGA freq_ids = 0,1,3 explicitly
data_first_three_freqs_implicit = BBData.from_file('/path/to/baseband_EVENTID_*.h5',freq_sel = [0,1,2]) # to load data implicitly from the first three files available
As one can see, caput.memh5
does the I/O management under the hood for us, allowing downselection along arbitrary axes (e.g. freq_sel
as shown above, but beam_sel
or time_sel
also can be used).
BBData.from_file
also:
- Handles the offset encoding of raw baseband data (4 real + 4 imaginary),
- Metadata which keep track of sign flips in the complex conjugate convention taken by the beamformer upstream, changing the sign convention when the data are loaded into memory.
A complete
singlebeam
file should have data and metadata attributes as described below, andmultibeam
files are quite closely related. \textbf{Bolded} refers to features that do not exist or are irrelevant for \texttt{singlebeam} files, but which would be a natural way to extend the data format for the pulsar beam data.
-
data.index\_map
: a dictionary-like data structure for users to interpret the axes which exist in theBBData
dataset. TheBBData
dataset holdsnp.ndarrays
of data. Here is a list of axes, and metadata describing them:- Observing Frequency:
data.index_map['freq']
($N_{\nu} \leq 1024$ ):data.index\_map['freq']['centre']
holds the center frequency of each PFB channel, in MHz. Similarly,data.index\_map['freq']['id']
Holds the frequency ID of each frequency channel as an integer$k$ . The mapping from frequency IDs to frequencies (in MHz) is$\nu_k = 800 - 0.390625k$ , for$k = 0\ldots 1023$ . Because every channel center and frequency ID is specified, the frequency axis is not assumed to be continuous. Note that we do not have the channel centered at 400.0 MHz. - Telescope array element :
data.index\_map['input']['id']
($N_e \leq 2048$ ) holds the serial numbers of each antenna used to form the synthesized beam. This axis is no longer present in beamformed baseband data datasets, but the metadata still exist to inform the end user which antennas were combined into a tied-array beam at each station. - Polarization/Pointing : (
$N_p$ assumed to be even):data.index_map['beam']
is supposed to hold the information about where each station's beams are formed. Currently it just holds integers$0,1,...2n-1$ , where$n$ is the number of unique sky locations which are beamformed. The beams and antenna polarization (either 'S' or 'E') are recorded indata['tiedbeam_locations'][:]
. It is possible to do hundreds of pointings offline in multiple phase center mode in the beamformer, limited only by the size of the file produced per frequency. When$N_p = 2$ , we refer to this as asinglebeam
file whereas whenN_p > 2
we call it amultibeam
file; themultibeam
files are typically broken down along the frequency axis to reduce the size of each file. - Time (
$N_t \sim 10^4$ ):data.index_map['time']['offset_fpga']
holds the index of every FPGA frame afterdata['time0']['fpga_count']
. Only one record of thefpga_offset
is recorded for all frequency channels, since we do not want to recorddata.index_map['time']['fpga_offset']
independently for each channel (which would double our data volume). Therefore, for a particular element of baseband data in array of shape(nfreq, ntime)
, the Unix time at which thedata['tiedbeam_baseband'][k,:,m]
element was recorded isdata.ctime['time0'][k] + 2.56e-6 * data.index_map['time']['fpga_offset'][m]
- Observing Frequency:
-
data['tiedbeam_baseband']
: array of shape ($N_{\nu},N_{p}, N_t$ ) Holds the actual baseband data in an array of complex numbers. The baseband data should be flux-calibrated such that the mean of the power obtained by squaring the data is in units of Janskys *$f_{good}^2$ here$f_{good}$ is the fraction of antennas that are not flagged. The baseband data have an ambiguous complex conjugate convention. Data that obeys the same complex conjugate convention as raw PFB output from the F-engine also has the attribute \texttt{data[tiedbeam\_baseband
].attrs[conjugate\_beamform
] = 1}, whereas data that has the opposite convention (data processed prior to October 2020) lacks this attribute. -
data['time0'] : array of shape
$(N_{\nu})$ \ Holds the absolute start time of each baseband dump as a function of frequency channel as a pair offloat64
s, indata['time0']['ctime']
anddata['time0']['ctime_offset']
respectively. Times are formatted as a UNIX timestamp in seconds (since midnight on January 1 1970 in UTC time). Since the baseband dumps start at a different time in each frequency channel, \texttt{ctime} is recorded as a function of frequency channel, disciplined via a GPS-disciplined crystal oscillator, to the nearest nanosecond. The precision ofctime
is$\approx \SI{100}{\ns}$ because it is stored asfloat64
. Therefore, for most applications usingctime
alone is sufficient. However, since afloat64
cannot hold UNIX timestamps to nanosecond precision ($\approx$ 19 decimal digits are needed), a secondfloat64
holds the last few relevant decimal places of the full UNIX time in seconds. Because of the limitations of afloat64
it is often the case thatctime_offset
is less than several hundreds of nanoseconds.data['time0']['ctime']
anddata['time0']['ctime_offset']
can be easily converted toastropy.Time
objects using theval2
keyword. If you do high precision arithmetic, you will find thatctime
+ctime_offset
mod 2.56e-6 is a constant over all frequency channels. In addition,data['time0']['fpga_count']
can be used to calculate the start time of the dump to within a nanosecond. This calculation can be performed for each frequency channel, and the results should be consistent to$10^{-10}$ seconds. 4)data['tiedbeam_locations']['ra','dec', or 'pol']
: array of shape$(N_p)$ where$N_p$ is even holds the sky locations and polarizations used to phase up each station. It will also includedata['tiedbeam_locations']['X_400MHz','Y_400MHz']
which refer to local beam-model coordinates done via thebeam_model
package. 5)data['centroid']
Holds the position of the telescope's effective centroid, measured from (0,0,0) in local telescope coordinates, in meters, measured in either a Easting/Northing coordinate system (TONE) or in a$F_\perp,F_\parallel$ coordinate system (perpendicular or parallel to the focal line) as a function of frequency channel. This is a function of frequency because the telescope's centroid is a sensitivity-weighted average of antenna positions (Post-beamforming). This is not yet used in VLBI, but we have the machinery to perform small baseline corrections using this field if necessary. 6) data['telescope'].attrs['name'] (Not implemented yet?) Holds the name of the station (chime',
pathfinder',tone',
allenby', orgreenbank', or
hatcreek') (Kenzie please update?)
CHIME Outriggers will have a small number of stations collecting full-array baseband dumps and forming multiple synthesized beams. Since each baseline must be correlated and calibrated independently, we store each baseline and each station as its own independent HDF5 group within a HDF5 container (again inherited from caput
) called VLBIVis
. Each station group contains station-related metadata copied from the singlebeam
data (via coda.core.VLBIVis.copy_station_metadata
, which copies all attributes stored in the BBData
to its corresponding station HDF5 group.
The station groups also hold autocorrelation visibilities up to some maximum lag (20 * 2.56 us by default), while each baseline holds per-baseline (e.g. calibration) metadata and cross-correlation visibilities. For example, processing data from CHIME and KKO would result in two autocorrelation HDF5 groups (vis['chime']
, vis['kko']
,), and one cross-correlation HDF5 group vis['chime-kko']
(not kko-chime
, since we alphabetize the two stations in a baseline).
The cross-correlation visibilities, stored in vis['chime-gbo']['vis']
are packed in np.ndarray
s of shape
-
$N_k$ enumerates the number of frequency channels. Since we have a preference for working at the native frequency resolution of CHIME, this is fixed to 1024 for now, and infilled with zeros where frequency channels are corrupted by e.g. RFI. We always correlate at high frequency resolution, but this information is contained in the lag axis, which is easy to downselect if we don't mind binning the visibilities in frequency. -
$N_{c} \lesssim10$ enumerates the number of correlation phase centers. Usually one or several ($<10$ ) phase centers will be used per beam. We use thepycalc
wrapper arounddifxcalc11
to evaluate delays. Currently, we can assign a single (or multiple) VLBI "pointing" to each tied-array "beam" whose width is$0.25 \times 0.25$ degrees, in anticipation of science cases for assigning multiple VLBI pointings per synthesized beam (which a tracking beam may have the sensitivity to see). -
$N_p \times N_p$ indicates all possible combinations of antenna polarizations. There are two antenna polarizations for each telescope, and they will be labeledsouth'' and
east'' to denoteparallel to the cylinder axis'' and
perpendicular to the cylinder axis'' directions respectively. Since CHIME/FRB Outriggers have co-aligned, dual-polarization antennas, correlating in a linear basis is straightforward and removes the need for polarization calibration. -
$N_{\ell} \sim 100$ indicates the number of integer time lags saved (in units of$\SI{2.56}{\us}$ ). In principle, only a few ($<10$ ) are needed, but it is not difficult to compute and save roughly 100 integer lags, which also allows for some post-correlation upchannelization of the visibilities. -
$N_{t} \sim 10^{1-4}$ enumerates successive scans. At the time of this writing we work with short scans (< 1 second), but pending upgrades to the beamformers at each station, we will soon be able to record hundreds of seconds of beamformed data at each station. In that case$N_{t}$ might approach$\approx 10^4$ in a long observation.
In addition to the visibilities we also save the following metadata. At the time of cross-correlation, two singlebeam
(or multibeam
) files are processed to produce one visibility dataset. In addition to the metadata in both inputted \texttt{singlebeam} files (as described above) we will save...
- Software metadata --
github
commit hash denoting what version of the correlator produced the file. vis['chime']['time_a']
: The topocentric start time of each integration at each station to nanosecond precision (seeBBData['time0']
) as a function of frequency and time.vis['chime-kko']['vis'].attrs['station_a','station_b']
:Astropy.EarthLocation
objects denoting the geocentric(X,Y,Z)
positions of the stations fed intodifxcalc11
vis['chime-kko']['vis'].attrs['calibrated']
: a boolean attribute denoting whether phase + delay calibration has been applied to the visibilities viacoda.calibration.apply_phase_cal
.vis['chime-kko']['vis'].attrs['clock_jitter_corrected']
and['clock_drift_corrected']
refer to whether one-second timescale clock jitter (between the GPS and maser) has been calibrated out, and weeks-long timescale clock drift (between masers at two stations) has been calibrated out using the CHIME/FRBmaser
pipeline. Usecoda.clock.apply_clock_jitter
andcoda.clock.apply_clock_drift
to apply/unapply these corrections.
Eventually, we will write data converters to port this over to MeasurementSet or an appropriate container for visibilities that is more widely used. We don't have as much experience with more conventional containers, and we are keen to collaborate with people who have some familiarity in this area.
Developing a VLBI correlator brings enlightenment to radio interferometrists. Please contact one of us with suggestions for improvements, or questions about the documentation. PRs are very welcome! Calvin Leung Shion Andrew