HDF5 from MeerKAT data --> "taylor_flt has no effect?", empty DAT & LOG

Question

HDF5 from MeerKAT data --> "taylor_flt has no effect?", empty DAT & LOG

texadactyl opened this issue 3 years ago · comments

File: /datax/scratch/jzhang/20200917_guppi_59143_54557_000456_J1939-6342_offset_0001.rawspec.0000.fil in the data centre.

Convert the file to HDF5 with fil2h5 or implicitly with turbo_seti. Then, either run turboSETI executable or use FindDoppler.search() in a Python program (non-GPU mode). The result is the same. For each coarse channel, the following is observed:

find_doppler.N  ERROR    taylor_flt has no effect?

where N is the coarse channel number.

The DAT and LOG files have no entries beyond the usual boiler-plate information.

This is the first time that I have seen this symptom. Either:

A new situation has turned up due to MeerKAT which turbo_seti class FindDoppler cannot handle.
Something is rotten in the MeerKAT header or data.

cc: @ZaynAmell

Richard Elkins · Answer 1 · Sat May 29 2021 01:39:15 GMT+0800 (China Standard Time)

Source code: https://github.com/UCBerkeleySETI/turbo_seti/blob/master/turbo_seti/find_doppler/find_doppler.py,
lines 449-450.

Richard Elkins · Answer 2 · Sat May 29 2021 02:36:04 GMT+0800 (China Standard Time)

@telegraphic @david-macmahon @luigifcruz

The code in question is the Taylor tree processing. I find it perplexing that it has been successful for a number of years on a number of different files but now, with MeerKAT data, it fails.

Will try to to get more info as to why tree_findoppler == tree_findoppler_original at the end of each coarse channel processing. Hopefully, this will become obvious.

Richard Elkins · Answer 3 · Sat May 29 2021 03:42:52 GMT+0800 (China Standard Time)

search_coarse_channel() and Taylor tree processing are innocent! (:
The data_handler.py module somehow calculated a drift rate resolution of 20989.59
More logger.debug insertions needed.

Richard Elkins · Answer 4 · Sat May 29 2021 03:58:20 GMT+0800 (China Standard Time)

@telegraphic Your favorite code (issue #98) in the DATAH class instantiation:

        #EE To check if swapping tsteps_valid and tsteps is more appropriate.
        self.tsteps_valid = header['NAXIS2']
        self.tsteps = int(math.pow(2, math.ceil(np.log2(math.floor(self.tsteps_valid)))))

        self.obs_length = self.tsteps_valid * header['DELTAT']
        self.drift_rate_resolution = (1e6 * np.abs(header['DELTAF'])) / self.obs_length   # in Hz/sec
        self.header['baryv'] = 0.0
        self.header['barya'] = 0.0
        self.header['coarse_chan'] = coarse_chan

Decoder ring for data_handler.py class DATAH using Voyager values:

header:  
    {'SOURCE': 'Voyager1',                  = source_name
    'MJD': 57650.78209490741,               = tstart
    'DEC': '12d10m58.8s',                   = src_dej
    'RA': '17h10m03.984s',                  = src_raj
    'DELTAF': -2.7939677238464355e-06,      = foff
    'DELTAT': 18.253611008,                 = tsamp
    'NAXIS1': 1048576,                      = nchans / n_coarse_chans
                                            = nchans, if n_coarse_chans is not supplied (=1)
    'FCNTR': 8419.921873603016,             = frequency at the center of the list of channels
    'NAXIS': 2,                             NEVER USED
    'NAXIS2': 16,                           = n_ints_in_file
    'baryv': 0.0,                           NEVER USED
    'barya': 0.0,                           NEVER USED
    'coarse_chan': 0                        NEVER USED, not the same as n_coarse_chans
    } 

# Currently needed in search_coarse_channel:
fftlen=1048576,                             = NAXIS1
tsteps_valid=16,                            = NAXIS2 = n_ints_in_file
tsteps=16,                                  = 2 ^ ( ceil ( log2 ( floor ( tsteps_valid ) ) ) = n_ints_in_file    <----- !!!
obs_length=292.057776128,                   = tsteps_valid * DELTAT = n_ints_in_file * tsamp, units in seconds
shoulder_size=0                             ALWAYS 0
drift_rate_resolution=0.009566489757225039, = 1e6 * |DELTAF| / obs_length = 1e6 * |foff| / obs_length, units in Hz/s
tdwidth=1048576                             = fftlen + shoulder_size * tsteps = NAXIS1 + 0 = NAXIS1
                                            i.e. tdwidth ALWAYS = fftlen

I feel like a linguist and a historian, now.

Richard Elkins · Answer 5 · Sat May 29 2021 04:19:21 GMT+0800 (China Standard Time)

Checking the Math:

DELTAF = foff = 0.208984375 MHz
tsteps_valid = 2048
DELTAT = tsamp = 0.00489988785046729 s i.e. every 5 milliseconds! (really?)

obs_length = tsteps_valid * DELTAT = 2048 * 0.00489988785046729 = 10.035 s
drift_rate_resolution = 1e6 * 0.208984375 / 10.035 = 20825.6

Logs for Voyager and MeerKAT are attached.

log.meerkat.txt
log.voyager.txt

Richard Elkins · Answer 6 · Sat May 29 2021 05:07:05 GMT+0800 (China Standard Time)

IMO: Someone should write a spec for what the Filterbank header is for MeerKAT. Data mapping is encouraged.

Richard Elkins · Answer 7 · Sat May 29 2021 05:08:05 GMT+0800 (China Standard Time)

Reviewed & finalised specifications -> less guessing and wasted time by multiple people!

Cherry Ng · Answer 8 · Sat May 29 2021 08:23:58 GMT+0800 (China Standard Time)

The MeerKAT data format is documented for example in the figures here. In summary, we target a final freq res of 1.59Hz and time res of 0.627s. The data file that is causing trouble in this GitHub issue is definitely non-standard. However, I did not expect the drift rate search algorithm to be slaved to specific data shapes. Is it possible to trace down and remove any assumptions the code is making, and have the code adapt to any freq/time combination? That would certainly increase the use cases for the code base for future telescopes that might have different formats.

Richard Elkins · Answer 9 · Sat May 29 2021 20:27:41 GMT+0800 (China Standard Time)

I will re-test with an artificial Filterbank file with frequency resolution of 1.59Hz and time resolution of 0.627s.

Regarding the current turbo_seti search algorithm, I wish that I had better news. It is fairly fixed in terms of its limitations. I'd point you to a design specification but there isn't one. I've learned how it works through reverse engineering, bottom up.

Richard Elkins · Answer 10 · Sat May 29 2021 21:44:47 GMT+0800 (China Standard Time)

@cherryng
Much better, using my fbgen tool to mimic the original file specs as edited by your resolution requests:

Using blimpy version 2.0.11
Using turbo_seti version 2.0.19
Using h5py version 3.2.1

doppler_search: Calling FindDoppler(/home/elkins/BASIS/seti_data/meerkat/cherry_ng.h5)

--- File Info ---
DIMENSION_LABELS :   ['time' 'feed_id' 'frequency']
        az_start :                              0.0
       data_type :                                1
            fch1 :                     1511.375 MHz
            foff :                      0.00159 MHz
           ibeam :                                1
      machine_id :                                0
          nbeams :                                1
           nbits :                               32
          nchans :                             3712
            nifs :                                1
     rawdatafile :                              N/A
     source_name :                            fbgen
         src_dej :                       12:10:58.8
         src_raj :                     17:10:03.984
    telescope_id :                                0
           tsamp :                            0.627
   tstart (ISOT) :          2021-05-29T12:00:00.000
    tstart (MJD) :                          59363.5
        za_start :                              0.0

Num ints in file :                             2032
      File shape :                  (2032, 1, 3712)
--- Selection Info ---
Data selection shape :                  (2032, 1, 3712)
Minimum freq (MHz) :                         1511.375
Maximum freq (MHz) :                       1517.27549
doppler_search: Output directory: /home/elkins/BASIS/seti_data/meerkat/

turbo_seti version 2.0.19
blimpy version 2.0.11
h5py version 3.2.1

find_doppler    INFO     {'DIMENSION_LABELS': array(['time', 'feed_id', 'frequency'], dtype=object), 'az_start': 0.0, 'data_type': 1, 'fch1': 1511.375, 'foff': 0.00159, 'ibeam': 1, 'machine_id': 0, 'nbeams': 1, 'nbits': 32, 'nchans': 3712, 'nifs': 1, 'rawdatafile': 'N/A', 'source_name': 'fbgen', 'src_dej': <Angle 12.183 deg>, 'src_raj': <Angle 17.16777333 hourangle>, 'telescope_id': 0, 'tsamp': 0.627, 'tstart': 59363.5, 'za_start': 0.0}
find_doppler    INFO     File: /home/elkins/BASIS/seti_data/meerkat/cherry_ng.h5
 drift rates (min, max): (0.100000, 5.000000)
 SNR: 25.000000

find_doppler    DEBUG    Recreating DAT and LOG files
Starting ET search using /home/elkins/BASIS/seti_data/meerkat/cherry_ng.h5
find_doppler    INFO     Parameters: datafile=/home/elkins/BASIS/seti_data/meerkat/cherry_ng.h5, max_drift=5, min_drift=0.1, snr=25.0, out_dir=/home/elkins/BASIS/seti_data/meerkat/, coarse_chans=None, flagging=False, n_coarse_chan=1, kernels=None, gpu_backend=False, precision=2, append_output=False, log_level_int=10, obs_info={'pulsar': 0, 'pulsar_found': 0, 'pulsar_dm': 0.0, 'pulsar_snr': 0.0, 'pulsar_stats': array([0., 0., 0., 0., 0., 0.]), 'RFI_level': 0.0, 'Mean_SEFD': 0.0, 'psrflux_Sens': 0.0, 'SEFDs_val': [0.0], 'SEFDs_freq': [0.0], 'SEFDs_freq_up': [0.0]}
find_doppler.0  DEBUG    ===== coarse_channel=0, f_start=1511.375, f_stop=1517.27708
find_doppler.0  DEBUG    flagging=False, tsteps=2048, tsteps_valid=2048, tdwidth=3712, fftlen=3712, nframes=2048, shoulder_size=0, drift_rate_resolution=1.2479749839882455
find_doppler.0  DEBUG    median_flag=[0.]
find_doppler.0  DEBUG    specstart=0, specend=3712
find_doppler.0  DEBUG    comp_stats the_mean_val=13207926304512.0, the_stddev=70567931213.83704
find_doppler.0  DEBUG    BEGIN looping over drift_rate_nblock, drift_rate_nblock=0.
find_doppler.0  DEBUG    Drift_block 0 (in range from 0 through 0)
find_doppler.0  DEBUG    populate_tree() roll=0
find_doppler.0  DEBUG    done...
find_doppler.0  DEBUG    ***** drift_block <= 0 selected drift range:
[-4.99189994 -3.74392495 -2.49594997 -1.24797498]
find_doppler.0  DEBUG    Start searching for hits at drift rate: -4.991900
find_doppler.0  DEBUG    Start searching for hits at drift rate: -3.743925
find_doppler.0  DEBUG    Start searching for hits at drift rate: -2.495950
find_doppler.0  DEBUG    Start searching for hits at drift rate: -1.247975
find_doppler.0  DEBUG    populate_tree() roll=0
find_doppler.0  DEBUG    done...
find_doppler.0  DEBUG    tree_findoppler changed
find_doppler.0  DEBUG    ***** drift_block >= 0 selected drift range:
[1.24797498 2.49594997 3.74392495 4.99189994]
find_doppler.0  DEBUG    Start searching for hits at drift rate: 1.247975
find_doppler.0  DEBUG    Hit found at SNR 41.763764! 	Spectrum index: 1133, Drift rate: 1.247975	Uncorrected frequency: 1513.176470	Corrected frequency: 1513.176470
find_doppler.0  DEBUG    Hit found at SNR 42.849693! 	Spectrum index: 1134, Drift rate: 1.247975	Uncorrected frequency: 1513.178060	Corrected frequency: 1513.178060
find_doppler.0  DEBUG    Start searching for hits at drift rate: 2.495950
find_doppler.0  DEBUG    Hit found at SNR 31.414133! 	Spectrum index: 1133, Drift rate: 2.495950	Uncorrected frequency: 1513.176470	Corrected frequency: 1513.176470
find_doppler.0  DEBUG    Start searching for hits at drift rate: 3.743925
find_doppler.0  DEBUG    Start searching for hits at drift rate: 4.991900
find_doppler.0  DEBUG    END looping over drift_rate_nblock.
find_doppler.0  DEBUG    original matrix size: 7602176	(2048, 3712)
find_doppler.0  DEBUG    tree_orig shape: (2048, 3712)
find_doppler.0  DEBUG    SNR not big enough... 41.763764 pass... index: 1133
find_doppler.0  INFO     Top hit found! SNR 42.849693, Drift Rate 1.247975, index 1134
find_doppler.0  DEBUG    Total number of candidates for coarse channel 0 is: 3
doppler_search: search elapsed time = 1.996117115020752 seconds

The drift rate resolution (1.24) is vastly improved compared to >20,000 in the previous file.

Note that fbgen is not nearly as good at producing realistic signals and noise compared to setigen but has the advantage of being able to code header specs without having to write a Python program.

Richard Elkins · Answer 11 · Sat May 29 2021 21:46:28 GMT+0800 (China Standard Time)

fbgen configuration used:

[Meerkat-ish]

# Number of samples (time integrations)
nsamples = 2032

# Number of polarisation arrays per sample
nifs = 1

# Number of channels channels per polarisation array
nchans = 3712

# First frequency channel (MHz)
fch1 = 1511.375

# Difference between each of the nchans frequencies (MHz)
foff = 1.59e-03

# First sample start time in ISO time zone and format
tstart_iso = 2021-05-29T12:00:00.000

# Time interval between samples (s)
tsamp = 0.627

# Number of bits per piece of data to write
# 32: float
# 8: integer in range of 0 to 255
nbits = 32

# Signal boundaries
signal_low = 4e9
signal_high = 9e9

# Maximum noise as a fraction
max_noise = 1.0

# Frequency buffer size
# Maximum number of frequencies to write at a time
max_freq_write = 3712

Richard Elkins · Answer 12 · Fri Jun 11 2021 06:23:50 GMT+0800 (China Standard Time)

Resolved.
We were using too large of a fine channel resolution for turbo_seti which is narrow-band in nature.