tomvansteijn / xsboringen

plot borehole data in cross-sections

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

More efficient raster sampling for cross-section lines

tomvansteijn opened this issue · comments

Rasters are currently sampled at station points at a given step distance along the cross-section line. Parameter "res" in the input yaml files. The step distance should be small compared to raster cellsize, to avoid missing a raster value, which can result in an inconsistency in plot, for example when plotting groundlayers.

However, when the step size is small, each raster cell is sampled repeatedly. This method is therefore not very efficient.

Looking for a better method for finding the raster cells (row, col) intersecting with the cross-section line and plotting a nice stepped line. Perhaps using Bresenham's line algorithm?

I have been playing around with this recently because I felt some things could be optimized but didn't have a clear picture how. I ended up doing this (not entirely finished and internally consistent yet):

class RasterFileSubsurfaceLayer(SubsurfaceLayer):
    def __init__(self, cache=True, **kwargs):
        super().__init__(**kwargs)
        self._cache = {}
        with rasterio.open(self.top) as top:
            self.xul, self.yul = top.bounds.left, top.bounds.top
            self.delx = (top.bounds.right - top.bounds.left) / top.width
            self.dely = (top.bounds.top - top.bounds.bottom) / top.height

    def sample(self, coords):
        grid_coords = list(zip(*self.grid_coord(coords)))
        unknown_coords = {coord for coord in grid_coords if coord not in self._cache}
        sample_top_base = zip(
            self._sample_raster(self.top, unknown_coords),
            self._sample_raster(self.bottom, unknown_coords),
        )
        for coord, top_base in zip(unknown_coords, sample_top_base):
            self._cache[coord] = top_base
        for coord in grid_coords:
            yield self._cache[coord]

    def grid_coord(self, coord):
        x, y = coord[:, 0], coord[:, 1]
        x = x - (x - xul) % delx + delx / 2
        y = y + (yul - y) % dely - dely / 2
        return x, y

    @staticmethod
    def _sample_raster(rasterfile, coords):
        with rasterio.open(rasterfile) as src:
            for value in src.sample(coords):
                if value[0] in src.nodatavals:
                    yield np.nan
                else:
                    yield float(value[0])

This uses two things, on the one hand a cache of the previously queried cells, but it also rounds the queried coordinates to the center of the cell it is in. The sampling therefore becomes a little more complicated because you first need to compute the "grid coordinates" of each requested coordinate, then find out which of those are already cached, those not yet in the cache should then be extracted from the file, and finally return the results in the original order.

Hi Hugo,

Good thinking! As it turns out the sample method is often slower than reading the whole raster array in memory and indexing, like so:

def sample_raster(rasterfile, coords, band=1):
    '''sample raster file at coords'''
    log.debug('reading rasterfile {}'.format(os.path.basename(rasterfile)))
    with rasterio.open(rasterfile) as src:
        xs, ys = zip(*coords)
        rows, cols = rasterio.transform.rowcol(src.transform, xs, ys)
        array = src.read(band)
        values = array[rows, cols]
        values[np.in1d(values, src.nodatavals)] = np.nan
        for value in values:
            yield value

Like this the for loop is actually also unncessesary. I haven't tested it well though. How would this compare?

One of the major advantages of using the cache is that multiple calls to sample_raster is a lot faster. I was doing some interactive plotting of variants of the same cross section and after initially loading the relevant data, each subsequent call was significantly faster. Using the internal rowcol transformation is undoubtedly a lot faster than the python code I wrote.

Perhaps it would be interesting to cache the relevant section of the file (I currently have REGIS rasters for the entire country, where I need only a small section for a single cross section). A more complete version of my implementation referenced above is shown here, perhaps this caching could also be achieved by loading the relevant part of the raster into GridSubsurfaceLayer objects instead.

I needed a quick solution because sampling from a 0.5 m DTM took ages using the old implementation. I work a lot with Xarray so I used that. This method of sampling is very fast on the 30000 x 10000 raster I tested. I tried to keep the code as unintrusive as possible by sticking to the original functions. This method does add dependency on Xarray of course. I used xarray 0.16.0.

Also note: I already changed the isnan check in my last PR as np.nan == np.nan returns False. The same applies to value in nodatavalues due to float errors if the nodatavalue is e.g -3.402823e+38.

If you want I can make a PR, but from what I understood you also have an improved sampling method in place already.

def sample_raster(raster_file, coords):
    log.debug('reading rasterfile {}'.format(os.path.basename(rasterfile)))
    da = xr.open_rasterio(rasterfile).squeeze()
    x_samples = [c[0] for c in coords]
    y_samples = [c[1] for c in coords]  
    profile_y = da.sel(y=y_samples, x=x_samples, method='nearest').values.diagonal()
    
    for value in profile_y:
        if any(np.isclose(value, da.nodatavals)):
            yield np.nan
        elif np.isnan(value):
            yield np.nan
        else:
            yield value