oceanmodeling / StormEvents

Python interfaces for observational data surrounding named storm events, born from @jreniel's ADCIRCpy

Home Page:https://stormevents.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Use StormEvents / Searvey to Compare ADCIRC Results with Real-world Data

SorooshMani-NOAA opened this issue · comments

@g7192, please document your progress in this "issue". This would then serve as a good showcase for using StormEvents and/or Searvey.

Just for the sake of completeness:

  • Simulation data is taken from Global ESTOFS
    • CWL Point Data is to be used
  • The simulated point data is to be compared with actual station measurements
  • Comparison is going to be in the form of basic timeseries visualization as well as statistics of timeseries

@brey @g7192 @SorooshMani-NOAA

Hi George,
I will paste your suggestions for Jane here:

** Regarding the work for Jane and I don’t know what she is working on, but I would suggest the time-series clean up problem. I am posting some resources below and I am available for elaboration. This could be a literature review and/or prototyping exercise. Cheers, G **

https://arxiv.org/pdf/2004.08284.pdf
https://medium.com/analytics-vidhya/cleaning-and-understanding-multivariate-time-series-data-6554eefbda9c

@saeed-moghimi-noaa @SorooshMani-NOAA @brey

Currently, I've been reading documentation about the OceanModeling packages, specifically StormEvents, as well as about time series, xarrays, and statistical methods in general.

Here is my working code in Python:

import stormevents
import xarray
import numpy as np
from stormevents.coops import coops_stations
from stormevents.coops import COOPS_Station
from datetime import datetime, timedelta
from shapely.geometry import Polygon
from stormevents.coops import coops_stations_within_region
from stormevents.coops import coops_product_within_region
from matplotlib import pyplot as plt

# Retrieve simulation data (combined water level)
file = '/Desktop/estofs.t06z.points.cwl.nc'
sim_data = xarray.open_dataset(file, engine="netcdf4")
sim_data.data_vars
sim_data.dims

# Obtain time series for a station at random from simulation data
sim_station_name = sim_data.station_name[4]
sim_station_lon = sim_data.x[4]
sim_station_lat = sim_data.y[4]
sim_timeseries = sim_data.zeta[:, 4]
sim_timeseries_df = sim_timeseries.to_dataframe()
station_id = int(np.reshape(sim_station_name.astype(str).str.split("")[2].to_numpy(), 1)[0])

# Use StormEvents to obtain time series for the same station using the real data
station = COOPS_Station(station_id)
start_date = sim_timeseries_df.index[0]
end_date = sim_timeseries_df.index[len(sim_timeseries_df.index) - 1]
real_df = station.product('water_level',  start_date=start_date, end_date=end_date).to_dataframe()
real_timeseries_df = real_df.iloc[:, 0]
real_timeseries_df = real_timeseries_df.reset_index(level='nos_id')
real_timeseries_df = real_timeseries_df.iloc[:, 1]

# Re-index simulation data time series based on real data time series
sim_timeseries_df = sim_timeseries_df.reindex(index=real_timeseries_df.index, method='nearest').drop_duplicates()

# Plot both time series (simulation data and real data) on the same graph
plt.plot(sim_timeseries_df, label='Simulation Data Time Series')
plt.plot(real_timeseries_df, label='Real Data Time Series')
plt.title('Water Level Time Series Comparison')
plt.xlabel('Date/Time')
plt.legend()
plt.show()

# Compute some statistics to compare simulation data and real data
# Can convert both time series from dataframes to numpy arrays first
real_timeseries_array = real_timeseries_df.to_numpy()
sim_timeseries_array = sim_timeseries_df.to_numpy()

# Bias
# Positive values of bias indicate negatively-biased modelled data
# Negative values of bias indicate positively-biased modelled data
bias = real_timeseries_df.mean() - sim_timeseries_df.mean()
# Or, can compute bias or mean error in this way since expectation is linear
mean_error = np.nanmean(real_timeseries_array - sim_timeseries_array)
# Now, remove bias between the two time series and re-plot
plt.plot(sim_timeseries_df + bias, label='Simulation Data Time Series (Bias-corrected)')
plt.plot(real_timeseries_df, label='Real Data Time Series')
plt.title('Water Level Time Series Comparison (Bias-corrected)')
plt.xlabel('Date/Time')
plt.legend()
plt.show()

# Mean Absolute Error
mean_abs_error = np.nanmean(abs(real_timeseries_array - sim_timeseries_array))

# Root Mean Square Error (RMSE)
rmse = np.sqrt(np.nanmean((real_timeseries_array - sim_timeseries_array) ** 2))

# Scatter Index or Percentage RMSE
scatter_index = (rmse / np.nanmean(real_timeseries_array)) * 100


# Side note: we can also obtain water level data for stations within a given region using the Polygon() function
polygon = Polygon([(-67.204002, 44.654499), (-70.563301, 43.32), (-71.16481, 41.707001)])
coops_product_within_region(
    'water_level',
    region=polygon,
    start_date=datetime(2021, 2, 2),
    end_date=datetime(2021, 2, 9, 18)
)

water_level_timeseries

water_level_timeseries_corrected

@g7192

I took the jpg out of code section and it showed up. Looks very good. To compare the signal you also can remove the bias. Something like:

bias = data.mean() - model.mean()
plt.plot(sim_timeseries_df + bias, label='Simulation Data Timeseries')
plt.plot(real_timeseries_df, label='Real Data Timeseries')

My new code reflects the following changes that I made:

  • Looped through multiple stations
  • Improved quality of code by generalizing code to work for any station (parsed simulation data to obtain parameters of interest such as the station id and the start and end dates, rather than manually hardcoding them)
  • Created dictionary to store results, especially the calculated statistics, for easy access
  • Tested the total time it took to loop through a certain number of stations
  • Re-indexed the real time series dataframe based on the index of the simulation time series dataframe instead of the other way around (since the simulation time series data is sampled more frequently)

Updated code:

import xarray
import numpy as np
import time
from stormevents.coops import coops_stations
from stormevents.coops import COOPS_Station
from datetime import datetime, timedelta
from shapely.geometry import Polygon
from stormevents.coops import coops_stations_within_region
from stormevents.coops import coops_product_within_region
from matplotlib import pyplot as plt

# Retrieve simulation data (combined water level)
file = '/Desktop/estofs.t06z.points.cwl.nc'
sim_data = xarray.open_dataset(file, engine="netcdf4")
sim_data.data_vars
sim_data.dims


def main():
    # Obtain time series for a station at random from simulation data
    stats_dict = {}
    for i in range(100):
        sim_station_name = sim_data.station_name[i]
        sim_station_lon = sim_data.x[i]
        sim_station_lat = sim_data.y[i]
        sim_timeseries = sim_data.zeta[:, i]
        sim_timeseries_df = sim_timeseries.to_dataframe()
        station_id = int(np.reshape(sim_station_name.astype(str).str.split("")[2].to_numpy(), 1)[0])
        stats_dict[station_id] = {}

        # Use StormEvents to obtain time series for the same station using the real data
        station = COOPS_Station(station_id)
        start_date = sim_timeseries_df.index[0]
        end_date = sim_timeseries_df.index[len(sim_timeseries_df.index) - 1]
        real_df = station.product('water_level',  start_date=start_date, end_date=end_date).to_dataframe()
        
        if real_df.empty:
            continue
        else:
            real_timeseries_df = real_df.iloc[:, 0]
            real_timeseries_df = real_timeseries_df.reset_index(level='nos_id')
            real_timeseries_df = real_timeseries_df.iloc[:, 1]

            # Re-index real data time series based on simulation data time series
            real_timeseries_df = real_timeseries_df.reindex(index=sim_timeseries_df.index, method='nearest').drop_duplicates()
            # Plot both time series (simulation data and real data) on the same graph
            plt.plot(sim_timeseries_df, label='Simulation Data Time Series')
            plt.plot(real_timeseries_df, label='Real Data Time Series')
            plt.title('Water Level Time Series Comparison')
            plt.xlabel('Date/Time')
            plt.legend()
            plt.show()

            # Compute some statistics to compare simulation data and real data
            # Can convert both time series from dataframes to numpy arrays first
            real_timeseries_array = real_timeseries_df.to_numpy()
            sim_timeseries_array = sim_timeseries_df.to_numpy()

            # Bias or Mean Error
            # Positive values of bias indicate negatively-biased modelled data
            # Negative values of bias indicate positively-biased modelled data
            bias = np.nanmean(real_timeseries_array - sim_timeseries_array)
            stats_dict[station_id]['Bias'] = bias
            # Now, remove bias between the two time series and re-plot
            plt.plot(sim_timeseries_df + bias, label='Simulation Data Time Series (Bias-corrected)')
            plt.plot(real_timeseries_df, label='Real Data Time Series')
            plt.title('Water Level Time Series Comparison (Bias-corrected)')
            plt.xlabel('Date/Time')
            plt.legend()
            plt.show()

            # Mean Absolute Error
            mean_abs_error = np.nanmean(abs(real_timeseries_array - sim_timeseries_array))
            stats_dict[station_id]['Mean Absolute Error'] = mean_abs_error

            # Root Mean Square Error (RMSE)
            rmse = np.sqrt(np.nanmean((real_timeseries_array - sim_timeseries_array) ** 2))
            stats_dict[station_id]['RMSE'] = rmse

            # Scatter Index  or Percentage RMSE
            scatter_index = (rmse / np.nanmean(real_timeseries_array)) * 100
            stats_dict[station_id]['Scatter Index'] = scatter_index

    print(stats_dict)

# Side note: we can also obtain water level data for stations within a given region using the Polygon() function
polygon = Polygon([(-67.204002, 44.654499), (-70.563301, 43.32), (-71.16481, 41.707001)])
coops_product_within_region(
    'water_level',
    region=polygon,
    start_date=datetime(2021, 2, 2),
    end_date=datetime(2021, 2, 9, 18)
)


if __name__ == '__main__':
    starttime = time.time()
    main()
    endtime = time.time()
    print(f"Total time = {-starttime+endtime}")

Thank you for the update @g7192, keep up the good work! Let's discuss the next steps in our next tag up.

#!/usr/bin/python

""" Import libraries and script options """
import os, sys
import warnings
import time
import pyschism
import numpy as np
import pandas as pd
from pyschism.mesh import Hgrid
from tqdm.notebook import tqdm_notebook

os.system("clear")  # Clear the screen
warnings.filterwarnings("ignore")  # turn off warnings

# ========================== Globals ========================== #
# Path to reqiured files:
#case = 'irene_delaware_2011/tide_only/spinup_run/'  # 'irene_delaware_2011' 'florence_atlantic_2018'
case = 'irene_delaware_2011/VIMS_domain/'

PathtoWork = '/scratch2/COASTAL/coastal/save/Bahram.Khazaei/work/'  # work directory
PathToCodes = PathtoWork + 'source_codes/schism_src/my_schism_utility/'  # schism utility functions
PathToMyFunctions = PathToCodes + 'utility_functions/'
PathToMySchismFunctions = PathtoWork +'source_codes/schism_src/my_schism_utility/utility_functions/'

PathtoRunDir = '/scratch2/COASTAL/coastal/save/Bahram.Khazaei/work/run_dir/'  # my run directory
#PathToPriorOutputs = PathtoRunDir + case + '../ensemble/outputs_ensemble0/'
PathToPriorOutputs = PathtoRunDir + case + 'outputs_HOTSTART@2022.06.10/'
#PathToPriorOutputs = PathtoRunDir + case + 'files_backup_original/outputs_original/'

PathToObsrv = PathToCodes + 'validation_tools/coops_water_level_observations/water_levels_delaware_2011_months_4_10.nc'

# Ensemble options:
Nens = 10 # Number of ensembles

# selected points for time series plotting:
#  1) can be defined manually, for example: points = [(-75.20,39), (-74, 38), (-65,25)]
#  2) can be read from water level observation files
points=[]

# Define station list for stations argument or for all stations in obs file use: stations=None
select_stations = [8548989, 8539094, 8545240, 8540433, 8551762, 8551910, 8537121, 8555889, 8536110, 8557380]
pointsx = [-74.7517, -74.8700, -75.1407, -75.4100, -75.5883, -75.5717, -75.3750, -75.1133, -74.9668, -75.1200]
pointsy = [ 40.1367,  40.0817,  39.9333,  39.8117,  39.5817,  39.5583,  39.3050,  38.9867,  38.9667,  38.7816]
ajustments= [-5.5, -6.25, -2.25, -7.75, -7.85, -1.45, -6.75, -6.75, -1.70, -1.65]

# COLOR Coding:
C_RED   = '\33[3;1;31m'
C_BLUE  = '\33[3;1;34m'
C_BOLD  = '\33[1m'
C_CLEAR = '\33[0m'
# ====================== End of Globals ======================= #

""" Simulations """
start_time = time.time()
print(C_RED+"☑ Script started ... "+C_CLEAR)

# Append path and import required functions:
sys.path.append(PathToMyFunctions)
sys.path.append(PathToMySchismFunctions)
from SchismFunctions import read_SchismWaterElevation2D
from SchismFunctions import read_ObservedWaterLevel
from PythonFunctions import find_indices2D, ModelSkillsAll, PlotSkillHeatmap, PlotEnsembleSkillHeatmap

# Read the grid first:
print('  → Reading the \033[1m'+'hgrid.gr3'+'\033[0m file for grid information ...')
hgrid = Hgrid.open(PathtoRunDir + case + 'hgrid.gr3', crs='EPSG:4326')

# Read observations:
checkpoint_time = time.time()
df_obs, data_obs, NOSstationIDs = read_ObservedWaterLevel(PathToObsrv, stations=select_stations, obs_adjustment=ajustments) # stations=None

# Read simulations:
# Create an array of coordinates for stations locations and get their mesh node indices:
#pointsx, pointsy = data_obs.x.values, data_obs.y.values

points = np.column_stack([pointsx, pointsy])
points_indices = find_indices2D(points, hgrid.x, hgrid.y) # Indices for slected stations at the base run
df_sim, times, t0, dt = read_SchismWaterElevation2D(PathToPriorOutputs, points_indices, stationIDs=NOSstationIDs, old_version=None)

checkpoint_time = time.time() - checkpoint_time
print('☑ Total time to read the nc files: \033[1m', int(checkpoint_time), 'seconds\033[0m!')

# Calculate Model Skills for all stations:
SkillStats , ij = [], 0
for nos_station in df_obs.columns:
    # merge obs and sim by time:
    merge = df_sim[[nos_station]].merge(df_obs[[nos_station]], on = 'date', how = 'inner',suffixes=('_sim', '_obs')).dropna()
    # get sim and obs values of merged dataframe:
    sim = merge.iloc[:][str(nos_station)+'_sim'].values
    obs = merge.iloc[:][str(nos_station)+'_obs'].values
    # calculate skills:
    SkillStats0 = ModelSkillsAll(sim, obs, stationID=nos_station, Index=ij)
    SkillStats.append(SkillStats0)
    ij = ij + 1
    
# Put all stations in one dataframe:
SkillStats= pd.concat(SkillStats, axis=1)    

# Plot Skill Matrix:
ModelSkillFileName = 'irene_skills_tide_only.png'
PlotSkillHeatmap(SkillStats, units='m', cmap_type='coolwarm', Nbins=5, DPI=None, save_file_name=ModelSkillFileName)

checkpoint_time = time.time() - checkpoint_time
print('☑ Total time to calculate model skills: \033[1m', int(checkpoint_time), 'seconds\033[0m!')

# Calculate Model Skills for all stations and in ensemble of simulations:
# Read simulations:
# Create an array of coordinates for stations locations and get their mesh node indices:
pointsx, pointsy = data_obs.x.values, data_obs.y.values
points = np.column_stack([pointsx, pointsy])
points_indices = find_indices2D(points, hgrid.x, hgrid.y) # Indices for slected stations at the base run

# Read all ensembles:
Nens = 10 ##########
BD0, MSE0, RMSE0, NRMSE0, NSE0, RHO0 = [], [], [], [], [], [] 
    
for i in range(0, Nens):
    print('☑ Reading \033[1m'+'Ensemble Number: '+str(int(i))+'\033[0m')
    PathToEnsembleOutputs = PathtoRunDir + case + '/../ensemble/outputs_ensemble'+str(i)+'/'
    df_sim, times, t0, dt = read_SchismWaterElevation2D(PathToEnsembleOutputs, points_indices, stationIDs=NOSstationIDs)

    # Calculate Stats:
    SkillStats , ij = [], 0
    for nos_station in df_obs.columns:
        # merge obs and sim by time:
        merge = df_sim[[nos_station]].merge(df_obs[[nos_station]], on = 'date', how = 'inner',suffixes=('_sim', '_obs')).dropna()
        # get sim and obs values of merged dataframe:
        sim = merge.iloc[:][str(nos_station)+'_sim'].values
        obs = merge.iloc[:][str(nos_station)+'_obs'].values
        # calculate skills:
        SkillStats0 = ModelSkillsAll(sim, obs, stationID=nos_station, Index=ij)
        SkillStats.append(SkillStats0)
        ij = ij + 1

    # Put all stations in one dataframe and read desired skill metrics for all stations,
    # then append the previous ensembles:
    SkillStats= pd.concat(SkillStats, axis=1)
    BD0.append(np.array(SkillStats.iloc[7,:]))
    MSE0.append(np.array(SkillStats.iloc[9,:]))
    RMSE0.append(np.array(SkillStats.iloc[10,:]))
    NRMSE0.append(np.array(SkillStats.iloc[11,:]))
    NSE0.append(np.array(SkillStats.iloc[8,:]))
    RHO0.append(np.array(SkillStats.iloc[5,:]))

# convert metrics to array:
# For version with 6 metrics
# SkillStats3D = [NOSstationIDs, 
#                np.array(BD0), np.array(MSE0), np.array(RMSE0),
#                np.array(NRMSE0), np.array(NSE0), np.array(RHO0)]
SkillStats3D = [NOSstationIDs, 
                np.array(BD0), np.array(RMSE0),
                np.array(NSE0)+0.5, np.array(RHO0)+0.5]

checkpoint_time = time.time() - checkpoint_time
print('☑ Total time to calculate model skills: \033[1m', int(checkpoint_time), 'seconds\033[0m!')

# Plot Skill Matrix:
ModelSkillFileName = 'skill_test_ensmeble.png'
PlotEnsembleSkillHeatmap(SkillStats3D, units='m', cmap_type='coolwarm', Nbins=5, DPI=None, save_file_name=ModelSkillFileName)
print(C_RED+"☑ Script completed successfully!"+C_CLEAR)

@g7192 following our discussion with @saeed-moghimi-noaa, let's do this:

  • First plot points of the ADCIRC points file on a global map and see how they are distributed against IOC stations.
  • Stick with IOC stations available in ADCIRC point file: look at all the available global stations in the point-cwl file and pick the ones that match IOC stations
  • Manually find when a given storm hits a station and download data for forecasts 6 to 12 hours before and after that time to plot/analyze against measurements.
  • When all above is done, you can start looking into automating parts of this one by one, e.g. automating downloading from GESTOFS based on the date, or automating finding storm landfall on a station (I'm not sure how the latter is possible without some more data, maybe if you find a storm track for non-US storms then you can use shapely), etc.

@saeed-moghimi-noaa @SorooshMani-NOAA
I have a question regarding StormEvents: I tried to retrieve data for a particular storm, following the examples provided in the StormEvents documentation:

storm = StormEvent('FLORENCE', 2018)

Upon running the above line of code, I get the following error:
Screen Shot 2022-08-07 at 12 36 25 PM

Do you have any suggestions? Thanks!

@g7192 can you try installing the certifi package and see if it makes any difference? Just run the following in your environment

pip install certifi

and then try getting storm information again

@saeed-moghimi-noaa @SorooshMani-NOAA
I was able to resolve the above error. Here is my most recent code, for the storm Elsa:

import xarray
import numpy as np
import re
from stormevents.coops import coops_stations
from stormevents.coops import COOPS_Station
from datetime import datetime
from matplotlib import pyplot as plt
from stormevents import StormEvent

# Identify stations affected by the storm Elsa using StormEvents and pick a particular station for further analysis
storm = StormEvent('elsa', 2021)
storm.track()
elsa_data = storm.coops_product_within_isotach('water_level', wind_speed=34, start_date='2021-07-02 12:00:00', end_date='2021-07-10 06:00:00')
station_id = np.array([elsa_data.nos_id[1]])[0]
station = COOPS_Station(station_id)

# Retrieve simulation data (combined water level) for specific station chosen above
file = '/Desktop/estofs.t00z.points.cwl-4.nc'
sim_data = xarray.open_dataset(file, engine="netcdf4")
sim_station_names = np.array(sim_data.station_name.astype(str)).tolist()
sim_station_index = sim_station_names.index([i for i in sim_station_names if re.findall(station_id.astype(str), i)][0])
sim_timeseries = sim_data.zeta[:, sim_station_index]
sim_timeseries_df = sim_timeseries.to_dataframe()

# Use StormEvents to obtain time series for the same station using Elsa data
start_date = sim_timeseries_df.index[0]
end_date = sim_timeseries_df.index[len(sim_timeseries_df.index) - 1]
elsa_df = station.product('water_level', start_date=start_date, end_date=end_date).to_dataframe()
elsa_timeseries_df = elsa_df.iloc[:, 0]
elsa_timeseries_df = elsa_timeseries_df.reset_index(level='nos_id')
elsa_timeseries_df = elsa_timeseries_df.iloc[:, 1]

# Re-index Elsa data time series based on simulation data time series
elsa_timeseries_df = elsa_timeseries_df.reindex(index=sim_timeseries_df.index, method='nearest').drop_duplicates()

# Plot both time series (simulation data and Elsa data) together
plt.plot(sim_timeseries_df, label='Simulation Data Time Series')
plt.plot(elsa_timeseries_df, label='Elsa Data Time Series')
plt.title('Water Level Time Series Comparison')
plt.xlabel('Date/Time')
plt.legend()
plt.show()

# Compute some statistics to compare simulation data and Elsa data
# Can convert both time series from dataframes to numpy arrays first
elsa_timeseries_array = elsa_timeseries_df.to_numpy()
sim_timeseries_array = sim_timeseries_df.to_numpy()

# Bias or Mean Error
# Positive values of bias indicate negatively-biased modelled data
# Negative values of bias indicate positively-biased modelled data
bias = np.nanmean(elsa_timeseries_array - sim_timeseries_array)
bias
# Now, remove bias between the two time series and re-plot
plt.plot(sim_timeseries_df + bias, label='Simulation Data Time Series (Bias-corrected)')
plt.plot(elsa_timeseries_df, label='Elsa Data Time Series')
plt.title('Water Level Time Series Comparison (Bias-corrected)')
plt.xlabel('Date/Time')
plt.legend()
plt.show()

# Mean Absolute Error
mean_abs_error = np.nanmean(abs(elsa_timeseries_array - sim_timeseries_array))
mean_abs_error

# Root Mean Square Error (RMSE)
rmse = np.sqrt(np.nanmean((elsa_timeseries_array - sim_timeseries_array) ** 2))
rmse

# Scatter Index or Percentage RMSE
scatter_index = (rmse / np.nanmean(elsa_timeseries_array)) * 100
scatter_index

@g7192
Good Work, Jane! Please attach couple of figures and try to write a line describing them. What is the one catching your eye in each of them. Thanks.

@saeed-moghimi-noaa @SorooshMani-NOAA
Yes, here are the original and bias-corrected figures that resulted from the code above:

elsa_orig
elsa_corrected

It seems that, even though these figures are from a station affected by Elsa, simulation data and real (Elsa) data line up quite well.