Reading Dataset from memory

Question

Reading Dataset from memory

m-novikov opened this issue 9 years ago · comments

Can netcdf4 read Dataset from memory?
Something like this:

from netCDF4 import Dataset
f = open('netcdf_file.nc', 'rb')
df = Dataset(f)

It would be quite convenient and useful API.
As for now to read netCDF4 Dataset from URL I need explicitly save it as named temporary file.

Sean Arms · Answer 1 · Tue May 05 2015 21:31:40 GMT+0800 (China Standard Time)

Greetings!

We are working on this on the C side at Unidata, and hopefully it will be
finished soon. Once it's in the C lib, there should be very little work to
expose it in the netcdf4-python lib.

Cheers!

Sean

On Tuesday, May 5, 2015, Maxim Novikov notifications@github.com wrote:

Can netcdf4 read Dataset from memory?
Something like this:

from netCDF4 import Dataset
f = open('netcdf_file.nc', 'rb')
df = Dataset(f)

It would be quite convenient and useful API.
As for now to read netCDF4 Dataset from URL I need explicitly save it as
named temporary file.

—
Reply to this email directly or view it on GitHub
#406.

Ryan May · Answer 2 · Tue May 05 2015 22:33:13 GMT+0800 (China Standard Time)

To be clear, though, this would involve passing the entire "string" of data, not a file-like object.

Jeff Whitaker · Answer 3 · Wed May 06 2015 03:31:31 GMT+0800 (China Standard Time)

you can create a Dataset in memory, using 'diskless=True'.

Sean Arms · Answer 4 · Wed May 06 2015 03:38:37 GMT+0800 (China Standard Time)

Create, yes, but read - no.

Let's say you use the NetcdfSubset service from the THREDDS Data Server.
The server will return a netCDF file. When you use urllib2 to make the
request, you end up reading the server response, which is the bytes of a
netcdf file already in memory. The idea is to read in memory to remove the
need to write a temporary file to disk.

Sean

On Tue, May 5, 2015 at 1:31 PM, Jeff Whitaker notifications@github.com
wrote:

you can create a Dataset in memory, using 'diskless=True'.

—
Reply to this email directly or view it on GitHub
#406 (comment)
.

Jeff Whitaker · Answer 5 · Wed May 06 2015 03:42:29 GMT+0800 (China Standard Time)

you could copy the data directly to a diskless file (without first writing to disk) couldn't you?

Ryan May · Answer 6 · Wed May 06 2015 04:29:09 GMT+0800 (China Standard Time)

Define "diskless file".

NetCDF requires a filename to read data from.

Jeff Whitaker · Answer 7 · Wed May 06 2015 04:39:55 GMT+0800 (China Standard Time)

from netCDF4 import Dataset
nc = Dataset(URL)
ncm = Dataset('inmemory.nc',diskless=True,mode='w')
.....logic to copy data from nc to ncm...

then you have a 'diskless' or in-memory version of the dataset at URL

Ryan May · Answer 8 · Wed May 06 2015 04:45:58 GMT+0800 (China Standard Time)

Is the nc = Dataset(URL) supposed to work for anything besides opendap? It didn't for me when I just tried to hit a THREDDS server using NCSS. What I'm picturing is:

from netCDF4 import Dataset
from urllib.requests import urlopen
url = urlopen(URL)
ncdata = url.read()
nc = Dataset(ncdata, diskless=True, mode='r')

Max Novikov · Answer 9 · Wed May 06 2015 13:14:11 GMT+0800 (China Standard Time)

Passing string seems like fine idea, in my case of usage this will be convenient enough (I read 10mb NetCDF files which is not problem to store in python memory).
But reading whole data in memory for python will increase memory consumption and this string will need to be garbage collected which is not always happens fast in CPython.
Why not stick to file interface? This will give developer possibility to implement most convenient way handling the source of data, for string only it can be wrapped with cStringIO/StringIO.

PS. I dont really know binary structure of NetCDF4 format, event if it cannot be read/written incrementally, handling buffer consumption on C side of python extension can be good for large files.

Ryan May · Answer 10 · Thu May 07 2015 00:08:18 GMT+0800 (China Standard Time)

Well right now, the C netcdf library only takes a filename or an opendap URL; there's not even the option of taking any kind of file pointer. Even if the latter was possible, you still wouldn't be able to turn a StringIO instance into such a thing for standard C.

What they're adding to the netCDF C library is an API to point to an existing in-memory buffer and eliminate all file I/O; HDF5 already has such an API. It would be possible to add a Python API to netcdf4-python to take a file-like object, but at some level here all of the data needs to be read into a buffer, with a single pointer to be handed to the C-library. This is likely not to actually be a str, but I'm not sure if it's bytes, bytearray, memoryview or what (direct conversion to non-Python buffer?). I still need to learn a bit more Cython...

Stephan Hoyer · Answer 11 · Fri May 08 2015 01:19:12 GMT+0800 (China Standard Time)

There's been some discussion about this over on the h5py issue tracker (h5py/h5py#552). It sounds like some changes to the HDF5 libraries may be necessary to make this worth entirely smoothly.

In the meantime, if you're working with netCDF3 files, using a file like object is already possible with the scipy.io.netcdf interface.

Max Novikov · Answer 12 · Sat May 09 2015 09:40:00 GMT+0800 (China Standard Time)

Thank you. As I read NetCDF4 files, for now I settle with NamedTemporaryFile workaround.

Alexander Mohr · Answer 13 · Fri Apr 28 2017 04:35:06 GMT+0800 (China Standard Time)

looked into using a local OpenDAP server but couldn't find anything that easily worked serving local netCDF files. This would be an option if anyone can get it to work.

Alexander Mohr · Answer 14 · Fri Apr 28 2017 12:11:48 GMT+0800 (China Standard Time)

argh, did the work to create #652, however found: Unidata/netcdf-c#394 :(

Alexander Mohr · Answer 15 · Fri Apr 28 2017 12:15:24 GMT+0800 (China Standard Time)

btw, I'm guessing this is a dup of #295

Alexander Mohr · Answer 16 · Tue May 02 2017 04:24:06 GMT+0800 (China Standard Time)

btw, there may be another bug in netcdf-c with in-memory files, I just tried with a 2d array of data, and all rows after 100 returned garbage data. Investigating this now

Theo McCaie · Answer 17 · Sat Mar 30 2019 00:55:36 GMT+0800 (China Standard Time)

sorry for the cross post but this isn't working for me as per the docs. I've tried python 2.7 and 3.7 and get the same error

[ec2-user@ip-172-31-12-20 project]$ python3 inmem.py
Traceback (most recent call last):
  File "inmem.py", line 5, in <module>
    netCDF4.Dataset("in-mem-file", mode='r', memory=data)
  File "netCDF4/_netCDF4.pyx", line 2285, in netCDF4._netCDF4.Dataset.__init__
  File "netCDF4/_netCDF4.pyx", line 1855, in netCDF4._netCDF4._ensure_nc_success
FileNotFoundError: [Errno 2] No such file or directory: b'in-mem-file'

code:

import netCDF4
with open('./db8d6757c80a3fa51779a325ba76336451ea0344.nc','rb') as fp:
    data = fp.read()
ds = netCDF4.Dataset("in-mem-file", mode='r', memory=data)
print(ds)

netCDF4 version '1.5.0'

a FileNotFoundError seems irrelevant since I'm trying to read from memory. Help much appreciated.

Jeff Whitaker · Answer 18 · Sat Mar 30 2019 02:12:13 GMT+0800 (China Standard Time)

@tam203 - this should work. I can only think of two reasons it might not.

you have an older version of netcdf-c that either doesn't support in-memory access, or has a bug that has since been fixed, or
there's something about your file that the library doesn't like.

Can you tell us what version of netcdf-c you have, and post that file somewhere? (if it's small enough you can tar/zip it and attach it to this ticket).

Jeff Whitaker · Answer 19 · Sat Mar 30 2019 02:39:54 GMT+0800 (China Standard Time)

Could be related to Unidata/netcdf-c#394 which I believe was fixed in netcdf-c 4.5.0

Theo McCaie · Answer 20 · Mon Apr 01 2019 21:06:28 GMT+0800 (China Standard Time)

Thanks. I'm using what came from python3 -m pip install netcdf4 -t .

I think you are correct in the bug it looks like I'm on version 4.4.1.1 of netcdf-c :

>>> netCDF4.getlibversion()
'4.4.1.1 of Mar 23 2019 19:51:19 $'

How do I go about getting version 4.5 and will the pip version be updated shortly?

I'm packaging this up on a AWS EC2 machine to use in lambda so I need the C library to be packaged with the python not just installed somewhere on the system, if that makes sense.

Jeff Whitaker · Answer 21 · Tue Apr 02 2019 00:58:15 GMT+0800 (China Standard Time)

Ah - I see the linux and osx wheels are built using 4.4.1.1. I will update that and create a new release (1.5.0.1) with new binary wheels. If you have a newer version of the library on your system you can follow the build instructions in the docs to rebuild from source and link against the newer library.

Jeff Whitaker · Answer 22 · Tue Apr 02 2019 06:11:43 GMT+0800 (China Standard Time)

wheels for 1.5.0.1 are available (using netcdf-c 4.6.3). Please let me know if this fixes the problem.

Theo McCaie · Answer 23 · Tue Apr 02 2019 21:04:38 GMT+0800 (China Standard Time)

@jswhit Perfect that's fixed it thanks.

For any one's reference:

python3 -m pip install "netcdf4>=1.5.0.1" -t .

Is what I ran to ensure I got the new version. Ta.

Eduardo dos Santos Pereira · Answer 24 · Tue Aug 06 2019 22:03:21 GMT+0800 (China Standard Time)

@jswhit Perfect that's fixed it thanks.

For any one's reference:

python3 -m pip install "netcdf4>=1.5.0.1" -t .

Is what I ran to ensure I got the new version. Ta.

Solved my problem

Florentine Weber · Answer 25 · Wed Sep 11 2019 22:40:41 GMT+0800 (China Standard Time)

Didn't solve my problem, unfortunately.

Installing collected packages: numpy, cftime, netcdf4
Successfully installed cftime-1.0.3.4 netcdf4-1.5.2 numpy-1.17.2

Any other ideas?

Ryan May · Answer 26 · Thu Sep 12 2019 00:57:14 GMT+0800 (China Standard Time)

@kmfweb If you're installing using pip, that means you're using your systems version of netcdf-c (libnetcdf). What version of that is installed?

Florentine Weber · Answer 27 · Thu Sep 12 2019 22:22:48 GMT+0800 (China Standard Time)

@dopplershift I have checked using the command <> ncdump <> or <> nc-config --version <> which gives me the last line of output: netcdf library version 4.4.1.1 of Jun 8 2018 03:08:32

I have some old netCDF data which could, and still can be read. But with the new data I would like to read in and then to regard, I receive the "FileNotFoundError: [Errno 2] No such file or directory: b". Path file and name are correct, and I am able to access to file as well via ncview.

Ryan May · Answer 28 · Fri Sep 13 2019 00:43:38 GMT+0800 (China Standard Time)

I'm confused. Is this data you have in a file on disk or data that's already in a buffer in memory? Can you provide sample code for what's not working?

Florentine Weber · Answer 29 · Fri Sep 13 2019 01:57:32 GMT+0800 (China Standard Time)

I have been reading in data of decades, e.g. file19701979.nc, file19801989.nc, file19901999.nc etc. using a loop. Within this loop I have a function "New_Data,Latitudes,Longitudes = GetGrid4Slice(FileName,ReadInfo,SliceInfo,LatInfo,LonInfo)" which includes "ncf=netcdf.netcdf_file(FileName,'r')". For the couple of decadal netCDF files it runs through without any problems.

I have got rid of the decades loop, as now I am working with only 1 single netCDF file. For this netCDF file I receive the FileNotFoundError when calling "ncf=netcdf.netcdf_file(FileName,'r')". I am quite sure that my single netCDF file I am trying to read is a proper netCDF file as I am able to have a look using ncview.

I am not sure if this is about my loop, or the library's version. This file name and path is def the right one, and the Error is with the "b".

Ryan May · Answer 30 · Fri Sep 13 2019 02:43:39 GMT+0800 (China Standard Time)

Ok, then I think you should open a new issue. This issue is about reading datasets from a buffer that already exists in memory, not a file on disk.

cpaton8 · Answer 31 · Tue Apr 21 2020 06:06:42 GMT+0800 (China Standard Time)

I installed the most recent version of netcdf4 and am getting the error "OSError: [Errno -128] NetCDF: Attempt to use feature that was not turned on when netCDF was built."

import requests
from requests.auth import HTTPBasicAuth
import netCDF4
import os
import xarray as xr

username = ""
password = ""
session = requests.session()
session.auth = HTTPBasicAuth(username, password)

link = "https://urs.earthdata.nasa.gov/oauth/authorize?scope=uid&app_type=401&client_id=ijpRZvb9qeKCK5ctsn75Tg&response_type=code&redirect_uri=https%3A%2F%2Fe4ftl01.cr.usgs.gov%2Foauth&state=aHR0cHM6Ly9lNGZ0bDAxLmNyLnVzZ3MuZ292Ly9NT0RWNl9DbXBfQi9NT0xUL01PRDEzQTEuMDA2LzIwMTkuMDIuMDIvTU9EMTNBMS5BMjAxOTAzMy5oMDF2MDcuMDA2LjIwMTkwNTAyMDA0NTMuaGRm"

bio = io.BytesIO()
with session.get(link, stream=True) as resp:
    for chunk in resp.iter_content(chunk_size=2 ** 20):
        bio.write(chunk)
bio.seek(0)
 
test_bytes = bio.read()
netcdf_files = netCDF4.Dataset('in-mem-file', mode='r', memory=test_bytes)

Ryan May · Answer 32 · Tue Apr 21 2020 06:36:08 GMT+0800 (China Standard Time)

@cpaton8 That error message says:

Attempt to use feature that was not turned on when netCDF was built.

I'm not sure how you installed the netcdf-c package (libnetcdf.so or libnetcdf.dylib), but that message means it did not have the memory-based reading enabled when it was compiled.

cpaton8 · Answer 33 · Fri May 01 2020 03:38:17 GMT+0800 (China Standard Time)

@dopplershift all packages installed via conda-forge. libnetcdf 4.7.3 and netcdf 1.4.3
I am getting the same error when trying the above.
FileNotFoundError: [Errno 2] No such file or directory: b'in-mem-file'

Ryan May · Answer 34 · Sat May 02 2020 03:07:39 GMT+0800 (China Standard Time)

@cpaton8 A couple things:

What OS and Python version are you on? conda (with conda-forge configured) has informed me in no uncertain terms that is will NOT install libnetcdf 4.7.3 and netcdf4 1.4.3 together.
The sample code you provided does fail for me, but the message I get is OSError: [Errno -51] NetCDF: Unknown file format: b'in-mem-file', which is because in this case test_bytes is: b'HTTP Basic: Access denied.\n'

So, on macOS, in this environment:

>conda list
# packages in environment at /Users/rmay/miniconda3/envs/test-env:
#
# Name                    Version                   Build  Channel
brotlipy                  0.7.0           py37h9bfed18_1000    conda-forge
bzip2                     1.0.8                h0b31af3_2    conda-forge
ca-certificates           2020.4.5.1           hecc5488_0    conda-forge
certifi                   2020.4.5.1       py37hc8dfbb8_0    conda-forge
cffi                      1.14.0           py37h356ff06_0    conda-forge
cftime                    1.1.1.2          py37h10e2902_0    conda-forge
chardet                   3.0.4           py37hc8dfbb8_1006    conda-forge
cryptography              2.9.2            py37he655712_0    conda-forge
curl                      7.69.1               h2d98d24_0    conda-forge
hdf4                      4.2.13            h84186c3_1003    conda-forge
hdf5                      1.10.6          nompi_h3e39495_100    conda-forge
idna                      2.9                        py_1    conda-forge
jpeg                      9c                h1de35cc_1001    conda-forge
krb5                      1.17.1               h1752a42_0    conda-forge
libblas                   3.8.0               16_openblas    conda-forge
libcblas                  3.8.0               16_openblas    conda-forge
libcurl                   7.69.1               hc0b9707_0    conda-forge
libcxx                    10.0.0               h1af66ff_2    conda-forge
libedit                   3.1.20170329      hcfe32e1_1001    conda-forge
libffi                    3.2.1             h4a8c4bd_1007    conda-forge
libgfortran               4.0.0                         2    conda-forge
liblapack                 3.8.0               16_openblas    conda-forge
libnetcdf                 4.7.4           nompi_ha11d67f_102    conda-forge
libopenblas               0.3.9                h3d69b6c_0    conda-forge
libssh2                   1.8.2                hcdc9a53_2    conda-forge
llvm-openmp               10.0.0               h28b9765_0    conda-forge
ncurses                   6.1               h0a44026_1002    conda-forge
netcdf4                   1.5.3           nompi_py37hf55ae24_105    conda-forge
numpy                     1.18.1           py37h7687784_1    conda-forge
openssl                   1.1.1g               h0b31af3_0    conda-forge
pip                       20.1               pyh9f0ad1d_0    conda-forge
pycparser                 2.20                       py_0    conda-forge
pyopenssl                 19.1.0                     py_1    conda-forge
pysocks                   1.7.1            py37hc8dfbb8_1    conda-forge
python                    3.7.6           h90870a6_5_cpython    conda-forge
python_abi                3.7                     1_cp37m    conda-forge
readline                  8.0                  hcfe32e1_0    conda-forge
requests                  2.23.0             pyh8c360ce_2    conda-forge
setuptools                46.1.3           py37hc8dfbb8_0    conda-forge
six                       1.14.0                     py_1    conda-forge
sqlite                    3.30.1               h93121df_0    conda-forge
tk                        8.6.10               hbbe82c9_0    conda-forge
urllib3                   1.25.9                     py_0    conda-forge
wheel                     0.34.2                     py_1    conda-forge
xz                        5.2.5                h0b31af3_0    conda-forge
zlib                      1.2.11            h0b31af3_1006    conda-forge

this code works fine:

import requests
import netCDF4

link = ('https://thredds.ucar.edu/thredds/fileServer/satellite/goes/east/grb/ABI/Mesoscale-2/Channel08/'
        'current/OR_ABI-L1b-RadM2-M6C08_G16_s20201221854495_e20201221854553_c20201221855015.nc')
with requests.get(link) as resp:
    netcdf_file = netCDF4.Dataset('in-mem-file', mode='r', memory=resp.content)

print(netcdf_file.title)

Nick Silverman · Answer 35 · Thu Aug 13 2020 10:37:08 GMT+0800 (China Standard Time)

@cpaton8 I am having a similar problem with netCDF version 4.6.0. When I run your above example:

import requests
import netCDF4

link = ('https://thredds.ucar.edu/thredds/fileServer/satellite/goes/east/grb/ABI/Mesoscale-2/Channel08/'
        'current/OR_ABI-L1b-RadM2-M6C08_G16_s20201221854495_e20201221854553_c20201221855015.nc')
with requests.get(link) as resp:
    netcdf_file = netCDF4.Dataset('in-mem-file', mode='r', memory=resp.content)

print(netcdf_file.title)

I get the error:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "netCDF4/_netCDF4.pyx", line 2358, in netCDF4._netCDF4.Dataset.__init__
  File "netCDF4/_netCDF4.pyx", line 1926, in netCDF4._netCDF4._ensure_nc_success
OSError: [Errno -51] NetCDF: Unknown file format: b'in-mem-file'

My python 3.6.9 environment looks like this:

Package          Version
---------------- ------------
affine           2.3.0
asciitree        0.3.3
attrs            19.3.0
backcall         0.2.0
beautifulsoup4   4.9.1
boto3            1.14.39
botocore         1.17.39
Bottleneck       1.3.2
certifi          2020.6.20
cffi             1.14.1
cfgrib           0.9.8.4
cftime           1.2.1
chardet          3.0.4
click            7.1.2
click-plugins    1.1.1
cligj            0.5.0
cycler           0.10.0
decorator        4.4.2
docopt           0.6.2
docutils         0.15.2
fasteners        0.15
Fiona            1.8.13.post1
geopandas        0.8.1
idna             2.10
ipykernel        5.3.4
ipython          7.16.1
ipython-genutils 0.2.0
jedi             0.17.2
Jinja2           2.11.2
jmespath         0.10.0
jupyter-client   6.1.6
jupyter-core     4.6.3
kiwisolver       1.2.0
llvmlite         0.33.0
MarkupSafe       1.1.1
matplotlib       3.3.0
monotonic        1.5
munch            2.5.0
nc-time-axis     1.2.0
netCDF4          1.5.4
numba            0.50.1
numbagg          0.1
numcodecs        0.6.4
numpy            1.19.1
pandas           1.1.0
parso            0.7.1
pexpect          4.8.0
pickleshare      0.7.5
Pillow           7.2.0
pip              20.2.2
pkg-resources    0.0.0
prompt-toolkit   3.0.6
protobuf         4.0.0rc2
psycopg2-binary  2.8.5
ptyprocess       0.6.0
pycparser        2.20
Pydap            3.2.2
Pygments         2.6.1
pyparsing        2.4.7
pyproj           2.6.1.post1
python-dateutil  2.8.1
pytz             2020.1
pyzmq            19.0.2
rasterio         1.1.5
requests         2.24.0
s3transfer       0.3.3
setuptools       49.3.1
Shapely          1.7.0
siphon           0.8.0
six              1.15.0
snuggs           1.4.7
soupsieve        2.0.1
tornado          6.0.4
traitlets        4.3.3
urllib3          1.25.10
wcwidth          0.2.5
WebOb            1.8.6
xarray           0.16.0
zarr             2.4.0

Thoughts?

Ryan May · Answer 36 · Fri Aug 14 2020 01:44:16 GMT+0800 (China Standard Time)

@nicksilver I find it really useful in cases like this to look at what's being returned by requests. If I take the code that's failing for you and print out the response:

import requests
import netCDF4

link = ('https://thredds.ucar.edu/thredds/fileServer/satellite/goes/east/grb/ABI/Mesoscale-2/Channel08/'
        'current/OR_ABI-L1b-RadM2-M6C08_G16_s20201221854495_e20201221854553_c20201221855015.nc')
with requests.get(link) as resp:
    print(resp.content.decode('utf-8'))

I see:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>TDS - Error report</title>
    <link rel="stylesheet" href="/thredds/tds.css" type="text/css"/>
  </head>
  <body>
    <h1>HTTP Status 404 - Not Found</h1>
    <HR size="1" noshade="noshade">
    <p><b>Status</b> 404 - Not Found</p>
    <HR size="1" noshade="noshade">
    <h3>THREDDS Data Server Version 4.6
      -- <a href='https://www.unidata.ucar.edu/software/thredds/v4.6/tds/TDS.html'>Documentation</a></h3>
  </body>
</html>

So the original data file has aged off the server. If I update to a currently available file:

import requests
import netCDF4

link = ('https://thredds.ucar.edu/thredds/fileServer/satellite/goes/east/grb/ABI/Mesoscale-2/Channel08/'
        'current/OR_ABI-L1b-RadM2-M6C08_G16_s20202261740546_e20202261741003_c20202261741040.nc')
with requests.get(link) as resp:
    netcdf_file = netCDF4.Dataset('in-mem-file', mode='r', memory=resp.content)

print(netcdf_file.title)

I get ABI L1b Radiances.

cpaton8 · Answer 37 · Fri Aug 14 2020 01:56:17 GMT+0800 (China Standard Time)

@nicksilver not sure if this is the issue you are running into but the Modis files we've been working with are HDF-EOS v2 which are based on HDF4. They would need to be converted (there's a tool called h4toh5) before they’re compatible with netCDF-4.

Nick Silverman · Answer 38 · Fri Aug 14 2020 02:18:14 GMT+0800 (China Standard Time)

Beautiful...thank you!