“NetCDF (Network Common Data Form) is a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. It is also a community standard for sharing scientific data.”
NetCDF is developed by
UCAR/Unidata and is widely used for climate and weather data as well
as for other environmental data sets. The netcdf
library is
ported to a wide variety of operating systems and platforms, from laptop
computers to large mainframes. Data sets are typically large arrays with
axes for longitude, latitude and time, with other axes, such as depth,
added according to the nature of the data. Other types of data are also
commonly found.
Importantly, “a netCDF file includes information about the data it contains”. This comes in two flavours:
netcdf
library. These describe the basic building blocks of
the data set, its variables, and the dimensions of the variables and how
the pieces fit together. With just this information one can read the
data from the resource.Both types of metadata are necessary to “understand” the netCDF resource.
The descriptive metadata are not defined by the netcdf
library. To ensure interoperability, several
“conventions” have been developed over the years such that users of
netCDF data can correctly interpret what data developers have put in the
resource. The most important of the conventions is the CF Metadata Conventions. These
conventions define a large number of standards that help interpret
netCDF resources.
Other common conventions are related to climate prediction data, such as CMIP-5 and CMIP-6.
The RNetCDF
package is developed and maintained by the
same team that developed and maintains the netcdf
library.
It provides an interface to the netcdf
library that stays
very close to the API of the C library. As a result, it lacks an
intuitive user experience and workflow that R users would be familiar
with.
Package ncdf4
, the most widely used package to access
netCDF resources, does one better by performing the tedious task of
reading the structural metadata from the resource that is needed for a
basic understanding of the contents, such as dimension and variable
details, but the library API concept remains with functions that fairly
directly map to the netcdf
library functions.
One would really need to understand the netCDF data model and
implementation details to effectively use these packages. For instance,
most data describing a dimension is stored as a variable. So to read the
dimnames()
of a dimension you’d have to call
var.get.nc()
or ncvar_get()
. Neither package
loads the attributes of the dimensions, variables and the data set
(“global” variables), which is essential to understand what the
dimensions and variables represent.
While both packages are very good at what they do, it is clearly not enough.
Several packages have been developed to address some of these issues and make access to the data easier. Unfortunately, none of these packages provide a comprehensive R-style solution to accessing and interpreting netCDF resources in an intuitive way.
Package ncdfCF
provides a high-level interface using
functions and methods that are familiar to the R user. It reads the
structural metadata and also the attributes upon opening the resource.
In the process, the ncdfCF
package also applies CF Metadata
Conventions to interpret the data. This currently applies to:
netcdf4
format, allowing for a directory-like structure in
the netCDF resource. The specific scoping rules to find related objects
distributed over multiple groups are supported.CFtime
package these offsets can be turned into intelligible dates and
times, for all 9 defined calendars.ncdfCF
to
automatically align data variables that are not on a Cartesian grid to a
longitude-latitude grid.Opening and inspecting the contents of a netCDF resource is very straightforward:
library(ncdfCF)
# Get a netCDF file, here hourly data for 2016-01-01 over Rwanda
fn <- system.file("extdata", "ERA5land_Rwanda_20160101.nc", package = "ncdfCF")
# Open the file, all metadata is read
ds <- open_ncdf(fn)
# Easy access in understandable format to all the details
ds
#> <Dataset> ERA5land_Rwanda_20160101
#> Resource : /tmp/RtmpjGP4uo/Rinst7f02c36d4ca/ncdfCF/extdata/ERA5land_Rwanda_20160101.nc
#> Format : offset64
#> Conventions: CF-1.6
#> Keep open : FALSE
#>
#> Variables:
#> name long_name units data_type axes
#> t2m 2 metre temperature K NC_SHORT longitude, latitude, time
#> pev Potential evaporation m NC_SHORT longitude, latitude, time
#> tp Total precipitation m NC_SHORT longitude, latitude, time
#>
#> Axes:
#> id axis name length unlim values
#> 0 T time 24 U [2016-01-01 00:00:00 ... 2016-01-01 23:00:00]
#> 1 X longitude 31 [28 ... 31]
#> 2 Y latitude 21 [-1 ... -3]
#> unit
#> hours since 1900-01-01 00:00:00.0
#> degrees_east
#> degrees_north
#>
#> Attributes:
#> id name type length
#> 0 CDI NC_CHAR 64
#> 1 Conventions NC_CHAR 6
#> 2 history NC_CHAR 482
#> 3 CDO NC_CHAR 64
#> value
#> Climate Data Interface version 2.4.1 (https://m...
#> CF-1.6
#> Tue May 28 18:39:12 2024: cdo seldate,2016-01-0...
#> Climate Data Operators version 2.4.1 (https://m...
# Variables can be accessed through standard list-type extraction syntax
t2m <- ds[["t2m"]]
t2m
#> <Variable> t2m
#> Long name: 2 metre temperature
#>
#> Axes:
#> id axis name length unlim values
#> 1 X longitude 31 [28 ... 31]
#> 2 Y latitude 21 [-1 ... -3]
#> 0 T time 24 U [2016-01-01 00:00:00 ... 2016-01-01 23:00:00]
#> unit
#> degrees_east
#> degrees_north
#> hours since 1900-01-01 00:00:00.0
#>
#> Attributes:
#> id name type length value
#> 0 long_name NC_CHAR 19 2 metre temperature
#> 1 units NC_CHAR 1 K
#> 2 add_offset NC_DOUBLE 1 292.664569285614
#> 3 scale_factor NC_DOUBLE 1 0.00045127252204996
#> 4 _FillValue NC_SHORT 1 -32767
#> 5 missing_value NC_SHORT 1 -32767
# Same with dimensions, but now without first assigning the object to a symbol
ds[["longitude"]]
#> <Longitude axis> [1] longitude
#> Length : 31
#> Axis : X
#> Values : 28, 28.1, 28.2 ... 30.8, 30.9, 31 degrees_east
#> Bounds : (not set)
#>
#> Attributes:
#> id name type length value
#> 0 standard_name NC_CHAR 9 longitude
#> 1 long_name NC_CHAR 9 longitude
#> 2 units NC_CHAR 12 degrees_east
#> 3 axis NC_CHAR 1 X
# Regular base R operations simplify life further
dimnames(ds[["pev"]]) # A variable: list of dimension names
#> [1] "longitude" "latitude" "time"
dimnames(ds[["longitude"]]) # A dimension: vector of dimension element values
#> [1] 28.0 28.1 28.2 28.3 28.4 28.5 28.6 28.7 28.8 28.9 29.0 29.1 29.2 29.3 29.4
#> [16] 29.5 29.6 29.7 29.8 29.9 30.0 30.1 30.2 30.3 30.4 30.5 30.6 30.7 30.8 30.9
#> [31] 31.0
# Access attributes
ds[["pev"]]$attribute("long_name")
#> [1] "Potential evaporation"
In the last command you noted the list-like syntax with the
$
operator. The base objects in the package are based on
the R6
object-oriented model. R6
is a
light-weight but powerful and efficient framework to build object
models. Access to the public fields and functions is provided through
the $
operator. Common base R operators and functions, such
as shown above, are supported to facilitate integration of
ncdfCF
in frameworks built on base R or S3.
One of the perpetual headaches of users of netCDF files is to extract
the data. If you want to get all the data for a variable then neither
RNetCDF
nor ncdf4
are particularly
troublesome:
But what if you are interested in only a small area or a month of
data while the resource has global data spanning multiple years? In both
RNetCDF
and ncdf4
packages you’d have to work
out how your real-world boundaries translate to indices into the
variable array of interest and then populate start
and
count
arguments to pass on to var.get.nc()
or
ncvar_get()
. That may be feasible for longitude and
latitude axes, but for a time axis this becomes more complicated
(reading and parsing the “units” attribute of the axis) or nigh-on
impossible when non-standard calendars are used for the axis. Many R
users default to simply reading the entire array and then extracting the
area of interest using standard R tools. That is wasteful at best (lots
of I/O, RAM usage, CPU cycles) and practically impossible with some
larger netCDF resources that have variables upwards of 1GB in size.
Enter ncdfCF
. With ncdfCF
you have three
options to extract data for a variable:
data()
extracts all the data of the
variable into a CFData
object, including information on
axes, the coordinate reference system and attributes.[]
: Using R’s standard extraction
operator [
you work directly with the index values into the
array dimensions: d <- t2m[3:5, 1:4, 1:10]
. You can
leave out dimensions to extract everything from that dimension (but you
have to indicate the position, just like in regular arrays). So to get
the first 5 “time” slices from t2m
:
d <- t2m[, , 1:5]
. This works for any number of
dimensions, you simply have to adjust the number of positions that you
specify. You still need to know the indices into the arrays but
ncdfCF
has some helper functions to get you those. Not
specifying anything gets you the whole array:
d <- t2m[]
.subset()
: The subset()
method is more flexible than []
because it requires less
knowledge of how the data in the variable is structured, particularly
the order of the axes. While many netCDF resources “behave” in their
dimension order, there is no guarantee. With subset()
you
supply a list with items for each axis (by axis orientation or name, in
any order) and each item containing a vector of real-world coordinates
to extract. As an example, to extract values of a variable
x
for Australia for the year 2020 you call
x$subset(list(X = 112:154, Y = -9:-44, T = c("2020-01-01", "2021-01-01")))
.
The result is returned in a CFData
object.# Extract a time series for a specific location
ts <- t2m[5, 4, ]
str(ts)
#> num [1, 1, 1:24] 293 292 292 291 291 ...
#> - attr(*, "dimnames")=List of 3
#> ..$ : chr "28.4"
#> ..$ : chr "-1.3"
#> ..$ : chr [1:24] "2016-01-01 00:00:00" "2016-01-01 01:00:00" "2016-01-01 02:00:00" "2016-01-01 03:00:00" ...
#> - attr(*, "axis")= Named chr [1:3] "X" "Y" "T"
#> ..- attr(*, "names")= chr [1:3] "longitude" "latitude" "time"
#> - attr(*, "time")=List of 1
#> ..$ time:Formal class 'CFtime' [package "CFtime"] with 4 slots
#> .. .. ..@ datum : hours since 1900-01-01 00:00:00.0 [ gregorian calendar ]
#> .. .. ..@ resolution: num 1
#> .. .. ..@ offsets : num [1:24] 1016832 1016833 1016834 1016835 1016836 ...
#> .. .. ..@ bounds : logi FALSE
# Extract the full spatial extent for one time step
ts <- t2m[, , 12]
str(ts)
#> num [1:31, 1:21, 1] 300 300 300 300 300 ...
#> - attr(*, "dimnames")=List of 3
#> ..$ : chr [1:31] "28" "28.1" "28.2" "28.3" ...
#> ..$ : chr [1:21] "-1" "-1.1" "-1.2" "-1.3" ...
#> ..$ : chr "2016-01-01 11:00:00"
#> - attr(*, "axis")= Named chr [1:3] "X" "Y" "T"
#> ..- attr(*, "names")= chr [1:3] "longitude" "latitude" "time"
#> - attr(*, "time")=List of 1
#> ..$ time:Formal class 'CFtime' [package "CFtime"] with 4 slots
#> .. .. ..@ datum : hours since 1900-01-01 00:00:00.0 [ gregorian calendar ]
#> .. .. ..@ resolution: num NA
#> .. .. ..@ offsets : num 1016843
#> .. .. ..@ bounds : logi FALSE
Note that the results contain degenerate dimensions (of length 1). This by design because it allows attributes to be attached and then inspected by the user in a consistent manner.
# Extract a specific region, full time dimension
ts <- t2m$subset(list(X = 29:30, Y = -1:-2))
ts
#> <Data> t2m
#> Long name: 2 metre temperature
#>
#> Values: [283.0182 ... 299.917] K
#> NA: 0 (0.0%)
#>
#> Axes:
#> id axis name length unlim values
#> -1 X longitude 10 [29 ... 29.9]
#> -1 Y latitude 10 [-1.1 ... -2]
#> 0 T time 24 U [2016-01-01 00:00:00 ... 2016-01-01 23:00:00]
#> unit
#>
#>
#> hours since 1900-01-01 00:00:00.0
#>
#> Attributes:
#> id name type length value
#> 0 long_name NC_CHAR 19 2 metre temperature
#> 1 units NC_CHAR 1 K
#> 2 add_offset NC_DOUBLE 1 292.664569285614
#> 3 scale_factor NC_DOUBLE 1 0.00045127252204996
#> 4 _FillValue NC_SHORT 1 -32767
#> 5 missing_value NC_SHORT 1 -32767
# Extract specific time slices for a specific region
# Note that the axes are specified out of order and using alternative
# specifications: only the extreme values are used.
ts <- t2m$subset(list(T = c("2016-01-01 09:00", "2016-01-01 15:00"),
X = c(29.6, 28.8),
Y = seq(-2, -1, by = 0.05)))
ts
#> <Data> t2m
#> Long name: 2 metre temperature
#>
#> Values: [288.2335 ... 299.917] K
#> NA: 0 (0.0%)
#>
#> Axes:
#> id axis name length values
#> -1 X longitude 8 [28.8 ... 29.5]
#> -1 Y latitude 10 [-1.1 ... -2]
#> -1 T time 6 [2016-01-01 09:00:00 ... 2016-01-01 14:00:00]
#>
#> Attributes:
#> id name type length value
#> 0 long_name NC_CHAR 19 2 metre temperature
#> 1 units NC_CHAR 1 K
#> 2 add_offset NC_DOUBLE 1 292.664569285614
#> 3 scale_factor NC_DOUBLE 1 0.00045127252204996
#> 4 _FillValue NC_SHORT 1 -32767
#> 5 missing_value NC_SHORT 1 -32767
These methods will read data from the netCDF resource, but only as much as is requested.
The approaches lend themselves well to the apply()
family of functions for processing. Importantly, these functions give
access to data from the netCDF resource so you can tweak the size of
your request to the capacity of the computer, without exhausting
RAM.
The data()
and subset()
functions return
data from a variable in a CFData
instance. The
CFData
instance holds the actual data, as well as important
metadata of the data, including its axes, the coordinate reference
system, and the attributes, among others. The CFData
instance also lets you manipulate the data in a way that is informed by
the metadata. This overcomes a typical issue when working with netCDF
data that adheres to the CF Metadata Conventions.
The ordering of the axes in a typical netCDF resource is different from the way that R orders its data. That leads to surprising results if you are not aware of this issue:
# Open a file and read the data from a variable into a CFData instance
fn <- system.file("extdata", "tasmax_NAM-44_day_20410701-vncdfCF.nc", package = "ncdfCF")
ds <- open_ncdf(fn)
tx <- ds[["tasmax"]]$data()
tx
#> <Data> tasmax
#> Long name: Daily Maximum Near-Surface Air Temperature
#>
#> Values: [263.4697 ... 313.2861] K
#> NA: 0 (0.0%)
#>
#> Axes:
#> id axis name long_name length unlim
#> 2 X x x-coordinate in Cartesian system 148
#> 3 Y y y-coordinate in Cartesian system 140
#> 0 T time 1 U
#> Z height 1
#> values unit
#> [0 ... 7350000] m
#> [0 ... 6950000] m
#> [2041-07-01 12:00:00] days since 1949-12-1 00:00:00
#> [2] m
#>
#> Attributes:
#> id name type length value
#> 0 standard_name NC_CHAR 15 air_temperature
#> 1 long_name NC_CHAR 42 Daily Maximum Near-Surface Air Temperature
#> 2 units NC_CHAR 1 K
#> 3 grid_mapping NC_CHAR 17 Lambert_Conformal
#> 5 _FillValue NC_FLOAT 1 1.00000002004088e+20
#> 6 missing_value NC_FLOAT 1 1.00000002004088e+20
#> 7 original_name NC_CHAR 11 TEMP at 2 M
#> 8 cell_methods NC_CHAR 13 time: maximum
#> 9 FieldType NC_INT 1 104
#> 10 MemoryOrder NC_CHAR 3 XY
# Use the terra package for plotting
# install.packages("terra")
library(terra)
#> terra 1.7.83
# Get the data in exactly the way it is stored in the file, using `raw()`
tx_raw <- tx$raw()
str(tx_raw)
#> num [1:148, 1:140, 1] 301 301 301 301 301 ...
#> - attr(*, "dimnames")=List of 3
#> ..$ x : chr [1:148] "0" "50000" "1e+05" "150000" ...
#> ..$ y : chr [1:140] "0" "50000" "1e+05" "150000" ...
#> ..$ time: chr "2041-07-01 12:00:00"
# Plot the data
r <- terra::rast(tx_raw)
r
#> class : SpatRaster
#> dimensions : 148, 140, 1 (nrow, ncol, nlyr)
#> resolution : 1, 1 (x, y)
#> extent : 0, 140, 0, 148 (xmin, xmax, ymin, ymax)
#> coord. ref. :
#> source(s) : memory
#> name : lyr.1
#> min value : 263.4697
#> max value : 313.2861
plot(r)
North America is lying on its side. This is because the data is
stored differently in the netCDF resource than R expects. There is, in
fact, not a single way of storing data in a netCDF resources, the
dimensions may be stored in any order. The CF Metadata Conventions add
metadata to interpret the file storage. The array()
method
uses that to produce an array in the familiar R storage arrangement:
tx_array <- tx$array()
str(tx_array)
#> num [1:140, 1:148, 1] 277 277 277 277 277 ...
#> - attr(*, "dimnames")=List of 3
#> ..$ y : chr [1:140] "6950000" "6900000" "6850000" "6800000" ...
#> ..$ x : chr [1:148] "0" "50000" "1e+05" "150000" ...
#> ..$ time: chr "2041-07-01 12:00:00"
r <- terra::rast(tx_array)
terra::plot(r)
Ok, so now we got North America looking pretty ok again. The data has been oriented in the right way. Behind the scenes that may have involved transposing and flipping the data, depending on the data storage arrangement in the netCDF resource.
But the coordinate system is still not right. These are just ordinal
values along both axes. The terra::SpatRaster
object also
does not show a CRS. All of the above steps can be fixed by simply
calling the terra()
method on the data object. This will
return a terra::SpatRaster
for a data object with three
axes and a terra::SpatRasterDataset
for a data object with
four axes, including scalar axes if present:
r <- tx$terra()
r
#> class : SpatRaster
#> dimensions : 140, 148, 1 (nrow, ncol, nlyr)
#> resolution : 50000, 50000 (x, y)
#> extent : -25000, 7375000, -25000, 6975000 (xmin, xmax, ymin, ymax)
#> coord. ref. : +proj=lcc +lat_0=46.0000038146973 +lon_0=-97 +lat_1=35 +lat_2=60 +x_0=3675000 +y_0=3475000 +datum=WGS84 +units=m +no_defs
#> source(s) : memory
#> name : 2041-07-01 12:00:00
#> min value : 263.4697
#> max value : 313.2861
terra::plot(r)
So that’s a fully specified terra::SpatRaster
from
netCDF data.
(Disclaimer: Package terra
can do this
too with simply terra::rast(fn)
and then selecting a layer
to plot (which is not always trivial if you are looking for a specific
layer; e.g. what does “lyr.1” represent?). The whole point of the above
examples is to demonstrate the different steps in processing netCDF
data. There are also some subtle differences such as the names of the
layers. Furthermore, ncdfCF
doesn’t insert the attributes
of the variable into the SpatRaster
. terra
can
only handle netCDF resources that “behave” properly (especially the axis
order) and it has no particular consideration for the different
calendars that can be used with CF data.)