The CF Metadata Conventions (“CF” henceforth) are being developed by academics and practitioners in the climate and forecasting field, with the objective to facilitate interoperability between data producers and data consumers (whether human or computer systems).
Reading the CF documentation can be a daunting task because it is written as a standards document, and not as a guideline or tutorial for people who want to understand the concept and the structure but who are not looking for all of the low-level detail.
This vignette presents a view of CF from the perspective of R. The main elements of CF are introduced in terms that should be easily understood by anyone who has worked with matrices and arrays in R.
By definition, a CF-compliant data set is stored in a netCDF file. The CF elements and their relationship to the building blocks of the netCDF file are given in the figure below:
If you’ve worked with netCDF data in R before, probably using package
ncdf4
(but what follows is equally true when you use
package RNetCDF
), you should be familiar with the yellow
boxes in the above figure, especially “NC::Variable” (the specific
notation used in the figure is not important here), corresponding to the
ncvar4
class in ncdf4
. CF, however, recognizes
eight different objects that are each based on an “NC::Variable”. This
ranges from axes, or generic coordinate variables in the white
box in the figure (to get the dimension values of your data array when
not loaded upon opening the file you have to call
ncdf4::ncvar_get()
, i.e. you read a variable to get the
values of a dimension), to grid mapping variables (that define
the coordinate reference system of your data) to the actual data array
in a data variable. This is the source of all those surprising
“variables” that you see when you do names(nc$var)
:
library(ncdf4)
fn <- system.file("extdata", "tasmax_NAM-44_day_20410701-vncdfCF.nc", package = "ncdfCF")
nc <- nc_open(fn)
# The "variables"
names(nc$var)
#> [1] "time_bnds" "lon" "lat"
#> [4] "Lambert_Conformal" "height" "tasmax"
# The dimensions
names(nc$dim)
#> [1] "time" "bnds" "x" "y"
So which of these variables is the one that holds your data? And how
do you tell it apart from the other “variables”? And why are there
x
and y
dimensions and lon
and
lat
“variables”?
You may not be as familiar with the attributes in the netCDF file,
the “NC::Attribute” object in the above figure, as ncdf4
does not make these as easily accessible as the other information in the
file. That is a shame because attributes are the thing that makes netCDF
stand apart from pretty much any other gridded data format. CF is
actually all about attributes, with some of them organized in
“NC::Variable” objects that do not hold any other data! (Such as the
grid mapping variables mentioned above.) Put the other way
around, you are not really using a CF-compliant netCDF data set unless
you read the proper attributes.
Looking at the same netCDF file as above, but now using the
ncdfCF
package:
library(ncdfCF)
fn <- system.file("extdata", "tasmax_NAM-44_day_20410701-vncdfCF.nc", package = "ncdfCF")
ds <- open_ncdf(fn)
# The data variable
names(ds)
#> [1] "tasmax"
# The details of the data variable
tmax <- ds[["tasmax"]]
tmax$print(width = 25)
#> <Variable> tasmax
#> Long name: Daily Maximum Near-Surface Air Temperature
#>
#> Coordinate reference system:
#> name grid_mapping
#> Lambert_Conformal lambert_conformal_conic
#>
#> Axes:
#> id axis name long_name length unlim values
#> 2 X x x-coordinate in Cartes... 148 [0 ... 7350000]
#> 3 Y y y-coordinate in Cartes... 140 [0 ... 6950000]
#> 0 T time 1 U [2041-07-01 12:00:00]
#> Z height 1 [2]
#> unit
#> m
#> m
#> days since 1949-12-1 0...
#> m
#>
#> Auxiliary longitude-latitude grid:
#> orientation axis name extent unit
#> longitude X lon [-174.382 ... -19.618] degrees_east
#> latitude Y lat [10.154 ... 76.459] degrees_north
#>
#> Attributes:
#> id name type length value
#> 0 standard_name NC_CHAR 15 air_temperature
#> 1 long_name NC_CHAR 42 Daily Maximum Near-Sur...
#> 2 units NC_CHAR 1 K
#> 3 grid_mapping NC_CHAR 17 Lambert_Conformal
#> 4 coordinates NC_CHAR 14 height lat lon
#> 5 _FillValue NC_FLOAT 1 1.00000002004088e+20
#> 6 missing_value NC_FLOAT 1 1.00000002004088e+20
#> 7 original_name NC_CHAR 11 TEMP at 2 M
#> 8 cell_methods NC_CHAR 13 time: maximum
#> 9 FieldType NC_INT 1 104
#> 10 MemoryOrder NC_CHAR 3 XY
# The "time" axis
ds[["time"]]
#> <Time axis> [0] time
#> Length : 1 (unlimited)
#> Axis : T
#> Calendar : 365_day
#> Range : 2041-07-01 12:00:00
#> Bounds : 2041-07-01 ... 2041-07-02
#>
#> Attributes:
#> id name type length value
#> 0 standard_name NC_CHAR 4 time
#> 1 long_name NC_CHAR 4 time
#> 2 bounds NC_CHAR 9 time_bnds
#> 3 units NC_CHAR 29 days since 1949-12-1 00:00:00
#> 4 calendar NC_CHAR 7 365_day
#> 5 axis NC_CHAR 1 T
# Parameters in the grid mapping variable
ds[["Lambert_Conformal"]]
#> <Grid mapping> Lambert_Conformal
#> Grid mapping name: lambert_conformal_conic
#>
#> Attributes:
#> id name type length value
#> 0 grid_mapping_name NC_CHAR 23 lambert_conformal_conic
#> 1 longitude_of_central_meridian NC_DOUBLE 1 -97
#> 2 latitude_of_projection_origin NC_DOUBLE 1 46.0000038146973
#> 3 standard_parallel NC_DOUBLE 2 35, 60
#> 4 false_easting NC_DOUBLE 1 3675000
#> 5 false_northing NC_DOUBLE 1 3475000
Ok, that’s a lot more information. Let’s focus on a few important things:
x
, y
, time
and
height
. That is not all clear from the ncdf4
output which lists the first three dimensions and excludes
height
, but includes bnds
instead.Lambert_Conformal
that, by definition, applies to the
X
and Y
axes of the data variable
tasmax
. As suggested by CF, the data set also includes two
ancillary coordinate variables lat
and
lon
, each a matrix with the same dimension as the length of
the Y
and X
axes of the data variable
they are associated with, where each cell gives the latitude and
longitude, respectively, of the corresponding cell in the data
variable (whose axes have coordinates in the Lambert Conformal
coordinate reference system). You can identify these two aspects by
looking at the attributes of tasmax
: attribute
grid_mapping
points to the grid mapping variable
Lambert_Conformal
, attribute coordinates
lists
both the scalar coordinate variable height
and the
auxiliary coordinate variables lat
and
lon
.time
axis has an attribute bounds
which points to time_bnds
, a boundary variable
that indicates for each coordinate along the axis its lower and higher
values. For the time
axis there is just one coordinate at
2041-07-01 12:00:00
with range values
2041-07-01 ... 2041-07-02
(midnight to midnight). (It is
the time_bnds
variable that uses the bnds
dimension that ncdf4
reports.) The calendar
used by this axis is 365_day
, meaning that no year uses the
leap day of 29 February. This is a so-called “model calendar” and the
standard date-time functions will not work properly with this data; use
the built-in time methods instead.It should be obvious that correctly interpreting a netCDF file requires a package that not only provides easy access to the attributes of the file, but that also applies the CF conventions to put all the pieces together in such a way that the data set is presented to the user of the data - that would be you - as the data producers intended.
The issues arising from the particularities of CF on top of the
netCDF format are presented and discussed in the remainder of this
document, seen from the perspective of the R user, with examples using
the ncdfCF
package. At the end of this vignette is a
feature matrix that indicates the support for each CF element in package
ncdfCF
.
Please note that I have nothing against
ncdf4
orRNetCDF
packages. In fact, packagencdfCF
is built on top ofRNetCDF
. My point is merely to demonstrate how CF-compliant netCDF files, of which there are very many out there on the internet, require you to look beyond the basic netCDF building blocks and use the attributes. If your netCDF files do not use the CF conventions, then by all means usencdf4
if you prefer. I would suggest, though, to read the next section to understand better how data structures in netCDF differ from R standard practices.
In R, matrices and arrays are stored in column-major ordering, meaning that successive values go from the top-left down down the column, then across the columns to the bottom-right. This is easily seen when printing a matrix to the console:
matrix(1:12, nrow = 3, ncol = 4)
#> [,1] [,2] [,3] [,4]
#> [1,] 1 4 7 10
#> [2,] 2 5 8 11
#> [3,] 3 6 9 12
In netCDF files matrices and arrays are stored in row-major ordering,
starting from the top-left, progressing to the end of the row and then
down the rows to the bottom-left. This can be performed in R with the
byrow
argument of the matrix()
function:
matrix(1:12, nrow = 3, ncol = 4, byrow = TRUE)
#> [,1] [,2] [,3] [,4]
#> [1,] 1 2 3 4
#> [2,] 5 6 7 8
#> [3,] 9 10 11 12
CF-compliant data sets, however, are free to store the data in any possible organization of the axes, although the recommendation is to use the longitude - latitude - vertical - time ordering. But even when that recommendation is followed, the latitude coordinates in the row are often stored in ascending order, that is from the bottom-left working upwards. That looks like this:
matrix(c(9, 5, 1, 10, 6, 2, 11, 7, 3, 12, 8, 4), nrow = 3, ncol = 4)
#> [,1] [,2] [,3] [,4]
#> [1,] 9 10 11 12
#> [2,] 5 6 7 8
#> [3,] 1 2 3 4
This is the cause of lots of headaches and “patches” involving
t()
and rev()
in some order and
general exasperation (“why does it have to be so
complicated?”).
The bottom line is that there is no guarantee that the data are stored in a particular ordering. So how can you make sure that you get the data from the netCDF file in a predictable format so that you can use your favourite analysis routines with peace of mind? The answer should be obvious: read the attributes of the netCDF file and apply them following CF guidelines.
Lucky for you package ncdfCF
does all the dirty work
behind the scene to give you easy access to either the raw array data,
in the ordering found in the netCDF file, or an oriented array that
follows the standard R array layout:
# The raw data using the ordering in the netCDF file (or as modified by
# a processing method such as `summarise()`)
tmax_raw <- tmax$data()$raw()
str(tmax_raw)
#> num [1:148, 1:140, 1] 301 301 301 301 301 ...
#> - attr(*, "dimnames")=List of 3
#> ..$ x : chr [1:148] "0" "50000" "1e+05" "150000" ...
#> ..$ y : chr [1:140] "0" "50000" "1e+05" "150000" ...
#> ..$ time: chr "2041-07-01 12:00:00"
# The same data but now in standard R ordering
tmax_R <- tmax$data()$array()
str(tmax_R)
#> num [1:140, 1:148, 1] 277 277 277 277 277 ...
#> - attr(*, "dimnames")=List of 3
#> ..$ y : chr [1:140] "6950000" "6900000" "6850000" "6800000" ...
#> ..$ x : chr [1:148] "0" "50000" "1e+05" "150000" ...
#> ..$ time: chr "2041-07-01 12:00:00"
Note how in the array()
version the x
and
y
axes are reversed and the y
values decrease,
i.e. working from the top-left down to the bottom-right, just like a
standard R array.
The netCDF file format is very flexible, as well as CF, and the
ncdfCF
package uses a layered structure to capture it
all.
This structure enables a full
support of the features provided by the
netcdf
library and
the CF Conventions.
This section provides an overview of how this package implements the various elements of the CF Conventions.
Any filename extension is allowed.
Any allowable data type can be read to and written from a netCDF file. In R, only the standard data types are used.
It is required that variable, dimension, attribute and group names begin with a letter and be composed of letters, digits, and underscores. The maximum length of a name is 255 characters.
Fully implemented. Axes of length 1 are automatically encoded as a scalar coordinate variable, thus avoiding the use of a dimension.
The valid_range
attribute is determined when a
CFArray
is created or its values modified.
Upon saving a CFArray
to a netCDF file, data may be
packed and the appropriate attributes will be written along with the
data.
This package only interprets attributes that are defined by CF to have a special meaning that is relevant to reading, processing or writing data. Other attributes are read and presented to the user.
The attribute Conventions
is written to file with value
“CF-1.12”.
The attribute history
is created in the group of the
data variable or prepended with information on any processing that has
been applied to the data.
Other global, group or data variable attributes are not written to file automatically; the user has to add to amend these attributes explicitly.
Fully implemented.
The package does not use the UDUNITS library to manage units or convert between them.
The units
attribute is interpreted for axes and
effectively required for the dimensions “that receive special treatment”
(see sections 4.1 - 4.4) or they will not be recognized as such.
Otherwise presence or contents of the attribute is not tested.
The units_metadata
attribute is not managed by this
package. It is the responsibility of the user to set the appropriate
value.
Attributes scale_factor
and add_offset
are
only used for packing data (section 8.1).
Supported for all objects that have this attribute.
The standard_name
attribute is read but not checked
against the standard name table.
Not supported.
Not supported.
When opening a netCDF resource, axes are scanned for using the
units
or axis
attribute, then the
standard_name
, and finally, as an extension to CF, by its
name
attribute.
Fully implemented.
Fully implemented.
A vertical coordinate will be read but there is no complete support.
The positive
and standard_name
attributes are
not yet interpreted, nor is the units
attribute assessed
for a pressure or length unit.
Parametric vertical coordinates are not yet supported.
The time coordinate is managed by the CFtime package. All
defined calendars, including the recently added utc
and
tai
calendars, are supported, with some restrictions. Leap
seconds are fully supported in notation and arithmetic when using the
utc
calendar.
Time coordinates with no annual cycle and explicitly defined calendars are not supported.
Fully supported.
The coordinates
attribute is interpreted, both for
coordinate variables and auxiliary coordinate
variables.
In ncdfCF
, auxiliary coordinate variables will
have axis orientations attached to them for ease-of-use and
identification. This orientation will not be written to file as an
axis
attribute.
Fully supported.
Fully supported.
Not supported. But do note that one significant data collection of
netCDF files using a reduced horizontal grid, MODIS level-3 binned
format, uses a different format than the CF arrangement.
ncdfCF
does read this format without problem (extending CF
by also supporting netCDF user-defined types).
The simple grid_mapping
format is fully supported. The
extended, second format is not supported. The standard_name
on coordinate variables is not used; instead use is made of the axis
orientation (either through the axis
attribute or through
another supported mechanism; see section 4.).
ncdfCF
uses OGC WKT2 strings to report on coordinate
reference systems, rather than the stock-old and obsolete PROJ format as
is suggested by CF.
The crs_wkt
attribute is used when found.
Fully supported.
Not implemented.
Not implemented.
Generic labels are supported. Geographic regions are also supported
but not referenced against the list of standardized region
names and neither is the standard_name
attribute
referenced in this context.
Taxon names and identifiers and their specific attributes are not specifically interpreted.
These are not specifically supported but they will be read correctly.
Fully supported for 1D and 2D coordinate variables.
Not implemented.
Not implemented.
Not implemented.
Not implemented.
Fully supported on reading and writing.
Not implemented.
Not implemented.
Not implemented.
Not implemented.