Weather and Climate Data Sets

Lesson 6. Reanalysis Data Sets


During your exploration of weather and climate data, you might have encountered data sets either in "NetCDF" (pronounced "net C-D-F") or "GRIB"  (sounds like "rib") formats. If so, you can be assured that you have left the realm of the casual data user and have entered the domain of the true data scientist. These file types were created to have a machine-independent way to store and transport vast arrays of meteorological or geophysical data. The GRIB (or GRidded Binary) data standard was developed by the WMO and is the most basic format of gridded meteorological data, from satellite images to model data. NetCDF, on the other hand, is what I would describe as a high-level data standard that was developed by Unidata (a consortium of research and educational partners). NetCDF files are highly versatile, are also machine independent, and (more importantly) are self-describing, which means that all the information you need to know about the data is packaged right along with the data itself.

I'll provide links with more details on these file standards later in the lesson. However, before you dive in, let me point out some important points to keep in mind. First, both file types are binary files. That means that they are only readable by a computer (you can't just open the files and expect to see numbers). This means that you must have a special set of library functions and/or programs to read both NetCDF and GRIB files (remember that these files are meant for the scientists, not the casual consumer). These libraries and programs are machine dependent and can be a real pain to install on your particular system, which is particularly the case if you are not running a Linux system (data scientists must all run Linux systems). We'll need to overcome some technological challenges in order to develop a workflow that suits our needs. However, there are tools out there if you look around. Hopefully, with the programs that I point you to, you'll be able to perform most (if not all) of the analyses you need.

animated 3D simulation
Data can have complex relationships. Consider this simulation of the Nested Air Quality Prediction Modeling System over China. How might data like this be packaged so that it can be queried and analyzed?
Credit: Institute of Atmospheric Physics / Chinese Academy of Science

If I had to identify a significant difference between NetCDF and GRIB files, it would be in how the data are stored and accessed in each. GRIB files are simply a series of sequential records packed into a single file.  Each record represents a grid of a single variable at a single time (like pages in an atlas).  Each page has a description at the top (called the "header") describing the grid... what it is, how big, what the numbers are, etc. Grids can be extracted from the file and displayed/processed using the information in the header, but you are limited to extracting only an entire record at a time. Extracting a value at a particular latitude/longitude (for example) requires much more work because, for the sake of file size, that information is not stored with the actual data. If this seems annoying, I agree... it is. But remember that GRIB format files are simply containers for large data sets likely processed automatically by some heavy-duty computer power. That's not to say we can't do likewise, but know that if you need GRIB data of any kind, you'd better need a lot of it to make up for the time you will spend writing the scripts needed to process the data.

The good news is that NetCDF files are much more friendly. They are still in binary format, however, and you will need the proper set of tools to read them. The thing that makes NetCDF files better than their GRIB cousins are that NetCDF files are totally self-contained. That is, they have ALL the information you would need to know about the data included within their description section. This includes variable dimensions/units, a time-base definition, and the specific data structure within the file. What's more, NetCDF files are not sequential records of 2D data. Unlike the single book analogy where the data are stored on individual pages, NetCDF files represent a hypercube (more than 3 dimensions) of data that you may interact with in any way you wish. For example, in a NetCDF file, you might have model temperatures over a 3D domain (latitude, longitude, height) AND in time. In this 4-dimensional configuration, you might retrieve a horizontal grid, a vertical slice, a profile, or a time-series at a particular point in space -- all with relative ease. These are pretty amazing files and I think once you get the hang of it, you are going to wish everything was in NetCDF format.

So, in this lesson, we are going to start off by working with the NetCDF file structure, then move on to GRIB. Along the way you'll also be introduced to two new data sources: model reanalyses (specifically the North American Regional Reanalysis) and the National Digital Forecast Database. This will give you even more choices when it comes to finding data that will suit your needs.

Read on!