In this lesson, we will look at data and the databases in which you might be storing all your lovely data. Many sources (Franklin, 1992) consider 80% of all data to have a spatial component. This makes almost everyone's data a potential geographic resource (this is good news if you need to grow your empire at work or you just want some job security). However, in many cases, saying that a dataset has a geographic component just means that there is an address for each item of data; to treat this data as spatial will require the use of a geocoder which can convert an address to a spatial location. This is beyond the scope of this lesson, but a quick Google Search will give you plenty of starting points.
The first, and in many ways most important, element in GIS data handling is: what format is your data available in, and what format would you really like it to be in? Once you have managed to coerce the data into a useful format, you need to store it somewhere. You could leave it in a file on your hard disk, but that can become slow, it's messy if you need to share the data, and some of those files can be huge. So it's probably better if you find a server (somewhere central) and put all of your data into a database living on the server. How you go about choosing that database is what we'll consider in the second half of this lesson. When considering database choices, another key concern relates to whether or not you can pull your data back out. Many organizations have found themselves paying millions of dollars a year in license fees as they are stuck with an expensive database that is effectively holding their data hostage.
Finally, we will end the lesson with a look at some interesting datasets from a wide range of sources. In each case, we'll look at the data itself and its metadata, and explore how you could make use of it.
Can you bring your data into the GIS?
Once you have selected the dataset that you need for your project, the most pressing need is to import this data into your GIS. There are two main problems in this situation. The first relates to what format your data is in, and which formats your GIS can handle. For many years, the lingua franca of the vector GIS world has been the Esri shapefile - in large part due to the fact that Esri published the specification online for all to read and implement. This has led to many implementations, both proprietary and open source, which are widely used in GIS tools today. For the raster GIS world, there is less agreement as to which should be the favorite format. One good choice is the GeoTiff which is a public domain standard that describes how to include georeferencing data in the image file's metadata. For more complex earth science data, a popular choice is the netCDF format, which allows for compression of the data.
If you are stuck with a dataset in a format your GIS can not import, I can highly recommend the open source suite of tools GDAL/OGR that provides conversion utilities to read in and write out a wide variety of raster (GDAL) and vector (OGR) formats.
Another common problem in loading data is the need to know what projection and coordinate system corresponds to your dataset. While logically one might think that nobody would ever create a dataset without producing matching metadata, by now you know that this is all too common. This problem is becoming worse. As creating geographic data becomes easier (by so-called "neogeographers"), more and more data is turning up as just a text file or a shapefile with no projection. As a well trained geographer, you'd never make that mistake, but if you get stuck with a dataset with no projection information, you will probably want to visit spatialreference.org which provides a searchable list of projections, and for each projection provides 14 different formats to describe the projection.
Can you pull your data out of the GIS?
Simply put, if you can't get your data back out of a database, then you have to keep using that database even if there are better and/or cheaper options available, as the cost of recreating your data outweighs the benefits of switching. As an example, this problem has cost the US taxpayer up to $25 million just at the US Air Force Academy. There is no telling how much this sort of vendor lock-in is costing in the broader industry.
The moral of this story is to always plan ahead and make sure you have a clear exit strategy before importing your precious data into any database. If the database doesn't implement open standards and allow you to easily export (all of) your data, then there needs to be a really compelling reason to implement your system with it.
Franklin, C., 1992. An introduction to geographic information systems: linking maps to databases. Database 15 (2), 12-21.