Species distribution modeling for Solanum Acaule

An example of a Python data analysis project in a Jupyter notebook

1. Overview

This notebook creates a species distribution model for Solanum Acaule, a plant species growing in the western countries of South America, with the help of the R dismo package and illustrates the predicted distribution with an interactive map based on ESRI's ArcGIS API for Python and ArcGIS Online (AGOL). The notebook adapts an example from https://rspatial.org/raster/sdm/index.html by Robert J Hijmans and Jane Elith. The main goal is to demonstrate how Python, R, and different 3rd party Python packages can be combined in a Jupyter notebook to solve and document a spatial data analysis task. To limit length and complexity for the purposes of this Python programming course, we made various simplifications from the perspective of species distribution modeling.

The project and remainder of this notebook are organized into the following five steps/sections: project preparation (Section 2), obtaining and cleaning observation data for Solanum Acaule (Section 3), preparation of other input data (Section 4), creating and running the model in R (Section 5), and postprocessing and visualization of the results with AGOL (Section 6).

The code has been implemented and tested with Python 3.6. The last part (Section 6) requires a pennstate AGOL account (https://www.arcgis.com/home/index.html).

Python packages used:

R packages used:

2. Preparation

To run this notebook, you will have to make sure that a number of Python packages are installed as described above and that you downloaded and extracted the input data we are going to use to a new folder (referred to as the workspace folder in the following). Furthermore, you will have to adapt a number of input variables in Section 2.3.1 of this notebook.

2.1 Python packages to install

The notebook has been developed in the fresh Anaconda Python 3.6 environment created in Section 3.2. of the lesson materials. Installing the Python packages needed for the project (see Section 1 above) required adding two additional channels, conda-forge and esri, in Anaconda. If you followed the instructions from Section 3.2 of the lesson materials, you should already have all the required Python packages installed.

2.2 Data for this project

The data for this project is available as a .zip file linked in Section 3.11 of the lesson materials. You should extract the files contained in the .zip file to a new folder. You will then need to set the workspace variable in Section 2.3.1 of this notebook to refer to this folder.

The data consists of a world borders shapefile and a set of raster files for different bioclimatic variables available from https://www.worldclim.org/data/index.html [1].

[1] Fick, S.E. and R.J. Hijmans, 2017. Worldclim 2: New 1-km spatial resolution climate resolution climate surfaces for global land areas. International Journal of Climatology.

2.3 Importing packages and preparing notebook

The following subsections contain some first Python code to initialize the project, starting with some important input variables that you need to adapt for your purposes.

2.3.1 Set up input variables that need to be adapted by user

The workspace variable should point to the folder to which you extracted the input data. Please adapt the path accordingly.

Username for your pennstate organization account. We will need this later to visualize our results. Please fill in your username PSU ID in place of the asterisk.

The following two input variables contain the names of output shapefiles produced in later parts of the notebook that will be uploaded to your AGOL account. Please replace the asterisks in the names with your initials or PSU ID to make sure the names are unique in the organization.

2.3.2 Set up other input variables

The following input variables are for names of input and output files, etc. They can be left unchanged:

2.3.3 Import general Python packages needed

Here we import some Python packages that we will use in different places throught the project:

2.3.4 R interface and packages

To make sure that we can access R and the 'dismo' package from within our Python notebook, we have to set up rpy2 and load 'dismo', 'maptools', and 'rgdal'. We recently had cases where loading rpy2 failed on some systems due to the R_HOME environment variable not being set correctly. We therefore added the first line below which you will have to adapt to point to the lib\R folder in your AC Python environment (at the very least, you will have to replace username with your actual Windows user name).

Our Anaconda environment currently already contains the dismo and maptools R packages but not the rgdal package (see explanation in Section 3.11 of the lesson materials). To install the rgdal R package, run the following %R command. A window will open up that allows you to pick the mirror from which to install. The installation may take a little while.

The next commands make sure all three R packages needed are loaded. If one of these commands returns "array([0], dtype=int32)" instead of array([1], dtype=int32), there is a problem with this R package and you may have to skip over some parts of this notebook, see note in Section 3.11 of the lesson materials.

3. Obtaining and cleaning observation data

In most species distribution modeling projects, you would probably already have a data set available that contains information on where that species has been observed (presence data) and maybe also where it hasn't been observed (absence data). In this project, however, we are going to instead use data from the Global Biodiversity Inventory Facility (GBIF) at www.gbif.org that provides free access to a large repository of biodiversity data.

Dismo provides a function gbif(...) to directly download data from GBIF for convenience (see page 22 of the dismo package documentation at https://cran.r-project.org/web/packages/dismo/dismo.pdf ). The function allows for specifying the genus (solanum) and species (acaule) that we are interested in. However, for this walkthrough we won't be using a newly downloaded data set from GBIF, but rather a smaller example data set for Solanum Acaule provided as part of the dismo package which is better suited to illustrate the different steps in this walkthrough. We will then maintain the obtained presence data in a pandas data frame, inspect it, and apply different approaches to clean up the data a bit before we make use of it.

3.1 Acquiring the observation data and transferring it to a pandas data frame

Note: For the case that you run into any issues with executing the R commands in this subsection (which most likely means you weren't able to load the dismo package in Section 2.3.4 above), we are providing a workaround at the end of this subsection that reads in the data from a csv file in Python rather than using R and the dismo package.

To get the data for Solanum Acaule as an R data frame, we simply have to call the data(...) function provided by dismo with the name of the variable (acaule) that we want to store the data frame in. We use the R function head(...) to only show the first six rows of the table here. If you run the second line without the head(..) and scroll down to the end of the table, you will see that the data frame has 1366 rows and 25 columns in total.

Next, we want to move the entire table to Python to work with it there, more precisely we want to transfer it to a pandas DataFrame object. To achieve this we need to import pandas as well as activate some component of rpy2 that will make sure that R data frames are correctly translated into pandas data frames:

Now we can move the table to Python by using '%R' with the '-o' option as explained in Section 3.7 of the lesson materials:

For now we are done with using R. We now have a Python variable called acaule that contains the entire observation table and that is a pandas DataFrame object as can be easily verified. We use the pandas head() method to only show the first five rows of the table here:

Workaround for 3.1 (skip this part if everything above worked ok for you)

As we noted in Section 3.11 of the lesson materials, it is possible that you into issues with running the R commands in this section due to some issues with the R packages involved. If this is the case, you can run the following Python code to read in the acaule observation data from the csv file we are providing in Section 3.11. Depending on where you have placed that file on your harddisk, you may to adapt the path in the second line of the code.

3.2 Investigating the observation data

Now that we are back in Python, let's apply what we learned about working with pandas data frames to look at the data a bit in more detail. Let's start by looking at the column labels to get a better idea of what information we have available:

The most important columns for our species distribution modeling task are the latitude and longitude WGS84 coordinates stored in the 'lat' and 'lon' columns. Furthermore, we will later use the 'country' column to perform a quick plausibility check of the coordinates. Let's see again how many rows we now have in our table:

1366 rows. Unfortunately, when we take a closer look at the 'lat' and 'lon' columns, we see that not all the rows contain latitude and longitude coordinates. In the following command, we use [ ['lat', 'lon'] ] to restrict the columns to the 'lat' and 'lon' columns and [300:306] to slice the rows to only those with indices 300 to 305 (= row numbers 301 to 306 because the row labels start with 1, while the index starts counting from 0).

The NaN stands for 'not a number' and is the result of no value being provided for these entries in the GBIF data. This is something we will have to take care of because we cannot use these records when applying a distribution model. We picked rows 301 to 306 in the previous cell because we already knew these rows contain NaN entries. Luckily, it's easy to test whether a column contains such entries using the pandas notnull() method:

What happens here is that we apply isnull() to the acaule.lat and acaule.lon columns which gives us a boolean vector with True for rows that have NaN in the respective column and False otherwise, and then we apply the value_counts() method which gives us an overview of how often each value occurs in a data frame or, in this case, boolean vector. So we now know that the table contains 284 rows with a latitude value of NaN, and the same for the longitude values.

Finally, to get a quick idea of where the observations are located, we will use matplotlib in combination with cartopy to create a scatterplot of the 'lon' and 'lat' coordinate pairs on top of a simple world map (if you get some string output instead of a map image, run the code in this cell a second time):

As you can see, the resulting plot is directly embedded as an image into the notebook. This is a very simple but also very useful approach to quickly plot some data, and we will use it a few times more in this notebook. When we look at the data, we can see that there are a few suspicuous outliers that we should better take a closer look at before using this data to create our model.

3.3 Cleaning the observation data

When looking at the observation data in the previous section, we already identified a number of problematic entries which is often the case due to human error when creating the data or when compiling it by combining data from different sources. While 100% perfect data is often unachievable, in particular for large data sets, there are a number measures that can be taken to clean up the data by correcting mistakes or removing incomplete or erroneous entries. In the following we will show a few examples of how Python code can be used to validate and clean up our observation data.

3.3.1 Removing rows without lon or lat coordinates

As we saw above, there exist entries in our observation table that do not have longitude and latitude coordinates specified, showing up as NaN ("not a number") in the table data. Let's start the cleaning process by removing these. As we already showed, the pandas method notnull() can be used to test that cells do not contain NaN and can be applied to a column or data frame to give us a boolean column/data frame as result. So if we want to test how many rows need to be removed we can use the filtering based on boolean expressions approach discussed in Section 3.8.6:

This may look a bit complicated but what we do here within the squared brackets is just producing two boolean vectors: one that has True for all rows that have a latitude value (lat) that satisfies the notnull criterion, and one that does the same for the longitude (lon) column, and then the & operator is used to combine these into a single boolean vector using logical AND. This gives us a boolean vector for the rows that we actually want to keep, so to get those that need to be removed we invert the vector using the ~ operator for logical negation. Finally, by writing acaule[...] the resulting boolean vector is used to filter the data frame to only show those rows we are interested in. The output from the print command in line 2 tells us that there are 284 rows in total that need to be removed.

Now to actually remove the rows, we do the same thing but without the negation part and assign the result to a new variable acauleWithCoordinates:

Of the original 1366 rows we now have 1082 rows remaining, so we correctly removed the 284 rows that have NaN for the lat or lon column.

3.3.2 Duplicate removal

One thing that can often happen when data is compiled from different sources with different people contributing is that records are duplicated and end up multiple times in the table data. Pandas provides a method for checking for duplicates in a data frame, so let's quickly apply that to our data. The method is called duplicated(...) (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html ). It gives back a boolean vector that contains True if the corresponding row is a duplicate of another one and False otherwise. When calling the method, we will not provide the 'subset' parameter so that pandas will look at all columns for deciding whether two rows are identical or not. We will, however, use the parameter 'keep' to say that the first entry of those that are deemed identical will not be assigned True in the output. That means if there are five rows that are the same, the first of these will appear as False in the output boolean vector, while all the others will appear as True. So if we would use the result to filter the rows, we would keep exactly one (the first) of the identical rows in our table.

By applying the value_counts() method to the result of duplicated(...) we can immediately see that the vector consists of only False, so there are actually no duplicates in our data that we need to filter out.

3.3.3 Plausibility check of countries and coordinates

When we plotted the lat, lon coordinates on a map, we already noticed some weird outliers such as points in the Atlantic and Indian Ocean and in the Antartica. This can often be the result of mistakes made when entering the coordinates such as accidentally swapping lat and lon coordinates, forgetting a minus sign, etc.

Our data provides a column with the name of the country in which the observation took place, so this gives us the option to run a simple plausibility test to check whether country and coordinates are consistent. However, to do this, we need a way to geocode WGS84 coordinates to the country that contains these coordinates. There are multiple options how we can realize this kind of geocoding in our Python code. One option would be to use an existing geocoding service such as Google's geocoding API or ESRI's reverse_geocode(...) function provide as part of the ArcGIS API for Python package. However, these options have certain disadvantages such as having a limit of how many requests can be made freely in a certain time span (which is problematic given the larger number points on our data set and the fact that we may want to be able to run the notebook code repeatedly) and the fact that these typically work by finding the closest address to the given point which does not always give the correct results for sparsely populated areas close to the border of two countries.

We therefore opted to use an approach based on a local shapefile with polygons for all countries to implement the geocoding of coordinates to country names. The shapefile is called TM_WORLD_BORDERS-0.3.shp (see variable worldCountryShapefile defined in Section 2.3.2) and is part of the project data you had to extract, so it should be in your workspace folder now. Fortunately, the GDAL/OGR package (Section 3.9 of the lesson materials) contains all the methods we will need to query a shapefile with a point to get the polygon from the shapefile that contains that point. As the first step, we will use the OGR package of GDAL to open the world borders shapefile and access its only layer. Then we will determine the indices of the columns we are interested in ('NAME' for the countries' names) and for our observation data frame ('country' for the names of the country the observation was made in, 'lat' and 'lon' for the coordinates). To get the index of a column in a vector data set with OGR, we need to call the GetLayerDefn() method and then invoke GetFieldIndex(...) to get the index. For our observation data frame, we use the get_loc(...) method on the columns property of our data frame.

We will now loop through the rows in our observation data frame with itertuples() to create a list of country names in variable geocodingResults that for each row provides the name of the country from the world borders shapefile that contains the corresponding coordinates. We will use the OGR's ability to set a spatial filter for a layer with layer.SetSpatialFilter(point) to reduce the country polygons to only that polygon that intersects with a given point feature. The other relevant GDAL/OGR methods we will use for this are:

Please note that with the way the code in the previous cell is written, the geocoding result will be "UNKNOWN" for rows for which no containing polygon could be found in the world borders shapefile. You will actually find such entries if you change the last line to show the complete list.

The reason why we are storing the results of looking up the containing polygon for the observation points in a list is that we want to add a new column with these results to our data frame. While this isn't absolutely necessary, it means we will always have these intermediate results available and we can archive everything together as a single csv or shapefile on disk. However, it is important to note that our result list only contains entries for the active rows in acauleWithCoordinates but not the rows that we filtered out in Section 3.3.1 because they didn't contain lat/lon coordinates. These rows still exist, they are just not visible in acauleWithCoordinates, but this means that we would run into issues when we add geocodingResults as a new column to acauleWithCoordinates because of mismatching row numbers. Therefore, we are first going to clone the acauleWithCoordinates data frame which will give us a new data frame with only the active rows and columns that we then store in variable acauleWithGeocoding. We then add the geocoding results as a new column called 'country_geocoding' to this new data frame.

The new column appears as the very last column of the data frame, so you will have to scroll the table output from the previous cell all the way to the right to see it. Now, it's finally time to check for entries for which the geocoding result and the name of the country we find in the 'country' column do not match. We again use boolean indexing to achieve this. The boolean vector is constructed by the expression

acauleWithGeocoding.country_geocoding != acauleWithGeocoding.country

which compares corresponding cells of the 'country_geocoding' and 'country' columns. The resulting boolean vector will contain True for rows where the two names do not match (because we are using not equal (!=) as the comparison operator) and False otherwise. So we can directly use it to get only the mismatches by using the resulting vector to filter acauleWithGeocoding:

From printing out the 'country' and 'country_geocoding' columns of the mismatches, we can see that most of them are UNKNOWN cases where no containing country was found. In addition, we have cases where the two involved countries are neighbored countries, so a possible explanation could be that these observations were made very close to the common border between the countries, so that small deviations or imprecisions in the world country polygons could have led to the differing results.

Let's quickly use the approach we used before and draw the mismatches on a map with matplotlib:

We can see that the suspicious cases we already observed before are contained in these mismatches. We get UNKNOWN results because most them fall into water. One other thing we can see is that the observation with geocoding result 'Brazil' is actually quite far away from the Bolivian border.

In principle, we could now further inspect these mismatching cases and try to correct them, e.g. by figuring out whether some of them are due to swapping lat/lon coordinates or similar reasons. But instead, we are going to go with the easier approach of simply removing all of them, except for those Peru-Bolivia pairs where we assumed these were collected close to the common border. That means we want to keep all rows for which 'country_geocoding' and 'country' match (comparison operator ==) or (logical operator |) for which 'country_geocoding' equals 'Bolivia'. The resulting expression for the boolean indexing (Section 3.8.6 from the lesson materials) can be seen in the code below that also prints out the number of remaining rows and draws the result on a map.

We will use the resulting cleaned data in acauleClean in Section 5 as the main input for the species distribution modeling. However, it should be emphasized that the data cleaning in this section can be significantly extended and improved. Nevertheless, we hope that what we showed here gives a good idea of the powerful functionality that in particular pandas provides for realizing the table access and manipulation operations needed for typical data cleaning.

4. Preparing other input data

We will also keep things rather simple with respect to the data used for the predictor variables of our species distribution model. Predictor variables in species distribution model are variables that define the ecological niche of a species and typically involve climate related variables, land cover/use, elevation, etc. Usually they are provided as raster data, one raster per predictor variable. We will restrict ourselves to climate data acquired from the WorldClim web portal (https://www.worldclim.org/data/index.html) and the only preparation we will apply to this data is clipping the worldclim rasters to our study area of South America.

We will be using the WorldClim "Bioclimatic variables" rasters for this project. You already downloaded and extracted the data to your workspace folder. The data consists of 19 raster files with values averaged over the years 1970-2000 for bioclimatic variables often used as predictor variables in species distribution modeling. The meaning of the 19 different bioclimatic variables can be looked up at https://www.worldclim.org/data/bioclim.html . We will be using the 10m resolution raster files but higher resolution versions are also available from WorldClim. If you check out your workspace folder again, you will see that the raster files all start with wc2.0_bio_10m_... .

To clip the rasters in the different .tif files to South America, we will use GDAL again, more specifically the Translate(...) function for converting raster data sets which (among many other parameters) can be given a projWin parameter specifying the coordiantes of a rectangular area to extract from the raster. The following code clips all the wc2 .tif files in the workspace folder and produces output files with '_clipped' appended to the file name before the .tif extension. The code starts with the list of files in the workspace folder created with os.listdir(workspace) and then uses the higher-order function filter(...) (Section 3.3 of lesson materials) with a boolean auxiliary function isRasterFile(...) to reduce the list to wc2 .tif files based on the regular expression based pattern defined at the beginning of the code (see Section 3.2 of lesson materials). The regular expression used with re.match(...) is a bit complicated because we are making sure the filename does not end with 'clipped' to avoid clipping already clipped rasters that may still be in the workspace folder.

We then use map(...) (Section 3.3. of lesson materials) to apply a second auxiliary function clipRaster(...) to each element in the filtered list. clipRaster(...) applies the Translate(...) function to the given raster file producing a new output raster and returns the name of the output raster so that we end up with a list of the names of clipped raster files in variable clippedRasters for later use. For testing, we print out this list at the end.

The output of the previous cell should list the names of the 19 clipped raster files and these files should now appear in the workspace folder in addition to the wc2 input files. Typically, preparing other input data for the species distribution model would involve more steps (e.g. reprojection, adapting resolution) and additional data sets, but again we are keeping things somewhat simple here. That means we are now ready to create the actual model with the help of the dismo R package.

5. Creating and running the model in R

We already briefly discussed the dismo package and creating a model and using it to predict a species distribution in Section 3.7 of the lesson materials. As also discussed there, many species distribution modeling approaches exist, so in principle we could now test out different ones and compare the results. However, since we just want to illustrate the process and we only have presence data at our hand, no absence data (see the dismo documentation for details on other modeling approaches, in particular ones that also make use of absence data), we will use one of the simpler models, the bioclim model, in this notebook. This model requires the predictor variables as a raster stack and the presence data as a R data frame with just the lon and lat coordinates for each observation record.

5.1 Moving data to R

First we have to move our observation data and list of predictor raster files to R. We start by restricting our data frame to just the longitude and latitude columns. Then we use the %R magic command with just the -i option to create a similarly named variable acaulePoints in R. The output shows that this R variable indeed contains a data frame with the lon and lat values.

To be able to create the needed raster stack in R, we need to transfer the list of file names in clippedRasters to R as well:

Finally, we also transfer our workspace variable to R so that we know in which folder to look for the raster files and where to put the model output...

... and the same for our input variable containing the name of the prediction raster file we want to produce as the final result of this section:

5.2 Creating prediction with bioclim

The code in this section is completely in R, so please don't forget the leading %R in all the commands.

Note: If the following command does not work, there is a problem with the rgdal R package and you won't be able to perform the steps in this subsection yourself. In this case, one option is that you just read through this part (e.g. in the HTML export of this notebook in Section 3.11 of the lesson materials), and then take the file 'bioclim_prediction.tif' file from the .zip file you can download from the workaround note in Section 3.11 of the lesson materials, place it in your workspace folder for this walkthrough and then continue with section 6 in this notebook.

We are first using the dismo stack(...) function to create the RasterStack object that dismo uses for the managing the predictor raster files:

We can use the plot command to show the rasters in our raster stack in R:

Now comes the step in which we create the model with bioclim by calling bioclim(...) with our raster stack and observation points. The result is a Bioclim object that contains all the properties of the model:

The pairs(...) function can be used to plot the obervation data for pairs of predictor variables (left bottom half of the matrix), together with correlation coefficients (top right half of the matrix):

We can also use the response(...) function to have a look at what the model predicts about the likelihood of Solanum Acaule occurring for different values of the individual predictor variables X (assuming all other variables are at their median value):

Now we finally use the model we created to make a prediction of the species distribution over the project area of South America. In principle, we could use a different input raster stack, for instance to produce a prediction for a different time or a hypothetical one assuming bioclimatic variables would change in a certain way. However, we will simply use our rasters from the predictors variable again, meaning we will use the observations to make a prediction for where else one may be able to find Solanum Acaule. For this, we use the dismo predict(...) function with predictors and our bioclim model in variable bc:

The result is a new RasterLayer object that contains the prediction in form likelihood values for the cells. We can use plot(...) to visualize the output raster. In the following, we show two plots, the raw likelihood values and a version showing all cells with a raster value > 0.1 in green.

To be able to use the prediction raster in a map visualization, we now write it to a new output shapefile in our workspace folder:

With creating this output file of the bioclim prediction, the R part of this project is completed and we move back to Python for creating a final map visualization of the outcome.

6. Postprocessing and visualization of the results with ArcGIS Online

We now have created our bioclim species distribution model and created a raster predicting the likelihood of Solanum Acaule over the entire area of South America. The last things we want to do are creating a Shapefile version of our observation data frame and prediction raster and then using these to generate a simple interactive map representation in our notebook with the help of ESRI's ArcGIS API for Python. The map will have two layers, one with the observation point features and one with a polygonized version of our prediction raster. The reason why we are using a polygonized version of the raster is that there is no simple support for including a raster data set into a ArcGIS Online based map. There is the option to include tile maps and some other image based layer approaches but all of these would require more processing and hosting work than we want to cover in this walkthrough. The map visualization will require that the shapefiles for the two layers are uploaded to ArcGIS Online as .zip files. So we first will produce shapefiles and zipped versions for these two layers.

6.1 Create zipped point shapefile for acaule observation data using GeoPandas

Pandas is great for manipulating general table data but it does not provide support for handling geometry information and reading from or writing to shapefiles. For that we will need the geopandas package which has been built on top of pandas and, in addition, uses the Shapely package for creating and maintaining point, polyline, polygon, and other geometric objects. This will allow us to replace the 'lat' and 'lon' columns in acauleClean with a geometry column containing actual Point objects. We start by imporating geopandas and the Shapely Point class, and setting up a dictionary representation for a EPSG:4326 coordinate reference system to be used with geopandes.

Now, we will create a list of Point objects based on the lon and lat coordinates in each row. For that we first apply the zip(...) function that takes two lists and turns them into a list of pairs and then use list comprehension to give each pair to the Shapely Point(...) function to get a list of corresponding Point objects.

The output above should consist of a list of six Shapely Point objects, the first six elements in our geometry list, followed by a more readable version showing the actual coordinates produced by the second print using the star to unpack the list first. We can now create a geopandas DataFrame object from the pandas data frame in variable acauleGeo by first dropping the lat and lon columns and then invoking GeoDataFrame with the CRS specification we prepared in variable crs4326 and the list of Point objects in variable geometryList:

If you scroll all the way to the right in the table output produced above, you will see the new 'geometry' column with the Point objects. Just as a sidenote, here is how we can plot the content of our Point based geopandas data frame with matplotlib:

We can now use the geopandas to_file(...) method to write the data frame to a new shapefile. The name of the output shapefile has been defined at the beginning of the notebook in variable acauleObservationShapefile.

The new shapefile should now be in your workspace folder. Feel free to check it out by opening it in a GIS software of your choice.

As mentioned, we also need a zip file version of the shapefile to upload to ArcGIS Online. Let's use Python code to produce this as well with the help of the zipfile package. Since we will need to do this for two different shapefiles, we define a function zipShapefile(name) that takes a filename without extension and then creates a .zip file from those files that have the same name but different three letter extensions. We will use a regular expression again to get all these files, while at the same time making sure not to include a .zip file with that name that may already be in that folder.

If you check your workspace folder, you should now see the newly created .zip file there and it should contain the different files making up the shapefile.

6.2 Polygonize prediction raster to visualize it with AGOL

As we mentioned, the simplest way to get our prediction result included in our final map visualization is by creating a vector version of it that polygonizes the raster based on the cell values. For this, we will first apply a reclassification to the raster that classifies the raster values into six classes:

The resulting polygon vector data set will then consist of polygons formed by adjacent raster cells with the same reclassified value. To perform the reclassification and polygonization we will use GDAL again. For the reclassification we will use the raster algebra operators provided by GDAL and for the polygonization we will apply the Polygonize(...) function (Section 3.9.2 of the lesson materials. We start by opening the prediction raster file with GDAL:

Next, we prepare the output shapefile by creating the shapefile itself, followed by creating a layer in the shapefile, and adding a new attribute 'DN' to that layer for storing the raster value for each of the polygon features.

Now we perform the raster reclassification for the raster band in variable band. Since we do not want to modify the original raster file, we first create an in memory copy of the file. Then we get the first band and turn it into an array of values in variable data to which we can then apply raster algebra operations. Once the array has been reclassified, we write the data back to the raster band in band and call the Polygonize(...) function to perform the polygonization of band and write the resulting polygon features to outLayer, the only layer in our newly created shapefile.

Right now, the polygons created from the raster will cover the entire rectangular area containing South America with most of this area having extremely low likelihood (class 0 in our reclassification). For the map visualization we want to produce, it's better to remove these polygons and just work with those from classes 1 and above. As a last step before closing the shapefile, we therefore loop through the polygon featues in outLayer and remove those that have a 'DN' attribute value of 0:

This would be a good moment to check out the shapefile we produced in a GIS. When coloring the polygons based on the 'DN' field, the result should look somewhat similar to this:

The last thing we need to do is produce a zipped version of the shapefile by calling our previously defined zipShapefile function:

6.3 Upload and publish to AGOL

We now have our two shapefiles ready in zipped form, so we can use the ArcGIS API for Python to establish a connection to your ArcGIS Online account and publish them as feature services that we can use in our map visualization. We start by creating a GIS object connected to your ArcGIS online account using the approach described in section 3.10 of the lesson materials of authenticating with our PSU login. Remember that you will get a code that you will have to paste in at the prompt. If everything works correclty, the following code should produce the output "GIS@https://www.arcgis.com".

The following piece of code is for making sure that we don't have any acaule observation and prediction shapefiles or feature services already on AGOL from previous runs of the notebook. This would lead to errors in the following code cells when trying to upload and publish under the same names, so we better start from scratch. The code we use is built around the ArcGIS search(...) function for finding content items and two auxiliary functions: The first one, searchAGOL(...) simply builds a query from given title, owner, and itemType parameters and then returns a list of content items found by calling gis.content.search(...). The second auxiliary function, deleteIfExistsOnAGOL(...) uses this function and then deletes all item found from AGOL by invoking the delete(...) method of the items. We apply this function to shapefiles and feature services of both acaule observation and prediction.

The output of the previous cell will tell you which shapefiles and features services were deleted. If this is the first time you run this part of the notebook code, no items should be found on AGOL, so you will see empty lists [] in the output and no "Item ... has been deleted" messages. Just to make absolutely sure, let's use searchAGOL(...) to check again. The output at this point should be for empty lists [] or else something went wrong and you may have to manually delete the items using the AGOL web site.

Now, we can upload our two zipped shapefiles to AGOL using the gis.content.add(...) method. Since we only want to use these layers privately for this notebook and AGOL is smart enough to figure out that we are adding a zipped shapefile, we do not need to provide any property information and, hence, the first parameter of the calls is an empty dictionary {}:

Just to be extra careful, let's check the gis content again for shapefiles. This time it should show us the two items for the uploaded shapefiles as

[<Item title:"acauleObservations*" type:Shapefile owner:<your AGOL ID>>] [<Item title:"bioclimPrediction*" type:Shapefile owner:<your AGOL ID>>]

To use the shapefiles in an AGOL map, they need to be published as feature services first. This is done by calling the publish() method for each of them:

Again we make sure that this worked out correctly, this time by checking for the feature services newly published feature services that should show up in the output as

[<Item title:"acauleObservations*" type:Feature Layer Collection owner::<your AGOL ID>>] [<Item title:"bioclimPrediction*" type:Feature Layer Collection owner:<your AGOL ID>>]

Since the ArcGIS Python API includes widgets for content items, we can also show these in a nice way in this notebook including links and additional information:

6.4 Final interactive map visualization

With the two shapefiles published as feature services, the last step that remains is to create a map widget that embeds an ineractive AGOL map with our two layers into this notebook. This is straightforward: We first create the ArcGIS map object with gis.map() using streets as the basemap option and then add the map as a widget to the notebook:

After executing the cell above, you should now see a world map that you can pan and zoom with your mouse. Since this interactive map will not be visible in an HTML export or other static export of this notebook, we are including an image showing the map below:

The static image already shows the map with our observation and prediction layers but we actually have not added these yet. We add the prediction layer with the add_layer(...) method using the ClassedColorRenderer to color the polygons based on the values in the DN field and with 0.7 opacity. The layer will appear in the map widget above. (If for some reason any of the layers in this and the next step don't show up immediately, wait a moment and run the cell again. Also the colors may be different in your case.)

As a last step, we add the acaule observation point layer on top of the prediction:

We now have a map visualization of our little species modeling project that shows the observation data we started with and resulting distribution prediction and that can be used to inspect the data and results. Clicking on any of the observation points in the map will open a popup window showing all the attributes of that observation point. Clicking any polygon of the prediction layer allows us to check out the DN class this polygon belongs to from the reclassifcation we applied in Section 6.2.