GEOG 489
Advanced Python Programming for GIS

3.5 Python for Data Science

PrintPrint

Python has firmly established itself as one of the main programming languages used in Data Science. There exist many freely available Python packages for working with all kinds of data and performing different kinds of analysis, from general statistics to very domain-specific procedures. The same holds true for spatial data that we are dealing with in typical GIS projects: there are various packages for importing and exporting data coming in different GIS formats into a Python project and manipulating, analyzing and visualizing the data with Python code--and you will get to know quite a few of these packages in this lesson. We provide a short overview on the packages we consider most important below.

In Data Science, one common principle is that projects should be cleanly and exhaustively documented, including all data used, how the data has been processed and analyzed, and the results of the analyses. The underlying point of view is that science should be easily reproducible to assure a high quality and to benefit future research as well as application in practice. One idea to achieve full transparency and reproducibility is to combine describing text, code, and analysis results into a single report that can be published, shared, and used by anyone to rerun the steps of the analysis.

In the Python world, such executable reports are very commonly created in the form of Jupyter Notebooks. Jupyter Notebook is an open-source web-based software tool that allows you to create documents that combine runnable Python code (and code from other languages as well), its output, as well as formatted text, images, etc. as in a normal text document. Figure 3.1 shows you a brief part of a Jupyter Notebook, the one we are going to create in this lesson’s walkthrough.

A screen capture to show a bit of a Jupyter Notebook
Figure 3.1: Part of a Jupyter Notebook 

While Jupyter Notebook has been developed within the Python ecosystem, it can be used with other programming languages, for instance, the R language that you at least may have heard about as one of the main languages used for statistical computing. One of the things you will see in this lesson is how one can actually combine Python and R code within a Jupyter notebook to realize a somewhat complex spatial data science project in the area of species distribution modeling, also termed ecological niche modeling.

It may be interesting for you to know that Esri is also supporting Jupyter Notebook as a platform for conducting GIS projects with the help of their ArcGIS API for Python library and Jupyter Notebook has been integrated into several Esri products including ArcGIS Pro.

After a quick look at the Python packages most commonly used in the context of data science projects, we will provide a more detailed overview on what is coming in the remainder of the lesson, so that you will be able to follow along easily without getting confused by all the different software packages we are going to use.