GEOG 489
Advanced Python Programming for GIS

3.8.1 Creating a new data frame

PrintPrint

In our examples, we will be using pandas in combination with the numpy package, the package that provides many fundamental scientific computing functionalities for Python and that many other packages are built on. So we start by importing both packages:

import pandas as pd
import numpy as np

A data frame consists of cells that are organized into rows and columns. The rows and columns have names that serve as indices to access the data in the cells. Let us start by creating a data frame with some random numbers that simulate a time series of different measurements (columns) taken on consecutive days (rows) from January 1, 2017 to January 6, 2017. The first step is to create a pandas series of dates that will serve as the row names for our data frame. For this, we use the pandas function date_range(…):

dates = pd.date_range('20170101' , periods=6, freq='D')
dates

The first parameter given to date_range is the starting date. The ‘periods’ parameter tells the function how many dates we want to generate, while we use ‘freq’ to tell it that we want a date for every consecutive day. If you look at the output from the second line we included, you will see that the object returned by the function is a DatetimeIndex object which is a special class defined in pandas.

Next, we generate random numbers that should make up the content of our date frame with the help of the numpy function randn(…) for creating a set of random numbers that follow a standard normal distribution:

numbers = np.random.randn(6,4)
numbers

The output is a two-dimensional numpy array of random numbers normally distributed around 0 with 4 columns and 6 rows. We create a pandas data frame object from it with the following code:

df = pd.DataFrame(numbers, index=dates, columns=['m1', 'm2', 'm3', 'm4'])
df

Note that we provide our array of random numbers as the first parameter, followed by the DatetimeIndex object we created earlier for the row index. For the columns, we simply provide a list of the names with ‘m1’ standing for measurement 1, ‘m2’ standing for measurement 2, and so on. Please also note how the resulting data frame is displayed as a nicely formatted table in your Jupyter Notebook because it makes use of IPython widgets. Please keep in mind that because we are using random numbers for the content of the cells, the output produced by commands used in the following examples will look different in your notebook because the numbers are different.