GEOG 489
Advanced Python Programming for GIS

4.2 Collections and Sorting

PrintPrint

In programming, you are often dealing with collections of items of the same data type, e.g. collections of integer numbers, collections of Point objects, etc. You are already familiar with the built-in collection types list, tuple, and dictionary, but there exist more data structures for storing collections of items. Which data structure is best suited for a specific task depends on what operations exactly you need to perform with the data structure and items in it. For instance, a dictionary is the right choice if you mainly need to access the stored items based on their key. In contrast, a list is a good choice if you have a static collection of items that you need to iterate over or that you want to access based on their index. In general, there often exist several collection data types that you can use for a given task but some of them will be more efficient and better choices than the others.

Here is a little introductory example:

Let’s say that, in your Python program, you have different assignments or jobs coming in that need to be performed in the order in which they arrive. Since performing an assignment can take some time, you need to store the assignments in some sort of waiting queue: the next assignment to be performed is always taken from the front of the queue, while the new assignments arriving are added at the end of the queue. This approach is often referred to as the first-in-first-out (FIFO) approach.

To implement the waiting queue in this program, we could use a normal list. New assignments are added to the end of the list with list method append(), while items can be removed at the beginning of the list by calling the list method pop(...) with parameter 0. The following code simulates the arriving of new assignments and removing of the next assignment to be performed, starting with a queue with three assignments in it. For simplicity we alternate between an assignment being removed and a new assignment arriving, meaning the queue will always contain two or three assignments. We also simply use strings with an increasing number at the end for the assignments, while in a real application these would be more complex objects with attributes describing the assignment.

waitingQueue = ["Assignment 1", "Assignment 2", "Assignment 3"]

for count in range(3,100001): 
	waitingQueue.pop(0)                             # remove assignment at the beginning of the list/queue
	waitingQueue.append("Assignment " + str(count)) # add new assignment at the end of the list/queue

Run this program with basic code profiling as explained in Section 1.7.2.1. One thing you should see in the profiling results is that while the Python list implementation is able to perform append operations (adding at the end of the list) rather efficiently, it is not particularly well suited for removing (and also adding) elements at the beginning. There exist data structures that are much better suited for this task as the following version shows. It uses the collection class deque (standing for “double-ended queue”) that is defined in the collections module of the Python standard library, which contains several specialized collection data structures. Deque is optimized for adding and removing elements at the start and end of the collection.

import collections 

waitingQueue = collections.deque(["Assignment 1", "Assignment 2", "Assignment 3"]) 

for count in range(3,100001): 
	waitingQueue.popleft()                          # remove assignment at the beginning of the deque
	waitingQueue.append("Assignment " + str(count)) # add new assignment at the end of the deque

Please note that the deque method for removing the first element from the queue is called popleft(), while pop() removes the last element. The method append() adds an element at the end, while appendleft() adds an element at the start (we don’t need pop() and appendleft() in this example). The initial deque is created by giving the list of three assignments as a parameter to collections.deque(...).

If you profile this second version and compare the results with those of the first version using a list, you should see that deque is by far the better choice for implementing this kind of waiting queue. More precisely, adding elements at the end takes about the same time as for lists but removing elements at the front is approximately three times as fast (and as fast as adding at the end).

While we cannot go into the implementation details of lists and deque here (you may want to check out a book on algorithms and data structures in Python to learn how to implement such collections yourself), hopefully this example makes it clear that it’s a good idea to have some understanding of what collection data structure are available and which operations are fast with them and which are slow.

In the following, we are going to take a quick look at sets and priority queues (or heaps) as two examples of other specialized Python collections, and we talk about the common operation of sorting collections.