GEOG 865
Cloud and Server GIS

Moving data to the cloud

PrintPrint

One of the most challenging aspects of moving to a cloud deployment is transferring data from your local (on-premises) environment onto the cloud. In this section of the lesson, we'll look at special problems that arise in data transfer scenarios. We'll also discuss ways data can be moved to Amazon EC2, and you'll copy some GIS data to your own instance in preparation for publishing a web service.

Challenges of data transfer

For your data to go from your machine to commercial cloud services such as Amazon EC2 or Amazon S3, it must go "across the wire", meaning it is transferred through the Internet onto the cloud-based server. This can pose the following issues:

  • Your datasets may be so large that they are not feasible to transfer across the Internet in a reasonable amount of time.
  • A slow Internet connection or low bandwidth makes it impossible to transfer your data in a reasonable amount of time.
  • Your data may be sensitive enough that transferring it across the Internet would require extra security measures or is not an option altogether.

Let's examine these problems one at a time.

Large datasets

GIS data collections can be very large: up to terabytes in size. This is often the case when imagery is involved, but even vector datasets with a broad amount of coverage or detail can prove unwieldy for an Internet transfer.

When moving large datasets to the cloud, you have to plan for enough time to move the dataset and, if possible, increase your bandwidth. After doing a test transfer of a few hours or days, you should be able to get an idea of the rate of data transfer, and you can thereby extrapolate how long it would take to transfer the entire dataset.

If this amount of time is unreasonable (say, months) you may consider shipping the data directly to the cloud provider on a piece of hard media. The cloud provider can then load the data directly onto the cloud much faster than you could send it over the Internet. Amazon provides such a service called AWS Snowball. You load up your data on a ruggedized secure device called a "Snowball" and ship it to Amazon. In the old days of computing this technique was called "sneakernet", since you could sometimes put your data on a floppy disk and walk it across the office to another computer faster than you could send it electronically.

Internet connection limitations

Cloud-based data centers like Amazon's are built to handle high levels of data traffic coming in and out. However your connection going out to the cloud may be limited by a slow connection or lack of available bandwidth. Some IT departments and internet service providers (ISPs) throttle or cap the amount of data that can be transferred from any one machine or node in the network. These types of policies are sometimes put in place to prevent the use of streaming sites such as BitTorrent that violate company policy or simply monopolize the organization's available bandwidth. However sometimes these policies can negatively affect legitimate business needs such as transferring data to the cloud. If you find yourself in a situation with low bandwidth, it might be helpful to visit with your IT department to understand if your machines are being throttled and could be granted an exception. If an exception is not possible due to other bandwidth needs within the company you might explore whether your data transfer could occur during off-hours such as nights or weekends.

Sensitive data

Confidential or proprietary datasets, such as health records, may require extra security measures for transfer to the cloud. When dealing with sensitive data, the first question to answer is whether it is legal or feasible for the data to be hosted in the cloud in the first place. For example, some government organizations responsible for national security may possess classified or secret data that could never be uploaded to Amazon's data centers no matter the measures taken to ensure secure data transfer. Also, some organizations may not have the desire or permission to host datasets on servers that are physically located in a different country.

Other types of datasets may be okay to host on the cloud but must be encrypted during transfer, to prevent a malicious party from using any data that may be stolen en route to the cloud server. Secure socket layer (SSL) connections (HTTPS) and secure FTP are two techniques for encrypting data for Internet transfer.

Techniques for data transfer

Sometimes the ability for one computer to directly "see" or communicate with another computer is hindered by firewalls or network architectures. For example, your computer at work is probably allowed to only access the file systems of other computers on your internal network. You could potentially open up a folder on your Amazon EC2 instance for access by anyone but this opens a security risk that malicious parties could find the folder and copy items into it.

There are a number of strategies that people use to get around these limitations when transfering data into Amazon EC2 and other cloud environments, these include:

  • Copy and paste through Windows Remote Desktop. This is the technique we'll use in this course because it's convenient. However, it may not be appropriate for highly sensitive data.
  • Use of a "digital locker" type of site like Dropbox.com, where you are allowed to upload a certain amount of data onto the site (for example, 2 GB). You can then log into your instance and download the data onto whatever drive you choose. You could even use your allotted Penn State PASS storage for this technique. Upload the data to your PASS space using your local computer, then log in to your instance and download the data from your PASS space.
  • A secure FTP (file transfer protocol) connection configured by your IT department. FTP is an Internet protocol designed for transfer of files, but if the data is sensitive, you should encrypt it before you send it this way.

The ArcGIS Server on Amazon EC2 help has an overview of data transfer techniques. Please take some time right now to read Strategies for data transfer to Amazon Web Services.

Copying the Appalachian Trail data to your EC2 instance

In this part of the lesson, you'll copy some data to your EC2 instance in preparation for publishing a web service. Before you attempt these steps, you should be logged in to your EC2 instance through Windows Remote Desktop Connection. If you followed the steps earlier in the lesson for connecting via Remote Desktop then your local disk drives should be available to the instance.

  1. Download and unzip the Appalachian Trail dataset to a location on your local computer (not your EC2 instance).

    This is National Park Service data obtained from the Pennsylvania Spatial Data Access (PASDA) website. In this exercise we'll pretend this is a dataset that you've been using for years at work that you now want to transfer to the cloud.
  2. Open Remote Desktop Connection to your EC2 instance and then open Windows Explorer.

    You should see something like the following, where you have a set of drives listed for your instance and a set of drives listed for your local computer. The drives on the local computer will be followed by the computer name. For example, in image 2.2, below, the local computer is named ASROCK, and the A, C, and D drives are available from it. There are also two drives available to the EC2 instance, which are C and D.
     
    Screen capture to show Drives available to EC2 instance
    Figure 2.2: Available drives
  3. Browse to the folder on your local computer where you downloaded the Appalachian Trail data, right-click the folder, and click Copy.
  4. Browse to D:\data on your instance, right-click, and click Paste. This should put your data at D:\data\AppalacianTrail.
  5. Open and explore the AppalachianTrail folder. It contains a map document displaying the Appalachian Trail and shelters along the trail. The trail and shelter datasets are feature classes in an Esri file geodatabase. You will publish this map as a web service in the next part of the lesson.

Maintaining correct paths

A map document (MXD) stores the paths to all of the datasets referenced in the Table of Contents. If you move the map document and its data to another computer, you may see broken data sources when you open the map document on the destination computer. For example, the map document may be expecting to get the trails from C:\data\AppalachianTrail\Trail.gdb and now that you have copied everything to another computer, the data is located at D:\data\AppalachianTrail\Trail.gdb. In this situation, you need to open the map document in ArcMap and repair the data source on each layer to point at the new location.

You can avoid this problem if you are able to use relative paths in your map document. A relative path does not store the full path to the data, instead, it stores information about where the data lies relative to the map document. For example, instead of saying "the data is at C:\data\AppalachianTrail\Trails.gdb", the relative path says, "the data is in the same folder as this map document in a geodatabase called Trails.gdb". The benefit of this is that you can move the folder containing the map and data to any machine, and as long as you keep the map and data in the same folders relative to each other it does not matter if the data paths change.

In this exercise, relative paths have been used in Trail.mxd so that you did not have to fix any paths. For future reference, if you want to configure an MXD to use relative paths, in ArcMap click File > Map Document Properties and check the box Store relative pathnames to data sources.

Registering your data with ArcGIS Server

For simplicity in this course, you'll follow the workflow of transferring all data to your EC2 instance, working with ArcGIS for Desktop on your EC2 instance, and publishing to ArcGIS Server on your EC2 instance. Theoretically, you could do most of the desktop work on your own computer and then publish up to the server when you were ready. However, any time you introduce separate computers into the architecture, especially on different networks (in the case of your home computer and your EC2 instance), things can get more complicated. Because you have a limited time available to learn about ArcGIS Server, I want you to spend the time experimenting with the capabilities of the server, not worrying about network issues or which machine contains the data.

However, in large organizations, these challenges of distributed architectures are inevitable. Some GIS shops might have a GIS server administrator who controls access to ArcGIS Server, and a number of cartographers and desktop GIS users who just prepare the maps for publishing. This latter group of "publishers" work on machines that are separate from the server and may even reside on a different subnet than the server. In some cases, the publisher machines and the server machines use different copies of the data that are kept in sync by an automated process, and the paths to the data used by the publishers may be different than the paths used by the publishers.

To help manage these scenarios, ArcGIS Server 10.1 introduced the ability to "register" a data location, meaning that you provide ArcGIS Server with a list of data locations you typically use. If the publishers use a different path to the data than the server uses, you can provide both the paths. Then, when you publish a service, the map is copied to the server and all the paths in the map are switched to use the server's path instead of the publisher's path.

This can be a difficult concept to conceptualize with just a verbal explanation, so please take a few minutes to read the help topic About registering your data with ArcGIS Server. This has some diagrams of different situations where data registration can be particularly useful. It is one of the most important help topics for ArcGIS Server.

Please note that if you try to publish a service and ArcGIS Server does not find any of the data paths in your map in its list of registered folders and databases, the data will be packaged up and copied to the server at the time you publish. The copying ensures that no data paths will be broken in the published service. This automatic data copying is an interesting feature in some scenarios where the publishers do not have the rights to log in to the server machine, but it is not an appropriate workflow for managing large amounts of data. The best approach is to make sure you set up workable data locations on the publisher's machine and the server machines, and then carefully register those locations with ArcGIS Server. In some cases, like ours, the publisher's machine and the server machine will be viewing the same path to the data.

Follow the steps below to register your d:\data folder with ArcGIS Server:

  1. On your EC2 instance, start ArcMap to a new empty map and make sure the Catalog window is visible.

    Although you could register a data location using Manager, we're going to do it in ArcMap. You'll be working with ArcMap in the next section of the lesson, and it's beneficial to see how a connection to your instance can be used in ArcGIS for Desktop.

    When you start ArcMap, you may see errors indicating that some windows are being blocked. This is because Internet Explorer enhanced security configuration is enabled on your instance, and this affects some ArcMap windows. These warnings do not inhibit workflows in our course, and you can dismiss them.
  2. In the Catalog tree, expand GIS Servers and double-click Add ArcGIS Server.

    Notice there are three levels of connection you can make to ArcGIS Server, based on the privileges that the server administrator has granted you. You'll make the highest privileged connection, which is an administrator connection.
  3. Choose Administer GIS server and click Next.
  4. Enter the Server URL. This uses the format http://<Elastic Load Balancer name>/arcgis. An easy way to derive this URL is to copy the Manager URL from Cloud Builder and delete the "/manager" portion at the end.
  5. Enter the User Name and Password, which are the now-familiar primary site administrator credentials that you entered when you went through Cloud Builder and when you logged in to Manager. Then click Finish.

    From my experience, it can take a few minutes to connect. If the connection fails, you may try making a new connection using the URL http://localhost:6080/arcgis, which means "connect directly to ArcGIS Server on this computer".
  6. In the Catalog window, right-click your connection and click Server Properties.
  7. Click the Data Store tab to see the list of folders and databases registered with ArcGIS Server.
  8. In the lower section of the dialog box, click the plus (+) button next to the Registered Folders section.
  9. Type a Name such as My Geog 865 data folder.
  10. For the Publisher folder path, type d:\data. You'll be publishing from ArcMap on your EC2 instance and this is the path you'll use to access the data from ArcMap.
  11. In the section of the dialog box for the Server folder path, ensure that Same as publisher folder path is checked. This means that ArcGIS Server will also see your data using the path d:\data.
  12. Click OK to register the folder and OK again to exit the Server Properties dialog box.

Now you're ready to publish a map web service using your AppalachianTrail dataset that you placed in D:\data. You'll do this in the next section of the lesson.