Skip to content
brianhigh edited this page Apr 26, 2015 · 30 revisions

Resource Management

Information can be processed most efficiently when the appropriate resources are allocated, and utilized in an effective manner. In order to aid you, we’ll go over some methods of determining your resource needs, and techniques to best utilize the resources available to you.

To some this process is a dark art. At it’s core, is the process of capacity planning. Capacity planning involves determining the amount CPU, RAM, and Disk your information processing needs. To achieve efficient resource utilization, you may need to optimize your work flow, and potentially leverage technologies like parallel processing. In order to ensure you are effectively using your computing capacity, you’ll need to monitor your resource utilization at points throughout your work.

System Resources

The CPU or Central Processing Unit is the heart of your computer. As the name implies virtually all data processing is handled within the CPU. In simplest terms, a CPU’s performance capability is measured in two ways. It’s clock speed, that is the frequency the CPU operates at, which is measured in megahertz or gigahertz. And, the number of cores it has. Each core can perform a single operation (or calculation) at a time. So, the more cores you have, the more calculations that can be performed simultaneously.

RAM or Random Access Memory, is very fast, but short term memory. It’s job is to hold the data that the CPU is actively working with. If your needs call for large amounts of RAM, you’re in luck. It’s relatively easy to upgrade, and fairly cheap.

Disks are used to store data for the long term. But, not all disks are created equally. There are two main types of disk, Solid State Drives, and Hard Disk Drives. An SSD is many times faster than a hard disk, but that comes with a much heftier price tag for large amounts of storage. Hard Disks on the other hand, have been around for years. They are cheap even for storing several terabytes of data.

With disks, there are 3 basic ways to utilize them. The most common is a stand alone drive, either internal or external to your computer. The next most common is network storage, where the disks live somewhere else on the network. Network storage is good for large capacity, but rarely good for high performance. The third, is a disk array, which is several disks working together usually in a redundant fashion so data is retained in the event of a disk failing.

Optimization

In order to optimize your data, and work-flow, you need to identify what resources you are using, identify bottlenecks, and eliminate them.

Every operating system has tools to track resource usage. On Windows, the Performance Monitor (example shown below) is the most helpful. It gives a breakdown of RAM, CPU, Disk, and Network utilization by application. On a Mac, the Activity Monitor is the most user friendly method of tracking resource usage. And, in the newest versions of OS X, it has a color coding scheme to help identify if you are hitting the limits. On Linux, you’ve got many choices, but htop is a solid choice for CPU and RAM monitoring. For disk activity, you’ll want to use iostat.

Windows 7 Performance Monitor
Windows 7 Performance Monitor: Microsoft

Once you have determined your usage, you can try to identify bottlenecks. A bottleneck could be caused by your available resources, or your software. If you aren’t maxing out the CPU, Memory, and Disk, then the bottleneck is likely within the software itself. However, if you are maxing out a particular resource, then increasing the available resources should help. For example, if your system has a single hard disk drive, and it’s being maxed out, replacing it with a solid state drive should speed things up.

Memory Utilization

Random Access Memory or RAM is a finite resource. The amount you can install into a system is limited by the CPU, motherboard, and sometimes the operating system you are running.

In order for a system to perform efficiently, you must use your available RAM in an effective manner. Which means you need to ensure that each application or processis allocated enough memory. While at the same time you must avoid exceeding the available amount. When you exceed that limit, your computer will begin to swap or thrash. The result being poor performance, and a possible loss of data.

Allocating memory is made a bit more difficult due to the behavour of some applications. Software such as R, MATLAB, and Excel by default do all of their data processing in RAM. Which means, you need enough memory to fit the raw data set, and any changes being made to it.

If you plan on working with large data sets, you have a few options. The first option is to purchase enough memory to do all your work in RAM. However, this option can be costly, and doesn’t really scale well. Your second option is to break up the data into smaller pieces, and do your calculations in sections. The third option, is to use a plugin or addon for your software that lets you work from disk. This option however isn’t available for all software.

Case Study: Climate Change Project

To help provide context, we will look at a real research project conducted by the University of Washington.

Let’s start the scenario with a support request made by a research assistant to the departmental IT staff. Here is the actual request.

I'm running a SAS program that involves building a large dataset
from the meteorology files. Each meteorology file is 864 KB, and
I'm merging together about 5,000 of these separate meteorology
files into one giant dataset. So far I've been running this program
since Wednesday, and it's still running. I'm thinking that perhaps
my computer does not have enough processing memory to run this
task efficiently. Also, I'm running this off of my C:/drive, so the
slow processing time is not due to sending information over the
network. I think that if I continue to let my program run over the
weekend, it should definitely be done by Monday. However, I'm
concerned because I will need to run this program again for another
set of about 5,000 files later.

This provides sufficient detail to get an idea that this is a resource management problem. Either more memory needs to be installed, or the (685 line) SAS program needs to by modified to use less memory.

After looking further into this issue, we discovered that there were 5335 pairs of files, with each pair consuming 2284 kilobytes.

Let’s calculate the memory needed to load the entire dataset into RAM:

5335 * 2284 KB = 12185140 KB

12185140 KB is about 12.2 GB of RAM. Plus, as the files were being combined in memory, a second (combined) copy of the data was being collected in a separate table, also stored in memory, before being written out to disk. So, the total memory requirement may have been over 24 GB of RAM, (assuming SAS does not store the data any more efficiently than ASCII text). The computer running the analysis did not have nearly that much RAM installed. So, the main reason this was running slowly was that the hard disk was used as "swap space" when the RAM had been consumed. This "swapping out" of memory to and from disk is extremely slow.

But was it really necessary to load the entire dataset into RAM (twice)?

We ended up writing a short script that used negligible resources and did the job in less than half an hour, even on a modest computer. So, "no", we don’t have to load all of the data into RAM. We can just "stream" the data to disk, joining and streaming one pair of files to at a time, continually appending the stream of data to an ever-growing output file.

How did we do it?

Let’s start by looking at the input data files. We have two folders of files. For every (space-delimited) file in our data/ folder, there is a corresponding (tab-delimited) file in our force/ folder, sharing the same location coordinates (stored in the filenames). So, we want to link each pair of files on the location, combine the rows line-by-line, then collect the combined rows into a single output data file.

Here is a listing showing the first 9 lines of each type of input file:

$ head -9 data/data_45.59375_-122.15625
27.950 8.170 0.570 3.100
1.950 6.130 -1.000 3.030
3.925 5.650 0.490 3.020
2.400 5.680 -0.880 2.980
8.200 6.330 0.020 2.990
1.600 5.560 -0.360 3.040
14.650 5.420 0.310 3.410
37.775 5.080 -0.340 3.280
6.575 6.550 -0.500 3.440

$ head -9 force/force_45.59375_-122.15625
1915	1	1	237.83	41.655	79.851	0.1660
1915	1	2	244.82	40.886	77.705	0.1666
1915	1	3	248.43	32.984	84.425	0.1174
1915	1	4	246.30	39.536	79.572	0.1494
1915	1	5	245.51	38.740	81.598	0.1405
1915	1	6	249.16	37.549	81.220	0.1393
1915	1	7	254.88	33.961	84.068	0.1190
1915	1	8	262.68	35.973	82.808	0.1245
1915	1	9	258.02	43.198	79.247	0.1583

Here is an example of the desired output, showing the header and first 9 data records of the output CSV file:

$ head met.csv
lat,lng,precip,tmax,tmin,windsp,year,month,day,b1,b2,rh,b3
45.59375,-122.15625,27.950,8.170,0.570,3.100,1915,1,1,237.83,41.655,79.851,0.1660
45.59375,-122.15625,1.950,6.130,-1.000,3.030,1915,1,2,244.82,40.886,77.705,0.1666
45.59375,-122.15625,3.925,5.650,0.490,3.020,1915,1,3,248.43,32.984,84.425,0.1174
45.59375,-122.15625,2.400,5.680,-0.880,2.980,1915,1,4,246.30,39.536,79.572,0.1494
45.59375,-122.15625,8.200,6.330,0.020,2.990,1915,1,5,245.51,38.740,81.598,0.1405
45.59375,-122.15625,1.600,5.560,-0.360,3.040,1915,1,6,249.16,37.549,81.220,0.1393
45.59375,-122.15625,14.650,5.420,0.310,3.410,1915,1,7,254.88,33.961,84.068,0.1190
45.59375,-122.15625,37.775,5.080,-0.340,3.280,1915,1,8,262.68,35.973,82.808,0.1245
45.59375,-122.15625,6.575,6.550,-0.500,3.440,1915,1,9,258.02,43.198,79.247,0.1583

Here is a simple version of a Bash script which can do the job.[1]

#!/bin/bash

echo 'lat,lng,precip,tmax,tmin,windsp,year,month,day,b1,b2,rh,b3'

for LATLONG in `find data -type f | sed -n "s/^data\/data_//p"`
do
    DATA=data/data_$LATLONG
    FORCE=force/force_$LATLONG

    if [ -e $FORCE ]; then \
        paste -d, $DATA $FORCE | sed -e "s/^/$LATLONG,/" -e 's/[ \t_,]\+/,/g'
    fi
done

Requiring only a small fraction of the number lines of code, as compared to the SAS version, we can combine all of the files in just a matter of minutes.

Aside: Minimal code, maximal ugliness

At the risk of code obfuscation, we can reduce the lines of code to three:

echo 'lat,lng,precip,tmax,tmin,windsp,year,month,day,b1,b2,rh,b3'
for D in data/data_*; do L=${D/#data\/data_/}; F=force/force_$L
[ -e $F ] && paste -d, $D $F | sed -e "s/^/$L,/" -e 's/[ \t_,]\+/,/g'; done

We don’t recommend this approach, however, as it’s very hard to read and won’t run any faster!

We can run our script using these Bash commands:

$ cd /path/to/met
$ bash metmerge.sh > met.csv

How does it work? We join matching "data" and "force" files, sequentially, line-by-line, by matching paired files on location (latitude and longitude), as found in the file name of each file. (Here is an example of a "data" file name: "data_45.59375_-122.46875".) As we process files, we continually write the combined output "stream" in CSV format to a text file.

A Python version running on a decent server runs about twice as fast as this Bash script running on an old laptop. So, the total run time can be reduced to about 15 minutes, using a single Python process (CPU core), and less than 25 MB of RAM (peak).

#!/usr/bin/python

import os
import sys

hdr = "lat lng precip tmax tmin windsp year month day b1 b2 rh b3".split()
print ",".join(hdr)

def process_latlong(latlong):
    latlong_out = latlong.replace("_",",")
    data_file = open ("data/" + "data_" + latlong, "r")
    force_file = open ("force/" + "force_" + latlong, "r")

    for data_line in data_file.readlines():
        data_out = data_line.strip().replace(" ",",")
        force_line = force_file.readline()
        force_out = force_line.strip().replace(" ","").replace("\t",",")
        print latlong_out + "," + data_out + "," + force_out

    data_file.close()
    force_file.close()

dir_list = os.listdir("data/")
for fname in dir_list: process_latlong(fname.strip("data_"))

CPU Utilization

With modern CPUs having multiple cores, parallel processing is the only effective way to utilize all of the CPU power available. In order to utilize it, your data and work-flow may need adjustment. With parallel processing, your data is divided into pieces, and calculations are done on several pieces simultaneously.

Thankfully, there are some well developed tools and techniques to help with this. One of the more common is MapReduce which was popularized by Apache’s Hadoop. MapReduce is a framework for processing large volumes of data in parallel. Some lesser seen tools include GNU Parallel which is a tool used to run and manage command-line tools in a parallel fashion.

As an example, climate data can be processed in a parallel fashion. The data can be divided up by area, and then computation performed on a per area basis. Thus, instead of doing calculations for one ZIP code at a time, you could process data for 4, 8, or more areas at once.

Summary

In summary, there are three main components to resource management:

  • Capacity planning: Identification and allocation of necessary resources.

  • Utilization monitoring: Verifying you are using the resources you’ve allocated.

  • Bottleneck resolution: Identification, and correction of performance bottlenecks.


1. This Bash script will also run in POSIX mode.