Home

Analyzing APRS Big Data

Introduction

Bill Howe's Introduction to Data Science class provides an opportunity for organizations in need of big data research to solicit students in the class to help them. This was the motivation for proposing this project.

Background

Many world-wide volunteer amateur radio groups and individuals deploy digipeaters: radio transceivers and computers that repeat the digital signals transmitted by transceivers typically located in moving vehicles. These signals are generally encoded with GPS-based location data. This technology (called APRS) is used both for fun and for public service. For example, my ham radio club tracks the location of assets in the field during local disaster responses as well as for public service events such as marathons, bikeathons, walkathons and so on. The goal of the networks of APRS digipeaters is to have complete digipeater coverage such that a mobile asset can always be located.

The general approach taken to confirming that a proposed or installed digipeater is correctly located (or to determine where a new digipeater is needed) is to use RF propagation modeling given the elevation and gain of the digipeater and the terrain in the intended coverage area. This is followed by performing "drive tests" in which individuals drive around and essentially say "can you hear me now?". This is then coupled with feedback from end-users (both those in the vehicles and others tracking them on the map) as well as a lot of staring at realtime maps.

The Project Idea

A large number of APRS digipeaters world wide are interconnected via the Internet (APRS-IS) resulting in a realtime structured livestream (similar to the Twitter stream experimented with in the first programming assignment, but with nowhere near the data volume). This data could potentially be analyzed in a number of ways to identify and visualize which digipeaters "heard" a vehicle and, over time, map as well as identify holes in digipeater coverage. Who knows what other insights a talented data munger could develop?

The APRS livestream and some historic data is freely available, but please do not overwhelm the servers with 74,000 users querying them. That would make me very unpopular. If there's interest in this project, I expect I would be able to set up a proxy to be dedicated for this purpose to avoid impact to the APRS-IS system.

The Big Data

I have obtained two-year's worth of data collected from the APRS-IS system from Hessu, the author of http://aprs.fi, under the conditions that a paper be written and "published" about the work performed and that any software developed be made available as open source. There are a number of Amateur Radio magazines and web sites that might like to see a well-written paper. I can certainly post any papers on the web as well.

The data is available at http://data.w2aee.columbia.edu/~n2ygk/aprsis-archive/ and at s3://aprs-is/. The archive consists of over 600 of 80MB bzip compressed files totaling about 40-50GB representing close to 2 billion observations. Each file, when uncompressed, consists of lines of APRS-IS packets preceded by an integer timestamp in Unix format (seconds since 1 January 1970) a space and then the APRS-IS packet as received. This stuff is on a relatively small server, so please don't just do a massive download. Feel free to use it from Amazon S3 as much as you want. Start by grabbing a sample to work with.

What is represented by this APRS data?

To answer a few questions that have been asked and to provide a little more information:

Is APRS a massive sensor network? Yes. Perhaps several at once. There are moving vehicles (cars, trucks, boats, airplanes, balloons, bicycles, pedestrians, search and rescue dogs, and more) transmitting their locations, headings, speeds, battery voltage, temperature; weather stations supplying meteorological data (Citizen Weather Observer Program); and more. Coverage is global yet local: Local line-of-site radio paths, generally in the VHF frequency band, to generally fixed-location radio digital repeaters (digipeaters) and/or to Internet Gateways (IGATEs) which collect this data globally (APRS-IS). The system is 100% voluntary, generally using personally-owned equipment belong to amateur radio operators, their hobby clubs, or sometimes their served agencies (governmental and NGO).

Information about stationary stations (digipeaters) is generally transmitted by those stations in their own beacon packets. It is possible for digipeaters to be stealthy or more easily identified. One stealth approach is that they don't transmit beacons showing their location accurately. Another is that they do not perform alias substitution. As such, they will re-transmit a packet with the digipeater identified as, for example, WIDE2-1, rather than as, for example, an actual call sign like W2AEE.

The APRS Protocol is confusing, especially if you've seen more logically-defined networking protocols (e.g. you have a computer science background;-). It is built on top of a piece of another confusing protocol, AX.25 (Amateur X.25) which is based on the X.25 protocol. AX.25 addresses are ASCII-encode Amateur Radio Service call signs (an ITU standard), concatenated with a numeric Substation ID (SSID). For example, N2YGK is a call sign (with implied SSID zero). N2YGK-3 is a different address (same call sign, SSID 3).

APRS only uses the datagram mode of AX.25 (Unnumbered Information -- UI frames). This is analogous to IP/UDP; Connection-oriented AX.25 is not used by APRS. AX.25 has a source-routing model of addressing. Each packet transmitted has a Source and Destination call sign, along with an optional Digipeater list. APRS overloads the meaning of the destination call sign as well as the "digi list" by using functional aliases such as "WIDE" that any APRS digipeater in earshot will respond to.

You can find out more about APRS at http://www.aprs.org/aprs12.html. The 1.01 protocol document was an attempt at having a group other than the protocol's inventor document it and is a good start. See also this (outdated) overview: aprs.pdf. Be aware that position reports come in many flavors including the "compressed" Mic-Encoder (Mic-E) format which is used by Kenwood transceivers among others.

Please read up on APRS-IS which is the Internet data collector for over-the-air APRS data transmissions, among other things. See http://www.aprs-is.net/. Over-the-air APRS packets, when IGATEd onto APRS-IS, have the identity of the method of reception and IGATE tacked on the end of the digipeater list. For example, "qAR, WD0BIA" in the following sample packet means the packet was heard over-the-air by IGATE WD0BIA. The APRS-IS site describes the q-codes.

N0AYK-10>BEACON,qAR,WD0BIA:RPTR 147.030 + PL146.2

In the actual binary over-the-air AX.25 packet, the digipeater list is marked with a flag bit that indicates that the digipeater was already "used" so that the next digipeater to be used in the source-route is clear. In APRS-IS representation, this is a * tacked-on the end of the digipeater callsing. From this you can infer when a packet was heard directly from a source station rather than via one or more digipeaters. In the example above, WD0BIA heard N0AYK-10 directly as their is no * to show the last digipeater to transmit. See the example:

N0AN>APAGW,WA0ROI-1,W0AK-1,WIDE2*,qAR,K0SXY:>AGWTRACKER

The functional alias WIDE2 was the last digipeater to transmit this packet which was heard directly by WA0ROI-1 before being digipeated several hops. So the first-hop digipeater is the one that "heard" the source transmission.

How do I parse this stuff?

This is a Python regexp that might help. This is after splitting off the timestamp:

s = re.match('^(?P<from_call>[^>]+)>(?P[^,]+),*(?P.*),(?P[^,]+),(?P[^:]+)(:)(?P.*)$',l)
...
fc = s.group('from_call')
...
From_call is the source address. That's which station was transmitting.

Alternatively, see the parser module, aprspig.py, which can be used either as a Pig UDF or as a plain-old Python module, or, as a main program, it acts as a Hadoop streaming mapper. The module defines two functions:

aprs(l) which parses a line of APRS-IS data, and
position(to_call,info) which returns a canonical position in decimal degrees of latitude and longitude, among other things. When operating as a mapper, it emits a key of firsthop and value of from_call,latitude,longitude.

To_call is a destination address, but since APRS is a broadcast protocol, the destination address is not used as such. So, the to_call's meaning is overloaded in a number of ways pasted here from page 13 of the APRS specification

The AX.25 Destination Address field can contain 6 different types of APRS information:

A generic APRS address.
A generic APRS address with a symbol.
An APRS software version number.
Mic-E encoded data.
A Maidenhead Grid Locator (obsolete).
An Alternate Net (ALTNET) address. In all of these cases, the Destination Address SSID may specify a generic APRS digipeater path. I've highlighted the reason why you can't throw away the to_call: the Mic-E encoded data puts part of the lat/lon here.

Digis is a list of digipeater callsigns (up to and not including the qAR, etc.). These can and are a mix of actual callsigns and generic aliases. They are not so well documented in the protocol spec, but see page 11. The digipeater list is an implementation of strict source-routing, consuming digipeaters from left-to-right in the list. So, for the following packet transmitted:

A>B,C,D,E:hi there

A is the identifier of the transmitting station. B is the identifier of the intended destination. C,D,E is the path by which A wants the packet to be routed to B. You might read this in words as, "This is station A. I want to send the message 'hi there' to station B via stations C, D, and E, in that order." The expectation is that station C will retransmit the received packet (digipeat) which will then be heard by station D which will do the same and so on for station E, which station B presumably can hear.

Now here's the trick -- since radio is a broadcast medium, everybody within range can hear this message whether or not it's strictly destined for them. APRS explicitly uses this broadcast feature and repurposes the meaning of the to_call and digis to implement a constrained flooding algorithm to allow A's packet to be more widely disseminated than "just" to stations who can hear him directly. This is what the digipeaters do. They receive packets, make some decisions, and then retransmit the packet if appropriate. Since you don't want to cause a flooding collapse by repeating all packets indefinitely and by all digipeaters, APRS invests a lot of work in limiting which digipeaters in fact will retransmit a received packet.

The next trick is that digipeaters have their own radio call sign which uniquely identifies them, but they also have alias call signs that they will respond to. So, if I am digipeater C, I might also be listening for (and repeating packets) that are addressed to one of a number of generic aliases such as WIDE. This is a critical feature of APRS: As I drive around, my Kenwood TM-D710 radio transmits my location via "WIDE1-1, WIDE2-1". The digipeaters that hear me are listening for these aliases and make a decision to repeat. When a digipeater repeats a packet it does one or both of two things:

It marks it's digipeater in the digis list as used up (shown as a *) in APRS-IS; actually a bit set in the on-air AX.25 protocol.
It optionally replaces the generic alias with it's own callsign in the digis list as a means of tracing which digipeater repeated the packet.

In the first example below, K2PUT-15 heard me driving around and digipeated my packet. Then WB2ZII-15 heard K2PUT-15's repeat of that packet. So the digipeater that first heard me was K2PUT-15.

 N2YGK-3>TQQY8Q,K2PUT-15*,WIDE2-1,qAR,WB2ZII-15:`eLXmJ0k/]"5"}=
 N2YGK-3>TQQY8Q,WIDE1-1,WIDE2-1,qAR,WB2ZII-15:`eL\n 0k/]"5$}147.060MHz=

In the second example, WB2ZII-15, which is both a digipeater and IGATE, directly heard me. You can tell that because there are no *'s appended in the digis list. So the digipeater that first heard me was WB2ZII-15.

So, given the positions of all stations like N2YGK-3 heard directly by WB2ZII-15, one should be able to create a visualization of WB2ZII-15's direct RF coverage.

The gtype and gate are where APRS-IS comes in. An IGATE (such as WB2ZII-15) hears a packet over the air and injects it into the Internet-based system which results in the log files.

How safe is it to assume that digipeaters are permanent/static?

I would say this is something you might learn from the data. My assumption is that there are perhaps 3 or 4 classes of digipeaters:

Relatively-permanent long-lived "WIDE" digipeaters that have been in place for years. These are generally at "sites" (radio towers, mountain tops) that hams have access to. For example my county government hosts equipment at the main county radio tower as we hams provide voluntary emergency communication services to the county and NGOs.
Come-and-go home "RELAY" digipeaters. These are generally at ham's home stations and may only be turned on from time-to-time. Others are pretty much permanent. RELAYs used to respond to the RELAY alias but now use WIDE1-1 due to improvements made in the APRS flooding protocol. If you see RELAY in the data, somebody hasn't updated to the latest practices (as of 10 years ago;-).
Event-specific digipeaters that are temporarily located to service a marathon, bike-a-thon, etc. I've been known to take a 50-foot crank-up tower on a trailer to a high point in the center of a bike tour.
Mobile digipeaters. These are probably rare, but there is nothing to stop a mobile station from operating as a digipeater as well. For instance, I can set my Kenwood TM -D710 in my vehicle to operate as a digipeater. This practice really skews things;-)

It is also the case that even permanent digipeaters may have issues and go on/off the air, changing the characteristics of the APRS network. For instance, the W2AEE digipeater's 1/4-wave ground-plane antenna was broken in a storm several years ago (but still sort-of working) and was replaced with a trombone folded-dipole antenna last fall. This presumably changed to radiation pattern. Then the Motorola Micor transceiver went on the fritz and the digipeater was turned off for about a month. Two weeks ago, I replaced the Micor with a Kenwood TM-733 which is susceptible to front-end overload and not a good choice. Next week, a replacement Vertex Standard VX-4500 transceiver will replace that.

So the short answer is: keep the timestamps;-) It would likely make an interesting visualization to scroll through time to see if the coverage pattern changes.

A Good Example of APRS Visualization

A good example of APRS visualization is at http://aprs.fi.

Check this out: http://aprs.fi/#!mt=roadmap&z=14&call=a%2FWB2ZII-2&timerange=43200&tail=43200

This is a mobile vehicle following the course of a bike/run event. If you click on any of the dots along the course, aprs.fi will show more information including the list of digis. It will also draw a line to the IGATE that gated the packet from RF to the Internet. The straight segments that appear not to follow road contours are generally an indication that either: the tracker was not transmitting, or that it was not heard by any digipeater. These trackers generally are set to transmit their position every 30 seconds or when a significant change in heading happens, so using timestamps, last headings, etc. could be used to perhaps infer the situation. For example, if you see regular timestamps every 30 seconds and then a gap of 300 seconds and a long straight segment, that probably indicates an RF dead zone which would be interesting information for those planning to improve coverage.

The raw packets are no longer available for this particular event from 2012-09-02 in aprs.fi but they should be in the archived data made available for analysis.

If you look at: http://aprs.fi/#!mt=roadmap&z=12&call=a%2FN2YGK-3&timerange=604800&tail=604800 you will see me driving around. See http://aprs.fi/?c=raw&call=N2YGK-3&limit=1000&view=normal for the raw data (if you do this within the next day before the data is aged out of aprs.fi). Most of my packets are heard directly by an IGATE as there's no * in the digipeater list. I've pasted these packets here: http://pastebin.com/M0WxsvFf. Note that WB2ZII, WB2ZII-13, WB2ZII-14, WB2ZII-15 are different digipeaters; there can be up to 16 substation IDs (SSIDs) per callsign (WB2ZII-0 == WB2ZII).

Storing the Data in Amazon S3 and reducing it with Amazon EMR and Pig

I've stored the aprs-is compressed log files in Amazon S3. I've done a little googling around and it appears that the various Hadoop tools like Pig can read bzip2-compressed files and a simple Python UDF can potentially be supplied to a Pig loader to parse the data appropriately. I'm not sure Pig is the right tool, but I figured it's a start with Hadoop. I imagine one MapReduce might be to map all packets heard by a particular digipeater and reduce to something useful....

I've created a bucket called 'aprs-is' which currently contains over 600 files:

s3n://aprs-is/aprsis-20121231.log.bz2 -- this is one of the approximately 600 log files and contains 3,252,596 observations. All 600+ files can be accessed as s3n://aprs-is/aprsis-*.
s3n://aprs-is/small-sample.log -- this is a short uncompressed sample (40,010 observations)
s3n://aprs-is/reduced/ -- Contains the reduced APRS-IS data created by a MapReduce streaming job consisting of the aprspig.py Mapper and aprsreducer.py Reducer. The Reducer output is in JSON format. There's a small sample there called small-sample.txt.
s3n://aprs-is/code -- contains the aprspig.py, aprsreducer.py and (unsuccessful aprs.pig) scripts that I've used to reduce the data.

These files are not directly readable via the above URLs as I've configured Requester Pays in S3 so somebody's runaway code doesn't deplete my bank account (or at least run down the Free Tier balance). Notably MapReduce access to s3 incurs no data transfer charges, transfer in is free and storage is nearly free. Transfer out has a cost, so if the data can be reduced then the transfer out won't be too bad.

Reducing the data with Python and Hadoop MapReduce Streaming.

After Pig not "just working" I decided to move down a layer in the software stack and just use Hadoop. This task is a classic Map-Reduce: Map each input record to the "first_hop" as the key, with the (from_call,latitude,longitude) as the value, and then reduce the list of tuples attached to the same first hop to a unique set and output this to a JSON-encoded file for subsequent use by the visualization code (to be developed by me or another project team member).

This turned out to be relatively easy to do: the Pig UDF module I had written was easily modified to run as a simple main program that reads lines from stdin, parses them, and emits <key,value> pairs to stdout. I then wrote a simple reducer script to take the Map-Shuffle output on stdin, and build up a set() of unique position records. This tested out well, including being able to run it simply as "./aprspig.py <small-sample.log | sort | ./aprsreducer.py >small.json".

As with Pig, I ran into a number of problems at scale. Given the large number of input files, this generated a comparable number of map tasks, and the job was blowing up with Java memory errors after watching it succesfully map for 2-3 hours (much faster than Pig!). I suspect the same distinctness code (happening in my reducer) was where things blew up. I learned a lot about parsing Hadoop log files (1014 of them generated for this job), tuning Hadoop options, and making the mapper and reducer write to stderr periodically so that the controller doesn't think the mapper/reducer has hung -- and to be able to look at the stderr files after a job failure to get an idea of where things broke.

My most recent Hadoop run of this was a map-only job just to do the parsing and reduction. This dataset is now at s3://aprs-is/reduced/digipeaters.txt/. I am now running a simplified reducer that simply emits the <firsthop,position> tuples without trying to reduce duplicates as I suspect this is what was causing the job to hit a Java memory error as my reducer was doing the set() in-memory. A next step would likely be to look at how, in a Hadoop context, to write to a temporary file before emitting the final set.

Solving the Reducer Out-of-Memory Problem

It became clear that I had to find a better way to find the distinct position reports. I did a little more reading and hit on the solution of letting Hadoop do the work for me:

Mapper: Map the input to key=<firsthop,from_call,latitude,longitude> & data=nothing. This will allow the shuffle process to sort all the identical position reports together. Partitioner: Configure the Hadoop partitioner to use the KeyFieldBasedPartitioner. This allows guaranteeing the affinity of partial keys to a single reducer. In this case, use just the first key subfield, firsthop for partitioning. This guarantees that a single reducer will get all records for a given firsthop. Reducer: Input to the reducer is now sorted. Now only one prior key value has to be remembered in order to eliminate all duplicates. And, the reducer output can be generated incrementally, again without having to keep anything in memory. The final output of the reducer is JSON-encoded records, hopefully suitable for use by a visualization step.

Related Projects

See Eric Yeh's github for code that generates KML files suitable for use with Google Earth.

See Paul Munday's github as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly