datafiles

Data Files

Learning Objectives

This module will help you:

Understand the difference between various file types.
Learn how to name files to facilitate opening in suitable software.
Understand common character encoding standards and how to work with them.
Learn how to characterize a document’s structure to identify processing strategies.
Become familiar with common data file formats and their advantages and disadvantages.
Understand the importance of good table layout and how to achieve it.

Data File Types: Binary vs. Plain Text

There are essentially two main categories of digital file types, binary and plain text.

If a file is binary, the file just contains "zeros and ones". While this is technically true of any digital file stored within a binary computer system, the contents of a binary file will not necessarily conform to any standard character encoding system. The binary format may be highly efficient for storage or processing, but is essentially opaque, in that by simply looking at a binary file’s contents, you can’t really know what the format is or how to read it.^[1] Examples of binary files are database files, most multimedia files, and compressed files (such as zip files).

Plain text files, on the other hand, are composed of characters. Typically they are ASCII or Unicode characters represented by one or more bytes, where a byte is (generally) 8 bits. A bit can be considered either zero (off) or one (on). Plain text file formats are usually open and standard. Examples are web pages (HTML), XML, and CSV (comma separated value) data files.

File Name Extensions

Filenames generally have an extension, which is the part at the end ("suffix") of the filename, consisting of the last dot (.) and the characters that follow it.^[2]

Examples of binary filename extensions for images are .png and .jpeg. To launch "executable" programs on Windows systems you will often launch an .exe file. The .dmg ("disk image") filename extension is used on OS X. Common extensions for binary data files are .xls and .sas7bdat.

Plain text file formats for data files include .csv, .tsv, .txt, .xml, and .json, among others. Program source code is usually stored in plain text files, with extensions such as .R, .py, .pl, .c, .sh, .bat, and .do.

The extension is used to determine which "default application" should open it. Within the operating system, the extention is mapped to default applications. Mappings such as these are called file type associations.^[3]

Viewing Binary and Plain Text Files

When viewing the raw contents of files, whether they are binary or text files, we will often make use of a hexidecimal dump.

Hexdecimal is a base-16 number system with digits 0-F:

0 1 2 3 4 5 6 7 8 9 A B C D E F

Whereas binary has two possibilities, 0 and 1, hexadecimal has 16, including the ten decimal digits plus the letters A-F.

Let’s "dump" files in "hex" with hexdump…

$ hexdump -C -n 64 filename

Where the options we are using is this example are:

-C = display in hex and ASCII
-n 64 = show the first 64 characters
filename = name of file to view

In this example, we will view the first 64 bytes of an SVG image file. The file format stores information about the image in text, even though the file is displayed as a graphical image. Our filename is pie.svg.^[4]

$ hexdump -C -n 64 pie.svg

00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version="1|
00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 75 74  |.0" encoding="ut|
00000020  66 2d 38 22 3f 3e 0a 3c  21 44 4f 43 54 59 50 45  |f-8"?>.<!DOCTYPE|
00000030  20 73 76 67 20 50 55 42  4c 49 43 20 22 2d 2f 2f  | svg PUBLIC "-//|
00000040

We see a column of numbers on the left which show the character numbers in hexadecimal. Each line shows the hexadecimal number for each of 16 characters, in the center of the output, and on the right is the text equivalent (in "ASCII") of those character numbers.

We can see that this is an "XML" document with a version number, and the character encoding is shown as "UTF-8". The document type ("DOCTYPE") is "svg". All of this is contained in XML tags, similar in structure to HTML (the language of most web pages). The hexadecimal numbers correlate to the ASCII characters because the first 128 characters of the UTF-8 encoding scheme are the same as the ASCII character set. (We go into more detail on this matter later in this module.)

Even if the file was not a text file, and the ASCII printout looked like random characters, we would still be able to look at the hexadecimal dump to learn about the file.

For example, here is the PNG (binary) version of that same image. We will use the same syntax with hexdump, but look inside the pie.png file.

$ hexdump -C -n 64 pie.png

00000000  89 50 4e 47 0d 0a 1a 0a  00 00 00 0d 49 48 44 52  |.PNG........IHDR|
00000010  00 00 01 2c 00 00 02 26  10 04 00 00 00 13 97 a3  |...,...&........|
00000020  46 00 00 00 04 67 41 4d  41 00 00 b1 8f 0b fc 61  |F....gAMA......a|
00000030  05 00 00 00 20 63 48 52  4d 00 00 7a 26 00 00 80  |.... cHRM..z&...|
00000040

We have the same format of output. On the right, we see that the file is identified^[5] as a PNG file, as shown in the first few ASCII characters, but all other ASCII characters appear random (meaningless). Dots are shown for "non-printing" characters. Since the file is binary, and not encoded as characters, the ASCII which has been interpreted by hexdump is not very useful for learning anything more about the image. (As the file is not ASCII encoded, the ASCII interpretation is invalid.) We will just have to open the image in a graphics viewer to see what it is. Although both image files would display the same, you can see that there is a big difference between the contents of plain text and binary file formats.

Character Encodings

We will now take a closer look at the most popular character encoding standards for text files: ASCII, Extended ASCII, and Unicode.

ASCII was the primary standard text file encoding for many years. ASCII was then extended to include more characters, giving us "Extended ASCII". Lately, in the last couple of decades, Unicode has become dominant because it allows for thousands of characters (because it uses more bytes). Whereas ASCII was originally a seven bit encoding system, Extended ASCII just adds one more bit (to make a complete eight bit "byte"), and Uncode uses up to four bytes.

ASCII

The ASCII ("American Standard Code for Information Interchange"^[6]) standard was published a long time ago in 1963 and the current version is from 1986 (ANSI X3.4-1986). It was internationalized in 1983 (ISO 646:1983). ASCII originally used a seven-bit character set, so there were 128 characters (which is two to the seventh power).

ASCII Table

Often we will refer to an ASCII table. There is a command that you can use to generate one. The ascii command prints all 128 ASCII characters.

$ ascii

Usage: ascii [-dxohv] [-t] [char-alias...]
   -t = one-line output  -d = Decimal table  -o = octal table  -x = hex table
   -h = This help screen -v = version information
Prints all aliases of an ASCII character. Args may be chars, C \-escapes,
English names, ^-escapes, ASCII mnemonics, or numerics in decimal/octal/hex.

Dec Hex    Dec Hex    Dec Hex  Dec Hex  Dec Hex  Dec Hex   Dec Hex   Dec Hex
  0 00 NUL  16 10 DLE  32 20    48 30 0  64 40 @  80 50 P   96 60 `  112 70 p
  1 01 SOH  17 11 DC1  33 21 !  49 31 1  65 41 A  81 51 Q   97 61 a  113 71 q
  2 02 STX  18 12 DC2  34 22 "  50 32 2  66 42 B  82 52 R   98 62 b  114 72 r
  3 03 ETX  19 13 DC3  35 23 #  51 33 3  67 43 C  83 53 S   99 63 c  115 73 s
  4 04 EOT  20 14 DC4  36 24 $  52 34 4  68 44 D  84 54 T  100 64 d  116 74 t
  5 05 ENQ  21 15 NAK  37 25 %  53 35 5  69 45 E  85 55 U  101 65 e  117 75 u
  6 06 ACK  22 16 SYN  38 26 &  54 36 6  70 46 F  86 56 V  102 66 f  118 76 v
  7 07 BEL  23 17 ETB  39 27 '  55 37 7  71 47 G  87 57 W  103 67 g  119 77 w
  8 08 BS   24 18 CAN  40 28 (  56 38 8  72 48 H  88 58 X  104 68 h  120 78 x
  9 09 HT   25 19 EM   41 29 )  57 39 9  73 49 I  89 59 Y  105 69 i  121 79 y
 10 0A LF   26 1A SUB  42 2A *  58 3A :  74 4A J  90 5A Z  106 6A j  122 7A z
 11 0B VT   27 1B ESC  43 2B +  59 3B ;  75 4B K  91 5B [  107 6B k  123 7B {
 12 0C FF   28 1C FS   44 2C ,  60 3C <  76 4C L  92 5C \  108 6C l  124 7C |
 13 0D CR   29 1D GS   45 2D -  61 3D =  77 4D M  93 5D ]  109 6D m  125 7D }
 14 0E SO   30 1E RS   46 2E .  62 3E >  78 4E N  94 5E ^  110 6E n  126 7E ~
 15 0F SI   31 1F US   47 2F /  63 3F ?  79 4F O  95 5F _  111 6F o  127 7F DEL

You will see that there is a header showing the command usage followed by an ASCII table listing. The listing is arranged in 8 sets of columns, with each set showing the decimal (Dec) and hexadecimal (Hex) value for each character. Starting from zero (0), the first 32 characters (and the 128th) are the so-called "non-printing" characters, so those are shown with 2-3 letter codes describing the character. The 33rd character is the "Space" so nothing is shown. All other characters are symbols which appear on the standard US keyboard. The punctuation characters and decimal digits are followed by capital letters, more punctuation, lower-case letters, more punctuation, and finally ending with "DEL" (Delete), the 128th character (numbered 127, or 7F in hexadecimal). To have more characters, we would need more bits in our encoding standard, which we will look into next.

Extended ASCII

Extended ASCII, first published as the ISO-8859-1 ("ISO Latin 1") standard in 1987, adds another bit to ASCII, allowing for 191 characters, adding several rows of new characters to the end of the table.

ISO-8859-1 (Latin1) - Image: Roman Czyborra

Windows Latin 1 (Windows-1252)

Extended ASCII (ISO-8859-1) was extended even further to Windows-1252, sometimes (incorrectly) called "ANSI"^[7]

Here is a listing of the Windows-1252 character set. You will see that there are extra characters at the top of this table that we did not have in the Extended ASCII ("ISO Latin 1") character set, beginning with the the relatively new euro sign (€).

Windows-1252 (WinLatin1) - Image: Roman Czyborra

Why Character Encoding Matters

Mojibake example: garbled Smart Quotes in email reply

If a file is created using one character encoding, but is viewed using another, the characters are likely to display incorrectly.^[8] The resulting garbled text is sometimes called mojibake.

Mojibake example: in MS-Word

We can see how differences in character encodings can matter with a few simple examples. Let’s first generate a table of characters with Python.

Example 1. Printing the Windows-1252 character set with Python

The following Python script will show the printable characters of the Windows-1252 character set when run on a Windows system using a graphical Python interpreter such as IDLE or PyScripter.

Windows-1252 Table Python Script

Here is the full code listing for that Python script.

asciitable.py

# If run on a Windows system in a graphical environment such as
# IDLE's Python Shell, by default, this will print the Windows
# Latin 1 character set, a.k.a. Windows-1252 (WinLatin1).

import sys

# Print Extended ASCII table from character 32 to 256.
# (Skip non-printing characters numbered 1-31.)
start = 32
for i in range(start, 256):
    # Replace each non-printing character with a space.
    if i not in [129, 127, 141, 143, 144, 157, 160]:
        sys.stdout.write(chr(i))
    else:
        sys.stdout.write(" ")
    # Print a newline every 16 characters.
    if i > start and (i + 1) % 16 == 0:
        print

Example 2. Changing the Character Encoding within your application

We can see the characters properly in a non-Windows environment if we specifically set the character encoding in the application.

Changing Character Encoding in Mac OS X Terminal

Changing Character Encoding in Linux Terminal

However, this is not the default setting. Without knowing the output was encoded as Windows-1252, we might have thought our program had a bug.

So, how can we know the character encoding of "plain text" output? Let’s save the output as a file and test the file for it’s character encoding.

Example 3. Saving the output to a file with redirection

To save program output as a file, we can use file redirection. We will run the program on the Windows computer in a DOS shell and redirect with the '>' operator.^[9]

C:\> python asciitable.py > asciitable.txt

Redirection allows us to save to a file, but that file just contains the numeric codes for the characters. There is nothing in the file stating the actual character encoding format. We will have to guess, using the file command.

Example 4. Testing the file type with file

We can check the character encoding and other file properties using the file command. This command is available on Unix, Linux, and OS X systems. Here we will run the file command from a Bash shell.

$ file asciitable.txt
asciitable.txt: Non-ISO extended-ASCII text, with CRLF, NEL line terminators

While this tells us a little about the text format, we still don’t know the specific encoding standard used.

As you can see, dealing with various character encodings on different computing systems can be tricky.^[10] Is there a universal character encoding standard? Yes, Unicode! In the next section, we’ll see how we can convert our file to Unicode.^[11]

Unicode

Unicode provides an internationalized character encoding standard, to "encompass the characters of all the world’s living languages".^[12]

Like ASCII, but supports over 110,000 characters
Unicode standard was published in 1991
Most commonly used encodings are UTF-8 and UTF-16^[13]

You can browse the Unicode code charts to get an idea of the many character sets available.

Unicode Symbol Example: the Micro Sign

Example 5. Encoding the Micro Sign

The character µ, with Unicode^[14] name "MICRO SIGN" is encoded:

Encodings	Decimal	Hex
Unicode	181	U+00B5
Extended ASCII	181	B5
HTML numeric character reference	µ	µ
HTML named character entity	µ

Example 6. Typing the Micro Sign

How do you type the µ character into your computer?

Use these character codes:

Name	Decimal	Hex
MICRO SIGN	181	00B5

With these operating systems:^[15]

Windows: [Alt]decimal (using numeric keypad) … or … hex[Alt][x] (does not require numeric keypad)
OS X: for µ, you can simply use [Opt][m] … or … [Command][Ctrl][Space] … Search by name … or … use Unicode Hex Input (Input Source) and hex
Linux: [Shift][Ctrl]hex

Some Other Useful Symbols

Table 1. HTML Entities for Common Math Symbol Characters
Character Name	Char.	Entity	Num. Entity	Hex. Entity
DEGREE SYMBOL	°	°	°	°
MICRO MU SYMBOL	µ	µ	µ	µ
LOWER CASE SIGMA	σ	σ	σ	σ
N-ARY SUMMATION	∑	∑	∑	∑
GREEK SMALL LETTER PI	π	π	π	π
GREEK SMALL LETTER ALPHA	α	α	α	α
GREEK SMALL LETTER BETA	β	β	β	β
GREEK SMALL LETTER GAMMA	γ	γ	γ	γ
INCREMENT	Δ	Δ	∆	∆
GREEK SMALL LETTER EPSILON	ε	ε	ε	ε
INFINITY	∞	∞	∞	∞
PLUS OR MINUS	±	±	±	±
NOT EQUALS	≠	≠	≠	≠
ALMOST EQUAL	≈	≈	≈	≈
GREATER THAN OR EQUAL TO	≥	≥	≥	≥
LESS THAN OR EQUAL TO	≤	≤	≤	≤
DIVISION SIGN	÷	÷	÷	÷
SUPERSCRIPT TWO	²	²	²	²
SUPERSCRIPT THREE	³	³	³	³

For example, in Windows, you can use the "Num. Entity" column for [Alt] codes such as [Alt]946 for β (beta).

UTF-8: Encoding the Unicode Code Space

As Unicode is an encoding system, it depends on various character sets, such as UTF-8 (and UTF-16) for practical use.

UTF-8 (1993) is a variable-length 8-bit character encoding, which means that it can use one to four 8-bit bytes to represent each character. The first group of 128 characters in UTF-8 are the original 128 ASCII characters. This means that software configured to use Unicode will also be able to work with ASCII (ISO 646:1983) files.

The popularity of UTF-8 has increased since it was released. As of 2007, UTF-8 has became more dominant on the web than ASCII itself. UTF-8 is also the default encoding for HTML5 and JSON.

UTF-8 and UTF-16 are the standard encodings for Unicode text in HTML documents, with UTF-8 as the preferred and most used encoding.^[16]

— Wikipedia
UTF-8

Character Encoding Conversion Example

We can convert a file encoded as Windows-1252 into UTF-8 with iconv.^[17]

Example 7. Converting a Windows-1252 file into UTF-8

$ iconv -f windows-1252 -t utf-8 asciitable.txt > asciitable2.txt
$ file asciitable2.txt
asciitable2.txt: UTF-8 Unicode text

As you can see, you can use file to verify that this is a Unicode file encoded as UTF-8.

Tip	"Normalize" text datafiles to a common, universal encoding format like UTF-8 to ensure characters are displayed with the intended symbols.

Data Structure

Structured: Formal and rigorous design
- Example: Relational database
Semi-structured: Self-describing, validatable
- Markup using tags or key-value pairs
- Examples: XML and JSON
Unstructured:
- Multimedia and text document files
- Any internal structure, if present, is assumed or unreliable
- Example: email "body" ("header" is semi-structured)
- May have "implied" structure, like "delimited text"

You can store data in structured, semi-structured, or unstructured formats.

A structured data formal is a rigorous design like you would have in a relational database. Semi-structed is self-describing and can be automatically validated using software. It is validated through markup or key-value pairs. Examples of semi-structured file formats are XML and JSON. By having this self-describing format, programs can be run which can validate the document and make sure that it’s in the proper format. Unstructured documents include the majority of documents that you might be using in typical office work, such as multimedia and text document files.

You might say, "My documents are structured — with paragraphs, sentences and words." The idea is that any type of internal structure such as this is assumed. You might not have been consistent in using your "format". Or maybe you and your office-mates do not agree exactly on the "company standard" format’s details. Maybe you indent your paragraphs, but your co-worker does not. Maybe you leave one space between sentences or maybe you have two. Perhaps you made a "typo" by having two periods at the end of a sentence instead of one. A person would know this was just one sentence, but a program might see two. (A normal sentence and one with no words, but just a period.)

There is really no viable way to validate these kinds of documents, because there is no assurance that they conform to any specific and sufficiently detailed standard, so they are simply called "unstructured". They may be electronically parsed using heuristics and "artificial intelligence", but not as reliably as is if they were "self-describing" as to the meaning (symantics) of each part of the document. A self-describing format like XML would allow you to embed meta-data "tags" to indicate a person’s name as an "employee", a place name as a "location", or a postal code as such and not just a series of seemingly random digits.

Another example of unstructured text is the body of an email, where you could type any text content into the body that you wanted. The email header, however, is semi-structured because an email header does consist of key-value pairs, much like a JSON document. So the email, taken is as a whole, is semi-structured, but the email body is unstructured.

A controversial example might be a comma-separated-value (CSV) file. There are delimiters (commas) separating the fields and there is a header line at the top "describing" the values in each column, (much like key-value pairs). So, since this seems sort of like a database table (structured) or at least similar to a JSON file (semi-structured) you might not think it is unstructured. But the structure is implied, not explicit. The document does not tell you that it is in any particular format. You are assuming this format based on the filename suffix (extension) of .csv, the prevalence of a lot of commas, or the assurance of the person who gave you the file. It might just contain a list of items that does not conform to any particular standard. You have some rows with more or less values, or that don’t match the header line (if present). Any sort of structure is applied or assumed, but isn’t reliable. You could open a CSV file with a variable number of columns per row, or without a header in many software packages without getting an error or warning. You would have to discover and address any such issues yourself. The CSV data "format" is not (universally) validated.

XML

Self-describing^[18]
Structured
Standard
Parsable with libraries
Examples:
- XHTML
- KML
- XLSX
- p-list
- SVG

XML - Image: Dreftymac, CC BY 2.5

XML is self-describing, standard, and parsable, meaning there are automated means (libraries) of reading it in — loading in the values and the keys into a data structure. Examples are XHTML and KML (used with maps), XLSX is the (newer) MS-Excel format, the p-list is an OS X "property list" used for configuration, and SVG is an image file format.

For example, XHTML is an XML-compliant form of HTML that includes an XML version number, with a DOCTYPE specification, and an "xmlns" namespace. The rest of the document looks a like like any other HTML, such as HTML5, but all of the tags are also XML-compliant so that the file can be validated as XML by an automated parser. It is self describing because the structure of the document and the meaning of the elements are encoded in these tags.

JSON

JavaScript Object Notation
Open format (ISO and ECMA standards)
Human-readable text
For transmitting data objects
Attribute–value pairs
Often used in Ajax web applications

Another self-describing, semi-structired file format is JSON. It is an open format and standard recognized by two standards bodies (ISO and ECMA). Like XML, is is human-readable because it is plain-text, (though some may find it more readable than XML). It is used for transmitting data objects, meaning sending data structures between programs. It is organized into a nested structure of attribute:value (key-value) pairs. It is often used in Ajax web applications for transmitting data back and forth between the clients (web browsers or locally-installed "apps") and the server (server-hosted web application).

Example 8. JSON data structure for a person (John Smith)^[19]

{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 25,
  "height_cm": 167.6,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [],
  "spouse": null
}

In this example, we have braces enclosing information about a person. Sub-sections are also enclosed by braces and/or brackets. The "address" attribute has it’s value enclosed in braces and contains a list of attribute:value pairs. The "phoneNumbers" attribute has a value which is an array of lists of attribute:value pairs. The "children" attribute’s value is an empty array and the value for the "spouse" attribute is "null" which means "no value". So instead of having tags to enclose data values, we have a braces, brackets, and attribute names to provide a nested structure and describe the data values.

Delimited Text Files

Files formatted with delimiter separated values use:

Comma (e.g., "CSV")
Tab (e.g., "TSV")
Pipe (vertical bar: |)

… or other single character as a separator (delimiter) between values.

The CSV is the best-known of these, but TSVs are still very common, with pipe-delimited files being less common. A benefit of the latter type, however, is that the pipe symbol is the least likely to be found in most typical data values. Unlike the tab, the pipe (|) and comma (,) are visible to the eye.

Taking the example of a CSV, since commas are often found in data values, there needs to be a way to differentate the comma as the delimter in a CSV file and the comma present merely as data. This is done by "escaping" — adding more characters to indicate the special treatment. In a CSV, this is often called "comma-quote-delimited". But what happens when the escape character also needs to be escaped? Now you can see why using a tab or pipe makes the job a lot simpler, because there is far less need for adding escape characters. As you might imagine, the addition and parsing of these characters is more complicated and thus more prone to error (bugs). For this reason, a TSV may be a better choice than a CSV in many situations. Many data-import functions assume TSV by default.

The records (rows) are separated by line-ending characters (newlines):

Carriage-return (CR)
Line-feed (LF)
Carriage-return, Line-feed (CRLF)

On the original MacOS systems, the standard line-ending character was the Carriage-return (CR), on Unix, OS X, and Linux systems, the line-ending character is a Line-feed (LF), and on Windows, both are combined together as Carriage-return, Line-feed (CRLF). As you might guess, (or have experienced first hand), sharing files between these systems can create problems, so various automatic conversions are built into file-transfer applications. You may see a prompt during software installation (such as for Git) as to whether or not you wish for such conversions to take place by default.

For example, if the server is running Unix, and some of the clients run OS X and some run Windows, the technique is often to configure the client software to standardize on the line-ending format of the server. In this case, the server running Unix and the client running OS X would both use LF, but the Windows client would need to convert LF to CRLF when receiving files and then convert CRLF back again to LF when sending files back to the server.

Fixed-Width Text Files

Instead of using a single character as a delimiter, you can also neatly line up your columns into a format called "fixed-width". The space between columns is filled with as many "whitespace" characters (spaces or tabs) as needed. Because the number of these delimiting characters can vary from one field to the next, this format is a little harder to parse (automatically), but has the benefit of greater (human) readability. The lines (records, rows) are separated by newlines. Since the columns must line up, the fields must be limited in length, often to some arbitrary value. For this reason, this format is not as commonly used as a data exchange format, but is often seen in text-based tabular reports. Since databases often enforce field-length restrictions anyway, a fixed-width format may be used by such systems for data storage.

Here is an example:

                   mpg cyl disp
Mazda RX4         21.0   6  160
Mazda RX4 Wag     21.0   6  160
Datsun 710        22.8   4  108
Hornet 4 Drive    21.4   6  258
Hornet Sportabout 18.7   8  360
Valiant           18.1   6  225

(Data from mtcars, The R Datasets Package, R Core Team.)^[20]

You can see that the columns are neatly aligned and the table is more readable than if each value was sepated by a single comma or space.

Multi-line Text Files

You should know that there are also multi-line text file formats. Some popular genomics file formats use multi-line records, for example.

If you look carefully at these, you’ll see that each record starts off with a header line.

FASTA^[21]

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

In the case of the FASTA file, the record’s first line begins with an angle bracket (>) followed by a pipe-delimited string containing various codes, and finally ends with a name enclosed in brackets. After this first header line, you have a few lines that look like random gibberish, but they are meant to represent the biomolecular sequence information.

FASTQ^[22]

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

This FASTQ record’s header line begins with an "at symbol" (@) followed by an alphanumeric code and a space-delimited list of values. The first value appears to be another code containing underscores and colons as delimiters, but it is hard to be sure from just looking at it. The last value is actually a key=value pair specifying that the string of nucleotides listed on the next line will contain 36 letters. The third line is another header, just like the first line, but starting instead with a "+". That difference can be used when parsing the file to know the difference between the two types of headers.

Unfortunately, the fourth line, which is the "Phred quality" line, contains a quality-character for each nucleotide listed in the second line. A variety of alphanumeric and punctuation characters are used for Phred quality codes.^[23] This could lead to parsing problems, especially with fragmented (incomplete) records. What if a particular quality line starts with a "+" or an "@"? A parsing program might assume this is a header line and attempt to read it as a new record.

Are these multi-line text files structured, semi-structured, or unstructured? As before, there is no explicit information in the file telling you which format or standard the file structure would adhere to, and there is nothing self-describing about the contents of the file, with the exception of the "length=36" part. So, like delimited and fixed-width files, these are actually "unstructured", despite a lot of implied structure, or "structure by convention".

Regardless, unstructured files are still very useful and are usually structured enough to serve their purpose. The only reason we have been drawing these distinctions about structure has been to help us categorize and evaluate the various alternatives — to appreciate the tradeoffs and provide clues as to possible processing strategies.

Data File Layout: Tidy Data

Now that we have an understanding of the types of file formats, we will discuss a little of what to do with data. In particular, we will see that following a few simple layout guidelines can help immensely when it comes time for data analysis later on.

Whether your data comes from an XML, CSV, or other file format, you can hopefully view it as a single table of columns and rows, with a single value in each "cell" (row-column intersection). If so, that should be fairly tidy.

A Table of Columns and Rows

We want to structure each data file as a single table of "columns and rows" …

Table 2. A table of columns and rows
subID	height	weight
1	58	115
2	59	117
3	60	120

… to make them easier to import and analyze.

(Data from women, The R Datasets Package, R Core Team.)^[24]

If you need to, you can link multiple tables together on common "key" columns, once you have imported them into your database or analysis software.

Tidy Data Basic Tenets

The basic tenets of tidy data are:^[25]

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table

Tidy or Not?

Is this spreadsheet^[26] tidy data or not? Why or why not?

spreadsheet

We have what looks like columns and rows, but you might notice that there are apparently several rows of headers at the top (with levels and sublevels) and a nested hierarchy of "causes" on the left (listed in outline form). It would be confusing to import this table into data analysis software that did not know how to interpret those various levels.

There is also color-coding for various columns and rows, which we presume must carry some additional meaning, but which might easily be lost upon import into our preferred software application.

How would you go about tidying this dataset?

Wide and Long: Which Table is Tidier?

Once we have a table in a nice column and row format, we may further tidy the data by checking that each column should indeed be a column — or rearrange the table as needed to achieve this.

A table layout can be seen as "wide" or "long". Wide data has lots of columns (variables) and long data has lots of rows (cases, records, observations).

Here is an example in "wide" format:

Table 3. Iris data in "wide" format
Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
5.7	2.8	4.1	1.3	versicolor

Here the same table has been rearranged into "long" format:

Table 4. Iris data in "long" format
Flower.Id	Species	Flower.Part	Length	Width
1	setosa	Petal	1.4	0.2
1	setosa	Sepal	5.1	3.5
100	versicolor	Petal	4.1	1.3
100	versicolor	Sepal	5.7	2.8

(Data from iris, The R Datasets Package, R Core Team.)^[27] '''

The "long" layout may be tidier if you consider the "species" such as versicolor or the "part" of a flower such as Sepal as observations rather than as a variables. But in many cases, both wide and long formats are fine if they make sense and suit your analytical needs. The main idea is that you want to spend a little time tidying data so that you don’t waste a lot of time trying to analyze untidy data.

Why does tidiness matter?

If "wide" or "long" layout choices seem rather "subjective", consider which layout will better facilitate your analysis.

Because we have "tidied" our example data into a "long" layout, we can easily "facet" our plot by Species and Flower.Part.

ggplot(data=iris, aes(x=Width, y=Length)) +
    geom_point() + facet_grid(Species ~ Flower.Part, scale="free") +
    geom_smooth(method="lm") + theme_bw(base_size=16)

Plot using tidy iris data

Conclusion

In this module, we have explored file naming, types, and organization. We hope that by gaining an understanding of key concepts of data files, that you are better equipped to tackle your data-wrangling needs. We will build upon this knowledge in later data management modules, such as "databases", "programming", and "analysis tools".

1. Actually, byte sequences called magic numbers or file signatures may be used to identify file formats, but their use is not completely standardized or universal.

2. Filename extension, Wikipedia, CC BY-SA 3.0

3. file association, Wikipedia, CC BY-SA 3.0

4. While OS X, Linux and Unix systems come with hexdump, Windows systems no longer come with something comparable. Instead, you may wish use PowerShell, which does come with recent versions of Windows, download the hexdump.exe utility and use a command like this: hexdump.exe pie.svg | select -First 5 … though it will not produce exactly the same output as hexdump.

5. The first few bytes of a file are often used to identify the file type.

6. ASCII, Wikipedia, CC BY-SA 3.0

7. Windows-1252, Wikipedia, CC BY-SA 3.0

8. Our first example of mojibake which begins this section is from the reply window of Thunderbird when replying to a message from a Mailman email-list post. The original message was sent from a person using the Mutt email client. The message was PGP-signed. A quick internet search for the unintended characters brings up a useful help page. Apparently, this character sequence representing "curly quotes" or "Smart Quotes" is a common problem.

9. We would use the same python command from the Bash shell in Unix, Linux, or OS X.

10. You will find many cases of this issue discussed on help forums such as stackoverflow. There are modules in programming languages such as Ruby and Python which help address these problems. Applications like MS-Word also allow you to set or convert encodings.

11. For a more thorough treatment of dealing with character encoding problems, see "Bad Data Lurking in Plain Text" by Josh Levy, PhD, which is the fourth chapter of Q. Ethan McCallum’s Bad Data Handbook (O’Reilly Media, Inc., 2012).

12. Joe Becker, Unicode 88

13. Unicode, Wikipedia, CC BY-SA 3.0

14. Unicode, Wikipedia, CC BY-SA 3.0

15. Unicode input, Wikipedia, CC BY-SA 3.0

16. UTF-8, Wikipedia, CC BY-SA 3.0

17. iconv is another tool originally developed for Unix, Linux and OS X systems, though Windows versions are available and can be found with an Internet search.

18. XML, Wikipedia, CC BY-SA 3.0

19. JSON, Wikipedia, CC BY-SA 3.0

20. Extracted from the 1974 Motor Trend US magazine. Source: Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.

21. FASTA format, Wikipedia, CC BY-SA 3.0

22. FASTQ format, Wikipedia, CC BY-SA 3.0

23. See FASTQ format encoding, Wikipedia

24. The World Almanac and Book of Facts, 1975. Reference: McNeil, D. R. (1977) Interactive Data Analysis. Wiley.

25. Hadley Wickham, Tidy Data

26. WHO

27. Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59, 2–5. Reference: Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II, 179–188.

The latest version of this document is online at: https://github.com/brianhigh/research-computing/wiki Copyright © The Research Computing Team. This information is provided for educational purposes only. See LICENSE for more information. Creative Commons Attribution 4.0 International Public License.

datafiles

Data Files

Learning Objectives

Data File Types: Binary vs. Plain Text

File Name Extensions

Viewing Binary and Plain Text Files

Character Encodings

ASCII

ASCII Table

Extended ASCII

Windows Latin 1 (Windows-1252)

Why Character Encoding Matters

Unicode

Unicode Symbol Example: the Micro Sign

Some Other Useful Symbols

UTF-8: Encoding the Unicode Code Space

Character Encoding Conversion Example

Data Structure

XML

JSON

Delimited Text Files

Fixed-Width Text Files

Multi-line Text Files

Data File Layout: Tidy Data

A Table of Columns and Rows

Tidy Data Basic Tenets

Tidy or Not?

Wide and Long: Which Table is Tidier?

Why does tidiness matter?

Conclusion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Main Topics

Clone this wiki locally