Breaking File for Fun and Profit
or: How I Learned to Stop Worrying and Love the Magic Bytes

Table of Contents

Reading time: 5 minute

Back Home.

scanographic_rubics_cube.webp

There are only two hard things in Computer Science: cache invalidation and naming things.

– Phil Karlton

1. What is file

File, the aptly named unix command, can easily be fooled. If you’re unfamiliar with its purpose, let’s get you acquainted. First, we create a simple text file:

echo "I am a simple text file" > foo.txt

Now, here is what file does. It takes as an argument a file, and tells us what type of file it is. For this text file we just created:

file foo.txt
foo.txt: ASCII text

As we can see, file has — correctly — identified this as a ASCII text file. But not all text files are in ASCII, does file know this?

echo "I am a UTF-8 file 😝" > bar.txt
file bar.txt
bar.txt: Unicode text, UTF-8 text

It seems like it does.

File doesn’t work by extensions. Rather, as FILE(1) describes, it goes through a few stages.

  1. Filesystem tests.
  2. Magic tests.
  3. Language tests.

In short, filesystem tests are based on the stat system call, and identifies native file types such as sockets, symbolic links, named pipes etc.

Next, comes the magic tests, this is the easily broken bit. They test for magic numbers1 by looking for various fixed offset constants in the file.

An example of one of those magic numbers is if a file starts with the Hex 23 21. This is “#!” in ISO 8859-1. That is, all those crunchbangs we type for scripts are actually an example of a filetype magic bytes.

If none of the magic tests works, file checks if it is reasonably considered a text file. If it is, language tests are done, and encoding related information as well as possible programming language is determined.

If file can’t crack it, it’s just data.

2. How to break file

Once some years ago I had to create my own file implementation for uni. Part of the assignment was writing tests. I thought I’d use fortune to generate some ASCII and ensure that my file implementation (with very limited magic tests) always gave the same answer as file.

These tests were easily scaled to the entirety of the fortune corpus, and it led me to what I thought was a problem with my implementation. Often, it would diverge from file. A bit of investigation however led me to realize that my implementation was often correct when file wasn’t2.

And not just like, somewhat sensible mistakes, like this example-fail.txt file, which I’m assuming is cause by a language test:

The first time, it's a KLUDGE!
The second, a trick.
Later, it's a well-established technique!
        -- Mike Broido, Intermetrics

This is, obviously, without a doubt text? Haha, think again!

file example-fail.txt
example-fail.txt: CSV text

Ohh, right, so actually that is a CSV file? Well, it is a list of key-values seperated by commas… until the end at least.

Another example, here of an actual magic test breaking:

PARDON me, am I speaking ENGLISH?

As we can tell, this is not a ASCII text file, ohh no it is…

pardon.txt: Par archive data

It’s par archive data :p

Warning: file can be wrong in ways that are a lot more wacky. We’re not supposed to understand file, just break it.

3. How to systematically break file

So there is some C code needed, namely, my trivial^w useless^w correct version of file. Well, actually, a butchered version of it that only supports ASCII. You can find that in this repo, which has all the stuff you need to start finding problems with file.

However, here’s the code for the actual script that finds these:

#!/usr/bin/env bash

set -e
dir="test_files/gen_tests"

echo "[*] Generating a $dir directory.."
mkdir -p $dir

rm -f test_files/gen_tests/*

echo "[*] Generating test files.."

for i in {0..100}
do
    fortune > $dir/ascii$i.input
done

echo "[*] Running the type tests..."

exitcode=0

for f in $dir/*.input
do
  echo "[+] >>> Testing ${f}..."
  file    "${f}" | sed 's/ASCII text.*/ASCII text/' > "${f}.expected"
  sfile  "${f}" > "${f}.actual"

  if ! diff -u "${f}.expected" "${f}.actual"
  then
    echo "[-] >>> Failed :-(" | lolcat
    exitcode=1
  fi
done

This creates 100 test files from fortune output, and then tests my non-magic file-clone against the normal file.

If you wanna try this out yourself (and you’re using nix), it’s as simple as running:

nix run github:cafkafk/file-fuzzer

Sometimes you’ll get output like:

[+] >>> Testing test_files/gen_tests/ascii20.input...
--- test_files/gen_tests/ascii20.input.expected 2023-07-28 14:43:46.127390610 +0200
+++ test_files/gen_tests/ascii20.input.actual   2023-07-28 14:43:46.128390617 +0200
@@ -1 +1 @@
-test_files/gen_tests/ascii20.input: Unicode text, UTF-8 text
+test_files/gen_tests/ascii20.input: data
[-] >>> Failed :-(
[+] >>> Testing test_files/gen_tests/ascii21.input...

This is just because the file clone doesn’t support unicode, it’s not a real correctness issue. However, something like:

[+] >>> Testing test_files/gen_tests/ascii59.input...
--- test_files/gen_tests/ascii59.input.expected 2023-07-28 14:44:46.571761662 +0200
+++ test_files/gen_tests/ascii59.input.actual   2023-07-28 14:44:46.572761668 +0200
@@ -1 +1 @@
-test_files/gen_tests/ascii59.input: Monkey's Audio compressed format version 29557
+test_files/gen_tests/ascii59.input: ASCII text
[-] >>> Failed :-(
[+] >>> Testing test_files/gen_tests/ascii5.input...

Now that’s a jackpot. In this case, we just cat the file to see what file thinks is a “Moneky’s Audio compressed format version 29557”.

cat test_files/gen_tests/ascii59.input
MAC user's dynamic debugging list evaluator?  Never heard of that.

So true bestie.

4. Concluding Thoughts

The Unix file command is a useful tool that, while unassuming, harbors peculiar behaviors due to its reliance on magic bytes, and language tests to discern file types. As demonstrated, this can sometimes result in strange, misleading, or downright comical outputs.

These inconsistencies offer an opportunity for experimentation, even amusement, as we “break” file to yield the most absurd file type descriptions.

However, there’s a flip side to this coin. If manipulated maliciously, these discrepancies in file type identification could potentially create security vulnerabilities in systems that heavily depend on file. As developers or system administrators, it’s crucial to understand these potential risks, ensuring our systems can handle such scenarios effectively.

In conclusion, while it’s engaging and educational to explore the quirks of tools like file, it’s essential to remember that these peculiarities can have implications beyond their surface amusement.

As Phil Karlton famously observed, there are two hard things in Computer Science: cache invalidation and naming things. With the peculiarities of file identification in mind, we might consider adding a third to this list.

Back Home.

Footnotes:

2

This doesn’t mean mine was better, just that mine was trivial.

Author: Christina E. Sørensen

Created: 2024-04-14 Sun 10:06