Breaking File for Fun and Profit
or: How I Learned to Stop Worrying and Love the Magic Bytes
Table of Contents
Reading time: 5 minute
Back Home.
There are only two hard things in Computer Science: cache invalidation and naming things.
– Phil Karlton
1. What is file
File, the aptly named unix command, can easily be fooled. If you’re unfamiliar with its purpose, let’s get you acquainted. First, we create a simple text file:
echo "I am a simple text file" > foo.txt
Now, here is what file does. It takes as an argument a file, and tells us what type of file it is. For this text file we just created:
file foo.txt
foo.txt: ASCII text
As we can see, file has — correctly — identified this as a ASCII text file. But not all text files are in ASCII, does file know this?
echo "I am a UTF-8 file 😝" > bar.txt file bar.txt
bar.txt: Unicode text, UTF-8 text
It seems like it does.
File doesn’t work by extensions. Rather, as FILE(1) describes, it goes through a few stages.
- Filesystem tests.
- Magic tests.
- Language tests.
In short, filesystem tests are based on the stat system call, and identifies native file types such as sockets, symbolic links, named pipes etc.
Next, comes the magic tests, this is the easily broken bit. They test for magic numbers1 by looking for various fixed offset constants in the file.
An example of one of those magic numbers is if a file starts with the Hex 23 21. This is “#!” in ISO 8859-1. That is, all those crunchbangs we type for scripts are actually an example of a filetype magic bytes.
If none of the magic tests works, file checks if it is reasonably considered a text file. If it is, language tests are done, and encoding related information as well as possible programming language is determined.
If file can’t crack it, it’s just data.
2. How to break file
Once some years ago I had to create my own file implementation for uni. Part of the assignment was writing tests. I thought I’d use fortune to generate some ASCII and ensure that my file implementation (with very limited magic tests) always gave the same answer as file.
These tests were easily scaled to the entirety of the fortune corpus, and it led me to what I thought was a problem with my implementation. Often, it would diverge from file. A bit of investigation however led me to realize that my implementation was often correct when file wasn’t2.
And not just like, somewhat sensible mistakes, like this example-fail.txt
file,
which I’m assuming is cause by a language test:
The first time, it's a KLUDGE! The second, a trick. Later, it's a well-established technique! -- Mike Broido, Intermetrics
This is, obviously, without a doubt text? Haha, think again!
file example-fail.txt
example-fail.txt: CSV text
Ohh, right, so actually that is a CSV file? Well, it is a list of key-values seperated by commas… until the end at least.
Another example, here of an actual magic test breaking:
PARDON me, am I speaking ENGLISH?
As we can tell, this is not a ASCII text file, ohh no it is…
pardon.txt: Par archive data
It’s par archive data :p
Warning: file can be wrong in ways that are a lot more wacky. We’re not supposed to understand file, just break it.
3. How to systematically break file
So there is some C code needed, namely, my trivial^w useless^w correct version of file. Well, actually, a butchered version of it that only supports ASCII. You can find that in this repo, which has all the stuff you need to start finding problems with file.
However, here’s the code for the actual script that finds these:
#!/usr/bin/env bash set -e dir="test_files/gen_tests" echo "[*] Generating a $dir directory.." mkdir -p $dir rm -f test_files/gen_tests/* echo "[*] Generating test files.." for i in {0..100} do fortune > $dir/ascii$i.input done echo "[*] Running the type tests..." exitcode=0 for f in $dir/*.input do echo "[+] >>> Testing ${f}..." file "${f}" | sed 's/ASCII text.*/ASCII text/' > "${f}.expected" sfile "${f}" > "${f}.actual" if ! diff -u "${f}.expected" "${f}.actual" then echo "[-] >>> Failed :-(" | lolcat exitcode=1 fi done
This creates 100 test files from fortune output, and then tests my non-magic file-clone against the normal file.
If you wanna try this out yourself (and you’re using nix), it’s as simple as running:
nix run github:cafkafk/file-fuzzer
Sometimes you’ll get output like:
[+] >>> Testing test_files/gen_tests/ascii20.input... --- test_files/gen_tests/ascii20.input.expected 2023-07-28 14:43:46.127390610 +0200 +++ test_files/gen_tests/ascii20.input.actual 2023-07-28 14:43:46.128390617 +0200 @@ -1 +1 @@ -test_files/gen_tests/ascii20.input: Unicode text, UTF-8 text +test_files/gen_tests/ascii20.input: data [-] >>> Failed :-( [+] >>> Testing test_files/gen_tests/ascii21.input...
This is just because the file clone doesn’t support unicode, it’s not a real correctness issue. However, something like:
[+] >>> Testing test_files/gen_tests/ascii59.input... --- test_files/gen_tests/ascii59.input.expected 2023-07-28 14:44:46.571761662 +0200 +++ test_files/gen_tests/ascii59.input.actual 2023-07-28 14:44:46.572761668 +0200 @@ -1 +1 @@ -test_files/gen_tests/ascii59.input: Monkey's Audio compressed format version 29557 +test_files/gen_tests/ascii59.input: ASCII text [-] >>> Failed :-( [+] >>> Testing test_files/gen_tests/ascii5.input...
Now that’s a jackpot. In this case, we just cat the file to see what file thinks is a “Moneky’s Audio compressed format version 29557”.
cat test_files/gen_tests/ascii59.input
MAC user's dynamic debugging list evaluator? Never heard of that.
So true bestie.
4. Concluding Thoughts
The Unix file command is a useful tool that, while unassuming, harbors peculiar behaviors due to its reliance on magic bytes, and language tests to discern file types. As demonstrated, this can sometimes result in strange, misleading, or downright comical outputs.
These inconsistencies offer an opportunity for experimentation, even amusement, as we “break” file to yield the most absurd file type descriptions.
However, there’s a flip side to this coin. If manipulated maliciously, these discrepancies in file type identification could potentially create security vulnerabilities in systems that heavily depend on file. As developers or system administrators, it’s crucial to understand these potential risks, ensuring our systems can handle such scenarios effectively.
In conclusion, while it’s engaging and educational to explore the quirks of tools like file, it’s essential to remember that these peculiarities can have implications beyond their surface amusement.
As Phil Karlton famously observed, there are two hard things in Computer Science: cache invalidation and naming things. With the peculiarities of file identification in mind, we might consider adding a third to this list.
Back Home.
Footnotes:
A topic we won’t cover here. https://en.wikipedia.org/wiki/List_of_file_signatures.
This doesn’t mean mine was better, just that mine was trivial.