Arbitrary Limits Avoided by GNU Coreutils

Introduction

The GNU Coding Standards says the following:

Avoid arbitrary limits on the length or number of any data structure, including file names, lines, files, and symbols, by allocating all data structures dynamically. In most Unix utilities, “long lines are silently truncated”. This is not acceptable in a GNU utility.

GNU Coreutils adheres to this guideline, avoiding arbitrary limits that other implementations often do not. This page is a work-in-progress attempt at listing the limitations that are avoided.

rm -r can remove deeply nested directories

POSIX states the following about rm:

The rm utility shall be able to descend to arbitrary depths in a file hierarchy, and shall not fail due to path length limitations (unless an operand specified by the user exceeds system limitations).

GNU Coreutils complies with this specificiation, which can be seen in the example below:


$ mkdir -p $(yes a/ | head -n $((32 * 1024)) | tr -d '\n')
$ rm -rf a
$ ls a
ls: cannot access 'a': No such file or directory
        

This feature is important because, without it, you won’t be able to delete deeply nested directories that may be created by other programs. Naive implementations will construct the deepest path and call unlink() and/or rmdir() on it directly. However, this would fail and set errno to ENAMETOOLONG if the path is larger than PATH_MAX, leading the program to exit without removing any directory entries. Despite POSIX requiring otherwise, some implementations behave this way.

Avoiding the PATH_MAX limit can is done using openat() and friends, which operate on file names relative to file descriptors.

pwd works even when getcwd() fails

Many implementations of pwd will fail if the current working directory is greater than PATH_MAX characters. GNU Coreutils does not have this limitation since it does not use a statically allocated buffer. In addition, it also replaces the systems getcwd() function with one from Gnulib if it does not handle file names greater than PATH_MAX. The Gnulib implementation will try the systems getcwd() function. If it fails due to a long file name, it will open a file descriptor for each of its parent directories. It uses readdir to find the file name by comparing inode numbers, then constructs the path, repeating this process until the root directory is reached.

You can confirm this behavior using the following commands:


$ mkdir -p $(yes a/ | head -n $((32 * 1024)) | tr -d '\n')
$ while cd $(yes a/ | head -n 1024 | tr -d '\n'); do :; done 2>/dev/null
$ env pwd | tr '/' '\n' | tail -n +2 | grep -vE "(home|$(id -un))" | wc -l
32768
        

Note that the env invocation is needed to ensure that the shell builtin, if it exists, is not used.

ls -1 --sort=none works when all file names do not fit into memory

Typically invocations of the ls command require that all file names in the directory being printed, along with some additional per-file metadata, fit into the systems memory. This is unavoidable in most situations, because ls has to read all file names for sorting and output alignment purposes. However, this is not needed when the -1 --sort=none options are given. In this case GNU Coreutils does not buffer the file names into memory and prints them as they are read.

We can use the following commands to compare the memory usage of ls's default options with the memory usage of ls -1 --sort=none in a directory of 10 million files:


$ mkdir mkdir -p tmp && cd tmp
$ seq 10000000 | parallel -j 16 -m 'truncate -s 0 {}'
$ numfmt --to=iec --from=iec \
    $(env time --format=%M ls 2>&1 > /dev/null)k
2.4G
$ numfmt --to=iec --from=iec \
    $(env time --format=%M ls -1 --sort=none 2>&1 > /dev/null)k
2.7M