Introduction
The GNU Coding Standards says the following:
Avoid arbitrary limits on the length or number of any data structure, including file names, lines, files, and symbols, by allocating all data structures dynamically. In most Unix utilities, “long lines are silently truncated”. This is not acceptable in a GNU utility.
GNU Coreutils adheres to this guideline, avoiding arbitrary limits that other implementations often do not. This page is a work-in-progress attempt at listing the limitations that are avoided.
rm -r can remove deeply nested directories
POSIX states the following about rm:
The rm utility shall be able to descend to arbitrary depths in a file hierarchy, and shall not fail due to path length limitations (unless an operand specified by the user exceeds system limitations).
GNU Coreutils complies with this specificiation, which can be seen in the example below:
$ mkdir -p $(yes a/ | head -n $((32 * 1024)) | tr -d '\n')
$ rm -rf a
$ ls a
ls: cannot access 'a': No such file or directory
This feature is important because, without it, you won’t be able to
delete deeply nested directories that may be created by other
programs. Naive implementations will construct the deepest path and
call unlink() and/or rmdir() on it directly.
However, this would fail and set errno to
ENAMETOOLONG if the path is larger than
PATH_MAX, leading the program to exit without removing
any directory entries. Despite POSIX requiring otherwise, some
implementations behave this way.
Avoiding the PATH_MAX limit can is done using
openat() and friends, which operate on file names
relative to file descriptors.
pwd works even when getcwd() fails
Many implementations of pwd will fail if the current
working directory is greater than PATH_MAX characters.
GNU Coreutils does not have this limitation since it does not use a
statically allocated buffer. In addition, it also replaces the systems
getcwd() function with one from Gnulib if it does not
handle file names greater than PATH_MAX. The Gnulib
implementation will try the systems getcwd() function. If
it fails due to a long file name, it will open a file descriptor for
each of its parent directories. It uses readdir to find
the file name by comparing inode numbers, then constructs the path,
repeating this process until the root directory is reached.
You can confirm this behavior using the following commands:
$ mkdir -p $(yes a/ | head -n $((32 * 1024)) | tr -d '\n')
$ while cd $(yes a/ | head -n 1024 | tr -d '\n'); do :; done 2>/dev/null
$ env pwd | tr '/' '\n' | tail -n +2 | grep -vE "(home|$(id -un))" | wc -l
32768
Note that the env invocation is needed to ensure that the
shell builtin, if it exists, is not used.
ls -1 --sort=none works when all file names do not fit
into memory
Typically invocations of the ls command require that all
file names in the directory being printed, along with some additional
per-file metadata, fit into the systems memory. This is unavoidable in
most situations, because ls has to read all file names
for sorting and output alignment purposes. However, this is not needed
when the -1 --sort=none options are given. In this case
GNU Coreutils does not buffer the file names into memory and prints
them as they are read.
We can use the following commands to compare the memory usage of
ls's default options with the memory usage of
ls -1 --sort=none in a directory of 10 million files:
$ mkdir mkdir -p tmp && cd tmp
$ seq 10000000 | parallel -j 16 -m 'truncate -s 0 {}'
$ numfmt --to=iec --from=iec \
$(env time --format=%M ls 2>&1 > /dev/null)k
2.4G
$ numfmt --to=iec --from=iec \
$(env time --format=%M ls -1 --sort=none 2>&1 > /dev/null)k
2.7M