Skip to main content

๐Ÿงผ String Sanitizing

Guide to sanitizing strings.

File Nameโ€‹

LinuxmacOSWindows
/โŒโŒ1โŒ
\โŒโœ…โŒ
:โœ…โœ…1โŒ
*โœ…โœ…โŒ
?โœ…โœ…โŒ
"โœ…โœ…โŒ
<โœ…โœ…โŒ
>โœ…โœ…โŒ
|โœ…โœ…โŒ
ASCII 0โ€“31โŒโŒโŒ
Leading/trailing space/dotโœ…โœ…โŒ
Case sensitivityโœ…โŒโœ…2
Reserved namesCON, PRN, AUX, NUL, COM1โ€“COM9, LPT1โ€“LPT9
Unicode normalizationNFCNFDNFC
Max length (bytes)255255255
Max path length (bytes)409631024260

May want to avoid shell special characters: $, !, &, ;, ', (, ), ;, ~, #, %, @

NFD normalization encode certain characters into multiple code points, like eฬ (U+0065 U+0301) instead of รฉ (U+00E9).

To make a filename works cross-platform, it'll be easier to just lowercase everything.

Other relevevant limits:

  • Git is case sensitive
  • Dropbox does not support emojis

A sample implementation in Python:

def sanitize_filename(name: str, replacement: str = "_", max_length: int = 255, lower: bool = True) -> str:
import re, unicodedata

name = unicodedata.normalize("NFC", name)
name = re.sub(r'[\\\/:*?"<>|\x00-\x1F]', replacement, name)
name = name.strip(" .")

if lower:
name = name.lower()

reserved = {"con","prn","aux","nul","com1","com2","com3","com4","com5","com6","com7","com8","com9",
"lpt1","lpt2","lpt3","lpt4","lpt5","lpt6","lpt7","lpt8","lpt9"}
if name.lower() in reserved:
name += replacement

if len(name.encode("utf-8")) > max_length:
encoded = name.encode("utf-8")[:max_length]
name = encoded.decode("utf-8", errors="ignore")

return name or "untitled"

HTMLโ€‹

Escape <, >, &

Footnotesโ€‹

  1. Finder treats : as /. โ†ฉ โ†ฉ2

  2. Internally yes, but some software may treat them as case insensitive. โ†ฉ

  3. Can be configure via PATH_MAX limit, but don't. โ†ฉ