Unicode or character encoding error

A CI step failed because a file or data stream contains characters that cannot be interpreted in the assumed encoding.

encoding-unicode medium confidence build

Matched signals

  • UnicodeDecodeError
  • UnicodeEncodeError
  • codec can't decode
  • codec can't encode
  • invalid byte sequence
  • invalid multibyte character
  • invalid character in identifier
  • character encoding

Unicode or character encoding error

What this failure means

A CI step failed because a file or data stream contains characters that cannot be interpreted in the assumed encoding. This commonly manifests as a Python UnicodeDecodeError, Java MalformedInputException, or a locale warning turning a build tool error non-fatal.

Symptoms

Faultline looks for one or more of these log fragments:

UnicodeDecodeError
UnicodeEncodeError
codec can't decode
codec can't encode
invalid byte sequence
invalid multibyte character
invalid character in identifier
character encoding

Diagnosis

Encoding failures occur when:

  1. The CI runner’s locale is C or POSIX rather than en_US.UTF-8, causing tools to default to ASCII
  2. A source file, test fixture, or input data contains non-ASCII characters
  3. A file was saved with Latin-1 or Windows-1252 encoding rather than UTF-8
  4. A string conversion assumes ASCII but encounters an emoji, accented character, or non-Latin script

Check the runner’s locale:

locale
echo $LANG $LC_ALL $LC_CTYPE

A minimal CI runner may show LANG=C or no locale set at all.

Find non-UTF-8 files:

find . -name "*.py" -exec file {} \; | grep -v "ASCII\|UTF-8"

Fix steps

  1. Set UTF-8 locale in CI:

    # GitHub Actions / most CI
    env:
      LANG: en_US.UTF-8
      LC_ALL: en_US.UTF-8
      PYTHONIOENCODING: utf-8
    
  2. In Python, open files with explicit encoding:

    # GOOD
    with open("file.txt", encoding="utf-8") as f:
        data = f.read()
    
    # Python 3.15+ will warn on implicit encoding in open()
    # use PYTHONWARNDEFAULTENCODING=1 to audit existing code
    
  3. Convert non-UTF-8 files to UTF-8:

    # Detect encoding
    file -i suspicious-file.txt
    chardet suspicious-file.txt     # requires chardet package
    
    # Convert Latin-1 to UTF-8
    iconv -f latin1 -t utf-8 input.txt > output.txt
    
  4. In Java, specify charset explicitly:

    Files.readString(path, StandardCharsets.UTF_8);
    new InputStreamReader(stream, StandardCharsets.UTF_8);
    
  5. Ensure the database connection and ORM use UTF-8:

    # MySQL
    jdbc:mysql://host/db?characterEncoding=UTF-8
    
    # PostgreSQL: set client_encoding = 'UTF8' in the connection
    

Validation

  • Re-run the failing step after setting the locale.
  • Confirm locale shows LANG=en_US.UTF-8.
  • Run the failing test or build and confirm no codec errors.

Why it matters

Encoding failures are rare locally (developer machines default to UTF-8) but common in minimal CI containers that default to the C locale. They are tricky to diagnose because the error appears far from the encoding setup.

Prevention

  • Set LANG=en_US.UTF-8 and PYTHONIOENCODING=utf-8 in all CI environment blocks.
  • Enforce UTF-8 in .editorconfig: charset = utf-8.
  • Use file -i in a CI pre-flight check to detect non-UTF-8 source files.

How Faultline detects it

Use faultline explain encoding-unicode to see the full playbook.

faultline analyze build.log
faultline explain encoding-unicode

Generated from playbooks/bundled/log/runtime/encoding-unicode.yaml. Do not edit directly.

Try it on your own failed log

$ faultline analyze failed.log
Want this across every CI run? Faultline Teams tracks recurring failures across all your repos and surfaces patterns in a shared dashboard.