Unicode or character encoding error

What this failure means

A CI step failed because a file or data stream contains characters that cannot be interpreted in the assumed encoding. This commonly manifests as a Python UnicodeDecodeError, Java MalformedInputException, or a locale warning turning a build tool error non-fatal.

Symptoms

Faultline looks for one or more of these log fragments:

UnicodeDecodeError
UnicodeEncodeError
codec can't decode
codec can't encode
invalid byte sequence
invalid multibyte character
invalid character in identifier
character encoding

Diagnosis

Encoding failures occur when:

The CI runner’s locale is C or POSIX rather than en_US.UTF-8, causing tools to default to ASCII
A source file, test fixture, or input data contains non-ASCII characters
A file was saved with Latin-1 or Windows-1252 encoding rather than UTF-8
A string conversion assumes ASCII but encounters an emoji, accented character, or non-Latin script

Check the runner’s locale:

locale
echo $LANG $LC_ALL $LC_CTYPE

A minimal CI runner may show LANG=C or no locale set at all.

Find non-UTF-8 files:

find . -name "*.py" -exec file {} \; | grep -v "ASCII\|UTF-8"

Fix steps

Set UTF-8 locale in CI:

# GitHub Actions / most CI
env:
  LANG: en_US.UTF-8
  LC_ALL: en_US.UTF-8
  PYTHONIOENCODING: utf-8

In Python, open files with explicit encoding:

# GOOD
with open("file.txt", encoding="utf-8") as f:
    data = f.read()

# Python 3.15+ will warn on implicit encoding in open()
# use PYTHONWARNDEFAULTENCODING=1 to audit existing code

Convert non-UTF-8 files to UTF-8:

# Detect encoding
file -i suspicious-file.txt
chardet suspicious-file.txt     # requires chardet package

# Convert Latin-1 to UTF-8
iconv -f latin1 -t utf-8 input.txt > output.txt

In Java, specify charset explicitly:

Files.readString(path, StandardCharsets.UTF_8);
new InputStreamReader(stream, StandardCharsets.UTF_8);

Ensure the database connection and ORM use UTF-8:

# MySQL
jdbc:mysql://host/db?characterEncoding=UTF-8

# PostgreSQL: set client_encoding = 'UTF8' in the connection

Validation

Re-run the failing step after setting the locale.
Confirm locale shows LANG=en_US.UTF-8.
Run the failing test or build and confirm no codec errors.

Why it matters

Encoding failures are rare locally (developer machines default to UTF-8) but common in minimal CI containers that default to the C locale. They are tricky to diagnose because the error appears far from the encoding setup.

Prevention

Set LANG=en_US.UTF-8 and PYTHONIOENCODING=utf-8 in all CI environment blocks.
Enforce UTF-8 in .editorconfig: charset = utf-8.
Use file -i in a CI pre-flight check to detect non-UTF-8 source files.

How Faultline detects it

Use faultline explain encoding-unicode to see the full playbook.

faultline analyze build.log
faultline explain encoding-unicode

Generated from playbooks/bundled/log/runtime/encoding-unicode.yaml. Do not edit directly.

Try it on your own failed log

$ faultline analyze failed.log

Install Faultline CLI View on GitHub

Want this across every CI run? Faultline Teams tracks recurring failures across all your repos and surfaces patterns in a shared dashboard.