Matched signals
- UnicodeDecodeError
- UnicodeEncodeError
- codec can't decode
- codec can't encode
- invalid byte sequence
- invalid multibyte character
- invalid character in identifier
- character encoding
Unicode or character encoding error
What this failure means
A CI step failed because a file or data stream contains characters that
cannot be interpreted in the assumed encoding. This commonly manifests as a
Python UnicodeDecodeError, Java MalformedInputException, or a locale
warning turning a build tool error non-fatal.
Symptoms
Faultline looks for one or more of these log fragments:
UnicodeDecodeError
UnicodeEncodeError
codec can't decode
codec can't encode
invalid byte sequence
invalid multibyte character
invalid character in identifier
character encoding
Diagnosis
Encoding failures occur when:
- The CI runner’s locale is
CorPOSIXrather thanen_US.UTF-8, causing tools to default to ASCII - A source file, test fixture, or input data contains non-ASCII characters
- A file was saved with Latin-1 or Windows-1252 encoding rather than UTF-8
- A string conversion assumes ASCII but encounters an emoji, accented character, or non-Latin script
Check the runner’s locale:
locale
echo $LANG $LC_ALL $LC_CTYPE
A minimal CI runner may show LANG=C or no locale set at all.
Find non-UTF-8 files:
find . -name "*.py" -exec file {} \; | grep -v "ASCII\|UTF-8"
Fix steps
-
Set UTF-8 locale in CI:
# GitHub Actions / most CI env: LANG: en_US.UTF-8 LC_ALL: en_US.UTF-8 PYTHONIOENCODING: utf-8 -
In Python, open files with explicit encoding:
# GOOD with open("file.txt", encoding="utf-8") as f: data = f.read() # Python 3.15+ will warn on implicit encoding in open() # use PYTHONWARNDEFAULTENCODING=1 to audit existing code -
Convert non-UTF-8 files to UTF-8:
# Detect encoding file -i suspicious-file.txt chardet suspicious-file.txt # requires chardet package # Convert Latin-1 to UTF-8 iconv -f latin1 -t utf-8 input.txt > output.txt -
In Java, specify charset explicitly:
Files.readString(path, StandardCharsets.UTF_8); new InputStreamReader(stream, StandardCharsets.UTF_8); -
Ensure the database connection and ORM use UTF-8:
# MySQL jdbc:mysql://host/db?characterEncoding=UTF-8 # PostgreSQL: set client_encoding = 'UTF8' in the connection
Validation
- Re-run the failing step after setting the locale.
- Confirm
localeshowsLANG=en_US.UTF-8. - Run the failing test or build and confirm no codec errors.
Why it matters
Encoding failures are rare locally (developer machines default to UTF-8) but common in minimal CI containers that default to the C locale. They are tricky to diagnose because the error appears far from the encoding setup.
Prevention
- Set
LANG=en_US.UTF-8andPYTHONIOENCODING=utf-8in all CI environment blocks. - Enforce UTF-8 in
.editorconfig:charset = utf-8. - Use
file -iin a CI pre-flight check to detect non-UTF-8 source files.
How Faultline detects it
Use faultline explain encoding-unicode to see the full playbook.
faultline analyze build.log
faultline explain encoding-unicode
Generated from playbooks/bundled/log/runtime/encoding-unicode.yaml. Do not edit directly.